+------------------------------------+
| WikiAnswers Paraphrase Dataset 1.0 |
+------------------------------------+

Authors:

    Anthony Fader (afader@cs.washington.edu)
    Luke Zettlemoyer
    Oren Etzioni

This file describes the WikiAnswers Paraphrase Dataset. The dataset contains 
approximately 18 million word-aligned {question1, question2} pairs. 

If you use this data in your research, please use the following citation:

@inproceedings{Fader13,
    author    = {Anthony Fader and Luke Zettlemoyer and Oren Etzioni},
    title     = {{Paraphrase-Driven Learning for Open Question Answering}},
    booktitle = {Proceedings of the 51st Annual Meeting of the Association for 
                 Computational Linguistics},
    year      = {2013}
}

+------------+
| Background |
+------------+

WikiAnswers (http://wiki.answers.com) is a website where users can post 
questions and answers about almost any topic. A unique feature is that users
can tag two questions as equivalent and merge them together. For example,
here is a page listing the revisions to "What is the date of birth for Malia 
Obama?" 

    http://wiki.answers.com/Q/Special:Changes&cv=question:What_is_the_date_of_birth_for_Malia_Obama

We crawled these pages over a period of a month in 2012 and scraped the 
paraphrases from pages like the one above. 

This dataset contains these paraphrases, their word alignments, and basic 
NLP processed versions of the questions (tokenization, tagging, and 
lemmatization). There are about 2.5 million distinct questions and 
18 million distinct paraphrase pairs.

Here is some example data:

    question: 
        
    What are the green blobs in plant cells?

    paraphrases (lemmatized):

        a green substance in the plant cell be the ?
        be cytoplasm a green cell part in certain plant cell ?
        package of green coloring in plant cell ?
        part of the plant cell where the cell get it green color ?
        the green part in a plant be call ?
        the green part of a plant cell ?
        the part of the plant cell that make the plant green be call ?
        what be green part call ?
        what be green part in plant cell ?
        what be the green body in a plant cell ?
        what be the green machine within a plant cell ?
        what be the green part of a plant cell ?
        what be the green part of plant cell ?
        what be the green substance in plant cell ?
        what be the name of the green thing in the plant cell ?
        what be the part of plant cell that give it green color ?
        what be the part of the cell that produce the green color of the plant ?
        what be the part of the plant cell that make the green color ?
        what be to part of the plant cell ?
        what cell part do plant have that enable the plant to be give a green color ?
        what in a plant cell that be green ?
        what part of the cell be large and green ?
        what part of the plant cell turn it green ?


+---------------+
| questions.txt |
+---------------+

The file questions.txt contains four tab-separated columns of text in the form
(question, tokens, pos-tags, lemmas). For example:

    question: Are liposaccharides protiens?
    tokens:   Are liposaccharides protiens ?
    pos-tags: VBP NNS NNS .
    lemmas:   be liposaccharide protien ?

The tokens, pos-tags, and lemmas were produced using the Stanford CoreNLP 
tools. The code and models used are: stanford-corenlp-1.3.4.jar and 
stanford-corenlp-1.3.4-models.jar. The properties used were:

    annotators = tokenize,ssplit,pos,lemma
    ssplit.eolonly = true
    

+---------------------+
| word_alignments.txt |
+---------------------+

The file word_alignments.txt contains three tab-separated columns of text
in the form (question1, question2, word-alignments). Each question column is a
space-separated list of lemmatized tokens. The word-alignment column is a 
space-separated list of word-index alignments. Here is an example record:

    question1:       how many people live in racine ?
    question2:       what be the population of racine ?
    word-alignments: 0-0 1-1 2-2 2-3 3-3 4-4 5-5 6-6

This record corresponds to the following word alignment:

         0    1     2     3       4    5    6
        how  many people live     in racine ?
         |    |     | \   |       |    |    |
         |    |     |  \  |       |    |    |
         |    |     |   \ |       |    |    |
        what  be   the population of racine ?
         0    1     2      3      4    5    6

Each word alignment pair is in the form i-j, where i is the index in question1
and j is the index in question2. The indexes start at 0. Some words may not be
aligned to any word.

The paraphrases in word_alignments.txt appear in both orders, (q1, q2, w) and
(q2, q1, w'). 

The word-alignments were created using MGIZA++, called via Moses. The input
corpus to MGIZA++ was (q1, q2) and (q2, q1) for each paraphrase. The command
to create the alignments is:

    perl $MOSES/scripts/training/train-model.perl \
        -mgiza \
        -mgiza-cpus 8 \
        -alignment grow-diag-final-and \
        -reordering msd-bidirectional-fe \
        --score-options='--GoodTuring' \
        --extract-options='--IncludeSentenceId' \
        -parallel