+------------------------------------+ | WikiAnswers Paraphrase Dataset 1.0 | +------------------------------------+ Authors: Anthony Fader (afader@cs.washington.edu) Luke Zettlemoyer Oren Etzioni This file describes the WikiAnswers Paraphrase Dataset. The dataset contains approximately 18 million word-aligned {question1, question2} pairs. If you use this data in your research, please use the following citation: @inproceedings{Fader13, author = {Anthony Fader and Luke Zettlemoyer and Oren Etzioni}, title = {{Paraphrase-Driven Learning for Open Question Answering}}, booktitle = {Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics}, year = {2013} } +------------+ | Background | +------------+ WikiAnswers (http://wiki.answers.com) is a website where users can post questions and answers about almost any topic. A unique feature is that users can tag two questions as equivalent and merge them together. For example, here is a page listing the revisions to "What is the date of birth for Malia Obama?" http://wiki.answers.com/Q/Special:Changes&cv=question:What_is_the_date_of_birth_for_Malia_Obama We crawled these pages over a period of a month in 2012 and scraped the paraphrases from pages like the one above. This dataset contains these paraphrases, their word alignments, and basic NLP processed versions of the questions (tokenization, tagging, and lemmatization). There are about 2.5 million distinct questions and 18 million distinct paraphrase pairs. Here is some example data: question: What are the green blobs in plant cells? paraphrases (lemmatized): a green substance in the plant cell be the ? be cytoplasm a green cell part in certain plant cell ? package of green coloring in plant cell ? part of the plant cell where the cell get it green color ? the green part in a plant be call ? the green part of a plant cell ? the part of the plant cell that make the plant green be call ? what be green part call ? what be green part in plant cell ? what be the green body in a plant cell ? what be the green machine within a plant cell ? what be the green part of a plant cell ? what be the green part of plant cell ? what be the green substance in plant cell ? what be the name of the green thing in the plant cell ? what be the part of plant cell that give it green color ? what be the part of the cell that produce the green color of the plant ? what be the part of the plant cell that make the green color ? what be to part of the plant cell ? what cell part do plant have that enable the plant to be give a green color ? what in a plant cell that be green ? what part of the cell be large and green ? what part of the plant cell turn it green ? +---------------+ | questions.txt | +---------------+ The file questions.txt contains four tab-separated columns of text in the form (question, tokens, pos-tags, lemmas). For example: question: Are liposaccharides protiens? tokens: Are liposaccharides protiens ? pos-tags: VBP NNS NNS . lemmas: be liposaccharide protien ? The tokens, pos-tags, and lemmas were produced using the Stanford CoreNLP tools. The code and models used are: stanford-corenlp-1.3.4.jar and stanford-corenlp-1.3.4-models.jar. The properties used were: annotators = tokenize,ssplit,pos,lemma ssplit.eolonly = true +---------------------+ | word_alignments.txt | +---------------------+ The file word_alignments.txt contains three tab-separated columns of text in the form (question1, question2, word-alignments). Each question column is a space-separated list of lemmatized tokens. The word-alignment column is a space-separated list of word-index alignments. Here is an example record: question1: how many people live in racine ? question2: what be the population of racine ? word-alignments: 0-0 1-1 2-2 2-3 3-3 4-4 5-5 6-6 This record corresponds to the following word alignment: 0 1 2 3 4 5 6 how many people live in racine ? | | | \ | | | | | | | \ | | | | | | | \ | | | | what be the population of racine ? 0 1 2 3 4 5 6 Each word alignment pair is in the form i-j, where i is the index in question1 and j is the index in question2. The indexes start at 0. Some words may not be aligned to any word. The paraphrases in word_alignments.txt appear in both orders, (q1, q2, w) and (q2, q1, w'). The word-alignments were created using MGIZA++, called via Moses. The input corpus to MGIZA++ was (q1, q2) and (q2, q1) for each paraphrase. The command to create the alignments is: perl $MOSES/scripts/training/train-model.perl \ -mgiza \ -mgiza-cpus 8 \ -alignment grow-diag-final-and \ -reordering msd-bidirectional-fe \ --score-options='--GoodTuring' \ --extract-options='--IncludeSentenceId' \ -parallel