PPDB: The Paraphrase Database
Juri Ganitkevitch, Benjamin Van Durme and Chris Callison-Burch
We present the 1.0 release of our paraphrase database, PPDB. Its English
portion, PPDB:Eng, contains over 220 million paraphrase pairs, consisting of 73
million phrasal and 8 million lexical paraphrases, as well as 140 million
paraphrase patterns, which capture many meaning-preserving syntactic
transformations. The paraphrases are extracted from bilingual parallel corpora
totaling over 100 million sentence pairs and over 2 billion English words. We
also release PPDB:Spa, a collection of 196 million Spanish paraphrases. Each
paraphrase pair in PPDB contains a set of associated scores, including
paraphrase probabilities derived from the bitext data and a variety of
monolingual distributional similarity scores computed from the Google n-grams
and the Annotated Gigaword corpus. Our release includes pruning tools that
allow users to determine their own precision/recall tradeoff.
Back to Papers Accepted