Machine Translation of Arabic Dialects

Rabih Zbib1,  Erika Malchiodi1,  Jacob Devlin1,  David Stallard1,  Spyros Matsoukas1,  Richard Schwartz1,  John Makhoul1,  Omar F. Zaidan2,  Chris Callison-Burch2
1Raytheon BBN Technologies, 2Johns Hopkins University


Abstract

Arabic dialects present many challenges for machine translation, not least of which is the lack of data resources. We use crowdsourcing to cheaply and quickly build Levantine- English and Egyptian-English parallel corpora, consisting of 1.1M words and 380k words, respectively. The dialect sentences are selected from a large corpus of Arabic web text, and translated using Mechanical Turk. We use this crowdsourced data to build Dialect Arabic MT systems. Small amounts of dialect data have a dramatic impact on the quality of translation. When translating Egyptian and Levantine test sets, our Dialect Arabic MT system performs 6.3 and 7.0 BLEU points higher than a Modern Standard Arabic MT system trained on a 150 million word Arabic- English parallel corpus.