Arabic
Dialect Processing
The
existence of dialects for any language constitutes a challenge for NLP in
general since it adds another set of variation dimensions from a known
standard. The problem is particularly interesting and challenging in Arabic and
its different dialects, where the diversion from the standard could, in some
linguistic theories, warrant a classification as a different language. This
problem would not be as pronounced if standard Arabic were to be a living
language, however it is not. Any realistic and practical approach to processing
Arabic will have to account for dialectal usage since it is so pervasive. In
this tutorial, we will attempt to highlight different dialectal phenomena and
how they migrate from the standard and why they pose challenges to NLP. Our
tutorial will have four different parts: First, we will give you a background
layout of issues for standard Arabic NLP. Then we will give you a high level
generic view of dialects and different aspects of them that are of interest for
the NLP community, addressing both text and speech issues in addition to
standardization issues. We will focus in depth on two aspects of dialect
processing in the third and fourth parts of the tutorial, namely, dialectal
morphology and dialectal syntactic parsing. Throughout the presentation we will
make references to the different resources available and draw contrastive links
with standard Arabic and English. Moreover, we will discuss annotation
standards as exemplified in the Treebank. We will provide links to recent
publications and available toolkits/resources for all four sections.
It
is worth noting that:
·
This
tutorial is designed for computer scientists and linguists alike.
·
No
knowledge of Arabic is required (Though, we recommend taking a look at Nizar Habash’s Arabic NLP tutorial www.ccls.columbia.edu/cadim/presentations.html
which will be reviewed in the first quarter of the tutorial.)
Outline:
40:00 Review of Basic Arabic NLP with a segway
into dialect from a sociolinguistic and political perspective
40:00
Generic dialectal issues from an NLP perspective
Orthography
Phonetics
Phonology
Morphology
Syntax
Semantics
40:00
Focus on Dialectal Morphology
40:00
Focus on Dialectal Syntactic Parsing
Bios
Mona Diab received her PhD in 2003 in
the Linguistics department and UMIACS,
Dr. Diab served as co-chair
– together with Kareem Darwish and Nizar Habash - of the Workshop on
Computational Approaches to Semitic Languages (ACL 2005). She was also a senior
member in the 2005 JHU summer workshop on Parsing Arabic Dialects. In 2005, she co-founded the Columbia Arabic
Dialect Modeling (CADIM) group together with Nizar Habash and Owen Rambow. She
has published over 20 articles in different conferences, journals and
workshops. Mona has presented her work in numerous lectures and tutorials both
for academic and industrial audiences.
Nizar Habash received his PhD in 2003 from the Computer Science Department,
Dr. Habash served as
co-chair for the Workshop on Computational Approaches to Semitic Languages (ACL
2005) and also the Workshop on Machine Translation for Semitic Languages (MT
Summit 2003). In 2005, he co-founded the
Columbia Arabic Dialect Modeling (CADIM) group.
He is the vice-president of the Semitic Language Special Interest Group
in the Association of
Dr. Habash has published
over 20 articles in international conferences and journals and has given
numerous lectures and tutorials for academic and industrial audiences.
Mona’s website:
http://www.cs.columbia.edu/~mdiab
Nizar's website: http://www.nizarhabash.com
CADIM website: http://www.ccls.columbia.edu/cadim