detect-all,detect-inline-family=math,detect-inline-weight=math,detect-display-math=true
english \novocalize
Segmentation of clitics has been shown to improve accuracy on a variety of Arabic NLP tasks. However, state-of-the-art Arabic word segmenters are either limited to formal Modern Standard Arabic, performing poorly on Arabic text featuring dialectal vocabulary and grammar, or rely on linguistic knowledge that is hand-tuned for each dialect. We extend an existing MSA segmenter with a simple domain adaptation technique and new features in order to segment informal and dialectal Arabic text. Experiments show that our system outperforms existing systems on newswire, broadcast news and Egyptian dialect, improving segmentation F1 score on a recently released Egyptian Arabic corpus to 95.1%, compared to 90.8% for another segmenter designed specifically for Egyptian Arabic.
Segmentation of words, clitics, and affixes is essential for a number of natural language processing (NLP) applications, including machine translation, parsing, and speech recognition [1, 14, 9]. Segmentation is a common practice in Arabic NLP due to the language’s morphological richness. Specifically, clitic separation has been shown to improve performance on Arabic parsing [5] and Arabic-English machine translation [8]. However, the variety of Arabic dialects presents challenges in Arabic NLP. Dialectal Arabic contains non-standard orthography, vocabulary, morphology, and syntax. Tools that depend on corpora or grammatical properties that only consider formal Modern Standard Arabic (MSA) do not perform well when confronted with these differences. The creation of annotated corpora in dialectal Arabic [11] has promoted the development of new systems that support dialectal Arabic, but these systems tend to be tailored to specific dialects and require separate efforts for Egyptian Arabic, Levantine Arabic, Maghrebi Arabic, etc.
We present a single clitic segmentation model that is accurate on both MSA and informal Arabic. The model is an extension of the character-level conditional random field (CRF) model of Green and DeNero (2012). Our work goes beyond theirs in three aspects. First, we handle two Arabic orthographic normalization rules that commonly require rewriting of tokens after segmentation. Second, we add new features that improve segmentation accuracy. Third, we show that dialectal data can be handled in the framework of domain adaptation. Specifically, we show that even simple feature space augmentation [3] yields significant improvements in task accuracy.
We compare our work to the original Green and DeNero model and two other Arabic segmentation systems: the MADA+TOKAN toolkit v. 3.1 [6] and its Egyptian dialect variant, MADA-ARZ v. 0.4 [7]. We demonstrate that our system achieves better performance across the board, beating all three systems on MSA newswire, informal broadcast news, and Egyptian dialect. Our segmenter achieves a 95.1% F1 segmentation score evaluated against a gold standard on Egyptian dialect data, compared to 90.8% for MADA-ARZ and 92.9% for Green and DeNero. In addition, our model decodes input an order of magnitude faster than either version of MADA. Like the Green and DeNero system, but unlike MADA and MADA-ARZ, our system does not rely on a morphological analyzer, and can be applied directly to any dialect for which segmented training data is available. The source code is available in the latest public release of the Stanford Word Segmenter (http://nlp.stanford.edu/software/ segmenter.shtml).
A CRF model [10] defines a distribution , where is the observed input sequence and is the sequence of labels we seek to predict. Green and DeNero use a linear-chain model with as the sequence of input characters, and chosen according to the decision rule
where is the feature map defined in Section 2.1. Their model classifies each as one of I (continuation of a segment), O (whitespace outside any segment), B (beginning of a segment), or F (pre-grouped foreign characters).
Our segmenter expands this label space in order to handle two Arabic-specific orthographic rules. In our model, each can take on one of the six values :
RewAl indicates that the current character, which is always the Arabic letter <l>, starts a new segment and should additionally be transformed into the definite article <al—> when segmented. This type of transformation occurs after the prefix <li—> ‘‘to’’.
RewTa indicates that the current character, which is always the Arabic letter <t>, is a continuation but should be transformed into the letter <T> when segmented. Arabic orthography rules restrict the occurrence of <T> to the word-final position, writing it instead as <t> whenever it is followed by a suffix.
The model of Green and DeNero is a third-order (i.e., 4-gram) Markov CRF, employing the following indicator features:
a five-character window around the current character: for each and , the triple
-grams consisting of the current character and up to three preceding characters: for each and , the character-sequence/label-sequence pair
whether the current character is punctuation
whether the current character is a digit
the Unicode block of the current character
the Unicode character class of the current character
In addition to these, we include two other types of features motivated by specific errors the original system made on Egyptian dialect development data:
Word length and position within a word: for each , the pairs , , and , where , , and are the total length of the word containing , the number of characters after in the word, and the number of characters before in the word, respectively. Some incorrect segmentations produced by the original system could be ruled out with the knowledge of these statistics.
First and last two characters of the current word, separately influencing the first two labels and the last two labels: for each word consisting of characters , the tuples and . This set of features addresses a particular dialectal Arabic construction, the negation <mA>- + [verb] + <-^s>, which requires a matching prefix and suffix to be segmented simultaneously. This feature set also allows the model to take into account other interactions between the beginning and end of a word, particularly those involving the definite article <al—>.
A notable property of this feature set is that it remains highly dialect-agnostic, even though our additional features were chosen in response to errors made on text in Egyptian dialect. In particular, it does not depend on the existence of a dialect-specific lexicon or morphological analyzer. As a result, we expect this model to perform similarly well when applied to other Arabic dialects.
F1 (%) | TEDEval (%) | ||||||
Model | Training Data | ATB | BN | ARZ | ATB | BN | ARZ |
GD | ATB | 97.60 | 94.87 | 79.92 | 98.22 | 96.81 | 87.30 |
GD | BNARZ | 97.28 | 96.37 | 92.90 | 98.05 | 97.45 | 95.01 |
Rew | ATB | 97.55 | 94.95 | 79.95 | 98.72 | 97.45 | 87.54 |
Rew | BN | 97.58 | 96.60 | 82.94 | 98.75 | 98.18 | 89.43 |
Rew | BNARZ | 97.30 | 96.09 | 92.64 | 98.59 | 97.91 | 95.03 |
RewDA | BNARZ | 97.71 | 96.57 | 93.87 | 98.79 | 98.14 | 95.86 |
RewDAFeat | BNARZ | 98.36 | 97.35 | 95.06 | 99.14 | 98.57 | 96.67 |
In this work, we train our model to segment Arabic text drawn from three domains: newswire, which consists of formal text in MSA; broadcast news, which contains scripted, formal MSA as well as extemporaneous dialogue in a mix of MSA and dialect; and discussion forum posts written primarily in Egyptian dialect.
The approach to domain adaptation we use is that of feature space augmentation [3]. Each indicator feature from the model described in Section 2.1 is replaced by features in the augmented model, where is the number of domains from which the data is drawn (here, ). These features consist of the original feature and ‘‘domain-specific’’ features, one for each of the domains, each of which is active only when both the original feature is present and the current text comes from its assigned domain.
F1 (%) | TEDEval (%) | |||||
---|---|---|---|---|---|---|
ATB | BN | ARZ | ATB | BN | ARZ | |
MADA | 97.36 | 94.54 | 78.35 | 97.62 | 96.96 | 86.78 |
MADA-ARZ | 92.83 | 91.89 | 90.76 | 91.26 | 91.10 | 90.39 |
GDRewDAFeat | 98.30 | 97.17 | 95.13 | 99.10 | 98.42 | 96.75 |
We train and evaluate on three corpora: parts 1–3 of the newswire Arabic Treebank (ATB),11LDC2010T13, LDC2011T09, LDC2010T08 the Broadcast News Arabic Treebank (BN),22LDC2012T07 and parts 1–8 of the BOLT Phase 1 Egyptian Arabic Treebank (ARZ).33LDC2012E{93,98,89,99,107,125}, LDC2013E{12,21} These correspond respectively to the domains in section 2.2. We target the segmentation scheme used by these corpora (leaving morphological affixes and the definite article attached). For the ATB, we use the same split as Chiang et al. (2006). For each of the other two corpora, we split the data into 80% training, 10% development, and 10% test in chronological order by document.44These splits are publicly available at http://nlp.stanford.edu/software/parser-arabic-data-splits.shtml. We train the Green and DeNero model and our improvements using L-BFGS with regularization.
We use two evaluation metrics in our experiments. The first is an F1 precision-recall measure, ignoring orthographic rewrites. F1 scores provide a more informative assessment of performance than word-level or character-level accuracy scores, as over 80% of tokens in the development sets consist of only one segment, with an average of one segmentation every 4.7 tokens (or one every 20.4 characters).
The second metric we use is the TEDEval metric [13]. TEDEval was developed to evaluate joint segmentation and parsing55In order to evaluate segmentation in isolation, we convert each segmented sentence from both the model output and the gold standard to a flat tree with all segments descending directly from the root. in Hebrew, which requires a greater variety of orthographic rewrites than those possible in Arabic. Its edit distance-based scoring algorithm is robust enough to handle the rewrites produced by both MADA and our segmenter.
ATB | BN | ARZ | |
MADA | 705.65.1 | 472.00.8 | 767.81.9 |
MADA-ARZ | 784.71.6 | 492.14.2 | 779.02.7 |
GDRewDAFeat | 90.01.0 | 59.50.3 | 72.70.2 |
Table 1 contains results on the development set for the model of Green and DeNero and our improvements. Using domain adaptation alone helps performance on two of the three datasets (with a statistically insignificant decrease on broadcast news), and that our additional features further improve segmentation on all datasets. Table 2 shows the segmentation scores our model achieves when evaluated on the three test sets, as well as the results for MADA and MADA-ARZ. Our segmenter achieves higher scores than MADA and MADA-ARZ on all datasets under both evaluation metrics. In addition, our segmenter is faster than MADA. Table 3 compares the running times of the three systems. Our segmenter achieves a 7x or more speedup over MADA and MADA-ARZ on all datasets.
We sampled 100 errors randomly from all errors made by our final model (trained on all three datasets with domain adaptation and additional features) on the ARZ development set; see Table 4. These errors fall into three general categories:
typographical errors and annotation inconsistencies in the gold data;
errors that can be fixed with a fuller analysis of just the problematic token, and therefore represent a deficiency in the feature set; and
errors that would require additional context or sophisticated semantic awareness to fix.
Of the 100 errors we sampled, 33 are due to typographical errors or inconsistencies in the gold data. We classify 7 as typos and 26 as annotation inconsistencies, although the distinction between the two is murky: typos are intentionally preserved in the treebank data, but segmentation of typos varies depending on how well they can be reconciled with standard Arabic orthography. Four of the seven typos are the result of a missing space, such as:
<yas|har-bi-al-layAlI> ‘‘staysawakeatnight’’ (<yashar> + <bi-> + <al-layAlI>)
<‘amilatnA-’an> ‘‘madeus’’ (<‘amilat> + <—nA> + <’an>)
The first example is segmented in the Egyptian treebank but is left unsegmented by our system; the second is left as a single token in the treebank but is split into the above three segments by our system.
Of the annotation inconsistencies that do not involve typographical errors, a handful are segmentation mistakes; however, in the majority of these cases, the annotator chose not to segment a word for justifiable but arbitrary reasons. In particular, a few colloquial ‘‘filler’’ expressions are sometimes not segmented, despite being compound Arabic words that are segmented elsewhere in the data. These include <rabbinA> ‘‘[our] Lord’’ (oath); <‘indamA> ‘‘when’’/‘‘while’’; and <_hallIk> ‘‘keep’’/‘‘stay’’. Also, tokens containing foreign words are sometimes not segmented, despite carrying Arabic affixes. An example of this is <wamistur> ‘‘and Mister [English]’’, which could be segmented as <wa>- + <mistur>.
Category | # of errors |
---|---|
Abnormal gold data | 33 |
Typographical error | 7 |
Annotation inconsistency | 26 |
Need full-token features | 36 |
Need more context | 31 |
<wlA> | 5 |
<—nA>: verb/pron | 7 |
<—y>: nisba/pron | 4 |
other | 15 |
In 36 of the 100 sampled errors, we conjecture that the presence of the error indicates a shortcoming of the feature set, resulting in segmentations that make sense locally but are not plausible given the full token. Two examples of these are:
<wafi.tarIqaT> “and in the way” segmented as <wa>- + <fi.tarIqaT> (correct analysis is <wa>- + <fi—> + <.tarIqaT>). <f.tr> ‘‘break’’/‘‘breakfast’’ is a common Arabic root, but the presence of <q> should indicate that <f.tr> is not the root in this case.
<walAyuhimmhum> ‘‘and it’s not important to them’’ segmented as <wa>- + <li—> + <—ayuhimm> + <—hum> (correct analysis is <wa>- + <lA> + <yuhimm> + <—hum>). The 4-character window <lAyh> occurs commonly with a segment boundary after the <l>, but the segment <—ayuhimm> is not a well-formed Arabic word.
In the remaining 31 of 100 errors, external context is needed. In many of these, it is not clear how to address the error without sophisticated semantic reasoning about the surrounding sentence.
One token accounts for five of these errors: <wlA>, which in Egyptian dialect can be analyzed as <wa>- + <lA> ‘‘and [do/does] not’’ or as <wallA> ‘‘or’’. In a few cases, either is syntactically correct, and the meaning must be inferred from context.
Two other ambiguities are a frequent cause of error and seem to require sophisticated disambiguation. The first is <—nA>, which is both a first person plural object pronoun and a first person plural past tense ending. The former is segmented, while the latter is not. An example of this is the pair <‘ilmunA> ‘‘our knowledge’’ (<‘ilmu> + <—nA>) versus <‘alimnA> ‘‘we knew’’ (one segment). The other is <—y>, which is both a first person singular possessive pronoun and the nisba adjective ending (which turns a noun into an adjective meaning ‘‘of or related to’’); only the former is segmented. One example of this distinction that appeared in the development set is the pair <maw.dU‘I> ‘‘my topic’’ (<maw.dU‘> + <—y>) versus <maw.dU‘Iy> ‘‘topical’’, ‘‘objective’’.
In this paper we demonstrate substantial gains on Arabic clitic segmentation for both formal and dialectal text using a single model with dialect-independent features and a simple domain adaptation strategy. We present a new Arabic segmenter which performs better than tools employing sophisticated linguistic analysis, while also giving impressive speed improvements. We evaluated our segmenter on broadcast news and Egyptian Arabic due to the current availability of annotated data in these domains. However, as data for other Arabic dialects and genres becomes available, we expect that the model’s simplicity and the domain adaptation method we use will allow the system to be applied to these dialects with minimal effort and without a loss of performance in the original domains.
We thank the three anonymous reviewers, and Reut Tsarfaty for valuable correspondence regarding TEDEval. The second author is supported by a National Science Foundation Graduate Research Fellowship. This work was supported by the Defense Advanced Research Projects Agency (DARPA) Broad Operational Language Translation (BOLT) program through IBM. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the view of DARPA or the US government.