Automatic Morphological Enrichment of a Morphologically Underspecified Treebank
Sarah Alkuhlani, Nizar Habash and Ryan Roth
In this paper, we study the problem of automatic enrichment of a
morphologically underspecified treebank for Arabic, a morphologically rich
language. We show that we can map from a tagset of size six to one with 485
tags at an accuracy rate of 94%-95%. We can also identify the unspecified
lemmas in the treebank with an accuracy over 97%. Furthermore, we demonstrate
that using our automatic annotations improves the performance of a
state-of-the-art Arabic morphological tagger. Our approach combines a variety
of techniques from corpus-based statistical models to linguistic rules that
target specific phenomena. These results suggest that the cost of treebanking
can be reduced by designing underspecified treebanks that can be subsequently
enriched automatically.
Back to Papers Accepted