Department of Computing Science | National Research Council Canada |
---|---|
University of Alberta | 1200 Montreal Road |
Edmonton, AB, T6G 2E8, Canada | Ottawa, ON, K1A 0R6, Canada |
{msalameh,gkondrak}@ualberta.ca | Colin.Cherry@nrc-cnrc.gc.ca |
Morphological segmentation is an effective sparsity reduction strategy for statistical machine translation (SMT) involving morphologically complex languages. When translating into a segmented language, an extra step is required to desegment the output; previous studies have desegmented the 1-best output from the decoder. In this paper, we expand our translation options by desegmenting -best lists or lattices. Our novel lattice desegmentation algorithm effectively combines both segmented and desegmented views of the target language for a large subspace of possible translation outputs, which allows for inclusion of features related to the desegmentation process, as well as an unsegmented language model (LM). We investigate this technique in the context of English-to-Arabic and English-to-Finnish translation, showing significant improvements in translation quality over desegmentation of 1-best decoder outputs.
Morphological segmentation is considered to be indispensable when translating between English and morphologically complex languages such as Arabic. Morphological complexity leads to much higher type to token ratios than English, which can create sparsity problems during translation model estimation. Morphological segmentation addresses this issue by splitting surface forms into meaningful morphemes, while also performing orthographic transformations to further reduce sparsity. For example, the Arabic noun ¡laldwl¿ lldwl “to the countries” is segmented as l+ “to” Aldwl “the countries”. When translating from Arabic, this segmentation process is performed as input preprocessing and is otherwise transparent to the translation system. However, when translating into Arabic, the decoder produces segmented output, which must be desegmented to produce readable text. For example, l+ Aldwl must be converted to lldwl.
Desegmentation is typically performed as a post-processing step that is independent from the decoding process. While this division of labor is useful, the pipeline approach may prevent the desegmenter from recovering from errors made by the decoder. Despite the efforts of the decoder’s various component models, the system may produce mismatching segments, such as s+ hzymp, which pairs the future particle s+ “will” with a noun hzymp “defeat”, instead of a verb. In this scenario, there is no right desegmentation; the post-processor has been dealt a losing hand.
In this work, we show that it is possible to maintain the sparsity-reducing benefit of segmentation while translating directly into unsegmented text. We desegment a large set of possible decoder outputs by processing -best lists or lattices, which allows us to consider both the segmented and desegmented output before locking in the decoder’s decision. We demonstrate that significant improvements in translation quality can be achieved by training a linear model to re-rank this transformed translation space.
Translating into morphologically complex languages is a challenging and interesting task that has received much recent attention. Most techniques approach the problem by transforming the target language in some manner before training the translation model. They differ in what transformations are performed and at what stage they are reversed. The transformation might take the form of a morphological analysis or a morphological segmentation.
Many languages have access to morphological analyzers, which annotate surface forms with their lemmas and morphological features. Bojar (2007) incorporates such analyses into a factored model, to either include a language model over target morphological tags, or model the generation of morphological features. Other approaches train an SMT system to predict lemmas instead of surface forms, and then inflect the SMT output as a post-processing step [21, 7, 11, 10]. Alternatively, one can reparameterize existing phrase tables as exponential models, so that translation probabilities account for source context and morphological features [16, 30]. Of these approaches, ours is most similar to the translate-then-inflect approach, except we translate and then desegment. In particular, Toutanova et al. (2008) inflect and re-rank -best lists in a similar manner to how we desegment and re-rank -best lists or lattices.
Instead of producing an abstract feature layer, morphological segmentation transforms the target sentence by segmenting relevant morphemes, which are then handled as regular tokens during alignment and translation. This is done to reduce sparsity and to improve correspondence with the source language (usually English). Such a segmentation can be produced as a byproduct of analysis [24, 2, 9], or may be produced using an unsupervised morphological segmenter such as Morfessor [20, 7]. Work on target language morphological segmentation for SMT can be divided into three subproblems: segmentation, desegmentation and integration. Our work is concerned primarily with the integration problem, but we will discuss each subproblem in turn.
The usefulness of a target segmentation depends on its correspondence to the source language. If a morphological feature does not manifest itself as a separate token in the source, then it may be best to leave its corresponding segment attached to the stem. A number of studies have looked into what granularity of segmentation is best suited for a particular language pair [24, 2, 7, 9]. Since our focus here is on integrating segmentation into the decoding process, we simply adopt the segmentation strategies recommended by previous work: the Penn Arabic Treebank scheme for English-Arabic [9], and an unsupervised scheme for English-Finnish [7].
Desegmentation is the process of converting segmented words into their original surface form. For many segmentations, especially unsupervised ones, this amounts to simple concatenation. However, more complex segmentations, such as the Arabic tokenization provided by MADA [12], require further orthographic adjustments to reverse normalizations performed during segmentation. Badr et al. (2008) present two Arabic desegmentation schemes: table-based and rule-based. El Kholy and Habash (2012) provide an extensive study on the influence of segmentation and desegmentation on English-to-Arabic SMT. They introduce an additional desegmentation technique that augments the table-based approach with an unsegmented language model. Salameh et al. (2013) replace rule-based desegmentation with a discriminatively-trained character transducer. In this work, we adopt the Table+Rules approach of El Kholy and Habash (2012) for English-Arabic, while concatenation is sufficient for English-Finnish.
Work on integration attempts to improve SMT performance for morphologically complex target languages by going beyond simple pre- and post-processing. Oflazer and Durgar El-Kahlout (2007) desegment 1000-best lists for English-to-Turkish translation to enable scoring with an unsegmented language model. Unlike our work, they replace the segmented language model with the unsegmented one, allowing them to tune the linear model parameters by hand. We use both segmented and unsegmented language models, and tune automatically to optimize BLEU.
Like us, Luong et al. (2010) tune on unsegmented references,11 Tuning on unsegmented references does not require substantial modifications to the standard SMT pipeline. For example, Badr et al. (2008) also tune on unsegmented references by simply desegmenting SMT output before MERT collects sufficient statistics for BLEU. and translate with both segmented and unsegmented language models for English-to-Finnish translation. However, they adopt a scheme of word-boundary-aware morpheme-level phrase extraction, meaning that target phrases include only complete words, though those words are segmented into morphemes. This enables full decoder integration, where we do -best and lattice re-ranking. But it also comes at a substantial cost: when target phrases include only complete words, the system can only generate word forms that were seen during training. In this setting, the sparsity reduction from segmentation helps word alignment and target language modeling, but it does not result in a more expressive translation model. Furthermore, it becomes substantially more difficult to have non-adjacent source tokens contribute morphemes to a single target word. For example, when translating “with his blue car” into the Arabic ¡bsyArth alzrqA’—¿ bsyArth AlzrqA’, the target word bsyArth is composed of three tokens: b+ “with”, syArp “car” and +h “his”. With word-boundary-aware phrase extraction, a phrase pair containing all of “with his blue car” must have been seen in the parallel data to translate the phrase correctly at test time. With lattice desegmentation, we need only to have seen AlzrqA’ “blue” and the three morphological pieces of bsyArth for the decoder and desegmenter to assemble the phrase.
Our goal in this work is to benefit from the sparsity-reducing properties of morphological segmentation while simultaneously allowing the system to reason about the final surface forms of the target language. We approach this problem by augmenting an SMT system built over target segments with features that reflect the desegmented target words. In this section, we describe our various strategies for desegmenting the SMT system’s output space, along with the features that we add to take advantage of this desegmented view.
The two obvious baseline approaches each decode using one view of the target language. The unsegmented approach translates without segmenting the target. This trivially allows for an unsegmented language model and never makes desegmentation errors. However, it suffers from data sparsity and poor token-to-token correspondence with the source language.
The one-best desegmentation approach segments the target language at training time and then desegments the one-best output in post-processing. This resolves the sparsity issue, but does not allow the decoder to take into account features of the desegmented target. To the best of our knowledge, we are the first group to go beyond one-best desegmentation for English-to-Arabic translation. In English-to-Finnish, although alternative integration strategies have seen some success [20], the current state-of-the-art performs one-best-desegmentation [7].
The one-best approach can be extended easily by desegmenting -best lists of segmented decoder output. Doing so enables the inclusion of an unsegmented target language model, and with a small amount of bookkeeping, it also allows the inclusion of features related to the operations performed during desegmentation (see Section 3.4). With new features reflecting the desegmented output, we can re-tune our enhanced linear model on a development set. Following previous work, we will desegment 1000-best lists [24].
Once -best lists have been desegmented, we can tune on unsegmented references as a side-benefit. This could improve translation quality, as it brings our training scenario closer to our test scenario (test BLEU is always measured on unsegmented references). In particular, it could address issues with translation length mismatch. Previous work that has tuned on unsegmented references has reported mixed results [2, 20].
An -best list reflects a tiny portion of a decoder’s search space, typically fixed at 1000 hypotheses. Lattices22Or forests for hierarchical and syntactic decoders. can represent an exponential number of hypotheses in a compact structure. In this section, we discuss how a lattice from a multi-stack phrase-based decoder such as Moses [17] can be desegmented to enable word-level features.
A phrase-based decoder produces its output from left to right, with each operation appending the translation of a source phrase to a growing target hypothesis. Translation continues until each source word has been covered exactly once [19].
The search graph of a phrase-based decoder can be interpreted as a lattice, which can be interpreted as a finite state acceptor over target strings. In its most natural form, such an acceptor emits target phrases on each edge, but it can easily be transformed into a form with one edge per token, as shown in Figure 1a. This is sometimes referred to as a word graph [32], although in our case the segmented phrase table also produces tokens that correspond to morphemes.
Our goal is to desegment the decoder’s output lattice, and in doing so, gain access to a compact, desegmented view of a large portion of the translation search space. This can be accomplished by composing the lattice with a desegmenting transducer that consumes morphemes and outputs desegmented words. This transducer must be able to consume every word in our lattice’s output vocabulary. We define a word using the following regular expression:
(1) |
where [prefix], [stem] and [suffix] are non-overlapping sets of morphemes, whose members are easily determined using the segmenter’s segment boundary markers.33 Throughout this paper, we use “+” to mark morphemes as prefixes or suffixes, as in w+ or +h. In Equation 1 only, we overload “+” as the Kleene cross: . The second disjunct of Equation 1 covers words that have no clear stem, such as the Arabic ¡lh¿ lh “for him”, segmented as l+ “for” +h “him”. Equation 1 may need to be modified for other languages or segmentation schemes, but our techniques generalize to any definition that can be written as a regular expression.
A desegmenting transducer can be constructed by first encoding our desegmenter as a table that maps morpheme sequences to words. Regardless of whether the original desegmenter was based on concatenation, rules or table-lookup, it can be encoded as a lattice-specific table by applying it to an enumeration of all words found in the lattice. We can then transform that table into a finite state transducer with one path per table entry. Finally, we take the closure of this transducer, so that the resulting machine can transduce any sequence of words. The desegmenting transducer for our running example is shown in Figure 1b. Note that tokens requiring no desegmentation simply emit themselves. The lattice (Figure 1a) can then be desegmented by composing it with the transducer (1b), producing a desegmented lattice (1c). This is a natural place to introduce features that describe the desegmentation process, such as scores provided by a desegmentation table, which can be incorporated into the desegmenting transducer’s edge weights.
We now have a desegmented lattice, but it has not been annotated with an unsegmented (word-level) language model. In order to annotate lattice edges with an -gram LM, every path coming into a node must end with the same sequence of () tokens. If this property does not hold, then nodes must be split until it does.44 Or the LM composition can be done dynamically, effectively decoding the lattice with a beam or cube-pruned search [13]. This property is maintained by the decoder’s recombination rules for the segmented LM, but it is not guaranteed for the desegmented LM. Indeed, the expanded word-level context is one of the main benefits of incorporating a word-level LM. Fortunately, LM annotation as well as any necessary lattice modifications can be performed simultaneously by composing the desegmented lattice with a finite state acceptor encoding the LM [26].
In summary, we are given a segmented lattice, which encodes the decoder’s translation space as an acceptor over morphemes. We compose this acceptor with a desegmenting transducer, and then with an unsegmented LM acceptor, producing a fully annotated, desegmented lattice. Instead of using a tool kit such as OpenFst [1], we implement both the desegmenting transducer and the LM acceptor programmatically. This eliminates the need to construct intermediate machines, such as the lattice-specific desegmenter in Figure 1b, and facilitates working with edges annotated with feature vectors as opposed to single weights.
Lattice desegmentation is a non-local lattice transformation. That is, the morphemes forming a word might span several edges, making desegmentation non-trivial. Luong et al. (2010) address this problem by forcing the decoder’s phrase table to respect word boundaries, guaranteeing that each desegmentable token sequence is local to an edge. Inspired by the use of non-local features in forest decoding [14], we present an algorithm to find chains of edges that correspond to desegmentable token sequences, allowing lattice desegmentation with no phrase-table restrictions. This algorithm can be seen as implicitly constructing a customized desegmenting transducer and composing it with the input lattice on the fly.
Before describing the algorithm, we define some notation. An input morpheme lattice is a triple , where is a set of nodes, is a set of edges, and is the start node that begins each path through the lattice. Each edge is a 4-tuple , where , are head and tail nodes, is a single token accepted by this edge, and is the (potentially vector-valued) edge weight. Tokens are drawn from one of three non-overlapping morpho-syntactic sets: , where tokens that do not require desegmentation, such as complete words, punctuation and numbers, are considered to be in . It is also useful to consider the set of all outgoing edges for a node .
With this notation in place, we can define a chain to be a sequence of edges such that for . We denote singleton chains with , and when unambiguous, we abbreviate longer chains with their start and end node . A chain is valid if it emits the beginning of a word as defined by the regular expression in Equation 1. A valid chain is complete if its edges form an entire word, and if it is part of a path through the lattice that consists only of words. In Figure 1a, the complete chains are , , , and . The path restriction on complete chains forces words to be bounded by other words in order to be complete.55 Sentence-initial suffix morphemes and sentence-final prefix morphemes represent a special case that we omit for the sake of brevity. Lacking stems, they are left segmented. For example, if we removed the edge (AlTfl) from Figure 1a, then ([b+ lEbp]) would cease to be a complete chain, but it would still be a valid chain. Note that in the finite-state analogy, the path restriction is implicit in the composition operation.
Algorithm 3.3 desegments a lattice by finding all complete chains and replacing each one with a single edge. It maintains a work list of nodes that lie on the boundary between words, and for each node on this list, it launches a depth first search to find all complete chains extending from it. The search recognizes the valid chain to be complete by finding an edge such that forms a chain, but not a valid one. By inspection of Equation 1, this can only happen when a prefix or stem follows a stem or suffix, which always marks a word boundary. The chains found by this search are desegmented and then added to the output lattice as edges. The nodes at end points of these chains are added to the work list, as they lie at word boundaries by definition. Note that although this algorithm creates completely new edges, the resulting node set will be a subset of the input node set . The complement will consist of nodes that are word-internal in all paths through the input lattice, such as node 1 in Figure 1a.
[t]
{algorithmic} \STATE\COMMENTInitialize output lattice and work list \STATE, , , \WHILE \STATE\COMMENTWork on each node only once \STATEif then continue \STATE \STATE\COMMENTInitialize the chain stack \STATE \FOR \STATEif is valid then \ENDFOR\STATE\COMMENTDepth-first search for complete chains \WHILE \STATE\COMMENTAttempt to extend chain \FOR \IF is valid \STATE \ELSE\STATEMark as complete \ENDIF\ENDFOR\COMMENTDesegment complete chains \IF is complete \STATE \STATE \ENDIF\ENDWHILE\ENDWHILE\RETURN
Programmatic composition of a lattice with an -gram LM acceptor is a well understood problem. We use a dynamic program to enumerate all -word contexts leading into a node, and then split the node into multiple copies, one for each context. With each node corresponding to a single LM context, annotation of outgoing edges with -gram LM scores is straightforward.
Our re-ranker has access to all of the features used by the decoder, in addition to a number of features enabled by desegmentation.
We use a table-based desegmentation method for Arabic, which is based on segmenting an Arabic training corpus and memorizing the observed transformations to reverse them later. Finnish does not require a table, as all words can be desegmented with simple concatenation. The Arabic table consists of entries, where is a target morpheme sequence and is a desegmented surface form. Several entries may share the same , resulting in multiple desegmentation options. For the sake of symmetry with the unambiguous Finnish case, we augment Arabic -best lists or lattices with only the most frequent desegmentation .66 Allowing the re-ranker to choose between multiple s is a natural avenue for future work. We provide the desegmentation score = as a feature, to indicate the entry’s ambiguity in the training data.77 We also experimented on as an additional feature, but observed no improvement in translation quality. When an is missing from the table, we fall back on a set of desegmentation rules [9] and this feature is set to 0. This feature is always 0 for English-Finnish.
One advantage of our approach is that it allows discontiguous source words to translate into a single target word. In order to maintain some control over this powerful capability, we create three binary features that indicate the contiguity of a desegmentation. The first feature indicates that the desegmented morphemes were translated from contiguous source words. The second indicates that the source words contained a single discontiguity, as in a word-by-word translation of the “with his blue car” example from Section 2.2. The third indicates two or more discontiguities.
A 5-gram LM trained on unsegmented target text is used to assess the fluency of the desegmented word sequence.
We train our English-to-Arabic system using 1.49 million sentence pairs drawn from the NIST 2012 training set, excluding the UN data. This training set contains about 40 million Arabic tokens before segmentation, and 47 million after segmentation. We tune on the NIST 2004 evaluation set (1353 sentences) and evaluate on NIST 2005 (1056 sentences). As these evaluation sets are intended for Arabic-to-English translation, we select the first English reference to use as our source text.
Our English-to-Finnish system is trained on the same Europarl corpus as Luong et al. (2010) and Clifton and Sarkar (2011), which has roughly one million sentence pairs. We also use their development and test sets (2000 sentences each).
For Arabic, morphological segmentation is performed by MADA 3.2 [12], using the Penn Arabic Treebank (PATB) segmentation scheme as recommended by El Kholy and Habash (2012). For both segmented and unsegmented Arabic, we further normalize the script by converting different forms of Alif ¡a ’A ’a ’i ¿ and Ya ¡y _A¿ to bare Alif ¡a¿ and dotless Ya ¡_A¿. To generate the desegmentation table, we analyze the segmentations from the Arabic side of the parallel training data to collect mappings from morpheme sequences to surface forms.
For Finnish, we adopt the Unsup L-match segmentation technique of Clifton and Sarkar (2011), which uses Morfessor [8] to analyze the 5,000 most frequent Finnish words. The analysis is then applied to the Finnish side of the parallel text, and a list of segmented suffixes is collected. To improve coverage, words are further segmented according to their longest matching suffix from the list. As Morfessor does not perform any orthographic normalizations, it can be desegmented with simple concatenation.
We align the parallel data with GIZA++ [22] and decode using Moses [17]. The decoder’s log-linear model includes a standard feature set. Four translation model features encode phrase translation probabilities and lexical scores in both directions. Seven distortion features encode a standard distortion penalty as well as a bidirectional lexicalized reordering model. A KN-smoothed 5-gram language model is trained on the target side of the parallel data with SRILM [29]. Finally, we include word and phrase penalties. The decoder uses the default parameters for English-to-Arabic, except that the maximum phrase length is set to 8. For English-to-Finnish, we follow Clifton and Sarkar (2011) in setting the hypothesis stack size to 100, distortion limit to 6, and maximum phrase length to 20.
The decoder’s log-linear model is tuned with MERT [23]. Re-ranking models are tuned using a batch variant of hope-fear MIRA [5, 4], using the -best variant for -best desegmentation, and the lattice variant for lattice desegmentation. MIRA was selected over MERT because we have an in-house implementation that can tune on lattices very quickly. During development, we confirmed that MERT and MIRA perform similarly, as is expected with fewer than 20 features. Both the decoder’s log-linear model and the re-ranking models are trained on the same development set. Historically, we have not seen improvements from using different tuning sets for decoding and re-ranking. Lattices are pruned to a density of 50 edges per word before re-ranking.
We test four different systems. Our first baseline is Unsegmented, where we train on unsegmented target text, requiring no desegmentation step. Our second baseline is 1-best Deseg, where we train on segmented target text and desegment the decoder’s 1-best output. Starting from the system that produced 1-best Deseg, we then output either 1000-best lists or lattices to create our two experimental systems. The 1000-best Deseg system desegments, augments and re-ranks the decoder’s 1000-best list, while Lattice Deseg does the same in the lattice. We augment -best lists and lattices using the features described in Section 3.4.88Development experiments on a small-data English-to-Arabic scenario indicated that the Desegmentation Score was not particularly useful, so we exclude it from the main comparison, but include it in the ablation experiments.
We evaluate our system using BLEU [25] and TER [28]. Following Clark et al. (2011), we report average scores over five random tuning replications to account for optimizer instability. For the baselines, this means 5 runs of decoder tuning. For the desegmenting re-rankers, this means 5 runs of re-ranker tuning, each working on -best lists or lattices produced by the same (representative) decoder weights. We measure statistical significance using MultEval [6], which implements a stratified approximate randomization test to account for multiple tuning replications.
Tables 1 and 2 report results averaged over 5 tuning replications on English-to-Arabic and English-to-Finnish, respectively. In all scenarios, both -best Deseg and Lattice Deseg significantly outperform the 1-best Deseg baseline ().
For English-to-Arabic, 1-best desegmentation results in a 0.7 BLEU point improvement over training on unsegmented Arabic. Moving to lattice desegmentation more than doubles that improvement, resulting in a BLEU score of 34.4 and an improvement of 1.0 BLEU point over 1-best desegmentation. -best desegmentation also works well, resulting in a 0.6 BLEU point improvement over 1-best. Lattice desegmentation is significantly better () than -best desegmentation.
For English-to-Finnish, the Unsup L-match segmentation with 1-best desegmentation does not improve over the unsegmented baseline. The segmentation may be addressing issues with model sparsity, but it is also introducing errors that would have been impossible had words been left unsegmented. In fact, even with our lattice desegmenter providing a boost, we are unable to see a significant improvement over the unsegmented model. As we attempted to replicate the approach of Clifton and Sarkar (2011) exactly by working with their segmented data, this difference is likely due to changes in Moses since the publication of their result. Nonetheless, the 1000-best and lattice desegmenters both produce significant improvements over the 1-best desegmentation baseline, with Lattice Deseg achieving a 1-point improvement in TER. These results match the established state-of-the-art on this data set, but also indicate that there is still room for improvement in identifying the best segmentation strategy for English-to-Finnish translation.
We also tried a similar Morfessor-based segmentation for Arabic, which has an unsegmented test set BLEU of 32.7. As in Finnish, the 1-best desegmentation using Morfessor did not surpass the unsegmented baseline, producing a test BLEU of only 31.4 (not shown in Table 1). Lattice desegmentation was able to boost this to 32.9, slightly above 1-best desegmentation, but well below our best MADA desegmentation result of 34.4. There appears to be a large advantage to using MADA’s supervised segmentation in this scenario.
Model | Dev | Test | |
---|---|---|---|
BLEU | BLEU | TER | |
Unsegmented | 24.4 | 32.7 | 49.4 |
1-best Deseg | 24.4 | 33.4 | 48.6 |
1000-best Deseg | 25.0 | 34.0 | 48.0 |
Lattice Deseg | 25.2 | 34.4 | 47.7 |
Model | Dev | Test | |
---|---|---|---|
BLEU | BLEU | TER | |
Unsegmented | 15.4 | 15.1 | 70.8 |
1-best Deseg | 15.3 | 14.8 | 71.9 |
1000-best Deseg | 15.4 | 15.1 | 71.5 |
Lattice Deseg | 15.5 | 15.1 | 70.9 |
We conducted an ablation experiment on English-to-Arabic to measure the impact of the various features described in Section 3.4. Table 3 compares different combinations of features using lattice desegmentation. The unsegmented LM alone yields a 0.4 point improvement over the 1-best desegmentation score. Adding contiguity indicators on top of the unsegmented LM results in another 0.6 point improvement. As anticipated, the tuner assigns negative weights to discontiguous cases, encouraging the re-ranker to select a safer translation path when possible. Judging from the output on the NIST 2005 test set, the system uses these discontiguous desegmentations very rarely: only 5% of desegmented tokens align to discontiguous source phrases. Adding the desegmentation score to these two feature groups does not improve performance, confirming the results we observed during development. The desegmentation score would likely be useful in a scenario where we provide multiple desegmentation options to the re-ranker; for now, it indicates only the ambiguity of a fixed choice, and is likely redundant with information provided by the language model.
Features | dev | test |
---|---|---|
1-best Deseg | 24.5 | 33.4 |
+ Unsegmented LM | 24.9 | 33.8 |
+ Contiguity | 25.2 | 34.4 |
+ Desegmentation Score | 25.2 | 34.3 |
In order to better understand the source of our improvements in the English-to-Arabic scenario, we conducted an extensive manual analysis of the differences between 1-best and lattice desegmentation on our test set. We compared the output of the two systems using the Unix tool wdiff, which transforms a solution to the longest-common-subsequence problem into a sequence of multi-word insertions and deletions [15]. We considered adjacent insertion-deletion pairs to be (potentially phrasal) substitutions, and collected them into a file, omitting any unpaired insertions or deletions. We then sampled 650 cases where the two sides of the substitution were deemed to be related, and divided these cases into categories based on how the lattice desegmentation differs from the one-best desegmentation. We consider a phrase to be correct only if it can be found in the reference.
Table 4 breaks down per-phrase accuracy according to four manually-assigned categories: (1) clitical – the two systems agree on a stem, but at least one clitic, often a prefix denoting a preposition or determiner, was dropped, added or replaced; (2) lexical – a word was changed to a morphologically unrelated word with a similar meaning; (3) inflectional – the words have the same stem, but different inflection due to a change in gender, number or verb tense; (4) part-of-speech – the two systems agree on the lemma, but have selected different parts of speech.
For each case covering a single phrasal difference, we compare the phrases from each system to the reference. We report the number of instances where each system matched the reference, as well as cases where they were both incorrect. The majority of differences correspond to clitics, whose correction appears to be a major source of the improvements obtained by lattice desegmentation. This category is challenging for the decoder because English prepositions tend to correspond to multiple possible forms when translated into Arabic. It also includes the frequent cases involving the nominal determiner prefix Al “the” (left unsegmented by the PATB scheme), and the sentence-initial conjunction w+ “and”. The second most common category is lexical, where the unsegmented LM has drastically altered the choice of translation. The remaining categories show no major advantage for either system.
|
|
| |||||||
---|---|---|---|---|---|---|---|---|---|
Clitical | 157 | 71 | 79 | ||||||
Lexical | 61 | 39 | 80 | ||||||
Inflectional | 37 | 32 | 47 | ||||||
Part-of-speech | 19 | 17 | 11 |
We have explored deeper integration of morphological desegmentation into the statistical machine translation pipeline. We have presented a novel, finite-state-inspired approach to lattice desegmentation, which allows the system to account for a desegmented view of many possible translations, without any modification to the decoder or any restrictions on phrase extraction. When applied to English-to-Arabic translation, lattice desegmentation results in a 1.0 BLEU point improvement over one-best desegmentation, and a 1.7 BLEU point improvement over unsegmented translation. We have also applied our approach to English-to-Finnish translation, and although segmentation in general does not currently help, we are able to show significant improvements over a 1-best desegmentation baseline.
In the future, we plan to explore introducing multiple segmentation options into the lattice, and the application of our method to a full morphological analysis (as opposed to segmentation) of the target language. Eventually, we would like to replace the functionality of factored translation models [18] with lattice transformation and augmentation.
Thanks to Ann Clifton for generously providing the data and segmentation for our English-to-Finnish experiments, and to Marine Carpuat and Roland Kuhn for their helpful comments on an earlier draft. This research was supported by the Natural Sciences and Engineering Research Council of Canada.