Are Two Heads Better than One? Crowdsourced Translation via a
Two-Step Collaboration of Non-Professional Translators and Editors

Rui Yan, Mingkun Gao, Ellie Pavlick, and Chris Callison-Burch
Computer and Information Science Department,
University of Pennsylvania, Philadelphia, PA 19104, U.S.A.
{ruiyan,gmingkun,epavlick}@seas.upenn.edu, ccb@cis.upenn.edu

Abstract

Crowdsourcing is a viable mechanism for creating training data for machine translation. It provides a low cost, fast turn-around way of processing large volumes of data. However, when compared to professional translation, naive collection of translations from non-professionals yields low-quality results. Careful quality control is necessary for crowdsourcing to work well. In this paper, we examine the challenges of a two-step collaboration process with translation and post-editing by non-professionals. We develop graph-based ranking models that automatically select the best output from multiple redundant versions of translations and edits, and improves translation quality closer to professionals.

1 Introduction

Statistical machine translation (SMT) systems are trained using bilingual sentence-aligned parallel corpora. Theoretically, SMT can be applied to any language pair, but in practice it produces the state-of-art results only for language pairs with ample training data, like English-Arabic, English-Chinese, French-English, etc. SMT gets stuck in a severe bottleneck for many minority or ‘low resource’ languages with insufficient data. This drastically limits which languages SMT can be successfully applied to. Because of this, collecting parallel corpora for minor languages has become an interesting research challenge. There are various options for creating training data for new language pairs. Past approaches have examined harvesting translated documents from the web [30, 36, 32], or discovering parallel fragments from comparable corpora [26, 1, 31]. Until relatively recently, little consideration has been given to creating parallel data from scratch. This is because the cost of hiring professional translators is prohibitively high. For instance, Germann (2001) hoped to hire professional translators to create a modest sized 100,000 word Tamil-English parallel corpus, but were stymied by the costs and the difficulty of finding good translators for a short-term commitment.

Recently, crowdsourcing has opened the possibility of translating large amounts of text at low cost using non-professional translators. Facebook localized its web site into different languages using volunteers [35]. DuoLingo turns translation into an educational game, and translates web content using its language learners [37].

Rather than relying on volunteers or gamification, NLP research into crowdsourcing translation has focused on hiring workers on the Amazon Mechanical Turk (MTurk) platform [9]. This setup presents unique challenges, since it typically involves non-professional translators whose language skills are varied, and since it sometimes involves participants who try to cheat to get the small financial reward [43]. A natural approach for trying to shore up the skills of weak bilinguals is to pair them with a native speaker of the target language to edit their translations. We review relevant research from NLP and human-computer interaction (HCI) on collaborative translation processes in Section 2. To sort good translations from bad, researchers often solicit multiple, redundant translations and then build models to try to predict which translations are the best, or which translators tend to produce the highest quality translations.

The contributions of this paper are:

•

An analysis of the difficulties posed by a two-step collaboration between editors and translators in Mechanical Turk-style crowdsourcing environments. Editors vary in quality, and poor editing can be difficult to detect.
•

A new graph-based algorithm for selecting the best translation among multiple translations of the same input. This method takes into account the collaborative relationship between the translators and the editors.

2 Related work

In the HCI community, several researchers have proposed protocols for collaborative translation efforts [25, 24, 16, 15]. These have focused on an iterative collaboration between monolingual speakers of the two languages, facilitated with a machine translation system. These studies are similar to ours in that they rely on native speakers’ understanding of the target language to correct the disfluencies in poor translations. In our setup the poor translations are produced by bilingual individuals who are weak in the target language, and in their experiments the translations are the output of a machine translation system.¹¹A variety of HCI and NLP studies have confirmed the efficacy of monolingual or bilingual individuals post-editing of machine translation output [8, 19, 12]. Past NLP work has also examined automatic post-editing[18].

Another significant difference is that the HCI studies assume cooperative participants. For instance, Hu et al. (2010) recruited volunteers from the International Children’s Digital Library [13] who were all well intentioned and participated out a sense of altruism and to build a good reputation among the other volunteer translators at childrenslibrary.org. Our setup uses anonymous crowd workers hired on Mechanical Turk, whose motivation to participate is financial. Bernstein et al. (2010) characterized the problems with hiring editors via MTurk for a word processing application. Workers were either lazy (meaning they made only minimal edits) or overly zealous (meaning they made many unnecessary edits). Bernstein et al. (2010) addressed this problem with a three step find-fix-verify process. In the first step, workers click on one word or phrase that needed to be corrected. In the next step, a separate group of workers proposed corrections to problematic regions that had been identified by multiple workers in the first pass. In the final step, other workers would validate whether the proposed corrections were good.

Most NLP research into crowdsourcing has focused on Mechanical Turk, following pioneering work by Snow et al. (2008) who showed that the platform was a viable way of collecting data for a wide variety of NLP tasks at low cost and in large volumes. They further showed that non-expert annotations are similar to expert annotations when many non-expert labelings for the same input are aggregated, through simple voting or through weighting votes based on how closely non-experts matched experts on a small amount of calibration data. MTurk has subsequently been widely adopted by the NLP community and used for an extensive range of speech and language applications [7].

Although hiring professional translators to create bilingual training data for machine translation systems has been deemed infeasible, Mechanical Turk has provided a low cost way of creating large volumes of translations [9, 3]. For instance, Zbib et al. (2012, 2013) translated 1.5 million words of Levine Arabic and Egyptian Arabic, and showed that a statistical translation system trained on the dialect data outperformed a system trained on 100 times more MSA data. Post et al. (2012) used MTurk to create parallel corpora for six Indian languages for less than $0.01 per word. MTurk workers translated more than half a million words worth of Malayalam in less than a week. Several researchers have examined the use of active learning to further reduce the cost of translation [2, 4, 6]. Crowdsourcing allowed real studies to be conducted whereas most past active learning were simulated. Pavlick et al. (2014) conducted a large-scale demographic study of the languages spoken by workers on MTurk by translating 10,000 words in each of 100 languages. Chen and Dolan (2012) examined the steps necessary to build a persistent multilingual workforce on MTurk.

This paper is most closely related to previous work by Zaidan and Callison-Burch (2011), who showed that non-professional translators could approach the level of professional translators. They solicited multiple redundant translations from different Turkers for a collection of Urdu sentences that had been previously professionally translated by the Linguistics Data Consortium. They built a model to try to predict on a sentence-by-sentence and Turker-by-Turker which was the best translation or translator. They also hired US-based Turkers to edit the translations, since the translators were largely based in Pakistan and exhibited errors that are characteristic of speakers of English as a language. Zaidan and Callison-Burch (2011) observed only modest improvements when incorporating these edited translation into their model. We attempt to analyze why this is, and we proposed a new model to try to better leverage their data.

Urdu translator:

According to the territory’s people the pamphlets from the Taaliban had been read in the announcements in all the mosques of the Northern Wazeerastan.

English post-editor:

According to locals, the pamphlet released by the Taliban was read out on the loudspeakers of all the mosques in North Waziristan.

LDC professional:

According to the local people, the Taliban’s pamphlet was read over the loudspeakers of all mosques in North Waziristan.

Table 1: Different versions of translations.

3 Crowdsourcing Translation

Setup

We conduct our experiments using the data collected by Zaidan and Callison-Burch (2011). This data set consists 1,792 Urdu sentences from a variety of news and online sources, each paired with English translations provided by non-professional translators on Mechanical Turk.

Each Urdu sentence was translated redundantly by 3 distinct translators, and each translation was edited by 3 separate (native English-speaking) editors to correct for grammatical and stylistic errors. In total, this gives us 12 non-professional English candidate sentences (3 unedited, 9 edited) per original Urdu sentence. 52 different Turkers took part in the translation task, each translating 138 sentences on average. In the editing task, 320 Turkers participated, averaging 56 sentences each. For comparison, the data also includes 4 different reference translations for each source sentence, produced by professional translators.

Table 1 gives an example of an unedited translation, an edited translation, and a professional translation for the same sentence. The translations provided by translators on MTurk are generally done conscientiously, preserving the meaning of the source sentence, but typically contain simple mistakes like misspellings, typos, and awkward word choice. English-speaking editors, despite having no knowledge of the source language, are able to fix these errors. In this work, we show that the collaboration design of two heads– non-professional Urdu translators and non-professional English editors– yields better translated output than would either one working in isolation, and can better approximate the quality of professional translators.

Analysis

We know from inspection that translations seem to improve with editing (Table 1). Given the data from MTurk, we explore whether this is the case in general: Do all translations improve with editing? To what extent does the individual translator and the individual editor effect the quality of the final sentence?

Figure 1: Relationship between editor aggressiveness and effectiveness. Each point represents an editor/translation pair. Aggressiveness (x-axis) is measured as the TER between the pre-edit and post-edit version of the translation, and effectiveness (y-axis) is measured as the average amount by which the editing reduces the translation’s TER

{}_{gold}

. While many editors make only a few changes, those who make many changes can bring the translation substantially closer to professional quality.

Figure 2: Effect of editing on translations of varying quality. Rows reflect bins of editors, with the worse editors (those whose changes result in increased TER

{}_{gold}

) on the top and the most effective editors (those whose changes result in the largest reduction in TER

{}_{gold}

) on the bottom. Bars reflect bins of translations, with the highest TER

{}_{gold}

translations on the left, and the lowest on the right. We can see from the consistently negative

\Delta

TER

{}_{gold}

in the bottom row that good editors are able to improve both good and bad translations.

We use translation edit rate (TER) as a measure of translation similarity. TER represents the amount of change necessary to transform one sentence into another, so a low TER means the two sentences are very similar. To capture the quality (“professionalness”) of a translation, we take the average TER of the translation against each of our gold translations. That is, we define TER ${}_{gold}$ of translation $t$ as

\displaystyle TER_{gold}=\frac{1}{4}\sum\limits_{i=1}^{4}TER(gold_{i},t)

(1)

where a lower TER ${}_{gold}$ is indicative of a higher quality (more professional-sounding) translation.

We first look at editors along two dimensions: their aggressiveness and their effectiveness. Some editors may be very aggressive (they make many changes to the original translation) but still be ineffective (they fail to bring the quality of the translation closer to that of a professional). We measure aggressiveness by looking at the TER between the pre- and post-edited versions of each editor’s translations; higher TER implies more aggressive editing. To measure effectiveness, we look at the change in TER ${}_{gold}$ that results from the editing; negative $\Delta$ TER ${}_{gold}$ means the editor effectively improved the quality of the translation, while positive $\Delta$ TER ${}_{gold}$ means the editing actually brought the translation further from our gold standard.

Figure 1 shows the relationship between these two qualities for individual editor/translation pairs. We see that while most translations require only a few edits, there are a large number of translations which improve substantially after heavy editing. This trend conforms to our intuition that editing is most useful when the translation has much room for improvement, and opens the question of whether good editors can offer improvements to translations of all qualities.

Figure 3: Three alternative translations (left) and the edited versions of each (right). Each edit on the right was produced by a different editor. Order reflects the TER

{}_{gold}

of each translation, with the lowest TER

{}_{gold}

on the top. Some translators receive low TER

{}_{gold}

scores due to superficial errors, which can be easily improved through editing. In the above example, the middle-ranked translation (green) becomes the best translation after being revised by a good editor.

To address this question, we split our translations into 5 bins, based on their TER ${}_{gold}$ . We also split our editors into 5 bins, based on their effectiveness (i.e. the average amount by which their editing reduces TER ${}_{gold}$ ). Figure 2 shows the degree to which editors at each level are able to improve the translations from each bin. We see that good editors are able to make improvements to translations of all qualities, but that good editing has the greatest impact on lower quality translations. This result suggests that finding good editor/translator pairs, rather than good editors and good translators in isolation, should produce the best translations overall. Figure 3 gives an example of how an initially medium-quality translation, when combined with good editing, produces a better result than the higher-quality translation paired with mediocre editing.

4 Problem Formulation

The problem definition of the crowdsourcing translation task is straightforward: given a set of candidate translations for a source sentence, we want to choose the best output translation.

This output translation is the result of the combined translation and editing stages. Therefore, our method operates over a heterogeneous network that includes translators and post-editors as well as the translated sentences that they produce. We frame the problem as follows. We form two graphs: the first graph ( $G_{T}$ ) represents Turkers (translator/editor pairs) as nodes; the second graph ( $G_{C}$ ) represents candidate translated and post-edited sentences (henceforth “candidates”) as nodes. These two graphs, $G_{T}$ and $G_{C}$ are combined as subgraphs of a third graph ( $G_{TC}$ ). Edges in $G_{TC}$ connect author pairs (nodes in $G_{T}$ ) to the candidate that they produced (nodes in $G_{C}$ ). Together, $G_{T}$ , $G_{C}$ , and $G_{TC}$ define a co-ranking problem [39, 41, 40] with linkage establishment [38, 42], which we define formally as follows.

Let $G$ denote the heterogeneous graph with nodes $V$ and edges $E$ . Let $G$ = ( $V$ , $E$ ) = ( $V_{T},V_{C},E_{T},E_{C},E_{TC})$ . $G$ is divided into three subgraphs, $G_{T}$ , $G_{C}$ , and $G_{TC}$ . $G_{C}$ = ( $V_{C},E_{C}$ ) is a weighted undirected graph representing the candidates and their lexical relationships to one another. Let $V_{C}$ denote a collection of translated and edited candidates, and $E_{C}$ the lexical similarity between the candidates (see Section 4.3 for details). $G_{T}$ = ( $V_{T},E_{T}$ ) is a weighted undirected graph representing collaborations between Turkers. $V_{T}$ is the set of translator/editor pairs. Edges $E_{T}$ connect translator/editor pairs in $V_{T}$ which share a translator and/or editor. Each collaboration (i.e. each node in $V_{T}$ ) produces a candidate (i.e. a node in $V_{C}$ ). $G_{TC}=(V_{TC},E_{TC})$ is an unweighted bipartite graph that ties $G_{T}$ and $G_{C}$ together and represents “authorship”. The graph $G$ consists of nodes $V_{TC}$ = $V_{T}\cup V_{C}$ and edges $E_{TC}$ connecting each candidate with its authoring translator/post-editor pair. The three sub-networks ( $G_{T}$ , $G_{C}$ , and $G_{TC}$ ) are illustrated in Figure 4.

Figure 4: 2-step collaborative crowdsourcing translation model based on graph ranking framework including three sub-networks. The undirected links between users denotes translation-editing collaboration. The undirected links between candidate translations indicate lexical similarity between candidates. A bipartite graph ties candidate and Turker networks together by authorship (to make the figure clearer, some linkage is omitted). A dashed circle indicates the group of candidate translations for a single source sentence to translate.

4.1 Inter-Graph Ranking

The framework includes three random walks, one on $G_{T}$ , one on $G_{C}$ and one on $G_{TC}$ . A random walk on a graph is a Markov chain, its states being the vertices of the graph. It can be described by a stochastic square matrix, where the dimension is the number of vertices in the graph, and the entries describe the transition probabilities from one vertex to the next. The mutual reinforcement framework couples the two random walks on $G_{T}$ and $G_{C}$ that rank candidates and Turkers in isolation. The ranking method allows us to obtain a global ranking by taking into account the intra-/inter-component dependencies. In the following sections, we describe how we obtain the rankings on $G_{T}$ and $G_{C}$ , and then move on to discuss how the two are coupled.

Our algorithm aims to capture the following intuitions. A candidate is important if 1) it is similar to many of the other proposed candidates and 2) it is authored by better qualified translators and/or post-editors. Analogously, a translator/editor pair is believed to be better qualified if 1) the editor is collaborating with a good translator and vice versa and 2) the pair has authored important candidates. This ranking schema is actually a reinforced process across the heterogeneous graphs. We use two vectors $\textbf{c}=[\pi(c)]_{|c|\times 1}$ and $\textbf{t}=[\pi(t)]_{|t|\times 1}$ to denote the saliency scores $\pi(.)$ of candidates and Turker pairs. The above-mentioned intuitions can be formulated as follows:

$\bullet$ Homogeneity. We use adjacency matrix $[M]_{|\textbf{c}|\times|\textbf{c}|}$ to describe the homogeneous affinity between candidates and $[N]_{|\textbf{t}|\times|\textbf{t}|}$ to describe the affinity between Turkers.

\textbf{c}\propto{M}^{T}\textbf{c},\qquad\textbf{t}\propto{N}^{T}\textbf{t}

(2)

where $c=|V_{C}|$ is the number of vertices in the candidate graph and $t=|V_{T}|$ is the number of vertices in the Turker graph. The adjacency matrix $[M]$ denotes the transition probabilities between candidates, and analogously matrix $[N]$ denotes the affinity between Turker collaboration pairs.

$\bullet$ Heterogeneity. We use an adjacency matrix $[\hat{W}]_{|\textbf{c}|\times|\textbf{t}|}$ and $[\bar{W}]_{|\textbf{t}|\times|\textbf{c}|}$ to describe the authorship between the output candidate and the producer Turker pair from both of the candidate-to-Turker and Turker-to-candidate perspectives.

\textbf{c}\propto\hat{W}^{T}\textbf{t},\qquad\textbf{t}\propto\bar{W}^{T}% \textbf{c}

(3)

All affinity matrices will be defined in the next section. By fusing the above equations, we can have the following iterative calculation in matrix forms. For numerical computation of the saliency scores, the initial scores of all sentences and Turkers are set to 1 and the following two steps are alternated until convergence to select the best candidate.

Step 1: compute the saliency scores of candidates, and then normalize using $\ell$ -1 norm.

	$\displaystyle\textbf{c}^{(n)}$	$\displaystyle=(1-\lambda){M}^{T}\textbf{c}^{(n-1)}+\lambda\hat{W}\textbf{t}^{(% n-1)}$		(4)
	$\displaystyle\textbf{c}^{(n)}$	$\displaystyle=\textbf{c}^{(n)}/\|\|\textbf{c}^{(n)}\|\|_{1}$		(4)

Step 2: compute the saliency scores of Turker pairs, and then normalize using $\ell$ -1 norm.

	$\displaystyle\textbf{t}^{(n)}$	$\displaystyle=(1-\lambda){N}^{T}\textbf{t}^{(n-1)}+\lambda\bar{W}\textbf{c}^{(% n-1)}$		(5)
	$\displaystyle\textbf{t}^{(n)}$	$\displaystyle=\textbf{t}^{(n)}/\|\|\textbf{t}^{(n)}\|\|_{1}$		(5)

where $\lambda$ specifies the relative contributions to the saliency score trade-off between the homogeneous affinity and the heterogeneous affinity. In order to guarantee the convergence of the iterative form, we must force the transition matrix to be stochastic and irreducible. To this end, we must make the c and t column stochastic [20]. c and t are therefore normalized after each iteration of Equation (4) and (5).

4.2 Intra-Graph Ranking

The standard PageRank algorithm starts from an arbitrary node and randomly selects to either follow a random out-going edge (considering the weighted transition matrix) or to jump to a random node (treating all nodes with equal probability).

In a simple random walk, it is assumed that all nodes in the transitional matrix are equi-probable before the walk starts. Then c and t are calculated as:

\textbf{c}=\mu M^{T}\textbf{c}+(1-\mu)\frac{\textbf{1}}{|V_{C}|}

(6)

and

\textbf{t}=\mu N^{T}\textbf{t}+(1-\mu)\frac{\textbf{1}}{|V_{T}|}

(7)

where 1 is a vector with all elements equaling to 1 and the size is correspondent to the size of $V_{C}$ or $V_{T}$ . $\mu$ is the damping factor usually set to 0.85, as in the PageRank algorithm.

4.3 Affinity Matrix Establishment

We introduce the affinity matrix calculation, including homogeneous affinity (i.e., ${M},{N}$ ) and heterogeneous affinity (i.e., $\hat{W},\bar{W}$ ).

As discussed, we model the collection of candidates as a weighted undirected graph, $G_{C}$ , in which nodes in the graph represent candidate sentences and edges represent lexical relatedness. We define an edge’s weight to be the cosine similarity between the candidates represented by the nodes that it connects. The adjacency matrix M describes such a graph, with each entry corresponding to the weight of an edge.

	$\displaystyle\mathcal{F}$	$\displaystyle(c_{i},c_{j})=\frac{c_{i}\cdot c_{j}}{\|\|c_{i}\|\|\|\|c_{j}\|\|}$		(8)
		$\displaystyle M_{ij}=\frac{\mathcal{F}(c_{i},c_{j})}{\sum_{k}\mathcal{F}(c_{i}% ,c_{k})}$		(8)

where $\mathcal{F}(.)$ is the cosine similarity and $c$ is a term vector corresponding to a candidate. We treat a candidate as a short document and weight each term with tf.idf [23], where tf is the term frequency and idf is the inverse document frequency.

The Turker graph, $G_{T}$ , is an undirected graph whose edges represent “collaboration.” Formally, let $t_{i}$ and $t_{j}$ be two translator/editor pairs; we say that pair $t_{i}$ “collaborates with” pair $t_{j}$ (and therefore, there is an edge between $t_{i}$ and $t_{j}$ ) if $t_{i}$ and $t_{j}$ share either a translator or an editor (or share both a translator and an editor). Let the function $\mathcal{I}(t_{i},t_{j})$ denote the number of “collaborations” ( $\#col$ ) between $t_{i}$ and $t_{j}$ .

\displaystyle\mathcal{I}(t_{i},t_{j})

\displaystyle=\begin{cases}\#col&(e_{ij}\in E_{T})\\ 0&\text{otherwise}\end{cases},

(9)

Then the adjacency matrix N is then defined as

\displaystyle N_{ij}

\displaystyle=\frac{\mathcal{I}(t_{i},t_{j})}{\sum_{k}\mathcal{I}(t_{i},t_{k})}

(10)

In the bipartite candidate-Turker graph $G_{TC}$ , the entry $E_{TC}(i,j)$ is an indicator function denoting whether the candidate $c_{i}$ is generated by $t_{j}$ :

\mathcal{A}(c_{i},t_{j})=\begin{cases}1&(e_{ij}\in E_{TC})\\ 0&\text{otherwise}\end{cases}

(11)

Through $E_{TC}$ we define the weight matrices $\bar{W}_{ij}$ and $\hat{W}_{ij}$ , containing the conditional probabilities of transitions from $c_{i}$ to $t_{j}$ and vice versa:

		$\displaystyle\bar{W}_{ij}=\frac{\mathcal{A}(c_{i},t_{j})}{\sum_{k}\mathcal{A}(% c_{i},t_{k})},$		(12)
		$\displaystyle\hat{W}_{ij}=\frac{\mathcal{A}(c_{i},t_{j})}{\sum_{k}\mathcal{A}(% c_{k},t_{j})}$		(12)

5 Evaluation

We are interested in testing our random walk method, which incorporates information from both the candidate translations and from the Turkers. We want to test two versions of our proposed collaborative co-ranking method: 1) based on the unedited translations only and 2) based on the edited sentences after translator/editor collaborations.

Metric

Since we have four professional translation sets, we can calculate the Bilingual Evaluation Understudy (BLEU) score [27] for one professional translator (P1) using the other three (P2,3,4) as a reference set. We repeat the process four times, scoring each professional translator against the others, to calculate the expected range of professional quality translation. In the following sections, we evaluate each of our methods by calculating BLEU scores against the same four sets of three reference translations. Therefore, each number reported in our experimental results is an average of four numbers, corresponding to the four possible ways of choosing 3 of the 4 reference sets. This allows us to compare the BLEU score achieved by our methods against the BLEU scores achievable by professional translators.

Reference (Avg.)	42.51
Oracle (Seg-Trans)	44.93
Oracle (Seg-Trans+Edit)	48.44
Oracle (Turker-Trans)	38.66
Oracle (Turker-Trans+Edit)	39.16
Random	30.52
Lowest TER	35.78
Graph Ranking (Trans)	38.88
Graph Ranking (Trans+Edit)	41.43

Table 2: Overall BLEU performance for all methods (with and without post-editing). The highlighted result indicates the best performance, which is based on both candidate sentences and Turker information.

Baselines

As a naive baseline, we choose one candidate translation at random for each input Urdu sentence. To establish an upper bound for our methods, and to determine if there exist high-quality Turker translations at all, we compute four oracle scores. The first oracle operates at the segment level on the sentences produced by translators only: for each source segment, we choose from the translations the one that scores highest (in terms of BLEU) against the reference sentences. The second oracle is applied similarly, but chooses from the candidates produced by the collaboration of translator/post-editor pairs. The third oracle operates at the worker level: for each source segment, we choose from the translations the one provided by the worker whose translations (over all sentences) score the highest on average. The fourth oracle also operates at the worker level, but selects from sentences produced by translator/post-editor collaborations. These oracle methods represent ideal solutions under our scenario. We also examine two voting-inspired methods. The first method selects the translation with the minimum average TER [33] against the other translations; intuitively, this would represent the “consensus” translation. The second method selects the translation generated by the Turker who, on average, provides translations with the minimum average TER.

Results

A summary of our results in given in Table 2. As expected, random selection yields bad performance, with a BLEU score of 30.52. The oracles indicate that there is usually an acceptable translation from the Turkers for any given sentence. Since the oracles select from a small group of only 4 translations per source segment, they are not overly optimistic, and rather reflect the true potential of the collected translations. On average, the reference translations give a score of 42.38. To put this in perspective, the output of a state-of-the-art machine translation system (the syntax-based variant of Joshua) achieves a score of 26.91, which is reported in [43]. The approach which selects the translations with the minimum average TER [33] against the other three translations (the “consensus” translation) achieves BLEU scores of 35.78.

Using the raw translations without post-editing, our graph-based ranking method achieves a BLEU score of 38.89, compared to Zaidan and Callison-Burch (2011)’ s reported score of 28.13, which they achieved using a linear feature-based classification. Their linear classifier achieved a reported score of 39.06²²Note that the data we used in our experiments are slightly different, by discarding nearly 100 NULL sentences in the raw data. We do not re-implement this baseline but report the results from the paper directly. According to our experiments, most of the results generated by baselines and oracles are very close to the previously reported values. when combining information from both translators and editors. In contrast, our proposed graph-based ranking framework achieves a score of 41.43 when using the same information. This boost in BLEU score confirms our intuition that the hidden collaboration networks between candidate translations and transltor/editor pairs are indeed useful.

Parameter Tuning

Figure 5: Effect of candidate-Turker coupling (

\lambda

) on BLEU score.

There are two parameters in our experimental setups: $\mu$ controls the probability of starting a new random walk and $\lambda$ controls the coupling between the candidate and Turker sub-graphs. We set the damping factor $\mu$ to 0.85, following the standard PageRank paradigm. In order to determine a value for $\lambda$ , we used the average BLEU, computed against the professional reference translations, as a tuning metric. We experimented with values of $\lambda$ ranging from 0 to 1, with a step size of 0.05 (Figure 5). Small $\lambda$ values place little emphasis on the candidate/Turker coupling, whereas larger values rely more heavily on the co-ranking. Overall, we observed better performance with values within the range of 0.05-0.15. This suggests that both sources of information– the candidate itself and its authors– are important for the crowdsourcing translation task. In all of our reported results, we used the $\lambda$ = 0.1.

Analysis

We examine the relative contribution of each component of our approach on the overall performance. We first examine the centroid-based ranking on the candidate sub-graph ( $G_{C}$ ) alone to see the effect of voting among translated sentences; we denote this strategy as plain ranking. Then we incorporate the standard random walk on the Turker graph ( $G_{T}$ ) to include the structural information but without yet including any collaboration information; that is, we incorporate information from $G_{T}$ and $G_{C}$ without including edges linking the two together. The co-ranking paradigm is exactly the same as the framework described in Section 3.2, but with simplified structures.

Finally, we examine the two-step collaboration based candidate-Turker graph using several variations on edge establishment. As before, the nodes are the translator/post-editor working pairs. We investigate three settings in which 1) edges connect two nodes when they share only a translator, 2) edges connect two nodes when they share only a post-editor, and 3) edges connect two nodes when they share either a translator or a post-editor. These results are summarized in Table 3.

Plain ranking	38.89
w/o collaboration	38.88
Shared translator	41.38
Shared post-editor	41.29
Shared Turker	41.43

Table 3: Variations of all component settings.

Interestingly, we observe that when modeling the linkage between the collaboration pairs, connecting Turker pairs which share either a translator or the post-editor achieves better performance than connecting pairs that share only translators or connecting pairs which share only editors. This result supports the intuition that a denser collaboration matrix will help propagate saliency to good translators/post-editors and hence provides better predictions for candidate quality.

6 Conclusion

We have proposed an algorithm for using a two-step collaboration between non-professional translators and post-editors to obtain professional-quality translations. Our method, based on a co-ranking model, selects the best crowdsourced translation from a set of candidates, and is capable of selecting translations which near professional quality.

Crowdsourcing can play a pivotal role in future efforts to create parallel translation datasets. In addition to its benefits of cost and scalability, crowdsourcing provides access to languages that currently fall outside the scope of statistical machine translation research. In future work on crowdsourced translation, further benefits in quality improvement and cost reduction could stem from 1) building ground truth data sets based on high-quality Turkers’ translations and 2) identifying when sufficient data has been collected for a given input, to avoid soliciting unnecessary redundant translations.

Acknowledgements

This material is based on research sponsored by a DARPA Computer Science Study Panel phase 3 award entitled “Crowdsourcing Translation” (contract D12PC00368). The views and conclusions contained in this publication are those of the authors and should not be interpreted as representing official policies or endorsements by DARPA or the U.S. Government. This research was supported by the Johns Hopkins University Human Language Technology Center of Excellence and through gifts from Microsoft, Google and Facebook.

References

[1] S. Abdul-Rauf and H. Schwenk(2009-03) On the use of comparable corpora to improve SMT performance. pp. 16–23. External Links: Link Cited by: 1.
[2] V. Ambati, S. Vogel and J. G. Carbonell(2010) Active learning and crowd-sourcing for machine translation.. Vol. 11, pp. 2169–2174. Cited by: 2.
[3] V. Ambati and S. Vogel(2010) Can crowds build parallel corpora for machine translation systems?. Cited by: 2.
[4] V. Ambati(2012) Active learning and crowdsourcing for machine translation in low resource scenarios. Ph.D. Thesis, Language Technologies Institute, School of Computer Science, Carnegie Mellon University, Pittsburgh, PA. External Links: Link Cited by: 2.
[5] M. S. Bernstein, G. Little, R. C. Miller, B. Hartmann, M. S. Ackerman, D. R. Karger, D. Crowell and K. Panovich(2010) Soylent: a word processor with a crowd inside. Cited by: 2.
[6] M. Bloodgood and C. Callison-Burch(2010) Large-scale cost-focused active learning for statistical machine translation. External Links: Link Cited by: 2.
[7] C. Callison-Burch and M. Dredze(2010-06) Creating speech and language data with Amazon’s Mechanical Turk. pp. 1–12. External Links: Link Cited by: 2.
[8] C. Callison-Burch(2005) Linear B system description for the 2005 NIST MT evaluation exercise. Cited by: 2.
[9] C. Callison-Burch(2009) Fast, cheap, and creative: evaluating translation quality using amazon’s mechanical turk. . Cited by: 1, 2.
[10] D. L. Chen and W. B. Dolan(2012) Building a persistent workforce on mechanical turk for multilingual data collection. External Links: Link Cited by: 2.
[11] U. Germann(2001) Building a statistical machine translation system from scratch: how much bang for the buck can we expect?. DMMT ’01, pp. 1–8. External Links: Link, Document Cited by: 1.
[12] S. Green, J. Heer and C. D. Manning(2013) The efficacy of human post-editing for language translation. CHI ’13, pp. 439–448. External Links: ISBN 978-1-4503-1899-0, Link, Document Cited by: 2.
[13] J. P. Hourcade, B. B. Bederson, A. Druin, A. Rose, A. Farber and Y. Takayama(2003) The international children’s digital library: viewing digital books online. Interacting with Computers 15 (2), pp. 151–167. Cited by: 2.
[14] C. Hu, B. B. Bederson, P. Resnik and Y. Kronrod(2011) MonoTrans2: a new human computation system to support monolingual translation. CHI ’11, pp. 1133–1136. External Links: Link
[15] C. Hu, B. B. Bederson and P. Resnik(2010) Translation by iterative collaboration between monolingual users. Cited by: 2, 2.
[16] C. Hu(2009) Collaborative translation by monolingual users. CHI EA ’09, pp. 3105–3108. External Links: Link Cited by: 2.
[17] M. Kay(1998-01) The proper place of men and machines in language translation. Machine Translation 12 (1/2), pp. 3–23. External Links: ISSN 0922-6567, Link, Document
[18] K. Knight and I. Chander(1994) Automated postediting of documents. Cited by: 2.
[19] P. Koehn(2010-06) Enabling monolingual translators: post-editing vs. options. pp. 537–545. External Links: Link Cited by: 2.
[20] A. N. Langville and C. D. Meyer(2004) Deeper inside pagerank. Internet Mathematics 1 (3), pp. 335–380. Cited by: 4.1.
[21] Z. Li, C. Callison-Burch, C. Dyer, J. Ganitkevitch, A. Irvine, S. Khudanpur, L. Schwartz, W. Thornton, Z. Wang, J. Weese and O. Zaidan(2010-07) Joshua 2.0: a toolkit for parsing-based machine translation with syntax, semirings, discriminative training and other goodies. pp. 133–137. External Links: Link
[22] A. Louis and A. Nenkova(2013) What makes writing great? first experiments on article quality prediction in the science journalism domain. Transactions of Association for Computational Linguistics.
[23] C. D. Manning, P. Raghavan and H. Schütze(2008) Introduction to information retrieval. Vol. 1, Cambridge University Press Cambridge. Cited by: 4.3.
[24] D. Morita and T. Ishida(2009) Collaborative translation by monolinguals with machine translators. IUI ’09, pp. 361–366. External Links: ISBN 978-1-60558-168-2, Link, Document Cited by: 2.
[25] D. Morita and T. Ishida(2009) Designing protocols for collaborative translation. pp. 17–32. External Links: Link Cited by: 2.
[26] D. S. Munteanu and D. Marcu(2005-12) Improving machine translation performance by exploiting non-parallel corpora. Comput. Linguist. 31 (4), pp. 477–504. External Links: ISSN 0891-2017, Link, Document Cited by: 1.
[27] K. Papineni, S. Roukos, T. Ward and W. Zhu(2002) BLEU: a method for automatic evaluation of machine translation. ACL ’02, pp. 311–318. External Links: Link, Document Cited by: 5.
[28] E. Pavlick, M. Post, A. Irvine, D. Kachaev and C. Callison-Burch(2014) The language demographics of Amazon Mechanical Turk. Transactions of the Association for Computational Linguistics (TACL) 2 (Feb), pp. 79–92. External Links: Link Cited by: 2.
[29] M. Post, C. Callison-Burch and M. Osborne(2012) Constructing parallel corpora for six Indian languages via crowdsourcing. External Links: Link Cited by: 2.
[30] P. Resnik and N. A. Smith(2003-09) The web as a parallel corpus. Computational Linguistics 29 (3), pp. 349–380. External Links: ISSN 0891-2017, Link, Document Cited by: 1.
[31] J. R. Smith, C. Quirk and K. Toutanova(2010) Extracting parallel sentences from comparable corpora using document level alignment. pp. 403–411. External Links: ISBN 1-932432-65-5, Link Cited by: 1.
[32] J. Smith, H. Saint-Amand, M. Plamada, P. Koehn, C. Callison-Burch and A. Lopez(2013-07) Dirt cheap web-scale parallel text from the Common Crawl. External Links: Link Cited by: 1.
[33] M. Snover, B. Dorr, R. Schwartz, L. Micciulla and J. Makhoul(2006) A study of translation edit rate with targeted human annotation. pp. 223–231. Cited by: 5, 5.
[34] R. Snow, B. O’Connor, D. Jurafsky and A. Y. Ng(2008) Cheap and fast—but is it good?: evaluating non-expert annotations for natural language tasks. EMNLP ’08, pp. 254–263. External Links: Link Cited by: 2.
[35] TechCrunch(2008-01)(Website) External Links: Link Cited by: 1.
[36] J. Uszkoreit, J. M. Ponte, A. C. Popat and M. Dubiner(2010) Large scale parallel document mining for machine translation. COLING ’10, pp. 1101–1109. External Links: Link Cited by: 1.
[37] L. von Ahn(2013) Duolingo: learn a language for free while helping to translate the web. IUI ’13, pp. 1–2. External Links: ISBN 978-1-4503-1965-2, Link, Document Cited by: 1.
[38] R. Yan, L. Kong, C. Huang, X. Wan, X. Li and Y. Zhang(2011) Timeline generation through evolutionary trans-temporal summarization. EMNLP ’11, pp. 433–443. External Links: ISBN 978-1-937284-11-4, Link Cited by: 4.
[39] R. Yan, M. Lapata and X. Li(2012) Tweet recommendation with graph co-ranking. ACL ’12, pp. 516–525. External Links: Link Cited by: 4.
[40] R. Yan, X. Wan, M. Lapata, W. X. Zhao, P. Cheng and X. Li(2012) Visualizing timelines: evolutionary summarization via iterative reinforcement between text and image streams. CIKM ’12, pp. 275–284. External Links: ISBN 978-1-4503-1156-4, Link, Document Cited by: 4.
[41] R. Yan, X. Wan, J. Otterbacher, L. Kong, X. Li and Y. Zhang(2011) Evolutionary timeline summarization: a balanced optimization framework via iterative substitution. SIGIR ’11, pp. 745–754. External Links: ISBN 978-1-4503-0757-4, Link, Document Cited by: 4.
[42] R. Yan, Z. Yuan, X. Wan, Y. Zhang and X. Li(2012) Hierarchical graph summarization: leveraging hybrid information through visible and invisible linkage. PAKDD’12, pp. 97–108. Cited by: 4.
[43] O. F. Zaidan and C. Callison-Burch(2011) Crowdsourcing translation: professional quality from non-professionals. pp. 1220–1229. External Links: Link Cited by: 1, 2, 3, 5, 5.
[44] R. Zbib, E. Malchiodi, J. Devlin, D. Stallard, S. Matsoukas, R. Schwartz, J. Makhoul, O. F. Zaidan and C. Callison-Burch(2012) Machine translation of Arabic dialects. External Links: Link Cited by: 2.
[45] R. Zbib, G. Markiewicz, S. Matsoukas, R. Schwartz and J. Makhoul(2013) Systematic comparison of professional and crowdsourced reference translations for machine translation. Atlanta, Georgia. External Links: Link Cited by: 2.
[46] X. W. Zhao, Y. Guo, R. Yan, Y. He and X. Li(2013) Timeline generation with social attention. SIGIR ’13, pp. 1061–1064. External Links: ISBN 978-1-4503-2034-4, Link, Document

Generated on Tue Jun 10 18:41:45 2014 by LaTeXML [LOGO]

Are Two Heads Better than One? Crowdsourced Translation via a Two-Step Collaboration of Non-Professional Translators and Editors

Abstract

1 Introduction

2 Related work

3 Crowdsourcing Translation

Setup

Analysis

4 Problem Formulation

4.1 Inter-Graph Ranking

4.2 Intra-Graph Ranking

4.3 Affinity Matrix Establishment

5 Evaluation

Metric

Baselines

Results

Parameter Tuning

Analysis

6 Conclusion

Acknowledgements

References

Are Two Heads Better than One? Crowdsourced Translation via a
Two-Step Collaboration of Non-Professional Translators and Editors