Expectations of Word Sense in Parallel Corpora

Xuchen Yao,  Benjamin Van Durme,  Chris Callison-Burch
Johns Hopkins University


Abstract

Given a parallel corpus, if two distinct words in language A, a1 and a2 , are aligned to the same word b1 in language B, then this might signal that b1 is polysemous, or it might signal a1 and a2 are synonyms. Both assumptions with successful work have been put forward in the literature. We investigate these assumptions, along with other questions of word sense, by looking at sampled parallel sentences containing tokens of the same type in English, asking how often they mean the same thing when they are: 1. aligned to the same foreign type; and 2. aligned to different foreign types. Results for French-English and Chinese-English parallel corpora show similar behavior: Synonymy is only very weakly the more prevalent scenario, where both cases regularly occur.