Learning phonetic categories is one of the first steps to learning a language, yet is hard to do using only distributional phonetic information. Semantics could potentially be useful, since words with different meanings have distinct phonetics, but it is unclear how many word meanings are known to infants learning phonetic categories. We show that attending to a weaker source of semantics, in the form of a distribution over topics in the current context, can lead to improvements in phonetic category learning. In our model, an extension of a previous model of joint word-form and phonetic category inference, the probability of word-forms is topic-dependent, enabling the model to find significantly better phonetic vowel categories and word-forms than a model with no semantic knowledge.
goodred,blue,teal,green!60!black,orange,
Infants begin learning the phonetic categories of their native language in their first year (19; 29; 51). In theory, semantic information could offer a valuable cue for phoneme induction11The models in this paper do not distinguish between phonetic and phonemic categories, since they do not capture phonological processes (and there are also none present in our synthetic data). We thus use the terms interchangeably. by helping infants distinguish between minimal pairs, as linguists do (48). However, due to a widespread assumption that infants do not know the meanings of many words at the age when they are learning phonetic categories (see 42 for a review), most recent models of early phonetic category acquisition have explored the phonetic learning problem in the absence of semantic information (8; 9; 11; 26; 50).
Models without any semantic information are likely to underestimate infants’ ability to learn phonetic categories. Infants learn language in the wild, and quickly attune to the fact that words have (possibly unknown) meanings. The extent of infants’ semantic knowledge is not yet known, but existing evidence shows that six-month-olds can associate some words with their referents (4; 46; 47), leverage non-acoustic contexts such as objects or articulations to distinguish similar sounds (44; 52), and map meaning (in the form of objects or images) to new word-forms in some laboratory settings (15; 16; 39). These findings indicate that young infants are sensitive to co-occurrences between linguistic stimuli and at least some aspects of the world.
In this paper we explore the potential contribution of semantic information to phonetic learning by formalizing a model in which learners attend to the word-level context in which phones appear (as in the lexical-phonetic learning model of 11) and also to the situations in which word-forms are used. The modeled situations consist of combinations of categories of salient activities or objects, similar to the activity contexts explored by Roy et al. (37), e.g.,‘getting dressed’ or ‘eating breakfast’. We assume that child learners are able to infer a representation of the situational context from their non-linguistic environment. However, in our simulations we approximate the environmental information by running a topic model (5) over a corpus of child-directed speech to infer a topic distribution for each situation. These topic distributions are then used as input to our model to represent situational contexts.
The situational information in our model is similar to that assumed by theories of cross-situational word learning (14; 40; 53), but our model does not require learners to map individual words to their referents. Even in the absence of word-meaning mappings, situational information is potentially useful because similar-sounding words uttered in similar situations are more likely to be tokens of the same lexeme (containing the same phones) than similar-sounding words uttered in different situations.
In simulations of vowel learning, inspired by Vallabha et al. (50) and Feldman et al. (11), we show a clear improvement over previous models in both phonetic and lexical (word-form) categorization when situational context is used as an additional source of information. This improvement is especially noticeable when the word-level context is providing less information, arguably the more realistic setting. These results demonstrate that relying on situational co-occurrence can improve phonetic learning, even if learners do not yet know the meanings of individual words.
Infants attend to distributional characteristics of their input (24; 23), leading to the hypothesis that phonetic categories could be acquired on the basis of bottom-up distributional learning alone (8; 50; 26). However, this would require sound categories to be well separated, which often is not the case—for example, see Figure 1, which shows the English vowel space that is the focus of this paper.
Recent work has investigated whether infants could overcome such distributional ambiguity by incorporating top-down information, in particular, the fact that phones appear within words. At six months, infants begin to recognize word-forms such as their name and other frequently occurring words (21; 18), without necessarily linking a meaning to these forms. This “protolexicon” can help differentiate phonetic categories by adding word contexts in which certain sound categories appear (42; 12). To explore this idea further, Feldman et al. (11) implemented the Lexical-Distributional (LD) model, which jointly learns a set of phonetic vowel categories and a set of word-forms containing those categories. Simulations showed that the use of lexical context greatly improved phonetic learning.
Our own Topic-Lexical-Distributional (TLD) model extends the LD model to include an additional type of context: the situations in which words appear. To motivate this extension and clarify the differences between the models, we now provide a high-level overview of both models; details are given in Sections 3 and 4.
Both the LD and TLD models are computational-level models of phonetic (specifically, vowel) categorization where phones (vowels) are presented to the model in the context of words.22For a related model that also tackles the word segmentation problem, see Elsner et al. (10). In a model of phonological learning, Fourtassi and Dupoux (submitted) show that semantic context information similar to that used here remains useful despite segmentation errors. The task is to infer a set of phonetic categories and a set of lexical items on the basis of the data observed for each word token . In the original LD model, the observations for token are its frame , which consists of a list of consonants and slots for vowels, and the list of vowel tokens . (The TLD model includes additional observations, described below.) A single vowel token, , is a two dimensional vector representing the first two formants (peaks in the frequency spectrum, ordered from lowest to highest). For example, a token of the word kitty would have the frame , containing two consonant phones, /k/ and /t/, with two vowel phone slots in between, and two vowel formant vectors, and .33In simulations we also experiment with frames in which consonants are not represented perfectly.
Given the data, the model must assign each vowel token to a vowel category, . Both the LD and the TLD models do this using intermediate lexemes, , which contain vowel category assignments, , as well as a frame . If a word token is assigned to a lexeme, , the vowels within the word are assigned to that lexeme’s vowel categories, .44The notation is overloaded: refers both to the vowel formants and the vowel category assignments, and refers to both the token identity and its assignment to a lexeme. The word and lexeme frames must match, .
Lexical information helps with phonetic categorization because it can disambiguate highly overlapping categories, such as the ae and eh categories in Figure 1. A purely distributional learner who observes a cluster of data points in the ae-eh region is likely to assume all these points belong to a single category because the distributions of the categories are so similar. However, a learner who attends to lexical context will notice a difference: contexts that only occur with ae will be observed in one part of the ae-eh region, while contexts that only occur with eh will be observed in a different (though partially overlapping) space. The learner then has evidence of two different categories occurring in different sets of lexemes.
Simulations with the LD model show that using lexical information to constrain phonetic learning can greatly improve categorization accuracy (11), but it can also introduce errors. When two word tokens contain the same consonant frame but different vowels (i.e., minimal pairs), the model is more likely to categorize those two vowels together. Thus, the model has trouble distinguishing minimal pairs. Although young children also have trouble with minimal pairs (41; 45), the LD model may overestimate the degree of the problem. We hypothesize that if a learner is able to associate words with the contexts of their use (as children likely are), this could provide a weak source of information for disambiguating minimal pairs even without knowing their exact meanings. That is, if the learner hears kt and kt in different situational contexts, they are likely to be different lexical items (and and different phones), despite the lexical similarity between them.
To demonstrate the benefit of situational information, we develop the Topic-Lexical-Distributional (TLD) model, which extends the LD model by assuming that words appear in situations analogous to documents in a topic model. Each situation is associated with a mixture of topics , which is assumed to be observed. Thus, for the th token in situation , denoted , the observed data will be its frame , vowels , and topic vector .
From an acquisition perspective, the observed topic distribution represents the child’s knowledge of the context of the interaction: she can distinguish bathtime from dinnertime, and is able to recognize that some topics appear in certain contexts (e.g. animals on walks, vegetables at dinnertime) and not in others (few vegetables appear at bathtime). We assume that the child would learn these topics from observing the world around her and the co-occurrences of entities and activities in the world. Within any given situation, there might be a mixture of different (actual or possible) topics that are salient to the child. We assume further that as the child learns the language, she will begin to associate specific words with each topic as well.
Thus, in the TLD model, the words used in a situation are topic-dependent, implying meaning, but without pinpointing specific referents. Although the model observes the distribution of topics in each situation (corresponding to the child observing her non-linguistic environment), it must learn to associate each (phonetically and lexically ambiguous) word token with a particular topic from that distribution. The occurrence of similar-sounding words in different situations with mostly non-overlapping topics will provide evidence that those words belong to different topics and that they are therefore different lexemes. Conversely, potential minimal pairs that occur in situations with similar topic distributions are more likely to belong to the same topic and thus the same lexeme.
Although we assume that children infer topic distributions from the non-linguistic environment, we will use transcripts from childes to create the word/phone learning input for our model. These transcripts are not annotated with environmental context, but Roy et al. (37) found that topics learned from similar transcript data using a topic model were strongly correlated with immediate activities and contexts. We therefore obtain the topic distributions used as input to the TLD model by training an LDA topic model (5) on a superset of the child-directed transcript data we use for lexical-phonetic learning, dividing the transcripts into small sections (the ‘documents’ in LDA) that serve as our distinct situations . As noted above, the learned document-topic distributions are treated as observed variables in the TLD model to represent the situational context. The topic-word distributions learned by LDA are discarded, since these are based on the (correct and unambiguous) words in the transcript, whereas the TLD model is presented with phonetically ambiguous versions of these word tokens and must learn to disambiguate them and associate them with topics.
In this section we describe more formally the generative process for the LD model (11), a joint Bayesian model over phonetic categories and a lexicon, before describing the TLD extension in the following section.
The set of phonetic categories and the lexicon are both modeled using non-parametric Dirichlet Process priors, which return a potentially infinite number of categories or lexemes. A DP is parametrized as , where is a real-valued hyperparameter and is a base distribution. may be continuous, as when it generates phonetic categories in formant space, or discrete, as when it generates lexemes as a list of phonetic categories.
A draw from a DP, , returns a distribution over a set of draws from , i.e., a discrete distribution over a set of categories or lexemes generated by . In the mixture model setting, the category assignments are then generated from , with the datapoints themselves generated by the corresponding components from . If is infinite, the support of the DP is likewise infinite. During inference, we marginalize over .
Following previous models of vowel learning
(8; 50; 26; 9)
we assume that vowel tokens are drawn from a Gaussian mixture model.
The Infinite Gaussian Mixture Model (IGMM) (35)
includes a DP prior, as described above, in which the base distribution
generates multivariate Gaussians drawn from a Normal
Inverse-Wishart prior.55This compound distribution is equivalent to
Each observation, a formant vector , is drawn from the Gaussian
corresponding to its category assignment :
(1) | ||||
(2) | ||||
(3) | ||||
(4) |
The above model generates a category assignment for each vowel token . This is the baseline IGMM model, which clusters vowel tokens using bottom-up distributional information only; the LD model adds top-down information by assigning categories in the lexicon, rather than on the token level.
In the LD model, vowel phones appear within words drawn from the lexicon. Each such lexeme is represented as a frame plus a list of vowel categories . Lexeme assignments for each token are drawn from a DP with a lexicon-generating base distribution . The category for each vowel token in the word is determined by the lexeme; the formant values are drawn from the corresponding Gaussian as in the IGMM:
(5) | ||||
(6) | ||||
(7) |
generates lexemes by first drawing the number of phones from a geometric distribution and the number of consonant phones from a binomial distribution. The consonants are then generated from a DP with a uniform base distribution (but note they are fixed at inference time, i.e., are observed categorically), while the vowel phones are generated by the IGMM DP above, .
Note that two draws from may result in identical lexemes; these are nonetheless considered to be separate (homophone) lexemes.
The TLD model retains the IGMM vowel phone component, but extends the lexicon of the LD model by adding topic-specific lexicons, which capture the notion that lexeme probabilities are topic-dependent. Specifically, the TLD model replaces the Dirichlet Process lexicon with a Hierarchical Dirichlet Process (HDP; Teh (43)). In the HDP lexicon, a top-level global lexicon is generated as in the LD model. Topic-specific lexicons are then drawn from the global lexicon, containing a subset of the global lexicon (but since the size of the global lexicon is unbounded, so are the topic-specific lexicons). These topic-specific lexicons are used to generate the tokens in a similar manner to the LD model. There are a fixed number of lower level topic-lexicons; these are matched to the number of topics in the LDA model used to infer the topic distributions (see Section 6.4).
More formally, the global lexicon is generated as a top-level DP: (see Section 3.2; remember includes draws from the IGMM over vowel categories). is in turn used as the base distribution in the topic-level DPs, . In the Chinese Restaurant Franchise metaphor often used to describe HDPs, is a global menu of dishes (lexemes). The topic-specific lexicons are restaurants, each with its own distribution over dishes; this distribution is defined by seating customers (word tokens) at tables, each of which serves a single dish from the menu: all tokens at the same table are assigned to the same lexeme . Inference (Section 5) is defined in terms of tables rather than lexemes; if multiple tables draw the same dish from , tokens at these tables share a lexeme.
In the TLD model, tokens appear within situations, each of which has a distribution over topics . Each token has a co-indexed topic assignment variable, , drawn from , designating the topic-lexicon from which the table for is to be drawn. The formant values for are drawn in the same way as in the LD model, given the lexeme assignment at . This results in the following model, shown in Figure 2:
(8) | ||||
(9) | ||||
(10) | ||||
(11) | ||||
(12) |
We use Gibbs sampling to infer three sets of variables in the TLD model: assignments to vowel categories in the lexemes, assignments of tokens to topics, and assignments of tokens to tables (from which the assignment to lexemes can be read off).
Each vowel in the lexicon must be assigned to a category in the IGMM. The posterior probability of a category assignment is composed of the DP prior over categories and the likelihood of the observed vowels belonging to that category. We use to denote the set of vowel formants at position in words that have been assigned to lexeme . Then,
(13) |
The first (DP prior) factor is defined as:
(14) |
where is the number of other vowels in the lexicon, , assigned to category . Note that there is always positive probability of creating a new category.
The likelihood of the vowels is calculated by marginalizing over all possible means and variances of the Gaussian category parameters, given the NIW prior. For a single point (if ), this predictive posterior is in the form of a Student- distribution; for the more general case see Feldman et al. (11), Eq. B3.
We jointly sample and , the variables assigning tokens to tables and topics. Resampling the table assignment includes the possibility of changing to a table with a different lexeme or drawing a new table with a previously seen or novel lexeme. The joint conditional probability of a table and topic assignment, given all other current token assignments, is:
(15) |
The first factor, the prior probability of topic in document , is given by obtained from the LDA. The second factor is the prior probability of assigning word to table with lexeme given topic . It is given by the HDP, and depends on whether the table exists in the HDP topic-lexicon for and, likewise, whether any table in the topic-lexicon has the lexeme :
(16) |
Here is the number of other tokens at table , are the total number of tokens in topic , is the number of tables across all topics with the lexeme , and is the total number of tables.
The third factor, the likelihood of the vowel formants in the categories given by the lexeme , is of the same form as the likelihood of vowel categories when resampling lexeme vowel assignments. However, here it is calculated over the set of vowels in the token assigned to each vowel category (i.e., the vowels at indices where ). For a new lexeme, we approximate the likelihood using 100 samples drawn from the prior, each weighted by (28).
The three hyperparameters governing the HDP over the lexicon, and , and the DP over vowel categories, , are estimated using a slice sampler. The remaining hyperparameters for the vowel category and lexeme priors are set to the same values used by Feldman et al. (11).
We test our model on situated child directed speech, taken from the C1 section of the Brent corpus in childes (6; 20). This corpus consists of transcripts of speech directed at infants between the ages of 9 and 15 months, captured in a naturalistic setting as parent and child went about their day. This ensures variability of situations.
Utterances with unintelligible words or quotes are removed. We restrict the corpus to content words by retaining only words tagged as adj, n, part and v (adjectives, nouns, particles, and verbs). This is in line with evidence that infants distinguish content and function words on the basis of acoustic signals (38). Vowel categorization improves when attending only to more prosodically and phonologically salient tokens (1), which generally appear within content, not function words. The final corpus consists of 13138 tokens and 1497 word types.
The transcripts do not include phonetic information, so, following Feldman et al. (11), we synthesize the formant values using data from Hillenbrand et al. (17). This dataset consists of a set of 1669 manually gathered formant values from 139 American English speakers (men, women and children) for 12 vowels. For each vowel category, we construct a Gaussian from the mean and covariance of the datapoints belonging to that category, using the first and second formant values measured at steady state. We also construct a second dataset using only datapoints from adult female speakers.
Each word in the dataset is converted to a phonemic representation using the CMU pronunciation dictionary, which returns a sequence of Arpabet phoneme symbols. If there are multiple possible pronunciations, the first one is used. Each vowel phoneme in the word is then replaced by formant values drawn from the corresponding Hillenbrand Gaussian for that vowel.
The Arpabet encoding used in the phonemic representation includes 24 consonants. We construct datasets both using the full set of consonants—the ‘C24’ dataset—and with less fine-grained consonant categories. Distinguishing all consonant categories assumes perfect learning of consonants prior to vowel categorization and is thus somewhat unrealistic (29), but provides an upper limit on the information that word-contexts can give.
In the ‘C15’ dataset, the voicing distinction is collapsed, leaving 15 consonant categories. The collapsed categories are B/P, G/K, D/T, CH/JH, V/F, TH/DH, S/Z, SH/ZH, R/L while HH, M, NG, N, W, Y remain separate phonemes. This dataset mirrors the finding in Mani and Plunkett (22) that 12 month old infants are not sensitive to voicing mispronunciations.
The ‘C6’ dataset distinguishes between only 6 coarse consonant phonemes, corresponding to stops (B,P,G,K,D,T), affricates (CH,JH), fricatives (V, F, TH, DH, S, Z, SH, ZH, HH), nasals (M, NG, N), liquids (R, L), and semivowels/glides (W, Y). This dataset makes minimal assumptions about the category categories that infants could use in this learning setting.
Decreasing the number of consonants increases the ambiguity in the corpus: bat not only shares a frame (b_t) with boat and bite, but also, in the C15 dataset, with put, pad and bad (b/p_d/t), and in the C6 dataset, with dog and kite, among many others (STOP_STOP). Table 1 shows the percentage of types and tokens that are ambiguous in each dataset, that is, words in frames that match multiple wordtypes. Note that we always evaluate against the gold word identities, even when these are not distinguished in the model’s input. These datasets are intended to evaluate the degree of reliance on consonant information in the LD and TLD models, and to what extent the topics in the TLD model can replace this information.
The input to the TLD model includes a distribution over topics for each situation, which we infer in advance from the full Brent corpus (not only the C1 subset) using LDA. Each transcript in the Brent corpus captures about 75 minutes of parent-child interaction, and thus multiple situations will be included in each file. The transcripts do not delimit situations, so we do this somewhat arbitrarily by splitting each transcript after 50 CDS utterances, resulting in 203 situations for the Brent C1 dataset. As well as function words, we also remove the five most frequent content words (be, go, get, want, come). On average, situations are only 59 words long, reflecting the relative lack of content words in CDS utterances.
We infer 50 topics for this set of situations using the mallet toolkit (25). Hyperparameters are inferred, which leads to a dominant topic that includes mainly light verbs (have, let, see, do). The other topics are less frequent but capture stronger semantic meaning (e.g. yummy, peach, cookie, daddy, bib in one topic, shoe, let, put, hat, pants in another). The word-topic assignments are used to calculate unsmoothed situation-topic distributions used by the TLD model.
Dataset | C24 | C15 | C6 |
---|---|---|---|
Input Types | 1487 | 1426 | 1203 |
Frames | 1259 | 1078 | 702 |
Ambig Types % | 27.2 | 42.0 | 80.4 |
Ambig Tokens % | 41.3 | 56.9 | 77.2 |
We evaluate against adult categories, i.e., the ‘gold-standard’, since all learners of a language eventually converge on similar categories. (Since our model is not a model of the learning process, we do not compare the infant learning process to the learning algorithm.) We evaluate both the inferred phonetic categories and words using the clustering evaluation measure V-Measure (VM; 36).66Other clustering measures, such as 1-1 matching and pairwise precision and recall (accuracy and completeness) showed the same trends, but VM has been demonstrated to be the most stable measure when comparing solutions with varying numbers of clusters (7). VM is the harmonic mean of two components, similar to F-score, where the components (VC and VH) are measures of cross entropy between the gold and model categorization. For vowels, VM measures how well the inferred phonetic categorizations match the gold categories; for lexemes, it measures whether tokens have been assigned to the same lexemes both by the model and the gold standard. Words are evaluated against gold orthography, so homophones, e.g. hole and whole, are distinct gold words.
We compare all three models—TLD, LD, and IGMM—on the vowel categorization task, and TLD and LD on the lexical categorization task (since IGMM does not infer a lexicon). The datasets correspond to two sets of conditions: firstly, either using vowel categories synthesized from all speakers or only adult female speakers, and secondly, varying the coarseness of the observed consonant categories. Each condition (model, vowel speakers, consonant set) is run five times, using 1500 iterations of Gibbs sampling with hyperparameter sampling. Overall, we find that TLD outperforms the other models in both tasks, across all conditions.
Vowel categorization results are shown in Figure 3. IGMM performs substantially worse than both TLD and LD, with scores more than 30 points lower than the best results for these models, clearly showing the value of the protolexicon and replicating the results found by Feldman et al. (11) on this dataset. Furthermore, TLD consistently outperforms the LD model, finding better phonetic categories, both for vowels generated from the combined categories of all speakers (‘all’) and vowels generated from adult female speakers only (‘w’), although the latter are clearly much easier for both models to learn. Both models perform less well when the consonant frames provide less information, but the TLD model performance degrades less than the LD performance.
Both the TLD and the LD models find ‘supervowel’ categories, which cover multiple vowel categories and are used to merge minimal pairs into a single lexical item. Figure 4 shows example vowel categories inferred by the TLD model, including two supervowels. The TLD supervowels are used much less frequently than the supervowels found by the LD model, containing, on average, only two-thirds as many tokens.
Figure 5 shows that TLD also outperforms LD on the lexeme/word categorization task. Again performance decreases as the consonant categories become coarser, but the additional semantic information in the TLD model compensates for the lack of consonant information. In the individual components of VM, TLD and LD have similar VC (“recall”), but TLD has higher VH (“precision”), demonstrating that the semantic information given by the topics can separate potentially ambiguous words, as hypothesized.
Overall, the contextual semantic information added in the TLD model leads to both better phonetic categorization and to a better protolexicon, especially when the input is noisier, using degraded consonants. Since infants are not likely to have perfect knowledge of phonetic categories at this stage, semantic information is a potentially rich source of information that could be drawn upon to offset noise from other domains. The form of the semantic information added in the TLD model is itself quite weak, so the improvements shown here are in line with what infant learners could achieve.
Language acquisition is a complex task, in which many heterogeneous sources of information may be useful. In this paper, we investigated whether contextual semantic information could be of help when learning phonetic categories. We found that this contextual information can improve phonetic learning performance considerably, especially in situations where there is a high degree of phonetic ambiguity in the word-forms that learners hear. This suggests that previous models that have ignored semantic information may have underestimated the information that is available to infants. Our model illustrates one way in which language learners might harness the rich information that is present in the world without first needing to acquire a full inventory of word meanings.
The contextual semantic information that the TLD model tracks is similar to that potentially used in other linguistic learning tasks. Theories of cross-situational word learning (40; 53) assume that sensitivity to situational co-occurrences between words and non-linguistic contexts is a precursor to learning the meanings of individual words. Under this view, contextual semantics is available to infants well before they have acquired large numbers of semantic minimal pairs. However, recent experimental evidence indicates that learners do not always retain detailed information about the referents that are present in a scene when they hear a word (27; 49). This evidence poses a direct challenge to theories of cross-situational word learning. Our account does not necessarily require learners to track co-occurrences between words and individual objects, but instead focuses on more abstract information about salient events and topics in the environment; it will be important to investigate to what extent infants encode this information and use it in phonetic learning.
Regardless of the specific way in which infants encode semantic information, our method of adding this information by using LDA topics from transcript data was shown to be effective. This method is practical because it can approximate semantic information without relying on extensive manual annotation.
The LD model extended the phonetic categorization task by adding word contexts; the TLD model presented here goes even further, adding larger situational contexts. Both forms of top-down information help the low-level task of classifying acoustic signals into phonetic categories, furthering a holistic view of language learning with interaction across multiple levels.
This work was supported by EPSRC grant EP/H050442/1 and a James S. McDonnell Foundation Scholar Award to the final author.