We propose a spectral approach for unsupervised constituent parsing that comes with theoretical guarantees on latent structure recovery. Our approach is grammarless – we directly learn the bracketing structure of a given sentence without using a grammar model. The main algorithm is based on lifting the concept of additive tree metrics for structure learning of latent trees in the phylogenetic and machine learning communities to the case where the tree structure varies across examples. Although finding the “minimal” latent tree is NP-hard in general, for the case of projective trees we find that it can be found using bilexical parsing algorithms. Empirically, our algorithm performs favorably compared to the constituent context model of Klein and Manning (2002) without the need for careful initialization.
Solutions to the problem of grammar induction have been long sought after since the early days of computational linguistics and are interesting both from cognitive and engineering perspectives. Cognitively, it is more plausible to assume that children obtain only terminal strings of parse trees and not the actual parse trees. This means the unsupervised setting is a better model for studying language acquisition. From the engineering perspective, training data for unsupervised parsing exists in abundance (i.e. sentences and part-of-speech tags), and is much cheaper than the syntactically annotated data required for supervised training.
Most existing solutions treat the problem of unsupervised parsing by assuming a generative process over parse trees e.g. probabilistic context free grammars [Jelinek et al.1992], and the constituent context model [Klein and Manning2002]. Learning then reduces to finding a set of parameters that are estimated by identifying a local maximum of an objective function such as the likelihood [Klein and Manning2002] or a variant of it [Smith and Eisner2005, Cohen and Smith2009, Headden et al.2009, Spitkovsky et al.2010b, Gillenwater et al.2010, Golland et al.2012]. Unfortunately, finding the global maximum for these objective functions is usually intractable [Cohen and Smith2012] which often leads to severe local optima problems (but see Gormley and Eisner, 2013). Thus, strong experimental results are often achieved by initialization techniques [Klein and Manning2002, Gimpel and Smith2012], incremental dataset use [Spitkovsky et al.2010a] and other specialized techniques to avoid local optima such as count transforms [Spitkovsky et al.2013]. These approaches, while empirically promising, generally lack theoretical justification.
On the other hand, recently proposed spectral methods approach the problem via restriction of the PCFG model [Hsu et al.2012] or matrix completion [Bailly et al.2013]. These novel perspectives offer strong theoretical guarantees but are not designed to achieve competitive empirical results.
In this paper, we suggest a different approach, to provide a first step to bridging this theory-experiment gap. More specifically, we approach unsupervised constituent parsing from the perspective of structure learning as opposed to parameter learning. We associate each sentence with an undirected latent tree graphical model, which is a tree consisting of both observed variables (corresponding to the words in the sentence) and an additional set of latent variables that are unobserved in the data. This undirected latent tree is then directed via a direction mapping to give the final constituent parse.
In our framework, parsing reduces to finding the best latent structure for a given sentence. However, due to the presence of latent variables, structure learning of latent trees is substantially more complicated than in observed models. As before, one solution would be local search heuristics.
Intuitively, however, latent tree models encode low rank dependencies among the observed variables permitting the development of “spectral” methods that can lead to provably correct solutions. In particular we leverage the concept of additive tree metrics [Buneman1971, Buneman1974] in phylogenetics and machine learning that can create a special distance metric among the observed variables as a function of the underlying spectral dependencies [Choi et al.2011, Song et al.2011, Anandkumar et al.2011, Ishteva et al.2012]. Additive tree metrics can be leveraged by “meta-algorithms” such as neighbor-joining [Saitou and Nei1987] and recursive grouping [Choi et al.2011] to provide consistent learning algorithms for latent trees.
Moreover, we show that it is desirable to learn the “minimal” latent tree based on the tree metric (“minimum evolution” in phylogenetics). While this criterion is in general NP-hard [Desper and Gascuel2005], for projective trees we find that a bilexical parsing algorithm can be used to find an exact solution efficiently [Eisner and Satta1999].
Unlike in phylogenetics and graphical models, where a single latent tree is constructed for all the data, in our case, each part of speech sequence is associated with its own parse tree. This leads to a severe data sparsity problem even for moderately long sentences. To handle this issue, we present a strategy that is inspired by ideas from kernel smoothing in the statistics community [Zhou et al.2010, Kolar et al.2010b, Kolar et al.2010a]. This allows principled sharing of samples from different but similar underlying distributions.
We provide theoretical guarantees on the recovery of the correct underlying latent tree and characterize the associated sample complexity under our technique. Empirically we evaluate our method on data in English, German and Chinese. Our algorithm performs favorably to Klein and Manning’s (2002) constituent-context model (CCM), without the need for careful initialization. In addition, we also analyze CCM’s sensitivity to initialization, and compare our results to Seginer’s algorithm [Seginer2007].
In this section, we detail the learning setting and a conditional tree model we learn the structure for.
Let be a vector of words corresponding to a sentence of length . Each is represented by a vector in for . The vector is an embedding of the word in some space, chosen from a fixed dictionary that maps word types to . In addition, let be the associated vector of part-of-speech (POS) tags (i.e. is the POS tag of ).
In our learning algorithm, we assume that examples of the form for are given, and the goal is to predict a bracketing parse tree for each of these examples. The word embeddings are used during the learning process, but the final decoder that the learning algorithm outputs maps a POS tag sequence to a parse tree. While ideally we would want to use the word information in decoding as well, much of the syntax of a sentence is determined by the POS tags, and relatively high level of accuracy can be achieved by learning, for example, a supervised parser from POS tag sequences.
Just like our decoder, our model assumes that the bracketing of a given sentence is a function of its POS tags. The POS tags are generated from some distribution, followed by a deterministic generation of the bracketing parse tree. Then, latent states are generated for each bracket, and finally, the latent states at the yield of the bracketing parse tree generate the words of the sentence (in the form of embeddings). The latent states are represented by vectors where .
For intuition, consider the simple tag sequence . Two candidate constituent parse structures are shown in Figure 2 and the correct one is boxed in green (the other in red). Recall that our training data contains word phrases that have the tag sequence e.g. , .
Intuitively, the words in the above phrases exhibit dependencies that can reveal the parse structure. The determiner () and the direct object () are correlated in that the choice of determiner depends on the plurality of . However, the choice of verb () is mostly independent of the determiner. We could thus conclude that and should be closer in the parse tree than and , giving us the correct structure. Informally, the latent state corresponding to the bracket would store information about the plurality of , the key to the dependence between and . It would then be reasonable to assume that and are independent given .
Following this intuition, we propose to model the distribution over the latent bracketing states and words for each tag sequence as a latent tree graphical model, which encodes conditional independences among the words given the latent states.
Let , with representing the word embeddings, and representing the latent states of the bracketings. Then, according to our base model it holds that:
(1) |
where returns the parent node index of the argument in the latent tree corresponding to tag sequence .11At this point, refers to an arbitrary direction of the undirected tree . If is the root, then . All the are assumed to be leaves while all the are internal (i.e. non-leaf) nodes. The parameters control the conditional probability tables. We do not commit to a certain parametric family, but see more about the assumptions we make about in §3.2. The parameter space is denoted . The model assumes a factorization according to a latent-variable tree. The latent variables can incorporate various linguistic properties, such as head information, valence of dependency being generated, and so on. This information is expected to be learned automatically from data.
Our generative model deterministically maps a POS sequence to a bracketing via an undirected latent-variable tree. The orientation of the tree is determined by a direction mapping , which is fixed during learning and decoding. This means our decoder first identifies (given a POS sequence) an undirected tree, and then orients it by applying on the resulting tree (see below).
Define to be the set of undirected latent trees where all internal nodes have degree exactly (i.e. they correspond to binary bracketing), and in addition for any is projective (explained in the section). In addition, let be the set of binary bracketings. The complete generative model that we follow is then:
Generate a tag sequence
Decide on , the undirected latent tree that maps to.
Set by computing .
Set by computing .
Generate a tuple where according to Eq. 1.
See Figure 1 (left) for an example.
Generating a bracketing via an undirected tree enables us to build on existing methods for structure learning of latent-tree graphical models [Choi et al.2011, Anandkumar et al.2011]. Our learning algorithm focuses on recovering the undirected tree based for the generative model that was described above. This undirected tree is converted into a directed tree by applying . The mapping works in three steps:
It first chooses a top bracket where is the mid-point of the bracket and is the length of the sentence.
It marks the edge that splits the tree according to the top bracket as the “root edge” (marked in red in Figure 1(center))
It then creates from by directing the tree outward from as shown in Figure 1(center)
The resulting is a binary bracketing parse tree. As implied by the above definition of , selecting which edge is the root can be interpreted as determining the top bracket of the constituent parse. For example, in Figure 1, the top bracket is . Note that the “root” edge partitions the leaves into precisely this bracketing. As indicated in the above section, we restrict the set of undirected trees to be those such that after applying the resulting is projective i.e. there are no crossing brackets. In §4.1, we discuss an effective heuristic to find the top bracket without supervision.
Our goal is to recover for tag sequence using the data . To get an intuition about the algorithm, consider a partition of the set of examples into , i.e. each section in the partition has an identical sequence of part of speech tags. Assume for this section is large (we address the data sparsity issue in §3.4).
We can then proceed by learning how to map a POS sequence to a tree (through ) by focusing only on examples in .
Directly attempting to maximize the likelihood unfortunately results in an intractable optimization problem and greedy heuristics are often employed [Harmeling and Williams2011]. Instead we propose a method that is provably consistent and returns a tree that can be mapped to a bracketing using .
If all the variables were observed, then the Chow-Liu algorithm [Chow and Liu1968] could be used to find the most likely tree structure . The Chow-Liu algorithm essentially computes the distances among all pairs of variables (the negative of the mutual information) and then finds the minimum cost tree. However, the fact that the are latent variables makes this strategy substantially more complicated. In particular, it becomes challenging to compute the distances among pairs of latent variables. What is needed is a “special” distance function that allows us to reverse engineer the distances among the latent variables given the distances among the observed variables. This is the key idea behind additive tree metrics that are the basis of our approach.
In the following sections, we describe the key steps to our method. §3.1 and §3.2 largely describe existing background on additive tree metrics and latent tree structure learning, while §3.3 and §3.4 discuss novel aspects that are unique to our problem.
Let be the true undirected tree of sentence and assume the nodes to be indexed by such that . Furthermore, let refer to a node in the undirected tree (either observed or latent). We assume the existence of a distance function that allows us to compute distances between pairs of nodes. For example, as we see in §3.2 we will define the distance to be a function of the covariance matrix . Thus if and are both observed variables, the distance can be directly computed from the data.
Moreover, the metrics we construct are such that they are tree additive, defined below:
A function is an additive tree metric [Erdõs et al.1999] for the undirected tree if it is a distance metric,22This means that it satisfies if and only if , the triangle inequality and is also symmetric. and furthermore, the following relation holds:
(2) |
where is the set of all the edges in the (undirected) path from to in the tree .
As we describe below, given the tree structure, the additive tree metric property allows us to compute “backwards” the distances among the latent variables as a function of the distances among the observed variables.
Define to be the distance matrix among the variables, i.e. . Let , (equal to ), and indicate the word-word, latent-word and latent-latent sub-blocks of respectively. In addition, since is assumed to be known from context, we denote just by .
Given the fact that the distance between a pair of nodes is a function of the random variables they represent (according to the true model), only can be empirically estimated from data. However, if the underlying tree structure is known, then Definition 1 can be leveraged to compute and as we show below.
We first show how to compute for all such that and are adjacent to each other in , based only on observed nodes. It then follows that the other elements of the distance matrix can be computed based on Definition 1. To show how to compute distances between adjacent nodes, consider the two cases: (1) is a leaf edge; (2) is an internal edge.
|
|
Assume without loss of generality that is the leaf and is an internal latent node. Then must have exactly two other neighbors and . Let denote the set of nodes that are closer to than and similarly let denote the set of nodes that are closer to than . Let and denote all the leaves (word nodes) in and respectively. Then using path additivity (Definition 1), it can be shown that for any it holds that:
(3) |
Note that the right-hand side only depends on distances between observed random variables.
Both and are internal nodes. In this case, has exactly two other neighbors and , and similarly, has exactly other two neighbors and . Let denote the set of nodes closer to than , and analogously for , , and . Let , , , and refer to the leaves in , and respectively. Then for any , , , and it can be shown that:
(4) |
Empirically, one can obtain a more robust empirical estimate by averaging over all valid choices of in Eq. 3 and all valid choices of in Eq. 4 [Desper and Gascuel2005].
In constructing our distance metric, we begin with the following assumption on the distribution in Eq. 1 (analogous to the assumptions made in Anandkumar et al., 2011).
where has rank .
where has rank .
Also assume that has rank .
Note that the matrices and are a direct function of , but we do not specify a model family for . The only restriction is in the form of the above assumption. If and were discrete, represented as binary vectors, the above assumption would correspond to requiring all conditional probability tables in the latent tree to have rank . Assumption 1 allows for the to be high dimensional features, as long as the expectation requirement above is satisfied. Similar assumptions are made with spectral parameter learning methods e.g. Hsu et al.2009, Bailly et al.2009, Parikh et al.2011, and Cohen et al.2012.
Furthermore, Assumption 1 makes it explicit that regardless of the size of , the relationships among the variables in the latent tree are restricted to be of rank , and are thus low rank since . To leverage this low rank structure, we propose using the following additive metric, a normalized variant of that in Anandkumar et al.2011:
(5) |
where denotes the product of the top singular values of and , i.e. the uncentered cross-covariance matrix.
We can then show that this metric is additive:
A proof is in the supplementary for completeness. From here, we use to denote , since that is the metric we use for our learning algorithm.
It has been shown [Rzhetsky and Nei1993] that for any additive tree metric, can be recovered by solving for :
(6) |
where is the set of pairs of nodes which are adjacent to each other in and is computed using Eq. 3 and Eq. 4.
Note that the metric we use in defining is based on the expectations from the true distribution. In practice, the true distribution is unknown, and therefore we use an approximation for the distance metric . As we discussed in §3.1 all elements of the distance matrix are functions of observable quantities if the underlying tree is known. However, only the word-word sub-block can be directly estimated from the data without knowledge of the tree structure.
This subtlety makes solving the minimization problem in Eq. 6 NP-hard [Desper and Gascuel2005] if is allowed to be an arbitrary undirected tree. However, if we restrict to be in , as we do in the above, then maximizing over can be solved using the bilexical parsing algorithm from Eisner and Satta1999. This is because the computation of the other sub-blocks of the distance matrix only depend on the partitions of the nodes shown in Figure 3 into , , , and , and not on the entire tree structure.
Therefore, the procedure to find a bracketing for a given POS tag is to first estimate the distance matrix sub-block from raw text data (see §3.4), and then solve the optimization problem using a variant of the Eisner-Satta algorithm where is identical to in Eq. 6, with replaced with .
[t!]Inputs: Set of examples for , a kernel , an integer
Data structures: For each there is a (uncentered) covariance matrix , and a distance .
Algorithm:
(Covariance estimation)
Let , and , and estimate each covariance matrix as:
Compute using Eq. 5.
(Uncover structure)
Find , and for the th example, return the structure .
Summary. We first defined a generative model that describes how a sentence, its sequence of POS tags, and its bracketing is generated (§2.3). First an undirected is generated (only as a function of the POS tags), and then is mapped to a bracketing using a direction mapping . We then showed that we can define a distance metric between nodes in the undirected tree, such that minimizing it leads to a recovery of . This distance metric can be computed based only on the text, without needing to identify the latent information (§3.2). If the true distance metric is known, with respect to the true distribution that generates the words in a sentence, then can be fully recovered by optimizing the cost function . However, in practice the distance metric must be estimated from data, as discussed below.
We now address the data sparsity problem, in particular that can be very small, and therefore estimating for each POS sequence separately can be problematic.33This data sparsity problem is quite severe – for example, the Penn treebank [Marcus et al.1993] has a total number of 43,498 sentences, with 42,246 unique POS tag sequences, averaging to be 1.04.
In order to estimate from data, we need to estimate the covariance matrices (for ) from Eq. 5.
To give some motivation to our solution, consider estimating the covariance matrix for the tag sequence . may be insufficient for an accurate empirical estimate. However, consider another sequence . Although and are not identical, it is likely that is similar to because the determiner and the noun appear in similar syntactic context. also may be somewhat similar, but should not be very similar to because the noun and the determiner appear in a different syntactic context.
The observation that the covariance matrices depend on local syntactic context is the main driving force behind our solution. The local syntactic context acts as an “anchor,” which enhances or replaces a word index in a sentence with local syntactic context. More formally, an anchor is a function that maps a word index and a sequence of POS tags to a local context . The anchor we use is . Then, the covariance matrices are estimated using kernel smoothing [Hastie et al.2009], where the smoother tests similarity between the different anchors .
The full learning algorithm is given in Figure 3.3. The first step in the algorithm is to estimate the covariance matrix block for each training example and each pair of preterminal positions in . Instead of computing this block by computing the empirical covariance matrix for positions in the data , the algorithm uses all of the pairs from all of training examples. It averages the empirical covariance matrices from these contexts using a kernel weight, which gives a similarity measure for the position in and in another example . is the kernel “bandwidth”, a user-specified parameter that controls how inclusive the kernel will be with respect to examples in (see § 4.1 for a concrete example). Note that the learning algorithm is such that it ensures that if and .
Once the empirical estimates for the covariance matrices are obtained, a variant of the Eisner-Satta algorithm is used, as mentioned in §3.3.
Our main theoretical guarantee is that Algorithm 1 will recover the correct tree with high probability, if the given top bracket is correct and if we obtain enough examples from the model in §2. We give the theorem statement below. The constants lurking in the -notation and the full proof are in the supplementary.
Denote as the singular value of . Let .
Define as the estimated tree for tag sequence and as the correct tree. Let
Assume that
Then with probability , .
where , defined in the supplementary, is a function of the underlying distribution over the tag sequences and the kernel bandwidth .
Thus, the sample complexity of our approach depends on the dimensionality of the latent and observed states ( and ), the underlying singular values of the cross-covariance matrices () and the difference in the cost of the true tree compared to the cost of the incorrect trees ().
We report results on three different languages: English, German, and Chinese. For English we use the Penn treebank [Marcus et al.1993], with sections 2–21 for training and section 23 for final testing. For German and Chinese we use the Negra treebank and the Chinese treebank respectively and the first 80% of the sentences are used for training and the last 20% for testing. All punctuation from the data is removed.44We make brief use of punctuation for our top bracket heuristic detailed below before removing it.
We primarily compare our method to the constituent-context model (CCM) of Klein and Manning2002. We also compare our method to the algorithm of Seginer2007.
Our algorithm requires the top bracket in order to direct the latent tree. In practice, we employ the following heuristic to find the bracket using the following three steps:
If there exists a comma/semicolon/colon at index that has at least a verb before and both a noun followed by a verb after , then return as the top bracket. (Pick the rightmost comma/semicolon/colon if multiple satisfy the criterion).
Otherwise find the first non-participle verb (say at index ) and return .
If no verb exists, return .
As mentioned earlier, each can be an arbitrary feature vector. For all languages we use Brown clustering [Brown et al.1992] to construct a feature vector where the first elements indicate which mergable cluster the word belongs to, and the last elements indicate the cluster identity. For English, more sophisticated word embeddings are easily obtainable, and we experiment with neural word embeddings Turian et al.2010 of length 50. We also explored two types of CCA embeddings: OSCCA and TSCCA, given in Dhillon et al.2012. The OSCCA embeddings behaved better, so we only report its results.
For our experiments, we use the kernel
where denotes the user-specified bandwidth, and if and , and (and otherwise).
The kernel is non-zero if and only if the tags at position and in are identical to the ones in position and in , and if the direction between and is identical to the one between and . Note that the kernel is not binary, as opposed to the theoretical kernel in the supplementary material. Our experiments show that using a non-zero value different than 1 that is a function of the distance between and compared to the distance between and does better in practice.
Length | CCM | CCM-U | CCM-OB | CCM-UB |
---|---|---|---|---|
72.5 | 57.1 | 58.2 | 62.9 | |
54.1 | 36 | 24 | 23.7 | |
50 | 34.7 | 19.3 | 19.1 | |
47.2 | 30.7 | 16.8 | 16.6 | |
44.8 | 29.6 | 15.3 | 15.2 | |
26.3 | 13.5 | 13.9 | 13.8 |
English | German | Chinese | ||||||||||||
NN-O | NN | CC-O | CC | BC-O | BC | CCM | BC-O | BC | CCM | BC-O | BC | CCM | ||
train |
70.9 | 69.2 | 70.4 | 68.7 | 71.1 | 69.3 | 72.5 | 64.6 | 59.9 | 62.6 | 64.9 | 57.3 | 46.1 | |
55.1 | 53.5 | 53.2 | 51.6 | 53.0 | 51.5 | 50 | 52.7 | 48.7 | 47.9 | 51.4 | 46 | 22.4 | ||
46.1 | 44.5 | 43.6 | 41.9 | 43.3 | 41.8 | 26.3 | 46.7 | 43.6 | 19.8 | 42.6 | 38.6 | 15 | ||
test |
69.2 | 66.7 | 68.3 | 65.5 | 68.9 | 66.1 | 70.5 | 66.4 | 61.6 | 64.7 | 58.0 | 53.2 | 40.7 | |
60.3 | 58.3 | 58.6 | 56.4 | 58.6 | 56.5 | 53.8 | 57.5 | 53.5 | 49.6 | 54.3 | 49.4 | 35.9 | ||
54.1 | 52.3 | 52.3 | 50.3 | 51.9 | 50.2 | 50.4 | 52.8 | 49.2 | 48.9 | 49.7 | 45.5 | 20.1 | ||
50.8 | 49.0 | 48.6 | 46.6 | 48.3 | 46.6 | 47.4 | 50.0 | 46.8 | 45.6 | 46.7 | 42.7 | 17.8 | ||
48.1 | 46.3 | 45.6 | 43.7 | 45.4 | 43.8 | 44.9 | 48.3 | 45.4 | 21.9 | 44.6 | 40.7 | 16.1 | ||
45.5 | 43.8 | 43.0 | 41.1 | 42.7 | 41.1 | 26.1 | 46.9 | 44.1 | 20.1 | 42.2 | 38.6 | 14.3 |
For CCM, we found that if the full dataset (all sentence lengths) is used in training, then performance degrades when evaluating on sentences of length . We therefore restrict the data used with CCM to sentences of length , where is the maximal sentence length being evaluated. This does not happen with our algorithm, which manages to leverage lexical information whenever more data is available. We therefore use the full data for our method for all lengths.
We also experimented with the original POS tags and the universal POS tags of Petrov et al.2011. Here, we found out that our method does better with the universal part of speech tags. For CCM, we also experimented with the original parts of speech, universal tags (CCM-U), the cross-product of the original parts of speech with the Brown clusters (CCM-OB), and the cross-product of the universal tags with the Brown clusters (CCM-UB). The results in Table 1 indicate that the vanilla setting is the best for CCM.
Thus, for all results, we use universal tags for our method and the original POS tags for CCM. We believe that our approach substitutes the need for fine-grained POS tags with the lexical information. CCM, on the other hand, is fully unlexicalized.
Our method requires two parameters, the latent dimension and the bandwidth . CCM also has two parameters, the number of extra constituent/distituent counts used for smoothing. For both methods we chose the best parameters for sentences of length on the English Penn Treebank (training) and used this set for all other experiments. This resulted in for our method and , for CCM’s extra constituent/distituent counts respectively. We also tried letting CCM choose different hyperparameters for different sentence lengths based on dev-set likelihood, but this gave worse results than holding them fixed.
Table 2 summarizes our results. CCM is used with the initializer proposed in Klein and Manning2002.55We used the implementation available at http://tinyurl.com/lhwk5n6. NN, CC, and BC indicate the performance of our method for neural embeddings, CCA embeddings, and Brown clustering respectively, using the heuristic for described in § 4.1. NN-O, CC-O, and BC-O indicate that the oracle (i.e. true top bracket) was used for . For our method, test set results can be obtained by using Algorithm 3.3 (except the distances are computed using the training data).
For English, while CCM behaves better for short sentences (), our algorithm is more robust with longer sentences. This is especially noticeable for length , where CCM breaks down and our algorithm is more stable. We find that the neural embeddings modestly outperform the CCA and Brown cluster embeddings.
The results for German are similar, except CCM breaks down earlier at sentences of . For Chinese, our method substantially outperforms CCM for all lengths. Note that CCM performs very poorly, obtaining only around accuracy even for sentences of . We didn’t have neural embeddings for German and Chinese (which worked best for English) and thus only used Brown cluster embeddings.
For English, the disparity between NN-O (oracle top bracket) and NN (heuristic top bracket) is rather low suggesting that our top bracket heuristic is rather effective. However, for German and Chinese note that the “BC-O” performs substantially better, suggesting that if we had a better top bracket heuristic our performance would increase.
The EM algorithm with the CCM requires very careful initialization, which is described in Klein and Manning2002. If, on the other hand, random initialization is used, the variance of the performance of the CCM varies greatly. Figure 4 shows a histogram of the performance level for sentences of length for different random initializers. As one can see, for some restarts, CCM obtains accuracies lower than due to local optima. Our method does not suffer from local optima and thus does not require careful initialization.
Our approach is not directly comparable to Seginer’s because he uses punctuation, while we use POS tags. Using Seginer’s parser we were able to get results on the training sets. On English: (), (), (). On German: (), (), and (). On Chinese: (), (), and ().
Thus, while Seginer’s method performs better on English, our approach performs 2-3 points better on German, and both methods give similar performance on Chinese.
We described a spectral approach for unsupervised constituent parsing that comes with theoretical guarantees on latent structure recovery. Empirically, our algorithm performs favorably to the CCM of Klein and Manning (2002) without the need for careful initialization.
Acknowledgements: This work is supported by NSF IIS1218282, NSF IIS1111142, NIH R01GM093156, and the NSF Graduate Research Fellowship Program under Grant No. 0946825 (NSF Fellowship to APP).