We propose an online learning algorithm based on tensor-space models. A tensor-space model represents data in a compact way, and via rank-1 approximation the weight tensor can be made highly structured, resulting in a significantly smaller number of free parameters to be estimated than in comparable vector-space models. This regularizes the model complexity and makes the tensor model highly effective in situations where a large feature set is defined but very limited resources are available for training. We apply with the proposed algorithm to a parsing task, and show that even with very little training data the learning algorithm based on a tensor model performs well, and gives significantly better results than standard learning algorithms based on traditional vector-space models.
Many NLP applications use models that try to incorporate a large number of linguistic features so that as much human knowledge of language can be brought to bear on the (prediction) task as possible. This also makes training the model parameters a challenging problem, since the amount of labeled training data is usually small compared to the size of feature sets: the feature weights cannot be estimated reliably.
Most traditional models are linear models, in the sense that both the features of the data and model parameters are represented as vectors in a vector space. Many learning algorithms applied to NLP problems, such as the Perceptron [Collins2002], MIRA [Crammer et al.2006, McDonald et al.2005, Chiang et al.2008], PRO [Hopkins and May2011], RAMPION [Gimpel and Smith2012] etc., are based on vector-space models. Such models require learning individual feature weights directly, so that the number of parameters to be estimated is identical to the size of the feature set. When millions of features are used but the amount of labeled data is limited, it can be difficult to precisely estimate each feature weight.
In this paper, we shift the model from vector-space to tensor-space. Data can be represented in a compact and structured way using tensors as containers. Tensor representations have been applied to computer vision problems [Hazan et al.2005, Shashua and Hazan2005] and information retrieval [Cai et al.2006a] a long time ago. More recently, it has also been applied to parsing [Cohen and Collins2012, Cohen and Satta2013] and semantic analysis [Van de Cruys et al.2013]. A linear tensor model represents both features and weights in tensor-space, hence the weight tensor can be factorized and approximated by a linear sum of rank-1 tensors. This low-rank approximation imposes structural constraints on the feature weights and can be regarded as a form of regularization. With this representation, we no longer need to estimate individual feature weights directly but only a small number of “bases” instead. This property makes the the tensor model very effective when training a large number of feature weights in a low-resource environment. On the other hand, tensor models have many more degrees of “design freedom” than vector space models. While this makes them very flexible, it also creates much difficulty in designing an optimal tensor structure for a given training set.
We give detailed description of the tensor space model in Section 2. Several issues that come with the tensor model construction are addressed in Section 3. A tensor weight learning algorithm is then proposed in 4. Finally we give our experimental results on a parsing task and analysis in Section 5.
Most of the learning algorithms for NLP problems are based on vector space models, which represent data as vectors , and try to learn feature weight vectors such that a linear model is able to discriminate between, say, good and bad hypotheses. While this is a natural way of representing data, it is not the only choice. Below, we reformulate the model from vector to tensor space.
A tensor is a multidimensional array, and is a generalization of commonly used algebraic objects such as vectors and matrices. Specifically, a vector is a order tensor, a matrix is a order tensor, and data organized as a rectangular cuboid is a order tensor etc. In general, a order tensor is represented as , and an entry in is denoted by . Different dimensions of a tensor are named modes of the tensor.
Using a order tensor as container, we can assign each feature of the task a -dimensional index in the tensor and represent the data as tensors. Of course, shifting from a vector to a tensor representation entails several additional degrees of freedom, e.g., the order of the tensor and the sizes of the modes, which must be addressed when selecting a tensor model. This will be done in Section 3.
Just as a matrix can be decomposed as a linear combination of several rank-1 matrices via SVD, tensors also admit decompositions11The form of tensor decomposition defined here is named as CANDECOMP/PARAFAC(CP) decomposition [Kolda and Bader2009]. Another popular form of tensor decomposition is called Tucker decomposition, which decomposes a tensor into a core tensor multiplied by a matrix along each mode. We focus only on the CP decomposition in this paper. into linear combinations of “rank-1” tensors. A order tensor is rank-1 if it can be written as the outer product of vectors, i.e.
where . A order tensor can be factorized into a sum of component rank-1 tensors as
where , called the rank of the tensor, is the minimum number of rank-1 tensors whose sum equals . Via decomposition, one may approximate a tensor by the sum of major rank-1 tensors with .
In tensor space, a linear model may be written (ignoring a bias term) as
where is the feature tensor, is the corresponding weight tensor, and denotes the Hadamard product. If is further decomposed as the sum of major component rank-1 tensors, i.e. , then
(1) | |||||
(2) |
where is the -mode product operator between a order tensor and a vector of dimension , yielding a order tensor such that
The linear tensor model is illustrated in Figure 1.
So what is the advantage of learning with a tensor model instead of a vector model? Consider the case where we have defined 1,000,000 features for our task. A vector space linear model requires estimating 1,000,000 free parameters. However if we use a order tensor model, organize the features into a matrix , and use just one rank-1 matrix to approximate the weight tensor, then the linear model becomes
where . That is to say, now we only need to estimate 2000 parameters!
In general, if features are defined for a learning problem, and we (i) organize the feature set as a tensor and (ii) use component rank-1 tensors to approximate the corresponding target weight tensor. Then the total number of parameters to be learned for this tensor model is , which is usually much smaller than for a traditional vector space model. Therefore we expect the tensor model to be more effective in a low-resource training environment.
Specifically, a vector space model assumes each feature weight to be a “free” parameter, and estimating them reliably could therefore be hard when training data are not sufficient or the feature set is huge. By contrast, a linear tensor model only needs to learn “bases” of the feature weights instead of individual weights directly. The weight corresponding to the feature in the tensor model is expressed as
(3) |
where is the element in the vector .
In other words, a true feature weight is now approximated by a set of bases. This reminds us of the well-known low-rank matrix approximation of images via SVD, and we are applying similar techniques to approximate target feature weights, which is made possible only after we shift from vector to tensor space models.
This approximation can be treated as a form of model regularization, since the weight tensor is represented in a constrained form and made highly structured via the rank-1 tensor approximation. Of course, as we reduce the model complexity, e.g. by choosing a smaller and smaller , the model’s expressive ability is weakened at the same time. We will elaborate on this point in Section 3.1.
To apply a tensor model, we first need to convert the feature vector into a tensor . Once the structure of is determined, the structure of is fixed as well. As mentioned in Section 2.1, a tensor model has many more degrees of “design freedom” than a vector model, which makes the problem of finding a good tensor structure a nontrivial one.
The order of a tensor affects the model in two ways: the expressiveness of the model and the number of parameters to be estimated. We assume in the analysis below, noting that one can always add as many rank-1 component tensors as needed to approximate a tensor with arbitrary precision.
Obviously, the order tensor (vector) model is the most expressive, since it is structureless and any arbitrary set of numbers can always be represented exactly as a vector. The order rank-1 tensor (rank-1 matrix) is less expressive because not every set of numbers can be organized into a rank-1 matrix. In general, a order rank-1 tensor is more expressive than a order rank-1 tensor, as a lower-order tensor imposes less structural constraints on the set of numbers it can express. We formally state this fact as follows:
A set of real numbers that can be represented by a order tensor can also be represented by a order tensor , provided and have the same volume. But the reverse is not true.
See appendix. ∎
On the other hand, tensor order also affects the number of parameters to be trained. Assuming that a order has equal size on each mode (we will elaborate on this point in Section 3.2) and the volume (number of entries) of the tensor is fixed as , then the total number of parameters of the model is . This is a convex function of , and the minimum22The optimal integer solution can be determined simply by comparing the two function values. is reached at either or .
Therefore, as increases from 1 to , we lose more and more of the expressive power of the model but reduce the number of parameters to be trained. However it would be a bad idea to choose a beyond . The optimal tensor order depends on the nature of the actual problem, and we tune this hyper-parameter on a held-out set.
The size of each tensor mode, , determines the structure of feature weights a tensor model can precisely represent, as well as the number of parameters to estimate (we also assume in the analysis below). For example, if the tensor order is 2 and the volume is 12, then we can either choose or . For , the numbers that can be precisely represented are divided into 3 groups, each having 4 numbers, that are scaled versions of one another. Similarly for , the numbers can be divided into 2 groups with different scales. Obviously, the two possible choices of also lead to different numbers of free parameters (7 vs. 8).
Given and , there are many possible combinations of , and the optimal combination should indeed be determined by the structure of target features weights. However it is hard to know the structure of target feature weights before learning, and it would be impractical to try every possible combination of mode sizes, therefore we choose the criterion of determining the mode sizes as minimization of the total number of parameters, namely we solve the problem:
The optimal solution is reached when . Of course it is not guaranteed that is an integer, therefore we choose or such that and is minimized. The extra entries of the tensor correspond to no features and are used just for padding. Since for each there are only two possible values to choose, we can simply enumerate all the possible (which is usually a small number) combinations of values and pick the one that matches the conditions given above. This way are fully determined.
Here we are only following the principle of minimizing the parameter number. While this strategy might work well with small amount of training data, it is not guaranteed to be the best strategy in all cases, especially when more data is available we might want to increase the number of parameters, making the model more complex so that the data can be more precisely modeled. Ideally the mode size needs to be adaptive to the amount of training data as well as the property of target weights. A theoretically guaranteed optimal approach to determining the mode sizes remains an open problem, and will be explored in our future work.
The impact of using rank-1 tensors is obvious: a larger increases the model complexity and makes the model more expressive, since we are able to approximate target weight tensor with smaller error. As a trade-off, the number of parameters and training complexity will be increased. To find out the optimal value of for a given problem, we tune this hyper-parameter too on a held-out set.
Finally, we need to find a way to map the original feature vector to a tensor, i.e. to associate each feature with an index in the tensor. Assuming the tensor volume is the same as the number of features, then there are in all ways of mapping, which is an intractable number of possibilities even for modest sized feature sets, making it impractical to carry out a brute force search. However while we are doing the mapping, we hope to arrange the features in a way such that the corresponding target weight tensor has approximately a low-rank structure, this way it can be well approximated by very few component rank-1 tensors.
Unfortunately we have no knowledge about the target weights in advance, since that is what we need to learn after all. As a way out, we first run a simple vector-model based learning algorithm (say the Perceptron) on the training data and estimate a weight vector, which serves as a “surrogate” weight vector. We then use this surrogate vector to guide the design of the mapping. Ideally we hope to find a permutation of the surrogate weights to map to a tensor in such a way that the tensor has a rank as low as possible. However matrix rank minimization is in general a hard problem [Fazel2002]. Therefore, we follow an approximate algorithm given in Figure 4, whose main idea is illustrated via an example in Figure 4.
0.48
{algorithmic}
\REQUIRE\STATETensor order , tensor volume , mode size , surrogate weight vector
\STATELet
\STATE be the non-negative part of
\STATE be the negative part of
\REQUIRE\STATE = sort() in descending order
\STATE = sort() in ascending order
\STATE
\STATE,
\STATEConstruct vector
\STATE
\STATEMap to the tensor entry , such that
\STATE
where , and
{subfigure}0.5
Basically, what the algorithm does is to divide the surrogate weights into hierarchical groups such that groups on the same level are approximately proportional to each other. Using these groups as units we are able to “fill” the tensor in a hierarchical way. The resulting tensor will have an approximate low-rank structure, provided that the sorted feature weights have roughly group-wise proportional relations.
For comparison, we also experimented a trivial solution which maps each entry of the feature tensor to the tensor just in sequential order, namely is mapped to , is mapped to etc. This of course ignores correlation between features since the original feature order in the vector could be totally meaningless, and this strategy is not expected to be a good solution for vector to tensor mapping.
We now turn to the problem of learning the feature weight tensor. Here we propose an online learning algorithm similar to MIRA but modified to accommodate tensor models.
Let the model be , where is the weight tensor, is the feature tensor for an input-output pair . Training samples , where is the input and is the reference or oracle hypothesis, are fed to the weight learning algorithm in sequential order. A prediction is made by the model at time from a set of candidates , and the model updates the weight tensor by solving the following problem:
(4) | |||
where is a decomposed weight tensor and
is the structured hinge loss.
This problem setting follows the same “passive-aggressive” strategy as in the original MIRA. To optimize the vectors , we use a similar iterative strategy as proposed in [Cai et al.2006b]. Basically, the idea is that instead of optimizing all together, we optimize in turn. While we are updating one vector, the rest are fixed. For the problem setting given above, each of the sub-problems that need to be solved is convex, and according to [Cai et al.2006b] the objective function value will decrease after each individual weight update and eventually this procedure will converge.
We now give this procedure in more detail. Denote the weight vector of the mode of the tensor at time as . We will update the vectors in turn in the following order: . Once a vector has been updated, it is fixed for future updates.
By way of notation, define
(5) | |||||
(6) | |||||
(7) | |||||
(9) | |||||
In order to update from to get , the sub-problem to solve is:
where
Letting
and
we may compactly write
This convex optimization problem is just like the original MIRA and may be solved in a similar way. The updating strategy for is derived as
(10) | ||||
The initial vectors cannot be made all zero, since otherwise the -mode product in Equation (9) would yield all zero and the model would never get a chance to be updated. Therefore, we initialize the entries of uniformly such that the Frobenius-norm of the weight tensor is unity.
We call the algorithm above “Tensor-MIRA” and abbreviate it as T-MIRA.
In this section we shows empirical results of the training algorithm on a parsing task. We used the Charniak parser [Charniak et al.2005] for our experiment, and we used the proposed algorithm to train the reranking feature weights. For comparison, we also investigated training the reranker with Perceptron and MIRA.
To simulate a low-resource training environment, our training sets were selected from sections 2-9 of the Penn WSJ treebank, section 24 was used as the held-out set and section 23 as the evaluation set. We applied the default settings of the parser. There are around million features in all defined for reranking, and the -best size for reranking is set to 50. We selected the parse with the highest f-score from the 50-best list as the oracle.
We would like to observe from the experiments how the amount of training data as well as different settings of the tensor degrees of freedom affects the algorithm performance. Therefore we tried all combinations of the following experimental parameters:
Parameters | Settings |
---|---|
Training data () | Sec. 2, 2-3, 2-5, 2-9 |
Tensor order () | 2, 3, 4 |
# rank-1 tensors () | 1, 2, 3 |
Vec. to tensor mapping | approximate, sequential |
Here “approximate” and “sequential” means using, respectively, the algorithm given in Figure 4 and the sequential mapping mentioned in Section 3.4. According to the strategy given in 3.2, once the tensor order and number of features are fixed, the sizes of modes and total number of parameters to estimate are fixed as well, as shown in the tables below:
Size of modes | Number of parameters | |
---|---|---|
2 | 2310 | |
3 | 331 | |
4 | 136 |
The -scores of the held-out and evaluation set given by T-MIRA as well as the Perceptron and MIRA baseline are given in Table 5. From the results, we have the following observations:
When very few labeled data are available for training (compared with the number of features), T-MIRA performs much better than the vector-based models MIRA and Perceptron. However as the amount of training data increases, the advantage of T-MIRA fades away, and vector-based models catch up. This is because the weight tensors learned by T-MIRA are highly structured, which significantly reduces model/training complexity and makes the learning process very effective in a low-resource environment, but as the amount of data increases, the more complex and expressive vector-based models adapt to the data better, whereas further improvements from the tensor model is impeded by its structural constraints, making it insensitive to the increase of training data.
To further contrast the behavior of T-MIRA, MIRA and Perceptron, we plot the -scores on both the training and held-out sets given by these algorithms after each training epoch in Figure 7. The plots are for the experimental setting with mapping=surrogate, # rank-1 tensors=2, tensor order=2, training data=sections 2-3. It is clearly seen that both MIRA and Perceptron do much better than T-MIRA on the training set. Nevertheless, with a huge number of parameters to fit a limited amount of data, they tend to over-fit and give much worse results on the held-out set than T-MIRA does.
As an aside, observe that MIRA consistently outperformed Perceptron, as expected.
Properties of linear tensor model: The heuristic vector-to-tensor mapping strategy given by Figure 4 gives consistently better results than the sequential mapping strategy, as expected.
To make further comparison of the two strategies, in Figure 8 we plot the 20 largest singular values of the matrices which the surrogate weights (given by the Perceptron after running for 1 epoch) are mapped to by both strategies (from the experiment with training data sections 2-5). From the contrast between the largest and the -largest singular values, it can be seen that the matrix generated by the first strategy approximates a low-rank structure much better than the second strategy. Therefore, the performance of T-MIRA is influenced significantly by the way features are mapped to the tensor. If the corresponding target weight tensor has internal structure that makes it approximately low-rank, the learning procedure becomes more effective.
The best results are consistently given by order tensor models, and the differences between the and order tensors are not significant. As discussed in Section 3.1, although and order tensors have less parameters, the benefit of reduced training complexity does not compensate for the loss of expressiveness. A order tensor has already reduced the number of parameters from the original 1.33 million to only 2310, and it does not help to further reduce the number of parameters using higher order tensors.
As the amount of training data increases, there is a trend that the best results come from models with more rank-1 component tensors. Adding more rank-1 tensors increases the model’s complexity and ability of expression, making the model more adaptive to larger data sets.
Mapping | Approximate | Sequential | |||||||||||||||||||||
Rank-1 tensors | 1 | 2 | 3 | 1 | 2 | 3 | |||||||||||||||||
Tensor order | 2 | 3 | 4 | 2 | 3 | 4 | 2 | 3 | 4 | 2 | 3 | 4 | 2 | 3 | 4 | 2 | 3 | 4 | |||||
Held-out score | 89.43 | 89.16 | 89.22 | 89.16 | 89.21 | 89.24 | 89.27 | 89.14 | 89.24 | 89.21 | 88.90 | 88.89 | 89.13 | 88.88 | 88.88 | 89.15 | 88.87 | 88.99 | |||||
Evaluation score | 89.83 | 89.69 | |||||||||||||||||||||
MIRA | 88.57 | ||||||||||||||||||||||
Percep | 88.23 |
Mapping | Approximate | Sequential | |||||||||||||||||||||
Rank-1 tensors | 1 | 2 | 3 | 1 | 2 | 3 | |||||||||||||||||
Tensor order | 2 | 3 | 4 | 2 | 3 | 4 | 2 | 3 | 4 | 2 | 3 | 4 | 2 | 3 | 4 | 2 | 3 | 4 | |||||
Held-out score | 89.26 | 89.06 | 89.12 | 89.33 | 89.11 | 89.19 | 89.18 | 89.14 | 89.15 | 89.2 | 89.01 | 88.82 | 89.24 | 88.94 | 88.95 | 89.19 | 88.91 | 88.98 | |||||
Evaluation score | 90.02 | 89.82 | |||||||||||||||||||||
MIRA | 89.00 | ||||||||||||||||||||||
Percep | 88.59 |
Mapping | Approximate | Sequential | |||||||||||||||||||||
Rank-1 tensors | 1 | 2 | 3 | 1 | 2 | 3 | |||||||||||||||||
Tensor order | 2 | 3 | 4 | 2 | 3 | 4 | 2 | 3 | 4 | 2 | 3 | 4 | 2 | 3 | 4 | 2 | 3 | 4 | |||||
Held-out score | 89.40 | 89.44 | 89.17 | 89.5 | 89.37 | 89.18 | 89.47 | 89.32 | 89.18 | 89.23 | 89.03 | 88.93 | 89.24 | 88.98 | 88.94 | 89.16 | 89.01 | 88.85 | |||||
Evaluation score | 89.96 | 89.78 | |||||||||||||||||||||
MIRA | 89.49 | ||||||||||||||||||||||
Percep | 89.10 |
Mapping | Approximate | Sequential | |||||||||||||||||||||
Rank-1 tensors | 2 | 3 | 4 | 2 | 3 | 4 | |||||||||||||||||
Tensor order | 2 | 3 | 4 | 2 | 3 | 4 | 2 | 3 | 4 | 2 | 3 | 4 | 2 | 3 | 4 | 2 | 3 | 4 | |||||
Held-out score | 89.43 | 89.23 | 89.06 | 89.37 | 89.23 | 89.1 | 89.44 | 89.22 | 89.06 | 89.21 | 88.92 | 88.94 | 89.23 | 88.94 | 88.93 | 89.23 | 88.95 | 88.93 | |||||
Evaluation score | 89.95 | 89.84 | |||||||||||||||||||||
MIRA | 89.95 | ||||||||||||||||||||||
Percep | 89.77 |
[b]0.4
{subfigure}[b]0.4
In this paper, we reformulated the traditional linear vector-space models as tensor-space models, and proposed an online learning algorithm named Tensor-MIRA. A tensor-space model is a compact representation of data, and via rank-1 tensor approximation, the weight tensor can be made highly structured hence the number of parameters to be trained is significantly reduced. This can be regarded as a form of model regularization.Therefore, compared with the traditional vector-space models, learning in the tensor space is very effective when a large feature set is defined, but only small amount of training data is available. Our experimental results corroborated this argument.
As mentioned in Section 3.2, one interesting problem that merits further investigation is how to determine optimal mode sizes. The challenge of applying a tensor model comes from finding a proper tensor structure for a given problem, and the key to solving this problem is to find a balance between the model complexity (indicated by the order and sizes of modes) and the number of parameters. Developing a theoretically guaranteed approach of finding the optimal structure for a given task will make the tensor model not only perform well in low-resource environments, but adaptive to larger data sets.
This work was partially supported by IBM via DARPA/BOLT contract number HR0011-12-C-0015 and by the National Science Foundation via award number IIS-0963898.
For , it is obvious that if a set of real numbers can be represented by a rank-1 matrix, it can always be represented by a vector, but the reverse is not true.
For , if can be represented by , namely , then for any component vector in mode ,
where is the size of mode of , is a constant and Therefore
(11) |
and this representation is unique for a given (up to the ordering of and in , which simply assigns with different indices in the tensor), due to the pairwise proportional constraint imposed by .
If can also be represented by , then , where has a similar definition as . Then it must be the case that
(12) | |||
since otherwise would be represented by a different set of factors than those given in Equation (11).
Therefore, in order for tensor to represent the same set of real numbers that represents, there needs to exist a vector that can be represented by a rank-1 matrix as indicated by Equation (12), which is in general not guaranteed.
On the other hand, if can be represented by , namely
then we can just pick and let
and
Hence can also be represented by a order tensor . ∎