Unsupervised domain adaptation often relies on transforming the instance representation. However, most such approaches are designed for bag-of-words models, and ignore the structured features present in many problems in NLP. We propose a new technique called marginalized structured dropout, which exploits feature structure to obtain a remarkably simple and efficient feature projection. Applied to the task of fine-grained part-of-speech tagging on a dataset of historical Portuguese, marginalized structured dropout yields state-of-the-art accuracy while increasing speed by more than an order-of-magnitude over previous work.
Unsupervised domain adaptation is a fundamental problem for natural language processing, as we hope to apply our systems to datasets unlike those for which we have annotations. This is particularly relevant as labeled datasets become stale in comparison with rapidly evolving social media writing styles [8], and as there is increasing interest in natural language processing for historical texts [23]. While a number of different approaches for domain adaptation have been proposed [21, 26], they tend to emphasize bag-of-words features for classification tasks such as sentiment analysis. Consequently, many approaches rely on each instance having a relatively large number of active features, and fail to exploit the structured feature spaces that characterize syntactic tasks such as sequence labeling and parsing [25].
As we will show, substantial efficiency improvements can be obtained by designing domain adaptation methods for learning in structured feature spaces. We build on work from the deep learning community, in which denoising autoencoders are trained to remove synthetic noise from the observed instances [11]. By using the autoencoder to transform the original feature space, one may obtain a representation that is less dependent on any individual feature, and therefore more robust across domains. Chen et al. (2012) showed that such autoencoders can be learned even as the noising process is analytically marginalized; the idea is similar in spirit to feature noising [29]. While the marginalized denoising autoencoder (mDA) is considerably faster than the original denoising autoencoder, it requires solving a system of equations that can grow very large, as realistic NLP tasks can involve or more features.
In this paper we investigate noising functions that are explicitly designed for structured feature spaces, which are common in NLP. For example, in part-of-speech tagging, Toutanova et al. (2003) define several feature “templates”: the current word, the previous word, the suffix of the current word, and so on. For each feature template, there are thousands of binary features. To exploit this structure, we propose two alternative noising techniques: (1) feature scrambling, which randomly chooses a feature template and randomly selects an alternative value within the template, and (2) structured dropout, which randomly eliminates all but a single feature template. We show how it is possible to marginalize over both types of noise, and find that the solution for structured dropout is substantially simpler and more efficient than the mDA approach of Chen et al. (2012), which does not consider feature structure.
We apply these ideas to fine-grained part-of-speech tagging on a dataset of Portuguese texts from the years 1502 to 1836 [9], training on recent texts and evaluating on older documents. Both structure-aware domain adaptation algorithms perform as well as standard dropout — and better than the well-known structural correspondence learning (SCL) algorithm [1] — but structured dropout is more than an order-of-magnitude faster. As a secondary contribution of this paper, we demonstrate the applicability of unsupervised domain adaptation to the syntactic analysis of historical texts.
In this section we first briefly describe the denoising autoencoder [12], its application to domain adaptation, and the analytic marginalization of noise [4]. Then we present three versions of marginalized denoising autoencoders (mDA) by incorporating different types of noise, including two new noising processes that are designed for structured features.
Assume instances , which are drawn from both the source and target domains. We will “corrupt” these instances by adding different types of noise, and denote the corrupted version of by . Single-layer denoising autoencoders reconstruct the corrupted inputs with a projection matrix : , which is estimated by minimizing the squared reconstruction loss
(1) |
If we write , and we write its corrupted version , then the loss in (1) can be written as
(2) |
In this case, we have the well-known closed-form solution for this ordinary least square problem:
(3) |
where and . After obtaining the weight matrix , we can insert nonlinearity into the output of the denoiser, such as . It is also possible to apply stacking, by passing this vector through another autoencoder [4]. In pilot experiments, this slowed down estimation and had little effect on accuracy, so we did not include it.
Structured prediction tasks often have much more features than simple bag-of-words representation, and performance relies on the rare features. In a naive implementation of the denoising approach, both and will be dense matrices with dimensionality , which would be roughly elements in our experiments. To solve this problem, Chen et al. (2012) propose to use a set of pivot features, and train the autoencoder to reconstruct the pivots from the full set of features. Specifically, the corrupted input is divided to subsets . We obtain a projection matrix for each subset by reconstructing the pivot features from the features in this subset; we can then use the sum of all reconstructions as the new features, .
In the standard denoising autoencoder, we need to generate multiple versions of the corrupted data to reduce the variance of the solution [12]. But Chen et al. (2012) show that it is possible to marginalize over the noise, analytically computing expectations of both and , and computing
(4) |
where and . This is equivalent to corrupting the data times. The computation of these expectations depends on the type of noise.
Chen et al. (2012) used dropout noise for domain adaptation, which we briefly review. We then describe two novel types of noise that are designed for structured feature spaces, and explain how they can be marginalized to efficiently compute .
In dropout noise, each feature is set to zero with probability . If we define the scatter matrix of the uncorrupted input as , the solutions under dropout noise are
(5) |
and
(6) |
where and index two features. The form of these solutions means that computing requires solving a system of equations equal to the number of features (in the naive implementation), or several smaller systems of equations (in the high-dimensional version). Note also that is a tunable parameter for this type of noise.
In many NLP settings, we have several feature templates, such as previous-word, middle-word, next-word, etc, with only one feature per template firing on any token. We can exploit this structure by using an alternative dropout scheme: for each token, choose exactly one feature template to keep, and zero out all other features that consider this token (transition feature templates such as are not considered for dropout). Assuming we have feature templates, this noise leads to very simple solutions for the marginalized matrices and ,
(7) |
(8) |
For , we obtain a scaled version of the scatter matrix, because in each instance , there is exactly a chance that each individual feature survives dropout. is diagonal, because for any off-diagonal entry , at least one of and will drop out for every instance. We can therefore view the projection matrix as a row-normalized version of the scatter matrix . Put another way, the contribution of to the reconstruction for is equal to the co-occurence count of and , divided by the count of .
Unlike standard dropout, there are no free hyper-parameters to tune for structured dropout. Since is a diagonal matrix, we eliminate the cost of matrix inversion (or of solving a system of linear equations). Moreover, to extend mDA for high dimensional data, we no longer need to divide the corrupted input to several subsets.11 is an by matrix, where is the number of pivots.
For intuition, consider standard feature dropout with . This will look very similar to structured dropout: the matrix is identical, and has off-diagonal elements which are scaled by , which goes to zero as is large. However, by including these elements, standard dropout is considerably slower, as we show in our experiments.
A third alternative is to “scramble” the features by randomly selecting alternative features within each template. For a feature belonging to a template , with probability we will draw a noise feature also belonging to , according to some distribution . In this work, we use an uniform distribution, in which . However, the below solutions will also hold for other scrambling distributions, such as mean-preserving distributions.
Again, it is possible to analytically marginalize over this noise. Recall that . An off-diagonal entry in the matrix which involves features and belonging to different templates () can take four different values ( denotes feature in ):
if both features are unchanged, which happens with probability .
if both features are chosen as noise features, which happens with probability .
or if one feature is unchanged and the other one is chosen as the noise feature, which happens with probability or .
The diagonal entries take the first two values above, with probability and respectively. Other entries will be all zero (only one feature belonging to the same template will fire in ). We can use similar reasoning to compute the expectation of . With probability , the original features are preserved, and we add the outer-product ; with probability , we add the outer-product . Therefore can be computed as the sum of these terms.
We compare these methods on historical Portuguese part-of-speech tagging, creating domains over historical epochs.
We use the Tycho Brahe corpus to evaluate our methods. The corpus contains a total of 1,480,528 manually tagged words. It uses a set of 383 tags and is composed of various texts from historical Portuguese, from 1502 to 1836. We divide the texts into fifty-year periods to create different domains. Table 1 presents some statistics of the datasets. We hold out 5% of data as development data to tune parameters. The two most recent domains (1800-1849 and 1750-1849) are treated as source domains, and the other domains are target domains. This scenario is motivated by training a tagger on a modern newstext corpus and applying it to historical documents.
Dataset | # of Tokens | ||||
---|---|---|---|---|---|
Total | Narrative | Letters | Dissertation | Theatre | |
1800-1849 | 125719 | 91582 | 34137 | 0 | 0 |
1750-1799 | 202346 | 57477 | 84465 | 0 | 60404 |
1700-1749 | 278846 | 0 | 130327 | 148519 | 0 |
1650-1699 | 248194 | 83938 | 115062 | 49194 | 0 |
1600-1649 | 295154 | 117515 | 115252 | 62387 | 0 |
1550-1599 | 148061 | 148061 | 0 | 0 | 0 |
1500-1549 | 182208 | 126516 | 0 | 55692 | 0 |
Overall | 1480528 | 625089 | 479243 | 315792 | 60404 |
We use a conditional random field tagger, choosing CRFsuite because it supports arbitrary real valued features [20], with SGD optimization. Following the work of Nogueira Dos Santos et al. (2008) on this dataset, we apply the feature set of Ratnaparkhi (1996). There are feature templates and features in total. Following Blitzer et al. (2006), we consider pivot features that appear more than times in all the domains. This leads to a total of pivot features in our experiments.
We compare mDA with three alternative approaches. We refer to baseline as training a CRF tagger on the source domain and testing on the target domain with only base features. We also include PCA to project the entire dataset onto a low-dimensional sub-space (while still including the original features). Finally, we compare against Structural Correspondence Learning (SCL; Blitzer et al., 2006), another feature learning algorithm. In all cases, we include the entire dataset to compute the feature projections; we also conducted experiments using only the test and training data for feature projections, with very similar results.
All the hyper-parameters are decided with our development data on the training set. We try different low dimension from to for PCA. Following Blitzer (2008) we perform feature centering/normalization, as well as rescaling for SCL. The best parameters for SCL are dimensionality and rescale factor , which are the same as in the original paper. For mDA, the best corruption level is for dropout noise, and for scrambling noise. Structured dropout noise has no free hyper-parameters.
Table 2 presents results for different domain adaptation tasks. We also compute the transfer ratio, which is defined as , shown in Figure 1. The generally positive trend of these graphs indicates that adaptation becomes progressively more important as we select test sets that are more temporally remote from the training data.
In general, mDA outperforms SCL and PCA, the latter of which shows little improvement over the base features. The various noising approaches for mDA give very similar results. However, structured dropout is orders of magnitude faster than the alternatives, as shown in Table 3. The scrambling noise is most time-consuming, with cost dominated by a matrix multiplication.
Task | baseline | PCA | SCL | mDA | ||
---|---|---|---|---|---|---|
dropout | structured | scrambling | ||||
from 1800-1849 | ||||||
1750 | 89.12 | 89.09 | 89.69 | 90.08 | 90.08 | 90.01 |
1700 | 90.43 | 90.43 | 91.06 | 91.56 | 91.57 | 91.55 |
1650 | 88.45 | 88.52 | 87.09 | 88.69 | 88.70 | 88.57 |
1600 | 87.56 | 87.58 | 88.47 | 89.60 | 89.61 | 89.54 |
1550 | 89.66 | 89.61 | 90.57 | 91.39 | 91.39 | 91.36 |
1500 | 85.58 | 85.63 | 86.99 | 88.96 | 88.95 | 88.91 |
from 1750-1849 | ||||||
1700 | 94.64 | 94.62 | 94.81 | 95.08 | 95.08 | 95.02 |
1650 | 91.98 | 90.97 | 90.37 | 90.83 | 90.84 | 90.80 |
1600 | 92.95 | 92.91 | 93.17 | 93.78 | 93.78 | 93.71 |
1550 | 93.27 | 93.21 | 93.75 | 94.06 | 94.05 | 94.02 |
1500 | 89.80 | 89.75 | 90.59 | 91.71 | 91.71 | 91.68 |
Method | PCA | SCL | mDA | ||
---|---|---|---|---|---|
dropout | structured | scambling | |||
Time | 7,779 | 38,849 | 8,939 | 339 | 327,075 |
Most previous work on domain adaptation focused on the supervised setting, in which some labeled data is available in the target domain [16, 5, 10]. Our work focuses on unsupervised domain adaptation, where no labeled data is available in the target domain. Several representation learning methods have been proposed to solve this problem. In structural correspondence learning (SCL), the induced representation is based on the task of predicting the presence of pivot features. Autoencoders apply a similar idea, but use the denoised instances as the latent representation [28, 12, 4]. Within the context of denoising autoencoders, we have focused on dropout noise, which has also been applied as a general technique for improving the robustness of machine learning, particularly in neural networks [13, 29].
On the specific problem of sequence labeling, Xiao and Guo (2013) proposed a supervised domain adaptation method by using a log-bilinear language adaptation model. Dhillon et al. (2011) presented a spectral method to estimate low dimensional context-specific word representations for sequence labeling. Huang and Yates [14, 15] used an HMM model to learn latent representations, and then leverage the Posterior Regularization framework to incorporate specific biases. Unlike these methods, our approach uses a standard CRF, but with transformed features.
Our evaluation concerns syntactic analysis of historical text, which is a topic of increasing interest for NLP [23]. Pennacchiotti and Zanzotto (2008) find that part-of-speech tagging degrades considerably when applied to a corpus of historical Italian. Moon and Baldridge (2007) tackle the challenging problem of tagging Middle English, using techniques for projecting syntactic annotations across languages. Prior work on the Tycho Brahe corpus applied supervised learning to a random split of test and training data [17, 7]; they did not consider the domain adaptation problem of training on recent data and testing on older historical text.
Denoising autoencoders provide an intuitive solution for domain adaptation: transform the features into a representation that is resistant to the noise that may characterize the domain adaptation process. The original implementation of this idea produced this noise directly [12]; later work showed that dropout noise could be analytically marginalized [4]. We take another step towards simplicity by showing that structured dropout can make marginalization even easier, obtaining dramatic speedups without sacrificing accuracy.
: We thank the reviewers for useful feedback. This research was supported by National Science Foundation award 1349837.