Fast Easy Unsupervised Domain Adaptation
with Marginalized Structured Dropout

Yi Yang    Jacob Eisenstein
School of Interactive Computing
Georgia Institute of Technology
{yiyang, jacobe}@gatech.edu
Abstract

Unsupervised domain adaptation often relies on transforming the instance representation. However, most such approaches are designed for bag-of-words models, and ignore the structured features present in many problems in NLP. We propose a new technique called marginalized structured dropout, which exploits feature structure to obtain a remarkably simple and efficient feature projection. Applied to the task of fine-grained part-of-speech tagging on a dataset of historical Portuguese, marginalized structured dropout yields state-of-the-art accuracy while increasing speed by more than an order-of-magnitude over previous work.

1 Introduction

Unsupervised domain adaptation is a fundamental problem for natural language processing, as we hope to apply our systems to datasets unlike those for which we have annotations. This is particularly relevant as labeled datasets become stale in comparison with rapidly evolving social media writing styles [8], and as there is increasing interest in natural language processing for historical texts [23]. While a number of different approaches for domain adaptation have been proposed [21, 26], they tend to emphasize bag-of-words features for classification tasks such as sentiment analysis. Consequently, many approaches rely on each instance having a relatively large number of active features, and fail to exploit the structured feature spaces that characterize syntactic tasks such as sequence labeling and parsing [25].

As we will show, substantial efficiency improvements can be obtained by designing domain adaptation methods for learning in structured feature spaces. We build on work from the deep learning community, in which denoising autoencoders are trained to remove synthetic noise from the observed instances [11]. By using the autoencoder to transform the original feature space, one may obtain a representation that is less dependent on any individual feature, and therefore more robust across domains. Chen et al. (2012) showed that such autoencoders can be learned even as the noising process is analytically marginalized; the idea is similar in spirit to feature noising [29]. While the marginalized denoising autoencoder (mDA) is considerably faster than the original denoising autoencoder, it requires solving a system of equations that can grow very large, as realistic NLP tasks can involve 105 or more features.

In this paper we investigate noising functions that are explicitly designed for structured feature spaces, which are common in NLP. For example, in part-of-speech tagging, Toutanova et al. (2003) define several feature “templates”: the current word, the previous word, the suffix of the current word, and so on. For each feature template, there are thousands of binary features. To exploit this structure, we propose two alternative noising techniques: (1) feature scrambling, which randomly chooses a feature template and randomly selects an alternative value within the template, and (2) structured dropout, which randomly eliminates all but a single feature template. We show how it is possible to marginalize over both types of noise, and find that the solution for structured dropout is substantially simpler and more efficient than the mDA approach of Chen et al. (2012), which does not consider feature structure.

We apply these ideas to fine-grained part-of-speech tagging on a dataset of Portuguese texts from the years 1502 to 1836 [9], training on recent texts and evaluating on older documents. Both structure-aware domain adaptation algorithms perform as well as standard dropout — and better than the well-known structural correspondence learning (SCL) algorithm [1] — but structured dropout is more than an order-of-magnitude faster. As a secondary contribution of this paper, we demonstrate the applicability of unsupervised domain adaptation to the syntactic analysis of historical texts.

2 Model

In this section we first briefly describe the denoising autoencoder [12], its application to domain adaptation, and the analytic marginalization of noise [4]. Then we present three versions of marginalized denoising autoencoders (mDA) by incorporating different types of noise, including two new noising processes that are designed for structured features.

2.1 Denoising Autoencoders

Assume instances 𝐱1,,𝐱n, which are drawn from both the source and target domains. We will “corrupt” these instances by adding different types of noise, and denote the corrupted version of 𝐱i by 𝐱~i. Single-layer denoising autoencoders reconstruct the corrupted inputs with a projection matrix 𝐖 : dd, which is estimated by minimizing the squared reconstruction loss

=12i=1n||𝐱i-𝐖𝐱~i||2. (1)

If we write 𝐗=[𝐱1,,𝐱n]d×n, and we write its corrupted version 𝐗~, then the loss in (1) can be written as

(𝐖)=12ntr[(𝐗-𝐖𝐗~)(𝐗-𝐖𝐗~)]. (2)

In this case, we have the well-known closed-form solution for this ordinary least square problem:

𝐖=𝐏𝐐-1, (3)

where 𝐐=𝐗~𝐗~ and 𝐏=𝐗𝐗~. After obtaining the weight matrix 𝐖, we can insert nonlinearity into the output of the denoiser, such as tanh(𝐖𝐗). It is also possible to apply stacking, by passing this vector through another autoencoder [4]. In pilot experiments, this slowed down estimation and had little effect on accuracy, so we did not include it.

High-dimensional setting

Structured prediction tasks often have much more features than simple bag-of-words representation, and performance relies on the rare features. In a naive implementation of the denoising approach, both 𝐏 and 𝐐 will be dense matrices with dimensionality d×d, which would be roughly 1011 elements in our experiments. To solve this problem, Chen et al. (2012) propose to use a set of pivot features, and train the autoencoder to reconstruct the pivots from the full set of features. Specifically, the corrupted input is divided to S subsets 𝐱~i=[(𝐱~)i1,,(𝐱~)iS]. We obtain a projection matrix 𝐖s for each subset by reconstructing the pivot features from the features in this subset; we can then use the sum of all reconstructions as the new features, tanh(s=1S𝐖s𝐗s).

Marginalized Denoising Autoencoders

In the standard denoising autoencoder, we need to generate multiple versions of the corrupted data 𝐗~ to reduce the variance of the solution [12]. But Chen et al. (2012) show that it is possible to marginalize over the noise, analytically computing expectations of both 𝐏 and 𝐐, and computing

𝐖=E[𝐏]E[𝐐]-1, (4)

where E[𝐏]=i=1nE[𝐱i𝐱~i] and E[𝐐]=i=1nE[𝐱~i𝐱~i]. This is equivalent to corrupting the data m times. The computation of these expectations depends on the type of noise.

2.2 Noise distributions

Chen et al. (2012) used dropout noise for domain adaptation, which we briefly review. We then describe two novel types of noise that are designed for structured feature spaces, and explain how they can be marginalized to efficiently compute 𝐖.

Dropout noise

In dropout noise, each feature is set to zero with probability p>0. If we define the scatter matrix of the uncorrupted input as 𝐒=𝐗𝐗, the solutions under dropout noise are

E[𝐐]α,β={(1-p)2𝐒α,βif αβ(1-p)𝐒α,βif α=β, (5)

and

E[𝐏]α,β=(1-p)𝐒α,β, (6)

where α and β index two features. The form of these solutions means that computing 𝐖 requires solving a system of equations equal to the number of features (in the naive implementation), or several smaller systems of equations (in the high-dimensional version). Note also that p is a tunable parameter for this type of noise.

Structured dropout noise

In many NLP settings, we have several feature templates, such as previous-word, middle-word, next-word, etc, with only one feature per template firing on any token. We can exploit this structure by using an alternative dropout scheme: for each token, choose exactly one feature template to keep, and zero out all other features that consider this token (transition feature templates such as yt,yt-1 are not considered for dropout). Assuming we have K feature templates, this noise leads to very simple solutions for the marginalized matrices E[𝐏] and E[𝐐],

E[𝐐]α,β={0if αβ1K𝐒α,βif α=β (7)
E[𝐏]α,β=1K𝐒α,β, (8)

For E[𝐏], we obtain a scaled version of the scatter matrix, because in each instance 𝐱~, there is exactly a 1/K chance that each individual feature survives dropout. E[𝐐] is diagonal, because for any off-diagonal entry E[𝐐]α,β, at least one of α and β will drop out for every instance. We can therefore view the projection matrix 𝐖 as a row-normalized version of the scatter matrix 𝐒. Put another way, the contribution of β to the reconstruction for α is equal to the co-occurence count of α and β, divided by the count of β.

Unlike standard dropout, there are no free hyper-parameters to tune for structured dropout. Since E[𝐐] is a diagonal matrix, we eliminate the cost of matrix inversion (or of solving a system of linear equations). Moreover, to extend mDA for high dimensional data, we no longer need to divide the corrupted input 𝐱~ to several subsets.11E[𝐏] is an r by d matrix, where r is the number of pivots.

For intuition, consider standard feature dropout with p=K-1K. This will look very similar to structured dropout: the matrix E[𝐏] is identical, and E[𝐐] has off-diagonal elements which are scaled by (1-p)2, which goes to zero as K is large. However, by including these elements, standard dropout is considerably slower, as we show in our experiments.

Scrambling noise

A third alternative is to “scramble” the features by randomly selecting alternative features within each template. For a feature α belonging to a template F, with probability p we will draw a noise feature β also belonging to F, according to some distribution q. In this work, we use an uniform distribution, in which qβ=1|F|. However, the below solutions will also hold for other scrambling distributions, such as mean-preserving distributions.

Again, it is possible to analytically marginalize over this noise. Recall that E[𝐐]=i=1nE[𝐱~i𝐱~i]. An off-diagonal entry in the matrix 𝐱~𝐱~ which involves features α and β belonging to different templates (FαFβ) can take four different values (𝐱i,α denotes feature α in 𝐱i):

  • 𝐱i,α𝐱i,β if both features are unchanged, which happens with probability (1-p)2.

  • 1 if both features are chosen as noise features, which happens with probability p2qαqβ.

  • 𝐱i,α or 𝐱i,β if one feature is unchanged and the other one is chosen as the noise feature, which happens with probability p(1-p)qβ or p(1-p)qα.

The diagonal entries take the first two values above, with probability 1-p and pqα respectively. Other entries will be all zero (only one feature belonging to the same template will fire in 𝐱i). We can use similar reasoning to compute the expectation of 𝐏. With probability (1-p), the original features are preserved, and we add the outer-product 𝐱i𝐱i; with probability p, we add the outer-product 𝐱iq. Therefore E[𝐏] can be computed as the sum of these terms.

3 Experiments

We compare these methods on historical Portuguese part-of-speech tagging, creating domains over historical epochs.

3.1 Experiment setup

Datasets

We use the Tycho Brahe corpus to evaluate our methods. The corpus contains a total of 1,480,528 manually tagged words. It uses a set of 383 tags and is composed of various texts from historical Portuguese, from 1502 to 1836. We divide the texts into fifty-year periods to create different domains. Table 1 presents some statistics of the datasets. We hold out 5% of data as development data to tune parameters. The two most recent domains (1800-1849 and 1750-1849) are treated as source domains, and the other domains are target domains. This scenario is motivated by training a tagger on a modern newstext corpus and applying it to historical documents.

Dataset # of Tokens
Total Narrative Letters Dissertation Theatre
1800-1849 125719 91582 34137 0 0
1750-1799 202346 57477 84465 0 60404
1700-1749 278846 0 130327 148519 0
1650-1699 248194 83938 115062 49194 0
1600-1649 295154 117515 115252 62387 0
1550-1599 148061 148061 0 0 0
1500-1549 182208 126516 0 55692 0
Overall 1480528 625089 479243 315792 60404
Table 1: Statistics of the Tycho Brahe Corpus

CRF tagger

We use a conditional random field tagger, choosing CRFsuite because it supports arbitrary real valued features [20], with SGD optimization. Following the work of Nogueira Dos Santos et al. (2008) on this dataset, we apply the feature set of Ratnaparkhi (1996). There are 16 feature templates and 372,902 features in total. Following Blitzer et al. (2006), we consider pivot features that appear more than 50 times in all the domains. This leads to a total of 1572 pivot features in our experiments.

Methods

We compare mDA with three alternative approaches. We refer to baseline as training a CRF tagger on the source domain and testing on the target domain with only base features. We also include PCA to project the entire dataset onto a low-dimensional sub-space (while still including the original features). Finally, we compare against Structural Correspondence Learning (SCL; Blitzer et al., 2006), another feature learning algorithm. In all cases, we include the entire dataset to compute the feature projections; we also conducted experiments using only the test and training data for feature projections, with very similar results.

Parameters

All the hyper-parameters are decided with our development data on the training set. We try different low dimension K from 10 to 2000 for PCA. Following Blitzer (2008) we perform feature centering/normalization, as well as rescaling for SCL. The best parameters for SCL are dimensionality K=25 and rescale factor α=5, which are the same as in the original paper. For mDA, the best corruption level is p=0.9 for dropout noise, and p=0.1 for scrambling noise. Structured dropout noise has no free hyper-parameters.

3.2 Results

Table 2 presents results for different domain adaptation tasks. We also compute the transfer ratio, which is defined as adaptation accuracybaseline accuracy, shown in Figure 1. The generally positive trend of these graphs indicates that adaptation becomes progressively more important as we select test sets that are more temporally remote from the training data.

In general, mDA outperforms SCL and PCA, the latter of which shows little improvement over the base features. The various noising approaches for mDA give very similar results. However, structured dropout is orders of magnitude faster than the alternatives, as shown in Table 3. The scrambling noise is most time-consuming, with cost dominated by a matrix multiplication.

Task baseline PCA SCL mDA
dropout structured scrambling
from 1800-1849
1750 89.12 89.09 89.69 90.08 90.08 90.01
1700 90.43 90.43 91.06 91.56 91.57 91.55
1650 88.45 88.52 87.09 88.69 88.70 88.57
1600 87.56 87.58 88.47 89.60 89.61 89.54
1550 89.66 89.61 90.57 91.39 91.39 91.36
1500 85.58 85.63 86.99 88.96 88.95 88.91
from 1750-1849
1700 94.64 94.62 94.81 95.08 95.08 95.02
1650 91.98 90.97 90.37 90.83 90.84 90.80
1600 92.95 92.91 93.17 93.78 93.78 93.71
1550 93.27 93.21 93.75 94.06 94.05 94.02
1500 89.80 89.75 90.59 91.71 91.71 91.68
Table 2: Accuracy results for adaptation from labeled data in 1800-1849, and in 1750-1849.
Figure 1: Transfer ratio for adaptation to historical text
Method PCA SCL mDA
dropout structured scambling
Time 7,779 38,849 8,939 339 327,075
Table 3: Time, in seconds, to compute the feature transformation

4 Related Work

Domain adaptation

Most previous work on domain adaptation focused on the supervised setting, in which some labeled data is available in the target domain [16, 5, 10]. Our work focuses on unsupervised domain adaptation, where no labeled data is available in the target domain. Several representation learning methods have been proposed to solve this problem. In structural correspondence learning (SCL), the induced representation is based on the task of predicting the presence of pivot features. Autoencoders apply a similar idea, but use the denoised instances as the latent representation [28, 12, 4]. Within the context of denoising autoencoders, we have focused on dropout noise, which has also been applied as a general technique for improving the robustness of machine learning, particularly in neural networks [13, 29].

On the specific problem of sequence labeling, Xiao and Guo (2013) proposed a supervised domain adaptation method by using a log-bilinear language adaptation model. Dhillon et al. (2011) presented a spectral method to estimate low dimensional context-specific word representations for sequence labeling. Huang and Yates [14, 15] used an HMM model to learn latent representations, and then leverage the Posterior Regularization framework to incorporate specific biases. Unlike these methods, our approach uses a standard CRF, but with transformed features.

Historical text

Our evaluation concerns syntactic analysis of historical text, which is a topic of increasing interest for NLP [23]. Pennacchiotti and Zanzotto (2008) find that part-of-speech tagging degrades considerably when applied to a corpus of historical Italian. Moon and Baldridge (2007) tackle the challenging problem of tagging Middle English, using techniques for projecting syntactic annotations across languages. Prior work on the Tycho Brahe corpus applied supervised learning to a random split of test and training data [17, 7]; they did not consider the domain adaptation problem of training on recent data and testing on older historical text.

5 Conclusion and Future Work

Denoising autoencoders provide an intuitive solution for domain adaptation: transform the features into a representation that is resistant to the noise that may characterize the domain adaptation process. The original implementation of this idea produced this noise directly [12]; later work showed that dropout noise could be analytically marginalized [4]. We take another step towards simplicity by showing that structured dropout can make marginalization even easier, obtaining dramatic speedups without sacrificing accuracy.

Acknowledgments

: We thank the reviewers for useful feedback. This research was supported by National Science Foundation award 1349837.

References

  • [1] J. Blitzer, M. Dredze and F. Pereira(2007) Biographies, bollywood, boom-boxes and blenders: domain adaptation for sentiment classification. Prague, Czech Republic. Cited by: 1.
  • [2] J. Blitzer, R. McDonald and F. Pereira(2006) Domain adaptation with structural correspondence learning. EMNLP ’06, Stroudsburg, PA, USA, pp. 120–128. External Links: ISBN 1-932432-73-6, Link Cited by: 3.1, 3.1.
  • [3] J. Blitzer(2008) Domain adaptation of natural language processing systems. Ph.D. Thesis, University of Pennsylvania. Cited by: 3.1.
  • [4] M. Chen, Z. Xu, K. Weinberger and F. Sha(2012-07) Marginalized denoising autoencoders for domain adaptation. in J. Langford and J. Pineau (Eds.), Proceedings of the 29th International Conference on Machine Learning (ICML-12), ICML ’12, pp. 767–774. External Links: ISBN 978-1-4503-1285-1 Cited by: 1, 1, 2.1, 2.1, 2.1, 2.2, 2, 4, 5.
  • [5] H. Daumé III(2007) Frustratingly easy domain adaptation. Vol. 1785, pp. 1787. Cited by: 4.
  • [6] P. S. Dhillon, D. P. Foster and L. H. Ungar(2011) Multi-view learning of word embeddings via cca.. Vol. 24, pp. 199–207. Cited by: 4.
  • [7] C. N. Dos Santos, R. L. Milidiú and R. P. Rentería(2008) Portuguese part-of-speech tagging using entropy guided transformation learning. Computational Processing of the Portuguese Language, pp. 143–152. Cited by: 4.
  • [8] J. Eisenstein(2013) What to do about bad language on the internet. Atlanta, GA. Cited by: 1.
  • [9] P. Faria(2010) Tycho Brahe Parsed Corpus of Historical Portuguese. Note: http://www.tycho.iel.unicamp.br/ tycho/corpus/en/index.html Cited by: 1.
  • [10] J. R. Finkel and C. D. Manning(2009) Hierarchical bayesian domain adaptation. pp. 602–610. Cited by: 4.
  • [11] X. Glorot, A. Bordes and Y. Bengio(2011) Deep sparse rectifier networks. Vol. 15, pp. 315–323. Cited by: 1.
  • [12] X. Glorot, A. Bordes and Y. Bengio(2011) Domain adaptation for large-scale sentiment classification: a deep learning approach. pp. 513–520. Cited by: 2.1, 2, 4, 5.
  • [13] G. E. Hinton, N. Srivastava, A. Krizhevsky, I. Sutskever and R. R. Salakhutdinov(2012) Improving neural networks by preventing co-adaptation of feature detectors. arXiv preprint arXiv:1207.0580. Cited by: 4.
  • [14] F. Huang and A. Yates(2009) Distributional representations for handling sparsity in supervised sequence-labeling. pp. 495–503. Cited by: 4.
  • [15] F. Huang and A. Yates(2012) Biased representation learning for domain adaptation. pp. 1313–1323. Cited by: 4.
  • [16] J. Jiang and C. Zhai(2007) Instance weighting for domain adaptation in nlp. Vol. 2007, pp. 22. Cited by: 4.
  • [17] F. N. Kepler and M. Finger(2006) Comparing two markov methods for part-of-speech tagging of portuguese. Advances in Artificial Intelligence-IBERAMIA-SBIA 2006, pp. 482–491. Cited by: 4.
  • [18] T. Moon and J. Baldridge(2007) Part-of-speech tagging for middle english through alignment and projection of parallel diachronic texts.. pp. 390–399. Cited by: 4.
  • [19] C. Nogueira Dos Santos, R. L. Milidiú and R. P. Rentería(2008) Portuguese part-of-speech tagging using entropy guided transformation learning. PROPOR ’08, Berlin, Heidelberg, pp. 143–152. External Links: ISBN 978-3-540-85979-6, Link, Document Cited by: 3.1.
  • [20] N. Okazaki(2007) CRFsuite: a fast implementation of conditional random fields (crfs). External Links: Link Cited by: 3.1.
  • [21] S. J. Pan and Q. Yang(2010) A survey on transfer learning. Knowledge and Data Engineering, IEEE Transactions on 22 (10), pp. 1345–1359. Cited by: 1.
  • [22] M. Pennacchiotti and F. M. Zanzotto(2008) Natural language processing across time: an empirical investigation on italian. Advances in Natural Language Processing, pp. 371–382. Cited by: 4.
  • [23] M. Piotrowski(2012) Natural language processing for historical texts. Synthesis Lectures on Human Language Technologies 5 (2), pp. 1–157. Cited by: 1, 4.
  • [24] A. Ratnaparkhi(1996-April 16) A maximum entropy model for part-of-speech tagging. (en). Note: Adwait Ratnaparkhi (University of Pennsylvania; Dept . of Computer and Information Science); External Links: Link Cited by: 3.1.
  • [25] N. A. Smith(2011) Linguistic structure prediction. Synthesis Lectures on Human Language Technologies 4 (2), pp. 1–274. Cited by: 1.
  • [26] A. Søgaard(2013) Semi-supervised learning and domain adaptation in natural language processing. Synthesis Lectures on Human Language Technologies 6 (2), pp. 1–103. Cited by: 1.
  • [27] K. Toutanova, D. Klein, C. D. Manning and Y. Singer(2003) Feature-rich part-of-speech tagging with a cyclic dependency network. External Links: Link Cited by: 1.
  • [28] P. Vincent, H. Larochelle, Y. Bengio and P. Manzagol(2008) Extracting and composing robust features with denoising autoencoders. pp. 1096–1103. Cited by: 4.
  • [29] S. I. Wang, M. Wang, S. Wager, P. Liang and C. D. Manning(2013) Feature noising for log-linear structured prediction. Cited by: 1, 4.
  • [30] M. Xiao and Y. Guo(2013) Domain adaptation for sequence labeling tasks with a probabilistic language adaptation model. Vol. 28, pp. 293–301. External Links: Link Cited by: 4.