Tagging The Web: Building A Robust Web Tagger with Neural Network

Ji Ma

{}^{\dagger}

, Yue Zhang

{}^{\ddagger}

and Jingbo Zhu

{}^{\dagger}

{}^{\dagger}

Northeastern University, China

{}^{\ddagger}

Singapore University of Technology and Design
majineu@gmail.com
yue_zhang@sutd.edu.sg
zhujingbo@mail.neu.edu.cn

Abstract

In this paper¹¹This work was done while the first author was visiting SUTD., we address the problem of web-domain POS tagging using a two-phase approach. The first phase learns representations that capture regularities underlying web text. The representation is integrated as features into a neural network that serves as a scorer for an easy-first POS tagger. Parameters of the neural network are trained using guided learning in the second phase. Experiment on the SANCL 2012 shared task show that our approach achieves $93.27\%$ average tagging accuracy, which is the best accuracy reported so far on this data set, higher than those given by ensembled syntactic parsers.

1 Introduction

Analysing and extracting useful information from the web has become an increasingly important research direction for the NLP community, where many tasks require part-of-speech (POS) tagging as a fundamental preprocessing step. However, state-of-the-art POS taggers in the literature [5, 23] are mainly optimized on the the Penn Treebank (PTB), and when shifted to web data, tagging accuracies drop significantly [18].

The problem we face here can be considered as a special case of domain adaptation, where we have access to labelled data on the source domain (PTB) and unlabelled data on the target domain (web data). Exploiting useful information from the web data can be the key to improving web domain tagging. Towards this end, we adopt the idea of learning representations which has been demonstrated useful in capturing hidden regularities underlying the raw input data (web text, in our case).

Our approach consists of two phrases. In the pre-training phase, we learn an encoder that converts the web text into an intermediate representation, which acts as useful features for prediction tasks. We integrate the learned encoder with a set of well-established features for POS tagging [21, 5] in a single neural network, which is applied as a scorer to an easy-first POS tagger. We choose the easy-first tagging approach since it has been demonstrated to give higher accuracies than the standard left-to-right POS tagger [23, 15].

In the fine-tuning phase, the parameters of the network are optimized on a set of labelled training data using guided learning. The learned model preserves the property of preferring to tag easy words first. To our knowledge, we are the first to investigate guided learning for neural networks.

The idea of learning representations from unlabelled data and then fine-tuning a model with such representations according to some supervised criterion has been studied before [26, 6, 8]. While most previous work focus on in-domain sequential labelling or cross-domain classification tasks, we are the first to learn representations for web-domain structured prediction. Previous work treats the learned representations either as model parameters that are further optimized in supervised fine-tuning [6] or as fixed features that are kept unchanged [26, 8]. In this work, we investigate both strategies and give empirical comparisons in the cross-domain setting. Our results suggest that while both strategies improve in-domain tagging accuracies, keeping the learned representation unchanged consistently results in better cross-domain accuracies.

We conduct experiments on the official data set provided by the SANCL 2012 shared task [18]. Our method achieves a $93.27\%$ average accuracy across the web-domain, which is the best result reported so far on this data set, higher than those given by ensembled syntactic parsers. Our code will be publicly available at https://github.com/majineu/TWeb.

2 Learning from Web Text

Unsupervised learning is often used for training encoders that convert the input data to abstract representations (i.e. encoding vectors). Such representations capture hidden properties of the input, and can be used as features for supervised tasks [3, 20]. Among the many proposed encoders, we choose the restricted Boltzmann machine (RBM), which has been successfully used in many tasks [14, 10]. In this section, we give some background on RBMs and then show how they can be used to learn representations of the web text.

2.1 Restricted Boltzmann Machine

The RBM is a type of graphical model that contains two layers of binary stochastic units $\mathbf{v}\in{\{0,1\}}^{V}$ and $\mathbf{h}\in{\{0,1\}}^{H}$ , corresponding to a set of visible and hidden variables, respectively. The RBM defines the joint probability distribution over $\mathbf{v}$ and $\mathbf{h}$ by an energy function

E(\mathbf{v},\mathbf{h})=-\mathbf{c}^{\prime}\mathbf{h}-\mathbf{b}^{\prime}% \mathbf{v}-\mathbf{h}^{\prime}\mathbf{W}\mathbf{v},

(1)

which is factorized by a visible bias $\mathbf{b}\in{\mathbb{R}}^{V}$ , a hidden bias $\mathbf{c}\in{\mathbb{R}}^{H}$ and a weight matrix $\mathbf{W}\in{\mathbb{R}^{H\times V}}$ . The joint distribution $P(\mathbf{v},\mathbf{h})$ is given by

P(\mathbf{v},\mathbf{h})=\frac{1}{Z}\exp(E(\mathbf{v},\mathbf{h})),

(2)

where $Z$ is the partition function.

The affine form of $E$ with respect to $\mathbf{v}$ and $\mathbf{h}$ implies that the visible variables are conditionally independent with each other given the hidden layer units, and vice versa. This yields the conditional distribution:

\displaystyle P(\mathbf{v}|\mathbf{h})=\prod_{j=1}^{V}P(v_{j}|\mathbf{h})\;\;% \;\;P(\mathbf{h}|\mathbf{v})=\prod_{i=1}^{H}P(h_{i}|\mathbf{v})

P(v_{j}=1|\mathbf{h})=\sigma(\mathbf{b}_{j}+W_{\cdot j}\mathbf{h})

(3)

P(h_{i}=1|\mathbf{v})=\sigma(\mathbf{c}_{j}+W_{i\cdot}\mathbf{v})

(4)

Here $\sigma$ denotes the sigmoid function. Parameters of RBMs $\theta=\{\mathbf{b},\mathbf{c},\mathbf{W}\}$ can be trained efficiently using contrastive divergence learning (CD), see [11] for detailed descriptions of CD.

2.2 Encoding Web Text with RBM

Most of the indicative features for POS disambiguation can be found from the words and word combinations within a local context [21, 5]. Inspired by this observation, we apply the RBM to learn feature representations from word n-grams. More specifically, given the $i^{th}$ word $w_{i}$ of a sentence, we apply RBMs to model the joint distribution of the n-gram $(w_{i-l},\cdots,w_{i+r})$ , where $l$ and $r$ denote the left and right window, respectively. Note that the visible units of RBMs are binary. While in our case, each visible variable corresponds to a word, which may take on tens-of-thousands of different values. Therefore, the RBM need to be re-factorized to make inference tractable.

We utilize the Word Representation RBM (WRRBM) factorization proposed by Dahl et al. (2012). The basic idea is to share word representations across different positions in the input n-gram while using position-dependent weights to distinguish between different word orders.

Let $w_{k}$ be the $k$ -th entry of lexicon $L$ , and $\mathbf{w}_{k}$ be its one-hot representation (i.e., only the $k$ -th component of $\mathbf{w}_{k}$ is 1, and all the others are 0). Let $\mathbf{v}^{(j)}$ represents the $j$ -th visible variable of the WRRBM, which is a vector of length $|L|$ . Then $\mathbf{v}^{(j)}=\mathbf{w}_{k}$ means that the $j$ -th word in the n-gram is $w_{k}$ . Let $\mathbf{D}\in\mathbb{R}^{D\times{|L|}}$ be a projection matrix, then $\mathbf{D}\mathbf{w}_{k}$ projects $w_{k}$ into a $D$ -dimensional real value vector (embedding). For each position $j$ , there is a weight matrix $\mathbf{W}^{(j)}\in{\mathbb{R}^{H\times{D}}}$ , which is used to model the interaction between the hidden layer and the word projection in position $j$ . The visible biases are also shared across different positions $(b^{(j)}=b\;\forall j)$ and the energy function is:

\displaystyle E(\mathbf{v},\mathbf{h})=-\mathbf{c}^{\prime}\mathbf{h}-\sum_{j=% 1}^{n}{(\mathbf{b}^{\prime}\mathbf{v}^{(j)}+\mathbf{h}^{\prime}\mathbf{W}^{(j)% }\mathbf{D}\mathbf{v}^{(j)})},

(5)

which yields the conditional distributions:

\displaystyle P(\mathbf{v}|\mathbf{h})=\prod_{j=1}^{n}P(\mathbf{v}^{(j)}|% \mathbf{h})\quad P(\mathbf{h}|\mathbf{v})=\prod_{i=1}P(h_{i}|\mathbf{v})

P(h_{i}=1|\mathbf{v})=\sigma(c_{i}+\sum_{j=1}^{n}\mathbf{W}^{(j)}_{i\cdot}% \mathbf{D}\mathbf{v}^{(j)})

(6)

P(\mathbf{v}^{(j)}=\mathbf{w}_{k}|\mathbf{h})=\frac{1}{Z}{\exp(\mathbf{b}^{% \prime}\mathbf{w}_{k}+\mathbf{h}^{\prime}\mathbf{W}^{(j)}\mathbf{D}\mathbf{w}_% {k})}

(7)

Again $Z$ is the partition function.

The parameters $\{\mathbf{b},\mathbf{c},\mathbf{D},\mathbf{W}^{(1)},\ldots,\mathbf{W}^{(n)}\}$ can be trained using a Metropolis-Hastings-based CD variant and the learned word representations also capture certain syntactic information; see Dahl et al. (2012) for more details.

Note that one can stack standard RBMs on top of a WRRBM to construct a Deep Belief Network (DBN). By adopting greedy layer-wise training [10, 2], DBNs are capable of modelling higher order non-linear relations between the input, and has been demonstrated to improve performance for many computer vision tasks [10, 2, 13]. However, in this work we do not observe further improvement by employing DBNs. This may partly be due to the fact that unlike computer vision tasks, the input structure of POS tagging or other sequential labelling tasks is relatively simple, and a single non-linear layer is enough to model the interactions within the input [27].

3 Neural Network for POS Disambiguation

We integrate the learned WRRBM into a neural network, which serves as a scorer for POS disambiguation. The main challenge to designing the neural network structure is: on the one hand, we hope that the model can take the advantage of information provided by the learned WRRBM, which reflects general properties of web texts, so that the model generalizes well in the web domain; on the other hand, we also hope to improve the model’s discriminative power by utilizing well-established POS tagging features, such as those of Ratnaparkhi (1996).

Our approach is to leverage the two sources of information in one neural network by combining them though a shared output layer, as shown in Figure 1. Under the output layer, the network consists of two modules: the web-feature module, which incorporates knowledge from the pre-trained WRRBM, and the sparse-feature module, which makes use of other POS tagging features.

3.1 The Web-Feature Module

Figure 1: The proposed neural network. The web-feature module (lower left) and sparse-feature module (lower right) are combined by a shared output layer (upper).

The web-feature module, shown in the lower left part of Figure 1, consists of a input layer and two hidden layers. The input for the this module is the word n-gram $(w_{i-l},\dots,w_{i+r})$ , the form of which is identical to the training data of the pre-trained WRRBM.

The first layer is a linear projection layer, where each word in the input is projected into a $D$ -dimensional real value vector using the projection operation described in Section 2.2. The output of this layer $\mathbf{o}_{w}^{1}$ is the concatenation of the projections of $w_{i-l},\dots,w_{i+r}$ :

\mathbf{o}_{w}^{1}=\begin{pmatrix}\mathbf{M}_{w}^{1}\mathbf{w}_{i-l}\\ \vdots\\ \mathbf{M}_{w}^{1}\mathbf{w}_{i+r}\end{pmatrix}

(8)

Here $\mathbf{M}_{w}^{1}$ denotes the parameters of the first layer of the web-feature module, which is a ${D\times{|L|}}$ projection matrix.

The second layer is a sigmoid layer to model non-linear relations between the word projections:

\mathbf{o}_{w}^{2}=\sigma(\mathbf{M}_{w}^{2}\mathbf{o}_{w}^{1}+\mathbf{b}_{w}^% {2})

(9)

Parameters of this layer include: a bias vector $\mathbf{b}_{w}^{2}\in{\mathbb{R}^{H}}$ and a weight matrix $\mathbf{M}_{w}^{2}\in\mathbb{R}^{H\times{n{D}}}$ .

The web-feature module enables us to explore the learned WRRBM in various ways. First, it allows us to investigate knowledge from the WRRBM incrementally. We can choose to use only the word representations of the learned WRRBM. This can be achieved by initializing only the first layer of the web module with the projection matrix $\mathbf{D}$ of the learned WRRBM:

\mathbf{M}_{w}^{1}\leftarrow\mathbf{D}.

(10)

Alternatively, we can choose to use the hidden states of the WRRBM, which can be treated as the representations of the input n-gram. This can be achieved by also initializing the parameters of the second layer of the web-feature module using the position-dependent weight matrix and hidden bias of the learned WRRBM:

\mathbf{b}_{w}^{2}\leftarrow\mathbf{c}

(11)

\mathbf{M}_{w}^{2}\leftarrow{(\mathbf{W}^{(1)},\dots,\mathbf{W}^{(n)})}

(12)

Second, the web-feature module also allows us to make a comparison between whether or not to further adjust the pre-trained representation in the supervised fine-tuning phase, which corresponds to the supervised learning strategies of Turian et al. (2010) and Collobert et al. (2011), respectively. To our knowledge, no investigations have been presented in the literature on this issue.

3.2 The Sparse-Feature Module

The sparse-feature module, as shown in the lower right part of Figure 1, is designed to incorporate commonly-used tagging features. The input for this module is a vector of boolean values $\Phi(x)=(f_{1}(x),\dots,f_{k}(x))$ , where $x$ denotes the partially tagged input sentence and $f_{i}(x)$ denotes a feature function, which returns 1 if the corresponding feature fires and 0 otherwise. The first layer of this module is a linear transformation layer, which converts the high dimensional sparse vector into a fixed-dimensional real value vector:

\mathbf{o}_{s}=\mathbf{M}_{s}\Phi(x)+\mathbf{b}_{s}

(13)

Depending on the specific task being considered, the output of this layer can be further fed to other non-linear layers, such as a sigmoid or hyperbolic tangent layer, to model more complex relations. For POS tagging, we found that a simple linear layer yields satisfactory accuracies.

The web-feature and sparse-feature modules are combined by a linear output layer, as shown in the upper part of Figure 1. The value of each unit in this layer denotes the score of the corresponding POS tag.

\mathbf{o}_{o}=\mathbf{M}_{o}\begin{pmatrix}\mathbf{o}_{w}\\ \mathbf{o}_{s}\\ \end{pmatrix}+\mathbf{b}_{o}

(14)

In some circumstances, probability distribution over POS tags might be a more preferable form of output. Such distribution can be easily obtained by adding a soft-max layer on top of the output layer to perform a local normalization, as done by Collobert et al. (2011).

{algorithm}

[t] Easy-first POS tagging {algorithmic}[1] \REQUIRE $x$ a sentence of $m$ words $w_{1},\ldots,w_{m}$ \ENSUREtag sequence of $x$ \STATE $\mathbf{U}\leftarrow[w_{1},\ldots,w_{m}]\quad$ // untagged words \WHILE $\mathbf{U}\neq[]$ \STATE $(\hat{w},\hat{t})\leftarrow\arg\max_{(w,t)\in\mathbf{U}\times{\mathbf{T}}}S(w,t)$ \STATE $\hat{w}.t\leftarrow\hat{t}$

\STATE

$\mathbf{U}\leftarrow\mathbf{U}/[\hat{w}]\quad\quad$ // remove $\hat{w}$ from $\mathbf{U}$

\ENDWHILE\RETURN

$[w_{1}.t,\ldots,w_{m}.t]$

4 Easy-first POS tagging with Neural Network

The neural network proposed in Section 3 is used for POS disambiguation by the easy-first POS tagger. Parameters of the network are trained using guided learning, where learning and search interact with each other.

4.1 Easy-first POS tagging

Pseudo-code of easy-first tagging is shown in Algorithm 1. Rather than tagging a sentence from left to right, easy-first tagging is based on a deterministic process, repeatedly selecting the easiest word to tag. Here “easiness” is evaluated based on a statistical model. At each step, the algorithm adopts a scorer, the neural network in our case, to assign a score to each possible word-tag pair $(w,t)$ , and then selects the highest score one $(\hat{w},\hat{t})$ to tag (i.e., tag $\hat{w}$ with $\hat{t}$ ). The algorithm repeats until all words are tagged.

4.2 Training

The training algorithm repeats for several iterations over the training data, which is a set of sentences labelled with gold standard POS tags. In each iteration, the procedure shown in Algorithm 2 is applied to each sentence in the training set.

At each step during the processing of a training example, the algorithm calculates a margin loss based on two word-tag pairs $(\overline{w},\overline{t})$ and $(\hat{w},\hat{t})$ (line 4 $\thicksim$ line 6). $(\overline{w},\overline{t})$ denotes the word-tag pair that has the highest model score among those that are inconsistent with the gold standard, while $(\hat{w},\hat{t})$ denotes the one that has the highest model score among those that are consistent with the gold standard. If the loss is zero, the algorithm continues to process the next untagged word. Otherwise, parameters are updated using back-propagation.

The standard back-propagation algorithm [22] cannot be applied directly. This is because the standard loss is calculated based on a unique input vector. This condition does not hold in our case, because $\hat{w}$ and $\overline{w}$ may refer to different words, which means that the margin loss in line 6 of Algorithm 2 is calculated based on two different input vectors, denoted by $\langle\hat{w}\rangle$ and $\langle\overline{w}\rangle$ , respectively.

We solve this problem by decomposing the margin loss in line 6 into two parts:

•

$1+nn(\overline{w},\overline{t})$ , which is associated with $\langle\overline{w}\rangle$ ;
•

$-nn(\hat{w},\hat{t})$ , which is associated with $\langle\hat{w}\rangle$ .

In this way, two separate back-propagation updates can be used to update the model’s parameters (line 8 $\thicksim$ line 11). For the special case where $\hat{w}$ and $\overline{w}$ do refer to the same word $w$ , it can be easily verified that the two separate back-propagation updates equal to the standard back-propagation with a loss $1+nn(w,\overline{t})-nn(w,\hat{t})$ on the input $\langle w\rangle$ .

The algorithm proposed here belongs to a general framework named guided learning, where search and learning interact with each other. The algorithm learns not only a local classifier, but also the inference order. While previous work [23, 29, 9] apply guided learning to train a linear classifier by using variants of the perceptron algorithm, we are the first to combine guided learning with a neural network, by using a margin loss and a modified back-propagation algorithm.

{algorithm}

[t] Training over one sentence {algorithmic}[1] \REQUIRE $(x,t)$ a tagged sentence, neural net $n n$ \ENSUREupdated neural net $nn^{\prime}$

\STATE

$\mathbf{U}\leftarrow[w_{1},\ldots,w_{m}]$ // untagged words \STATE $\mathbf{R}\leftarrow[(w_{1},t_{1}),\ldots,(w_{m},t_{m})]$ // reference

\WHILE

$\mathbf{U}\neq[]$ \STATE $(\overline{w},\overline{t})\leftarrow\arg\max_{(w,t)\in(\mathbf{U}\times{% \mathbf{T}}/\mathbf{R})}nn(w,t)$ \STATE $(\hat{w},\hat{t})\leftarrow\arg\max_{(w,t)\in\mathbf{R}}nn(w,t)$

\STATE

$loss\leftarrow\max(0,1+nn(\overline{w},\overline{t})-nn(\hat{w},\hat{t}))$ \IF $loss>0$ \STATE $\hat{e}\leftarrow nn.\textrm{BackPropErr}(\langle\hat{w}\rangle,-nn(\hat{w},% \hat{t}))$ \STATE $\overline{e}\leftarrow nn.\textrm{BackPropErr}(\langle\overline{w}\rangle,1+nn% (\overline{w},\overline{t}))$ \STATE $nn.\textrm{Update}(\langle\hat{w}\rangle,\hat{e})$ \STATE $nn.\textrm{Update}(\langle\overline{w}\rangle,\overline{e})$ \ELSE\STATE $\mathbf{U}\leftarrow\mathbf{U}/\{\hat{w}\},\quad\mathbf{R}\leftarrow\mathbf{R}% /(\hat{w},\hat{t})$ \ENDIF\ENDWHILE\RETURN $n n$

5 Experiments

	Training set		Dev set		Test set
	WSJ-Train	Emails	Weblogs	WSJ-dev	Answers	Newsgroups	Reviews	WSJ-test
#Sen	30060	2,450	1,016	1,336	1,744	1,195	1,906	1,640
#Words	731,678	29,131	24,025	32,092	28,823	20,651	28,086	35,590
#Types	35,933	5,478	4,747	5,889	4,370	4,924	4,797	6,685

Table 1: Statistics of the labelled data. #Sen denotes number of sentences. #Words and #Types denote number of words and unique word types, respectively.

	Emails	Weblogs	Answers	Newsgroups	Reviews
#Sen	1,194,173	524,834	27,274	1,000,000	1,965,350
#Words	17,047,731	10,365,284	424,299	18,424,657	29,289,169
#Types	221,576	166,515	33,325	357,090	287,575

Table 2: Statistics of the raw unlabelled data.

features	templates
unigram	$H(w_{i}),\quad C(w_{i}),\quad L(w_{i}),\quad L(w_{i-1}),\quad L(w_{i+1}),\quad t% _{i-2},\quad t_{i-1},\quad t_{i+1},\quad t_{i+2}$
bigram	$L(w_{i})\odot L(w_{i-1}),\quad L(w_{i})\odot L(w_{i+1}),\quad t_{i-2}\odot t_{% i-1},\quad t_{i-1}\odot t_{i+1},\quad t_{i+1}\odot t_{i+2},$
	$L(w_{i})\odot t_{i-2},\quad L(w_{i})\odot t_{i-1},\quad L(w_{i})\odot t_{i+1},% \quad L(w_{i})\odot t_{i+2}$
trigram	$L(w_{i})\odot t_{i-2}\odot t_{i-1},\quad L(w_{i})\odot t_{i-1}\odot t_{i+1},% \quad L(w_{i})\odot t_{i+1}\odot t_{i+2}$

Table 3: Feature templates, where

w_{i}

denotes the current word.

H(w)

and

C(w)

indicates whether

w

contains hyphen and upper case letters, respectively.

L(w)

denotes a lowercased

w

5.1 Setup

Our experiments are conducted on the data set provided by the SANCL 2012 shared task, which aims at building a single robust syntactic analysis system across the web-domain. The data set consists of labelled data for both the source (Wall Street Journal portion of the Penn Treebank) and target (web) domains. The web domain data can be further classified into five sub-domains, including emails, weblogs, business reviews, news groups and Yahoo!Answers. While emails and weblogs are used as the development sets, reviews, news groups and Yahoo!Answers are used as the final test sets. Participants are not allowed to use web-domain labelled data for training. In addition to labelled data, a large amount of unlabelled data on the web domain is also provided. Statistics about labelled and unlabelled data are summarized in Table 1 and Table 2, respectively.

The raw web domain data contains much noise, including spelling error, emotions and inconsistent capitalization. Following some participants [12], we conduct simple preprocessing steps to the input of the development and the test sets²²The preprocessing steps make use of no POS knowledge, and does not bring any unfair advantages to the participants.

•

Neutral quotes are transformed to opening or closing quotes.
•

Tokens starting with “www.”, “http.” or ending with “.org”, “.com” are converted to a “#URL” symbol
•

Repeated punctuations such as “!!!!” are collapsed into one.
•

Left brackets such as “ $<$ ”,“{” and “[” are converted to “-LRB-”. Similarly, right brackets are converted to “-RRB-”
•

Upper cased words that contain more than 4 letters are lowercased.
•

Consecutive occurrences of one or more digits within a word are replaced with “#DIG”

We apply the same preprocessing steps to all the unlabelled data. In addition, following Dahl et al. (2012) and Turian et al. (2010), we also lowercased all the unlabelled data and removed those sentences that contain less than 90% a-z letters.

The tagging performance is evaluated according to the official evaluation metrics of SANCL 2012. The tagging accuracy is defined as the percentage of words (punctuations included) that are correctly tagged. The averaged accuracies are calculated across the web domain data.

We trained the WRRBM on web-domain data of different sizes (number of sentences). The data sets are generated by first concatenating all the cleaned unlabelled data, then selecting sentences evenly across the concatenated file.

For each data set, we investigate an extensive set of combinations of hyper-parameters: the n-gram window $(l,r)$ in $\{(1,1),(2,1),(1,2),(2,2)\}$ ; the hidden layer size in $\{200,300,400\}$ ; the learning rate in $\{0.1,0.01,0.001\}$ . All these parameters are selected according to the averaged accuracy on the development set.

5.2 Baseline

We reimplemented the greedy easy-first POS tagger of Ma et al. (2013), which is used for all the experiments. While the tagger of Ma et al. (2013) utilizes a linear scorer, our tagger adopts the neural network as its scorer. The neural network of our baseline tagger only contains the sparse-feature module. We use this baseline to examine the performance of a tagger trained purely on the source domain. Feature templates are shown in Table 3, which are based on those of Ratnaparkhi (1996) and Shen et al. (2007).

Accuracies of the baseline tagger are shown in the upper part of Table 6. Compared with the performance of the official baseline (row 4 of Table 6), which is evaluated based on the output of BerkeleyParser [16, 17], our baseline tagger achieves comparable accuracies on both the source and target domain data. With data preprocessing, the average accuracy boosts to about 92.02 on the test set of the target domain. This is consistent with previous work (Le Roux et al., 2011), which found that for noisy data such as web domain text, data cleaning is a effective and necessary step.

5.3 Exploring the Learned Knowledge

Figure 2: Tagging accuracies on the source-domain data. “word” and “ngram” denote using word representations and n-gram representations, respectively. “fixed” and “adjust” denote that the learned representation are kept unchanged or further adjusted in supervised learning, respectively.

Figure 3: Accuracies on the email domain.

Figure 4: Accuracies on the weblog domain.

As mentioned in Section 3.1, the knowledge learned from the WRRBM can be investigated incrementally, using word representation, which corresponds to initializing only the projection layer of web-feature module with the projection matrix of the learned WRRBM, or ngram-level representation, which corresponds to initializing both the projection and sigmoid layers of the web-feature module by the learned WRRBM. In each case, there can be two different training strategies depending on whether the learned representations are further adjusted or kept unchanged during the fine-turning phrase. Experimental results under the 4 combined settings on the development sets are illustrated in Figure 2, 3 and 4, where the x-axis denotes the size of the training data and y-axis denotes tagging accuracy.

5.3.1 Effect of the Training Strategy

From Figure 2 we can see that when knowledge from the pre-trained WRRBM is incorporated, both the training strategies (“word-fixed” vs “word-adjusted”, “ngram-fixed” vs “ngram-adjusted”) improve accuracies on the source domain, which is consistent with previous findings (Turian et al., 2010; Collobert et al., 2011). In addition, adjusting the learned representation or keeping them fixed does not result in too much difference in tagging accuracies.

On the web-domain data, shown in Figure 3 and 4, we found that leaving the learned representation unchanged (“word-fixed”, “ngram-fixed”) yields consistently higher performance gains. This result is to some degree expected. Intuitively, unsupervised pre-training moves the parameters of the WRRBM towards the region where properties of the web domain data are properly modelled. However, since fine-tuning is conducted with respect to the source domain, adjusting the parameters of the pre-trained representation towards optimizing source domain tagging accuracies would disrupt its ability in modelling the web domain data. Therefore, a better idea is to keep the representation unchanged so that we can learn a function that maps the general web-text properties to its syntactic categories.

5.3.2 Word and N-gram Representation

From Figures 2, 3 and 4, we can see that adopting the ngram-level representation consistently achieves better performance compared with using word representations only (“word-fixed” vs “ngram-fixed”, “word-adjusted” vs “ngram-adjusted”). This result illustrates that the ngram-level knowledge captures more complex interactions of the web text, which cannot be recovered by using only word embeddings. Similar result was reported by Dahl et al. (2012), who found that using both the word embeddings and the hidden units of a tri-gram WRRBM as additional features for a CRF chunker yields larger improvements than using word embeddings only.

method	all	non-oov	oov
baseline	89.81	92.42	65.64
word-adjust	$+$ 0.09	$-$ 0.05	$+$ 1.38
word-fix	$+$ 0.11	$+$ 0.13	$+$ 1.73
ngram-adjust	$+$ 0.53	$+$ 0.52	$+$ 0.53
ngram-fix	$+$ 0.69	$+$ 0.60	$+$ 2.30

Table 4: Performance on the email domain.

Finally, more detailed accuracies under the 4 settings on the email domain are shown in Table 4. We can see that the improvement of using word representations mainly comes from better accuracy of out-of-vocabulary (oov) words. By contrast, using n-gram representations improves the performance on both oov and non-oov.

5.4 Effect of Unlabelled Domain Data

		RBM-E	RBM-W	RBM-M
$+\textrm{acc}\%$	Emails	$+$ 0.73	$+$ 0.37	$+$ 0.69
$+\textrm{acc}\%$	Weblog	$+$ 0.31	$+$ 0.52	$+$ 0.54
$\textrm{cov}\%$	Emails	95.24	92.79	93.88
$\textrm{cov}\%$	Weblog	90.21	97.74	94.77

Table 5: Effect of unlabelled data. “

+

acc” denotes improvement in tagging accuracy and “cov” denotes the lexicon coverages.

System	Answer	Newsgroup	Review	WSJ-t	Avg
baseline-raw	89.79	91.36	89.96	97.09	90.31
baseline-clean	91.35	92.06	92.92	97.09	92.02
best-clean	92.50	93.83	93.64	97.44	93.27
baseline-offical	90.20	91.24	89.33	97.08	90.26
Le Roux et al.(2011)	91.79	93.81	93.11	97.29	92.90
Tang et al. (2012)	91.76	92.91	91.94	97.49	92.20

Table 6: Main results. “baseline-raw” and “baseline-clean” denote performance of our baseline tagger on the raw and cleaned data, respectively. “best-clean” is best performance achieved using a 4-gram WRRBM. The lower part shows accuracies of the official baseline and that of the top 2 participants.

In some circumstances, we may know beforehand that the target domain data belongs to a certain sub-domain, such as the email domain. In such cases, it might be desirable to train WRRBM using data only on that domain. We conduct experiments to test whether using the target domain data to train the WRRBM yields better performance compared with using mixed data from all sub-domains.

We trained 3 WRRBMs using the email domain data (RBM-E), weblog domain data (RBM-W) and mixed domain data (RBM-M), respectively, with each data set consisting of 300k sentences. Tagging performance and lexicon coverages of each data set on the development sets are shown in Table 5. We can see that using the target domain data achieves similar improvements compared with using the mixed data. However, for the email domain, RBM-W yields much smaller improvement compared with RBM-E, and vice versa. From the lexicon coverages, we can see that the sub-domains varies significantly. The results suggest that using mixed data can achieve almost as good performance as using the target sub-domain data, while using mixed data yields a much more robust tagger across all sub-domains.

5.5 Final Results

The best result achieved by using a 4-gram WRRBM, $(w_{i-2},\ldots,w_{i+1})$ , with 300 hidden units learned on 1,000k web domain sentences are shown in row 3 of Table 6. Performance of the top 2 systems of the SANCL 2012 task are also shown in Table 6. Our greedy tagger achieves $93.27\%$ tagging accuracy, which is significantly better than the baseline’s $92.02\%$ accuracy ( $p<0.05$ by McNemar’s test). Moreover, we achieve the highest tagging accuracy reported so far on this data set, surpassing those achieved using parser combinations based on self-training [24, 12]. In addition, different from Le Roux et al. (2012), we do not use any external resources in data cleaning.

6 Related Work

Learning representations has been intensively studied in computer vision tasks [2, 13]. In NLP, there is also much work along this line. In particular, Collobert et al. (2011) and Turian et al. (2010) learn word embeddings to improve the performance of in-domain POS tagging, named entity recognition, chunking and semantic role labelling. Yang et al. (2013) induce bi-lingual word embeddings for word alignment. Zheng et al. (2013) investigate Chinese character embeddings for joint word segmentation and POS tagging. While those approaches mainly explore token-level representations (word or character embeddings), using WRRBM is able to utilize both word and n-gram representations.

Titov (2011) and Glorot et al. (2011) propose to learn representations from the mixture of both source and target domain unlabelled data to improve cross-domain sentiment classification. Titov (2011) also propose a regularizer to constrain the inter-domain variability. In particular, their regularizer aims to minimize the Kullback-Leibler (KL) distance between the marginal distributions of the learned representations on the source and target domains.

Their work differs from ours in that their approaches learn representations from the feature vectors for sentiment classification, which might be of thousands of dimensions. Such high dimensional input gives rise to high computational cost and it is not clear whether those approaches can be applied to large scale unlabelled data, with hundreds of millions of training examples. Our method learns representations from only word n-grams with $n$ ranging from 3 to 5, which can be easily applied to large scale-data. In addition, while Titov (2011) and Glorot et al. (2011) use the learned representation to improve cross-domain classification tasks, we are the first to apply it to cross-domain structured prediction.

Blitzer et al. (2006) propose to induce shared representations for domain adaptation, which is based on the alternating structure optimization (ASO) method of Ando and Zhang (2005). The idea is to project the original feature representations into low dimensional representations, which yields a high-accuracy classifier on the target domain. The new representations are induced based on the auxiliary tasks defined on unlabelled data together with a dimensionality reduction technique. Such auxiliary tasks can be specific to the supervised task. As pointed out by Plank (2009), for many NLP tasks, defining the auxiliary tasks is a non-trivial engineering problem. Compared with Blitzer et al. (2006), the advantage of using RBMs is that it learns representations in a pure unsupervised manner, which is much simpler.

Regarding using neural networks for sequential labelling, our approach shares similarity with that of Collobert et al. (2011). In particular, we both use a non-linear layer to model complex relations underling word embeddings. However, our network differs from theirs in the following aspects. Collobert et al. (2011) model the dependency between neighbouring tags in a generative manner, by employing a transition score $A_{ij}$ . Training the score involves a forward process of complexity $O(nT^{2})$ , where $T$ denotes the number of tags. Our model captures such a dependency in a discriminative manner, by just adding tag-related features to the sparse-feature module. In addition, Collobert et al. (2011) train their network by maximizing the training set likelihood, while our approach is to minimize the margin loss using guided learning.

7 Conclusion

We built a web-domain POS tagger using a two-phase approach. We used a WRRBM to learn the representation of the web text and incorporate the representation in a neural network, which is trained using guided learning for easy-first POS tagging. Experiment showed that our approach achieved significant improvement in tagging the web domain text. In addition, we found that keeping the learned representations unchanged yields better performance compared with further optimizing them on the source domain data. We release our tools at https://github.com/majineu/TWeb.

For future work, we would like to investigate the two-phase approach to more challenging tasks, such as web domain syntactic parsing. We believe that high-accuracy web domain taggers and parsers would benefit a wide range of downstream tasks such as machine translation.

Acknowledgements

We would like to thank Hugo Larochelle for his advices on re-implementing WRRBM. We also thank Nan Yang, Shujie Liu and Tong Xiao for the fruitful discussions, and three anonymous reviewers for their insightful suggestions. This research was supported by the National Science Foundation of China (61272376; 61300097), the research grant T2MOE1301 from Singapore Ministry of Education (MOE) and the start-up grant SRG ISTD2012038 from SUTD.

References

[1] R. Ando and T. Zhang(2005-06) A high-performance semi-supervised learning method for text chunking. Ann Arbor, Michigan, pp. 1–9. External Links: Link, Document Cited by: 6.
[2] Y. Bengio, P. Lamblin, D. Popovici and H. Larochelle(2007) Greedy layer-wise training of deep networks. in B. Schölkopf, J. Platt and T. Hoffman (Eds.), Advances in Neural Information Processing Systems 19, pp. 153–160. Cited by: 2.2, 6.
[3] Y. Bengio(2009) Learning deep architectures for AI. Foundations and Trends in Machine Learning 2 (1), pp. 1–127. Note: Also published as a book. Now Publishers, 2009. External Links: Document Cited by: 2.
[4] J. Blitzer, R. McDonald and F. Pereira(2006-07) Domain adaptation with structural correspondence learning. Sydney, Australia, pp. 120–128. External Links: Link Cited by: 6.
[5] M. Collins(2002) Discriminative training methods for hidden markov models: theory and experiments with perceptron algorithms. EMNLP ’02, Stroudsburg, PA, USA, pp. 1–8. External Links: Document, Link Cited by: 1, 1, 2.2.
[6] R. Collobert, J. Weston, L. Bottou, M. Karlen, K. Kavukcuoglu and P. Kuksa(2011) Natural language processing (almost) from scratch. Journal of Machine Learning Research 12, pp. 2493–2537. Cited by: 1, 3.1, 3.2, 6, 6.
[7] G. E. Dahl, R. P. Adams and H. Larochelle(2012-07) Training restricted boltzmann machines on word observations. ICML ’12, New York, NY, USA, pp. 679–686. External Links: ISBN 978-1-4503-1285-1 Cited by: 2.2, 2.2, 5.1.
[8] X. Glorot, A. Bordes and Y. Bengio(2011) Domain adaptation for large-scale sentiment classification: a deep learning approach. See , pp. 513–520. Cited by: 1, 6, 6.
[9] Y. Goldberg and M. Elhadad(2010) An efficient algorithm for easy-first non-directional dependency parsing. HLT ’10, Stroudsburg, PA, USA, pp. 742–750. External Links: ISBN 1-932432-65-5, Link Cited by: 4.2.
[10] G. E. Hinton, S. Osindero and Y. Teh(2006-07) A fast learning algorithm for deep belief nets. Neural Comput. 18 (7), pp. 1527–1554. External Links: ISSN 0899-7667, Link, Document Cited by: 2.2, 2.
[11] G. E. Hinton(2002-08) Training products of experts by minimizing contrastive divergence. Neural Comput. 14 (8), pp. 1771–1800. External Links: ISSN 0899-7667, Link, Document Cited by: 2.1.
[12] J. Le Roux, J. Foster, J. Wagner, R. S. Z. Kaljahi and A. Bryl(2012-06) DCU-Paris13 Systems for the SANCL 2012 Shared Task. Montréal, Canada, pp. 1–4 (Anglais). External Links: Link Cited by: 5.1, 5.5.
[13] H. Lee, R. Grosse, R. Ranganath and A. Y. Ng(2009) Convolutional deep belief networks for scalable unsupervised learning of hierarchical representations. pp. 609–616. Cited by: 2.2, 6.
[14] H. Lee, P. Pham, Y. Largman and A. Ng(2009) Unsupervised feature learning for audio classification using convolutional deep belief networks. in Y. Bengio, D. Schuurmans, J. Lafferty, C. K. I. Williams and A. Culotta (Eds.), Advances in Neural Information Processing Systems 22, pp. 1096–1104. Cited by: 2.
[15] J. Ma, J. Zhu, T. Xiao and N. Yang(2013-08) Easy-first pos tagging and dependency parsing with beam search. Sofia, Bulgaria, pp. 110–114. External Links: Link Cited by: 1.
[16] S. Petrov, L. Barrett, R. Thibaux and D. Klein(2006-07) Learning accurate, compact, and interpretable tree annotation. Sydney, Australia, pp. 433–440. External Links: Link Cited by: 5.2.
[17] S. Petrov and D. Klein(2007-04) Improved inference for unlexicalized parsing. Rochester, New York, pp. 404–411. External Links: Link Cited by: 5.2.
[18] S. Petrov and R. McDonald(2012) Overview of the 2012 shared task on parsing the web. Note: Notes of the First Workshop on Syntactic Analysis of Non-Canonical Language (SANCL) Cited by: 1, 1.
[19] B. Plank(2009) Structural correspondence learning for parse disambiguation.. See , pp. 37–45. External Links: Link Cited by: 6.
[20] M. Ranzato, C. Poultney, S. Chopra and Y. LeCun(2007) Efficient learning of sparse representations with an energy-based model. in B. Schölkopf, J. Platt and T. Hoffman (Eds.), Advances in Neural Information Processing Systems 19, pp. 1137–1144. Cited by: 2.
[21] A. Ratnaparkhi(1996) A maximum entropy model for part-of-speech tagging. Cited by: 1, 2.2, 3.
[22] D. E. Rumelhart, G. E. Hinton and R. J. Williams(1988) Neurocomputing: foundations of research. in J. A. Anderson and E. Rosenfeld (Eds.), pp. 696–699. External Links: ISBN 0-262-01097-6, Link Cited by: 4.2.
[23] L. Shen, G. Satta and A. Joshi(2007-06) Guided learning for bidirectional sequence classification. Prague, Czech Republic, pp. 760–767. External Links: Link Cited by: 1, 1, 4.2.
[24] B. Tang, M. Jiang and H. Xu(2012-06) Varderlibt’s systems for sancl2012 shared task. Montréal, Canada (Anglais). Cited by: 5.5.
[25] I. Titov(2011-06) Domain adaptation by constraining inter-domain variability of latent feature representation. Portland, Oregon, USA, pp. 62–71. External Links: Link Cited by: 6, 6.
[26] J. Turian, L. Ratinov and Y. Bengio(2010-07) Word representations: a simple and general method for semi-supervised learning. Uppsala, Sweden, pp. 384–394. External Links: Link Cited by: 1, 3.1, 5.1, 6.
[27] M. Wang and C. D. Manning(2013) Effect of non-linear deep architecture in sequence labeling. Cited by: 2.2.
[28] N. Yang, S. Liu, M. Li, M. Zhou and N. Yu(2013-08) Word alignment modeling with context dependent deep neural network. Sofia, Bulgaria, pp. 166–175. External Links: Link Cited by: 6.
[29] Y. Zhang and S. Clark(2011-07) Syntax-based grammaticality improvement using ccg and guided search. Edinburgh, Scotland, UK., pp. 1147–1157. External Links: Link Cited by: 4.2.
[30] X. Zheng, H. Chen and T. Xu(2013-10) Deep learning for Chinese word segmentation and POS tagging. Seattle, Washington, USA, pp. 647–657. External Links: Link Cited by: 6.

Generated on Tue Jun 10 17:17:55 2014 by LaTeXML [LOGO]