Sprinkling Topics for Weakly Supervised Text Classification

Swapnil Hingmire

{}^{\textrm{1}}

{}^{\textrm{,}}

{}^{\textrm{2}}

swapnil.hingmire@tcs.com
&Sutanu Chakraborti

{}^{\textrm{2}}

sutanuc@cse.iitm.ac.in\@close@row

{}^{\textrm{1}}

Systems Research Lab, Tata Research Development and Design Center, Pune, India

{}^{\textrm{2}}

Department of Computer Science and Engineering,

Indian Institute of Technology Madras, Chennai, India

1. for each topic

t

, draw a distribution over words:

\phi_{t}\sim\mathrm{Dirichlet}(\beta_{w})

2. for each document

d\in D

a. Draw a vector of topic proportions:

\theta_{d}\sim\mathrm{Dirichlet}(\alpha_{t})

b. for each word

w

at position

n

d

i. Draw a topic assignment:

z_{d,n}\sim\mathrm{Multinomial}(\theta_{d})

ii. Draw a word:

w_{d,n}\sim\mathrm{Multinomial}(z_{d,n})

Abstract

Supervised text classification algorithms require a large number of documents labeled by humans, that involve a labor-intensive and time consuming process. In this paper, we propose a weakly supervised algorithm in which supervision comes in the form of labeling of Latent Dirichlet Allocation (LDA) topics. We then use this weak supervision to “sprinkle” artificial words to the training documents to identify topics in accordance with the underlying class structure of the corpus based on the higher order word associations. We evaluate this approach to improve performance of text classification on three real world datasets.

1 Introduction

In supervised text classification learning algorithms, the learner (a program) takes human labeled documents as input and learns a decision function that can classify a previously unseen document to one of the predefined classes. Usually a large number of documents labeled by humans are used by the learner to classify unseen documents with adequate accuracy. Unfortunately, labeling a large number of documents is a labor-intensive and time consuming process.

In this paper, we propose a text classification algorithm based on Latent Dirichlet Allocation (LDA) [] which does not need labeled documents. LDA is an unsupervised probabilistic topic model and it is widely used to discover latent semantic structure of a document collection by modeling words in the documents. Blei et al. [] used LDA topics as features in text classification, but they use labeled documents while learning a classifier. sLDA [], DiscLDA [] and MedLDA [] are few extensions of LDA which model both class labels and words in the documents. These models can be used for text classification, but they need expensive labeled documents.

An approach that is less demanding in terms of knowledge engineering is ClassifyLDA (Hingmire et al., 2013). In this approach, a topic model on a given set of unlabeled training documents is constructed using LDA, then an annotator assigns a class label to some topics based on their most probable words. These labeled topics are used to create a new topic model such that in the new model topics are better aligned to class labels. A class label is assigned to a test document on the basis of its most prominent topics. We extend ClassifyLDA algorithm by “sprinkling” topics to unlabeled documents.

Sprinkling [] integrates class labels of documents into Latent Semantic Indexing (LSI)[]. The basic idea involves encoding of class labels as artificial words which are “sprinkled” (appended) to training documents. As LSI uses higher order word associations [], sprinkling of artificial words gives better and class-enriched latent semantic structure. However, Sprinkled LSI is a supervised technique and hence it requires expensive labeled documents. The paper revolves around the idea of labeling topics (which are far fewer in number compared to documents) as in ClassifyLDA, and using these labeled topic for sprinkling.

As in ClassifyLDA, we ask an annotator to assign class labels to a set of topics inferred on the unlabeled training documents. We use the labeled topics to find probability distribution of each training document over the class labels. We create a set of artificial words corresponding to a class label and add (or sprinkle) them to the document. The number of such artificial terms is proportional to the probability of generating the document by the class label. We then infer a set of topics on the sprinkled training documents. As LDA uses higher order word associations [] while discovering topics, we hypothesize that sprinkling will improve text classification performance of ClassifyLDA. We experimentally verify this hypothesis on three real world datasets.

2 Related Work

Several researchers have proposed semi-supervised text classification algorithms with the aim of reducing the time, effort and cost involved in labeling documents. These algorithms can be broadly categorized into three categories depending on how supervision is provided. In the first category, a small set of labeled documents and a large set of unlabeled documents is used while learning a classifier. Semi-supervised text classification algorithms proposed in [], [], [] and [] are a few examples of this type. However, these algorithms are sensitive to initial labeled documents and hyper-parameters of the algorithm.

In the second category, supervision comes in the form of labeled words (features). [] and [] are a few examples of this type. An important limitation of these algorithms is coming up with a small set of words that should be presented to the annotators for labeling. Also a human annotator may discard or mislabel a polysemous word, which may affect the performance of a text classifier.

The third type of semi-supervised text classification algorithms is based on active learning. In active learning, particular unlabeled documents or features are selected and queried to an oracle (e.g. human annotator).[], [], [] are a few examples of active learning based text classification algorithms. However, these algorithms are sensitive to the sampling strategy used to query documents or features.

In our approach, an annotator does not label documents or words, rather she labels a small set of interpretable topics which are inferred in an unsupervised manner. These topics are very few, when compared to the number of documents. As the most probable words of topics are representative of the dataset, there is no need for the annotator to search for the right set of features for each class. As LDA topics are semantically more meaningful than individual words and can be acquired easily, our approach overcomes limitations of the semi-supervised methods discussed above.

3 Background

3.1 LDA

LDA is an unsupervised probabilistic generative model for collections of discrete data such as text documents. The generative process of LDA can be described as follows:

Generated on Wed Jun 11 17:32:43 2014 by LaTeXML [LOGO]