Discussion forums have evolved into a dependable source of knowledge to solve common problems. However, only a minority of the posts in discussion forums are solution posts. Identifying solution posts from discussion forums, hence, is an important research problem. In this paper, we present a technique for unsupervised solution post identification leveraging a so far unexplored textual feature, that of lexical correlations between problems and solutions. We use translation models and language models to exploit lexical correlations and solution post character respectively. Our technique is designed to not rely much on structural features such as post metadata since such features are often not uniformly available across forums. Our clustering-based iterative solution identification approach based on the EM-formulation performs favorably in an empirical evaluation, beating the only unsupervised solution identification technique from literature by a very large margin. We also show that our unsupervised technique is competitive against methods that require supervision, outperforming one such technique comfortably.
Discussion forums have become a popular knowledge source for finding solutions to common problems. StackOverflow11http://www.stackoverflow.com, a popular discussion forum for programmers is among the top-100 most visited sites globally22http://www.alexa.com/siteinfo/stackoverflow.com. Now, there are discussion forums for almost every major product ranging from automobiles33http://www.cadillacforums.com/ to gadgets such as those of Mac44https://discussions.apple.com/ or Samsung55http://www.galaxyforums.net/. These typically start with a registered user posting a question/problem66We use problem and question, as well as solution and answer interchangeably in this paper. to which other users respond. Typical response posts include solutions or clarification requests, whereas feedback posts form another major category of forum posts. As is the case with any community of humans, discussion forums have their share of inflammatory remarks too. Mining problem-solution pairs from discussion forums has attracted much attention from the scholarly community in the recent past. Since the first post most usually contains the problem description, identifying its solutions from among the other posts in the thread has been the focus of many recent efforts (e.g., [8, 9]). Extracting problem-solution pairs from forums enables the usage of such knowledge in knowledge reuse frameworks such as case-based reasoning [11] that use problem-solution pairs as raw material. In this paper, we address the problem of unsupervised solution post identification77This problem has been referred to as answer extraction by some papers earlier. However, we use solution identification to refer to the problem since answer and extraction have other connotations in the Question-Answering and Information Extraction communities respectively. from discussion forums.
Among the first papers to address the solution identification problem was the unsupervised approach proposed by [4]. It employs a graph propagation method that prioritizes posts that are (a) more similar to the problem post, (b) more similar to other posts, and (c) authored by a more authoritative user, to be labeled as solution posts. Though seen to be effective in identifying solutions from travel forums, the first two assumptions, (a) and (b), were seen to be not very reliable in solution identification in other kinds of discussion boards. [3] reports a study that illustrates that non-solution posts are, on an average, as similar to the problem as solution posts in technical forums. The second assumption (i.e., (b) above) was also not seen to be useful in discussion forums since posts that are highly similar to other posts were seen to be complaints, repetitive content being more pervasive among complaint posts than solutions [2]. Having exhausted the two obvious textual features for solution identification, subsequent approaches have largely used the presence of lexical cues signifying solution-like narrative (e.g., instructive narratives such as ”check the router for any connection issues”) as the primary content-based feature for solution identification.
All solution identification approaches since [4] have used supervised methods that require training data in the form of labeled solution and non-solution posts. The techniques differ from one another mostly in the non-textual features that are employed in representing posts. A variety of high precision assumptions such as solution post typically follows a problem post [15], solution posts are likely to be within the first few posts, solution posts are likely to have been acknowledged by the problem post author [3], users with high authoritativeness are likely to author solutions [9], and so on have been seen to be useful in solution identification. Being supervised methods, the above assumptions are implicitly factored in by including the appropriate feature (e.g., post position in thread) in the feature space so that the learner may learn the correlation (e.g., solution posts typically are among the first few posts) using the training data. Though such assumptions on structural features, if generic enough, may be built into unsupervised techniques to aid solution identification, the variation in availability of such features across forums limits the usage of models that rely heavily on structural features. For example, some forums employ chronological order based flattening of threads [16] making reply-to information unavailable; models that harness reply-to features would then have limited utility on identifying solutions within such flattened threads. On medical forums, privacy considerations may force forum data to be dumped without author information, making a host of author-id based features unavailable. On datasets that contain data from across forums, the model may have to be aware of the absence of certain features in subsets of the data, or be modeled using features that are available on all threads.
Our Contribution: We propose an unsupervised method for solution identification. The cornerstone of our technique is the usage of a hitherto unexplored textual feature, lexical correlations between problems and solutions, that is exploited along with language model based characterization of solution posts. We model the lexical correlation and solution post character using regularized translation models and unigram language models respectively. To keep our technique applicable across a large variety of forums with varying availability of non-textual features, we design it to be able to work with minimal availability of non-textual features. In particular, we show that by using post position as the only non-textual feature, we are able to achieve accuracies comparable to supervision-based approaches that use many structural features [2].
In this section, we provide a brief overview of previous work related to our problem. Though most of the answer/solution identification approaches proposed so far in literature are supervised methods that require a labeled training corpus, there are a few that require limited or no supervision. Table 1 provides an overview of some of the more recent solution identification techniques from literature, with a focus on some features that we wish to highlight. The common observation that most problem-solving discussion threads have a problem description in the first post has been explicitly factored into many techniques; knowing the problem/question is important for solution identification since author relations between problem and other posts provide valuable cues for solution identification. Most techniques use a variety of such features as noted in Section 1. SVMs have been the most popular method for supervised and semi-supervised learning for the task of solution identification.
Paper Reference | Supervision | Assumptions on | Features other than | Learning |
Problem Position | Post Content Used | Technique | ||
[15] | Supervised | First Post likely | HMM assumes | Naive Bayes |
to be problem | solution follows problem | & HMM | ||
[7] | Supervised | First Post | Post Position, Author, | CRFs |
Context Posts | ||||
[10] | Supervised | None | Post Position, Author, | MaxEnt, |
Previous Posts, Profile etc. | SVM, CRF | |||
[9] | Supervised | First Post | Post Position, Author, | SVM |
Author Authority | ||||
[3] | Supervised | First Post | Post Position, Author, Problem | SVM |
Author’s activities wrt Post | ||||
[2] | Limited | First Post | Post Position/Rating, Author, | SVMs & |
Supervision | Author Rating, Post Ack | Co-Training | ||
[4] | Unsupervised | None | Author, Author Authority, | Graph |
Relation to Problem Author | Propagation | |||
Our Method | Unsupervised | First Post | Post Position | Translation |
Models & LM |
Of particular interest to us are approaches that use limited or no supervision, since we focus on unsupervised solution identification in this paper. The only unsupervised approach for the task, that from [4], uses a graph propagation method on a graph modeled using posts as vertices, and relies on the assumptions that posts that bear high similarity to the problem and other posts and those authored by authoritative users are more likely to be solution posts. Some of those assumptions, as mentioned in Section 1, were later found to be not generalizable to beyond travel forums. The semi-supervised approach presented in [2] uses a few labeled threads to bootstrap SVM based learners which are then co-trained in an iterative fashion. In addition to various features explored in literature, they use acknowledgement modeling so that posts that have been acknowledged positively may be favored for being labeled as solutions.
We will use translation and language models in our method for solution identification. Usage of translation models for modeling the correlation between textual problems and solutions have been explored earlier starting from the answer retrieval work in [18] where new queries were conceptually expanded using the translation model to improve retrieval. Translation models were also seen to be useful in segmenting incident reports into the problem and solution parts [5]; we will use an adaptation of the generative model presented therein, for our solution extraction formulation. Entity-level translation models were recently shown to be useful in modeling correlations in QA archives [17].
Let a thread from a discussion forum be made up of posts. Since we assume, much like many other earlier papers, that the first post is the problem post, the task is to identify which among the remaining posts are solutions. There could be multiple (most likely, different) solutions within the same thread. We may now model the thread as post pairs, each pair having the problem post as the first element, and one of the remaining posts (i.e., reply posts in ) as the second element. Let be the set of such problem-reply pairs from across threads in the discussion forum. We are interested in finding a subset of such that most of the pairs in are problem-solution pairs, and most of those in are not so. In short, we would like to find problem-solution pairs from such that the F-measure88http://en.wikipedia.org/wiki/F1_score for solution identification is maximized.
Central to our approach is the assumption of lexical correlation between the problem and solution texts. At the word level, this translates to assuming that there exist word pairs such that the presence of the first word in the problem part predicts the presence/absence of the second word in the solution part well. Though not yet harnessed for solution identification, the correlation assumption is not at all novel. Infact, the assumption that similar problems have similar solutions (of which the correlation assumption is an offshoot) forms the foundation of case-based reasoning systems [11], a kind of knowledge reuse systems that could be the natural consumers of problem-solution pairs mined from forums. The usage of translation models in QA retrieval [18, 17] and segmentation [5] were also motivated by the correlation assumption. We use an IBM Model 1 translation model [1] in our technique; simplistically, such a model may be thought of as a 2-d associative array where the value is directly related to the probability of occuring in the problem when occurs in the solution.
In K-Means | In Our Approach | |
Data | Multi-dimensional Points | pairs |
Cluster Model | Respective Centroid Vector | Respective and Models for each cluster |
Initialization | Random Choice of Centroids | Models learnt using pairs labeled |
using the Post Position of | ||
E-Step | ||
(Sec 4.3.1), and learn solution word | ||
source probabilities (Sec 4.3.2) | ||
M-Step | Re-learn and using pairs labeled | |
and using pairs labeled (Sec 4.3.3) | ||
Output | The clustering of points | pairs labeled as |
Consider a unigram language model that models the lexical characteristics of solution posts, and a translation model that models the lexical correlation between problems and solutions. Our generative model models the reply part of a pair (in which is a solution) as being generated from the statistical models in as follows.
For each word occuring in ,
Choose
If , Choose
Else, Choose
where denotes the multionomial distribution obtained from conditioned over the words in the post ; this is obtained by assigning each candidate solution word a weight equal to , and normalizing such weights across all solution words. In short, each solution word is assumed to be generated from the language model or the translation model (conditioned on the problem words) with a probability of and respectively, thus accounting for the correlation assumption. The generative model above is similar to the proposal in [5], adapted suitably for our scenario. We model non-solution posts similarly with the sole difference being that they would be sampled from the analogous models and that characterize behavior of non-solution posts.
Example: Consider the following illustrative example of a problem and solution post:
Problem: I am unable to surf the web on the BT public wifi.
Solution: Maybe, you should try disconnecting and rejoining the network.
Of the solution words above, generic words such as try and should could probably be explained by (i.e., sampled from) the solution language model, whereas disconnect and rejoin could be correlated well with surf and wifi and hence are more likely to be supported better by the translation model.
We propose a clustering based approach so as to cluster each of the pairs into either the solution cluster or the non-solution cluster. The objective function that we seek to maximize is the following:
(1) |
indicates the conformance of the pair (details in Section 4.3.1) with the generative model that uses the and models as the language and translation models respectively. The clustering based approach labels each pair as either solution (i.e., ) or non-solution (i.e., ). Since we do not know the models or the labelings to start with, we use an iterative approach modeled on the EM meta-algorithm [6] involving iterations, each comprising of an E-step followed by the M-step. For simplicity and brevity, instead of deriving the EM formulation, we illustrate our approach by making an analogy with the popular K-Means clustering [13] algorithm that also uses the EM formulation and crisp assignments of data points like we do. K-Means is a clustering algorithm that clusters objects represented as multi-dimensional points into clusters where each cluster is represented by the centroid of all its members. Each iteration in K-Means starts off with assigning each data object to its nearest centroid, followed by re-computing the centroid vector based on the assignments made. The analogy with K-Means is illustrated in Table 2.
Though the analogy in Table 2 serves to provide a high-level picture of our approach, the details require further exposition. In short, our approach is a 2-way clustering algorithm that uses two pairs of models, and , to model solution pairs and non-solution pairs respectively. At each iteration, the post-pairs are labeled as either solution () or non-solution () based on which pair of models they better conform to. Within the same iteration, the four models are then re-learnt using the labels and other side information. At the end of the iterations, the pairs labeled are output as solution pairs. We describe the various details in separate subsections herein.
As outlined in Table 2, each pair would be assigned to one of the classes, solution or non-solution, based on whether it conforms better with the solution models (i.e., & ) or non-solution models ( & ), as determined using the function, i.e.,
falls out of the generative model:
where denotes the probability of from and denotes the probability of from the multinomial distribution derived from conditioned over the words in , as in Section 4.2.
Since the language and translation models operate at the word level, the objective function entails that we let the models learn based on their fractional contribution of the words from the language and translation models. Thus, we estimate the proportional contribution of each word from the language and translation models too, in the E-step. The fractional contributions of the word in the pair labeled as solution (i.e., ) is as follows:
The fractional contributions are just the actual supports for the word , normalized by the total contribution for the word from across the two models. Similar estimates, and are made for reply words from pairs labeled . In our example from Section 4.2, words such as rejoin are likely to get higher scores due to being better correlated with problem words and consequently better supported by the translation model; those such as try may get higher scores.
We use the labels and reply-word source estimates from the E-step to re-learn the language and translation models in this step. As may be obvious from the ensuing discussion, those pairs labeled as solution pairs are used to learn the and models and those labeled as non-solution pairs are used to learn the models with subscript . We let each reply word contribute as much to the respective language and translation models according to the estimates in Section 4.3.2. In our example, if the word disconnect is assigned a source probability of and for the translation and language models respectively, the virtual document-pair from that goes into the training of the respective model would assume that disconnect occurs in with a frequency of ; similarly, the respective would account for disconnect with a frequency of . Though fractional word frequencies are not possible in real documents, statistical models can accomodate such fractional frequencies in a straightforward manner. The language models are learnt only over the parts of the pairs since they are meant to characterize reply behavior; on the other hand, translation models learn over both and parts to model correlation.
Regularizing the models: In our formulation, the language and translation models may be seen as competing for ”ownership” of reply words. Consider the post and reply vocabularies to be of sizes and respectively; then, the translation model would have variables, whereas the unigram language model has only variables. This gives the translation model an implicit edge due to having more parameters to tune to the data, putting the language models at a disadvantage. To level off the playing field, we use a regularization99We use the word regularization in a generic sense to mean adapting models to avoid overfitting; in particular, it may be noted that we are not using popular regularization methods such as L1-regularization. operation in the learning of the translation models. The IBM Model 1 learning process uses an internal EM approach where the E-step estimates the alignment vector for each problem word; this vector indicates the distribution of alignments of the problem word across the solution words. In our example, an example alignment vector for wifi could be: . Our regularization method uses a parameter to discard the long tail in the alignment vector by resetting entries having a value to followed by re-normalizing the alignment vector to add up to 1.0. Such pruning is performed at each iteration in the learning of the translation model, so that the following M-steps learn the probability matrix according to such modified alignment vectors.
The semantics of the parameter may be intuitively outlined. If we would like to allow alignment vectors to allow a problem word to align with upto two reply words, we would need to set to a value close to ; ideally though, to allow for the mass consumed by an almost inevitable long tail of very low values in the alignment vector, we would need to set it to slightly lower than , say .
[t]
Input. , a set of pairs
Output. , the set of identified solution pairs
{code}
Initialization
\uln.
\uln.
\uln.
\uln. Learn & using pairs labeled
\uln. Learn & using pairs labeled
EM Iterations
\uln.
E-Step:
\uln.
\uln.
\uln.
\uln.
M-Step:
\uln. Learn & from pairs labeled
using the estimates
\uln. Learn & from pairs labeled
using the estimates
Output
\uln. Output pairs from with
as
K-Means clustering mostly initializes centroid vectors randomly; however, it is non-trivial to initialize the complex translation and language models randomly. Moreover, an initialization such that the and models favor the solution pairs more than the non-solution pairs is critical so that they may progressively lean towards modeling solution behaviour better across iterations. Towards this, we make use of a structural feature; in particular, adapting the hypothesis that solutions occur in the first posts (Ref. [3]), we label the pairs that have the the reply from the second post (note that the first post is assumed to be the problem post) in the thread as a solution post, and all others as non-solution posts. Such an initialization along with uniform reply word source probabilities is used to learn the initial estimates of the , , and models to be used in the E-step for the first iteration. We will show that we are able to effectively perform solution identification using our approach by exploiting just one structural feature, the post position, as above. However, we will also show that we can exploit other features as and when available, to deliver higher accuracy clusterings.
The overall method comprising the steps that have been described is presented in Algorithm 4.3.3. The initialization using the post position (Ref. Sec 4.3.4) is illustrated in Lines 1-5, whereas the EM-iterations form Steps 6 through 12. Of these, the E-step incorporates labeling (Line 8) as described in Sec 4.3.1 and reply-word source estimation (Line 10) detailed in Sec 4.3.2. The models are then re-learnt in the M-Step (Lines 11-12) as outlined in Sec 4.3.3. At the end of the iterations that may run up to times if the labelings do not stabilize earlier, the pairs labeled are output as identified solutions (Line 13).
Time Complexity: Let denote , and the number of unique words in each problem and reply post be and respectively. We will denote the vocabulary size of problem posts as and that of reply posts as . Learning of the language and translation models in each iteration costs and respectively (assuming the translation model learning runs for iterations). The E-step labeling and source estimation cost each. For iterations of our algorithm, this leads to an overall complexity of .
We use a crawl of 140k threads from Apple Discussion forums1010http://discussions.apple.com. Out of these, 300 threads (comprising 1440 posts) were randomly chosen and each post was manually tagged as either solution or non-solution by the authors of [2] (who were kind enough to share the data with us) with an inter-annotator agreement1111http://en.wikipedia.org/wiki/Cohen’s_kappa of 0.71. On an average, of replies in each thread and of first replies were seen to be solutions, leading to an F-measure of for our initialization heuristic. We use the F-measure1212http://en.wikipedia.org/wiki/F1_score for solution identification, as the primary evaluation measure. While we vary the various parameters separately in order to evaluate the trends, we use a dataset of 800 threads (containing the 300 labeled threads) and set and unless otherwise mentioned. Since we have only 300 labeled threads, accuracy measures are reported on those (like in [2]). We pre-process the post data by stemming words [14].
Technique | Precision | Recall | F-Measure | |
Unsupervised Graph Propagation [4] | 29.7 % | 55.6 % | 38.7 % | |
Our Method with only Translation Models () | 41.8 % | 86.8 % | 56.5 % | |
Our Method with only Language Models () | 63.2 % | 62.1 % | 62.6 % | |
Our Method with Both Models () | 61.3 % | 66.9 % | 64.0 % | |
Methods using Supervision [2] | ANS CT | 40.6 % | 88.0 % | 55.6 % |
ANS-ACK PCT | 56.8 % | 84.1 % | 67.8% |
ProblemWord, SolutionWord | ||
---|---|---|
network, guest | 0.0754 | |
connect, adaptor | 0.0526 | |
wireless, adaptor | 0.0526 | |
translat, shortcut | 0.0492 | |
updat, rebuilt | 0.0405 | |
SolutionWord | ||
your | 0.0115 | |
try | 0.0033 | |
router | 0.0033 | |
see | 0.0033 | |
password | 0.0023 |
In this study, we compare the performance of our method under varying settings of against the only unsupervised approach for solution identification from literature, that from [4]. We use an independent implementation of the technique using Kullback-Leibler Divergence [12] as the similarity measure between posts; KL-Divergence was seen to perform best in the experiments reported in [4].
Table 3 illustrates the comparative performance on various quality metrics, of which F-Measure is typically considered most important. Our pure-LM1313Language Model setting (i.e., ) was seen to perform up to F-Measure points better than the pure-TM1414Translation Model setting (i.e., ), whereas the uniform mix is seen to be able to harness both to give a point (i.e., ) improvement over the pure-LM case. The comparison with the approach from [4] illustrates that our method is very clearly the superior method for solution identification outperforming the former by large margins on all the evaluation measures, with the improvement on F-measure being more than points.
Comparison wrt Methods from [2]: Table 3 also lists the performance of SVM-based methods from [2] that use supervised information for solution identification, to help put the performance of our technique in perspective. Of the two methods therein, ANS CT is a more general method that uses two views (structural and lexical) of solutions which are then co-trained. ANS-ACK PCT is an enhanced method that requires author-id information and a means of classifying posts as acknowledgements (which is done using additional supervision); a post being acknowledged by the problem author is then used as a signal to enhance the solution-ness of a post. In the absence of author information (such as may be common in privacy-constrained domains such as medical forums) and extrinsic information to enable identify acknowledgements, ANS CT is the only technique available. Our technique is seen to outperform ANS CT by a respectable margin ( F-measure points) while trailing behind the enhanced ANS-ACK PCT method with a reasonably narrow F-measure point margin. Thus, our unsupervised method is seen to be a strong competitor even for techniques using supervision outlined in [2], illustrating the effectiveness of LM and TM modeling of reply posts.
Across Iterations: For scenarios where computation is at a premium, it is useful to know how quickly the quality of solution identification stabilizes, so that the results can be collected after fewer iterations. Figure 1 plots the F-measure across iterations for the run with setting, where the F-measure is seen to stabilize in as few as 4-5 iterations. Similar trends were observed for other runs as well, confirming that the run may be stopped as early as after the fourth iteration without considerable loss in quality.
Example Estimates from LMs and TMs: In order to understand the behavior of the statistical models, we took the highest 100 entries from both and and attempted to qualitatively evaluate semantics of the words (or word pairs) corresponding to those. Though the stemming made it hard to make sense of some entries, we present some of the understandable entries from among the top-100 in Table 4. The first three entries from deal with connection issues for which adaptor or guest account related solutions are proposed, whereas the remaining have something to do with the mac translator app and rebuilding libraries after an update. The top words from include imperative words and words from solutions to common issues that include actions pertaining to the router or password.
We now analyse the performance of our approach against varying parameter settings. In particular, we vary and values and the dataset size, and experiment with some initialization variations.
Varying : is the weighting parameter that indicates the fraction of weight assigned to LMs (vis-a-vis TMs). As may be seen from Figure 4, the quality of the results as measured by the F-measure is seen to peak around the middle (i.e., ), and decline slowly towards either extreme, with a sharp decline at (i.e., pure-TM setting). This indicates that a uniform mix is favorable; however, if one were to choose only one type of model, usage of LMs is seen to be preferable than TMs.
Varying : is directly related to the extent of pruning of TMs, in the regularization operation; all values in the alignment vector are pruned. Thus, each problem word is roughly allowed to be aligned with at most solution words. The trends from Figure 4 suggests that allowing a problem word to be aligned to up to solution words (i.e., ) is seen to yield the best performance though the quality decline is graceful towards either side of the range.
Varying Data Size: Though more data always tends to be beneficial since statistical models benefit from redundancy, the marginal utility of additional data drops to very small levels beyond a point; we are interested in the amount of data beyond which the quality of solution identification flattens out. Figure 4 suggests that there is a sharp improvement in quality while increasing the amount of data from threads (i.e., pairs) to 550 ( pairs), whereas the increment is smaller when adding another 250 pairs (total of pairs). Beyond threads, the F-measure was seen to flatten out rapidly and stabilize at .
Initialization: In Apple discussion forums, posts by Apple employees that are labeled with the Apple employees tag (approximately of posts in our dataset) tend to be solutions. So are posts that are marked Helpful ( of posts) by other users. Being specific to Apple forums, we did not use them for initialization in experiments so far with the intent of keeping the technique generic. However, when such posts are initialized as solutions (in addition to first replies as we did earlier), the F-score for solution identification for our technique was seen to improve slightly, to (from ). Thus, our technique is able to exploit any extra solution identifying structural features that are available.
We considered the problem of unsupervised solution post identification from discussion forum threads. Towards identifying solutions to the problem posed in the initial post, we proposed the usage of a hitherto unexplored textual feature for the solution identification problem; that of lexical correlations between problems and solutions. We model and harness lexical correlations using translation models, in the company of unigram language models that are used to characterize reply posts, and formulate a clustering-based EM approach for solution identification. We show that our technique is able to effectively identify solutions using just one non-content based feature, the post position, whereas previous techniques in literature have depended heavily on structural features (that are not always available in many forums) and supervised information. Our technique is seen to outperform the sole unsupervised solution identification technique in literature, by a large margin; further, our method is even seen to be competitive to recent methods that use supervision, beating one of them comfortably, and trailing another by a narrow margin. In short, our empirical analysis illustrates the superior performance and establishes our method as the method of choice for unsupervised solution identification.
Exploration into the usage of translation models to aid other operations in discussion forums such as proactive word suggestions for solution authoring would be interesting direction for follow-up work. Discovery of problem-solution pairs in cases where the problem post is not known beforehand, would be a challenging problem to address.