A typical knowledge-based question answering (KB-QA) system faces two challenges: one is to transform natural language questions into their meaning representations (MRs); the other is to retrieve answers from knowledge bases (KBs) using generated MRs. Unlike previous methods which treat them in a cascaded manner, we present a translation-based approach to solve these two tasks in one unified framework. We translate questions to answers based on CYK parsing. Answers as translations of the span covered by each CYK cell are obtained by a question translation method, which first generates formal triple queries as MRs for the span based on question patterns and relation expressions, and then retrieves answers from a given KB based on triple queries generated. A linear model is defined over derivations, and minimum error rate training is used to tune feature weights based on a set of question-answer pairs. Compared to a KB-QA system using a state-of-the-art semantic parser, our method achieves better results.
Knowledge-based question answering (KB-QA) computes answers to natural language (NL) questions based on existing knowledge bases (KBs). Most previous systems tackle this task in a cascaded manner: First, the input question is transformed into its meaning representation (MR) by an independent semantic parser [26, 20, 2, 17, 4, 22, 1, 14, 3]; Then, the answers are retrieved from existing KBs using generated MRs as queries.
Unlike existing KB-QA systems which treat semantic parsing and answer retrieval as two cascaded tasks, this paper presents a unified framework that can integrate semantic parsing into the question answering procedure directly. Borrowing ideas from machine translation (MT), we treat the QA task as a translation procedure. Like MT, CYK parsing is used to parse each input question, and answers of the span covered by each CYK cell are considered the translations of that cell; unlike MT, which uses offline-generated translation tables to translate source phrases into target translations, a semantic parsing-based question translation method is used to translate each span into its answers on-the-fly, based on question patterns and relation expressions. The final answers can be obtained from the root cell. Derivations generated during such a translation procedure are modeled by a linear model, and minimum error rate training (MERT) [21] is used to tune feature weights based on a set of question-answer pairs.
Figure 1 shows an example: the question director of movie starred by Tom Hanks is translated to one of its answers Robert Zemeckis by three main steps: (i) translate director of to director of; (ii) translate movie starred by Tom Hanks to one of its answers Forrest Gump; (iii) translate director of Forrest Gump to a final answer Robert Zemeckis. Note that the updated question covered by Cell[0, 6] is obtained by combining the answers to question spans covered by Cell[0, 1] and Cell[2, 6].
The contributions of this work are two-fold: (1) We propose a translation-based KB-QA method that integrates semantic parsing and QA in one unified framework. The benefit of our method is that we don’t need to explicitly generate complete semantic structures for input questions. Besides which, answers generated during the translation procedure help significantly with search space pruning. (2) We propose a robust method to transform single-relation questions into formal triple queries as their MRs, which trades off between transformation accuracy and recall using question patterns and relation expressions respectively.
Formally, given a knowledge base and an NL question , our KB-QA method generates a set of formal triples-answer pairs as derivations, which are scored and ranked by the distribution defined as follows:
denotes a knowledge base11 We use a large scale knowledge base in this paper, which contains 2.3B entities, 5.5K predicates, and 18B assertions. A 16-machine cluster is used to host and serve the whole data. that stores a set of assertions. Each assertion is in the form of , where denotes a predicate, and denote the subject and object entities of , with unique s22Each KB entity has a unique ID. For the sake of convenience, we omit the information in the rest of the paper..
denotes the search space . is composed of a set of ordered formal triples . Each triple denotes an assertion in , where and denotes the beginning and end indexes of the question span from which is transformed. The order of triples in denotes the order of translation steps from to . E.g., director of, Null, director of, Tom Hanks, Film.Actor.Film, Forrest Gump and Forrest Gump, Film.Film.Director, Robert Zemeckis are three ordered formal triples corresponding to the three translation steps in Figure 1. We define the task of transforming question spans into formal triples as question translation. denotes one final answer of .
denotes the feature function.
denotes the feature weight of .
According to the above description, our KB-QA method can be decomposed into four tasks as: (1) search space generation for ; (2) question translation for transforming question spans into their corresponding formal triples; (3) feature design for ; and (4) feature weight tuning for . We present details of these four tasks in the following subsections one-by-one.
We first present our translation-based KB-QA method in Algorithm 1, which is used to generate for each input NL question .
\SetKwDataIndexIndex \For to \Forall s.t.
formal triple create a new derivation update the model score of insert to \For to \Forall s.t. \Forall s.t. \For and \ForEachformal triple create a new derivation update the model score of insert to \Return.
The first half (from Line 1 to Line 13) generates a formal triple set for each unary span , using the question translation method (Line 4), which takes as the input. Each triple returned is in the form of , where ’s mention occurs in , is a predicate that denotes the meaning expressed by the context of in , is an answer of based on , and . We describe the implementation detail of in Section 2.3.
The second half (from Line 14 to Line 31) first updates the content of each bigger span by concatenating the answers to its any two consecutive smaller spans covered by (Line 18). Then, is called to generate triples for the updated span (Line 19). The above operations are equivalent to answering a simplified question, which is obtained by replacing the answerable spans in the original question with their corresponding answers. The search space for the entire question is returned at last (Line 31).
The purpose of question translation is to translate a span to a set of formal triples . Each triple is in the form of , where ’s mention33For simplicity, a cleaned entity dictionary dumped from the entire is used to detect entity mentions in . occurs in , is a predicate that denotes the meaning expressed by the context of in , is an answer to retrieved from using a triple query . Note that if no predicate or answer can be generated, will be returned as a special triple, which sets to be itself, and to be . This makes sure the un-answerable spans can be passed on to the higher-level operations.
Question translation assumes each span is a single-relation question (Fader et al., 2013). Such assumption simplifies the efforts of semantic parsing to the minimum question units, while leaving the capability of handling multiple-relation questions (Figure 1 gives one such example) to the outer CYK-parsing based translation procedure. Two question translation methods are presented in the rest of this subsection, which are based on question patterns and relation expressions respectively.
\SetKwDataIndexIndex \ForEachentity mention = replace in with \ForEachquestion pattern \If == \ForEach create a new triple query \ForEach create a new formal triple insert to
.
A question pattern includes a pattern string , which is composed of words and a slot symbol , and a KB predicate , which denotes the meaning expressed by the context words in .
Algorithm 2 shows how to generate formal triples for a span based on question patterns (-based question translation). For each entity mention , we replace it with and obtain a pattern string (Line 3). If can match one , then we construct a triple query (Line 9) using as its predicate and one of the KB entities returned by as its subject entity (Line 6). Here, the objective of is to output a set of disambiguated KB entities in . The name of each entity returned equals the input entity mention and occurs in some assertions where are the predicates. The underlying idea is to use the context (predicate) information to help entity disambiguation. The answers of are returned by based on and (Line 10), each of which is used to construct a formal triple and added to for (from Line 11 to Line 16). Figure 2 gives an example.
Question patterns are collected as follows: First, 5W queries, which begin with What, Where, Who, When, or Which, are selected from a large scale query log of a commercial search engine; Then, a cleaned entity dictionary is used to annotate each query by replacing all entity mentions it contains with the symbol . Only high-frequent query patterns which contain one are maintained; Lastly, annotators try to manually label the most-frequent 50,000 query patterns with their corresponding predicates, and 4,764 question patterns with single labeled predicates are obtained.
From experiments (Table 3 in Section 4.3) we can see that, question pattern based question translation can achieve high end-to-end accuracy. But as human efforts are needed in the mining procedure, this method cannot be extended to large scale very easily. Besides, different users often type the questions with the same meaning in different NL expressions. For example, although the question Forrest Gump was directed by which moviemaker means the same as the question in Figure 2, no question pattern can cover it. We need to find an alternative way to alleviate such coverage issue.
Aiming to alleviate the coverage issue occurring in -based method, an alternative relation expression () -based method is proposed, and will be used when the -based method fails.
We define as a relation expression set for a given KB predicate . Each relation expression includes an expression string , which must contain at least one content word, and a weight , which denotes the confidence that can represent ’s meaning in NL. For example, is the director of is one relation expression string for the predicate Film.Film.Director, which means it is usually used to express this relation (predicate) in NL.
\SetKwDataIndexIndex \ForEachentity mention \ForEach s.t. == \ForEachpredicate related to \If create a new triple query \ForEach create a new formal triple insert to sort based on the score of each \Return.
Algorithm 3 shows how to generate triples for a question based on relation expressions. For each possible entity mention and a KB predicate that is related to a KB entity whose name equals , is computed (Line 5) based on the similarity between question context and , which measures how likely can be transformed into a triple query . If this score is larger than , which means there are overlaps between ’s context and , then will be used as the triple query of , and a set of formal triples will be generated based on and (from Line 7 to Line 15). The computation of is defined as follows:
where is the n-gram order which ranges from 1 to 5, is an n-gram occurring in without overlapping with and containing at least one content word, is the posterior probability which is computed by:
denotes the weighted sum of times that occurs in :
where denotes the number of times that occurs in , and is decided by the relation expression extraction component.
Figure 3 gives an example, where n-grams with rectangles are the ones that occur in both ’s context and the relation expression set of a given predicate . Unlike the -based method which needs a perfect match, the -based method allows fuzzy matching between and , and records this (Line 13) in generated triples, which is used as features later.
Relation expressions are mined as follows: Given a set of KB assertions with an identical predicate , we first extract all sentences from English Wiki pages44http://en.wikipedia.org/wiki/Wikipedia:Database_download, each of which contains at least one pair of entities occurring in one assertion. Then, we extract the shortest path between paired entities in the dependency tree of each sentence as an candidate for the given predicate. The intuition is that any sentence containing such entity pairs occur in an assertion is likely to express the predicate of that assertion in some way. Last, all relation expressions extracted are filtered by heuristic rules, i.e., the frequency must be larger than 4, the length must be shorter than 10, and then weighted by the pattern scoring methods proposed in [10, 11]. For each predicate, we only keep the relation expressions whose pattern scores are larger than a pre-defined threshold. Figure 4 gives one relation expression extraction example. The statistics and overall quality of the relation expressions are listed in Section 4.1.
Sometimes, a question may provide multiple constraints to its answers. movie starred by Tom Hanks in 1994 is one such question. All the films as the answers of this question should satisfy the following two constraints: (1) starred by Tom Hanks; and (2) released in 1994. It is easy to see that such questions cannot be translated to single triples.
We propose a dependency tree-based method to handle such multiple-constraint questions by (i) decomposing the original question into a set of sub-questions using syntax-based patterns; and (ii) intersecting the answers of all sub-questions as the final answers of the original question. Note, question decomposition only operates on the original question and question spans covered by complete dependency subtrees. Four syntax-based patterns (Figure 5) are used for question decomposition. If a question matches any one of these patterns, then sub-questions are generated by collecting the paths between and each in the pattern, where each denotes a complete subtree with a noun, number, or question word as its root node, the symbol above denotes this preposition can be skipped in matching. For the question mentioned at the beginning, its two sub-questions generated are movie starred by Tom Hanks and movie starred in 1994, as its dependency form matches pattern (a). Similar ideas are used in IBM Watson [13] as well.
As dependency parsing is not perfect, we generate single triples for such questions without considering constraints as well, and add them to the search space for competition. is used to boost triples that are converted from sub-questions generated by question decomposition. The more constraints an answer satisfies, the better. Obviously, current patterns used can’t cover all cases but most-common ones. We leave a more general pattern mining method for future work.
The objective of our KB-QA system is to seek the derivation that maximizes the probability described in Section 2.1 as:
We now introduce the feature sets that are used in the above linear model:
, which counts the number of original question words occurring in . It penalizes those partially answered questions.
, which counts the number of spans in that are converted to formal triples. It controls the granularity of the spans used in question translation.
, which counts the number of spans in that are (1) converted to formal triples, whose predicates are not , and (2) covered by complete dependency subtrees at the same time. The underlying intuition is that, dependency subtrees of should be treated as units for question translation.
, which counts the number of triples in that are converted from sub-questions generated by the question decomposition component.
, which counts the number of triples in , whose predicates are not .
, which sums the scores of all triples in as .
, which counts the number of triples in that are generated by -based question translation method.
, which counts the number of triples in that are generated by -based question translation method.
, which sums the static rank scores of all subject entities in ’s triple set as .
, which sums the static rank scores of all object entities in ’s triple set as .
, which sums the confidence scores of all object entities in ’s triple set as .
For each assertion stored in , and denote the static rank scores55The static rank score of an entity represents a general indicator of the overall quality of that entity. for and respectively; represents the probability . These three scores are used as features to rank answers generated in QA procedure.
Given a set of question-answer pairs as the development (dev) set, we use the minimum error rate training (MERT) [21] algorithm to tune the feature weights in our proposed model. The training criterion is to seek the feature weights that can minimize the accumulated errors of the top-1 answer of questions in the dev set:
is the number of questions in the dev set, is the correct answers as references of the question in the dev set, is the top-1 answer candidate of the question in the dev set based on feature weights , is the error function which is defined as:
where is an indicator function which equals 1 when is included in the reference set , and 0 otherwise.
Our work intersects with two research directions: semantic parsing and question answering.
Some previous works on semantic parsing [25, 26, 23, 27, 24, 15, 16] require manually annotated logical forms as supervision, and are hard to extend resulting parsers from limited domains, such as GEO, JOBS and ATIS, to open domains. Recent works [5, 18] have alleviated such issues using question-answer pairs as weak supervision, but still with the shortcoming of using limited lexical triggers to link NL phrases to predicates. Poon (2013) has proposed an unsupervised method by adopting grounded-learning to leverage the database for indirect supervision. But transformation from NL questions to MRs heavily depends on dependency parsing results. Besides, the KB used (ATIS) is limited as well. Kwiatkowski et al. (2013) use Wiktionary and a limited manual lexicon to map POS tags to a set of predefined CCG lexical categories, which aims to reduce the need for learning lexicon from training data. But it still needs human efforts to define lexical categories, which usually can not cover all the semantic phenomena.
Berant et al. (2013) have not only enlarged the KB used for Freebase [12], but also used a bigger lexicon trigger set extracted by the open IE method [19] for NL phrases to predicates linking. In comparison, our method has further advantages: (1) Question answering and semantic parsing are performed in an joint way under a unified framework; (2) A robust method is proposed to map NL questions to their formal triple queries, which trades off the mapping quality by using question patterns and relation expressions in a cascaded way; and (3) We use domain independent feature set which allowing us to use a relatively small number of question-answer pairs to tune model parameters.
Fader et al. (2013) map questions to formal (triple) queries over a large scale, open-domain database of facts extracted from a raw corpus by ReVerb [8]. Compared to their work, our method gains an improvement in two aspects: (1) Instead of using facts extracted using the open IE method, we leverage a large scale, high-quality knowledge base; (2) We can handle multiple-relation questions, instead of single-relation queries only, based on our translation based KB-QA framework.
Espana-Bonet and Comas (2012) have proposed an MT-based method for factoid QA. But MT in there work means to translate questions into -best translations, which are used for finding similar sentences in the document collection that probably contain answers. Echihabi and Marcu (2003) have developed a noisy-channel model for QA, which explains how a sentence containing an answer to a given question can be rewritten into that question through a sequence of stochastic operations. Compared to the above two MT-motivated QA work, our method uses MT methodology to translate questions to answers directly.
Following Berant et al. (2013), we use the same subset of WEBQUESTIONS (3,778 questions) as the development set (Dev) for weight tuning in MERT, and use the other part of WEBQUESTIONS (2,032 questions) as the test set (Test). Table 1 shows the statistics of this data set.
Data Set | # Questions | # Words |
---|---|---|
WEBQUESTIONS | 5,810 | 6.7 |
Table 2 shows the statistics of question patterns and relation expressions used in our KB-QA system. As all question patterns are collected with human involvement as we discussed in Section 2.3.1, the quality is very high (). We also sample 1,000 instances from the whole relation expression set and manually label their quality. The accuracy is around . These two resources can cover 566 head predicates in our KB.
# Entries | Accuracy | |
---|---|---|
Question Patterns | 4,764 | 98% |
Relation Expressions | 133,445 | 89% |
Since Berant et al. (2013) is one of the latest work which has reported QA results based on a large scale, general domain knowledge base (Freebase), we consider their evaluation result on WEBQUESTIONS as our baseline.
Our KB-QA system generates the -best derivations for each question span, where is set to 20. The answers with the highest model scores are considered the best answers for evaluation. For evaluation, we follow Berant et al. (2013) to allow partial credit and score an answer using the F1 measure, comparing the predicted set of entities to the annotated set of entities.
One difference between these two systems is the KB used. Since Freebase is completely contained by our KB, we disallow all entities which are not included by Freebase. By doing so, our KB provides the same knowledge as Freebase does, which means we do not gain any extra advantage by using a larger KB. But we still allow ourselves to use the static rank scores and confidence scores of entities as features, as we described in Section 2.4.
We first show the overall evaluation results of our KB-QA system and compare them with baseline’s results on Dev and Test. Note that we do not re-implement the baseline system, but just list their evaluation numbers reported in the paper. Comparison results are listed in Table 3.
Dev (Accuracy) | Test (Accuracy) | |
---|---|---|
Baseline | 32.9% | 31.4% |
Our Method | 42.5% (+9.6%) | 37.5% (+6.1%) |
Table 3 shows our KB-QA method outperforms baseline on both Dev and Test. We think the potential reasons of this improvement include:
Different methods are used to map NL phrases to KB predicates. Berant et al. (2013) have used a lexicon extracted from a subset of ReVerb triples [19], which is similar to the relation expression set used in question translation. But as our relation expressions are extracted by an in-house extractor, we can record their extraction-related statistics as extra information, and use them as features to measure the mapping quality. Besides, as a portion of entities in our KB are extracted from Wiki, we know the one-to-one correspondence between such entities and Wiki pages, and use this information in relation expression extraction for entity disambiguation. A lower disambiguation error rate results in better relation expressions.
Question patterns are used to map NL context to KB predicates. Context can be either continuous or discontinues phrases. Although the size of this set is limited, they can actually cover head questions/queries66Head questions/queries mean the questions/queries with high frequency and clear patterns. very well. The underlying intuition of using patterns is that those high-frequent questions/queries should and can be treated and solved in the QA task, by involving human effort at a relative small price but with very impressive accuracy.
In order to figure out the impacts of question patterns and relation expressions, another experiment (Table 4) is designed to evaluate their independent influences, where and denote the results of KB-QA systems which only allow question patterns and relation expressions in question translation respectively.
Settings | Test (Accuracy) | Test (Precision) | |
---|---|---|---|
11.8% | 97.5% | ||
32.5% | 73.2% |
From Table 4 we can see that the accuracy of on Test () is slightly better than baseline’s result (). We think this improvement comes from two aspects: (1) The quality of the relation expressions is better than the quality of the lexicon entries used in the baseline; and (2) We use the extraction-related statistics of relation expressions as features, which brings more information to measure the confidence of mapping between NL phrases and KB predicates, and makes the model to be more flexible. Meanwhile, perform worse () than , due to coverage issue. But by comparing the precisions of these two settings, we find (97.5%) outperforms (73.2%) significantly, due to its high quality. This means how to extract high-quality question patterns is worth to be studied for the question answering task.
As the performance of our KB-QA system relies heavily on the -best beam approximation, we evaluate the impact of the beam size and list the comparison results in Figure 6. We can see that as we increase incrementally, the accuracy increase at the same time. However, a larger (e.g. 200) cannot bring significant improvements comparing to a smaller one (e.g., 20), but using a large has a tremendous impact on system efficiency. So we choose as the optimal value in above experiments, which trades off between accuracy and efficiency.
Actually, the size of our system’s search space is much smaller than the one of the semantic parser used in the baseline.This is due to the fact that, if triple queries generated by the question translation component cannot derive any answer from KB, we will discard such triple queries directly during the QA procedure. We can see that using a small can achieve better results than baseline, where the beam size is set to be 200.
Since named entity recognizers trained on Penn TreeBank usually perform poorly on web queries, We instead use a simple string-match method to detect entity mentions in the question using a cleaned entity dictionary dumped from our KB. One problem of doing so is the entity detection issue. For example, in the question who was Esther’s husband ?, we cannot detect Esther as an entity, as it is just part of an entity name. We need an ad-hoc entity detection component to handle such issues, especially for a web scenario, where users often type entity names in their partial or abbreviation forms.
Some questions lack sufficient evidences to detect predicates. where is Byron Nelson 2012 ? is an example. Since each relation expression must contain at least one content word, this question cannot match any relation expression. Except for Byron Nelson and 2012, all the others are non-content words.
Besides, ambiguous entries contained in relation expression sets of different predicates can bring mapping errors as well. For the following question who did Steve Spurrier play pro football for? as an example, since the unigram play exists in both Film.Film.Actor and American_Football.Player.Current_Team ’s relation expression sets, we made a wrong prediction, which led to wrong answers.
Sometimes, we cannot give exact answers to superlative questions like what is the first book Sherlock Holmes appeared in?. For this example, we can give all book names where Sherlock Holmes appeared in, but we cannot rank them based on their publication date , as we cannot learn the alignment between the constraint word first occurred in the question and the predicate Book.Written_Work.Date_Of_First_Publication from training data automatically. Although we have followed some work [22, 18] to handle such special linguistic phenomena by defining some specific operators, it is still hard to cover all unseen cases. We leave this to future work as an independent topic.
This paper presents a translation-based KB-QA method that integrates semantic parsing and QA in one unified framework. Comparing to the baseline system using an independent semantic parser with state-of-the-art performance, we achieve better results on a general domain evaluation set.
Several directions can be further explored in the future: (i) We plan to design a method that can extract question patterns automatically, using existing labeled question patterns and KB as weak supervision. As we discussed in the experiment part, how to mine high-quality question patterns is worth further study for the QA task; (ii) We plan to integrate an ad-hoc NER into our KB-QA system to alleviate the entity detection issue; (iii) In fact, our proposed QA framework can be generalized to other intelligence besides knowledge bases as well. Any method that can generate answers to questions, such as the Web-based QA approach, can be integrated into this framework, by using them in the question translation component.