Answering natural language questions using the Freebase knowledge base has recently been explored as a platform for advancing the state of the art in open domain semantic parsing. Those efforts map questions to sophisticated meaning representations that are then attempted to be matched against viable answer candidates in the knowledge base. Here we show that relatively modest information extraction techniques, when paired with a web-scale corpus, can outperform these sophisticated approaches by roughly 34% relative gain.
[0]leftmargin=*,itemindent=0em,itemsep=-2pt,topsep=0pt
Question answering (QA) from a knowledge base (KB) has a long history within natural language processing, going back to the 1960s and 1970s, with systems such as Baseball [16] and Lunar [38]. These systems were limited to closed-domains due to a lack of knowledge resources, computing power, and ability to robustly understand natural language. With the recent growth in KBs such as DBPedia [1], Freebase [4] and Yago2 [18], it has become more practical to consider answering questions across wider domains, with commercial systems including Google Now, based on Google’s Knowledge Graph, and Facebook Graph Search, based on social network connections.
The AI community has tended to approach this problem with a focus on first understanding the intent of the question, via shallow or deep forms of semantic parsing (c.f. §3 for a discussion). Typically questions are converted into some meaning representation (e.g., the lambda calculus), then mapped to database queries. Performance is thus bounded by the accuracy of the original semantic parsing, and the well-formedness of resultant database queries.11As an example, of errors of the CCG-backed [24] system were contributed by parsing or structural matching failure.
The Information Extraction (IE) community approaches QA differently: first performing relatively coarse information retrieval as a way to triage the set of possible answer candidates, and only then attempting to perform deeper analysis.
Researchers in semantic parsing have recently explored QA over Freebase as a way of moving beyond closed domains such as GeoQuery [35]. While making semantic parsing more robust is a laudable goal, here we provide a more rigorous IE baseline against which those efforts should be compared: we show that “traditional” IE methodology can significantly outperform prior state-of-the-art as reported in the semantic parsing literature, with a relative gain of 34% as compared to Berant et al. (2013).
We will view a KB as an interlinked collection of “topics”. When given a question about one or several topics, we can select a “view” of the KB concerning only involved topics, then inspect every related node within a few hops of relations to the topic node in order to extract the answer. We call such a view a topic graph and assume answers can be found within the graph. We aim to maximally automate the answer extraction process, by massively combining discriminative features for both the question and the topic graph. With a high performance learner we have found that a system with millions of features can be trained within hours, leading to intuitive, human interpretable features. For example, we learn that given a question concerning money, such as: what money is used in ukraine, the expected answer type is likely currency. We formalize this approach in 4.
One challenge for natural language querying against a KB is the relative informality of queries as compared to the grammar of a KB. For example, for the question: who cheated on celebrity A, answers can be retrieved via the Freebase relation celebrity.infidelity.participant, but the connection between the phrase cheated on and the formal KB relation is not explicit. To alleviate this problem, the best attempt so far is to map from ReVerb [10] predicate-argument triples to Freebase relation triples [6, 2]. Note that to boost precision, ReVerb has already pruned down less frequent or credible triples, yielding not as much coverage as its text source, ClueWeb. Here we instead directly mine relation mappings from ClueWeb and show that both direct relation mapping precision and indirect QA improve by a large margin. Details in 5.
Finally, we tested our system, jacana-freebase,22https://code.google.com/p/jacana on a realistic dataset generously contributed by Berant et al. (2013), who collected thousands of commonly asked questions by crawling the Google Suggest service. Our method achieves state-of-the-art performance with at , a relative increase from the previous of .
QA from a KB faces two prominent challenges: model and data. The model challenge involves finding the best meaning representation for the question, converting it into a query and executing the query on the KB. Most work approaches this via the bridge of various intermediate representations, including combinatory categorial grammar (Zettlemoyer and Collins, 2005, 2007, 2009; Kwiatkowski et al., 2010, 2011, 2013), synchronous context-free grammars [37], dependency trees [27, 2], string kernels [20, 7], and tree transducers [19]. These works successfully showed their effectiveness in QA, despite the fact that most of them require hand-labeled logic annotations. More recent research started to minimize this direct supervision by using latent meaning representations [2, 24] or distant supervision [23].
We instead attack the problem of QA from a KB from an IE perspective: we learn directly the pattern of QA pairs, represented by the dependency parse of questions and the Freebase structure of answer candidates, without the use of intermediate, general purpose meaning representations.
The data challenge is more formally framed as ontology or (textual) schema matching [17, 33, 9]: matching structure of two ontologies/databases or (in extension) mapping between KB relations and NL text. In terms of the latter, Cai and Yates (2013) and Berant et al. (2013) applied pattern matching and relation intersection between Freebase relations and predicate-argument triples from the ReVerb OpenIE system [10]. Kwiatkowski et al. (2013) expanded their CCG lexicon with Wiktionary word tags towards more domain independence. Fader et al. (2013) learned question paraphrases from aligning multiple questions with the same answers generated by WikiAnswers. The key factor to their success is to have a huge text source. Our work pushes the data challenge to the limit by mining directly from ClueWeb, a 5TB collection of web data.
Finally, the KB community has developed other means for QA without semantic parsing [29, 12, 36, 39, 34]. Most of these work executed SPARQL queries on interlinked data represented by RDF (Resource Description Framework) triples, or simply performed triple matching. Heuristics and manual templates were also commonly used [8]. We propose instead to learn discriminative features from the data with shallow question analysis. The final system captures intuitive patterns of QA pairs automatically.
Our model is inspired by an intuition on how everyday people search for answers. If you asked someone: what is the name of justin bieber brother,33All examples used in this paper come from the training data crawled from Google Suggest. They are lowercased and some contain typos. and gave them access to Freebase, that person might first determine that the question is about Justin Bieber (or his brother), go to Justin Bieber’s Freebase page, and search for his brother’s name. Unfortunately Freebase does not contain an exact relation called brother, but instead sibling. Thus further inference (i.e., brother male sibling) has to be made. In the following we describe how we represent this process.
In answering our example query a person might take into consideration multiple constraints. With regards to the question, we know we are looking for the name of a person based on the following:
the dependency relation nsubj(what, name) and prep_of(name, brother) indicates that the question seeks the information of a name;44We use the Stanford collapsed dependency form.
the dependency relation prep_of(name, brother) indicates that the name is about a brother (but we do not know whether it is a person name yet);
the dependency relation nn(brother, bieber) and the facts that, (i) Bieber is a person and (ii) a person’s brother should also be a person, indicate that the name is about a person.
This motivates the design of dependency-based features. We show one example in Figure 1(a), left side. The following linguistic information is of interest:
question word (qword), such as what/who/how many. We use a list of 9 common qwords. 55who, when, what, where, how, which, why, whom, whose.
question focus (qfocus), a cue of expected answer types, such as name/money/time. We keep our analysis simple and do not use a question classifier, but simply extract the noun dependent of qword as qfocus.
question verb (qverb), such as is/play/take, extracted from the main verb of the question. Question verbs are also good hints of answer types. For instance, play is likely to be followed by an instrument, a movie or a sports team.
question topic (qtopic). The topic of the question helps us find relevant Freebase pages. We simply apply a named entity recognizer to find the question topic. Note that there can be more than one topic in the question.
Then we convert the dependency parse into a more generic question graph, in the following steps:
if a node was tagged with a question feature, then replace this node with its question feature, e.g., what qword=what;
(special case) if a qtopic node was tagged as a named entity, then replace this node with its its named entity form, e.g., bieber qtopic=person;
drop any leaf node that is a determiner, preposition or punctuation.
The converted graph is shown in Figure 1(a), right side. We call this a question feature graph, with every node and relation a potential feature for this question. Then features are extracted in the following form: with the source and the target node, for every edge in the graph, extract , , and as features. For the edge, prep_of(qfocus=name, brother), this would mean the following features: qfocus=name, brother, qfocus=namebrother, and qfocus=nameprep_ofbrother.
We show with examples why these features make sense later in 6 Table 6. Furthermore, the reason that we have kept some lexical features, such as brother, is that we hope to learn from training a high correlation between brother and some Freebase relations and properties (such as sibling and male) if we do not possess an external resource to help us identify such a correlation.
[Dependence parse with annotated question features in dashed boxes (left) and converted feature graph (right) with only relevant and general information about the original question kept. Note that the left is a real but incorrect parse.]
\subfloat[A view of Freebase graph on the Justin Bieber topic with nodes in solid boxes and properties in dashed boxes. The hatching node, Jaxon Bieber, is the answer. Freebase uses a dummy parent node for a list of nodes with the same relation.]
Given a topic, we selectively roll out the Freebase graph by choosing those nodes within a few hops of relationship to the topic node, and form a topic graph. Besides incoming and/or outgoing relationships, nodes also have properties: a string that describes the attribute of a node, for instance, node type, gender or height (for a person). One major difference between relations and properties is that both arguments of a relation are nodes, while only one argument of a property is a node, the other a string. Arguments of relations are usually interconnected, e.g., London can be the place_of_birth for Justin Bieber, or capital_of the UK. Arguments of properties are attributes that are only “attached” to certain nodes and have no outgoing edges. Figure 1(b) shows an example.
Both relationship and property of a node are important to identifying the answer. They connect the nodes with the question and describe some unique characteristics. For instance, without the properties type:person and gender:male, we would not have known the node Jaxon Bieber represents a male person. These properties, along with the sibling relationship to the topic node, are important cues for answering the question. Thus for the Freebase graph, we use relations (with directions) and properties as features for each node.
Additionally, we have analyzed how Freebase relations map back to the question. Some of the mapping can be simply detected as paraphrasing or lexical overlap. For example, the person.parents relationship helps answering questions about parenthood. However, most Freebase relations are framed in a way that is not commonly addressed in natural language questions. For instance, for common celebrity gossip questions like who cheated on celebrity A, it is hard for a system to find the Freebase relation celebrity.infidelity.participant as the target relation if it had not observed this pattern in training.
Thus assuming there is an alignment model that is able to tell how likely one relation maps to the original question, we add extra alignment-based features for the incoming and outgoing relation of each node. Specifically, for each relation in a topic graph, we compute to rank the relations. Finally the ranking (e.g., top 1/2/5/10/100 and beyond) of each relation is used as features instead of a pure probability. We describe such an alignment model in 5.
We combine question features and Freebase features (per node) by doing a pairwise concatenation. In this way we hope to capture the association between question patterns and answer nodes. For instance, in a loglinear model setting, we expect to learn a high feature weight for features like:
qfocus=moneynode_type=currency
and a very low weight for:
qfocus=moneynode_type=person.
This combination greatly enlarges the total number of features, but owing to progress in large-scale machine learning such feature spaces are less of a concern than they once were (concrete numbers in 6 Model Tuning).
In this section we describe a “translation” table between Freebase relations and NL words was built.
The objective is to find the most likely relation a question prompts. For instance, for the question who is the father of King George VI, the most likely relation we look for is people.person.parents. To put it more formally, given a question of a word vector , we want to find out the relation that maximizes the probability .
More interestingly, for the question who is the father of the Periodic Table, the actual relation that encodes its original meaning is law.invention.inventor, rather than people.person.parents. This simple example points out that every part of the question could change what the question inquires eventually. Thus we need to count for each word in . Due to the bias and incompleteness of any data source, we approximate the true probability of with under our specific model. For the simplicity of computation, we assume conditional independence between words and apply Naive Bayes:
where is the prior probability of a relation and is the conditional probability of word given .
It is possible that we do not observe a certain relation when computing the above equation. In this case we back off to the “sub-relations”: a relation is a concatenation of a series of sub-relations . For instance, the sub-relations of people.person.parents are people, person, and parents. Again, we assume conditional independence between sub-relations and apply Naive Bayes:
One other reason that we estimated and for sub-relations is that Freebase relations share some common structures in between them. For instance, both people.person.parents and fictional_universe.fictional_character.parents indicate the parent relationship but the latter is much less commonly annotated. We hope that the shared sub-relation, parents, can help better estimate for the less annotated. Note that the backoff model would have a much smaller value than the original, due to double multiplication . In practice we normalize it by the sub-relations size to keep it at the same scale with .
Finally, to estimate the prior and conditional probability, we need a massive data collection.
The ClueWeb0966http://lemurproject.org/clueweb09/ dataset is a collection of 1 billion webpages (5TB compressed in raw html) in 10 languages by Carnegie Mellon University in 2009. FACC1, the Freebase Annotation of the ClueWeb Corpus version 1 [15], contains index and offset of Freebase entities within the English portion of ClueWeb. Out of all 500 million English documents, 340 million were automatically annotated with at least one entity, with an average of 15 entity mentions per document. The precision and recall of annotation were estimated at and [32].
Given these two resources, for each binary Freebase relation, we can find a collection of sentences each of which contains both of its arguments, then simply learn how words in these sentences are associated with this relation, i.e., and . By counting how many times each relation was annotated, we can estimate and . The learning task can be framed in the following short steps:
The extraction formed two parallel corpora, one with “relation - sentence” pairs (for estimating and ) and the other with “subrelations - sentence” pairs (for and ). Each corpus has 1.2 billion pairs.
The tricky part was to align these 1.2 billion pairs. Since the relations on one side of these pairs are not natural sentences, we ran the most simple IBM alignment Model 1 [5] to estimate the translation probability with GIZA++ [30]. To speed up, the 1.2 billion pairs were split into 100 even chunks. We ran 5 iterations of EM on each one and finally aligned the 1.2 billion pairs from both directions. To symmetrize the alignment, common MT heuristics intersection, union, grow-diag-final, and grow-diag-final-and [22] were separately applied and evaluated later.
Treating the aligned pairs as observation, the co-occurrence matrix between aligning relations and words was computed. There were 10,484 relations and sub-relations in all, and we kept the top 20,000 words.
From the co-occurrence matrix we computed , , and .
7.0% | 0.7% | 1.2% | 0.4% | 1.3% | 89.5% |
Hand-checking the learned probabilities shows both success, failure and some bias. For instance, for the film.actor.film relation (mapping from film names to actor names), the top words given by are won, star, among, show. For the film.film.directed_by relation, some important stop words that could indicate this relation, such as by and with, rank directly after director and direct. However, due to significant popular interest in certain news categories, and the resultant catering by websites to those information desires, then for example we also learned a heavily correlated connection between Jennifer Aniston and celebrity.infidelity.victim, and between some other you-know-who names and celebrity.infidelity.participant.
We next formally evaluate how the learned mapping help predict relations from words.
Both ClueWeb and its Freebase annotation has a bias. Thus we were firstly interested in the coverage of mined relation mappings. As a comparison, we used a dataset of relation mapping contributed by Berant et al. (2013) and Lin and Etzioni (2012). The idea is very similar: they intersected Freebase relations with predicates in (arg1, predicate, arg2) triples extracted from ReVerb to learn the mapping between Freebase relations and triple predicates. Note the scale difference: although ReVerb was also extracted from ClueWeb09, there were only 15 million triples to intersect with the relations, while we had 1.2 billion alignment pairs. We call this dataset ReverbMapping and ours CluewebMapping.
The evaluation dataset, WebQuestions, was also contributed by Berant et al. (2013). It contains 3778 training and 2032 test questions collected from the Google Suggest service. All questions were annotated with answers from Freebase. Some questions have more than one answer, such as what to see near sedona arizona?.
We evaluated on the training set in two aspects: coverage and prediction performance. We define answer node as the node that is the answer and answer relation as the relation from the answer node to its direct parent. Then we computed how much and how well the answer relation was triggered by ReverbMapping and CluewebMapping. Thus for the question, who is the father of King George VI, we ask two questions: does the mapping, 1. (coverage) contain the answer relation people.person.parents? 2. (precision) predict the answer relation from the question?
Table 1 shows the coverage of CluewebMapping, which covers of all answer relations. Among them, we were able to learn the rule mapping using more than 10 thousand relation-sentence pairs for each of the of all answer relations. In contrast, ReverbMapping covers of the answer relations.
Next we evaluated the prediction performance, using the evaluation metrics of information retrieval. For each question, we extracted all relations in its corresponding topic graph, and ranked each relation with whether it is the answer relation. For instance, for the previous example question, we want to rank the relation people.person.parents as number 1. We computed standard MAP (Mean Average Precision) and MRR (Mean Reciprocal Rank), shown in Table 2(a). As a simple baseline, “word overlap” counts the overlap between relations and the question. CluewebMapping ranks each relation by . ReverbMapping does the same, except that we took a uniform distribution on and since the contributed dataset did not include co-occurrence counts to estimate these probabilities.77The way we used ReverbMapping was not how Berant et al. (2013) originally used it: they employed a discriminative log-linear model to judge relations and that might yield better performance. As a fair comparison, ranking of CluewebMapping under uniform distribution is also included in Table 2(a). Note that the median rank from CluewebMapping is only 12, indicating that half of all answer relations are ranked in the top 12.
Table 2(b) further shows the percentage of answer relations with respect to their ranking. CluewebMapping successfully ranked of answer relations as top 1. A sample of these includes person.place_of_birth, location.containedby, country.currency_used, regular_tv_appearance.actor, etc. These percentage numbers are good clue for feature design: for instance, we may be confident in a relation if it is ranked top 5 or 10 by CluewebMapping.
To conclude, we found that CluewebMapping provides satisfying coverage on the 3778 training questions: only were missing, despite the biased nature of web data. Also, CluewebMapping gives reasonably good precision on its prediction, despite the noisy nature of web data. We move on to fully evaluate the final QA .
[Ranking on answer relations. Best result on CluewebMapping was under the grow-diag-final-and heuristics (row 3) when symmetrizing alignment from both directions. The last row shows ranking of CluewebMapping under uniform distribution (assuming counting on words and relations is not known).]
Median Rank | MAP | MRR | |
---|---|---|---|
word overlap | 471 | 0.0380 | 0.0590 |
ReverbMapping | 60 | 0.0691 | 0.0829 |
CluewebMapping | 12 | 0.2074 | 0.2900 |
with uniform dist. | 61 | 0.0544 | 0.0561 |
[Percentage of answer relations w.r.t. ranking number (header). w.o.: word overlap; R.M.: ReverbMapping; C.M.: CluewebMapping.]
w. o. | 3.5 | 4.7 | 2.5 | 3.9 | 4.1 | 81.3 |
R.M. | 2.6 | 9.1 | 8.6 | 26.0 | 13.0 | 40.7 |
C.M. | 19.0 | 19.9 | 8.9 | 22.3 | 7.5 | 22.4 |
We evaluate the final in this section. The system of comparison is that of Berant et al. (2013).
We re-used WebQuestions, a dataset collected by Berant et al. (2013). It contains questions crawled from the Google Suggest service, with answers annotated on Amazon Mechanical Turk. All questions contain at least one answer from Freebase. This dataset has been split by into train-all and test. We further randomly divided train-all by to a smaller train and development set dev. Note that our dev set is different from that of Berant et al. (2013), but the final result on test is directly comparable. Results are reported in terms of macro with partial credit (following Berant et al. (2013)) if a predicted answer list does not have a perfect match with all gold answers, as a lot of questions in WebQuestions contain more than one answer.
With an Information Retrieval (IR) front-end, we need to locate the exact Freebase topic node a question is about. For this purpose we used the Freebase Search API [13].All named entities 88When no named entities are detected, we fall back to noun phrases. in a question were sent to this API, which returned a ranked list of relevant topics. We also evaluated how well the search API served the IR purpose. WebQuestions not only has answers annotated, but also which Freebase topic nodes the answers come from. Thus we evaluated the ranking of retrieval with the gold standard annotation on train-all, shown in Table 3. The top 2 results of the Search API contain gold standard topics for more than of the questions and the top 10 results contain more than . We took this as a “good enough” IR front-end and used it on test.
Once a topic is obtained we query the Freebase Topic API [14] to retrieve all relevant information, resulting in a topic graph. The API returns almost identical information as displayed via a web browser to a user viewing this topic. Given that turkers annotated answers based on the topic page via a browser, this supports the assumption that the same answer would be located in the topic graph, which is then passed to the QA engine for feature extraction and classification.
top | 1 | 2 | 3 | 5 | 10 |
---|---|---|---|---|---|
# | 3263 | 3456 | 3532 | 3574 | 3604 |
% | 86.4 | 91.5 | 93.5 | 94.6 | 95.4 |
P | R | ||
basic | 57.3 | 30.1 | 39.5 |
+ word overlap | 56.0 | 31.4 | 40.2 |
+ CluewebMapping | 59.9 | 35.4 | 44.5 |
+both | 59.0 | 35.4 | 44.3 |
We treat QA on Freebase as a binary classification task: for each node in the topic graph, we extract features and judge whether it is the answer node. Every question was processed by the Stanford CoreNLP suite with the caseless model. Then the question features (§4.1) and node features (§4.2) were combined (§4.3) for each node. The learning problem is challenging: for about questions in train, there are 3 million nodes ( nodes per topic graph), and 7 million feature types. We employed a high-performance machine learning tool, Classias [31]. Training usually took around 4 hours. We experimented with various discriminative learners on dev, including logistic regression, perceptron and SVM, and found L1 regularized logistic regression to give the best result. The L1 regularization encourages sparse features by driving feature weights towards zero, which was ideal for the over-generated feature space. After training, we had around thousand features with non-zero weights, a fold reduction from the original features.
Also, we did an ablation test on dev about how additional features on the mapping between Freebase relations and the original questions help, with three feature settings: 1) “basic” features include feature productions read off from the feature graph (Figure 1); 2) “+ word overlap” adds additional features on whether sub-relations have overlap with the question; and 3) “+ CluewebMapping” adds the ranking of relation prediction given the question according to CluewebMapping. Table 4 shows that the additional CluewebMapping features improved overall by , a relative improvement: a remarkable gain given that the model already learned a strong correlation between question types and answer types (explained more in discussion and Table 6 later).
Finally, the ratio of positive vs. negative examples affect final : the more positive examples, the lower the precision and the higher the recall. Under the original setting, this ratio was about . This produced precision around and recall around (c.f. Table 4). To optimize for , we down-sampled the negative examples to , i.e., a new ratio of . This boosted the final on dev to . We report the final test result under this down-sampled training. In practice the precision/recall balance can be adjusted by the positive/negative ratio.
P | R | ||
Gold Retrieval | 45.4 | 52.2 | 48.6 |
Freebase Search API | 38.8 | 45.8 | 42.0 |
Berant et al. (2013) | - | - | 31.4 |
Table 5 gives the final on test. “Gold Retrieval” always ranked the correct topic node top 1, a perfect IR front-end assumption. In a more realistic scenario, we had already evaluated that the Freebase Search API returned the correct topic node of the time in its top 10 results (c.f. Table 3), thus we also tested on the top 10 results returned by the Search API. To keep things simple, we did not perform answer voting, but simply extracted answers from the first (ranked by the Search API) topic node with predicted answer(s) found. The final of gives a relative improvement over previous best result [2] of by one third.
One question of interest is whether our system, aided by the massive web data, can be fairly compared to the semantic parsing approaches (note that Berant et al. (2013) also used ClueWeb indirectly through ReVerb). Thus we took out the word overlapping and CluewebMapping based features, and the new on test was .
The other question of interest is that whether our system has acquired some level of “machine intelligence”: how much does it know what the question inquires? We discuss it below through feature and error analysis.
The combination between questions and Freebase nodes captures some real gist of QA pattern typing, shown in Table 6 with sampled features and weights. Our system learned, for instance, when the question asks for geographic adjacency information (qverb=border), the correct answer relation to look for is location.adjoins. Detailed comparison with the output from Berant et al. (2013) is a work in progress and will be presented in a follow-up report.
wgt. | feature |
5.56 | qfocus=moneytype=Currency |
5.35 | qverb=dietype=Cause_Of_Death |
5.11 | qword=whentype=datetime |
4.56 | qverb=borderrel=location.adjoins |
3.90 | qword=whyincoming_relation_rank=top_3 |
2.94 | qverb=goqtopic=locationtype=Tourist_attraction |
-3.94 | qtopic=locationrel=location.imports_exports.date |
-2.93 | qtopic=personrel=education.end_date |
We proposed an automatic method for Question Answering from structured data source (Freebase). Our approach associates question features with answer patterns described by Freebase and has achieved state-of-the-art results on a balanced and realistic QA corpus. To compensate for the problem of domain mismatch or overfitting, we exploited ClueWeb, mined mappings between KB relations and natural language text, and showed that it helped both relation prediction and answer extraction. Our method employs relatively lightweight machinery but has good performance. We hope that this result establishes a new baseline against which semantic parsing researchers can measure their progress towards deeper language understanding and answering of human questions.
We thank the Allen Institute for Artificial Intelligence for funding this work. We are also grateful to Jonathan Berant, Tom Kwiatkowski, Qingqing Cai, Adam Lopez, Chris Callison-Burch and Peter Clark for helpful discussion and to the reviewers for insightful comments.