Information Extraction over Structured Data:
Question Answering with Freebase

Xuchen Yao    Benjamin Van Durme
Human Language Technology Center of Excellence
Center for Language and Speech Processing
Johns Hopkins University
Baltimore, MD, USA
Abstract

Answering natural language questions using the Freebase knowledge base has recently been explored as a platform for advancing the state of the art in open domain semantic parsing. Those efforts map questions to sophisticated meaning representations that are then attempted to be matched against viable answer candidates in the knowledge base. Here we show that relatively modest information extraction techniques, when paired with a web-scale corpus, can outperform these sophisticated approaches by roughly 34% relative gain.

\setlist

[0]leftmargin=*,itemindent=0em,itemsep=-2pt,topsep=0pt

1 Introduction

Question answering (QA) from a knowledge base (KB) has a long history within natural language processing, going back to the 1960s and 1970s, with systems such as Baseball [16] and Lunar [38]. These systems were limited to closed-domains due to a lack of knowledge resources, computing power, and ability to robustly understand natural language. With the recent growth in KBs such as DBPedia [1], Freebase [4] and Yago2 [18], it has become more practical to consider answering questions across wider domains, with commercial systems including Google Now, based on Google’s Knowledge Graph, and Facebook Graph Search, based on social network connections.

The AI community has tended to approach this problem with a focus on first understanding the intent of the question, via shallow or deep forms of semantic parsing (c.f. §3 for a discussion). Typically questions are converted into some meaning representation (e.g., the lambda calculus), then mapped to database queries. Performance is thus bounded by the accuracy of the original semantic parsing, and the well-formedness of resultant database queries.11As an example, 50% of errors of the CCG-backed [24] system were contributed by parsing or structural matching failure.

The Information Extraction (IE) community approaches QA differently: first performing relatively coarse information retrieval as a way to triage the set of possible answer candidates, and only then attempting to perform deeper analysis.

Researchers in semantic parsing have recently explored QA over Freebase as a way of moving beyond closed domains such as GeoQuery [35]. While making semantic parsing more robust is a laudable goal, here we provide a more rigorous IE baseline against which those efforts should be compared: we show that “traditional” IE methodology can significantly outperform prior state-of-the-art as reported in the semantic parsing literature, with a relative gain of  34% F1 as compared to Berant et al. (2013).

2 Approach

We will view a KB as an interlinked collection of “topics”. When given a question about one or several topics, we can select a “view” of the KB concerning only involved topics, then inspect every related node within a few hops of relations to the topic node in order to extract the answer. We call such a view a topic graph and assume answers can be found within the graph. We aim to maximally automate the answer extraction process, by massively combining discriminative features for both the question and the topic graph. With a high performance learner we have found that a system with millions of features can be trained within hours, leading to intuitive, human interpretable features. For example, we learn that given a question concerning money, such as: what money is used in ukraine, the expected answer type is likely currency. We formalize this approach in §4.

One challenge for natural language querying against a KB is the relative informality of queries as compared to the grammar of a KB. For example, for the question: who cheated on celebrity A, answers can be retrieved via the Freebase relation celebrity.infidelity.participant, but the connection between the phrase cheated on and the formal KB relation is not explicit. To alleviate this problem, the best attempt so far is to map from ReVerb [10] predicate-argument triples to Freebase relation triples [6, 2]. Note that to boost precision, ReVerb has already pruned down less frequent or credible triples, yielding not as much coverage as its text source, ClueWeb. Here we instead directly mine relation mappings from ClueWeb and show that both direct relation mapping precision and indirect QA F1 improve by a large margin. Details in §5.

Finally, we tested our system, jacana-freebase,22https://code.google.com/p/jacana on a realistic dataset generously contributed by Berant et al. (2013), who collected thousands of commonly asked questions by crawling the Google Suggest service. Our method achieves state-of-the-art performance with F1 at 42.0%, a 34% relative increase from the previous F1 of 31.4%.

3 Background

QA from a KB faces two prominent challenges: model and data. The model challenge involves finding the best meaning representation for the question, converting it into a query and executing the query on the KB. Most work approaches this via the bridge of various intermediate representations, including combinatory categorial grammar (Zettlemoyer and Collins, 2005, 2007, 2009; Kwiatkowski et al., 2010, 2011, 2013), synchronous context-free grammars [37], dependency trees [27, 2], string kernels [20, 7], and tree transducers [19]. These works successfully showed their effectiveness in QA, despite the fact that most of them require hand-labeled logic annotations. More recent research started to minimize this direct supervision by using latent meaning representations [2, 24] or distant supervision [23].

We instead attack the problem of QA from a KB from an IE perspective: we learn directly the pattern of QA pairs, represented by the dependency parse of questions and the Freebase structure of answer candidates, without the use of intermediate, general purpose meaning representations.

The data challenge is more formally framed as ontology or (textual) schema matching [17, 33, 9]: matching structure of two ontologies/databases or (in extension) mapping between KB relations and NL text. In terms of the latter, Cai and Yates (2013) and Berant et al. (2013) applied pattern matching and relation intersection between Freebase relations and predicate-argument triples from the ReVerb OpenIE system [10]. Kwiatkowski et al. (2013) expanded their CCG lexicon with Wiktionary word tags towards more domain independence. Fader et al. (2013) learned question paraphrases from aligning multiple questions with the same answers generated by WikiAnswers. The key factor to their success is to have a huge text source. Our work pushes the data challenge to the limit by mining directly from ClueWeb, a 5TB collection of web data.

Finally, the KB community has developed other means for QA without semantic parsing [29, 12, 36, 39, 34]. Most of these work executed SPARQL queries on interlinked data represented by RDF (Resource Description Framework) triples, or simply performed triple matching. Heuristics and manual templates were also commonly used [8]. We propose instead to learn discriminative features from the data with shallow question analysis. The final system captures intuitive patterns of QA pairs automatically.

4 Graph Features

Our model is inspired by an intuition on how everyday people search for answers. If you asked someone: what is the name of justin bieber brother,33All examples used in this paper come from the training data crawled from Google Suggest. They are lowercased and some contain typos. and gave them access to Freebase, that person might first determine that the question is about Justin Bieber (or his brother), go to Justin Bieber’s Freebase page, and search for his brother’s name. Unfortunately Freebase does not contain an exact relation called brother, but instead sibling. Thus further inference (i.e., brother male sibling) has to be made. In the following we describe how we represent this process.

4.1 Question Graph

In answering our example query a person might take into consideration multiple constraints. With regards to the question, we know we are looking for the name of a person based on the following:

  • the dependency relation nsubj(what, name) and prep_of(name, brother) indicates that the question seeks the information of a name;44We use the Stanford collapsed dependency form.

  • the dependency relation prep_of(name, brother) indicates that the name is about a brother (but we do not know whether it is a person name yet);

  • the dependency relation nn(brother, bieber) and the facts that, (i) Bieber is a person and (ii) a person’s brother should also be a person, indicate that the name is about a person.

This motivates the design of dependency-based features. We show one example in Figure 1(a), left side. The following linguistic information is of interest:

  • question word (qword), such as what/who/how many. We use a list of 9 common qwords. 55who, when, what, where, how, which, why, whom, whose.

  • question focus (qfocus), a cue of expected answer types, such as name/money/time. We keep our analysis simple and do not use a question classifier, but simply extract the noun dependent of qword as qfocus.

  • question verb (qverb), such as is/play/take, extracted from the main verb of the question. Question verbs are also good hints of answer types. For instance, play is likely to be followed by an instrument, a movie or a sports team.

  • question topic (qtopic). The topic of the question helps us find relevant Freebase pages. We simply apply a named entity recognizer to find the question topic. Note that there can be more than one topic in the question.

Then we convert the dependency parse into a more generic question graph, in the following steps:

  1. 1.

    if a node was tagged with a question feature, then replace this node with its question feature, e.g., what qword=what;

  2. 2.

    (special case) if a qtopic node was tagged as a named entity, then replace this node with its its named entity form, e.g., bieber qtopic=person;

  3. 3.

    drop any leaf node that is a determiner, preposition or punctuation.

The converted graph is shown in Figure 1(a), right side. We call this a question feature graph, with every node and relation a potential feature for this question. Then features are extracted in the following form: with s the source and t the target node, for every edge e(s,t) in the graph, extract s, t, st and set as features. For the edge, prep_of(qfocus=name, brother), this would mean the following features: qfocus=name, brother, qfocus=namebrother, and qfocus=nameprep_ofbrother.

We show with examples why these features make sense later in §6 Table 6. Furthermore, the reason that we have kept some lexical features, such as brother, is that we hope to learn from training a high correlation between brother and some Freebase relations and properties (such as sibling and male) if we do not possess an external resource to help us identify such a correlation.

\subfloat

[Dependence parse with annotated question features in dashed boxes (left) and converted feature graph (right) with only relevant and general information about the original question kept. Note that the left is a real but incorrect parse.]

\subfloat

[A view of Freebase graph on the Justin Bieber topic with nodes in solid boxes and properties in dashed boxes. The hatching node, Jaxon Bieber, is the answer. Freebase uses a dummy parent node for a list of nodes with the same relation.]

Figure 1: Dependency parse and excerpted Freebase topic graph on the question what is the name of justin bieber brother.

4.2 Freebase Topic Graph

Given a topic, we selectively roll out the Freebase graph by choosing those nodes within a few hops of relationship to the topic node, and form a topic graph. Besides incoming and/or outgoing relationships, nodes also have properties: a string that describes the attribute of a node, for instance, node type, gender or height (for a person). One major difference between relations and properties is that both arguments of a relation are nodes, while only one argument of a property is a node, the other a string. Arguments of relations are usually interconnected, e.g., London can be the place_of_birth for Justin Bieber, or capital_of the UK. Arguments of properties are attributes that are only “attached” to certain nodes and have no outgoing edges. Figure 1(b) shows an example.

Both relationship and property of a node are important to identifying the answer. They connect the nodes with the question and describe some unique characteristics. For instance, without the properties type:person and gender:male, we would not have known the node Jaxon Bieber represents a male person. These properties, along with the sibling relationship to the topic node, are important cues for answering the question. Thus for the Freebase graph, we use relations (with directions) and properties as features for each node.

Additionally, we have analyzed how Freebase relations map back to the question. Some of the mapping can be simply detected as paraphrasing or lexical overlap. For example, the person.parents relationship helps answering questions about parenthood. However, most Freebase relations are framed in a way that is not commonly addressed in natural language questions. For instance, for common celebrity gossip questions like who cheated on celebrity A, it is hard for a system to find the Freebase relation celebrity.infidelity.participant as the target relation if it had not observed this pattern in training.

Thus assuming there is an alignment model that is able to tell how likely one relation maps to the original question, we add extra alignment-based features for the incoming and outgoing relation of each node. Specifically, for each relation rel in a topic graph, we compute P(relquestion) to rank the relations. Finally the ranking (e.g., top 1/2/5/10/100 and beyond) of each relation is used as features instead of a pure probability. We describe such an alignment model in § 5.

4.3 Feature Production

We combine question features and Freebase features (per node) by doing a pairwise concatenation. In this way we hope to capture the association between question patterns and answer nodes. For instance, in a loglinear model setting, we expect to learn a high feature weight for features like:

qfocus=money|node_type=currency

and a very low weight for:

qfocus=money|node_type=person.

This combination greatly enlarges the total number of features, but owing to progress in large-scale machine learning such feature spaces are less of a concern than they once were (concrete numbers in § 6 Model Tuning).

5 Relation Mapping

In this section we describe a “translation” table between Freebase relations and NL words was built.

5.1 Formula

The objective is to find the most likely relation a question prompts. For instance, for the question who is the father of King George VI, the most likely relation we look for is people.person.parents. To put it more formally, given a question Q of a word vector 𝐰, we want to find out the relation R that maximizes the probability P(RQ).

More interestingly, for the question who is the father of the Periodic Table, the actual relation that encodes its original meaning is law.invention.inventor, rather than people.person.parents. This simple example points out that every part of the question could change what the question inquires eventually. Thus we need to count for each word w in Q. Due to the bias and incompleteness of any data source, we approximate the true probability of P with P~ under our specific model. For the simplicity of computation, we assume conditional independence between words and apply Naive Bayes:

P~(RQ) P~(QR)P~(R)
P~(𝐰R)P~(R)
wP~(wR)P~(R)

where P~(R) is the prior probability of a relation R and P~(wR) is the conditional probability of word w given R.

It is possible that we do not observe a certain relation R when computing the above equation. In this case we back off to the “sub-relations”: a relation R is a concatenation of a series of sub-relations R=𝐫=r1.r2.r3.. For instance, the sub-relations of people.person.parents are people, person, and parents. Again, we assume conditional independence between sub-relations and apply Naive Bayes:

P~backoff(RQ) P~(𝐫Q)
rP~(rQ)
rP~(Qr)P~(r)
rwP~(wr)P~(r)

One other reason that we estimated P~(wr) and P~(r) for sub-relations is that Freebase relations share some common structures in between them. For instance, both people.person.parents and fictional_universe.fictional_character.parents indicate the parent relationship but the latter is much less commonly annotated. We hope that the shared sub-relation, parents, can help better estimate for the less annotated. Note that the backoff model would have a much smaller value than the original, due to double multiplication rw. In practice we normalize it by the sub-relations size to keep it at the same scale with P~(RQ).

Finally, to estimate the prior and conditional probability, we need a massive data collection.

5.2 Steps

The ClueWeb0966http://lemurproject.org/clueweb09/ dataset is a collection of 1 billion webpages (5TB compressed in raw html) in 10 languages by Carnegie Mellon University in 2009. FACC1, the Freebase Annotation of the ClueWeb Corpus version 1 [15], contains index and offset of Freebase entities within the English portion of ClueWeb. Out of all 500 million English documents, 340 million were automatically annotated with at least one entity, with an average of 15 entity mentions per document. The precision and recall of annotation were estimated at 80-85% and 70-85% [32].

Given these two resources, for each binary Freebase relation, we can find a collection of sentences each of which contains both of its arguments, then simply learn how words in these sentences are associated with this relation, i.e., P~(wR) and P~(wr). By counting how many times each relation R was annotated, we can estimate P~(R) and P~(r). The learning task can be framed in the following short steps:

  1. 1.

    We split each html document by sentences [21] using NLTK [3] and extracted those with at least two Freebase entities which has at least one direct established relation according to Freebase.

  2. 2.

    The extraction formed two parallel corpora, one with “relation - sentence” pairs (for estimating P~(wR) and P~(R)) and the other with “subrelations - sentence” pairs (for P~(wr) and P~(r)). Each corpus has 1.2 billion pairs.

  3. 3.

    The tricky part was to align these 1.2 billion pairs. Since the relations on one side of these pairs are not natural sentences, we ran the most simple IBM alignment Model 1 [5] to estimate the translation probability with GIZA++ [30]. To speed up, the 1.2 billion pairs were split into 100 even chunks. We ran 5 iterations of EM on each one and finally aligned the 1.2 billion pairs from both directions. To symmetrize the alignment, common MT heuristics intersection, union, grow-diag-final, and grow-diag-final-and [22] were separately applied and evaluated later.

  4. 4.

    Treating the aligned pairs as observation, the co-occurrence matrix between aligning relations and words was computed. There were 10,484 relations and sub-relations in all, and we kept the top 20,000 words.

  5. 5.

    From the co-occurrence matrix we computed P~(wR), P~(R), P~(wr) and P~(r).

𝟎 𝟏𝟎 𝟏𝟎𝟐 𝟏𝟎𝟑 𝟏𝟎𝟒 >𝟏𝟎𝟒
7.0% 0.7% 1.2% 0.4% 1.3% 89.5%
Table 1: Percentage of answer relations (the incoming relation connected to the answer node) with respect to how many sentences we learned this relation from in CluewebMapping. For instance, the first column says there are 7% of answer relations for which we cannot find a mapping (so we had to use the backoff probability estimation); the last column says there are 89.5% of answer relations that we were able to learn the mapping between this relation and text based on more than 10 thousand relation-sentence pairs. The total number of answer relations is 7886.

Hand-checking the learned probabilities shows both success, failure and some bias. For instance, for the film.actor.film relation (mapping from film names to actor names), the top words given by P~(wR) are won, star, among, show. For the film.film.directed_by relation, some important stop words that could indicate this relation, such as by and with, rank directly after director and direct. However, due to significant popular interest in certain news categories, and the resultant catering by websites to those information desires, then for example we also learned a heavily correlated connection between Jennifer Aniston and celebrity.infidelity.victim, and between some other you-know-who names and celebrity.infidelity.participant.

We next formally evaluate how the learned mapping help predict relations from words.

5.3 Evaluation

Both ClueWeb and its Freebase annotation has a bias. Thus we were firstly interested in the coverage of mined relation mappings. As a comparison, we used a dataset of relation mapping contributed by Berant et al. (2013) and Lin and Etzioni (2012). The idea is very similar: they intersected Freebase relations with predicates in (arg1, predicate, arg2) triples extracted from ReVerb to learn the mapping between Freebase relations and triple predicates. Note the scale difference: although ReVerb was also extracted from ClueWeb09, there were only 15 million triples to intersect with the relations, while we had 1.2 billion alignment pairs. We call this dataset ReverbMapping and ours CluewebMapping.

The evaluation dataset, WebQuestions, was also contributed by Berant et al. (2013). It contains 3778 training and 2032 test questions collected from the Google Suggest service. All questions were annotated with answers from Freebase. Some questions have more than one answer, such as what to see near sedona arizona?.

We evaluated on the training set in two aspects: coverage and prediction performance. We define answer node as the node that is the answer and answer relation as the relation from the answer node to its direct parent. Then we computed how much and how well the answer relation was triggered by ReverbMapping and CluewebMapping. Thus for the question, who is the father of King George VI, we ask two questions: does the mapping, 1. (coverage) contain the answer relation people.person.parents? 2. (precision) predict the answer relation from the question?

Table 1 shows the coverage of CluewebMapping, which covers 93.0% of all answer relations. Among them, we were able to learn the rule mapping using more than 10 thousand relation-sentence pairs for each of the 89.5% of all answer relations. In contrast, ReverbMapping covers 89.7% of the answer relations.

Next we evaluated the prediction performance, using the evaluation metrics of information retrieval. For each question, we extracted all relations in its corresponding topic graph, and ranked each relation with whether it is the answer relation. For instance, for the previous example question, we want to rank the relation people.person.parents as number 1. We computed standard MAP (Mean Average Precision) and MRR (Mean Reciprocal Rank), shown in Table 2(a). As a simple baseline, “word overlap” counts the overlap between relations and the question. CluewebMapping ranks each relation by P~(RQ). ReverbMapping does the same, except that we took a uniform distribution on P~(wR) and P~(R) since the contributed dataset did not include co-occurrence counts to estimate these probabilities.77The way we used ReverbMapping was not how Berant et al. (2013) originally used it: they employed a discriminative log-linear model to judge relations and that might yield better performance. As a fair comparison, ranking of CluewebMapping under uniform distribution is also included in Table 2(a). Note that the median rank from CluewebMapping is only 12, indicating that half of all answer relations are ranked in the top 12.

Table 2(b) further shows the percentage of answer relations with respect to their ranking. CluewebMapping successfully ranked 19% of answer relations as top 1. A sample of these includes person.place_of_birth, location.containedby, country.currency_used, regular_tv_appearance.actor, etc. These percentage numbers are good clue for feature design: for instance, we may be confident in a relation if it is ranked top 5 or 10 by CluewebMapping.

To conclude, we found that CluewebMapping provides satisfying coverage on the 3778 training questions: only 7% were missing, despite the biased nature of web data. Also, CluewebMapping gives reasonably good precision on its prediction, despite the noisy nature of web data. We move on to fully evaluate the final QA F1.

\subfloat

[Ranking on answer relations. Best result on CluewebMapping was under the grow-diag-final-and heuristics (row 3) when symmetrizing alignment from both directions. The last row shows ranking of CluewebMapping under uniform distribution (assuming counting on words and relations is not known).]

Median Rank MAP MRR
word overlap 471 0.0380 0.0590
ReverbMapping 60 0.0691 0.0829
CluewebMapping 12 0.2074 0.2900
with uniform dist. 61 0.0544 0.0561

\subfloat

[Percentage of answer relations w.r.t. ranking number (header). w.o.: word overlap; R.M.: ReverbMapping; C.M.: CluewebMapping.]

𝟏 𝟓 𝟏𝟎 𝟓𝟎 𝟏𝟎𝟎 >𝟏𝟎𝟎
w. o. 3.5 4.7 2.5 3.9 4.1 81.3
R.M. 2.6 9.1 8.6 26.0 13.0 40.7
C.M. 19.0 19.9 8.9 22.3 7.5 22.4

Table 2: Evaluation on answer relation ranking prediction on 3778 training questions.

6 Experiments

We evaluate the final F1 in this section. The system of comparison is that of Berant et al. (2013).

Data

We re-used WebQuestions, a dataset collected by Berant et al. (2013). It contains 5810 questions crawled from the Google Suggest service, with answers annotated on Amazon Mechanical Turk. All questions contain at least one answer from Freebase. This dataset has been split by 65%/35% into train-all and test. We further randomly divided train-all by 80%/20% to a smaller train and development set dev. Note that our dev set is different from that of Berant et al. (2013), but the final result on test is directly comparable. Results are reported in terms of macro F1 with partial credit (following Berant et al. (2013)) if a predicted answer list does not have a perfect match with all gold answers, as a lot of questions in WebQuestions contain more than one answer.

Search

With an Information Retrieval (IR) front-end, we need to locate the exact Freebase topic node a question is about. For this purpose we used the Freebase Search API [13].All named entities 88When no named entities are detected, we fall back to noun phrases. in a question were sent to this API, which returned a ranked list of relevant topics. We also evaluated how well the search API served the IR purpose. WebQuestions not only has answers annotated, but also which Freebase topic nodes the answers come from. Thus we evaluated the ranking of retrieval with the gold standard annotation on train-all, shown in Table 3. The top 2 results of the Search API contain gold standard topics for more than 90% of the questions and the top 10 results contain more than 95%. We took this as a “good enough” IR front-end and used it on test.

Once a topic is obtained we query the Freebase Topic API [14] to retrieve all relevant information, resulting in a topic graph. The API returns almost identical information as displayed via a web browser to a user viewing this topic. Given that turkers annotated answers based on the topic page via a browser, this supports the assumption that the same answer would be located in the topic graph, which is then passed to the QA engine for feature extraction and classification.

top 1 2 3 5 10
# 3263 3456 3532 3574 3604
% 86.4 91.5 93.5 94.6 95.4
Table 3: Evaluation on the Freebase Search API: how many questions’ top n retrieved results contain the gold standard topic. Total number of questions is 3778 (size of train-all). There were only 5 questions with no retrieved results.
P R 𝐅𝟏
basic 57.3 30.1 39.5
+ word overlap 56.0 31.4 40.2
+ CluewebMapping 59.9 35.4 44.5
+both 59.0 35.4 44.3
Table 4: F1 on dev with different feature settings.

Model Tuning

We treat QA on Freebase as a binary classification task: for each node in the topic graph, we extract features and judge whether it is the answer node. Every question was processed by the Stanford CoreNLP suite with the caseless model. Then the question features (§4.1) and node features (§4.2) were combined (§4.3) for each node. The learning problem is challenging: for about 3000 questions in train, there are 3 million nodes (1000 nodes per topic graph), and 7 million feature types. We employed a high-performance machine learning tool, Classias [31]. Training usually took around 4 hours. We experimented with various discriminative learners on dev, including logistic regression, perceptron and SVM, and found L1 regularized logistic regression to give the best result. The L1 regularization encourages sparse features by driving feature weights towards zero, which was ideal for the over-generated feature space. After training, we had around 30 thousand features with non-zero weights, a 200 fold reduction from the original features.

Also, we did an ablation test on dev about how additional features on the mapping between Freebase relations and the original questions help, with three feature settings: 1) “basic” features include feature productions read off from the feature graph (Figure 1); 2) “+ word overlap” adds additional features on whether sub-relations have overlap with the question; and 3) “+ CluewebMapping” adds the ranking of relation prediction given the question according to CluewebMapping. Table 4 shows that the additional CluewebMapping features improved overall F1 by 5%, a 13% relative improvement: a remarkable gain given that the model already learned a strong correlation between question types and answer types (explained more in discussion and Table 6 later).

Finally, the ratio of positive vs. negative examples affect final F1: the more positive examples, the lower the precision and the higher the recall. Under the original setting, this ratio was about 1:275. This produced precision around 60% and recall around 35% (c.f. Table 4). To optimize for F1, we down-sampled the negative examples to 20%, i.e., a new ratio of 1:55. This boosted the final F1 on dev to 48%. We report the final test result under this down-sampled training. In practice the precision/recall balance can be adjusted by the positive/negative ratio.

P R 𝐅𝟏
Gold Retrieval 45.4 52.2 48.6
Freebase Search API 38.8 45.8 42.0
Berant et al. (2013) - - 31.4
Table 5: F1 on test with Gold Retrieval and Freebase Search API as the IR front end. Berant et al. (2013) actually reported accuracy on this dataset. However, since their system predicted answers for almost every question (p.c.), it is roughly that precision=recall=F1=accuracy for them.

Test Results

Table 5 gives the final F1 on test. “Gold Retrieval” always ranked the correct topic node top 1, a perfect IR front-end assumption. In a more realistic scenario, we had already evaluated that the Freebase Search API returned the correct topic node 95% of the time in its top 10 results (c.f. Table 3), thus we also tested on the top 10 results returned by the Search API. To keep things simple, we did not perform answer voting, but simply extracted answers from the first (ranked by the Search API) topic node with predicted answer(s) found. The final F1 of 42.0% gives a relative improvement over previous best result [2] of 31.4% by one third.

One question of interest is whether our system, aided by the massive web data, can be fairly compared to the semantic parsing approaches (note that Berant et al. (2013) also used ClueWeb indirectly through ReVerb). Thus we took out the word overlapping and CluewebMapping based features, and the new F1 on test was 36.9%.

The other question of interest is that whether our system has acquired some level of “machine intelligence”: how much does it know what the question inquires? We discuss it below through feature and error analysis.

Discussion

The combination between questions and Freebase nodes captures some real gist of QA pattern typing, shown in Table 6 with sampled features and weights. Our system learned, for instance, when the question asks for geographic adjacency information (qverb=border), the correct answer relation to look for is location.adjoins. Detailed comparison with the output from Berant et al. (2013) is a work in progress and will be presented in a follow-up report.

wgt. feature
5.56 qfocus=money|type=Currency
5.35 qverb=die|type=Cause_Of_Death
5.11 qword=when|type=datetime
4.56 qverb=border|rel=location.adjoins
3.90 qword=why|incoming_relation_rank=top_3
2.94 qverb=go|qtopic=location|type=Tourist_attraction
-3.94 qtopic=location|rel=location.imports_exports.date
-2.93 qtopic=person|rel=education.end_date
Table 6: A sample of the top 50 most positive/negative features. Features are production between question and node features (c.f. Figure 1).

7 Conclusion

We proposed an automatic method for Question Answering from structured data source (Freebase). Our approach associates question features with answer patterns described by Freebase and has achieved state-of-the-art results on a balanced and realistic QA corpus. To compensate for the problem of domain mismatch or overfitting, we exploited ClueWeb, mined mappings between KB relations and natural language text, and showed that it helped both relation prediction and answer extraction. Our method employs relatively lightweight machinery but has good performance. We hope that this result establishes a new baseline against which semantic parsing researchers can measure their progress towards deeper language understanding and answering of human questions.

Acknowledgments

We thank the Allen Institute for Artificial Intelligence for funding this work. We are also grateful to Jonathan Berant, Tom Kwiatkowski, Qingqing Cai, Adam Lopez, Chris Callison-Burch and Peter Clark for helpful discussion and to the reviewers for insightful comments.

References

  • [1] S. Auer, C. Bizer, G. Kobilarov, J. Lehmann, R. Cyganiak and Z. Ives(2007) DBPedia: A nucleus for a web of open data. The semantic web, pp. 722–735. Cited by: 1.
  • [2] J. Berant, A. Chou, R. Frostig and P. Liang(2013) Semantic Parsing on Freebase from Question-Answer Pairs. Cited by: 1, 2, 2, 3, 3, 5.3, 5.3, 5.3, 6, 6, 6, 6, 5, 6.
  • [3] S. Bird and E. Loper(2004) NLTK: The Natural Language Toolkit. Cited by: 1..
  • [4] K. Bollacker, C. Evans, P. Paritosh, T. Sturge and J. Taylor(2008) Freebase: a collaboratively created graph database for structuring human knowledge. pp. 1247–1250. Cited by: 1.
  • [5] P. F. Brown, V. J. D. Pietra, S. A. D. Pietra and R. L. Mercer(1993) The mathematics of statistical machine translation: parameter estimation. Computational linguistics 19 (2), pp. 263–311. Cited by: 3..
  • [6] Q. Cai and A. Yates(2013) Large-scale semantic parsing via schema matching and lexicon extension. Cited by: 2, 3.
  • [7] D. L. Chen and R. J. Mooney(2011) Learning to Interpret Natural Language Navigation Instructions from Observations. Vol. 2, pp. 1–2. Cited by: 3.
  • [8] J. Chu-Carroll, J. Fan, B. K. Boguraev, D. Carmel, D. Sheinwald and C. Welty(2012) Finding needles in the haystack: search and candidate generation. IBM Journal of Research and Development. Cited by: 3.
  • [9] J. Euzenat and P. Shvaiko(2007) Ontology matching. Springer. Cited by: 3.
  • [10] A. Fader, S. Soderland and O. Etzioni(2011) Identifying relations for open information extraction. Cited by: 2, 3.
  • [11] A. Fader, L. Zettlemoyer and O. Etzioni(2013) Paraphrase-Driven Learning for Open Question Answering. Cited by: 3.
  • [12] A. Frank, H. Krieger, F. Xu, H. Uszkoreit, B. Crysmann, B. Jörg and U. Schäfer(2007) Question answering from structured knowledge sources. Journal of Applied Logic 5 (1), pp. 20–48. Cited by: 3.
  • [13] Freebase(2013)(Website) Note: https://developers.google.com/freebase/v1/search-overview Cited by: 6.
  • [14] Freebase(2013)(Website) Note: https://developers.google.com/freebase/v1/topic-overview Cited by: 6.
  • [15] E. Gabrilovich, M. Ringgaard and A. Subramanya(2013-06) FACC1: Freebase annotation of ClueWeb corpora, Version 1 (Release date 2013-06-26, Format version 1, Correction level 0). Note: http://lemurproject.org/clueweb09/FACC1/ Cited by: 5.2.
  • [16] B. F. Green Jr, A. K. Wolf, C. Chomsky and K. Laughery(1961) Baseball: an automatic question-answerer. pp. 219–224. Cited by: 1.
  • [17] J. R. Hobbs(1985) Ontological promiscuity. Cited by: 3.
  • [18] J. Hoffart, F. M. Suchanek, K. Berberich, E. Lewis-Kelham, G. De Melo and G. Weikum(2011) Yago2: exploring and querying world knowledge in time, space, context, and many languages. pp. 229–232. Cited by: 1.
  • [19] B. K. Jones, M. Johnson and S. Goldwater(2012) Semantic parsing with bayesian tree transducers. Cited by: 3.
  • [20] R. J. Kate and R. J. Mooney(2006) Using string-kernels for learning semantic parsers. Cited by: 3.
  • [21] T. Kiss and J. Strunk(2006) Unsupervised multilingual sentence boundary detection. Computational Linguistics 32 (4), pp. 485–525. Cited by: 1..
  • [22] P. Koehn(2010) Statistical machine translation. Cambridge University Press, New York, NY, USA. Cited by: 3..
  • [23] J. Krishnamurthy and T. M. Mitchell(2012) Weakly supervised training of semantic parsers. Cited by: 3.
  • [24] T. Kwiatkowski, E. Choi, Y. Artzi and L. Zettlemoyer(2013) Scaling Semantic Parsers with On-the-fly Ontology Matching. Cited by: 1, 3, 3, 3.
  • [25] T. Kwiatkowski, L. Zettlemoyer, S. Goldwater and M. Steedman(2010) Inducing probabilistic CCG grammars from logical form with higher-order unification. pp. 1223–1233. Cited by: 3.
  • [26] T. Kwiatkowski, L. Zettlemoyer, S. Goldwater and M. Steedman(2011) Lexical generalization in CCG grammar induction for semantic parsing. Cited by: 3.
  • [27] P. Liang, M. I. Jordan and D. Klein(2011) Learning Dependency-Based Compositional Semantics. Cited by: 3.
  • [28] T. Lin and O. Etzioni(2012) Entity Linking at Web Scale. pp. 84–88. Cited by: 5.3.
  • [29] V. Lopez, M. Pasin and E. Motta(2005) Aqualog: an ontology-portable question answering system for the semantic web. The Semantic Web: Research and Applications, pp. 546–562. Cited by: 3.
  • [30] F. J. Och and H. Ney(2003) A systematic comparison of various statistical alignment models. Computational linguistics 29 (1), pp. 19–51. Cited by: 3..
  • [31] N. Okazaki(2009) Classias: a collection of machine-learning algorithms for classification. External Links: Link Cited by: 6.
  • [32] D. Orr, A. Subramanya, E. Gabrilovich and M. Ringgaard(2013-07) 11 billion clues in 800 million documents: a web research corpus annotated with freebase concepts. Note: http://googleresearch.blogspot.com/2013/07/11-billion-clues-in-800-million.html Cited by: 5.2.
  • [33] E. Rahm and P. A. Bernstein(2001) A survey of approaches to automatic schema matching. the VLDB Journal 10 (4), pp. 334–350. Cited by: 3.
  • [34] S. Shekarpour, A. Ngonga Ngomo and S. Auer(2013) Question answering on interlinked data. Cited by: 3.
  • [35] L. R. Tang and R. J. Mooney(2001) Using multiple clause constructors in inductive logic programming for semantic parsing. Machine Learning: ECML 2001, pp. 466–477. Cited by: 1.
  • [36] C. Unger, L. Bühmann, J. Lehmann, A. Ngonga Ngomo, D. Gerber and P. Cimiano(2012) Template-based question answering over RDF data. Cited by: 3.
  • [37] Y. W. Wong and R. J. Mooney(2007) Learning synchronous grammars for semantic parsing with lambda calculus. Cited by: 3.
  • [38] W. A. Woods(1977) Lunar rocks in natural english: explorations in natural language question answering. Linguistic structures processing 5, pp. 521–569. Cited by: 1.
  • [39] M. Yahya, K. Berberich, S. Elbassuoni, M. Ramanath, V. Tresp and G. Weikum(2012) Natural language questions for the web of data. Cited by: 3.
  • [40] L. S. Zettlemoyer and M. Collins(2005) Learning to map sentences to logical form: Structured classification with probabilistic categorial grammars. Uncertainty in Artificial Intelligence (UAI). Cited by: 3.
  • [41] L. S. Zettlemoyer and M. Collins(2007) Online learning of relaxed CCG grammars for parsing to logical form. Cited by: 3.
  • [42] L. S. Zettlemoyer and M. Collins(2009) Learning context-dependent mappings from sentences to logical form. Cited by: 3.