Semantic Parsing via Paraphrasing

Jonathan Berant
Stanford University
joberant@stanford.edu
&Percy Liang
Stanford University
pliang@cs.stanford.edu

Abstract

A central challenge in semantic parsing is handling the myriad ways in which knowledge base predicates can be expressed. Traditionally, semantic parsers are trained primarily from text paired with knowledge base information. Our goal is to exploit the much larger amounts of raw text not tied to any knowledge base. In this paper, we turn semantic parsing on its head. Given an input utterance, we first use a simple method to deterministically generate a set of candidate logical forms with a canonical realization in natural language for each. Then, we use a paraphrase model to choose the realization that best paraphrases the input, and output the corresponding logical form. We present two simple paraphrase models, an association model and a vector space model, and train them jointly from question-answer pairs. Our system ParaSempre improves state-of-the-art accuracies on two recently released question-answering datasets.

\setdescription

leftmargin=0cm,labelindent=0cm

\ifthenelse\isundefined\definition\ifthenelse\isundefined\assumption\ifthenelse\isundefined\proposition\ifthenelse\isundefined\theorem\ifthenelse\isundefined\lemma\ifthenelse\isundefined\corollary\ifthenelse\isundefined\alg\ifthenelse\isundefined\example

1 Introduction

We consider the semantic parsing problem of mapping natural language utterances into logical forms to be executed on a knowledge base (KB) [35, 36, 34, 20]. Scaling semantic parsers to large knowledge bases has attracted substantial attention recently [2, 1, 19], since it drives applications such as question answering (QA) and information extraction (IE).

Figure 1: Semantic parsing via paraphrasing: For each candidate logical form (in red), we generate canonical utterances (in purple). The model is trained to paraphrase the input utterance (in green) into the canonical utterances associated with the correct denotation (in blue).

Semantic parsers need to somehow associate natural language phrases with logical predicates, e.g., they must learn that the constructions “What does X do for a living?”, “What is X’s profession?”, and “Who is X?”, should all map to the logical predicate Profession. To learn these mappings, traditional semantic parsers use data which pairs natural language with the KB. However, this leaves untapped a vast amount of text not related to the KB. For instance, the utterances “Where is ACL in 2014?” and “What is the location of ACL 2014?” cannot be used in traditional semantic parsing methods, since the KB does not contain an entity ACL2014, but this pair clearly contains valuable linguistic information. As another reference point, out of 500,000 relations extracted by the ReVerb Open IE system [9], only about 10,000 can be aligned to Freebase [1].

In this paper, we present a novel approach for semantic parsing based on paraphrasing that can exploit large amounts of text not covered by the KB (Figure 1). Our approach targets factoid questions with a modest amount of compositionality. Given an input utterance, we first use a simple deterministic procedure to construct a manageable set of candidate logical forms (ideally, we would generate canonical utterances for all possible logical forms, but this is intractable). Next, we heuristically generate canonical utterances for each logical form based on the text descriptions of predicates from the KB. Finally, we choose the canonical utterance that best paraphrases the input utterance, and thereby the logical form that generated it. We use two complementary paraphrase models: an association model based on aligned phrase pairs extracted from a monolingual parallel corpus, and a vector space model, which represents each utterance as a vector and learns a similarity score between them. The entire system is trained jointly from question-answer pairs only.

Figure 2: The main challenge in semantic parsing is coping with the mismatch between language and the KB. (a) Traditionally, semantic parsing maps utterances directly to logical forms. (b) Kwiatkowski et al. (2013) map the utterance to an underspecified logical form, and perform ontology matching to handle the mismatch. (c) We approach the problem in the other direction, generating canonical utterances for logical forms, and use paraphrase models to handle the mismatch.

Our work relates to recent lines of research in semantic parsing and question answering. Kwiatkowski et al. (2013) first maps utterances to a domain-independent intermediate logical form, and then performs ontology matching to produce the final logical form. In some sense, we approach the problem from the opposite end, using an intermediate utterance, which allows us to employ paraphrasing methods (Figure 2). Fader et al. (2013) presented a QA system that maps questions onto simple queries against Open IE extractions, by learning paraphrases from a large monolingual parallel corpus, and performing a single paraphrasing step. We adopt the idea of using paraphrasing for QA, but suggest a more general paraphrase model and work against a formal KB (Freebase).

We apply our semantic parser on two datasets: WebQuestions [1], which contains 5,810 question-answer pairs with common questions asked by web users; and Free917 [2], which has 917 questions manually authored by annotators. On WebQuestions, we obtain a relative improvement of 12% in accuracy over the state-of-the-art, and on Free917 we match the current best performing system. The source code of our system ParaSempre is released at http://www-nlp.stanford.edu/software/sempre/.

2 Setup

Our task is as follows: Given (i) a knowledge base $\mathcal{K}$ , and (ii) a training set of question-answer pairs $\{(x_{i},y_{i})\}_{i=1}^{n}$ , output a semantic parser that maps new questions $x$ to answers $y$ via latent logical forms $z$ . Let $\mathcal{E}$ denote a set of entities (e.g., BillGates), and let $\mathcal{P}$ denote a set of properties (e.g., PlaceOfBirth). A knowledge base $\mathcal{K}$ is a set of assertions $(e_{1},p,e_{2})\in\mathcal{E}\times\mathcal{P}\times\mathcal{E}$ (e.g., $(\text{\footnotesize{\tt BillGates}},\text{\footnotesize{\tt PlaceOfBirth}},% \text{\footnotesize{\tt Seattle}})$ ). We use the Freebase KB [13], which has 41M entities, 19K properties, and 596M assertions.

To query the KB, we use a logical language called simple $\lambda$ -DCS. In simple $\lambda$ -DCS, an entity (e.g., Seattle) is a unary predicate (i.e., a subset of $\mathcal{E}$ ) denoting a singleton set containing that entity. A property (which is a binary predicate) can be joined with a unary predicate; e.g., Founded.Microsoft denotes the entities that are Microsoft founders. In $\text{\footnotesize{\tt PlaceOfBirth.Seattle}}\sqcap\text{\footnotesize{\tt Founded% .Microsoft}}$ , an intersection operator allows us to denote the set of Seattle-born Microsoft founders. A reverse operator reverses the order of arguments: $\mathbf{R}$ [PlaceOfBirth].BillGates denotes Bill Gates’s birthplace (in contrast to PlaceOfBirth.Seattle). Lastly, $\text{\footnotesize{\tt count}}(\text{\footnotesize{\tt Founded}}.\text{% \footnotesize{\tt Microsoft}})$ denotes set cardinality, in this case, the number of Microsoft founders. The denotation of a logical form $z$ with respect to a KB $\mathcal{K}$ is given by ${\llbracket z\rrbracket}_{\mathcal{K}}$ . For a formal description of simple $\lambda$ -DCS, see Liang (2013) and Berant et al. (2013).

3 Model overview

We now present the general framework for semantic parsing via paraphrasing, including the model and the learning algorithm. In Sections 4 and 5, we provide the details of our implementation.

Canonical utterance construction

Given an utterance $x$ and the KB, we construct a set of candidate logical forms $\mathcal{Z}_{x}$ , and then for each $z\in\mathcal{Z}_{x}$ generate a small set of canonical natural language utterances $\mathcal{C}_{z}$ . Our goal at this point is only to generate a manageable set of logical forms containing the correct one, and then generate an appropriate canonical utterance from it. This strategy is feasible in factoid QA where compositionality is low, and so the size of $\mathcal{Z}_{x}$ is limited (Section 4).

Paraphrasing

We score the canonical utterances in $\mathcal{C}_{z}$ with respect to the input utterance $x$ using a paraphrase model, which offers two advantages. First, the paraphrase model is decoupled from the KB, so we can train it from large text corpora. Second, natural language utterances often do not express predicates explicitly, e.g., the question “What is Italy’s money?” expresses the binary predicate CurrencyOf with a possessive construction. Paraphrasing methods are well-suited for handling such text-to-text gaps. Our framework accommodates any paraphrasing method, and in this paper we propose an association model that learns to associate natural language phrases that co-occur frequently in a monolingual parallel corpus, combined with a vector space model, which learns to score the similarity between vector representations of natural language utterances (Section 5).

Model

We define a discriminative log-linear model that places a probability distribution over pairs of logical forms and canonical utterances $(c,z)$ , given an utterance $x$ :

\displaystyle p_{\theta}(c,z\mid x)=\frac{\exp\{\phi(x,c,z)^{\top}\theta\}}{% \sum_{z^{\prime}\in\mathcal{Z}_{x},c^{\prime}\in\mathcal{C}_{z}}\exp\{\phi(x,c% ^{\prime},z^{\prime})^{\top}\theta\}},

where $\theta\in\mathbb{R}^{b}$ is the vector of parameters to be learned, and $\phi(x,c,z)$ is a feature vector extracted from the input utterance $x$ , the canonical utterance $c$ , and the logical form $z$ . Note that the candidate set of logical forms $\mathcal{Z}_{x}$ and canonical utterances $\mathcal{C}_{x}$ are constructed during the canonical utterance construction phase.

The model score decomposes into two terms:

\displaystyle\phi(x,c,z)^{\top}\theta=\phi_{\text{pr}}(x,c)^{\top}\theta_{% \text{pr}}+\phi_{\text{lf}}(x,z)^{\top}\theta_{\text{lf}},

where the parameters $\theta_{\text{pr}}$ define the paraphrase model (Section 5), which is based on features extracted from text only (the input and canonical utterance). The parameters $\theta_{\text{lf}}$ correspond to semantic parsing features based on the logical form and input utterance, and are briefly described in this section.

Many existing paraphrase models introduce latent variables to describe the derivation of $c$ from $x$ , e.g., with transformations [16, 29] or alignments [14, 6, 3]. However, we opt for a simpler paraphrase model without latent variables in the interest of efficiency.

Logical form features

The parameters $\theta_{\text{lf}}$ correspond to the following features adopted from Berant et al. (2013). For a logical form $z$ , we extract the size of its denotation ${\llbracket z\rrbracket}_{\mathcal{K}}$ . We also add all binary predicates in $z$ as features. Moreover, we extract a popularity feature for predicates based on the number of instances they have in $\mathcal{K}$ . For Freebase entities, we extract a popularity feature based on the entity frequency in an entity linked subset of Reverb [22]. Lastly, Freebase formulas have types (see Section 4), and we conjoin the type of $z$ with the first word of $x$ , to capture the correlation between a word (e.g., “where”) with the Freebase type (e.g., Location).

Learning

As our training data consists of question-answer pairs $(x_{i},y_{i})$ , we maximize the log-likelihood of the correct answer. The probability of an answer $y$ is obtained by marginalizing over canonical utterances $c$ and logical forms $z$ whose denotation is $y$ . Formally, our objective function $\mathcal{O}(\theta)$ is as follows:

		$\displaystyle\mathcal{O}(\theta)=\sum_{i=1}^{n}\log p_{\theta}(y_{i}\mid x_{i}% )-\lambda\\|\theta\\|_{1},$
		$\displaystyle p_{\theta}(y\mid x)=\sum_{z\in\mathcal{Z}_{x}:y={\llbracket z% \rrbracket}_{\mathcal{K}}}\sum_{c\in\mathcal{C}_{z}}p_{\theta}(c,z\mid x).$

The strength $\lambda$ of the $L_{1}$ regularizer is set based on cross-validation. We optimize the objective by initializing the parameters $\theta$ to zero and running AdaGrad [8]. We approximate the set of pairs of logical forms and canonical utterances with a beam of size 2,000.

4 Canonical utterance construction

We construct canonical utterances in two steps. Given an input utterance $x$ , we first construct a set of logical forms $\mathcal{Z}_{x}$ , and then generate canonical utterances from each $z\in\mathcal{Z}_{x}$ . Both steps are performed with a small and simple set of deterministic rules, which suffices for our datasets, as they consist of factoid questions with a modest amount of compositional structure. We describe these rules below for completeness. Due to its soporific effect though, we advise the reader to skim it quickly.

Candidate logical forms

#	Template	Example	Question
1	$p . e$	Directed.TopGun	Who directed Top Gun?
2	$p_{1}.p_{2}.e$	Employment.EmployerOf.SteveBalmer	Where does Steve Balmer work?
3	$p.(p_{1}.e_{1}\sqcap p_{2}.e_{2})$	Character.(Actor.BradPitt $\sqcap$ Film.Troy)	Who did Brad Pitt play in Troy?
4	$\text{\footnotesize{\tt Type}}.t\sqcap z$	Type.Composer $\sqcap$ SpeakerOf.French	What composers spoke French?
5	count $(z)$	count(BoatDesigner.NatHerreshoff)	How many ships were designed by Nat Herreshoff?

Table 1: Logical form templates, where

p,p_{1},p_{2}

are Freebase properties,

e,e_{1},e_{2}

are Freebase entities,

t

is a Freebase type, and

z

is a logical form.

We consider logical forms defined by a set of templates, summarized in Table 1. The basic template is a join of a binary and an entity, where a binary can either be one property $p . e$ (#1 in the table) or two properties $p_{1}.p_{2}.e$ (#2). To handle cases of events involving multiple arguments (e.g., “Who did Brad Pitt play in Troy?”), we introduce the template $p.(p_{1}.e_{1}\sqcap p_{2}.e_{2})$ (#3), where the main event is modified by more than one entity. Logical forms can be further modified by a unary “filter”, e.g., the answer to “What composers spoke French?” is a set of composers, i.e., a subset of all people (#4). Lastly, we handle aggregation formulas for utterances such as “How many teams are in the NCAA?” (#5).

To construct candidate logical forms $\mathcal{Z}_{x}$ for a given utterance $x$ , our strategy is to find an entity in $x$ and grow the logical form from that entity. As we show later, this procedure actually produces a set with better coverage than constructing logical forms recursively from spans of $x$ , as is done in traditional semantic parsing. Specifically, for every span of $x$ , we take at most 10 entities whose Freebase descriptions approximately match the span. Then, we join each entity $e$ with all type-compatible¹¹Entities in Freebase are associated with a set of types, and properties have a type signature $(t_{1},t_{2})$ We use these types to compute an expected type $t$ for any logical form $z$ . binaries $b$ , and add these logical forms to $\mathcal{Z}_{x}$ (#1 and #2).

To construct logical forms with multiple entities (#3) we do the following: For any logical form $z=p.p_{1}.e_{1}$ , where $p_{1}$ has type signature $(t_{1},*)$ , we look for other entities $e_{2}$ that were matched in $x$ . Then, we add the logical form $p.(p_{1}.e_{1}\sqcap p_{2}.e_{2})$ , if there exists a binary $p_{2}$ with a compatible type signature $(t_{1},t_{2})$ , where $t_{2}$ is one of $e_{2}$ ’s types. For example, for the logical form Character.Actor.BradPitt, if we match the entity Troy in $x$ , we obtain Character.(Actor.BradPitt $\sqcap$ Film.Troy). We further modify logical forms by intersecting with a unary filter (#4): given a formula $z$ with some Freebase type (e.g., People), we look at all Freebase sub-types $t$ (e.g., Composer), and check whether one of their Freebase descriptions (e.g., “composer”) appears in $x$ . If so, we add the formula $\text{\footnotesize{\tt Type}}.t\sqcap z$ to $\mathcal{Z}_{x}$ . Finally, we check whether $x$ is an aggregation formula by identifying whether it starts with phrases such as “how many” or “number of” (#5).

On WebQuestions, this results in 645 formulas per utterance on average. Clearly, we can increase the expressivity of this step by expanding the template set. For example, we could handle superlative utterances (“What NBA player is tallest?”) by adding a template with an argmax operator.

Utterance generation

	$\boldsymbol{d(p)}$ Categ.	Rule	Example
$p . e$	NP	WH $d(t)$ has $d(e)$ as NP ?	What election contest has George Bush as winner?
	VP	WH $d(t)$ (AUX) VP $d(e)$ ?	What radio station serves area New-York?
	PP	WH $d(t)$ PP $d(e)$ ?	What beer from region Argentina?
	NP VP	WH $d(t)$ VP the NP $d(e)$ ?	What mass transportation system served the area Berlin?
$\mathbf{R}(p).e$	NP	WH $d(t)$ is the NP of $d(e)$ ?	What location is the place of birth of Elvis Presley?
	VP	WH $d(t)$ AUX $d(e)$ VP ?	What film is Brazil featured in?
	PP	WH $d(t)$ $d(e)$ PP ?	What destination Spanish steps near travel destination?
	NP VP	WH NP is VP by $d(e)$ ?	What structure is designed by Herod?

Table 2: Generation rules for templates of the form

p . e

and

\mathbf{R}[p].e

based on the syntactic category of the property description. Freebase descriptions for the type, entity, and property are denoted by

d(t)

d(e)

and

d(p)

respectively. The surface form of the auxiliary AUX is determined by the POS tag of the verb inside the VP tree.

While mapping general language utterances to logical forms is hard, we observe that it is much easier to generate a canonical natural language utterances of our choice given a logical form. Table 2 summarizes the rules used to generate canonical utterances from the template $p . e$ . Questions begin with a question word, are followed by the Freebase description of the expected answer type ( $d(t)$ ), and followed by Freebase descriptions of the entity ( $d(e)$ ) and binary ( $d(p)$ ). To fill in auxiliary verbs, determiners, and prepositions, we parse the description $d(p)$ into one of NP, VP, PP, or NP VP. This determines the generation rule to be used.

Each Freebase property $p$ has an explicit property $p^{\prime}$ equivalent to the reverse $\mathbf{R}[p]$ (e.g., ContainedBy and $\mathbf{R}[\text{\footnotesize{\tt Contains}}]$ ). For each logical form $z$ , we also generate using equivalent logical forms where $p$ is replaced with $\mathbf{R}[p^{\prime}]$ . Reversed formulas have different generation rules, since entities in these formulas are in the subject position rather than object position.

We generate the description $d(t)$ from the Freebase description of the type of $z$ (this handles #4). For the template $p_{1}.p_{2}.e$ (#2), we have a similar set of rules, which depends on the syntax of $d(p_{1})$ and $d(p_{2})$ and is omitted for brevity. The template $p.(p_{1}.e_{1}\sqcap p_{2}.e_{2})$ (#3) is generated by appending the prepositional phrase in $d(e_{2})$ , e.g, “What character is the character of Brad Pitt in Troy?”. Lastly, we choose the question phrase “How many” for aggregation formulas (#5), and “What” for all other formulas.

We also generate canonical utterances using an alignment lexicon, released by Berant et al. (2013), which maps text phrases to Freebase binary predicates. For a binary predicate $b$ mapped from text phrase $d(b)$ , we generate the utterance WH $d(t)$ $d(b)$ $d(e)$ ?. On the WebQuestions dataset, we generate an average of 1,423 canonical utterances $c$ per input utterance $x$ . In Section 6, we show that an even simpler method of generating canonical utterances by concatenating Freebase descriptions hurts accuracy by only a modest amount.

5 Paraphrasing

Once the candidate set of logical forms paired with canonical utterances is constructed, our problem is reduced to scoring pairs $(c,z)$ based on a paraphrase model. The NLP paraphrase literature is vast and ranges from simple methods employing surface features [32], through vector space models [28], to latent variable models [6, 33, 29].

In this paper, we focus on two paraphrase models that emphasize simplicity and efficiency. This is important since for each question-answer pair, we consider thousands of canonical utterances as potential paraphrases. In contrast, traditional paraphrase detection [7] and Recognizing Textual Entailment (RTE) tasks [4] consider examples consisting of only a single pair of candidate paraphrases.

Our paraphrase model decomposes into an association model and a vector space model:

\displaystyle\phi_{\text{pr}}(x,c)^{\top}\theta_{\text{pr}}=\phi_{\text{as}}(x% ,c)^{\top}\theta_{\text{as}}+\phi_{\text{vs}}(x,c)^{\top}\theta_{\text{vs}}.

5.1 Association model

Figure 3: Token associations extracted for a paraphrase pair. Blue and dashed (red and solid) indicate positive (negative) score. Line width is proportional to the absolute value of the score.

The goal of the association model is to determine whether $x$ and $c$ contain phrases that are likely to be paraphrases. Given an utterance $x=\langle x_{0},x_{1},..,x_{n-1}\rangle$ , we denote by $x_{i:j}$ the span from token $i$ to token $j$ . For each pair of utterances $(x,c)$ , we go through all spans of $x$ and $c$ and identify a set of pairs of potential paraphrases $(x_{i:j},c_{i^{\prime}:j^{\prime}})$ , which we call associations. (We will describe how associations are identified shortly.) We then define features on each association; the weighted combination of these features yields a score. In this light, associations can be viewed as soft paraphrase rules. Figure 3 presents examples of associations extracted from a paraphrase pair and visualizes the learned scores. We can see that our model learns a positive score for associating “type” with “genres”, and a negative score for associating “is” with “play”.

We define associations in $x$ and $c$ primarily by looking up phrase pairs in a phrase table constructed using the Paralex corpus [10]. Paralex is a large monolingual parallel corpora, containing 18 million pairs of question paraphrases from wikianswers.com, which were tagged as having the same meaning by users. Paralex is suitable for our needs since it focuses on question paraphrases. For example, the phrase “do for a living” occurs mostly in questions, and we can extract associations for this phrase from Paralex. Paraphrase pairs in Paralex are word-aligned using standard machine translation methods. We use the word alignments to construct a phrase table by applying the consistent phrase pair heuristic [24] to all 5-grams. This results in a phrase table with approximately 1.3 million phrase pairs. We let $\mathcal{A}$ denote this set of mined candidate associations.

Category	Description
Assoc.	$\text{lemma}(x_{i:j})\land\text{lemma}(c_{i^{\prime}:j^{\prime}})$
	$\text{pos}(x_{i:j})\land\text{pos}(c_{i^{\prime}:j^{\prime}})$
	$\text{lemma}(x_{i:j})=\text{lemma}(c_{i^{\prime}:j^{\prime}})$ ?
	$\text{pos}(x_{i:j})=\text{pos}(c_{i^{\prime}:j^{\prime}})$ ?
	$\text{lemma}(x_{i:j})$ and $\text{lemma}(c_{i^{\prime}:j^{\prime}})$ are synonyms?
	$\text{lemma}(x_{i:j})$ and $\text{lemma}(c_{i^{\prime}:j^{\prime}})$ are derivations?
Deletions	Deleted lemma and POS tag

Table 3: Full feature set in the association model.

x_{i:j}

and

c_{i^{\prime}:j^{\prime}}

denote spans from

x

and

c

\text{pos}(x_{i:j})

and

\text{lemma}(x_{i:j})

denote the POS tag and lemma sequence of

x_{i:j}

For a pair $(x,c)$ , we also consider as candidate associations the set $\mathcal{B}$ (represented implicitly), which contains token pairs $(x_{i},c_{i^{\prime}})$ such that $x_{i}$ and $c_{i^{\prime}}$ share the same lemma, the same POS tag, or are linked through a derivation link on WordNet [11]. This allows us to learn paraphrases for words that appear in our datasets but are not covered by the phrase table, and to handle nominalizations for phrase pairs such as “Who designed the game of life?” and “What game designer is the designer of the game of life?”.

Our model goes over all possible spans of $x$ and $c$ and constructs all possible associations from $\mathcal{A}$ and $\mathcal{B}$ . This results in many poor associations (e.g., “play” and “the”), but as illustrated in Figure 3, we learn weights that discriminate good from bad associations. Table 3 specifies the full set of features. Note that unlike standard paraphrase detection and RTE systems, we use lexicalized features, firing approximately 400,000 features on WebQuestions. By extracting POS features, we obtain soft syntactic rules, e.g., the feature “JJ N $\land$ N” indicates that omitting adjectives before nouns is possible. Once associations are constructed, we mark tokens in $x$ and $c$ that were not part of any association, and extract deletion features for their lemmas and POS tags. Thus, we learn that deleting pronouns is acceptable, while deleting nouns is not.

To summarize, the association model links phrases of two utterances in multiple overlapping ways. During training, the model learns which associations are characteristic of paraphrases and which are not.

5.2 Vector space model

The association model relies on having a good set of candidate associations, but mining associations suffers from coverage issues. We now introduce a vector space (VS) model, which assigns a vector representation for each utterance, and learns a scoring function that ranks paraphrase candidates.

We start by constructing vector representations of words. We run the word2vec tool [23] on lower-cased Wikipedia text (1.59 billion tokens), using the CBOW model with a window of 5 and hierarchical softmax. We also experiment with publicly released word embeddings [17], which were trained using both local and global context. Both result in $k$ -dimensional vectors ( $k=50$ ). Next, we construct a vector $v_{x}\in\mathbb{R}^{k}$ for each utterance $x$ by simply averaging the vectors of all content words (nouns, verbs, and adjectives) in $x$ .

We can now estimate a paraphrase score for two utterances $x$ and $c$ via a weighted combination of the components of the vector representations:

\displaystyle v_{x}^{\top}Wv_{c}=\sum_{i,j=1}^{k}w_{ij}v_{x,i}v_{c,j}

where $W\in\mathbb{R}^{k\times k}$ is a parameter matrix. In terms of our earlier notation, we have $\theta_{\text{vs}}=\text{vec}(W)$ and $\phi_{\text{vs}}(x,c)=\text{vec}(v_{x}v_{c}^{\top})$ , where $\text{vec}(\cdot)$ unrolls a matrix into a vector. In Section 6, we experiment with $W$ equal to the identity matrix, constraining $W$ to be diagonal, and learning a full $W$ matrix.

The VS model can identify correct paraphrases in cases where it is hard to directly associate phrases from $x$ and $c$ . For example, the answer to “Where is made Kia car?” (from WebQuestions), is given by the canonical utterance “What city is Kia motors a headquarters of?”. The association model does not associate “made” and “headquarters”, but the VS model is able to determine that these utterances are semantically related. In other cases, the VS model cannot distinguish correct paraphrases from incorrect ones. For example, the association model identifies that the paraphrase for “What type of music did Richard Wagner Play?” is “What is the musical genres of Richard Wagner?”, by relating phrases such as “type of music” and “musical genres”. The VS model ranks the canonical utterance “What composition has Richard Wagner as lyricist?” higher, as this utterance is also in the music domain. Thus, we combine the two models to benefit from their complementary nature.

In summary, while the association model aligns particular phrases to one another, the vector space model provides a soft vector-based representation for utterances.

6 Empirical evaluation

In this section, we evaluate our system on WebQuestions and Free917. After describing the setup (Section 6.1), we present our main empirical results and analyze the components of the system (Section 6.2).

6.1 Setup

We use the WebQuestions dataset [1], which contains 5,810 question-answer pairs. This dataset was created by crawling questions through the Google Suggest API, and then obtaining answers using Amazon Mechanical Turk. We use the original train-test split, and divide the training set into 3 random 80%–20% splits for development. This dataset is characterized by questions that are commonly asked on the web (and are not necessarily grammatical), such as “What character did Natalie Portman play in Star Wars?” and “What kind of money to take to Bahamas?”.

Dataset	# examples	# word types
Free917	917	2,036
WebQuestions	5,810	4,525

Table 4: Statistics on WebQuestions and Free917.

The Free917 dataset contains 917 questions, authored by two annotators and annotated with logical forms. This dataset contains questions on rarer topics (for example, “What is the engine in a 2010 Ferrari California?” and “What was the cover price of the X-men Issue 1?”), but the phrasing of questions tends to be more rigid compared to WebQuestions. Table 4 provides some statistics on the two datasets. Following Cai and Yates (2013), we hold out 30% of the data for the final test, and perform 3 random 80%-20% splits of the training set for development. Since we train from question-answer pairs, we collect answers by executing the gold logical forms against Freebase.

We execute $\lambda$ -DCS queries by converting them into SPARQL and executing them against a copy of Freebase using the Virtuoso database engine. We evaluate our system with accuracy, that is, the proportion of questions we answer correctly. We run all questions through the Stanford CoreNLP pipeline [30, 12, 18].

We tuned the $L_{1}$ regularization strength, developed features, and ran analysis experiments on the development set (averaging across random splits). On WebQuestions, without $L_{1}$ regularization, the number of non-zero features was 360K; $L_{1}$ regularization brings it down to 17K.

6.2 Results

We compare our system to Cai and Yates (2013) (CY13), Berant et al. (2013) (BCFL13), and Kwiatkowski et al. (2013) (KCAZ13). For BCFL13, we obtained results using the Sempre package²²http://www-nlp.stanford.edu/software/sempre/ and running Berant et al. (2013)’s system on the datasets.

Table 5 presents results on the test set. We achieve a substantial relative improvement of 12% in accuracy on WebQuestions, and match the best results on Free917. Interestingly, our system gets an oracle accuracy of 63% on WebQuestions compared to 48% obtained by BCFL13, where the oracle accuracy is the fraction of questions for which at least one logical form in the candidate set produced by the system is correct. This demonstrates that our method for constructing candidate logical forms is reasonable. To further examine this, we ran BCFL13 on the development set, allowing it to use only predicates from logical forms suggested by our logical form construction step. This improved oracle accuracy on the development set to 64.5%, but accuracy was 32.2%. This shows that the improvement in accuracy should not be attributed only to better logical form generation, but also to the paraphrase model.

	Free917	WebQuestions
CY13	59.0	–
BCFL13	62.0	35.7
KCAZ13	68.0	–
This work	68.5	39.9

Table 5: Results on the test set.

	Free917	WebQuestions
Our system	73.9	41.2
–Vsm	71.0	40.5
–Association	52.7	35.3
–Paraphrase	31.8	21.3
SimpleGen	73.4	40.4
Full matrix	52.7	35.3
Diagonal	50.4	30.6
Identity	50.7	30.4
Jaccard	69.7	31.3
Edit	40.8	24.8
WDDC06	71.0	29.8

Table 6: Results for ablations and baselines on development set.

We now perform more extensive analysis of our system’s components and compare it to various baselines.

Component ablation

We ablate the association model, the VS model, and the entire paraphrase model (using only logical form features). Table 5 shows that our full system obtains highest accuracy, and that removing the association model results in a much larger degradation compared to removing the VS model.

Utterance generation

Our system generates relatively natural utterances from logical forms using simple rules based on Freebase descriptions (Section 4). We now consider simply concatenating Freebase descriptions. For example, the logical form $\mathbf{R}$ [PlaceOfBirth].ElvisPresley would generate the utterance “What location Elvis Presley place of birth?”. Row SimpleGen in Table 6 demonstrates that we still get good results in this setup. This is expected given that our paraphrase models are not sensitive to the syntactic structure of the generated utterance.

Figure 4: Values of the paraphrase score

v_{x_{i}}^{\top}Wv_{c_{i^{\prime}}}

for all content word tokens

x_{i}

and

c_{i^{\prime}}

, where

W

is an arbitrary full matrix, a diagonal matrix, or the identity matrix. We omit scores for the words “czech” and “republic” since they appear in all canonical utterances for this example.

VS model

Our system learns parameters for a full $W$ matrix. We now examine results when learning parameters for a full matrix $W$ , a diagonal matrix $W$ , and when setting $W$ to be the identity matrix. Table 6 (third section) illustrates that learning a full matrix substantially improves accuracy. Figure 4 gives an example for a correct paraphrase pair, where the full matrix model boosts the overall model score. Note that the full matrix assigns a high score for the phrases “official language” and “speak” compared to the simpler models, but other pairs are less interpretable.

Baselines

We also compared our system to the following implemented baselines:

•

Jaccard: We compute the Jaccard score between the tokens of $x$ and $c$ and define $\phi_{\text{pr}}(x,c)$ to be this single feature.
•

Edit: We compute the token edit distance between $x$ and $c$ and define $\phi_{\text{pr}}(x,c)$ to be this single feature.
•

WDDC06: We re-implement 13 features from Wan et al. (2006), who obtained close to state-of-the-art performance on the Microsoft Research paraphrase corpus.³³We implement all features that do not require dependency parsing.

Table 6 demonstrates that we improve performance over all baselines. Interestingly, Jaccard and WDDC06 obtain reasonable performance on Free917 but perform much worse on WebQuestions. We surmise this is because questions in Free917 were generated by annotators prompted by Freebase facts, whereas questions in WebQuestions originated independently of Freebase. Thus, word choice in Free917 is often close to the generated Freebase descriptions, allowing simple baselines to perform well.

Error analysis

We sampled examples from the development set to examine the main reasons ParaSempre makes errors. We notice that in many cases the paraphrase model can be further improved. For example, ParaSempre suggests that the best paraphrase for “What company did Henry Ford work for?” is “What written work novel by Henry Ford?” rather than “The employer of Henry Ford”, due to the exact match of the word “work”. Another example is the question “Where is the Nascar hall of fame?”, where ParaSempre suggests that “What hall of fame discipline has Nascar hall of fame as halls of fame?” is the best canonical utterance. This is because our simple model allows to associate “hall of fame” with the canonical utterance three times. Entity recognition also accounts for many errors, e.g., the entity chosen in “where was the gallipoli campaign waged?” is Galipoli and not GalipoliCampaign. Last, ParaSempre does not handle temporal information, which causes errors in questions like “Where did Harriet Tubman live after the civil war?”

7 Discussion

In this work, we approach the problem of semantic parsing from a paraphrasing viewpoint. A fundamental motivation and long standing goal of the paraphrasing and RTE communities has been to cast various semantic applications as paraphrasing/textual entailment [4]. While it has been shown that paraphrasing methods are useful for question answering [15] and relation extraction [27], this is, to the best of our knowledge, the first paper to perform semantic parsing through paraphrasing. Our paraphrase model emphasizes simplicity and efficiency, but the framework is agnostic to the internals of the paraphrase method.

On the semantic parsing side, our work is most related to Kwiatkowski et al. (2013). The main challenge in semantic parsing is coping with the mismatch between language and the KB. In both Kwiatkowski et al. (2013) and this work, an intermediate representation is employed to handle the mismatch, but while they use a logical representation, we opt for a text-based one. Our choice allows us to benefit from the parallel monolingual corpus ParaLex and from word vectors trained on Wikipedia. We believe that our approach is particularly suitable for scenarios such as factoid question answering, where the space of logical forms is somewhat constrained and a few generation rules suffice to reduce the problem to paraphrasing.

Our work is also related to Fader et al. (2013), who presented a paraphrase-driven question answering system. One can view this work as a generalization of Fader et al. along three dimensions. First, Fader et al. use a KB over natural language extractions rather than a formal KB and so querying the KB does not require a generation step – they paraphrase questions to KB entries directly. Second, they suggest a particular paraphrasing method that maps a test question to a question for which the answer is already known in a single step. We propose a general paraphrasing framework and instantiate it with two paraphrase models. Lastly, Fader et al. handle queries with only one property and entity whereas we generalize to more types of logical forms.

Since our generated questions are passed to a paraphrase model, we took a very simple approach, mostly ensuring that we preserved the semantics of the utterance without striving for the most fluent realization. Research on generation [5, 26, 31, 25] typically focuses on generating natural utterances for human consumption, where fluency is important.

In conclusion, the main contribution of this paper is a novel approach for semantic parsing based on a simple generation procedure and a paraphrase model. We achieve state-of-the-art results on two recently released datasets. We believe that our approach opens a window of opportunity for learning semantic parsers from raw text not necessarily related to the target KB. With more sophisticated generation and paraphrase, we hope to tackle compositionally richer utterances.

Acknowledgments

We thank Kai Sheng Tai for performing the error analysis. Stanford University gratefully acknowledges the support of the Defense Advanced Research Projects Agency (DARPA) Deep Exploration and Filtering of Text (DEFT) Program under Air Force Research Laboratory (AFRL) contract no. FA8750-13-2-0040. Any opinions, findings, and conclusion or recommendations expressed in this material are those of the authors and do not necessarily reflect the view of the DARPA, AFRL, or the US government. The second author is supported by a Google Faculty Research Award.

References

[1] J. Berant, A. Chou, R. Frostig and P. Liang(2013) Semantic parsing on Freebase from question-answer pairs. Cited by: 1, 1, 1, 2, 3, 4, 6.1, 6.2.
[2] Q. Cai and A. Yates(2013) Large-scale semantic parsing via schema matching and lexicon extension. Cited by: 1, 1, 6.1, 6.2.
[3] M. Chang, D. Goldwasser, D. Roth and V. Srikumar(2010) Discriminative learning over constrained latent representations. Cited by: 3.
[4] I. Dagan, D. Roth, M. Sammons and F. M. Zanzotto(2013) Recognizing textual entailment: models and applications. Morgan and Claypool Publishers. Cited by: 5, 7.
[5] R. Dale, S. Geldof and J. Prost(2003) CORAL: using natural language generation for navigational assistance. pp. 35–44. Cited by: 7.
[6] D. Das and N. A. Smith(2009) Paraphrase identification as probabilistic quasi-synchronous recognition. pp. 468–476. Cited by: 3, 5.
[7] B. Dolan, C. Quirk and C. Brockett(2004) Unsupervised construction of large paraphrase corpora: exploiting massively parallel news sources. Cited by: 5.
[8] J. Duchi, E. Hazan and Y. Singer(2010) Adaptive subgradient methods for online learning and stochastic optimization. Cited by: 3.
[9] A. Fader, S. Soderland and O. Etzioni(2011) Identifying relations for open information extraction. Cited by: 1.
[10] A. Fader, L. Zettlemoyer and O. Etzioni(2013) Paraphrase-driven learning for open question answering. Cited by: 1, 5.1, 7.
[11] C. Fellbaum(1998) WordNet: an electronic lexical database. MIT Press. Cited by: 5.1.
[12] J. R. Finkel, T. Grenager and C. Manning(2005) Incorporating non-local information into information extraction systems by Gibbs sampling. pp. 363–370. Cited by: 6.1.
[13] Google(2013) Freebase data dumps (2013-06-09). Note: \urlhttps://developers.google.com/freebase/data Cited by: 2.
[14] A. Haghighi, A. Y. Ng and C. D. Manning(2005) Robust textual inference via graph matching. Cited by: 3.
[15] S. Harabagiu and A. Hickl(2006) Methods for using textual entailment in open-domain question answering. Cited by: 7.
[16] M. Heilman and N. A. Smith(2010) Tree edit models for recognizing textual entailments, paraphrases, and answers to questions. pp. 1011–1019. Cited by: 3.
[17] E. H. Huang, R. Socher, C. D. Manning and A. Y. Ng(2012) Improving word representations via global context and multiple word prototypes. Cited by: 5.2.
[18] D. Klein and C. Manning(2003) Accurate unlexicalized parsing. pp. 423–430. Cited by: 6.1.
[19] T. Kwiatkowski, E. Choi, Y. Artzi and L. Zettlemoyer(2013) Scaling semantic parsers with on-the-fly ontology matching. Cited by: 2, 1, 1, 6.2, 7.
[20] T. Kwiatkowski, L. Zettlemoyer, S. Goldwater and M. Steedman(2010) Inducing probabilistic CCG grammars from logical form with higher-order unification. pp. 1223–1233. Cited by: 1.
[21] P. Liang(2013) Lambda dependency-based compositional semantics. Technical report ArXiv. Cited by: 2.
[22] T. Lin, Mausam and O. Etzioni(2012) Entity linking at web scale. Cited by: 3.
[23] T. Mikolov, K. Chen, G. Corrado and Jeffrey(2013) Efficient estimation of word representations in vector space. Technical report ArXiv. Cited by: 5.2.
[24] F. J. Och and H. Ney(2004) The alignment template approach to statistical machine translation. Computational Linguistics 30, pp. 417–449. Cited by: 5.1.
[25] P. Piwek and K. E. Boyer(2012) Varieties of question generation: introduction to this special issue. Dialogue and Discourse 3, pp. 1–9. Cited by: 7.
[26] E. Reiter, S. Sripada, J. Hunter, J. Yu and I. Davy(2005) Choosing words in computer-generated weather forecasts. Artificial Intelligence 167, pp. 137–169. Cited by: 7.
[27] L. Romano, M. kouylekov, I. Szpektor, I. Dagan and A. Lavelli(2006) Investigating a generic paraphrase-based approach for relation extraction. Cited by: 7.
[28] R. Socher, E. H. Huang, J. Pennin, C. D. Manning and A. Ng(2011) Dynamic pooling and unfolding recursive autoencoders for paraphrase detection. pp. 801–809. Cited by: 5.
[29] A. Stern and I. Dagan(2011) A confidence model for syntactically-motivated entailment proofs. pp. 455–462. Cited by: 3, 5.
[30] K. Toutanova and C. D. Manning(2003) Feature-rich part-of-speech tagging with a cyclic dependency network. Cited by: 6.1.
[31] R. Turner, Y. Sripada and E. Reiter(2009) Generating approximate geographic descriptions. pp. 42–49. Cited by: 7.
[32] S. Wan, M. Dras, R. Dale and C. Paris(2006) Using dependency-based features to take the “para-farce” out of paraphrase. Cited by: 6.2, 5.
[33] M. Wang and C. D. Manning(2010) Probabilistic tree-edit models with structured latent variables for textual entailment and question answering. pp. 1164–1172. Cited by: 5.
[34] Y. W. Wong and R. J. Mooney(2007) Learning synchronous grammars for semantic parsing with lambda calculus. pp. 960–967. Cited by: 1.
[35] M. Zelle and R. J. Mooney(1996) Learning to parse database queries using inductive logic proramming.. pp. 1050–1055. Cited by: 1.
[36] L. S. Zettlemoyer and M. Collins(2005) Learning to map sentences to logical form: structured classification with probabilistic categorial grammars. pp. 658–666. Cited by: 1.

Generated on Wed Jun 11 19:04:21 2014 by LaTeXML [LOGO]