Dependency-based Compositional Semantics (DCS) is a framework of natural language semantics with easy-to-process structures as well as strict semantics. In this paper, we equip the DCS framework with logical inference, by defining abstract denotations as an abstraction of the computing process of denotations in original DCS. An inference engine is built to achieve inference on abstract denotations. Furthermore, we propose a way to generate on-the-fly knowledge in logical inference, by combining our framework with the idea of tree transformation. Experiments on FraCaS and PASCAL RTE datasets show promising results.
Dependency-based Compositional Semantics (DCS) provides an intuitive way to model semantics of questions, by using simple dependency-like trees []. It is expressive enough to represent complex natural language queries on a relational database, yet simple enough to be latently learned from question-answer pairs. In this paper, we equip DCS with logical inference, which, in one point of view, is “the best way of testing an NLP system’s semantic capacity” [].
It should be noted that, however, a framework primarily designed for question answering is not readily suited for logical inference. Because, answers returned by a query depend on the specific database, but implication is independent of any databases. For example, answers to the question “What books are read by students?”, should always be a subset of answers to “What books are ever read by anyone?”, no matter how we store the data of students and how many records of books are there in our database.
Thus, our first step is to fix a notation which abstracts the calculation process of DCS trees, so as to clarify its meaning without the aid of any existing database. The idea is to borrow a minimal set of operators from relational algebra [], which is already able to formulate the calculation in DCS and define abstract denotation, which is an abstraction of the computation of denotations guided by DCS trees. Meanings of sentences then can be represented by primary relations among abstract denotations. This formulation keeps the simpleness and computability of DCS trees mostly unaffected; for example, our semantic calculation for DCS trees is parallel to the denotation computation in original DCS.
An inference engine is built to handle inference on abstract denotations. Moreover, to compensate the lack of background knowledge in practical inference, we combine our framework with the idea of tree transformation [], to propose a way of generating knowledge in logical representation from entailment rules [], which are by now typically considered as syntactic rewriting rules.
We test our system on FraCaS [] and PASCAL RTE datasets []. The experiments show: (i) a competitive performance on FraCaS dataset; (ii) a big impact of our automatically generated on-the-fly knowledge in achieving high recall for a logic-based RTE system; and (iii) a result that outperforms state-of-the-art RTE system on RTE5 data. Our whole system is publicly released and can be downloaded from http://kmcs.nii.ac.jp/tianran/tifmo/.
In this section we describe the idea of representing natural language semantics by DCS trees, and achieving inference by computing logical relations among the corresponding abstract denotations.
student |
---|
ARG |
Mark |
John |
Emily |
… |
book |
---|
ARG |
A Tale of Two Cities |
Ulysses |
… |
read | |
---|---|
SUBJ | OBJ |
Mark | New York Times |
Mary | A Tale of Two Cities |
John | Ulysses |
… | … |
DCS trees has been proposed to represent natural language semantics with a structure similar to dependency trees [] (Figure 1). For the sentence “students read books”, imagine a database consists of three tables, namely, a set of students, a set of books, and a set of “reading” events (Table 1). The DCS tree in Figure 1 is interpreted as a command for querying these tables, obtaining “reading” entries whose “SUBJ” field is student and whose “OBJ” field is book. The result is a set , which is called a denotation.
DCS trees can be extended to represent linguistic phenomena such as quantification and coreference, with additional markers introducing additional operations on tables. Figure 2 shows an example with a quantifier “every”, which is marked as “” on the edge and interpreted as a division operator (§2.2). Optimistically, we believe DCS can provide a framework of semantic representation with sufficiently wide coverage for real-world texts.
The strict semantics of DCS trees brings us the idea of applying DCS to logical inference. This is not trivial, however, because DCS works under the assumption that databases are explicitly available. Obviously this is unrealistic for logical inference on unrestricted texts, because we cannot prepare a database for everything in the world. This fact fairly restricts the applicable tasks of DCS.
Our solution is to redefine DCS trees without the aid of any databases, by considering each node of a DCS tree as a content word in a sentence (but may no longer be a table in a specific database), while each edge represents semantic relations between two words. The labels on both ends of an edge, such as SUBJ (subject) and OBJ (object), are considered as semantic roles of the corresponding words11The semantic role ARG is specifically defined for denoting nominal predicate.. To formulate the database querying process defined by a DCS tree, we provide formal semantics to DCS trees by employing relational algebra [] for representing the query. As described below, we represent meanings of sentences with abstract denotations, and logical relations among sentences are computed as relations among their abstract denotations. In this way, we can perform inference over formulas of relational algebra, without computing database entries explicitly.
Abstract denotations are formulas constructed from a minimal set of relational algebra [] operators, which is already able to formulate the database queries defined by DCS trees.
For example, the semantics of “students read books” is given by the abstract denotation:
where read, student and book denote sets represented by these words respectively, and represents the set considered as the domain of the semantic role (e.g. is the set of books considered as objects). The operators and represent intersection and Cartesian product respectively, both borrowed from relational algebra. It is not hard to see the abstract denotation denotes the intersection of the “reading” set (as illustrated by the “read” table in Table 1) with the product of “student” set and “book” set, which results in the same denotation as computed by the DCS tree in Figure 1, i.e. . However, the point is that itself is an algebraic formula that does not depend on any concrete databases.
Formally, we introduce the following constants:
: a universal set containing all entities.
Content words: a content word (e.g. read) defines a set representing the word (e.g. ).
In addition we introduce following functions:
: the Cartesian product of two sets.
: the intersection of two sets.
: projection onto domain of semantic role (e.g. ). Generally we admit projections onto multiple semantics roles, denoted by where is a set of semantic roles.
: relabeling (e.g. ).
: the division operator, where is defined as the largest set which satisfies .22If and has the same dimension, is either or (-dimension point set), depending on if . This is used to formulate universal quantifiers, such as “Mary loves every dog” and “books read by all students”.
An abstract denotation is then defined as finite applications of functions on either constants or other abstract denotations.
As the semantics of DCS trees is formulated by abstract denotations, the meanings of declarative sentences are represented by statements on abstract denotations. Statements are declarations of some relations among abstract denotations, for which we consider the following set relations:
Non-emptiness : the set is not empty.
Subsumption : set is subsumed by .33Using division operator, subsumption can be represented by non-emptiness, since for sets , of the same dimension, .
Roughly speaking, the relations correspond to the logical concepts satisfiability and entailment.
example phrase | abstract denotation / statement | |
---|---|---|
compound noun | pet fish | |
modification | nice day | |
temporal relation | boys study at night | |
relative clause | books that | |
students read | ||
quantification | all men die | |
hypernym | ||
derivation | all criminals commit | |
a crime | ||
antonym | ||
negation | no dogs are hurt |
Based on abstract denotations, we briefly describe our process to apply DCS to textual inference.
To obtain DCS trees from natural language, we use Stanford CoreNLP55http://nlp.stanford.edu/software/corenlp.shtml for dependency parsing [], and convert Stanford dependencies to DCS trees by pattern matching on POS tags and dependency labels.66In [] DCS trees are learned from QA pairs and database entries. We obtain DCS trees from dependency trees, to bypass the need of a concrete database. Currently we use the following semantic roles: ARG, SUBJ, OBJ, IOBJ, TIME and MOD. The semantic role MOD is used for any restrictive modifiers. Determiners such as “all”, “every” and “each” trigger quantifiers, as shown in Figure 2.
\AxiomC \AxiomCT \UnaryInfC \AxiomCAxiom 4 \TrinaryInfC \AxiomCT \UnaryInfC \AxiomC \AxiomCAxiom 8 \TrinaryInfC \UnaryInfC \AxiomCAxiom 6 \TrinaryInfC \AxiomCAxiom 4 \TrinaryInfC
A DCS tree is defined as a rooted tree, where each node is labeled with a content word and each edge is labeled with a pair of semantic roles 77The definition differs slightly from the original , mainly for the sake of simplicity and clarity.. Here is the node nearer to the root. Furthermore, for each edge we can optionally assign a quantification marker.
Abstract denotation of a DCS tree can be calculated in a bottom-up manner. For example, the abstract denotation of H in Figure 2 is calculated from the leaf node Mary, and then:
Node love (Mary loves):
Node animal (Animal that Mary loves):
Node have (Tom has an animal that Mary loves):
.
Formally, suppose the root of a DCS tree has children , and edges labeled by , respectively. The abstract denotation of is defined as:
where is the subtree of rooted at , and is the set of possible semantic roles for content word (e.g. ), and is the product of which has dimension (e.g. ).
When universal quantifiers are involved, we need to add division operators to the formula. If is assigned by a quantification marker “”88Multiple quantifiers can be processed similarly., then the abstract denotation is99The result of depends on the order of the children . Different orders correspond to readings of different quantifier scopes.
where is the same tree as except that the edge is removed. For example, the abstract denotation of the first sentence of T in Figure 2 (Mary loves every dog) is calculated from (Mary loves) as
After the abstract denotation is calculated, the statement representing the meaning of the sentence is defined as . For example, the statement of “students read books” is , and the statement of “Mary loves every dog” is , which is logically equivalent to .1010See Footnote 2,3.
1. |
2. |
3. |
4. |
5. |
6. |
7. |
8. |
Since meanings of sentences are represented by statements on abstract denotations, logical inference among sentences is reduced to deriving new relations among abstract denotations. This is done by applying axioms to known statements, and approximately 30 axioms are implemented (Table 3). These are algebraic properties of abstract denotations, among which we choose a set of axioms that can be handled efficiently and enable most common types of inference seen in natural language.
For the example in Figure 2, by constructing the following abstract denotations:
Tom has a dog:
Objects that Tom has:
,
we can use the lexical knowledge , the statements of T (i.e. and ), and the axioms in Table 3,1111Algebraic identities, such as and , are also axioms. to prove the statement of H (i.e. ) (Figure 3).
We built an inference engine to perform logical inference on abstract denotations as above. In this logical system, we treat abstract denotations as terms and statements as atomic sentences, which are far more easier to handle than first order predicate logic (FOL) formulas. Furthermore, all implemented axioms are horn clauses, hence we can employ forward-chaining, which is very efficient.
Further extensions of our framework are made to deal with additional linguistic phenomena, as briefly explained below.
To deal with negation in our forward-chaining inference engine, we introduce one more relation on abstract denotations, namely disjointness , meaning that and are disjoint sets. Using disjointness we implemented two types of negations: (i) atomic negation, for each content word we allow negation of that word, characterized by the property ; and (ii) root negation, for a DCS tree and its denotation , the negation of is represented by , meaning that in its effect.
Selection operators in relational algebra select a subset from a set to satisfy some specific properties. This can be employed to represent linguistic phenomena such as downward monotonicity and generalized quantifiers. In the current system, we implement (i) superlatives, e.g. (the highest mountain in Asia) and (ii) numerics, e.g. (two pet fish), where is a selection marker. Selection operators are implemented as markers assigned to abstract denotations, with specially designed axioms. For example superlatives satisfy the following property: . New rules can be added if necessary.
We use Stanford CoreNLP to resolve coreferences [], whereas coreference is implemented as a special type of selection. If a node in a DCS tree belongs to a mention cluster , we take the abstract denotation and make a selection , which is regarded as the abstract denotation of that mention. Then all selections of the same mention cluster are declared to be equal.
Recognizing textual entailment (RTE) is the task of determining whether a given textual statement H can be inferred by a text passage T. For this, our primary textual inference system operates as:
For a T-H pair, apply dependency parsing and coreference resolution.
Perform rule-based conversion from dependency parses to DCS trees, which are translated to statements on abstract denotations.
Use statements of T and linguistic knowledge as premises, and try to prove statements of H by our inference engine.
However, this method does not work for real-world datasets such as PASCAL RTE [], because of the knowledge bottleneck: it is often the case that the lack of sufficient linguistic knowledge causes failure of inference, thus the system outputs “no entailment” for almost all pairs [].
The transparent syntax-to-semantics interface of DCS enables us to back off to NLP techniques during inference for catching up the lack of knowledge. We extract fragments of DCS trees as paraphrase candidates, translate them back to linguistic expressions, and apply distributional similarity to judge their validity. In this way, our framework combines distributional and logical semantics, which is also the main subject of and .
As follows, our full system (Figure 4) additionally invokes linguistic knowledge on-the-fly:
If H is not proven, compare DCS trees of T and H, and generate path alignments.
Aligned paths are evaluated by a similarity score to estimate their likelihood of being paraphrases. Path alignments with scores higher than a threshold are accepted.
Convert accepted path alignments into statements on abstract denotations, use them in logical inference as new knowledge, and try to prove H again.
On-the-fly knowledge is generated by aligning paths in DCS trees. A path is considered as joining two germs in a DCS tree, where a germ is defined as a specific semantic role of a node. For example, Figure 5 shows DCS trees of the following sentences (a simplified pair from RTE2-dev):
T: Tropical storm Debby is blamed for deaths.
H: A storm has caused loss of life.
The germ and germ in DCS tree of T are joined by the underscored path. Two paths are aligned if the joined germs are aligned, and we impose constraints on aligned germs to inhibit meaningless alignments, as described below.
Two germs are aligned if they are both at leaf nodes (e.g. in T and in H, Figure 5), or they already have part of their meanings in common, by some logical clues.
To formulate this properly, we define the abstract denotation of a germ, which, intuitively, represents the meaning of the germ in the specific sentence. The abstract denotation of a germ is defined in a top-down manner: for the root node of a DCS tree , we define its denotation as the denotation of the entire tree ; for a non-root node and its parent node , let the edge be labeled by semantic roles , then define
Now for a germ , the denotation is defined as the projection of the denotation of node onto the specific semantic role : .
For example, the abstract denotation of germ in Figure 1 is defined as , meaning “books read by students”. Similarly, denotation of germ in T of Figure 5 indicates the object of “blame” as in the sentence “Tropical storm Debby is blamed for death”, which is a tropical storm, is Debby, etc. Technically, each germ in a DCS tree indicates a variable when the DCS tree is translated to a FOL formula, and the abstract denotation of the germ corresponds to the set of consistent values [] of that variable.
The logical clue to align germs is: if there exists an abstract denotation, other than , that is a superset of both abstract denotations of two germs, then the two germs can be aligned. A simple example is that in T can be aligned to in H, because their denotations have a common superset other than , namely . A more complicated example is that and can be aligned, because inference can induce = = , as well as = , so they also have the common superset . However, for example, logical clues can avoid aligning to , which is obviously meaningless.
Aligned paths are evaluated by a similarity score, for which we use distributional similarity of the words that appear in the paths (§4.1). Only path alignments with high similarity scores can be accepted. Also, we only accept paths of length , to prevent too long paths to be aligned.
Accepted aligned paths are converted into statements, which are used as new knowledge. The conversion is done by first performing a DCS tree transformation according to the aligned paths, and then declare a subsumption relation between the denotations of aligned germs. For example, to apply the aligned path pair generated in Figure 5, we use it to transform T into a new tree T’ (Figure 6), and then the aligned germs, OBJ(blame) in T and SUBJ(cause) in T’, will generate the on-the-fly knowledge: .
Similar to the tree transformation based approach to RTE [], this process can also utilize lexical-syntactic entailment rules []. Furthermore, since the on-the-fly knowledge is generated by transformed pairs of DCS trees, all contexts are preserved: in Figure 6, though the tree transformation can be seen as generated from the entailment rule “X is blamed for death X causes loss of life”, the generated on-the-fly knowledge, as shown above the trees, only fires with the additional condition that is a tropical storm and is Debby. Hence, the process can also be used to generate knowledge from context sensitive rules [], which are known to have higher quality [].
However, it should be noted that using on-the-fly knowledge in logical inference is not a trivial task. For example, the FOL formula of the rule “X is blamed for death X causes loss of life” is:
which is not a horn clause. The FOL formula for the context-preserved rule in Figure 6 is even more involved. Still, it can be efficiently treated by our inference engine because as a statement, the formula is an atomic sentence, more than a horn clause.
The lexical knowledge we use are synonyms, hypernyms and antonyms extracted from WordNet1212http://wordnet.princeton.edu/. We also add axioms on named entities, stopwords, numerics and superlatives. For example, named entities are singletons, so we add axioms such as .
To calculate the similarity scores of path alignments, we use the sum of word vectors of the words from each path, and calculate the cosine similarity. For example, the similarity score of the path alignment “OBJ(blame)IOBJ-ARG(death) SUBJ(cause)OBJ-ARG(loss)MOD-ARG(life)” is calculated as the cosine similarity of vectors blame+death and cause+loss+life. Other structures in the paths, such as semantic roles, are ignored in the calculation. The word vectors we use are from 1313http://code.google.com/p/word2vec/ (Mikolov13), and additional results are also shown using 1414http://metaoptimize.com/projects/wordreprs/ (Turian10). The threshold for accepted path alignments is set to , based on pre-experiments on RTE development sets.
The FraCaS test suite contains 346 inference problems divided into 9 sections, each focused on a category of semantic phenomena. We use the data by , and experiment on the first section, Quantifiers, following . This section has 44 single premise and 30 multi premise problems. Most of the problems do not require lexical knowledge, so we use our primary textual inference system without on-the-fly knowledge nor WordNet, to test the performance of the DCS framework as formal semantics. To obtain the three-valued output (i.e. yes, no, and unknown), we output “yes” if H is proven, or try to prove the negation of H if H is not proven. To negate H, we use the root negation as described in §2.5. If the negation of H is proven, we output “no”, otherwise we output “unknown”.
The result is shown in Table 4. Since our system uses an off-the-shelf dependency parser, and semantic representations are obtained from simple rule-based conversion from dependency trees, there will be only one (right or wrong) interpretation in face of ambiguous sentences. Still, our system outperforms ’s probabilistic CCG-parser. Compared to and , our system does not need a pre-trained alignment model, and it improves by making multi-sentence inferences. To sum up, the result shows that DCS is good at handling universal quantifiers and negations.
Single Prem. | Multi Prem. | |
---|---|---|
Lewis13 | 70 | 50 |
MacCartney07 | 84.1 | - |
MacCartney08 | 97.7 | - |
Our Sys. | 79.5 | 80.0 |
Most errors are due to wrongly generated DCS trees (e.g. wrongly assigned semantic roles) or unimplemented quantifier triggers (e.g. “neither”) or generalized quantifiers (e.g. “at least a few”). These could be addressed by future work.
RTE2 | RTE3 | RTE4 | RTE5 | |||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|
Prec. | Rec. | Acc. | Prec. | Rec. | Acc. | Prec. | Rec. | Acc. | Prec. | Rec. | Acc. | |
Primary | 70.9 | 9.8 | 52.9 | 73.2 | 7.3 | 51.1 | 89.7 | 5.2 | 52.3 | 82.6 | 6.3 | 52.5 |
+On-the-fly | 57.6 | 66.5 | 58.8 | 63.7 | 64.6 | 63.0 | 60.0 | 57.4 | 59.6 | 69.9 | 55.7 | 65.8 |
On PASCAL RTE datasets, strict logical inference is known to have very low recall [], so on-the-fly knowledge is crucial in this setting. We test the effect of on-the-fly knowledge on RTE2, RTE3, RTE4 and RTE5 datasets, and compare our system with other approaches.
Results on test data are shown in Table 5. When only primary knowledge is used in inference (the first row), recalls are actually very low; After we activate the on-the-fly knowledge, recalls jump to over 50%, with a moderate fall of precision. As a result, accuracies significantly increase.
RTE2 | RTE3 | RTE4 | RTE5 | |
---|---|---|---|---|
Bos06 | 60.6 | - | - | - |
MacCartney08 | - | 59.4 | - | - |
Clark08 | - | - | 56.5 | - |
Wang10 | 63.0 | 61.1 | - | - |
Stern11 | 61.6 | 67.1 | - | 63.5 |
Stern12 | - | - | - | 64.0 |
Our Sys. | 58.8 | 63.0 | 59.6 | 65.8 |
A comparison between our system and other RTE systems is shown in Table 6. Bos06 [] is a hybrid system combining deep features from a theorem prover and a model builder, together with shallow features such as lexical overlap and text length. MacCartney08 [] uses natural logic to calculate inference relations between two superficially aligned sentences. Clark08 [] is a logic-based system utilizing various resources including WordNet and DIRT paraphrases [], and is tolerant to partially unproven H sentences in some degree. All of the three systems pursue a logical approach, while combining various techniques to achieve robustness. The result shows that our system has comparable performance. On the other hand, Wang10 [] learns a tree-edit model from training data, and captures entailment relation by tree edit distance. Stern11 [] and Stern12 [] extend this framework to utilize entailment rules as tree transformations. These are more tailored systems using machine learning with many handcrafted features. Still, our unsupervised system outperforms the state-of-the-art on RTE5 dataset.
Summing up test data from RTE2 to RTE5, Figure 7 shows the proportion of all proven pairs and their precision. Less than 5% pairs can be proven primarily, with a precision of 77%. Over 40% pairs can be proven by one piece of on-the-fly knowledge, yet pairs do exist in which more than 2 pieces are necessary. The precisions of 1 and 2 pieces on-the-fly knowledge application are over 60%, which is fairly high, given our rough estimation of the similarity score. As a comparison, studied the proportion of proven pairs and precision by applying DIRT rules to tree skeletons in RTE2 and RTE3 data. The proportion is 8% with precision 65% on RTE2, and proportion 6% with precision 72% on RTE3. Applied by our logical system, the noisy on-the-fly knowledge can achieve a precision comparable to higher quality resources such as DIRT.
A major type of error is caused by the ignorance of semantic roles in calculation of similarity scores. For example, though “Italy beats Kazakhstan” is not primarily proven from “Italy is defeated by Kazakhstan”, our system does produce the path alignment “SUBJ(beat)OBJ OBJ(defeat)SUBJ” with a high similarity score. The impact of such errors depends on the data making methodology, though. It lowers precisions in RTE2 and RTE3 data, particularly in “IE” subtask (where precisions drop under ). On the other hand, it occurs less often in “IR” subtask.
Finally, to see if we “get lucky” on RTE5 data in the choice of word vectors and thresholds, we change the thresholds from to and draw the precision-recall curve, using two types of word vectors, Mikolov13 and Turian10. As shown in Figure 8, though the precision drops for Turian10, both curves show the pattern that our system keeps gaining recall while maintaining precision to a certain level. Not too much “magic” in Mikolov13 actually: for over 80% pairs, every node in DCS tree of H can be covered by a path of length that has a corresponding path of length in T with a similarity score .
We have presented a method of deriving abstract denotation from DCS trees, which enables logical inference on DCS, and we developed a textual inference system based on the framework. Experimental results have shown the power of the representation that allows both strict inference as on FraCaS data and robust reasoning as on RTE data.
Exploration of an appropriate meaning representation for querying and reasoning on knowledge bases has a long history. Description logic, being less expressive than FOL but featuring more efficient reasoning, is used as a theory base for Semantic Web []. Ideas similar to our framework, including the use of sets in a representation that benefits efficient reasoning, are also found in description logic and knowledge representation community []. To our knowledge, however, their applications to logical inference beyond the use for database querying have not been much explored in the context of NLP.
The pursue of a logic more suitable for natural language inference is not new. For instance, has implemented a model of natural logic []. While being computationally efficient, various inference patterns are out of the scope of their system.
Much work has been done in mapping natural language into database queries []. Among these, the (-)DCS [] framework defines algorithms that transparently map a labeled tree to a database querying procedure. Essentially, this is because DCS trees restrict the querying process to a very limited subset of possible operations. Our main contribution, the abstract denotation of DCS trees, can thus be considered as an attempt to characterize a fragment of FOL that is suited for both natural language inference and transparent syntax-semantics mapping, through the choice of operations and relations on sets.
We have demonstrated the utility of logical inference on DCS through the RTE task. A wide variety of strategies tackling the RTE task have been investigated [], including the comparison of surface strings [], syntactic and semantic structures [], semantic vectors [] and logical representations []. Acquisition of basic knowledge for RTE is also a huge stream of research []. These previous works include various techniques for acquiring and incorporating different kinds of linguistic and world knowledge, and further fight against the knowledge bottleneck problem, e.g. by back-off to shallower representations.
Logic-based RTE systems employ various approaches to bridge knowledge gaps. proposes features from a model builder; proposes an abduction process; shows handcrafted rules could drastically improve the performance of a logic-based RTE system.
As such, our current RTE system is at a proof-of-concept stage, in that many of the above techniques are yet to be implemented. Nonetheless, we would like to emphasize that it already shows performance competitive to state-of-the-art systems on one data set (RTE5). Other directions of our future work include further exploitation of the new semantic representation. For example, since abstract denotations are readily suited for data querying, they can be used to verify newly generated assumptions by fact search in a database. This may open a way towards a hybrid approach to RTE wherein logical inference is intermingled with large scale database querying.
This research was supported by the Todai Robot Project at National Institute of Informatics.