We present a simple, data-driven approach to generation from knowledge bases (KB). A key feature of this approach is that grammar induction is driven by the extended domain of locality principle of TAG (Tree Adjoining Grammar); and that it takes into account both syntactic and semantic information. The resulting extracted TAG includes a unification based semantics and can be used by an existing surface realiser to generate sentences from KB data. Experimental evaluation on the KBGen data shows that our model outperforms a data-driven generate-and-rank approach based on an automatically induced probabilistic grammar; and is comparable with a handcrafted symbolic approach.
In this paper we present a grammar based approach for generating from knowledge bases (KB) which is linguistically principled and conceptually simple. A key feature of this approach is that grammar induction is driven by the extended domain of locality principle of TAG (Tree Adjoining Grammar) and takes into account both syntactic and semantic information. The resulting extracted TAGs include a unification based semantics and can be used by an existing surface realiser to generate sentences from KB data.
To evaluate our approach, we use the benchmark provided by the KBGen challenge [4, 3], a challenge designed to evaluate generation from knowledge bases; where the input is a KB subset; and where the expected output is a complex sentence conveying the meaning represented by the input. When compared with two other systems having taken part in the KBGen challenge, our system outperforms a data-driven, generate-and-rank approach based on an automatically induced probabilistic grammar; and produces results comparable to those obtained by a symbolic, rule based approach. Most importantly, we obtain these results using a general purpose approach that we believe is simpler and more transparent than current state of the art surface realisation systems generating from KB or DB data.
Our work is related to work on concept to text generation.
Earlier work on concept to text generation mainly focuses on generation from logical forms using rule-based methods. [35] uses hand-written rules to generate sentences from an extended predicate logic formalism; [32] introduces a head-driven algorithm for generating from logical forms; [16] defines a chart based algorithm which enhances efficiency by minimising the number of semantically incomplete phrases being built; and [31] presents an extension of the chart based generation algorithm presented in [16] which supports the generation of multiple paraphrases from underspecified semantic input. In all these approaches, grammar and lexicon are developed manually and it is assumed that the lexicon associates semantic sub-formulae with natural language expressions. Our approach is similar to these approaches in that it assumes a grammar encoding a compositional semantics. It differs from them however in that, in our approach, grammar and lexicon are automatically acquired from the data.
With the development of the semantic web and the proliferation of knowledge bases, generation from knowledge bases has attracted increased interest and so called ontology verbalisers have been proposed which support the generation of text from (parts of) knowledge bases. One main strand of work maps each axiom in the knowledge base to a clause. Thus the OWL verbaliser integrated in the Protégé tool [15] provides a verbalisation of every axiom present in the ontology under consideration and [37] describes an ontology verbaliser using XML-based generation. As discussed in [29], one important limitation of these approaches is that they assume a simple deterministic mapping between knowledge representation languages and some controlled natural language (CNL). Specifically, the assumption is that each atomic term (individual, class, property) maps to a word and each axiom maps to a sentence. As a result, the verbalisation of larger ontology parts can produce very unnatural text such as, Every cat is an animal. Every dog is an animal. Every horse is an animal. Every rabbit is an animal. More generally, the CNL based approaches to ontology verbalisation generate clauses (one per axiom) rather than complex sentences and thus cannot adequately handle the verbalisation of more complex input such as the KBGen data where the KB input often requires the generation of a complex sentence rather than a sequence of base clauses.
To generate more complex output from KB data, several alternative approaches have been proposed.
The MIAKT project [5] and the ONTOGENERATION project [1] use symbolic NLG techniques to produce textual descriptions from some semantic information contained in a knowledge base. Both systems require some manual input (lexicons and domain schemas). More sophisticated NLG systems such as TAILOR [28], MIGRAINE [25], and STOP [30] offer tailored output based on user/patient models. While offering more flexibility and expressiveness, these systems are difficult to adapt by non-NLG experts because they require the user to understand the architecture of the NLG systems [5]. Similarly, the NaturalOWL system [10] has been proposed to generate fluent descriptions of museum exhibits from an OWL ontology. This approach however relies on extensive manual annotation of the input data.
The SWAT project has focused on producing descriptions of ontologies that are both coherent and efficient [38]. For instance, instead of the above output, the SWAT system would generate the sentence: The following are kinds of animals: cats, dogs, horses and rabbits. . In this approach too however, the verbaliser output is strongly constrained by a simple Definite Clause Grammar covering simple clauses and sentences verbalising aggregation patterns such as the above. More generally, the sentences generated by ontology verbalisers cover a limited set of linguistics constructions; the grammar used is manually defined; and the mapping between semantics and strings is assumed to be deterministic (e.g., a verb maps to a relation and a noun to a concept). In constrast, we propose an approach which can generate complex sentences from KB data; where the grammar is acquired from the data; and where no assumption is made about the mapping between semantics and NL expressions.
Recent work has focused on data-driven generation from frames, lambda terms and data base entries.
[9] describes an approach for generating from the frames produced by a dialog system. They induce a probabilistic Tree Adjoining Grammar from a training set aligning frames and sentences using the grammar induction technique of [6] and use a beam search that uses weighted features learned from the training data to rank alternative expansions at each step.
The function of a gated channel is to release particles from the endoplasmic reticulum
:TRIPLES ( (|Release-Of-Calcium646| |object| |Particle-In-Motion64582|) (|Release-Of-Calcium646| |base| |Endoplasmic-Reticulum64603|) (|Gated-Channel64605| |has-function||Release-Of-Calcium646|) (|Release-Of-Calcium646| |agent| |Gated-Channel64605|)) :INSTANCE-TYPES (|Particle-In-Motion64582| |instance-of| |Particle-In-Motion|) (|Endoplasmic-Reticulum64603| |instance-of| |Endoplasmic-Reticulum|) (|Gated-Channel64605| |instance-of| |Gated-Channel|) |Release-Of-Calcium646| |instance-of| |Release-Of-Calcium|)) :ROOT-TYPES ( (|Release-Of-Calcium646| |instance-of| |Event|) (|Particle-In-Motion64582| |instance-of| |Entity|) (|Endoplasmic-Reticulum64603| |instance-of| |Entity|) (|Gated-Channel64605| |instance-of| |Entity|)))
[24] focuses on generating natural language sentences from logical form (i.e., lambda terms) using a synchronous context-free grammar. They introduce a novel synchronous context free grammar formalism for generating from lambda terms; induce such a synchronous grammar using a generative model; and extract the best output sentence from the generated forest using a log linear model.
[39, 23] focuses on generating from variable-free tree-structured representations such as the CLANG formal language used in the ROBOCUP competition and the database entries collected by [22] for weather forecast generation and for the air travel domain (ATIS dataset) by [8]. [39] uses synchronous grammars to transform a variable free tree structured meaning representation into sentences. [23] uses a Conditional Random Field to generate from the same meaning representations.
Finally, more recent papers propose approaches which perform both surface realisation and content selection. [2] proposes a log linear model which decomposes into a sequence of discriminative local decisions. The first classifier determines which records to mention; the second, which fields of these records to select; and the third, which words to use to verbalise the selected fields. [18] uses a generative model for content selection and verbalises the selected input using WASP, an existing generator. Finally, [20, 19] develop a joint optimisation approach for content selection and surface realisation using a generic, domain independent probabilistic grammar which captures the structure of the database and the mapping from fields to strings. They intersect the grammar with a language model to improve fluency; use a weighted hypergraph to pack the derivations; and find the best derivation tree using Viterbi algorithm.
Our approach differs from the approaches which assume variable free tree structured representations [39, 23] and data-based entries [18, 20, 19] in that it handles graph-based, KB input and assumes a compositional semantics. It is closest to [9] and [24] who extract a grammar encoding syntax and semantics from frames and lambda terms respectively. It differs from the former however in that it enforces a tighter syntax/semantics integration by requiring that the elementary trees of our extracted grammar encode the appropriate linking information. While [9] extracts a TAG grammar associating each elementary tree with a semantics, we additionnally require that these trees encode the appropriate linking between syntactic and semantic arguments thereby restricting the space of possible tree combinations and drastically reducing the search space. Although conceptually related to [24], our approach extracts a unification based grammar rather than one with lambda terms. The extraction process and the generation algorithms are also fundamentally different. We use a simple mainly symbolic approach whereas they use a generative approach for grammar induction and a discriminative approach for sentence generation.
\nodegc0NP | ||
\nodegc1.1DT | \nodegc1.2NN | \nodegc1.3NN |
\nodegc2.1a | \nodegc2.2gated | \nodegc2.3channel |
instance-of(GC,Gated-Channel) |
gc0gc1.1
\nodeconnectgc0gc1.2
\nodeconnectgc0gc1.3
\nodeconnectgc1.1gc2.1
\nodeconnectgc1.2gc2.2
\nodeconnectgc1.3gc2.3
\nodeconnectfn0fn1.1\nodeconnectfn0fn1.2
\nodeconnectfn1.2fn2.1\nodeconnectfn1.2fn2.2
\nodeconnectfn2.1fn3.1
\nodefn0S
\nodefn1.1NP
\nodefn1.2VP
\nodefn2.1VBZ
\nodefn2.2NP
\nodefn3.1releases
instance-of(RoC,Release-of-Calcium)
object(RoC,PM)
agent(RoC,GC)
\nodeconnectp0p1.1
\nodep0NP
\nodep1.1particles
instance-of(PM,Particle-In-Motion)
\nodeconnectfrom0from1.1
\nodeconnectfrom0from1.2
\nodeconnectfrom1.2from2.1\nodeconnectfrom1.2from2.2
\nodeconnectfrom2.1from3.1
\nodefrom0VP
\nodefrom1.1VP
\nodefrom1.2PP
\nodefrom2.1IN
\nodefrom2.2NP
\nodefrom3.1from
base(RoC,ER)
\nodeconnecter0er1.1\nodeconnecter0er1.2\nodeconnecter0er1.3
\nodeconnecter1.1er2.1
\nodeconnecter1.2er2.2
\nodeconnecter1.3er2.3
\nodeer0NP
\nodeer1.1DT
\nodeer1.2NN
\nodeer1.3NN
\nodeer2.1the
\nodeer2.2endoplasmic
\nodeer2.3reticulum
instance-of(ER,Endoplasmic-Reticulum)
2pt\anodecurve[tr]gc0[b]fn1.10.3in \anodecurve[bl]p0[bl]fn2.20.3in \anodecurve[tl]from0[tr]fn1.20.3in \anodecurve[tl]from1.1[br]fn1.20.3in \anodecurve[tr]er0[bl]from2.20.3in
The kbgen task was introduced as a new shared task at Generation Challenges 2013 [3]11http://www.kbgen.org and aimed to compare different generation systems on KB data. Specifically, the task is to verbalise a subset of a knowledge base. For instance, the KB input shown in Figure 1 can be verbalised as: \enumsentence The function of a gated channel is to release particles from the endoplasmic reticulum
The KB subsets forming the kbgen input data were pre-selected from the AURA biology knowledge base [14], a knowledge base about biology which was manually encoded by biology teachers and encodes knowledge about events, entities, properties and relations where relations include event-to-entity, event-to-event, event-to-property and entity-to-property relations. AURA uses a frame-based knowledge representation and reasoning system called Knowledge Machine [7] which was translated into first-order logic with equality and from there, into multiple different formats including SILK [13] and OWL2 [26]. It is available for download in various formats including OWL22http://www.ai.sri.com/halo/halobook2010/exported-kb/biokb.html.
To generate from the kbgen data, we induce a Feature-Based Lexicalised Tree Adjoining Grammar (FB-LTAG, [34]) augmented with a unification-based semantics [11] from the training data. We then use this grammar and an existing surface realiser to generate from the test data.
Figure 2 shows an example FB-LTAG augmented with a unification-based semantics.
Briefly, an FB-LTAG consists of a set of elementary trees which can be either initial or auxiliary. Initial trees are trees whose leaves are labeled with substitution nodes (marked with a down-arrow) or terminal categories. Auxiliary trees are distinguished by a foot node (marked with a star) whose category must be the same as that of the root node. In addition, in an FB-LTAG, each elementary tree is anchored by a lexical item (lexicalisation) and the nodes in the elementary trees are decorated with two feature structures called top and bottom which are unified during derivation. Two tree-composition operations are used to combine trees namely, substitution and adjunction. While substitution inserts a tree in a substitution node of another tree, adjunction inserts an auxiliary tree into a tree. In terms of unifications, substitution unifies the top feature structure of the substitution node with the top feature structure of the root of the tree being substituted in. Adjunction unifies the top feature structure of the root of the tree being adjoined with the top feature structure of the node being adjoined to; and the bottom feature structure of the foot node of the auxiliary tree being adjoined with the bottom feature structure of the node being adjoined to.
In an FB-LTAG augmented with a unification-based semantics, each tree
is associated with a semantics i.e., a set of literals whose arguments
may be constants or unification variables. The semantics of a
derived tree is the union of the semantics of the tree contributing to
its derivation modulo unification. Importantly, semantic variables are
shared with syntactic variables (i.e., variables occurring in the
feature structures decorating the tree nodes) so that when trees are
combined, the appropriate syntax/semantics linking is enforced. For
instance given the semantics:
instance-of(RoC,Release-Of-Calcium), |
object(RoC,PM),agent(RoC,GC),base(RoC,ER), |
instance-of(ER,Endoplasmic-Reticulum), |
instance-of(GC,Gated-Channel), |
instance-of(PM,Particle-In-Motion) |
the grammar will generate A gated channel releases particles from the endoplasmic reticulum but not e.g., Particles releases a gated channel from the endoplasmic reticulum.
We extract our FB-LTAG with unification semantics from the kbgen training data in two main steps. First, we align the KB data with the input string. Second, we induce a Tree Adjoining Grammar augmented with a unification-based semantics from the aligned data.
Given a Sentence/Input pair provided by the KBGen Challenge, the alignment procedure associates each entity and event variable in to a substring in . To do this, we use the entity and the event lexicon provided by the kbgen organiser. The event lexicon maps event types to verbs, their inflected forms and nominalizations while the entity lexicon maps entity types to a noun and its plural form. For instance, the lexicon entries for the event and entity types shown in Figure 1 are as shown in Figure 3.
For each entity and each event variable in , we retrieve the
corresponding type (e.g., Particle-In-Motion
for
Particle-In-Motion64582
); search the kbgen lexicon for the
corresponding phrases (e.g., molecule in motion,molecules in
motion); and associate with the phrase in which matches one
of these phrases. Figure 3 shows an example lexicon
and the resulting alignment obtained for the scenario shown in
Figure 1. Note that there is not always an exact match
between the phrase associated in the kbgen lexicon with a type and
the phrase occurring in the training sentence. To account for this, we
use some additional similarity based heuristics to identify the phrase
in the input string that is most likely to be associated with a
variable lacking an exact match in the input string. E.g., for entity
variables (e.g., Particle-In-Motion64582
), we search the input
string for nouns (e.g., particles) whose overlap with the variable
type (e.g., Particle-In-Motion) is not empty.
Particle-In-Motion |
molecule in motion,molecules in motion |
---|---|
Endoplasmic-Reticulum |
endoplasmic reticulum,endoplasmic reticulum |
Gated-Channel |
gated Channel,gated Channels |
Release-Of-Calcium |
releases,release,released,release |
The function of a (gated channel, Gated-Channel64605
) is to (release, Release-Of-Calcium646
) (particles, Particle-In-Motion64582
) from the (endoplasmic reticulum, Endoplasmic-Reticulum64603
)
\nodefn0S | |||||||||
\nodefn1.1NP | \nodefn1.2VP | ||||||||
\nodefn2.1NP | \nodefn2.2PP | \nodefn2.3VBZ | \nodefn2.4S | ||||||
\nodefn3.1DT | \nodefn3.2NN | \nodefn3.3IN | \nodefn3.4NP | \nodefn3.5is | \nodefn3.6VP | ||||
\nodefn4.1the | \nodefn4.2fn | \nodefn4.3of | \nodefn4.4TO | \nodefn4.5VB | \nodefn4.6NP | \nodefn4.7PP | |||
\nodefn5.1to | \nodefn6.1release | \nodefn5.2IN | \nodefn5.3NP | ||||||
\nodefn7.2from | |||||||||
instance-of(RoC,Release-of-Calcium) | |||||||||
object(RoC,PM) | |||||||||
base(RoC,ER) | |||||||||
has-function(GC,RoC) | |||||||||
agent(RoC,GC) |
fn0fn1.1\nodeconnectfn0fn1.2
\nodeconnectfn1.1fn2.1\nodeconnectfn1.1fn2.2
\nodeconnectfn1.2fn2.3\nodeconnectfn1.2fn2.4
\nodeconnectfn2.1fn3.1\nodeconnectfn2.1fn3.2
\nodeconnectfn2.2fn3.3\nodeconnectfn2.2fn3.4
\nodeconnectfn2.3fn3.5
\nodeconnectfn2.4fn3.6
\nodeconnectfn3.1fn4.1
\nodeconnectfn3.2fn4.2
\nodeconnectfn3.3fn4.3
\nodeconnectfn3.6fn4.4
\nodeconnectfn4.4fn5.1
\nodeconnectfn3.6fn4.5\nodeconnectfn3.6fn4.6\nodeconnectfn3.6fn4.7
\nodeconnectfn4.5fn6.1
\nodeconnectfn4.7fn5.2\nodeconnectfn4.7fn5.3
\nodeconnectfn5.2fn7.2
\nodeconnectgc0gc1.1
\nodeconnectgc0gc1.2
\nodeconnectgc0gc1.3
\nodeconnectgc1.1gc2.1
\nodeconnectgc1.2gc2.2
\nodeconnectgc1.3gc2.3
\nodegc0NP
\nodegc1.1DT
\nodegc1.2NN
\nodegc1.3NN
\nodegc2.1a
\nodegc2.2gated
\nodegc2.3channel
instance-of(GC,Gated-Channel)
\nodeconnectp0p1.1
\nodep0NP
\nodep1.1particles
instance-of(PM,Particle-In-Motion)
\nodeconnecter0er1.1\nodeconnecter0er1.2\nodeconnecter0er1.3
\nodeconnecter1.1er2.1
\nodeconnecter1.2er2.2
\nodeconnecter1.3er2.3
\nodeer0NP
\nodeer1.1DT
\nodeer1.2NN
\nodeer1.3NN
\nodeer2.1the
\nodeer2.2endoplasmic
\nodeer2.3reticulum
instance-of(ER,Endoplasmic-Reticulum)
To extract a Feature-Based Lexicalised Tree Adjoining Grammar (FB-LTAG) from the kbgen data, we parse the sentences of the training corpus; project the entity and event variables to the syntactic projection of the strings they are aligned with; and extract the elementary trees of the resulting FB-LTAG from the parse tree using semantic information. Figure 4 shows the trees extracted from the scenario given in Figure 1.
To associate each training example sentence with a syntactic parse, we use the Stanford parser. After alignment, the entity and event variables occurring in the input semantics are associated with substrings of the yield of the syntactic parse tree. We project these variables up the syntactic tree to reflect headedness. A variable aligned with a noun is projected to the NP level or to the immediately dominating PP if it occurs in the subtree dominated by the leftmost daughter of that PP. A variable aligned with a verb is projected to the first S node immediately dominating that verb or, in the case of a predicative sentence, to the root of that sentence33Initially, we used the head information provided by the Stanford parser. In practice however, we found that the heuristics we defined to project semantic variables to the corresponding syntactic projection were more accurate and better supported our grammar extraction process..
Once entity and event variables have been projected up the parse trees, we extract elementary FB-LTAG trees and their semantics from the input scenario as follows.
First, the subtrees whose root node is indexed with an entity variable are extracted. This results in a set of NP and PP trees anchored with entity names and associated with the predication true of the indexing variable.
Second, the subtrees capturing relations between variables are extracted. To perform this extraction, each input variable is associated with a set of dependent variables i.e., the set of variables such that is related to (). The minimal tree containing all and only the dependent variables of a variable is then extracted and associated with the set of literals such that . This procedure extracts the subtrees relating the argument variables of a semantic functors such as an event or a role e.g., a tree describing a verb and its arguments as shown in the top part of Figure 4. Note that such a tree may capture a verb occurring in a relative or a subordinate clause (together with its arguments) thus allowing for complex sentences including a relative or relating a main and a subordinate clause.
The resulting grammar extracted from the parse trees (cf. e.g., Figure 4) is a Feature-Based Tree Adjoining Grammar with a Unification-based compositional semantics as described in [11]. In particular, our grammars differs from the traditional probabilistic Tree Adjoining Grammar extracted as described in e.g., [6] in that they encode both syntax and semantics rather than just syntax. They also differ from the semantic FB-TAG extracted by [9] in that (i) they encode the linking between syntactic and semantic arguments; (ii) they allow for elementary trees spanning discontiguous strings (e.g., The function of X is to release Y); and (iii) they enforce the semantic principle underlying TAG namely that an elementary tree containing a syntactic functor also contains its syntactic arguments.
To generate with the grammar extracted from the kbgen data, we use the GenI surface realiser [12]. Briefly, given an input semantics and a FB-LTAG with a unification based semantics, GenI selects all grammar entries whose semantics subsumes the input semantics; combines these entries using the FB-LTAG combination operations (i.e., adjunction and substitution); and outputs the yield of all derived trees which are syntactically complete and whose semantics is the input semantics. To rank the generator output, we train a language model on the GeniA corpus 44http://www.nactem.ac.uk/genia/, a corpus of 2000 MEDLINE asbtracts about biology containing more than 400000 words [17] and use this model to rank the generated sentences by decreasing probability.
Thus for instance, given the input semantics shown in Figure 1 and the grammar depicted in Figure 4, the surface realiser will select all of these trees; combine them using FB-LTAG substitution operation; and output as generated sentence the yield of the resulting derived tree namely the sentence The function of a gated channel is to release particles from the endoplasmic reticulum.
However, this procedure only works if the entries necessary to generate from the given input are present in the grammar. To handle new, unseen input, we proceed in two ways. First, we try to guess a grammar entry from the shape of the input and the existing grammar. Second, we expand the grammar by decomposing the extracted trees into simpler ones.
Given the limited size of the training data, it is often the case that input from the test data will have no matching grammar unit. To handle such previously unseen input, we start by partitioning the input semantics into sub-semantics corresponding to events, entities and role.
For each entity variable of type , we create a default NP tree whose semantics is a literal of the form instance-of(X,Type).
For event variables, we search the lexicon for an entry with a matching or similar semantics i.e., an entry with the same number and same type of literals (literals with same arity and with identical relations). When one is found, a grammar entry is constructed for the unseen event variable by substituting the event type of the matching entry with the type of the event variable. For instance, given the input semantics instance-of(C,Carry), object(C,X), base(C,Y), has-function(Z,C), agent(C,Z), this procedure will create a grammar entry identical to that shown at the top of Figure 4 except that the event type Release-of-Calcium is changed to Carry and the terminal release to the word form associated in the kbgen lexicon with this concept, namely to the verb carry.
While the extracted grammar nicely captures predicate/argument dependencies, it is very specific to the items seen in the training data. To reduce overfitting, we generalise the extracted grammar by extracting from each event tree, subtrees that capture structures with fewer arguments and optional modifiers.
For each event tree extracted from the training data which contains a subject-verb-object subtree , we add to the grammar and associate it with the semantics of minus the relations associated with the arguments that have been removed. For instance, given the extracted tree for the sentence ”Aquaporin facilitates the movement of water molecules through hydrophilic channels.”, this procedure will construct a new grammar tree corresponding to the subphrase “Aquaporin facilitates the movement of water molecules”.
We also construct both simpler event trees and optional modifiers trees by extracting from event trees, PP trees which are associated with a relational semantics. For instance, given the tree shown in Figure 4, the PP tree associated with the relation base(RoC,ET) is removed thus creating two new trees as illustrated in Figure 5: an S tree corresponding to the sentence The function of a gated channel is to release particles and an auxiliary PP tree corresponding to the phrase from the endoplasmic reticulum. Similarly in the above example, a PP tree corresponding to the phrase ”through hydrophilic channels.” will be extracted.
As with the base grammar, missing grammar entries are guessed from the expanded grammar. However we do this only in cases where a correct grammar entry cannot be guessed from the base grammar.
\nodefn0S | |||||||
\nodefn1.1NP | \nodefn1.2VP | ||||||
\nodefn2.1NP | \nodefn2.2PP | \nodefn2.3VBZ | \nodefn2.4S | ||||
\nodefn3.1DT | \nodefn3.2NN | \nodefn3.3IN | \nodefn3.4NP | \nodefn3.5is | \nodefn3.6VP | ||
\nodefn4.1the | \nodefn4.2fn | \nodefn4.3of | \nodefn4.4TO | \nodefn5.2VB | \nodefn5.3NP | ||
\nodefn5.1to | \nodefn6.1release | ||||||
instance-of(RoC,Release-of-Calcium) | |||||||
object(RoC,PM) | |||||||
has-function(GC,RoC) | |||||||
agent(RoC,GC) |
fn0fn1.1\nodeconnectfn0fn1.2
\nodeconnectfn1.1fn2.1\nodeconnectfn1.1fn2.2
\nodeconnectfn1.2fn2.3\nodeconnectfn1.2fn2.4
\nodeconnectfn2.1fn3.1\nodeconnectfn2.1fn3.2
\nodeconnectfn2.2fn3.3\nodeconnectfn2.2fn3.4
\nodeconnectfn2.3fn3.5
\nodeconnectfn2.4fn3.6
\nodeconnectfn3.1fn4.1
\nodeconnectfn3.2fn4.2
\nodeconnectfn3.3fn4.3
\nodeconnectfn3.6fn4.4\nodeconnectfn3.6fn5.2\nodeconnectfn3.6fn5.3
\nodeconnectfn4.4fn5.1
\nodeconnectfn5.2fn6.1
\nodeconnectfrom0from1.1
\nodeconnectfrom0from1.2
\nodeconnectfrom1.2from2.1\nodeconnectfrom1.2from2.2
\nodeconnectfrom2.1from3.1
\nodefrom0VP
\nodefrom1.1VP
\nodefrom1.2PP
\nodefrom2.1IN
\nodefrom2.2NP
\nodefrom3.1from
base(RoC,ER)
We evaluate our approach on the kbgen data and compare it with the kbgen reference and two other systems having taken part to the kbgen challenge.
Following a practice introduced by [2], we use the term scenario to denote a KB subset paired with a sentence. The kbgen benchmark contains 207 scenarii for training and 72 for testing. Each KB subset consists of a set of triples and each scenario contains on average 16 triples and 17 words.
We evaluate three configurations of our approach on the kbgen test data: one without grammar expansion (Base); a second with a manual grammar expansion ManExp; and a third one with automated grammar expansion AutExp. We compare the results obtained with those obtained by two other systems participating in the KBGen challenge, namely the UDEL system, a symbolic rule based system developed by a group of students at the University of Delaware; and the IMS system, a statistical system using a probabilistic grammar induced from the training data.
We evaluate system output automatically, using the BLEU-4 modified precision score [27] with the human written sentences as reference. We also report results from a human based evaluation. In this evaluation, participants were asked to rate sentences along three dimensions: fluency (Is the text easy to read?), grammaticality and meaning similarity or adequacy (Does the meaning conveyed by the generated sentence correspond to the meaning conveyed by the reference sentence?). The evaluation was done on line using the LG-Eval toolkit [21], subjects used a sliding scale from -50 to +50 and a Latin Square Experimental Design was used to ensure that each evaluator sees the same number of outputs from each system and for each test set item. 12 subjects participated in the evaluation and 3 judgments were collected for each output.
System | All | Covered | Coverage | # Trees |
---|---|---|---|---|
IMS | 0.12 | 0.12 | 100% | |
UDEL | 0.32 | 0.32 | 100% | |
Base | 0.04 | 0.39 | 30.5% | 371 |
ManExp | 0.28 | 0.34 | 83 % | 412 |
AutExp | 0.29 | 0.29 | 100% | 477 |
Table 6 summarises the results of the automatic evaluation and shows the size (number of elementary TAG trees) of the grammars extracted from the kbgen data.
Fluency | Grammaticality | Meaning Similarity | ||||||||||||
System | Mean | Homogeneous Subsets | Mean | Homogeneous Subsets | Mean | Homogeneous Subsets | ||||||||
UDEL | 4.36 | A | 4.48 | A | 3.69 | A | ||||||||
AutExp | 3.45 | B | 3.55 | B | 3.65 | A | ||||||||
IMS | 1.91 | C | 2.05 | C | 1.31 | B |
The average BLEU score is given with respect to all input (All) and to those inputs for which the systems generate at least one sentence (Covered). While both the IMS and the UDEL system have full coverage, our Base system strongly undergenerates failing to account for 69.5% of the test data. However, because the extracted grammar is linguistically principled and relatively compact, it is possible to manually edit it. Indeed, the ManExp results show that, by adding 41 trees to the grammar, coverage can be increased by 52.5 points reaching a coverage of 83%. Finally, the AutExp results demonstrate that the automated expansion mechanism permits achieving full coverage while keeping a relative small grammar (477 trees).
In terms of BLEU score, the best version of our system (AutExp) outperforms the probabilistic approach of IMS by a large margin (+0.17) and produces results similar to the fully handcrafted UDEL system (-0.03).
In sum, our approach permits obtaining BLEU scores and a coverage which are similar to that obtained by a hand crafted system and outperforms a probabilistic approach. One key feature of our approach is that the grammar extracted from the training data is linguistically principled in that it obeys the extended locality principle of Tree Adjoining Grammars. As a result, the extracted grammar is compact and can be manually modified to fit the need of an application as shown by the good results obtained when using the ManExp configuration.
We now turn to the results of the human evaluation. Table 7 summarises the results whereby systems are grouped by letters when there is no significant difference between them (significance level: p 0.05). We used ANOVAs and post-hoc Tukey tests to test for significance. The differences between systems are statistically significant throughout except for meaning similarity (adequacy) where UDEL and our system are on the same level. Across the metrics, our system consistently ranks second behind the symbolic, UDEL system and before the statistical IMS one thus confirming the ranking based on BLEU.
In Tree Adjoining Grammar, the extended domain of locality principle ensures that TAG trees group together in a single structure a syntactic predicate and its arguments. Moreover, the semantic principle requires that each elementary tree captures a single semantic unit. Together these two principles ensure that TAG elementary trees capture basic semantic units and their dependencies. In this paper, we presented a grammar extraction approach which ensures that extracted grammars comply with these two basic TAG principles. Using the kbgen benchmark, we then showed that the resulting induced FB-LTAG compares favorably with competing symbolic and statistical approaches when used to generate from knowledge base data.
In the current version of the generator, the output is ranked using a simple language model trained on the GENIA corpus. We observed that this often fails to return the best output in terms of BLEU score, fluency, grammaticality and/or meaning. In the future, we plan to remedy this using a ranking approach such as proposed in [33, 36].