Surface Realisation from Knowledge-Bases

Bikash Gyawali
Université de Lorraine, LORIA
Villers-lès-Nancy, F-54600, France
bikash.gyawali@loria.fr
   Claire Gardent
CNRS, LORIA, UMR 7503
Vandoeuvre-lès-Nancy, F-54500, France
claire.gardent@loria.fr
Abstract

We present a simple, data-driven approach to generation from knowledge bases (KB). A key feature of this approach is that grammar induction is driven by the extended domain of locality principle of TAG (Tree Adjoining Grammar); and that it takes into account both syntactic and semantic information. The resulting extracted TAG includes a unification based semantics and can be used by an existing surface realiser to generate sentences from KB data. Experimental evaluation on the KBGen data shows that our model outperforms a data-driven generate-and-rank approach based on an automatically induced probabilistic grammar; and is comparable with a handcrafted symbolic approach.

1 Introduction

In this paper we present a grammar based approach for generating from knowledge bases (KB) which is linguistically principled and conceptually simple. A key feature of this approach is that grammar induction is driven by the extended domain of locality principle of TAG (Tree Adjoining Grammar) and takes into account both syntactic and semantic information. The resulting extracted TAGs include a unification based semantics and can be used by an existing surface realiser to generate sentences from KB data.

To evaluate our approach, we use the benchmark provided by the KBGen challenge [4, 3], a challenge designed to evaluate generation from knowledge bases; where the input is a KB subset; and where the expected output is a complex sentence conveying the meaning represented by the input. When compared with two other systems having taken part in the KBGen challenge, our system outperforms a data-driven, generate-and-rank approach based on an automatically induced probabilistic grammar; and produces results comparable to those obtained by a symbolic, rule based approach. Most importantly, we obtain these results using a general purpose approach that we believe is simpler and more transparent than current state of the art surface realisation systems generating from KB or DB data.

2 Related Work

Our work is related to work on concept to text generation.

Earlier work on concept to text generation mainly focuses on generation from logical forms using rule-based methods. [35] uses hand-written rules to generate sentences from an extended predicate logic formalism; [32] introduces a head-driven algorithm for generating from logical forms; [16] defines a chart based algorithm which enhances efficiency by minimising the number of semantically incomplete phrases being built; and [31] presents an extension of the chart based generation algorithm presented in [16] which supports the generation of multiple paraphrases from underspecified semantic input. In all these approaches, grammar and lexicon are developed manually and it is assumed that the lexicon associates semantic sub-formulae with natural language expressions. Our approach is similar to these approaches in that it assumes a grammar encoding a compositional semantics. It differs from them however in that, in our approach, grammar and lexicon are automatically acquired from the data.

With the development of the semantic web and the proliferation of knowledge bases, generation from knowledge bases has attracted increased interest and so called ontology verbalisers have been proposed which support the generation of text from (parts of) knowledge bases. One main strand of work maps each axiom in the knowledge base to a clause. Thus the OWL verbaliser integrated in the Protégé tool [15] provides a verbalisation of every axiom present in the ontology under consideration and [37] describes an ontology verbaliser using XML-based generation. As discussed in [29], one important limitation of these approaches is that they assume a simple deterministic mapping between knowledge representation languages and some controlled natural language (CNL). Specifically, the assumption is that each atomic term (individual, class, property) maps to a word and each axiom maps to a sentence. As a result, the verbalisation of larger ontology parts can produce very unnatural text such as, Every cat is an animal. Every dog is an animal. Every horse is an animal. Every rabbit is an animal. More generally, the CNL based approaches to ontology verbalisation generate clauses (one per axiom) rather than complex sentences and thus cannot adequately handle the verbalisation of more complex input such as the KBGen data where the KB input often requires the generation of a complex sentence rather than a sequence of base clauses.

To generate more complex output from KB data, several alternative approaches have been proposed.

The MIAKT project [5] and the ONTOGENERATION project [1] use symbolic NLG techniques to produce textual descriptions from some semantic information contained in a knowledge base. Both systems require some manual input (lexicons and domain schemas). More sophisticated NLG systems such as TAILOR [28], MIGRAINE [25], and STOP [30] offer tailored output based on user/patient models. While offering more flexibility and expressiveness, these systems are difficult to adapt by non-NLG experts because they require the user to understand the architecture of the NLG systems [5]. Similarly, the NaturalOWL system [10] has been proposed to generate fluent descriptions of museum exhibits from an OWL ontology. This approach however relies on extensive manual annotation of the input data.

The SWAT project has focused on producing descriptions of ontologies that are both coherent and efficient [38]. For instance, instead of the above output, the SWAT system would generate the sentence: The following are kinds of animals: cats, dogs, horses and rabbits. . In this approach too however, the verbaliser output is strongly constrained by a simple Definite Clause Grammar covering simple clauses and sentences verbalising aggregation patterns such as the above. More generally, the sentences generated by ontology verbalisers cover a limited set of linguistics constructions; the grammar used is manually defined; and the mapping between semantics and strings is assumed to be deterministic (e.g., a verb maps to a relation and a noun to a concept). In constrast, we propose an approach which can generate complex sentences from KB data; where the grammar is acquired from the data; and where no assumption is made about the mapping between semantics and NL expressions.

Recent work has focused on data-driven generation from frames, lambda terms and data base entries.

[9] describes an approach for generating from the frames produced by a dialog system. They induce a probabilistic Tree Adjoining Grammar from a training set aligning frames and sentences using the grammar induction technique of [6] and use a beam search that uses weighted features learned from the training data to rank alternative expansions at each step.

The function of a gated channel is to release particles from the endoplasmic reticulum

:TRIPLES (
(|Release-Of-Calcium646| |object| |Particle-In-Motion64582|)
(|Release-Of-Calcium646| |base| |Endoplasmic-Reticulum64603|)
(|Gated-Channel64605|  |has-function||Release-Of-Calcium646|)
(|Release-Of-Calcium646|  |agent| |Gated-Channel64605|))
:INSTANCE-TYPES
(|Particle-In-Motion64582| |instance-of| |Particle-In-Motion|)
(|Endoplasmic-Reticulum64603| |instance-of| |Endoplasmic-Reticulum|)
(|Gated-Channel64605| |instance-of| |Gated-Channel|)
 |Release-Of-Calcium646| |instance-of| |Release-Of-Calcium|))
:ROOT-TYPES (
(|Release-Of-Calcium646| |instance-of| |Event|)
(|Particle-In-Motion64582| |instance-of| |Entity|)
(|Endoplasmic-Reticulum64603| |instance-of| |Entity|)
(|Gated-Channel64605| |instance-of| |Entity|)))
Figure 1: Example KBGEN Scenario

[24] focuses on generating natural language sentences from logical form (i.e., lambda terms) using a synchronous context-free grammar. They introduce a novel synchronous context free grammar formalism for generating from lambda terms; induce such a synchronous grammar using a generative model; and extract the best output sentence from the generated forest using a log linear model.

[39, 23] focuses on generating from variable-free tree-structured representations such as the CLANG formal language used in the ROBOCUP competition and the database entries collected by [22] for weather forecast generation and for the air travel domain (ATIS dataset) by [8]. [39] uses synchronous grammars to transform a variable free tree structured meaning representation into sentences. [23] uses a Conditional Random Field to generate from the same meaning representations.

Finally, more recent papers propose approaches which perform both surface realisation and content selection. [2] proposes a log linear model which decomposes into a sequence of discriminative local decisions. The first classifier determines which records to mention; the second, which fields of these records to select; and the third, which words to use to verbalise the selected fields. [18] uses a generative model for content selection and verbalises the selected input using WASP-1, an existing generator. Finally, [20, 19] develop a joint optimisation approach for content selection and surface realisation using a generic, domain independent probabilistic grammar which captures the structure of the database and the mapping from fields to strings. They intersect the grammar with a language model to improve fluency; use a weighted hypergraph to pack the derivations; and find the best derivation tree using Viterbi algorithm.

Our approach differs from the approaches which assume variable free tree structured representations [39, 23] and data-based entries [18, 20, 19] in that it handles graph-based, KB input and assumes a compositional semantics. It is closest to [9] and [24] who extract a grammar encoding syntax and semantics from frames and lambda terms respectively. It differs from the former however in that it enforces a tighter syntax/semantics integration by requiring that the elementary trees of our extracted grammar encode the appropriate linking information. While [9] extracts a TAG grammar associating each elementary tree with a semantics, we additionnally require that these trees encode the appropriate linking between syntactic and semantic arguments thereby restricting the space of possible tree combinations and drastically reducing the search space. Although conceptually related to [24], our approach extracts a unification based grammar rather than one with lambda terms. The extraction process and the generation algorithms are also fundamentally different. We use a simple mainly symbolic approach whereas they use a generative approach for grammar induction and a discriminative approach for sentence generation.

3 The KBGen Task

\nodegc0NPGC
\nodegc1.1DT \nodegc1.2NN \nodegc1.3NN
\nodegc2.1a \nodegc2.2gated \nodegc2.3channel
instance-of(GC,Gated-Channel)
\nodeconnect

gc0gc1.1 \nodeconnectgc0gc1.2 \nodeconnectgc0gc1.3 \nodeconnectgc1.1gc2.1 \nodeconnectgc1.2gc2.2 \nodeconnectgc1.3gc2.3

\nodefn0SRoC1
\nodefn1.1NPGC \nodefn1.2VPRoC1RoC
\nodefn2.1VBZRoC \nodefn2.2NPPM
\nodefn3.1releases
instance-of(RoC,Release-of-Calcium)
object(RoC,PM)
agent(RoC,GC)
\nodeconnectfn0fn1.1\nodeconnectfn0fn1.2 \nodeconnectfn1.2fn2.1\nodeconnectfn1.2fn2.2 \nodeconnectfn2.1fn3.1
\nodep0NPPM
\nodep1.1particles
instance-of(PM,Particle-In-Motion)
\nodeconnectp0p1.1
\nodefrom0VPRoC
\nodefrom1.1VP*RoC \nodefrom1.2PP
\nodefrom2.1IN \nodefrom2.2NPER
\nodefrom3.1from
base(RoC,ER)
\nodeconnectfrom0from1.1 \nodeconnectfrom0from1.2 \nodeconnectfrom1.2from2.1\nodeconnectfrom1.2from2.2 \nodeconnectfrom2.1from3.1
\nodeer0NPER
\nodeer1.1DT \nodeer1.2NN \nodeer1.3NN
\nodeer2.1the \nodeer2.2endoplasmic \nodeer2.3reticulum
instance-of(ER,Endoplasmic-Reticulum)
\nodeconnecter0er1.1\nodeconnecter0er1.2\nodeconnecter0er1.3 \nodeconnecter1.1er2.1 \nodeconnecter1.2er2.2 \nodeconnecter1.3er2.3

\makedash

2pt\anodecurve[tr]gc0[b]fn1.10.3in \anodecurve[bl]p0[bl]fn2.20.3in \anodecurve[tl]from0[tr]fn1.20.3in \anodecurve[tl]from1.1[br]fn1.20.3in \anodecurve[tr]er0[bl]from2.20.3in

Figure 2: Example FB-LTAG with Unification-Based Semantics. Dotted lines indicate substitution and adjunction operations between trees. The variables decorating the tree nodes (e.g., GC) abbreviate feature structures of the form [idx:V] where V is a unification variable shared with the semantics.

The kbgen task was introduced as a new shared task at Generation Challenges 2013 [3]11http://www.kbgen.org and aimed to compare different generation systems on KB data. Specifically, the task is to verbalise a subset of a knowledge base. For instance, the KB input shown in Figure 1 can be verbalised as: \enumsentence The function of a gated channel is to release particles from the endoplasmic reticulum

The KB subsets forming the kbgen input data were pre-selected from the AURA biology knowledge base [14], a knowledge base about biology which was manually encoded by biology teachers and encodes knowledge about events, entities, properties and relations where relations include event-to-entity, event-to-event, event-to-property and entity-to-property relations. AURA uses a frame-based knowledge representation and reasoning system called Knowledge Machine [7] which was translated into first-order logic with equality and from there, into multiple different formats including SILK [13] and OWL2 [26]. It is available for download in various formats including OWL22http://www.ai.sri.com/halo/halobook2010/exported-kb/biokb.html.

4 Generating from the KBGen Knowledge-Base

To generate from the kbgen data, we induce a Feature-Based Lexicalised Tree Adjoining Grammar (FB-LTAG, [34]) augmented with a unification-based semantics [11] from the training data. We then use this grammar and an existing surface realiser to generate from the test data.

4.1 Feature-Based Lexicalised Tree Adjoining Grammar

Figure 2 shows an example FB-LTAG augmented with a unification-based semantics.

Briefly, an FB-LTAG consists of a set of elementary trees which can be either initial or auxiliary. Initial trees are trees whose leaves are labeled with substitution nodes (marked with a down-arrow) or terminal categories. Auxiliary trees are distinguished by a foot node (marked with a star) whose category must be the same as that of the root node. In addition, in an FB-LTAG, each elementary tree is anchored by a lexical item (lexicalisation) and the nodes in the elementary trees are decorated with two feature structures called top and bottom which are unified during derivation. Two tree-composition operations are used to combine trees namely, substitution and adjunction. While substitution inserts a tree in a substitution node of another tree, adjunction inserts an auxiliary tree into a tree. In terms of unifications, substitution unifies the top feature structure of the substitution node with the top feature structure of the root of the tree being substituted in. Adjunction unifies the top feature structure of the root of the tree being adjoined with the top feature structure of the node being adjoined to; and the bottom feature structure of the foot node of the auxiliary tree being adjoined with the bottom feature structure of the node being adjoined to.

In an FB-LTAG augmented with a unification-based semantics, each tree is associated with a semantics i.e., a set of literals whose arguments may be constants or unification variables. The semantics of a derived tree is the union of the semantics of the tree contributing to its derivation modulo unification. Importantly, semantic variables are shared with syntactic variables (i.e., variables occurring in the feature structures decorating the tree nodes) so that when trees are combined, the appropriate syntax/semantics linking is enforced. For instance given the semantics:

instance-of(RoC,Release-Of-Calcium),
object(RoC,PM),agent(RoC,GC),base(RoC,ER),
instance-of(ER,Endoplasmic-Reticulum),
instance-of(GC,Gated-Channel),
instance-of(PM,Particle-In-Motion)

the grammar will generate A gated channel releases particles from the endoplasmic reticulum but not e.g., Particles releases a gated channel from the endoplasmic reticulum.

4.2 Grammar Extraction

We extract our FB-LTAG with unification semantics from the kbgen training data in two main steps. First, we align the KB data with the input string. Second, we induce a Tree Adjoining Grammar augmented with a unification-based semantics from the aligned data.

4.2.1 Alignment

Given a Sentence/Input pair (S,I) provided by the KBGen Challenge, the alignment procedure associates each entity and event variable in I to a substring in S. To do this, we use the entity and the event lexicon provided by the kbgen organiser. The event lexicon maps event types to verbs, their inflected forms and nominalizations while the entity lexicon maps entity types to a noun and its plural form. For instance, the lexicon entries for the event and entity types shown in Figure 1 are as shown in Figure 3.

For each entity and each event variable V in I, we retrieve the corresponding type (e.g., Particle-In-Motion for Particle-In-Motion64582); search the kbgen lexicon for the corresponding phrases (e.g., molecule in motion,molecules in motion); and associate V with the phrase in S which matches one of these phrases. Figure 3 shows an example lexicon and the resulting alignment obtained for the scenario shown in Figure 1. Note that there is not always an exact match between the phrase associated in the kbgen lexicon with a type and the phrase occurring in the training sentence. To account for this, we use some additional similarity based heuristics to identify the phrase in the input string that is most likely to be associated with a variable lacking an exact match in the input string. E.g., for entity variables (e.g., Particle-In-Motion64582), we search the input string for nouns (e.g., particles) whose overlap with the variable type (e.g., Particle-In-Motion) is not empty.

Particle-In-Motion molecule in motion,molecules in motion
Endoplasmic-Reticulum endoplasmic reticulum,endoplasmic reticulum
Gated-Channel gated Channel,gated Channels
Release-Of-Calcium releases,release,released,release

The function of a (gated channel, Gated-Channel64605) is to (release, Release-Of-Calcium646) (particles, Particle-In-Motion64582) from the (endoplasmic reticulum, Endoplasmic-Reticulum64603 )

Figure 3: Example Entries from the kbgen Lexicon and example alignment

4.2.2 Inducing a based FB-LTAG from the aligned data

\nodefn0SRoC3
\nodefn1.1NP \nodefn1.2VPRoC3RoC2
\nodefn2.1NP \nodefn2.2PP \nodefn2.3VBZ \nodefn2.4SRoC2RoC1
\nodefn3.1DT \nodefn3.2NN \nodefn3.3IN \nodefn3.4NPGC \nodefn3.5is \nodefn3.6VPRoC1RoC
\nodefn4.1the \nodefn4.2fn \nodefn4.3of \nodefn4.4TO \nodefn4.5VBRoC \nodefn4.6NPPM \nodefn4.7PP
\nodefn5.1to \nodefn6.1release \nodefn5.2IN \nodefn5.3NPER
\nodefn7.2from
instance-of(RoC,Release-of-Calcium)
object(RoC,PM)
base(RoC,ER)
has-function(GC,RoC)
agent(RoC,GC)
\nodeconnect

fn0fn1.1\nodeconnectfn0fn1.2 \nodeconnectfn1.1fn2.1\nodeconnectfn1.1fn2.2 \nodeconnectfn1.2fn2.3\nodeconnectfn1.2fn2.4 \nodeconnectfn2.1fn3.1\nodeconnectfn2.1fn3.2 \nodeconnectfn2.2fn3.3\nodeconnectfn2.2fn3.4 \nodeconnectfn2.3fn3.5 \nodeconnectfn2.4fn3.6 \nodeconnectfn3.1fn4.1 \nodeconnectfn3.2fn4.2 \nodeconnectfn3.3fn4.3 \nodeconnectfn3.6fn4.4 \nodeconnectfn4.4fn5.1 \nodeconnectfn3.6fn4.5\nodeconnectfn3.6fn4.6\nodeconnectfn3.6fn4.7 \nodeconnectfn4.5fn6.1 \nodeconnectfn4.7fn5.2\nodeconnectfn4.7fn5.3 \nodeconnectfn5.2fn7.2

\nodegc0NPGC
\nodegc1.1DT \nodegc1.2NN \nodegc1.3NN
\nodegc2.1a \nodegc2.2gated \nodegc2.3channel
instance-of(GC,Gated-Channel)
\nodeconnectgc0gc1.1 \nodeconnectgc0gc1.2 \nodeconnectgc0gc1.3 \nodeconnectgc1.1gc2.1 \nodeconnectgc1.2gc2.2 \nodeconnectgc1.3gc2.3
\nodep0NPPM
\nodep1.1particles
instance-of(PM,Particle-In-Motion)
\nodeconnectp0p1.1
\nodeer0NPER
\nodeer1.1DT \nodeer1.2NN \nodeer1.3NN
\nodeer2.1the \nodeer2.2endoplasmic \nodeer2.3reticulum
instance-of(ER,Endoplasmic-Reticulum)
\nodeconnecter0er1.1\nodeconnecter0er1.2\nodeconnecter0er1.3 \nodeconnecter1.1er2.1 \nodeconnecter1.2er2.2 \nodeconnecter1.3er2.3

Figure 4: Extracted Grammar for “The function of a gated channel is to release particles from the endoplasmic reticulum”. Variable names have been abbreviated and the kbgen tuple notation converted to terms so as to fit the input format expected by our surface realiser.

To extract a Feature-Based Lexicalised Tree Adjoining Grammar (FB-LTAG) from the kbgen data, we parse the sentences of the training corpus; project the entity and event variables to the syntactic projection of the strings they are aligned with; and extract the elementary trees of the resulting FB-LTAG from the parse tree using semantic information. Figure 4 shows the trees extracted from the scenario given in Figure 1.

To associate each training example sentence with a syntactic parse, we use the Stanford parser. After alignment, the entity and event variables occurring in the input semantics are associated with substrings of the yield of the syntactic parse tree. We project these variables up the syntactic tree to reflect headedness. A variable aligned with a noun is projected to the NP level or to the immediately dominating PP if it occurs in the subtree dominated by the leftmost daughter of that PP. A variable aligned with a verb is projected to the first S node immediately dominating that verb or, in the case of a predicative sentence, to the root of that sentence33Initially, we used the head information provided by the Stanford parser. In practice however, we found that the heuristics we defined to project semantic variables to the corresponding syntactic projection were more accurate and better supported our grammar extraction process..

Once entity and event variables have been projected up the parse trees, we extract elementary FB-LTAG trees and their semantics from the input scenario as follows.

First, the subtrees whose root node is indexed with an entity variable are extracted. This results in a set of NP and PP trees anchored with entity names and associated with the predication true of the indexing variable.

Second, the subtrees capturing relations between variables are extracted. To perform this extraction, each input variable X is associated with a set of dependent variables i.e., the set of variables Y such that X is related to Y (R(X,Y)). The minimal tree containing all and only the dependent variables D(X) of a variable X is then extracted and associated with the set of literals Φ such that Φ={R(Y,Z)(Y=XZD(X))(Y,ZD(X))}. This procedure extracts the subtrees relating the argument variables of a semantic functors such as an event or a role e.g., a tree describing a verb and its arguments as shown in the top part of Figure 4. Note that such a tree may capture a verb occurring in a relative or a subordinate clause (together with its arguments) thus allowing for complex sentences including a relative or relating a main and a subordinate clause.

The resulting grammar extracted from the parse trees (cf. e.g., Figure 4) is a Feature-Based Tree Adjoining Grammar with a Unification-based compositional semantics as described in [11]. In particular, our grammars differs from the traditional probabilistic Tree Adjoining Grammar extracted as described in e.g., [6] in that they encode both syntax and semantics rather than just syntax. They also differ from the semantic FB-TAG extracted by [9] in that (i) they encode the linking between syntactic and semantic arguments; (ii) they allow for elementary trees spanning discontiguous strings (e.g., The function of X is to release Y); and (iii) they enforce the semantic principle underlying TAG namely that an elementary tree containing a syntactic functor also contains its syntactic arguments.

4.3 Generation

To generate with the grammar extracted from the kbgen data, we use the GenI surface realiser [12]. Briefly, given an input semantics and a FB-LTAG with a unification based semantics, GenI selects all grammar entries whose semantics subsumes the input semantics; combines these entries using the FB-LTAG combination operations (i.e., adjunction and substitution); and outputs the yield of all derived trees which are syntactically complete and whose semantics is the input semantics. To rank the generator output, we train a language model on the GeniA corpus 44http://www.nactem.ac.uk/genia/, a corpus of 2000 MEDLINE asbtracts about biology containing more than 400000 words [17] and use this model to rank the generated sentences by decreasing probability.

Thus for instance, given the input semantics shown in Figure 1 and the grammar depicted in Figure 4, the surface realiser will select all of these trees; combine them using FB-LTAG substitution operation; and output as generated sentence the yield of the resulting derived tree namely the sentence The function of a gated channel is to release particles from the endoplasmic reticulum.

However, this procedure only works if the entries necessary to generate from the given input are present in the grammar. To handle new, unseen input, we proceed in two ways. First, we try to guess a grammar entry from the shape of the input and the existing grammar. Second, we expand the grammar by decomposing the extracted trees into simpler ones.

4.4 Guessing new grammar entries.

Given the limited size of the training data, it is often the case that input from the test data will have no matching grammar unit. To handle such previously unseen input, we start by partitioning the input semantics into sub-semantics corresponding to events, entities and role.

For each entity variable X of type Type, we create a default NP tree whose semantics is a literal of the form instance-of(X,Type).

For event variables, we search the lexicon for an entry with a matching or similar semantics i.e., an entry with the same number and same type of literals (literals with same arity and with identical relations). When one is found, a grammar entry is constructed for the unseen event variable by substituting the event type of the matching entry with the type of the event variable. For instance, given the input semantics instance-of(C,Carry), object(C,X), base(C,Y), has-function(Z,C), agent(C,Z), this procedure will create a grammar entry identical to that shown at the top of Figure 4 except that the event type Release-of-Calcium is changed to Carry and the terminal release to the word form associated in the kbgen lexicon with this concept, namely to the verb carry.

4.5 Expanding the Grammar

While the extracted grammar nicely captures predicate/argument dependencies, it is very specific to the items seen in the training data. To reduce overfitting, we generalise the extracted grammar by extracting from each event tree, subtrees that capture structures with fewer arguments and optional modifiers.

For each event tree τ extracted from the training data which contains a subject-verb-object subtree τ, we add τ to the grammar and associate it with the semantics of τ minus the relations associated with the arguments that have been removed. For instance, given the extracted tree for the sentence ”Aquaporin facilitates the movement of water molecules through hydrophilic channels.”, this procedure will construct a new grammar tree corresponding to the subphrase “Aquaporin facilitates the movement of water molecules”.

We also construct both simpler event trees and optional modifiers trees by extracting from event trees, PP trees which are associated with a relational semantics. For instance, given the tree shown in Figure 4, the PP tree associated with the relation base(RoC,ET) is removed thus creating two new trees as illustrated in Figure 5: an S tree corresponding to the sentence The function of a gated channel is to release particles and an auxiliary PP tree corresponding to the phrase from the endoplasmic reticulum. Similarly in the above example, a PP tree corresponding to the phrase ”through hydrophilic channels.” will be extracted.

As with the base grammar, missing grammar entries are guessed from the expanded grammar. However we do this only in cases where a correct grammar entry cannot be guessed from the base grammar.

\nodefn0SRoC3
\nodefn1.1NP \nodefn1.2VPRoC3RoC2
\nodefn2.1NP \nodefn2.2PP \nodefn2.3VBZ \nodefn2.4SRoC2RoC1
\nodefn3.1DT \nodefn3.2NN \nodefn3.3IN \nodefn3.4NPGC \nodefn3.5is \nodefn3.6VPRoC1RoC
\nodefn4.1the \nodefn4.2fn \nodefn4.3of \nodefn4.4TO \nodefn5.2VBRoC \nodefn5.3NPPM
\nodefn5.1to \nodefn6.1release
instance-of(RoC,Release-of-Calcium)
object(RoC,PM)
has-function(GC,RoC)
agent(RoC,GC)
\nodeconnect

fn0fn1.1\nodeconnectfn0fn1.2 \nodeconnectfn1.1fn2.1\nodeconnectfn1.1fn2.2 \nodeconnectfn1.2fn2.3\nodeconnectfn1.2fn2.4 \nodeconnectfn2.1fn3.1\nodeconnectfn2.1fn3.2 \nodeconnectfn2.2fn3.3\nodeconnectfn2.2fn3.4 \nodeconnectfn2.3fn3.5 \nodeconnectfn2.4fn3.6 \nodeconnectfn3.1fn4.1 \nodeconnectfn3.2fn4.2 \nodeconnectfn3.3fn4.3 \nodeconnectfn3.6fn4.4\nodeconnectfn3.6fn5.2\nodeconnectfn3.6fn5.3 \nodeconnectfn4.4fn5.1 \nodeconnectfn5.2fn6.1

\nodefrom0VPRoC
\nodefrom1.1VP*,RoC \nodefrom1.2PP
\nodefrom2.1IN \nodefrom2.2NPER
\nodefrom3.1from
base(RoC,ER)
\nodeconnectfrom0from1.1 \nodeconnectfrom0from1.2 \nodeconnectfrom1.2from2.1\nodeconnectfrom1.2from2.2 \nodeconnectfrom2.1from3.1

Figure 5: Trees Added by the Expansion Process

5 Experimental Setup

We evaluate our approach on the kbgen data and compare it with the kbgen reference and two other systems having taken part to the kbgen challenge.

5.1 Training and test data.

Following a practice introduced by [2], we use the term scenario to denote a KB subset paired with a sentence. The kbgen benchmark contains 207 scenarii for training and 72 for testing. Each KB subset consists of a set of triples and each scenario contains on average 16 triples and 17 words.

5.2 Systems

We evaluate three configurations of our approach on the kbgen test data: one without grammar expansion (Base); a second with a manual grammar expansion ManExp; and a third one with automated grammar expansion AutExp. We compare the results obtained with those obtained by two other systems participating in the KBGen challenge, namely the UDEL system, a symbolic rule based system developed by a group of students at the University of Delaware; and the IMS system, a statistical system using a probabilistic grammar induced from the training data.

5.3 Metrics.

We evaluate system output automatically, using the BLEU-4 modified precision score [27] with the human written sentences as reference. We also report results from a human based evaluation. In this evaluation, participants were asked to rate sentences along three dimensions: fluency (Is the text easy to read?), grammaticality and meaning similarity or adequacy (Does the meaning conveyed by the generated sentence correspond to the meaning conveyed by the reference sentence?). The evaluation was done on line using the LG-Eval toolkit [21], subjects used a sliding scale from -50 to +50 and a Latin Square Experimental Design was used to ensure that each evaluator sees the same number of outputs from each system and for each test set item. 12 subjects participated in the evaluation and 3 judgments were collected for each output.

6 Results and Discussion

System All Covered Coverage # Trees
IMS 0.12 0.12 100%
UDEL 0.32 0.32 100%
Base 0.04 0.39 30.5% 371
ManExp 0.28 0.34 83 % 412
AutExp 0.29 0.29 100% 477
Figure 6: BLEU scores and Grammar Size (Number of Elementary TAG trees

Table 6 summarises the results of the automatic evaluation and shows the size (number of elementary TAG trees) of the grammars extracted from the kbgen data.

Fluency Grammaticality Meaning Similarity
System Mean Homogeneous Subsets Mean Homogeneous Subsets Mean Homogeneous Subsets
UDEL 4.36 A 4.48 A 3.69 A
AutExp 3.45 B 3.55 B 3.65 A
IMS 1.91 C 2.05 C 1.31 B
Figure 7: Human Evaluation Results on a scale of 0 to 5. Homogeneous subsets are determined using Tukey’s Post Hoc Test with p < 0.05

The average BLEU score is given with respect to all input (All) and to those inputs for which the systems generate at least one sentence (Covered). While both the IMS and the UDEL system have full coverage, our Base system strongly undergenerates failing to account for 69.5% of the test data. However, because the extracted grammar is linguistically principled and relatively compact, it is possible to manually edit it. Indeed, the ManExp results show that, by adding 41 trees to the grammar, coverage can be increased by 52.5 points reaching a coverage of 83%. Finally, the AutExp results demonstrate that the automated expansion mechanism permits achieving full coverage while keeping a relative small grammar (477 trees).

In terms of BLEU score, the best version of our system (AutExp) outperforms the probabilistic approach of IMS by a large margin (+0.17) and produces results similar to the fully handcrafted UDEL system (-0.03).

In sum, our approach permits obtaining BLEU scores and a coverage which are similar to that obtained by a hand crafted system and outperforms a probabilistic approach. One key feature of our approach is that the grammar extracted from the training data is linguistically principled in that it obeys the extended locality principle of Tree Adjoining Grammars. As a result, the extracted grammar is compact and can be manually modified to fit the need of an application as shown by the good results obtained when using the ManExp configuration.

We now turn to the results of the human evaluation. Table 7 summarises the results whereby systems are grouped by letters when there is no significant difference between them (significance level: p < 0.05). We used ANOVAs and post-hoc Tukey tests to test for significance. The differences between systems are statistically significant throughout except for meaning similarity (adequacy) where UDEL and our system are on the same level. Across the metrics, our system consistently ranks second behind the symbolic, UDEL system and before the statistical IMS one thus confirming the ranking based on BLEU.

7 Conclusion

In Tree Adjoining Grammar, the extended domain of locality principle ensures that TAG trees group together in a single structure a syntactic predicate and its arguments. Moreover, the semantic principle requires that each elementary tree captures a single semantic unit. Together these two principles ensure that TAG elementary trees capture basic semantic units and their dependencies. In this paper, we presented a grammar extraction approach which ensures that extracted grammars comply with these two basic TAG principles. Using the kbgen benchmark, we then showed that the resulting induced FB-LTAG compares favorably with competing symbolic and statistical approaches when used to generate from knowledge base data.

In the current version of the generator, the output is ranked using a simple language model trained on the GENIA corpus. We observed that this often fails to return the best output in terms of BLEU score, fluency, grammaticality and/or meaning. In the future, we plan to remedy this using a ranking approach such as proposed in [33, 36].

References

  • [1] G. Aguado, A. Bañón, J. Bateman, S. Bernardos, M. Fernández, A. Gómez-Pérez, E. Nieto, A. Olalla, R. Plaza and A. Sánchez(1998) ONTOGENERATION: reusing domain and linguistic ontologies for spanish text generation. Vol. 98. Cited by: 2.
  • [2] G. Angeli, P. Liang and D. Klein(2010) A simple domain-independent probabilistic approach to generation. pp. 502–512. Cited by: 2, 5.1.
  • [3] E. Banik, C. Gardent and E. Kow(2013) The kbgen challenge. pp. 94–97. Cited by: 1, 3.
  • [4] E. Banik, C. Gardent, D. Scott, N. Dinesh and F. Liang(2012) KBGen: text generation from knowledge bases as a new shared task. pp. 141–145. Cited by: 1.
  • [5] K. Bontcheva and Y. Wilks.(2004) Automatic report generation from ontologies: the miakt approach.. Ninth International Conference on Applications of Natural Language to Information Systems (NLDB’2004), Cited by: 2.
  • [6] D. Chiang(2000) Statistical parsing with an automatically-extracted tree adjoining grammar. pp. 456–463. Cited by: 2, 4.2.2.
  • [7] P. Clark and B. Porter(1997) Building concept representations from reusable components. pp. 369–376. Cited by: 3.
  • [8] D. A. Dahl, M. Bates, M. Brown, W. Fisher, K. Hunicke-Smith, D. Pallett, C. Pao, A. Rudnicky and E. Shriberg(1994) Expanding the scope of the atis task: the atis-3 corpus. pp. 43–48. Cited by: 2.
  • [9] D. DeVault, D. Traum and R. Artstein(2008) Making grammar-based generation easier to deploy in dialogue systems. pp. 198–207. Cited by: 2, 2, 4.2.2.
  • [10] D. Galanis, G. Karakatsiotis, G. Lampouras and I. Androutsopoulos(2009) An open-source natural language generator for owl ontologies and its use in protégé and second life. pp. 17–20. Cited by: 2.
  • [11] C. Gardent and L. Kallmeyer(2003) Semantic construction in feature-based tag. pp. 123–130. Cited by: 4.2.2, 4.
  • [12] C. Gardent and E. Kow(2007) A symbolic approach to near-deterministic surface realisation using tree adjoining grammar. Vol. 7, pp. 328–335. Cited by: 4.3.
  • [13] B. Grosof(2012) The silk project: semantic inferencing on large knowledge. Technical report SRI. Note: \urlhttp://silk.semwebcentral.org/ Cited by: 3.
  • [14] D. Gunning, V. K. Chaudhri, P. Clark, K. Barker, S. Chaw, M. Greaves, B. Grosof, A. Leung, D. McDonald, S. Mishra, J. Pacheco, B. Porter, A. Spaulding, D. Tecuci and J. Tien(2010) Project halo update - progress toward digital aristotle. AI Magazine Fall, pp. 33–58. Cited by: 3.
  • [15] K. Kaljurand and N.E. Fuchs(2007) Verbalizing owl in attempto controlled english. Proceedings of OWLED07. Cited by: 2.
  • [16] M. Kay(1996) Chart generation. pp. 200–204. Cited by: 2.
  • [17] J. Kim, T. Ohta, Y. Tateisi and J. Tsujii(2003) GENIA corpus—a semantically annotated corpus for bio-textmining. Bioinformatics 19 (suppl 1), pp. i180–i182. Cited by: 4.3.
  • [18] J. Kim and R. J. Mooney(2010) Generative alignment and semantic parsing for learning from ambiguous supervision. pp. 543–551. Cited by: 2, 2.
  • [19] I. Konstas and M. Lapata(2012) Concept-to-text generation via discriminative reranking. pp. 369–378. Cited by: 2, 2.
  • [20] I. Konstas and M. Lapata(2012) Unsupervised concept-to-text generation with hypergraphs. pp. 752–761. Cited by: 2, 2.
  • [21] E. Kow and A. Belz(2012) LG-eval: a toolkit for creating online language evaluation experiments.. pp. 4033–4037. Cited by: 5.3.
  • [22] P. Liang, M. I. Jordan and D. Klein(2009) Learning semantic correspondences with less supervision. pp. 91–99. Cited by: 2.
  • [23] W. Lu, H. T. Ng and W. S. Lee(2009) Natural language generation with tree conditional random fields. pp. 400–409. Cited by: 2, 2.
  • [24] W. Lu and H. T. Ng(2011) A probabilistic forest-to-string model for language generation from typed lambda calculus expressions. pp. 1611–1622. Cited by: 2, 2.
  • [25] V. Mittal, G. Carenini and J. Moore(1994) Generating patient specific explanations in migraine. Cited by: 2.
  • [26] B. Motik, P. F. Patel-Schneider, B. Parsia, C. Bock, A. Fokoue, P. Haase, R. Hoekstra, I. Horrocks, A. Ruttenberg and U. Sattler(2009) OWL 2 web ontology language: structural specification and functional-style syntax. W3C recommendation 27, pp. 17. Cited by: 3.
  • [27] K. Papineni, S. Roukos, T. Ward and W. Zhu(2002) BLEU: a method for automatic evaluation of machine translation. pp. 311–318. Cited by: 5.3.
  • [28] C.L. Paris(1988) Tailoring object descriptions to a user’s level of expertise. Computational Linguistics 14 (3), pp. 64–78. Cited by: 2.
  • [29] R. Power and A. Third(2010) Expressing owl axioms by english sentences: dubious in theory, feasible in practice. pp. 1006–1013. Cited by: 2.
  • [30] E. Reiter, R. Robertson and L.M. Osman(2003) Lessons from a failure: generating tailored smoking cessation letters. Artificial Intelligence 144 (1), pp. 41–58. Cited by: 2.
  • [31] H. Shemtov(1996) Generation of paraphrases from ambiguous logical forms. pp. 919–924. Cited by: 2.
  • [32] S. M. Shieber, G. Van Noord, F. C. Pereira and R. C. Moore(1990) Semantic-head-driven generation. Computational Linguistics 16 (1), pp. 30–42. Cited by: 2.
  • [33] E. Velldal and S. Oepen(2006) Statistical ranking in tactical generation. pp. 517–525. Cited by: 7.
  • [34] K. Vijay-Shanker and A. Joshi(1988) Feature structures based tree adjoining grammars. Budapest, Hungary. Cited by: 4.
  • [35] J. Wang(1980) On computational sentence generation from logical form. pp. 405–411. Cited by: 2.
  • [36] M. White and R. Rajkumar(2009) Perceptron reranking for ccg realization. pp. 410–419. Cited by: 7.
  • [37] G. Wilcock(2003) Talking owls: towards an ontology verbalizer. Human Language Technology for the Semantic Web and Web Services, ISWC 3, pp. 109–112. Cited by: 2.
  • [38] S. Williams and R. Power(2010) Grouping axioms for more coherent ontology descriptions. Dublin, pp. 197–202. Cited by: 2.
  • [39] Y. W. Wong and R. J. Mooney(2007) Generation by inverting a semantic parser that uses statistical machine translation.. pp. 172–179. Cited by: 2, 2.