Omni-word Feature and Soft Constraint
for Chinese Relation Extraction

Yanping Chen

{}^{\dagger}

&Qinghua Zheng

{}^{\dagger}

{}^{\dagger}

MOEKLINNS Lab, Department of Computer Science and Technology
Xi’an Jiaotong University, China
ypench@gmail.com, qhzheng@mail.xjtu.edu.cn

{}^{\ddagger}

Amazon.com, Inc.
wzhan@amazon.com
&Wei Zhang

{}^{\ddagger}

Abstract

Chinese is an ancient hieroglyphic. It is inattentive to structure. Therefore, segmenting and parsing Chinese are more difficult and less accurate. In this paper, we propose an Omni-word feature and a soft constraint method for Chinese relation extraction. The Omni-word feature uses every potential word in a sentence as lexicon feature, reducing errors caused by word segmentation. In order to utilize the structure information of a relation instance, we discuss how soft constraint can be used to capture the local dependency. Both Omni-word feature and soft constraint make a better use of sentence information and minimize the influences caused by Chinese word segmentation and parsing. We test these methods on the ACE 2005 RDC Chinese corpus. The results show a significant improvement in Chinese relation extraction, outperforming other methods in F-score by 10% in 6 relation types and 15% in 18 relation subtypes.

1 Introduction

Information Extraction (IE) aims at extracting syntactic or semantic units with concrete concepts or linguistic functions []. Instead of dealing with the whole documents, focusing on designated information, most of the IE systems extract named entities, relations, quantifiers or events from sentences.

The relation recognition task is to find the relationships between two entities. Successful recognition of relation implies correctly detecting both the relation arguments and relation type. Although this task has received extensive research. The performance of relation extraction is still unsatisfactory with a F-score of 67.5% for English (23 subtypes) []. Chinese relation extraction also faces a weak performance having F-score about 66.6% in 18 subtypes [].

The difficulty of Chinese IE is that Chinese words are written next to each other without delimiter in between. Lacking of orthographic word makes Chinese word segmentation difficult. In Chinese, a single sentence often has several segmentation paths leading to the segmentation ambiguity problem []. The lack of delimiter also causes the Out-of-Vocabulary problem (OOV, also known as new word detection) []. These problems are worsened by the fact that Chinese has a large number of characters and words. Currently, the state-of-the-art Chinese OOV recognition system has performance about 75% in recall []. The errors caused by segmentation and OOV will accumulate and propagate to subsequent processing (e.g. part-of-speech (POS) tagging or parsing).

Therefore, the Chinese relation extraction is more difficult. According to our survey, compared to the same work in English, the Chinese relation extraction researches make less significant progress.

Based on the characteristics of Chinese, in this paper, an Omni-word feature and a soft constraint method are proposed for Chinese relation extraction. We apply these approaches in a maximum entropy based system to extract relations from the ACE 2005 corpus. Experimental results show that our method has made a significant improvement.

The contributions of this paper include

1.

Propose a novel Omni-word feature for Chinese relation extraction. Unlike the traditional segmentation based method, which is a partition of the sentence, the Omni-word feature uses every potential word in a sentence as lexicon feature.
2.

Aiming at the Chinese inattentive structure, we utilize the soft constraint to capture the local dependency in a relation instance. Four constraint conditions are proposed to generate combined features to capture the local dependency and maximize the classification determination.

The rest of this paper is organized as follows. Section 2 introduces the related work. The Omni-word feature and soft constrain are proposed in Section 3. We give the experimental results in Section 3.2 and analyze the performance in Section 4. Conclusions are given in Section 5.

2 Related Work

There are two paradigms extracting the relationship between two entities: the Open Relation Extraction (ORE) and the Traditional Relation Extraction (TRE) [].

Based on massive and heterogeneous corpora, the ORE systems deal with millions or billions of documents. Even strict filtrations or constrains are employed to filter the redundancy information, they often generate tens of thousands of relations dynamically []. The practicability of ORE systems depends on the adequateness of information in a big corpus []. Most of the ORE systems utilize weak supervision knowledge to guide the extracting process, such as: Databases [], Wikipedia [], Regular expression [], Ontology [] or Knowledge Base extracted automatically from Internet []. However, when iteratively coping with large heterogeneous data, the ORE systems suffer from the “semantic drift” problem, caused by error accumulation []. Agichtein, Carlson and Fader et al. [] propose syntactic and semantic constraints to prevent this deficiency. The soft constraints, proposed in this paper, are combined features like these syntactic or semantic constraints, which will be discussed in Section 3.2.

The TRE paradigm takes hand-tagged examples as input, extracting predefined relation types []. The TRE systems use techniques such as: Rules (Regulars, Patterns and Propositions) [], Kernel method [], Belief network [], Linear programming [], Maximum entropy [] or SVM []. Compared to the ORE systems, the TRE systems have a robust performance. Disadvantages of the TRE systems are that the manually annotated corpus is required, which is time-consuming and costly in human labor. And migrating between different applications is difficult. However, the TRE systems are evaluable and comparable. Different systems running on the same corpus can be evaluated appropriately.

In the field of Chinese relation extraction, Liu et al. [] proposed a convolution tree kernel. Combining with external semantic resources, a better performance was achieved. Che et al. [] introduced a feature based method, which utilized lexicon information around entities and was evaluated on Winnow and SVM classifiers. Li and Zhang et al. [] explored the position feature between two entities. For each type of these relations, a SVM was trained and tested independently. Based on Deep Belief Network, Chen et al. [] proposed a model handling the high dimensional feature space. In addition, there are mixed models. For example, Lin et al. [] employed a model, combining both the feature based and the tree kernel based methods.

Despite the popularity of kernel based method, Huang et al. [] experimented with different kernel methods and inferred that simply migrating from English kernel methods can result in a bad performance in Chinese relation extraction. Chen and Li et al. [] also pointed out that, due to the inaccuracy of Chinese word segmentation and parsing, the tree kernel based approach is inappropriate for Chinese relation extraction. The reason of the tree kernel based approach not achieve the same level of accuracy as that from English may be that segmenting and parsing Chinese are more difficult and less accurate than processing English.

In our research, we proposed an Omni-word feature and a soft constraint method. Both approaches are based on the Chinese characteristics. Therefore, better performance is expected. In the following, we introduce the feature construction, which discusses the proposed two approaches.

3 Feature Construction

In this section, the employed candidate features are discussed. And four constraint conditions are proposed to transform the candidate features into combined features. The soft constraint is the method to generate the combine features¹¹If without ambiguity, we also use the terminology of “soft constraint” denoting features generated by the employed constraint conditions..

3.1 Candidate Feature Set

In the ACE corpus, an entity is an object or set of objects in the world. An entity mention is a reference to an entity. The entity mention is annotated with its full extent and its head, referred to as the extend mention and the head mention respectively. The extent mention includes both the head and its modifiers. Each relation has two entities as arguments: Arg-1 and Arg-2, referred to as E1 and E2. A relation mention (or instance) is the embodiment of a relation. It is referred by the sentence (or clause) in which the relation is located in. In our work, we focus on the detection and recognition of relation mention.

Relation identification is handled as a classification problem. Entity-related information (e.g. head noun, entity type, subtype, CLASS, LDCTYPE, etc.) are supposed to be known and provided by the corpus. In our experiment, the entity type, subtype and the head noun are used.

All the employed features are simply classified into five categories: Entity Type and Subtype, Head Noun, Position Feature, POS Tag and Omni-word Feature. The first four are widely used. The last one is proposed in this paper and is discussed in detail.

Entity Type and Subtype: In ACE 2005 RDC Chinese corpus, there are 7 entity types (Person, Organization, GPE, Location, Facility, Weapon and Vehicle) and 44 subtypes (e.g. Group, Government, Continent, etc.).

Head Noun: The head noun (or head mention) of entity mention is manually annotated. This feature is useful and widely used.

Position Feature: The position structure between two entity mentions (extend mentions). Because the entity mentions can be nested, two entity mentions may have four coarse structures: “E1 is before E2”, “E1 is after E2”, “E1 nests in E2” and “E2 nests in E1”, encoded as: ‘E1_B_E2’, ‘E1_A_E2’, ‘E1_N_E2’ and ‘E2_N_E1’.

POS Tag: In our model, we use only the adjacent entity POS tags, which lie in two sides of the entity mention. These POS tags are labelled by the ICTCLAS package²²http://ictclas.org/. The POS tags are not used independently. It is encoded by combining the POS tag with the adjacent entity mention information. For example ‘E1_Right_n’ means that the right side of the first entity is a noun (“n”).

Omni-word Feature: The notion of “word” in Chinese is vague and has never played a role in the Chinese philological tradition []. Some Chinese segmentation performance has been reported precision scores above 95% []. However, for the same sentence, even native peoples in China often disagree on word boundaries []. Sproat et al. [] has showed that there is a consistence of 75% on the segmentation among different native Chinese speakers. The word-formation of Chinese also implies that the meanings of a compound word are made up, usually, by the meanings of words that contained in it []. So, fragments of phrase are also informative.

Because high precision can be received by using simple lexical features []. Making better use of such information is beneficial. In consideration of the Chinese characteristics, we use every potential word in a relation mention as the lexical features. {CJK}UTF8gbsn For example, relation mention ‘å°åå¤§å®æ£®æå¬å’ (Taipei Daan Forest Park) has a ”PART-WHOLE” relation type. The traditional segmentation method may generate four lexical features {‘å°å’, ‘å¤§å®’, ‘æ£®æ’, ‘å¬å’}, which is a partition of the relation mention. On the other hand, the Omni-word feature denoting all the possible words in the relation mention may generate features as:

{‘å°’, ‘å’, ‘å¤§’, ‘å®’, ‘æ£®’, ‘æ’, ‘å¬’, ‘å’, ‘å°å’, ‘å¤§å®’, ‘æ£®æ’, ‘å¬å’, ‘æ£®æå¬å’, ‘å¤§å®æ£®æå¬å’}³³The generated Omni-word features dependent on the employed lexicon.

Most of these features are nested or overlapped mutually. So, the traditional character-based or word-based feature is only a subset of the Omni-word feature. To extract the Omni-word feature, only a lexicon is required, then scan the sentence to collect every word.

Because the number of lexicon entry determines the dimension of the feature space, performance of Omni-word feature is influenced by the lexicon being employed. In this paper, we generate the lexicon by merging two lexicons. The first lexicon is obtained by segmenting every relation instance using the ICTCLAS package, collecting very word produced by ICTCLAS. Because the ICTCLAS package was trained on annotated corpus containing many meaningful lexicon entries. We expect this lexicon to improve the performance. The second lexicon is the Lexicon Common Words in Contemporary Chinese⁴⁴Published by Ministry of Education of the People’s Republic of China in 2008, containing 56,008 entries..

Despite the Omni-word can be seen as a subset of n-Gram feature. It is not the same as the n-Gram feature. N-Gram features are more fragmented. In most of the instances, the n-Gram features have no semantic meanings attached to them, thus have varied distributions. Furthermore, for a single Chinese word, occurrences of 4 characters are frequent. Even 7 or more characters are not rare. Because Chinese has plenty of characters⁵⁵Currently, at least 13000 characters are used by native Chinese people. Modern Chinese Dictionary: http://www.cp.com.cn/, when the corpus becoming larger, the n-Gram (n¿4) method is difficult to be adopted. On the other hand, the Omni-word can avoid these problems and take advantages of Chinese characteristics (the word-formation and the ambiguity of word segmentation).

3.2 Soft Constraint

The structure information (or dependent information) of relation instance is critical for recognition. However, even in English, “deeper” analysis (e.g. logical syntactic relations or predicate-argument structure) may suffer from a worse performance caused by inaccurate chunking or parsing. Hence, the local dependency contexts around the relation arguments are more helpful []. Zhang et al. [] also showed that Path-enclosed Tree (PT) achieves the best performance in the kernel based relation extraction. In this field, the tree kernel based method commonly uses the parse tree to capture the structure information []. On the other hand, the feature based method usually uses the combined feature to capture such structure information [].

In the open relation extraction domain, syntactic and semantic constraints are widely employed to prevent the “semantic drift” problem. Such constraints can also be seen as structural constraint. Most of these constraints are hard constraints. Any relation instance violating these constraints (or below a predefined threshold) will be abandoned. For example, Agichtein and Gravano [] generates patterns according to a confidence threshold ( $\tau_{t}$ ). Fader et al. [] utilizes a confidence function. And Carlson et al. [] filters candidate instances and patterns using the number of times they co-occurs.

Deleting of relation instances is acceptable for open relation extraction because it always deals with a big data set. But it’s not suitable for traditional relation extraction, and will result in a low recall. Utilizing the notion of combined feature [], we replace the hard constraint by the soft constraint. Each soft constraint (combined feature) has a parameter trained by the classifier indicating the discrimination ability it has. No subjective or priori judgement is adopted to delete any potential determinative constraint (except for the reason of dimensionality reduction).

Most of the researches make use of the combined feature, but rarely analyze the influence of the approaches we combine them. In this paper, we use the soft constraint to model the local dependency. It is a subset of the combined feature, generated by four constraint conditions: singleton, position sensitive, bin sensitive and semantic pair . For every employed candidate feature, an appropriate constraint condition is selected to combine them with additional information to maximize the classification determination.

{CJK}

UTF8gbsn Singleton: A feature is employed as a singleton feature when it is used without combining with any information. In our experiments, only the position feature is used as singleton feature.

Position Sensitive: A position sensitive feature has a label indicating which entity mention it depends on. In our experiment, the Head noun and POS Tag are utilized as position sensitive features, which has been introduced in Section 3.1. For example, ‘å°å_E1’ means that the head noun ‘å°å’ depend on the first entity mention.

Semantic Pair: Semantic pair is generated by combining two semantic units. Two kinds of semantic pair are employed. Those are generated by combining two entity types or two entity subtypes into a semantic pair. For example, ‘Person_Location’ denotes that the type of the first relation argument is a “Person” (entity type) and the second is a “Location” (entity type). Semantic pair can capture both the semantic and structure information in a relation mention.

Bin Sensitive: In our study, Omni-word feature is not added as “bag of words”. To use the Omni-word feature, we segment each relation mention by two entity mentions. Together with the two entity mentions, we get five parts: “FIRST”, “MIDDLE”, “END”, “E1” and “E2” (or less, if the two entity mentions are nested). Each part is taken as an independent bin. A flag is used to distinguish them. For example, ‘å°å_Bin_F’, ‘å°å_Bin_E1’ and ‘å°å_Bin_E’ mean that the lexicon entry ‘å°å’ appears in three bins: the FIRST bin, the first entity mention (E1) bin and the END bin. They will be used as three independent features.

To sum up, among the five candidate feature sets, the position feature is used as a singleton feature. Both head noun and POS tag are position sensitive. Entity types and subtypes are employed as semantic pair. Only Omni-word feature is bin sensitive. In the following experiments, focusing on Chinese relation extraction, we will analyze the performance of candidate feature sets and study the influence of the constraint conditions.

sectionExperiments

In this section, methodologies of the Omni-word feature and the soft constraint are tested. Then they are compared with the state-of-the-art methods.

3.3 Settings and Results

We use the ACE 2005 RDC Chinese corpus, which was collected from newswires, broadcasts and weblogs, containing 633 documents with 6 major relation types and 18 subtypes. There are 8,023 relations and 9,317 relation mentions. After deleting 5 documents containing wrong annotations⁶⁶DAVYZW_{20041230.1024, 20050110.1403, 20050111.1514, 20050127.1720, 20050201.1538}., we keep 9,244 relation mentions as positive instances.

{CJK}

UTF8gbsn To get the negative instances, each document is segmented into sentences⁷⁷The five punctuations are used as sentence boundaries: Period (ã), Question mark (ï¼), Exclamatory mark (ï¼), Semicolon (ï¼) and Comma (ï¼).. Those sentences that do not contain any entity mention pair are deleted. For each of the remained sentences, we iteratively extract every entity mention pair as the arguments of relation instances for predicting. For example, suppose a sentence has three entity mentions: A,B and C. Because the relation arguments are order sensitive, six entity mention pairs can be generated: [A,B], [A,C], [B,C], [B,A], [C,A] and [C,B]. After discarding the entity mention pairs that were used as positive instances, we generated 93,283 negative relation instances labelled as “OTHER”. Then, we have 7 relation types and 19 subtypes.

A maximum entropy multi-class classifier is trained and tested on the generated relation instances. We adopt the five-fold cross validation for training and testing. Because we are interested in the 6 annotated major relation types and the 18 subtypes, we average the results of five runs on the 6 positive relation types (and 18 subtypes) as the final performance. F-score is computed by

\frac{2\times(Precision\times Recall)}{Precision+Recall}

To implement the maximum entropy model, the toolkit provided by Le [] is employed. The iteration is set to $30$ .

Five candidate feature sets are employed to generate the combined features. The entity type and subtype, head noun, position feature are referred to as $\mathcal{F}_{thp}$ ⁸⁸“thp” is an acronym of “type, head, position”. Features in $\mathcal{F}_{thp}$ are the candidate features combined with the corresponding constraint conditions. The following $\mathcal{F}_{pos}$ and $\mathcal{F}_{ow}$ are the same.. The POS tags are referred to as $\mathcal{F}_{pos}$ . The Omni-word feature set is denoted by $\mathcal{F}_{ow}$ .

Table 1 gives the performance of our system on the 6 types and 18 subtypes. Note that, in this paper, bare numbers and numbers in the parentheses represent the results of the 6 types and the 18 subtypes respectively.

Table 1: Performance on Type (Subtype)

Features	P	R	F
$\mathcal{F}_{thp}$	61.51	48.85	54.46
$\mathcal{F}_{thp}$	(52.92)	(36.92)	(43.49)
$\mathcal{F}_{ow}$	80.16	75.45	77.74
$\mathcal{F}_{ow}$	(66.98)	(54.85)	(60.31)
$\mathcal{F}_{thp}\cup\mathcal{F}_{pos}$	83.93	77.81	80.76
$\mathcal{F}_{thp}\cup\mathcal{F}_{pos}$	(69.83)	(61.63)	(65.47)
$\mathcal{F}_{thp}\cup\mathcal{F}_{ow}$	92.40	88.37	90.34
$\mathcal{F}_{thp}\cup\mathcal{F}_{ow}$	(81.94)	(70.69)	(75.90)
$\mathcal{F}_{thp}\cup\mathcal{F}_{pos}\cup\mathcal{F}_{ow}$	92.26	88.51	90.35
	(80.52)	(70.96)	(75.44)

In Row 1, because $\mathcal{F}_{thp}$ are features directly obtained from annotated corpus, we take this performance as our referential performance. In Row 2, with only the $\mathcal{F}_{ow}$ feature, the F-score already reaches 77.74% in 6 types and 60.31% in 18 subtypes. The last row shows that adding the $\mathcal{F}_{pos}$ almost has no effect on the performance when both the $\mathcal{F}_{thp}$ and $\mathcal{F}_{ow}$ are in use. The results show that $\mathcal{F}_{ow}$ is effective for Chinese relation extraction.

The superiorities of Owni-word feature depend on three reasons. First, the specificity of Chinese word-formation indicates that the subphrases of Chinese word (or phrase) are also informative. Second, most of relation instances have limited context. The Owni-word feature, utilizing every possible word in them, is a better way to capture more information. Third, the entity mentions are manually annotated. They can precisely segment the relation instance into corresponding bins. Segmentation of bins bears the sentence structure information. Therefore, the Owni-word feature with bin information can make a better use of both the syntactic information and the local dependency.

3.4 Comparison

Various systems were proposed for Chinese relation extraction. We mainly focus on systems trained and tested on the ACE corpus. Table 2 lists three systems.

Table 2: Survey of Other Systems

System	P	R	F
Che et al. []	76.13	70.18	73.27
Zhang et al. []	80.71	62.48	70.43
Zhang et al. []	(77.75)	(60.20)	(67.86)
Liu et al. []	81.1	61.0	69.0
Liu et al. []	(79.1)	(57.5)	(66.6)

Table 3: Comparing With the State-of-the-Art Methods

System	Feature Set	P	R	F
[]	Ei.Type, Ei.Subtype, Order, $Word_{Ei\mathbin{\ooalign{\raisebox{0.1pt}{$\scriptstyle+$}\cr\smash{\raisebox% {-0.6pt}{$\scriptstyle-$}}\cr}}1}$ , $Word_{Ei\mathbin{\ooalign{\raisebox{0.1pt}{$\scriptstyle+$}\cr\smash{\raisebox% {-0.6pt}{$\scriptstyle-$}}\cr}}2}$ , $POS_{Ei\mathbin{\ooalign{\raisebox{0.1pt}{$\scriptstyle+$}\cr\smash{\raisebox{% -0.6pt}{$\scriptstyle-$}}\cr}}1}$ , $POS_{Ei\mathbin{\ooalign{\raisebox{0.1pt}{$\scriptstyle+$}\cr\smash{\raisebox{% -0.6pt}{$\scriptstyle-$}}\cr}}2}$	84.81	75.69	79.99
[]		(64.89)	(52.99)	(58.34)
[]	Ei.Type, Ei.Subtype, 9 Position Feature, Uni-Gram, Bi-Gram	79.56	72.99	76.13
[]	Ei.Type, Ei.Subtype, 9 Position Feature, Uni-Gram, Bi-Gram	(66.78)	(54.56)	(60.06)
Ours	$\mathcal{F}_{thp}\cup\mathcal{F}_{pos}\cup\mathcal{F}_{ow}$	92.26	88.51	90.35
Ours		(80.52)	(70.96)	(75.44)

Che et al. [] was implemented on the ACE 2004 corpus, with 2/3 data for training and 1/3 for testing. The performance was reported on 7 relation types: 6 major relation types and the none relation (or negative instance). Zhang et al. [] was based on the ACE 2005 corpus with 75% data for training and 25% for testing. Performances about the 7 types and 19 subtypes were given. Both of them are feature based methods. Liu et al. [] is a kernel based method evaluated on the ACE 2005 corpus. The five-fold cross validation was used and declared the performances on 6 relation types and 18 subtypes.

The data preprocessing makes differences from our experiments to others. In order to give a better comparison with the state-of-the-art methods, based on our experiment settings and data, we implement the two feature based methods proposed by Che et al. [] and Zhang et al. [] in Table 2. The results are shown in Table 3.

In Table 3, Ei ( $i\in{1,2}$ ) represents entity mention. “Order” in Che et al. [] denotes the position structure of entity mention pair. Four types of order are employed (the same as ours). $Word_{Ei\mathbin{\ooalign{\raisebox{0.1pt}{$\scriptstyle+$}\cr\smash{\raisebox% {-0.6pt}{$\scriptstyle-$}}\cr}}k}$ and $POS_{Ei\mathbin{\ooalign{\raisebox{0.1pt}{$\scriptstyle+$}\cr\smash{\raisebox{% -0.6pt}{$\scriptstyle-$}}\cr}}k}$ are the words and POS of Ei, “ $\mathbin{\ooalign{\raisebox{0.1pt}{$\textstyle+$}\cr\smash{\raisebox{-0.6pt}{$% \textstyle-$}}\cr}}k$ ” means that it is the $k$ th word (of POS) after (+) or before (-) the corresponding entity mention. In this paper, $k=1$ and $k=2$ were set.

In Row 2, the “Uni-Gram” represents the Uni-gram features of internal and external character sequences. Internal character sequences are the four entity extend and head mentions. Five kinds of external character sequences are used: one In-Between character sequence between E1 and E2 and four character sequences around E1 and E2 in a given window size w_s. The w_s is set to 4. The “Bi-Gram” is the 2-gram feature of internal and external character sequences. Instead of the 4 position structures, the 9 position structures are used. Please refer to Zhang et al. [] for the details of these 9 position structures.

In Table 3, it is shown that our system outperforms other systems, in F-score, by 10% on 6 relation types and by 15% on 18 subtypes.

For researchers who are interested in our work, the source code of our system and our implementations of Che et al. [] and Zhang et al. [] are available at https://github.com/YPench/CRDC.

4 Discussion

In this section, we analyze the influences of employed feature sets and constraint conditions on the performances.

Most papers in relation extraction try to augment the number of employed features. In our experiment, we found that this does not always guarantee the best performance, despite the classifier being adopted is claimed to control these features independently. Because features may interact mutually in an indirect way, even with the same feature set, different constraint conditions can have significant influences on the final performance.

In Section 3, we introduced five candidate feature sets. Instead of using them as independent features, we combined them with additional information. We proposed four constraint conditions to generate the soft constraint features. In Table 4, the performances of candidate features are compared when different constraint conditions was employed.

Table 4: Influence of Feature Set

No.	Feature	Constraint Condition	Par	P	R	F	I
1	entity CLASS and LDCTYPE	(1)/as singleton	21,112	60.29	42.82	50.07	-4.39
1		(1)/as singleton	21,910	(41.70)	(25.18)	(31.40)	-12.09
2		(1)/combined with positional Info	21,159	63.02	44.47	52.15	-2.31
2		(1)/combined with positional Info	22,013	(41.61)	(26.31)	(32.24)	-11.25
3		(1)/as semantic pair	21,207	63.35	47.67	54.40	-0.06
3		(1)/as semantic pair	22,068	(42.98)	(31.34)	(36.25)	-7.24
4	Type, Subtype semantic pair	(1)/as singleton	19,390	51.37	29.16	37.20	-17.26
4		(1)/as singleton	147,435	(32.8)	(18.97)	(24.06)	-19.43
5		(1)/combined with positional info	19,524	61.77	43.67	51.17	-3.29
5		(1)/combined with positional info	20,297	(41.13)	(26.83)	(32.47)	-11.02
6		(5)/as singleton	105,865	91.39	87.92	89.62	-0.73
6		(5)/as singleton	121,218	(79.32)	(68.73)	(73.65)	-1.79
7	head noun	(3)/as singleton	21,450	85.66	75.74	80.40	-0.36
7		(3)/as singleton	22,409	(64.38)	(57.14)	(60.55)	-0.34
8		(3)/as semantic pair	77,333	83.05	73.14	77.78	-2.54
8		(3)/as semantic pair	77,947	(59.70)	(51.70)	(55.41)	-5.48
9		(5)/as singleton	100,963	92.50	88.90	90.66	+0.31
9		(5)/as singleton	115,499	(82.63)	(71.67)	(76.76)	+1.32
10	adjacent entity POS tag	(3)/as singleton	21,450	72.66	61.16	66.41	-13.91
10		(3)/as singleton	22,409	(62.42)	(45.69)	(52.76)	-8.13
11		(3)/combined with entity type	22,151	80.66	71.67	75.90	-4.42
11		(3)/combined with entity type	23,357	(63.41)	(53.16)	(57.83)	-3.06
12		(5)/as singleton	106,931	92.50	88.66	90.54	+0.19
12		(5)/as singleton	121,194	(82.04)	(71.36)	(76.33)	+0.89
13	Omni-word feature	(2)/By-Segmentation as singleton	36,916	67.19	60.12	63.46	-14.28
13		(2)/By-Segmentation as singleton	41,652	(55.85)	(44.50)	(49.54)	-10.77
14		(2)/By-Segmentation with bins	79,430	71.12	66.90	68.95	-8.79
14		(2)/By-Segmentation with bins	84,715	(54.76)	(43.50)	(48.48)	-11.83
15		(2)/By-Omni-word as singleton	47,428	69.67	63.77	66.59	-11.15
15		(2)/By-Omni-word as singleton	57,702	(54.85)	(48.84)	(51.67)	-8.64
16		(5)/as singleton	57,321	91.43	86.37	88.83	-1.52
16		(5)/as singleton	67,722	(76.43)	(69.57)	(72.84)	-2.60

In Column 3 of Table 4 (Constraint Condition), (1), (2), (3), (4) and (5) stand for the referential feature sets⁹⁹(1), (2), (3), (4) and (5) denote $\mathcal{F}_{thp}$ , $\mathcal{F}_{ow}$ , $\mathcal{F}_{thp}\cup\mathcal{F}_{pos}$ , $\mathcal{F}_{thp}\cup\mathcal{F}_{ow}$ and $\mathcal{F}_{thp}\cup\mathcal{F}_{pos}\cup\mathcal{F}_{ow}$ respectively. in Table 1. Symbol “/” means that the corresponding candidate features in the referential feature set are substituted by the new constraint condition. Par in Column 4 is the number of parameters in the trained maximum entropy model, which indicate the model complexity. I in Column 5 is the influence on performance. “-” and “+” mean that the performance is decreased or increased.

The first observation is that the combined features are more powerful than used as singletons. Model parameters are increased by the combined features. Increasing of parameters projects the relation extraction problem into a higher dimensional space, making the decision boundaries become more flexible.

The named entities in the ACE corpus are also annotated with the CLASS and LDCTYPE labels. Zhou et al. [] has shown that these labels can result in a weaker performance. Row 1, 2 and 3 show that, no matter how they are used, the performances decrease obviously. The reason of the performance degradation may be caused by the problem of over-fitting or data sparseness.

At most of the time, increase of model parameters can result in a better performance. Except in Row 8 and Row 11, when two head nouns of entity pair were combined as semantic pair and when POS tag were combined with the entity type, the performances are decreased. There are 7356 head nouns in the training set. Combining two head nouns may increase the feature space by $7356\times(7356-1)$ . Such a large feature space makes the occurrence of features close to a random distribution, leading to a worse data sparseness.

In Row 4, 10 and 13, these features are used as singleton, the performance degrades considerably. This means that, the missing of sentence structure information on the employed features can lead to a bad performance.

Row 9 and 12 show an interesting result. Comparing the reference set (5) with the reference set (3), the Head noun and adjacent entity POS tag get a better performance when used as singletons. These results reflect the interactions between different features. Discussion of this issue is beyond this paper’s scope. In this paper, for a better demonstration of the constraint condition, we still use the Position Sensitive as the default setting to use the Head noun and the adjacent entity POS tag.

Row 13 and 14 compare the Omni-word feature (By-Omni-word) with the traditional segmentation based feature (By-Segmentation). By-Segmentation denotes the traditional segmentation based feature set generated by a segmentation tool, collecting every output of relation mention. In this place, the ICTCLAS package is adopted too.

Conventionally, if a sentence is perfectly segmented, By-Segmentation is straightforward and effective. But, our experiment shows different observations. Row 13 and 14 show that the Omni-word method outperforms the traditional method. Especially, when the bin information is used (Row 15), the performance of Omni-word feature increases considerably.

Row 14 shows that, compared with the traditional method, the Omni-word feature improves the performance by about 8.79% in 6 relation types and 11.83% in 18 subtypes in F-core. Such improvement may reside in the three reasons discussed in Section 3.3.

In short, from Table 4 we have seen that the entity type and subtype maximize the performance when used as semantic pair. Head noun and adjacent entity POS tag are employed to combine with positional information. Omni-word feature with bins information can increase the performance considerably. Our model (in Section 3.3) uses these settings. This insures that the performances of the candidate features are optimized.

5 Conclusion

In this paper, We proposed a novel Omni-word feature taking advantages of Chinese sub-phrases. We also introduced the soft constraint method for Chinese relation recognition. The soft constraint utilizes four constraint conditions to catch the structure information in a relation instance. Both the Omni-word feature and soft constrain make better use of information a sentence has, and minimize the deficiency caused by Chinese segmentation and parsing.

The size of the employed lexicon determines the dimension of the feature space. The first impression is that more lexicon entries result in more power. However, more lexicon entries also increase the computational complexity and bring in noises. In our future work, we will study this issue. The notion of soft constraints can also be extended to include more patterns, rules, regexes or syntactic constraints that have been used for information extraction. The usability of these strategies is also left for future work.

Acknowledgments

The research was supported in part by NSF of China (91118005, 91218301, 61221063); 863 Program of China (2012AA011003); Cheung Kong Scholar’s Program; Pillar Program of NST (2012BAH16F02); Ministry of Education of China Humanities and Social Sciences Project (12YJC880117); The Ministry of Education Innovation Research Team (IRT13035).

References

Generated on Tue Jun 10 17:57:18 2014 by LaTeXML [LOGO]

Omni-word Feature and Soft Constraint for Chinese Relation Extraction