Approximation Strategies for Multi-Structure Sentence Compression

Kapil Thadani
Department of Computer Science
Columbia University
New York, NY 10025, USA
kapil@cs.columbia.edu

Abstract

Sentence compression has been shown to benefit from joint inference involving both n-gram and dependency-factored objectives but this typically requires expensive integer programming. We explore instead the use of Lagrangian relaxation to decouple the two subproblems and solve them separately. While dynamic programming is viable for bigram-based sentence compression, finding optimal compressed trees within graphs is NP-hard. We recover approximate solutions to this problem using LP relaxation and maximum spanning tree algorithms, yielding techniques that can be combined with the efficient bigram-based inference approach using Lagrange multipliers. Experiments show that these approximation strategies produce results comparable to a state-of-the-art integer linear programming formulation for the same joint inference task along with a significant improvement in runtime.

1 Introduction

Sentence compression is a text-to-text generation task in which an input sentence must be transformed into a shorter output sentence which accurately reflects the meaning in the input and also remains grammatically well-formed. The compression task has received increasing attention in recent years, in part due to the availability of datasets such as the Ziff-Davis corpus [24] and the Edinburgh compression corpora [5], from which the following example is drawn.

Original: In 1967 Chapman, who had cultivated a conventional image with his ubiquitous tweed jacket and pipe, by his own later admission stunned a party attended by his friends and future Python colleagues by coming out as a homosexual.

Compressed: In 1967 Chapman, who had cultivated a conventional image, stunned a party by coming out as a homosexual.

Following an assumption often used in compression systems, the compressed output in this corpus is constructed by dropping tokens from the input sentence without any paraphrasing or reordering.¹¹This is referred to as extractive compression by Cohn and Lapata (2008) & Galanis and Androutsopoulos (2010) following the terminology used in document summarization.

A number of diverse approaches have been proposed for deletion-based sentence compression, including techniques that assemble the output text under an n-gram factorization over the input text [36, 7] or an arc factorization over input dependency parses [16, 17, 15]. Joint methods have also been proposed that invoke integer linear programming (ILP) formulations to simultaneously consider multiple structural inference problems—both over n-grams and input dependencies [34] or n-grams and all possible dependencies [45]. However, it is well-established that the utility of ILP for optimal inference in structured problems is often outweighed by the worst-case performance of ILP solvers on large problems without unique integral solutions. Furthermore, approximate solutions can often be adequate for real-world generation systems, particularly in the presence of linguistically-motivated constraints such as those described by Clarke and Lapata (2008), or domain-specific pruning strategies such as the use of sentence templates to constrain the output.

In this work, we develop approximate inference strategies to the joint approach of Thadani and McKeown (2013) which trade the optimality guarantees of exact ILP for faster inference by separately solving the n-gram and dependency subproblems and using Lagrange multipliers to enforce consistency between their solutions. However, while the former problem can be solved efficiently using the dynamic programming approach of McDonald (2006), there are no efficient algorithms to recover maximum weighted non-projective subtrees in a general directed graph. Maximum spanning tree algorithms, commonly used in non-projective dependency parsing [35], are not easily adaptable to this task since the maximum-weight subtree is not necessarily a part of the maximum spanning tree.

We therefore consider methods to recover approximate solutions for the subproblem of finding the maximum weighted subtree in a graph, common among which is the use of a linear programming relaxation. This linear program (LP) appears empirically tight for compression problems and our experiments indicate that simply using the non-integral solutions of this LP in Lagrangian relaxation can empirically lead to reasonable compressions. In addition, we can recover approximate solutions to this problem by using the Chu-Liu Edmonds algorithm for recovering maximum spanning trees [4, 14] over the relatively sparse subgraph defined by a solution to the relaxed LP. Our proposed approximation strategies are evaluated using automated metrics in order to address the question: under what conditions should a real-world sentence compression system implementation consider exact inference with an ILP or approximate inference? The contributions of this work include:

•

An empirically-useful technique for approximating the maximum-weight subtree in a weighted graph using LP-relaxed inference.
•

Multiple approaches to generate good approximate solutions for joint multi-structure compression, based on Lagrangian relaxation to enforce equality between the sequential and syntactic inference subproblems.
•

An analysis of the tradeoffs incurred by joint approaches with regard to runtime as well as performance under automated measures.

2 Multi-Structure Sentence Compression

Even though compression is typically formulated as a token deletion task, it is evident that dropping tokens independently from an input sentence will likely not result in fluent and meaningful compressive text. Tokens in well-formed sentences participate in a number of syntactic and semantic relationships with other tokens, so one might expect that accounting for heterogenous structural relationships between tokens will improve the coherence of the output sentence. Furthermore, much recent work has focused on the challenge of joint sentence extraction and compression, also known as compressive summarization [34, 2, 1, 30, 41], in which questions of efficiency are paramount due to the larger problems involved; however, these approaches largely restrict compression to pruning parse trees, thereby imposing a dependency on parser performance. We focus in this work on a sentence-level compression system to approximate the ILP-based inference of Thadani and McKeown (2013) which does not restrict compressions to follow input parses but permits the generation of novel dependency relations in output compressions.

The rest of this section is organized as follows: §2.1 provies an overview of the joint sequential and syntactic objective for compression from Thadani and McKeown (2013) while §2.2 discusses the use of Lagrange multipliers to enforce consistency between the different structures considered. Following this, §2.3 discusses a dynamic program to find maximum weight bigram subsequences from the input sentence, while §2.4 covers LP relaxation-based approaches for approximating solutions to the problem of finding a maximum-weight subtree in a graph of potential output dependencies. Finally, §2.5 discusses the features and model training approach used in our experimental results which are presented in §3.

2.1 Joint objective

We begin with some notation. For an input sentence $S$ comprised of $n$ tokens including duplicates, we denote the set of tokens in $S$ by $T\triangleq\{t_{i}:1\leq i\leq n\}$ . Let $C$ represent a compression of $S$ and let $x_{i}\in\{0,1\}$ denote an indicator variable whose value corresponds to whether token $t_{i}\in T$ is present in the compressed sentence $C$ . In addition, we define bigram indicator variables $y_{ij}\in\{0,1\}$ to represent whether a particular order-preserving bigram²²Although Thadani and McKeown (2013) is not restricted to bigrams or order-preserving n-grams, we limit our discussion to this scenario as it also fits the assumptions of McDonald (2006) and the datasets of Clarke and Lapata (2006). $\langle t_{i},t_{j}\rangle$ from $S$ is present as a contiguous bigram in $C$ as well as dependency indicator variables $z_{ij}\in\{0,1\}$ corresponding to whether the dependency arc $t_{i}\rightarrow t_{j}$ is present in the dependency parse of $C$ . The score for a given compression $C$ can now be defined to factor over its tokens, n-grams and dependencies as follows.

$\displaystyle score(C)$	$\displaystyle=\quad\sum_{t_{i}\in T}\hskip{8.0pt}x_{i}\cdot\theta_{\text{tok}}% (t_{i})$
	$\displaystyle\ +\hskip{-10.0pt}\sum_{\substack{t_{i}\in T\cup\{\textsc{start}% \},\\ t_{j}\in T\cup\{\textsc{end}\}}}\hskip{-10.0pt}y_{ij}\cdot\theta_{\text{bgr}}(% \langle t_{i},t_{j}\rangle)$
	$\displaystyle\ +\hskip{-10.0pt}\sum_{\substack{t_{i}\in T\cup\{\textsc{root}\}% ,\\ t_{j}\in T}}\hskip{-10.0pt}z_{ij}\cdot\theta_{\text{dep}}(t_{i}\rightarrow t_{% j})$	(1)

where $\theta_{\text{tok}}$ , $\theta_{\text{bgr}}$ and $\theta_{\text{dep}}$ are feature-based scoring functions for tokens, bigrams and dependencies respectively. Specifically, each $\theta_{v}(\cdot)\equiv\mathbf{w}_{v}^{\top}\boldsymbol{\phi}_{v}(\cdot)$ where $\boldsymbol{\phi}_{v}(\cdot)$ is a feature map for a given variable type $v\in\{\text{tok, bgr, dep}\}$ and $\mathbf{w}_{v}$ is the corresponding vector of learned parameters.

The inference task involves recovering the highest scoring compression $C^{*}$ under a particular set of model parameters $\mathbf{w}$ .

	$\displaystyle C^{*}$	$\displaystyle=\ \operatorname*{arg\,max}_{C}\ \ score(C)$
		$\displaystyle=\ \operatorname*{arg\,max}_{\mathbf{x},\mathbf{y},\mathbf{z}}\ % \mathbf{x}^{\top}\boldsymbol{\theta}_{\text{tok}}+\mathbf{y}^{\top}\boldsymbol% {\theta}_{\text{bgr}}+\mathbf{z}^{\top}\boldsymbol{\theta}_{\text{dep}}$		(2)

where the incidence vector $\mathbf{x}\triangleq\langle x_{i}\rangle_{t_{i}\in T}$ represents an entire token configuration over $T$ , with $\mathbf{y}$ and $\mathbf{z}$ defined analogously to represent configurations of bigrams and dependencies. $\boldsymbol{\theta}_{v}\triangleq\langle\theta_{v}(\cdot)\rangle$ denotes a corresponding vector of scores for each variable type $v$ under the current model parameters. In order to recover meaningful compressions by optimizing (2), the inference step must ensure:

1.

The configurations $\mathbf{x}$ , $\mathbf{y}$ and $\mathbf{z}$ are consistent with each other, i.e., all configurations cover the same tokens.
2.

The structural configurations $\mathbf{y}$ and $\mathbf{z}$ are non-degenerate, i.e, the bigram configuration $\mathbf{y}$ represents an acyclic path while the dependency configuration $\mathbf{z}$ forms a tree.

These requirements naturally rule out simple approximate inference formulations such as search-based approaches for the joint objective.³³This work follows Thadani and McKeown (2013) in recovering non-projective trees for inference. However, recovering projective trees is tractable when a total ordering of output tokens is assumed. This will be addressed in future work. An ILP-based inference solution is demonstrated in Thadani and McKeown (2013) that makes use of linear constraints over the boolean variables $x_{i}$ , $y_{ij}$ and $z_{ij}$ to guarantee consistency, as well as auxiliary real-valued variables and constraints representing the flow of commodities [31] in order to establish structure in $\mathbf{y}$ and $\mathbf{z}$ . In the following section, we propose an alternative formulation that exploits the modularity of this joint objective.

2.2 Lagrangian relaxation

Dual decomposition [26] and Lagrangian relaxation in general are often used for solving joint inference problems which are decomposable into individual subproblems linked by equality constraints [27, 44, 43, 13, 32, 11, 1]. This approach permits sub-problems to be solved separately using problem-specific efficient algorithms, while consistency over the structures produced is enforced through Lagrange multipliers via iterative optimization. Exact solutions are guaranteed when the algorithm converges on a consistent primal solution, although this convergence itself is not guaranteed and depends on the tightness of the underlying LP relaxation. The primary advantage of this technique is the ability to leverage the underlying structure of the problems in inference rather than relying on a generic ILP formulation while still often producing exact solutions.

The multi-structure inference problem described in the previous section seems in many ways to be a natural fit to such an approach since output scores factor over different types of structure that comprise the output compression. Even if ILP-based approaches perform reasonably at the scale of single-sentence compression problems, the exponential worst-case complexity of general-purpose ILPs will inevitably pose challenges when scaling up to (a) handle larger inputs, (b) use higher-order structural fragments, or (c) incorporate additional models.

Consider once more the optimization problem characterized by (2) The two structural problems that need to be solved in this formulation are the extraction of a maximum-weight acyclic subsequence of bigrams $\mathbf{y}$ from the lattice of all order-preserving bigrams from $S$ and the recovery of a maximum-weight directed subtree $\mathbf{z}$ . Let $\boldsymbol{\alpha}(\mathbf{y})\in\{0,1\}^{n}$ denote the incidence vector of tokens contained in the n-gram sequence $\mathbf{y}$ and $\boldsymbol{\beta}(\mathbf{z})\in\{0,1\}^{n}$ denote the incidence vector of words contained in the dependency tree $\mathbf{z}$ . We can now rewrite the objective in (2) while enforcing the constraint that the words contained in the sequence $\mathbf{y}$ are the same as the words contained in the tree $\mathbf{z}$ , i.e., $\boldsymbol{\alpha}(\mathbf{y})=\boldsymbol{\beta}(\mathbf{z})$ , by introducing a vector of Lagrange multipliers $\boldsymbol{\lambda}\in\mathbb{R}^{n}$ . In addition, the token configuration $\mathbf{x}$ can be rewritten in the form of a weighted combination of $\boldsymbol{\alpha}(\mathbf{y})$ and $\boldsymbol{\beta}(\mathbf{z})$ to ensure its consistency with $\mathbf{y}$ and $\mathbf{z}$ . This results in the following Lagrangian:

$\displaystyle L(\boldsymbol{\lambda},\mathbf{y,z})=$	$\displaystyle\ \mathbf{y}^{\top}\boldsymbol{\theta}_{\text{bgr}}+\mathbf{z}^{% \top}\boldsymbol{\theta}_{\text{dep}}$
	$\displaystyle+\boldsymbol{\theta}_{\text{tok}}^{\top}\left(\psi\cdot% \boldsymbol{\alpha}(\mathbf{y})+(1-\psi)\cdot\boldsymbol{\beta}(\mathbf{z})\right)$
	$\displaystyle+\boldsymbol{\lambda}^{\top}\left(\boldsymbol{\alpha}(\mathbf{y})% -\boldsymbol{\beta}(\mathbf{z})\right)$	(3)

Finding the $\mathbf{y}$ and $\mathbf{z}$ that maximize this Lagrangian above yields a dual objective, and the dual problem corresponding to the primal objective specified in (2) is therefore the minimization of this objective over the Lagrange multipliers $\boldsymbol{\lambda}$ .

	$\displaystyle\min_{\boldsymbol{\lambda}}\max_{\mathbf{y},\mathbf{z}}L(% \boldsymbol{\lambda},\mathbf{y,z})$
$\displaystyle=$	$\displaystyle\min_{\boldsymbol{\lambda}}\ \max_{\mathbf{y}}\mathbf{y}^{\top}% \boldsymbol{\theta}_{\text{bgr}}+\left(\boldsymbol{\lambda}+\psi\cdot% \boldsymbol{\theta}_{\text{tok}}\right)^{\top}\boldsymbol{\alpha}(\mathbf{y})$
	$\displaystyle\ \ \ +\ \max_{\mathbf{z}}\mathbf{z}^{\top}\boldsymbol{\theta}_{% \text{dep}}-\left(\boldsymbol{\lambda}+(\psi-1)\cdot\boldsymbol{\theta}_{\text% {tok}}\right)^{\top}\boldsymbol{\beta}(\mathbf{z})$
$\displaystyle=$	$\displaystyle\min_{\boldsymbol{\lambda}}\ \max_{\mathbf{y}}f(\mathbf{y},% \boldsymbol{\lambda},\psi,\boldsymbol{\theta})$
	$\displaystyle\quad+\max_{\mathbf{z}}g(\mathbf{z},\boldsymbol{\lambda},\psi,% \boldsymbol{\theta})$	(4)

This can now be solved with the iterative subgradient algorithm illustrated in Algorithm 2.2. In each iteration $i$ , the algorithm solves for $\mathbf{y}^{(i)}$ and $\mathbf{z}^{(i)}$ under $\boldsymbol{\lambda}^{(i)}$ , then generates $\boldsymbol{\lambda}^{(i+1)}$ to penalize inconsistencies between $\boldsymbol{\alpha}(\mathbf{y}^{(i)})$ and $\boldsymbol{\beta}(\mathbf{z}^{(i)})$ . When $\boldsymbol{\alpha}(\mathbf{y}^{(i)})=\boldsymbol{\beta}(\mathbf{z}^{(i)})$ , the resulting primal solution is exact, i.e., $\mathbf{y}^{(i)}$ and $\mathbf{z}^{(i)}$ represent the optimal structures under (2). Otherwise, if the algorithm starts oscillating between a few primal solutions, the underlying LP must have a non-integral solution in which case approximation heuristics can be employed.⁴⁴ Heuristic approaches [26, 44], tightening [43] or branch and bound [11] can still be used to retrieve optimal solutions, but we did not explore these strategies here. The application of this Lagrangian relaxation strategy is contingent upon the existence of algorithms to solve the maximization subproblems for $f(\mathbf{y},\boldsymbol{\lambda},\psi,\boldsymbol{\theta})$ and $g(\mathbf{z},\boldsymbol{\lambda},\psi,\boldsymbol{\theta})$ . The following sections discuss our approach to these problems.

{algorithm}

[t] Subgradient-based joint inference {algorithmic}[1] \StatexInput: scores $\boldsymbol{\theta}$ , ratio $\psi$ , repetition limit $l_{\text{max}}$ , iteration limit $i_{\text{max}}$ , learning rate schedule $\boldsymbol{\eta}$ \StatexOutput: token configuration $\mathbf{x}$ \Statex\State $\boldsymbol{\lambda}^{(0)}\leftarrow\langle 0\rangle^{n}$ \State $M\leftarrow\varnothing,M_{\text{repeats}}\leftarrow\varnothing$ \Foriteration $i<i_{\text{max}}$ \State $\hat{\mathbf{y}}\leftarrow\operatorname*{arg\,max}_{\mathbf{y}}f(\mathbf{y},% \boldsymbol{\lambda},\psi,\boldsymbol{\theta})$ \State $\hat{\mathbf{z}}\,\leftarrow\operatorname*{arg\,max}_{\mathbf{z}}\,g(\mathbf{z% },\boldsymbol{\lambda},\psi,\boldsymbol{\theta})$ \If $\boldsymbol{\alpha}(\hat{\mathbf{y}})=\boldsymbol{\beta}(\hat{\mathbf{z}})$ \Return $\boldsymbol{\alpha}(\hat{\mathbf{y}})$ \EndIf\If $\boldsymbol{\alpha}(\hat{\mathbf{y}})\in M$ \State $M_{\text{repeats}}\leftarrow M_{\text{repeats}}\cup\{\boldsymbol{\alpha}(\hat{% \mathbf{y}})\}$ \EndIf\If $\boldsymbol{\beta}(\hat{\mathbf{z}})\in M$ \State $M_{\text{repeats}}\leftarrow M_{\text{repeats}}\cup\{\boldsymbol{\beta}(\hat{% \mathbf{z}})\}$ \EndIf\Stateif $|M_{\text{repeats}}|\geq l_{\text{max}}$ then break \State $M\leftarrow M\cup\{\boldsymbol{\alpha}(\hat{\mathbf{y}}),\boldsymbol{\beta}(% \hat{\mathbf{z}})\}$ \State $\boldsymbol{\lambda}^{(i+1)}\leftarrow\boldsymbol{\lambda}^{(i)}-\eta_{i}\left% (\boldsymbol{\alpha}(\hat{\mathbf{y}})-\boldsymbol{\beta}(\hat{\mathbf{z}})\right)$ \EndFor\Return $\operatorname*{arg\,max}_{\mathbf{x\in M_{\text{repeats}}}}score(\mathbf{x})$

2.3 Bigram subsequences

McDonald (2006) provides a Viterbi-like dynamic programming algorithm to recover the highest-scoring sequence of order-preserving bigrams from a lattice, either in unconstrained form or with a specific length constraint. The latter requires a dynamic programming table $Q[i][r]$ which represents the best score for a compression of length $r$ ending at token $i$ . The table can be populated using the following recurrence:

	$\displaystyle Q[i][1]$	$\displaystyle=score(S,\textsc{start},i)$
	$\displaystyle Q[i][r]$	$\displaystyle=\max_{j<i}Q[j][r-1]+score(S,i,j)$
	$\displaystyle Q[i][R+1]$	$\displaystyle=Q[i][R]+score(S,i,\textsc{end})$

where $R$ is the required number of output tokens and the scoring function is defined as

\displaystyle score(S,i,j)

\displaystyle\triangleq\theta_{\text{bgr}}(\langle t_{i},t_{j}\rangle)+\lambda% _{j}+\psi\cdot\theta_{\text{tok}}(t_{j})

so as to solve $f(\mathbf{y},\boldsymbol{\lambda},\psi,\boldsymbol{\theta})$ from (4). This approach requires $O(n^{2}R)$ time in order to identify the highest scoring sequence $\mathbf{y}$ and corresponding token configuration $\boldsymbol{\alpha}(\mathbf{y})$ .

2.4 Dependency subtrees

The maximum-weight non-projective subtree problem over general graphs is not as easily solved. Although the maximum spanning tree for a given token configuration can be recovered efficiently, Figure 1 illustrates that the maximum-scoring subtree is not necessarily found within it. The problem of recovering a maximum-weight subtree in a graph has been shown to be NP-hard even with uniform edge weights [29].

Figure 1: An example of the difficulty of recovering the maximum-weight subtree (B

\rightarrow

C, B

\rightarrow

D) from the maximum spanning tree (A

\rightarrow

C, C

\rightarrow

B, B

\rightarrow

D).

In order to produce a solution to this subproblem, we use an LP relaxation of the relevant portion of the ILP from Thadani and McKeown (2013) by omitting integer constraints over the token and dependency variables in $\mathbf{x}$ and $\mathbf{z}$ respectively. For simplicity, however, we describe the ILP version rather than the relaxed LP in order to motivate the constraints with their intended purpose rather than their effect in the relaxed problem. The objective for this LP is given by

\displaystyle\max_{\mathbf{x},\mathbf{z}}\ \mathbf{x}^{\top}\boldsymbol{\theta% ^{\prime}}_{\text{tok}}+\mathbf{z}^{\top}\boldsymbol{\theta}_{\text{dep}}

(5)

where the vector of token scores is redefined as

\displaystyle\boldsymbol{\theta^{\prime}}_{\text{tok}}\triangleq(1-\psi)\cdot% \boldsymbol{\theta}_{\text{tok}}-\boldsymbol{\lambda}

(6)

in order to solve $g(\mathbf{z},\boldsymbol{\lambda},\psi,\boldsymbol{\theta})$ from (4).

Linear constraints are introduced to produce dependency structures that are close to the optimal dependency trees. First, tokens in the solution must only be active if they have a single active incoming dependency edge. In addition, to avoid producing multiple disconnected subtrees, only one dependency is permitted to attach to the root pseudo-token.

	$\displaystyle x_{j}-\sum_{i}z_{ij}$	$\displaystyle=0,\quad\forall t_{j}\in T$		(7)
	$\displaystyle\sum_{j}z_{ij}$	$\displaystyle=1,\quad\text{if}\ t_{i}=\textsc{root}$		(8)

In order to avoid cycles in the dependency tree, we include additional variables to establish single-commodity flow [31] between all pairs of tokens. These $\gamma_{ij}$ variables carry non-negative real values which must be consumed by active tokens that they are incident to.

	$\displaystyle\gamma_{ij}\geq 0,$	$\displaystyle\forall t_{i},t_{j}\in T$		(9)
	$\displaystyle\sum_{i}\gamma_{ij}-\sum_{k}\gamma_{jk}=x_{j},$	$\displaystyle\forall t_{j}\in T$		(10)

These constraints ensure that cyclic structures are not possible in the non-relaxed ILP. In addition, they serve to establish connectivity for the dependency structure $\mathbf{z}$ since commodity can only originate in one location—at the pseudo-token root which has no incoming commodity variables. However, in order to enforce these properties on the output dependency structure, this acyclic, connected commodity structure must constrain the activation of the $z$ variables.

\displaystyle\gamma_{ij}-C_{\text{max}}z_{ij}

\displaystyle\leq 0,\quad\forall t_{i},t_{j}\in T

(11)

where $C_{\text{max}}$ is an arbitrary upper bound on the value of $\gamma_{ij}$ variables. Figure 2 illustrates how these commodity flow variables constrain the output of the ILP to be a tree. However, the effect of these constraints is diminished when solving an LP relaxation of the above problem.

Figure 2: An illustration of commodity values for a valid solution of the non-relaxed ILP.

In the LP relaxation, $x_{i}$ and $z_{ij}$ are redefined as real-valued variables in $[0,1]$ , potentially resulting in fractional values for dependency and token indicators. As a result, the commodity flow network is able to establish connectivity but cannot enforce a tree structure, for instance, directed acyclic structures are possible and token indicators $x_{i}$ may be partially be assigned to the solution structure. This poses a challenge in implementing $\boldsymbol{\beta}(\mathbf{z})$ which is needed to recover a token configuration from the solution of this subproblem.

We propose two alternative solutions to address this issue in the context of the joint inference strategy. The first is to simply use the relaxed token configuration identified by the LP in Algorithm 2.2, i.e., to set $\boldsymbol{\beta}(\mathbf{\tilde{z}})=\tilde{x}$ where $\mathbf{\tilde{x}}$ and $\mathbf{\tilde{z}}$ represent the real-valued counterparts of the incidence vectors $\mathbf{x}$ and $\mathbf{z}$ . The viability of this approximation strategy is due to the following:

•

The relaxed LP is empirically fairly tight, yielding integral solutions 89% of the time on the compression datasets described in §3.
•

The bigram subproblem is guaranteed to return a well-formed integral solution which obeys the imposed compression rate, so we are assured of a source of valid—if non-optimal—solutions in line 13 of Algorithm 2.2.

We also consider another strategy that attempts to approximate a valid integral solution to the dependency subproblem. In order to do this, we first include an additional constraint in the relaxed LP which restrict the number of tokens in the output to a specific number of tokens $R$ that is given by an input compression rate.

\displaystyle\sum_{i}x_{i}=R

(12)

The addition of this constraint to the relaxed LP reduces the rate of integral solutions drastically—from 89% to approximately 33%—but it serves to ensure that the resulting token configuration $\mathbf{\tilde{x}}$ has at least as many non-zero elements as $R$ , i.e., there are at least as many tokens activated in the LP solution as are required in a valid solution.

We then construct a subgraph $G(\mathbf{\tilde{z}})$ consisting of all dependency edges that were assigned non-zero values in the solution, assigning to each edge a score equal to the score of that edge in the LP as well as the score of its dependent word, i.e., each $z_{ij}$ in $G(\mathbf{\tilde{z}})$ is assigned a score of $\theta_{\text{dep}}(\langle t_{i},t_{j}\rangle)-\lambda_{j}+(1-\psi)\cdot% \theta_{\text{tok}}(t_{j})$ . Since the commodity flow constraints in (9)–(11) ensure a connected $\mathbf{\tilde{z}}$ , it is therefore possible to recover a maximum-weight spanning tree from $G(\mathbf{\tilde{z}})$ using the Chu-Liu Edmonds algorithm [4, 14].⁵⁵ A detailed description of the Chu-Liu Edmonds algorithm for MSTs is available in McDonald et al. (2005). Although the runtime of this algorithm is cubic in the size of the input graph, it is fairly speedy when applied on relatively sparse graphs such as the solutions to the LP described above. The resulting spanning tree is a useful integral approximation of $\mathbf{\tilde{z}}$ but, as mentioned previously, may contain more nodes than $R$ due to fractional values in $\mathbf{\tilde{x}}$ ; we therefore repeatedly prune leaves with the lowest incoming edge weight in the current tree until exactly $R$ nodes remain. The resulting tree is assumed to be a reasonable approximation of the optimal integral solution to this LP.

The Chu-Liu Edmonds algorithm is also employed for another purpose: when the underlying LP for the joint inference problem is not tight—a frequent occurrence in our compression experiments—Algorithm 2.2 will not converge on a single primal solution and will instead oscillate between solutions that are close to the dual optimum. We identify this phenomenon by counting repeated solutions and, if they exceed some threshold $l_{\text{max}}$ with at least one repeated solution from either subproblem, we terminate the update procedure for Lagrange multipliers and instead attempt to identify a good solution from the repeating ones by scoring them under (2). It is straightforward to recover and score a bigram configuration $\mathbf{y}$ from a token configuration $\boldsymbol{\beta}(\mathbf{z})$ . However, scoring solutions produced by the dynamic program from §2.3 also requires the score over a corresponding parse tree; this can be recovered by constructing a dependency subgraph containing across only the tokens that are active in $\boldsymbol{\alpha}(\mathbf{y})$ and retrieving the maximum spanning tree for that subgraph using the Chu-Liu Edmonds algorithm.

2.5 Learning and Features

The features used in this work are largely based on the features from Thadani and McKeown (2013).

•

$\boldsymbol{\phi}_{\text{tok}}$ contains features for part-of-speech (POS) tag sequences of length up to 3 around the token, features for the dependency label of the token conjoined with its POS, lexical features for verb stems and non-word symbols and morphological features that identify capitalized sequences, negations and words in parentheses.
•

$\boldsymbol{\phi}_{\text{bgr}}$ contains features for POS patterns in a bigram, the labels of dependency edges incident to it, its likelihood under a Gigaword language model (LM) and an indicator for whether it is present in the input sentence.
•

$\boldsymbol{\phi}_{\text{dep}}$ contains features for the probability of a dependency edge under a smoothed dependency grammar constructed from the Penn Treebank and various conjunctions of the following features: (a) whether the edge appears as a dependency or ancestral relation in the input parse (b) the directionality of the dependency (c) the label of the edge (d) the POS tags of the tokens incident to the edge and (e) the labels of their surrounding chunks and whether the edge remains within the chunk.

For the experiments in the following section, we trained models using a variant of the structured perceptron [10] which incorporates minibatches [50] for easy parallelization and faster convergence.⁶⁶ We used a minibatch size of 4 in all experiments. Overfitting was avoided by averaging parameters and monitoring performance against a held-out development set during training. All models were trained using variants of the ILP-based inference approach of Thadani and McKeown (2013). We followed Martins et al. (2009) in using LP-relaxed inference during learning, assuming algorithmic separability [28] for these problems.

3 Experiments

We ran compression experiments over the newswire (NW) and broadcast news transcription (BN) corpora compiled by Clarke and Lapata (2008) which contain gold compressions produced by human annotators using only word deletion. The datasets were filtered to eliminate instances with less than 2 and more than 110 tokens for parser compatibility and divided into training/development/test sections following the splits from Clarke and Lapata (2008), yielding 953/63/603 instances for the NW corpus and 880/78/404 for the BN corpus. Gold dependency parses were approximated by running the Stanford dependency parser⁷⁷http://nlp.stanford.edu/software/ over reference compressions.

Following evaluations in machine translation as well as previous work in sentence compression [47, 7, 34, 39, 45], we evaluate system performance using F ${}_{1}$ metrics over n-grams and dependency edges produced by parsing system output with RASP [3] and the Stanford parser. All ILPs and LPs were solved using Gurobi,⁸⁸http://www.gurobi.com a high-performance commercial-grade solver. Following a recent analysis of compression evaluations [39] which revealed a strong correlation between system compression rate and human judgments of compression quality, we constrained all systems to produce compressed output at a specific rate—determined by the the gold compressions available for each instance—to ensure that the reported differences between the systems under study are meaningful.

Inference		n-grams F ${}_{1}\%$				Syntactic relations F ${}_{1}$ %			Inference
objective	technique	$n=1$	$2$	$3$	$4$	$\mathbf{z}$	Stanford	RASP	time (s)
n-grams	3-LM (CL08)	74.96	60.60	46.83	38.71	-	60.52	57.49	0.72
n-grams	DP (McD06)	78.80	66.04	52.67	42.39	-	63.28	57.89	0.01
deps	LP $\rightarrow$ MST	79.61	64.32	50.36	40.97	66.57	66.82	59.70	0.07
deps	ILP-Dep	80.02	65.99	52.42	43.07	72.43	67.63	60.78	0.16
	DP + LP $\rightarrow$ MST	79.50	66.75	53.48	44.33	64.63	67.69	60.94	0.24
joint	DP + LP	79.10	68.22	55.05	45.81	65.74	68.24	62.04	0.12
	ILP-Joint (TM13)	80.13	68.34	55.56	46.60	72.57	68.89	62.61	0.31

Table 1: Experimental results for the BN corpus, averaged over 3 gold compressions per instance. All systems were restricted to compress to the size of the median gold compression yielding an average compression rate of

77.26\%

Inference		n-grams F ${}_{1}\%$				Syntactic relations F ${}_{1}$ %			Inference
objective	technique	$n=1$	$2$	$3$	$4$	$\mathbf{z}$	Stanford	RASP	time (s)
n-grams	3-LM (CL08)	66.66	51.59	39.33	30.55	-	50.76	49.57	1.22
n-grams	DP (McD06)	73.18	58.31	45.07	34.77	-	56.23	51.14	0.01
deps	LP $\rightarrow$ MST	73.32	55.12	41.18	31.44	61.01	58.37	52.57	0.12
deps	ILP-Dep	73.76	57.09	43.47	33.44	65.45	60.06	54.31	0.28
	DP + LP $\rightarrow$ MST	73.13	57.03	43.79	34.01	57.91	58.46	53.20	0.33
joint	DP + LP	72.06	59.83	47.39	37.72	58.13	58.97	53.78	0.21
	ILP-Joint (TM13)	74.00	59.90	47.22	37.01	65.65	61.29	56.24	0.60

Table 2: Experimental results for the NW corpus with all systems compressing to the size of the gold compression, yielding an average compression rate of

70.24\%

. In both tables, bold entries show significant gains within a column under the paired t-test (

p<0.05

) and Wilcoxon’s signed rank test (

p<0.01

3.1 Systems

We report results over the following systems grouped into three categories of models: tokens + n-grams, tokens + dependencies, and joint models.

•

3-LM: A reimplementation of the unsupervised ILP of Clarke and Lapata (2008) which infers order-preserving trigram variables parameterized with log-likelihood under an LM and a significance score for token variables inspired by Hori and Furui (2004), as well as various linguistically-motivated constraints to encourage fluency in output compressions.
•

DP: The bigram-based dynamic program of McDonald (2006) described in §2.3.⁹⁹For consistent comparisons with the other systems, our reimplementation does not include the $k$ -best inference strategy presented in McDonald (2006) for learning with MIRA.
•

LP $\rightarrow$ MST: An approximate inference approach based on an LP relaxation of ILP-Dep. As discussed in §2.4, a maximum spanning tree is recovered from the output of the LP and greedily pruned in order to generate a valid integral solution while observing the imposed compression rate.
•

ILP-Dep: A version of the joint ILP of Thadani and McKeown (2013) without n-gram variables and corresponding features.
•

DP+LP $\rightarrow$ MST: An approximate joint inference approach based on Lagrangian relaxation that uses DP for the maximum weight subsequence problem and LP $\rightarrow$ MST for the maximum weight subtree problem.
•

DP+LP: Another Lagrangian relaxation approach that pairs DP with the non-integral solutions from an LP relaxation of the maximum weight subtree problem (cf. §2.4).
•

ILP-Joint: The full ILP from Thadani and McKeown (2013), which provides an upper bound on the performance of the proposed approximation strategies.

The learning rate schedule for the Lagrangian relaxation approaches was set as $\eta_{i}\triangleq\tau/(\tau+i)$ ,¹⁰¹⁰ $\tau$ was set to $100$ for aggressive subgradient updates. while the hyperparameter $\psi$ was tuned using the development split of each corpus.¹¹¹¹We were surprised to observe that performance improved significantly when $\psi$ was set closer to 1, thereby emphasizing token features in the dependency subproblem. The final values chosen were $\psi_{\text{BN}}=0.9$ and $\psi_{\text{NW}}=0.8$ .

3.2 Results

Tables 1 and 2 summarize the results from our compression experiments on the BN and NW corpora respectively. Starting with the n-gram approaches, the performance of 3-LM leads us to observe that the gains of supervised learning far outweigh the utility of higher-order n-gram factorization, which is also responsible for a significant increase in wall-clock time. In contrast, DP is an order of magnitude faster than all other approaches studied here although it is not competitive under parse-based measures such as RASP F ${}_{1}\%$ which is known to correlate with human judgments of grammaticality [5].

We were surprised by the strong performance of the dependency-based inference techniques, which yielded results that approached the joint model in both n-gram and parse-based measures. The exact ILP-Dep approach halves the runtime of ILP-Joint to produce compressions that have similar (although statistically distinguishable) scores. Approximating dependency-based inference with LP $\rightarrow$ MST yields similar performance for a further halving of runtime; however, the performance of this approach is notably worse.

Turning to the joint approaches, the strong performance of ILP-Joint is expected; less so is the relatively high but yet practically reasonable runtime that it requires. We note, however, that these ILPs are solved using a highly-optimized commercial-grade solver that can utilize all CPU cores¹²¹²16 cores in our experimental environment. while our approximation approaches are implemented as single-processed Python code without significant effort toward optimization. Comparing the two approximation strategies shows a clear performance advantage for DP+LP over DP+LP $\rightarrow$ MST: the latter approach entails slower inference due to the overhead of running the Chu-Liu Edmonds algorithm at every dual update, and furthermore, the error introduced by approximating an integral solution results in a significant decrease in dependency recall. In contrast, DP+LP directly optimizes the dual problem by using the relaxed dependency solution to update Lagrange multipliers and achieves the best performance on parse-based F ${}_{1}$ outside of the slower ILP approaches. Convergence rates also vary for these two techniques: DP+LP has a lower rate of empirical convergence (15% on BN and 4% on NW) when compared to DP+LP $\rightarrow$ MST (19% on BN and 6% on NW).

Figure 3 shows the effect of input sentence length on inference time and performance for ILP-Joint and DP+LP over the NW test corpus.¹³¹³Similar results were observed for the BN test corpus. The timing results reveal that the approximation strategy is consistently faster than the ILP solver. The variation in RASP F ${}_{1}$ % with input size indicates the viability of a hybrid approach which could balance accuracy and speed by using ILP-Joint for smaller problems and DP+LP for larger ones.

4 Related Work

Sentence compression is one of the better-studied text-to-text generation problems and has been observed to play a significant role in human summarization [23, 22]. Most approaches to sentence compression are supervised [25, 42, 46, 36, 47, 18, 40, 9, 17, 19, 38, 15] following the release of datasets such as the Ziff-Davis corpus [24] and the Edinburgh compression corpora [5, 7], although unsupervised approaches—largely based on ILPs—have also received consideration [6, 7, 16]. Compression has also been used as a tool for document summarization [12, 49, 6, 34, 2, 48, 1, 37, 30, 41], with recent work formulating the summarization task as joint sentence extraction and compression and often employing ILP or Lagrangian relaxation. Monolingual compression also faces many obstacles common to decoding in machine translation, and a number of approaches which have been proposed to combine phrasal and syntactic models [21, 43] inter alia offer directions for future research into compression problems.

Figure 3: Effect of input size on (a) inference time, and (b) the corresponding difference in RASP F

{}_{1}

% (ILP-Joint – DP+LP) on the NW corpus.

5 Conclusion

We have presented approximate inference strategies to jointly compress sentences under bigram and dependency-factored objectives by exploiting the modularity of the task and considering the two subproblems in isolation. Experiments show that one of these approximation strategies produces results comparable to a state-of-the-art integer linear program for the same joint inference task with a 60% reduction in average inference time.

Acknowledgments

The author is grateful to Alexander Rush for helpful discussions and to the anonymous reviewers for their comments. This work was supported by the Intelligence Advanced Research Projects Activity (IARPA) via Department of Interior National Business Center (DoI/NBC) contract number D11PC20153. The U.S. Government is authorized to reproduce and distribute reprints for Governmental purposes notwithstanding any copyright annotation thereon.¹⁴¹⁴ The views and conclusions contained herein are those of the authors and should not be interpreted as necessarily representing the official policies or endorsements, either expressed or implied, of IARPA, DoI/NBC, or the U.S. Government.

References

[1] M. Almeida and A. F. T. Martins(2013-08) Fast and robust compressive summarization with dual decomposition and multi-task learning. See , pp. 196–206. External Links: ISBN 978-1-937284-50-3 Cited by: 2.2, 2, 4.
[2] T. Berg-Kirkpatrick, D. Gillick and D. Klein(2011) Jointly learning to extract and compress. pp. 481–490. External Links: ISBN 978-1-932432-87-9, Link Cited by: 2, 4.
[3] T. Briscoe, J. Carroll and R. Watson(2006) The second release of the RASP system. Cited by: 3.
[4] Y. Chu and T. Liu(1965) On the shortest arborescence of a directed graph. Science Sinica 14, pp. 1396–1400. Cited by: 1, 2.4.
[5] J. Clarke and M. Lapata(2006) Models for sentence compression: a comparison across domains, training requirements and evaluation measures. pp. 377–384. External Links: Link, Document Cited by: 1, 2.1, 3.2, 4.
[6] J. Clarke and M. Lapata(2007) Modelling compression with discourse constraints. pp. 1–11. External Links: Link Cited by: 4.
[7] J. Clarke and M. Lapata(2008-03) Global inference for sentence compression: an integer linear programming approach. Journal for Artificial Intelligence Research 31, pp. 399–429. External Links: ISSN 1076-9757, Link Cited by: 3.1, 1, 3, 3, 4.
[8] T. Cohn and M. Lapata(2008) Sentence compression beyond word deletion. pp. 137–144. External Links: ISBN 978-1-905593-44-6, Link Cited by: 1.
[9] T. Cohn and M. Lapata(2009-04) Sentence compression as tree transduction. Journal of Artificial Intelligence Research 34 (1), pp. 637–674. External Links: ISSN 1076-9757, Link Cited by: 4.
[10] M. Collins(2002) Discriminative training methods for hidden Markov models. pp. 1–8. External Links: Link, Document Cited by: 2.5.
[11] D. Das, A. F. T. Martins and N. A. Smith(2012) An exact dual decomposition algorithm for shallow semantic parsing with constraints. SemEval ’12, pp. 209–217. External Links: Link Cited by: 2.2, 2.2.
[12] H. Daumé,III and D. Marcu(2002) A noisy-channel model for document compression. pp. 449–456. External Links: Link, Document Cited by: 4.
[13] J. DeNero and K. Macherey(2011) Model-based aligner combination using dual decomposition. pp. 420–429. External Links: ISBN 978-1-932432-87-9, Link Cited by: 2.2.
[14] J. R. Edmonds(1967) Optimum branchings. Journal of Research of the National Bureau of Standards 71B, pp. 233–240. Cited by: 1, 2.4.
[15] K. Filippova and Y. Altun(2013) Overcoming the lack of parallel data in sentence compression. pp. 1481–1491. Cited by: 1, 4.
[16] K. Filippova and M. Strube(2008) Dependency tree based sentence compression. pp. 25–32. External Links: Link Cited by: 1, 4.
[17] D. Galanis and I. Androutsopoulos(2010) An extractive supervised two-stage method for sentence compression. pp. 885–893. External Links: ISBN 1-932432-65-5, Link Cited by: 1, 1, 4.
[18] M. Galley and K. McKeown(2007-04) Lexicalized Markov grammars for sentence compression. pp. 180–187. Cited by: 4.
[19] J. Ganitkevitch, C. Callison-Burch, C. Napoles and B. Van Durme(2011) Learning sentential paraphrases from bilingual parallel corpora for text-to-text generation. pp. 1168–1179. External Links: ISBN 978-1-937284-11-4, Link Cited by: 4.
[20] C. Hori and S. Furui(2004) Speech summarization: an approach through word extraction and a method for evaluation. IEICE Transactions on Information and Systems E87-D (1), pp. 15–25. Cited by: 3.1.
[21] L. Huang and D. Chiang(2007-06) Forest rescoring: faster decoding with integrated language models. pp. 144–151. External Links: Link Cited by: 4.
[22] H. Jing and K. R. McKeown(2000) Cut and paste based text summarization. pp. 178–185. External Links: Link Cited by: 4.
[23] H. Jing(2000) Sentence reduction for automatic text summarization. pp. 310–315. External Links: Link, Document Cited by: 4.
[24] K. Knight and D. Marcu(2000) Statistics-based summarization - step one: sentence compression. pp. 703–710. External Links: ISBN 0-262-51112-6, Link Cited by: 1, 4.
[25] K. Knight and D. Marcu(2002-07) Summarization beyond sentence extraction: a probabilistic approach to sentence compression. Artificial Intelligence 139 (1), pp. 91–107. External Links: ISSN 0004-3702, Link, Document Cited by: 4.
[26] N. Komodakis, N. Paragios and G. Tziritas(2007-Oct.) MRF optimization via dual decomposition: message-passing revisited. Vol. , pp. 1–8. External Links: Document, ISSN 1550-5499 Cited by: 2.2, 2.2.
[27] T. Koo, A. M. Rush, M. Collins, T. Jaakkola and D. Sontag(2010) Dual decomposition for parsing with non-projective head automata. pp. 1288–1298. External Links: Link Cited by: 2.2.
[28] A. Kulesza and F. Pereira(2007) Structured learning with approximate inference.. See , External Links: Link Cited by: 2.5.
[29] H. C. Lau, T. H. Ngo and B. N. Nguyen(2006) Finding a length-constrained maximum-sum or maximum-density subtree and its application to logistics. Discrete Optimization 3 (4), pp. 385 – 391. Note: External Links: ISSN 1572-5286, Document, Link Cited by: 2.4.
[30] C. Li, F. Liu, F. Weng and Y. Liu(2013-10) Document summarization via guided sentence compression. Seattle, Washington, USA, pp. 490–500. External Links: Link Cited by: 2, 4.
[31] T. L. Magnanti and L. A. Wolsey(1994) Optimal trees. Cited by: 2.1, 2.4.
[32] A. F. T. Martins, N. A. Smith, P. M. Q. Aguiar and M. A. T. Figueiredo(2011) Dual decomposition with many overlapping components. pp. 238–249. External Links: ISBN 978-1-937284-11-4, Link Cited by: 2.2.
[33] A. F. T. Martins, N. A. Smith and E. P. Xing(2009) Concise integer linear programming formulations for dependency parsing. pp. 342–350. External Links: ISBN 978-1-932432-45-9, Link Cited by: 2.5.
[34] A. F. T. Martins and N. A. Smith(2009) Summarization with a joint model for sentence extraction and compression. pp. 1–9. External Links: ISBN 978-1-932432-35-0, Link Cited by: 1, 2, 3, 4.
[35] R. McDonald, F. Pereira, K. Ribarov and J. Hajič(2005) Non-projective dependency parsing using spanning tree algorithms. pp. 523–530. External Links: Link, Document Cited by: 1, 2.4.
[36] R. McDonald(2006) Discriminative sentence compression with soft syntactic evidence. pp. 297–304. Cited by: 3.1, 1, 1, 2.1, 2.3, 4.
[37] A. Molina, J. Torres-Moreno, E. SanJuan, I. da Cunha and G. E. Sierra Martínez(2013) Discursive sentence compression. Vol. 7817, pp. 394–407. Cited by: 4.
[38] C. Napoles, C. Callison-Burch, J. Ganitkevitch and B. Van Durme(2011) Paraphrastic sentence compression with a character-based metric: tightening without deletion. pp. 84–90. External Links: ISBN 9781937284053, Link Cited by: 4.
[39] C. Napoles, B. Van Durme and C. Callison-Burch(2011) Evaluating sentence compression: pitfalls and suggested remedies. pp. 91–97. External Links: ISBN 9781937284053, Link Cited by: 3.
[40] T. Nomoto(2007-11) Discriminative sentence compression with conditional random fields. Information Processing and Management 43 (6), pp. 1571–1587. External Links: ISSN 0306-4573, Link, Document Cited by: 4.
[41] X. Qian and Y. Liu(2013-10) Fast joint compression and summarization via graph cuts. Seattle, Washington, USA, pp. 1492–1502. External Links: Link Cited by: 2, 4.
[42] S. Riezler, T. H. King, R. Crouch and A. Zaenen(2003) Statistical sentence condensation using ambiguity packing and stochastic disambiguation methods for lexical-functional grammar. pp. 118–125. External Links: Link, Document Cited by: 4.
[43] A. M. Rush and M. Collins(2011) Exact decoding of syntactic translation models through Lagrangian relaxation. pp. 72–82. External Links: ISBN 978-1-932432-87-9, Link Cited by: 2.2, 2.2, 4.
[44] A. M. Rush, D. Sontag, M. Collins and T. Jaakkola(2010) On dual decomposition and linear programming relaxations for natural language processing. pp. 1–11. External Links: Link Cited by: 2.2, 2.2.
[45] K. Thadani and K. McKeown(2013) Sentence compression with joint structural inference. Cited by: 3.1, 3.1, 1, 1, 2.1, 2.1, 2.4, 2.5, 2, 2, 3.
[46] J. Turner and E. Charniak(2005) Supervised and unsupervised learning for sentence compression. pp. 290–297. External Links: Link, Document Cited by: 4.
[47] Y. Unno, T. Ninomiya, Y. Miyao and J. Tsujii(2006) Trimming CFG parse trees for sentence compression using machine learning approaches. pp. 850–857. External Links: Link Cited by: 3, 4.
[48] K. Woodsend and M. Lapata(2012) Multiple aspect summarization using integer linear programming. pp. 233–243. Cited by: 4.
[49] D. Zajic, B. J. Dorr, J. Lin and R. Schwartz(2007-11) Multi-candidate reduction: sentence compression as a tool for document summarization tasks. Information Processing and Management 43 (6), pp. 1549–1570. External Links: ISSN 0306-4573, Link, Document Cited by: 4.
[50] K. Zhao and L. Huang(2013-06) Minibatch and parallelization for online large margin structured learning. Atlanta, Georgia, pp. 370–379. External Links: Link Cited by: 2.5.

Generated on Tue Jun 10 18:53:19 2014 by LaTeXML [LOGO]