Sentence compression has been shown to benefit from joint inference involving both n-gram and dependency-factored objectives but this typically requires expensive integer programming. We explore instead the use of Lagrangian relaxation to decouple the two subproblems and solve them separately. While dynamic programming is viable for bigram-based sentence compression, finding optimal compressed trees within graphs is NP-hard. We recover approximate solutions to this problem using LP relaxation and maximum spanning tree algorithms, yielding techniques that can be combined with the efficient bigram-based inference approach using Lagrange multipliers. Experiments show that these approximation strategies produce results comparable to a state-of-the-art integer linear programming formulation for the same joint inference task along with a significant improvement in runtime.
Sentence compression is a text-to-text generation task in which an input sentence must be transformed into a shorter output sentence which accurately reflects the meaning in the input and also remains grammatically well-formed. The compression task has received increasing attention in recent years, in part due to the availability of datasets such as the Ziff-Davis corpus [24] and the Edinburgh compression corpora [5], from which the following example is drawn.
Original: In 1967 Chapman, who had cultivated a conventional image with his ubiquitous tweed jacket and pipe, by his own later admission stunned a party attended by his friends and future Python colleagues by coming out as a homosexual. |
Compressed: In 1967 Chapman, who had cultivated a conventional image, stunned a party by coming out as a homosexual. |
A number of diverse approaches have been proposed for deletion-based sentence compression, including techniques that assemble the output text under an n-gram factorization over the input text [36, 7] or an arc factorization over input dependency parses [16, 17, 15]. Joint methods have also been proposed that invoke integer linear programming (ILP) formulations to simultaneously consider multiple structural inference problems—both over n-grams and input dependencies [34] or n-grams and all possible dependencies [45]. However, it is well-established that the utility of ILP for optimal inference in structured problems is often outweighed by the worst-case performance of ILP solvers on large problems without unique integral solutions. Furthermore, approximate solutions can often be adequate for real-world generation systems, particularly in the presence of linguistically-motivated constraints such as those described by Clarke and Lapata (2008), or domain-specific pruning strategies such as the use of sentence templates to constrain the output.
In this work, we develop approximate inference strategies to the joint approach of Thadani and McKeown (2013) which trade the optimality guarantees of exact ILP for faster inference by separately solving the n-gram and dependency subproblems and using Lagrange multipliers to enforce consistency between their solutions. However, while the former problem can be solved efficiently using the dynamic programming approach of McDonald (2006), there are no efficient algorithms to recover maximum weighted non-projective subtrees in a general directed graph. Maximum spanning tree algorithms, commonly used in non-projective dependency parsing [35], are not easily adaptable to this task since the maximum-weight subtree is not necessarily a part of the maximum spanning tree.
We therefore consider methods to recover approximate solutions for the subproblem of finding the maximum weighted subtree in a graph, common among which is the use of a linear programming relaxation. This linear program (LP) appears empirically tight for compression problems and our experiments indicate that simply using the non-integral solutions of this LP in Lagrangian relaxation can empirically lead to reasonable compressions. In addition, we can recover approximate solutions to this problem by using the Chu-Liu Edmonds algorithm for recovering maximum spanning trees [4, 14] over the relatively sparse subgraph defined by a solution to the relaxed LP. Our proposed approximation strategies are evaluated using automated metrics in order to address the question: under what conditions should a real-world sentence compression system implementation consider exact inference with an ILP or approximate inference? The contributions of this work include:
An empirically-useful technique for approximating the maximum-weight subtree in a weighted graph using LP-relaxed inference.
Multiple approaches to generate good approximate solutions for joint multi-structure compression, based on Lagrangian relaxation to enforce equality between the sequential and syntactic inference subproblems.
An analysis of the tradeoffs incurred by joint approaches with regard to runtime as well as performance under automated measures.
Even though compression is typically formulated as a token deletion task, it is evident that dropping tokens independently from an input sentence will likely not result in fluent and meaningful compressive text. Tokens in well-formed sentences participate in a number of syntactic and semantic relationships with other tokens, so one might expect that accounting for heterogenous structural relationships between tokens will improve the coherence of the output sentence. Furthermore, much recent work has focused on the challenge of joint sentence extraction and compression, also known as compressive summarization [34, 2, 1, 30, 41], in which questions of efficiency are paramount due to the larger problems involved; however, these approaches largely restrict compression to pruning parse trees, thereby imposing a dependency on parser performance. We focus in this work on a sentence-level compression system to approximate the ILP-based inference of Thadani and McKeown (2013) which does not restrict compressions to follow input parses but permits the generation of novel dependency relations in output compressions.
The rest of this section is organized as follows: §2.1 provies an overview of the joint sequential and syntactic objective for compression from Thadani and McKeown (2013) while §2.2 discusses the use of Lagrange multipliers to enforce consistency between the different structures considered. Following this, §2.3 discusses a dynamic program to find maximum weight bigram subsequences from the input sentence, while §2.4 covers LP relaxation-based approaches for approximating solutions to the problem of finding a maximum-weight subtree in a graph of potential output dependencies. Finally, §2.5 discusses the features and model training approach used in our experimental results which are presented in §3.
We begin with some notation. For an input sentence comprised of tokens including duplicates, we denote the set of tokens in by . Let represent a compression of and let denote an indicator variable whose value corresponds to whether token is present in the compressed sentence . In addition, we define bigram indicator variables to represent whether a particular order-preserving bigram22Although Thadani and McKeown (2013) is not restricted to bigrams or order-preserving n-grams, we limit our discussion to this scenario as it also fits the assumptions of McDonald (2006) and the datasets of Clarke and Lapata (2006). from is present as a contiguous bigram in as well as dependency indicator variables corresponding to whether the dependency arc is present in the dependency parse of . The score for a given compression can now be defined to factor over its tokens, n-grams and dependencies as follows.
(1) |
where , and are feature-based scoring functions for tokens, bigrams and dependencies respectively. Specifically, each where is a feature map for a given variable type and is the corresponding vector of learned parameters.
The inference task involves recovering the highest scoring compression under a particular set of model parameters .
(2) |
where the incidence vector represents an entire token configuration over , with and defined analogously to represent configurations of bigrams and dependencies. denotes a corresponding vector of scores for each variable type under the current model parameters. In order to recover meaningful compressions by optimizing (2), the inference step must ensure:
The configurations , and are consistent with each other, i.e., all configurations cover the same tokens.
The structural configurations and are non-degenerate, i.e, the bigram configuration represents an acyclic path while the dependency configuration forms a tree.
These requirements naturally rule out simple approximate inference formulations such as search-based approaches for the joint objective.33This work follows Thadani and McKeown (2013) in recovering non-projective trees for inference. However, recovering projective trees is tractable when a total ordering of output tokens is assumed. This will be addressed in future work. An ILP-based inference solution is demonstrated in Thadani and McKeown (2013) that makes use of linear constraints over the boolean variables , and to guarantee consistency, as well as auxiliary real-valued variables and constraints representing the flow of commodities [31] in order to establish structure in and . In the following section, we propose an alternative formulation that exploits the modularity of this joint objective.
Dual decomposition [26] and Lagrangian relaxation in general are often used for solving joint inference problems which are decomposable into individual subproblems linked by equality constraints [27, 44, 43, 13, 32, 11, 1]. This approach permits sub-problems to be solved separately using problem-specific efficient algorithms, while consistency over the structures produced is enforced through Lagrange multipliers via iterative optimization. Exact solutions are guaranteed when the algorithm converges on a consistent primal solution, although this convergence itself is not guaranteed and depends on the tightness of the underlying LP relaxation. The primary advantage of this technique is the ability to leverage the underlying structure of the problems in inference rather than relying on a generic ILP formulation while still often producing exact solutions.
The multi-structure inference problem described in the previous section seems in many ways to be a natural fit to such an approach since output scores factor over different types of structure that comprise the output compression. Even if ILP-based approaches perform reasonably at the scale of single-sentence compression problems, the exponential worst-case complexity of general-purpose ILPs will inevitably pose challenges when scaling up to (a) handle larger inputs, (b) use higher-order structural fragments, or (c) incorporate additional models.
Consider once more the optimization problem characterized by (2) The two structural problems that need to be solved in this formulation are the extraction of a maximum-weight acyclic subsequence of bigrams from the lattice of all order-preserving bigrams from and the recovery of a maximum-weight directed subtree . Let denote the incidence vector of tokens contained in the n-gram sequence and denote the incidence vector of words contained in the dependency tree . We can now rewrite the objective in (2) while enforcing the constraint that the words contained in the sequence are the same as the words contained in the tree , i.e., , by introducing a vector of Lagrange multipliers . In addition, the token configuration can be rewritten in the form of a weighted combination of and to ensure its consistency with and . This results in the following Lagrangian:
(3) |
Finding the and that maximize this Lagrangian above yields a dual objective, and the dual problem corresponding to the primal objective specified in (2) is therefore the minimization of this objective over the Lagrange multipliers .
(4) |
This can now be solved with the iterative subgradient algorithm illustrated in Algorithm 2.2. In each iteration , the algorithm solves for and under , then generates to penalize inconsistencies between and . When , the resulting primal solution is exact, i.e., and represent the optimal structures under (2). Otherwise, if the algorithm starts oscillating between a few primal solutions, the underlying LP must have a non-integral solution in which case approximation heuristics can be employed.44 Heuristic approaches [26, 44], tightening [43] or branch and bound [11] can still be used to retrieve optimal solutions, but we did not explore these strategies here. The application of this Lagrangian relaxation strategy is contingent upon the existence of algorithms to solve the maximization subproblems for and . The following sections discuss our approach to these problems.
[t] {algorithmic}[1] \StatexInput: scores , ratio , repetition limit , iteration limit , learning rate schedule \StatexOutput: token configuration \Statex\State \State \Foriteration \State \State \If \Return \EndIf\If \State \EndIf\If \State \EndIf\Stateif then break \State \State \EndFor\Return
McDonald (2006) provides a Viterbi-like dynamic programming algorithm to recover the highest-scoring sequence of order-preserving bigrams from a lattice, either in unconstrained form or with a specific length constraint. The latter requires a dynamic programming table which represents the best score for a compression of length ending at token . The table can be populated using the following recurrence:
where is the required number of output tokens and the scoring function is defined as
so as to solve from (4). This approach requires time in order to identify the highest scoring sequence and corresponding token configuration .
The maximum-weight non-projective subtree problem over general graphs is not as easily solved. Although the maximum spanning tree for a given token configuration can be recovered efficiently, Figure 1 illustrates that the maximum-scoring subtree is not necessarily found within it. The problem of recovering a maximum-weight subtree in a graph has been shown to be NP-hard even with uniform edge weights [29].
In order to produce a solution to this subproblem, we use an LP relaxation of the relevant portion of the ILP from Thadani and McKeown (2013) by omitting integer constraints over the token and dependency variables in and respectively. For simplicity, however, we describe the ILP version rather than the relaxed LP in order to motivate the constraints with their intended purpose rather than their effect in the relaxed problem. The objective for this LP is given by
(5) |
where the vector of token scores is redefined as
(6) |
in order to solve from (4).
Linear constraints are introduced to produce dependency structures that are close to the optimal dependency trees. First, tokens in the solution must only be active if they have a single active incoming dependency edge. In addition, to avoid producing multiple disconnected subtrees, only one dependency is permitted to attach to the root pseudo-token.
(7) | ||||
(8) |
In order to avoid cycles in the dependency tree, we include additional variables to establish single-commodity flow [31] between all pairs of tokens. These variables carry non-negative real values which must be consumed by active tokens that they are incident to.
(9) | ||||
(10) |
These constraints ensure that cyclic structures are not possible in the non-relaxed ILP. In addition, they serve to establish connectivity for the dependency structure since commodity can only originate in one location—at the pseudo-token root which has no incoming commodity variables. However, in order to enforce these properties on the output dependency structure, this acyclic, connected commodity structure must constrain the activation of the variables.
(11) |
where is an arbitrary upper bound on the value of variables. Figure 2 illustrates how these commodity flow variables constrain the output of the ILP to be a tree. However, the effect of these constraints is diminished when solving an LP relaxation of the above problem.
In the LP relaxation, and are redefined as real-valued variables in , potentially resulting in fractional values for dependency and token indicators. As a result, the commodity flow network is able to establish connectivity but cannot enforce a tree structure, for instance, directed acyclic structures are possible and token indicators may be partially be assigned to the solution structure. This poses a challenge in implementing which is needed to recover a token configuration from the solution of this subproblem.
We propose two alternative solutions to address this issue in the context of the joint inference strategy. The first is to simply use the relaxed token configuration identified by the LP in Algorithm 2.2, i.e., to set where and represent the real-valued counterparts of the incidence vectors and . The viability of this approximation strategy is due to the following:
The relaxed LP is empirically fairly tight, yielding integral solutions 89% of the time on the compression datasets described in §3.
The bigram subproblem is guaranteed to return a well-formed integral solution which obeys the imposed compression rate, so we are assured of a source of valid—if non-optimal—solutions in line 13 of Algorithm 2.2.
We also consider another strategy that attempts to approximate a valid integral solution to the dependency subproblem. In order to do this, we first include an additional constraint in the relaxed LP which restrict the number of tokens in the output to a specific number of tokens that is given by an input compression rate.
(12) |
The addition of this constraint to the relaxed LP reduces the rate of integral solutions drastically—from 89% to approximately 33%—but it serves to ensure that the resulting token configuration has at least as many non-zero elements as , i.e., there are at least as many tokens activated in the LP solution as are required in a valid solution.
We then construct a subgraph consisting of all dependency edges that were assigned non-zero values in the solution, assigning to each edge a score equal to the score of that edge in the LP as well as the score of its dependent word, i.e., each in is assigned a score of . Since the commodity flow constraints in (9)–(11) ensure a connected , it is therefore possible to recover a maximum-weight spanning tree from using the Chu-Liu Edmonds algorithm [4, 14].55 A detailed description of the Chu-Liu Edmonds algorithm for MSTs is available in McDonald et al. (2005). Although the runtime of this algorithm is cubic in the size of the input graph, it is fairly speedy when applied on relatively sparse graphs such as the solutions to the LP described above. The resulting spanning tree is a useful integral approximation of but, as mentioned previously, may contain more nodes than due to fractional values in ; we therefore repeatedly prune leaves with the lowest incoming edge weight in the current tree until exactly nodes remain. The resulting tree is assumed to be a reasonable approximation of the optimal integral solution to this LP.
The Chu-Liu Edmonds algorithm is also employed for another purpose: when the underlying LP for the joint inference problem is not tight—a frequent occurrence in our compression experiments—Algorithm 2.2 will not converge on a single primal solution and will instead oscillate between solutions that are close to the dual optimum. We identify this phenomenon by counting repeated solutions and, if they exceed some threshold with at least one repeated solution from either subproblem, we terminate the update procedure for Lagrange multipliers and instead attempt to identify a good solution from the repeating ones by scoring them under (2). It is straightforward to recover and score a bigram configuration from a token configuration . However, scoring solutions produced by the dynamic program from §2.3 also requires the score over a corresponding parse tree; this can be recovered by constructing a dependency subgraph containing across only the tokens that are active in and retrieving the maximum spanning tree for that subgraph using the Chu-Liu Edmonds algorithm.
The features used in this work are largely based on the features from Thadani and McKeown (2013).
contains features for part-of-speech (POS) tag sequences of length up to 3 around the token, features for the dependency label of the token conjoined with its POS, lexical features for verb stems and non-word symbols and morphological features that identify capitalized sequences, negations and words in parentheses.
contains features for POS patterns in a bigram, the labels of dependency edges incident to it, its likelihood under a Gigaword language model (LM) and an indicator for whether it is present in the input sentence.
contains features for the probability of a dependency edge under a smoothed dependency grammar constructed from the Penn Treebank and various conjunctions of the following features: (a) whether the edge appears as a dependency or ancestral relation in the input parse (b) the directionality of the dependency (c) the label of the edge (d) the POS tags of the tokens incident to the edge and (e) the labels of their surrounding chunks and whether the edge remains within the chunk.
For the experiments in the following section, we trained models using a variant of the structured perceptron [10] which incorporates minibatches [50] for easy parallelization and faster convergence.66 We used a minibatch size of 4 in all experiments. Overfitting was avoided by averaging parameters and monitoring performance against a held-out development set during training. All models were trained using variants of the ILP-based inference approach of Thadani and McKeown (2013). We followed Martins et al. (2009) in using LP-relaxed inference during learning, assuming algorithmic separability [28] for these problems.
We ran compression experiments over the newswire (NW) and broadcast news transcription (BN) corpora compiled by Clarke and Lapata (2008) which contain gold compressions produced by human annotators using only word deletion. The datasets were filtered to eliminate instances with less than 2 and more than 110 tokens for parser compatibility and divided into training/development/test sections following the splits from Clarke and Lapata (2008), yielding 953/63/603 instances for the NW corpus and 880/78/404 for the BN corpus. Gold dependency parses were approximated by running the Stanford dependency parser77http://nlp.stanford.edu/software/ over reference compressions.
Following evaluations in machine translation as well as previous work in sentence compression [47, 7, 34, 39, 45], we evaluate system performance using F metrics over n-grams and dependency edges produced by parsing system output with RASP [3] and the Stanford parser. All ILPs and LPs were solved using Gurobi,88http://www.gurobi.com a high-performance commercial-grade solver. Following a recent analysis of compression evaluations [39] which revealed a strong correlation between system compression rate and human judgments of compression quality, we constrained all systems to produce compressed output at a specific rate—determined by the the gold compressions available for each instance—to ensure that the reported differences between the systems under study are meaningful.
Inference | n-grams F | Syntactic relations F% | Inference | ||||||
---|---|---|---|---|---|---|---|---|---|
objective | technique | Stanford | RASP | time (s) | |||||
n-grams | 3-LM (CL08) | 74.96 | 60.60 | 46.83 | 38.71 | - | 60.52 | 57.49 | 0.72 |
DP (McD06) | 78.80 | 66.04 | 52.67 | 42.39 | - | 63.28 | 57.89 | 0.01 | |
deps | LPMST | 79.61 | 64.32 | 50.36 | 40.97 | 66.57 | 66.82 | 59.70 | 0.07 |
ILP-Dep | 80.02 | 65.99 | 52.42 | 43.07 | 72.43 | 67.63 | 60.78 | 0.16 | |
DP + LPMST | 79.50 | 66.75 | 53.48 | 44.33 | 64.63 | 67.69 | 60.94 | 0.24 | |
joint | DP + LP | 79.10 | 68.22 | 55.05 | 45.81 | 65.74 | 68.24 | 62.04 | 0.12 |
ILP-Joint (TM13) | 80.13 | 68.34 | 55.56 | 46.60 | 72.57 | 68.89 | 62.61 | 0.31 |
Inference | n-grams F | Syntactic relations F% | Inference | ||||||
---|---|---|---|---|---|---|---|---|---|
objective | technique | Stanford | RASP | time (s) | |||||
n-grams | 3-LM (CL08) | 66.66 | 51.59 | 39.33 | 30.55 | - | 50.76 | 49.57 | 1.22 |
DP (McD06) | 73.18 | 58.31 | 45.07 | 34.77 | - | 56.23 | 51.14 | 0.01 | |
deps | LPMST | 73.32 | 55.12 | 41.18 | 31.44 | 61.01 | 58.37 | 52.57 | 0.12 |
ILP-Dep | 73.76 | 57.09 | 43.47 | 33.44 | 65.45 | 60.06 | 54.31 | 0.28 | |
DP + LPMST | 73.13 | 57.03 | 43.79 | 34.01 | 57.91 | 58.46 | 53.20 | 0.33 | |
joint | DP + LP | 72.06 | 59.83 | 47.39 | 37.72 | 58.13 | 58.97 | 53.78 | 0.21 |
ILP-Joint (TM13) | 74.00 | 59.90 | 47.22 | 37.01 | 65.65 | 61.29 | 56.24 | 0.60 |
We report results over the following systems grouped into three categories of models: tokens + n-grams, tokens + dependencies, and joint models.
3-LM: A reimplementation of the unsupervised ILP of Clarke and Lapata (2008) which infers order-preserving trigram variables parameterized with log-likelihood under an LM and a significance score for token variables inspired by Hori and Furui (2004), as well as various linguistically-motivated constraints to encourage fluency in output compressions.
LPMST: An approximate inference approach based on an LP relaxation of ILP-Dep. As discussed in §2.4, a maximum spanning tree is recovered from the output of the LP and greedily pruned in order to generate a valid integral solution while observing the imposed compression rate.
ILP-Dep: A version of the joint ILP of Thadani and McKeown (2013) without n-gram variables and corresponding features.
DP+LPMST: An approximate joint inference approach based on Lagrangian relaxation that uses DP for the maximum weight subsequence problem and LPMST for the maximum weight subtree problem.
DP+LP: Another Lagrangian relaxation approach that pairs DP with the non-integral solutions from an LP relaxation of the maximum weight subtree problem (cf. §2.4).
ILP-Joint: The full ILP from Thadani and McKeown (2013), which provides an upper bound on the performance of the proposed approximation strategies.
The learning rate schedule for the Lagrangian relaxation approaches was set as ,1010 was set to for aggressive subgradient updates. while the hyperparameter was tuned using the development split of each corpus.1111We were surprised to observe that performance improved significantly when was set closer to 1, thereby emphasizing token features in the dependency subproblem. The final values chosen were and .
Tables 1 and 2 summarize the results from our compression experiments on the BN and NW corpora respectively. Starting with the n-gram approaches, the performance of 3-LM leads us to observe that the gains of supervised learning far outweigh the utility of higher-order n-gram factorization, which is also responsible for a significant increase in wall-clock time. In contrast, DP is an order of magnitude faster than all other approaches studied here although it is not competitive under parse-based measures such as RASP F which is known to correlate with human judgments of grammaticality [5].
We were surprised by the strong performance of the dependency-based inference techniques, which yielded results that approached the joint model in both n-gram and parse-based measures. The exact ILP-Dep approach halves the runtime of ILP-Joint to produce compressions that have similar (although statistically distinguishable) scores. Approximating dependency-based inference with LPMST yields similar performance for a further halving of runtime; however, the performance of this approach is notably worse.
Turning to the joint approaches, the strong performance of ILP-Joint is expected; less so is the relatively high but yet practically reasonable runtime that it requires. We note, however, that these ILPs are solved using a highly-optimized commercial-grade solver that can utilize all CPU cores121216 cores in our experimental environment. while our approximation approaches are implemented as single-processed Python code without significant effort toward optimization. Comparing the two approximation strategies shows a clear performance advantage for DP+LP over DP+LPMST: the latter approach entails slower inference due to the overhead of running the Chu-Liu Edmonds algorithm at every dual update, and furthermore, the error introduced by approximating an integral solution results in a significant decrease in dependency recall. In contrast, DP+LP directly optimizes the dual problem by using the relaxed dependency solution to update Lagrange multipliers and achieves the best performance on parse-based F outside of the slower ILP approaches. Convergence rates also vary for these two techniques: DP+LP has a lower rate of empirical convergence (15% on BN and 4% on NW) when compared to DP+LPMST (19% on BN and 6% on NW).
Figure 3 shows the effect of input sentence length on inference time and performance for ILP-Joint and DP+LP over the NW test corpus.1313Similar results were observed for the BN test corpus. The timing results reveal that the approximation strategy is consistently faster than the ILP solver. The variation in RASP F% with input size indicates the viability of a hybrid approach which could balance accuracy and speed by using ILP-Joint for smaller problems and DP+LP for larger ones.
Sentence compression is one of the better-studied text-to-text generation problems and has been observed to play a significant role in human summarization [23, 22]. Most approaches to sentence compression are supervised [25, 42, 46, 36, 47, 18, 40, 9, 17, 19, 38, 15] following the release of datasets such as the Ziff-Davis corpus [24] and the Edinburgh compression corpora [5, 7], although unsupervised approaches—largely based on ILPs—have also received consideration [6, 7, 16]. Compression has also been used as a tool for document summarization [12, 49, 6, 34, 2, 48, 1, 37, 30, 41], with recent work formulating the summarization task as joint sentence extraction and compression and often employing ILP or Lagrangian relaxation. Monolingual compression also faces many obstacles common to decoding in machine translation, and a number of approaches which have been proposed to combine phrasal and syntactic models [21, 43] inter alia offer directions for future research into compression problems.
We have presented approximate inference strategies to jointly compress sentences under bigram and dependency-factored objectives by exploiting the modularity of the task and considering the two subproblems in isolation. Experiments show that one of these approximation strategies produces results comparable to a state-of-the-art integer linear program for the same joint inference task with a 60% reduction in average inference time.
The author is grateful to Alexander Rush for helpful discussions and to the anonymous reviewers for their comments. This work was supported by the Intelligence Advanced Research Projects Activity (IARPA) via Department of Interior National Business Center (DoI/NBC) contract number D11PC20153. The U.S. Government is authorized to reproduce and distribute reprints for Governmental purposes notwithstanding any copyright annotation thereon.1414 The views and conclusions contained herein are those of the authors and should not be interpreted as necessarily representing the official policies or endorsements, either expressed or implied, of IARPA, DoI/NBC, or the U.S. Government.