Bootstrapping Lexical Choice via Multiple-Sequence Alignment

  title={Bootstrapping Lexical Choice via Multiple-Sequence Alignment},
  author={Regina Barzilay and Lillian Lee},
An important component of any generation system is the mapping dictionary, a lexicon of elementary semantic expressions and corresponding natural language realizations. Typically, labor-intensive knowledge-based methods are used to construct the dictionary. We instead propose to acquire it automatically via a novel multiple-pass algorithm employing multiple-sequence alignment, a technique commonly used in bioinformatics. Crucially, our method lever-ages latent information contained in multi… 

Figures from this paper

Expanding Paraphrase Lexicons by Exploiting Generalities
This article presents a method for systematically expanding an initial seed lexicon made up of high-quality paraphrases by automatically capturing morpho-semantic and syntactic generalizations within the lexicon and using them to leverage the power of large-scale monolingual data.
Statistical Acquisition of Content Selection Rules for Natural Language Generation
This paper presents a method to acquire content selection rules automatically from a corpus of text and associated semantics and evaluated by comparing its output with information selected by human authors in unseen texts, where it was able to filter half the input data set without loss of recall.
Adding Syntax to Dynamic Programming for Aligning Comparable Texts for the Generation of Paraphrases
This paper describes an algorithm for incorporating syntactic features in the alignment process for non-parallel texts with the goal of generating novel paraphrases of existing texts using dynamic programming with alignment decision based on the local syntactic similarity between two sentences.
Curate and Generate: A Corpus and Method for Joint Control of Semantics and Style in Neural NLG
YelpNLG is presented, a corpus of 300,000 rich, parallel meaning representations and highly stylistically varied reference texts spanning different restaurant attributes, and a novel methodology that can be scalably reused to generate NLG datasets for other domains is described.
Optimal and Syntactically-Informed Decoding for Monolingual Phrase-Based Alignment
This work examines a state-of-the-art structured prediction model for the alignment task which uses a phrase-based representation and is forced to decode alignments using an approximate search approach and proposes a straightforward exact decoding technique based on integer linear programming that yields order- of-magnitude improvements in decoding speed.
Prenominal Modifier Ordering via Multiple Sequence Alignment
A novel approach to producing a fluent ordering for a set of prenominal modifiers in a noun phrase is presented, adapting multiple sequence alignment techniques used in computational biology to the alignment of modifiers.
Generating Phrasal and Sentential Paraphrases: A Survey of Data-Driven Methods
A comprehensive and application-independent survey of data-driven phrasal and sentential paraphrase generation methods is conducted, while also conveying an appreciation for the importance and potential use of paraphrases in the field of NLP research.
A Metric for Paraphrase Detection
  • J. CordeiroG. DiasP. Brazdil
  • Computer Science
    2007 International Multi-Conference on Computing in the Global Information Technology (ICCGI'07)
  • 2007
This paper proposes a new metric for unsupervised detection of paraphrases and test it over a set of standard paraphrase corpora and the results are promising as they outperform state-of-the-art measures developed for similar tasks.
A Survey of Paraphrasing and Textual Entailment Methods
Key ideas from the two areas of paraphrasing and textual entailment are summarized by considering in turn recognition, generation, and extraction methods, also pointing to prominent articles and resources.
Syntax-based Alignment of Multiple Translations: Extracting Paraphrases and Generating New Sentences
A syntax-based algorithm that automatically builds Finite State Automata (word lattices) from semantically equivalent translation sets that are good representations of paraphrases and can predict the correctness of alternative semantic renderings, which may be used to evaluate the quality of translations.


Bootstrapping Syntax and Recursion using Alginment-Based Learning
A new type of unsupervised learning algorithm, based on the alignment of sentences and Harris’s (1951) notion of interchangeability is introduced, which results in a labelled, bracketed version of the corpus of natural language sentences.
Trainable Methods for Surface Natural Language Generation
Three systems for surface natural language generation that are trainable from annotated corpora that attempt to produce a grammatical natural language phrase from a domain-specific semantic representation are presented.
Models of translation equivalence among words
This article presents methods for biasing statistical translation models to reflect bitext properties, and shows how a statistical translation model can take advantage of preexisting knowledge that might be available about particular language pairs.
Generation that Exploits Corpus-Based Statistical Knowledge
We describe novel aspects of a new natural language generator called Nitrogen. This generator has a highly flexible input representation that allows a spectrum of input from syntactic to semantic
Extracting Paraphrases from a Parallel Corpus
This work presents an unsupervised learning algorithm for identification of paraphrases from a corpus of multiple English translations of the same source text that yields phrasal and single word lexical paraphrasing as well as syntactic paraphrase.
The Mathematics of Statistical Machine Translation: Parameter Estimation
It is reasonable to argue that word-by-word alignments are inherent in any sufficiently large bilingual corpus, given a set of pairs of sentences that are translations of one another.
Finding consensus in speech recognition: word error minimization and other applications of confusion networks
We describe a new framework for distilling information from word lattices to improve the accuracy of the speech recognition output and obtain a more perspicuous representation of a set of alternative
Exploiting a Probabilistic Hierarchical Model for Generation
Initial results are presented showing that a tree-based model derived from aTree-annotated corpus improves on a tree modelderived from an unannotated Corpus, and that a Tree-based stochastic model with a hand-crafted grammar outperforms both.
Bleu: a Method for Automatic Evaluation of Machine Translation
This work proposes a method of automatic machine translation evaluation that is quick, inexpensive, and language-independent, that correlates highly with human evaluation, and that has little marginal cost per run.
Automatic Evaluation of Computer Generated Text: A Progress Report on the TextEval Project
The basis of this approach is the use of a standard set and the adoption of a statistical view of translation quality, which has the ability to provide evaluations which avoid dependence on any particular theory of translation, which are therefore potentially more objective than previous techniques.