A Human Evaluation of AMR-to-English Generation Systems

  title={A Human Evaluation of AMR-to-English Generation Systems},
  author={Emma Manning and Shira Wein and Nathan Schneider},
Most current state-of-the art systems for generating English text from Abstract Meaning Representation (AMR) have been evaluated only using automated metrics, such as BLEU, which are known to be problematic for natural language generation. In this work, we present the results of a new human evaluation which collects fluency and adequacy scores, as well as categorization of error types, for several recent AMR generation systems. We discuss the relative quality of these systems and how our… 

Figures and Tables from this paper

Towards a Decomposable Metric for Explainable Evaluation of Text Generation from AMR

This work proposes \mathcal{M}_\beta, a decomposable metric that builds on two pillars that measures the linguistic quality of the generated text, and shows that fulfillment of both principles offers benefits for AMR-to-text evaluation, including explainability of scores.

The GEM Benchmark: Natural Language Generation, its Evaluation and Metrics

GEM, a living benchmark for natural language Generation (NLG), its Evaluation, and Metrics, is introduced and the description of the data for the 2021 shared task at the associated GEM Workshop is described.

Meaning Representations: Recent Advances in Parsing and Generation

This course visits recent research on neural approaches that provide high performance on meaning representation benchmark data sets that improve upon the technical knowledge about complex, high-performance machine learning systems.

ToxCCIn: Toxic Content Classification with Interpretability

This work proposes a technique to improve the interpretability of transformer-based models by scoring a post based on the maximum toxicity of its spans and augmenting the training process to identify correct spans, which can produce explanations that exceed the quality of those provided by Logistic Regression analysis.

Promoting Graph Awareness in Linearized Graph-to-Text Generation

This work uses graph-denoising objectives implemented in a multi-task text-to-text framework and finds that these denoising scaffolds lead to substantial improvements in downstream generation in low-resource settings.

A Dynamic, Interpreted CheckList for Meaning-oriented NLG Metric Evaluation – through the Lens of Semantic Similarity Rating

The usefulness of CheckList is demonstrated by designing a new metric GraCo that computes lexical cohesion graphs over AMR concepts and suggests that meaning-oriented NLG metrics can profit from graph-based metric components using AMR.

Referenceless Parsing-Based Evaluation of AMR-to-English Generation

It is found that the errors introduced by automatic AMR parsing substantially limit the effectiveness of this approach, but a manual editing study indicates that as parsing improves, parsing-based evaluation has the potential to outperform most reference-based metrics.

A Survey : Neural Networks for AMR-to-Text

The neural network-based method is detailed and the latest progress of AMR-to-Text, which refers to AMR reconstruction, Decoder optimization, etc, is presented and a summary of current techniques and the outlook for future research is provided.

Biomedical Data-to-Text Generation via Fine-Tuning Transformers

It is shown that fine-tuned transformers are able to generate realistic, multi-sentence text from data in the biomedical domain, yet have important limitations.



A Partially Rule-Based Approach to AMR Generation

This paper presents a new approach to generating English text from Abstract Meaning Representation that is largely rule-based, supplemented only by a language model and simple statistical linearization models, allowing for more control over the output.

Why We Need New Evaluation Metrics for NLG

A wide range of metrics are investigated, including state-of-the-art word-based and novel grammar-based ones, and it is demonstrated that they only weakly reflect human judgements of system outputs as generated by data-driven, end-to-end NLG.

A Structured Review of the Validity of BLEU

The evidence supports using BLEU for diagnostic evaluation of MT systems, but does not support using it outside of MT, for evaluation of individual texts, or for scientific hypothesis testing.

GPT-too: A Language-Model-First Approach for AMR-to-Text Generation

An alternative approach that combines a strong pre-trained language model with cycle consistency-based re-scoring is proposed that outperform all previous techniques on the English LDC2017T10 dataset, including the recent use of transformer architectures.

A Study of Translation Edit Rate with Targeted Human Annotation

A new, intuitive measure for evaluating machine-translation output that avoids the knowledge intensiveness of more meaning-based approaches, and the labor-intensiveness of human judgments is examined, which indicates that HTER correlates with human judgments better than HMETEOR and that the four-reference variants of TER and HTER correlate withhuman judgments as well as—or better than—a second human judgment does.


We define a new, intuitive measure for evaluating machine translation output that avoids the knowledge intensiveness of more meaning-based approaches, and the labor-intensiveness of human judgments.

A Call for Clarity in Reporting BLEU Scores

Pointing to the success of the parsing community, it is suggested machine translation researchers settle upon the BLEU scheme, which does not allow for user-supplied reference processing, and provide a new tool, SACREBLEU, to facilitate this.

Neural AMR: Sequence-to-Sequence Models for Parsing and Generation

This work presents a novel training procedure that can lift the limitation of the relatively limited amount of labeled data and the non-sequential nature of the AMR graphs, and presents strong evidence that sequence-based AMR models are robust against ordering variations of graph-to-sequence conversions.

Bleu: a Method for Automatic Evaluation of Machine Translation

This work proposes a method of automatic machine translation evaluation that is quick, inexpensive, and language-independent, that correlates highly with human evaluation, and that has little marginal cost per run.

Generating English from Abstract Meaning Representations

A method that learns to linearize tokens of AMR graphs into an English-like order is introduced, which reduces the amount of distortion in PBMT and increases generation quality.