CTRLEval: An Unsupervised Reference-Free Metric for Evaluating Controlled Text Generation

@inproceedings{Ke2022CTRLEvalAU,
  title={CTRLEval: An Unsupervised Reference-Free Metric for Evaluating Controlled Text Generation},
  author={Pei Ke and Hao Zhou and Yankai Lin and Peng Li and Jie Zhou and Xiaoyan Zhu and Minlie Huang},
  booktitle={Annual Meeting of the Association for Computational Linguistics},
  year={2022}
}
Existing reference-free metrics have obvious limitations for evaluating controlled text generation models. Unsupervised metrics can only provide a task-agnostic evaluation result which correlates weakly with human judgments, whereas supervised ones may overfit task-specific data with poor generalization ability to other datasets. In this paper, we propose an unsupervised reference-free metric called CTRLEval, which evaluates controlled text generation from different aspects by formulating each… 

On the Blind Spots of Model-Based Evaluation Metrics for Text Generation

This work design and synthesize a wide range of potential errors and check whether they result in a drop in the metric scores, and investigates the reasons behind these blind spots and suggests practical workarounds for a more reliable evaluation of text generation.

Director: Generator-Classifiers For Supervised Language Modeling

A new architecture, Director, that consists of a unified generator-classifier with both a language modeling and a classification head for each output token that outperforms existing model guiding approaches in terms of both accuracy and efficiency.

UPER: Boosting Multi-Document Summarization with an Unsupervised Prompt-based Extractor

This work constructs prompting templates that invoke the underlying knowledge in Pre-trained Language Model (PLM) to calculate the document and keyword’s perplexity, which can assess the document's semantic salience, thus improving the subsequent abstract generation.

Is This Abstract Generated by AI? A Research for the Gap between AI-generated Scientific Text and Human-written Scientific Text

There exists a “writing style” gap between AI-generated scientific text and human-written scientific text, which suggests that while AI has the potential to generate scientific content that is as accurate as human- written content, there is still a gap in terms of depth and overall quality.

A Survey of Controllable Text Generation using Transformer-based Pre-trained Language Models

This is the first survey paper to summarize CTG techniques from the perspective of PLMs, and it is hoped it can help researchers in related fields to quickly track the academic frontier, providing them with a landscape of the area and a roadmap for future research.

References

SHOWING 1-10 OF 49 REFERENCES

BLEURT: Learning Robust Metrics for Text Generation

BLEURT, a learned evaluation metric for English based on BERT, can model human judgment with a few thousand possibly biased training examples and yields superior results even when the training data is scarce and out-of-distribution.

Better Automatic Evaluation of Open-Domain Dialogue Systems with Contextualized Embeddings

Using contextualized word embeddings to compute more accurate relatedness scores and thus better evaluation metrics is explored, and experiments show that the evaluation metrics outperform RUBER, which is trained on staticembeddings.

BARTScore: Evaluating Generated Text as Text Generation

This work conceptualizes the evaluation of generated text as a text generation problem, modeled using pre-trained sequence-to-sequence models, and proposes a metric BARTS CORE with a number of variants that can be flexibly applied in an unsupervised fashion to evaluation of text from different perspectives.

CoCon: A Self-Supervised Approach for Controlled Text Generation

This work proposes Content-Conditioner (CoCon) to control an LM's output text with a target content, at a fine-grained level and shows that CoCon can naturally incorporate target content into generated texts and control high-level text attributes in a zero-shot manner.

CTRL: A Conditional Transformer Language Model for Controllable Generation

CTRL is released, a 1.63 billion-parameter conditional transformer language model, trained to condition on control codes that govern style, content, and task-specific behavior, providing more explicit control over text generation.

Compression, Transduction, and Creation: A Unified Framework for Evaluating Natural Language Generation

This work develops a family of interpretable metrics that are suitable for evaluating key aspects of different NLG tasks, often without need of gold reference data, and shows the uniformly designed metrics achieve stronger or comparable correlations with human judgement compared to state-of-the-art metrics.

RUBER: An Unsupervised Method for Automatic Evaluation of Open-Domain Dialog Systems

RUBER is proposed, a Referenced metric and Unreferenced metrics Blended Evaluation Routine, which evaluates a reply by taking into consideration both a groundtruth reply and a query (previous user-issued utterance).

OpenMEVA: A Benchmark for Evaluating Open-ended Story Generation Metrics

OpenMEVA provides a comprehensive test suite to assess the capabilities of metrics, including the correlation with human judgments, the generalization to different model outputs and datasets, the ability to judge story coherence, and the robustness to perturbations.

UNION: An Unreferenced Metric for Evaluating Open-ended Story Generation

UNION is a learnable unreferenced metric for evaluating open-ended story generation, which measures the quality of a generated story without any reference, which correlates better with human judgments and is more generalizable than existing state-of-the-art metrics.

Judge the Judges: A Large-Scale Evaluation Study of Neural Language Models for Online Review Generation

A large-scale, systematic study to evaluate the existing evaluation methods for natural language generation in the context of generating online product reviews finds lexical diversity an intriguing metric that is indicative of the assessments of different evaluators.