CTRLEval: An Unsupervised Reference-Free Metric for Evaluating Controlled Text Generation

  title={CTRLEval: An Unsupervised Reference-Free Metric for Evaluating Controlled Text Generation},
  author={Pei Ke and Hao Zhou and Yankai Lin and Peng Li and Jie Zhou and Xiaoyan Zhu and Minlie Huang},
  booktitle={Annual Meeting of the Association for Computational Linguistics},
Existing reference-free metrics have obvious limitations for evaluating controlled text generation models. Unsupervised metrics can only provide a task-agnostic evaluation result which correlates weakly with human judgments, whereas supervised ones may overfit task-specific data with poor generalization ability to other datasets. In this paper, we propose an unsupervised reference-free metric called CTRLEval, which evaluates controlled text generation from different aspects by formulating each… 

On the Blind Spots of Model-Based Evaluation Metrics for Text Generation

This work design and synthesize a wide range of potential errors and check whether they result in a drop in the metric scores, and investigates the reasons behind these blind spots and suggests practical workarounds for a more reliable evaluation of text generation.

Director: Generator-Classifiers For Supervised Language Modeling

A new architecture, Director, that consists of a unified generator-classifier with both a language modeling and a classification head for each output token that outperforms existing model guiding approaches in terms of both accuracy and efficiency.

AI vs. Human -- Differentiation Analysis of Scientific Content Generation

It is found that there exists a "writing style"gap between AI-generated scientific text and human-written scientific text that contributes to guiding the optimization of AI models to produce high-quality content and addressing related ethical and security concerns.

UPER: Boosting Multi-Document Summarization with an Unsupervised Prompt-based Extractor

This work constructs prompting templates that invoke the underlying knowledge in Pre-trained Language Model (PLM) to calculate the document and keyword’s perplexity, which can assess the document's semantic salience, thus improving the subsequent abstract generation.

A Survey of Controllable Text Generation using Transformer-based Pre-trained Language Models

This is the first survey paper to summarize CTG techniques from the perspective of PLMs, and it is hoped it can help researchers in related fields to quickly track the academic frontier, providing them with a landscape of the area and a roadmap for future research.

Is This Abstract Generated by AI? A Research for the Gap between AI-generated Scientific Text and Human-written Scientific Text

There exists a “writing style” gap between AI-generated scientific text and human-written scientific text, which suggests that while AI has the potential to generate scientific content that is as accurate as human- written content, there is still a gap in terms of depth and overall quality.



MoverScore: Text Generation Evaluating with Contextualized Embeddings and Earth Mover Distance

This paper investigates strategies to encode system and reference texts to devise a metric that shows a high correlation with human judgment of text quality and validate the new metric, namely MoverScore, on a number of text generation tasks.

BLEURT: Learning Robust Metrics for Text Generation

BLEURT, a learned evaluation metric for English based on BERT, can model human judgment with a few thousand possibly biased training examples and yields superior results even when the training data is scarce and out-of-distribution.

Better Automatic Evaluation of Open-Domain Dialogue Systems with Contextualized Embeddings

Using contextualized word embeddings to compute more accurate relatedness scores and thus better evaluation metrics is explored, and experiments show that the evaluation metrics outperform RUBER, which is trained on staticembeddings.

BERTScore: Evaluating Text Generation with BERT

This work proposes BERTScore, an automatic evaluation metric for text generation that correlates better with human judgments and provides stronger model selection performance than existing metrics.

BARTScore: Evaluating Generated Text as Text Generation

This work conceptualizes the evaluation of generated text as a text generation problem, modeled using pre-trained sequence-to-sequence models, and proposes a metric BARTS CORE with a number of variants that can be flexibly applied in an unsupervised fashion to evaluation of text from different perspectives.

CTRL: A Conditional Transformer Language Model for Controllable Generation

CTRL is released, a 1.63 billion-parameter conditional transformer language model, trained to condition on control codes that govern style, content, and task-specific behavior, providing more explicit control over text generation.

Compression, Transduction, and Creation: A Unified Framework for Evaluating Natural Language Generation

This work develops a family of interpretable metrics that are suitable for evaluating key aspects of different NLG tasks, often without need of gold reference data, and shows the uniformly designed metrics achieve stronger or comparable correlations with human judgement compared to state-of-the-art metrics.

Language Model Augmented Relevance Score

This paper proposes Language Model Augmented Relevance Score (MARS), a new context-aware metric for NLG evaluation that leverages off-the-shelf language models, guided by reinforcement learning, to create augmented references that consider both the generation context and available human references.

RUBER: An Unsupervised Method for Automatic Evaluation of Open-Domain Dialog Systems

RUBER is proposed, a Referenced metric and Unreferenced metrics Blended Evaluation Routine, which evaluates a reply by taking into consideration both a groundtruth reply and a query (previous user-issued utterance).

OpenMEVA: A Benchmark for Evaluating Open-ended Story Generation Metrics

OpenMEVA provides a comprehensive test suite to assess the capabilities of metrics, including the correlation with human judgments, the generalization to different model outputs and datasets, the ability to judge story coherence, and the robustness to perturbations.