CTRLEval: An Unsupervised Reference-Free Metric for Evaluating Controlled Text Generation
@inproceedings{Ke2022CTRLEvalAU, title={CTRLEval: An Unsupervised Reference-Free Metric for Evaluating Controlled Text Generation}, author={Pei Ke and Hao Zhou and Yankai Lin and Peng Li and Jie Zhou and Xiaoyan Zhu and Minlie Huang}, booktitle={Annual Meeting of the Association for Computational Linguistics}, year={2022} }
Existing reference-free metrics have obvious limitations for evaluating controlled text generation models. Unsupervised metrics can only provide a task-agnostic evaluation result which correlates weakly with human judgments, whereas supervised ones may overfit task-specific data with poor generalization ability to other datasets. In this paper, we propose an unsupervised reference-free metric called CTRLEval, which evaluates controlled text generation from different aspects by formulating each…
Figures and Tables from this paper
6 Citations
On the Blind Spots of Model-Based Evaluation Metrics for Text Generation
- Computer ScienceArXiv
- 2022
This work design and synthesize a wide range of potential errors and check whether they result in a drop in the metric scores, and investigates the reasons behind these blind spots and suggests practical workarounds for a more reliable evaluation of text generation.
Director: Generator-Classifiers For Supervised Language Modeling
- Computer ScienceAACL
- 2022
A new architecture, Director, that consists of a unified generator-classifier with both a language modeling and a classification head for each output token that outperforms existing model guiding approaches in terms of both accuracy and efficiency.
AI vs. Human -- Differentiation Analysis of Scientific Content Generation
- Computer Science
- 2023
It is found that there exists a "writing style"gap between AI-generated scientific text and human-written scientific text that contributes to guiding the optimization of AI models to produce high-quality content and addressing related ethical and security concerns.
UPER: Boosting Multi-Document Summarization with an Unsupervised Prompt-based Extractor
- Computer ScienceCOLING
- 2022
This work constructs prompting templates that invoke the underlying knowledge in Pre-trained Language Model (PLM) to calculate the document and keyword’s perplexity, which can assess the document's semantic salience, thus improving the subsequent abstract generation.
A Survey of Controllable Text Generation using Transformer-based Pre-trained Language Models
- Computer ScienceArXiv
- 2022
This is the first survey paper to summarize CTG techniques from the perspective of PLMs, and it is hoped it can help researchers in related fields to quickly track the academic frontier, providing them with a landscape of the area and a roadmap for future research.
Is This Abstract Generated by AI? A Research for the Gap between AI-generated Scientific Text and Human-written Scientific Text
- Computer ScienceArXiv
- 2023
There exists a “writing style” gap between AI-generated scientific text and human-written scientific text, which suggests that while AI has the potential to generate scientific content that is as accurate as human- written content, there is still a gap in terms of depth and overall quality.
References
SHOWING 1-10 OF 49 REFERENCES
MoverScore: Text Generation Evaluating with Contextualized Embeddings and Earth Mover Distance
- Computer ScienceEMNLP
- 2019
This paper investigates strategies to encode system and reference texts to devise a metric that shows a high correlation with human judgment of text quality and validate the new metric, namely MoverScore, on a number of text generation tasks.
BLEURT: Learning Robust Metrics for Text Generation
- Computer ScienceACL
- 2020
BLEURT, a learned evaluation metric for English based on BERT, can model human judgment with a few thousand possibly biased training examples and yields superior results even when the training data is scarce and out-of-distribution.
Better Automatic Evaluation of Open-Domain Dialogue Systems with Contextualized Embeddings
- Computer ScienceProceedings of the Workshop on Methods for Optimizing and Evaluating Neural Language Generation
- 2019
Using contextualized word embeddings to compute more accurate relatedness scores and thus better evaluation metrics is explored, and experiments show that the evaluation metrics outperform RUBER, which is trained on staticembeddings.
BERTScore: Evaluating Text Generation with BERT
- Computer ScienceICLR
- 2020
This work proposes BERTScore, an automatic evaluation metric for text generation that correlates better with human judgments and provides stronger model selection performance than existing metrics.
BARTScore: Evaluating Generated Text as Text Generation
- Computer ScienceNeurIPS
- 2021
This work conceptualizes the evaluation of generated text as a text generation problem, modeled using pre-trained sequence-to-sequence models, and proposes a metric BARTS CORE with a number of variants that can be flexibly applied in an unsupervised fashion to evaluation of text from different perspectives.
CTRL: A Conditional Transformer Language Model for Controllable Generation
- Computer ScienceArXiv
- 2019
CTRL is released, a 1.63 billion-parameter conditional transformer language model, trained to condition on control codes that govern style, content, and task-specific behavior, providing more explicit control over text generation.
Compression, Transduction, and Creation: A Unified Framework for Evaluating Natural Language Generation
- Computer ScienceEMNLP
- 2021
This work develops a family of interpretable metrics that are suitable for evaluating key aspects of different NLG tasks, often without need of gold reference data, and shows the uniformly designed metrics achieve stronger or comparable correlations with human judgement compared to state-of-the-art metrics.
Language Model Augmented Relevance Score
- Computer ScienceACL
- 2021
This paper proposes Language Model Augmented Relevance Score (MARS), a new context-aware metric for NLG evaluation that leverages off-the-shelf language models, guided by reinforcement learning, to create augmented references that consider both the generation context and available human references.
RUBER: An Unsupervised Method for Automatic Evaluation of Open-Domain Dialog Systems
- Computer ScienceAAAI
- 2018
RUBER is proposed, a Referenced metric and Unreferenced metrics Blended Evaluation Routine, which evaluates a reply by taking into consideration both a groundtruth reply and a query (previous user-issued utterance).
OpenMEVA: A Benchmark for Evaluating Open-ended Story Generation Metrics
- Computer ScienceACL
- 2021
OpenMEVA provides a comprehensive test suite to assess the capabilities of metrics, including the correlation with human judgments, the generalization to different model outputs and datasets, the ability to judge story coherence, and the robustness to perturbations.