Corpus ID: 237532546

Does Summary Evaluation Survive Translation to Other Languages?

  title={Does Summary Evaluation Survive Translation to Other Languages?},
  author={Neslihan Iskender and Oleg V. Vasilyev and Tim Polzehl and John Bohannon and Sebastian Moller},
The creation of a large summarization quality dataset is a considerable, expensive, timeconsuming effort, requiring careful planning and setup. It includes producing humanwritten and machine-generated summaries and evaluation of the summaries by humans, preferably by linguistic experts, and by automatic evaluation tools. If such effort is made in one language, it would be beneficial to be able to use it in other languages. To investigate how much we can trust the translation of such dataset… Expand

Figures and Tables from this paper


Automatically Evaluating Content Selection in Summarization without Human Models
This work capitalizes on the assumption that the distribution of words in the input and an informative summary of that input should be similar to each other, and ranks participating systems similarly to manual model-based pyramid evaluation and to manual human judgments of responsiveness. Expand
Evaluating the Factual Consistency of Abstractive Text Summarization
A weakly-supervised, model-based approach for verifying factual consistency and identifying conflicts between source documents and a generated summary substantially outperforms previous models, including those trained with strong supervision using standard datasets for natural language inference and fact checking. Expand
Bleu: a Method for Automatic Evaluation of Machine Translation
This work proposes a method of automatic machine translation evaluation that is quick, inexpensive, and language-independent, that correlates highly with human evaluation, and that has little marginal cost per run. Expand
Reliability of Human Evaluation for Text Summarization: Lessons Learned and Challenges Ahead
Only a small portion of research papers with human evaluation for text summarization provide information about the participant demographics, task design, and experiment protocol. Additionally, manyExpand
ROUGE: A Package for Automatic Evaluation of Summaries
Four different RouGE measures are introduced: ROUGE-N, ROUge-L, R OUGE-W, and ROUAGE-S included in the Rouge summarization evaluation package and their evaluations. Expand
BERTScore: Evaluating Text Generation with BERT
This work proposes BERTScore, an automatic evaluation metric for text generation that correlates better with human judgments and provides stronger model selection performance than existing metrics. Expand
Transformers: State-of-the-Art Natural Language Processing
Transformers is an open-source library that consists of carefully engineered state-of-the art Transformer architectures under a unified API and a curated collection of pretrained models made by and available for the community. Expand
Teaching Machines to Read and Comprehend
A new methodology is defined that resolves this bottleneck and provides large scale supervised reading comprehension data that allows a class of attention based deep neural networks that learn to read real documents and answer complex questions with minimal prior knowledge of language structure to be developed. Expand
Toward using confidence intervals to compare correlations.
  • G. Zou
  • Mathematics, Medicine
  • Psychological methods
  • 2007
The distinctive feature of this approach is its acknowledgment of the asymmetry of sampling distributions for single correlations, which requires only the availability of confidence limits for the separate correlations and a method for taking into account the dependency between correlations. Expand
SummEval: Reevaluating summarization evaluation
  • arXiv, arXiv:2007.12626v4.
  • 2020