• Publications
  • Influence
Neural data-to-text generation: A comparison between pipeline and end-to-end architectures
TLDR
Automatic and human evaluations together with a qualitative analysis suggest that having explicit intermediate steps in the generation process results in better texts than the ones generated by end-to-end approaches.
Best practices for the human evaluation of automatically generated text
TLDR
This paper provides an overview of how human evaluation is currently conducted, and presents a set of best practices, grounded in the literature, for Natural Language Generation systems.
Scalar Diversity
We present experimental evidence showing that there is considerable variation between the rates at which scalar expressions from different lexical scales give rise to upper-bounded construals. We
Twenty Years of Confusion in Human Evaluation: NLG Needs Evaluation Sheets and Standardised Definitions
TLDR
Due to a pervasive lack of clarity in reports and extreme diversity in approaches, human evaluation in NLG presents as extremely confused in 2020, and that the field is in urgent need of standard methods and terminology.
Open Dutch WordNet
TLDR
Open Dutch WordNet is described, which has been derived from the Cornetto database, the Princeton WordNet and open source resources, and it has been linked to the Global Wordnet Grid.
The GEM Benchmark: Natural Language Generation, its Evaluation and Metrics
TLDR
GEM, a living benchmark for natural language Generation (NLG), its Evaluation, and Metrics, is introduced and the description of the data for the 2021 shared task at the associated GEM Workshop is described.
Stereotyping and Bias in the Flickr30K Dataset
TLDR
Evidence against the assumption that crowdsourced descriptions of the images in the Flickr30K dataset focus only on the information that can be obtained from the image alone is presented, and a list of biases and unwarranted inferences are provided.
Measuring the Diversity of Automatic Image Descriptions
TLDR
This paper considers the production of generic descriptions as a lack of diversity in the output, which is quantified using established metrics and two new metrics that frame image description as a word recall task, to evaluate system performance on the head of the vocabulary, as well as on the long tail, where system performance degrades.
Sound-based distributional models
TLDR
The first results of the efforts to build a perceptually grounded semantic model based on sound data collected from freesound.org show that the models are able to capture semantic relatedness, with the tag- based model scoring higher than the sound-based model and the combined model.
...
1
2
3
4
...