What Can We Do to Improve Peer Review in NLP?

  title={What Can We Do to Improve Peer Review in NLP?},
  author={Anna Rogers and Isabelle Augenstein},
Peer review is our best tool for judging the quality of conference submissions, but it is becoming increasingly spurious. We argue that a part of the problem is that the reviewers and area chairs face a poorly defined task forcing apples-to-oranges comparisons. There are several potential ways forward, but the key difficulty is creating the incentives and mechanisms for their consistent implementation in the NLP community. 

Figures from this paper

What do writing features tell us about AI papers?
This work extracts a collection of writing features, and constructs a suite of prediction tasks to assess the usefulness of these features in predicting citation counts and the publication of AI-related papers, and shows that the features describe writing style more than content. Expand
“This is a Problem, Don’t You Agree?” Framing and Bias in Human Evaluation for Natural Language Generation
Despite recent efforts reviewing current human evaluation practices for natural language generation (NLG) research, the lack of reported question wording and potential for framing effects orExpand
Ranking Scientific Papers Using Preference Learning
Peer review is the main quality control mechanism in academia. Quality of scientific work has many dimensions; coupled with the subjective nature of the reviewing task, this makes final decisionExpand
Changing the World by Changing the Data
This position paper maps out the arguments for and against data curation, and argues that fundamentally the point is moot: curation already is and will be happening, and it is changing the world. Expand
QA Dataset Explosion: A Taxonomy of NLP Resources for Question Answering and Reading Comprehension
The largest survey of the field to date of question answering and reading comprehension, providing an overview of the various formats and domains of the current resources, and highlighting the current lacunae for future work. Expand
Determining the Credibility of Science Communication
Some first steps towards addressing problems of ensuring that scientific publications are credible and that scientific findings are not misrepresented, distorted or outright misreported when communicated by journalists or the general public are presented. Expand
Text similarity analysis for evaluation of descriptive answers
A text analysis based automated approach for automatic evaluation of the descriptive answers in an examination, based on Siamese Manhattan LSTM, was found to be very efficient in order to be implemented in an institution or in an university. Expand
Exploratory Analysis of News Sentiment Using Subgroup Discovery
In this study, we present an exploratory analysis of a Slovenian news corpus, in which we investigate the association between named entities and sentiment in the news. We propose a methodology thatExpand
Multimodal Pretraining Unmasked: A Meta-Analysis and a Unified Framework of Vision-and-Language BERTs
The experiments show that training data and hyperparameters are responsible for most of the differences between the reported results, but they also reveal that the embedding layer plays a crucial role in these massive models. Expand


A Dataset of Peer Reviews (PeerRead): Collection, Insights and NLP Applications
The first public dataset of scientific peer reviews available for research purposes (PeerRead v1) is presented and it is shown that simple models can predict whether a paper is accepted with up to 21% error reduction compared to the majority baseline. Expand
PaRe: A Paper-Reviewer Matching Approach Using a Common Topic Space
The common topic model jointly models the topics common to the submission and the reviewer’s profile while relying on abstract topic vectors to achieve consistent improvements compared to the state-of-the-art. Expand
Does My Rebuttal Matter? Insights from a Major NLP Conference
The results suggest that a reviewer’s final score is largely determined by her initial score and the distance to the other reviewers’ initial scores, which could help better assess the usefulness of the rebuttal phase in NLP conferences. Expand
Emerging trends: Reviewing the reviewers (again)
The ACL-2019 Business meeting ended with a discussion of reviewing, which suggested the problem is not so much too many submissions, but rather, random reviewing. Expand
Choosing How to Choose Papers
This paper presents a framework based on L(p,q)-norm empirical risk minimization for learning the community's aggregate mapping, and characterize $p=q=1$ as the only choice that satisfies three natural axiomatic properties. Expand
Loss Functions, Axioms, and Peer Review
This paper presents a framework inspired by empirical risk minimization (ERM) for learning the community's aggregate mapping and characterize $p=q=1$ as the only choice of these hyperparameters that satisfies three natural axiomatic properties. Expand
Improving Our Reviewing Processes
  • I. Mani
  • Political Science, Computer Science
  • Computational Linguistics
  • 2011
As I see it, there are two distinct problems to tackle: first, a lack of qualified reviewers, and second, a Lack of quality control in reviews. Expand
On peer review in computer science: analysis of its effectiveness and suggestions for improvement
The development, definition and rationale of a theoretical model for peer review processes are reported on to support the identification of appropriate metrics to assess the processes main characteristics in order to render peer review more transparent and understandable. Expand
A Quantitative Analysis of Peer Review
In this paper we focus on the analysis of peer reviews and reviewers behaviour in a number of different review processes. More specifically, we report on the development, definition and rationale ofExpand
Conference reviewing considered harmful
A model of computer systems research is developed to help prospective authors understand the often obscure workings of conference program committees, and it is argued that paper merit is likely to be zipf distributed, making it inherently difficult for program committees to distinguish between most papers. Expand