• Publications
  • Influence
Intermediate-Task Transfer Learning with Pretrained Language Models: When and Why Does It Work?
TLDR
It is observed that intermediate tasks requiring high-level inference and reasoning abilities tend to work best and that target task performance is strongly correlated with higher-level abilities such as coreference resolution, but it is failed to observe more granular correlations between probing and target taskperformance.
Intermediate-Task Transfer Learning with Pretrained Models for Natural Language Understanding: When and Why Does It Work?
TLDR
It is observed that intermediate tasks requiring high-level inference and reasoning abilities tend to work best and that target task performance is strongly correlated with higher-level abilities such as coreference resolution, but it is failed to observe more granular correlations between probing and target taskperformance.
Consistency of a Recurrent Language Model with Respect to Incomplete Decoding
TLDR
It is proved that commonly used incomplete decoding algorithms - greedy search, beam search, top-k sampling, and nucleus sampling - are inconsistent, despite the fact that recurrent language models are trained to produce sequences of finite length.
ENGINE: Energy-Based Inference Networks for Non-Autoregressive Machine Translation
TLDR
This work views their non-autoregressive translation system as an inference network trained to minimize the autoregressive teacher energy, which achieves state-of-the-art non-Autoregressive results on the IWSLT 2014 DE-EN and WMT 2016 RO-EN datasets, approaching the performance of autore progressive models.
Unsupervised Evaluation Metrics and Learning Criteria for Non-Parallel Textual Transfer
TLDR
This work considers the problem of automatically generating textual paraphrases with modified attributes or properties, focusing on the setting without parallel data, and proposes additional metrics based on semantic preservation and fluency as well as a way to combine them into a single overall score.
Towards Actual (Not Operational) Textual Style Transfer Auto-Evaluation
TLDR
The dangerous current state of style transfer auto-evaluation research is elucidated and ways to aggregate the three metrics into one evaluator are proposed.
AgreeSum: Agreement-Oriented Multi-Document Summarization
TLDR
This work creates a dataset for AgreeSum, and provides annotations on article-summary entailment relations for a subset of the clusters in the dataset, and hopes that these article- summary entailment annotations contribute to the community’s effort in improving abstractive summarization faithfulness.
QuALITY: Question Answering with Long Input Texts, Yes!
TLDR
QuALITY, a multiple-choice QA dataset with context passages in English that have an average length of about 5,000 tokens, much longer than typical current models can process, is introduced to enable building and testing models on long-document comprehension.
Comparing Test Sets with Item Response Theory
TLDR
Quoref, HellaSwag, and MC-TACO are best suited for distinguishing among state-of-the-art models, while SNLI, MNLI, and CommitmentBank seem to be saturated for current strong models.
...
...