What’s in Your Head? Emergent Behaviour in Multi-Task Transformer Models

@article{Geva2021WhatsIY,
  title={What’s in Your Head? Emergent Behaviour in Multi-Task Transformer Models},
  author={Mor Geva and Uri Katz and Aviv Ben-Arie and Jonathan Berant},
  journal={ArXiv},
  year={2021},
  volume={abs/2104.06129}
}
The primary paradigm for multi-task training in natural language processing is to represent the input with a shared pre-trained language model, and add a small, thin network (head) per task. Given an input, a target head is the head that is selected for outputting the final prediction. In this work, we examine the behaviour of non-target heads, that is, the output of heads when given input that belongs to a different task than the one they were trained for. We find that non-target heads exhibit… 
ExT5: Towards Extreme Multi-Task Scaling for Transfer Learning
TLDR
ExMIX (Extreme Mixture): a massive collection of 107 supervised NLP tasks across diverse domains and task-families is introduced, and a model pre-trained using a multi-task objective of self-supervised span denoising and supervised EXMIX is proposed.
Interactively Providing Explanations for Transformer Language Models
TLDR
This work emphasizes using prototype networks directly incorporated into the model architecture and hence explain the reasoning behind the network’s decisions, which offers a better understanding of language models and uses human capabilities to incorporate knowledge outside of the rigid range of purely data-driven approaches.
Interactively Generating Explanations for Transformer Language Models
TLDR
This work emphasizes using prototype networks directly incorporated into the model architecture and hence explain the reasoning process behind the network’s decisions, which offers a better understanding of language models and uses human capabilities to incorporate knowledge outside of the rigid range of purely data-driven approaches.

References

SHOWING 1-10 OF 41 REFERENCES
A Joint Many-Task Model: Growing a Neural Network for Multiple NLP Tasks
TLDR
A joint many-task model together with a strategy for successively growing its depth to solve increasingly complex tasks and uses a simple regularization term to allow for optimizing all model weights to improve one task’s loss without exhibiting catastrophic interference of the other tasks.
A Multi-Type Multi-Span Network for Reading Comprehension that Requires Discrete Reasoning
TLDR
The Multi-Type Multi-Span Network (MTMSN) is introduced, a neural reading comprehension model that combines a multi-type answer predictor designed to support various answer types with amulti-span extraction method for dynamically producing one or multiple text spans.
Linguistic Knowledge and Transferability of Contextual Representations
TLDR
It is found that linear models trained on top of frozen contextual representations are competitive with state-of-the-art task-specific models in many cases, but fail on tasks requiring fine-grained linguistic knowledge.
Multi-Task Deep Neural Networks for Natural Language Understanding
TLDR
A Multi-Task Deep Neural Network (MT-DNN) for learning representations across multiple natural language understanding (NLU) tasks that allows domain adaptation with substantially fewer in-domain labels than the pre-trained BERT representations.
Evidence Sentence Extraction for Machine Reading Comprehension
TLDR
This paper focuses on extracting evidence sentences that can explain or support the answers of multiple-choice MRC tasks, where the majority of answer options cannot be directly extracted from reference documents.
Answering while Summarizing: Multi-task Learning for Multi-hop QA with Evidence Extraction
TLDR
This study proposes the Query Focused Extractor (QFE) model for evidence extraction and uses multi-task learning with the QA model, inspired by extractive summarization models; compared with the existing method, it sequentially extracts evidence sentences by using an RNN with an attention mechanism on the question sentence.
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
TLDR
A new language representation model, BERT, designed to pre-train deep bidirectional representations from unlabeled text by jointly conditioning on both left and right context in all layers, which can be fine-tuned with just one additional output layer to create state-of-the-art models for a wide range of tasks.
DROP: A Reading Comprehension Benchmark Requiring Discrete Reasoning Over Paragraphs
TLDR
A new reading comprehension benchmark, DROP, which requires Discrete Reasoning Over the content of Paragraphs, and presents a new model that combines reading comprehension methods with simple numerical reasoning to achieve 51% F1.
How Does BERT Answer Questions?: A Layer-Wise Analysis of Transformer Representations
TLDR
A layer-wise analysis of BERT's hidden states reveals that fine-tuning has little impact on the models' semantic abilities and that prediction errors can be recognized in the vector representations of even early layers.
Injecting Numerical Reasoning Skills into Language Models
TLDR
This work shows that numerical reasoning is amenable to automatic data generation, and thus one can inject this skill into pre-trained LMs, by generating large amounts of data, and training in a multi-task setup.
...
1
2
3
4
5
...