Corpus ID: 226226586

ABNIRML: Analyzing the Behavior of Neural IR Models

@article{MacAvaney2020ABNIRMLAT,
  title={ABNIRML: Analyzing the Behavior of Neural IR Models},
  author={Sean MacAvaney and Sergey Feldman and Nazli Goharian and Doug Downey and Arman Cohan},
  journal={ArXiv},
  year={2020},
  volume={abs/2011.00696}
}
Numerous studies have demonstrated the effectiveness of pretrained contextualized language models such as BERT and T5 for ad-hoc search. However, it is not well-understood why these methods are so effective, what makes some variants more effective than others, and what pitfalls they may have. We present a new comprehensive framework for Analyzing the Behavior of Neural IR ModeLs (ABNIRML), which includes new types of diagnostic tests that allow us to probe several characteristics---such as… Expand

Figures and Tables from this paper

Towards Axiomatic Explanations for Neural Ranking Models
TLDR
This work investigates whether one can explain the behavior of neural ranking models in terms of their congruence with well understood principles of document ranking by using established theories from axiomatic IR. Expand

References

SHOWING 1-10 OF 56 REFERENCES
CEDR: Contextualized Embeddings for Document Ranking
TLDR
This work investigates how two pretrained contextualized language models (ELMo and BERT) can be utilized for ad-hoc document ranking and proposes a joint approach that incorporates BERT's classification vector into existing neural models and shows that it outperforms state-of-the-art ad-Hoc ranking baselines. Expand
Is Attention Interpretable?
TLDR
While attention noisily predicts input components’ overall importance to a model, it is by no means a fail-safe indicator, and there are many ways in which this does not hold, where gradient-based rankings of attention weights better predict their effects than their magnitudes. Expand
Beyond Accuracy: Behavioral Testing of NLP Models with CheckList
Although measuring held-out accuracy has been the primary approach to evaluate generalization, it often overestimates the performance of NLP models, while alternative approaches for evaluating modelsExpand
An Axiomatic Approach to Diagnosing Neural IR Models
TLDR
It is argued that diagnostic datasets grounded in axioms are a good approach to diagnosing neural IR models and empirically validate to what extent well-known deep IR models are able to realize the axiomatic pattern underlying the datasets. Expand
Linguistic Knowledge and Transferability of Contextual Representations
TLDR
It is found that linear models trained on top of frozen contextual representations are competitive with state-of-the-art task-specific models in many cases, but fail on tasks requiring fine-grained linguistic knowledge. Expand
Right for the Wrong Reasons: Diagnosing Syntactic Heuristics in Natural Language Inference
TLDR
There is substantial room for improvement in NLI systems, and the HANS dataset can motivate and measure progress in this area, which contains many examples where the heuristics fail. Expand
Diagnosing BERT with Retrieval Heuristics
TLDR
This paper creates diagnostic datasets that each fulfil a retrieval heuristic (both term matching and semantic-based)—to explore what BERT is able to learn, and finds BERT, when applied to a recently released large-scale web corpus with ad-hoc topics, to not adhere to any of the explored axioms. Expand
Language Models and Word Sense Disambiguation: An Overview and Analysis
TLDR
An in-depth quantitative and qualitative analysis of the celebrated BERT model with respect to lexical ambiguity reveals that in some cases language models come close to solving coarse-grained noun disambiguation under ideal conditions in terms of availability of training data and computing resources. Expand
Deeper Text Understanding for IR with Contextual Neural Language Modeling
TLDR
Experimental results demonstrate that the contextual text representations from BERT are more effective than traditional word embeddings in bringing large improvements on queries written in natural languages. Expand
Document Ranking with a Pretrained Sequence-to-Sequence Model
TLDR
Surprisingly, it is found that the choice of target tokens impacts effectiveness, even for words that are closely related semantically, which sheds some light on why the sequence-to-sequence formulation for document ranking is effective. Expand
...
1
2
3
4
5
...