ABNIRML: Analyzing the Behavior of Neural IR Models

  title={ABNIRML: Analyzing the Behavior of Neural IR Models},
  author={Sean MacAvaney and Sergey Feldman and Nazli Goharian and Doug Downey and Arman Cohan},
  journal={Transactions of the Association for Computational Linguistics},
Pretrained contextualized language models such as BERT and T5 have established a new state-of-the-art for ad-hoc search. However, it is not yet well understood why these methods are so effective, what makes some variants more effective than others, and what pitfalls they may have. We present a new comprehensive framework for Analyzing the Behavior of Neural IR ModeLs (ABNIRML), which includes new types of diagnostic probes that allow us to test several characteristics—such as writing styles… 

The Role of Complex NLP in Transformers for Text Ranking

The results highlight that syntactic aspects do not play a critical role in the effectiveness of re-ranking with BERT and point to other mechanisms such as query-passage cross-attention and richer embeddings that capture word meanings based on aggregated context regardless of the word order for being the main attributions for its superior performance.

MS-Shift: An Analysis of MS MARCO Distribution Shifts on Neural Retrieval

This study demonstrates that it is possible to design more controllable distribution shifts as a tool to better understand generalization of IR models, and releases the MS MARCO query subsets, which provide an additional resource to benchmark zero-shot transfer in Information Retrieval.

Match Your Words! A Study of Lexical Matching in Neural Information Retrieval

Overall, it is shown that neural IR models fail to properly generalize term importance on out-of-domain collections or terms almost unseen during training.

Evaluating the Robustness of Retrieval Pipelines with Query Variation Generators

The experimental results for two different IR tasks reveal that retrieval pipelines are not robust to query variations that maintain the content the same, with effectiveness drops of ∼20% on average when compared with the original query as provided in the datasets.

Highlighting exact matching via marking strategies for ad hoc document ranking with pretrained contextualized language models

These findings support that traditional information retrieval cues such as exact matching are still valuable for large pretrained contextualized models such as BERT and ELECTRA to achieve higher or at least comparable performance.

Towards Axiomatic Explanations for Neural Ranking Models

This work investigates whether one can explain the behavior of neural ranking models in terms of their congruence with well understood principles of document ranking by using established theories from axiomatic IR.

How Does BERT Rerank Passages? An Attribution Analysis with Information Bottlenecks

On BERT-based models for passage reranking, it is found that BERT still cares about exact token matching for reranking; the [CLS] token mainly gathers information for predictions at the last layer; top-ranked passages are robust to token removal; and BERT fine-tuned on MSMARCO has positional bias towards the start of the passage.

The Inefficiency of Language Models in Scholarly Retrieval: An Experimental Walk-through

Retrieval performance turns out to be more influenced by the surface form rather than the semantics of the text, and an exhaustive categorization yields several classes of orthographically and semantically related, partially related and completely unrelated neighbors.

What Are You Token About? Dense Retrieval as Distributions Over the Vocabulary

This work proposes to interpret the vector representations produced by dual encoders by projecting them into the model’s vocabulary space, and shows that the resulting distributions over vocabulary tokens are intuitive and contain rich semantic information.

Explainability of Text Processing and Retrieval Methods: A Critical Survey

Approaches that have been applied to explain word embeddings, sequence modeling, attention modules, transformers, BERT are surveyed.



CEDR: Contextualized Embeddings for Document Ranking

This work investigates how two pretrained contextualized language models (ELMo and BERT) can be utilized for ad-hoc document ranking and proposes a joint approach that incorporates BERT's classification vector into existing neural models and shows that it outperforms state-of-the-art ad-Hoc ranking baselines.

An Axiomatic Approach to Diagnosing Neural IR Models

It is argued that diagnostic datasets grounded in axioms are a good approach to diagnosing neural IR models and empirically validate to what extent well-known deep IR models are able to realize the axiomatic pattern underlying the datasets.

Diagnosing BERT with Retrieval Heuristics

This paper creates diagnostic datasets that each fulfil a retrieval heuristic (both term matching and semantic-based)—to explore what BERT is able to learn, and finds BERT, when applied to a recently released large-scale web corpus with ad-hoc topics, to not adhere to any of the explored axioms.

Masked Language Modeling and the Distributional Hypothesis: Order Word Matters Pre-training for Little

This paper pre-train MLMs on sentences with randomly shuffled word order, and shows that these models still achieve high accuracy after fine-tuning on many downstream tasks—including tasks specifically designed to be challenging for models that ignore word order.

Language Models and Word Sense Disambiguation: An Overview and Analysis

An in-depth quantitative and qualitative analysis of the celebrated BERT model with respect to lexical ambiguity reveals that in some cases language models come close to solving coarse-grained noun disambiguation under ideal conditions in terms of availability of training data and computing resources.

Linguistic Knowledge and Transferability of Contextual Representations

It is found that linear models trained on top of frozen contextual representations are competitive with state-of-the-art task-specific models in many cases, but fail on tasks requiring fine-grained linguistic knowledge.

Beyond Accuracy: Behavioral Testing of NLP Models with CheckList

Although measuring held-out accuracy has been the primary approach to evaluate generalization, it often overestimates the performance of NLP models, while alternative approaches for evaluating models

Right for the Wrong Reasons: Diagnosing Syntactic Heuristics in Natural Language Inference

There is substantial room for improvement in NLI systems, and the HANS dataset can motivate and measure progress in this area, which contains many examples where the heuristics fail.

Is Attention Interpretable?

While attention noisily predicts input components’ overall importance to a model, it is by no means a fail-safe indicator, and there are many ways in which this does not hold, where gradient-based rankings of attention weights better predict their effects than their magnitudes.

Towards Axiomatic Explanations for Neural Ranking Models

This work investigates whether one can explain the behavior of neural ranking models in terms of their congruence with well understood principles of document ranking by using established theories from axiomatic IR.