ABNIRML: Analyzing the Behavior of Neural IR Models

  title={ABNIRML: Analyzing the Behavior of Neural IR Models},
  author={Sean MacAvaney and Sergey Feldman and Nazli Goharian and Doug Downey and Arman Cohan},
  journal={Transactions of the Association for Computational Linguistics},
Pretrained contextualized language models such as BERT and T5 have established a new state-of-the-art for ad-hoc search. However, it is not yet well understood why these methods are so effective, what makes some variants more effective than others, and what pitfalls they may have. We present a new comprehensive framework for Analyzing the Behavior of Neural IR ModeLs (ABNIRML), which includes new types of diagnostic probes that allow us to test several characteristics—such as writing styles… 

The Role of Complex NLP in Transformers for Text Ranking

The results highlight that syntactic aspects do not play a critical role in the effectiveness of re-ranking with BERT and point to other mechanisms such as query-passage cross-attention and richer embeddings that capture word meanings based on aggregated context regardless of the word order for being the main attributions for its superior performance.

Match Your Words! A Study of Lexical Matching in Neural Information Retrieval

Overall, it is shown that neural IR models fail to properly generalize term importance on out-of-domain collections or terms almost unseen during training.

Evaluating the Robustness of Retrieval Pipelines with Query Variation Generators

The experimental results for two different IR tasks reveal that retrieval pipelines are not robust to query variations that maintain the content the same, with effectiveness drops of ∼20% on average when compared with the original query as provided in the datasets.

Towards Axiomatic Explanations for Neural Ranking Models

This work investigates whether one can explain the behavior of neural ranking models in terms of their congruence with well understood principles of document ranking by using established theories from axiomatic IR.

How Does BERT Rerank Passages? An Attribution Analysis with Information Bottlenecks

On BERT-based models for passage reranking, it is found that BERT still cares about exact token matching for reranking; the [CLS] token mainly gathers information for predictions at the last layer; top-ranked passages are robust to token removal; and BERT fine-tuned on MSMARCO has positional bias towards the start of the passage.

The Inefficiency of Language Models in Scholarly Retrieval: An Experimental Walk-through

Retrieval performance turns out to be more influenced by the surface form rather than the semantics of the text, and an exhaustive categorization yields several classes of orthographically and semantically related, partially related and completely unrelated neighbors.

Entity-Conditioned Question Generation for Robust Attention Distribution in Neural Information Retrieval

This work uses a novel targeted synthetic data generation method that identifies poorly attended entities and conditions the generation episodes on those to teach neural IR to attend more uniformly and robustly to all entities in a given passage.

Sparse Pairwise Re-ranking with Pre-trained Transformers

This work investigates whether the efficiency of pairwise re-ranking can be improved by subsampling from all pairs, and evaluates three sampling methods and five preference aggregation methods.

Axiomatic Retrieval Experimentation with ir_axioms

Axiomatic approaches to information retrieval have played a key role in determining basic constraints that characterize good retrieval models. Beyond their importance in retrieval theory, axioms have



CEDR: Contextualized Embeddings for Document Ranking

This work investigates how two pretrained contextualized language models (ELMo and BERT) can be utilized for ad-hoc document ranking and proposes a joint approach that incorporates BERT's classification vector into existing neural models and shows that it outperforms state-of-the-art ad-Hoc ranking baselines.

An Axiomatic Approach to Diagnosing Neural IR Models

It is argued that diagnostic datasets grounded in axioms are a good approach to diagnosing neural IR models and empirically validate to what extent well-known deep IR models are able to realize the axiomatic pattern underlying the datasets.

Diagnosing BERT with Retrieval Heuristics

This paper creates diagnostic datasets that each fulfil a retrieval heuristic (both term matching and semantic-based)—to explore what BERT is able to learn, and finds BERT, when applied to a recently released large-scale web corpus with ad-hoc topics, to not adhere to any of the explored axioms.

Masked Language Modeling and the Distributional Hypothesis: Order Word Matters Pre-training for Little

This paper pre-train MLMs on sentences with randomly shuffled word order, and shows that these models still achieve high accuracy after fine-tuning on many downstream tasks—including tasks specifically designed to be challenging for models that ignore word order.

Language Models and Word Sense Disambiguation: An Overview and Analysis

An in-depth quantitative and qualitative analysis of the celebrated BERT model with respect to lexical ambiguity reveals that in some cases language models come close to solving coarse-grained noun disambiguation under ideal conditions in terms of availability of training data and computing resources.

Linguistic Knowledge and Transferability of Contextual Representations

It is found that linear models trained on top of frozen contextual representations are competitive with state-of-the-art task-specific models in many cases, but fail on tasks requiring fine-grained linguistic knowledge.

Beyond Accuracy: Behavioral Testing of NLP Models with CheckList

Although measuring held-out accuracy has been the primary approach to evaluate generalization, it often overestimates the performance of NLP models, while alternative approaches for evaluating models

Right for the Wrong Reasons: Diagnosing Syntactic Heuristics in Natural Language Inference

There is substantial room for improvement in NLI systems, and the HANS dataset can motivate and measure progress in this area, which contains many examples where the heuristics fail.

Is Attention Interpretable?

While attention noisily predicts input components’ overall importance to a model, it is by no means a fail-safe indicator, and there are many ways in which this does not hold, where gradient-based rankings of attention weights better predict their effects than their magnitudes.

Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer

This systematic study compares pre-training objectives, architectures, unlabeled datasets, transfer approaches, and other factors on dozens of language understanding tasks and achieves state-of-the-art results on many benchmarks covering summarization, question answering, text classification, and more.