Corpus ID: 202719370

NLVR2 Visual Bias Analysis

@article{Suhr2019NLVR2VB,
  title={NLVR2 Visual Bias Analysis},
  author={Alane Suhr and Yoav Artzi},
  journal={ArXiv},
  year={2019},
  volume={abs/1909.10411}
}
NLVR2 (Suhr et al., 2019) was designed to be robust for language bias through a data collection process that resulted in each natural language sentence appearing with both true and false labels. The process did not provide a similar measure of control for visual bias. This technical report analyzes the potential for visual bias in NLVR2. We show that some amount of visual bias likely exists. Finally, we identify a subset of the test data that allows to test for model performance in a way that… Expand
Evaluating NLP Models via Contrast Sets
TLDR
A new annotation paradigm for NLP is proposed that helps to close systematic gaps in the test data, and it is recommended that after a dataset is constructed, the dataset authors manually perturb the test instances in small but meaningful ways that change the gold label, creating contrast sets. Expand
UNITER: Learning UNiversal Image-TExt Representations
TLDR
UNITER, a UNiversal Image-TExt Representation, learned through large-scale pre-training over four image-text datasets is introduced, which can power heterogeneous downstream V+L tasks with joint multimodal embeddings. Expand
Supplementary Material UNITER: UNiversal Image-TExt Representation Learning
This supplementary material has eight sections. Section A.1 describes the details of our dataset collection. Section A.2 describes our implementation details for each downstream task. Section A.3Expand
Evaluating Models’ Local Decision Boundaries via Contrast Sets
TLDR
A more rigorous annotation paradigm for NLP that helps to close systematic gaps in the test data, and recommends that the dataset authors manually perturb the test instances in small but meaningful ways that (typically) change the gold label, creating contrast sets. Expand
UNITER: UNiversal Image-TExt Representation Learning
TLDR
UNITER, a UNiversal Image-TExt Representation, learned through large-scale pre-training over four image-text datasets is introduced, which can power heterogeneous downstream V+L tasks with joint multimodal embeddings. Expand

References

SHOWING 1-5 OF 5 REFERENCES
VisualBERT: A Simple and Performant Baseline for Vision and Language
TLDR
Analysis demonstrates that VisualBERT can ground elements of language to image regions without any explicit supervision and is even sensitive to syntactic relationships, tracking, for example, associations between verbs and image regions corresponding to their arguments. Expand
A Corpus for Reasoning about Natural Language Grounded in Photographs
TLDR
This work introduces a new dataset for joint reasoning about natural language and images, with a focus on semantic diversity, compositionality, and visual reasoning challenges, and Evaluation using state-of-the-art visual reasoning methods shows the data presents a strong challenge. Expand
Weakly Supervised Semantic Parsing with Abstract Examples
TLDR
This work proposes that in closed worlds with clear semantic types, one can substantially alleviate problems by utilizing an abstract representation, where tokens in both the language utterance and program are lifted to an abstract form and results in sharing between different examples that alleviates the difficulties in training. Expand
A Corpus of Natural Language for Visual Reasoning
TLDR
A method of crowdsourcing linguistically-diverse data, and an analysis of the data demonstrates a broad set of linguistic phenomena, requiring visual and set-theoretic reasoning. Expand
LXMERT: Learning Cross-Modality Encoder Representations from Transformers
TLDR
The LXMERT (Learning Cross-Modality Encoder Representations from Transformers) framework, a large-scale Transformer model that consists of three encoders, achieves the state-of-the-art results on two visual question answering datasets and shows the generalizability of the pre-trained cross-modality model. Expand