Corpus ID: 220714231

Spatially Aware Multimodal Transformers for TextVQA

@article{Kant2020SpatiallyAM,
  title={Spatially Aware Multimodal Transformers for TextVQA},
  author={Yash Kant and Dhruv Batra and Peter Anderson and A. Schwing and D. Parikh and Jiasen Lu and Harsh Agrawal},
  journal={ArXiv},
  year={2020},
  volume={abs/2007.12146}
}
  • Yash Kant, Dhruv Batra, +4 authors Harsh Agrawal
  • Published 2020
  • Computer Science
  • ArXiv
  • Textual cues are essential for everyday tasks like buying groceries and using public transport. To develop this assistive technology, we study the TextVQA task, i.e., reasoning about text in images to answer a question. Existing approaches are limited in their use of spatial relations and rely on fully-connected transformer-like architectures to implicitly learn the spatial structure of a scene. In contrast, we propose a novel spatially aware self-attention layer such that each visual entity… CONTINUE READING