Corpus ID: 202539204

Supervised Multimodal Bitransformers for Classifying Images and Text

@article{Kiela2019SupervisedMB,
  title={Supervised Multimodal Bitransformers for Classifying Images and Text},
  author={Douwe Kiela and Suvrat Bhooshan and Hamed Firooz and Davide Testuggine},
  journal={ArXiv},
  year={2019},
  volume={abs/1909.02950}
}
Self-supervised bidirectional transformer models such as BERT have led to dramatic improvements in a wide variety of textual classification tasks. The modern digital world is increasingly multimodal, however, and textual information is often accompanied by other modalities such as images. We introduce a supervised multimodal bitransformer model that fuses information from text and image encoders, and obtain state-of-the-art performance on various multimodal classification benchmark tasks… Expand
Deep Multi-Modal Sets
TLDR
A scalable, multi-modal framework that reasons over different modalities to learn various types of tasks is demonstrated and new state-of-the-art performance on two multi- modal datasets is demonstrated. Expand
BERT Can See Out of the Box: On the Cross-modal Transferability of Text Representations
TLDR
BERT-gen is introduced, an architecture for text generation based on BERT, able to leverage on either mono- or multi- modal representations, and the proposed model obtains substantial improvements over the state-of-the-art on two established Visual Question Generation datasets. Expand
Emerging Trends of Multimodal Research in Vision and Language
TLDR
A detailed overview of the latest trends in research pertaining to visual and language modalities is presented, looking at its applications in their task formulations and how to solve various problems related to semantic perception and content generation. Expand
IITK at SemEval-2020 Task 8: Unimodal and Bimodal Sentiment Analysis of Internet Memes
TLDR
The results show that a text-only approach, a simple Feed Forward Neural Network (FFNN) with Word2vec embeddings as input, performs superior to all the others and stands first in the Sentiment analysis task with a relative improvement of 63% over the baseline macro-F1 score. Expand
Multimodal Categorization of Crisis Events in Social Media
TLDR
A new multimodal fusion method that leverages both images and texts as input and introduces a cross-attention module that can filter uninformative and misleading components from weak modalities on a sample by sample basis. Expand
MISA: Modality-Invariant and -Specific Representations for Multimodal Sentiment Analysis
TLDR
A novel framework, MISA, is proposed, which projects each modality to two distinct subspaces, which provide a holistic view of the multimodal data, which is used for fusion that leads to task predictions. Expand
Transformers: State-of-the-Art Natural Language Processing
TLDR
Transformers is an open-source library that consists of carefully engineered state-of-the art Transformer architectures under a unified API and a curated collection of pretrained models made by and available for the community. Expand
MORSE: MultimOdal sentiment analysis for Real-life SEttings
TLDR
This work introduces MORSE, a domain-specific dataset for MultimOdal sentiment analysis in Real-life SEttings and proposes a novel two-step fine-tuning method for multimodal sentiment classification using transfer learning and the Transformer model architecture. Expand
lamBERT: Language and Action Learning Using Multimodal BERT
TLDR
This study proposes the language and action learning using multimodal BERT (lamBERT) model that enables the learning of language and actions by extending the BERT model to multi-modal representation and integrating it with reinforcement learning. Expand
Are we pretraining it right? Digging deeper into visio-linguistic pretraining
Numerous recent works have proposed pretraining generic visio-linguistic representations and then finetuning them for downstream vision and language tasks. While architecture and objective functionExpand
...
1
2
3
4
5
...

References

SHOWING 1-10 OF 76 REFERENCES
VideoBERT: A Joint Model for Video and Language Representation Learning
TLDR
This work builds upon the BERT model to learn bidirectional joint distributions over sequences of visual and linguistic tokens, derived from vector quantization of video data and off-the-shelf speech recognition outputs, respectively, which can be applied directly to open-vocabulary classification. Expand
Zero-Shot Learning Through Cross-Modal Transfer
TLDR
This work introduces a model that can recognize objects in images even if no training data is available for the object class, and uses novelty detection methods to differentiate unseen classes from seen classes. Expand
Efficient Large-Scale Multi-Modal Classification
TLDR
The results show that fusion with discretized features outperforms text-only classification, at a fraction of the computational cost of full multi-modal fusion, with the additional benefit of improved interpretability. Expand
WSABIE: Scaling Up to Large Vocabulary Image Annotation
TLDR
This work proposes a strongly performing method that scales to image annotation datasets by simultaneously learning to optimize precision at the top of the ranked list of annotations for a given image and learning a low-dimensional joint embedding space for both images and annotations. Expand
Gated Multimodal Units for Information Fusion
TLDR
The Gated Multimodal Unit model is intended to be used as an internal unit in a neural network architecture whose purpose is to find an intermediate representation based on a combination of data from different modalities. Expand
Learning and Transferring Mid-level Image Representations Using Convolutional Neural Networks
TLDR
This work designs a method to reuse layers trained on the ImageNet dataset to compute mid-level image representation for images in the PASCAL VOC dataset, and shows that despite differences in image statistics and tasks in the two datasets, the transferred representation leads to significantly improved results for object and action classification. Expand
DeViSE: A Deep Visual-Semantic Embedding Model
TLDR
This paper presents a new deep visual-semantic embedding model trained to identify visual objects using both labeled image data as well as semantic information gleaned from unannotated text and shows that the semantic information can be exploited to make predictions about tens of thousands of image labels not observed during training. Expand
CentralNet: a Multilayer Approach for Multimodal Fusion
TLDR
This paper proposes a novel multimodal fusion approach, aiming to produce best possible decisions by integrating information coming from multiple media by introducing a central network linking the modality specific networks. Expand
Universal Language Model Fine-tuning for Text Classification
TLDR
This work proposes Universal Language Model Fine-tuning (ULMFiT), an effective transfer learning method that can be applied to any task in NLP, and introduces techniques that are key for fine- Tuning a language model. Expand
Automatic Description Generation from Images: A Survey of Models, Datasets, and Evaluation Measures
TLDR
This survey classify the existing approaches based on how they conceptualize this problem, viz., models that cast description as either generation problem or as a retrieval problem over a visual or multimodal representational space. Expand
...
1
2
3
4
5
...