• Corpus ID: 244117869

CoLLIE: Continual Learning of Language Grounding from Language-Image Embeddings

  title={CoLLIE: Continual Learning of Language Grounding from Language-Image Embeddings},
  author={Gabriel Skantze and Bram Willemsen},
This paper presents CoLLIE: a simple, yet effective model for continual learning of how language is grounded in vision. Given a pre-trained multimodal embedding model, where language and images are projected in the same semantic space (in this case CLIP by OpenAI), CoLLIE learns a transformation function that adjusts the language embeddings when needed to accommodate new language use. This is done by predicting the difference vector that needs to be applied, as well as a scaling factor for this… 
1 Citations
"This is my unicorn, Fluffy": Personalizing frozen vision-language representations
This work introduces a new learning setup called Personalized Vision & Language (PerVL) with two new benchmark datasets for retrieving and segment-ing user-specific (“personalized”) concepts “in the wild” and proposes an architecture for solving PerVL that operates by extending the input vocabulary of a pretrained model with new word embeddings for the new personalized concepts.


Learning Transferable Visual Models From Natural Language Supervision
It is demonstrated that the simple pre-training task of predicting which caption goes with which image is an efficient and scalable way to learn SOTA image representations from scratch on a dataset of 400 million (image, text) pairs collected from the internet.
DeViSE: A Deep Visual-Semantic Embedding Model
This paper presents a new deep visual-semantic embedding model trained to identify visual objects using both labeled image data as well as semantic information gleaned from unannotated text and shows that the semantic information can be exploited to make predictions about tens of thousands of image labels not observed during training.
A Large-Scale Attribute Dataset for Zero-Shot Learning
A Large-scale Attribute Dataset with 78,017 images of 230 classes with 359 attributes of visual, semantic and subjective properties is proposed, which argues that the "co-occurrence bias problem" of existing datasets, which is caused by the biased co-occurring of objects, significantly hinders models from correctly learning the concept.
Language Modeling with Gated Convolutional Networks
A finite context approach through stacked convolutions, which can be more efficient since they allow parallelization over sequential tokens, is developed and is the first time a non-recurrent approach is competitive with strong recurrent models on these large scale language tasks.
Image Captioning with Semantic Attention
This paper proposes a new algorithm that combines top-down and bottom-up approaches to natural language description through a model of semantic attention, and significantly outperforms the state-of-the-art approaches consistently across different evaluation metrics.
Referring Expression Comprehension: A Survey of Methods and Datasets
This survey examines the state-of-the-art by comparing modern approaches to the referring expression comprehension problem, and classifies methods by their mechanism to encode the visual and textual modalities.
Generalizing from a Few Examples: A Survey on Few-Shot Learning
A thorough survey to fully understand Few-Shot Learning (FSL), and categorizes FSL methods from three perspectives: data, which uses prior knowledge to augment the supervised experience; model, which used to reduce the size of the hypothesis space; and algorithm, which using prior knowledgeto alter the search for the best hypothesis in the given hypothesis space.
Visual Dialog
A retrieval-based evaluation protocol for Visual Dialog where the AI agent is asked to sort a set of candidate answers and evaluated on metrics such as mean-reciprocal-rank of human response, and a family of neural encoder-decoder models, which outperform a number of sophisticated baselines.
Natural Language Object Retrieval
Experimental results demonstrate that the SCRC model effectively utilizes both local and global information, outperforming previous baseline methods significantly on different datasets and scenarios, and can exploit large scale vision and language datasets for knowledge transfer.
Multimodal Distributional Semantics
This work proposes a flexible architecture to integrate text- and image-based distributional information, and shows in a set of empirical tests that the integrated model is superior to the purely text-based approach, and it provides somewhat complementary semantic information with respect to the latter.