• Corpus ID: 244117869

CoLLIE: Continual Learning of Language Grounding from Language-Image Embeddings

  title={CoLLIE: Continual Learning of Language Grounding from Language-Image Embeddings},
  author={Gabriel Skantze and Bram Willemsen},
This paper presents CoLLIE: a simple, yet effective model for continual learning of how language is grounded in vision. Given a pre-trained multimodal embedding model, where language and images are projected in the same semantic space (in this case CLIP by OpenAI), CoLLIE learns a transformation function that adjusts the language embeddings when needed to accommodate new language use. Unlike traditional few-shot learning, the model does not just learn new classes and labels, but can also… 
1 Citations

Figures and Tables from this paper

"This is my unicorn, Fluffy": Personalizing frozen vision-language representations
This work proposes an architecture for solving PerVL that operates by extending the input vocabulary of a pretrained model with new word embeddings for the new personalized concepts, and demonstrates that this approach learns personalized visual concepts from a few examples and can effectively apply them in image retrieval and semantic segmentation using rich textual queries.


Learning Transferable Visual Models From Natural Language Supervision
It is demonstrated that the simple pre-training task of predicting which caption goes with which image is an efficient and scalable way to learn SOTA image representations from scratch on a dataset of 400 million (image, text) pairs collected from the internet.
DeViSE: A Deep Visual-Semantic Embedding Model
This paper presents a new deep visual-semantic embedding model trained to identify visual objects using both labeled image data as well as semantic information gleaned from unannotated text and shows that the semantic information can be exploited to make predictions about tens of thousands of image labels not observed during training.
A Large-Scale Attribute Dataset for Zero-Shot Learning
A Large-scale Attribute Dataset with 78,017 images of 230 classes with 359 attributes of visual, semantic and subjective properties is proposed, which argues that the "co-occurrence bias problem" of existing datasets, which is caused by the biased co-occurring of objects, significantly hinders models from correctly learning the concept.
Language Modeling with Gated Convolutional Networks
A finite context approach through stacked convolutions, which can be more efficient since they allow parallelization over sequential tokens, is developed and is the first time a non-recurrent approach is competitive with strong recurrent models on these large scale language tasks.
Image Captioning with Semantic Attention
This paper proposes a new algorithm that combines top-down and bottom-up approaches to natural language description through a model of semantic attention, and significantly outperforms the state-of-the-art approaches consistently across different evaluation metrics.
Visual Dialog
A retrieval-based evaluation protocol for Visual Dialog where the AI agent is asked to sort a set of candidate answers and evaluated on metrics such as mean-reciprocal-rank of human response, and a family of neural encoder-decoder models, which outperform a number of sophisticated baselines.
Continual Learning with Deep Generative Replay
The Deep Generative Replay is proposed, a novel framework with a cooperative dual model architecture consisting of a deep generative model ("generator") and a task solving model ("solver"), with only these two models, training data for previous tasks can easily be sampled and interleaved with those for a new task.
Generalizing from a Few Examples: A Survey on Few-Shot Learning
A thorough survey to fully understand Few-Shot Learning (FSL), and categorizes FSL methods from three perspectives: data, which uses prior knowledge to augment the supervised experience; model, which used to reduce the size of the hypothesis space; and algorithm, which using prior knowledgeto alter the search for the best hypothesis in the given hypothesis space.
Visual question answering: Datasets, algorithms, and future challenges
Natural Language Object Retrieval
Experimental results demonstrate that the SCRC model effectively utilizes both local and global information, outperforming previous baseline methods significantly on different datasets and scenarios, and can exploit large scale vision and language datasets for knowledge transfer.