• Corpus ID: 244346010

Simple but Effective: CLIP Embeddings for Embodied AI

  title={Simple but Effective: CLIP Embeddings for Embodied AI},
  author={Apoorv Khandelwal and Luca Weihs and Roozbeh Mottaghi and Aniruddha Kembhavi},
Contrastive language image pretraining (CLIP) encoders have been shown to be beneficial for a range of visual tasks from classification and detection to captioning and image manipulation. We investigate the effectiveness of CLIP visual backbones for embodied AI tasks. We build incredibly simple baselines, named EmbCLIP, with no task specific architectures, inductive biases (such as the use of semantic maps), auxiliary tasks during training, or depth maps—yet we find that our improved baselines… 

Figures and Tables from this paper


How Much Can CLIP Benefit Vision-and-Language Tasks?
It is shown that CLIP significantly outperforms widely-used visual encoders trained with in-domain annotated data, such as BottomUp-TopDown, and also establishes new state-of-the-art results on Visual Question Answering, Visual Entailment, and V&L Navigation tasks.
Learning Transferable Visual Models From Natural Language Supervision
It is demonstrated that the simple pre-training task of predicting which caption goes with which image is an efficient and scalable way to learn SOTA image representations from scratch on a dataset of 400 million (image, text) pairs collected from the internet.
THDA: Treasure Hunt Data Augmentation for Semantic Navigation
Can general-purpose neural models learn to navigate? For PointGoal navigation (‘go to ∆x,∆y’), the answer is a clear ‘yes’ – mapless neural models composed of taskagnostic components (CNNs and RNNs)
AllenAct: A Framework for Embodied AI Research
AllenAct is introduced, a modular and flexible learning framework designed with a focus on the unique requirements of Embodied AI research that provides first-class support for a growing collection of embodied environments, tasks and algorithms.
Auxiliary Tasks and Exploration Enable ObjectGoal Navigation
ObjectGoal Navigation (OBJECTNAV) is an embodied task wherein agents are to navigate to an object instance in an unseen environment. Prior works have shown that end-to-end OBJECTNAV agents that use
Open-vocabulary Object Detection via Vision and Language Knowledge Distillation
This work distills the knowledge from a pretrained open-vocabulary image classification model (teacher) into a two-stage detector (student), which uses the teacher model to encode category texts and image regions of object proposals and trains a student detector, whose region embeddings of detected boxes are aligned with the text and image embeddeddings inferred by the teacher.
CLIPort: What and Where Pathways for Robotic Manipulation
CLIPORT is presented, a language-conditioned imitation learning agent that combines the broad semantic understanding of CLIP with the spatial precision of Transporter and is capable of solving a variety of language-specified tabletop tasks without any explicit representations of object poses, instance, history, symbolic states, or syntactic structures.
Visual Room Rearrangement
The experiments show that solving this challenging interactive task that involves navigation and object interaction is beyond the capabilities of the current state-of-the-art techniques for embodied tasks and the authors are still very far from achieving perfect performance on these types of tasks.
ALFRED: A Benchmark for Interpreting Grounded Instructions for Everyday Tasks
It is shown that a baseline model based on recent embodied vision-and-language tasks performs poorly on ALFRED, suggesting that there is significant room for developing innovative grounded visual language understanding models with this benchmark.
RoboTHOR: An Open Simulation-to-Real Embodied AI Platform
RoboTHOR offers a framework of simulated environments paired with physical counterparts to systematically explore and overcome the challenges of simulation-to-real transfer, and a platform where researchers across the globe can remotely test their embodied models in the physical world.