Simple but Effective: CLIP Embeddings for Embodied AI

  title={Simple but Effective: CLIP Embeddings for Embodied AI},
  author={Apoorv Khandelwal and Luca Weihs and Roozbeh Mottaghi and Aniruddha Kembhavi},
  journal={2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
Contrastive language image pretraining (CLIP) encoders have been shown to be beneficial for a range of visual tasks from classification and detection to captioning and image manipulation. We investigate the effectiveness of CLIP visual backbones for Embodied AI tasks. We build incredibly simple baselines, named EmbCLIP, with no task specific architectures, inductive biases (such as the use of semantic maps), auxiliary tasks during training, or depth maps-yet we find that our improved baselines… 

Figures and Tables from this paper

Offline Visual Representation Learning for Embodied Navigation

While the benefits of pretraining sometimes diminish (or entirely disappear) with long finetuning schedules, OVRL’s performance gains continue to increase (not decrease) as the agent is trained for 2 billion frames of experience.

ZSON: Zero-Shot Object-Goal Navigation using Multimodal Goal Embeddings

This work presents a scalable approach for learning open-world object-goal navigation ( ObjectNav) – the task of asking a virtual robot (agent) to any instance of an object in an unexplored environment (e.g., “find a sink” ) and discovers that agents can generalize to compound instructions with a room explicitly mentioned and when the target room can be inferred.

CLIP on Wheels: Zero-Shot Object Navigation as Object Localization and Exploration

This paper translates the success of zero-shot vision models to the popular embodied AI task of object navigation, and finds that a straightforward CoW, with CLIP-based object localization plus classical exploration, and no additional training, often outperforms learnable approaches in terms of success, efficiency, and robustness to dataset distribution shift.

CoWs on Pasture: Baselines and Benchmarks for Language-Driven Zero-Shot Object Navigation

This work investigates a straightforward framework, CLIP on Wheels (CoW), to adapt open-vocabulary models to this task without Tuning, and introduces the P ASTURE benchmark, which considers uncommon objects, objects described by spatial and appearance attributes, and hidden objects described rel-ative to visible objects.

A General Purpose Supervisory Signal for Embodied Agents

The Scene Graph Contrastive (SGC) loss is proposed, which uses scene graphs as general-purpose, training-only, supervisory signals, and uses contrastive learning to align an agent’s representation with a rich graphical encoding of its environment.

Ask4Help: Learning to Leverage an Expert for Embodied Tasks

This paper proposes the A SK 4H ELP policy, a policy that augments agents with the ability to request, and then use expert assistance, thereby reducing the cost of querying the expert.

Retrospectives on the Embodied AI Workshop

A retrospective on the state of Embodied AI research is presented and 13 challenges presented at the EmbodiedAI Workshop at CVPR are grouped into three themes: visual navigation, rearrangement and integration.

ProcTHOR: Large-Scale Embodied AI Using Procedural Generation

The proposed PROCTHOR, a framework for procedural generation of Embodied AI environments, enables us to sample arbitrarily large datasets of diverse, interactive, customizable, and performant virtual environments to train and evaluate embodied agents across navigation, interaction, and manipulation tasks.

Emergence of Maps in the Memories of Blind Navigation Agents

It is found that blind agents are surprisingly effective navigators in new environments and utilize memory over long horizons, and there is emergence of maps and collision detection neurons in the representations of the environment built by a blind agent as it navigates.

ESC: Exploration with Soft Commonsense Constraints for Zero-shot Object Navigation

This work presents a novel zero-shot object navigation method, Exploration with Soft Commonsense constraints (ESC), that transfers commonsense knowledge in pre-trained models to open-world object navigation without any navigation experience nor any other training on the visual environments.



Habitat: A Platform for Embodied AI Research

The comparison between learning and SLAM approaches from two recent works are revisited and evidence is found -- that learning outperforms SLAM if scaled to an order of magnitude more experience than previous investigations, and the first cross-dataset generalization experiments are conducted.

Matterport3D: Learning from RGB-D Data in Indoor Environments

Matterport3D is introduced, a large-scale RGB-D dataset containing 10,800 panoramic views from 194,400RGB-D images of 90 building-scale scenes that enable a variety of supervised and self-supervised computer vision tasks, including keypoint matching, view overlap prediction, normal prediction from color, semantic segmentation, and region classification.

How Much Can CLIP Benefit Vision-and-Language Tasks?

It is shown that CLIP significantly outperforms widely-used visual encoders trained with in-domain annotated data, such as BottomUp-TopDown, and also establishes new state-of-the-art results on Visual Question Answering, Visual Entailment, and V&L Navigation tasks.

Rearrangement: A challenge for embodied

  • ai. arXiv,
  • 2020

Learning Transferable Visual Models From Natural Language Supervision

It is demonstrated that the simple pre-training task of predicting which caption goes with which image is an efficient and scalable way to learn SOTA image representations from scratch on a dataset of 400 million (image, text) pairs collected from the internet.

Habitat-Web: Learning Embodied Object-Search Strategies from Human Demonstrations at Scale

A large-scale study of imitating human demonstrations on tasks that require a virtual robot to search for objects in new environments - ObjectGoal Navigation and Pick&place - finds the IL-trained agent learns efficient object-search behavior from humans.

Continuous Scene Representations for Embodied AI

Using CSR, state-of-the-art approaches for the challenging downstream task of visual room rearrangement are outperformed, without any task specific training and the learned embeddings capture salient spatial details of the scene and show applicability to real world data.

Stubborn: A Strong Baseline for Indoor Object Navigation

A semantic-agnostic exploration strategy (called Stubborn) without any learning that surprisingly outperforms prior work on the Habitat Challenge task of navigating to a target object in indoor environments is presented.

THDA: Treasure Hunt Data Augmentation for Semantic Navigation

This paper shows that the key problem is overfitting in ObjectNav, and introduces Treasure Hunt Data Augmentation (THDA) to address overfitting.

Habitat-Matterport 3D Dataset (HM3D): 1000 Large-scale 3D Environments for Embodied AI

Habitat-Matterport 3D is a large-scale dataset of 1,000 building-scale 3D reconstructions from a diverse set of real-world locations that is ‘pareto optimal’ in the following sense – agents trained to perform PointGoal navigation on HM3D achieve the highest performance regardless of whether they are evaluated onHM3D, Gibson, or MP3D.