Simple but Effective: CLIP Embeddings for Embodied AI
@article{Khandelwal2021SimpleBE, title={Simple but Effective: CLIP Embeddings for Embodied AI}, author={Apoorv Khandelwal and Luca Weihs and Roozbeh Mottaghi and Aniruddha Kembhavi}, journal={2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)}, year={2021}, pages={14809-14818} }
Contrastive language image pretraining (CLIP) encoders have been shown to be beneficial for a range of visual tasks from classification and detection to captioning and image manipulation. We investigate the effectiveness of CLIP visual backbones for Embodied AI tasks. We build incredibly simple baselines, named EmbCLIP, with no task specific architectures, inductive biases (such as the use of semantic maps), auxiliary tasks during training, or depth maps-yet we find that our improved baselines…
Figures and Tables from this paper
50 Citations
Offline Visual Representation Learning for Embodied Navigation
- Computer ScienceArXiv
- 2022
While the benefits of pretraining sometimes diminish (or entirely disappear) with long finetuning schedules, OVRL’s performance gains continue to increase (not decrease) as the agent is trained for 2 billion frames of experience.
ZSON: Zero-Shot Object-Goal Navigation using Multimodal Goal Embeddings
- Computer ScienceArXiv
- 2022
This work presents a scalable approach for learning open-world object-goal navigation ( ObjectNav) – the task of asking a virtual robot (agent) to any instance of an object in an unexplored environment (e.g., “find a sink” ) and discovers that agents can generalize to compound instructions with a room explicitly mentioned and when the target room can be inferred.
CLIP on Wheels: Zero-Shot Object Navigation as Object Localization and Exploration
- Computer ScienceArXiv
- 2022
This paper translates the success of zero-shot vision models to the popular embodied AI task of object navigation, and finds that a straightforward CoW, with CLIP-based object localization plus classical exploration, and no additional training, often outperforms learnable approaches in terms of success, efficiency, and robustness to dataset distribution shift.
CoWs on Pasture: Baselines and Benchmarks for Language-Driven Zero-Shot Object Navigation
- Computer Science
- 2022
This work investigates a straightforward framework, CLIP on Wheels (CoW), to adapt open-vocabulary models to this task without Tuning, and introduces the P ASTURE benchmark, which considers uncommon objects, objects described by spatial and appearance attributes, and hidden objects described rel-ative to visible objects.
A General Purpose Supervisory Signal for Embodied Agents
- Computer ScienceArXiv
- 2022
The Scene Graph Contrastive (SGC) loss is proposed, which uses scene graphs as general-purpose, training-only, supervisory signals, and uses contrastive learning to align an agent’s representation with a rich graphical encoding of its environment.
Ask4Help: Learning to Leverage an Expert for Embodied Tasks
- Computer ScienceArXiv
- 2022
This paper proposes the A SK 4H ELP policy, a policy that augments agents with the ability to request, and then use expert assistance, thereby reducing the cost of querying the expert.
Retrospectives on the Embodied AI Workshop
- Computer ScienceArXiv
- 2022
A retrospective on the state of Embodied AI research is presented and 13 challenges presented at the EmbodiedAI Workshop at CVPR are grouped into three themes: visual navigation, rearrangement and integration.
ProcTHOR: Large-Scale Embodied AI Using Procedural Generation
- Computer ScienceArXiv
- 2022
The proposed PROCTHOR, a framework for procedural generation of Embodied AI environments, enables us to sample arbitrarily large datasets of diverse, interactive, customizable, and performant virtual environments to train and evaluate embodied agents across navigation, interaction, and manipulation tasks.
Emergence of Maps in the Memories of Blind Navigation Agents
- Computer Science
- 2023
It is found that blind agents are surprisingly effective navigators in new environments and utilize memory over long horizons, and there is emergence of maps and collision detection neurons in the representations of the environment built by a blind agent as it navigates.
ESC: Exploration with Soft Commonsense Constraints for Zero-shot Object Navigation
- Computer ScienceArXiv
- 2023
This work presents a novel zero-shot object navigation method, Exploration with Soft Commonsense constraints (ESC), that transfers commonsense knowledge in pre-trained models to open-world object navigation without any navigation experience nor any other training on the visual environments.
References
SHOWING 1-10 OF 39 REFERENCES
Habitat: A Platform for Embodied AI Research
- Computer Science2019 IEEE/CVF International Conference on Computer Vision (ICCV)
- 2019
The comparison between learning and SLAM approaches from two recent works are revisited and evidence is found -- that learning outperforms SLAM if scaled to an order of magnitude more experience than previous investigations, and the first cross-dataset generalization experiments are conducted.
Matterport3D: Learning from RGB-D Data in Indoor Environments
- Computer Science2017 International Conference on 3D Vision (3DV)
- 2017
Matterport3D is introduced, a large-scale RGB-D dataset containing 10,800 panoramic views from 194,400RGB-D images of 90 building-scale scenes that enable a variety of supervised and self-supervised computer vision tasks, including keypoint matching, view overlap prediction, normal prediction from color, semantic segmentation, and region classification.
How Much Can CLIP Benefit Vision-and-Language Tasks?
- Computer ScienceICLR
- 2022
It is shown that CLIP significantly outperforms widely-used visual encoders trained with in-domain annotated data, such as BottomUp-TopDown, and also establishes new state-of-the-art results on Visual Question Answering, Visual Entailment, and V&L Navigation tasks.
Rearrangement: A challenge for embodied
- ai. arXiv,
- 2020
Learning Transferable Visual Models From Natural Language Supervision
- Computer ScienceICML
- 2021
It is demonstrated that the simple pre-training task of predicting which caption goes with which image is an efficient and scalable way to learn SOTA image representations from scratch on a dataset of 400 million (image, text) pairs collected from the internet.
Habitat-Web: Learning Embodied Object-Search Strategies from Human Demonstrations at Scale
- Computer Science2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)
- 2022
A large-scale study of imitating human demonstrations on tasks that require a virtual robot to search for objects in new environments - ObjectGoal Navigation and Pick&place - finds the IL-trained agent learns efficient object-search behavior from humans.
Continuous Scene Representations for Embodied AI
- Computer Science2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)
- 2022
Using CSR, state-of-the-art approaches for the challenging downstream task of visual room rearrangement are outperformed, without any task specific training and the learned embeddings capture salient spatial details of the scene and show applicability to real world data.
Stubborn: A Strong Baseline for Indoor Object Navigation
- Computer Science2022 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS)
- 2022
A semantic-agnostic exploration strategy (called Stubborn) without any learning that surprisingly outperforms prior work on the Habitat Challenge task of navigating to a target object in indoor environments is presented.
THDA: Treasure Hunt Data Augmentation for Semantic Navigation
- Computer Science2021 IEEE/CVF International Conference on Computer Vision (ICCV)
- 2021
This paper shows that the key problem is overfitting in ObjectNav, and introduces Treasure Hunt Data Augmentation (THDA) to address overfitting.
Habitat-Matterport 3D Dataset (HM3D): 1000 Large-scale 3D Environments for Embodied AI
- Computer ScienceNeurIPS Datasets and Benchmarks
- 2021
Habitat-Matterport 3D is a large-scale dataset of 1,000 building-scale 3D reconstructions from a diverse set of real-world locations that is ‘pareto optimal’ in the following sense – agents trained to perform PointGoal navigation on HM3D achieve the highest performance regardless of whether they are evaluated onHM3D, Gibson, or MP3D.