• Corpus ID: 244346010

Simple but Effective: CLIP Embeddings for Embodied AI

@article{Khandelwal2021SimpleBE,
  title={Simple but Effective: CLIP Embeddings for Embodied AI},
  author={Apoorv Khandelwal and Luca Weihs and Roozbeh Mottaghi and Aniruddha Kembhavi},
  journal={ArXiv},
  year={2021},
  volume={abs/2111.09888}
}
Contrastive language image pretraining (CLIP) encoders have been shown to be beneficial for a range of visual tasks from classification and detection to caption-ing and image manipulation. We investigate the effectiveness of CLIP visual backbones for Embodied AI tasks. We build incredibly simple baselines, named EmbCLIP, with no task specific architectures, inductive biases (such as the use of semantic maps), auxiliary tasks during training, or depth maps—yet we find that our improved baselines… 

Figures and Tables from this paper

Offline Visual Representation Learning for Embodied Navigation
TLDR
While the benefits of pretraining sometimes diminish (or entirely disappear) with long finetuning schedules, OVRL’s performance gains continue to increase (not decrease) as the agent is trained for 2 billion frames of experience.
ZSON: Zero-Shot Object-Goal Navigation using Multimodal Goal Embeddings
TLDR
This work presents a scalable approach for learning open-world object-goal navigation ( ObjectNav) – the task of asking a virtual robot (agent) to any instance of an object in an unexplored environment (e.g., “find a sink” ) and discovers that agents can generalize to compound instructions with a room explicitly mentioned and when the target room can be inferred.
CLIP on Wheels: Zero-Shot Object Navigation as Object Localization and Exploration
TLDR
This paper translates the success of zero-shot vision models to the popular embodied AI task of object navigation, and finds that a straightforward CoW, with CLIP-based object localization plus classical exploration, and no additional training, often outperforms learnable approaches in terms of success, efficiency, and robustness to dataset distribution shift.
ProcTHOR: Large-Scale Embodied AI Using Procedural Generation
TLDR
The proposed PROCTHOR, a framework for procedural generation of Embodied AI environments, enables us to sample arbitrarily large datasets of diverse, interactive, customizable, and performant virtual environments to train and evaluate embodied agents across navigation, interaction, and manipulation tasks.
Robots Enact Malignant Stereotypes
TLDR
This paper finds that robots powered by large datasets and Dissolution Models that contain humans risk physically amplifying malignant stereotypes in general; and recommends that robot learning methods that physically manifest stereotypes or other harmful outcomes be paused, reworked, or even wound down when appropriate, until outcomes can be proven safe, effective, and just.
Inner Monologue: Embodied Reasoning through Planning with Language Models
TLDR
This work proposes that by leveraging environment feedback, LLMs are able to form an inner monologue that allows them to more richly process and plan in robotic control scenarios, and finds that closed-loop language feedback significantly improves high-level instruction completion on three domains.
LM-Nav: Robotic Navigation with Large Pre-Trained Models of Language, Vision, and Action
TLDR
Each model is pre-trained on its own dataset, and it is shown that the complete system can execute a variety of user-specified instructions in real-world outdoor environments — choosing the correct sequence of landmarks through a combination of language and spatial context — and handle mistakes.
AnyMorph: Learning Transferable Polices By Inferring Agent Morphology
TLDR
This work proposes the first reinforcement learning algorithm that can train a policy to generalize to new agent morphologies without requiring a description of the agent’s morphology in advance, and attains good performance without an explicit description of morphology.
ET tu, CLIP? Addressing Common Object Errors for Unseen Environments
TLDR
A simple method that employs pre-trained CLIP encoders to enhance model generalization in the ALFRED task and analysis results support that CLIP especially helps with leveraging object descrip-tions, detecting small objects, and interpreting rare words.
Zero-shot object goal visual navigation
TLDR
This work proposes a zero-shot object navigation task by combining zeroshot learning with object goal visual navigation, which aims at guiding robots to find objects belonging to novel classes without any training samples, and shows that the model is less class-sensitive and generalizes better.
...
...

References

SHOWING 1-10 OF 41 REFERENCES
Learning to Explore using Active Neural SLAM
This work presents a modular and hierarchical approach to learn policies for exploring 3D environments, called `Active Neural SLAM'. Our approach leverages the strengths of both classical and
Habitat: A Platform for Embodied AI Research
TLDR
The comparison between learning and SLAM approaches from two recent works are revisited and evidence is found -- that learning outperforms SLAM if scaled to an order of magnitude more experience than previous investigations, and the first cross-dataset generalization experiments are conducted.
How Much Can CLIP Benefit Vision-and-Language Tasks?
TLDR
It is shown that CLIP significantly outperforms widely-used visual encoders trained with in-domain annotated data, such as BottomUp-TopDown, and also establishes new state-of-the-art results on Visual Question Answering, Visual Entailment, and V&L Navigation tasks.
DD-PPO: Learning Near-Perfect PointGoal Navigators from 2.5 Billion Frames
TLDR
It is shown that the scene understanding and navigation policies learned can be transferred to other navigation tasks -- the analog of "ImageNet pre-training + task-specific fine-tuning" for embodied AI.
Matterport3D: Learning from RGB-D Data in Indoor Environments
TLDR
Matterport3D is introduced, a large-scale RGB-D dataset containing 10,800 panoramic views from 194,400RGB-D images of 90 building-scale scenes that enable a variety of supervised and self-supervised computer vision tasks, including keypoint matching, view overlap prediction, normal prediction from color, semantic segmentation, and region classification.
Wijmans. ObjectNav Revisited: On Evaluation of Embodied Agents
  • Navigating to Objects. arXiv,
  • 2020
Learning Transferable Visual Models From Natural Language Supervision
TLDR
It is demonstrated that the simple pre-training task of predicting which caption goes with which image is an efficient and scalable way to learn SOTA image representations from scratch on a dataset of 400 million (image, text) pairs collected from the internet.
Rearrangement: A challenge for embodied
  • ai. arXiv,
  • 2020
Habitat-Web: Learning Embodied Object-Search Strategies from Human Demonstrations at Scale
TLDR
A large-scale study of imitating human demonstrations on tasks that require a virtual robot to search for objects in new environments, using Habitat simulator running in a web browser to Amazon Mechanical Turk and virtual teleoperation data-collection infrastructure to answer the question – how does large- scale imitation learning (IL) compare to reinforcement learning (RL).
Continuous Scene Representations for Embodied AI
TLDR
This work proposes Continuous Scene Representations (CSR), a scene representation constructed by an embodied agent navigating within a space, where objects and their relationships are modeled by continuous valued embeddings, to embed pair-wise relationships between objects in a latent space.
...
...