• Corpus ID: 235765481

Learning Vision-Guided Quadrupedal Locomotion End-to-End with Cross-Modal Transformers

  title={Learning Vision-Guided Quadrupedal Locomotion End-to-End with Cross-Modal Transformers},
  author={Ruihan Yang and Minghao Zhang and Nicklas Hansen and Huazhe Xu and Xiaolong Wang},
We propose to address quadrupedal locomotion tasks using Reinforcement Learning (RL) with a Transformer-based model that learns to combine proprioceptive information and high-dimensional depth sensor inputs. While learning-based locomotion has made great advances using RL, most methods still rely on domain randomization for training blind agents that generalize to challenging terrains. Our key insight is that proprioceptive states only offer contact measurements for immediate reaction, whereas… 

Recent Approaches for Perceptive Legged Locomotion

As both legged robots and embedded compute have become more capable, researchers have started to focus on field deployment of these robots. Robust autonomy in unstructured environments requires

Learning to Walk by Steering: Perceptive Quadrupedal Locomotion in Dynamic Environments

A hierarchical learning framework, named PRELUDE, is presented, which decomposes the problem of perceptive locomotion into high-level decision-making to predict navigation commands and low-level gait generation to realize the target commands.

Perceiver-Actor: A Multi-Task Transformer for Robotic Manipulation

P ER A CT, a language-conditioned behavior-cloning agent for multi-task 6-DoF manipulation, outperforms unstructured image-to-action agents and 3D ConvNet baselines for a wide range of tabletop tasks.

Coupling Vision and Proprioception for Navigation of Legged Robots

This work exploits the complementary strengths of vision and proprioception to develop a point-goal navigation system for legged robots, called VP-Nav, which shows superior performance compared to wheeled robot baselines, and ablation studies which have disjoint high-level planning and low-level control.

Robot Active Neural Sensing and Planning in Unknown Cluttered Environments

This work presents the active neural sensing approach that generates the kinematically feasible viewpoint sequences for the robot manipulator with an in-hand camera to gather the minimum number of observations needed to reconstruct the underlying environment.

Live in the Moment: Learning Dynamics Model Adapted to Evolving Policy

A novel model-based RL method, named Policy-adaptation Model-based Actor-Critic (PMAC) is proposed, which learns a policy-adapted dynamics model based on aPolicy- Adaptation mechanism that dynamically adjusts the historical policy mixture distribution to ensure the learned model can continually adapt to the state-action visitation distribution of the evolving policy.

Egocentric Visual Self-Modeling for Legged Robot Locomotion

This work proposed an end-to-end approach that uses high dimension visual observation and action commands to train a visual self-model for legged locomotion, which learns the spatial relationship between the robot body movement and the ground texture changes from image sequences.

Learning Semantics-Aware Locomotion Skills from Human Demonstration

This work presents a framework that learns semantics-aware locomotion skills from perception for quadrupedal robots, such that the robot can traverse through complex offroad terrains with appropriate speeds and gaits using perception information.

Neural Scene Representation for Locomotion on Structured Terrain

This work proposes a 3D reconstruction model that faithfully reconstructs the scene, despite the noisy measurements and large amounts of missing data coming from the blind spots of the camera arrangement, and shows that the proposed method outperforms classical map representations.

Human-AI Shared Control via Frequency-based Policy Dissection

The experiments show that human-AI shared control achieved by Policy Dissection in driving task can substantially improve the performance and safety in unseen traffic scenes and suggest the promising direction of implementing human- AI shared autonomy through interpreting the learned representation of the autonomous agents.



Attention is all you need, 2017

  • 2017

Zero-Shot Terrain Generalization for Visual Locomotion Policies

This paper proposes an end-to-end learning approach that makes direct use of the raw exteroceptive inputs gathered from a simulated 3D LiDAR sensor, thus circumventing the need for ground-truth heightmaps or preprocessing of perception information.

Online Learning of Unknown Dynamics for Model-Based Controllers in Legged Locomotion

This work proposes to learn a time-varying, locally linear residual model along the robot’s current trajectory, to compensate for the prediction errors of the controller's model.

RMA: Rapid Motor Adaptation for Legged Robots

Rapid Motor Adaptation algorithm is presented to solve the problem of real-time online adaptation in quadruped robots by trained completely in simulation without using any domain knowledge like reference trajectories or predefined foot trajectory generators and deployed on the A1 robot without any fine-tuning.

VATT: Transformers for Multimodal Self-Supervised Learning from Raw Video, Audio and Text

The convolution-free VATT outperforms state-of-the-art ConvNet-based architectures in the downstream tasks and shows the generalizability of the model despite the domain gap between videos and images.

GLiDE: Generalizable Quadrupedal Locomotion in Diverse Environments with a Centroidal Model

This work explores how RL can be effectively used with a centroidal model to generate robust control policies for quadrupedal locomotion and shows the potential of the method by demonstrating stepping-stone locomotion, twolegged in-place balance, balance beam locomotion; and sim-toreal transfer without further adaptations.

Seeing Out of tHe bOx: End-to-End Pre-training for Vision-Language Representation Learning

This paper proposes SOHO to "Seeing Out of tHe bOx" that takes a whole image as input, and learns vision-language representation in an end-to-end manner, and does not require bounding box annotations which enables inference 10 times faster than region-based approaches.

Multi-Modal Fusion Transformer for End-to-End Autonomous Driving

This work demonstrates that imitation learning policies based on existing sensor fusion methods under-perform in the presence of a high density of dynamic agents and complex scenarios, which require global contextual reasoning, and proposes TransFuser, a novel Multi-Modal Fusion Transformer to integrate image and LiDAR representations using attention.

UniT: Multimodal Multitask Learning with a Unified Transformer

UniT, a Unified Transformer model to simultaneously learn the most prominent tasks across different domains, ranging from object detection to natural language understanding and multimodal reasoning, achieves strong performance on each task with significantly fewer parameters.

Decoupling the Role of Data, Attention, and Losses in Multimodal Transformers

It is learned that models with a multimodal attention mechanism can outperform deeper models with modality-specific attention mechanisms and that successful contrastive losses used in the self-supervised learning literature do not yield similar performance gains when used in multi-modal transformers.