Share This Author
ALFRED: A Benchmark for Interpreting Grounded Instructions for Everyday Tasks
- Mohit Shridhar, Jesse Thomason, D. Fox
- Computer ScienceComputer Vision and Pattern Recognition
- 3 December 2019
It is shown that a baseline model based on recent embodied vision-and-language tasks performs poorly on ALFRED, suggesting that there is significant room for developing innovative grounded visual language understanding models with this benchmark.
CLIPort: What and Where Pathways for Robotic Manipulation
CLIPORT is presented, a language-conditioned imitation learning agent that combines the broad semantic understanding of CLIP with the spatial precision of Transporter and is capable of solving a variety of language-specified tabletop tasks without any explicit representations of object poses, instance, history, symbolic states, or syntactic structures.
ALFWorld: Aligning Text and Embodied Environments for Interactive Learning
- Mohit Shridhar, Xingdi Yuan, Marc-Alexandre Côté, Yonatan Bisk, Adam Trischler, Matthew J. Hausknecht
- Computer ScienceInternational Conference on Learning…
- 8 October 2020
ALFWorld, a simulator that enables agents to learn abstract, text-based policies in TextWorld and then execute goals from the ALFRED benchmark in a rich visual environment, enables the creation of a new BUTLER agent whose abstract knowledge corresponds directly to concrete, visually grounded actions.
Interactive Visual Grounding of Referring Expressions for Human-Robot Interaction
INGRESS, a robot system that follows human natural language instructions to pick and place everyday objects, is presented and a two-stage neural network model for grounding is proposed and outperformed a state-of-the-art method on the RefCOCO dataset and in robot experiments with humans.
INGRESS: Interactive visual grounding of referring expressions
INGRESS, a robot system that follows human natural language instructions to pick and place everyday objects, is presented and a two-stage neural-network model for grounding is proposed and outperformed a state-of-the-art method on the RefCOCO dataset and in robot experiments with humans.
Language Grounding with 3D Objects
- Jesse Thomason, Mohit Shridhar, Yonatan Bisk, Chris Paxton, Luke Zettlemoyer
- Computer ScienceConference on Robot Learning
- 26 July 2021
A novel reasoning task that targets both visual and non-visual language about 3D objects, and shows that adding view estimation to language grounding models improves accuracy on both SNARE and when identifying objects referred to in language on a robot platform, but note that a large gap remains between these models and human performance.
Perceiver-Actor: A Multi-Task Transformer for Robotic Manipulation
This work investigates P ER A CT, a language-conditioned behavior-cloning agent for multi-task 6-DoF manipulation, and shows that it significantly outperforms unstructured image-to-action agents and 3D ConvNet baselines for a wide range of tabletop tasks.
XPose: Reinventing User Interaction with Flying Cameras
- Ziquan Lan, Mohit Shridhar, David Hsu, Shengdong Zhao
- Computer Science, ArtRobotics: Science and Systems
- 12 July 2017
A systematic user study indicates that XPose results in more successful user performances in phototaking tasks than the touchscreen joystick interface widely used in commercial drones today.
Grounding Spatio-Semantic Referring Expressions for Human-Robot Interaction
A robot system that retrieves everyday objects with unconstrained natural language descriptions by leveraging a large dataset of images labeled with text descriptions that allows unrestricted object types and natural language referring expressions and a two-stage neural-network grounding pipeline.
Monocular SLAM for Real-Time Applications on Mobile Platforms
A lean pipeline for development of a monocular SLAM system on a mobile device is outlined, and the key insights and results of developing such a system are highlighted.