Explore and Explain: Self-supervised Navigation and Recounting

@article{Bigazzi2021ExploreAE,
  title={Explore and Explain: Self-supervised Navigation and Recounting},
  author={Roberto Bigazzi and Federico Landi and Marcella Cornia and Silvia Cascianelli and Lorenzo Baraldi and Rita Cucchiara},
  journal={2020 25th International Conference on Pattern Recognition (ICPR)},
  year={2021},
  pages={1152-1159}
}
Embodied AI has been recently gaining attention as it aims to foster the development of autonomous and intelligent agents. In this paper, we devise a novel embodied setting in which an agent needs to explore a previously unknown environment while recounting what it sees during the path. In this context, the agent needs to navigate the environment driven by an exploration goal, select proper moments for description, and output natural language descriptions of relevant objects and scenes. Our… 

Figures and Tables from this paper

Deep Learning for Embodied Vision Navigation: A Survey

This paper presents a comprehensive review of embodied navigation tasks and the recent progress in deep learning based methods, which includes two major tasks: target-oriented navigation and the instruction-oriented Navigation.

Spot the Difference: A Novel Task for Embodied Agents in Changing Environments

A new dataset of occupancy maps starting from existing datasets of 3D spaces and generating a number of possible layouts for a single environment is collected and an exploration policy is proposed that can take advantage of previous knowledge of the environment and identify changes in the scene faster and more effectively than existing agents.

Out of the Box: Embodied Navigation in the Real World

This work describes the architectural discrepancies that damage the Sim2Real adaptation ability of models trained on the Habitat simulator and proposes a novel solution tailored towards the deployment in realworld scenarios.

Embodied Navigation at the Art Gallery

This paper builds and releases a new 3D space with unique characteristics: the one of a complete art museum, named ArtGallery3D (AG3D), which is ampler, richer in visual features, and provides very sparse occupancy information.

Learning to Select: A Fully Attentive Approach for Novel Object Captioning

This paper presents a novel approach for NOC that learns to select the most relevant objects of an image, regardless of their adherence to the training set, and to constrain the generative process of a language model accordingly.

From Show to Tell: A Survey on Image Captioning

This work aims at providing a comprehensive overview of image captioning approaches, from visual encoding and text generation to training strategies, datasets, and evaluation metrics, and quantitatively compare many relevant state-of-the-art approaches to identify the most impactful technical innovations in architectures and training strategies.

From Show to Tell: A Survey on Deep Learning-based Image Captioning.

This work aims at providing a comprehensive overview of image captioning approaches, from visual encoding and text generation to training strategies, datasets, and evaluation metrics, and quantitatively compare many relevant state-of-the-art approaches to identify the most impactful technical innovations in architectures and training strategies.

Retrieval-Augmented Transformer for Image Captioning

This paper investigates the development of an image captioning approach with a kNN memory, with which knowledge can be retrieved from an external corpus to aid the generation process and increase caption quality.

Focus on Impact: Indoor Exploration with Intrinsic Motivation

This work proposes to train the model with a purely intrinsic reward signal to guide exploration, which is based on the impact of the robot's actions on its internal representation of the environment, and replaces the traditional count-based regularization with an estimated pseudo-count of previously visited states.

References

SHOWING 1-10 OF 43 REFERENCES

An Exploration of Embodied Visual Exploration

A taxonomy for existing visual exploration algorithms is presented and a standard framework for benchmarking them is created and a thorough empirical study of the four state-of-the-art paradigms using the proposed framework is performed.

SMArT: Training Shallow Memory-aware Transformers for Robotic Explainability

This paper proposes a fully-attentive captioning algorithm which can provide state-of-the-art performances on language generation while restricting its computational demands and incorporates a novel memory-aware encoding of image regions.

Target-driven visual navigation in indoor scenes using deep reinforcement learning

This paper proposes an actor-critic model whose policy is a function of the goal as well as the current state, which allows better generalization and proposes the AI2-THOR framework, which provides an environment with high-quality 3D scenes and a physics engine.

Perceive, Transform, and Act: Multi-Modal Attention Networks for Vision-and-Language Navigation

Perceive, Transform, and Act (PTA) is devised: a fully-attentive VLN architecture that leaves the recurrent approach behind and the first Transformer-like architecture incorporating three different modalities - natural language, images, and discrete actions for the agent control.

Curiosity-Driven Exploration by Self-Supervised Prediction

This work forms curiosity as the error in an agent's ability to predict the consequence of its own actions in a visual feature space learned by a self-supervised inverse dynamics model, which scales to high-dimensional continuous state spaces like images, bypasses the difficulties of directly predicting pixels, and ignores the aspects of the environment that cannot affect the agent.

Gibson Env: Real-World Perception for Embodied Agents

This paper investigates developing real-world perception for active agents, proposes Gibson Environment for this purpose, and showcases a set of perceptual tasks learned therein.

Learning to Poke by Poking: Experiential Learning of Intuitive Physics

A novel approach based on deep neural networks is proposed for modeling the dynamics of robot's interactions directly from images, by jointly estimating forward and inverse models of dynamics.

On Evaluation of Embodied Navigation Agents

The present document summarizes the consensus recommendations of a working group to study empirical methodology in navigation research and discusses different problem statements and the role of generalization, present evaluation measures, and provides standard scenarios that can be used for benchmarking.

The Arcade Learning Environment: An Evaluation Platform for General Agents (Extended Abstract)

The promise of ALE is illustrated by developing and benchmarking domain-independent agents designed using well-established AI techniques for both reinforcement learning and planning, and an evaluation methodology made possible by ALE is proposed.

Explainable Agents and Robots: Results from a Systematic Literature Review

A Systematic Literature Review of eXplainable Artificial Intelligence (XAI), finding that almost all of the studied papers deal with robots/agents explaining their behaviors to the human users, and very few works addressed inter-robot (inter-agent) explainability.