Corpus ID: 218571383

Keyframing the Future: Keyframe Discovery for Visual Prediction and Planning

  title={Keyframing the Future: Keyframe Discovery for Visual Prediction and Planning},
  author={Karl Pertsch and Oleh Rybkin and Jingyun Yang and Shenghao Zhou and Konstantinos G. Derpanis and Kostas Daniilidis and Joseph J. Lim and Andrew Jaegle},
Temporal observations such as videos contain essential information about the dynamics of the underlying scene, but they are often interleaved with inessential, predictable details. One way of dealing with this problem is by focusing on the most informative moments in a sequence. We propose a model that learns to discover these important events and the times when they occur and uses them to represent the full sequence. We do so using a hierarchical Keyframe-Inpainter (KeyIn) model that first… Expand
Variational Predictive Routing with Nested Subjective Timescales
This work presents Variational Predictive Routing (VPR) – a neural probabilistic inference system that organizes latent representations of video features in a temporal hierarchy, based on their rates of change, thus modeling continuous data as a hierarchical renewal process. Expand
Model-Based Reinforcement Learning via Latent-Space Collocation
It is argued that it is easier to solve long-horizon tasks by planning sequences of states rather than just actions, as the effects of actions greatly compound over time and are harder to optimize. Expand
Episodic Memory for Subjective-Timescale Models
Planning in complex environments requires reasoning over multi-step timescales. However, in model-based learning, an agent’s model is more commonly defined over transitions between consecutiveExpand
Learning Intuitive Physics with Multimodal Generative Models
This paper presents a perception framework that fuses visual and tactile feedback to make predictions about the expected motion of objects in dynamic scenes, using a novel See-Through-your-Skin sensor that provides high resolution multimodal sensing of contact surfaces. Expand
Demonstration-Guided Reinforcement Learning with Learned Skills
  • Karl Pertsch, Youngwoon Lee, Yue Wu, Joseph J. Lim
  • Computer Science
  • ArXiv
  • 2021
Skill-based Learning with Demonstrations (SkiLD) is proposed, an algorithm for demonstration-guided RL that efficiently leverages the provided demonstrations by following the demonstrated skills instead of the primitive actions, resulting in substantial performance improvements over prior demonstration- guided RL approaches. Expand
Subgoal Search For Complex Reasoning Tasks
It is shown that a simple approach of generating k-th step ahead subgoals is surprisingly efficient on three challenging domains: two popular puzzle games, Sokoban and the Rubik’s Cube, and an inequality proving benchmark INT. Expand
Stochastic Image-to-Video Synthesis using cINNs
The approach is naturally implemented using a conditional invertible neural network (cINN) that can explain videos by independently modelling static and other video characteristics, thus laying the basis for controlled video synthesis. Expand
Forecasting Characteristic 3D Poses of Human Actions
A probabilistic approach is proposed that first models the possible multi-modality in the distribution of possible characteristic poses, then samples future pose hypotheses from the predicted distribution in an autoregressive fashion to model dependencies between joints and then optimizes the final pose with bone length and angle constraints. Expand
Long-Horizon Visual Planning with Goal-Conditioned Hierarchical Predictors
By using both goal-conditioning and hierarchical prediction, GCPs enable us to solve visual planning tasks with much longer horizon than previously possible and enable an effective hierarchical planning algorithm that optimizes trajectories in a coarse-to-fine manner. Expand
Rapid trial-and-error learning with simulation supports flexible tool use and physical reasoning
This work proposes that the flexibility of human physical problem solving rests on an ability to imagine the effects of hypothesized actions, while the efficiency of human search arises from rich action priors which are updated via observations of the world. Expand


Anticipating Visual Representations from Unlabeled Video
This work presents a framework that capitalizes on temporal structure in unlabeled video to learn to anticipate human actions and objects and applies recognition algorithms on the authors' predicted representation to anticipate objects and actions. Expand
Probabilistic Video Generation using Holistic Attribute Control
Improve the video generation consistency through temporally-conditional sampling and quality by structuring the latent space with attribute controls; ensuring that attributes can be both inferred and conditioned on during learning/generation. Expand
Decomposing Motion and Content for Natural Video Sequence Prediction
To the best of the knowledge, this is the first end-to-end trainable network architecture with motion and content separation to model the spatiotemporal dynamics for pixel-level future prediction in natural videos. Expand
Time-Agnostic Prediction: Predicting Predictable Video Frames
This work decouple visual prediction from a rigid notion of time so that time-agnostic predictors (TAP) are not tied to specific times so that they may instead discover predictable "bottleneck" frames no matter when they occur. Expand
Stochastic Variational Video Prediction
This paper develops a stochastic variational video prediction (SV2P) method that predicts a different possible future for each sample of its latent variables, and is the first to provide effective Stochastic multi-frame prediction for real-world video. Expand
Improved Conditional VRNNs for Video Prediction
This work proposes a hierarchy of latent variables, which defines a family of flexible prior and posterior distributions in order to better model the probability of future sequences and validate the proposal through a series of ablation experiments. Expand
Video (language) modeling: a baseline for generative models of natural videos
For the first time, it is shown that a strong baseline model for unsupervised feature learning using video data can predict non-trivial motions over short video sequences. Expand
Self-Supervised Visual Planning with Temporal Skip Connections
This work introduces a video prediction model that can keep track of objects through occlusion by incorporating temporal skip-connections and demonstrates that this model substantially outperforms prior work on video prediction-based control. Expand
Hierarchical Foresight: Self-Supervised Learning of Long-Horizon Tasks via Visual Subgoal Generation
A framework for subgoal generation and planning, hierarchical visual foresight (HVF), which generates subgoal images conditioned on a goal image, and uses them for planning, and observes that the method naturally identifies semantically meaningful states as subgoals. Expand
Learning Latent Dynamics for Planning from Pixels
The Deep Planning Network (PlaNet) is proposed, a purely model-based agent that learns the environment dynamics from images and chooses actions through fast online planning in latent space using a latent dynamics model with both deterministic and stochastic transition components. Expand