• Corpus ID: 252917601

Future Object Detection with Spatiotemporal Transformers

@inproceedings{Tonderski2022FutureOD,
  title={Future Object Detection with Spatiotemporal Transformers},
  author={Adam Tonderski and Joakim Johnander and Christoffer Petersson and Kalle AAstrom},
  year={2022}
}
, Abstract. We propose the task Future Object Detection, in which the goal is to predict the bounding boxes for all visible objects in a future video frame. While this task involves recognizing temporal and kine-matic patterns, in addition to the semantic and geometric ones, it only requires annotations in the standard form for individual, single (future) frames, in contrast to expensive full sequence annotations. We propose to tackle this task with an end-to-end method, in which a detection… 

References

SHOWING 1-10 OF 48 REFERENCES

End-to-End Object Detection with Transformers

This work presents a new method that views object detection as a direct set prediction problem, and demonstrates accuracy and run-time performance on par with the well-established and highly-optimized Faster RCNN baseline on the challenging COCO object detection dataset.

Multiple Object Forecasting: Predicting Future Object Locations in Diverse Environments

The problem of multiple object forecasting (MOF), in which the goal is to predict future bounding boxes of tracked objects, is introduced and a novel encoder-decoder architecture, STED, is presented, which combines visual and temporal features to model both object-motion and ego-motion, and outperforms existing approaches for MOF.

Future Video Synthesis With Object Motion Prediction

An approach to predict future video frames given a sequence of continuous video frames in the past by decoupling the background scene and moving objects and shows that this model outperforms the state-of-the-art in terms of visual quality and accuracy.

RelationNet++: Bridging Visual Representations for Object Detection via Transformer Decoder

An attention-based decoder module similar as that in Transformer~\cite{vaswani2017attention} to bridge other representations into a typical object detector built on a single representation format, in an end-to-end fashion is presented.

FCOS: Fully Convolutional One-Stage Object Detection

For the first time, a much simpler and flexible detection framework achieving improved detection accuracy is demonstrated, and it is hoped that the proposed FCOS framework can serve as a simple and strong alternative for many other instance-level tasks.

PnP-DETR: Towards Efficient Visual Analysis with Transformers

This work encapsulates the idea of reducing spatial redundancy into a novel poll and pool (PnP) sampling module, with which it is built an end-to-end PnP-DETR architecture that adaptively allocates its computation spatially to be more efficient.

Forecasting from LiDAR via Future Object Detection

This paper proposes an end-to-end approach for detection and motion forecasting based on raw sensor measurement as opposed to ground truth tracks, which improves overall accuracy and prompts us to rethink the role of explicit tracking for embodied perception.

Video Action Transformer Network

The Action Transformer model for recognizing and localizing human actions in video clips is introduced and it is shown that by using high-resolution, person-specific, class-agnostic queries, the model spontaneously learns to track individual people and to pick up on semantic context from the actions of others.

Decomposing Motion and Content for Natural Video Sequence Prediction

To the best of the knowledge, this is the first end-to-end trainable network architecture with motion and content separation to model the spatiotemporal dynamics for pixel-level future prediction in natural videos.

End-to-End Video Instance Segmentation with Transformers

A new video instance segmentation framework built upon Transformers, termed VisTR, which views the VIS task as a direct end-to-end parallel sequence decoding/prediction problem, and achieves the highest speed among all existing VIS models and the best result among methods using single model on the YouTube-VIS dataset.