Future Object Detection with Spatiotemporal Transformers
@inproceedings{Tonderski2022FutureOD, title={Future Object Detection with Spatiotemporal Transformers}, author={Adam Tonderski and Joakim Johnander and Christoffer Petersson and Kalle AAstrom}, year={2022} }
, Abstract. We propose the task Future Object Detection, in which the goal is to predict the bounding boxes for all visible objects in a future video frame. While this task involves recognizing temporal and kine-matic patterns, in addition to the semantic and geometric ones, it only requires annotations in the standard form for individual, single (future) frames, in contrast to expensive full sequence annotations. We propose to tackle this task with an end-to-end method, in which a detection…
Figures and Tables from this paper
References
SHOWING 1-10 OF 48 REFERENCES
End-to-End Object Detection with Transformers
- Computer ScienceECCV
- 2020
This work presents a new method that views object detection as a direct set prediction problem, and demonstrates accuracy and run-time performance on par with the well-established and highly-optimized Faster RCNN baseline on the challenging COCO object detection dataset.
Multiple Object Forecasting: Predicting Future Object Locations in Diverse Environments
- Computer Science2020 IEEE Winter Conference on Applications of Computer Vision (WACV)
- 2020
The problem of multiple object forecasting (MOF), in which the goal is to predict future bounding boxes of tracked objects, is introduced and a novel encoder-decoder architecture, STED, is presented, which combines visual and temporal features to model both object-motion and ego-motion, and outperforms existing approaches for MOF.
Future Video Synthesis With Object Motion Prediction
- Computer Science2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)
- 2020
An approach to predict future video frames given a sequence of continuous video frames in the past by decoupling the background scene and moving objects and shows that this model outperforms the state-of-the-art in terms of visual quality and accuracy.
RelationNet++: Bridging Visual Representations for Object Detection via Transformer Decoder
- Computer ScienceNeurIPS
- 2020
An attention-based decoder module similar as that in Transformer~\cite{vaswani2017attention} to bridge other representations into a typical object detector built on a single representation format, in an end-to-end fashion is presented.
FCOS: Fully Convolutional One-Stage Object Detection
- Computer Science2019 IEEE/CVF International Conference on Computer Vision (ICCV)
- 2019
For the first time, a much simpler and flexible detection framework achieving improved detection accuracy is demonstrated, and it is hoped that the proposed FCOS framework can serve as a simple and strong alternative for many other instance-level tasks.
PnP-DETR: Towards Efficient Visual Analysis with Transformers
- Computer Science2021 IEEE/CVF International Conference on Computer Vision (ICCV)
- 2021
This work encapsulates the idea of reducing spatial redundancy into a novel poll and pool (PnP) sampling module, with which it is built an end-to-end PnP-DETR architecture that adaptively allocates its computation spatially to be more efficient.
Forecasting from LiDAR via Future Object Detection
- Computer Science2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)
- 2022
This paper proposes an end-to-end approach for detection and motion forecasting based on raw sensor measurement as opposed to ground truth tracks, which improves overall accuracy and prompts us to rethink the role of explicit tracking for embodied perception.
Video Action Transformer Network
- Computer Science2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)
- 2019
The Action Transformer model for recognizing and localizing human actions in video clips is introduced and it is shown that by using high-resolution, person-specific, class-agnostic queries, the model spontaneously learns to track individual people and to pick up on semantic context from the actions of others.
Decomposing Motion and Content for Natural Video Sequence Prediction
- Computer ScienceICLR
- 2017
To the best of the knowledge, this is the first end-to-end trainable network architecture with motion and content separation to model the spatiotemporal dynamics for pixel-level future prediction in natural videos.
End-to-End Video Instance Segmentation with Transformers
- Computer Science2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)
- 2021
A new video instance segmentation framework built upon Transformers, termed VisTR, which views the VIS task as a direct end-to-end parallel sequence decoding/prediction problem, and achieves the highest speed among all existing VIS models and the best result among methods using single model on the YouTube-VIS dataset.