Corpus ID: 204009011

CATER: A diagnostic dataset for Compositional Actions and TEmporal Reasoning

@article{Girdhar2020CATERAD,
  title={CATER: A diagnostic dataset for Compositional Actions and TEmporal Reasoning},
  author={Rohit Girdhar and D. Ramanan},
  journal={ArXiv},
  year={2020},
  volume={abs/1910.04744}
}
Computer vision has undergone a dramatic revolution in performance, driven in large part through deep features trained on large-scale supervised datasets. However, much of these improvements have focused on static image analysis; video understanding has seen rather modest improvements. Even though new datasets and spatiotemporal models have been proposed, simple frame-by-frame classification methods often still remain competitive. We posit that current video datasets are plagued with implicit… Expand
26 Citations
EVENTS THROUGH DYNAMIC VISUAL REASONING
CLEVRER: CoLlision Events for Video REpresentation and Reasoning
  • 66
  • PDF
Compositional Video Synthesis with Action Graphs
  • 3
  • PDF
Hopper: Multi-hop Transformer for Spatiotemporal Reasoning
  • 1
  • Highly Influenced
  • PDF
Grounding Physical Concepts of Objects and Events Through Dynamic Visual Reasoning
  • 2
  • PDF
Win-Fail Action Recognition
  • PDF
Holistic static and animated 3D scene generation from diverse text descriptions
DVD: A Diagnostic Dataset for Multi-step Reasoning in Video Grounded Dialogue
  • Highly Influenced
  • PDF
On Modality Bias in the TVQA Dataset
  • 1
  • PDF
...
1
2
3
...

References

SHOWING 1-10 OF 94 REFERENCES
Procedural Generation of Videos to Train Deep Action Recognition Networks
  • 82
  • PDF
TGIF-QA: Toward Spatio-Temporal Reasoning in Visual Question Answering
  • 158
  • PDF
The “Something Something” Video Database for Learning and Evaluating Visual Common Sense
  • 289
  • Highly Influential
  • PDF
Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset
  • 2,354
  • Highly Influential
  • PDF
ActionVLAD: Learning Spatio-Temporal Aggregation for Action Classification
  • 275
  • PDF
CLEVRER: CoLlision Events for Video REpresentation and Reasoning
  • 66
  • PDF
Human Pose Forecasting via Deep Markov Models
  • 18
  • PDF
AVA: A Video Dataset of Spatio-Temporally Localized Atomic Visual Actions
  • C. Gu, C. Sun, +8 authors J. Malik
  • Computer Science
  • 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition
  • 2018
  • 352
  • Highly Influential
  • PDF
Explore Multi-Step Reasoning in Video Question Answering
  • 28
  • Highly Influential
...
1
2
3
4
5
...