HowTo100M: Learning a Text-Video Embedding by Watching Hundred Million Narrated Video Clips
- Antoine Miech, Dimitri Zhukov, Jean-Baptiste Alayrac, Makarand Tapaswi, I. Laptev, Josef Sivic
- Computer ScienceIEEE International Conference on Computer Vision
- 7 June 2019
It is demonstrated that a text-video embedding trained on this data leads to state-of-the-art results for text-to-video retrieval and action localization on instructional video datasets such as YouCook2 or CrossTask.
MovieQA: Understanding Stories in Movies through Question-Answering
- Makarand Tapaswi, Yukun Zhu, R. Stiefelhagen, A. Torralba, R. Urtasun, S. Fidler
- Computer ScienceComputer Vision and Pattern Recognition
- 9 December 2015
The MovieQA dataset, which aims to evaluate automatic story comprehension from both video and text, is introduced and existing QA techniques are extended to show that question-answering with such open-ended semantics is hard.
Semi-supervised Learning with Constraints for Person Identification in Multimedia Data
- M. Bäuml, Makarand Tapaswi, R. Stiefelhagen
- Computer ScienceIEEE Conference on Computer Vision and Pattern…
- 1 June 2013
A unified learning framework is proposed which incorporates labeled and unlabeled data, and constraints between pairs of features in the training, and is applied to train multinomial logistic regression classifiers for multi-class face recognition.
“Knock! Knock! Who is it?” probabilistic person identification in TV-series
- Makarand Tapaswi, M. Bäuml, R. Stiefelhagen
- Computer ScienceIEEE Conference on Computer Vision and Pattern…
- 16 June 2012
This work model each TV series episode as a Markov Random Field, integrating face recognition, clothing appearance, speaker recognition and contextual constraints in a probabilistic manner, and formulation of the identification task is formulated as an energy minimization problem.
MovieGraphs: Towards Understanding Human-Centric Situations from Videos
- Paul Vicol, Makarand Tapaswi, Lluís Castrejón, S. Fidler
- Computer ScienceIEEE/CVF Conference on Computer Vision and…
- 19 December 2017
MovieGraphs is the first benchmark to focus on inferred properties of human-centric situations, and opens up an exciting avenue towards socially-intelligent AI agents.
Total Cluster: A person agnostic clustering method for broadcast videos
- Makarand Tapaswi, O. Parkhi, Esa Rahtu, Eric Sommerlade, R. Stiefelhagen, Andrew Zisserman
- Computer ScienceIndian Conference on Computer Vision, Graphics…
- 14 December 2014
The extent to which faces can be clustered automatically without making an error is explored, and an extension of the clustering method to entire episodes using exemplar SVMs based on the negative training data automatically harvested from the editing structure is proposed.
Recovering the Missing Link: Predicting Class-Attribute Associations for Unsupervised Zero-Shot Learning
- Ziad Al-Halah, Makarand Tapaswi, R. Stiefelhagen
- Computer ScienceComputer Vision and Pattern Recognition
- 1 June 2016
This work proposes an approach to learn relations that couples class embeddings with their corresponding attributes, given only the name of an unseen class, which outperforms state-of the-art methods in both predicting class-attribute associations and unsupervised ZSL by a large margin.
Situation Recognition with Graph Neural Networks
- Ruiyu Li, Makarand Tapaswi, Renjie Liao, Jiaya Jia, R. Urtasun, S. Fidler
- Computer ScienceIEEE International Conference on Computer Vision
- 14 August 2017
A model based on Graph Neural Networks is proposed that allows us to efficiently capture joint dependencies between roles using neural networks defined on a graph and significantly outperforms existing work, as well as multiple baselines.
Video Face Clustering With Unknown Number of Clusters
- Makarand Tapaswi, M. Law, S. Fidler
- Computer ScienceIEEE International Conference on Computer Vision
- 9 August 2019
Ball Cluster Learning (BCL) is proposed, a supervised approach to carve the embedding space into balls of equal size, one for each cluster, and the learned ball radius is easily translated to a stopping criterion for iterative merging algorithms.
Think Global, Act Local: Dual-scale Graph Transformer for Vision-and-Language Navigation
- Shizhe Chen, Pierre-Louis Guhur, Makarand Tapaswi, C. Schmid, I. Laptev
- Computer ScienceComputer Vision and Pattern Recognition
- 23 February 2022
This work proposes a dual-scale graph transformer (DUET) for joint long-term action planning and fine-grained cross-modal understanding, and builds a topological map on-the-fly to enable efficient exploration in global action space.
...
...