HowTo100M: Learning a Text-Video Embedding by Watching Hundred Million Narrated Video Clips

@article{Miech2019HowTo100MLA,
  title={HowTo100M: Learning a Text-Video Embedding by Watching Hundred Million Narrated Video Clips},
  author={Antoine Miech and Dimitri Zhukov and Jean-Baptiste Alayrac and Makarand Tapaswi and Ivan Laptev and Josef Sivic},
  journal={2019 IEEE/CVF International Conference on Computer Vision (ICCV)},
  year={2019},
  pages={2630-2640}
}
  • Antoine Miech, Dimitri Zhukov, +3 authors Josef Sivic
  • Published 2019
  • Computer Science
  • 2019 IEEE/CVF International Conference on Computer Vision (ICCV)
  • Learning text-video embeddings usually requires a dataset of video clips with manually provided captions. [...] Key Method Our data collection procedure is fast, scalable and does not require any additional manual annotation. Second, we demonstrate that a text-video embedding trained on this data leads to state-of-the-art results for text-to-video retrieval and action localization on instructional video datasets such as YouCook2 or CrossTask. Finally, we show that this embedding transfers well to other domains…Expand Abstract

    Figures, Tables, and Topics from this paper.

    Citations

    Publications citing this paper.
    SHOWING 1-10 OF 47 CITATIONS

    Learning Spatiotemporal Features via Video and Text Pair Discrimination

    VIEW 4 EXCERPTS
    CITES BACKGROUND
    HIGHLY INFLUENCED

    Multi-modal Dense Video Captioning

    • Vladimir Iashin, Esa Rahtu
    • Computer Science, Engineering
    • 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW)
    • 2020
    VIEW 3 EXCERPTS
    CITES METHODS & BACKGROUND

    HERO: Hierarchical Encoder for Video+Language Omni-representation Pre-training

    VIEW 6 EXCERPTS
    CITES METHODS & BACKGROUND
    HIGHLY INFLUENCED

    Action Modifiers: Learning from Adverbs in Instructional Videos

    VIEW 7 EXCERPTS
    CITES BACKGROUND, METHODS & RESULTS

    Fine-Grained Action Retrieval Through Multiple Parts-of-Speech Embeddings

    VIEW 2 EXCERPTS
    CITES METHODS

    FILTER CITATIONS BY YEAR

    2019
    2020

    CITATION STATISTICS

    • 13 Highly Influenced Citations

    • Averaged 23 Citations per year from 2019 through 2020

    References

    Publications referenced by this paper.
    SHOWING 1-10 OF 80 REFERENCES

    MSR-VTT: A Large Video Description Dataset for Bridging Video and Language

    VIEW 6 EXCERPTS
    HIGHLY INFLUENTIAL

    Enhancing Video Summarization via Vision-Language Embedding

    VIEW 3 EXCERPTS

    Localizing Moments in Video with Natural Language

    VIEW 6 EXCERPTS

    Unsupervised Learning from Narrated Instruction Videos

    VIEW 5 EXCERPTS

    Movie Description

    VIEW 3 EXCERPTS

    COIN: A Large-Scale Dataset for Comprehensive Instructional Video Analysis

    VIEW 2 EXCERPTS