Retrieving and Highlighting Action with Spatiotemporal Reference

  title={Retrieving and Highlighting Action with Spatiotemporal Reference},
  author={Seito Kasai and Yuchi Ishikawa and Masaki Hayashi and Yoshimitsu Aoki and Kensho Hara and Hirokatsu Kataoka},
  journal={2020 IEEE International Conference on Image Processing (ICIP)},
In this paper, we present a framework thatjointly retrieves and spatiotemporally highlights actions in videos by enhancing current deep cross-modal retrieval methods. Our work takes on the novel task of action highlighting, which visualizes where and when actions occur in an untrimmed video setting. Action highlighting is a fine-grained task, compared to conventional action recognition tasks which focus on classification or window-based localization. Leveraging weak supervision from annotated… 

Figures and Tables from this paper


Fine-Grained Action Retrieval Through Multiple Parts-of-Speech Embeddings
This paper proposes to enrich the embedding by disentangling parts-of-speech (PoS) in the accompanying captions by building a separate multi-modal embedding space for each PoS tag, which enables learning specialised embedding spaces that offer multiple views of the same embedded entities.
Finding Beans in Burgers: Deep Semantic-Visual Embedding with Localization
This work introduces a new architecture of this type, with a visual path that leverages recent spaceaware pooling mechanisms and a textual path which is jointly trained from scratch, which offers a versatile model.
VSE++: Improving Visual-Semantic Embeddings with Hard Negatives
A simple change to common loss functions used for multi-modal embeddings, inspired by hard negative mining, the use of hard negatives in structured prediction, and ranking loss functions, is introduced, which yields significant gains in retrieval performance.
Learning Joint Embedding with Multimodal Cues for Cross-Modal Video-Text Retrieval
This paper proposes a novel framework that simultaneously utilizes multi-modal features (different visual characteristics, audio inputs, and text) by a fusion strategy for efficient retrieval and explores several loss functions in training the embedding.
Look, Imagine and Match: Improving Textual-Visual Cross-Modal Retrieval with Generative Models
This work proposes to incorporate generative processes into the cross-modal feature embedding, through which it is able to learn not only the global abstract features but also the local grounded features of image-text pairs.
Dual Encoding for Zero-Example Video Retrieval
This paper takes a concept-free approach, proposing a dual deep encoding network that encodes videos and queries into powerful dense representations of their own and establishes a new state-of-the-art for zero-example video retrieval.
Use What You Have: Video retrieval using representations from collaborative experts
This paper proposes a collaborative experts model to aggregate information from these different pre-trained experts and assess the approach empirically on five retrieval benchmarks: MSR-VTT, LSMDC, MSVD, DiDeMo, and ActivityNet.
Learning a Text-Video Embedding from Incomplete and Heterogeneous Data
This work proposes a Mixture-of-Embedding-Experts (MEE) model with ability to handle missing input modalities during training and demonstrates significant improvements and outperforms previously reported methods on both text-to-video and video- to-text retrieval tasks.
Unified Visual-Semantic Embeddings: Bridging Vision and Language With Structured Meaning Representations
The Unified VSE outperforms baselines on cross-modal retrieval tasks; the enforcement of the semantic coverage improves the model's robustness in defending text-domain adversarial attacks and empowers the use of visual cues to accurately resolve word dependencies in novel sentences.
MSR-VTT: A Large Video Description Dataset for Bridging Video and Language
A detailed analysis of MSR-VTT in comparison to a complete set of existing datasets, together with a summarization of different state-of-the-art video-to-text approaches, shows that the hybrid Recurrent Neural Networkbased approach, which combines single-frame and motion representations with soft-attention pooling strategy, yields the best generalization capability on this dataset.