• Corpus ID: 239049626

Video and Text Matching with Conditioned Embeddings

@article{Ali2021VideoAT,
  title={Video and Text Matching with Conditioned Embeddings},
  author={Ameen Ali and Idan Schwartz and Tamir Hazan and Lior Wolf},
  journal={ArXiv},
  year={2021},
  volume={abs/2110.11298}
}
We present a method for matching a text sentence from a given corpus to a given video clip and vice versa. Traditionally video and text matching is done by learning a shared embedding space and the encoding of one modality is independent of the other. In this work, we encode the dataset data in a way that takes into account the query’s relevant information. The power of the method is demonstrated to arise from pooling the interaction data between words and frames. Since the encoding of the… 

Figures and Tables from this paper

Latent Space Explanation by Intervention
TLDR
This study aims to reveal hidden concepts by employing an intervention mechanism that shifts the predicted class based on discrete variational autoencoders by determining the concepts that can alter the class, hence providing interpretability.

References

SHOWING 1-10 OF 54 REFERENCES
Weakly Supervised Video Moment Retrieval From Text Queries
TLDR
This work proposes a joint visual-semantic embedding based framework that learns the notion of relevant segments from video using only video-level sentence descriptions using Text-Guided Attention (TGA).
Multilevel Language and Vision Integration for Text-to-Clip Retrieval
TLDR
A multilevel model that integrates vision and language features earlier and more tightly than prior work is introduced, and text features are injected early on when generating clip proposals to help eliminate unlikely clips and thus speed up processing and boost performance.
HowTo100M: Learning a Text-Video Embedding by Watching Hundred Million Narrated Video Clips
TLDR
It is demonstrated that a text-video embedding trained on this data leads to state-of-the-art results for text-to-video retrieval and action localization on instructional video datasets such as YouCook2 or CrossTask.
Use What You Have: Video retrieval using representations from collaborative experts
TLDR
This paper proposes a collaborative experts model to aggregate information from these different pre-trained experts and assess the approach empirically on five retrieval benchmarks: MSR-VTT, LSMDC, MSVD, DiDeMo, and ActivityNet.
Cross-Modal and Hierarchical Modeling of Video and Text
TLDR
H hierarchical sequence embedding is introduced, a generic model for embedding sequential data of different modalities into hierarchically semantic spaces, with either explicit or implicit correspondence information, for hierarchical sequential data where there are correspondences across multiple modalities.
Dual Encoding for Zero-Example Video Retrieval
TLDR
This paper takes a concept-free approach, proposing a dual deep encoding network that encodes videos and queries into powerful dense representations of their own and establishes a new state-of-the-art for zero-example video retrieval.
Learning Joint Embedding with Multimodal Cues for Cross-Modal Video-Text Retrieval
TLDR
This paper proposes a novel framework that simultaneously utilizes multi-modal features (different visual characteristics, audio inputs, and text) by a fusion strategy for efficient retrieval and explores several loss functions in training the embedding.
MSR-VTT: A Large Video Description Dataset for Bridging Video and Language
TLDR
A detailed analysis of MSR-VTT in comparison to a complete set of existing datasets, together with a summarization of different state-of-the-art video-to-text approaches, shows that the hybrid Recurrent Neural Networkbased approach, which combines single-frame and motion representations with soft-attention pooling strategy, yields the best generalization capability on this dataset.
Localizing Moments in Video with Natural Language
TLDR
The Moment Context Network (MCN) is proposed which effectively localizes natural language queries in videos by integrating local and global video features over time and outperforms several baseline methods.
A Joint Sequence Fusion Model for Video Question Answering and Retrieval
TLDR
This work focuses on video-language tasks including multimodal retrieval and video QA, and evaluates the JSFusion model in three retrieval and VQA tasks in LSMDC, for which the model achieves the best performance reported so far.
...
1
2
3
4
5
...