Corpus ID: 16007710

Word2VisualVec: Image and Video to Sentence Matching by Visual Feature Prediction

@article{Dong2016Word2VisualVecIA,
  title={Word2VisualVec: Image and Video to Sentence Matching by Visual Feature Prediction},
  author={J. Dong and Xirong Li and Cees G. M. Snoek},
  journal={arXiv: Computer Vision and Pattern Recognition},
  year={2016}
}
This paper strives to find the sentence best describing the content of an image or video. Different from existing works, which rely on a joint subspace for image / video to sentence matching, we propose to do so in a visual space only. We contribute Word2VisualVec, a deep neural network architecture that learns to predict a deep visual encoding of textual input based on sentence vectorization and a multi-layer perceptron. We thoroughly analyze its architectural design, by varying the sentence… Expand
38 Citations
Stacked Convolutional Deep Encoding Network For Video-Text Retrieval
  • 2
  • PDF
A unified cycle-consistent neural model for text and image retrieval
  • 3
  • Highly Influenced
Scene Retrieval for Video Summarization Based on Text-to-Image gan
  • 1
University of Amsterdam and Renmin University at TRECVID 2016: Searching Video, Detecting Events and Describing Video
  • 21
  • PDF
Joint embeddings with multimodal cues for video-text retrieval
  • 15
  • Highly Influenced
Gated Recurrent Capsules for Visual Word Embeddings
  • 2
  • Highly Influenced
  • PDF
Webly Supervised Joint Embedding for Cross-Modal Image-Text Retrieval
  • 35
  • PDF
Weakly Supervised Video Moment Retrieval From Text Queries
  • 38
  • PDF
...
1
2
3
4
...

References

SHOWING 1-10 OF 46 REFERENCES
Learning Joint Representations of Videos and Sentences with Web Image Search
  • 54
  • Highly Influential
  • PDF
From captions to visual concepts and back
  • 1,003
  • PDF
Jointly Modeling Embedding and Translation to Bridge Video and Language
  • 383
  • Highly Influential
  • PDF
Deep Visual-Semantic Alignments for Generating Image Descriptions
  • A. Karpathy, Li Fei-Fei
  • Computer Science, Medicine
  • IEEE Transactions on Pattern Analysis and Machine Intelligence
  • 2017
  • 1,879
  • PDF
Multimodal Convolutional Neural Networks for Matching Image and Sentence
  • 252
  • Highly Influential
  • PDF
Show and tell: A neural image caption generator
  • 3,806
  • Highly Influential
  • PDF
Jointly Modeling Deep Video and Compositional Text to Bridge Vision and Language in a Unified Framework
  • 192
  • Highly Influential
  • PDF
The ImageNet Shuffle: Reorganized Pre-training for Video Event Detection
  • 97
  • PDF
MSR-VTT: A Large Video Description Dataset for Bridging Video and Language
  • 470
  • PDF
...
1
2
3
4
5
...