Describing Multimedia Content Using Attention-Based Encoder-Decoder Networks

  title={Describing Multimedia Content Using Attention-Based Encoder-Decoder Networks},
  author={Kyunghyun Cho and Aaron C. Courville and Yoshua Bengio},
  journal={IEEE Transactions on Multimedia},
Whereas deep neural networks were first mostly used for classification tasks, they are rapidly expanding in the realm of structured output problems, where the observed target is composed of multiple random variables that have a rich joint distribution, given the input. In this paper we focus on the case where the input also has a rich structure and the input and output structures are somehow related. We describe systems that learn to attend to different places in the input, for each element of… 

Figures and Tables from this paper

Not All Attention Is Needed: Gated Attention Network for Sequence Data

A novel method to dynamically select a subset of elements to attend to using an auxiliary network, and compute attention weights to aggregate the selected elements is proposed, promising to extend the idea to more complex attention-based models, such as transformers and seq-to-seq models.

CAM-RNN: Co-Attention Model Based RNN for Video Captioning

A co-attention model based recurrent neural network (CAM-RNN) is proposed, where the CAM is utilized to encode the visual and text features, and the RNN works as the decoder to generate the video caption.

Learning by Injection: Attention Embedded Recurrent Neural Network for Amharic Text-image Recognition

This paper extends the attention mechanism for Amharic text-image recognition to include CNNs and attention embedded recurrent encoder-decoder networks that are integrated following the configuration of the seq2seq framework.

Survey on the attention based RNN model and its applications in computer vision

This survey introduces some attention based RNN models which can focus on different parts of the input for each output item, in order to explore and take advantage of the implicit relations between the input and the output items.

CaptionNet: A Tailor-made Recurrent Neural Network for Generating Image Descriptions

A novel model named CaptionNet is proposed in this work as an improved LSTM specially designed for image captioning, where only attended image features are allowed to be fed into the memory of CaptionNet through input gates, reducing the dependency on the previous predicted words.

Pay Attention to the Activations: A Modular Attention Mechanism for Fine-Grained Image Recognition

The proposed approach learns to attend to lower-level feature activations without requiring part annotations and uses those activations to update and rectify the output likelihood distribution, and demonstrates that well-known networks such as wide residual networks and ResNeXt, when augmented with the approach, systematically improve their classification accuracy and become more robust to changes in deformation and pose.

Recurrent Highway Networks with Attention Mechanism for Scene Text Recognition

Recurrent Highway Networks (RHN), as a popular architecture because of its capability of training deep structure, can preform excellently in plenty of situations and has least parameters, is employed and combined with attention mechanism to solve scene Text Recognition problem.

Evaluating Sequence-to-Sequence Models for Handwritten Text Recognition

An attention-based sequence-to-sequence model that combines a convolutional neural network as a generic feature extractor with a recurrent neural network to encode both the visual information, as well as the temporal context between characters in the input image, and uses a separate recurrent network to decode the actual character sequence.

Image Input OR Video Hierarchical LSTMs with Adaptive Attention ( hLSTMat ) Feature Extraction Generated Captions Losses

A hierarchical LSTM with adaptive attention (hLSTMat) approach for image and video captioning that utilizes the spatial or temporal attention for selecting specific regions or frames to predict the related words, while the adaptive attention is for deciding whether to depend on the visual information or the language context information.

Deep neural network with attention model for scene text recognition

In the proposed framework, feature extraction, feature attention and sequence recognition are integrated in a jointly trainable network, and this model has the least number of parameters so far.



Explain Images with Multimodal Recurrent Neural Networks

The m-RNN model directly models the probability distribution of generating a word given previous words and the image, and achieves significant performance improvement over the state-of-the-art methods which directly optimize the ranking objective function for retrieval.

Long-term recurrent convolutional networks for visual recognition and description

A novel recurrent convolutional architecture suitable for large-scale visual learning which is end-to-end trainable, and shows such models have distinct advantages over state-of-the-art models for recognition or generation which are separately defined and/or optimized.

Deep Captioning with Multimodal Recurrent Neural Networks (m-RNN)

The m-RNN model directly models the probability distribution of generating a word given previous words and an image, and achieves significant performance improvement over the state-of-the-art methods which directly optimize the ranking objective function for retrieval.

Describing Videos by Exploiting Temporal Structure

This work proposes an approach that successfully takes into account both the local and global temporal structure of videos to produce descriptions and proposes a temporal attention mechanism that allows to go beyond local temporal modeling and learns to automatically select the most relevant temporal segments given the text-generating RNN.

Attention-Based Models for Speech Recognition

The attention-mechanism is extended with features needed for speech recognition and a novel and generic method of adding location-awareness to the attention mechanism is proposed to alleviate the issue of high phoneme error rate.

Translating Videos to Natural Language Using Deep Recurrent Neural Networks

This paper proposes to translate videos directly to sentences using a unified deep neural network with both convolutional and recurrent structure, to create sentence descriptions of open-domain videos with large vocabularies.

Show and tell: A neural image caption generator

This paper presents a generative model based on a deep recurrent architecture that combines recent advances in computer vision and machine translation and that can be used to generate natural sentences describing an image.

Combining time- and frequency-domain convolution in convolutional neural network-based phone recognition

  • L. Tóth
  • Computer Science
    2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
  • 2014
The two network architectures, convolution along the frequency axis and time-domain convolution, can be readily combined and report an error rate of 16.7% on the TIMIT phone recognition task, a new record on this dataset.

Language Models for Image Captioning: The Quirks and What Works

By combining key aspects of the ME and RNN methods, this paper achieves a new record performance over previously published results on the benchmark COCO dataset, however, the gains the authors see in BLEU do not translate to human judgments.

Multiple Object Recognition with Visual Attention

The model is a deep recurrent neural network trained with reinforcement learning to attend to the most relevant regions of the input image and it is shown that the model learns to both localize and recognize multiple objects despite being given only class labels during training.