Describing Multimedia Content Using Attention-Based Encoder-Decoder Networks

  title={Describing Multimedia Content Using Attention-Based Encoder-Decoder Networks},
  author={Kyunghyun Cho and Aaron C. Courville and Yoshua Bengio},
  journal={IEEE Transactions on Multimedia},
Whereas deep neural networks were first mostly used for classification tasks, they are rapidly expanding in the realm of structured output problems, where the observed target is composed of multiple random variables that have a rich joint distribution, given the input. In this paper we focus on the case where the input also has a rich structure and the input and output structures are somehow related. We describe systems that learn to attend to different places in the input, for each element of… 

Figures and Tables from this paper

CAM-RNN: Co-Attention Model Based RNN for Video Captioning

A co-attention model based recurrent neural network (CAM-RNN) is proposed, where the CAM is utilized to encode the visual and text features, and the RNN works as the decoder to generate the video caption.

Learning by Injection: Attention Embedded Recurrent Neural Network for Amharic Text-image Recognition

This paper extends the attention mechanism for Amharic text-image recognition to include CNNs and attention embedded recurrent encoder-decoder networks that are integrated following the configuration of the seq2seq framework.

Survey on the attention based RNN model and its applications in computer vision

This survey introduces some attention based RNN models which can focus on different parts of the input for each output item, in order to explore and take advantage of the implicit relations between the input and the output items.

CaptionNet: A Tailor-made Recurrent Neural Network for Generating Image Descriptions

A novel model named CaptionNet is proposed in this work as an improved LSTM specially designed for image captioning, where only attended image features are allowed to be fed into the memory of CaptionNet through input gates, reducing the dependency on the previous predicted words.

Pay Attention to the Activations: A Modular Attention Mechanism for Fine-Grained Image Recognition

The proposed approach learns to attend to lower-level feature activations without requiring part annotations and uses those activations to update and rectify the output likelihood distribution, and demonstrates that well-known networks such as wide residual networks and ResNeXt, when augmented with the approach, systematically improve their classification accuracy and become more robust to changes in deformation and pose.

Recurrent Highway Networks with Attention Mechanism for Scene Text Recognition

Recurrent Highway Networks (RHN), as a popular architecture because of its capability of training deep structure, can preform excellently in plenty of situations and has least parameters, is employed and combined with attention mechanism to solve scene Text Recognition problem.

Evaluating Sequence-to-Sequence Models for Handwritten Text Recognition

An attention-based sequence-to-sequence model that combines a convolutional neural network as a generic feature extractor with a recurrent neural network to encode both the visual information, as well as the temporal context between characters in the input image, and uses a separate recurrent network to decode the actual character sequence.

Image Input OR Video Hierarchical LSTMs with Adaptive Attention ( hLSTMat ) Feature Extraction Generated Captions Losses

A hierarchical LSTM with adaptive attention (hLSTMat) approach for image and video captioning that utilizes the spatial or temporal attention for selecting specific regions or frames to predict the related words, while the adaptive attention is for deciding whether to depend on the visual information or the language context information.

Deep neural network with attention model for scene text recognition

In the proposed framework, feature extraction, feature attention and sequence recognition are integrated in a jointly trainable network, and this model has the least number of parameters so far.

Recurrent Temporal Sparse Autoencoder for attention-based action recognition

An unsupervised Recurrent Temporal Sparse Autoencoder (RTSAE) network, which learns to extract sparse key-frames to sharpen discriminative yet to retain descriptive capability, as well to shield interfere information is designed.



Explain Images with Multimodal Recurrent Neural Networks

The m-RNN model directly models the probability distribution of generating a word given previous words and the image, and achieves significant performance improvement over the state-of-the-art methods which directly optimize the ranking objective function for retrieval.

Deep Captioning with Multimodal Recurrent Neural Networks (m-RNN)

The m-RNN model directly models the probability distribution of generating a word given previous words and an image, and achieves significant performance improvement over the state-of-the-art methods which directly optimize the ranking objective function for retrieval.

Recurrent Models of Visual Attention

A novel recurrent neural network model that is capable of extracting information from an image or video by adaptively selecting a sequence of regions or locations and only processing the selected regions at high resolution is presented.

Describing Videos by Exploiting Temporal Structure

This work proposes an approach that successfully takes into account both the local and global temporal structure of videos to produce descriptions and proposes a temporal attention mechanism that allows to go beyond local temporal modeling and learns to automatically select the most relevant temporal segments given the text-generating RNN.

Attention-Based Models for Speech Recognition

The attention-mechanism is extended with features needed for speech recognition and a novel and generic method of adding location-awareness to the attention mechanism is proposed to alleviate the issue of high phoneme error rate.

Translating Videos to Natural Language Using Deep Recurrent Neural Networks

This paper proposes to translate videos directly to sentences using a unified deep neural network with both convolutional and recurrent structure, to create sentence descriptions of open-domain videos with large vocabularies.

Combining time- and frequency-domain convolution in convolutional neural network-based phone recognition

  • L. Tóth
  • Computer Science
    2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
  • 2014
The two network architectures, convolution along the frequency axis and time-domain convolution, can be readily combined and report an error rate of 16.7% on the TIMIT phone recognition task, a new record on this dataset.

Language Models for Image Captioning: The Quirks and What Works

By combining key aspects of the ME and RNN methods, this paper achieves a new record performance over previously published results on the benchmark COCO dataset, however, the gains the authors see in BLEU do not translate to human judgments.

Multiple Object Recognition with Visual Attention

The model is a deep recurrent neural network trained with reinforcement learning to attend to the most relevant regions of the input image and it is shown that the model learns to both localize and recognize multiple objects despite being given only class labels during training.

Striving for Simplicity: The All Convolutional Net

It is found that max-pooling can simply be replaced by a convolutional layer with increased stride without loss in accuracy on several image recognition benchmarks.