• Publications
  • Influence
How2: A Large-scale Dataset for Multimodal Language Understanding
TLDR
How2, a multimodal collection of instructional videos with English subtitles and crowdsourced Portuguese translations, is introduced, and integrated sequence-to-sequence baselines for machine translation, automatic speech recognition, spoken language translation, and multi-modal summarization are presented.
LIUM-CVC Submissions for WMT17 Multimodal Translation Task
TLDR
The monomodal and multimodal Neural Machine Translation systems developed by LIUM and CVC for WMT17 Shared Task on Multimodal Translation ranked first for both En-De and En-Fr language pairs according to the automatic evaluation metrics METEOR and BLEU.
Probing the Need for Visual Context in Multimodal Machine Translation
TLDR
This paper probes the contribution of the visual modality to state-of-the-art MMT models by conducting a systematic analysis where the models are partially deprived from source-side textual context and shows that under limited textual context, models are capable of leveraging the visual input to generate better translations.
Does Multimodality Help Human and Machine for Translation and Image Captioning?
TLDR
The systems developed by LIUM and CVC for the WMT16 Multimodal Machine Translation challenge are presented, namely phrase-based systems and attentional recurrent neural networks models trained using monomodal or multimodal data.
NMTPY: A Flexible Toolkit for Advanced Neural Machine Translation Systems
Abstract In this paper, we present nmtpy, a flexible Python toolkit based on Theano for training Neural Machine Translation and other neural sequence-to-sequence architectures. nmtpy decouples the
Simultaneous Machine Translation with Visual Context
TLDR
The results show that visual context is helpful and that visually-grounded models based on explicit object region information are much better than commonly used global features, reaching up to 3 BLEU points improvement under low latency scenarios.
Multimodal Grounding for Sequence-to-sequence Speech Recognition
TLDR
This paper proposes novel end-to-end multimodal ASR systems and compares them to the adaptive approach by using a range of visual representations obtained from state-of-the-art convolutional neural networks and shows that adaptive training is effective for S2S models leading to an absolute improvement of 1.4% in word error rate.
Multimodal Attention for Neural Machine Translation
TLDR
This work assesses the feasibility of a multimodal attention mechanism that simultaneously focus over an image and its natural language description for generating a description in another language.
Curious Case of Language Generation Evaluation Metrics: A Cautionary Tale
TLDR
These experiments show that metrics usually prefer system outputs to human-authored texts, can be insensitive to correct translations of rare words, and can yield surprisingly high scores when given a single sentence as system output for the entire test set.
Multimodal machine translation through visuals and speech
TLDR
The paper concludes with a discussion of directions for future research in multimodal machine translation: the need for more expansive and challenging datasets, for targeted evaluations of model performance, and for multimodality in both the input and output space.
...
...