• Corpus ID: 21580890

Using Artificial Tokens to Control Languages for Multilingual Image Caption Generation

  title={Using Artificial Tokens to Control Languages for Multilingual Image Caption Generation},
  author={Satoshi Tsutsui and David J. Crandall},
Recent work in computer vision has yielded impressive results in automatically describing images with natural language. Most of these systems generate captions in a sin- gle language, requiring multiple language-specific models to build a multilingual captioning system. We propose a very simple technique to build a single unified model across languages, using artificial tokens to control the language, making the captioning system more compact. We evaluate our approach on generating English and… 

Figures and Tables from this paper

Unpaired Cross-lingual Image Caption Generation with Self-Supervised Rewards
This paper proposes to generate cross-lingual image captions with self-supervised rewards in the reinforcement learning framework to alleviate disfluency and visual irrelevancy errors and achieves significant performance improvement over state-of-the-art methods.
Deep Image Captioning: An Overview
  • I. Hrga, Marina Ivašić-Kos
  • Computer Science
    2019 42nd International Convention on Information and Communication Technology, Electronics and Microelectronics (MIPRO)
  • 2019
An overview of issues and recent image captioning research is given, with a particular emphasis on models that use the deep encoder-decoder architecture.
Encoder-Decoder Architecture for Image Caption Generation
A combined model of CNN and GRU was proposed to achieve accurate image captions and results in a BLEU-4 score on the MS-COCO 2017 dataset as 53.5.
Deep learning apporach for image captioning in Hindi language
  • Ankit Rathi
  • Computer Science
    2020 International Conference on Computer, Electrical & Communication Engineering (ICCECE)
  • 2020
The experiments showed that training the model with a single clean description per image generates higher-quality caption than a model trained with five uncleaned descriptions per image, which is the current state of the art.
VaTeX: A Large-Scale, High-Quality Multilingual Dataset for Video-and-Language Research
This work presents a new large-scale multilingual video description dataset, VATEX, which contains over 41,250 videos and 825,000 captions in both English and Chinese and demonstrates that the spatiotemporal video context can be effectively utilized to align source and target languages and thus assist machine translation.
COCO-CN for Cross-Lingual Image Tagging, Captioning, and Retrieval
This paper proposes COCO-CN, a novel dataset enriching MS-COCO with manually written Chinese sentences and tags and develops a recommendation-assisted collective annotation system, automatically providing an annotator with several tags and sentences deemed to be relevant with respect to the pictorial content.
SibNet: Sibling Convolutional Encoder for Video Captioning
This work introduces a novel Sibling Convolutional Encoder (SibNet) for visual captioning, which employs a dual-branch architecture to collaboratively encode videos and demonstrates that the proposed SibNet consistently outperforms existing methods across different evaluation metrics.
Outline to Story: Fine-grained Controllable Story Generation from Cascaded Events
This paper proposes a model and creates datasets for a model that fine tunes pre-trained language models on augmented sequences of outline-story pairs with simple language modeling objective, and instantiates research interest of fine-grained controllable generation of open-domain long text, where controlling inputs are represented by short text.
Multimodal Author Profiling for Twitter: Notebook for PAN at CLEF 2018
The proposed multimodal author profiling systems obtained classification accuracies of 0.7680, 0.7737, and 0.7709 for Arabic, English and Spanish languages, respectively using support vector machine.
Overview of the 6th Author Profiling Task at PAN 2018: Multimodal Gender Identification in Twitter
This overview presents the framework and the results of the Author Profiling shared task at PAN 2018, to address gender identification from a multimodal perspective, where not only texts but also images are given.


Cross-Lingual Image Caption Generation
The model was designed to transfer the knowledge representation obtained from the English portion into the Japanese portion, and the resulting bilingual comparable corpus has better performance than a monolingual corpus, indicating that image understanding using a resource-rich language benefits a resources-poor language.
Google’s Multilingual Neural Machine Translation System: Enabling Zero-Shot Translation
This work proposes a simple solution to use a single Neural Machine Translation (NMT) model to translate between multiple languages using a shared wordpiece vocabulary, and introduces an artificial token at the beginning of the input sentence to specify the required target language.
Show and tell: A neural image caption generator
This paper presents a generative model based on a deep recurrent architecture that combines recent advances in computer vision and machine translation and that can be used to generate natural sentences describing an image.
Multilingual Image Description with Neural Sequence Models
An approach to multi-language image description bringing together insights from neural machine translation and neural image description is presented, finding significant and substantial improvements in BLEU4 and Meteor scores for models trained over multiple languages, compared to a monolingual baseline.
Multimodal Pivots for Image Caption Translation
This work presents an approach to improve statistical machine translation of image descriptions by multimodal pivots defined in visual space, and relies on available large datasets of monolingually captioned images, and on state-of-the-art convolutional neural networks to compute image similarities.
Multi-Way, Multilingual Neural Machine Translation with a Shared Attention Mechanism
We propose multi-way, multilingual neural machine translation. The proposed approach enables a single neural translation model to translate between multiple languages, with a number of parameters
Sequence to Sequence Learning with Neural Networks
This paper presents a general end-to-end approach to sequence learning that makes minimal assumptions on the sequence structure, and finds that reversing the order of the words in all source sentences improved the LSTM's performance markedly, because doing so introduced many short term dependencies between the source and the target sentence which made the optimization problem easier.
DenseCap: Fully Convolutional Localization Networks for Dense Captioning
A Fully Convolutional Localization Network (FCLN) architecture is proposed that processes an image with a single, efficient forward pass, requires no external regions proposals, and can be trained end-to-end with asingle round of optimization.
Multi-Source Neural Translation
A multi-source machine translation model is built and trained to maximize the probability of a target English string given French and German sources to report up to +4.8 Bleu increases on top of a very strong attention-based neural translation model.
Bleu: a Method for Automatic Evaluation of Machine Translation
This work proposes a method of automatic machine translation evaluation that is quick, inexpensive, and language-independent, that correlates highly with human evaluation, and that has little marginal cost per run.