Text-Free Image-to-Speech Synthesis Using Learned Segmental Units

@article{Hsu2021TextFreeIS,
  title={Text-Free Image-to-Speech Synthesis Using Learned Segmental Units},
  author={Wei-Ning Hsu and David F. Harwath and Christopher Song and James R. Glass},
  journal={ArXiv},
  year={2021},
  volume={abs/2012.15454}
}
In this paper we present the first model for directly synthesizing fluent, natural-sounding spoken audio captions for images that does not require natural language text as an intermediate representation or source of supervision. Instead, we connect the image captioning module and the speech synthesis module with a set of discrete, sub-word speech units that are discovered with a self-supervised visual grounding task. We conduct experiments on the Flickr8k spoken caption dataset in addition to a… Expand
End-to-End Image-to-Speech Generation for Untranscribed Unknown Languages
TLDR
This study uses a vector-quantized variational autoencoder (VQ-VAE) model to learn the discrete representation of a speech caption in an unsupervised manner, where discrete labels are used by an image-captioning model. Expand
A Spoken Language Dataset of Descriptions for Speech- and Percept-Based Learning
Grounded language acquisition is a major area of research combining aspects of 1 natural language processing, computer vision, and signal processing, compounded 2 by domain issues requiring sampleExpand
Generative Spoken Language Modeling from Raw Audio
TLDR
This work introduces metrics to automatically evaluate the generated output in terms of acoustic and linguistic quality in two associated endto-end tasks, respectively: speech resynthesis and speech generation, and will open source the evaluation stack and baseline models. Expand
fairseq S^2: A Scalable and Integrable Speech Synthesis Toolkit
  • Changhan Wang, Wei-Ning Hsu, +5 authors J. Pino
  • Computer Science, Engineering
  • ArXiv
  • 2021
This paper presents FAIRSEQ S, a FAIRSEQ extension for speech synthesis. We implement a number of autoregressive (AR) and non-AR text-to-speech models, and their multi-speaker variants. To enableExpand
Ju l 2 02 1 ZR-2021 VG : Zero-Resource Speech Challenge , Visually-Grounded Language Modelling track , 2021 edition Version 2 . 0 – final for NeurIPS
Learning to comprehend and produce spoken languages is one of the hallmarks of human cognition, and the importance of speech communication also makes speech-based capabilities central to AIExpand
Semantic sentence similarity: size does not always matter
TLDR
This study addresses the question whether visually grounded speech recognition models learn to capture sentence semantics without access to any prior linguistic knowledge and shows that a model trained on a small image-caption database outperforms two models trained on much larger databases. Expand
Fast-Slow Transformer for Visually Grounding Speech
We present Fast-Slow Transformer for Visually Grounding Speech, or FaST-VGS. FaST-VGS is a Transformer-based model for learning the associations between raw speech waveforms and visual images. TheExpand
Visually grounded models of spoken language: A survey of datasets, architectures and evaluation techniques
TLDR
An overview of the evolution of visually grounded models of spoken language over the last 20 years is provided, which discusses the central research questions addressed, the timeline of developments, and the datasets which enabled much of this work. Expand

References

SHOWING 1-10 OF 107 REFERENCES
Learning Word-Like Units from Joint Audio-Visual Analysis
TLDR
This model effectively implements a form of spoken language acquisition, in which the computer learns not only to recognize word categories by sound, but also to enrich the words it learns with semantics by grounding them in images. Expand
SPEECH-COCO: 600k Visually Grounded Spoken Captions Aligned to MSCOCO Data Set
TLDR
This paper presents an augmentation of MSCOCO dataset where speech is added to image and text and Investigating multimodal learning schemes for unsupervised speech pattern discovery is also possible with this corpus. Expand
Style Tokens: Unsupervised Style Modeling, Control and Transfer in End-to-End Speech Synthesis
TLDR
"global style tokens" (GSTs), a bank of embeddings that are jointly trained within Tacotron, a state-of-the-art end-to-end speech synthesis system, learn to factorize noise and speaker identity, providing a path towards highly scalable but robust speech synthesis. Expand
Image 2 speech : Automatically generating audio descriptions of images
This paper proposes a new task for artificial intelligence. The image2speech task generates a spoken description of an image. We present baseline experiments in which the neural net used is aExpand
Tacotron: Towards End-to-End Speech Synthesis
TLDR
Tacotron is presented, an end-to-end generative text- to-speech model that synthesizes speech directly from characters that achieves a 3.82 subjective 5-scale mean opinion score on US English, outperforming a production parametric system in terms of naturalness. Expand
Large-scale representation learning from visually grounded untranscribed speech
TLDR
A scalable method to automatically generate diverse audio for image captioning datasets via a dual encoder that learns to align latent representations from both modalities is described and it is shown that a masked margin softmax loss for such models is superior to the standard triplet loss. Expand
Visually Grounded Learning of Keyword Prediction from Untranscribed Speech
TLDR
This work uses an image-to-words multi-label visual classifier to tag images with soft textual labels, and then trains a neural network to map from the speech to these soft targets, and shows that the resulting speech system is able to predict which words occur in an utterance without seeing any parallel speech and text. Expand
Language learning using Speech to Image retrieval
TLDR
This work improves on existing neural network approaches to create visually grounded embeddings for spoken utterances and shows that the visually grounded sentence encoder learns to recognise words from the input even though it is not explicitly trained for word recognition. Expand
Deep Voice 3: Scaling Text-to-Speech with Convolutional Sequence Learning
TLDR
Deep Voice 3 is presented, a fully-convolutional attention-based neural text-to-speech (TTS) system that matches state-of-the-art neural speech synthesis systems in naturalness while training ten times faster. Expand
VoiceLoop: Voice Fitting and Synthesis via a Phonological Loop
TLDR
A new neural text tospeech method that is able to transform text to speech in voices that are sampled in the wild and without requiring aligned phonemes or linguistic features is presented, making TTS accessible to a wider range of applications. Expand
...
1
2
3
4
5
...