• Corpus ID: 247158593

Learning English with Peppa Pig

@article{Nikolaus2022LearningEW,
  title={Learning English with Peppa Pig},
  author={Mitja Nikolaus and A. Alishahi and Grzegorz Chrupała},
  journal={ArXiv},
  year={2022},
  volume={abs/2202.12917}
}
Recent computational models of the acquisition of spoken language via grounding in perception exploit associations between the spoken and visual modalities and learn to represent speech and visual data in a joint vector space. A major unresolved issue from the point of ecological validity is the training data, typically consisting of images or videos paired with spoken descriptions of what is depicted. Such a setup guarantees an unreal-istically strong correlation between speech and the visual… 
Word Discovery in Visually Grounded, Self-Supervised Speech Models
TLDR
It is shown that powerful word segmentation and clustering capability emerges within the model’s self-attention heads, suggesting that the visual grounding task is a crucial component of the word discovery capability the authors observe.
ConceptBeam: Concept Driven Target Speech Extraction
TLDR
ConceptBeam clearly outperforms the baseline methods and effectively extracts speech based on the semantic representation, and is compared with two methods: one based on keywords obtained from recognition systems and another based on sound source separation.

References

SHOWING 1-10 OF 57 REFERENCES
Unsupervised Learning of Spoken Language with Visual Context
TLDR
A deep neural network model capable of rudimentary spoken language acquisition using untranscribed audio training data, whose only supervision comes in the form of contextually relevant visual images, is presented.
Visually grounded models of spoken language: A survey of datasets, architectures and evaluation techniques
TLDR
An overview of the evolution of visually grounded models of spoken language over the last 20 years is provided, which discusses the central research questions addressed, the timeline of developments, and the datasets which enabled much of this work.
Word Recognition, Competition, and Activation in a Model of Visually Grounded Speech
TLDR
It is found that not all speech frames play an equal role in the final encoded representation of a given word, but that some frames have a crucial effect on it and it is suggested that word representation could be activated through a process of lexical competition.
Representations of language in a model of visually grounded speech signal
TLDR
An in-depth analysis of the representations used by different components of the trained model shows that encoding of semantic aspects tends to become richer as the authors go up the hierarchy of layers, whereas encoding of form-related aspects of the language input tends to initially increase and then plateau or decrease.
Language learning using Speech to Image retrieval
TLDR
This work improves on existing neural network approaches to create visually grounded embeddings for spoken utterances and shows that the visually grounded sentence encoder learns to recognise words from the input even though it is not explicitly trained for word recognition.
Transfer Learning from Audio-Visual Grounding to Speech Recognition
TLDR
This paper proposes a novel transfer learning scenario, which distills robust phonetic features from grounding models that are trained to tell whether a pair of image and speech are semantically correlated, without using any textual transcripts.
Jointly Discovering Visual Objects and Spoken Words from Raw Sensory Input
In this paper, we explore neural network models that learn to associate segments of spoken audio captions with the semantically relevant portions of natural images that they refer to. We demonstrate
Learning Word-Like Units from Joint Audio-Visual Analysis
TLDR
This model effectively implements a form of spoken language acquisition, in which the computer learns not only to recognize word categories by sound, but also to enrich the words it learns with semantics by grounding them in images.
A visual context-aware multimodal system for spoken language processing
TLDR
This work presents a real-time multimodal system motivated by findings that performs early integration of visual contextual information to recognize the most likely word sequences in spoken language utterances.
Can phones, syllables, and words emerge as side-products of cross-situational audiovisual learning? - A computational investigation
TLDR
It is found that representations associated with phonetic, syllabic, and lexical units of speech indeed emerge from the audiovisual learning process, and the finding is also robust against variations in model architecture or characteristics of model training and testing data.
...
...