Encoding of phonology in a recurrent neural model of grounded speech

  title={Encoding of phonology in a recurrent neural model of grounded speech},
  author={A. Alishahi and Marie Barking and Grzegorz Chrupała},
We study the representation and encoding of phonemes in a recurrent neural network model of grounded speech. We use a model which processes images and their spoken descriptions, and projects the visual and auditory representations into the same semantic space. We perform a number of analyses on how information about individual phonemes is encoded in the MFCC features extracted from the speech signal, and the activations of the layers of the model. Via experiments with phoneme decoding and… 

Figures and Tables from this paper

Word Recognition, Competition, and Activation in a Model of Visually Grounded Speech

It is found that not all speech frames play an equal role in the final encoded representation of a given word, but that some frames have a crucial effect on it and it is suggested that word representation could be activated through a process of lexical competition.

Analyzing analytical methods: The case of phonology in neural models of spoken language

It is concluded that reporting analysis results with randomly initialized models is crucial, and that global-scope methods tend to yield more consistent and interpretable results and are recommend their use as a complement to local-scope diagnostic methods.

Discrete representations in neural models of spoken language

A systematic analysis of the impact of architectural choices, the learning objective and training dataset, and the evaluation metric on the merits of four commonly used metrics in the context of weakly supervised models of spoken language finds that the different evaluation metrics can give inconsistent results.

Encoding of speaker identity in a Neural Network model of Visually Grounded Speech perception

This thesis presents research on how the unique characteristics of a voice are encoded in a Recurrent Neural Network trained on Visually Grounded Speech signals, and finds that in general gender and speaker encoding are most prevalent in the first few layers of the RNN.

Analyzing Phonetic and Graphemic Representations in End-to-End Automatic Speech Recognition

This paper analyzes the learned internal representations in an end-to-end ASR model and finds remarkable consistency in how different properties are represented in different layers of the deep neural network.

A phonetic model of non-native spoken word processing

This work trains a computational model of phonetic learning, which has no access to phonology, and shows that the model exhibits predictable behaviors on phone-level and word-level discrimination tasks, showing that phonology may not be necessary to explain some of the word processing effects observed in non-native speakers.

Can phones, syllables, and words emerge as side-products of cross-situational audiovisual learning? -- A computational investigation

Remains associated with phonetic, syllabic, and lexical units of speech indeed emerge from the audiovisual learning process, and the results suggest that cross-modal and cross-situational learning may, in principle, assist in early language development much beyond just enabling association of acoustic word forms to their referential meanings.

Learning Hierarchical Discrete Linguistic Units from Visually-Grounded Speech

In this paper, we present a method for learning discrete linguistic units by incorporating vector quantization layers into neural models of visually grounded speech. We show that our method is

Learning English with Peppa Pig

A simple bi-modal architecture is trained on the portion of the data consisting of dialog between characters, and evaluated on segments containing descriptive narrations that succeeds at learning aspects of the visual semantics of spoken language.

On the difficulty of a distributional semantics of spoken language

It is conjecture that unsupervised learning of spoken language semantics becomes possible if the authors abstract from the surface variability, and possible routes toward transferring these approaches to the domain of unrestricted natural speech are suggested.



Representation of Linguistic Form and Function in Recurrent Neural Networks

A method for estimating the amount of contribution of individual tokens in the input to the final prediction of the networks is proposed and shows that the Visual pathway pays selective attention to lexical categories and grammatical functions that carry semantic information, and learns to treat word types differently depending on their grammatical function and their position in the sequential structure of the sentence.

From phonemes to images: levels of representation in a recurrent neural model of visually-grounded language learning

This work presents a model of visually-grounded language learning based on stacked gated recurrent neural networks which learns to predict visual features given an image description in the form of a sequence of phonemes, and shows that it represents linguistic information in a hierarchy of levels.

Detection of phonological features in continuous speech using neural networks

This paper reports experiments on three phonological feature systems: the Sound Pattern of English (SPE) system, amulti-valued (MV) feature system which uses traditional phonetic categories such as manner, place, etc., and government Phonology which uses a set of structured primes.

Dynamic Encoding of Acoustic Features in Neural Responses to Continuous Speech

Electroencephalography responses to continuous speech are characterized by obtaining the time-locked responses to phoneme instances (phoneme-related potential), and it is found that each instance of a phoneme in continuous speech produces multiple distinguishable neural responses occurring as early as 50 ms and as late as 400 ms after the phoneme onset.

Representations of language in a model of visually grounded speech signal

An in-depth analysis of the representations used by different components of the trained model shows that encoding of semantic aspects tends to become richer as the authors go up the hierarchy of layers, whereas encoding of form-related aspects of the language input tends to initially increase and then plateau or decrease.

Common Neural Basis for Phoneme Processing in Infants and Adults

It is argued that infants have access at the beginning of life to phonemic representations, which are modified without training or implicit instruction, but by the statistical distributions of speech input in order to converge to the native phonemic categories.

Exploiting deep neural networks for detection-based speech recognition

Semantics guide infants' vowel learning: Computational and experimental evidence.

Fine-grained Analysis of Sentence Embeddings Using Auxiliary Prediction Tasks

This work proposes a framework that facilitates better understanding of the encoded representations of sentence vectors and demonstrates the potential contribution of the approach by analyzing different sentence representation mechanisms.

Memory for Serial Order : A Network Model of the Phonological Loop and its Timing

A connectionist model of human short-term memory is presented that extends the `phonological loop' (A. D. Baddeley, 1986) to encompass serial order and learning. Psychological and neuropsychological