• Publications
  • Influence
Representations of language in a model of visually grounded speech signal
TLDR
An in-depth analysis of the representations used by different components of the trained model shows that encoding of semantic aspects tends to become richer as the authors go up the hierarchy of layers, whereas encoding of form-related aspects of the language input tends to initially increase and then plateau or decrease. Expand
The PhotoBook Dataset: Building Common Ground through Visually-Grounded Dialogue
TLDR
A baseline model for reference resolution is proposed which uses a simple method to take into account shared information accumulated in a reference chain and shows that this information is particularly important to resolve later descriptions and underline the need to develop more sophisticated models of common ground in dialogue interaction. Expand
From phonemes to images: levels of representation in a recurrent neural model of visually-grounded language learning
TLDR
This work presents a model of visually-grounded language learning based on stacked gated recurrent neural networks which learns to predict visual features given an image description in the form of a sequence of phonemes, and shows that it represents linguistic information in a hierarchy of levels. Expand
Curious Topics: A Curiosity-Based Model of First Language Word Learning
TLDR
The goal of this study is to show how a curious, active choice of topics by a language learner improves word learning compared to random selection. Expand
Active Word Learning through Self-supervision
TLDR
A computational study of crosssituational word learning is presented to investigate whether a curious word learner who actively influences linguistic input in each context has an advantage over a passive learner. Expand
Discrete representations in neural models of spoken language
TLDR
A systematic analysis of the impact of architectural choices, the learning objective and training dataset, and the evaluation metric on the merits of four commonly used metrics in the context of weakly supervised models of spoken language finds that the different evaluation metrics can give inconsistent results. Expand
On the difficulty of a distributional semantics of spoken language
TLDR
It is conjecture that unsupervised learning of spoken language semantics becomes possible if the authors abstract from the surface variability, and possible routes toward transferring these approaches to the domain of unrestricted natural speech are suggested. Expand
Emergence of language structures from exposure to visually grounded speech signal
TLDR
This work introduces a multi-layer recurrent neural network model which is trained to project spoken sentences and their corresponding visual scene features into a shared semantic space, and investigates to what extent representations of linguistic structures such as discrete words emerge in this model, and where within the network architecture they are localized. Expand
Modeling relations in a referential game
TLDR
Inspired by recent work on visual question answering using Relation Networks, this work builds and evaluates models of expression grounding that take in account interactions between elements of the visual scene and provides an analysis of the performance and the relational representations learned. Expand
Learning to Understand Child-directed and Adult-directed Speech
TLDR
The results suggest that this is at least partially due to linguistic rather than acoustic properties of the two registers, as the same pattern is seen when looking at models trained on acoustically comparable synthetic speech. Expand