Maarten Versteegh

Learn More
The Interspeech 2015 Zero Resource Speech Challenge aims at discovering subword and word units from raw speech. The challenge provides the first unified and open source suite of evaluation metrics and data sets to compare and analyse the results of unsupervised linguistic unit discovery algorithms. It consists of two tracks. In the first, a psychophysically(More)
We report on an architecture for the unsupervised discovery of talker-invariant subword embeddings. It is made out of two components: a dynamic-time warping based spoken term discovery (STD) system and a Siamese deep neural network (DNN). The STD system clusters word-sized repeated fragments in the acoustic streams while the DNN is trained to minimize the(More)
The unsupervised discovery of linguistic terms from either continuous phoneme transcriptions or from raw speech has seen an increasing interest in the past years both from a theoretical and a practical standpoint. Yet, there exists no common accepted evaluation method for the systems performing term discovery. Here, we propose such an evaluation toolbox,(More)
This paper reports on the results of the Zero Resource Speech Challenge 2015, the first unified benchmark for zero resource speech technology, which aims at the unsupervised discovery of subword and word units from raw speech. This paper discusses the motivation for the challenge, its data sets, tasks and baseline systems. We outline the ideas behind the(More)
This paper presents a deep architecture for learning a similarity metric on variablelength character sequences. The model combines a stack of character-level bidirectional LSTM’s with a Siamese architecture. It learns to project variablelength strings into a fixed-dimensional embedding space by using only information about the similarity between pairs of(More)
Infants learn language at an incredible speed, and one of the first steps in this voyage is learning the basic sound units of their native languages. It is widely thought that caregivers facilitate this task by hyperarticulating when speaking to their infants. Using state-of-the-art speech technology, we addressed this key theoretical question: Are sound(More)
Recent work has explored deep architectures for learning acoustic features in an unsupervised or weakly-supervised way for phone recognition. Here we investigate the role of the input features, and in particular we test whether standard mel-scaled filterbanks could be replaced by inherently richer representations, such as derived from an analytic scattering(More)
This paper investigates the effects of novel words on a cognitively plausible computational model of word learning. The model is first familiarized with a set of words, achieving high recognition scores and subsequently offered novel words for training. We show that the model is able to recognize the novel words as different from the previously seen words,(More)
In this paper we investigate a computational model of word learning that is cognitively plausible. The model is partly trained on incorrect form-referent pairings, modelling the input to a word-learning child that may contain such mismatches due to inattention to a joint communicative scene. We introduce a procedure of active learning, based on attested(More)