DP-Parse: Finding Word Boundaries from Raw Speech with an Instance Lexicon

  title={DP-Parse: Finding Word Boundaries from Raw Speech with an Instance Lexicon},
  author={Robin Algayres and Tristan Ricoul and Julien Karadayi and Hugo Laurenccon and Salah Zaiem and Abdel-rahman Mohamed and Beno{\^i}t Sagot and Emmanuel Dupoux},
  journal={Transactions of the Association for Computational Linguistics},
Abstract Finding word boundaries in continuous speech is challenging as there is little or no equivalent of a ‘space’ delimiter between words. Popular Bayesian non-parametric models for text segmentation (Goldwater et al., 2006, 2009) use a Dirichlet process to jointly segment sentences and build a lexicon of word types. We introduce DP-Parse, which uses similar principles but only relies on an instance lexicon of word tokens, avoiding the clustering errors that arise with a lexicon of word… 



Truly Unsupervised Acoustic Word Embeddings Using Weak Top-down Constraints in Encoder-decoder Models

  • H. Kamper
  • Computer Science
    ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
  • 2019
The encoder-decoder correspondence autoencoder is proposed, which, instead of true word segments, uses automatically discovered segments: an unsupervised term discovery system finds pairs of words of the same unknown type, and the EncDec-CAE is trained to reconstruct one word given the other as input.

Speech2Vec: A Sequence-to-Sequence Framework for Learning Word Embeddings from Speech

The proposed Speech2Vec model, a novel deep neural network architecture for learning fixed-length vector representations of audio segments excised from a speech corpus, is based on a RNN Encoder-Decoder framework, and borrows the methodology of skipgrams or continuous bag-of-words for training.

Multilingual Jointly Trained Acoustic and Written Word Embeddings

It is found that phonetic supervision improves performance over character sequences, and that distinctive feature supervision is helpful in handling unseen phones in the target language.

Unsupervised lexical clustering of speech segments using fixed-dimensional acoustic embeddings

This work proposes a novel lexical clustering model: variable-length word segments are embedded in a fixed-dimensional acoustic space in which clustering is then performed, and finds that the best methods produce clusters with wide variation in sizes, as observed in natural language.

Discriminative acoustic word embeddings: Tecurrent neural network-based approaches

This paper presents new discriminative embedding models based on recurrent neural networks (RNNs) and considers training losses that have been successful in prior work, in particular a cross entropy loss for word classification and a contrastive loss that explicitly aims to separate same-word and different-word pairs in a “Siamese network” training setting.

The Zero Resource Speech Benchmark 2021: Metrics and baselines for unsupervised spoken language modeling

We introduce a new unsupervised task, spoken language modeling: the learning of linguistic representations from raw audio signals without any labels, along with the Zero Resource Speech Benchmark

Word Segmentation on Discovered Phone Units with Dynamic Programming and Self-Supervised Scoring

A duration-penalized dynamic programming (DPDP) procedure that can be used for either phone or word segmentation by changing the self-supervised scoring network that gives segment costs, which gives comparableword segmentation results to state-of-the-art joint self- Supervised speech segmentation models on an English benchmark.

Adaptor Grammars for the Linguist: Word Segmentation Experiments for Very Low-Resource Languages

This paper aims to improve upon a strong baseline for the unsupervised word discovery task on two very low-resource Bantu languages, taking advantage of the expertise of linguists on these particular languages.

Acoustic Word Embeddings for Zero-Resource Languages Using Self-Supervised Contrastive Learning and Multilingual Adaptation

It is found that self-supervised contrastive adaptation outperforms adapted multilingual correspondence autoencoder and Siamese AWE models, giving the best overall results in a word discrimination task on six zero-resource languages.