• Publications
  • Influence
The LIMSI Broadcast News transcription system
TLDR
Development work in moving from laboratory read speech data to real-world or `found' speech data in preparation for the DARPA evaluations on this task from 1996 to 1999 is described. Expand
The ETAPE corpus for the evaluation of speech-based TV content processing in the French language
The paper presents a comprehensive overview of existing data for the evaluation of spoken content processing in a multimedia framework for the French language. We focus on the ETAPE corpus which willExpand
Lightly supervised and unsupervised acoustic model training
TLDR
Experiments providing supervision only via the language model training materials show that including texts which are contemporaneous with the audio data is not crucial for success of the approach, and that the acoustic models can be initialized with as little as 10 min of manually annotated data. Expand
Last Words: Amazon Mechanical Turk: Gold Mine or Coal Mine?
TLDR
To define precisely what MTurk is and what it is not, it is hoped that this will point out opportunities for the community to deliberately value ethics above cost savings. Expand
Where are we in transcribing French broadcast news?
TLDR
Advances in automatic processing of broadcast news speech in French based on recent improvements to the LIMSI English system are described, with the main differences between the English and French BN systems being a 200k vocabulary to overcome the lower lexical coverage in French. Expand
Partitioning and transcription of broadcast news data
TLDR
This paper reports on the recent work in transcribing broadcast news data, including the problem of partitioning the data into homogeneous segments prior to word recognition, using a continuous mixture density, tied-state cross-word context-dependent HMM system with a 65k trigram language model. Expand
Breaking the Unwritten Language Barrier: The BULB Project
Abstract The project Breaking the Unwritten Language Barrier (BULB), which brings together linguists and computer scientists, aims at supporting linguists in documenting unwritten languages. In orderExpand
On designing pronunciation lexicons for large vocabulary continuous speech recognition
  • L. Lamel, G. Adda
  • Computer Science
  • Proceeding of Fourth International Conference on…
  • 3 October 1996
TLDR
The American English lexicon developed primarily for the ARPA WSJ/NAB tasks is described, which is phonemically represented, and contains alternate pronunciations for about 10% of the words. Expand
Investigating syllabic structures and their variation in spontaneous French
TLDR
Dans ce papier, nous traitons des structures syllabiques et de leur variation dans un corpus de parole en francais issu d'entrevues radio-diffusees, peuvent servir d'outils linguistiques pour explorer de facon coherente des corpus virtuellement illimites. Expand
A Very Low Resource Language Speech Corpus for Computational Language Documentation Experiments
TLDR
A speech corpus collected during a realistic language documentation process, made up of 5k speech utterances in Mboshi aligned to French text translations, is presented. Expand
...
1
2
3
4
5
...