Benoit Habert

Learn More
When trying to identify essential concepts and relationships in a medium-size corpus, it is not always possible to rely on statistical methods, as the frequencies are too low. We present an alternative method, symbolic, based on the simplification of parse trees. We discuss the resuits on nominal phrases of two technical corpora, analyzed by two different(More)
Medical Language Processing (MLP), especially in specific domains, requires fine-grained semantic lexica. We examine whether robust natural language processing tools used on a representative corpus of a domain help in building and refining a semantic categorization. We test this hypothesis with ZELLIG, a corpus analysis tool. The first clusters we obtain(More)
The present study focuses on automatic processing of sibling resources of audio and written documents, such as available in audio archives or for parliament debates: written texts are close but not exact audio transcripts. Such resources deserve attention for several reasons: they represent an interesting testbed for studying differences between written and(More)
The aim of this study is to elaborate a disfluent speech model by comparing different types of audio transcripts. The study makes use of 10 hours of French radio interview archives, involving journalists and personalities from political or civil society. A first type of transcripts is press-oriented where most disfluencies are discarded. For 10% of the(More)
The increasing use of methods in natural language processing (NLP) which are based on huge corpora require that the lexical, morphosyntactic and syntactic homogeneity of texts be mastered. We have developed a methodology and associate tools for text calibration or "profiling" within the ELRA benchmark called "Contribution to the construction of contemporary(More)
Looking for a better understanding of spontaneous speech-related phenomena and to improve automatic speech recognition (ASR), we present here a study on the relationship between the occurrence of overlapping speech segments and disfluencies (filled pauses, repetitions, revisions) in political interviews. First we present our data, and our overlap annotation(More)
There is a constant need to extend and tune specialized vocabularies to account for new words and new word usages. This paper addresses the issue of characterizing the semantic class of such words. We test the hypothesis that the analysis of word distribution in a representative corpus, as obtained by robust NLP tools, can help identify words with similar(More)
The reported study aims at increasing our understanding of spontaneous speech-related phenomena from sibling corpora of speech and orthographic transcriptions at various levels of elaboration. It makes use of 9 hours of French broadcast interview archives, involving 10 journalists and 10 personalities from political or civil society. First we considered(More)