Harald Romsdorfer

Learn More
Communication between humans deeply relies on the capability of expressing and recognizing feelings. For this reason, research on human-machine interaction needs to focus on the recognition and simulation of emotional states, prerequisite of which is the collection of affective corpora. Currently available datasets still represent a bottleneck for the(More)
In multilingual countries, text-to-speech synthesis systems often have to deal with texts containing inclusions of multiple other languages in form of phrases, words, or even parts of words. In such multilingual cultural settings, listeners expect a high-quality text-to-speech synthesis system to read such texts in a way that the origin of the inclusions is(More)
Communication between humans deeply relies on our capability of experiencing, expressing, and recognizing feelings. For this reason, research on human-machine interaction needs to focus on the recognition and simulation of emotional states, prerequisite of which is the collection of affective corpora. Currently available datasets still represent a(More)
Data annotation is the most labor-intensive part for the acquisition of a multimodal corpus. 3D vision technology can ease the annotation process, especially when continuous surface deformations need to be extracted accurately and consistently over time. In this paper, we give an example use of such technology, namely the acquisition of an audio-visual(More)
An automatic system for segmenting speech signals used for the training of statistical prosody models is presented. Starting from a canonical transcription, the system simultaneously delivers an accurate phonetic segmentation and the matched phonetic transcription indicating pronunciation variants. Although the system is HMM-based, it uses only the speech(More)
Beamforming is crucial for hands-free mobile terminals and voice-enabled automated home environments based on distant-speech interaction to mitigate causes of system degradation, e.g., interfering noise sources or competing speakers. This paper presents an adaptation of the most common state-of-the-art broadband beamformers to uniform circular arrays, such(More)
Polyglot text-to-speech synthesis, i.e. the synthesis of sentences containing one or more inclusions from other languages, primarily depends on an accurate morphosyntactic analyzer for such mixed-lingual texts. From the output of this analyzer, the pronunciation can be derived by means of phonological transformations which are language-specific and depend(More)
Accurate, microphone-based speaker localization in real-world environments, like office spaces or meeting rooms, must be able to track a single speaker and multiple concurrent speakers in the presence of reverberations and background noise. Our Multiband Joint Position-Pitch (M-PoPi) algorithm for circular microphone arrays already shows a frame-wise(More)
A polyglot text-to-speech synthesis system which is able to read aloud mixed-lingual text has first of all to derive the correct pronunciation. This is achieved with an accurate morpho-syntactic analyzer that works simultaneously as language detector, followed by a phonological component which performs various phonological transformations. The result of(More)