Harriet J. Nock

Learn More
This paper reviews definitions of audio-visual synchrony and examines their empirical behaviour on test sets up to 200 times larger than used by other authors. The results give new insights into the practical utility of existing synchrony definitions and justify application of audio-visual synchrony techniques to the problem of active speaker localisation(More)
In the early 1990s, the availability of the TIMIT read-speech phonetically transcribed corpus led to work at AT&T on the automatic inference of pronunciation variation. This work, brie ̄y summarized here, used stochastic decision trees trained on phonetic and linguistic features, and was applied to the DARPA North American Business News readspeech ASR task.(More)
We present a learning-based approach to the semantic indexing of multimedia content using cues derived from audio, visual, and text features. We approach the problem by developing a set of statistical models for a predefined lexicon. Novel concepts are then mapped in terms of the concepts in the lexicon. To achieve robust detection of concepts, we exploit(More)
Conversational speech exhibits considerable pronunciation variability, which has been shown to have a detrimental effect on the accuracy of automatic speech recognition. There have been many attempts to model pronunciation variation, including the use of decision-trees to generate alternate word pronunciations from phonemic baseforms. Use of such(More)
This paper considers schemes for determining which of a set of faces on screen, if any, is producing speech in a video soundtrack. Whilst motivated by the TREC 2002 (Video Retrieval Track) monologue detection task, the schemes are also applicable to voice and face-based biometrics systems, for assessing lip synchronization quality in movie editing and(More)
Accurately modelling pronunciation variability in conversational speech is an important component of an automatic speech recognition system. We describe some of the projects undertaken in this direction during and after WS97, the Fifth LVCSR Summer Workshop, held at Johns Hopkins University, Baltimore, in JulyAugust, 1997. We first illustrate a use of(More)
In this paper we present our approach to detect monologues in video shots. A monologue shot is defined as a shot containing a talking person in the video channel with the corresponding speech in the audio channel. Whilst motivated by the TREC 2002 Video Retrieval Track (VT02), the underlying approach of synchrony between audio and video signals are also(More)
In this paper, we describe the IBM Research system for analysis, indexing, and retrieval of video, which was applied to the TREC-2002 video retrieval benchmark. The system explores methods for fully-automatic content analysis, shot boundary detection, multi-modal feature extraction, statistical modeling for semantic concept detection, and speech recognition(More)