Samuel Pachoud

Learn More
In this paper, we present a spatio-temporal feature representation and a probabilistic matching function to recognise lip movements from pronounced digits. Our model (1) automatically selects spatio-temporal features extracted from 10 digit model templates and (2) matches them with probe video sequences. Spatio-temporal features embed lip movements from(More)
In this paper, we present a spatio-temporal feature representation and a probabilistic matching function to recognise lip movements from pronounced digits. Our model (1) automatically selects spatio-temporal features extracted from 10 digit model templates and (2) matches them with probe video sequences. Spatio-temporal features embed lip movements from(More)
We present a method to group trajectories of moving objects extracted from real-world surveillance videos. The trajectories are first mapped into a low dimensionality feature space generated through linear regression. Next the regression coefficients are clustered by a Gaussian Mixture Model initialized by K-means for improved efficiency. The model(More)
For the recognition of speech, in particular spoken digits, captured in video with poor sound due to noise, we develop a novel audio-visual fusion technique that performs significantly better than utilising either audio or video signal alone. Specifically, we present an audio-visual intermediate fusion strategy to locate speaker dependant pronounced digits(More)
Human perception is multi-sensory. In particular, one often uses two of our five senses: Sight and Hearing. Sight or vision describes the ability to detect electromagnetic waves within the visible range (light) by the eye and the brain to interpret the image as sight. Hearing or audition is the sense of sound perception and results from tiny hair fibres in(More)
We extract relevant and informative audio-visual features using multiple multi-class Support Vector Machines with probabilistic outputs, and demonstrate the approach in a noisy audio-visual speech reading scenario. We first extract visual spatio-temporal features and audio cepstral coefficients from pronounced digit sequences. Two classifiers are then(More)
  • 1