Ricard Marxer

Learn More
The CHiME challenge series aims to advance far field speech recognition technology by promoting research at the interface of signal processing and automatic speech recognition. This paper presents the design and outcomes of the 3rd CHiME Challenge, which targets the performance of automatic speech recognition in a real-world, commercially-motivated(More)
We present a method for lead instrument separation using an available musical score that may not be properly aligned with the polyphonic audio mixture. Improper alignment degrades the performance of existing score-informed source separation algorithms. Several techniques are proposed to manage local and global misalignments, such as a score information(More)
This research focuses on the removal of the singing voice in polyphonic audio recordings under real-time constraints. It is based on time-frequency binary masks resulting from the combination of azimuth, phase difference and absolute frequency spectral bin classification and harmonic-derived masks. For the harmonic-derived masks, a pitch likelihood(More)
We present the use of a Tikhonov regularization based method, as an alternative to the Non-negative Matrix Factorization (NMF) approach , for source separation in professional audio recordings. This method is a direct and computationally less expensive solution to the problem, which makes it interesting in low latency scenarios. The technique sacrifices the(More)
—A system is presented that segments, clusters and predicts musical audio in an unsupervised manner, adjusting the number of (timbre) clusters instantaneously to the audio input. A sequence learning algorithm adapts its structure to a dynamically changing clustering tree. The flow of the system is as follows: 1) segmentation by onset detection, 2) timbre(More)
In this paper, we investigate the performance of three unsupervised classification algorithms applied to musical data. They are first evaluated on the direct set of feature vectors that have been extracted from the original songs, and we try to highlight whether this data seems to lie on an embedded manifold or not. Furthermore, we try to enhance the(More)
The main assumption of our spectrum decomposition method is that the short-term Fourier transform (STFT) of our audio signal, Y is a linear combination of N C elementary spectra, also named basis components. This can be expressed as Y = BG where Y ∈ R N S ×1 is the spectrum at a given frame m, N S being the size of the spectrum. B ∈ R N S ×N C is the matrix(More)
Dysarthria is a speech disorder caused by difficulties in controlling muscles, such as the tongue and lips, that are needed to produce speech. These differences in motor skills cause speech to be slurred, mumbled, and spoken relatively slowly, and can also increase the likelihood of dysfluency. This includes non-speech sounds, and 'stuttering', defined here(More)