Learn More
Separating singing voices from music accompaniment is an important task in many applications, such as music information retrieval, lyric recognition and alignment. Music accompaniment can be assumed to be in a low-rank subspace, because of its repetition structure; on the other hand, singing voices can be regarded as relatively sparse within songs. In this(More)
We describe a large audiovisual speech corpus recorded in a car environment, as well as the equipment and procedures used to build this corpus. Data are collected through a multi-sensory array consisting of eight microphones on the sun visor and four video cameras on the dashboard. The script for the corpus consists of four categories: isolated digits,(More)
Hidden markov model Artificial neural network Tandem model Gaussian mixture model supervector a b s t r a c t Acoustic Event Detection (AED) aims to identify both timestamps and types of events in an audio stream. This becomes very challenging when going beyond restricted highlight events and well controlled recordings. We propose extracting discriminative(More)
Monaural source separation is useful for many real-world applications though it is a challenging problem. In this paper, we study deep learning for monaural speech separation. We propose the joint optimization of the deep learning models (deep neural networks and recurrent neural networks) with an extra masking layer, which enforces a reconstruction(More)
Automatic prosody labeling is important for both speech synthesis and automatic speech understanding. Humans use both syntactic cues and acoustic cues to develop their prediction of prosody for a given utterance. This process can be effectively modeled by an ANN-based syntactic-prosodic model that predicts prosody from syntax and a GMM-based(More)
In this paper, we propose a novel method for image inpainting based on a Deep Convolutional Generative Adversarial Network (DCGAN). We define a loss function consisting of two parts: (1) a contextual loss that preserves similarity between the input corrupted image and the recovered image, and (2) a perceptual loss that ensures a perceptually realistic(More)
Two transcribers have labeled prosodic events independently on a subset of Switchboard corpus using adapted ToBI (TOnes and Break Indices) system. Transcriptions of two types of pitch accents (H* and L*), phrasal accents (Hand Land nd boundary tones (H% and L%) encoded independently by two transcribers are compared for intertranscriber reliability. Two(More)
Three research prototype speech recognition systems are described, all of which use recently developed methods from artificial intelligence (specifically support vector machines, dynamic Bayesian networks, and maximum entropy classification) in order to implement, in the form of an automatic speech recognizer, current theories of human speech perception and(More)
In this paper, we present a patch-based regression framework for addressing the human age and head pose estimation problems. Firstly, each image is encoded as an ensemble of orderless coordinate patches, the global distribution of which is described by Gaussian mixture models (GMM), and then each image is further expressed as a specific distribution model(More)
In this work, we present a SIFT-Bag based generative-to-discriminative framework for addressing the problem of video event recognition in unconstrained news videos. In the generative stage, each video clip is encoded as a bag of SIFT feature vectors, the distribution of which is described by a Gaussian Mixture Models (GMM). In the discriminative stage, the(More)