Learn More
Separating singing voices from music accompaniment is an important task in many applications, such as music information retrieval, lyric recognition and alignment. Music accompaniment can be assumed to be in a low-rank subspace, because of its repetition structure; on the other hand, singing voices can be regarded as relatively sparse within songs. In this(More)
Hidden markov model Artificial neural network Tandem model Gaussian mixture model supervector a b s t r a c t Acoustic Event Detection (AED) aims to identify both timestamps and types of events in an audio stream. This becomes very challenging when going beyond restricted highlight events and well controlled recordings. We propose extracting discriminative(More)
Monaural source separation is useful for many real-world applications though it is a challenging problem. In this paper, we study deep learning for monaural speech separation. We propose the joint optimization of the deep learning models (deep neural networks and recurrent neural networks) with an extra masking layer, which enforces a reconstruction(More)
Automatic prosody labeling is important for both speech synthesis and automatic speech understanding. Humans use both syntactic cues and acoustic cues to develop their prediction of prosody for a given utterance. This process can be effectively modeled by an ANN-based syntactic-prosodic model that predicts prosody from syntax and a GMM-based(More)
We describe a large audiovisual speech corpus recorded in a car environment, as well as the equipment and procedures used to build this corpus. Data are collected through a multi-sensory array consisting of eight microphones on the sun visor and four video cameras on the dashboard. The script for the corpus consists of four categories: isolated digits,(More)
Two transcribers have labeled prosodic events independently on a subset of Switchboard corpus using adapted ToBI (TOnes and Break Indices) system. Transcriptions of two types of pitch accents (H* and L*), phrasal accents (Hand Land nd boundary tones (H% and L%) encoded independently by two transcribers are compared for intertranscriber reliability. Two(More)
In this paper, we propose a novel method for image inpainting based on a Deep Convolutional Generative Adversarial Network (DCGAN). We define a loss function consisting of two parts: (1) a contextual loss that preserves similarity between the input corrupted image and the recovered image, and (2) a perceptual loss that ensures a perceptually realistic(More)
In this work, we present a SIFT-Bag based generative-to-discriminative framework for addressing the problem of video event recognition in unconstrained news videos. In the generative stage, each video clip is encoded as a bag of SIFT feature vectors, the distribution of which is described by a Gaussian Mixture Models (GMM). In the discriminative stage, the(More)
In this paper, we present a patch-based regression framework for addressing the human age and head pose estimation problems. Firstly, each image is encoded as an ensemble of orderless coordinate patches, the global distribution of which is described by Gaussian mixture models (GMM), and then each image is further expressed as a specific distribution model(More)
Monaural source separation is important for many real world applications. It is challenging since only single channel information is available. In this paper, we explore using deep recurrent neural networks for singing voice separation from monaural recordings in a supervised setting. Deep recurrent neural networks with different temporal connections are(More)