• Corpus ID: 243848135

LiMuSE: Lightweight Multi-modal Speaker Extraction

  title={LiMuSE: Lightweight Multi-modal Speaker Extraction},
  author={Qinghua Liu and Yating Huang and Yunzhe Hao and Jiaming Xu and Bo Xu},
In recent years, multi-modal cues, including spatial information, facial expression and voiceprint, are introduced to the speaker extraction task to serve as complementary information to each other to achieve better performance. However, the front-end models, for speaker extraction, become large and hard to deploy on a resource-constrained device. In this pa-per, we address the aforementioned problem with novel model architectures and model compression techniques, and propose a lightweight… 

Figures and Tables from this paper

Dive into Big Model Training

This report explores what and how the big model training works by diving into training objectives and training methodologies, and summarizes the existingTraining methodologies into three main categories: training parallelism, memory-saving technologies, and model sparsity design.

Modeling the Repetition-Based Recovering of Acoustic and Visual Sources With Dendritic Neurons

The work suggests that somatodendritic neuron models offer a promising neuro-inspired learning strategy to account for the characteristics of the brain segregation capabilities as well as to make predictions on yet untested experimental settings.



Group Communication With Context Codec for Lightweight Source Separation

Two simple modules are proposed, group communication and context codec, that can be easily applied to a wide range of architectures to jointly decrease the model size and complexity without sacrificing the performance.

Ultra-Lightweight Speech Separation Via Group Communication

  • Yi LuoCong HanN. Mesgarani
  • Computer Science
    ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
  • 2021
A simple model design paradigm that explicitly designs ultra-lightweight models without sacrificing the performance is provided and the group communication (GroupComm) is introduced, where a feature vector is split into smaller groups and a small processing block is used to perform inter-group communication.

Audio-Visual Speech Separation with Visual Features Enhanced by Adversarial Training

A novel AVSS model that uses speech-related visual features for isolating the target speaker and adopts the time-domain approach and builds audio-visual speech separation networks with temporal convolutional neural networks block.

Wase: Learning When to Attend for Speaker Extraction in Cocktail Party Environments

An onset/offset-based model completes the composite task, a complementary combination of speaker extraction and speaker-dependent voice activity detection, and combined voiceprint with onset/ offset cues.

Speaker and Direction Inferred Dual-Channel Speech Separation

This work proposes a speaker and direction inferred speech separation network (dubbed SDNet) to solve the cocktail party problem and generates more precise perceptual representations with the help of spatial features and successfully deals with the problem of the unknown number of sources and the selection of outputs.

Multi-Stage Speaker Extraction with Utterance and Frame-Level Reference Signals

This work proposes a speaker extraction technique, that performs in multiple stages to take full advantage of short reference speech sample, and for the first time, uses frame-level sequential speech embedding as the reference for target speaker.

VoiceFilter-Lite: Streaming Targeted Voice Separation for On-Device Speech Recognition

This work introduces VoiceFilter-Lite, a single-channel source separation model that runs on the device to preserve only the speech signals from a target user, as part of a streaming speech recognition system, and shows that such a model can be quantized as a 8-bit integer model and run in realtime.

Neural Architecture Search for Speech Recognition

A range of neural architecture search techniques are used to automatically learn two hyper-parameters that heavily affect the performance and model complexity of state-of-the-art factored time delay neural network (TDNN-F) acoustic models: i) the left and right splicing context offsets; and ii) the dimensionality of the bottleneck linear projection at each hidden layer.

SpEx: Multi-Scale Time Domain Speaker Extraction Network

A time-domain speaker extraction network (SpEx) that converts the mixture speech into multi-scale embedding coefficients instead of decomposing the speech signal into magnitude and phase spectra is proposed and achieves relative improvements over the best baseline.

Multi-Modal Multi-Channel Target Speech Separation

A general multi-modal framework for target speech separation is proposed by utilizing all the available information of the target speaker, including his/her spatial location, voice characteristics and lip movements, and a factorized attention-based fusion method is proposed to aggregate the high-level semantic information of multi- modalities at embedding level.