• Publications
  • Influence
FastSpeech: Fast, Robust and Controllable Text to Speech
TLDR
A novel feed-forward network based on Transformer to generate mel-spectrogram in parallel for TTS is proposed, which speeds up mel-Spectrogram generation by 270x and the end-to-end speech synthesis by 38x and is called FastSpeech.
Self-Supervised Spatiotemporal Learning via Video Clip Order Prediction
TLDR
A self-supervised spatiotemporal learning technique which leverages the chronological order of videos to learn the spatiotmporal representation of the video by predicting the order of shuffled clips from the video.
FastSpeech 2: Fast and High-Quality End-to-End Text to Speech
TLDR
FastSpeech 2 is proposed, which addresses the issues in FastSpeech and better solves the one-to-many mapping problem in TTS by directly training the model with ground-truth target instead of the simplified output from teacher, and introducing more variation information of speech as conditional inputs.
Investigating Capsule Networks with Dynamic Routing for Text Classification
TLDR
This work proposes three strategies to stabilize the dynamic routing process to alleviate the disturbance of some noise capsules which may contain “background” information or have not been successfully trained.
Video Question Answering via Gradually Refined Attention over Appearance and Motion
TLDR
This paper proposes an end-to-end model which gradually refines its attention over the appearance and motion features of the video using the question as guidance and demonstrates the effectiveness of the model by analyzing the refined attention weights during the question answering procedure.
Improving Automatic Source Code Summarization via Deep Reinforcement Learning
TLDR
An abstract syntax tree structure as well as sequential content of code snippets into a deep reinforcement learning framework (i.e., actor-critic network) which provides the confidence of predicting the next word according to current state and an advantage reward composed of BLEU metric to train both networks.
Cross-Modal Interaction Networks for Query-Based Moment Retrieval in Videos
TLDR
A novel Cross-Modal Interaction Network (CMIN) is introduced to consider multiple crucial factors for this challenging task, including the syntactic structure of natural language queries; long-range semantic dependencies in video context and the sufficient cross-modal interaction.
Multilingual Neural Machine Translation with Knowledge Distillation
TLDR
One model is enough to handle multiple languages, with comparable or even better accuracy than individual models, in this distillation-based approach to boost the accuracy of multilingual machine translation.
Dialogue Act Recognition via CRF-Attentive Structured Network
TLDR
This paper tackles the problem of DAR from the viewpoint of extending richer Conditional Random Field (CRF) structured dependencies without abandoning end-to-end training and incorporates hierarchical semantic inference with memory mechanism on the utterance modeling at multiple levels.
MEMEN: Multi-layer Embedding with Memory Networks for Machine Comprehension
TLDR
A novel neural network architecture called Multi-layer Embedding with Memory Network (MEMEN) for machine reading task, which employs classic skip-gram model to the syntactic and semantic information of the words to train a new kind of embedding layer.
...
1
2
3
4
5
...