FastSpeech: Fast, Robust and Controllable Text to Speech
A novel feed-forward network based on Transformer to generate mel-spectrogram in parallel for TTS is proposed, which speeds up mel-Spectrogram generation by 270x and the end-to-end speech synthesis by 38x and is called FastSpeech.
FastSpeech 2: Fast and High-Quality End-to-End Text to Speech
FastSpeech 2 is proposed, which addresses the issues in FastSpeech and better solves the one-to-many mapping problem in TTS by directly training the model with ground-truth target instead of the simplified output from teacher, and introducing more variation information of speech as conditional inputs.
Video Question Answering via Gradually Refined Attention over Appearance and Motion
This paper proposes an end-to-end model which gradually refines its attention over the appearance and motion features of the video using the question as guidance and demonstrates the effectiveness of the model by analyzing the refined attention weights during the question answering procedure.
Self-Supervised Spatiotemporal Learning via Video Clip Order Prediction
- D. Xu, Jun Xiao, Zhou Zhao, Jian Shao, Di Xie, Yueting Zhuang
- Computer ScienceComputer Vision and Pattern Recognition
- 1 June 2019
A self-supervised spatiotemporal learning technique which leverages the chronological order of videos to learn the spatiotmporal representation of the video by predicting the order of shuffled clips from the video.
Improving Automatic Source Code Summarization via Deep Reinforcement Learning
- Yao Wan, Zhou Zhao, Philip S. Yu
- Computer ScienceInternational Conference on Automated Software…
- 1 September 2018
An abstract syntax tree structure as well as sequential content of code snippets into a deep reinforcement learning framework (i.e., actor-critic network) which provides the confidence of predicting the next word according to current state and an advantage reward composed of BLEU metric to train both networks.
Investigating Capsule Networks with Dynamic Routing for Text Classification
- Wei Zhao, Jianbo Ye, Min Yang, Zeyang Lei, Suofei Zhang, Zhou Zhao
- Computer ScienceConference on Empirical Methods in Natural…
- 29 March 2018
This work proposes three strategies to stabilize the dynamic routing process to alleviate the disturbance of some noise capsules which may contain “background” information or have not been successfully trained.
Cross-Modal Interaction Networks for Query-Based Moment Retrieval in Videos
- Zhu Zhang, Zhijie Lin, Zhou Zhao, Z. Xiao
- Computer ScienceAnnual International ACM SIGIR Conference on…
- 6 June 2019
A novel Cross-Modal Interaction Network (CMIN) is introduced to consider multiple crucial factors for this challenging task, including the syntactic structure of natural language queries; long-range semantic dependencies in video context and the sufficient cross-modal interaction.
ActivityNet-QA: A Dataset for Understanding Complex Web Videos via Question Answering
- Zhou Yu, D. Xu, D. Tao
- Computer Science, PhysicsAAAI Conference on Artificial Intelligence
- 6 June 2019
This work introduces ActivityNet-QA, a fully annotated and large scale VideoQA dataset, which consists of 58,000 QA pairs on 5,800 complex web videos derived from the popular ActivityNet dataset and explores various video representation strategies to improve videoQA performance.
Multilingual Neural Machine Translation with Knowledge Distillation
- Xu Tan, Yi Ren, Di He, Tao Qin, Zhou Zhao, Tie-Yan Liu
- Computer ScienceInternational Conference on Learning…
- 1 February 2019
One model is enough to handle multiple languages, with comparable or even better accuracy than individual models, in this distillation-based approach to boost the accuracy of multilingual machine translation.
Weakly-Supervised Video Moment Retrieval via Semantic Completion Network
- Zhijie Lin, Zhou Zhao, Zhu Zhang, Qi Wang, Huasheng Liu
- Computer ScienceAAAI Conference on Artificial Intelligence
- 19 November 2019
This paper proposes a novel weakly-supervised moment retrieval framework requiring only coarse video-level annotations for training, and devise a proposal generation module that aggregates the context information to generate and score all candidate proposals in one single pass.