Incremental TTS for Japanese Language

@inproceedings{Yanagita2018IncrementalTF,
  title={Incremental TTS for Japanese Language},
  author={Tomoya Yanagita and Sakriani Sakti and Satoshi Nakamura},
  booktitle={INTERSPEECH},
  year={2018}
}
Simultaneous lecture translation requires speech to be translated in real time before the speaker has spoken an entire sentence since a long delay will create difficulties for the listeners trying to follow the lecture. The challenge is to construct a full-fledged system with speech recognition, machine translation, and textto-speech synthesis (TTS) components that could produce highquality speech translations on the fly. Specifically for a TTS, this poses problems as a conventional framework… 

Figures and Tables from this paper

Neural iTTS: Toward Synthesizing Speech in Real-time with End-to-end Neural Text-to-Speech Framework

An initial step to construct iTTS based on end-to-end neural framework (Neural iTTS) is taken and the effects of various incremental units on the quality of end- to-end Neural speech synthesis in both English and Japanese are investigated.

Incremental Text-to-Speech Synthesis with Prefix-to-Prefix Framework

This work proposes a neural incremental TTS approach using the prefix-to-prefix framework from simultaneous translation, which achieves similar speech naturalness compared to full sentence TTS, but only with a constant (1-2 words) latency.

Incremental Machine Speech Chain Towards Enabling Listening While Speaking in Real-Time

This work constructs incremental ASR (ISR) and incremental TTS (ITTS) by letting both systems improve together through a short-term loop and reveals that the proposed framework is able to reduce delays due to long utterances while keeping a comparable performance to the non-incremental basic machine speech chain.

Simultaneous Speech-to-Speech Translation System with Neural Incremental ASR, MT, and TTS

The system consists of three fully-incremental neural processing modules for automatic speech recognition, machine translation, and text-to-speech synthesis and its overall latency was investigated along with module-level performance.

Incremental Speech Synthesis For Speech-To-Speech Translation

This work focuses on improving the incremental synthesis performance of TTS models, and proposes latency metrics tailored to S2ST applications, and investigates methods for latency reduction in this context.

References

SHOWING 1-10 OF 21 REFERENCES

Simple, lexicalized choice of translation timing for simultaneous speech translation

This work proposes a method that uses lexicalized information to perform translation unit segmentation considering the relationship between the source and target languages and shows that this system can achieve a delay reduction of 20% compared to pause segmentation with identical accuracy.

Real-time Incremental Speech-to-Speech Translation of Dialogs

Experimental results demonstrate that high quality translations can be generated with the incremental approach with approximately half the latency associated with non-incremental approach.

Adaptive Latency for Part-of-Speech Tagging in Incremental Text-to-Speech Synthesis

This article addresses the problem of the part-of-speech tagging (POS) which is a critical step for accurate grapheme-to-phoneme conversion and prosody estimation and proposes a method based on a set of decision trees estimating online whether a given POS tag is likely to be modified when more right-contextual information becomes available.

Simultaneous translation of lectures and speeches

It is concluded that machines can already deliver comprehensible simultaneous translation output and while machine performance is affected by recognition errors (and thus can be improved), human performance is limited by the cognitive challenge of performing the task in real time.

HMM training strategy for incremental speech synthesis

This study describes a voice training procedure which integrates explicitly a potential uncertainty on some contextual features in the context of HMM-based speech synthesis, and shows that the proposed strategy outperforms the baseline technique for French.

Constructing a speech translation system using simultaneous interpretation data

This paper examines the possibilities of additionally incorporating simultaneous interpretation data (made by simultaneous interpreters) in the learning of the machine translation system, and finds that according to automatic evaluation metrics, the system achieves performance similar to that of a simultaneous interpreter that has 1 year of experience.

Partial representations improve the prosody of incremental speech synthesis

The quality of prosodic parameter assignments generated from partial utterance specifications is analyzed in order to determine the requirements that symbolic incremental prosody modelling should meet and finds that broader, higher-level information helps to improve prosody even if lower- level information about the near future is yet unavailable.

Automatic sentence segmentation and punctuation prediction for spoken language translation

A novel sentence segmentation method which is specifically tailored to the requirements of machine translation algorithms and is competitive with state-of-the-art approaches for detecting sentence-like units is presented.

Decision tree usage for incremental parametric speech synthesis

  • Timo Baumann
  • Computer Science
    2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
  • 2014
This paper investigates the `locality' of features in parametric speech synthesis voices and takes some missing steps towards better HMM state selection and prosody modelling for incremental speech synthesis.

Accent Sandhi Estimation of Tokyo Dialect of Japanese Using Conditional Random Fields

The proposed method predicted accent nucleus positions for accent phrases with 94.66% accuracy, clearly surpassing the accuracy obtained using the rule-based method and significantly improved the naturalness of synthetic speech.