Reinforce-Aligner: Reinforcement Alignment Search for Robust End-to-End Text-to-Speech

Text-to-speech (TTS) synthesis is the process of producing synthesized speech from text or phoneme input. Traditional TTS models contain multiple processing steps and require external aligners, which provide attention alignments of phoneme-to-frame sequences. As the complexity increases and efficiency decreases with every additional step, there is expanding demand in modern synthesis pipelines for endto-end TTS with efficient internal aligners. In this work, we propose an end-to-end text-to… 

