Keikichi Hirose

Learn More
This paper introduces a new open source, WFST-based toolkit for Grapheme-toPhoneme conversion. The toolkit is efficient, accurate and currently supports a range of features including EM sequence alignment and several decoding techniques novel in the context of G2P. Experimental results show that a combination RNNLM system outperforms all previous reported(More)
This work introduces a modified WFST-based multiple to multiple EM-driven alignment algorithm for Grapheme-to-Phoneme (G2P) conversion, and preliminary experimental results applying a Recurrent Neural Network Language Model (RNNLM) as an Nbest rescoring mechanism for G2P conversion. The alignment algorithm leverages the WFST framework and introduces several(More)
This work investigates two related issues in the area of WFSTbased G2P conversion. The first is the impact that the approach utilized to convert a target word to an equivalent finite-state machine has on downstream decoding efficiency. The second issue considered is the impact that the approach utilized to represent the joint n-gram model via the WFST(More)
The process of generating the F<inf>0</inf> contour of speech has been modeled quite accurately in mathematical tenns by Fujisaki and his coworkers, but the extraction of parameters of the underlying commands from an observed F<inf>0</inf> contour is an inverse problem that can be solved only by successive approximation. In order to guarantee an efficient(More)
A novel technique is developed to separate the audio sources from a single mixture. The method is based on decomposing the Hilbert spectrum (HS) of the mixed signal into independent source subspaces. Hilbert transform combined with empirical mode decomposition (EMD) constitutes HS, which is a fine-resolution time-frequency representation of a nonstationary(More)
This paper proposes a technique which automatically estimates speakers' age only with acoustic, not linguistic, information of their utterances. This method is based upon speaker recognition techniques. In the current work, we firstly divided speakers of two databases, JNAS and S(senior)-JNAS, into two groups by listening tests. One group has only the(More)
Recently, a novel and structural representation of speech was proposed [1,2], where the inevitable acoustic variations caused by non- linguistic factors are effectively removed from speech. This structural representation captures only microphone- and speaker-invariant speech contrasts or dynamics and uses no absolute or static acoustic properties directly(More)