Improving Word Recognition in Speech Transcriptions by Decision-level Fusion of Stemming and Two-way Phoneme Pruning

  title={Improving Word Recognition in Speech Transcriptions by Decision-level Fusion of Stemming and Two-way Phoneme Pruning},
  author={Sunakshi Mehra and Seba Susan},
We introduce an unsupervised approach for correcting highly imperfect speech transcriptions based on a decision-level fusion of stemming and two-way phoneme pruning. Transcripts are acquired from videos by extracting audio using Ffmpeg framework and further converting audio to text transcript using Google API. In the benchmark LRW dataset, there are 500 word categories, and 50 videos per class in mp4 format. All videos consist of 29 frames (each 1.16 seconds long) and the word appears in the… Expand

Figures and Tables from this paper

Weighted Ensemble of Neural and Probabilistic Graphical Models for Click Prediction
Predicting user behavior in web mining is an important concept with commercial implications. The user response to search engine results is crucial for understanding the relative popularity ofExpand


Automatic sub-word unit discovery and pronunciation lexicon induction for ASR with application to under-resourced languages
The proposed method is able to discover lexicons that perform as well as baseline expert systems for Acholi and close to this level for the other two languages when used to train DNN-HMM ASR systems, demonstrating the potential of the method to enable and accelerate ASR for under-resourced languages by eliminating the dependence on human expertise. Expand
Alignment of Speech to Highly Imperfect Text Transcriptions
  • A. Haubold, J. Kender
  • Computer Science
  • 2007 IEEE International Conference on Multimedia and Expo
  • 2007
A novel and inexpensive approach for the temporal alignment of speech to highly imperfect transcripts from automatic speech recognition (ASR), and the alignment performance is promising, showing a correct matching of phonemes within 10, 20, 30 second error margins. Expand
Improved Arabic speech recognition system through the automatic generation of fine-grained phonetic transcriptions
A set of language-dependent grapheme-to-allophone rules are developed that can predict such allophonic variations and hence provide a phonetic transcription that is sensitive to the local context for the automatic speech recognition system. Expand
Towards Unsupervised Automatic Speech Recognition Trained by Unaligned Speech and Text only
This work proposes a framework to achieve unsupervised ASR on a read English speech dataset, where audio and text are unaligned and semantic embeddings of audio segments are trained from the vector representations using a skip-gram model. Expand
Multi-label Classification Models for Detection of Phonetic Features in building Acoustic Models
  • Rupam Ojha, C. Sekhar
  • Computer Science
  • 2019 International Joint Conference on Neural Networks (IJCNN)
  • 2019
Performance improvement over other phoneme recognition studies using the phonetic features is obtained and the effectiveness of the proposed approach is demonstrated on TIMIT and Wall Street Journal corpora. Expand
Subword based approach for grapheme-to-phoneme conversion in Bengali text-to-speech synthesis system
  • K. Ghosh, K. S. Rao
  • Computer Science
  • 2012 National Conference on Communications (NCC)
  • 2012
The proposed subword based approach for G2P conversion in a text-to-speech (TTS) synthesis system improves the accuracy of the rule-based approach by resolving the ambiguity especially in case of the inflected or compound words. Expand
Automatic alignment and error correction of human generated transcripts for long speech recordings
A new alignment approach for approximate transcriptions of long audio files is presented which is designed to discover and correct errors in the manual transcription during the alignment process. Expand
Multilingual and Unsupervised Subword Modeling for Zero-Resource Languages
It is found that combining two existing target-language-only methods yields better features than either method alone, and even better results are obtained by extracting target language bottleneck features using a model trained on other languages. Expand
Exploiting Cross-Lingual Speaker and Phonetic Diversity for Unsupervised Subword Modeling
  • Siyuan Feng, Tan Lee
  • Computer Science, Engineering
  • IEEE/ACM Transactions on Audio, Speech, and Language Processing
  • 2019
This research addresses the problem of acoustic modeling of low-resource languages for which transcribed training data is absent and shows that learning of robust BNF representations can be achieved by effectively leveraging transcribed speech data and well-trained automatic speech recognition systems from one or more out-of-domain languages. Expand
Improving End-to-end Speech Recognition with Pronunciation-assisted Sub-word Modeling
PASM is proposed, a sub-word extraction method that leverages the pronunciation information of a word that can greatly improve upon the character-based baseline, and also outperform commonly used byte-pair encoding methods. Expand