Improving arabic broadcast transcription using automatic topic clustering

@article{Chu2012ImprovingAB,
  title={Improving arabic broadcast transcription using automatic topic clustering},
  author={Stephen M. Chu and Lidia Mangu},
  journal={2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)},
  year={2012},
  pages={4449-4452}
}
Latent Dirichlet Allocation (LDA) has been shown to be an effective model to augment n-gram language models in speech recognition applications. In this work, we aim to take advantage of the superior unsupervised learning ability of the framework, and use it to uncover topic structure embedded in the corpora in an entirely data-driven fashion. In addition, we describe a bi-level inference and classification method that allows topic clustering at the utterance level while preserving the document… CONTINUE READING

Similar Papers

Figures, Tables, Results, and Topics from this paper.

Key Quantitative Results

  • Experiments show that optimizing LM in the LDA topic space leads to 5% reduction in language model perplexity.
  • The results show that LM optimization in the topic space leads to a 0.2% absolute word error rate reduction on all three test sets, compared to LM built on random partitions of the same training data.

Citations

Publications citing this paper.

Context dependent recurrent neural network language model

  • 2012 IEEE Spoken Language Technology Workshop (SLT)
  • 2012
VIEW 2 EXCERPTS
CITES METHODS

References

Publications referenced by this paper.
SHOWING 1-10 OF 16 REFERENCES

Chueh, “Latent Dirichlet language model for speech recognition,

J.-T. Chien, C.-H
  • IEEE Trans. Audio, Speech and Language Processing,
  • 2011
VIEW 1 EXCERPT

Correlated Latent Semantic Model for Unsupervised LM Adaptation

  • 2007 IEEE International Conference on Acoustics, Speech and Signal Processing - ICASSP '07
  • 2007
VIEW 1 EXCERPT