Learn More
The n-gram language model adaptation is typically formulated using deleted interpolation under the maximum likelihood estimation framework. This paper proposes a Bayesian learning framework for n-gram statistical language model training and adaptation. By introducing a Dirichlet conjugate prior to the n-gram parameters, we formulate the deleted(More)
Target entity disambiguation (TED), the task of identifying target entities of the same domain, has been recognized as a critical step in various important applications. In this paper, we propose a graph-based model called TremenRank to collectively identify target entities in short texts given a name list only. TremenRank propagates trust within the graph,(More)
We present a topic mixture language modeling approach making use of the soft classification notion of topic models. Given a text document set, we first perform document soft classification by applying a topic modeling process such as probabilistic latent semantic analyses (PLSA) or latent Dirichlet allocation (LDA) on the dataset. Then we can derive(More)
We present a semi-supervised learning (SSL) method for building domain-specific language models (LMs) from general-domain data using probabilistic latent semantic analysis (PLSA). The proposed technique first performs topic decomposition (TD) on the combined dataset of domain-specific and general-domain data. Then it derives latent topic distribution of the(More)
We present a semi-supervised learning method for building domain-specific language models (LM) from general-domain data. This method is aimed to use small amount of domain-specific data as seeds to tap domain-specific resources residing in larger amount of general-domain data with the help of topic modeling technologies. The proposed algorithm first(More)
Natural language interface is an important research topic in the area of natural language processing (NLP). Natural language interaction with robot could be the most natural and efficient way. In order to build speech enabled human language interface of robots, our research goal is to study the problems in this area and develop technologies that can(More)
In this paper, we propose a method to extend the use of latent topics into higher order n-gram models. In training, the parameters of higher order n-gram models are estimated using discounted average counts derived from the application of probabilistic latent semantic analysis(PLSA) models on n-gram counts in training corpus. In decoding, a simple yet(More)
  • 1