Spectral clustering is a powerful clustering method for document data set. However, spectral clustering needs to solve an eigenvalue problem of the matrix converted from the similarity matrix corresponding to the data set. Therefore, it is not practical to use spectral clustering for a large data set. To overcome this problem, we propose the method to… (More)
This paper describes a system which uses a decision tree to find and classify names in Japanese texts. The decision tree uses part-of-speech, character type, and special dictionary information to determine the probability that a particular type of name opens or closes at a given position in the text. The output is generated from the consistent sequence of… (More)
In this paper, we propose a new ensemble document clustering method. The novelty of our method is the use of Non-negative Matrix Factorization (NMF) in the generation phase and a weighted hypergraph in the integration phase. In our experiment, we compared our method with some clustering methods. Our method achieved the best results .
In this paper, we improve an unsuper-vised learning method using the Expectation-Maximization (EM) algorithm proposed by Nigam et al. for text classification problems in order to apply it to word sense disambigua-tion (WSD) problems. The improved method stops the EM algorithm at the optimum iteration number. To estimate that number, we propose two methods.… (More)
In this paper, we describe a system that divides example sentences (data set) into clusters, based on the meaning of the target word, using a semi-supervised clustering technique. In this task, the estimation of the cluster number (the number of the meaning) is critical. Our system primarily concentrates on this aspect. First, a user assigns the system an… (More)
In natural language processing, it is effective to convert problems to classification problems, and to solve them by an inductive learning method. However, this strategy needs labeled training data which is fairly expensive to obtain. To overcome this problem, some learning methods using unlabeled training data have been proposed. Co-training is… (More)