Hiroyuki Shinnou

Learn More
Spectral clustering is a powerful clustering method for document data set. However, spectral clustering needs to solve an eigenvalue problem of the matrix converted from the similarity matrix corresponding to the data set. Therefore, it is not practical to use spectral clustering for a large data set. To overcome this problem, we propose the method to(More)
This paper describes a system which u s e s a d eci-sion tree to o n d a n d classify names in Japanese texts. The d ecision tree uses part-of-speech, character type, and special dictionary information to d etermine t he probability t hat a p a r t icu-lar type of name o pens or closes at a g i v en position in the t ext. The o u tput i s g e n erated from(More)
In this paper, we improve an unsuper-vised learning method using the Expectation-Maximization (EM) algorithm proposed by Nigam et al. for text classification problems in order to apply it to word sense disambigua-tion (WSD) problems. The improved method stops the EM algorithm at the optimum iteration number. To estimate that number, we propose two methods.(More)
In this paper, we describe a system that divides example sentences (data set) into clusters, based on the meaning of the target word, using a semi-supervised clustering technique. In this task, the estimation of the cluster number (the number of the meaning) is critical. Our system primarily concentrates on this aspect. First, a user assigns the system an(More)
In this paper, we propose a practical method to detect Japanese homophone errors in Japanese texts. It is very important to detect homophone errors in Japanese revision systems because Japanese texts suffer from homophone errors frequently. In order to detect ho-mophone errors, we have only to solve the homophone problem. We can use the decision list to do(More)
In natural language processing, it is effective to convert problems to classification problems, and to solve them by an inductive learning method. However, this strategy needs labeled training data which is fairly expensive to obtain. To overcome this problem, some learning methods using unlabeled training data have been proposed. Co-training is(More)