Data Set Used
The increasing number of digitized texts presently available notably on the Web has developed an acute need in text mining techniques. Clustering systems are used more and more often in text mining, especially to analyze texts and to extract knowledge they contain. With the availability of the vast amount of clustering algorithms and techniques, it becomes… (More)
This paper deals with our research on unsupervised classification for automatic language identification purpose. The study of this new hybrid algorithm shows that the combination of the Kmeans and the artificial ants and taking advantage of an n-gram text representation is promising. We propose an alternative approach to the standard use of both algorithms.… (More)
The classification of textual documents has been widely studied. The majority of classification approaches use supervised learning methods, which are acceptable for rather small corpora allowing experts to generate representative sets of data for the training, but are not feasible for significant flows of data. Unsupervised classification methods discover… (More)
Dans cet article, nous proposons la méthode des SOM (cartes auto-organisatrices de Kohonen) pour la classification non supervisée de documents textuels basés sur les n-grammes. La même méthode basée sur les synsets de WordNet comme termes pour la représentation des documents est étudiée par la suite. Ces combinaisons sont évaluées et comparées.
With the great and rapidly growing number of documents available in digital form (Internet, library, CD-Rom…), the automatic classification of texts has become a significant research field and a fundamental task in document processing. This paper deals with unsupervised classification of textual documents also called text clustering using Self-Organizing… (More)
A great number of methods of unsupervised classifications also called clustering were applied to the textual documents. In this paper, we initially propose the method of the self-organizing maps of Kohonen for the clustering of the textual documents based on the n-grams representation. The same method based on the synsets of WordNet as terms for the… (More)