An Improved KNN Text Classification Algorithm Based on Clustering

  title={An Improved KNN Text Classification Algorithm Based on Clustering},
  author={Yong Zhou and Youwen Li and Shixiong Xia},
  journal={J. Comput.},
The traditional KNN text classification algorithm used all training samples for classification, so it had a huge number of training samples and a high degree of calculation complexity, and it also didn’t reflect the different importance of different samples. In allusion to the problems mentioned above, an improved KNN text classification algorithm based on clustering center is proposed in this paper. Firstly, the given training sets are compressed and the samples near by the border are deleted… 

Figures and Tables from this paper

Combination of K-Nearest Neighbor and K-Means based on Term Re-weighting for Classify Indonesian News

The simulation results show that the combination of the proposed algorithm and K-Means cluster algorithm has a percentage accuracy reached 87%, an average value of f-measure evaluation= 0.8029 with the best k-values= 5 and the computation takes 55 second for one document.

An Improved KNN Algorithm for Text Classification

An improved KNN algorithm, which calculates similarity by considering the interaction and coupling relationship between the document internal and the document, which can overcome the shortcomings of the previous algorithms and improve the accuracy of the KNN text classification.

An Improved Sample Mean KNN Algorithm Based on LDA

  • H. XuePeiwen Wang
  • Computer Science
    2019 11th International Conference on Intelligent Human-Machine Systems and Cybernetics (IHMSC)
  • 2019
This paper focused on the problems of large training sets and differences in sample feature numbers and the improved distance formula was used to calculate the k-nearest neighbor.

A Machine Learning Approach for Text and Document Mining

A combination of traditional KNN classification algorithm and K-Means clustering algorithm has been proposed to overcome the difficulty of the main disadvantage of this method: high computational complexity.

Cluster Based Text Classification Model

The experimental results show that the proposed model outperforms the existing classification models for the task of suspicious email detection and topic categorization on the Reuters-21578 and 20 Newsgroups datasets.

SAW Classification Algorithm for Chinese Text Classification

Experiments show that SAW classification algorithm on the premise of ensuring precision in classification, significantly improve the classification precision and recall, obviously improving the performance of information retrieval, and providing an effective means of data use in the era of big data information extraction.

Comparative Study of Five Text Classification Algorithms with their Improvements

This work is comprehensive study for almost all the amendments which were done on these five algorithms for text classification, Decision Tree, Support Vector Machine, KNearest Neighbors, Naïve Bayes and hidden Markov model.

Improving css-KNN Classification Performance by Shifts in Training Data

The idea is to compute training data modifications, such that class representative instances are optimized before the actual k-NN algorithm is employed, which can be useful for improving the effectivenes of other classifiers as well as it can find applications in domain of recommendation systems and keyword-based search.

A Hybrid Model of Clustering and Classification to Enhance the Performance of a Classifier

A hybrid model; K-means clustering method for the preprocessing of the data that provides an added description to the data and improves the effectiveness of the classification task and can be deployed with any classification algorithms to improve its performance.

Improving K-nearest neighbor efficiency for text categorization

Experimental results on the 20Newsgroup and Reuters corpora show that the proposed approach increases the performance of k-NN and reduces the time classification, while some improvements algorithms proposed in the literature to face those shortcomings are surveyed.



A Fast KNN Algorithm for Text Categorization

  • Yu WangZheng-ou Wang
  • Computer Science
    2007 International Conference on Machine Learning and Cybernetics
  • 2007
A method called TFKNN(Tree-Fast-K-Nearest-Neighbor) is presented, which can search the exact k nearest neighbors quickly and the time of similarity computing is decreased largely.

Vector-Combination-Applied KNN Method for Chinese Text Categorization

An improved KNN (k-Nearest Neighbor) method for Chinese Text Categorization is proposed, which applies vector-combination technology to extract the associated discriminating words according to the CHI statistic distribution, which indicates the relationship between words and classes.

Improving Chinese text categorization by outlier learning

In this paper, an outlier learning based text categorization system is proposed, where AdaBoost algorithm is adopted for outlier identifying and simulation results reveal that the new system is successful in improving learning performance forText categorization.

Advances in Machine Learning Based Text Categorization

It is pointed out that problems such as nonlinearity, skewed data distribution, labeling bottleneck, hierarchical categorization, scalability of algorithms and categorization of Web pages are the key problems to the study of text categorization.


  • Lu Yu
  • Computer Science
  • 2002
The hypostasis of VSM (vector space model), a kind of frequently-used classical text classification model, is analyzed to find the reason for its low classification precision, and a weight adjustment method is put forward in which the IDF function is replaced by evaluation function used in feature selection.

A Clustering Algorithm Using Dynamic Nearest Neighbors Selection Model

A novel algorithm named DNNS is proposed using Dynamic Nearest Neighbors Selection model, which improves clustering quality with an appropriate selection of nearest neighbors, and experimental results on standard databases VOTE and ZOO demonstrate that DNNs outperforms ROCK and VBACC based on the evaluation metrics of fα.

A Comparative Study on Feature Selection in Text Categorization

This paper finds strong correlations between the DF IG and CHI values of a term and suggests that DF thresholding the simplest method with the lowest cost in computation can be reliably used instead of IG or CHI when the computation of these measures are too expensive.

Machine learning in automated text categorization

This survey discusses the main approaches to text categorization that fall within the machine learning paradigm and discusses in detail issues pertaining to three different problems, namely, document representation, classifier construction, and classifier evaluation.

A Simple and Efficient Algorithm to Classify a Large Scale of Texts