Development and Research of the Text Messages Semantic Clustering Methodology
This paper presents document retrieval approach based on combination of latent semantic index (LSI) and two different clustering algorithms. The idea is to first retrieve papers and create initial clusters based on LSI. Then, we use flat clustering method to further group similar documents in clusters. The paper also presents a new algorithm for k-means clustering that aims at dealing with the fact that the standard k-means algorithm is too greedy. Our experiments show that in many of cases the two-step algorithm performs better than standard k-means. The main advantage of our method is that it forces the centroid vector towards the extremities, and consequently gets a completely different starting point compared to the standard algorithm. This also makes the algorithm less greedy than the standard one. We believe our method can be used to retrieve relevant documents from a document collection. Our experiments have revealed that it performs well in most cases, but also failing in some cases.