Document clustering using word clusters via the information bottleneck method

@inproceedings{Slonim2000DocumentCU,
  title={Document clustering using word clusters via the information bottleneck method},
  author={Noam Slonim and Naftali Tishby},
  booktitle={SIGIR '00},
  year={2000}
}
We present a novel implementation of the recently introduced information bottleneck method for unsupervised document clustering. Given a joint empirical distribution of words and documents, p(x, y), we first cluster the words, Y, so that the obtained word clusters, Ytilde;, maximally preserve the information on the documents. The resulting joint distribution. p(X, Ytilde;), contains most of the original information about the documents, I(X; Ytilde;) ≈ I(X; Y), but it is much less sparse and… 

Figures and Tables from this paper

Distributed Document Clustering Using Word-clusters

  • D. DebR. Angryk
  • Computer Science
    2007 IEEE Symposium on Computational Intelligence and Data Mining
  • 2007
TLDR
DIB adopts a two stage agglomerative information bottleneck (aIB) algorithm to generate local clusters to cluster distributed documents and demonstrates the robustness, efficiency and effectiveness of this approach.

The Power of Word Clusters for Text Classification

TLDR
This work applies the information bottleneck method to find word-clusters that preserve the information about document categories and use these clusters as features for classification, and shows that when the training sample is small word clusters can yield significant improvement in classification accuracy.

Document Clustering

TLDR
An improvement of the graph partitioning techniques used for document clustering and a completely different approach in which the words are clustered first and then the word cluster is used to cluster the documents.

A scaleable document clustering approach for large document corpora

Document clustering of scientific texts using citation contexts

TLDR
The experimental results indicate that the use of citation contexts, when combined with the vocabulary in the full-text of the document, is a promising alternative means of capturing critical topics covered by journal articles.

Clustering of Scientific Texts Using Citation Contexts

TLDR
This paper proposes a new approach for clustering scientific documents, based on the utilization of citation contexts, and hypothesizes that citation contexts will provide relevant synonymous and related vocabulary which will help increase the effectiveness of the bag-of-words representation.

Unsupervised document classification using sequential information maximization

TLDR
A novel sequential clustering algorithm which is motivated by the Information Bottleneck method is presented, and it is found to be consistently superior to all the other clustering methods examined, typically by a significant margin.

Incremental Clustering Using Information Bottleneck Theory

TLDR
An improved sequential clustering algorithm (SIB) is proposed to adjust the intermediate clustering results and experimental results show that the ICIB method achieves higher accuracy and time performance than K-Means, AIB and SIB algorithms.

Automatic Document Clustering using Topic Analysis

TLDR
This work applies topic segmentation to detect topics within documents and using term relationships attempt to build hierarchies which represent a “real world” topic hierarchy and proposes two evaluation methods for document clustering systems.

Topic Hierarchy Generation for Text Segments: A Practical Web-based Approach

TLDR
This work investigates the possibilities of using highly ranked search-result snippets to enrich the representation of text segments and addresses the problem of generating topic hierarchies for diverse text segments with a general and practical approach that uses the Web as an additional knowledge source.
...

References

SHOWING 1-10 OF 43 REFERENCES

Web document clustering: a feasibility demonstration

TLDR
To satisfy the stringent requirements of the Web domain, an incremental, linear time algorithm called Suffix Tree Clustering (STC) is introduced which creates clusters based on phrases shared between documents, showing that STC is faster than standard clustering methods in this domain.

Recent trends in hierarchic document clustering: A critical review

  • P. Willett
  • Computer Science
    Inf. Process. Manag.
  • 1988

Distributional clustering of words for text classification

TLDR
This paper describes the application of Distributional Clustering to document classification and shows that it can reduce the feature dimensional&y by three orders of magnitude and lose only 2% accuracy-significantly better than Latent Semantic Indexing, class-based clustering, feature selection by mutual information, or Markov-blanket-based feature selection.

Distributional Clustering of English Words

TLDR
Deterministic annealing is used to find lowest distortion sets of clusters: as the annealed parameter increases, existing clusters become unstable and subdivide, yielding a hierarchical "soft" clustering of the data.

Cluster-based language models for distributed retrieval

TLDR
A new approach to distributed retrieval based on document clustering and language modeling is proposed and it is shown that all three methods improve the effectiveness of distributed retrieval.

Scatter/Gather: a cluster-based approach to browsing large document collections

TLDR
This work presents a document browsing technique that employs document clustering as its primary operation, and presents fast (linear time) clustering algorithms which support this interactive browsing paradigm.

Interactive Internet search through automatic clustering (poster abstract): an empirical study

TLDR
The results indicate that the subjects spent less time finding correct answers using Adaptive Search than using the search engine directly, and suggests that document clustering can be integrated into an interactive search system in such a way that it substantially helps information seekers.

Agglomerative Information Bottleneck

TLDR
A novel distributional clustering algorithm that maximizes the mutual information per cluster between data and given categories and achieves compression by 3 orders of magnitudes loosing only 10% of the original mutual information.

Data Clustering by Markovian Relaxation and the Information Bottleneck Method

TLDR
This method combines a pairwise based approach with a vector-quantization method which provide a meaningful interpretation to the resulting clusters and can cluster data with no geometric or other bias and makes no assumption about the underlying distribution.