Text clustering for topic detection

  title={Text clustering for topic detection},
  author={Young-Woo Seo and Katia P. Sycara},
Abstract : The world wide web represents vast stores of information. However, the sheer amount of such information makes it practically impossible for any human user to be aware of much of it. Therefore, it would be very helpful to have a system that automatically discovers relevant, yet previously unknown information, and reports it to users in human-readable form. As the first attempt to accomplish such a goal, we proposed a new clustering algorithm and compared it with existing clustering… 

Figures and Tables from this paper

Research of Automatic Topic Detection Based on Incremental Clustering

This paper proposes a new topic detection method (TPIC) based on an incremental clustering algorithm that strives to achieve a high accuracy and the capability of estimating the true number of topics in the document corpus.

A Method for Topic Detection in Great Volumes of Data

This paper explores the adoption of a methodology of feature reduction to underline the most significant topics within a document corpus by using an approach based on a clustering algorithm (X-means) over the \(tf-idf\) matrix calculated starting from the corpus.

Topic Detection Using MFSs

This paper describes the original method in extracting the set of Maximal Frequent word Sequences, and how it can be adapted to identify topics in a textual dataset, and demonstrates how the MFSs themselves can act as topic descriptors for the clusters.

Detecting Topic Labels for Tweets by Matching Features from Pseudo-Relevance Feedback

A novel pseudo-relevance feedback algorithm is proposed to accurately identify topic labels for short texts that robustly handles noise in both the short texts and the feedback source through a method called 'feature matching'.

Topic Detection from Short Text: A Term-based Consensus Clustering method

A Term-based Consensus Clustering topic detection framework is developed to provide an unsupervised methodology for finding distinct topics from within SMS collections and it is demonstrated that TCC obtains best clustering performance when observing a large number of the predefined topics across short text collections.

Chinese Text Clustering for Topic Detection Based on Word Pattern Relation

This research adopt the method of word expansion to compose relevant features into the same semantic concept, then conduct the corresponding documents to concept clusters, and finally merge the

A Comparative Study between Single-Pass Algorithm and K-means Algorithm in Web Topic Detection

The overall flow of the entire topic of Web topic detection is described, then a comparison between Single-Pass algorithm and K-means algorithm is made, and the result shows that Single- pass algorithm is better than K-Means algorithm inWeb topic detection.

Topic extraction in social media

The goal of the research described in this paper was to develop a prototype that can "feel the pulse of the Arabic users with regards to a certain hot topic.

Unsupervised Topic Detection in document collections: an application in marketing and business journals

A new methodology is introduced that facilitates the determination of the number of topics discussed in a given text collection, and may help both scientists and practitioners to systematically discover topics in digital information environments, as provided by the internet for instance.

Forum topic detection based on hierarchical clustering

  • Hui LiQing Li
  • Computer Science
    2016 International Conference on Audio, Language and Image Processing (ICALIP)
  • 2016
The principle of maximum entropy and information gain when calculating feature weight is introduced, based on the agglomerative hierarchical clustering (AHC) based on a game forum and handling sparse forum short texts.



NewsWeeder: Learning to Filter Netnews

Principal Direction Divisive Partitioning

  • Daniel Boley
  • Computer Science
    Data Mining and Knowledge Discovery
  • 2004
A new algorithm capable of partitioning a set of documents or other samples based on an embedding in a high dimensional Euclidean space (i.e., in which every document is a vector of real numbers) that operates by repeatedly splitting clusters into smaller clusters.

Model-based Gaussian and non-Gaussian clustering

The classification maximum likelihood approach is sufficiently general to encompass many current clustering algorithms, including those based on the sum of squares criterion and on the criterion of Friedman and Rubin (1967), but it is restricted to Gaussian distributions and it does not allow for noise.

Neural Networks for Pattern Recognition

Foundations of statistical natural language processing

This foundational text is the first comprehensive introduction to statistical natural language processing (NLP) to appear and provides broad but rigorous coverage of mathematical and linguistic foundations, as well as detailed discussion of statistical methods, allowing students and researchers to construct their own implementations.

Topic Detection and Tracking Pilot Study Final Report

Topic Detection and Tracking (TDT) is a DARPA-sponsored initiative to investigate the state of the art in finding and following new events in a stream of broadcast news stories. The TDT problem

Learning approaches for detecting and tracking news events

The authors extend existing supervised-learning and unsupervised-clustering algorithms to allow document classification based on the information content and temporal aspects of news events to be classified using manually segmented documents.

Pattern Classification

Classification • Supervised – parallelpiped – minimum distance – maximum likelihood (Bayes Rule) > non-parametric > parametric – support vector machines – neural networks – context classification •

Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper

Vibratory power unit for vibrating conveyers and screens comprising an asynchronous polyphase motor, at least one pair of associated unbalanced masses disposed on the shaft of said motor, with the

Neural Networks: A Comprehensive Foundation

Simon Haykin Neural Networks A Comprehensive Foundation Simon S. Haykin neural networks a comprehensive foundation pdf PDF Drive.