Learn More
Because topic detection and tracking (TDT) shares similar challenges with information retrieval, information filtering and information extraction in bursts of news stories, it has become a hot spot in the community of nature language processing. The TDT system oriented to BBS can detect and track the special event netizens paying close attention to and(More)
The huge amount of information stored in databases owned by corporations (e.g., retail, financial, telecom) has spurred a tremendous interest in the area of knowledge discovery and data mining. Clustering, in data mining, is a useful technique for discovering interesting data distributions and patterns in the underlying data, and has many application(More)
In this paper, SOStream, which is a novel algorithm of clustering over high dimensional online data stream is presented, it is based on subspace.-SOStream partitions the data space into grids, and maintains a superset of all dense units in an online way. A deterministic lower and upper bound of the selectivity of each maintained units are also given. With(More)
kNN classifier is widely used in text categorization, however, kNN has the large computational and store requirements, and its performance also suffers from uneven distribution of training data. Usually, condensing technique is resorted to reducing the noises of training data and decreasing the cost of time and space. Traditional condensing technique picks(More)
The classification of deep Web sources is an important area in large-scale deep Web integration, which is still at an early stage. Many deep web sources are structured by providing structured query interfaces and results. Classifying such structured sources into domains is one of the critical steps toward the integration of heterogeneous Web sources. To(More)
In is paper, a novel algorithm for clustering data streams with mixed numeric and categorical attributes (CNC-Stream)is proposed. A new similarity measure based on entropy determining the similarity between the objects(data points in the stream or the micro- clusters in memory) is also presented here, which makes CNC-Stream work, the experiments conducted(More)
-K-Nearest Neighbor is used broadly in text classification, but it has one deficiency—computational efficiency. In this paper, we propose a heuristic search way to find out the k nearest neighbors quickly. Simulated annealing algorithm and inverted array are used to help find out the expected neighbors. Our experimental results demonstrate a significant(More)