Learn More
Labeling text data is quite time-consuming but essential for automatic text classification. Especially, manually creating multiple labels for each document may become impractical when a very large amount of data is needed for training multi-label text classifiers. To minimize the human-labeling efforts, we propose a novel multi-label active learning(More)
With the wide applications of large scale graph data such as social networks, the problem of finding the top-<i>k</i> shortest paths attracts increasing attention. This paper focuses on the discovery of the top-<i>k</i> simple shortest paths (paths without loops). The well known algorithm for this problem is due to Yen, and the provided worstcase bound(More)
With the rapid growth of large graphs, we cannot assume that graphs can still be fully loaded into memory, thus the disk-based graph operation is inevitable. In this paper, we take the shortest path discovery as an example to investigate the technique issues when leveraging existing infrastructure of relational database (RDB) in the graph data management.(More)
With the advent of cloud computing, it becomes desirable to utilize cloud computing to efficiently process complex operations on large graphs without compromising their sensitive information. This paper studies shortest distance computing in the cloud, which aims at the following goals: i) preventing outsourced graphs from neighborhood attack, ii)(More)
Modern large distributed applications, such as mobile communications and banking services, require fast responses to enormous and frequent query requests. This kind of application usually employs in a distributed query-intensive data environment, where the system response time significantly depends on ways of data distribution. Motivated by the efficiency(More)
Alternating Decision Tree (ADTree) is a successful classification model based on boosting and has a wide range of applications. The existing ADTree induction algorithms apply a " top-down " strategy to evaluate the best split at each boosting iteration, which is very time-consuming and thus is unsuitable for modeling on large data sets. This paper proposes(More)
Density-based clustering is a sort of clustering analysis methods, which can discover clusters with arbitrary shape and is insensitive to noise data. The efficiency of data mining algorithms is strongly needed with data becoming larger and larger. In this paper, we present a new fast clustering algorithm called CURD, which means Clustering Using References(More)