Learn More
Relevance feedback is often a critical component when designing image databases. With these databases it is difficult to specify queries directly and explicitly. Relevance feedback interactively determinines a user's desired output or <i>query concept</i> by asking the user whether certain proposed images are relevant or not. For a relevance feedback(More)
Previous methods of distributed Gibbs sampling for LDA run into either memory or communication bottlenecks. To improve scalability, we propose four strategies: <i>data placement</i>, <i>pipeline processing</i>, <i>word bundling</i>, and <i>priority-based scheduling</i>. Experiments show that our strategies significantly reduce the unparallelizable(More)
Frequent itemset mining (FIM) is a useful tool for discovering frequently co-occurrent items. Since its inception, a number of significant FIM algorithms have been developed to speed up mining performance. Unfortunately, when the dataset size is huge, both the memory use and computational cost can still be prohibitively expensive. In this work, we propose(More)
— We propose a content-based soft annotation (CBSA) procedure for providing images with semantical labels. The annotation procedure starts with labeling a small set of training images, each with one single semantical label (e.g., forest, animal, or sky). An ensemble of binary classifiers is then trained for predicting label membership for images. The(More)
The proliferation of digital images and the widespread distribution of digital data that has been made possible by the Internet has increased problems associated with copyright infringement on digital images. Watermarking schemes have been proposed to safeguard copyrighted images, but watermarks are vulnerable to image processing and geometric distortions(More)
Representation learning has shown its effectiveness in many tasks such as image classification and text mining. Network representation learning aims at learning distributed vector representation for each vertex in a network, which is also increasingly recognized as an important aspect for network analysis. Most network representation learning methods(More)
Spectral clustering algorithms have been shown to be more effective in finding clusters than some traditional algorithms, such as k-means. However, spectral clustering suffers from a scalability problem in both memory use and computational time when the size of a data set is large. To perform clustering on large data sets, we investigate two representative(More)
To answer user queries efficiently, a stream management system must handle continuous, high-volume, possibly noisy, and time-varying data streams. One major research area in stream management seeks to allocate resources (such as network bandwidth and memory) to query plans, either to minimize resource usage under a precision requirement, or to maximize(More)
This paper presents PLDA, our parallel implementation of Latent Dirich-let Allocation on MPI and MapReduce. PLDA smooths out storage and computation bottlenecks and provides fault recovery for lengthy distributed computations. We show that PLDA can be applied to large, real-world applications and achieves good scalability. We have released MPI-PLDA to open(More)
We propose using one-class, two-class, and multiclass SVMs to annotate images for supporting keyword retrieval of images. Providing automatic annotation requires an accurate mapping of images' low-level perceptual features (e.g., color and texture) to some high-level semantic labels (e.g., landscape, architecture, and animals). Much work has been performed(More)