Learn More
Set-valued data, in which a set of values are associated with an individual, is common in databases ranging from market basket data, to medical databases of patients' symptoms and behaviors, to query engine search logs. Anonymizing this data is important if we are to reconcile the conflicting demands arising from the desire to release the data for study and(More)
Deep-web crawl is concerned with the problem of surfacing hidden content behind search interfaces on the Web. While many deep-web sites maintain document-oriented textual content (e.g., Wikipedia, PubMed, Twitter, etc.), which has traditionally been the focus of the deep-web literature, we observe that a significant portion of deep-web sites, including(More)
Similarity join is the problem of finding pairs of records with similarity score greater than some threshold. In this paper we study the problem of scaling up similarity join for different metric distance functions using MapReduce. We propose a ClusterJoin framework that partitions the data space based on the underlying data distribution , and distributes(More)
Keyword search over entity databases (e.g., product, movie databases) is an important problem. Current techniques for keyword search on databases may often return incomplete and imprecise results. On the one hand, they either require that relevant entities contain all (or most) of the query keywords , or that relevant entities and the query keywords occur(More)
In comparison to the extensive body of existing work considering publish-once, static anonymization, dynamic anonymization is less well studied. Previous work, most notably m-invariance, has made considerable progress in devising a scheme that attempts to prevent individual records from being associated with too few sensitive values. We show, however, that(More)
Complex Event Processing (CEP) Systems are stream processing systems that monitor incoming event streams in search of userspecified event patterns. While CEP systems have been adopted in a variety of applications, the privacy implications of event pattern reporting mechanisms have yet to be studied - a stark contrast to the significant amount of attention(More)
Complex Event Processing (CEP) has emerged as a technology for monitoring event streams in search of user specified event patterns. When a CEP system is deployed in sensitive environments the user may wish to mitigate leaks of private information while ensuring that useful nonsensitive patterns are still reported. In this paper we consider how to suppress(More)
We study the following problem: given the name of an ad-hoc concept as well as a few seed entities belonging to the concept, output all entities belonging to it. Since producing the exact set of entities is hard, we focus on returning a ranked list of entities. Previous approaches either use seed entities as the only input, or inherently require negative(More)