Learn More
Set-valued data, in which a set of values are associated with an individual, is common in databases ranging from market basket data, to medical databases of patients' symptoms and behaviors, to query engine search logs. Anonymizing this data is important if we are to reconcile the conflicting demands arising from the desire to release the data for study and(More)
Deep-web crawl is concerned with the problem of surfacing hidden content behind search interfaces on the Web. While many deep-web sites maintain document-oriented textual content (e.g., Wikipedia, PubMed, Twitter, etc.), which has traditionally been the focus of the deep-web literature, we observe that a significant portion of deep-web sites, including(More)
We study the following problem: given the name of an ad-hoc concept as well as a few seed entities belonging to the concept, output all entities belonging to it. Since producing the exact set of entities is hard, we focus on returning a ranked list of entities. Previous approaches either use seed entities as the only input, or inherently require negative(More)
It is well known today that pages on the Web contain a large number of content-rich relational tables. Such tables have been systematically extracted in a number of efforts to empower important applications such as table search and schema discovery. However, a significant fraction of relational tables are <i>not</i> embedded in the standard HTML table tags,(More)
Keyword search over entity databases (e.g., product, movie databases) is an important problem. Current techniques for keyword search on databases may often return incomplete and imprecise results. On the one hand, they either require that relevant entities contain all (or most) of the query keywords , or that relevant entities and the query keywords occur(More)
Similarity join is the problem of finding pairs of records with similarity score greater than some threshold. In this paper we study the problem of scaling up similarity join for different metric distance functions using MapReduce. We propose a ClusterJoin framework that partitions the data space based on the underlying data distribution , and distributes(More)
In comparison to the extensive body of existing work considering publish-once, static anonymization, dynamic anonymization is less well studied. Previous work, most notably m-invariance, has made considerable progress in devising a scheme that attempts to prevent individual records from being associated with too few sensitive values. We show, however, that(More)
Complex Event Processing (CEP) Systems are stream processing systems that monitor incoming event streams in search of userspecified event patterns. While CEP systems have been adopted in a variety of applications, the privacy implications of event pattern reporting mechanisms have yet to be studied - a stark contrast to the significant amount of attention(More)