Learn More
Near-duplicate web documents are abundant. Two such documents differ from each other in a very small portion that displays advertisements, for example. Such differences are irrelevant for web search. So the quality of a web crawler increases if it can assess whether a newly crawled web page is a near-duplicate of a previously crawled web page or not. In the(More)
This paper introduces ULDBs, an extension of relational databases with simple yet expressive constructs for representing and manipulating both lineage and uncertainty. Uncertain data and data lineage are two important areas of data management that have been considered extensively in isolation, however many applications require the features in tandem.(More)
This chapter covers the Trio database management system. Trio is a robust prototype that supports uncertain data and data lineage, along with the standard features of a relational DBMS. Trio's new ULDB data model is an extension to the relational model capturing various types of uncertainty along with data lineage, and its TriQL query language extends SQL(More)
In this paper we study the tradeoff between parallelism and communication cost in a map-reduce computation. For any problem that is not " embarrassingly parallel, " the finer we partition the work of the reducers so that more parallelism can be extracted, the greater will be the total communication between mappers and reducers. We introduce a model of(More)
Fuzzy/similarity joins have been widely studied in the research community and extensively used in real-world applications. This paper proposes and evaluates several algorithms for finding all pairs of elements from an input set that meet a similarity threshold. The computation model is a single MapReduce job. Because we allow only one MapReduce round, the(More)
Given a large graph G = (V, E) with millions of nodes and edges, how do we compute its connected components efficiently? Recent work addresses this problem in map-reduce, where a fundamental trade-off exists between the number of map-reduce rounds and the communication of each round. Denoting d the diameter of the graph, and n the number of nodes in the(More)
This paper explores an inherent tension in modeling and querying uncertain data: simple, intuitive representations of uncertain data capture many application requirements, but these representations are generally incomplete―standard operations over the data may result in unrepresentable types of uncertainty. Complete models are theoretically(More)
Data integration systems offer a uniform interface to a set of data sources. Despite recent progress, setting up and maintaining a data integration application still requires significant upfront effort of creating a mediated schema and semantic mappings from the data sources to the mediated schema. Many application contexts involving multiple data sources(More)
This paper introduces uldbs, an extension of relational databases with simple yet expressive constructs for representing and manipulating both lineage and uncertainty. Uncertain data and data lineage are two important areas of data management that have been considered extensively in isolation, however many applications require the features in tandem.(More)
We study the problem of computing query results with confidence values in ULDBs: relational databases with uncertainty and lineage. ULDBs, which subsume probabilistic databases, offer an alternative decoupled method of computing confidence values: Instead of computing confidences during query processing, compute them afterwards based on lineage. This(More)