Learn More
This chapter covers the Trio database management system. Trio is a robust prototype that supports uncertain data and data lineage, along with the standard features of a relational DBMS. Trio's new ULDB data model is an extension to the relational model capturing various types of uncertainty along with data lineage, and its TriQL query language extends SQL(More)
Near-duplicate web documents are abundant. Two such documents differ from each other in a very small portion that displays advertisements, for example. Such differences are irrelevant for web search. So the quality of a web crawler increases if it can assess whether a newly crawled web page is a near-duplicate of a previously crawled web page or not. In the(More)
This paper introduces ULDBs, an extension of relational databases with simple yet expressive constructs for representing and manipulating both lineage and uncertainty. Uncertain data and data lineage are two important areas of data management that have been considered extensively in isolation, however many applications require the features in tandem.(More)
Fuzzy/similarity joins have been widely studied in the research community and extensively used in real-world applications. This paper proposes and evaluates several algorithms for finding all pairs of elements from an input set that meet a similarity threshold. The computation model is a single MapReduce job. Because we allow only one MapReduce round, the(More)
In this paper we study the tradeoff between parallelism and communication cost in a map-reduce computation. For any problem that is not " embarrassingly parallel, " the finer we partition the work of the reducers so that more parallelism can be extracted, the greater will be the total communication between mappers and reducers. We introduce a model of(More)
Given a large graph G = (V, E) with millions of nodes and edges, how do we compute its connected components efficiently? Recent work addresses this problem in map-reduce, where a fundamental trade-off exists between the number of map-reduce rounds and the communication of each round. Denoting d the diameter of the graph, and n the number of nodes in the(More)
This paper explores an inherent tension in modeling and querying uncertain data: simple, intuitive representations of uncertain data capture many application requirements, but these representations are generally incomplete―standard operations over the data may result in unrepresentable types of uncertainty. Complete models are theoretically(More)
This paper introduces ULDBs, an extension of re-lational databases with simple yet expressive constructs for representing and manipulating both lineage and uncertainty. Uncertain data and data lineage are two important areas of data management that have been considered extensively in isolation, however many applications require the features in tandem.(More)
We consider the problem of human-assisted graph search: given a directed acyclic graph with some (unknown) target node(s), we consider the problem of finding the target node(s) by asking an omniscient human questions of the form " Is there a target node that is reachable from the current node? ". This general problem has applications in many domains that(More)
We study the problem of computing query results with confidence values in ULDBs: relational databases with uncertainty and lineage. ULDBs, which subsume probabilistic databases, offer an alternative decoupled method of computing confidence values: Instead of computing confidences during query processing, compute them afterwards based on lineage. This(More)