Chris Jermaine

Learn More
To deal with data uncertainty, existing probabilistic database systems augment tuples with attribute-level or tuple-level probability values, which are loaded into the database along with the data itself. This approach can severely limit the system's ability to gracefully handle complex or unforeseen types of uncertainty, and does not permit the uncertainty(More)
Broadcast is a scalable way of disseminating data because broadcasting an item satisfies all outstanding client requests for it. However, because the transmission medium is shared, individual requests may have high response times. In this paper, we show how to minimize the average response time given multiple broadcast channels by optimally partitioning(More)
This article describes query processing in the DBO database system. Like other database systems designed for ad hoc analytic processing, DBO is able to compute the exact answers to queries over a large relational database in a scalable fashion. Unlike any other system designed for analytic processing, DBO can constantly maintain a guess as to the final(More)
Since the 1970's, database systems have been "compute-centric". When a computation needs the data, it requests the data, and the data are pulled through the system. We believe that this is problematic for two reasons. First, requests for data naturally incur high latency as the data are pulled through the memory hierarchy, and second, it makes it difficult(More)
In online aggregation, a database system processes a user’s aggregation query in an online fashion. At all times during processing, the system gives the user an estimate of the final query result, with the confidence bounds that become tighter over time. In this paper, we consider how online aggregation can be built into a MapReduce system for large-scale(More)
When anomaly detection software is used as a data analysis tool, finding the hardest-to-detect anomalies is not the most critical task. Rather, it is often more important to make sure that those anomalies that are reported to the user are in fact interesting. If too many unremarkable data points are returned to the user labeled as candidate anomalies, the(More)
We consider the problem of similarity search in a very large sequence database with edit distance as the similarity measure. Given limited main memory, our goal is to develop a reference-based index that reduces the number of costly edit distance computations in order to answer a query. The idea in reference-based indexing is to select a small set of(More)
While the desire to support fast, ad hoc query processing for large data warehouses has motivated the recent introduction of many new indexing structures, with a few notable exceptions (namely, the LSM-Tree [4] and the Stepped Merge Method [1]) little attention has been given to developing new indexing schemes that allow fast insertions. Since additions to(More)
This paper deals with detecting change of distribution in multi-dimensional data sets. For a given baseline data set and a set of newly observed data points, we define a statistical test called the <i>density test</i> for deciding if the observed data points are sampled from the underlying distribution that produced the baseline data set. We define a test(More)
In applications that produce a large amount of data describing the paths of moving objects, there is a need to ask questions about the interaction of objects over a long recorded history. In this paper, we consider the problem of computing joins over massive moving object histories. The particular join that we study is the "Closest-Point-Of- Approach" join,(More)