Christopher Ré

Learn More
Stochastic Gradient Descent (SGD) is a popular algorithm that can achieve stateof-the-art performance on a variety of machine learning tasks. Several researchers have recently proposed schemes to parallelize SGD, but all require performancedestroying memory locking and synchronization. This work aims to show using novel theoretical analysis, algorithms, and(More)
Over the past few years, Markov Logic Networks (MLNs) have emerged as a powerful AI framework that combines statistical and logical reasoning. It has been applied to a wide range of data management problems, such as information extraction, ontology matching, and text mining, and has become a core technology underlying several major AI projects. Because of(More)
MADlib is a free, open source library of in-database analytic methods. It provides an evolving suite of SQL-based algorithms for machine learning, data mining and statistics that run at scale within a database engine, with no need for data import/export to other tools. The goal is for MADlib to eventually serve a role for scalable database systems that is(More)
Modern enterprise applications are forced to deal with unreliable, inconsistent and imprecise information. Probabilistic databases can model such data naturally, but SQL query evaluation on probabilistic databases is difficult: previous approaches have either restricted the SQL queries, or computed approximate probabilities, or did not scale, and it was(More)
This paper describes a new approach for computing nonnegative matrix factorizations (NMFs) with linear programming. The key idea is a data-driven model for the factorization, in which the most salient features in the data are used to express the remaining features. More precisely, given a data matrix X , the algorithm identifies a matrix C that satisfies X(More)
A major problem in detecting events in streams of data is that the data can be imprecise (<i>e.g.</i> RFID data). However, current state-ofthe-art event detection systems such as Cayuga [14], SASE [46] or SnoopIB[1], assume the data is <i>precise</i>. Noise in the data can be captured using techniques such as hidden Markov models. Inference on these models(More)
This paper develops Jellyfish, an algorithm for solving data-processing problems with matrix-valued decision variables regularized to have low rank. Particular examples of problems solvable by Jellyfish include matrix completion problems and least-squares problems regularized by the nuclear norm or γ2-norm. Jellyfish implements a projected incremental(More)
We present a declarative framework for collective deduplication of entity references in the presence of constraints. Constraints occur naturally in many data cleaning domains and can improve the quality of deduplication. An example of a constraint is "each paper has a unique publication venue''; if two paper references are duplicates, then their associated(More)
MystiQ is a system that uses probabilistic query semantics [3] to find answers in large numbers of data sources of less than perfect quality. There are many reasons why the data originating from many different sources may be of poor quality, and therefore difficult to query: the same data item may have different representation in different sources; the(More)
The Knowledge Base Acceleration track in TREC 2012 focused on a single task: filter a time-ordered corpus for documents that are highly relevant to a predefined list of entities. KBA differs from previous filtering evaluations in two primary ways: the stream corpus is >100x larger than previous filtering collections, and the use of entities as topics(More)