Russell Sears

Learn More
While the use of MapReduce systems (such as Hadoop) for large scale data analysis has been widely recognized and studied, we have recently seen an explosion in the number of systems developed for cloud data serving. These newer systems address "cloud OLTP" applications, though they typically do not support ACID transactions. Examples of systems proposed for(More)
MapReduce is a popular framework for data-intensive distributed computing of batch jobs. To simplify fault tolerance, many implementations of MapReduce materialize the entire output of each map and reduce task before it can be consumed. In this paper, we propose a modified MapReduce architecture that allows data to be pipelined between operators. This(More)
Machine learning systems offer unparalled flexibility in dealing with evolving input in a variety of applications, such as intrusion detection systems and spam e-mail filtering. However, machine learning algorithms themselves can be a target of attack by a malicious adversary. This paper provides a framework for answering the question, "Can machine learning(More)
Recent research has explored using Datalog-based languages to express a distributed system as a set of logical invariants [2, 19]. Two properties of distributed systems proved difficult to model in Datalog. First, the state of any such system evolves with its execution. Second, deductions in these systems may be arbitrarily delayed, dropped, or reordered by(More)
MapReduce is a popular framework for data-intensive distributed computing of batch jobs. To simplify fault tolerance, the output of each MapReduce task and job is <i>materialized</i> to disk before it is consumed. In this demonstration, we describe a modified MapReduce architecture that allows data to be <i>pipelined</i> between operators. This extends the(More)
Building and debugging distributed software remains extremely difficult. We conjecture that by adopting a <i>data-centric</i> approach to system design and by employing <i>declarative</i> programming languages, a broad range of distributed software can be recast naturally in a data-parallel programming model. Our hope is that this model can significantly(More)
Application designers must decide whether to store large objects (BLOBs) in a filesystem or in a database. Generally, this decision is based on factors such as application simplicity or manageability. Often, system performance affects these factors. Folklore tells us that databases efficiently handle large numbers of small objects, while filesystems are(More)
Resource Managers like Apache YARN have emerged as a critical layer in the cloud computing system stack, but the developer abstractions for leasing cluster resources and instantiating application logic are very low-level. This flexibility comes at a high cost in terms of developer effort, as each application must repeatedly tackle the same challenges (e.g.,(More)