• Publications
  • Influence
Spark: Cluster Computing with Working Sets
Spark can outperform Hadoop by 10x in iterative machine learning jobs, and can be used to interactively query a 39 GB dataset with sub-second response time. Expand
Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing
Resilient Distributed Datasets is presented, a distributed memory abstraction that lets programmers perform in-memory computations on large clusters in a fault-tolerant manner and is implemented in a system called Spark, which is evaluated through a variety of user applications and benchmarks. Expand
A view of cloud computing
Clearing the clouds away from the true potential and obstacles posed by this computing capability.
Above the Clouds: A Berkeley View of Cloud Computing
This work focuses on SaaS Providers (Cloud Users) and Cloud Providers, which have received less attention than SAAS Users, and uses the term Private Cloud to refer to internal datacenters of a business or other organization, not made available to the general public. Expand
Mesos: A Platform for Fine-Grained Resource Sharing in the Data Center
The results show that Mesos can achieve near-optimal data locality when sharing the cluster among diverse frameworks, can scale to 50,000 (emulated) nodes, and is resilient to failures. Expand
Spark SQL: Relational Data Processing in Spark
Spark SQL is a new module in Apache Spark that integrates relational processing with Spark's functional programming API, and includes a highly extensible optimizer, Catalyst, built using features of the Scala programming language. Expand
Dominant Resource Fairness: Fair Allocation of Multiple Resource Types
Dominant Resource Fairness (DRF), a generalization of max-min fairness to multiple resource types, is proposed, and it is shown that it leads to better throughput and fairness than the slot-based fair sharing schemes in current cluster schedulers. Expand
Discretized streams: fault-tolerant streaming computation at scale
D-Streams enable a parallel recovery mechanism that improves efficiency over traditional replication and backup schemes, and tolerates stragglers, and can easily be composed with batch and interactive query models like MapReduce, enabling rich applications that combine these modes. Expand
Improving MapReduce Performance in Heterogeneous Environments
A new scheduling algorithm, Longest Approximate Time to End (LATE), that is highly robust to heterogeneity and can improve Hadoop response times by a factor of 2 in clusters of 200 virtual machines on EC2. Expand
Delay scheduling: a simple technique for achieving locality and fairness in cluster scheduling
This work proposes a simple algorithm called delay scheduling, which achieves nearly optimal data locality in a variety of workloads and can increase throughput by up to 2x while preserving fairness. Expand