• Publications
  • Influence
Spark: Cluster Computing with Working Sets
TLDR
Spark can outperform Hadoop by 10x in iterative machine learning jobs, and can be used to interactively query a 39 GB dataset with sub-second response time. Expand
Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing
TLDR
Resilient Distributed Datasets is presented, a distributed memory abstraction that lets programmers perform in-memory computations on large clusters in a fault-tolerant manner and is implemented in a system called Spark, which is evaluated through a variety of user applications and benchmarks. Expand
TAG: a Tiny AGgregation service for ad-hoc sensor networks
TLDR
This work presents the Tiny AGgregation (TAG) service for aggregation in low-power, distributed, wireless environments, and discusses a variety of optimizations for improving the performance and fault tolerance of the basic solution. Expand
TinyDB: an acquisitional query processing system for sensor networks
TLDR
This work evaluates issues in the context of TinyDB, a distributed query processor for smart sensor devices, and shows how acquisitional techniques can provide significant reductions in power consumption on the authors' sensor devices. Expand
Spark SQL: Relational Data Processing in Spark
TLDR
Spark SQL is a new module in Apache Spark that integrates relational processing with Spark's functional programming API, and includes a highly extensible optimizer, Catalyst, built using features of the Scala programming language. Expand
GraphX: Graph Processing in a Distributed Dataflow Framework
TLDR
This paper introduces GraphX, an embedded graph processing framework built on top of Apache Spark, a widely used distributed dataflow system and demonstrates that GraphX achieves an order of magnitude performance gain over the base dataflow framework and matches the performance of specialized graph processing systems while enabling a wider range of computation. Expand
MLlib: Machine Learning in Apache Spark
TLDR
MLlib is presented, Spark's open-source distributed machine learning library that provides efficient functionality for a wide range of learning settings and includes several underlying statistical, optimization, and linear algebra primitives. Expand
Apache Spark
This open source computing framework unifies streaming, batch, and interactive big data workloads to unlock new applications.
The design of an acquisitional query processor for sensor networks
TLDR
This work evaluates issues in the context of TinyDB, a distributed query processor for smart sensor devices, and shows how acquisitional techniques can provide significant reductions in power consumption on the authors' sensor devices. Expand
TelegraphCQ: Continuous Dataflow Processing for an Uncertain World
TLDR
The next generation Telegraph system, called TelegraphCQ, is focused on meeting the challenges that arise in handling large streams of continuous queries over high-volume, highly-variable data streams and leverages the PostgreSQL open source code base. Expand
...
1
2
3
4
5
...