# Communication efficient algorithms for fundamental big data problems

@article{Sanders2013CommunicationEA, title={Communication efficient algorithms for fundamental big data problems}, author={Peter Sanders and Sebastian Schlag and Ingo M{\"u}ller}, journal={2013 IEEE International Conference on Big Data}, year={2013}, pages={15-23} }

Big Data applications often store or obtain their data distributed over many computers connected by a network. Since the network is usually slower than the local memory of the machines, it is crucial to process the data in such a way that not too much communication takes place. Indeed, only communication volume sublinear in the input size may be affordable. We believe that this direction of research deserves more intensive study. We give examples for several fundamental algorithmic problems…

## Figures from this paper

## 32 Citations

Communication Efficient Algorithms for Distributed OLAP Query Execution

- Computer Science
- 2014

A technique to find a better partitioning of the tables in a database to allow the execution of joins without communication effort, and an algorithm that selects the first k tuples of the result set of a query with a communication effort independent from the size of the database.

Practical Massively Parallel Sorting

- Computer ScienceSPAA
- 2015

The algorithms are multi-level generalizations of the known algorithms sample sort and multiway mergesort, which turns out to be very scalable both in theory and practice where it scales up to 215 MPI processes with outstanding performance in particular for medium sized inputs.

Communication Efficient Algorithms for Top-k Selection Problems

- Computer Science2016 IEEE International Parallel and Distributed Processing Symposium (IPDPS)
- 2016

We present scalable parallel algorithms with sublinear per-processor communication volume and low latency for several fundamental problems related to finding the most relevant elements in a set, for…

Communication Efficient Checking of Big Data Operations

- Computer Science2018 IEEE International Parallel and Distributed Processing Symposium (IPDPS)
- 2018

These checkers cover many of the commonly used operations, including sum, average, median, and minimum aggregation, as well as sorting, union, merge, and zip, to check the correctness of operations in Big Data processing frameworks and distributed databases.

Parallel Weighted Random Sampling

- Computer ScienceESA
- 2019

This work gives efficient, fast, and practicable algorithms for sampling single items, $k$ items with/without replacement, permutations, subsets, and reservoirs, and improved sequential algorithms for alias table construction and for sampling with replacement.

Bloom Filters for ReduceBy, GroupBy and Join in Thrill

- Computer Science
- 2017

An augmented version of the detection algorithm, which detects the worker with the highest number of total occurences for each key, which is determined as the shuffle target for that key in the Reduce operation.

Communication-Efficient String Sorting

- Computer Science2020 IEEE International Parallel and Distributed Processing Symposium (IPDPS)
- 2020

These algorithms inspect only characters that are needed to determine the sorting order and communication volume is reduced by also communicating only those characters and by communicating repetitions of the same prefixes only once.

Efficient Parallel Random Sampling—Vectorized, Cache-Efficient, and Online

- Computer ScienceACM Trans. Math. Softw.
- 2018

A simple divide-and-conquer scheme is proposed that makes sequential algorithms more cache efficient and leads to a parallel algorithm running in expected time O(n/p+log p) on p processors, i.e., scales to massively parallel machines even for moderate values of n.

Robust Massively Parallel Sorting

- Computer ScienceALENEX
- 2017

This work investigates distributed memory parallel sorting algorithms that scale to the largest available machines and are robust with respect to input size and distribution of the input elements and designs a new variant of quicksort with fast high-quality pivot selection.

Connecting MapReduce Computations to Realistic Machine Models

- Computer Science2020 IEEE International Conference on Big Data (Big Data)
- 2020

This paper explains how the popular, highly abstract MapReduce model of parallel computation (MRC/MPC) can be rooted in reality by showing how to execute MapReduce computations robustly and…

## References

SHOWING 1-10 OF 28 REFERENCES

Improving distributed join efficiency with extended bloom filter operations

- Computer Science21st International Conference on Advanced Information Networking and Applications (AINA '07)
- 2007

This paper presents extensions of bloom filter operations that are applicable to a wide range of usages, where bloom filters are facilitated for compressed set representation, and points out how they improve the performance of such distributed joins.

Fundamental parallel algorithms for private-cache chip multiprocessors

- Computer ScienceSPAA '08
- 2008

This paper presents two sorting algorithms, a distribution sort and a mergesort, and studies sorting lower bounds in a computational model, which is called the parallel external-memory (PEM) model, that formalizes the essential properties of the algorithms for private-cache CMPs.

One is enough: distributed filtering for duplicate elimination

- Computer ScienceCIKM '11
- 2011

A suite of distributed Bloom filters that exploit different ways of partitioning the event space to address the continuous nature of event delivery and are extended to support sliding window semantics.

Communication lower bounds and optimal algorithms for programs that reference arrays - Part 1

- Computer ScienceArXiv
- 2013

This work generalizes the lower bound approach used initially for Theta(N3) matrix multiplication to a much larger class of algorithms, that may have arbitrary numbers of loops and arrays with arbitrary dimensions as long as the index expressions are a ne combinations of loop variables.

Distributed Duplicate Removal

- 2013

The distributed duplicate removal problem is concerned with the detection and subsequent elimination of all duplicate elements in a given multiset that is distributed over several computers connected…

Data streams: algorithms and applications

- Computer ScienceSODA '03
- 2003

Data Streams: Algorithms and Applications surveys the emerging area of algorithms for processing data streams and associated applications, which rely on metric embeddings, pseudo-random computations, sparse approximation theory and communication complexity.

Theory and Practice of Bloom Filters for Distributed Systems

- Computer ScienceIEEE Communications Surveys & Tutorials
- 2012

An overview of the basic and advanced probabilistic techniques is given, reviewing over 20 variants and discussing their application in distributed systems, in particular for caching, peer-to-peer systems, routing and forwarding, and measurement data summarization.

Efficient Parallel Graph Algorithms for Coarse-Grained Multicomputers and BSP

- Computer ScienceAlgorithmica
- 2001

The algorithms for Problems (1)—(7) are the first practically relevant parallel algorithms for these standard graph problems, and the number of communication rounds/ supersteps obtained in this paper is independent of the problem size, and grows only logarithmically with respect to p.

Efficient Parallel Graph Algorithms For Coarse Grained Multicomputers and BSP

- Computer ScienceICALP
- 1997

The algorithms presented are the first practically relevant deterministic parallel algorithms for these problems to be used for commercially available coarse grained parallel machines and view as an important step towards the final goal of O(1) communication rounds.

Fast, Small, Simple Rank/Select on Bitmaps

- Computer ScienceSEA
- 2012

This paper presents two structures, one using the bitmap in plain form and another using a compressed form, that are simple to implement and combine much lower space overheads than previous work with excellent time performance for rank and select queries.