External memory algorithms and data structures: dealing with massive data

@article{Vitter2001ExternalMA,
  title={External memory algorithms and data structures: dealing with massive data},
  author={Jeffrey Scott Vitter},
  journal={ACM Comput. Surv.},
  year={2001},
  volume={33},
  pages={209-271}
}
  • J. Vitter
  • Published 1 June 2001
  • Computer Science
  • ACM Comput. Surv.
Data sets in large applications are often too massive to fit completely inside the computers internal memory. The resulting input/output communication (or I/O) between fast internal memory and slower external memory (such as disks) can be a major performance bottleneck. In this article we survey the state of the art in the design and analysis of external memory (or EM) algorithms and data structures, where the goal is to exploit locality in order to reduce the I/O costs. We consider a variety… 

Paradigms for EÆcient Design of External Memory Algorithms

This work surveys the state of the art in the design and analysis of external memory algorithms, where the primary goal is to reduce the number of input/output (or I/O) operations, which tend to be a bottleneck in data-intensive applications.

The power of duality for prefetching and sorting with parallel disks

This paper considers parallel disk input and output separately, in particular as the prefetch scheduling problem and the output scheduling problem, respectively.

RAM-Efficient External Memory Sorting

A splitting based algorithm in combination with existing RAM sorting techniques is obtained and it is proved a sorting lower bound that shows that in most cases the results are optimal both in terms of I/O and internal computation.

External memory pipelining made easy with TPIE

A major extension of the TPIE library is presented that includes a pipelining framework that allows for practically efficient streaming-based implementations of I/O-efficient algorithms while minimizing I/ O-overhead between streaming components.

Fast Concurrent Access to Parallel Disks

This work rehabilitate Aggarwal and Vitter's ``single-disk multi-head'' model that allows access to D arbitrary blocks in each I/ O step and shows that a shared buffer of O(D) blocks suffices to support efficient writing.

External Memory Geometric Data Structures

Fundamental dynamic structures for oneand two-dimensional orthogonal range searching will be focused on, and some of the fundamental techniques used to develop such structures will be highlighted.

Efficient Algorithms and Data Structures for Massive Data Sets

  • Alka
  • Computer Science
    ArXiv
  • 2010
This thesis proposes two variants of the W-Stream model, and design algorithms for the maximal independent set, vertex-colouring, and planar graph single source shortest paths problems on those models.

Classic and new data structure problems in external memory

This thesis shows an inherent query-insertion tradeoff of hashing in the I/O model, which implies that the buffering technique is essentially useless for hash tables, and builds a hash table that achieves the same search cost as its cache-aware version does, for all block sizes.

A Simple and Efficient Parallel Disk Mergesort

The techniques in this paper can be generalized to meet the load-balancing requirements of other applications using parallel disks, including distribution sort and multiway partitioning of a file into several other files.

Data Intensive Computation in a Compute/storage Hierarchy

It is shown how application of EM techniques can yield significant performance improvement for a GIS application, and that the derived cache model does not adequately represent the memory system at the cache/register level.
...

References

SHOWING 1-10 OF 338 REFERENCES

External memory algorithms

This tutorial surveys the state of the art in the design and analysis of external memory algorithms (also known as EM algorithms or out-of-core algorithms or I/O algorithms), and discusses a variety of problems and shows how to solve them efficiently in external memory.

The power of duality for prefetching and sorting with parallel disks

This paper considers parallel disk input and output separately, in particular as the prefetch scheduling problem and the output scheduling problem, respectively.

Online Data Structures in External Memory

A variety of on-line data structures for external memory are discussed--some very old and some very new--such as hashing (for dictionaries), B-trees ( for dictionaries and 1-D range search), buffer trees, buffer trees (for batched dynamic problems), interval trees with weight-balanced B-Trees, priority search trees, and R-tree and other spatial structures.

Bulk Synchronous Parallel Algorithms for the External Memory Model

A simple, deterministic simulation technique is presented which transforms certain Bulk Synchronous Parallel (BSP) algorithms into efficient parallel EM algorithms that meet well known I /O complexity lower bounds for various problems, including sorting.

Fast Concurrent Access to Parallel Disks

This work rehabilitate Aggarwal and Vitter's ``single-disk multi-head'' model that allows access to D arbitrary blocks in each I/ O step and shows that a shared buffer of O(D) blocks suffices to support efficient writing.

Reducing I/O complexity by simulating coarse grained parallel algorithms

A deterministic simulation technique which transforms coarse grained multicomputer (CGM) algorithms into external memory algorithms for the parallel disk model is presented, which optimizes block-wise data access and parallel disk I/O and, at the same time, utilizes multiple processors connected via a communication network or shared memory.

Blocking in Parallel Multisearch Problems

Techniques to achieve blocking for I/ O as well as for communication in multisearch on the BSP and EM-BSP models are presented and a lower bound to the number of I- O operations required for filtering n queries through a binary or multiway search tree of size m is given.

A Simple and Efficient Parallel Disk Mergesort

The techniques in this paper can be generalized to meet the load-balancing requirements of other applications using parallel disks, including distribution sort and multiway partitioning of a file into several other files.

Deterministic distribution sort in shared and distributed memory multiprocessors

An elegant deterministic load balancing strategy for distribution sort that is applicable to a wide variety of parallel diska and parallel memory hierarchies with both single and parallel processors and shows how to sort determiniatically in parallelMemory hierarchies.

Efficient External Memory Algorithms by Simulating Coarse-Grained Parallel Algorithms

A simulation technique is provided which produces efficient parallel EM algorithms from efficient BSP-like parallel algorithms, which can accommodate one or multiple processors on the EM target machine, each with one or more disks, and they also adapt to the disk blocking factor of the target machine.
...