A Study on Garbage Collection Algorithms for Big Data Environments

  title={A Study on Garbage Collection Algorithms for Big Data Environments},
  author={Rodrigo Bruno and Paulo Ferreira},
  journal={ACM Computing Surveys (CSUR)},
  pages={1 - 35}
The need to process and store massive amounts of data—Big Data—is a reality. In areas such as scientific experiments, social networks management, credit card fraud detection, targeted advertisement, and financial analysis, massive amounts of information are generated and processed daily to extract valuable, summarized information. Due to its fast development cycle (i.e., less expensive to develop), mainly because of automatic memory management, and rich community resources, managed object… 

Figures and Tables from this paper

A Performance Comparison of Modern Garbage Collectors for Big Data Environments

This project aims to understand how different garbage collectors scale in terms of throughput, latency, and memory usage in memory-hungry environments, so that, for given a platform with particular performance needs, the most suitable garbage collection algorithm is mapped.

Analysis of Garbage Collection Algorithms and Memory Management in Java

  • P. PufekH. GrgicB. Mihaljević
  • Computer Science
    2019 42nd International Convention on Information and Communication Technology, Electronics and Microelectronics (MIPRO)
  • 2019
This paper explores several garbage collectors available in JDK 11 by using selected benchmarking applications of the DaCapo suite for comparison of the number of algorithms’ iterations and the duration of the collection time.

Runtime Object Lifetime Profiler for Latency Sensitive Big Data Applications

ROLP is a Runtime Object Lifetime Profiler that profiles application code at runtime and helps pretenuring GC algorithms allocating objects with similar lifetimes close to each other so that the overall fragmentation, GC effort, and application pauses are reduced.

Selecting a JVM Garbage Collector for Big Data and Cloud Services

This work intends to evaluate throughput, pause time, and memory usage in existing JVM GCs using benchmark suites like DaCapo and Renaissance to find the trade-offs between the above mentioned performance metrics.

Selecting a GC for Java Applications

A thorough evaluation of several of the most widely known and available GCs for Java in OpenJDK HotSpot using different applications, and a method to easily pick the best one is presented.

A Study on the Causes of Garbage Collection in Java for Big Data Workloads

This study brings forth the causes that invoke the 3 Java garbage collectors for Big Data and non Big Data workloads and concludes that allocation failure occurs for about 31-99 per cent of the total garbage collector execution time and is the most common cause for garbage collection in Big Data jobs.

Distributed garbage collection for general graphs

A scalable, cycle-collecting, decentralized, reference counting garbage collector with partial tracing, which is based on the Brownbridge system but uses four different types of references to label edges and is stable against concurrent mutation.

Adaptive and Concurrent Garbage Collection for Virtual Machines

An adaptive and concurrent garbage collection technique that can predict the optimal GC algorithm for a program without going through all the GC algorithms and is helpful in finding better heap size settings for improved program execution.

TeraCache: E cient Spark Caching Over Fast Storage Devices

TeraCache is proposed, an extension of the Spark data cache that avoids the need of serdes by keeping all cached data on-heap but o↵-memory, using memory-mapped I/O (mmio).



Broom: Sweeping Out Garbage Collection from Big Data Systems

The initial results show that region-based memory management reduces emulated Naiad vertex runtime by 34% for typical data analytics jobs, and could be memory-safe and inferred automatically.

FACADE: A Compiler and Runtime for (Almost) Object-Bounded Big Data Applications

A novel compiler framework, called Facade, that can generate highly-efficient data manipulation code by automatically transforming the data path of an existing Big Data application by leading to significantly reduced memory management cost and improved scalability.

NG2C: pretenuring garbage collection with dynamic generations for HotSpot big data applications

NG2C, a new GC algorithm that combines pretenuring with user-defined dynamic generations, is proposed, which decreases the worst observable GC pause time and avoids object promotion and heap fragmentation both responsible for most of the duration of HotSpot GC pause times.

Lifetime-Based Memory Management for Distributed Data Processing Systems

Deca is presented, a concrete implementation of the lifetime-based memory management framework, which transparently decomposes and groups objects with similar lifetimes into byte arrays and releases their space altogether when their lifetimes come to an end.

NumaGiC: a Garbage Collector for Big Data on Big NUMA Machines

NumaGiC, a GC with a mostly-distributed design that improves overall performance and increases the performance of the collector itself by up to 3.6x over NAPS and up to 5.4x over Parallel Scavenge.

The Garbage Collection Handbook: The art of automatic memory management

The Garbage Collection Handbook: The Art of Automatic Memory Management brings together a wealth of knowledge gathered by automatic memory management researchers and developers over the past fifty years and addresses new challenges to garbage collection made by recent advances in hardware and software.

Profiling, what-if analysis, and cost-based optimization of MapReduce programs

This work introduces, to its knowledge, the first Cost-based Optimizer for simple to arbitrarily complex MapReduce programs, which focuses on the optimization opportunities presented by the large space of configuration parameters for these programs.

Data structure aware garbage collector

This work proposes a very simple interface that requires minor programmer effort and achieves substantial performance and scalability improvements on the common use of data structures or collections for organizing data on the heap.

Hive - A Warehousing Solution Over a Map-Reduce Framework

Hadoop is a popular open-source map-reduce implementation which is being used as an alternative to store and process extremely large data sets on commodity hardware.

Efficient Big Data Processing in Hadoop MapReduce

This tutorial is motivated by the clear need of many organizations, companies, and researchers to deal with big data volumes efficiently and highlights the similarities and differences between Hadoop MapReduce and Parallel DBMS.