A Study on Garbage Collection Algorithms for Big Data Environments

@article{Bruno2018ASO,
  title={A Study on Garbage Collection Algorithms for Big Data Environments},
  author={Rodrigo Bruno and Paulo Ferreira},
  journal={ACM Computing Surveys (CSUR)},
  year={2018},
  volume={51},
  pages={1 - 35}
}
The need to process and store massive amounts of data—Big Data—is a reality. In areas such as scientific experiments, social networks management, credit card fraud detection, targeted advertisement, and financial analysis, massive amounts of information are generated and processed daily to extract valuable, summarized information. Due to its fast development cycle (i.e., less expensive to develop), mainly because of automatic memory management, and rich community resources, managed object… 
An Experimental Evaluation of Garbage Collectors on Big Data Applications
TLDR
This paper conducts the first comprehensive evaluation on three popular garbage collectors, i.e., Parallel, CMS, and G1, using four representative Spark applications and obtains many findings about GC inefficiencies.
Analysis of Garbage Collection Algorithms and Memory Management in Java
  • P. Pufek, H. Grgic, B. Mihaljevic
  • Computer Science
    2019 42nd International Convention on Information and Communication Technology, Electronics and Microelectronics (MIPRO)
  • 2019
TLDR
This paper explores several garbage collectors available in JDK 11 by using selected benchmarking applications of the DaCapo suite for comparison of the number of algorithms’ iterations and the duration of the collection time.
Runtime Object Lifetime Profiler for Latency Sensitive Big Data Applications
TLDR
ROLP is a Runtime Object Lifetime Profiler that profiles application code at runtime and helps pretenuring GC algorithms allocating objects with similar lifetimes close to each other so that the overall fragmentation, GC effort, and application pauses are reduced.
Selecting a JVM Garbage Collector for Big Data and Cloud Services
TLDR
This work intends to evaluate throughput, pause time, and memory usage in existing JVM GCs using benchmark suites like DaCapo and Renaissance to find the trade-offs between the above mentioned performance metrics.
Selecting a GC for Java Applications
Nowadays, there are several Garbage Collector (GC) solutions that can be used in an application. Such GCs behave differently regarding several performance metrics, in particular throughput, pause
A Study on the Causes of Garbage Collection in Java for Big Data Workloads
TLDR
This study brings forth the causes that invoke the 3 Java garbage collectors for Big Data and non Big Data workloads and concludes that allocation failure occurs for about 31-99 per cent of the total garbage collector execution time and is the most common cause for garbage collection in Big Data jobs.
Distributed garbage collection for general graphs
TLDR
A scalable, cycle-collecting, decentralized, reference counting garbage collector with partial tracing, which is based on the Brownbridge system but uses four different types of references to label edges and is stable against concurrent mutation.
Distributed garbage collection for general graphs
TLDR
A scalable, cycle-collecting, decentralized, reference counting garbage collector with partial tracing, which is based on the Brownbridge system but uses four different types of references to label edges and is stable against concurrent mutation.
GC-Wise: A Self-adaptive approach for memory-performance efficiency in Java VMs
TLDR
GC-Wise, a system to determine, at run-time, the best values for critical heap management parameters of the OpenJDK JVM, aiming to maximize memory-performance efficiency, is presented.
Say Goodbye to Off-heap Caches! On-heap Caches Using Memory-Mapped I/O
TLDR
TeraCache is proposed, an extension of the Spark data cache that avoids the need of serdes by keeping all cached data on-heap but off-memory, using memory-mapped I/O (mmio).
...
1
2
...

References

SHOWING 1-10 OF 81 REFERENCES
Broom: Sweeping Out Garbage Collection from Big Data Systems
TLDR
The initial results show that region-based memory management reduces emulated Naiad vertex runtime by 34% for typical data analytics jobs, and could be memory-safe and inferred automatically.
FACADE: A Compiler and Runtime for (Almost) Object-Bounded Big Data Applications
TLDR
A novel compiler framework, called Facade, that can generate highly-efficient data manipulation code by automatically transforming the data path of an existing Big Data application by leading to significantly reduced memory management cost and improved scalability.
NG2C: pretenuring garbage collection with dynamic generations for HotSpot big data applications
TLDR
NG2C, a new GC algorithm that combines pretenuring with user-defined dynamic generations, is proposed, which decreases the worst observable GC pause time and avoids object promotion and heap fragmentation both responsible for most of the duration of HotSpot GC pause times.
Pig latin: a not-so-foreign language for data processing
TLDR
A new language called Pig Latin is described, designed to fit in a sweet spot between the declarative style of SQL, and the low-level, procedural style of map-reduce, which is an open-source, Apache-incubator project, and available for general use.
Lifetime-Based Memory Management for Distributed Data Processing Systems
TLDR
Deca is presented, a concrete implementation of the lifetime-based memory management framework, which transparently decomposes and groups objects with similar lifetimes into byte arrays and releases their space altogether when their lifetimes come to an end.
NumaGiC: a Garbage Collector for Big Data on Big NUMA Machines
TLDR
NumaGiC, a GC with a mostly-distributed design that improves overall performance and increases the performance of the collector itself by up to 3.6x over NAPS and up to 5.4x over Parallel Scavenge.
The Garbage Collection Handbook: The art of automatic memory management
TLDR
The Garbage Collection Handbook: The Art of Automatic Memory Management brings together a wealth of knowledge gathered by automatic memory management researchers and developers over the past fifty years and addresses new challenges to garbage collection made by recent advances in hardware and software.
Profiling, what-if analysis, and cost-based optimization of MapReduce programs
TLDR
This work introduces, to its knowledge, the first Cost-based Optimizer for simple to arbitrarily complex MapReduce programs, which focuses on the optimization opportunities presented by the large space of configuration parameters for these programs.
Data structure aware garbage collector
TLDR
This work proposes a very simple interface that requires minor programmer effort and achieves substantial performance and scalability improvements on the common use of data structures or collections for organizing data on the heap.
Hive - A Warehousing Solution Over a Map-Reduce Framework
TLDR
Hadoop is a popular open-source map-reduce implementation which is being used as an alternative to store and process extremely large data sets on commodity hardware.
...
1
2
3
4
5
...