Leveraging burst buffer coordination to prevent I/O interference

@article{Kougkas2016LeveragingBB,
  title={Leveraging burst buffer coordination to prevent I/O interference},
  author={Anthony Kougkas and Matthieu Dorier and Robert Latham and Robert B. Ross and Xian-he Sun},
  journal={2016 IEEE 12th International Conference on e-Science (e-Science)},
  year={2016},
  pages={371-380}
}
Concurrent accesses to the shared storage resources in current HPC machines lead to severe performance degradation caused by I/O contention. In this study, we identify some key challenges to efficiently handling interleaved data accesses, and we propose a system-wide solution to optimize global performance. We implemented and tested several I/O scheduling policies, including prioritizing specific applications by leveraging burst buffers to defer the conflicting accesses from another application… 
Harmonia: An Interference-Aware Dynamic I/O Scheduler for Shared Non-volatile Burst Buffers
TLDR
Harmonia is introduced, a new dynamic I/O scheduler that is aware of interference, adapts to the underlying system, implements a new 2-way decision-making process and employs several scheduling policies to maximize the system efficiency and applications' performance.
Explorations of Data Swapping on Burst Buffer
  • T. Xu, Kento Sato, S. Matsuoka
  • Computer Science
    2018 IEEE 24th International Conference on Parallel and Distributed Systems (ICPADS)
  • 2018
TLDR
It is found that most HPC applications can still achieve full performance when using a buffer size that is far less than the total access space of the application, which can lead to a huge reduction on the required capacity for burst buffer.
BBOS: Efficient HPC Storage Management via Burst Buffer Over-Subscription
TLDR
This work adopts BB over-subscription allocation method by allowing HPC applications to use BB only for I/O phase for improving BB utilization, and finds that BB utilization is improved at least 2.2x, and more stable and higher checkpoint performance is guaranteed compared to other approaches.
Mapping and scheduling HPC applications for optimizing I/O
TLDR
This work proposes to couple a novel bandwidth-aware mapping algorithm to I/O list-scheduling policies to develop a cross-layer optimization solution, and shows important gains for the simple, bandwidth- aware mapping solution that it provides compared to its non bandwidth- Aware counterpart.
Toward Managing HPC Burst Buffers Effectively: Draining Strategy to Regulate Bursty I/O Behavior
TLDR
A proactive draining scheme to manage the draining process of distributed node-local burst buffers is proposed and an I/O provisioning model is developed to predict the minimized I/o provisioning requirement for permanent storage systems.
I/O Scheduling Strategy for Periodic Applications
TLDR
This work shows how to take advantage of the periodic nature of HPC applications to develop efficient periodic scheduling strategies for their I/O transfers, and proves that this scheduler has the advantage of being de-centralized and thus overcoming the overhead of online schedulers, but also that it performs better than the other solutions.
Checkpointing Strategies for Shared High-Performance Computing Platforms
TLDR
This work considers different aspects (system-level scheduling policies and hardware) that optimize the overall performance of concurrently executing CR-based applications that share I/O resources, and shows that by combining optimal checkpointing periods with contention-aware system-level I/o scheduling strategies, this work can significantly improve overall application performance and maximize the platform throughput.
Software-defined QoS for I/O in exascale computing
TLDR
Evaluation shows that SDQoS can effectively control the I/O bandwidth within a 5%–10% deviation and improve the performance by 20% in extreme cases.
Automatic, Application-Aware I/O Forwarding Resource Allocation
TLDR
This work implemented, evaluated, and deployed an automatic mechanism, DFRA, for application-adaptive dynamic forwarding resource allocation, which improves applications’ I/O performance by up to 18.9×, eliminates most of the interapplication I/o interference, and has saved over 200 million of core-hours during its test deployment on TaihuLight for 11 months.
Optimal Cooperative Checkpointing for Shared High-Performance Computing Platforms
  • T. Hérault, Y. Robert, J. Dongarra
  • Computer Science
    2018 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)
  • 2018
TLDR
It is shown that combining optimal checkpointing periods with I/O scheduling strategies can provide a significant improvement on the overall application performance, thereby maximizing platform throughput and minimizing the global waste on the platform.
...
1
2
3
...

References

SHOWING 1-10 OF 42 REFERENCES
TRIO: Burst Buffer Based I/O Orchestration
TLDR
This paper proposes a burst buffer based I/O orchestration framework, named TRIO, to intercept and reshape the bursty writes for better sequential write traffic to storage servers, and demonstrates that TRIO could efficiently utilize storage bandwidth and reduce the average job I-O time by 37% on average for data-intensive applications in typical checkpointing scenarios.
Scheduling the I/O of HPC Applications Under Congestion
TLDR
This paper shows that the global I/O scheduler is able to reduce the effects of congestion, even on systems where burst buffers are used, and can increase the overall system throughput up to 56%.
AGIOS: Application-Guided I/O Scheduling for Parallel File Systems
TLDR
This paper improves the performance of server-side I/O scheduling on parallel file systems by transparently including information about the applications' access patterns, obtained from traces generated by the scheduler itself, without changes in application or file system.
AGIOS: Application-Guided I/O Scheduling for Parallel File Systems
TLDR
This paper improves the performance of server-side I/O scheduling on parallel file systems by transparently including information about the applications' access patterns, obtained from traces generated by the scheduler itself, without changes in application or file system.
CALCioM: Mitigating I/O Interference in HPC Systems through Cross-Application Coordination
TLDR
Experiments show how CALCioM can be used to efficiently and transparently improve the scheduling strategy between two otherwise interfering applications, given specified metrics of machine wide efficiency.
IOrchestrator: Improving the Performance of Multi-node I/O Systems via Inter-Server Coordination
  • Xuechen Zhang, K. Davis, Song Jiang
  • Computer Science
    2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis
  • 2010
TLDR
This paper proposes a scheme, IOrchestrator, to improve I/O performance of multi-node storage systems by orchestratingI/O services among programs when such inter-data-server coordination is dynamically determined to be cost effective.
I/O Scheduling Service for Multi-Application Clusters
TLDR
This article presents an extension of the aIOLi to address the issue of disjoint accesses generated by different concurrent applications in a cluster, and proposes a new generic framework pluggable into any I/O file system layer.
On the role of burst buffers in leadership-class storage systems
TLDR
It is shown that burst buffers can accelerate the application perceived throughput to the external storage system and can reduce the amount of external storage bandwidth required to meet a desired application perceived bottleneck goal.
A Novel network request scheduler for a large scale storage system
TLDR
A quantum-based, Object Based Round Robin NRS algorithm is proposed that reorders the execution of I/O requests per data object, presenting a workload to backend storage that can be optimized more easily.
CA-NFS: A congestion-aware network file system
TLDR
This work develops a holistic framework for adaptively scheduling asynchronous requests in distributed file systems and implements modifications in the Congestion-Aware Network File System (CA-NFS), an extension to the ubiquitous network file system (NFS).
...
1
2
3
4
5
...