PolarFS: An Ultra-low Latency and Failure Resilient Distributed File System for Shared Storage Cloud Database

@article{Cao2018PolarFSAU,
  title={PolarFS: An Ultra-low Latency and Failure Resilient Distributed File System for Shared Storage Cloud Database},
  author={Wei Cao and Zhenjun Liu and Peng Wang and Sen Chen and Caifeng Zhu and Song Zheng and Yuhui Wang and Guoqing Ma},
  journal={Proc. VLDB Endow.},
  year={2018},
  volume={11},
  pages={1849-1862}
}
PolarFS is a distributed file system with ultra-low latency and high availability, designed for the POLARDB database service, which is now available on the Alibaba Cloud. [] Key Method To keep replica consistency while maximizing I/O throughput for PolarFS, we develop ParallelRaft, a consensus protocol derived from Raft, which breaks Raft's strict serialization by exploiting the out-of-order I/O completion tolerance capability of databases. ParallelRaft inherits the understand-ability and easy implementation…
SeRW: Adaptively Separating Read and Write upon SSDs of Hybrid Storage Server in Clouds
TLDR
SeRW notably improves both the overall system performance and SSD endurance without sacrificing the write latency, and adaptively steers some SSD-writes to idle HDDs in running time to fully exploit the IO potential of underutilized HDDs.
PolarDB Serverless: A Cloud Native Database for Disaggregated Data Centers
TLDR
The novel architecture of \name is described, which follows the disaggregregation design paradigm: the CPU resource on compute nodes is decoupled from remote memory pool and storage pool, and each resource pool grows or shrinks independently.
Taurus Database: How to be Fast, Available, and Frugal in the Cloud
TLDR
Taurus is a new multi-tenant cloud database system that separates the compute and storage layers in a similar manner to Amazon Aurora and Microsoft Socrates and provides similar benefits, such as read replica support, low network utilization, hardware sharing and scalability.
Latte: A Native Table Engine On Nvme Storage
TLDR
A lightweight native storage stack called Lightstack is proposed to minimize the software overhead of NVMe devices and has up to 3.6-6.5× the throughput of MySQL’s InnoDB and MyRocks engines, with latency as low as 28% in the same hardware environment.
Latte: A Native Table Engine On Nvme Storage
TLDR
A lightweight native storage stack called Lightstack is proposed to minimize the software overhead of NVMe devices and has up to 3.6-6.5× the throughput of MySQL’s InnoDB and MyRocks engines, with latency as low as 28% in the same hardware environment.
Analysis of and Optimization for Write-dominated Hybrid Storage Nodes in Cloud
TLDR
By effectively offloading the right amount of write IOs from overburdened SSDs to underutilized HDDs in WSNs, SWR is able to adequately alleviate the aforementioned problems suffered by W SNs, and significantly improves overall system performance and SSD endurance.
Experience Paper: Danaus: isolation and efficiency of container I/O at the client side of network storage
TLDR
This work developed a Danaus prototype that integrates a union filesystem with a Ceph distributed filesystem client and a configurable shared cache and achieves improved performance stability because it handles I/O with reserved per-tenant resources and avoids intensive kernel locking.
Towards Cost-Effective and Elastic Cloud Database Deployment via Memory Disaggregation
TLDR
A novel database architecture called LegoBase is proposed, which explores the co-design of database kernel and memory disaggregation, and pushes the memory management back to the database layer for bypassing the Linux I/O stack and re-using or designing (remote) memory access optimizations with an understanding of data access patterns.
Faster than Flash: An In-Depth Study of System Challenges for Emerging Ultra-Low Latency SSDs
TLDR
This work comprehensively performs empirical evaluations with 800GB ULL SSD prototypes and characterize ULL behaviors by considering a wide range of I/O path parameters, such as different queues and access patterns, and analyzes the efficiencies and challenges of the polled-mode and hybrid polling I/o completion methods.
ArkDB: A Key-Value Engine for Scalable Cloud Storage Services
TLDR
This paper presents ArkDB, a key-value engine designed to address these challenges by combining advantages of both LSM tree and Bw-tree, and leveraging advances in hardware technologies.
...
...

References

SHOWING 1-10 OF 36 REFERENCES
Optimizing the Block I/O Subsystem for Fast Storage Devices
TLDR
This article proposes six optimizations that enable an OS to fully exploit the performance characteristics of fast storage devices and demonstrates that the overheads from the traditional storage-stack design are significant and cannot easily be overcome without modifying the hardware interface and adding new capabilities to the operating system.
Using One-Sided RDMA Reads to Build a Fast, CPU-Efficient Key-Value Store
TLDR
This paper explores the design of a distributed in-memory key-value store called Pilaf that takes advantage of Remote Direct Memory Access to achieve high performance with low CPU overhead and introduces the notion of self-verifying data structures that can detect read-write races without client-server coordination.
Amazon Aurora: Design Considerations for High Throughput Cloud-Native Relational Databases
TLDR
This paper describes the architecture of Aurora and the design considerations leading to that architecture, and describes how Aurora achieves consensus on durable state across numerous storage nodes using an efficient asynchronous scheme, avoiding expensive and chatty recovery protocols.
MICA: A Holistic Approach to Fast In-Memory Key-Value Storage
TLDR
MICA optimizes for multi-core architectures by enabling parallel access to partitioned data, and for efficient parallel data access, MICA maps client requests directly to specific CPU cores at the server NIC level by using client-supplied information and adopts a light-weight networking stack that bypasses the kernel.
PaxosStore: High-availability Storage Made Practical in WeChat
TLDR
A layered design of the Paxos-based storage protocol stack is proposed, where PaxosLog, the key data structure used in the protocol, is devised to bridge the programming-oriented consistent read/write to the storage-oriented Paxos procedure.
Ceph: a scalable, high-performance distributed file system
TLDR
Performance measurements under a variety of workloads show that Ceph has excellent I/O performance and scalable metadata management, supporting more than 250,000 metadata operations per second.
CORFU: A Shared Log Design for Flash Clusters
TLDR
CORFU organizes a cluster of flash devices as a single, shared log that can be accessed concurrently by multiple clients over the network, slashing cost, power consumption and latency by eliminating storage servers.
Scalable Memcached Design for InfiniBand Clusters Using Hybrid Transports
  • Jithin Jose, H. Subramoni, D. Panda
  • Computer Science
    2012 12th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (ccgrid 2012)
  • 2012
TLDR
This work introduces a hybrid transport model which takes advantage of the best features of RC and UD to deliver scalability and performance higher than that of a single-transport, and presents comprehensive performance analysis using micro benchmarks, application benchmarks and realistic industry workloads.
APUS: fast and scalable paxos on RDMA
TLDR
This paper presents APUS, the first RDMA-based Paxos protocol that aims to be fast and scalable to client connections and hosts, and evaluated APUS on nine widely-used server programs.
Memcached Design on High Performance RDMA Capable Interconnects
TLDR
The design extends the existing open-source Memcached software and makes it RDMA capable and a detailed performance comparison of the Memcaches design is provided compared to unmodifiedmemcached using Sockets over RDMA and 10Gigabit Ethernet network with hardware-accelerated TCP/IP.
...
...