Learn More
Hadoop framework has been widely used in various clusters to build large scale, high performance systems. However, Hadoop distributed file system (HDFS) is designed to manage large files and suffers performance penalty while managing a large amount of small files. As a consequence, many web applications, like WebGIS, may not take benefits from Hadoop. In(More)
With higher reliability requirements in clusters and data centers, RAID-6 has gained popularity due to its capability to tolerate concurrent failures of any two disks, which has been shown to be of increasing importance in large scale storage systems. Among various implementations of erasure codes in RAID-6, a typical set of codes known as Maximum Distance(More)
In deduplication-based backup systems, the chunks of each backup are physically scattered after deduplication, which causes a challenging fragmentation problem. The fragmentation decreases restore performance, and results in invalid chunks becoming physically scattered in different containers after users delete backups. Existing solutions attempt to rewrite(More)
RAID-6 is widely used to tolerate concurrent failures of any two disks to provide a higher level of reliability with the support of erasure codes. Among many implementations, one class of codes called {\bfseries{M}}aximum {\bfseries{D}}istance {\bfseries{S}}eparable ({\bfseries{MDS}}) codes aims to offer data protection against disk failures with optimal(More)
Data deduplication has become a standard component in modern backup systems. In order to understand the fundamental tradeoffs in each of its design choices (such as prefetching and sampling), we disassemble data deduplication into a large N-dimensional parameter space. Each point in the space is of various parameter settings, and performs a tradeoff among(More)
Solid State Drives (SSD's) have shown promise to be a candidate to replace traditional hard disk drives, but due to certain physical characteristics of NAND flash, there are some challenging areas of improvement and further research. We focus on the layout and management of the small amount of RAM that serves as a cache between the SSD and the system that(More)
The buffer cache plays an essential role in smoothing the gap between the upper-level computational components and the lower-level storage devices. A good buffer cache management scheme should be beneficial to not only the computational components, but also to the storage components by reducing disk I/Os. Existing cache replacement algorithms are well(More)
Under the severe energy crisis and the fast development of cloud computing, nowadays sustainability in large data centers receives much more attention than ever. Due to its high performance and reliability, RAID, particularly RAID-5, is widely used in these data centers. However, a challenge on the sustainability of RAID-5 is its scalability, or how to(More)
NAND flash-based SSDs suffer from limited lifetime due to the fact that NAND flash can only be programmed or erased for limited times. Among various approaches to address this problem, we propose to reduce the number of writes to the flash via exploiting the content locality between the write data and its corresponding old version in the flash. This content(More)
This paper presents a novel block I/O scheduler specifically for SSDs. The scheduler leverages the internal rich parallelism resulting from SSD's highly parallelized architecture. It speculatively divides the entire SSD space into different subregions and dispatches requests into those subregions in a round-robin fashion at the Linux kernel block layer. In(More)