• Publications
  • Influence
A case for redundant arrays of inexpensive disks (RAID)
TLDR
Five levels of RAIDs are introduced, giving their relative cost/performance, and a comparison to an IBM 3380 and a Fujitsu Super Eagle is compared. Expand
RAID: high-performance, reliable secondary storage
TLDR
A comprehensive overview of disk array technology and implementation topics such as refining the basic RAID levels to improve performance and designing algorithms to maintain data consistency are discussed. Expand
More Effective Distributed ML via a Stale Synchronous Parallel Parameter Server
We propose a parameter server system for distributed ML, which follows a Stale Synchronous Parallel (SSP) model of computation that maximizes the time computational workers spend doing useful work onExpand
A Large-Scale Study of Failures in High-Performance Computing Systems
TLDR
Analysis of failure data collected at two large high-performance computing sites finds that average failure rates differ wildly across systems, ranging from 20-1000 failures per year, and that time between failures is modeled well by a Weibull distribution with decreasing hazard rate. Expand
Safe and effective fine-grained TCP retransmissions for datacenter communication
TLDR
This paper uses high-resolution timers to enable microsecond-granularity TCP timeouts and shows that eliminating the minimum retransmission timeout bound is safe for all environments, including the wide-area. Expand
A Large-Scale Study of Failures in High-Performance Computing Systems
TLDR
Analysis of failure data collected at two large high-performance computing sites finds that average failure rates differ wildly across systems, ranging from 20-1000 failures per year, and that time between failures is modeled well by a Weibull distribution with decreasing hazard rate. Expand
Informed prefetching and caching
TLDR
This paper shows how to use application-disclosed access patterns (hints) to expose and exploit I/O parallelism and to allocate dynamically file buffers among three competing demands: prefetching hinted blocks, caching hinted blocks for reuse, and caching recently used data for unhinted accesses. Expand
PLFS: a checkpoint filesystem for parallel applications
TLDR
A virtual parallel log structured file system which remaps an application's preferred data layout into one which is optimized for the underlying file system, which can reduce checkpoint time by an order of magnitude. Expand
Measurement and Analysis of TCP Throughput Collapse in Cluster-based Storage Systems
TLDR
This paper analyzes this Incast problem, explores its sensitivity to various system parameters, and examines the effectiveness of alternative TCP- and Ethernet-level strategies in mitigating the TCP throughput collapse. Expand
Scalable Performance of the Panasas Parallel File System
TLDR
Performance measures of I/O, metadata, and recovery operations for storage clusters that range in size from 10 to 120 storage nodes, 1 to 12 metadata nodes, and with file system client counts ranging from 1 to 100 compute nodes are presented. Expand
...
1
2
3
4
5
...