A Bloom Filter Based Scalable Data Integrity Check Tool for Large-Scale Dataset
Data integrity is pivotal to the usefulness of any storage system. It ensures that the data stored is free from any modification throughout its existence on the storage medium. Hash functions such as cyclic redundancy checks or check-sums are frequently used to detect data corruption during its transmission to permanent storage or its stay there. Without these checks, such data errors usually go undetected and unreported to the system and hence are not communicated to the application. They are referred as "silent data corruption." When an application reads corrupted or malformed data, it leads to incorrect results or a failed system. Storage arrays in leadership computing facilities comprise several thousands of drives, thus increasing the likelihood of such failures. These environments mandate a file system capable of detecting data corruption. Parallel file systems have traditionally ignored providing integrity checks because of the high computational cost, particularly in dealing with unaligned data request from scientific applications. In this paper, we assess the cost of providing data integrity on a parallel file system. We present an approach that provides this capability with as low as 5% overhead for writes and 22% overhead for reads for aligned requests and some additional cost for unaligned requests.