baijia - papers and notes

Full Version: Zhang: End-to-end data integrity for file systems: a ZFS case study.FAST'10
You're currently viewing a stripped down version of our content. View the full version with proper formatting.
End-to-end Data Integrity for File Systems: A ZFS Case Study
Yupu Zhang, Abhishek Rajimwale, Andrea C. Arpaci-Dusseau, and Remzi H. Arpaci-Dusseau. FAST'10

PDF from USENIX: http://www.usenix.org/events/fast10/tech...http://www.usenix.org/events/fast10/tech/full_papers
Slides from USENIX: http://www.usenix.org/events/fast10/tech...http://www.usenix.org/events/fast10/tech/slides
Viedo from USENIX: http://www.usenix.org/multimedia/fast10zhang
This paper study file system's data integrity with disk and memory corruption.

From the result of ZFS's ability to recovery from disk corruption, we may be
confident to use checksum like in ZFS to ensure integrity up to level check in
this paper.

Memory corruption is largely ignored by assuming the system may crash when
memory is not correct. However, evidence in this paper shows memory corruption
is also "common" in long-run systems. And which is worse, bad data in file
system's cache is written to the disk.

When designing large scale and long-run systems, not only disk corruption but
also memory corruption should be taken into consideration.

For data integrity, one method is to duplicate the data. This may improve the
data integrity level, although can not ensure fully correctness. And duplication
is also used in systems for improve the locality and throughputs.
In this paper, the author takes the ZFS as an example to analyze the effects
of disk and memory corruption on file system data integrity. ZFS is a
state-of-the-art file system which embedded a great many mechanisms for data
integrity. The study shows that ZFS is robust to a wide range of disk and
memory corruption.

The citations in this paper provide us much information. Some of them can be
cited as evidences in our later work.

1 The existence of hardware-based memory corruption
R. Baumann. Soft errors in advanced computer systems. IEEE Des. Test, 22
(3):258–266, 2005.
T. C. May and M. H. Woods. Alpha-particle-induced soft errors in dynamic
memories. IEEE Trans. on Electron Dev, 26(1), 1979.
J. F. Ziegler and W. A. Lanford. Effect of cosmic rays on computer memories.
Science, 206(4420):776–788, 1979.
X. Li, K. Shen, M. C. Huang, and L. Chu. A memory soft error measurement on
production systems. In USENIX, 2007.
T. J. O’Gorman, J. M. Ross, A. H. Taber, J. F. Ziegler, H. P. Muhlfeld, C. J.
Montrose, H. W. Curtis, and J. L. Walsh. Field
testing for cosmic ray soft errors in semiconductor memories. IBM J. Res.
Dev., 40(1):41–50, 1996.
B. Schroeder, E. Pinheiro, andW.-D.Weber. DRAM errors in the wild: a large-
scale field study. In SIGMETRICS, 2009.

2 bugs lead to “wild writes” into random memory contents
J. Chapin, M. Rosenblum, S. Devine, T. Lahiri, D. Teodosiu, and A. Gupta.
Hive: Fault Containment for Shared-Memory Multiprocessors. In SOSP, 1995.
CERT/CC Advisories. http://www.cert.org/advisories/.
Kernel Bug Tracker. http://bugzilla.kernel.org/.
US-CERT Vulnerabilities Notes Database. http://www.kb.cert.org/vuls/.
Y. Xie, A. Chou, and D. Engler. Archer: using symbolic, pathsensitive
analysis to detect memory access errors. In FSE, 2003.

3 disk corruptions caused by spikes in power, erratic arm movements, and
scratches in media.
D. Anderson, J. Dykes, and E. Riedel. More Than an Interface: SCSI vs. ATA.
In FAST, 2003.
T. J. Schwarz, Q. Xin, E. L. Miller, D. D. Long, A. Hospodor, and S. Ng. Disk
Scrubbing in Large Archival Storage Systems.In MASCOTS, 2004.
The Data Clinic. Hard Disk Failure. http://www.dataclinic.co.uk/hard-disk-
failures.htm.

4 complexities in modern disk firmware
V. Prabhakaran, L. N. Bairavasundaram, N. Agrawal, H. S. Gunawi, A. C.
Arpaci-Dusseau, and R. H. Arpaci-Dusseau. IRON File Systems. In SOSP, 2005.

5 disk corruptions caused by firmware error
a) write to the wrong location
G. Weinberg. The Solaris Dynamic File System.
http://members.visi.net/thedave/sun/DynFS.pdf.
b) disk lost a write but report complete.
R. Sundaram. The Private Lives of Disk Drives. http://partners.netapp.com/
go/techontap/matl/sample/0206tot resiliency.html.
c) error caused by bus
R. Green. EIDE Controller Flaws Version 24.
http://mindprod.com/jgloss/eideflaw.html.
J. Wehman and P. den Haan. The Enhanced IDE/Fast-ATA FAQ. http://thef-
nym.sci.kun.nl/cgi-pieterh/atazip/atafq.html.

6 data corruption caused by buggy device drivers
A. Chou, J. Yang, B. Chelf, S. Hallem, and D. Engler. An Empirical Study of
Operating System Errors. In SOSP, 2001.
D. Engler, D. Y. Chen, S. Hallem, A. Chou, and B. Chelf. Bugs as Deviant
Behavior: A General Approach to Inferring Errors in Systems Code. In SOSP,
2001.
M. M. Swift, B. N. Bershad, and H. M. Levy. Improving the Reliability of
Commodity Operating Systems. In SOSP, 2003.
I will add the entries of these cited paper in baijia later.
Reference URL's