testing the resiliency of zfs set copies=n
Published by Jim Salter // May 9th, 2016
I decided to see how well ZFS copies=n would stand up to on-disk corruption today. Spoiler alert: not great.
First step, I created a 1GB virtual disk, made a zpool out of it with 8K blocks, and set copies=2.
me@locutus:~$ sudo qemu-img create -f qcow2 /data/test/copies/0.qcow2 1G me@locutus:~$ sudo qemu-nbd -c /dev/nbd0 /data/test/copies/0.qcow2 1G me@locutus:~$ sudo zpool create -oashift=12 test /data/test/copies/0.qcow2 me@locutus:~$ sudo zfs set copies=2 test
Now, I wrote 400 1MB files to it – just enough to make the pool roughly 85% full, including the overhead due to copies=2.
me@locutus:~$ cat /tmp/makefiles.pl #!/usr/bin/perl for ($x=0; $x<400 ; $x++) { print "dd if=/dev/zero bs=1M count=1 of=$x\n"; print `dd if=/dev/zero bs=1M count=1 of=$x`; }
With the files written, I backed up my virtual filesystem, fully populated, so I can repeat the experiment later.
me@locutus:~$ sudo zpool export test me@locutus:~$ sudo cp /data/test/copies/0.qcow2 /data/test/copies/0.qcow2.bak me@locutus:~$ sudo zpool import test
Now, I write corrupt blocks to 10% of the filesystem. (Roughly: it's possible that the same block was overwritten with a garbage block more than once.) Note that I used a specific seed, so that I can recreate the scenario exactly, for more runs later.
me@locutus:~$ cat /tmp/corruptor.pl #!/usr/bin/perl # total number of blocks in the test filesystem $numblocks=131072; # desired percentage corrupt blocks $percentcorrupt=.1; # so we write this many corrupt blocks $corruptloop=$numblocks*$percentcorrupt; # consistent results for testing srand 32767; # generate 8K of garbage data for ($x=0; $x<8*1024; $x++) { $garbage .= chr(int(rand(256))); } open FH, "> /dev/nbd0"; for ($x=0; $x<$corruptloop; $x++) { $blocknum = int(rand($numblocks-100)); print "Writing garbage data to block $blocknum\n"; seek FH, ($blocknum*8*1024), 0; print FH $garbage; } close FH;
Okay. When I scrub the filesystem I just wrote those 10,000 or so corrupt blocks to, what happens?
me@locutus:~$ sudo zpool scrub test ; sudo zpool status test pool: test state: ONLINE status: One or more devices has experienced an error resulting in data corruption. Applications may be affected. action: Restore the file in question if possible. Otherwise restore the entire pool from backup. see: http://zfsonlinux.org/msg/ZFS-8000-8A scan: scrub repaired 133M in 0h1m with 1989 errors on Mon May 9 15:56:11 2016 config: NAME STATE READ WRITE CKSUM test ONLINE 0 0 1.94K nbd0 ONLINE 0 0 8.94K errors: 1989 data errors, use '-v' for a list me@locutus:~$ sudo zpool status -v test | grep /test/ | wc -l 385
OUCH. 385 of my 400 total files were still corrupt after the scrub! Copies=2 didn't do a great job for me here.
What if I try it again, this time just writing garbage to 1% of the blocks on disk, not 10%? First, let's restore that backup I cleverly made:
root@locutus:/data/test/copies# zpool export test root@locutus:/data/test/copies# qemu-nbd -d /dev/nbd0 /dev/nbd0 disconnected root@locutus:/data/test/copies# pv < 0.qcow2.bak > 0.qcow2 999MB 0:00:00 [1.63GB/s] [==================================>] 100% root@locutus:/data/test/copies# qemu-nbd -c /dev/nbd0 /data/test/copies/0.qcow2 root@locutus:/data/test/copies# zpool import test root@locutus:/data/test/copies# zpool status test | tail -n 5 NAME STATE READ WRITE CKSUM test ONLINE 0 0 0 nbd0 ONLINE 0 0 0 errors: No known data errors
Alright, now let's change $percentcorrupt from 0.1 to 0.01, and try again. How'd we do after only corrupting 1% of the blocks on disk?
root@locutus:/data/test/copies# zpool status test pool: test state: ONLINE status: One or more devices has experienced an error resulting in data corruption. Applications may be affected. action: Restore the file in question if possible. Otherwise restore the entire pool from backup. see: http://zfsonlinux.org/msg/ZFS-8000-8A scan: scrub repaired 101M in 0h0m with 72 errors on Mon May 9 16:13:49 2016 config: NAME STATE READ WRITE CKSUM test ONLINE 0 0 72 nbd0 ONLINE 0 0 1.08K errors: 64 data errors, use '-v' for a list root@locutus:/data/test/copies# zpool status test -v | grep /test/ | wc -l 64
Still not great. We lost 64 out of our 400 files. Tenth of a percent?
root@locutus:/data/test/copies# zpool status -v test pool: test state: ONLINE status: One or more devices has experienced an error resulting in data corruption. Applications may be affected. action: Restore the file in question if possible. Otherwise restore the entire pool from backup. see: http://zfsonlinux.org/msg/ZFS-8000-8A scan: scrub repaired 12.1M in 0h0m with 2 errors on Mon May 9 16:22:30 2016 config: NAME STATE READ WRITE CKSUM test ONLINE 0 0 2 nbd0 ONLINE 0 0 105 errors: Permanent errors have been detected in the following files: /test/300 /test/371
Damn. We still lost two files, even with only writing 130 or so corrupt blocks. (The missing 26 corrupt blocks weren't picked up by the scrub because they happened in the 15% or so of unused space on the pool, presumably: a scrub won't check unused blocks.) OK, what if we try a control - how about we corrupt the same tenth of a percent of the filesystem (105 blocks or so), this time without copies=2 set? To make it fair, I wrote 800 1MB files to the same filesystem this time using the default copies=1 - this is more files, but it's the same percentage of the filesystem full. (Interestingly, this went a LOT faster. Perceptibly, more than twice as fast, I think, although I didn't actually time it.)
Now with our still 84% full /test zpool, but this time with copies=1, I corrupted the same 0.1% of the total block count.
root@locutus:/data/test/copies# zpool status test pool: test state: ONLINE status: One or more devices has experienced an error resulting in data corruption. Applications may be affected. action: Restore the file in question if possible. Otherwise restore the entire pool from backup. see: http://zfsonlinux.org/msg/ZFS-8000-8A scan: scrub repaired 8K in 0h0m with 98 errors on Mon May 9 16:28:26 2016 config: NAME STATE READ WRITE CKSUM test ONLINE 0 0 98 /data/test/copies/0.qcow2 ONLINE 0 0 198 errors: 93 data errors, use '-v' for a list
Without copies=2 set, we lost 93 files instead of 2. So copies=n was definitely better than nothing for our test of writing 0.1% of the filesystem as bad blocks... but it wasn't fabulous, either, and it fell flat on its face with 1% or 10% of the filesystem corrupted. By comparison, a truly redundant pool - one with two disks in it, in a mirror vdev - would have survived the same test (corrupting ANY number of blocks on a single disk) with flying colors.
The TL;DR here is that copies=n is better than nothing... but not by a long shot, and you do give up a lot of performance for it. Conclusion: play with it if you like, but limit it only to extremely important data, and don't make the mistake of thinking it's any substitute for device redundancy, much less backups.