ZFS: practicing failures on virtual hardware
Published by Jim Salter // May 16th, 2016
I always used to sweat, and sweat bullets, when it came time to replace a failed disk in ZFS. It happened infrequently enough that I never did remember the syntax quite right in between issues, and the last thing you want to do with production hardware and data is fumble around in a cold sweat – you want to know what the correct syntax for everything is ahead of time, and be practiced and confident.
Wonderfully, there’s no reason for that anymore – particularly on Linux, which boasts a really excellent set of tools for simulating storage hardware quickly and easily. So today, we’re going to walk through setting up a pool, destroying one of the disks in it, and recovering from the failure. The important part here isn’t really the syntax for the replacement, though… it’s learning how to set up the simulation in the first place, which allows you to test lots of things properly when your butt isn’t on the line!
Prerequisites
Obviously, you need to have ZFS available. If you’re on Ubuntu Xenial, you can get it with apt update ; apt install zfs-linux. If you’re on an earlier version of Ubuntu, you’ll need to add the zfs-native PPA from the ZFS on Linux project, and install from there: apt-add-repository ppa:zfs-native/stable ; apt update ; apt install ubuntu-zfs.
You’ll also need the QEMU tools, to create and manage .qcow2 storage files, and qemu-nbd loopback type devices that access them like real hardware. Again on Ubuntu, that’s apt update ; apt install qemu-utils.
If you aren’t running Ubuntu, the tools are definitely still available, but you’ll need to consult your own distro’s documentation / forums / etc to figure out how to install ’em.
Creating the virtual disk back-end
First up, we create the back end files for the storage. In today’s example, that’ll be a pair of 1GB disks.
root@banshee:~# qemu-img create -f qcow2 /tmp/0.qcow2 1G ; qemu-img create -f qcow2 /tmp/1.qcow2 1G Formatting '/tmp/0.qcow2', fmt=qcow2 size=1073741824 encryption=off cluster_size=65536 lazy_refcounts=off Formatting '/tmp/1.qcow2', fmt=qcow2 size=1073741824 encryption=off cluster_size=65536 lazy_refcounts=off
That does exactly what it looks like: creates a pair of 1GB virtual disks, /tmp/0.qcow2 and /tmp/1.qcow2. Qcow2 files are sparsely allocated, so they don’t actually take up any room at all yet – they’ll expand as and if you put data in them. (If you omit the -f qcow2 argument, qemu-img will create fully allocated RAW files instead, which will take longer.)
Creating the virtual disk devices
By themselves, our .qcow2 files don’t help us much. In order to use them as disks with ZFS, what we really need are block devices, in the /dev filesystem, which reference our qcow2 files. That’s where qemu-nbd comes in.
root@banshee:~# modprobe nbd max_part=63
You might or might not actually need that bit – but it never hurts. This makes sure the nbd kernel module is loaded, and, for safety’s sake, that it won’t try to load more than 63 partitions on a single virtual device. This might keep your system from crashing if you try to access a stupendously corrupt or maliciously crafted qcow2 file – I’m not sure what happens if a system thinks it has devices sda1 through sda65537, and I’d really rather not find out!
root@banshee:~# qemu-nbd -c /dev/nbd0 /tmp/0.qcow2 ; qemu-nbd -c /dev/nbd1 /tmp/1.qcow2
Easy peasy – we now have virtual disks /dev/nbd0 and /dev/nbd1, which can be accessed by ZFS (or any other filesystem or linux system utility) just like any other disk would be.
Setting up a zpool
This looks just like setting up any other pool. We’ll use the ashift argument to make the pool use 4K blocks, since that matches my underlying block device (which might or might not really matter, but it’s a good habit to get into anyway, and this is all about building good habits and muscle memory, right?)
root@banshee:~# zpool create -oashift=12 test mirror /dev/nbd0 /dev/nbd1 root@banshee:~# zpool status test pool: test state: ONLINE scan: none requested config: NAME STATE READ WRITE CKSUM test ONLINE 0 0 0 mirror-0 ONLINE 0 0 0 nbd0 ONLINE 0 0 0 nbd1 ONLINE 0 0 0 errors: No known data errors
Easy peasy! We now have a pool with a simple two-disk mirror vdev.
Playing with topology: add more devices
What if we wanted to make it a pool of mirrors, with a second mirror vdev?
root@banshee:~# qemu-img create -f qcow2 /tmp/2.qcow2 1G ; qemu-img create -f qcow2 /tmp/3.qcow2 1G Formatting '/tmp/2.qcow2', fmt=qcow2 size=1073741824 encryption=off cluster_size=65536 lazy_refcounts=off Formatting '/tmp/3.qcow2', fmt=qcow2 size=1073741824 encryption=off cluster_size=65536 lazy_refcounts=off root@banshee:~# qemu-nbd -c /dev/nbd2 /tmp/2.qcow2 ; qemu-nbd -c /dev/nbd3 /tmp/3.qcow2 root@banshee:~# zpool add -oashift=12 test mirror /dev/nbd2 /dev/nbd3 root@banshee:~# zpool status test pool: test state: ONLINE scan: none requested config: NAME STATE READ WRITE CKSUM test ONLINE 0 0 0 mirror-0 ONLINE 0 0 0 nbd0 ONLINE 0 0 0 nbd1 ONLINE 0 0 0 mirror-1 ONLINE 0 0 0 nbd2 ONLINE 0 0 0 nbd3 ONLINE 0 0 0 errors: No known data errors
Couldn’t be easier.
Save some data on your new virtual hardware
Before we do anything else, let’s store some data on the new pool, so that we have something to detect corruption on later. (ZFS doesn’t checksum or scrub empty blocks, so won’t find corruption in them.)
root@banshee:~# pv < /dev/urandom > /test/urandom.bin 408MB 0:00:37 [ 15MB/s] [ <=> ] ^C
I used the pipe viewer utility here, pv, which is available on Ubuntu with apt update ; apt install pv.
The ^C is because I hit control-C to interrupt it once I felt that I’d written enough data. In this case, a bit over 400MB of random data, saved on our new pool in the file /test/urandom.bin.
root@banshee:~# zpool list test NAME SIZE ALLOC FREE EXPANDSZ FRAG CAP DEDUP HEALTH ALTROOT test 1.97G 414M 1.56G - 26% 20% 1.00x ONLINE - root@banshee:~# zfs list test NAME USED AVAIL REFER MOUNTPOINT test 413M 1.50G 413M /test root@banshee:~# ls -lh /test total 413M -rw-r--r-- 1 root root 413M May 16 12:12 urandom.bin
Gravy.
I just want to kill something beautiful
What happens if a drive or a controller port goes berserk, and overwrites large swathes of a disk with zeroes? No time like the present to find out!
We don’t really want to write directly over a .qcow2 file, since our goal today is to test ZFS, not to test the QEMU infrastructure itself. We want to write to the actual device we created, and let the device put the data in the .qcow2 file. Let’s pick on /dev/nbd0, backed by /tmp/0.qcow2 – it looks uppity and in need of some comeuppance.
root@banshee:~# pv < /dev/zero > /dev/nbd0 777MB 0:00:06 [48.9MB/s] [ <=> ] ^C
Boom. Errors galore. Does ZFS know about them yet?
root@banshee:~# zpool status test pool: test state: ONLINE scan: none requested config: NAME STATE READ WRITE CKSUM test ONLINE 0 0 0 mirror-0 ONLINE 0 0 0 nbd0 ONLINE 0 0 0 nbd1 ONLINE 0 0 0 mirror-1 ONLINE 0 0 0 nbd2 ONLINE 0 0 0 nbd3 ONLINE 0 0 0 errors: No known data errors
Nope! We really, really did simulate on-disk corruption, which ZFS has no way of knowing about until it tries to actually access the data.
Easter egging for errors
So, let’s actually access the data by reading in the entire /test/urandom.bin file, and then check our status:
root@banshee:~# pv < /test/urandom.bin > /dev/null 412MB 0:00:00 [6.99GB/s] [=============================================>] 100% root@banshee:~# zpool status test pool: test state: ONLINE scan: none requested config: NAME STATE READ WRITE CKSUM test ONLINE 0 0 0 mirror-0 ONLINE 0 0 0 nbd0 ONLINE 0 0 0 nbd1 ONLINE 0 0 0 mirror-1 ONLINE 0 0 0 nbd2 ONLINE 0 0 0 nbd3 ONLINE 0 0 0 errors: No known data errors
Hey – still nothing! Why not? Because /test/urandom.qcow2 is still in the ARC, that’s why! ZFS didn’t actually need to hit the storage to hand us the file, so it didn’t – and we still haven’t detected the massive amount of corruption we injected into /dev/nbd0.
We can get ZFS to actually look for the problem in a couple of ways. The cruder way is to export the pool and reimport it, which will have the side effect of dumping the ARC as well. The more professional way is to scrub the pool, which is a technique with the explicit design of reading and verifying every block. But first, let’s see what happens just reading from the storage normally, after the ARC isn’t holding the data anymore:
root@banshee:~# zpool export test root@banshee:~# zpool import test root@banshee:~# zpool status test pool: test state: ONLINE status: One or more devices has experienced an unrecoverable error. An attempt was made to correct the error. Applications are unaffected. action: Determine if the device needs to be replaced, and clear the errors using 'zpool clear' or replace the device with 'zpool replace'. see: http://zfsonlinux.org/msg/ZFS-8000-9P scan: none requested config: NAME STATE READ WRITE CKSUM test ONLINE 0 0 0 mirror-0 ONLINE 0 0 0 nbd0 ONLINE 0 0 4 nbd1 ONLINE 0 0 0 mirror-1 ONLINE 0 0 0 nbd2 ONLINE 0 0 0 nbd3 ONLINE 0 0 0 errors: No known data errors
Ooh, we’ve already found a few errors – we injected so much corruption that we nailed a few of ZFS’ metadata blocks on /dev/nbd0, which were found immediately on importing the pool again. There were redundant copies of the metadata blocks available, though, so ZFS repaired the errors it found already with those.
Now that we’ve exported and imported the pool, which also dumped the ARC, let’s re-read the file in its entirety again, then see what our status looks like:
root@banshee:~# pv < /test/urandom.bin > /dev/null 412MB 0:00:00 [ 563MB/s] [================================================>] 100% root@banshee:~# zpool status test pool: test state: ONLINE status: One or more devices has experienced an unrecoverable error. An attempt was made to correct the error. Applications are unaffected. action: Determine if the device needs to be replaced, and clear the errors using 'zpool clear' or replace the device with 'zpool replace'. see: http://zfsonlinux.org/msg/ZFS-8000-9P scan: none requested config: NAME STATE READ WRITE CKSUM test ONLINE 0 0 0 mirror-0 ONLINE 0 0 0 nbd0 ONLINE 0 0 9 nbd1 ONLINE 0 0 0 mirror-1 ONLINE 0 0 0 nbd2 ONLINE 0 0 0 nbd3 ONLINE 0 0 0 errors: No known data errors
Five more blocks detected corrupt. Doesn’t seem like a lot, does it? Keep in mind, ZFS is only finding blocks that we specifically attempt to read from right now – so a lot of what would be corrupt blocks on /dev/nbd0, we actually read from /dev/nbd1 instead. And only half-ish of the data was saved to mirror-0 in the first place – the rest went to mirror-1. So, ZFS is finding and repairing corrupt blocks… but only a few of them so far. This was worth playing with to see what happens with undetected errors in normal use, but now that we’ve done this, let’s look for errors the right way.
It isn’t really clean until it’s been scrubbed
When we scrub the pool, we explicitly read every single block that’s been written and is actively in use on every single device, all at once.
So far, we’ve found nine corrupt blocks just poking around the system like we normally would in normal use. Will a scrub find more?
root@banshee:~# zpool scrub test root@banshee:~# zpool status test pool: test state: ONLINE status: One or more devices has experienced an unrecoverable error. An attempt was made to correct the error. Applications are unaffected. action: Determine if the device needs to be replaced, and clear the errors using 'zpool clear' or replace the device with 'zpool replace'. see: http://zfsonlinux.org/msg/ZFS-8000-9P scan: scrub repaired 170M in 0h0m with 0 errors on Mon May 16 12:32:00 2016 config: NAME STATE READ WRITE CKSUM test ONLINE 0 0 0 mirror-0 ONLINE 0 0 0 nbd0 ONLINE 0 0 1.40K nbd1 ONLINE 0 0 0 mirror-1 ONLINE 0 0 0 nbd2 ONLINE 0 0 0 nbd3 ONLINE 0 0 0 errors: No known data errors
There we go! We just went from 9 checksum errors, to 1.4 thousand checksum errors detected.
And that, folks, is why you want to regularly scrub your pool.
Can we do worse? What happens if we blow through every block on /dev/nbd0, instead of “just” 3/4 or so of them?
root@banshee:~# pv < /dev/zero > /dev/nbd0 pv: write failed: No space left on device<=> ] root@banshee:~# zpool scrub test root@banshee:~# zpool status test pool: test state: ONLINE status: One or more devices could not be used because the label is missing or invalid. Sufficient replicas exist for the pool to continue functioning in a degraded state. action: Replace the device using 'zpool replace'. see: http://zfsonlinux.org/msg/ZFS-8000-4J scan: scrub repaired 0 in 0h0m with 0 errors on Mon May 16 12:36:38 2016 config: NAME STATE READ WRITE CKSUM test ONLINE 0 0 0 mirror-0 ONLINE 0 0 0 nbd0 UNAVAIL 0 0 1.40K corrupted data nbd1 ONLINE 0 0 0 mirror-1 ONLINE 0 0 0 nbd2 ONLINE 0 0 0 nbd3 ONLINE 0 0 0 errors: No known data errors
The CKSUM column hasn’t changed – but /dev/nbd0 now shows as UNAVAIL, with the corrupted data flag, because we blew through all the metadata including all copies of the disk label itself. We still have no data errors on the pool itself, but mirror-0, and the pool itself, are operating degraded now. Which we’ll want to fix.
Interestingly, we also just found a bug in ZFS – the pool itself as well as the mirror-0 vdev, should be showing DEGRADED in the STATE column, not ONLINE! I’m gonna go report that; be back in a minute…
Replacing a failed disk (Rise chicken. Chicken, rise.)
OK, now that we’ve thoroughly failed a disk, we can replace it. We happen to know – because we’re evil bastards who did it ourselves – that the actual disk itself is fine on the hardware level, we just basically degaussed it. So we could just replace it in the array in situ. That’s not usually going to be best practice with real hardware, though, so let’s more thoroughly simulate actually removing and replacing the disk with a new one.
First, the removal:
root@banshee:~# qemu-nbd -d /dev/nbd0 /dev/nbd0 disconnected
Simple enough. ZFS doesn’t really know the disk is gone yet, but we can force it to figure out by scrubbing again before checking the status:
root@banshee:~# zpool scrub test root@banshee:~# zpool status test pool: test state: DEGRADED status: One or more devices could not be used because the label is missing or invalid. Sufficient replicas exist for the pool to continue functioning in a degraded state. action: Replace the device using 'zpool replace'. see: http://zfsonlinux.org/msg/ZFS-8000-4J scan: scrub repaired 0 in 0h0m with 0 errors on Mon May 16 12:42:35 2016 config: NAME STATE READ WRITE CKSUM test DEGRADED 0 0 0 mirror-0 DEGRADED 0 0 0 nbd0 UNAVAIL 0 0 1.40K corrupted data nbd1 ONLINE 0 0 0 mirror-1 ONLINE 0 0 0 nbd2 ONLINE 0 0 0 nbd3 ONLINE 0 0 0 errors: No known data errors
Interestingly, now that we’ve actually removed /dev/nbd0 in a hardware sense, our pool and vdev show – correctly – as DEGRADED, in the actual STATUS column as well as in the status message.
Regardless, now that we’ve “pulled” the disk, let’s “replace” it.
root@banshee:~# rm /tmp/0.qcow2 root@banshee:~# qemu-img create -f qcow2 /tmp/0.qcow2 1G Formatting '/tmp/0.qcow2', fmt=qcow2 size=1073741824 encryption=off cluster_size=65536 lazy_refcounts=off root@banshee:~# qemu-nbd -c /dev/nbd0 /tmp/0.qcow2
Easy enough – now, we’re in exactly the same boat we’d be in if this was a real machine and we’d physically removed and replaced the offending drive, but done nothing else. ZFS, of course, is still going to show degraded status – we need to give the new drive to the pool, as well as physically plugging it in. So let’s do that:
root@banshee:~# zpool replace test /dev/nbd0
Is it really that easy? Let’s check the status and find out:
root@banshee:~# zpool status test pool: test state: ONLINE scan: resilvered 206M in 0h0m with 0 errors on Mon May 16 12:47:20 2016 config: NAME STATE READ WRITE CKSUM test ONLINE 0 0 0 mirror-0 ONLINE 0 0 0 nbd0 ONLINE 0 0 0 nbd1 ONLINE 0 0 0 mirror-1 ONLINE 0 0 0 nbd2 ONLINE 0 0 0 nbd3 ONLINE 0 0 0 errors: No known data errors
Yep, it really is that easy.
Party’s over: clean up before you go home
Now that you’ve successfully tested what you set out to test, the last step is to clean up your virtual mess. In order, you need to destroy the test pool, disconnect the virtual devices, and delete the back-end storage files they referenced.
root@banshee:~# zpool destroy test root@banshee:~# qemu-nbd -d /dev/nbd0 ; qemu-nbd -d /dev/nbd1 ; qemu-nbd -d /dev/nbd2 ; qemu-nbd -d /dev/nbd3 /dev/nbd0 disconnected /dev/nbd1 disconnected /dev/nbd2 disconnected /dev/nbd3 disconnected root@banshee:~# rm /tmp/0.qcow2 ; rm /tmp/1.qcow2 ; rm /tmp/2.qcow2 ; rm /tmp/3.qcow2
All done – clean as a whistle, and ready to set up the next simulation!
If you learned how to fail out a disk, awesome, but…
If you ended up here trying to figure out how to deal with and replace a failed disk, cool, and I hope you got what you were looking for. But please, remember the testing steps we did for the environment – they’re what this post is actually about. Learning how to set up your own virtual test environment will make you a much, much better admin down the line – and make you as cool as an action hero that one fateful day when it’s a real disk that’s faulted out of your real pool, and your data’s on the line. Nothing gives you confidence and takes away the stress like plenty of experience having done the exact same thing before.