Why Sanoid’s ZFS replication matters
Published by Jim Salter // November 19th, 2014
If you’re an old hand in the storage game and are familiar with rsync (which is an amazing tool, btw) you might not be quite sure why block-level replication matters.
So, let’s do a thought experiment
What if you want daily offsite DR of an entire VM image? Let’s say the image is about 2TB in size. If you just copy the whole thing, like with FTP or SCP or any other simple copy tool, you’re looking at squeezing 2TB, byte by byte, over your offsite internet connection. Even with a completely uncontested 100mbps connection, with roughly 11.5MB/sec of throughput, that’s going to take a solid 50 hours. Ouch. OK, that’s a non-starter.
OK, so what about rsync? Rsync is a highly advanced userland tool that is specifically designed to only migrate CHANGED data across the network connection. But in order to do this, rsync first has to read the entire file, on both ends, then tokenize it in chunks, then compare those tokens, which will tell it what chunks have changed and therefore need to be sent across the pipe. This works great for moderately sized files. But if you stop and think about it, that means for this scenario, rsync needs to read an entire 2TB file – on both ends – before it can even start actually moving data. So, let’s say you’ve got really great storage on both ends and you can pull sustained 100MB/sec reads over the entire 2TB on each end… you’re still looking at six hours, minimum, during which your production system will be drowning in I/O requests and nearly unusable. Even if you only changed a single byte of data. Well, crap.
This brings us to syncoid, Sanoid’s replication tool. Syncoid makes using the underlying ZFS replication simple and easy – you give it a source, you give it a target, and it makes sure the target is up-to-date with all of the snapshots available on the source. Syncoid (and ZFS) don’t need to tokenize anything first – they already know what individual blocks have changed, because that information is implicit in the snapshots themselves (which ultimately are just a list of pointers to blocks in the filesystem). So Syncoid just has to see what the newest snapshot is on the target, then send an incremental stream from the source to the target updating it with any more recent snapshots – meaning it starts working immediately, and it doesn’t generate any unnecessary load on the source or target machines.
Ooh, a case study! Everybody loves case studies!
How’s this work in practice? Well, luckily, I just so happen to have a >2TB dataset in production. Almost the full 2TB of it is a single VM (which is the fileserver for an engineering office), with a couple hundred gigabytes thrown in for other, smaller application server VMs in that office. The big fileserver is Linux, two of the application server VMs are as well, and finally there’s a Windows VM running a server copy of Quickbooks. So let’s see how this plays out in real life.
The laughably small internet connection
root@remotebackup:~# iperf -c 10.10.10.1 ------------------------------------------------------------ Client connecting to 10.10.10.1, TCP port 5001 TCP window size: 45.0 KByte (default) ------------------------------------------------------------ [ 3] local 10.10.10.2 port 39796 connected with 10.10.10.1 port 5001 [ ID] Interval Transfer Bandwidth [ 3] 0.0-11.0 sec 2.00 MBytes 1.52 Mbits/sec
First question – just how crappy is the bandwidth we’re working with? Well, it’s an OpenVPN tunnel running across a low-dollar residential cable connection – the offsite DR is actually to the home of one of the firm partners. Our iperf test here shows us we’re getting a whopping 1.5mbps – enough for about 160KB/sec or so of throughput. Yuck.
2TBs of VM images? Well, 2.2TB really. 2.6TB including snapshots.
root@remotebackup:~# ssh 10.10.10.1 zfs list data/images NAME USED AVAIL REFER MOUNTPOINT data/images 2.64T 2.86T 2.19T /data/images
We can see that on the source server, we have 2.64TB of storage used, out of which a little under 2.2TB are the images themselves (the rest is data contained in snapshots, but not contained in the current state of data/images). We do have a local copy, of course – like I said, this is a production example – but it’s close to a day out of date, during which time an office full of engineers have been busily working on drawings and documents. So, how do we replicate this monster over a $40/mo internet connection?
Getting the job done: syncoid!
root@remotebackup:~# /usr/local/bin/syncoid root@10.10.10.1:data/images backup/images Sending incremental syncoid_remotebackup_2014-11-17:22:00:04 ... syncoid_cseremotebackup_2014-11-18:11:08:51 (~ 3.1 GB): 3.04GB 0:53:17 [ 998kB/s] [=============================> ] 99%
Easy peasy – a single command takes care of it. Over this ludicrously tiny internet connection, it did take a little under an hour to run… but that’s not so bad for a daily backup routine, and the 1MB/sec that our little internet pipe limited us to obviously isn’t putting a big strain on the underlying storage in the meantime. (If you’re wondering how we got 1MB/sec over a 1mbps pipe that should only move 160KB/sec or so, the answer is built-in LZO compression. Snazzy, right?)
I love this stuff. I really, really do.