The devil is always in the details. Software systems can be quite complex and assumptions about how they work can get you in trouble. On my previously mentioned TV server, this week, I noticed some nearly unwatchable shows. My initial thought was that I needed to realign my antenna, but I noticed that the shows were all recorded at the same time and I was watching another. Maybe they were bandwidth starved? The worst stutters were during scenes with lots of motion. Now, in my previous setup, I know I was not able to watch anything while recording 4 HD shows. At least I could record 4 HD show simultaneously. The old setup was only one disk, though. Now I’m running a one disk ZFS RAIDZ. I absolutely know RAID5 type setups don’t perform like RAID0 setups (and that RAIDZ isn’t exactly RAID5), but early generalizations I’d read lead me to expect RAID5 type performance. I didn’t investigate further and made assumptions based on my understanding of the technology. Boy was I wrong. Take a look at the numbers from bonnie++:
Version 1.96 ------Sequential Output------ --Sequential Input- --Random-
Concurrency 1 -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks--
Machine Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP /sec %CP
dagobah.boonthe 16G 96 99 131609 34 104142 31 277 99 271779 41 116.4 14
Latency 253ms 7136ms 7453ms 36211us 731ms 785ms
Version 1.96 ------Sequential Create------ --------Random Create--------
dagobah.boontheekul -Create-- --Read--- -Delete-- -Create-- --Read--- -Delete--
files /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP
16 22615 96 +++++ +++ 17543 97 20137 91 +++++ +++ 18455 98
Latency 12940us 9652us 227us 18233us 153us 377us
Those are crazy numbers. So, I turned to Google and did some more reading. It turns out a RAIDZ’s write performance is much worse than I expected. This quote (well, quote of a quote) is very good:
"Now we come to the crucial decision ZFS has made for raidz and
raidz2: in raidz and raidz2, the data block is striped across all of
the disks. Instead of a model where a parity stripe is a bunch of data
blocks, each with an independent checksum, ZFS stripes a single data
block (and its parity), with a single checksum, across all the disks
(or as many of them as necessary).
This is a rational implementation decision, but when combined with the
need to verify checksums, it has an important consequence: in ZFS,
reads always involve all disks, because ZFS always must verify the
data block's checksum, which requires reading all of the data block,
which is spread across all of the drives. This is unlike normal RAID-5
or RAID-6, in which a small enough read will only touch one drive, and
means that adding more disks to a ZFS raidz pool does not increase how
many random reads you can do per second.
(A normal RAID-5 or RAID-6 array has a (theoretical) random read IO
capacity equal to the sum of the random IO operations rate of each of
the disks in the array, and so adding another disk adds its IOPs per
second to your read capacity. A ZFS raidz or raidz2 pool instead has a
capacity equal to the slowest disk's IOPs per second, and adding
another disk does nothing to help. Effectively a raidz ZFS gives you a
single disk's read IOPs per second rate.)"
This was on a blog of a SUN engineer (although a post from a few years
ago), unfortunately I don't have the link, I actually had to go
through my posting history on the Ars Technica forum to even find this
quote in the first place. If the situation has changed and the above
quote no longer holds true, it would be nice if someone more
knowledgeable on the performance implications could elaborate what
kind of performance is to be expected on a raidz system
- Sincerely,
Dan Naumov
Wow. In that same thread and in another I found, someone posted some benchmarking results they had done. They are very interesting. Follow these links:
http://virtual.tehinterweb.net/livejournal/2009-06-22_zfs_diskperf/zfs-diskperf-contig-write.png
http://virtual.tehinterweb.net/livejournal/2009-06-22_zfs_diskperf/zfs-diskperf-5MB-readwrite.png
http://virtual.tehinterweb.net/livejournal/2009-06-22_zfs_diskperf/zfs-diskperf-1MB-readwrite.png
I was quite surprised. Further reading has lead me to rethink my setup. For this TV/media server I need the ability to read and write simultaneously at high rates. Write speed is more important in that I’ll likely be recording more shows at any given time than I’m watching, but I’ll still need to be able to stream a couple of HD shows at the same time. Heck, my 4 tuners haven’t been enough on a couple of occasions. So, I’m going to have to sacrifice space for speed. I don’t know how I’m going to do the data shuffle, but I’m considering picking up another pair of 1TB drives and a PCI-Express SATA controller (only one free one left on the mainboard). That’ll help. Then I guess I’ll build the pool from mirrored pairs of 1TB drives. My read/write performance should improve, but the addition of 2TB more drives won’t give me any more space. It will be interesting to see what kind of numbers I get out of it.
Update: I reconfigured the pool from a 4 disk RAIDZ to a pool of 2 2 disk mirrors. Bonnie++ results:
Version 1.96 ------Sequential Output------ --Sequential Input- --Random-
Concurrency 1 -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks--
Machine Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP /sec %CP
dagobah.boonthe 16G 89 99 95149 25 87337 23 289 99 270826 31 214.8 28
Latency 354ms 11106ms 10391ms 40391us 3882ms 464ms
Version 1.96 ------Sequential Create------ --------Random Create--------
dagobah.boontheekul -Create-- --Read--- -Delete-- -Create-- --Read--- -Delete--
files /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP
16 20303 95 +++++ +++ 8108 99 19027 97 11837 99 4149 99
Latency 13531us 265us 478us 39328us 308us 5922us
Update: I got in 2 more 1TB disks (and a HighPoint 2310, as I’ve used up all 6 on-board SATA ports). I added the 2 new disks (via the HighPoint, no other rearranging of the drives) as another mirror in the pool. Now the capacity is back up to what it was as a RAIDZ. Bonnie++ results:
Version 1.96 ------Sequential Output------ --Sequential Input- --Random-
Concurrency 1 -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks--
Machine Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP /sec %CP
dagobah.boonthe 16G 112 99 145807 39 124058 33 289 99 372925 46 274.9 22
Latency 331ms 3891ms 7095ms 31678us 2229ms 457ms
Version 1.96 ------Sequential Create------ --------Random Create--------
dagobah.boontheekul -Create-- --Read--- -Delete-- -Create-- --Read--- -Delete--
files /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP
16 22492 94 +++++ +++ 19562 97 16765 97 18918 99 6837 99
Latency 18093us 1113us 185us 37416us 203us 386us
It’s hard to make any conclusions from this. The numbers for the pool of mirrors are completely different than what I expected. In addition, I can say that interactive performance with similar workloads to what caused me problems previously have much improved. Just from the reconfiguration of the existing drives, I was able to record 4 HD streams simultaneously while watching another and no apparent stuttering or the like in any recording or the playback. Adding the 2 additional disks as another mirror in the pool had apparent impact in the bonnie++ numbers and brought my usable space back up to previous levels, but I’m still rather surprised at the bonnie++ numbers.