home

The Devil is in the Details

The devil is always in the details.  Software systems can be quite complex and assumptions about how they work can get you in trouble.  On my previously mentioned TV server, this week, I noticed some nearly unwatchable shows.  My initial thought was that I needed to realign my antenna, but I noticed that the shows were all recorded at the same time and I was watching another.  Maybe they were bandwidth starved?  The worst stutters were during scenes with lots of motion.  Now, in my previous setup, I know I was not able to watch anything while recording 4 HD shows.  At least I could record 4 HD show simultaneously.  The old setup was only one disk, though.  Now I’m running a one disk ZFS RAIDZ.  I absolutely know RAID5 type setups don’t perform like RAID0 setups (and that RAIDZ isn’t exactly RAID5), but early generalizations I’d read lead me to expect RAID5 type performance.  I didn’t investigate further and made assumptions based on my understanding of the technology.  Boy was I wrong.  Take a look at the numbers from bonnie++:

Version  1.96       ------Sequential Output------ --Sequential Input- --Random-
Concurrency   1     -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks--
Machine        Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP  /sec %CP
dagobah.boonthe 16G    96  99 131609  34 104142  31   277  99 271779  41 116.4  14
Latency               253ms    7136ms    7453ms   36211us     731ms     785ms
Version  1.96       ------Sequential Create------ --------Random Create--------
dagobah.boontheekul -Create-- --Read--- -Delete-- -Create-- --Read--- -Delete--
 files  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP
 16 22615  96 +++++ +++ 17543  97 20137  91 +++++ +++ 18455  98
Latency             12940us    9652us     227us   18233us     153us     377us

Those are crazy numbers.  So, I turned to Google and did some more reading.  It turns out a RAIDZ’s write performance is much worse than I expected.  This quote (well, quote of a quote) is very good:

"Now we come to the crucial decision ZFS has made for raidz and
raidz2: in raidz and raidz2, the data block is striped across all of
the disks. Instead of a model where a parity stripe is a bunch of data
blocks, each with an independent checksum, ZFS stripes a single data
block (and its parity), with a single checksum, across all the disks
(or as many of them as necessary).

This is a rational implementation decision, but when combined with the
need to verify checksums, it has an important consequence: in ZFS,
reads always involve all disks, because ZFS always must verify the
data block's checksum, which requires reading all of the data block,
which is spread across all of the drives. This is unlike normal RAID-5
or RAID-6, in which a small enough read will only touch one drive, and
means that adding more disks to a ZFS raidz pool does not increase how
many random reads you can do per second.

(A normal RAID-5 or RAID-6 array has a (theoretical) random read IO
capacity equal to the sum of the random IO operations rate of each of
the disks in the array, and so adding another disk adds its IOPs per
second to your read capacity. A ZFS raidz or raidz2 pool instead has a
capacity equal to the slowest disk's IOPs per second, and adding
another disk does nothing to help. Effectively a raidz ZFS gives you a
single disk's read IOPs per second rate.)"

This was on a blog of a SUN engineer (although a post from a few years
ago), unfortunately I don't have the link, I actually had to go
through my posting history on the Ars Technica forum to even find this
quote in the first place. If the situation has changed and the above
quote no longer holds true, it would be nice if someone more
knowledgeable on the performance implications could elaborate what
kind of performance is to be expected on a raidz system :) 

- Sincerely,
Dan Naumov

Wow.  In that same thread and in another I found, someone posted some benchmarking results they had done.  They are very interesting.  Follow these links:

http://virtual.tehinterweb.net/livejournal/2009-06-22_zfs_diskperf/zfs-diskperf-contig-write.png

http://virtual.tehinterweb.net/livejournal/2009-06-22_zfs_diskperf/zfs-diskperf-5MB-readwrite.png

http://virtual.tehinterweb.net/livejournal/2009-06-22_zfs_diskperf/zfs-diskperf-1MB-readwrite.png

I was quite surprised.  Further reading has lead me to rethink my setup.  For this TV/media server I need the ability to read and write simultaneously at high rates.  Write speed is more important in that I’ll likely be recording more shows at any given time than I’m watching, but I’ll still need to be able to stream a couple of HD shows at the same time.  Heck, my 4 tuners haven’t been enough on a couple of occasions.  So, I’m going to have to sacrifice space for speed.  I don’t know how I’m going to do the data shuffle, but I’m considering picking up another pair of 1TB drives and a PCI-Express SATA controller (only one free one left on the mainboard).  That’ll help.  Then I guess I’ll build the pool from mirrored pairs of 1TB drives.  My read/write performance should improve, but the addition of 2TB more drives won’t give me any more space.  It will be interesting to see what kind of numbers I get out of it.

Update: I reconfigured the pool from a 4 disk RAIDZ to a pool of 2 2 disk mirrors.  Bonnie++ results:

Version  1.96       ------Sequential Output------ --Sequential Input- --Random-
Concurrency   1     -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks--
Machine        Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP  /sec %CP
dagobah.boonthe 16G    89  99 95149  25 87337  23   289  99 270826  31 214.8  28
Latency               354ms   11106ms   10391ms   40391us    3882ms     464ms
Version  1.96       ------Sequential Create------ --------Random Create--------
dagobah.boontheekul -Create-- --Read--- -Delete-- -Create-- --Read--- -Delete--
 files  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP
 16 20303  95 +++++ +++  8108  99 19027  97 11837  99  4149  99
Latency             13531us     265us     478us   39328us     308us    5922us

Update: I got in 2 more 1TB disks (and a HighPoint 2310, as I’ve used up all 6 on-board SATA ports).  I added the 2 new disks (via the HighPoint, no other rearranging of the drives) as another mirror in the pool.  Now the capacity is back up to what it was as a RAIDZ.  Bonnie++ results:

Version  1.96       ------Sequential Output------ --Sequential Input- --Random-
Concurrency   1     -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks--
Machine        Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP  /sec %CP
dagobah.boonthe 16G   112  99 145807  39 124058  33   289  99 372925  46 274.9  22
Latency               331ms    3891ms    7095ms   31678us    2229ms     457ms
Version  1.96       ------Sequential Create------ --------Random Create--------
dagobah.boontheekul -Create-- --Read--- -Delete-- -Create-- --Read--- -Delete--
 files  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP
 16 22492  94 +++++ +++ 19562  97 16765  97 18918  99  6837  99
Latency             18093us    1113us     185us   37416us     203us     386us

It’s hard to make any conclusions from this.  The numbers for the pool of mirrors are completely different than what I expected.  In addition, I can say that interactive performance with similar workloads to what caused me problems previously have much improved.  Just from the reconfiguration of the existing drives, I was able to record 4 HD streams simultaneously while watching another and no apparent stuttering or the like in any recording or the playback.  Adding the 2 additional disks as another mirror in the pool had apparent impact in the bonnie++ numbers and brought my usable space back up to previous levels, but I’m still rather surprised at the bonnie++ numbers.

Print This Article

Comments are closed.