Solaris 11.2, a perfectly timed release and how it saved me hours of resilver time.

Solaris 11.2 was released on July 31st, 2014, with the release brings version 35 of zpool among a many other things, most of which are irrelevant as a home user. However, with the new zpool version 35 there is an markedly increase in resilver performance which I delve into with this post. Below you can see the description Oracle has given zpool version 35 from the output of zpool upgrade -v below:

gullibleadmin@epijunkie:~# zpool upgrade -v
This system is currently running ZFS pool version 35.

The following versions are supported:

VER DESCRIPTION
--- --------------------------------------------------------
1 Initial ZFS version
2 Ditto blocks (replicated metadata)
3 Hot spares and double parity RAID-Z
4 zpool history
5 Compression using the gzip algorithm
6 bootfs pool property
7 Separate intent log devices
8 Delegated administration
9 refquota and refreservation properties
10 Cache devices
11 Improved scrub performance
12 Snapshot properties
13 snapused property
14 passthrough-x aclinherit
15 user/group space accounting
16 stmf property support
17 Triple-parity RAID-Z
18 Snapshot user holds
19 Log device removal
20 Compression using zle (zero-length encoding)
21 Deduplication
22 Received properties
23 Slim ZIL
24 System attributes
25 Improved scrub stats
26 Improved snapshot deletion performance
27 Improved snapshot creation performance
28 Multiple vdev replacements
29 RAID-Z/mirror hybrid allocator
30 Encryption
31 Improved 'zfs list' performance
32 One MB blocksize
33 Improved share support
34 Sharing with inheritance
35 Sequential resilver
 
For more information on a particular version, including supported releases,
see the ZFS Administration Guide.

Oracle is defining “Sequential resilver” as follows:

The previous resilvering algorithm repairs blocks from oldest to newest, which can degrade into a lot of small random I/O. The new resilvering algorithm uses a two-step process to sort and resilver blocks in LBA order.

The amount of improvement depends on how pool data is laid out. For example, sequentially written data on a mirrored pool shows no improvement, but randomly written data or sequentially written data on RAID-Z improves significantly – typically reducing time by 25 to 50 percent.

I can attest to the performance increase this provides as a single drive resilver is only requiring 1/2 of the previous time. I suspect that I would not reasonably be able to replace all the drives in the vdev simultaneously without massive IO problems reflecting as more than a weeks worth of dedicated resilver time without this new algorithm in zpool version 35.

This project began with buying 6x 5TB drives to add as a RAIDZ2 vdev to my current pool. As I already have a 6x RAIDZ2 vdev using 3TB drives, I plan to swap the 5TB drives in place of the current 3TB drives utilizing the autoexpand=on setting. After the pool with the single (6x 5TB) vdev has resilvered and autoexpanded, I will then add the 3TB drives as a vdev. In doing this the pool will be closer to equal free space between the vdevs than if I were just to add the 5TB drives as a vdev. From what I’ve read and watched, ZFS utilizes dynamic stripping [Page 37], meaning that data is adaptively striped across the vdevs. As you can see below, I gain a distinct advantage swapping the drives to 5TB before adding the second vdev due to the way ZFS dynamically stripes the data across the vdevs using free space as the primary metric. Also worth mentioning, by using zfs send and zfs recv to a new pool (consisting of the 6x 5TB drives) instead of swapping the drives I would effectively defrag the pool but to me this is not soo important.

ZFS write balance

This table shows how the pool’s free space would be divided after adding a vdev. The table is presented in RAW space in terabytes, as manufactures define a terabyte. According to George Wilson, a former developer at Sun and Oracle, the code responsible for writing data across vdevs at least up to zpool version 28 would attempt to balance the free space between vdevs but uses a weak algorithm with a maximum preferential write load of 25%. In my case the vdevs would never reach a balance of the new data being written to my pool if I were to just add a 6x 5TB vdev. I speculate that this would cause a catastrophic performance loss as the full vdev reached it’s capacity and the 512 byte striping requirement would struggle to find space in the fragmentation as it writes data across all drives in all vdevs. There is hope for the users of the feature flag branch of ZFS as the developers at the OpenZFS have enhanced the write balance to free space against varying aged vdevs.

I am replacing the drives using the zpool replace command. When replacing the drives with this command, as long as the replaced drive(s) is still attached the drive will remain online and current; because of this the pool is not at a higher risk of data loss in terms of the parity loss because the drive and parity is still online. The output below is from the pool while using zpool version 34, as you can see the resilvering process is dragging along due to the IO limit of the vdev being asked to reassemble blocks from oldest to newest. This is actually where my experience with zpool version 35 started. I happened to come across the news that Solaris 11.2 had been released from beta and I happened to notice new zpool version described as “Sequential resilver” which peaked my interest as my current resilver process was dragging along. After reading as much as I could with a newly released product I decided to test it out and ended up detaching the drive before the resilver process completed on version 34 due to my impatiences.

gullibleadmin@epijunkie:~# zpool status version34
  pool: version34
 state: DEGRADED
status: One or more devices is currently being resilvered.  The pool will
        continue to function in a degraded state.
action: Wait for the resilver to complete.
  scan: resilver in progress since Sat Aug  2 16:33:38 2014
    3.12T scanned out of 12.9T at 45.4M/s, 62h28m to go
    1.04T resilvered, 24.27% done
config:
config:

        NAME                         STATE     READ WRITE CKSUM
        version34                    DEGRADED     0     0     0
          raidz2-0                   DEGRADED     0     0     0
            replacing-0              DEGRADED     0     0     0
              c0t50014EE6ADAECACFd0  ONLINE       0     0     0
              c0t5000C50073B0B76Ad0  DEGRADED     0     0     0  (resilvering)
            c0t50014EE6ADAF9785d0    ONLINE       0     0     0
            c0t50014EE0037F34DAd0    ONLINE       0     0     0
            c0t50014EE2B2B5AF19d0    ONLINE       0     0     0
            c0t50014EE6583B7CA1d0    ONLINE       0     0     0
            c0t50014EE6585382A7d0    ONLINE       0     0     0

errors: No known data errors

Here is the output when I ran the same replace command, now on zpool version 35. As you can see below the process is different with the algorithm change (Stage 1 – Scanning):

gullibleadmin@epijunkie:~# zpool status -lx
  pool: version35
 state: DEGRADED
status: One or more devices is currently being resilvered.  The pool will
        continue to function in a degraded state.
action: Wait for the resilver to complete.
        Run 'zpool status -v' to see device specific details.
  scan: resilver in progress since Sun Aug  3 15:14:45 2014
    4.58T scanned out of 12.9T at 3.51G/s, 0h40m to go
    0 resilvered
config:

        NAME                         STATE     READ WRITE CKSUM
        version35                    DEGRADED     0     0     0
          raidz2-0                   DEGRADED     0     0     0
            replacing-0              DEGRADED     0     0     0
              c0t50014EE6ADAECACFd0  ONLINE       0     0     0
              c0t5000C50073B0B76Ad0  DEGRADED     0     0     0  (resilvering)
            c0t50014EE6ADAF9785d0    ONLINE       0     0     0
            c0t50014EE0037F34DAd0    ONLINE       0     0     0
            c0t50014EE2B2B5AF19d0    ONLINE       0     0     0
            c0t50014EE6583B7CA1d0    ONLINE       0     0     0
            c0t50014EE6585382A7d0    ONLINE       0     0     0

errors: No known data errors

The first stage of the resilver process is where ZFS scans the entire pool and waits until the second stage to write the resilvered data to the drive. I speculate that in stage one, ZFS reads the entire vdev in LBA order, that is to say in the order the data is currently written on the disk from address 0 to address xxxxxxxxxx. I speculate that after ZFS knows the LBA order, it compares that against the block age and by block age I mean as the pool experienced writing blocks in chronological order which does not typically correlate with LBA order when data is no longer sequential and fragmented. I also speculate that ZFS creates a reference table from these two data points to optimize the drive resilver to maximize drive bandwidth by utilizing this new way of doing a resilver in sequential order rather than by block age where fragmentation plays a large role in the performance.

Once Stage 2 starts, the data is being written to the new drive(s) and the speed inevitably drops to a lower but consistent speed, the numbers I observed ranged from 109M/s to 145M/s with a median of 119M/s. This is a large difference from zpool version 34 where the bandwidth ranged at a greater delta and also performed at a much lower median. The numbers I observed on the zpool version 34 resilver would range from 1M/s to 110M/s with a median of 25M/s. As you’ll notice from the output below from stage 2 of the resilver (zpool version 35), the speed observed is likely the write-bandwidth-limit of the new drive limiting the resilver process, not the vdev IO limit. still up for debate, see the end of this post as I updated with more experimenting.

Stage 2:

gullibleadmin@epijunkie:~# zpool status version35
  pool: version35
 state: DEGRADED
status: One or more devices is currently being resilvered.  The pool will
        continue to function in a degraded state.
action: Wait for the resilver to complete.
        Run 'zpool status -v' to see device specific details.
  scan: resilver in progress since Sun Aug  3 15:14:45 2014
    12.9T scanned
    638G resilvered at 129M/s, 29.01% done, 20h42m to go
config:

        NAME                         STATE     READ WRITE CKSUM
        version35                    DEGRADED     0     0     0
          raidz2-0                   DEGRADED     0     0     0
            replacing-0              DEGRADED     0     0     0
              c0t50014EE6ADAECACFd0  ONLINE       0     0     0
              c0t5000C50073B0B76Ad0  DEGRADED     0     0     0  (resilvering)
            c0t50014EE6ADAF9785d0    ONLINE       0     0     0
            c0t50014EE0037F34DAd0    ONLINE       0     0     0
            c0t50014EE2B2B5AF19d0    ONLINE       0     0     0
            c0t50014EE6583B7CA1d0    ONLINE       0     0     0
            c0t50014EE6585382A7d0    ONLINE       0     0     0

errors: No known data errors

I imagine there is a low number of home users of Solaris but to those who do, I highly recommend upgrading to version 35 of the zpool. It will be curious to find out how this new algorithm impacts drives that are beginning to fail, especially in the home environment where lower performance and lower quality desktop drives are typically used. I say this in part due to something I observed when I initiated the zpool replace command on the remaining 5 drives before the first drive’s resilver process completed. The curious event was the resilvering process restarted for the pool/vdev/drive of which had already progressed to 80% on the first drive. I’m not sure if this was because the replaced drive was still online, or a byproduct of the new code in zpool version 35 / Solaris 11.2, or simply a bug in the code. I do not suspect it was a bug in the zpool status output command as well after the projected time elapsed for the first drive to complete, as indicated before issuing the replace command for the 5 other drives; the drive IO was still indicating it was being resilvered as well as the zpool status output was still indicating the drive was being replaced/resilvered. As I did not lose any data from this blip I am not terribly concerned, but a curious ‘feature’.

This post will end with Some more output from commands, because who doesn’t love tables/graphs:

gullibleadmin@epijunkie:~# zpool iostat -v jackscoldsweat

                               capacity     operations    bandwidth
pool                         alloc   free   read  write   read  write
---------------------------  -----  -----  -----  -----  -----  -----
jackscoldsweat               12.9T  3.41T  1.18K      0   147M  1.59K
  raidz2                     12.9T  3.41T  1.18K      0   147M  1.59K
    replacing                    -      -  1.16K     20  36.8M  90.1K
      c0t50014EE6ADAECACFd0      -      -  1.03K      0  36.9M  3.71K
      c0t5000C50073B0B76Ad0      -      -      0    998      0  36.9M
    replacing                    -      -  1.16K     20  36.8M  98.1K
      c0t50014EE6ADAF9785d0      -      -   1020      0  36.9M  3.71K
      c0t5000C50073B0CDEDd0      -      -      0    982      0  36.9M
    replacing                    -      -  1.15K     21  36.7M   105K
      c0t50014EE0037F34DAd0      -      -   1010      0  36.7M  3.71K
      c0t5000C50073B0D5DFd0      -      -      0    987      0  36.8M
    replacing                    -      -  1.16K     21  36.8M   101K
      c0t50014EE2B2B5AF19d0      -      -   1008      0  36.9M  3.71K
      c0t5000C50073B4E90Cd0      -      -      0    970      0  36.9M
    replacing                    -      -  1.16K     20  36.8M  99.6K
      c0t50014EE6583B7CA1d0      -      -   1006      0  36.9M  3.71K
      c0t5000C50073B1031Dd0      -      -      0    993      0  36.9M
    replacing                    -      -  1.15K     22  36.8M   111K
      c0t50014EE6585382A7d0      -      -   1006      0  36.8M  3.71K
      c0t5000C50073B08181d0      -      -      0    972      0  36.9M
---------------------------  -----  -----  -----  -----  -----  -----

gullibleadmin@epijunkie:~# zpool status jackscoldsweat
  pool: jackscoldsweat
 state: DEGRADED
status: One or more devices is currently being resilvered.  The pool will
        continue to function in a degraded state.
action: Wait for the resilver to complete.
        Run 'zpool status -v' to see device specific details.
  scan: resilver in progress since Mon Aug  4 15:52:17 2014
    12.9T scanned
    7.49T resilvered at 124M/s, 58.11% done, 12h41m to go
config:

        NAME                         STATE     READ WRITE CKSUM
        jackscoldsweat               DEGRADED     0     0     0
          raidz2-0                   DEGRADED     0     0     0
            replacing-0              DEGRADED     0     0     0
              c0t50014EE6ADAECACFd0  ONLINE       0     0     0
              c0t5000C50073B0B76Ad0  DEGRADED     0     0     0  (resilvering)
            replacing-1              DEGRADED     0     0     0
              c0t50014EE6ADAF9785d0  ONLINE       0     0     0
              c0t5000C50073B0CDEDd0  DEGRADED     0     0     0  (resilvering)
            replacing-2              DEGRADED     0     0     0
              c0t50014EE0037F34DAd0  ONLINE       0     0     0
              c0t5000C50073B0D5DFd0  DEGRADED     0     0     0  (resilvering)
            replacing-3              DEGRADED     0     0     0
              c0t50014EE2B2B5AF19d0  ONLINE       0     0     0
              c0t5000C50073B4E90Cd0  DEGRADED     0     0     0  (resilvering)
            replacing-4              DEGRADED     0     0     0
              c0t50014EE6583B7CA1d0  ONLINE       0     0     0
              c0t5000C50073B1031Dd0  DEGRADED     0     0     0  (resilvering)
            replacing-5              DEGRADED     0     0     0
              c0t50014EE6585382A7d0  ONLINE       0     0     0
              c0t5000C50073B08181d0  DEGRADED     0     0     0  (resilvering)

errors: No known data errors

Another thanks goes out to Allan Jude as the inspiration of swapping the drives to create a large vdev to better balance the free space ratio.

 

UPDATE: 2014 August 6, Wednesday 1137

Keeping my options open, I left the autoexpand flag to “off” until I was sure that was the course of action I wanted to take; my reservation being the benefit of defragmenting the pool by instead creating a new pool and zfs send-ing the data over instead of swapping the drives in place; this would have required me to re-replace the 5TB back with the 3TB, which I was prepared to do. Curiosity had me see what would happen if I were to swap the drives back using the zpool replace command.

gullibleadmin@epijunkie:~# zpool status jackscoldsweat
  pool: jackscoldsweat
 state: DEGRADED
status: One or more devices is currently being resilvered.  The pool will
        continue to function in a degraded state.
action: Wait for the resilver to complete.
        Run 'zpool status -v' to see device specific details.
  scan: resilver in progress since Wed Aug  6 10:52:37 2014
    12.9T scanned
    488G resilvered at 331M/s, 3.70% done, 10h54m to go
config:

        NAME                         STATE     READ WRITE CKSUM
        jackscoldsweat               DEGRADED     0     0     0
          raidz2-0                   DEGRADED     0     0     0
            replacing-0              DEGRADED     0     0     0
              c0t5000C50073B0B76Ad0  ONLINE       0     0     0
              c0t50014EE6ADAECACFd0  DEGRADED     0     0     0  (resilvering)
            replacing-1              DEGRADED     0     0     0
              c0t5000C50073B0CDEDd0  ONLINE       0     0     0
              c0t50014EE6ADAF9785d0  DEGRADED     0     0     0  (resilvering)
            replacing-2              DEGRADED     0     0     0
              c0t5000C50073B0D5DFd0  ONLINE       0     0     0
              c0t50014EE0037F34DAd0  DEGRADED     0     0     0  (resilvering)
            replacing-3              DEGRADED     0     0     0
              c0t5000C50073B4E90Cd0  ONLINE       0     0     0
              c0t50014EE2B2B5AF19d0  DEGRADED     0     0     0  (resilvering)
            replacing-4              DEGRADED     0     0     0
              c0t5000C50073B1031Dd0  ONLINE       0     0     0
              c0t50014EE6583B7CA1d0  DEGRADED     0     0     0  (resilvering)
            replacing-5              DEGRADED     0     0     0
              c0t5000C50073B08181d0  ONLINE       0     0     0
              c0t50014EE6585382A7d0  DEGRADED     0     0     0  (resilvering)

errors: No known data errors

Hmm, this is curious. Did the resilver from the first replacement of the drives defragment the drive? It seems like it. The first replacement ended up taking 35 hours and 29 minutes. This is projecting about 1/3 of the previous time to resilver the drives, after what I suspect was a resilver AND defragmentation.

gullibleadmin@epijunkie:~# zpool status jackscoldsweat
  pool: jackscoldsweat
 state: ONLINE
  scan: resilvered 12.9T in 35h29m with 0 errors on Wed Aug  6 03:21:22 2014
config:

        NAME                       STATE     READ WRITE CKSUM
        jackscoldsweat             ONLINE       0     0     0
          raidz2-0                 ONLINE       0     0     0
            c0t5000C50073B0B76Ad0  ONLINE       0     0     0
            c0t5000C50073B0CDEDd0  ONLINE       0     0     0
            c0t5000C50073B0D5DFd0  ONLINE       0     0     0
            c0t5000C50073B4E90Cd0  ONLINE       0     0     0
            c0t5000C50073B1031Dd0  ONLINE       0     0     0
            c0t5000C50073B08181d0  ONLINE       0     0     0

errors: No known data errors
This entry was posted in Technology and tagged , , . Bookmark the permalink.

Leave a Reply

Your email address will not be published. Required fields are marked *