ZFS replace

Nur damit ich es nicht vergesse: In einem mirrored setup mit Spares funktioniert der Austausch von defekten Platten etwas anders als in einem raidz setup.

So sieht es aus, wenn es "kaputt" ist:

[root@myserver-2 (de-dc-2) ~]# zpool status
  pool: zones
 state: DEGRADED
status: One or more devices has been removed by the administrator.
        Sufficient replicas exist for the pool to continue functioning
        in a degraded state.
action: Online the device using 'zpool online' or replace the device with
    'zpool replace'.
  scan: resilvered 38,6G in 0 days 00:21:01 with 0 errors on Sun Feb 19 06:43:45 2023
config:

        NAME          STATE     READ WRITE CKSUM
        zones         DEGRADED     0     0     0
          mirror-0    ONLINE       0     0     0
            c1t0d0    ONLINE       0     0     0
            c1t1d0    ONLINE       0     0     0
          mirror-1    DEGRADED     0     0     0
            spare-0   DEGRADED     0     0     0
              c1t2d0  REMOVED      0     0     0
              c1t6d0  ONLINE       0     0     0
            c1t3d0    ONLINE       0     0     0
          mirror-2    ONLINE       0     0     0
            c1t4d0    ONLINE       0     0     0
            c1t5d0    ONLINE       0     0     0
        spares
          c1t6d0      INUSE     currently in use
          c1t7d0      AVAIL

errors: No known data errors

"Kaputt" stimmt natürlich nicht ganz, denn eine unserer Spares, die wir extra für diesen Fall vorgehalten haben, ist ja eingesprungen.

Für den RAID-Controller sieht die Sache in dem Moment so aus:

[root@myserver-2 (de-dc-2) ~]# /opt/root/storcli/storcli /c0 /eall /sall show
Controller = 0
Status = Success
Description = Show Drive Information Succeeded.
                                                          

Drive Information :
=================

-------------------------------------------------------------------------------
EID:Slt DID State  DG       Size Intf Med SED PI SeSz Model            Sp Type
-------------------------------------------------------------------------------
252:0    15 Onln    0 278.875 GB SAS  HDD N   N  512B MK3001GRRB       U  -
252:1    16 Onln    1 278.875 GB SAS  HDD N   N  512B ST9300653SS      U  -
252:2    17 Failed  7 278.875 GB SAS  HDD N   N  512B MK3001GRRB       U  -
252:3    10 Onln    2 278.875 GB SAS  HDD N   N  512B ST9300653SS      U  -
252:4     9 Onln    3 278.875 GB SAS  HDD N   N  512B ST9300653SS      U  -
252:5    11 Onln    4 278.875 GB SAS  HDD N   N  512B MK3001GRRB       U  -
252:6    12 Onln    5 278.875 GB SAS  HDD N   N  512B ST9300653SS      U  -
252:7    13 Onln    6 278.875 GB SAS  HDD N   N  512B ST9300653SS      U  -
-------------------------------------------------------------------------------

EID-Enclosure Device ID|Slt-Slot No.|DID-Device ID|DG-DriveGroup
DHS-Dedicated Hot Spare|UGood-Unconfigured Good|GHS-Global Hotspare
UBad-Unconfigured Bad|Onln-Online|Offln-Offline|Intf-Interface
Med-Media Type|SED-Self Encryptive Drive|PI-Protection Info
SeSz-Sector Size|Sp-Spun|U-Up|D-Down|T-Transition|F-Foreign
UGUnsp-Unsupported|UGShld-UnConfigured shielded|HSPShld-Hotspare shielded
CFShld-Configured shielded|Cpybck-CopyBack|CBShld-Copyback Shielded

Nachdem die Platte getauscht wurde, sieht die Ausgabe so aus:

252:2    18 UGood  F 278.875 GB SAS  HDD N   N  512B ST9300653SS      U  -

Die Platte hat offenbar noch eine "f"oreign Konfiguration, die wir erstmal löschen müssen:

[root@myserver-2 (de-dc-2) ~]# /opt/root/storcli/storcli /c0 /fall delete
Controller = 0
Status = Success
Description = Successfully deleted foreign configuration

Der RAID-Controller sieht es dann so:

252:2    18 UGood  - 278.875 GB SAS  HDD N   N  512B ST9300653SS      U  -

Schnell noch ein RAID-0 auf der Platte angelegt:

[root@myserver-2 (de-dc-2) ~]# /opt/root/storcli/storcli /c0 add vd type=r0 drives=252:2
Controller = 0
Status = Success
Description = Add VD Succeeded

Danach kann die Platte dann mit format -e und fdisk neu partitioniert und gelabelt werden.

Für ZFS stellt sich die Lage jetzt so dar:

[root@myserver-2 (de-dc-2) ~]# zpool status -x
  pool: zones
 state: DEGRADED
status: One or more devices could not be used because the label is missing or
        invalid.  Sufficient replicas exist for the pool to continue
        functioning in a degraded state.
action: Replace the device using 'zpool replace'.
   see: http://illumos.org/msg/ZFS-8000-4J
  scan: scrub in progress since Fri Feb 24 14:09:14 2023
        115G scanned at 373M/s, 75,3G issued at 243M/s, 115G total
        0 repaired, 65,17% done, 0 days 00:02:49 to go
config:
                               
        NAME          STATE     READ WRITE CKSUM
        zones         DEGRADED     0     0     0
          mirror-0    ONLINE       0     0     0
            c1t0d0    ONLINE       0     0     0
            c1t1d0    ONLINE       0     0     0
          mirror-1    DEGRADED     0     0     0
            spare-0   DEGRADED     0     0     0
              c1t2d0  UNAVAIL      0     0     0  corrupted data
              c1t6d0  ONLINE       0     0     0
            c1t3d0    ONLINE       0     0     0
          mirror-2    ONLINE       0     0     0
            c1t4d0    ONLINE       0     0     0
            c1t5d0    ONLINE       0     0     0
        spares
          c1t6d0      INUSE     currently in use
          c1t7d0      AVAIL
                              
errors: No known data errors

Und jetzt reicht es tatsächlich, mit zpool replace zones c1t2d0 ZFS mitzuteilen, dass die defekte Platte ausgetauscht worden ist, und ZFS fängt an zu resilvern:

[root@myserver-2 (de-dc-2) ~]# zpool status -x
  pool: zones                                        
 state: DEGRADED
status: One or more devices is currently being resilvered.  The pool will
        continue to function, possibly in a degraded state.
action: Wait for the resilver to complete.
  scan: resilver in progress since Fri Feb 24 14:16:05 2023
        115G scanned at 1,43G/s, 78,0G issued at 987M/s, 115G total
        1,40G resilvered, 67,58% done, 0 days 00:00:38 to go
config:

        NAME                STATE     READ WRITE CKSUM
        zones               DEGRADED     0     0     0
          mirror-0          ONLINE       0     0     0
            c1t0d0          ONLINE       0     0     0
            c1t1d0          ONLINE       0     0     0
          mirror-1          DEGRADED     0     0     0
            spare-0         DEGRADED     0     0     0
              replacing-0   UNAVAIL      0     0     0
                c1t2d0/old  UNAVAIL      0     0     0  corrupted data
                c1t2d0      ONLINE       0     0 1,06K  (resilvering)
              c1t6d0        ONLINE       0     0     0
            c1t3d0          ONLINE       0     0     0
          mirror-2          ONLINE       0     0     0
            c1t4d0          ONLINE       0     0     0
            c1t5d0          ONLINE       0     0     0
        spares
          c1t6d0            INUSE     currently in use
          c1t7d0            AVAIL

errors: No known data errors

Wenn der Vorgang abgeschlossen ist, wird auch die eingesprungene Spare-Platte wieder frei:

[root@myserver-2 (de-dc-2) ~]# zpool status 
  pool: zones
 state: ONLINE
  scan: resilvered 38,8G in 0 days 00:20:49 with 0 errors on Fri Feb 24 14:36:54 2023
config:

        NAME        STATE     READ WRITE CKSUM
        zones       ONLINE       0     0     0
          mirror-0  ONLINE       0     0     0
            c1t0d0  ONLINE       0     0     0
            c1t1d0  ONLINE       0     0     0
          mirror-1  ONLINE       0     0     0
            c1t2d0  ONLINE       0     0     0
            c1t3d0  ONLINE       0     0     0
          mirror-2  ONLINE       0     0     0
            c1t4d0  ONLINE       0     0     0
            c1t5d0  ONLINE       0     0     0
        spares
          c1t6d0    AVAIL   
          c1t7d0    AVAIL   

errors: No known data errors