Dumping SmartOS boot zpool when booting from harddisk

Read first

Before reading this article you would find useful to become familiar with this other two:

Problem to solve

The steps described in previous articles create a SmartOS installation booting from the harddisk instead the regular boot from a CD, a DVD, a USB pendrive or PXE. This is useful when managing remote machines with no physical access. You must be careful, though, since a mistake in the manipulation of the boot zpool can leave the machine offline with a slow and painful recovery procedure.

For my personal needs (servers hosted in a remote datacenter), my approach for boot recovery would be to restart the server with the rescue mode facilities of the datacenter using an Operating system with ZFS support. That would be enough to mount the harddisk and correct any misconfiguration.

This approach is fine 99.9% of the time, but it doesn't work (easily) if there is an issue in early boot stages, like a corruption of GRUB, bad upgrading of GRUB (trying, for instance, to support features available in new ZFS releases) or if you want to upgrade to new Illumos loader, enable ZFS encryption, etc.

We need a fast way to recover the machine if something goes wrong in the early boot stage.

Since in my case the boot ZFS zpool is only 1 GB in size, what about just dumping the harddisk datablocks as a file in another server? With gigabit data speed, you can rewrite 1 GB in ten seconds. Booting in rescue mode and typing the appropriate commands would be the slow part!

Identifying GRUB area and ZFS boot zpool

The following instructions can be done safely in the Solaris global zone while the machine is in production.

  1. Let's see the Solaris harddisk partitions:

    [root@xXx ~]# fdisk /dev/rdsk/c1t0d0
                 Total disk size is 60800 cylinders
                 Cylinder size is 96390 (512 byte) blocks
    
                                                   Cylinders
          Partition   Status    Type          Start   End   Length    %
          =========   ======    ============  =====   ===   ======   ===
              1                 EFI               0  60799    60800    100
    
    
    
    
    SELECT ONE OF THE FOLLOWING:
       1. Create a partition
       2. Specify the active partition
       3. Delete a partition
       4. Change between Solaris and Solaris2 Partition IDs
       5. Edit/View extended partitions
       6. Exit (update disk configuration and exit)
       7. Cancel (exit without updating disk configuration)
    Enter Selection:
    

    Here we see that the harddisk contains a single EFI partition spanning all of it.

    Nota

    This machine has two harddisks with the same partitioning. I only show one for brevity.

  2. Under Solaris and derivatives, there is a second nested partitioning schema called slices:

    [root@xXx ~]# format /dev/rdsk/c1t0d0p0
    selecting /dev/rdsk/c1t0d0p0
    [disk formatted]
    /dev/dsk/c1t0d0s0 is part of active ZFS pool arranque. Please see zpool(1M).
    /dev/dsk/c1t0d0s1 is part of active ZFS pool zones. Please see zpool(1M).
    
    
    FORMAT MENU:
            disk       - select a disk
            type       - select (define) a disk type
            partition  - select (define) a partition table
            current    - describe the current disk
            format     - format and analyze the disk
            fdisk      - run the fdisk program
            repair     - repair a defective sector
            label      - write label to the disk
            analyze    - surface analysis
            defect     - defect list management
            backup     - search for backup labels
            verify     - read and display labels
            inquiry    - show vendor, product and revision
            volname    - set 8-character volume name
            !<cmd>     - execute <cmd>, then return
            quit
    format> p
    
    
    PARTITION MENU:
            0      - change `0' partition
            1      - change `1' partition
            2      - change `2' partition
            3      - change `3' partition
            4      - change `4' partition
            5      - change `5' partition
            6      - change `6' partition
            expand - expand label to use whole disk
            select - select a predefined table
            modify - modify a predefined partition table
            name   - name the current table
            print  - display the current table
            label  - write partition map and label to the disk
            !<cmd> - execute <cmd>, then return
            quit
    partition> p
    Current partition table (original):
    Total disk sectors available: 5860516717 + 16384 (reserved sectors)
    
    Part      Tag    Flag     First Sector          Size          Last Sector
      0        usr    wm                34         1.00GB           2097185
      1        usr    wm           2097186         2.73TB           5860516750
      2 unassigned    wm                 0            0                0
      3 unassigned    wm                 0            0                0
      4 unassigned    wm                 0            0                0
      5 unassigned    wm                 0            0                0
      6 unassigned    wm                 0            0                0
      8   reserved    wm        5860516751         8.00MB           5860533134
    
    partition>
    

    Nota

    Notice that we are listing the slices in the first (and only) EFI partition of the harddisk. That is the reason we write /dev/rdsk/c1t0d0p0: the first partition of the disk /dev/rdsk/c1t0d0.

    We can see this:

    • There are 34 (not really) free sectors at the beginning of the partition. This is reserved space for the boot system. ZFS doesn't need it because it provides its own boot area, but GRUB will be installed there by default. This is legacy, my friends.
    • After that, there is a 1 GB ZFS zpool. This is the boot zpool, as described in previous articles. The zpool is ONLY used when booting the system. After that, it is never used again. SmartOS is a hypervisor and the Operating System will be running from RAM. In a regular SmartOS deployment, the Operating System would be loaded from CD, DVD, an USB pendrive or PXE.
    • Then we have the main ZFS zpool, known as zones zpool in SmartOS nomenclature. Here is where the configuration and production data live.
    • There is a small slice at the end. Forget about this, it is legacy.
  3. In order to preserve the boot environment, we need to dump the EFI partition table, the free (not actually free) sectors at the beginning of the partition and the boot zpool slice.

    The sectors provided by the command format are offsets inside the partition, but in this case the EFI partition starts at cylinder zero, so the real offsets in the disks are:

    • Sector offset of the first sector of the partition:

      0 * 96390 = 0
      
    • Sector offset of the first sector after the ZFS boot zpool:

      0 * 96390 + 2097186 = 2097186
      
  4. After booting, the ZFS boot zpool will be idle. Fine. We can be cautious, though, and export the boot zpool before dumping it. The intent is to get a consistent dump:

    [root@xXx ~]# zpool list
    NAME       SIZE  ALLOC   FREE  CKPOINT  EXPANDSZ   FRAG    CAP  DEDUP  HEALTH  ALTROOT
    arranque  1008M   300M   708M        -         -    22%    29%  1.00x  ONLINE  -
    zones     2.73T  1.21T  1.52T        -         -    56%    44%  1.00x  ONLINE  -
    [root@xXx ~]# zpool export arranque
    [root@xXx ~]# zpool list
    NAME    SIZE  ALLOC   FREE  CKPOINT  EXPANDSZ   FRAG    CAP  DEDUP  HEALTH  ALTROOT
    zones  2.73T  1.21T  1.52T        -         -    56%    44%  1.00x  ONLINE  -
    
  5. We dump the data from the global zone:

    [root@xXx ~]# cd /zones/z-jcea/
    [root@xXx /zones/z-jcea]# dd if=/dev/rdsk/c1t0d0 \
                                 of=SmartOS_boot-20190829.dump \
                                 bs=512 count=2097186
    

    Nota

    This dump will take a while because we are reading in 512 bytes chunks. The dump can be 6 times faster noticing that:

    512 * 2097186 = 3072 * 349531
    

    Advertencia

    If the data you are overwriting don't start at the very beginning of the disk, you would need to use the skip= parameter.

  6. Now you can transfer that file to another machine for disaster recovery.

    I would calculate its hash and rename the file containing the hash in order to detect corruption in the future, the SmartOS release and the dump date. In this example, the filename would be:

    SmartOS_xXx_boot-20190829-770171d16de18c1d17a6570b9842dc600cc5477ebfff1e9da74958e6fbdc1d44-20190918.dump
    

This machine has two harddisks configured as mirror. The two sides of a ZFS mirror are not bit-by-bit identical. For catastrophic failure you could recover only a side, boot the machine, rebuild the other side of the ZFS mirror and reinstall GRUB on both sides.

Disaster recovery

If everything fails and you need to overwrite the boot system, you would:

  1. Boot the machine in rescue mode. The rescue Operating System is not important, use any you feel familiar with.

  2. Transfer the old dump to the machine. Usually, the rescue mode will run from RAM and you would need at least 1 GB of extra free RAM to download the recovery dump file. This is not an issue with this machine because it has plenty of memory.

  3. Then overwrite the boot area of the disk with the old dump. For instance, if you use Linux, you could type:

    root@RESCUE:~# dd if=DUMP_FILE of=/dev/sda
    

    Advertencia

    If the data you want to overwrite is not located at the beginning of the harddisk, you will need to use the parameter seek=.

    Failing to do that WILL cause corruption and data loss.

    BE CAREFUL!

  4. Disable rescue mode and be sure your system will boot from the harddisk you have corrected.

  5. Reboot and cross your fingers.