When we set up our infrastructure initially (Ubuntu VMs on proxmox with ZFS storage), I used the default block size, which was 8kB at the time..

Space considerations when using zvols on RAIDz

Smaller block sizes are good because they limit write amplification (writing one byte in the guest require writing a full block to the underlying storage, typically 4kB for hard disks, and in our case 8kB for ZFS). This is also the reason small files take up more storage space than their contents.

I already knew that RAIDz was bad for write performance, so our production servers have striped mirrors to minimize overhead. However our backup server uses RAIDz2 with 6 drives for increased storage efficiency and to survive the failure of any two drives (which is much harder with striped mirrors). I wrote a detailed article about the technical details of how we backup but in essence we stream the ZFS volumes using zfs send and zfs recv directly to the RAIDz2 pool of our backup server.

First, note that the minimum write size for a RAIDz2 pool is 3 sectors (one for data + two for parity, since we must survive the failure of two disks).

While large blocks decrease write performance, small blocks can decrease storage efficiency when combined with the wrong combination of RAIDz. For me the article that finally made it click was on Jason Rose’s blog:

When ZFS writes to an array, it makes sure the total number of sectors it writes is a multiple of this nmin value defined above [in our case 3]. ZFS does this to avoid situations where data gets deleted and it ends up with a space on the disk that’s too small to be used (for example, a 2-sector wide space can’t be used by RAID-Z2 because there’s not enough room for even a single data sector and the necessary two parity sectors). Any sectors not filled by user data or parity information are known as “padding”;
Calculating & Minimizing ZFS Allocation Overhead

The logical sector size of the HDDs in the backup server is 4kB as is typical, and the guest image is split in 8kB chunks, so they fit on two sectors and require two parity sectors and two padding sectors.

	Disk 1	Disk 2	Disk 3	Disk 4	Disk 5	Disk 6
Stripe 1	D1	D2	P1	P2	X	X
Stripe 2	D1	D2	P1	P2	X	X

volblock = 8k

In this configuration each 6*4kB = 24kB stripe can only hold 8kB of data, so our storage efficiency is only 33%, which is half of the expected 66% for RAIDz2 in a typical situation.

Here we would much prefer having a volblocksize of 16kB, meaning our ZFS blocks would take up 4 sectors of disk space, leaving 2 sectors for parity and no padding.

	Disk 1	Disk 2	Disk 3	Disk 4	Disk 5	Disk 6
Stripe 1	D1	D2	D3	D4	P1	P2
Stripe 2	D1	D2	D3	D4	P1	P2

volblock = 16k

Note that setting volblock = 4k looks ideal because there are no padding sectors, however that means storing 4 parity sectors per stripe, and ends up wasting exactly the same space. Conversely, if we increase volblock we still need two parity sectors per stripe (not two parity sectors per block).

By chance, 16k is the new default volblock with proxmox! Unfortunately we still have several large old production VMs that still use 8kB blocks that we can’t just stop for a couple of days.

Changing volblocksize

Let look at several methods and their caveats

proxmox storage move – safe and easy, but requires fiddling with proxmox storage

This is a method that I did not know about beforehand, so I include it here for completeness but did not test it fully. It makes most of the techniques below obsolete, but still, you may find yourself in a situation where the qemu solution is not possible, or slower than the alternative methods presented below.

This solution uses the live block operations of qemu to migrate from one disk to the other. Combined with the proxmox GUI, it makes for a very easy migration.

Create a new dataset to hold the new zvols (for instance zfs create rpool/data_16k)
In the proxmox GUI, create a new ZFS storage (at the datacenter level), and connect it to this dataset. Set the volblock for this storage to 16k.
This is only a default value for zvols in this storage, but because of the way disk migration works in proxmox, migrating a disk to this storage is enough to change its volblock.
In the proxmox GUI, find the disk in the hardware tab of the VM, and initiate a “move storage” to the new storage. After it is complete, you can use zfs rename to move the new zvol to its original location (don’t forget to edit the qemu file) or leave it where it is.

dd – safe and easy, but slow and requires downtime

First trap: man zfs create says:

By default, a reservation of equal size is created.
man zfs create

which is not the proxmox default so you may want to add the -s flag.

# Shutdown the VM

# Make a snapshot of the original volume in case anything goes wrong
zfs snapshot /dev/rpool/data/vm-VMID-disk-0@before_migration

# Create a new disk with the correct volblock and size.
zfs create -s -V 10G rpool/data/vm-VMID-disk-0-16k -o volblock=16k

dd if=/dev/rpool/data/vm-VMID-disk-0 of=/dev/rpool/data/vm-VMID-disk-0-16k bs=1M status=progress conv=sparse

# Keep the old volume for a few days just in case
zfs rename /dev/rpool/data/vm-VMID-disk-0 /dev/rpool/data/vm-VMID-disk-0-8k

zfs rename /dev/rpool/data/vm-VMID-disk-0-16k /dev/rpool/data/vm-VMID-disk-0

# Don't forget to destroy/rename any replicated or backup volumes since you just lost all your snapshots for incremental replication

All commands in this example except dd are common to all solutions and are implied in the rest of this article.

e2image – a bit faster than dd, still requires downtime (and requires a bare ext4 partition)

dd is nice but it copies every byte in the partition even if the partition is almost empty. You can try to alleviate that by writing a bunch of zeros in the guest to overwrite free space, then use conv=sparse in dd, but that didn’t help much in our case. A better idea is to use a specialized tool such as e2image to copy the main ext4 partition.

Warning: we didn’t test this method as much as the others, so use at your own risk.

# Shutdown the VM, create the new disk, snapshot the old volume

# Start by taking a look at the partition table manually (with command 'p')
parted /dev/rpool/data/vm-VMID-disk-0

# Enough to copy the partition table and still be fast
dd if=/dev/rpool/data/vm-VMID-disk-0 of=/dev/rpool/data/vm-VMID-disk-0-16k bs=1M status=progress count=100

# The first partition (/boot) is very small, so we copy it with dd
dd if=/dev/rpool/data/vm-VMID-disk-0-part1 of=/dev/rpool/data/vm-VMID-disk-0-16k-part1 bs=1M status=progress

# The second partition is large, we'll copy it with e2image
e2image -rap /dev/rpool/data/vm-VMID-disk-0-part1 /dev/rpool/data/vm-VMID-disk-0-16k-part1

# Swap disks, start the VM

lvconvert – no downtime ! But requires lvm

This solution was inspired by this article by TecMint.

If you chose to use LVM when installing the VM, then we can use LVM’s mirror RAID capabilities to copy all the data to the new disk with the correct volblocksize without shutting down the VM !

We copy the partition table, then the /boot partition with dd (we can do this live because this partition is not expected to change often while the system is running).

We could also extend the first dd command to copy both the partition table and the /boot partition in one go, but we need to be careful : if we spill over and start copying the LVM volume, the system will be very confused when we add this new disk. To be safe we clobber the start of the new LVM volume with zeros.

# Shutdown the VM, create the new disk, snapshot the old volume

# Start by taking a look at the partition table manually (with command 'p')
parted /dev/rpool/data/vm-VMID-disk-0

# Enough to copy the partition table and still be fast
dd if=/dev/rpool/data/vm-VMID-disk-0 of=/dev/rpool/data/vm-VMID-disk-0-16k bs=1M status=progress count=100

# The first partition (/boot) is very small, so we copy it with dd
dd if=/dev/rpool/data/vm-VMID-disk-0-part1 of=/dev/rpool/data/vm-VMID-disk-0-16k-part1 bs=1M status=progress

# In case we spilled over this partition, we clean it a bit. Double-check you are clobbering the new, empty partition and not the existing one
dd if=/dev/zero of=/dev/rpool/data/vm-VMID-disk-0-16k-part2 bs=1M status=progress count=100

# Swap disks, start the VM

The next step is to add hot-swap the new disk in the VM while it is running. The easiest way I found to do that is to first add it to the text configuration file

echo "scsi9: data:vm-VMID-disk-0-16k,size=10G" >> /etc/pve/qemu-server/VMID.conf

(watch out for extra PVE snapshots or pending hardware changed which may add sections to the .conf file, we want to add this line to the main section)

The from the web UI, detach and re-attach this disk (I suppose the web UI has a hook that signals to qemu that a new drive has been added).

At this point, I also suggest adding a write speed limit to this disk, otherwise you whole server may be IO-bound for the next few hours (or days).

The next step is to login to the VM and copy data from the old to the new drive. Let’s say that inside the VM /dev/sda is the original drive and /dev/sdb is the new one, with partition /dev/sda1 mounted to /boot (we’ve already copied that) and /dev/sda2 containing a LVM volume (physical volume or PV). We’re going to create a new PV on /dev/sdb2, and extend the volume group (VG) to contain both PVs. Then set our logical volume (LV) to be mirorred on both PVs.

Note the arguments of the lvconvert command changed since the original tecmint article, since the default for --mirrorlog changed from core to disk, and we can’t spare any space on our disk for a log.

pvcreate /dev/sdb2
vgextend ubuntu-vg /dev/sdb2

# This next line may take hours or days, run it in a detached screen, or directly on the VM serial console!
lvconvert --mirrorlog core --type mirror -m 1 /dev/ubuntu-vg/ubuntu-lv /dev/sdb2

lvconvert -m 0 /dev/ubuntu-vg/ubuntu-lv /dev/sda2

# Cleanup
lvconvert -m 0 /dev/ubuntu-vg/ubuntu-lv /dev/sda2
vgreduce ubuntu-vg /dev/sda2
pvremove /dev/sda2

You can now detach the original disk, but at this point I still recommend shutting down the VM (this will be a short downtime), swapping the old and new zvols, changing the qemu config file to include only the new disk (now /dev/rpool/data/vm-VMID-disk-0) and starting the VM back up.

What if I installed Ubuntu without LVM?

It sounds desperate, but it’s not! This legendary git repo where the last change was 10 years ago holds a program capable of transforming a bare ext4 partition into ext4-over-lvm in-place.

You can clone this to the proxmox hypervisor, apply this little patch (ok we’re a little desperate at this point)

diff --git a/blocks/__main__.py b/blocks/__main__.py
index 5248d77..ac3790f 100644
--- a/blocks/__main__.py
+++ b/blocks/__main__.py
@@ -136,7 +136,7 @@ class BCacheReq(Requirement):
 
 def mk_dm(devname, table, readonly, exit_stack):
     needs_udev_fallback = False
-    cmd = 'dmsetup create --noudevsync --'.split() + [devname]
+    cmd = 'dmsetup create --verifyudev --'.split() + [devname]
     if readonly:
         cmd[3:3] = ['--readonly']
     proc = subprocess.Popen(cmd, stdin=subprocess.PIPE)
@@ -148,7 +148,7 @@ def mk_dm(devname, table, readonly, exit_stack):
         proc = subprocess.Popen(cmd, stdin=subprocess.PIPE)
         proc.communicate(table.encode('ascii'))
         assert proc.returncode == 0, 'Report to https://github.com/g2p/blocks/issues/8 if you see this'
-    cmd = 'dmsetup remove --noudevsync --'.split() + [devname]
+    cmd = 'dmsetup remove --verifyudev --'.split() + [devname]
     if needs_udev_fallback:
         cmd[3:3] = ['--verifyudev']
     exit_stack.callback(lambda: quiet_call(cmd))

and now we’re ready to convert from ext4 to ext4-over-lvm and then fallback to the lvm case !

Take a snapshot of the volume. I needed to rollback in more than half the cases.

zfs snapshot /dev/rpool/data/vm-VMID-disk-0@before_migration

But first, here is a checklist to avoid bricking your system (these are done on the guest, so Ubuntu):

Check whether /etc/fstab refers to your boot partition by UUID. If not, you must edit the fstab to use the partition UUID (use blkid)
Check whether lvm2 is installed (apt install -y lvm2)

If you had to do either of the steps above, then make sure lvm is installed to initramfs, and grub is aware of the new UUID:

update-initramfs -u -k all
grub-install /dev/sda

Reboot the system at least once after these commands to check everything is ok.

Now for the main part: stop the VM and run

cd ~/clones/blocks
python3 blocks to-lvm --vg-name ubuntuvg /dev/rpool/data/vm-VMID-disk-0-part2

Keep the output of this command handy, it contains a command you can use to rollback the write. But of course you also have a snapshot for that.

At this point you should be able to restart the VM. Is not see the troubleshooting section below.

Now you have a Ubuntu system with LVM, but we still didn’t change volblocksize! After checking the VM still boots (do a proper shutdown and power up rather than a reboot), go back to the case with lvconvert, using ubuntuvg instead of ubuntu-vg (blocks doesn’t like hyphens) and lv1 instead of ubuntu-lv.

Troubleshooting

If you get dropped to a initramfs prompt, here are a few troubleshooting tips:

call up the grub menu at boot (shift key), and edit the first entry. It should contain a UUID as the root fs
At the initramfs prompt, the command vgchange should exist, otherwise the initramfs has no support for LVM.

At this point, you can rollback, and try again to run update-initramfs and grub-install as described in the previous steps. However in at least one case this did not work for us and we had to use the following procedure:

After converting the filesystem to lvm, boot into the VM using a live USB, and open a shell
Mount the broken system and chroot into it. See this article for instruction, but mostly it should boil down to:
mount /dev/ubuntuvg/lv1 /mnt
mount /dev/sda1 /mnt/boot
mount -t proc proc /mnt/proc
mount -t sysfs sys /mnt/sys mount -o bind /dev /mnt/dev mount -t devpts pts /mnt/dev/pts
chroot /mnt
Run update-grub, optionally inspect the contents of /mnt/boot, repair /etc/fstab…
Exit the chroot with exit, unmount everything and reboot

Other options

It looks like dm-clone or dm-mirror should be able to accomplish a live migration with no downtime, but documentation and examples are very sparse, so I would only consider this option if migrating to LVM is not an option at all.

Guillaume Matheron

Change ZFS volblock on a running Proxmox VM