ZFS partitioning layout - recomendations

Hello all

I stared to experiment in a virtual machine ZFS on root with a specific partitioning layout, like this:

zfs create -o mountpoint=none zroot/ROOT
zfs create -o mountpoint=/ -o canmount=noauto zroot/ROOT/${ID}
zfs create -o mountpoint=/home zroot/home
zfs create -o mountpoint=/usr -o canmount=off zroot/usr
zfs create -o mountpoint=/usr/local zroot/usr/local
zfs create -o mountpoint=/var -o canmount=off zroot/var
zfs create -o mountpoint=/var/cache zroot/var/cache
zfs create -o mountpoint=/var/crash zroot/var/crash
zfs create -o mountpoint=/var/lib zroot/var/lib
zfs create -o mountpoint=/var/lib/containers zroot/var/lib/containers
zfs create -o mountpoint=/var/lib/libvirt zroot/var/lib/libvirt
zfs create -o mountpoint=/var/log zroot/var/log
zfs create -o mountpoint=/var/spool zroot/var/spool
zfs create -o mountpoint=/var/tmp zroot/var/tmp
zfs create -o mountpoint=/var/www zroot/var/www

The reason behind this idea is to maximize the control on every dataset, like:

  1. setting specific block size for each mount point
  2. setting enable or disable snapshots of a specific mount point
  3. sync options for each mount point
    and many more.

What do you think about this layout?
Is it bad, good, overkill etc?

Thanks

and a small test:

#you can set a different recordsize for each partition based on its purpose 
liviu@bobdenaut:~$ sudo zfs set recordsize=128k zroot/home
liviu@bobdenaut:~$ sudo zfs get compress zroot/home
NAME        PROPERTY     VALUE           SOURCE
zroot/home  compression  lz4             inherited from zroot
liviu@bobdenaut:~$ ./test.sh 
sysctl: cannot stat /proc/sys/vfs/zfs/txg/timeout: No such file or directory
1 K size -> 1 K alloc
2 K size -> 1 K alloc
3 K size -> 5 K alloc
4 K size -> 1 K alloc
5 K size -> 1 K alloc
6 K size -> 1 K alloc
7 K size -> 1 K alloc
8 K size -> 9 K alloc
9 K size -> 1 K alloc
11 K size -> 1 K alloc
12 K size -> 1 K alloc
13 K size -> 1 K alloc
15 K size -> 17 K alloc
16 K size -> 1 K alloc
17 K size -> 1 K alloc
23 K size -> 1 K alloc
24 K size -> 1 K alloc
25 K size -> 29 K alloc
31 K size -> 1 K alloc
32 K size -> 1 K alloc
33 K size -> 1 K alloc
63 K size -> 1 K alloc
64 K size -> 65 K alloc
65 K size -> 1 K alloc
127 K size -> 1 K alloc
128 K size -> 1 K alloc
129 K size -> 1 K alloc
254 K size -> 261 K alloc
255 K size -> 1 K alloc
256 K size -> 1 K alloc
257 K size -> 1 K alloc
512 K size -> 1 K alloc
1024 K size -> 1029 K alloc

If the filesystems are in the same pool that is backed by the same device(s), I would expect you would want the same block size for them.

As for snapshots, sometimes you do want to separate things (typically you don’t want your user data reverted when you revert your OS, or vice versa). But for package-owned files like those under /etc and /usr (and some under /var), you probably do want those to revert “in sync” when you do a rollback or else you might risk breaking/destabilizing the software when the versions of its files end up mis-matched.

/etc is a bit of a mixed bag. I typically keep it together with the rest of the OS because the changes I make are small and easily reproducible. I have experimented with making a local git repo to track changes to /etc. For example:

sudo zfs create -o mountpoint=/srv/etc.git root/0/etc.git
sudo git init --bare /srv/etc.git
sudo git init -b main /etc
sudo bash -ec "cd /etc; git add .; git commit -m '$(printf '%(%FT%T)T')'; git push -u -f /srv/etc.git main;"
$ zfs list root/0/etc.git
NAME             USED  AVAIL  REFER  MOUNTPOINT
root/0/etc.git  5.69M  90.9G  5.69M  /srv/etc.git

In theory (and if I keep up with pushing the changes I make to /etc to my “backup” repo), I should be able to find and restore my intentional changes after a rollback. (But I haven’t had a good occasion to put this into practice yet.)

Edit: The recordsize is a little different from the block size (which is set on the pool). You can get significant performance improvements by matching that with the record size used by a database, but for most programs it probably won’t matter much. It also affects how well the compression algoritms work, so there are disk space trade-offs to changing that.

1 Like

indeed this is my concern too, when it comes to snapshots but, when you get the chance to setup different tuning parameters per different dataset then I think this is the best way to go.

For example, the whole /var/log /var/cache and /var/crash and maybe other I would set “sync=disabled”, like this:

zfs set sync=standard zroot

zfs set sync=disabled zroot/var/cache
zfs set sync=disabled zroot/var/crash
zfs set sync=disabled zroot/var/log
zfs set sync=disabled zroot/var/spool
zfs set sync=disabled zroot/var/tmp
zfs set sync=disabled zroot/var/www

the system becomes faster and I dont care about data corruption on these partitions (well maybe log) but will see …

Also, you can exclude them from snapshoting:

zfs set com.sun:auto-snapshot=false zroot/var/cache
zfs set com.sun:auto-snapshot=false zroot/var/crash
zfs set com.sun:auto-snapshot=false zroot/var/lib/containers
zfs set com.sun:auto-snapshot=false zroot/var/lib/libvirt
zfs set com.sun:auto-snapshot=false zroot/var/log
zfs set com.sun:auto-snapshot=false zroot/var/spool
zfs set com.sun:auto-snapshot=false zroot/var/tmp
zfs set com.sun:auto-snapshot=false zroot/var/www

Yea, will do some tests live on my hardware.

1 Like

nice idea, I like that :slight_smile:

I’m still experimenting with the git repo idea myself. FWIW, I just added the following to my setup which I think might help keep my local repo up-to-date.

$ sudo dnf install libdnf5-plugin-actions
$ cat <<- 'END' | sudo tee /etc/dnf/libdnf5-plugins/actions.d/git.actions
	post_transaction:::enabled=host-only:/usr/bin/sh -ec cd\ /etc;\ git\ add\ .;\ git\ commit\ -m\ $(printf\ '%(%FT%T)T');\ git\ push;
END

I still have to remember to run cd /etc && commit -a --amend --no-edit after rpmconf -a though. I think I’ll try creating a /usr/local/bin/rpmconf “wrapper” script to take care of that automatically …

/usr/local/bin/rpmconf:

#!/usr/bin/bash

command -p rpmconf "$@"
set -e; cd /etc; git commit -a --amend --no-edit; git push;

(don’t forget chmod +x /usr/local/bin/rpmconf)

1 Like

Any idea how to set vfs.zfs.txg.timeout in Fedora?

force commit Transaction Group (TXG)

Is that just a module parameter?

$ modinfo zfs | grep zfs_txg_timeout
parm:           zfs_txg_timeout:Max seconds worth of delta per txg

If so, create a conf file under /etc/modprobe.d and a corresponding dracut conf file under /etc/dracut.conf.d to add it to the initramfs. Then run dracut -f to update your initramfs.

Thanks!

Maybe it’s not the right place to set those values?
Because it doesn’t get set:

liviu@bobdenaut:~$ cat /etc/dracut.conf.d/
modprobe_files.conf zol.conf

liviu@bobdenaut:~$ cat /etc/dracut.conf.d/modprobe_files.conf 
install_items+=" /etc/modprobe.d/nvidia_params.conf "
install_items+=" /etc/modprobe.d/zfs_arc.conf "

liviu@bobdenaut:~$ cat /etc/modprobe.d/zfs_arc.conf
# Set Max ARC size => 16GB
options zfs zfs_arc_max=16106127360
# Set Min ARC size => 2GB
options zfs zfs_arc_min=2147483648
# force commit Transaction Group (TXG) at 120 secs, increase to aggregated more data (default 5 sec)
options zfs vfs.zfs.txg.timeout=120
# force commit Transaction Group (TXG) if dirty_data reaches 95% of dirty_data_max (default 20%, FreeBSD 12.1)
options zfs vfs.zfs.dirty_data_sync_pct=95
# max gap between any two aggregated writes, 0 to minimize frags (default 4096, 4KB)
options zfs vfs.zfs.vdev.write_gap_limit=0
liviu@bobdenaut:~$ sysctl -n vfs.zfs.txg.timeout
sysctl: cannot stat /proc/sys/vfs/zfs/txg/timeout: No such file or directory

The module option wouldn’t have the vfs. prefix. Also, it would use _ (or -) instead of . in the option name and it would be a global setting that would effect all zfs pools and filesystems.

If there is a per-filesystem setting that can be set with sysctl, I’m not familiar with it.

my bad, the options were wrong spelled

1 Like

found them doing grep on “arc_summary” :slight_smile:

1 Like

FYI, you can check the current/active value of any module parameter with, e.g., cat /sys/module/zfs/parameters/zfs_txg_timeout.

Some of those can also be changed at runtime (without rebooting) by “echoing” a new value to the file. You’ll have to check the documentation to find out what settings can be changed on the fly though.

1 Like

You can use modinfo -p zfs to get a list of parameters with brief descriptions. ZFS has a lot of tunables.

wow modinfo -p zfs is really useful, didn’t know about -p
gracias!

1 Like