XFS with external disk for journal/metadata

Hi. I’m building a home server with a small SSD as the root disk and two very big HDDs for storage.

I decided to run all of these as JBOD (no RAID) and use regular rsync for redundancy. I thought this would be more straightforward to maintain and recover in case of a disk failure.

Anyway, I thought that the XFS filesystems created on the big HDDs make sense to have journal and or metadata written to some portion of the SSD to allow for greater performance.

But I have a few concerns.

  • in the examples I see, device is specified by /dev path. e.g. /dev/sda2. But this can easily change and I see no way to point at another device afterwards.
  • would data be recoverable from the volume in case of the external drive failure? Will it make any difference whether journal or metadata is offloaded?
  • what will give better boost of performance - metadata or journal offloading?

Any pointers appreciated!

P.S. or should I better use ext4 where I can just remove a missing/failed journal and recreate it?

  1. Device names changes are no problem at all as you should define the mkfs.xfs generated filesystem uuid in fstab and then use “mount /myhybridxfsmount”.
  2. When you split xfs metadata from file data and loose metadata device the data is lost but therefore using kind of raid security to both kind devices! When using a journal device there’s not any measureable difference because that’s very few data while it’s already sequential written.
  3. Using a device for metadata gives you really fast metadata (listing, find) access for all files after reboot and in operating when having multiple 100 of millions files and small memory foorprint.
    Howto: eg create hw/sw raid6 out of lots HDD’s, create a hw/sw raid1 out of 2 or 3 ssd/nvme’s, create a mdadm linear raid with raid1 as first and raid6 as second device, mkfs.xfs on md device, mount xfs with inode32 option. Put in millions of files, “drop_caches” (or reboot), run “iostat -xm 1” continously and while doing any find/du/ncdu cmd or recursive file listing (local or remote by nfs/samba) or transfer files - have fun :slight_smile:

2PS: ext4 cannot make use of a metadata device.

WIth external xfs metadata it’s even an eye blink to do chmod, chown, chgrp, touch (new timestamp) on endless files/dirs, handling endless reflink file revisions and help a lot on removing them too - superior && cool :slight_smile:

So for everybody who is concerned about the inode32 option I tested with triple raid1 as 3x 3.2TB (~3TB netto) nvme’s and ended at little over 3.000.000.000 inodes/files (on 55TB (raid1+)raid6 hybrid-xfs) until raid1 full (not the raid6 part) which makes it really easy to know which nvme size is needed for a wished number of inodes that could be hold in external device. - Ever seen running 32 different find’s running over endless millions of files each all over 90% cpu on a “hdd based perc raid6 xfs” tuned with external metadata raid1, that’s such a fun :slight_smile:

Anybody there who has a zfs pool with special device, “normal xfs” or ext4 on hdd raid5/6 or btrfs configured and >=64cores could be so kind to run a metadata only test by it’s own with elbencho. elbencho is a new kind of benchmark tool to test parallel filesystems like beegfs, lustre and so on while being easier as mdtest, ior etc to evaluate I/O performance, tool designer is beegfs developer Sven Breuner, GitHub - breuner/elbencho: A distributed storage benchmark for file systems, object stores & block devices with support for GPUs, it’s a ~6MB tar.gz file with static libs included.
R730xd, 2x E5-2683v4, 9x8TB hdd @h730p raid5 +special 3x3TB nvme raid1 by mdadm, all inodes on raid1 with all data in so called extents in raid5, can loose 2 nvme, loose 3 nvme all data in raid5 is gone (or do a raid1 of 4 nvme), write and rm of files it’s not at it’s best as all hw is quiet old yet:
Filesystem 1K-blocks Used Available Use% Mounted on
/dev/md125 65630846272 15373372600 50257473672 24% /hxfs
elbencho -r -w -d -t 64 -n 64 -N 3200 -s 0 --lat -F -D /hxfs/test/nix # elbencho -h for help on options
OPERATION RESULT TYPE FIRST DONE LAST DONE
========= ================ ========== =========
MKDIRS Elapsed time : 1ms 120ms
Dirs/s : 48152 34080
Dirs total : 86 4096
Dirs latency : [ min=12us avg=1.20ms max=3.88ms ]

WRITE Elapsed time : 4m31.561s 5m50.821s
Files/s : 36970 37361
Files total : 10039852 13107200
Files latency : [ min=13us avg=1.63ms max=15.8ms ]

READ Elapsed time : 12.576s 12.640s
Files/s : 1038515 1036903
Files total : 13061407 13107200
Files latency : [ min=2us avg=59us max=62.8ms ]

RMFILES Elapsed time : 7m3.876s 9m7.189s
Files/s : 23571 23953
Files total : 9991588 13107200
Files latency : [ min=12us avg=2.56ms max=125ms ]

RMDIRS Elapsed time : 3.433s 4.526s
Dirs/s : 901 904
Dirs total : 3094 4096
Dirs latency : [ min=1.20ms avg=65.7ms max=151ms ]

Or as in above bench “READ” is answered by fs cache and so doing in 2 steps and empty cache between that READ must come from special raid1:
elbencho -w -d -t 64 -n 64 -N 3200 -s 0 --lat /hxfs/test/nix
OPERATION RESULT TYPE FIRST DONE LAST DONE
========= ================ ========== =========
MKDIRS Elapsed time : 1ms 116ms
Dirs/s : 48448 35268
Dirs total : 64 4096
Dirs latency : [ min=11us avg=1.12ms max=3.55ms ]

WRITE Elapsed time : 5m19.774s 5m52.207s # as it’s 3nvme raid1 it’s 39.3M files for the kernel/mdadm
Files/s : 37136 37214
Files total : 11875349 13107200
Files latency : [ min=12us avg=1.68ms max=51.3ms ]

echo 3 > /proc/sys/vm/drop_caches # empty fs cache (or reboot), for zfs do zpool export … && zpool import … !!

elbencho -r -t 64 -n 64 -N 3200 -s 0 --lat -F -D /hxfs/test/nix
OPERATION RESULT TYPE FIRST DONE LAST DONE
========= ================ ========== =========
READ Elapsed time : 14.154s 15.863s
Files/s : 845048 826274
Files total : 11961593 13107200
Files latency : [ min=5us avg=72us max=8.62ms ]

RMFILES Elapsed time : 7m2.491s 7m42.899s # as it’s 3nvme raid1 it’s 39.3M files for the kernel/mdadm
Files/s : 27819 28315
Files total : 11753581 13107200
Files latency : [ min=12us avg=2.21ms max=21.6ms ]

RMDIRS Elapsed time : 3.753s 4.428s
Dirs/s : 912 925
Dirs total : 3424 4096
Dirs latency : [ min=1.55ms avg=66.5ms max=162ms ]

Test your own hdd raid 5 / 6 / 10, mirror / raidz / draid* +special/logs/cache :slight_smile:*

Much newer hw with 2.5x number of and even faster disks,
RX2540M5, 2x Gold 6226R, 192GB, 24x 16TB hdd as raidz2 in 4 vdevs, pool 39% used:
elbencho -r -w -d -t 64 -n 64 -N 3200 -s 0 --lat -F -D /dpool/test/tmp
OPERATION RESULT TYPE FIRST DONE LAST DONE
========= ================ ========== =========
MKDIRS Elapsed time : 130ms 855ms
Dirs/s : 498 4788
Dirs total : 65 4096
Dirs latency : [ min=17us avg=210us max=51.7ms ]

WRITE Elapsed time : 7m52.066s 8m11.264s
Files/s : 26608 26680
Files total : 12560785 13107200
Files latency : [ min=21us avg=2.38ms max=1.89s ]

No zpool export && import, so get usage of arc here:

READ Elapsed time : 15.706s 16.090s
Files/s : 813530 814610
Files total : 12778030 13107200
Files latency : [ min=1us avg=76us max=52.8ms ]

RMFILES Elapsed time : 21m52.278s 22m2.489s
Files/s : 9874 9911
Files total : 12958582 13107200
Files latency : [ min=32us avg=6.44ms max=4.75s ]

RMDIRS Elapsed time : 10.036s 10.131s
Dirs/s : 401 404
Dirs total : 4030 4096
Dirs latency : [ min=72.5ms avg=157ms max=260ms ]

R760, 2x Gold 6526Y, 256GB, 12x 20TB raidz2 + 2x 3TB special mirror, pool 5% used:
elbencho -r -w -d -t 64 -n 64 -N 3200 -s 0 --lat -F -D /dpool/test/tmp
OPERATION RESULT TYPE FIRST DONE LAST DONE
=========== ================ ========== =========
MKDIRS Elapsed time : 6ms 49ms
Dirs/s : 54061 82866
Dirs total : 370 4096
Dirs latency : [ min=45us avg=217us max=1.09ms ]

WRITE Elapsed time : 47.899s 48.651s
Files/s : 268881 269412
Files total : 12879409 13107200
Files latency : [ min=11us avg=234us max=88.6ms ]

READ Elapsed time : 13.374s 13.628s
Files/s : 961438 961721
Files total : 12859095 13107200
Files latency : [ min=1us avg=64us max=147ms ]

RMFILES Elapsed time : 7m34.519s 7m37.907s
Files/s : 28623 28624
Files total : 13010136 13107200
Files latency : [ min=17us avg=2.23ms max=32.5ms ]

RMDIRS Elapsed time : 4.585s 6.212s
Dirs/s : 666 659
Dirs total : 3056 4096
Dirs latency : [ min=1.98ms avg=90.3ms max=211ms ]

So a 3 month old one with zfs special can beat yet the good 9 year old one with xfs special …, need actual hw for new xfs special test :slight_smile:

SM, 2x 6246R, 256G, 26x 18TB lsi-raid6 +xfs special, 100Gb OPA:
./elbencho -r -w -d -t 64 -n 64 -N 3200 -s 0 --lat -F -D /usr1
OPERATION RESULT TYPE FIRST DONE LAST DONE
========= ================ ========== =========
MKDIRS Elapsed time : 1ms 18ms
Dirs/s : 106312 225376
Dirs total : 128 4096
Dirs latency : [ min=3us avg=118us max=461us ]

WRITE Elapsed time : 42.755s 52.338s
Files/s : 250944 250433
Files total : 10729223 13107200
Files latency : [ min=5us avg=239us max=201ms ]

READ Elapsed time : 11.655s 11.997s
Files/s : 1092970 1092464
Files total : 12738601 13107200
Files latency : [ min=0us avg=56us max=107ms ]

RMFILES Elapsed time : 1m13.395s 1m36.599s
Files/s : 135893 135686
Files total : 9974013 13107200
Files latency : [ min=3us avg=436us max=981ms ]

RMDIRS Elapsed time : 5.966s 6.513s
Dirs/s : 623 628
Dirs total : 3717 4096
Dirs latency : [ min=1.66ms avg=100ms max=231ms ]

Cpu level not last cry here but enough yet - see “rmfiles” xfs vs. zfs :slight_smile:
local: 4,5GB/s write, 4GB/s read, nfs4.2: 4,5GB/s write, 4GB/s read nativ or from server cache 10,5GB/s