Hi. I’m building a home server with a small SSD as the root disk and two very big HDDs for storage.
I decided to run all of these as JBOD (no RAID) and use regular rsync for redundancy. I thought this would be more straightforward to maintain and recover in case of a disk failure.
Anyway, I thought that the XFS filesystems created on the big HDDs make sense to have journal and or metadata written to some portion of the SSD to allow for greater performance.
But I have a few concerns.
in the examples I see, device is specified by /dev path. e.g. /dev/sda2. But this can easily change and I see no way to point at another device afterwards.
would data be recoverable from the volume in case of the external drive failure? Will it make any difference whether journal or metadata is offloaded?
what will give better boost of performance - metadata or journal offloading?
Any pointers appreciated!
P.S. or should I better use ext4 where I can just remove a missing/failed journal and recreate it?
Device names changes are no problem at all as you should define the mkfs.xfs generated filesystem uuid in fstab and then use “mount /myhybridxfsmount”.
When you split xfs metadata from file data and loose metadata device the data is lost but therefore using kind of raid security to both kind devices! When using a journal device there’s not any measureable difference because that’s very few data while it’s already sequential written.
Using a device for metadata gives you really fast metadata (listing, find) access for all files after reboot and in operating when having multiple 100 of millions files and small memory foorprint.
Howto: eg create hw/sw raid6 out of lots HDD’s, create a hw/sw raid1 out of 2 or 3 ssd/nvme’s, create a mdadm linear raid with raid1 as first and raid6 as second device, mkfs.xfs on md device, mount xfs with inode32 option. Put in millions of files, “drop_caches” (or reboot), run “iostat -xm 1” continously and while doing any find/du/ncdu cmd or recursive file listing (local or remote by nfs/samba) or transfer files - have fun
WIth external xfs metadata it’s even an eye blink to do chmod, chown, chgrp, touch (new timestamp) on endless files/dirs, handling endless reflink file revisions and help a lot on removing them too - superior && cool
So for everybody who is concerned about the inode32 option I tested with triple raid1 as 3x 3.2TB (~3TB netto) nvme’s and ended at little over 3.000.000.000 inodes/files (on 55TB (raid1+)raid6 hybrid-xfs) until raid1 full (not the raid6 part) which makes it really easy to know which nvme size is needed for a wished number of inodes that could be hold in external device. - Ever seen running 32 different find’s running over endless millions of files each all over 90% cpu on a “hdd based perc raid6 xfs” tuned with external metadata raid1, that’s such a fun
Anybody there who has a zfs pool with special device, “normal xfs” or ext4 on hdd raid5/6 or btrfs configured and >=64cores could be so kind to run a metadata only test by it’s own with elbencho. elbencho is a new kind of benchmark tool to test parallel filesystems like beegfs, lustre and so on while being easier as mdtest, ior etc to evaluate I/O performance, tool designer is beegfs developer Sven Breuner, GitHub - breuner/elbencho: A distributed storage benchmark for file systems, object stores & block devices with support for GPUs, it’s a ~6MB tar.gz file with static libs included. R730xd, 2x E5-2683v4, 9x8TB hdd @h730p raid5 +special 3x3TB nvme raid1 by mdadm, all inodes on raid1 with all data in so called extents in raid5, can loose 2 nvme, loose 3 nvme all data in raid5 is gone (or do a raid1 of 4 nvme), write and rm of files it’s not at it’s best as all hw is quiet old yet: Filesystem 1K-blocks Used Available Use% Mounted on /dev/md125 65630846272 15373372600 50257473672 24% /hxfs elbencho -r -w -d -t 64 -n 64 -N 3200 -s 0 --lat -F -D /hxfs/test/nix # elbencho -h for help on options OPERATION RESULT TYPE FIRST DONE LAST DONE ========= ================ ========== ========= MKDIRS Elapsed time : 1ms 120ms Dirs/s : 48152 34080 Dirs total : 86 4096 Dirs latency : [ min=12us avg=1.20ms max=3.88ms ] — WRITE Elapsed time : 4m31.561s 5m50.821s Files/s : 36970 37361 Files total : 10039852 13107200 Files latency : [ min=13us avg=1.63ms max=15.8ms ] — READ Elapsed time : 12.576s 12.640s Files/s : 1038515 1036903 Files total : 13061407 13107200 Files latency : [ min=2us avg=59us max=62.8ms ] — RMFILES Elapsed time : 7m3.876s 9m7.189s Files/s : 23571 23953 Files total : 9991588 13107200 Files latency : [ min=12us avg=2.56ms max=125ms ] — RMDIRS Elapsed time : 3.433s 4.526s Dirs/s : 901 904 Dirs total : 3094 4096 Dirs latency : [ min=1.20ms avg=65.7ms max=151ms ] — Or as in above bench “READ” is answered by fs cache and so doing in 2 steps and empty cache between that READ must come from special raid1: elbencho -w -d -t 64 -n 64 -N 3200 -s 0 --lat /hxfs/test/nix OPERATION RESULT TYPE FIRST DONE LAST DONE ========= ================ ========== ========= MKDIRS Elapsed time : 1ms 116ms Dirs/s : 48448 35268 Dirs total : 64 4096 Dirs latency : [ min=11us avg=1.12ms max=3.55ms ] — WRITE Elapsed time : 5m19.774s 5m52.207s # as it’s 3nvme raid1 it’s 39.3M files for the kernel/mdadm Files/s : 37136 37214 Files total : 11875349 13107200 Files latency : [ min=12us avg=1.68ms max=51.3ms ] — echo 3 > /proc/sys/vm/drop_caches # empty fs cache (or reboot), for zfs do zpool export … && zpool import … !!
elbencho -r -t 64 -n 64 -N 3200 -s 0 --lat -F -D /hxfs/test/nix OPERATION RESULT TYPE FIRST DONE LAST DONE ========= ================ ========== ========= READ Elapsed time : 14.154s 15.863s Files/s : 845048 826274 Files total : 11961593 13107200 Files latency : [ min=5us avg=72us max=8.62ms ] — RMFILES Elapsed time : 7m2.491s 7m42.899s # as it’s 3nvme raid1 it’s 39.3M files for the kernel/mdadm Files/s : 27819 28315 Files total : 11753581 13107200 Files latency : [ min=12us avg=2.21ms max=21.6ms ] — RMDIRS Elapsed time : 3.753s 4.428s Dirs/s : 912 925 Dirs total : 3424 4096 Dirs latency : [ min=1.55ms avg=66.5ms max=162ms ] — Test your own hdd raid 5 / 6 / 10, mirror / raidz / draid* +special/logs/cache *
Cpu level not last cry here but enough yet - see “rmfiles” xfs vs. zfs
local: 4,5GB/s write, 4GB/s read, nfs4.2: 4,5GB/s write, 4GB/s read nativ or from server cache 10,5GB/s