For my backups I use my own program. On the first run it duplicates
the given directories to an identical structure on the given backup
media. On following runs it compares the previously written backup
structure with the current source and again duplicates the source to
a new identical structure on the given backup media. Only this time
it copies only files which were added or changed. The files that are
unchanged are hardlinked to the previous version, which saves both
time and space, sacrificing redundancy. The program is written in Perl.
I have been using this program since 2013. The backups are nominally
50 GB, but the daily differences are vastly smaller, so that the daily
backup takes about 3 to 4 minutes. The backups use ext2 filesystem.
After I upgraded from fc32 to fc33, the times for the backup have
suddenly risen by a factor betwen 4 and 5. It seems that it is the
writing / linking that takes the additional time. Reading the directory
data doesn’t seem to matter. I did a direct dnf upgrade, so that AFAIK
my FS should still be ext4.
Any ideas? Is fc33 doing some strange transformations from ext4 to
Btrfs and back? Is Perl under fc33 slower? …?
First, I am still gms, but as I discovered, my original email account is gone, which
probably caused my Fedora gms account to be blocked, so here I am as gms2.
It turns out, that during the backup I get into swapping mode. Either the OS now needs
more memory, or Perl doesn’t do it’s memory management as well as before, but
now I have at least a direction to move on. I would delete the above posting, but as
my gms account is unreachable (to me)…
Is fc33 doing some strange transformations from ext4 to Btrfs and back?
No.
Regression testing is a bit of a pain, because you need an A and B setup to run the tests. And then it’s a case of process of elimination. bcc-tools have quite a lot of tools for finding latency. Of course you need to have some idea or reference for expected latency. But what you’re experiencing sounds like the latency is so high that it should be obvious if only you can narrow down where it’s happening. So bcc-tools can show you biolatency for the physical drive (not likely, it hasn’t changed), various file system latency tools (also no change, not likely), and a VFS latency tool (maybe). But there’s dozens of these and one might help narrow down what has such high latency.
Another possibility is to run something like perf top -p on the PID doing all the work and see what it’s doing and if it’s doing something unexpected. There’s a bunch of guides about perf tools.
I would prefer not to post it - it is 1300+ lines of ugly Perl code plus
some C-code and it needs some additional setup. Also in the meantime
I have discovered that a) the swap is probably just a red herring and
b) it sometimes looks like even a normal “cp -a” is slower. Of course,
since I didn’t do any measurements before the upgrade, the objective
value of that statement is around zero.
BTW inxi doesn’t come default, but looks interesting. Thank you for
the pointer.
Tell me about PITA… I don’t really want to start poking and prodding, I have
“better things to do”. What I really wanted to know was, if there was some
quantum jump between fc32 and fc33. “It just was normal”, then I did the
upgrade and “the performance went to hell”. I am using that code on 4 other
machines, but only one has the data size which makes it noticeable. I guess,
I will just figure some way around it (split into slow and fast changing data…).
Don’t delete it, even if you recover your gms account - the post may be a helpful lesson to others. Instead select a specific post as Solution to point readers into the right direction.
Of course I did that. As my mother said: if you can’t improve it, leave it as you found it.
I managed to hunt down some data. I found timing on the same backup disk,
once before and once after the upgrade to F-33. Of course the backups are
NOT identical, but near enough to make the following numbers strange:
before: real 3m9.906s user 0m43.719s sys 0m43.078s
after: real 19m53.261s user 0m42.792s sys 16m31.425s
It is obvious, that the real difference is in the system time jump from 43 seconds
to 16+ minutes. How can 43 seconds user time invoke 16 minutes of system action
(apart from the system throwing a party and billing the user)?
I don’t know but such a large different surely does not even need A/B comparison, this should show up as significant latency somewhere. I don’t really have further ideas, I suggest posting an email to fedora-devel@ list.
I finally dug into the code and found that the culprit is the following statement:
system (“chcon --reference=$fname1 $fname2”);
I replaced it by a C routine and got to the following timing:
real 3m53.952s user 0m34.040s sys 1m18.063s
That is comparable to the “before” situation.
As for “should show up as significant latency somewhere” - that call happens
almost 18 000 times during the backup, so it sums up.
I hesitate to assign the location of the problem. It could be either the Perl “system”
call, or the system “chcon” command. Also during the tests with a reduced version
of the backup program the lost time was ca. 8 minutes compared to ca. 15 minutes
for the full version - there is some non-linearity somewhere…
But for now: Problem Solved.
In any form of programming, a recurring event that has even a tiny latency gets multiplied by the number of times it repeats and can become significant. You seem to have found one thing but there may be more.
What would the difference be if you were to forgo the chcon step and do it all one fell swoop after the files were in the backup location instead of one file at a time?
I finally went after what I called non-linearities in my previous
assessment. It turns out, that the big issue actually IS the swapping.
It’s not a red herring, but a reality.
The program is using recursion to follow the storage directory structure.
It turns out that in Perl it is easy to do, but you pay for it: the
memory is easily acquired, but it’s never given back. I must have been
already moving at the brink of my 8 GB memory and after the change from
F-32 to F-33 something changed (OS memory handling, Perl interpreter size,
Perl memory handling, overall size of the OS - take your pick), I fell
over the edge of the available memory and the system started swapping.
What I thought was the solution (changing perl “system” calls to
C-routines), was just a cure for the symptom. C-calls need less space
then calling an OS command from inside the program and the reduced overall
size allows the whole Perl program to fit into the available memory.
Of course, if the structure of the data being backed up changes, it
might take me over the edge again. So might another change in the OS.
As I see it, there are various possibilities in front of me:
sit tight and pray
get more memory
split the data into several smaller parts, instead of one big chunk
work on shrinking the memory footprint of the Perl program
re-write the backup program in another language
I am not confident enough for 1, don’t want to do 2 (I am not even
sure I can do it on this machine) and 3 misses the whole point of
the system. So at the moment I am at 4, but there might be 5 in the
future.
Thanks to everybody who took part in this discussion, and good
luck in all your efforts.
You might move the chcon part into a parallel branch that does not delay the backup but instead runs concurrently from a list created by the backup process. It could compare the context of $file1 & $file2 and only run chcon if they do not match. It appears you are running chcon on every file, everytime, even if not needed.
rsync for example will copy files and maintain all attributes when using archive mode (-a). “cp -a” does the same, although it copies every file, not just those that have changed. If your backup program does not copy the attributes but instead relies on chcon then the added overhead to run chcon is built in by that lack.
Well, now (since I got the size down) it is not swapping in any serious manner, so
there is no need to do anything like that, but if I ever get into the bad swap situation
again, I will try to do that. Thanks for the tip.