Rethinking deltarpms

My most recent DNF run actually reported significant savings due to the use of DeltaRPMs, thanks in large part to many texlive packages that compress well and include a lot of files that don’t change from build to build. But that experience is far from the norm, and it got me thinking about the utility of DeltaRPMs in general.

First I whipped up some stats, surveying all of the /var/log/dnf.log files I happen to have laying around on my machines. (In both cases, that comprised the last 2-4 months of logs, plus some older rotations from 2019 that never got reaped by logrotate for reasons.)

On my Fedora 32 desktop:

      Date  Before (MB)  After (MB)    Saved
--------------------------------------------
2019-07-01     1047.10     1002.70      4.1%
2019-07-05       30.40       29.60      2.1%
2019-07-05      106.20      106.10      0.1%
2019-07-06      132.60       86.30     34.1%
2019-07-09       93.60       92.70      0.1%
2019-07-09       46.80       39.80     14.1%
2019-07-10        1.30        0.40     68.1%
2019-07-13      125.10      121.40      2.1%
2019-07-14       32.60        5.40     83.1%
2019-07-15      101.80       68.80     32.1%
2019-07-26     1080.70     1064.00      1.1%
2020-10-09      238.80      221.80      7.1%
2020-10-11       20.00       10.60     47.1%
2020-10-21     1497.60     1496.60      0.1%
2020-10-27      446.90      434.80      2.1%
2020-11-02      246.80      203.40     17.1%
2020-11-05      522.00      517.40      0.1%
2020-11-08      302.00      300.20      0.1%
2020-11-19      203.50      197.50      2.1%
2020-11-21      570.80      570.30      0.1%
2020-11-23      119.50      104.60     12.1%
2020-11-24        2.20        1.90     13.1%
2020-11-26      112.30      117.00     -4.1%
2020-11-29       93.70       86.20      8.1%
2020-12-02      397.70      396.80      0.1%
2020-12-04      428.20      212.60     50.1%

And on my Fedora 33 fileserver:

      Date  Before (MB)  After (MB)    Saved
--------------------------------------------
2019-07-09      203.30      203.00      0.1%
2019-07-13      253.00      252.90      0.1%
2019-07-14       12.30        4.40     63.1%
2019-07-15       73.10       59.90     17.1%
2019-07-26      914.90      914.30      0.1%
2020-06-08      124.00      120.80      2.1%
2020-06-15      336.70      331.80      1.1%
2020-06-16       96.40       82.50     14.1%
2020-06-22      172.80      166.40      3.1%
2020-06-22       66.50       65.50      1.1%
2020-06-24      129.00      122.60      4.1%
2020-06-27        6.40        4.90     23.1%
2020-07-02      157.20      149.40      4.1%
2020-07-02      157.20      157.80      0.1%
2020-07-12      526.80      509.80      3.1%
2020-07-17      566.50      529.60      6.1%
2020-07-18       25.70       13.70     46.1%
2020-08-06      634.10      616.40      2.1%
2020-08-09      156.30      122.50     21.1%
2020-08-10      139.10      138.90      0.1%
2020-08-12      346.70      346.70      0.1%
2020-08-16      127.70      126.00      1.1%
2020-08-20      302.40      264.90     12.1%
2020-08-26      529.90      505.10      4.1%
2020-08-28       36.00       29.60     17.1%
2020-08-29        7.60        0.40     94.1%
2020-09-02      173.30      142.10     17.1%
2020-09-04      115.90      115.00      0.1%
2020-09-11      704.50      695.30      1.1%
2020-09-22      547.80      508.40      7.1%
2020-09-24      124.00      107.20     13.1%
2020-10-08      695.20      693.00      0.1%
2020-10-08       17.00        9.40     44.1%
2020-10-11      340.00      332.10      2.1%
2020-10-12        4.70        1.00     78.1%
2020-10-21      856.90      813.00      5.1%
2020-10-26      263.60      263.30      0.1%
2020-11-05      614.10      610.30      0.1%
2020-11-08       84.40       82.50      2.1%
2020-11-10      100.30       90.30      9.1%
2020-11-21       65.80       65.60      0.1%
2020-11-23      180.10      178.60      0.1%
2020-11-24        6.10        3.90     35.1%
2020-11-26       74.30       74.30      0.1%
2020-11-29      469.70      452.30      3.1%

The stats paint a bad enough picture all by themselves: Most of the time, the savings from using deltarpms is negligible, and even occasionally negative due to rebuild failures.

But what the stats don’t show is the amount of time and CPU cycles consumed by the use of deltas, which is significant.

For those who don’t know, here’s a quick sketch of the way deltarpms work:

  1. After a new RPM is generated, it gets compared with one or several previous RPMs for that same package.
  2. For each previous RPM, any files that were unchanged from that previous package get filtered out of the new package. The remaining (changed) files are packaged into a .drpm, which is specific to both the new package version and the previous version it was diffed with.
  3. When you run dnf upgrade, DNF looks at all of the current versions of your packages that are being upgraded from, and if there’s a deltaRPM between that version and the version being installed, that gets downloaded instead of the full RPM
  4. (Here’s the really damning part) When DNF receives a deltaRPM, before it can install the package it has to recreate the full RPM to be installed. It does this by grabbing the installed files on your system (from the previous RPM) and combining them with the files in the downloaded .drpm. That gets it the .rpm for the updated version, which can then be installed.
  5. If any of the local files don’t match the contents of the previous RPM (say, they included a file that you modified locally), then the reconstruction of the full RPM fails and DNF is forced to abandon the DeltaRPM and download the full version.

Steps 4-5 are where things go really bad. Step 4 takes a LONG time, comparatively, even when there are no problems. (If you go into step 5, then you’re hosed and all chance of saving any bandwidth is pretty much out the window.) But even when all of the reassemblies succeed, they are a highly CPU- and disk-intensive process that drags the performance of DNF down considerably.

DeltaRPMs are basically a tradeoff between network bandwidth and local time/cycles: In order to download less data from the network, your system has to reconstruct the RPMs to be installed, which requires significantly more work (that takes significantly more time).

And the more I think about it, the more backwards that tradeoff seems. Network bandwidth is cheap, over my broadband service downloading a few extra megabytes (or even a few hundred) takes at most a few seconds and costs nothing. So, what exactly is the advantage to piling all that extra complexity and delay into the dnf upgrade process, just for the possibility of slightly smaller downloads? That’s kind of a terrible deal any way you slice it.

2 Likes

I should have qualified that, “…to the best of my understanding”, and please do feel free to correct any details I got wrong.

Here’s the Python script I used to generate those stats tables, if anyone is curious to check on their own DeltaRPM performance. I strongly suspect this will only be useful if you use commandline dnf to update, GNOME Software may use deltaRPMs but I doubt it would be logged in /var/log/dnf.log even so.

#!/usr/bin/python3

import os
import re

with os.scandir("/var/log") as scan:
    dnf_logs = [entry for entry in scan if entry.name.startswith("dnf.log")]
linestats = []
dnf_logs.sort(key=lambda x: x.stat().st_mtime)
for dnf_log in dnf_logs:
    with open(dnf_log, "r") as f:
        for l in f.readlines():
            if not "Delta RPMs" in l:
                continue
            match = re.search(
                r"^([0-9]{4}-[0-9]{2}-[0-9]{2}).*Delta RPMs (.*?) ([0-9.]*) MB of updates to ([0-9.]*) MB \((.*) (saved|wasted)",
                l)
            if not match:
                continue
            date = match.group(1)
            before, after, delta = (match.group(3), match.group(4), match.group(5))
            reduced = bool(match.group(2) == "reduced")
            linestats.append((date, float(before), float(after), delta))
print("      Date  Before (MB)  After (MB)    Saved")
print("--------------------------------------------")
for line in linestats:
    print("%10s  %10.2f  %10.2f  %8s" % line)
    allmb += line[1]
    deltamb += line[2]
savedmb = allmb - deltamb
print(
    "\nTotal MBytes: %8.2f Delta MBytes: %8.2f (Saved: %6.2f, %5.2f%%)"
    % (allmb, deltamb, savedmb, 100 * (savedmb / allmb)))

(There’s a bit of useless code in there, like I didn’t end up actually using the reduced string-match for anything.)

1 Like

It certainly might be worth rethinking this. When the feature was created, that was far from true for a large number of our users. It still isn’t universally true globally.

1 Like

PS: since this is more of a project conversation than a troubleshooting / help one, you might get better response to this topic at the contributor-focused Fedora Discussion forum.

2 Likes

As a former user of satellite internet I can state that the bandwidth there had a limit and once the limit was reached extra cost $. Some other providers have limits as well, although most are now limiting speed rather than data.

3 Likes

It isn’t, that’s true. And I did think about that. The problem, though, is the abysmal track record of deltaRPMs for actually saving significant bandwidth. Even on a metered connection, I’m not sure it’s enough to be worth it.

I realized that one thing my script didn’t do was compute totals, so I added that in. (I’ll update the script in my previous comment as well.) For both of the systems I listed previously, the totals come out as:

F32 desktop:

Total MBytes:  8000.20 Delta MBytes:  7488.90 (Saved: 511.30,  6.39%)

F33 fileserver:

Total MBytes: 11539.30 Delta MBytes: 11077.40 (Saved: 461.90,  4.00%)

I mean… Meh. ¯\_(ツ)_/¯

DeltaRPMs were, I suspect, a pretty cool idea back in the old days before compilers started using application security features like address randomization by default. Back when, if a package only changed one or two lines of code, you could end up with a new RPM that was 95% identical to the previous package. Then, you’d have real savings.

But now, with modern compile toolchains, every build of a program results in a different set of binaries, and if you’re packaging compiled software the norm is going to be closer to 5% identical to the previous build, not 95%. Far too little to be worth it.

Maybe that’s a change that should be made, on the server end: Just throw away every deltaRPM that reduces the package size by less than, say, 20% — the ones that really aren’t worth the time it takes to reassemble the original (and possibly have to throw the delta away and download the full package anyway). Then, the (relatively few) .drpms that remain could actually represent some tangible benefit.

5 Likes

Seems I’m not the only one who’s been thinking about these issues!

The actual proposal is over my head, I’ll have to study it more to get an idea for what exactly they want to do. But I like that it:

  1. Acknowledges all of the same problems with Delta RPMs that we’ve discussed here.
  2. Is explicitly incompatible with Delta RPMs. (Which mean that if it’s adopted, it will implicitly mean the death of .drpms.)
4 Likes

Since it seems there’s a time limit on editing our own previous comments, and the “updated” version of my stats script above doesn’t actually work, here’s a fixed version. Apologies for the necromancy.

#!/usr/bin/python3
import os, re
with os.scandir("/var/log") as scan:
    dnf_logs = [entry for entry in scan if entry.name.startswith("dnf.log")]
lines = []
dnf_logs.sort(key=lambda x: x.stat().st_mtime)
for dnf_log in dnf_logs:
    with open(dnf_log, "r") as f:
        lines.extend([line for line in f.readlines() if "Delta RPMs" in line])
linestats = []
for l in lines:
    match = re.search(
        r"^([0-9]{4}-[0-9]{2}-[0-9]{2}).*Delta RPMs (reduced|increased) ([0-9.]*) MB of updates to ([0-9.]*) MB \((.*) (saved|wasted)",
        l)
    if not match:
        continue
    date = match.group(1)
    before, after, delta = match.groups()[2:5]
    linestats.append((date, float(before), float(after), delta))
print("      Date  Before (MB)  After (MB)    Saved")
print("--------------------------------------------")
for line in linestats:
    print("%10s  %10.2f  %10.2f  %8s" % line)
allmb = sum([line[1] for line in linestats])
deltamb = sum([line[2] for line in linestats])
savedmb = allmb - deltamb
print(
    "\nTotal MBytes: %8.2f Delta MBytes: %8.2f (Saved: %6.2f, %5.2f%%)"
    % (allmb, deltamb, savedmb, 100 * (savedmb / allmb)))
3 Likes