F44 Change Proposal: CommonLicenses [SystemWide]

The GPL variants are one of the easier scenarios. Consider the BSD-3-Clause license:

there is a wildcard placeholder there which is substituted in with the name of the project or author, giving us effectively infinitely many valid but different BSD-3-Clause texts. It is not valid to collapse those all into a single common text in the RPMs - to comply with the license we are required to ship the variant with the particular substitutions defined by the project itself.

To pick on xorg-x11-proto-devel we see several variants within the same package:

$ grep "^IN NO EVENT SHALL" xorg-x11-proto-devel/*
xorg-x11-proto-devel/COPYING-applewmproto:IN NO EVENT SHALL PRECISION INSIGHT AND/OR ITS SUPPLIERS BE LIABLE FOR
xorg-x11-proto-devel/COPYING-pmproto:IN NO EVENT SHALL THE OPEN GROUP BE LIABLE FOR ANY CLAIM, DAMAGES OR
xorg-x11-proto-devel/COPYING-windowswmproto:IN NO EVENT SHALL PRECISION INSIGHT AND/OR ITS SUPPLIERS BE LIABLE FOR
xorg-x11-proto-devel/COPYING-xextproto:IN NO EVENT SHALL HEWLETT-PACKARD COMPANY BE LIABLE FOR ANY CLAIM,
xorg-x11-proto-devel/COPYING-xf86driproto:IN NO EVENT SHALL PRECISION INSIGHT AND/OR ITS SUPPLIERS BE LIABLE FOR
xorg-x11-proto-devel/COPYING-xf86vidmodeproto:IN NO EVENT SHALL Kaleb S. KEITHLEY BE LIABLE FOR ANY CLAIM, DAMAGES
1 Like

yeah… there’s a reason my fdupes based spelunking example didn’t comment on the BSD licenses… It’s also why I keep hinting we only focus on the strict deduping.

but given that… makes me wonder how many variants of MIT & BSD are here on my laptop…and if someone were foolish enough to install the full package catalogue and ran my fdupes analysis would it still come out to be about a 2/3rds duplicate in size and file count..scaling the same a my laptop?

Let’s assume my laptop analysis scales. 3.5k binary packages with 4.6k license files 3k of which are duplicates via 400 ‘set’ unique files. Scaling up to say 35k binary packages… means strict deduping of 30k duplicate files using 4k ‘set’ unique files… and about 12k other unique files. The ‘set’ unique files are going to be a weird mix of variants of a lot of license files.

Would it be desirable to have common-license package that has 4k files in it in order to dedupe something like 30k files out of 46k? It’s like a 90% reduction in size just with strict deduping of files that show up at least twice. You’d have dozens of GPLv2 variants in those 4k set unique files in the common package, but you could use checksum logic to figure out which one is the verbatim match out of the variants.

Maybe 4k files is a little too much… maybe we can just take like the most duplicated files..
using a little awk and and sort magic with fdupes…

$ fdupes -r -l ./ | awk '{ print NF $0}' | sort -n -r | head -n 50

I get the top 50 duplicated fdupes ‘set’ unique files on my laptop

The top 50 duplicated files sum up to 2310 of the 3465 duplicated files so about 2/3 of the total duplicated files. So 2/3rds of the total possible space savings so we’re down from a maximum of 90% savings using 400 files to about 60% savings using 50 files. I wonder how those top 50 from my laptop scale across the entire set of packages?

Details from my laptop.
top has 175 copies and is a GPLv2 variant
50th has 12 copies and is an APACHEv2 variant
49th has 12 copies and is a BSD 2-clause variant using 'THE COPYRIGHT HOLDER` in the ‘NO EVENT SHALL’ clause and the placeholder year and owner at the top.
41st has 14 copies and is a BSD 3-clause variant with boiler plate placeholders

going even further.. I can get to 50% reduction of 1723 duplicates with just the top 20 ‘set’ unique files.

number 10 on the list with 66 dupes is a BSD 3-caluse variant with boiler plate placeholders

So it could be just having a relatively small number of common variants of license files may actually have some utility with minimal risk… if the tooling sticks to strict deduping.

1 Like

Why is the base prefix GPL-3.0? Why not GPL? What would be common prefix for CC-BY-NC-SA-3.0-DE - will it be CC or CC-BY or CC-BY-NC-SA-3.0? So many questions.

Another thing to bear in mind is that SPDX license identifiers can have modifiers for exceptions eg “LGPL-2.1-or-later WITH GCC-exception-2.0”. So when linking to a license, it isn’t always sufficient to link to just one file - the “LGPL-2.1-or-later” text needs to be combined with the “GCC-exception-2.0” text.

Sure, and those exceptions aren’t really a special case. It’s just another identifier bounded by operands. Since the exception text is standardized, it can be captured too.

OK I have installed a system with as many x86_64 packages I could get installed without various non-declared conflicts stopping things. The following is the data I could get for this.

root@fedora-rawhide:/usr/share/licenses# rpm -qa | wc -l
65893
root@fedora-rawhide:/usr/share/licenses# fdupes -r -m /usr/share/licenses
27594 duplicate files (in 3707 sets), occupying 346.6 megabytes
root@fedora-rawhide:/usr/share/licenses# du -sch .
509M    .
509M    total
root@fedora-rawhide:/usr/share/licenses# find . -type f | wc -l
42615

Now some of the files in licenses don’t seem to be actual licenses, and there are a TON of small variations of the perl and gpl v1 license it seems. However this should give some upper limits if we ‘assumed’ the 42615 files found were licenses, and over half are duplicates which can be brought down to 3707. This means that there are around 15021 which aren’t de-duplicated for a total of around 18728.

I will try to do a more thorough audit of the files and see how many of these are tiny variations using licensecheck and md5sum to see where the variations lie.

Some more info. There are a lot of unknown license files in the /usr/share/licenses directory which cover things like PATENTS or other NOTICES plus some AUTHORS files. Removing those and using license check on the 39677 files left I have for the top 50 types of licenses:

root@fedora-rawhide:~# awk -F: '{print $2}' /tmp/z | sort | uniq -c | sort -bnr | head -n 30
   6019  MIT License
   4005  BSD 3-Clause License
   3441  *No copyright* Apache License 2.0
   3317  GNU General Public License, Version 2
   2553  GNU General Public License, Version 3
   2536  LaTeX Project Public License 1.3c
   2001  LaTeX Project Public License 1
   1634  GNU Lesser General Public License, Version 2.1
   1077  Apache License 2.0
    981  UNKNOWN
    813  GNU Lesser General Public License, Version 3
    780  BSD 2-Clause License
    763  Artistic License 1.0 and/or GNU General Public License, Version 1 and/or The Perl 5 License
    740  *No copyright* UNKNOWN
    603  GNU Library General Public License, Version 2.0
    551  *No copyright* Creative Commons CC0 1.0
    457  SIL Open Font License 1.1
    453  GNU General Public License, Version 1
    415  Mozilla Public License 2.0
    407  *No copyright* BSD 3-Clause License
    269  ISC License
    239  *No copyright* MIT License
    206  Artistic License 1.0 (Perl) and/or GNU General Public License, Version 1 and/or The Perl 5 License
    197  GNU General Public License v2.0 or later
    179  *No copyright* BSD 2-Clause License
    165  *No copyright* GNU Lesser General Public License, Version 3
    155  LaTeX Project Public License 1.2
    148  *No copyright* Boost Software License 1.0
    142  *No copyright* GNU General Public License, Version 3
    134  Artistic License 2.0

with licensecheck saying there were 756 different types of general licenses but I can see that it does not differentiate between ones with one or another address in it or other things.. so I would NOT use this to say we can have some set of N licenses and cover all software.

Oh and for space savings.. to install as many packages as I did, I needed 372 GB of diskspace which would save 340mb of space. Most of the storage being used for the 65893 packages is in /usr/share/docs it would seem.

yeah the docs…
but there’s an rpm mechanism to delibrately request the docs not be placed on the filesystem right? I’m rusty..but i could have sworn that was a thing. One of the reasons in fact why the licenses cant be in the docs directory in fact.. because creates a compliance issue if users use that mechanism.

Yes there is, and normally if you are creating containers or such you will end up with installing that way. If I had more time to put into this, I would probably try to set up a new system which had that turned off to figure out what space savings there were. One thing I did see was that there are a LOT of small variations of the same files.

licensecheck reads and sees that it is MIT or GPL or something but the real difference is that there is an exception or some other item which changes the license checksum so just saying it is ‘GPLv3’ and pointing to a common-licenses GPLv3 file may not be right.

I also learned that rebooting a system after you have installed everything is not a good idea.. a lot of services are turned on by default and that ends up with a non-working system :crying_cat:

First of all,
you just confirmed that I’m able to nerdsnipe you.. I pledge not to abuse that.. much.

Second of all,
your at scale numbers with 65k packages appear to be inline with my scaled up estimates from my laptop. Your 3707 fdupes ‘set’ unique files is lower than my estimate by a little bit…which suggests that at scale the duplication factor increases a little bit nonlinearly.
If I prep a little python script thing to spit out a visual distribution of what fdupes output would you run it against your maximum installed package set?

also…that last pasted in ’ …| head -n 30’ pipeline you showed..
assuming those numbers at the start are the duplication count like I showed.. but from the licensecheck pov instead of fdupes pov. Summing up your top 30 duplicated files.. i get 35k duplicated files.. meaning 30 licensecheck ‘set’ unique files account for 35k files on disk?

licensecheck is clearly looking at a lot of files that fdupes considers unique and not part of a set of size > 1. It would interesting to rerun licensecheck just against the files fdupes reports as being in a duplicate set…hmmm…

yes. I think licensecheck is grouping all kinds of license files which don’t duplicate as the same. I need to do something a bit smarter when I get this box working again.

I can put a script together against my laptops.. and then you can just run that? that we can share output from the same methodology that compres/contrasts both fdupes and licensecheck

yes you can share it here, matrix or other communication systems.

Top 50 duplicate license files

File name License found number of repeats md5sum
./texlive-abntex2-doc/lppl1.3.txt LaTeX Project Public License 1.3c 2536 9f4337828d782bdea41f03dd2ad1b808
./texlive-12many-doc/lppl1.txt LaTeX Project Public License 1 2001 63d57a4e1c4ac88ac6f589a0fa18ce2f
./aerc/vendor/github.com/arran4/golang-ical/LICENSE No copyright Apache License 2.0 1547 3b83ef96387f14655fc854ddc3c6bd57
./abbayedesmorts-gpl/COPYING GNU General Public License, Version 3 1231 d32239bcb673463ab874e80d47fae504
./64tass/LICENSE-GPL-2.0 GNU General Public License, Version 2 1044 b234ee4d69f5fce4486a80fdaf4a4263
./aardvark-dns/LICENSE No copyright Apache License 2.0 685 86d3f3a95c324c9479bd8986968f4327
./aerc/vendor/github.com/ProtonMail/go-crypto/PATENTS No copyright UNKNOWN 639 3a55d95595a6f9e37dee53826b4daff2
./aerc/vendor/golang.org/x/crypto/LICENSE BSD 3-Clause License 527 7998cb338f82d15c0eff93b7004d272a
./perl-B-Generate/Copying GNU General Public License, Version 1 429 5b122a36d0f6dc55279a0ebc69f3c60b
./a2ps/COPYING GNU General Public License, Version 3 426 1ebbd3e34237af26da5dc08a4e440464
./64tass/LICENSE-LGPL-2.1 GNU Lesser General Public License, Version 2.1 370 4fbd65380cdd255951079008b364516c
./accessibility-inspector/CC0-1.0.txt No copyright Creative Commons CC0 1.0 368 65d3616852dbf7b1a6d4b53b00626032
./akonadi-calendar/LGPL-2.0-or-later.txt GNU Library General Public License, Version 2.0 285 6d2d9952d88b50a51a5c73dc431d06c7
./adwaita-cursor-theme/COPYING_LGPL GNU Lesser General Public License, Version 3 281 e6a600fd5e1d9cbde2d983680233ad02
./texlive-a5comb-doc/pd.txt No copyright UNKNOWN 255 a05a19671f483e8d4081b91cb173a8f2
./adb-enhanced/LICENSE No copyright Apache License 2.0 245 e3fc50a88d0a364313df4b21ef20c29e
./0ad/license_gpl-2.0.txt GNU General Public License, Version 2 231 751419260aa954499f7abaabaa882bbe
./accessibility-inspector/BSD-3-Clause.txt No copyright BSD 3-Clause License 220 954f4d71a37096249f837652a7f586c0
./aerc/vendor/github.com/ProtonMail/go-crypto/LICENSE BSD 3-Clause License 194 5d4950ecb7b26d2c5e4e7b4e0dd74707
./adf-accanthis-2-fonts/COPYING GNU General Public License, Version 2 187 94d55d512a9ba36caa9b7df079bae19f
./accessibility-inspector/LicenseRef-KDE-Accepted-LGPL.txt No copyright GNU Lesser General Public License, Version 3 161 e4b79a181b6483b37d39a27f4d75e60a
./angelfish/GPL-2.0-or-later.txt GNU General Public License, Version 2 156 9e2385fe012386d34dcc5c9863070881
./texlive-abc-doc/lppl1.2.txt LaTeX Project Public License 1.2 155 5af7e4b8e7dde262cf18bd7224f454b5
./texlive-aboensis/ofl.txt SIL Open Font License 1.1 149 6ed93967ff0dc6dd9c8d31c17f817a06
./budgie-desktop-view/LICENSE.md No copyright Apache License 2.0 141 2ee41112a44fe7014dce33e26468ba93
./akonadi-calendar/CC0-1.0.txt No copyright Creative Commons CC0 1.0 138 6fd064768b8d61c31ddd0540570fbd33
./artikulate/LGPL-3.0-only.txt GNU Lesser General Public License, Version 3 131 c51d3eef3be114124d11349ca0d7e117
./akonadiconsole/LicenseRef-KDE-Accepted-GPL.txt No copyright GNU General Public License, Version 3 128 b4c280013bbbadfbe92219498dc5228c
./aerc/vendor/gopkg.in/yaml.v3/NOTICE Apache License 2.0 124 6964839e54f4fefcdae13f22b92d0fbb
./alsa-plugins-a52/COPYING.GPL GNU General Public License, Version 2 124 59530bdf33659b29e73d4adb9f9f6552
./vtk-mpich/vtkAcceleratorsVTKmFilters/LICENSE No copyright UNKNOWN 122 913e66ea8b4197e7182f755d09849982
./asio-devel/LICENSE_1_0.txt No copyright Boost Software License 1.0 118 e4224ccaecb14d942c71d31bef20d78c
./akonadi-calendar/LGPL-2.1-or-later.txt GNU Lesser General Public License, Version 2.1 114 2a4f4fd2128ea2f65047ee63fbca9f68
./akonadiconsole/GPL-3.0-only.txt GNU General Public License, Version 3 113 1c76c4cc354acaac30ed4d5eefea7245
./blueprint-compiler/COPYING GNU Lesser General Public License, Version 3 111 3000208d539ec061b899bce1d9ce9404
./arm-none-eabi-gcc-cs/COPYING.LIB GNU Lesser General Public License, Version 2.1 111 2d5025d4aa3495befef8f17206a5b0a1
./akonadi-calendar-tools/GPL-2.0-or-later.txt GNU General Public License, Version 2 107 fed54355545ffd980b814dab4a3b312c
./arm-none-eabi-binutils-cs/COPYING3.LIB GNU Lesser General Public License, Version 3 101 6a6a8e020838b23406c81b19c1d46df6
./accel-ppp/COPYING GNU General Public License, Version 2 99 eb723b61539feef013de476e68b5c50a
./ansible/ansible_collections/amazon/aws/COPYING GNU General Public License, Version 3 99 8f0e2cd40e05189ec81232da84bd6e1a
./akonadi-search/LGPL-2.1-only.txt GNU Lesser General Public License, Version 2.1 97 fabba2a3bfeb22a6483d44e9ae824d3f
./texlive-arphic-doc/other-free.txt No copyright UNKNOWN 96 464c575cb3368dc3314d4054d0dd78cb
./6tunnel/COPYING GNU General Public License, Version 2 91 393a5ca445f6965873eca0259a17f833
./aerc/vendor/gopkg.in/yaml.v3/LICENSE Apache License 2.0 and/or MIT License 86 3c91c17266710e16afdbb2b6d15c761c
./akonadi-calendar/MIT.txt No copyright UNKNOWN 86 38aa75cf4c4c87f018227d5ec9638d75
./aerc/vendor/github.com/containerd/console/LICENSE Apache License 2.0 84 1269f40c0d099c21a871163984590d89
./cinnamon-settings-daemon/COPYING.LIB GNU Lesser General Public License, Version 2.1 78 fbc093901857fcd118f065f900982c24
./aw-awatcher/LICENSE Mozilla Public License 2.0 78 815ca599c9df247a0c7f619bab123dad
./anubis/vendor/go.opentelemetry.io/otel/LICENSE Apache License 2.0 and/or BSD 3-Clause License 77 24dad3abbe7a8f390afc8ab967bfefa7
./artikulate/GPL-2.0-only.txt GNU General Public License, Version 2 76 93e64b4a83c7e441e48bbdeeea05c977

For the most part I found that the differences between various versions of a license were due to the following:

  • Added copyright data saying who and when the file was copyrighted
  • Format changes where spacing changed things
  • Address changes for web urls and places to send mail to.
  • Major change in the KDE license which I am not sure which one is valid
ssmoogen@fedora-rawhide:/tmp$ grep -i kde /tmp/z4
| ./accessibility-inspector/LicenseRef-KDE-Accepted-LGPL.txt| *No copyright* GNU Lesser General Public License, Version 3| 161| e4b79a181b6483b37d39a27f4d75e60a |
| ./akonadiconsole/LicenseRef-KDE-Accepted-GPL.txt| *No copyright* GNU General Public License, Version 3| 128| b4c280013bbbadfbe92219498dc5228c |

ssmoogen@fedora-rawhide:/usr/share/licenses$ diff ./accessibility-inspector/LicenseRef-KDE-Accepted-LGPL.txt ./akonadiconsole/LicenseRef-KDE-Accepted-GPL.txt
2,5c2,5
< modify it under the terms of the GNU Lesser General Public
< License as published by the Free Software Foundation; either
< version 3 of the license or (at your option) any later version
< that is accepted by the membership of KDE e.V. (or its successor
---
> modify it under the terms of the GNU General Public License as
> published by the Free Software Foundation; either version 3 of
> the license or (at your option) at any later version that is
> accepted by the membership of KDE e.V. (or its successor
7c7
< proxy as defined in Section 6 of version 3 of the license.
---
> proxy as defined in Section 14 of version 3 of the license.