What data will be collected, exactly? — a breakout topic for the F40 Change Request on Privacy-preserving telemetry for Fedora Workstation

mpearson · July 8, 2023, 8:49pm

Full disclosure: I’m the lead on the Lenovo Linux team. This note is without my Lenovo hat on - it’s my personal views. I’m hesitant to post - because it is so controversial…but I had thoughts.

In the original proposal it mentions how the data would help Red Hat with “collect specific metrics to justify additional time spent on contributing to Fedora or additional investment in Fedora” - and I want to highlight that it has benefits outside just Red Hat.

One of the biggest problems I have with the Lenovo Linux program is convincing product teams and web teams that the Linux market is real and that we should be doing more.
They know nothing about Linux and from their point of view it often looks like a small market and doesn’t make sense to spend time and resources on it. We don’t do Linux support on Ideapads, Legions, etc because of this. If you’re wondering why many HW vendors only seem to do Linux support on the high end workstation platforms - it’s because of enterprise demand creating the business case. I personally believe the consumer market is there - but it’s extremely hard to prove and therefore to get teams to make that leap.

I don’t want to support anything that truly invades privacy in any way. From a Lenovo perspective I think if we encouraged that we’d (quite rightly) get flamed to oblivion - privacy is a Linux super power and very important. But as someone who’s job it is to make Linux run better on Lenovo platforms and who has to convince teams to invest and grow the Linux team…I really wish I had some accurate, anonymised data to back up my arguments

If this goes ahead then data wise something that determines what HW is being used could be really useful for the whole ecosystem. I suspect it would help many vendors/manufacturers/etc build business cases to grow Linux support and offerings (of course…it could backfire horribly - I’ve avoided using opt-in data thus far from previous projects for that very reason - the numbers are too small). I hope this effect would be generally reviewed as a positive thing for the community overall (though I suspect the topic is too emotional).

I do have to caveat all the above with the fact that Linux demand is slowly increasing anyway and programs like ours are a part of proving that. Whilst having the data may accelerate progress and indeed be useful, I don’t think it’s worth destroying trust in Fedora over.

fraetor · July 9, 2023, 1:34pm

This sort of crosses over into the Opt-in/opt-out debate, but is the first time setup information so useful that collecting metrics before the chance to opt out is presented is required?

There seems to be a lot of consternation around this point, despite that data not being sent anywhere. It is also rather confusing that it isn’t a simple on/off. It might help make this change more palatable if this want the case.

On the other hand there not doing this would preclude you from instrumenting the installer/setup process, which I guess would be a reason that you need it. There is also not any technical argument for this, as the data is never sent anywhere and deleted if telemetry isn’t enabled, but the emotional argument may be sufficient here. Also the telemetry collected by the time of the initial setup will not really have had a chance for any sensitive data to be collected, short of the username.

raxel-pepi · July 9, 2023, 4:15pm

I agree that knowing how the data is going to be used, handled, changed (if in the future what is collected will be removed/added) and distributed to the public as statistics is as if not more important than knowing what data is collected.
It’s the more delicate part of the process, as any mishap can lead to dire consequences, even if the data is “harmless” it can lead to identification (especially if an expert reviews it).
This alone is a good argument to avoid telemetry altogether, it’s like nuclear energy, it can either be the savior of the energy crisis or a tool for destruction.

Now, if the telemetry does get approved, here are some things related to packages I think will be very useful to collect.

Package list:
This is somewhat obvious, but knowing what users install with a metric of total installations is a game changer for developers.
It might reveal that users use software that lacks the care for such an important package or vice-versa, that a very tested package is rarely used. Either way, it helps developers know what to prioritize in their work.

An advantage I see with this is it can help refine the software selection in Workstation.

For example, as soon as i install Workstation I swap Rhythmbox and Cheese for Audacious and Guvcview.
Let’s say the amount of users who do that change ends up being the majority. Isn’t it convenient to change the software selection in Workstation to adapt to the new majority?
Or if the telemetry reveals that a lot of users install Waydroid (software that might be perceived as bloatware in an ootb install), maybe make Gnome Software more prone to displaying it on the home page? Although it’s possible that approach can make a loop where more users click on it because it’s displayed first.

What types of packages do we collect?
I see a possibility of leaking personal information if we collect information on Third Party DNF Repositories/Packages and third party Flatpak Repositories/Packages.
The telemetry could end up revealing information of software in the works, it can report a package the user is making that needs to be kept secret or leak data about a private repository.

What if the telemetry ends up revealing that some government or private organization uses Fedora to control their critical systems thanks to key names getting leaked? It could reveal to skilled hackers that a certain organization is very likely to use software vulnerable to an exploit.

I would restrict the telemetry to only collect Package Information from the Official Repositories, Source Code Repositories, Flathub & Fedora Flatpaks and Snaps.
Counting COPR here is key (it already reports every time software from there is installed), because it can reveal some software that users are installing that needs to be a DNF Package. An example would be gnome-patched from Calcastor, that has 4.396 total downloads (it’s a software i need because GNOME’s default performance in my hardware is lackluster and triple buffering makes it smooth).

Appimages can also compromise private information. What if an organization uses them to distribute a private software inside their systems?

raxel-pepi · July 9, 2023, 6:09pm

I found something interesting Arch Linux does to collect telemetry with an opt-in method. pkgstats

It’s a package that collects the list of packages once per week, and it sends the information to https://pkgstats.archlinux.de/
Here’s its wiki page: pkgstats - ArchWiki

fraetor · July 9, 2023, 6:56pm

Debian has something similar with popcon https://popcon.debian.org/.

william8000 · July 10, 2023, 12:05pm

Aren’t issues with application incompatibility already handled better by ABRT? Telemetry might give a hint that an application isn’t working, but ABRT gives the details.

william8000 · July 10, 2023, 12:06pm

Doesn’t the Fedora repo already track package downloads? No method is perfect for tracking application usage. The logs on the Fedora repo already give a non-invasive method of tracking package installations.

tomh · July 10, 2023, 3:03pm

No, because package downloads are handled by the mirror network and don’t generally involve any Fedora operated servers so there is no way for anybody centrally to know what has been downloaded.

supakeen · July 10, 2023, 3:44pm

Would it be nice if there’s the possibility of also defining a duration for the collection of a certain metric (and disallowing ‘in-perpetuity’)?

boredsquirrel · July 11, 2023, 3:40pm

What about KDE? How is this tied to Gnome apps?

What is Fedora, in what way is it tied to Gnome? What makes the desktop, what the Distro important?

I think having telemitry for GUI things is very important, so 100% desktop

used settings pages and subpages (move to the top, a level up?)
settings searched for that dont exist
setup changes
used apps
used key combos or never used key combos
changed things in GNOME, extensions, …
uninstalled preinstalled Apps

Apart from that I think App user feedback is already very rich. There are issues everywhere.

And I actually dont know what problems I have with Fedora KDE…

rpmfusion not installed through GUI (KDE setup dialog now?)
some systemd services like autoupdates premade, independend of GUI elements
udev rules for some things
podman USB access
user in groups libvirt and plugdev by default
some wheel polkit exceptions for LUKS, udisks2, kde-partitionmanager
Discover running in the background so I disable it always, not actually sure it worked.
fish shell with shortcuts
nice KDE backgrounds instead of these Fedora ones
a custom Grub theme
a better SDDM theme
an actually good color theme

These are all things I can report, but yes to understand usage of complex GUI things, more KDE telemitry could help. And the tiny Fedora part apart from the Desktop maybe also.

jrredho · July 11, 2023, 5:00pm

This is fantastic! We’ve gone for no telemetry to key-stroke monitoring.

Other than my problems with thinking all of this is fully inconsistent with the Fedora Project Mission Statement, most of what I see cited are user settings interfaces.

Yet I ask myself, would any of this proposed telemetry have been useful in motivating some of the relatively recent big changes that Fedora has been at the forefront of? The move to systemd? systemd-resolved? Defaulting to btrfs? Migrating to Pipewire? The changes now being considered to the efi boot system? Would any of them happened if you’d have been tracking user activities?

My guess is emphatically no. Because that tracking wouldn’t have pushed innovation.

This change will result in a significant hit to the reputation of the Fedora Project. I think that you will sorely regret it as time goes by.

td211 · July 11, 2023, 5:06pm

You can have that if you want, on KDE you can share usage data, it is disabled by default. You can set it to full data and share the following:

mattdm · July 12, 2023, 2:43pm

2 posts were merged into an existing topic: F40 Change Request: Privacy-preserving Telemetry for Fedora Workstation (System-Wide)

py0xc3 · July 12, 2023, 4:14pm

Poll related to this topic:

mattdm · July 12, 2023, 4:35pm

I guess there might be room for a poll about what kind of data people might be comfortable with, under what circumstances.

For example, I personally would be comfortable with hardware information, even as “buried opt-out”, as long as it is stored in a non-fingerprintable way.^[1] I’m comfortable with reporting what packages and flatpaks I have installed and even non-specific information about the ones I’ve run with explicit opt-out. I’m comfortable with that for most GNOME settings, as well. I’m comfortable participating in some UX studies with opt-in. But something like “file types in my Documents directory” or whatever? Nope!

This touches on Approaches to data handling, safety, and avoiding individual identification — a breakout topic for the F40 Change Request on Privacy-preserving telemetry for Fedora Workstation… Personally, for myself, not worried if an encrypted form of this passes through a proxy together before it’s separated, although I can think of stronger designs for that too.] ↩︎

mattdm · July 12, 2023, 5:01pm

A post was merged into an existing topic: Approaches to data handling, safety, and avoiding individual identification — a breakout topic for the F40 Change Request on Privacy-preserving telemetry for Fedora Workstation

catanzaro · July 12, 2023, 9:36pm

Your post is a good summary of the various options. I’m actually hoping to do your option 1, though: never ask the user (again). Yeah, your least-favorite choice. As you’ve pointed out, regularly prompting for consent is going to be disruptive. Just a simple yes/no toggle to control all data collection is the best way to expose this to users. My plan is to link to a wiki page with detailed information on each data point that would be collected and what they would look like. The OS UI itself would be simple and the complexity would be contained to the wiki page. I’m reasonably confident that if we ever start collecting anything too invasive, people who care will notice, rise up in arms, and most Fedora users will find out about it soon enough.

Moreover, to avoid huge lists of new telemetry items, I would also group them together in “groups” so that the user can assess more quickly what’s going on. For instance, interface telemetry, system info telemetry, installed software telemetry and so on.

And finally, should the user be able to select single telemetry items (or, single groups)? Or is it ‘all-in’ for telemetry?

You’re envisioning some complicated UI for fine-grained control of telemetry that would be exposed in the OS. I am not. It would probably not be accepted upstream, so it’s not likely something we can do. But because there have been several other requests for this, maybe we can do this as an extra app for power users. You would need to install it manually, though, since it wouldn’t be a suitable UI for typical users.

Also, if we’re going to limit the duration for which metrics can be collected, most likely we’d wind up only collecting most metrics for a few weeks. So even if we were to do this, I wouldn’t expect any huge lists of metrics to toggle and it wouldn’t be worth the effort to build them into the OS. I really have no clue how many we would have, but I’d expect relatively few.

Also remember that every setting we expose in the UI would have to be localized, which realistically will only happen upstream.

catanzaro · July 12, 2023, 9:45pm

I’d envision new metrics being added and old metrics removed on a regular basis, but it’s hard to guess how often this would happen. Depends on developer requests.

As for specific vs. broad high-level info: I would say both.

I think this would ultimately be decided on a case-by-case basis for each metric that we would collect, via the hypothetical community process for approving metrics, but let’s say that by default each metric would be collected for two months unless there is a good reason to use a different time span. I don’t think we need long-term data collection to answer most interesting questions.

Yes, a package update would be required on the client side, and a server update would be needed too. Unknown metrics will be treated as malformed by the server. If some people don’t apply updates they’ll just keep sending old metrics, which is fine.

catanzaro · July 12, 2023, 9:52pm

James Frost:

This sort of crosses over into the Opt-in/opt-out debate, but is the first time setup information so useful that collecting metrics before the chance to opt out is presented is required?

There seems to be a lot of consternation around this point, despite that data not being sent anywhere. It is also rather confusing that it isn’t a simple on/off. It might help make this change more palatable if this want the case.

On the other hand there not doing this would preclude you from instrumenting the installer/setup process, which I guess would be a reason that you need it. There is also not any technical argument for this, as the data is never sent anywhere and deleted if telemetry isn’t enabled, but the emotional argument may be sufficient here. Also the telemetry collected by the time of the initial setup will not really have had a chance for any sensitive data to be collected, short of the username.

As you say, it’s only required if we need to collect data on the first boot itself. (I guess technically data could be collected by the installer session too, but it will always be deleted and not uploaded because the user will not be prompted to consent to data collection until first boot, after the installer session is gone.)

Theoretically,this early data could be useful to help track down user experience problems, but realistically I don’t think it will be that useful to us. Our first boot experience is simple and reliable, after all. I’m considering simplifying the change proposal by removing this detail, especially since there have been several complaints about collecting data before consent, even though it never gets uploaded.

catanzaro · July 12, 2023, 9:54pm

Chris:

Dalto made a major point. Michael, you just dismiss it if people have a different opinion. The best they can get is generalized phrases that do not tell much (“make Fedora the major distribution”, which was never set in contrast/context to your proposal, and at this point, this phrase can be questioned given the developments that imho already created damage to the community and its trust). Alternatively, others’ thoughts are ignored. You are also very selective to capture only scenarios and points that support your opinion (you ignore other scenarios/points, including when it comes to protect data at stages that do not serve your proposal). Your elaborations already brought personal data together but then you limited the scenarios so that you do not need to care.

What is underlying to your arguments is that you are right and if people disagree, then they have not yet understood. This can be seen already already as problematic and questions if you should process such people’s data.

Data gathering is something that has to be taken seriously - it contains many critical and complex tasks, always. You should not get data that is not necessary. However, you want to get data, but so far without idea what data or for what explicitly.

Usually, issues like that shall start with a problem. Then, you check out what you need to solve a problem. Specific data can be a potential solution. Then, you check out if and possibly, how to get that data.

However, you first want to get data, then you want to find out which data and thus what problem to solve with it: sounds a little like “first of all, just get data, then let’s think for what we need it”. This already indicates minor respect of the responsibility when it comes to data processing.

I am not sure that at this point if the trust can be recovered tbh.

What I do not understand is how you expect me to respond to comments like this? I don’t see anything actionable to respond to here.

Topic		Replies	Views
How can we make the Change process more clear to people? Project Discussion fesco	12	367	July 12, 2023
Fedora 40: Does it have telemetry built into it? Ask Fedora	11	4026	April 22, 2024
Privacy focused users: what applications and what settings you use with your Fedora? The Water Cooler tech-talk	15	1167	July 20, 2023
F42 Change Proposal: Opt-In Metrics for Fedora Workstation (system-wide) Change Proposals fesco , f42	152	3962	March 11, 2025
Questions about opt-in telemetry Ask Fedora f42	40	943	February 10, 2025

What data will be collected, exactly? — a breakout topic for the F40 Change Request on Privacy-preserving telemetry for Fedora Workstation

Related topics