What data will be collected, exactly? — a breakout topic for the F40 Change Request on Privacy-preserving telemetry for Fedora Workstation

Full disclosure: I’m the lead on the Lenovo Linux team. This note is without my Lenovo hat on - it’s my personal views. I’m hesitant to post - because it is so controversial…but I had thoughts.

In the original proposal it mentions how the data would help Red Hat with “collect specific metrics to justify additional time spent on contributing to Fedora or additional investment in Fedora” - and I want to highlight that it has benefits outside just Red Hat.

One of the biggest problems I have with the Lenovo Linux program is convincing product teams and web teams that the Linux market is real and that we should be doing more.
They know nothing about Linux and from their point of view it often looks like a small market and doesn’t make sense to spend time and resources on it. We don’t do Linux support on Ideapads, Legions, etc because of this. If you’re wondering why many HW vendors only seem to do Linux support on the high end workstation platforms - it’s because of enterprise demand creating the business case. I personally believe the consumer market is there - but it’s extremely hard to prove and therefore to get teams to make that leap.

I don’t want to support anything that truly invades privacy in any way. From a Lenovo perspective I think if we encouraged that we’d (quite rightly) get flamed to oblivion - privacy is a Linux super power and very important. But as someone who’s job it is to make Linux run better on Lenovo platforms and who has to convince teams to invest and grow the Linux team…I really wish I had some accurate, anonymised data to back up my arguments :slight_smile:

If this goes ahead then data wise something that determines what HW is being used could be really useful for the whole ecosystem. I suspect it would help many vendors/manufacturers/etc build business cases to grow Linux support and offerings (of course…it could backfire horribly - I’ve avoided using opt-in data thus far from previous projects for that very reason - the numbers are too small). I hope this effect would be generally reviewed as a positive thing for the community overall (though I suspect the topic is too emotional).

I do have to caveat all the above with the fact that Linux demand is slowly increasing anyway and programs like ours are a part of proving that. Whilst having the data may accelerate progress and indeed be useful, I don’t think it’s worth destroying trust in Fedora over.

11 Likes

This sort of crosses over into the Opt-in/opt-out debate, but is the first time setup information so useful that collecting metrics before the chance to opt out is presented is required?

There seems to be a lot of consternation around this point, despite that data not being sent anywhere. It is also rather confusing that it isn’t a simple on/off. It might help make this change more palatable if this want the case.

On the other hand there not doing this would preclude you from instrumenting the installer/setup process, which I guess would be a reason that you need it. There is also not any technical argument for this, as the data is never sent anywhere and deleted if telemetry isn’t enabled, but the emotional argument may be sufficient here. Also the telemetry collected by the time of the initial setup will not really have had a chance for any sensitive data to be collected, short of the username.

I agree that knowing how the data is going to be used, handled, changed (if in the future what is collected will be removed/added) and distributed to the public as statistics is as if not more important than knowing what data is collected.
It’s the more delicate part of the process, as any mishap can lead to dire consequences, even if the data is “harmless” it can lead to identification (especially if an expert reviews it).
This alone is a good argument to avoid telemetry altogether, it’s like nuclear energy, it can either be the savior of the energy crisis or a tool for destruction.

Now, if the telemetry does get approved, here are some things related to packages I think will be very useful to collect.

Package list:
This is somewhat obvious, but knowing what users install with a metric of total installations is a game changer for developers.
It might reveal that users use software that lacks the care for such an important package or vice-versa, that a very tested package is rarely used. Either way, it helps developers know what to prioritize in their work.

An advantage I see with this is it can help refine the software selection in Workstation.

For example, as soon as i install Workstation I swap Rhythmbox and Cheese for Audacious and Guvcview.
Let’s say the amount of users who do that change ends up being the majority. Isn’t it convenient to change the software selection in Workstation to adapt to the new majority?
Or if the telemetry reveals that a lot of users install Waydroid (software that might be perceived as bloatware in an ootb install), maybe make Gnome Software more prone to displaying it on the home page? Although it’s possible that approach can make a loop where more users click on it because it’s displayed first.

What types of packages do we collect?
I see a possibility of leaking personal information if we collect information on Third Party DNF Repositories/Packages and third party Flatpak Repositories/Packages.
The telemetry could end up revealing information of software in the works, it can report a package the user is making that needs to be kept secret or leak data about a private repository.

What if the telemetry ends up revealing that some government or private organization uses Fedora to control their critical systems thanks to key names getting leaked? It could reveal to skilled hackers that a certain organization is very likely to use software vulnerable to an exploit.

I would restrict the telemetry to only collect Package Information from the Official Repositories, Source Code Repositories, Flathub & Fedora Flatpaks and Snaps.
Counting COPR here is key (it already reports every time software from there is installed), because it can reveal some software that users are installing that needs to be a DNF Package. An example would be gnome-patched from Calcastor, that has 4.396 total downloads (it’s a software i need because GNOME’s default performance in my hardware is lackluster and triple buffering makes it smooth).

Appimages can also compromise private information. What if an organization uses them to distribute a private software inside their systems?

3 Likes

I found something interesting Arch Linux does to collect telemetry with an opt-in method. pkgstats

It’s a package that collects the list of packages once per week, and it sends the information to https://pkgstats.archlinux.de/
Here’s its wiki page: pkgstats - ArchWiki

Debian has something similar with popcon https://popcon.debian.org/.

2 Likes

Aren’t issues with application incompatibility already handled better by ABRT? Telemetry might give a hint that an application isn’t working, but ABRT gives the details.

Doesn’t the Fedora repo already track package downloads? No method is perfect for tracking application usage. The logs on the Fedora repo already give a non-invasive method of tracking package installations.

No, because package downloads are handled by the mirror network and don’t generally involve any Fedora operated servers so there is no way for anybody centrally to know what has been downloaded.

Would it be nice if there’s the possibility of also defining a duration for the collection of a certain metric (and disallowing ‘in-perpetuity’)?

2 Likes

What about KDE? How is this tied to Gnome apps?

What is Fedora, in what way is it tied to Gnome? What makes the desktop, what the Distro important?

I think having telemitry for GUI things is very important, so 100% desktop

  • used settings pages and subpages (move to the top, a level up?)
  • settings searched for that dont exist
  • setup changes
  • used apps
  • used key combos or never used key combos
  • changed things in GNOME, extensions, …
  • uninstalled preinstalled Apps

Apart from that I think App user feedback is already very rich. There are issues everywhere.

And I actually dont know what problems I have with Fedora KDE…

  • rpmfusion not installed through GUI (KDE setup dialog now?)
  • some systemd services like autoupdates premade, independend of GUI elements
  • udev rules for some things
  • podman USB access
  • user in groups libvirt and plugdev by default
  • some wheel polkit exceptions for LUKS, udisks2, kde-partitionmanager
  • Discover running in the background so I disable it always, not actually sure it worked.
  • fish shell with shortcuts
  • nice KDE backgrounds instead of these Fedora ones
  • a custom Grub theme
  • a better SDDM theme
  • an actually good color theme

These are all things I can report, but yes to understand usage of complex GUI things, more KDE telemitry could help. And the tiny Fedora part apart from the Desktop maybe also.

This is fantastic! We’ve gone for no telemetry to key-stroke monitoring.

Other than my problems with thinking all of this is fully inconsistent with the Fedora Project Mission Statement, most of what I see cited are user settings interfaces.

Yet I ask myself, would any of this proposed telemetry have been useful in motivating some of the relatively recent big changes that Fedora has been at the forefront of? The move to systemd? systemd-resolved? Defaulting to btrfs? Migrating to Pipewire? The changes now being considered to the efi boot system? Would any of them happened if you’d have been tracking user activities?

My guess is emphatically no. Because that tracking wouldn’t have pushed innovation.

This change will result in a significant hit to the reputation of the Fedora Project. I think that you will sorely regret it as time goes by.

2 Likes

You can have that if you want, on KDE you can share usage data, it is disabled by default. You can set it to full data and share the following:

1 Like

2 posts were merged into an existing topic: F40 Change Request: Privacy-preserving Telemetry for Fedora Workstation (System-Wide)

Poll related to this topic:

I guess there might be room for a poll about what kind of data people might be comfortable with, under what circumstances.

For example, I personally would be comfortable with hardware information, even as “buried opt-out”, as long as it is stored in a non-fingerprintable way.[1] I’m comfortable with reporting what packages and flatpaks I have installed and even non-specific information about the ones I’ve run with explicit opt-out. I’m comfortable with that for most GNOME settings, as well. I’m comfortable participating in some UX studies with opt-in. But something like “file types in my Documents directory” or whatever? Nope!


  1. This touches on Approaches to data handling, safety, and avoiding individual identification — a breakout topic for the F40 Change Request on Privacy-preserving telemetry for Fedora Workstation… Personally, for myself, not worried if an encrypted form of this passes through a proxy together before it’s separated, although I can think of stronger designs for that too.] ↩︎

2 Likes

A post was merged into an existing topic: Approaches to data handling, safety, and avoiding individual identification — a breakout topic for the F40 Change Request on Privacy-preserving telemetry for Fedora Workstation

Your post is a good summary of the various options. I’m actually hoping to do your option 1, though: never ask the user (again). Yeah, your least-favorite choice. As you’ve pointed out, regularly prompting for consent is going to be disruptive. Just a simple yes/no toggle to control all data collection is the best way to expose this to users. My plan is to link to a wiki page with detailed information on each data point that would be collected and what they would look like. The OS UI itself would be simple and the complexity would be contained to the wiki page. I’m reasonably confident that if we ever start collecting anything too invasive, people who care will notice, rise up in arms, and most Fedora users will find out about it soon enough.

Moreover, to avoid huge lists of new telemetry items, I would also group them together in “groups” so that the user can assess more quickly what’s going on. For instance, interface telemetry, system info telemetry, installed software telemetry and so on.

And finally, should the user be able to select single telemetry items (or, single groups)? Or is it ‘all-in’ for telemetry?

You’re envisioning some complicated UI for fine-grained control of telemetry that would be exposed in the OS. I am not. :slight_smile: It would probably not be accepted upstream, so it’s not likely something we can do. But because there have been several other requests for this, maybe we can do this as an extra app for power users. You would need to install it manually, though, since it wouldn’t be a suitable UI for typical users.

Also, if we’re going to limit the duration for which metrics can be collected, most likely we’d wind up only collecting most metrics for a few weeks. So even if we were to do this, I wouldn’t expect any huge lists of metrics to toggle and it wouldn’t be worth the effort to build them into the OS. I really have no clue how many we would have, but I’d expect relatively few.

Also remember that every setting we expose in the UI would have to be localized, which realistically will only happen upstream.

I’d envision new metrics being added and old metrics removed on a regular basis, but it’s hard to guess how often this would happen. Depends on developer requests.

As for specific vs. broad high-level info: I would say both.

I think this would ultimately be decided on a case-by-case basis for each metric that we would collect, via the hypothetical community process for approving metrics, but let’s say that by default each metric would be collected for two months unless there is a good reason to use a different time span. I don’t think we need long-term data collection to answer most interesting questions.

Yes, a package update would be required on the client side, and a server update would be needed too. Unknown metrics will be treated as malformed by the server. If some people don’t apply updates they’ll just keep sending old metrics, which is fine.

As you say, it’s only required if we need to collect data on the first boot itself. (I guess technically data could be collected by the installer session too, but it will always be deleted and not uploaded because the user will not be prompted to consent to data collection until first boot, after the installer session is gone.)

Theoretically,this early data could be useful to help track down user experience problems, but realistically I don’t think it will be that useful to us. Our first boot experience is simple and reliable, after all. I’m considering simplifying the change proposal by removing this detail, especially since there have been several complaints about collecting data before consent, even though it never gets uploaded.

1 Like

What I do not understand is how you expect me to respond to comments like this? I don’t see anything actionable to respond to here.