The change proposal F40 Change Request: Privacy-preserving Telemetry for Fedora Workstation (System-Wide) is — as is appropriate for such a big, important topic! — getting a lot of discussion. In order to keep the conversation from becoming one long list, I’m making a number of break-out topics for various important sub-topics that are emerging in the discussion.
This topic for discussion of exactly what data will be collected, and how that will be decided. Post in the main thread that are primarily about this will be moved here.[1] If you have more to add on this particular topic, this is the best place for that.
Some posts which cover this but also other points will remain in the main topic, to avoid breaking the flow. ↩︎
If this proposal is accepted, we still will not have authority to actually collect any data. Everything to be collected would still need to be approved separately, somehow. This change proposal contains a placeholder suggestion that FESCo would need to approve each metric to be collected, but we’d need to figure out how exactly this would work. Suggestions welcome for alternate processes to follow. For example, we could have a process similar to the change proposal process for approving particular metrics. Or we could have some new subcommittee that decides whether particular metrics should be approved.
That said, if approved via this hypothetical future community process, I would indeed envision collecting application usage data. (That’s something I failed to consider in the “What data might we collect?” section of the change proposal.) But I don’t think it needs to be highly-intrusive. Remember that one of the principles here is we don’t want to associate the collected data with particular users: we only want to know aggregate data. So let’s say that on Thursday you use Epiphany, Geary, gfeeds, GNOME Text Editor, and GNOME Console. We might send five separate events indicating that some user launched each application on that day. (Endless actually collects the time that each app was used, but judging from the concerned opinions thus far in this thread, that’s probably a little too invasive for Fedora.) So the data on the server might look something like: <Epiphany, 5000 users on July 6>, <Firefox: 500,000 users on July 6>. But we wouldn’t need to collect, say, the set of applications launched on a given day, <User launched Epiphany, Geary, Firefox, and Thunderbird>, because why would we need to know that? And we certainly wouldn’t want to be able to associate that data with your IP address (we are explicitly prohibited from collecting IP addresses), location, or trace it back to you in any way, because that would be creepy.
I would also limit data collection to packaged apps only, plus maybe a few hardcoded others. Let’s hypothetically say we’d collect usage data on Fedora applications, Flathub applications, Google Chrome, and Steam. It’s not safe to collect data on arbitrary applications because the names of arbitrary applications might potentially contain personal data (e.g. “Michael’s Private Crime Stuff Application” or “Foo Corporation Top Secret Project App”) and we have to take due care not to collect anything that would likely contain personal data. (I would interpret this strictly, but not outrageously strictly. E.g. I don’t think a CPU model name is likely to contain personal data. But application names certainly might.)
This comment is a little handwavey and hypothetical because I’m discussing a metric that I haven’t proposed collecting yet, and I haven’t thought that much about it, but this should give you an idea of how I’m approaching this. If Fedora ever collects so much as to feel invasive, then we’ve messed up and violated user trust. But I don’t think counting application usage in such a general manner is really that invasive (especially in contrast to, say, a proprietary software platform, which is probably keeping track of the applications that you personally are using, and who knows for sure because we cannot see the source code).
You know, alot of good points have been made in this proposal, and I feel like we got out of the “panicking” stage and are actually progressing now.
What I would suggest is that we concentrate on the “What” and the “How”. The “How” seems mostly figured out; it’s going to be a toggle on the gnome initial setup app. (Hoping there will be one under “Settings” too)
The “What” needs to be figured out now, as in, “What” do you want to capture? And I mean all of it. I know you said that you’d make a separate proposal about what you want to gather, but I don’t know that we will feel safe about approving this without knowing what will be tracked.
I know the list of “What” will be tracked will also change over time, but I’m asking for a full transparent list of what would be tracked. Then I think a sound decision can be taken.
But I am quite a bit more open to the idea than I was yesterday, before I read all the feedback here.
My intention is that we collect only ever collect a low level of data, so hopefully there should be no need to make the user control more complicated than an on/off switch.
What is a “low level of data”? Those are weasel words. What exactly is planned on being collected? Why is it so important that this be opt-out by default? Why do you believe people like me who do turn on opt-in settings would only be providing “garbage”?
Here is my full list of what metrics we might want to propose to collect in the future. But this list is just brainstorming. I am NOT yet proposing to actually collect any of these data points, and I would expect we’d never actually implement collection of most of them. It’s just to give you an idea of what sort of data might want to collect. The list is a little raw because it’s just a brainstorming document from two years ago.
More likely, what we actually wind up collecting will be determined by developer requests for some data point.
Fedora version and edition - /etc/os-release
Original install version and date
Wayland or X - session type
XDG_CURRENT_DESKTOP (also includes GNOME Classic, other desktops, etc.)
Which extensions are being used (what are the most popular,should we put an effort into stabilizing)
UEFI or BIOS. Secure boot enabled? TPM? HSI level.
NVidia Binary Driver
Dual boot?
Software sources
3rd party repos enabled
Flathub installed? (Filtered or unfiltered)
Rpmfusion enabled?
System uptime ? How often rebooted ?
When last updated? When last updated through Software?
Applications installed
Flatpak or RPMS
App usage
Which settings panels do people use?
Default browser
See ~/.local/share/gnome-shell/application_state, limited to known apps, or known sources (Fedora, flathub)
Are they using Toolbx
How many toolbxes do they have
Which base image is being used (limited to known images)
Are they using GUI IDE, are they using Owen’s VS Code integration
System configuration
Dynamic/static workspaces
Workspaces only on primary or workspaces on all displays?
How many user accounts
Sharing settings - global switch, file, screen, media sharing; remote login
Online accounts - which enabled?
Locale, enabled input methods
Enterprise login setup?
System usage
Number of workspaces open over time
Number of windows open over time
Number of apps open over time
Hardware
How many displays?
Display resolution
Memory
Touchscreen?
Hardware make (does the Lenovo Fedora laptops make a market dent?)
What file system do people use
podman / docker usage (are there running containers)
Have there been certain problems on the system? oom kills, shell crashes, …
Again, remember these data points would be stored separately and not correlated together (i.e. it’s not a user profile). And we would have a separate approval process each time we want to start collecting a metric, so each one can be debated separately.
I suppose if we want to start with one particular metric to be initially approved, I would pick gnome-control-center panel usage as I understand our designers really want to add more preferences (believe it or not) but are having trouble doing so.
You are either deliberately ignoring or just missing the point.
Many people see the metrics you are proposing to collect as “creepy” or intrusive. You just don’t personally see it that way which is fine. Making it public is good but it in no way addresses the actual concern.
Well maybe the Fedora community would decide not to approve that particular metric, then.
I am expecting that a simple integer counter of whether an application gets used should not be too controversial. (Endless collects the amount of time applications are used. Who knows what Windows and macOS collect.) It’s not like we’d be able to know that any particular user is using a particular application.
But if the community disagrees, we can go without. E.g. maybe we could collect only particular applications that we want to know about, say Visual Studio Code vs. GNOME Builder.
There are two considerations/potential problems that come to my mind.
Correlation of stored data
Counting things is fine, but the danger comes if data is stored with any correlation.
To follow on from your idea of tracking if a set of predefined applications has been used, this can quickly become uniquely identifying information.
The combinatios of installed applications scale as $2^n$, meaning that with as few as 25 applications being included you could uniquely identify a single user out of over 33 million.
To a certain extent you can technically enforce this by only storing numbers. This prevents you from using more complex data structures, such as arrays, that can correlate multiple bits of data.
Stopping collection of uninteresting metrics
While you have suggested a process for collecting additional metrics if devs need them, you haven’t outlined any process for collection of a metric to stop. Ideally I’d like something here that “fails safe”.
One idea is that collection of metric would automatically stop if the decision to collect it wasn’t reaffirmed after some reasonable time period, possibly defined when the metric is proposed. It would be fine to have another decision to keep collecting it, but by the default being not to collect something, forgotten metrics wouldn’t continue being collected.
As I understand the system as proposed, “this application was used” and “that application was used” become values stored independently of each other, and of any other identifier. As long as that separation is done in a robust way, this kind of fingerprinting would be impossible while still giving us very useful information.
Right. In general, we don’t want to be able to correlate different data points, because that becomes a user profile, and user profiles are creepy.
We occasionally might need to do this only for some particular metrics, e.g. asking “how many NVIDIA users use gnome-online-accounts?” could help us figure out how bad is the incompatibility between WebKitGTK vs. NVIDIA graphics. But that would be an exception for those particular data points, not how we collect most data.
Makes sense that we should consider how long the data will be collected for when approving a new metric to be collected. I’ll add this to the feedback section of the change proposal.
Yeah, that should be fine. Explicitly defined queries like this only count as a single combination from a finger printing perspective, so unless you have millions of these questions being asked it isn’t an issue.
Because only certain people (like yourself) turn on opt-in metrics. It’s not representative of the whole population of users. Accurate Statistics is really useful.
I’d prefer a tally of application usage to be collected, for example right now RStudio is a bit less powerful on the Fedora repository due to lack of Quarto.
If it turns out to be a popular application for Fedora users, then more resources could be spent on it.
Also the possibility of a mandatory toggle (you must select whether to enable or disable telemetry, initially neutral) is pretty good too imo.
I will start apologizing if this is not the correct section for this post.
An important question that should be addressed is how users interact with the telemetry after they expressed the consent: are they supposed to approve each and every single telemetry point or is it all-in?
I believe there could be many approaches for this, and I also support an user-centric approach in which there is at least some kind of control. I have thought of some of them:
Never ask the user: if the user expressed their consent, they will never be prompted again and data collection will go until stopped manually. This is the less annoying approach, but also the one that gives the less control to the user. I would avoid this if possible, and give users informations across time.
Ask the user for each telemetry point, at the moment in which the new telemetry point is introduced: this approach gives the maximum control to the user and it gives them the opportunity to quit immediately in the case the telemetry becomes too much for their sensibility/ideas; at the same time, the devs can add telemetry whenever they need or want. However, this might be very intrusive, especially if the new telemetry points are added and removed very frequently and the user is prompted too often so that it becomes annoying.
Limit the introduction of telemetry points at new Fedora releases: release cycles are actually very balanced in my opinion, and they could offer a good spot where to ask for many new data points at once. Then, the user can review just once the new telemetry points and quit if desired. This could be a good counterbalance, as it prompts the users in the moment they are willing to spend time to upgrade their system; moreover, the Fedora decision organs can take their time to assess their decisions in a wider time window that is synchronized with the next release. However, developers should be fine adding new telemetry at most once every 6 months. This is my favourite option, as the user is only prompted at a new release and never bothers until the next release, but it might not be feasible for developers’ needs who may prefer faster introduction of new telemetry.
Limit the introduction of new telemetry at a fixed interval rate: the same as the above, but in mid-release. Let’s say for example every two months new telemetry is added in block and users are prompted. This can be much more confusing than the previous point, especially in the case the fixed rate changes across time.
What do you think? Do you have better ideas to handle this?
Moreover, to avoid huge lists of new telemetry items, I would also group them together in “groups” so that the user can assess more quickly what’s going on. For instance, interface telemetry, system info telemetry, installed software telemetry and so on.
And finally, should the user be able to select single telemetry items (or, single groups)? Or is it ‘all-in’ for telemetry?
Dalto made a major point. Michael, you just dismiss it if people have a different opinion. The best they can get is generalized phrases that do not tell much (“make Fedora the major distribution”, which was never set in contrast/context to your proposal, and at this point, this phrase can be questioned given the developments that imho already created damage to the community and its trust). Alternatively, others’ thoughts are ignored. You are also very selective to capture only scenarios and points that support your opinion (you ignore other scenarios/points, including when it comes to protect data at stages that do not serve your proposal). Your elaborations already brought personal data together but then you limited the scenarios so that you do not need to care.
What is underlying to your arguments is that you are right and if people disagree, then they have not yet understood. This can be seen already already as problematic and questions if you should process such people’s data.
Data gathering is something that has to be taken seriously - it contains many critical and complex tasks, always. You should not get data that is not necessary. However, you want to get data, but so far without idea what data or for what explicitly.
Usually, issues like that shall start with a problem. Then, you check out what you need to solve a problem. Specific data can be a potential solution. Then, you check out if and possibly, how to get that data.
However, you first want to get data, then you want to find out which data and thus what problem to solve with it: sounds a little like “first of all, just get data, then let’s think for what we need it”. This already indicates minor respect of the responsibility when it comes to data processing.
I am not sure that at this point if the trust can be recovered tbh.
Thanks for writing those up. I had all those questions to ask too.
But also I will add on:
How often would items change? Is the idea that you would want to
gather more specific information about something to help decide how to
implement things? Or have broad high level info? Or both?
Would things ever get retired?
What needs to change to add new items? Does a package update need to
go to users I assume? What happens if some people update and others
don’t?
What data will be collected isn’t the best starting point. Instead, ask what data will be used for, and work backwards from that to what data needs to be collected to get there. Privacy is less about what data is collected, and more about how the data is used.
If I’m in a public place, talking to a friend, and someone inadvertently overhears me, my privacy is not necessarily violated. But if I’m overheard, and someone passes gossip, or posts something to the Internet, or uses what they overheard to decided how to advertise to me better, then my privacy has been violated, even though I was overheard in a public place.
If my phone company collects data about my calls, they haven’t necessarily violated my privacy. That data is also theirs anyway. If they put that data together to learn about my habits, or share that data with another organization who might put it together without my consent, then they’ve definitely violated my privacy.
It’s clear that preserving privacy is one of your goals. To reach that goal, it’s important to start from what you will do and what you could do with the data. If any of the data could be used to do something worse, you can decide not to collect that data, even though it would be useful.