Approaches to data handling, safety, and avoiding individual identification — a breakout topic for the F40 Change Request on Privacy-preserving telemetry for Fedora Workstation

Only way to be sure is to open up the data set and see if smart people complain that they’ve deanonymized the data. I’m willing to open it if that’s what the community wants.

We’ll have to clearly document what gets collected and when new data collection is added. However, of course it would be up to the user to follow along, notice something they don’t like is now being collected, and turn off the setting, which the vast majority of users will not do.

In practice, I expect a few motivated users will keep close tabs on the metrics being collected, and stir up a big ruckus if we were to ever try collecting something too invasive (which hopefully we would not do in the first place).

1 Like

This will be possible.

The most popular GNOME Shell extension is system tray icons support. The GNOME developers have removed the legacy tray support, but end users are suffering.

Good thing we collected data on extension downloads so that we’re able to know what is the most popular extension, eh? :wink:

Status notifier support is actually on my to-do list due to the high number of requests for this feature, although our initial designs do not involve a tray, but a less-prominent menu structure. Help welcome, because it’s not at the top of my to-do list. Ticket here.

1 Like

Wouldn’t it be nice if we had something showing explicitly what will be shared? That’d probably make it more palatable for many users.

3 Likes

Good thiing this was done on the infrastructure end, and not via telemetry provided from the users system. :slight_smile:

5 Likes

Yes, the requirement that all metrics must be individually approved is part of this change proposal.

Currently the system stores records of the events that are sent, but separately for each event. i.e. if we collect two events “did user launch Firefox today” and also “is user using Wayland or X11” then it records each individual such event for every user that submits it, but those records are not associated with each other and (in theory) we shouldn’t be able to know “user launched Firefox and also uses Wayland” (unless the events are recorded at the same fixed time, or at a fixed time offset from each other, in which case you could guess that from the timestamps, which I consider a flaw that we need to think about).

Ultimately, when we go to view the data, what we’re going to see is a count:

X11: m users
Wayland: n users

Discarding the underlying event data would actually theoretically not hurt anything, but I don’t think it’s necessary either since it’s not like the events contain personally-identifiable information.

1 Like

Yes, Steam has the best telemetry collection system:

  1. They explicitly ask if the user wants to send telemetry data to Valve before it is collected.
  2. They show all collected data before sending (user can press Cancel or even change their mind).
  3. After submitting the data, they will show a link with public statistics.
  4. If a user declines telemetry collection, they will not ask again for the next 365 days.
2 Likes

Couldn’t you just provide a button to see the data that has been collected (e.g. like in KDE or iOS) instead of having to use a metrics server to see it? Sure, there is no guarantee that that is what is being sent but it is much easier to see and there is already trust placed in the software to begin with

Steam has something like 120 million monthly users. We have maybe 1% of that. That means they can get a much better general picture than we can without asking nearly as many people as often. I will run this past some data scientists I know to see what they think about sample size and frequency, but my intuition is that it would likely need to be so frequent it would be annoying (and then present its own problems). But, intuition isn’t always the right approach with data science. :classic_smiley:

2 Likes

I think as a bare minimum, the full data set collected has to be fully open to everyone. If it has the chance of being deanonymized it probably shouldn’t be collected in the first place

8 Likes

Yes, if it’s always readable somewhere and the moment anyone can de-anonymize the data, it should be cut out. But if it’s always anonymous but the data is shared, readable and actually useful for every fedora user then sure. I can allow it. But I rather my computer asks me it first.

1 Like

Get users interested. Explain why it is important for the project and what can be achieved with this data. Before sending any data, show it to the user.

Any imposition will only cause the opposite effect and, as a result, the outflow of users to other distributions. Red Hat has already lost its popularity over the past month, and now it will lose even more.

6 Likes

We would need to link to such an explanation, because we don’t have space in gnome-initial-setup to explain everything there. But I agree, users deserve to a way to see what data would be collected at the time they’re prompted to make a choice. I think the toggle from this screenshot qualifies as prominent. We’d need to adjust the text slightly to include a link to the data collection policy and the list of metrics that will be collected.

At minimum, the data would be available to Fedora community members on request. It’s not going to be private to Red Hat.

We might need a separate discussion on whether the raw data should be Fedora-private or whether it should be completely public. I’m really not sure whether it’s a good idea.

I agree that the aggregated data should be made public.

I’m not sure about “the various types of data that gets collected doesn’t change within a Fedora release” because that will really slow down our ability to collect data. That can even lead to collecting more than necessary. You could imagine that many design decisions would only require collecting a couple weeks’ worth of data, for example.

I’m also not sure if the full change proposal process is the most appropriate way to submit metrics for consideration, but I do agree they need to be individually proposed and approved via some community approval process.

1 Like

All collected data must be publicly available to all users. No requests or other dark patterns.

It doesn’t. It downloads databases and stores them offline in ~/.cache/mozilla/firefox/$PROFILE/safebrowsing.

3 Likes

It implicitly does, because the database is a partial hash table, and when a URL hits on a partial hash the browser downloads the full hashes for that bucket. Thereby the statistical sampling.

If Google was malicious, compromised, or cooperative with hostile governments, an adversary could send you to a page that calls out to external resources such that the totality of hash bucket hits provide enough bits to uniquely identify you/the page. Or the same could be done with a post fully of links you will probably click.

1 Like

I’m going to add this to the feedback section of the change proposal. I have no objection to this, although where to put it would be a design question. (Does it go directly in gnome-control-center, or would we offer it via a new app that users would have to install if they want to see a graphical view of what is being collected?)

Alas that this probably means more work for me. :stuck_out_tongue:

2 Likes

It will certainly be done upstream.

I’ve been persuaded that “run your own metrics server” is not sufficient transparency. Having a local view of what gets sent makes sense.

I’m not sure if we’re really going to need the ability to tweak the level of information provided, but that’s the sort of thing we can do if need be.

6 Likes

It can be disabled from outside GNOME by editing /etc/metrics/eos-metrics-permissions.conf, but realistically, if you don’t have GNOME installed you probably won’t have the metrics components installed either (why would you?). And even if they do somehow get installed by mistake, no data will be uploaded to Fedora anyway. You would have to manually enable uploading before it would send anything.

The only point at which we would enable uploading is (a) gnome-initial-setup privacy page, and (b) possibly in the future gnome-tour, if we ever want to enable telemetry for users upgrading from older Fedora versions

We already have separate telemetry for RHEL (Red Hat Insights). I don’t think we should make GNOME and Fedora design decisions on the basis of what’s best for RHEL. If we did that, then Fedora would look very different.

1 Like

I think it’s a mistake to limit the data to GNOME/Workstation, if you really want to create good telemetry that can help users, gather it from all users of Fedora.
If that isn’t the case, you are only helping GNOME, making the experience for the rest of the Fedora users worse as the developers of the spins lack hard data insight in how to make their work better.

1 Like

Yeah, I kind of know that they’re a lot to ask, but I think they’ll be really helpful, both for other open source software projects who are on the fence or unsure about adding telemetry, and also to convince users of their benefits. Thank you for considering my suggestion! :heart:

By the by, I heard somewhere that some software (I think Syncthing?) shows you the JSON that they’ll be sending to their servers. I wonder if that’s something that could be done here also to improve transparency. Though I’m way more unsure about this part.

I’d like to push on the hesitance to show all the raw data. May I ask what collected data is so at risk of being deanonymized or used maliciously that it can’t be shown to all?

In terms of the data itself, I would say that if what is being collected is at risk of disclosing anything personal it probably hasn’t been vetted well in the first place.

2 Likes