Approaches to data handling, safety, and avoiding individual identification — a breakout topic for the F40 Change Request on Privacy-preserving telemetry for Fedora Workstation

It probably wouldn’t be JSON and we’d need to think about how exactly it should work, but I agree we should have some way to show this and I’ll be adding that to the feedback section of the change proposal.

This is from HackerNews and is a bit inflammatory (I see people complaining about instinctive reactions about the usage of the word... | Hacker News) but I thought it has a notable point about this proposal for @catanzaro:

by [TrueDuality]

I see people complaining about instinctive reactions about the usage of the word telemetry, but they’re rightly justified in those reactions. People have those instinctive reactions for a very good reason even with this specific proposal. If you read the discussion post, the following becomes clear:

  • The proposer has clearly not done any research on how to actually collect anonymous data (they’d never heard of differential privacy for example).

  • They want a plug and play solution (they specifically say they don’t want to do more work than that)

  • They are not open to discussing privacy regulations such as GDPR

  • They are not willing to bend on the most contentious points of their proposal

  • The system they want to use collects invasive metrics that can be de-anonymized and has only been used by a niche distribution

Because the de-anonymization bit might not be clear, let me summarize some of the things that the Endless OS metrics collect:

  • Country

  • Location based on IP address to within 1 degree lat/long

  • Your specific hardware profile

  • Daily report that includes your hardware profile, along with the number of times the check ins have occurred in the past

  • Detailed program usage (every start / stop)

  • An unspecified series of additional metrics that can be sent from anywhere else on the system via a dbus interface

Additional this proposal wants to explicitly collect:

  • What packages and versions of such are installed

  • Specific application usage metrics (the example they give is the gnome settings panel)

They discard the IP address, but how hard do you think it is to differentiate users based on the combination of hardware profile, +/- 1 degree of location accuracy, their specific set of packages (and knowing the history of package installs/uninstalls already through their package manager). The proposal doesn’t meet its stated intentions of being anonymous, and the proposer actively understands that users don’t want this but believe their desire for the metrics overrides the end users desire of not being tracked.

1 Like

If the user consented to telemetry in gnome-initial-setup or gnome-control-center, the telemetry would remain enabled until the user disables it.

Otherwise, it’s not enabled.

(OK, it could be locally enabled but never uploaded. I’m glossing over the distinction between enabled with uploading vs. enabled without uploading vs. disabled.)

If the user has chosen not to have telemetry via some method, the data shouldn’t even be collected locally. There is a risk of leakage that way.

5 Likes

Indeed, I don’t see what the hurry is in collecting telemetry before users consent. It also makes the system unnecessarily more complicated. Honestly I think the proposers should consider being more critical and conservative in their approach to this

1 Like

So let’s say you’re using a GNOME program that has telemetry support, but you do not have the eos-event-recorder-daemon package installed. The program will see that there is no name owner for the eos-metrics D-Bus interface and it will just not attempt to upload metrics because there is no service to connect to.

Now let’s say eos-event-recorder-daemon gets installed by accident for whatever reason. The initial state is to collect metrics locally, but do not upload them. We cannot upload them because we do not know that the user has consented until the state is changed to a non-default value. That is, we have a tri-state <off, initial value, on>. If you’re not using GNOME (or if you’ve upgraded from a previous version of Fedora) and you have not viewed the privacy setting in gnome-initial-setup or gnome-control-center and you have not set its value manually, then what will happen is the GNOME application will connect to eos-event-recorder-daemon and the daemon will record the metrics locally but never send them to Fedora. If you eventually do change the value of the setting later on, then they’ll either be uploaded or deleted.

Perhaps I’m being overly-cautious. Several users have requested that the data be made public, and I’m OK with this.

Yes, we do generally work upstream first. Most work needs to be done upstream.

Some work needs to be done downstream (e.g. packaging components, configuring them to send metrics to Fedora, selecting which metrics to enable or disable).

GNOME doesn’t have a big change proposal process like Fedora does, so it’s not like we need to host a debate at the GNOME level about whether to do this or not. We’ll just build the functionality, and distros will decide for themselves whether to enable or disable by either installing eos-event-recorder-daemon or not.

This user clearly did some research on the Endless metrics system, but is inadequately familiar with the change proposal itself. I appreciate that it’s still early and it takes time for the community to understand what is being proposed. E.g. the change proposal is very clear that we won’t use the eos-phone-home component (which is what is collecting your latitude and longitude). I’ve also mentioned many many times that specific metrics to be collected would need to be separately approved (we won’t just enable everything that Endless collects) and we won’t build user profiles (there would be no point, we don’t need user profiles) so we simply would not be able to know the same user uploaded two different metrics. I’ve also never suggested collecting the list of specific set of packages (I think that’s just made up?).

E.g. Endless does collect detailed hardware profile and detailed program usage, but I haven’t proposed doing that in Fedora. And even in Endless, the data is collected separately nowadays (no user profiles anymore). Endless knows this many users use this CPU, but it does not know that a user with this CPU has this set of packages installed and used this app for this amount of time. (Well, except in the event of an error, in which case the data does get collected together for debug purposes. But errors are not expected, so hopefully that’s not too objectionable, and if so, well, we can change how it works.)

Anyway, I very much appreciate help with responding to misconceptions about the change proposal on third-party forums (Hacker News, reddit, wherever). I think most of the people in this Discourse discussion should have a decent general idea of what I’m planning by now and can help correct misunderstandings.

3 Likes

but isn’t it opt-in for upgrading users? And the power users will disable anyway. So the data may be heavily skewed to newer, possibly casual users

1 Like

I’m only a user, and not a contributor, so hopefully it’s OK for me to chime in. I can see the value of telemetry, and I’m fine with opt-out as presented here, assuming that

  1. the data is deanonymization-proof enough that Fedora can confidently make it all public
  2. no one, not even Red Hat employees, can connect the collected data to IP addresses or other identifiers from server logs or what have you
  3. I can, on my local computer, see a full enumeration of all the data being collected or sent

Having investment of money and effort go toward the things where it gets the most bang for the buck would be great, and worth this kind of easily accessible opt-out data collection IMO, assuming these points are ensured. I think points 2 and 3 have been covered already, but point 1 is still in the air. This might not be possible in the context of this proposal, but assuming this proposal passes, if possible I would like to see Fedora make a policy to the effect that when whatever decision-making body is considering the addition of any new data point for collection, one of the criteria should be that the data can be made public.

4 Likes

But to be clear, we didn’t officially bring this up to them at all, even though any work based off the data we collect will end up affecting the upstream project, correct?

1 Like

Can I suggest editing the proposal to make that more clear?

I think it’s currently:

We do not plan to deploy the eos-phone-home component in Fedora.

Maybe to change it to:

Under this proposal, we will not include the eos-phone-home component, which collects user’s latitude and longitude, in Fedora.

That’s a much stronger statement than “we don’t plan on” :slight_smile:

1 Like

Has anyone looked into the Transparent Telemetry used by the Go programming language?

There are a few blog posts on it here:
https://research.swtch.com/telemetry

Specifically their ideas around statistically samplng the population are really interesting, amd could help assuage some of the privacy fears by making telemetry collection much less frequent.

Essentially, by using some basic statistics you can sample a small subset of data to get results that apply to the whole population with a high (and known) confidence.
10,000 samples roughly corresponds to a 1% uncertainty, and that is regardless of the number of collected samples is 10,000, or 10 billion.

This allows you to reduce the number of telemetry points needed significantly, which the go telemetry uses to tell clients how often they need to upload.

In my comments in this debate, I am not much concerned with hacks as well. They can happen everywhere, but my trust in Fedora and related organizational elements in terms of providing the best possible mitigation is higher than in most other org, since I know its means/structures/organization that are used so far.

But please consider the transport of data. This needs to be part of the proposal and the policy, and is an important part of transparency for the user. Be aware that the user’s chance to disable telemetry is not much worth if trust is lost already given to too less information or the perception that this is not taken seriously by Fedora (this perception has in some cases risen in some of the recent discussion). This does not start at your database. This includes the means for cryptography up to the final server, but also where the final destination server is stored and to which legislation its corporate entity belongs.

On its way, a lot of data is linked together and to its IP. And there are countries where the target organization of the data is obligated to retain the capability to decrypt. Depending on the crypto, this also can involve earlier hops on the route of the data. This has to be considered in a dedicated manner, and made clear for the users in the policy, but also presented to them when they are at the point where they have to decide / are confronted with the telemetry stuff. This is not just to convince to not disable it, but also to retain trust that whatever decision they take it is respected and implemented in that way, and that everything has been considered from Fedora’s side (you can already see sufficient comments in the various topics about the perception telemetry can create - the term is already “damaged” today, and this gets easily linked to the organization using it - this is not always rational but still necessary for consideration)

We had so many media reports, elaborations and about companies to get more data than allowed/informed or to not care for some rules. This is what you have to overcome already in your proposal, the policy but also in the points presented to the users.

As far as I understood @mattdm , I misunderstood your points about GDPR in your proposal. The points indicated to me that data might be stored abroad. Make clear that GDPR is considered and that it is explicitly be ensured that data remains in GDPR areas (even if you do not store the IP yourself … because of the data on the route, and so on). Of course such issues also need elaboration and explanation of the technical means that ensure it. And they need to be able to be discussed as well - this is part of transparency (excluding “trust us, we cared somehow” *).

Important for the user to know is that only the data in the database can be published globally because it can be no longer linked to its users (such as 1 million used firefox daily or so), but this does not count for the data on its way to the database, which can be linked to people. This is both for transparency, so that users get the feeling that they can intervene and that they are heard (if it is in the policy, it is easy to understand that (and where) opening tickets about that is intended).

Further, the goals are important: what data is used for what goal. This helps the user to understand what the data can be used for, to tackle uncertainties about what this data could be eligible for (but also tell them to what they contribute if they allow you to get it - and this is a contribution). It also mitigates mistrust about what people might have in mind with “telemetry”, such thoughts as (simplified) “if it cannot be linked to anything, for what do you need it then?”. Telemetry stuff on itself creates some suspicion - you have to overcome that.


Also, there are users who need explicit security that they can tell their customers that nothing of that system is leaving GDPR area and that everything is considered in GDPR compliant ways (making both explicitly clear to customers is also a matter of marketing, not just law, for them - and us): any uncertainty can make them to migrate to another system because they might not feel secure if they have to evaluate themselves (they might have no time, or be not sure if they are qualified to evaluate). I still have a group of management consultants in mind who made clear at FRoSCOn that they all use Fedora given its transparency, stability and reliability :wink: A lot of people use Fedora also for their work.


We already saw some suspicion about Red Hat. I don’t understand most of it, but please avoid anything that implies “Red Hat takes care somehow” *. This shall have no place in community and only facilitates misunderstanding of the interrelations between Fedora and Red Hat (fostering perceptions that can create more harm than your telemetry can foster improvements). This is also about the transparency in keeping Red Hat and Fedora distinguished and separated from each other.


* that comments are meant as an illustrative negative example, it is not meant in the way that I accuse you of having done that :wink: I try to be clear in that respect since the discussions have already developed problematic in some of its topics.

2 Likes

I seriously doubt you understand the point those of us who strongly oppose data collection be default are trying to make. You are explaining implementation details of the proposal as if the question of principle is irrelevant.

Yes, I am quite emotional, because I am angry. I have already wasted more time and energy on this than I would like to.

This proposal tells the user base to trust something which does not define what data it wants to collect and strongly insists on not being built safe by design. (For instance, if I suggested starting recording before consent in my $dayjob, my co-workers would assume I were joking. I work in a business where the clients are regulated industries.) The users are told to trust the developers, trust the implementation, as opposed to a more healthy situation where it’s “You don’t need to trust us, we’ve made sure that even if we screw up, there won’t be a leak.”

As for implementation: Also, I must have missed it, has the complications given shared systems with many users even been discussed here?

I’ll repeat: I do not want to have to learn yet another set of implementation details in order to be able to make an informed decision to keep my privacy.

On the bottom of this page it says “It’s your OS.” But, obviously not my data when funding decisions are to be made.

3 Likes

Unfortunately, the topics seem to develop currently a little redundant. But merging or further splitting is likely to only increase confusion.

However, without imposing anything, it might be noted that this here is mostly about approaches in case this is implemented (for some, this maybe means to elaborate how to make the worst case less worse) , while the major discussion(s) about IF it is implemented OR NOT (and related stuff) is in the other topic(s).

The current stage of the other topics are
Opt-in / Opt-Out? A breakout topic for the F40 Change Request on Privacy-preserving telemetry for Fedora Workstation - #177 by py0xc3
What data will be collected, exactly? — a breakout topic for the F40 Change Request on Privacy-preserving telemetry for Fedora Workstation - #19 by kevin
F40 Change Request: Privacy-preserving Telemetry for Fedora Workstation (System-Wide) - #352 by kevin

What if the standard for data handling were that all the data is publicly accessible by anyone anywhere?

This would constrain what data could be collected in exactly the ways that would best preserve privacy. If there can be any concern that the collected data could be used for some nefarious purpose, or that individuals or corporations would be uncomfortable with that data being public, take that as an indication that you should not collect the data in the first place.

1 Like

What are the plans for retention of the collected data? Would it be indefinite, or only while it was needed to answer a specific question that someone has?

I’m of the opinion that if you only retain the data while a particular question is being answered that will significantly increase safety, both around the risk of databreaches if the server is ever hacked, as well as more general defence against reidentification by preventing datasets that could be potentually combined being collected at the same time.

2 Likes

You have argued along this line several times now, but I think this is flawed. In the original proposal thread, this is what is mentioned:

Many Gnome applications are used outside of Gnome; e.g. Evince, gnome-terminal, and many others. If any of these applications decide to gather telemetry, a non-gnome user would start collecting telemetry data without having seen a prompt. You will of course argue it’s not being sent. But the problem is it is there now, waiting for a bug/malware/proprietary application to abuse it. In the original thread I suggested to make this a “weak dependency” so that a user can ban telemetry packages for good e.g. by setting excludepkgs.

That said, this only solves the issue for advanced users savvy enough to know their way around dnf configuration. This should be thought out really carefully. Handling private data isn’t trivial. It is a constant fight against information leakage. Approaching everything as an engineering problem isn’t going to cut it. E.g. even though the final result is aggregated, an event stream needs to be processed. The proposal includes nothing regarding the handling of in-flight data (the event stream).

2 Likes