Approaches to data handling, safety, and avoiding individual identification — a breakout topic for the F40 Change Request on Privacy-preserving telemetry for Fedora Workstation

The change proposal F40 Change Request: Privacy-preserving Telemetry for Fedora Workstation (System-Wide) is — as is appropriate for such a big, important topic! — getting a lot of discussion. In order to keep the conversation from becoming one long list, I’m making a number of break-out topics for various important sub-topics that are emerging in the discussion.

This topic is for discussion of approaches to handling data so we get useful information while preserving individual user privacy — including avoiding collecting individually-identifiable information. This includes both theoretical and specific technical points.

Post in the main thread that are primarily about this will be moved here.[1] If you have more to add on this particular topic, this is the best place for that.

Note that some of this crosses over into the opt-in/opt-out discussion — it’s hard to draw a hard line. Generally, that topic is best for debate about the principle, while this is about how any collected data might be handled in either situation.


  1. Some posts which cover this but also other points will remain in the main topic, to avoid breaking the flow. ↩︎

The design needs rethought I think, it’s underwhelming for something described as privacy-preserving: there’s little attempt to ensure that IP addresses aren’t stored because the nginx server and the server running azafea-metrics-proxy can be compromised. Open source or not, this system has no way to ensure that the code running on the servers discards IP addresses. Nor can we reasonably trust that this project will be appropriately resourced to ensure that the servers will be secure or that there will not be a misconfiguration causing IP addresses to be logged by some software or other after all. CPE already doesn’t have enough resources to maintain some parts of Fedora infrastructure, and with the passage of time a server that is up-to-date and secure and appropriately configured quickly becomes not up-to-date, not secure, and not appropriately configured.

The IP address issue needs fixed and that requires significantly more work than just using Azafea. Even without that, much more effort could be put usefully into making this system less non-privacy-preserving. There is low-hanging fruit like encrypting the metrics so that nginx and the metrics proxy cannot read them. Another low-hanging fruit is ensuring that the environments hosting the Redis and PostgreSQL databases and Azafea do not have Internet access, to make it harder for any data to be exfiltrated. That would still not be privacy-preserving at all, because of the IP addresses which are the most glaring problem, but it would be an improvement.

1 Like

Yeah, it seems to me that RH teams keep giving CPE more things to do while not providing it more staffing/infra accordingly.

2 Likes

I have not yet read your points in detail and have unfortunately to refocus for today, but a final note from my side for today:

Be aware that this is not only about what you store but what is transported and what information goes along with what is transported:

This data on itself from one machine already develops towards a potential profile, together with its route, it might be even able to identify a person/user/machine over time.

From a European perspective, this can be already highly deterring because of the potential legal obligations that are expected to be realistic for US hops.

I am worrying that this could create a large decline in European Fedora users and undermines trust (media can make much out of that). If reasons are rational or not, if one agrees to them or not, is not relevant if such a developments starts. And it will be hard to revert. This is only one aspect.

With data of all kinds come implications (which starts with determining and ensuring privacy, consideration of cultures, social and technical possibilities/pitfalls, etc.) and thus responsibilities, and at the moment I’m not convinced if this is sufficiently realized.

1 Like

I had been hoping that the “run your own telemetry server” idea would be sufficient to avoid concerns about collecting too much. But I am willing to consider releasing all the data. I would prefer not to, because I do not know whether people who are smarter than me could potentially figure out how to deanonymize it. But if the Fedora community really wants the data released, I’m not going to strongly object to that.

2 Likes

Well of course we won’t combine IP address data from the web server logs with the actual metrics database, which, to be clear, will not contain any IP addresses.

I’m not sure whether CPE would be willing to disable web server logs entirely because some logging is going to be needed in case the system comes under attack, but perhaps the web server logs could be kept for only a short period of time (e.g. two days)? Does anyone have concrete suggestions regarding this?

Is there anything specific you’d like to see here? Other than “apply software updates regularly,” which is a pretty basic expectation?

I’m certainly willing to consider system changes.

azafea-metrics-proxy does not see IP addresses (every connection to it would be from the web server), and it does need to be able to see what the metrics contain to store it into redis, so it would need to be the point at which data is decrypted. Oops, Will says this is wrong. The data doesn’t have to be decrypted until after azafea removes it from redis.

I suppose we could encrypt the data with a public key encryption scheme, and split up the server such that nginx and azafea-metrics-proxy run on different servers, and the server running nginx does not have the private key to decrypt the data. This would make this more complicated to operate, of course. And it would make running your own metrics server slightly more complicated. It seems a little paranoid to me, but I guess we could do it if this is really considered important.

1 Like

So it’s true that data gets uploaded in batches, with a bunch of data points from one user sent all at once. But it will be stored separately so those points are not associated with each other. (Exception: currently if the request is malformed, it does all get stored together to facilitate debugging. That is, of course, not the normal expected behavior.)

Only if the server is malicious would it know that a user with a particular IP address uploaded this set of data. And presumably you trust Fedora to not be malicious if you’re a Fedora Linux user, so this scenario would correspond to the server being hacked by some third party. Even in this worst-case scenario, I’d like the data we collect to be sufficiently limited that it shouldn’t be a big deal.

1 Like

I’m not sure whether CPE would be willing to disable web server logs entirely

Oh, I’d understood the change proposal as involving no IP addresses being stored by anything anywhere at all, not just them not being stored in the metrics database. Looking at the text again, it is probably the “which notably means IP addresses must not be stored” that led me to believe the proposal involved no storing of IP addresses at all. Anyway, um, it does seem of course much better to disable the web server logs?

Is there anything specific you’d like to see here? Other than “apply software updates regularly,” which is a pretty basic expectation?

  • Having the machine that receives the initial requests be used solely for this proxying purpose, minimizing the software that runs on it.
  • Automatic updates for nginx and the system.
  • Generally considering it a sensitive system and restricting remote access methods and permissions accordingly.

And, sure, maybe some of these are pretty basic expectations. But, basic things are easy to forget and often there’s nobody who notices. It’s surprising the things that can just not get noticed or the degree to which there can be nobody with the time or willingness to fix them. Okay, for (an otherwise unrelated) example, the GNOME 3.38 runtime on Flathub never actually got EOLed because the pipeline that was supposed to do that failed. That was probably noticed by many people, but one way or another, it ends up still not being flagged as EOL today, two years after the last update. It’s just the way software is: everything is frequently broken, the breakage doesn’t get noticed for years, and even if it did nobody has the time to fix it anyway.

I suppose we could encrypt the data with a public key encryption scheme, and split up the server such that nginx and azafea-metrics-proxy run on different servers, and the server running nginx does not have the private key to decrypt the data.

Yes, exactly this, this would be great.

It seems a little paranoid to me, but I guess we could do it if this is really considered important.

Not paranoid at all, it hugely helps reduce the risk of IP addresses and app usage data (or any other information that will be collected, but app usage data is the most sensitive I’ve seen mentioned yet) being collected together with any of the multiple ways that could otherwise happen. That’s one of the biggest risks, the encryption helps reduce it pretty significantly, and it’s way less complicated to do than the sort of thing that would be needed to hide the IP addresses.

I think it’s the single improvement with the best cost-benefit ratio.

1 Like

But I am willing to consider releasing all the data. I would prefer not to, because I do not know whether people who are smarter than me could potentially figure out how to deanonymize it.

This is a bit concerning. I think the proposal should count with the scenario that the telemetry server will be compromised and should make sure that the data the possible attacker can acquire are anonymized enough that this won’t be possible.

Also, how are you going to deal with possible expansion of the collected telemetry data? I.e. if the user agrees to the telemetry with the current set of collected data, but further down the line you decide you want to collect more which the user might not be comfortable with. How would you relay this information to the user?

2 Likes

(Greetings from Endless OS. I’m the author of the blog post about this metrics system, linked in the change proposal. I am also a Fedora user in my spare time.)

Actually I don’t think it does. Azafea-metrics-proxy just prepends a received-at timestamp to the blob of binary data it receives and stores the result in redis. Only Azafea actually parses the received GVariant.

So it would be possible to encrypt to a public key in the client, and have the private key only live in Azafea, which is sufficiently far removed from the HTTP submission to have no idea what the IP address is and no way to get it by accident.

I have to say that I find the threat model here a little far-fetched (admittedly having only thought about it for about 90 seconds) but if it helps reassure people that the submitted events cannot be associated to an IP address, it can be done.

4 Likes

Michael addressed this above:

1 Like

It will be combined and linked on its route through many processing hops in many countries up to your database. This is where your consideration has to start. Btw, this is also where the legal perspective has to start. What I read so far, breaks with the GDPR guidelines you elaborated yourself above (GDPR does not just apply to the actual database at the end but also the preceding stages). Users with related obligations will have a problem… (and others justified worries, not just about breaking GDPR but also about the reasons why it is broken).

I would need to trust many servers in between, which are likely to be several within the US. This is already something many will not be allowed to trust in (if they process themselves GDPR data), and many will not want. Afaik, you will be obligated to retain the capability to decrypt any cryptography on the path in between, right? (Feel free to correct me here since I have not kept myself up to date in this respect for a while). That’s a major reason why bringing personal data into the US is already a violation of GDPR (be for or against it, but you cause issues to users, which creates negative incentives to them). You cannot limit the scenario to your database so that it fits your preferences. You seem to want to produce data that can create personal profiles when put together, but you reject to care for everything before your database. And this is where my trust has ended.

That said, whatever is done, users must be explicitly asked if they want it (may it be with the button set by default to true or false). And they must be able to click on an explanation that offers more clarity than the proposal. Everything else can create unforeseeable backdrafts, trust-losses and perceptions but also abuse (because only a small part of the community will factually be aware and thus rationally care), in many respects.

This proposal does not tell much details but wants approval. I am not sure if that can be called transparent. At the same time, Ubuntu ensures GDPR through servers from within GDPR area. This already gives many some basic security. I’m not arguing that their approach is better or worse, but generally arguing that ours is more careful and more transparent seems a little far fetched.

2 Likes

Thanks!
In that case, let’s take some time to audit the data before we make it accessible. If it’s records such as “on July 1st there were 42000 installations of Evolution and 4200 installs with an NVidia graphics card” then I don’t see how this could be de-anonymized, but maybe things are more complex (like maybe some apps being only useful on some archs on in some countries? Not sure).
Let’s take some extra steps to make sure we’re not storing deanonimyzable (is that even a word?) data, but I’d love to make that data available to the community. I think that without transparency, you’re asking people to trust, and that may be a bit too much to ask for these days.

If this gets enabled, maybe we could make further collected metrics go through a Change Proposal or something like it?

1 Like

I believe the original intent was to only show aggregate data, so yes that would be how it would work.

The concern that many people expressed though is how the data would be recorded in the backend.

Couldn’t we just keep the aggregate data? I’m no expert in telemetry, so maybe that does not make sense. And if we end up needing some info that we dropped because we didn’t think it was useful before, then we just bite the bullet, aggregate it and start from that point in time. I think that would help people trust that no personal data is collected if there wasn’t any non-public dataset.

The aggregate data has to come from somewhere. I’m not an expert in databases or anything, but to make aggregate datasets you need to have a list of raw data to work from.

Not just that, but if you want granularity in your datasets, for example pick a date/time range, you need the non-aggregated data for that to be accessible live by the script. Or at least that’s how I understand it.

I think something like what Valve is doing with the Steam Survey is a good idea, shows stats of OSes used, graphics card usage, etc

1 Like

Yeah! But as I said, I’d like way more information, especially about how the Fedora dev team wants to act on the information. Mostly because I’m interested in what the improvements that telemetry will bring. I mean if the telemetry says that a certain feature is the most used and they want to optimize that, I’d be interested in hearing it. If the telemetry says that people go into a deep submenu of a submenu of a submenu to reach an action and they want to make that action easier to reach, I’d also be interested in hearing that. Really I’m curious to know about any change they want to do because of the telemetry

1 Like

I will make a note to add this and the encryption suggestion to the feedback section of the change proposal and will discuss it with CPE.

Sigh. I’ll ask about getting this fixed. :frowning: If anybody noticed, nobody reported it until now.

1 Like