Potential process and policies for approving particular metric collection — a breakout topic for the F40 Change Request on Privacy-preserving telemetry for Fedora Workstation

mattdm · July 11, 2023, 8:13pm

Continuing the discussion from F40 Change Request: Privacy-preserving Telemetry for Fedora Workstation (System-Wide):

And, here it is! Please use this topic to discuss potential policies and process for collecting specific data.

catanzaro · July 11, 2023, 8:22pm

Quoting from the change proposal:

The proposal owners feel it is essential to ensure the Fedora community has ultimate oversight over metrics collection. Community control is required to maintain user trust. If this change proposal is approved, then we’ll need new policies and procedures to ensure community oversight over metrics collection and ensure Fedora users can be confident that our metrics collection does not violate their privacy.

We can say “we would never collect personally-identifiable data” and write software that really doesn’t collect any such data, but this alone will never be enough to ensure user confidence. […] We will also want to ensure the Fedora community has ultimate control over which particular metrics are collected. One option is that each metric to be collected should be separately approved by FESCo. Collection of particular metrics in a particular data format is ultimately an engineering decision, and therefore FESCo seems like an appropriate approval point. Because FESCo members are elected regularly by the Fedora community, this also provides the community with ultimate control over metrics collection via the election process. But other oversight and approval structures would work too.

My suggestion that FESCo approve each metric to be collected is sort of a placeholder proposal. It’s a real proposal that we could really do, but there are many other ways we could approve metrics as well (e.g. a subcommittee, a community or developer vote, etc.). I was hoping for more community feedback on what the process should look like, but this has not been a focus of comments provided so far, so here’s a breakout topic to discuss just this.

fraetor · July 11, 2023, 10:17pm

I mentioned it in other breakout threads, but I think something important is that data is collected for a specific reason.

This means having a question you want to answer, deciding what metrics you need to do that, and only then collecting them. It is important to avoid collecting metrics “just in case”.

As part of that there needs to be a(n automated?) process for stopping the collection of a metric after the question has been answered.

catanzaro · July 12, 2023, 3:07pm

I agree that whatever process we follow to approve metric collection, this should be a requirement.

An easy solution for this is to have some sort of default time limit on how long a metric can be collected. Say, two months.

But anyway, I’d prefer to keep this thread focused on what sort of governance process would be required to approve metrics. Should metrics be approved by FESCo? A new telemetry SIG? Something else?

pavi · July 13, 2023, 11:10am

Does FESCO have the expertise required to know if the data collected e.g. allows re-identification (for example if combined other things, e.g. network data such as accesses to the telemetry endpoint/flatpak servers/other third-party services), or contains something that can be considered PII.

As noted in the other thread, Azafea does not store only aggregate numbers, but there are also individual timestamped events. So the question to what degree re-identification from a leaked or public dataset is possible is not that obvious.

Can Fedora back up the assertion that the data collection is “privacy-preserving” with something concrete? And also verify it again when new items are added?

If there’s no local expert, and even if there is, it could be a good idea to try to recruit some outside independent expert(s) to take a look and give comments either signed-off or anonymously. And then publish those comments alongside the FESCO/telemetry-SIG/etc. decisions.

catanzaro · July 13, 2023, 9:56pm

I don’t think we can expect FESCo to investigate the safety of particular metrics themselves, but we can count on FESCo to notice community feedback about the safety of metrics. So hopefully the community will keep eyes on the metrics collection proposals. I’m thinking having a category or tag here in Discourse would be useful for that. Since nobody else has suggested a concrete approval mechanism, my plan is: developer proposes a metric here in Discourse, community has two weeks to comment and flag concerns, then it goes to FESCo for a vote. If anybody manages to recruit a privacy expert or two to get involved and watch the proposals, it’d be even better.

Whether the system is actually privacy-preserving or not will depend entirely on what data gets collected. Since we’re not associating records together or building user profiles, our primary concern when approving metrics should be “how likely is it that this event could accidentally contain personal data?” Collecting hardware information is probably safe. Collecting filenames from your home directory is not. Events are timestamped, but we can reduce the granularity of those timestamps.

I think re-identification from the dataset shouldn’t be possible if we ensure there are no IP addresses or arbitrary user-controlled text (“taylor swift upcoming hit.mp3”) anywhere in the dataset. A more likely attack would be to try to figure out whether different events are sent by the same client. They’re supposed to be stored separately. I can’t guarantee there is no way for smart people to guess that particular records are submitted by the same user. I can guarantee that I don’t know how.

pavi · July 14, 2023, 1:32pm

There probably could also be some suggestion on what FESCo should take into account to make the decisions, e.g. if unsure and no sufficient input, reject.

I don’t know if you need Fedora Legal to sign off on additions, or if that’s too much work for them or not in scope.

Such list of forbidden things probably can be extended with several things, location information, MAC addresses, etc…

If there is going to be a checklist for FESCo / whatever to use, maybe what “personal data” is should be defined; and understood in a wide sense as anything that makes it possible to identify a natural person, either alone or in combination with the rest of the dataset or other stuff accessible in a reasonable attack scenario. (I know there’s a list of “18 PII types” used but that’s probably an US-ism.)

The hard part is probably making sure re-identification is hard also for smart people. There could be a clear explanation written somewhere what exactly and how gets stored in the database, followed by a statement explaining why identifying persons from this data is going to be very hard.

Then it can be put somewhere visible, and you can wait for someone on the Internet to tell you why it is wrong. There are bits and pieces of the explanation in the comments here and in the proposal, which probably can be collected together.

mattdm · July 14, 2023, 2:57pm

I expect that we would get some guidance on general classes of things from legal that are okay within the parameters of whatever system we build. Anything that seems different from that, we’d bring back for review.

Should this be approved, I will see if I can get a functional definition like this from legal. (In addition to the GDPR, there are US privacy laws we also must comply with, so US-isms have relevance even if they’re not comprehensive.) If I can’t get lawyers to write that for us, we will have to go the other way around and get interested folks to write one, and I can take it back with “is this okay” and probably get a lot of “no this must be changed” and so on. But we’ll get it done.

dalto · July 14, 2023, 4:11pm

Up until about a year ago I was a privacy officer for a corporation. Even though I only had to deal with US laws, it was almost impossible to give general guidance.

If you gave me a specific issue, I could give you an answer. However, if you asked for general guidance the answer was almost always “It depends”. Privacy law just isn’t that simple. Especially if you need to consider laws in all parts of the world.

mattdm · September 14, 2023, 3:24pm

I am closing these threads, as this change proposal has been withdrawn. We will open new related threads as needed when there is a new proposal on the table.

Topic		Replies	Views
How can we make the Change process more clear to people? Project Discussion fesco	12	367	July 12, 2023
F42 Change Proposal: Opt-In Metrics for Fedora Workstation (system-wide) Change Proposals fesco , f42	152	3791	March 11, 2025
Fedora 40: Does it have telemetry built into it? Ask Fedora	11	3864	April 22, 2024
UNOFFICIAL poll about OPT-OUT metrics proposal Project Discussion fesco , workstation-wg	0	3313	July 9, 2023
Privacy focused users: what applications and what settings you use with your Fedora? The Water Cooler tech-talk	15	1135	July 20, 2023

Potential process and policies for approving particular metric collection — a breakout topic for the F40 Change Request on Privacy-preserving telemetry for Fedora Workstation

Related topics