Potential process and policies for approving particular metric collection — a breakout topic for the F40 Change Request on Privacy-preserving telemetry for Fedora Workstation

Continuing the discussion from F40 Change Request: Privacy-preserving Telemetry for Fedora Workstation (System-Wide):

And, here it is! Please use this topic to discuss potential policies and process for collecting specific data.

Quoting from the change proposal:

My suggestion that FESCo approve each metric to be collected is sort of a placeholder proposal. It’s a real proposal that we could really do, but there are many other ways we could approve metrics as well (e.g. a subcommittee, a community or developer vote, etc.). I was hoping for more community feedback on what the process should look like, but this has not been a focus of comments provided so far, so here’s a breakout topic to discuss just this.

I mentioned it in other breakout threads, but I think something important is that data is collected for a specific reason.

This means having a question you want to answer, deciding what metrics you need to do that, and only then collecting them. It is important to avoid collecting metrics “just in case”.

As part of that there needs to be a(n automated?) process for stopping the collection of a metric after the question has been answered.

4 Likes

I agree that whatever process we follow to approve metric collection, this should be a requirement.

An easy solution for this is to have some sort of default time limit on how long a metric can be collected. Say, two months.

But anyway, I’d prefer to keep this thread focused on what sort of governance process would be required to approve metrics. Should metrics be approved by FESCo? A new telemetry SIG? Something else?

1 Like

Does FESCO have the expertise required to know if the data collected e.g. allows re-identification (for example if combined other things, e.g. network data such as accesses to the telemetry endpoint/flatpak servers/other third-party services), or contains something that can be considered PII.

As noted in the other thread, Azafea does not store only aggregate numbers, but there are also individual timestamped events. So the question to what degree re-identification from a leaked or public dataset is possible is not that obvious.

Can Fedora back up the assertion that the data collection is “privacy-preserving” with something concrete? And also verify it again when new items are added?

If there’s no local expert, and even if there is, it could be a good idea to try to recruit some outside independent expert(s) to take a look and give comments either signed-off or anonymously. And then publish those comments alongside the FESCO/telemetry-SIG/etc. decisions.

1 Like

I don’t think we can expect FESCo to investigate the safety of particular metrics themselves, but we can count on FESCo to notice community feedback about the safety of metrics. So hopefully the community will keep eyes on the metrics collection proposals. I’m thinking having a category or tag here in Discourse would be useful for that. Since nobody else has suggested a concrete approval mechanism, my plan is: developer proposes a metric here in Discourse, community has two weeks to comment and flag concerns, then it goes to FESCo for a vote. If anybody manages to recruit a privacy expert or two to get involved and watch the proposals, it’d be even better.

Whether the system is actually privacy-preserving or not will depend entirely on what data gets collected. Since we’re not associating records together or building user profiles, our primary concern when approving metrics should be “how likely is it that this event could accidentally contain personal data?” Collecting hardware information is probably safe. Collecting filenames from your home directory is not. Events are timestamped, but we can reduce the granularity of those timestamps.

I think re-identification from the dataset shouldn’t be possible if we ensure there are no IP addresses or arbitrary user-controlled text (“taylor swift upcoming hit.mp3”) anywhere in the dataset. A more likely attack would be to try to figure out whether different events are sent by the same client. They’re supposed to be stored separately. I can’t guarantee there is no way for smart people to guess that particular records are submitted by the same user. I can guarantee that I don’t know how. :slight_smile:

2 Likes

There probably could also be some suggestion on what FESCo should take into account to make the decisions, e.g. if unsure and no sufficient input, reject.

I don’t know if you need Fedora Legal to sign off on additions, or if that’s too much work for them or not in scope.

Such list of forbidden things probably can be extended with several things, location information, MAC addresses, etc…

If there is going to be a checklist for FESCo / whatever to use, maybe what “personal data” is should be defined; and understood in a wide sense as anything that makes it possible to identify a natural person, either alone or in combination with the rest of the dataset or other stuff accessible in a reasonable attack scenario. (I know there’s a list of “18 PII types” used but that’s probably an US-ism.)

The hard part is probably making sure re-identification is hard also for smart people. There could be a clear explanation written somewhere what exactly and how gets stored in the database, followed by a statement explaining why identifying persons from this data is going to be very hard.

Then it can be put somewhere visible, and you can wait for someone on the Internet to tell you why it is wrong. There are bits and pieces of the explanation in the comments here and in the proposal, which probably can be collected together.

I expect that we would get some guidance on general classes of things from legal that are okay within the parameters of whatever system we build. Anything that seems different from that, we’d bring back for review.

Should this be approved, I will see if I can get a functional definition like this from legal. (In addition to the GDPR, there are US privacy laws we also must comply with, so US-isms have relevance even if they’re not comprehensive.) If I can’t get lawyers to write that for us, we will have to go the other way around and get interested folks to write one, and I can take it back with “is this okay” and probably get a lot of “no this must be changed” and so on. But we’ll get it done. :classic_smiley:

Up until about a year ago I was a privacy officer for a corporation. Even though I only had to deal with US laws, it was almost impossible to give general guidance.

If you gave me a specific issue, I could give you an answer. However, if you asked for general guidance the answer was almost always “It depends”. Privacy law just isn’t that simple. Especially if you need to consider laws in all parts of the world.

I am closing these threads, as this change proposal has been withdrawn. We will open new related threads as needed when there is a new proposal on the table.