Approaches to data handling, safety, and avoiding individual identification — a breakout topic for the F40 Change Request on Privacy-preserving telemetry for Fedora Workstation

I think this is not paranoid at all.

So:

  • telemitry client receives public key by the telemitry server
  • telemitry client encrypts traffic with TLS and additionally with the public key
  • the NGINX server decrypts the traffic one layer and stores data as archives. A bunch of archives is collected, then the list randomly renamed and shuffled, again renamed and shuffled.
  • the NGINX server sends the archives to the telemitry server, where they are decrypted and stored.

As far as I understood the components.

3 Likes

What is actually the data that will be stored in the end? Is it aggregate metrics numbers only? Or event stream from each user, with some unique ID for each? What is the retention time?

Azafea doesn’t seem to have documentation that clearly explains it, so one would need to dig into its sources.

Knowing what is recorded is important on whether one can do re-identification from the stored dataset, and just from reading the proposal text and Azafea documentation I think one does not get much understanding of this. This probably should be explained better.

The proposal probably should at explain at least in the abstract what sort of data would be recorded and for how long, and what impact this has on how easy re-identification is. Or at least what the aim is. If re-identification is easy, then one should think whether the data is actually personally identifying.

Also, I think one should not aim here for the maximum leeway legally allowed (see how the “Fedora data collection policy” section is phrased), but make an objective assessment and set clear boundaries.

The data collection policy should not be something that is there just because you are legally required to have it. Instead, it should be something to guide the whole data collection effort.

2 Likes

You’re right, the schema isn’t obvious in the Azafea documentation. Each class in the Events page in the documentation corresponds to a database table. They all have a foreign key pointing to a “channel” table.

I’m sure it is possible to configure sphinx/readthedocs/sqlalchemy to render the database schema as part of the documentation. This would be good to add.

Each event is stored individually. Each recorded event has an integer ID that identifies the row in the database, and a reference to a “channel” which in Endless OS identifies the originally-installed OS image, but no ID for each individual user, or to the batch of events it was submitted in.

If you run the code on Fedora today, all events will be attributed to the same channel with a blank image ID, because Fedora systems don’t have the Endless OS-specific eos-image-version xattr on the root directory of the root filesystem.

Currently there is no automatic rule to expire & delete old events. This would be a great addition, combined with a batch process to aggregate the event stream at different resolutions (e.g. day, week, month). In practice it is rarely useful to look at individual events, only aggregated summaries. (A complication for Endless OS is that our users are often very intermittently connected, so we don’t know when the events for July 11th 2023 will stop arriving, and it’s actually very hard to determine this latency because the received_at time is not stored on each event…)

I’ll put some table schemas and sample data behind a cut to avoid posting an even more gigantic wall of text. I hope this is helpful!

Database schemas and sample data within

To make this concrete, while bearing in mind that the Fedora proposal does not presuppose that Fedora would record the same data points that Endless OS does, and that Fedora would define a “channel” differently, here’s the schema for the “updater failure” table:

azafea=> \d updater_failure_v3
                               Table "public.updater_failure_v3"
    Column     |           Type           | Collation | Nullable |           Default            
---------------+--------------------------+-----------+----------+------------------------------
 id            | integer                  |           | not null | generated always as identity
 os_version    | character varying        |           | not null | 
 occured_at    | timestamp with time zone |           | not null | 
 component     | character varying        |           | not null | 
 error_message | character varying        |           | not null | 
 channel_id    | integer                  |           |          | 

And the channel_v3 table it refers to:

                                       Table "public.channel_v3"
      Column       |            Type             | Collation | Nullable |           Default            
-------------------+-----------------------------+-----------+----------+------------------------------
 id                | integer                     |           | not null | generated always as identity
 image_id          | character varying           |           | not null | 
 site              | jsonb                       |           | not null | 
 dual_boot         | boolean                     |           | not null | 
 live              | boolean                     |           | not null | 
 image_product     | character varying           |           |          | 
 image_branch      | character varying           |           |          | 
 image_arch        | character varying           |           |          | 
 image_platform    | character varying           |           |          | 
 image_timestamp   | timestamp without time zone |           |          | 
 image_personality | character varying           |           |          | 
 site_id           | character varying           |           |          | 
 site_city         | character varying           |           |          | 
 site_state        | character varying           |           |          | 
 site_street       | character varying           |           |          | 
 site_country      | character varying           |           |          | 
 site_facility     | character varying           |           |          | 
  • The image_id column is the image ID
  • The site column is an optional string-to-string dictionary which is empty by default, which is intended for use in contexts where, for example, the same OS image is deployed in computer labs in several schools in a given region and our deployment partner wants to be able to distinguish between the different schools. (We have essentially never used this feature, which needs to be manually configured on each client.)
  • The dual_boot and live columns are booleans with an obvious meaning (i hope)
  • All the image_* and site_* fields are just views on the image_id and site fields.

Here are a couple of randomly-selected rows from this pair of tables:

azafea=> select updater_failure_v3.*, channel_v3.image_id, channel_v3.site, channel_v3.dual_boot, channel_v3.live from updater_failure_v3 tablesample bernoulli(0.1) join channel_v3 on updater_failure_v3.channel_id = channel_v3.id limit 2;
-[ RECORD 1 ]-+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
id            | 540
os_version    | 4.0.0
occured_at    | 2021-12-21 20:38:50.478174+00
component     | eos-updater-flatpak-installer
error_message | Couldn’t apply some flatpak update actions for this boot: Failed to read commit dd5cf78a2f925ba3892cc9a168e8102e1a0e16e4b824f8cfdcca8f3d654e3aa5: No such metadata object 1655f4f3e085fa547a37ad2ae3dc1d93a5d0e2a2c51d3e4785a5e03f701cecd3.dirtree
channel_id    | 3262
image_id      | eos-eos3.5-amd64-amd64.190408-212651.en
site          | {}
dual_boot     | f
live          | f
-[ RECORD 2 ]-+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
id            | 81961
os_version    | 4.0.5
occured_at    | 2022-05-17 08:31:18.238816+00
component     | eos-updater
error_message | Error fetching update: opcode close: min-free-space-percent '3%' would be exceeded, at least 13.3 kB requested
channel_id    | 354017
image_id      | eos-eos4.0-amd64-amd64.211213-144019.base
site          | {}
dual_boot     | f
live          | f

(By the way, these are two of the most common classes of updater errors on Endless OS: corruption in the ostree repo and/or filesystem, and insufficient free space to pull the update. And yes, it would be better to report something more easily-queried than “the message field of a GError” but perfect is the enemy of good.)

Some data is aggregated client-side before being sent to the server. Here are two randomly-selected rows from the daily app usage table:

azafea=> select daily_app_usage_v3.*, channel_v3.image_id, channel_v3.site, channel_v3.dual_boot, channel_v3.live from daily_app_usage_v3 tablesample bernoulli(0.1) join channel_v3 on daily_app_usage_v3.channel_id = channel_v3.id  limit 2;
-[ RECORD 1 ]+-----------------------------------------------
id           | 11797684
os_version   | 4.0.10
period_start | 2023-05-22
count        | 1955
app_id       | google-chrome.desktop
channel_id   | 138
image_id     | eosoem-eos3.6-amd64-nexthw.190923-084936.pt_BR
site         | {}
dual_boot    | f
live         | f
-[ RECORD 2 ]+-----------------------------------------------
id           | 11798176
os_version   | 5.0.2
period_start | 2023-05-22
count        | 8009
app_id       | org.chromium.Chromium.desktop
channel_id   | 376
image_id     | eos-eos4.0-amd64-amd64.211123-052013.base
site         | {}
dual_boot    | f
live         | f

The count field is a duration in seconds. (You may read this and think, shouldn’t it at least be rounded off to some lower precision? Yes, that would be great!)

We can see which other apps were used by users in those channels on that date, but we can’t tell which specific users used which combination of apps.

On that particular date, there were 1746 users in the eosoem-eos3.6-amd64-nexthw.190923-084936.pt_BR channel and 30 in the eos-eos4.0-amd64-amd64.211123-052013.base channel. In that month, there were 3247 and 41 respectively.

Again, on Fedora as the code stands all users would be attributed to the same channel with an unknown image ID. This field could be used to distinguish (e.g.) silverblue from workstation; OEM-preinstall from download; and perhaps originally installed Fedora branch. But I would imagine that a channel would be defined much more broadly than on Endless OS.

2 Likes

Poll related to this topic:

One thought (apologies if it’s been mentioned above and I missed it, but did not see it while semi-skimming / semi-reading this topic – after reading the 200+ posts on the original topic last night I ran out of bandwith)

Can we ensure / require that each metric we collect is implemented by a separate RPM package? That way users can opt in to telemetry in general, but opt out of providing any info they consider sensitive

2 Likes

Not a possibility on Silverblue/Kinoite without major changes. I would say:

  • if name doesnt match, block telemitry (dont ever collect data on Fedora Forks)
  • a config file where you can enable / disable services from even running

Rpms are interesting but not practical for OSTree. But something similarly clean is a good idea.

1 Like

Are those even in scope? I thought this was only for Workstation. Is that not the case?

My reading of the current tentative design is that Silverblue would actually get the system, because it’s ultimately built out of the same bits as Workstation. In the other thread, catanzaro said the package would be added to the workstation group in comps and also added as a weak dep of gnome-control-center; those changes would certainly cause it to be included in Silverblue.

As described, it wouldn’t be in KDE or Kinoite (or any other spin/edition besides Workstation/Silverblue and any other GNOME-based spin), I don’t think.

1 Like

The full telemetry implementation may not be present and hence data not sent, but I wouldn’t make the assumption that no component would be present. For example, it’s been discussed that Toolbox images being used is of interest.

as described and as I understand it, the system wouldn’t work through existing components reporting stuff into it, but the telemetry system actively going and getting the data. so if it’s not installed, nothing is different. taking a census of toolbox images doesn’t require changes to toolbox; you can do it yourself just by running toolbox list, after all. all the data collection system would need to do is…do that, or a slightly more programmatic equivalent of that.

The timestamps if combined with something else such as network data might allow re-identification, especially if someone is doing something rare. I guess the submit time is not stored anywhere?

Text fields in telemetry data also seem something that needs to be careful about, as eg. free-form error messages can contain IP addresses or some other PII. This might not be the case with installers.

One might here want to quantize the timestamps to lower resolution, have background jobs aggregate the events. Not sure what can be done with text fields, probably one would need on client side to have filters that censor obvious PII patterns. Probably safest is to not collect such fields at all (nor location or anything in known PII categories).

Whoever ends up having to review what data will be collected will need some expertise with assessing questions like this (differential privacy, how to do re-identification, what could be PII, etc.). I don’t really have it. Does FESCO have such expertise, I guess not since it is voted in? So it sounds like it would be best if the deciding body in Fedora would recruit some independent expert help, and get comments when the dataset collected is modified.

The goal would not be to cross some legal checkboxes, but to make a best honest effort to minimize any risks in the data collection. Especially when it is advertized as “privacy-preserving”, it has to be backed up by some real analysis that shows it really is privacy-preserving in some sense. I think right now, when it’s not clear what data will be collected and no guidelines have been yet agreed on etc., at least I don’t really know it is “privacy-preserving”.

2 Likes

Note from the proposal:

Each metric is stored in the database with a Unix timestamp indicating when it was generated on the client. If abused, this timestamp could allow correlation of data points that are collected at the same time as each other, or at a fixed time offset to other events. For example, if the system were designed to collect two metrics exactly 300 seconds after the system were booted, then just looking at the timestamps would be enough to determine that both metrics recorded at the same time were submitted by the same user. Accordingly, we should consider modifying the metrics server to reduce timestamp granularity at least somewhat.

I think the idea of expert oversight is a good one.

3 Likes

Right… I’m not familiar with the EOS telemetry system specifically, but the ones I am normally are designed to be pluggable, with collectors for each metric.

We can find a way to toggle each on and off separately I suppose. eg via /etc/telemetry.overrides.d/… so files can be dropped in easily.

Several users have requested granular control over what data is collected, but creating a new RPM package for each metric and then testing whether that package is installed is really overkill. That’s not a suitable way to do this.

It would be both: the central component (eos-metrics-instrumentation) will implement many metrics, but also individual components that want to record metrics can do so too. Let’s hypothetically say the Calculator developers want to know which calculator modes are used the most. That’s not something that can easily be determined except inside Calculator itself, so that would be the point where the metric should be collected at.

Correct. The submit time does not get stored, but the time the metrics were recorded on the client side does get recorded.

That was the very first thing I thought when I saw those error messages.

The second thing I thought was “this is a really good example of how collecting data can be used to improve Fedora.” Now I want to go out and add some telemetry to GNOME Software to let us know about failed updates.

But collecting a free-form error message could easily go awry. For example, coincidentally last week my GNOME Software was actually failing to update with an error message that would have revealed an internal company URL, and maybe also an internal repository name (I don’t remember exactly). Hypothetically, the error message could have contained something much worse like “- nothing provides illegal-software-backend needed by illegal-software-1.23-4.fc38.x86_64.”

Of course that doesn’t mean we should not collect any data on update errors. I would submit error codes rather than error messages. It might be harder to investigate what went wrong, but we would make do.

Matthew already pointed out my change proposal says the timestamp granularity should be reduced by the server. I suppose that would actually be better performed by the client instead, since nobody here seems to trust the server very much.

Not sure what can be done with text fields, probably one would need on client side to have filters that censor obvious PII patterns. Probably safest is to not collect such fields at all (nor location or anything in known PII categories).

My rule of thumb is to not collect text fields unless we can be reasonably certain they will never contain personal data. For example, “AMD Radeon™ RX 570 Series” is text but your graphics card name is not very likely to contain personal data. The only way collecting your graphics card name is likely to reveal sensitive data is if you’re an AMD employee working on a secret unreleased graphics card, in which case hopefully you would turn off the telemetry. So that’s pretty safe to collect.

Now say we want to collect the image names used to create Toolbx containers. Usually the image name will be something harmless like “fedora-toolbox-38” which can be collected safely, but it could also be named “foo-corp-top-secret-image.” Yeah, so can’t just blindly collect image names either. But we can collect some image names. It’s probably OK to collect if it shows up on a public container registry.

Filenames, usernames, hostnames, error messages, etc. are all obviously not safe to collect. I don’t think trying to detect and censor “obvious PII patterns” is likely to be successful. We should just be really really careful when collecting text. And this is why I suggest each metric to be collected be examined and approved individually.

I’ll respond to your final points in the process topic since you posted similar questions in both places.

1 Like

This event was retrofitted onto the top-level error-handling block of eos-updater. Unless you’ve been very, very careful to maintain a good taxonomy of error codes, in my experience they all tend to be G_IO_ERROR_FAILED or similar by the time they bubble up to the top level of the application. Reporting the error text has many downsides as you say, but the codes alone would be next to useless.

I’m not disagreeing with the fundamental point – it would be preferable to spend the time improving the error codes and then reporting those instead – just explaining why it’s like this.

(There is actually a much more mundane problem with sending error messages: they are localized. I have a gross saved SQL query that pattern-matches translations of “input/Output Error” in a half-dozen languages as a result…)

1 Like

The only way collecting your graphics card name is likely to reveal sensitive data is if you’re an AMD employee working on a secret unreleased graphics card, in which case hopefully you would turn off the telemetry. So that’s pretty safe to collect.

I’d not thought of this aspect but it has an impact on us (Lenovo) and might impact other vendors using Fedora as a preload. I personally use Fedora on prototype HW all the time…and I’d be in big trouble if that data was public. A switch to disable telemetry will be important for us during certain phases in our program.

I know it’s getting a bit niche but would it be reasonable to suggest identifying prototype HW from the DMI information and disabling by default telemetry in those cases? I think all of ours have ‘SIT’ or ‘SVT’ in the product name - and I suspect other vendors would have similar mechanisms.
I’d be happy to contribute directly in implementation for this (if it is feasible). I fully appreciate this particular aspect is selfish, but I don’t want to risk getting the Linux team labelled as source of information leaks as the backlash internally would be brutal.

For me this post was a little bit of an ‘oh crud - hadn’t considered that’ moment. Another reason why these proposals and discussion are important :slight_smile: !

2 Likes

That’s true; it would only work if error codes were a lot more specific than G_IO_ERROR_FAILED.

Sure, why not? I think you would probably not consent to the telemetry in the first place if using prototype hardware, but having extra checks to avoid mistakes seems useful.