What data will be collected, exactly? — a breakout topic for the F40 Change Request on Privacy-preserving telemetry for Fedora Workstation

If this proposal is accepted, we still will not have authority to actually collect any data. Everything to be collected would still need to be approved separately, somehow. This change proposal contains a placeholder suggestion that FESCo would need to approve each metric to be collected, but we’d need to figure out how exactly this would work. Suggestions welcome for alternate processes to follow. For example, we could have a process similar to the change proposal process for approving particular metrics. Or we could have some new subcommittee that decides whether particular metrics should be approved.

That said, if approved via this hypothetical future community process, I would indeed envision collecting application usage data. (That’s something I failed to consider in the “What data might we collect?” section of the change proposal.) But I don’t think it needs to be highly-intrusive. Remember that one of the principles here is we don’t want to associate the collected data with particular users: we only want to know aggregate data. So let’s say that on Thursday you use Epiphany, Geary, gfeeds, GNOME Text Editor, and GNOME Console. We might send five separate events indicating that some user launched each application on that day. (Endless actually collects the time that each app was used, but judging from the concerned opinions thus far in this thread, that’s probably a little too invasive for Fedora.) So the data on the server might look something like: <Epiphany, 5000 users on July 6>, <Firefox: 500,000 users on July 6>. But we wouldn’t need to collect, say, the set of applications launched on a given day, <User launched Epiphany, Geary, Firefox, and Thunderbird>, because why would we need to know that? And we certainly wouldn’t want to be able to associate that data with your IP address (we are explicitly prohibited from collecting IP addresses), location, or trace it back to you in any way, because that would be creepy.

I would also limit data collection to packaged apps only, plus maybe a few hardcoded others. Let’s hypothetically say we’d collect usage data on Fedora applications, Flathub applications, Google Chrome, and Steam. It’s not safe to collect data on arbitrary applications because the names of arbitrary applications might potentially contain personal data (e.g. “Michael’s Private Crime Stuff Application” or “Foo Corporation Top Secret Project App”) and we have to take due care not to collect anything that would likely contain personal data. (I would interpret this strictly, but not outrageously strictly. E.g. I don’t think a CPU model name is likely to contain personal data. But application names certainly might.)

This comment is a little handwavey and hypothetical because I’m discussing a metric that I haven’t proposed collecting yet, and I haven’t thought that much about it, but this should give you an idea of how I’m approaching this. If Fedora ever collects so much as to feel invasive, then we’ve messed up and violated user trust. But I don’t think counting application usage in such a general manner is really that invasive (especially in contrast to, say, a proprietary software platform, which is probably keeping track of the applications that you personally are using, and who knows for sure because we cannot see the source code).

2 Likes