Approaches to data handling, safety, and avoiding individual identification — a breakout topic for the F40 Change Request on Privacy-preserving telemetry for Fedora Workstation

paradoxguitarist · July 9, 2023, 11:06pm

That would be an excellent additional safety rail for those concerned.

grumpey · July 9, 2023, 11:25pm

I would rather this be explicitly installed vice as a dependency. Why not do something like eos-metrics-package-name

paradoxguitarist · July 9, 2023, 11:32pm

I’ll be honest. I don’t know what you’re talking about. Can you explain?

I don’t think the package name (that or anything else) would affect it, and I’m not familiar with a vice setting for rpm/dnf.

IIRC weak dependencies can be excluded/uninstalled and dnf/yum will still update/install it. If you have it listed as a normal/required dep, it starts throwing errors. So if you tried to dnf remove eos-metrics-package-name it would uninstall GNOME Desktop as the proposal currently stands. Letting it stand as a weak one, you could keep using GNOME DE.

grumpey · July 9, 2023, 11:37pm

Do not use a weak dependency, it would need to be installed by the user or be part of a group.

paradoxguitarist · July 9, 2023, 11:49pm

Wouldn’t it be part of the Fedora Workstation group?

fatka · July 9, 2023, 11:50pm

The rationale behind weak dependency is to enable users who disable telemetry to use an application that enables telemetry without having to install the telemetry packages.

telemetry on + telemetry enabled application + eos-*
telemetry off + telemetry enabled application + eos-* (collection happens, but isn’t sent)
telemetry off + telemetry enabled application, eos-* not installed (no telemetry collection, so not exposed to bugs, malicious or proprietary programs abusing the system)

Weak dependencies enable 3 without hampering 1 or 2.

mattdm · July 10, 2023, 7:01pm

I think some of this goes to What data will be collected, exactly? — or may just to term “telemetry” and what that implies to people.

This concern makes a lot of sense to me if the telemetry in question is, like, a recorded user session, or mouse hot-spot tracking (very useful in UX studies!). But, on the other hand, it makes no sense for things like “hardware in use” or “applications installed” — that data is just … there already, and no amount of checking or unchecking boxes changes that.

fatka · July 10, 2023, 7:15pm

Exactly! Maybe there could be a granular permission system who gets to access the data. Otherwise anyone has access to it, and the user is unnecessarily giving a broader permission, violating one of the core tenets of security/privacy.

mattdm · July 10, 2023, 7:19pm

But some of this data — on the system, that is — is available to all software on that system, at least as things are now. (We could do a lot more sandboxing, but that’s a whole 'nuther layer of work.) Right now, if some third-party app you install wants to copy all of your hardware info, the installed package set, and /etc/passwd, there’s not really anything to stop it.

dalto · July 10, 2023, 7:25pm

That isn’t a reason to create more of it. Ultimately, local data collection is still collection and that is part of what we are trying to avoid.

aman9das · July 10, 2023, 7:30pm

As a statistics major, I would point out that the Law of Large numbers is poorly interpreted. Large sample size will give more accurate information about the population average. It would not violate the whole population’s privacy by doing so (no singular person is tracked). We don’t want Fedora to start catering to just your feedback and needs, but the needs of the population. So large sample size is a positive.

Also you didn’t back your claim that privacy and telemetry can not go hand in hand. That is blatantly false. See Prio | Stanford Applied Crypto Group for example. It is absolutely possible to discard all identifiable data, e.g. MAC address etc. Useful data will be aggregated.

As for your own privacy needs, the only possible issue is the data being collected in disk before the telemetry toggle is shown. You would be able to turn it off, as you do with Android. No worries there.

And that is okay, the improvements would be liked by the majority (because of collected metrics), and that is what matters. If someone preferred how Fedora 25 was and nothing else, we don’t expect Fedora to revert all the updates since for a single user. Tough luck for them. Change is a good thing over time.

PS: the post may come off as targeted, it’s because I feel the targeted post is self-centred. Apologies. I am not affiliated with Fedora or Red Hat or IBM.

dalto · July 10, 2023, 7:55pm

Not entirely. You can’t collect data from me over the internet without having my IP address, that alone is enough to identify me. If I am understanding it correctly, the Prio system you referenced could potentially protect the number of times I used an application from being exposed through it’s distributed data aggregation. However, it could not protect access to the information that I am using that application since all the servers need to know the metrics they are collecting even if they don’t need access to the specific numbers.

Second, all that information needs to be collected locally first and that also presents risks of exposure.

jrredho · July 10, 2023, 8:07pm

Surely you’re not using Google/Android as a paragon of self-enabled
privacy.

ghoultek · July 10, 2023, 8:12pm

You are misinterpreting my point. I don’t want FRH to have my data and I don’t want them approximating me or my data. For example, there is no accurate way to quantify the size of the Linux community or the number of Linux installs. Website and browser stats, download stats, system update/upgrade stats, the Steam hardware survey, the count of people who are members of the official forum, the Fedora subreddit, the number of people who attend conventions, even when taken all together still would not provide a number that is even close to reliable. However, many entities will attempt to use the above sources and others to estimate the size of the community and count of installs. With that estimation, opinions, decisions, and justifications will be made. I would like to keep the size of the Linux community and the number of Linux installs, across all distros. obscure. This keeps those opinions, decisions, and justifications, in the false, invalid, and unreliable category. I don’t want corporate entities sizing up the community, in order to develop a strategy to turn the community into a glorified cash cow. The Linux community is not and does not behave like a market. So any reference to “market share” will be false/unreliable. Let’s leave “market share” to those license and EULA driven products such as Windows 10. If even half of the community could be estimated and reliably treated as a market then the other 50% aren’t far behind. No thank you. This is why I don’t want the law of large numbers applied to data collection on Linux.

This isn’t about catering to just my needs/feedback. There is nothing good going to come from automated data collection schemes on Linux. Win 10 and Android, are poster children examples of what happens when data collection is wide spread. I don’t want to contribute to some investor or corporate entity getting rich off of me. I do not want to be a proverbial cow to be milked.

I’ve explained this in my post. The list that I put together, I’ve used in a development project for a prior employer. It was very reliable for identifying individual devices no matter where the device was moved to within the company.

I’ve already explained the falsity of this. You are making an assumption. I’ve already been down this path many times while working on a dev team. Structured data provides efficiency to devs but does not guarantee quality improvements. Survey data post release cycles confirmed this.

I have friends and family members who worked for IBM in the 80s, 90s, and early 2000s. IBM isn’t new to me. Neither is for-profit corporate greed. I well versed in the game of misdirect and mislead customers for the sake of getting greater access and thus more $$$$$. You can not snow a Snowman. I carry a dual M.S. in Wind and Blizzard-cology, and Precipitology.

Yes there are concerns. I’ve already explained this. I have no desire to use an Android like system on my desktop that collects my data. No, I don’t want the collection components on my disk; not even dormant. I don’t want to have to turn it off and at some point in the future it becomes 10, 20, 30. 50 switches that need to be turned off. See Win10’s settings app. across successive Win10 version. No thank you.

Lastly, if I were interested in my data being collected I would just:

install the Endless OS distro., or
use Win10

No thank you. Data collection like Android or some other data collection platform, with or with opt-in/out schemes are not desired or helpful. I’m not interested in participating in such schemes.

I’m a Computer Science/Mathematics major. With real world experience dealing with code, other devs, middle managers, end users who were internal and external customers, and difficult/demanding executives of varying degrees of technical know-how.

I have an idea. How about they put all this data collection into RHEL and see how that works out with their corporate customers.

mattdm · July 10, 2023, 8:46pm

Right — there’s not a lot of unique bits for the number of people in the world^[1], so fingerprinting like this can be really powerful. It’s possible, however, to keep these bits of information apart, so they can’t be correlated (including, as I saw earlier in one of these posts, not submitting them at the same time).

It’s also true (as @dalto says) that the design of the internet means any receiver gets an IP address, which is possibly identifying information and a likely key that could be used for correlation. I personally would be fine with a mechanism where the initial receiving system doesn’t forward any IP address information to the next step, severing everything. However depending on the sensitivity of the particular data^[2], that might not be reassuring enough. We could, however, use Tor, or I2P — or even some sort of peer-to-peer onion-routing scheme devised specifically for this.

33 ↩︎
about which reasonable people can reasonably disgree! ↩︎

ghoultek · July 10, 2023, 9:10pm

Here is a simpler solution. Don’t introduce any data collection schemes into the Fedora distro. Instead put all those data collection components into RHEL and let Redhat find out how their corporate customers feel about being surveilled.

If I wanted to be surveilled, why would I choose Fedora when Endless OS and Windows 10 exist? They have much better telemetry/data collection than Fedora. There would be no point. If I don’t want to be surveilled on my desktop then why would install and use Win 10 or Endless OS? There would be no point. The end result of both questions are going in opposite directions, and neither of those directions would lead to me wanted Fedora with telemetry/data collection.

There is a reason why M$ does not include all that telemetry crap in their enterprise Win10 edition. M$ doesn’t want 10k lawsuits, a few dozen class action suits, and their corporate volume licensing customers demanding the FTC and EU put the breaks on them. M$ goes after the individual non-corporate, non-government, consumer.

I hear what you are proposing to sanitize and anonymize the data but you are working yourself into a pretzel just to construct a more and more complex infrastructure for something users are saying they don’t want or want to participate in. If Fedora/Redhat (FRH) were an amusement park business. The forced introduction of data collection schemes and infrastructure would be like forcing every customer entering the park to get on the giant, scary roller coaster, regardless of height, weight, fear and phobias… and all the while saying “it will be fine… trust us… its free for everyone”.

mattdm · July 10, 2023, 9:51pm

I don’t hear a coherent voice. I know many people are deeply opposed in different ways, while others aren’t, and others don’t care. People who are upset are, naturally, the loudest voices. That doesn’t mean what they’re saying isn’t important, but also isn’t everything.

There is clearly a potential benefit to the project (and therefore to users!) in getting useful data by which we can drive decisions. So, it’s useful to figure out how we could do that while also making something people are comfortable with. That’s a hard problem that doesn’t have easy answers — and it turns out sometimes the right answers are complicated.

ajorg · July 10, 2023, 10:11pm

I’m not especially persuaded that there is, relative to more organic sources of anecdotal data. It’s somewhat harder to use bug reports and social media and forum chatter to make those decisions. And you and your bosses won’t feel as confident that you’ve made the right decisions. But considering that the data will be a bit vague if it’s privacy-preserving, and incomplete if you let anyone opt-out, I’m unconvinced that the Fedora Project will actually make better decisions with this data than they would without it.

There are things about open source that are uncomfortable for corporations, but also fundamental. Some will argue that the expectation of privacy, and the consequent lack of telemetry, isn’t one of those things, but I believe it is.

ghoultek · July 10, 2023, 10:24pm

Let me share commentary from one of my comments on reddit:

Designers need to understand the user in order to improve interfaces

Then they should engage with users. The user of automated data collection schemes produces efficiencies for the devs through analysis of structured data (usually SQL queries/reports). However, nuance is lost, and the reasoning that drives the data output from the user is often not captured at all. I’ve explained this in my post.

How can they delight the customers if they don’t know how they use their system?

Engage with the users. Many users, newbies especially, are more familiar and comfortable with reddit. Are the devs coming to r/linux or r/fedora or r/linux4noobs or r/linux_gaming or r/linuxhardware on a daily basis engaging the users in their posts and comments directly while identifying themselves as members of the Fedora dev team? The answer looks like a resounding “NO”. Employing automated data collection schemes means the devs are engaging with stats and database reports not end users. One can not properly employ TQM principles by using dry, sterile reports. One has to engage customers directly.

If Fedora were a restaurant, the use of automated data collection schemes would have them making assumptions/decisions about why certain dishes were more popular than others. There would be assumptions like:
“Dish A is ordered the most in the evening so the customers must love it.”

You can’t answer the question of WHY do customers order dish A the most with stats alone. The customers could be ordering dish A because dishes B, C, and D taste like crap, or are considered too expensive. Stat data only provides a piece of the picture not the entire picture. The interpretation of the stat data will not enable the chefs and management improve the quality of dishes B, C, and D because they are blind as to why they need improvement and don’t even know that they need improvement. Most likely attention and resources would be devoted to dish A while leaving B, C, and D unchanged.

I included a real problem/suggestion/solution in my post. Fedora has an official Youtube channel. Why is there no video on their official YT channel that addresses how to install a new kernel via GUI and command line? Because no one is attempting to look at how users, especially newbies, approach Fedora. No one is asking the question of how do users learn how to use Fedora. You can’t answer that with raw stats or cooked DB reports.

People are already talking themselves into pretzels around how the data will be dis-aggregated via the server side collection infrastructure in order to justify building a system that users are saying they don’t want or want to participate in. To return to the restaurant example, its like forcing a beef and pork appetizer platter on customers, making it mandatory that they pay for the platter, but disregarding that some customers are vegan and others don’t eat pork by choice. When customers shun the platter, management turns around and says “We will serve the beef smothered in onions, peppers, and string beans, and deep fry the pork.” The customer doesn’t want the platter and doesn’t want to pay for it.

The stats will only partially benefit the devs and give management numerical means of declaring that they accomplished something all the while not serving the customer. This is a classic case of dev/IT efficiency going up, customer sentiment sinking to new historic lows, and customer resentment remaining high.

Data collection does not guarantee improving a system, but at least enables it in ways that are not possible without such data.

There is no substitute for engaging and interacting with customers directly. If done right they will volunteer plenty of info. without sacrificing their privacy. Some might even be willing to act as test subjects. These are common known as focus groups or early adopters.

ghoultek · July 10, 2023, 10:42pm

Matthew,

If the dev team and management really want info. I can provide a bunch more info. It won’t be in neat columns and rows in a spreadsheet or CSV file though.

Where are the rest of the council members? Why are they not in this convo. asking questions of the users instead of pushing justifications? You guys want feedback right? You are getting it right now.

Topic		Replies	Views
How can we make the Change process more clear to people? Project Discussion fesco	12	350	July 12, 2023
Privacy focused users: what applications and what settings you use with your Fedora? The Water Cooler tech-talk	15	1116	July 20, 2023
FYI on how DNF countme avoids counting Fedora's QA Project Discussion quality-team	0	445	July 15, 2023
CPE Weekly Update – Week 27 2022 Community Blog	1	270	July 11, 2022
Fedora 40: Does it have telemetry built into it? Ask Fedora	11	3671	April 22, 2024

Approaches to data handling, safety, and avoiding individual identification — a breakout topic for the F40 Change Request on Privacy-preserving telemetry for Fedora Workstation

Related topics