Scrapers and ideas for how to deal with them

I’ve seen others make a similar argument, but there’s opportunity for misunderstanding here so let me clarify for the thread:

The point of the Markov approach(es) is not to out-gun the enemy with superior bandwidth. The point is to spoil some percentage of their gains, such that the additional effort of sorting garbage from non-garbage exceeds the threshold of their interest in our specific content.

Depending on the scraper, this might only require a very small percentage of garbage. This is especially true since it will likely go unnoticed until after a large amount of money and time has been spent training LLMs on garbage.

I liken this approach to the bug sprays and poisons that don’t kill the bugs immediately, one at a time – they allow the bug to take the poison back to the nest where the damage is far greater. Bon Appetit!

1 Like

yes. :wink:

They are like a poorly implemented web spider that ignores all the work
people have done to make nice web spiders.

ie, they hit a package repo. They follow every link and every link off
there. They hit an api and follow every link of the content returned.

They seem to favor doing a tree… for example, someones fork of the
kernel or the like.

Well, I don’t think things are entirely dire. I think we have done a lot
to mitigate things. I just wanted to have this thread in case there were
easy things we could implement before the holidays, as fewer people will
be around to mitigate things then.

But sure we could look into it…

Sorry, I’m afraid I was not clear enough on my point.
Let me put it another way: in order to authenticate, an user has to pass
the challenge first.
Ideally, scrapers should not pass this challenge, making it a trusted
source if the request is coming from a scraper or not.

But the ones that don’t pass a challenge are already blocked by anubis.

These are ones that do pass a challenge. Mostly because it’s someone
elses browser/computer. They don’t care if that computer/browser has to
expend effort to pass a challenge.

Which means you could achieve your goal by tuning Anubis.

If scrapers are passing the challenge, is Meta Refresh enabled? Would it
make sense to disable Meta Refresh temporarily?
Would it make sense to (temporarily?) increase the difficulty of the challenges?

It’s not.

I have adjusted difficulty, but it’s hard to do so in a way that affects
only the scrapers. For example last week I cranked up the difficulty on
/forks/ and it did stop the scrapers, but then real users trying to use
their forks reported that they couldn’t get in. ;(

Yeah, that makes sense, but (and I am again spectulating), they don’t
much care. The scrapers are doing this so they can turn around and see
AI companies large datasets. In any kind of sane world, if they provided
bad / worthless data, the AI companies would stop buying from them.
Sadly, the AI hunger for data means they will consume whatever they can
find for training data. ;(

But I could be wrong… it could work.

They are already eating their own slop; If we can feed them even more horseshit… You already know this of course… I just enjoy the thought of being a little person in a little place costing some enormous vampire like <insert example “company” here> a staggering amount of money to clean up my utter drivel which their AI feasted upon despite being told to go away multiple times.

Poison the well. Eventually the poison gets noticed and even the desperate won’t drink from it… hopefully.

I also like this idea, but can it be done in a resource effective manner?

Honestly, I’ve never tried but I’ve read about bloggers who have after being pissed off by scrapers.

In theory, it should be “quite” light.

You maybe suck a (few) thousand posts/database entries/whatever you want to generate from your existing corpus each hour/day/week/whatever as the Markov chain sauce and let it rip.

Every time an “invisible” link is clicked, the Markov chain swings into action. It smacks out a Markov chain of some content, you pick a few words to be hyperlinks, and obviously they link to yet more Markov chain output which will be generated on the fly. Rinse and repeat. Anything that even starts following this stuff is probably bogus and probably not daft enough to stop following an endless chain of legitimate looking crap in the first place.

It looks real, it smells real, it’s utterly confident as it’s been written by thousands of humans, and it passes all the “real” checks, but it’s hallucinated tripe. Maybe sounds familiar - “You’re absolutely correct.”

That feels to me like it’s fairly cheap to produce on demand - I suspect a handful of lines of python would knock it out of the park so crack it out in C/rust/Go/whatever and the CPU load would be negligible. Bandwidth - they’re going to scrape the guts out of everything anyway and won’t stop… so you’re paying for it regardless.

Let the AI leopards feast. There’s an endless supply of meat…

You even get metrics on how effective it is just by watching how often Markov has to produce some output… that alone has to be worth even just giving it a punt. If it’s crap, turn the code off and the real users will never have even known it’s been active.

What about setting limits on queries/link hammering per period of time? …So regardless of the user…when suspicious activity occurs those actions can be slowed down. Example: if server sees 10 queries/links being requested in rapid succession by ‘user’…let’s say at a rate of 1 per second as an example…would it not be possible to throttle those suspicious ‘users’ ? Place added time limits for accessing content every time such event occurs…with each ‘violation’ increasing the time penalty. If nothing else it should at least place a speed limit on this sort of bad traffic.

Sure, but it gets back to the scale. Keeping track of 1,000,000 ip’s to ‘slow’ becomes difficult over time. We used to block them, but quickly you run into ipsets that are really really large. ;(

So, today we were hit again and I was able to look deeper instead of trying to react while a bunch of other things were happening. :slight_smile:

I can say in this case it was not: io, bandwith, cpu, process limits or anything like that.

The problem ended up being that they were hitting /history/ and /blame/ endpoints on src.fedoraproject.org (and in particular the kernel package). These endpoints mean the backend has to go run a ‘git blame’ and wait for it to come back to reply back to the client. This increased the latency for everyone. I have now blocked /blame/ and /history/ and everything is back to normal (although they are still scraping commits, I don’t care much if it’s not causing a problem and it’s fast, which it is).

Hopefully this will make things more stable at least for the holidays.

7 Likes

I personally don’t like the idea of wasting money on making life easier for scrapers with their behaviour.

Do the pages have to be completely public?
You could hide all content behind a login, so that users who are not logged in only see a ‘blank page with a login form’. This isn’t ideal, but at least to me it’s perfectly acceptable given the situation.

I’m sure it wouldn’t be practical, but it might be neat if each FAS account automatically got a Wireguard public-key that would allow limited VPN access to Fedora’s network. Then maybe you could crank up the Anubis challenges for those not using the Fedora Wireguard VPN. Presumably the scrapers would not be able to create a FAS account and set all that up to get around Anubis? Or if a few did, you could cancel that account’s Wireguard key?

Hmm.

We’re gonna have a similar problem after the forge migration yes?

Is there a middle ground here where blame and history endpoints are available to logged in FAS account holders instead of just turned off for everyone?

If that’s possible would you have enough access log telemetry to track down FAS holders with maliciously infected clients so we could you know.. have a chat..about impacts.

hmmm…

I’m not sure I’m keen on that for all the infra endpoints. It feels like a terrible idea to require VPN access to the web ui endpoints meant to facilitate user/contributor engagements.

I am open to being persuaded that something like this would be appropriate as a way to better protect a set of API endpoints from abuse as an alternative to implementing lots of different API keys aross different services.

But it seems like this sort of thing is off the mark. If the large impact scraper activity is the public facing web-ui endpoints.. putting that behind a VPN probably doesn’t serve a good purpose. It’s probably a matter of needing to understanding which web-ui endpoints are critical to remain public and which ones should be behind a FAS login in order to put some accountability in place when usage crosses into resource abuse.

1 Like

It is probibly possible, but I could not figure out any easy way to do
it with pagure. There may be a way to do it with forge.

Ideally there would be a cookie anubis could check, or perhaps something
on the forge side could just disable those ‘expensive’ buttons to non
logged in users.

I’m not sure how much legit users actually use them. I don’t know that I
have ever done so… if I am looking for blame I just run it locally on
the checked out repo.

So this https://pagure.io/$PROJECT/blame/… is the problem?

I think someone already blocked it?! :smiley:

At least for me, even if I’m logged in, I get the forbidden warning.

Maybe we should redirect the user to some page explaining what’s going on and why?

yes, that was me.

I guess we could redirect it to something. But really I don’t think very many people use it. I blocked it a week ago and this is the first time anyone has really noticed or commented on it. :wink:

Well, I only tried it because you mentioned in your previous post. :laughing:

The blame UI I would agree.

The history UI.. I’m not so sure. Thinking of my own..history.. of using the history UI on other forges and its been a non-zero set…especially when I’m doing a drive by on the codebase where I am a user looking to be a casual contributor.. before I fork/clone/branch something.

In fact I already used the history UI at github to weigh in on a packager/upstream concerns in an upstream ticket about some license files in the upstream user base that was causing a hang up in the package submission review in this gig.