Scrapers and ideas for how to deal with them

Hey folks.

So, as many of you know, this year has been full of scrapers hitting our infrastructure heavily and causing problems, wasting resources and taking up a lot of admin time trying to keep them away. :frowning:

a few months ago, we setup anubis ( https://anubis.techaro.lol/ ) in front of many of our sites and it’s really helped quite a lot. It “weighs” the connection and challenges some with proof of work challenges. This has pretty much stopped the mainstream scrapers that use known cloud resources. Unfortunately, it has broken some users workflows, but slowly we are working through that with clients that report problems to us (either allowing things they need, adjusting their client so it doesn’t get challenged or the like).

There is unfortunately another category of scrapers thats still causing us woes. I am speculating on some of this, but I think they are setup like this:

  • produce a ‘free’ browser plugin that claims to be a vpn, or something people think is nice.
  • Get people to download and install this thing. Even note in the terms of service that they agree to allow you to use their browser for whatever.
  • gather 100’s of thousands of these.
  • Sell their services to someplace that wants to scrape content

So, the effect is: multi 100,000 ips from all sorts of places, where they don’t much care about the anubis challenge since it’s on the users browser/computer and they have a bunch of em.

Most anything I can think of to do to mitigate these also blocks legit users. It’s very hard to tell them apart. :frowning: These mostly hit src.fedoraproject.org, koji.fedoraproject.org and pagure.io.

So, it would be great if we could figure out how to handle this better, especially since the holidays are coming up and less people will be around to manually try and block them. ;(

The two outstanding ideas I have right now:

  1. Toss more resources to our end. I already made pagure.io much bigger when it migrated the other day and it’s load has so far been much better. I can double cpus on pkgs01 (the backend for src.fedoraproject.org) and see if that helps, but it needs a reboot.
  2. If we can figure some way for anubis to know when a connection is for a logged in user, at least we could always allow those and slow non authenticated users. There is a pagure cookie, but as far as I can tell it’s always there, even if you aren’t logged in. Does anyone know a way to tell if a connection is for a logged in user?

I’m not sure there’s a good solution here, but if anyone has ideas, I am all ears. :slight_smile:

Thanks for reading.

6 Likes

I don’t know if anubis supports this but go-away supports https://git.gammaspectra.live/git/go-away/wiki/Challenges#http which allows the backend to use the user’s session cookie to make a request to https://pagure.io/api/0/-/whoami or similar to ascertain whether the user is logged in during the challenge.

I like the idea of more infrastructure (better hardware to handle requests) and optimizing websites to be scrape-friendly (more plaintext/static).

Doing client checks feels like DRM (I have to be human-verified by AI to view human content), and any server-side resources for that could be better spent serving content to everyone/thing faster.

AI (scrappers/anything automated) runs on hardware; scale the hosting hardware to out-scale the AI, instead of something like using AI to block AI :stuck_out_tongue:

1 Like

That sounds like an arms race you need infinite money to win, sadly in this case the scrapers have that infinite money.

4 Likes

For the record, the client checks do not use AI. They are JavaScript proof-of-work challenges.

1 Like

Kevin Fenzi via Fedora Discussion
notifications@fedoraproject.discoursemail.com writes:

There is unfortunately another category of scrapers thats still causing us woes. I am speculating on some of this, but I think they are setup like this:

  • produce a ‘free’ browser plugin that claims to be a vpn, or something people think is nice.
  • Get people to download and install this thing. Even note in the terms of service that they agree to allow you to use their browser for whatever.
  • gather 100’s of thousands of these.
  • Sell their services to someplace that wants to scrape content

One of my ISPs has been frequently blocked. Would it help if I helped you to investigate this?

  1. If we can figure some way for anubis to know when a connection is for a logged in user, at least we could always allow those and slow non authenticated users. There is a pagure cookie, but as far as I can tell it’s always there, even if you aren’t logged in. Does anyone know a way to tell if a connection is for a logged in user?

Isn’t this already in place? i.e. in my tests, a logged in user always has an anubis-cookie-verification.

Is it possible to limit interactions per second? And would that help at all? I imagine it can be done with browser-side code.

Invisible link for normal users, which when clicked takes scrapers into a never ending maze of constantly generated Markov chain content, each of which contains more links to endlessly generated Markov chain bollocks. Generator only needs to run when it’s being used and it’s quite light - 2 or 3 paragraphs of tripe per page and splatter links in there which triggers more Markov.

Doesn’t stop scrapers getting in, but when they do follow an “invisible link” they vanish down a hole, never to be seen again.

Alternatively, if one of these links gets clicked, then it’s probably a scraper so ban the IP for 6 hours.

It may not help substantially, because the requests are coming from all over (each source is itself limited in what it requests, it is just the shear number of sources (this is not a lot different than the IOT device DDoS attacks which come from lots of sources)).

Some AI companies are reportedly using multiple methods to spread the request load source(s). Not only browser extensions, but (short term) cloud instances, vpns, tor, whatever works.

While some of the scrapers respect the robots.txt values to not index, or crawl slow, not all do (as the only the few first AI companies to scrap the world will win (whatever that means), no one wants to be in fourth place).

I recall that Cloudflare has added “block AI bots” (and also a “pay to crawl”) option. Could using that capability be an option? I believe Akamai (which I believe Red Hat uses (used?) for their own web site), also has something equivalent (probably the other large CDNs, too, as they, themselves, are trying to limit the herds).

One problem with that approach is that more and more users are behind CGNAT, so you end up blocking many innocent users for the actions of the few. The extent of that collateral damage is not easily predictable.

@anothermindbomb FYI I suggested exactly this previously and offered to implement it, but the general opinion was that the juice wasn’t worth the squeeze:

  • A successful implementation has to mimic our content, which means it needs to be heavily customized and routinely re-customized.
  • There will be significant lag between when we implement this and when the scraper notices that they are receiving bogus content.
  • Some scrapers will continue to blindly scrape anyway.
  • Some scrapers will adjust and continue the “spy vs spy” game.
  • This hurdle may even exacerbate the problem. Scrapers who are not forever deterred will know that when they find the next workaround, it’s a race between them and us. So they will hit us even harder when they do hit us.

In short, I still believe that this effort would have some success, and potentially significant success. But it will indeed require significant initial and ongoing effort, and it will not solve the problem entirely.

I’m reminded of Wargames, where the only successful outcome came from not playing Geothermal Nuclear War in the first place.

@kevin Can we put most of the http stuff behind authentication? (Excluding our main “front door” websites and end-user docs.)

At least then we would have a single service to scale instead of all of them, and we could isolate / shape the unauthenticated traffic. For integrations with select partners / systems, we could issue “forever” auth tokens. (Though they should eventually implement rotation.)

The main argument against this (aside from implementation effort) seems to be SEO, but do we really care about SEO for these systems?

I mean, we probably want public-facing git repositories on Pagure/distgit/the new forge to be indexable by search engines. I suppose we could require authentication for expensive endpoints on the git forge (like blame) but not sure how feasible that’d be or if it’d help much.

@kevin Could you put some rough numbers on the problem? E.g. for each of our sites that get hit hard, what is your “finger in the wind” estimate for:

  • Legitimate traffic volume per month (GiB in / out)
  • Legitimate traffic peak volume (GiB in / out)
  • Legitimate traffic peak req/sec
  • The “with scrapers” version of all of the above

Also, it’s interesting to me that our bottlenecks seem to be CPU rather than bandwidth, even though these scrapers are presumably only inducing read operations, not writes. This smells like a target for optimization. Happy to deep dive on this if interested.

Finally, if bandwidth really is our main bottleneck and we can’t put up a gate that holds, then either a CDN or some move to distribute the load (like mirrors) may be the only solution. Happy to work on this too.

Can we actually outline the use cases / value? I’m skeptical that we gain contributors from Google searches sending people directly to a specific distgit repo.

I don’t want to do anything to harm the community or our recruiting, but I also think that frequent outages are at least as harmful to its long-term health. Many contributors are already at max capacity with all systems functional, and newbies need to have a good experience to stay involved. So if I was forced to choose between “keep things functional for contributors” and “keep things discoverable to anonymous parties”, I’d choose the former for many reasons.

(Hopefully it goes without saying that the goal is to avoid such a choice, though it increasingly seems to be one we may need to consider.)

Something to consider: the ops burden of fighting these scrapers is significant. It harms the entire project, as all other Infra work grinds to a halt and the people involved become more and more burned out. How long can we continue to deprioritize other work? How much longer can our ops people take this before they snap? There is no direct measure of these things, but the thresholds there can be catastrophic. At some point, it becomes prudent to take drastic measures before the next layer of crisis unfolds.

Hey folks. Thanks for all the replies. Will try and answer everything I
can here…

On Fri, Dec 05, 2025 at 02:47:35AM +0000, Maxwell G via Fedora Discussion wrote:

I don’t know if anubis supports this but go-away supports https://git.gammaspectra.live/git/go-away/wiki/Challenges#http which allows the backend to use the user’s session cookie to make a request to https://pagure.io/api/0/-/whoami or similar to ascertain whether the user is logged in during the challenge.

Thats cool. I don’t think anubis has this ability… or at least not
yet.

I am unclear currently on how pagure authenticates. It does add a
cookie, but it does that no matter if you are logged in or not, and it
doesn’t look like the cookie contents are different either.

It clearly must have some way to tell…

On Fri, Dec 05, 2025 at 03:11:00AM +0000, Espionage724 via Fedora Discussion wrote:

I like the idea of more infrastructure (better hardware to handle requests) and optimizing websites to be scrape-friendly (more plaintext/static).

I already did add cpus to the src.fedoraproject.org backend.

On Fri, Dec 05, 2025 at 03:27:07AM +0000, Adam Kafei via Fedora Discussion wrote:

That sounds like an arms race you need infinite money to win, sadly in this case the scrapers have that infinite money.

Yep. They sure do, but on the other hand they are not all focused on us.
It’s likely only a small part of their army going after us.

On Fri, Dec 05, 2025 at 03:35:46PM +0000, Tulio Magno Quites Machado Filho via Fedora Discussion wrote:

One of my ISPs has been frequently blocked. Would it help if I helped you to investigate this?

Well, of course I am curious, but I am not sure how we can track it down
or what good that information will really do. I mean it’s likely
customers of the ISP that have installed bot scraping software for
whatever reason. I suppose they could try and get them to not do that,
but it would be a lot of work to contact each user and find out whats
going on.

  1. If we can figure some way for anubis to know when a connection is for a logged in user, at least we could always allow those and slow non authenticated users. There is a pagure cookie, but as far as I can tell it’s always there, even if you aren’t logged in. Does anyone know a way to tell if a connection is for a logged in user?

Isn’t this already in place? i.e. in my tests, a logged in user always has an anubis-cookie-verification.

anubis creates that cookie if you pass a challenge. It has nothing to do
with being logged in or not. :frowning:

It uses that to tell you already passed and it can just let you through
instead of challenging you on every connection.

On Fri, Dec 05, 2025 at 06:30:49PM +0000, Gary Buhrmaster via Fedora Discussion wrote:

It may not help substantially, because the requests are coming from all over (each source is itself limited in what it requests, it is just the shear number of sources (this is not a lot different than the IOT device DDoS attacks which come from lots of sources)).

Some AI companies are reportedly using multiple methods to spread the request load source(s). Not only browser extensions, but (short term) cloud instances, vpns, tor, whatever works.

Yeah. So for example, 100,000 requests come in, but they are hitting say
1000 urls, so only 1000 hits over several hours from all different ips.

While some of the scrapers respect the robots.txt values to not index, or crawl slow, not all do (as the only the few first AI companies to scrap the world will win (whatever that means), no one wants to be in fourth place).

I recall that Cloudflare has added “block AI bots” (and also a “pay to crawl”) option. Could using that capability be an option? I believe Akamai (which I believe Red Hat uses (used?) for their own web site), also has something equivalent (probably the other large CDNs, too, as they, themselves, are trying to limit the herds).

We (or at least I) have a strong aversion to using a non free service
like cloudflare or akamai. They could also be quite expensive unless
they agreed to donate their services.

On Fri, Dec 05, 2025 at 06:20:02PM +0000, Steve Flynn via Fedora Discussion wrote:

Invisible link for normal users, which when clicked takes scrapers into a never ending maze of constantly generated Markov chain content, each of which contains more links to endlessly generated Markov chain bollocks. Generator only needs to run when it’s being used and it’s quite light - 2 or 3 paragraphs of tripe per page and pslatter links in there which triggers more Markov.

Yeah, that could be light on cpu and such, but it would not be light on
bandwith. You still need to return all that junk to them and process
more requests.

Doesn’t stop scrapers getting in, but when they do follow an “invisible link” they vanish down a hole, never to be seen again.

Also, while they would no doubt follow the markov path, they likely also
would continue to scrape the real urls also. ie, they have capacity to
do both.

Alternatively, if one of these links gets clicked, then it’s probably a scraper so ban the IP for 6 hours.

Yeah, might hit legit folks behind say a nat or gateway tho. ;(

On Fri, Dec 05, 2025 at 06:49:26PM +0000, Michael Winters via Fedora Discussion wrote:

@kevin Can we put most of the http stuff behind authentication? (Excluding our main “front door” websites and end-user docs.)

Sure. But… that has a pile of problems:

  • breaks all the people who use our sites for the automation/various
    things, like user mockbuilding against our builds, people testing
    things, etc etc.
  • goes against core values of the project. We want to be open and share
    everything with the world, not put it behind an auth layer.
  • spammers have already figured out they can make accounts for spamming
    lists, it likely wouldn’t be too long before they made accounts to
    scrape (and then it’s back to wack-a-mole closing them as they make
    them).
  • it may have legal problems since people would be unable to download
    the source for things that we have distributed.
  • search engines

At least then we would have a single service to scale instead of all of them, and we could isolate / shape the unauthenticated traffic. For integrations with select partners / systems, we could issue “forever” auth tokens. (Though they should eventually implement rotation.)

The main argument against this seems to be SEO, but do we really care about SEO for these systems?

There’s a number of more arguments I think… but yes, we want search
engines to point users to our stuff lest we disappear.

On Fri, Dec 05, 2025 at 07:21:50PM +0000, Michael Winters via Fedora Discussion wrote:

@kevin Could you put some rough numbers on the problem? E.g. for each of our sites that get hit hard, what is your “finger in the wind” estimate for:

  • Legitimate traffic volume per month (GiB in / out)
  • Legitimate traffic peak volume (GiB in / out)
  • Legitimate traffic peak req/sec
  • The “with scrapers” version of all of the above

I would have to try and gather this, I don’t have it on the tip of my
head. It also varies I think depending on what they are hitting and
such.

Also, it’s interesting to me that our bottlenecks seem to be CPU rather than bandwidth, even though these scrapers are presumably only inducing read operations, not writes. This smells like a target for optimization. Happy to deep dive on this if interested.

Well, for both src.fedoraproject.org and koji.fedoraproject.org what
they are doing is hitting things that need db interaction and/or pulling
information from files (read git repo, display info) or things like
‘generate a xz archive of this tag’ or the like.

yes, I am sure there’s optimization that could be done, but pagure is
heading to replacement, so I don’t think there’s a lot of desire to work
on it. Perhaps we could put just the expensive koji stuff behind
something…

But this makes me think of another thing we might do… right now
there’s a 10s check in haproxy for pkgs01(src backend). So, this means
if it takes more than 10s for a check, haproxy marks the service down
and people get 503’s. We could bump that up… it would mean things
would get really slow under high load, but at least it would return.

Finally, if bandwidth really is our main bottleneck and we can’t put up a gate that holds, then either a CDN or some move to distribute the load (like mirrors) may be the only solution. Happy to work on this too.

I don’t think BW is currently the bottleneck. We aren’t saturating our
links at least.

1 Like

Question:
Are we talking about scrapers for API endpoints or are we talking about scrapers for end-user UI endpoints?

I totally agree with the principal, but as (I think it was Ben who reminded us), the Fedora Council adopted (in 2018) the following statement:

“The Fedora Project wants to advance free and open source software and as a pragmatic matter we recognize that some infrastructure needs may be best served by using closed source or non-free tools today. Therefore the Council is willing to accept closed source or non-free tools in Fedora’s infrastructure where free and open source tools are not viable or not available.”

At this point not at least investigating using such solutions is like bring a knife to a gun fight (and the AI scrapers are now using howitzers).

anubis creates that cookie if you pass a challenge. It has nothing to do
with being logged in or not. :frowning:

It uses that to tell you already passed and it can just let you through
instead of challenging you on every connection.

Sorry, I’m afraid I was not clear enough on my point.
Let me put it another way: in order to authenticate, an user has to pass
the challenge first.
Ideally, scrapers should not pass this challenge, making it a trusted
source if the request is coming from a scraper or not.

Which means you could achieve your goal by tuning Anubis.

If scrapers are passing the challenge, is Meta Refresh enabled? Would it
make sense to disable Meta Refresh temporarily?
Would it make sense to (temporarily?) increase the difficulty of the challenges?