Hey folks. Thanks for all the replies. Will try and answer everything I
can here…
On Fri, Dec 05, 2025 at 02:47:35AM +0000, Maxwell G via Fedora Discussion wrote:
I don’t know if anubis supports this but go-away supports https://git.gammaspectra.live/git/go-away/wiki/Challenges#http which allows the backend to use the user’s session cookie to make a request to https://pagure.io/api/0/-/whoami or similar to ascertain whether the user is logged in during the challenge.
Thats cool. I don’t think anubis has this ability… or at least not
yet.
I am unclear currently on how pagure authenticates. It does add a
cookie, but it does that no matter if you are logged in or not, and it
doesn’t look like the cookie contents are different either.
It clearly must have some way to tell…
On Fri, Dec 05, 2025 at 03:11:00AM +0000, Espionage724 via Fedora Discussion wrote:
I like the idea of more infrastructure (better hardware to handle requests) and optimizing websites to be scrape-friendly (more plaintext/static).
I already did add cpus to the src.fedoraproject.org backend.
On Fri, Dec 05, 2025 at 03:27:07AM +0000, Adam Kafei via Fedora Discussion wrote:
That sounds like an arms race you need infinite money to win, sadly in this case the scrapers have that infinite money.
Yep. They sure do, but on the other hand they are not all focused on us.
It’s likely only a small part of their army going after us.
On Fri, Dec 05, 2025 at 03:35:46PM +0000, Tulio Magno Quites Machado Filho via Fedora Discussion wrote:
One of my ISPs has been frequently blocked. Would it help if I helped you to investigate this?
Well, of course I am curious, but I am not sure how we can track it down
or what good that information will really do. I mean it’s likely
customers of the ISP that have installed bot scraping software for
whatever reason. I suppose they could try and get them to not do that,
but it would be a lot of work to contact each user and find out whats
going on.
- If we can figure some way for anubis to know when a connection is for a logged in user, at least we could always allow those and slow non authenticated users. There is a pagure cookie, but as far as I can tell it’s always there, even if you aren’t logged in. Does anyone know a way to tell if a connection is for a logged in user?
Isn’t this already in place? i.e. in my tests, a logged in user always has an anubis-cookie-verification.
anubis creates that cookie if you pass a challenge. It has nothing to do
with being logged in or not. 
It uses that to tell you already passed and it can just let you through
instead of challenging you on every connection.
On Fri, Dec 05, 2025 at 06:30:49PM +0000, Gary Buhrmaster via Fedora Discussion wrote:
It may not help substantially, because the requests are coming from all over (each source is itself limited in what it requests, it is just the shear number of sources (this is not a lot different than the IOT device DDoS attacks which come from lots of sources)).
Some AI companies are reportedly using multiple methods to spread the request load source(s). Not only browser extensions, but (short term) cloud instances, vpns, tor, whatever works.
Yeah. So for example, 100,000 requests come in, but they are hitting say
1000 urls, so only 1000 hits over several hours from all different ips.
While some of the scrapers respect the robots.txt values to not index, or crawl slow, not all do (as the only the few first AI companies to scrap the world will win (whatever that means), no one wants to be in fourth place).
I recall that Cloudflare has added “block AI bots” (and also a “pay to crawl”) option. Could using that capability be an option? I believe Akamai (which I believe Red Hat uses (used?) for their own web site), also has something equivalent (probably the other large CDNs, too, as they, themselves, are trying to limit the herds).
We (or at least I) have a strong aversion to using a non free service
like cloudflare or akamai. They could also be quite expensive unless
they agreed to donate their services.
On Fri, Dec 05, 2025 at 06:20:02PM +0000, Steve Flynn via Fedora Discussion wrote:
Invisible link for normal users, which when clicked takes scrapers into a never ending maze of constantly generated Markov chain content, each of which contains more links to endlessly generated Markov chain bollocks. Generator only needs to run when it’s being used and it’s quite light - 2 or 3 paragraphs of tripe per page and pslatter links in there which triggers more Markov.
Yeah, that could be light on cpu and such, but it would not be light on
bandwith. You still need to return all that junk to them and process
more requests.
Doesn’t stop scrapers getting in, but when they do follow an “invisible link” they vanish down a hole, never to be seen again.
Also, while they would no doubt follow the markov path, they likely also
would continue to scrape the real urls also. ie, they have capacity to
do both.
Alternatively, if one of these links gets clicked, then it’s probably a scraper so ban the IP for 6 hours.
Yeah, might hit legit folks behind say a nat or gateway tho. ;(
On Fri, Dec 05, 2025 at 06:49:26PM +0000, Michael Winters via Fedora Discussion wrote:
@kevin Can we put most of the http stuff behind authentication? (Excluding our main “front door” websites and end-user docs.)
Sure. But… that has a pile of problems:
- breaks all the people who use our sites for the automation/various
things, like user mockbuilding against our builds, people testing
things, etc etc.
- goes against core values of the project. We want to be open and share
everything with the world, not put it behind an auth layer.
- spammers have already figured out they can make accounts for spamming
lists, it likely wouldn’t be too long before they made accounts to
scrape (and then it’s back to wack-a-mole closing them as they make
them).
- it may have legal problems since people would be unable to download
the source for things that we have distributed.
- search engines
At least then we would have a single service to scale instead of all of them, and we could isolate / shape the unauthenticated traffic. For integrations with select partners / systems, we could issue “forever” auth tokens. (Though they should eventually implement rotation.)
The main argument against this seems to be SEO, but do we really care about SEO for these systems?
There’s a number of more arguments I think… but yes, we want search
engines to point users to our stuff lest we disappear.
On Fri, Dec 05, 2025 at 07:21:50PM +0000, Michael Winters via Fedora Discussion wrote:
@kevin Could you put some rough numbers on the problem? E.g. for each of our sites that get hit hard, what is your “finger in the wind” estimate for:
- Legitimate traffic volume per month (GiB in / out)
- Legitimate traffic peak volume (GiB in / out)
- Legitimate traffic peak req/sec
- The “with scrapers” version of all of the above
I would have to try and gather this, I don’t have it on the tip of my
head. It also varies I think depending on what they are hitting and
such.
Also, it’s interesting to me that our bottlenecks seem to be CPU rather than bandwidth, even though these scrapers are presumably only inducing read operations, not writes. This smells like a target for optimization. Happy to deep dive on this if interested.
Well, for both src.fedoraproject.org and koji.fedoraproject.org what
they are doing is hitting things that need db interaction and/or pulling
information from files (read git repo, display info) or things like
‘generate a xz archive of this tag’ or the like.
yes, I am sure there’s optimization that could be done, but pagure is
heading to replacement, so I don’t think there’s a lot of desire to work
on it. Perhaps we could put just the expensive koji stuff behind
something…
But this makes me think of another thing we might do… right now
there’s a 10s check in haproxy for pkgs01(src backend). So, this means
if it takes more than 10s for a check, haproxy marks the service down
and people get 503’s. We could bump that up… it would mean things
would get really slow under high load, but at least it would return.
Finally, if bandwidth really is our main bottleneck and we can’t put up a gate that holds, then either a CDN or some move to distribute the load (like mirrors) may be the only solution. Happy to work on this too.
I don’t think BW is currently the bottleneck. We aren’t saturating our
links at least.