I noticed a surge in “crawler” traffic to this site in the last month. Discourse hosting is priced by page-views, and so a huge surge there is… notable. Of course, we also want to be visible and indexed, so some of this traffic is important. But, there are few here that I think we should discuss. Here’s the top 5 for the last 30 days:
|Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; GPTBot/1.0; +https://openai.com/gptbot)||138843|
|Mozilla/5.0 (Linux; Android 6.0.1; Nexus 5X Build/MMB29P) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/117.0.5938.132 Mobile Safari/537.36 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)||46630|
|Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; bingbot/2.0; +http://www.bing.com/bingbot.htm) Chrome/103.0.5060.134 Safari/537.36||44135|
|Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)||39272|
|Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/600.2.5 (KHTML, like Gecko) Version/8.0.2 Safari/600.2.5 (Amazonbot/0.1; +https://developer.amazon.com/support/amazonbot)||23561|
|Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_5) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/13.1.1 Safari/605.1.15 (Applebot/0.1; +http://www.apple.com/go/applebot)||18435|
|Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) HeadlessChrome/101.0.4951.64 Safari/537.36 DiscourseSiteMonitor/2.0||16148|
|Mozilla/5.0 (Linux; Android 6.0.1; Nexus 5X Build/MMB29P) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/117.0.5938.92 Mobile Safari/537.36 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)||15565|
So, first, what the absolute heck, Facebook? Over a million pageviews for your “external link” crawler? Something’s clearly wrong there. I’m immediately adding that to the list of bots that is restricted to one query every minute, and we’ll see if that helps. If not, I’ll either see if the Discourse team can do something a little stronger without blocking them completely — and failing that, something more drastic.
From 2 to 10, they all make reasonable sense — big companies with search engines. I’m a little dubious about the Amazon / Alexa one, but the number isn’t completely out of line.
I want to talk about the second one, though. That’s gathering our conversations for training OpenAI’s ChatGPT and whatever other tools. If the amount of traffic goes up from where it is, I think we need to put it in the slow-down list. But, I think there is a more fundamental question:
Should we decline to have our posts here ingested by this bot?
- Allow: I think Large Language Model AI is generally good and will help people.
- Allow: Hey, we’re about being open, and shouldn’t limit access no matter the use
- Allow: I’m skeptical but I think there’s an overall benefit to Fedora and Fedora users
- Allow: Something else (I’ll explain)
- Deny: I am skeptical of LLM AI overall and don’t think we should feed it.
- Deny: I am interested in the potential but don’t want this going into a proprietary model.
- Deny: Something else (I’ll explain)
As always, this is a “straw poll”, gathering opinion, rather than a binding vote. Thanks, everyone!
I mean, too late for what they’ve already done, I suppose, but still? ↩︎