What should we do about OpenAI's crawler bot?

I noticed a surge in “crawler” traffic to this site in the last month. Discourse hosting is priced by page-views, and so a huge surge there is… notable. Of course, we also want to be visible and indexed, so some of this traffic is important. But, there are few here that I think we should discuss. Here’s the top 5 for the last 30 days:

User Agent Pageviews
facebookexternalhit/1.1 (+http://www.facebook.com/externalhit_uatext.php) 1033216
Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; GPTBot/1.0; +https://openai.com/gptbot) 138843
Mozilla/5.0 (Linux; Android 6.0.1; Nexus 5X Build/MMB29P) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/117.0.5938.132 Mobile Safari/537.36 (compatible; Googlebot/2.1; +http://www.google.com/bot.html) 46630
Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; bingbot/2.0; +http://www.bing.com/bingbot.htm) Chrome/103.0.5060.134 Safari/537.36 44135
Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html) 39272
Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/600.2.5 (KHTML, like Gecko) Version/8.0.2 Safari/600.2.5 (Amazonbot/0.1; +https://developer.amazon.com/support/amazonbot) 23561
Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_5) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/13.1.1 Safari/605.1.15 (Applebot/0.1; +http://www.apple.com/go/applebot) 18435
Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) HeadlessChrome/101.0.4951.64 Safari/537.36 DiscourseSiteMonitor/2.0 16148
Mozilla/5.0 (Linux; Android 6.0.1; Nexus 5X Build/MMB29P) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/117.0.5938.92 Mobile Safari/537.36 (compatible; Googlebot/2.1; +http://www.google.com/bot.html) 15565

So, first, what the absolute heck, Facebook? Over a million pageviews for your “external link” crawler? Something’s clearly wrong there. I’m immediately adding that to the list of bots that is restricted to one query every minute, and we’ll see if that helps. If not, I’ll either see if the Discourse team can do something a little stronger without blocking them completely — and failing that, something more drastic.

From 2 to 10, they all make reasonable sense — big companies with search engines. I’m a little dubious about the Amazon / Alexa one, but the number isn’t completely out of line.

I want to talk about the second one, though. That’s gathering our conversations for training OpenAI’s ChatGPT and whatever other tools. If the amount of traffic goes up from where it is, I think we need to put it in the slow-down list. But, I think there is a more fundamental question:

Should we decline to have our posts here ingested by this bot?[1]

  • Allow: I think Large Language Model AI is generally good and will help people.
  • Allow: Hey, we’re about being open, and shouldn’t limit access no matter the use
  • Allow: I’m skeptical but I think there’s an overall benefit to Fedora and Fedora users
  • Allow: Something else (I’ll explain)
  • Deny: I am skeptical of LLM AI overall and don’t think we should feed it.
  • Deny: I am interested in the potential but don’t want this going into a proprietary model.
  • Deny: Something else (I’ll explain)
0 voters

As always, this is a “straw poll”, gathering opinion, rather than a binding vote. Thanks, everyone!


  1. I mean, too late for what they’ve already done, I suppose, but still? ↩︎

1 Like

To the extent that we generate “content” here, we may think about a license for that content. And that will have implications on whose access we can justify limiting or blocking.

2 Likes

I voted Deny.

I think I am for being open with the data. But I am also for that FOSS projects should not be paying for a third-party using this data in a proprietary product.

If there is a way to make OpenAI pay for the traffic it generates - I’d reconsider.

4 Likes

I voted deny too, just on principle. The intellectual property implications of all this data hoovering is problematic - and while those are apparently legal right now even in the EU, I think it’s a case of the law being unprepared for LLMs.

(Obviously I am not speaking on behalf of my employer here)

3 Likes

I’d be okay with this were it an open source LLM (as specially with Ask Fedora’s questions and answers it could be trained to become a good tech support bot), but with OpenAI and their predatory behavior it’s an easy no for me.

It’s kind of in the fine print, but… Terms of Service - Fedora Discussion

1 Like

Which is: CC-BY-SA by default, which asks people using our writing to provide credit. Personally, I think there’s a good argument to be made that LLM AI is a form of washing off the requirements of the license.

3 Likes

I’d say: “AI is generally good and could help people”. (For AI’s usefulness some wariness is required on user side also.)

Edited/added: LLMs are collecting a lot of text from all over the net, and we all know there’s a lot of wrong or useless information there. If they collect something from here, quality of answers they generate will raise a bit.

P.S. I hope I replaced all of potentially offending words with acceptable ones.

1 Like

Facebook is hitting hack/bot attacks consantly especially company sites and they keep pushing stuff. i think Facebook could just go delete section permanently is there even people using it like really using it.

and by attacks i mean especially botting IT and Hosting side and latest is now Booking, shopping and tourism sites too just close things down and registered can see

With more than a little irony here: your original text was flagged by the experimental Discourse AI classifier as “potentially toxic”. It’s not supposed to be set up to make that as anything but advisory to mods and it only had 34% confidence, which is way below the threshold, so I have a bug to report to Discourse about that.

I don’t think there was really a problem with your original language, at least not in my judgement — but, actually, your edit is more clear and really just as strong… although the humor takes a hit.

1 Like

I try hard to follow CoC and not to disrupt friendly atmosphere. So my original language is never disclosed. :rofl: I revise my posts at least once before pressing Reply.

1 Like

Update: now claudebot is hitting us really hard. Updating to add that to the blocklist.

1 Like