social.kernel.org

Conversation

Evan Prodromou

I was reading this article about LLMs making bad citations. I found it pretty interesting, so I decided to try to replicate it with ChatGPT.

https://www.cjr.org/tow_center/we-compared-eight-ai-search-engines-theyre-all-bad-at-citing-news.php

Evan Prodromou

evan@cosocial.ca

6 months ago

Reply to @evan@cosocial.ca

I tried it with a document I wrote, FEP 5711. It's an enhancement proposal for ActivityPub, adding some inverse relationships for important properties.

https://codeberg.org/fediverse/fep/src/branch/main/fep/5711/fep-5711.md

Evan Prodromou

evan@cosocial.ca

6 months ago

Reply to @evan@cosocial.ca

Anyway, I took a paragraph out of the document and asked ChatGPT to identify the URL, publisher, publication date, and title. It failed. You can see the transcript here:

https://chatgpt.com/share/68573fa9-b340-800f-b9b4-7b74fdf0bf46

Evan Prodromou

evan@cosocial.ca

6 months ago

Reply to @evan@cosocial.ca

I was surprised to see that it had really no visibility of the FEPs. After a while, I realized that codeberg.org, the hosting service for FEPs, has ChatGPT blocked.

https://codeberg.org/robots.txt

Evan Prodromou

evan@cosocial.ca

6 months ago

Reply to @evan@cosocial.ca

I understand the goal; many people don't want their code to be used by LLM code generators. But it also means that this document repository isn't visible for people who use LLMs like a search engine. Numbers vary, but afaict somewhere around 10% of people use LLMs as their primary search engine, and about 50% of people use LLMs some of the time for search.

Evan Prodromou

evan@cosocial.ca

6 months ago

Reply to @evan@cosocial.ca

I guess there's maybe some justification like, those people are bad, and they don't deserve nice things like Fediverse Enhancement Proposals? Or, maybe, we have to take a principled stand against LLMs by not providing any training data for them? Such that, perhaps, people disappointed by not having good results in LLMs will return to using traditional search engines like Google or Bing, which are more ethical because reasons.

Mirko Adam

elshid@librem.one

6 months ago

Reply to @evan@cosocial.ca

@evan No, the reason is simply the server load. The AI crawlers have so excessively crawled @Codeberg that their main service, to host a git server, was often very slow.

Evan Prodromou

evan@cosocial.ca

6 months ago

Reply to @elshid@librem.one

@elshid @Codeberg that's interesting! Also, really self-destructive.

K. Ryabitsev-Prime 🍁

monsieuricon

6 months ago

Reply to @evan@cosocial.ca

@evan @elshid @Codeberg AI crawlers descend on online resources like locust and consume then until they are dead. When they recover, they do it again.

Evan Prodromou

evan@cosocial.ca

6 months ago

Reply to

@tobias @elshid @Codeberg well, that's hardcore, but the problem is that when people can't find your information on their chosen search engine, they don't go find another search engine. They don't even know they're missing your info.

Stefan Monnier

monnier@oldbytes.space

6 months ago

Reply to @evan@cosocial.ca

@evan AFAIK, LLMs are very poor (give bad results inefficiently) for the kind of search you tried. Even if it had scraped Codeberg (which it may have, actually), the result might be not much better. We know how to generate good results for your query, in a simpler and more efficient way. Knowing how public servers suffer constant DDoS attacks nowadays, apparently linked to the LLM madness, why would you encourage its misuse?

Evan Prodromou

evan@cosocial.ca

6 months ago

Reply to @monnier@oldbytes.space

@monnier because I'm more concerned with conveying information to the people using LLMs than I am with convincing them not to use LLMs.

Codeberg

Codeberg@social.anoxinon.de

5 months ago

Reply to @evan@cosocial.ca

@evan The decision to block these scrapers actually originates from an Codeberg-e.V. internal discussion. The argument for "visibility" in the AI language models was considered. However, the cost on Codeberg is immense (also see https://drewdevault.com/2025/03/17/2025-03-17-Stop-externalizing-your-costs-on-me.html), and we are also not fans of the Big Tech AI companies.

Codeberg is a development platform. If your content needs to be known to all scrapers, you are free to publish valuable resources to a normal website additionally.

Evan Prodromou

evan@cosocial.ca

5 months ago

Reply to @Codeberg@social.anoxinon.de

@Codeberg Good idea! I set up a mirror at https://fep.swf.pub/ I probably need to automate the sync, and make sure to point back to Codeberg for contributions.

About social.kernel.org

Terms of service

Please do not use this service in violation of the Linux Kernel Code of Conduct. Doing so will result in your account suspension with the referral of the matter to the CoC committee.
"Repeating"/"boosting" someone else's status on this platform will be treated as endorsement and will fall under rule #1.
You are encouraged to use this platform to promote your work on the Linux Kernel, but there is no restriction on permitted topics (with the exception of anything covered by #1 above).
There is no requirement to post in English, but it should be considered the primary language of communication on this platform.

Privacy notice

The admins of this service have access to all posted statuses. They aren't looking, but if it's something they shouldn't know about, then you should not post it on this platform.

Please see the Linux Foundation Privacy Policy, which applies to this platform as well.

Getting your own account

If you would like an account on this instance, please check that the following applies to you:

You are listed in MAINTAINERS or CREDITS
OR: You have a kernel.org account or email address
OR: You have a long and established history of involvement with the Linux Kernel

If the above is true and you agree with the Terms of Service and Privacy Notice listed above, please use these instructions to request an account:

How to request an account on social.kernel.org