Conversation

I was reading this article about LLMs making bad citations. I found it pretty interesting, so I decided to try to replicate it with ChatGPT.

https://www.cjr.org/tow_center/we-compared-eight-ai-search-engines-theyre-all-bad-at-citing-news.php

1
0
0

I tried it with a document I wrote, FEP 5711. It's an enhancement proposal for ActivityPub, adding some inverse relationships for important properties.

https://codeberg.org/fediverse/fep/src/branch/main/fep/5711/fep-5711.md

1
0
0

Anyway, I took a paragraph out of the document and asked ChatGPT to identify the URL, publisher, publication date, and title. It failed. You can see the transcript here:

https://chatgpt.com/share/68573fa9-b340-800f-b9b4-7b74fdf0bf46

1
0
0

I was surprised to see that it had really no visibility of the FEPs. After a while, I realized that codeberg.org, the hosting service for FEPs, has ChatGPT blocked.

https://codeberg.org/robots.txt

1
0
0

I understand the goal; many people don't want their code to be used by LLM code generators. But it also means that this document repository isn't visible for people who use LLMs like a search engine. Numbers vary, but afaict somewhere around 10% of people use LLMs as their primary search engine, and about 50% of people use LLMs some of the time for search.

2
0
0

I guess there's maybe some justification like, those people are bad, and they don't deserve nice things like Fediverse Enhancement Proposals? Or, maybe, we have to take a principled stand against LLMs by not providing any training data for them? Such that, perhaps, people disappointed by not having good results in LLMs will return to using traditional search engines like Google or Bing, which are more ethical because reasons.

2
0
0

@evan No, the reason is simply the server load. The AI crawlers have so excessively crawled @Codeberg that their main service, to host a git server, was often very slow.

1
0
1

@elshid @Codeberg that's interesting! Also, really self-destructive.

1
0
0
@evan @elshid @Codeberg AI crawlers descend on online resources like locust and consume then until they are dead. When they recover, they do it again.
0
0
4

@tobias @elshid @Codeberg well, that's hardcore, but the problem is that when people can't find your information on their chosen search engine, they don't go find another search engine. They don't even know they're missing your info.

0
0
0

@evan AFAIK, LLMs are very poor (give bad results inefficiently) for the kind of search you tried. Even if it had scraped Codeberg (which it may have, actually), the result might be not much better. We know how to generate good results for your query, in a simpler and more efficient way. Knowing how public servers suffer constant DDoS attacks nowadays, apparently linked to the LLM madness, why would you encourage its misuse?

1
0
0

@monnier because I'm more concerned with conveying information to the people using LLMs than I am with convincing them not to use LLMs.

0
0
0

@evan The decision to block these scrapers actually originates from an Codeberg-e.V. internal discussion. The argument for "visibility" in the AI language models was considered. However, the cost on Codeberg is immense (also see https://drewdevault.com/2025/03/17/2025-03-17-Stop-externalizing-your-costs-on-me.html), and we are also not fans of the Big Tech AI companies.

Codeberg is a development platform. If your content needs to be known to all scrapers, you are free to publish valuable resources to a normal website additionally.

1
0
0

@Codeberg Good idea! I set up a mirror at https://fep.swf.pub/ I probably need to automate the sync, and make sure to point back to Codeberg for contributions.

0
0
0