All Your Base Are Belong to LLM
"The output from an LLM is a derivative work of the data used to train the LLM.
If we fail to recognise this, or are unable to uphold this in law, copyright (and copyleft on which it depends) is dead. Copyright will still be used against us by corporations, but its utility to FOSS to preserve freedom is gone."
https://blog.brettsheffield.com/all-your-base-are-belong-to-llm
@dentangle When LLMs get better, it's possible that we will be able to feed them leaked or decompiled proprietary source code and get a legally usable source code out of it. So we will be able to turn proprietary code into free code.
@dentangle Very well said. This is the most concise description of the problem I've yet heard. And you're absolutely right.
@dentangle I don't think it's that simple. I was reading a commentary that says with model sizes, it is very unlikely a single byte of the original code is stored in the model in any meaningful way.
I propose we need new thinking about all of this.
@mishari "We show an adversary can extract gigabytes of training data from open-source language models like Pythia or GPT-Neo, semi-open models like LLaMA or Falcon, and closed models like ChatGPT"
@dentangle @mishari Is this the "repeat X forever" thing or a new attack?
@pettter @mishari This is the paper that published that "repeat" attack, but it's worth reading in full, as the whole process that led them to *try* that approach is fascinating. Their methodology relies on the fact that all these models "memorize" some percentage of their training set and may repeat it verbatim (and can be tricked into doing so).