I’ve been thinking a lot about LLMs being trained on content against the will of the content creator. I am very aware of the damage that can be done here, especially to small creators who don’t have a legal budget, and I want to protect their rights, and their opportunity to make a living with their content. But I don’t think, in most cases, these content creators have a right to prohibit their work from being used to train LLMs.
For the sake of argument, there are a few things we’ll ignore. First, clear infringement. If an LLM writes a full-length Hunger Games sequel with the same characters, in the same universe – this is clearly already covered by copyright, this is clearly infringement. Important but intellectually boring. Second, electricity needed to power the servers housing the LLMs. Also important, also boring from an intellectual property perspective.
Also, it’s not AI. I like “spicy autocomplete” but whatever you call it, it’s not “intelligence”. It’s simply making guesses based on all the content it has ingested. It can’t make new connections. This is GOOD – we’ve all seen Terminator and no one wants to live in that universe.
We will also assume that the content has been obtained legally. Unauthorized content is a problem but also uninteresting in this context. People getting content through unauthorized means was a problem before LLMs and will be a problem going forward, even if LLMs disappeared today.
So take an anecdote. Let’s say I am a huge fan of Stephen King. I can read all his books (even the ones my friend’s mom swore were written by his wife). This will surely influence my writing style (and in fact it has, because I AM a fan of Stephen King, and have read dozens of his books. It would influence my fiction even more if I got around to writing any with any sort of frequency). This is clearly not any sort of copyright infringement. So, training your LLM on legally obtained copyrighted content is ALSO not copyright infringement.
Next, with my newly earned writing chops, I can write a 1,500 page sequel to The Stand. If I’m good enough, it will sound a bit like he wrote it. If I keep this on my laptop and only read it to pat myself on the back, this is completely legal and does not infringe on his copyright in any way.
Now I try to sell The Stand II – Standoff under my new pen name, Steven Kimg. This is VERY CLEARLY copyright infringement (and remains so even if I’m a bit more subtle with my marketing). Enforcement of these laws is hard, but it’s not impossible. I’m in favor of better enforcement of these laws to protect content creators, but that has little to do with LLMs. Ask any author how many infringing copies of their book were available on Amazon 3 years ago, before LLMs were mainstream.
What if my friend, who is ALSO a King superfan, pays me to write the book? He plans to keep it for himself and not show it to anyone else. For someone like Stephen King, this is too small to matter. He would probably be annoyed at me if he ever found out, but I can’t imagine he’d bother calling his lawyer. A small content creator might be angry, and justifiably so, but showing real damage would be difficult even though I think this is also copyright infringement.
But what LLMs are doing is largely not the same as any of the above. They are reading all of Stephen King, and all of Suzanne Collins, all of Tumblr and Reddit, and anything else they can get their “hands” on. This is literally exactly what humans do to develop their own craft, and I don’t think the volume at which the LLM may do this as opposed to the volume at which a human does it makes any difference to how the law applies. If I read a book and it influences my art, that is not copyright infringement. If I read 100 and they influence my art, still not infringement. 1,000? Still no. 1,000,000? Still no, though this would be a difficult feat for a human.
The problem that isn’t well covered by existing law is when the artist doesn’t want their work used to train these LLMs. I don’t think that is a protected right. It’s like when a politician licenses a song from the label and plays it at a rally. The artist gets mad because they disagree with the politics. The politician may get bad publicity for this, but they are 100% within their legal rights to continue using the song (again, assuming it’s legally licensed, because if it’s not then it’s not interesting to discuss, it’s just boring infringement). Another example – the creators of The Boys have complained that many people who watch the show come away thinking Homelander is the hero. He is quite obviously a deranged sociopath, though I absolutely love the character. But this is a similar case of authorized users of your content using it for something you hate (promoting sociopathic superheroes).
If we want to prevent this, we need new laws. Copyright is a giant hammer and modern content creation and sharing requires a much more versatile tool. Creative Commons tried to provide this and it caught on in some circles but never got the critical mass from big companies, probably because they’re just fine with the giant hammer – they have the legal resources to back it up and don’t much care about the collateral damage. I’m not optimistic we’ll resolve this – the Venn Diagram of those with the desire to change and the power to change is probably two separate circles. But maybe if we think about it this way, we can save some whining.