some thoughts on AI training data

I recently ran across an article on Bruce Schneier’s blog proposing the idea of an “AI Dividend” similar to Alaska’s Permanent Fund, which would distribute profits from generative AI to the people whose data was used to train the AI. The argument is made that this would be a fair way to compensate people for their data, while helping to ensure that the benefits of AI are shared more widely. A primary concern is that there is a need to prevent training data from being locked away — both by AI companies in their proprietary models, and by individuals that don’t want to let a corporate entity make money off their individual content without some kind of compensation.

Logistical Challenges

The proposal is interesting, but ultimately doomed, because the logistics behind a “pay everyone some money per word a given AI generates” solution are untenable:

  • The proposal only applies to “words”. AI has moved way past words, so it’s unclear how this proposal would be applied to non-word based content, such as:
    • Source code
    • Images
    • Music
    • Video
    • Scientific output such as new chemical formulations, or DNA processing
    • …and so on
  • Is this intended to be “anyone >= 18 years old”? If so, there are lots of minors that make lots of internet content, shouldn’t they get paid?
  • Is this “anyone that makes online content of any kind”? If so, how does a government bureaucracy validate whether someone has actually made online content?
  • Is this just “anyone with an email address”? If so, how does someone tie an email address to themselves in any way that’s verifiable — do we require a national “online presence” database of some kind that ties an online persona to an individual and their address/bank account so they can get paid?
  • If it’s only fair to pay people that have put content online for that content, then the counter is also true: that it’s NOT fair to pay those that haven’t put content online, so there’s got to be some way to say which of those any given individual is.

These are just some quick initial thoughts, I’m sure the more one considers this, there will be many more difficult to answer questions.

Philosophical Differences

I disagree that people who post content publicly have a right to compensation if someone uses that content for some other purpose. I fall into the camp of “when you made your data public, you lost the right to expect that it won’t be used by others”: in my mind this is a “cyber” equivalent to the real-world situation of, if you are walking down the street in a city, you no longer have a right to expect privacy.

In my opinion, if you want to be paid for your content, you are free to wall it away behind a paywall. (I should note that AI model issues with copyright claims are a completely different animal, and I’m unsure of a good solution to deal with those. However, I would argue those issues aren’t really addressed by the “community chest” approach proposed in the article either).

In discussing this concept with others, I was asked what other proposals to address this issue I’ve seen. Unfortunately, the answer to that is, I haven’t seen any. I definitely haven’t seen a “good” solution yet, and in my opinion I think that is simply because there isn’t one. From my point of view, the right answer to this problem is very simple: Public data is public and free. This is captured in the long-circulated hacker ethos “Information wants to be free“. If someone doesn’t want their information to be free, that’s cool: they can always opt to not make it public. They don’t even have to make it a “paywall”, they could just simply put it behind a simple password that they provide (or deny) to others upon request.

The argument of “I made all this stuff free to anyone on the internet, but now that a company wants to use it, I want to get paid” seems petty and greedy (to me, anyway) so I have little sympathy for it. I do have considerably more sympathy for those that argue “I don’t want EntityX to be able to use my content because I have fundamental issues with their ethics”, but in that scenario the solution still isn’t “have EntityX pay up so people get Y-money per word”. The solution to that situation is to provide a person that doesn’t want their stuff used a way to opt-out (or opt-in). The simplest solution to allow someone to do that is still “put their stuff behind a barrier and only let people they like have access to it”.

An Old Question

Ultimately, this whole thing is just a new iteration of the same dilemma that’s been present since networks first started connecting up to each other, and it can be simply asked like this:
Should the internet be a free exchange of content/ideas, or should it be a bunch of walled gardens — some with bridges to other ones?

If the internet should be free (which I argue it should), then people have to realize and be ok with the fact that entities they may disagree with are free to use it (and any content anyone has added to its public spaces) however that entity wishes. That may include the entity making money off of doing so.

If someone isn’t ok with that, again, that’s cool: they are free to wall off their garden, and optionally charge admission.