Push for Spotify-like pay-for-copyrighted-content ChatGPT option


As a paying ChatGPT customer, I wanted to start a discussion in the community, triggered by Gary Marcus’ substack article

https : // garymarcus . substack . com / p / the-desperate-race-to-save-generative/comments?,publication_id=888615&post_id=140480488&isFreemail=true&comments=true

I made the following posting there :

Myles Dear
https : // substack . com / profile / 63442026-myles-dear

Jan 8
https : // garymarcus . substack . com / p / the-desperate-race-to-save-generative / comment / 46891336
·edited Jan 8Liked by Gary Marcus

To be fair, as a software engineer, I mostly use ChatGPT to pull in public-domain knowledge that I lack to accomplish specific tasks. None of the links it comes up with for any of my prompts point to any paywall of any sort. It helps save me gobs of time and makes me more effective and efficient because it’s able to take large amounts of information via web plugins, munge it together, and spit it out in the form I need. I feel I’m getting my money’s worth.

With that said, I do empathize with copyright holders and feel sad that the same tool has crossed those lines. If all copyrighted content was pulled from ChatGPT I wouldn’t shed a tear. Also, if I wanted ChatGPT to access copyrighted data, I wouldn’t mind paying an optional fee for the privilege (think Spotify and how one monthly fee gets distributed to the copyright owners whose content you actually use).

In fact, Github CoPilot has already been dinged for providing publicly accessible code in its responses and now it’s possible to set it to generate original content only and not regurgitate public code verbatim. Anything is possible, if there’s the political will to do so. We should continue to push back to our AI providers.

3 months later, I am just now seeing this. And, amazed that no one has commented.

I agree that some mechanism needs to be worked out where content creators are compensated in some way for their copyrighted works that are used in LLM responses. But, how?

In a RAG scenario, this is one thought I have had recently:

I record every question, documents returned (from cosine similarity search) and LLM response.

Now, let’s say that these returned documents are texts from copyrighted works. How do we know who to compensate and when? I’m thinking we start with determining which documents are actually used in the model response. Just because a document is returned in the cosine similarity search does not necessarily mean the model uses it to arrive at an answer to a question.

In my current implementation, I instruct the model to cite the returned documents it has used in it’s response. So, theoretically, I could go back into the log and retrieve the exact documents that were used in each response.

In the metadata of each document, I currently have the document title and author. If I include another property, publisher, now I have a way to know what documents were actually used in actual responses and who should be compensated.

But, then, how to compensate? If, for example, my profit comes from a markup of tokens, how do I determine what percentage of that markup to earmark for royalties?

That part I haven’t figured out yet. But I think I’ve at least got a good starting point for a Spotify-like-pay-for-copyrighted-content plan for AI generated content.

I am stupefied by the fact that no one else has responded to this issue as it is probably one of the most important for most GPT creators to date.

At any rate, this is how simple it is to track documents actually used in responses:

For RAG, you only need add something like this to your system prompt:

Please cite the document numbers of the texts you use in your response.