Solidarity Protest For the Stack Overflow community

Data-wise, there’s very little difference, but Stack Overflow took a hard stance against AI, only to turn around 180°, seemingly without asking their users. :confused:

And welcome to the community btw! :heart:

1 Like

Absolutely! There is a moral imperative to cite one’s sources. This imperative extends without a doubt to the creators of AI.

When you don’t cite, you are expelled from university. It’s the highest academic crime.

All GPT systems need to start citing where it gets all this wondrous knowledge.

I think they could do it at a minimum via semantic search into their training data, but there should be serious effort to make it as accurate as possible.

If they can’t provide links, at least give us enough to google.

1 Like

The CEO recognized AI when it first came out as the future, but the community was quite upset.

If you make it an iphone app, it becomes hard to do, but E2E encryption is not a solution, just a minor hurdle for spying on public chat forums.

Public chat should be considered as public as anything else you post to the web. Private discord servers is the only way around this, but be careful about who you invite.

2 Likes

I believe it is foolish to sabotage one’s own contributions after the data has already been transferred.
This will only harm regular users. Even if OpenAI has not yet used the contributions for AI training, we must assume they have backups from before. Considering that most of this data is used to train AI in coding, it is necessary and beneficial.

openai #AItraining #coding future

2 Likes

This comment is absolutely true, not citing is the same as claiming to be the source of that knowledge, and in most cases, the content cited is not the complete version of the original, meaning that knowledge is lost without citation.

We only got to where we are today by standing on the shoulders of giants. If we fail to acknowledge these giants, we undermine the foundation of progress, risking the integrity and evolution of knowledge itself.

2 Likes

Hello all I can help you :smiling_face_with_three_hearts:

That’s great for those that have been told they need to seek help…

The discussion regarding the OpenAI and Stack Overflow partnership primarily addresses user rights, community moderation, platform centralization, AI’s social impact, and data ownership.

Key points include:

  • Solidarity with Stack Overflow members who oppose the partnership, focusing on fair use and acknowledgment of community contributions in AI development.
  • Advocacy for decentralized platforms like Mastodon to ensure equitable compensation for content creators and to protect personal data from exploitation by profit-driven entities.
  • Concerns about AI replacing human jobs and the necessity of human oversight in moderating AI outputs for accuracy.
  • Proposals for a federated network model where contributors are compensated through revenue generated by their content, highlighting shifts towards secure, encrypted communication platforms as hopeful developments.
  • Arguments for mandatory citation of sources used in AI training to prevent what could be likened to academic plagiarism, emphasizing the ethical need to recognize foundational work.
  • Perspectives on the ineffectiveness of sabotaging user contributions as a form of protest, with a recognition of the ongoing value of such data in enhancing AI training.
  • Questions about the distinctiveness of this scenario from typical AI training processes, with offers of assistance and further engagement from the community.

This encapsulates the broad community reaction, covering both concerns and proposed solutions regarding the evolving interaction between AI technologies and user-generated content platforms.

(Summary of AI Summary by AI)

I also want to demand high transparency from OpenAI regarding learning data and sources.

By the way, it seems like AI has summarized the key points, doesn’t it? Or maybe not…

Yes. That’s right. Every contribution matters. The problem is that we never got compensated out of it. The only way I see a solution is if every one gets paid 25 cents in proportion to the ratings in SO, by Stack Overflow. $1 for each bronze medal, $5 for silver,$10 for gold. They are using the data of their users who freely contributed. I mean answers of Jon skeet alone are worthy of a book. Then there is not just programming, theres mathematics, physics and so on. All of them run by free contribution. SO made billions of advertising. Why was there no exclusive chat bot from SO, that would rightly attribute the author and pays it for attribution. Were they sleeping? Or they thought AI is just another wind, and will settle. It’s a huge wind man!

You know what Defacing will do. It will stop RL HF. Future models will be less intelligent. less correct. Because there will be no answer to check against.

1 Like

Perfect Ad for GPT technologies. Nothing more effective than crushing the human spirit into a compact utility.

TBH, this is the real anxiety that anti-AI people have, but they make up terminator style fantasies to cover it up. Which is unfortunate, because unlike those, this is real.

Once we acknowledge the problem, we can start dealing with it. IMHO, citation is the first and best step forward.

First, I support everyone’s right to peacefully protest anything in any way they feel is appropriate.

Even when I don’t really get the reasons for it.

No one got compensated before either, yet they were freely posting away on StackOverflow—ostensibly out of goodwill and an interest in helping others.

The design and policies of SO are such that it’s intended to create a permanent useful resource for anyone needing help with their issues.

Using SO data in LLM training just exponentially expands the usefulness of their existing contributions.

Beyond which, defacing your own solutions is pointless from a practical sense, as SO can simply roll back any changes-it won’t affect the end result.

In the short-term, it only hurts users who are actively seeking answers.

Now, it’s certainly caused some subset of Internet users to be aware of their complaints, but I would assume most of us were aware of this tension already.

Regardless, the Stack Overflow ToS clearly states,

Subscriber Content
You agree that any and all content, including without limitation any and all text, graphics, logos, tools, photographs, images, illustrations, software or source code, audio and video, animations, and product feedback (collectively, “Content”) that you provide to the public Network (collectively, “Subscriber Content”), is perpetually and irrevocably licensed to Stack Overflow on a worldwide, royalty-free, non-exclusive basis pursuant to Creative Commons licensing terms (CC BY-SA 4.0), and you grant Stack Overflow the perpetual and irrevocable right and license to access, use, process, copy, distribute, export, display and to commercially exploit such Subscriber Content, even if such Subscriber Content has been contributed and subsequently removed by you as reasonably necessary to, for example (without limitation):

This means that you cannot revoke permission for Stack Overflow to publish, distribute, store and use such content and to allow others to have derivative rights to publish, distribute, store and use such content.

So, these users are protesting against a website doing exactly what the website told them they were going to do.

Which, again, okay… Protest whatever you want to protest, but from the announcement on SO,

This integration will help OpenAI improve its AI models using enhanced content and feedback from the Stack Overflow community and provide attribution to the Stack Overflow community within ChatGPT to foster deeper engagement with content.

It sounds like the plan is to incorporate some Bing-style citation links back to the original Stack Overflow source where appropriate so I’m really not entirely sure why so many feathers are getting ruffled here.

Now, whether there is an argument that generative AI of this calibre is so new and unexpected that no reasonable person could have imagined these terms would enable their content to be used to train an AI model is another question, but not one that’s particularly interesting.

Perpetual and irrevocable licenses are pretty standard and pretty universally accepted…

1 Like

Not true, folks were compensated by reputation. Eg, on Kaggle where people post content and get credit for it, it’s used on resumes to get jobs in the industry.

What’s happening here is people are not getting credit for content which informs the outputs of GPT.

The AI engines need to appropriately give credit. It’s technically feasible, but they are reluctant as they are afraid it will open them up to copyright infringement as people realize where they are getting their answers from.

1 Like

That’s a really tough sell. Like, I get it, I understand what you’re saying, but I am hard pressed to find issue with someone balking at their SO answer being used to train an AI model.

First, any information which is on SO is almost certainly somewhere else on the Internet. Probably 98% of all SO answers could be found simply by reading the documentation. So, simply because a model is trained—in part—on SO QA pairs doesn’t necessarily mean an LLM answering a similar question with a similar answer derived all or even most of its knowledge on the topic from any particular answer.

So, if we have a situation where 10, 100, or 1000 distinct sources contributed to the model’s ability to answer a question, how much attribution is each individual entitled to?

The number of things I have read and learned from SO is likely staggering. Many of them likely contribute in some small part to any number of answers I provide to any number of people on any number of topics, but because they are all jumbled up inside my squishy brain, no one really cares. I certainly couldn’t ascertain how much of any answer I provide is reliant on information I’ve read from SO or any place else.

All that’s different here is the scale.

Besides, the plan (according to the press release) is to provide links back to SO where appropriate, so people will continue to receive their reputation compensation.

When models are trained on trillions of tokens, how much credit do you think people should be getting for contributing any one datum to a model?

The community of SO (and most other communities) was built on upvotes and traffic, with AI taking the content and using it to answer questions it is breaking the contract with that community. Pretty sure they’re not ok with it.

While Stack Overflow owns user posts, the site uses a Creative Commons 4.0 license that requires attribution. We’ll see if the ChatGPT integrations, which have not rolled out yet, will honor that license to the satisfaction of disgruntled Stack Overflow users. For now, the battle continues.

CC license is a big deal. It’s the underpinning of mass amounts of the content on the web. Not respecting these licenses is a crime against those who put the content on the web in good faith.

What part of this,

Besides, the plan (according to the press release) is to provide links back to SO where appropriate, so people will continue to receive their reputation compensation.

did you deliberately miss?

Also, you seem to be missing the point that the CC license only carries weight if there is no fair use exception for the training of AI models as they are inherently transformative in nature.

If anything, the users of SO should be over-the-moon about the agreement because it means OpenAI—who is on record advocating that training an AI model is transformative fair use of copyrighted materials—is voluntarily agreeing to the CC license terms by entering into a contractual agreement with SO for the use of the data.

Yep, if they cite properly it’s fine. If they don’t, it’s not.

2 Likes

True. But again, if we snatch the opportunity from people to get credited for their own content, they will riot. Imagine asking a question to ChatGPT, and stumbling upon your own exact answer somehow, that you gave maybe 10 years ago. How bad one would feel? One of the core features of human beings is that they have this inbuilt nature of helping others, selflessly. And that’s how Forums, and sites like SO have survived, and business models have established.

I really don’t see a solution to this problem, other than a 1 time settlement, based on the reps and medal tally on SO, by SO to the users of SO. How big of a dent it’s going to cost, that’s another story.

1 Like

I didn’t know why that would make me, or anyone else, feel bad.

1 Like

IF the search engine rumors are true, then we are going to be losing what essentially powers the discovery of these pages in place for a single entity that instead transforms it.

Everyone has their reasons for posting their answers, and engaging with the discussion. However it just leaves a bad taste knowing that everything they post is bought, churned, without seeing a single dime.

With ChatGPT being sourced by SO answers who’s really going to end up on there? “New” questions? Most “new/unasked” questions are and increasingly will be completely hallucinated by people who have no idea what they’re doing. This is already happening here.

Where will the discussions be afterwards? Do you really think people are going to follow through the link and engage with the community there and wait for an answer? Potentially immortalize themself with a foolish question? Or just continue their discussion instantly with ChatGPT.

So, who’s going to populate SO? It’s been almost instantly turned into a relic.

Who is going to want to be an active community member of a website that basically sells your knowledge and gives you no slice of the pie?

“Oh, but they host it for you”
Give me a freaking break

2 Likes

If all this is not settled very well, delicately and not in a push-it-under-the rug kind of way, it can cause lot of problems in the future. People will stop participating. They would stop helping instantly, in grudge. That alone is scary. Who are we kidding? SO has so much of knowledge pool, and just like water if there is no knowledge coming into the pool, we all will figuratively die.

1 Like