Training with forum data?

I have a semi-large forum aimed at a specific niche and I’ve been experimenting with GPT-3.5 (and GPT-4) providing the first automated answer whenever someone posts a new question. Now the answers it gives are hit & miss. Sometimes the answer is spot on, other times it’s utterly incorrect. My forum has 1.3GB of (text) data consisting of questions and answers. I don’t think GPT has been trained with data from my forum (although it does know something about it) so I was wondering if it is at all possible to train it (through embeddings?) with the data from the forum.

I just want to feed it cleaned up text versions of all the forum threads. That means it will also be trained with incorrect data (since people do give wrong answers on my forum from time to time), but mostly the data is pretty much accurate.

Is this possible?

2 Likes

Welcome to the developer forum!

You can certainly embed your entire forum, then using sematic retrieval, produce a prompt that contains the most relevant context and then use that to answer the users query.

Pick a vector database, not sure of your budget, but you can go from premium hosted solutions to build it yourself open source.

Sounds like a fun project.

4 Likes

You can also experiment with the concept in a Jupyter Notebook like this guy did to do the same kind of thing:

1 Like

I know it’s been a while, but the idea of training a model with forum data is really interesting.

Someone did some analysis of this forum this year, I believe.

Details escape me at the moment, but a search might find it?

Might’ve been on Github?

yeah, that’s what I was thinking. Any public forum or social media with individuals interacting can be web scrapped, including sub reddits, youtube communities, and so on. It comes down to with which LLM will you be fine tunning it and if paid, how much are you willing to pay and use such model. The same can also be done with conversation and individuals. Not sure how OAI measures in moderation what kind of type of data has consent and approval of the data.

As in, would you need explict conscent from the users to do so?

I think fireship has a video on this

What gets me thinking is, if someone was going to try to moderate this, would this be a np incomplete problem? So many edge cases… The training data can be from characters of a video game you’ve created, or someone that is no longer around that was part of your family, how would you prove you have their consent? what about prompt hacking? how would someone (or a moderation model) be able to distinguish if what you are claiming to be conversations from your model isn’t content from a twitch chat or the recording of someone else in a conversation or a meeting… such a hard problem… and then there’s also no fine tuning and just putting a conversation on a model and tell it to play role as that “fictitious character” that you’ve “just created”…

1 Like

Ah, it was the Lounge, I believe?

Yes, that mythical place that really exists once you reach “Regular” status here on the forum.

I missed a lot of the thread about the process the other day, but it’s on our radar. The Lounge is a bit slower paced, and we try to keep the noise/signal ratio a lot better.

That said, we’re also striving to do better here with putting things in the right categories, adding tags, etc. Thoughtful posting helps! :wink:

But yeah, it’s been analyzed. I’m not sure if it’s been used for fine-tuning, though… I’m sure it has somewhere?

What forum engine are you using @Zippy1970 ? My Chatbot uses RAG with Semantic and hybrid search to retrieve knowledge from the forum.

Forums are nice because posts naturally chunk data for efficient embedding. In a well structured and moderated forum, topics (threads) naturally fall into semantic buckets.

Another issue is how do you deal with the issue of new content? Are you going to continuously train new models? Or just let embedding and RAG take care of this?

I have not found the need for model training.