I have a semi-large forum aimed at a specific niche and I’ve been experimenting with GPT-3.5 (and GPT-4) providing the first automated answer whenever someone posts a new question. Now the answers it gives are hit & miss. Sometimes the answer is spot on, other times it’s utterly incorrect. My forum has 1.3GB of (text) data consisting of questions and answers. I don’t think GPT has been trained with data from my forum (although it does know something about it) so I was wondering if it is at all possible to train it (through embeddings?) with the data from the forum.
I just want to feed it cleaned up text versions of all the forum threads. That means it will also be trained with incorrect data (since people do give wrong answers on my forum from time to time), but mostly the data is pretty much accurate.
You can certainly embed your entire forum, then using sematic retrieval, produce a prompt that contains the most relevant context and then use that to answer the users query.
Pick a vector database, not sure of your budget, but you can go from premium hosted solutions to build it yourself open source.
yeah, that’s what I was thinking. Any public forum or social media with individuals interacting can be web scrapped, including sub reddits, youtube communities, and so on. It comes down to with which LLM will you be fine tunning it and if paid, how much are you willing to pay and use such model. The same can also be done with conversation and individuals. Not sure how OAI measures in moderation what kind of type of data has consent and approval of the data.
As in, would you need explict conscent from the users to do so?
What gets me thinking is, if someone was going to try to moderate this, would this be a np incomplete problem? So many edge cases… The training data can be from characters of a video game you’ve created, or someone that is no longer around that was part of your family, how would you prove you have their consent? what about prompt hacking? how would someone (or a moderation model) be able to distinguish if what you are claiming to be conversations from your model isn’t content from a twitch chat or the recording of someone else in a conversation or a meeting… such a hard problem… and then there’s also no fine tuning and just putting a conversation on a model and tell it to play role as that “fictitious character” that you’ve “just created”…
Yes, that mythical place that really exists once you reach “Regular” status here on the forum.
I missed a lot of the thread about the process the other day, but it’s on our radar. The Lounge is a bit slower paced, and we try to keep the noise/signal ratio a lot better.
That said, we’re also striving to do better here with putting things in the right categories, adding tags, etc. Thoughtful posting helps!
But yeah, it’s been analyzed. I’m not sure if it’s been used for fine-tuning, though… I’m sure it has somewhere?
What forum engine are you using @Zippy1970 ? My Chatbot uses RAG with Semantic and hybrid search to retrieve knowledge from the forum.
Forums are nice because posts naturally chunk data for efficient embedding. In a well structured and moderated forum, topics (threads) naturally fall into semantic buckets.
Another issue is how do you deal with the issue of new content? Are you going to continuously train new models? Or just let embedding and RAG take care of this?