Training with forum data?

I have a semi-large forum aimed at a specific niche and I’ve been experimenting with GPT-3.5 (and GPT-4) providing the first automated answer whenever someone posts a new question. Now the answers it gives are hit & miss. Sometimes the answer is spot on, other times it’s utterly incorrect. My forum has 1.3GB of (text) data consisting of questions and answers. I don’t think GPT has been trained with data from my forum (although it does know something about it) so I was wondering if it is at all possible to train it (through embeddings?) with the data from the forum.

I just want to feed it cleaned up text versions of all the forum threads. That means it will also be trained with incorrect data (since people do give wrong answers on my forum from time to time), but mostly the data is pretty much accurate.

Is this possible?

1 Like

Welcome to the developer forum!

You can certainly embed your entire forum, then using sematic retrieval, produce a prompt that contains the most relevant context and then use that to answer the users query.

Pick a vector database, not sure of your budget, but you can go from premium hosted solutions to build it yourself open source.

Sounds like a fun project.

3 Likes

You can also experiment with the concept in a Jupyter Notebook like this guy did to do the same kind of thing: