Fine-tuning or using embeddings? Small dataset

Hi community,

I’m trying to build a chatbot using OpenAI’s API and our own data, which is like a dataset of academic papers. We feed the bot 5-6 papers at a time and it will answer our questions.
I did some research on this and it seems building a embeddings database as external source for GPT is the best option, but is it possible that the model can memorize the data directly by fine-tuning it? I’m new to NLP, so sorry for newbie questions like this. To my understanding fine-tuning is used to adapt the model to specific tasks that it was not familiar with before, not to memorize or understand specific dataset.
I had some discussions with my team and some members found that using embeddings as index may not be accurate in some cases, that’s why I’m having the question now.
Any suggestions are welcome, thanks in advance.

1 Like

That’s correct.

Theoretically, yes. Practically, no.

Which cases? For a chatbot, none that I’ve encountered given that the logic of the strategy is sound.

If the goal is a conversational interface (chatbot) then you’ll want to use one of the latest chat-completion models. These are the best for conversation and cannot be fine-tuned. But, they can be instructed to respond in the context of the prompt quite reliability (temperature 0).

Hi Brian, thanks for your reply!
As an example when embeddings didn’t work: we have chemical A which is related to chemical B, but this is not explained in paper 1, so we have to provide the bot with other papers 2, 3, 4… as background knowledge when asking questions about doing experiments with A, hoping the bot also provides some ideas with B.
Since this is a new research field, every time we have to give 4-5 papers as background prompts, using embeddings indexing lost details for couple of times. So I’m thinking of teaching the bot domain knowledge (actually just few papers, certainly not enough). Tried to figure out how pre-trained models memorize stuff but found nothing, wondering why people don’t talk about it. BTW do you know any researches or papers discussing this?
Thanks again

I’d take a better look at your embedding strategy before considering fine-tuning. It sounds like the input into your embeddings system (i.e. how you parse the papers) is insufficient.

If you were to fine-tune, you’d still have to solve this same issue.

If you are still looking into this. It might be worthwhile seeing if there are any open source datasets that already meet your needs.

There are several for medical and chemical available on the internet.