Fine-tuning or using embeddings? Small dataset

buckethead · May 4, 2023, 11:54pm

Hi community,

I’m trying to build a chatbot using OpenAI’s API and our own data, which is like a dataset of academic papers. We feed the bot 5-6 papers at a time and it will answer our questions.
I did some research on this and it seems building a embeddings database as external source for GPT is the best option, but is it possible that the model can memorize the data directly by fine-tuning it? I’m new to NLP, so sorry for newbie questions like this. To my understanding fine-tuning is used to adapt the model to specific tasks that it was not familiar with before, not to memorize or understand specific dataset.
I had some discussions with my team and some members found that using embeddings as index may not be accurate in some cases, that’s why I’m having the question now.
Any suggestions are welcome, thanks in advance.

wfhbrian · May 5, 2023, 12:31am

That’s correct.

Theoretically, yes. Practically, no.

Which cases? For a chatbot, none that I’ve encountered given that the logic of the strategy is sound.

If the goal is a conversational interface (chatbot) then you’ll want to use one of the latest chat-completion models. These are the best for conversation and cannot be fine-tuned. But, they can be instructed to respond in the context of the prompt quite reliability (temperature 0).

buckethead · May 5, 2023, 1:49am

Hi Brian, thanks for your reply!
As an example when embeddings didn’t work: we have chemical A which is related to chemical B, but this is not explained in paper 1, so we have to provide the bot with other papers 2, 3, 4… as background knowledge when asking questions about doing experiments with A, hoping the bot also provides some ideas with B.
Since this is a new research field, every time we have to give 4-5 papers as background prompts, using embeddings indexing lost details for couple of times. So I’m thinking of teaching the bot domain knowledge (actually just few papers, certainly not enough). Tried to figure out how pre-trained models memorize stuff but found nothing, wondering why people don’t talk about it. BTW do you know any researches or papers discussing this?
Thanks again

wfhbrian · May 5, 2023, 11:18am

I’d take a better look at your embedding strategy before considering fine-tuning. It sounds like the input into your embeddings system (i.e. how you parse the papers) is insufficient.

If you were to fine-tune, you’d still have to solve this same issue.

Truss · August 13, 2023, 11:16am

If you are still looking into this. It might be worthwhile seeing if there are any open source datasets that already meet your needs.

There are several for medical and chemical available on the internet.

Topic		Replies	Views
What's better for the type of chatbot I am building? Fine tune or embedding? Community chatgpt , api	10	2180	August 20, 2023
ChatGPT 3.5's fine-tuning or embeddings or both? API embeddings , fine-tuning	5	5935	August 25, 2023
Fine tune a chatbot to provide answers specific dataset Community embeddings , chatgpt	4	847	December 17, 2023
Embeddings vs finetunes API	7	2862	January 16, 2023
How to fine tune a chatbot for Q&A API	12	8410	December 16, 2023

Fine-tuning or using embeddings? Small dataset

Related topics