Embeddings vs finetunes

mike3 · August 15, 2022, 5:58am

The project is an “expert” bot. I’ve got a guideline document that the bot is supposed to answer questions about. I’ve created embeddings for the document, and I embed new questions, and compare and then when the comparison is done, I return the section of the document to davinci (or more precisely, to text-davinci-002) with the question. This works pretty well. The concern is the number of tokens that this uses, feeding entire sections of the document to answer a single question. Is it possible, via finetunes, to get the model to remember the document? I did some testing and my results, again, were not great. But what I’d like is to put the sections into finetunes, and then say “based on section 2, answer this question.” where ‘section 2’ is arrived at by comparing question embeddings against my locally saved document embeddings, without having to send section 2 to the API every time a relevant question is asked. Will this work?

Multiman · August 23, 2022, 10:33pm

Sounds like you have created embeddings for each section of your doc, right?
That’s a vector search and you might have better results at no API costs by using sBERT or something.

I’m not aware of an effective way to teach GPT-3 the content of a particular document (or a set of documents) through fine tuning. At least my experiments to do so failed and I ended up with a vector search using sBerts sentence embeddings.
Check out Semantic Search — Sentence-Transformers documentation

mike3 · August 24, 2022, 3:59am

This is exactly what I’m doing. Thanks for the link!

hkro · January 13, 2023, 9:24am

I also came across the question recently.

@Multiman are you sure that GPT-3 can not be fine-tuned with the content of a particular document?
For example in the links below it is explained how to create a facutal Q&A bot out of a custom dataset. Which in this case would need to be created from the guideline document.

github.com

openai/openai-cookbook/blob/main/examples/Question_answering_using_embeddings.ipynb

{
 "cells": [
  {
   "attachments": {},
   "cell_type": "markdown",
   "id": "3b0435cb",
   "metadata": {},
   "source": [
    "# Question answering using embeddings-based search\n",
    "\n",
    "GPT excels at answering questions, but only on topics it remembers from its training data.\n",
    "\n",
    "What should you do if you want GPT to answer questions about unfamiliar topics? E.g.,\n",
    "- Recent events after Sep 2021\n",
    "- Your non-public documents\n",
    "- Information from past conversations\n",
    "- etc.\n",
    "\n",
    "This notebook demonstrates a two-step Search-Ask method for enabling GPT to answer questions using a library of reference text.\n",
    "\n",

This file has been truncated. show original

Multiman · January 13, 2023, 4:35pm

The doc you pointed to is not about fine tuning. It is the standard two step approach:

retrieve relevant passages through vector search (a.k.a. semantic search through embeddings). The retriever extracts passages that are semantically similar to the questions.
build a GPT3 prompt that contains the questions and the extracted passages.
No fine tuning at all. Fine tuning refers to updating the weights of the LLM in favor of your particular domain.

curt.kennedy · January 13, 2023, 4:58pm

Have you tried increasing the number of epochs in your fine-tune training of teaching the model about your documents? (Say from 4 to 10 or more) In theory this will “burn in” the details much more. It’s worth a shot, but may not be any more crisp than a vector similarity lookup.

hkro · January 16, 2023, 1:01pm

Makes sense. Thanks. “Fine-tuning” through embeddings (as used in the title of the doc) is not real fine-tuning (updating the weights of the LLM).

drinkingteddy · January 16, 2023, 4:53pm

You’ve fine tuned davinci-002?

Topic		Replies	Views
What's better for the type of chatbot I am building? Fine tune or embedding? Community chatgpt , api	10	2185	August 20, 2023
How to fine tune a chatbot for Q&A API	12	8412	December 16, 2023
Fine-tuning or using embeddings? Small dataset API chatgpt	5	1554	December 17, 2023
Fine-Tuning plus Embedding API	2	4773	May 3, 2023
Fine tuning vs. Embedding API	21	45438	December 12, 2023

Embeddings vs finetunes

Related topics