New to OpenAI and GPT. I am creating a POC for our internal knowledge base in my company. So far i’ve used LlamaIndex to index our docs, and used prompting to change behavior of the engine itself - sending the prompts first with each query invocation which seems inefficient but it works for a small number of prompts. I am also exploring storing our index in Pinecone DB and using the Pinecone reader integration in Llamaindex.
Still a bit confused of all the strategies and want to clear up my understanding, please correct if anything is wrong:
-
With fine tuning I give a set of my prompts, choose a base model such as gpt-3.5-turbo and the fine tuning exercise returns a new custom model which I can use. This new model has all my prompts integrated in it and I can then just use the new custom model without supplying the prompting with each conversation line item. Fine tuning can be very expensive.
-
Embedding is the process of taking a bunch of text and creating a vector model. For instance I can create an embedding of the current conversation so far and provide it as context for the next response. This overcomes the maximum tokens limit when supplying context using text and also is more cost efficient.
-
Instead of “fine tuning”, can I just create an embedding of my prompts and then supply that with each conversation before the context embedding? i.e. query(embedding(prompts) + embedding(context) + latest conversation line item)
Question - why use expensive Fine tuning if I can use embeddings for prompts? I am guessing this may be performance related? I understand I have to generate embeddings for the context as that is dynamic based on what the user says.