Best approach for adding knowledge to base model

My goal is to teach an LLM (GPT-3 or GPT-4 turbo) to translate a natural language prompt into a short script which the model then passes to a “ProcessScript” tool/function to perform the action within my application – basically a natural language UI vs. clicking with the mouse.

My first attempt used the system prompt and in-context learning and it worked really well. The problem is the number of tokens which I would like to minimize. To describe all the functionality I need in the system prompt will require more tokens than will fit in the context.

I thought “Aha! I will use fine-tuning to bake the knowledge from my system prompt into the model.” and have been playing with that for last few days with poor results. After reading the information provided by OpenAI and other topics in this forum, I think my understanding of fine-tuning was wrong… it’s not a good technique for adding lots of knowledge to a model, but rather it’s something to use when you want to adjust tone or output format. My fine-tuned models kind of work… sometimes, but nowhere near the performance of the in-context version.

I’m considering abandoning fine-tuning and switching to RAG with the new OpenAI embedding models since this technique is designed for providing new knowledge to a model. I had a bad experience trying to get a RAG Chatbot working based on my application’s documentation, but I’m thinking it could work better for this task since I will be hand-crafting the embeddings.

I’d appreciate any advice on the best approach and any experiences using RAG for similar applications. Thanks.

Not two minutes passed until I saw this repo and then this post.

Have you checked out Rawdog? Do you think this is in the ballpark of what you’re wanting?

It is a github repo, so you are free to fork it and adjust the src to your application’s needs. (Note I did not build this, although I wish I did for the name itself)

Thanks @Macha. Rawdog is certainly along the lines of what I’m trying to do, but it uses a fairly large system prompt and ChatGPT’s extensive knowledge of Python to get the job done. In my case I need to teach ChatGPT to write scripts in my application’s proprietary scripting language which ChatGPT has never seen. I’ve been able to do that successfully with a subset of the scripting language, but describing the whole language with examples is going to be 10,000’s of tokens which is not ideal. I’m going to try using RAG to help me reduce the prompt size.

1 Like

Ah, I see now.
Yeah, when GPT is faced with a programming language it has not seen before, it can become quite error-prone.

RAG would be useful, but may I ask how did you fine-tune your models? Did you pre-process your data at all?

Fine-tuning could actually generate some qualitative results if you pre-process the data correctly. Meaning, if you just dump docs into it, it is not going to be effective, and likely make it worse.

What you would need to do is set up a bunch of chats that describe how your scripting language works, in a mock exchange between a user and the AI. Imagine it like an intern, where you show the intern what other people have asked about the programming language before in the past, and what the right answers are.

Fine-tuning is quite good for getting the model to understand specific structures and formats. Code is a structure.

That being said, RAG can help dramatically reduce prompt sizes without needing to do this. This is especially useful if you are more consistently attempting to retrieve the same thing. Just make sure the data is parsed and stored in smaller chunks. Both options can work well. Just be mindful, RAG won’t be perfect either. It is directly dumping extra data in a prompt for context. I have made attempts on similar schema for LangChain copiloting, and there were times it got it confused with another language. That may be non-trivial for you considering your use-case.

Otherwise, I wish you the best of luck, and if you need further help, feel free to reach out!

Thanks @Macha. For my fine-tuning attempts I constructed conversations with my system prompt, a user prompt, and the expected response containing the function call and the LLM-generated script. These conversations were derived from actual conversations I had with ChatGPT-4 using in-context learning. I created a number of fine-tuned models using between 10 and 100 conversations on a simple task. The results were certainly better than without tuning at all, but very poor when compared to providing the information in the system prompt (in-context).