What's the most accurate? Fine tunning vs Prompt Stuffing

Hi everyone,

I have seen a lot of topics talking about how to upgrade GPT (or any generalist open source LLM for that matter) into an expert / niche LLM that has a particularly sharp knowledge about a set of topics.

Most discussions compare fine tunning GPT vs using semantic search + prompt stuffing.

Here is what I noticed / understood:
1/ Most recommendations highlight that semantic search + prompt stuffing is more than enough to satisfy most use cases
2/ Re-training a LLM appears to be expensive while prompt stuffing is cheap
3/ People tend to have abnormally high expectations regarding LLM fine tuning. They expect the re-trained model to suddenly become highly knowlegeable about a certain topic, which is not the case.
4/ There isn’t a big trade off regarding complexity & time to implement each solution, both are “relatively easy”.

Fair enough, but I have a few questions.

Regarding 1., is there a rule of thumb to differentiate cases that would better benefit from a fine-tuning approach rather than stuffing?

More generally if prompt stuffing is that good, what’s the point re-training a model? (especially if it comes at a higher cost)?

When it comes to accuracy, could there be a significant difference between stuffing & fine tuning? (maybe under certain circumstances?)

Thanks a lot for your help on this

1 Like

One good rule of thumb is that if you can’t multi-shot train in the prompt, usually due to space limitations, you need to fine-tune.

A good example of this is a classifier. This is my main use case for fine-tunes.

Yes, absolutely! If you need many examples for the AI to learn from, and so you go ahead with a good fine-tune, then you are going to have better results because the LLM is no longer “information starved”. But this is information on the pattern, not information of “knowledge” or facts.

Another benefit is faster inference times. Also, less input tokens, since you aren’t forming a large elaborate prompt anymore.

The case for adding knowledge is usually best tackled by RAG, or feeding it context that the base LLM can draw from.

Another purpose of a fine-tune is to add tone. So you take your writing style, and train a fine-tone with {neutral tone} —> {unique tone} pairs. You can even get these pairs by feeding your original tone into GPT, and ask it to create a neutral version of this, so it creates {unique tone} —> {neutral tone}. So the training file is just the inverse of these pairs (just reverse the direction of the arrow, and call this your training file!)

So overall, a fine-tune is required where you have massive pattern problem to solve, and can’t mimic this pattern realistically within the confines of the finite-prompt window.

So categorization and tone are both classified as “patterns”, that I would argue have a hard time fitting into the prompt.

The only exceptions, are things that the LLM already knows. So for example, sentiment analysis is something the LLM is already aware of, so you don’t need a fine-tune to generate this.

So, fine-tunes are needed for patterns that the LLM doesn’t already have extensive training on.


Great points. I’d just like to add:

In my case one of my applications takes in a simple idea (Someone put in Boat Manufacturing once, as an example), and then GPT-3.5-turbo generates a JSON object listing ~10 areas that a LLM can help build this idea.

It has so far a 0% error rate (it’s chained with an initial screening to filter out bad requests, then re-shape it. I also use a database to cache/retrieve previous requests, so I guess it’s not really a 0% error rate, but I’d rather return null for edge cases anyways. I have never received a malformed object before)

I get away with this by giving it some “few-shot prompts”: (There are also instructions at the start but I’d rather not show them)

            role: "user",
            content: "Sell properties"
            role: "assistant",
            content: `{ "query": "House Flipping" }`
            role: "user",
            content: `sell bodies`,
            role: "assistant",
            content: `{ "query": null }`
            role: "user",
            content: `lol idek`,
            role: "assistant",
            content: `{ "query": null }`
            role: "user",
            content: `Selling cheese`,
            role: "assistant",
            content: `{ "query": "Cheese Vendor" }`
            role: "user",
            content: query,

Then the second GPT function simply takes the cleaned/shaped query if it doesn’t match anything in the database, and turns it into a simple object without any few-shot prompts:

                        // Describes how AI technologies can benefit the user
                        "description": "string",
                        // A succinct title (less than 5 words)
                        "name": "string"

                OBJECTIVE: Generate 10 objects fitting this schema.`

Considering that the fine-tuned Davinci & GPT-3.5-Turbo costs 8x more than the vanilla 3.5 model you really need to justify using fine-tuning.

So if you think you want to do it I highly, highly, highly suggest first considering “am I doing way too many things that could be broken down into separate functions?” Then, running it through the OpenAI Evals framework first and testing numerous prompts:

I see way too often people try to make a single request do way too much. I could, if I wanted, fine-tuned a model to perform both these steps at the same time. But why bother? What if, for example I decided to cache and retrieve the query as I’ve done? Slightly change the schema? I’m screwed. Single responsibility.


This reminds me of another case I forgot to mention, which is the situation where your multi-shot prompt version works 80% of the time, and your fine-tune works 99% of the time and you have some way to detect when something is wrong (like JSON that doesn’t conform).

So in this case, to save cost, use the multi-shot prompt version at first, detect if there was an error, and if so, run your fine-tune. Here you are saving a ton of money by letting the multi-shot give it a good try before resorting to the fine-tune model.

Another consideration, is which model to fine tune? So for classification with tons of training data, I would shoot for the lowest model first, so Babbage-002, because the model is only outputting one token in the response.

For situations where you have either lots of tokens output, or the input is super-convoluted, go with a higher model to fine-tune. Similar to above, if you can detect and issue, then start with the lowest model, and keep upgrading the request to the next higher model if an issue is detected. BTW, this is also a good strategy to avoid API downtime, because often only one model is down, not all models.


I agree with everything above,

I chuckled a bit when I read “prompt stuffing”, good one!

No matter what method or model you choose, you’ll have to deal with the fact that the models have a limited amount of attention. If you stuff your prompt with instructions, the attention will be divided among them, and the response to each of these will be shorter and simpler.


Thanks a lot for those insights. I think I am starting to get the general idea.

So I have this case that seem to be a multi-shot case but that probably has some more subtle angles to consider. Basically, I want to build a sort of system that is specialized within the data privacy compliance area. So we are talking about a niche with a lot of pretty specific & large amount / diverse amount of information & knowledge to retrieve.

Here is my goal: Based on a single question, I want to be able to retrieve the most relevant information that could help building knowledge to answer to the question. It’s not about answering the question directly and it’s not about retrieving information in bulk. It’s rather about highlighting the most relevant piece of information and making it easy to understand, concise & actionable to help for future decision making.

And I think I will essentially have 3 types of questions

[1] The simple one – The question goes straight to the point about a very specific topic

Question: Considering GDPR law, where am I allowed to store data collected?

Answer: GDPR states that data should essentially be stored in the EU. Stored data can be transferred to another region as long as the destination country enforces data privacy rules that are equivalent to GDPR. Note that transferring data to the US might be in breach of GDPR due to the recent eprivacy developments that exposed a risk regarding data anonymization & access.

[2] The open one – The question requires combining several topics & ask for an opinion / outlook

Question: How did GDPR impact digital marketing and how will it impact it in the near future?

Answer: The gdpr has significant implications on online lead generation. The biggest impact was on ethical data acquisition that requires businesses to have a clear basis for data collection such as consent or legitimate interest. This ensures that leads are genuinely interested in your business thereby improving the effectiveness of your marketing efforts. Furthermore, other requirements have emerged: Companies can no longer store data indefinitely unless they have a legitimate purpose, they also have to justify & keep track of all data processing in place, etc. Within the near future, it is expected that laws such as GDPR will continue to apply constraints to companies operating digital marketing activities while users’ protection should improve over time. Future regulations will most likely be influenced by the upcoming judgments from the different local data authorities that are assessing major tech. companies compliance such as Meta, Facebook & Amazon.

[3] The decision-making one – The question asks for a recommendation / what to do in a certain situation

Question: Is it more strategic to focus on gdpr-compliant user consent collection or on secure data storage for now?

Answer: When it comes to GDPR compliance, it is difficult to prioritize consent collection versus data storage as both are mandatory element to compliance. A valid consent is the first critical & mandatory step before collecting & processing personal data. Without it, anyone collecting data will be automatically in breach. Once the data is collected, it should be stored in a secure manner which involves data encryption, anonymization, etc. Even though it is not advised, we could assume that securing data could be done afterwards in case the company needs a delay to build up its securitization capabilities.

How do I deal with this use case right now?

I approached those questions in a very “traditional” way:

  • I have a large text file (50k words → It will probably grow up to 150k words in the future) that contains all the data privacy knowledge.
  • Based on the question asked, I run a semantic search to retrieve relevant information from my document (Ada-002 embedding + Chroma + Similarity x MMR search)
  • Then I build a few prompts in which I inject the semantic search result as a context and I refine the retrieved information using GPT4 to exclude any irrelevant information / rephrase the text to make it easy to understand.

What I like with this approach is the ability to retrieve a good diversity of information from my document. Diversity is important because some questions (such as [2] & [3]) involve dealing with cross-topic knowledge.

What I don’t like is the fact that I can sometime get “false positive” / unrelated information. And I can also end up retrieving information that is a bit vague or actually not entirely applicable to the question → The more complex or large the question is, the bigger the issue.

What I also don’t like is that I need several prompts to reach a decent result (which can quickly become expensive).

Could multi-shot work be an enhancement?

As far as I understand, the approach above is not multi-shot per say but rather context stuffing. To do a real multi-shot, I should come up with a pairs of question x answer.

But I am not sure about the benefits of shifting from context stuffing to multi shot:

  • Can this really bring better accuracy & value compared to context stuffing?
  • If I want to cover the complexity of data privacy, I assume I’ll have to come up with an very large amount of questions. In my current document, what used to be a single paragraph regarding a certain topic will suddenly become a list of 10-20 questions x answers to make sure I cover aspect of the same topic.
  • Can a multi-shot maintain the right level of information diversity? Some questions are broad or complex ; they require retrieving information from several topics. Intuitively, I would assume that the question x answer pair would narrow the prompt intent and make it less capable of assembling diverse information

So my questions are this point are:

Can we consider such situation as pattern limit due to the complexity described?

If I make the effort of building so many questions-answers pairs, what’s the point in keeping a multi-shot approach rather than feeding a fined-tuned gpt3.5?

Thanks a lot again for your help

You can train a classifier which classifies the question as “0” = Simple, “1” = Open, “2” = Decision. And then do something different in the back end depending on the classification.

Another approach is to create your own embedding based classifier. So embed all of these questions, with the 0, 1, 2 label, and if a new question has high cosine similarity with the already labeled questions, then go down the appropriate path. This has the advantage of being tuned on the fly, since it’s data driven, unlike the lock-in you get from a fine-tune (see below).

The fine-tune locks you in, and is hard to “untrain” if something changes. A dynamic data (database correlation → prompt) driven thing can be changed on the fly.

Your diversity in search is coming from MMR, or maximal marginal relevance. That could be a problem too if you only have 50k words total, since it will put you in the weeds quickly. IF your prompts are NOT diverse enough (if window size is hit), make multiple API calls with the various prompt variations, and consolidate on the back end.

Or go with a keyword search as well to get more diversity in search.

But more diversity can lead to unstable answers, so you need to average and consolidate (more work).

So less diversity on such a small corpus might be better? Maybe? You’d have to try.


I encountered this same thought with my database. In a single conversation pair it’s very easy to return information. But as the conversation continues the context becomes more chaotic, and so does the potential of misunderstanding.

Questions can depend on previous questions or statements. There may be a comparative question. Some questions are actually a bunch of questions all jumbled together. Some questions are open-ended and actually require an opinion based on a grouping of documents. Some questions can just be outright wrong and not what the user actually intends to say. I am in a very specific field and deal with customers who use terminology incorrectly, but have adapted to understand what they “intend to say”

There are solutions to the above. Standalone queries (Asking GPT to gather the context and remove any contextual dependencies from the question).

Recommendations & aggregations (Gathering and distilling the “recommended” documents for GPT to form an opinion/recommendation)

So what you are looking for is a very complex database that can manage all these different types of queries. Fine-Tuning may be a part of the puzzle but definitely is not the solution.

Is a perfect starting point. Then you can build encapsulated methods to handle the type of query.

I recommend using a powerful database such as Weaviate:


All good answers by Curt & Ruckus

I believe you can accomplish this goal using RAG and fine tuning combined, here’s an example from the OpenAI cookbook :laughing:

1 Like

I think another thing this brings up is what I call “Context Management Orchestration” or CMO for short.

So for example, you maintain the past N turns of the conversation between user and assistant. And for the recent 1 or 2 turns, you embed this, retrieve the content, and in the system you “whisper” in the AI’s ear something like "The following information may be helpful for you in your next response: “–> Retrieved Information Here <–”,

So in your next API call, you send the entire past N turns, plus an updated system message (at the end of the array to give it prominence), which contains more of a “suggested RAG” collection of information for the AI to act on.

This will maintain the conversational flow, plus the system adds new information, and you are letting the AI decide if it’s relevant, to make for a smoother transition.

So don’t think RAG is only answering the question without prior context. No, you NEED prior context to smoothen out the exchange. You need CMO.

With CMO and “suggested RAG” (via system injection), the conversation will grow organically, similar to how human conversations evolve and shift over time.


Hi everyone

First thanks a lot for all the insights. I took some time to dig into each suggestion, read a few other thread, went through the different cookbook reco, etc. I think it’s getting clearer for me. Now the real thing is probably about being smart to find the right solution & balance it with complexity.

Overall, I tried to imagine what could be the most complex / advanced workflow below. Not saying that I would do everything, but I will probably consider a few pieces of that workflow, try them & see how my program accuracy improves.

What do you think? Are there things you think might be irrelevant or just over-engineered compared to the need?

1 Like

The diagram depicts very clearly what I had in mind.

I don’t think it’s over-engineered or too advanced, especially given what you are shooting for long term.

But you do seem more Question/Answer based, it least in appearance. I don’t see any CMO in there for step 7 … maybe it’s not a big deal, but the bot could appear to be, well, a robot! Maybe you don’t care. :rofl:

But a solid starting architecture.

Thanks for the feedback.

I haven’t integrated the CMO part because I felt it was more related to a conversational approach while my goal is rather to create a one-shot Q&A system for now. Also, I am not sure how I could actually create a chat system using the architecture depicted earlier. I think this architecture could help a lot gaining accuracy, but it would also come with a pretty high latency IMO, making it impossible to create a decent chat system.

The search part for RAG should take less than a second.

The LLM inference latency can be mitigated by streaming the response to the user.