Seeking advice on Assistant API 2.0 Beta: RAG vs Fine-tuning for Structured Output

Hi everyone! First-time poster here, and fairly new to working with OpenAI’s APIs. I’d really appreciate any guidance or recommendations on my current project.

Current Setup
I’m working with the Assistant API 2.0 Beta in C# (using the official OpenAI C# Library) to build an AI implemenation that:

  • Needs to work with specific domain knowledge (~35 000 tokens)
  • Must output structured data (specifically a JSON schema array)
  • Performs classification tasks (more complex than basic sentiment analysis, with multiple categories and reasoning)
  • The output and input itself and the instructions of the assistant aren’t that long ((~ 125 tokens for one classification out, about the same for the input + a picture) and ~1200 tokens for the instructions)

Current Implementation
Since file search doesn’t work in combination with a structured output, I’m currently using a function call to provide the necessary domain knowledge to the model. While this works ok, I’m seeing escalating token costs as conversations and the number of threads grow. The plan is to do like a thousand classifcations a day.

The Challenge
I’m looking to optimize this setup and reduce redundancies / bring down cost in the long run. I’m considering two main approaches:

  1. Fine-tuning GPT-4 specifically for / with my JSON array output and domain knowledge
  2. Implementing a RAG (Retrieval-Augmented Generation) solution

Questions

  1. For my specific use case (classification with structured output), would RAG or fine-tuning be more suitable?
  2. Is it correct that GPT-4 fine-tuning isn’t currently supported in the Assistants API 2.0 Beta? (I’ve seen mentions that only GPT-3.5 Turbo is supported)
  3. How complex/time-consuming is implementing a RAG solution typically? I am worried that it is going to be really complex and time consuming.
  4. If fine-tuning is the better approach, how would you structure the examples to achieve both the JSON array output format and incorporate the domain knowledge?

I’m happy for any pointers, recommendations, or insights from those with more experience. Thank you so much in advance for your help and time!

Forward

Often, people will think they are asking specific questions that get them closer to the solution, but it is actually a broader expertise need to even ask the right questions, and understanding of the total goal that is needed to answer appropriately.

So I must take a step back, and imagine the application you want to build.

What is being classified, from what?

You have a task, re-described as “classify these 1200 tokens, but you need to know about 35000 more tokens”. Or similar?

Perhaps to be run in an automated fashion on lots of data?

That amount of input can be done within the context window of a 128k gpt-4o model (requiring tier-2 or higher payment tier to allow the a single request per minute that size)

No need to “retrieve”. Nor necessarily to fine-tune.

If you want a 50% discount for running a batch of these overnight, that also rules out using Assistants.

-=-

Getting your structured JSON

Structured output, even strict structured output following a schema, can be obtained by using response_format as an API parameter - On Chat Completions. A single API request to provide all your input and get the JSON output.

Leave the Assistants endpoint behind.

-=-

Fine-tuning into Assistants

Fine tuning can be done and then used. Some models are inexplicably blocked on Assistants even though they would work fine (OpenAI doesn’t trust you not to have damaged function-calling or Python, or knows you can’t train on their file search tool?)

Chat Completions has inherent compatibility with all fine-tune chat models.

-=-

Application of fine tuning

Extensive fine-tuning may reduce the need for knowledge input when you can just demonstrate by hundreds or thousands of classification examples to be learned. The AI might then understand the pattern without a whole bunch of knowledge or distracting text in context.

Big upfront cost for parity or reduction in cost, later.

json_schema response format issues

You would train on the same JSON output as a strict structured schema will produce for you. However, you can’t emulate fully the backend JSON response recipient that the AI produces before its JSON output or the system message injection of a schema. Those are undocumented and a full “API simulator” for fine-tune training is not provided. Functions was only recently added for fine-tuning after OpenAI made a complex obfuscation method.

Moving forward

Consider how you can construct all of the input messages needed to have the AI write your JSON. How you would automate that. Produce a json_schema for the output you desire. Qualify performance against existing models.

2 Likes

First of all, thank you so much for such a swift and comprehensive response! I really appreciate the time you took to provide such detailed guidance.

I hope I understood your points correctly (please correct me if I misinterpreted anything):

Regarding the knowledge base:

  • Current size: ~35k tokens
  • Type: Think like descriptions of specific types, like birds for instance. So essentially this bird has brown feathers and 2 legs, makes this and that sound etc…
  • Usage: The GPT should use the description as the basis for his choice and the reasoning statement, why it would classify it as that. So for example “I classified it as a Parott because his feathers are colorful and it seems to talk. For this classification it should be using descriptive text as well as a picture.
    Like a said, potentially we are looking at around a thousand classifications a day, so I have token cost in the back of my mind. I should ran automated in the background, but the result will be checked by a human.

Right now 10 classifications cost me around 4 Cents, so that would bring up the cost to around 4 Bucks a day, without batching – which is not really suitable because the results should appear rather quickly.

Current Understanding from Your Response:

  • Chat Completions with response_format is preferable to Assistants API
  • If I want to use fine-tuning, as I suspected, I would have to ditch the assistants API anyways for the chat completions API
  • Fine-tuning has limitations with JSON schema responses
  • The full context could fit in GPT-4’s 128k window, so I just should use that instead of a function call or RAG
  • Fine-tuning might reduce costs later but has significant upfront investment.

Questions:

  1. If I understand you correctly you just would send the domain knowledge each time via the first prompt or like a system prompt? I am worried about the token cost and the context memory once a thread is getting too long, so the recall is getting wonky?

  2. May I ask why you would prefer the chat completions API over the assistants API?

  3. Even though it fits in GPT-4’s context window, wouldn’t RAG be more efficient for selectively using this knowledge? Specifically:

    • Could reduce tokens per request
    • Might improve response relevance
    • Easier knowledge base updates
  4. If implementing RAG:

  • I assume it would be complicated to set up and maintain, or is that not the case at all? Until now I worked with file search in the assistant which was working really well, but doesn’t work with the structured output.
  1. Regarding Fine-tuning:
    • You mentioned training on JSON output with schema would be complex - could you elaborate on the challenges?

The Chat Completions approach with response_format sounds promising, but I’d love to understand how to best structure the knowledge integration, especially considering potential future scaling and optimization needs.

Again, thank you for taking the time to provide such detailed guidance. I apologize if I’ve misunderstood any aspects of your response - I’m still learning and trying to grasp these concepts fully. Your expertise is incredibly helpful in navigating these decisions.

Have you tried simply adding the 35k token knowledge as a file to the assistant and enabling search?

Hey jlvanhulst, thank you for your response.

I did try that and it works on it’s own – just not at the same time with structured outputs on (turns of as soon as you turn file search on in the playground).

Basically I am looking for an easy way to use both, without the token costs running unnecessarily high.

I searched in the forum how others did it and seems like a common approach is just to use two assistants “in a row” (one with file search, one with structured outputs). But I don’t see how it helps me bring down the cost. But maybe I am mistaken?

You can do both, depending on the model. Also - ‘in the past’ when structured outputs did not formally exist, I simply prompted the json output, especailly with a pretty straight forward structure you seem to have that should be no problem?

Let me add to that, in my experience only 3.5 is not super reliable with the ‘output prompting’ all the current others really good as long as you explicitly prompt the details about your output.

Thats interesting to hear, can you tell which Model? Right now, I am using 4o-08-16 which works fast and reliable for me – just not with both of those things turned on.

I did similar things in the past, but with the scale of the project I am afraid of the model failing to adhere to the output format. I had that in the past with GPT-4-Turbo – not excessively, but this project is on a different scale. I am also worried because even OpenAI themselves talks about that issue (see my graph).

I have only my anecdotal evidence - so by no means rely on it. I guess it all really depends on how complex your reply is and how precise it needs to match. The two step approach would obviously quickly solve it as well - get one answer and then use the json assistant to fix it.

Separate from that you have the question of reliablity of the actual answers RAG vs finetuning. Here too its probably about how good your RAG is vs how good you need it and how much better finetuing can be. Fintetuing for this use case is pretty easy I think (I assume you have plenty of answer pairs).

Assistants offers:

  • Server-side chat history of a thread - for conversational applications
  • Built-in tools of a file_search on document extraction chunks, Python code interpreter

Assistants requires:

  • multiple calls, management of the state of remote objects, more programming to fit someone else’s pattern

Assistants drawbacks:

  • None of the tool instructions are yours, they are often written and have AI instructions counter to your use.
  • Lower performance

What is RAG for classification

RAG is retrieval-augmented generation. Typically search-powered.

Providing everything the AI needs to know about a fixed data set is sort of “retrieval”, even if it never changes and is not optimized by a search.

If you were using ChatGPT Enterprise, their (poor) strategy is to fill the context with equal amounts from all documents up to 5k tokens each, and then more from document search up to 110k. Auto-bloat. So your case of 35k documentation is not as bad.

Better knowledge technique for classification

However, you can have a multi-stage AI classification, drilling down on a hierarchy. This can reduce total costs despite multiple AI calls of specific prompted purpose, and improves AI attention to what is important in the end.

Example:

You are an OpenAI support assistant. Pick which of these categories (max 3) best fits the question - which best would have answers:

Pre-sales, ChatGPT
Pre-sales, API
Account and subscriptions, ChatGPT
Account and payments, API
Product features, ChatGPT
Bug report, ChatGPT
API development help

then another round:

You are an OpenAI support assistant. Pick which of these categories (max 3) best fits the question - which best would have answers:

From “API development help”

  • Chat Completions, interactive chatbot
  • Chat Completions, automation
  • Chat Completions, usage
  • Audio, transcriptions from
  • Audio, TTS
  • File storage
  • Assistants endpoint
  • Fine-tuning

We’re getting closer:

From “API development help”-> Chat Completions, automation
– batch processing
– scripting jobs
– parallelization and rates
– …

From “API development help”-> Fine-tuning
– suitability for applications
– API methods
– Constructing JSON API calls
– Supported/unsupported features

You can see with a few AI simple categorizations of playing “20 questions”, we’ve narrowed down impossible knowledge to that which can drive an answer for this very forum topic.

This is still RAG, but instead of 35k, maybe you have 2k + 2k + 5k. It is also appropriate for categorization and classification, far more reliable than semantic search when you have prepared the data.


Structured response_format fine-tuning?

  • You can train an AI on making JSON.

  • You can’t train it on making response_format=json_schema, schema=…

The output of response_format=json_schema you don’t see may look like:

\<special token>to=Responses.my_structured_name\<special token>{"...

This can be confirmed by the excess output token consumption of a longer schema name, not part of a JSON you actually receive.

You cannot replicate this, thus your entire fine-tuning would be training against the pattern the AI knows from OpenAI post-train. Even when enforced by logit grammar, when you are in strings and need your categories, output would look different than all your fine-tune training that gets close but no cigar.

Then there is: that the input schema is put into the system message in a particular format. If you know exactly how that # Responses is done by reverse-engineering, you can copy it, but it is tedious.

Strict structured output basically makes almost unbreakable JSON, highly desirable. But has no fine-tune mechanism offered.