Reducing token usage while hinting LLM as it generates

I’m currently using the gpt-3.5-turbo-instruct for my LLM application, which involves generating descriptions of DDL schemas for databases and building SQL queries(this is because I’ve noticed that making API calls to gpt-3.5-turbo takes considerably longer for me, sometimes up to 10 times slower.)

For improved accuracy and to mitigate unpredictability, I’ve been sending multiple smaller requests to incrementally construct a larger final result in a step-by-step manner, providing hints to the model at each step. However, I’ve realized that this method can be significantly more expensive than simply generating the entire text in a single request, even if the latter produces more tokens. This is due to the fact that OpenAI charges for the tokens both in the prompt and the completion, which makes my incremental approach 5 to 10 times costlier, because my prompts are much larger than completions I generate at each step.

Maybe there is a way to use chat gpt api to only fill gaps in large prompt, not to only generate completions to existing prompt? That is fairly easy to implement with local LLMs without adding any additional overhead, but I am not sure if it is possible for chat gpt api.

While I’m aware that fine-tuning the model is a solution, allowing for more extensive outputs in one go, I’m hoping there might be another workaround like the one i described above. Any suggestions on how to reduce token consumption without compromising on the quality of the output would be greatly appreciated.

Thank you in advance for your help!

1 Like

I recommend using gpt-3.5-turbo-0613 instead of gpt-3.5-turbo-instruct for LLM applications. However, this might not reduce the number of tokens used.

Can you explain please why gpt-3.5-turbo-0613 is better than gpt-3.5-turbo-instruct? I thought the only difference between them is that gpt-3.5-turbo was additionally trained with RLHF to make it more likely to follow user request and follow multi-turn dialog structure. The main problem for me is that i am getting very slow responses through API calls to chat models compared to same models on openai website and playground and compared to api calls to instruct models. I really need to know if chat models are better in some way than instruct models.

The way a model responds in a chat completion task and how it reads the current context can be somewhat different and more accurate.

Edited

Additionally, there’s another method to reduce tokens for LLM applications: “set a maximum token limit based on user input.” (not in payloads when request it)

example (when check usage):

This may be challenging to explain, but for a developer, it’s feasible to implement this. eg using TypeScript.

I can only speak to your second point about context for sub-operations, what I’m doing personally is letting the AI generate (during the initial step, or in the previous step, or so) a “task” string property that replaces current context for the subtask (I try my best to inform it as such in the descriptions)~

So, while large original prompt, these smaller requests could use an extract or a segment made specifically for that subtask that includes the context needed to create that specific table and nothing else (? possibly)

It’s not. It’s precisely the tool for executing instructions for generation and processing of text without an AI that says “sure, I can do that for you” and writes more summary and puts things in code blocks like it was ChatGPT.

The chat models will have some qualities of increased logic answering because of the amount of problem solving that is trained yes/no. But they also have degradation from tons of chat user topic denials and anti-hallucination denials and warnings and disclaimers built in - that show the AI NOT performing the desired task, or NOT writing to the desired length.

Fine tuning is not a solution if your concern is at all related to cost.

You already have a good approach, because the more instructions you provide at once, the more they will be intermingled, but you are also on the right model to have step-by-step production done. Giving gpt-3.5-turbo “chat” model multiple chain of thought outputs to produce now gets you complete ignoring.