We have a workflow which works well with chat completions:
- Give the system instruction (1600 tokens, constant)
- Give the user input (~300 tokens, variable)
- Consume GPT response (~5 tokens, variable)
Since the instruction is always the same, we’d rather not pay full rate for all those tokens, if we can avoid it.
With the chat completions API, we don’t seem to have any way around it.
With the assistants API:
- We can pre-load the instruction, but it seems to be billed at full rate within the context of the thread
- We can load the instruction as a file to be retrieved. But (although the docs don’t seem to mention this), from testing, the retrieval actually counts as tokens, seemingly even more tokens than simple instructions.
So is there any way we can be more efficient than repeating our instruction for every single request?