We have an invoicing website that uses OpenAI’s SDK to scan images/pdfs of invoices and autofill fields in the application.
When we make a request from the SDK we provide 3 message entities.
-
Custom context about the user so the AI can map data from the scanned document to the provided context data - This is different for every single user.
-
Static instructions on how to read the invoices. - This is always the same for everyone
-
User provided image/pdf
Since we want to minimise token usage as much as possible and utilise the cache mechanisms as much as possible, we are wondering whether we should continue with using plain chat completions from the SDK or switch to using Open AI’s threads.
Our idea is that every single user will use their own thread so 1) message will be cached for every user. Meaning the user “foo” will have their own cache and user “bar” will have their own separate cache as well for the 1) message. This theoretically should save slightly the token usage, however we are not sure how the open ai’s caching mechanism works exactly.
Our concerns are
- do chat completions keep multiple caches or overwrites old ones with new ones ?
- do threads save their own cache ?
- when using threads, does every new thread creation with new user context cost tokens ?
- when using threads, does every additional message increase context size, meaning increased price per message?