Assistants API, Batch or Chat Completions?

coltenunger · May 23, 2024, 9:10pm

Hi everyone,

I’m building a software that automates lite proofreading using ChatGPT. The software parses text from a large manuscript, sends the plain text to ChatGPT in chunks, and sends back a proofread response which is handed back to the user. For the system prompt I’m giving ChatGPT a style guide.

I’m wondering which API should I use? The assistants is nice because I can store the style guide in file form but since its such a large amount of text I’m not sure if I should use the batch API instead…please advise:)

Here’s a bit more info about my needs:

— I’ll be making 25-30 calls to the API (chunking the manuscript to stay within the context window limits)
— Each manuscript is around 100,000 tokens and the style guide is around 7,000 tokens
— This software will only be fired up and used around 50 times a year
— I’d like to be able to fine tune the model later on down the road

Thanks in advance!

vb · May 23, 2024, 9:23pm

Hi!
A few observations:

context length of GPT-4 turbo is 128 K tokens but you will get a maximum output of 4096 tokens. I suggest to use GPT-4 in order to make sure the model understands the 7K style guide and applies it consistently.
using the batch API means you will get your results at some point in the next 24 hours. If you don’t need results faster that’s an option.
if however you were to take the first 4K token output and merge it with the manuscript, then send it back to the API for the next 4K tokens you are effectively multi-shotting the model as it can use the previous output as an additional example.
based on this approach you can start by sending the style guide and the first 4K tokens from the manuscript and if this works fine decide if you actually ever need to send more than the next 4k tokens.
(I hope this is not confusing the context length with the maximum output length).
if you used the batch API an incremental approach is not possible
if you use an assistant you would have to consider if you want to keep the previous messages in context, effectively making each call more expensive and likely reducing the quality of the results

I suggest using the completions endpoint and sending the style guide as part of the system prompt.

Ps. It’s not clear if and when you will be able to fine-tune GPT 4.

coltenunger · May 23, 2024, 11:42pm

Thanks @vb for the reply! I’m planning on using GPT4 so that I have a large context window and I’m aware of the output tokens.

Okay, so I’m trying to understand what you mean by merging the first 4k tokens with the manuscript? Here’s how I was planning on making the API calls originally: send the whole manuscript in chunks (4k tokens at a time) and parse the responses into a new and proofread manuscript one at a time. So, send ChatGPT a chunk, it sends back the proofread text (according to the style guide) and I store the response. And this continues until the whole manuscript is proofread.
So if I used the chat completions api I would have to send the style guide in every single API call, right? Or would ChatGPT remember the style guide within the context window?
And what do you mean by if I used a batch API an incremental approach wouldn’t be possible? I wouldn’t need the result faster than 24 hours.
How would the assistants API be more expensive? Couldn’t I just open a new thread each time I start proofreading a new manuscript?

Sorry for all the questions, really trying to wrap my head around this! Your answers are a huge help—thanks!

vb · May 24, 2024, 5:57am

Sorry for not being completely clear. What I mean by ‘merging’ is that if you have a manuscript that has not been proofread, and you get the first 4K tokens back from the model, you can replace this part in the manuscript before sending it to the model in the next turn. However, since you will be sending 4K tokens for proofreading separately, it doesn’t really matter, and it was likely a simple misunderstanding.

The GPT model will have to read the style guide as input every time you request proofreading for a part of the manuscript. From this perspective, it does not make a difference whether the file content is attached via the assistant or sent as part of the system prompt via the completions endpoint.

My comment regarding the batch approach is likely also not relevant following our clarification of the process.

The assistant’s API handles things in the background that we have no control over since it is a low-code tool designed to spin up AI apps quickly and easily. If you want to maintain full control, using chat completions is the better way. But to be completely honest, I don’t think the difference will be significant if you only plan to run the app 50 times per year. Using the batch API will be the most cost-effective solution.

coltenunger · May 26, 2024, 7:28pm

Thanks for the clarification! Ok, most of that makes sense.

So are you saying that no matter if I choose to use the Assistants API or the Chat Completions the style guides will have to be sent with every call?
Is there not persistant memory within the context window for either of those APIs? Maybe I am misunderstanding what you’re saying in your second paragraph…

vb · May 26, 2024, 7:53pm

Yes, that’s what I mean. There is no persistent memory.
I have to apologize for not being completely clear about the actual recommendation from the start.

If it’s necessary for the model to follow the style guide as a whole for each chuck it will need to read the style guide as a whole every time before applying the changes to the manuscript.

The assistants API offers the possibility to access knowledge like the style guide to a far greater extent but it will do so by looking up the most relevant bits and pieces and then add them to the conversation every time they are needed (RAG).
In fact the V1 assistant will always add the whole document if it fits into the context window.

Coming back to your original question, there is nothing inherently wrong with using the assistants as long as each chunk is treated like a new conversation. But we know that the assistants are still in beta and you get more reliability with the completions endpoint.

coltenunger · May 29, 2024, 10:38pm

Thanks for the reply! Really big help. So just to clarify…

It is necessary for the model to follow the style guide as a whole for each chunk of text I give it, however, that doesn’t mean I have to send it with each chunk as long as I’m within the context window, right?

For example, if I open a new context window, send ChatGPT the style guide and have 100k tokens remaining then will it remember the style guide down the road when I have 7k tokens remaining?

I’m just trying to figure out if I need to send the style guide with every chunk if I have a 128k context window.

Thanks!

vb · May 29, 2024, 11:17pm

Yes, you can do it that way but since you will pay for each token (= approximately 0.7 words) your aim is to be as precise as possible.
Also, the more tokens you send to the model the higher likelihood of the model getting confused.

You can optimize by sending the styleguide and then check until what length of the manuscript the model performed well. For example 10,000 words, 20,000 etc…

It’s possible to speed up this process by making some mistakes on purpose in different parts of the manuscript. Then you don’t have to read the whole response but can look up the test cases. When you notice that the model doesn’t apply the style guide as intended you try again with a shorter text.

Hope this helps!

dahifi · May 31, 2024, 3:02pm

The question here is whether the job you’re doing benefits from any sort of context management, which I’d say it doesn’t. The completions endpoint would be my go to here. Your main challenge here is text splitting, I dont’ think using chat or assistants is going to help. Assistants wouldn’t hurt your here, you’re just creating a new thread and doing one run step, but it’s not necessary.

coltenunger · May 31, 2024, 6:58pm

This helps tremendously @vb!

I think that’s what I’ll do—Go with using the Chat Completions or the Batch and just optimize the response by testing how far into the manuscript the AI performs well before getting confused, then I’ll just find out at what token limit that occurs and shrink the context window to get the best results.

And yes, I will definitely add some test cases to the manuscript.
Thanks for these suggestions!

coltenunger · May 31, 2024, 6:59pm

Thanks for the response @dahifi!

Yes, that’s a great way of summarizing the question. So all in all you would suggest the Chat Completions API, right?

coltenunger · June 5, 2024, 5:24pm

I actually have a couple more question relating to the context window. I’m trying to cement my understanding of how tokens work with the context window…

If my maximum output/input tokens for a certain model are 4k and the context window is 128k does that mean that I could (theoretically if each of the output and input calls were exactly 4k tokens) only make 32 calls (128k divided by 4k) before the window shuts?

This is a precursor to my second question:

Using the Completions API, is it possible to open/close a certain chat window once the max context tokens have been reached? In this case it would be 128k.

Thanks in advance! @vb @dahifi

dahifi · June 6, 2024, 7:14pm

I think you’re confusing the endpoints again. Chat keeps context window between calls, completions doesn’t.

If you are doing proofreading, you’ll need to use an input window smaller or equal to your output window, else you’re going to lose something.

_j · June 8, 2024, 10:40pm

I think you’re confusing the endpoints again.

Assistants maintains messages in threads between calls.

Completions AND chat-completions is stateless.

If you are having an AI do proofreading, you can use an input window much larger than the output by only having the AI report on errors in need of corrections.

dahifi · June 11, 2024, 7:38pm

Yea, I don’t use chat, got confused by the fact that the Message objects are passed each call, but that’s handled on the client side.

Topic		Replies	Views
4096 response limit vs 128 000 context window API	11	11030	February 6, 2025
When using Playground, what happens if total system/user/assistant prompts exceed max token length API gpt-4 , api	10	3643	December 24, 2024
Assistant 2.0 Tokens Usage - Usage is too high API assistants-api , assistants-pricing	8	1746	April 30, 2024
"chat" wrt chat/completions API	4	363	June 25, 2024
/v1/completions vs /v1/chat/completions endpoints API	11	32067	December 15, 2023

Assistants API, Batch or Chat Completions?

Related topics