I’m building a software that automates lite proofreading using ChatGPT. The software parses text from a large manuscript, sends the plain text to ChatGPT in chunks, and sends back a proofread response which is handed back to the user. For the system prompt I’m giving ChatGPT a style guide.
I’m wondering which API should I use? The assistants is nice because I can store the style guide in file form but since its such a large amount of text I’m not sure if I should use the batch API instead…please advise:)
Here’s a bit more info about my needs:
— I’ll be making 25-30 calls to the API (chunking the manuscript to stay within the context window limits)
— Each manuscript is around 100,000 tokens and the style guide is around 7,000 tokens
— This software will only be fired up and used around 50 times a year
— I’d like to be able to fine tune the model later on down the road
context length of GPT-4 turbo is 128 K tokens but you will get a maximum output of 4096 tokens. I suggest to use GPT-4 in order to make sure the model understands the 7K style guide and applies it consistently.
using the batch API means you will get your results at some point in the next 24 hours. If you don’t need results faster that’s an option.
if however you were to take the first 4K token output and merge it with the manuscript, then send it back to the API for the next 4K tokens you are effectively multi-shotting the model as it can use the previous output as an additional example.
based on this approach you can start by sending the style guide and the first 4K tokens from the manuscript and if this works fine decide if you actually ever need to send more than the next 4k tokens.
(I hope this is not confusing the context length with the maximum output length).
if you used the batch API an incremental approach is not possible
if you use an assistant you would have to consider if you want to keep the previous messages in context, effectively making each call more expensive and likely reducing the quality of the results
I suggest using the completions endpoint and sending the style guide as part of the system prompt.
Ps. It’s not clear if and when you will be able to fine-tune GPT 4.
Thanks @vb for the reply! I’m planning on using GPT4 so that I have a large context window and I’m aware of the output tokens.
Okay, so I’m trying to understand what you mean by merging the first 4k tokens with the manuscript? Here’s how I was planning on making the API calls originally: send the whole manuscript in chunks (4k tokens at a time) and parse the responses into a new and proofread manuscript one at a time. So, send ChatGPT a chunk, it sends back the proofread text (according to the style guide) and I store the response. And this continues until the whole manuscript is proofread.
So if I used the chat completions api I would have to send the style guide in every single API call, right? Or would ChatGPT remember the style guide within the context window?
And what do you mean by if I used a batch API an incremental approach wouldn’t be possible? I wouldn’t need the result faster than 24 hours.
How would the assistants API be more expensive? Couldn’t I just open a new thread each time I start proofreading a new manuscript?
Sorry for all the questions, really trying to wrap my head around this! Your answers are a huge help—thanks!
Sorry for not being completely clear. What I mean by ‘merging’ is that if you have a manuscript that has not been proofread, and you get the first 4K tokens back from the model, you can replace this part in the manuscript before sending it to the model in the next turn. However, since you will be sending 4K tokens for proofreading separately, it doesn’t really matter, and it was likely a simple misunderstanding.
The GPT model will have to read the style guide as input every time you request proofreading for a part of the manuscript. From this perspective, it does not make a difference whether the file content is attached via the assistant or sent as part of the system prompt via the completions endpoint.
My comment regarding the batch approach is likely also not relevant following our clarification of the process.
The assistant’s API handles things in the background that we have no control over since it is a low-code tool designed to spin up AI apps quickly and easily. If you want to maintain full control, using chat completions is the better way. But to be completely honest, I don’t think the difference will be significant if you only plan to run the app 50 times per year. Using the batch API will be the most cost-effective solution.
Thanks for the clarification! Ok, most of that makes sense.
So are you saying that no matter if I choose to use the Assistants API or the Chat Completions the style guides will have to be sent with every call?
Is there not persistant memory within the context window for either of those APIs? Maybe I am misunderstanding what you’re saying in your second paragraph…
Yes, that’s what I mean. There is no persistent memory.
I have to apologize for not being completely clear about the actual recommendation from the start.
If it’s necessary for the model to follow the style guide as a whole for each chuck it will need to read the style guide as a whole every time before applying the changes to the manuscript.
The assistants API offers the possibility to access knowledge like the style guide to a far greater extent but it will do so by looking up the most relevant bits and pieces and then add them to the conversation every time they are needed (RAG).
In fact the V1 assistant will always add the whole document if it fits into the context window.
Coming back to your original question, there is nothing inherently wrong with using the assistants as long as each chunk is treated like a new conversation. But we know that the assistants are still in beta and you get more reliability with the completions endpoint.
Thanks for the reply! Really big help. So just to clarify…
It is necessary for the model to follow the style guide as a whole for each chunk of text I give it, however, that doesn’t mean I have to send it with each chunk as long as I’m within the context window, right?
For example, if I open a new context window, send ChatGPT the style guide and have 100k tokens remaining then will it remember the style guide down the road when I have 7k tokens remaining?
I’m just trying to figure out if I need to send the style guide with every chunk if I have a 128k context window.
Yes, you can do it that way but since you will pay for each token (= approximately 0.7 words) your aim is to be as precise as possible.
Also, the more tokens you send to the model the higher likelihood of the model getting confused.
You can optimize by sending the styleguide and then check until what length of the manuscript the model performed well. For example 10,000 words, 20,000 etc…
It’s possible to speed up this process by making some mistakes on purpose in different parts of the manuscript. Then you don’t have to read the whole response but can look up the test cases. When you notice that the model doesn’t apply the style guide as intended you try again with a shorter text.
The question here is whether the job you’re doing benefits from any sort of context management, which I’d say it doesn’t. The completions endpoint would be my go to here. Your main challenge here is text splitting, I dont’ think using chat or assistants is going to help. Assistants wouldn’t hurt your here, you’re just creating a new thread and doing one run step, but it’s not necessary.
I think that’s what I’ll do—Go with using the Chat Completions or the Batch and just optimize the response by testing how far into the manuscript the AI performs well before getting confused, then I’ll just find out at what token limit that occurs and shrink the context window to get the best results.
And yes, I will definitely add some test cases to the manuscript.
Thanks for these suggestions!
I actually have a couple more question relating to the context window. I’m trying to cement my understanding of how tokens work with the context window…
If my maximum output/input tokens for a certain model are 4k and the context window is 128k does that mean that I could (theoretically if each of the output and input calls were exactly 4k tokens) only make 32 calls (128k divided by 4k) before the window shuts?
This is a precursor to my second question:
Using the Completions API, is it possible to open/close a certain chat window once the max context tokens have been reached? In this case it would be 128k.
Assistants maintains messages in threads between calls.
Completions AND chat-completions is stateless.
If you are having an AI do proofreading, you can use an input window much larger than the output by only having the AI report on errors in need of corrections.