What's the best API method/model for this use-case scenario?

I have been using the Custom GPTs, coupled with retrieval documents, to perform analysis on chunks of text that I affix to a specific prompt. The chunks of text are about around 700 words on average. Some of the functions I ask it to perform are things like summarizing it into short paragraphs and performing various analysis against my knowledge base of documents. It works great.

I then moved all my custom GPT instructions and knowledge into Assistants, hoping to try automate this with the API. From what I understand from reading here in various posts, there is issues with the cost due to context retention when using the API. I have about 100s of these text chunks to ask questions on and so not sure how things might suddenly snowball out of control and rack up hefty charges.

What I’m trying to figure out is the best way to use the API. Is firing over 100 separate requests with 700 or so words in the prompt more cost effective than firing the text chunk once and asking it the 10 or so prompts I need to ask it? And I now see there’s a gpt 3.5 tubro-1106 option. Is that available over API now? Any drawbacks to using it? Cheaper? Last time I checked I thought it was just gpt4 turbo 1106 that worked with document retrieval.

Any insight appreciated.

Assistants for now is really meant for utilizing a combination of functionalities, similar to how a Custom GPT can begin with RAG or Code Interpreter by checking a box and then handling the requests in the back-end.

Arguably, for a straightforward task like summarizing text it may make more sense to use the ChatCompletion / Completion endpoints. If you plan on adding new features in the future (such as Retrieval) I would just jump straight into Assistants knowing that they are going to improve.

The complaints only regard conversational Assistants. The new (and only) usable GPT-4 model has a context length of 128k. Without any control this means that a conversation has a potential of costing over $1/message.

In your use-case this is irrelevant. You can just simply destroy the thread and create a new one because you don’t need the message history.

If you want to ask 10 questions it makes sense to include them all into one prompt. It would be much cheaper.


It really depends on what your task is. It would be worth running a sample set on both and comparing the results. GPT-3.5 is much cheaper.

Thanks for the advice!

I actually use about 7 prompts. Only one is text summary and GPT 4 turbo 1106 does a great job at that. Then I have one prompt that is just grammar checking. Which admittedly I could use a different library for that (which would likely also cost me). The other 5 prompts compare the text to my documents. So I have to use an API that supports that. With 700 or so words and 5 prompts you’re saying it would still be cheaper to send it once and ask the 5 prompts on it? And for the others just destroy the threads?

I actually found GPT 3.5 turbo to be as good as 4 in terms of summarizing.



If your questions rely on the previous iteration you can still get away with a single prompt.

You could probably reduce the token count in step #2 by just simply pointing out where the errors are. Or even just waiting to perform the corrections at the end, LOL. This is just a quick example. You may want to also try it split up though and see if there is a difference in quality. It’s always a balancing act :person_shrugging:. My main point is just not to re-send the conversation each time.


Consider it like a formula. Once completed you want to completely destroy it and start again, or risk side-effects (and higher costs)

1 Like