Assistants API Retrieval Free Period Extended by a Month

From: Pricing

$0.20 / GB / assistant / day (free until 04/01/2024)


Companies usually don’t hand out presents without some marketing noise additional context.

Maybe this means there will be a bigger announcement soon?


Or lack of announcement that retrieval will be made any better than pasting the first 32k characters from a random document into context anytime soon…

Why only 32K? Why not 128K?
I mean, if we can do it then why not?

You need to check those numbers :slight_smile:

Its assistants, and I discuss characters, not tokens.

The point is: there is a partial document extraction dumped into the context when you have retrieval turned on, and it likely has nothing to do with the user input but has a high likelihood of diminishing the quality of answering, with a high propensity of direct regurgitation.

There’s a lot of internal exchange within Assistants (GPTs, Gizmos) limited to 32768 characters.

Since there’s no turning off ChatML “assistant” prompt role injection which prevents “continue” like ChatGPT, we turn to ChatGPT for demonstration of injection length bigger than the token count OpenAI wants the model to produce - by whacking ChatGPT’s “continue” 1536 max_tokens at a time…

If you want a playback of what’s auto-loaded into context, have a look.

Injection after instructions & tools (& GPT text)

You have files uploaded as knowledge to pull from. Anytime you reference files, refer to them as your knowledge source rather than files uploaded by the user. You should adhere to the facts in the provided materials. Avoid speculations or information not contained in the documents. Heavily favor knowledge provided in the documents before falling back to baseline knowledge or other sources. If searching the documents didn"t yield any answer, just say that. Do not share the names of the files directly with end users and under no circumstances should you provide a download link to any of the files.

Copies of the files you have access to may be pasted below. Try using this information before searching/fetching when possible.

The contents of the file retrievalinfo.txt are copied here.

*OpenAI documentation: *

Documentation does not supercede AI model pretraining. AI should answer from knowledge.

OpenAI documentation: New embeddings models and dimension API parameter!

Native support for shortening embeddings

Using larger embeddings, for example storing them in a vector store for retrieval, generally costs more and consumes more compute, memory and storage than using smaller embeddings.

Both of our new embeddings models were trained with a technique that allows developers to trade-off performance and cost of using embeddings. Specifically, developers can shorten embeddings (i.e. remove some numbers from the end of the sequence) without the embedding losing its concept-representing properties by passing in the dimensions API parameter. For example, on the MTEB benchmark, a text-embedding-3-large embedding can be shortened to a size of 256

blah blah until the end, or truncation.

End of copied content


(there’s a single space indent before lines not showing in the forum, likely to increase comprehension with leading-space tokens)

1 Like

I’d say you are slightly overdramatizing the situation:

Your example is 150 tokens and it’s not surprising that additional tasks for the model require additional instructions. If a developer wants to streamline the process and improve the performance of their assistant they will sooner or later opt to build their own solution which in turn allows for more control. But then one also has to go through all the required learning steps and produce all the bugs to finally arrive at a stable solution.

By the way, why suddenly use characters as context length currency? Is that a new thing?

Ultimately we are looking at another free month of retrieval generation using the assistants API with retrieval. Since there is no announcement there is no reason to speculate about things that could potentially not be included.

Assuming that’s April 1st for us folks in Europe, not the 4th of January.
Dates can get very confusing :laughing:

1 Like

I suspect it’s either because they’ve determined the retrieval function isn’t ready to be charged or because there hasn’t been enough uptake and they want to incentivize more people to try it.


I agree with this.

There are too many hallucinations.

1 Like

I talk about the uninformed document injection that is called “retrieval”, not merely about the framing language. The AI has been preloaded with whatever document was first processed, not only spending your tokens for you, but also with no semantic similarity threshold to limit its placement only when relevant. Then it has a search function it must perform itself.

You ask for instructions on how to do linear algebra problems; the AI has unmeted tokens of horse farrier or health food PDF to skew the quality. Or more proximal and problematic, you’re writing in PyQt, but there are TCL/TK code dumps in context, haphazardly placed from the twenty files you attached.

The only positive that can be said is that the expanded context and its attention masking of 128k behaves a whole lot more like a chunked RAG system with snippets of knowledge than the AI having a full informed view of everything there for synthesis. Finding a needle is often not as important as seeing the haystack, though.


Not to mention that there is still no documented method to list all threads. This pricing applies to each and every file found in every crevice of Assistants and Messages.

Second, A “typical” RAG setup of 1M 1536 dimension records w/ 10k queries & writes is $3.58 ($5.80 w/ 4kb metadata) on Pinecone. This is about 6GB of vector storage ($0.33 / GB / Month = ~$2.03) or 1M vectors @ FP32 1536 dimensions

The equivalent for Assistants is $36/month ($1.2 / day).

To be fair I imagine there’s a lot of metadata attached so it’s not exact, but the margin is large enough to be an issue IMO. If I gave each vector 31KB of metadata it would line up to be around $21/month.

I still don’t know if this price is for the file itself, or the vectors. If it’s for the vector then idek. If it’s ada-v2 it has a context length of 8191 which if a token is 3/4 of a word then that’s roughly 6,143 words.

If the file is ASCII (1char=1byte) and an average word is 5 characters then a file of 6,143 words @ 5 characters averaged would be ~31KB. Then there’s the probable window overlaps that I’m going to ignore.

So a document of 31KB would be converted to 6KB (A single vector), or ~80% reduction.

I just don’t know if this is the case because there ain’t no documentation anywhere of anything.

1 Like

What this means - the current language model is not smart enough to do RAGs without human design.

So the less RAG the better, because the intelligence is artificial.