Embeddings Vector Store - Capacity Limitation

Dear community members,

I’m trying to build an app which will allow to attach around 100k parsed web pages as a RAG with OpenAI Assistant. I can say that experiment was successful with 600 pages, but as far as I understand vector DB is stored in memory, meaning that RAM capacity might become a pitfall. Does anybody have some similar size experience? Also is there any tangible advantage of switching from Llama to Chroma?

My current set up is as follows:
RAG – Lllama Index
Parser – BeautifulSoup
Embedding – OpenAIEmbeddings(model=“text-embedding-3-small”, dimensions=1536)
OpenAI Assistant Model - gpt-3.5-turbo-0125
Language: Uzbek

Depends what you are using.

I use PSQL and pgvector and that’s not held in memory except for the HNSW index.

1 Like

I can give some insights about strategies for optimum performance with low RAM footprint.


The text-embedding-3-large model is likely of larger size, as evidenced by its cost and dimensions. That implies that it could have more parameters and training corpus quality, especially important on less-common languages.

MIRACL is the one to pay attention to here for multi-language:

Eval benchmark ada-002 3-small 3-large
MIRACL average (multi-language) 31.4 44.0 54.9
MTEB average 61.0 62.3 64.6


The text-embedding-3-large model is more performative on benchmarks, even when not all of its vector output is used. A dimensions API parameter is provided where you can specify an arbitrary smaller number of dimensions to return.

3-small 3-large 3-small 3-large 3-large
Embedding size 512 256 1536 1024 3072
Average MTEB score 61.6 62.0 62.3 64.1 64.6

The smaller dimensions are normalized by the API upon request, or you can do this math yourself, allowing you to perform evaluations on multiple size scenarios without making requests again.

This enables very flexible usage. You can even load full disk embeddings to RAM at different dimensions dynamically, giving you a “fast” and a “full” option.

Quantized Tensors

The vector dimension values are 32 bit floats, but likely backed by hardware that generates at lower resolution.

That gives us opportunity to evaluate RAM optimization with lower bit depths. I did just that, and found negligible similarity loss at 16 bit as a native (NumPy) format.

In fact, I also evaluated specialized 8-bit float format now also used in ML hardware accelerators, and found expected loss, but with dot products that still rank similarly, within the certainty the AI itself might have. This is stored and manipulated in memory as 8 bit also, and can be cast to higher precision calculations.


  • The large model is worthy of consideration. It gives better results even using smaller dimensions of its output than 3-small, and for Uzbek is wise;
  • both new models are compatible with arbitrary dimensional-length vectors;
  • 1/4 the RAM storage of those vectors can still perform the task.
  • Vector databases will need to keep up with these possibilities that can be coded.
1 Like

Thank you! seems like I found a consistent guide on Postrges SQL and pgvector. Can you please advise me how can attach the output of the similarity_search_with_score (query, k=2) as a RAG tool with Openai Assitant? In my previous experiment with LLama vectors it was added as a function to the Openai Assistant throgh the ToolMetadata

That’s quite a big question that would take a lot of time. You can hire me if you need more focussed help.

Instead I’ll refer you to my open source solution. I don’t use Assistant but my solution is not much different:

function definition here:

function added here:

query run here:

Since you can’t place additional messages in a thread appropriately when using the assistants endpoint, and having a function to obtain information is wasteful and unpredictable, the place I would include automatic knowledge injection is with the additional_instructions parameter that can be used with a run.

It will appear early in the context, so you must use notation: “this knowledge is specifically added here for improved answering about the most recent user question…” or similar.

1 Like

Dear @merefield ,

I have followed your advise and created PSQL with pgvector plugin and also switched the embedding model to 3-large.
I’m trying to use pgvector db.similarity_search_with_score function (query, k=3) from the indexes created with the Open AI Embeddings (text-embedding-3-large model) from the source text in Uzbek language.
In 50% of the queries I’m getting 3 chunks of unrelevant text with the low score (less than 0.5). I suspect that poor results might be due to the source language.
Do you have anything to advise as an imrovement?
Thank you in advance!

Drop results below a certain cosine distance threshold and make that threshold a config setting in your app so you can experiment with it without having to edit code …