How to deal with different vector-dimensions for embeddings and search with pgvector?

I use the pgvector-extension for storing embeddings from OpenAI as the data source for my RAG pipeline.

Until now, the best practice was to use the embedding model text-embedding-ada-002 providing vectors with a dimension of 1536.

Today, OpenAI has announced 2 new models, text-embedding-3-small and text-embedding-3-large, providing various dimensions 512 and 1536 and respectively 256, 1024 and 3072.

My question is, how can I deal with multiple vector dimensions on the same table offering the same query.

Currently, my table looks like this:

create table if not exists public.embeddings
 (
    id         serial primary key,
    embedding  vector(1536) not null

    // ... some more columns, but irrelevant for the given context
)

create index if not exists embeddings_embedding_idx
    on public.embeddings using ivfflat (embedding public.vector_cosine_ops);

For querying, I use a stored function:

create or replace function match_embeddings(
    query_embedding vector(1536),
    match_threshold float,
    match_count int
)
    RETURNS table(j json)
AS
$$
BEGIN
    RETURN QUERY
        select row_to_json(r)
        from (select e.id,
                     1 - (e.embedding <=> query_embedding) as similarity
              from embeddings e
              where
                1 - (e.embedding <=> query_embedding) > match_threshold
              order by similarity desc
              limit match_count) r;
END
$$ language plpgsql;

At the moment, the full setup is based on a fixed dimension size for the vector: 1536.
Unfortunately, pgvector does not provide a variable dimension size. If I have a column vector(1536), it must be filled with an array consisting of exactly 1536 values. Providing fewer values will result in an error.

I wonder, if I can easily increase the vector storage until the current required maximum (3072) and, for instance, store the applied vector size inside the table to identify at any time, which kind of dimension has been stored.

The main questions are:

  1. should I right-pad existing vectors, which have a dimension of 1536, with zeros to fit into 3072 dimensions? Will this destroy any query mechanism?
  2. Or should I rather add multiple columns for each required dimension, like embedding_1536 vector(1536), embedding_3072 vector(3072)?
  3. or should I rather have multiple embedding-tables providing each a vector-column providing the corresponding dimension?
  4. or something else, which is not in my head as of now

Any useful ideas, hints, remarks, or solutions are very much welcome. Thanks!

2 Likes

You will have to re-embed everything if you change to a different model or different dimensional embedding.

One of the more intriguing uses is text-embedding-3-large at dimensions:1024. If you have an existing vector database of fixed dimension where you can segment the search spaces, you can fill the remaining 512 values with -1 or 1, which will put dot products in a completely different embedding search space.

1 Like

If you had to fill, I’d go with zeros, so as not to affect the dot product.

Also, for sure I don’t think you could compare across models, like you said, so re-vector up everybody!

I’m curious about these different embedding sizes coming out of the new models.

Wondering if they are just created by truncating and re-scaling from the higher model. I’d have to test to find out. If so, you could create the high dimensional version and derive the other versions. :thinking:

2 Likes

As far as i know the vectors are normalized as they come from OpenAI. If you put the extra vectors at 1 or -1, you will have to make sure that your metric normalizes them before calculation (or at least compensates).