Adding all embeddings into a single list causes high dimensions

Hi,
I have created an application where the input file is dynamic. The input files are chuncked and then these chunks are passed into the Azure OpenAI embedding model text-ada-002.

The created embeddings are then appened into a single list to create a single field entry in the Azure Index.
But when I do I get the following error.

OperationNotAllowed\nMessage: The request is invalid. Details: actions : 0: The vector field 'content_vector' dimensionality must match the field definition's 'dimensions' property. Expected: '2048'. Actual: '104448'.

I know this is related to Azure, but I want to know if I’m doing the concatenation of the embeddings wrong.

def get_chunks(text):
    Logging.log_info("Starting text chunking.")
    try:
        text_splitter = RecursiveCharacterTextSplitter(
            chunk_size=int(CHUNK_SIZE),
            chunk_overlap=int(OVERLAP),
            length_function=len,
            is_separator_regex=False,
        )
        texts = text_splitter.create_documents([text])
        Logging.log_info(f"when fetching text split {texts}, length of chunks {len(texts)}")
        return texts
    except Exception as error:
        Logging.log_errors(f'Could not chunk the text: {error}')

Azure OpenAI code

def get_vector_embeddings(text):
    try:
        Logging.log_info("Trying to get vector embeddings")
        client = AzureOpenAI(api_key=AZURE_OPENAI_KEY,
                             api_version=API_VERSION,
                             azure_endpoint=AZURE_ENDPOINT)
        embeddings = client.embeddings.create(input=[text], model=model_name).data[0].embedding
        Logging.log_info(f"Vector embedding success{embeddings}")
        return embeddings.tolist()
    except Exception as error:
        Logging.log_errors(f'Could not get vector embeddings: {error}')

Embedding append to list

        if get_token_count(represented_text) > 8000:
            chunks = get_chunks(represented_text)
            embeddings = []
            for document in chunks:
                embeddings.extend(get_vector_embeddings(document.page_content))

PS: I’m using the RecursiveCharacterTextSplitter of the langchain framework to get the chunks

Hi! Welcome to the forum!

Learned me some python today!

image

did you maybe mean append?

image

FYI: you can send multiple text chunks (i.e. an array of strings) to the api at the same time, that’s what this is about:

image

then you don’t need to re-append everything, and you might save some requests per minute.

I’m sorry, but there’s just so many dimensions of misunderstanding here that it is hard to chunk.

A vector database stores embeddings vectors that match the model employed.
You haven’t even started with the right number of dimensions to use any OpenAI model. Let alone that you can’t just mash together even more vectors into one.

2 Likes