Adding all embeddings into a single list causes high dimensions

midhun.benny222 · February 2, 2024, 4:40am

Hi,
I have created an application where the input file is dynamic. The input files are chuncked and then these chunks are passed into the Azure OpenAI embedding model text-ada-002.

The created embeddings are then appened into a single list to create a single field entry in the Azure Index.
But when I do I get the following error.

OperationNotAllowed\nMessage: The request is invalid. Details: actions : 0: The vector field 'content_vector' dimensionality must match the field definition's 'dimensions' property. Expected: '2048'. Actual: '104448'.

I know this is related to Azure, but I want to know if I’m doing the concatenation of the embeddings wrong.

def get_chunks(text):
    Logging.log_info("Starting text chunking.")
    try:
        text_splitter = RecursiveCharacterTextSplitter(
            chunk_size=int(CHUNK_SIZE),
            chunk_overlap=int(OVERLAP),
            length_function=len,
            is_separator_regex=False,
        )
        texts = text_splitter.create_documents([text])
        Logging.log_info(f"when fetching text split {texts}, length of chunks {len(texts)}")
        return texts
    except Exception as error:
        Logging.log_errors(f'Could not chunk the text: {error}')

Azure OpenAI code

def get_vector_embeddings(text):
    try:
        Logging.log_info("Trying to get vector embeddings")
        client = AzureOpenAI(api_key=AZURE_OPENAI_KEY,
                             api_version=API_VERSION,
                             azure_endpoint=AZURE_ENDPOINT)
        embeddings = client.embeddings.create(input=[text], model=model_name).data[0].embedding
        Logging.log_info(f"Vector embedding success{embeddings}")
        return embeddings.tolist()
    except Exception as error:
        Logging.log_errors(f'Could not get vector embeddings: {error}')

Embedding append to list

        if get_token_count(represented_text) > 8000:
            chunks = get_chunks(represented_text)
            embeddings = []
            for document in chunks:
                embeddings.extend(get_vector_embeddings(document.page_content))

PS: I’m using the RecursiveCharacterTextSplitter of the langchain framework to get the chunks

Diet · February 2, 2024, 5:10am

Hi! Welcome to the forum!

Learned me some python today!

did you maybe mean append?

FYI: you can send multiple text chunks (i.e. an array of strings) to the api at the same time, that’s what this is about:

then you don’t need to re-append everything, and you might save some requests per minute.

_j · February 2, 2024, 5:13am

I’m sorry, but there’s just so many dimensions of misunderstanding here that it is hard to chunk.

A vector database stores embeddings vectors that match the model employed.
You haven’t even started with the right number of dimensions to use any OpenAI model. Let alone that you can’t just mash together even more vectors into one.

Topic		Replies	Views
Why does Azure Open AI embedding not allow to an array of text as input API	7	1558	December 25, 2023
Searching Using Vectors Derived from Long Text Segments in an Embedding Model API embeddings , api	4	2440	December 15, 2023
Why is Openai Embeddings API returning multiple vectors for one very long string? API	3	1356	December 18, 2023
Embedding Longer Texts API	8	15178	December 25, 2023
Understanding "text-embedding-ada-002" vector length of 1536 API	5	21842	January 21, 2024

Adding all embeddings into a single list causes high dimensions

Related topics