I am trying to create an embedding based upon more then 15000 sentences, however when I run the code with more then 2048 sentences the embedding fails because of the token limit. The docs has a list containing the more then 15000 sentences. Is there a way to create such an embedding, by changing something in my code?
Yes, I know there is a limit on tokens for embedding.
However, when I used another embedding model, like “all-MiniLM-L6-v2” that has a token limit of 128 tokens it worked, because the limit only applied within the sentences, so no sentence had more then the token limit.
When I apply the same algorithm with “text-embedding-ada-002” the limit seems to be applied to the quantity of sentences, not to the texts within them. My problems is that if that is true I would have to combine the sentences and that would not provide the result that I need, since I need to know to which cluster each of the more then 15000 sentences belongs to.
Yes you get a single embedding vector for each chunk of text you pass to the API. So if you want vectors for each sentence, you have to split the data into sentences and feed them into the API one at a time.
People embed at different levels, word, sentence, paragraph, page, to form different contexts. The API provides this freedom.
I was able to run the code but it is very slow, because it sends one request at a time in a loop. Is there a way to send all requests at once for openai, so that it can be parallelized?
Wow it worked really well, 100x faster. Thanks a lot.
Do you know how it works? I tried reading the docs but it is not clear if it passes all sentences as one text and then splits them in the return, or it just sends each sentence in parallel, or something else.