Embedding Longer Texts

The documentation for embedding says " Note that the maximum length of input text for our embedding models is 2048 tokens (approximately equivalent to around 2-3 pages of text). You should verify that your inputs do not exceed this limit before making a request."

What if I have 200 pages of document? Can I still use embedding?

Thanks for your help.

You need to break up your text into 2048 or less token chunks. Each chunk gets its own embedding. Cache the embeddings, and use them as you might.

1 Like

Thanks. So when I send a query, it will look through all the embeddings and find the best one and based on that answer? Sorry I am a noob, so may be asking a dumb question.

Hi-
So lets say I have 200 pages of text and we know roughly 2048 tokens = 2 pages. So you are recommending that we have 100 different documents each 2 pages long (<2048 tokens) and then supply these 100 documents to OpenAI for creating embeddings?

More specifically as I look at openai-python/Obtain_dataset.ipynb at main · openai/openai-python · GitHub, should I run these 100 files one at a time and save all of the embeddings (by appending) in the same file [output/embedded_1k_reviews.csv’] or for each of the file i run, I create a new file?

I am facing a similar problem in my cognitive architecture, where I will have to search an arbitrarily large set of files for particular memories. You can see how I do this in my “let’s build an ACOG” video series.

The direct answer to your question is that I store all the embeddings as a local pickle file, or just in memory. I call it an indexed memory in this file: ACOG_Experiment01/inner_loop.py at main · daveshap/ACOG_Experiment01 · GitHub

Since numpy is somewhat optimized (certainly good enough for our use) it is plenty fast enough to search through a few hundred log files very quickly.

I tried storing the embeddings as JSON but found that was very space inefficient. Storing them all as a list of matrices as a pickle was about 2x or 4x efficient. But also if you have plenty of RAM you can just keep it in memory.

Essentially, I built up a master Index. In the future, I will split this off into a separate, dedicated API service. There’s a cloud based one called Pinecone you might find useful.

I just found Milvus. https://milvus.io/ this might suit both our needs.

2 Likes

You will get embeddings for both the data set and the query. These embeddings are saved locally. You then do a cosign similarities search on your local machine to determine the most related documents.

1 Like

Would appreciate your answer to the second post I made.

Store the embeddings as a single document or multiple, it doesn’t matter. I would probably put them all in a single csv with an ID for each document.

Whatever the case, you will need to figure out programmatically how to compare your query embedding against your doc embeddings.

Also, when determining how to break up your text to create the embeddings to start with, I would consider breaking them up into semantically meaningful chunks, rather than static page lengths.

1 Like