Embedding Longer Texts

chinmay.duke · June 17, 2022, 3:36am

The documentation for embedding says " Note that the maximum length of input text for our embedding models is 2048 tokens (approximately equivalent to around 2-3 pages of text). You should verify that your inputs do not exceed this limit before making a request."

What if I have 200 pages of document? Can I still use embedding?

Thanks for your help.

jhsmith12345 · June 17, 2022, 5:11am

You need to break up your text into 2048 or less token chunks. Each chunk gets its own embedding. Cache the embeddings, and use them as you might.

chinmay.duke · June 17, 2022, 5:38am

Thanks. So when I send a query, it will look through all the embeddings and find the best one and based on that answer? Sorry I am a noob, so may be asking a dumb question.

chinmay.duke · June 18, 2022, 9:00pm

Hi-
So lets say I have 200 pages of text and we know roughly 2048 tokens = 2 pages. So you are recommending that we have 100 different documents each 2 pages long (<2048 tokens) and then supply these 100 documents to OpenAI for creating embeddings?

More specifically as I look at openai-python/Obtain_dataset.ipynb at main · openai/openai-python · GitHub, should I run these 100 files one at a time and save all of the embeddings (by appending) in the same file [output/embedded_1k_reviews.csv’] or for each of the file i run, I create a new file?

daveshapautomator · June 18, 2022, 10:49pm

I am facing a similar problem in my cognitive architecture, where I will have to search an arbitrarily large set of files for particular memories. You can see how I do this in my “let’s build an ACOG” video series.

The direct answer to your question is that I store all the embeddings as a local pickle file, or just in memory. I call it an indexed memory in this file: ACOG_Experiment01/inner_loop.py at main · daveshap/ACOG_Experiment01 · GitHub

Since numpy is somewhat optimized (certainly good enough for our use) it is plenty fast enough to search through a few hundred log files very quickly.

I tried storing the embeddings as JSON but found that was very space inefficient. Storing them all as a list of matrices as a pickle was about 2x or 4x efficient. But also if you have plenty of RAM you can just keep it in memory.

Essentially, I built up a master Index. In the future, I will split this off into a separate, dedicated API service. There’s a cloud based one called Pinecone you might find useful.

I just found Milvus. https://milvus.io/ this might suit both our needs.

jhsmith12345 · June 19, 2022, 3:01am

You will get embeddings for both the data set and the query. These embeddings are saved locally. You then do a cosign similarities search on your local machine to determine the most related documents.

chinmay.duke · June 19, 2022, 3:13am

Would appreciate your answer to the second post I made.

jhsmith12345 · June 19, 2022, 2:03pm

Store the embeddings as a single document or multiple, it doesn’t matter. I would probably put them all in a single csv with an ID for each document.

Whatever the case, you will need to figure out programmatically how to compare your query embedding against your doc embeddings.

Also, when determining how to break up your text to create the embeddings to start with, I would consider breaking them up into semantically meaningful chunks, rather than static page lengths.

Topic		Replies	Views
Searching Using Vectors Derived from Long Text Segments in an Embedding Model API embeddings , api	4	2533	December 15, 2023
Use file with text-davinci-001 to increase tokens in prompt Prompting	13	2609	December 15, 2023
Reasonable text length for embedding API	5	2415	December 25, 2023
Question answering using embeddings-based search API embeddings	1	2123	December 17, 2023
Embedding - text length vs accuracy? API	13	16355	December 25, 2023

Embedding Longer Texts

Related topics