Searching Using Vectors Derived from Long Text Segments in an Embedding Model

imysnn · August 8, 2023, 8:35am

Hello everyone,

Please bear with me if you notice any grammar or spelling mistakes, as my English might not be perfect.

I’m currently using an embedding model to enhance our search and recommendation systems. However, I’ve come across the limitation of adav2, which is 8k tokens. This poses a challenge when dealing with long texts. My current idea is to segment the lengthy texts into smaller fragments and then utilize the embedding model to generate multiple vectors. However, I have some concerns about the segmentation process and vector manipulation.

For instance, if the long text contains user information such as name, age, and gender. To avoid exceeding the token limit, I split it into two segments:

The first segment includes the name, age, and some descriptions.
The second segment includes the gender and additional descriptions.

These segments would generate two vectors. However, I intend to search for users within a specific age range and gender. How can I achieve this effectively?

Thank you

Foxalabs · August 8, 2023, 9:34am

Hi and welcome to the developer forum!

You could try overlapping your chunked data, i.e. include 30% of the prior chunk into the current chunk and also include 30% of the next chunk, so you end up with 40% chunk specific data along with 60% of the chunk data before and after included, this is space inefficient but improves quality of retrieval by including more relevant chunks with a single search.

If the large text in each record is not used for the searches then you could strip that in a pre processing stage prior to embedding and replace it with a hashed index code so that you could subsequently look up that index to retrieve the text if you wished to reconstruct the entire document, that way your embed chunks are much smaller. Of course if you are reliant on the text in each record for the search then this method will not be useful.

charleswilliams1120 · August 8, 2023, 10:17am

I’ve successfully gotten aichat to increase tokens to 16k… I do not know the accuracy of the data produced at this token level since I’m using it for videogames and the further from reality it gets the better it is for my use case… I did it simply by modifying the total tokens line but you have to make sure you calculate it properly … is input tokens = 8k output 8k … for mine I do it differently since my input is short but I have gpt voice cinematics

_j · August 8, 2023, 11:08am

You would use a regular database with those key fields extracted and indexed, so that a deterministic query can be made against the data.

Embeddings is not a search algorithm; it is a semantic method of finding similar topics, subjects, forms of writing, attitudes, etc.

A news story may have a meaning like “basketball champions” (along with thousands of other language aspects). Another basketball story will have a high similarity score compared to other text that is significantly different in content.

You can use large texts and text retrieval by embedding a summary (although the summary itself may take many AI language turns to condense it to something smaller.) Then the summary can be embedded to vector, and the linked values return not the summary but the original from the lookup database.

If you are exceeding the context length of an embedding model, the datas are also going to be too large to pre-load into the context of a language model for question-answering.

Topic		Replies	Views
Embedding Longer Texts API	8	15278	December 25, 2023
Embedding - text length vs accuracy? API	13	15979	December 25, 2023
How similar are vectors for a word/phrase and vector for text that includes the word/phrase API ada002	1	191	May 31, 2024
Creating embeddings for large text file from MongoDb API	2	990	April 2, 2024
How to Optimize Text Chunking for Improved Embedding Vectorization? API vector-db , semantic-search	6	11038	December 15, 2023

Searching Using Vectors Derived from Long Text Segments in an Embedding Model

Related topics