Please bear with me if you notice any grammar or spelling mistakes, as my English might not be perfect.
I’m currently using an embedding model to enhance our search and recommendation systems. However, I’ve come across the limitation of adav2, which is 8k tokens. This poses a challenge when dealing with long texts. My current idea is to segment the lengthy texts into smaller fragments and then utilize the embedding model to generate multiple vectors. However, I have some concerns about the segmentation process and vector manipulation.
For instance, if the long text contains user information such as name, age, and gender. To avoid exceeding the token limit, I split it into two segments:
The first segment includes the name, age, and some descriptions.
The second segment includes the gender and additional descriptions.
These segments would generate two vectors. However, I intend to search for users within a specific age range and gender. How can I achieve this effectively?
You could try overlapping your chunked data, i.e. include 30% of the prior chunk into the current chunk and also include 30% of the next chunk, so you end up with 40% chunk specific data along with 60% of the chunk data before and after included, this is space inefficient but improves quality of retrieval by including more relevant chunks with a single search.
If the large text in each record is not used for the searches then you could strip that in a pre processing stage prior to embedding and replace it with a hashed index code so that you could subsequently look up that index to retrieve the text if you wished to reconstruct the entire document, that way your embed chunks are much smaller. Of course if you are reliant on the text in each record for the search then this method will not be useful.
I’ve successfully gotten aichat to increase tokens to 16k… I do not know the accuracy of the data produced at this token level since I’m using it for videogames and the further from reality it gets the better it is for my use case… I did it simply by modifying the total tokens line but you have to make sure you calculate it properly … is input tokens = 8k output 8k … for mine I do it differently since my input is short but I have gpt voice cinematics
You would use a regular database with those key fields extracted and indexed, so that a deterministic query can be made against the data.
Embeddings is not a search algorithm; it is a semantic method of finding similar topics, subjects, forms of writing, attitudes, etc.
A news story may have a meaning like “basketball champions” (along with thousands of other language aspects). Another basketball story will have a high similarity score compared to other text that is significantly different in content.
You can use large texts and text retrieval by embedding a summary (although the summary itself may take many AI language turns to condense it to something smaller.) Then the summary can be embedded to vector, and the linked values return not the summary but the original from the lookup database.
If you are exceeding the context length of an embedding model, the datas are also going to be too large to pre-load into the context of a language model for question-answering.