Hebrew text not being correctly tokenized

neeraj-codebuddy · July 22, 2024, 12:50pm

I’m using text-embedding-ada-002 as the embedding model and using it to create chunks of Hebrew text with the below parameters and storing them in Pinecone.
Chunk overlap: 20
Chunk size: 512

However when I try to query it, the response which I get from Pinecone and the output generated by the LLM is not as expected. How can I improve this situation?

Diet · July 22, 2024, 1:20pm

Welcome to the community!

Have you considered using the Text Embedding 3 Large model?

https://platform.openai.com/docs/guides/embeddings

Chunk size and overlap might also not be the best strategy for all use-cases. I think most veterans would recommend some sort of semantic chunking.

neeraj-codebuddy · July 23, 2024, 5:52am

Thanks for the suggestions. I had tried text-embedding-small but I had issues while retrieving correct data when compared with text-embedding-ada-002. I’ll try to change the model and see whether it helps.
I’m using RecursiveCharacterTextSplitter from langchain/text_splitter JS package. I was not able to find anything related to semantic chunking in langchain JS package. Are you aware of any other alternatives?

Diet · July 23, 2024, 5:47pm

Personally, I’m not a fan of these naive tools. Langchain is 49% wrapper, 50% marketing, and 1% value.

Just code it.

dignity_for_all · July 24, 2024, 6:08am

Whether you use Langchain or not, saving the input and output results to the LLM as logs can help identify where failures occur.
Tools like W&B might be useful for this.

sps · July 24, 2024, 6:27am

Welcome @neeraj-codebuddy

I think one of the most basic things to do is to inspect the context that’s being passed to the model for the chat completion API call.

Topic		Replies	Views
How to Optimize Text Chunking for Improved Embedding Vectorization? API vector-db , semantic-search	6	11156	December 15, 2023
Receving an incorrect response from text-embedding-ada-002 API embeddings	1	1416	July 5, 2023
Embeddings results using Ada-Embedding-data-002 API	10	2404	March 29, 2023
What is the best embedding model for Arabic Data sets? Community chatgpt , api	4	1864	May 4, 2024
Chunk & Embedding ( OpenAI + Pinecone ) API	2	5481	June 4, 2024

Hebrew text not being correctly tokenized

Related topics