Hebrew text not being correctly tokenized

I’m using text-embedding-ada-002 as the embedding model and using it to create chunks of Hebrew text with the below parameters and storing them in Pinecone.
Chunk overlap: 20
Chunk size: 512

However when I try to query it, the response which I get from Pinecone and the output generated by the LLM is not as expected. How can I improve this situation?

Welcome to the community!

Have you considered using the Text Embedding 3 Large model?


https://platform.openai.com/docs/guides/embeddings

Chunk size and overlap might also not be the best strategy for all use-cases. I think most veterans would recommend some sort of semantic chunking.

1 Like

Thanks for the suggestions. I had tried text-embedding-small but I had issues while retrieving correct data when compared with text-embedding-ada-002. I’ll try to change the model and see whether it helps.
I’m using RecursiveCharacterTextSplitter from langchain/text_splitter JS package. I was not able to find anything related to semantic chunking in langchain JS package. Are you aware of any other alternatives?

Personally, I’m not a fan of these naive tools. Langchain is 49% wrapper, 50% marketing, and 1% value.

Just code it.

1 Like

Whether you use Langchain or not, saving the input and output results to the LLM as logs can help identify where failures occur.
Tools like W&B might be useful for this.

1 Like

Welcome @neeraj-codebuddy

I think one of the most basic things to do is to inspect the context that’s being passed to the model for the chat completion API call.