Which embedding tokenizer should I use?

I am beginning to test vector searches on my embeddings (using PineCone and cosine similarity right now). Currently, I am using CL100K_base as tokenizer for embedding calls.

In my use case, users will enter a one or two sentence query to search regulatory documents. The documents could range in size from two paragraphs to two pages. The documents will consist primarily of state law and regulations and guidelines. It sounds like BERT would be better suited for the task of tokenizing the documents and query as opposed to CL100K_base.

ChatGPT says:

BERT tokenizer may be a better choice as it has been specifically designed for processing natural language text and has been trained on a large corpus of text data. BERT is particularly well-suited for understanding the meaning of a sentence in the context of the surrounding text, and can take into account the relationships between different words and phrases.

I would love to hear what users (humans) who have used them both have to say.

1 Like

Been doing something a bit similar, but having encoded my documents using OpenAI embeddings, been using the same for the questions as well.

The way the API call is structured, you wont need to worry about how the tokenisation happens, but I would suggest using the same embeddings for both the documents and questions.

If you use OpenAI’s tiktoken (GitHub - openai/tiktoken) according to the documentation, it not only allows you to specify the toknizer directly by get_encoding function, but what is even greater, you can get tokenizer by providing the name of the model you would like to use leaving a choice of corresponding tokenizer to the library itself: tiktoken.encoding_for_model("text-davinci-003")

1 Like

Do you use BERT or CL100K_base?

I am actually using cURL in PHP to make the calls. Do you use BERT or CL100K_base?

Hi, Please, I want to ask. This might seem very fundamental but I am not sure. chat GPT was trained on embeddings from "text-davinci-003" so it makes sense to use it when calling openAI’s GPT model. But are these LLMs embeddings agnostic? Can I similarly use BERT embeddings or Llama, Word2Vec, etc. on chatGPT? Will I have good results?

I am looking at embedding some text, storing it on a vector database, and conducting a semantic search later.

Since writing this post, I switched from Pinecone to Weaviate vector store. In Weaviate, I use their text2vec-openai transformer, which uses the text-embedding-ada-002 embedding model, which uses the CL100K_base tokenizer. I have had excellent results.

I cannot comment on anything else since I’ve not used anything else.