Which embedding tokenizer should I use?

I am beginning to test vector searches on my embeddings (using PineCone and cosine similarity right now). Currently, I am using CL100K_base as tokenizer for embedding calls.

In my use case, users will enter a one or two sentence query to search regulatory documents. The documents could range in size from two paragraphs to two pages. The documents will consist primarily of state law and regulations and guidelines. It sounds like BERT would be better suited for the task of tokenizing the documents and query as opposed to CL100K_base.

ChatGPT says:

BERT tokenizer may be a better choice as it has been specifically designed for processing natural language text and has been trained on a large corpus of text data. BERT is particularly well-suited for understanding the meaning of a sentence in the context of the surrounding text, and can take into account the relationships between different words and phrases.

I would love to hear what users (humans) who have used them both have to say.

1 Like

Been doing something a bit similar, but having encoded my documents using OpenAI embeddings, been using the same for the questions as well.

The way the API call is structured, you wont need to worry about how the tokenisation happens, but I would suggest using the same embeddings for both the documents and questions.

If you use OpenAI’s tiktoken (GitHub - openai/tiktoken) according to the documentation, it not only allows you to specify the toknizer directly by get_encoding function, but what is even greater, you can get tokenizer by providing the name of the model you would like to use leaving a choice of corresponding tokenizer to the library itself: tiktoken.encoding_for_model("text-davinci-003")

1 Like

Do you use BERT or CL100K_base?

I am actually using cURL in PHP to make the calls. Do you use BERT or CL100K_base?