Which embedding tokenizer should I use?

SomebodySysop · March 3, 2023, 8:59am

I am beginning to test vector searches on my embeddings (using PineCone and cosine similarity right now). Currently, I am using CL100K_base as tokenizer for embedding calls.

In my use case, users will enter a one or two sentence query to search regulatory documents. The documents could range in size from two paragraphs to two pages. The documents will consist primarily of state law and regulations and guidelines. It sounds like BERT would be better suited for the task of tokenizing the documents and query as opposed to CL100K_base.

ChatGPT says:

BERT tokenizer may be a better choice as it has been specifically designed for processing natural language text and has been trained on a large corpus of text data. BERT is particularly well-suited for understanding the meaning of a sentence in the context of the surrounding text, and can take into account the relationships between different words and phrases.

I would love to hear what users (humans) who have used them both have to say.

udm17 · March 3, 2023, 11:26am

Been doing something a bit similar, but having encoded my documents using OpenAI embeddings, been using the same for the questions as well.

The way the API call is structured, you wont need to worry about how the tokenisation happens, but I would suggest using the same embeddings for both the documents and questions.

walaszczykm · March 3, 2023, 11:35am

If you use OpenAI’s tiktoken (GitHub - openai/tiktoken) according to the documentation, it not only allows you to specify the toknizer directly by get_encoding function, but what is even greater, you can get tokenizer by providing the name of the model you would like to use leaving a choice of corresponding tokenizer to the library itself: tiktoken.encoding_for_model("text-davinci-003")

SomebodySysop · March 3, 2023, 8:12pm

Do you use BERT or CL100K_base?

SomebodySysop · March 3, 2023, 8:12pm

I am actually using cURL in PHP to make the calls. Do you use BERT or CL100K_base?

chukwudi · September 5, 2023, 1:45pm

Hi, Please, I want to ask. This might seem very fundamental but I am not sure. chat GPT was trained on embeddings from "text-davinci-003" so it makes sense to use it when calling openAI’s GPT model. But are these LLMs embeddings agnostic? Can I similarly use BERT embeddings or Llama, Word2Vec, etc. on chatGPT? Will I have good results?

I am looking at embedding some text, storing it on a vector database, and conducting a semantic search later.

SomebodySysop · September 5, 2023, 4:53pm

Since writing this post, I switched from Pinecone to Weaviate vector store. In Weaviate, I use their text2vec-openai transformer, which uses the text-embedding-ada-002 embedding model, which uses the CL100K_base tokenizer. I have had excellent results.

I cannot comment on anything else since I’ve not used anything else.

Topic		Replies	Views
Using a Custom Tokenizer with GPT Embeddings API	5	3619	March 4, 2024
What is the tokenizer used for openai text-embedding-3-large? API embeddings	1	5896	February 10, 2024
Optimal token size for embeddings model? API	2	3985	December 25, 2023
GPT3 vs SBERT for semantic search/similarity? API	1	1935	January 24, 2023
Quality of embeddings using davinci-001 embeddings model vs. ada-002 model API embeddings	15	4062	April 9, 2024

Which embedding tokenizer should I use?

Related topics