I am beginning to test vector searches on my embeddings (using PineCone and cosine similarity right now). Currently, I am using CL100K_base as tokenizer for embedding calls.
In my use case, users will enter a one or two sentence query to search regulatory documents. The documents could range in size from two paragraphs to two pages. The documents will consist primarily of state law and regulations and guidelines. It sounds like BERT would be better suited for the task of tokenizing the documents and query as opposed to CL100K_base.
BERT tokenizer may be a better choice as it has been specifically designed for processing natural language text and has been trained on a large corpus of text data. BERT is particularly well-suited for understanding the meaning of a sentence in the context of the surrounding text, and can take into account the relationships between different words and phrases.
I would love to hear what users (humans) who have used them both have to say.
If you use OpenAI’s tiktoken (GitHub - openai/tiktoken) according to the documentation, it not only allows you to specify the toknizer directly by get_encoding function, but what is even greater, you can get tokenizer by providing the name of the model you would like to use leaving a choice of corresponding tokenizer to the library itself: tiktoken.encoding_for_model("text-davinci-003")