I am beginning to test vector searches on my embeddings (using PineCone and cosine similarity right now). Currently, I am using CL100K_base as tokenizer for embedding calls.
In my use case, users will enter a one or two sentence query to search regulatory documents. The documents could range in size from two paragraphs to two pages. The documents will consist primarily of state law and regulations and guidelines. It sounds like BERT would be better suited for the task of tokenizing the documents and query as opposed to CL100K_base.
BERT tokenizer may be a better choice as it has been specifically designed for processing natural language text and has been trained on a large corpus of text data. BERT is particularly well-suited for understanding the meaning of a sentence in the context of the surrounding text, and can take into account the relationships between different words and phrases.
I would love to hear what users (humans) who have used them both have to say.
If you use OpenAI’s tiktoken (GitHub - openai/tiktoken) according to the documentation, it not only allows you to specify the toknizer directly by get_encoding function, but what is even greater, you can get tokenizer by providing the name of the model you would like to use leaving a choice of corresponding tokenizer to the library itself: tiktoken.encoding_for_model("text-davinci-003")
Hi, Please, I want to ask. This might seem very fundamental but I am not sure. chat GPT was trained on embeddings from "text-davinci-003" so it makes sense to use it when calling openAI’s GPT model. But are these LLMs embeddings agnostic? Can I similarly use BERT embeddings or Llama, Word2Vec, etc. on chatGPT? Will I have good results?
I am looking at embedding some text, storing it on a vector database, and conducting a semantic search later.
Since writing this post, I switched from Pinecone to Weaviate vector store. In Weaviate, I use their text2vec-openai transformer, which uses the text-embedding-ada-002 embedding model, which uses the CL100K_base tokenizer. I have had excellent results.
I cannot comment on anything else since I’ve not used anything else.