Using a Custom Tokenizer with GPT Embeddings

Hi,

I’m currently utilizing the text-embedding-ada-002 model for embedding purposes, and I’m interested in incorporating custom special tokens into the model’s tokenizer. After exploring the tiktoken package, I found an example that demonstrates how to define a custom tokenizer:

cl100k_base = tiktoken.get_encoding(“cl100k_base”)

#In production, load the arguments directly instead of accessing private attributes
#See openai_public.py for examples of arguments for specific encodings
enc = tiktoken.Encoding(
# When defining a custom tokenizer, ensure to use a distinct name that reflects its behavior.
name=“cl100k_im”,
pat_str=cl100k_base._pat_str,
mergeable_ranks=cl100k_base._mergeable_ranks,
special_tokens={
**cl100k_base._special_tokens,
“<your_special_token1>”: 100264,
“<your_special_token2>”: 100265,
}
)

My question is: how can I integrate my custom tokenizer “cl100k_im” when utilizing the text-embedding-ada-002 model for embedding ?

Thanks!

The token encoder of OpenAI AI models is pre-set into the model training and API endpoint itself, and cannot be amended. There are special tokens that are proprietary to OpenAI that have been trained in other models than embeddings, but they are blocked from being encoded and sent to AI.

Since you cannot fine-tune embeddings models, sending a token that has zero semantic value also would seem to make little sense.

2 Likes

Thanks for clearing that up, I see what you mean. But let’s say I’ve got a couple of words in my text that I want to match with the same tokens or embedded vectors. Is there a way to do this with text-embedding-ada-002?

1 Like

You can use the tiktoken library, locally, to encode strings and decode token number sequences. Doing so, you could see the token numbers, delineation between tokens, and obtain counts.

The embeddings still accepts just plain text, and the token operation behind the scenes really isn’t of much application, unless, for example, you wanted to run embeddings on every of 100k tokens to see how similar "psy" is to " Psycho". (the former is if you input psycho lower-case on a new line.

Embeddings is much more concept-based than words. It will have an overview of this entire post when sent, comparable against other text in immeasurable ways.

2 Likes

My intention was to allocate some words so they’d have the same tokens and embeddings, no matter where they appear or what the context is. I understand I can’t fine-tune the embeddings but I wanted to try different tricks.

1 Like

I don’t really see “allocate some words so they’d have the same tokens and embeddings.” as arriving at a solution to anything.

You can’t piece together inputs to arrive at the same output.
“Once upon a time”…(join)…“happily ever after” will not have vectors that can be reconstructed to be close to the whole text.

A paragraph replaced with some synonyms is still extremely close in semantic similarity. If getting deeper into the text with some processing you envision, you might be enhancing the similarity by a semantic quality of “hard to read text that isn’t capitalized correctly”.

Embeddings is an internal hidden state of the model built up by reading through the input in a similar manner to the way a language AI model does, accumulating meaning and concept behind an input based on pretraining on written language and fine-tuning learning layers. Think of it as a “considering everything about what was written”, except with less focus on purely producing a single next token like a language model would.

1 Like