Using a Custom Tokenizer with GPT Embeddings

_j · March 4, 2024, 11:29am

You can use the tiktoken library, locally, to encode strings and decode token number sequences. Doing so, you could see the token numbers, delineation between tokens, and obtain counts.

The embeddings still accepts just plain text, and the token operation behind the scenes really isn’t of much application, unless, for example, you wanted to run embeddings on every of 100k tokens to see how similar "psy" is to " Psycho". (the former is if you input psycho lower-case on a new line.

Embeddings is much more concept-based than words. It will have an overview of this entire post when sent, comparable against other text in immeasurable ways.

Topic		Replies	Views
Passing token weights to embeddings API? API embeddings	5	1789	November 14, 2023
What are the custom special tokens in tiktoken/token libraries? Use cases? API token , gpt	1	4417	December 14, 2023
Embeddings for tokens used by GPT models? API	2	914	December 17, 2023
Customize OpenAI embedding model API	4	1859	June 2, 2023
Does openAI provide API that takes Embeddings as an input? API embeddings	10	4152	December 18, 2023

Using a Custom Tokenizer with GPT Embeddings

Related topics