How to Prepare Index References for Fine-Tuning: Tokenization and Context Considerations

dg · September 19, 2023, 5:58am

What is the best way to prepare related information for finetuning, such as the ISBN of a publication or any other index references like “SYS.1.5.A2”? The numbers only make sense in their entirety.

Are the individual sections broken down into tokens and, if so, is the information still interpreted correctly in the context of a query? Does it help to put the corresponding references in the training data in quotes in the role “user” as well as in the role “assistant”?

_j · September 19, 2023, 8:35am

Yes, the model uses its internal tokens throughout the language and data that it operates on. If you are curious about tokenization, you can put your own text into an online tokenizer, to see how it is broken into units of understanding. Tiktokenizer.

However, it sounds like you might be planning to use fine-tune beyond the scope where you might achieve success, such as if you were planning to use it as a data retrieval system.

You might train AI on the language that makes it output text that looks like barcodes, and train it on specific output formats and response styles to user input, but you will also encounter hallucinations just like when you ask GPT to provide citations for its article and it makes up plausible but non-existent web links.

Ask AI to conduct a multiple-choice quiz on your area of specialization, such as 1980’s freestyle music, or collectible 19th century coins, and you will see it mish-mash data of various sources together.

dg · September 19, 2023, 9:06am

Thanks for the reply. Indeed, my first attempts from a mixture of customized prompt design and finetuning are quite promising.

_j · September 19, 2023, 9:11am

The OpenAI fine-tuning guide might give you examples of applications, but it doesn’t outright say when an idea is destined for disappointment.

A much better solution for augmenting the AI on cold hard facts is an embeddings-based vector database, that retrieves knowledge text that is semantically similar to the question being posed, and can be fed into the AI alongside the user question before it generates an answer.

Topic		Replies	Views
Fine Tune GPT 3.5 for your own Knowledge/Facts API chatgpt , api	7	2728	December 22, 2023
What's better for the type of chatbot I am building? Fine tune or embedding? Community chatgpt , api	10	2240	August 20, 2023
Preparing the dataset for embeddings API	10	6164	December 17, 2023
Can I fine tune without specifying an answer through the "assistant" role? API	6	1270	December 25, 2023
Fine-tuning 3.5 turbo to act as conversational AI like Non-Playable Character in games API fine-tuning	4	1595	October 4, 2023

How to Prepare Index References for Fine-Tuning: Tokenization and Context Considerations

Related topics