How to Prepare Index References for Fine-Tuning: Tokenization and Context Considerations

What is the best way to prepare related information for finetuning, such as the ISBN of a publication or any other index references like “SYS.1.5.A2”? The numbers only make sense in their entirety.

Are the individual sections broken down into tokens and, if so, is the information still interpreted correctly in the context of a query? Does it help to put the corresponding references in the training data in quotes in the role “user” as well as in the role “assistant”?

Yes, the model uses its internal tokens throughout the language and data that it operates on. If you are curious about tokenization, you can put your own text into an online tokenizer, to see how it is broken into units of understanding. Tiktokenizer.

However, it sounds like you might be planning to use fine-tune beyond the scope where you might achieve success, such as if you were planning to use it as a data retrieval system.

You might train AI on the language that makes it output text that looks like barcodes, and train it on specific output formats and response styles to user input, but you will also encounter hallucinations just like when you ask GPT to provide citations for its article and it makes up plausible but non-existent web links.

Ask AI to conduct a multiple-choice quiz on your area of specialization, such as 1980’s freestyle music, or collectible 19th century coins, and you will see it mish-mash data of various sources together.

Thanks for the reply. Indeed, my first attempts from a mixture of customized prompt design and finetuning are quite promising.

The OpenAI fine-tuning guide might give you examples of applications, but it doesn’t outright say when an idea is destined for disappointment.

A much better solution for augmenting the AI on cold hard facts is an embeddings-based vector database, that retrieves knowledge text that is semantically similar to the question being posed, and can be fed into the AI alongside the user question before it generates an answer.

1 Like