Custom model response not aligning with training datasets

Hi everyone, I’ve been working on fine-tuning a GPT model (using gpt-4o-2024-08-06) with company-specific data stored in a blog-like format in our database. For the fine-tuning process, I used 5 blog posts from our database and passed their titles and bodies as input for training.

The fine-tuning process completed successfully, and I’ve started using the model for prompting. However, when I ask questions, the responses often don’t align with the data I provided during fine-tuning. It seems like the model is not accurately reflecting the content of my dataset.

Could the issue be related to the limited dataset size, or might there be other factors at play? What can I do to ensure the model gives responses based on the data I supplied during fine-tuning? Any tips or insights would be greatly appreciated!

Welcome to the community!

Did you expect that you could provide data during fine-tuning, and that this data would then be useable at inference time?

Unfortunately, that only works with a tiny, very specific scope - but even then I wouldn’t recommend it.

If you want data you provide to be used to answer questions, you need to somehow include that in the context. One way people do that is with RAG (retrieval augmented generation), and one simple implementation of this is to use the assistants API with a document.