Fine tuning GPT-4o with large data source in system prompt

Hi everyone,

I am currently trying to fine tune gpt-4o/gpt-4o-mini. I have a large data source which I am currently passing along with the system prompt for the model to provide me with a structured response in JSON format. The values to be inserted in the JSON response is taken from the data source which I am currently passing with the system prompt.

However I want to improve the accuracy on how the model builds the JSON response by fine tuning using a dataset of user prompts and the type of response expected from the model. My doubt is whether I should include the huge data source inside the system prompt while building the fine tuning dataset. The data source takes up about 5000 tokens and including them in the system prompt will be very expensive.

If I omit the data source in finetuning dataset and just include it in the API calls of the fine tuned model, will it generate the response using the mix of data source and fine tuned knowledge?

Thanks!.