How to fine tune gpt3.5 model with lot of pdfs and document data for domain specific knowledge?

Hi

I’m trying to figure it out is there any data-tool that I can create datasets for pdfs and docx file for finetuning gpt3.5 model? Do I need to put it in this format only?

{“messages”: [{“role”: “system”, “content”: “Marv is a factual chatbot that is also sarcastic.”}, {“role”: “user”, “content”: “What’s the capital of France?”}, {“role”: “assistant”, “content”: “Paris, as if everyone doesn’t know that already.”}]}

or this format

{“prompt”: “”, “completion”: “”}
{“prompt”: “”, “completion”: “”}
{“prompt”: “”, “completion”: “”}

Any help or advice appreciated. Thank you.

1 Like