If say I wanted davinci to be an SME in a certain topic. And say for instance I had a clean textbook on that topic (suppose no images for now) that I fed into davinci.
How would one set up the training file that goes into davinci so that it understands all the content within that textbook and that it could actually become an SME on that topic?
It depends on the topic. If it’s history or other fact-based, then you’re better off with a Q&A system. If it requires more executive function (like solving problems) then your finetuning data will look very different.
Content writing is different from being an SME. I’m kinda confused as to what you want to do with this? You’re being a little vague, perhaps not deliberately. It would be helpful if you gave very specific examples of the outcomes you’re looking for.
Ahh sorry let me explain my process a bit:
I want to train davinci to write long-form content articles/blogs from a corpus of dog health articles I’ve found. I’ve taken the last few weeks in collecting this training data for davinci. I’ve collected fairly technical articles from animal blogs and scientific articles and then cleaned them (removed images, weird formatting, etc.).
I’m now at the step where I need to upload a training file to OpenAI in “Prompt… Completion” format. I am not sure how best to structure the training file so that davinci will perform well in the open ended text generation portion.
At the end of the process, I want to type in a prompt like “What are some different ways to mitigate cancer in dogs at different stages in their life” and davinci will generate content from that corpus of documents.
One reason would be to save on tokens. If you’re supplying a bunch of text in with your query, then those count towards your token costs. Whereas if you fine-tune and do queries against that model, the “source text” you trained on doesn’t cost anything, only the generated response.
Another reason could be token limits, I think theres a 4k limit per query. So if you’re using 3500 tokens of training data inside the prompt, you’ll be limited in how much you can generate. Whereas a pretrained/finetuned model you can get the full 4k of output.
I bet pretrained/finetuned models would be faster too
Hi Dave - I’d love to learn how you did this. I tried clicking the link but the YT video is private. Would it be possible to gain access, please? Thanks!