Making Fine Tune Davinci an SME on a certain topic

Hey everyone!

If say I wanted davinci to be an SME in a certain topic. And say for instance I had a clean textbook on that topic (suppose no images for now) that I fed into davinci.

How would one set up the training file that goes into davinci so that it understands all the content within that textbook and that it could actually become an SME on that topic?

Thanks!

1 Like

It depends on the topic. If it’s history or other fact-based, then you’re better off with a Q&A system. If it requires more executive function (like solving problems) then your finetuning data will look very different.

2 Likes

So say for instance it was like pet health content - like if I wanted GPT 3 to be an SME in a specific type of cancer within dogs.

How would I feed in the training data so that davinci will understand the content and then write articles from it?

1 Like

Content writing is different from being an SME. I’m kinda confused as to what you want to do with this? You’re being a little vague, perhaps not deliberately. It would be helpful if you gave very specific examples of the outcomes you’re looking for.

1 Like

Ahh sorry let me explain my process a bit:
I want to train davinci to write long-form content articles/blogs from a corpus of dog health articles I’ve found. I’ve taken the last few weeks in collecting this training data for davinci. I’ve collected fairly technical articles from animal blogs and scientific articles and then cleaned them (removed images, weird formatting, etc.).

I’m now at the step where I need to upload a training file to OpenAI in “Prompt… Completion” format. I am not sure how best to structure the training file so that davinci will perform well in the open ended text generation portion.

At the end of the process, I want to type in a prompt like “What are some different ways to mitigate cancer in dogs at different stages in their life” and davinci will generate content from that corpus of documents.

Hope this helps paint the picture a bit better!

1 Like

Here’s how I’d approach it Answer complex questions from an arbitrarily large set of documents with vector search and GPT-3 - YouTube

Basically the only change is that you’re looking to produce longer answers from a variety of sources. But the underlying principle is the same.

2 Likes

Ah interesting, so you never actually fine tune a model. When do you actually ever fine tune models if you can just feed everything into davinci 002?

1 Like

I haven’t worked with fine tuning, but…

One reason would be to save on tokens. If you’re supplying a bunch of text in with your query, then those count towards your token costs. Whereas if you fine-tune and do queries against that model, the “source text” you trained on doesn’t cost anything, only the generated response.

Another reason could be token limits, I think theres a 4k limit per query. So if you’re using 3500 tokens of training data inside the prompt, you’ll be limited in how much you can generate. Whereas a pretrained/finetuned model you can get the full 4k of output.

I bet pretrained/finetuned models would be faster too

Dave’s approach uses embeddings