I am currently catching myself up on the basics of GPT / python etc. So far its very challenging but Im really giving it my best shot since i want to create a virtual version of myself.
The plan is to train GPT3 to try to mimic me the best way possible.
Therefore the most important thing is the training dataset. I will probably need tons of input / output prompts in that JSON file. And i need lots of different categories - such as memories, experiences, values etc…
Now i have some kind of noob questions but really glad if you could help guiding me through these baby steps im taking.
Can I only use 1 JSON file for fine tuning GPT or can i use multiple files (this way i could at least categorize the files which would make things a little bit easier)
How do i train GPT to NOT answer if questions are asked that are not in the training dataset (like if someone asks about the specific dangers in piloting a jet plane - Virtual Me should literally have no clue about it)
To what point can i control whether the AI is also asking questions rather than just answering them? I´d love to give it a real nice personal touch.
Are there some sort of Training Repos with JSON files that can be worked with?
What could the approximate costs be? I mean, at the moment I have no clue about how big the dataset needs to be to properly work. Is there any similiar project for orientation?
Just dont want to end up with a 5000K bill when i start the fine tune:P
I was able to find out the answers to some of my questions allready - please correct me if I´m wrong:
Only one JSON file can be used for fine tuning GPT3. It has to contain all the datasets which should be at least 200 in order to properly work
I still have no answer to this one…
I think the answer to this one is to simply increase the dataset with question / answer / hint / advice prompts - everything in one JSON file
Found out that @daveshapautomator offers quiet nice repos and also explanation videos! However a specific repo for this purpose will be hard to find as the experiences and memories are quiet personal…
Found out the costs are really ok for datasets in the range of 200-500 so thats a big relief!
I would be grateful to get some ideas on what the best / most efficient way is to create the dataset for Virtual-Me. I have a hard time imagining wether synthetic data can be used, as all the answers to questions must be authentic ones based on real experience. I want the Bot to be friendly, empathetic, able to answer questions but also able to pose questions. What would you suggest @daveshapautomator ?
Also my question in regards of how i teach GPT3 to not answer questions when Virtual-Me doesnt know the answer to a question regarding a topic i have no experience with remains unsolved. Anyone got an idea?
that sounds interesting.
Since you worked with embeddings: I am curious: how long where the strings you created embeddings for ? I had the experience that embeddings got a bit fuzzy and I got false positives with embeddings of strings that were a couple of sentences long.
Did you experience that as well ?
That sounds amazing!
Unfortunately i don´t have any books written yet. But the funny thing is I was just flirting with the thought of writing something like my own Biography as a base. And then have a list of Q&A to various different topics so the bot gets the hang of my views and virtues.
So basically you combine prompt engineering and embeddings.
I am having a hard time getting it all under one roof. Do you know any open source project with a similiar intent that has some public code I could use to work on ? (Noob-coder here)