lintsch
1
Hey everyone!
I am currently catching myself up on the basics of GPT / python etc. So far its very challenging but Im really giving it my best shot since i want to create a virtual version of myself.
The plan is to train GPT3 to try to mimic me the best way possible.
Therefore the most important thing is the training dataset. I will probably need tons of input / output prompts in that JSON file. And i need lots of different categories - such as memories, experiences, values etc…
Now i have some kind of noob questions but really glad if you could help guiding me through these baby steps im taking.
Can I only use 1 JSON file for fine tuning GPT or can i use multiple files (this way i could at least categorize the files which would make things a little bit easier)
How do i train GPT to NOT answer if questions are asked that are not in the training dataset (like if someone asks about the specific dangers in piloting a jet plane - Virtual Me should literally have no clue about it)
To what point can i control whether the AI is also asking questions rather than just answering them? I´d love to give it a real nice personal touch.
Are there some sort of Training Repos with JSON files that can be worked with?
What could the approximate costs be? I mean, at the moment I have no clue about how big the dataset needs to be to properly work. Is there any similiar project for orientation?
Just dont want to end up with a 5000K bill when i start the fine tune:P
Sorry for all the noob questions !
Best regards
Patrick
2 Likes
lintsch
2
Hey everyone!
I was able to find out the answers to some of my questions allready - please correct me if I´m wrong:
-
Only one JSON file can be used for fine tuning GPT3. It has to contain all the datasets which should be at least 200 in order to properly work
-
I still have no answer to this one…
-
I think the answer to this one is to simply increase the dataset with question / answer / hint / advice prompts - everything in one JSON file
-
Found out that @daveshapautomator offers quiet nice repos and also explanation videos! However a specific repo for this purpose will be hard to find as the experiences and memories are quiet personal…
-
Found out the costs are really ok for datasets in the range of 200-500 so thats a big relief!
I would be grateful to get some ideas on what the best / most efficient way is to create the dataset for Virtual-Me. I have a hard time imagining wether synthetic data can be used, as all the answers to questions must be authentic ones based on real experience. I want the Bot to be friendly, empathetic, able to answer questions but also able to pose questions. What would you suggest @daveshapautomator ?
Also my question in regards of how i teach GPT3 to not answer questions when Virtual-Me doesnt know the answer to a question regarding a topic i have no experience with remains unsolved. Anyone got an idea?
Thx and best regards!
Hi
I have developed a quick version of what you are looking for. I am using:
- books as datasets
- embeddings to limit the cost of finetuning (I execute a semantic search in the dataset to extract a context than use it for a davinci completion request)
- some prompt design on davinci to build a Chatbot
It works fine and, because I am using my books as dataset, the bot answers like me 
heiko
4
Hi Mikiane
that sounds interesting.
Since you worked with embeddings: I am curious: how long where the strings you created embeddings for ? I had the experience that embeddings got a bit fuzzy and I got false positives with embeddings of strings that were a couple of sentences long.
Did you experience that as well ?
2 Likes
Around 2000 tokens.
It works fine
@Mikiane
That sounds amazing!
Unfortunately i don´t have any books written yet. But the funny thing is I was just flirting with the thought of writing something like my own Biography as a base. And then have a list of Q&A to various different topics so the bot gets the hang of my views and virtues.
So basically you combine prompt engineering and embeddings.
I am having a hard time getting it all under one roof. Do you know any open source project with a similiar intent that has some public code I could use to work on ? (Noob-coder here)
Best regards!
Patrick
I should publish something soon about it.
lintsch
8
@Mikiane
You cant possibly imagine how much im looking forward to that xD
Best regards