I wanted to train GPT models and increase its knowledge base but finetuning is not helping me. is there any alternatives or how can we train model on huge data and increase its knowledge. please let me know.
Depends on how big the knowledge base is, if it’s a document less than 55 pages, the GPT-4 API will be able to support it as an input. How are you trying to fine tune the model and what sort of response are you expecting will determine how you structure the finetuning document. Can you provide some more info ?
Also look into Embeddings for accessing your knowledge
Lets say I want to train model with following prompts and completion…
{“prompt”:“Write a summary of the statistics for Luka Doncic (PG, DAL).”,“completion”:“Luka Doncic played 19 games, starting 19 of them, and averaging 37.4 minutes per game. They scored 33.5 points per game, making 218 out of 432 field goal attempts for a shooting percentage of 50.5. They also made 46 out of 146 three-point field goal attempts for a shooting percentage of 31.5. In addition, they made 154 out of 213 free throw attempts for a free throw percentage of 72.3.”}
{“prompt”:“Write a summary of the statistics for Jayson Tatum (PF, BOS).”,“completion”:“Jayson Tatum played 20 games, starting 20 of them, and averaging 36.6 minutes per game. They scored 30.8 points per game, making 201 out of 418 field goal attempts for a shooting percentage of 48.1. They also made 66 out of 190 three-point field goal attempts for a shooting percentage of 34.7. In addition, they made 147 out of 169 free throw attempts for a free throw percentage of 87.0.”}
{“prompt”:“Write a summary of the statistics for Anthony Davis (C, LAL).”,“completion”:“Anthony Davis played 17 games, starting 17 of them, and averaging 34.5 minutes per game. They scored 26.2 points per game, making 168 out of 294 field goal attempts for a shooting percentage of 57.1. They also made 4 out of 19 three-point field goal attempts for a shooting percentage of 21.1. In addition, they made 106 out of 130 free throw attempts for a free throw percentage of 81.5.”}
and so on … similar data…
and now If I ask question —>Write a summary of the statistics for Anthony Davis (C, LAL).
The response comes like—>Anthony Davis played 31 games, starting 27 of them, and averaging 33.4 minutes per game. They scored 33.5 points per game, making 218 out of 432 field goal attempts for a shooting percentage of 50.5. They also made 46 out of 146 three-point field goal attempts for a shooting percentage of 31.5. In addition, they made 154 out of 213 free throw attempts for a free throw percentage of 72.3.
so the data is finetuned but model didnt increased the knowledge base, this was just and example my data will be something internal and different than this but I need model to learn not just tune and adjust tone of response… how can this be achieved.
yes but embeddings will not train the model, is there any way other than just finetune, because it doesnt increase knowledgebase
That sounds really expensive! Giving it a knowledge upgrade in every API call!
I had a similar point and watching this video helped me a lot to contextualize myself with this issue.
Basically it is saying that if you are using fine-tuning to increase your knowledge, you may not be using the right tool
You literally can’t do what you are asking to do.
Finetuning only trains the last layers of the network. It can only bake in a Style
The solution is to build an embeddings backend and prompt GPT to use an api to access that backend.
I see Fine-tuning as a superficial response tuning, basically examples on how you expect the model will respond to a given input but it doesn’t really change the context at all and that will tend to generate hallucinations or inconsistent responses.
One approach to solve this is implement RAG + a fine tuned model to reduce over/under fitting issues as much as possible.
You need to convert your knowledge into embeddings, then store that embeddings into a vector db, query the db looking for similarities vs user input.
From there you can now construct the prompt to the fine-tuned model:
"based on the given context: {context} answer the question {question}"
You may want to customize that prompt according to your goal.