Just getting started with fine-tuning. Having used RAG embedding that is already giving not-bad results, wanted to try the fine-tuning side. With some basic Python was able to convert text into jsonl but the first results after training were below expectations… wrong answers, hallucinations galore… though at least hallucinations within the right context. Also Training loss of 1.8587 was not converging smoothly as expected. 67K tokens, 3 epochs, LR multiplier 2, about 40 long lines with each line in the following chat format (jsonl).
{“messages”: [{“role”: “system”, “content”: “…”}, {“role”: “user”, “content”:
“…”}, {“role”: “assistant”, “content”: “…”}]}
The values in the content areas are:
- system content = always same content… "You are an AI assistant for…
- user content = chunk of text from original text file
- assistant content = next chunk of text from original text file
My challenge is to understand how to better build json content to match the LLM expectations and add knowledge from my original file (file of instructions).
Questions:
- Do I really need to come up with a question (user content) for each snippet of actual text I want the LLM to learn about (assistant content)? That seems to be a very tedious way to encode data, especially where some of the data in the text file are just CSV data. If needed any suggestions on making good questions for what is in fact text that is part of a whole?
- Would setting my system content to an actual prompt (like the one I am relatively happy with already with RAG) be of value?
- Is this fine-tuning path likely to increase the accuracy of my chat questions (which are already not bad with RAG/embedding)
Looking for additional guidance / suggestions / examples.