Hello,
In the context of customer support, I am building a tool that basically takes as input (1) the conversation history and (2) a draft of the new response and tries to rewrite the draft in a complete email and format it so that matches the company’s tone of voice.
At the moment, I tried with prompt engineering and I got decent but not production ready results.
Then, I prepared a database for finetuning composed of 50 conversations+drafts elements and the corresponsing final email the way I want it to come out in the model response. With this dataset, I created a finetuned model based on gpt-4-mini.
I was ready to be amazed by my super cool fine tuned model, but it performs worst than the prompt-engineering version. It adds random words and weird stuff in the response + it doesnt even get the tone of voice right all the time.
Note that the prompt used before the inputs is exactly the same in both version.
At this point I would like to ask you if you have any advices or suggestions on how to best deal with this use case.
Thanks in advance!
1 Like
Hi there and welcome to the Community!
In general, this is a great use case for fine-tuning and should work in principle.
What temperature setting are you applying when using the fine-tuned model? Issues often arise when users use too high temperature for the fine-tune. I am wondering if this could be root cause in your case?
You may also consider further adjusting/specifying your prompt to target specific undesired behaviour that are common across your outputs.
1 Like
I am using 1.2 as temperature, do you think it’s too high?
I am wondering if I finetuned the correct way: given that the input of every request would be conversation+draft, I reproduced the exact same inputs in the finetune database while adding a prefectly formatted response.
2 Likes
Yeah, the temperature is likely the cause of the issue. Try with a value of no more than 0.5 instead.
The data set you used sounds fine.
Let us know how it goes.
1 Like
ok will try with 0.2 and I will let you know how it goes.
The prompt used both in the finetune database and regular request is pretty long: 742 tokens. Do you think it’s too long?
1 Like
I don’t think it’s too long or let’s say I don’t think it is what is causing the issue.
1 Like