Is it helpful to add COT data in fine-tuning?

Hi all. I’m building a chatbot through fine-tuning on GPT-3.5-turbo. Currently, there are still many failing cases where the prompt has already specifically describe the instruction and there are corresponding cases in the sft data. What should I do to solve those bad-cases? Will COT data which consist of chain of thought and final response text be helpful?

1 Like

Welcome to the Dev Community!
As is always the case with these things, I think the correct answer here is “it depends”.

What is your use-case? Is the model “smart” enough to get it right with prompt engineering?

Are you trying to add in external knowledge? If so, you may be better using RAG techniques (e.g. Assistants API) instead.

1 Like

Thanks for your reply. My use case is Sales Agent whose aim is to sell things like house, insurance to customers. In this case, it has some rules to follow, e.g, ask the customer to see the house when the customer shows some interest, and ask for the customer’s IM account to keep deeper communication when the customer has no interest. Currently, even though I provided similar example in training data, the SFT model still can’t follow the instructions in prompt. I’m a little confused and have no idea what to do to make it “smarter”…

Use GPT-4?

2 Likes

No, my sft is based on GPT-3.5-turbo-0125

Well yeah, I’m saying maybe that’s the issue :thinking:

Yeah, but now GPT-4 sft api is not available for me. :joy:

You can generally get away without fine-tuning GPT-4, if your instructions are good enough.

I have tried that but finally gave up that way. The first reason is that GPT-4’s response is too slow to meet our app’s requirement. The second is that there are still some cases that GPT-4 can’t follow our instructions… So, we tried GPT-3.5 and now we got approximate accuracy(about 90% good case) compared to GPT-4 through SFT. But out app requires almost 95% which now is blocked by some cases that can hardly fixed by adding some similar data or modify our prompt (Maybe the prompt can be better, but tuning has no direction that can surely lead us to better result)

I see.

The issue with gpt-3.5 CoT is that 3.5 generally struggles with reflection tasks. While fine-tuning might help the model initiate CoT more often, I don’t see how it can actually improve CoT outcomes.

Some of the issues you’ve mentioned can be mitigated to a degree:

  • GPT-4’s response is too slow: use streaming to improve the UX. If you need complex CoT, consider some form of asynchronous communication.
  • some cases that GPT-4 can’t follow our instructions: while these cases exist, they can often be engineered around. if you want to share your prompts, some of us here might be able to take a crack at it.

However, if you’ve already got 90% accuracy with 3.5, that’s pretty good. I wouldn’t throw that away. Is it possible to identify the remaining 10% where it will fail ahead of time?

1 Like