Fine-Tuning Model Performance – Seeking Solutions

We built a personified AI chatbot that delivers personalized Vedic astrology readings by processing users’ birth chart data via an API.

Why We Attempted Fine-Tuning

We aimed to improve:

  • Simpler, More Human Language – Making responses warm, engaging, and easy to understand.
  • Conversational Variability – Reducing repetition and ensuring a more dynamic, natural flow.
  • Concise Output – Keeping responses brief and impactful.

Our Approach

  • Collected real user chat data.
  • Manually refined responses to match our desired tone and style.
  • Fine-tuned the model using this improved dataset.

Unexpected Fine-Tuning Issues

  • Worsened Performance – The fine-tuned model performed worse than the original system prompt version.
  • Language & Tone Issues – Responses became unnatural, erratic, and sometimes incoherent.
  • Overall Degradation – The fine-tuned model did not deliver the expected improvements and was less effective than the system prompt approach.

Looking for Insights

  • Has anyone faced similar degradation when fine-tuning with user chat data?
  • What alternative strategies (e.g., refined prompt engineering, reinforcement learning, or hybrid approaches) could improve chatbot responses while maintaining the strengths of the system prompt model?

Did you still include your system prompt in the fine-tuning samples? Or did you try to fine-tune on pure user inputs and model outputs?