How would you go about fine tuning dataset for Medical DRG Coding?

Hi all, I have never done fine tuning/embeddings/langchain before and a problem I have is rather challenging so I’d be thankful if someone with expertise could guide me in the right direction. I’m trying to train LLM for medical patient coding, particularly adhering to Australian Coding Standards for diagnoses and procedures. The end goal is to develop a system capable of generating primary and secondary diagnoses, as well as procedures, based on a patient’s medical history. However, the current OpenAI GPT model lacks knowledge of Australian Coding Standards, ACHI Codes, making it challenging to accurately interpret diagnoses and procedures. The data about this is described in few books which you can see here:

This is just small sample from which you can see format that data is in, but there is 15k of these interventions and 7k diagnoses. Many entries say look up to other section for more information. This is raw knowledge about how to tackle coding, and what each code means. Additionally I have dataset of 22k real world examples that are already coded. However chatgpt currently has no knowledge what these codes mean and which codes should go with other and what not. So how would you go about building dataset? Would you use fine tuning or something with embeddings?

In the end it should work like this:

Given input: “Patient is admitted for drainage of ascites due to known underlying liver disease.” Desired output:

  • Principal diagnosis: Ascites
  • Additional diagnosis: Liver disease
  • Procedure: Drainage of ascites

If fine tuning is a way to go, can you give me few examples how dataset would look like for something like this.

RAG will be very expensive with OpenAI since you have to include lots of supportive codes so AI can make reasoning with good differentiation of what DRG code excludes another. You can try to fine tune Llama 3 to sumarize that part and then put it in GPT 4 with long context window. It’s definetelly doeble, the question is - how much it would cost and if it worth it to deploy own model for that. Med.Report supports free medical coding automation on their website. It’s more for US market but you can try to contact them via linkedin. They have an amazing team of ML scientists and healthcare expert and are very opened. Since you are in Australia it shouldn’t be a problem in terms of competition.