Hi all, I have never done fine tuning/embeddings/langchain before and a problem I have is rather challenging so I’d be thankful if someone with expertise could guide me in the right direction. I’m trying to train LLM for medical patient coding, particularly adhering to Australian Coding Standards for diagnoses and procedures. The end goal is to develop a system capable of generating primary and secondary diagnoses, as well as procedures, based on a patient’s medical history. However, the current OpenAI GPT model lacks knowledge of Australian Coding Standards, ACHI Codes, making it challenging to accurately interpret diagnoses and procedures. The data about this is described in few books which you can see here:
This is just small sample from which you can see format that data is in, but there is 15k of these interventions and 7k diagnoses. Many entries say look up to other section for more information. This is raw knowledge about how to tackle coding, and what each code means. Additionally I have dataset of 22k real world examples that are already coded. However chatgpt currently has no knowledge what these codes mean and which codes should go with other and what not. So how would you go about building dataset? Would you use fine tuning or something with embeddings?
In the end it should work like this:
Given input: “Patient is admitted for drainage of ascites due to known underlying liver disease.” Desired output:
- Principal diagnosis: Ascites
- Additional diagnosis: Liver disease
- Procedure: Drainage of ascites
If fine tuning is a way to go, can you give me few examples how dataset would look like for something like this.