I have a set of Linkedin jobs and Some occupations, i'm trying to match the jobs with the most similar occupation/s

I used the fine tune api on curie and my jsonl file looks like this:

{“prompt”:“similar occupation to software designer?\n\n###\n\n”,“completion”:" {"occupation_id": 2535, "occupation": "software architect\n"}“}
{“prompt”:“similar occupation to application architect?\n\n###\n\n”,“completion”:” {"occupation_id": 2535, "occupation": "software architect\n"}“}
{“prompt”:“similar occupation to software specialist?\n\n###\n\n”,“completion”:” {"occupation_id": 2536, "occupation": "software developer\n"}“}
{“prompt”:“similar occupation to software developers?\n\n###\n\n”,“completion”:” {"occupation_id": 2536, "occupation": "software developer\n"}“}
{“prompt”:“similar occupation to programmer?\n\n###\n\n”,“completion”:” {"occupation_id": 2536, "occupation": "software developer\n"}“}
{“prompt”:“similar occupation to application software developer?\n\n###\n\n”,“completion”:” {"occupation_id": 2536, "occupation": "software developer\n"}“}
{“prompt”:“similar occupation to software engineer?\n\n###\n\n”,“completion”:” {"occupation_id": 2536, "occupation": "software developer\n"}“}
{“prompt”:“similar occupation to Social Media and Marketing Coordinator?\n\n###\n\n”,“completion”:” {"occupation_id": 1973, "occupation": "online marketer\n"}“}
{“prompt”:“similar occupation to Social Media Marketing Intern?\n\n###\n\n”,“completion”:” {"occupation_id": 1973, "occupation": "online marketer\n"}“}
{“prompt”:“similar occupation to Social Media Marketing Intern?\n\n###\n\n”,“completion”:” {"occupation_id": 1973, "occupation": "online marketer\n"}"}

& so on

I have a static number of occupations in my db, each with an id.

Now i’m using the completion api on my fine tuned model, but it’s not accurate and most of the time the returned id is not correct and doesn’t belong to the occupation returned. It sometimes returns an occupation that doesn’t exist in my training data. I need it to return an occupation from the ones i provided in the training data with the correct id.

Any idea what i’m doing wrong or what can be improved?

  • Provide more training examples to cover more of the possible occupations. The model can only return occupations it has seen in the training data.

  • Use logit biasing to upweight the log probabilities of known occupations and downweight unknown occupations. This makes the model much more likely to return occupations from the training set.

  • Truncate the model’s output to a fixed, known list of occupations. For example, only return one of the top 5 most probable occupations, where those top 5 are selected from the training occupations.

  • remove the occupation_id