Tagging a CSV dataset: fine-tune, embed or neither?

I have a CSV dataset of 40,000 health-related texts (as it happens, questions). I want to tag them with their subject, and output the questions plus tags to a new CSV file.

I have 50-100 questions already tagged with example tags. Ideally I’d like the model to generalise from the tags I have. E.g. for the question How many people were diagnosed with kidney cancer in 2022? I have added tags diagnosis,cancer, kidney and I’d like the model to look at How many children were diagnosed with diabetes in each of the past five years? and add tags like diagnosis,diabetes,children.

This works pretty well if I provide 50 tagged questions in the ChatGPT prompt UI and ask it individual questions. But I’m trying to work out the most efficient, and likely successful, way to do this for 40,000 questions.

Should I:

  1. Use the Completion API with my custom prompt, and tag the remaining questions in batches?
  2. Create embeddings of the sample dataset, then use this embeddings database to tag the remaining questions? (Unfortunately the Cookbook doesn’t have an example of tagging.)
  3. Fine-tune my own model? (Maybe overkill given that the ChatGPT prompt works pretty well.)

Grateful for any thoughts or examples.

Hi and welcome to the Developer Forum!

Fine tuning is certainly and option but you would want thousands to 10’s of thousands of Q/A example pairs to get decent reliability, I think your examples set is as good an idea as any, but… may I suggest trying just 3 or 4 examples in your prompt, attention is at a primum and typically you’ll find high quality results with just a few examples to keep the task on focus.

2 Likes

My advice here would be to first focus on creating a method to measure the accuracy of the labels against a ground truth, so you have a way to track your progress :laughing:

2 Likes

Thanks @Foxalabs - I’ll try just a few examples.

Thanks also @N2U - I’ll score the results by hand.

Any thoughts on completion API vs embeddings?

And if completion API is the best way to go, once I’m got a prompt that works well, do I just write a script to call it repeatedly over the input CSV, a few thousand rows at a time? Sorry for the beginner question.

If I understand your use case correctly, API for sure. No need to embed the CSV file unless you want to run queries against it on an ongoing basis.

If I understand, you basically want to categorize the question list you currently have. I am doing something remarkably similar in my RAG application.

I store the question/response pairs in a regular DBMS table. I have a function which hourly goes through the table to categorize these records. The model creates and maintains the list of categories, and just goes through reading each record and assigning the appropriate category (using an existing one or creating a new one) to it.

This is just the LLM and the database. No embeddings necessary.