How to make classification and information extraction task cost-effective

Seeking advice from experienced API users for how to perform a large-ish classification and information extraction task in a cost-effective way.

I have a dataset with about 500,000 questions and answers. I want to identify which of these questions are about one topic (specifically, literature). I don’t know in advance what percentage of questions are about my topic, but it is a minority, maybe 20% at most.

Then, from the questions that are about my topic, I would like to extract additional data (e.g., authors and works mentioned or referenced in the question).

Listing my current plan below. Seeking suggestions to improve it!

Phase 1

Using gpt-3.5 (or a cheaper model if performance is good), perform binary classification on the 500k questions to determine whether they are about the topic. Limit output to an integer representation that is basically a confidence interval (0 = definitely not about my topic, 100 = definitely about my topic).

Phase 2

With the subset of questions about my topic, use GPT 4 Turbo to extract additional information implicit in the question.

Example question: “Who says ‘but soft what light through yonder window breaks?’ Romeo.”

Example output (will be JSON): Author: William Shakespeare (1564-1616). Work: Romeo and Juliet (1597).

Questions

  • I have written a system prompt for this task (about 120 tokens) that is effective in the Playground. Using the API, do you have to pay for the tokens in the system prompt for every user input evaluated?

  • If yes, would it be better to batch many questions per prompt rather than repeatedly paying for the same system message?

  • Is the two-phase solution (i.e., classify the set first, then extract from the subset) the best one? Would it potentially be more efficient to perform both tasks (classification and information extraction) at once? In the current plan, questions about my topic will have to be processed twice. (On the other hand, paying to generate a lot of empty JSON is bad.)

Thanks for any advice you can provide!

1 Like

Hi - I think your proposed approach makes a lot of sense. I would honestly keep it it as two separate steps to ensure getting optimal outcomes. I do think that you should also consider a fine-tuned GPT 3.5 model for Phase 2. They are quite capable of these types of tasks based on my experience.

I do not know the length of your system prompt but unless it is very extensive, the cost implications are marginal. Therefore I’d recommend processing question by question for optimal results.

Thank you for your comment! I will read up on fine-tuning 3.5.

My system prompt is approximately 120 tokens (also edited my original post to include this information). I would guess that almost all questions will fall between 30 and 300 tokens.

Thanks for the additional information. I can only echo my recommendation in light of that. I have several fine-tuned GPT 3.5 turbo models in place that I use predominantly for classification tasks and I have found them to be very cost-effective and high-performing.

Hey, your task is kind of similar to my internship project, if it’s possible, could we please discuss it a bit more on twitter or linkedIn, whatever fits you.