How to make classification and information extraction task cost-effective

ecf2u · January 28, 2024, 2:46pm

Seeking advice from experienced API users for how to perform a large-ish classification and information extraction task in a cost-effective way.

I have a dataset with about 500,000 questions and answers. I want to identify which of these questions are about one topic (specifically, literature). I don’t know in advance what percentage of questions are about my topic, but it is a minority, maybe 20% at most.

Then, from the questions that are about my topic, I would like to extract additional data (e.g., authors and works mentioned or referenced in the question).

Listing my current plan below. Seeking suggestions to improve it!

Phase 1

Using gpt-3.5 (or a cheaper model if performance is good), perform binary classification on the 500k questions to determine whether they are about the topic. Limit output to an integer representation that is basically a confidence interval (0 = definitely not about my topic, 100 = definitely about my topic).

Phase 2

With the subset of questions about my topic, use GPT 4 Turbo to extract additional information implicit in the question.

Example question: “Who says ‘but soft what light through yonder window breaks?’ Romeo.”

Example output (will be JSON): Author: William Shakespeare (1564-1616). Work: Romeo and Juliet (1597).

Questions

I have written a system prompt for this task (about 120 tokens) that is effective in the Playground. Using the API, do you have to pay for the tokens in the system prompt for every user input evaluated?
If yes, would it be better to batch many questions per prompt rather than repeatedly paying for the same system message?
Is the two-phase solution (i.e., classify the set first, then extract from the subset) the best one? Would it potentially be more efficient to perform both tasks (classification and information extraction) at once? In the current plan, questions about my topic will have to be processed twice. (On the other hand, paying to generate a lot of empty JSON is bad.)

Thanks for any advice you can provide!

jr.2509 · January 28, 2024, 2:57pm

Hi - I think your proposed approach makes a lot of sense. I would honestly keep it it as two separate steps to ensure getting optimal outcomes. I do think that you should also consider a fine-tuned GPT 3.5 model for Phase 2. They are quite capable of these types of tasks based on my experience.

I do not know the length of your system prompt but unless it is very extensive, the cost implications are marginal. Therefore I’d recommend processing question by question for optimal results.

ecf2u · January 28, 2024, 3:16pm

Thank you for your comment! I will read up on fine-tuning 3.5.

My system prompt is approximately 120 tokens (also edited my original post to include this information). I would guess that almost all questions will fall between 30 and 300 tokens.

jr.2509 · January 28, 2024, 3:43pm

Thanks for the additional information. I can only echo my recommendation in light of that. I have several fine-tuned GPT 3.5 turbo models in place that I use predominantly for classification tasks and I have found them to be very cost-effective and high-performing.

ayyoub.manssouri.edu · March 20, 2024, 1:35pm

Hey, your task is kind of similar to my internship project, if it’s possible, could we please discuss it a bit more on twitter or linkedIn, whatever fits you.

Topic		Replies	Views
Limits and limits and limits API	2	1452	May 31, 2021
Use "private" dataset as basis for AI responses Prompting	29	2844	December 16, 2023
Can I save tokens if I preprocess my data? API	7	1614	January 3, 2024
Classify whether a question can be answered from the provided data API	4	2257	December 20, 2023
How to optimize API request in terms of expenses API	8	2258	December 17, 2023

How to make classification and information extraction task cost-effective

Related topics