Seeking advice from experienced API users for how to perform a large-ish classification and information extraction task in a cost-effective way.
I have a dataset with about 500,000 questions and answers. I want to identify which of these questions are about one topic (specifically, literature). I don’t know in advance what percentage of questions are about my topic, but it is a minority, maybe 20% at most.
Then, from the questions that are about my topic, I would like to extract additional data (e.g., authors and works mentioned or referenced in the question).
Listing my current plan below. Seeking suggestions to improve it!
Phase 1
Using gpt-3.5
(or a cheaper model if performance is good), perform binary classification on the 500k questions to determine whether they are about the topic. Limit output to an integer representation that is basically a confidence interval (0 = definitely not about my topic, 100 = definitely about my topic).
Phase 2
With the subset of questions about my topic, use GPT 4 Turbo to extract additional information implicit in the question.
Example question: “Who says ‘but soft what light through yonder window breaks?’ Romeo.”
Example output (will be JSON): Author: William Shakespeare (1564-1616). Work: Romeo and Juliet (1597).
Questions
-
I have written a system prompt for this task (about 120 tokens) that is effective in the Playground. Using the API, do you have to pay for the tokens in the system prompt for every user input evaluated?
-
If yes, would it be better to batch many questions per prompt rather than repeatedly paying for the same system message?
-
Is the two-phase solution (i.e., classify the set first, then extract from the subset) the best one? Would it potentially be more efficient to perform both tasks (classification and information extraction) at once? In the current plan, questions about my topic will have to be processed twice. (On the other hand, paying to generate a lot of empty JSON is bad.)
Thanks for any advice you can provide!