Yes exactly. That’s my issue. I have about 3000 phrases in my dataset which I could size down for testing purposes a bit, but it’s still problematic for my main use case. So I am thinking about how i can mitigate the implications of my underlying design which is very token hungry/inefficient. But since I am currently bound to that setup, the only thing I can do for now is mitigation.
I have users that need to dig through my dataset to define labels and everytime they find a part of a (new) label, they have the opportunity to label a few instances through a keyword-search. However, since these labels are not well defined yet, they can’t find all instances through obvious keywords which is where AI comes into play.
Furthermore, I am aiming for a collaborative situation, where the model takes a similar role to a real assistant who digs through the data and presents what it considers to be appropriate. Like a very primitive version of the dynamic that is shown here: https://www.youtube.com/watch?v=BdHj210v9Yo
I have a labeled gold-standard dataset for testing purposes, but in my real-world scenario the labels would evolve on the fly which is why I have decided on a binary classification for each new label.
To examine whether the performance is better or worse for some labels I have to test them all (e.g. because some might have more structural markers).
I have ~40 Labels * ~3000 Classifications of 1000 Tokens which is >18$.
Of course I can scale that down a bit, but I am still exploring other ways to improve efficiency. I probably can’t make that many API calls in a reasonable time anyway…