Sorry if that is a dumb question, but I have data that is already cleaned and tokenized. Do you have insights on how it might change model performance if I join my tokens back into a string and submit it like normal text?
what are you trying to accomplish?
Whether tokenized vectors to tokenized vectors saves you compute time or resources depends on what you are doing.
Tokenization is not a lossless compression other than at 0 temp. once you allow any probability into the responses the compression/decompression becomes lossy or non deterministic. that’s it’s power.
I’m building a text classifier and I already have cleaned my data (stopwords, lemmas etc.) for some conventional models. I was just wondering if I would consume less model-tokens if I submit this cleaned version to ada instead of the full text.
Since I don’t really understand how the models work, it might have an impact on prediction quality if they also take stopwords/POS into account that I remove through my own preprocessing. Or maybe it helps.
I am curious about which of my data I should use for classification and whether there is a tradeoff.
If the cleaned text is shorter (less tokens) than the original text, then it’ll consume less tokens. That said, ada is extremely cheap, so I’m not sure there would be any practical difference in pricing.
I’d encourage you to experiment with running predictions on both forms of data, since the cost is extremely low, so there’s no risk (ada is $0.0008 per 1,000 tokens).
Well if my preprocessing saves 10-20% of tokens I would not call that insignificant. (I haven’t calculated it, might be even more.)
And even though it is cheap that is countered by the sheer amount of data that I have. One classification-API call costs me about 1000 tokens right now.
To systematically assess how it performs with/without previous cleaning based on my data I would have to make several houndreds, if not thousands of classifications.
I see, thanks for explaining.
Hundreds to thousands of classifications then comes out to hundreds-of-thousands to millions (100s-1000s * 1000) of ada tokens, which is still in the range of cents-to-dollars.
For example:
- 100 ada calls of 1,000 tokens each = $0.08
- 1000 ada calls of 1,000 tokens each = $0.8
- 3000 ada calls of 1,000 tokens each = $2.4
We provide new users $18 worth of free tokens to experiment with, which is equivalent to 22,500 ada calls of 1,000 tokens each.
Yes exactly. That’s my issue. I have about 3000 phrases in my dataset which I could size down for testing purposes a bit, but it’s still problematic for my main use case. So I am thinking about how i can mitigate the implications of my underlying design which is very token hungry/inefficient. But since I am currently bound to that setup, the only thing I can do for now is mitigation.
I have users that need to dig through my dataset to define labels and everytime they find a part of a (new) label, they have the opportunity to label a few instances through a keyword-search. However, since these labels are not well defined yet, they can’t find all instances through obvious keywords which is where AI comes into play.
Furthermore, I am aiming for a collaborative situation, where the model takes a similar role to a real assistant who digs through the data and presents what it considers to be appropriate. Like a very primitive version of the dynamic that is shown here: https://www.youtube.com/watch?v=BdHj210v9Yo
I have a labeled gold-standard dataset for testing purposes, but in my real-world scenario the labels would evolve on the fly which is why I have decided on a binary classification for each new label.
To examine whether the performance is better or worse for some labels I have to test them all (e.g. because some might have more structural markers).
I have ~40 Labels * ~3000 Classifications of 1000 Tokens which is >18$.
Of course I can scale that down a bit, but I am still exploring other ways to improve efficiency. I probably can’t make that many API calls in a reasonable time anyway…