- The transformation is not always a trivial prompt engineering task when the classes are meaningful phrases, or when there are a lot of classes.
I would avoid lots of classes coming out of one classifier, mainly because I want to maximize SNR. If you need lots of classes, create more classifiers and have each classifier handle a smaller set of classes.
Now when the classes are meaningful phrases? I would avoid that too, maybe I’m not seeing the benefit of this. You could always map the single token class back to meaningful phrases through a lookup, either straight lookup of some sort, or a correlated lookup like an embedding.
- Even if it is trivial, the completion still is not guaranteed to be one of the single tokens used to represent classes. This forces the user to study degenerate completions and then implement ways to post-process them.
You can use the token_logprobs
to at least see how close the classification was to your token, and you can backoff on any action if it’s not close enough.
As for degenerate completions, you will always have to code the corner cases coming out of these. Simple example would be if your classifier expects to have ‘0’ or ‘1’ in the output, the fine-tuned GPT-3 model can output ’ 0’, ’ zero’, etc, and so you alias these back to ‘0’. You can even seed it with entity extraction values from the original input (see below for running multiple models in parallel).
In the case of bad classifications, then this is where multiple models come in. You run a variety of diverse models on the same input, and you make a decision based on the entirety of the output. These models can even be non-AI based, such as RegEx correlators. You just need an algorithm on the back end to fuse this information into a final result.
- If the transformation doesn’t include the class’ original name, then useful semantics in the class name would be unexploited by GPT-3.
I’d need an example of this one. But like I mentioned earlier, useful semantics from the classification could be restored by lookups (vector or direct) and seeded with entity extraction or other classifiers … all in the background AI and non-AI running in parallel on the incoming data.
I just see sampling as an unnecessary workaround. There’s a potentially simpler approach which should be evaluated.
Yes, there are simpler approaches! And these are what I would use in the background in parallel. Then integrate the responses (via direct code, or AI, or both) into the final answer.