Soft Token output limits and worsening performance

I’ve noticed a general trend of poor performance on the pretrained models for anything longer than 750-1k input tokens.

This includes a:

Decrease in output quality
Decrease in performance for RAG tasks
Poor quality outputs for anything over 500 tokens in length
Output Instability/Volatility

This has also been true of the manual chatGPT interface.

The issues seem consistent across 3.5 and 4.0.

To be blunt this issue is extremely detrimental. At the same time that there has been an increase in speed and responsiveness of the application there has been a steep drop in output quality.

I have had API related tasks mistakenly omit up to 75% of provided information.

It is not acceptable to have a 16k 4k model behave like a 2k 1k model.

I’ve tried dozens of prompts and the only ones that work with any consistency are injection type which are not a reliable option.

This may be a dealbreaker. There are ways around this like chunking the data and writing tests to verify information loss against different prompts but it is time and expense.

Model retraining is expensive and even my colleagues at large cap companies have limited ability to fine tune.