Text classification worsens after 500 iterations w 3.5 turbo 16k

Running text classification, feeding 70k rows into api w prompt via for loop. If I feed <=500 rows in, the results (ie f1 score) are good, but if I feed 1000 rows in performance becomes trash, (way lower f1).

I’ve run several tests - if I take any portion of the data (e.g. rows 500-1000 or rows 61500-62000)- the performance of classification is fine.

I thought it might be getting lazy, so I paused the for loop for 1 minute for every 500 rows it classified. That didn’t help. I tried ego boosting the model to encourage it to keep going - no luck so far.

Has anyone ran into this? I’m very perplexed. The model shouldn’t run into laziness because every iteration is a fresh api call (I believe?). It’s not remembering past outputs because we’re not storing messages. Just super weird behavior.

Please help! I would greatly appreciate it.