New AI classifier for indicating AI-written text!

We’ve trained a classifier to distinguish between text written by a human and text written by AIs from a variety of providers. While it is impossible to reliably detect all AI-written text, we believe good classifiers can inform mitigations for false claims that AI-generated text was written by a human: for example, running automated misinformation campaigns, using AI tools for academic dishonesty, and positioning an AI chatbot as a human.

Our classifier is not fully reliable. In our evaluations on a “challenge set” of English texts, our classifier correctly identifies 26% of AI-written text (true positives) as “likely AI-written,” while incorrectly labeling human-written text as AI-written 9% of the time (false positives). Our classifier’s reliability typically improves as the length of the input text increases. Compared to our previously released classifier, this new classifier is significantly more reliable on text from more recent AI systems.



Haven’t read the paper yet, but possibly related


Importantly noted by OpenAI in the blog post above:


Our classifier has a number of important limitations. It should not be used as a primary decision-making tool, but instead as a complement to other methods of determining the source of a piece of text.

  1. The classifier is very unreliable on short texts (below 1,000 characters). Even longer texts are sometimes incorrectly labeled by the classifier.
  2. Sometimes human-written text will be incorrectly but confidently labeled as AI-written by our classifier.
  3. We recommend using the classifier only for English text. It performs significantly worse in other languages and it is unreliable on code.
  4. Text that is very predictable cannot be reliably identified. For example, it is impossible to predict whether a list of the first 1,000 prime numbers was written by AI or humans, because the correct answer is always the same.
  5. AI-written text can be edited to evade the classifier. Classifiers like ours can be updated and retrained based on successful attacks, but it is unclear whether detection has an advantage in the long-term.
  6. Classifiers based on neural networks are known to be poorly calibrated outside of their training data. For inputs that are very different from text in our training set, the classifier is sometimes extremely confident in a wrong prediction.

Still, it would be great to see this classifier available to developers via an API. However, this does not seem forthcoming, per the same reference above:


At this time, we are only making the classifier available via open access to a web interface that outputs classes rather than estimated probabilities. We do not plan to release the model weights, but recognize that the web interface still makes it possible for anyone to develop evasion techniques by optimizing against the classifier.