GPT 3.5 API - how to stop AI from admitting it's an AI?

All you are monitoring is the output. So no need to look at input. On the output, in order of easiest to hardest:

  1. regex matching: Example search for the substring "AI " in the output.
  2. 1-token categorizer: Train a base model like babbage on Good/Bad outptuts and map them to ’ 0’ or ’ 1’ ← note the leading space in each. Run at a temperature of 0. Say the ’ 1’ means bad, then that is your signal to drop to davinci
  3. Embeddings. Embed a bunch of Good/Bad outputs and store them in memory. Run the new output across these as dot-product, equivalent to cosine similarity if you use text-embedding-ada-002 because it produces unit vectors. Determine if it is closest to the aggregate of Good outputs, or closest to the aggregate of Bad outputs.

You can run all three in parallel, have a weighted average, or a > 2 out-of 3 voting scheme to determine the outcome.

Each of the three has pros and cons, but the composite integrates up nicely to a good signal.

HTH

5 Likes