GPT 3.5 API - how to stop AI from admitting it's an AI?

curt.kennedy · March 8, 2023, 11:40pm

All you are monitoring is the output. So no need to look at input. On the output, in order of easiest to hardest:

regex matching: Example search for the substring "AI " in the output.
1-token categorizer: Train a base model like babbage on Good/Bad outptuts and map them to ’ 0’ or ’ 1’ ← note the leading space in each. Run at a temperature of 0. Say the ’ 1’ means bad, then that is your signal to drop to davinci
Embeddings. Embed a bunch of Good/Bad outputs and store them in memory. Run the new output across these as dot-product, equivalent to cosine similarity if you use text-embedding-ada-002 because it produces unit vectors. Determine if it is closest to the aggregate of Good outputs, or closest to the aggregate of Bad outputs.

You can run all three in parallel, have a weighted average, or a > 2 out-of 3 voting scheme to determine the outcome.

Each of the three has pros and cons, but the composite integrates up nicely to a good signal.

HTH

Topic		Replies	Views
GPT-4o API - how to stop AI from admitting it’s an AI? API gpt-4 , chatgpt , api	9	361	March 20, 2025
Custom chatbot says that it's developed by OpenAI API gpt-4	33	2208	April 2, 2024
How to get responses without the added "chat" when converting from davinci-003 to ChatGPT API gpt-3.5-turbo API	10	2916	March 6, 2023
Response of gpt-4-turbo is taking more time API gpt-4-turbo , assistants-api	9	2456	December 11, 2023
Gpt-turbo not following instructions in role-playing, different behaviour from chat.openai.com/chat Prompting	8	1993	December 19, 2023