Hello, I was hoping someone may be able to help me with this question. Thank you in advance!
This video describes a pipeline that Anthropic AI uses to moderate their model. Essentially, you get the query from the user, generate a response, then have a moderation model critique said response using a sort of “constitution”, then revise the response according to said constitution, and finally display the response.
I haven’t implemented it with GPT-3 yet, but it seems like it should be robust against the sort of conditioning that trolls (or testers) try to use to get inappropriate or off-topic responses.
3 Likes
Thank you very much for the response and example! It is greatly appreciated.
1 Like