Ok an idea. With Anthropic’s paper about monosemanticity in the neuron’s abstraction (VM I guess?), I would posit that detecting generated text would be more feasible. However, I’m sure that each model would have to have its own interpretability methodology, as each training round messes with the weights and activations. Who knows, but if we can get mechinterp systems working well we might be able to figure this out.