As API-based foundation models scale, I’ve been thinking about the boundary between normal high-volume usage (benchmarks, evaluation runs, synthetic data generation) and structured querying designed to approximate or distill capabilities.
At what point does usage meaningfully become “model extraction,” and is that even a technically enforceable distinction?
It seems like:
-
Call count alone isn’t meaningful
-
Token volume matters
-
Structured prompt variation might matter
-
Intent is almost impossible to prove
I’m curious how people here think about this from both a technical and governance perspective.