Where is the line between heavy API usage and systematic model extraction?

As API-based foundation models scale, I’ve been thinking about the boundary between normal high-volume usage (benchmarks, evaluation runs, synthetic data generation) and structured querying designed to approximate or distill capabilities.

At what point does usage meaningfully become “model extraction,” and is that even a technically enforceable distinction?

It seems like:

  • Call count alone isn’t meaningful

  • Token volume matters

  • Structured prompt variation might matter

  • Intent is almost impossible to prove

I’m curious how people here think about this from both a technical and governance perspective.