I’d be cautious interpreting too much from the tokens used. It’s the embedding you end up with that matters.
My tests were using full sentences. Maybe for single words it is better (which could make sense).
I guess the lesson here is always test, and with data pertinent to the application.