GPT-3 to tokenise multi-word expressions

I am currently learning about if GPT-3 can effectively tokenise a text for words and multi-word expressions. For example, I would like it to tokenise the sentence “By the way, I am Harry Potter” as [“By the way”, “I”, “am”, “Harry Potter”]. That is, individual standalone words remain independent, but words that fit together into a unit like a name or a phrase go together.

I am going to read through this to see if it provides any such methods: OpenAI GPT2 — transformers 4.11.3 documentation.

Does anyone expect - which I will try now - that just by giving GPT-3 examples of what I want it to do, as is demonstrated in the documentation, it is capable of effective tokenisation in this way? If not, can anyone recommend a related strategy?

Thank you very much.

1 Like

Ohhh I like your solution!!

Lmao I was just gonna say how N-grams sounds like what Julius wanted. Oh well the GPT3 prompt solution sounds much more elegant!!

Word2Vec I think also has trigrams / bigrams which were dervied from training data.

You can try brute force ConceptNet ie u check if “I” is in ConceptNet. If yes, try “I am”. If yes, try “I am Harry”. But it doesn’t work. So “I am” is a token. And so on.

But ye the GPT3 solution sounds much more elegant!!

2 Likes

I made videos on both of these topics in 2020.

Tokenization:

N-grams:

Really fun thing to explore

1 Like

Thanks very much, I’ve appreciated your help.

I think I can only send about 1500 words at a time as a prompt. This is a simple question, but do you just break larger amounts of text into units of that size to send as prompts?

Is the API rate-limited? I think I read that you need approval before your application goes live. Does calling the API maybe in a burst of 20 times count as a “live application”, or when is the cut-off in terms of activity before they require approval?

Thanks very much.