I am currently learning about if GPT-3 can effectively tokenise a text for words and multi-word expressions. For example, I would like it to tokenise the sentence “By the way, I am Harry Potter” as [“By the way”, “I”, “am”, “Harry Potter”]. That is, individual standalone words remain independent, but words that fit together into a unit like a name or a phrase go together.
I am going to read through this to see if it provides any such methods: OpenAI GPT2 — transformers 4.11.3 documentation.
Does anyone expect - which I will try now - that just by giving GPT-3 examples of what I want it to do, as is demonstrated in the documentation, it is capable of effective tokenisation in this way? If not, can anyone recommend a related strategy?
Thank you very much.
1 Like
Ohhh I like your solution!!
Lmao I was just gonna say how N-grams sounds like what Julius wanted. Oh well the GPT3 prompt solution sounds much more elegant!!
Word2Vec I think also has trigrams / bigrams which were dervied from training data.
You can try brute force ConceptNet ie u check if “I” is in ConceptNet. If yes, try “I am”. If yes, try “I am Harry”. But it doesn’t work. So “I am” is a token. And so on.
But ye the GPT3 solution sounds much more elegant!!
2 Likes
I made videos on both of these topics in 2020.
Tokenization:
N-grams:
Really fun thing to explore
1 Like
Thanks very much, I’ve appreciated your help.
I think I can only send about 1500 words at a time as a prompt. This is a simple question, but do you just break larger amounts of text into units of that size to send as prompts?
Is the API rate-limited? I think I read that you need approval before your application goes live. Does calling the API maybe in a burst of 20 times count as a “live application”, or when is the cut-off in terms of activity before they require approval?
Thanks very much.