GPT-3 to tokenise multi-word expressions

juliushamilton100 · October 15, 2021, 11:00am

I am currently learning about if GPT-3 can effectively tokenise a text for words and multi-word expressions. For example, I would like it to tokenise the sentence “By the way, I am Harry Potter” as [“By the way”, “I”, “am”, “Harry Potter”]. That is, individual standalone words remain independent, but words that fit together into a unit like a name or a phrase go together.

I am going to read through this to see if it provides any such methods: OpenAI GPT2 — transformers 4.11.3 documentation.

Does anyone expect - which I will try now - that just by giving GPT-3 examples of what I want it to do, as is demonstrated in the documentation, it is capable of effective tokenisation in this way? If not, can anyone recommend a related strategy?

Thank you very much.

danielhanchen · October 15, 2021, 11:57am

Ohhh I like your solution!!

Lmao I was just gonna say how N-grams sounds like what Julius wanted. Oh well the GPT3 prompt solution sounds much more elegant!!

Word2Vec I think also has trigrams / bigrams which were dervied from training data.

You can try brute force ConceptNet ie u check if “I” is in ConceptNet. If yes, try “I am”. If yes, try “I am Harry”. But it doesn’t work. So “I am” is a token. And so on.

But ye the GPT3 solution sounds much more elegant!!

bakztfuture · October 15, 2021, 1:49pm

I made videos on both of these topics in 2020.

Tokenization:

N-grams:

Really fun thing to explore

juliushamilton100 · October 18, 2021, 9:54am

Thanks very much, I’ve appreciated your help.

I think I can only send about 1500 words at a time as a prompt. This is a simple question, but do you just break larger amounts of text into units of that size to send as prompts?

Is the API rate-limited? I think I read that you need approval before your application goes live. Does calling the API maybe in a burst of 20 times count as a “live application”, or when is the cut-off in terms of activity before they require approval?

Thanks very much.

Topic		Replies	Views
NodeJS multiple requests until it's complete? Prompting	5	1413	January 3, 2023
Optimization of large requests to GPT API chatgpt , chat-completion , assistants-api	1	1507	November 24, 2023
Inputs tokens limit, data extraction API gpt-4 , gpt-35-turbo , api , token , rate-limit	2	4168	February 3, 2024
Practical Tips for Dealing with Large Documents (>2048 tokens) API	6	8391	December 17, 2023
Struggling with max_tokens and getting responses within a given limit, please help! API chatgpt	5	16563	October 28, 2023

GPT-3 to tokenise multi-word expressions

Related topics