[gpt-3.5-turbo-instruct] Python 'AssertionError: Length mismatch' for certain surprisal calls via API

Hi, for certain inputs, when I try to get surprisal values with gpt-3.5-turbo-instruct with Python’s surprisal package, concretely surprisal.OpenAIModel(), I get the following error:

in surprise
    len(tokens) == len(tokenized[b]) + use_bos_token
AssertionError: Length mismatch in tokenization by GPT2 tokenizer `Encoding(num_tokens=

The input that I am trying to run is: “What it is not free to do is to covertly manipulate messages that are purportedly being created by the independent creative communities”

Am I missing something? Any input would be highly appreciated!

Hi and welcome to the Developer Forum!

Hi, the problem is that you are using the GPT2 tokenizer for GPT-3.5, which will give you the wrong number of tokens for that message, please try the CL100K_BASE tokenizer model not GPT2