NewConnectionError keeps coming up over a .tiktoken file

We’re using ChatGPT to offer code suggestions to users, however we often get this error when a petition is sent:

HTTPSConnectionPool(host='openaipublic.blob.core.windows.net', port=443): Max retries exceeded with url: /encodings/cl100k_base.tiktoken (Caused by NewConnectionError('<urllib3.connection.HTTPSConnection object at 0xffff2cc01b90>: Failed to establish a new connection: [Errno -2] Name or service not known'))

If I’m reading this correctly, the OpenAI backend (or maybe the library itself?) is trying to fetch that encoding file from the URL but fails to do so and the request returns with an error. Is this something on my end (I’m using the tiktoken library to precount the tokens but I don’t see any async methods there)? or is it an availability bug with your API?

If the latter, would upping the max_retries parameter of AsyncOpenAI(...) with some backoff measures be enough to solve the issue?

1 Like

The tiktoken library, before first use, must access the internet to obtain its encoding dictionary. The link is at:

https://openaipublic.blob.core.windows.net/encodings/cl100k_base.tiktoken

The file for the encoding is then hashed and saved into a temp directory available to the python enviroment, to a subdirectory data-gym-cache. The encoding file name is 9b5ad71b2ce5302211f9c61530b329a4922fc6a4

You can also set an environment variable TIKTOKEN_CACHE_DIR to change the location.

If your environment does not allow tiktoken internet access, none of that will happen.

1 Like

Thanks, this solved the issue.

Might I ask, if this file doesn’t change often, why is it not included with the library?

The general thought is that a pip wheel distribution module should be lightweight, and also perhaps when storing source in forkable repos like github. While there is a limit of 60MB of what the pypi repo will take, the various vocab files are approaching 10MB and could conceivably grow.

There is also the possibility they thought the dictionaries could be dynamic, with the hashing of contents as file names, but we don’t have hints of the design decision. Even v0.1.0 had the encoder files as an Azure blob.

Same concern, unaddressed: