NewConnectionError keeps coming up over a .tiktoken file

jchaves · March 7, 2024, 12:54am

We’re using ChatGPT to offer code suggestions to users, however we often get this error when a petition is sent:

HTTPSConnectionPool(host='openaipublic.blob.core.windows.net', port=443): Max retries exceeded with url: /encodings/cl100k_base.tiktoken (Caused by NewConnectionError('<urllib3.connection.HTTPSConnection object at 0xffff2cc01b90>: Failed to establish a new connection: [Errno -2] Name or service not known'))

If I’m reading this correctly, the OpenAI backend (or maybe the library itself?) is trying to fetch that encoding file from the URL but fails to do so and the request returns with an error. Is this something on my end (I’m using the tiktoken library to precount the tokens but I don’t see any async methods there)? or is it an availability bug with your API?

If the latter, would upping the max_retries parameter of AsyncOpenAI(...) with some backoff measures be enough to solve the issue?

_j · March 7, 2024, 2:01am

The tiktoken library, before first use, must access the internet to obtain its encoding dictionary. The link is at:

https://openaipublic.blob.core.windows.net/encodings/cl100k_base.tiktoken

The file for the encoding is then hashed and saved into a temp directory available to the python enviroment, to a subdirectory data-gym-cache. The encoding file name is 9b5ad71b2ce5302211f9c61530b329a4922fc6a4

You can also set an environment variable TIKTOKEN_CACHE_DIR to change the location.

If your environment does not allow tiktoken internet access, none of that will happen.

jchaves · March 15, 2024, 12:34am

Thanks, this solved the issue.

Might I ask, if this file doesn’t change often, why is it not included with the library?

_j · March 15, 2024, 1:13am

The general thought is that a pip wheel distribution module should be lightweight, and also perhaps when storing source in forkable repos like github. While there is a limit of 60MB of what the pypi repo will take, the various vocab files are approaching 10MB and could conceivably grow.

There is also the possibility they thought the dictionaries could be dynamic, with the hashing of contents as file names, but we don’t have hints of the design decision. Even v0.1.0 had the encoder files as an Azure blob.

Same concern, unaddressed:

github.com/openai/tiktoken

Is there a way for tiktoken to interoperate better with offline AI software?

opened 07:59PM - 27 Dec 23 UTC

ParetoOptimalDev

For instance there are bug reports from users trying to run software in offline …only mode, but because those libraries use tiktoken and it goes out to download vocab files, those users get an error like: - https://github.com/openai/whisper/discussions/1399 (fix consists of downloading files to cache, pip installing something) - https://github.com/Significant-Gravitas/AutoGPT/issues/1909 - https://github.com/imartinez/privateGPT/issues/1458 In that last issue for example the issue was: ``` File "/home/tony/installs/privateGPT/.venv/lib/python3.11/site-packages/tiktoken_ext/openai_public.py", line 11, in gpt2 mergeable_ranks = data_gym_to_mergeable_bpe_ranks( ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/tony/installs/privateGPT/.venv/lib/python3.11/site-packages/tiktoken/load.py", line 82, in data_gym_to_mergeable_bpe_ranks vocab_bpe_contents = read_file_cached(vocab_bpe_file).decode() ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ ``` Perhaps tiktoken could respect an environmental variable like `OFFLINE` similar to `TERM=dumb` for terminals and throw an error of `vocab file.xyz not present, not downloading because OFFLINE=1 environmental variable set`? Thanks!

jchaves · June 19, 2024, 5:48pm

Do you happen to know which is the appropriate filename for the o200_base.tiktoken file? This is the encoding used for the new Omni model

Edit: To get the filename:

import hashlib

# e.g. blobpath = https://openaipublic.blob.core.windows.net/encodings/cl100k_base.tiktoken
blobpath = "your_blob_url_here"
cache_key = hashlib.sha1(blobpath.encode()).hexdigest()
print(cache_key)

_j · June 19, 2024, 7:26pm

You can just update tiktoken to the latest version. Then either specify the new model name where you were putting in the model name already, or specify the encoding as o200k instead of cl100k…

Topic		Replies	Views
Persistent SSL Certificate Verification Errors with OpenAI API and tiktoken API python , rag	1	1284	August 27, 2024
TikToken.GetEncoding Hangs or Freezes Bugs	6	213	January 30, 2025
Counting tokens for chat API calls (gpt-3.5-turbo) Documentation	5	27898	December 13, 2023
Could not automatically map text-embedding-3-small to a tokeniser API	2	3918	February 14, 2024
Token usage when using openai.chat.completions.create stream: true API gpt-4 , token	7	4552	November 4, 2023

NewConnectionError keeps coming up over a .tiktoken file

Related topics