Getting 400 response with already working code

we were using the same embedding API for a week it was working,no change in input,all of sudden we are getting 400 on the same API & input, we have also tried updating libraries

this is error

/server/node_modules/openai/src/error.ts:66
0|server | return new BadRequestError(status, error, message, headers);
0|server | ^
0|server | Error: 400 ‘$.input’ is invalid. Please check the API reference: https://platform.openai.com/docs/api-reference.
0|server | at Function.generate (/server/node_modules/openai/src/error.ts:66:14)
0|server | at OpenAI.makeStatusError (/server/node_modules/openai/src/core.ts:358:21)
0|server | at OpenAI.makeRequest (/server/node_modules/openai/src/core.ts:416:24)
0|server | at processTicksAndRejections (node:internal/process/task_queues:95:5)
0|server | server/node_modules/langchain/dist/embeddings/openai.cjs:223:29
0|server | at RetryOperation._fn (/server/node_modules/p-retry/index.js:50:12)

I was getting the same error with the Python client. You need to strip new lines from the beginning and end of the text in the documents. Hope it helps.

I solved using this code in all the texts I passed:

import json

def sanitize_for_json(text):
    return json.dumps(text)

I am using chroma, this solution was provided via discord by the fantastic mod taz

1 Like

I realized that my error was for an none text, so you can also use this:

def sanitize(text):
    if not text:
        return " "
    return text

I am getting the same error in previously :

BadRequestError: Error code: 400 - {‘error’: {‘message’: “‘$.input’ is invalid. Please check the API reference: https://platform.openai.com/docs/api-reference.”, ‘type’: ‘invalid_request_error’, ‘param’: None, ‘code’: None}}

I have tried approaches mentioned in the thread but they don’t work for me. I have openai-1.12.0 installed

Code:

embeddings_list=

import openai

client = openai.OpenAI()

input_list = [x.replace(“\n”, " ") for x in input_list]

txt_embeddings = client.embeddings.create(input = input_list
,model=“text-embedding-3-large”
,dimensions=512).data

for x in txt_embeddings:
embeddings_list.append(x.embedding)

Interestingly, the code works if I iterate through the list and pass one value at a time to: client.embeddings.create

There is no “input_list” being defined in this snippet.

I gave you one, and improved

import openai
client = openai.Client()

with open("mytext.txt", "r", encoding="utf-8") as file:
    text_string = file.read()  # read file, split into paragraph chunks, index
stripped_chunks = [part.strip() for part in text_string.split("\n\n") if part.strip()]
input_list = [f"[{index + 1}] {part}" for index, part in enumerate(stripped_chunks)]

input_list = [x.replace("\n", " ").replace('  ', ' ') for x in input_list]

txt_embeddings = client.embeddings.create(
    model="text-embedding-3-large",
    #  input=["a good bot", "accepts lists"],
    input=input_list,
    dimensions=512,
    encoding_format="float",
)
embeddings_list = []
for x in txt_embeddings.data:
    embeddings_list.append(x.embedding)
    print(x.embedding[:4])
print(txt_embeddings.usage.model_dump())

counter to old advice above, also working:


    input=["\na good bot\n", "\n\n accepts lists\n\n"],

Hi, thanks for your answer.
The input list I have is quite large, so cannot supply it here. It’s a list of strings that are input’s in a search box of an app from different users.

Not sure the fix you are suggesting above, but I tried all below but none works for me, same error:

  1. input_list = [x.replace(“\n”, " ") for x in input_list]
  2. input_list = [x.replace(“\n”, " “).replace(” “,” ") for x in input_list]
  3. input_list = [x.replace(“\n”, " “).replace(” “,” ").strip() for x in input_list]
  4. input_list = [x.replace(“\n”, " “).replace(” “,” ").strip() for x in input_list if x]

As mentioned earlier, code works if I iterate through input list and call the API for one string at a time

With the new embeddings model, there really isn’t a need to strip out linefeeds or separators. If an AI could understand the message, the embeddings AI can understand the message. The paragraphs and formatting might ensure even higher understanding.

Comment out the input_list modification: does it work?

Don’t output to the same list variable as your input, does it work?

I commented out:
input_list = [x.replace(“\n”, " “).replace(” “,” ").strip() for x in input_list if x]

and changed input_list to input_list_1

input_list_1 = [x.replace(“\n”, " “).replace(” “,” ").strip() for x in input_list if x]

Both approaches did not work, same error

There is some value in the list that’s erroring, but it should error even when I do it element by element :man_shrugging:

Quite peculiar. Send two entities automated. Error?
Put the text into the commented list strings I put into the code and comment out the other. Error?

Let’s imagine that you are doing chunking in the worst way possible, like splitting Unicode by byte lengths. And then your text is something like OpenAI documentation that has special tokens in the text itself. Let’s try to fix that:

import re  # add to imports

# function: ensure valid characters in input_list
def ensure_utf8(strings):
    cleaned_strings = []
    for string in strings:
        # Decode using UTF-8, replacing invalid bytes
        clean_string = string.encode('utf-8', 'replace').decode('utf-8', 'replace')
        # Remove substrings enclosed in <| and |> - AI special tokens
        clean_string = re.sub(r'<\|.*?\|>', '', clean_string)
        # Optionally, take out ALL the extra space runs, like code indenting
        clean_string = ' '.join(clean_string.split())
        cleaned_strings.append(clean_string)
    return cleaned_strings

# add this where your input_list "enters" the code
input_list = ensure_utf8(input_list)

# The existing replace if you still want it
input_list = [x.replace("\n", " ").replace('  ', ' ') for x in input_list]

I also found that 2048 items is the max. 100000 tokens was accepted but I didn’t run it up higher, which might be a model limit of 128000. So you can also do some token-counting on the input.

tried your function, same error. Strings in my list are max 3000 characters

2048 is the number of strings or characters per string?

Total tokens in my list are ~ 43,000

function: ensure valid characters in input_list

def ensure_utf8(strings):
cleaned_strings =
for string in strings:
# Decode using UTF-8, replacing invalid bytes
clean_string = string.encode(‘utf-8’, ‘replace’).decode(‘utf-8’, ‘replace’)
# Remove substrings enclosed in <| and |> - AI special tokens
clean_string = re.sub(r’<|.*?|>', ‘’, clean_string)
# Optionally, take out ALL the extra space runs, like code indenting
clean_string = ’ ‘.join(clean_string.split())
cleaned_strings.append(clean_string)
cleaned_strings = [x.replace(“\n”, " ").replace(’ ', ’ ') for x in cleaned_strings]
return cleaned_strings

input_list = ensure_utf8(input_list)

Still same error

Max Tokens for a string in my input list: 812
Total tokens across all strings: 202595

Thanks its working but i cant understand how json.data() converts the itmes to string without a JSON file.

Hi @mittal.sameer ,
Were you able to rectify the error. I am also stuck in a similar situation, i.e. when I pass values one by one, it is working but when I am pass in batches then it gives the same error.

I suspect is issue is total token count. As mentioned in the api document:

input text to embed, encoded as a string or array of tokens. To embed multiple inputs in a single request, pass an array of strings or array of token arrays. The input must not exceed the max input tokens for the model (8192 tokens for text-embedding-ada-002 ), cannot be an empty string, and any array must be 2048 dimensions or less.

https://platform.openai.com/docs/api-reference/embeddings/create#:~:text=Input%20text%20to,dimensions%20or%20less.

From this understand is that when we pass a list then tokens are counted for the sum of tokens of each str in list. If such is the case then you can try batching.

I solved it by batching as follows:

def divide_list_into_batches(lst, batch_size):
for i in range(0, len(lst), batch_size):
yield lst[i:i+batch_size]

input_list is your list of texts

for input_list_tmp in divide_list_into_batches(input_list, 2048):