we were using the same embedding API for a week it was working,no change in input,all of sudden we are getting 400 on the same API & input, we have also tried updating libraries
this is error
/server/node_modules/openai/src/error.ts:66
0|server | return new BadRequestError(status, error, message, headers);
0|server | ^
0|server | Error: 400 ‘$.input’ is invalid. Please check the API reference: https://platform.openai.com/docs/api-reference.
0|server | at Function.generate (/server/node_modules/openai/src/error.ts:66:14)
0|server | at OpenAI.makeStatusError (/server/node_modules/openai/src/core.ts:358:21)
0|server | at OpenAI.makeRequest (/server/node_modules/openai/src/core.ts:416:24)
0|server | at processTicksAndRejections (node:internal/process/task_queues:95:5)
0|server | server/node_modules/langchain/dist/embeddings/openai.cjs:223:29
0|server | at RetryOperation._fn (/server/node_modules/p-retry/index.js:50:12)
I was getting the same error with the Python client. You need to strip new lines from the beginning and end of the text in the documents. Hope it helps.
There is no “input_list” being defined in this snippet.
I gave you one, and improved
import openai
client = openai.Client()
with open("mytext.txt", "r", encoding="utf-8") as file:
text_string = file.read() # read file, split into paragraph chunks, index
stripped_chunks = [part.strip() for part in text_string.split("\n\n") if part.strip()]
input_list = [f"[{index + 1}] {part}" for index, part in enumerate(stripped_chunks)]
input_list = [x.replace("\n", " ").replace(' ', ' ') for x in input_list]
txt_embeddings = client.embeddings.create(
model="text-embedding-3-large",
# input=["a good bot", "accepts lists"],
input=input_list,
dimensions=512,
encoding_format="float",
)
embeddings_list = []
for x in txt_embeddings.data:
embeddings_list.append(x.embedding)
print(x.embedding[:4])
print(txt_embeddings.usage.model_dump())
counter to old advice above, also working:
input=["\na good bot\n", "\n\n accepts lists\n\n"],
Hi, thanks for your answer.
The input list I have is quite large, so cannot supply it here. It’s a list of strings that are input’s in a search box of an app from different users.
Not sure the fix you are suggesting above, but I tried all below but none works for me, same error:
input_list = [x.replace(“\n”, " ") for x in input_list]
input_list = [x.replace(“\n”, " “).replace(” “,” ") for x in input_list]
input_list = [x.replace(“\n”, " “).replace(” “,” ").strip() for x in input_list]
input_list = [x.replace(“\n”, " “).replace(” “,” ").strip() for x in input_list if x]
As mentioned earlier, code works if I iterate through input list and call the API for one string at a time
With the new embeddings model, there really isn’t a need to strip out linefeeds or separators. If an AI could understand the message, the embeddings AI can understand the message. The paragraphs and formatting might ensure even higher understanding.
Comment out the input_list modification: does it work?
Don’t output to the same list variable as your input, does it work?
Quite peculiar. Send two entities automated. Error?
Put the text into the commented list strings I put into the code and comment out the other. Error?
Let’s imagine that you are doing chunking in the worst way possible, like splitting Unicode by byte lengths. And then your text is something like OpenAI documentation that has special tokens in the text itself. Let’s try to fix that:
import re # add to imports
# function: ensure valid characters in input_list
def ensure_utf8(strings):
cleaned_strings = []
for string in strings:
# Decode using UTF-8, replacing invalid bytes
clean_string = string.encode('utf-8', 'replace').decode('utf-8', 'replace')
# Remove substrings enclosed in <| and |> - AI special tokens
clean_string = re.sub(r'<\|.*?\|>', '', clean_string)
# Optionally, take out ALL the extra space runs, like code indenting
clean_string = ' '.join(clean_string.split())
cleaned_strings.append(clean_string)
return cleaned_strings
# add this where your input_list "enters" the code
input_list = ensure_utf8(input_list)
# The existing replace if you still want it
input_list = [x.replace("\n", " ").replace(' ', ' ') for x in input_list]
I also found that 2048 items is the max. 100000 tokens was accepted but I didn’t run it up higher, which might be a model limit of 128000. So you can also do some token-counting on the input.
def ensure_utf8(strings):
cleaned_strings =
for string in strings:
# Decode using UTF-8, replacing invalid bytes
clean_string = string.encode(‘utf-8’, ‘replace’).decode(‘utf-8’, ‘replace’)
# Remove substrings enclosed in <| and |> - AI special tokens
clean_string = re.sub(r’<|.*?|>', ‘’, clean_string)
# Optionally, take out ALL the extra space runs, like code indenting
clean_string = ’ ‘.join(clean_string.split())
cleaned_strings.append(clean_string)
cleaned_strings = [x.replace(“\n”, " ").replace(’ ', ’ ') for x in cleaned_strings]
return cleaned_strings
input_list = ensure_utf8(input_list)
Still same error
Max Tokens for a string in my input list: 812
Total tokens across all strings: 202595
Hi @mittal.sameer ,
Were you able to rectify the error. I am also stuck in a similar situation, i.e. when I pass values one by one, it is working but when I am pass in batches then it gives the same error.
I suspect is issue is total token count. As mentioned in the api document:
input text to embed, encoded as a string or array of tokens. To embed multiple inputs in a single request, pass an array of strings or array of token arrays. The input must not exceed the max input tokens for the model (8192 tokens for text-embedding-ada-002 ), cannot be an empty string, and any array must be 2048 dimensions or less.
From this understand is that when we pass a list then tokens are counted for the sum of tokens of each str in list. If such is the case then you can try batching.
Error initializing the system: Error code: 400 - {'error': "'input' field must be a string or an array of strings"}
It appears me when I run this function:
def create_retriever_from_cassandra(session, name, description):
"""Create Cassandra Vector Store and transform it into a retriever.
Return the retriever tool.
"""
keyspace = KEYSPACE_ARTICLE
table_name = "articles"
embedding = OpenAIEmbeddings(
api_key=OPENAI_API_KEY,
base_url=BASE_URL,
model="text-embedding-3-large"
)
CassVectorStore = Cassandra(
session=session,
keyspace=keyspace,
table_name=table_name,
embedding=embedding
)
retriever = CassVectorStore.as_retriever(
search_type="similarity",
search_kwargs={'k': 4}
)
retriever_tool = create_retriever_tool(
retriever=retriever,
name=name,
description=description
)
return retriever_tool
I am knew in working with langchains and RAG’s, so it’s kinda hard for me to understand what is going on. I know for the fact, that my OpenAIEmbeddings needs to be a string or a an array of strings, like the error message says. How can I fix this.