While generating embedding using openAI model embedding model (text-embedding-ada-002) we are not getting “]” closing bracket at end of each array for each row parsed with sample text . We can only see open brackets and closing are even not getting added manually .
We tried for first 3 rows containing text still we are getting same issue . Sample EMBEDDING output for one row w/o “]” bracket looks below ,
[-0.017111334949731827, -0.01736113429069519, -0.00905526801943779, -0.006987475324422121, -0.005818270146846771, 0.011449189856648445, … , -0.039857059717178345, -0.04032890498638153, 0.0054435692727565765, 0.00532907759770751, 0.014349651522934437, -0.02120528742671013, 0.016334177926182747, 0.011379800736904144
Here see above “]” closing bracket is missing ,
Please help as we are unable to find the RCA for the same and blocked as w/o [proper embedding array we cannot apply ML algorithms
Thanks
Dhruv Shah
1 Like
udm17
September 25, 2023, 8:52am
2
Hi Dhruv. Can you share the code snippet/API call that you are using to generate the embeddings ?
That will be really helpful to see what the problem might be
1 Like
_j
September 25, 2023, 8:56am
3
Platform:
Python 3.8.16
openai 0.28.0
numpy 1.24.4
Script:
import openai
openai.api_key = "sk-1234"
print(openai.Embedding.create(
model="text-embedding-ada-002",
input="banana banana I eat bananas"
))
Output:
{
"object": "list",
"data": [
{
"object": "embedding",
"index": 0,
"embedding": [
-0.021899977698922157,
-0.02625957690179348,
0.02568594552576542,
0.015513545833528042,
0.002731123473495245,
0.003432228695601225,
...
...
-0.03316865116357803,
-0.0028107943944633007,
-0.009783604182302952,
-0.005921151954680681
]
}
],
"model": "text-embedding-ada-002-v2",
"usage": {
"prompt_tokens": 5,
"total_tokens": 5
}
}
So see if you can’t replicate this, and then find where your application goes wrong.
1 Like
import os
import openai
os.environ[‘OPENAI_API_KEY’] = ‘xyz’
openai.organization = “xyz”
openai.api_key = os.getenv(“OPENAI_API_KEY”)
openai.Model.list()
1 Like
embedding_model = “text-embedding-ada-002”
embedding_encoding = “cl100k_base”
max_tokens = 8000
1 Like
Hi,
Below is the code called for open AI to fetch embedding for tokens passed as raw text,
Blockquote
top_n = 10
df = df.sort_values(“sr.no”).tail(top_n * 2) # first cut to first 2k entries, assuming less than half will be filtered out
df.drop(“sr.no”, axis=1, inplace=True)
encoding = tiktoken.get_encoding(embedding_encoding)
df[“n_tokens”] = df.combined.apply(lambda x: len(encoding.encode(x)))
df = df[df.n_tokens <= max_tokens].tail(top_n)
len(df)
Blockquote
df[“embedding”] = df.combined.apply(lambda x: get_embedding(x, engine=embedding_model))
Above ode generate embedding which when saved to Excel shows “]” missing in each array of embedding for corresponding row with text tokens
1 Like
Hi @udm17 ,
Any idea you have for such behavior ?
Thanks
Dhruv Shah
1 Like
I think whatever issue you’re having is between your code and Excel.
The API is returning valid JSON.
1 Like
_j
September 26, 2023, 6:26am
9
Plus the pasted code from the OpenAI cookbook that is in the same basic form as it was in early 2022.
1 Like
I have resolved this error with the code below.
df[‘embedding’] = df[‘embedding’].astype(str)
1 Like