Getting embeddings of length 1

Hi, I’m trying to use Langchain to create a vectorstore from scraped HTML pages, but I encountered an issue where I’m getting embeddings of length 1 when it should be 1536 per OpenAI Platform.

Here’s how my code looks:

from langchain.document_loaders import BSHTMLLoader
from langchain.embeddings import OpenAIEmbeddings
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.vectorstores.faiss import FAISS

...

all_raw_documents = []

for file in html_files:
    loader = BSHTMLLoader(file)
    raw_documents = loader.load()
    all_raw_documents.extend(raw_documents)

text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000,
    chunk_overlap=200,
)
documents = text_splitter.split_documents(all_raw_documents)
embeddings = OpenAIEmbeddings()
vectorstore = FAISS.from_documents(documents, embeddings)

On the last line, I’m getting the below error:

Traceback (most recent call last):
  File "/ingest.py", line 128, in <module>
    ingest_docs(customers)
  File "/ingest.py", line 34, in ingest_docs
    vectorstore = FAISS.from_documents(documents, embeddings)
  File "/env/lib/python3.10/site-packages/langchain/vectorstores/base.py", line 272, in from_documents
    return cls.from_texts(texts, embedding, metadatas=metadatas, **kwargs)
  File "/env/lib/python3.10/site-packages/langchain/vectorstores/faiss.py", line 385, in from_texts
    return cls.__from(
  File "/env/lib/python3.10/site-packages/langchain/vectorstores/faiss.py", line 348, in __from
    index.add(np.array(embeddings, dtype=np.float32))
ValueError: setting an array element with a sequence. The requested array has an inhomogeneous shape after 1 dimensions. The detected shape was (13238,) + inhomogeneous part.

After some investigation, I found that the problem is due to getting the following erroneous embeddings:

{
  "embedding": [
    NaN
  ],
  "index": 520,
  "object": "embedding"
} 

As you can see, the embedding returned has length 1 instead of a real embedding with length 1536.

Does anyone know how to resolve this issue? Thanks!

1 Like

Hi, I am also getting this issue with my code which worked fine yesterday. Maybe is is related to some API upgrade because the price for embeddings dropped by 75% since yesterday.

2 Likes

Yeah we also started seeing this issue yesterday. For now we had to update /env/lib/python3.10/site-packages/langchain/embeddings/openai.py directly to retry embedding generation for the erroneous chunk inside _get_len_safe_embeddings().

See same issue:

"""
Traceback (most recent call last):
  File "/home/jon/miniconda3/envs/h2ollm/lib/python3.10/concurrent/futures/process.py", line 246, in _process_worker
    r = call_item.fn(*call_item.args, **call_item.kwargs)
  File "/home/jon/h2ogpt/utils.py", line 759, in _traced_func
    return func(*args, **kwargs)
  File "/home/jon/h2ogpt/tests/test_langchain_units.py", line 291, in test_qa_daidocs_db_chunk_openaiembedding_hfmodel
    check_ret(ret)
  File "/home/jon/h2ogpt/tests/test_langchain_units.py", line 75, in check_ret
    for ret1 in ret:
  File "/home/jon/h2ogpt/gpt_langchain.py", line 1512, in _run_qa_db
    docs, chain, scores, use_context = get_similarity_chain(**sim_kwargs)
  File "/home/jon/h2ogpt/gpt_langchain.py", line 1622, in get_similarity_chain
    db, num_new_sources, new_sources_metadata = make_db(use_openai_embedding=use_openai_embedding,
  File "/home/jon/h2ogpt/gpt_langchain.py", line 1187, in make_db
    return _make_db(**langchain_kwargs)
  File "/home/jon/h2ogpt/gpt_langchain.py", line 1320, in _make_db
    db = get_db(sources, use_openai_embedding=use_openai_embedding, db_type=db_type,
  File "/home/jon/h2ogpt/gpt_langchain.py", line 69, in get_db
    db = FAISS.from_documents(sources, embedding)
  File "/home/jon/miniconda3/envs/h2ollm/lib/python3.10/site-packages/langchain/vectorstores/base.py", line 317, in from_documents
    return cls.from_texts(texts, embedding, metadatas=metadatas, **kwargs)
  File "/home/jon/miniconda3/envs/h2ollm/lib/python3.10/site-packages/langchain/vectorstores/faiss.py", line 502, in from_texts
    return cls.__from(
  File "/home/jon/miniconda3/envs/h2ollm/lib/python3.10/site-packages/langchain/vectorstores/faiss.py", line 454, in __from
    vector = np.array(embeddings, dtype=np.float32)
ValueError: setting an array element with a sequence. The requested array has an inhomogeneous shape after 1 dimensions. The detected shape was (1407,) + inhomogeneous part.
"""