Failed to upload file to a vector store

alejandrosmx2b · June 14, 2024, 11:37am

Hello.

I have tried from the API to upload a file to a Vector Store and apparently it is uploaded successfully.

However, from the web interface, when I go to that Vector Store, the file I just uploaded appears as “failed”:

I have done the same process of uploading it from the web interface to the Vector Store but the same problem happens.

The funny thing is that when I try to delete this failed file, it remains in process forever and is never deleted:

What’s more, if I select another vector store, the page crashes.

The file is a txt with the following content:

Updated;ID
05/15/2024;52139
05/15/2024;52080
05/15/2024;51884
05/15/2024;51773
05/15/2024;51759
05/15/2024;51712
05/15/2024;51680
05/15/2024;51649
05/15/2024;51536
05/15/2024;531536

I have tried with other accepted file formats but I have the same problem.

Can anyone tell me why this happens?

Any suggestion is welcome. Thank you.

kgupta · June 18, 2024, 2:38pm

Facing same issue with the web interface, trying to upload a 20mb .txt file which is failing while a much smaller succeeds

_j · June 18, 2024, 2:43pm

The maximum file size is 512 MB. Each file should contain no more than 5,000,000 tokens per file (computed automatically when you attach a file).

20M bytes can exceed the 5M token limit, especially in plain text files where there is no overhead to discard and there is data not in natural language (that would have very poor semantic search results).

kgupta · June 18, 2024, 3:44pm

I reduced the file to 6mb and below the 5m token limit and still facing same issue while attaching file to a new vector store…uploading of file is successful but attching it when it fails and then I cannot deleted the file but instead delete the entire store and repeat the steps with same issue. The file is financial data in csv format but uploaded as a .txt file. This has worked a few days back and it seems to be an issue today

alejandrosmx2b · June 19, 2024, 10:18am

Hello, I got the file to load for me.

Transform the file content into JSON format (I used an online CSV to JSON converter for testing).

I also tested it in DOCX and it worked (I used an online TXT to DOCX converter for testing).

suraj_b_khamkar · September 12, 2024, 2:39pm

Hello,

I am facing the same issue, is there any update?
For me it’s successfully able to upload the file in store with status ‘Ready’ but associates the same file with vector store with status ‘Failed’.
Any pointers will be helpful.

foxdev · September 18, 2024, 7:07pm

facing same issue, is there any way to get more feedback on this error? the threads console isn’t telling me very much, seems like enhanced monitoring would help here

jagath77 · October 16, 2024, 4:49am

Make sure the document is not a scanned document (image based document). Vector store does not allow scanned documents to be attached as it identifies it as a empty file.

adammus · November 22, 2024, 2:13pm

Have you tried to upload hardcoded file from your project?. If it succeeds - it means the way you are storing temporary file is the issue

I made sure to handle the uploaded files as binary streams, just like the hardcoded files.

antheducnguyen · December 2, 2024, 6:40am

I also encountered the same problem (especially the infinite loop) with my .doc files. When I converted them to pdf and tried again, it worked!

ignacio.cavanagh · December 10, 2024, 1:49pm

I am having the same issue. Was someone able to solve this problem?

shay.groups · January 5, 2025, 4:58am

i had the same issue, it was because the file was saved as UTF8, i tried UTF8+BOM and still didn’t work, only when i opened the csv file with a notepad and pressed save as… and then selected ansi as the file format - only then the Langchain in open webui accepted the file,

open web ui code should fix the loading of files to include option for encoding, or auto detect the encoding part, or if exception then try another encoding (UTF8), the code should be like this:

from langchain_community.document_loaders.csv_loader import CSVLoader

loader = CSVLoader(file_path="path_to_your_file.csv", encoding="utf-8")
docs = loader.load()

this is my original stacktrace:

WARNI [python_multipart.multipart] Skipping data after last boundary
INFO  [open_webui.routers.files] file.content_type: application/vnd.ms-excel
ERROR [open_webui.routers.retrieval] Error loading C:\Program Files\Python312\Lib\site-packages\open_webui\data/uploads/83464607-b967-4f83-b242-e483e0032546_Book1.csv
Traceback (most recent call last):
  File "C:\Program Files\Python312\Lib\site-packages\langchain_community\document_loaders\csv_loader.py", line 135, in lazy_load
    yield from self.__read_file(csvfile)
  File "C:\Program Files\Python312\Lib\site-packages\langchain_community\document_loaders\csv_loader.py", line 155, in __read_file
    for i, row in enumerate(csv_reader):
                  ^^^^^^^^^^^^^^^^^^^^^
  File "C:\Program Files\Python312\Lib\csv.py", line 115, in __next__
    self.fieldnames
  File "C:\Program Files\Python312\Lib\csv.py", line 102, in fieldnames
    self._fieldnames = next(self.reader)
                       ^^^^^^^^^^^^^^^^^
  File "C:\Program Files\Python312\Lib\encodings\cp1255.py", line 23, in decode
    return codecs.charmap_decode(input,self.errors,decoding_table)[0]
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
UnicodeDecodeError: 'charmap' codec can't decode byte 0x90 in position 30: character maps to <undefined>

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "C:\Program Files\Python312\Lib\site-packages\open_webui\routers\retrieval.py", line 884, in process_file
    docs = loader.load(
           ^^^^^^^^^^^^
  File "C:\Program Files\Python312\Lib\site-packages\open_webui\retrieval\loaders\main.py", line 127, in load
    docs = loader.load()
           ^^^^^^^^^^^^^
  File "C:\Program Files\Python312\Lib\site-packages\langchain_core\document_loaders\base.py", line 31, in load
    return list(self.lazy_load())
           ^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Program Files\Python312\Lib\site-packages\langchain_community\document_loaders\csv_loader.py", line 149, in lazy_load
    raise RuntimeError(f"Error loading {self.file_path}") from e
RuntimeError: Error loading C:\Program Files\Python312\Lib\site-packages\open_webui\data/uploads/83464607-b967-4f83-b242-e483e0032546_Book1.csv
ERROR [open_webui.routers.files] 400: Error loading C:\Program Files\Python312\Lib\site-packages\open_webui\data/uploads/83464607-b967-4f83-b242-e483e0032546_Book1.csv
Traceback (most recent call last):
  File "C:\Program Files\Python312\Lib\site-packages\langchain_community\document_loaders\csv_loader.py", line 135, in lazy_load
    yield from self.__read_file(csvfile)
  File "C:\Program Files\Python312\Lib\site-packages\langchain_community\document_loaders\csv_loader.py", line 155, in __read_file
    for i, row in enumerate(csv_reader):
                  ^^^^^^^^^^^^^^^^^^^^^
  File "C:\Program Files\Python312\Lib\csv.py", line 115, in __next__
    self.fieldnames
  File "C:\Program Files\Python312\Lib\csv.py", line 102, in fieldnames
    self._fieldnames = next(self.reader)
                       ^^^^^^^^^^^^^^^^^
  File "C:\Program Files\Python312\Lib\encodings\cp1255.py", line 23, in decode
    return codecs.charmap_decode(input,self.errors,decoding_table)[0]
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
UnicodeDecodeError: 'charmap' codec can't decode byte 0x90 in position 30: character maps to <undefined>

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "C:\Program Files\Python312\Lib\site-packages\open_webui\routers\retrieval.py", line 884, in process_file
    docs = loader.load(
           ^^^^^^^^^^^^
  File "C:\Program Files\Python312\Lib\site-packages\open_webui\retrieval\loaders\main.py", line 127, in load
    docs = loader.load()
           ^^^^^^^^^^^^^
  File "C:\Program Files\Python312\Lib\site-packages\langchain_core\document_loaders\base.py", line 31, in load
    return list(self.lazy_load())
           ^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Program Files\Python312\Lib\site-packages\langchain_community\document_loaders\csv_loader.py", line 149, in lazy_load
    raise RuntimeError(f"Error loading {self.file_path}") from e
RuntimeError: Error loading C:\Program Files\Python312\Lib\site-packages\open_webui\data/uploads/83464607-b967-4f83-b242-e483e0032546_Book1.csv

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "C:\Program Files\Python312\Lib\site-packages\open_webui\routers\files.py", line 74, in upload_file
    process_file(request, ProcessFileForm(file_id=id))
  File "C:\Program Files\Python312\Lib\site-packages\open_webui\routers\retrieval.py", line 962, in process_file
    raise HTTPException(
fastapi.exceptions.HTTPException: 400: Error loading C:\Program Files\Python312\Lib\site-packages\open_webui\data/uploads/83464607-b967-4f83-b242-e483e0032546_Book1.csv

hope they will fix it soon

arata · January 5, 2025, 9:49am

The “they” that you would have to contact is the community maintainers of Langchain.

Here is the API method being employed for CSV reading:

https://python.langchain.com/api_reference/community/document_loaders/langchain_community.document_loaders.csv_loader.CSVLoader.html

And the source:

https://python.langchain.com/api_reference/_modules/langchain_community/document_loaders/csv_loader.html#CSVLoader

If you want to tear into the langchain code to see what will work more universally, or report the improvement needed such as on ASCII + multi-byte UTF-8 CSV, (the source of the character error when cp1252 encoding is attempted), here is a cheap and easy bot text dump (which can only come about by extensive information and role prompting first).

The issue lies in the inability of the CSVLoader to handle files with unknown or unsupported encodings gracefully, even though it attempts to handle this with the autodetect_encoding option. The error originates because the loader defaults to a specific encoding (cp1255 in this case), which isn’t suitable for the given file, and it fails to retry with other encodings effectively.

Here’s a step-by-step solution to patch this issue, ensuring a robust way to handle various file encodings:

Proposed Fix: Enhancing Encoding Detection in `CSVLoader`

This fix will modify the CSVLoader class to improve how it detects and falls back to alternative encodings. It also ensures that even files with entirely unknown encodings are handled gracefully, e.g., by using utf-8 as a default fallback.

Implementation

Here is the patch code for the CSVLoader class, which can be directly used by users or submitted as a pull request to the LangChain repository:

import csv
from io import TextIOWrapper
from pathlib import Path
from typing import Iterator, Optional, Sequence, Union, Dict

from langchain_core.documents import Document
from langchain_community.document_loaders.base import BaseLoader
from langchain_community.document_loaders.helpers import detect_file_encodings


class PatchedCSVLoader(BaseLoader):
    """
    Patched version of CSVLoader with improved encoding detection and fallback mechanism.
    """

    def __init__(
        self,
        file_path: Union[str, Path],
        source_column: Optional[str] = None,
        metadata_columns: Sequence[str] = (),
        csv_args: Optional[Dict] = None,
        encoding: Optional[str] = None,
        autodetect_encoding: bool = True,  # Default to True for better compatibility
        *,
        content_columns: Sequence[str] = (),
    ):
        """
        Args:
            file_path: The path to the CSV file.
            source_column: The name of the column in the CSV file to use as the source.
              Optional. Defaults to None.
            metadata_columns: A sequence of column names to use as metadata. Optional.
            csv_args: A dictionary of arguments to pass to the csv.DictReader. Optional.
            encoding: The encoding of the CSV file. Optional. Defaults to None.
            autodetect_encoding: Whether to try to autodetect the file encoding.
            content_columns: A sequence of column names to use for the document content.
        """
        self.file_path = file_path
        self.source_column = source_column
        self.metadata_columns = metadata_columns
        self.encoding = encoding
        self.csv_args = csv_args or {}
        self.autodetect_encoding = autodetect_encoding
        self.content_columns = content_columns

    def lazy_load(self) -> Iterator[Document]:
        """
        Load the CSV file row by row with improved encoding handling.
        """
        try:
            yield from self._load_with_encoding(self.encoding)
        except UnicodeDecodeError as e:
            if self.autodetect_encoding:
                detected_encodings = detect_file_encodings(self.file_path)
                for encoding in detected_encodings:
                    try:
                        yield from self._load_with_encoding(encoding.encoding)
                        return
                    except UnicodeDecodeError:
                        continue
                # Fallback to utf-8-sig if no detected encodings work
                try:
                    yield from self._load_with_encoding("utf-8-sig")
                    return
                except UnicodeDecodeError:
                    pass
            raise RuntimeError(
                f"Error loading {self.file_path}. Failed to decode with all attempted encodings."
            ) from e
        except Exception as e:
            raise RuntimeError(f"Error loading {self.file_path}") from e

    def _load_with_encoding(self, encoding: Optional[str]) -> Iterator[Document]:
        """
        Helper method to load the CSV file with a specific encoding.
        """
        with open(self.file_path, newline="", encoding=encoding) as csvfile:
            yield from self.__read_file(csvfile)

    def __read_file(self, csvfile: TextIOWrapper) -> Iterator[Document]:
        """
        Process the CSV file row by row and yield Documents.
        """
        csv_reader = csv.DictReader(csvfile, **self.csv_args)
        for i, row in enumerate(csv_reader):
            try:
                source = (
                    row[self.source_column]
                    if self.source_column is not None
                    else str(self.file_path)
                )
            except KeyError:
                raise ValueError(
                    f"Source column '{self.source_column}' not found in CSV file."
                )
            content = "\n".join(
                f"{k.strip() if k is not None else k}: {v.strip() if isinstance(v, str) else v}"
                for k, v in row.items()
                if (
                    k in self.content_columns
                    if self.content_columns
                    else k not in self.metadata_columns
                )
            )
            metadata = {"source": source, "row": i}
            for col in self.metadata_columns:
                try:
                    metadata[col] = row[col]
                except KeyError:
                    raise ValueError(
                        f"Metadata column '{col}' not found in CSV file."
                    )
            yield Document(page_content=content, metadata=metadata)

How This Fix Works

Encoding Detection:
- The autodetect_encoding option triggers the detect_file_encodings helper to identify possible encodings.
- It iterates through the detected encodings and retries opening the file until one succeeds.
Fallback Mechanism:
- If no detected encoding works, the patch attempts to decode the file using utf-8-sig as a final fallback.
- This accounts for common issues like Byte Order Mark (BOM) in UTF-8 files.
Error Handling:
- If all encoding attempts fail, the loader raises a detailed RuntimeError, helping users understand the failure.
Backward Compatibility:
- Existing arguments (source_column, metadata_columns, etc.) are preserved to ensure compatibility with current usage.

How to Use

Standalone Usage

Replace the original CSVLoader with the PatchedCSVLoader in your code:

from your_module import PatchedCSVLoader

loader = PatchedCSVLoader(
    file_path="./data.csv",
    autodetect_encoding=True,
)
docs = loader.load()

Patch as a Pull Request

Include the PatchedCSVLoader class in your pull request to LangChain. Add this as a drop-in replacement for the existing CSVLoader, ensuring backward compatibility and resolving encoding issues.

Testing the Fix

Test Files:
- UTF-8 encoded file.
- File with BOM (e.g., utf-8-sig).
- Non-UTF encodings (e.g., cp1252, iso-8859-1).
- Corrupted or unsupported encoding files.
Expected Results:
- Files should load correctly for supported encodings.
- For unsupported or corrupted files, an appropriate error should be raised.

This fix not only resolves your immediate issue but also improves the library’s robustness, making it more reliable for the community.

Note that the new method has a new name - you’d have to overwrite the existing class name and test it for versatility on all types of CSV one might attempt, or expand the methods available.

Ohmann · March 18, 2025, 8:28pm

I just stumbled on this problem .. it just seems unreal to me that the assistant can’t ocr these files and add them. .. I tried it on chatgpt and got the same result : "no text could be extracted from this file "

Zulkif_Saltik · June 18, 2025, 5:59am

Failed files removal from Vector Store

2025-06-17T21:00:00Z

The same problem happened to me and the solution is quite simple. Openai files status is ready but failed under vector store. Interestingly, you cannot delete these files from the dashboard.

I use markdown file format. I have never had any problems with thousands of files, but when I encountered this error recently, I started to think that it was related to the table structure in the file content. It is definitely not related to the file size (within the limits). Still, I am not sure in which condition this problem occurs, I am investigating.

Solution

Detection of Failed Files: I created a piece of code that lists all the files in the vector store with file ID and status (completed or failed) information.
Collecting failed file IDs: I wrote all the failed file IDs to an excel file. (Header file_id)
Deleting the File: Then, you first delete the file from files with client.files.delete.
Deleting the File from the Vector Store: After this process, you delete the same file attached in the vector store id with client.vector_stores.files.delete.
Controls: The processes take about 3-5 minutes. When you list all your files again, you will not see any files with the status failed. Also, when you check from the dashboard, you will not see any files with the status failed. So much so that if you have connected this vector store to an assistant, you will not see the red dot on the vector database icon on the assistant screen.

Delete file_id from Files and Vector Store

My suggestion is to first delete the files, then delete the files with the same file id from the vector store. After a few minutes (3 to 5 minutes) you will see that the failed files have been successfully removed.


from openai import OpenAI
import openpyxl

client = OpenAI(api_key="sk-xxx")

EXCEL_PATH = "file_ids.xlsx" 

def read_file_ids_from_excel(excel_path):
    work_book = openpyxl.load_workbook(excel_path)
    ws = work_book.active
    file_ids = []
    for row in ws.iter_rows(min_row=2, values_only=True):  # Header
        if row[0]:
            file_ids.append(str(row[0]).strip())
    return file_ids

file_ids = read_file_ids_from_excel(EXCEL_PATH)
print(f"{len(file_ids)} file_id found.")

for file_id in file_ids:
    print(f"Dosya: {file_id}")
    
    # Deleting the File (FIRST)
    client.files.delete(f"{file_id}")
    
    # Deleting the File from the Vector Store (AFTER)
    #deleted_vector_store_file = client.vector_stores.files.delete(
    #   vector_store_id="vs_xxx",
    #   file_id=f"{file_id}"
    #)
    #print(deleted_vector_store_file)

Topic		Replies	Views
Assistant api, retrieval file api is not working Bugs api	44	15472	March 13, 2024
Failed to update assistant: UserError: Failed to index file: Unsupported file type: application/csv Bugs playground	70	21410	February 27, 2024
Can Not Add Files To Vector Store API assistants-api , vector-store	20	6192	December 10, 2024
Assistant not able to access uploaded file API file-uploads	39	22422	January 29, 2025
Unable to Delete Failed File in Vector Store Bugs api	5	394	June 26, 2025