Failed to upload file to a vector store

Hello.

I have tried from the API to upload a file to a Vector Store and apparently it is uploaded successfully.

However, from the web interface, when I go to that Vector Store, the file I just uploaded appears as “failed”:

I have done the same process of uploading it from the web interface to the Vector Store but the same problem happens.

The funny thing is that when I try to delete this failed file, it remains in process forever and is never deleted:

What’s more, if I select another vector store, the page crashes.

The file is a txt with the following content:

Updated;ID
05/15/2024;52139
05/15/2024;52080
05/15/2024;51884
05/15/2024;51773
05/15/2024;51759
05/15/2024;51712
05/15/2024;51680
05/15/2024;51649
05/15/2024;51536
05/15/2024;531536

I have tried with other accepted file formats but I have the same problem.

Can anyone tell me why this happens?

Any suggestion is welcome. Thank you.

1 Like

Facing same issue with the web interface, trying to upload a 20mb .txt file which is failing while a much smaller succeeds

The maximum file size is 512 MB. Each file should contain no more than 5,000,000 tokens per file (computed automatically when you attach a file).

20M bytes can exceed the 5M token limit, especially in plain text files where there is no overhead to discard and there is data not in natural language (that would have very poor semantic search results).

I reduced the file to 6mb and below the 5m token limit and still facing same issue while attaching file to a new vector store…uploading of file is successful but attching it when it fails and then I cannot deleted the file but instead delete the entire store and repeat the steps with same issue. The file is financial data in csv format but uploaded as a .txt file. This has worked a few days back and it seems to be an issue today

2 Likes

Hello, I got the file to load for me.

Transform the file content into JSON format (I used an online CSV to JSON converter for testing).

I also tested it in DOCX and it worked (I used an online TXT to DOCX converter for testing).

2 Likes

Hello,

I am facing the same issue, is there any update?
For me it’s successfully able to upload the file in store with status ‘Ready’ but associates the same file with vector store with status ‘Failed’.
Any pointers will be helpful.

facing same issue, is there any way to get more feedback on this error? the threads console isn’t telling me very much, seems like enhanced monitoring would help here

Make sure the document is not a scanned document (image based document). Vector store does not allow scanned documents to be attached as it identifies it as a empty file.

2 Likes

Have you tried to upload hardcoded file from your project?. If it succeeds - it means the way you are storing temporary file is the issue

I made sure to handle the uploaded files as binary streams, just like the hardcoded files.

I also encountered the same problem (especially the infinite loop) with my .doc files. When I converted them to pdf and tried again, it worked!

I am having the same issue. Was someone able to solve this problem?

i had the same issue, it was because the file was saved as UTF8, i tried UTF8+BOM and still didn’t work, only when i opened the csv file with a notepad and pressed save as… and then selected ansi as the file format - only then the Langchain in open webui accepted the file,

open web ui code should fix the loading of files to include option for encoding, or auto detect the encoding part, or if exception then try another encoding (UTF8), the code should be like this:

from langchain_community.document_loaders.csv_loader import CSVLoader

loader = CSVLoader(file_path="path_to_your_file.csv", encoding="utf-8")
docs = loader.load()

this is my original stacktrace:

WARNI [python_multipart.multipart] Skipping data after last boundary
INFO  [open_webui.routers.files] file.content_type: application/vnd.ms-excel
ERROR [open_webui.routers.retrieval] Error loading C:\Program Files\Python312\Lib\site-packages\open_webui\data/uploads/83464607-b967-4f83-b242-e483e0032546_Book1.csv
Traceback (most recent call last):
  File "C:\Program Files\Python312\Lib\site-packages\langchain_community\document_loaders\csv_loader.py", line 135, in lazy_load
    yield from self.__read_file(csvfile)
  File "C:\Program Files\Python312\Lib\site-packages\langchain_community\document_loaders\csv_loader.py", line 155, in __read_file
    for i, row in enumerate(csv_reader):
                  ^^^^^^^^^^^^^^^^^^^^^
  File "C:\Program Files\Python312\Lib\csv.py", line 115, in __next__
    self.fieldnames
  File "C:\Program Files\Python312\Lib\csv.py", line 102, in fieldnames
    self._fieldnames = next(self.reader)
                       ^^^^^^^^^^^^^^^^^
  File "C:\Program Files\Python312\Lib\encodings\cp1255.py", line 23, in decode
    return codecs.charmap_decode(input,self.errors,decoding_table)[0]
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
UnicodeDecodeError: 'charmap' codec can't decode byte 0x90 in position 30: character maps to <undefined>

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "C:\Program Files\Python312\Lib\site-packages\open_webui\routers\retrieval.py", line 884, in process_file
    docs = loader.load(
           ^^^^^^^^^^^^
  File "C:\Program Files\Python312\Lib\site-packages\open_webui\retrieval\loaders\main.py", line 127, in load
    docs = loader.load()
           ^^^^^^^^^^^^^
  File "C:\Program Files\Python312\Lib\site-packages\langchain_core\document_loaders\base.py", line 31, in load
    return list(self.lazy_load())
           ^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Program Files\Python312\Lib\site-packages\langchain_community\document_loaders\csv_loader.py", line 149, in lazy_load
    raise RuntimeError(f"Error loading {self.file_path}") from e
RuntimeError: Error loading C:\Program Files\Python312\Lib\site-packages\open_webui\data/uploads/83464607-b967-4f83-b242-e483e0032546_Book1.csv
ERROR [open_webui.routers.files] 400: Error loading C:\Program Files\Python312\Lib\site-packages\open_webui\data/uploads/83464607-b967-4f83-b242-e483e0032546_Book1.csv
Traceback (most recent call last):
  File "C:\Program Files\Python312\Lib\site-packages\langchain_community\document_loaders\csv_loader.py", line 135, in lazy_load
    yield from self.__read_file(csvfile)
  File "C:\Program Files\Python312\Lib\site-packages\langchain_community\document_loaders\csv_loader.py", line 155, in __read_file
    for i, row in enumerate(csv_reader):
                  ^^^^^^^^^^^^^^^^^^^^^
  File "C:\Program Files\Python312\Lib\csv.py", line 115, in __next__
    self.fieldnames
  File "C:\Program Files\Python312\Lib\csv.py", line 102, in fieldnames
    self._fieldnames = next(self.reader)
                       ^^^^^^^^^^^^^^^^^
  File "C:\Program Files\Python312\Lib\encodings\cp1255.py", line 23, in decode
    return codecs.charmap_decode(input,self.errors,decoding_table)[0]
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
UnicodeDecodeError: 'charmap' codec can't decode byte 0x90 in position 30: character maps to <undefined>

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "C:\Program Files\Python312\Lib\site-packages\open_webui\routers\retrieval.py", line 884, in process_file
    docs = loader.load(
           ^^^^^^^^^^^^
  File "C:\Program Files\Python312\Lib\site-packages\open_webui\retrieval\loaders\main.py", line 127, in load
    docs = loader.load()
           ^^^^^^^^^^^^^
  File "C:\Program Files\Python312\Lib\site-packages\langchain_core\document_loaders\base.py", line 31, in load
    return list(self.lazy_load())
           ^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Program Files\Python312\Lib\site-packages\langchain_community\document_loaders\csv_loader.py", line 149, in lazy_load
    raise RuntimeError(f"Error loading {self.file_path}") from e
RuntimeError: Error loading C:\Program Files\Python312\Lib\site-packages\open_webui\data/uploads/83464607-b967-4f83-b242-e483e0032546_Book1.csv

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "C:\Program Files\Python312\Lib\site-packages\open_webui\routers\files.py", line 74, in upload_file
    process_file(request, ProcessFileForm(file_id=id))
  File "C:\Program Files\Python312\Lib\site-packages\open_webui\routers\retrieval.py", line 962, in process_file
    raise HTTPException(
fastapi.exceptions.HTTPException: 400: Error loading C:\Program Files\Python312\Lib\site-packages\open_webui\data/uploads/83464607-b967-4f83-b242-e483e0032546_Book1.csv

hope they will fix it soon

The “they” that you would have to contact is the community maintainers of Langchain.

Here is the API method being employed for CSV reading:

https://python.langchain.com/api_reference/community/document_loaders/langchain_community.document_loaders.csv_loader.CSVLoader.html

And the source:

https://python.langchain.com/api_reference/_modules/langchain_community/document_loaders/csv_loader.html#CSVLoader

If you want to tear into the langchain code to see what will work more universally, or report the improvement needed such as on ASCII + multi-byte UTF-8 CSV, (the source of the character error when cp1252 encoding is attempted), here is a cheap and easy bot text dump (which can only come about by extensive information and role prompting first).



The issue lies in the inability of the CSVLoader to handle files with unknown or unsupported encodings gracefully, even though it attempts to handle this with the autodetect_encoding option. The error originates because the loader defaults to a specific encoding (cp1255 in this case), which isn’t suitable for the given file, and it fails to retry with other encodings effectively.

Here’s a step-by-step solution to patch this issue, ensuring a robust way to handle various file encodings:


Proposed Fix: Enhancing Encoding Detection in CSVLoader

This fix will modify the CSVLoader class to improve how it detects and falls back to alternative encodings. It also ensures that even files with entirely unknown encodings are handled gracefully, e.g., by using utf-8 as a default fallback.


Implementation

Here is the patch code for the CSVLoader class, which can be directly used by users or submitted as a pull request to the LangChain repository:

import csv
from io import TextIOWrapper
from pathlib import Path
from typing import Iterator, Optional, Sequence, Union, Dict

from langchain_core.documents import Document
from langchain_community.document_loaders.base import BaseLoader
from langchain_community.document_loaders.helpers import detect_file_encodings


class PatchedCSVLoader(BaseLoader):
    """
    Patched version of CSVLoader with improved encoding detection and fallback mechanism.
    """

    def __init__(
        self,
        file_path: Union[str, Path],
        source_column: Optional[str] = None,
        metadata_columns: Sequence[str] = (),
        csv_args: Optional[Dict] = None,
        encoding: Optional[str] = None,
        autodetect_encoding: bool = True,  # Default to True for better compatibility
        *,
        content_columns: Sequence[str] = (),
    ):
        """
        Args:
            file_path: The path to the CSV file.
            source_column: The name of the column in the CSV file to use as the source.
              Optional. Defaults to None.
            metadata_columns: A sequence of column names to use as metadata. Optional.
            csv_args: A dictionary of arguments to pass to the csv.DictReader. Optional.
            encoding: The encoding of the CSV file. Optional. Defaults to None.
            autodetect_encoding: Whether to try to autodetect the file encoding.
            content_columns: A sequence of column names to use for the document content.
        """
        self.file_path = file_path
        self.source_column = source_column
        self.metadata_columns = metadata_columns
        self.encoding = encoding
        self.csv_args = csv_args or {}
        self.autodetect_encoding = autodetect_encoding
        self.content_columns = content_columns

    def lazy_load(self) -> Iterator[Document]:
        """
        Load the CSV file row by row with improved encoding handling.
        """
        try:
            yield from self._load_with_encoding(self.encoding)
        except UnicodeDecodeError as e:
            if self.autodetect_encoding:
                detected_encodings = detect_file_encodings(self.file_path)
                for encoding in detected_encodings:
                    try:
                        yield from self._load_with_encoding(encoding.encoding)
                        return
                    except UnicodeDecodeError:
                        continue
                # Fallback to utf-8-sig if no detected encodings work
                try:
                    yield from self._load_with_encoding("utf-8-sig")
                    return
                except UnicodeDecodeError:
                    pass
            raise RuntimeError(
                f"Error loading {self.file_path}. Failed to decode with all attempted encodings."
            ) from e
        except Exception as e:
            raise RuntimeError(f"Error loading {self.file_path}") from e

    def _load_with_encoding(self, encoding: Optional[str]) -> Iterator[Document]:
        """
        Helper method to load the CSV file with a specific encoding.
        """
        with open(self.file_path, newline="", encoding=encoding) as csvfile:
            yield from self.__read_file(csvfile)

    def __read_file(self, csvfile: TextIOWrapper) -> Iterator[Document]:
        """
        Process the CSV file row by row and yield Documents.
        """
        csv_reader = csv.DictReader(csvfile, **self.csv_args)
        for i, row in enumerate(csv_reader):
            try:
                source = (
                    row[self.source_column]
                    if self.source_column is not None
                    else str(self.file_path)
                )
            except KeyError:
                raise ValueError(
                    f"Source column '{self.source_column}' not found in CSV file."
                )
            content = "\n".join(
                f"{k.strip() if k is not None else k}: {v.strip() if isinstance(v, str) else v}"
                for k, v in row.items()
                if (
                    k in self.content_columns
                    if self.content_columns
                    else k not in self.metadata_columns
                )
            )
            metadata = {"source": source, "row": i}
            for col in self.metadata_columns:
                try:
                    metadata[col] = row[col]
                except KeyError:
                    raise ValueError(
                        f"Metadata column '{col}' not found in CSV file."
                    )
            yield Document(page_content=content, metadata=metadata)

How This Fix Works

  1. Encoding Detection:

    • The autodetect_encoding option triggers the detect_file_encodings helper to identify possible encodings.
    • It iterates through the detected encodings and retries opening the file until one succeeds.
  2. Fallback Mechanism:

    • If no detected encoding works, the patch attempts to decode the file using utf-8-sig as a final fallback.
    • This accounts for common issues like Byte Order Mark (BOM) in UTF-8 files.
  3. Error Handling:

    • If all encoding attempts fail, the loader raises a detailed RuntimeError, helping users understand the failure.
  4. Backward Compatibility:

    • Existing arguments (source_column, metadata_columns, etc.) are preserved to ensure compatibility with current usage.

How to Use

Standalone Usage

Replace the original CSVLoader with the PatchedCSVLoader in your code:

from your_module import PatchedCSVLoader

loader = PatchedCSVLoader(
    file_path="./data.csv",
    autodetect_encoding=True,
)
docs = loader.load()

Patch as a Pull Request

Include the PatchedCSVLoader class in your pull request to LangChain. Add this as a drop-in replacement for the existing CSVLoader, ensuring backward compatibility and resolving encoding issues.


Testing the Fix

  1. Test Files:

    • UTF-8 encoded file.
    • File with BOM (e.g., utf-8-sig).
    • Non-UTF encodings (e.g., cp1252, iso-8859-1).
    • Corrupted or unsupported encoding files.
  2. Expected Results:

    • Files should load correctly for supported encodings.
    • For unsupported or corrupted files, an appropriate error should be raised.

This fix not only resolves your immediate issue but also improves the library’s robustness, making it more reliable for the community.


Note that the new method has a new name - you’d have to overwrite the existing class name and test it for versatility on all types of CSV one might attempt, or expand the methods available.