How can I upload in memory buffer in a vector store?

Nixilb · June 30, 2024, 7:44am

I get a streamlit file that way:

uploaded_file = st.file_uploader("Choisir un fichier pdf") #https://docs.streamlit.io/develop/api-reference/widgets/st.file_uploader
if uploaded_file is not None:
    st.write("Uploaded...")
    # To read file as bytes:
    bytes_data = uploaded_file.getvalue()

When I upload it as a file with vector_stores.file_batches.upload_and_poll() using [io.BytesIO(blob)] as the stream array it makes an error:

Files with extensions [none] are not supported for retrieval.

How can I upload in memory buffer in a vector store?

_j · June 30, 2024, 8:23am

I don’t have experience with streamlit and its files, but I do with Python, the API documentation, and using buffers. Also a talent for sweet-talking a robot. Enough to send an AI on the right path and provide you your streamlit object file name.

The error you encountered with the API indicating that files without extensions are not supported is due to the requirement for the OpenAI file storage to recognize the file type via its extension. When you use an in-memory file like a BytesIO object, it doesn’t inherently carry the metadata about the file name or type, which is crucial for the API to process the file correctly.

To resolve this, you need to mimic a file-like object that includes the necessary metadata, such as the file name. In Python, you can achieve this by wrapping the BytesIO object into another class that adds the .name attribute, which will be used by the OpenAI API to infer the file type from the extension. Here’s how you can do it:
import io
import streamlit as st
from openai import OpenAI

# Function to wrap BytesIO with a name attribute
class NamedBytesIO(io.BytesIO):
    def __init__(self, buffer, name):
        super().__init__(buffer)
        self.name = name

# Streamlit file upload interface
uploaded_file = st.file_uploader("Choose a PDF file")
if uploaded_file is not None:
    st.write("Uploaded...")
    
    # To read file as bytes and create a NamedBytesIO object with the original file name
    bytes_data = uploaded_file.getvalue()
    file_stream = NamedBytesIO(bytes_data, uploaded_file.name)

    # Initialize OpenAI client
    client = OpenAI()
    
    # Create a vector store called "Financial Documents"
    vector_store = client.beta.vector_stores.create(name="Financial Documents")

    # Upload the file and poll for completion using the named file stream
    file_batch = client.beta.vector_stores.file_batches.upload_and_poll(
        vector_store_id=vector_store.id, 
        files=[file_stream]  # Pass the file_stream in a list
    )
    
    # Print the status and the file counts of the batch to see the result of this operation
    st.write("Batch Status:", file_batch.status)
    st.write("File Counts:", file_batch.file_counts)
Explanation:

NamedBytesIO Class: This is a subclass of io.BytesIO that adds a name attribute to the object. This name attribute is expected to contain the file name with the correct file extension, which the API uses to determine the file type.

File Upload Handling: When a file is uploaded using Streamlit’s file_uploader, the file’s original name, which includes the extension, is preserved and passed to the NamedBytesIO object. This way, when you upload the file to OpenAI’s API, it has the necessary metadata to process the file correctly.

OpenAI API Call: The files are then uploaded to the specified vector store, and the status of the operation is displayed using Streamlit’s write method.

This solution should handle the file appropriately and avoid issues with file type recognition by the API.

You can try out this provisional answer and see.

Nixilb · June 30, 2024, 8:38am

Hi Merci for the answer. I already got it before from GPT but didn’t tried it: i classified it as hallucination… It seems incredible to rely on an extension to get the type of the file and not having it as a parameter…

_j · June 30, 2024, 8:41am

The silliest part - OpenAI will still inspect the contents of the file anyway to find the type.

Nixilb · June 30, 2024, 9:10am

Thus no reason for this!
xxxxxxxxxxxxxxxxxxxxxxxxxxxxxx

Nixilb · June 30, 2024, 9:10am

The code above works perfectly.
Merci beaucoup!

Nixilb · July 1, 2024, 12:15pm

file_batch = client.beta.vector_stores.file_batches.upload_and_poll(
    vector_store_id=vector_store.id, 
    files=[(uploaded_file.name, uploaded_file.getvalue())]
)

From:

github.com/openai/openai-python

upload_and_poll a in memory file

opened 08:42AM - 30 Jun 24 UTC

closed 10:47AM - 01 Jul 24 UTC

nixilb

question

### Confirm this is an issue with the Python library and not an underlying OpenA…I API - [X] This is an issue with the Python library ### Describe the bug upload_and_poll a in memory file I get a streamlit file that way: ```python uploaded_file = st.file_uploader("Choisir un fichier pdf") #https://docs.streamlit.io/develop/api-reference/widgets/st.file_uploader if uploaded_file is not None: st.write("Uploaded...") # To read file as bytes: bytes_data = uploaded_file.getvalue() ``` When I upload it as a file with `vector_stores.file_batches.upload_and_poll()` using `[io.BytesIO(blob)]` as the stream array it makes an error: **"Files with extensions [none] are not supported for retrieval."** It seems that OpenAI file storage recognize the file type via its extension. It should be an optional parameter. ### To Reproduce see code above ### Code snippets _No response_ ### OS linux ### Python version python 3.12 ### Library version openai 1.33.0

Topic		Replies	Views
Uploading file with streaming - problem to retrieve the file Prompting gpt-35-turbo , api , assistants-api , assistants-files	2	897	June 3, 2024
400 error on connect assistent files Bugs assistants-api , assistants-files	26	3608	September 11, 2024
Files with extensions [none] are not supported for retrieval API assistants-api	3	2218	July 26, 2024
Error when uploading a batch of files sotred in a vector Bugs vector-db	4	1062	July 22, 2024
Unrecognized file format error whisper BytesIO, can't write to disk API whisper	6	1942	February 25, 2024

How can I upload in memory buffer in a vector store?

Explanation:

Related topics