How can I upload in memory buffer in a vector store?

I get a streamlit file that way:

uploaded_file = st.file_uploader("Choisir un fichier pdf") #https://docs.streamlit.io/develop/api-reference/widgets/st.file_uploader
if uploaded_file is not None:
    st.write("Uploaded...")
    # To read file as bytes:
    bytes_data = uploaded_file.getvalue()

When I upload it as a file with vector_stores.file_batches.upload_and_poll() using [io.BytesIO(blob)] as the stream array it makes an error:

Files with extensions [none] are not supported for retrieval.

How can I upload in memory buffer in a vector store?

I don’t have experience with streamlit and its files, but I do with Python, the API documentation, and using buffers. Also a talent for sweet-talking a robot. Enough to send an AI on the right path and provide you your streamlit object file name.

The error you encountered with the API indicating that files without extensions are not supported is due to the requirement for the OpenAI file storage to recognize the file type via its extension. When you use an in-memory file like a BytesIO object, it doesn’t inherently carry the metadata about the file name or type, which is crucial for the API to process the file correctly.

To resolve this, you need to mimic a file-like object that includes the necessary metadata, such as the file name. In Python, you can achieve this by wrapping the BytesIO object into another class that adds the .name attribute, which will be used by the OpenAI API to infer the file type from the extension. Here’s how you can do it:

import io
import streamlit as st
from openai import OpenAI

# Function to wrap BytesIO with a name attribute
class NamedBytesIO(io.BytesIO):
    def __init__(self, buffer, name):
        super().__init__(buffer)
        self.name = name

# Streamlit file upload interface
uploaded_file = st.file_uploader("Choose a PDF file")
if uploaded_file is not None:
    st.write("Uploaded...")
    
    # To read file as bytes and create a NamedBytesIO object with the original file name
    bytes_data = uploaded_file.getvalue()
    file_stream = NamedBytesIO(bytes_data, uploaded_file.name)

    # Initialize OpenAI client
    client = OpenAI()
    
    # Create a vector store called "Financial Documents"
    vector_store = client.beta.vector_stores.create(name="Financial Documents")

    # Upload the file and poll for completion using the named file stream
    file_batch = client.beta.vector_stores.file_batches.upload_and_poll(
        vector_store_id=vector_store.id, 
        files=[file_stream]  # Pass the file_stream in a list
    )
    
    # Print the status and the file counts of the batch to see the result of this operation
    st.write("Batch Status:", file_batch.status)
    st.write("File Counts:", file_batch.file_counts)

Explanation:

  1. NamedBytesIO Class: This is a subclass of io.BytesIO that adds a name attribute to the object. This name attribute is expected to contain the file name with the correct file extension, which the API uses to determine the file type.

  2. File Upload Handling: When a file is uploaded using Streamlit’s file_uploader, the file’s original name, which includes the extension, is preserved and passed to the NamedBytesIO object. This way, when you upload the file to OpenAI’s API, it has the necessary metadata to process the file correctly.

  3. OpenAI API Call: The files are then uploaded to the specified vector store, and the status of the operation is displayed using Streamlit’s write method.

This solution should handle the file appropriately and avoid issues with file type recognition by the API.

You can try out this provisional answer and see.

1 Like

Hi Merci for the answer. I already got it before from GPT but didn’t tried it: i classified it as hallucination… It seems incredible to rely on an extension to get the type of the file and not having it as a parameter…

The silliest part - OpenAI will still inspect the contents of the file anyway to find the type.

Thus no reason for this!
xxxxxxxxxxxxxxxxxxxxxxxxxxxxxx

The code above works perfectly.
Merci beaucoup!

1 Like
file_batch = client.beta.vector_stores.file_batches.upload_and_poll(
    vector_store_id=vector_store.id, 
    files=[(uploaded_file.name, uploaded_file.getvalue())]
)

From:

1 Like