My metadata isn't correctly processed when uploading a JSON file

Hello everyone,
I’m working on a RAG system, I want to store my chunks with some additional metadata, to do so I used a json file. Here is a snippet of my code:

for pdf in tqdm(list(Path(".").glob("*.pdf")), unit="pdf"):
    pages     = PyPDFLoader(str(pdf)).load()
    text      = "\n".join(p.page_content for p in pages)
    sentences = sent_tokenize(text)
    parents   = parent_chunks(sentences, CHUNK_SIZE_WORDS)

    json_path = tmp_dir / f"{pdf.stem}.json"      
    with open(json_path, "w", encoding="utf-8") as f:
        for p_idx, p_text in enumerate(parents):
            parent_id = f"{pdf.stem}-P{p_idx}"

            f.write(json.dumps({
                text: p_text,
                metadata: {
                    "source": pdf.name,
                    "parent_id": parent_id,
                    "level": 0,
                }}, ensure_ascii=False))

            for level, splitter in ((1, child1_splitter), (2, child2_splitter)):
                for chunk in splitter.split_text(p_text):
                    f.write(json.dumps({
                        'text': chunk,
                        'metadata': {
                            "source" : pdf.name,
                            "parent_id": parent_id,
                            "level": level,
                        }}, ensure_ascii=False))

    json_paths.append(json_path)

client.vector_stores.file_batches.upload_and_poll(
    vector_store_id = vs.id,
    files           = [open(p, "rb") for p in json_paths],
)

But the problem is when I try to search a document in my Vector store, using this code for example:

resp = client.vector_stores.search(
    vector_store_id = vs.id,
    query           = query,
    max_num_results = 1,
)

I get something like this:

VectorStoreSearchResponse(attributes={}, content=[Content(text='{"attributes": "xxxx_EN-P0", "content": "xxxxxxx"...)

You can see that ‘attributes’ is empty, and my metadata was passed to ‘content’ as a string. How can I adjust my code so that “attributes” isn’t empty and contains my metadata ?

1 Like

Here’s what doesn’t work correctly, with me making RESTful calls to diagnose everything behind the SDK, which has no Python example:

  • adding file attributes as metadata returns a vector store file object without that same attributes just sent.

This is the vector search API return when employing that attribute key:value with “eq” as filter type that I set on a file in the vector store:

{
  "object": "vector_store.search_results.page",
  "search_query": [
    "a placeholder text"
  ],
  "data": [
    {
      "file_id": "file-1234",
      "filename": "placeholder.txt",
      "score": 0.5090088910904553,
      "attributes": {
        "filemeta1": "filevalue1"
      },
      "content": [
        {
          "type": "text",
          "text": "This is the start of file contents.\r\nplaceholder.txt is empty except for this message.\r\n(You're probably supposed to be doing something else other than looking here...)"
        }
      ]
    }
  ],
  "has_more": false,
  "next_page": null
}

The file’s metadata comes even without a filters.

So the SDK of whatever undocumented version is shuffling around the data object on you for its type.

Here’s Python code just imagined up to get the JSON response from the endpoint.

import os
import httpx

def search_vs(id, query="a placeholder text", max=1) -> dict:
    """
    RESTful vector store search, manual filter parameters on attributes
    """
    api_key = os.environ.get("OPENAI_API_KEY")
    if not api_key:
        raise EnvironmentError("OPENAI_API_KEY environment variable is not set")

    url = f"https://api.openai.com/v1/vector_stores/{id}/search"
    headers = {
        "Authorization": f"Bearer {api_key}",
    }
    body = {
        "query": query,
        "max_num_results": max,
        #"filters": {
        #    "key": "filemeta1",
        #    "type": "eq",
        #    "value": "filevalue1",
        #}
    }
    with httpx.Client(timeout=20.0) as client:
        response = client.post(url, headers=headers, json=body)
        response.raise_for_status()

    return response.json()

id = "vs_1234"

response = search_vs(id)  # add query= and max=
print(response)

…and how I got the attributes to a file, just one key/value.

def add_file_meta(id, file_id, key="filemeta1", value="filevalue1") -> dict:
    """
    set attributes on a vector store's file (not the file in storage)
    """
    api_key = os.environ.get("OPENAI_API_KEY")
    if not api_key:
        raise EnvironmentError("OPENAI_API_KEY environment variable is not set")

    url = f"https://api.openai.com/v1/vector_stores/{id}/files/{file_id}"
    headers = {
        "Authorization": f"Bearer {api_key}",
    }
    body = {"attributes": {key: value}}
    with httpx.Client(timeout=10.0) as client:
        response = client.post(url, headers=headers, json=body)
        response.raise_for_status()

    return response.json()

You’ve got to upload the file to storage properly. Which doesn’t support any metadata or attributes.

Then attach the file ID to a vector store. That’s where you set the attributes that can be filtered on.

1 Like

Thank you so much,
Instead of using client.vector_stores.file_batches.upload_and_poll(), I used client.vector_stores.files.create() which has an argument “attributes” that I used to store my metadata. Now my vector search object doesn’t contain an empty “attributes” !

1 Like