Hello everyone,
I’m working on a RAG system, I want to store my chunks with some additional metadata, to do so I used a json file. Here is a snippet of my code:
for pdf in tqdm(list(Path(".").glob("*.pdf")), unit="pdf"):
pages = PyPDFLoader(str(pdf)).load()
text = "\n".join(p.page_content for p in pages)
sentences = sent_tokenize(text)
parents = parent_chunks(sentences, CHUNK_SIZE_WORDS)
json_path = tmp_dir / f"{pdf.stem}.json"
with open(json_path, "w", encoding="utf-8") as f:
for p_idx, p_text in enumerate(parents):
parent_id = f"{pdf.stem}-P{p_idx}"
f.write(json.dumps({
text: p_text,
metadata: {
"source": pdf.name,
"parent_id": parent_id,
"level": 0,
}}, ensure_ascii=False))
for level, splitter in ((1, child1_splitter), (2, child2_splitter)):
for chunk in splitter.split_text(p_text):
f.write(json.dumps({
'text': chunk,
'metadata': {
"source" : pdf.name,
"parent_id": parent_id,
"level": level,
}}, ensure_ascii=False))
json_paths.append(json_path)
client.vector_stores.file_batches.upload_and_poll(
vector_store_id = vs.id,
files = [open(p, "rb") for p in json_paths],
)
But the problem is when I try to search a document in my Vector store, using this code for example:
resp = client.vector_stores.search(
vector_store_id = vs.id,
query = query,
max_num_results = 1,
)
I get something like this:
VectorStoreSearchResponse(attributes={}, content=[Content(text='{"attributes": "xxxx_EN-P0", "content": "xxxxxxx"...)
You can see that ‘attributes’ is empty, and my metadata was passed to ‘content’ as a string. How can I adjust my code so that “attributes” isn’t empty and contains my metadata ?