The best way to extract a vector store file

Husam_Fathi · December 3, 2025, 11:38am

I have a vector store containing a list of doc files. I want the best way to get a general table of contents of all the files: a list of JSON objects with the title of each chapter, a short description of the chapter, and the order of the final results.

Husam_Fathi · December 3, 2025, 11:44am

        try:
            response = client.responses.create(
                model=AIModelChoices.GPT_4O_MINI,
                max_tool_calls=1,
                parallel_tool_calls=False,

                # --- 1. STRICT OUTPUT CONFIG ---
                text=ResponseTextConfigParam(
                    format=ResponseFormatTextJSONSchemaConfigParam(
                        type="json_schema",
                        name="curriculum_schema",
                        schema=schema_dict,
                        strict=True,
                    )
                ),
                max_output_tokens= 1000000,
            # --- 2. IMPROVED PROMPT ---
                instructions=(
                    "You are an expert curriculum designer.\n"
                    f"CONTEXT: Existing chapters: {context_str}\n"
                    f"CONSTRAINT: {density_instruction}\n"
                    "TASK:\n"
                    "1. Use file_search to scan the beginning of the document for a Table of Contents.\n"
                    "2. If no TOC is found, infer chapters from the section headers you see.\n"  # <--- FALLBACK
                    "3. Extract titles and short descriptions.\n"
                    "4. Merge with existing chapters and re-order.\n"
                    "5. YOU MUST OUTPUT JSON. Do not return empty text."
                ),
                input="Generate the chapter list.",
                tools=[
                    FileSearchToolParam(
                        type="file_search",
                        vector_store_ids=vector_store_ids,
                        max_num_results=5,
                    )
                ]
            )

            logger.info(f"📦 Response Status: {getattr(response, 'status', 'unknown')}")

            # --- 3. EXTRACT AND VALIDATE ---

            # Use the helper property to get the aggregated text
            response_text = response.output_text


            logger.info("=============== RAW JSON START ===============")
            logger.info('output_text'+response.output_text)
            logger.info(f'text {response.text}')
            logger.info(f'output {response.output}')
            logger.info(f'queries {response.output[-1].queries}')
            logger.info(f'model_dump_json {response.model_dump_json()}')
            logger.info("================ RAW JSON END ================")


        except Exception as e:
            logger.error(f"❌ OpenAI API Error: {e}", exc_info=True)
            return []

This code will always rerun the output text as empty

_j · December 3, 2025, 3:41pm

Here’s the skinny:

Vector stores extracts document text, and then chunks it. All documents are combined into one vector store of chunks, with the embeddings allowing semantic search retrieval of top-ranked chunks by a query string.

This text is not exposed to you, nor is it exposed as a complete document to the AI, either. You get placement of ranked chunks of 800 tokens, for the AI to ferret out snippets of knowledge.

An AI using a vector store cannot “summarize this PDF”, nor separate out the ranked results it gets from running a query.

If you want to independently get the per-document “table of contents” you describe generated by AI, the best way would be to use the “input_file” method to send a PDF (and only PDF) as a content part in a user message. This places the whole document into AI context, along with images of each page. Then you can ask the AI to produce single document output product based on its observations.

Topic		Replies	Views
How to Populate a Vector Store with PDFs and Images for Searchability Using OpenAI's GPT-5 Model API gpt-5 , responses-api	2	372	November 20, 2025
How can I make the assistant 'read' scanned documents that are in PDF format? API assistants-api , file-uploads	3	547	June 2, 2025
Best practice for using multiple documents with File Search in a single Responses API call? API file-search , responses-api	0	206	September 24, 2025
New to the Responses API (C#) and having trouble performing both an upload and returning structured output API json , file-uploads , structured-output , file-search	1	639	August 30, 2025
Issues with Incorrect Responses for Specific Legal Articles in Large Document Using Vector Store API api , assistants-api	2	215	December 28, 2024

The best way to extract a vector store file

Related topics