Help with file retrieval not finding obvious code

Subject: GPT-4o Vector Search Not Finding Relevant Code Snippets in Large File

Question:

I have a question about file retrieval using a vector store with OpenAI’s GPT-4o model.

I provide the assistant with a 200,000-line code file containing Blender-related code examples to improve its responses when generating scripts. Within this file, I have two major Blender add-ons:

  1. A 111,000-line general-purpose Blender add-on (works fine for common requests).
  2. A 50,000-line Blender add-on focused specifically on Geometry Nodes (not being retrieved properly).

When I query the assistant for general Blender scripting (e.g., “How do I create a cube and add a material?”), the search works well, and it finds relevant examples. However, when I request code related to Geometry Nodes, it fails to find relevant examples—even though there are 50,000 lines dedicated to this exact topic in the file.

Key Details:

  • I’m using GPT-4o with vector search.
  • The vector store is properly set up, and the file is successfully indexed.
  • I’ve tested different prompts and approaches, and I’m experienced in prompt engineering.
  • The Geometry Nodes code is located toward the end of the file—could that impact retrieval?
  • The assistant follows detailed instructions (included below) that direct it to retrieve only existing code and never generate new code.
  • I am not currently using temperature—should I be?
  • Would fine-tuning be needed at this point, or should vector search be powerful enough for this?

Questions:

  1. Does the position of the code (near the end of the file) affect retrieval?
  2. Should I be explicitly providing a list of operator names, types, or property names to help retrieval? (e.g., GeometryNode_, bpy.types.GeometryNode, etc.)
  3. Are there specific retrieval settings (e.g., temperature, search parameters) that could improve accuracy?
  4. Would fine-tuning help in this case, or is vector search supposed to be sufficient for this kind of structured code lookup?
  5. Is there a best practice for improving retrieval of domain-specific code when using vector search?

Here’s a portion of the assistant’s retrieval instructions for context:

graphql

CopyEdit

# 1. **Search the User's Code File:** Identify the most relevant or interesting parts based on the user prompt. Prioritize function names, Blender API calls, and operator usage.

# 2. **Identify Relevant Examples:** Search for function definitions, function calls, Blender operator usage, and API patterns related to the user's request. 

# 3. **Provide Code Snippets Only:** Output only existing code retrieved from the file. Do not generate new code.

# 4. **Ensure Relevance:** If a direct match isn’t found, provide the closest matching example and explain its relevance.

# 5. **Access Issues:** If no relevant examples are found, output ‘No relevant examples found’ along with search terms used.

Summary:

  • Why isn’t the assistant finding relevant Geometry Nodes code, despite it being in a large indexed file?
  • What retrieval improvements (e.g., providing explicit search terms, using different settings, fine-tuning) should I consider?
  • How can I ensure that domain-specific code like Geometry Nodes gets retrieved properly?

I’d appreciate any guidance on this! Thanks.

Quick idea for an upgrade right when I click submit I looked over and I noticed hey your topic is similar to and I didn’t have time to stop clicking and then my answer was posted what I would do because intuitively I scroll down the page to see if there was a special section underneath my topic that contains you know yeah your topic is similar to this but I didn’t find it I would put that under there when you put a post maybe somebody doesn’t seee it something you know they could always notice that and click on the relevant topic and see if it matched and then cancel their request if it does… It could be like an expandable text item with a triangle next to it that just says relevant topic so it doesn’t take up any space either

Another idea for an upgrade I just thought of…i’m getting all these good ideas today but you know I just posted a question why don’t you make a specialized version of chat GPT plus that does a serious scouring for the answer i’m talking about five or six requests where it looks up the Internet verifies the date and time of responses make sure they’re relevant looks up the API you provide in the documentation really scouring or something like 6 or 7 or eight times put some together and just do a Hail Mary throw to answer the user’s question right here immediately right when they post it but you know like maybe even up to ten times more intensive than a standard chat G P T request…