Subject: GPT-4o Vector Search Not Finding Relevant Code Snippets in Large File
Question:
I have a question about file retrieval using a vector store with OpenAI’s GPT-4o model.
I provide the assistant with a 200,000-line code file containing Blender-related code examples to improve its responses when generating scripts. Within this file, I have two major Blender add-ons:
- A 111,000-line general-purpose Blender add-on (works fine for common requests).
- A 50,000-line Blender add-on focused specifically on Geometry Nodes (not being retrieved properly).
When I query the assistant for general Blender scripting (e.g., “How do I create a cube and add a material?”), the search works well, and it finds relevant examples. However, when I request code related to Geometry Nodes, it fails to find relevant examples—even though there are 50,000 lines dedicated to this exact topic in the file.
Key Details:
- I’m using GPT-4o with vector search.
- The vector store is properly set up, and the file is successfully indexed.
- I’ve tested different prompts and approaches, and I’m experienced in prompt engineering.
- The Geometry Nodes code is located toward the end of the file—could that impact retrieval?
- The assistant follows detailed instructions (included below) that direct it to retrieve only existing code and never generate new code.
- I am not currently using temperature—should I be?
- Would fine-tuning be needed at this point, or should vector search be powerful enough for this?
Questions:
- Does the position of the code (near the end of the file) affect retrieval?
- Should I be explicitly providing a list of operator names, types, or property names to help retrieval? (e.g.,
GeometryNode_
,bpy.types.GeometryNode
, etc.) - Are there specific retrieval settings (e.g., temperature, search parameters) that could improve accuracy?
- Would fine-tuning help in this case, or is vector search supposed to be sufficient for this kind of structured code lookup?
- Is there a best practice for improving retrieval of domain-specific code when using vector search?
Here’s a portion of the assistant’s retrieval instructions for context:
graphql
CopyEdit
# 1. **Search the User's Code File:** Identify the most relevant or interesting parts based on the user prompt. Prioritize function names, Blender API calls, and operator usage.
# 2. **Identify Relevant Examples:** Search for function definitions, function calls, Blender operator usage, and API patterns related to the user's request.
# 3. **Provide Code Snippets Only:** Output only existing code retrieved from the file. Do not generate new code.
# 4. **Ensure Relevance:** If a direct match isn’t found, provide the closest matching example and explain its relevance.
# 5. **Access Issues:** If no relevant examples are found, output ‘No relevant examples found’ along with search terms used.
Summary:
- Why isn’t the assistant finding relevant Geometry Nodes code, despite it being in a large indexed file?
- What retrieval improvements (e.g., providing explicit search terms, using different settings, fine-tuning) should I consider?
- How can I ensure that domain-specific code like Geometry Nodes gets retrieved properly?
I’d appreciate any guidance on this! Thanks.