I have a Assistant with file_search and I asked this question: “Give me 1 problem presented in the documents”
When I inspect the file search results with this, I see that there is 20 results with rank score varying from 0.48 to 0.07
In the assistant answer, there is one annotation and the reference to the document is a result that had a rank score of 0.09 so does anyone know how does this work? Why did it not use the highest ranking chunk in its answer?
i don’t know the internal workings of the file_search but in my own implementation of file search using embeddings, the whole process is mathematical using cosine similarity equation. no reasoning, no AI yet. so when you submit the results to the API, the API still reserves the right to choose from what you gave it. perhaps its the same.
Using agentm-py (not completely released, will do tomorrow) you could do something like this:
import asyncio
import os
from src.core.classify_list_agent import ClassifyListAgent
from src.core.summarize_list_agent import SummarizeListAgent
from src.core.reduce_list_agent import ReduceListAgent
# Step 1: Gather the file list from the codebase directory
def list_files_in_codebase(directory: str):
file_list = []
for root, dirs, files in os.walk(directory):
for file in files:
if file.endswith(".py"): # You can filter by any file type
file_list.append(os.path.join(root, file))
return file_list
# Step 2: Classify files based on relevance to user authentication
async def classify_files_for_task(files):
classification_criteria = "Classify each file as relevant or not for handling user authentication."
agent = ClassifyListAgent(list_to_classify=files, classification_criteria=classification_criteria)
classified_files = await agent.classify_list()
return classified_files
# Step 3: Summarize the contents of classified files
async def summarize_files(files):
agent = SummarizeListAgent(list_to_summarize=files)
summaries = await agent.summarize_list()
return summaries
# Step 4: Reduce to the most important files based on task goal
async def reduce_to_important_files(files):
reduction_goal = "Reduce the list to files most essential for user authentication."
agent = ReduceListAgent(list_to_reduce=files, reduction_goal=reduction_goal)
reduced_files = await agent.reduce_list()
return reduced_files
# Full Example Workflow
async def run_file_analysis_workflow():
# Step 1: List all Python files in the codebase
codebase_directory = "./your_codebase_directory"
files = list_files_in_codebase(codebase_directory)
print("Files in Codebase:", files)
# Step 2: Classify files based on user authentication task
classified_files = await classify_files_for_task(files)
relevant_files = [file['item'] for file in classified_files if file['classification'] == 'relevant']
print("\nRelevant Files for User Authentication:", relevant_files)
# Step 3: Summarize the relevant files
summaries = await summarize_files(relevant_files)
print("\nSummaries of Relevant Files:", summaries)
# Step 4: Reduce to the most important files
reduced_files = await reduce_to_important_files(relevant_files)
print("\nReduced List of Important Files:", reduced_files)
if __name__ == "__main__":
asyncio.run(run_file_analysis_workflow())
Which would do something like this:
Files in Codebase: ['src/core/auth.py', 'src/core/database.py', 'src/core/user.py', 'src/core/openai_api.py']
Relevant Files for User Authentication: ['src/core/auth.py', 'src/core/user.py']
Summaries of Relevant Files:
['src/core/auth.py: Handles authentication and session management.',
'src/core/user.py: Manages user data, authentication, and authorization.']
Reduced List of Important Files: ['src/core/auth.py', 'src/core/user.py']