Access to "myfiles_browser"

Hi folks,

I am trying to develop a simple question and answer app using RAG technology for our own academic purpose. The RAG system I developed was very simple, following many public accessible examples using LangChain (including OpenAIEmbedding, FAISS, and ChatOpenAI). The performance was not very good. For example, when I ask “What viruses were studied in this paper?”, the app may not be able to get it right.

I also tried to directly upload the pdf file to chatGPT 4, and it was able to answer much better. When I asked GPT4, it says it used the myfiles_browser tool to access and review the document. I wonder how I can access “myfiles_browser” through API? I have thousands of papers that I need to answer and get the answers for dozens of questions.

If “myfiles_browser” is not available through API, how can I improve my RAG app’s performance in terms of accuracy?

Thank you so much!

Hi did you find a solution orr a work around to
“myfiles_browser” is not available through API,

The myfiles_browser is semantic search tool within API assistant’s “retrieval” of files.

It is what gets used to answer about your documents, an AI that makes a function call instead of answering (at your expense) to call a tool and get search results and then “click” into documents and read them to find out if there is nothing relevant to user input.

So one just uses the assistants retrieval, the same as uploading a file to ChatGPT or adding files to a GPT in ChatGPT Plus.

Remember there’re differences between ChatGPT and the APIs.

ChatGPT will read the document you’ve uploaded to a single conversation thread or to the knowledge of a Custom GPTs.

To use a similar system through the API you’ll need to implement it using the Knowledge Retrieval tool of the Assistants API.

More information on the Knowledge Retrieval tool: https://platform.openai.com/docs/assistants/tools/knowledge-retrieval

And about saved files for API use: https://platform.openai.com/docs/api-reference/files

Thanks for your suggestions. I tried to the example on https://platform.openai.com/docs/assistants/tools/file-search. For regular short questions, it seems work, for example “Who are the authors of this article?”

When I ask the tool to extract a whole section (for example the Methods section) out of a scientific article, it is not able to. Even though I specifically asked to extra the complete section, it only extracted part of it.

But when I uploaded the article through chatgpt UI (using GPT4), it was able to extract the complete section. So I guess the file search (RAG) is still not working as good as “myfiles_browser”?

Any suggestion if I want to extract a whole section (such as the methods section) out of a pdf file? Any help would be greatly appreciated!

client = OpenAI(api_key="sk-***")
 
assistant = client.beta.assistants.create(
  name="Scientific Research Assistant",
  instructions="You are an expert in biomedical researcher. Use you knowledge base to answer questions about research articles.",
  model="gpt-4-turbo-2024-04-09",
  tools=[{"type": "file_search"}],
  temperature=0,
)


vector_store = client.beta.vector_stores.create(name="Scientific Research")

file_paths = ["1234.pdf"]
file_streams = [open(path, "rb") for path in file_paths]
 
file_batch = client.beta.vector_stores.file_batches.upload_and_poll(
  vector_store_id=vector_store.id, files=file_streams
)
 
print(file_batch.status)
print(file_batch.file_counts)




assistant = client.beta.assistants.update(
  assistant_id=assistant.id,
  tool_resources={"file_search": {"vector_store_ids": [vector_store.id]}},
)




message_file = client.files.create(
  file=open("1234.pdf", "rb"), purpose="assistants"
)

section = "Methods"
content = f"""Please extract the {section} section from the following text, which is a scientific article. \ 
            Please provide me the whole {section} section completely. Do not miss, change, or summarize any of the words in the section."""

thread = client.beta.threads.create(
  messages=[
    {
      "role": "user",
      "content": content,
      # Attach the new file to the message.
      "attachments": [
        { "file_id": message_file.id, "tools": [{"type": "file_search"}] }
      ],
    }
  ]
)
 






run = client.beta.threads.runs.create_and_poll(
    thread_id=thread.id, assistant_id=assistant.id
)

messages = list(client.beta.threads.messages.list(thread_id=thread.id, run_id=run.id))

message_content = messages[0].content[0].text
annotations = message_content.annotations
citations = []
for index, annotation in enumerate(annotations):
    message_content.value = message_content.value.replace(annotation.text, f"[{index}]")
    if file_citation := getattr(annotation, "file_citation", None):
        cited_file = client.files.retrieve(file_citation.file_id)
        citations.append(f"[{index}] {cited_file.filename}")

print(message_content.value)
print("\n".join(citations))