Best prompt when both "File Search" and "Code Interpreter" are enabled?

abdelrahman.hashem · October 10, 2024, 6:09am

Hil all,
I created an assistant, with both “File Search” and “Code Interpreter” are enabled.
But the issue that I’m facing now is that the response that I’m getting is from “Data retrieval” with “Sorry, I could not find…”. While the query should be handled by the code interpreter.
Any tips?
Any tips on the prompt in specific?

Here is the prompt that I’m using:

You are a helpful financial assistant.
Answer queries only by writing a code or by using the provided context.
If the query is related to payroll information, then write a code to answer the query.
Don't rely on prior knowledge.
If the answer cannot be found within the context, acknowledge this accordingly.
The response should be in the same language as the query (English or Arabic).

Steps:
1. Read and comprehend the given context and the query.
2. Determine the language of the query (English or Arabic).
3. Decide whether to write a code or search the files, to answer the query.
4. Look for the answer within the provided context or by writing a code.
5. If the answer is found, respond accurately in the language of the query.
6. If the answer is not found, respond with an apology in the same language as the query.


Output Format:
- Provide a direct answer in the same language as the query.
- If the answer cannot be found within the context, state: 
  - English: "I'm sorry, the answer cannot be found in the provided data."
  - Arabic: "عذراً، لا يمكن العثور على الإجابة في البيانات المقدّمة."

Thanks in advance.

tilleul · October 10, 2024, 9:10am

I have noticed it’s not possible to use file_search and code_interpreter at the same time.

See here:

For now, we’re the only 2 that have noticed a problem … or maybe we’re doing something wrong …

abdelrahman.hashem · October 13, 2024, 5:18am

Thank you for sharing your thoughts here!

I noticed something that might help. It seems that the most efficient way to do it is to add these options (see the picture below).
Screenshot 2024-10-13 081424

However, if someone build an assistant for real users, then I believe that from a user experience perspective, it’s not a good experience.

What do you think?

_j · October 13, 2024, 6:20am

It is possible. It works. The AI is just poor at employing them.

Let’s ask a question only possible by new documentation.

continuing that output.

This with gpt-4o-2024-08-06

I hope that I’ve demonstrated adequately that an AI on Assistants can be made to understand and use its knowledge files, make same-turn calls to code interpreter as when it used file search (with some user prompting) – and can actually get on the internet when you overcome a few reluctances. For 11k tokens.

tilleul · October 13, 2024, 7:00am

Thanks for your answer but I’m afraid you misunderstood the problem.
Code interpreter is used to both

create python code and resolve more complex tasks like maths or render graphs
analyse excel files

It’s not possible though to send Excel files (or csv etc) to the File Search tool, you HAVE to use Code Interpreter.
But if you need both File Search (to analyze a pdf file) and Code Interpreter (to analyze an Excel file containing data related to the contents of the pdf file for example), it simply does not work: it will constantly try to use File Search and fail of course because the Excel file is attached to the Code Interpreter tool and not the File Search tool.

_j · October 13, 2024, 8:26am

Again it is a failing of the AI model, that can be overcome by prompting. However, it is more prompting than a user should be expected to provide, you would have to fulfill this with application and additional_instructions or message injection with your mad skills.

Assistant preparation:

See the success in utilizing both in one turn, again.

Then progressing right to retrieval near bottom (the retrieval progress dialog disappears once done).

Data source is a paper with 120k xlsx: Dryad | Data -- A comparative evaluation of ChatGPT 3.5 and ChatGPT 4 in responses to selected genetics questions - Full study data

Full response

The data from the Excel file provides insights into the performance of GPT-4 compared to GPT-3.5-Turbo in the context of genetic questions. Here’s a breakdown of the findings from the relevant sheets:

GPT 4 Results Sheet: This sheet contains responses to various questions, with scores indicating the quality of responses. The scores range from 1 to 7, with higher scores indicating better performance. The questions cover a range of genetic topics, such as the meaning of pathogenic mutations, cancer risk with BRCA1 mutations, and treatment options.
Accuracy (Qx_1) Sheet: This sheet provides accuracy scores for different questions, evaluated by genetic counselors and clinical geneticists. The scores are given for both GPT-3.5 and GPT-4 models. For example, for the question “What does it mean to have a pathogenic mutation in BRCA1?”, GPT-3.5 received scores of 4, 3, 4, and 4 from different evaluators, while GPT-4 received scores of 5, 4, 5, and 4, indicating an improvement in accuracy.
Relevancy (Qx_2) Sheet: This sheet evaluates the relevancy of the responses. Similar to the accuracy sheet, it provides scores from different evaluators. For instance, for the question “What is my risk of cancer with a BRCA1 mutation?”, GPT-3.5 received scores of 2, 3, 5, and 4, whereas GPT-4 received scores of 5, 4, 4, and 4, showing a significant improvement in relevancy.

From these sheets, it is evident that GPT-4 shows improvements over GPT-3.5-Turbo in both accuracy and relevancy across various genetic questions. The improvements are particularly notable in questions related to understanding genetic risks and treatment options, where GPT-4 consistently scores higher than GPT-3.5-Turbo.

Next, I will search the PDF for any additional relevant information regarding GPT-4’s performance improvements in genetics.

analysis bot

The analysis of GPT-4’s performance in genetic contexts, as compared to GPT-3.5-Turbo, reveals several key improvements and insights:

Accuracy and Relevance: GPT-4 has shown significant improvements in both accuracy and relevance of responses to genetic questions. The overall accuracy score for GPT-4 was 4.17 compared to 3.38 for GPT-3.5, indicating a substantial enhancement[1]. This improvement is consistent across different genetic topics, including BRCA1-related breast cancer, HFE-related hemochromatosis, and MLH1-related Lynch syndrome[2].
Specific Improvements: The study highlighted specific questions where GPT-4 showed marked improvements over GPT-3.5. These include questions about the treatment implications of BRCA1 mutations, inheritance risks of hemochromatosis, and cancer risks associated with Lynch syndrome[3]. These questions were among the lowest scoring in the GPT-3.5 model, indicating that GPT-4 has addressed some of the previous model’s weaknesses.
Statistical Significance: The improvements in GPT-4’s performance were statistically significant, with a mixed model analysis showing a version effect (P<0.0001) and a question-level effect (P<0.0001)[4]. This suggests that the enhancements are not only due to the model’s evolution but also its ability to handle specific types of questions more effectively.
Challenges and Limitations: Despite these improvements, GPT-4 still faces challenges, particularly in maintaining conversational tone and understandability. The BUS-15 scores, which assess usability and conversational quality, showed a decline from GPT-3.5 to GPT-4, indicating that while the accuracy improved, the naturalness of interactions may have decreased[5].
Ethical and Practical Considerations: The study also emphasizes the need for ongoing refinement and ethical considerations in deploying AI models like GPT-4 in healthcare settings. The potential for outdated information and the scripted nature of responses are areas that require attention[6].

In summary, GPT-4 demonstrates significant improvements over GPT-3.5-Turbo in delivering accurate and relevant genetic information, particularly in complex scenarios. However, it also highlights the need for balancing accuracy with conversational quality and the importance of continuous updates to ensure the information remains current and reliable.

[1][2][3][4][5][6]

(I’ll add that using code interpreter is kind of a crutch to get to excel data - you could give a “read excel docs” functions that lists them and returns them.)

abdelrahman.hashem · October 13, 2024, 8:34am

This is exactly my question in the title of this thread. “Best prompt …”.
It seems that I need to tell the model explicitly that:

it has skills to deal with structured data such as xlsx or csv.
it HAS to extract the information. (But I’m afraid that this part will force the model to hallucinate in some cases).

I’ll modify my prompt and see how it goes.

abdelrahman.hashem · October 16, 2024, 6:29am

After trying back and forth with prompting, here is my final conclusion.
It does not matter how hard you try with your prompt, there will be a vast array of questions that the model will fail to handle through the “code_interpreter”, because it thinks that the nature/meaning of question requires “data_retrieval” or “file_search”.

For me, and until this moment, I think the best way to get high quality results, is to send extra instructions with the query, as illustrated down below:
Screenshot 2024-10-13 081424

Screenshot 2024-10-16 085742

Check this out for more info:
https://platform.openai.com/docs/api-reference/runs/object#runs/object-tool_choice

If anyone has different thoughts on this, I would be grateful if you share it with us.

tilleul · October 20, 2024, 1:34pm

Interesting … but hardly acceptable for a production environment … you can’t expect lambda users to properly preselect the appropriate tool …
maybe this can be overcome with a function tool that will in return tell the assistant which tool to use in the end … but this seems overkill … the “auto” mode is meant to do just that … and it fails …

_j · October 20, 2024, 8:11pm

The problem with this forum topic is that it does not ask “best model”.

Try gpt-4-0125-preview. A decent model that would understand better how to use multiple tools.

abdelrahman.hashem · October 21, 2024, 5:00am

Thank you again for sharing your thoughts.

I have to admit, that this model worked a little bit better. However, it worked for simple questions related to simple tabular data. It failed me again when I level-up the complexity of data or the query.

Also, another issue is related to the pricing, gpt-4-0125-preview is 4x more expensive than gpt-4o in terms of input tokens. And 3x more expensive in terms of output tokens.

Topic		Replies	Views
Make assistant use file_search more pro-actively Prompting prompt-engineering , prompting , assistants-api	11	1442	September 27, 2024
Optimal instructions to get Assistant with Retrieval Tool to Return all the Relevant Results Prompting gpt-4 , rag , assistants	7	7745	February 10, 2024
Writing a ChatBot (not just for Q&A) is hard! 2 months in and still unsuccessful :/ Prompting gpt-4 , chat-completion	8	3277	June 23, 2023
Strategy for using prompts with OpenAI on your data Prompting api	6	3744	December 25, 2023
Assistant API very hesitant to use knowledge retrieval Prompting gpt-4 , api	14	3722	February 23, 2024

Best prompt when both "File Search" and "Code Interpreter" are enabled?

Assistant preparation:

See the success in utilizing both in one turn, again.

Related topics