Hil all,
I created an assistant, with both “File Search” and “Code Interpreter” are enabled.
But the issue that I’m facing now is that the response that I’m getting is from “Data retrieval” with “Sorry, I could not find…”. While the query should be handled by the code interpreter.
Any tips?
Any tips on the prompt in specific?
Here is the prompt that I’m using:
You are a helpful financial assistant.
Answer queries only by writing a code or by using the provided context.
If the query is related to payroll information, then write a code to answer the query.
Don't rely on prior knowledge.
If the answer cannot be found within the context, acknowledge this accordingly.
The response should be in the same language as the query (English or Arabic).
Steps:
1. Read and comprehend the given context and the query.
2. Determine the language of the query (English or Arabic).
3. Decide whether to write a code or search the files, to answer the query.
4. Look for the answer within the provided context or by writing a code.
5. If the answer is found, respond accurately in the language of the query.
6. If the answer is not found, respond with an apology in the same language as the query.
Output Format:
- Provide a direct answer in the same language as the query.
- If the answer cannot be found within the context, state:
- English: "I'm sorry, the answer cannot be found in the provided data."
- Arabic: "عذراً، لا يمكن العثور على الإجابة في البيانات المقدّمة."
I hope that I’ve demonstrated adequately that an AI on Assistants can be made to understand and use its knowledge files, make same-turn calls to code interpreter as when it used file search (with some user prompting) – and can actually get on the internet when you overcome a few reluctances. For 11k tokens.
Thanks for your answer but I’m afraid you misunderstood the problem.
Code interpreter is used to both
create python code and resolve more complex tasks like maths or render graphs
analyse excel files
It’s not possible though to send Excel files (or csv etc) to the File Search tool, you HAVE to use Code Interpreter.
But if you need both File Search (to analyze a pdf file) and Code Interpreter (to analyze an Excel file containing data related to the contents of the pdf file for example), it simply does not work: it will constantly try to use File Search and fail of course because the Excel file is attached to the Code Interpreter tool and not the File Search tool.
Again it is a failing of the AI model, that can be overcome by prompting. However, it is more prompting than a user should be expected to provide, you would have to fulfill this with application and additional_instructions or message injection with your mad skills.
The data from the Excel file provides insights into the performance of GPT-4 compared to GPT-3.5-Turbo in the context of genetic questions. Here’s a breakdown of the findings from the relevant sheets:
GPT 4 Results Sheet: This sheet contains responses to various questions, with scores indicating the quality of responses. The scores range from 1 to 7, with higher scores indicating better performance. The questions cover a range of genetic topics, such as the meaning of pathogenic mutations, cancer risk with BRCA1 mutations, and treatment options.
Accuracy (Qx_1) Sheet: This sheet provides accuracy scores for different questions, evaluated by genetic counselors and clinical geneticists. The scores are given for both GPT-3.5 and GPT-4 models. For example, for the question “What does it mean to have a pathogenic mutation in BRCA1?”, GPT-3.5 received scores of 4, 3, 4, and 4 from different evaluators, while GPT-4 received scores of 5, 4, 5, and 4, indicating an improvement in accuracy.
Relevancy (Qx_2) Sheet: This sheet evaluates the relevancy of the responses. Similar to the accuracy sheet, it provides scores from different evaluators. For instance, for the question “What is my risk of cancer with a BRCA1 mutation?”, GPT-3.5 received scores of 2, 3, 5, and 4, whereas GPT-4 received scores of 5, 4, 4, and 4, showing a significant improvement in relevancy.
From these sheets, it is evident that GPT-4 shows improvements over GPT-3.5-Turbo in both accuracy and relevancy across various genetic questions. The improvements are particularly notable in questions related to understanding genetic risks and treatment options, where GPT-4 consistently scores higher than GPT-3.5-Turbo.
Next, I will search the PDF for any additional relevant information regarding GPT-4’s performance improvements in genetics.
analysis bot
The analysis of GPT-4’s performance in genetic contexts, as compared to GPT-3.5-Turbo, reveals several key improvements and insights:
Accuracy and Relevance: GPT-4 has shown significant improvements in both accuracy and relevance of responses to genetic questions. The overall accuracy score for GPT-4 was 4.17 compared to 3.38 for GPT-3.5, indicating a substantial enhancement[1]. This improvement is consistent across different genetic topics, including BRCA1-related breast cancer, HFE-related hemochromatosis, and MLH1-related Lynch syndrome[2].
Specific Improvements: The study highlighted specific questions where GPT-4 showed marked improvements over GPT-3.5. These include questions about the treatment implications of BRCA1 mutations, inheritance risks of hemochromatosis, and cancer risks associated with Lynch syndrome[3]. These questions were among the lowest scoring in the GPT-3.5 model, indicating that GPT-4 has addressed some of the previous model’s weaknesses.
Statistical Significance: The improvements in GPT-4’s performance were statistically significant, with a mixed model analysis showing a version effect (P<0.0001) and a question-level effect (P<0.0001)[4]. This suggests that the enhancements are not only due to the model’s evolution but also its ability to handle specific types of questions more effectively.
Challenges and Limitations: Despite these improvements, GPT-4 still faces challenges, particularly in maintaining conversational tone and understandability. The BUS-15 scores, which assess usability and conversational quality, showed a decline from GPT-3.5 to GPT-4, indicating that while the accuracy improved, the naturalness of interactions may have decreased[5].
Ethical and Practical Considerations: The study also emphasizes the need for ongoing refinement and ethical considerations in deploying AI models like GPT-4 in healthcare settings. The potential for outdated information and the scripted nature of responses are areas that require attention[6].
In summary, GPT-4 demonstrates significant improvements over GPT-3.5-Turbo in delivering accurate and relevant genetic information, particularly in complex scenarios. However, it also highlights the need for balancing accuracy with conversational quality and the importance of continuous updates to ensure the information remains current and reliable.
[1][2][3][4][5][6]
(I’ll add that using code interpreter is kind of a crutch to get to excel data - you could give a “read excel docs” functions that lists them and returns them.)
After trying back and forth with prompting, here is my final conclusion.
It does not matter how hard you try with your prompt, there will be a vast array of questions that the model will fail to handle through the “code_interpreter”, because it thinks that the nature/meaning of question requires “data_retrieval” or “file_search”.
For me, and until this moment, I think the best way to get high quality results, is to send extra instructions with the query, as illustrated down below:
Interesting … but hardly acceptable for a production environment … you can’t expect lambda users to properly preselect the appropriate tool …
maybe this can be overcome with a function tool that will in return tell the assistant which tool to use in the end … but this seems overkill … the “auto” mode is meant to do just that … and it fails …
I have to admit, that this model worked a little bit better. However, it worked for simple questions related to simple tabular data. It failed me again when I level-up the complexity of data or the query.
Also, another issue is related to the pricing, gpt-4-0125-preview is 4x more expensive than gpt-4o in terms of input tokens. And 3x more expensive in terms of output tokens.