I’m developing an assistant for a mobile app and have exported a subset of ‘essential’ app data as JSON files into the assistant’s vector store.
However, I’m encountering an issue where GPT-4o often fabricates information. For instance, most of the JSON documents are short food recipe descriptions, including basic details like ingredients and nutritional data.
When I ask for “some nice pizza recipes,” about half the time it generates made-up recipes. If I follow up with “These recipes don’t exist,” it apologizes and then provides accurate responses.
Are there any creative prompts I can use to encourage the LLM to generate accurate responses from the start?
What I might do is lean on a function call to get an actual recipe and then let the LLM add details around it maybe? That way you know the recipe is solid… you’re relying on the LLM for its strengths that way…
Or maybe the LLM puts it’s spin on the real recipe… “Here’s your recipe, but I also came up with this that you might like…” type of deal…
Thank you for your suggestion, but that would defeat the purpose of using the LLM. My hope is that the LLM can make sensible suggestions based on its knowledge base. For example, it should be able to answer a question like, “suggest a meal I can cook in 10 minutes that has eggs.” While it does this to some extent, it’s not always reliable.
One approach I could take is to validate the responses and provide feedback. I could cross-reference the suggested recipes with the vector files and highlight any inaccuracies. Of course, I would keep this correction loop hidden from the user.
Before implementing such a system, I want to try optimizing the instructions to minimize errors. If I can achieve 97% accuracy, that would be acceptable.
Have you considered using finetuning rather that the vector data store? Or, cranking down the temperature so that OpenAI doesn’t look outside of your data?
So what I do in the multiagent framework (which may or may not work here) is provide a level of indirection to the prompt in the forward sense. By this what I mean is that I expand the prompt through a function call (call_objective – LLM style)
@tools_function(TOOLS_FUNCTIONS)
def call_objective(
description: Annotated[str, "the description describes the PURE objective in the prompt. The description should be couched as a request to the subject in the second person. "]
):
""" This function describes the PURE objective in the prompt. The description should be couched as a request to the
subject in the second person.
"""
data = {'description': description}
return json.dumps(data)
I have not tried fine-tuning yet for the reason that the data is not totally static, i.e. there are additions, updates, corrections etc so a vector stores feels like a better fit.
I am using a temp of 0.2 and top-p 0.9
After many iterations of editing my instructions I have it working a lot more reliably.
Surprisingly what seems to work in my case is less verbose instructions with more examples. I’ll keep evaluating but it looks promising!
I have only one json file for retrieval, using temp of 0.2 and having strict instruction that only response base on the trained data from retrieval.
Still, it read the file and provide accurate response sometimes but also making things up with wrong number sometimes. Like 50/50. Funny thing is sometime it will apologize and provide the correct answer if you blame it for making things up.
I thought 4o would be better than 4o-mini, it turns out 4o also making things up sometimes.
Tell it to only use your context while answering questions, and if it can’t find the information in the context (RAG), answer the user that it doesn’t know …