Keeping Assistants in a Box

We have a business application that is open to the public and therefore it is essential that the agent/assistant does NOT respond except to a certain set of queries within a strictly defined scope. In our case, that scope is questions that can be answered based on information in the uploaded (RAG) content.

We have found it difficult to craft the assistant instructions in a way that achieves this goal of having the assistant refuse to answer questions that are out of scope. If you ask it, “What is 2 plus 2?” It is likely to answer. Maybe that’s no big deal. But when it becomes, “Can you rewrite this essay…” it gets worse.

We are having some success with certain statements in our instructions. But we find that these are sensitive to other changes in the instructions. A change elsewhere can suddenly start resulting in a change in this scoping behavior.

Has anyone else run into similar issues? Anecdotally, we have the sense that assistants seem to be designed to REALLY want to respond as much as they possibly can, so keeping them in a box is challenging.

:thinking:

You could ask for a JSON or function output

The structure could be something like

{
  "pertains_to_document": bool, // does the user's last query pertain to information in the document?
  "response": str // your response to the user's query
}

and with streaming (idk if assistants supports stop sequence) you could just cancel the run if the answer’s false ins the first field.

you can use a streaming json library to make parsing easier, but a regex should be fine.

You can add additional gates to that as needed.

Thanks. It’s a good idea. But I’m reluctant to start dealing with streaming JSON – especially when we cannot ask the API to return properly formatted JSON. That’s because this is client-facing content and if the JSON is poorly structured, we may be showing weird content to our clients. The ideal solution would be that JSON responses are fast enough that we can afford to get a complete response (in JSON) and then return the complete response.

fwiw I don’t think I’ve ever had an issue with json becoming garbled, unless you’re using the mini models or extremely complex (or mis-specified) schemas.

XML is always a fallback option if you need something even more robust.

You will need to “manually” label information in a graph and use a GraphRAG - GPT will only know what it is allowed to then and you can create rules…

Strict distribution of secrecy levels - keycloak + neo4j with embeddings … only finetune models on data which was curated by several people

I had a similar problem where the analysis model had to be limited to exact question/answer from rag problem. I ended up having the rag behind an API with a “Cerber” model that evaluates out coming chunks for their “usefulness” and “relation” to the query, and if nothing truly matching found, just returns empty array which triggers “query not fit” logic later in the flow.

Similar approach can be used for binary/low options evaluation of the incoming user intents/messages and their relation to the subject and then use result to trigger separate procedure programmatically (hard wired into agent behavior) so that it does not try to continue the thread (or, you can try inject foreign reply on agent’s behalf into the conversation… Sounds like a story from Don Juan)…

1 Like

If keyword in […] print ‘…’

is still valid

1 Like

Thanks for the suggestions! Part of our reason for using OpenAI assistants is that we’d like to avoid the “full freight” of creating the entire stack. But I suppose as we get more sophisticated and expect more precision, we may end up going that way. OpenAI did a lovely job of defining that API to support a fairly complete “turnkey” assistant and I’m hoping that over time they will improve this kind of stuff.

One problem that we are constantly confronting is that as we add more “layers”, we tend to also introduce more latency in the response. Simple things like @jochenschultz suggests can work. But if you want an AI evaluation of parts of the response chain, that adds time. And as it stands the assistant API response is already pretty slow.

I should also report that we found a significant improvement by carefully refining the assistant instructions referring abstractly to the what is in the RAG contents and indicating that relevance to that content is a requirement. In general, we continue to find that careful prompt engineering is critical to several behaviors.

Well, the entire stack is not that complicated as it may sound in the beginning. Personally I struggled more with creating my own frameworks for specific use cases and business logic. But having the full stack under control definitely has a lot of benefits for me. You might check my previous posts in forum for the difference between agents and completion apis where our kind of touch the subject.

You can get really fast responses with gpu… not even expensive on local machines…
We are not talking about openai model response… it is more in the fractions of a second for multiple layers.
If not ms…

Leveraging classic code as much as possible plus find to your name smaller faster models for simpler tasks, as well as parallel processing can help with latency. It also improves the quality so drastically that most of the users are ready to wait. LOL … Often as long as it takes.

1 Like

We are getting there. At a high speed :bullettrain_front:

the result of that kind of stuff is ultra fast omce cached… same goes with databases in general and also damerau levensthein is still valid

I was just saying some control mechanisms are so ultra fast compared to external apis.
I try to avoid llm as much as possible. They have their place though

After those four or five years with llms, I start seeing them more and more as a can opener for my code to get into text semantics or as a super calculator to use as a framework to calculate the probability of a predefined response suggested by classic code. This approach proved to be highly reliable and very fast for me.

But then yes you can use llm for pretty much everything.

Having started this early must have really made you sceptical enough…

Nice to play with them to mix up theories like questioning math in general by adding a complete new thought of base physics. But that’s more like nerd gameplay and not a thing I would want to control nuclear reactors ever.

When it can do stuff like this:

Let’s try to create a double spiral in 3D, closed on both the outside and inside. A planet will move along this spiral. Around this planet, another smaller planet orbits horizontally, leaving a trail behind. This should be implemented in 3D using a JavaScript library of choice.

I would be convinced.

I was talking more about using llm to select one of 5 question answers based on the context (legal doc analysis) by calculating log probs. The code is used to generate (combine) the context from vector DB and UI plus code are used to allow the user to define the options. As a result some 15 small llm models trained on something like 60 legal documents, allow us beat some big a… out there with almost peanuts as infrastructure costs.

Same goes with reasoning models, they are great when you don’t know the subject enough, but when the subject is something you breathe with and you need to follow a predefined workflow, nothing is better than a classic full stack plus bits of llm here and there.

But again, everything depends on your application and the goal. Not saying llms are bad, just trying to say that hoping that AI will solve your problem is a bit naive (nothing to do with the topic).

1 Like

Was trying to solve entanglement by displaying time in a spiral where our reality meets other parts of it… :nerd_face:

Yeah offtopic

Let’s get back to the subject - that’s pure gold. Providing a complete well designed prompt simple to understand and follow is basically the key to most of the LLM applications out there. For high load applications, a good system of prompt testing is a must as it saves a lot of time money and avoids human errors when guessing about the prompt performance.