Biggest difficulty in developing LLM apps

Agreed. I am using Weaviate where my metadata is also embedded, and I’ve built in a keyword capability. There’s no getting around the noise – the contracts are what they are. The other hard part I forgot to mention is training users on how to ask the questions in the correct manner and use the tools available.

I’ve been a database developer for over 40 years, and I was a pioneer in the area of Electronic Publishing some 30 years ago. I knew a little something about data structure, and in particular, text data structure, before I got into this AI game. I generally use Semantic Chunking https://www.youtube.com/watch?v=B5B4fF95J9s&ab_channel=SwingingInTheHood and various embed strategies depending upon the type of text (legal, policy, sermons, regulatory code, religious texts, scientific, etc…) I’m working with. And I use metadata fairly extensively.

These are very detailed and extensive contracts which are chunked at the top level by their hierarchal/semantic structures, then at the second level by size… And, there are multiple agreements, over multiple years, so depending upon the scope of the search, similarity results could easily be found in 50+ document chunks. And, don’t get me started if the user expands the search out to multiple guilds (IATSE, SAG-AFTRA, DGA, WGA, AFM, etc…)

But to your point, it could be the document chunk size I’ve chosen is too small – this was done to try and reduce the average cost per query. We weren’t getting better answers at a higher chunk size, so figured it was worth a try. Again, trying to find that fine line.

However, I’ve got a plan. The cosine similarity search is almost certainly going to bring back the most relevant documents. So, all I have to do is configure my system to analyze x documents at a time, retrieve the relevant ones used in the response, and then return either a summary or a concatenation of the individual responses. Sort of like either map reduce or refine summarization strategies, but on a RAG level.

What the end user wants is a complete answer, and I think this is how we get it to him.

Actually, if I could be guaranteed that the model could read 200K with 100% accuracy, it actually would solve this particular problem. But, unfortunately, we don’t live in that world – yet.

2 Likes

Oh, this is 100%!

Wow, impressive!

Another thing you may try is some kind of indexing - making various summaries and of the chunks and linking them to originals.

Oops, you already figured this out :slight_smile:

Yes, basically when attention windows of the models reach or at least approah their context windows.

1 Like

Here’s a draft an my essay where I try to explain my concept of anti-agent (a mix of imperative and declarative approaches) and compare it with the AI agents (more declarative approach)r: Notion – The all-in-one workspace for your notes, tasks, wikis, and databases.

Commenting is open :slight_smile:

For simple RAG: Reliability, edge-cases, and as a consequence hallucinations for sure.

For private bots with well-known documents it’s not much of an issue as we already know the expected answer and are just using it for time efficiency. For learning/discovery, or public-facing chatbots, especially customer service it needs to either be very accurate, or at the least know when it can’t answer correctly.

One of my first tests with the OpenAI Retrieval was uploading a vehicle care guide and asking for torque values. If I listened to it’s authoritative answer my car would’ve dismantled on the road.

For reality: Recommendations, aggregations, comparisons, stacked questions, ambiguous questions, incorrect questions, keyword-focused questions. Questions that require some additional logic to answer.

I haven’t seen anyone bother touching this yet. Not on this forum anyways.

:rofl: Totally is. As of now they are expensive & feel rushed. The reality for me is that OpenAI just does things and we have to deal with the consequences afterwards. They break, we adapt, they break again.

With so many new tools and gadgets that I want to take advantage of it’s going to be a nightmare trying to stay on top of each other and their inevitable breaking unannounced updates.

So, I’m hoping that in the future their Assistants will have already been tested on these updates. I want mine to use TTS, STT, GPT4V, & Function Calling. Maybe Dall-E. I also use my Assistants framework for a number of different applications. So it’s nice to just install the package, and plug-and-play.

Another benefit is being stateful. I have a Whatsapp Chatbot client. WhatsApp doesn’t even hold the conversation so when I get a new message it’s just that: a message. I don’t need a database for anything besides user profile information.

I think one of the biggest difficulties is understanding user intent.

The intent of the user will then determine which action to take.

To solve this problem, you have to create a control plane.

This control plane will first understand which resources are required to react to the users intent. Also, the control plane knows what it can and cannot answer, so another function of the control plane is to decide this, and possibly respond with “I cannot help you with that.”

For example, if a user says “What time is it?”, the control plane has to query the current time, which is outside of the LLM training data, decide if it has access to a time resource, and if it does, it retrieves it, and sends it back to the user, preferably in their time zone.

So a lot of moving parts in this simple question. Moving parts are “do I have access to time information”, and “do I know the users time zone”, and “if I have access to time, but not the users time zone, should I respond?”

This applies even to general RAG systems. For example, which RAG DB do I pull from given this question? Is this within scope?

So basically user intent and control plane logic is a difficult one.

2 Likes

Good point. I’m keeping my eye on smaller LLMs built for function calling for this exact purpose. To me it seems inevitable that we will need to route the query to a number of different agents.

My RAG as of now is very small and basically entails application design, purposes, and functionality. So I just use function calling with enums to accomplish this :man_shrugging:

To me it makes sense to train a much smaller model on user intention classification, and then pass ambiguous/difficult queries to GPT-4 to infer.

Separating logic from GPT seems to be the key here. GPT, LLMs by extension are unstructured query handlers. Function calling brings the best of both words by classifying the intent, and also transforming the query.

1 Like

In order to deal with “reality”, you’ve got to deal with “real” end-users. Not the big-brained theorists, or the hobbyists, or even the early adaptors, but every day folk who don’t know nothing about tokens and prompt engineering and temperatures and topK (hell, I still don’t know what that means). The butcher, the baker, the candlestick maker. They just want to be able to ask a question and have the AI answer it as simply and completely as possible. The first time.

Yeah, “difficult” is putting it mildly. When you come upon this, and the shocking reality that these machines aren’t as smart as we thought they were, that’s when the rubber hits the road and most of the well-thought-out theories and research papers fall apart.

Another major difficulty in developing LLM apps: Meeting end-user expectations.

2 Likes

100% agreed. It’s the bane of all interface developers. Abstract the code to an intuitive level for the layman.

It’s why I’m all about transform the initial query with a fine-tuned model. The professional that can filter and adapt. Before all this I use to own a construction company.

It’s all too common to deal with people who
A) think they know it all, but fail on the nuanced and ultimately say things that make a big difference to me
B) people who really don’t know anything, and that’s fine, but we need to know that and translate it all. Assuming a large degree of what they want

2 Likes

What do you see as the source of these difficulties? Is it in the realm what we discussed with @SomebodySysop above?

1 Like

100%. An interesting case I encountered while working on one of my latest clients.

A user is given a case interview question and then he can either start responding or ask clarifying questions. The challenge is that that answer itself may be a list of questions, so I spent a lot of time engineering a prompt that will distinguish between the 2 only to finally decide to use buttons at this stage (so when they are asked a question, I’m changing a chat inout form to 2 buttons: ask / answer).

2 Likes

Did you try Mistral tiny for that. After playing around with it for a while, it seems to me like a good candidate for what you are looking for.

what happened? what’d i miss?

Yes, 100%!

I’ve been keeping my eye on Gorrila for some time now. That, along with validators like JsonFormer Seem to be moving in the right direction.

The Axel Springer agreement. Where their articles get higher priority than others. I’m almost certain this will eventually lead to tracking, and advertising

1 Like

Thanks for the link! Will check it out!

1 Like

Somebody sent me this:

Anybody knows where is this picture from?

2 Likes
6 Likes

man, notion is a pain in the ***. at least they have trash security.

what you seem to be describing seems to be one attempt at dealing with the drawbacks of unmanaged conversations, or single layer prompt engineering - but it seems to me like most if not all of those issues can be mitigated by installing a goal management layer.

in my opinion/experience you can have your cake and eat it too in this case.

your post is still useful, in that it discusses the limitations of basic “prompt engineering” and raises the need for more stratified approaches. perhaps the need for a sort of “prompt architecture”?

Here’s a synopsis of your post, I’ll take it down if you don’t want it up

synopsis by gpt-4-1106-preview

The article “Introduction to Anti-Agents. Comparing declarative and imperative approaches in conversational AI interfaces” by Tony Simonovsky discusses the two predominant programming approaches for conversational artificial intelligence (AI) interfaces: declarative and imperative.

The declarative method is prevalent and is used in frameworks like Autogen and Assistants API. It allows developers to define the desired outcomes for AI agents, relying on the AI model to handle the execution details. This method is user-friendly and expedites the development process, excelling in scenarios that require flexible, context-aware interactions. However, it may struggle with complex, rule-heavy contexts, as AI’s effectiveness is bound by how much text it can process and use at a time.

The imperative approach, contrastingly, involves specifying detailed steps for the AI to follow. This provides developers with precise control and predictability, essential for managing intricate conversations. “Anti-agents” form a subset of this approach, combining structured dialogue flows with a limited amount of AI autonomy within each step. This method streamlines prompt engineering and allows for better modularity and testability, although it may lack the ability to adapt to unforeseen inputs and can feel less dynamic.

Simonovsky discusses examples in content creation, support/sales bots, and gaming to illustrate the use of AI agents (declarative) and anti-agents (imperative). In content creation, AI agents help with flexible, general tasks, while anti-agents handle complex, structured projects. In support/sales, an AI agent can adapt to diverse customer needs, whereas an anti-agent ensures adherence to specific protocols. In gaming, AI agents allow NPCs to dynamically interact with the player, while anti-agents ensure NPCs follow a precise narrative path.

The article concludes that while AI technology continues to improve, the structured, predictable nature of anti-agents is still crucial in scenarios requiring meticulous control. Developers should weigh the merits of both approaches to choose the one that suits their project’s needs and delivers the desired user experience. As LLMs evolve, the balance between AI agents’ autonomy and anti-agents’ guided structure will be key to advanced AI applications.

Cannot agree, but anyways it is a draft version, I’ll then publish somewhere else.

Could you elaborate?

This is an essay, so obviously doesn’t pretend to be truth, rather - my exploration of some of the LLM app development current challenges (especially with respect to AI agents controllability and testability).

I’m actually playing around with this concept on practice in AIConversationFlow: GitHub - TonySimonovsky/AIConversationFlow: AI Conversation Flow provides a framework to create anti-agents to build complex non-linear LLM conversation flows, that are composable, controllable and easily testable.

Thanks for posting the synopsis, man!

you know, like we discussed last time - the objective management in this case

1 Like

looks like a downgraded pre-version of RAG Fusion by Adrian R.

Sure, I can give you a video of it working:

Also several sites by computer scientist Ph.D. describing it:

And a post on Langchain LLamindex
If you follow these steps, you will get good results.

People who ask if this will be useful, yes, and it is not possible for a single LLM to do without an IDE, and this is an excellent point. The context windows of LLMs are severally limited due to quadratic tokenization of the input through the transformation stages. We can solve this by having multiple AI agents handle seperate sections of the queries, as stated in ChatDev toscl, AI Jason Agent 3.0 (with build structures) with IDE seperation of Agents, and creation of in- computer warehouses, and then implementation of another type of LLM> an SSM6 version MAMBA, which can handle large seq inputs. in combination with GPT4 Agents,
By breaking all tasks down into smaller blocks, we create a development enviornment where multiple LLMs can handle the tasks of RAG, leading to the ability to create fusion type inferences.

  1. Each LLM is able to create a RAG request.
  2. The requests are then ranked by an oversight LLM
  3. the best of the RAG results are then combined by an LLM
  4. the combined result is presented to another LLM that is handling part of the input values.
  5. the Oversight process is monitored by another LLM which gives tasks.
    Here are the links with programs using these features:
    https://www.youtube.com/watch?v=Zlgkzjndpak&t=230s Chat dev toscl with video graphic interface so you can see the LLMs talking to each other.
    https://www.youtube.com/watch?v=AVInhYBUnKs&t=1s LLM API that creates groups of research agents to prevent losing track of what is going on with warehouses.
    In combination with these technologies and software programming. Handling larger tasks becomes very simple.
1 Like