Biggest difficulty in developing LLM apps

Data Quality and Bias : Ensuring the quality and diversity of data used to train LLMs is a significant challenge. Biases in the training data can lead to biased outputs, which can be a serious issue, especially in sensitive applications. Striving for unbiased, fair, and representative data is a continuous and complex task.
That being said, I think this will drastically decrease in the near future where AI is actively collecting data in real time from sensor arrays that are hooked to various agents in a hybrid API driven agent complex, or Hive-AI.

Is there a paper associated with this repo? Iā€™d be interested to see some replicable performance metrics.

2 Likes

I think you need to specify precisely what you mean by bias here. Being ā€œunbiasedā€ may not be a universally good thing in the context of LLM training data.

1 Like

Biggest problem Iā€™m seeing with RAG is keeping the LLM focused but maintaining context. I know I know, preaching to the choir. An example, youā€™re creating a document that is complex, maybe an RFP (request for proposal). You have the document that you are creating and you have a document that describes how to respond to the RFP. You want the LLM to interact with you to complete and grade each section of the RFP. You canā€™t drift RFP sections are very specific. Right now I create a parent child doc definition. I use the parent doc to provide context, like AI ā€œhey youā€™re missing section 1A and 3C. Which one would you like to work on?ā€ Human ā€œ3Cā€. Similarity search for 3C description and guidance. Use this section specific guidance for this chat. Currently Iā€™ve set up a ā€œplan and executeā€ agent using the main doc description to drive the plan then iterate over the sections using an ā€œevaluateā€ tool with the agent to drill into each section.

If someone has a better strategy Iā€™m all ears, Iā€™m, I mean eyes ::eyes:

1 Like

Bias, as spoken about here in the context of data collection, extrapolation, and interpretation, means being tilted in one direction in a two-directional pathway. Anything that ignores some data and returns a response in one direction that is misguided, illogical, or potentially and dangerously causes a reaction based on biased information could lead to responses that are also misinformed. That is the simple gist of it. Actually, almost all data is biased, no matter how much we try to make it not so. But that is the hard part. Next, even if we have knowledge of this bias, it is hard for us to control the reactions that come from it. For instance, in Phi2, the producers of that LLM were able to create a 2 billion parameter LLM that performed better than MAMBA and GPT-3.5. It performed on a scale that was equivalent to an earlier 1.7 trillion parameter model by reducing non-scientific data. This may also be questionable, depending on what metrics you consider ā€˜betterā€™ and ā€˜worse,ā€™ but generally, we are looking for ā€˜more accurate.ā€™

It can be applicable philosophically to any field, and humans are inherently biased due to their perception of reality, which makes it difficult for humans to determine what ā€˜realityā€™ is. Locke, Dante, Socrates, Thales (600 BC), and many other philosophers have wrestled with this problem. Just like when you look at a cup and see a solid object, but you know that it is made of moving atoms. This is visual-based bias based on the wavelength and wave function reception of the eyes, their signal transmission to the brain, and the brainā€™s interpretation of these signals. All data that humans perceive is biased. AI and AI sensors will bring more ā€˜truthā€™ to the world by utilizing sensors with superior abilities than humans.

However, in the interim, we as humans have to deal with the fact that our own data is biased, thus it becomes the hardest issue to deal with. And that is just the beginning. Biases come in many forms, including coding, data quality, regulatory-related biases, and dictatorial-influenced data driven by media singularity. Media pluralism results in better data through the presentation of multi-vantage bias; however, it is still biased.

Sure, I wrote a paper for Plos S and T (Plos One). Still being reviewed, but I can send the link. Anchoring Global Security: Autonomous Shipping with Mind Reading AI, GPT-core and MAMBA- core Agents, RAG-Fusion, AI Communities, Hive- AI, and the Human Psyche

You did not read the paper.


page 16. I doubt you can read a 44 page report in 10 minutes.
And I know you did not because I wrote it and here is a screen shot of page 16

This discussion seems to have gone a bit off the rails. Letā€™s maintain decorum on the forum; having an academic discussion doesnā€™t necessitate personal attacks.

1 Like

The paper specifically states the advantages, and more importantly, the pyschological effects of using AI with rag fusion, and the difficulties people have associated with AI. The paper also goes into the human perception of what is difficult, dangerous, and what perceptions about that are, which exactly covers this post topic. The topic of this discussion is ā€œbiggest difficulty in developing LLM apps.ā€ The response is that controlling, overseeing, and cooperating based on bias is the most difficult part. So, knowing what the definition of difficulty, bias, and developing in LLM apps becomes critical. We can turn to the papers and scientific research that describes how people feel about this, what their perceptions are, and then discover new ways to approach these difficulties. It is a scientific discussion in the realm of physics, philosophy, biology, and computer science, which requires a cross field expertise. This is also difficult to find today, and in my experience, I have faced a lot of problems explaining this to people. So, I usualy start by giving examples, and then research proof, and then ask questions.

Iā€™ve seen the entire conversation, and Iā€™ll give the best possible advice you can get here.

When another community member, especially a community leader, tells you that:

Iā€™m going to go ahead and give you an opportunity to revise your post and remove the snark and condescension, neither of which are necessary here.

Itā€™s a good idea to take that opportunity.

If you want to have a proper discussion about the paper Iā€™ll advise you to make a separate topic about it, instead of posting about on other peopleā€™s topics.

4 Likes

I got it. I revised it, and deleted the old comment.
I provided a better example, with definitions, explanations, and examples of understanding complex topics, which allows us to understand the ā€œBiggest difficulty in developing LLM apps.ā€ Not just RAG. I hope the revisions and the added content, along with solutions and physchological surveys of 6000 people globally help to cement the idea of difficulty, fear, and understanding of AI, its creation, and its perception. (page 12-16)


please let me know if you think this is appropriate.

Great question. Mine is knowledge ā€œfreshnessā€. Indexing current knowledge at scale is achievable. But ā€œpruningā€ the knowledge so that only the essential elements are indexed, and keeping it fresh (as the pace of knowledge generation is quickening) is a problem I am thinking deeply about.

3 Likes

Just add time dimensions to your vector? If you donā€™t want to do that, you can also just rerank by timestamp :thinking:

2 Likes

What are your current ideas for resolving this? Do you thunk @Diet 's recipe will work for you?

But age is only a negative if there is newer data that is inconsistent, right? There is so much junk being produced these days newer isnā€™t necessarily better.
In the context of scientific articles, SemanticScholar has ā€˜citationsā€™ and ā€˜influentialCitationsā€™ counts, which I currently balance against age (realizing also that newer content has less time to accumulate citations).
But recognizing inconsistencies across sources, which Iā€™d love to do, is a very challenging task. hmm

2 Likes

What you describe is actually a very important concept which is under current research and development by many researchers.

I imagine though it may require including a timestamp of sorts alongside all of the training data used, which itself is non-trivial, and there may need to be some kind of new transformer architecture which is able to make use of that additional information in a meaningful way.

Iā€™m imagining some kind of mechanism where during training if you have a document with the text, ā€œthe English Monarch is Queen Elizabeth IIā€ with a 2021 timestamp and a document with the text, ā€œthe English Monarch is King Charles IIIā€ with a 2023 timestamp we would want the weights to be modified in such a was to increase the probability of generating,

  • the English Monarch is King Charles III
  • the English Monarch was Queen Elizabeth II
  • the former English Monarch was Queen Elizabeth II
  • Etc

and decrease the probability of generating,

  • the English Monarch is Queen Elizabeth II
  • the crown prince of England is Prince Charles
  • etc

I imagine this is a momentarily non-trivial task because it amounts to, in effect, using a modelā€™s training data to update its training data after-the-fact, on the fly, so it would be as though all of the training data with earlier timestamps was rewritten from a current temporal POV.

Unless Iā€™m drastically overthinking what youā€™re suggesting, which is entirely likely.

ĀÆā \ā _ā (ā ćƒ„ā )ā _ā /ā ĀÆ

1 Like

Itā€™s quite possible with recursive content analysis, unfortunately quite expensive because itā€™s hard to predict the concept net for any particular topic.

Would definitely be an interesting project if you could get funding for it, but itā€™s likely gonna have some rather severe social consequences if publicized lol

1 Like

So my new company, Awarity, is focused on pioneering RAG 2.0, which means I have a lot of thoughts around this topic. Iā€™m not ready to share all of my ideas but here are a couple of things Iā€™ll shareā€¦

Context window size is an issue but simply having a bigger context window isnā€™t the only piece of the puzzle. You need to create a Chain-of-Thought to answer more complex questions. Iā€™m working to create a virtualized context window which would let you logically query a model with a billion tokens of context. That in itself isnā€™t enough. The whole problem of reasoning is a compression problem so its really about knowing how to efficiently compress that 1 billion tokens of content down into a 8k - 16k context window that the model can use to generate an answer.

Identifying what you show to the model is key. The model can only generate grounded answers for things it sees in its context window. If it sees the answer to the question in the information you show it, the model will reliably use that information for its answer. If it doesnā€™t see the information then all bets are off and itā€™s 50/50 that it will make something up. The model always wants to generate an answer. Guardrail instructions help but theyā€™re not fool proof. You need a secondary fact checker call.

3 Likes

In my experience, good old RAG 1.0 still solves the problem of answering questions from 1 billion tokens of content within a 16K context window (with gpt-4-turbo now just a 4K output window) with old fashioned cosine similarity. The problem isnā€™t understanding the returned documents, but rendering a comprehensive answer to the question.

Again, in my experience, this is a result of two limitations:

  1. not returning enough context documents
  2. the model not being able to efficiently read all of the context documents (i.e., losing details in the middle).

Neither of these problems is resolved through summarization, or compression, particularly if you are dealing with legal documents in which every detail must be analyzed and rendered verbatim.

Now, Iā€™m just speaking from my own limited experience. If Iā€™m wrong, please educate me. If I am dealing with a 750 page long labor agreement, and I need to not only know the rules around holiday pay, but every possible rule that affects every possible classification of worker, I just donā€™t see how summarization is going to get the comprehensive and detailed answer I need, as an employer, to make a fully informed decision.

1 Like

I have a test corpus that I call Bowl Season which currently contains 42 documents (43 after monday) one for each college football game. I can ask that corpus ā€œtell me the score of every game and the player from each game with the most receiving yardsā€. Using my system I can get GPT-3.5 or 4 to accurately answer that question using an 8k context window. With traditional RAG and a 128k context window even GPT 4 struggles.

thatā€™s my small corpus. My largest corpus has over 1m documents and 500 million tokens. For the smaller corpus I let the model reason over every doc in the corpus. It needs to do that to answer questions like ā€œtell me the score of every gameā€ but for larger corpuses I only let the model ā€œpotentiallyā€ see around 25% of the corpus. My algorithm is generally capable of answering complex questions over any size corpus. Itā€™s really just how long do you want to let it run (reasoning over 1m documents could take upwards of 100 LLM calls) and how much money are you willing to spend.

1 Like