Biggest difficulty in developing LLM apps

plasmatoid · December 31, 2023, 11:57pm

Data Quality and Bias : Ensuring the quality and diversity of data used to train LLMs is a significant challenge. Biases in the training data can lead to biased outputs, which can be a serious issue, especially in sensitive applications. Striving for unbiased, fair, and representative data is a continuous and complex task.
That being said, I think this will drastically decrease in the near future where AI is actively collecting data in real time from sensor arrays that are hooked to various agents in a hybrid API driven agent complex, or Hive-AI.

anon22939549 · January 1, 2024, 12:12am

Is there a paper associated with this repo? I’d be interested to see some replicable performance metrics.

anon22939549 · January 1, 2024, 12:15am

I think you need to specify precisely what you mean by bias here. Being “unbiased” may not be a universally good thing in the context of LLM training data.

marty.bradley · January 1, 2024, 12:29am

Biggest problem I’m seeing with RAG is keeping the LLM focused but maintaining context. I know I know, preaching to the choir. An example, you’re creating a document that is complex, maybe an RFP (request for proposal). You have the document that you are creating and you have a document that describes how to respond to the RFP. You want the LLM to interact with you to complete and grade each section of the RFP. You can’t drift RFP sections are very specific. Right now I create a parent child doc definition. I use the parent doc to provide context, like AI “hey you’re missing section 1A and 3C. Which one would you like to work on?” Human “3C”. Similarity search for 3C description and guidance. Use this section specific guidance for this chat. Currently I’ve set up a “plan and execute” agent using the main doc description to drive the plan then iterate over the sections using an “evaluate” tool with the agent to drill into each section.

If someone has a better strategy I’m all ears, I’m, I mean eyes :

plasmatoid · January 1, 2024, 12:30am

Bias, as spoken about here in the context of data collection, extrapolation, and interpretation, means being tilted in one direction in a two-directional pathway. Anything that ignores some data and returns a response in one direction that is misguided, illogical, or potentially and dangerously causes a reaction based on biased information could lead to responses that are also misinformed. That is the simple gist of it. Actually, almost all data is biased, no matter how much we try to make it not so. But that is the hard part. Next, even if we have knowledge of this bias, it is hard for us to control the reactions that come from it. For instance, in Phi2, the producers of that LLM were able to create a 2 billion parameter LLM that performed better than MAMBA and GPT-3.5. It performed on a scale that was equivalent to an earlier 1.7 trillion parameter model by reducing non-scientific data. This may also be questionable, depending on what metrics you consider ‘better’ and ‘worse,’ but generally, we are looking for ‘more accurate.’

It can be applicable philosophically to any field, and humans are inherently biased due to their perception of reality, which makes it difficult for humans to determine what ‘reality’ is. Locke, Dante, Socrates, Thales (600 BC), and many other philosophers have wrestled with this problem. Just like when you look at a cup and see a solid object, but you know that it is made of moving atoms. This is visual-based bias based on the wavelength and wave function reception of the eyes, their signal transmission to the brain, and the brain’s interpretation of these signals. All data that humans perceive is biased. AI and AI sensors will bring more ‘truth’ to the world by utilizing sensors with superior abilities than humans.

However, in the interim, we as humans have to deal with the fact that our own data is biased, thus it becomes the hardest issue to deal with. And that is just the beginning. Biases come in many forms, including coding, data quality, regulatory-related biases, and dictatorial-influenced data driven by media singularity. Media pluralism results in better data through the presentation of multi-vantage bias; however, it is still biased.

plasmatoid · January 1, 2024, 12:33am

Sure, I wrote a paper for Plos S and T (Plos One). Still being reviewed, but I can send the link. Anchoring Global Security: Autonomous Shipping with Mind Reading AI, GPT-core and MAMBA- core Agents, RAG-Fusion, AI Communities, Hive- AI, and the Human Psyche

December 2023
DOI:
10.13140/RG.2.2.34188.90247
There are actually many papers (about 200) that I read before writing that, and many of them contained theory about bias. Feel free to check the references.

image1121×759 97.2 KB

Sign Up | LinkedIn

plasmatoid · January 1, 2024, 1:02am

You did not read the paper.

page 16. I doubt you can read a 44 page report in 10 minutes.
And I know you did not because I wrote it and here is a screen shot of page 16

N2U · January 1, 2024, 2:02am

This discussion seems to have gone a bit off the rails. Let’s maintain decorum on the forum; having an academic discussion doesn’t necessitate personal attacks.

plasmatoid · January 1, 2024, 12:00pm

The paper specifically states the advantages, and more importantly, the pyschological effects of using AI with rag fusion, and the difficulties people have associated with AI. The paper also goes into the human perception of what is difficult, dangerous, and what perceptions about that are, which exactly covers this post topic. The topic of this discussion is “biggest difficulty in developing LLM apps.” The response is that controlling, overseeing, and cooperating based on bias is the most difficult part. So, knowing what the definition of difficulty, bias, and developing in LLM apps becomes critical. We can turn to the papers and scientific research that describes how people feel about this, what their perceptions are, and then discover new ways to approach these difficulties. It is a scientific discussion in the realm of physics, philosophy, biology, and computer science, which requires a cross field expertise. This is also difficult to find today, and in my experience, I have faced a lot of problems explaining this to people. So, I usualy start by giving examples, and then research proof, and then ask questions.

N2U · January 1, 2024, 12:41pm

I’ve seen the entire conversation, and I’ll give the best possible advice you can get here.

When another community member, especially a community leader, tells you that:

I’m going to go ahead and give you an opportunity to revise your post and remove the snark and condescension, neither of which are necessary here.

It’s a good idea to take that opportunity.

If you want to have a proper discussion about the paper I’ll advise you to make a separate topic about it, instead of posting about on other people’s topics.

plasmatoid · January 1, 2024, 1:00pm

I got it. I revised it, and deleted the old comment.
I provided a better example, with definitions, explanations, and examples of understanding complex topics, which allows us to understand the “Biggest difficulty in developing LLM apps.” Not just RAG. I hope the revisions and the added content, along with solutions and physchological surveys of 6000 people globally help to cement the idea of difficulty, fear, and understanding of AI, its creation, and its perception. (page 12-16)

please let me know if you think this is appropriate.

cass · January 4, 2024, 2:34pm

Great question. Mine is knowledge “freshness”. Indexing current knowledge at scale is achievable. But “pruning” the knowledge so that only the essential elements are indexed, and keeping it fresh (as the pace of knowledge generation is quickening) is a problem I am thinking deeply about.

Diet · January 4, 2024, 4:13pm

Just add time dimensions to your vector? If you don’t want to do that, you can also just rerank by timestamp

TonyAIChamp · January 5, 2024, 3:56am

What are your current ideas for resolving this? Do you thunk @Diet 's recipe will work for you?

bruce.dambrosio · January 5, 2024, 10:23pm

But age is only a negative if there is newer data that is inconsistent, right? There is so much junk being produced these days newer isn’t necessarily better.
In the context of scientific articles, SemanticScholar has ‘citations’ and ‘influentialCitations’ counts, which I currently balance against age (realizing also that newer content has less time to accumulate citations).
But recognizing inconsistencies across sources, which I’d love to do, is a very challenging task. hmm

anon22939549 · January 5, 2024, 11:09pm

What you describe is actually a very important concept which is under current research and development by many researchers.

I imagine though it may require including a timestamp of sorts alongside all of the training data used, which itself is non-trivial, and there may need to be some kind of new transformer architecture which is able to make use of that additional information in a meaningful way.

I’m imagining some kind of mechanism where during training if you have a document with the text, “the English Monarch is Queen Elizabeth II” with a 2021 timestamp and a document with the text, “the English Monarch is King Charles III” with a 2023 timestamp we would want the weights to be modified in such a was to increase the probability of generating,

the English Monarch is King Charles III
the English Monarch was Queen Elizabeth II
the former English Monarch was Queen Elizabeth II
Etc

and decrease the probability of generating,

the English Monarch is Queen Elizabeth II
the crown prince of England is Prince Charles
etc

I imagine this is a momentarily non-trivial task because it amounts to, in effect, using a model’s training data to update its training data after-the-fact, on the fly, so it would be as though all of the training data with earlier timestamps was rewritten from a current temporal POV.

Unless I’m drastically overthinking what you’re suggesting, which is entirely likely.

¯⁠\⁠_⁠(⁠ツ⁠)⁠_⁠/⁠¯

Diet · January 6, 2024, 1:28am

It’s quite possible with recursive content analysis, unfortunately quite expensive because it’s hard to predict the concept net for any particular topic.

Would definitely be an interesting project if you could get funding for it, but it’s likely gonna have some rather severe social consequences if publicized lol

stevenic · January 6, 2024, 1:31am

So my new company, Awarity, is focused on pioneering RAG 2.0, which means I have a lot of thoughts around this topic. I’m not ready to share all of my ideas but here are a couple of things I’ll share…

Context window size is an issue but simply having a bigger context window isn’t the only piece of the puzzle. You need to create a Chain-of-Thought to answer more complex questions. I’m working to create a virtualized context window which would let you logically query a model with a billion tokens of context. That in itself isn’t enough. The whole problem of reasoning is a compression problem so its really about knowing how to efficiently compress that 1 billion tokens of content down into a 8k - 16k context window that the model can use to generate an answer.

Identifying what you show to the model is key. The model can only generate grounded answers for things it sees in its context window. If it sees the answer to the question in the information you show it, the model will reliably use that information for its answer. If it doesn’t see the information then all bets are off and it’s 50/50 that it will make something up. The model always wants to generate an answer. Guardrail instructions help but they’re not fool proof. You need a secondary fact checker call.

SomebodySysop · January 6, 2024, 1:57am

In my experience, good old RAG 1.0 still solves the problem of answering questions from 1 billion tokens of content within a 16K context window (with gpt-4-turbo now just a 4K output window) with old fashioned cosine similarity. The problem isn’t understanding the returned documents, but rendering a comprehensive answer to the question.

Again, in my experience, this is a result of two limitations:

not returning enough context documents
the model not being able to efficiently read all of the context documents (i.e., losing details in the middle).

Neither of these problems is resolved through summarization, or compression, particularly if you are dealing with legal documents in which every detail must be analyzed and rendered verbatim.

Now, I’m just speaking from my own limited experience. If I’m wrong, please educate me. If I am dealing with a 750 page long labor agreement, and I need to not only know the rules around holiday pay, but every possible rule that affects every possible classification of worker, I just don’t see how summarization is going to get the comprehensive and detailed answer I need, as an employer, to make a fully informed decision.

stevenic · January 6, 2024, 2:04am

I have a test corpus that I call Bowl Season which currently contains 42 documents (43 after monday) one for each college football game. I can ask that corpus “tell me the score of every game and the player from each game with the most receiving yards”. Using my system I can get GPT-3.5 or 4 to accurately answer that question using an 8k context window. With traditional RAG and a 128k context window even GPT 4 struggles.

that’s my small corpus. My largest corpus has over 1m documents and 500 million tokens. For the smaller corpus I let the model reason over every doc in the corpus. It needs to do that to answer questions like “tell me the score of every game” but for larger corpuses I only let the model “potentially” see around 25% of the corpus. My algorithm is generally capable of answering complex questions over any size corpus. It’s really just how long do you want to let it run (reasoning over 1m documents could take upwards of 100 LLM calls) and how much money are you willing to spend.

Topic		Replies	Views
GPT Builder Or Programming Language? Community project	22	1008	October 13, 2024
ChatGPT: Dangerous lack of transparency and informed consent Community	20	10228	January 30, 2023
Something happening here and it is seismic Community gpt-4 , chatgpt	55	4680	May 20, 2024
Phas -> Forest Of Thought Community project , tree-of-thoughts , reasoning , ai-reasoning , forest-of-thoughts	18	1205	March 17, 2025
Prompt Engineering Is Dead, and Context Engineering Is Already Obsolete: Why the Future Is Automated Workflow Architecture with LLMs Prompting prompt-engineering	105	24825	April 3, 2026

Biggest difficulty in developing LLM apps

Related topics