Data Quality and Bias : Ensuring the quality and diversity of data used to train LLMs is a significant challenge. Biases in the training data can lead to biased outputs, which can be a serious issue, especially in sensitive applications. Striving for unbiased, fair, and representative data is a continuous and complex task.
That being said, I think this will drastically decrease in the near future where AI is actively collecting data in real time from sensor arrays that are hooked to various agents in a hybrid API driven agent complex, or Hive-AI.
Is there a paper associated with this repo? Iād be interested to see some replicable performance metrics.
I think you need to specify precisely what you mean by bias here. Being āunbiasedā may not be a universally good thing in the context of LLM training data.
Biggest problem Iām seeing with RAG is keeping the LLM focused but maintaining context. I know I know, preaching to the choir. An example, youāre creating a document that is complex, maybe an RFP (request for proposal). You have the document that you are creating and you have a document that describes how to respond to the RFP. You want the LLM to interact with you to complete and grade each section of the RFP. You canāt drift RFP sections are very specific. Right now I create a parent child doc definition. I use the parent doc to provide context, like AI āhey youāre missing section 1A and 3C. Which one would you like to work on?ā Human ā3Cā. Similarity search for 3C description and guidance. Use this section specific guidance for this chat. Currently Iāve set up a āplan and executeā agent using the main doc description to drive the plan then iterate over the sections using an āevaluateā tool with the agent to drill into each section.
If someone has a better strategy Iām all ears, Iām, I mean eyes :
Bias, as spoken about here in the context of data collection, extrapolation, and interpretation, means being tilted in one direction in a two-directional pathway. Anything that ignores some data and returns a response in one direction that is misguided, illogical, or potentially and dangerously causes a reaction based on biased information could lead to responses that are also misinformed. That is the simple gist of it. Actually, almost all data is biased, no matter how much we try to make it not so. But that is the hard part. Next, even if we have knowledge of this bias, it is hard for us to control the reactions that come from it. For instance, in Phi2, the producers of that LLM were able to create a 2 billion parameter LLM that performed better than MAMBA and GPT-3.5. It performed on a scale that was equivalent to an earlier 1.7 trillion parameter model by reducing non-scientific data. This may also be questionable, depending on what metrics you consider ābetterā and āworse,ā but generally, we are looking for āmore accurate.ā
It can be applicable philosophically to any field, and humans are inherently biased due to their perception of reality, which makes it difficult for humans to determine what ārealityā is. Locke, Dante, Socrates, Thales (600 BC), and many other philosophers have wrestled with this problem. Just like when you look at a cup and see a solid object, but you know that it is made of moving atoms. This is visual-based bias based on the wavelength and wave function reception of the eyes, their signal transmission to the brain, and the brainās interpretation of these signals. All data that humans perceive is biased. AI and AI sensors will bring more ātruthā to the world by utilizing sensors with superior abilities than humans.
However, in the interim, we as humans have to deal with the fact that our own data is biased, thus it becomes the hardest issue to deal with. And that is just the beginning. Biases come in many forms, including coding, data quality, regulatory-related biases, and dictatorial-influenced data driven by media singularity. Media pluralism results in better data through the presentation of multi-vantage bias; however, it is still biased.
Sure, I wrote a paper for Plos S and T (Plos One). Still being reviewed, but I can send the link. Anchoring Global Security: Autonomous Shipping with Mind Reading AI, GPT-core and MAMBA- core Agents, RAG-Fusion, AI Communities, Hive- AI, and the Human Psyche
- December 2023
- DOI:
- 10.13140/RG.2.2.34188.90247
There are actually many papers (about 200) that I read before writing that, and many of them contained theory about bias. Feel free to check the references.
Sign Up | LinkedIn
You did not read the paper.
page 16. I doubt you can read a 44 page report in 10 minutes.
And I know you did not because I wrote it and here is a screen shot of page 16
This discussion seems to have gone a bit off the rails. Letās maintain decorum on the forum; having an academic discussion doesnāt necessitate personal attacks.
The paper specifically states the advantages, and more importantly, the pyschological effects of using AI with rag fusion, and the difficulties people have associated with AI. The paper also goes into the human perception of what is difficult, dangerous, and what perceptions about that are, which exactly covers this post topic. The topic of this discussion is ābiggest difficulty in developing LLM apps.ā The response is that controlling, overseeing, and cooperating based on bias is the most difficult part. So, knowing what the definition of difficulty, bias, and developing in LLM apps becomes critical. We can turn to the papers and scientific research that describes how people feel about this, what their perceptions are, and then discover new ways to approach these difficulties. It is a scientific discussion in the realm of physics, philosophy, biology, and computer science, which requires a cross field expertise. This is also difficult to find today, and in my experience, I have faced a lot of problems explaining this to people. So, I usualy start by giving examples, and then research proof, and then ask questions.
Iāve seen the entire conversation, and Iāll give the best possible advice you can get here.
When another community member, especially a community leader, tells you that:
Iām going to go ahead and give you an opportunity to revise your post and remove the snark and condescension, neither of which are necessary here.
Itās a good idea to take that opportunity.
If you want to have a proper discussion about the paper Iāll advise you to make a separate topic about it, instead of posting about on other peopleās topics.
I got it. I revised it, and deleted the old comment.
I provided a better example, with definitions, explanations, and examples of understanding complex topics, which allows us to understand the āBiggest difficulty in developing LLM apps.ā Not just RAG. I hope the revisions and the added content, along with solutions and physchological surveys of 6000 people globally help to cement the idea of difficulty, fear, and understanding of AI, its creation, and its perception. (page 12-16)
please let me know if you think this is appropriate.
Great question. Mine is knowledge āfreshnessā. Indexing current knowledge at scale is achievable. But āpruningā the knowledge so that only the essential elements are indexed, and keeping it fresh (as the pace of knowledge generation is quickening) is a problem I am thinking deeply about.
Just add time dimensions to your vector? If you donāt want to do that, you can also just rerank by timestamp
What are your current ideas for resolving this? Do you thunk @Diet 's recipe will work for you?
But age is only a negative if there is newer data that is inconsistent, right? There is so much junk being produced these days newer isnāt necessarily better.
In the context of scientific articles, SemanticScholar has ācitationsā and āinfluentialCitationsā counts, which I currently balance against age (realizing also that newer content has less time to accumulate citations).
But recognizing inconsistencies across sources, which Iād love to do, is a very challenging task. hmm
What you describe is actually a very important concept which is under current research and development by many researchers.
I imagine though it may require including a timestamp of sorts alongside all of the training data used, which itself is non-trivial, and there may need to be some kind of new transformer architecture which is able to make use of that additional information in a meaningful way.
Iām imagining some kind of mechanism where during training if you have a document with the text, āthe English Monarch is Queen Elizabeth IIā with a 2021 timestamp and a document with the text, āthe English Monarch is King Charles IIIā with a 2023 timestamp we would want the weights to be modified in such a was to increase the probability of generating,
- the English Monarch is King Charles III
- the English Monarch was Queen Elizabeth II
- the former English Monarch was Queen Elizabeth II
- Etc
and decrease the probability of generating,
- the English Monarch is Queen Elizabeth II
- the crown prince of England is Prince Charles
- etc
I imagine this is a momentarily non-trivial task because it amounts to, in effect, using a modelās training data to update its training data after-the-fact, on the fly, so it would be as though all of the training data with earlier timestamps was rewritten from a current temporal POV.
Unless Iām drastically overthinking what youāre suggesting, which is entirely likely.
ĀÆā \ā _ā (ā ćā )ā _ā /ā ĀÆ
Itās quite possible with recursive content analysis, unfortunately quite expensive because itās hard to predict the concept net for any particular topic.
Would definitely be an interesting project if you could get funding for it, but itās likely gonna have some rather severe social consequences if publicized lol
So my new company, Awarity, is focused on pioneering RAG 2.0, which means I have a lot of thoughts around this topic. Iām not ready to share all of my ideas but here are a couple of things Iāll shareā¦
Context window size is an issue but simply having a bigger context window isnāt the only piece of the puzzle. You need to create a Chain-of-Thought to answer more complex questions. Iām working to create a virtualized context window which would let you logically query a model with a billion tokens of context. That in itself isnāt enough. The whole problem of reasoning is a compression problem so its really about knowing how to efficiently compress that 1 billion tokens of content down into a 8k - 16k context window that the model can use to generate an answer.
Identifying what you show to the model is key. The model can only generate grounded answers for things it sees in its context window. If it sees the answer to the question in the information you show it, the model will reliably use that information for its answer. If it doesnāt see the information then all bets are off and itās 50/50 that it will make something up. The model always wants to generate an answer. Guardrail instructions help but theyāre not fool proof. You need a secondary fact checker call.
In my experience, good old RAG 1.0 still solves the problem of answering questions from 1 billion tokens of content within a 16K context window (with gpt-4-turbo now just a 4K output window) with old fashioned cosine similarity. The problem isnāt understanding the returned documents, but rendering a comprehensive answer to the question.
Again, in my experience, this is a result of two limitations:
- not returning enough context documents
- the model not being able to efficiently read all of the context documents (i.e., losing details in the middle).
Neither of these problems is resolved through summarization, or compression, particularly if you are dealing with legal documents in which every detail must be analyzed and rendered verbatim.
Now, Iām just speaking from my own limited experience. If Iām wrong, please educate me. If I am dealing with a 750 page long labor agreement, and I need to not only know the rules around holiday pay, but every possible rule that affects every possible classification of worker, I just donāt see how summarization is going to get the comprehensive and detailed answer I need, as an employer, to make a fully informed decision.
I have a test corpus that I call Bowl Season which currently contains 42 documents (43 after monday) one for each college football game. I can ask that corpus ātell me the score of every game and the player from each game with the most receiving yardsā. Using my system I can get GPT-3.5 or 4 to accurately answer that question using an 8k context window. With traditional RAG and a 128k context window even GPT 4 struggles.
thatās my small corpus. My largest corpus has over 1m documents and 500 million tokens. For the smaller corpus I let the model reason over every doc in the corpus. It needs to do that to answer questions like ātell me the score of every gameā but for larger corpuses I only let the model āpotentiallyā see around 25% of the corpus. My algorithm is generally capable of answering complex questions over any size corpus. Itās really just how long do you want to let it run (reasoning over 1m documents could take upwards of 100 LLM calls) and how much money are you willing to spend.