This sounds like the wrong tool for this particular job.
This seems like it should be some fairly structured data that can be parsed and queried using more traditional means.
This sounds like the wrong tool for this particular job.
This seems like it should be some fairly structured data that can be parsed and queried using more traditional means.
This is a test corpus and itâs just one out of 100 eval queries that Iâm working on. Other queries are âcompare the performance of the quarterbacks in the rose bowl with the quarterbacks in the sugar bowlâ (GPT-4 does the best with this, Mixtral 8x7B does ok, and GPT-3.5 fails), âtell me the team with the most consecutive scores in any gameâ (all models fail), etc.
What I like about this corpus is that its small and easily fact checked so coming up with 100 eval queries for this corpus that test different dimensions of reasoning is easy. For example, I can already tell you that GPT-3.5 is better at counting then GPT-4, although all models Iâve tested generally suck at counting.
As humans weâre more then capable of reading through every page on espn.com and compiling scores and states from every bowl game. My ultimate goal is to replicate that behavior with a general purpose reasoning engine that can answer complex question using semi-structured data.
Surely log probs and a check of the response to the database could be enough to validate the answer
The other public dataset I can share is our SEC Filing corpus. Weâre ingesting around 350 SEC filing a day which equates to 17m tokens a day. You can ask âtell me every company that was acquired todayâ and get an accurate answer back. or âtell me every company that had a leadership change todayâ
thatâs a great suggestion
I donât know. There seems to me a huge difference between giving me all the scores or telling me all the companies acquired on a day, particularly if you can use these keywords and get responses, and telling me all the rules and exceptions to holiday pay rules in a legal agreement that may or may not use the term âholiday payâ, may or may not even use the terms âpayâ or âruleâ.
These two categories of search seem to underline the purpose of using LLMs for search in the first place: keyword search vs. semantic search. Searching for specific words, terms, phrases (such as âwinning teamâ or âcompany acquiredâ) as opposed to searching for an idea or concept (âholiday compensation for multiple categories of workersâ).
I believe you are correct. Phi2 ( Phi-2: The surprising power of small language models - Microsoft Research ) a 2billion parameter model produces (x1000) the same results as a 1.7 trillion parameter model that was made just a year ago. This is due to the quality of the data. when you look at these metrixs, and what PHI2 did, it become obvious that to produce a model that hallucinates less, you need to select the best data to feed it first. This will change soon with the introduction of inferrencing neural activity before plasticity as a foundation for learning beyond backpropagation (ie. Inferring neural activity before plasticity as a foundation for learning beyond backpropagation | Nature Neuroscience ) where models have the ability to change node connectors in real time with switching mechanism. It may be a while, but it will come as seen in Onen 2023.
Totally agree. Value sets in humans work both ways with multimodal neurons. A trigger like seeing the photo of âSpider-Manâ, or âHalle Berryâ ( Single-Cell Recognition: A Halle Berry Brain Cell | www.caltech.edu ) leads to the electrochemical activation of very specific multimodal neurons associated with those images. While activated, those neurons interact with any other associated neurons. So, an increase in matrix recognition based on both ideas and specific words would yield âimaginativeâ or âbetterâ results than that of just one or the other. However, 3-point vectorization or multi-point vectorization can be a result of this process. If we are looking for unimaginative results, we reduce the possible vectorization, but it is very common to think that this will lead to better, more efficient data streams.
So, the question comes⌠do you want a search that reveals a possible hallucination/imagination, or a result that is curated to the requested response so that very specific, non-wandering answers will be produced?
Autonomaton, or Human? ![]()
Want to say anything more about what differentiates RAG from RAG 2.0?
I have my own ideas Iâll be posting soon, but curious to hear yours?
IMHO, âRAGâ is a tiny baby step towards serious attention of the role of memory in cognitive architecture. One: itâs read-only, from LLM perspective. Two: itâs too shallow, both in indexing and retrieval. Three: without re-ranking, its too query-independent.
âWorking memoryâ? 7+/-2 anyone?
When it comes to âfreshnessâ, have you tried RRF, or reciprocal rank fusion?
So one stream is from the embeddings. The other is time (or freshness). You fuse both time and relevancy with RRF.
There is no âpruningâ here, just forcing a time dimension into your ranking. You could also upgrade or downgrade the importance of time with the constant divisor in the denominator.
One thing I would suggest if youâre going to use this approach would be to use binned times based on your needs.
As a contrived example, say you have 84,600 documents all timestamped one-second apart. If the most relevant document is the âoldestâ at one-day-old that might still meet your minimum threshold value for âfreshness.â Thereâs no reason to downgrade its rank so far.
You could assign a rank of 1 in the time domain to anything within a day, week, month, or any arbitrary time period.
Further, when assigning a time-based rank, you can use any arbitrary scoring system you want, depending on how much you want to penalize âstaleâ documents.
Everything within a week could get a score of 1, everything else less than a month could get a 10, and anything older than that could get a 100.
Like I said, itâs completely arbitrary.
ÂŻâ \â _â (â ăâ )â _â /â ÂŻ
Only you and your data know how you should penalize older documents.
Good point. You should bin it according to the amount of information, per topic even, that you are ingesting.
So I would implement this as a timestamp quantized to whatever level. And you can change the quantization as the data ages.
For example, say you have a hot topic, or story, with minute by minute updates coming in. So quantize a UNIX timestamp to the nearest minute. I would use the floor operation (see below). But this just adds some zeros to the end, and quantizes the output.
As the topic âagesâ, go back, and update this same timestamp, but say floor it to the day. Again adds more zeros.
Then after awhile, quantize (floor) it to the week. More zeros.
You continually do this, essentially adding more zeros. Up to year, decade, whatever your quantization schedule is.
And all you are doing is modifying an integer representing time in the database next to the item. So low overhead, computationally cheap.
You would also want to automate these quantization updates, by forming a future timeline and having the system update the quantization based on these scheduled events.
You could also trivially zero fill the integer from the right on some schedule. This way you donât even have to think of days/weeks etc, you are just zero filling a string at different decimal places, from the right, over time on some schedule. You just need to stop at some level, so your integer doesnât become all zeros!
So a schedule of:
1704663698 â to the second
1704663690
1704663600
1704663000
1704660000
1704600000
1704000000
1700000000
1700000000 â Iâd keep the first two digits at a minumum, but it gets funky and you should start thinking of actual year/decade here. But to the RRF, it may not matter so much since it is a linear 1, 2 , 3, ⌠ranking without actual time.
So with this in mind, just keep driving back.
1700000000
1690000000
1680000000
1670000000 ⌠etc. at whatever quantization
The beauty of RRF is that it isnât so sensitive to the magnitude of change and the data gets older and older than the present. So you can be sloppy.
The RRF âkernelâ is just 1/x which has a long tail and relative consistency for large x.
My interest is data analysis since that has been my coolest experience with AI thus far. From that end, there are three.
(1) Context window â itâs possible the bigger limitation is me not knowing how to connect GPT directly to the DB to query then load only the data restricted to the search parameters
(2) Newly created Assistants who canât answer anything about their KB without incessant goading and prodding about its tools to finally getting it to work.
(3) Inability for assistants to retrieve a new file programmatically uploaded to it. It will not index and load into memory in the same way a custom GPT does when you drag-and-drop a new file into the prompt
Could you explain a bit what you do in data analysis using LLMs? I found LLM useful for qualitative data analysis, but not for quantitative
The whole purpose of RAG is to eliminate the need for shrinking large data sets by pinpointing the specific texts within the specific documents within the dataset that you need.
I believe you are talking about the same idea, just in different words. You can think of finding a very specific part of information in a big dataset as shrinking the information without losing anything substantial.
We now have significant limitations as to how much information we can process (say context window in this case).
It is increasing over time, but there will probably always be some limit in this sense. And thatâs where âshrinkingâ (or however we call it) is kinda the last mile in this process.