Biggest difficulty in developing LLM apps

anon22939549 · January 6, 2024, 3:02am

This sounds like the wrong tool for this particular job.

This seems like it should be some fairly structured data that can be parsed and queried using more traditional means.

stevenic · January 6, 2024, 3:20am

This is a test corpus and it’s just one out of 100 eval queries that I’m working on. Other queries are “compare the performance of the quarterbacks in the rose bowl with the quarterbacks in the sugar bowl” (GPT-4 does the best with this, Mixtral 8x7B does ok, and GPT-3.5 fails), “tell me the team with the most consecutive scores in any game” (all models fail), etc.

What I like about this corpus is that its small and easily fact checked so coming up with 100 eval queries for this corpus that test different dimensions of reasoning is easy. For example, I can already tell you that GPT-3.5 is better at counting then GPT-4, although all models I’ve tested generally suck at counting.

As humans we’re more then capable of reading through every page on espn.com and compiling scores and states from every bowl game. My ultimate goal is to replicate that behavior with a general purpose reasoning engine that can answer complex question using semi-structured data.

anon10827405 · January 6, 2024, 3:26am

Surely log probs and a check of the response to the database could be enough to validate the answer

stevenic · January 6, 2024, 3:27am

The other public dataset I can share is our SEC Filing corpus. We’re ingesting around 350 SEC filing a day which equates to 17m tokens a day. You can ask “tell me every company that was acquired today” and get an accurate answer back. or “tell me every company that had a leadership change today”

stevenic · January 6, 2024, 3:28am

that’s a great suggestion

SomebodySysop · January 6, 2024, 6:00am

I don’t know. There seems to me a huge difference between giving me all the scores or telling me all the companies acquired on a day, particularly if you can use these keywords and get responses, and telling me all the rules and exceptions to holiday pay rules in a legal agreement that may or may not use the term “holiday pay”, may or may not even use the terms “pay” or “rule”.

These two categories of search seem to underline the purpose of using LLMs for search in the first place: keyword search vs. semantic search. Searching for specific words, terms, phrases (such as “winning team” or “company acquired”) as opposed to searching for an idea or concept (“holiday compensation for multiple categories of workers”).

plasmatoid · January 6, 2024, 12:13pm

I believe you are correct. Phi2 ( Phi-2: The surprising power of small language models - Microsoft Research ) a 2billion parameter model produces (x1000) the same results as a 1.7 trillion parameter model that was made just a year ago. This is due to the quality of the data. when you look at these metrixs, and what PHI2 did, it become obvious that to produce a model that hallucinates less, you need to select the best data to feed it first. This will change soon with the introduction of inferrencing neural activity before plasticity as a foundation for learning beyond backpropagation (ie. Inferring neural activity before plasticity as a foundation for learning beyond backpropagation | Nature Neuroscience ) where models have the ability to change node connectors in real time with switching mechanism. It may be a while, but it will come as seen in Onen 2023.

plasmatoid · January 6, 2024, 12:21pm

Totally agree. Value sets in humans work both ways with multimodal neurons. A trigger like seeing the photo of ‘Spider-Man’, or ‘Halle Berry’ ( Single-Cell Recognition: A Halle Berry Brain Cell | www.caltech.edu ) leads to the electrochemical activation of very specific multimodal neurons associated with those images. While activated, those neurons interact with any other associated neurons. So, an increase in matrix recognition based on both ideas and specific words would yield ‘imaginative’ or ‘better’ results than that of just one or the other. However, 3-point vectorization or multi-point vectorization can be a result of this process. If we are looking for unimaginative results, we reduce the possible vectorization, but it is very common to think that this will lead to better, more efficient data streams.

So, the question comes… do you want a search that reveals a possible hallucination/imagination, or a result that is curated to the requested response so that very specific, non-wandering answers will be produced?
Autonomaton, or Human?

Creative or Imaginative Response: This approach involves synthesizing information in a novel or unexpected way. It can be likened to a form of “hallucination” or imagination, where the AI goes beyond the strict boundaries of the input data to generate new ideas or connections. This can be particularly useful for tasks that benefit from creativity, such as generating art, brainstorming ideas, or solving problems that require thinking outside the box.
Specific, Curated Response: Here, the AI focuses on delivering precise, accurate information directly related to the query. This approach is akin to a focused search, where the AI retrieves and synthesizes data strictly relevant to the request. It’s most suitable for tasks that require factual accuracy and specificity, such as answering direct questions, providing explanations, or summarizing known information.

bruce.dambrosio · January 6, 2024, 5:55pm

Want to say anything more about what differentiates RAG from RAG 2.0?
I have my own ideas I’ll be posting soon, but curious to hear yours?
IMHO, ‘RAG’ is a tiny baby step towards serious attention of the role of memory in cognitive architecture. One: it’s read-only, from LLM perspective. Two: it’s too shallow, both in indexing and retrieval. Three: without re-ranking, its too query-independent.
‘Working memory’? 7+/-2 anyone?

curt.kennedy · January 7, 2024, 7:12pm

When it comes to “freshness”, have you tried RRF, or reciprocal rank fusion?

So one stream is from the embeddings. The other is time (or freshness). You fuse both time and relevancy with RRF.

There is no “pruning” here, just forcing a time dimension into your ranking. You could also upgrade or downgrade the importance of time with the constant divisor in the denominator.

anon22939549 · January 7, 2024, 9:14pm

One thing I would suggest if you’re going to use this approach would be to use binned times based on your needs.

As a contrived example, say you have 84,600 documents all timestamped one-second apart. If the most relevant document is the “oldest” at one-day-old that might still meet your minimum threshold value for “freshness.” There’s no reason to downgrade its rank so far.

You could assign a rank of 1 in the time domain to anything within a day, week, month, or any arbitrary time period.

Further, when assigning a time-based rank, you can use any arbitrary scoring system you want, depending on how much you want to penalize “stale” documents.

Everything within a week could get a score of 1, everything else less than a month could get a 10, and anything older than that could get a 100.

Like I said, it’s completely arbitrary.

¯⁠\⁠_⁠(⁠ツ⁠)⁠_⁠/⁠¯

Only you and your data know how you should penalize older documents.

curt.kennedy · January 7, 2024, 9:31pm

Good point. You should bin it according to the amount of information, per topic even, that you are ingesting.

So I would implement this as a timestamp quantized to whatever level. And you can change the quantization as the data ages.

For example, say you have a hot topic, or story, with minute by minute updates coming in. So quantize a UNIX timestamp to the nearest minute. I would use the floor operation (see below). But this just adds some zeros to the end, and quantizes the output.

As the topic “ages”, go back, and update this same timestamp, but say floor it to the day. Again adds more zeros.

Then after awhile, quantize (floor) it to the week. More zeros.

You continually do this, essentially adding more zeros. Up to year, decade, whatever your quantization schedule is.

And all you are doing is modifying an integer representing time in the database next to the item. So low overhead, computationally cheap.

You would also want to automate these quantization updates, by forming a future timeline and having the system update the quantization based on these scheduled events.

You could also trivially zero fill the integer from the right on some schedule. This way you don’t even have to think of days/weeks etc, you are just zero filling a string at different decimal places, from the right, over time on some schedule. You just need to stop at some level, so your integer doesn’t become all zeros!

So a schedule of:

1704663698 ← to the second
1704663690
1704663600
1704663000
1704660000
1704600000
1704000000
1700000000
1700000000 ← I’d keep the first two digits at a minumum, but it gets funky and you should start thinking of actual year/decade here. But to the RRF, it may not matter so much since it is a linear 1, 2 , 3, … ranking without actual time.

So with this in mind, just keep driving back.

1700000000
1690000000
1680000000
1670000000 … etc. at whatever quantization

The beauty of RRF is that it isn’t so sensitive to the magnitude of change and the data gets older and older than the present. So you can be sloppy.

The RRF “kernel” is just 1/x which has a long tail and relative consistency for large x.

brdemorin · January 12, 2024, 2:33am

My interest is data analysis since that has been my coolest experience with AI thus far. From that end, there are three.

(1) Context window → it’s possible the bigger limitation is me not knowing how to connect GPT directly to the DB to query then load only the data restricted to the search parameters

(2) Newly created Assistants who can’t answer anything about their KB without incessant goading and prodding about its tools to finally getting it to work.

(3) Inability for assistants to retrieve a new file programmatically uploaded to it. It will not index and load into memory in the same way a custom GPT does when you drag-and-drop a new file into the prompt

TonyAIChamp · January 12, 2024, 3:46am

Could you explain a bit what you do in data analysis using LLMs? I found LLM useful for qualitative data analysis, but not for quantitative

SomebodySysop · January 12, 2024, 4:18am

The whole purpose of RAG is to eliminate the need for shrinking large data sets by pinpointing the specific texts within the specific documents within the dataset that you need.

TonyAIChamp · January 12, 2024, 5:53am

I believe you are talking about the same idea, just in different words. You can think of finding a very specific part of information in a big dataset as shrinking the information without losing anything substantial.

We now have significant limitations as to how much information we can process (say context window in this case).

It is increasing over time, but there will probably always be some limit in this sense. And that’s where “shrinking” (or however we call it) is kinda the last mile in this process.

Topic		Replies	Views
GPT Builder Or Programming Language? Community project	22	1008	October 13, 2024
ChatGPT: Dangerous lack of transparency and informed consent Community	20	10228	January 30, 2023
Something happening here and it is seismic Community gpt-4 , chatgpt	55	4680	May 20, 2024
Phas -> Forest Of Thought Community project , tree-of-thoughts , reasoning , ai-reasoning , forest-of-thoughts	18	1205	March 17, 2025
Prompt Engineering Is Dead, and Context Engineering Is Already Obsolete: Why the Future Is Automated Workflow Architecture with LLMs Prompting prompt-engineering	105	24827	April 3, 2026

Biggest difficulty in developing LLM apps

Related topics