Medium Post: Grounding LLM's - Part 1

stevenic · August 11, 2024, 11:12pm

I’ve been heads down working on the tech for my new startup so it’s been a while since I’ve shared anything here. I just published a new medium post which talks about grounding LLM’s to avoid hallucinations and provides a simple template that others might find useful.

stevenic · August 11, 2024, 11:13pm

I’ve been doing a lot of work around Long Context Reasoning and will hopefully have more to share on that front soon.

PaulBellow · August 11, 2024, 11:14pm

Hey, stranger. Good to see ya!

I’ll take a look. Definitely interested in what you’ve been up to recently…

BTW, the moderator team mentioned your concern to OpenAI employees - about paying $1,000+/month, and not being able to get any helpful support. Not sure they have a solution yet as they’re growing so fast, but we’re pushing for better solutions.

Again, good to see you!

stevenic · August 11, 2024, 11:16pm

Good to know… That turned out to be just me running out of credits but I don’t feel like support did much to help me diagnose the issue.

stevenic · August 11, 2024, 11:18pm

Hopefully I can start translating my spending of $1000+/month into more insights that help give the broader community better prompting techniques. I’ve definitely figured out a lot over the last few months.

PaulBellow · August 11, 2024, 11:31pm

Yeah, things are really starting to accelerate and get interesting, for sure.

What’s your thoughts on 4o? I saw some rumors on YouTube about a drop of 4o-Large soon that might be interesting…

stevenic · August 12, 2024, 12:01am

I hadn’t heard the 4o-Large rumor yet. I generally like 4o. Claude 3.5 Sonnet gives more detailed answers in a lot of cases but 4o is generally better at following instructions.

I currently use my own concoction which I call gpt-4o-hybrid. I’ve worked out how to blend gpt-4o and gpt-4o-mini such that I can get gpt-4o quality answers at significantly reduced costs.

I typically process around 20 million tokens a day in my work which was running me around $140 or so. With my gpt-4o-hybrid concoction I’ve got my cost down to about $5 a day but it’s currently tailored for my work load.

PaulBellow · August 12, 2024, 12:07am

Yeah, who knows if it’s true or not…

Very cool.

Did you see that RouterLLM that came out?

That’s a great reduction!

Are you using 4o to prime 4o-mini?

stevenic · August 12, 2024, 12:15am

I did but what I do is radically different. It ties into a breakthrough I’ve made around long context reasoning. I’ve worked out how to distribute a prompts reasoning across multiple model calls. I’m able to essentially build an infinite chain of thought that I then collapse to an answer at the end of the prompt.

My gpt-4o-hybrid model uses gpt-4o-mini to build this chain of thought and then I collapse it using gpt-4o. The result is a gpt-4o quality answer at near gpt-4o-mini prices.

The largest context window I’ve constructed is 104 million tokens (it took 4.5 hours to resolve and 23,000 model calls) but I routinely build context windows spanning several million tokens.

stevenic · August 12, 2024, 12:22am

We have a preview of our service coming out in a couple weeks. We’ll support queries over up to 10 million tokens to start but there’s theoretically no limit to how large a context window we can construct

PaulBellow · August 12, 2024, 12:28am

Please tell me the output was simply “42”…

Look forward to it.

stevenic · August 12, 2024, 12:31am

lol close… we did a multi needle in a haystack test where we hid 20 unique passwords in 4 copies of the worlds longest novel: Marianbad I Love You. We used a fine tuned version of Llama 3 8b running on a server with 2 RTX 4090s and successfully retrieved all 20 passwords.

stevenic · August 12, 2024, 12:39am

We’ve done other things like give it all of Shakespeares writings and have it return the last sentence spoken by every character before they died.

Currently we’re giving it really large collections of documents subpoenaed for litigation and having it scan those documents for potential evidence. It can extract evidence from over 3,000 documents in under 2 hours

stevenic · August 12, 2024, 2:03am

I’m still watching the videos you shared but I definitely don’t buy that gpt-4o is a 70b parameter model as proposed by @iamruletheworldmo and while gpt-4o-mini is definitely smaller it’s not 8b. It feels somewhere between 34b and 70b to me and I spend a LOT of time with all of these models so I have a pretty decent sense of size.

There certainly could be a larger model coming from OpenAI but at this point I personally feel like with size it’s like going from a 4k TV to an 8k TV. There’s a noticeable jump going from 480p to 1080p but going from 1080p to 4k is less noticeable and going from 4k to 8k isn’t noticeable at all unless they’re side by side.

stevenic · August 12, 2024, 2:10am

I suspect that most of the major advances with the models themselves will come partly by bigger models but mostly through better fine tuning datasets.

The issue is we’ve taught these models how to follow instructions and now we need to teach them how to plan in steps. There aren’t any good data sources for planning so they all have to be built from scratch and that takes time.

PaulBellow · August 12, 2024, 8:16pm

Yeah, we’re kinda topping out on data to feed them… though they’re looking at generating data to train on… which could get wonky…

I’m sure it’ll be a bunch of factors coming together that gets us to the next level…

thomas11 · August 18, 2024, 12:52pm

Great writeup, and this is exactly what we’re doing with our own technology - Except we’re using RAG and VSS to find relevant context based upon the original prompt.

stevenic · August 19, 2024, 9:51am

Thanks… The goal of the article was to explore some of the nuances of grounding separate from the retrieval techniques used to populate the prompt. We don’t use RAG in our solutions (at least not traditional RAG) but dealing with hallucinations isn’t tied to any sort of retrieval strategy.

anon22939549 · August 21, 2024, 2:54am

Have you been following the RouteLLM project out of LM-Sys?

stevenic · August 21, 2024, 4:30am

Yes but what we’re doing is very very different… the our algorithm approaches reasoning is radically different and we’re able to split reasoning into phases. I’m able to run the earlier phases that see the bulk of the tokens using gpt-4o-mini and the later phases that actually do the bulk of the reasoning using gpt-4o. The result is you can apply gpt-4o level reasoning to up to 10 million tokens (we’ve successfully done 12 million) but 9.5 million of those tokens will be processed by gpt-4o-mini so the average cost for that 10 million token prompt is only around $4 including both input and output tokens. And there’s no significant quality loss. You would have gotten the same basic answer running all 10 million tokens through gpt-4o but it would have cost you closer to $50

Topic		Replies	Views
Hypothetical Token-increase Strategy . Community gpt-4 , chatgpt	21	222	March 17, 2025
Processing Large Documents - 128K limit API gpt-4	41	6792	November 8, 2024
New 4-turbo model has a unique limit? Or is this a bizarre hallucation? API	18	4429	January 26, 2024
Determining if the user has changed a subject Prompting	11	2013	March 28, 2023
Links in system prompt for GPT-4? Prompting gpt-4	10	5594	November 2, 2023

Medium Post: Grounding LLM's - Part 1

Related topics