Medium Post: Grounding LLM's - Part 1

I’ve been heads down working on the tech for my new startup so it’s been a while since I’ve shared anything here. I just published a new medium post which talks about grounding LLM’s to avoid hallucinations and provides a simple template that others might find useful.

15 Likes

I’ve been doing a lot of work around Long Context Reasoning and will hopefully have more to share on that front soon.

4 Likes

Hey, stranger. Good to see ya!

I’ll take a look. Definitely interested in what you’ve been up to recently…

BTW, the moderator team mentioned your concern to OpenAI employees - about paying $1,000+/month, and not being able to get any helpful support. Not sure they have a solution yet as they’re growing so fast, but we’re pushing for better solutions.

Again, good to see you!

1 Like

Good to know… That turned out to be just me running out of credits but I don’t feel like support did much to help me diagnose the issue.

2 Likes

Hopefully I can start translating my spending of $1000+/month into more insights that help give the broader community better prompting techniques. I’ve definitely figured out a lot over the last few months.

1 Like

Yeah, things are really starting to accelerate and get interesting, for sure.

What’s your thoughts on 4o? I saw some rumors on YouTube about a drop of 4o-Large soon that might be interesting…

I hadn’t heard the 4o-Large rumor yet. I generally like 4o. Claude 3.5 Sonnet gives more detailed answers in a lot of cases but 4o is generally better at following instructions.

I currently use my own concoction which I call gpt-4o-hybrid. I’ve worked out how to blend gpt-4o and gpt-4o-mini such that I can get gpt-4o quality answers at significantly reduced costs.

I typically process around 20 million tokens a day in my work which was running me around $140 or so. With my gpt-4o-hybrid concoction I’ve got my cost down to about $5 a day but it’s currently tailored for my work load.

Yeah, who knows if it’s true or not…

Very cool.

Did you see that RouterLLM that came out?

That’s a great reduction!

Are you using 4o to prime 4o-mini?

I did but what I do is radically different. It ties into a breakthrough I’ve made around long context reasoning. I’ve worked out how to distribute a prompts reasoning across multiple model calls. I’m able to essentially build an infinite chain of thought that I then collapse to an answer at the end of the prompt.

My gpt-4o-hybrid model uses gpt-4o-mini to build this chain of thought and then I collapse it using gpt-4o. The result is a gpt-4o quality answer at near gpt-4o-mini prices.

The largest context window I’ve constructed is 104 million tokens (it took 4.5 hours to resolve and 23,000 model calls) but I routinely build context windows spanning several million tokens.

3 Likes

We have a preview of our service coming out in a couple weeks. We’ll support queries over up to 10 million tokens to start but there’s theoretically no limit to how large a context window we can construct

1 Like

Please tell me the output was simply “42”… :wink:

Look forward to it.

1 Like

lol close… we did a multi needle in a haystack test where we hid 20 unique passwords in 4 copies of the worlds longest novel: Marianbad I Love You. We used a fine tuned version of Llama 3 8b running on a server with 2 RTX 4090s and successfully retrieved all 20 passwords.

2 Likes

We’ve done other things like give it all of Shakespeares writings and have it return the last sentence spoken by every character before they died.

Currently we’re giving it really large collections of documents subpoenaed for litigation and having it scan those documents for potential evidence. It can extract evidence from over 3,000 documents in under 2 hours

5 Likes

I’m still watching the videos you shared but I definitely don’t buy that gpt-4o is a 70b parameter model as proposed by @iamruletheworldmo and while gpt-4o-mini is definitely smaller it’s not 8b. It feels somewhere between 34b and 70b to me and I spend a LOT of time with all of these models so I have a pretty decent sense of size.

There certainly could be a larger model coming from OpenAI but at this point I personally feel like with size it’s like going from a 4k TV to an 8k TV. There’s a noticeable jump going from 480p to 1080p but going from 1080p to 4k is less noticeable and going from 4k to 8k isn’t noticeable at all unless they’re side by side.

1 Like

I suspect that most of the major advances with the models themselves will come partly by bigger models but mostly through better fine tuning datasets.

The issue is we’ve taught these models how to follow instructions and now we need to teach them how to plan in steps. There aren’t any good data sources for planning so they all have to be built from scratch and that takes time.

3 Likes

Yeah, we’re kinda topping out on data to feed them… though they’re looking at generating data to train on… which could get wonky…

I’m sure it’ll be a bunch of factors coming together that gets us to the next level…

Great writeup, and this is exactly what we’re doing with our own technology - Except we’re using RAG and VSS to find relevant context based upon the original prompt.

2 Likes

Thanks… The goal of the article was to explore some of the nuances of grounding separate from the retrieval techniques used to populate the prompt. We don’t use RAG in our solutions (at least not traditional RAG) but dealing with hallucinations isn’t tied to any sort of retrieval strategy.

Have you been following the RouteLLM project out of LM-Sys?