Models output is not matching training data with 1200+ specific names

Hi Kevin

No I’ve not tried the new Structured Outputs feature. Do you think the model will be able to provide the right outputs with a list as long as my 1200 exercises, roughly 8500 tokens, if I use Structured Outputs? And by that not edit any of the names? Maybe I should give it a go.

So I would have to edit my promt to a completely different setup right?

And I guess the fine tuning work I’ve done with the training file of 100.000 tokens is of no use then right? Since Fine-tuning is not the correct option for me apparently. Can someone explain to me why fine-tuning is not the right solution for me and give an example of when to use it? Apparently I misunderstood the usecase for both Fine-tuning and Rag.

Thanks but I don’t mean an example of how it’s changing the words. I mean literally a short example of the program you’re trying to generate. I’m trying to see the pattern you’re slot filling the words into. These models are all about conforming to patterns.

What I’m really looking for is can you increase the likelihood of getting the names back verbatim by splitting your generation across multiple calls. 1200 names is too many names to expect the model to reliably get right and 800 generally isn’t going to do much better. You need to be at like 50 but it really depends on what the shape of your output is.

As for structured outputs that could help but I doubt it… like I said 1200 names is just too many names

1 Like

Your prompt may actually become a bit simpler, because a lot of your “business logic” will move into the JSON Schema you use to define the shape of the data the model should generate. In terms of whether or not your schema would be too large, I don’t know if I could predict that in general terms, but I would recommend trying it out.

This video is a nice primer on when fine tuning or RAG may be appropriate.

There’s surely some sort of way to break these steps down? If you are looking for workout exercises I would first try to categorize/tag each exercise as much as possible.

Dumbell Incline Fly
Bench, Free Weights, Chest (or even more specific chest muscles)

You can build an “internal processing” stage where the model takes the input of the user and theorizes the muscle groups required and then make a query to your database using this query (which would ideally include the tags).

Then, you have distilled 1,200 potential exercises to let’s say 500. Much more suitable.

Distill, distill, distill.

Essentially, what you are trying to build is a recommendation system.

I would jump straights into a Knowledge Graph like Weaviate. Ref2Vec would be a great start. Basically you apply these tags to a user, and then grab the items that fit in this space that you’ve created. You can grab the top N results and pass it to the model to make a final decision.

These types of models are much more malleable than fine-tuning an LLM.

1 Like

While this will make the submitted prompt shorter the actual prompt that’s reasoned over will actually be longer after all of the schema information is added.

The bigger issue is that’s not what’s causing the truncation they’re seeing. It’s the pressure on the model trying to squeeze its response into the available output tokens. For long outputs the model starts feeling pressure to get everything to fit so it starts looking for ways to make its output shorter. That’s when it starts doing things like dropping words and rephrasing things.

Given that JSON response tend to consume more output tokens I suspect that moving to structured outputs will just make this particular problem worse.

3 Likes

I wonder whether the correct semantics for ops case is a sha256sum (or some such variant) of the correct names. GPT-4 then would preclude from compressing the names…maybe…