Yeah… Sometimes I have to laugh at the mistakes the model makes. It’s like what??? GPT-4o is REALLY GOOD but it’s stochastic and even with temp of 0.0 it’s going to have variances and slip up from time to time. I would say that I’m pretty good at prompting and there’s no prompt you can write that’s going to fix that.
Just because the model makes mistakes doesn’t mean it’s not useful. There’s definitely an art to prompting these models but there’s just as much an art around how you shape the problems you give them. As programmers we’re not used to programming for a system that’s non-deterministic and you have to approach things differently with these models.
These models do so much amazing work. These days in AI development I spend more time building constraints, checks, feedback loops, and, what really feels like a programmatic dance with the model to ensure some level of control, quality & consistency with the output.
I also like markdown. It gives the model much more liberty and chance for expression. It’s my perfect blend of structured & unstructured semantics
Yeah… these days I try to make just a giant feedback loop of markdown. We have a whole document ingestion pipeline where we convert all of the documents we’re going to feed to the model to markdown. So where possible we pass markdown into the model and ask for markdown out.
We balance this out by passing all of our instructions in using XML tags. That way there’s a clearer separation between the instructions we’re passing in and the data we want to reason over
Stumbled over this discussion when deciding to minify my structured output schema or not. It makes no difference for my token input or output counts. Looks like they get minified anyway for structured outputs, which makes sense since the hierarchical information is preserved.
I disagree. The tokenizer does not break apart strings based on words. It breaks them apart based on tokens. What exactly is a token depends on the tokenizer itself! Generally speaking, tokenizers for LLMs break apart strings based on semantics. I’ll give an example, but the tokenization of this example is probably not how ChatGPT tokenizes, but it might.
“He jumped over the crosswalk.” may be tokenized such that there is one token for each of “he”, “jump”, “ed”, “over”, “the”, “cross”, “walk”, “.”
I do not know how to answer the question of whether spaces help ChatGPT or not. I’m wondering about this myself as well as the way to encode something into JSON that best helps ChatGPT understand the encoded data.