Tokens breaking up explicit keywords?

I created a prompt to specifically provide the exact number of responses as keywords being provided, and it’s still giving me responses that sometimes breakup a keyword, or skips it entirely… My list is provided is separated only by ; without any spaces. Is it possible that tokenization is causing the issue? Any suggestions for improving the prompt? I also provided an example of the json format response I’m expecting as well.

Yes, tokenization. And also just AI ability to understand previous tokens, both instructions and what has been generated.

Words with a leading space are more common in encoding. You can find a way to have both the input and the output words all just be a string of words with no delimiter. That will ensure they all have the same weight and highest semantics.

That also would reduce the amount of pattern in your input and output. Consider that the AI is always calculating the next token to create. A strong pattern of semicolons after words, and the semicolons instead of terminating the output become more likely.

I would just accept a list longer than you need, and then truncate in code by the delimiter between words.

1 Like

This is helpful and gives me some good ideas. I’m initially converting camelCase and abbreviated technical column names into a normalized list and then using those normalized names to ask for additional context (Definition, etc…) the conversion to a normalized name is working fine, but once I use the normalized name as part of the query I start getting losses or the keywords broken up.

I’ll try using underscores, or if I need to just use the original technical names as the keyword and see if that improves things. Thanks!