Question about the tokenizer

Following the guide in Best practices for prompt engineering , where the technique to use ### or “”" to separate the instruction and context is mentioned, I tried different format in tokenizer, but I find out that token ids of “###” change as its format change.

So my question is: is there a best choice to use the technique? Or it doesn’t matter?

I don’t know if it matters. But for fine-tunes the documentation recommends ‘\n\n###\n\n’.

Which in tokens is [198, 198, 21017, 628, 198]

Which sorta looks like an impulse response, which could be better? Not sure.

1 Like

Thanks for your idea. I didn’t associate this with tips in fine-tuning doc. If so, it seems like a pattern that is not precisely repeated is in favor? What follows double 198 repeat is 628, 198 instead of another double 198, which supports “Text: ###\n###” in my case.

Though all these may not matter at all. I wonder why it is a technique to improve performance. I think it has to do with the model’s instruct training dataset. I’ll check if it’s accessible.