Question about the tokenizer

emily_eb8395 · January 13, 2023, 10:00am

Following the guide in Best practices for prompt engineering , where the technique to use ### or “”" to separate the instruction and context is mentioned, I tried different format in tokenizer, but I find out that token ids of “###” change as its format change.

So my question is: is there a best choice to use the technique? Or it doesn’t matter?

curt.kennedy · January 13, 2023, 3:32pm

I don’t know if it matters. But for fine-tunes the documentation recommends ‘\n\n###\n\n’.

Which in tokens is [198, 198, 21017, 628, 198]

Which sorta looks like an impulse response, which could be better? Not sure.

emily_eb8395 · January 15, 2023, 2:24am

Thanks for your idea. I didn’t associate this with tips in fine-tuning doc. If so, it seems like a pattern that is not precisely repeated is in favor? What follows double 198 repeat is 628, 198 instead of another double 198, which supports “Text: ###\n###” in my case.

Though all these may not matter at all. I wonder why it is a technique to improve performance. I think it has to do with the model’s instruct training dataset. I’ll check if it’s accessible.

Topic		Replies	Views
Best choice for separator? Prompting	5	4624	August 11, 2021
Do characters in prompt matter? Prompting	8	1648	January 4, 2024
Question regarding prompt token calculation Prompting	1	695	August 14, 2021
Fine tuning Completions are cut API	3	567	April 13, 2022
Fine-tune completion tokens longer than 1 API	9	1449	January 3, 2024

Question about the tokenizer

Related topics