I’ve been fine-tuning for the past week now, and after 5 iterations, and increasingly larger data-sets after each fine-tune. I realize the continue fine-tuning wasn’t really adding the an existing fine-tuned dataset with the new datasets, I am looking for other ways to improve on performance and cost.
What I’ve decided to do is use emojis that represent some of the text being replaced in the Prompts (not the completions - those remain in text format only).
Are there any issues with such a method?
When I check via the Tokenizer tool; I see this: ��� for the emoji, which converts to 3 tokens (funnily enough), and then a note appears: Note: Your input contained one or more unicode characters that map to multiple tokens. The output visualization may display the bytes in each token in a non-standard way.
Are there any issues with such a method? and am I not benefiting since the Emoji is actually being calculated as 3 tokens?