About fine tuning - Your opinion?

I can give a layman idea of attention layer consumption, from my similar position as a layman basically, with a linguistic analogy that is parallel to what comes before the open-ended generation of a decoder-only transformer.

Abstract: A whole bunch of unexhausted attention layers is all you need

“Here’s the file you said you needed. It has Joe’s calendar, and Becky’s tasks, and her reminders. Take her reminders and add them to his calendar. Then the remaining items also must be added in some way, so take them and look at his stuff and see if he also has similar entries, and if they are still unclear, put them in that error report instead adding those with dates possibly in error to that. Then they all can be added to the new file for the outsourced workers.”

I was going to bold the tokens that require attention to find the internal references and paths back to the meaning, but you can start at pronouns, and then other anaphoric references, and see the whole passage I wrote needs attention and would be more bold than not. Basically a large remapping of where token production should actually be looking for meaning.


One can see there’s hella code even early on, by looking at the ranks of the cl100k tokenizer.

What there also certainly is: millions of instructions that are the quality of upvoted GPT-4 answers fed into ongoing gpt-3.5 internal tunes. Quality hard to beat unless you need to deviate from that professional-setting disclaimed chat with refusals for 8x the price.