A fine-tuned turbo alongside GPT-4 is a great combination. I am using this approach for something like ReACT with a very specific syntax. Using GPT-4 alone and just pasting a big explainer in the system prompt does work, but it takes up a lot of context and is slow and expensive. Pasting that same thing in un-tuned 3.5 does not work so well. Fine-tuning the syntax into 3.5 and doing some smart handoffs between them is the best of both worlds for me.
I can give a layman idea of attention layer consumption, from my similar position as a layman basically, with a linguistic analogy that is parallel to what comes before the open-ended generation of a decoder-only transformer.
Abstract: A whole bunch of unexhausted attention layers is all you need
“Here’s the file you said you needed. It has Joe’s calendar, and Becky’s tasks, and her reminders. Take her reminders and add them to his calendar. Then the remaining items also must be added in some way, so take them and look at his stuff and see if he also has similar entries, and if they are still unclear, put them in that error report instead adding those with dates possibly in error to that. Then they all can be added to the new file for the outsourced workers.”
I was going to bold the tokens that require attention to find the internal references and paths back to the meaning, but you can start at pronouns, and then other anaphoric references, and see the whole passage I wrote needs attention and would be more bold than not. Basically a large remapping of where token production should actually be looking for meaning.
One can see there’s hella code even early on, by looking at the ranks of the cl100k tokenizer.
What there also certainly is: millions of instructions that are the quality of upvoted GPT-4 answers fed into ongoing gpt-3.5 internal tunes. Quality hard to beat unless you need to deviate from that professional-setting disclaimed chat with refusals for 8x the price.
This reminds me of the Microsoft Orca model released this past summer.
So in this context, take your Input/Output pairs generated by GPT-4, and feed them into GPT-3.5 as the fine-tuning data.
BAM, now you have your own personal GPT-3.75.
And if your Inputs are close to the training data, you might even have something close to GPT-4, just like what the Orca models do.
So in this narrow usage context, I can see the value in fine-tuning Turbo to get GPT-4 like answers.
LOL, this is so true. It’s basically why Chain of Thought is a big improvement. CoT basically breaks the problem into little bite size pieces, forcing more attention on each piece.
Oh god, that’s insane. I would just have the person re-write the prompt
But yeah, if you could automatically do this, and break out all the internals and send them attentively to the machine, then WOAH, that’s progress, but also a ridiculously hard problem to map out.
It would also take different software paths. You would almost need a classifier:
- Is this a generic question: Answer without RAG
- Is this a specific question: Answer with RAG
- Is this in anyway a series of instructions?: Run the complex breakout algorithm
Training Input. Output
GPT-3.5 Turbo $0.0080 / 1K tokens $0.0120 / 1K tokens $0.0160 / 1K tokens
GPT-4-8K context $0.03 / 1K tokens $0.06 / 1K tokens
3-4x more expensive but 10x more parameters, better reasoning, more creative, higher quality linguistics. Personally I would rather not opt for fine tuning and use GPT-4K-32K if available. But I think fine tuning is necessary in certain situations especially for particular formats, tones, styles, and domain-specific knowledge. But I think GPT-4K-32K with DALLE3 would be almost always best in most cases once available.
There’s quite a lot of citable evidence that larger contexts lead to wrong answers on complex questions. The devil details.
But to mix my metaphors, don’t forget about the forest.
This prompt is OK, but it’s not ‘final’, imho, rather just a (lightly) weighed member of the blend.
Small side point - a massive competitive weakness of GPT4 (versus OS) here is lack of access to the logits/hidden states. Simply switching out the classification head with something finetuned lets you do much more intelligent and interesting things. Fine tuning GPT3.5 is a great step forward.
My opinion it would be great if models gpt-3.5-turbo-16k-0613
available in fine-tuning because of tokens & large contexts
Indeed. I have at times felt that the reason GPT-4 is better than 3.5 is mainly that it can pay attention to more bits within the answer, rather than the larger layers or slightly longer context. Scaling 3.5 to 16k tokens didn’t give me a good model – I’d prefer to compress my data down to 8k tokens and use GPT-4, than trying to use 3.5 with 16k context.