About fine tuning - Your opinion?

spro · October 25, 2023, 11:08pm

A fine-tuned turbo alongside GPT-4 is a great combination. I am using this approach for something like ReACT with a very specific syntax. Using GPT-4 alone and just pasting a big explainer in the system prompt does work, but it takes up a lot of context and is slow and expensive. Pasting that same thing in un-tuned 3.5 does not work so well. Fine-tuning the syntax into 3.5 and doing some smart handoffs between them is the best of both worlds for me.

_j · October 25, 2023, 11:15pm

I can give a layman idea of attention layer consumption, from my similar position as a layman basically, with a linguistic analogy that is parallel to what comes before the open-ended generation of a decoder-only transformer.

Abstract: A whole bunch of unexhausted attention layers is all you need

“Here’s the file you said you needed. It has Joe’s calendar, and Becky’s tasks, and her reminders. Take her reminders and add them to his calendar. Then the remaining items also must be added in some way, so take them and look at his stuff and see if he also has similar entries, and if they are still unclear, put them in that error report instead adding those with dates possibly in error to that. Then they all can be added to the new file for the outsourced workers.”

I was going to bold the tokens that require attention to find the internal references and paths back to the meaning, but you can start at pronouns, and then other anaphoric references, and see the whole passage I wrote needs attention and would be more bold than not. Basically a large remapping of where token production should actually be looking for meaning.

One can see there’s hella code even early on, by looking at the ranks of the cl100k tokenizer.

What there also certainly is: millions of instructions that are the quality of upvoted GPT-4 answers fed into ongoing gpt-3.5 internal tunes. Quality hard to beat unless you need to deviate from that professional-setting disclaimed chat with refusals for 8x the price.

curt.kennedy · October 25, 2023, 11:21pm

This reminds me of the Microsoft Orca model released this past summer.

So in this context, take your Input/Output pairs generated by GPT-4, and feed them into GPT-3.5 as the fine-tuning data.

BAM, now you have your own personal GPT-3.75.

And if your Inputs are close to the training data, you might even have something close to GPT-4, just like what the Orca models do.

So in this narrow usage context, I can see the value in fine-tuning Turbo to get GPT-4 like answers.

curt.kennedy · October 25, 2023, 11:29pm

LOL, this is so true. It’s basically why Chain of Thought is a big improvement. CoT basically breaks the problem into little bite size pieces, forcing more attention on each piece.

Oh god, that’s insane. I would just have the person re-write the prompt

But yeah, if you could automatically do this, and break out all the internals and send them attentively to the machine, then WOAH, that’s progress, but also a ridiculously hard problem to map out.

It would also take different software paths. You would almost need a classifier:

Is this a generic question: Answer without RAG
Is this a specific question: Answer with RAG
Is this in anyway a series of instructions?: Run the complex breakout algorithm

nexa · October 26, 2023, 3:27am

                                  Training                              Input.                                Output

GPT-3.5 Turbo $0.0080 / 1K tokens $0.0120 / 1K tokens $0.0160 / 1K tokens

GPT-4-8K context $0.03 / 1K tokens $0.06 / 1K tokens

3-4x more expensive but 10x more parameters, better reasoning, more creative, higher quality linguistics. Personally I would rather not opt for fine tuning and use GPT-4K-32K if available. But I think fine tuning is necessary in certain situations especially for particular formats, tones, styles, and domain-specific knowledge. But I think GPT-4K-32K with DALLE3 would be almost always best in most cases once available.

qrdl · October 26, 2023, 4:17am

There’s quite a lot of citable evidence that larger contexts lead to wrong answers on complex questions. The devil details.

But to mix my metaphors, don’t forget about the forest.

This prompt is OK, but it’s not ‘final’, imho, rather just a (lightly) weighed member of the blend.

Small side point - a massive competitive weakness of GPT4 (versus OS) here is lack of access to the logits/hidden states. Simply switching out the classification head with something finetuned lets you do much more intelligent and interesting things. Fine tuning GPT3.5 is a great step forward.

b0zal · October 26, 2023, 5:08am

My opinion it would be great if models gpt-3.5-turbo-16k-0613 available in fine-tuning because of tokens & large contexts

jwatte · October 27, 2023, 3:09pm

Indeed. I have at times felt that the reason GPT-4 is better than 3.5 is mainly that it can pay attention to more bits within the answer, rather than the larger layers or slightly longer context. Scaling 3.5 to 16k tokens didn’t give me a good model – I’d prefer to compress my data down to 8k tokens and use GPT-4, than trying to use 3.5 with 16k context.

Topic		Replies	Views
Do 'MAX tokens' include the follow up prompts and completion in a single chat session API token	22	5008	August 25, 2023
It looks like GPT-4-32k is rolling out API gpt-4	202	71047	July 16, 2023
Fine-tuning vs Context-Injection (RAG) Prompting gpt-4 , gpt-35-turbo , chatgpt	5	11089	December 11, 2023
Processing Large Documents - 128K limit API gpt-4	41	5958	November 8, 2024
Problems with long contexts - gpt that solves law cases API gpt-4o	16	339	October 24, 2024

About fine tuning - Your opinion?

Related topics