Preparing csv -> All prompts are identical

JahCoder · March 22, 2023, 5:08pm

I am trying to prepare a csv to a JSONL using the CLI from open ai and I keep getting this error “ERROR in common_suffix validator: All prompts are identical:”

PaulBellow · March 22, 2023, 5:37pm

Welcome to the forum…

Sounds like you’re not creating the dataset correctly…

Data formatting

To fine-tune a model, you’ll need a set of training examples that each consist of a single input (“prompt”) and its associated output (“completion”). This is notably different from using our base models, where you might input detailed instructions or multiple examples in a single prompt.

Each prompt should end with a fixed separator to inform the model when the prompt ends and the completion begins. A simple separator which generally works well is \n\n###\n\n. The separator should not appear elsewhere in any prompt.

Each completion should start with a whitespace due to our tokenization, which tokenizes most words with a preceding whitespace.

Each completion should end with a fixed stop sequence to inform the model when the completion ends. A stop sequence could be \n, ###, or any other token that does not appear in any completion.

For inference, you should format your prompts in the same way as you did when creating the training dataset, including the same separator. Also specify the same stop sequence to properly truncate the completion.

General best practices

Fine-tuning performs better with more high-quality examples. To fine-tune a model that performs better than using a high-quality prompt with our base models, you should provide at least a few hundred high-quality examples, ideally vetted by human experts. From there, performance tends to linearly increase with every doubling of the number of examples. Increasing the number of examples is usually the best and most reliable way of improving performance.

Classifiers are the easiest models to get started with. For classification problems we suggest using ada, which generally tends to perform only very slightly worse than more capable models once fine-tuned, whilst being significantly faster and cheaper.

If you are fine-tuning on a pre-existing dataset rather than writing prompts from scratch, be sure to manually review your data for offensive or inaccurate content if possible, or review as many random samples of the dataset as possible if it is large.

{"prompt":"Company: BHFF insurance\nProduct: allround insurance\nAd:One stop shop for all your insurance needs!\nSupported:", "completion":" yes"}
{"prompt":"Company: Loft conversion specialists\nProduct: -\nAd:Straight teeth in weeks!\nSupported:", "completion":" no"}

More here…

https://platform.openai.com/docs/guides/fine-tuning/preparing-your-dataset

Topic		Replies	Views
Fine-tuned model error—"openai: error: unrecognized arguments:" API	10	2271	December 19, 2023
Fine tune error in using openai tool to parse JSONL file API fine-tuning , api	13	967	June 30, 2023
Fine-tuning problem, multiple completion Prompting	2	1745	December 25, 2023
Model fine-tuning lessons learned API	2	1696	February 28, 2023
Trying To Fine-Tune To Overcome Prompt Size Limit API	4	1413	December 17, 2023

Preparing csv -> All prompts are identical

Data formatting

General best practices

Related topics