Fine-tuning 3.5, cryptic unhelpful message upon failure

proy · December 17, 2023, 11:43pm

When attempting to fine-tune my jsonl file on top of gpt-3.5-turbo-1106, I get this error message.

“The job failed due to an invalid training file. Invalid file format. Example 1 contains invalid tokens.”

After 6 hour trying to figure-out which character in my 20 MB file that makes validation fail, I take the liberty of venting this…

Come-on OpenAI, you can do better than this!!! Give me a line number! Give me a character! Give me a hint! Give me my time back so I do not need to reverse-engineer your algorithm please!

PaulBellow · December 17, 2023, 11:52pm

Try checking for smart quotes…

proy · December 17, 2023, 11:57pm

I did. I have my own validator on top of my jsonl generation paying attention to these… And still, it fails, and the only thing I can hold on to is that cryptic message. A needle in a haystack!

curt.kennedy · December 17, 2023, 11:58pm

Also, proper JSON is supposed to use the plain double quote ” and not the plain single quote ’

Also go with UTF-8 too.

proy · December 18, 2023, 12:01am

Indeed. I paid attention to that also.

Since my jsonl is all coming from a data generation routine I wrote in java, everything should be ok. It is also the same jsonl generation java I used for about a year now, but there is now some new sensibility it seems in all this.

I also removed some multi-lingual training content since it was using some non-standard utf-8 characters, just in case. But that did not help either.

curt.kennedy · December 18, 2023, 12:03am

OK then,

I would binary divide and conquer to isolate the error.

How many JSONL lines?

proy · December 18, 2023, 12:05am

20,000 lines.

But, do you agree that in the shadow of all that greatness happening at OpenAI, this is way below expectations to have to fall-back on primitive means such an iterative divide and conquer that will take me half a day?

curt.kennedy · December 18, 2023, 12:07am

Yes I agree. The error statement should say, “on JSONL line XX, we threw an exception.”

Agree agree agree!

The online fine-tuning interface is new. They used to have a JSONL validator internally that you could run at the command line.

proy · December 18, 2023, 12:09am

The command-line outputs the same useless cryptic message (it probably eats its own dog-food and the web interface calls that). That leaves us with zero to rely on…

curt.kennedy · December 18, 2023, 12:10am

Well the only good news is that you can D&C at the command line without incurring fine-tune costs.

Will relay to OAI team.

proy · December 18, 2023, 12:15am

I tried my first 11 lines, and got it to fail already in there. Does someone see something unusual?

{“messages”: [{“role”: “system”, “content”: “EL_Search_Classification_Rule is: Using a slot-filling approach, produce the final criteria (from last to first sentence communicated) of the property search in name-value fields ‘property_type’ (Apartments|Buildings|Country-houses|Houses|Land|Locals|Multi-Familial Houses|Offices|Other|Warehouses), ‘surface’, ‘bedroom_count’ (1-10, +), ‘bathroom_count’ (1-10, +), ‘min_price’ (numeric values only), ‘max_price’ (numeric values only), ‘currency_code’, ‘location’, ‘city’, ‘country’ (do not infer), ‘amenities’ with corresponding quantifier for each (N/A|none|some|few|average|lots|max), and ‘features’. ->”}, {“role”: “assistant”, “content”: “<|endoftext|>”}]}
{“messages”: [{“role”: “system”, “content”: “As part of EL_Search_Classification_Rule, synonym terminologies for ‘Apartments’ in ‘property_type’ can include: penthouse|lofts|condo|cooperative|luxury condo building|co-op|flat|coops|coop|suite|luxury condo buildings|condominium|penthouses|cooperatives|loft|suites|condos|condominiums|flats|co-ops|apartment. ->”}, {“role”: “assistant”, “content”: “<|endoftext|>”}]}
{“messages”: [{“role”: “system”, “content”: “As part of EL_Search_Classification_Rule, synonym terminologies for ‘Buildings’ in ‘property_type’ can include: building|apartment building|mixed-use building|apartment buildings|industrial property|green or sustainable building|mixed-use buildings|commercial building|green or sustainable buildings|industrial properties|commercial buildings. ->”}, {“role”: “assistant”, “content”: “<|endoftext|>”}]}
{“messages”: [{“role”: “system”, “content”: “As part of EL_Search_Classification_Rule, synonym terminologies for ‘Country-houses’ in ‘property_type’ can include: estates|chateau|mansions|hacienda|chateaux|chalet|ranches|country-home|farmhouses|chalets|manor|haciendas|farmhouse|estate|ranch|country-house|villas|manors|mansion|lodges|country-homes|lodge|villa. ->”}, {“role”: “assistant”, “content”: “<|endoftext|>”}]}
{“messages”: [{“role”: “system”, “content”: “As part of EL_Search_Classification_Rule, synonym terminologies for ‘Houses’ in ‘property_type’ can include: new construction house|luxury villas|sustainable home|pet-friendly house|townhouses|pet-friendly houses|fixer-upper houses|fixer-upper homes|pet-friendly home|new construction home|single-family homes|bungalow|eco-friendly home|luxury homes|luxury houses|new construction homes|luxury house|townhouse|new construction houses|homes|sustainable homes|fixer-upper home|eco-friendly house|single-family home|fixer-upper house|detached homes|house|detached home|pet-friendly homes|bungalows|eco-friendly houses|luxury home|home|luxury villa|eco-friendly homes. ->”}, {“role”: “assistant”, “content”: “<|endoftext|>”}]}
{“messages”: [{“role”: “system”, “content”: “As part of EL_Search_Classification_Rule, synonym terminologies for ‘Land’ in ‘property_type’ can include: industrial land|agricutural land|lot|lots|farm land|commercial land|plot|plots. ->”}, {“role”: “assistant”, “content”: “<|endoftext|>”}]}
{“messages”: [{“role”: “system”, “content”: “As part of EL_Search_Classification_Rule, synonym terminologies for ‘Locals’ in ‘property_type’ can include: local shops|local|retail space|local shop|retail spaces. ->”}, {“role”: “assistant”, “content”: “<|endoftext|>”}]}
{“messages”: [{“role”: “system”, “content”: “As part of EL_Search_Classification_Rule, synonym terminologies for ‘Multi-Familial Houses’ in ‘property_type’ can include: duplexes|new-construction multi-family houses|triplex|multi-family homes|multi-family real-estate listings|multi-family houses|multi-family properties|new-construction multi-family homes|new-construction multi-family home|multi-family real-estate listing|fourlexes|multi-family property|fourplex|multi-familial house|multi-family home|multi-familial homes|duplex|multi-family house|multi-familial home|new-construction multi-family house|triplexes. ->”}, {“role”: “assistant”, “content”: “<|endoftext|>”}]}
{“messages”: [{“role”: “system”, “content”: “As part of EL_Search_Classification_Rule, synonym terminologies for ‘Offices’ in ‘property_type’ can include: serviced office spaces|serviced office space|office space|luxury office spaces|small office space|commercial office space|shared office space|green or sustainable office spaces|office|office spaces|green or sustainable office space|luxury office space|commercial office spaces|shared office spaces|small office spaces. ->”}, {“role”: “assistant”, “content”: “<|endoftext|>”}]}
{“messages”: [{“role”: “system”, “content”: “As part of EL_Search_Classification_Rule, synonym terminologies for ‘Warehouses’ in ‘property_type’ can include: warehouse spaces|high ceiling warehouses|warehouse|industrial warehouse|distribution warehouse|cold storage warehouses|high ceiling warehouse|industrial warehouses|secured warehouse spaces|secured warehouse|fulfillment center warehouses|fulfillment center warehouse|secured warehouse space|secured warehouses|distribution warehouses|warehouse space|cold storage warehouse. ->”}, {“role”: “assistant”, “content”: “<|endoftext|>”}]}
{“messages”: [{“role”: “system”, “content”: “As part of EL_Search_Classification_Rule, allowed terminologies for ‘features’ are: Sports area|Community or event center|Rainwater harvesting|Solar panels|In-law suite|Separate guest house|Multi-generational living|Gazebo|Built-in barbecue|Fire pit|Outdoor kitchen|Wine cellar|Energy efficiency certifications|High-speed internet|Smart home|Home office|Fireplace|Garage or parking|School nearby|Public transportation nearby|Park nearby|Shopping centers nearby|Beach nearby|Mountain nearby|Ocean view|Panoramic view|Mountain view|City view|Ocean front|Acreage|Private lot|Modern|Colonial|Rustic|Backyard|Garden or green area|Patio|Balcony|Private pool|Shared pool|Gym|Ramps|Elevators|Doorways|Ground-floor|Gated community|24/7 security guard|Surveillance cameras|Pet friendly|New construction|Furnished|Financing available|Spa|Luxury|Crypto-transaction allowed|Micro-ownership|Business for sale|Rental income|Tax exemption|Penthouse|Alarm system|Cable ready|Deck|Dock|Golf course nearby|Intercom ->”}, {“role”: “assistant”, “content”: “<|endoftext|>”}]}

curt.kennedy · December 18, 2023, 12:17am

Maybe this? You are using special tokens.

proy · December 18, 2023, 12:17am

Note: the editor changed the quotes to curly quotes. Here’s is an image of notepad++.

proy · December 18, 2023, 12:18am

Let me try with removing my custom end-of-text everywhere to see.

proy · December 18, 2023, 12:20am

You have a fine eye. That worked!

So, custom tokens are not allowed anymore? Is that it?

N2U · December 18, 2023, 12:22am

Might be, or they just needs some escaping

There’s this data validation example in the OpenAI cookbook you can try if you run into trouble again:

curt.kennedy · December 18, 2023, 12:25am

The action to OAI still stands … better error reporting.

But in my experience, the special tokens tend to be “taboo”, so avoid them in production.

proy · December 18, 2023, 12:34am

On the other hand, I use chat-gpt as a semantic-normalizer. I do not use chat-gpt to do everything for me, but just to do ‘some-things’ for me.

Here is a random json into the jsonl.

{“messages”: [{“role”: “system”, “content”: “El is a semantic normalizer chatbot, applying EL_All_rules defined.”}, {“role”: “system”, “content”: “TEXT: property_type = Locals // TEXT: transaction_type = buy // TEXT: min_price = 670000 // TEXT: max_price = 750000 // TEXT: currency_code = GBP // TEXT: city = Paris // TEXT: bedroom_count = N/A // TEXT: bathroom_count = N/A // TEXT: surface = N/A // TEXT: sector = Montparnasse // TEXT: state = Ile-de-France // TEXT: country = France // TEXT: amenities = N/A // TEXT: features = N/A // TEXT: location = Montparnasse, Paris, Ile-de-France in France // TEXT: limit = N/A // TEXT: last_question = Where would you want your property to be located?”}, {“role”: “user”, “content”: “I would like to see a green or sustainable office space in Montparnasse.”}, {“role”: “assistant”, “content”: “property_type: Offices // transaction_type: buy // min_price: 670000 // max_price: 750000 // currency_code: GBP // city: Paris // bedroom_count: N/A // bathroom_count: N/A // surface: N/A // sector: Montparnasse // state: Ile-de-France // country: France // amenities: N/A // features: N/A // location: Montparnasse, Paris, Ile-de-France in France // <|endoftext|>”}]}

I give chat-gpt a pre-condition (that I maintain in my app), then feed it a sentence, in this case “I would like to see a green or sustainable office space in Montparnasse.” and expect a post-condition with the adaptation of what was given to it IN A NORMALIZED STRUCTURE I CAN COMPUTE AGAINST.

So, Chat-GPT does not do everything for me, it just does the semantic normalization to a structure I can consume.

Does that seem compatible with expectations? I’ve been able to do that with DaVinci-3 and the 3.5-turbo and am just trying to do the same with the latest 3.5-turbo now.

curt.kennedy · December 18, 2023, 12:40am

Yes it is. The problem here is the “infinite action space”. It can be chipped away with other AI models (fine-tunes / function calling) but it is a larger problem, and a hard one, that we are trying to solve (hey, at least I think this is a serious problem to solve ).

Your only out here is to pick off relevant data with keywords or embeddings. It seems like you are feeding non-semantic database “rows” to the model, which will confuse it.

The LLM is a narrowband input device, so treat it as such.

_j · December 18, 2023, 12:48am

When sharing, you’ll need to not use directional quotes, or use the forum’s preformatted text formatting option.

You should be able to json-ize each line.

I was initially going to suspect upper-utf8 characters with accents, like é.

But the problem is: that you are training the AI on outputting nothing in response to an input and only providing a system prompt.

Examples should be at a minimum:
system: new AI identity
user: type of input that will be provided
assistant: custom response to user input

You also wouldn’t use endoftext token, the chat container has its own stop token used by the endpoint that gets trained.

Topic		Replies	Views
Fine-tuning a model with structured output API	23	5681	March 26, 2024
ChatCompletion GPT4 API Error - Message 0? API gpt-4 , api	12	2507	December 18, 2023
Fine-tuned gpt-3.5 API returns different response when one symbol changed in input messages API fine-tuning , api	13	1662	November 6, 2023
We want your feedback: Help us improve OpenAI's documentation and resources API	35	6774	March 20, 2023
Internal server error in fine tuning API gpt-35-turbo , fine-tuning , api	12	860	January 22, 2024

Fine-tuning 3.5, cryptic unhelpful message upon failure

Related topics