Fine-tuning 3.5, cryptic unhelpful message upon failure

When attempting to fine-tune my jsonl file on top of gpt-3.5-turbo-1106, I get this error message.

“The job failed due to an invalid training file. Invalid file format. Example 1 contains invalid tokens.”

After 6 hour trying to figure-out which character in my 20 MB file that makes validation fail, I take the liberty of venting this…

Come-on OpenAI, you can do better than this!!! Give me a line number! Give me a character! Give me a hint! Give me my time back so I do not need to reverse-engineer your algorithm please!

1 Like

Try checking for smart quotes…

1 Like

I did. I have my own validator on top of my jsonl generation paying attention to these… And still, it fails, and the only thing I can hold on to is that cryptic message. A needle in a haystack!

Also, proper JSON is supposed to use the plain double quote and not the plain single quote

Also go with UTF-8 too.

Indeed. I paid attention to that also.

Since my jsonl is all coming from a data generation routine I wrote in java, everything should be ok. It is also the same jsonl generation java I used for about a year now, but there is now some new sensibility it seems in all this.

I also removed some multi-lingual training content since it was using some non-standard utf-8 characters, just in case. But that did not help either.

OK then,

I would binary divide and conquer to isolate the error.

How many JSONL lines?

20,000 lines.

But, do you agree that in the shadow of all that greatness happening at OpenAI, this is way below expectations to have to fall-back on primitive means such an iterative divide and conquer that will take me half a day?

Yes I agree. The error statement should say, “on JSONL line XX, we threw an exception.”

Agree agree agree!

The online fine-tuning interface is new. They used to have a JSONL validator internally that you could run at the command line.

1 Like

The command-line outputs the same useless cryptic message (it probably eats its own dog-food and the web interface calls that). That leaves us with zero to rely on…

Well the only good news is that you can D&C at the command line without incurring fine-tune costs.

Will relay to OAI team.

I tried my first 11 lines, and got it to fail already in there. Does someone see something unusual?

{“messages”: [{“role”: “system”, “content”: “EL_Search_Classification_Rule is: Using a slot-filling approach, produce the final criteria (from last to first sentence communicated) of the property search in name-value fields ‘property_type’ (Apartments|Buildings|Country-houses|Houses|Land|Locals|Multi-Familial Houses|Offices|Other|Warehouses), ‘surface’, ‘bedroom_count’ (1-10, +), ‘bathroom_count’ (1-10, +), ‘min_price’ (numeric values only), ‘max_price’ (numeric values only), ‘currency_code’, ‘location’, ‘city’, ‘country’ (do not infer), ‘amenities’ with corresponding quantifier for each (N/A|none|some|few|average|lots|max), and ‘features’. ->”}, {“role”: “assistant”, “content”: “<|endoftext|>”}]}
{“messages”: [{“role”: “system”, “content”: “As part of EL_Search_Classification_Rule, synonym terminologies for ‘Apartments’ in ‘property_type’ can include: penthouse|lofts|condo|cooperative|luxury condo building|co-op|flat|coops|coop|suite|luxury condo buildings|condominium|penthouses|cooperatives|loft|suites|condos|condominiums|flats|co-ops|apartment. ->”}, {“role”: “assistant”, “content”: “<|endoftext|>”}]}
{“messages”: [{“role”: “system”, “content”: “As part of EL_Search_Classification_Rule, synonym terminologies for ‘Buildings’ in ‘property_type’ can include: building|apartment building|mixed-use building|apartment buildings|industrial property|green or sustainable building|mixed-use buildings|commercial building|green or sustainable buildings|industrial properties|commercial buildings. ->”}, {“role”: “assistant”, “content”: “<|endoftext|>”}]}
{“messages”: [{“role”: “system”, “content”: “As part of EL_Search_Classification_Rule, synonym terminologies for ‘Country-houses’ in ‘property_type’ can include: estates|chateau|mansions|hacienda|chateaux|chalet|ranches|country-home|farmhouses|chalets|manor|haciendas|farmhouse|estate|ranch|country-house|villas|manors|mansion|lodges|country-homes|lodge|villa. ->”}, {“role”: “assistant”, “content”: “<|endoftext|>”}]}
{“messages”: [{“role”: “system”, “content”: “As part of EL_Search_Classification_Rule, synonym terminologies for ‘Houses’ in ‘property_type’ can include: new construction house|luxury villas|sustainable home|pet-friendly house|townhouses|pet-friendly houses|fixer-upper houses|fixer-upper homes|pet-friendly home|new construction home|single-family homes|bungalow|eco-friendly home|luxury homes|luxury houses|new construction homes|luxury house|townhouse|new construction houses|homes|sustainable homes|fixer-upper home|eco-friendly house|single-family home|fixer-upper house|detached homes|house|detached home|pet-friendly homes|bungalows|eco-friendly houses|luxury home|home|luxury villa|eco-friendly homes. ->”}, {“role”: “assistant”, “content”: “<|endoftext|>”}]}
{“messages”: [{“role”: “system”, “content”: “As part of EL_Search_Classification_Rule, synonym terminologies for ‘Land’ in ‘property_type’ can include: industrial land|agricutural land|lot|lots|farm land|commercial land|plot|plots. ->”}, {“role”: “assistant”, “content”: “<|endoftext|>”}]}
{“messages”: [{“role”: “system”, “content”: “As part of EL_Search_Classification_Rule, synonym terminologies for ‘Locals’ in ‘property_type’ can include: local shops|local|retail space|local shop|retail spaces. ->”}, {“role”: “assistant”, “content”: “<|endoftext|>”}]}
{“messages”: [{“role”: “system”, “content”: “As part of EL_Search_Classification_Rule, synonym terminologies for ‘Multi-Familial Houses’ in ‘property_type’ can include: duplexes|new-construction multi-family houses|triplex|multi-family homes|multi-family real-estate listings|multi-family houses|multi-family properties|new-construction multi-family homes|new-construction multi-family home|multi-family real-estate listing|fourlexes|multi-family property|fourplex|multi-familial house|multi-family home|multi-familial homes|duplex|multi-family house|multi-familial home|new-construction multi-family house|triplexes. ->”}, {“role”: “assistant”, “content”: “<|endoftext|>”}]}
{“messages”: [{“role”: “system”, “content”: “As part of EL_Search_Classification_Rule, synonym terminologies for ‘Offices’ in ‘property_type’ can include: serviced office spaces|serviced office space|office space|luxury office spaces|small office space|commercial office space|shared office space|green or sustainable office spaces|office|office spaces|green or sustainable office space|luxury office space|commercial office spaces|shared office spaces|small office spaces. ->”}, {“role”: “assistant”, “content”: “<|endoftext|>”}]}
{“messages”: [{“role”: “system”, “content”: “As part of EL_Search_Classification_Rule, synonym terminologies for ‘Warehouses’ in ‘property_type’ can include: warehouse spaces|high ceiling warehouses|warehouse|industrial warehouse|distribution warehouse|cold storage warehouses|high ceiling warehouse|industrial warehouses|secured warehouse spaces|secured warehouse|fulfillment center warehouses|fulfillment center warehouse|secured warehouse space|secured warehouses|distribution warehouses|warehouse space|cold storage warehouse. ->”}, {“role”: “assistant”, “content”: “<|endoftext|>”}]}
{“messages”: [{“role”: “system”, “content”: “As part of EL_Search_Classification_Rule, allowed terminologies for ‘features’ are: Sports area|Community or event center|Rainwater harvesting|Solar panels|In-law suite|Separate guest house|Multi-generational living|Gazebo|Built-in barbecue|Fire pit|Outdoor kitchen|Wine cellar|Energy efficiency certifications|High-speed internet|Smart home|Home office|Fireplace|Garage or parking|School nearby|Public transportation nearby|Park nearby|Shopping centers nearby|Beach nearby|Mountain nearby|Ocean view|Panoramic view|Mountain view|City view|Ocean front|Acreage|Private lot|Modern|Colonial|Rustic|Backyard|Garden or green area|Patio|Balcony|Private pool|Shared pool|Gym|Ramps|Elevators|Doorways|Ground-floor|Gated community|24/7 security guard|Surveillance cameras|Pet friendly|New construction|Furnished|Financing available|Spa|Luxury|Crypto-transaction allowed|Micro-ownership|Business for sale|Rental income|Tax exemption|Penthouse|Alarm system|Cable ready|Deck|Dock|Golf course nearby|Intercom ->”}, {“role”: “assistant”, “content”: “<|endoftext|>”}]}

Maybe this? You are using special tokens.

Note: the editor changed the quotes to curly quotes. Here’s is an image of notepad++.

Let me try with removing my custom end-of-text everywhere to see.

1 Like

You have a fine eye. That worked!

So, custom tokens are not allowed anymore? Is that it?

2 Likes

Might be, or they just needs some escaping :thinking:

There’s this data validation example in the OpenAI cookbook you can try if you run into trouble again:

2 Likes

The action to OAI still stands … better error reporting.

But in my experience, the special tokens tend to be “taboo”, so avoid them in production. :rofl:

1 Like

On the other hand, I use chat-gpt as a semantic-normalizer. I do not use chat-gpt to do everything for me, but just to do ‘some-things’ for me.

Here is a random json into the jsonl.

{“messages”: [{“role”: “system”, “content”: “El is a semantic normalizer chatbot, applying EL_All_rules defined.”}, {“role”: “system”, “content”: “TEXT: property_type = Locals // TEXT: transaction_type = buy // TEXT: min_price = 670000 // TEXT: max_price = 750000 // TEXT: currency_code = GBP // TEXT: city = Paris // TEXT: bedroom_count = N/A // TEXT: bathroom_count = N/A // TEXT: surface = N/A // TEXT: sector = Montparnasse // TEXT: state = Ile-de-France // TEXT: country = France // TEXT: amenities = N/A // TEXT: features = N/A // TEXT: location = Montparnasse, Paris, Ile-de-France in France // TEXT: limit = N/A // TEXT: last_question = Where would you want your property to be located?”}, {“role”: “user”, “content”: “I would like to see a green or sustainable office space in Montparnasse.”}, {“role”: “assistant”, “content”: “property_type: Offices // transaction_type: buy // min_price: 670000 // max_price: 750000 // currency_code: GBP // city: Paris // bedroom_count: N/A // bathroom_count: N/A // surface: N/A // sector: Montparnasse // state: Ile-de-France // country: France // amenities: N/A // features: N/A // location: Montparnasse, Paris, Ile-de-France in France // <|endoftext|>”}]}

I give chat-gpt a pre-condition (that I maintain in my app), then feed it a sentence, in this case “I would like to see a green or sustainable office space in Montparnasse.” and expect a post-condition with the adaptation of what was given to it IN A NORMALIZED STRUCTURE I CAN COMPUTE AGAINST.

So, Chat-GPT does not do everything for me, it just does the semantic normalization to a structure I can consume.

Does that seem compatible with expectations? I’ve been able to do that with DaVinci-3 and the 3.5-turbo and am just trying to do the same with the latest 3.5-turbo now.

Yes it is. The problem here is the “infinite action space”. It can be chipped away with other AI models (fine-tunes / function calling) but it is a larger problem, and a hard one, that we are trying to solve (hey, at least I think this is a serious problem to solve :rofl:).

Your only out here is to pick off relevant data with keywords or embeddings. It seems like you are feeding non-semantic database “rows” to the model, which will confuse it.

The LLM is a narrowband input device, so treat it as such.

When sharing, you’ll need to not use directional quotes, or use the forum’s preformatted text formatting option.

You should be able to json-ize each line.

I was initially going to suspect upper-utf8 characters with accents, like é.

But the problem is: that you are training the AI on outputting nothing in response to an input and only providing a system prompt.

Examples should be at a minimum:
system: new AI identity
user: type of input that will be provided
assistant: custom response to user input

You also wouldn’t use endoftext token, the chat container has its own stop token used by the endpoint that gets trained.