Fine-tuning with quotation marks

How can I format data for a fine-tuning task where the prompts/completions have quotation marks in them? This obviously conflicts with the JSONL data format because it will prematurely terminate quoted data blocks.

A natural thing to do is escape the quotation marks, but I’m not sure how the fine-tuning endpoint will handle this. Will the backend serialize \" as "? Will the fine-tuned model reliably predict \" where quotation marks should be (so I can just parse its outputs)? Any knowledge/experience would be greatly appreciated.

Example prompt:

How many users are named "Alice"?
SELECT

With completion

 COUNT(*) FROM users WHERE name = "Alice";

JSONL result:

{"prompt":"How many users are named "Alice"?\nSELECT", "completion":" COUNT(*) FROM users WHERE name = "Alice";"}
1 Like

AFAIK anything that uses JSON will convert the \" back into a single character at runtime. I believe I did some experiments with this in the Python terminal. You can try it yourself too.

1 Like

Hey Kyle, the fine-tuning API handles JSON escape sequences. We parse the JSON lines in the backend, escaping them as expected - so it should just work. Note that the completions API returns its result as a JSON object, so any predicted double quotation marks will be escaped in the resulting JSON dict as well.

P.S. SQL uses single quotation marks, not double quotation marks :sweat_smile:

1 Like

Hi @rachel thanks for the response! This is helpful.

By the way: I’m aware it’s best practice to use single quotes in SQL but double quotes are still legal. Some of the academic datasets I’m testing Codex on have SQLs with double quotes so I’d like to be able to handle them anyways.