Proper formatting for fine-tuning dataset

I tried to write a shell script that would take user and assistant messages as inputs and output a .jsonl file for fine-tuning. I tried using the format validation script, but the script is retuirning a traceback. I wonder if it’s because of jq formatting the JSON. If there’s an easy way to fix the formatting, that would be good. Script is below.

#!/bin/bash

# Define output file
OUTPUT_FILE="fine_tuning_data.jsonl"

# Ensure the output file is empty before starting
> "$OUTPUT_FILE"

# Prompt user for input
echo "Enter training data (type 'exit' to finish):"

while true; do
    read -p "User: " user_input
    if [[ "$user_input" == "exit" ]]; then
        break
    fi
    read -p "Assistant: " assistant_response

    # Create JSONL formatted entry
    json_entry=$(jq -n --arg u "$user_input" --arg a "$assistant_response" \
            '{messages: [{role: "system", content: "SYSTEM_MESSAGE"}, {role: "user", content: $u}, {role: "assistant", content: $a}]}')
    # Append to output file
    echo "$json_entry" >> "$OUTPUT_FILE"
done

echo "Data saved to $OUTPUT_FILE"

Sample entry

{
“messages”: [
{
“role”: “system”,
“content”: “SYSTEM_MESSAGE”
},
{
“role”: “user”,
“content”: “USER_MESSAGE”
},
{
“role”: “assistant”,
“content”: “ASSISTANT_MESSAGE”
}
]
}

I understand that you want your bash script to operate like this:

Without barfing, that is.

code:

#!/bin/bash

OUTPUT_FILE="fine_tuning_data.jsonl"
SYSTEM_MESSAGE="Clippy is a factual chatbot that is also sarcastic."

# Clear previous content
> "$OUTPUT_FILE"

echo "Creating fine-tuning JSONL data. Type 'exit' to stop."

while true; do
    read -p "User (type 'exit' to finish): " user_input
    [[ "$user_input" == "exit" ]] && break

    read -p "Assistant: " assistant_response

    # Manually construct JSON entry ensuring proper escaping
    jq -nc \
      --arg sys "$SYSTEM_MESSAGE" \
      --arg usr "$user_input" \
      --arg ast "$assistant_response" \
      '{
        messages: [
          {"role": "system", "content": $sys},
          {"role": "user", "content": $usr},
          {"role": "assistant", "content": $ast}
        ]
      }' >> "$OUTPUT_FILE"
done

echo "Data saved to $OUTPUT_FILE"

If you really have SYSTEM variable already – that’s required in a training file.

The little script doesn’t let you do much in the way of creating a longer chat or editing though.

Maybe I’ll crank out a Python GUI for you…

Redid my training file. Used jsonformatter.org to minify the JSON into JSONL. It didn’t want to convert the whole file, so I just copy-pasted each prompt-response pair from Notepad into the website and back. There’s a web app for just about anything.