Fine tune error in using openai tool to parse JSONL file

I’m trying to prepare data for fine-tuning using this command:

openai tools fine_tunes.prepare_data -f <LOCAL_FILE>

And getting the following error:

The indices of the long examples has changed as a result of a previously applied recommendation.
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Program Files\Python311\Lib\site-packages\pandas\core\frame.py", line 5266, in drop
    return super().drop(
           ^^^^^^^^^^^^^
  File "C:\Program Files\Python311\Lib\site-packages\pandas\core\generic.py", line 4549, in drop
    obj = obj._drop_axis(labels, axis, level=level, errors=errors)
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Program Files\Python311\Lib\site-packages\pandas\core\generic.py", line 4591, in _drop_axis
    new_axis = axis.drop(labels, errors=errors)
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Program Files\Python311\Lib\site-packages\pandas\core\indexes\base.py", line 6696, in drop
    raise KeyError(f"{list(labels[mask])} not found in axis")
KeyError: '[242] not found in axis'

The JSONL file I’m working on looks ok, any idea what’s causing the error?

Just as a sanity check, are you replacing <LOCAL_FILE> with the name of your local file?

Yes, of course :smile:
the answer needs to be longer then…

Ahh the short reply limit :smiley: Ok, well the error you are getting is saying pandas can’t remove a row or column from a file, this makes me wonder if you have missed a step in the procedure.

Can you go through the process again and try?

I noticed now that this happens only when I approve the following:
- [Recommended] Remove 30 long examples [Y/n]: Y

Otherwise, it continue and creates the file (with some strange notes but still there is an output)

Ok, so it seems that maybe the entire this is >30 long examples? or… rather, it thinks it is.

Sorry, your question isn’t clear to me.
Can you please explain again?

My apologies, from you reply I take that if the Y flag is set the procedure fails.

If this is the case and the rule for Y is Remove 30 long examples, then it seems to be that the system thinks “everything” is within the 30 long examples and so the file is empty (or lacking certain content and thus it failed when rows are removed with the the 242 error

So you are suggesting that maybe there are mistaken chars/ missing chars in the JSONL file that causes all rows to look like 30 long rows that is the entire doc?

And so by removing the document is empty?

Perhaps not that exactly, but something like that, yes. I can’t think of another reason why it might be as it is otherwise.

Thanks, I’ll try to find the problematic row and TS it.

I found the problematic row, but I cannot find out what’s wrong with it.

I tried looking for hidden chars that can break strings but couldn’t find anything.

It’s super long, but I really want to know what the issue is so I can make sure not to return it…

The fine-tuning models have a 2k token limit and you’re trying to send it a 5k prompt. That could be part of the issue.

The indices of the long examples has changed as a result of a previously applied recommendation.

It seems like it’s trying to split your prompts and then for whatever reason completely loses track of them. I wonder if it’s splitting them more than once?

Reduce your prompts to a suitable token length. The prompt you’re trying to send is incredibly noisy. If you are trying to tune the model to always return a consistent object you are probably better off just using GPT-4

2 Likes

I have added some code to limit the prompt to 4096 tokens, as per GPT3 instructions.
I still get the same error from this row:

I can simply delete it and move on, but the fact that I cannot find any reason for the error is making me lose sleep at night :exploding_head: