ChatGPT Export Data organization has changed ... again, with no documentation

This is about “Lack of Documentation” and trying to add some light.

I just downloaded my “Export Data” zip archive … and discovered it has changed. Towards the beginning of March I downloaded the archive, and there were lots of folders with long numbers as names. , that contained an audio folder, inside that were lots of wav files (even longer hex names.) Actual example: file_671d686cb0a02002286e258615025f2ac753012b9aa7bfa60cb8b7f32a6bb1c427d63a0a52aab272d47753ccba02f6d9-c19e52ca-7d80-4fa8-9028-56806107ae67.wav) note the long ‘file id’ followed by a uuid.

Today I downloaded my archive, and there are none of those folders. No uuid folders, no audio folders, just long-ass filenames, which are all ‘.dat’. If you open one in a text editor, you’ll see it’s a RIFF file, so these are the audio files, probably, if they have those long names (an example: file_0afa96a4208c30020e09fcf8ef6fb62ac753012b9aa7bfa60cb8b7f32a6bb1c427d63a0a52aab272d47753ccba02f6d9.dat - note, no uuid in the the name, it’s not in a folder (top level of the archive) and it’s .dat.

But there are also files with smaller names followed by .dat (least helpful extension known) such as: file_00000000f6605230a72c9fe4513d8ec2.dat

Open one of these up in a text editor, and you’ll see it’s a PNG. So you can manually change the extension to .png, and you’ll now see the thumbnail, Guess what file these are? The new image generations! Note in the archive there is still an actual folder “dalle-generations” with files that have names like file-0sLZ7w7d2U0zDWOaSmTrKNKC-d1a13955-737f-4c82-be69-c34a0b47316a.webp … note, that’s a file id (NOT hex, an alphanumeric string) followed by a hex uuid, followed by .webp

So, if you want to find all these new images you’re generating, you WON’T find them in the dalle-generations folder, they will be ‘short’ names amongst the ‘long’ names of .dat files.

Also, if you had some method of reading the conversations.json, extracting conversations and their messages and matching up the media in them with the media in the archive … that method probably just broke (mine did! thus this edit to the post) because these files are treated as “assets” and referenced by only the file id, which was always the first part of the name in the archive, not the old file name. However, the new ‘file_xxxx…’ format is different than the old ‘file-xxxx…’ that was more or less the same between the user uploaded and dalle generated images. The method of identifying assets was the same, ‘file-AlphanumericString-some-file-name.ext’ and the filename and extension were not used in the asset notation in the json. Who knows now…

I also wonder if there are pdf’s or other file types hiding amongst the .dat files. I hardly did any audio, but mostly I’ve been interested in the images. I also hardly did any video. There was for a brief time a ‘sora’ folder in the archive, but not anymore. Maybe that has its own export? Haven’t look, hardly used it yet.

This has been a public service post, for the confused among us, who actually like knowing what we’re getting when we download these archives.

2 Likes

using only this prompt:

What model are you?

show me what it replies.

You mean the model I was asking about the format? That was 4o:

"I am an AI assistant developed by OpenAI, currently utilizing the GPT-4o model. GPT-4o is a multimodal model capable of processing and generating text, images, audio, and video. This allows me to assist with a wide range of tasks, including drafting text, analyzing images, and interpreting audio content. "

Not really the issue. I made sure I had web search enabled, but it doesn’t matter if there is no documentation online, and they never bother to inform the model what’s in the export download. Using their undocumented asset ‘file id’ format is bad enough, but now they go and change it when they feel like makes these exports just ridiculous. There seem to be some changes in the conversations.json too, but Claude 3.7 sonnet managed to figure it out and I did import the few new conversations that were in the download … but the images will be a puzzle for another day. The question remains, why isn’t there some export documentation that actually explains what the standards are? Would it be so hard? I have yet to figure out if there is any record in the messages which custom GPT was used to create it. Remember plug-ins? Is there any record of them in the json? Anyway, I have this repo that I’ve been working on, and have managed to gleen a lot, including matching up the images, generated and uploaded, into the coversations that they came from. But now having new files with different naming conventions? How often are they going to change it? Did they really have to change the “-” to “_”??? What’s the point?

well you’re one of the few who still have unbridled access to 4o.

it’s really been on the fritz recently.

turbo is being forced for most users and it’s trying to pass itself off as 4o

1 Like

interesting, can u speak more to that?


My image in ghibIi

I am one of those affected by 4-turbo

Youll find a lot of threads recently of 4o not working, getting irritated etc.

While I am aware of what caused it, theres some issue about terminology and being implicated in accidentally compromising a 157 billion dollar tool.

The effects are barely visible and are bejng worked out, but it was actualy a very happy surprise for the Devs…

I hope.