This is about “Lack of Documentation” and trying to add some light.
I just downloaded my “Export Data” zip archive … and discovered it has changed. Towards the beginning of March I downloaded the archive, and there were lots of folders with long numbers as names. , that contained an audio folder, inside that were lots of wav files (even longer hex names.) Actual example: file_671d686cb0a02002286e258615025f2ac753012b9aa7bfa60cb8b7f32a6bb1c427d63a0a52aab272d47753ccba02f6d9-c19e52ca-7d80-4fa8-9028-56806107ae67.wav) note the long ‘file id’ followed by a uuid.
Today I downloaded my archive, and there are none of those folders. No uuid folders, no audio folders, just long-ass filenames, which are all ‘.dat’. If you open one in a text editor, you’ll see it’s a RIFF file, so these are the audio files, probably, if they have those long names (an example: file_0afa96a4208c30020e09fcf8ef6fb62ac753012b9aa7bfa60cb8b7f32a6bb1c427d63a0a52aab272d47753ccba02f6d9.dat - note, no uuid in the the name, it’s not in a folder (top level of the archive) and it’s .dat.
But there are also files with smaller names followed by .dat (least helpful extension known) such as: file_00000000f6605230a72c9fe4513d8ec2.dat
Open one of these up in a text editor, and you’ll see it’s a PNG. So you can manually change the extension to .png, and you’ll now see the thumbnail, Guess what file these are? The new image generations! Note in the archive there is still an actual folder “dalle-generations” with files that have names like file-0sLZ7w7d2U0zDWOaSmTrKNKC-d1a13955-737f-4c82-be69-c34a0b47316a.webp … note, that’s a file id (NOT hex, an alphanumeric string) followed by a hex uuid, followed by .webp
So, if you want to find all these new images you’re generating, you WON’T find them in the dalle-generations folder, they will be ‘short’ names amongst the ‘long’ names of .dat files.
Also, if you had some method of reading the conversations.json, extracting conversations and their messages and matching up the media in them with the media in the archive … that method probably just broke (mine did! thus this edit to the post) because these files are treated as “assets” and referenced by only the file id, which was always the first part of the name in the archive, not the old file name. However, the new ‘file_xxxx…’ format is different than the old ‘file-xxxx…’ that was more or less the same between the user uploaded and dalle generated images. The method of identifying assets was the same, ‘file-AlphanumericString-some-file-name.ext’ and the filename and extension were not used in the asset notation in the json. Who knows now…
I also wonder if there are pdf’s or other file types hiding amongst the .dat files. I hardly did any audio, but mostly I’ve been interested in the images. I also hardly did any video. There was for a brief time a ‘sora’ folder in the archive, but not anymore. Maybe that has its own export? Haven’t look, hardly used it yet.
This has been a public service post, for the confused among us, who actually like knowing what we’re getting when we download these archives.