Hello, excellent humans.
I am jumping on the openAI gpt4o multimodal inference engine with the hopes to generate enough training data to fine-tune the model iteratively, but I’m facing some difficulties coming up with enough quality data for fine-tuning.
The problem space is the analysis and summary of vocational training videos in the wheelchair custom seating and mobility vertical. I capture keyframes sampled once per second (and subsampled once every 1/3 second, performing blur analysis to pick the clearest image, and skipping images in the middle of scene changes). I then present english/french/spanish subtitles with timestamps (pulled from the audio track and transcribed to english using Whisper and then translated to other languages) and I present image metadata to gpt4o to tell it the image timestamp, then I give it a temporary Dropbox link to pick up the image to vectorize /preprocess it before presenting it to the model. Output is a series of image summaries that describe specific details on vocational skills demonstrated. I allow multiple images to be combined into a single entry with a list of timestamps or timestamp ranges if they describe something visually similar, or if they describe a motion seen over a series of frames such as the action of swinging a hammer or running a sewing machine back and forth on a piece of fabric.
I then ask the AI to combine subtitle multilingual content and previously produced image summaries to come up with a description that includes with audio and visual information (I call them commentaries).
I allow the AI to rewrite history as with every step I provide all the summary details from previous steps along with fresh new images to summarize. I allow the AI to rewrite older image summaries and commentaries if new visual or subtitle information is presented that allows it to see past details in a new light.
So far, I am finding that although the AI is good at identifying large objects, it lacks sophistication in identifying not only small details but linking the objects and actions seen to the overall purpose and skills being taught/demonstrated. It is weak at identifying motions occurring over multiple image frames in a series. It is weak at taking subtitle content and weaving it into its overall summaries (although I do see it doing this a little).
My challenge : with all the additional content I’ve had to add to the system prompt to give it specific instructions (ie. watch the presenter’s hands and report on what each hand does and for what purpose, identify when a single object such as a wheelchair lap belt with a central buckle splits into and left and right part when the buckle is undone and then always refer to left or right as long as the belt is undone), etc is taking most of the allowable input context window of 128,000 in gpt4o and I haven’t even started distilling the e-learning content (multiple choice socratic question generation from the presented learning content, which I was planning on being the sole input to the more limited text-only fine-tuning process OpenAI just released at the end of August 2024).
I wonder if I’ll have to figure out a way to “pre-fine-tune” the model with specific input/output examples to clear out my input context before trying to fine tune it on the e-learning content for the series, since my input context won’t be big enough to train it on the next video series (I have five series, each with around ten videos, to process for my client but there are many many more videos than that in his collection that he could ask me to process should this prototype go well).
I’m also hoping that over time it won’t take as much detailed concerted human effort to get good output as I am putting in close to the amount of time it may take me to manually do the summaries, and I could do it without the massive amount of repetition I’m seeing now on the inference output (with repeated entries being technically different but semantically identical, where key semantic details I need it to catch aren’t getting caught). If this experiment can’t show the model as being able to reasonably quickly pick up the absolute basics of this vocational skill set then I’ll likely have to abandon the effort and wait for a better fine-tunable AI platform to roll out. Maybe I’m a year or two too early, but I’m giving it the best try I can.
It’s a tug of war here, if I spend too much time doing very in-depth specific training then I may bias the model inappropriately, or I could just accept that the model is doing the best it can and move on to other series, in the hopes of coming back to regenerate content when the model becomes smarter with a wider knowledge set).
I am saving all training details aside so I can retrain if the base model changes and then apply fine tuning on top of it, so the investment in creating human-generated training material and human-reviewed AI-generated fine tuning training material won’t be wasted.
Discussion welcome/encouraged.