Having trouble in advanced multimodal reasoning beyond the surface

Hello, excellent humans.

I am jumping on the openAI gpt4o multimodal inference engine with the hopes to generate enough training data to fine-tune the model iteratively, but I’m facing some difficulties coming up with enough quality data for fine-tuning.

The problem space is the analysis and summary of vocational training videos in the wheelchair custom seating and mobility vertical. I capture keyframes sampled once per second (and subsampled once every 1/3 second, performing blur analysis to pick the clearest image, and skipping images in the middle of scene changes). I then present english/french/spanish subtitles with timestamps (pulled from the audio track and transcribed to english using Whisper and then translated to other languages) and I present image metadata to gpt4o to tell it the image timestamp, then I give it a temporary Dropbox link to pick up the image to vectorize /preprocess it before presenting it to the model. Output is a series of image summaries that describe specific details on vocational skills demonstrated. I allow multiple images to be combined into a single entry with a list of timestamps or timestamp ranges if they describe something visually similar, or if they describe a motion seen over a series of frames such as the action of swinging a hammer or running a sewing machine back and forth on a piece of fabric.

I then ask the AI to combine subtitle multilingual content and previously produced image summaries to come up with a description that includes with audio and visual information (I call them commentaries).

I allow the AI to rewrite history as with every step I provide all the summary details from previous steps along with fresh new images to summarize. I allow the AI to rewrite older image summaries and commentaries if new visual or subtitle information is presented that allows it to see past details in a new light.

So far, I am finding that although the AI is good at identifying large objects, it lacks sophistication in identifying not only small details but linking the objects and actions seen to the overall purpose and skills being taught/demonstrated. It is weak at identifying motions occurring over multiple image frames in a series. It is weak at taking subtitle content and weaving it into its overall summaries (although I do see it doing this a little).

My challenge : with all the additional content I’ve had to add to the system prompt to give it specific instructions (ie. watch the presenter’s hands and report on what each hand does and for what purpose, identify when a single object such as a wheelchair lap belt with a central buckle splits into and left and right part when the buckle is undone and then always refer to left or right as long as the belt is undone), etc is taking most of the allowable input context window of 128,000 in gpt4o and I haven’t even started distilling the e-learning content (multiple choice socratic question generation from the presented learning content, which I was planning on being the sole input to the more limited text-only fine-tuning process OpenAI just released at the end of August 2024).

I wonder if I’ll have to figure out a way to “pre-fine-tune” the model with specific input/output examples to clear out my input context before trying to fine tune it on the e-learning content for the series, since my input context won’t be big enough to train it on the next video series (I have five series, each with around ten videos, to process for my client but there are many many more videos than that in his collection that he could ask me to process should this prototype go well).

I’m also hoping that over time it won’t take as much detailed concerted human effort to get good output as I am putting in close to the amount of time it may take me to manually do the summaries, and I could do it without the massive amount of repetition I’m seeing now on the inference output (with repeated entries being technically different but semantically identical, where key semantic details I need it to catch aren’t getting caught). If this experiment can’t show the model as being able to reasonably quickly pick up the absolute basics of this vocational skill set then I’ll likely have to abandon the effort and wait for a better fine-tunable AI platform to roll out. Maybe I’m a year or two too early, but I’m giving it the best try I can.

It’s a tug of war here, if I spend too much time doing very in-depth specific training then I may bias the model inappropriately, or I could just accept that the model is doing the best it can and move on to other series, in the hopes of coming back to regenerate content when the model becomes smarter with a wider knowledge set).

I am saving all training details aside so I can retrain if the base model changes and then apply fine tuning on top of it, so the investment in creating human-generated training material and human-reviewed AI-generated fine tuning training material won’t be wasted.

Discussion welcome/encouraged.

ref : https://ai.stackexchange.com/questions/40753/looking-for-direction-to-develop-an-ai-trained-on-proprietary-training-videos

1 Like

I think you’ve found yourself in a tough spot.

You are hard-coding things to try and accommodate fundamental failures. You’re also betraying the usability of a model by already knowing and instructing the model on what to expect. Might as well watch the video yourself at this point.

Indeed. If you found yourself a nice comfortable position I would think it would only last until the next video. You need something generalized enough to accommodate the wide(ish) variety of videos you intend to analyze.

Admittedly, I’m also new in the true video understanding w/ LLM game. I think a lot of people are. I guess that makes us pioneers :saluting_face:

There’s a lot of very cool technologies being released.

I can’t say if this will actually be helpful, but I do understand, on a fundamental level that what you’re experiencing is an attention issue. There’s just too much going on.

So, I can’t say “this is your solution”, as I’d like to, but, as a fellow pionner I can show you what I’ve been looking at and hope to implement. I would like to see what your opinion of it as well, and maybe see if it directly helps, or maybe if it leads you onto a path to something that solves your issue:

Multimodal Large Language Models (MLLMs) have shown promising progress in understanding and analyzing video content. However, processing long videos remains a significant challenge constrained by the given context length. To address this limitation, we propose LongVU, a spatiotemporal adaptive compression mechanism to reduce the number of video tokens while preserving visual details of long videos. Our idea is based on leveraging cross-modal query and inter-frame dependencies to adaptively reduce temporal and spatial redundancy in videos. Specifically, we leverage DINOv2 features to remove redundant frames that exhibit high similarity. Then we utilize text-guided cross-modal query for selective frame feature reduction. Further, we perform spatial token reduction across frames based on their temporal dependencies. Our adaptive compression strategy effectively processes a large number of frames with little visual information loss within limited context length. Our LongVU consistently surpass existing methods across a variety of video understanding benchmarks, especially on hour-long video understanding tasks such as VideoMME and MLVU. Given a light-weight LLM, our LongVU also scales effectively into a smaller size with state-of-the-art video understanding performance.

2 Likes