Reading videos with GPT4V

Fusseldieb · November 24, 2023, 5:23am

Thanks for reading through!

These are interesting points and questions, to which I have answers.

Levels might be the way? Check for silent clips and ignore them. Feels like whisper needs to find a main voice to latch on to or it starts transcribing digital noise.

I gave it a clip of me walking loudly while Whisper was configured for “pt”, and it transcribed:

Legendas pela comunidade do Amara.org

Which translates to: “Subtitles by the community of Amara.org”, which is another hallucination.

EDIT: Looks like this is an ongoing discussion: Dataset bias ("❤️ Translated by Amara.org Community") · openai/whisper · Discussion #928 · GitHub

So yea, Levels already flew out of the window - too.
You would need some sort of LLM to tell you if there is voice, or not. Then, it might work.
Or simply just a sucessor to the Whisper model which correctly says “(Silence)” or something along the lines when the audio is quiet.

If you fed each frame individually and asked ‘is there an air conditioner in this image?’ it might shed some light on what’s happening.

This will pretty much work, however, this question is heavily dependant on the video. You would need to run a step which asks all sorts of questions, which will get inaccurate pretty fast.

Topic		Replies	Views
How can I use Embeddings with Chat GPT 3-5 Turbo Prompting	39	48943	December 12, 2023
GPT-3, DALL-E, and our Multimodal Future [YouTube Video Series] Community	55	3760	January 3, 2024
Train (fine-tune) a model with text from books or articles API	62	28520	November 30, 2023
Announcing GPT-4o in the API! Announcements	130	110107	July 4, 2024
The “system” role - How it influences the chat behavior API	45	100789	December 12, 2023

Reading videos with GPT4V

Related topics