I couldn’t find an API endpoint that can accept a video stream. There’s either one for images or video files.
As such if I wanted to create an app that allows the user to interact with ChatGPT multimodally as they showed off how would I do it? Would I open a stream to gpt4o and pass in a frame every x frames ?
if i am going to implement that demo with the current limitations, the easiest and simplest way is to trigger image capture when the user either starts talking or after, then send both to the backend, audio being transcribed by whisper and then composed together for vision request format with the captured image. i might also prepare a tool/function that will trigger image capture since user might not all the time wants to talk about the image and only do so when referred to in the conversation.