Khan academy GPT-4o Math tutor demo - How to

In today’s Khan Academy video of Math tutoring with his son
the interactivity between annotating the triangle and the assistant is impressively synchronized.

Does anyone know what APIs are used to create this interactive multi-modal experience? Is the whiteboard converted to video and sent to the model?


That’s part of the unprecedented publicity push of this new model. Filmed in the OpenAI setup as were a bunch of others in the publicity package, released with perfect timing. Plausible “nice” application of the technology, instead of replacing phone trees for corporations.

The app likely has the ability to “look” when you say look, as you notice the speech slowing down while obtaining the initial image. The model can process over an image a second, but one can listen, that a lot of the discussion wouldn’t need visual cues. The actual context loading of captured images, context length management of chat with images, aborting output with speech, you can imagine them all with generation and API methods. Continuation of generation upon a dynamic context is something that we don’t get.

The ChatGPT app is its own product with methods not available to consume by API even now.

Since gpt4-o is multimodal expect more modes of input to be released as time goes on. How you plug audio, video, etc into the API I guess will come with an API update soon.

One possible method might be that the canvas in the Khan Academy app detects when the pencil draws something and possibly sends a few images as the pencil moves to provide enough context. An example of this kind of analysis can be seen in the video cookbook from OpenAI: