ah thank you, I did miss it
Ooh my, video to avm <3
ok, I’ve updated the app but I still don’t see the video functionality, but I did see hahah i did see the santa voice, pretty cool
ah, can’t wait to test out the video on avm, doesn’t seem to be out on either android or ios
Near the end of the video they said it’s coming out very soon for Teams and then a little later for Plus and Pro and then 2025 later for Edu and Enterprise (Paraphrased a bit)
ah, that makes sense! can’t wait to see it on the pro plan
man, this is amazing! I feel like I can become a better cook like this, might need to invest on getting a tripod or something like that hahah
woah, share screen is going to be a thing too! amazing!! I really hope it also works on pc, I’m 100% going to use it way more on a pc rather then on the cellphone
I’m very interested in seeing if it’s actually retaining frames without instruction (from the introduction part it definitely does, I just wonder how explicit it needs to be). I already have some tests I’d like to try:
-
Nonchalantly scan the room while communicating something irrelevant to the visuals and then ask what was recently shown
-
Roll a ball across the screen in numerous directions and see if the model is able to synthesize snapshots
ok ok, i got to the end
so in the next week most pro and plus users will be getting this feature?!!
what a time to be alive!!! lets gooo!! I’m so happy for this, this demo was amazing
Like Daniel Simons’ Invisible Gorilla?
That would be a fun test.
I’m more interested in understanding how they implemented it. The closest thing I can think of is something I’ve been dabbling in lately:
https://arxiv.org/pdf/2410.17434
Which can be tried out here:
Which is capable of understanding videos of long lengths (hours long!). It’s not live, but it’s very fast
x-mas joke of the day:
whats every elf’s favorite music?
wrap music
was waiting for this feature, super happy about it!
LongVU failed ![]()
When explicitly asked, it does note of the gorilla though. So… Ehhh… Idk it’s a though one to say because the video does talk about the gorilla in the end. It does correctly guess the walking direction though.
Gorilla wearing a gorilla suit? Gorilla-ception?
Will definitely need to try with cGPT when we can.
@anon10827405 about the room scanning, I’d love to have it working with multiple cameras… can you imagine loosing something and asking where did I leave my missing item? that would be cool.
It would also be nice to use it as a security camera, if someone breaks in turn all lights red and start playing the theme of doom with a -8.0 pitch, convince the person to reevaluate their actions in a santa voice
also, I’m super surprised how fast it is… must be using the real time endpoint… because technically you can break down a video into frames and send to a vision modal… but that would take so much longer
Dude. Yes! I am always leaving my keys in the dumbest places.
Imagine this bad boy hooked up with smart-glasses? Then AR? Watch-dogs when?
oh my lord. Then, smoke comes out of a room and the Mortal Kombat theme song starts to play with santa claus noises becoming louder.
I’m really hoping OAI releases some technical details. LongVU’s strategy is super cool and works great with slow processing visual content by running it through numerous models (3 + some aggregation strategy) but, yeah, it is very fast and I’m hoping it’ll be the same to us plebs with somewhat poor & intermittently janky connections
The sample rate is about 1 image a second. Then this is utilized as context when a generation by post-speech silence is triggered. I expect there is a limit to how long of a buffer of images is used as input as a narrative and as part of past chat turns.
How about a refrigerator… LOL
Never happened to me! Honest!
![]()
@_j already on the case dropping knowledge!
I love this place…
I’d imagine it uses some sort of compression algorithm, similar to how chatgpt does it after a certain point
I want it everywhere too, in my microscopes and telescopes as well, 2 years from now will be an epic time to be alive
Untiled base images consume 85 tokens or even less. 100 seconds → 8500 tokens of a 30k ChatGPT input context. OpenAI has the ability, and so did particular input format of gpt-4-vision-preview, to use untiled larger image inputs in proprietary products, along with reserving proprietary features for their own products.
Release updates log of previous days:
https://help.openai.com/en/articles/10271060-12-days-of-openai-release-updates
Oooh… Makes sense. LongVU does the same.
My biggest issue with this, and a reason why I increased the FPS is because sometimes the single frame is blurry (from being live). I’m guessing that’s why they had to hold it for numerous seconds. Hopefully the API version (
) will give us controls over this.
Where did you find this info?
It was discussed after the spring update demo by a more official source than just my imaginings, a factoid that I bookmarked in my wetware for just this moment.



