👾 Livestream: o3 plays Pokémon

Watch o3 play Pokémon—live. See how it plans its next move, explains its reasoning, analyzes the map visually, and saves to memory.

19 Likes

This is kinda wild and amazing. Thanks for sharing!

Could become a great series… o3 playing different games every month.

2 Likes

A few screenshots…



4 Likes

Sweet UI, hilarious how slow o3 is, would be interested to see if a different model can handle it and do it a lot faster?

Also where is “cumulative token use” in the UI, eh? Seems like watching o3 move back and forth 6 steps left and right for the last 5 minutes was like, several hundred thousand tokens or something..?

If it is API o3, it is a far cry in token generation from any benchmark-running o3 used internally to make announcement bar graphs. The consumption would be in the input, and vision context, depending on how a chat/turn history is managed. It seems that new input is about 10k-15k per turn, and there’s about 45k cached that is not expiring or rotating out. We’re talking 50000 tokens run to receive an output of “A Button press”.

Looks like the turnaround on WeWork plays Pokemon might be faster and cheaper.

Seen plenty of even text adventure game attempts get lost in loops when context is still lossless. Sparse attention does the discarding, with the impression of the quality becoming similar to “vector RAG”. Here, it seems a managed memory text is being attempted, along with some backend of the game. Then a “where did it go wrong” agent prompt without cache.

Run the release date version of gpt-4.5, though…

1 Like

Other LLMs took more than twice as long to reach where o3 is right now!

Model Steps to Boulder Badge Hours to Boulder Badge
OpenAI o3 710 ~12
Claude 3.7 5,000 N/A
Gemini 2.5 Pro N/A ~35
1 Like

It’s still going at it!?

A little tweak could make this less boring and impenetrable: a super-summarizer that does TTS on the action taken and the reasoning. An announcer’s play-by-play.

Perhaps a structured output “my_action_reasoning” key – if asking about reasoning doesn’t trigger recursive prompt policy blocks and API organization bans..


Thought experiment: can a “patches” vision model such as o4-mini, based on semantic units of 32px x 32px patches, play this game if it is a 32 pixel grid? An odd increment?

1 Like

That would be kinda cool… Feed it input every 3 to 5 minutes then while it’s talking process the next batch? Hrm…

1 Like

Ya just saw somebody else recommend that in the Twitch chat too!

1 Like

Needless to say this is true

this actually hurts to watch
but i can’t stop checking in every few hours

:woozy_face:

1 Like

Though you may ignore this or thing I’m being pretentious: I did some thinking last night and I believe that you could increase the speed of model response probably by at least one order of magnitude and/or use simpler models by performing the following re-factor:

(all in all, my opinion was it looked like you all had a good UI and interaction with the gameplay emulator, but a poor usage of the LLM’s capacity to actually perform the task, because you were relying to much on a “sending a raw dump of data” to a single model call, instead of properly allowing the LLM to use world-state context window and direct middleware management to allow the LLM to create the proper environment for achieving a sped-up gameplay*

The models themselves have become so powerful that interacting with them in single-shot calls with linear context window data is the least efficient way to use the LLM, a new approach is required that allows the LLM to actually manage and control it’s own world state context and middleware environment, thus allowing actual improvement and externalised learning to take place, instead of always treating the model as a black box (which it is, but we keep limiting it’s effectiveness through the systems we are designing to interact with it!).

1. Modify your world state inputs to separate the vision processing layer from the reasoning call

(thus world state for the reasoning model includes only tabular text data representing the visual state, massively reducing token load for the reasoning call, reasoning load, and allowing “focus” on the gameplay elements and production of automated scripting for producing actions)

2. Improve the models ability to modify the world-state/middleware environement dramatically

instead of only a few simplistic “memory logs”, actually encourage and provide in initial state frameworks for the model to use to maintain an actual “learned state of gameplay and probalistic results based on available data”. Thus the model projects it’s moves/activity into the future of the gameplay, by having a world state that more fully represents in a semantically relevant way, and can therefore take longer sequences of actions based on that.

i.e. sequential lists of time-based information is absolutely the worst way to get the model to provide forward-thinking and learn-from-the-past-experience kind of behavior. It must become a three dimensional world state context the LLM manages itself and NOT a linear record of activity

3. Allow the model to predict a sequence of valid moves and on-the-fly program both a “walking” and a “fight scene” automation level based on logical if-then from the gameplay response

The model is currently handicapped by being relegated to a single-turn sequence

  • with such simplistic gameplay as in pokemon, the model can easily provide automation (or write the code for such automation on-the-fly) for long-series sequences of walking moves, NPC interactions, and fight scenes, and then “get the total results after the fact”. (it’s a low risk game and the risk factorization of fight-scenes can be factored in as necessary to prevent total character loss)

Inputs

  1. (result) tabularized data of the current gameplay state (i.e. map, etc.)
    (process) Call to vision model is separate, not a part of main model call, and generates tabular data of on-screen tiles position and typification. (a text-map of the map). No need for main reasoner to be parsing this data for such a simplistic 8bit visualization that could easily be represented by tabular data and much more easily parsed directly by an agent who’s only responsibility is that parsing.

    • Even more interesting would be (to see how it effects timing) an optional addition to the vision call/vision processing (I still think vision processing should be able to occur on the 8bit data in the order of <1s with or without an LLM vision model, to just produce a tabularized set of tile data), that would possiblyslow it down (maybe not much), to say "hey vision model, if the map contains mostly areas greater than 2x5, simply describe the map in natural language like “there is a field of tiles (A) 10x5, bordered by a field of tiles (Y) 2x8, with npc located at (x, y)”. And fallback to tabularized data if it’s more like a maze-zone, or complex map area, etc.
  • This call would be split into 3 parts for production of the tabularized data:
    • the mini-map, main map, and any menu’s currently displayed

Thus you treat the gameplay output as basically an API endpoint. So you have a translation layer, that could be a vision model call (or is there not better local parsers then wasting tokens and time on a vision model for such simplistic processing of visual data?), that basically, for either a single-sequence or multi-sequence button press, takes as input the gameplay screens and outputs the tabularized data.

  1. (result) Existing world-state context window

I think we all understand this. The critical aspect is the instructions to the model to develop the world state itself through taking game play input and initial instructions. Obviously, it would take time and discussion of the gameplay/game map itself alongside the model to come up with the best possible “set of world state instructions” OR you could enable a side-call, say, after every 10 turns - have the model (that contains the gameplay world state), step out of the world state, and make an LLM-to-LLM call to have discussion regarding the success/failure of the current world state strategy, and consider modifications to the global mapping/memory strategy, and then make modifications as necessary, return to gameplay.

LLM Instructions to develop World State (part of initial state)

  1. Develop the world state by taking the map data, and tracking the position of the character in the total map over time. Create a “master map” which is a scaled-down version of the total map, or other-wise create a stable memonic device that allows for easy reference of known map locations. Possibly if necessary chunk known data and store outside of context window but provide index within context window for recall when entering the map.

  2. Create a synthetic filter of receiving data from the gameplay translator to maintain a stable set of information about character health, skills, levels, etc, without any duplication of information within the context window. (obviously you already implemented this aspect)

  3. Maintain a track record of critical results you get in gameplay - a list of interactions with NPC’s, a list of critical locations such as health centers, entrances, etc. (didn’t see this implemented)

  4. Maintain a track record of probabilistic results when interacting with fight-scenes/etc. in order to guide strategy for automated sequences in the future.

Outputs

  1. Single-turn or multi-turn (multi-turn preferred). Move sequences. (simplistic version)
  2. Automated move sequences (script)
  3. Direct modification to world state in terms of global map, strategy map, memory, etc.
  • Note that (2) requires a translation layer as well - allow the model to generate an automation script, and the middleware takes that automation script from the models response and applies it directly in terms of interacting with the API layer for the game emulator.

Thus, the model is able to perform gameplay similar to a “human mind” would,

(i.e. by building it’s own cognitive/memnoic state map allowing it to increase it’s speed and effectiveness during gameplay, and by proceeding much more rapidly by seperating the simplistic visual translation layer from the reasoning layer and allowing the production of long-running automatic move sequencing).


TL,DR; Interested? Put me in touch with your team, give me 2 weeks and an NDA, and I’ll work for free just for the love of the (LLM) game :slight_smile:

1 Like

You should get in touch with the creator directly! You can see their contact info or chat with them in the Twitch chat. (This stream wasn’t put together by OpenAI.)

1 Like

geepers, i can’t imagine the cost to play this game all the way through, at this rate…

lol

That’s the amazing thing about the cheaper models of which I include 4.1… Let’s say with a proper system each context window was 25k tokens, but you got an output of just 2k that included multi-turn automated moves…

So say if each move was 10 moves automated, that’s 10 moves for about 5 cents… So to play through say, 300 moves, that’s definitely about $15.00.

But if you share the stream with OpenAI you get a lotttt of free tokens per day. So if you limited it somewhat, you could probably play through the whole game in a few days for more or less free.

If the middleware provides the automation and thinking, it’s free. It’s only expensive once you start using the expensive models and relying on them to do all the work that could otherwise more efficiently run as local code.

3000 turns so far.

An average looks like 20k new input, 40k cached input. 1.5k output.

turn those k into M, multiply by 3.

uncached_tokens_per_job=20000
cached_tokens_per_job=40000
output_tokens_per_job=1500
jobs=3000
uncached_tokens=uncached_tokens_per_job*jobs
cached_tokens=cached_tokens_per_job*jobs
output_tokens=output_tokens_per_job*jobs
cost_uncached=uncached_tokens/1_000_000 * 10.00
cost_cached=cached_tokens/1_000_000 * 2.50
cost_output=output_tokens/1_000_000 * 40.00
total_cost=cost_uncached+cost_cached+cost_output

cost_uncached, cost_cached, cost_output, total_cost

(600.0, 300.0, 180.0, 1080.0)

A 10-year-old is cheaper and wouldn’t bump into the same wall for 10 minutes.

2 Likes

thx for that…

and it’s not even out of the newbie stuff!
i don’t really understand how to hook up Ai to do this sort of thing yet,
but it seems like it keeps inverting the map somehow.

:upside_down_face: :melting_face:

1 Like

:joy: It’s like firing up a supercollider to pop bubble wrap. Technically impressive, but utterly bonkers just to hit “A.”
Save yourself the credits you’re burning and let a simple script handle the button.

Amazing idea. Excited to see the future of LLMs with video games.

Pokemon is a very ambitious start. Incredible to see the progress made though.
Kind of crazy to think that games like Portal (Valve) will technically become a reality for some LLMs.

This would be cool employed as an agentic system. o3 to asynchronously designate goals and map locations, and a faster model like gpt-4.1 to control the interface. As it seems right now, the model is constantly route-planning to its own detriment.

Hopefully the author open-sources their script and we can finally see come collaborations come to life

1 Like

They’ve made some UI tweaks to make steps/progress a bit clearer:

2 Likes