Can LLMs Be Computers - Embedding a VM inside an LLM

This paper blew me away: Can LLMs Be Computers? | Percepta

The entire agentic ecosystem today, from Claude Code to Codex to OpenClaw, is based on the principle of surrounding LLMs - which are simply token distribution generators - with external tools (Python, git, curl, APIs, MCPs), and a state machine. An LLM cannot reliably “execute” a mathematical expression - it can talk/reason about it; it can (if seen in the training data) “guess” the result; it can produce Python code to run (externally) in a Python sandbox, and use the result in its answer.

What the folks in this paper did, was turn the LLM into a virtual machine (VM), and effectively use the transformer weights as a fast WebAssembly interpreter. The result: you give an LLM a program, it ACTUALLY executes the program by emitting a stack trace (internally), and it outputs the result!

They took a vanilla PyTorch Transformer architecture (couple of dozen lines of code), and embedded WebAssembly in model weights by training the transformer to predict state transitions of a WASM virtual machine, so the attention and MLP layers collectively implement the interpreter’s logic. Programs are fed as tokens, and the network simulates execution step-by-step.

If it holds true, this could be an absolute game changer. Imagine you somehow integrate this into a GPT model, i.e. you effectively embed a code interpreter “inside” the LLM weights. Then instead of using costly/inefficient thinking/reasoning tokens, you are instead using an internal logarithmic-complexity state emission to literally compute or “run” something, before finally emitting tokens.

Anyway, it got me very excited, and I’m wondering if OpenAI peeps are looking into this?

6 Likes

I can’t wait for Microsoft to go all-in on this so Windows Calculator will say 2+2 = ‘A Journey’ :joy_cat:

1 Like

ChatGPT tells me when llm inference is 100 or 1000 times faster then this is possibly good… Though honestly I think this might be a reach.

Do you have particular use cases in mind that you can share?

(Maybe I am not so smart but I love ASM code :slight_smile: and would be interested in any use cases)

I struggle to see the game-changing aspect here; from my perspective, it just looks like using a computer to host an LLM that pretends to be a computer. Why not just use a computer as a… computer? :sweat_smile:

1 Like

The game changing aspect is in the context of how things are done today. Today, you use an LLM to churn through tens of thousands of tokens (both hidden/thinking, and actual output), sequentially, and then give that output to an external tool (e.g. shell, Python interpreter etc) to execute, feed the results back, and repeat.

This “game changing approach” is where you bring the external tool (e.g. interpreter) into the transformer and apply logarithmic complexity thinking/execution internally, very rapidly, with no tokens emitted. You emit only the final result when the LLM is satisfied.

It has nothing to do with “use a computer as a computer” (what does that even mean?) and everything to do with use LLMs more efficiently.

4 Likes

So is the gain actually in efficiency, or just in not serializing the reasoning into tokens?

From what I understand, this approach avoids emitting reasoning tokens, and instead performs the reasoning internally in the model’s state. I presume with similar underlying compute?

1 Like

Yes the gain is in efficiency and it’s two-fold:

  1. Not serializing and emitting reasoning tokens, but instead performing that “internally” with state transitions in attention and MLP layers. It is order of magnitude faster. Let’s say I tell gpt-x calculate x^2 + 2x + 1 = 5 and x^2 + 3x = 2 - right now it would invoke “test-time compute” and perform CoT, which is going all the way and emitting tokens sequentially. Instead it would solve this “internally” with no token emmissions and what these peeps found was that they can do it in logarithmic time (better than linear).

  2. For code execution it goes even further than that - the internal weights are actually a WASM interpreter, so instead of “thinking” and emitting code to execute, the internal states actually execute the code.

2 Likes

I don’t thanks (I run a micro Claw setup myself), and it has nothing to do with the paper I shared, but thanks anyway!

I see the parallel, but your system is still executing outside the model weights, just very efficiently.

I’m more interested in whether moving execution into the weights actually reduces compute, or just removes token overhead.

And I guess the trade-off is you lose visibility into the reasoning as well?

From a reasoning-first perspective, it feels like we’ve drifted a bit from that core question.

naw, everything recorded on the machine just like the paper implies,

observations are just logs. the weights too. I do train them - on the same machine, I just dont train them in the way the paper depicts, I use nvidia tao - the paper has a ineffective manner of doing that.

screenshot one - has the weights on the tail + embeddings, again - I do train them, just not in a wasted manner. I create the weights from the floor.

I think that’s slightly different though, logs and training aren’t the same as execution inside the forward pass.

In your setup the code is still being executed outside the model weights, right?

in a way - I use a level of metaprogramming with “reflexes” the code doesnt trigger a reaction, the value of the weights do - ie its internal from the model trained/used not external

I don’t see any connection between their approach and what I posted originally @phyde1001 .

Back to your point @phyde1001 - with the approach in the paper you actually gain transparancy - you don’t really see reasoning tokens now anyway, unless you use an open weight model like Qwen3. But with the authors’ approach you can in principle see internal state transitions, like a gdb debugger. It’s one of the aspects they actually highlighted.

@phyde1001

see… before he stated it , ( qwen 3 check, observability check, INTERNAL state check)

Sorry guys, school run and kids dentist but keep at it and I’ll catvh up :wink:

That’s interesting, I see what you mean about internal state transitions being inspectable.

I guess the distinction I’m still thinking about is that while you might gain machine-level transparency (like a debugger view), you lose human-readable reasoning unless it’s explicitly surfaced.

So it feels like a shift from interpretability in text to interpretability in state?

(OK gone for a bit, 1h to catch up)

1 Like

Yeah that’s how I interpret it as well - back to the old school way of debugging. At the same time, encoding something WASM in the weights is kind of bringing determinism into the transformer - you are simply representing a code interpreter by a set of matmul operations under the hood. The current way (thinking tokens) is anyway not very robust because those tokens are being sampled from some distribution, and you are relying on that to “tell you the truth”. You can see how weak it is when I use Qwen3 for example and tell it “you are wrong - think again”, and it will happily oblige, even if I am the one that is wrong. So another gain is “reasoning guarantees”. Maybe. :sweat_smile: I mean I just find their approach interesting and I don’t think we should be too dismissive of it.

1 Like

I tried to inform you of a more advanced version of the paper - which was dismissed :smiley:

i look forward to the next post.