When Code Stops Being Text and Starts Being State
Letting AI systems execute code changed how I think about correctness. The hard problems are no longer linguistic. They are about state, determinism, isolation, and what it means to trust a result that actually ran.

Image courtesy Gemini
The first time I let an AI system execute code, it stopped feeling like a chatbot and started feeling like software.
Not smarter software.
More dangerous software.
Text generation lives in the world of plausibility.
Code execution lives in the world of consequences.
That boundary is easy to underestimate until you cross it.
This post is about what actually changes when models stop describing answers and start producing state.
Text Is Forgiving. State Is Not.
Language models are surprisingly cheap to be wrong.
If an explanation is slightly off, users shrug.
If a paragraph hedges, nobody files a bug.
Execution does not get that luxury.
Once code runs, the system has to deal with memory, files, CPU, time, and failure. These are not abstract concepts. They persist across requests. They leak if you are careless. They compound if you ignore them.
What surprised me most was how quickly hidden assumptions surfaced. Identical prompts behaved differently because something subtle changed underneath.
A file existed in one run but not another.
A cached object lived longer than expected.
A dependency resolved differently because the environment drifted.
None of this was visible at the prompt layer.
All of it lived in execution.
The Environment Becomes an Interface
As soon as code execution is allowed, the execution environment becomes part of the API.
The model is no longer responding to just text. It is interacting with a world defined by filesystem layout, installed libraries, network access, and resource limits.
Models are very good at discovering affordances.
If a library exists, it will be imported.
If a directory exists, it will be explored.
If a limit is unclear, it will be tested.
This is where many systems break. Execution is treated as an internal detail instead of a first class interface.
In practice, the sandbox is not a safety feature.
It is the product.
Determinism Stops Being Optional
With language only systems, variation is expected.
With execution, variation feels like a bug.
If the same input produces different results, evaluation collapses. You cannot regression test. You cannot compare changes. You cannot trust automation built on top.
Execution forced me to care about determinism in a way prompts never did.
That meant versioned environments.
Pinned dependencies.
Controlled randomness.
Explicit resource limits.
Once code runs, reproducibility is no longer a nice to have. It is table stakes.
Errors Become User Facing Design
Another uncomfortable shift is how failures present themselves.
In text systems, failures are conversational. The model misunderstood or lacked context. The fix is often another prompt iteration.
Execution failures are concrete.
A script crashes.
A dependency is missing.
A computation overflows.
These are not wording problems. They are software problems.
Raw stack traces are useful but hostile. Over-sanitized errors hide the signal developers need. Designing the error surface becomes part of the core experience, not an afterthought.
Execution turns error handling into product design.
Accuracy Is No Longer the Right Metric
It is tempting to think code execution simply improves accuracy.
In practice, it changes what accuracy even means.
A system can compute a result perfectly while solving the wrong problem. A model can execute flawed logic flawlessly. Without visibility into intermediate steps, these failures are harder to catch than hallucinations because they look authoritative.
This pushed me toward instrumenting execution itself.
Capturing traces.
Logging intermediate values.
Evaluating code paths, not just outputs.
With execution, correctness is about process, not just answers.
Trust Moves Down the Stack
Before execution, trust mostly lived at the model layer.
Do I trust this model to answer well?
After execution, trust moves downward.
Do I trust this environment to be isolated?
Do I trust state to reset between runs?
Do I trust that resource limits actually hold?
These are infrastructure questions. And they are far harder to fix retroactively than prompts or models.
In systems that execute code, the model is often the least risky component.
Where Execution Actually Belongs
Despite the risks, execution is essential for certain problems.
Data analysis.
Verification.
Simulation.
Transformation.
These tasks benefit enormously from being computed instead of narrated.
The key insight for me was this.
The model decides what to attempt.
The system decides what is allowed.
That boundary must be explicit, observable, and testable.
Closing Thought
Allowing models to execute code feels like giving them power.
In reality, it forces engineers to take responsibility.
Responsibility for state.
Responsibility for determinism.
Responsibility for failures that cannot be explained away with better wording.
Code execution is not an AI feature.
It is a systems commitment.
Once you make it, outputs stop being just text and start being accountable.