Building a Loop Engine

The coding agent still runs the familiar inner ReAct loop. Atomic adds the outer loop: explicit stages, artifacts, gates, observability, budgets, and a real stopping condition.

There’s been a lot of buzz and discussion around loops and workflows in the past few days. There have also been many people chiming in that they’ve been doing this all along but not sharing the code to do it. I’m going to share what it is, how I do it, and the actual code so you can build your own.

I’ll share how I built an engine to power these loops or workflows. I use these terms interchangeably because they are quite similar. I’ll also document my path to this architecture and why I built it the way I did.

As with any writing, it’s important to know the background of the author so you can decide how to weigh their observations and thoughts. I’m an AI researcher at Microsoft Research, and I’ve spent time building and working on coding agents and developer tools over the past couple of years for MSR and Windows, solving challenges for using coding agents inside of a codebase that is 1 billion+ LoC. Prior to that, I worked on uncertainty estimation at an MIT startup, and I have a research background in 3D vision and world models. Outside of my job, I’m an open source builder on Atomic, where I’ve spent many hours refining, building, and working on coding agents. Despite reading my profile as quite AGI-pilled, I more often than not will exclaim how these models are incredibly stupid. So I hope that I come across at least somewhat level headed, but you can decide that. Okay, now with that self-serving bit out of the way, let’s get into the world of loops and workflows.

There’s been good work to properly define what a loop is, but how do you just build one, and how do you build it in such a way that it doesn’t burn tokens and is reliable? Matt Van Horn had a good writeup on the history of the concept, how developers are employing it, and how there is still such a major gap between AI used in real-world deployments because you do need a “production” version of a loop:

Which is why every serious 2026 write-up on loops converges on the same three hard stops: a maximum iteration count, no-progress detection, and a token or dollar budget ceiling. The romantic version of loops is that you write the loops and a thousand agents build your company overnight. The production version is that you write the loops, and most of your job is making sure they halt. Gartner puts agentic AI at the peak of inflated expectations, with only about seventeen percent of organizations actually deploying agents. The gap between the timeline and the receipts is the real state of play.

Additionally, Dex Horthy clearly pointed out that if you were to hypothetically start looping through everything without a care, you’ll end up in the zone of not understanding your codebase and slop:

Here’s what’s gonna happen:

you replace your code review with feedback loops (sentry, datadog, support tickets, etc)

you stop reading the code

software factory fixes everything

one day something breaks at 3am, agent can’t fix it

nobody’s read the code in 3 months

you have 3 weeks of downtime trying to re-onboard and fix it

you lose significant % of your contracts and users

your company is now dead.

I’ve been trying to figure this out for the past months because I noticed coding agents didn’t scale when it came to work that was complex and ambiguous. I also disagree with the sentiment that all you needed was the model and for it to delegate its own orchestration. Real production code needs management across dependencies and teams, and I don’t have infinite tokens to burn.

I started with a harness that orchestrated multiple agents into workflows powered by the Opencode, Claude Code, and GitHub Copilot CLI SDKs. This approach didn’t scale well because the SDKs all had slightly different interfaces, each had its own limitations for what I wanted control over, and, lastly, as these teams are shipping fast, they frequently introduced bugs. I learned about Pi and started exploring it as a minimal coding agent implementation. Pi is great because it is a simple harness and still fully extensible. The benefit of Pi is that a builder can leverage its large ecosystem of extensions/tools/MCPs, and it is a well-maintained project. I could take Pi and use it as the runtime with control over its functionality and security, so I could focus on building a super powerful engine for the coding agent to execute autonomously rather than worry about supporting multiple interfaces, squashing bugs, and managing three separate backends. The migration to Pi was an entire rewrite of the system, but well worth it.

The next challenges I ran into were around traditional orchestration, reliability, and observability. The coding agent is an incredibly powerful primitive, but it was clear that it needed proper scaffolding with observability to work. From working with coding agents, I knew that you need to be able to pass the proper context, steer, and have an easy way to observe what is going on so you can interrupt if things go wrong.

I also knew that a lot of coding-agent work felt reactive instead of proactive. It would get just close enough to what I wanted, then fall apart in the last 30% of the work, or worse, completely go off track. I also knew that many developers still spend most of their efforts manually prompting, writing markdown programs, or running the same skills in repetition. From all of this, it naturally made sense to look at a way to automate this process.

I spent many hours thinking about problems around testing, intent alignment, the software development lifecycle, and speaking to developers about what they saw, experienced, and, very importantly, felt. This turned into the features that were included in Atomic to solve these challenges: review gates, support for parallel execution, human-in-the-loop gates, resume/pause capability for any stage in a loop, being able to steer mid-run, verbatim compaction (not what you see in coding agents today), and defining the workflow with any model that you desire so you can measure and manage your costs. Honestly, each part of Atomic’s system has been specifically fine-tuned to work exceptionally for long-running coding-agent loops and deserves a blog post of its own for the architecture and how it was conceived. Let me know if you’re interested in that, by the way.

This is not trivial to construct. It’s taken hundreds of hours of studying coding-agent implementations, feedback from developers using it in repos that are 10M+ LoC, and re-architecting the end-to-end system at least three times from scratch. There is a reason Claude Code dynamic workflows doesn’t have any of these features, because frankly it’s hard to build, and you have to go back to software engineering principles. Sorry, but the model won’t save you here. Secondly, there is a reason (I’m guessing) Peter, who shared the loops heard round the world tweet, didn’t share his setup. I imagine that he’s heavily optimized the OpenClaw ecosystem, and it would require serious reworking to generally work on all codebase shapes. So we need a solution that has good DX and works across different codebases and teams.

So we built it ourselves. We called it a workflow engine, but you can call it a loop, loop engine, workflows, it doesn’t matter. The point is how it works and how easy it is to build your own.

The inner loop is the traditional ReAct loop where a model will execute until it finishes making tool calls. In contrast, the outer loop is organizing more complex/long-horizon tasks like software engineering into atomic units (like GH issues) and delegating each to a separate instance of an agent. You can take it a step further and even ensure your inner loop is consistent and verifiable. This is where static and dynamic workflows come in. This is not through simple prompts, but rather by giving the model a workflow meta-tool in the inner loop: the ability to define its own subroutines through subagent chaining and tool calling.

The technical shape

Atomic treats the coding agent as the inner loop and the workflow runtime as the outer loop.

The inner loop is still the normal agent loop: model reads context, calls tools, observes results, repeats until it produces an answer. Atomic does not replace that. The workflow engine wraps it with a typed, observable, resumable execution layer that decides which agent session runs, with what context, on what model, with which tools, in what order, and with what validation or human gate before continuing.

A workflow in Atomic is a TypeScript module, not just a prompt. See the appendix for what the public API looks like.

The important part is the boundary it creates. Every ctx.task, ctx.chain, ctx.parallel, ctx.stage, ctx.workflow, and ctx.ui.* call creates explicit runtime structure. Atomic can see the graph, persist the state, attach to a stage, pause it, resume it, steer it, kill it, inspect transcripts, and preserve artifacts. That is very different from asking one model to “please do the following steps.”

Atomic supports both static and dynamic workflows.

A static workflow is the versioned TypeScript definition of a workflow. It has declared inputs, declared outputs, stage names, model choices, fallback chains, concurrency limits, human gates, worktree options, and artifact paths. You can commit it to .atomic/workflows, ship it through an Atomic package, or bundle it into the product.

A dynamic workflow is when the agent uses the workflow tool at runtime to create a tracked one-off task, chain, or parallel fan-out without a saved workflow file. That gives the model a workflow meta-tool inside its normal ReAct loop. Instead of merely saying “I will ask three agents,” it can actually spawn three tracked stage sessions, give each one clean context, collect their outputs, and synthesize them. If the pattern proves useful, you can promote it into a real TypeScript workflow.

The execution model is a DAG, but the developer does not need to manually draw the DAG. Atomic infers it from runtime control flow. Sequential awaits become dependent stages. ctx.parallel or Promise.all creates concurrent branches. Loops can create repeated stage groups. Child workflows called with ctx.workflow(...) are nested under the parent and shown in the same expanded graph. This matters because the graph is not just visualization, it becomes the control plane.

The runtime tracks:

stage name and status
input/output contracts
session file and transcript path
model and reasoning effort
fallback model attempts
errors and warnings
artifacts and output files
live pause/resume/interrupt handles
pending human input
nested workflow boundaries

The core design principle is that large context moves through files and artifacts, not through the model prompt. Small handoffs can use previous or {previous}. Large handoffs should be written to files with outputMode: "file-only" and passed forward with reads. This avoids the common failure mode where every stage inherits the full transcript of every prior stage and token usage explodes.

Atomic also separates context modes. Implementation stages can use forked context when continuity matters. Reviewer stages should usually use fresh context so they are not biased by the implementation agent’s reasoning. That distinction is one of the simplest ways to make review loops more reliable, where the reviewer reads the diff, artifacts, tests, and criteria, not the implementer’s biased thinking.

Human input is also part of the runtime. A workflow can call ctx.ui.input, ctx.ui.confirm, ctx.ui.select, or ctx.ui.editor at the exact point where a decision is needed. The run enters an awaiting-input state, the prompt appears in the workflow UI, and the answer is routed back to the correct stage. This is how you build approval gates, review gates, release gates, and “stop before destructive action” behavior without relying on the model to remember a markdown instruction.

Reliability comes from making the outer loop explicit. You can set a max iteration count. You can require structured reviewer outputs. You can run independent reviewers and reduce their decisions deterministically. You can fail closed when declared outputs do not validate. You can retry provider failures with fallbackModels. You can isolate work in git worktrees. You can cap output size. You can pause or interrupt a runaway stage and resume it with a steering message. These are boring software engineering controls, but they are exactly what makes long-running agent work survivable.

The model and cost are also prioritized. A workflow can use a cheap model for classification, a stronger model for implementation, a different model for review, and a fallback chain for critical stages. Model strings can include reasoning effort, e.g. openai/gpt-5.5:high or anthropic/claude-haiku-4-5:off, so cost and latency are controlled per stage instead of globally. Parallelism is explicit too, so a broad fan-out is a deliberate choice rather than the model deciding a path and burning tokens.

Observability ties everything together. Atomic gives you /workflow status, /workflow connect, /workflow attach, /workflow pause, /workflow interrupt, /workflow resume, and /workflow kill. You can inspect the graph, attach to a single stage, send a steering message, answer a pending prompt, or resume paused work. The system keeps run history and terminal state around for inspection. A failed workflow is not just a giant lost chat transcript. It is a run with named stages, artifacts, errors, and receipts that you can analyze, measure, and improve.

So the architecture is:

Use TypeScript to define the outer loop.
Use separate agent sessions as atomic stage units.
Use fresh or forked context intentionally.
Use artifacts for large handoffs.
Use typed inputs and outputs as contracts.
Use parallel branches where independence is real.
Use review/human gates where correctness matters.
Use model selection and fallback per stage.
Persist everything needed to inspect, resume, debug, and measure ROI.
Let the model operate inside the loop, but do not let it be the loop.

Why not just skills?

Skills are useful, but they are instruction bundles. A skill can teach the agent how to do something: how to review code, how to write tests, how to use Bun, how to follow a release process. But a skill does not give you durable execution state. It does not create a graph. It does not create independent sessions. It does not enforce typed inputs or outputs. It does not give you concurrency, pause/resume, stage-level model choices, human gates, artifacts, fallback models, or run inspection.

In Atomic, skills and workflows are complementary. A workflow decides when and where work happens. A skill improves how a specific stage performs its work. For example, a review stage inside a workflow can invoke a code-review skill. But the skill itself is not the orchestration engine.

Why not just markdown?

Markdown is good for instructions, specs, and documentation. It is bad as a runtime. A markdown checklist cannot validate input schemas. It cannot schedule parallel branches. It cannot persist stage state. It cannot enforce output contracts. It cannot pause a running model call and resume it later. It cannot route a human approval answer to the correct stage. It cannot automatically keep large artifacts out of model context. It cannot expose a graph of what is happening.

Most markdown “workflows” are really prompts asking the model to simulate a workflow. That works for short tasks, but it degrades as soon as the task becomes long, ambiguous, or failure-prone. The model forgets steps, over-compresses context, repeats itself, hides uncertainty, or claims completion too early. Markdown can describe the process, but something else needs to execute the process.

Why not any other harness?

I’ve seen plenty of lightweight hook examples. This is not sufficient.

You can build this on top of any harness, but this would require you to own the provider adapters, tool calling, sessions, transcript persistence, UI, MCP, extension loading, model configuration, auth, permissions, package discovery, and runtime controls. I tried the route of orchestrating several external CLIs and SDKs. I already shared this doesn’t work because every backend has slightly different semantics, failure modes, streaming behavior, context handling, and bugs.

Atomic builds on Pi because Pi is already a small, extensible coding-agent harness. That means the workflow engine can focus on orchestration instead of the agent runtime. Atomic gets the extension ecosystem, MCP/tools, model/provider plumbing, TUI surfaces, sessions, and package system, then adds the missing outer loop: typed workflow definitions, tracked stages, graph execution, human gates, artifacts, live control, and resumability.

The reason this matters is that production agent work is not just “call a better model.” It is systems engineering. You need boundaries, contracts, state, observability, failure handling, and cost controls. The model is powerful, but the model should be inside a system that constrains and verifies it. Atomic is my attempt to make that system open, inspectable, and easy to modify.

This is not necessary for every feature or bug fix. It’s most useful for large, ambiguous, long-running tasks where the difficulty is in managing the process, validation, and context. I recorded myself doing the GitHub issue to PR for you to review as well, so you can see a canonical example of how you can use it.

It’s not perfect, but it’s a shift toward less prompting and more designing the process for the agent to coordinate, validate, and reduce cognitive load. This is especially important for messy, large codebases. Atomic (bastani-inc) has a public implementation of this style of architecture. You can ask your coding agent to inspect and explain it.

There is no right way to build a scaffold or a system, but there are principles that are key to making sure that it does scale inside of any codebase shape. I would suggest that each person try it out on a scenario where a coding agent has failed you in the past, whether that be because your context was too large or because the agent misunderstood your intent and generated slop. Don’t take my word for it. Try it and see if this method can make it better.

Appendix

import { workflow } from "@bastani/workflows";
import { Type } from "typebox";

export default workflow({
  name: "review-change",
  description: "Research, review, and synthesize a change",
  inputs: {
    target: Type.String({ description: "Diff, PR, issue, or task" }),
  },
  outputs: {
    result: Type.String(),
  },
  run: async (ctx) => {
    const scoutPath = ".atomic/workflows/runs/review-change/scout.md";

    await ctx.task("scout", {
      prompt: `Map the relevant context for: ${ctx.inputs.target}`,
      context: "fresh",
      output: scoutPath,
      outputMode: "file-only",
    });

    const reviews = await ctx.parallel(
      [
        {
          name: "correctness-review",
          prompt: `Read ${scoutPath} and review correctness, regressions, and tests.`,
          reads: [scoutPath],
          context: "fresh",
          model: "openai/gpt-5.5:high",
        },
        {
          name: "maintainability-review",
          prompt: `Read ${scoutPath} and review maintainability and edge cases.`,
          reads: [scoutPath],
          context: "fresh",
          model: "anthropic/claude-sonnet-4:high",
        },
      ],
      { concurrency: 2 },
    );

    const final = await ctx.task("synthesis", {
      prompt: [
        "Synthesize the reviewer findings.",
        "Keep only evidence-backed issues.",
        "Separate blockers from optional suggestions.",
      ].join("\n"),
      previous: reviews.map((r) => r.text).join("\n\n"),
    });

    return { result: final.text };
  },
});

The technical shape

Why not just skills?

Why not just markdown?

Why not any other harness?

Appendix

References

Stay in the loop