Thariq Shihipar at Claude Code has been making the case that HTML beats markdown for agent output. We agree. The DeepSeek-OCR screenshot trick is what makes the token cost honest enough to do this at every plan stage.
Geoffrey Huntley's Ralph primitive opened up a rich ecosystem of looped coding agents. Atomic's built-in Ralph builds on that lineage with RFC-driven planning, schema-enforced dual review, deterministic file-grouped finding clusters, and a captured branch changeset injected into both reviewers — designed for unattended long-running work where every step needs to be inspectable after the fact.
The benchmarks worth caring about measure the GDP a coding agent can generate, not how much of SQLite it can recall. ProgramBench gets one key thing right while other design choices deserve scrutiny. Here's how to combine its best idea with real outcome measurement to build the benchmark we actually need.
DeepSeek V4 dropped on April 24 with a 1.6T-parameter open-weights model that costs roughly 1/7 of Claude Opus 4.7 and posts coding numbers in the same neighborhood. Here's where it actually wins, what engineers using it in real work are reporting, and what's genuinely new under the hood.
Anthropic published a detailed postmortem on three Claude Code regressions between March and April. The document is admirable — it's also a mirror. Every team shipping AI-assisted software right now is operating under the same conditions, and software has never been more vulnerable because of it.
OpenAI shipped GPT-5.5 today, exactly one week after Claude Opus 4.7. It leads on Terminal-Bench 2.0 and hard math, trails Opus 4.7 on SWE-Bench Pro and tool use, and doubles the API price. Here's what the benchmarks actually say, how the model was built, and what early users are reporting.
Coding agents are brilliant inside a session and fragile outside it. Atomic is an open-source TypeScript SDK that wraps deterministic workflows around Claude Code, Copilot CLI, or opencode — without reimplementing their harness. Here's the gap it closes and why a general agent framework isn't the answer.
Anthropic shipped Claude Design on April 17. Three days later we shipped an open-source replica built as an Atomic workflow — 5 phases, the same pipeline ported across three coding agents (Claude, Copilot CLI, opencode). Here's what that reveals about building thin harnesses around coding agents.
Anthropic shipped Opus 4.7 today. The headline is SWE-Bench Verified at 87.6%, but the real story for software engineers is what changed in how the model behaves on long-running, autonomous work — loop resistance, self-verification, and finer-grained control over reasoning effort and token budget.
Compression breakthroughs, edge-optimized model releases, and local runtime optimization are converging fast. Edge AI is emerging as a distinct — and in many scenarios, preferred — layer in the coding agent stack.
Research breakthroughs in attention optimization, community-built memory tools, and production harness architectures are converging on the same problem. Understanding how these layers connect is essential for anyone building with or for coding agents.
AI coding agents produce statistically average UIs by default. After months of iteration, I found a workflow that actually produces distinctive, polished interfaces: a design sandbox for rapid prototyping, structured design skills like Impeccable, and machine-readable design language definitions.
Marc Andreessen argues that society's messiness — not model capabilities — is where the real building opportunity lies. We break down his Latent Space podcast thesis and connect it to Claude Mythos, the Unix mindset for agents, and what it all means for software engineers.
Stanford's Meta-Harness system uses coding agents to automatically optimize the scaffolding around LLMs — achieving state-of-the-art results across coding, math, and classification tasks. The secret: giving the optimizer raw execution traces instead of compressed feedback.
After extensive work with coding agents (primarily Copilot CLI and the SDK) alongside Codex 5.4 and Opus/Sonnet 4.6, three patterns emerged: the effective context window is far smaller than advertised, sub-agents are essential for long-horizon work, and Skills + CLIs beat MCP servers for context control.
Dex Horthy's Research-Plan-Implement was the first widely-adopted structured workflow for AI coding agents. Three failure modes forced a complete redesign into QRSPI. Here's what changed, what practitioners are validating, and what it reveals about where AI-assisted development is headed.
OpenAI acquired Astral. Anthropic acquired Bun. Both companies are buying the execution layer underneath their coding agents — the runtimes, package managers, linters, and test runners that determine whether an agent can actually ship code.
An honest assessment after going deep with Factory's Droids, Devin, Claude Code, Cursor, Windsurf, Codex, and a dozen more. The features blend together, the cycle of hope and disappointment repeats, and the thing that's actually missing isn't another desktop app or a better model.
Jensen Huang projects $1 trillion in compute demand through 2027. But 84% of developers use AI coding tools and almost nobody is measuring if they work. Two companies are building the observability infrastructure that separates teams that improve from teams that just spend.
Most teams overuse subagents when skills are the better primitive. The architectural case for progressive context disclosure, automatic project scoping, and portable expertise across 30+ AI coding tools.
AI coding agents burn through hundreds of thousands of tokens grepping files and hallucinating APIs. A new class of context infrastructure tools is emerging to fix both problems — for your codebase and for external libraries.
A developer reimplemented SQLite in Rust with LLMs — 576,000 lines that compiled, passed tests, and ran 20,171x slower than the real thing. The bugs weren't syntactic. They were semantic. Here's why architecture, specs, test-driven contracts, and targeted review are the fix.
GPT-5.4's coding benchmarks barely moved. But computer use jumped from 47% to 75%, tool search cuts MCP token usage by 47%, and knowledge work hit 83% across 44 professions. Here's what actually matters for developers.
Four industry leaders independently converged on the same conclusion: engineering discipline is the competitive moat when building with AI agents. Here's the day-one infrastructure that makes agent-generated code reliable.
A technical deep dive into harness engineering — the converging discipline across OpenAI, Anthropic, and independent practitioners that makes coding agents reliable on complex work.
A technical deep dive into the isolated VM infrastructure that lets AI coding agents operate for hours without human intervention — from Cursor's cloud agents and Firecracker microVMs to snapshot bootstrapping, computer use, and secrets management.
The biggest constraint in multi-agent development isn't model capability. It's that nobody's built the orchestration, window management, and resource isolation layers end to end. A technical deep dive into what each tool does architecturally, where it breaks, and what the missing product looks like.
The discourse assumes juniors need protection from AI tools. They don't. They need trust, a disciplined workflow, and room to build capability on their own terms.
Karpathy just named the layer most engineers are missing: Claws. Here's the data behind it, and how to start building it today.
Google just reclaimed #1 on SWE-Bench Verified with Gemini 3.1 Pro. But Codex still leads terminal work, and Claude still leads real-world preference. Here's what's technically different about each model—and what engineers are actually experiencing.
Coding is practically solved. The engineer's job is shifting from writing code to designing systems, writing specs, and orchestrating agents. Here's what the new software development lifecycle looks like and how to adopt it today.
Sonnet 4.6 scores within 1.2 points of Opus 4.6 on SWE-bench at roughly 60% of the cost. We break down the benchmarks, architecture changes, pricing math, developer reactions, and what it means for your agentic workflows.
Google DeepMind's new paper formalizes delegation as more than task decomposition — it's a transfer of authority, accountability, and trust. Here's what that means for how we build coding agents, with concrete patterns you can apply today.
OpenAI's Codex Spark trades intelligence for speed at 1,000+ tokens/sec on Cerebras hardware. The real story isn't the model—it's the infrastructure overhaul and the emerging split between speed mode and depth mode in coding agents.
GLM-5 hit 77.8% on SWE-bench Verified under an MIT license. The benchmark gap between open and closed models is closing fast. Here's what that means for how you architect your coding agent infrastructure—and what to do about it.
OpenAI shipped a million lines of code with zero human-written code. The engineering patterns they discovered—progressive disclosure, layered architecture, feedback loops—are patterns you can adopt today. Here's a practical breakdown.
Cursor ran thousands of agents to build a browser. Anthropic ran 16 to build a C compiler. Both independently converged on the same five design patterns. Here's the technical breakdown of why, and how you can apply them.
Factory's Signals system auto-resolves 73% of agent issues in under 4 hours using LLM judges, friction telemetry, and a closed-loop pipeline. Here's how it works and how you can adopt similar patterns in your own agent infrastructure.
Four major AI releases dropped within 24 hours. Here's a technical deep dive into Opus 4.6, GPT-5.3 Codex, Claude Code's agent teams, and Copilot CLI's Fleet Mode—and how to start using them effectively.
I spent a week exploring OpenAI's new Codex macOS app. Here's what I learned about its orchestration-first approach, how it differs from the Claude workflow I've grown attached to, and whether it's worth adding to your toolkit.
A practical guide to wiring AI coding agents into your CI/CD pipeline with GitHub Actions. Includes working configurations for Copilot Autofix, OpenAI Codex, and Claude Code with proper guardrails.
How hooks, skills, and tool orchestration are transforming developer infrastructure. A deep dive into Claude Code's layered stack and why the most important code you write this year won't be features.
OpenAI built a data agent serving 3.5k users across 600 petabytes. The architectural patterns that made it work are the same ones that power a 3,000-line coding agent CLI.
A technical guide to implementing procedural memory, specialized sub-agents, and autonomous ralph loops for AI coding assistants cross platform.
Building on AI Coding Infrastructure, Atomic introduces a research-to-execution flywheel where specifications become lasting memory. Here's what we learned scaling multi-agent workflows.
Open sourcing my developer workflow with AI agents—skills, sub-agents, and autonomous execution. A 5-minute setup that provides the missing infrastructure layer for AI coding tools.
An overview of two frameworks for memory and context management to enable continous self-learning systems
An interactive cheat sheet covering context engineering techniques for LLMs including retrieval, processing, management, and dynamic assembly strategies.
How context engineering transforms AI-powered development tools from disappointing to transformative through smart prompting, MCP servers, and strategic tool integration.
A deep dive into the concepts of memorization, generalization, and reasoning in large language models.