Building AI Agents That Work at Any Scale

The Convergence

OpenAI recently published a deep dive on their internal data agent, a tool powered by GPT-5.2 that serves 3,500+ users querying 600 petabytes of data across 70,000 datasets. It is available through Slack, web interfaces, IDEs, and MCP connectors.

When I read through their architecture, I noticed something striking: the patterns they converged on after building at massive scale are nearly identical to what was discovered building Atomic, a 3,000-line meta-framework CLI for AI coding agents.

This is not coincidence. It reveals something fundamental about how reliable AI agents need to be structured, regardless of scale.

In this post, I will break down the architectural parallels and extract the lessons that matter for anyone building AI coding tools.

OpenAI’s Six-Layer Context Architecture

OpenAI’s data agent is built around multiple layers of context that ground it in their data and institutional knowledge. Without this layered approach, even strong models produce wrong results.

Here is their architecture:

Layer 1: Table UsageSchema metadata and query historyLayer 2: Human AnnotationsCurated descriptions and semanticsLayer 3: Codex EnrichmentCode-level definitionsLayer 4: Institutional KnowledgeSlack Docs and Notion contentLayer 5: MemoryCorrections and self-improvementLayer 6: Runtime ContextLive queries to data warehouse

Each layer serves a distinct purpose:

LayerWhat It ProvidesWhy It Matters
Table UsageSchema metadata, historical queries, table lineageTells the agent what exists and how it has been used
Human AnnotationsCurated descriptions, business meaning, known caveatsCaptures intent that cannot be inferred from schemas
Codex EnrichmentCode-level definitions derived from the codebaseReveals how data is actually constructed
Institutional KnowledgeSlack, Docs, Notion content about launches, incidents, metric definitionsProvides organizational context beyond the data
MemoryCorrections, learned nuances, user feedbackEnables continuous improvement without retraining
Runtime ContextLive queries when existing context is staleHandles edge cases and validates assumptions

The key insight: each layer addresses a different failure mode. Table usage alone does not tell you business intent. Human annotations do not capture derivation logic. Memory captures corrections that would otherwise be lost.

Atomic’s Parallel Architecture

Building Atomic brings us to a remarkably similar layered structure for coding agents:

Layer 1: Codebase StructureFile locations and dependenciesLayer 2: Research DocumentsCurated analysis in research docsLayer 3: Pattern DocumentationExisting patterns from pattern-finder agentLayer 4: SpecificationsArchitectural decisions and acceptance criteriaLayer 5: Progress Trackingfeature-list.json and progress.txtLayer 6: Runtime AnalysisLive codebase queries via sub-agents

Here is how Atomic’s layers map to OpenAI’s:

OpenAI LayerAtomic EquivalentImplementation
Table UsageCodebase Structurecodebase-locator agent discovers file locations and dependencies
Human AnnotationsResearch Documentsresearch docs with curated analysis and context
Codex EnrichmentPattern Documentationcodebase-pattern-finder extracts existing implementation patterns
Institutional KnowledgeSpecificationsspecs capture architectural decisions and rationale
MemoryProgress Trackingfeature-list.json and progress.txt maintain state across sessions
Runtime ContextRuntime AnalysisSub-agents query the live codebase when needed

The parallel is not superficial. Both systems solve the same fundamental problem: grounding AI agents in context that persists beyond a single conversation.

Three Lessons from the Convergence

OpenAI’s writeup surfaces three lessons that directly mirror what was learned building Atomic at 1/200,000th the scale.

Lesson 1: Less Is More

OpenAI’s discovery: Early on, they exposed their full tool set to the agent and quickly ran into problems with overlapping functionality. While this redundancy can be helpful for specific custom cases and is more obvious to a human when manually invoking, it is confusing to agents. To reduce ambiguity and improve reliability, they restricted and consolidated certain tool calls.

What this means in practice:

When you give an agent 50 tools with overlapping capabilities, it wastes tokens reasoning about which tool to use and often picks wrong. The solution is not more tools, it is fewer, more focused tools with clear boundaries.

How Atomic implements this:

Each sub-agent in Atomic has explicitly restricted tool access:

# codebase-locator agent definition
name: codebase-locator
description: Finds WHERE code lives - files directories modules
tools: Read, Grep, Glob
You are a specialist at locating code. Your job is to find WHERE
things exist in the codebase.
## Constraints
- DO NOT analyze code behavior
- DO NOT suggest improvements
- DO NOT evaluate code quality
- ONLY return locations with file paths
## Output Format
For each search:
- Full file path
- Line numbers where relevant
- Brief description of what is at that location

Compare this to a hypothetical do everything agent:

# BAD: Overlapping capabilities create confusion
name: general-purpose-agent
tools: Read, Write, Edit, Grep, Glob, Bash, WebSearch
You can analyze code, write code, search the web, run commands...

The restricted agent knows exactly what it should do. The general-purpose agent has to reason about which capability to use for every action.

The pattern for your own agents:

Pattern: Focused ToolsGlobLocator AgentGrepAnalyzer AgentReadExecuteImplementer AgentEditWriteAnti-Pattern: Tool SprawlTool AAgentTool B overlaps ATool C overlaps A and BTool DTool E overlaps D

Actionable takeaway: Audit your agent’s tool access. For each tool, ask: Is there another tool that does something similar? If yes, either consolidate or create separate agents with non-overlapping tool sets.

Lesson 2: Guide the Goal, Not the Path

OpenAI’s discovery: They also discovered that highly prescriptive prompting degraded results. While many questions share a general analytical shape, the details vary enough that rigid instructions often pushed the agent down incorrect paths. By shifting to higher-level guidance and relying on GPT-5’s reasoning to choose the appropriate execution path, the agent became more robust and produced better results.

What this means in practice:

Telling an agent exactly how to accomplish something assumes you know the best path. But the best path varies based on context. Over-specification leads to failure when the situation does not match your assumptions.

How Atomic implements this:

Atomic separates research from specs from implementation. The user guides the goal at each phase, but the agent owns the execution path.

Phase 3: ImplementationUser: Approves specAgent: Chooses HOW to implementOutput: Working code and testsPhase 2: SpecificationUser: Reviews and approvesAgent: Synthesizes into specOutput: specs feature-spec.mdPhase 1: ResearchUser: What should we build?Agent: Investigates codebaseOutput: research docs

The implement-feature command demonstrates this principle:

# implement-feature command (simplified)
Read the approved specification and feature list.
Implement the next incomplete feature.
## What You Decide (Agent Owns Execution)
- Which files to modify
- What patterns to follow (based on existing codebase)
- How to structure the implementation
- What tests to write
## What is Fixed (User Owns Goals)
- The feature requirements (from spec)
- Acceptance criteria (from feature-list.json)
- Quality gates (tests must pass)

Compare this to over-specified prompting:

# BAD: Over-specified implementation instruction
To implement authentication:
1. Create file src/auth/login.ts
2. Import bcrypt from bcrypt
3. Create function validatePassword(input, hash)
4. Use bcrypt.compare()
5. Return boolean result
...

This breaks when:

  • The project uses a different directory structure
  • The project already has auth utilities
  • The project uses argon2 instead of bcrypt
  • Any assumption does not match reality

The pattern for your own agents:

Define what success looks like, not how to achieve it:

{
"features": [
{
"id": "auth-001",
"title": "Password Validation",
"acceptance_criteria": [
"Valid passwords return true",
"Invalid passwords return false",
"Timing attacks are mitigated",
"Follows existing auth patterns in codebase"
],
"passes": false
}
]
}

Actionable takeaway: Replace step-by-step instructions with acceptance criteria. Let the agent figure out the path; verify the outcome.

Lesson 3: Meaning Lives in Code

OpenAI’s discovery: Schemas and query history describe a table’s shape and usage, but its true meaning lives in the code that produces it. Pipeline logic captures assumptions, freshness guarantees, and business intent that never surface in SQL or metadata. By crawling the codebase with Codex, their agent understands how datasets are actually constructed.

What this means in practice:

Documentation lies. Comments go stale. But the code that produces data (or implements features) is the source of truth. An agent that can read and understand the actual implementation has context that no metadata layer can provide.

How Atomic implements this:

The codebase-pattern-finder agent does not just find files, it extracts actual implementation patterns:

# codebase-pattern-finder agent
You are a pattern librarian. Find similar implementations that
can serve as templates.
## What to Extract
For each pattern found, provide:
- Full file path with line numbers
- The ACTUAL CODE (not summaries)
- Other places this pattern is used
- No editorial commentary

Here is an example of what the pattern-finder outputs:

Authentication Pattern

  • Location: src/auth/session.ts:45-67
  • Also used in: src/auth/refresh.ts:23-41
export async function validateSession(token: string): Promise<Session | null> {
const decoded = await verifyJWT(token, config.jwtSecret);
if (!decoded) return null;
const session = await db.sessions.findUnique({
where: { id: decoded.sessionId },
include: { user: true }
});
if (!session || session.expiresAt < new Date()) {
return null;
}
return session;
}

Pattern notes: Uses early-return style, includes user relation, checks expiration explicitly.

When the implementation agent later needs to add authentication to a new endpoint, it has the actual code pattern, not a description of what authentication should look like.

The knowledge hierarchy:

High Value ContextActual implementation codeTest cases showing behaviorGit history showing whyMedium Value ContextFile locationsDependency graphsAPI signaturesLow Value ContextGeneric documentationOutdated commentsAbstract descriptions

How specs capture meaning:

In Atomic, specifications are not just requirements. They capture the why behind decisions:

specs/authentication-spec.md
## Decision: JWT with Refresh Tokens
### Why Not Session Cookies?
- Mobile clients cannot reliably handle cookies
- Existing API clients expect Bearer tokens
- See research docs 2025-01-15-auth-analysis.md for full comparison
### Why 15-Minute Access Token Expiry?
- Balances security (short window) with UX (not too many refreshes)
- Matches existing patterns in src/auth/config.ts
- Security review approved this duration (Slack thread: security-2025-01-10)
### Implementation Pattern to Follow
See src/auth/session.ts:45-67 for the established validation pattern.

Actionable takeaway: When building context for your agent, prioritize:

  1. Actual code over documentation
  2. Tests over comments (tests show real expected behavior)
  3. Git history for why context
  4. Specs that link back to code locations

The Scale-Invariant Pattern

Here is what the convergence reveals: layered context with clear boundaries beats throwing everything into one massive prompt.

OpenAI discovered this managing petabytes. Atomic discovered it managing a few thousand lines of markdown specs. The infrastructure changes with scale, but the architecture does not.

100%
Scroll to zoom • Drag to pan

Why this pattern emerges independently:

  1. Context windows are finite. You cannot dump everything into a prompt. Layers let you load relevant context on demand.

  2. Agents need grounding. Without structured context, agents hallucinate. Layers provide verifiable sources of truth.

  3. Work needs to persist. Sessions end. Context resets. File-based artifacts survive.

  4. Specialization reduces errors. General-purpose tools create ambiguity. Focused tools have clear success criteria.

  5. Goals are stable; paths are not. What you want to achieve changes slowly. How you achieve it varies with every situation.

Implementing These Patterns

If you are building AI coding agents or tools, here is how to apply these lessons:

1. Design Your Context Layers

Map out what context your agent needs and where it comes from:

// Example: Context layer configuration
interface ContextLayers {
// Layer 1: What exists
codebaseStructure: {
source: 'glob + grep',
updates: 'on-demand',
agent: 'codebase-locator'
};
// Layer 2: Curated understanding
researchDocs: {
source: 'research/docs/*.md',
updates: 'per-session',
agent: 'codebase-analyzer'
};
// Layer 3: Implementation patterns
patterns: {
source: 'extracted from codebase',
updates: 'per-feature',
agent: 'pattern-finder'
};
// Layer 4: Decisions and rationale
specs: {
source: 'specs/*.md',
updates: 'per-project',
agent: 'human + spec-generator'
};
// Layer 5: Session state
progress: {
source: 'feature-list.json + progress.txt',
updates: 'per-feature',
agent: 'implementation-agent'
};
// Layer 6: Live verification
runtime: {
source: 'live codebase queries',
updates: 'on-demand',
agent: 'any agent with Read access'
};
}

2. Restrict Tool Access Per Agent

Create agent definitions with explicit tool boundaries. Here is a template:

Agent Header (YAML):

name: agent-name
description: when to invoke this agent
tools: ONLY the tools this agent needs

Agent Body:

You are a specialist at specific capability.
## You MUST
- Primary responsibility
- Secondary responsibility
## You MUST NOT
- Explicitly forbidden action
- Another forbidden action
## Output Format
Exactly what this agent returns

3. Use Acceptance Criteria, Not Instructions

Replace procedural prompts with outcome-based contracts:

{
"feature": "User Login",
"acceptance_criteria": [
"POST /api/auth/login accepts email and password",
"Returns JWT on valid credentials",
"Returns 401 on invalid credentials",
"Follows existing auth patterns in src/auth/",
"Includes test coverage for happy and error paths"
]
}

4. Make Memory Persistent

Choose a memory strategy that survives session boundaries:

project/
├── research/
│ └── docs/ # Layer 2: Curated understanding
│ └── YYYY-MM-DD-topic.md
├── specs/ # Layer 4: Decisions
│ └── feature-spec.md
├── feature-list.json # Layer 5: Progress tracking
├── progress.txt # Layer 5: Session state
└── .git/ # Layer 6: History and verification

Conclusion

OpenAI built their data agent to serve thousands of users across petabytes of data. Atomic is here to help individual developers ship code with AI assistance. The scale is different by orders of magnitude, but the architecture converged on the same patterns:

  1. Layered context that loads relevant information on demand
  2. Tool consolidation that reduces ambiguity
  3. Goal-oriented prompting that lets agents choose execution paths
  4. Persistent memory through specs and progress tracking
  5. Specialized agents with restricted capabilities

These patterns work because they address fundamental properties of how LLMs operate: finite context windows, tendency to hallucinate without grounding, and strength at reasoning when given clear goals.

Scale changes the infrastructure. It does not change the architecture.

If you are building AI coding tools, start with these patterns. They have been validated at both ends of the scale spectrum.


Resources