By 2026, we have moved past the novelty of “LLMs can chat.”
The questions hitting my desk in Tokyo are no longer about benchmark scores. Enterprises are asking something much more pragmatic:
“Can this system actually take over a specific business process?”
However, as soon as we move from slides to implementation, things get messy. I see teams trying to force a simple RAG bot to handle edge cases it wasn’t built for, while others over-engineer multi-agent “AI teams” for tasks that a simple decision tree could solve. Many end up oscillating between the two, never quite finding the “just right” configuration.
The post-mortems for these projects usually tell the same story:
- A basic FAQ system becomes a maintenance nightmare due to unnecessary orchestration.
- “Reflection” loops are added for the sake of “intelligence,” pushing latency from seconds to nearly a minute.
- A brittle toolchain collapses the moment a single API endpoint fluctuates.
Gartner’s data reflects this: at least 30% of GenAI projects are abandoned after the PoC stage, often due to runaway costs or a lack of clear business value. For Agent-specific projects, they predict over 40% will be cancelled by the end of 2027.
The mistake usually happens at the very first step:
We start stacking “capabilities” before we have defined the “boundaries.”
Why Boundaries Trump Capabilities
In AI system design, there is a seductive trap: the belief that a model’s reasoning power (its “IQ”) is the sole determinant of success. In my experience—spanning oceanography to medical AI—boundaries are what actually dictate whether a product survives production.
I define these boundaries across three dimensions:
- Task Boundary: Is there a clear “right or wrong”? Is this deterministic output or heuristic exploration?
- Latency Boundary: What is the tolerance for waiting? 1 second (interactive), 30 seconds (asynchronous), or several minutes?
- Authority Boundary: Is the system just “talking” (outputting information), or is it “doing” (calling APIs, modifying databases, triggering logistics)?
Using a high-autonomy Agent for a task that a simple RAG setup could solve isn’t just a waste of tokens; it introduces unnecessary stochasticity into a process that should be certain. Anthropic’s engineering team puts it bluntly: find the simplest solution that works, and only increase complexity when absolutely necessary.
The Building Blocks: Four Pillars of Agentic Systems
Before we look at architectures, we need a shared vocabulary. Regardless of complexity, these systems are built from the same four “bricks”:
- The Brain (LLM Core): Responsible for decision-making based on context.
- Planning: Breaking goals into steps. In lower-level systems, this is hard-coded; in higher-level ones, it’s dynamic.
- Memory: Short-term context (chat history) and long-term knowledge (RAG/vector databases).
- Action Space: The external tools the system can touch (APIs, code interpreters, ERP interfaces).
Lately, the industry has shifted focus. We’ve moved from Prompt Engineering to Context Engineering, and now to Harness Engineering. We aren’t just trying to make the “Brain” smarter; we are trying to build the best harness to make the existing Brain useful. When a system gains the ability to “act,” the primary concern isn’t just power—it’s how you constrain it.
The Landscape: A Spectrum from LLM Apps to AI Agents
I use the following framework to categorise systems based on the degree of autonomy the LLM holds. This spectrum serves as the roadmap for this series.
| Level | Name | LLM Role | Typical Architecture |
|---|---|---|---|
| L1 | Basic Responder | Passive generation: “Talk when spoken to” | Prompt + RAG |
| L2 | Router | Classifier: Directs intent to fixed rules | Intent Class. / Router |
| L3 | Tool Executor | Active Planning: Decides which tool and when | Function Calling / Tool Use |
| L4 | Orchestrator | Active Coordination: Distributes tasks to sub-agents | Multi-agent systems |
| L5 | Explorer | Autonomous Iteration: Self-correcting, dynamic paths | Goal-oriented (Experimental) |
A few clarifications:
- L1 and L2 are workflows. The logic is controlled by external code; the LLM is just a component. L3 is where an “Agent” actually begins, as the LLM starts to dictate its own execution path.
- I include L1/L2 because they are often the “correct” answer for many business problems.
- L4 can contain L1–L3. An orchestrator’s sub-agents are often L3 executors themselves.
The Spectrum in Practice: An E-commerce Example
To make this concrete, imagine an e-commerce customer service evolution:
L1: The Basic Responder
- Role: Knowledge base librarian.
- Example: A user asks “What is your return policy?” The system finds the policy text and summarises it.
- Nature: Highly controlled. The LLM only handles “translation” of data to prose.
L2: The Router
- Role: The Triage Desk.
- Example: The system identifies if a user is “Complaining” or “Asking for status.” If it’s a complaint, it triggers a hard-coded script.
- Nature: Uses semantic understanding for routing, but the path remains a “railway track.”
L3: The Tool Executor (The Agentic Threshold)
- Role: The Skilled Operator.
- Example: User: “Where is my package?” The system realises it needs the
get_shipping_statustool, extracts the tracking ID, and executes. - Nature: The LLM decides the path. It knows when to talk and when to act.
L4: The Orchestrator
- Role: The Project Manager.
- Example: User: “I want to return this, but I lost the invoice.” The system spins up a “Finance Agent” to verify the transaction and a “Logistics Agent” to book a pickup.
- Nature: Tasks are parallelised and decomposed across specialised units.
L5: The Explorer
- Role: The Junior Specialist.
- Example: Faced with a unique, never-before-seen refund glitch, the system autonomously audits logs, writes a temporary data-cleaning script, and resolves it.
- Nature: Experimental. High risk, high autonomy.
Autonomy vs. Architecture: Two Critical Lenses
When deciding on a direction, I separate two distinct concepts:
- Autonomy Level (The “What”): How much do we trust the system to work without a leash? (L1–L5).
- System Architecture (The “How”): Is this a single agent or a swarm?
They are not hard-linked. You can have a complex L4 Multi-agent system that requires human approval at every step (Low Autonomy), or a simple L3 single agent running 100% autonomously in a sandbox (High Autonomy).
Define the scenario (the Level) before you pick the architecture (the Bricks). Many projects fail because they use a “fashionable” Multi-agent architecture to solve what is essentially an L2 routing problem.
Three Questions to Locate Your Position
To determine your business boundary, ask these three questions:
1. Is there a clear “Right vs. Wrong”?
- If yes (Data queries, summaries): L1/L2 is usually enough.
- If it requires subjective judgment or real-time external info: L3 minimum.
- If the goal is fuzzy and requires exploration: L4 and above.
2. What is the latency tolerance?
- < 3 seconds (Real-time): L1/L2. L3 requires extreme optimisation.
- 10–30 seconds (Async): L3/L4.
- Minutes: L4/L5.
3. Does it actually need to “do” anything?
- Information output only: L1/L2.
- API calls or database writes: L3.
- Cross-system, multi-role collaboration: L4.
The Quick Selection Framework:
Deterministic Outcomes + Information Only + Low Latency
Requires Judgment-based Routing Between Processes
Requires External System Integration + Defined Boundaries
Multi-role Collaboration + Parallelizable Task Decomposition
Open-ended Exploration + No Clear Termination Condition
What Follows
In the subsequent entries of this series, I will dismantle each level:
- What a typical system at this level actually looks like.
- When it is the “just right” choice.
- Common design pitfalls and over-engineering traps.
- How to build the necessary boundaries to keep it stable.
The goal isn’t to build the “smartest” system possible. It’s to build the one that survives production. In my experience, the systems that stick are rarely the most brilliant—they are the ones that are controllable, maintainable, and respect their boundaries.