L3: From Talking to Doing — Giving Agents the Ability to Act and Guide Dynamically

Author's Note

This is the third post in the series, focused on the L3 level. After L1 (basic Q&A) and L2 (intent routing) addressed AI’s ability to “understand” and “retrieve accurately,” L3 confronts an essentially different challenge: turning AI from a speaker into an executor. I’d recommend reading the series intro first to get oriented on the L1–L5 framework as a whole.

I. What L3 Actually Is: A Critical Architectural Inflection Point

Across the full L1–L5 spectrum, L3 is the pivotal turning point. As I established in the intro post: starting at L3, the LLM begins to drive its own execution path — and that’s where Agent behavior truly begins.

The technical foundation behind this shift is called Function Calling (referred to as Tool Use in Anthropic’s API documentation).

Function Calling works in a fundamentally different way from asking an LLM a question and waiting for an answer. The flow looks like this:

The developer gives the model a set of tool definitions (structured descriptions: tool name, what it does, parameter schema)
After the user sends a request, the model autonomously decides: does this task require calling a tool? If so, which one?
Instead of generating a text answer directly, the model outputs a structured call intent (specifying the tool name and parameters)
Your application receives that intent and actually executes the corresponding function or API call
The execution result is returned to the model, which then generates the final natural-language response

The key is step 2: the model is deciding what to do — not external code. This is a fundamental departure from L2’s routing logic — in L2, the LLM identifies intent and triggers a preset script, with the execution path hardcoded externally. At L3, the LLM dynamically determines the call path at runtime. Anthropic’s official documentation describes this capability as: “Claude decides when to call a tool based on the user’s request and the tool’s description.”¹

That autonomous scheduling capability is exactly why L3 is the starting point for true Agent behavior.

II. How It Works: The Minimal Execution Loop

To describe L3’s full operational logic, I’ll use an analytical framework I call the Minimal Execution Loop. To be clear, this is a descriptive framework I’ve developed from engineering practice — not an industry-standard term. Its academic counterpart is the ReAct paradigm (Reasoning + Acting), proposed by Google Research in 2022 and published at ICLR 2023² — the paper that first rigorously demonstrated the effectiveness of interleaving “reasoning traces” with “action execution.”

The full L3 loop looks like this:

          
flowchart LR
  A[Understand intent] --> B[Identify information gaps]
  B --> C["[Clarify to fill gaps]"]
  C --> D[Select tool]
  D --> E[Fill parameters]
  E --> F[Execute call]
  F --> G[Validate result]
  G --> H[Generate deliverable]
  
  G -- "If result is anomalous, feed back" --> B

Process Flow

Swipe to explore

A few key properties of this loop worth understanding:

Where autonomy actually lives: Not in the clarification step, but in “select tool” and “fill parameters.” The LLM autonomously decides whether to call get_sales_data or get_competitor_report, and figures out how to extract the right parameter values from the conversation context — that’s what separates L3 from L2.

Single-step vs. multi-step: L3 can involve a single tool call (retrieve one data point, generate a response) or a chained sequence (fetch data → analyze → generate report). Single-step is the fundamental form; chained calls introduce “cascading error” risk, which I cover in Section VIII.

Clarification is a feature, not the definition: Clarification (Slot Filling / Clarification) is one node in the loop — it activates when information is incomplete. L2 can also do clarification, via preset question scripts. Treating clarification as L3’s defining characteristic is an architectural misreading.

III. L3’s “Mouth”: Dynamic Guidance and Information Completion

Even though clarification isn’t L3’s defining feature, L3 does have an information-gathering capability that L2 lacks: dynamic clarification grounded in business context.

L2 Clarification vs. L3 Clarification

L2 clarification depends on preset logic: when the system detects a “sales query” intent, it fires a fixed list of follow-up questions. The questions are fixed, the order is fixed, there’s no adaptation based on what the user actually said.

L3 clarification is generated dynamically by the LLM. When a user says “analyze Drug A’s performance in the Northeast region,” the L3 Agent doesn’t execute immediately — instead, it identifies the structural gaps needed to run a complete analysis:

“Based on our standard analysis workflow, I need to confirm a few things: 1. Is the time period Q4 last year, or the full year? 2. Should this include a comparison against Drug B’s data for the same period? 3. Are we analyzing sales volume only, or should I include coverage rate and growth rate as well?”

What makes this different: it’s inference, not a template. If the user’s description already includes the time period, the LLM won’t ask again. If the user mentions a non-standard analysis dimension, it will ask for clarification rather than forcing it into the standard template.

Technical Implementation

Dynamic clarification can be implemented in two ways:

System prompt approach: Convert the SOP into structured guidance logic written into the System Prompt, describing “in what situations to ask what questions”
Tool-based approach: Define information gathering itself as a tool (e.g., clarify_analysis_scope), allowing the LLM to actively call it when it determines information is insufficient

The first approach is simpler to implement; the second is more flexible, and better suited for scenarios where clarification records need to be incorporated into audit trails (healthcare, finance, etc.).

Business Value for Non-Technical Users

For a newly onboarded field rep or junior analyst, every clarification question from the Agent is silent on-the-job training in business standards — it’s teaching the user which dimensions to consider for a complete analysis. This converts what used to exist only in senior employees’ heads into a systematic, codified guidance system.

IV. L3’s “Hands”: Controlled Tool Execution

This is the most technically complex part of L3 architecture — and the easiest place to get burned.

The Execution Boundary of Function Calling

Before discussing “what to let the LLM execute,” it’s worth clearing up a common misconception: Function Calling does not mean prohibiting the LLM from generating any code or query statements.

Text-to-SQL (having the LLM generate SQL and execute it) is a mature, widely adopted pattern within the Function Calling framework, already deployed at scale in BI tools and data analytics platforms³.

The real design question isn’t “can the LLM write SQL” — it’s what environment it runs in and whether there’s a review mechanism. The design principle breaks down like this:

Scenario	Recommended approach
Read-only queries, low-privilege database	Allow LLM-generated SQL with parameter validation
High-privilege database, write operations	Route through predefined interfaces (controlled tools); generated SQL goes through a review layer before execution
Production write operations (sending emails, pushing data)	Enforce Draft & Review; execute only after human confirmation

The underlying design intent is to insert a sandbox of appropriate granularity between LLM capability and production data — not to blanket-prohibit the LLM from generating query logic.

Toolbox Design: Two Principles That Get Overlooked

The real-world effectiveness of Function Calling depends heavily on toolbox design quality. Anthropic’s engineering team covers this in Writing Effective Tools for AI Agents⁴, and two points are particularly critical for business deployment:

Principle 1: Tool description quality determines selection accuracy. The LLM reads the description field to decide whether to call a tool. Vague descriptions (e.g., get_data: retrieves data) create ambiguity — especially pronounced when there are many tools. A good description explains: what this tool does, what scenarios it’s appropriate for, and what it’s not for. Anthropic’s internal testing showed that adding Tool Use Examples improved accuracy on complex parameter handling from 72% to 90%⁴.

Principle 2: Tool granularity involves real tradeoffs. Fine-grained tools (each tool does exactly one thing) are safer, have smaller blast radius when something goes wrong, and have simpler parameters — but complex tasks require multiple calls, consuming more context. Coarse-grained tools (one tool handles multiple situations) need fewer calls, but their complex parameters make LLM input errors more likely. The general recommendation: start fine-grained, then let evaluation data drive consolidation decisions — don’t design tool boundaries by intuition.

Workflow Orchestration Example

When tool calls need to be chained across multiple steps, a typical L3 chained workflow looks like this:

Execution Pipeline

User requests analysis report

get_sales_data()

drug_id, region, date_range

return:Raw data

calculate_kpi()

raw_data, metrics=[‘growth’, ‘share’]

return:KPI calculation results

generate_report()

kpi_data, template=‘manager_pdf’, role=‘market_manager’

return:Output final report file

Outputs final report file

Each step’s input comes from the previous step’s output; each step is a predefined, controlled interface. The LLM’s role is that of a scheduler — not a programmer operating directly on the database.

V. The Highest-Value L3 Use Case: Activating Existing IT Assets

This is the direction most underestimated in enterprise L3 deployments — and in practice, one of the most strategically valuable: teaching AI to use the organization’s existing IT systems, rather than replacing them.

Why This Matters Especially in Regulated Industries

Pharma, finance, and insurance companies share a common operational reality: existing IT systems (CRM, ERP, BI platforms, compliance tools) have been validated over years, are embedded in core business processes, carry prohibitive replacement costs, and would require re-certification with regulators. In these environments, “introducing an entirely new AI system” typically faces far greater resistance than “teaching AI to use the existing systems.”

L3’s Function Calling architecture fits this need naturally: wrap existing IT system APIs as tools, and the Agent simulates the behavior of a human operator by calling those tools.

What this means in practice:

The CRM doesn’t need to be rebuilt — the Agent reads customer data via existing APIs
The BI platform doesn’t need to be replaced — the Agent extracts reports via existing query interfaces
Compliance systems are unaffected — the Agent operates within the permission boundaries of the compliance interface
Data authenticity and accuracy are still guaranteed by the mature systems; AI handles only the scheduling logic

MCP: The Infrastructure Standard for This Direction

To address the engineering complexity of integrating AI with IT assets, Anthropic released Model Context Protocol (MCP)⁵ in November 2024 — an open standard designed to normalize how AI systems interact with external data sources and tools.

Before MCP, connecting each new system required writing custom connector code. Anthropic calls this the “N×M integration problem” — N data sources times M AI systems produces exponentially growing maintenance overhead. MCP solves this with a unified protocol, which has been described as “the USB standard of AI integration.”⁶

MCP has since been adopted by OpenAI and Google DeepMind, and in December 2025, Anthropic donated it to the Agentic AI Foundation (AAIF) under the Linux Foundation, making it a de facto industry standard⁷. Pre-built MCP servers already exist for major enterprise systems including Google Drive, Salesforce, Slack, GitHub, and Postgres.

For pharma companies, this means: Veeva (the industry’s leading CRM), internal BI systems, and payer policy databases can all be connected to the same L3 Agent through the standard MCP protocol — no need to develop a custom integration layer for each system.

Once the “which systems to connect” problem is solved, organizations quickly hit the next question: once the Agent is connected to those systems, how should it operate them? Anthropic’s Agent Skills — launched in late 2025 and released as an open standard in December 2025 — are designed to answer exactly that. If MCP solves the “pipe” problem (letting the Agent access CRM, BI, and compliance systems), Skills solve the “methodology” problem (telling the Agent which specific business processes and operating norms to follow when accessing those systems).

The division of responsibility: MCP is the standard interface at the IT infrastructure layer; Skills are the standard packaging format at the business knowledge layer. A complete enterprise deployment in the L3 context might use both simultaneously — MCP connecting existing IT assets, Skills encoding the company’s SOPs and analytical methodologies.

The technical details, selection logic, and whether Skills can achieve the same industry-standard status as MCP (still an open question) are topics I’ll cover in a dedicated post.

Strategic Implications for Change Management

From an adoption standpoint, the “activate existing IT assets” direction has another advantage: it reduces resistance to change. From the IT department’s perspective, they’re not being replaced by a new system — their existing systems are getting a smarter user. That framing significantly reduces cross-functional friction.

VI. Role Adaptation: Making the Same Analysis “Speak to Different Audiences”

Role differentiation inside pharma organizations is significant — the same analytical data has very different value depending on who’s reading it. L3’s tool-calling architecture makes it possible to “generate multiple delivery formats from one analytical logic.”

Target role	Core concerns	Ideal deliverable
Field Rep (MR)	Specific customer actions, talk tracks, visit priority	One-tap mobile brief / push email
Marketing Manager	Growth curves, competitive comparison, formulary status	Structured PDF report / BI dashboard link
Executive	Macro insights, risk signals, ROI outlook	3-page PPT summary in company template

How does this work technically?

Role adaptation is typically implemented by introducing an “output target role” parameter in the Function Calling tool definition. The Agent selects and calls different formatting tools based on the user’s role tag (obtained from the permission system or conversation context):

# Tool definition example (simplified)
tools = [
    {
        "name": "generate_report",
        "description": "Generates a role-adapted report from analysis data. Use when the same KPI dataset needs to be rendered in different formats for different audiences.",
        "input_schema": {
            "kpi_data": "object",
            "target_role": {
                "type": "string",
                "enum": ["rep", "manager", "executive"],
                "description": "Target audience role. Determines output format and level of detail."
            }
        }
    }
]

generate_report(target_role="executive") calls the 3-page PPT template; generate_report(target_role="rep") calls the mobile brief template. Same data, different tool call, completely different output — no manual reformatting required.

VII. UI/UX: Designing the L3 Experience for Non-Technical Users

This is the most consistently underestimated part of L3 deployment — and the one that determines whether non-technical users can actually use the system productively.

The UI Challenge L3 Introduces That L1/L2 Didn’t

In L1/L2, user interaction is relatively simple: enter a question, see the answer. But L3 introduces a new complexity: the model is executing a series of actions behind the scenes — selecting tools, filling parameters, calling APIs, validating results — and the user has no visibility into any of it.

For technical users, that’s probably fine. For non-technical business users — marketing managers, field reps — that “black box” feeling creates two problems: distrust (“what is the AI doing? where did these numbers come from?”) and inability to course-correct (“the output is wrong but I don’t know which step failed”).

The core task of L3 UI/UX design is therefore: translate the backend tool-calling process into something users can see and understand in business terms.

A Four-Layer UX Framework

Organized around what users need at different stages of the process, L3’s UX breaks into four layers:

Layer 1: Process Transparency — The most important layer, and L3’s most distinctive design challenge.

The system should translate the tool-calling process into language users can understand, rather than exposing technical details:

Technical Perspective

// API Request
get_sales_data_api(
  drug_id=‘A123’,
  region=‘northeast’,
  period=‘2024-Q4’
)

Business Perspective

Pulling Drug A Q4 2024 Northeast region sales data from CRM…

── Demo UI ──

This “progress visibility” design lets users see what the Agent is doing and builds trust in the system. Users don’t need to understand APIs — they just need to feel that the system is genuinely working on their request.

Layer 2: Parameter Review — Let users verify the Agent’s “understanding” before anything executes.

After the Agent extracts parameters from the natural-language conversation, it should present a lightweight confirmation view:

Confirm Parameters / Parameter Review

L3 Agent Confirmation Layer

Product

Drug A (ID: A-2024)

Region

Northeast (NY, NJ, CT, MA)

Time Period

Q4 2024 (October – December)

Comparison

Includes Drug B same-period data

── Demo UI ──

This interface addresses the most common L3 failure mode — parameter hallucination. Catching a parameter error before compute is spent is dramatically more efficient than discovering it in the output.

Layer 3: Draft & Review — A confirmation mechanism specifically for “write operations” (actions with real-world consequences).

Write operations include: sending emails, pushing data to downstream systems, updating CRM records, generating official reports. Before any of these execute, a preview interface is non-negotiable:

Draft Review

Pending

SubjectQ4 Sales Brief — Northeast Region

Body

Team, please find attached the Drug A Q4 2024 Northeast region sales analysis. The analysis covers an 18% year-over-year increase and key competitive dynamics…

DrugA_Q4Brief_Northeast_2025.pdfPDF · 2.4 MB

── Demo UI ──

This is the concrete UI expression of the “Human-in-the-loop” principle Anthropic explicitly recommends in Building Effective Agents⁸: before executing any irreversible action, always leave a human the last word.

Layer 4: Result Attribution — Tell users where the data came from.

When the Agent surfaces analytical conclusions, key data points should carry source annotations:

Northeast Q4 sales up 18% year-over-year

Source: CRM SystemAs of 2024-12-31

Competitor Drug B same-period growth: 11%

Source: Internal BI PlatformDate: 2025-01-05

── Demo UI ──

This not only increases credibility — it also provides a traceable path for investigating data quality issues.

Additional Design Recommendations for Non-Technical Users

Use business language, hide technical details: Parameter labels should say “Time Period,” not date_range; region selection should be a clickable map, not a code input field
Progressive disclosure: Show a high-level “processing” status first; let users click to expand and see the detailed steps
Translate error messages into business language: An API timeout shouldn’t surface as “timeout error” — it should say “The CRM system is responding slowly. Retrying…”

VIII. What Can Go Wrong: Three L3 Failure Modes

Giving an Agent execution authority means errors are no longer just conversational missteps — they’re real operational mistakes.

Failure Mode 1: Parameter Hallucination

LLMs can and do make mistakes when extracting parameters. Common scenarios:

Interpreting “last Q4” as Q4 2023 instead of Q4 2024 (due to training data temporal bias)
Mapping a region name to the wrong region code
Selecting an incorrect default value when the conversation contains ambiguity

Mitigations:

Introduce Layer 2 UI (Parameter Review) from the framework above
Add a parameter validation layer before tool calls; automatically intercept values that fall outside reasonable bounds
Build a “business terminology dictionary” for common query conditions, mapping natural-language expressions to standardized parameter values

Failure Mode 2: Error Propagation in Chained Calls

When L3 executes multi-step chained tool calls, a deviation in step one gets amplified by each subsequent step. The wrong date range in step one means step two’s growth rate calculation runs on a bad baseline — and by the time the final deck is generated, the conclusion may be completely wrong, even though each individual step looked correct in isolation.

It’s worth being precise: this problem arises specifically in chained call scenarios, not in all L3 deployments. A single tool call (retrieve one record, return an answer) carries no such risk.

Mitigations:

Add intermediate result validation at critical nodes in the chain (e.g., show a data summary after retrieval, confirm it looks right before proceeding to calculation)
Design “idempotent” tools — the same input always produces the same output, making issues reproducible and debuggable
Log the full tool call chain for post-hoc auditing

Failure Mode 3: Permission Escalation

If a tool interface’s permissions are scoped too broadly, or if tool descriptions are imprecise, an Agent may be induced — whether by unintentional prompt ambiguity or deliberate prompt injection — to access data it shouldn’t or execute operations it shouldn’t.

Mitigations:

Apply the principle of least privilege to every tool — each tool should have only the permissions necessary to perform its specific function, nothing more
Strictly separate read-only tools from write-operation tools; write-operation tools must pass through an independent review layer
Implement access control at the MCP layer, not only at the application layer
Maintain audit logs of tool calls to satisfy compliance requirements

Human-in-the-Loop: Standard Architecture in High-Regulation Industries

In pharma, finance, and insurance, the cost of the failures above often extends beyond the technical domain — into regulatory compliance, patient safety, and financial loss. Human-in-the-loop isn’t a patch added after the fact; it’s part of the architecture from the start.

In practice, I’d recommend distinguishing between two classes of operations:

Read operations (data retrieval, analysis generation): Allow the Agent to execute autonomously, with post-hoc review available
Write operations (sending communications, updating records, pushing data): Always require a Draft & Review interface with human confirmation before execution

IX. L3’s Ceiling: What It Can’t Do

Understanding an architecture’s limits matters as much as understanding its capabilities. The following are structural constraints of single-Agent L3 architecture — not engineering problems that can be optimized away.

Limit 1: Cannot Autonomously Coordinate Parallel Cross-System Tasks

When a business request requires simultaneously calling multiple independent systems and aggregating the results, single-Agent L3 hits an efficiency wall.

Take “resolve an anomalous refund case”: if the Agent needs to verify the financial transaction (finance system), check the shipping status (logistics system), and review the customer’s history (CRM) all at once, L3 can only do this serially — query finance, wait for the result, query logistics, wait for the result, then query CRM. Response time scales linearly with the number of tasks.

This kind of scenario is better suited to the parallel multi-Agent orchestration architecture of L4.

Limit 2: Cannot Handle Tasks Where the Execution Path Can’t Be Enumerated in Advance

L3 tool calling depends on a predefined toolbox. That means if an Agent encounters a genuinely novel business scenario with no corresponding tool, it can’t improvise a new execution path on the fly.

Put differently: L3 excels at flexible combination within known boundaries, but cannot autonomously explore outside those boundaries. When a business requires the Agent to face a problem type it’s never seen before and independently find a solution, L3 isn’t up to the job.

Limit 3: Limited Capacity for Complex Role and Task Coordination

When a complex task requires true division of labor across specialized roles — say, a medical information Agent handles data, a compliance Agent handles review, and a formatting Agent handles output — the single-Agent L3 architecture struggles to absorb the inherent tension of those different responsibilities. Cramming all that logic into one Agent leads to toolbox bloat, system prompt sprawl, and a debugging nightmare.

All three limits point to the same underlying problem: when task complexity exceeds “a single Agent autonomously scheduling within a predefined toolbox,” a new architecture is needed — one where multiple specialized Agents divide the work, coordinated by an orchestration layer. That’s exactly what the L4 collaborative orchestration architecture is designed to address, and it’s the subject of the next post.

X. Closing

L3 marks the evolution of Agents from “verbose task encyclopedias” to “digital assistants that can actually get things done within controlled boundaries.” But that evolution has to be built on a clear-eyed understanding of the risks and limits involved.

In the scenarios I’ve described in this post, the L3 systems that continue to run reliably over time are rarely the most “intelligent” designs — they’re the ones where the toolbox boundaries are clear, the process is transparent to users, human confirmation is preserved before high-stakes operations, and the system integrates seamlessly with existing IT assets.

As Anthropic’s engineering team put it in Building Effective Agents: the success of an LLM application isn’t about building the most complex system — it’s about building the right system for the current need.⁸

When your business requirements exceed what “a single Agent scheduling within predefined tools” can handle — when tasks start requiring multi-role parallel collaboration and dynamic cross-system coordination — you’ve reached L3’s ceiling. And that means you’re standing at the door of L4.

Anthropic. Tool Use Overview. Claude API Documentation. https://platform.claude.com/docs/en/agents-and-tools/tool-use/overview ↩
Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., & Cao, Y. (2023). ReAct: Synergizing Reasoning and Acting in Language Models. International Conference on Learning Representations (ICLR). arXiv:2210.03629 ↩
Anthropic. Programmatic Tool Calling. Claude API Documentation. https://platform.claude.com/docs/en/agents-and-tools/tool-use/programmatic-tool-calling ↩
Anthropic Engineering. (2025). Writing Effective Tools for AI Agents. https://www.anthropic.com/engineering/writing-tools-for-agents ↩ ↩²
Anthropic. (2024, November). Introducing the Model Context Protocol. https://www.anthropic.com/news/model-context-protocol ↩
Wikipedia. Model Context Protocol. https://en.wikipedia.org/wiki/Model_Context_Protocol ↩
Agentic AI Foundation / Linux Foundation. (2025, December). MCP Donated to Linux Foundation. Derived from: Anthropic MCP Roadmap. https://modelcontextprotocol.io/development/roadmap ↩
Schluntz, E., & Zhang, B. (2024, December). Building Effective Agents. Anthropic. https://www.anthropic.com/research/building-effective-agents ↩ ↩²

L3: From Talking to Doing — Giving Agents the Ability to Act and Guide Dynamically

I. What L3 Actually Is: A Critical Architectural Inflection Point

II. How It Works: The Minimal Execution Loop

III. L3’s “Mouth”: Dynamic Guidance and Information Completion

L2 Clarification vs. L3 Clarification

Technical Implementation

Business Value for Non-Technical Users

IV. L3’s “Hands”: Controlled Tool Execution

The Execution Boundary of Function Calling

Toolbox Design: Two Principles That Get Overlooked

Workflow Orchestration Example

V. The Highest-Value L3 Use Case: Activating Existing IT Assets

Why This Matters Especially in Regulated Industries

MCP: The Infrastructure Standard for This Direction

Strategic Implications for Change Management

VI. Role Adaptation: Making the Same Analysis “Speak to Different Audiences”

VII. UI/UX: Designing the L3 Experience for Non-Technical Users

The UI Challenge L3 Introduces That L1/L2 Didn’t

A Four-Layer UX Framework

Confirm Parameters / Parameter Review

Additional Design Recommendations for Non-Technical Users

VIII. What Can Go Wrong: Three L3 Failure Modes

Failure Mode 1: Parameter Hallucination

Failure Mode 2: Error Propagation in Chained Calls

Failure Mode 3: Permission Escalation

Human-in-the-Loop: Standard Architecture in High-Regulation Industries

IX. L3’s Ceiling: What It Can’t Do

Limit 1: Cannot Autonomously Coordinate Parallel Cross-System Tasks

Limit 2: Cannot Handle Tasks Where the Execution Path Can’t Be Enumerated in Advance

Limit 3: Limited Capacity for Complex Role and Task Coordination

X. Closing

Footnotes

L1 & L2: Teaching AI to Be Precise, Not Just Prolific