AI Product 22 min read Published 2026-04-22

L1 & L2: Teaching AI to Be Precise, Not Just Prolific

This post breaks down three retrieval failure modes in L1/L2 architectures, the engineering behind routing determinism, and one of the most underrated system capabilities out there — knowing when to refuse.

Author Lusan
Published 2026-04-22
Author's Note

In the introductory post of this series, I laid out a custom L1–L5 architecture spectrum for describing how LLM applications evolve from “passive generation” toward “autonomous exploration.” This isn’t an industry-standard taxonomy — it’s an analytical tool for making architecture decisions. Specifically, it’s designed to help answer the one question that matters most: how much autonomy does this use case actually require?

This post focuses on the first two levels of that spectrum:

  • L1 (Basic Responder): The LLM generates passively, with all flow control handled by external code. The canonical architecture is Prompt + RAG Definition Retrieval-Augmented Generation — instead of relying purely on what the model has memorized, the system first searches a knowledge base you control, then generates a response grounded in what it finds. .
  • L2 (Router): The LLM identifies intent and routes requests to predefined execution paths. The canonical architecture is Intent Classification + Router.

As I argued in the intro post, when a system only needs to talk (output information), has clear task boundaries, and can operate within a 3-second latency window — L1/L2 is usually the right answer.

But “the right answer” and “will work automatically” are very different things. In high-stakes industries like pharma, finance, and legal compliance — where data accuracy requirements are unforgiving — a poorly designed L1/L2 system will fail in two predictable ways: retrieval precision breakdown and unpredictable routing behavior.

This post walks through the root causes of both, the engineering approaches to address them, and how UI design fits into the picture. It closes with an honest look at the structural ceiling of L1/L2 — understanding that ceiling is what lets you make a principled decision about when L3 is actually warranted.

Author's Note

Most examples in this post come from the pharmaceutical industry — regulatory submission analysis, compliance document retrieval, SOP lookup, and so on. That’s a reflection of where I currently work. But the L1/L2 architecture patterns themselves are general-purpose. As long as your use case satisfies “clear task boundaries, information output only, no dynamic multi-step planning,” the thinking here applies equally to an e-commerce FAQ system, a SaaS help center, or an internal knowledge base. High-regulation industries just turn up the dial on precision and predictability requirements, making design flaws harder to hide. But those requirements are ones that any serious AI product should be thinking about anyway.


Getting the Definitions Right: What L1 and L2 Do (and Don’t) Do

Before we get into optimization, it’s worth being precise about the capability boundaries of each level. Getting this wrong at the start means your architecture decisions are already on shaky ground.

L1: Context-based Generation

The core capability of L1 is generating answers within a given context. That context might come from the model’s training knowledge, but in enterprise settings it’s more commonly sourced through external retrieval (RAG).

The critical constraint: L1 does not modify external state in any business system. No database writes, no workflow triggers, no API calls with side effects. It’s a read-only information processing layer — its capability boundary is “read, compare, summarize, present.”

L2: Static Routing

L2 adds an intent recognition layer on top of L1. It can identify what the user is trying to do and route the request to a predefined execution path. Those paths are hardcoded — compliance queries go down Path A, report exports go down Path B, pricing lookups hit API C.

The critical constraint: L2’s planning capability is static. It handles if-else branching just fine, but the moment a task requires “plan the next step based on what the last step returned” or “dynamically adjust strategy mid-execution,” L2 breaks down. That’s not an implementation flaw — it’s a design-level boundary.


A Common Trap: Mismatching the Level to the Problem

Before getting into optimization, there’s a recurring engineering mistake worth calling out:

          
graph LR
  A[System starts at L1] --> B[Accuracy isn't good enough]
  B --> C[Keep piling on RAG improvements]
  C --> D[Still not working]
  D --> E[Jump straight to Multi-agent L4]

        
Process Flow
Swipe to explore

The problem with this pattern isn’t the level-skipping per se — it’s giving up on L1 before the foundational design is actually right. Prematurely introducing multi-agent orchestration just buries the original defects inside a more complex system where they’re harder to diagnose. Anthropic’s engineering team, after analyzing a large number of customer deployments, put it plainly: find the simplest solution that solves the problem, and only add complexity when you genuinely need it1.

In most high-regulation use cases, the simplest solution is a well-designed L1/L2 — not a more sophisticated agent system.


L1 Architecture: Three Ways Retrieval Breaks Down

RAG is the core mechanism of L1. It solves the model knowledge staleness problem, but in high-accuracy business scenarios, it breaks down in three distinct ways. Understanding each failure mode is what tells you where to intervene.

Failure Mode 1: Chunking Destroys Semantic Integrity

The standard RAG pipeline splits documents into chunks Definition When building a knowledge base, RAG systems first break long documents into smaller pieces before indexing them. Each piece is called a chunk. When you query the system, it retrieves the most relevant chunks — not the full document. , vectorizes them, stores them in an index, and retrieves the highest-similarity chunks at query time. The problem is the act of splitting itself.

When text is cut at fixed token boundaries, a policy clause from an annual compliance report might end up split across three separate chunks. Each fragment is grammatically valid on its own, but only means something when read together. The RAG system, operating at the chunk level, has no way to reconstruct that cross-chunk logical chain.

A comparative study in the clinical decision support domain found that replacing fixed-token chunking with semantically-aware Adaptive Chunking — on the same RAG pipeline — improved accuracy from 50% to 87%, with significant gains in retrieval relevance as well2. That magnitude of difference makes chunking strategy a design decision that can move the needle by an order of magnitude, not a configuration detail you can set-and-forget.

Practical improvement directions include: semantic chunking (splitting at meaning boundaries rather than character counts), proposition-based chunking (decomposing text into standalone factual statements, each self-contained), and hierarchical indexing (building a summary layer above the chunk level, so retrieval works like navigating a table of contents before drilling into content)3.

Failure Mode 2: Domain Semantic Drift Distorts the Vector Space

The underlying logic of vector search is that semantically similar text ends up closer together in the vector space. But “semantically similar” is defined by the embedding model — and most general-purpose embedding models were trained on general-purpose corpora.

Vector search works by translating text into arrays of numbers (vectors), where semantically similar text lands closer together in that numerical space — like placing every word on a map organized by meaning, where “apple” and “fruit” are neighbors, but “apple” and “rocket” are far apart. When you run a query, the system finds the content on that map closest to your question.

In pharma, terms like “indication,” “contraindication,” and “approval pathway” are often poorly represented in a general-purpose vector space. A query like “What are the criteria for orphan drug designation?” might retrieve documents about orphanages rather than regulatory policy — not because the model is broken, but because its training data didn’t cover this semantic neighborhood.

This is a systemic blind spot in RAG: the semantic gap between the embedding model and your domain. The way to address it is fine-tuning the embedding model on domain-annotated data, so its vector representations more accurately reflect the semantic structure of your field. The Prompt Engineering Guide makes this point explicitly: when working in a specialized vertical, fine-tuning the embedding model is typically a required step, not an optional enhancement4.

Failure Mode 3: Multi-hop Reasoning Exceeds Single-Pass Retrieval

The first two failure modes can be mitigated through better chunking and embedding quality. The third one runs into a structural limitation of RAG itself.

Some questions simply can’t be answered with a single retrieval pass — they require compositional reasoning across multiple documents and sections. For example: “Which policy clauses in our compliance reports underwent substantive changes between 2023 and 2024?”

To answer that, the system needs to: ① retrieve the relevant clauses from 2023, ② retrieve the relevant clauses from 2024, ③ understand the differences, and ④ determine whether those differences are “substantive.” That’s a multi-hop reasoning chain, and the standard RAG pattern of “one retrieval, one generation” structurally cannot support it.

We’ll come back to this failure mode at the end of the post, because it points directly to the structural ceiling of L1.


L1 Optimization: Hybrid Retrieval and Source Traceability

Once you understand the three failure modes, the optimization directions become clear. The two strategies below have proven effective in high-regulation scenarios — with one important caveat: they augment RAG, they don’t replace it. Vector search remains the core mechanism for fast candidate retrieval.

Strategy 1: Hybrid Retrieval

The basic idea is to combine two complementary retrieval methods:

  • Keyword search Definition Also called sparse retrieval (BM25): Exact keyword matching — similar to a traditional search engine, it finds documents containing the same words rather than semantically similar ones. Highly reliable for precise terminology and exact numerical values.
  • Dense retrieval: Vector-based semantic search, which handles paraphrased queries and varied phrasing well.

The recommended approach for high-regulation scenarios: use dense retrieval to quickly surface a semantically relevant candidate set, then use BM25 or structured indexes to verify precise terminology and numerical values. Research shows that hybrid retrieval can improve retrieval accuracy and ranking precision by up to 13.1% in domain-specific RAG deployments5.

For use cases involving structured long-form documents like tables and annual reports, building a hierarchical index on top of the chunk layer is also worth considering: store data summaries at each section node, then retrieve by first locating the section before drilling down. This more closely mirrors how a human would actually navigate a document, rather than blindly scanning fragments3.

Strategy 2: Source Traceability

There’s an easy-to-overlook user behavior in enterprise high-regulation systems: users don’t just need answers — they need to know where the answers came from.

In compliance review, regulatory submission evaluation, and similar workflows, the AI-generated summary is only the first step. Users then need to verify the source, confirm the wording is accurate, and decide whether to act on it. If the system can’t trace each output to a specific document, section, and page number, users have to do that manually. That’s not just an efficiency issue — in strictly regulated environments, it’s a compliance risk.

Source traceability isn’t a nice-to-have. In high-regulation industries, it’s a baseline requirement for L1 systems:

  • Every generated conclusion must be tied to a source document, section, and page number (and where possible, paragraph or table ID).
  • When a conclusion draws on multiple sources, the system should flag any contradictions between them.
  • When no supporting source can be found, the system should refuse to generate rather than interpolate.

This is what transforms an L1 system from “a model that gives you answers” into a verifiable information gateway — and that repositioning is the foundation of building user trust in regulated industries.


L2 Architecture: Engineering Routing Determinism

If L1 is about finding the right information, L2 is about processing the right request down the right path.

The central challenge in L2 architecture is “determinism” — but that word needs precise engineering definitions, or it becomes misleading.

Three Dimensions of Determinism

  • Reproducibility: The same input, queried at any time, consistently resolves to the same document or data source. This depends on index stability and version management — not on the LLM itself.
  • Routing consistency: The same user intent always travels the same logical path. This depends on classifier stability and explicitly defined routing rules.
  • Output structuring: Even at temperature Definition Temperature controls the randomness of LLM outputs. At 0, outputs are theoretically most deterministic. At 1, they're more creative but less stable. Even at temperature=0, different model versions can produce different phrasing. =0, outputs can drift across model versions. Real determinism comes from constraining outputs to a predefined structure — JSON schemas, enumerated options, format templates — with a downstream rule-validation layer. In practical terms: restrict what the AI is allowed to output (e.g., only “yes / no / uncertain”), then validate that output with hard rules downstream. Don’t rely on the model to produce perfectly consistent free-form text.

The Two-Layer Architecture of L2 Routing

L2 routing is a two-layer pipeline, not two parallel strategies.

Layer 1: Intent Classifier

Maps user input to predefined intent categories. The design principle here: use a lightweight, high-determinism classification model — don’t hand intent recognition back to the same generative LLM.

Typical classification granularity:

  • Open-ended analysis requests (e.g., “analyze competitor activity”) → route to L1 retrieval flow
  • Structured data queries (e.g., “export this quarter’s compliance report”) → call the corresponding API directly
  • Out-of-scope requests (e.g., queries about a competitor’s non-public internal data) → intercept and prompt user to clarify
  • Compliance-sensitive requests (e.g., questions containing regulatory restricted terminology) → force into human review path

Layer 2: Routing Strategy Based on Classification

Routing execution only begins after intent classification. The core principle here: use hardcoded rules for critical business paths — don’t let the LLM decide.

For instance, if a user queries the appropriate-use guidelines for a specific drug, L2 routing should direct them to the human-reviewed SOP document library — not to a general retrieval flow. The authority requirements for that type of information leave no room for imprecision.

          
graph TD
  Input([User Input]) --> Classifier[Intent Classifier]
  Classifier -- "Intent score / category" --> Router[Routing Rule Layer]
  
  Router -- "A" --> L1[Invoke L1 retrieval flow]
  Router -- "B" --> API[Call API directly]
  Router -- "C" --> Audit[Human review queue]
  Router -- "Reject" --> Reject[Return guidance response, ask user to clarify]

        
Process Flow
Swipe to explore

Knowing When Not to Answer: The Most Underrated System Capability

At the L2 level, the most important control logic isn’t “how to answer” — it’s “when not to answer.”

Researchers call this capability “abstention.” A systematic review published in MIT Press defines it as the behavior of an LLM refusing to provide an answer, with the noted potential to reduce hallucinations and improve safety6. The research further shows that in high-risk domains — healthcare, law, compliance — the damage from an overconfident but wrong answer far exceeds the cost of clearly saying “I don’t know.”

Practically, an L2 system’s refusal capability can be implemented across three layers:

Layer 1: Rule-based Blocking Hardcode the clearly known boundary conditions. Example: if a query involves a competitor’s non-public financial data, or a time period outside the database coverage, intercept immediately and return a standardized guidance message (“This system covers data from 2020–2024. Your query falls outside that range — please confirm the time period.”).

This is the most reliable and lowest-cost refusal mechanism, suited for boundary conditions that can be exhaustively enumerated.

Layer 2: Retrieval Confidence Threshold Every RAG retrieval produces a similarity score. When the top result still falls below a set threshold, it means the system couldn’t find a reliable supporting source — at which point it should refuse to generate, rather than force an answer out of low-quality matches.

retrieval_results = retrieve(query)
if retrieval_results[0].score < CONFIDENCE_THRESHOLD:
    return "No directly relevant materials were found in the knowledge base for this query. "
           "Please review the question or contact your data administrator."
else:
    return generate(query, retrieval_results)

Layer 3: Out-of-Scope Classifier For boundary requests that can’t be enumerated by rules but can be identified semantically, train a dedicated classifier to determine “is this question within system scope?” Research indicates that when diverse out-of-scope query samples are available, a fine-tuned classifier performs best; when sample diversity is limited, rule-based blocking still yields solid results; combining both usually covers all scenarios7.


UI/UX: Constraining Input to Reduce Uncertainty at the Source

Determinism at the architecture layer only works with the right frontend design to support it. For non-technical business users, good UI design is fundamentally about reducing input ambiguity — rather than having the system guess at user intent, have the interface guide users into expressing their intent clearly.

Three design patterns that work well in practice:

Pattern 1: Intent-Aware Form-Based Conversation

Each form field maps directly to a parameter the backend needs. When a user submits the form, intent classification and parameter collection are already complete — the system goes straight to retrieval or API calls without any AI inference in the middle.

Trigger condition: The intent classifier identifies a highly structured intent type (e.g., “competitive analysis,” “compliance query”).

Interaction: The system proactively surfaces a parameter collection UI, where users select a drug name, time range, and analysis dimension from dropdowns — rather than writing a free-form natural language prompt.

Backend L2 integration: Each form field maps to a parameter slot in the routing rules. Submitting the form means intent classification and parameter extraction are already done; the backend goes directly to L1 retrieval or API invocation without an LLM inference step.

Best suited for: Tasks with highly predictable structure, business users without technical backgrounds, and situations where the system needs to operate within a specific parameter space.

Pattern 2: Domain-Term Faceted Search Guidance

Trigger condition: The system detects possible term categories in real time as the user starts typing.

Interaction: A term suggestion list appears next to the input field, letting users select standardized terminology as they type — preventing non-standard variant spellings from ending up in the final query.

Backend L2 integration: Standardized input significantly reduces classifier ambiguity, while also improving vector retrieval precision at the L1 stage (standardized terms have more stable vector representations than free-form phrasing).

Best suited for: Terminology-dense domains (drug names, regulatory clause IDs, ICD-10 diagnosis codes) where users have varying familiarity with standard terminology.

Pattern 3: Refusal UX Design

How a system handles a refusal directly shapes user trust.

What not to do: Display “Unable to answer” and provide no path forward.

What to do instead: A refusal response should contain three things: ① explain the specific reason (“This query falls outside the current data coverage range,” not a vague “I can’t answer that”); ② guide the user toward a correction (“Try narrowing the time range to 2020–2024, or select from the following available data types”); ③ provide an escalation path (“If you need further assistance, contact [data support team]”).

The core logic here: refusal isn’t failure. A system that accurately knows its own limits is itself a demonstration of trustworthiness.


The Structural Ceiling of L1/L2: Three Problems They Can’t Solve

Every optimization above operates within the L1/L2 architectural framework. But there are three categories of business scenarios where L1/L2 will fail structurally — regardless of how well-designed the implementation is. This isn’t a quality-of-implementation problem; it’s a level-definition boundary.

Understanding these three failure patterns is the core basis for deciding whether to introduce L3.

Ceiling 1: Multi-hop Reasoning

Scenario: “Which compliance clauses underwent substantive changes between 2023 and 2024, and which of those directly affect our product line?”

Why L1 fails: This question requires comparative reasoning across multiple documents and time dimensions. The system needs to: identify relevant clauses from both years, compare them, assess whether the differences are “substantive,” and map those to specific products. That’s a reasoning chain of at least four steps.

The standard RAG “one retrieval, one generation” flow structurally supports exactly one retrieval and one generation pass. It can’t re-query based on intermediate outputs, and it can’t decompose a question into sub-problems to solve incrementally. RAGFlow’s engineering report makes this explicit: “multi-hop questions” are one of the scenarios where the semantic gap in RAG systems is most pronounced, and traditional retrieval methods show clear limitations here8.

What’s needed: The ability to decompose a problem into sub-tasks and execute them iteratively — which is the core capability of L3 (Tool Executor).

Ceiling 2: Dynamic Planning

Scenario: A user submits a market access consultation for a novel drug. Three different national regulatory frameworks are involved, and the requirements across those jurisdictions are mutually constraining.

Why L2 fails: There’s no way to write the processing path for this in advance. The system needs to first understand each country’s regulatory framework, identify the conflicts, and then decide what to query next — based on what it found. In other words, “what to do in step N” depends on “what step N-1 returned” — and that’s precisely what L2’s hardcoded routing can’t handle.

L2 can handle “if it’s a compliance query, take Path A.” It cannot handle “based on what the retrieval returned, decide in real time whether to take Path A or Path B.”

What’s needed: The ability to re-plan subsequent steps during execution, based on intermediate results — again, the core capability of L3.

Ceiling 3: Cross-system State Management

Scenario: A user wants to sync the key findings from an internal research report to three platforms simultaneously: CRM, the internal knowledge base, and the compliance review system.

Why L1/L2 fails: As established, L1 by definition doesn’t modify external system state, and L2’s execution paths are predefined. Neither can coordinate multiple external systems, maintain cross-system consistency, and dynamically determine next actions based on each system’s response.

This scenario doesn’t just require calling external tools (an L3 capability) — it also requires maintaining shared context state across multiple tools and handling partial failures (e.g., the CRM update succeeded but the compliance system timed out). That starts pushing into L4 (Multi-Agent Orchestrator) territory.


These three failure patterns draw the structural boundary for L1/L2: when a task requires dynamic reasoning across steps, cross-system state coordination, or path replanning based on intermediate results, L1/L2 architectures hit their limit. That’s the fundamental reason L3 exists — and we’ll go deep on it in the next post.


The Business Decision: Stay at L1/L2 or Upgrade?

In practice, the following decision framework helps evaluate whether a use case actually requires moving beyond L1/L2:

If all of the following are true, L1/L2 is the right call:

  • The task has clear correctness criteria (not dependent on subjective judgment)
  • The system only needs to output information — it doesn’t need to change external system state
  • Each query is independent — no need to accumulate intermediate results across steps
  • Latency requirements are under 3 seconds

If any of the following signals appear, seriously evaluate L3:

  • User questions require reasoning across multiple documents in combination
  • The system needs to decide what to retrieve in step 2 based on what step 1 returned
  • Valid responses require calling an external API and making judgments based on its return value
  • The scenario involves “think, then think again” — such as self-verification of answers

Accuracy problems should be addressed through the retrieval path first, not by scaling the model. If an L1 system’s answers aren’t accurate enough, the first thing to audit is chunking strategy, embedding quality, and retrieval logic — not swapping to a larger model or jumping to L3.


Closing: Determinism Is the Foundation of Trust — and Also Its Boundary

Let me return to a core argument from the introductory post: in high-regulation industries, an AI product’s success isn’t determined by how much it can do — it’s determined by how predictable it is.

Think of each traceable citation in an L1/L2 system as a “trust deposit”: the user sees the source, verifies the accuracy, and banks one unit of trust. By contrast, every time the system generates confidently without reliable supporting evidence, that’s a withdrawal. Enough withdrawals, and users start approaching every output with skepticism — at which point the business value of the system erodes from the ground up.

The design goal of L1/L2 is not to make the system look smarter. It’s to make system behavior stable, boundaries legible, and outputs trustworthy enough that users don’t have to constantly verify them. In regulated industries, that predictability is the entry ticket. In other contexts, it’s the foundation of user trust. The degree required varies — but the design direction is the same.

That said, predictability implies limits. When the complexity of the business requirement exceeds what static routing and single-pass retrieval can handle, forcing L1/L2 to stretch produces a system that looks stable but is actually brittle.

Acknowledging limits is a sign of architectural maturity. Recognizing where those limits lie is the starting point for L3.

Up next: L3 (Tool Executor). When a system starts deciding for itself which tools to invoke and in what order, the LLM shifts from “passive generator” to “active planner” — and that shift brings both genuine capability gains and engineering complexity you have to take seriously.

Footnotes

  1. Anthropic (2024). “Building Effective Agents.” https://www.anthropic.com/research/building-effective-agents

  2. Spasojevic et al. (2024). “Comparative Evaluation of Advanced Chunking for Retrieval-Augmented Generation in Large Language Models for Clinical Decision Support.” PMC. Compares fixed-token chunking vs. Adaptive Chunking accuracy in clinical decision support scenarios. https://pmc.ncbi.nlm.nih.gov/articles/PMC12649634/

  3. Gao et al. (2023). “Retrieval-Augmented Generation for Large Language Models: A Survey.” arXiv:2312.10997. Appendix B contains a detailed description of hierarchical index structures. https://arxiv.org/pdf/2312.10997 2

  4. Prompt Engineering Guide: “RAG for LLMs.” Engineering recommendations on embedding model fine-tuning. https://www.promptingguide.ai/research/rag

  5. Hybrid retrieval experiment data cited from Moreira et al. (2024), referenced in arXiv:2506.00054 (RAG survey). Original source: Doan et al. (2024), “A lightweight hybrid retrieval strategy combining unstructured text embeddings with structured knowledge graph embeddings.” https://arxiv.org/html/2506.00054v1

  6. Wen et al. (2025). “Know Your Limits: A Survey of Abstention in Large Language Models.” Transactions of the Association for Computational Linguistics, vol. 13, pp. 529–556. https://direct.mit.edu/tacl/article/doi/10.1162/tacl_a_00754/131566/

  7. Ibid., Wen et al. (2025), “Scoping” section: “When diverse out-of-scope query samples are available, supervised fine-tuning (SFT) performs best; when sample diversity is limited, Circuit Breakers yield strong results; combining both typically captures the advantages of each.”

  8. RAGFlow Engineering Team (2024). “The Rise and Evolution of RAG in 2024 — A Year in Review.” https://ragflow.io/blog/the-rise-and-evolution-of-rag-in-2024-a-year-in-review

Written by
Lusan

Thinking and creating at the intersection of data, decision-making, and design.