Overview

  • Agentic design patterns are reusable ways to structure systems in which a language model does more than generate text. The model becomes part of a larger loop that observes context, chooses actions, uses tools, manages state, and keeps moving toward a goal. That shift, from text generation to goal-directed orchestration, is the central idea behind modern agent engineering. In this view, the model is not the whole product. It is the reasoning core inside a system that must also handle memory, tool access, control flow, communication, and failure recovery.

  • A useful mental model is to treat an agent system as an operating canvas. The canvas is the runtime environment that holds prompts, state, tools, external APIs, memory stores, and the logic that routes information from one step to the next. The important design question is therefore not only “which model should I call?” but also “how should the system be structured so the model can act reliably under uncertainty?” That is exactly where design patterns matter.

  • The reason patterns are so important is that single-shot prompting breaks down quickly as tasks become multi-step, tool-dependent, or long-running. Once a system must decompose work, retrieve facts, call APIs, maintain conversational state, coordinate specialists, or recover from partial failure, the architecture matters at least as much as the prompt. This is the same lesson the broader agent literature has converged on: performance improves when reasoning is interleaved with action, when external tools can be invoked, and when retrieved evidence augments model-only memory. ReAct by Yao et al. (2022) showed that alternating reasoning and acting improves multi-step task solving by letting the model update plans from observations. Toolformer by Schick et al. (2023) showed that models can learn when and how to call tools, which is foundational for practical agents. Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks by Lewis et al. (2020) established the now-standard idea that external retrieval can make generation more factual and updatable.

  • Each pattern in this primer is paired with hands-on implementations using LangChain, demonstrating how these concepts translate into real, executable systems. This bridges the gap between theory and practice by showing how agentic behaviors such as planning, tool use, memory, and coordination can be concretely realized in production-oriented frameworks.

  • At a high level, an agentic system can be described as a policy over actions conditioned on context. If we write the agent’s state at time \(t\) as \(s_t\), its chosen action as \(a_t\), and its objective as maximizing expected cumulative utility, then the design problem is often framed as:

\[a_t \sim \pi_\theta(a \mid s_t), \qquad \max_{\pi_\theta} \mathbb{E}\left[\sum_{t=0}^{T} \gamma^t r_t\right]\]
  • This is not saying every agent in practice is trained end-to-end with reinforcement learning. Most production agents are not. Rather, it gives a clean way to think about what the system is doing: at each step it selects the next best action given the current state, available tools, and long-term objective. For the introductory patterns in this primer, no special loss function is central yet, because the focus is system structure rather than model training.

Why are agentic systems needed?

  • Modern AI systems reached a point where generating high-quality text is no longer the bottleneck. The real limitation lies in reliably solving complex, multi-step, real-world problems. A standalone large language model can produce fluent answers, but it struggles when tasks require persistence, external interaction, or adaptive decision-making. This gap is precisely why agentic systems are needed.

  • At their core, real-world problems are not single-shot queries. They are processes. They involve gathering information, making intermediate decisions, interacting with external systems, and iteratively refining outcomes. A static prompt-response model cannot sustain this kind of workflow because it lacks continuity, structured control, and the ability to act.

  • Agentic systems address this by transforming the model into part of a loop rather than a terminal endpoint. Instead of producing a single output, the system continuously updates its understanding and actions:

\[\text{goal} \rightarrow \text{perception} \rightarrow \text{reasoning} \rightarrow \text{action} \rightarrow \text{feedback} \rightarrow \text{updated state}\]
  • This loop directly mirrors the five-step operational cycle described in the source material, where an agent gets a mission, gathers context, plans, acts, and improves over time.

  • The following figure illustrates that agentic AI functions as an intelligent assistant, continuously learning through experience. It operates via a straightforward five-step loop to accomplish tasks.

The limitations of non-agentic systems

  • Traditional LLM-based applications fail in predictable ways when pushed beyond simple tasks:

    • They cannot maintain state across multiple steps without manual orchestration
    • They lack access to real-time or external information unless explicitly integrated
    • They do not inherently plan or decompose problems
    • They cannot act in the environment (e.g., call APIs, update systems)
    • They cannot improve through feedback within a task
  • This leads to brittle systems that perform well in demos but degrade quickly in production scenarios.

  • Research has consistently highlighted these gaps. For example, ReAct by Yao et al. (2022) demonstrated that combining reasoning with actions significantly improves performance on multi-step tasks by allowing models to update their strategy based on observations. Similarly, Toolformer by Schick et al. (2023) showed that models become far more capable when they can decide when to use external tools. These works reinforce a key idea: intelligence in practical systems emerges not just from reasoning, but from structured interaction with the environment.

The need for goal-directed behavior

  • Agentic systems are needed because real applications are goal-driven rather than query-driven. Instead of answering “What is X?”, systems must achieve objectives like:

    • Resolve a customer issue end-to-end
    • Plan and execute a workflow
    • Monitor and react to changing conditions
    • Coordinate multiple steps across systems
  • This shift requires systems that can operate autonomously toward a goal, rather than simply responding to inputs.

  • Formally, this aligns with decision-making under uncertainty, where the system must choose actions that maximize long-term success:

\[a_t \sim \pi(a \mid s_t), \quad \max \mathbb{E}\left[\sum_{t=0}^{T} \gamma^t r_t\right]\]
  • Even when not explicitly trained with reinforcement learning, agentic systems implicitly approximate this process by iteratively selecting actions that move closer to a goal.

The need for interaction with the external world

  • Another critical limitation of standalone models is that they are closed systems. They rely entirely on pretraining data and cannot:

    • Access up-to-date information
    • Perform real operations (e.g., database queries, transactions)
    • Verify outputs against external sources
  • Agentic systems solve this by incorporating tool use and retrieval. This is why approaches like Retrieval-Augmented Generation by Lewis et al. (2020) are foundational. They allow systems to ground their outputs in real data, reducing hallucinations and enabling dynamic knowledge access.

  • In practice, this turns the model into a coordinator rather than a knowledge container.

The need for adaptability and feedback

  • Real environments are dynamic. Requirements change, inputs are noisy, and intermediate steps often fail. Non-agentic systems lack mechanisms to adapt mid-execution.

  • Agentic systems introduce:

    • Feedback loops that allow correction
    • Reflection mechanisms that improve outputs
    • Memory that accumulates knowledge across steps
  • This is essential for robustness. Without these capabilities, systems cannot recover from errors or improve performance within a task.

The need for scalable complexity

  • As tasks grow in complexity, a single monolithic reasoning step becomes inefficient and unreliable. Breaking problems into smaller steps, coordinating multiple components, and distributing responsibilities becomes necessary.

  • Agentic systems enable this by:

    • Decomposing tasks into manageable units
    • Coordinating multiple specialized components
    • Supporting parallel and sequential execution
  • This naturally leads to more advanced architectures such as multi-agent systems, where different agents handle distinct roles and collaborate toward a shared goal.

Why patterns matter

  • Patterns matter because agent systems fail in recurring ways. They lose context, over-call tools, forget intermediate results, mis-handle branching logic, or produce brittle behavior when the environment changes. Reusable patterns help by decomposing these recurring problems into standard solutions: prompt chaining for staged reasoning, routing for specialization, parallelization for throughput, reflection for self-critique, tool use for external action, planning for long-horizon tasks, memory for continuity, guardrails for safety, and evaluation for observability.

  • This is also why frameworks matter. Frameworks are not the intelligence. They are the scaffolding that makes intelligence operational. LangChain overview - Docs by LangChain positions LangChain as an integration and agent framework, while LangGraph overview - Docs by LangChain emphasizes stateful, long-running workflows. That division is important: LangChain is convenient for composition, while LangGraph becomes especially useful once your agent needs explicit state transitions, branching, retries, or human checkpoints.

The architectural shift

  • The most important conceptual shift is that the model is no longer the application boundary. In earlier LLM applications, the prompt itself effectively defined the system. In agentic systems, the prompt becomes just one component within a broader orchestration layer that manages state, tools, and control flow.

  • Rather than relying on a single forward pass of reasoning, agentic systems operate as structured, iterative processes. The system continuously evaluates its current context, selects an action, executes it, and updates its internal state before proceeding. This introduces continuity and adaptability that static prompt-based systems fundamentally lack.

  • This shift enables several critical capabilities:

    • Stateful execution: Intermediate outputs, decisions, and context are preserved across steps instead of being recomputed from scratch
    • Adaptive decision-making: The system can revise its approach dynamically based on new observations or tool outputs
    • Composability: Complex tasks can be decomposed into smaller, modular units that can be independently improved and reused
    • Resilience: Failures are no longer terminal; the system can retry, branch, or escalate when needed
  • These capabilities align closely with how agentic systems are described in the source material, where an agent progresses through cycles of understanding, planning, acting, and refining its behavior over time.

  • From a systems perspective, this means that intelligence is no longer a single computation but an emergent property of coordinated interactions between components. The language model provides reasoning, but the surrounding system provides structure, memory, and execution.

  • This architectural framing also explains why many agentic patterns exist. Each pattern addresses a specific challenge introduced by this shift. For example:

    • Prompt chaining structures multi-step reasoning
    • Routing enables specialization across tasks
    • Tool use connects reasoning to real-world actions
    • Reflection introduces self-correction
    • Planning supports long-horizon objectives
  • Instead of embedding all logic inside a single prompt, these patterns distribute responsibility across a controlled workflow. The result is a system that is easier to debug, extend, and scale.

  • The key takeaway is that once you move from single-step generation to iterative, goal-driven execution, architecture becomes the dominant factor in system performance. The model is still essential, but it is no longer sufficient on its own.

Practical implications for builders

  • For a practitioner, the immediate implication is that reliability comes more from architecture than from prompt cleverness alone. A strong system usually does four things well:

    • It controls context. The model should only see the information needed for the current decision. Too little context causes blind reasoning, while too much causes distraction and degraded instruction following.

    • It makes action explicit. A model should not merely suggest what to do when the system can safely do it through tools.

    • It stores state outside the model. Memory, checkpoints, and interaction history should live in structured state rather than being entrusted entirely to the context window.

    • It treats failures as expected events. Agents need retries, fallbacks, validation, and escalation paths.

  • Those principles are not isolated tricks. They are the connective tissue across the patterns that follow.

A LangChain sketch

  • Even the simplest LangChain example already hints at the architectural idea. A plain chain is not yet a full agent, but it shows how you stop thinking in one giant prompt and begin thinking in composable steps.
from langchain_openai import ChatOpenAI
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.output_parsers import StrOutputParser

llm = ChatOpenAI(model="gpt-4o-mini", temperature=0)

prompt = ChatPromptTemplate.from_messages([
    ("system", "You are an assistant that turns vague goals into crisp task statements."),
    ("human", "Goal: {goal}")
])

goal_to_task = prompt | llm | StrOutputParser()

result = goal_to_task.invoke({
    "goal": "Help me design an AI workflow that can answer support questions reliably."
})

print(result)
  • This is only a starting point, but it captures the seed of the larger idea: the system translates a user goal into a machine-usable intermediate representation, which can later be routed into retrieval, planning, tool use, or evaluation. In other words, even the simplest useful agent begins by making hidden structure explicit.

What makes an AI system an agent?

  • An AI system becomes an agent when it transitions from passive response generation to active, goal-directed behavior. The defining shift is from generating outputs to driving outcomes. This happens when a system is embedded in a loop that enables it to perceive, reason, act, and adapt over time in pursuit of a goal.

  • At its simplest, an agent is a system that maps observations to actions in pursuit of a goal. However, modern agentic systems extend this classical definition by incorporating reasoning, tool use, memory, and iterative feedback loops. The result is a system that does not merely answer questions, but actively works toward outcomes.

The core agent loop

  • A practical way to understand what makes a system agentic is through its operational loop. An agent continuously cycles through a structured process:

    • It receives a goal
    • It gathers relevant context
    • It reasons about possible actions
    • It executes actions
    • It observes outcomes and adapts
  • This can be formalized as a sequential decision process:

    \[s_{t+1} = f(s_t, a_t, o_t), \quad a_t \sim \pi(a \mid s_t)\]
    • where \(s_t\) represents the system state, \(a_t\) the chosen action, and \(o_t\) the observation from the environment.
  • This loop is the minimal structure required for agency. Without it, a system cannot adapt, improve, or operate beyond a single interaction.

From models to agents

  • A large language model on its own does not qualify as an agent. It functions as a reasoning engine, capable of transforming input text into output text based on learned patterns. However, it lacks:

    • Persistent state across interactions
    • Direct access to external systems
    • The ability to take real actions
    • Feedback-driven adaptation within a task
  • This corresponds to what can be considered a baseline configuration, where intelligence is present but not operationalized.

  • An agent emerges when this reasoning capability is embedded within a system that provides:

    • State management, allowing continuity across steps
    • Tool interfaces, enabling interaction with external systems
    • Control flow, determining how decisions unfold over time
    • Feedback integration, enabling adaptation based on outcomes
  • This transformation aligns with the progression described in the source material, where systems evolve from isolated reasoning engines into connected, action-capable entities.

Levels of agent capability

  • Agentic systems can be understood along a spectrum of increasing capability and autonomy.

    • Level 0: The reasoning core:

      • At this level, the system consists solely of a language model. It can reason about problems but cannot interact with the environment or access external information beyond its training data.
    • Level 1: The connected problem-solver:

      • Here, the system gains access to tools and external data sources. It can retrieve information, call APIs, and execute multi-step actions, enabling it to solve real-world problems that require up-to-date or external knowledge.

      • This is closely related to the paradigm introduced in Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks by Lewis et al. (2020), where external retrieval enhances model capabilities by grounding outputs in factual data.

    • Level 2: The strategic problem-solver:

      • At this level, the agent can plan, manage context strategically, and handle complex, multi-step workflows. A key capability here is context engineering, which involves selecting and structuring the most relevant information for each step to maximize performance.

      • This is conceptually aligned with structured reasoning approaches such as Chain-of-Thought Prompting by Wei et al. (2022), where intermediate reasoning steps improve task performance by decomposing problems.

    • Level 3: Collaborative multi-agent systems:

      • The most advanced level involves multiple agents working together, each specializing in different roles. Instead of a single monolithic system, intelligence emerges from coordination among agents.

      • The following figure shows various instances demonstrating the spectrum of agent complexity.

    • This mirrors organizational structures in human systems, where specialized roles collaborate to achieve complex objectives. It also aligns with emerging research in distributed AI systems, where coordination and communication become central challenges.

Key properties of agentic systems

  • Several properties distinguish agents from traditional systems:

    • Autonomy: The ability to operate without constant human intervention

    • Proactiveness: The ability to initiate actions toward goals rather than waiting for instructions

    • Reactivity: The ability to respond dynamically to changes in the environment

    • Tool use: The ability to extend capabilities through interaction with external systems

    • Memory: The ability to retain and utilize information across time

    • Communication: The ability to interact with users or other agents

    • Prioritization: The ability to evaluate and rank tasks or actions based on criteria such as urgency, importance, dependencies, and resource constraints

    • Pattern selection and composition: The ability to combine multiple design patterns into a coherent system that aligns with task requirements and operational constraints

  • These properties are not independent. They reinforce each other to create systems that can operate effectively in complex, dynamic environments.

The role of reasoning and action

  • A defining feature of agentic systems is the tight coupling between reasoning and action. Instead of generating a complete solution upfront, the system iteratively refines its approach based on feedback.

  • This paradigm is exemplified by ReAct by Yao et al. (2022), which interleaves reasoning steps with actions, allowing the system to update its understanding as new information becomes available.

  • The key insight is that reasoning alone is insufficient. Effective problem-solving requires interaction with the environment, and that interaction must inform subsequent reasoning.

A minimal LangChain agent example

  • The transition from a simple chain to an agent becomes clear when tools and decision-making are introduced.
from langchain.agents import initialize_agent, Tool
from langchain_openai import ChatOpenAI

# Define a simple tool
def search_tool(query: str) -> str:
    return f"Search results for: {query}"

tools = [
    Tool(
        name="Search",
        func=search_tool,
        description="Useful for answering questions about current events"
    )
]

llm = ChatOpenAI(model="gpt-4o-mini", temperature=0)

agent = initialize_agent(
    tools=tools,
    llm=llm,
    agent="zero-shot-react-description",
    verbose=True
)

result = agent.run("What are recent developments in AI agents?")
print(result)
  • This example illustrates the essential ingredients of an agent:

    • A reasoning model
    • A set of tools
    • A decision policy that determines when to use them
  • Even in this minimal form, the system is no longer just generating text. It is selecting actions based on context, which is the defining step toward agency.

The emerging paradigm

  • The progression from LLM workflows to fully agentic systems represents a broader shift in AI:

    • From static pipelines to dynamic systems
    • From isolated models to integrated environments
    • From answering questions to achieving goals
  • The following figure shows transitioning from LLMs to RAG, then to Agentic RAG, and finally to Agentic AI.

  • This evolution reflects a growing recognition that intelligence is not just about knowledge or reasoning in isolation. It is about the ability to operate effectively in a world of uncertainty, constraints, and changing information.

Agentic Design Patterns

Core Idea

  • The agentic design patterns covered in this section form the operational backbone of agentic systems. Together, they define how an agent reasons, decides, acts, and improves while interacting with its environment. Rather than functioning as isolated techniques, these patterns compose into execution graphs that transform static model calls into dynamic, goal-directed systems.

  • At a high level, these patterns collectively implement a structured decision process:

\[\text{input} \rightarrow \text{decomposition} \rightarrow \text{selection} \rightarrow \text{execution} \rightarrow \text{evaluation} \rightarrow \text{iteration}\]
  • Each pattern contributes a specific capability within this flow, enabling agents to move from simple response generation to complex, adaptive behavior. Importantly, prioritization and pattern selection act as meta-level controls over this process, determining not only what actions are taken, but which patterns are invoked and in what order.

From linear prompts to execution graphs

  • Traditional LLM systems operate as linear pipelines: a prompt is constructed, a response is generated, and the process ends. In contrast, agentic systems organize computation as directed graphs of operations, where intermediate outputs are routed, transformed, validated, and reused.

  • The patterns in this section collectively enable this shift:

    • Prompt chaining introduces structured decomposition
    • Routing introduces conditional branching
    • Parallelization introduces concurrent execution
    • Reflection introduces iterative refinement
    • Tool use introduces external interaction
    • Planning introduces long-horizon structure
    • Multi-agent systems introduce distributed specialization
    • Prioritization introduces decision ordering under constraints
    • Pattern selection and composition introduces system-level orchestration
  • Together, these transform a single inference into a coordinated, adaptive process.

Functional roles

  • Each pattern plays a distinct role in the execution lifecycle of an agent, as follows:

    • Prompt chaining as decomposition: Prompt chaining breaks complex tasks into smaller, sequential steps. It reduces cognitive load on the model and enables intermediate validation. This is the foundation upon which most other patterns build.

    • Routing as decision-making: Routing determines which path the system should take. It selects tools, models, or workflows based on input characteristics, enabling specialization and efficiency.

    • Parallelization as scaling mechanism: Parallelization allows independent tasks to be executed simultaneously. It improves latency and enables exploration of multiple reasoning paths or data sources.

    • Reflection as quality control: Reflection introduces feedback loops that allow the system to critique and refine its outputs. It improves reliability and correctness through iterative improvement.

    • Tool use as action interface: Tool use connects the agent to the external world. It enables retrieval, computation, and real-world actions, extending the system beyond its internal knowledge.

    • Planning as strategic coordination: Planning organizes actions over multiple steps. It enables the system to reason about dependencies, sequence tasks, and pursue long-term goals.

    • Multi-agent systems as distributed intelligence: Multi-agent systems distribute responsibilities across specialized agents. They enable modularity, scalability, and collaboration in complex workflows.

    • Prioritization as resource-aware decision control: Prioritization determines which tasks, goals, or actions should be executed first when multiple options compete. It incorporates criteria such as urgency, importance, dependencies, and resource constraints, ensuring that the agent focuses on high-impact actions under limited time or compute.

    • Pattern selection and composition as system orchestration: Pattern selection determines which combination of patterns should be applied for a given task, while composition defines how they are connected. This operates at a meta-level, shaping the overall execution graph rather than individual steps.

Compositional structure

  • These patterns are rarely used in isolation. A typical execution flow may look like:
\[\text{input} \rightarrow \text{routing} \rightarrow \text{planning} \rightarrow \text{prioritization} \rightarrow \left[ \text{parallel tool calls} \right] \rightarrow \text{aggregation} \rightarrow \text{reflection} \rightarrow \text{output}\]
  • This structure highlights how patterns compose:

    • Routing selects the workflow
    • Planning defines the structure
    • Prioritization orders tasks and allocates resources
    • Parallelization executes independent steps
    • Tool use provides capabilities
    • Reflection ensures quality
  • At a higher level, pattern selection and composition determines whether this entire pipeline is even the right structure, or whether an alternative configuration (e.g., multi-agent orchestration or iterative loops) should be used instead.

  • In more advanced systems, multi-agent coordination may wrap around this entire process, with different agents handling planning, execution, validation, and prioritization.

When to use these patterns

  • These patterns become necessary as task complexity increases:

    • Use prompt chaining when tasks require multiple reasoning steps
    • Use routing when inputs vary significantly in type or complexity
    • Use parallelization when tasks are independent and latency matters
    • Use reflection when correctness and quality are critical
    • Use tool use when external data or actions are required
    • Use planning when tasks span multiple dependent steps
    • Use multi-agent systems when specialization improves outcomes
    • Use prioritization when multiple tasks compete under constraints (time, compute, dependencies)
    • Use pattern selection and composition when designing full systems, especially when multiple patterns must be combined or adapted dynamically
  • The choice is not binary. Most real systems use a combination of these patterns, selected and orchestrated based on task requirements and constraints.

The unifying principle

  • The unifying idea across all these patterns is control. They introduce structure into how models are used, transforming them from passive generators into components of a controlled execution system.

  • Instead of asking “what should the model output?”, agentic systems ask “what should the system do next?”

  • Prioritization refines this further into “what should the system do next given constraints?”, while pattern selection elevates it to “what system should be constructed to solve this class of problems?”

  • This shift, from output generation to action selection and system design, is what enables the patterns in this primer to work together as a cohesive whole.

Prompt Chaining

  • Prompt chaining is a foundational agentic design pattern that transforms how complex problems are solved with language models. Rather than relying on a single, monolithic prompt, it decomposes a task into a sequence of smaller, structured steps, where each step feeds into the next. This approach shifts systems away from fragile one-shot reasoning toward controlled, multi-stage execution that is more reliable, interpretable, and scalable.

  • At its core, prompt chaining operationalizes the idea that complex reasoning is best handled incrementally. Each step focuses on a specific sub-problem, reducing the cognitive load on the model and improving overall performance. This principle is supported by findings from Chain-of-Thought Prompting (Wei et al., 2022), which demonstrate that breaking reasoning into intermediate steps significantly enhances accuracy on complex tasks.

  • More broadly, prompt chaining reflects a shift in how language models are conceptualized: not as monolithic problem solvers, but as components within a structured computation graph. In this paradigm, reasoning is distributed across multiple steps and can be integrated with external tools and persistent state, aligning with the evolution toward agentic systems.

  • Because it introduces structure and control while remaining relatively simple to implement, prompt chaining often serves as the entry point into agentic design.

Why prompt chaining is needed

  • Single-prompt approaches often fail when tasks become multi-step or require structured reasoning. These failures arise from several well-known limitations:

    • Instruction overload: Large prompts with multiple constraints cause the model to ignore or misinterpret parts of the task
    • Context dilution: Important details get lost as prompt length increases
    • Error amplification: Mistakes in early reasoning cannot be corrected mid-process
    • Lack of control: There is no way to inspect or guide intermediate steps
  • Prompt chaining addresses these issues by explicitly structuring the reasoning process into discrete stages. Each stage has a well-defined input and output, allowing the system to validate, transform, or enrich information before passing it forward.

The structure of a prompt chain

  • A prompt chain can be viewed as a directed sequence of transformations:

    \[x_0 \rightarrow f_1(x_0) = x_1 \rightarrow f_2(x_1) = x_2 \rightarrow \cdots \rightarrow f_n(x_{n-1}) = x_n\]
    • where each \(f_i\) represents a prompt-driven transformation applied by the model.
  • This structure introduces modularity into the system:

    • Each step can be independently designed and optimized
    • Intermediate outputs can be inspected and debugged
    • External tools can be inserted between steps
    • Different models can be used for different stages
  • The result is a pipeline that behaves more like a program than a single inference call. The following figure illustrates the prompt chaining pattern, where agents receive a series of prompts from the user, with the output of each agent serving as the input for the next in the chain.

Example

  • Consider a task such as generating a research summary from raw documents. A single prompt might attempt to:

    • Extract key points
    • Organize them
    • Generate a coherent summary
  • In a chained approach, this becomes:

    1. Extract key facts from the document
    2. Cluster facts into themes
    3. Generate a structured outline
    4. Produce the final summary
  • Each step reduces ambiguity and improves control over the output.

Implementation

  • LangChain provides a natural abstraction for prompt chaining through composable chains. Each component in the chain transforms input into output, allowing pipelines to be constructed declaratively.
from langchain_openai import ChatOpenAI
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.output_parsers import StrOutputParser

llm = ChatOpenAI(model="gpt-4o-mini", temperature=0)

# Step 1: Extract key points
extract_prompt = ChatPromptTemplate.from_messages([
    ("system", "Extract key facts from the following text."),
    ("human", "{input_text}")
])

# Step 2: Organize into themes
organize_prompt = ChatPromptTemplate.from_messages([
    ("system", "Group the following facts into themes."),
    ("human", "{facts}")
])

# Step 3: Generate summary
summary_prompt = ChatPromptTemplate.from_messages([
    ("system", "Write a concise summary from these themes."),
    ("human", "{themes}")
])

extract_chain = extract_prompt | llm | StrOutputParser()
organize_chain = organize_prompt | llm | StrOutputParser()
summary_chain = summary_prompt | llm | StrOutputParser()

# Execute chain
text = "AI agents are systems that can reason, act, and adapt..."
facts = extract_chain.invoke({"input_text": text})
themes = organize_chain.invoke({"facts": facts})
summary = summary_chain.invoke({"themes": themes})

print(summary)
  • This example demonstrates how each stage isolates a specific responsibility. The system becomes easier to debug and extend, since intermediate outputs can be inspected or modified.

Enhancing chains with tools

  • Prompt chains are not limited to model-only transformations. External tools can be inserted between steps to enrich the workflow.

  • For example:

    • A retrieval step can fetch relevant documents
    • A database query can validate extracted facts
    • An API call can provide real-time data
  • This hybrid approach is closely related to Retrieval-Augmented Generation by Lewis et al. (2020), where retrieval is integrated into the generation pipeline to improve factual accuracy.

  • In practice, this turns a prompt chain into a flexible workflow that combines reasoning with external capabilities.

Prompt chaining as a building block for agents

  • Prompt chaining is more than a technique for structuring prompts. It is a foundational building block for agentic systems.

  • Many higher-level patterns rely on chaining:

    • Planning uses chains to decompose tasks into subgoals
    • Reflection uses chains to critique and refine outputs
    • Routing uses chains to decide which path to take
    • Tool use often involves chaining reasoning with action
  • In this sense, prompt chaining provides the scaffolding for more advanced behaviors. It enables systems to simulate structured thought processes and execute them reliably.

Failure modes

  • While powerful, prompt chaining introduces its own challenges:

    • Latency: Multiple steps increase response time
    • Cost: Each step requires an additional model call
    • Error propagation: Incorrect outputs can cascade through the chain
    • Over-fragmentation: Too many steps can make the system unnecessarily complex
  • These trade-offs must be carefully managed. In practice, effective chains strike a balance between decomposition and efficiency.

  • One common mitigation strategy is to validate intermediate outputs before passing them forward. Another is to selectively merge steps when they are tightly coupled.

Routing

  • Routing is an agentic design pattern that enables a system to dynamically select the most appropriate path, model, tool, or sub-agent based on the characteristics of the input. Instead of applying a single fixed workflow to every request, routing introduces conditional logic that directs tasks to specialized components, improving both performance and efficiency.

  • At a fundamental level, routing transforms an otherwise linear pipeline into a decision-driven system. This aligns with the broader principle that intelligence in complex systems often emerges not from uniform processing, but from specialization and selective execution.

Why routing is needed

  • As systems grow in complexity, a single model or workflow becomes insufficient for handling diverse inputs. Different tasks may require:

    • Different reasoning strategies
    • Different tools or APIs
    • Different levels of computational cost
    • Different domain expertise
  • Without routing, systems either overuse expensive resources or underperform on specialized tasks.

  • Routing addresses this by introducing a decision layer that determines how each input should be handled. This allows systems to:

    • Improve accuracy by delegating to specialized components
    • Reduce cost by using simpler models when appropriate
    • Increase flexibility by supporting multiple workflows
  • This idea is closely related to modular AI systems and mixture-of-experts architectures. For example, Switch Transformers by Fedus et al. (2021) demonstrate how routing inputs to specialized subnetworks improves scalability and efficiency in large models.

The routing decision function

  • At its core, routing can be expressed as a decision function:

    \[r(x) \rightarrow i\]
    • where \(x\) is the input and \(i\) is the selected route or component.
  • This decision can be implemented in several ways:

    • A rule-based classifier
    • A lightweight model
    • A language model itself
    • A hybrid of heuristics and learned signals
  • The output of the routing step determines which downstream process will handle the task.

  • The following figure shows the routing pattern where inputs are directed to different processing paths based on classification using an LLM as a router.

Types of routing

  • Routing can take several forms depending on the system design.

  • Input-based routing:

    • The system analyzes the input and decides which path to take. For example:

      • Questions about math are routed to a symbolic solver
      • Questions about current events are routed to a retrieval pipeline
      • Creative writing tasks are routed to a generative model
  • Tool routing:

    • The system selects which tool or API to use based on the task. This is common in agent systems where multiple tools are available.

    • This behavior is closely related to the mechanisms explored in Toolformer by Schick et al. (2023), where models learn when to invoke external tools.

  • Model routing:

    • Different models are used depending on task complexity:

      • Lightweight models for simple queries
      • Larger models for complex reasoning
    • This enables cost-performance optimization in production systems.

  • Agent routing:

    • Tasks are delegated to different agents, each with a specialized role. This becomes particularly important in multi-agent systems.

Example

  • Consider a system that handles customer support queries. Without routing, all queries are processed the same way. With routing:

    • Billing issues are sent to a financial agent
    • Technical issues are sent to a troubleshooting agent
    • General inquiries are handled by a conversational agent
  • This improves both response quality and system efficiency.

Implementation

  • LangChain supports routing through router chains and conditional logic. A common approach is to use a classification step to determine the route.
from langchain_openai import ChatOpenAI
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.output_parsers import StrOutputParser

llm = ChatOpenAI(model="gpt-4o-mini", temperature=0)

# Router prompt
router_prompt = ChatPromptTemplate.from_messages([
    ("system", "Classify the user query into one of: math, search, or general."),
    ("human", "{query}")
])

router_chain = router_prompt | llm | StrOutputParser()

def route(query):
    route = router_chain.invoke({"query": query}).strip().lower()
    return route

# Define handlers
def math_handler(query):
    return f"Solving math problem: {query}"

def search_handler(query):
    return f"Searching for: {query}"

def general_handler(query):
    return f"General response: {query}"

# Routing logic
def handle_query(query):
    route_type = route(query)
    if "math" in route_type:
        return math_handler(query)
    elif "search" in route_type:
        return search_handler(query)
    else:
        return general_handler(query)

print(handle_query("What is 25 * 17?"))
  • This example demonstrates how a lightweight routing decision can direct queries to different handlers. In more advanced systems, each handler could itself be a complex chain or agent.

Routing with chains and tools

  • Routing becomes more powerful when combined with other patterns:

    • With prompt chaining: Different chains can be selected dynamically
    • With tool use: The system can choose the most appropriate tool
    • With planning: Routing decisions can be made at multiple stages
    • With multi-agent systems: Tasks can be distributed across agents
  • This composability makes routing a central mechanism in agent orchestration.

Failure modes

  • Routing introduces new challenges:

    • Misclassification: Incorrect routing leads to poor results
    • Ambiguity: Some inputs may not clearly map to a single route
    • Overhead: The routing step adds latency and cost
    • Fragmentation: Too many routes can make the system difficult to manage
  • To mitigate these issues:

    • Use confidence thresholds and fallback paths
    • Allow multiple routes for ambiguous inputs
    • Continuously evaluate routing accuracy
    • Keep routing logic interpretable when possible

Parallelization

  • Parallelization is an agentic design pattern that enables systems to execute multiple independent tasks simultaneously rather than sequentially. By distributing work across parallel branches, the system improves latency, throughput, and scalability while maintaining the ability to recombine results into a coherent output.

  • This pattern reflects a broader principle in intelligent systems: when tasks are independent or loosely coupled, executing them concurrently leads to significant efficiency gains. In agentic systems, where workflows often involve multiple sub-tasks such as retrieval, reasoning, validation, or generation, parallelization becomes a natural extension of prompt chaining and routing.

Why parallelization is needed

  • Sequential execution introduces unnecessary delays when tasks do not depend on each other. For example:

    • Retrieving information from multiple sources
    • Generating multiple candidate responses
    • Evaluating outputs using different criteria
    • Processing multiple inputs in batch
  • If these steps are executed one after another, total latency becomes the sum of all execution times. Parallelization reduces this to the maximum execution time among tasks:

    \[T_{\text{parallel}} \approx \max(T_1, T_2, \dots, T_n)\]
    • instead of:
    \[T_{\text{sequential}} = \sum_{i=1}^{n} T_i\]
  • This reduction can be substantial in real-world systems, especially when individual steps involve network calls or model inference.

The following figure shows parallel execution of independent tasks using sub-agents and aggregation of their outputs.

Forms of parallelization

  • Parallelization can be applied in several ways depending on the system design.

  • Task parallelism:

    • Different tasks are executed simultaneously. For example:

      • Running multiple retrieval queries across different databases
      • Generating answers using different prompts
      • Evaluating outputs with multiple scoring functions
    • Each task operates independently and produces its own output.

  • Data parallelism:

    • The same operation is applied to multiple inputs in parallel. For example:

      • Processing multiple documents simultaneously
      • Running the same prompt across different data samples
    • This is useful for scaling workloads across large datasets.

  • Model parallelism:

    • Different models are used simultaneously to process the same input. This can improve robustness by combining diverse perspectives.

    • This idea connects to ensemble methods in machine learning, where combining multiple models often yields better performance. For example, Deep Ensembles by Lakshminarayanan et al. (2017) demonstrate improved predictive uncertainty and robustness by aggregating outputs from multiple models.

Example

  • Consider a system that generates multiple candidate answers to a question and then selects the best one. Instead of generating answers sequentially, the system can:

    1. Generate multiple responses in parallel
    2. Evaluate each response independently
    3. Select or combine the best outputs
  • This approach improves both speed and quality, as it allows exploration of multiple reasoning paths simultaneously.

Implementation

  • LangChain supports parallel execution through constructs like RunnableParallel, which allows multiple chains to run concurrently.
from langchain_openai import ChatOpenAI
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.runnables import RunnableParallel
from langchain_core.output_parsers import StrOutputParser

llm = ChatOpenAI(model="gpt-4o-mini", temperature=0)

# Define different reasoning strategies
prompt_1 = ChatPromptTemplate.from_messages([
    ("system", "Answer concisely."),
    ("human", "{question}")
])

prompt_2 = ChatPromptTemplate.from_messages([
    ("system", "Answer with detailed reasoning."),
    ("human", "{question}")
])

chain_1 = prompt_1 | llm | StrOutputParser()
chain_2 = prompt_2 | llm | StrOutputParser()

parallel_chain = RunnableParallel(
    concise=chain_1,
    detailed=chain_2
)

result = parallel_chain.invoke({"question": "What is reinforcement learning?"})

print(result)
  • This example runs two different reasoning strategies in parallel and returns both outputs. A downstream step could then select or merge the best result.

Aggregation and synchronization

  • Parallelization requires a mechanism to combine results from multiple branches. This step is often referred to as aggregation.

  • Common aggregation strategies include:

    • Selection: Choose the best output based on a scoring function
    • Voting: Combine outputs using majority or weighted voting
    • Synthesis: Merge outputs into a unified response
    • Filtering: Remove low-quality or inconsistent results
  • This step is critical because parallelization without proper aggregation can lead to fragmented or inconsistent outputs.

Parallelization in agentic systems

  • Parallelization is particularly powerful when combined with other patterns:

    • With prompt chaining: Multiple branches can process different aspects of a task
    • With routing: Different routes can be executed concurrently
    • With multi-agent systems: Multiple agents can work simultaneously on different subtasks
    • With retrieval: Multiple sources can be queried in parallel
  • This enables systems to handle complex workflows efficiently while maintaining modularity.

Failure modes

  • While parallelization improves performance, it introduces additional complexity:

    • Resource contention: Parallel tasks may compete for computational resources
    • Synchronization overhead: Combining results adds complexity
    • Inconsistent outputs: Different branches may produce conflicting results
    • Cost increase: Running multiple tasks simultaneously increases usage
  • To mitigate these issues:

    • Limit the number of parallel branches
    • Use lightweight models for exploratory branches
    • Apply strong aggregation and validation mechanisms
    • Monitor system performance and resource usage

Reflection

  • Reflection is an agentic design pattern that enables a system to evaluate and improve its own outputs through iterative self-critique. Rather than treating an initial response as final, the system introduces a structured feedback loop in which outputs are analyzed, corrected, and refined. This transforms the system from a one-pass generator into an adaptive process capable of improving its performance within the scope of a single task.

  • At its core, reflection operationalizes a simple but powerful idea: reasoning improves when a system is given the opportunity to revisit and critique its own work. This mirrors human problem-solving, where first drafts are rarely final and iterative revision leads to stronger, more accurate outcomes. By incorporating this loop, systems can identify weaknesses, correct errors, and enhance clarity without external intervention.

  • More broadly, reflection represents a shift from static generation to iterative improvement. It serves as a built-in mechanism for quality control, increasing reliability and robustness by enabling systems to detect and address their own mistakes. In the context of agentic design patterns, this makes reflection a foundational capability—one that brings machine reasoning closer to human-like processes, where refinement and revision are essential.

  • Ultimately, reflection allows systems to “learn” within a task itself, even in the absence of explicit retraining. By continuously reassessing and improving their outputs, they become more adaptive, accurate, and effective problem-solvers.

Why reflection is needed

  • Even advanced models frequently produce outputs that are:

  • Incomplete
  • Inconsistent
  • Hallucinated
  • Poorly structured

  • In a single-pass system, these issues persist because there is no mechanism for correction. Reflection introduces a second stage where the system evaluates its output against criteria such as correctness, completeness, and coherence.

  • This idea is supported by research such as Self-Refine: Iterative Refinement with Self-Feedback by Madaan et al. (2023), which shows that iterative self-feedback significantly improves output quality across tasks.

The reflection loop

  • Reflection can be formalized as an iterative process:

    \[y_0 = f(x), \quad y_{t+1} = g(y_t, x)\]
    • where:

      • \(f(x)\) generates an initial output
      • \(g(y_t, x)\) evaluates and refines the output
  • This process can be repeated multiple times until a stopping condition is met, such as:

    • A quality threshold
    • A fixed number of iterations
    • Convergence of outputs
  • The result is a progressively improved response.

  • The following figure shows the self-reflection design pattern which undergoes iterative self-refinement with outputs being critiqued and improved over multiple passes.

  • The following figure shows the reflection design pattern with a producer and critique agent.

Types of reflection

  • Reflection can take several forms depending on how feedback is generated, as follows:

    • Self-critique:

      • The model evaluates its own output using a secondary prompt. For example:

        • Identify errors in reasoning
        • Check factual consistency
        • Suggest improvements
    • External critique:

      • A separate model or system evaluates the output. This can improve robustness by introducing diversity in evaluation.
    • Rule-based validation:

      • Outputs are checked against predefined constraints, such as:

        • JSON schema validation
        • Logical consistency checks
        • Domain-specific rules
    • Human-in-the-loop reflection:

      • A human provides feedback, which the system incorporates into subsequent iterations.

Example

  • Consider a system that generates code. A reflection-based workflow might:

    1. Generate initial code
    2. Analyze the code for errors or inefficiencies
    3. Revise the code based on feedback
    4. Repeat until the code meets quality criteria
  • This process significantly improves reliability compared to a single-pass generation.

Implementation

  • LangChain can implement reflection by chaining generation and critique steps.
from langchain_openai import ChatOpenAI
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.output_parsers import StrOutputParser

llm = ChatOpenAI(model="gpt-4o-mini", temperature=0)

# Step 1: Generate initial answer
generate_prompt = ChatPromptTemplate.from_messages([
    ("system", "Answer the question."),
    ("human", "{question}")
])

# Step 2: Critique answer
critique_prompt = ChatPromptTemplate.from_messages([
    ("system", "Critique the following answer for correctness and completeness."),
    ("human", "{answer}")
])

# Step 3: Improve answer
improve_prompt = ChatPromptTemplate.from_messages([
    ("system", "Improve the answer based on the critique."),
    ("human", "Answer: {answer}\nCritique: {critique}")
])

generate_chain = generate_prompt | llm | StrOutputParser()
critique_chain = critique_prompt | llm | StrOutputParser()
improve_chain = improve_prompt | llm | StrOutputParser()

question = "Explain how neural networks learn."

initial = generate_chain.invoke({"question": question})
critique = critique_chain.invoke({"answer": initial})
improved = improve_chain.invoke({
    "answer": initial,
    "critique": critique
})

print(improved)
  • This example demonstrates a single iteration of reflection. In practice, this loop can be repeated multiple times for further refinement.

Reflection in agentic systems

  • Reflection plays a critical role in enabling agents to improve their behavior dynamically. It is often used in:

    • Planning: Refining task decomposition
    • Tool use: Verifying correctness of tool outputs
    • Reasoning: Correcting logical errors
    • Multi-agent systems: Providing feedback between agents
  • This aligns with the paradigm introduced in ReAct by Yao et al. (2022), where reasoning is continuously updated based on observations and intermediate results.

Failure modes

  • While reflection improves quality, it introduces trade-offs:

    • Increased latency: Multiple iterations require additional model calls
    • Cost overhead: Each refinement step adds computational cost
    • Over-correction: Excessive refinement can degrade outputs
    • Bias reinforcement: The model may reinforce its own mistakes
  • To mitigate these issues:

    • Limit the number of reflection iterations
    • Use structured evaluation criteria
    • Introduce diversity in critique (e.g., multiple evaluators)
    • Combine reflection with external validation

Tool Use

  • Tool use is an agentic design pattern that extends a system’s capabilities beyond its internal knowledge by enabling interaction with external functions, APIs, databases, and real-world environments. It transforms a language model from a purely reasoning engine into an action-oriented system capable of operating in practical contexts.

  • At its core, tool use embodies the principle that intelligence is not just about understanding what needs to be done, but also about executing those actions—whether that involves retrieving information, performing computations, or triggering workflows.

  • By bridging the gap between reasoning and execution, tool use shifts the role of AI from a static source of knowledge to a dynamic coordinator of capabilities. In agentic systems, this pattern is what allows models to move beyond simulation and actively engage with the world. As such, it represents a fundamental step in the evolution of AI: the point at which intelligence becomes operational, turning insight into real-world execution.

Why tool use is needed

  • Language models are inherently constrained:

    • Their knowledge is limited to training data
    • They cannot access real-time or proprietary information
    • They cannot perform deterministic computations reliably
    • They cannot directly interact with external systems
  • Tool use addresses these limitations by allowing the system to delegate specific tasks to specialized components.

  • For example:

    • Use a search API to retrieve current information
    • Use a calculator for precise numerical computation
    • Query a database for structured data
    • Call a service to execute transactions
  • This paradigm is strongly supported by research such as Toolformer by Schick et al. (2023), which demonstrates that models can learn to decide when and how to use tools, significantly improving performance on real-world tasks.

  • The following figure shows the integration of external tools into the agentic reasoning loop for action execution.

The tool interaction loop

  • Tool use introduces an extended decision loop where the system must determine not only what to say, but what to do:
\[a_t = \begin{cases} \text{generate response} \ \text{invoke tool } T_i(x) \end{cases}\]
  • After invoking a tool, the system observes the result and incorporates it into subsequent reasoning:
\[s_{t+1} = f(s_t, \text{tool output})\]
  • This creates a tight coupling between reasoning and execution, where actions directly influence future decisions.

  • This interaction pattern is central to modern agent frameworks and is exemplified by ReAct by Yao et al. (2022), where reasoning steps guide tool usage and observations refine subsequent reasoning.

  • The following figure shows the tool use design pattern.

Types of tools

  • Tools can take many forms depending on the application:

    • Information retrieval tools:

      • Web search APIs
      • Vector databases (RAG systems)
      • Knowledge bases

      • These provide access to external knowledge and improve factual accuracy.
    • Computation tools:

      • Calculators
      • Code execution environments
      • Simulation engines

      • These ensure correctness in tasks requiring precise computation.
    • Action tools:

      • APIs for booking, payments, or transactions
      • Workflow automation systems
      • Robotics interfaces

      • These allow the system to affect the external world.
    • Validation tools:

      • Schema validators
      • Consistency checkers
      • Safety filters

      • These ensure outputs meet required constraints.

Example

  • Consider a system tasked with answering a financial question: “What is the current stock price of AAPL, and how does it compare to last week?”

  • A tool-enabled system would:

    1. Recognize that real-time data is required
    2. Invoke a financial API to retrieve current and historical prices
    3. Compute the difference
    4. Generate a response
  • Without tool use, the model would either hallucinate or provide outdated information.

Implementation

  • LangChain provides built-in abstractions for integrating tools into agent workflows.
from langchain.agents import initialize_agent, Tool
from langchain_openai import ChatOpenAI

# Define a simple calculator tool
def calculator(expression: str) -> str:
    return str(eval(expression))

tools = [
    Tool(
        name="Calculator",
        func=calculator,
        description="Useful for solving math expressions"
    )
]

llm = ChatOpenAI(model="gpt-4o-mini", temperature=0)

agent = initialize_agent(
    tools=tools,
    llm=llm,
    agent="zero-shot-react-description",
    verbose=True
)

result = agent.run("What is (45 * 23) + 17?")
print(result)
  • In this example, the agent decides when to invoke the calculator tool instead of attempting to compute the result internally. This improves both accuracy and reliability.

Tool selection and orchestration

  • A key challenge in tool use is deciding:

    • Which tool to use
    • When to use it
    • How to interpret its output
  • This introduces a decision layer similar to routing, but focused specifically on action selection.

  • In more advanced systems, this can involve:

    • Ranking multiple tools
    • Composing multiple tool calls
    • Handling tool failures and retries
  • This orchestration is central to building robust agentic systems.

Tool use in agentic systems

  • Tool use is deeply interconnected with other patterns:

    • With routing: Selecting the appropriate tool
    • With prompt chaining: Integrating tool outputs into multi-step workflows
    • With reflection: Verifying and correcting tool results
    • With planning: Sequencing multiple tool calls
  • This makes tool use one of the most critical enablers of real-world functionality.

Failure modes

  • Tool use introduces several challenges:

    • Incorrect tool selection: The system may choose the wrong tool
    • Tool misuse: Inputs to tools may be malformed
    • Latency: External calls can be slow
    • Error handling: Tools may fail or return unexpected results
  • To mitigate these issues:

    • Provide clear tool descriptions
    • Validate inputs and outputs
    • Implement retries and fallbacks
    • Monitor tool performance

Planning

  • Planning is an agentic design pattern that enables a system to break down a complex goal into a structured sequence of actions before execution. Instead of reacting myopically step by step, the system forms an explicit or implicit plan that guides its behavior across multiple steps, introducing foresight, coordination, and long-horizon reasoning.

  • At its core, planning shifts a system from reactive execution to goal-directed strategy. Rather than deciding only the immediate next action, the system reasons about how a sequence of actions can collectively achieve an objective. This marks a transition from local decision-making to a more global, strategic perspective.

  • By incorporating planning, agentic systems can anticipate dependencies, coordinate actions, and pursue goals with greater effectiveness. In this sense, planning is the pattern that transforms isolated actions into coherent strategy.

Why planning is needed

  • Reactive systems, even when combined with tools and reflection, often struggle with:

    • Multi-step dependencies
    • Long-horizon tasks
    • Coordination across subtasks
    • Efficient use of resources
  • Without planning, the system may:

    • Take redundant or suboptimal actions
    • Lose track of progress
    • Fail to coordinate multiple steps effectively
  • Planning addresses these issues by introducing a structured representation of the task before execution begins.

  • This aligns with classical AI planning as well as modern LLM-based approaches. For example, Plan-and-Solve Prompting by Wang et al. (2023) shows that explicitly generating a plan before solving improves performance on complex reasoning tasks.

The planning process

  • Planning can be expressed as generating a sequence of actions:

    \[\pi = (a_1, a_2, \dots, a_n)\]
    • where \(\pi\) is the plan and each \(a_i\) is an action or subtask.
  • Execution then follows:

\[s_{t+1} = f(s_t, a_t)\]
  • The key distinction is that the sequence \(\pi\) is generated before or during execution, rather than emerging purely step-by-step.

  • The following figure shows the planning design pattern which involves task decomposition into a structured plan before execution.

Types of planning

  • Planning can take several forms depending on how explicit and structured the plan is.

  • Static planning:

    • The system generates a full plan upfront and executes it sequentially. This works well for well-defined tasks but can be brittle if conditions change.
  • Dynamic planning:

    • The system updates its plan during execution based on new information. This introduces adaptability and resilience.
  • Hierarchical planning:

    • Tasks are decomposed into subgoals and sub-subgoals, forming a tree structure. This is useful for complex problems with multiple layers of abstraction.
  • Iterative planning:

    • The system alternates between planning and execution, refining its plan as it progresses.
  • These approaches reflect different trade-offs between structure and flexibility.

Example

  • Consider a task such as: “Plan a trip to Paris for three days.”

  • A planning-based system might:

    1. Identify key components: travel, accommodation, itinerary
    2. Break each component into subtasks
    3. Sequence the tasks logically
    4. Execute each step using tools (e.g., booking APIs, search)
  • Without planning, the system might jump between unrelated steps or miss important dependencies.

Implementation

  • Planning can be implemented in LangChain by separating plan generation from execution.
from langchain_openai import ChatOpenAI
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.output_parsers import StrOutputParser

llm = ChatOpenAI(model="gpt-4o-mini", temperature=0)

# Step 1: Generate plan
plan_prompt = ChatPromptTemplate.from_messages([
    ("system", "Break the task into a sequence of steps."),
    ("human", "{task}")
])

# Step 2: Execute each step
execute_prompt = ChatPromptTemplate.from_messages([
    ("system", "Execute the following step."),
    ("human", "{step}")
])

plan_chain = plan_prompt | llm | StrOutputParser()
execute_chain = execute_prompt | llm | StrOutputParser()

task = "Prepare a report on renewable energy trends."

plan = plan_chain.invoke({"task": task})
steps = plan.split("\n")

results = []
for step in steps:
    result = execute_chain.invoke({"step": step})
    results.append(result)

print(results)
  • This example demonstrates a simple two-phase approach: first generate a plan, then execute each step sequentially.

Planning with tools and feedback

  • Planning becomes more powerful when combined with other patterns:

    • With tool use: Each step in the plan can invoke specific tools
    • With reflection: The plan can be evaluated and refined
    • With routing: Different steps can be assigned to specialized components
    • With parallelization: Independent steps can be executed concurrently
  • This creates a flexible system where planning guides execution but does not rigidly constrain it.

Planning in agentic systems

  • Planning is a key enabler of advanced agent behavior:

    • It allows agents to handle long-term objectives
    • It improves coordination across multiple actions
    • It reduces inefficiencies in execution
    • It enables proactive behavior
  • In multi-agent systems, planning often involves coordination across agents, where different agents are assigned different parts of the plan.

Failure modes

  • Planning introduces its own challenges:

    • Overplanning: Excessive detail can reduce flexibility
    • Plan brittleness: Static plans may fail in dynamic environments
    • Error propagation: Flawed plans lead to flawed execution
    • Complexity: Managing plans adds overhead
  • To mitigate these issues:

    • Use dynamic or iterative planning
    • Incorporate feedback loops
    • Validate plans before execution
    • Allow replanning when conditions change

Prioritization

  • In complex, dynamic environments, agentic systems constantly face multiple competing actions, conflicting goals, and limited resources. Without a structured way to decide what to do next, they risk inefficiency, delays, or even complete failure to achieve their objectives. The prioritization design pattern addresses this challenge by enabling agents to evaluate, rank, and select tasks according to well-defined criteria, ensuring that effort is directed toward the most impactful actions.
  • At its core, prioritization transforms an agent from a reactive executor into a strategic decision-maker: rather than treating all tasks equally, the agent continuously determines what matters most and aligns its behavior with overarching goals and constraints. As a result, prioritization becomes a cornerstone of agentic intelligence, allowing agents not just to act, but to decide what is worth acting on. By continuously evaluating and reordering tasks, agents demonstrate a form of strategic reasoning that closely mirrors human decision-making, a capability that is essential for building systems that are not only functional, but truly effective in real-world, high-complexity environments.

Core idea

  • Prioritization introduces a decision function over a set of candidate tasks:

    \[a^* = \arg\max_{a \in \mathcal{A}} \mathcal{S}(a)\]
    • where:

      • \(\mathcal{A}\) is the set of possible actions or tasks
      • \(\mathcal{S}(a)\) is a scoring function based on prioritization criteria
      • \(a^*\) is the selected highest-priority action
  • This formalization highlights that prioritization is fundamentally an optimization problem under constraints.

Key components of prioritization

  • Effective prioritization typically involves four key components:

  • Criteria definition:

    • Agents define evaluation criteria to assess tasks. Common criteria include:

      • Urgency: how time-sensitive the task is
      • Importance: impact on primary objectives
      • Dependencies: whether other tasks rely on it
      • Resource availability: readiness of tools or data
      • Cost-benefit tradeoff: effort versus expected outcome
      • User preferences: personalization signals
    • These criteria define the agent’s notion of “value”.

  • Task evaluation:

    • Each candidate task is evaluated against the defined criteria. This can range from:

      • Rule-based scoring (e.g., priority levels P0, P1, P2)
      • Heuristic functions
      • LLM-based reasoning over task descriptions
    • This step transforms qualitative information into comparable scores.

  • Scheduling and selection:

    • Based on evaluations, the agent selects the next action or sequence of actions. This may involve:

      • Priority queues
      • Greedy selection
      • Integration with planning systems
    • This is where prioritization connects directly with planning and execution.

  • Dynamic re-prioritization:

    • As new information arrives or conditions change, priorities must be updated. This enables:

      • Responsiveness to new events
      • Adaptation to deadlines
      • Recovery from failures or delays
    • Dynamic re-prioritization is essential for real-world environments where conditions are non-static.

  • The following figure shows the prioritization design pattern and how tasks are evaluated and ordered based on defined criteria.

Levels of prioritization

  • Prioritization operates at multiple levels within an agentic system:

    • Goal-level prioritization: selecting which high-level objective to pursue
    • Plan-level prioritization: ordering sub-tasks within a plan
    • Action-level prioritization: choosing the next immediate step
  • This multi-level structure mirrors hierarchical decision-making in human organizations.

Relationship to other patterns

  • Prioritization is deeply interconnected with other agentic design patterns:

    • Planning: prioritization determines which plan steps execute first
    • Routing: prioritization can influence which workflow or agent is selected
    • Tool use: determines which tool invocation is most critical
    • Goal monitoring: evaluates progress and adjusts focus
    • Evaluation: provides signals that influence future prioritization
  • Together, these patterns form a decision-making backbone for the agent.

Real-world applications

  • Prioritization is fundamental across many domains:

    • Customer support: urgent incidents (e.g., outages) are handled before routine requests
    • Cloud computing: critical workloads receive resources before batch jobs
    • Autonomous driving: collision avoidance overrides efficiency goals
    • Financial trading: high-risk or high-reward trades are executed first
    • Cybersecurity: severe threats are addressed before minor alerts
    • Personal assistants: schedules and reminders are ordered by importance and timing
  • These examples demonstrate that prioritization is essential wherever decisions must be made under constraints.

Implementation

  • The following example demonstrates a project manager agent that creates, prioritizes, and assigns tasks using tools.
from langchain_openai import ChatOpenAI
from langchain.agents import AgentExecutor, create_react_agent
from langchain_core.prompts import ChatPromptTemplate

llm = ChatOpenAI(model="gpt-4o-mini", temperature=0.5)

prompt = ChatPromptTemplate.from_messages([
    ("system", """You are a Project Manager AI.
    Always:
    1. Create a task
    2. Assign priority (P0 highest, P2 lowest)
    3. Assign a worker
    """),
    ("human", "{input}")
])

agent = create_react_agent(llm, tools=[], prompt=prompt)
executor = AgentExecutor(agent=agent, tools=[], verbose=True)

executor.invoke({"input": "Create an urgent task to fix login issues"})
  • In practice, this system would integrate with:

    • Task storage (memory layer)
    • Tooling for updates and assignment
    • Evaluation signals for reprioritization

Why prioritization matters

  • Without prioritization:

    • Agents may waste resources on low-value tasks
    • Critical deadlines may be missed
    • Conflicting goals may cause indecision
    • System behavior becomes unpredictable
  • With prioritization:

    • Decision-making becomes structured and goal-aligned
    • Resources are allocated efficiently
    • Agents behave more intelligently and robustly
    • Systems can scale to complex, multi-objective environments

Rule of thumb

  • Use the prioritization pattern when an agent must autonomously manage multiple competing tasks or goals under constraints. It is especially critical in dynamic environments where conditions change and decisions must be made continuously.

Pattern Selection and Composition

Core Idea

  • Agentic systems are not constructed from a single model, prompt, or technique. Instead, they emerge from the deliberate integration of multiple design patterns, each contributing a distinct aspect of intelligence—reasoning, action, memory, control, and safety. While these patterns can be studied individually, real-world effectiveness depends on how they are brought together into a cohesive whole.

  • This marks an important shift in perspective: from understanding isolated capabilities to designing complete systems. At this stage, the emphasis is no longer on how each pattern works independently, but on how they interact, reinforce one another, and impose constraints within a unified architecture. The success of an agentic system is therefore defined not only by the strength of its individual components, but by the quality of their composition.

  • A central principle in this process is that pattern selection is inherently context-dependent. Different applications introduce varying requirements across dimensions such as latency, cost, reliability, risk tolerance, and task complexity. There is no single optimal configuration; instead, designing an effective system becomes an exercise in balancing trade-offs. The choice and arrangement of patterns must align with the specific constraints and goals of the problem being solved.

  • This is the transition from techniques to systems—from assembling capabilities to engineering architectures. Pattern selection and composition provide the mechanism for synthesis, enabling developers to combine discrete elements into cohesive, production-ready solutions that are robust, scalable, and aligned with real-world demands.

  • Ultimately, this is the layer where components become systems: where individual patterns, when thoughtfully composed, create something greater than the sum of their parts.

Why composition is needed

  • Real-world problems are inherently multi-dimensional. A single pattern cannot address all requirements:

    • Prompt chaining handles structured reasoning
    • Routing enables specialization
    • Tool use enables external interaction
    • Memory enables persistence
    • Planning enables long-horizon execution
    • Reflection enables refinement
    • Guardrails ensure safety
  • Without composition, systems remain limited in capability. With composition, they become flexible and robust.

  • This reflects principles from software architecture, where modular components are combined to form complex systems. In agentic design, patterns serve as these modular building blocks.

The composition framework

  • Agentic systems can be viewed as compositions of patterns:

    \[\mathcal{S} = \mathcal{P}_1 \circ \mathcal{P}_2 \circ \cdots \circ \mathcal{P}_n\]
    • where each \(\mathcal{P}_i\) represents a design pattern.
  • The challenge lies in determining:

    • Which patterns to include
    • How they interact
    • In what order they are applied
  • This composition defines the system’s behavior.

Common composition strategies

  • Different strategies can be used to combine patterns effectively.

    • Linear composition:

      • Patterns are applied sequentially
      • Example: prompt chaining \(\rightarrow\) tool use \(\rightarrow\) reflection
    • Hierarchical composition:

      • High-level patterns orchestrate lower-level ones
      • Example: planning coordinating multiple chains
    • Parallel composition:

      • Multiple patterns operate simultaneously
      • Example: parallel retrieval + parallel evaluation
    • Conditional composition:

      • Patterns are selected dynamically
      • Example: routing between different workflows
  • These strategies can be combined to create complex architectures.

Example

  • Consider a research assistant agent:

    1. Routing determines the type of query
    2. Planning decomposes the task
    3. Tool use retrieves relevant information
    4. Prompt chaining processes the data
    5. Reflection improves the output
    6. Evaluation measures quality
    7. Memory stores results
  • This composition enables the system to handle complex tasks effectively.

Implementation

  • LangChain enables composition through modular chains, agents, and workflows.
from langchain_openai import ChatOpenAI
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.runnables import RunnableParallel

llm = ChatOpenAI(model="gpt-4o-mini", temperature=0)

# Prompt chaining
prompt = ChatPromptTemplate.from_messages([
    ("system", "Summarize the input."),
    ("human", "{text}")
])

chain = prompt | llm

# Parallel evaluation
def evaluate(output):
    return f"Evaluation of: {output}"

workflow = RunnableParallel(
    summary=chain,
    evaluation=lambda x: evaluate(x["text"])
)

result = workflow.invoke({"text": "Agentic systems combine reasoning and action."})
print(result)
  • This example demonstrates how multiple components can be composed into a single workflow.

Design considerations

  • Effective composition requires careful consideration of:

    • Task complexity:

      • Simple tasks may require only a few patterns
      • Complex tasks require richer compositions
    • Performance constraints:

      • Latency and cost must be balanced
      • Parallelization and routing can optimize efficiency
    • Reliability requirements:

      • Reflection, guardrails, and monitoring improve robustness
    • Scalability:

      • Modular composition enables system growth
  • These factors guide pattern selection.

Failure modes

  • Poor composition can lead to:

    • Over-engineering: Too many patterns increase complexity
    • Under-engineering: Missing patterns limit capability
    • Tight coupling: Reduces flexibility
    • Unclear control flow: Makes debugging difficult
  • To mitigate these issues:

    • Start simple and iterate
    • Use modular designs
    • Clearly define interfaces between patterns
    • Continuously evaluate system performance

Multi-Agent Systems

  • Multi-agent systems represent an agentic design pattern in which multiple specialized agents collaborate to achieve a shared goal. Rather than relying on a single, monolithic agent to handle every aspect of a task, responsibilities are distributed across agents with clearly defined roles, expertise, and capabilities. This introduces modularity, scalability, and specialization into agentic architectures.

  • This approach reflects a fundamental shift in how complex problems are solved: moving away from a single generalist toward a coordinated team of specialists. Much like human organizations, where division of labor and collaboration drive effectiveness, multi-agent systems leverage structured cooperation to produce better outcomes.

  • At a deeper level, multi-agent systems embody the concept of distributed intelligence. Intelligence is no longer concentrated in a single entity but instead emerges from the interactions and coordination among agents. This enables systems to scale not only in size but also in capability and complexity, supporting parallelism, adaptability, and flexible coordination.

  • Ultimately, this pattern transforms individual intelligence into collective intelligence, making it a foundational approach for building sophisticated, real-world AI systems.

Motivation

  • As tasks grow in complexity, a single agent faces several limitations:

    • Cognitive overload from handling multiple responsibilities
    • Difficulty maintaining consistent context across diverse subtasks
    • Inefficiency in switching between different types of reasoning
    • Limited scalability for large workflows
  • Multi-agent systems address these challenges by decomposing the problem into roles and delegating tasks accordingly.

  • This idea aligns with distributed AI and cooperative systems, where coordination among multiple entities leads to emergent intelligence. For example, Generative Agents by Park et al. (2023) demonstrate how multiple agents interacting in a shared environment can produce complex, believable behaviors.

The multi-agent architecture

  • A multi-agent system can be viewed as a set of agents:

    \[A = {a_1, a_2, \dots, a_n}\]
    • where each agent \(a_i\) is responsible for a specific function.
  • The system operates through communication and coordination:

\[a_i \leftrightarrow a_j \quad \forall i, j\]
  • A central coordinator or decentralized protocol manages how agents interact and share information.

  • The following figure shows an example of multi-agent system.

Multi-agent topologies

  • Multi-agent systems can be structured in different ways depending on how agents communicate, coordinate, and share responsibilities. These structures define the interrelationships between agents and directly impact system efficiency, robustness, scalability, and adaptability.

  • At a high level, multi-agent coordination spans a spectrum from fully independent agents to highly structured hierarchical and custom-designed systems. Each model introduces trade-offs between control, flexibility, communication overhead, and fault tolerance.

Single agent
  • A single agent operates independently without interacting with others
  • Simple to implement and manage
  • Limited by the capabilities and resources of one agent
  • This model is suitable when tasks can be solved in isolation and do not require collaboration.
Network (decentralized coordination)
  • Multiple agents communicate directly in a peer-to-peer fashion
  • No central controller; agents share information, resources, and tasks

  • Advantages:

    • High flexibility and scalability
    • Resilient to individual agent failure
  • Challenges:

    • Coordination complexity increases with scale
    • Communication overhead can become significant
    • Harder to ensure consistent global behavior
  • This corresponds to decentralized coordination where autonomy is maximized but control is reduced.
Supervisor (centralized coordination)
  • A central “supervisor” agent manages a group of subordinate agents

  • The supervisor:

    • Assigns tasks
    • Aggregates results
    • Maintains global context
    • Resolves conflicts
  • Advantages:

    • Clear control flow and coordination
    • Easier to debug and manage
  • Challenges:

    • Single point of failure
    • Potential bottleneck under high load
  • This is the most common production pattern due to its simplicity and controllability.

Supervisor as a tool
  • The supervisor provides capabilities rather than strict control
  • Acts as a resource provider (e.g., tools, data, analysis)
  • Other agents retain autonomy in decision-making

  • Advantages:

    • Balances guidance with flexibility
    • Avoids rigid top-down control
  • This model is useful when centralized expertise is needed without constraining agent autonomy.
Hierarchical systems
  • Agents are organized into multiple layers:

    • High-level agents define goals
    • Mid-level agents plan and coordinate
    • Low-level agents execute actions
  • Advantages:

    • Scales well for complex tasks
    • Enables structured decomposition of problems
    • Supports distributed decision-making
  • Challenges:

    • Increased system complexity
    • Requires careful coordination across layers
  • This mirrors real-world organizational hierarchies and is well-suited for large, multi-stage workflows.

Custom systems
  • Tailored architectures combining elements of different models
  • May include hybrid coordination strategies or entirely novel designs

  • Advantages:

    • Optimized for specific tasks, environments, or constraints
    • Can balance trade-offs across control, flexibility, and efficiency
  • Challenges:

    • More difficult to design and implement
    • Requires deep understanding of agent interactions and communication protocols
  • Custom systems are typically used in advanced production settings where standard patterns are insufficient.

  • The choice of coordination model is a critical design decision. It depends on factors such as task complexity, number of agents, required autonomy, robustness needs, and acceptable communication overhead.

  • The following figure shows how agents communicate and interact in various ways.

Example

  • Consider a product launch scenario. A multi-agent system might include:

    • A Project Manager agent to coordinate tasks
    • A Market Research agent to analyze trends
    • A Design agent to create product concepts
    • A Marketing agent to generate campaigns
  • The Project Manager agent assigns tasks, collects outputs, and ensures alignment across agents.

  • This example illustrates how specialization and coordination enable the system to handle complex, multi-faceted objectives.

Implementation

  • LangChain and related frameworks support multi-agent orchestration through role-based agents and shared workflows.
from langchain_openai import ChatOpenAI
from langchain.agents import initialize_agent, Tool

llm = ChatOpenAI(model="gpt-4o-mini", temperature=0)

# Define simple role-based tools (agents)
def research_agent(task: str) -> str:
    return f"Research findings for: {task}"

def writing_agent(task: str) -> str:
    return f"Written content for: {task}"

tools = [
    Tool(name="ResearchAgent", func=research_agent, description="Performs research"),
    Tool(name="WritingAgent", func=writing_agent, description="Writes content")
]

manager_agent = initialize_agent(
    tools=tools,
    llm=llm,
    agent="zero-shot-react-description",
    verbose=True
)

result = manager_agent.run("Create a blog post about AI agents.")
print(result)
  • In this simplified example, the manager agent delegates tasks to specialized agents. In more advanced systems, each agent would have its own internal logic, memory, and tools.

Communication and coordination

  • Effective multi-agent systems depend on how agents communicate:

    • Message passing: Agents exchange structured messages
    • Shared memory: Agents read and write to a common state
    • Task delegation: Agents assign subtasks to others
    • Feedback loops: Agents critique and refine each other’s outputs
  • Communication protocols are critical for ensuring consistency and alignment across agents.

Multi-agent systems in practice

  • Multi-agent systems are particularly useful for:

    • Complex workflows with multiple stages
    • Tasks requiring diverse expertise
    • Large-scale automation pipelines
    • Collaborative problem-solving
  • They are increasingly used in domains such as:

    • Software engineering (code generation, testing, deployment)
    • Research and analysis
    • Business process automation
    • Simulation and modeling

Failure modes

  • Multi-agent systems introduce additional complexity:

    • Coordination overhead: Managing communication between agents
    • Inconsistency: Agents may produce conflicting outputs
    • Latency: Multiple agents increase execution time
    • Debugging difficulty: Errors may arise from interactions between agents
  • To mitigate these issues:

    • Define clear roles and responsibilities
    • Use structured communication formats
    • Implement validation and aggregation mechanisms
    • Monitor interactions between agents

State, Adaptation, and Control in Agentic Systems

Core Idea

  • As agentic systems evolve from simple workflows into autonomous, goal-directed architectures, three foundational capabilities become critical: the ability to retain state, improve over time, and stay aligned with objectives. The patterns in this section, namely Memory Management, Learning and Adaptation, Model Context Protocol (MCP), and Goal Setting and Monitoring, collectively address these needs.

  • Together, they define how an agent persists information, updates its behavior, coordinates internal components, and ensures progress toward desired outcomes. Without these capabilities, even well-designed systems with strong reasoning, planning, and tool use remain fundamentally limited.

From stateless execution to persistent intelligence

  • Earlier patterns such as prompt chaining, routing, and tool use primarily operate within the scope of a single task or interaction. However, real-world systems require continuity across time. This introduces the need for stateful execution, where past interactions, intermediate results, and learned knowledge influence future behavior.

  • Formally, instead of treating each step independently:

    \[a_t \sim \pi(a \mid x_t)\]
  • agentic systems operate over accumulated state:

    \[a_t \sim \pi(a \mid s_t), \quad s_t = f(s_{t-1}, o_{t-1})\]
    • where \(s_t\) captures memory, context, and prior outcomes.
  • This shift enables agents to maintain coherence, avoid redundant work, and build progressively richer representations of their environment.

Memory as the foundation of continuity

  • Memory management provides the infrastructure for storing and retrieving information across both short and long time horizons. It allows systems to:

    • Maintain conversational and task continuity
    • Personalize interactions
    • Accumulate knowledge from prior executions
  • Without memory, agents behave like stateless functions. With memory, they begin to exhibit traits of persistence and experience.

Learning as the mechanism for improvement

  • While memory enables retention, learning enables transformation. Learning and adaptation allow agents to refine their behavior based on feedback, outcomes, and experience.

  • This introduces a feedback-driven optimization loop:

    \[\pi_{t+1} = \pi_t + \Delta(\text{feedback}, \text{experience})\]
    • where the system updates its policy based on observed performance.
  • In practice, this may take the form of:

    • Incorporating feedback into memory
    • Adjusting prompts or workflows
    • Improving routing and tool selection
  • Learning ensures that agents do not remain static, but evolve toward better performance over time.

Context as the glue of the system

  • As systems grow in complexity, multiple components such as tools, memory stores, and sub-agents must interact seamlessly. Model Context Protocol (MCP) provides the structure for this interaction.

  • It defines how information is represented and passed between components:

    \[C = {u, s, m, t, r}\]
    • ensuring that all relevant context is consistently available.
  • Without structured context, systems become fragmented and difficult to scale. MCP ensures coherence across the entire architecture.

Goals as the anchor of behavior

  • Even with memory and learning, an agent requires a clear sense of direction. Goal setting and monitoring provide this by defining objectives and tracking progress.

  • This introduces a control loop:

    \[\Delta_t = d(s_t, G)\]
    • where the system continuously measures its distance from the goal and adjusts accordingly.
  • This ensures that:

    • Actions remain aligned with objectives
    • Progress is measurable
    • Deviations are detected and corrected

The combined effect

  • These four patterns are deeply interconnected:

    • Memory stores experience
    • Learning transforms experience into improved behavior
    • MCP ensures experience and context flow correctly through the system
    • Goals and monitoring ensure behavior remains aligned and purposeful
  • Together, they form the backbone of persistent, adaptive, and goal-driven agentic systems.

  • They mark the transition from systems that can act, to systems that can remember, improve, coordinate, and stay aligned over time.

Memory Management

  • Memory management is a foundational agentic design pattern that enables systems to retain, organize, and utilize information across interactions over time. At its core, it allows an agent to persist information beyond a single prompt or step—an essential capability, since real-world tasks often span multiple interactions, depend on historical context, and benefit from accumulated knowledge. Without memory, each interaction resets the system to a blank state, severely limiting its effectiveness.

  • By introducing persistence, memory transforms agents from stateless, reactive responders into stateful, adaptive systems. This shift enables continuity in interactions, supports personalization, and allows agents to incorporate past experiences into current decision-making. As a result, agents can learn, refine their behavior, and improve performance over time.

  • In agentic design, memory is the mechanism that turns isolated interactions into a coherent experience. It provides the structure for accumulating knowledge, maintaining context, and enabling long-term reasoning—making it a critical component for building capable, real-world AI systems.

Why memory is needed

  • Stateless systems face fundamental limitations:

    • They forget previous interactions
    • They cannot build context over time
    • They cannot personalize responses
    • They struggle with long-horizon tasks
  • Memory addresses these issues by enabling the system to store and retrieve relevant information when needed.

  • This aligns with the broader paradigm introduced in Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks by Lewis et al. (2020), where external memory retrieval enhances reasoning by grounding outputs in stored knowledge.

Types of memory

  • Memory in agentic systems can be categorized along two complementary axes:

    • Functional taxonomy (what kind of information is stored and why)
    • Storage and retrieval mechanisms (how memory is implemented and accessed)
  • Together, these dimensions provide a more complete view of how memory operates in real-world systems.

Functional types of memory
  • These correspond to cognitive roles and are independent of how memory is physically stored.

    • Short-term memory (working memory):

      • Stores information relevant to the current task

      • Typically implemented within the model’s context window

      • Includes recent messages, intermediate outputs, and current execution state

    • Enables continuity within a single workflow

    • Often volatile and limited by context size

    • Long-term memory:

      • Persists information across sessions

      • Stored externally (e.g., databases, vector stores, file systems)

      • Includes user preferences, past interactions, and accumulated knowledge

    • Enables personalization and learning over time

    • Episodic memory:

      • Stores specific past experiences or events

      • Often includes timestamps and contextual metadata

      • Allows the system to recall prior situations and outcomes

      • Particularly useful for temporal reasoning and history-aware behavior

    • Semantic memory:

      • Stores generalized knowledge extracted from experiences

      • Represents facts, abstractions, and patterns

      • Enables reasoning beyond specific past events

Storage and retrieval mechanisms
  • In addition to functional types, memory can also be categorized by how it is implemented.

  • Vector memory (embedding-based memory):

    • Stores information as embeddings in vector databases

    • Retrieval is based on semantic similarity search

    • Best suited for:

      • Semantic recall
      • Paraphrase handling
      • Large-scale knowledge retrieval
    • Commonly used in retrieval-augmented systems such as Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks by Lewis et al. (2020), where external memory enhances reasoning

    • Typically supports:

      • Long-term memory
      • Semantic memory
  • File-based memory (log-structured or document memory):

    • Stores information as structured files (e.g., markdown, JSON, logs)

    • Often versioned using systems like Git (diff, history, commits)

    • Retrieval is keyword-based (e.g., BM25) or structure-aware

    • Best suited for:

      • Episodic memory with temporal tracking
      • Auditable and human-readable memory
      • Reproducibility and debugging
    • Example implementations include Git-based memory systems such as DiffMem, where memory evolves through version-controlled commits

    • Naturally supports:

      • Episodic memory (time-stamped history)
      • Long-term memory (persistent logs)
How these dimensions interact
  • These two categorizations are orthogonal and often combined in practice:

    • Short-term memory \(\rightarrow\) usually context window
    • Long-term memory \(\rightarrow\) vector store or file system
    • Episodic memory \(\rightarrow\) often file-based (logs, timelines)
    • Semantic memory \(\rightarrow\) often vector-based (embeddings)
  • A unified view can be expressed as:

\[\text{Memory} = \text{Function (what)} + \text{Mechanism (how)}\]
  • For example:

    • A vector database may implement semantic long-term memory
    • A Git-based system may implement episodic long-term memory
    • A hybrid system may combine both
Practical perspective
  • Modern agentic systems increasingly adopt hybrid memory architectures, where:

    • Vector memory handles semantic retrieval
    • File-based memory handles history, structure, and traceability
  • This layered approach enables agents to:

    • Retrieve relevant knowledge efficiently
    • Track how knowledge evolves over time
    • Maintain both performance and interpretability
  • These distinctions mirror concepts from cognitive science, while also reflecting practical system design choices required for building robust, real-world agentic systems.

File-based vs. Vector Memory

  • As agentic systems evolve from simple reactive pipelines into stateful, adaptive systems, memory design becomes a first-class architectural decision. Modern agents are expected not only to retrieve relevant information, but also to reason about how that information changes over time, whether it remains valid, and how it should influence future decisions.

  • This introduces a fundamental design tension:

    • Systems must optimize for recall and scale to handle large, diverse knowledge
    • Systems must also ensure accuracy and interpretability to maintain trust and correctness
  • These competing requirements shape how memory systems are built in practice and lead to two dominant paradigms:

    • Vector-based memory optimized for semantic recall and scalability
    • File-based memory optimized for transparency, temporal tracking, and control
  • Rather than being interchangeable, these approaches reflect different philosophies of memory. Vector memory treats knowledge as a searchable semantic space, while file-based memory treats it as a structured, evolving record. As highlighted in the broader agentic design framework , managing state, context, and historical knowledge is central to building robust agents, and memory becomes the backbone of that capability.

  • In practice, no single approach is universally optimal. The choice depends on tradeoffs between scale, interpretability, semantic understanding, temporal reasoning, and system complexity. Increasingly, real-world systems adopt composable memory architectures that combine both paradigms to balance these tradeoffs effectively, enabling agents to be both scalable and trustworthy.

Vector memory (semantic retrieval)
  • Vector memory is the dominant paradigm in modern agentic systems and underpins retrieval-augmented architectures such as Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks by Lewis et al. (2020), which shows how external retrieval improves reasoning by grounding outputs in relevant knowledge.

  • In this approach:

    • Text is converted into embeddings (high-dimensional vectors)
    • Stored in a vector database (e.g., FAISS, Pinecone, Weaviate)
    • Retrieved using similarity search
  • Formally, retrieval is defined as:

    \[\text{retrieve}(q) = \arg\max_{s_i \in \mathcal{M}} \text{sim}(q, s_i)\]
    • where similarity is typically cosine similarity.
  • Key characteristics

    • Semantic matching rather than exact matching
    • Handles paraphrases and implicit meaning
    • Scales efficiently to large datasets
    • Retrieval is approximate but fast
File-based memory (log-structured memory)
  • File-based memory takes a fundamentally different approach by storing knowledge as structured documents, logs, or version-controlled files. Instead of embeddings, it relies on explicit representations of information.

  • A notable implementation is the Git-based memory approach in DiffMem GitHub Repository, where:

    • Memories are stored as markdown files
    • Each interaction is recorded as a commit
    • Git history tracks how knowledge evolves
    • Retrieval uses keyword-based methods like BM25
  • This approach treats memory as a versioned knowledge base, not just a retrieval index.

  • Key characteristics:

    • Human-readable storage (markdown, logs)
    • Native versioning (diff, history, blame)
    • Deterministic retrieval (keyword or structured queries)
    • Strong temporal awareness
  • A key capability is time-travel memory:

    • Agents can inspect past states of knowledge
    • Enables reproducibility and debugging
    • Supports auditing and traceability
Comparative Analysis
Aspect Vector Memory File-based Memory
Retrieval type Semantic similarity Keyword / structured
Representation Embeddings Raw text / files
Interpretability Low High
Temporal awareness Weak (unless added) Strong (native)
Scalability High Moderate
Determinism Approximate Deterministic
Strengths and weaknesses
  • Vector memory

    • Pros:

      • Captures semantic meaning and paraphrases
      • Scales to large datasets
      • Efficient approximate search
      • Strong for knowledge retrieval and QA
    • Cons:

      • Weak temporal reasoning (evolving facts over time)
      • Hard to debug or interpret
      • Requires embedding infrastructure
      • May retrieve semantically similar but irrelevant data
    • A known limitation is that embeddings capture surface similarity rather than true reasoning. For example, symbolic equivalences like “10 + 10” and “20” are not inherently aligned without additional processing.

  • File-based memory

    • Pros:

      • Fully transparent and human-readable
      • Native versioning and history tracking
      • Strong temporal reasoning
      • Easy manual correction and editing
      • Deterministic and reproducible
    • Cons:

      • Weak semantic understanding
      • Limited scalability compared to vector systems
      • Requires indexing (e.g., BM25)
      • May miss relevant but differently phrased information
    • A key advantage is handling changing facts over time, such as: “My daughter is 10” \(\rightarrow\) later “11” \(\rightarrow\) later “12”

    • File-based systems preserve this evolution explicitly, whereas vector systems often treat outdated entries as noise unless additional filtering is applied.

When to use each approach
  • Use vector memory when:

    • You need semantic search across large corpora
    • Queries are ambiguous or paraphrased
    • Scale and latency are critical
    • Knowledge is relatively static

    • Examples:

      • Enterprise knowledge assistants
      • Document retrieval systems
      • RAG-based copilots
  • Use file-based memory when:

    • You need strong temporal tracking and versioning
    • Interpretability and auditability are critical
    • Data scale is manageable
    • You require full control over stored knowledge

    • Examples:

      • Personal assistants with long-term context
      • Coding agents tracking project evolution
      • Systems requiring reproducibility
      • Research or journaling agents
Hybrid memory systems
  • In practice, most production systems combine both paradigms to balance tradeoffs:

    \[\text{memory} = \text{vector store} + \text{file store} + \text{indexing layer}\]
    • where:

      • Vector store enables semantic retrieval
      • File store maintains authoritative history
      • Indexing layer bridges retrieval and structure
  • Example architecture:

    • Store raw interactions in logs or Git
    • Periodically generate embeddings from current state
    • Use vector search for fast retrieval
    • Fall back to file history for auditing and correctness

Memory operations

  • Memory usage involves two key operations:

    \[\text{store}(s_t) \quad \text{and} \quad \text{retrieve}(q)\]
    • where:

      • \(s_t\) is the state or information to store
      • \(q\) is a query used to retrieve relevant memory
  • In practice, the retrieval mechanism depends on how memory is implemented:

    • Vector-based retrieval:

      • Uses embeddings and similarity search
      • Retrieves items based on semantic closeness
      \[\text{retrieve}_{vec}(q) = \arg\max_{s_i} \text{sim}(q, s_i)\]
    • File-based retrieval:

      • Uses keyword search (e.g., BM25), metadata filtering, or structured queries
      • Retrieves items based on exact matches, timestamps, or document structure
      \[\text{retrieve}_{file}(q) = \text{rank}_{\text{BM25}}(q, D)\]
  • The core challenge is not just storing information, but retrieving the most relevant subset at the right time.

    • Vector memory excels at semantic recall (finding conceptually similar information)
    • File-based memory excels at temporal and structural recall (finding the most recent, authoritative, or exact record)
  • The following figure shows memory storage and retrieval flow in an agentic system, including short-term and long-term memory components.

Example

  • Consider a personal assistant agent. Memory enables it to:

  • Remember user preferences (e.g., preferred meeting times)
  • Recall past conversations
  • Adapt responses based on historical context

  • Using different memory types:

  • Vector memory:

    • Retrieves semantically relevant preferences
    • Example: “When does Alice like meetings?” \(\rightarrow\) retrieves “morning meetings”
  • File-based memory:

    • Tracks how preferences evolve over time
    • Example:

      • 2023: “Alice prefers morning meetings”
      • 2025: “Alice now prefers afternoons”
    • Enables selecting the most recent or valid fact
  • Without memory, the assistant would treat each interaction independently, leading to repetitive and less useful behavior.

Implementation

  • LangChain provides built-in support for memory across multiple dimensions, with native integrations for vector-based memory (e.g., FAISS, Pinecone, Chroma) and extensibility that allows integration of file-based or custom storage systems via tools, retrievers, or custom memory implementations.

  • To better reflect real-world agent design, these examples can be categorized along two axes:

    • Duration: short-term vs. long-term
    • Mechanism: vector-based vs. file-based
Short-term memory (working memory)
  • Context-based buffer memory (LangChain native)
from langchain_openai import ChatOpenAI
from langchain.memory import ConversationBufferMemory
from langchain.chains import ConversationChain

llm = ChatOpenAI(model="gpt-4o-mini", temperature=0)

memory = ConversationBufferMemory()

conversation = ConversationChain(
    llm=llm,
    memory=memory
)

conversation.predict(input="Hi, my name is Alice.")
conversation.predict(input="What is my name?")
  • Mechanism: in-context (no external storage)
  • Duration: short-term
  • Use case: conversational continuity within a session

  • This demonstrates how recent interactions are retained in the context window to maintain coherence.
Long-term memory (persistent storage)
  • Long-term memory in LangChain is typically implemented using vector stores, while file-based approaches can be integrated depending on system requirements.
Vector-based long-term memory (semantic retrieval)
from langchain.vectorstores import FAISS
from langchain.embeddings import OpenAIEmbeddings

embeddings = OpenAIEmbeddings()

vector_store = FAISS.from_texts(
    ["Alice prefers morning meetings.", "Alice works in AI research."],
    embeddings
)

query = "What does Alice prefer?"
docs = vector_store.similarity_search(query)

print(docs)
  • Mechanism: embeddings + similarity search
  • Duration: long-term
  • Strength: semantic recall

  • Best suited for:

    • Knowledge bases
    • Retrieval-augmented generation (RAG) systems
    • Large-scale memory
  • This is the primary memory abstraction supported natively by LangChain.
File-based long-term memory (structured logs)
  • Simple file-based memory (custom integration):
import json

memory_file = "memory.json"

def store_memory(entry):
    try:
        data = json.load(open(memory_file))
    except:
        data = []
    data.append(entry)
    json.dump(data, open(memory_file, "w"))

def retrieve_memory(query):
    data = json.load(open(memory_file))
    return [m for m in data if query.lower() in m.lower()]

store_memory("Alice prefers morning meetings.")
store_memory("Alice now prefers afternoon meetings.")

print(retrieve_memory("Alice"))
  • Mechanism: file storage + keyword search
  • Duration: long-term
  • Strength: transparency and control

  • This is not a native LangChain memory abstraction, but can be integrated via custom tools or retrievers.
File-based long-term memory (temporal / versioned)
import datetime

log = []

def store_event(text):
    log.append({
        "timestamp": str(datetime.datetime.now()),
        "text": text
    })

def retrieve_latest(keyword):
    results = [e for e in log if keyword.lower() in e["text"].lower()]
    return sorted(results, key=lambda x: x["timestamp"], reverse=True)[0]

store_event("Alice prefers morning meetings.")
store_event("Alice now prefers afternoon meetings.")

print(retrieve_latest("Alice"))
  • Mechanism: timestamped logs
  • Duration: long-term
  • Strength: temporal reasoning and recency awareness

  • This approach is particularly useful for tracking evolving state and can be layered alongside vector memory.
Comparative Analysis
Type Mechanism Duration LangChain Support Strength
Buffer memory Context window Short-term Native Conversational continuity
Vector memory Embeddings Long-term Native Semantic retrieval
File memory (simple) Files + keyword Long-term Custom Interpretability
File memory (temporal) Logs + timestamps Long-term Custom Temporal reasoning
Key takeaways
  • LangChain natively supports vector-based memory for scalable semantic retrieval
  • File-based memory must be integrated manually, but provides strong benefits for traceability and temporal reasoning
  • Buffer memory provides short-term conversational continuity
  • In practice, production systems combine all three into a layered memory architecture.

Memory in agentic systems

  • Memory is deeply integrated with other patterns:

    • With planning: Tracks progress and intermediate states
    • With reflection: Stores feedback and improvements
    • With tool use: Records results of tool interactions
    • With multi-agent systems: Enables shared context across agents
  • Different memory types serve different roles:

    • Vector memory \(\rightarrow\) shared semantic knowledge
    • File-based memory \(\rightarrow\) shared logs, history, and traceability
  • This makes memory a foundational and multi-layered component of any sophisticated agentic system.

Failure modes

  • Memory introduces several challenges:

    • Irrelevant retrieval:

      • Vector memory may return semantically similar but incorrect data
      • File-based memory may return keyword matches without context
    • Context overload:

      • Too much retrieved memory degrades model performance
    • Staleness:

      • Vector memory may surface outdated embeddings
      • File-based memory may accumulate obsolete entries
    • Semantic gaps:

      • Vector memory may miss exact or symbolic relationships
      • File-based memory may miss semantically relevant matches
    • Privacy concerns:

      • Storing sensitive data requires safeguards regardless of storage type
  • To mitigate these issues:

    • Use hybrid retrieval (semantic + keyword)
    • Apply recency and relevance ranking
    • Implement memory consolidation and pruning
    • Add metadata (timestamps, entities, summaries)
    • Use access controls and encryption
  • In practice, robust systems combine both approaches:

    • Vector memory for semantic recall at scale
    • File-based memory for accuracy, history, and control
  • This hybrid design enables agents to retrieve the right information while understanding its context and evolution.

Learning and Adaptation

  • Learning and adaptation represent the shift from static intelligence to evolving intelligence. Rather than simply executing tasks, systems that incorporate this pattern continuously improve, adapting to new environments and refining their behavior over time. This marks a fundamental transition: intelligence is no longer fixed at design, but shaped through experience.

  • In agentic design, learning introduces the concept of growth. Agents are no longer limited to acting and reasoning within a single task—they develop across tasks and over time. While patterns like reflection enable short-term corrections within a given interaction, learning extends this capability, allowing agents to carry insights forward and apply them in future situations.

  • At its core, learning and adaptation turn experience into improvement. By leveraging feedback, interaction outcomes, and accumulated knowledge, agents refine their internal policies and decision-making processes. This creates a compounding effect, where each interaction contributes to a more capable system.

  • Ultimately, this pattern defines the evolution from systems that merely execute and correct to systems that continuously improve. It lays the foundation for building agents that do not just perform tasks, but become progressively better at performing them.

Why learning is needed

  • Even with planning, tool use, and memory, an agent without learning remains fundamentally static:

    • It repeats the same mistakes across tasks
    • It cannot generalize from past experiences
    • It does not improve efficiency over time
    • It lacks adaptation to changing environments
  • Learning enables agents to:

    • Optimize decision-making strategies
    • Improve task performance
    • Adapt to new conditions
    • Personalize behavior
  • This aligns with reinforcement learning principles, where agents improve through interaction with an environment. For example, Deep Reinforcement Learning by Mnih et al. (2015) demonstrates how agents can learn optimal policies through reward-driven interaction, showing that iterative feedback improves long-term outcomes.

The learning process

  • Learning can be formalized as updating a policy based on experience:

    \[\pi_{\theta'} = \pi_{\theta} + \alpha \nabla J(\theta)\]
    • where:

      • \(\pi_{\theta}\) is the current policy
      • \(\theta\) are the parameters
      • \(J(\theta)\) is the objective function
      • \(\alpha\) is the learning rate
  • The objective often involves maximizing expected reward:

\[J(\theta) = \mathbb{E}_{\pi_\theta}[R]\]
  • This formulation underpins many adaptive agent systems, even when implemented implicitly through prompt updates or memory adjustments.

Types of learning in agentic systems

  • Learning can occur in multiple ways depending on how feedback is obtained and applied, as follows:

  • Supervised learning from feedback:

    • Uses labeled examples or corrections
    • Often implemented via human feedback
    • Improves specific behaviors

    • This is closely related to approaches like InstructGPT by Ouyang et al. (2022), where models are fine-tuned using human preferences to improve alignment.
  • Reinforcement learning:

    • Uses reward signals from the environment
    • Optimizes long-term performance
    • Suitable for sequential decision-making
  • Self-improvement (bootstrapped learning):

    • Uses the agent’s own outputs and reflections
    • Iteratively improves without external labels
    • Often combined with reflection and memory
  • Online adaptation:

    • Continuously updates behavior during deployment
    • Adapts to dynamic environments
  • These approaches are often combined in practical systems.

Example

  • Consider a customer support agent:

    • Initially, it provides generic responses
    • Over time, it learns which responses resolve issues faster
    • It adapts to user preferences and common queries
    • It improves its routing and tool usage decisions
  • Without learning, the system remains static. With learning, it becomes progressively more effective.

Implementation

  • While LangChain does not directly implement reinforcement learning, learning can be approximated through feedback loops and memory updates.
from langchain_openai import ChatOpenAI
from langchain.memory import ConversationBufferMemory

llm = ChatOpenAI(model="gpt-4o-mini", temperature=0)
memory = ConversationBufferMemory()

def update_memory_with_feedback(input_text, response, feedback):
    memory.save_context(
        {"input": input_text},
        {"output": f"{response}\nFeedback: {feedback}"}
    )

# Simulated interaction
user_input = "Explain quantum computing simply."
response = llm.invoke(user_input)

# Simulated feedback
feedback = "Too complex, simplify further."

update_memory_with_feedback(user_input, response.content, feedback)
  • This example demonstrates how feedback can be incorporated into memory, influencing future responses.

Learning through evaluation loops

  • Learning in agentic systems often emerges from repeated evaluation cycles, where performance is continuously measured and used to drive improvement. Rather than relying on static behavior, agents iteratively refine their outputs based on feedback signals.

  • A typical loop follows:

    1. Generate output
    2. Evaluate output (via metrics, rules, or humans)
    3. Update system behavior
    4. Repeat
  • This creates a feedback loop that gradually improves performance over time and mirrors reinforcement learning pipelines such as Deep Reinforcement Learning by Mnih et al. (2015), which shows how iterative reward-driven updates improve policies over time.

  • The following figure shows the learning and adapting pattern, which features feedback-driven learning where agent outputs are evaluated and used to improve future behavior.

  • This loop forms the foundation for more advanced self-improving systems, including agents that can modify their own behavior, architecture, or even code.

Learning in agentic systems

  • Learning interacts deeply with other patterns:

    • With memory: Stores learned knowledge
    • With reflection: Provides signals for improvement
    • With planning: Refines strategies over time
    • With tool use: Improves tool selection and usage
  • This integration enables agents to evolve holistically rather than in isolated components.

Failure modes

  • Learning introduces new risks:

    • Overfitting: Adapting too strongly to specific cases
    • Feedback bias: Learning from incorrect or biased signals
    • Instability: Frequent updates may degrade performance
    • Catastrophic forgetting: Losing previously learned knowledge
  • To mitigate these issues:

    • Use balanced and diverse feedback
    • Regularize updates
    • Maintain stable baseline behaviors
    • Monitor performance over time

Self-Improving Coding Agent (SICA)

  • The Self-Improving Coding Agent GitHub Repository represents a significant step beyond standard evaluation loops by enabling an agent to directly modify its own source code. Instead of learning indirectly through parameter updates or prompt adjustments, SICA performs explicit self-modification, making it both the learner and the subject of learning.

  • SICA operates through an iterative self-improvement cycle:

    • It maintains an archive of past agent versions and their benchmark performance
    • It selects the best-performing version using a weighted scoring function (considering success, time, and computational cost)
    • It analyzes past performance to identify improvements
    • It modifies its own codebase
    • The new version is evaluated and added back to the archive
  • This creates a closed-loop system where learning is driven entirely by past performance, enabling continuous evolution without traditional retraining.

  • The following figure shows SICA’s self-improvement flow, learning and adapting based on its past versions.

  • Over time, SICA demonstrated meaningful architectural evolution:

    • Transitioned from simple file overwrites to a Smart Editor
    • Introduced Diff-Enhanced editing for targeted code changes
    • Implemented AST-based reasoning for efficient navigation
    • Developed hybrid search mechanisms combining fast lookup and structural parsing
  • The following figure shows performance across iterations with key improvements annotated with their corresponding tool or agent modifications.

  • SICA’s architecture also highlights several production-relevant design patterns:

    • Multi-agent decomposition: coding, reasoning, and problem-solving sub-agents
    • Memory and context structuring: organized prompts and execution traces
    • Tool use: file operations, command execution, and AST parsing
    • Exception handling and monitoring: an asynchronous overseer agent detects loops, stagnation, and inefficiencies
  • A particularly important innovation is the overseer agent, which acts as a meta-controller:

    • Monitors execution via callgraphs and logs
    • Detects pathological behavior (e.g., repeated work)
    • Can intervene or terminate execution
  • This introduces a form of self-regulation and aligns closely with guardrails and monitoring patterns in production systems.

AlphaEvolve

  • AlphaEvolve extends the idea of learning through evaluation into the domain of algorithm discovery. Developed by Google, it combines large language models with evolutionary algorithms and automated evaluation systems to iteratively generate and optimize solutions.

  • The system operates through a structured evolutionary loop:

    • Generate candidate algorithms using LLMs
    • Evaluate them using predefined metrics
    • Select high-performing candidates
    • Refine and recombine them
    • Repeat
  • A key design feature is the use of LLM ensembles:

    • Gemini Flash generates diverse candidate solutions
    • Gemini Pro performs deeper analysis and refinement
  • This division of labor improves both exploration and exploitation in the search space.

  • AlphaEvolve has demonstrated strong real-world impact:

    • Reduced data center compute usage by 0.7%
    • Improved TPU hardware design via Verilog optimization
    • Achieved up to 32.5% performance gains in GPU kernels
    • Discovered new matrix multiplication algorithms
    • Solved or improved a large fraction of open mathematical problems
  • Conceptually, AlphaEvolve represents the convergence of:

    • Learning through evaluation loops
    • Parallelization (multiple candidates evaluated simultaneously)
    • Planning and search (evolutionary optimization)
    • Tool use (evaluation systems and computational pipelines)
  • It shows that agentic systems can move beyond task execution into knowledge and algorithm discovery.

OpenEvolve

  • OpenEvolve builds on similar principles but focuses specifically on evolving code through an LLM-driven pipeline. It generalizes the evolutionary approach into a flexible, production-ready system for optimizing programs.

  • Its architecture is centered around a controller that orchestrates multiple components:

    • Program sampler
    • Program database
    • Evaluator pool
    • LLM ensemble
  • The following figure shows the OpenEvolve internal architecture and how these components interact.

  • The system operates through an iterative loop:

    1. Generate candidate programs using LLMs
    2. Evaluate them using custom evaluators
    3. Store results in a database
    4. Select and refine high-performing programs
    5. Repeat
  • Key capabilities include:

    • Evolution of entire codebases, not just functions
    • Multi-objective optimization (e.g., performance, efficiency)
    • Support for multiple programming languages
    • Distributed evaluation for scalability
    • Flexible prompt and configuration control
  • A typical usage pattern:

from openevolve import OpenEvolve

evolve = OpenEvolve(
    initial_program_path="path/to/initial_program.py",
    evaluation_file="path/to/evaluator.py",
    config_path="path/to/config.yaml",
)

best_program = await evolve.run(iterations=1000)

print("Best program metrics:")
for name, value in best_program.metrics.items():
    print(f"{name}: {value:.4f}")
  • OpenEvolve highlights how learning through evaluation can be operationalized in production systems:

    • Evaluation becomes the central driver of improvement
    • Memory is externalized via program databases
    • Parallelization enables large-scale search
    • Composition integrates LLMs, evaluators, and storage systems

Learning and Adaptation Loop

  • Across SICA, AlphaEvolve, and OpenEvolve, a common pattern emerges:
\[\text{Generate} \rightarrow \text{Evaluate} \rightarrow \text{Select} \rightarrow \text{Modify} \rightarrow \text{Repeat}\]
  • This loop generalizes learning beyond traditional training into continuous system evolution.

  • These systems demonstrate that:

  • Evaluation is not just for measurement, but for driving improvement
  • Agents can evolve at multiple levels:

    • Outputs (reflection)
    • strategies (planning)
    • architectures (multi-agent composition)
    • code itself (self-modification)
  • The boundary between execution and learning is increasingly blurred

  • This pattern becomes essential when building agents that must operate in dynamic, uncertain, or evolving environments, where static behavior is insufficient.

Model Context Protocol (MCP)

  • Model Context Protocol (MCP) is an agentic design pattern that standardizes how context is structured, transmitted, and consumed across the components of an agentic system. It defines a consistent interface for passing information between models, tools, memory systems, and agents, enabling interoperability and composability.

  • As agentic systems grow in complexity, context becomes the central medium through which all components interact. MCP introduces discipline into this process by formalizing how context is represented and exchanged, preventing fragmentation, inconsistency, and misalignment between system parts. By doing so, it ensures that every component operates on a shared understanding of the system state.

  • More than just a technical convention, MCP represents the standardization of information flow in agentic systems. It is the pattern that enables coherence—allowing complex systems to function as unified wholes rather than disconnected parts. In this sense, MCP transforms context from passive data into an active mechanism for coordination, turning information into aligned, system-wide behavior.

Why MCP is needed

  • Without a structured protocol for context, systems encounter several challenges:

    • Inconsistent data formats across components
    • Loss of critical information during transitions
    • Difficulty integrating multiple tools and agents
    • Poor scalability due to ad-hoc interfaces
  • MCP addresses these issues by defining a shared schema for context, enabling seamless communication across system boundaries.

  • This aligns with broader system design principles seen in distributed systems and APIs, where standardization enables interoperability. In agentic systems, context plays the role of both data and control signal, making its structure even more critical.

  • The following figure shows structured context flowing between components in an agentic system, ensuring consistent data exchange and interoperability. This visualization highlights how MCP acts as the connective tissue of the system.

The structure of context

  • Context in an agentic system typically includes:

    • User input
    • System state
    • Memory retrievals
    • Tool outputs
    • Intermediate reasoning steps
  • MCP organizes these elements into a structured representation:

    \[C = {u, s, m, t, r}\]
    • where:

      • \(u\) = user input
      • \(s\) = system state
      • \(m\) = memory
      • \(t\) = tool outputs
      • \(r\) = reasoning traces
  • This structured context is passed between components, ensuring that all relevant information is preserved.

Context transformation

  • As context flows through the system, it is transformed:

    \[C_{t+1} = f(C_t, a_t)\]
    • where:

      • \(C_t\) is the current context
      • \(a_t\) is the action taken
      • \(f\) is the transformation function
  • Each component consumes context, modifies it, and passes it forward. MCP ensures that this transformation remains consistent and interpretable.

Example

  • Consider a multi-step agent handling a customer request:

    1. Receives user query
    2. Retrieves relevant memory
    3. Calls a tool (e.g., database query)
    4. Updates state with results
    5. Generates response
  • Without MCP, each step might use different formats, leading to integration issues. With MCP, all steps operate on a shared context structure, enabling smooth transitions.

Implementation

  • LangChain implicitly supports MCP-like behavior through structured inputs and outputs.
from langchain_openai import ChatOpenAI
from langchain_core.prompts import ChatPromptTemplate

llm = ChatOpenAI(model="gpt-4o-mini", temperature=0)

prompt = ChatPromptTemplate.from_messages([
    ("system", "You are an assistant that uses structured context."),
    ("human", "User input: {input}\nMemory: {memory}\nTool Output: {tool_output}")
])

context = {
    "input": "What is my order status?",
    "memory": "User has order #1234",
    "tool_output": "Order #1234 is shipped"
}

response = llm.invoke(prompt.format(**context))
print(response.content)
  • This example demonstrates how structured context can be passed into a model, ensuring that all relevant information is included.

MCP in multi-component systems

  • MCP becomes especially important in systems involving:

    • Multiple agents
    • Multiple tools
    • Distributed execution
    • Complex workflows
  • In such systems, context must be:

    • Consistent: Same structure across components
    • Complete: Includes all necessary information
    • Efficient: Avoids unnecessary duplication
    • Traceable: Supports debugging and monitoring

MCP and other patterns

  • MCP integrates tightly with other agentic patterns:

    • With memory: Defines how memory is injected into context
    • With tool use: Standardizes tool input and output formats
    • With multi-agent systems: Enables communication between agents
    • With planning: Represents plans and intermediate states
  • This makes MCP a foundational infrastructure pattern rather than a standalone capability.

Failure modes

  • Improper context management can lead to:

    • Context fragmentation: Missing or inconsistent data
    • Overloaded context: Excessive information degrading performance
    • Ambiguity: Unclear structure leading to misinterpretation
    • Latency: Large context sizes slowing down processing
  • To mitigate these issues:

    • Define clear schemas for context
    • Limit context to relevant information
    • Use structured formats (e.g., JSON-like representations)
    • Monitor context size and flow

Goal Setting and Monitoring

  • Goal setting and monitoring enables systems to define objectives explicitly, track progress toward them, and adjust behavior based on deviations or outcomes. It introduces a control layer that ensures the agent remains aligned with its intended purpose over time.

  • While planning determines how a task will be executed, goal setting defines what success looks like, and monitoring ensures that execution remains on track. Together, they transform agent behavior from open-ended activity into directed, measurable progress.

Motivation

  • Without explicit goals and monitoring mechanisms, agentic systems face several risks:

    • Drift from the original objective
    • Inefficient or redundant actions
    • Lack of termination criteria
    • Inability to detect failure or suboptimal performance
  • Goal setting provides direction, while monitoring provides feedback. This mirrors control systems in engineering, where a system continuously compares its current state to a desired target.

  • This concept aligns with optimization frameworks where systems aim to minimize or maximize an objective function:

    \[\min_{\pi} L(\pi, G)\]
    • where:

      • \(\pi\) is the policy or behavior
      • \(G\) is the goal
      • \(L\) is a loss function measuring deviation from the goal
  • Monitoring ensures that this loss is evaluated continuously and used to guide behavior.

  • The following figure shows continuous monitoring of agent progress against defined goals, enabling dynamic adjustments and termination decisions. This loop ensures that the system remains aligned with its objectives.

Defining goals

  • Goals in agentic systems can take different forms depending on the task, as follows:

    • Explicit goals:

      • Clearly defined objectives (e.g., “summarize this document”)
      • Often provided by the user or system
    • Implicit goals:

      • Derived from context or system design
      • Not directly specified but inferred
    • Hierarchical goals:

      • High-level goals decomposed into subgoals
      • Enables complex task execution
  • Goals can also include constraints, such as time limits, resource usage, or quality thresholds.

Monitoring progress

  • Monitoring involves tracking the agent’s state relative to its goal:

    \[\Delta_t = d(s_t, G)\]
    • where:

      • \(s_t\) is the current state
      • \(G\) is the goal
      • \(d\) is a distance or discrepancy function
  • The system uses \(\Delta_t\) to decide whether to continue execution, adjust strategy, or terminate.

Example

  • Consider an agent tasked with: “Write a research report on climate change.”

  • Goal setting defines:

    • Completion criteria (e.g., structured report with sections)
    • Quality requirements (e.g., factual accuracy, citations)
  • Monitoring tracks:

    • Progress through sections
    • Coverage of required topics
    • Consistency and coherence
  • If the system detects missing sections or poor quality, it can trigger corrective actions such as re-planning or reflection.

Implementation

  • Goal tracking can be implemented by maintaining a state object and evaluating progress at each step.
from langchain_openai import ChatOpenAI

llm = ChatOpenAI(model="gpt-4o-mini", temperature=0)

goal = "Write a 3-section report on renewable energy."
state = {"sections_completed": 0}

def check_progress(state, goal):
    return state["sections_completed"] >= 3

while not check_progress(state, goal):
    response = llm.invoke("Write next section of report.")
    print(response.content)
    state["sections_completed"] += 1

print("Goal achieved!")
  • This example demonstrates a simple monitoring loop where progress is tracked and used to determine termination.

Feedback-driven monitoring

  • Monitoring often involves evaluating outputs against criteria:

    • Completeness
    • Accuracy
    • Consistency
    • Efficiency
  • This creates a feedback loop:

    1. Generate output
    2. Evaluate against goal
    3. Update state
    4. Adjust behavior

Goal management in complex systems

  • In advanced agentic systems, goal management can involve:

    • Multiple concurrent goals
    • Dynamic goal updates
    • Conflict resolution between goals
    • Prioritization of objectives
  • This requires a more sophisticated control layer that can balance competing demands.

Integration with other patterns

  • Goal setting and monitoring interact with multiple patterns:

    • With planning: Defines what the plan aims to achieve
    • With reflection: Identifies deviations and triggers corrections
    • With memory: Stores progress and past outcomes
    • With learning: Refines goal achievement strategies
  • This integration ensures that goals are not static, but actively influence system behavior.

Failure modes

  • Common challenges include:

    • Poorly defined goals: Ambiguity leads to inconsistent behavior
    • Over-constrained goals: Limits flexibility
    • Insufficient monitoring: Failures go undetected
    • Metric misalignment: Optimizing the wrong objective
  • To mitigate these issues:

    • Define clear and measurable goals
    • Use appropriate evaluation metrics
    • Monitor continuously
    • Allow adaptive goal refinement

Exception Handling and Recovery

  • Exception handling and recovery enables systems to detect failures, handle unexpected conditions, and recover gracefully without derailing the overall task. It introduces robustness into agentic systems, ensuring that errors are not terminal but manageable events.

  • In real-world environments, uncertainty and failure are inevitable. APIs fail, tools return incorrect outputs, plans break, and environments change. This pattern ensures that agents can continue operating despite these disruptions.

Why exception handling is needed

  • Without structured exception handling, agentic systems suffer from:

    • Fragility in the presence of errors
    • Cascading failures across steps
    • Inability to recover from unexpected conditions
    • Poor user experience due to abrupt failures
  • Exception handling transforms failure from a stopping condition into a recoverable event.

  • This aligns with resilience principles in distributed systems, where systems are designed to tolerate faults rather than avoid them entirely.

Types of exceptions

  • Agentic systems encounter different categories of failures:

    • Execution errors:

      • Tool failures (e.g., API timeouts, invalid responses)
      • Code execution errors
      • Resource constraints
    • Reasoning errors:

      • Incorrect assumptions
      • Logical inconsistencies
      • Misinterpretation of inputs
    • Planning errors:

      • Invalid or incomplete plans
      • Missing dependencies
    • Environmental errors:

      • Changes in external systems
      • Unavailable resources
  • Each type requires different handling strategies.

The exception handling process

  • Exception handling can be modeled as:

    \[s_{t+1} = \begin{cases} f(s_t, a_t) & \text{if no error} \ g(s_t, e_t) & \text{if error occurs} \end{cases}\]
    • where:

      • \(e_t\) is the detected error
      • \(g\) is the recovery function
  • The system must detect the error, classify it, and apply an appropriate recovery strategy.

Recovery strategies

  • Different strategies can be applied depending on the nature of the failure,as follows:

    • Retry mechanisms:

      • Re-execute the failed action
      • Useful for transient errors
    • Fallback strategies:

      • Use alternative tools or methods
      • Provide degraded but functional output
    • Replanning:

      • Adjust the plan to account for failure
      • Often used in dynamic environments
    • Human escalation:

      • Request human intervention for critical failures
    • Graceful degradation:

      • Continue operation with reduced capability
  • These strategies ensure that the system remains functional even under adverse conditions.

Example

  • Consider an agent that queries a weather API:

    • The API fails due to a timeout
    • The agent retries the request
    • If failure persists, it switches to an alternative API
    • If no data is available, it informs the user gracefully
  • Without exception handling, the system would simply fail. With it, the system adapts and continues.

Implementation

  • LangChain supports exception handling through standard Python constructs combined with agent logic.
from langchain_openai import ChatOpenAI

llm = ChatOpenAI(model="gpt-4o-mini", temperature=0)

def safe_invoke(prompt):
    try:
        return llm.invoke(prompt).content
    except Exception as e:
        return f"Error occurred: {str(e)}. Retrying..."

response = safe_invoke("Explain black holes.")
print(response)
  • This example demonstrates a simple retry mechanism for handling failures.

Exception handling loop

  • Exception handling often operates as a loop and ensures that failures are managed systematically. The core components of the loop are as below:

    1. Attempt action
    2. Detect error
    3. Classify error
    4. Apply recovery strategy
    5. Continue execution

Exception handling in agentic systems

  • This pattern integrates with other patterns:

    • With planning: Enables replanning after failure
    • With tool use: Handles tool-related errors
    • With reflection: Diagnoses reasoning failures
    • With monitoring: Detects deviations from expected behavior
  • This interconnectedness ensures that recovery is not isolated but part of the overall system behavior.

Failure modes

  • Even exception handling can fail if not designed properly:

    • Silent failures: Errors go undetected
    • Infinite retries: System gets stuck retrying
    • Incorrect recovery: Wrong strategy applied
    • Overhead: Excessive handling slows down execution
  • To mitigate these issues:

    • Implement clear error detection mechanisms
    • Limit retries and define thresholds
    • Use appropriate recovery strategies
    • Monitor system behavior

Human-in-the-Loop

Core Idea

  • As agentic systems evolve from simple workflows into autonomous, goal-driven architectures, a fundamental tension emerges between capability and control. The more autonomy an agent is given through patterns such as planning, tool use, and multi-agent collaboration, the greater the need for mechanisms that ensure reliability, correctness, and alignment with human intent. This is where human-in-the-loop (HITL) becomes essential.

  • Agentic systems operate in environments that are inherently uncertain, dynamic, and often high-stakes. While models can reason, act, and adapt, they do not possess true judgment, accountability, or contextual awareness in the way humans do. This creates a gap between what systems can do and what they should be allowed to do autonomously. HITL bridges this gap by embedding human oversight directly into the system’s execution loop.

  • Rather than viewing autonomy as an all-or-nothing property, modern agentic design treats it as a spectrum. At one end are fully automated workflows with minimal intervention, and at the other are tightly controlled systems where humans validate every step. Human-in-the-loop enables systems to operate flexibly along this spectrum, introducing checkpoints, approvals, and feedback mechanisms exactly where they are needed.

  • This pattern is particularly critical in scenarios involving ambiguity, ethical considerations, or irreversible actions. In such cases, purely automated decision-making can lead to compounding errors or unintended consequences. By incorporating human judgment at key points, systems gain an additional layer of robustness and accountability without sacrificing the efficiency benefits of automation.

  • More broadly, HITL reflects a shift toward hybrid intelligence systems, where humans and AI collaborate rather than compete. The agent handles scale, speed, and pattern recognition, while the human provides oversight, intuition, and contextual grounding. Together, they form a system that is more reliable and adaptable than either could achieve alone.

  • This section explores how human-in-the-loop is implemented as a design pattern within agentic systems, and how it integrates with other patterns such as reflection, evaluation, and guardrails to enable safe and effective real-world deployment.

Why human-in-the-loop is needed

  • Fully autonomous systems face inherent limitations:

    • They may produce incorrect or unsafe outputs
    • They lack contextual understanding in ambiguous situations
    • They may misinterpret goals or constraints
    • They cannot always be trusted for high-stakes decisions
  • Human-in-the-loop addresses these limitations by introducing checkpoints where human input can:

    • Validate decisions
    • Correct errors
    • Provide additional context
    • Override system behavior
  • This aligns with approaches such as Deep Reinforcement Learning from Human Preferences by Christiano et al. (2017), where human feedback is used to guide agent behavior toward desired outcomes.

  • The following figure shows the integration of human checkpoints within the agent workflow, enabling validation, correction, and control at different stages. This illustrates how human input is interleaved with automated processes.

Modes of human involvement

  • Human interaction can occur at different stages of the agent workflow, as follows:

    • Pre-execution guidance:

      • Humans define goals, constraints, or plans
      • Ensures correct initial setup
    • Mid-execution intervention:

      • Humans review intermediate outputs
      • Can approve, modify, or redirect actions
    • Post-execution validation:

      • Humans evaluate final outputs
      • Provide feedback for improvement
    • Continuous supervision:

      • Humans monitor system behavior in real time
  • Each mode offers different trade-offs between autonomy and control.

The HITL interaction loop

  • Human-in-the-loop can be modeled as an augmented decision process:

    \[a_t = \begin{cases} \pi(s_t) & \text{if autonomous} \ \pi_h(s_t) & \text{if human intervention} \end{cases}\]
    • where:

      • \(\pi\) is the agent policy
      • \(\pi_h\) is the human-influenced decision
  • This introduces an external control signal that can override or guide the agent.

Example

  • Consider an AI system assisting with legal document drafting:

    • The agent generates a draft
    • A human reviews and edits the content
    • The agent incorporates feedback
    • The process repeats until approval
  • Without HITL, errors could propagate into critical outputs. With HITL, quality and accountability are significantly improved.

Implementation

  • LangChain supports human-in-the-loop patterns through interactive workflows and checkpoints.
from langchain_openai import ChatOpenAI

llm = ChatOpenAI(model="gpt-4o-mini", temperature=0)

def human_review(output):
    print("Model output:", output)
    feedback = input("Approve? (yes/edit): ")
    return feedback

response = llm.invoke("Draft a business email.")

decision = human_review(response.content)

if decision == "yes":
    final_output = response.content
else:
    final_output = llm.invoke("Revise based on feedback.").content

print(final_output)
  • This example demonstrates a simple human approval step before finalizing output.

HITL in agentic systems

  • Human-in-the-loop integrates with multiple patterns:

    • With reflection: Humans provide higher-quality critiques
    • With learning: Human feedback improves future performance
    • With planning: Humans validate or refine plans
    • With monitoring: Humans detect anomalies and intervene
  • This makes HITL a key mechanism for ensuring alignment and reliability.

Failure modes

  • While beneficial, HITL introduces challenges:

    • Latency: Human intervention slows down execution
    • Scalability: Human involvement does not scale easily
    • Inconsistency: Different humans may provide different feedback
    • Over-reliance: Excessive dependence on humans reduces autonomy
  • To mitigate these issues:

    • Use HITL selectively for high-risk or ambiguous tasks
    • Define clear guidelines for human intervention
    • Combine with automated validation where possible
    • Optimize workflows to minimize delays

Guardrails and Safety

Core Idea

  • Guardrails and safety represent a critical control layer in agentic systems, ensuring that increasing autonomy does not lead to uncontrolled or harmful behavior. As agents become more capable through patterns like planning, tool use, memory, and learning, they transition from passive assistants to systems that can take actions, make decisions, and influence real-world outcomes. This increased capability introduces corresponding risks, making safety mechanisms not optional but foundational.

  • At a systems level, guardrails can be understood as constraint-enforcing functions applied throughout the agent lifecycle:

\[a_t' = \mathcal{G}(a_t), \quad \text{where } \mathcal{G} \text{ enforces safety, policy, and operational constraints}\]
  • Rather than being a single checkpoint, guardrails operate as a layered system across the entire architecture. They are applied at input ingestion, during reasoning and planning, before tool execution, and after output generation. This layered enforcement ensures that safety is maintained continuously, not just validated at the end.

  • In production architectures, guardrails serve multiple roles:

    • They act as policy enforcement mechanisms, ensuring compliance with business rules and regulations
    • They function as risk mitigation systems, preventing unsafe or unintended actions
    • They provide trust boundaries, especially when agents interact with external systems or sensitive data
    • They enable controlled autonomy, allowing systems to act independently within safe limits
  • This pattern is closely related to alignment research such as Constitutional AI by Bai et al. (2022), which shows that embedding explicit principles into system behavior can guide outputs toward safer and more aligned responses.

  • Importantly, guardrails are not meant to replace other patterns but to complement them. They work in conjunction with:

    • Tool use, by restricting what actions can be executed
    • Planning, by ensuring generated plans adhere to constraints
    • Reflection, by validating and correcting unsafe outputs
    • Human-in-the-loop, by escalating high-risk decisions
  • From a design perspective, guardrails introduce a shift from “can the system do this?” to “should the system do this?” This distinction is essential for building reliable, production-grade agentic systems.

  • Ultimately, guardrails and safety transform agentic systems from powerful but potentially unpredictable entities into controlled, trustworthy systems capable of operating in real-world environments.

Motivation

  • Without safety mechanisms, agentic systems may:

    • Generate harmful or unsafe outputs
    • Execute unintended or dangerous actions
    • Violate constraints or policies
    • Amplify biases or hallucinations
  • Guardrails mitigate these risks by enforcing rules and validating outputs at different stages of execution.

  • This aligns with alignment research such as Constitutional AI by Bai et al. (2022), which demonstrates how predefined principles can guide model behavior toward safer outputs without constant human supervision.

Types of guardrails

  • Guardrails can be applied at multiple levels within an agentic system.

    • Input guardrails:

      • Validate and sanitize user inputs
      • Prevent prompt injection or malicious inputs
    • Output guardrails:

      • Filter or modify generated outputs
      • Ensure compliance with policies
    • Tool guardrails:

      • Restrict which tools can be used
      • Validate tool inputs and outputs
    • Execution guardrails:

      • Enforce constraints during workflow execution
      • Prevent unsafe sequences of actions
  • These layers collectively ensure system safety.

The guardrail enforcement process

  • Guardrails can be modeled as constraint functions applied to actions and outputs:

    \[a_t' = \mathcal{G}(a_t)\]
    • where:

      • \(a_t\) is the original action
      • \(\mathcal{G}\) is the guardrail function
      • \(a_t'\) is the validated or modified action
  • If an action violates constraints, it can be blocked, modified, or escalated. This ensures that only safe actions are executed.

Example

  • Consider an agent with access to a payment API:

    • The agent attempts to execute a transaction
    • A guardrail checks if the transaction exceeds a threshold
    • If it does, the action is blocked or requires human approval
  • Without guardrails, the system could perform unsafe operations. With guardrails, constraints are enforced.

Implementation

  • Guardrails can be implemented using validation layers and conditional logic. The following example demonstrates a simple output filtering mechanism.
from langchain_openai import ChatOpenAI

llm = ChatOpenAI(model="gpt-4o-mini", temperature=0)

def output_guardrail(response):
    if "harmful" in response.lower():
        return "Output blocked due to safety concerns."
    return response

response = llm.invoke("Generate a response.")

safe_response = output_guardrail(response.content)
print(safe_response)

Guardrails in agentic workflows

  • Guardrails are typically applied at multiple points:

    1. Before processing input
    2. During reasoning and planning
    3. Before executing actions
    4. After generating outputs
  • The following figure illustrates guardrail design, with enforcement of safety constraints at multiple stages of the agent workflow, including input validation, action filtering, and output moderation.

  • This layered approach ensures comprehensive safety coverage.

Guardrails and other patterns

  • Guardrails interact with several other patterns:

    • With tool use: Restricts unsafe tool interactions
    • With planning: Ensures plans adhere to constraints
    • With monitoring: Detects violations in real time
    • With human-in-the-loop: Escalates critical decisions
  • This integration ensures that safety is embedded throughout the system.

Failure modes

  • Improperly designed guardrails can introduce issues:

  • Over-restriction: Blocking useful or valid actions
  • Under-restriction: Failing to prevent harmful behavior
  • False positives/negatives: Incorrect validation decisions
  • Latency: Additional checks slow down execution

  • To mitigate these challenges:

    • Define clear and balanced constraints
    • Use layered guardrails for redundancy
    • Continuously evaluate and refine rules
    • Combine automated checks with human oversight

Evaluation

Core Idea

  • Evaluation is the foundational layer that transforms agentic systems from experimental prototypes into reliable, production-ready systems. As these systems evolve from simple prompt-response interactions into complex, multi-step architectures capable of reasoning, planning, acting, and adapting, the need for structured and quantitative assessment becomes essential. Without evaluation, there is no reliable way to determine whether these increasingly sophisticated behaviors are effective, correct, or aligned with intended goals.

  • At its core, evaluation provides the mechanism for turning agent behavior into measurable signals. It enables developers to validate correctness, detect failure modes, and systematically improve performance. Rather than relying on intuition or manual inspection—which quickly becomes infeasible as system complexity grows—evaluation introduces a structured framework for assessing outputs across key dimensions such as accuracy, quality, efficiency, and robustness.

  • From a systems perspective, evaluation acts as the feedback backbone that connects execution to learning. It creates visibility into how an agent behaves across different stages of its operation, making it possible to trace decisions, identify breakdowns, and understand outcomes. This visibility also enables comparability between different system designs, prompts, or models, allowing teams to make informed decisions about trade-offs and optimizations. In turn, this supports continuous improvement through iterative refinement and reinforces accountability in production environments where reliability and correctness are critical.

  • Importantly, evaluation is not just diagnostic—it is operational. The signals it generates can feed directly into monitoring systems, trigger corrective actions, and inform future updates. In this way, evaluation becomes deeply integrated into the lifecycle of an agentic system, guiding reflection, validating planning, informing learning, and enforcing guardrails.

  • As a cross-cutting concern, evaluation touches nearly every aspect of agent design. It is the mechanism that provides visibility, turns performance into insight, and enables systems to be measured, compared, and optimized systematically. Without it, agentic systems lack the ability to understand or improve their own behavior, making evaluation not just a supporting component, but a fundamental requirement for building robust, scalable, and trustworthy intelligent systems.

Why evaluation is needed

  • Without proper evaluation, agentic systems face several issues:

    • Inability to measure progress or success
    • Difficulty identifying failure modes
    • Lack of feedback for learning and adaptation
    • Poor comparability between system versions
  • Evaluation transforms system behavior into measurable outcomes, enabling continuous improvement.

  • This aligns with empirical evaluation practices in machine learning, where models are assessed using defined metrics. For example, benchmarks in NLP have been critical for tracking progress across models and techniques.

Defining evaluation metrics

  • Metrics depend on the task and system goals. Common categories include:

    • Accuracy metrics:

      • Correctness of outputs
      • Factual consistency
      • Task completion rate
    • Quality metrics:

      • Coherence and clarity
      • Relevance
      • Completeness
    • Efficiency metrics:

      • Latency
      • Resource usage
      • Cost
    • Robustness metrics:

      • Performance under noisy or adversarial inputs
      • Stability across different scenarios
  • These metrics provide a multi-dimensional view of system performance.

The evaluation function

  • Evaluation can be formalized as:

    \[M = \mathcal{E}(y, y^*)\]
    • where:

      • \(y\) is the system output
      • \(y^*\) is the ground truth or expected output
      • \(\mathcal{E}\) is the evaluation function
  • In cases where ground truth is unavailable, proxy metrics or human evaluation may be used.

Types of evaluation

  • Evaluation can be performed at different stages and levels, as follows:

    • Offline evaluation:

      • Conducted using predefined datasets
      • Useful for benchmarking
    • Online evaluation:

      • Conducted during deployment
      • Reflects real-world performance
    • Human evaluation:

      • Involves human judgment
      • Useful for subjective criteria
    • Automated evaluation:

      • Uses metrics or models to score outputs
      • Scalable and consistent
  • These approaches are often combined for comprehensive assessment.

Example

  • Consider an agent generating summaries:

    • Accuracy is measured by comparing against reference summaries
    • Quality is evaluated using coherence and readability metrics
    • Efficiency is measured by latency and cost
  • By tracking these metrics, the system can be improved iteratively.

Implementation

  • Evaluation can be integrated into workflows using scoring functions.
from langchain_openai import ChatOpenAI

llm = ChatOpenAI(model="gpt-4o-mini", temperature=0)

def evaluate_response(response, reference):
    return "correct" if response.strip() == reference.strip() else "incorrect"

response = llm.invoke("What is 2 + 2?")
score = evaluate_response(response.content, "4")

print("Score:", score)
  • This example demonstrates a simple evaluation mechanism.

Evaluation loop

  • Evaluation is often part of a continuous loop:

    1. Generate output
    2. Measure performance
    3. Analyze results
    4. Improve system
  • The following figure shows evaluation and monitoring of agents, with an continuous evaluation loop where outputs are measured against metrics and used to guide system improvements.

  • This loop is central to maintaining and improving system quality.

Evaluation in agentic systems

  • Evaluation interacts with multiple patterns:

    • With learning: Provides signals for updating behavior
    • With monitoring: Tracks real-time performance
    • With guardrails: Ensures compliance with constraints
    • With planning: Evaluates plan effectiveness
  • This integration ensures that evaluation is not isolated but embedded throughout the system lifecycle.

Failure modes

  • Evaluation introduces its own challenges:

    • Metric misalignment: Metrics may not reflect true objectives
    • Incomplete coverage: Not all scenarios are evaluated
    • Bias in evaluation: Metrics may favor certain outputs
    • Over-optimization: System may optimize for metrics rather than goals
  • To mitigate these issues:

    • Use multiple complementary metrics
    • Include human evaluation where needed
    • Continuously update evaluation criteria
    • Monitor for unintended consequences

References

Foundational Techniques

Reflection, Self-Improvement, and Learning

Agentic Design Patterns Blogs/Books

Multi-Agent Systems

Safety, Alignment, and Guardrails

Developer Frameworks and Agent Infrastructure

Production Agent Architectures and Design Guidance

Enterprise and Platform Implementations

Citation

If you found our work useful, please cite it as:

@article{Chadha2020DistilledAgenticDesignPatterns,
  title   = {Agentic Design Patterns},
  author  = {Chadha, Aman},
  journal = {Distilled AI},
  year    = {2020},
  note    = {\url{https://aman.ai}}
}