Sign in to generate with AI
From chatbots to autonomous systems — tool use, function calling, memory, multi-agent orchestration, and safety guardrails.
Welcome to Building AI Agents. In this talk, we'll explore how to create autonomous systems that go beyond simple chatbots.
Let's start by understanding how AI has evolved. As you can see in the diagram, we've progressed through three generations. Generation one gave us rule-based bots with scripted responses. Generation two brought us L-L-M chatbots that could understand natural language. And now, generation three delivers AI agents — true autonomous systems. These can even coordinate into multi-agent systems for even more complex tasks.
So what's the real difference? {{step}}A traditional chatbot is reactive and stateless. It responds to your queries, has no access to external tools, relies on short-term memory, and focuses on single-turn conversations. {{step}}An AI agent, by contrast, is proactive and autonomous. It takes action on your behalf, can use tools and call A-P-I-s, maintains persistent memory across sessions, and can reason through multi-step problems.
There are five core capabilities that separate agents from chatbots. First, autonomy — the ability to make decisions without human intervention. Second, tool use — executing functions and calling external A-P-I-s. Third, planning — breaking complex goals down into sequential steps. Fourth, memory — maintaining context across conversations and entire sessions. And fifth, adaptability — learning from feedback and adjusting strategies over time.
One of the most powerful agent capabilities is tool use. This is how we extend the language model's abilities beyond just generating text, letting it interact with the real world.
Here's how function calling works end-to-end. A user submits a query to the language model. The model decides whether it needs to use external tools. If not, it returns a direct response. If yes, it selects from available tools like a weather A-P-I, database query, or web search. The model then executes the function, gets the result back, and uses that information to craft a final response to the user.
Looking at this code, we're using LangChain to define tools as decorated Python functions. We have a tool called get_weather that takes a location and returns weather data. Another tool called search_docs queries our internal documentation. We pass these tools to create_openai_functions_agent, which handles all the selection and calling logic automatically.
Tools need to be defined with a specific schema so the language model understands what they do and how to use them. This J-S-O-N shows a search_database tool with a description, parameter types, and required fields. The model reads this schema and knows exactly what parameters to pass when invoking the tool.
Let's see this in action. Running the agent, we give it a query about affordable laptops under eight hundred dollars. The agent analyzes the request, determines it needs the search_database tool, and calls it with query equals laptop and a max_price filter of eight hundred. The tool returns twelve results, and the agent synthesizes those into three top recommendations with prices and links.
Agents need to remember things. Without good memory management, they can't maintain coherent conversations or learn about their users. Let's explore the different memory strategies available.
We have three main memory types. {{step}}Short-term memory uses a conversation buffer for the current session. It stores recent messages in a window of maybe ten to fifty exchanges, giving us fast access. {{step}}Long-term memory uses a vector store for past conversations. We can semantically search through older sessions to find relevant context and user preferences, all stored persistently. {{step}}And semantic memory builds a knowledge base of facts and entities with their relationships, which we can integrate with retrieval-augmented generation.
Looking at this code, we implement short-term memory with ConversationBufferMemory, storing the last ten exchanges for fast retrieval. For long-term memory, we use Pinecone as a vector store. We embed past conversations and can query them with similarity_search to get the three most relevant ones. This gives us the best of both worlds — immediate context and deep historical knowledge.
The table compares different strategies for managing context windows. Full history gives complete context but hits token limits quickly. A sliding window is memory efficient but loses old context. Summarization compresses context but sacrifices detail. And semantic search retrieves only relevant context but adds retrieval overhead. Most production systems use a combination of these approaches.
Here's a real example. In session one on Monday, the user tells us their favorite color is blue, and we store that preference. Two days later in session two, the user asks for a shirt recommendation. The agent retrieves the stored preference, finds favorite_color equals blue, and recommends a navy blue Oxford shirt. That continuity across sessions is what makes agents powerful.
As systems get more complex, a single agent often isn't enough. Multi-agent orchestration lets us coordinate specialized agents to tackle sophisticated problems more effectively.
There are several patterns for orchestrating multiple agents. {{step}}Sequential uses a chain of agents where research flows into analysis, which then feeds into writing. Each agent's output becomes the next agent's input, creating a clear linear workflow. {{step}}Parallel execution runs multiple independent agents concurrently on different data sources, then merges their results for faster completion. {{step}}Hierarchical uses a manager-worker pattern where an orchestrator delegates specialized tasks to workers and then synthesizes their results. {{step}}And collaborative agents debate and reach consensus through peer interaction, iteratively refining the output for better quality.
In a hierarchical setup shown here, a user request goes to an orchestrator agent. The orchestrator decomposes the task and delegates to specialist agents — a research agent, an analysis agent, and a writing agent. The research agent queries web search and databases. The analysis agent processes that data. The writing agent generates content. All results flow back to the orchestrator, which produces the final output.
This code uses LanGraph to build a multi-agent system. We define three specialized agents using create_react_agent with their own tools. Then we build a workflow graph, adding nodes for each agent and defining edges to show the flow from research to analysis to writing. Finally, we compile and execute the graph to produce our result.
When agents pass information to each other, they need a structured format. This J-S-O-N shows a message from the research agent to the analyst agent. It includes a task I-D, the data being passed like competitors and market size, and metadata about confidence level and sources. This structure ensures clear, reliable communication.
The table shows different coordination strategies. Shared state works well for real-time synchronization like collaborative editing. Message passing handles asynchronous tasks like email pipelines. An event bus decouples agents in systems like notifications. And workflow engines orchestrate complex multi-step processes like approval chains.
Before we deploy agents to production, we need safety guardrails. These protect against harmful outputs, unsafe actions, and unexpected failures. Let's explore the critical safeguards.
We need safety at every layer. {{step}}Input filters do pre-processing — detecting prompt injection, masking personally identifiable information, and preventing jailbreaks. {{step}}Output validation happens post-processing with content moderation, fact checking, and bias detection. {{step}}Action control sets runtime limits like rate limiting, permission checks, and sandboxing. {{step}}And human-in-loop provides oversight through approval gates, audit logs, and escalation paths.
We can use libraries like Guardrails to implement safety. We define validators like toxic-language, P-I-I-detector, and profanity-free. When the agent produces output, we validate it against these rules, filtering toxic content and redacting P-I-I automatically. This happens before the response reaches the user.
Here's an essential checklist for production agents. Input sanitization blocks prompt injection and malicious inputs. Output filtering screens for toxic content, P-I-I, and hallucinations. Action permissions whitelist allowed tools and A-P-I endpoints. Rate limiting prevents abuse with per-user quotas. Audit logging tracks all actions for debugging and compliance. Escalation paths route risky decisions to humans. And rollback capability undoes harmful actions automatically.
Once we've built an agent with proper safeguards, we need to measure how well it performs. Evaluation and testing tell us what's working and where improvements are needed.
We evaluate across three key dimensions. {{step}}Accuracy measures correctness — including task success rate, factual accuracy, whether the agent selected the right tools, and overall output quality. {{step}}Efficiency measures performance — how many steps it takes, token usage, latency at the ninety-five percentile, and cost per task. {{step}}Safety measures reliability — tracking harmful outputs, failed actions, hallucinations, and how the agent handles edge cases.
The numbers speak for themselves. Our agent achieves an eighty-seven percent task success rate, completes tasks in an average of four point two steps, uses about one thousand eight hundred fifty tokens per task, and has a harmful output rate of just zero point three percent — well within acceptable bounds.
Looking at this code, we define test cases with expected inputs, the tools we expect the agent to use, and success criteria. For example, booking a flight should use search_flights and book_ticket tools. A harmful request like asking how to hack a website should use no tools and respond with a refusal. We run these parametrized tests automatically to catch regressions.
The table shows different evaluation methods. Unit tests verify individual tools like tool parsing and A-P-I calls. Integration tests check multi-step workflows like research feeding into analysis feeding into a report. A/B tests compare prompt variations. Red teaming tries to break safety boundaries with jailbreak attempts. And human evaluation assesses subjective qualities like coherence and tone.
LangSmith provides a platform for systematic evaluation. We create a dataset of example tasks and expected outputs. Then we run our agent through the evaluator with custom evaluators for accuracy, cost, and latency. The results summary gives us a comprehensive view of agent performance.
After deployment, we need continuous monitoring. Agent executions feed into a logging layer, which writes metrics to a metrics store. A monitoring dashboard visualizes these metrics and alerts us to anomalies. When we detect a problem, we page the on-call engineer to investigate, debug, and fix it, creating a feedback loop to improve the agent.
Here's what a production dashboard looks like. {{step}}Performance shows a P-ninety-five latency of eight hundred fifty milliseconds, ninety-four percent success rate, and an average cost of twelve cents per task. {{step}}Safety shows only zero point two percent harmful outputs, one point one percent failed actions, and three point four percent escalations. {{step}}Usage shows one thousand two hundred forty daily active users, eight thousand six hundred tasks daily, and eighty-nine percent retention. {{step}}And quality shows a C-S-A-T score of four point three out of five, ninety-one percent task completion, and seventy-six percent repeat usage.
The table outlines common failure modes and solutions. Tool hallucination — when the agent invents non-existent tools — is fixed with strict tool schema validation. Infinite loops happen when an agent repeats the same action; we add max iteration limits. Context loss means the agent forgets earlier conversation; a better memory architecture solves this. Over-planning means too many unnecessary steps; adding a direct answer path helps. And unsafe actions require permission whitelisting.
We've covered a lot of ground on agent architecture, from basic tool use through multi-agent orchestration, safety, and evaluation. These patterns form the foundation of building production AI agents that are reliable, safe, and effective.
Use this presentation as a starting point — edit the content, change the theme, or generate a similar one with AI.