Video

Context Engineering for AI Agents: Managing Intelligence at Scale

By SambaNova

April 15, 2026

Context engineering has quickly become the single biggest lever for making complex AI agents reliable in production. In this hands-on session, part three of the Building Agentic Applications series, Kwasi Ankomah, Director of AI Solutions at SambaNova, explains what context engineering is, why agents fail when context grows unmanaged, and how to fix it in SambaCloud with patterns used inside tools like Claude Code, OpenCode, and DeepAgents.

At its core, context engineering means giving an agent exactly the right context at the right time, and nothing more. As agents call tools, search the web, query databases, and hold conversations, context arrives from everywhere and the model's context window fills fast. This session shows how to keep that window clean so your agents stay accurate, fast, and affordable as they scale.

Why agents lose the plot: context collapse

Two distinct problems cause agent failures. The first is the hard token limit, roughly 128K–200K tokens on frontier open-source models and up to ~1 million on closed models like Claude Opus and Gemini. Once the window fills, the oldest messages get dropped and the agent loses early instructions. The second is attention degradation, the well-documented "lost in the middle" effect: models attend strongly to the beginning and end of context but far less to the middle, so critical details buried in long context become effectively invisible. Together these produce context collapse, such as falling accuracy and rising latency as context bloats.

The five context types

A practical framework for diagnosing agents is to ask which context type is broken?

Input context: system prompts, skills, and memory loaded before the user says anything. Largely static and repeated at every step.
Runtime context: user metadata, API keys, and connections that propagate automatically to subagents.
Compression: summarizing old messages and offloading large outputs to files.
Isolation: passing work to task-specific subagents so the main agent's window stays clear.
Long-term memory: data persisted across conversations in a store like Postgres, Redis, or MongoDB.

The core techniques demonstrated

The workshop walks through live notebook demos of the patterns that matter most in production:

Search-and-offload. A custom tool saves raw web-search results to a file and returns only a short summary, so the model never ingests the full dumps. In the demo, summaries of ~400–500 characters replaced source text of ~3,000–3,500 characters per result.
Middleware (pre- and post-model hooks). Middleware intercepts the agent loop to add observability, guardrails, summarization, PII redaction, and dynamic prompts without changing agent code. One demo ran nine searches yet grew the context by only 999 tokens.
Automatic compression / auto-compaction. Frameworks auto-compact when context reaches ~85–90% full — the same mechanism behind Claude Code's auto-compact — replacing a 129,000-character tool result with a pointer to a saved file.
Subagent delegation. A supervisor agent delegates noisy work to focused subagents, each with its own fresh context window, which return compact summaries. In the final demo this offloaded ~153,000 tokens while keeping the supervisor clean.

Why it matters: accuracy, speed, and cost

Context engineering isn't only about avoiding token limits and forgetting. It's also a major cost lever: offloading context to cheap file storage instead of resending it as input tokens on every turn dramatically reduces spend, which compounds across thousands of users in production. The teams that are most disciplined about context are the ones winning at large-scale agent deployments.

What you'll learn

What context engineering is and why it's the number-one lever for reliable agents
How context collapse, token limits, and attention degradation cause failures
The five context types and how to diagnose which one is broken
How to build search-and-offload tools, middleware hooks, and subagent delegation
How auto-compaction works under the hood in tools like Claude Code and DeepAgents
Production patterns: KV-cache management, codebase reindexing, and parallel subagents

FAQs

Context engineering is the practice of giving an AI agent exactly the right context at the right time and nothing more. It manages what enters a model's limited context window through compression, offloading, and isolation, so agents stay accurate, fast, and cost-efficient as tasks grow complex.

Context collapse is the drop in agent performance when its context window grows too large. Accuracy falls and latency rises because the model hits token limits (dropping older messages) and suffers attention degradation, where information in the middle of long context is effectively ignored.

The five context types are: input context (system prompts, skills, memory), runtime context (user metadata, keys, connections), compression (summarizing and offloading), isolation (delegating to subagents), and long-term memory (data persisted across conversations in a store like Redis or Postgres).

"Lost in the middle" is attention degradation where language models attend strongly to the beginning and end of their context but weakly to the middle. Critical details buried in long context become effectively invisible, so important information should be positioned or offloaded deliberately.

Middleware is a layer that intercepts the agent loop with pre- and post-model hooks. It runs logic before or after tool calls and model calls to compress context, add observability, enforce guardrails, redact PII, or inject dynamic prompts without changing the agent's core code.

Subagent delegation has a supervisor agent hand noisy tasks (web search, database queries) to focused subagents, each with its own fresh context window. The subagents return compact summaries, keeping the supervisor's context clean. It's one of the most effective context-isolation strategies for complex agents.

Auto-compaction automatically summarizes a conversation when context reaches roughly 85–90% of the model's window, preventing it from running out of room. Tools like Claude Code use this via a middleware hook that triggers a compression algorithm and replaces bulk content with pointers to saved files.

Yes. Offloading large context to file storage instead of resending it as input tokens on every agent turn is the main saving, because you stop paying for repeated input tokens. At scale for thousands of users, this dramatically lowers inference cost.

← Build Lightning-Fast AI Apps with Hugging Face Gradio

Do More with Less: Enterprise Agent Tech Workflows on Minimal Hardware →