Video

Context Engineering for AI Agents: Managing Intelligence at Scale

Written by SambaNova | April 15, 2026

Context engineering has quickly become the single biggest lever for making complex AI agents reliable in production. In this hands-on session — part three of the Building Agentic Applications series — Kwasi Ankomah, Director of AI Solutions at SambaNova, explains what context engineering is, why agents fail when context grows unmanaged, and how to fix it with patterns used inside tools like Claude Code, OpenCode, and DeepAgents.

At its core, context engineering means giving an agent exactly the right context at the right time, and nothing more. As agents call tools, search the web, query databases, and hold conversations, context arrives from everywhere and the model's context window fills fast. This session shows how to keep that window clean so your agents stay accurate, fast, and affordable as they scale.

Why agents lose the plot: context collapse

Two distinct problems cause agent failures. The first is the hard token limit — roughly 128K–200K tokens on frontier open-source models and up to ~1 million on closed models like Claude Opus and Gemini. Once the window fills, the oldest messages get dropped and the agent loses early instructions. The second is attention degradation, the well-documented "lost in the middle" effect: models attend strongly to the beginning and end of context but far less to the middle, so critical details buried in long context become effectively invisible. Together these produce context collapse — falling accuracy and rising latency as context bloats.

The five context types

A practical framework for diagnosing agents is to ask which context type is broken?

  1. Input context — system prompts, skills, and memory loaded before the user says anything. Largely static and repeated at every step.
  2. Runtime context — user metadata, API keys, and connections that propagate automatically to subagents.
  3. Compression — summarizing old messages and offloading large outputs to files.
  4. Isolation — passing work to task-specific subagents so the main agent's window stays clear.
  5. Long-term memory — data persisted across conversations in a store like Postgres, Redis, or MongoDB.

The core techniques demonstrated

The workshop walks through live notebook demos of the patterns that matter most in production:

  • Search-and-offload. A custom tool saves raw web-search results to a file and returns only a short summary, so the model never ingests the full dumps. In the demo, summaries of ~400–500 characters replaced source text of ~3,000–3,500 characters per result.
  • Middleware (pre- and post-model hooks). Middleware intercepts the agent loop to add observability, guardrails, summarization, PII redaction, and dynamic prompts without changing agent code. One demo ran nine searches yet grew the context by only 999 tokens.
  • Automatic compression / auto-compaction. Frameworks auto-compact when context reaches ~85–90% full — the same mechanism behind Claude Code's auto-compact — replacing a 129,000-character tool result with a pointer to a saved file.
  • Subagent delegation. A supervisor agent delegates noisy work to focused subagents, each with its own fresh context window, which return compact summaries. In the final demo this offloaded ~153,000 tokens while keeping the supervisor clean.

Why it matters: accuracy, speed, and cost

Context engineering isn't only about avoiding token limits and forgetting. It's also a major cost lever: offloading context to cheap file storage instead of resending it as input tokens on every turn dramatically reduces spend — which compounds across thousands of users in production. The teams that are most disciplined about context are the ones winning at large-scale agent deployments.

What you'll learn

  • What context engineering is and why it's the number-one lever for reliable agents
  • How context collapse, token limits, and attention degradation cause failures
  • The five context types and how to diagnose which one is broken
  • How to build search-and-offload tools, middleware hooks, and subagent delegation
  • How auto-compaction works under the hood in tools like Claude Code and DeepAgents
  • Production patterns: KV-cache management, codebase reindexing, and parallel subagents