Blog

Many-Shot Prompting: A Practical Guide to In-Context Learning at Scale

By Shubhangi Upasani, Chen Wu, Jay Rainton, Bo Li, Urmish Thakker, Changran Hu, Qizheng Zhang

April 22, 2026

TL;DR: What We Found

We ran thousands of experiments on many-shot in-context learning (ICL) across multiple benchmarks, model sizes, and prompting strategies. Here are the headline findings:

Many-shot ICL works, but only for certain tasks. Structured classification and information extraction see large, consistent gains. Open-ended generation tasks like machine translation barely move.
More examples ≠ better results after a point. Performance typically plateaus around 50–70 examples per class, then stalls or degrades as the context window saturates.
How you select examples matters more than how many you use. Cross-label similarity-based selection at low shot counts (n=1 per class) delivered our best result: 90.2% accuracy vs. a 43% zero-shot baseline.
Smaller models scale more gracefully with many-shot prompts. LLaMA-8B continued improving where 70B started to over-condition and degrade.
Reinforced ICL (chain-of-thought demos) peaks early. Just 4 reasoning traces matched or beat 32 on GPQA Diamond.

In recent years, large language models (LLMs) have undergone substantial advances in their ability to process and reason over extended context lengths. Architectural innovations and optimized attention mechanisms have expanded context windows from a few thousand tokens to hundreds of thousands, enabling models to condition on vast amounts of in-context information.

This progress has transformed the traditional few-shot in-context learning (ICL) [6] paradigm into what can now be characterized as many-shot ICL where the number of exemplars or demonstrations provided within the prompt can scale to hundreds or even thousands.

In this post, we share empirical observations and practical insights from our exploration of many-shot ICL across a range of tasks, domains, and few model backbones. We also explore Reinforced ICL and Dynamic ICL regimes. We analyze when and why many-shot prompting leads to consistent performance gains, outline failure modes observed during large-context evaluations, and discuss best practices for constructing effective long-context prompts. In this blogpost, we show:

Variation in model accuracy under different n-shot configurations
Best performing example selection strategy for prompt construction
Success and failure cases for many-shot ICL
Scaling law for Reinforced ICL with number of examples in the prompt

Preliminaries

In-context learning

In-context learning (ICL) [1] is a simple but powerful idea: Large language models (LLM) can learn new tasks directly from examples placed in their input context, without updating any model parameters. By observing a few input–output pairs, the model infers the underlying pattern and applies it to new queries. This makes ICL a lightweight alternative to fine-tuning or pre-training when adapting models to new tasks.

As context lengths have grown and serving systems like LMCache [3] and MoonCake [4] have added to deployment efficiency, ICL has evolved into a practical way to teach models through demonstration. It also offers a unique lens for studying how LLMs generalize, compress patterns, and “learn” from context alone.

Many-shot ICL

Recent progress in long-context architectures has unlocked many-shot ICL, where we can fit hundreds or even thousands of demonstrations into a single prompt. Intuitively, the more examples we provide, the more signal the model has to learn from, and several studies [2] show that performance often improves as the number of shots increases.

However, scaling up is not always straightforward. The gains from many-shot prompting depend heavily on how examples are organized: their ordering, diversity, template format, and even the phrasing of instructions. In practice, many-shot ICL can be surprisingly brittle if prompts aren’t constructed carefully. In our work, we explore these sensitivities and share what worked (and what didn’t) when building effective many-shot prompts. We also show success and failure domains for many-shot ICL.

2026 04 20 - In Context Learning Blog visuals 01

Reinforced ICL

In Reinforced ICL [2], the goal isn’t to show the model final answers but to demonstrate how to get there. Instead of giving input–output pairs, we include a few examples of chain-of-thought reasoning in the prompt. The model then imitates these reasoning patterns to arrive at its own answers.

This approach works especially well in domains like MATH or GSM8K, where the reasoning process matters more than the exact output format. Reinforced ICL can be seen as a lightweight, unsupervised way to teach reasoning strategies purely through examples in context.

Dynamic ICL

Traditional ICL uses a fixed set of examples. Every new query sees a static prompt. Dynamic ICL, in contrast, builds the prompt on-the-fly. For each query, the model (or a retrieval system) selects the most relevant examples based on embedding similarity, heuristics, or task metadata.

By tailoring the context to each input, dynamic ICL tends to produce better results than static prompts, especially in varied or open-ended domains. Dynamic ICL just swaps out irrelevant examples for more relevant ones and in practice it often leads to meaningful gains in consistency and accuracy. In our work, we explore different selection strategies to create prompts for Dynamic ICL.

Experiments and Results

To evaluate our hypotheses, we conducted experiments across a range of datasets drawn from two primary benchmarks: LongICLBench and EvalHarness.

LongICLBench [5] focuses on long-context learning in extreme label classification settings, covering six datasets with label spaces ranging from 28 to 174 classes and input lengths extending from roughly 2k to 50k tokens. This benchmark provides an ideal testbed for studying many-shot ICL, as it stresses both the model’s ability to leverage a large number of demonstrations and its capacity to maintain coherence over long input sequences. For experiments 1-4 described below, we used the Banking77 dataset from LongICLBench.

In addition, we included selected tasks from EvalHarness, emphasizing those that require complex contextual reasoning, such as machine translation, information extraction, summarization, and question answering. Together, these datasets span a diverse set of task structures, allowing us to probe the robustness and generality of many-shot ICL across domains. Experiments 5 and 6 are carried out on Eval-Harness. We also explore two model backbones, Llama-3.1-8B-Instruct and Llama-3.3-70B-Instruct, for selected experiments.

Our experimental results and key observations are summarized next.

1. Many-shot ICL with extreme label classification

We chose the Banking77 dataset from LongICLBench which has 77 classes. For Banking77 with C = 77 classes, we use a per-class shot formulation: Each prompt contains n demonstrations per class, resulting in N = n × C total demonstrations (e.g., n = 1 ⇒ N = 77, n = 5 ⇒ N = 385). Unless otherwise stated, we refer to n as the per-class shot count and N as the total number of demonstrations. We experimented with the Llama-3.1-8B-Instruct model for this. We perform 10 seed experiments for each row and report average results observed across all seeds in Table 1.

1-shot/label + instruct means 1 example per class (total examples 77) for Banking77 followed by the task instruction to construct the input prompt.

Num Shots Per Class (n)	Total Shots (N)	Avg. Acc (%)
Baseline (0-shot)	0	43
1-shot/label + instruct	77	68 土 1.97
2-shot/label + instruct	154	73.3 土1.96
3-shot/label + instruct	231	74.8 土1.35
4-shot/label + instruct	308	76.6 土1.87
5-shot/label + instruct	385	78.2 土1.55
50-shot/label + instruct	3850	80.8 土2.94
70-shot/label + instruct	5390	82.3 土1.51

Table 1: Many-shot ICL results on Banking 77 for Llama-3.1-8B-Instruct

As can be seen, accuracy rises consistently from 68% (n = 1, N = 77 shot) to approximately 82% (n = 70, n = 5390 shots). Beyond n = 70 shots, we reach the full context length of 128k tokens. The model continues to benefit from additional examples up to a moderate scale but shows limited improvement once the context window saturates.

Banking77 clearly benefits from many-shot ICL with the average model performance increasing with an increase in the number of input-output pairs in the prompt. We also show a plot for how input tokens vary with the number of shots for Banking77 in Figure 1.

Manyshot ICL-with-extreme-label-classification

Figure 1: Context length scaling with num shots in input prompt

2. Selection order for prompt construction with many-shot ICL

We carried out experiments using 10 different seeds for each n-shot setting shown in Table 1. Across 10 random shuffle trials, performance varied by 2-3%, confirming that the order of demonstrations matters, showing that many-shot ICL remains order-sensitive due to positional attention bias. Averaging over multiple random orders yields more stable and reliable results.

3. Selection strategy for Dynamic ICL

For Dynamic ICL, we select examples adaptively for each input query instead of relying on static prompts. We broadly focus on two selection strategies here:

Random Selection - random selection of examples to form a N-shot prompt for each input query.
Similarity-Based Selection - selecting most relevant examples to the input query based on embedding-based similarity. We use the sentence-transformers/all-MiniLM-L6-v2 model (384-dimensional embeddings).

Each of the above two regimes are further divided into:

Label-Wise Selection - select n-shot examples for every label class. So, for a 5-shot prompt setting on Banking77, we would select five input-output pairs for each of the 77 classes, hence having 385 total demonstrations in the prompt.
Cross-Label Selection - select n-shot * number of classes across the entire dataset. It's possible that each label class may be represented to a different extent in the final prompt, since we don’t do label-wise grouping.

Merging the above two regimes, we come up with four selection strategies -

Cross-Label Random Selection
Label-Wise Random Selection
Cross-Label Similarity Selection
Label-Wise Similarity Selection

These strategies are compared below on Banking77 for Llama-3.1-8B-Instruct in Tables 2-4.

Num Shots Per Class (n)	Label-Wise Random Selection (accuracy in %)	Cross-Label Random Selection (accuracy in %)
Baseline (0-shot)	43
1-shot/label + instruct	57.3	68
2-shot/label + instruct	57.2	73.34
3-shot/label + instruct	37.4	74.8
4-shot/label + instruct	40.6	76.64
5-shot/label + instruct	30.2	78.18
50-shot/label + instruct	48.65	80.8
70-shot/label + instruct	45.24	82.3

Table 2: Comparison for label-wise vs. cross-label random selection

Num Shots Per Class (n)	Cross-Label Random Selection (accuracy in %)	Cross-Label Similarity-Based Selection (accuracy in %)
Baseline (0-shot)	43
1-shot/label + instruct	68	90.2
2-shot/label + instruct	73.34	85.2
3-shot/label + instruct	74.8	84.2
4-shot/label + instruct	76.64	85.6
5-shot/label + instruct	78.18	81
50-shot/label + instruct	80.8	67.8
70-shot/label + instruct	82.3	53.4

Table 3: Comparison for cross-label random selection vs. cross-label similarity-based selection

Comparing these policies in Tables 2-3, cross-label selection consistently outperforms label-wise random selection, suggesting that enforcing per-label balance can reduce useful diversity by over-representing redundant examples. Cross-label benefits from exposure to greater contextual diversity thereby improving generalization, label-balance forces many low information examples, which is not very helpful.

Cross-label similarity-based selection is strong at small update magnitudes (high relevance), but degrades as N grows, whereas cross-label random selection scales more robustly with larger N (higher diversity). This reflects a bias-variance tradeoff: Relevance-focused (similarity-based) policies help early, but can over concentrate the context as the update becomes aggressive; diversity-focused (random) policies scale better.

Our best setting is (n = 1 ⇒ N = 77) for cross-label similarity. This maximizes relevance per demonstration per class label, yielding strong task adaptation before redundancy degrades performance at larger scale.

Num Shots Per Class (n)	Label-Wise Random Selection (accuracy in %)	Label-Wise Similarity-Based Selection (accuracy in %)
Baseline (0-shot)	43
1-shot/label + instruct	57.3	83.4
2-shot/label + instruct	57.2	81.2
3-shot/label + instruct	37.4	74.2
4-shot/label + instruct	40.6	76
5-shot/label + instruct	30.2	77.4
50-shot/label + instruct	48.65	73.2
70-shot/label + instruct	45.24	67.4

Table 4: Comparison for label-wise random selection vs. label-based similarity-based selection

In Table 4, we observe that label-wise similarity selection consistently outperforms label-wise random selection across all update magnitudes. This indicates that when label balance is enforced, selecting demonstrations that are semantically relevant to the query provides a stronger adaptation signal than random sampling within each class.

However, performance under label-wise similarity selection still degrades as the number of demonstrations increases (e.g., from 83.4% at n = 1 per-class shot to 67.4% at n = 70 per-class shot). This suggests that enforcing per-label balance limits the effective diversity of the prompt: As the update becomes more aggressive, additional demonstrations become increasingly redundant, even when selected by similarity.

Overall, these results show that while similarity-based selection mitigates some weaknesses of label-wise prompting, the label-wise constraint itself restricts scalability, motivating cross-label selection strategies that allow diversity to grow with update magnitude.

4. Scaling patterns with different model backbones

We experimented with 8B and 70B variants for Llama on Banking77 with various n-shot settings using cross label random selection on Banking77.

Num Shots	Llama-3.1-8B-Instruct (accuracy in %)	Llama-3.3-70B-Instruct (accuracy in %)
Baseline (0-shot)	43	57
1-shot/label + instruct	68.9	69.8
2-shot/label + instruct	73.3	79.5
3-shot/label + instruct	74.8	80.3
4-shot/label + instruct	76.6	82.4
5-shot/label + instruct	78.2	85.2
50-shot/label + instruct	80.8	88.6
70-shot/label + instruct	82.3	80.3

Table 5: Comparison between different model backbones across n-shot setting

At small to moderate update magnitudes, Llama 3.3 70B consistently outperforms the smaller model, indicating that higher capacity models can more effectively exploit diverse in-context supervision. As the update magnitude increases further, the performance gap narrows and the 8B model catches up, suggesting that sufficiently large prompts can partially compensate for limited model capacity.

Notably, the 70B model exhibits a performance drop at the largest update magnitude, consistent with over-conditioning. In contrast, the smaller model remains in a signal-accumulation regime and is less sensitive to this effect.

5. Reinforced ICL scaling law

We next study Reinforced ICL, where chain-of-thought (CoT) based reasoning traces are provided as demonstrations instead of direct input-output pairs. GPQA (Graduate-level Google-Proof Q&A) is a challenging multiple-choice benchmark designed to test reasoning and factual knowledge beyond web-searchable facts. It contains graduate-level questions in STEM fields (biology, physics, chemistry, etc.), curated to avoid answers that can be trivially found on the web — hence "Google-Proof." It’s used to evaluate models on deep reasoning and domain knowledge, rather than memorized or shallowly pattern-matched facts.

Using GPQA Diamond subset with Llama-3.3-70B-Instruct, we show in Table 6 that Reinforced ICL improves performance as the magnitude of the update increases to 4 demonstrations. Beyond this point, accuracy plateaus and degrades.

A plausible explanation is that early demonstrations provide a strong inductive bias, leading to rapid gains with only a few examples. As the number of reasoning traces increases, attention is increasingly divided across long CoT. This competition for attention reduces the effective influence of any single trace, causing performance to plateau or degrade despite additional context.

Num Shots	Accuracy (%)
Baseline (0-shot)	49.6
1	60.9
2	60.3
4	61.1
6	59.5
8	57.4
10	56.8
12	57.1
14	55.8
16	58.6
32	57.3

Table 6: Reinforced ICL on GPQA dataset

To generate synthetic reasoning traces for Reinforced ICL on GPQA Diamond, we follow the CoT data generation pipeline described in the Camel-AI framework [8]. We instantiate a generator–verifier setup, where both agents are powered by the gpt-oss-120b model.

The generator produces step-by-step reasoning traces under predefined formatting constraints, and the verifier checks the final answer against the ground-truth label. We retain only examples whose generated final answer exactly matches the correct label, filtering out inconsistent traces.

Applying this procedure to the 198 question in GPQA Diamond yields 133 validated reasoning demonstrations, which are used for Reinforced ICL experiments. We used greedy sampling with max tokens set to 12000 to generate CoT traces.

6. Success and failure cases for many-shot ICL

We also tried many-shot ICL on multiple datasets from EvalHarness. These experiments span structured reasoning and open-ended generation settings, allowing us to examine how the effectiveness of prompt-based updates depends on task structure and information content. We use BLEU (Bilingual Evaluation Understudy) score to evaluate machine translation (WMT16) tasks, F1 score to evaluate reading comprehension (DROP) task and accuracy (in %) based on exact match for the rest of the tasks. We show how input token length varies with the number of shots in the prompt in Figure 1.

Task	Baseline (0-shot)	4 Shot	16 Shot	32 Shot
drop	10.8	13.1	14.2	15.8
fda	39.65	86.8	89.4	89.7
swde	74.17	92.9	95.1	96.4
arc challenge	38.82	93.72	94.48	93.45
gsm8k	87.56	94.7	94.8	94.8
GPQA(MC)	47.99	51.9	50	48.8
wmt16-de-en	44.76	46.3	46.9	47
wmt16-en-de	34.51	37	37.4	37.5

Table 7: Many-shot ICL on EvalHarness datasets

We analyze when many-shot ICL is effective across tasks in Table 7 using Llama-3.3-70B-Instruct. We find that many-shot prompting consistently improves performance on structure-heavy tasks with constrained outputs, including structured reasoning (e.g., DROP) and information extraction benchmarks (e.g., FDA, SWDE).

In these settings, additional demonstrations provide high information gain by capturing relevant patterns. For tasks with constrained outputs (e.g., ARC-Challenge and GSM8K), performance improves sharply with a small number of demonstrations but quickly saturates, indicating that only limited contextual supervision is required to specify task behavior.

In contrast, GPQA (multiple choice) exhibits only modest gains at small update magnitudes. Finally, open-ended generation tasks such as machine translation (wmt16 de-en, wmt16 en-de) show consistent but small improvements with additional context, indicating many-shot ICL offers limited benefits when task structures are already well captured during pretraining.

Conclusion

Our exploration of many-shot in-context learning (ICL) highlights both its promise and its practical constraints. As context lengths continue to scale, many-shot prompting provides a powerful and flexible alternative to fine-tuning, allowing models to internalize richer task patterns purely from examples.

2026 04 20 - In Context Learning Blog visuals 02 v2

Through our experiments across LongICLBench and EvalHarness, we find that many-shot ICL consistently improves performance on structured classification and reasoning tasks, particularly in domains like extreme classification and information extraction.

At the same time, performance remains highly sensitive to prompt design, example ordering, and selection strategy. Dynamic ICL, when based on cross-label selection, produces more robust and generalizable behavior than static or label-clustered prompting. We also observe that scaling helps up to a point: Accuracy improvements taper beyond roughly 50–70 per class demonstrations, suggesting saturation within the available context window.

Finally, we note that many-shot ICL still struggles in open-ended generation tasks such as question answering and machine translation, where noise accumulation and context dilution can outweigh the benefits of additional examples. This points toward an important future direction: Combining many-shot prompting with retrieval-augmented and memory-based methods to dynamically focus on the most relevant context segments.

FAQs

Many-shot prompting (or many-shot in-context learning) is a technique where hundreds or thousands of input–output examples are included in a single LLM prompt, rather than the typical 1–5 examples used in few-shot prompting. It’s enabled by the expansion of model context windows to 128k tokens and beyond.

It depends on the task and selection strategy. For classification with cross-label similarity, as few as 1 example per class can reach 90%+ accuracy. For cross-label random selection, gains continue up to ~50–70 examples per class before plateauing. For chain-of-thought (CoT) reasoning tasks, 4 demonstrations are often the sweet spot.

No. Our experiments show strong gains for structured classification, information extraction, and constrained reasoning tasks. Open-ended generation tasks like machine translation and free-form QA see minimal improvement — the task patterns are already well captured during model pre-training.

Both have their place. Similarity-based selection delivers the best accuracy at low shot counts by maximizing relevance. But as you scale up, random selection outperforms because it maintains diversity and avoids the redundancy that similarity-based selection introduces.

Two main reasons: context saturation (the model’s attention gets diluted across too many examples) and over-conditioning (larger models become overly sensitive to in-context signals at high shot counts). Both effects are more pronounced in larger models and with similarity-based selection.

Many-shot ICL can consume the full 128k-token context window. Efficient KV cache management and high-bandwidth memory are essential for production deployment. Architectures like SambaNova’s RDU are designed specifically for sustained long-context throughput, making them well-suited for many-shot ICL workloads.

References

[1] Dong, Qingxiu, et al. "A survey on in-context learning." arXiv preprint arXiv:2301.00234 (2022). https://arxiv.org/abs/2301.00234

[2] Agarwal, Rishabh, et al. "Many-shot in-context learning." Advances in Neural Information Processing Systems 37 (2024): 76930-76966. https://arxiv.org/abs/2404.11018

[3] Cheng, Yihua, et al. "LMCache: An Efficient KV Cache Layer for Enterprise-Scale LLM Inference." arXiv preprint arXiv:2510.09665 (2025). https://arxiv.org/abs/2510.09665

[4] Qin, Ruoyu, et al. "Mooncake: A kvcache-centric disaggregated architecture for llm serving." arXiv preprint arXiv:2407.00079 (2024). https://arxiv.org/abs/2407.00079

[5] Li, Tianle, et al. "Long-context llms struggle with long in-context learning." arXiv preprint arXiv:2404.02060 (2024). https://arxiv.org/abs/2404.02060v3

[6] Brown, Tom, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D. Kaplan, Prafulla Dhariwal, Arvind Neelakantan et al. "Language models are few-shot learners." Advances in neural information processing systems 33 (2020): 1877-1901.

[7] Rein, David, Betty Li Hou, Asa Cooper Stickland, Jackson Petty, Richard Yuanzhe Pang, Julien Dirani, Julian Michael, and Samuel R. Bowman. "Gpqa: A graduate-level google-proof q&a benchmark." In First Conference on Language Modeling. 2024.

[8] Camel AI. Data generation module — camel ai documentation. https://docs.camel-ai.

org/key_modules/datagen, 2025. Accessed: 2026-01-XX.

← The Decode Era of AI: Why Dataflow Matters More Than Ever

MiniMax M2.7 Running Fastest on SambaCloud →