Are LLMs Truly Solving Software Problems — or Are Agents Doing It?

Written by Mengmeng Ji, Ravi Raju | November 14, 2025

Large language models (LLMs) have evolved from text generators into powerful systems that can reason, code, and even act autonomously within complex workflows.SWE-bench is a benchmark designed to evaluate large language models on complex, real-world software engineering tasks drawn from GitHub [1]. Given a codebase and a GitHub issue, a model must produce a patch that resolves the problem described. Each instance in the dataset consists of a failing test (Fail-to-Pass) that is fixed by the corresponding pull request along with additional tests to ensure no unrelated behavior is broken (Pass-to-Pass). The tasks come from actual GitHub issues and their associated fixes, making the evaluation grounded in how software evolves in practice.

Figure 1. SWE-bench evaluation pipeline

The SWE-bench evaluation pipeline has two main stages: generation and evaluation. In the generation stage, the model is prompted with an issue description and relevant repository context, and it produces a candidate patch. In the evaluation stage, this patch is tested inside a controlled Docker environment, where the repository is cloned, the model’s patch is applied to the targeted file(s), and the full test suite is run to check whether the issue is resolved without breaking other functionality.

Most existing SWE-bench experiments use agentic workflows — combining LLMs with retrieval, tool use, and multi-step reasoning. These systems achieve strong results, but they also blur an important line: Are the models themselves solving these problems, or are the agents doing the heavy lifting?

To explore this, we designed two experiments that isolate these factors, contrasting these two perspectives using SWE-bench as a controlled testbed.

SWE-bench with an agentic workflow (mini-SWE-agent) — testing performance when reasoning is scaffolded by structured interaction.
SWE-bench as a single-shot long-context benchmark — testing the model’s intrinsic capability.

Stage 1: SWE-bench in an Agentic Workflow (Mini-SWE-Agent)

We first evaluated SWE-bench in an agentic workflow setting using the mini-SWE-agent framework, the reference agent released by the SWE-bench authors [9]. Unlike SWE-agent, which leverages an ACI (Agent-Computer Interface), mini-SWE-agent is a bash-only command-line workflow that provides a lightweight sandbox for experimenting with different models as shown in Figure 2 [1][9]. It produces a linear history in which each agent step is appended directly to the message stream — making it especially useful for debugging, fine-tuning, and measuring token usage and conversation length. This design aligns well with our goal of assessing SWE-bench’s strengths as an agentic benchmark while also probing its potential as a long-context benchmark. As a baseline, we tested GPT-5-nano and obtained a 31% resolve rate on 100 samples from SWE-bench Verified, compared to the 34.8% reported on the open-source leaderboard for the full dataset.

Figure 2. A block diagram of the mini-swe-agent framework.

We evaluated SambaNova’s DeepSeek-R1-0528 and Qwen3-32B using the mini-SWE-agent framework [10]. On 100 samples from SWE-bench Verified, DeepSeek-R1-0528 achieved a 30.3% resolve rate, while Qwen3-32B reached 15.2% [11][12]. These results show that our cloud API models achieve competitive performance and help fill the gap left by the SWE-bench Verified leaderboard.

While these experiments confirm that SWE-bench is well-suited for agentic workflows, they also raise the question: Can it still serve as a long-context benchmark? A key challenge lies in measuring and controlling context length in an agentic setting. The context length here depends on the number of dialogue rounds and accumulated tokens. Our statistics in Table 1 show that, across all tested models, full conversations generally stayed under 20k tokens. Within this range, successful solutions tended to use shorter contexts, while longer contexts often correlated with incorrect or unstable outputs — indicating diminishing returns from additional context.

Table 1. Token length statistics for passed / failed examples for Deepseek-R1 and Qwen3-32B

Taken together, these findings indicate that SWE-bench is suitable as an agentic workflow benchmark.

Stage 2: SWE-bench as a Long-Context Benchmark

Our second stage of work was to test the model’s ability of solving a complex coding problem when given a single shot of long context. For this stage of experiment, we needed to replicate the two core phases of SWE-bench, which required an appropriate dataset variant to be selected first. The original SWE-bench contains ~19k training samples and ~2k test samples drawn from 12 Python libraries. SWE-bench Lite, a filtered subset of 300 samples, offers both BM25-retrieved context and an Oracle version with directly provided files. However, the Oracle version has a maximum context length of ~50k tokens, which is insufficient for our long-context benchmark requirements. In contrast, SWE-bench Verified, introduced by OpenAI, refines the original dataset by removing unsolvable or irrelevant issues and categorizing problems by difficulty, resulting in a more diverse and reliable evaluation set of 500 samples [2]. This makes it the most suitable benchmark for our goals. The tradeoff, however, is that unlike SWE-bench Lite, SWE-bench Verified does not include pre-retrieved / Oracle context — meaning we must implement retrieval ourselves.

We first explored RAG techniques to retrieve a long context for model generation. Based on our literature survey, major retrieval methods are either based on sparse retrieval methods like BM25, or more complex agentic systems, like meta-RAG, Code-RAG bench, or a hybrid variant like CodeMonkeys [3][4][5]. Given the high cost of agent-based retrieval, we opted for a simpler yet effective strategy: Use BM25 to rank code chunks and then inject the “golden” patches to guarantee high recall. With this approach, we curated two sub-datasets from SWE-bench Verified, each containing 100 samples with 100% recall, one with 64k token context length, and one with 128k token context length.

Using these long-context datasets, we evaluated two models: GPT-5-nano and Qwen3-Coder-30B-A3B, with sb-cli, the official SWE-bench command-line evaluation tool [6][7][8]. The results were poor: Qwen3-Coder-30B-A3B achieved only a 7% resolve rate, while GPT-5-nano solved none of the tasks. We conducted qualitative analysis on the model’s outputs and revealed common failure modes. Many generated patches contained hallucinated information: Some had incorrect line numbers in the diff header as shown in Figure 3, while others targeted files that did not exist in the repository.

Figure 3. A misaligned patch generated by GPT-5 nano. One of the errors can be seen In the chunk header, where the line numbers from the first line clearly exceed the patch length.

After verifying that our post-processing pipeline was not the issue, we concluded that single-shot prompting with such long contexts likely exceeds the practical limits of current LLMs for this type of problem. We can conclude that SWE-bench, when used as a bare metal long context benchmark, remains challenging for all models.

Conclusion

SWE-bench remains a strong benchmark for evaluating agentic workflows, but a weak one for testing long-context reasoning. Despite involving large and complex repositories, it primarily rewards the models’ ability to reason through structured, iterative processes rather than their raw capacity to handle long contexts. Our findings challenge the common assumption that simply increasing a model’s context window allows it to understand and solve long software problems. Even top-performing models fail in single-shot long-context settings, underscoring that agentic scaffolding is essential for success. This contrast helps reveal where current models’ intrinsic “intelligence” ends and where orchestration through agents begins.

Recently, LongCodeBench introduced LongSWE-Bench, a similar benchmark to our RAG/Oracle setting, showing that open-source models achieved only single-digit solve rates, and the best closed-source model (Gemini 2.5 Pro) solved just 22% [13]. These findings highlight that long-context agentic benchmarks remain a critical frontier for foundation models.

References

[1] Jimenez, Carlos E., et al. SWE-bench: Can Language Models Resolve Real-World GitHub Issues? 2024. arXiv, https://arxiv.org/abs/2310.06770

[2] OpenAI. “Introducing SWE-bench Verified.” OpenAI, 13 Aug. 2024, https://openai.com/index/introducing-swe-bench-verified/.

[3] Tawosi, Vali, et al. Meta-RAG on Large Codebases Using Code Summarization. 2025. arXiv, https://arxiv.org/abs/2508.02611.

[4] Wang, Zora Zhiruo, et al. CodeRAG-Bench: Can Retrieval Augment Code Generation? 2025. arXiv, https://arxiv.org/abs/2406.14497.

[5] Ehrlich, Ryan, et al. CodeMonkeys: Scaling Test-Time Compute for Software Engineering. 2025. arXiv, https://arxiv.org/abs/2501.14723.

[6] OpenAI. “Introducing GPT-5.” OpenAI, 7 Aug. 2025, https://openai.com/index/introducing-gpt-5/.

[7] Qwen Team. “Qwen3-Coder-30B-A3B-Instruct.” Hugging Face, 2025, https://huggingface.co/Qwen/Qwen3-Coder-30B-A3B-Instruct.

[8] SWE-bench. “Quick Start - sb-cli.” SWE-bench, 2025, https://www.swebench.com/sb-cli/quick-start/.

[9] SWE-agent. mini-SWE-agent. GitHub, 2025, https://github.com/SWE-agent/mini-swe-agent. (github.com)

[10] SambaNova Systems. “SambaNova Cloud Dashboard.” SambaNova, 2025, https://cloud.sambanova.ai/

[11] DeepSeek-AI. “DeepSeek-R1-0528.” Hugging Face, 2025, https://huggingface.co/deepseek-ai/DeepSeek-R1-0528.

[12] Qwen Team. “Qwen3-32B.” Hugging Face, 2025, https://huggingface.co/Qwen/Qwen3-32B.

[13] Rando, Stefano, et al. LongCodeBench: Evaluating Coding LLMs at 1M Context Windows. 2025. arXiv, https://arxiv.org/abs/2505.07897.

View full post