🎯 PDB: Precise Debugging Benchmark

PDB is an automatic pipeline that converts any coding dataset into a debugging benchmark with precision-aware evaluation. It generates buggy programs by synthesizing verified atomic bugs and composing them into multi-bug programs, then evaluates models using edit-level precision and bug-level recall.

Abstract

Unlike code completion, debugging requires localizing faults and applying targeted edits. We observe that frontier LLMs often regenerate correct but over-edited solutions during debugging. To evaluate how far LLMs are from precise debugging, we introduce the Precise Debugging Benchmarking (PDB) framework, which automatically converts any coding dataset into a debugging benchmark with precision-aware evaluation.

PDB generates buggy programs by synthesizing verified atomic bugs and composing them into multi-bug programs. We define two novel metrics, edit-level precision and bug-level recall, which measure how many necessary edits are made and how many bugs are resolved. We release two evaluation benchmarks: PDB-Single-Hard (5,751 single-line bug examples) and PDB-Multi (256 contiguous 2–4 line bug examples).

Experiments show that frontier models, such as GPT-5.1-Codex and DeepSeek-V3.2-Thinking, achieve unit-test pass rates above 76% but precision at or below 45%, even when explicitly instructed to perform minimal debugging. Iterative and agentic debugging strategies do not substantially improve precision or recall, highlighting the need to rethink post-training pipelines for coding models.

Datasets

We release two evaluation sets built with the PDB generation and evaluation pipeline, both sourced from two existing coding benchmarks: BigCodeBench (API usage) and LiveCodeBench (algorithmic reasoning).

PDB-Single-Hard — 5,751 single-line bug examples, filtered from an initial pool of 7,591 PDB-Single examples to retain only cases that are not easily solved by a quorum of reference models.
PDB-Multi — 256 examples where each bug is a contiguous 2–4 line block, drawn from longer programs (≥ 35 lines) with an atomicity filter applied to reject compound-dependent bugs.

PDB-Single-Hard data distribution: 5,751 examples across source benchmark, bug count (1–4), and bug category.

Results

Model rankings on debugging precision differ strikingly from rankings on the unit-test pass rate. This discrepancy persists across both PDB-Single-Hard and the multi-line PDB-Multi benchmark, indicating that increasing bug granularity does not mitigate the gap.

Scatter plots below: hover for exact numbers. Full tables are on the Leaderboard.

PDB-Single-Hard (single-line)

PDB-Multi (multi-line)

Bug-count breakdown on PDB-Single-Hard (BigCodeBench).

Bug-count breakdown on PDB-Single-Hard (LiveCodeBench).

Ablation Studies

Ablation — Prompting Freeform vs. minimal debugging. Across all evaluated models, freeform prompting produces a substantial drop in edit-level precision and bug-level recall. Even the strongest models, including Claude-Sonnet-4.5 and Qwen3-Coder-480B, achieve less than 60% precision without a minimal-edit constraint; Gemini-2.5-Pro drops by 40% absolute, and GPT-5.1-Codex fails to reach 20%. Prompt-level constraints are necessary but insufficient: while minimal-debug prompts reduce over-editing, they do not fundamentally change the underlying regeneration tendency.

Model performance under minimal-debug vs. freeform prompting.

Ablation — Iteration & agents Iterative and agentic debugging. Iterative (up to three revision attempts) and agentic (with unit-test & execution feedback) settings consistently improve unit-test scores and recall — but precision stays flat or degrades. In fact, agentic debugging often underperforms plain iterative debugging in precision, suggesting additional feedback exacerbates regeneration-oriented strategies. Even Claude-Code, the strongest agentic baseline, only reaches roughly 50% precision.

Model performance under iterative and agentic setups.

Ablation — Bug categories Which defect categories are easiest to repair? With the exception of Gemini-2.5-Pro, which exhibits a relatively uniform recall of ~70% across all categories, most models show markedly higher recall on Build/Package/Merge defects. We hypothesize this advantage arises from the higher prevalence of such defects in pretraining data, making them easier to recognize and repair than algorithmic or boundary-condition faults.

Recall distribution over the five bug categories.

BibTeX

@article{zhu2026pdb,
  title={Precise Debugging Benchmark: Is Your Model Debugging or Regenerating?},
  author={Zhu, Wang Bill and Chai, Miaosen and Wang, Shangshang and Liu, Yejia and Bian, Song and Dong, Honghua and Neiswanger, Willie and Jia, Robin},
  journal={arXiv preprint arXiv:2604.17338},
  year={2026}
}