PACE proxy benchmark predicts agentic evaluation scores at under 1% of full cost
A new arXiv paper introduces PACE, a proxy benchmark framework that predicts performance on expensive agentic benchmarks using a compact subset of cheaper evaluation instances.
Read more
A new arXiv paper introduces PACE, a framework for predicting LLM-agent performance on costly benchmarks such as SWE-Bench and GAIA from a small set of cheaper atomic evaluation instances. Across 14 models, four agentic benchmarks, and 19 non-agentic benchmarks, the authors report mean absolute error under 4%, Spearman correlation above 0.80, and pairwise model-ranking accuracy around 85%. The claimed evaluation cost is less than 1% of running full agentic benchmarks.
Key details: PACE predicts scores on four target agentic benchmarks, The paper reports under 4% mean absolute error and Spearman correlation above 0.80, The proxy cost is reported as less than 1% of full agentic evaluation.
Why it matters: Cheaper agent evaluation would make model routing and development decisions less dependent on slow, expensive benchmark runs.