AxDafny paper tests agentic code generation against formal verification
A new arXiv paper introduces AxDafny, a verifier-guided repair framework for generating Dafny code and proof artifacts, and reports strong gains over a GPT-5.5 baseline.
Read more
An arXiv paper introduces AxDafny, a framework for agentic code generation in Dafny where the system must produce both executable code and proof artifacts that pass verification. The authors also introduce LiveCodeBench-Pro-Dafny, a 250-problem benchmark translated into Dafny with formal specifications and verifier-based evaluation. They report that AxDafny substantially improves verification success over baseline GPT-5.5 performance and reaches 92.7% verification success on DafnyBench.
Key details: Submitted June 30, 2026 to arXiv, The paper studies agentic verified code generation in Dafny, It introduces AxDafny and LiveCodeBench-Pro-Dafny, The benchmark includes 250 programming problems with formal specifications, AxDafny reports 92.7% verification success on DafnyBench.
Why it matters: Formal verification is one of the cleanest ways to measure whether coding agents actually produce correct programs, not just plausible patches.