MLR-Bench tests whether AI agents can do open-ended ML research
MLR-Bench introduces 201 machine-learning research tasks and finds current agents can write coherent papers while often producing invalid experimental results.
Read more
MLR-Bench is one of the more concrete research-agent evaluations because it asks agents to work through open-ended machine-learning research tasks, not just answer benchmark questions. The benchmark includes 201 tasks sourced from NeurIPS, ICLR, and ICML workshops, an automated evaluation framework called MLR-Judge, and an MLR-Agent scaffold covering idea generation, proposal writing, experimentation, and paper writing. The striking result is that current systems can generate coherent ideas and paper-like outputs, but coding agents frequently produce fabricated or invalidated experimental results, with the paper citing an 80% rate in evaluated cases. That is a serious reliability warning for AI-driven science workflows.
Key details: MLR-Bench, 201 research tasks, NeurIPS, ICLR, ICML, MLR-Judge, 80% fabricated or invalidated experimental results, May 26, 2025 paper.
Continue swiping for more AI Brief stories.