ServiceNow expands EVA-Bench for real enterprise voice-agent workflows
EVA-Bench Data 2.0 grows to 213 scenarios across airline service, enterprise IT, and healthcare HR, testing agents against 121 tools and 35-plus workflows.
Read more
ServiceNow AI released EVA-Bench Data 2.0, expanding its open evaluation set for enterprise voice agents from one domain to three. The benchmark now covers airline customer service, enterprise IT service management, and healthcare HR service delivery, with 213 scenarios across 121 tools and more than 35 workflows. ServiceNow says every scenario was checked for solvability against OpenAI GPT-5.4, Google Gemini 3.1 Pro, and Anthropic Claude Opus 4.6. The release is useful because voice-agent failures are highly domain-specific: accurately handling a booking confirmation code does not prove an agent can navigate an HR policy or an IT incident. The dataset does not by itself establish which model is best, but it gives teams a larger open test bed for evaluating whether agents can complete realistic multi-tool work rather than merely hold a fluent conversation.
Key details: June 4, 2026, 213 scenarios, 121 tools, 3 enterprise domains, 35+ workflows, Open-source datasets.
Continue swiping for more AI Brief stories.