Research & papersarXivJun 16, 2026

Red-team paper finds sustained attacks still break frontier Anthropic models

A June arXiv study tested Anthropic Fable 5 and Opus 4.8 against automated jailbreak families and found that even hardened frontier models still produced harmful completions under sustained adaptive attacks.

A June arXiv paper evaluated Anthropic Fable 5 and Opus 4.8 with automated jailbreak attacks across 7,826 harmful intents. The authors report that static obfuscation was mostly neutralized, but adaptive iterative attacks still produced panel-confirmed harmful completions: 702 for Fable 5 and 1,620 for Opus 4.8. The study argues that aggregate safety rates should not be treated as reassurance because persistent automated attackers can still find residual failure modes in hardened frontier models.

Key details: The paper tested Fable 5 and Opus 4.8 across 7,826 harmful intents, Fable 5 produced 702 panel-confirmed harmful completions in the study, The authors found adaptive iterative attacks were more effective than static obfuscation.

Why it matters: This gives the Fable access fight a technical evidence layer: model-release policy depends on residual attack surfaces, not just benchmark headlines.

Original

Red-team paper finds sustained attacks still break frontier Anthropic models

Your reading trail

Saved stories