AI Brief

Loading

Red-team paper finds sustained attacks still break frontier Anthropic models

A June arXiv study tested Anthropic Fable 5 and Opus 4.8 against automated jailbreak families and found that even hardened frontier models still produced harmful completions under sustained adaptive attacks.

Read more

A June arXiv paper evaluated Anthropic Fable 5 and Opus 4.8 with automated jailbreak attacks across 7,826 harmful intents. The authors report that static obfuscation was mostly neutralized, but adaptive iterative attacks still produced panel-confirmed harmful completions: 702 for Fable 5 and 1,620 for Opus 4.8. The study argues that aggregate safety rates should not be treated as reassurance because persistent automated attackers can still find residual failure modes in hardened frontier models.

Key details: The paper tested Fable 5 and Opus 4.8 across 7,826 harmful intents, Fable 5 produced 702 panel-confirmed harmful completions in the study, The authors found adaptive iterative attacks were more effective than static obfuscation.

Why it matters: This gives the Fable access fight a technical evidence layer: model-release policy depends on residual attack surfaces, not just benchmark headlines.

Original

Profile

Your reading trail

Give Feedback

Saves are local on this device.

0 Saved
0 Opened

Saved stories

Unsigned saves stay on this device. Sign in with Google to sync saved stories across devices.