AI Brief

Loading

MARS paper uses text refusal directions to improve multimodal model safety

A new arXiv paper proposes Modality-Agnostic Refusal Steering, a training-free method for transferring textual refusal directions into image and video safety controls.

Read more

An arXiv paper examines whether refusal directions extracted from an LLM backbone can improve safety in multimodal models. The authors introduce Modality-Agnostic Refusal Steering, or MARS, a training-free approach that uses activation steering, re-centering, adaptive strength selection, and layer choice to improve safety without collecting unsafe multimodal training data. Evaluations across five multimodal models found consistent safety gains while preserving utility.

Key details: Submitted June 30, 2026 to arXiv, The paper studies safety steering for multimodal LLMs, It introduces Modality-Agnostic Refusal Steering, The method is training-free and avoids unsafe multimodal safety data collection, The authors report consistent gains across safety, utility, and video-jailbreak benchmarks.

Why it matters: Multimodal jailbreak defense is harder to train directly, so reusable activation-level safety methods could matter as image and video agents spread.

Original

Profile

Your reading trail

Give Feedback

Saves are local on this device.

0 Saved
0 Opened

Saved stories

Unsigned saves stay on this device. Sign in with Google to sync saved stories across devices.