Research & papersarXivJun 30, 2026

MARS paper uses text refusal directions to improve multimodal model safety

A new arXiv paper proposes Modality-Agnostic Refusal Steering, a training-free method for transferring textual refusal directions into image and video safety controls.

An arXiv paper examines whether refusal directions extracted from an LLM backbone can improve safety in multimodal models. The authors introduce Modality-Agnostic Refusal Steering, or MARS, a training-free approach that uses activation steering, re-centering, adaptive strength selection, and layer choice to improve safety without collecting unsafe multimodal training data. Evaluations across five multimodal models found consistent safety gains while preserving utility.

Key details: Submitted June 30, 2026 to arXiv, The paper studies safety steering for multimodal LLMs, It introduces Modality-Agnostic Refusal Steering, The method is training-free and avoids unsafe multimodal safety data collection, The authors report consistent gains across safety, utility, and video-jailbreak benchmarks.

Why it matters: Multimodal jailbreak defense is harder to train directly, so reusable activation-level safety methods could matter as image and video agents spread.

Original

MARS paper uses text refusal directions to improve multimodal model safety

Your reading trail

Saved stories