KSAFE-MM: A Multimodal Safety Benchmark via Localized Contextualization for Korean Cultural Risks
KSAFE-MM:通过本地化语境化实现韩国文化风险的多模态安全基准
Yongwoo Kim, Sojung An, Yunjin Park, Jungwon Yoon, Dujin Lee, HyunBeom Cho, Jaewon Lee, Wonhyuk Lee, Youngchol Kim, JeongYeop Kim, Donghyun Kim
AI总结 针对多模态大语言模型在安全评估中缺乏文化特异性问题,提出KSAFE-MM基准,通过语言和视觉语境化构建通用与韩国文化特有的多模态安全测试集,揭示模型对文化攻击的脆弱性及安全性与过度拒绝之间的权衡。
详情
多模态大语言模型(MLLMs)通过引入跨多种模态(如语言和视觉)的漏洞,加剧了安全风险。然而,当前的MLLM安全评估工具存在重大局限性:1)以英语为中心的数据集构建,以及2)关注与当地文化背景无关的通用风险。本文介绍了KSAFE-MM,一个用于韩语多模态安全评估的基准,涵盖通用安全风险和文化特定漏洞。KSAFE-MM由两部分组成:KSAFE-MM-G和KSAFE-MM-C。KSAFE-MM-G通过语言语境化评估韩语语境中的全球共享风险,将通用安全查询转化为上下文相关的多模态样本。KSAFE-MM-C利用源自真实世界语境的本地化视觉查询,针对文化依赖的MLLM安全漏洞。它将这些视觉查询与越狱式文本查询配对,以覆盖涉及文化视觉线索和恶意文本意图的多模态安全风险。这些组件共同提供了一个从通用到本地的构建流程,用于评估全球共享安全风险和文化特定漏洞。我们在KSAFE-MM上评估了12个最先进的MLLM,并揭示了模型对文化攻击的脆弱性高于通用攻击。值得注意的是,越狱策略显著提高了攻击成功率,其中ProgramExecution的攻击成功率高达74.2%,而标准查询仅为13.4%。此外,我们发现了安全性与过度拒绝之间的系统性权衡,即实现低攻击成功率的模型往往对良性查询表现出过度的拒绝行为。这些发现强调了超越以英语为中心的基准、进行文化基础安全评估的紧迫性。
Multimodal Large Language Models (MLLMs) exacerbate safety risks by introducing vulnerabilities across multiple modalities, such as language and vision. Current MLLM safety evaluation tools, however, suffer from major limitations: 1) English-centric dataset construction, and 2) a focus on generic risks that are not tied to local cultural contexts. This paper introduces KSAFE-MM, a benchmark for Korean multimodal safety evaluation that covers both general safety risks and culture-specific vulnerabilities. KSAFE-MM consists of two parts, KSAFE-MM-G and KSAFE-MM-C. KSAFE-MM-G evaluates globally shared risks in Korean contexts through linguistic contextualization, which transforms generic safety queries into contextually grounded multimodal samples. KSAFE-MM-C targets culture-dependent MLLM safety vulnerabilities using localized visual queries derived from real-world contexts. It pairs these visual queries with jailbreak-style textual queries to cover multimodal safety risks involving cultural visual cues and malicious textual intent. Together, these components provide a general-to-local construction pipeline for evaluating both globally shared safety risks and culture-specific vulnerabilities. We evaluate 12 state-of-the-art MLLMs on KSAFE-MM and reveal that models exhibit greater vulnerability to culturally grounded attacks than to generic ones. Notably, jailbreaking strategies substantially amplify attack success rates, with ProgramExecution yielding up to 74.2% ASR compared to 13.4% for standard queries. Furthermore, we identify a systematic trade-off between safety and over-refusal, where models achieving low ASR tend to exhibit excessive refusal behavior on benign queries. These findings highlight the urgent need for culturally grounded safety evaluation beyond English-centric benchmarks.