AI 大模型

大模型对齐与安全

大模型对齐、安全、越狱、红队、提示注入和可信评测。

今日/当前日期收录 2 篇信号源：cs.CL, cs.AI, cs.CY, cs.LG

2606.20408 2026-06-19 cs.CR cs.AI 新提交 95%

LLM agent safety, multi-turn red-teaming, jailbreak benchmarks, adversarial robustness, safety-critical systems

LLM智能体安全性、多轮红队测试、越狱基准、对抗鲁棒性、安全关键系统

Hanwool Lee, Dasol Choi, Bokyeong Kim, Seung Geun Kim, Haon Park

发表机构 * AIM Intelligence（AIM智能公司）； KAERI（韩国原子能研究所）

专题命中红队测试：多轮红队测试基准评估LLM智能体在安全关键系统中的鲁棒性

AI总结提出NRT-Bench基准，通过模拟核电站控制室的多轮红队测试，评估LLM智能体在安全关键系统中的对抗鲁棒性，发现不同模型的漏洞几乎不重叠，且防御效果高度依赖模型。

详情

AI中文摘要

大型语言模型（LLM）智能体越来越多地被提议作为安全关键系统的监督组件，但它们在持续、自适应对抗压力下的鲁棒性仍鲜有表征。我们提出了NRT-Bench，一个用于对作为安全关键系统操作员的LLM智能体进行多轮红队测试的基准，实例化为一个模拟核电站控制室。一个由五个角色组成的操作员团队，每个角色由可配置的LLM支持，运行一个由六项关键安全功能（CSF）管理的工厂，而对手在有限的多轮会话中通过四个通道注入消息，每轮有反馈。危害是一个客观信号，而非LLM评判的文本：一旦任何CSF丢失，运行即终止，并归因于导致该消息。在固定攻击配对重放协议下评估四个前沿操作员模型，我们发现自适应多轮攻击可靠地将操作员团队推过安全极限：在这四个模型中，8.7%至12.1%的攻击会话以工厂失去关键安全功能告终。尽管这四个模型在此聚合率下看起来几乎同样鲁棒，但它们的失败几乎没有重叠：在149个会话中，没有一个会话击败所有四个模型，而三分之一的会话至少击败一个模型，因此漏洞在模型之间几乎是不相交的，而非嵌套的。添加防御的效果强烈依赖于模型：同一套护栏或安全顾问智能体对一个模型降低攻击成功率，却可能对另一个模型提高成功率。我们发布了模拟场地、攻击数据集和重放工具，用于LLM智能体的可重复安全评估。

英文摘要

Large language model (LLM) agents are increasingly proposed as supervisory components for safety-critical systems, yet their robustness under sustained, adaptive adversarial pressure remains poorly characterized. We present NRT-Bench, a benchmark for multi-turn red-teaming of LLM agents acting as operators of a safety-critical system, instantiated in a simulated nuclear power plant control room. A five-role operator team, each backed by a configurable LLM, runs a plant governed by six critical safety functions (CSFs), while adversaries inject messages over four channels in bounded multi-turn sessions with per-turn feedback. Harm is an objective signal rather than LLM-judged text: a run terminates the moment any CSF is lost, attributed to the causing message. Evaluating four frontier operator models under a fixed-attack paired-replay protocol, we find that adaptive multi-turn attacks reliably push the operator team past a safety limit: across the four models, between 8.7% and 12.1% of attack sessions end with the plant losing a critical safety function. Although the four models look almost equally robust by this aggregate rate, their failures barely overlap: of $149$ sessions, none defeat all four models while a third defeat at least one, so vulnerabilities are nearly disjoint across models rather than nested. The effect of added defences is strongly model-dependent: the same guardrail stack or safety-advisor agent that lowers attack success for one model can raise it for another. We release the simulation venue, attack dataset, and replay tooling for reproducible safety evaluation of LLM agents.

URL PDF HTML ☆

赞 0 踩 0

2606.19887 2026-06-19 cs.CR cs.AI 新提交 90%

FFinRED: An Expert-Guided Benchmark Generation and Evaluation Framework for Financial LLM Red-Teaming

FFinRED：面向金融大语言模型红队测试的专家引导基准生成与评估框架

Chaeyun Kim, Daeyoung Park, Junghwan Kim, Jinyoung Jeong, Eunji Song, Yongtaek Lim, Minwoo Kim

发表机构 * DATUMO INC.（DATUMO公司）； Korea Advanced Institute of Science and Technology (KAIST)（韩国先进科学研究院）； Financial Security Institute (FSI)（金融安全研究所）

专题命中红队测试：金融LLM红队测试框架，专家引导。

AI总结提出FinRED框架，通过专家引导的两级分类法将全球金融标准映射为威胁，并利用真实金融文档生成上下文丰富的红队行为提示，结合专家验证的评估标准，有效降低关键假阴性。

详情

AI中文摘要

现有的安全基准主要针对通用对抗场景，但忽略了金融领域的特定风险。金融大语言模型面临监管合规违规、欺诈助长和系统性信任侵蚀等问题，需要有针对性的评估。我们引入了FinRED，一个与金融专家共同开发的、用于金融大语言模型安全评估的专家引导红队测试框架。FinRED采用新颖的两级分类法，将全球标准（如FATF和EU DORA）映射到从监管规避到复杂欺诈的威胁，并结合可扩展的流水线，通过专家定义的架构将真实金融文档转换为上下文丰富的红队行为提示（种子）。严格的专家验证确认了种子的合理性和真实性，以实现有意义的LLM安全评估。我们还提供了一个经过专家验证的、金融专用的评估标准，该标准超越了免责声明检查，比静态的一刀切标准更贴近人类专家，并将关键假阴性从28个减少到12个。FinRED与国际采纳的风险管理和信息安全标准（如ISO/IEC 27001）保持一致，已在韩国金融安全研究院（FSI）的监管沙盒中部署，用于真实金融服务中的生成式AI安全评估。为减轻双重用途风险，数据集、生成流水线、提示模板和评估框架对合格研究人员开放，访问地址为：此https URL和此https URL。

英文摘要

Existing safety benchmarks target general adversarial scenarios but miss finance-specific risks. Financial LLMs face regulatory compliance violations, fraud facilitation, and systemic trust erosion that require targeted evaluation. We introduce FinRED, an expert-guided red-teaming framework for financial LLM safety evaluation developed with financial experts. FinRED uses a novel two-level taxonomy mapping global standards (e.g., FATF and EU DORA) to threats ranging from regulatory evasion to complex fraud, integrated with a scalable pipeline that converts real financial documents into context-rich red-teaming Behavioral Prompts (seeds) through an expert-defined schema. Rigorous expert validation confirms seed plausibility and realism for meaningful LLM safety evaluation. We also provide an expert-validated, finance-specific rubric that goes beyond disclaimer checks, aligns more closely with human experts than static one-size-fits-all rubrics, and reduces critical false negatives from 28 to 12. Aligned with internationally adopted risk-management and information-security standards (e.g., ISO/IEC 27001), FinRED is deployed in South Korea's Financial Security Institute (FSI) regulatory sandbox for generative AI security evaluation in real financial services. To mitigate dual-use risks, the dataset, generation pipeline, prompt template, and evaluation framework are gated for qualified researchers at https://github.com/selectstar-ai/FinRED-paper and https://huggingface.co/datasets/datumo/FinRED.

URL PDF HTML ☆

赞 0 踩 0