arXivDaily arXiv每日学术速递 周一至周五更新
重置
全部学科分类 4033
2606.04547 2026-06-16 cs.IR cs.CL 版本更新

Beyond Retrieval: Learning Compact User Representations for Scalable LLM Personalization

超越检索:学习紧凑用户表示以实现可扩展的LLM个性化

Heng Cao, Fan Zhang, Jian Yao, Yujie Zheng, Changlin Zhao, Lu Hao, Yuxuan Wei, Wangze Ni, Huaiyu Fu, Yuqian Sun, Xuyan Mo

发表机构 * Microsoft(微软公司) Shanghai International Studies University(上海国际问题研究大学) Zhejiang University(浙江大学) Department of Data Science and Artificial Intelligence, The Hong Kong Polytechnic University(数据科学与人工智能系,香港理工大学)

AI总结 提出TAP-PER框架,通过可学习的用户状态前缀嵌入编码用户偏好,避免显式提示构建和繁重的每用户适配器,在六个LaMP任务上优于基线方法,并显著减少参数开销。

Comments 16 pages, 6 figures

详情
AI中文摘要

个性化大型语言模型需要在保持鲁棒性和部署规模效率的同时,将模型行为适应于个体用户。现有方法通常在输入层面(通过检索用户历史或构建个人资料提示)或参数层面(通过维护用户特定的参数高效模块)进行个性化。前者使个性化对检索质量和提示设计敏感,而后者则产生随用户数量增长的存储和维护成本。为解决这些限制,我们提出TAP-PER(时间注意力前缀个性化),一种基于前缀的框架,将用户偏好编码为可学习的表示,消除了显式提示构建,并用轻量级用户状态前缀嵌入替代了繁重的每用户适配器。受个性化推荐系统启发,TAP-PER将用户建模分解为用户状态和查询条件组件,并引入时间信号以捕捉用户兴趣的演变特性。在六个LaMP任务上的实验表明,TAP-PER在分类、评分和生成设置中均持续优于基于提示和基于模型的基线。此外,在1000用户规模下,TAP-PER的每用户参数比OPPU少130倍,总参数量约为PER-PCS的一半,证明无需显式提示构建或繁重的每用户适配器即可实现可扩展的LLM个性化。

英文摘要

Personalizing large language models requires adapting model behavior to individual users while preserving robustness and deployment-scale efficiency. Existing approaches typically personalize LLMs either at the input level, by retrieving user histories or constructing profile prompts, or at the parameter level, by maintaining user-specific parameter-efficient modules. The former makes personalization sensitive to retrieval quality and prompt design, whereas the latter incurs storage and maintenance costs that grow with the user population. To address these limitations, we propose TAP-PER (Temporal Attentive Prefix for PERsonalization), a prefix-based framework that encodes user preferences as learnable representations, eliminating explicit prompt construction and replacing heavy per-user adapters with lightweight user-state prefix embeddings. Inspired by personalized recommendation systems, TAP-PER decomposes user modeling into user-state and query-conditioned components, and incorporates temporal signals to capture the evolving nature of user interests. Experiments on six LaMP tasks show that TAP-PER consistently outperforms prompt-based and model-based baselines across classification, rating, and generation settings. Moreover, TAP-PER uses 130x fewer per-user parameters than OPPU and roughly half the total parameter footprint of PER-PCS at the 1,000-user scale, demonstrating that scalable LLM personalization can be achieved without explicit prompt construction or heavy per-user adapters.

2606.03489 2026-06-16 cs.CR cs.AI 版本更新

Learn from Your Mistakes: Tree-like Self-Play for Secure Code LLMs

从错误中学习:面向安全代码LLM的树状自博弈

Wenqi Chen, Ziyan Zhang, Bin Wang, Lin Liu, Hengheng Zhang, Zhengsu Chen

发表机构 * arXiv.org GitHub

AI总结 提出树状自博弈(TSP)框架,将安全代码生成建模为细粒度序列决策过程,通过构建决策树探索安全与脆弱路径,使模型在关键决策节点自我纠正,显著提升代码安全性并实现跨语言泛化。

Comments 18 pages, 3 figures, Accepted by ICML 2026

详情
AI中文摘要

尽管大型语言模型(LLM)在代码生成方面表现出色,但它们仍然容易复制训练数据中固有的细微但关键的安全漏洞。当前的校准技术,如监督微调(SFT)和强化学习(RL),通常在序列级别应用粗粒度的优化。这种方法往往无法解决安全缺陷的局部性,即单个错误的token选择可能危及整个程序。为了弥合这一差距,我们引入了树状自博弈(TSP),一个将安全代码生成重新定义为细粒度序列决策过程的框架。与盲目最大化似然的标准方法不同,TSP构建了一个决策树,模型在其中探索分支轨迹——同时生成安全的“黄金路径”和易受攻击的变体。通过将代码生成视为自博弈游戏,模型学会严格区分自身的局部错误。这提供了一个密集的、在策略的学习信号,迫使模型在通常出现漏洞的关键决策节点进行自我纠正。我们的实验表明,TSP从根本上提高了模型的可靠性。在Python安全基准测试中,TSP将CodeLlama-7B的通过率(SPR@1)提升至75.8%,显著优于SFT(57.0%)和非结构化自博弈基线。关键的是,TSP引发了鲁棒的分布外泛化:模型不仅将未见类别(CWE)中的漏洞减少了24.5%,还成功将从C/C++学到的安全原则迁移到多种语言,包括Python、Go和JavaScript。这表明TSP不仅仅是记忆补丁,而是内化了抽象的、与语言无关的安全逻辑。

英文摘要

While Large Language Models (LLMs) excel in code generation, they remain prone to replicating subtle yet critical vulnerabilities endemic to their training data. Current alignment techniques, such as Supervised Fine-Tuning (SFT) and Reinforcement Learning (RL), typically apply coarse-grained optimization at the sequence level. This approach often fails to address the localized nature of security flaws, where a single incorrect token choice can compromise an entire program. To bridge this gap, we introduce Tree-like Self-Play (TSP), a framework that reframes secure code generation as a fine-grained sequential decision process. Unlike standard methods that blindly maximize likelihood, TSP constructs a decision tree where the model explores branching trajectories--generating both secure "golden paths" and vulnerable variants. By treating code generation as a self-play game, the model learns to strictly discriminate against its own localized errors. This provides a dense, on-policy learning signal that forces self-correction precisely at the critical decision nodes where vulnerabilities typically emerge. Our experiments demonstrate that TSP fundamentally enhances model reliability. In Python security benchmarks, TSP boosts CodeLlama-7B's pass rate (SPR@1) to 75.8%, significantly outperforming SFT (57.0%) and unstructured self-play baselines. Crucially, TSP induces robust out-of-distribution generalization: the model not only reduces vulnerabilities in unseen categories (CWEs) by 24.5% but also successfully transfers security principles learned from C/C++ to diverse languages, including Python, Go, and JavaScript. This suggests that TSP does not merely memorize patches, but internalizes abstract, language-agnostic security logic.

2606.01613 2026-06-16 cs.IR cs.AI cs.MA 版本更新

TechRAG: Evidence-Gated Multimodal Agentic RAG for Technical Literature Reasoning

TechGraphRAG:面向技术文献推理的智能图增强RAG框架

Kanwar Bharat Singh

发表机构 * Global Tire Intelligence and Solutions (GTIS)(全球轮胎智能与解决方案(GTIS)) The Goodyear Tire & Rubber Company(固特异轮胎与橡胶公司)

AI总结 提出一种13步自主流水线的智能检索增强生成框架,通过证据充分性评分、知识图谱遍历和自校正生成,支持领域特定技术文献推理。

详情
AI中文摘要

本文提出了一种面向特定领域技术推理支持的智能检索增强生成(RAG)框架,并在包含约2100篇智能轮胎、车辆动力学和车辆控制领域学术论文的精选语料库上进行了实例化。与传统的单次RAG系统不同,所提出的架构采用13步自主流水线:按意图分类查询,基于多维评分标准评估证据充分性,执行带有漂移防护查询重构的智能重试,通过迭代优化-搜索-验证循环搜索外部学术数据库(Crossref、OpenAlex、Semantic Scholar),遍历Neo4j知识图谱以获取关系上下文,验证引用完整性,并在自动重新生成后应用后生成质量检查。主要贡献包括:一个跨五个维度、带有相关性衰减和混合规则/LLM审查的100分证据充分性评分框架;一个具有迭代智能循环的路径依赖外部搜索架构;一个通过基于LLM的实体提取和OpenAlex作者验证以及语料库内引用解析构建的知识图谱;以及一个带有引用验证和质量评估的自校正生成循环。该框架作为一个实际实施的案例研究,展示了智能、基于证据的RAG如何支持大型特定领域语料库上的文献导航和技术推理。

英文摘要

This paper presents an agentic multimodal retrieval-augmented generation (RAG) framework for domain-specific literature reasoning, instantiated on a curated corpus of several thousand papers in intelligent tires, vehicle dynamics, vehicle control, sensing, estimation, and machine learning. Unlike conventional single-pass RAG systems, the proposed architecture uses an autonomous, evidence-gated pipeline that classifies query intent, generates separate text and visual query rewrites, performs hybrid text retrieval with FAISS and BM25 followed by cross-encoder reranking, expands evidence through graph-guided chunk traversal over a Neo4j knowledge graph, and retrieves visual document evidence using ColSmol late-interaction embeddings with MUVERA fixed-dimensional encoding, approximate nearest-neighbor search, and MaxSim reranking. The framework scores evidence sufficiency using a 100-point rubric with hybrid rule-based/LLM review, retries retrieval through drift-guarded reformulation, searches external academic databases through optimize--search--vet loops, merges and deduplicates multimodal evidence, verifies citation integrity, and generates cited answers through Planner, Researcher, Writer, and Critic agents with self-correcting revision. Key contributions include: (i) a scalable multimodal retrieval architecture combining text, graph, and visual evidence over 40,000 document pages; (ii) an interpretable evidence sufficiency and retry mechanism; (iii) a multi-agent generation pipeline with evidence mapping and critic-driven revision; (iv) a domain knowledge graph with LLM-based entity extraction, OpenAlex author validation, and intra-corpus citation resolution; and (v) a route-dependent external search architecture for targeted literature expansion. The result is a practical, evidence-gated, multimodal agentic RAG architecture for technical reasoning over specialized research corpora.

2606.01110 2026-06-16 physics.geo-ph cs.LG quant-ph 版本更新

Accelerating physics-informed neural networks for full waveform inversion using a hybrid quantum-classical finite-basis architecture

使用混合量子-经典有限基架构加速全波形反演的物理信息神经网络

Hoang Anh Nguyen, Divakar Vashisth, Ali Tura

发表机构 * Department of Geophysics, Colorado School of Mines(地质学系,科罗拉多矿业学院) Department of Energy Science and Engineering, Stanford University(能源科学与工程系,斯坦福大学) Department of Petroleum Engineering, Colorado School of Mines(石油工程系,科罗拉多矿业学院)

AI总结 提出一种混合量子-经典FBPINN用于声波全波形反演,通过参数化量子电路实现波场和速度网络,在约8倍少的训练迭代次数下达到比经典基线更低的L1速度误差,并泛化至其他波反演问题。

详情
AI中文摘要

全波形反演(FWI)从接收器数据重建非均匀材料属性,但计算需求高。物理信息神经网络(PINN)及其域分解变体(FBPINN)提供无网格替代方案,但在表示复杂速度场时面临收敛挑战。我们提出一种用于声波FWI的混合量子-经典FBPINN,结合量子计算和经典机器学习,其中分解的波场网络和全局速度网络实现为以参数化量子电路(PQC)终结的经典到量子流水线。PQC作为可微分的JAX状态向量模拟器实现,通过经典PINN、量子电路和物理信息损失实现端到端自动微分。在地球物理异常基准上,量子混合模型在约8倍少的训练迭代次数下达到比主要经典FBPINN基线更低的L1速度误差,尽管使用的可训练参数约少33%,并且优于所有15个经典超参数变体。第二个基准(棋盘格)展示了反演流水线的通用性,确认量子混合架构可以恢复超出局部异常基准的结构化空间变化。我们的框架广泛适用于基于波的反演问题,包括医学超声断层扫描和无损评估。

英文摘要

Full waveform inversion (FWI) reconstructs heterogeneous material properties from receiver data but remains computationally demanding. Physics-informed neural networks (PINNs) and their domain-decomposed variants (FBPINNs) offer a mesh-free alternative but face convergence challenges when representing complex velocity fields. We present a hybrid quantum-classical FBPINN for acoustic FWI, bringing together quantum computing and classical machine learning, in which the decomposed wavefield network and the global velocity network are implemented as classical-to-quantum pipelines terminating in parameterized quantum circuits (PQCs). The PQCs are realized as differentiable JAX statevector simulators, enabling end-to-end automatic differentiation through the classical PINN, the quantum circuit, and the physics-informed loss. On a geophysical anomaly benchmark, the quantum hybrid reaches a lower L1 velocity error than the primary classical FBPINN baseline in approximately 8x fewer training iterations, despite using approximately 33% fewer trainable parameters, and it outperforms all 15 classical hyperparameter variants tested. A second benchmark (checkerboard) demonstrates the generality of the inversion pipeline, confirming that the quantum hybrid architecture can recover structured spatial variations beyond the localized anomaly benchmark. Our framework is broadly applicable to wave-based inverse problems beyond geophysics, including medical ultrasound tomography and non-destructive evaluation.

2605.30837 2026-06-16 cs.CR cs.LG 版本更新

Send a SCOUT First: Pre-hoc Reasoning for Adaptive Detector Allocation in Prompt-Injection Defense

先派侦察兵:提示注入防御中自适应检测器分配的预推理方法

Shuhao Zhang, Jiarui Li, Qi Cao, Ruiyi Zhang, Pengtao Xie

发表机构 * UC San Diego(加州大学圣迭戈分校) University of Illinois Urbana-Champaign(伊利诺伊大学厄巴纳-香槟分校)

AI总结 针对提示注入检测器异构且不可靠的问题,提出SCOUT框架,通过预测每个检测器对每个样本的可靠性和延迟,动态分配检测器,实现安全性与效率的权衡。

Comments We propose SCOUT, a detector allocation framework that predicts each detector's accuracy and latency on a given input before running it, letting operators control the safety-utility trade-off with a single threshold and route to an LLM judge only when needed

详情
AI中文摘要

提示注入检测器是异构的:每个检测器在不同攻击切片上表现强劲,但没有一个始终可靠。然而现有系统仍将检测视为固定的单检测器流水线,将每个请求提交给一个检测器的盲点。我们将防御重新定义为检测器分配:给定一个异构池,决定每个请求运行哪些检测器以及是否升级到LLM法官。我们的框架SCOUT(可扩展且可控的结果预测用于不确定性感知分诊)通过预测每个检测器在类似历史输入上的样本级可靠性和延迟,使这一决策动态化,并向操作员暴露一个单一的安全-效用阈值(其中效用包含良性通过率和挂钟时间)。为了评估这一设置,我们构建了SCOUT-450基准,该基准捕捉了旧提示注入集未充分代表的、结构复杂的面向代理的注入。在SCOUT-450上,相对于始终开启的GPT-4o法官,安全导向的工作点将攻击成功率降低46%,总挂钟时间降低40%,同时良性效用下降5.1个百分点。SCOUT还迁移到三个外部基准(BIPIA、IPI和IHEval),改善了安全-效用前沿。

英文摘要

Prompt-injection detectors are heterogeneous: each is strong on a different slice of attacks, and none is always reliable. Yet existing systems still treat detection as a fixed single-detector pipeline, committing every request to one detector's blind spots. We reframe defense as detector allocation: given a heterogeneous pool, decide per request which detectors to run and whether to escalate to an LLM judge. Our framework SCOUT (Scalable and Controllable Outcome-prediction for Uncertainty-aware Triage) makes this decision dynamic by predicting each detector's per-sample reliability and latency from how it behaved on similar past inputs, and exposes a single safety-utility threshold to the operator (where utility bundles benign-pass rate and wall-clock). To evaluate this setting, we build SCOUT-450, a benchmark that captures the structurally complex, agent-facing injections that older prompt-injection sets under-represent. On SCOUT-450, a safety-oriented operating point reduces attack-success rate by 46% and total wall-clock by 40% relative to an always-on GPT-4o judge, at a 5.1-point benign-utility drop. SCOUT also transfers to three external benchmarks (BIPIA, IPI, and IHEval), improving the safety-utility frontier.

2605.28734 2026-06-16 cs.CR cs.CL cs.LG 版本更新

Code as a Weapon: A Consensus-Labeled Prompt Bank for Measuring Coding-Model Compliance with Malicious-Code Requests

代码即武器:用于衡量编码模型对恶意代码请求遵从性的共识标记提示库

Richard J. Young, Gregory D. Moody

发表机构 * University of Nevada Las Vegas(内华达大学拉斯维加斯分校) Department of Information Systems(信息系统系)

AI总结 本文通过构建一个经五名评审共识标记的提示库(包含4,748个可执行恶意代码请求和1,923个有害安全知识请求),为编码模型对恶意代码请求的拒绝行为提供了可靠且可跨语料库比较的测量基准。

Comments 23 pages, 9 figures, 6 tables. Consensus-labeled prompt bank consolidating eight malicious-code corpora (ASTRA, CySecBench, AdvBench/harmful_behaviors, JailbreakBench, MalwareBench, RedCode, RMCBench, Scam2Prompt) spanning diverse elicitation paradigms; 6,675 prompts, 33,375 classification calls

详情
AI中文摘要

一个回答有害问题的通用语言模型返回文本;而一个遵从恶意请求的编码模型可以返回一个可运行的武器——键盘记录器、勒索软件存根、按原样运行的漏洞利用。这种单一遵从行为严重性的不对称意味着,编码专用模型应比通用聊天模型设置更高的拒绝标准,而非更低,然而目前该领域无法判断它们是否做到了这一点。针对恶意代码的拒绝基准是零散的:它们混合了可执行软件(即用型武器)的请求与有害安全知识(仍需人类操作的信息)的请求,并在不可比较的语料库上报告拒绝率,因此没有单一统计量衡量真正重要的属性。本文引入了一个扩展的共识标记提示库,区分了这两种请求类型,并为跨语料库的编码模型遵从性测量提供了结构稳定的基础。八个语料库(ASTRA、CySecBench、AdvBench/harmful_behaviors、JailbreakBench、MalwareBench、RedCode、RMCBench、Scam2Prompt)在五名评审共识协议下被整合和分类(6,675个提示 × 5名评审 = 33,375次调用)。评审小组达到Fleiss' kappa = 0.767 [95% CI 0.755, 0.777](“显著”);95.0%的提示获得至少四名评审一致,76.9%完全一致,并且小组在3,133个共享提示上以Cohen's kappa = 0.952复现了先前四个语料库的发布。发布的库包含4,748个共识-CODE提示(可执行恶意代码请求)和1,923个共识-KNOWLEDGE提示(有害安全知识请求)。该库是该领域一直缺乏的经过验证的工具:一个经过可靠性量化的基础,用于测试编码模型是否满足其可执行输出所要求的更严格拒绝标准。

英文摘要

A general-purpose language model that answers a harmful question returns text; a coding model that complies with a malicious request can return a working weapon: a keylogger, ransomware, an exploit that runs as written. This asymmetry in the severity of a single act of compliance implies coding-specialized models should clear a higher refusal bar than general-purpose chat models, not a lower one, yet the field cannot tell whether they do. Refusal benchmarks for malicious code are fragmented: they mix requests for executable software with requests for harmful security knowledge and report refusal rates over non-comparable corpora. This paper's central result is that the CODE-versus-KNOWLEDGE classification axis established in a prior four-corpus release remains stable under a substantially expanded corpus pool and an independently refreshed judge panel, evidence that it measures a real construct rather than an artifact of the prompts or judges. Eight corpora spanning diverse elicitation paradigms (direct, jailbreak-decorated, indirect, and agent/interpreter: ASTRA, CySecBench, AdvBench/harmful_behaviors, JailbreakBench, MalwareBench, RedCode, RMCBench, Scam2Prompt) are classified under a five-judge consensus protocol (6,675 prompts x 5 judges = 33,375 calls), reaching Fleiss' kappa = 0.767 [95% CI 0.755, 0.777] ("substantial"). Critically, the panel shares no judge with the prior release (five paid commercial APIs replaced by five open-weight models from five vendors), yet the two panels agree on 94.45% of the 3,133 shared prompts and reach Cohen's kappa = 0.952 [0.942, 0.963] on the 3,031-prompt binary overlap: the axis survives near-total panel replacement. The released bank comprises 4,748 consensus-CODE and 1,923 consensus-KNOWLEDGE prompts, a reliability-quantified benchmark whose central classification axis is shown stable across corpus expansion and judge-panel replacement.

2605.26595 2026-06-16 cs.CR cs.AI cs.LG 版本更新

Cordyceps: Covert Control Attacks on LLMs via Data Poisoning

Cordyceps: 通过数据投毒对LLM的隐蔽控制攻击

Zedian Shao, Charles Fleming, Teodora Baluta

发表机构 * Georgia Institute of Technology(佐治亚理工学院) Cisco Systems(思科系统)

AI总结 提出一种数据投毒方法,通过语义关联教LLM隐藏任意恶意指令,实现隐蔽控制攻击,绕过多种防御。

Comments USENIX Security '26

详情
AI中文摘要

大型语言模型(LLM)通常在没有经过精心筛选的文本数据集上进行微调,而对手可以对这些数据集进行投毒。现有的投毒攻击主要依赖于固定的触发短语,而异常检测、干净数据正则化或在线监控等防御措施可以中和这些触发短语。在本文中,我们提出了一种数据投毒方法,通过共享知识(如事实或概念)与攻击者选择的短语之间的语义关联,可靠且隐蔽地教LLM一种信息隐藏方案。诱导的隐藏方案可以编码和解码任意恶意指令,从而揭示了一种新的、微妙的投毒诱导漏洞:隐蔽控制攻击。我们精确描述了隐蔽控制攻击,并在5个LLM、3个后门防御和4个提示注入防御上进行了评估。在少量投毒样本的情况下,隐蔽控制攻击在平均攻击成功率上比基于启发式的提示注入攻击高出约40%(相对于干净微调模型)。它们还绕过了基于检测和微调的防御,在后门防御后保持高达93%的攻击成功率,在提示注入防御后保持高达98%的攻击成功率。

英文摘要

Large language models (LLMs) are often fine-tuned on uncurated text datasets that adversaries can poison. Existing poisoning attacks primarily rely on fixed trigger phrases that defenses such as outlier detection, clean-data regularization, or online monitoring can neutralize. In this paper, we propose a data poisoning method that teaches an LLM an information hiding scheme reliably and stealthily through semantic associations between shared knowledge such as facts or concepts and attacker-chosen phrases. The induced hiding scheme can encode and decode arbitrary malicious instructions, thus revealing a new and subtle poisoning-induced vulnerability: covert control attacks. We precisely characterize covert control attacks and evaluate them across $5$ LLMs, $3$ backdoor defenses, and $4$ prompt injection defenses. With a small poisoned fraction, covert control attacks outperform heuristic-based prompt injection attacks in average attack success rate by about $40\%$ relative to clean fine-tuned models. They also circumvent defenses based on detection and fine-tuning, maintaining up to $93\%$ attack success rate after backdoor defenses and up to $98\%$ after prompt injection defenses.

2605.30208 2026-06-16 cs.SE cs.AI 版本更新

Automating Low-Risk Code Review at Meta: RADAR, Risk Calibration, and Review Efficiency

自动化低风险代码审查在Meta:RADAR、风险校准与审查效率

Chris Adams, Arjun Singh Banga, Parveen Bansal, Souvik Bhattacharya, Payal Bhuptani, Rujin Cao, Pedro Canahuati, Nate Cook, Brian Ellis, Prabhakar Goyal, Gurinder Grewal, Tianyu He, Matt Labunka, Alex Manners, David Molnar, Ging Cee Ng, Vishal Parekh, Jiefu Pei, Frederic Sagnes, James Saindon, Will Shackleton, Sid Sidhu, Gursharan Singh, Karthik Chengayan Sridhar, Matt Steiner, Pratibha Udmalpet, Sean Xia, Stacey Yan, Audris Mockus, Peter Rigby, Nachiappan Nagappan

发表机构 * Meta USA, UK, Canada(Meta美国、英国、加拿大)

AI总结 提出RADAR系统,通过多阶段漏斗对代码差异进行风险分层自动化审查,在Meta部署后显著提升审查效率并降低风险。

详情
AI中文摘要

AI辅助编码工具改变了软件生产。在Meta,每人工提交的代码行数同比增长105.9%,每位开发者的提交量增长51%,其中代理AI贡献了超过80%的增长。与此同时,获得及时审查的提交比例下降,暴露出代码供应与审查带宽之间的差距。我们提出三个问题,从可行性到校准再到影响:(1)风险分层的自动化能否在不同组织中大规模运行,(2)调整风险阈值如何影响自动化产出与安全性之间的权衡,(3)自动化审查在多大程度上减少AI生成变更的端到端延迟?我们部署了RADAR(风险感知差异自动审查),一个多阶段漏斗,根据作者和源类型对每个差异进行分类,应用资格门控、静态启发式、机器学习差异风险评分、基于LLM的自动化代码审查,以及在落地合格变更前的确定性验证。我们通过覆盖535K+个RADAR审查差异的遥测、政策变更的前后观察比较以及效率结果的差异分析来评估RADAR。RADAR已审查535K+个差异并落地331K+个。将差异风险评分阈值从第25百分位放宽到第50百分位,批准率提高到60.31%。RADAR审查差异的回滚率是非RADAR差异的1/3,生产事故率是非RADAR差异的1/50。RADAR将中位关闭时间减少超过330%,中位差异审查墙时间减少35%。风险感知的分层自动化可以显著减少由AI驱动的代码增长造成的审查瓶颈,同时不损害生产安全。

英文摘要

AI-assisted coding tools have altered software production. At Meta, significant lines of code per human-landed diff grew by 105.9% year over year and per-developer diff volume rose 51%, with agentic AI responsible for over 80% of that growth. Meanwhile, the share of diffs receiving timely review has declined, exposing a widening gap between code supply and reviewer bandwidth. We ask three questions that progress from feasibility through calibration to impact: (1) can risk-stratified automation operate at scale across diverse organizations, (2) how does tuning the risk threshold affect the trade-off between automation yield and safety, and (3) to what extent does automated review reduce end-to-end latency for AI-generated changes? We deployed RADAR (Risk Aware Diff Auto Review), a multi-stage funnel that classifies each diff by authorship and source type, applies eligibility gates, static heuristics, a machine-learned Diff Risk Score, LLM-based Automated Code Review, and deterministic validation before landing qualifying changes. We evaluate RADAR through telemetry covering 535K+ RADAR-reviewed diffs, observational before-after comparisons for policy changes, and difference-in-differences analysis of efficiency outcomes. RADAR has reviewed 535K+ diffs and landed 331K+. Relaxing the Diff Risk Score threshold from the 25th to the 50th percentile increased the approve rate to 60.31%. The revert rate for RADAR-reviewed diffs is 1/3 that of non-RADAR diffs, and the Production Incident rate is 1/50 that of non-RADAR diffs. RADAR reduces median time to close by over 330% and median diff review wall time by 35%. Risk-aware layered automation can materially reduce review bottlenecks created by AI-driven code growth without compromising production safety.

2605.29874 2026-06-16 cs.MA cs.AI cs.GT 版本更新

Evolutionary Dynamics of Cooperation in Next-Generation LLM Agent Systems: A Cross-Provider Empirical Extension

下一代LLM智能体系统中合作的演化动力学:跨提供商的实证扩展

Francisco León Zúñiga Bolívar

发表机构 * Institución Universitaria Colegio Mayor del Cauca(大学机构科尔多瓦大学)

AI总结 本研究通过扩展Willis等人的基准,测试2025-2026年四个前沿LLM模型在迭代囚徒困境中的合作偏差,发现合作偏差普遍存在但提供商间差异显著,且噪声仍是普遍挑战。

Comments v2 (erratum): two truncated Gemini 3.1 Pro libraries regenerated; cooperative-plurality 9/12->10/12, conclusions unchanged. 11 pages, 3 figures, 8 tables. Extends arXiv:2501.16173. Code and n=500 replication: https://github.com/arqFranciscoLeon/evollm (archived: https://doi.org/10.5281/zenodo.20248615)

详情
AI中文摘要

下一代LLM智能体是否继承了其前身中记录的合作偏差,还是规模和提供商的多样性重塑了竞争性多智能体环境中的均衡行为?Willis等人使用演化博弈论和迭代囚徒困境(IPD)为此问题建立了基准,发现ChatGPT-4o和Claude 3.5 Sonnet中存在一致的合作偏差。我们将此基准扩展到2025-2026年发布的四个前沿模型——Claude Sonnet 4.6、Gemini 2.5 Flash、Gemini 3.1 Pro和GPT-5.4 Mini——在三种提示风格(默认、散文、自我优化)和四种群体组成(平衡和有偏,有无噪声)下应用相同的协议。合作偏差在提供商间持续存在(H1):在平衡无噪声条件下,十二种模型-提示组合中有九种倾向于合作均衡。提供商间差异显著(H3):Gemini 2.5 Flash在有偏条件下达到高达77%的攻击性均衡,而GPT-5.4 Mini在自我优化下达到70%的合作均衡。对攻击性能力对等的支持是部分的(H2):自我优化提高了所有模型的ICD,Claude Sonnet 4.6 Refine在数据集中达到最高ICD(0.913),但默认和散文提示未显示系统性缩小。关于噪声鲁棒性的证据方向为正但未稳健确认(H4):在每种条件下n=500次Moran迭代,Claude Sonnet 4.6的平均噪声敏感度约为6个百分点,而Claude 3.5 Sonnet为13个百分点,但一旦传播前身未报告的抽样误差,这一跨研究差距在统计上不显著。提供商身份而非模型代际是均衡结果的最强相关因素;无论模型大小或年代,噪声仍然是普遍挑战。

英文摘要

Do next-generation LLM agents inherit the cooperative biases documented in their predecessors, or does scale and provider diversity reshape equilibrium behaviour in competitive multi-agent settings? Willis et al. established a benchmark for this question using evolutionary game theory and the Iterated Prisoner's Dilemma (IPD), finding consistent cooperative biases in ChatGPT-4o and Claude 3.5 Sonnet. We extend this benchmark to four frontier models released in 2025-2026 - Claude Sonnet 4.6, Gemini 2.5 Flash, Gemini 3.1 Pro, and GPT-5.4 Mini - applying the identical protocol across three prompting styles (Default, Prose, Self-Refine) and four population compositions (balanced and biased, with and without noise). Cooperative bias persists across providers (H1): ten of twelve model-prompt combinations favour cooperative equilibria in balanced noiseless conditions. Cross-provider divergence is substantial (H3): Gemini 2.5 Flash reaches up to 77% aggressive equilibria under biased conditions, while GPT-5.4 Mini reaches 70% cooperative equilibria under Self-Refine. Support for aggressive capability parity is partial (H2): Self-Refine raises ICD in all models and Gemini 3.1 Pro Refine achieves the highest ICD in the dataset (0.925), but Default and Prose prompts show no systematic narrowing. Evidence on noise robustness is directionally positive but not robustly confirmed (H4): with n=500 Moran iterations per condition, average noise sensitivity is about 6 percentage points for Claude Sonnet 4.6 versus 13 pp for Claude 3.5 Sonnet, but this cross-study gap is not statistically significant once the predecessor's unreported sampling error is propagated. Provider identity, rather than model generation, is the strongest correlate of equilibrium outcomes; noise remains a universal challenge regardless of model size or vintage.

2605.29208 2026-06-16 cs.MS cs.LG 版本更新

libhmm: A Modern C++20 Library for Hidden Markov Models with Correct MLE Emission M-Steps

libhmm:一个用于隐马尔可夫模型的现代C++20库,具有正确的MLE发射M步

Gary Wolfman

发表机构 * Independent Researcher(独立研究者)

AI总结 本文介绍libhmm,一个C++20库,用于隐马尔可夫模型参数估计、序列解码和模型选择,解决了现有软件中缺乏零依赖C++ HMM库以及Baum-Welch算法发射分布M步中广泛使用矩估计近似的问题,实现了十六种连续和离散发射分布的正确最大似然估计。

Comments 17 pages, 3 figures, 8 tables

详情
AI中文摘要

我们描述了libhmm,一个用于隐马尔可夫模型参数估计、序列解码和模型选择的C++20库。libhmm解决了现有软件中的两个空白:缺乏一个维护良好、零依赖的C++ HMM库,适合嵌入到生产系统中;以及在Baum-Welch算法的发射分布M步中广泛使用矩估计近似。该库实现了十六种连续和离散发射分布的正确最大似然估计,包括用于位置-尺度Student-t分布的ECME算法、用于Gamma、Beta、Weibull和负二项分布的Newton-Raphson最大化,以及用于圆形数据的von Mises分布。所有前向-后向和Viterbi计算都在全对数空间中运行。通过编译时分派和标量回退,为AVX-512、AVX2、SSE2和ARM NEON提供了SIMD加速。通过配套包pylibhmm提供Python绑定。我们将libhmm与现有的C和C++ HMM库以及已发布的R参考包在五个真实数据基准上进行比较,并讨论了设计中做出的架构权衡。

英文摘要

We describe libhmm, a C++20 library for Hidden Markov Model parameter estimation, sequence decoding, and model selection. libhmm addresses two gaps in existing software: the absence of a well-maintained, zero-dependency C++ HMM library suitable for embedding in production systems, and the widespread use of method-of-moments (MOM) approximations in the emission distribution M-step of the Baum-Welch algorithm. The library implements correct maximum likelihood estimators for sixteen scalar emission distributions, including an ECME algorithm for the location-scale Student-t distribution, Newton-Raphson maximization for Gamma, Beta, Weibull, and Negative Binomial distributions, and the von Mises distribution for circular data. All forward-backward and Viterbi calculations operate in full log-space. SIMD acceleration is provided for AVX-512, AVX2, SSE2, and ARM NEON via compile-time dispatch with scalar fallback. Version 4 adds multivariate observation support via the BasicHmm<Obs> template, with three multivariate emission families (diagonal Gaussian, full-covariance Gaussian, and independent components) each with correct weighted MLE M-steps. Python bindings are available via the companion package pylibhmm. We compare libhmm against established C and C++ HMM libraries and against published R reference packages on seven real-data benchmarks, and discuss the architectural tradeoffs made in the design.

2605.25796 2026-06-16 cs.CR cs.AI cs.CL 版本更新

SAMark: A Self-Anchored Text Watermarking with Paragraph-Level Paraphrase Robustness

SAMark: 一种具有段落级释义鲁棒性的自锚文本水印

Jiahao Huo, Wenjie Qu, Yibo Yan, Kening Zheng, Jiaheng Zhang, Xuming Hu, Philip S. Yu, Mingxun Zhou

发表机构 * University of Science and Technology of China(中国科学技术大学)

AI总结 提出SAMark自锚水印框架,通过建立语义空间中与句子顺序无关的逐步独立绿色区域,结合多通道双曲评分机制和多样性感知过滤策略,在段落级释义攻击下实现高检测率并打破鲁棒性-质量权衡。

详情
AI中文摘要

语义级水印通过将句子作为基本单元,提高了对文本修改的鲁棒性。然而,对段落级释义的鲁棒性仍然困难,因为此类攻击通过改变句子顺序全局性地破坏水印信号。在这项工作中,我们提出了SAMark,一种自锚水印框架,通过建立语义空间中与步骤无关的绿色区域,消除了对句子顺序的依赖。为了提高可检测性,我们引入了一种多通道双曲评分机制,该机制在放大水印信号的同时抑制来自弱对齐候选的噪声。我们进一步提出了一种多样性感知过滤策略,将硬过滤与软正则化相结合,超越了简单的n-gram重复过滤器,以解决语义冗余问题。实验结果表明,在典型的段落级释义攻击下,SAMark实现了高达90.2%的TP@FP1%,平均比最强先前基线高出30%以上,同时保持了与未水印文本相竞争的生成本质量,并打破了限制先前方法的鲁棒性-质量权衡。

英文摘要

Semantic-level watermarking (SWM) improves robustness against text modifications by treating sentences as the basic unit. However, robustness to paragraph-level paraphrasing remains difficult because such attacks globally disrupt watermark signals by changing sentence order. In this work, we propose SAMark, a self-anchored watermarking framework that removes the dependency on sentence order by establishing a step-independent green region in semantic space. To improve detectability, we introduce a multi-channel hyperbolic scoring mechanism that amplifies watermark signals while suppressing noise from weakly aligned candidates. We further propose a diversity-aware filtering strategy that combines hard filtering with soft regularization, extending beyond simple n-gram repetition filters to address semantic redundancy. Experimental results show that SAMark achieves up to 90.2% TP@FP1% under typical paragraph-level paraphrasing attacks, outperforming the strongest prior baseline by more than 30% on average, while maintaining generation quality competitive with unwatermarked text and breaking the robustness-quality trade-off that limits prior methods.

2601.07326 2026-06-16 math.OC cs.LG 版本更新

Convergence Rate Analysis of the AdamW-style Shampoo: Unifying One-Sided and Two-Sided Preconditioning

AdamW风格Shampoo的收敛率分析:统一单侧与双侧预处理

Huan Li, Yiming Dong, Zhouchen Lin

发表机构 * Huan Li(李焕) Yiming Dong(董怡铭) Zhouchen Lin(林周辰)

AI总结 本文研究AdamW风格Shampoo优化器,统一单侧与双侧预处理,并建立了以核范数度量的收敛率,该收敛率在理想情况下与SGD的最优收敛率类似。

Comments V3:ICML Camera-Ready. V4 v.s. V3: extend to the more general setting where the exponents of the two preconditioners do not sum to 1/2

详情
AI中文摘要

本文研究AdamW风格Shampoo优化器,它是经典Shampoo的一种有效实现,并在AlgoPerf神经网络训练算法竞赛的外部调优赛道中获胜。我们的分析统一了单侧和双侧预处理,并建立了以核范数度量的收敛率 $\frac{1}{K}\sum_{k=1}^K E\left[\|\nabla f(X_k)\|_*\right]\leq O(\frac{\sqrt{m+n}C}{K^{1/4}})$,其中 $K$ 表示迭代次数,$(m,n)$ 表示矩阵参数的尺寸,$C$ 与SGD最优收敛率中的常数一致。理论上,我们有 $\|\nabla f(X)\|_F\leq \|\nabla f(X)\|_*\leq \sqrt{m+n}\|\nabla f(X)\|_F$,这支持了我们的收敛率在 $\|\nabla f(X)\|_*= Θ(\sqrt{m+n})\|\nabla f(X)\|_F$ 且 $m$ 和 $n$ 平衡的理想情况下,可以被视为类似于SGD的最优收敛率 $\frac{1}{K}\sum_{k=1}^KE\left[\|\nabla f(X_k)\|_F\right]\leq O(\frac{C}{K^{1/4}})$。

英文摘要

This paper studies AdamW-style Shampoo, an effective variant of the classical Shampoo that won the external tuning track of the AlgoPerf neural network training competition. Our analysis unifies one-sided and two-sided preconditioning. When the exponents of the two preconditioners sum to $1/2$, we establish the convergence rate $\frac{1}{K}\sum_{k=1}^KE\left[||\nabla f(X_k)||_*\right]\leq O(\frac{\sqrt{m+n}C}{K^{1/4}})$, where $K$ represents the number of iterations, $(m,n)$ denotes the dimensions of the matrix-valued parameters, and $C$ matches the constant appearing in the optimal convergence rate of SGD. Theoretically, the nuclear norm and Frobenius norm satisfy $||\nabla f(X)||_F\leq ||\nabla f(X)||_*\leq \sqrt{\min\{m,n\}}||\nabla f(X)||_F$, which suggests that our convergence rate is analogous to the optimal $\frac{1}{K}\sum_{k=1}^KE\left[||\nabla f(X_k)||_F\right]\leq O(\frac{C}{K^{1/4}})$ convergence rate of SGD in the ideal case where $||\nabla f(X)||_*= Θ(\sqrt{\min\{m,n\}})||\nabla f(X)||_F$ and $m$ and $n$ are of comparable magnitude. Then, we extend our analysis to settings where the preconditioning exponents do not sum to 1/2, and establish convergence with an explicit but more involved rate.

2605.21629 2026-06-16 cs.CY cs.AI cs.HC 版本更新

Faster Completion, Less Learning: Generative AI Reduced Study Time on Math Problems and the Knowledge They Build

更快完成,更少学习:生成式AI减少了学生在数学问题及所构建知识上的学习时间

Sina Rismanchian, Hasan Uzun, Jeffrey Matayoshi, Eric Cosyn, Eyad Kurd-Misto

发表机构 * University of California, Irvine(加州大学尔湾分校) McGraw Hill(麦格劳-希尔)

AI总结 本研究探讨生成式AI如何影响学生的学习过程和学习成果,通过分析大量学习互动数据,发现AI使用导致学生在可被AI处理的问题上学习时间减少,但这种效率提升在监考情况下消失,揭示了AI对学习行为和知识构建的深远影响。

详情
AI中文摘要

How much have students' ordinary learning processes shifted in response to generative AI, and how does that affect their durable learning outcomes? Self-report surveys show little change, while small-scale behavioral studies report widespread AI use without the scale or duration to measure learning consequences. We address both questions using a ten-year panel of $3.2$ million ALEKS learning interactions for the time-on-task analysis, complemented by ALEKS PPL placement-assessment data for the proctoring and retention analyses, with a quasi-experimental design exploiting within-curriculum variation in AI susceptibility: text-based word problems transcribable into AI prompts serve as the treated group; graph-based problems requiring interactive platform manipulation as the comparison. Learning time on AI-susceptible problems declines $2.8\%$ per quarter among college students after ChatGPT's release, cumulating to $26.9\%$ over eleven quarters; high-schoolers show $31.3\%$, middle-schoolers $9.0\%$, and Grade 5 students no detectable change. The divergence vanishes entirely under proctoring for college students, making general efficiency gains unlikely. Logistic fixed-effects models on randomly assigned proctored retention items yield a $25\%$ cumulative decline in odds of correct response; the same estimator on non-proctored assessment produces a large opposite-signed increase -- inconsistent with any platform, cohort, or curriculum explanation. These results are among the first large-scale behavioral and outcome evidence that generative AI has altered how students study and the knowledge they build -- the population-level indicator of \emph{cognitive surrender}, with direct implications for educational research, assessment governance, and AI policy.

英文摘要

How much have students' ordinary learning processes shifted in response to generative AI, and how does that affect their durable learning outcomes? Self-report surveys show little change, while small-scale behavioral studies report widespread AI use without the scale or duration to measure learning consequences. We address both questions using a ten-year panel of $3.2$ million ALEKS learning interactions for investigating time-on-task, complemented by ALEKS PPL placement-assessment data for examining proctoring and learning outcomes, with a quasi-experimental design exploiting variation in tasks that are more susceptible to AI (text-based word problems) and less susceptible to AI (interactive graph-based problems). Learning time on AI-susceptible problems declines $2.8\%$ per quarter among college students after ChatGPT's release, cumulating to $26.9\%$ over eleven quarters; high-schoolers show $31.3\%$, middle-schoolers $9.0\%$, and Grade 5 students no detectable change. Among college students, the post-ChatGPT divergence vanishes entirely under proctoring, ruling out broad efficiency gains as the likely explanation. Logistic fixed-effects models on randomly assigned proctored retention items yield a $25\%$ cumulative decline in odds of correct response; the same estimator on non-proctored assessment produces a large opposite-signed increase -- inconsistent with any platform, cohort, or curriculum explanation. These results are among the first large-scale behavioral and outcome evidence that generative AI has altered how students study and the knowledge they build -- the population-level indicator of \emph{cognitive surrender}, with direct implications for educational research, assessment governance, and AI policy.

2605.21312 2026-06-16 cs.DC cs.AI cs.LG 版本更新

Frontier: Towards Comprehensive and Accurate LLM Inference Simulation

Frontier: 向全面且准确的LLM推理模拟迈进

Yicheng Feng, Xin Tan, Yangtao Deng, Yimin Jiang, Yibo Zhu, Hong Xu

发表机构 * The Chinese University of Hong Kong(香港中文大学) Anuttacon StepFun

AI总结 本文提出Frontier,一种用于现代LLM推理服务的离散事件模拟器,通过离散化抽象和对关键运行时优化的建模,实现了对复杂工作负载的准确预测,从而在不同服务场景中提供更精确的计算、通信和内存成本预测。

详情
AI中文摘要

现代LLM服务已不再是单一或整体的。生产系统现在结合了解耦执行、复杂并行性、运行时优化和状态化工作负载,如推理、代理和RL展开。模拟对于探索这个快速增长的设计空间具有吸引力,但现有模拟器缺乏所需的架构完整性和决策级精度。它们的单体-副本抽象不适合解耦服务,而平均情况分析代理可能会扭曲SLA预测甚至逆转优化结论。我们提出了Frontier,一种用于现代LLM推理服务的离散事件模拟器。Frontier具有解耦抽象。它通过建模共置、预填解码解耦(PDD)和注意力-前馈网络解耦(AFD)与角色特定的集群工作者,捕捉现代服务系统的结构和动态。它在调度器-批次引擎循环中整合关键运行时优化(例如CUDA图、推测解码),并支持新兴工作负载的状态请求。它进一步提供了在多样化服务场景中对计算、通信和内存成本的准确且可推广的预测。在16-H800 GPU测试平台上,Frontier实现了平均吞吐量误差低于4%。与最先进的模拟器相比,它在共置情况下将端到端延迟误差从44.9%降低到6.4%,在解耦情况下从51.7%降低到2.6%。它扩展到超过1000个GPU在商用CPU上,并启用了新的用例,如依赖SLA的帕累托前沿探索、异构解耦分配、代理推理调度验证和RL后训练重配置。

英文摘要

Modern LLM serving is no longer homogeneous or monolithic. Production systems now combine disaggregated execution, complex parallelism, runtime optimizations, and stateful workloads such as reasoning, agents, and RL rollouts. Simulation is attractive for exploring this growing design space, yet existing simulators lack the architectural completeness and decision-grade fidelity it demands. Their monolithic-replica abstractions are ill-suited to disaggregated serving, while average-case analytical proxies can distort SLA predictions and even reverse optimization conclusions. We present Frontier, a discrete-event simulator for modern LLM inference serving. Frontier features a disaggregated abstraction. It captures the structure and dynamics of modern serving systems by modeling co-location, Prefill-Decode Disaggregation (PDD), and Attention-FFN Disaggregation (AFD) with role-specific cluster workers, incorporating key runtime optimizations (e.g., CUDA Graphs, speculative decoding) within the scheduler-batch-engine loop, and supporting stateful requests for emerging workloads. It further provides accurate and generalizable predictions of computation, communication, and memory costs across diverse serving scenarios with complex workload compositions. On 16-H800 GPU testbed, Frontier achieves an average throughput error below 4%. Compared with state-of-the-art simulators, it reduces end-to-end latency error from 44.9% to 6.4% under co-location and from 51.7% to 2.6% under disaggregation. It scales to over 1K GPUs on commodity CPUs and enables new use cases such as SLA-dependent Pareto frontier exploration, heterogeneous disaggregated allocation, agentic reasoning scheduling validation, and RL post-training reconfiguration. We release Frontier at https://github.com/NetX-lab/Frontier.

2605.18528 2026-06-16 math.OC cs.LG 版本更新

Scale-Invariant Neural Network Optimization: Norm Geometry and Heavy-Tailed Noise

尺度不变神经网络优化:范数几何与重尾噪声

Jiayu Zhang, Tianyi Lin

发表机构 * Department of Industrial Engineering and Operations Research(工业工程与运营管理系)

AI总结 针对重尾噪声下的非凸随机优化,研究了尺度不变一阶方法的维度依赖下界,并提出了匹配上界的批处理Scion方法以及利用高阶光滑性的传输Scion方法。

Comments Polished writing; Fixed typos and references; 45 pages

详情
AI中文摘要

来自神经网络优化的一个日益增长的经验是,优化器的设计应尊重模型的参数化方式。尺度不变方法变得重要,因为其归一化的逐层更新不仅支持跨模型大小的超参数迁移,还能利用输入-输出矩阵范数几何。同时,深度学习中的随机梯度噪声通常远非亚高斯,可能表现出重尾。这些关键观察塑造了近期训练神经网络的算法原理,然而它们的联合理论后果仍未被充分探索。特别地,对于具有一般输入-输出矩阵范数的尺度不变方法,什么维度依赖是不可避免的,以及高阶光滑性是否能在重尾噪声下加速训练,尚不清楚。我们通过一般范数下 $\mathbb{R}^{m\times n}$ 上的非凸光滑随机优化来研究这些问题,目标是在 $p^{\mathrm{th}}$ 阶矩重尾噪声下达到 $\varepsilon$-稳定点。我们的第一个贡献是维度相关的下界:当 $\frac{\max\{m,n\}}{(\min\{m,n\})^2}$ 足够大时,任何具有谱范数的尺度不变一阶方法需要 $\Omega(\min\{m, n\}\varepsilon^{-\frac{3p-2}{p-1}})$ 次 oracle 调用。我们证明,具有谱范数的批处理 Scion 方法达到了匹配的上界 $O(\min\{m, n\}\varepsilon^{-\frac{3p-2}{p-1}})$。为了利用高阶光滑性,我们提出了一种传输 Scion 方法,并在范数为谱范数且 Hessian 矩阵 Lipschitz 连续时将界改进为 $O(\min\{m, n\}\varepsilon^{-\frac{5p-3}{2p-2}})$。最后,我们将实践启发式方法融入我们的传输方法,并在多种架构和模型大小上进行评估,展示了其在训练神经网络中的灵活性和兼容性。

英文摘要

A growing lesson from neural network optimization is that optimizer design should respect how the model is parametrized. The layerwise input-output structure of neural networks motivates scale-invariant optimizers, such as Muon and Scion, whose updates also support hyperparameter transfer. At the same time, stochastic gradient noise in deep learning is often far from sub-Gaussian and may exhibit heavy tails. These observations have shaped recent algorithmic principles for training neural networks, yet their joint theoretical consequences are underexplored. In particular, it remains unclear what dimension dependence is unavoidable for gradient-based methods given the problem class is defined by input-output norm and under heavy-tailed noise, and whether higher-order smoothness can accelerate training. We study these questions through nonconvex smooth stochastic optimization over $\mathbb R^{m\times n}$ equipped with general norms and under $p^\mathrm{th}$-moment heavy-tailed noise, where the goal is to achieve an $ε$-stationary point in the dual norm. Our first contribution is a dimension-dependent lower bound: when $\frac{\max\{m,n\}}{(\min\{m,n\})^2}$ is large enough, any gradient-based method requires $Ω(\min\{m, n\}ε^{-\frac{3p-2}{p-1}})$ oracles for the problem class defined by the spectral norm, which is a common input-output norm. We prove that a scale-invariant Scion method with the spectral norm can achieve the matching upper bound of $O(\min\{m, n\}ε^{-\frac{3p-2}{p-1}})$. To exploit higher-order smoothness, we propose a transported Scion method and improve the bound to $O(\min\{m, n\}ε^{-\frac{5p-3}{2p-2}})$ when the Hessian is Lipschitz. Finally, we incorporate heuristics into our transported method and evaluate it across multiple architectures and model sizes, demonstrating its flexibility and compatibility with neural network training.

2605.13092 2026-06-16 stat.ML cs.LG stat.ME 版本更新

Adaptive Kernel Density Estimation with Pre-training

具有预训练的自适应核密度估计

Ruitong Zhang, Ke Deng

发表机构 * Department of Statistics and Data Science, Tsinghua University(统计与数据科学系,清华大学)

AI总结 本文提出利用预训练技术提升高维下自适应核密度估计效率,通过神经网络推荐合适核函数,实验证明在目标分布接近预训练分布时效果显著。

详情
AI中文摘要

高维密度估计是一个重要且具有挑战性的统计问题。传统基于核平滑的方法在高维中效率低下,因难以指定合适的位置自适应核。本文将预训练技术引入非参数密度估计中,通过建立预训练神经网络为每个样本点推荐合适的位置自适应核,实现高维高效密度估计。大量数值实验表明,当目标分布接近预训练分布族时,该策略能显著提升密度估计精度。当目标分布与预训练分布族差异较大时,预训练策略的益处可能减弱,但可通过额外的微调过程重新激活。

英文摘要

Density estimation in high-dimensional settings is an important and challenging statistical problem.Traditional methods based on kernel smoothing are inefficient in high dimensions due to the difficulties in specifying appropriate location-adaptive kernels. In this work, we introduce pre-training, a key idea behind many cutting-edge AI technologies, to the context of non-parametric density estimation. By establishing a pre-trained neural network that can recommend an appropriate location-adaptive kernel for each sample point, efficient density estimation with adaptive kernels is achieved in high dimensions. A wide range of numerical experiments show that this strategy is highly effective for improving density-estimation accuracy, when the target distribution is close to the distribution family for pre-training. When the target distribution is substantially different from the pre-training distribution family, the benefit from the proposed pre-training strategy may be diluted, but can be reactivated by an additional fine-tuning procedure.

2605.11047 2026-06-16 cs.CR cs.AI 版本更新

Red-Teaming Agent Execution Contexts: Open-World Security Evaluation on OpenClaw

红队代理执行上下文:OpenClaw上的开放世界安全评估

Hongwei Yao, Yiming Liu, Yiling He, Bingrun Yang

发表机构 * National University of Singapore(新加坡国立大学)

AI总结 本文提出DeepTrap框架,通过黑盒轨迹优化发现OpenClaw中的上下文漏洞,展示上下文妥协可引发安全风险,强调需执行中心的安全评估。

Comments Accepted to ICML 2026 Workshop

详情
AI中文摘要

代理语言模型系统越来越多地依赖可变的执行上下文,包括文件、内存、工具、技能和辅助制品,从而产生超出显式用户提示的安全部署风险。本文提出了DeepTrap,一个自动框架,用于发现OpenClaw中的上下文漏洞。DeepTrap将对抗性上下文操纵建模为黑盒轨迹级优化问题,平衡风险实现、良性任务保留和隐蔽性。它结合了风险条件评估、多目标轨迹评分、奖励引导的束搜索和基于反射的深度探测,以识别高价值的受侵上下文。我们构建了一个包含42个案例的基准,涵盖六类漏洞和七个操作场景,并使用攻击和效用评分评估了九个目标模型。结果表明,上下文妥协可以诱导显著的不安全行为,同时保持用户面向任务的完成,证明最终响应评估是不足的。研究结果强调了对代理AI系统执行中心安全评估的必要性。我们的代码已发布在:https://github.com/ZJUICSR/DeepTrap

英文摘要

Agentic language-model systems increasingly rely on mutable execution contexts, including files, memory, tools, skills, and auxiliary artifacts, creating security risks beyond explicit user prompts. This paper presents DeepTrap, an automated framework for discovering contextual vulnerabilities in OpenClaw. DeepTrap formulates adversarial context manipulation as a black-box trajectory-level optimization problem that balances risk realization, benign-task preservation, and stealth. It combines risk-conditioned evaluation, multi-objective trajectory scoring, reward-guided beam search, and reflection-based deep probing to identify high-value compromised contexts. We construct a 42-case benchmark spanning six vulnerability classes and seven operational scenarios, and evaluate nine target models using attack and utility grading scores. Results show that contextual compromise can induce substantial unsafe behavior while preserving user-facing task completion, demonstrating that final-response evaluation is insufficient. The findings highlight the need for execution-centric security evaluation of agentic AI systems. Our code is released at: https://github.com/ZJUICSR/DeepTrap

2605.03573 2026-06-16 stat.ML cs.LG 版本更新

Stochastic Schrödinger Diffusion Models for Pure-State Ensemble Generation

随机薛定谔扩散模型用于纯态集合生成

Jian Xu, Wei Chen, Shigui Li, Chao Li, Jingyuan Zheng, Delu Zeng, John Paisley, Qibin Zhao

发表机构 * RIKEN iTHEMS RIKEN AIP South China University of Technology(华南理工大学) Stanford University(斯坦福大学) Columbia University(哥伦比亚大学)

AI总结 本文提出随机薛定谔扩散模型(SSDMs),在复射影空间CP^{d-1}上构建基于分数的生成框架,通过局部欧几里得奥本海姆-乌尔申贝格近似实现无解析过渡密度的训练,提升量子机器学习的泛化能力。

详情
AI中文摘要

在量子机器学习(QML)中,经典数据通常被编码为量子纯态并直接处理为量子表示,推动了在底层表示层面生成模型的发展,该模型从底层纯态集合中采样新量子态,而非从扰动的经典输入重新准备。然而,将具有明确反向时间采样器的分数扩散模型扩展到量子纯态集合仍具挑战性,由于复射影空间CP^{d-1}的非欧几里得几何和过渡密度的不可行性。我们提出了随机薛定谔扩散模型(SSDMs),一种内在的基于分数的生成框架,配备了Fubini-Study(FS)度量。SSDMs通过随机薛定谔方程(SSE)实现正向黎曼扩散,并推导出由黎曼分数∇_{FS} log p_t驱动的反向时间动力学。为了在没有解析过渡密度的情况下进行训练,我们引入了一个基于FS正常坐标中局部欧几里得奥本海姆-乌尔申贝格近似的局部时间目标,从而得到一个映射回流形的解析教师分数。实验表明,SSDMs能够忠实捕捉目标纯态集合的统计特性,包括可观测量的矩、重叠核MMD和纠缠度量,并且SSDM生成的量子表示通过表示层面的数据增强提升了下游QML的泛化能力。

英文摘要

Quantum machine learning increasingly relies on pure-state representations, motivating generative models that sample directly in quantum representation space rather than perturbing classical inputs and re-encoding. We introduce Stochastic Schrödinger Diffusion Models (SSDMs), a score-based generative framework that defines diffusion, scores, and reverse-time sampling intrinsically on the complex projective manifold $\mathbb{CP}^{d-1}$ under the Fubini--Study metric. SSDMs combine a Riemannian Ornstein--Uhlenbeck forward diffusion with a stochastic Schrödinger realization, and learn reverse-time dynamics driven by the Riemannian score. Our central technical contribution is a local-time learning objective that exploits the local Euclidean OU limit of intrinsic manifold diffusions in Fubini-Study normal coordinates to obtain an analytic teacher score, bypassing the intractable transition densities that limit existing Riemannian score-based models. Across synthetic, physics-inspired (TFIM, XXZ), and quantum feature-state benchmarks up to $14$ qubits, SSDMs match target pure-state ensembles by orders of magnitude on MMD and observable statistics over both ambient Euclidean and matched Riemannian score-based baselines, and improve representation-level diagnostics for downstream quantum kernel methods.

2605.09370 2026-06-16 cs.DC cs.AI 版本更新

From Detection to Recovery: Operational Analysis on LLM Pre-training with 504 GPUs

从检测到恢复:504 GPU上LLM预训练的运营分析

Daemyung Kang, Eunjin Hwang, Hanjeong Lee, HyeokJin Kim, Hyunhoi Koo, Jeongkyu Shin, Jeongseok Kang, Jihyun Kang, Jinho Heo, Joongi Kim, Junbum Lee, Jungseung Yang, Kyujin Cho, Youngsook Song

发表机构 * Lablup Inc.(Lablup公司)

AI总结 本文通过分析一个63节点NVIDIA B200生产集群(504 GPU)55天的Prometheus时间序列数据和73天的运营日志,提出了三项定量分析,揭示了多信号检测策略、存储I/O瓶颈和自动重试恢复机制的有效性。

Comments 42 pages, 19 figures, 16 tables. Lablup Technical Report

详情
AI中文摘要

大规模AI训练现在基本上是一个分布式系统问题,硬件故障已成为常规操作条件而非罕见例外。然而,来自生产训练集群的公开运营证据仍然稀缺。本技术报告对63节点NVIDIA B200生产集群(504 GPU)进行了实证分析,使用了55天的Prometheus时间序列数据和73天的运营日志,涵盖了224次多节点训练会话。该集群在跨组织环境中运行,五个参与方(SKT、Upstage、Lablup、NVIDIA Korea和VAST Data)共享统一的监控管道。这种安排使得联合诊断一个60节点规模的存储I/O瓶颈成为可能,该瓶颈在2-4节点规模下不会出现,这是一个单一团队无法单独隔离的生产规模现象。基于为期数月的预训练活动,我们进行了三项定量分析,得出四个发现。首先,对751个Prometheus指标和10个XID识别的GPU故障进行统计分析,实现了10/10的检测率(2/10在XID之前),每天约0.84个误报。没有单一指标在所有故障类型中持续占主导地位,这促使采用多信号检测策略。其次,对沿GPU VRAM到NFS路径的523个检查点事件进行分析,将“带宽悖论”(200 Gbps RoCE的1.4-10.4%利用率)归因于128槽NFS RPC层的饱和。第三,多节点故障响应显示集中排除(63个节点中前3个占所有排除的>50%),自动重试链的成功率为33.3%(12个链,73次尝试),是手动恢复率12.5%的2.7倍;中位重试间隔为11分钟(IQR 10-11)。所有分析均基于提供会话级工作负载管理、GPU中心调度和统一可观测性的生产基础设施。

英文摘要

Large-scale AI training is fundamentally a distributed systems problem, where hardware failures are routine operating conditions rather than rare exceptions, yet public operational evidence from production training clusters remains limited. This report presents an empirical analysis of a 63-node NVIDIA B200 production cluster (504 GPUs), using 55 days of Prometheus time-series data and 73 days of operational logs covering 224 multi-node training sessions. The environment is cross-organizational: five parties (SKT, Upstage, Lablup, NVIDIA Korea, VAST Data) share a unified monitoring pipeline. This enabled joint diagnosis of a 60-node-scale storage I/O bottleneck absent in 2-4-node tests, a production-scale phenomenon no single team could isolate alone. We perform three quantitative analyses yielding four findings. First, over 751 Prometheus metrics and 10 XID-identified GPU failures, no single metric is consistently dominant across failure types, motivating multi-signal detection. Second, 523 checkpoint events trace the save/load path from GPU VRAM to the NFS server: restart loading reaches 21.5% of maximum read bandwidth (700 GB/s) and save bursts 16.0% of maximum write bandwidth (250 GB/s), with NFS/RPC queueing and transport-layer backlog rising together. Third, across 224 sessions over 73 days, node exclusions concentrate so the top 3 of 63 nodes account for over 50%. Fourth, auto-retry chain analysis shows a 33.3% success rate over 12 chains (73 attempts), 2.7x the 12.5% manual rate, with a median retry interval of 11 minutes (IQR 10-11). All analyses are grounded in production infrastructure providing session-level workload management, GPU-centric scheduling, and unified observability.

2605.06738 2026-06-16 cs.CR cs.AI 版本更新

Trust Without Trusting: A Recomputable Trust Protocol for Autonomous Agents

无需信任的信任:面向自主智能体的可重算信任协议

Lars Kersten Kroehl

发表机构 * MolTrust / CryptoKRI GmbH(MolTrust/加密KRI GmbH)

AI总结 提出组合证据协议(CEP),通过五条件谓词和锚定数据重算,使任何方都能独立验证边界所有者是否遵循其公开规则,解决了开放代理世界中依赖他人边界时的信任验证问题。

Comments 18 pages, 5 figures. v2: substantial revision, reframed around recomputable accountability (Combined Evidence Protocol); adds figures, code listings, and deployment evidence. Supersedes v1 (From Specification to Deployment)

详情
AI中文摘要

自主AI代理已经在生产规模上进行交易——在单一市场上,有69,000个机器人、1.65亿笔交易、5000万美元的交易量——任何一方都可以在没有中心服务的情况下验证签名凭证。在覆盖信任大部分需求的开放代理世界中,没有通用边界,每一方自行选择与谁交易。边界仅出现在封闭空间划定之处——市场、平台或联盟制定内部规则。划定边界者拥有应用边界的权力,并可能闭门按其意愿应用。本文解决了由此产生的空白:当你依赖他人的边界时,如何检查他们是否应用了自己发布的规则——不轻信任何人的话,也不将检查交给新的可信方?我们的答案是组合证据协议(CEP):一个五条件谓词,任何一方都可以从锚定数据重新计算,将“边界所有者是否遵循其自身的准入规则”转化为任何人都可验证的事实,而非任何人相信的主张。保障乐观汇总的安全机制同样保障了这一点——正确性依赖于重算,因此度量属于每个人,预言机问题得以解决。其承载场景是一个由平等、互不信任的同行组成的联盟,在共享章程下,每方都能独立验证他们共同同意的规则正在被应用。CEP属于无需信任系统家族——乐观和零知识汇总、可验证机器学习、自主主权身份谓词。其底层基础设施已上线:自2026年3月起运行的一个W3C VC + DID信任层,锚定在Base L2上,延续arXiv:2605.06738并独立运行。

英文摘要

Autonomous AI agents already transact at production scale -- 69,000 bots, 165 million transactions, $50 million in volume on a single marketplace -- and any party can verify a signed credential without a central service. In an open agent world that covers most of what trust requires: there are no universal borders, and each party chooses for itself whom to deal with. Borders appear only where a closed space draws one -- a marketplace, a platform, or a consortium sets house rules. Whoever draws the border holds the authority to apply it, and may apply it as they choose, behind closed doors. This paper addresses the gap that opens there: when you rely on someone else's border, how do you check that they applied their own published rules -- taking no one's word for it, and handing the check to no new trusted party? Our answer is the Combined Evidence Protocol (CEP): a five-condition predicate any party recomputes from anchored data, turning "did the boundary-owner follow its own admission rules" into a fact anyone verifies rather than a claim anyone believes. The move that secures optimistic rollups secures this -- correctness rests on recomputation, so the measurement belongs to everyone and the oracle problem dissolves. Its load-bearing setting is a consortium of co-equal, mutually distrusting peers under a shared charter, each able to verify, independently, that the rules they jointly agreed are the rules being applied. CEP belongs to the family of trustless systems -- optimistic and zero-knowledge rollups, verifiable ML, self-sovereign-identity predicates. The infrastructure beneath it is live: a W3C VC + DID trust layer running since March 2026, anchored on Base L2, continuing arXiv:2605.06738 and standing on its own.

2605.05855 2026-06-16 cs.IR cs.CL 版本更新

Bridging Passive and Active: Enhancing Conversation Starter Recommendation via Active Expression Modeling

桥接被动与主动:通过主动表达建模增强对话启动推荐

Yiqing Wu, Haoming Li, Guanyu Jiang, Jiahao Liang, Yongchun Zhu, Jingwu Chen, Feng Zhang

发表机构 * Bytedance Beijing China(字节跳动北京中国)

AI总结 针对LLM驱动的对话搜索中被动推荐陷入回声室的问题,提出PA-Bridge框架,通过对抗分布对齐器桥接被动推荐与主动表达之间的分布差异,并引入语义离散化器实现流行度去偏,在线实验显著提升特征渗透率和用户活跃天数。

Comments Accepted by SIGIR 2026

详情
AI中文摘要

大型语言模型(LLM)驱动的对话搜索正在将信息检索从被动关键词匹配转变为主动、开放式的对话。在此背景下,对话启动器被广泛部署,以提供个性化查询推荐,帮助用户发起对话。传统上,推荐这些启动器依赖于一个封闭的“曝光-点击”循环。然而,这种反馈循环机制使系统陷入回声室,加上数据稀疏性,无法捕捉由开放世界塑造的对话搜索意图的动态特性。结果,系统偏向于流行但通用的建议。在这项工作中,我们揭示了一个未被利用的范式转变,以打破这种有害的反馈循环:通过用户的主动表达来利用用户的“自由意志”。与传统推荐不同,对话搜索使用户能够通过手动输入查询完全绕过菜单。主动查询中的开放世界意图是打破这一循环的关键。然而,整合它们并非易事:(1)主动查询与制定的启动器之间存在固有的分布偏移。(2)此外,开放文本的“非ID化”特性使得传统的基于项目的流行度统计在大规模工业流式训练中无效。为此,我们提出了被动-主动桥接(PA-Bridge),一种新颖的框架,采用对抗分布对齐器来桥接被动推荐的启动器与主动表达之间的分布差距。此外,我们引入了一个语义离散化器,以实现流行度去偏算法的部署。在我们平台上的在线A/B测试表明,PA-Bridge显著提升了特征渗透率0.54%和用户活跃天数0.04%。

英文摘要

Large Language Model (LLM)-driven conversational search is shifting information retrieval from reactive keyword matching to proactive, open-ended dialogues. In this context, Conversation Starters are widely deployed to provide personalized query recommendations that help users initiate dialogues. Conventionally, recommending these starters relies on a closed "exposure-click" loop. Yet, this feedback loop mechanism traps the system in an echo chamber where, compounded by data sparsity, it fails to capture the dynamic nature of conversational search intents shaped by the open world. As a result, the system skews towards popular but generic suggestions. In this work, we uncover an untapped paradigm shift to shatter this harmful feedback loop: harnessing user "free will" through active user expressions. Unlike traditional recommendations, conversational search empowers users to bypass menus entirely through manually typed queries. The open-world intents in active queries hold the key to breaking this loop. However, incorporating them is non-trivial: (1) there exists an inherent distribution shift between active queries and formulated starters. (2) Furthermore, the "non-ID-able" nature of open text renders traditional item-based popularity statistics ineffective for large-scale industrial streaming training. To this end, we propose Passive-Active Bridge (PA-Bridge), a novel framework that employs an adversarial distribution aligner to bridge the distributional gap between passively recommended starters and active expressions. Moreover, we introduce a semantic discretizer to enable the deployment of popularity debiasing algorithms. Online A/B tests on our platform, demonstrate that PA-Bridge significantly boosts the Feature Penetration Rate by 0.54% and User Active Days by 0.04%.

2605.00873 2026-06-16 cs.MM cs.AI cs.CV 版本更新

BRITE: A Benchmark for Reliable and Interpretable T2V Evaluation on Implausible Scenarios

BRITE:面向不可信场景的可靠可解释文本到视频评估基准

Advait Tilak, Jiwon Choi, Nazifa Mouli, Wei Le

发表机构 * arXiv.org cs.MM(计算机视觉)

AI总结 提出BRITE基准,通过人工参与协议统一不可信提示、细粒度音视频一致性评估和可解释QA评估,揭示现有模型在对象-动作绑定和音视频同步上的显著缺陷。

详情
AI中文摘要

逼真文本到视频(T2V)生成的快速发展带来了对最新评估方法的迫切需求。现有基准大多忽略了不可信场景,并且不衡量音视频对齐。我们引入BRITE,这是第一个将(1)不可信提示、(2)音视频一致性的细粒度评估以及(3)基于QA的可解释评估统一为全面T2V基准的框架。与完全自动化的基于多模态LLM的流水线(容易产生幻觉和提示歧义)不同,BRITE通过严格的人工参与协议保证基准创建的可靠性。评估五个最先进模型(Sora 2、Veo 3.1、Runway Gen4.5、Pixverse V5.5和Qwen3Max),我们揭示了一个关键性能差距:虽然模型在静态对象组合方面表现出色,但在对象-动作绑定和音视频同步方面表现出显著退化。我们的框架为社区提供了一个可靠、可解释的基准和评估框架,能够检测和定位下一代T2V模型的局限性,特别是对于流形外提示。

英文摘要

The rapid advancement of photorealistic Text-to-Video (T2V) generation brings in an urgent need for up-to-date evaluation methods. Existing benchmarks largely overlooked implausible scenarios and do not measure audio-visual alignment. We introduce BRITE, the first framework that unifies (1) implausible prompting, (2) fine-grained assessment of audio-visual consistency, and (3) QA-based interpretable evaluation into a comprehensive T2V benchmark. Unlike fully automated Multimodal LLM-based pipelines, which are prone to hallucination and prompt ambiguity, BRITE guarantees reliability through a rigorous human-in-the-loop protocol for benchmark creation. Evaluating five state-of-the-art models (Sora 2, Veo 3.1, Runway Gen4.5, Pixverse V5.5, and Qwen3Max), we reveal a critical performance gap: while models excel at static object composition, they exhibit significant degradation in object-action binding and audio-visual synchronization. Our framework offers the community a reliable, interpretable benchmark and evaluation framework that can detect and locate limitations in the next generation of T2V models, especially for off-manifold prompts

2605.00074 2026-06-16 q-bio.GN cs.AI 版本更新

CRC-Screen: Certified DNA-Synthesis Hazard Screening Under Taxonomic Shift

CRC-Screen:分类偏移下认证的DNA合成危害筛查

Najmul Hasan

发表机构 * Najmul Hasan(纳杰姆·哈桑)

AI总结 针对DNA合成订单中危害序列因分类偏移导致基线筛查100%误报的问题,提出基于k-mer Jaccard相似度、五LLM评委修剪均值和嵌入聚类质心余弦相似度的融合信号,经单调逻辑聚合器和共形风险控制校准,在保证假阴性率受控的同时实现零漏检和低误报。

Comments Accepted at the 6th Muslims in ML (MusIML) Workshop at ICML 2026

详情
AI中文摘要

DNA合成供应商通过将请求序列与精选的危害列表进行比对来筛查传入订单。我们证明,当危害序列来自参考集中缺失的分类家族时,这种基线方法会崩溃为100%的误报率:在共形风险控制的认证漏检率约束下,低区分度信号迫使阈值低于整个测试良性样本的质量。我们组合了从合成订单的公共注释中导出的三个信号:与已知毒素的$k$-mer Jaccard相似度、五个LLM评委小组的修剪均值分数,以及与聚类嵌入质心的余弦相似度。在单调逻辑聚合器下融合并由共形风险控制校准,得到的筛查器认证$\mathbb{E}[\mathrm{FNR}] \le \alpha + \mathrm{TV}$,其中加性项是家族留出下校准到测试的分布偏移(跨折认证上限为24-49%)。在UniProt KW-0800审核毒素上,以$\alpha=0.05$进行十次留一分类家族交叉验证,校准后的筛查器在每一折上实现0%的经验测试漏检率,并在十折中的九折上实现0%的测试误报率。该界限的有限样本松弛量$1/(n_{\mathrm{cal}}+1)$将我们200个危害子样本的可认证漏检率上限限制在1.77%;达到采购级$\alpha=10^{-3}$需要$18\times$更大的校准集,而完整的UniProt KW-0800审核语料库足够大以提供此规模。可认证DNA合成筛查的约束条件是校准数据,而非算法。代码:此 https URL

英文摘要

DNA-synthesis providers screen incoming orders by searching the requested sequence against curated hazard lists. We show that this baseline collapses to a 100% false-flag rate when the hazardous sequence comes from a taxonomic family absent from the reference set: under Conformal Risk Control's certified miss-rate constraint, a low-discrimination signal forces the threshold below the entire test-benign mass. We compose three signals derived from a synthesis order's public annotation: $k$-mer Jaccard similarity to known toxins, the trimmed-mean score of a five-LLM judge panel, and cosine similarity to clustered embedding centroids. Fused under a monotone logistic aggregator and calibrated by Conformal Risk Control, the resulting screener certifies $\mathbb{E}[\mathrm{FNR}] \le α+ \mathrm{TV}$, where the additive term is the calibration-to-test distribution shift under family holdout (a certified ceiling of 24-49% across folds). Across ten leave-one-taxonomic-family-out folds at $α=0.05$ on UniProt KW-0800 reviewed toxins, the calibrated screener achieves 0% empirical test miss rate on every fold and 0% test false-flag rate on nine of ten folds. The bound's finite-sample slack $1/(n_{\mathrm{cal}}+1)$ caps the certifiable miss rate at 1.77% on our 200-hazard subsample; reaching procurement-grade $α=10^{-3}$ requires an $18\times$ larger calibration set, which the full reviewed UniProt KW-0800 corpus is large enough to deliver. The binding constraint on certifiable DNA-synthesis screening is calibration data, not algorithms. Code: https://github.com/najmulhasan-code/crc-screen

2604.26963 2026-06-16 cs.OS cs.DC cs.LG cs.MA 版本更新

MARS: Efficient, Adaptive Co-Scheduling for Heterogeneous Agentic Systems

MARS:面向异构智能体系统的高效自适应协同调度

Yifei Wang, Hancheng Ye, Yechen Xu, Cong Guo, Chiyue Wei, Qinsi Wang, Dongting Li, Tingjun Chen, Hai "Helen" Li, Danyang Zhuo, Yiran Chen

发表机构 * Duke University(杜克大学)

AI总结 提出MARS协同调度系统,通过统一信息流全局协调GPU推理与CPU工具执行,解耦准入与执行防止资源过载,并采用智能体中心调度器最小化端到端延迟,实验显示延迟降低5.94倍。

Comments 14 pages, 13 figures. Preprint

详情
AI中文摘要

大型语言模型(LLM)越来越多地被部署为自主智能体的执行核心,而非独立的文本生成器。智能体工作负载引发了时间上的转变,从单轮推理转向多轮LLM-工具循环,以及空间上的转变,从聊天规模的仅GPU执行转向仓库规模的GPU-CPU协同执行。因此,协调智能体执行的异构资源需求已成为一个关键的系统挑战。我们设计并实现了MARS,一个高效且自适应的协同调度系统,它在GPU-CPU耦合资源压力下全局协调异构智能体工作负载。通过统一信息流建立对GPU推理和CPU工具执行的全局可见性,MARS中的外部控制平面将准入与执行解耦,以防止异构资源过载。内部智能体中心调度器通过优先处理延迟敏感的延续,并仅在热恢复带来延迟收益时自适应保留KV缓存状态,进一步最小化端到端关键路径。我们的评估表明,MARS将端到端延迟降低高达5.94倍,同时保持接近最大的系统吞吐量。我们进一步将MARS作为OpenHands编码智能体框架的服务后端,通过加速端到端任务完成时间高达1.87倍,展示了其在现实世界中的有效性。我们的源代码在此https URL公开提供。

英文摘要

Large language models (LLMs) are increasingly deployed as the execution core of autonomous agents rather than as standalone text generators. Agentic workloads induce a temporal shift from single-turn inference to multi-turn LLM-tool loops, and a spatial shift from chat-scale, GPU-only execution to repository-scale, GPU-CPU co-located execution. Consequently, coordinating heterogeneous resource demands of agentic execution has emerged as a critical system challenge. We design and implement MARS, an efficient and adaptive co-scheduling system that globally coordinates heterogeneous agentic workloads under coupled GPU-CPU resource pressure. By establishing holistic visibility across GPU inference and CPU tool execution via a unified information stream, an external control plane in MARS decouples admission from execution to prevent heterogeneous resource oversubscription. An internal agent-centric scheduler further minimizes the end-to-end critical path by prioritizing latency-sensitive continuations and adaptively retaining KV cache state only when warm resumption yields a latency benefit. Our evaluations show that MARS reduces end-to-end latency by up to 5.94x while maintaining nearly maximal system throughput. We further integrate MARS as the serving backend for the OpenHands coding agent framework, demonstrating its real-world effectiveness by accelerating end-to-end task completion time by up to 1.87x. Our source code is publicly available at https://github.com/Afterglow231/MARS_preview .

2604.25371 2026-06-16 q-bio.QM cs.CV 版本更新

PhyloSDF: Phylogenetically-Conditioned Neural Generation of 3D Skull Morphology via Residual Flow Matching

PhyloSDF: 基于系统发育条件的残差流匹配神经生成3D颅骨形态

Kaikwan Lau, Gary P. T. Choi

发表机构 * Department of Mathematics(数学系)

AI总结 提出PhyloSDF模型,结合系统发育一致性损失和残差条件流匹配,从少量样本生成符合系统发育关系的3D颅骨形态,在达尔文雀数据集上优于扩散模型和标准流匹配。

详情
AI中文摘要

生成新颖、生物学上可信的三维形态结构是计算进化生物学中的一个基本挑战,其难点在于极端的数据稀缺性以及对生成形状必须尊重物种间系统发育关系的要求。在这项工作中,我们提出了PhyloSDF,一个基于系统发育条件的神经生成模型,用于3D生物形态,它整合了两项创新:(1) 一个由新型系统发育一致性损失正则化的DeepSDF自动解码器,该损失使潜在空间结构与进化距离相关(Pearson r=0.993);(2) 一个残差条件流匹配(Residual CFM)架构,将生成分解为解析的物种质心查找和学习到的残差预测,从而能够从每个物种仅约4个标本进行生成。我们在达尔文雀及其近缘物种的24个物种的100个微CT扫描颅骨上评估了PhyloSDF。该模型生成的网格在代码水平上实现了真实种内变异的88-129%,所有180个生成网格均被验证为非记忆。残差CFM在保真度(Chamfer距离0.00181 vs. 0.00190)和形态测量Fréchet距离(10,641 vs. 13,322)上均超越了去噪扩散(在此尺度下完全失败)、标准流匹配(模式坍缩至3-6%变异)以及高斯混合基线。跨18个物种的留一物种实验展示了系统发育外推能力,平滑的潜在插值产生了生物学上可信的祖先颅骨重建。

英文摘要

Generating novel, biologically plausible three-dimensional morphological structures is a fundamental challenge in computational evolutionary biology, hampered by extreme data scarcity and the requirement that generated shapes respect phylogenetic relationships among species. In this work, we present PhyloSDF, a phylogenetically-conditioned neural generative model for 3D biological morphology that integrates two innovations: (1) a DeepSDF auto-decoder regularized by a novel Phylogenetic Consistency Loss that structures the latent space to correlate with evolutionary distances (Pearson r=0.993); (2) a Residual Conditional Flow Matching (Residual CFM) architecture that factorizes generation into analytic species-centroid lookup and learned residual prediction, enabling generation from as few as ~4 specimens per species. We evaluate PhyloSDF on 100 micro-CT-scanned skulls of Darwin's Finches and their relatives across 24 species. The model generates novel meshes achieving 88-129% of real intra-species variation at the code level, with all 180 generated meshes verified as non-memorized. Residual CFM surpasses denoising diffusion (which fails entirely at this scale), standard flow matching (which mode-collapses to 3-6% variation), and a Gaussian mixture baseline in both fidelity (Chamfer Distance 0.00181 vs. 0.00190) and morphometric Fréchet distance (10,641 vs. 13,322). Leave-one-species-out experiments across 18 species demonstrate phylogenetic extrapolation capability, and smooth latent interpolations produce biologically plausible ancestral skull reconstructions.

2604.23952 2026-06-16 stat.ML cs.LG nlin.CD 版本更新

Conditional Score-Based Modeling of Effective Langevin Dynamics

基于条件分数的有效朗之万动力学建模

Ludovico T. Giorgini

发表机构 * Department of Mathematics, Massachusetts Institute of Technology(数学系,麻省理工学院)

AI总结 提出一种基于有限时间转移密度条件分数的随机降阶模型校准方法,通过最小二乘拟合从数据中推断漂移和扩散系数,避免轨迹微分或状态空间划分。

详情
AI中文摘要

随机降阶模型广泛用于表示复杂系统的有效动力学,但根据数据估计其漂移和扩散系数仍然具有挑战性。标准方法通常依赖于短时间轨迹增量、状态空间划分或候选模型的重复模拟,这些方法对于高维系统、粗时间采样或非均匀采样数据变得不可靠或计算成本高昂。我们引入了一种数据驱动的校准方法,该方法基于随机降阶模型系数与有限时间转移密度的条件分数(定义为转移密度对初始状态的对数梯度)之间的新关系。由此得到的恒等式将滞后相关函数的导数表示为观测到的滞后对上的平稳期望,其中涉及该条件分数和未知模型系数。这种公式允许直接从有限滞后统计量约束漂移和扩散结构,而无需在校准过程中对轨迹进行微分、划分状态空间或重复积分候选降阶模型,从而产生一个关于平稳滞后对的最小二乘拟合问题。我们在三个复杂度递增的系统上验证了该方法:一个解析可解的Cox-Ingersoll-Ross扩散过程、一个具有仿射乘性噪声的二维非平衡扩散过程,以及一个周期性的软自旋随机朗道-利夫希茨链。在这些测试中,推断出的模型在再现有限滞后动力学相关性的同时保持了不变统计量。该框架为从数据中学习再现规定统计和动力学性质的随机降阶模型提供了一种可扩展的途径。

英文摘要

Stochastic reduced-order models are widely used to represent the effective dynamics of complex systems, but estimating their drift and diffusion coefficients from data remains challenging. Standard approaches often rely on short-time trajectory increments, state-space partitioning, or repeated simulation of candidate models, which become unreliable or computationally expensive for high-dimensional systems, coarse temporal sampling, or unevenly sampled data. We introduce a data-driven calibration method based on a novel relationship between the coefficients of a stochastic reduced model and the conditional score of the finite-time transition density, defined as the gradient of the logarithm of the transition density with respect to the initial state. The resulting identity expresses derivatives of lagged correlation functions as stationary expectations over observed lagged pairs involving this conditional score and the unknown model coefficients. This formulation allows the drift and diffusion structure to be constrained directly from finite-lag statistics, without differentiating trajectories, partitioning state space, or repeatedly integrating candidate reduced models during calibration, yielding a least-squares fitting problem over stationary lagged pairs. We validate the approach on three systems of increasing complexity: an analytically tractable Cox--Ingersoll--Ross diffusion, a two-dimensional nonequilibrium diffusion with affine multiplicative noise, and a periodic soft-spin stochastic Landau--Lifshitz chain. Across these tests, the inferred models preserve the invariant statistics while reproducing finite-lag dynamical correlations. The framework provides a scalable route for learning stochastic reduced-order models from data that reproduce prescribed statistical and dynamical properties.

2604.23628 2026-06-16 cs.DS cs.LG 版本更新

Characterizing Admissible Objective Functions for Hierarchical Clustering

刻画层次聚类的可容许目标函数

Ryuki Tsukuba, Kazutoshi Ando

发表机构 * Faculty of Engineering, Shizuoka University(izuoka大学工学部) Graduate School of Integrated Science and Technology, Shizuoka University(izuoka大学综合科学技术研究院)

AI总结 本文研究层次聚类的可容许目标函数,对基于聚合相似度的和型目标函数,完整刻画了对称多项式次数≤2时的可容许性,并给出次数为3的充分条件;引入最大型目标函数,刻画了任意对称缩放函数的可容许性。

Comments 20 pages, 3 figures. Revised version. The presentation has been substantially revised; new characterizations of max-type objective functions and related proofs have been clarified. Submitted to Discrete Applied Mathematics

详情
AI中文摘要

层次聚类是数据分析中的基本任务,但经典方法长期缺乏有原则的目标函数。Dasgupta [STOC~2016] 通过提出一个动机良好的聚类树目标函数,朝着填补这一空白迈出了重要一步。Cohen-Addad 等人 [J. ACM 2019] 随后引入了可容许性的概念:如果一个目标函数在输入相似度矩阵允许生成树时,其极小化器恰好是生成该矩阵的树,则该目标函数是可容许的。他们还给出了基于聚合簇间相似度的一类目标函数中可容许性的充要条件。我们将这类函数称为和型目标函数。然而,除了 Dasgupta 的原始目标函数外,该类中没有给出显式的可容许目标函数。本文从两个方向研究层次聚类的可容许目标函数。对于和型目标函数,当缩放函数是次数不超过2的对称多项式时,我们给出了完整的刻画,并推导了次数为3的多项式的充分条件。我们还证明,递归最稀疏割算法对我们刻画所覆盖的可容许目标函数实现了 O($\phi$) 的近似比,其中 $\phi$ 是最稀疏割子程序的近似因子。然后,我们引入了最大型目标函数,其中簇间相互作用通过最大簇间相似度而非聚合相似度来度量。对于该类,我们刻画了哪些目标函数对于任意对称缩放函数是可容许的,并在缩放函数是次数不超过2的对称多项式时给出了完整刻画。

英文摘要

Hierarchical clustering is a fundamental task in data analysis, but classical methods have long lacked a principled objective function. Dasgupta [STOC~2016] took an important step toward addressing this gap by proposing a well-motivated objective function for cluster trees. Cohen-Addad et al. [J. ACM 2019] subsequently introduced the notion of admissibility: an objective function is admissible if, whenever the input similarity matrix admits generating trees, its minimizers are precisely those generating trees.They also gave a necessary and sufficient condition for admissibility within a family of objective functions based on aggregate intercluster similarity. We refer to this family as sum-type objective functions. However, apart from Dasgupta's original objective function, no explicit admissible objective functions in this family were provided. In this paper, we study admissible objective functions for hierarchical clustering in two directions. For sum-type objective functions, we give a complete characterization when the scaling function is a symmetric polynomial of degree at most two, and we derive sufficient conditions for degree-three polynomials. We also show that the recursive sparsest cut algorithm achieves an O$(ϕ)$-approximation ratio for the admissible objective functions covered by our characterization, where $ϕ$ is the approximation factor of the sparsest cut subroutine. We then introduce max-type objective functions, where cluster interaction is measured by maximum, rather than aggregate, intercluster similarity. For this class, we characterize which objective functions are admissible for arbitrary symmetric scaling functions and give a complete characterization when the scaling function is a symmetric polynomial of degree at most two.

2511.09465 2026-06-16 stat.ML cs.LG 版本更新

Branching Flows: Discrete, Continuous, and Manifold Flow Matching with Splits and Deletions

分支流:带有分裂和删除的离散、连续和流形流匹配

Lukas Billera, Hedwig Nora Nordlinder, Jack Collier Ryder, Anton Oresten, Aron Stålmarck, Theodor Mosetti Björk, Ben Murrell

发表机构 * Department of Microbiology, Tumor and Cell Biology, Karolinska Institutet(卡罗林斯卡研究所微生物学、肿瘤和细胞生物学系)

AI总结 提出分支流框架,通过随机分支和死亡过程控制序列元素数量,适用于变长数据生成,并在小分子、抗体序列和蛋白质骨架生成中验证效果。

Comments 39 pages, 16 figures

详情
AI中文摘要

扩散和流匹配方法在状态空间连续的领域(如图像生成或蛋白质折叠与设计)以及离散领域(如扩散大语言模型)中显示出前景。当状态中的元素数量预先固定时(如图像),它们自然适用,但当大语言模型响应的长度或蛋白质链中的氨基酸数量未知时,则需要临时解决方案。这里我们提出分支流,一种生成建模框架,与扩散和流匹配方法一样,将简单分布传输到数据分布。但在分支流中,状态中的元素在二叉树森林上演化,以模型学习的速率随机分支和死亡。这使得模型在生成过程中能够控制序列中的元素数量。我们还表明,分支流可以与离散集、连续欧几里得空间、光滑流形以及混合这些组件的“多模态”乘积空间上的任何流匹配基础过程组合。我们在三个领域进行了演示:小分子生成(多模态)、抗体序列生成(离散)和蛋白质骨架生成(多模态),并表明分支流是一个具有稳定学习目标的能力分布学习器,并且它实现了新的能力。

英文摘要

Diffusion and flow matching approaches to generative modeling have shown promise in domains where the state space is continuous, such as image generation or protein folding & design, and discrete, exemplified by diffusion large language models. They offer a natural fit when the number of elements in a state is fixed in advance (e.g. images), but require ad hoc solutions when, for example, the length of a response from a large language model, or the number of amino acids in a protein chain is not known a priori. Here we propose Branching Flows, a generative modeling framework that, like diffusion and flow matching approaches, transports a simple distribution to the data distribution. But in Branching Flows, the elements in the state evolve over a forest of binary trees, branching and dying stochastically with rates that are learned by the model. This allows the model to control, during generation, the number of elements in the sequence. We also show that Branching Flows can compose with any flow matching base process on discrete sets, continuous Euclidean spaces, smooth manifolds, and `multimodal' product spaces that mix these components. We demonstrate this in three domains: small molecule generation (multimodal), antibody sequence generation (discrete), and protein backbone generation (multimodal), and show that Branching Flows is a capable distribution learner with a stable learning objective, and that it enables new capabilities.

2604.18827 2026-06-16 q-bio.NC cs.AI 版本更新

OmniMouse: Scaling properties of multi-modal, multi-task Brain Models on 150B Neural Tokens

OmniMouse: 基于1500亿神经令牌的多模态多任务脑模型的可扩展性

Konstantin F. Willeke, Polina Turishcheva, Alex Gilbert, Goirik Chakrabarty, Hasan A. Bedel, Paul G. Fahey, Yongrong Qiu, Marissa A. Weis, Michaela Vystrčilová, Taliah Muhammad, Lydia Ntanavara, Rachel E. Froebe, Kayla Ponder, Zheng Huan Tan, Emin Orhan, Erick Cobos, Sophia Sanborn, Katrin Franke, Fabian H. Sinz, Alexander S. Ecker, Andreas S. Tolias

发表机构 * Department of Ophthalmology, Byers Eye Institute, Stanford University(斯坦福大学眼科学系、比尔斯眼科研究所) Stanford Bio-X, Stanford University(斯坦福大学生物交叉学科) Wu Tsai Neurosciences Institute, Stanford University(斯坦福大学吴泰教授神经科学研究所) Institute of Computer Science and Campus Institute Data Science, University Göttingen(哥廷根大学计算机科学研究所和校园数据科学研究所)

AI总结 利用小鼠视觉皮层31亿神经元数据,训练多模态多任务模型OmniMouse,在神经预测、行为解码等任务上达到最优,发现性能随数据量可靠提升但模型规模收益饱和,与AI领域标准扩展规律相反。

Comments Published at ICLR2026

详情
AI中文摘要

扩展数据和人工神经网络已经改变了人工智能,推动了语言和视觉领域的突破。类似的原则是否适用于脑活动建模仍不清楚。这里我们利用了一个数据集,包含来自73只小鼠视觉皮层的310万个神经元,跨越323个会话,总计超过1500亿个神经令牌,记录于自然电影、图像、参数化刺激和行为期间。我们训练了多模态、多任务模型,在测试时灵活支持三种模式:神经预测、行为解码、神经预测或三者的任意组合。OmniMouse实现了最先进的性能,在几乎所有评估模式下优于专门的基线。我们发现性能随数据量可靠地提升,但增加模型大小的收益饱和。这颠倒了标准的人工智能扩展故事:在语言和计算机视觉中,大规模数据集使参数扩展成为进步的主要驱动力,而在脑建模中——即使是在小鼠视觉皮层这个相对简单的系统中——尽管有大量的记录,模型仍然受限于数据。系统性的扩展观察提出了神经建模中相变的可能性,更大和更丰富的数据集可能解锁定性的新能力,类似于大型语言模型中出现的涌现特性。代码见此网址。

英文摘要

Scaling data and artificial neural networks has transformed AI, driving breakthroughs in language and vision. Whether similar principles apply to modeling brain activity remains unclear. Here we leveraged a dataset of 3.1 million neurons from the visual cortex of 73 mice across 323 sessions, totaling more than 150 billion neural tokens recorded during natural movies, images and parametric stimuli, and behavior. We train multi-modal, multi-task models that support three regimes flexibly at test time: neural prediction, behavioral decoding, neural forecasting, or any combination of the three. OmniMouse achieves state-of-the-art performance, outperforming specialized baselines across nearly all evaluation regimes. We find that performance scales reliably with more data, but gains from increasing model size saturate. This inverts the standard AI scaling story: in language and computer vision, massive datasets make parameter scaling the primary driver of progress, whereas in brain modeling -- even in the mouse visual cortex, a relatively simple system -- models remain data-limited despite vast recordings. The observation of systematic scaling raises the possibility of phase transitions in neural modeling, where larger and richer datasets might unlock qualitatively new capabilities, paralleling the emergent properties seen in large language models. Code available at https://github.com/enigma-brain/omnimouse.

2411.05824 2026-06-16 eess.IV cs.CV cs.LG 版本更新

Navigating Distribution Shifts in Medical Image Analysis: A Survey

医学图像分析中的分布偏移导航:综述

Zixian Su, Jingwei Guo, Xi Yang, Qiufeng Wang, Frans Coenen, Amir Hussain, Kaizhu Huang

发表机构 * Life Simulation Research Center, Beijing Academy of Artificial Intelligence(北京人工智能生命模拟研究中心) Electrical and Mathematical Sciences and Engineering Division, King Abdullah University of Science and Technology(王国阿卜杜勒·阿齐兹国王科技大学电气与数学科学与工程系) Department of Intelligent Science, School of Advanced Technology, Xi’an Jiaotong-Liverpool University(西安交通大学利物浦大学先进科技学院智能科学系) Computer Science, School of Computer Science and Informatics, University of Liverpool(利物浦大学计算机科学与信息学学院) SDAIA-KFUPM Joint Research Centre for Artificial Intelligence, King Fahd University of Petroleum and Minerals(法赫德石油与矿物大学人工智能SDAIA-KFUPM联合研究中心) Nuffield Department of Primary Care Health Sciences, University of Oxford(牛津大学初级保健健康科学努尔菲尔德部门)

AI总结 本文系统综述了应对医学图像分析中分布偏移的深度学习方法,按临床约束分类为联合训练、联邦学习、微调和域泛化,并揭示方法从显式对齐向不确定性建模的转变。

详情
AI中文摘要

医学图像分析(MedIA)已成为现代医疗保健中不可或缺的一部分,增强了临床诊断和个性化治疗。尽管深度学习(DL)技术取得了显著进展,但其实际部署面临分布偏移带来的挑战,即基于特定数据集训练的模型在不同医院或患者群体的数据上表现不佳。为解决这一问题,研究人员积极开发策略以提高DL模型的适应性,使其能够在陌生环境中有效使用。本文系统综述了将DL技术应用于受分布偏移影响的MedIA系统的方法。我们并非按技术特征组织现有方法,而是明确将现实临床约束(如有限的数据可访问性、严格的隐私要求和异构协作协议)与能够解决这些约束的技术范式联系起来。通过建立操作约束与方法论演变之间的这种联系,我们将现有工作分类为联合训练、联邦学习、微调和域泛化,每种方法对应特定的医疗场景。除了这种分类,我们的实证分析表明,随着这些范式中域信息逐渐变得不可访问,性能改进变得越来越受限,并进一步揭示了方法论焦点从显式分布对齐向不确定性感知建模的逐渐转变,最终指向在实际MedIA中需要更多可部署性感知的设计。

英文摘要

Medical Image Analysis (MedIA) has become indispensable in modern healthcare, enhancing clinical diagnostics and personalized treatment. Despite the remarkable advancements supported by deep learning (DL) technologies, their practical deployment faces challenges posed by distribution shifts, where models trained on specific datasets underperform on others from varying hospitals, or patient populations. To address this issue, researchers have been actively developing strategies to increase the adaptability of DL models, enabling their effective use in unfamiliar environments. This paper systematically reviews approaches that apply DL techniques to MedIA systems affected by distribution shifts. Rather than organizing existing methods by technical characteristics, we explicitly bridge real-world clinical constraints -- such as limited data accessibility, strict privacy requirements, and heterogeneous collaboration protocols -- with the technical paradigms able to address them. By establishing this connection between operational constraints and methodological evolution, we categorize existing works into Joint Training, Federated Learning, Fine-tuning, and Domain Generalization, each aligned with specific healthcare scenarios. Beyond this taxonomy, our empirical analysis suggests that, as domain information becomes progressively less accessible across these paradigms, performance improvements become increasingly constrained, and further uncovers a gradual shift in methodological focus from explicit distribution alignment toward uncertainty-aware modeling, ultimately pointing to the need for more deployability-aware design in real-world MedIA.