arXivDaily arXiv每日学术速递 周一至周五更新
全部学科分类 2251
专题追踪
2605.27784 2026-05-28 cs.AI

Diagnosing Live Within-Policy Instruction Conflicts in LLM Agents with Witnessed Resolution Profiles

诊断LLM代理中实时策略内指令冲突的见证解析轮廓

Lu Yan, Xuan Chen, Xiangyu Zhang

发表机构 * Purdue University(普渡大学)

AI总结 提出WIRE管道,通过提取规则、编码为PyRule子句、检测冲突并生成见证实例,诊断LLM代理单一提示策略内规则对之间的冲突,发现64.6%的见证实例至少违反一条源规则。

详情
AI中文摘要

LLM代理受长期自然语言提示策略的约束,但个别合理的常设规则可能以未经检查的方式相互作用。我们研究实时策略内规则冲突诊断:在单个提示策略内找到可以共同治理现实状态的规则对,并测量模型在响应或工具动作中如何解决该压力。我们引入WIRE,一个见证策略内规则评估管道。WIRE提取基于源的规则,将其编码为PyRule子句,使用可满足性检查保留同表面硬碰撞候选,将这些候选实现为具体的共同治理见证,并根据原始源规则文本判断模型输出。在六个公开提示策略中,WIRE提取276条源规则和560个原子子句,分类30,944个策略内子句对比较,保留170个编码的硬碰撞候选源规则对,并将其实现为1,402个具体见证。在仅策略评估中,这些见证产生13,335次后生成试验,其中两条源规则共同治理且两个合规标签均可判断。仅35.4%处于联合合规状态;64.6%违反至少一条治理源规则。这些轮廓是WIRE选择候选的条件诊断,而非部署频率或因果超额失败估计,但它们揭示了不同的策略、模型和工具动作解析模式。

英文摘要

LLM agents are governed by long-lived natural-language prompt policies, but individually reasonable standing rules can interact in uninspected ways. We study live intra-policy rule-conflict diagnosis: finding rule pairs inside a single prompt policy that can co-govern a realistic state, and measuring how models resolve that pressure in responses or tool actions. We introduce WIRE, a Witnessed Intra-policy Rule Evaluation pipeline. WIRE extracts source-grounded rules, encodes them as PyRule clauses, uses satisfiability checks to retain same-surface hard-collision candidates, realizes those candidates as concrete co-governance witnesses, and judges model outputs against the original source-rule text. Across six public prompt policies, WIRE extracts 276 source rules and 560 atomic clauses, classifies 30,944 within-policy clause-pair comparisons, retains 170 encoded hard-collision candidate source-rule pairs, and realizes them as 1,402 concrete witnesses. In policy-only evaluation, these witnesses yield 13,335 post- generation trials where both source rules govern and both compliance labels are judgeable. Only 35.4% fall in joint compliance; 64.6% violate at least one governed source rule. These profiles are conditional diagnostics for WIRE-selected candidates, not deployment-frequency or causal excess failure estimates, but they reveal distinct policy, model, and tool-action resolution patterns.

2605.27782 2026-05-28 cs.LG cs.CR

Revisiting ML Training under Fully Homomorphic Encryption: Convergence Guarantees, Differential Privacy, and Efficient Algorithms

重新审视全同态加密下的机器学习训练:收敛保证、差分隐私与高效算法

Yvonne Zhou, Mingyu Liang, Ivan Brugere, Danial Dervovic, Yue Guo, Antigoni Polychroniadou, Min Wu, Dana Dachman-Soled

发表机构 * University of Maryland, College Park, MD(马里兰大学 College Park 分校) AlgoCRYPT CoE(算法密码学中心)

AI总结 本文首次对全同态加密下的机器学习训练进行理论收敛性分析,结合适用于加密计算的差分隐私训练算法,通过多项式近似激活函数和损失函数实现近似梯度下降的收敛,并采用无逐样本梯度裁剪的差分隐私机制提升计算效率。

详情
AI中文摘要

我们首次对全同态加密(FHE)下的机器学习训练进行了理论收敛性分析,并结合了一种针对加密计算量身定制的差分隐私(DP)训练算法。我们的方法在实现可比效用的同时,提高了标准差分隐私梯度下降(DP-GD)的计算效率。具体而言,我们证明了使用激活函数和损失函数的多项式近似(这是FHE兼容性所必需的)的近似梯度下降的收敛性。为了保护下游任务中的隐私,我们集成了差分隐私,而无需依赖昂贵的逐样本梯度裁剪,从而实现了可扩展的加密学习。我们还提供了数据无关的超参数选择和具有理论依据的多项式近似策略,这些策略可能具有独立的价值。这些贡献共同推进了在敏感数据上进行高效、私密且安全的机器学习的可行性。

英文摘要

We present the first theoretical convergence analysis of machine learning training under fully homomorphic encryption (FHE), combined with a differentially private (DP) training algorithm tailored to encrypted computation. Our approach improves computational efficiency over standard differentially private gradient descent (DP-GD) while achieving comparable utility. In particular, we prove convergence of approximate gradient descent using polynomial approximations of activation and loss functions, which are required for FHE compatibility. To preserve privacy in downstream tasks, we integrate differential privacy without relying on costly per-sample gradient clipping, enabling scalable encrypted learning. We also provide data-independent hyperparameter selection and theoretically grounded strategies for polynomial approximation which can be of independent interest. Together, these contributions advance the feasibility of efficient, private, and secure machine learning on sensitive data.

2605.27774 2026-05-28 cs.LG

Fine-Tuning Dynamics of In-Context Factual Recall in Transformers

Transformer中上下文事实回忆的微调动力学

Ruomin Huang, Eshaan Nichani, Jason D. Lee, Rong Ge

发表机构 * Duke University(杜克大学) Princeton University(普林斯顿大学) UC Berkeley(加州大学伯克利分校)

AI总结 研究Transformer在上下文学习中如何利用存储的参数化事实知识,通过引入上下文事实回忆任务并分析单层Transformer的微调动力学,证明模型收敛到特定的成对注意力模式,且所需样本量极少。

详情
AI中文摘要

上下文学习——基于提示中给出的示例执行任务——是大型语言模型中涌现的重要能力,在理论和实践中都受到了广泛关注。现有的理论工作通常关注学习仅使用提示中信息的场景。然而,许多上下文学习的实际实例要求模型检索存储在模型参数中的事实知识,而上下文则用于识别哪些知识是相关的。在这项工作中,我们研究了上下文学习如何利用事实知识回忆。我们通过引入\emph{上下文事实回忆(IC-recall)}任务来形式化这种行为,其中Transformer接收从隐藏关系生成的(主题,答案)对上下文以及一个查询主题,必须同时推断该隐藏关系并检索相应的答案。事实知识通过Transformer访问一个简单预构建的MLP联想记忆(存储(主题,关系,答案)三元组)来建模。我们分析了单层Transformer在IC-recall数据上的监督微调动力学,并证明模型通过收敛到特定的成对注意力模式成功执行IC-recall。这个微调阶段只需要极少的样本——仅与存储的知识三元组数量成对数关系。实验验证了我们的理论预测,并表明即使MLP层是预训练而非构建的,成对注意力模式也会出现。

英文摘要

In-context learning \ -- performing tasks based on examples given in the prompt \ -- is an important capability that has emerged in large language models and has received significant attention in both theory and practice. Existing theoretical work often focuses on settings where the learning uses information purely from the prompt. However, many practical instances of in-context learning require the model to retrieve factual knowledge stored in the model's parameters, with the context serving to identify which knowledge is relevant. In this work, we study how in-context learning leverages factual knowledge recall. We formalize this behavior by introducing the \emph{in-context factual recall (IC-recall)} task, where a transformer is provided a context of (subject, answer) pairs generated from a hidden relation, along with a query subject, and must both infer this hidden relation and retrieve the corresponding answer. Factual knowledge is modeled by the transformer having access to a simple pre-constructed MLP associative memory storing (subject, relation, answer) triplets. We analyze the supervised fine-tuning dynamics of a one-layer transformer on IC-recall data and prove that the model successfully performs IC-recall by converging to a particular pairwise attention pattern. This fine-tuning stage requires a very small number of samples \ -- only polylogarithmic in the number of stored knowledge triplets. Experiments verify our theoretical predictions and show that the pairwise attention pattern emerges even when the MLP layer is pretrained instead of constructed.

2605.27773 2026-05-28 cs.CL cs.AI cs.LG

Do Models Know Why They Changed Their Mind? Interpretability and Faithfulness of Chain-of-Thought Under Knowledge Conflict

模型是否知道它们为何改变主意?知识冲突下思维链的可解释性与忠实性

Pruthvinath Jeripity Venkata

发表机构 * Independent Researcher(独立研究员)

AI总结 通过引入内省忠实性,研究在知识冲突下语言模型的思维链推理是否忠实反映其决策机制,发现CoT高度稳定但置信度携带微弱真实信号。

Comments 12 pages, 8 tables, 3 appendices

详情
AI中文摘要

当语言模型看到与其训练知识相矛盾的文档时,它必须做出选择:遵循文档还是相信自己。先前的工作证明这种选择取决于事实的知名程度。我们问:模型的思维链(CoT)推理是否忠实地报告了这一机制?我们引入了内省忠实性,并在200个问题、8个模型和4种提示条件下进行了测试。我们发现CoT推理在相反决策下高度稳定:翻转对保留了96%的相同答案相似度(d=0.34;通过ROUGE-L确认,d=0.45)。然而,自我评定的置信度携带微弱的真实信号:对于实体知名度无信息量的冷门事实,置信度仍能预测决策(p<0.001),并追踪项目级知识(r=0.134)。GPT-4o是唯一具有统计上可靠的推理-决策耦合的模型。Claude Sonnet 4.6显示出最宽的置信度范围(SD=1.39),但汇总相关性接近零,因为置信度-决策关系在不同条件下反转;温度消融实验证实这是模型特有的。内部思考令牌比面向用户的CoT显示出更大的决策敏感性(p=0.033)。CoT分解为决策不变的知识展示(约96%)和一层薄弱的置信度层,后者带有微弱但真实的信号。对于监控:读取置信度,而不是论证。

英文摘要

When a language model sees a document contradicting its training knowledge, it must choose: follow the document or trust itself. Prior work proved this choice depends on how well-known the fact is. We ask: does the model's chain-of-thought (CoT) reasoning faithfully report this mechanism? We introduce introspective faithfulness and test it across 200 questions, 8 models, and 4 prompt conditions. We find CoT reasoning is highly stable across opposite decisions: flip pairs retain 96% of same-answer similarity (d=0.34; confirmed by ROUGE-L, d=0.45). Yet self-rated confidence carries a faint genuine signal: for obscure facts where entity fame is uninformative, confidence still predicts decisions (p<0.001) and tracks item-level knowledge (r=0.134). GPT-4o is the only model with statistically reliable reasoning-decision coupling. Claude Sonnet 4.6 shows the widest confidence range (SD=1.39) but near-zero pooled correlation because the confidence-decision relationship reverses between conditions; a temperature ablation confirms this is model-specific. Internal thinking tokens show greater decision-sensitivity than user-facing CoT (p=0.033). CoT decomposes into a decision-invariant knowledge display (~96%) and a thin confidence layer with weak but real signal. For monitoring: read confidence, not the argument.

2605.27772 2026-05-28 cs.SD cs.LG

Do Audio LLMs Listen or Read? Analyzing and Mitigating Paralinguistic Failures with VoxParadox

音频大语言模型是听还是读?使用VoxParadox分析和缓解副语言失败

Jiacheng Pang, Ashutosh Chaubey, Mohammad Soleymani

发表机构 * Institute for Creative Technologies, University of Southern California, Los Angeles, USA(创意技术研究所,南加州大学,洛杉矶,美国)

AI总结 针对音频大语言模型在副语言理解上的不足,提出对抗性基准VoxParadox和Prompt-Conditioned Layer Mixer方法,显著提升模型对副语言线索的利用能力。

Comments Accepted as a conference paper at ICML 2026. Project page: https://voxparadox.github.io/

详情
AI中文摘要

音频大语言模型(Audio LLMs)在语音理解任务上表现出色,但其理解副语言信息的能力仍然有限。为了系统量化这一问题,我们引入了VoxParadox,一个包含2000个验证示例的对抗性基准,涵盖10项副语言任务,通过受控语音合成故意使转录声明与说话风格不匹配,从而直接测量语音副语言理解能力。对多种音频大语言模型的评估显示,它们在声学真实情况上的准确率持续较低,并且强烈倾向于遵循语言暗示的(错误)答案。为了理解这一差距的原因,我们进行了逐层探针分析,发现(i)副语言线索可能在更深的编码器层以及编码器-大语言模型接口处退化,(ii)即使音频令牌中存在这些线索,语言模型也经常忽略它们。为了解决这些问题,我们提出了提示条件层混合器(PCLM),它根据输入提示自适应地组合多个音频层的信息,并结合直接偏好优化(DPO)来明确偏好声学支持的选项而非语言暗示的选项。这些方法显著提升了音频大语言模型的副语言理解能力,在VoxParadox上将Audio Flamingo 3从17.40%提升至65.20%,在MMSU副语言子集上从37.74%提升至54.78%。我们的项目页面位于https://voxparadox.github.io/。

英文摘要

Audio large language models (Audio LLMs) demonstrate strong performance on speech understanding tasks, yet their ability to understand paralinguistic information remains limited. To systematically quantify this issue, we introduce VoxParadox, an adversarial benchmark with 2,000 verified examples, spanning 10 paralinguistic tasks, created with controlled speech synthesis to intentionally mismatch transcript claims and speaking style, enabling direct measurement of speech paralinguistic understanding. Evaluation of a diverse set of Audio LLMs reveals consistently low accuracy on acoustic ground truth and a strong tendency to follow language-implied (incorrect) answers. To understand the cause of this gap, we perform layer-wise probing and find that (i) paralinguistic cues can degrade in deeper encoder layers and at the encoder--LLM interface, and (ii) even when such cues are available in audio tokens, the language model frequently ignores them. To address these problems, we propose Prompt-Conditioned Layer Mixer (PCLM), which adaptively combines information from multiple audio layers based on the input prompt, and pair it with Direct Preference Optimization (DPO) to explicitly prefer acoustically supported options over language-implied alternatives. These methods substantially improve Audio LLM paralinguistic understanding, improving Audio Flamingo 3 from 17.40% to 65.20% on VoxParadox, and from 37.74% to 54.78% on MMSU paralinguistic subset. Our project page is available at https://voxparadox.github.io/.

2605.27768 2026-05-28 cs.AI

Auditable Decision Models with Learned Abstention and Real-Time Steering

具有学习弃权与实时引导的可审计决策模型

Sankaranarayanan Palamadai Chandrasekaran

发表机构 * Simple Machine Mind(简单机器思维)

AI总结 提出EvaluatorDPT模型,通过Transformer编码器学习YES/NO/TBD三值决策,其中TBD作为延迟输出被学习,并支持推理时阈值控制和辅助语义信号,实现可审计的决策控制。

Comments 21 pages, 5 figures

详情
AI中文摘要

生产AI系统通常在证据不完整、冲突或不足的情况下运行。强制分类器将此类情况压缩为动作标签,而生成系统可能产生难以解释为可审计执行决策的输出。我们研究AI系统的操作决策控制,其中不确定性必须明确可路由、受策略约束且可审计,而不是隐藏在强制预测或自由形式生成中。我们提出EvaluatorDPT,一种有界决策控制模型,预测YES、NO或TBD,其中TBD被学习为延迟结果,而非仅作为事后置信规则添加。该模型使用Transformer编码器,带有主要的有界决策头和用于价值观及情绪/情感的辅助通道。接口在形式上与领域无关:部署领域提供证据和策略阈值,而模型发出有界分布,可在推理时通过记录的操作阈值以及(经验证后)辅助语义信号进行控制。对于评估的模型版本,我们报告了在保留验证集和测试集上的决策性能;由于此评估中禁用了情感头,因此省略了辅助情感指标。在保留测试集(n=44,597)上,模型达到准确率=0.8260,宏F1=0.8252,各类别F1分别为0.8314(YES)、0.8486(NO)和0.7956(TBD)。评估记录还包括校准证据(验证集上ECE=0.0338)、阈值扫描输出、多种子稳定性检查、混淆矩阵和可重复性命令。我们的主要贡献是一个有界执行接口,其中延迟被学习,推理时路由保持可检查,辅助信号提供可审计行为控制的路径,且评估证据支持外部审查。

英文摘要

Production AI systems often operate with incomplete, conflicting, or insufficient evidence. Forced classifiers collapse such cases into action labels, while generative systems can produce outputs that are difficult to interpret as auditable execution decisions. We study operational decision control for AI systems, where uncertainty must be explicitly routable, policy-governed, and auditable rather than hidden inside forced predictions or free-form generation. We present EvaluatorDPT, a bounded decision-control model that predicts YES, NO, or TBD, where TBD is learned as a deferral outcome rather than added only as a post-hoc confidence rule. The model uses a transformer encoder with a primary bounded-decision head and structured auxiliary channels for values and emotions/sentiments. The interface is domain-agnostic in form: a deployment domain supplies evidence and policy thresholds, while the model emits a bounded distribution that can be controlled at inference time through recorded operating thresholds and, when validated, auxiliary semantic signals. For the evaluated model version, we report decision performance on held-out validation and test splits; auxiliary emotion metrics are omitted because the emotion head is disabled for this evaluation. On the held-out test split (n=44,597), the model achieves Accuracy = 0.8260 and Macro F1 = 0.8252, with per-class F1 of 0.8314 (YES), 0.8486 (NO), and 0.7956 (TBD). The evaluation record also includes calibration evidence (ECE = 0.0338 on validation), threshold-sweep outputs, multi-seed stability checks, confusion matrices, and reproducibility commands. Our main contribution is a bounded execution interface in which deferral is learned, inference-time routing remains inspectable, auxiliary signals provide a path to auditable behavior control, and evaluation evidence supports external review.

2605.27767 2026-05-28 cs.CL cs.AI cs.LG

UniMaia: Steering Chess Policies with Language for Human-like Play

UniMaia:用语言引导国际象棋策略以实现类人玩法

Sherman Siu, Lesley Istead

发表机构 * University of Waterloo(滑铁卢大学) Carleton University(卡尔顿大学)

AI总结 提出UniMaia框架,通过参数高效文本编码器和ControlNet风格调节机制,在冻结的Lc0国际象棋策略网络上实现提示条件策略调制,实现语义控制(如开局选择和玩家强度)并保持预训练策略表征,同时构建大规模元数据增强的Lichess数据集和半自动提示生成管道,在多个基准上取得最优或竞争性结果。

详情
AI中文摘要

大型语言模型的最新进展使得自然语言能够作为控制复杂系统的灵活接口,但通常以大规模多模态训练或弱化领域特定归纳偏差为代价。在结构化决策领域(如国际象棋)中,专门的策略网络表现强劲但缺乏语义可控性,而提示条件语言模型更灵活但通常领域基础较弱。我们提出$ extbf{UniMaia}$,一个用于提示条件策略调制的框架,它使用参数高效文本编码器和ControlNet风格的调节机制来适配基于Lc0的冻结国际象棋策略网络。UniMaia能够实现对游戏玩法的语义控制,包括开局选择和玩家强度,同时保留预训练的策略表征。我们进一步引入$ extbf{UniMaia-Aux}$,它结合了辅助时间条件化和行为预测目标。为了支持这项工作,我们构建了一个大规模元数据增强的Lichess数据集,开发了一个半自动提示生成管道,并引入了涵盖提示条件和元数据条件设置的基准。UniMaia在多个提示条件基准上实现了最先进的预期准确率,在通用指令遵循任务上达到了竞争性的最佳着法准确率,同时在人类着法预测基准上与专门的元数据条件方法保持竞争力。UniMaia-Aux进一步提高了多个评估设置下的预期准确率和行为建模,在最佳着法准确率上略有折衷。总体而言,我们的结果表明,无需端到端多模态训练即可实现领域特定策略网络的提示条件控制,同时突出了可控性与预测性能之间的权衡。

英文摘要

Recent advances in large language models have enabled natural language to serve as a flexible interface for controlling complex systems, but often at the cost of large-scale multimodal training or weakened domain-specific inductive biases. In structured decision-making domains such as chess, specialized policy networks achieve strong performance but lack semantic controllability, while prompt-conditioned language models are more flexible yet typically exhibit weaker domain grounding. We propose $\textbf{UniMaia}$, a framework for prompt-conditioned policy modulation that adapts a frozen Lc0-based chess policy network using a parameter-efficient text encoder and a ControlNet-style conditioning mechanism. UniMaia enables semantic control over gameplay, including opening selection and player strength, while preserving the pretrained policy representations. We further introduce $\textbf{UniMaia-Aux}$, which incorporates auxiliary temporal conditioning and behavioral prediction objectives. To support this work, we construct a large-scale metadata-augmented Lichess dataset, develop a semi-automated prompt-generation pipeline, and introduce benchmarks spanning both prompt-conditioned and metadata-conditioned settings. UniMaia achieves state-of-the-art expected accuracy on several prompt-conditioned benchmarks and competitive top-move accuracy on general instruction-following tasks, while remaining competitive with dedicated metadata-conditioned approaches on human move prediction benchmarks. UniMaia-Aux further improves expected accuracy and behavioral modeling across several evaluation settings, with modest trade-offs in top-move accuracy. Overall, our results demonstrate that prompt-conditioned control of domain-specific policy networks is feasible without end-to-end multimodal training, while highlighting trade-offs between controllability and predictive performance.

2605.27766 2026-05-28 cs.AI

Got a Secret? LLM Agents Can't Keep It: Evaluating Privacy in Multi-Agent Systems

Got a Secret? LLM Agents Can't Keep It: Evaluating Privacy in Multi-Agent Systems

Aman Priyanshu, Supriti Vijay, Esha Pahwa

发表机构 * Foundation AI USA(Foundation AI美国)

AI总结 本研究通过多智能体模拟平台评估LLM智能体在社交压力下的隐私泄露风险,发现多轮社交交互显著增加隐私泄露,且泄露具有社交传染性,即使有隐私指令也无法完全消除。

详情
AI中文摘要

LLM安全评估主要在隔离环境中测试模型,然而部署的AI智能体越来越多地与其他智能体在持久社交环境中交互。我们引入了一个Moltbook风格的模拟平台,数千个LLM智能体在模拟的一个月内跨社区交互,并用它来评估在不同程度的社交压力下隐私作为下游安全问题的表现。我们发现从单轮到多轮社交评估会放大隐私侵犯(OpenAI模型上CIMemories 19.95%到我们的45.30%),泄露具有社交传染性,观察到同伴泄露后智能体泄露敏感信息的可能性增加8倍,并且明确的隐私指令减少但不能消除这种效应,即使有保护措施,泄露率仍高于37.8%。我们的发现表明,基于静态聊天的安全基准系统性地低估了智能体部署中的风险,而仅社交环境就足以引发单轮评估永远不会发现的敏感泄露。

英文摘要

LLM safety evaluations predominantly test models in isolation, yet deployed AI agents increasingly operate within persistent social environments alongside other agents. We introduce a Moltbook-style simulation platform where thousands of LLM agents interact across communities over a simulated month, and use it to evaluate privacy as a downstream safety concern under varying degrees of social pressure. We find that shifting from single turn to multi turn social evaluation amplifies privacy violations (CIMemories 19.95% to Ours 45.30% across OpenAI models), that leakage is socially contagious, with agents 8 times more likely to disclose sensitive information after observing a peer do so, and that explicit privacy instructions reduce but do not eliminate this effect, leaving leakage rates above 37.8% even with safeguards. Our findings suggest that static chat based safety benchmarks systematically underestimate risks in agentic deployment, and that social context alone is sufficient to elicit sensitive disclosures that single turn evaluations would never surface.

2605.27765 2026-05-28 cs.LG cs.AI

Restoring the Sweet Spot: Pass-Rate Weighted Self-Distillation for LLM Reasoning

恢复甜蜜点:用于LLM推理的通过率加权自蒸馏

Zehao Liu, Yuanpu Cao, Jinghui Chen, Vasant G. Honavar

发表机构 * College of Information Sciences and Technology(信息科学与技术学院)

AI总结 提出SC-SDPO方法,通过问题通过率加权自蒸馏损失,动态调整训练难度,提升LLM推理性能。

Comments 18 pages, 8 figures

详情
AI中文摘要

自蒸馏策略优化(SDPO)通过利用模型自身的反馈条件预测作为自教师,为大型语言模型的强化学习提供密集的令牌级信用分配。然而,与GRPO不同——其群体相对优势自然地将学习集中在一个中等难度问题的甜蜜点上——SDPO的基于KL的优势缺乏隐式的难度感知。我们通过GRPO的优势归一化视角分析这一差距。将可学习性框架扩展到归一化奖励,我们表明归一化吸收了方差项$p(1-p)$,使各问题的前导阶可学习性相等,留下$\sqrt{p(1-p)}$作为每个问题梯度中唯一的残差缩放因子。这一分析产生了一个简单的处方:用$[\hat{p}(1-\hat{p})]^{1/2}$加权每个问题的SDPO损失,得到SC-SDPO,即SDPO的尺度一致变体。所提出的权重作为在线策略rollout与批自适应归一化的零成本副产品获得,诱导出一个隐式课程,动态跟踪模型不断发展的能力。在科学推理和工具使用基准上的实验表明,SC-SDPO持续优于SDPO,在Qwen3-8B上获得+3.2/+4.3(mean@16/maj@16)的提升,在OLMo-3-7B上获得+1.8/+3.0的提升,同时在整个优化过程中保持稳定的训练动态。

英文摘要

Self-Distillation Policy Optimization (SDPO) provides dense token-level credit assignment for reinforcement learning with large language models by leveraging the model's own feedback-conditioned predictions as a self-teacher. Unlike GRPO, however, whose group-relative advantage naturally concentrates learning on a sweet spot of intermediate-difficulty questions, SDPO's KL-based advantage lacks an implicit notion of difficulty awareness. We analyze this gap through the lens of GRPO's advantage normalization. Extending the learnability framework to normalized rewards, we show that normalization absorbs the variance term $p(1-p)$, equalizing leading-order learnability across questions and leaving $\sqrt{p(1-p)}$ as the sole residual scaling factor in the per-question gradient. This analysis yields a simple prescription: weight each question's SDPO loss by $[\hat{p}(1-\hat{p})]^{1/2}$, resulting in SC-SDPO, a scale-consistent variant of SDPO. The proposed weights are obtained as a zero-cost byproduct of on-policy rollouts with batch-adaptive normalization, inducing an implicit curriculum that dynamically tracks the model's evolving competence. Experiments on scientific reasoning and tool-use benchmarks demonstrate that SC-SDPO consistently improves over SDPO, yielding gains of +3.2/+4.3 (mean@16/maj@16) on Qwen3-8B and +1.8/+3.0 on OLMo-3-7B, while preserving stable training dynamics throughout optimization.

2605.27764 2026-05-28 cs.CV cs.AI

Can Segmentation Models Understand the World? Towards Proactive Affordance Reasoning via Visual Chain-of-Thought

分割模型能理解世界吗?通过视觉思维链实现主动可供性推理

Yuchen Guo, Junli Gong, Hongmin Cai, Yiu-ming Cheung, Weifeng Su

发表机构 * Northwestern University(西北大学) Northeastern University(东北大学) South China University of Technology(华南理工大学) Hong Kong Baptist University(香港 Baptist大学) Beijing Normal - Hong Kong Baptist University(北京师范大学-香港 Baptist大学)

AI总结 提出SegWorld框架,通过多级视觉思维链在意图级指令下进行主动场景观察和可供性推理,实现从目标到部件的高效分割。

详情
AI中文摘要

最近的分割模型将大语言模型(LLMs)与掩码解码器结合,将复杂的语言表达映射到掩码上,但其指令仍然是目标指涉的:它们描述、约束或暗示待分割的区域。然而,在现实世界的具身交互中,人类指令通常是意图级的,包括期望的结果而不指定实现该结果的区域。为弥合这一差距,我们引入SegWorld,其中模型在确定掩码之前通过多级视觉思维链(CoT)推理场景。在接收任何指令之前,它主动观察场景,描述可见对象并推断它们可能支持的可能事件。给定指令后,它继续思维链:从与意图相关的对象,到满足意图的动作,再到物理交互部位,即支持该动作的对象部分。我们将SegWorld形式化为概率推理,其中主动观察提供语言场景上下文,当指令以意图级别给出时,可改善掩码预测。我们构建了一个意图到部件的基准,用于评估从高层目标出发的可供性承载部件分割。实验表明,SegWorld在目标指涉指令上匹配指令驱动基线,并在意图级指令上显著提升。

英文摘要

Recent segmentation models couple large language models (LLMs) with mask decoders to ground complex language expressions into masks, yet their instructions remain target-referential: they describe, constrain, or imply the region to be segmented. However, in real-world embodied interaction, human instructions are often at the intent-level, which includes the desired outcome without naming the region that enables it. To bridge this gap, we introduce SegWorld, where the model reasons about the scene through a multi-level visual chain-of-thought (CoT) before committing to a mask. Before receiving any instructions, it proactively observes the scene, describing visible objects and inferring plausible events they may support. Given an instruction, it continues the chain: from the object relevant to the intent, through the action that satisfies it, to the physical interaction site, the object part that affords the action. We formalize SegWorld as probabilistic inference, in which proactive observation supplies a linguistic scene context that improves mask prediction when instructions are given at the level of intent. We construct an intent-to-part benchmark for evaluating affordance-bearing part segmentation from high-level goals. Experiments show SegWorld matches instruction-driven baselines on target-referential instructions and improves substantially on intent-level ones.

2605.27763 2026-05-28 cs.LG

A Paired Testing Protocol for Batch-Conditioned Refusal Robustness in LLM Serving

LLM服务中基于批处理条件的拒绝鲁棒性配对测试协议

Sahil Kadadekar

发表机构 * Independent Researcher(独立研究者)

AI总结 提出配对测试协议,通过四项研究验证批处理条件对LLM安全标签的影响,发现批处理导致的安全标签翻转率低但存在,建议精确堆栈验证。

Comments 12 pages. Accepted to the ICML 2026 Workshop on Hypothesis Testing

详情
AI中文摘要

语言模型的安全性评估通常将服务配置视为固定的背景基础设施,但当同一提示可能单独评估、在同步批处理中或在连续批处理调度器内评估时,批处理条件是一个未经测试的处理变量。我们将四项基于工件的研究综合成一个配对测试协议:研究A结合局部发现、评分者校正裁决和真实批处理确认;研究B测试跨模型泛化;研究C测试连续批处理组合;研究D进行批处理不变核消融。局部测试发现安全标签变化比能力标签变化更频繁(0.51% vs. 0.14%),但对63个候选行的裁决仅留下17个真实行为翻转,意味着校正后的全集率为0.16%。扩展到15个模型未发现可检测的普遍安全优于能力偏差:翻转接近平衡(0.94倍),对齐类型无显著关联(p=0.942,η²=0.033),输出不稳定性是最强的脆弱性筛选指标(r=0.909,bootstrap 95% CI [0.65, 0.97])。在目标核消融中,标准vLLM在当前分数翻转候选上重现22/55的标签翻转,而启用VLLM_BATCH_INVARIANT=1将同一测试减少到0/55翻转;组合测试单独未发现4.7pp敏感度下的聚合效应。测试建议是精确堆栈验证:在服务批处理设置下评估拒绝,将安全提示与能力控制配对,并分别报告低速率方向翻转与聚合零效应。

英文摘要

Safety evaluations of language models often treat serving configuration as fixed background infrastructure, but batch condition is an untested treatment variable whenever the same prompt may be evaluated alone, in a synchronized batch, or inside a continuous-batching scheduler. We synthesize four artifact-backed studies into a paired testing protocol: Study A combines local discovery, scorer-corrected adjudication, and true-batching confirmation; Study B tests cross-model generalization; Study C tests continuous-batch composition; and Study D runs a batch-invariant-kernel ablation. The local test finds safety-label changes more often than capability-label changes (0.51% vs. 0.14%), but adjudication of 63 candidate rows leaves only 17 genuine behavioral flips, implying a corrected full-set rate of 0.16%. The 15-model extension finds no detectable universal safety-over-capability skew: flips are near parity (0.94x), alignment type has no detectable association ($p=0.942$, $η^2=0.033$), and output instability is the strongest tested fragility screen ($r=0.909$, bootstrap 95% CI [0.65, 0.97]). In the targeted kernel ablation, standard vLLM reproduces 22/55 label flips on current score-flip candidates, while enabling VLLM_BATCH_INVARIANT=1 reduces the same test to 0/55 flips; the composition test separately finds no aggregate effect at 4.7pp sensitivity. The testing recommendation is exact-stack validation: evaluate refusal at the served batch setting, pair safety prompts with capability controls, and report low-rate directional flips separately from aggregate null effects.

2605.27761 2026-05-28 cs.CV cs.SE

AndroidDaily: A Verifiable Benchmark for Mobile GUI Agents on Real-World Closed-Source Applications

AndroidDaily: 面向真实世界闭源应用的可验证移动GUI智能体基准

Yifan Sui, Xin Huang, Hongbing Li, Fang Xu, Jiahe Lv, Haolong Yan, Yeqing Shen, Litao Liu, Zhimin Fan, Ziyang Meng, Jia Wang, Junbo Qi, Kaijun Tan, Zheng Ge, Xiangyu Zhang, Daxin Jiang, Osamu Yoshie

发表机构 * Beijing University of Posts and Telecommunications(北京邮电大学) StepFun Waseda University(早稻田大学)

AI总结 针对闭源应用无法获取内部状态导致自动验证困难的问题,提出AndroidDaily基准(350个日常任务)和GRADE评估器(基于可观察外部指南的三层系统),实现无需内部状态的可验证评估,最强模型成功率为62.0%。

Comments 11 pages, 6 figures. Preprint

详情
AI中文摘要

GUI基础模型和移动GUI智能体的快速发展催生了众多评估基准,但大多数依赖于模拟环境或开源应用,真实世界的闭源应用在很大程度上未得到评估。核心困难在于闭源应用不暴露内部状态,使得传统的自动验证不适用。为弥合这一差距,我们引入了AndroidDaily,一个大规模基准,包含跨94个高频Android应用的350个现实日常任务,涵盖交通、购物、本地服务、娱乐、内容创作、社交媒体和日常实用工具。为了在这些不透明环境中实现自动且可验证的评估,我们提出了基于指南的自动诊断评估评审器(GRADE),这是一个基于三层可观察外部指南系统构建的过程感知评估器:操作义务、输出质量和负面约束。GRADE根据这些标准跟踪智能体的视觉轨迹,并产生步骤级诊断判断,将长期、开放式的移动交互转化为可验证的评估,而无需依赖隐藏的内部状态。实验表明,GRADE与人类评估者的一致性达到87.37%。最强模型在AndroidDaily上的成功率为62.0%,凸显了当前推理能力与现实移动工作流实际执行之间的巨大差距。

英文摘要

The rapid development of GUI foundation models and mobile GUI agents has spurred numerous evaluation benchmarks, yet most rely on simulated environments or open-source applications, leaving real-world closed-source applications largely unevaluated. The core difficulty is that closed-source applications do not expose internal states, making traditional automatic verification inapplicable. To bridge this gap, we introduce AndroidDaily, a large-scale benchmark comprising 350 realistic daily-use tasks across 94 high-frequency Android applications spanning transportation, shopping, local services, entertainment, content creation, social media, and everyday utilities. To enable automatic and verifiable assessment in these opaque environments, we propose Guideline-grounded Reviewer for Automatic Diagnostic Evaluation (GRADE), a process-aware evaluator built on a three-tiered system of observable external guidelines: operational obligations, output quality, and negative constraints. GRADE tracks the agent's visual trajectory against these criteria and produces step-level diagnostic judgments, turning long-horizon, open-ended mobile interactions into verifiable evaluation without relying on hidden internal states. Experiments show that GRADE achieves 87.37\% agreement with human evaluators. The strongest model reaches a 62.0\% success rate on AndroidDaily, highlighting a substantial gap between current reasoning capabilities and practical execution in realistic mobile workflows.

2605.27760 2026-05-28 cs.AI

SkillGrad: Optimizing Agent Skills Like Gradient Descent

SkillGrad: 像梯度下降一样优化智能体技能

Hanyu Wang, Yifan Lan, Bochuan Cao, Lu Lin, Jinghui Chen

发表机构 * College of Information Sciences and Technology(信息科学与技术学院) The Pennsylvania State University(宾夕法尼亚州立大学) University Park, PA, USA

AI总结 提出SkillGrad框架,将技能包视为结构化参数,通过轨迹级损失、文本梯度诊断和动量记忆覆盖进行类梯度下降优化,在表格问答任务上平均提升6.7个百分点。

详情
AI中文摘要

智能体技能通过将可复用的程序化知识存储在结构化文件中,提供了一种轻量级的方式将LLM智能体适配到专业领域。然而,无论是从第三方下载还是自行生成,这些技能往往不可靠、不完整或过时。现有的技能演化方法通常通过启发式反思来解决这些缺陷,缺乏明确的优化公式。在本文中,我们提出了SkillGrad,一种受梯度下降启发的智能体技能优化框架。SkillGrad将技能包视为结构化参数,以梯度下降的方式进行优化:任务执行提供轨迹级损失证据,自动诊断随后提供指示修正方向的文本梯度。为了稳定跨迭代的优化,动量智能体将重复出现的诊断模式累积到持久记忆覆盖层中。最后,基于LLM的修补器通过对技能包进行层感知编辑来执行参数更新。在SpreadsheetBench Verified和WikiTableQuestions上的评估表明,SkillGrad在两个骨干LLM上始终优于基于训练的技能演化基线,平均比最强的基于训练的基线高出6.7个百分点。消融实验进一步表明,动量和对比诊断都对最终技能质量有贡献。

英文摘要

Agent skills provide a lightweight way to adapt LLM agents to specialized domains by storing reusable procedural knowledge in structured files. However, whether downloaded from third parties or self-generated, these skills are often unreliable, incomplete, or outdated. Existing skill-evolution methods often address these deficiencies through heuristic reflections without an explicit optimization formulation. In this paper, we propose SkillGrad, a gradient-descent-inspired framework for optimizing agent skills. SkillGrad treats the skill package as a structured parameter to optimize in a gradient descent fashion: task executions provide trajectory-level loss evidence, automatic diagnoses then provide text-based gradients that indicate the correction directions. To stabilize optimization across iterations, a momentum agent accumulates recurring diagnostic patterns into a persistent memory overlay. Finally, an LLM-based patcher executes the parameter update by applying layer-aware edits to the skill package. Evaluated on SpreadsheetBench Verified and WikiTableQuestions, SkillGrad consistently outperforms training-based skill evolution baselines across two backbone LLMs, improving over the strongest training-based baseline by $6.7$ percentage points on average. Ablations further show that momentum and contrastive diagnosis both contribute to the final skill quality.

2605.27759 2026-05-28 cs.RO

Colosseum V2: Benchmarking Generalization for Vision Language Action Models

Colosseum V2:视觉语言动作模型的泛化能力基准测试

Jeremy Morgan, Prajwal Vijay, Hyeonho Oh, Jincen Song, Ashvin Arora, Alina Du, Gaurav Sukhatme, Jesse Thomason, Ishika Singh

发表机构 * Department of Computer Science, University of Southern California(南加州大学计算机科学系) Department of Electrical Engineering, Indian Institute of Technology Madras(印度理工学院Madras分校电子工程系) Fu Foundation School of Engineering and Applied Science, Columbia University(哥伦比亚大学工程与应用科学学院)

AI总结 提出Colosseum V2大规模仿真基准,通过28个任务和两种机器人形态,系统评估VLA模型在分布偏移下的泛化能力,揭示其在高层次理解与鲁棒行为之间的差距。

详情
AI中文摘要

视觉-语言-动作(VLA)模型在大规模视觉和语言预训练的推动下,在机器人操作中展现出有前景的泛化能力。然而,这种进展可能具有误导性。尽管VLA具有零样本感知和语言能力,但它们的整体任务性能在分布偏移下常常下降,揭示了这些系统将高层次理解转化为鲁棒行为方面的差距。为了系统地研究这一差距,我们引入了Colosseum V2,这是一个大规模仿真基准,用于评估机器人学习中VLA在不同条件下的泛化能力。该基准包含28个任务,涵盖13个任务类别和两种机器人形态,覆盖了广泛的操作原语和长时域行为。基于ManiSkill仿真器构建,Colosseum V2支持快速、GPU并行化的评估,并支持大规模域内和域外测试。我们评估了包括Action Chunking Transformers (ACT)和Pi0.5在内的最先进方法,揭示了它们在基础性能和泛化方面的局限性。我们展示了仿真与真实世界指标之间的强相关性,支持了该基准的生态效度。通过在统一基准中标准化任务、指标和评估协议,Colosseum V2实现了可重复和公平的比较,降低了评估开销,并加速了向通用机器人策略的进展。

英文摘要

Vision-Language-Action (VLA) models demonstrate promising generalization in robotic manipulation, driven by advances in large-scale vision and language pre-training. This progress can be misleading. Despite the zero-shot perception and language capabilities of VLAs, their overall task performance often degrades under distribution shifts, revealing gaps in how these systems translate high-level understanding into robust behavior. To systematically study this gap, we introduce Colosseum V2, a large-scale simulation benchmark for evaluating VLA generalization in robot learning across diverse conditions. The benchmark comprises 28 tasks spanning 13 task categories and two robot morphologies, covering a wide range of manipulation primitives and long-horizon behaviors. Built on the ManiSkill simulator, Colosseum V2 enables fast, GPU-parallelized evaluation and supports both in-domain and out-of-domain testing at scale. We evaluate state-of-the-art methods, including Action Chunking Transformers (ACT) and Pi0.5, and reveal limitations in both base performance and generalization. We demonstrate strong correlations between simulation and real-world metrics that support the ecological validity of the benchmark. By standardizing tasks, metrics, and evaluation protocols within a unified benchmark, Colosseum V2 enables reproducible and fair comparisons, reduced evaluation overhead, and accelerated progress toward general-purpose robot policies.

2605.27758 2026-05-28 cs.LG cs.AI physics.comp-ph

High-Fidelity Industrial Crash Dynamics Prediction via Geometry-Aware Operator Learning with Memory-Efficient Low-Rank Attention

基于几何感知算子学习与内存高效低秩注意力的高保真工业碰撞动力学预测

Deepak Akhare, Mohammad Amin Nabian, Corey Adams, Sudeep Chavare, Sanjay Choudhry

发表机构 * Department of Aerospace and Mechanical Engineering, University of Notre Dame(诺特大学航空航天与机械工程系) NVIDIA General Motors(通用汽车)

AI总结 本文提出GeoTransolver框架,通过几何感知算子学习和内存高效低秩注意力机制,实现工业级碰撞动力学的高保真预测,在复杂梁和整车碰撞数据集上验证了其准确性和效率。

详情
AI中文摘要

汽车碰撞安全性优化仍然是一个安全关键挑战,需要通过迭代的高保真模拟来管理大规模非线性结构变形和能量耗散。虽然传统有限元求解器计算成本高昂,新兴的算子学习框架提供了快速的代理预测;然而,将其应用于工业级碰撞分析(其中复杂几何、接触非线性和快速演变的瞬态变形并存)仍然是一个未解决的挑战。在本文中,我们证明GeoTransolver框架为工业规模下准确、高保真的碰撞动力学预测提供了可行的解决方案。在复杂的保险杠梁和整车碰撞数据集上进行的基准测试表明,GeoTransolver能够捕捉多尺度几何上下文,并准确解析塑性变形模式以及关键乘员位置的加速度曲线。除了架构本身,我们提出并系统评估了一系列时间预测策略,包括一次性、时间条件和自回归滚动策略,证明一次性方法在显著降低训练开销和推理延迟的同时实现了最先进的准确性。作为次要贡献,我们引入了一种基于快速低秩注意力路由引擎(FLARE)的修改,应用于GeoTransolver注意力主干,将内存开销减少约2倍,同时进一步提高O(N)长程、高频瞬态的预测准确性,保留了基础框架的几何感知交叉注意力优势。我们的结果突显了几何感知算子学习在复杂、安全关键的汽车动力学高保真代理建模中的实际可行性。

英文摘要

Automotive crashworthiness optimization remains a safety-critical challenge, requiring the management of large-scale nonlinear structural deformations and energy dissipation through iterative, high-fidelity simulations. While traditional finite element solvers are computationally prohibitive, emerging operator learning frameworks provide rapid surrogate predictions; however, applying them to industrial-scale crash analysis, where complex geometry, contact nonlinearities, and rapidly evolving transient deformation coexist, remains an open challenge. In this paper, we demonstrate that the GeoTransolver framework provides a viable solution for accurate, high-fidelity crash dynamics prediction at industrial scale. Benchmarked on complex bumper beam and full-vehicle crash datasets, GeoTransolver captures multi-scale geometric context and accurately resolves plastic deformation patterns as well as acceleration profiles at critical occupant locations. Beyond the architecture itself, we propose and systematically evaluate a suite of temporal prediction recipes, including one-shot, time-conditional, and autoregressive rollout strategies, demonstrating that the one-shot approach achieves state-of-the-art accuracy with significantly reduced training overhead and inference latency. As a secondary contribution, we introduce a Fast Low-rank Attention Routing Engine (FLARE)-based modification to the GeoTransolver attention backbone that reduces memory overhead by approximately 2x while further improving predictive accuracy for O(N) long-range, high-frequency transients, preserving the geometry-aware cross-attention strengths of the base framework. Our results highlight the practical viability of geometry-aware operator learning for high-fidelity surrogate modeling of complex, safety-critical automotive dynamics.

2605.27750 2026-05-28 cs.CL cs.AI cs.CV cs.DL

Reading or Guessing? Visual Grounding Failures of Vision-Language Models for OCR in Ancient Greek Editions

阅读还是猜测?古希腊版本OCR中视觉语言模型的视觉定位失败

Antonia Karamolegkou, Nicolas Angleraud, Benoît Sagot, Thibault Clérice

发表机构 * Inria(法国国家信息与自动化研究所)

AI总结 通过对比开放权重视觉语言模型与传统OCR基线在低资源古希腊批判版本上的表现,发现VLM即使错误也能生成流畅文本,表明其依赖语言先验,并引入扰动和标记级定位度量分析视觉证据。

详情
AI中文摘要

最近的研究表明,用于光学字符识别(OCR)的视觉语言模型(VLM)能够生成看似合理但缺乏视觉支持的文本,暗示其依赖语言先验。通过将开放权重VLM与传统OCR基线在低资源古希腊批判版本上进行对比,我们展示了VLM的错误即使在错误时也往往保持流畅,产生合理的希腊语替换,而传统引擎则产生局部识别噪声。为了分析解码过程中的视觉证据,我们引入了受控图像扰动和基于条件与无图像解码分布的标记级定位度量。在字符级扰动下,VLM与扰动的真实文本严重偏离,而传统OCR相对忠实;然而,标记级分析表明先验依赖是模型特定的:在OCR专业模型中,流畅的词汇错误几乎不依赖图像而产生,而通用VLM即使在错误时也仍然依赖于视觉输入。解码时干预未能可靠地恢复定位,而OCR后语言模型校正仅通过生成后修复文本改善了几个系统。我们的结果将先前关于OCR语言先验依赖的证据扩展到低资源历史文档和更广泛的模型集,表明流畅输出不一定具有视觉基础,并推动了超越总体准确性的可解释性驱动评估。

英文摘要

Recent work has shown that Vision-Language Models (VLMs) used for optical character recognition (OCR) can generate plausible but visually unsupported text, suggesting reliance on language priors. Comparing open-weight VLMs with traditional OCR baselines on low-resource Ancient Greek critical editions, we show that VLM errors often remain fluent even when wrong, producing plausible Greek substitutions where traditional engines produce local recognition noise. To analyze visual evidence during decoding, we introduce controlled image perturbations and token-level grounding measures based on conditional versus image-free decoding distributions. Under character-level perturbations, VLMs diverge sharply from the perturbed ground truth while traditional OCR remains comparatively faithful; however, token-level analysis shows that prior reliance is model-specific: in an OCR-specialist model, fluent lexical errors are produced with little reliance on the image, whereas general-purpose VLMs remain conditioned on the visual input even when wrong. Decode-time interventions fail to reliably restore grounding, while post-OCR language-model correction improves several systems only by repairing text after generation. Our results extend prior evidence of OCR language-prior reliance to low-resource historical documents and a broader set of models, showing that fluent output is not necessarily visually grounded and motivating interpretability-driven evaluation beyond aggregate accuracy.

2605.27748 2026-05-28 cs.CV cs.AI cs.LG

Mahalanobis PatchCore: Covariance-Aware and Streaming-Compatible Industrial Anomaly Detection

马氏距离 PatchCore:协方差感知与流式兼容的工业异常检测

Niccolò Ferrari, Oligert Osmani, Evelina Lamma

发表机构 * Department of Engineering, University of Ferrara(费拉拉大学工程学院)

AI总结 提出马氏距离 PatchCore,通过协方差估计和流式处理改进 PatchCore,在保持性能的同时降低峰值内存并提升工业检测精度。

Comments 57 pages, 7 figures

详情
AI中文摘要

工业视觉异常检测通常是一类问题:正常图像丰富,而缺陷罕见、异质且常在系统设计时不可用。PatchCore 风格的检索适合此场景,因为它通过正常补丁特征的内存库对测试图像评分,但标准欧几里得几何忽略了特征相关性,且其离线构建在子采样前需实例化整个补丁池。我们引入马氏距离 PatchCore,一种协方差感知、流式兼容的 PatchCore 扩展。其人工智能贡献在于一种检索检测器,它在降维特征空间中估计正则化协方差模型并对嵌入进行白化,使得变换后的欧几里得最近邻搜索实现马氏距离检索。一个有界内存、可重复迭代的训练流程通过增量降维、在线协方差估计和流式聚合,无需一次性存储所有正常补丁即可构建内存库。工程应用是自动化工业检测,其中视觉异常检测必须在实际内存限制下保持准确。我们在一个公开的 15 类工业异常检测基准和三个工业数据集(涵盖吹灌封条带安瓿弯月面检测、琥珀色玻璃安瓿底部检测和冻干饼西林瓶检测)上评估该方法。马氏距离 PatchCore 在公开基准上保留了大部分离线 PatchCore 的图像级性能,同时将峰值内存从 5.41 GB 降至 2.78 GB,并将选定的工业平均图像接收者操作特征曲线下面积从 0.981 提升至 0.986。

英文摘要

Industrial visual anomaly detection is usually one-class: normal images are abundant, while defects are rare, heterogeneous, and often unavailable during system design. PatchCore-style retrieval suits this setting because it scores test images from a memory bank of normal patch features, but the standard Euclidean geometry ignores feature correlations and its offline construction materialises the full patch pool before subsampling. We introduce Mahalanobis PatchCore, a covariance-aware, streaming-compatible extension of PatchCore. Its artificial intelligence contribution is a retrieval detector that estimates a regularised covariance model in reduced feature space and whitens embeddings, so Euclidean nearest-neighbour search after transformation implements Mahalanobis retrieval. A bounded-memory, re-iterable training pipeline builds the memory bank without storing all normal patches at once, using incremental dimensionality reduction, online covariance estimation, and streaming aggregation. The engineering application is automated industrial inspection, where visual anomaly detection must remain accurate under practical memory limits. We evaluate the method on a public 15-category industrial anomaly-detection benchmark and three industrial datasets covering blow-fill-seal strip-ampoule meniscus inspection, amber-glass-ampoule bottom inspection, and lyophilised-cake vial inspection. Mahalanobis PatchCore preserves most offline PatchCore image-level performance on the public benchmark while reducing peak memory from 5.41 to 2.78 GB, and improves the selected industrial mean image area under the receiver operating characteristic curve from 0.981 to 0.986.

2605.27744 2026-05-28 cs.AI

A Policy-Driven Runtime Layer for Agentic LLM Serving

一种面向智能体LLM服务的策略驱动运行时层

Rui Zhang, Chaeeun Kim, Liting Hu

发表机构 * University of California, Santa Cruz(加州大学圣克鲁兹分校)

AI总结 针对多智能体LLM系统中跨层策略难以高效实现的问题,提出在框架与引擎之间插入智能体运行时层,通过四个原语支持任意智能体感知策略,并在KV缓存策略CacheSage上验证了有效性。

详情
AI中文摘要

多智能体LLM系统已成为主流生产工作负载,但服务栈并非为其构建。上层的智能体框架知道智能体身份、角色、模式和调度结构,但从未看到引擎级事件;下层的服务引擎看到每个事件但对智能体一无所知。许多跨层策略依赖于两者:前缀缓存、批处理整形、推测执行、公平性、工具结果记忆、安全执行等。每个策略都存在于两层之间的缝隙中,目前通过向某一邻域打补丁来解决。我们认为这个缝隙最好通过架构变更而非点修复来解决:在框架和引擎之间插入第三层,即智能体运行时层,暴露四个原语(观察、评分、预测、行动),任何智能体感知策略都可以插入其中,并以智能体身份作为共享坐标。我们将九个具体策略映射到该层,并在具有最大即时服务成本杠杆的策略上深入验证了该抽象:跨会话的KV缓存,实例化为CacheSage,它在线学习每工作负载的智能体转移矩阵,并用于基于存活的驱逐和步间预取。在五个真实多智能体工作负载上的初步结果显示,与未修改的服务栈相比,缓存命中率提升13到37个百分点,平均TTFT降低12%到29%,吞吐量提高6%到14%。

英文摘要

Multi-agent LLM systems have become the dominant production workload, but the serving stack was not built for them. The agent framework above knows agent identities, role, schemas, and dispatch structure but never sees an engine-level event; the serving engine below sees every event but knows nothing about agents. A surprising number of cross-cutting policies depend on both: prefix caching, batch shaping, speculative execution, fairness, tool-result memoization, safety enforcement, and more. Each lives in the seam between the two layers and is currently solved by a one-off patch into one neighbor or the other. We argue this seam is best addressed by an architectural change rather than point fixes: insert a third tier, an agent runtime layer, between the framework and the engine, exposing four primitives (observe, score, predict, act) into which any agent-aware policy plugs, with agent identity as the shared coordinate. We map nine concrete policies onto the layer and validate the abstraction in depth on the one with the largest immediate serving-cost lever: KV caching across sessions, instantiated as CacheSage, which learns the per-workload agent transition matrix online and uses it for survival-based eviction and between-step prefetch. Preliminary results on five real multi-agent workloads show +13 to +37 pp cache hit-rate lift, 12% to 29% lower mean TTFT, and 6% to 14% higher throughput over an unmodified serving stack.

2605.27741 2026-05-28 cs.CL

Escape the Language Prior: Mitigating Late-Stage Modality Collapse in Audio Reasoning via Modality-Aware Policy Optimization

逃离语言先验:通过模态感知策略优化缓解音频推理中的后期模态崩溃

Cihan Xiao, Yiwen Shao, Chenxing Li, Xiang He, Zhenwen Liang, Steve Yves, Sanjeev Khudanpur, Liefeng Bo

发表机构 * Johns Hopkins University(约翰霍普金斯大学) Tencent Hunyuan(腾讯文言)

AI总结 针对多模态大语言模型在强化学习后训练中因统一策略梯度忽略模态依赖性而导致的后期模态崩溃问题,提出模态感知策略优化(MAPO)框架,通过模态相关性掩码和辅助注意力损失分支动态聚焦梯度并维持跨模态推理,在复杂音频推理基准上取得新最优结果。

详情
AI中文摘要

音频和全模态大语言模型展现出令人印象深刻的跨模态推理能力。然而,将标准的强化学习后训练算法应用于这些模型时,暴露了一个关键的结构性脆弱性:像GRPO这样的方法对所有token施加统一的策略梯度,忽略了它们对非文本源模态的不平等依赖。这加剧了在扩展思维链生成过程中的后期模态崩溃,模型逐渐放弃主要源信号,转而依赖压缩的文本先验,导致自信但无根据的幻觉。为了解决这个问题,我们引入了模态感知策略优化(MAPO),一种新颖的双分支强化学习框架。首先,MAPO使用模态相关性掩码动态地将策略梯度集中在模态关键token上,该掩码源自音频消融参考与多模态策略之间的跨模态微分熵。其次,它集成一个辅助注意力损失分支,对模型内部的注意力分布施加有针对性的、时间尺度的惩罚。这确保模型在推理轨迹深处主动维持跨模态基础。在复杂音频推理基准上的评估表明,MAPO显著提高了长时推理保真度和多模态指令遵循能力,在开放权重模型中实现了极具竞争力的性能,并在几个关键基准上创造了新的最先进结果。通过严格依赖原生统计信号而非特定领域的归纳偏置,MAPO为缓解跨多种多模态系统的认知崩溃提供了一个有前景的基础。

英文摘要

Audio and omni-modal large language models exhibit impressive cross-modal reasoning capabilities. However, applying standard reinforcement learning post-training algorithms to these models exposes a critical structural vulnerability: methods like GRPO apply uniform policy gradients across all tokens, ignoring their unequal dependence on the non-text source modality. This exacerbates late-stage modality collapse during extended chain-of-thought generation, where models progressively abandon the primary source signal in favor of compressed textual priors, leading to confident but ungrounded hallucinations. To address this, we introduce Modality-Aware Policy Optimization (MAPO), a novel dual-branch reinforcement learning framework. First, MAPO dynamically concentrates the policy gradient on modality-critical tokens using a modality relevance mask, which is derived from the cross-modal differential entropy between an audio-ablated reference and the multimodal policy. Second, it integrates an auxiliary attention loss branch that applies a targeted, temporally scaled penalty to the model's internal attention distributions. This ensures the model actively sustains cross-modal grounding deep into the reasoning trace. Evaluations on complex audio reasoning benchmarks demonstrate that MAPO substantially improves long-horizon reasoning fidelity and multimodal instruction following, achieving highly competitive performance and setting new state-of-the-art results on several key benchmarks among open-weight models. By relying strictly on native statistical signals rather than domain-specific inductive biases, MAPO offers a promising foundation for mitigating epistemic collapse across diverse multimodal systems.

2605.27740 2026-05-28 cs.CL

UNIQUE: Universal Top-k Sparse Attention for Training-free Inference and Sparsity-aware Training

UNIQUE: 通用Top-k稀疏注意力,用于免训练推理和稀疏感知训练

Keqi Deng, Shaoshi Ling, Ruchao Fan, Jinyu Li

发表机构 * Microsoft(微软)

AI总结 提出UNIQUE框架,通过基于键均值和标准差的页面重要性评分和软掩码稀疏感知训练,实现LLM长上下文推理中KV缓存的高效稀疏注意力,在保持任务性能的同时显著加速。

详情
AI中文摘要

大型语言模型(LLM)中的长上下文推理受到自注意力键值(KV)缓存线性增长的瓶颈限制。Top-k稀疏注意力通过仅加载一小部分KV缓存来缓解这一问题,但准确且廉价地估计缓存重要性(既用于免训练使用,也用于稀疏感知训练)仍然具有挑战性。本文提出UNIQUE,一个通用的top-k稀疏注意力框架,同时满足这两个需求,并在LLM模态中保持一致的有效性。UNIQUE以KV页面为粒度,通过一个简单而准确的评分来估计每页的重要性,该评分结合了页面键的均值作为代表性向量及其标准差作为偏移项。为了进一步缩小训练-推理差距,本文引入了一种软掩码稀疏感知训练方案,该方案使用top-k分数边界作为每个查询的阈值,并在其周围使用sigmoid软掩码,既不需要辅助损失,也不需要架构更改。在文本和语音LLM上的实验表明,UNIQUE在LongBench Pro等长上下文基准测试和长形式语音识别上保持了任务性能,同时与FlashInfer密集注意力相比,实现了高达11.4倍的注意力内核加速,并且与基于vLLM的密集模型相比,实现了至少5.3倍的端到端解码加速。

英文摘要

Long-context inference in large language models (LLMs) is bottlenecked by the linear growth of the self-attention key-value (KV) cache. Top-k sparse attention alleviates this by loading only a small fraction of the KV cache, but accurately and cheaply estimating cache importance, for both training-free use and sparsity-aware training, remains challenging. This paper proposes UNIQUE, a universal top-k sparse attention framework that addresses both requirements and stays consistently effective across LLM modalities. UNIQUE operates at the granularity of KV pages and estimates per-page importance with a simple yet accurate score combining the mean of the page's keys as a representative vector with their standard deviation as an offset term. To further close the train-inference gap, this paper introduces a soft-mask sparsity-aware training scheme that uses the top-k score boundary as a per-query threshold and a sigmoid soft mask around it, requiring neither auxiliary losses nor architectural changes. Experiments on text and speech LLMs show that UNIQUE preserves task performance on long-context benchmarks such as LongBench Pro and on long-form speech recognition, while delivering up to 11.4x attention-kernel speedup over FlashInfer dense attention and at least 5.3x end-to-end decoding speedup over a vLLM-based dense model.

2605.27739 2026-05-28 cs.LG cs.AI

Worker Disagreement Reveals Sharp Directions in Local SGD

工作者分歧揭示局部SGD中的尖锐方向

Tolga Dimlioglu, Kristi Topollai, Anna Choromanska

发表机构 * New York University(纽约大学)

AI总结 本文通过理论分析和实验证明,局部SGD中的工作者平均间隙协方差能够捕捉Hessian矩阵的尖锐方向,从而提供一种廉价的无Hessian估计方法。

Comments 5 pages main body, 18 pages appendix - Accepted to HiLD 2026, ICML

详情
AI中文摘要

深度神经网络训练通常表现出高度各向异性的损失几何,其中少数尖锐的主导Hessian方向与大量平坦区域共存。梯度往往不成比例地与这些主导方向对齐,尽管稳定的进展通常需要穿过平坦区域的方向。因此,估计主导子空间是有用的,但使用基于Hessian的直接方法成本高昂。我们表明,标准局部SGD通过工作者分歧暴露了这种几何结构。我们从理论上证明,工作者平均间隙协方差由随机梯度噪声和Hessian曲率塑造,导致工作者沿着尖锐的曲率敏感方向产生分歧。因此,工作者平均间隙提供了主导子空间的廉价无Hessian估计。在MLP、CNN和Transformer上的实验表明,由工作者平均间隙形成的子空间捕获了位于主导Hessian特征空间中的梯度分量的很大一部分。

英文摘要

Deep neural network training often exhibits highly anisotropic loss geometry, where a few sharp dominant Hessian directions coexist with a large flatter bulk. Gradients tend to align disproportionately with these dominant directions, although stable progress often requires movement through flatter bulk directions. Estimating the dominant subspace is therefore useful but costly with direct Hessian-based methods. We show that standard Local SGD exposes this geometry through worker disagreement. We theoretically show that the worker-average gap covariance is shaped by stochastic-gradient noise and Hessian curvature, causing workers to disagree along sharp, curvature-sensitive directions. Thus, worker-average gaps provide a cheap Hessian-free estimator of the dominant subspace. Experiments on MLPs, CNNs, and Transformers show that subspaces formed by worker-average gaps capture a substantial fraction of the gradient component lying in the dominant Hessian eigenspace.

2605.27737 2026-05-28 cs.CV

Bounded-Compute Multimodal Regression for Product-Rating Prediction

有界计算多模态回归用于产品评分预测

William Leach, Ru He, Sizhuo Ma, Yizhen Jia, Min Cao, Jian Wang, Rick Cao

发表机构 * Snap Inc.

AI总结 针对严格延迟预算下的标量回归任务,提出一种有界计算适配方法,通过替换语言模型头为轻量MLP并固定输入,在LoViF 2026挑战赛中实现高效多模态回归。

Comments Accepted to the LoViF Workshop at CVPR 2026. 8 pages, 2 figures

详情
AI中文摘要

视觉语言模型在多模态质量评估中日益受欢迎,但其默认依赖自回归文本生成和动态视觉处理,在严格延迟预算下难以适配标量回归。我们提出一种有界计算适配方法,基于SmolVLM2-256M-Video-Instruct,用于LoViF 2026高效VLM挑战赛中的产品评分预测。受近期多模态参与度预测结果(显示基于特征的回归可优于基于token的分数生成)的启发,我们将语言建模头替换为轻量两层MLP,输入为池化后的解码器状态,并通过固定384x384图像和截断元数据强制执行确定性输入。在受控消融实验中,静态全局图像处理略优于动态平铺,且将训练样本从10万扩展到1600万显著提升了验证相关性。在官方留出评估中,我们的228M参数模型达到了0.39 PLCC和0.40 CES,为资源受限的多模态回归提供了强且可复现的基线。

英文摘要

Vision-language models (VLMs) are increasingly attractive for multimodal quality assessment, but their default reliance on autoregressive text generation and dynamic visual processing is poorly matched to scalar regression under strict latency budgets. We present a bounded-compute adaptation of SmolVLM2-256M-Video-Instruct for product-rating prediction in the LoViF 2026 Efficient VLM challenge. Motivated by recent multimodal engagement-prediction results showing that feature-based regression can outperform token-based score generation, we replace the language-modeling head with a lightweight two-layer MLP fed by pooled decoder states, and we enforce deterministic inputs through fixed 384x384 images and truncated metadata. Across controlled ablations, static global image processing slightly outperforms dynamic tiling, and scaling from 100K to 16M training examples substantially improves validation correlation. Under the official held-out evaluation, our 228M-parameter model achieves 0.39 PLCC and 0.40 CES, providing a strong and reproducible baseline for resource-constrained multimodal regression.

2605.27736 2026-05-28 cs.LG cs.CV

Explicit Critic Guidance for Aligning Diffusion Models

显式评论家引导的对齐扩散模型

Zhengyang Liang, Qihang Zhang, Ceyuan Yang

发表机构 * University of Toronto(多伦多大学) Vector Institute(向量研究所) The Chinese University of Hong Kong(香港中文大学)

AI总结 提出一种状态对齐的潜在演员-评论家框架,通过将扩散模型自身作为时间步条件价值函数,实现轨迹级PPO训练和推理时引导,在单/多奖励基准上优于先前方法。

详情
AI中文摘要

在线强化学习对于将扩散模型与不可微目标对齐变得越来越重要。然而,现有方法在沿去噪轨迹分配细粒度信用和实现稳定的基于价值的优化方面仍面临限制。我们提出了一种用于扩散后训练的状态对齐潜在演员-评论家框架,其中扩散模型自身作为时间步条件价值函数,并直接在噪声潜在状态上预测价值。这使得轨迹级PPO训练成为可能,通过简单的条件和价值预训练策略支持稳定的演员-评论家优化,并自然地允许学习到的评论家用于推理时引导。我们进一步将框架扩展到多奖励优化,其中与互补奖励的联合训练有助于减轻奖励破解。在基于UNet和DiT的骨干网络上,我们的方法在单奖励和多奖励基准上始终优于先前的组相对RL和演员-评论家基线,同时测试时引导在生成质量上提供了额外提升。

英文摘要

Online reinforcement learning is becoming increasingly important for aligning diffusion models with non-differentiable objectives. However, existing methods still face limitations in assigning fine-grained credit along denoising trajectories and in realizing stable value-based optimization. We propose a state-aligned latent actor-critic framework for diffusion post-training, in which the diffusion model serves as its own timestep-conditioned value function and predicts values directly on noisy latent states. This enables trajectory-level PPO training, supports stable actor-critic optimization with simple conditioning and value pretraining strategies, and naturally allows the learned critic to be reused for inference-time steering. We further extend the framework to multi-reward optimization, where joint training with complementary rewards helps alleviate reward hacking. Across both UNet- and DiT-based backbones, our method consistently outperforms prior group-relative RL and actor-critic baselines on single-reward and multi-reward benchmarks, while test-time steering provides additional gains in generation quality.

2605.27734 2026-05-28 cs.LG

Learn from your own latents and not from tokens: A sample-complexity theory

从自身潜在表示而非token学习:样本复杂度理论

Daniel J. Korchinski, Alessandro Favero, Matthieu Wyart

发表机构 * Institute of Physics(物理研究所) University of Cambridge(剑桥大学) Johns Hopkins University(约翰霍普金斯大学) EPFL(苏黎世联邦理工学院)

AI总结 本文通过概率上下文无关语法数据,证明潜在预测方法在样本复杂度上相比token级SSL具有指数级优势,并分析了多尺度层次结构的必要性。

Comments 10 pages, 5 figures in main. 28 pages, 14 figures, 1 table in all

详情
AI中文摘要

生成模型,从扩散模型到大型语言模型,取得了显著性能,但代价是训练数据量比生物学习所需的大几个数量级。一种替代范式已经出现,其中网络被训练来预测其自身对相关视图或掩码区域的潜在表示,如data2vec和JEPA——这一思想与皮层的预测编码理论相关。尽管有强大的实证结果,但这些方法的理论理解仍然有限。核心问题包括:潜在预测实际上能提高多少数据效率?将此类方法堆叠成多尺度层次结构是否有益?我们使用一个可处理的概率上下文无关语法作为数据来回答这两个问题,该语法捕捉了自然语言和图像的组合结构。这样的语法通过沿深度为$L$的隐藏符号树递归应用产生规则,生成可见token的字符串。对于这样的数据,监督或token级SSL需要样本数量随$L$指数增长才能恢复潜在树;我们证明潜在预测在$L$上(对数因子内)以常数样本量实现这一点。我们通过(i)层次聚类算法,(ii)端到端神经网络(其预测-聚类器模块通过梯度下降在每个层次预测自身的潜在表示),以及(iii)data2vec的首次样本复杂度分析(我们证明其隐式执行层次潜在预测)来确认这一界限。这表明显式堆叠如H-JEPA在很大程度上是冗余的。

英文摘要

Generative models, from diffusion models to large language models, achieve remarkable performance but at a cost in training data orders of magnitude larger than what biological learners require. An alternative paradigm has emerged in which networks are trained to predict their \emph{own} latent representations of related views or masked regions, as in data2vec and JEPA -- an idea related to predictive-coding accounts of the cortex. Despite strong empirical results, the theoretical understanding of these methods remains limited. Central questions include: by how much does latent prediction actually improve data efficiency? Is there a benefit to stacking such methods into multi-scale hierarchies? We answer both using as data a tractable probabilistic context-free grammar that captures the compositional structure of natural language and images. Such a grammar generates strings of visible tokens by recursively applying production rules along a tree of hidden symbols of depth $L$. For such data, supervised or token-level SSL require a number of samples \emph{exponential} in $L$ to recover the latent tree; we prove that latent prediction achieves this with a number of samples \emph{constant} in $L$, up to logarithmic factors. We confirm this bound with (i) a hierarchical clustering algorithm, (ii) an end-to-end neural network whose predictor-clusterer modules predict their own latents at each level via gradient descent, and (iii) the first sample-complexity analysis of data2vec, which we show implicitly performs hierarchical latent prediction. This suggests that explicit stacking such as H-JEPA is largely redundant.

2605.27733 2026-05-28 cs.LG

Can Entry-Wise Clipping Give Spectral Control of Stochastic Gradients?

逐元素裁剪能否实现随机梯度的谱控制?

Zitao Song, Cedar Site Bai, Zhe Zhang, Brian Bullins, David F. Gleich

发表机构 * Department of Computer Science, Purdue University, West Lafayette, IN, USA(计算机科学系,普渡大学,西拉法塞特,印第安纳州,美国) School of Industrial Engineering, Purdue University, West Lafayette, IN, USA(工业工程学院,普渡大学,西拉法塞特,印第安纳州,美国)

AI总结 本文提出一种逐元素裁剪方法,通过分析梯度噪声的局部化特性,在保持矩阵结构的同时实现谱控制,并在Cauchy污染噪声下给出收敛保证,实验表明该方法可节省训练令牌。

详情
AI中文摘要

训练不稳定(如损失尖峰)通常由随机梯度噪声引起。由于语言训练数据中的罕见表达以及多层组合,噪声影响具有重尾特性,且小批量平均无法消除。现有方法在结构与成本之间权衡:向量范数裁剪忽略权重更新的矩阵结构,而谱归一化(如Muon (Jordan et al., 2024))在额外成本下尊重该结构。我们表明这种权衡可以平衡。真实梯度噪声类似于逐元素重尾污染,一阶扰动分析揭示了此类噪声的局部化性质,在此性质下,简单的逐元素方法可实现谱控制。利用这一点,我们推导出高斯信号先验下贝叶斯最优逐元素估计器的可处理替代。我们在Cauchy污染噪声下建立了$O(ε^{-4})$收敛保证。实验发现,平滑收缩改进了NanoGPT预训练上的Adam,节省了约7%的训练令牌。我们进一步发现,在谱归一化之前应用逐元素裁剪,在Muon基础上额外节省约2%的令牌。

英文摘要

Training instabilities such as loss spikes are frequently the result of stochastic gradient noise. Because of rare expressions in language training data, and multiple layer composition, the noise impact is heavy-tailed and survives mini-batch averaging. Existing remedies trade off structure against cost: vector-norm clipping ignores the matrix structure of weight updates, while spectral normalization (e.g., Muon (Jordan et al., 2024)) respects it at additional cost. We show that this trade-off can be balanced. Real gradient noise appears to be similar to entry-wise heavy-tailed contamination, and a first-order perturbation analysis reveals a localization property of such noise, under which a simple entry-wise method achieves spectral control. Exploiting this, we derive a tractable surrogate for the Bayes-optimal entry-wise estimator under a Gaussian signal prior. We establish $O(ε^{-4})$ convergence guarantee under Cauchy-contaminated noise. Empirically, we find that smooth shrinkage improves Adam on NanoGPT pretraining, saving ${\sim}7\%$ of training tokens. We further find that applying the entry-wise clipping before spectral normalization yields a ${\sim}2\%$ token saving on top of Muon.

2605.27726 2026-05-28 cs.CV

Asynchronous Remote Sensing Time-Series Fusion for Cloud Removal and Anytime Reconstruction

异步遥感时间序列融合用于云去除与任意时间重建

Forouzan Fallah, Chia Yu Hsu, Wenwen Li, Anna Liljedahl, Yezhou Yang

发表机构 * School of Computing and Augmented Intelligence, Arizona State University(计算与增强智能学院,亚利桑那州立大学) School of Geographical Sciences and Urban Planning, Arizona State University(地理科学与城市规划学院,亚利桑那州立大学) Woodwell Climate Research Center(伍德沃德气候研究中心)

AI总结 提出AGFlow模型,通过时间对齐生成流匹配融合异步S1/S2数据,实现云去除、缺失帧重建及任意时间查询。

Comments CVPR 2026 MORSE Workshop

详情
AI中文摘要

频繁的云层覆盖严重限制了Sentinel-2 (S2) 光学时间序列在地球表面监测中的可用性。Sentinel-1 (S1) SAR提供了全天候的互补观测,但由于采集不规则且异步,实际的S1/S2融合仍然困难。许多现有方法假设时间对齐的输入(或需要外部最近日期匹配),并且通常仅恢复观测时间戳,限制了长间隙下的重建并阻止了按需合成。我们提出AGFlow(时间对齐生成流匹配),一种用于S1/S2云去除和时间序列重建的时空流匹配模型,具有三种能力:(1) 时间戳条件内部对齐,融合异步S1和含云S2观测,无需基于预处理的配对;(2) 时空上下文感知去噪,联合建模空间结构与时间动态(而非独立的逐像素时间序列);(3) 任意时间查询,能够在监测窗口内的观测时间戳和用户指定时间戳生成无云S2帧。我们在RESTORE-DiT基准协议上进行了评估,包括定量指标、定性比较和组件消融。AGFlow显著改善了完全缺失帧的重建(MAE和RMSE相比RESTORE-DiT降低16-19%),并在持续间隙下提供可靠重建,同时具有竞争力的云去除性能和灵活的时间查询能力,适用于密集植被监测等下游任务。

英文摘要

Frequent cloud cover severely limits the usability of Sentinel-2 (S2) optical time series for Earth surface monitoring. Sentinel-1 (S1) SAR provides all-weather complementary observations, but practical S1/S2 fusion remains difficult because acquisitions are irregular and asynchronous. Many existing approaches assume temporally aligned inputs (or require external nearest-date matching) and typically restore only observed timestamps, limiting reconstruction under long gaps and preventing on-demand synthesis. We propose AGFlow (Time Aligned Generative Flow Matching), a spatiotemporal flow-matching model for S1/S2 cloud removal and time-series reconstruction with three capabilities: (1) timestamp-conditioned internal alignment that fuses asynchronous S1 and cloudy S2 observations without preprocessing-based pairing; (2) spatiotemporal, context-aware denoising that models spatial structure jointly with temporal dynamics (rather than independent per-pixel time series); and (3) anytime querying, enabling generation of cloud-free S2 frames at both observed and user-specified timestamps within the monitoring window. We evaluate on the RESTORE-DiT benchmark protocol with quantitative metrics, qualitative comparisons, and component ablations. AGFlow notably improves fully missing-frame reconstruction (MAE and RMSE reduce by 16-19% over RESTORE-DiT) and provides reliable reconstructions under persistent gaps, while also yielding competitive cloud removal performance and flexible temporal querying for downstream tasks such as dense vegetation monitoring.

2605.27724 2026-05-28 cs.RO cs.AI

HumanoidMimicGen: Data Generation for Loco-Manipulation via Whole-Body Planning

HumanoidMimicGen: 通过全身规划生成行走操作数据

Kevin Lin, Ajay Mandlekar, Caelan Reed Garrett, Nikita Chernyadev, Yu Fang, Runyu Ding, Yuqi Xie, Justin Tran, Linxi Fan, Yuke Zhu

发表机构 * NVIDIA The University of Texas at Austin(德克萨斯大学奥斯汀分校)

AI总结 提出HumanoidMimicGen方法,通过全身规划自动生成人形机器人行走操作演示数据,在模拟基准上使联合训练的策略性能提升20%。

Comments website: https://humanoidmimicgen.github.io/

详情
AI中文摘要

模仿学习是训练人形机器人行走和操作的一种有前景的方法,但它需要大量演示,而这些演示通过遥操作收集耗时且困难。现有的数据生成算法可以自动合成操作器的演示,但它们在类人机器人上效果不佳,因为其高维复合动作空间涉及手臂、腿和躯干。我们提出HumanoidMimicGen,一种生成人形机器人腿部行走操作数据的方法。我们的方法将少量源演示中的接触丰富的全身技能适应到新状态,并泛化到物体姿态的变化。通过将这些单臂和双臂技能与全身运动规划和操作规划交替进行,该方法在多样化的场景和布局中生成稳定、无碰撞的数据。为了评估我们的方法,我们引入了一个新的模拟行走操作基准,包含九个测试人形机器人行走操作能力的多样化任务。在那里,我们证明HumanoidMimicGen自动生成用于模仿学习的大规模数据集,并能够系统研究数据生成和策略学习决策如何影响模型性能。我们表明,与仅使用真实世界数据训练的策略相比,与HumanoidMimicGen生成的数据联合训练的全身视觉运动策略性能提升20%。

英文摘要

Imitation learning is a promising approach for training humanoid robots to both walk and manipulate, but it requires a large number of demonstrations, which are time-intensive and difficult to collect via teleoperation. Existing data-generation algorithms can automatically synthesize demonstrations for manipulators, but they are ineffective on humanoids because their high-dimensional composite action spaces involve arms, legs, and torsos. We present HumanoidMimicGen, a method for generating humanoid legged loco-manipulation data. Our method adapts contact-rich whole-body skills from a handful of source demonstrations to new states, generalizing across changes in object pose. By interleaving these single- and dual-arm skills with whole-body locomotion and manipulation planning, the method generates stable, collision-free data across diverse scenes and layouts. To evaluate our approach, we introduce a new simulated loco-manipulation benchmark containing nine diverse tasks that test humanoid loco-manipulation capabilities. There, we demonstrate that HumanoidMimicGen automatically generates large datasets for imitation learning and enables a systematic study of how data generation and policy learning decisions impact model performance. We show that whole-body visuomotor policies co-trained with data generated by HumanoidMimicGen outperform those trained only on real-world data by 20%.

2605.27722 2026-05-28 cs.LG

NUCLEUS-MoE: Unified Model of Pool Boiling for Liquid Cooling

NUCLEUS-MoE:池沸腾液冷统一模型

Arthur Feeney, Xianwei Zou, Sheikh Md Shakeel Hassan, Siddhartha Rachabathuni, Aparna Chandramowlishwaran

发表机构 * Department of Electrical Engineering and Computer Science(电气工程与计算机科学系) University of California, Irvine(加州大学 Irvine 分校)

AI总结 提出混合专家模型NUCLEUS,通过邻域注意力、符号距离场重初始化与专家路由,统一建模不同流体和工况下的池沸腾,实现零样本与小样本泛化。

Comments 12 pages, 9 figurs, KDD AI for Science

详情
AI中文摘要

两相沸腾的传热速率比单相冷却高一个数量级,但由于相变、湍流和输运之间的强耦合,以及对流体性质和热力学条件的极端敏感性,其建模仍然困难。现有的基于学习的替代模型要么特定于条件,要么特定于流体,限制了泛化能力并需要单独的模型。我们提出了NUCLEUS,一种用于池沸腾的混合专家模型,它用单一架构取代了专门的替代模型集合。NUCLEUS结合了邻域注意力、用于界面一致性的符号距离场重初始化,以及在不同沸腾动力学中表现出新兴专业化的专家路由。在池沸腾的高保真模拟上训练,NUCLEUS联合模拟了三种流体类别(电介质、制冷剂和低温流体)的饱和和过冷沸腾,解决了先前模型在极端流体上的失败模式。我们表明,专家路由在没有显式监督的情况下表现出连贯的空间结构和专业化。定量上,NUCLEUS在保持异质沸腾配置的物理一致性的同时,达到或超过了基线。我们还展示了在诸如用于浸没冷却的新型流体(Opteon 2P50)等下游任务上的零样本和小样本泛化能力。这些结果表明,混合专家模型是走向沸腾动力学统一替代建模的可扩展途径,并为科学机器学习中的更广泛泛化奠定了基础。

英文摘要

Two-phase boiling enables heat transfer rates an order of magnitude higher than single-phase cooling, but it remains difficult to model due to the strong coupling between phase change, turbulence, and transport, as well as extreme sensitivity to fluid properties and thermodynamic conditions. Existing learning-based surrogates are either condition- or fluid-specific, limiting generalization and requiring separate models. We present NUCLEUS, a mixture-of-experts model for pool boiling that replaces collections of specialized surrogates with a single architecture. NUCLEUS combines neighborhood attention, signed distance field reinitialization for interface consistency, and expert routing that exhibits emergent specialization across distinct boiling dynamics. Trained on high-fidelity simulations of pool boiling, NUCLEUS jointly models saturated and subcooled boiling across three fluid classes (dielectrics, refrigerants, and cryogens), resolving failure modes of prior models on extreme fluids. We show that expert routing exhibits coherent spatial structure and specialization without explicit supervision. Quantitatively, NUCLEUS matches or exceeds baselines while maintaining physical consistency across heterogeneous boiling configurations. We also show zero-shot and few-shot generalization capabilities on downstream tasks such as a new fluid (Opteon 2P50 developed for immersion cooling). These results demonstrate that mixture-of-experts models are a scalable pathway toward unified surrogate modeling of boiling dynamics and lay the groundwork for broader generalization across scientific ML.

2605.27721 2026-05-28 cs.CL cs.AI

UserHarness: Harnessing User Minds for Stronger Agent Theory-of-Mind

UserHarness:利用用户心智增强智能体心理理论

Cheng Qian, Jiayu Liu, Heng Ji

发表机构 * University of Illinois Urbana-Champaign(伊利诺伊大学厄巴纳-香槟分校)

AI总结 提出UserHarness框架,通过显式重建用户心智状态(信念、意图等)进行心理理论推理,在五个基准上达到95.94%的宏准确率,相对提升超15%。

Comments 19 Pages, 4 Figures, 2 Tables

详情
AI中文摘要

理解用户的信念和意图对于构建有效的智能体助手至关重要。这种能力通常通过心理理论(ToM)任务来评估,成功需要从用户的角度进行推理。然而,许多现有方法通过复杂流水线处理ToM,间接建模行为,而没有显式重建用户的心智状态。这忽略了问题的核心结构:用户基于其信念行动,信念通过观察环境更新;信念和意图共同决定行动,行动又改变环境;社会推理通常需要关于他人信念或意图的嵌套信念。我们提出UserHarness,一个简单的框架,将ToM推理重新定义为显式的用户心智重建。UserHarness分解用户的心智状态、其与外部环境的关系以及由此产生的行动,使智能体能够跟踪用户的观察、信念、意图和行为。在五个基准上,UserHarness达到高达95.94%的宏准确率,相比现有推理方法相对提升超过15%,相比最强的纯提示框架相对提升约20%。这些结果表明,稳健的用户理解需要从用户心智根源进行推理,将用户驾驭作为未来更具适应性的助手的有前景的基础。

英文摘要

Understanding what a user believes and intends is central to building effective agent assistants. This ability is often evaluated through Theory-of-Mind (ToM) tasks, where success requires reasoning from the user's perspective. However, many existing approaches address ToM with complex pipelines that model behavior indirectly, without explicitly reconstructing the user's mental state. This misses the core structure of the problem: users act based on their beliefs, which are updated through observations of the environment; beliefs and intentions jointly determine actions, which in turn change the environment; and social reasoning often requires nested beliefs about what others believe or intend. We propose UserHarness, a simple framework that reframes ToM reasoning as explicit user-mind reconstruction. UserHarness decomposes the user's mental state, its relation to the external environment, and the actions that follow from it, enabling agents to track what the user observes, believes, intends, and does. Across five benchmarks, UserHarness reaches up to 95.94% macro accuracy, improving over existing inference methods by more than 15% relative and over the strongest prompt-only harness by about 20% relative. These results suggest that robust user understanding requires reasoning from the roots of the user's mind, positioning user harnessing as a promising foundation for more adaptive future assistants.

2605.27720 2026-05-28 cs.LG stat.AP

Bayesian Deployment Approval for Learned Landing Controllers under Finite Rollout Validation

有限滚动验证下学习型着陆控制器的贝叶斯部署批准

Fei Jiang, Lei Yang

发表机构 * Independent Researcher(独立研究者)

AI总结 针对学习型自主控制器在有限仿真验证下的部署不确定性,提出基于贝叶斯后验推断的部署批准框架,通过后验批准概率和部署风险进行不确定性校准评估。

Comments 16 pages, 4 figures and 4 tables

详情
AI中文摘要

强化学习和数据驱动的自主控制器通常使用累积奖励和有限仿真轨迹下的经验成功频率进行评估。然而,这些经验指标不一定能为不确定性下的部署准备提供足够的统计证据。本文针对有限滚动证据下的学习型自主着陆控制器,开发了一个贝叶斯批准框架。基于不确定运行条件下的触地安全满足性,引入了概率着陆能力公式,同时使用贝叶斯后验推断来量化学习策略真实部署能力的不确定性。进一步引入了后验批准概率和后验部署风险用于面向部署的评估,以及一个支持在渐进滚动测试中做出批准/拒绝/继续决策的顺序验证框架。使用PPO和SAC控制器的仿真实验表明,在有限的验证证据下,经验成功和奖励优化可能产生过度自信的部署解释,而后验批准推断提供了更不确定性校准的部署准备评估。所提出的框架为传统强化学习评估与不确定性下面向部署的验证之间提供了实用的统计联系,并可推广到更广泛的学习型自主系统类别。

英文摘要

Reinforcement learning and data-driven autonomous controllers are commonly evaluated using cumulative reward and empirical success frequency under finite simulation trajectories. However, such empirical metrics do not necessarily provide sufficient statistical evidence regarding deployment readiness under uncertainty. This work develops a Bayesian approval framework for learned autonomous landing controllers under finite rollout evidence. A probabilistic landing capability formulation is introduced based on touchdown safety satisfaction under uncertain operating conditions, while Bayesian posterior inference is used to quantify uncertainty regarding the true deployment capability of learned policies. Posterior approval probability and posterior deployment risk are further introduced for deployment-oriented evaluation, together with a sequential validation framework supporting approve/reject/continue decisions during progressive rollout testing. Simulation experiments using PPO and SAC controllers demonstrate that empirical success and reward optimization may produce overconfident deployment interpretation under limited validation evidence, whereas posterior approval inference provides a more uncertainty-calibrated assessment of deployment readiness. The proposed framework provides a practical statistical connection between conventional reinforcement-learning evaluation and deployment-oriented validation under uncertainty and may be generalized to broader classes of learned autonomous systems.