arXivDaily arXiv每日学术速递 周一至周五更新
重置
2605.26112 2026-05-26 cs.AI cs.LG 版本更新

From Model Scaling to System Scaling: Scaling the Harness in Agentic AI

从模型扩展到系统扩展:扩展智能体AI中的“缰绳”

Shangding Gu

发表机构 * UC Berkeley(伯克利大学)

AI总结 本文提出智能体AI的下一个瓶颈是系统扩展而非仅模型扩展,通过设计可审计、持久、模块化和可验证的架构(称为“缰绳”),并研究上下文治理、可信记忆和动态技能路由三大瓶颈,以推动智能体行为从模型能力向长期任务执行转化。

详情
AI中文摘要

本文研究智能体AI中下一个主要瓶颈是系统扩展,而不仅仅是模型扩展:围绕基础模型设计可审计、持久、模块化和可验证的架构。我们将这种转变称为扩展“缰绳”:将基础模型周围的结构化执行层视为设计、评估和优化的一等对象。尽管近期的大语言模型使智能体能够使用工具、检索信息、维护记忆并执行长期工作流,但评估仍以模型为中心,通常将智能体简化为最终任务成功,而将记忆、检索、工具使用、编排、验证和治理视为次要的实现细节。这种框架日益不足,因为智能体性能源于基础模型、记忆基质、上下文构建器、技能路由层、编排循环以及验证与治理层之间的交互。这些组件共同构成智能体缰绳,将模型能力转化为长期智能体行为。我们通过三个核心瓶颈研究扩展缰绳:上下文治理、可信记忆和动态技能路由,以及协调和约束它们的编排与治理机制。我们进一步概述了缰绳级基准的研究议程,超越一次性任务成功,测量轨迹质量、记忆卫生、上下文效率、通信保真度、验证成本和随时间的安全演化。为使讨论具体化,我们开发了CheetahClaws:https://github.com/SafeRL-Lab/cheetahclaws,一个Python原生参考缰绳,并将其与Claude Code和OpenClaw进行比较。我们的主要主张是,智能体AI的未来进展将同样依赖于系统设计和更强的模型。

英文摘要

This paper studies the next major bottleneck in agentic AI as system scaling, not only model scaling: the design of auditable, persistent, modular, and verifiable architectures around foundation models. We refer to this shift as scaling the harness: treating the structured execution layer around a foundation model as a first-class object of design, evaluation, and optimization. Although recent large language models enable agents to use tools, retrieve information, maintain memory, and execute long-horizon workflows, evaluation remains largely model-centric, often reducing agents to final-task success while treating memory, retrieval, tool use, orchestration, verification, and governance as secondary implementation details. This framing is increasingly inadequate because agent performance emerges from the interaction among the foundation model, memory substrate, context constructor, skill-routing layer, orchestration loop, and verification-and-governance layer. Together, these components form the agent harness, which translates model capability into long-horizon agent behavior. We study scaling the harness through three core bottlenecks: context governance, trustworthy memory, and dynamic skill routing, together with the orchestration and governance mechanisms that coordinate and constrain them. We further outline a research agenda for harness-level benchmarks that go beyond one-shot task success to measure trajectory quality, memory hygiene, context efficiency, communication fidelity, verification cost, and safe evolution over time. To make the discussion concrete, we develop CheetahClaws: https://github.com/SafeRL-Lab/cheetahclaws, a Python-native reference harness, and compare it with Claude Code and OpenClaw. Our main claim is that future progress in agentic AI will depend as much on system design as on stronger foundation models.

2605.26111 2026-05-26 cs.CV cs.AI cs.GR cs.LG cs.MM 版本更新

Squeezing Capacity from Multimodal Large Language Models for Subject-driven Generation

从多模态大语言模型中榨取能力用于主题驱动生成

Shuhong Zheng, Aashish Kumar Misraa, Yu-Teng Li, Yu-Jhe Li, Igor Gilitschenski

发表机构 * University of Toronto & Vector Institute(多伦多大学及向量研究所) Adobe(Adobe公司) Google(谷歌公司)

AI总结 提出一种结合多模态大语言模型和VAE身份条件的方法,通过双层级聚合模块和多阶段去噪策略,在主题驱动图像生成中实现多模态理解与身份保持的平衡,优于现有方法。

Comments 33 pages, 18 figures, Project Page: https://zsh2000.github.io/squeeze-mllm-subject-gen/

详情
AI中文摘要

主题驱动图像生成旨在合成新图像,在遵循文本指令的同时保持给定主题的身份。现有方法通常分别编码文本和参考图像,这限制了跨模态推理能力并导致复制粘贴伪影。最近连接多模态模型和扩散模型的框架改进了指令遵循,但很大程度上忽略了身份保持。为了解决这些限制,我们将扩散模型条件设置为联合编码文本和参考图像的多模态大语言模型(MLLM),并用基于VAE的身份条件进行增强。设计了一种新颖的双层级聚合(DLA)模块来聚合多级MLLM特征以实现最优条件,并应用多阶段去噪策略在推理过程中逐步平衡来自MLLM的语义信息和来自VAE的精细细节身份。大量实验表明,我们的方法协调了多模态理解与身份保持,缓解了复制粘贴问题,并在主题驱动图像生成中实现了优于人类偏好的性能。我们的项目网站位于https://zsh2000.github.io/squeeze-mllm-subject-gen/。

英文摘要

Subject-driven image generation aims to synthesize new images that preserve the identity of the given subject while following textual instructions. Existing approaches often encode text and reference images separately. This limits cross-modal reasoning abilities and causes copy-paste artifacts. Recent frameworks that connect multimodal models and diffusion models improve instruction following, but largely overlook identity preservation. To address these limitations, we condition diffusion models on Multimodal Large Language Models (MLLMs) that jointly encode text and reference images, and augment it with VAE-based identity conditioning. A novel Dual Layer Aggregation (DLA) module is designed to aggregate multi-level MLLM features for optimal conditioning, and a multi-stage denoising strategy is applied to progressively balance the semantic information from MLLM and fine-detail identity from VAE during inference. Extensive experiments demonstrate that our approach harmonizes multimodal understanding with identity preservation, mitigates copy-paste issues, and achieves superior performance regarding human preference on subject-driven image generation. Our project website is available at https://zsh2000.github.io/squeeze-mllm-subject-gen/.

2605.26100 2026-05-26 cs.SE cs.AI 版本更新

Beyond Summaries: Structure-Aware Labeling of Code Changes with Large Language Models

超越摘要:基于结构感知的代码变更标注与大型语言模型

Bar Weiss, Antonio Abu-Nassar, Adi Sosnovich, Karen Yorav

发表机构 * Viterbi Faculty of Electrical and Computer Engineering(电气与计算机工程学院) IBM Research(IBM研究院)

AI总结 提出两阶段流水线,利用大型语言模型对代码补丁中的变更进行基于分类的标注,捕获结构关系和语义属性,以提升代码审查效率。

Comments 13 pages, 6 figures

详情
AI中文摘要

代码审查是软件工程中的关键实践,然而现代项目中代码补丁的规模和频率不断增长,加上AI代码助手的广泛采用,使得人工审查越来越具有挑战性。识别补丁中的变更类型(如重命名、移动或逻辑修改)可以通过实现优先级排序、过滤和自动化来显著提高审查效率。然而,现有的基于LLM的代码审查方法主要集中在摘要和评论生成上,结构化代码审查尚未得到充分探索。在本文中,我们系统研究了使用大型语言模型(LLMs)对代码补丁中的代码变更进行基于分类的标注。我们引入了一个两阶段流水线,首先为差异块分配标签,然后对其进行细化以捕获结构关系和语义属性,例如重命名传播和类型变更。我们的方法采用少样本提示来生成与语言无关且可定制的标签,无需传统静态分析流水线的工程开销。我们在一个手动策划的自然和合成补丁基准上,跨多个上下文配置评估了四个LLM。我们的最佳配置实现了高达84%的召回率和81%的精确率,并在提取关系和属性元数据方面具有高准确性。这些结果表明,基于LLM的标注可以通过实现灵活、多语言和自动化友好的代码审查工作流,有效补充静态分析。

英文摘要

Code review is a critical practice in software engineering, yet the growing scale and frequency of code patches in modern projects, together with the widespread adoption of AI code assistants, make manual review increasingly challenging. Identifying the types of changes within a patch, such as renames, moves, or logic modifications, can substantially improve review efficiency by enabling prioritization, filtering, and automation. However, existing LLM-based approaches to code review have largely focused on summarization and comment generation, leaving structured code reviews underexplored. In this paper, we present a systematic study of using large language models (LLMs) for taxonomy-based labeling of code changes in a code patch. We introduce a two-stage pipeline that assigns labels to diff hunks and then refines them to capture structural relationships and semantic attributes, such as rename propagation and type changes. Our approach employs few-shot prompting to produce language-agnostic and customizable labels, without the engineering overhead of traditional static-analysis pipelines. We evaluate four LLMs across multiple context configurations on a manually curated benchmark of natural and synthetic patches. Our best configuration achieves up to $84\%$ recall and $81\%$ precision, with high accuracy in extracting relational and attribute metadata. These results suggest that LLM-based labeling can effectively complement static analysis by enabling flexible, multilingual, and automation-friendly code review workflows.

2605.26086 2026-05-26 cs.AI 版本更新

Claw-Anything: Benchmarking Always-On Personal Assistants with Broader Access to User's Digital World

Claw-Anything: 对更广泛访问用户数字世界的始终在线个人助手的基准测试

Yusong Lin, Xinyuan Liang, Haiyang Wang, Qipeng Gu, Siqi Cheng, Jiangui Chen, Shuzhe Wu, Feiyang Pan, Lue Fan, Sanyuan Zhao, Dandan Tu

发表机构 * Beijing Institute of Technology(北京理工大学) Huawei Technologies Co., Ltd(华为技术有限公司) Peking University(北京大学) Institute of Automation, Chinese Academy of Sciences(中国科学院自动化研究所)

AI总结 提出Claw-Anything基准测试,通过扩展长期活动历史、相互依赖的后端服务以及跨多设备的GUI和CLI交互三个维度,评估大型语言模型代理在始终在线环境下的性能,发现GPT-5.5仅达34.5% pass@1,并发布自动化数据生成管道提升基线模型23.7%。

详情
AI中文摘要

大型语言模型代理越来越被设想为始终在线的个人助手,能够访问用户数字世界中任何相关的内容。然而,当前系统仅在该世界的狭窄片段上运行,限制了上下文敏感推理和有效协助。现有基准测试同样仅提供部分用户状态,因此无法捕捉在这种广泛、始终在线环境下的性能。为填补这一空白,我们引入了Claw-Anything,一个沿三个维度扩展代理上下文的基准测试:长期活动历史、相互依赖的后端服务以及跨多设备的集成GUI和CLI交互。为实例化这一设置,我们通过多轮事件注入模拟数月的用户活动,产生复杂的世界状态和现实噪声,包括无关事件和冲突信号。代理必须在丰富的上下文环境中进行推理,同时对此类噪声保持鲁棒性。这种扩展范围还使得能够评估主动协助,要求代理预测用户需求并提供及时建议。实验表明,GPT-5.5仅达到34.5%的pass@1,显著低于先前的基准测试,凸显了当前代理能力与始终在线个人协助需求之间的差距。除基准测试外,我们还发布了一个自动化数据生成管道,该管道产生了2000个训练环境,并将基线模型提升了23.7%,展示了可扩展数据基础设施的实用性。

英文摘要

Large language model agents are increasingly envisioned as always-on personal assistants with access to anything relevant in the user's digital world. Yet current systems operate over only narrow slices of that world, limiting context-sensitive reasoning and effective assistance. Existing benchmarks similarly provide only partial user state and therefore fail to capture performance in such a broad, always-on setting. To address this gap, we introduce Claw-Anything, a benchmark that expands agent context along three dimensions: long-horizon activity histories, interdependent backend services, and integrated GUI and CLI interaction across multiple devices. To instantiate this setting, we simulate months of user activity through multi-round event injection, producing complex world states and realistic noise, including irrelevant events and conflicting signals. Agents must reason over rich contextual environments while remaining robust to such noise. This expanded scope also enables the evaluation of proactive assistance, requiring agents to anticipate user needs and deliver timely recommendations. Experiments show that GPT-5.5 achieves only 34.5% pass@1, substantially below prior benchmarks, underscoring a gap between current agent capabilities and the demands of always-on personal assistance. Alongside the benchmark, we release an automated data-generation pipeline that yields 2,000 training environments and improves the base model by 23.7%, demonstrating its utility of scalable data infrastructure.

2605.26081 2026-05-26 cs.AI 版本更新

VeriTrace: Evolving Mental Models for Deep Research Agents

VeriTrace:深度研究智能体的心智模型演化

Haolang Zhao, Yunbo Long, Lukas Beckenbauer, Alexandra Brintrup

发表机构 * Department of Engineering, University of Cambridge(剑桥大学工程系) TUM School of Management, Technical University of Munich(慕尼黑技术大学管理学院) The Alan Turing Institute(艾伦·图灵研究所)

AI总结 针对深度研究智能体面临的信息不确定性,提出VeriTrace认知图框架,通过显式反馈循环(解释更新、偏差反馈、模式修正)来演化心智模型,在DeepResearch Bench和DeepConsult上取得显著提升。

详情
AI中文摘要

深度研究智能体面临广阔、相互依赖且普遍不确定的信息。现有系统探索了演化中的中间表示应如何呈现,但将其演化留给LLM的隐式推理。没有显式调节,中间层容易被混合质量信息污染,并沿其依赖关系传播错误,因此模型规模往往最终替代了缺失的调节。我们认为,智能体的心智模型应通过持续将任务理解与现实对齐的显式反馈来演化,并识别出三种调节循环:解释更新、偏差反馈和模式修正。我们在VeriTrace中实现了这一点,这是一个显式实现这三种循环的认知图框架。使用匹配的Qwen3.5-27B骨干网络,VeriTrace在DeepResearch Bench (DRB) Insight上比最强匹配基线提高4.22个百分点(总体提高1.49个百分点),在DeepConsult上总体胜率提高5.9个百分点。使用Config-DeepSeek,它在DRB上取得了最强的可复现开源结果。

英文摘要

Deep research agents face vast, interdependent, and pervasively uncertain information. Existing systems explore what evolving intermediate representations should look like, but leave their evolution to the LLM's implicit reasoning. Without explicit regulation, the intermediate layer is easily contaminated by mixed-quality information and propagates errors along its dependencies, so model scale often ends up substituting for absent regulation. We argue that an agent's mental model should instead evolve through explicit feedback that continuously aligns task understanding with reality, and identify three regulatory loops: interpretive update, deviation feedback, and schema revision. We realise this in VeriTrace, a cognitive-graph framework that explicitly implements the three loops. Using matched Qwen3.5-27B backbones, VeriTrace improves over the strongest matched baseline by 4.22 pp on DeepResearch Bench (DRB) Insight (1.49 pp Overall) and by 5.9 pp Overall win rate on DeepConsult. With Config-DeepSeek, it achieves the strongest reproducible open-source result on DRB.

2605.26074 2026-05-26 cs.CL cs.AI q-fin.GN 版本更新

StakeBench: Evaluating Language Understanding Grounded in Market Commitment

StakeBench: 评估基于市场承诺的语言理解

Yunhua Pei, Jingyu Hu, Yiwei Shi, Hongnan Ma, Weiru Liu, John Cartlidge

发表机构 * University of Bristol(布里斯托大学)

AI总结 提出StakeBench框架,通过将市场评论与可验证的交易记录关联,从市场行为中自动生成监督信号,评估语言模型对市场承诺的理解能力。

Comments 21 pages, 2 figures, 20 tables. Preprint. Dataset and evaluation code included

详情
AI中文摘要

现有的金融自然语言处理基准通常依赖外部观察者提供的标签,衡量语言如何被感知而非说话者在市场中承诺了什么。我们引入StakeBench,一个基于市场承诺的语言理解评估框架。StakeBench将来自2261个已结算市场的560,876条评论与Polymarket和Manifold上可验证的头寸、行动和市场赔率记录相关联。监督信号来自可观察的市场行为。头寸方向、评论后交易行动和市场赔率轨迹取代了人工标注。四个诊断任务测试模型是否检测到市场承诺、识别揭示的方向、预测未来行动以及执行集体赔率预测。三个承诺感知指标衡量与揭示偏好而非感知情绪的一致性。有效性审计和明确的解释边界有助于区分可观察的承诺信号与潜在信念和因果市场赔率影响。在15个LLM、18个主题和平台设置中,模型部分恢复了头寸方向信号,定向准确率从0.506到0.599,但在后续任务中出现结构性失败。15个模型中有10个在未来行动预测中崩溃为一到两个行动标签,且没有模型在集体赔率预测中持续优于朴素赔率方向基线。模型规模与性能不相关,金融领域微调不改善揭示方向识别,平台激励强烈影响高阶结果。StakeBench在CC-BY 4.0许可下附带评估代码和数据集。

英文摘要

Existing financial NLP benchmarks often rely on labels supplied by outside observers, measuring how language is perceived rather than what speakers have committed to in the market. We introduce StakeBench, an evaluation framework for language understanding grounded in market commitment. StakeBench links 560,876 comments from 2,261 resolved markets to verified position, action, and market-odds records across Polymarket and Manifold. Supervision is derived from observable market behavior. Position sides, post-comment trading actions, and market-odds trajectories replace human annotation. Four diagnostic tasks test whether models detect market commitment, identify the revealed side, anticipate future action, and perform collective odds projection. Three commitment-aware metrics measure alignment with revealed preferences rather than perceived sentiment. Validity audits and explicit interpretation boundaries help distinguish observable commitment signals from latent belief and causal market-odds impact. Across 15 LLMs and 18 topics and platform settings, models partially recover position-side signals, with Directed Accuracy from 0.506 to 0.599, but show structural failures on later tasks. Ten of the fifteen models collapse to one or two action labels in future action anticipation, and no model consistently improves on the naive odds-direction baseline in collective odds projection. Model scale is not correlated with performance, finance-domain tuning does not improve revealed-side identification, and platform incentives strongly shape higher-order results. StakeBench is packaged with evaluation code and dataset under CC-BY 4.0.

2605.26067 2026-05-26 cs.LG cs.AI 版本更新

Conditional KRR: Injecting Unpenalized Features into Kernel Methods with Applications to Kernel Thresholding

条件KRR:将无惩罚特征注入核方法及其在核阈值处理中的应用

Rustem Takhanov, Zhenisbek Assylbekov

发表机构 * Department of Mathematics, Nazarbayev University, Astana, Kazakhstan(纳扎尔巴耶夫大学数学系) Nazarbayev University Research Administration, Astana, Kazakhstan(纳扎尔巴耶夫大学研究行政部) Purdue University Fort Wayne, Indiana, USA(普渡大学枫林分校)

AI总结 本文通过将条件KRR简化为带残差核的KRR,理论分析了其统计性质,并展示了在核主成分和随机特征场景下优于标准KRR的条件。

Comments Accepted to ICML 2026

详情
AI中文摘要

条件正定(CPD)核是相对于函数类$\mathcal{F}$定义的。众所周知,这样的核$K$与其原生空间(类似于RKHS定义)相关联,进而产生一种学习方法——称为条件核岭回归(条件KRR),因其与KRR的类比而得名——其中估计的回归函数通过其原生空间范数的平方进行惩罚。该方法之所以引人关注,是因为它可以被视为经典线性回归(由$\mathcal{F}$指定特征),随后对目标变量的残差(未解释)部分应用标准KRR。这类方法最近引起了越来越多的关注。 我们通过将其行为简化为带有另一个固定核(称为残差核)的KRR来研究该方法的统计性质。我们的主要理论结果表明,这种简化确实是可能的,代价是期望测试风险中增加一个由$\mathcal{O}(1/\sqrt{N})$界定的额外项,其中$N$是样本量,隐藏常数依赖于类$\mathcal{F}$和输入分布。 这种简化使我们能够分析在$K$是正定的且$\mathcal{F}$由$K$的Mercer分解中的前$k$个主特征函数给出的情况下的条件KRR。我们还考虑了$\mathcal{F}$由来自$K$的随机特征表示的$k$个随机特征组成的设置。事实证明,这两种设置密切相关。我们的理论分析和实验都证实,只要回归函数的$\mathcal{F}$分量比残差部分更显著,条件KRR在这些情况下优于标准KRR。

英文摘要

Conditionally positive definite (CPD) kernels are defined with respect to a function class $\mathcal{F}$. It is well known that such a kernel $K$ is associated with its native space (defined analogously to an RKHS), which in turn gives rise to a learning method -- called conditional kernel ridge regression (conditional KRR) due to its analogy with KRR -- where the estimated regression function is penalized by the square of its native space norm. This method is of interest because it can be viewed as classical linear regression, with features specified by $\mathcal{F}$, followed by the application of standard KRR to the residual (unexplained) component of the target variable. Methods of this type have recently attracted increasing attention. We study the statistical properties of this method by reducing its behavior to that of KRR with another fixed kernel, called the residual kernel. Our main theoretical result shows that such a reduction is indeed possible, at the cost of an additional term in the expected test risk, bounded by $\mathcal{O}(1/\sqrt{N})$, where $N$ is the sample size and the hidden constant depends on the class $\mathcal{F}$ and the input distribution. This reduction enables us to analyze conditional KRR in the case where $K$ is positive definite and $\mathcal{F}$ is given by the first $k$ principal eigenfunctions in the Mercer decomposition of $K$. We also consider the setting where $\mathcal{F}$ consists of $k$ random features from a random feature representation of $K$. It turns out that these two settings are closely related. Both our theoretical analysis and experiments confirm that conditional KRR outperforms standard KRR in these cases whenever the $\mathcal{F}$-component of the regression function is more pronounced than the residual part.

2605.26061 2026-05-26 cs.LG cs.AI 版本更新

Neuronal Stochastic Attention Circuit (NSAC) for Probabilistic Representation Learning

神经元随机注意力电路(NSAC)用于概率表示学习

Waleed Razzaq, Yun-Bo Zhao

发表机构 * Department of Automation, University of Science \& Technology of China, Hefei, China Institute of Artificial Intelligence, Hefei Comprehensive National Science Center

AI总结 提出一种受生物学启发的连续时间注意力架构NSAC,通过Ornstein-Uhlenbeck随机微分方程和NCP门控机制在logits上诱导高斯分布,实现概率输出与不确定性量化。

详情
AI中文摘要

连续时间表示学习中不确定性估计的可靠量化仍处于初级阶段,尤其是在连续时间注意力架构中。我们引入了神经元随机注意力电路(NSAC),这是一种新颖的受生物学启发的连续时间注意力架构,它将注意力logit计算重新表述为Ornstein-Uhlenbeck随机微分方程的解,该方程由来自重新利用的秀丽隐杆线虫神经元电路策略(NCP)布线机制的输入依赖的非线性互连门调制。它在logits上诱导高斯分布,通过注意力权重上的逻辑正态分布传播原则性的随机性,从而产生概率输出。一个结合高斯负对数似然与认知分离正则化器的两项目标函数强制更高的预测方差,并能够联合量化偶然不确定性和认知不确定性。实验上,我们在多种学习任务中实现了NSAC,包括:(i) 不规则连续时间函数逼近;(ii) 多元回归;(iii) 长程预测;(iv) 工业4.0;以及(v) 自动驾驶车辆的车道保持。我们观察到,NSAC在准确性上与多个基线保持竞争力,产生合理校准的不确定性估计,同时在神经元细胞级别具有可解释性。

英文摘要

Reliable quantification of uncertainty estimates in continuous-time (CT) representation learning remains nascent, particularly within CT attention architectures. We introduce the Neuronal Stochastic Attention Circuit (NSAC), a novel biologically-inspired CT attention architecture that reformulates attention logit computation as the solution of an Ornstein-Uhlenbeck stochastic differential equation modulated by input-dependent, nonlinear interlinked gates derived from repurposed C.elegans Neuronal Circuit Policies (NCPs) wiring mechanism. It induces Gaussian distribution over logits that propagates principled stochasticity through logistic-normal distribution over attention weights to yield probabilistic output. A two-term objective function combining Gaussian negative log-likelihood with an epistemic-separation regularizer enforces higher predictive variance and enables joint quantification of aleatoric and epistemic uncertainty. Empirically, we implement NSAC in a diverse set of learning tasks including: (i) irregular CT function approximation; (ii) multivariate regression; (iii) long-range forecasting; (iv) Industry 4.0; and (v) the lane-keeping of autonomous vehicles. We observe that the NSAC remains competitive against several baselines in terms of accuracy and produces reasonably well-calibrated uncertainty estimates while being interpretable at the neuronal cell level.

2605.26045 2026-05-26 cs.CL cs.AI 版本更新

Confidence and Calibration of Activation Oracles for Reliable Interpretation of Language Model Internals

激活预言机的置信度与校准:用于语言模型内部的可信解释

Federico Torrielli, Peter Schneider-Kamp, Lukas Galke Poech

发表机构 * University of Turin(都灵大学) University of Southern Denmark(南丹麦大学)

AI总结 本文研究了6种激活预言机置信度估计方法,发现bootstrap模式频率在校准上优于其他方法(ECE 5.7% vs 25.5%),而log-prob基线可作为快速分诊信号。

详情
AI中文摘要

激活预言机旨在使其他模型的激活对人类可读,并且与白盒可解释性技术相比取得了有希望的结果。然而,此类激活预言机自然语言输出的不确定性量化(UQ)迄今研究不足。本文研究了6种不同的激活预言机置信度估计方法,并评估其置信度分数的校准程度。我们在每个预言机6000个样本(变化动词和上下文提示)上的实验表明,bootstrap模式频率是测试中校准最好的方法(在Qwen3-8B上ECE 5.7% vs 答案词对数概率的25.5%;在Qwen3.6-27B上10.3% vs 13.1%),并且log-prob基线可以以极低的成本作为快速分诊信号。代码和修补后的训练器可在https://github.com/federicotorrielli/probabilistic_activation_oracles获取。

英文摘要

Activation oracles aim to make the activations of other models legible to humans and yield promising results compared to white-box interpretability techniques. However, uncertainty quantification (UQ) for the natural-language outputs of such activation oracles is so far understudied. Here, we investigate 6 different methods for estimating the confidence of activation oracles and evaluate how well-calibrated their confidence scores are. Our experiments on 6,000 samples per oracle (varying verbalizer and context prompts) reveal that bootstrap mode frequency is the best-calibrated method among those tested (ECE 5.7% vs. 25.5% for the answer-word log-probability on Qwen3-8B; 10.3% vs. 13.1% on Qwen3.6-27B), and that the log-prob baseline can serve as a fast triage signal at a fraction of the cost. Code and the patched trainer are available at https://github.com/federicotorrielli/probabilistic_activation_oracles.

2605.26040 2026-05-26 cs.AI 版本更新

L2IR: Revealing Latent Intent in Graph Fraud Detection

L2IR: 揭示图欺诈检测中的潜在意图

Jinsheng Guo, Zhenhao Weng, Yibo Liu, Yan Qiao, Meng Li

发表机构 * Hefei University of Technology(合肥工业大学)

AI总结 提出L2IR框架,利用大语言模型从用户行为和可疑连接中提取潜在意图,通过自适应自训练增强鲁棒性,在广泛伪装的数据集上提升图神经网络检测器的AUPRC最高达8.27%。

Comments 12 pages, 6 figures

详情
AI中文摘要

图欺诈检测长期以来依赖图神经网络(GNN)在关系数据上传播和聚合信息。然而,实践中的一个关键障碍是欺诈者经常通过与良性用户伪造大量连接来伪装自己,导致欺诈信号在邻域聚合过程中逐渐稀释,削弱检测可靠性。尽管最近的工作使用大语言模型(LLM)为欺诈检测提供丰富的语义线索,但可疑连接背后的潜在意图仍未得到充分探索。更严重的是,标注欺诈样本的稀缺使得训练在严重伪装下保持鲁棒的检测器变得困难。为解决这些问题,我们提出L2IR,一种LLM驱动的潜在意图揭示框架,用于图欺诈检测。通过从用户行为和可疑连接中揭示潜在意图,L2IR从原始行为轨迹中提取意图感知表示,并推理单个连接背后的真实目的,有效区分支持性链接和误导性链接。它进一步结合自适应自训练,在有限监督下增强鲁棒性。在两个以广泛伪装为特征的真实世界数据集上的评估表明,L2IR超越了强基线,并可作为即插即用的增强模块用于多种基于GNN的检测器,将AUPRC提升最高达8.27%。

英文摘要

Graph fraud detection has long depended on Graph Neural Networks (GNNs) to propagate and aggregate information across relational data. A critical obstacle in practice, however, is that fraudsters frequently disguise themselves by forging numerous connections with benign users, causing fraud signals to be progressively diluted during neighborhood aggregation and undermining detection reliability. While recent efforts have used Large Language Models (LLMs) to provide rich semantic cues for fraud detection, the underlying intent behind suspicious connections remains insufficiently explored. Compounding this issue, the scarcity of annotated fraud samples makes it difficult to train detectors that remain robust under heavy camouflage. To address these gaps, we propose L2IR, an LLM-driven Latent Intent Revealing framework for graph fraud detection. By uncovering latent intent from both user behaviors and suspicious connections, L2IR extracts intent-aware representations from raw behavioral traces and reasons about the true purpose behind individual connections, effectively distinguishing supportive links from misleading ones. It further incorporates adaptive self-training to enhance robustness under limited supervision. Evaluations on two real-world datasets characterized by pervasive camouflage demonstrate that L2IR surpasses strong baselines and can function as a plug-in enhancement for a range of GNN-based detectors, improving AUPRC by up to 8.27%.

2605.26038 2026-05-26 cs.CV cs.AI 版本更新

DRScaffold: Boosting Dense-Scene Reasoning in Lightweight Vision Language Models

DRScaffold:提升轻量级视觉语言模型在密集场景推理中的能力

Xinrui Shi, Kai Liu, Ziqing Zhang, Jianze Li, Anqi Li, Yulun Zhang

发表机构 * Shanghai Jiao Tong University(上海交通大学)

AI总结 针对轻量级视觉语言模型在密集场景推理中缺乏显式视觉锚定导致推理链不可靠的问题,提出DRScaffold监督微调框架,通过将监督目标分解为四个因果有序阶段,在不修改架构的情况下强制进行有根据的推理,显著提升密集场景推理性能。

详情
AI中文摘要

轻量级视觉语言模型在标准基准测试中表现有竞争力,但在密集场景推理中系统性失败,其中多个物体、属性和关系必须通过多步推理共同定位和解决。这种能力对于模型必须可靠解释杂乱环境的现实应用至关重要。然而,现有的训练信号在推理步骤与底层视觉实体和关系之间没有提供显式锚定,使得轻量级模型可以自由生成流畅但视觉上无根据的推理链。为解决这一差距,我们首先引入DRBench,一个包含2943张图像中14573个问题的基准,分为五个任务类别,跨越三个渐进推理层。基于DRBench,我们提出DRScaffold,一个监督微调框架,将监督目标分解为四个因果有序阶段,在不修改架构的情况下强制进行有根据的推理。在三个轻量级VLM上的实验表明,在DRBench上取得了显著提升,同时保持或改善了一般基准的性能。值得注意的是,使用DRScaffold训练的Qwen2.5-VL-3B在DRBench上超越了冻结的Qwen2.5-VL-32B,表明结构化监督可以替代密集场景推理中相当一部分模型规模。我们的代码和模型可在https://github.com/irene-shi/DRScaffold获取。

英文摘要

Lightweight vision-language models perform competitively on standard benchmarks yet fail systematically in dense-scene reasoning, where multiple objects, attributes, and relations must be jointly grounded and resolved through multi-step inference. Such capability is critical for real-world applications where models must reliably interpret cluttered environments. Yet existing training signals provide no explicit grounding between reasoning steps and the underlying visual entities and relations, leaving lightweight models free to generate fluent but visually unanchored reasoning chains. To address this gap, we first introduce DRBench, a benchmark of 14,573 questions across 2,943 images, organized into five task categories spanning three progressive reasoning layers. Building on DRBench, we propose DRScaffold, a supervised fine-tuning framework that decomposes the supervision target into four causally ordered stages, enforcing grounded reasoning without architectural modification. Experiments on three lightweight VLMs demonstrate substantial gains on DRBench while preserving or improving performance on general-purpose benchmarks. Notably, Qwen2.5-VL-3B trained with DRScaffold surpasses the frozen Qwen2.5-VL-32B on DRBench, demonstrating that structured supervision can substitute for a significant portion of model scale in dense-scene reasoning. Our code and models are available at https://github.com/irene-shi/DRScaffold .

2605.26036 2026-05-26 cs.AI cs.LG 版本更新

CITYREP: A Unified Benchmark for Urban Representations Across Cities, Tasks, and Modalities

CITYREP:跨城市、任务和模态的城市表示统一基准

Junyuan Liu, Xinglei Wang, Zichao Zeng, Jiazhuang Feng, Quan Qin, Ilya Ilyankou, Guangsheng Dong, Tao Cheng

发表机构 * SpaceTimeLab, University College London, UK(伦敦大学空间时间实验室) DIMPact, University College London, UK(伦敦大学3DIMPact实验室) School of Resource and Environmental Sciences, Wuhan University, China(武汉大学资源与环境科学学院) State Key Laboratory of Information Engineering in Surveying, Mapping and Remote Sensing, Wuhan University, China(武汉大学测绘遥感信息工程国家重点实验室)

AI总结 提出CityRep基准,通过空间结构划分评估城市表示在不同模态、城市和任务上的性能,解决随机划分导致的空间泄漏和性能膨胀问题。

详情
AI中文摘要

城市表示学习将复杂城市环境编码为通用嵌入,用于多样下游任务和新兴城市基础模型。然而,当前评估存在局限,通常聚焦于一两个城市和任务,并依赖随机划分导致空间泄漏,从而产生膨胀的性能,并弱化跨位置泛化和公平比较的支持。为解决此问题,我们提出CityRep,一个统一基准,使用空间结构划分评估跨数据模态、城市和任务的城市表示。CityRep包含三个关键组件:(1)一个空间单元无关的评估框架,通过标准化对齐模块支持异构城市表示;(2)一个统一评估协议,使用基于区块的空间划分以减轻空间泄漏并实现严格的模型比较;(3)一个可扩展的多城市、多任务基准套件,涵盖8个城市和8个任务,包括回归、分类和分布预测。我们评估了11个代表性城市表示模型。结果表明,性能对划分协议高度敏感,随机划分会膨胀分数并改变模型排名。我们还观察到跨城市和任务的显著变异性,强调了需要泛化感知的评估。CityRep作为一个可复现的基准发布,包含数据集、评估流水线和诊断工具,以促进公平比较并支持未来城市表示学习向城市基础模型的研究。

英文摘要

Urban representation learning encodes complex urban environments into general-purpose embeddings for diverse downstream tasks and emerging urban foundation models. However, current evaluations are limited, typically focusing on one or two cities and tasks and relying on random splits that introduce spatial leakage, leading to inflated performance and weak support for cross-location generalization and fair comparison. To address this, we propose CityRep, a unified benchmark that evaluates urban representations across data modalities, cities, and tasks using spatially structured splits. CityRep consists of three key components: (1) a spatial unit-agnostic evaluation framework that supports heterogeneous urban representations through a standardized alignment module; (2) a unified evaluation protocol using block-based spatial splits to mitigate spatial leakage and enable rigorous model comparison; and (3) an extensible multi-city, multi-task benchmark suite spanning 8 cities and 8 tasks across regression, classification, and distribution prediction. We evaluate 11 representative urban representation models. Results show that performance is highly sensitive to the split protocol, with random splits inflating scores and altering model rankings. We also observe substantial variability across cities and tasks, underscoring the need for generalization-aware evaluation. CityRep is released as a reproducible benchmark with datasets, evaluation pipelines, and diagnostic tools to facilitate fair comparison and support future research in urban representation learning towards urban foundation models.

2605.26032 2026-05-26 cs.CV cond-mat.stat-mech cs.AI cs.LG 版本更新

Everything at Every Scale: Scale-Invariant Diffusion with Continuous Super-Resolution

一切尺度:具有连续超分辨率的尺度不变扩散

Zixin Jessie Chen, Zhuo Chen, Archer Wang, Jeff Gore, William T. Freeman, Congyue Deng, Marin Soljačić

发表机构 * Department of Physics, Massachusetts Institute of Technology(麻省理工学院物理系) Department of EECS, Massachusetts Institute of Technology(麻省理工学院电子工程与计算机科学系) NSF AI Institute for Artificial Intelligence and Fundamental Interactions(国家科学基金会人工智能与基础相互作用研究所) Institute for Data, Systems and Society, Massachusetts Institute of Technology(麻省理工学院数据、系统与社会研究所)

AI总结 提出SKILD模型,通过尺度不变扩散统一图像生成与连续超分辨率,仅改变起始时间步即可实现不同任务。

Comments 29 pages, 17 figures

详情
AI中文摘要

从噪声创建图像是图像生成;从粗糙输入重建精细细节是超分辨率。尽管它们在实际应用中有差异,但都可以理解为逆转跨尺度的信息损失。我们引入了$ extbf{SKILD}$,一个$ extbf{S}$cale-invariant $ extbf{K}$-Space $ extbf{I}$mage $ extbf{L}$earning $ extbf{D}$iffusion模型,它在单个无条件框架内统一了生成和连续超分辨率。自然图像和临界物理系统都表现出尺度不变性,我们利用这一点设计了一个前向过程,该过程从精细尺度到粗糙尺度衰减图像内容,同时注入频谱匹配的高斯噪声,使尺度成为扩散动力学的显式坐标。相同训练的反向过程通过仅改变起始时间步来执行生成和连续超分辨率:$ extit{没有特定任务的架构,没有条件分支,没有无分类器指导,没有按尺度因子重新训练}$。实验上,SKILD在无条件CIFAR-10上达到FID 2.65和Inception Score 9.63,从单个无条件检查点在ImageNet上执行$2 imes$--$8 imes$超分辨率,同时在感知指标上优于条件模型,并重建了临界伊辛模型,其连接的四点相关函数紧密跟踪真实情况。

英文摘要

Creating images from noise is image generation; reconstructing fine details from coarse inputs is super-resolution. Despite their practical differences, both can be understood as reversing information loss across scales. We introduce $\textbf{SKILD}$, a $\textbf{S}$cale-invariant $\textbf{K}$-Space $\textbf{I}$mage $\textbf{L}$earning $\textbf{D}$iffusion model that unifies generation and continuous super-resolution within a single unconditional framework. Both natural images and critical physical systems exhibit scale invariance, and we leverage it to design a forward process that attenuates image content from fine to coarse scales while injecting spectrum-matched Gaussian noise, making scale an explicit coordinate of the diffusion dynamics. The same trained reverse process performs generation and continuous super-resolution by varying only the starting timestep: $\textit{no task-specific architecture, no conditioning branch, no classifier-free guidance, no retraining per scale factor}$. Empirically, SKILD reaches FID $2.65$ and Inception Score $9.63$ on unconditional CIFAR-10, performs $2\times$--$8\times$ super-resolution on ImageNet from a single unconditional checkpoint while outperforming conditional models across perceptual metrics, and reconstructs critical Ising models whose connected four-point correlations closely track the ground truth.

2605.26026 2026-05-26 cs.CV cs.AI cs.LG 版本更新

A Multimodal 3D Foundation Model for Light Sheet Fluorescence Microscopy Enables Few-Shot Segmentation, Classification, and Deblurring

一种用于光片荧光显微镜的多模态3D基础模型实现少样本分割、分类和去模糊

Adina Scheinfeld, Haotan Zhang, Shang Mu, Rudolf L. M. van Herten, Lucas Stoffl, Ali Erturk, Zhuhao Wu, Johannes C. Paetzold

发表机构 * Tri-Institutional Program in Computational Biology \& Medicine, Weill Cornell Medicine, New York, NY, USA Department of Radiology, Weill Cornell Medicine, New York, NY, USA Helen Robert Appel Alzheimers Disease Research Institute, Feil Family Brain Mind Research Institute, Weill Cornell Medicine, New York, NY, USA Graduate Program in Physiology, Biophysics Systems Biology, Weill Cornell Medicine, New York, NY, USA Cornell Tech, New York, NY, USA Institute for Intelligent Biotechnologies (iBIO), Helmholtz Center Munich, Neuherberg, Germany Institute for Stroke Dementia Research, Klinikum der Universität München, Ludwig-Maximilians University Munich, Munich, Germany

AI总结 提出一种基于掩码重建与图像-文本对齐联合优化的3D基础模型,在光片荧光显微镜数据上预训练,通过少样本适应显著降低标注成本并提升分割、分类和去模糊性能。

Comments 11 pages, 3 figures

详情
AI中文摘要

光片荧光显微镜(LSM)能够对生物样本进行高分辨率三维(3D)成像,提供丰富的体积数据用于研究细胞组织、病理学和血管网络。然而,LSM数据的大小、维度和标注负担使得监督深度学习方法成本高昂且难以扩展。此外,尽管存在大量未标注的LSM体积数据,但由于计算挑战和体积表示学习的复杂性,针对该模态的基础模型仍未得到充分探索。在这项工作中,我们引入了一个用于LSM数据的3D基础模型,该模型在涵盖多种生物体、染色和成像协议的大型精选3D图像集合上进行了预训练。通过联合优化掩码重建和图像-文本对齐,我们学习了可迁移的体积表示。预训练骨干网络大幅降低了标注负担,实现了针对多种下游任务的高效少样本适应。我们在下游分割、分类和去模糊任务上评估了该方法。结果表明,我们的方法在(1)使用标准评估指标衡量时以及(2)经过领域专家严格评估时,均持续优于基线。这凸显了基础模型预训练在减少标注需求的同时提升多样化LSM分析任务性能的潜力。预训练模型权重以及预训练和微调的代码已公开:https://github.com/AdinaScheinfeld/lsm_fm_public_repo.git。

英文摘要

Light sheet fluorescence microscopy (LSM) enables high-resolution, three-dimensional (3D) imaging of biological specimens, providing rich volumetric data for studying cellular organization, pathology, and vascular networks. However, the size, dimensionality, and annotation burden of LSM data make supervised deep learning approaches costly and difficult to scale. Additionally, despite the abundance of unannotated LSM volumes, foundation models for this modality remain underexplored due to computational challenges and the complexity of volumetric representation learning. In this work, we introduce a 3D foundation model for LSM data, pretrained on a large curated collection of 3D images spanning multiple organisms, stains, and imaging protocols. We learn transferable volumetric representations by jointly optimizing for masked reconstruction and image-text alignment. The pretrained backbone drastically reduces the annotation burden, enabling efficient, few-shot adaptation for varied downstream tasks. We evaluate this approach on downstream segmentation, classification, and deblurring. Our results demonstrate consistent improvements over baselines, (1) when measured using standard evaluation metrics and (2) when rigorously assessed by domain experts. This highlights the potential of foundation model pretraining to reduce annotation requirements while improving performance across diverse LSM analysis tasks. Pretrained model weights and code for pretraining and finetuning are publicly available: https://github.com/AdinaScheinfeld/lsm_fm_public_repo.git.

2605.26019 2026-05-26 cs.LG cs.AI cs.CL 版本更新

Retrieval-Augmented Detection of Potentially Abusive Clauses in Chilean Terms of Service

智利服务条款中潜在滥用条款的检索增强检测

Christoffer Loeffler, Tomás Rey Pizarro, Daniel Ignacio Miranda Vásquez, Andrea Martínez Freile

发表机构 * School of Computer Engineering, Pontificia Universidad Católica de Valparaíso(Pontificia Universidad Católica de Valparaíso计算机工程学院) Faculty of Law, Universidad Adolfo Ibáñez(Adolfo Ibáñez大学法学院)

AI总结 提出检索增强生成框架,结合混合稠密-稀疏检索与提示增强,用于自动检测和分类智利服务条款中的潜在滥用条款,并引入包含100份合同和10,029条标注条款的语料库,实验表明该方法显著提升性能,使本地模型接近云端系统。

Comments 42 pages, 6 figures, 9 tables

详情
AI中文摘要

在线服务条款通常作为附意合同运作,造成不对称性,可能使消费者面临潜在滥用条款。在智利,评估此类条款在法律上具有挑战性,因为某些条款明显违反强制性消费者法律,而其他条款则依赖于更广泛的标准,如诚信和合同失衡。我们提出一个检索增强生成框架,用于自动检测和分类智利服务条款中的潜在滥用条款。该框架设计为本地执行,结合了高效条款检测、混合稠密-稀疏检索、重排序和提示增强,以支持中等规模的开源语言模型。我们还引入了智利滥用服务条款扩展语料库,包含100份合同和10,029条标注条款,涵盖24个法律基础的类别,包括非法、黑暗和灰色条款。比较商业和开源语言模型、微调编码器以及传统基线的实验表明,检索增强提示显著提高了性能,并使本地模型能够以较低的计算和令牌成本接近更大的基于云的系统。该研究还贡献了一个精细的法律注释方案和一个用于AI辅助消费者合同审查的实用设计。

英文摘要

Online Terms of Service often function as contracts of adhesion, creating asymmetries that may expose consumers to potentially abusive clauses. In Chile, assessing such clauses is legally challenging because some provisions clearly violate mandatory consumer law, whereas others depend on broader standards such as good faith and contractual imbalance. We present a retrieval-augmented generation framework for the automated detection and classification of potentially abusive clauses in Chilean Terms of Service. Designed for local execution, it combines efficient clause detection, hybrid dense--sparse retrieval, reranking, and prompt augmentation to support medium-sized open-weight language models. We also introduce the Chilean Abusive Terms of Service Extended corpus, comprising 100 contracts and 10,029 annotated clauses in 24 legally grounded categories spanning illegal, dark, and gray clauses. Experiments comparing commercial and open-weight language models, fine-tuned encoders, and traditional baselines show that retrieval-augmented prompting substantially improves performance and enables local models to approach larger cloud-based systems at lower computational and token cost. The study also contributes a refined legal annotation scheme and a practical design for AI-assisted consumer contract review.

2605.26013 2026-05-26 cs.LG cs.AI cs.CV 版本更新

AdvantageFlow: Advantage-Weighted Least Squares for RL in Flow Models

AdvantageFlow: 流模型中基于优势加权的强化学习最小二乘法

Branislav Kveton, Anup Rao, Subhojyoti Mukherjee, Krishna Kumar Singh, Viet Dac Lai

发表机构 * Adobe Research(Adobe研究)

AI总结 提出AdvantageFlow算法,通过优势加权前向过程预测损失和 rollout 策略正则化,在图像生成任务中优于Flow-GRPO和负感知微调基线。

详情
AI中文摘要

我们引入了AdvantageFlow,一种用于修正流模型的前向过程强化学习算法。与优化反向过程的Flow-GRPO不同,我们优化了一个优势加权的前向过程预测损失。当优势为负且损失变为非凸时,该优化问题不稳定。我们通过rollout策略正则化来稳定它,这降低了方差,并源于拟合局部奖励改进的目标分布。我们在Stable Diffusion 3.5 Medium上评估了AdvantageFlow在图像生成任务中的表现。它优于Flow-GRPO和基于负感知微调的最先进前向过程强化学习基线。

英文摘要

We introduce AdvantageFlow, a forward-process reinforcement learning algorithm for rectified flow models. Unlike Flow-GRPO, which optimizes the reverse process, we optimize an advantage-weighted forward-process prediction loss. This optimization problem is unstable when advantages are negative and the loss becomes non-convex. We stabilize it by rollout policy regularization, which reduces variance and arises from fitting a local reward-improving target distribution. We evaluate AdvantageFlow on image generation tasks with Stable Diffusion 3.5 Medium. It outperforms both Flow-GRPO and a state-of-the-art forward-process RL baseline based on negative-aware fine-tuning.

2605.26012 2026-05-26 cs.LG cs.AI 版本更新

Learning in Low-Dimensional Subspaces: Orthogonal Bottlenecks for Reinforcement Learning

低维子空间中的学习:强化学习的正交瓶颈

Aleksandar Todorov, Matthia Sabatelli

AI总结 提出一种在强化学习编码器特征中插入固定正交投影以约束低维子空间的简单先验,证明其在线性可实现性假设下保持表达能力,并在实验中显示价值表示可压缩至极低维度而不损失性能。

详情
AI中文摘要

深度强化学习代理通常依赖高维神经表示,尽管越来越多的证据表明任务相关的价值和策略结构本质上是低维的。在这项工作中,我们提出了一种简单而有效的表示级先验,它插入一个固定的正交投影以将编码器特征约束到低维子空间,无需辅助目标、预训练或对底层RL算法的更改。在线性可实现性假设下,我们证明当瓶颈维度超过特征空间中最优价值函数的内在秩时,瓶颈保持表达能力,并将诱导的梯度动力学保留到等价的低维参数化。实验上,我们发现,在单任务和多任务基准测试中,一旦瓶颈维度超过一个小的任务相关阈值,基线性能要么匹配要么提高;在许多情况下,价值表示可以压缩到极低维度而不损失,最小充分维度更多地取决于环境复杂性而非编码器宽度。此外,我们分析了表示几何,发现正交瓶颈稳定了特征范数,并与更高的有效秩相关。这些结果共同支持了强化学习中流形假设的表示空间解释,并将正交瓶颈定位为一种轻量级、架构无关的塑造RL表示的机制。

英文摘要

Deep reinforcement learning (RL) agents commonly rely on high-dimensional neural representations, despite growing evidence that task-relevant value and policy structure may be intrinsically low-dimensional. In this work, we present a simple yet effective representation-level prior that inserts a fixed orthonormal projection to constrain encoder features to a low-dimensional subspace, requiring no auxiliary objectives, pretraining, or changes to the underlying RL algorithm. Under a linear realizability assumption, we prove that when the bottleneck dimension exceeds the intrinsic rank of the optimal value function in feature space, the bottleneck preserves expressivity and leaves the induced gradient dynamics unchanged up to an equivalent low-dimensional parameterization. Empirically, we find that across both single and multi-task benchmarks, baseline performance is either matched or improved once the bottleneck dimension exceeds a small task-dependent threshold; in many cases, value representations can be compressed to extremely low dimensions without loss, and the minimal sufficient dimension depends far more on environment complexity than encoder width. In addition, we analyze representation geometry and find that orthogonal bottlenecks stabilize feature norms and are associated with higher effective rank. Together, these results support a representation-space interpretation of the manifold hypothesis in reinforcement learning and position orthogonal bottlenecks as a lightweight, architecture-agnostic mechanism for shaping RL representations.

2605.26001 2026-05-26 cs.CL cs.AI cs.CY 版本更新

AI-Assisted Systematization for Evaluating GenAI Systems

AI辅助的系统化方法用于评估生成式AI系统

Dhruv Agarwal, Emily Sheng, Chad Atalla, Jean Garcia-Gathright, Hussein Mozannar, Hannah Washington, Alexandra Chouldechova, Solon Barocas, Hanna Wallach

发表机构 * Cornell University(康奈尔大学) Microsoft Research(微软研究院)

AI总结 针对生成式AI评估中概念模糊的问题,提出AI辅助系统化方法,通过概念规范和验证工作表生成可衡量的概念规范,并评估其内容效度和信息可恢复性。

详情
AI中文摘要

评估生成式AI(GenAI)系统具有挑战性,因为许多评估目标都是宽泛且有争议的概念,例如“推理”、“公平性”或“创造力”。当这些概念未得到充分明确时,就不清楚应该测量什么或如何解释评估结果。这个问题反映了一个缺失的步骤:系统化,即从一个宽泛的背景概念转变为用可衡量术语对概念进行明确、结构化的描述。为了帮助解决系统化在认知上要求高且资源密集的问题,我们研究了AI辅助是否能够支持这一过程。为了实现AI辅助的系统化并评估其质量,我们引入了系统化概念的结构化表示——概念规范——以及一个验证工作表。然后,我们开发了两种AI辅助系统化工具:一种直接的零样本方法和一种多智能体方法,后者更贴近现有文献中手动系统化的方法。我们使用这些系统化工具为两个概念——仇恨言论和数字共情——生成概念规范,并评估所得概念规范的内容效度和信息可恢复性。

英文摘要

Evaluating generative AI (GenAI) systems is challenging because many targets of evaluation are broad, contested concepts, such as "reasoning," "fairness," or "creativity." When these concepts are left underspecified, it becomes unclear what should be measured or how evaluation results should be interpreted. This problem reflects a missing step: systematization, that is, moving from a broad background concept to an explicit, structured account of the concept in measurable terms. To help address the fact that systematization is cognitively demanding and resource-intensive, we investigate whether AI assistance can support this process. To enable AI-assisted systematization and assess its quality, we introduce a structured representation of a systematized concept, a concept spec, and a validation worksheet. We then develop two AI-assisted systematizers: a direct, zero-shot approach and a multi-agent approach that more closely mirrors manual systematization approaches from existing literature. We use these systematizers to produce concept specs for two concepts -- hate-based rhetoric and digital empathy -- and evaluate resulting concept specs on content validity and information recoverability.

2605.25168 2026-05-26 eess.IV cs.AI cs.CV 版本更新

Methodology for Creating a Clinically Verified Dermoscopic Image Dataset

创建临床验证的皮肤镜图像数据集的方法论

Kozachok Elena Sergeevna

发表机构 * Ivannikov Institute for System Programming of the Russian Academy of Sciences(伊万诺夫系统编程研究所,俄罗斯科学院)

AI总结 提出一种结合移动皮肤镜图像采集标准操作程序、结构化元数据信息模型和多阶段专家验证的方法,构建临床验证的皮肤镜图像数据集,用于医学信息学研究。

Comments 22 pages, 5 figures, 5 tables

详情
AI中文摘要

本研究提出了一种构建临床验证的皮肤镜图像数据集的方法,用于医学信息学研究。该工作的相关性在于,自动化诊断支持系统的性能不仅取决于图像数量,还取决于图像采集过程的可重复性、结构化元数据的完整性以及诊断标签的可靠性。国际数据集主要是在与俄罗斯常规门诊实践和移动皮肤镜显著不同的条件下创建的。所提出的方法整合了三个相互关联的组成部分:(1)通过移动皮肤镜采集图像的标准操作程序(SOP),(2)一个信息模型,包含16个结构化元数据字段,组织成六个临床导向的块,采用ISIC兼容的符号表示,以及(3)多阶段专家验证诊断标签(初始临床注释、三位专家的共识审查以及所有恶性肿瘤的组织学确认)。使用该方法,在2025年6月至2026年5月期间,收集了来自443名患者的1026张独特的皮肤镜图像数据集。从1044条初始记录中排除了18个重复项。该数据集包括九个疾病类别;所有39个恶性病变(18个黑色素瘤、15个基底细胞癌和6个鳞状细胞癌)均经过组织学验证。患者年龄范围为2至90岁(中位年龄38岁),其中女性279人(63%),男性164人(37%)。每张图像都附有专家注释的皮肤镜结构和明确的verification_stage字段,指示诊断确认的水平。所得数据集作为临床验证的试点资源,适用于独立模型评估、域偏移分析、可解释性研究和进一步扩展。

英文摘要

This study presents a methodology for constructing a clinically verified dataset of dermatoscopic images for medical informatics research. The relevance of the work is driven by the fact that the performance of automated diagnostic support systems depends not only on the volume of images, but also on the reproducibility of the image acquisition procedure, the completeness of structured metadata, and the reliability of diagnostic labels. International collections were primarily created under conditions that differ substantially from routine Russian outpatient practice and mobile dermatoscopy. The proposed methodology integrates three interconnected components: (1) a standard operating procedure (SOP) for acquiring images via mobile dermatoscopy, (2) an information model comprising 16 structured metadata fields organized into six clinically oriented blocks in ISIC-compatible notation, and (3) a multi-stage expert verification of diagnostic labels (initial clinical annotation, consensus review by three specialists, and histological confirmation of all malignant neoplasms). Using this methodology, a dataset of 1,026 unique dermatoscopic images from 443 patients was collected between June 2025 and May 2026. From 1,044 initial records, 18 duplicates were excluded. The dataset includes nine nosological categories; all 39 malignant lesions (18 melanomas, 15 basal cell carcinomas, and 6 squamous cell carcinomas) were histologically verified. Patient age ranged from 2 to 90 years (median 38), with 279 females (63%) and 164 males (37%). Each image is accompanied by expert-annotated dermatoscopic structures and an explicit verification_stage field indicating the level of diagnostic confirmation. The resulting dataset serves as a pilot clinically verified resource suitable for independent model evaluation, domain shift analysis, interpretability studies, and further expansion.

2605.24728 2026-05-26 cs.AI 版本更新

Hylos: Operability Contracts for Model-Native Spatial Intelligence

Hylos: 面向模型原生空间智能的可操作性契约

Christopher Da Silva

AI总结 提出Hylos系统架构,通过契约约束和空间事务管理,确保生成或编辑的3D内容具备可操作性,支持CAD、机器人等下游应用。

Comments 27 pages, 7 figures. Systems/position preprint with focused artifact study

详情
AI中文摘要

基础模型日益能够描述、重建和生成3D对象、装配体、场景和环境,但视觉上合理的空间输出尚不具备可操作的3D特性。只有当系统能够识别其实体、框架、表面、约束、来源、允许的动作、预期效果和验证失败时,生成的对象或环境才对智能体有用。本文介绍了Hylos,一种用于契约约束的空间智能的系统架构。Hylos维护场景级别的可操作性状态,涵盖对象、装配体、资产、表面锚点、断言、动作候选、求解器任务、共享执行器调用、能力差距和效果差异。持久的空间变化通过空间事务(SpatialTransaction)进行路由:这是一种提交边界,用于解析引用、检查可允许性、保护不变量、投影效果,并返回提交、审查、回滚、延迟或能力差距结果。本文以系统/立场预印本的形式呈现,并附带一个聚焦的工件研究,而非广泛的基准测试。该研究考察了因果修复:一个可见的错位出现在依赖组件上,而支持的修复位于控制它的上游放置结构中。成功的交互通过场景依赖关系追踪症状,选择支持的上游交互,并应用经过验证的更改,而不是直接编辑可见几何体。更广泛的论点是,空间AI不仅应根据视觉质量进行评估,还应考虑生成或编辑的3D能否成为CAD、机器人、仿真、检测、制造和交互式世界创作的可靠基础。

英文摘要

Foundation models can increasingly describe, reconstruct, and generate 3D objects, assemblies, scenes, and environments, but visually plausible spatial output is not yet operable 3D. A generated object or environment becomes useful to an agent only when the system can identify its entities, frames, surfaces, constraints, provenance, admissible actions, expected effects, and validation failures. This paper introduces Hylos, a systems architecture for contract-bounded spatial intelligence. Hylos maintains scene-scale operability state over objects, assemblies, assets, surface anchors, assertions, action candidates, solver jobs, shared actuator invocations, capability gaps, and effect diffs. Durable spatial changes are routed through a SpatialTransaction: a commit boundary that resolves references, checks admissibility, protects invariants, projects effects, and returns commit, review, rollback, deferral, or capability-gap outcomes. The paper is framed as a systems/position preprint with a focused artifact study rather than a broad benchmark. The study examines causal repair: a visible misalignment appears on a dependent component, while the supported repair lies upstream in the placement structure that controls it. The successful interaction traces the symptom through scene dependencies, selects a supported upstream interaction, and applies a validated change instead of directly editing visible geometry. The broader claim is that spatial AI should be evaluated not only by visual quality, but by whether generated or edited 3D can become reliable substrate for CAD, robotics, simulation, inspection, manufacturing, and interactive world authoring.

2605.23904 2026-05-26 cs.AI cs.CL 版本更新

SkillOpt: Executive Strategy for Self-Evolving Agent Skills

SkillOpt: 自我进化智能体技能的执行策略

Yifan Yang, Ziyang Gong, Weiquan Huang, Qihao Yang, Ziwei Zhou, Zisu Huang, Yan Li, Xuemei Gao, Qi Dai, Bei Liu, Kai Qiu, Yuqing Yang, Dongdong Chen, Xue Yang, Chong Luo

发表机构 * Microsoft(微软公司) Shanghai Jiao Tong University(上海交通大学) Tongji University(同济大学) Fudan University(复旦大学)

AI总结 提出SkillOpt,一种系统性的可控文本空间优化器,通过分离的优化器模型对技能文档进行有界编辑,并仅在严格改善验证分数时接受编辑,从而稳定训练技能,在六个基准测试中全面优于现有方法。

Comments 27 pages, 4 figures, 6 tables

详情
AI中文摘要

当前的智能体技能要么是手工制作的,要么是一次性生成的,要么通过松散控制的自我修订来进化,这些方法都不像深度学习优化器那样作用于技能,并且都无法在反馈下可靠地改进其起点。我们认为,技能应该作为冻结智能体的外部状态进行训练,并遵循使权重空间优化可复现的相同原则。据我们所知,SkillOpt是第一个系统性的可控文本空间优化器,用于智能体技能:一个独立的优化器模型将带分数的轨迹转换为对单个技能文档的有界添加/删除/替换编辑,并且仅当编辑严格改善保留验证分数时才接受编辑。文本学习率预算、拒绝编辑缓冲区和逐轮慢/元更新使得技能训练稳定,同时在部署时无需增加推理时的模型调用。在六个基准测试、七个目标模型和三个执行框架(直接对话、Codex、Claude Code)中,SkillOpt在所有52个评估的(模型、基准、框架)单元上取得最佳或并列最佳,并击败了每个单元上的所有竞争者,包括人类、一次性LLM、Trace2Skill、TextGrad、GEPA和EvoSkill技能。在GPT-5.5上,它在直接对话中将平均无技能准确率提高了23.5个百分点,在Codex智能体循环中提高了24.8个百分点,在Claude Code中提高了19.1个百分点。迁移实验进一步表明,优化后的技能工件在跨模型规模、在Codex和Claude Code执行环境之间迁移以及迁移到邻近的数学基准测试时,无需进一步优化即可保留其价值。代码:https://aka.ms/skillopt

英文摘要

Agent skills today are hand-crafted, generated one-shot, or evolved through loosely controlled self-revision, none of which behaves like a deep-learning optimizer for the skill, and none of which reliably improves over its starting point under feedback. We argue the skill should instead be trained as the external state of a frozen agent, with the same discipline that makes weight-space optimization reproducible. SkillOpt is, to our knowledge, the first systematic controllable text-space optimizer for agent skills: a separate optimizer model turns scored rollouts into bounded add/delete/replace edits on a single skill document, and an edit is accepted only when it strictly improves a held-out validation score. A textual learning-rate budget, rejected-edit buffer, and epoch-wise slow/meta update make skill training stable while adding zero inference-time model calls at deployment. Across six benchmarks, seven target models, and three execution harnesses (direct chat, Codex, Claude Code), SkillOpt is best or tied on all 52 evaluated (model, benchmark, harness) cells and beats every per-cell competitor among human, one-shot LLM, Trace2Skill, TextGrad, GEPA, and EvoSkill skills. On GPT-5.5 it lifts the average no-skill accuracy by +23.5 points in direct chat, by +24.8 inside the Codex agentic loop, and by +19.1 inside Claude Code. Transfer experiments further show that optimized skill artifacts retain value when moved across model scales, between Codex and Claude Code execution environments, and to a nearby math benchmark without further optimization. Code: https://aka.ms/skillopt

2605.23082 2026-05-26 stat.ML cs.AI cs.LG 版本更新

KAPLAN: Kolmogorov-Arnold Prognostic Learnable Activation Networks for Survival Analysis

KAPLAN: 用于生存分析的Kolmogorov-Arnold可预测可学习激活网络

Stelios Boulitsakis Logothetis, Angela Wood, Pietro Liò

发表机构 * University of Cambridge(剑桥大学)

AI总结 提出KAPLAN-HR模型,利用B样条Kolmogorov-Arnold网络非参数估计条件风险函数,通过深层架构自动捕捉交互和时变效应,并证明其收敛速率仅依赖于表示平滑性,从而缓解维度灾难,在六个临床数据集上达到或超越现有方法。

Comments 9 pages, 3 figures, 13 supplementary pages. Submitted to NeurIPS 2026

详情
AI中文摘要

生存分析旨在建模协变量和时间如何共同影响右删失下的事件时间分布。经典方法如Cox模型和广义加性模型(GAM)需要手动指定交互和时变效应,这在丰富的临床数据集上越来越不切实际。我们引入了KAPLAN-HR,一种B样条Kolmogorov-Arnold网络(KAN),用于非参数估计条件风险函数作为协变量和时间的联合函数。单层KAPLAN-HR模型恢复GAM,而更深层的架构通过组合捕捉交互和时变效应。我们为非参数KAN风险估计器建立了收敛速率,该速率仅依赖于底层KAN表示的平滑性,而不依赖于协变量维度,从而缓解了KAN可表示目标的维度灾难。在六个临床基准数据集的评估中,KAPLAN-HR匹配或超过了已建立的统计和深度学习生存方法的预测性能。

英文摘要

Survival analysis aims to model how covariates and time jointly shape the time-to-event distribution under right censoring. Classical methods such as the Cox model and generalised additive models (GAMs) require interactions and time-varying effects to be manually specified, which is increasingly impractical on rich clinical datasets. We introduce KAPLAN-HR, a B-spline Kolmogorov-Arnold Network (KAN) for nonparametric estimation of the conditional hazard as a joint function of covariates and time. A single-layer KAPLAN-HR model recovers a GAM, while deeper architectures capture interactions and time-varying effects through composition. We establish a convergence rate for the nonparametric KAN hazard estimator that depends only on the smoothness of the underlying KAN representation and not on the covariate dimension, thereby mitigating the curse of dimensionality for KAN-representable targets. In evaluations over six clinical benchmark datasets, KAPLAN-HR matches or exceeds the predictive performance of established statistical and deep learning survival methods.

2605.25984 2026-05-26 cs.CL cs.AI 版本更新

SafeCtrl-RL: Inference-Time Adaptive Behaviour Control for LLM Dialogue via RL-Driven Prompt Optimisation

SafeCtrl-RL: 通过RL驱动的提示优化的LLM对话推理时自适应行为控制

Michael Orme, Yanchao Yu, Zhiyuan Tan

发表机构 * School of Computing, Engineering and Building Environment(计算、工程与建筑环境学院)

AI总结 提出SafeCtrl-RL框架,利用强化学习在推理时动态选择提示调整策略,无需重新训练即可抑制不安全行为,提升LLM对话的安全性和响应质量。

详情
AI中文摘要

确保大型语言模型(LLM)的安全和上下文适当行为仍然是实际部署的关键挑战。我们提出了 extbf{SafeCtrl-RL},一个推理时行为控制框架,无需模型重新训练或参数修改即可实现自适应安全调节。该方法将对话生成形式化为一个序列决策过程,其中强化学习代理根据上下文反馈动态选择提示调整策略。这使得不安全行为可以通过迭代细化被抑制,我们将其概念化为推理时行为遗忘。在多个LLM和不安全对话场景下的评估表明,SafeCtrl-RL一致地提高了安全性和响应质量,优于现有的基于提示的优化方法,并实现了良好的性能-效率权衡。**警告:本文可能包含有害语言的示例,建议读者谨慎阅读。**

英文摘要

Ensuring safe and contextually appropriate behaviour in Large Language Models (LLMs) remains a critical challenge for real-world deployment. We present \textbf{SafeCtrl-RL}, an inference-time behavioural control framework that enables adaptive safety regulation without model retraining or parameter modification. The method formulates dialogue generation as a sequential decision process, where a reinforcement learning agent dynamically selects prompt adjustment strategies based on contextual feedback. This allows unsafe behaviours to be suppressed through iterative refinement, which we conceptualise as inference-time behavioural unlearning. Evaluated across multiple LLMs and unsafe dialogue scenarios, SafeCtrl-RL consistently improves safety and response quality, outperforms existing prompt-based optimisation methods, and achieves favourable performance--efficiency trade-offs. **Warning: This paper may contain examples of harmful language, and reader discretion is recommended.

2605.25977 2026-05-26 cs.CL cs.AI cs.LG 版本更新

Creative Quality Alignment: Expert Tacit Knowledge Transfer via Chain-of-Thought Fine-Tuning

创意质量对齐:通过思维链微调实现专家隐性知识迁移

Bo Zou, Chao Xu

AI总结 本文通过低数据成本和小基模型的严格工程条件,实证验证了校准惊喜中的创意质量度量,并发现数据偏差,提出创意质量对齐方法及理论解释。

详情
AI中文摘要

本文对校准惊喜(Zou & Xu, 2026a)中提出的创意质量度量进行了实证实现。本文解决的问题是:这一数学主张在工程层面是否成立?为使答案尽可能通用,我们特意选择了最严格的工程条件:低数据成本和小基模型。训练数据来自BC协议(Zou & Xu, 2026b)产生的大约100个专家思维链(CoT)标注。我们还发现了一个数据偏差:大多数公开可用的对齐数据集偏向于工艺相关知识,而受众建模和现实逻辑覆盖系统性薄弱。我们使用术语“创意质量对齐”(CQA)来描述这类工程方法。我们还提供了一个支持性的理论观察:在具有单一条件分布架构的LLM中,通过架构对偶性,校准欣赏侧会自动迁移到生成侧。这是大约100个CoT示例就足够的结构性原因——而非像LIMA(Zhou et al., 2023)那样的纯粹经验观察。

英文摘要

This paper provides an empirical implementation of the creative quality metric proposed in Calibrated Surprise (Zou & Xu, 2026a). The question this paper addresses is: does this mathematical claim hold at the engineering level? To make the answer as general as possible, we deliberately choose the strictest engineering conditions: low data cost and a small base model. Training data comes from approximately 100 expert chain-of-thought (CoT) annotations produced by the BC Protocol (Zou & Xu, 2026b). We also identify a data bias: most publicly available alignment datasets are skewed toward craft-related knowledge, while audience modeling and reality-logic coverage are systematically weak. We use the term Creative Quality Alignment (CQA) to describe this class of engineering methods. We also offer a supporting theoretical observation: in an LLM with a single conditional distribution architecture, calibrating the appreciation side automatically transfers to the generation side via architectural duality. This is the structural reason why ~100 CoT examples are sufficient -- not a purely empirical observation like LIMA (Zhou et al., 2023).

2605.25964 2026-05-26 cs.AI 版本更新

LECTOR: Joint Optimization of Scientific Reasoning Graphs and Introduction Generation

LECTOR: 科学推理图与引言生成的联合优化

Jiabei Xiao, Yizhou Wang, Chen Tang, Pengze Li, Wanli Ouyang, Shixiang Tang

发表机构 * The Chinese University of Hong Kong(香港中文大学) Shanghai Artificial Intelligence Laboratory(上海人工智能实验室) Fudan University(复旦大学)

AI总结 提出LECTOR框架,通过逻辑-表达协同强化学习联合优化科学推理图的结构保真度和引言生成质量,在Nature Communications数据集上实现显著提升。

Comments 25 pages

详情
AI中文摘要

AI科学家在研究流程的多个阶段已展现出有希望的进展,其中自动科学论文写作仍然是一个艰巨的挑战。引言写作尤其困难,不仅要求语言流畅,还需要逻辑合理性和可验证的忠实性。大多数AI辅助方法将任务视为文本生成而非推理和结构化,导致严重缺陷,例如引用幻觉。为解决此问题,我们首先定义了内容条件引言生成(CCIG)任务,要求引言基于论文的核心证据。然后我们提出LECTOR,一种新颖的逻辑-表达协同强化学习框架,能够严格遵循科学家的逻辑,添加高质量引用并保持结构化表达。LECTOR首先从论文主体构建逻辑推理图,作为可验证的逻辑蓝图。随后,它采用逻辑-表达协同奖励机制,联合优化图的结构保真度和最终叙述的质量。我们从Nature Communications论文中构建数据集来评估我们的方法。大量实验表明,在逻辑保真度和引言生成质量指标上均有一致改进,例如图质量(+26.7%)、引用质量(+8.6%)和论文一致性(+3.3%)。代码和数据可在https://github.com/Xiao-Youth/LECTOR获取。

英文摘要

AI Scientists have shown promising progress across multiple stages of the research pipeline, among which automatic scientific paper writing remains a formidable challenge. The Introduction writing is especially challenging, which demands not only linguistic fluency, but logical soundness and verifiable faithfulness. Most AI-assisted methods treat the task as text generation instead of reasoning and structuring, leading to severe drawbacks, e.g., hallucinating citations. To address this, we first formulate the Content-Conditional Introduction Generation (CCIG) task, which requires grounding the Introduction in the paper's core evidence. We then propose LECTOR, a novel Logic-Expression Co-Reinforcement Learning framework that can strictly follow the scientist's logic, add high-quality citations and keep structured expressions. LECTOR first constructs a logic-reasoning graph from the paper's main body to serve as a verifiable logical blueprint. Subsequently, it employs a Logic-Expression Co-Rewarding mechanism to jointly optimize for both the graph's structural fidelity and the final narrative's quality. We conduct a dataset from Nature Communications papers to assess our method. Extensive experiments show consistent improvements in both logic fidelity and Introduction generation quality metrics, e.g., Graph Quality (+26.7%), Citation Quality (+8.6%), and Paper Consistency (+3.3%). Code and data are available at https://github.com/Xiao-Youth/LECTOR.

2605.25962 2026-05-26 cs.SD cs.AI 版本更新

Continual Speaker Identity Unlearning with Minimal Interference

持续说话人身份遗忘与最小干扰

Jinju Kim, Yunsung Kang, Gyeong-Moon Park, Jong Hwan Ko

发表机构 * Sungkyunkwan University(成均馆大学) Korea University(韩国大学)

AI总结 提出CORTIS框架,通过Fisher信息参数掩码和正交投影实现零样本语音合成中持续说话人身份遗忘,避免先前遗忘的说话人重新出现。

Comments preprint

详情
AI中文摘要

机器遗忘从预训练模型中移除指定概念或知识。最近的工作将此范式扩展到零样本语音合成(ZS-TTS)中的说话人身份遗忘,即选择性擦除模型复制说话人声音的能力。然而,现有方法默认所有遗忘请求同时到达,这是一个不现实的假设,因为隐私驱动的移除会随时间顺序到达。我们证明这一假设破坏了现有最先进的方法:遗忘每个新说话人会完全恢复先前遗忘的说话人,重新引入遗忘本应消除的隐私风险。我们提出了累积正交身份抑制(CORTIS),这是首个在ZS-TTS中实现持续说话人身份遗忘的框架,无需访问先前遗忘的说话人数据。CORTIS结合了基于Fisher信息的参数掩码(将更新定位到与说话人相关的权重)和针对先前遗忘更新子空间的正交投影。使用VoiceBox,CORTIS在长请求序列中遗忘每个请求的说话人,同时保持先前遗忘的说话人被遗忘,显著优于先前方法的顺序应用。演示地址:https://cumulativeortis.github.io/ 。

英文摘要

Machine unlearning removes designated concepts or knowledge from pre-trained models. Recent work has extended this paradigm to speaker identity unlearning in zero-shot text-to-speech (ZS-TTS), the task of selectively erasing a model's ability to replicate a speaker's voice. Existing methods, however, quietly assume all unlearning requests arrive at once; an unrealistic assumption, since privacy-motivated removals arrive sequentially over time. We show this assumption breaks state-of-the-art methods: unlearning each new speaker fully revives previously unlearned speakers, reintroducing the very privacy risk unlearning was meant to eliminate. We present Cumulative ORThogonal Identity Suppression (CORTIS), the first framework for continual speaker identity unlearning in ZS-TTS that requires no access to previously-unlearned speaker data. CORTIS combines Fisher-information-based parameter masking, which localizes updates to speaker-relevant weights, with orthogonal projection against subspaces spanned by prior unlearning updates. With VoiceBox, CORTIS unlearns each requested speaker while keeping previously unlearned speakers forgotten across long request sequences, substantially outperforming sequential application of prior methods. The demo is available at https://cumulativeortis.github.io/ .

2605.25955 2026-05-26 cs.CL cs.AI cs.LG 版本更新

QUIET: A Multi-Blank Cascaded Story Cloze Benchmark for LLM Creative Generation Capability

QUIET: 面向LLM创意生成能力的多空白级联故事完形填空基准

Bo Zou, Chao Xu

AI总结 提出QUIET基准,通过多空白级联故事完形填空和基于信息论的自动评分协议,客观评估大语言模型的创意生成能力。

详情
AI中文摘要

大语言模型(LLM)在创意能力评估中面临双重挑战:现有基准(如Story Cloze Test、HellaSwag)通过多项选择识别范式衡量模型对叙事延续的判别能力,而非直接衡量创意生成能力;基于量规的评分和LLM-as-Judge方法依赖主观维度评估或自然语言模型输出,无法提供客观、自动化的评分机制。本文提出QUIET(Quality Understanding via Interlocked Evaluation Testing),一种基于多空白级联故事完形填空的LLM创意能力诊断基准。QUIET在结构完整的故事中设置N个空白(10-20个),每个空白附带显式内容约束,且空白之间存在级联依赖关系——较早空白填充的内容约束较晚空白的可行解空间。被评估模型(或人类参与者)以开放生成模式填充所有空白;结果由基于信息论的自动化评分协议评分,无需人工评分。该评分协议直接操作化“校准惊喜”理论框架(Zou & Xu, 2026a)。对于每个空白k,计算复合分数:score = satisfy * (1 + lambda * surprise),其中lambda = 1.0。这里,“satisfy”衡量空白填充满足内容约束的程度(客观逻辑推理判断,非主观审美评分),“surprise”衡量在满足约束条件下的惊喜程度。不满足约束的创意答案得零分;满足约束但平庸的答案得分低;满足约束且令人惊喜的答案得分高。

英文摘要

Large language models (LLMs) face a dual challenge in creative capability evaluation: existing benchmarks (e.g., Story Cloze Test, HellaSwag) measure models' discriminative ability over narrative continuation using multiple-choice recognition paradigms, rather than directly measuring creative generation capability; rubric-based scoring and LLM-as-Judge methods rely on subjective dimension assessment or natural language model outputs, and cannot provide objective, automated scoring mechanisms. This paper proposes QUIET (Quality Understanding via Interlocked Evaluation Testing), a diagnostic benchmark for LLM creative capability based on multi-blank cascaded story cloze. QUIET sets N blanks (10-20) in a story with complete structure, with each blank accompanied by an explicit content constraint, and cascade dependency relationships between blanks -- the content filled into earlier blanks constrains the feasible solution space for later blanks. The evaluated model (or human participants) fills all blanks in open-ended generation mode; the results are scored by an information-theoretic automated scoring protocol without human grading. The scoring protocol directly operationalizes the "calibrated surprise" theoretical framework (Zou & Xu, 2026a). For each blank k, a composite score is computed: score = satisfy * (1 + lambda * surprise), where lambda = 1.0. Here, "satisfy" measures how well the blank filling satisfies the content constraint (objective logical reasoning judgment, not subjective aesthetic scoring), and "surprise" measures the degree of surprise given that the constraint is satisfied. Creative answers that do not satisfy the constraint score zero; answers that satisfy the constraint but are mediocre score low; answers that satisfy the constraint and are surprising score high.

2605.25954 2026-05-26 cs.LG cs.AI 版本更新

Step-TP: A Grounded, Step-Level Dataset with Chain-of-Thought Reasoning for LLM-Guided Tensor Program Optimization

Step-TP: 一个基于步骤级、带有思维链推理的 LLM 引导张量程序优化数据集

Mengfan Liu, Da Zheng, Junwei Su, Chuan Wu

发表机构 * The University of Hong Kong(香港大学) University of Science and Technology of China(中国科学技术大学)

AI总结 为解决 LLM 在张量程序优化中缺乏可验证步骤级监督的问题,提出 Step-TP 数据集,通过结构化思维链推理和原子步骤监督实现可靠的多步优化。

详情
AI中文摘要

尽管大语言模型(LLM)具有强大的推理能力,但由于需要精确、可组合的变换决策,优化张量程序的执行效率仍然具有挑战性。最近的 LLM 引导方法将张量程序优化视为一个迭代决策过程,但现有数据集仅提供使用令牌效率低下的表示方式的端到端优化程序对,缺乏可验证的步骤级监督和可解释性。因此,LLM 难以在大型组合优化空间中做出可靠的单步决策。我们引入了 Step-TP,一个用于张量程序优化的后训练数据集,它提供基于事实的、原子性的步骤级监督,并带有结构化的思维链(CoT)推理。Step-TP 在中间程序状态上形成一个封闭的推理循环,从而实现可靠的多步优化,而非结果模仿。其设计遵循四个原则:(i) 令牌高效、可验证的中间表示(IR),可确定性降低为 TVM TIR;(ii) 原子且可组合的优化策略,将复杂轨迹分解为可解释的单步决策;(iii) 结构化的 CoT 监督与显式的 IR 到 IR 状态转换相结合;(iv) 策略过滤以平衡覆盖范围同时防止捷径利用。该数据集和实现可在 GitHub 链接 https://github.com/LIUMENGFAN-gif/StepTP 获取。

英文摘要

Despite the strong reasoning capabilities of large language models (LLMs), optimizing the execution efficiency of tensor programs remains challenging due to the need for precise, composable transformation decisions. Recent LLM-guided approaches frame tensor program optimization as an iterative decision process, but existing datasets provide only end-to-end optimized program pairs using token-inefficient representations, lacking verifiable step-level supervision and interpretability. As a result, LLMs struggle to make reliable single-step decisions in large combinatorial optimization spaces. We introduce Step-TP, a post-training dataset for tensor program optimization that provides grounded, atomic, step-level supervision with structured chain-of-thought (CoT) reasoning. Step-TP forms a closed reasoning loop over intermediate program states, enabling reliable multi-step optimization rather than outcome imitation. Its design is guided by four principles: (i) a token-efficient, verifiable intermediate representation (IR) that deterministically lowers to TVM TIR; (ii) atomic and composable optimization strategies that decompose complex trajectories into interpretable single-step decisions; (iii) structured CoT supervision coupled with explicit IR-to-IR state transitions; and (iv) strategy filtering to balance coverage while preventing shortcut exploitation. The dataset and implementation are available at a GitHub link, https://github.com/LIUMENGFAN-gif/StepTP.

2605.25952 2026-05-26 cs.CV cs.AI 版本更新

VEN-VL: A Visual Ensemble MoE Framework for Effective and Efficient Multi-Modal Understanding

VEN-VL: 一种用于高效多模态理解的视觉集成MoE框架

Yinghao Wu, Zhuoyan Luo, Yiyao Yu, Zhaojian Yu, Yujiu Yang, Xiao-Ping Zhang

发表机构 * Tsinghua University(清华大学)

AI总结 提出VEN-VL框架,通过先丰富后压缩的策略,利用视觉集成MoE和自适应路由增强视觉令牌的信息容量与密度,在少量压缩令牌下实现复杂视觉任务的性能与效率平衡。

详情
AI中文摘要

尽管近期高效方法在加速多模态理解方面取得了显著进展,但它们仍然存在明显的性能下降。这些方法强调单一视觉线索的高压缩比,并依赖基于启发式剪枝策略的粗略注意力对齐,导致视觉令牌的信息容量和密度出现瓶颈。针对这一局限,我们提出了VEN-VL,一种遵循“先丰富后压缩”原则的视觉集成MoE框架,用于高效感知。具体来说,我们首先通过统一不同视角的视觉表示来丰富信息容量,然后通过专门视觉专家中的自适应路由器逐步压缩信息以增强信息密度。此外,我们通过显式视觉监督融入原始结构的重建能力,促进关键信息的保留。实验结果表明,我们在使用少量信息压缩令牌的复杂视觉任务中具有优越性,有效弥合了性能与效率之间的差距。

英文摘要

Despite the remarkable progress achieved by recent efficient methods in accelerating multimodal understanding, they still suffer from noticeable performance degradation. Their emphasis on the high compression ratio of a single visual clue and reliance on the heuristic pruning strategy with coarse attention alignment incurs a bottleneck on the information capacity and density of visual tokens. Addressing this limitation, we propose VEN-VL, a visual ensemble MoE framework for effective and efficient perception following the enrich then compact principle. Specifically, we first enrich the information capacity by unifying the visual representations of different perspectives, and then progressively compact it with adaptive routers in specialized visual experts to enhance the information density. Furthermore, we incorporate the reconstruction ability of vanilla structure via explicit visual supervision, facilitating crucial information preservation. Experimental results demonstrate our superiority in complex visual tasks with few information-condensed tokens, which effectively bridges the gap between performance and efficiency.

2605.25949 2026-05-26 cs.LG cs.AI physics.comp-ph 版本更新

Small Models, Strong Priors: Architectural Inductive Bias for Parameter-Efficient Neural PDE Solvers

小模型,强先验:参数高效神经PDE求解器的架构归纳偏置

Shyam Sankaran, Hanwen Wang, Paris Perdikaris

发表机构 * Department of Mechanical Engineering and Applied Mechanics, University of Pennsylvania(宾夕法尼亚大学机械工程与应用力学系)

AI总结 提出WaveLiT架构,通过小模型(1-10M参数)利用小波多尺度先验实现参数高效,在多个PDE基准上媲美大100-1000倍的基础模型,并揭示先验失败模式可提供有用信号。

详情
AI中文摘要

神经PDE求解器遵循视觉和语言的扩展轨迹,最近的基础模型达到数十亿参数。我们认为,在该领域中,规模不能很好地替代架构归纳偏置:结构化先验带来超高的参数效率,并且它们成功和失败的模式本身就能说明它们捕获了什么。我们通过WaveLiT实例化这一论点,该架构结合了用于无损多分辨率标记化的离散小波变换、增强的线性注意力块、共享权重的多尺度特征金字塔以及小波域辅助损失。定制的1-10M参数WaveLiT模型在八个TheWell基准测试中与规模大100-1000倍的基础模型竞争,在波动和声学主导的基准测试中增益最大,其中小波多尺度先验适合主导动力学结构,且小的每步误差在展开时不会几何级数地复合。在所有八个基准测试上联合训练后,一个10M参数的基础变体表现出结构化的、物理上可解释的迁移模式——在小波多尺度先验匹配动力学的地方最强,在混沌平流主导的流动中最弱。整个流水线在单个GPU上训练。结果表明,小模型PDE性能由架构归纳偏置而非规模决定,并且先验失败的结构是关于其内容的有用经验信号。

英文摘要

Neural PDE solvers have followed the scaling trajectory of vision and language, with recent foundation models reaching billions of parameters. We argue that scale is a poor substitute for architectural inductive bias in this domain: structured priors deliver outsized parameter efficiency, and the pattern of where they succeed and fail is itself informative about what they capture. We instantiate this argument in WaveLiT, an architecture combining a discrete wavelet transform for lossless multi-resolution tokenization, an augmented linear attention block, a shared-weight multiscale feature pyramid, and a wavelet-domain auxiliary loss. Bespoke 1-10M-parameter WaveLiT models compete with foundation models of 100-1000$\times$ their size across eight TheWell benchmarks, with the largest gains on wave and acoustic-dominated benchmarks where the wavelet-multiscale prior fits the dominant dynamical structure and small per-step errors do not compound geometrically under rollout. Trained jointly across all eight benchmarks, a 10M-parameter foundation variant exhibits a structured, physically interpretable transfer pattern -- strongest where the wavelet-multiscale prior matches the dynamics, weakest on chaotic advection-dominated flows. The entire pipeline trains on a single GPU. The results suggest that small-model PDE performance is shaped by architectural inductive bias rather than scale, and that the structure of a prior's failures is a useful empirical signal about its content.

2605.25944 2026-05-26 cs.CV cs.AI 版本更新

EchoPilot: Training-Free Ultrasound Video Segmentation via Scale-Space Semantic Prompting and Reliability-Gated Memory

EchoPilot: 通过尺度空间语义提示和可靠性门控记忆实现无训练超声视频分割

Ruiqiang Xiao, Zhaohu Xing, Yijun Yang, Zhenyan Han, Weiming Wang, Kaishun Wu, Lei Zhu

发表机构 * The Hong Kong University of Science and Technology (Guangzhou)(香港科技大学(广州)) Third Affiliated Hospital of Sun Yat-Sen University(中山大学第三附属医院) Hong Kong Metropolitan University(香港 Metropolitan 大学)

AI总结 提出EchoPilot,一种无需训练、仅需单点点击和类别名称的超声视频分割框架,通过尺度空间语义提示解决初始化歧义,并引入可靠性门控记忆减少传播漂移,在多个数据集上达到最优性能。

Comments Early accepted to MICCAI 2026. Project page: https://keeplearning-again.github.io/EchoPilot/

详情
AI中文摘要

超声视频分割在临床上具有重要价值,但由于散斑噪声、弱边界和快速解剖变形而困难。最近的可提示基础模型实现了点引导分割,但它们在超声中的直接部署仍然不可靠:单个点提供的空间上下文不足以解决尺度模糊性,贪婪的记忆更新会将早期错误放大为严重的时间漂移。我们提出了EchoPilot,一个在稀疏第一帧交互下进行超声视频分割的无训练框架,仅需单点点击和解剖类别名称。EchoPilot协调一个冻结的医学视觉语言模型(VLM)进行语义定位,一个视觉基础模型(VFM)进行密集几何特征提取,以及一个可提示视频分割器进行掩码预测和传播。为了解决初始化歧义,我们提出了尺度空间语义提示,首先通过无参数的S.E.E.D.(语义能量-熵密度)准则选择最佳上下文视图,然后从密集基础特征中合成几何精确的辅助点提示,无需额外用户交互。为了减少传播漂移,进一步引入了可靠性门控记忆更新,在不确定预测下选择性冻结分割器的记忆库,防止错误累积。我们还贡献了第一个动态胎儿胎盘超声视频分割数据集,包含671个标注帧。在三个超声视频数据集上,EchoPilot在稀疏交互设置下实现了最先进的性能,持续优于无训练基线和微调专家。

英文摘要

Ultrasound video segmentation is clinically valuable yet difficult due to speckle noise, weak boundaries, and rapid anatomical deformation. Recent promptable foundation models enable point-guided segmentation, but their direct deployment in ultrasound remains unreliable: a single point provides insufficient spatial context to resolve scale ambiguity, and greedy memory updates amplify early errors into severe temporal drift. We present EchoPilot, a training-free framework for ultrasound video segmentation under sparse first-frame interaction, requiring only a single point click and an anatomical category name. EchoPilot orchestrates a frozen medical vision-language model (VLM) for semantic localization, a vision foundation model (VFM) for dense geometric feature extraction, and a promptable video segmentor for mask prediction and propagation. To resolve initialization ambiguity, we propose Scale-Space Semantic Prompting, which first selects an optimal contextual view via a parameter-free S.E.E.D. (Semantic Energy-Entropy Density) criterion, and then synthesizes geometrically precise auxiliary point prompts from dense foundation features without additional user interaction. To reduce propagation drift, a Reliability-Gated Memory update is further introduced to selectively freeze the segmentor's memory bank under uncertain predictions, preventing error accumulation. We also contribute the first dynamic fetal placenta ultrasound video segmentation dataset with 671 annotated frames. Across three ultrasound video datasets, EchoPilot achieves state-of-the-art performance under the sparse-interactive setting, consistently outperforming training-free baselines and finetuned specialists.

2605.25939 2026-05-26 cs.LG cs.AI 版本更新

From Latent Space to Training Data: Explainable Specialization in Minimal MLPs

从潜在空间到训练数据:最小MLP中的可解释特化

Enrique Alba, Ezequiel Lopez-Rubio

发表机构 * ITIS Software, University of Malaga(马德里大学ITIS软件)

AI总结 研究最小单隐藏层MLP中隐藏神经元是否因训练偏差而特化,以及这种特化是否改善基于原型的训练数据重构,发现覆盖正则化能提高特化比并降低重构误差,而重叠惩罚会导致原型中心被推出凸包。

详情
AI中文摘要

我们在此研究训练偏差是否能使隐藏神经元在最小单隐藏层MLP中特化,以及这种特化是否改善从学习权重对训练数据集进行基于原型的重构。我们考虑宽度等于数据集大小的高斯激活MLP,并比较三种结构损失(分别鼓励训练样本覆盖、神经元诱导原型之间的分离以及隐藏响应的低重叠)与标准拟合基线。在均匀采样的一维数据集上的实验显示,从N=3到N=100的480次受控运行中呈现稳定模式。覆盖正则化在每个测试大小下给出最低的平均重构误差,并相对于标准基线提高了原型使用特化比,而分离效果参差不齐,重叠惩罚则系统性有害。我们表明这种损害并非优化失败:重叠激活的方法与无重叠方法一样拟合数据,但将优化器引导至退化均衡,其中原型中心被推出训练输入的凸包。覆盖无法奖励这种驱逐,并充当吸引子:分离仅在高温下允许它,而重叠在名义超参数选择下允许它。在分离掩码上的直接τ扫描和N=100时的原型位置可视化确认了这一机制。这些发现为原型可恢复性感知训练提供了一个简单的设计原则:每个排斥性结构损失必须由一个兼容的吸引子补偿,否则它将破坏本应精炼的潜在几何结构。

英文摘要

We here study whether training biases can make hidden neurons specialize in minimal one-hidden-layer MLPs, and whether such specialization improves prototype-based reconstruction of the training dataset from the learned weights. We consider Gaussianactivation MLPs of width equal to dataset size and compare three structural losses that respectively encourage coverage of the training samples, separation between neuron-induced prototypes, and low overlap of hidden responses, against the standard fitting baseline. Experiments on uniformly sampled one-dimensional datasets show a stable pattern from N = 3 to N = 100 across 480 controlled runs. Coverage regularization gives the lowest mean reconstruction error at every tested size and raises the prototype-usage specialization ratio relative to the standard baseline, while separation has mixed effects and overlap penalties are systematically harmful. We show that the harm is not an optimization failure: overlap-active approaches fit the data as well as overlap-free ones but route the optimizer to a degenerate equilibrium in which prototype centers are pushed outside the convex hull of the training inputs. Coverage cannot reward this expulsion and acts as an attractor: separation admits it only at large temperature and overlap admits it at the nominal hyperparameter choice. A direct τ-sweep on the separation-only mask and a prototype-position visualization at N = 100 confirm the mechanism. The findings yield a simple design principle for prototype-recoverability-aware training: every repulsive structural loss must be compensated by a compatible attractor, or it will collapse the latent geometry it was meant to refine.

2605.25933 2026-05-26 cs.LG cs.AI 版本更新

Quantitative Evaluation of the Severity of Posttraumatic Stress Disorder through Transfer Learning from Specific Phobia Data

通过特定恐惧症数据迁移学习定量评估创伤后应激障碍的严重程度

Nicolas Ricka, Gauthier Pellegrin, Denis A. Fompeyrine, Thomas Rohaly, Leah Enders, Heather Roy

发表机构 * MyndBlue DCS Corporation Human in Complex Systems Division, DEVCOM Army Research Laboratory(复杂系统人类研究部,DEVCOM陆军研究实验室)

AI总结 提出基于多元核密度估计的机器学习方法,利用心率与皮肤电导信号从特定恐惧症数据迁移学习,客观评估PTSD严重程度,分类准确率86%,平均绝对误差5.6。

Comments Submitted to a peer-reviewed journal, comments welcome

详情
AI中文摘要

创伤后应激障碍(PTSD)是一种普遍且使人衰弱的心理健康状况,对个人和社会产生重大影响。目前PTSD的临床评估通常依赖主观评价,耗时、昂贵且易受人为偏见影响。本研究提出一种基于多元核密度估计(MKDE)技术的机器学习方法,用于客观评估PTSD严重程度。我们收集了21名参与者在沉浸式模拟期间的心率(HR)和皮肤电导反应(GSR)信号以及PTSD检查表-军事版(PCL-M)标签。在公开的蜘蛛恐惧症数据集上训练恐惧反应模型,并从军事数据集估计的恐惧反应曲线中提取PTSD预测特征。该模型在分类PTSD状态时达到86%的准确率,有效区分有和无PTSD的参与者(PCL-M阈值为36)。模型的平均绝对误差(MAE)为5.6,并以17%的平均绝对百分比误差估计临床PTSD严重程度量表。我们的算法通过提供一种客观且低努力的生理评估方法,显示出增强PTSD严重程度估计和随访的潜力。这些发现表明在筛查和随访环境中具有临床实用性。

英文摘要

Posttraumatic stress disorder (PTSD) is a prevalent and debilitating mental health condition with significant personal and societal impacts. Current clinical assessments of PTSD often rely on subjective evaluations, which can be time-consuming, costly, and prone to human bias. This study proposes a machine learning (ML) approach based on multivariate kernel density estimation (MKDE) technique for the objective evaluation of PTSD severity. We collected heart rate (HR) and galvanic skin response (GSR) signals as well as PTSD Checklist - Military Version (PCL-M) labels from 21 participants during an immersive simulation. A fear-response model was trained on a public arachnophobia dataset, and predictive features of PTSD were extracted from the fear-response curves estimated on the military dataset. The model achieved an accuracy of 86\% in classifying PTSD status, effectively distinguishing participants with and without PTSD (PCL-M threshold of 36). The average mean absolute error (MAE) of the models is 5.6, and it estimated a clinical PTSD severity scale with a mean absolute percentage error of 17\%. Our algorithm demonstrates promising potential for enhancing estimation of PTSD severity and followup by offering an objective and low-effort evaluation approach using physiology. These findings suggest clinical utility in both screening and follow-up settings.

2605.25931 2026-05-26 cs.AI 版本更新

Explore Before You Solve: The Speed--Depth Trade-off in Epistemic Agents for ARC-AGI-3

先探索再求解:面向ARC-AGI-3的认知主体中的速度-深度权衡

Liew Keong Han

发表机构 * Independent researcher(独立研究者)

AI总结 通过系统分析所有25个公开ARC-AGI-3游戏,发现它们均可通过非智能策略达到,并提出了一个三阶段认知主体AERA,在速度-深度权衡框架下形式化其性能。

Comments 22 pages, 3 figures. Code: https://github.com/farmountain/aera-arc3-paper (CC0)

详情
AI中文摘要

我们系统研究了所有25个公开ARC-AGI-3游戏,发现每个游戏都可以通过非智能策略达到:10个通过单次盲步,5个通过一次探测动作,1个通过重复按ACTION1键,1个通过多样化探索,8个通过具有足够预算(50-200步)的单一重复动作。此外,一个库级别的空坐标漏洞使得18个游戏可以在1步内绕过。这一基准批评意味着公开评估集无法区分智能探索与琐碎启发式——私有的55游戏评估才是真正的智能测试。在此背景下,我们提出了AERA(自适应认知推理主体),一个三阶段(探索/验证/规划)主体,在Qwen2.5-0.5B上对这些25个游戏实现了RHAE=0.2116(4/25解决),而随机和无探索基线得分为0.0000。我们通过速度-深度权衡框架形式化AERA:在凸性假设下(附录中对一类环境证明),RHAE的二次形式作为对偏离动作效率与信息增益之间帕累托前沿的二阶惩罚。贡献:(i)基准有效性分析表明,当前交互式推理基准未能衡量它们声称所需的探索,以及(ii)探索前规划框架和模型能力×探索交互。链接的代码条目在完整的55游戏私有评估中实现了RHAE=0.30。代码:CC0。

英文摘要

We systematically investigate all 25 public ARC-AGI-3 games and find that every one is reachable through non-intelligent strategies: 10 in a single blind step, 5 after one probing action, 1 via repeated ACTION1 presses, 1 via diverse exploration, and 8 via single repeated actions with sufficient budget (50-200 steps). A library-level null-coordinate vulnerability additionally bypasses 18 games in 1 step. This benchmark critique implies the public evaluation set cannot discriminate intelligent exploration from trivial heuristics - the private 55-game evaluation is the only genuine intelligence test. Against this backdrop, we present AERA (Adaptive Epistemic Reasoning Agent), a three-phase (EXPLORE / VERIFY / PLAN) agent achieving RHAE=0.2116 (4/25 solved) on these 25 games with Qwen2.5-0.5B, while random and no-explore baselines score 0.0000. We formalise AERA through a Speed--Depth trade-off framework: under a convexity assumption (proved for a class of environments in the Appendix), RHAE's quadratic form emerges as a second-order penalty for deviating from the Pareto frontier between action efficiency and information gain. Contributions: (i) a benchmark validity analysis showing that current interactive reasoning benchmarks fail to measure the exploration they claim to require, and (ii) the EXPLORE-before-PLAN framework and model-capability x exploration interaction. The linked code track entry achieves RHAE=0.30 on the full 55-game private evaluation. Code: CC0.

2605.25920 2026-05-26 cs.CL cs.AI 版本更新

Can LLMs Time Travel? Enhancing Temporal Consistency in Legal Agentic Search through Reinforcement Learning

LLM 能时间旅行吗?通过强化学习增强法律智能搜索中的时间一致性

Wei Fan, Yining Zhou, Mufan Zhang, Yanbing Weng, Yiran HU, Tianshi Zheng, Baixuan Xu, Chunyang Li, Jianhui Yang, Haoran Li, Yangqiu Song

发表机构 * Department of Computer Science and Engineering, HKUST, Hong Kong SAR, China(香港科技大学计算机科学与工程系) School of Law, Tsinghua University, Beijing, China(清华大学法学院) Cheriton School of Computer Science, University of Waterloo, Waterloo, Canada(滑铁卢大学丘成桐计算机科学系)

AI总结 提出 LegalSearch-R1 框架,结合本地 statute RAG 和在线搜索,通过强化学习在跨修订期数据上训练,以解决法律 LLM 的时间偏差和搜索代理缺乏时间约束的问题,在13项法律任务上超越现有方法。

Comments Under Review

详情
AI中文摘要

虽然增强智能搜索能力的大型语言模型在法律推理方面显示出前景,但它们忽略了一个基本约束:适用法律必须与每个案件的时间背景相匹配,因为法条的事后追溯适用违反了核心法律原则并导致错误结论。我们的观察表明,当前的法律 LLM 存在锚定于其训练截止日期的时间偏差,而搜索代理很少将时间约束纳入查询,并且仅靠网络搜索无法提供法律推理所需的精确法条和先例引用。为应对这些挑战,我们提出 LegalSearch-R1,一个端到端的强化学习框架,它将本地 statute RAG 用于精确条文匹配,与在线网络搜索用于更广泛的法律知识相结合,并在涵盖多个修订期的按时间索引的数据上训练以强制执行时间一致性。在我们涵盖13项法律任务的基准上的大量实验表明,我们的7B参数代理在时间一致性上以12.9%至29.8%的优势超越最先进的深度研究框架和专门的法律 LLM,以57.7%至80.3%的优势超越基线,并展现出强大的域外泛化能力。代码和数据可在 https://github.com/AlexFanw/LegalSearch-R1 获取。

英文摘要

While large language models (LLMs) augmented with agentic search capabilities show promise for legal reasoning, they overlook a fundamental constraint that applicable law must match the temporal context of each case, as retroactive application of statutes violates core legal principles and leads to erroneous conclusions. Our observations reveal that current legal LLMs suffer from temporal bias anchored to their training cutoff, while search agents rarely incorporate temporal constraints into queries, and that web search alone cannot provide the precise statute and precedent citations that legal reasoning demands. To address these challenges, we propose LegalSearch-R1, an end-to-end reinforcement learning framework that pairs local statute RAG for precise article matching with online web search for broader legal knowledge, trained on temporally-indexed data spanning multiple amendment periods to enforce temporal consistency. Extensive experiments on our benchmark covering 13 legal tasks demonstrate that our 7B-parameter agent outperforms state-of-the-art deep research frameworks and specialized legal LLMs by 12.9% to 29.8%, surpasses baselines by 57.7% to 80.3% on temporal consistency, and exhibits robust out-of-domain generalization. The code and data are available at https://github.com/AlexFanw/LegalSearch-R1.

2605.25893 2026-05-26 cs.AI 版本更新

$D^2$-Monitor: Dynamic Safety Monitoring for Diffusion LLMs via Hesitation-Aware Routing

$D^2$-Monitor: 通过犹豫感知路由实现扩散LLM的动态安全监控

Aoxi Liu, Yupeng Chen, James Oldfield, Guanzhe Hong, Junchi Yu, Baoyuan Wu, Philip Torr, Adel Bibi

发表机构 * Torr Vision Group, University of Oxford(奥克斯大学托尔视觉组) The Chinese University of Hong Kong, Shenzhen(香港中文大学(深圳))

AI总结 针对扩散大语言模型的安全监控问题,提出基于犹豫感知路由的双层动态监控框架$D^2$-Monitor,通过轻量级探针实时估计犹豫度并触发高容量探针,在3个数据集上以0.85M参数达到最优性能与效率平衡。

详情
AI中文摘要

尽管扩散大语言模型(D-LLMs)作为自回归大语言模型(AR-LLMs)的替代方案已经出现,但D-LLMs的安全监控在很大程度上仍未得到探索。与AR-LLMs不同,D-LLMs通过多步去噪过程生成文本,暴露了中间隐藏表示,这些表示可能包含标准单步监控设置中无法获得的安全相关信息。受轻量级探针适用于始终在线监控的启发,我们分析了哪些轨迹级信号最能指示此类探针可能遇到困难。我们发现,信息量最大的信号是安全犹豫度:中间隐藏状态反复落在探针决策边界的小范围内。D-LLM轨迹中此类犹豫步的数量能有效预测探针失败,提供了样本难度的代理指标。基于此分析,我们提出了$D^2$-Monitor,一种针对D-LLMs的双层安全监控器。$D^2$-Monitor采用轻量级探针作为始终在线监控器,以联合估计犹豫度并执行基础分类。当犹豫度超过阈值时,激活更具表现力但计算量更大的探针。这种动态路由机制在测试时高效分配监控资源。在4个D-LLM上的3个数据集(WildguardMix、ToxicChat、OpenAI-Moderation)上评估,$D^2$-Monitor以紧凑的参数规模(≤0.85M参数)实现了最先进的性能,并且相对于8个基线方法,在有效性和效率之间取得了最佳权衡。

英文摘要

Despite the emergence of diffusion large language models (D-LLMs) as an alternative to autoregressive large language models (AR-LLMs), safety monitoring for D-LLMs remains largely unexplored. Unlike AR-LLMs, D-LLMs generate text through a multi-step denoising process, exposing intermediate hidden representations that may contain safety-relevant information unavailable in standard single-step monitoring setups. Motivated by the suitability of lightweight probes for always-on monitoring, we analyze which trajectory-level signals best indicate when such probes are likely to struggle. We find that the most informative signal is safety hesitation: intermediate hidden states repeatedly falling within a small margin of the probe's decision boundary. The number of such hesitation steps in D-LLM's trajectory predicts probe failure effectively, providing a proxy of sample difficulty. Building on this analysis, we propose $D^2$-Monitor, a bi-level safety monitor for D-LLMs. $D^2$-Monitor adopts a lightweight probe as an always-on monitor to jointly estimate hesitation and perform base classification. When the hesitation level exceeds a threshold, a more expressive but computationally heavier probe is activated. This dynamic routing mechanism allocates monitoring resources efficiently at test time. Evaluated on 3 datasets (WildguardMix, ToxicChat, OpenAI-Moderation) across 4 D-LLMs, $D^2$-Monitor achieves state-of-the-art performance with a compact parameter footprint ($\leq$ 0.85M parameters), and exhibits the best trade-off between effectiveness and efficiency relative to 8 baselines.

2605.25891 2026-05-26 cs.CL cs.AI 版本更新

Causal Tongue-Tie: LLMs Can Encode Causal Direction, But Their Yes/No Outputs Fail to Express

因果舌结:LLMs 能编码因果方向,但其是/否输出无法表达

Ziyi Ding, Xiao-Ping Zhang

发表机构 * Tsinghua Shenzhen International Graduate School(清华大学深圳国际研究生院)

AI总结 研究发现大语言模型在因果问题上存在内部编码与输出不匹配的现象,通过线性探针可从隐藏状态恢复证据支持的答案(准确率约0.97),但口头是/否回答却退化为常识答案(准确率约0.5),揭示了约+0.5的差距,称为“因果舌结”。

详情
AI中文摘要

我们发现大语言模型关于因果问题所编码的内容与其回答之间存在不匹配。在反常识的 CLadder 项目上,固定的线性探针从模型隐藏状态中恢复出证据支持的答案(准确率约0.97),而口头的是/否回答则退化为常识答案(准确率约0.5)。我们将这约+0.5的差距称为“因果舌结”:错误的“是/否”回答可分解为两种可分离的失败模式——没有内部信号,或者口头接口无法表达的信号。这一发现对仅基于输出的因果基准测试具有双向影响:基准测试“正确”不一定意味着模型理解了,基准测试“错误”也不一定意味着模型不能理解。基于单一准确率数字得出的关于 LLMs 是否能够进行因果推理的笼统论断,值得重新审视。

英文摘要

We find a mismatch between what large language models encode about a causal question and what they answer. On anti-commonsense CLadder items, a fixed linear probe recovers the evidence-supported answer from the model's hidden state (accuracy approximately 0.97), while the spoken Yes/No reverts to the commonsense one (accuracy approximately 0.5). We call this approximately +0.5 gap Causal Tongue-Tie: a wrong Yes/No decomposes into two separable failure modes: no internal signal versus a signal the verbal interface cannot say. The implication cuts both ways for output-only causal benchmarks: a benchmark "correct" need not mean the model has understood, and a benchmark "wrong" need not mean it cannot. Sweeping claims about whether LLMs can do causal reasoning, drawn from a single accuracy number, deserve a second look.

2605.25856 2026-05-26 cs.HC cs.AI 版本更新

Explaining Too Much? Understanding How Large Language Model Reasoning Traces Influence Performance and Metacognition

解释过多?理解大型语言模型推理轨迹如何影响性能和元认知

Daniela Fernandes, Daniel Buschek, Lev Tankelevitch, Thomas Kosch, Robin Welsch

发表机构 * Aalto University(奥卢大学) University of Bayreuth(拜鲁特大学) Microsoft Research Cambridge(微软研究院剑桥)

AI总结 通过用户实验,研究大型语言模型展示推理轨迹(完整或摘要)对任务性能、信任、愉悦感和自我评估校准的影响,发现轨迹提升主观体验但无性能增益,且导致过度自信。

Comments 27 pages, 5 figures, 9 tables

详情
AI中文摘要

大型语言模型界面日益冗长,在最终答案之外暴露中间推理轨迹。轨迹被框架化为透明机制,但尚不清楚人们如何利用它们解决问题。我们报告了一项预注册的组间研究(N = 559),参与者在三种条件下解决十个LSAT式推理问题:仅答案基线、答案前显示完整轨迹、答案旁显示摘要轨迹。摘要轨迹在无轨迹基线上保持了任务性能,同时显著提升了信任和愉悦感,表明轨迹暴露改变了交互的主观评价,但未带来性能收益。在使用暴露冗长中间输出的开放权重推理模型时,完整轨迹相对于仅答案基线还损害了性能。在所有条件下,参与者大幅高估了自己的表现,且没有轨迹格式支持校准的自我评估。进一步分析表明,愉悦感(而非信任)承载了通向高估的间接路径,与处理流畅性解释一致。推理轨迹最好被理解为面向用户的界面工件,而非模型认知的透明窗口,校准不太可能从轨迹本身产生,最好通过首先引发用户自身推理的交互来支撑。

英文摘要

Large Language Model interfaces are increasingly verbose, exposing intermediate reasoning traces alongside final answers. Traces are framed as transparency mechanisms, yet it is unclear how people use them to solve problems. We report a preregistered between-subjects study (N = 559) in which participants solved ten LSAT-style reasoning problems under one of three conditions: an Answer-only baseline, a Full-trace revealed before the answer, and a Summary-trace presented alongside the answer. Summaries preserved task performance at the no-trace baseline while significantly elevating trust and hedonic appeal, establishing that trace exposure shifts subjective appraisal of the interaction without bringing performance benefits. Under an open-weight reasoning model exposing verbose intermediate output, full traces additionally impaired performance relative to the answer-only baseline. Across all conditions, participants substantially overestimated their performance, and no trace format supported calibrated self-evaluation. Further analysis indicates that hedonic appeal, not trust, carries the indirect path to overestimation, consistent with a processing-fluency account. Reasoning traces are best understood as user-facing interface artifacts rather than transparent windows into model cognition, and calibration is unlikely to emerge from the traces themselves and may best be scaffolded by interactions that elicit users' own reasoning first.

2605.25854 2026-05-26 cs.AI 版本更新

From Accounting to Coordination: A Virtual Water-Aware Electricity-Computation-Water Nexus Framework for Data Center Dispatch

从核算到协调:面向数据中心调度的虚拟水感知电-算-水关联框架

Haiyang You, Chengwei Lou, Jin Zhao, Yue Zhou, Lu Zhang, Jin Yang

发表机构 * International Energy Agency(国际能源署) IEEE

AI总结 提出一个将虚拟水影响内化到电力系统调度的可微优化框架,通过深度学习实现端到端协调策略学习,在IEEE 30/118节点系统上实现约3-5%的淡水取水减少。

详情
AI中文摘要

数据中心的扩张驱动了电力需求的持续增长以及发电站点相关的取水量增加。这些取水发生在发电站点,并根据网络潮流虚拟分配给负荷。因此,特定负荷的实际水足迹随发电调度和网络条件动态变化。现有方法通常依赖静态统计核算来量化这些水足迹。然而,这种静态方法无法捕捉调度优化和负载迁移如何动态影响取水。结果,静态统计核算方法仍与优化过程脱节,无法指导负载迁移或电力调度以缓解水压力。为解决这一局限,本文开发了一个可运行的电-算-水(ECW)关联框架,将虚拟水影响直接内化到电力系统调度中。该框架将调度优化表示为嵌入深度学习架构中的可微优化层,能够在保持运行可行性的同时实现协调策略的高效端到端学习。结合不动点协调,该框架确保了虚拟水归因与物理发电侧取水之间的一致性。在IEEE 30节点和118节点测试系统上的案例研究展示了可靠的收敛性、精确的功率-水一致性,以及在受水约束条件下发电相关淡水取水减少约3-5%。

英文摘要

The expansion of data centers (DCs) drives a sustained increase in electricity demand and associated water withdrawals at generation sites. These withdrawals occur at generation sites and are virtually allocated to demand based on network power flows. Consequently, the actual water footprint of a specific load varies dynamically with generation dispatch and network conditions. Existing approaches typically rely on static statistical accounting to quantify these water footprints. However, such static methods fail to capture how dispatch optimization and workload relocation dynamically affect water withdrawals. As a result, static statistical accounting approaches remain decoupled from the optimization process, rendering them incapable of guiding workload relocation or power dispatch to mitigate water stress. To address this limitation, this paper develops an operational electricity-computation-water (ECW) nexus framework that internalizes virtual water impacts directly into power system dispatch. The framework represents dispatch optimization as a differentiable optimization layer embedded within a deep learning architecture, enabling efficient end-to-end learning of coordination policies while preserving operational feasibility. Combined with fixed-point coordination, the framework enforces consistency between virtual water attribution and physical generation-side withdrawals. Case studies on the IEEE 30-bus and 118-bus test systems demonstrate reliable convergence, exact power-water consistency, and reductions of approximately 3-5% in generation-related freshwater withdrawals under water-constrained conditions.

2605.25850 2026-05-26 cs.CL cs.AI cs.LG 版本更新

TIAR: Trajectory-Informed Advantage Reweighting for LLM Abstention Learning

TIAR:基于轨迹信息的优势重加权用于大语言模型弃权学习

Muyu Pan, Shu Zhao, Nan Zhang, Philip Shin, Varun Parekh, Vijaykrishnan Narayanan, Rui Zhang

发表机构 * Department of Computer Science, The Pennsylvania State University(宾夕法尼亚州立大学计算机科学系)

AI总结 本文提出TIAR方法,利用GRPO中的多条轨迹作为自然弃权信号,动态重加权弃权奖励,在六个评估类别中的五个上取得最优弃权F1分数,同时保持基线准确率。

Comments 10 pages, 1 figure, 4 tables

详情
AI中文摘要

本文研究大语言模型(LLM)的弃权学习,特别是使用三元奖励来激励大语言模型中的真实性。本文将该思想从三元奖励扩展到基于轨迹信息的优势重加权(Trajectory-Informed Advantage Reweighting),在组相对策略优化(GRPO)训练期间动态重加权弃权奖励。本工作的目标聚焦于弃权学习而非提升真实性,作为减少幻觉的探索。本文的新颖之处在于方法论创新、优势重加权和基准选择。利用GRPO的多条轨迹作为自然弃权信号,该方法使用奖励信号探索知识边界并鼓励一致性。通过证明轨迹可以作为策略相对于查询的置信度指标,进而用于动态计算弃权优势。使用AbstentionBench作为评估基准,因为本工作旨在为弃权学习领域做出贡献。对该基准上的所有数据集,均使用本方法和各种基线进行了测试。实证结果表明,TIAR在六个评估类别中的五个上取得了最优弃权F1分数,在31个基准数据集中的17个上优于静态三元基线,同时完全保持基线准确率。

英文摘要

This paper investigates large language model (LLM) abstention learning, specifically using ternary reward, which incentivize truthfulness in large language models. This paper extends that idea by moving from a ternary reward to a Trajectory-Informed advantage reweighting, dynamically re-weights the abstention reward during Group Relative Policy Optimization (GRPO) training. The objective of this work focuses on abstention learning instead of improving truthfulness, serving as an exploration into hallucination reduction. The novelty of this paper lies in methodological innovation, advantage re-weighting, and benchmark selection. Leveraging GRPO's multiple trajectories as a natural abstention signal, this method uses a reward signal to explore knowledge boundaries and encourage consistency. By demonstrating that trajectories can be used as a confidence indicator of the policy relative to the query, they are then used to dynamically calculate the abstention advantage. AbstentionBench is used as the evaluation benchmark, as this work aims to contribute to the field of abstention learning. All datasets on the benchmark were tested against this method and various baselines. Empirical results demonstrate that TIAR achieves state-of-the-art abstention F1 scores across five of six evaluation categories, outperforming the static ternary baseline on 17 of 31 benchmark datasets while fully preserving baseline accuracy.

2605.25848 2026-05-26 cs.LG cs.AI 版本更新

Geometric Evolution Maps: Extracting Stable Concept Probes from Transformer Residual Streams

几何演化图:从Transformer残差流中提取稳定概念探针

James Henry

发表机构 * Independent Researcher(独立研究者)

AI总结 提出几何演化图(GEM)方法,通过追踪残差流中概念的方向轨迹并识别旋转停止的交接层,提取稳定的概念探针,在391个概念×模型对中优于峰值层探针的比例达66.2%。

Comments 24 pages, 3 figures. Reference implementation: rosetta_tools v1.3.1 (doi:10.5281/zenodo.20361433)

详情
AI中文摘要

从Transformer残差流中提取的概念探针的可靠性取决于提取层。常见的做法是在固定的后期层或分离得分函数的峰值处进行探测,这忽略了一个基本的结构特征:概念表示在其组装阶段经历显著的方向旋转,直到主要概念分配区(CAZ)之后的一个特征交接层才稳定下来。我们引入了几何演化图(GEM),它通过残差流激活追踪概念的完整方向轨迹,识别旋转停止的交接层,并从该层提取稳定的探针方向。在跨越70M到14B参数的23种架构和17种概念类型中,CAZ内入口到出口的余弦相似度平均为0.233,表明CAZ入口处的探针方向不能可靠地预测出口处的探针方向。在391个概念×模型对(23个模型×17个概念)上的消融实验表明,GEM提取的探针在268/391次试验(68.5%)中至少与峰值层探针一样精确,并在259/391次试验(66.2%)中严格优于峰值层探针。架构差异显著:MHA模型在173/221次试验(78.3%)中偏好交接层;GQA模型仅在56/119次试验(47.1%)中偏好交接层。模型级Wilcoxon检验:W=214, N=23, p=0.010(单侧)。一个自适应消融宽度规则针对79/391个近最终层情况:在60/79个触发情况(75.9%)中提高了探针质量,平均增益+7.44个百分点。方向特异性控制证实消融效果是概念方向特异性的:与随机方向消融相比,中位数抑制率为377倍(99.1%的概念方向击败了所有10个随机种子)。参考实现:rosetta_tools v1.3.1(doi:10.5281/zenodo.20361433)。

英文摘要

Concept probes extracted from transformer residual streams are only as reliable as the layer from which they are extracted. The common practice of probing at a fixed late layer or at the peak of a separation score function ignores a fundamental structural feature: concept representations undergo substantial directional rotation during their assembly phase, and do not settle into a stable direction until a characteristic handoff layer after the primary Concept Allocation Zone (CAZ). We introduce Geometric Evolution Maps (GEMs), which track the full directional trajectory of a concept through residual stream activations, identify the handoff layer where rotation ceases, and extract the settled probe direction from that layer. Across 23 architectures spanning 70M to 14B parameters and 17 concept types, the entry-to-exit cosine similarity within CAZs has a mean of 0.233, showing that probe direction at CAZ entry does not reliably predict probe direction at exit. Ablation experiments across 391 concept x model pairs (23 models x 17 concepts) show that GEM-extracted probes are at least as precise as peak-layer probes in 268/391 trials (68.5%), and strictly outperform in 259/391 (66.2%). The architecture split is pronounced: MHA models favour the handoff in 173/221 trials (78.3%); GQA models favour the handoff in only 56/119 trials (47.1%). Model-level Wilcoxon: W=214, N=23, p=0.010 (one-sided). An adaptive ablation width rule targets the 79/391 near-final-layer cases: it improves probe quality in 60/79 triggered cases (75.9%), mean gain +7.44pp. A direction-specificity control confirms the ablation effect is concept-direction specific: median 377x suppression rate versus random-direction ablation (99.1% of concept directions beat all 10 random seeds). Reference implementation: rosetta_tools v1.3.1 (doi:10.5281/zenodo.20361433).

2605.25836 2026-05-26 cs.CR cs.AI cs.CL 版本更新

TTPrint: Evidence-Grounded TTP Extraction via Diverge-then-Converge Verification

TTPrint:通过发散-收敛验证实现基于证据的TTP提取

Yutong Cheng, Changze Li, Raihan Sultan Pasha Basuki, Qian Cui, Wei Ding, Peng Gao

发表机构 * Virginia Tech(弗吉尼亚理工大学) Universitas Ary Ginanjar(阿里甘jar大学) Amazon(亚马逊)

AI总结 提出TTPrint方法,采用先广泛提取后严格验证的发散-收敛设计,结合确定性证据定位与权威定义验证,在文档级TTP提取任务上显著提升宏F1分数。

Comments Preprint

详情
AI中文摘要

从网络威胁情报(CTI)报告中提取MITRE ATT&CK技术是一个开放集、多标签问题,需要高召回率(不遗漏技术)和高精确率(不虚构未支持的技术)。现有方法——基于规则、监督学习和基于LLM的方法——难以同时实现两者:基于规则和监督方法缺乏跨多种攻击描述的泛化能力,而基于LLM的方法将候选生成和验证耦合在单一推理步骤中,导致召回率和精确率同时受限。我们提出TTPrint,通过受人类分析师工作方式启发的发散-收敛设计来解决这一挑战:首先广泛提取,然后严格验证。在发散阶段,报告被分解为原子行为,并广泛提出候选技术。然后,确定性跨度定位阶段将每个候选锚定到源文本中的特定证据窗口。收敛验证阶段仅保留由定位证据和权威MITRE定义支持的候选。我们贡献了两个评估资源——清理后的TRAM基准(TRAM-Clean)和一个新的注释数据集(TTPrint-Bench)——以解决现有基准中的已知注释噪声,并将任务提升到文档级TTP提取。在TRAM-Clean和TTPrint-Bench上,TTPrint分别达到76.48%和87.39%的宏F1,比领先基线高出63.5%和29.4%。跨六个LLM的多骨干分析和阈值敏感性研究进一步证明了跨模型选择的泛化能力,并为参数选择提供了实用指导。

英文摘要

Extracting MITRE ATT&CK techniques from cyber threat intelligence (CTI) reports is an open-set, multi-label problem requiring both high recall (not missing techniques) and high precision (not hallucinating unsupported ones). Existing methods--rule-based, supervised, and LLM-based--struggle to achieve both: rule-based and supervised approaches lack generalizability across diverse attack descriptions, while LLM-based approaches that couple candidate generation and validation within a single inference step suffer from limited recall and precision simultaneously. We propose TTPrint, which addresses this challenge through a diverge-then-converge design inspired by how human analysts work: first extracting broadly, then verifying rigorously. In the divergent phase, reports are decomposed into atomic behaviors and candidate techniques are proposed broadly. A deterministic span localization stage then anchors each candidate to a specific evidence window in the source text. A convergent verification stage retains only candidates supported by both the localized evidence and the authoritative MITRE definition. We contribute two evaluation resources--a cleaned TRAM benchmark (TRAM-Clean) and a new annotated dataset (TTPrint-Bench)--to address known annotation noise in existing benchmarks and elevate the task to document-level TTP extraction. On TRAM-Clean and TTPrint-Bench, TTPrint achieves 76.48% and 87.39% macro-F1 respectively, outperforming the leading baseline by 63.5% and 29.4%. A multi-backbone analysis across six LLMs and a threshold sensitivity study further demonstrate generalizability across model choices and provide practical guidance for parameter selection.

2605.25835 2026-05-26 cs.LG cs.AI 版本更新

Context-Instrumental Data Distillation for Kubernetes Manifest Generation: Method and Experimental Evaluation

面向Kubernetes清单生成的上下文-工具数据蒸馏方法及实验评估

Andrey Kozachok, Anatoliy Bakaev, Aleksandr Kozachok, Shamil Magomedov, Artem Noev

发表机构 * RTU MIREA(俄罗斯莫斯科RTU MIREA)

AI总结 提出上下文-工具数据蒸馏方法,通过合成生成和反向指令生成构建语料库,结合外部验证器过滤,在资源受限条件下微调1.5B参数小语言模型生成Kubernetes清单,实验表明严格输出格式比增加训练样本更关键。

Comments 15 pages, 4 figures, 2 tables

详情
AI中文摘要

本文研究了参数高达40亿的小语言模型(SLM)在领域特定语言(DSL)中生成工件的专业化。选择Kubernetes清单作为目标领域。我们提出了上下文-工具数据蒸馏方法:源语料库通过合成生成形成,在扩展方案中通过从真实Kubernetes YAML文件进行反向指令生成,仅当通过外部验证器并匹配领域上下文模型时,才将配对包含在训练中。与经典的KL散度知识蒸馏不同,基线实现简化为在工具验证示例上进行监督微调。实验部分在资源受限条件下展示了试点实现:DeepSeek-V4 Flash API作为教师模型进行合成生成,而Qwen2.5-Coder-1.5B-Instruct通过LoRA在CPU上进行微调。在K8s-Distill-Pilot语料库(训练1200,验证100,测试200)上,我们以更严格的提示公式和max_new_tokens=768实现了full-pass@1 = 91.5%(183/200)。关键经验发现是,对于Kubernetes YAML,试点中的结果质量更多地取决于严格的输出格式要求,而不是简单地增加训练样本数量。

英文摘要

This paper examines the specialization of Small Language Models (SLMs) with up to 4 billion parameters for generating artifacts in domain-specific languages (DSL). Kubernetes manifests are chosen as the target domain. We propose the context-instrumental data distillation method: the source corpus is formed through synthetic generation and, in an extended scheme, through reverse instruction generation from real Kubernetes YAML files, with pairs included in training only upon passing external validators and matching the domain context model. Unlike classical KL-divergence knowledge distillation, the baseline implementation reduces to supervised fine-tuning on instrumentally verified examples. The experimental section presents a pilot implementation under resource-constrained conditions: the DeepSeek-V4 Flash API serves as the teacher for synthetic generation, while Qwen2.5-Coder-1.5B-Instruct is fine-tuned via LoRA on CPU. On the K8s-Distill-Pilot corpus (train_1200, validation_100, test_200), we achieved full-pass@1 = 91.5% (183/200) with a stricter prompt formulation and max_new_tokens=768. The key empirical finding is that for Kubernetes YAML, result quality in the pilot depended more on strict output format requirements than on simply increasing the number of training examples.

2605.25832 2026-05-26 cs.RO cs.AI cs.CL cs.CV 版本更新

When Search Becomes Memory: Turning Robot Design Trials into Transferable Skills

当搜索成为记忆:将机器人设计试验转化为可迁移技能

Yunfei Wang, Xiaohao Xu, Yang Li, Xiaonan Huang

发表机构 * University of Michigan(密歇根大学)

AI总结 提出Auto-Robotist,一种自进化LLM代理,通过将形态搜索轨迹提炼为自然语言技能库,实现可迁移的机器人设计知识,在EvoGym任务中提升冷启动搜索并跨设计空间迁移技能。

Comments 20 pages, 8 figures

详情
AI中文摘要

大型语言模型(LLMs)越来越多地被用作进化机器人设计的提案生成器,但大多数循环仍然是无记忆的:模拟结果塑造下一代种群,但并未作为可复用的设计知识保留。我们提出Auto-Robotist,一种自进化的LLM代理,它将形态搜索轨迹提炼为显式的自然语言技能库。每个技能存储结构原型、基于证据的正负规则以及支持它们的评估设计,使设计记忆可检查而非隐含在种群中。在搜索过程中,代理检索技能以调节LLM对精英主体的编辑,同时保留遗传算法(GA)突变路径以进行探索;评估后,通过添加、诊断和合并更新库。在涵盖运动、穿越和物体交互的七个EvoGym任务中,Auto-Robotist改善了冷启动5x5搜索,并将学到的技能迁移到10x10设计空间,其中参考条件迁移在每个任务上都优于GA。这些结果表明,LLM代理可以将昂贵的物理评估转化为可复用、可审计的设计原则。我们的代码将在接收后发布。

英文摘要

Large language models (LLMs) are increasingly used as proposal generators for evolutionary robot design, yet most loops remain memoryless: simulator results shape the next population but are not preserved as reusable design knowledge. We present Auto-Robotist, a self-evolving LLM agent that distills morphology-search traces into an explicit natural-language skill library. Each skill stores a structural archetype, evidence-grounded positive and negative rules, and the evaluated designs that support them, making design memory inspectable rather than implicit in a population. During search, the agent retrieves skills to condition LLM edits of elite bodies while retaining a Genetic Algorithm (GA) mutation path for exploration; after evaluation, it updates the library through Add, Diagnose, and Merge. Across seven EvoGym tasks spanning locomotion, traversal, and object interaction, Auto-Robotist improves cold-start 5x5 search and transfers learned skills to 10x10 design spaces, where reference-conditioned transfer outperforms GA on every task. These results suggest that LLM agents can convert expensive physical evaluations into reusable, auditable design principles. Our code will be released upon acceptance.

2605.25831 2026-05-26 cs.CL cs.AI cs.LG 版本更新

Clarify, Abstain or Answer? Strategising in Conversation with Belief-Augmented Generation

澄清、弃权或回答?基于信念增强生成的对话策略

Joris Baan, Wilker Aziz, Barbara Plank, Raquel Fernández

发表机构 * University of Amsterdam(阿姆斯特丹大学) MCML Munich(慕尼黑MCML) LMU Munich(慕尼黑莱茵-魏尔堡大学)

AI总结 提出信念增强生成(BAG)方法,通过将大语言模型自身的信念状态注入提示,使其推理多个采样响应并决定对话策略(回答、澄清或弃权),从而提升多轮模糊问答的准确性和策略决策的忠实度。

详情
AI中文摘要

大语言模型(LLMs)定义了文本上的分布,这可以视为不确定性的概率表示:采样K个响应会产生一个信念状态——模型认为合理的响应。现有工作利用这种表示进行解码或选择性预测等狭窄任务,通常需要手动干预,无法直接控制生成。我们提出信念增强生成(BAG):通过提示将LLMs锚定在其自身的信念状态中,并让它们推理这K个样本以决定对话策略:回答、澄清或弃权。在多轮模糊问答设置中,我们发现LLMs默认很少澄清或弃权,忽略了关于输入或事实的不确定性。BAG在六个模型上提高了问答准确性,并产生了比仅提示基线更忠实于信念状态的策略决策。然而,区分何时澄清与何时弃权仍然具有挑战性。

英文摘要

Large language models (LLMs) define a distribution over text, which can be viewed as a probabilistic representation of uncertainty: sampling K responses yields a belief state - responses a model deems plausible. Existing work exploits this representation for narrow tasks like either decoding or selective prediction, and often requires manual interventions, not controlling generation directly. We propose Belief-Augmented Generation (BAG): grounding LLMs in their own belief state via the prompt and letting them reason over these K samples to decide on a conversational strategy: answer, clarify, or abstain. In a multi-turn ambiguous QA setting, we find that LLMs by default rarely clarify or abstain, ignoring uncertainty about the input or facts. BAG improves QA accuracy across six models and yields strategy decisions more faithful to the belief state than prompt-only baselines. Disentangling when to clarify from when to abstain, however, remains challenging.

2605.25829 2026-05-26 cs.RO cs.AI 版本更新

OASIS: Observation-Action Space Alignment via SE(3) Trajectory Prediction for Robotic Manipulation

OASIS: 通过SE(3)轨迹预测实现机器人操作中的观测-动作空间对齐

Xinzhe Chen, Sihua Ren, Liqi Huang, Haowen Sun, Mingyang Li, Xingyu Chen, Zeyang Liu, Xuguang Lan

发表机构 * National Key Laboratory of Human-Machine Hybrid Augmented Intelligence(人机混合增强智能国家重点实验室) Institute of Artificial Intelligence and Robotics, Xi’an Jiaotong University(西安交通大学人工智能与机器人研究所)

AI总结 提出OASIS视觉运动策略,通过SE(3)末端执行器轨迹预测对齐中间表示与动作空间,在仿真和真实实验中优于VLA和WAM基线。

详情
AI中文摘要

最近的视觉-语言-动作(VLA)模型和世界动作模型(WAMs)通过用辅助空间特征或未来视觉状态预测丰富中间表示来推进机器人操作。然而,这些表示在很大程度上仍停留在观测空间内,不共享动作空间的刚体几何,迫使动作解码器隐式恢复该几何。我们提出OASIS,一种通过$SE(3)$末端执行器轨迹预测将中间表示与动作空间对齐的视觉运动策略。OASIS将融合视觉-语言和度量深度特征的3D感知特征编码器与生成相机帧末端执行器轨迹的$SE(3)$轨迹预测器耦合。以预测器的姿态监督隐藏状态为条件,动作解码器生成与刚体运动一致的动作块。在仿真和真实世界实验中,OASIS在成功率和分布外泛化方面优于VLA和WAM基线。我们的项目页面位于https://npuhandsome.github.io/OASIS_web。

英文摘要

Recent vision-language-action (VLA) models and world action models (WAMs) advance robotic manipulation by enriching intermediate representations with auxiliary spatial features or future visual-state prediction. However, these representations largely remain within the observation space and do not share the rigid-body geometry of the action space, forcing the action decoder to implicitly recover this geometry. We propose OASIS, a visuomotor policy that aligns the intermediate representation with the action space via $SE(3)$ end-effector trajectory prediction. OASIS couples a 3D-aware feature encoder that fuses vision-language and metric-depth features with an $SE(3)$ trajectory predictor that produces a camera-frame end-effector trajectory. Conditioned on the predictor's pose-supervised hidden states, the action decoder generates action chunks consistent with rigid-body motion. Across simulation and real-world experiments, OASIS outperforms VLA and WAM baselines in success rate and out-of-distribution generalization. Our project page is available at https://npuhandsome.github.io/OASIS_web.

2605.25816 2026-05-26 cs.CL cs.AI 版本更新

Fine-Tuning Over Architectural Complexity: Broad-Coverage PII Detection on PIIBench with DeBERTa

超越架构复杂性的微调:基于DeBERTa的PIIBench广泛覆盖PII检测

Pritesh Jha

AI总结 本研究通过微调DeBERTa模型,在涵盖82种实体类型的多源PIIBench数据集上实现广泛覆盖的PII检测,直接微调方法在F1分数上显著优于架构复杂的层次模型和课程扩展方法。

详情
AI中文摘要

个人身份信息(PII)检测系统通常在狭窄的源或领域边界内训练,当部署在异构文本上时覆盖范围有限。我们研究了在修正后的多源PIIBench准备数据上的模型微调,该数据跨越十个源数据集,涵盖82种保留实体类型。我们评估了三种基于DeBERTa的方法:直接令牌分类微调、源条件层次模型(SC+H)和三阶段课程扩展(SC+H+Curr)。在可重复的5,000条记录保留子集(test_5k)上,与八个已发表的比较系统相比,直接微调的DeBERTa达到F1 0.6476,而SC+H和课程变体分别达到0.5899和0.2772;最强的已发表比较系统仅达到0.1723。由于验证最初偏向SC+H,我们在完整的100,002条记录保留分割上进行了最终的流式评估。直接微调仍然优越,达到F1 0.6455,而SC+H为0.5894。实体级分析表明,直接微调在82个细粒度实体类型中的54个和所有十个粗粒度组中获胜(按支持加权实体F1),而SC+H在28个类型上保持局部优势。结果表明,多样化的任务特定训练数据和简单的加权交叉熵目标对广泛覆盖的PII检测的贡献大于所测试的架构和课程复杂性。

英文摘要

Personally identifiable information (PII) detection systems are frequently trained within narrow source or domain boundaries, limiting coverage when deployed on heterogeneous text. We study model fine-tuning on a corrected multi-source PIIBench preparation spanning 82 retained entity types across ten source datasets. We evaluate three DeBERTa-based approaches: direct token classification fine-tuning, a source-conditioned hierarchical model (SC+H), and a three-phase curriculum extension (SC+H+Curr). Against eight published comparator systems on a reproducible 5,000-record held-out subset (test_5k), direct fine-tuned DeBERTa achieves F1 0.6476, while SC+H and the curriculum variant achieve 0.5899 and 0.2772 respectively; the strongest published comparator reaches only 0.1723. Because validation initially favoured SC+H, we perform a final streamed evaluation on the complete 100,002-record held-out split. Direct fine-tuning remains superior, achieving F1 0.6455 versus 0.5894 for SC+H. Entity-level analysis shows that direct fine tuning wins 54 of 82 fine entity types and all ten coarse groups by support-weighted entity F1, while SC+H retains localised advantages on 28 types. The results indicate that diverse task-specific training data and a simple weighted cross-entropy objective contribute more to broad-coverage PII detection than the tested architectural and curriculum complexity.

2605.25814 2026-05-26 cs.CL cs.AI 版本更新

Adaptive Graph Refinement and Label Propagation with LLMs for Cost-Effective Entity Resolution

自适应图优化与基于大语言模型的标签传播用于经济高效实体解析

Hongtao Wang, Renchi Yang, Haoran Zheng, Xiangyu Ke

发表机构 * Hong Kong Baptist University(香港 Baptist 大学) Zhejiang University(浙江大学)

AI总结 提出Alper框架,通过迭代概率标签传播整合匹配与聚类,自适应融合图传播弱信号与LLM强查询,在预算约束下最大化边际增益,实现高效实体解析。

详情
AI中文摘要

脏实体解析(ER)从单个杂乱数据集中识别指向同一真实世界实体的记录,是数据管理和挖掘中的基本任务。然而,ER的主流阻塞-匹配-聚类范式存在严重缺陷。其级联、解耦的工作流本质上生成一个静态、稀疏的图,由于阻塞失败导致缺失边,由于匹配错误导致噪声链接,造成错误传播并产生次优聚类,特别是在聚类中施加严格传递性时。我们认为匹配和聚类本质上是协同的,两者都优化理想实体图的构建。基于这一见解,我们提出Alper,一个统一框架,将这些步骤整合为在全局、演化图上的迭代概率标签传播过程。与分离的阻塞不同,Alper通过自适应地整合来自图传播的“弱但廉价”信号与基于LLM的“强但昂贵”成对查询,动态优化图结构和标签。为了提高成本效益,我们将信号选择形式化为在查询预算下最大化累积边际增益的约束优化问题,通过我们的贪心算法求解,并具有可证明的理论保证。我们在八个基准数据集上的广泛实验表明,Alper始终优于最先进的级联流水线。

英文摘要

Dirty entity resolution (ER), which identifies records referring to the same real-world entity from a single, messy dataset, is a fundamental task in data management and mining. However, the dominant blocking-matching-clustering paradigm for ER suffers from critical flaws. Its cascaded, decoupled workflow essentially produces a static, sparse graph plagued by missing edges (due to blocking failures) and noisy links (due to matching errors), causing error propagation and yielding suboptimal clusters, particularly when rigid transitivity is imposed in the clustering. We contend that matching and clustering are fundamentally synergistic, both optimizing for the construction of an ideal entity graph. Building upon this insight, we propose Alper, a unified framework that integrates these steps into an iterative probabilistic label propagation process over a global, evolving graph. Unlike disjoint blocking, Alper refines the graph structure and labels dynamically by adaptively integrating "weak but cheap" signals from graph propagation with "strong but expensive" LLM-based pairwise queries. For higher cost-effectiveness, we formulate the signal selection as a constrained optimization problem maximizing cumulative marginal gain under a query budget, solved via our greedy algorithm with provable theoretical guarantees. Our extensive experiments over eight benchmark datasets demonstrate that Alper is consistently superior to state-of-the-art cascaded pipelines.

2605.25794 2026-05-26 cs.AI 版本更新

When Can We Trust Early Warnings? Leakage-Excluded Early Outcome Prediction from LMS Interaction Logs

何时可以信任早期预警?从 LMS 交互日志中排除泄漏的早期结果预测

Ngoc Luyen Le, Marie-Hélène Abel, Bertrand Laforge

发表机构 * Gamaizer Université de technologie de Compiègne, CNRS, Heudiasyc(巴黎综合理工学院,法国国家科学研究中心,Heudiasyc实验室) Sorbonne Université, CNRS UMR 7585, LPMHE(索邦大学,法国国家科学研究中心,LPMHE实验室)

AI总结 针对学习管理系统日志中早期预测结果因时间泄漏而被高估的问题,提出 LEAP 协议(排除泄漏的早期可用性协议),通过截止优先截断和特征溯源审计防止后截止证据进入基准,并在 OULAD 数据集上验证了多种方法的性能。

详情
AI中文摘要

基于学习管理系统(LMS)日志构建的早期预警模型旨在尽早预测课程结束结果,以便及时提供学习者支持。然而,报告的“早期”性能常常因时间泄漏而被夸大。当流程使用了在预测时尚未可用的信息时,就会发生这种情况。我们在时间可用性约束下形式化了基于截止点的早期结果预测,并引入了 LEAP(排除泄漏的早期可用性协议),该协议在连接和聚合之前强制执行截止优先截断,并审计特征来源以防止后截止证据进入基准。我们在公共开放大学学习分析数据集(OULAD)上实例化 LEAP,作为跨周截止点的泄漏控制评估的多步骤协议。使用几种标准学习方法,我们通过 ROC-AUC、PR-AUC、Brier 分数和 F1@0.5 评估性能。结果显示,随着观察窗口的扩大,性能提高,在第 3 周左右有显著提升;随机森林在最早截止点表现最佳,而梯度提升在此后占主导地位。泄漏消融进一步表明,时间违规,特别是通过评估信息,可能会夸大表观的“早期”性能。

英文摘要

Early-warning models built from Learning Management System (LMS) logs aim to predict end-of-course outcomes early enough to enable timely learner support. However, reported "early" performance is often inflated by temporal leakage. This occurs when the pipeline uses information that would not yet be available at the time of prediction. We formalize cutoff-based early outcome prediction under a temporal availability constraint and introduce LEAP (Leakage-Excluded Early-Availability Protocol), which enforces cutoff-first truncation prior to joins and aggregation and audits feature provenance to prevent post-cutoff evidence from entering the benchmark. We instantiate LEAP on the public Open University Learning Analytics Dataset (OULAD) as a multi-step protocol for leakage-controlled evaluation across weekly cutoffs. Using several standard learning methods, we evaluate performance using ROC-AUC, PR-AUC, Brier score, and F1@0.5. Results show improving performance as the observation window expands, with a marked gain around week~3; Random Forest performs best at the earliest cutoffs, while Gradient Boosting dominates thereafter. Leakage ablations further show that temporal violations, especially through assessment information, can inflate apparent "early" performance.

2605.25789 2026-05-26 cs.LG cs.AI cs.IT math.IT stat.ML 版本更新

On the Benefits of Free Exploration for Regret Minimization in Multi-Armed Bandits

关于自由探索对多臂老虎机遗憾最小化的益处

Yunlong Hou, Zixin Zhong, Vincent Y. F. Tan

发表机构 * Department of Mathematics, National University of Singapore(新加坡国立大学数学系) Department of Mathematics, Department of Electrical and Computer Engineering, National University of Singapore(新加坡国立大学数学系、电子与计算机工程系)

AI总结 本文研究在初始自由探索阶段后最小化累积遗憾的多臂老虎机问题,提出一种两阶段算法UFE-KLUCB-H,并证明其相比无自由探索的策略能严格减少遗憾。

Comments 55 pages

详情
AI中文摘要

我们研究了一个随机多臂老虎机问题,其中智能体在遗憾累积之前被授予一个自由探索预算,这是经典遗憾最小化或纯探索范式未涵盖的设置。目标是设计一个自适应策略,在初始自由探索阶段策略性地探索老虎机实例,并在后续阶段最小化累积遗憾。我们形式化了这个带有自由探索的遗憾最小化问题,并识别出一个有趣的区间,其中自由探索预算与时间范围成对数比例。为了量化由于自由探索阶段的可用性而高概率节省的遗憾量,我们引入了一类新的策略,称为$(α,β)$-可能节省策略。我们提出了一种两阶段、可能节省的算法UFE-KLUCB-H,它由一个原则性的自由探索策略UFE和一个历史感知的遗憾最小化策略KLUCB-H组成。推导了UFE-KLUCB-H的实例相关上界,表明UFE-KLUCB-H累积的遗憾严格少于无法访问自由探索阶段的策略。作为补充,我们基于针对自由探索环境定制的多实例扰动论证推导了实例相关下界,证明了UFE-KLUCB-H对于二值老虎机的近乎最优性。我们的上界和下界揭示了累积遗憾中依赖于可用自由探索量的尖锐相变。进行了仿真,表明算法中的强制探索和自适应性导致了更大的遗憾节省。

英文摘要

We study a stochastic multi-armed bandit problem where an agent is granted a free exploration budget before regret accumulates, a setting not captured by the classic regret minimization or pure exploration paradigms. The goal is to design an adaptive policy that strategically explores the bandit instance in the initial free exploration phase and minimizes the cumulative regret in the subsequent phase. We formalize this regret minimization with free exploration problem and identify an interesting regime where the free exploration budget scales logarithmically with the time horizon. To quantify the amount of regret saved with high probability as a result of the availability of the free exploration phase, we introduce a novel set of policies known as $(α,β)$-probably saving policies. We propose a two-phase, probably saving algorithm, UFE-KLUCB-H, which consists of a principled free exploration policy, UFE, and a history-aware regret minimization policy KLUCB-H. Instance-dependent upper bounds on UFE-KLUCB-H are derived, showing that UFE-KLUCB-H accumulates strictly less regret than policies that do not have access to a free exploration phase. Complementarily, we derive instance-dependent lower bounds based on novel multi-instance perturbation arguments tailored to the free-exploration setting, demonstrating the near-optimality of UFE-KLUCB-H for two-valued bandits. Our upper and lower bounds reveal sharp phase transitions in the accumulated regret depending on the amount of available free exploration. Simulations are conducted to demonstrate that forced exploration and adaptivity in the algorithm lead to greater regret savings.

2605.25786 2026-05-26 cs.LG cs.AI 版本更新

NPSolver: Neural Poisson Solver with Iterative Physics Supervision

NPSolver: 具有迭代物理监督的神经泊松求解器

Bocheng Zeng, Rui Zhang, Runze Mao, Mengtao Yan, Xuan Bai, Yang Liu, Zhi X. Chen, Hao Sun

发表机构 * Gaoling School of Artificial Intelligence(高岭人工智能学院) Renmin University of China(中国人民大学) School of Mechanics and Engineering Science(力学与工程科学学院) Peking University(北京大学) AI for Science Institute(AI for Science研究院) University of Chinese Academy of Sciences(中国科学院大学)

AI总结 提出NPSolver,通过迭代物理监督(利用少量PCG步骤)训练无标签的神经泊松求解器,并引入边界感知Transolver架构,在2D/3D不规则几何上优于物理信息和数据驱动基线。

Comments kdd 2026

详情
AI中文摘要

在复杂不规则域上高效求解泊松方程仍然是科学计算中的一个基本挑战,因为经典迭代求解器常常因病态系统而面临过长的运行时间。虽然神经算子提供了一种快速的替代方案,但它们通常依赖大规模标记数据集,或者在使用物理信息残差损失时难以处理不稳定的训练动态。我们提出 extsc{NPSolver},一种通过迭代物理监督训练的无标签神经泊松求解器。 extsc{NPSolver} 不依赖完全收敛的数值解或原始PDE残差,而是利用少量预处理共轭梯度(PCG)步骤来优化自身预测,从而提供更稳定且尺度良好的训练信号。理论分析证实,这种迭代监督充当了良态误差代理,并且停止梯度设计对于优化稳定性至关重要。为了更好地捕捉混合边界条件下的边界驱动特征,我们进一步引入了边界感知Transolver( extsc{BA-Transolver})架构,该架构明确分离了内部和边界令牌化。在2D和3D不规则几何上的广泛评估表明, extsc{NPSolver} 优于物理信息和数据驱动基线。此外,一个下游热控制任务突出了该模型进行高效可靠的基于梯度的边界控制的能力。我们将在 https://github.com/intell-sci-comput/NPSolver 发布我们的代码和数据。

英文摘要

Efficiently solving Poisson equations on complex, irregular domains remains a fundamental challenge in scientific computing, as classical iterative solvers often suffer from prohibitive runtime due to ill-conditioned systems. While neural operators offer a fast alternative, they typically rely on large-scale labeled datasets or struggle with unstable training dynamics when using physics-informed residual losses. We propose \textsc{NPSolver}, a neural Poisson solver trained without solution labels via iterative physics supervision. Instead of relying on fully converged numerical solutions or raw PDE residuals, \textsc{NPSolver} utilizes a small number of preconditioned conjugate gradient (PCG) steps to refine its own predictions, providing a more stable and well-scaled training signal. Theoretical analysis confirms that this iterative supervision serves as a well-conditioned error proxy and that a stop-gradient design is essential for optimization stability. To better capture boundary-driven features under mixed boundary conditions, we further introduce the Boundary-Aware Transolver (\textsc{BA-Transolver}) architecture that explicitly separates interior and boundary tokenization. Extensive evaluations on 2D and 3D irregular geometries demonstrate that \textsc{NPSolver} outperforms both physics-informed and data-driven baselines. Furthermore, a downstream thermal control task highlights the model's capability for conducting efficient and reliable gradient-based boundary control. We will release our codes and data at https://github.com/intell-sci-comput/NPSolver.

2605.25771 2026-05-26 cs.LG cs.AI 版本更新

MDGMIX: Boundary-Aware Subgraph Mixing for Multi-Domain Graph Pre-Training

MDGMIX: 边界感知的子图混合用于多域图预训练

Ziyu Zheng, Yaming Yang, Ziyu Guan, Wei Zhao, Xinyan Huang

发表机构 * School of Computer Science(计算机科学学院) Technology, Xidian University, Xi’an, China(技术学院,西安电子科技大学) School of Artificial Intelligence, Xidian University, Xi’an, China(人工智能学院,西安电子科技大学)

AI总结 针对多域图预训练中的数据冗余问题,提出MDGMIX框架,通过边界感知子图混合与层次判别学习解耦共享和域特定模式,并在适配时使用轻量级提示加权机制,在少样本分类任务中优于强基线且效率更高。

Comments Accepted by ICML2026

详情
AI中文摘要

多域图预训练是构建具有跨域泛化能力的基础图模型的关键步骤。然而,现有方法主要依赖联合训练所有源域图,导致计算成本高。此外,尚不清楚所有源域图数据是否对有效迁移有同等贡献。本文通过实验揭示了多域图预训练中存在显著的数据冗余。基于这一发现,我们提出了多域图预训练框架MDGMIX,该框架将边界感知的子图混合与层次判别相结合。通过选择边界节点构建具有挑战性的混合域子图,MDGMIX利用粗粒度域判别和细粒度域分解损失来解耦共享模式与域特定模式。在适配过程中,MDGMIX采用轻量级提示加权机制来迁移源域知识。大量实验表明,MDGMIX在少样本分类任务中持续优于强基线,同时表现出优越的时间和内存效率。代码可在 https://github.com/zhengziyu77/MDGMIX 获取。

英文摘要

Multi-domain graph pre-training is a crucial step in constructing foundational graph models with cross-domain generalization capabilities. However, existing methods predominantly rely on jointly training all source domain graphs, resulting in high computational costs. Furthermore, it remains unclear whether all source domain graph data contribute equally to effective transfer. This paper empirically reveals significant data redundancy in multi-domain graph pre-training. Based on this finding, we propose the Multi-domain Graph Pre-training Framework, MDGMIX, which combines boundary-aware subgraph mixing with hierarchical discrimination. By selecting boundary nodes to construct challenging mixed-domain subgraphs, MDGMIX employs coarse-grained domain discrimination and fine-grained domain decomposition losses to decouple shared patterns from domain-specific patterns. During adaptation, MDGMIX employs a lightweight prompt weighting mechanism to transfer source domain knowledge. Extensive experiments demonstrate that MDGMIX consistently outperforms strong baselines in few-shot classification tasks while exhibiting superior time and memory efficiency. The code is available at: https://github.com/zhengziyu77/MDGMIX.

2605.25765 2026-05-26 cs.CV cs.AI cs.LG 版本更新

Concept Unlearning via Cross-Attention Activation Projection for Diffusion Models

通过交叉注意力激活投影实现扩散模型的概念遗忘

Saemi Moon, Suhyeon Jun, Seoyeon Lee, Dongwoo Kim

发表机构 * CSE, POSTECH(POSTECH计算机科学系) GSAI, POSTECH(POSTECH通用人工智能实验室)

AI总结 提出PURE方法,利用交叉注意力激活空间构建遗忘和保留基,通过线性投影编辑权重,在保持保留概念的同时有效消除目标概念。

详情
AI中文摘要

概念遗忘旨在从预训练的文本到图像扩散模型中擦除目标概念,而无需重新训练。闭式方法在此设置中具有吸引力,因为它们对交叉注意力权重应用单一确定性编辑,并且不增加推理时间成本。然而,现有的闭式方法通过文本编码器对少数命名目标概念的简短锚定提示的响应来表示目标概念,而唤起该概念但不一致命名的释义提示可以绕过编辑。我们认为,目标应该改为在交叉注意力激活空间中表示。文本嵌入描述用户的提示,而交叉注意力激活描述模型即将渲染的内容,后者泛化到锚定模板未覆盖的释义。基于这一观察,我们提出了PURE(U-Net渲染中的投影用于擦除),这是一种闭式方法,从沿短去噪轨迹捕获的逐层交叉注意力激活构建遗忘和保留基,并将单个线性投影器应用于交叉注意力键和值权重。在最近涵盖艺术风格、知识产权、名人和NSFW类别中十个概念的整体概念遗忘基准上,PURE显著减少了在释义和对抗性提示下的目标泄露,同时将保留概念保持接近未编辑模型,在评估方法中实现了最佳的总体遗忘-保留权衡。

英文摘要

Concept unlearning aims to erase a target concept from a pretrained text-to-image diffusion model without retraining. Closed-form methods are attractive in this setting because they apply a single deterministic edit to the cross-attention weights and add no inference-time cost. Existing closed-form methods, however, represent the target concept through the text encoder's response to a few short anchor prompts that name it, and paraphrased prompts that evoke the concept without naming it consistently bypass the edit. We argue that the target should instead be represented in the cross-attention activation space. Text embeddings describe the user's prompt, while cross-attention activations describe what the model is about to render, and the latter generalize to paraphrase the anchor templates do not cover. Building on this observation, we propose PURE (Projection in U-Net Rendering for Erasure), a closed-form method that builds the forget and retain bases from per-layer cross-attention activations captured along a short denoising trajectory and applies a single linear projector to the cross-attention key and value weights. On a recent holistic concept-unlearning benchmark covering ten concepts across artistic style, intellectual property, celebrity, and NSFW categories, PURE significantly reduces target leakage under paraphrased and adversarial prompts while preserving retain concepts close to the unedited model, yielding the best overall forget-retain trade-off among evaluated methods.

2605.25764 2026-05-26 cs.CV cs.AI 版本更新

Benchmarking Pathology Foundation Models for Spatial Domain Understanding

病理基础模型在空间域理解中的基准测试

Bokai Zhao, Yiyang Zhang, Yuanchi Zhu, Hanqing Chao, Long Bai, Tai Ma, Minfeng Xu, Ming Song, Tianzi Jiang

发表机构 * School of Artificial Intelligence, University of Chinese Academy of Sciences(中国科学院大学人工智能学院) Brainnetome Center, Institute of Automation, Chinese Academy of Sciences(中国科学院自动化研究所脑网膜工程中心) Beijing Key Laboratory of Brainnetome and Brain-Computer Interface, Institute of Automation, Chinese Academy of Sciences(中国科学院自动化研究所北京脑网膜与脑机接口重点实验室) DAMO Academy, Alibaba Group(阿里云达摩院) ShanghaiTech University(上海科技大学)

AI总结 提出SpaPath-Bench基准,通过空间域识别任务评估病理基础模型在区分组织区域和捕获空间关系方面的表示能力。

Comments MICCAI2026

详情
AI中文摘要

病理基础模型(PFMs)已成为从全切片图像(WSIs)中学习可迁移表示的核心方法,通常通过下游临床终点进行基准测试。虽然这种任务级评估不可或缺,但它们对表示本身编码了什么提供了有限的见解,特别是PFM嵌入是否能够区分有意义的组织区域并捕获其空间关系。我们提出了SpaPath-Bench,一个表示级基准,旨在诊断PFMs中的空间表示能力。SpaPath-Bench将配对全切片图像和空间转录组学(ST)数据上的空间域识别(SDI)制定为诊断任务。它整理了42个公开的配对WSI和ST切片,支持跨19个编码器和7种SDI方法的大规模评估,并使用三个互补标准衡量分区质量:无监督空间一致性、转录组学参考一致性和专家参考一致性。在83K次运行中,SpaPath-Bench揭示了不同的预训练范式捕获了组织空间架构的不同方面,并为构建下一代空间感知计算病理模型提供了实用指导。代码和数据管道公开于https://bokai-zhao.github.io/SpaPath-benchboard/。

英文摘要

Pathology foundation models (PFMs) have emerged as a core approach for learning transferable representations from whole slide images (WSIs), and they are typically benchmarked through downstream clinical endpoints. While such task level evaluations are indispensable, they offer limited insight into what the representations themselves encode, particularly whether PFM embeddings can distinguish meaningful tissue regions and capture their spatial relationships. We present SpaPath-Bench, a representation level benchmark designed to diagnose spatial representation capability in PFMs. SpaPath-Bench formulates spatial domain identification (SDI) on paired whole slide image and spatial transcriptomics (ST) data as a diagnostic task. It curates 42 public paired WSI and ST slides, enables large scale evaluation across 19 encoders and seven SDI methods, and measures partition quality using three complementary criteria: unsupervised spatial coherence, transcriptomics referenced agreement, and expert referenced agreement. Across 83K runs, SpaPath-Bench reveals that different pretraining paradigms capture distinct aspects of tissue spatial architecture, and it provides practical guidance for building the next generation of spatially aware computational pathology models. Code and data pipelines are publicly available at https://bokai-zhao.github.io/SpaPath-benchboard/.

2605.25749 2026-05-26 cs.IR cs.AI cs.LG 版本更新

DeGRe: Dense-supervised Generative Reranking for Recommendation

DeGRe: 密集监督的生成式重排序用于推荐

Chaotian Song, Jingyao Zhang, Chenghao Chen, Zisen Sang, Dehai Zhao, Guodong Cao, Boxi Wu, Deng Cai, Jia Jia

发表机构 * College of Software, Zhejiang University Hangzhou China Rajax Network Technology, Taobao Shangou of Alibaba Hangzhou China Rajax Network Technology, Taobao Shangou of Alibaba Beijing China State Key Lab of CAD\&CG, Zhejiang University Hangzhou China Rajax Network Technology, Taobao Shangou of Alibaba Shanghai China College of Software, Zhejiang University Rajax Network Technology, Taobao Shangou of Alibaba State Key Lab of CAD\&CG, Zhejiang University

AI总结 提出DeGRe框架,通过离线探索中的密集监督信号(Lookahead Evaluator)指导在线生成器(Online Generator)进行单步贪婪解码,解决重排序中的启发式标签偏差和信用分配问题。

Comments Accepted to KDD 2026 (ADS Track)

详情
AI中文摘要

在多阶段推荐系统中,重排序通过捕获列表内上下文依赖关系来优化整体效用,但其核心挑战在于在指数级排列空间中探索最优序列。最近的研究转向端到端生成式框架,通常利用列表级奖励或偏好对齐来指导生成器训练。然而,这些方法仍面临两个关键问题。首先是启发式标签偏差。现有方法通常基于简单规则构建训练目标,例如将点击项提升到顶部,而忽略列表上下文中的因果依赖关系。其次是信用分配问题。稀疏的列表级后验奖励无法直接指导序列生成中的中间步骤,导致优化方向模糊。为了解决这些问题,我们提出DeGRe(密集监督的生成式重排序),一种通过密集监督弥合离线探索与在线效率之间差距的生成式重排序框架。DeGRe的核心在于其离线-在线解耦设计。在离线阶段,我们引入基于累积回归的Lookahead Evaluator,利用束搜索在未曝光空间中主动挖掘高价值前瞻序列。在训练期间,我们将评估器的逐步价值估计转换为密集监督信号,并将其蒸馏到轻量级在线生成器中。这种机制使生成器能够内化前瞻规划能力,在线推理时仅需一次高效的贪婪解码即可逼近全局最优。实验表明,DeGRe在公开基准和工业数据集上优于基线模型。我们已成功将DeGRe部署到淘宝闪购中,显著提升了在线推荐效果。

英文摘要

In multi-stage recommender systems, reranking optimizes overall utility by capturing intra-list contextual dependencies, yet its central challenge lies in exploring optimal sequences within an exponentially large permutation space. Recent studies have shifted towards end-to-end generative frameworks, which typically leverage list-wise rewards or preference alignment to guide generator training. However, these methods still face two critical issues. First is the heuristic label bias. Existing methods often construct training targets based on simple rules, such as promoting clicked items to the top, while ignoring causal dependencies within the list context. Second is the credit assignment problem. Sparse list-level posterior rewards fail to directly guide intermediate steps in sequence generation, leading to ambiguous optimization directions. To address these issues, we propose DeGRe (Dense-supervised Generative Reranking), a generative reranking framework that bridges the gap between offline exploration and online efficiency through dense supervision. The core of DeGRe lies in its offline-online decoupled design. During the offline phase, we introduce a Lookahead Evaluator based on cumulative regression, which leverages beam search to actively mine high-value lookahead sequences in the unexposed space. During training, we transform the step-wise value estimations from the evaluator into dense supervision signals and distill them into a lightweight Online Generator. This mechanism enables the generator to internalize lookahead planning capabilities, requiring only a single efficient greedy decoding pass during online inference to approximate the global optimum. Experiments demonstrate that DeGRe outperforms baseline models on public benchmarks and industrial datasets. We have successfully deployed DeGRe on Taobao Flash Shopping, significantly improving online recommendations.

2605.25748 2026-05-26 cs.AI 版本更新

Agent-Centric Social Trajectory Prediction: A Free Energy Principle Perspective

以智能体为中心的社交轨迹预测:自由能原理视角

Yanping Wu, Ji Zhang, Hao Chen, Edmond S. L. Ho, Chongfeng Wei

发表机构 * University of Glasgow(格拉斯哥大学) Southwest Jiaotong University(西南交通大学)

AI总结 针对现有轨迹预测方法依赖全局状态、部分可观测下信念推理不足及缺乏认知行为约束的问题,提出基于自由能原理的智能体中心轨迹预测框架FEP-Diff,通过双分支时空编码器、目标条件信念学习器和残差扩散轨迹生成器,在受限可观测条件下实现认知合理的预测。

Comments 10 pages, 4 figures

详情
AI中文摘要

轨迹预测方法在捕捉复杂运动模式方面已展现出显著能力。然而,现有方法依赖于全局状态假设,在部分可观测性下存在信念推理不足的问题,且预测中缺乏认知行为约束。这些局限性严重影响了实际部署的可行性和物理合理性。在这项工作中,我们提出了FEP-Diff,一个基于自由能原理的以智能体为中心的轨迹预测框架,旨在现实约束下实现认知合理的预测。具体来说,一个双分支时空编码器从局部观测中提取自我运动动态和社会交互线索。在此基础上,一个目标条件信念学习器推断多模态潜在信念分布,通过自由能目标进行优化,并对局部邻域图施加社会一致性约束以促进相邻智能体之间的认知对齐。最后,一个残差扩散轨迹生成器以学习到的信念表示为条件,通过令牌级代理条件,产生精确且多样化的未来预测。在五个公开基准上的大量实验表明,FEP-Diff在受限可观测性下始终优于最先进的方法。代码:https://anonymous.4open.science/r/FEP-Diff-8876。

英文摘要

Trajectory prediction methods have demonstrated remarkable capabilities in capturing complex motion patterns. However, existing methods rely on global state assumptions, suffer from insufficient belief inference under partial observability, and lack cognitive behavioral constraints in prediction. These limitations severely compromise both deployment feasibility and physical plausibility in real-world settings. In this work, we propose FEP-Diff, an agent-centric trajectory prediction framework grounded in the Free Energy Principle, aimed at achieving cognitively plausible predictions under realistic constraints. Specifically, a dual-branch spatiotemporal encoder extracts ego-motion dynamics and social interaction cues from local observations. Building upon this, a goal-conditioned belief learner infers multimodal latent belief distributions optimized via a free-energy objective, with a social consistency constraint on the local neighborhood graph to promote cognitive alignment among neighboring agents. Finally, a residual diffusion trajectory generator is conditioned on the learned belief representations with token-level proxy conditioning, producing precise and diverse future predictions. Extensive experiments on five public benchmarks demonstrate that FEP-Diff consistently outperforms state-of-the-art methods under restricted observability. Code: https://anonymous.4open.science/r/FEP-Diff-8876.

2605.25746 2026-05-26 cs.MA cs.AI 版本更新

Multi-Agent Coordination Adaptation via Structure-Guided Orchestration

基于结构引导编排的多智能体协调适应

Haoran Li, Shulun Chen, Shaoyuan Sun, Hanchen Wang

发表机构 * Nanjing University(南京大学) University of Technology Sydney(悉尼科技大学) University of New South Wales(新南威尔士大学)

AI总结 提出MACA框架,通过概率视角将多智能体协调视为结构与编排的联合后验推断,利用任务和预算条件结构先验指导策略编排,实现高效自适应协调,性能平均提升8.42%且令牌消耗减少43.19%。

Comments 21 pages

详情
AI中文摘要

随着基于大语言模型的多智能体系统规模扩大以处理日益复杂的任务,平衡结构稳定性和动态适应性变得越来越具有挑战性。现有系统通常采用以结构为中心的方法,坚持预先确定的结构,限制了细粒度控制;或者采用以编排为中心的方法,动态调整决策,同时使协调结构隐含且不稳定。为了解决这一挑战,我们从概率角度重新审视多智能体协调,将其视为结构和编排联合分布的后验推断。我们引入了MACA,一个自动协调框架,它学习一个任务和预算条件的结构先验,用于智能体参与和交互。该先验指导基于策略的编排作为后验推断的近似,实现了具有细粒度控制的高效解决方案。在多个基准测试中,MACA比自适应多智能体基线平均高出8.42%,同时使用的令牌数减少了43.19%。进一步研究表明,结构和编排的联合适应抑制了冗余交互,使协调收敛到任务有效的执行。

英文摘要

As large language model (LLM)-based multi-agent systems scale to handle increasingly complex tasks, balancing structural stability and dynamic adaptability becomes increasingly challenging. Existing systems typically adopt either structure-centric methods, committing to structures determined upfront that limit fine-grained control, or orchestration-centric methods, adapting decisions dynamically while leaving coordination structure implicit and unstable. To address this challenge, we revisit multi-agent coordination from a probabilistic perspective, casting it as posterior inference over the joint distribution of structure and orchestration. We introduce MACA, an automated coordination framework that learns a task- and budget-conditioned structural prior over agent participation and interactions. This prior guides a policy-based orchestration as an approximation to posterior inference, enabling efficient solutions with fine-grained control. Across benchmarks, MACA outperforms adaptive multi-agent baselines by an average of 8.42% while using 43.19% fewer tokens. Further investigation reveals that joint adaptation of structure and orchestration suppresses redundant interactions, converging coordination toward task-effective execution.

2605.25735 2026-05-26 cs.AI 版本更新

A Deep Dive into Axiomatic Design -- Part I: Problem Formulation

公理化设计的深度剖析——第一部分:问题表述

Aydin Homay

发表机构 * Technische Universität Dresden(德累斯顿理工大学)

AI总结 本文聚焦公理化设计中的问题表述步骤,澄清一级功能需求的定义与特性,分析常见误区与困难,并提供实用指导,最后探讨大语言模型在该步骤中的作用。

Comments The paper is accepted at the ICAD 2026 - MIT and the final camera ready will be available once it got published by the Springer

详情
AI中文摘要

问题表述——将客户需求和约束转化为最小的一组独立的一级功能需求——可以说是每个设计框架中最关键的步骤,包括公理化设计,然而在实践中它经常被误解或低估。本文专门关注公理化设计中的问题表述,澄清一级FR是什么(以及不是什么),解释为什么在给定的相同需求和约束下,它们不应在不同设计者之间合理变化,并强调导致设计失败的内在困难和反复出现的陷阱。讨论主要基于Nam P. Suh的三本书:《设计原理》、《公理化设计:进展与应用》和《复杂性理论》,并提供实用指导,帮助设计者制定适定的一级FR。最后,本文简要回顾了大语言模型时代的问题表述,并讨论了此类工具在一级层面上能够(以及不能)做出什么贡献。

英文摘要

Problem formulation translating customer needs and constraints into a minimum set of independent first-level functional requirements, is arguably the most critical step in every design framework, including axiomatic design yet it is frequently misunderstood or underestimated in practice. This paper focuses exclusively on problem formulation in axiomatic design it clarifies what first-level FRs are (and are not), explains why they should not legitimately vary across designers given the same needs and constraints, and highlights intrinsic difficulties and recurring pitfalls that lead to design failure. The discussion is grounded primarily in Nam P.Suh's three books. The Principles of Design, Axiomatic Design Advances and Applications, and Complexity Theory, and it offers practical guidance to help designers formulate well-posed first-level FRs. Finally, the paper briefly revisits problem formulation in the era of large language models and discusses what such tools can (and cannot) contribute at the first level.

2605.25720 2026-05-26 cs.AI 版本更新

Learning to Search and Searching to Learn for Generalization in Planning

学习搜索与搜索学习以实现规划中的泛化

Michael Aichmüller, Yannik Hesse, Hector Geffner

发表机构 * Department of Machine Learning and Reasoning, RWTH Aachen University(机器学习与推理部门,亚琛RWTH大学)

AI总结 提出一种结合关系图神经网络值启发式的自改进WA*学习框架,通过搜索引导和Q学习更新启发式,实现零样本泛化,在多个规划任务中优于深度强化学习。

Comments Accepted at ICML 2026

详情
AI中文摘要

组合泛化仍然是深度强化学习(DRL)中的一个核心挑战。经典规划通过显式关系描述为研究这一问题提供了一个简单但具有挑战性的环境,无需从感知中学习。在稀疏奖励领域中,通过实时搜索的标准RL探索效率低下,而基于学习的规划方法通常依赖于专家演示、事后重标或从目标状态开始的随机游走。相比之下,规划器依赖于最佳优先搜索方法(如$\mathrm{A}^\star$)从头开始解决问题。我们提出了一种自改进的$\mathrm{WA}^\star$学习框架,结合由关系图神经网络表示的值启发式:启发式引导搜索,产生的搜索数据通过$Q$-学习更新启发式。这个循环产生了可以作为通用策略的启发式,并且即使在没有搜索的情况下也能解决新实例,而DRL在其他情况下会失败,正如我们在Sokoban、PushWorld、The Witness以及2023年国际规划竞赛基准等谜题上所展示的。值得注意的是,我们展示了强大的零样本泛化能力:例如,在少于30个块的Blocksworld实例上训练的启发式,无需搜索即可成功解决包含488个块的实例。

英文摘要

Combinatorial generalization remains a central challenge in Deep Reinforcement Learning (DRL). Classical planning provides a simple yet challenging setting to study this problem through explicit relational descriptions, without requiring learning from perception. In sparse-reward domains, standard RL exploration via real-time search is ineffective, and learning-based planning methods often rely on expert demonstrations, hindsight relabeling, or random walks from the goal state. In contrast, planners rely on best-first search methods such as $\mathrm{A}^\star$ to solve problems from scratch. We propose a self-improving $\mathrm{WA}^\star$ learning framework in combination with a value heuristic represented by a Relational Graph Neural Network: the heuristic guides search, and the resulting search data updates the heuristic via $Q$-learning. This loop yields heuristics that can function as general policies and solve new instances even without search, where DRL otherwise fails, as we show on puzzles such as Sokoban, PushWorld, The Witness, and the 2023 International Planning Competition benchmarks. Notably, we demonstrate strong zero-shot generalization: For example, heuristics trained on Blocksworld instances with fewer than 30 blocks successfully solve instances with 488 blocks without search.

2605.25717 2026-05-26 cs.AI cs.CE cs.LG 版本更新

FLOATBench: A Dataset and Benchmark for Floating Offshore Wind Turbine Tower Fatigue

FLOATBench:浮式海上风力发电机塔架疲劳数据集与基准

João Alves Ribeiro, Bruno Alves Ribeiro, Francisco Pimenta, Sérgio M. O. Tavares, Faez Ahmed

发表机构 * Department of Mechanical Engineering(机械工程系) Massachusetts Institute of Technology(麻省理工学院) School of Engineering(工程学院) Brown University(布朗大学) CONSTRUCT, Faculty of Engineering University of Porto(CONSTRUCT,工程学院,葡萄牙波尔图大学) University of Aveiro(阿维罗大学)

AI总结 提出FLOATBench,一个包含582,120个疲劳损伤标签的表格基准,基于22 MW浮式风机塔架的高保真仿真,并引入工况感知的评估协议以检测随机划分无法发现的性能排名变化。

详情
AI中文摘要

全球大部分海上风能资源位于水深过大、无法使用固定式基础的海域,因此浮式海上风力发电机(FOWT)对于深水部署至关重要。随着行业向22 MW级设计规模发展,塔架疲劳变得愈发关键,因为更大的结构会放大由持续风浪激励引起的耦合气动-水动-伺服-弹性载荷。准确的疲劳损伤预测对于认证、设计优化和成本降低至关重要。然而,该领域缺乏共享的替代模型基准:不同研究报告了不同的仿真、划分和指标,使得方法难以比较。我们提出FLOATBench,一个公开的表格基准,包含三种22 MW FOWT塔架几何形状的582,120个逐截面疲劳损伤标签,这些标签来自三种塔架的19,404次高保真OpenFAST仿真(每种塔架6,468次:1,078个对齐风浪工况点×六个湍流种子),每种塔架在30个截面上进行标注。FLOATBench包括一个基于工况感知的联合风浪运行包络的alpha-shape划分,将测试点分为训练内、插值和外推区域。它配备了一个可复现的评估框架,涵盖三个协议级别:随机验证(E1)、塔内工况感知评估(E2)和跨塔迁移(E3)。工况感知协议揭示了全局性能与外推性能之间的排名变化,而随机划分排行榜无法检测到这些变化。据作者所知,FLOATBench是首个用于表格替代建模的FOWT疲劳基准,并提供了一个可推广到定义在物理运行包络上的工程替代模型的评估协议。数据集和代码可在以下网址获取:https://github.com/Joao97ribeiro/FLOATBench。

英文摘要

Most of the world's offshore wind resource lies in waters too deep for fixed-bottom foundations, making floating offshore wind turbines (FOWTs) essential for deep-water deployment. As the industry scales toward $22$ MW class designs, tower fatigue becomes increasingly critical because larger structures amplify the coupled aero-hydro-servo-elastic loads induced by continuous wind and wave excitation. Accurate fatigue-damage prediction is therefore central to certification, design optimization, and cost reduction. Yet the field lacks a shared surrogate benchmark: studies report different simulations, splits, and metrics, making methods difficult to compare. We present FLOATBench, a public tabular benchmark with $582{,}120$ per-section fatigue-damage labels across three $22$ MW FOWT tower geometries, derived from $19{,}404$ high-fidelity OpenFAST simulations across the three towers ($6{,}468$ per tower: $1{,}078$ aligned wind/wave operating points $\times$ six turbulence seeds), labeled at $30$ cross-sections per tower. FLOATBench includes a regime-aware alpha-shape partition of the joint wind/wave operating envelope, stratifying test points into in-train, interpolation, and extrapolation regimes. It is paired with a reproducible evaluation harness covering three protocol levels: random validation (E1), within-tower regime-aware evaluation (E2), and cross-tower transfer (E3). The regime-aware protocol reveals rank shifts between global and extrapolation performance that random-split leaderboards cannot detect. To the authors' knowledge, FLOATBench is the first FOWT fatigue benchmark for tabular surrogate modeling, and offers an evaluation protocol that generalizes to engineering surrogates defined over physical operating envelopes. Dataset and code available at: https://github.com/Joao97ribeiro/FLOATBench.

2605.25707 2026-05-26 cs.AI 版本更新

AgentHijack: Benchmarking Computer Use Agent Robustness to Common Environment Corruptions

AgentHijack:基准测试计算机使用智能体对常见环境干扰的鲁棒性

Jingwei Sun, Jianing Zhu, Yuanyi Li, Tongliang Liu, Xia HU, Bo Han

发表机构 * TMLR Group, Hong Kong Baptist University(香港 Baptist 大学 TMLR 团体) The University of Texas at Austin(德克萨斯大学奥斯汀分校) Sydney AI Centre, The University of Sydney(悉尼大学 AI 中心) Shanghai Artificial Intelligence Laboratory(上海人工智能实验室)

AI总结 提出AgentHijack基准,通过9种可配置的常见环境干扰评估多模态大语言模型驱动的计算机使用智能体的鲁棒性,并设计AgentHijack-Agent框架提升其抗干扰能力。

Comments accepted by ICML 2026

详情
AI中文摘要

由多模态大语言模型(MLLM)驱动的自主计算机使用智能体正在成为完成复杂数字工作流的得力助手。然而,真实世界的执行环境远非理想:弹出窗口、分辨率变化和竞争性应用频繁干扰智能体的感知和控制。我们引入了AgentHijack,一个旨在评估计算机使用智能体在常见干扰下鲁棒性的基准,其中动态环境中的不确定性在没有直接对抗意图的情况下破坏执行流程。具体来说,AgentHijack引入了9种可配置的常见干扰来复现现实的不完美场景。我们评估了多种利用基于MLLM的智能体的桌面任务,发现即使是微小的干扰实例也会导致显著的性能下降,这强调了智能体的脆弱性以及鲁棒性评估的必要性。随后,我们提出了AgentHijack-Agent,一个将具有增强基础能力的动作生成器与负责行为总结和环境检查的旁观者相结合的框架。大量实验验证了其有效性。我们的代码、环境、基线模型和数据公开于:https://AgentHijack.github.io。

英文摘要

Autonomous computer use agents that powered by multimodal large language models (MLLMs) are emerging as capable assistants for completing complex digital workflows. However, real-world execution environments are far from ideal: pop-ups, resolution changes, and competing applications frequently interfere with agent perception and control. We introduce AgentHijack, a benchmark designed to evaluate the robustness of computer-use agents under common corruptions, where the uncertainties in dynamic environment disrupt the execution flow without direct adversarial intent. Specifically, AgentHijack introduces 9 configurable common corruptions to replicate realistic imperfect scenarios. We evaluate a variety of desktop tasks that utilize MLLM-based agents and discover that even minor instances of corruption can result in substantial performance degradation, which emphasizes the fragility of agents and underscores the necessity of robustness evaluation. Afterward, we propose AgentHijack-Agent, a framework that integrates an action generator with enhanced grounding capabilities and an onlooker responsible for behavior summarization and environment checking. Extensive experiments validate its effectiveness. Our code, environment, baseline models and data are publicly available at: https://AgentHijack.github.io.

2605.25698 2026-05-26 cs.LG cs.AI 版本更新

How Should LLMs Consume High-Quality Data? Optimal Data Scheduling via Quality-Aware Functional Scaling Laws

LLM应如何消费高质量数据?通过质量感知的功能缩放定律实现最优数据调度

Zhitao Zhu, Xili Wang, Shizhe Wu, Jiawei Fu, Xiaoqing Liu

发表机构 * Peking University(北京大学) Meituan(美团)

AI总结 本文通过引入数据质量维度扩展功能缩放定律,解析求解了联合数据质量和批次大小调度问题,揭示了高质量数据的双重角色,并提出了Drop-Stable-Rampup调度策略,在15B MoE模型上相比WSD和余弦衰减分别提升平均准确率+1.70和+2.98。

详情
AI中文摘要

高质量数据在大语言模型训练中稀缺,但如何联合训练动态调度其使用缺乏理论指导。我们通过引入数据质量维度扩展功能缩放定律,并以渐近闭式形式求解了联合数据质量和批次大小调度问题。该解揭示了两个阶段和高质量数据的双重角色。在噪声受限阶段,高质量数据应作为信号放大器:降低批次大小将更清洁的数据转换为更多信号而不放大噪声。在信号受限阶段,它应作为噪声抑制器:后期放置可减少终端噪声而不牺牲信号积累。现有的课程式流程主要利用第二个角色,将更清洁的数据放在后期,但忽略了第一个角色,因为传统的衰减调度在高质量数据可用时恰好降低了更新强度。受此启发,我们为LLM中期训练提出了Drop-Stable-Rampup:在质量转换时,降低批次大小,保持稳定以积累信号,然后逐渐增加以抑制终端噪声。在一个在108B tokens上中期训练的15B混合专家模型上,Drop-Stable-Rampup相比Warmup-Stable-Decay (WSD)平均准确率提升+1.70,相比余弦衰减提升+2.98,在数学推理基准如GSM8K (+4.23)和MATH (+2.80)上增益尤其显著。

英文摘要

High-quality data is scarce in large language model (LLM) training, yet how to schedule its use jointly with training dynamics lacks theoretical guidance. We extend functional scaling laws by incorporating a data-quality dimension, and solve the joint data-quality and batch-size scheduling problem in asymptotic closed form. The solution reveals two regimes and a dual role of high-quality data. In the noise-limited regime, high-quality data should be used as a signal amplifier: lowering the batch size converts cleaner data into more signal without amplifying noise. In the signal-limited regime, it should be used as a noise suppressor: late placement reduces terminal noise without sacrificing signal accumulation. Existing curriculum-style pipelines primarily exploit the second role by placing cleaner data late, but miss the first role because conventional decay schedules reduce update intensity exactly when high-quality data becomes available. Guided by this, we propose Drop-Stable-Rampup for LLM midtraining: upon the quality transition, drop the batch size, hold it stable to accumulate signal, then ramp up to suppress terminal noise. On a 15B Mixture-of-Experts model midtrained on 108B tokens, Drop-Stable-Rampup improves average accuracy over Warmup-Stable-Decay (WSD) by +1.70 and over Cosine-decay by +2.98, with particularly large gains on mathematical reasoning benchmarks such as GSM8K (+4.23) and MATH (+2.80).

2605.25682 2026-05-26 cs.DC cs.AI 版本更新

Profiling-Driven Adaptive Distributed Transformer Inference on Embedded Edge Deployment

面向嵌入式边缘部署的剖析驱动自适应分布式Transformer推理

Muhammad Azlan Qazi, Alexandros Iosifidis, Qi Zhang

发表机构 * Aarhus University(奥胡斯大学) Tampere University(塔尔基耶大学)

AI总结 通过结合分段均值压缩和轻量级离线剖析,自适应地在运行时选择本地或分布式执行,解决了嵌入式设备上分布式Transformer推理中CPU-GPU通信瓶颈问题,相比全张量交换降低了65%-77%延迟和34%-52%能耗。

详情
AI中文摘要

将Transformer推理分布在嵌入式边缘设备上可以缓解单个内存和计算约束,但在实际硬件上的实际益处仍不明确:先前的工作主要依赖于忽略硬件特定通信开销的模拟。我们在通过WiFi连接的NVIDIA Jetson Orin Nano设备上进行了硬件原型研究。我们的关键发现是,主要瓶颈不仅是网络带宽,还有通信期间的CPU-GPU暂存。由于Jetson的集成GPU架构缺乏NCCL所需的PCIe/NVLink路径,所有设备间数据通信应通过GLOO路由并在CPU内存中暂存;这种开销随通信数据量扩展,使得对于中等规模模型(如ViT),全张量交换比单设备推理更慢。因此,我们通过结合分段均值压缩与轻量级离线剖析来评估Prism,以在运行时自适应地选择本地或分布式执行。实验表明,相对于静态分布式执行设置中的全张量交换,该策略将延迟降低了65%-77%,能耗降低了34%-52%,证明了剖析驱动自适应对于嵌入式硬件上的实际分布式Transformer推理至关重要。

英文摘要

Distributing Transformer inference across embedded edge devices can alleviate individual memory and compute constraints, yet practical benefits on real hardware remain unclear: prior work relies largely on simulations that overlook hardware-specific communication overheads. We present a hardware prototype study on NVIDIA Jetson Orin Nano devices connected over WiFi. Our key finding is that the dominant bottleneck is not just network bandwidth but also the CPU-GPU staging during communication. Because Jetson's integrated GPU architecture lacks the PCIe/NVLink pathway that NCCL requires, all inter-device data communication should be routed through GLOO and staged in CPU memory; an overhead that scales with communication data volume and makes full-tensor exchange slower than single-device inference across the batch sizes for medium sized models such as ViT. We therefore evaluate Prism by combining Segment Means compression with lightweight offline profiling to adaptively select between local and distributed execution at runtime. Experiments show that this strategy reduces latency by 65%-77% and energy consumption by 34%-52% relative to full-tensor exchange in static distributed execution setup, demonstrating that profiling-driven adaptation is essential for practical distributed Transformer inference on embedded hardware.

2605.25681 2026-05-26 cs.LG cs.AI 版本更新

Don't Retrain, Just Reuse: Recovering Dual-Target Molecules from Single-Target Diffusion Models

不要重新训练,只需重用:从单目标扩散模型中恢复双目标分子

Qingyuan Zeng, Pengxiang Cai, Zixin Guan, Ziyang Chen, Anglin Liu, Lang Qin, Xinyao Lai, Jintai Chen

发表机构 * The Hong Kong University of Science and Technology (Guangzhou)(香港科技大学(广州)) Guangzhou University of Chinese Medicine(广州中医药大学)

AI总结 提出REUSE框架,通过层次化进化输入空间搜索,从冻结的单目标扩散模型中恢复双目标分子,无需重新训练或修改扩散过程,在双目标亲和力上提升20.9个百分点。

详情
AI中文摘要

设计一个能调节两个靶点的单一分子是多药理学中一种有前景的策略,但它比标准的单目标生成要困难得多,因为一个候选分子必须满足两个结合要求,同时保持药物相似性和可合成性。现有的双目标生成方法通常通过在采样期间重新训练生成器或干预扩散过程来引入双目标能力。前者在双目标监督稀疏时可能成本高昂且难以稳定,而后者可能对去噪时的目标平衡和竞争性更新方向敏感。这些局限性促使我们寻找一种保持生成器不变的替代方案:能否在不修改参数或去噪动态的情况下,从冻结的单目标扩散模型的输入空间中恢复双目标候选分子?我们将此任务表述为一个受约束的多目标优化问题,并提出REUSE,一种层次化进化输入空间搜索框架,结合配对条件探索和结构化多阶段选择,以强制执行双目标亲和力、化学质量和多样性。实验表明,与修改扩散过程的方法相比,REUSE持续改善了双目标亲和力和平衡性,在双高亲和力指标上比最强基线提高了20.9个百分点,同时保持了竞争性的分子质量。

英文摘要

Designing a single molecule that modulates two targets is a promising strategy for polypharmacology, but it remains substantially harder than standard single-target generation because one candidate must satisfy two binding requirements while preserving drug-likeness and synthesizability. Existing dual-target generative methods typically introduce dual-target capability by either retraining the generator or intervening in the diffusion process during sampling. The former can be costly and difficult to stabilize when dual-target supervision is sparse, while the latter may be sensitive to denoising-time target balancing and competing update directions. These limitations motivate a generator-preserving alternative that keeps the pretrained prior intact: can dual-target candidates instead be recovered from the input space of a frozen single-target diffusion model, without modifying its parameters or denoising dynamics? We formulate this task as a constrained multi-objective optimization problem and propose REUSE, a hierarchical evolutionary input-space search framework that combines pair-conditioned exploration with structured multi-stage selection to enforce dual-target affinity, chemical quality, and diversity. Experiments show that, compared with methods that modify the diffusion process, REUSE consistently improves dual-target affinity and balance, achieving a 20.9-percentage-point gain in Dual High Affinity over the strongest prior baseline while maintaining competitive molecular quality.

2605.25680 2026-05-26 cs.CL cs.AI 版本更新

Simulating Human Memory with Language Models

用语言模型模拟人类记忆

Qihan Wang, Nicholas Tomlin, Michael Hu, Brian Dillon, Tal Linzen

发表机构 * NYU(纽约大学) UMass Amherst(马萨诸塞大学阿姆赫斯特分校)

AI总结 本研究通过心理学经典记忆实验对比语言模型与人类记忆,发现未经调优的模型记忆优于人类,但通过提示策略和压缩器可使模型遗忘方式更接近人类,从而在下游教育任务中成为更有效的用户模拟器。

详情
AI中文摘要

语言模型越来越多地被部署为用户模拟器,但它们的记忆远比真实用户可靠。为了衡量这一差距,我们在人类和语言模型上进行了一系列来自心理学的经典记忆实验。跨任务我们发现,未经调优的语言模型表现出比人类更好的记忆,即使在被提示模仿人类行为时也是如此。然后我们表明,更好的提示策略和使用压缩器可以使语言模型以更类似人类的方式遗忘内容。使用这些方法,我们初步证明,具有人类类似记忆约束的语言模型可以在下游教育任务中作为更有效的用户模拟器。最后,我们发布人类参考数据和基准,以支持未来关于用语言模型模拟人类记忆的工作。

英文摘要

Language models are increasingly being deployed as user simulators, but their memory is far more reliable than that of real users. To measure this gap, we run a series of classic memory experiments from psychology on both humans and language models. Across tasks, we find that out-of-the-box language models exhibit better memory than humans, even when prompted to imitate human behavior. We then show that better prompting strategies and the use of a compactor can cause language models to forget content in a more human-like way. Using these methods, we show preliminary evidence that language models with human-like memory constraints can function as more effective user simulators in a downstream education task. Finally, we release human reference data and benchmarks to support future work on simulating human memory with language models.

2605.25673 2026-05-26 cs.CR cs.AI 版本更新

Referential Security as a New Paradigm for AI Evaluations

引用安全性作为AI评估的新范式

Dan Ristea, Vasilios Mavroudis

发表机构 * University College London(伦敦大学学院) Alan Turing Institute(艾伦·图灵研究所) King's College London(国王学院伦敦)

AI总结 针对AI系统持续更新导致评估标识不稳定问题,提出引用安全性范式,通过将模型身份作为可验证属性来确保评估的可重复性、纵向审计有效性和跨提供商等价性。

详情
AI中文摘要

安全评估本质上依赖于稳定的标识符。任何发现、审计或监管决策必须始终附属于其所涉及的具体工件。持续更新的人工智能系统违反了这一核心假设,公开的模型名称保持不变,而底层权重、提示、检索机制、滥用分类器、推理设置和服务基础设施却未经宣布地修改。因此,当前的评估常常适用于表面标签而非可识别和不同的系统。为了解决这个问题,我们提出引用安全性作为AI评估的新范式。基本安全问题不仅涉及模型是否安全,还涉及后续方能否最终确定特定安全声明所针对的是哪个系统。这种方法将模型身份重新定义为经验上可验证的属性,并将引用稳定性与其所制约的实质性安全声明分开。该框架为当前实践处理不善的三个关键工作流带来了可处理性。具体来说,它实现了可重复评估、纵向审计有效性和跨提供商等价性。通过将这些评估建立在可验证的工件上,我们的方法确保安全审计和监管发现在动态系统的整个操作生命周期中保持其实证效用。

英文摘要

Security evaluations inherently depend on stable identifiers. Any finding, audit, or regulatory decision must remain attached to the specific artifact it pertains to. Continuously updated artificial intelligence systems violate this core assumption, with public model designations remaining static while underlying weights, prompts, retrieval mechanisms, misuse classifiers, inference settings, and serving infrastructures undergo unannounced modifications. Consequently, current evaluations frequently apply to superficial labels rather than identifiable and distinct systems. To resolve this, we propose referential security as a new paradigm for AI evaluation. The fundamental security question extends beyond whether a model is safe to whether subsequent parties can conclusively determine which system a specific safety claim addressed. This approach reframes model identity as an empirically verifiable property and separates referential stability from the substantive security claims it conditions. This framework brings tractability to three critical workflows that current practices handle poorly. Specifically, it enables reproducible evaluation, longitudinal audit validity, and cross-provider equivalence. By grounding these evaluations in verifiable artifacts, our approach ensures that safety audits and regulatory findings maintain their empirical utility across the operational lifecycle of dynamic systems.

2605.25665 2026-05-26 cs.SE cs.AI 版本更新

Meta-Engineering Harnesses for AI-Native Software Production: A Contract-Driven Adversarial Verification Architecture with Early Deployment Report

面向AI原生软件生产的元工程框架:一种基于合约的对抗性验证架构及早期部署报告

Satadru Sengupta, Tamunokorite Briggs, Ivan Myshakivskyi

发表机构 * HireNimbus

AI总结 提出一种元工程框架,通过合约驱动、角色专业化AI代理和对抗性验证,实现AI原生软件的持续生产、验证与改进,并在小型服务公司的CTO即服务场景中部署17项功能,验证了其可靠性。

Comments 17 pages, 2 figures, early deployment report

详情
AI中文摘要

AI原生软件开发通常在单个模型、提示或生成工件的层面进行评估。这种框架对于生产环境是不够的,在这些环境中,软件必须在多个操作上下文和长时间跨度内持续生产、验证、部署、维护和适应。我们提出了一种元工程框架:一种软件生产架构,它将操作和产品特性需求转化为明确的合约,通过角色专业化的AI代理分配工作,执行独立和对抗性验证,并通过结构化失败分类和外环校准持续自我改进。该框架专为软件交付不是一次性项目而是持续运营功能的场景设计。在我们的激励应用——面向小型服务公司的CTO即服务中,该系统将网站、预订流程、支付系统、后台工作流自动化和AI代理接口作为持续演进的技术基础设施进行管理,而非一次性交付物。我们描述了分层架构,包括两遍合约编译、带有专业化记录的持久化Markdown记忆、基于注意力和独立性的验证、四路失败仲裁器以及外环校准。我们报告了早期生产部署的结果,该部署跨越数周,涵盖17项功能,包括一个详细的应用内支付案例研究,揭示了合约不完整性和验证边界问题。这些观察直接推动了框架的针对性改进。贡献在于实现了一个可测量、可扩展的验证架构,使AI原生服务即软件生产变得可靠、可审计且可随时间改进。

英文摘要

AI-native software development is often evaluated at the level of individual models, prompts, or generated artifacts. This framing is insufficient for production environments where software must be continuously produced, verified, deployed, maintained, and adapted across many operational contexts and long time horizons. We present a meta-engineering harness: a software-production architecture that transforms operational and product feature requirements into explicit contracts, routes work through role-specialized AI agents, performs independent and adversarial verification, and continuously improves itself through structured failure classification and outer-loop calibration. The harness is designed for settings in which software delivery is not a one-time project but an ongoing operating function. In our motivating application, CTO-as-a-service for small service firms, the system manages websites, booking flows, payment systems, backoffice workflow automations, and AI-agent interfaces as continuously evolving technical infrastructure rather than one-off deliverables. We describe the layered architecture, including two-pass contract compilation, persistent markdown memory with specialization records, attention-based and independence-based verifications, a four-way failure arbiter, and outer-loop calibration. We report results from an early production deployment spanning 17 features over several weeks, including a detailed in-app payments case study that revealed contract incompleteness and verification-boundary issues. These observations directly drove targeted improvements to the harness. The contribution is an implemented, measurable, and extensible verification architecture for making AI-native service-as-a-software production reliable, auditable, and improvable over time.

2605.25664 2026-05-26 cs.HC cs.AI cs.AR cs.CY 版本更新

Posture Clip: Sit properly or I wont let you work

Posture Clip:坐姿端正,否则不让你工作

Arka Majhi, Aparajita Mondal

发表机构 * Faculty of Information Technology and Communication Sciences(信息科技与通讯科学学院) Tampere University(塔尔基马亚大学) School of Forest Sciences(森林科学学院) University of Eastern Finland(东芬兰大学)

AI总结 提出一种名为PostureClip的衣夹式设备,通过屏幕变黑和恢复来限制用户弯腰工作,实验表明其能显著改善坐姿角度并减少弯腰时长。

Comments Published online by Cambridge University Press on 14 May 2026

详情
Journal ref
Wearable Technologies, 7, e5 (2026)
AI中文摘要

不良姿势因其对健康和生产率的有害影响而成为一个重要问题。本文提出了一种名为PostureClip的衣夹式设备,旨在通过黑屏并在纠正姿势后恢复屏幕,限制用户以弯腰角度坐着工作,从而促进更好的姿势。该设备集成了传感器和反馈机制,为用户提供实时姿势反馈。为了评估PostureClip的有效性,进行了一项对照实验,参与者(n=165)每天使用笔记本电脑/个人电脑工作超过6小时。参与者被随机分配到干预组(IG1,n=54;IG2,n=55),使用衣夹式设备,以及对照组(CG,n=56),不使用该设备。IG1未收到反馈,而IG2通过通知并进一步使屏幕变暗从设备获得反馈。研究在参与者的办公室环境中进行,持续4周,收集了姿势角度、弯腰持续时间以及用户反馈等指标。分析显示,与无反馈组和对照组(未干预)相比,使用带反馈的PostureClip的参与者组在姿势角度上有显著改善(p<0.001),弯腰持续时间显著减少(p<0.01)。用户反馈的定性分析强调了该设备的易用性、提供及时反馈的有效性以及对参与者姿势意识和习惯的积极影响。这些结果表明,PostureClip是促进久坐工作中更好姿势的有效工具。

英文摘要

Poor posture is a significant concern due to its detrimental effects on health and productivity. This paper presents a collar-clipped device called PostureClip, designed to restrict users from sitting and working at a bent angle, by blacking out the screen and resuming on correcting posture, thereby promoting better posture. The device integrates sensors and feedback mechanisms to provide real-time posture feedback to users. To evaluate the effectiveness of PostureClip, a controlled experiment was conducted with participants (n=165) who were working on a laptop/PC for over 6 hours per day. The participants were randomly assigned to both the intervention group (IG1,n=54 ; IG2,n=55), which used the collar-clipped device, and the control group (CG, n=56), which did not use the device. IG1 didn't get feedback while IG2 got feedback from the device by notifying and further darkening the screen. The study was conducted in the office environment of the participants, for 4 weeks, and metrics such as posture angle, duration of bent angle, and user feedback were collected. Analysis revealed significant improvements in posture angle (p<0.001) and significant reduction in bent angle duration (p<0.01) for participants' group using PostureClip with feedback and compared to the group without feedback and the control group (who were not intervened). The qualitative analysis of user feedback highlighted the device's ease of use, effectiveness in providing timely feedback, and positive impact on participants' awareness and habits regarding posture. These results indicate that PostureClip is an effective tool for promoting better posture during sedentary work.

2605.25658 2026-05-26 cs.CL cs.AI 版本更新

AutoSG: LLM-Driven Solver Generation Solely from Task Prompts for Expensive Optimization

AutoSG: 仅从任务提示出发的LLM驱动的昂贵优化求解器生成

Haoran Gu, Handing Wang, Yi Mei, Mengjie Zhang

发表机构 * Xidian University(西安电子科技大学) Victoria University of Wellington(威灵顿维多利亚大学)

AI总结 提出AutoSG框架,通过检索增强生成、单步自优化和无实例评估机制,从自然语言提示直接生成可执行定制求解器,解决昂贵优化中的幻觉、结构破坏和评估成本问题。

详情
AI中文摘要

昂贵优化任务在现实应用中普遍存在,需要高度专业化的求解器。虽然LLM驱动的自动求解器生成显示出前景,但当前范式在处理昂贵优化时面临三个关键问题:由于领域知识不足导致的事实幻觉、在细化过程中频繁破坏先前建立的局部最优结构,以及在训练实例上执行带来的高昂评估成本和受限的泛化能力。为了解决这些问题,我们引入了AutoSG,一个完全自动化的流程,直接将自然语言提示转换为可执行的定制求解器。AutoSG具有三个核心创新:一个检索增强的求解器生成模块,严格将代码基于经过验证的文献;一个单步自优化算子,在保留关键结构组件的同时引入特定任务的改进;以及一个基于Elo的无实例LLM-as-a-Judge评估机制,快速建立全局排名。在多种昂贵优化任务上的广泛评估证实,AutoSG显著优于人工设计的最先进框架和现有的LLM生成的求解器。

英文摘要

Expensive optimization tasks are ubiquitous in real-world applications, demanding highly specialized solvers. While LLM-driven automated solver generation shows promise, current paradigms face three critical issues when tackling expensive optimization: factual hallucinations due to deficient domain knowledge, the frequent dismantling of previously established locally optimal structures during refinement, and the prohibitive evaluation costs alongside restricted generalization caused by executing on training instances. To address these issues, we introduce AutoSG, a fully automated workflow directly translating natural language prompts into executable customized solvers. AutoSG features three core innovations: a retrieval-augmented solver generation module strictly grounding code in verified literature; a one-step self-refinement operator introducing task-specific improvements while preserving critical structural components; and an instance-free Elo-based LLM-as-a-Judge evaluation mechanism rapidly establishing global rankings. Extensive evaluations across diverse expensive optimization tasks confirm AutoSG significantly outperforms human-designed state-of-the-art frameworks and existing LLM-generated solvers.

2605.25632 2026-05-26 cs.AI cs.LG q-fin.RM 版本更新

Insuring Every Action: An Authority Frontier Framework for Runtime Actuarial Control of Autonomous AI Agents

为每个行动投保:自主AI代理运行时精算控制的权威边界框架

Hao-Hsuan Chen

发表机构 * Department of Risk Management and Insurance(风险管理与保险系)

AI总结 提出精算行动接口(AAI)和权威边界框架,通过确定性运行时合约对自主AI代理的副作用行动进行定价、门控和评估,实现跨领域的精算控制与基准测试。

Comments 35 pages, 4 figures, 11 tables. Companion paper on the mathematical foundations: SSRN 6761960

详情
AI中文摘要

自主AI代理越来越多地产生带有副作用的行动:数据库变更、退款、支付、外部承诺。我们提出精算行动接口(AAI),这是一个确定性的运行时合约,它在时间一致的风险映射下,对每个此类行动按照合约固定的安全默认值进行定价,并根据每个边界的储备资本预算门控执行。然后我们开发了权威边界,这是一种评估原语,用于衡量运行时在每个储备资本水平下释放的自主权威量。该框架提供:(i) 一个确定性的报价-绑定-提交协议,带有通行费限制的能力令牌;(ii) 一个通用的七类行动分类法,将异构工具调用映射到可比较的权威单位;(iii) 在alpha支出下的重放确定性和逐路径储备覆盖;(iv) 通过全储备需求C_full和资本指标Capital@k进行跨域归一化。我们在四个代理环境(数据库变更、客服退款以及公共tau-bench零售和航空工具使用轨迹)中实例化AAI,并报告一个实时Postgres面板,其中三个Azure托管的模型通过同一合约提出行动。边界在跨域中表现出常见的低储备拒绝和中间释放模式,仅在预算网格达到全储备需求时饱和;所需储备资本变化达22倍(Capital@50从289到6457)。该框架不强制域采用相同形状;它揭示每个域的精算几何。在实时面板中,合约在低预算下防止了所有三个模型的实现损失,但在拒绝下的承保持续性方面有所不同:模型身份是一个精算承保变量。贡献是一个用于自主代理副作用运行时精算控制的基准就绪评估框架。

英文摘要

Autonomous AI agents increasingly issue side-effect-bearing actions: database mutations, refunds, payments, external commitments. We propose the Actuarial Action Interface (AAI), a deterministic runtime contract that prices each such action against a contractually fixed safe default under a time-consistent risk mapping, and gates execution against a per-boundary reserve capital budget. We then develop the Authority Frontier, an evaluation primitive measuring how much autonomous authority the runtime releases at each level of reserve capital. The framework provides (i) a deterministic quote-bind-commit protocol with toll-bounded capability tokens; (ii) a universal seven-class action taxonomy mapping heterogeneous tool calls to comparable authority units; (iii) replay determinism and pathwise reserve coverage under alpha-spending; (iv) cross-domain normalization via full reserve demand C_full and capital metrics Capital@k. We instantiate AAI across four agentic environments (database mutation, customer-service refund, and the public tau-bench retail and airline tool-use traces) and report a live Postgres panel in which three Azure-hosted models propose actions through the same contract. The frontier exhibits a common low-reserve refusal and intermediate-release pattern across domains, with saturation only where the budget grid reaches full reserve demand; required reserve capital varies by 22x (Capital@50 from 289 to 6457). The framework does not force domains into the same shape; it surfaces each domain's actuarial geometry. In the live panel the contract prevents realized loss across all three models at low budget while differing in underwriting persistence under denial: model identity is an actuarial underwriting variable. The contribution is a benchmark-ready evaluation framework for runtime actuarial control of autonomous-agent side effects.

2605.25620 2026-05-26 cs.AI 版本更新

Back to Parsimonious Latents: Learning Task-Centric World Models from Visual Foundations

回归简约潜在变量:从视觉基础学习以任务为中心的世界模型

Minghao Fu, Fan Feng, Nicklas Hansen, Biwei Huang

发表机构 * University of California, San Diego(加州大学圣地亚哥分校)

AI总结 提出TC-WM框架,通过将预训练视觉嵌入线性投影为紧凑潜在状态、对比学习对齐子空间并重建嵌入,将基础模型特征转化为任务充分的世界表示,实现更好的世界建模质量和控制精度。

详情
AI中文摘要

世界模型使智能体能够根据动作预测未来动态,因此潜在表示的选择对于规划和控制至关重要。这种表示通常要么直接从像素中学习,但语义结构有限;要么继承自冻结的视觉基础模型,但包含过多与任务无关的细节,导致状态空间与下游规划和控制不匹配。这在无奖励的离线设置中尤其具有挑战性,因为模型必须从固定轨迹中学习,没有奖励监督或在线交互。为了解决这个问题,我们提出了TC-WM,一个将基础模型嵌入转化为紧凑、任务充分的世界表示的框架。关键设计是将预训练嵌入空间视为语义支架而非最终状态空间:TC-WM将高维视觉嵌入线性投影到紧凑潜在变量作为动态空间,通过对比学习将子空间与智能体的物理状态对齐,并重建嵌入以保留有用的视觉结构。这结合了基础特征的通用性和以任务为中心的动态的可控性。理论上,我们证明TC-WM足以识别潜在的任务中心潜在因子,只需简单变换。实验上,TC-WM能够在多种环境(如Robomimic和D4RL)中实现测试时规划,其世界建模质量和控制精度均优于现有方法。

英文摘要

World models enable agents to predict future dynamics conditioned on actions, making the choice of latent representation central to planning and control. Such representations are often either learned directly from pixels with limited semantic structure or inherited from frozen visual foundation models with excessive task-irrelevant detail, yielding state spaces that are poorly matched to downstream planning and control. This is especially challenging in reward-free offline settings, where the model must learn from fixed trajectories without reward supervision or online interaction. To address this, we propose TC-WM, a framework for turning foundation-model embeddings into compact, task-sufficient world representations. The key design is to treat the pretrained embedding space as a semantic scaffold rather than as the final state space: TC-WM linearly projects high-dimensional visual embeddings into a compact latent as the dynamic space, aligns a subspace with the agent's physical state via contrastive learning, and reconstructs embeddings to preserve useful visual structure. This combines the generality of foundation features with the controllability of task-centric dynamics. Theoretically, we show that TC-WM suffices to identify the underlying task-centric latent factors up to a simple transformation. Empirically, TC-WM enables test-time planning across diverse environments (e.g., Robomimic and D4RL), achieving better world-modeling quality and more precise control than state-of-the-art approaches.

2605.25612 2026-05-26 cs.LG cs.AI 版本更新

Towards the Connection between Activation Sparsity and Flat Minima

激活稀疏性与平坦极小值之间的联系

Ze Peng, Jian Zhang, Lei Qi, Yang Gao, Yinghuan Shi

发表机构 * State Key Laboratory for Novel Software Technology, Nanjing University(南京大学新型软件技术国家重点实验室) Institute of Brain-Machine Interface, Nanjing University(南京大学脑机接口研究院) School of Computer Science and Engineering, Southeast University(东南大学计算机科学与工程学院)

AI总结 本文发现损失景观的平坦性与Transformer中MLP激活稀疏性密切相关,通过理论推导和三种实用方法增强稀疏性,显著降低推理和训练成本。

详情
AI中文摘要

标准训练的Transformer的MLP块中出现的激活稀疏性为在不牺牲性能的情况下大幅降低计算成本提供了机会。为了从理论上解释这一现象,现有工作表明激活稀疏性并非源于数据属性或数据拟合,而是来自训练过程的隐式偏差。然而,这些联系是在强假设下得到的,无法应用于标准训练的大步数深度模型。与这些工作不同,我们发现损失景观的平坦性也与MLP激活稀疏性密切相关,并且可以作为标准深度网络的一个更弱且自然出现的假设。具体来说,我们发现:1) MLP激活稀疏性等于“增强平坦性”(平坦性度量的加权和)与输入范数和MLP激活梯度乘积的比值。我们经验性地发现该比值在训练过程中下降,导致稀疏激活。2) 我们还提出了导数稀疏性的概念,在ReLU下它退化为激活稀疏性,但进一步支持反向传播中的剪枝,并且比激活稀疏性更稳定。基于理论发现,我们通过三种方法减小分子和增大分母来进一步鼓励激活稀疏性。这些即插即用的修改可以有效降低比值并产生更稀疏的激活。在ImageNet-1K和C4上的实验表明,与原始Transformer相比,推理稀疏性至少提高36%,训练稀疏性至少提高50%,表明在推理和训练中进一步降低成本的潜力。

英文摘要

The observation that activation sparsity emerges in MLP blocks of standardly trained Transformers offers an opportunity to drastically reduce computation costs without sacrificing performance. To theoretically explain this phenomenon, existing works have shown that activation sparsity does not result from the data properties or data fitting but from the implicit bias of the training process. However, these connections are obtained with strong assumptions, which cannot be applied to deep models standardly trained with a large number of steps. Different from these works, we find that the flatness of loss landscapes is also closely related to the MLP activation sparsity and can serve as a weaker and naturally emerging assumption standard deep networks. Specifically, we find that 1) the MLP activation sparsity equals a ratio between "augmented flatness" (a weighted sum of flatness measures) and the product of the input norm and activation gradient of the MLP. We empirically find that this ratio decreases during training, leading to sparse activations. 2) We also propose the notion of derivative sparsity, which reduces to activation sparsity under ReLU, but further enables pruning in the backward propagation and is more stable than activation sparsity. With the theoretical findings, we can further encourage activation sparsity by decreasing the numerator and increasing the denominator of the ratio using three methods. These plug-and-play modifications can effectively reduce the ratio and produce sparser activations. Experiments on ImageNet-1K and C4 demonstrate relative improvements of at least 36% on inference sparsity and at least 50% on training sparsity over vanilla Transformers, indicating further potential cost reduction in both inference and training

2605.25603 2026-05-26 cs.AI 版本更新

Detecting Unfaithful Chain-of-Thought via Circuit-Guided Internal-External Discrepancy

通过电路引导的内外不一致性检测不忠实的思维链

Xu Shen, Zhen Tan, Song Wang, Pingjun Hong, Rui Miao, Xin Wang, Tianlong Chen

发表机构 * Jilin University(吉林大学) University of Central Florida(中央佛罗里达大学) Arizona State University(亚利桑那州立大学) University of Vienna(维也纳大学) University of North Carolina at Chapel Hill(北卡罗来纳大学教堂山分校)

AI总结 提出CIE-Scorer框架,通过追踪句子级电路并利用Fused Gromov-Wasserstein距离度量内部与外部推理图的不一致性,实现实例级思维链不忠实检测,在FaithCoT-Bench上取得最优性能并降低电路构建成本。

详情
AI中文摘要

思维链(CoT)推理提高了大型语言模型(LLMs)的问题解决能力,但生成的推理轨迹可能并不忠实地反映模型的实际决策过程。现有的CoT不忠实检测器主要依赖于生成理由的外部信号,如文本合理性或答案一致性,而忽略了来自模型内部计算的证据。尽管最近的电路追踪方法通过追踪推理过程中信息如何在模型组件间流动提供了获取模型内部证据的途径,但为长CoT构建完整推理电路成本高昂且难以扩展。为应对这些挑战,我们提出了电路引导的内外不一致性评分器(CIE-Scorer),一个用于实例级CoT不忠实检测的框架。关键思想是,忠实的推理轨迹应与模型的计算过程一致,而不忠实的轨迹可能偏离它。CIE-Scorer从信息丰富的推理令牌中高效追踪紧凑的句子级电路,构建内部和外部推理图,并使用Fused Gromov-Wasserstein距离度量它们的不一致性。在FaithCoT-Bench的四个数据集上的实验表明,CIE-Scorer在降低电路构建成本的同时实现了最先进的性能,证明了将机械可解释性信号与外部推理轨迹相结合用于CoT不忠实检测的有效性。

英文摘要

Chain-of-thought (CoT) reasoning improves the problem-solving ability of large language models (LLMs), but generated reasoning traces may not faithfully reflect the model's actual decision process. Existing CoT unfaithfulness detectors mainly rely on external signals from generated rationales, such as textual plausibility or answer consistency, while overlooking evidence from the model's internal computation. Although recent circuit tracing methods provide a way to obtain model-internal evidence by tracing how information flows through model components during reasoning, constructing full reasoning circuits for long CoTs is costly and difficult to scale. To address these challenges, we propose Circuit-guided Internal-External Discrepancy Scorer (CIE-Scorer), a framework for instance-level CoT unfaithfulness detection. The key idea is that faithful reasoning traces should align with the model's computational process, whereas unfaithful traces may diverge from it. CIE-Scorer efficiently traces compact sentence-level circuits from informative reasoning tokens, constructs internal and external reasoning graphs, and measures their discrepancy using Fused Gromov--Wasserstein distance. Experiments on four datasets from FaithCoT-Bench show that CIE-Scorer achieves state-of-the-art performance while reducing the cost of circuit construction, demonstrating the effectiveness of combining mechanistic interpretability signals with external reasoning traces for CoT unfaithfulness detection.

2605.25601 2026-05-26 cs.CL cs.AI 版本更新

Toward a Benchmark for Controllable Simulation of Imperfect Students with Large Language Models

面向大语言模型可控模拟不完美学生的基准

Alexander Apartsin, Omri Sason, Yehudit Aperstein

发表机构 * Holon Institute of Technology(霍隆理工学院) Afeka Tel Aviv Academic College of Engineering(阿法卡特拉维夫工程学院)

AI总结 本研究提出一个基准框架,通过提示控制语言模型模拟具有指定技能轮廓的学生,并评估其可控性,为教师教育中的刻意练习提供支持。

Comments 22 pages, 7 figures

详情
AI中文摘要

教师教育需要与表现出可识别优势、弱点和部分掌握的学习者进行刻意练习。大型语言模型可以通过模拟具有已知技能组成部分的学生来支持这种练习,使教师能够演练解释、诊断和教学回应。然而,为此目的,核心要求既不是最大化基准准确率,也不是抑制孤立的事实,而是控制模型行为,使其反映指定的技能轮廓。本文研究了是否可以通过提示引导语言模型保留某些技能同时抑制其他技能。我们引入了一个面向基准的框架,其中显式技能向量表示模拟学生,基于提示的控制指定保留和缺失的能力,并使用轮廓对齐指标、保留与遗忘比较以及跨技能校准分析来评估行为。结果表明,在结构化数学环境中可以诱导和测量选择性的部分掌握,尽管可控程度仍依赖于模型。这些发现将可控学习者模拟定位为教师教育、教育模拟和语言模型控制交叉领域的一个独特研究问题。

英文摘要

Teacher education requires deliberate practice with learners who exhibit identifiable strengths, weaknesses, and partial mastery. Large language models could support such practice by simulating students with known skill components, enabling teachers to rehearse explanations, diagnoses, and instructional responses. For this purpose, however, the central requirement is neither to maximize benchmark accuracy nor to suppress isolated facts, but to control model behavior so that it reflects a specified skill profile. This paper investigates whether prompted language models can be steered to retain some skills while suppressing others. We introduce a benchmark-oriented framework in which an explicit skill vector represents a simulated student, prompt-based control specifies retained and missing competencies, and behavior is evaluated using profile-alignment metrics, retained-versus-forgotten comparisons, and cross-skill calibration analyses. The results show that selective partial mastery can be induced and measured in a structured mathematics setting, although the degree of controllability remains model-dependent. These findings position controllable learner simulation as a distinct research problem at the intersection of teacher education, educational simulation, and language-model control.

2605.25584 2026-05-26 cs.RO cs.AI 版本更新

Acting on the Unseen: Communication-Free Collaborative Filtering for Decentralized Multi-Robot Task Allocation

作用于未知:面向分散式多机器人任务分配的无通信协同过滤

Alexander Apartsin, Yigal Meshulam, Yehudit Aperstein

发表机构 * Holon Institute of Technology(霍洛技术学院) Afeka Tel Aviv Academic College of Engineering(阿法卡特拉维夫工程学院)

AI总结 针对零知识多机器人任务分配问题,提出基于在线低秩协同过滤的SwarmCF方法,无需通信、先验知识或协调者,实现每个机器人在未见任务上的有效行动,并证明其样本复杂度优势。

Comments 27 pages, 12 figures

详情
AI中文摘要

多机器人任务分配通常假设某种通信、已知任务模型或协调者的组合。我们研究相反的极端情况,这在实践中常见但在理论上被忽视,我们称之为零知识MRTA(ZK-MRTA):一个没有先验知识(没有任务模型,甚至没有潜在秩)、没有通信(没有消息、没有参数共享、没有协调者)、并且只能部分且私下带噪地观察队友结果的公共流的机器人团队。一个隐藏的低秩结构决定了哪个机器人适合哪个任务,并且任务数量远多于轮次,因此大多数(机器人,任务)对从未被尝试过。然而,每个机器人可以通过在广播流上运行在线低秩协同过滤(SwarmCF)来很好地处理从未尝试过的任务以及新任务。与任何无结构学习器相比,优势是类别性的,而不是常数因子:无结构学习器在未见对上的误差被证明处于先验均值水平。我们证明了每个机器人的匹配样本复杂度(在秩d和任务数n下,Θ(d) vs Θ(n)),任务稀缺下的任意时间(累积奖励)分离,以及一个确定性条件,在该条件下从掩码广播中分散恢复是精确的(经验验证)。实验量化了广播的价值、一个正比例缩放律(每个机器人的未见对技能随团队规模增加)、以及低秩方法中最强的掩码鲁棒性和任意时间曲线,恢复了集中式全通信上限的大部分(约80%的技能收益),并在容量1竞争和基于机器人的感知实例中保持有效。

英文摘要

Multi-robot task allocation usually assumes some combination of communication, known task models, or a coordinator. We study the opposite extreme, a regime common in practice but overlooked in theory, which we name Zero-Knowledge MRTA (ZK-MRTA): a robot team with no prior knowledge (no task models, not even the latent rank), no communication (no messages, no parameter sharing, no coordinator), and only a partial and privately-noisy view of a public stream of teammates' outcomes. A hidden low-rank structure governs which robot suits which task, and there are far more tasks than rounds, so most (robot, task) pairs are never attempted. Yet each robot can act well on tasks it never attempted, and onboard new tasks, by running online low-rank collaborative filtering over the broadcast (SwarmCF). The advantage over any structure-free learner is categorical, not a constant factor: a structure-free learner is provably at the prior-mean error floor on unseen pairs. We prove a matching per-robot sample complexity (Θ(d) versus Θ(n), in the rank d and the task count n), an anytime (cumulative-reward) separation under task scarcity, and a deterministic condition under which decentralized recovery from the masked broadcast is exact (validated empirically). Experiments quantify the value of the broadcast, a positive scaling law (per-robot unseen-pair skill rises with team size), and the strongest masking-robustness and anytime profile among low-rank methods, recovering most (about 80% on earned skill) of a centralized full-communication ceiling, and holding under capacity-1 contention and in a robotics-grounded sensing instance.

2605.25577 2026-05-26 cs.LG cs.AI 版本更新

Geometric Flow Matching for Molecular Conformation Generation via Manifold Decomposition

基于流形分解的几何流匹配分子构象生成

Yunqing Liu, Yi Zhou, Wenqi Fan

发表机构 * The Hong Kong Polytechnic University(香港理工大学)

AI总结 提出GO-Flow方法,通过将生成过程分解为平移、旋转和构象三个物理子空间,利用流形上的最优传输和测地流,解决现有方法忽略分子几何层次结构的问题,实现高质量、高效率的分子构象生成。

详情
AI中文摘要

生成准确的3D分子构象是计算化学和药物发现中的关键挑战。最近,扩散和流匹配模型取得了显著成功。然而,它们的数学公式与分子的物理现实之间存在严重的不匹配。现有方法主要将分子视为笛卡尔空间中的无结构点云,忽略了键长和键角相对刚性而扭转角构成主要柔性自由度的内在层次力学。这种对流形的不感知迫使模型从头重新学习基本几何约束,常常导致物理上不可信的中间结构。为了解决这个问题,我们提出了GO-Flow,通过流形分解将生成建模与分子几何对齐。GO-Flow不是强制在欧几里得空间中运动,而是将生成过程分解为三个物理驱动的子空间:具有线性最优输运的平移空间、$SO(3)$上具有测地流的旋转空间以及具有熵最优输运的构象空间。这种分解注入了几何归纳偏置,使生成路径更好地与分子自由度对齐。当与等变神经架构结合时,它鼓励旋转一致的生成并提高几何有效性。在GEOM-Drugs和GEOM-QM9上的大量实验表明,GO-Flow实现了最先进的生成质量。值得注意的是,通过在正确的流形上自然地学习更直的概率路径,我们的方法能够在仅50步的情况下实现高保真采样,有效弥合了结构精度与计算效率之间的差距。

英文摘要

The generation of accurate 3D molecular conformations is a pivotal challenge in computational chemistry and drug discovery. Recently, diffusion and flow matching models have achieved remarkable success. However, there is a critical misalignment between their mathematical formulation and the physical reality of molecules. Existing approaches predominantly treat molecules as unstructured point clouds in Cartesian space, overlooking the intrinsic hierarchical mechanics where bond lengths and bond angles are relatively stiff, whereas torsion angles constitute the dominant flexible degrees of freedom. This lack of manifold awareness forces models to relearn fundamental geometric constraints from scratch, often leading to physically implausible intermediate structures. To address this, we propose GO-Flow that aligns generative modeling with molecular geometry via manifold decomposition. Instead of forcing motion through Euclidean space, GO-Flow decomposes the generation process into three physically motivated subspaces: translation space with linear optimal transport, rotation space with geodesic flows on $SO(3)$, and conformation space with entropic optimal transport. This decomposition injects geometric inductive biases and makes the generative paths better aligned with molecular degrees of freedom. When combined with equivariant neural architectures, it encourages rotation-consistent generation and improves geometric validity. Extensive experiments on GEOM-Drugs and GEOM-QM9 demonstrate that GO-Flow achieves state-of-the-art generation quality. Notably, by learning straighter probability paths on the correct manifolds naturally, our method enables high-fidelity sampling with as few as 50 steps, effectively bridging the gap between structural precision and computational efficiency.

2605.25574 2026-05-26 cs.CV cs.AI 版本更新

Mosaic: Compositional Multi-Concept Erasure via Vector Field Blending

Mosaic: 通过向量场混合的组合式多概念擦除

Junseok Ko, Jungwoo Kim, Jong-Seok Lee

发表机构 * Department of Artificial Intelligence, Yonsei University(延世大学人工智能系) School of Integrated Technology, Yonsei University(延世大学整合技术学院)

AI总结 针对流式文本到图像模型中同时擦除多个目标概念的任务,提出Mosaic框架,通过动态构建概念特定掩码并选择性混合向量场,无需额外优化即可有效移除复杂场景中的多概念。

详情
AI中文摘要

概念擦除已成为确保文本到图像(T2I)模型安全与伦理图像合成的关键研究方向。现有研究虽探索了多概念擦除,但通常假设每张图像仅有一个目标概念,这一限制被现代基于流的T2I模型日益暴露,此类模型可同时生成包含多个概念的复杂场景。为弥补这一空白,我们引入组合式多概念擦除这一新任务,旨在同时移除单个场景中的多个目标概念。我们提出CoME-Bench,一个用于评估组合式多概念擦除的基准,涵盖类别内和跨类别场景。我们进一步提出Mosaic,一个用于基于流的T2I模型中多概念擦除的新框架,该框架通过动态构建概念特定掩码并选择性混合它们,利用向量场中目标概念的空间局部性,无需额外优化。大量实验表明,Mosaic能有效移除复杂组合场景中的多个目标概念,同时保留非目标上下文。

英文摘要

Concept erasure has emerged as a key research direction for ensuring safe and ethical image synthesis in Text-to-Image (T2I) models. While existing studies have explored concept erasure across multiple concepts, they typically assume only a single target concept per image, a limitation increasingly exposed by modern flow-based T2I models, which can generate complex scenes with multiple concepts simultaneously. To address this gap, we introduce compositional multi-concept erasure, a new task that aims to simultaneously remove multiple target concepts within a single scene. We propose CoME-Bench, a benchmark for evaluating compositional multi-concept erasure, which covers both intra- and cross-category scenarios. We further propose Mosaic, a novel framework for multi-concept erasure in flow-based T2I models, which exploits the spatial locality of target concepts in the vector field by dynamically constructing concept-specific masks and selectively blending them without additional optimization. Extensive experiments demonstrate that Mosaic effectively removes multiple target concepts in complex compositional scenes while preserving non-target contexts.

2605.25572 2026-05-26 cs.CL cs.AI 版本更新

PennySynth: RAG-Driven Data Synthesis for Automated Quantum Code Generation

PennySynth:基于RAG的数据合成用于自动量子代码生成

Minghao Shao, Nouhaila Innan, Hariharan Janardhanan, Muhammad Kashif, Alberto Marchisio, Muhammad Shafique

发表机构 * eBRAIN Lab, Division of Engineering, New York University Abu Dhabi (NYUAD)(eBRAIN实验室,工程系,纽约大学阿布扎比分校) Center for Quantum and Topological Systems (CQTS), NYUAD Research Institute(量子与拓扑系统中心(CQTS),NYUAD研究所) Department of Computer Science and Engineering, NYU Tandon School of Engineering(计算机科学与工程系,纽约大学坦顿工程学院)

AI总结 提出PennySynth框架,通过检索增强生成和代码感知嵌入,利用13,389个PennyLane指令-代码对数据集,在QHack竞赛中实现52%-68%的pass@5,显著提升量子代码生成的结构有效性和功能正确性。

Comments 11 pages, 3 figures

详情
AI中文摘要

量子编程框架日益增长的复杂性暴露了现有基于大语言模型(LLM)的代码助手的一个关键局限性:通用模型在面对专门的量子编码挑战时,会幻觉出PennyLane特定的门名称、错误放置设备配置并生成结构无效的电路。我们提出PennySynth,一个检索增强生成框架,通过将LLM推理条件化为一个包含13,389个PennyLane指令-代码对的精选知识库来解决这一差距,该知识库通过一个三阶段(提取、验证和去重)流程从官方PennyLane仓库、社区GitHub源和QHack竞赛档案中构建。PennySynth引入了一种使用st-codesearch-distilroberta-base的代码感知嵌入策略,该策略针对自然语言到代码的检索进行训练,将平均检索余弦相似度从通用基线的0.45提高到0.726。在涵盖QHack竞赛三年(2022、2023、2024)的74个挑战上进行评估,PennySynth在QHack 2022、2023和2024上分别达到64%、68%和52%的pass@5,相比无检索的Claude Sonnet 4.6提高了+28、+25和+28个百分点。我们进一步引入了一个量子适应的CodeBLEU指标,该指标对qml.*令牌模式进行加权,并表明结构代码相似性和功能正确性捕捉了量子代码质量的不同方面。受控消融实验揭示,代码感知嵌入是检索性能的主要驱动因素,而当检索质量足够精确时,数据集扩展和源组合提供了额外的增益。

英文摘要

The growing complexity of quantum programming frameworks has exposed a critical limitation in existing large language model (LLM)-based code assistants: general-purpose models hallucinate PennyLane-specific gate names, misplace device configurations, and produce structurally invalid circuits when faced with specialized quantum coding challenges. We present PennySynth, a retrieval-augmented generation framework that addresses this gap by conditioning LLM inference on a curated knowledge base of 13,389 PennyLane instruction-code pairs, built via a three-stage extraction, verification, and deduplication pipeline over official PennyLane repositories, community GitHub sources, and QHack competition archives. PennySynth introduces a code-aware embedding strategy using st-codesearch-distilroberta-base, trained for natural-language-to-code retrieval, increasing average retrieval cosine similarity from 0.45 to 0.726 compared to a general-purpose baseline. Evaluated across 74 challenges spanning three years of the QHack competition (2022, 2023, 2024), PennySynth achieves 64%, 68%, and 52% pass@5 on QHack 2022, 2023, and 2024, respectively, improving over Claude Sonnet 4.6 without retrieval by +28, +25, and +28 percentage points. We further introduce a quantum-adapted CodeBLEU metric that upweights qml.* token patterns and show that structural code similarity and functional correctness capture distinct aspects of quantum code quality. Controlled ablations reveal that code-aware embeddings are the primary driver of retrieval performance, while dataset expansion and source composition provide additional gains when retrieval quality is sufficiently precise.

2605.25566 2026-05-26 cs.AI 版本更新

Uncertainty Reasoning with Large Language Models for Explainable Disease Diagnosis

基于大语言模型的不确定性推理用于可解释疾病诊断

Xiaoyang Fan, Yufan Cai, Zhe Hou, Jin Song Dong

发表机构 * National University of Singapore(新加坡国立大学) Griffith University(格里菲斯大学)

AI总结 提出一种神经符号推理框架,将大语言模型与模糊逻辑和声明式规则结合,实现可解释且形式可验证的医学诊断。

详情
AI中文摘要

临床决策需要对不完整、不精确且以语言表达的患者叙述进行推理。虽然大语言模型(LLMs)擅长从自然语言中提取潜在信息,但它们缺乏可信赖医疗AI所必需的可验证性和可解释性。我们提出一种神经符号推理框架,将LLMs与形式逻辑对齐,以实现可解释且形式可验证的医学诊断。患者描述和临床指南被嵌入神经知识库,其中LLMs提取结构化医疗实体、时间关系和模糊症状模式,这些被解码为用模糊逻辑和声明式规则表达的符号知识库。我们执行两阶段推理:(1)归纳符号泛化,从编码叙述中捕获诊断模式;(2)通过逻辑编程引擎进行推理验证,推导并验证符合临床标准的诊断。每个症状被视为具有概率权重的模糊谓词,推理路径可审计、可调整,并与医生反馈兼容。与纯统计方法不同,我们的系统支持迭代优化:LLM生成的诊断与真实情况之间的偏差可以通过形式规则追踪、解释和纠正。通过结合基于逻辑的透明性、LLM的适应性和概率鲁棒性,该框架实现了与人类一致的医疗推理,具有强泛化能力和可验证的逐步推理链。我们在公开基准上验证了该框架,展示了符号推理与LLM在真实临床叙述中的有效协调。结果显示,性能与最先进的LLM相当,同时额外提供了可解释的推理路径和形式可验证的诊断结论。

英文摘要

Clinical decision-making requires reasoning over incomplete, imprecise, and linguistically expressed patient narratives. While large language models (LLMs) excel at extracting latent information from natural language, they lack the verifiability and interpretability essential for trustworthy medical AI. We propose a neuro-symbolic reasoning framework that aligns LLMs with formal logic to enable explainable and formally verifiable medical diagnosis. Patient descriptions and clinical guidelines are embedded into a neural knowledge base, where LLMs extract structured medical entities, temporal relations, and fuzzy symptom patterns, which are decoded into a symbolic knowledge base expressed in fuzzy logic and declarative rules. We perform two-stage reasoning: (1) inductive symbolic generalization to capture diagnostic patterns from encoded narratives, and (2) inference verification via a logic programming engine to derive and validate diagnoses consistent with clinical standards. Each symptom is treated as a fuzzy predicate with probabilistic weights, and inference paths are auditable, adjustable, and compatible with physician feedback. Unlike purely statistical methods, our system supports iterative refinement: misalignment between LLM-generated diagnoses and ground truth can be traced, explained, and corrected through formal rules. By combining logic-based transparency, LLM adaptability, and probabilistic robustness, the framework enables human-aligned healthcare inference with strong generalization and verifiable, step-by-step reasoning chains. We validate our framework on public benchmarks, demonstrating effective reconciliation of symbolic reasoning and LLMs with real-world clinical narratives. Results show performance comparable to state-of-the-art LLMs, while additionally providing interpretable reasoning paths and formally verifiable diagnostic conclusions.

2605.25558 2026-05-26 cs.AI 版本更新

Beyond Query Memorization: Large Language Model Routing with Query Decomposition and Historical Matching

超越查询记忆化:基于查询分解和历史匹配的大语言模型路由

Bo Lv, Jingbo Sun

发表机构 * Tencent Hunyuan(腾讯文言) University of Chinese Academy of Sciences(中国科学院大学)

AI总结 提出DecoR路由框架,通过查询能力分解和历史日志匹配来避免记忆化陷阱,在保持高准确率的同时降低推理成本。

详情
AI中文摘要

优化预测性能与计算成本之间的权衡是大语言模型(LLM)部署中的核心关注点。当前的路由方法主要依赖于基于表面特征的查询到模型的直接映射,使其容易陷入记忆化陷阱,并导致在分布外(OOD)数据上的泛化能力差。在本文中,我们提出DecoR,一种新颖的路由框架,将路由任务重新定义为从历史日志中筛选相似查询的匹配过程,有效缓解了记忆化陷阱。为了提高匹配准确性,我们引入了一种查询能力分解方法,将语言表面形式与任务内在需求解耦,将匹配导向能力维度,从而将决策基于基本任务属性。此外,我们开发了CodaSet,一个用于评估路由泛化能力的综合基准,实验结果表明,DecoR在分布内和OOD设置下均保持优越的准确性,同时大幅降低推理成本。所有代码和数据可在https://github.com/lvbotenbest/DecoR获取。

英文摘要

Optimizing the trade-off among predictive performance and computational cost is a central focus in the deployment of Large Language Models (LLMs). Current routing methods primarily rely on direct mapping from queries to models based on surface-level features, making them susceptible to the memorization trap and leading to poor generalizability on out-of-distribution (OOD) data. In this paper, we propose DecoR, a novel routing framework that recasts the routing task as a matching process of sifting similar queries from historical logs, effectively mitigating the memorization trap. To enhance matching accuracy, we introduce a query capability deconstruction method that decouples linguistic surface forms from task-intrinsic requirements, directing matching toward capability dimensions to ground decisions in essential task attributes. Furthermore, we develop CodaSet, a comprehensive benchmark for assessing routing generalization, where experimental results demonstrate that DecoR maintains superior accuracy while substantially lowering inference costs across both in-distribution and OOD settings. All the codes and data are available at https://github.com/lvbotenbest/DecoR.

2605.25554 2026-05-26 cs.AI 版本更新

PHGNet: Prototype-Guided Hypergraph Construction for Heterogeneous Spatiotemporal Forecasting

PHGNet: 原型引导的超图构建用于异质时空预测

Ruiwen Gu, Yahao Liu, Zhenyu Liu, Qitai Tan, Xiao-Ping Zhang

发表机构 * Shenzhen Ubiquitous Data Enabling Key Lab(深圳通用数据赋能重点实验室) Shenzhen International Graduate School, Tsinghua University(深圳国际研究生学院,清华大学) School of Computer Science and Engineering(计算机科学与工程学院) University of Electronic Science and Technology of China(电子科技大学)

AI总结 提出基于原型引导超图构建的时空预测框架PHGNet,通过原型学习机制自适应地将模式相似节点分配到超边以捕获高阶交互,并引入全局-局部节点表示模块和迭代残差细化与时间查询注意力机制提升预测精度。

详情
AI中文摘要

作为智能交通系统的核心任务,交通预测在城市交通管理中起着关键作用。准确的交通预测依赖于对复杂时空依赖关系的建模,而由于交通系统中的空间异质性,这本身就具有挑战性。尽管取得了显著进展,大多数现有方法仍局限于成对空间依赖建模,难以捕获具有相似交通模式的节点之间的动态高阶交互。为了解决这个问题,我们提出了PHGNet,一种基于原型引导超图构建的新型时空预测框架。在PHGNet的核心,设计了一种原型学习机制,自适应地将模式相似的节点分配到超边,从而捕获具有时变结构的高阶交互。为了提高动态超图构建的可靠性,我们进一步开发了一个全局-局部节点表示模块来提取时间一致的特征。对于预测,引入了迭代残差细化和时间查询注意力机制,以提高预测精度并支持高效的并行解码。在多个真实世界数据集上的大量实验表明,与最先进的方法相比,PHGNet实现了优越的预测性能。

英文摘要

As a core task in intelligent transportation systems, traffic forecasting plays a critical role in urban traffic management. Accurate traffic forecasting relies on modeling complex spatiotemporal dependencies, which is inherently challenging due to spatial heterogeneity in traffic systems.Despite significant progress, most existing methods are still limited to pairwise spatial dependency modeling, making it difficult to capture dynamic high-order interactions among nodes with similar traffic patterns. To address this issue, we propose PHGNet, a novel spatiotemporal forecasting framework based on prototype-guided hypergraph construction. At the core of PHGNet, a prototype learning mechanism is designed to adaptively assign pattern-similar nodes to hyperedges, thereby capturing high-order interactions with time-varying structures. To improve the reliability of dynamic hypergraph construction, we further develop a global-local node representation module to extract time-consistent features. For forecasting, iterative residual refinement and Temporal Query Attention are introduced to improve forecasting accuracy while supporting efficient parallel decoding. Extensive experiments on multiple real-world datasets demonstrate that PHGNet achieves superior predictive performance compared with state-of-the-art methods.

2605.25549 2026-05-26 cs.CL cs.AI cs.LG 版本更新

BC Protocol: Structured Dual-Expert Dialogue for Eliciting High-Quality Chain-of-Thought Post-Training Data

BC协议:结构化双专家对话用于生成高质量思维链后训练数据

Bo Zou, Chao Xu

AI总结 针对大语言模型后训练中高质量专家思维链数据生产瓶颈,提出BC协议——一种结构化双专家引出方法,通过配对领域专家与知识工程师,系统外化专家隐性判断为自然语言推理链,实验证明其在推理过程自然性上具有压倒性优势。

详情
AI中文摘要

高质量的专家思维链(CoT)数据是大语言模型(LLM)后训练的核心瓶颈之一。现有数据生产方法各有结构性局限:众包标注缺乏深度推理路径;专家单独写作受限于“专家盲点”——专家会结构性跳过他们认为显而易见的推理步骤;RLHF仅产生偏好信号而非推理链。 本文提出BC协议——一种用于LLM后训练数据生产的结构化双专家引出方法。该方法精心配对领域专家(晶体智力)与知识工程师(流体智力),系统地将专家的隐性判断外化为自然语言推理链。我们引入了参与者资质模型,定义了影响引出质量的六个参与者特征维度。“校准的无知”是本文提出的原创概念。我们进一步提出“选择优于规定”作为方法论原则:对于隐性知识引出任务,将质量控制资源投入人员选择比投入同等资源于流程设计能获得更高回报。 在叙事小说领域的受控实验中,我们直接比较了BC协议双对话产生的CoT(A组,n=20)与同一领域专家独立撰写的CoT(B组,n=20)。三个跨供应商评判模型——GPT-4o、Claude Opus 4.5和Gemini 2.5 Pro——在五个维度上进行了盲评(共600个评分)。结果表明,BC协议在“推理过程自然性”上具有压倒性优势(A组均值4.80 vs. B组均值1.30,p=2.4×10^{-8},Cliff's δ=1.0)。

英文摘要

High-quality expert chain-of-thought (CoT) data is one of the core bottlenecks in large language model (LLM) post-training. Existing data production methods each have structural limitations: crowdsourced annotation lacks deep reasoning paths; expert solo writing is constrained by the "expert blind spot" -- experts structurally skip reasoning steps they consider obvious; RLHF only produces preference signals rather than reasoning chains. This paper proposes the BC Protocol -- a structured dual-expert elicitation method for LLM post-training data production. The method carefully pairs a domain expert (crystallized intelligence) with a knowledge engineer (fluid intelligence), systematically externalizing the expert's implicit judgments as natural language reasoning chains. We introduce the Participant Aptitude Model, which defines six participant characteristic dimensions that affect elicitation quality. "Calibrated Ignorance" is an original concept proposed in this paper. We further propose "Selection-over-Prescription" as a methodological principle: for implicit knowledge elicitation tasks, investing quality-control resources in personnel selection yields a higher return than investing the same resources in process design. In a controlled experiment in the narrative fiction domain, we directly compared CoT produced by BC Protocol dual dialogue (Group A, (n=20)) against CoT written independently by the same domain expert (Group B, (n=20)). Three cross-vendor judge models -- GPT-4o, Claude Opus 4.5, and Gemini 2.5 Pro -- conducted blind evaluation across five dimensions (600 ratings total). Results show that the BC Protocol achieves an overwhelming advantage in "naturalness of reasoning process" (Group A mean 4.80 vs. Group B mean 1.30, (p=2.4\times10^{-8}), Cliff's (δ=1.0)).

2605.25548 2026-05-26 cs.LG cs.AI 版本更新

'Si'multaneous 'S'patial-'T'emporal Message Passing for Dynamic Graph Representation Learning

Si'multaneous 'S'patial-'T'emporal Message Passing for Dynamic Graph Representation Learning

Shubhajit Roy, Anirban Dasgupta

发表机构 * Department of Computer Science and Engineering(计算机科学与工程系) Indian Institute of Technology Gandhinagar(印度理工学院甘地纳加尔)

AI总结 提出SiST-GNN,通过在一个消息传递操作中融合空间和时间信号,实现动态图表示学习的联合推理,在链接预测任务上超越先前方法109%-277%。

详情
AI中文摘要

操作于快照序列的动态图神经网络(DGNN)通常分为两类:\emph{时间优先}方法先构建每个节点的时间嵌入,然后进行空间聚合;而\emph{空间优先}方法则颠倒这一顺序,将图卷积的输出馈送到下游时间模块。无论哪种情况,严格的顺序迫使第二阶段消耗第一阶段已压缩的摘要,排除了对拓扑和演化的联合推理;具体而言,消息传递算子永远无法根据邻居的\emph{过去}轨迹来加权其贡献。本文介绍了 extbf{SiST-GNN}( extbf{Si}multaneous extbf{S}patial- extbf{T}emporal extbf{GNN}),它在单个消息传递操作中融合两种信号,而不是将它们串联。具体地,在每个快照中,我们为每个节点维护一个循环隐藏状态来总结其历史,将其与节点当前特征向量配对,并将该配对视为由跨时间边连接的两个节点;在此时间增强图上运行标准图卷积,得到更新后的表示。我们的实证研究涵盖九个公开基线和十四个模型-数据集组合,覆盖固定分割和实时更新评估场景。在每个公开基准上,SiST-GNN在链接预测任务中相对于最强先前方法,在固定分割设置中提升109%-277%,在实时更新设置中提升68%-194%。我们还通过离散化底层连续时间事件流,构建了三个动态节点分类任务;在此,SiST-GNN以7%-22%的优势击败领先的离散时间(DTDG)基线,并与直接消费原始事件的连续时间(CTDG)方法相匹配。

英文摘要

Dynamic graph neural networks (DGNNs) that operate on snapshot sequences typically fall into one of two categories. \emph{Temporal-first} approaches build per-node temporal embeddings and only afterwards perform spatial aggregation, whereas \emph{Spatial-first} approaches invert this order, feeding the output of a graph convolution into a downstream temporal module. In either case, the rigid sequencing forces the second stage to consume an already-compressed summary produced by the first, ruling out joint reasoning over topology and evolution; concretely, the message-passing operator never gets to weight a neighbor's contribution by that neighbor's \emph{past} trajectory. This paper introduces \textbf{SiST-GNN} (\textbf{Si}multaneous \textbf{S}patial-\textbf{T}emporal \textbf{GNN}), which fuses the two signals inside a single message-passing operation rather than chaining them. Concretely, at each snapshot we maintain a recurrent hidden state per node that summarises its history, pair it with the node's current feature vector, and treat the pair as two nodes joined by a cross-time edge; running a standard graph convolution on this temporally augmented graph yields the updated representation. Our empirical study spans nine public baselines and fourteen model-dataset combinations, covering both fixed-split and live-update evaluation regimes. Across every public benchmark, SiST-GNN sets a new state of the art in link prediction task over the strongest prior method by $109$--$277\%$ in the fixed-split setting and by $68$--$194\%$ in the live-update setting. We additionally construct three dynamic node-classification tasks by discretising the underlying continuous-time event streams; here SiST-GNN beats the leading discrete-time (DTDG) baseline by $7$--$22\%$ and matches continuous-time (CTDG) methods that consume the raw events directly.

2605.25543 2026-05-26 cs.AI 版本更新

ADMFormer: An Adaptive-Decomposition Transformer with Time-Varying Masked Spatial Attention for Traffic Forecasting

ADMFormer:一种用于交通预测的具有时变掩码空间注意力的自适应分解Transformer

Ruiwen Gu, Qitai Tan, Yahao Liu, Xiao-Ping Zhang

发表机构 * Shenzhen International Graduate School(深圳国际研究生院) Tsinghua University(清华大学) School of Computer Science and Engineering(计算机科学与工程学院) University of Electronic Science and Technology of China(电子科技大学) Shenzhen Ubiquitous Data Enabling Key Lab(深圳 ubiquitous 数据赋能重点实验室)

AI总结 提出ADMFormer,通过自适应分解机制解耦交通序列中的稳定周期规律与事件驱动波动,并使用时变掩码空间注意力稀疏化动态空间依赖,实现交通预测的SOTA性能。

详情
AI中文摘要

准确的交通预测对于智能交通系统至关重要,支持广泛的现实应用。然而,由于两个关键因素,它仍然具有挑战性:(1)交通序列包含异质的时间模式,其中稳定的周期性规律与事件驱动的波动共存。现有方法通常将它们统一表示,限制了捕捉细粒度时间动态的能力。(2)节点间的空间依赖本质上是动态且稀疏的,而密集的全对注意力常常引入冗余交互并放大噪声。为了解决这些问题,我们提出了ADMFormer,一种具有时变掩码空间注意力的自适应分解Transformer。具体来说,ADMFormer首先采用时间-节点自适应门控机制将交通信号解耦为随时间与节点变化的主导规律和残余波动。然后设计了一个双分支时间模块,分别从这两个分解成分中捕捉全局周期依赖和高频不规则变化。此外,ADMFormer引入了时变掩码空间注意力,基于实时交通状态稀疏化空间交互,从而有效保留动态且信息丰富的依赖。在四个真实世界数据集上的大量实验表明,ADMFormer实现了最先进的性能。

英文摘要

Accurate traffic forecasting is essential for intelligent transportation systems, supporting a wide range of real-world applications. However, it remains challenging due to two key factors:~(1) Traffic series contain heterogeneous temporal patterns, where stable periodic regularities coexist with event-driven fluctuations. Existing methods often treat them within a unified representation, limiting their ability to capture fine-grained temporal dynamics.~(2)Spatial dependencies among nodes are inherently dynamic and sparse, while dense all-pairs attention often introduces redundant interactions and amplifies noise. To address these issues, we propose ADMFormer, an Adaptive-Decomposition Transformer with Time-Varying Masked Spatial Attention. Specifically, ADMFormer first employs a time-node adaptive gating mechanism to decouple traffic signals into dominant regularities and residual fluctuations that vary across time and nodes. A dual-branch temporal module is then designed to separately capture global periodic dependencies and high-frequency irregular variations from these two decomposed components. Furthermore, ADMFormer introduces a time-varying masked spatial attention that sparsifies spatial interactions based on real-time traffic states, thereby effectively preserving dynamic and informative dependencies. Extensive experiments on four real-world datasets demonstrate that ADMFormer achieves state-of-the-art performance.

2605.25541 2026-05-26 cs.CG cs.AI cs.HC cs.LG 版本更新

TopoAlign: Topology-Aware Visual Representation Alignment

TopoAlign:拓扑感知的视觉表示对齐

Xinyuan Yan, Rita Sevastjanova, Mennatallah El-Assady, Bei Wang

发表机构 * University of Utah(犹他大学) ETH Zürich(苏黎世联邦理工学院)

AI总结 提出TopoAlign框架,利用拓扑数据分析中的mapper图,通过联合力导向优化、自动结构匹配区域检测和基序查询,从拓扑角度比较不同模型或层的表示结构对齐。

详情
AI中文摘要

神经网络将输入编码为高维向量(称为表示),通过编码任务相关的结构和语义来捕捉模型如何处理数据。表示对齐指不同模型、层或训练条件对相同输入产生相似表示的程度,对模型解释、选择和鲁棒性分析有重要意义。现有的对齐度量方法主要依赖于几何属性(如邻域和聚类相似性),对表示的全局组织提供的洞察有限。在这项工作中,我们提出了TopoAlign,一个从结构角度视觉比较模型表示的拓扑感知框架。利用拓扑数据分析中的mapper图,TopoAlign联合分析来自不同模型或层的共享输入构建的图。该框架支持自上而下的比较工作流:首先通过联合力导向优化进行全局结构对齐,生成协调的图布局;然后通过自动检测结构匹配区域(用Bubble Sets可视化)识别局部对应关系;最后通过基于基序的查询和膜启发式可视化实现细粒度模式检查。我们通过语言和多模态模型的案例研究以及专家反馈展示了TopoAlign。结果表明,TopoAlign从拓扑角度为表示结构和对齐提供了有意义的洞察。

英文摘要

Neural networks encode inputs as high-dimensional vectors, known as representations, that capture how models process data by encoding task-relevant structure and semantics. Representation alignment refers to the degree to which different models, layers, or training conditions produce similar representations for the same inputs, with important implications for model interpretation, selection, and robustness analysis. Existing approaches to measure alignment primarily rely on geometric properties, such as neighborhood and cluster similarity, offering limited insight into the global organization of representations. In this work, we present TopoAlign, a topology-aware framework for visually comparing model representations from a structural perspective. Leveraging mapper graphs from topological data analysis, TopoAlign jointly analyzes graphs constructed from representations of shared inputs across different models or layers. The framework supports a top-down comparative workflow: it first performs global structure alignment via joint force-directed optimization to produce coordinated graph layouts; it then identifies local correspondences through automated detection of structurally matching regions, visualized with Bubble Sets; and finally it enables fine-grained pattern inspection through motif-based queries and membrane-inspired visualizations. We demonstrate TopoAlign through case studies on language and multimodal models, complemented by expert feedback. Our results show that TopoAlign provides meaningful insights into representation structure and alignment from a topological perspective.

2605.25536 2026-05-26 cs.SE cs.AI 版本更新

A Tertiary Review of Large Language Model-Based Code Generating Tasks: Trends, Challenges, and Future Directions

基于大语言模型的代码生成任务的三级综述:趋势、挑战与未来方向

Muslim Chochlov, Michael English, Jim Buckley

发表机构 * University of Limerick(利默里克大学)

AI总结 本三级综述综合了30篇二级研究(2017-2025年),分析了基于大语言模型的代码生成任务在出版趋势、效果、场景、集成挑战和未来方向上的证据,发现基准测试准确率高但泛化性弱,鲁棒性脆弱,效率问题普遍,毒性和偏见报告不足,主要挑战涉及经济可行性、评估有效性和社会技术集成。

详情
AI中文摘要

上下文。大语言模型(LLMs)越来越多地被应用于软件工程中的代码生成任务(CGTs)。尽管报告的结果令人鼓舞,但这种应用的更广泛影响及其与真实世界开发的集成仍未被充分理解,现有的三级研究在这方面提供的很少。目标。本三级研究整合了关于基于LLM的CGTs的二级证据,综合了出版格局、效果、场景、集成挑战和未来研究方向。方法。遵循系统综述指南,我们在相关数字图书馆中进行了检索,并辅以前向和后向滚雪球及筛选步骤。评估了研究质量,并通过评估者间一致性统计对提取可靠性进行了审计。使用SWEBOK知识领域和HELM框架综合了证据。结果。我们识别出30篇发表于2017-2025年间的二级研究,自2023年以来快速增长。在基准测试上准确性似乎很强,但在真实世界泛化方面支持较弱;鲁棒性在不同任务和配置下脆弱;效率约束普遍存在;毒性和偏见报告不足。主要挑战涉及经济可行性、评估有效性和社会技术集成。未来方向建议领域感知的模型改进以及全面、标准化评估的需求。结论。基于LLM的CGTs代表了一个快速成熟但评估不均的研究领域,突出了对领域感知模型改进和全面、标准化评估的需求,以及解决效率和相关成本问题。

英文摘要

Context. Large language models (LLMs) are increasingly applied to code-generating tasks (CGTs) in software engineering. While reported results are promising, the broader effects of such application and their integration into real-world development remain insufficiently understood with existing tertiary studies provide little in this area. Objective. This tertiary study consolidates secondary evidence on LLM-based CGTs, synthesizing the publication landscape, effects, scenarios, integration challenges, and future research directions. Method. Following systematic review guidelines, we searched in related digital libraries, complemented by backward-and-forward snowballing and screening step. Study quality was assessed and extraction reliability was audited with inter-rater agreement statistics. Evidence was synthesized using SWEBOK knowledge areas and the HELM framework. Results. We identify 30 secondary studies published between 2017-2025, with rapid growth since 2023. Accuracy seems strong on benchmarks but weakly supported for real-world generalization; robustness is fragile across tasks and configurations; efficiency constraints are pervasive; toxicity and bias are under-reported. Dominant challenges concern economic feasibility, evaluation validity, and socio-technical integration. Future directions suggest domain-aware model improvement and the need for holistic, standardized evaluation. Conclusion. LLM-based CGTs represent a fast-maturing yet unevenly evaluated research area, highlighting the need for domain-aware model improvements and holistic, standardized evaluation, addressing efficiency and associated costs.

2605.25535 2026-05-26 cs.AI 版本更新

Personalize-then-Store: Benchmarking and Learning Personalized Memory for Long-horizon Agents

个性化再存储:面向长时程智能体的个性化记忆基准测试与学习

Yeonjun In, Wonjoong Kim, Sangwu Park, Kanghoon Yoon, Chanyoung Park

发表机构 * KAIST(韩国科学技术院)

AI总结 针对现有基于大语言模型的记忆系统采用通用静态策略忽略用户间存储上下文差异的问题,提出首个个性化记忆基准PerMemBench和会话级存储门控框架,验证个性化能显著提升记忆保留但精确门控仍是关键挑战。

Comments preprint

详情
AI中文摘要

现有的基于大语言模型(LLM)的记忆系统采用通用、静态的策略,忽略了一个基本现实:不同用户值得存储在记忆中的上下文是不同的。这种错位将有限的记忆预算浪费在短暂交互上,同时未能为长时程任务保留关键上下文。为解决这一差距,我们研究了一个未被充分探索的问题:基于LLM的记忆系统能否学习个性化的记忆策略?我们引入了PerMemBench,这是首个用于评估个性化记忆系统的基准,具有跨多年、多领域、多样化用户角色的交互历史。我们进一步提出了记忆个性化的首个实证研究,提出了会话级存储门控,这是一个轻量级框架,可选择性地绕过短暂会话的记忆操作。我们的研究证实,在完美门控下,个性化能带来显著的保留增益,但同时也揭示出精确门控仍然是一个开放且关键的挑战。

英文摘要

Existing large language model (LLM) based memory systems apply universal, static policies that overlook a fundamental reality: the contexts that are worth storing in memory are different across users. This misalignment wastes limited memory budget on transient interactions while failing to preserve critical context for long horizon tasks. To address this gap, we investigate an underexplored question: can LLM based memory systems learn personalized memory policies? We introduce PerMemBench, the first benchmark for evaluating personalized memory systems, featuring multi year, multi domain interaction histories across diverse user personas. We further present the first empirical study of memory personalization, proposing session level storage gating, a lightweight framework that selectively bypasses memory operations for transient sessions. Our study confirms that personalization yields substantial retention gains under perfect gating, yet reveals that accurate gating remains an open and critical challenge.

2605.25534 2026-05-26 cs.AI 版本更新

StructBreak: Structural Cognitive Overload-Induced Safety Failures in MLLMs

StructBreak: 多模态大语言模型中结构性认知过载引发的安全故障

Yang Luo, Xinran Liu, Tiantian Ji, Zhiyi Yin, Lingyun Peng, Shuyu Li

发表机构 * Key Laboratory of Trustworthy Distributed Computing and Service (MoE), Beijing University of Posts and Telecommunications(可信分布式计算与服务重点实验室(MoE),北京邮电大学) Institute of Computing Technology, Chinese Academy of Sciences(中国科学院计算技术研究所)

AI总结 提出StructBreak框架,通过量化结构性认知过载(SCO)揭示一种高阶认知过载攻击范式,在六种主流MLLM上实现92%平均攻击成功率,并证明该攻击通过结构性通道绕过安全过滤器。

Comments 23 pages; accepted to Findings of ACL 2026. This paper contains examples of harmful content

详情
AI中文摘要

多模态大语言模型(MLLM)在结构推理方面表现出色,但在结构一致性方面存在明显的逻辑脆弱性。我们将这种现象称为结构性认知过载(SCO),它是深度推理与安全对齐之间竞争产生的副产品。然而,先前的工作主要针对排版和像素级扰动,对SCO的研究尚不充分。为此,我们提出了StructBreak,一个自动化的端到端框架,旨在量化SCO。通过利用StructBreak,我们发现了一种新颖的高阶认知过载攻击范式;值得注意的是,这种攻击在实用的黑盒设置下运行,无需内部模型访问。因此,我们利用该框架建立了一个涵盖十种不同威胁场景的综合基准。对六种领先MLLM的实证评估表明,SCO容易触发有毒内容生成,平均攻击成功率(ASR)达到92%(在Gemini 2.5上高达97%)。为了阐明SCO的机制,我们进一步进行了模型级解释,涵盖注意力动态、潜在空间拓扑和几何分析。我们的发现表明,StructBreak作为一种新颖的结构性通道来绕过安全过滤器。此外,固有安全机制的有效性有限,凸显了当前的对齐范式不足以应对复杂多模态推理的时代。

英文摘要

Multimodal Large Language Models (MLLMs) excel at structural reasoning yet suffer from a sharp logical brittleness in structural consistency. We term this phenomenon Structural Cognitive Overload (SCO), a byproduct of the contention between deep reasoning and safety alignment. However, prior work has predominantly targeted typographic and pixel-level perturbations, leaving the study of SCO largely unexplored. To this end, we propose StructBreak, an automated end-to-end framework designed to quantify SCO. By leveraging StructBreak, we uncover a novel higher-order cognitive overload attack paradigm; notably, this attack operates under a practical black-box setting, requiring no internal model access. Consequently, we utilize this framework to establish a comprehensive benchmark spanning ten diverse threat scenarios. Empirical evaluations on six leading MLLMs reveal that SCO readily triggers toxic generation, yielding a 92% average ASR (up to 97% on Gemini 2.5). To elucidate the mechanism of SCO, we further conduct model-level interpretations spanning attention dynamics, latent space topology, and geometric analysis. Our findings reveal that StructBreak acts as a novel structural channel to circumvent safety filters. Furthermore, the limited efficacy of inherent safety mechanisms underscores that current alignment paradigms are insufficient for the era of complex multimodal reasoning.

2605.25518 2026-05-26 cs.CV cs.AI 版本更新

Cross-Stage Attention Multi-Expert Network for Radiologist-Inspired Breast Ultrasound Diagnosis

受放射科医生启发的乳腺超声诊断的跨阶段注意力多专家网络

Xinyang Zhai, Chong Yang, Ruizhi Zhang

发表机构 * International Agency for Research on Cancer (IARC)(国际癌症研究机构) World Health Organization(世界卫生组织)

AI总结 提出跨阶段注意力混合专家网络(CSA-MoE-Net),通过跨阶段注意力模块增强多级特征、三分支MoE块从全肿瘤图像、肿瘤核心和边界学习互补特征,并在平衡数据集上实现96.33%准确率,显著优于基线ResNet-18。

详情
AI中文摘要

乳腺超声成像是一种重要的早期乳腺癌诊断无创方法,但由于肿瘤异质性、边界模糊和数据不平衡,自动良恶性分类仍具挑战。为了提高特征表示和分类准确性,本文提出了跨阶段注意力混合专家网络(CSA-MoE-Net)。它采用跨阶段注意力增强的ResNet-18作为骨干网络,其中跨阶段注意力模块自适应地重新校准多级特征,从而增强关键肿瘤特征并抑制冗余。一个三分支混合专家(MoE)块从全肿瘤图像、肿瘤核心和边界学习互补特征,自适应门控网络融合这些特征以捕获形态、纹理和上下文信息。融合后的特征在架构中称为融合专家特征(FEF)。在包含2,129张乳腺超声图像的平衡数据集上的实验表明,在20次独立运行的平均值下,该模型实现了96.33%的准确率、94.09%的精确率、98.53%的召回率、96.25%的F1分数和99.50%的AUC。与基线ResNet-18相比,这些指标分别提高了3.01、0.70、5.37、2.98和5.42个百分点。所提出的机制无需侵入性修改,可无缝嵌入VGG-16、DenseNet-121等网络,带来稳定的性能提升,从而为计算机辅助诊断提供可靠支持。

英文摘要

Breast ultrasound imaging is an important noninvasive method for early breast cancer diagnosis, but automatic benign/malignant classification remains challenging due to tumor heterogeneity, blurred boundaries, and data imbalance. To improve feature representation and classification accuracy, this paper proposes the Cross-Stage Attention Mixture-of-Experts Network (CSA-MoE-Net). It adopts a Cross-Stage Attention-enhanced ResNet-18 as the backbone, in which the Cross-Stage Attention module adaptively recalibrates multi-level features, thereby enhancing key tumor features and suppressing redundancy. A three-branch Mixture of Experts (MoE) Block learns complementary features from the Whole Tumor Image, Tumor Core, and Boundary, and an Adaptive Gating Network fuses them to capture morphological, textural, and contextual information. The fused features are denoted as Fused Expert Feature (FEF) in the architecture. Experiments on a balanced dataset of 2,129 breast ultrasound images show that, averaged over 20 independent runs, the model achieves an accuracy of 96.33\%, precision of 94.09\%, recall of 98.53\%, F1-score of 96.25\%, and AUC of 99.50\%. Compared to the baseline ResNet-18, these metrics improve by 3.01, 0.70, 5.37, 2.98, and 5.42 percentage points, respectively. The proposed mechanism requires no invasive modification and can be seamlessly embedded into VGG-16, DenseNet-121, etc., yielding stable performance gains, thus providing reliable support for computer-aided diagnosis.

2605.25517 2026-05-26 cs.AI 版本更新

What Gets Cited: Competitive GEO in AI Answer Engines

什么被引用:AI 问答引擎中的竞争性生成式引擎优化

Rahul Vishwakarma, Shushant Kumar, Ratnesh Jamidar

发表机构 * Sprinklr

AI总结 研究 AI 问答引擎中两个检索候选源竞争时,哪些因素决定哪个源被优先引用,通过控制实验发现主题相关性和列表位置是主要驱动因素。

详情
AI中文摘要

AI 问答引擎从检索到的页面生成答案,但只引用少数来源。这使得可见性不仅取决于排名,还取决于被引用。我们研究竞争性生成式引擎优化(GEO):当两个检索到的候选源竞争时,什么因素使得其中一个更可能被首先引用?我们构建了一个受控的两文档检索增强生成(RAG)测试平台,将恰好两个候选源注入模型上下文,并测量输出中第一个引用标记引用了哪个源。在六个 LLM 上,我们执行了 252,000 次试验,在 18 个内容因素的一个析因程序下进行重复配对比较。在每次试验中,两个源恰好在一个因素上不同;我们使用品牌匿名化和平衡源顺序来将内容效应与位置偏差分离。混合效应模型显示,主题相关性和列表位置是被首先引用的最大驱动因素。包含明确的价格信息和最近的时间戳也持续有帮助。完整性和信任线索带来较小的增益,而仅格式编辑几乎没有影响。我们发布了一个可重复的评估协议和一个优先化的 GEO 检查清单供从业者使用,并在 Sprinklr 的早期内部试点中进行了实践,团队报告了对工作流可用性的积极定性反馈。

英文摘要

AI answer engines generate answers from retrieved pages but cite only a few sources. This makes visibility depend not just on ranking, but on being cited. We study competitive Generative Engine Optimization (GEO): when two retrieved candidates compete, what makes one more likely to be cited first? We build a controlled two-document retrieval-augmented generation (RAG) testbed that injects exactly two candidate sources into the model context and measures which source is referenced by the first citation marker in the output. Across six LLMs we execute 252,000 trials, repeated paired comparisons under one factorial program over 18 content factors. In each trial the two sources differ in exactly one factor; we use brand anonymization and counterbalanced source order to separate content effects from position bias. Mixed-effects models show that topical relevance and list position are the biggest drivers of being cited first. Including explicit price information and a recent timestamp also helps consistently. Completeness and trust cues add smaller gains, while formatting-only edits have little impact. We release a reproducible evaluation protocol and a prioritized GEO checklist for practitioners, and we exercised it in an early internal pilot at Sprinklr, where teams reported positive qualitative feedback on workflow usability.

2605.25505 2026-05-26 cs.CY cs.AI econ.GN physics.soc-ph q-fin.EC 版本更新

Generative AI impacts on intra-urban inequality and skill premium in Beijing

生成式人工智能对北京城市内部不平等和技能溢价的影响

Xiliu He, Haoxiang Zhao, Mingyi Ma, Edward Wen Chuan Lai, Koei Enomoto, Anni Hu, Jiatong Li, Lingyun Chu, Yuan Lai

发表机构 * School of Architecture, Tsinghua University(清华大学建筑学院) ZODA LAB(ZODA实验室) Technology Innovation Center for Smart Human Settlements and Spatial Planning & Governance, Ministry of Natural Resources, Tsinghua University(智能人居环境与空间规划及治理技术创新中心,自然资源部,清华大学)

AI总结 利用北京2018-2024年500万条招聘数据,通过五个大语言模型评估任务级暴露度,构建社区级生成式人工智能暴露指数,发现生成式人工智能暴露集中在核心区,导致高暴露社区工资停滞和“高技能陷阱”,挑战了技能偏向技术变革理论。

Comments 21 pages, 8 figures

详情
AI中文摘要

生成式人工智能(GenAI)是首次大规模触及高认知任务的自动化浪潮,但其对城市内部不平等的影响仍基本未知。利用北京2018-2024年500万条招聘数据,我们通过汇总五个领先大语言模型的任务级评估,构建了社区级GenAI暴露指数。我们考察了这一冲击的空间、结构和因果机制。我们发现,GenAI暴露高度集中在城市核心区,加深了城市内部的人工智能鸿沟。自2023年以来,高暴露社区尽管继续吸引高技能工人,却经历了工资停滞——一种“高技能陷阱”。这种工资惩罚是由任务去技能化和劳动力市场拥挤加剧驱动的。以ChatGPT发布为中心的倍差法设计支持因果解释。这些发现挑战了流行的技能偏向技术变革理论,并为全球科技中心的包容性人工智能治理提供了基础。

英文摘要

Generative artificial intelligence (GenAI) is the first automation wave to reach high-cognitive tasks at scale, yet its effects on intra-urban inequality remain largely unknown. Using 5 million job postings from Beijing (2018--2024), we construct a neighborhood-level GenAI Exposure Index by aggregating task-level assessments from five leading large language models. We examine the spatial, structural and causal mechanisms of this shock. We find that GenAI exposure is highly concentrated in the city's core districts, deepening the intra-urban AI divide. Since 2023, high-exposure neighborhoods have experienced wage stagnation even as they continue to attract high-skilled workers -- a "high-skill trap." This wage penalty is driven by task de-skilling and intensified labor-market crowding. A difference-in-differences design centered on ChatGPT's release supports a causal interpretation. These findings challenge the prevailing theory of skill-biased technological change and provide a basis for inclusive AI governance in global technology hubs.

2605.25502 2026-05-26 cs.CL cs.AI 版本更新

A Controlled Synthetic Benchmark for Educational Aspect-Based Sentiment Analysis

面向教育方面情感分析的可控合成基准

Yehudit Aperstein, Alexander Apartsin

发表机构 * Intelligent Systems, Afeka Academic College of Engineering(阿法卡学术工程智能系统学院) School of Computer Science, Faculty of Sciences, Holon Institute of Technology(霍洛技术学院计算机科学学院)

AI总结 为解决教育领域标注数据稀缺问题,提出一个包含10,000条合成课程评论和20个教学方面的可控合成基准,并通过实验验证了任务难度及合成到真实的迁移能力。

Comments 39 pages, 14 figures

详情
AI中文摘要

教育方面情感分析(ABSA)可以支持课程改进,但带有方面标签的学生反馈仍然稀缺,因为教育评论是私有的、特定于机构的且标注成本高昂。本研究引入了一个面向教育ABSA的可控合成基准,该基准由10,000条合成课程评论构建,具有明确的训练-验证-测试划分,以及一个涵盖教学质量、评估与课程管理、学习需求、学习环境和参与度的20方面教学模式。该语料库通过采样的目标标签、采样的细微属性以及经过三轮评审-编辑流程优化的真实感提示生成。在该基准上,使用TF-IDF、两阶段变换器和联合编码器的局部基线表明该任务并非易事;最强的未调优模型BERT在留出集上的检测微F1得分为0.2760,而一个适度的低学习率BERT调度将其提升至0.2930。基于gpt-5.2的全测试GPT推理在零样本模式下达到0.2519微F1,在使用基于检索的少样本提示时达到0.2501,使批量推理高于经典基线并接近紧凑的联合编码器。在来自Herath等人的2,829条映射学生反馈评论上进行的保守外部评估中,BERT在9个方面重叠上的微F1得分为0.4593,表明部分合成到真实的迁移。真实性和忠实度分析作为生成器诊断报告,阐明了基准如何稳定以及标签噪声仍然存在的位置。因此,本研究贡献了一个合成教育ABSA语料库、一个文档化的生成过程以及一个可复现的基准设置,适用于公共标注数据仍然难以获得的领域。

英文摘要

Educational aspect-based sentiment analysis (ABSA) can support course improvement, but public aspect-labeled student feedback remains scarce because educational reviews are private, institution-specific, and expensive to annotate. This study introduces a controlled synthetic benchmark for educational ABSA built from 10,000 synthetic course reviews with explicit train-validation-test splits and a 20-aspect pedagogical schema spanning instructional quality, assessment and course management, learning demand, learning environment, and engagement. The corpus is generated with sampled target labels, sampled nuance attributes, and a realism-tuned prompt refined through a three-cycle judge-editor procedure. On the resulting benchmark, local baselines with TF-IDF, two-step transformers, and joint encoders show that the task is nontrivial; the strongest untuned model, BERT, reaches a held-out detection micro-F1 of 0.2760, while a modest lower-rate BERT schedule improves this to 0.2930. Full-test GPT-based inference with gpt-5.2 reaches 0.2519 micro-F1 in zero-shot mode and 0.2501 with retrieval-based few-shot prompting, placing batch inference above the classical baseline and close to the compact joint encoders. A conservative external evaluation on 2,829 mapped student-feedback reviews from Herath et al. yields a micro-F1 of 0.4593 for BERT on a 9-aspect overlap, indicating partial synthetic-to-real transfer. Realism and faithfulness analyses are reported as generator diagnostics that clarify how the benchmark was stabilized and where label noise remains. The study therefore contributes a synthetic educational ABSA corpus, a documented generation procedure, and a reproducible benchmark setting for a domain in which public labeled data remain difficult to obtain.

2605.25489 2026-05-26 cs.AI cs.HC 版本更新

ATWL: A Formal Language for Representing, Comparing, and Reusing Visual Analytics Workflows

ATWL:一种用于表示、比较和重用可视化分析工作流的正式语言

Natalia Andrienko, Gennady Andrienko, Jürgen Bernard, Michael Sedlmair

发表机构 * Fraunhofer Institute IAIS(弗劳恩霍夫研究所IAIS) Lamarr Institute for Machine Learning and Artificial Intelligence(拉马尔机器学习与人工智能研究所) City St George’s, University of London(伦敦大学城市圣乔治学院) University of Zurich(苏黎世大学) University of Stuttgart(斯图加特大学)

AI总结 提出ATWL语言,通过模块化本体和标准化意图形式化表示可视化分析工作流,结合LLM提取工作流,实现结构比较和重用。

详情
AI中文摘要

可视化分析(VA)工作流本质上是复杂的,涉及数据转换、特征工程、视觉表示和人类解释。它们通常以非结构化的散文形式描述,阻碍了系统比较、成熟策略的重用以及新手的培训。我们提出了工件-转换工作流语言(ATWL),这是一种领域无关的声明式语言,通过捕获工作流的结构和潜在分析意图来形式化表示VA工作流。ATWL构建于一个由八种工件类型(实体、特征、排列、可视化、模式、模型、知识、规范)和以标准化意图(例如,定义单元、表征、情境化、抽象)为特征的转换组成的模块化本体之上。为了证明形式化工作不必阻碍采用,我们通过监督式LLM代理交互从研究论文中提取工作流,将人类角色简化为审查和细化。利用这一过程,我们从已发表的VA论文中构建了一个包含17个ATWL工作流的库。跨工作流分析揭示了结构规律性——一个反复出现的元结构、重复出现的主题、可重用的构建块、多样的迭代策略以及跨领域等价性——这些在散文中是不可见的。我们进一步通过一个受控实验评估了实际效用,在该实验中,同一个LLM处理了两个分析问题,提供的库要么是原始论文,要么是ATWL表示。两种形式都提供了有用的建议,但形式化表示系统地添加了显式迭代结构、类型化数据流、片段级适应来源以及支持扩展的紧凑性,超出了散文库在LLM上下文中的容量。ATWL使得从叙事描述向形式化表示、可比较和可重用的分析知识过渡成为可能。

英文摘要

Visual analytics (VA) workflows are inherently complex, involving data transformation, feature engineering, visual representation, and human interpretation. They are typically described in unstructured prose, hindering systematic comparison, reuse of proven strategies, and training of novices. We present Artifact-Transform Workflow Language (ATWL), a domain-agnostic, declarative language that formally represents VA workflows by capturing their structure and underlying analytical intent. ATWL is built upon a modular ontology of eight artifact types (entities, features, arrangements, visualisations, patterns, models, knowledge, specifications) and transforms characterised by standardised intents (e.g., define-unit, characterise, contextualise, abstract). To show that formalisation effort need not impede adoption, we extract workflows from research papers through supervised interaction with LLM agents, reducing the human role to review and refinement. Using this process, we constructed a library of seventeen ATWL workflows from published VA papers. Cross-workflow analysis reveals structural regularities -- a recurrent meta-structure, recurring motifs, reusable building blocks, diverse iterative strategies, and cross-domain equivalences -- that remain invisible in prose. We further evaluate practical utility through a controlled experiment in which the same LLM addressed two analytical problems with the library supplied either as original papers or as ATWL representations. Both forms enabled useful recommendations, but the formal representation systematically added explicit iteration structure, typed data flow, fragment-level adaptation provenance, and compactness supporting scaling beyond what prose libraries can fit in an LLM's context. ATWL enables a transition from narrative descriptions to formally represented, comparable, and reusable analytical knowledge.

2605.25488 2026-05-26 cs.CV cs.AI cs.MM 版本更新

Test-Time Self-Adaptive Conditioning for Stable Audio-Driven Talking-Head Generation

测试时自适应条件用于稳定音频驱动说话头生成

Zhicheng Zhang, Lei Wang, Yu Zhang, Yongsheng Gao

发表机构 * School of Business, University of New South Wales (UNSW)(新南威尔士大学商学院) School of Engineering and Built Environment, Griffith University(格里菲斯大学工程与环境学院)

AI总结 提出一种无需参数训练的测试时自适应条件框架(TT-SAC),通过反馈循环调整条件表示,提升预训练说话头生成器的身份保持、时间一致性和感知质量。

Comments Research report

详情
AI中文摘要

音频驱动的说话头生成在AniTalker、FLOAT和Sonic等最新模型中取得了显著进展。尽管取得了成功,大多数现有方法在推理阶段依赖单一静态参考图像来调节整个视频生成过程。这种静态条件范式通常导致固定身份特征与动态面部运动之间的不匹配,从而引起身份漂移、时间不一致性和感知质量下降。我们引入了测试时自适应条件(TT-SAC),这是一个无需参数的推理框架,使预训练的说话头生成器能够在推理过程中调整其条件表示,而无需重新训练、梯度更新或额外监督。TT-SAC不是将参考肖像视为不可变的,而是将生成器与其编码器组合成一个反馈循环:生成器自身的输出被重新编码,以构建一个更符合合成序列时间动态的精细条件表示。单次自适应步骤近似于生成过程的自洽平衡,稳定了跨时间的身份和运动。我们进一步提供了理论分析,表明在温和的Lipschitz假设下,测试时条件自适应减少了特征方差并提高了生成稳定性,同时表现出原则性的偏差-方差权衡,该权衡决定了自适应最优强度。在最新说话头生成器和基准数据集上的大量实验表明,在唇形同步准确性、时间一致性、身份保持和感知保真度方面均有持续改进。TT-SAC提供了一种模型无关且无需训练的策略来增强生成视频模型,将测试时条件自适应确立为稳定音频驱动肖像动画的有效机制。

英文摘要

Audio-driven talking-head generation has achieved remarkable progress with recent models such as AniTalker, FLOAT, and Sonic. Despite their success, most existing approaches rely on a single static reference image to condition the entire video generation process at inference stage. This static conditioning paradigm often creates a mismatch between fixed identity features and dynamically evolving facial motion, leading to identity drift, temporal inconsistency, and degraded perceptual quality. We introduce Test-Time Self-Adaptive Conditioning (TT-SAC), a parameter-free inference framework that enables pretrained talking-head generators to adapt their conditioning representations during inference without retraining, gradient updates, or additional supervision. Instead of treating the reference portrait as immutable, TT-SAC composes the generator with its encoder in a feedback loop: the generator's own outputs are re-encoded to construct a refined conditioning representation that better aligns with the temporal dynamics of the synthesized sequence. A single adaptation step approximates a self-consistent equilibrium of the generative process, stabilizing identity and motion across time. We further provide theoretical analysis showing that test-time conditioning adaptation reduces feature variance and improves generative stability under mild Lipschitz assumptions, while exhibiting a principled bias-variance tradeoff that governs the optimal strength of adaptation. Extensive experiments on state-of-the-art talking-head generators and benchmark datasets demonstrate consistent improvements in lip-sync accuracy, temporal coherence, identity preservation, and perceptual fidelity. TT-SAC offers a model-agnostic and training-free strategy for enhancing generative video models, establishing test-time conditioning adaptation as an effective mechanism for stabilizing audio-driven portrait animation.

2605.25477 2026-05-26 cs.RO cs.AI 版本更新

EXPO-FT: Sample-Efficient Reinforcement Learning Finetuning for Vision-Language-Action Models

EXPO-FT:面向视觉-语言-动作模型的样本高效强化学习微调

Perry Dong, Kuo-Han Hung, Tian Gao, Dorsa Sadigh, Chelsea Finn

发表机构 * Stanford University(斯坦福大学)

AI总结 提出EXPO-FT系统,通过样本高效的强化学习微调预训练的VLA策略,在多种高精度操作任务中实现完美性能(30/30成功率),平均仅需19.1分钟在线机器人数据。

详情
AI中文摘要

高效且可靠地学习新任务的能力一直是机器人学的基础挑战。视觉-语言-动作(VLA)模型在多种操作任务中展现出强大的泛化能力,但预训练策略始终无法达到实际部署所需的可靠性。强化学习(RL)微调为弥合这一差距提供了有前景的路径,但现有方法要么从头开始训练而未充分利用预训练先验,要么微调VLA而未达到实际部署所需的样本效率和成功率。我们提出了EXPO-FT,一个用于对预训练VLA策略进行稳定、样本高效的RL微调的系统,填补了这一空白。我们的系统解决了一系列具有挑战性的操作任务,包括串灯并插入插头点亮、将台球击入袋中、将花插入酒瓶,每个任务都需要高精度、动态动作以及对不同初始状态的鲁棒性。我们的系统在所有评估任务中均实现了完美的任务性能(30/30成功),平均仅需19.1分钟的在线机器人数据,优于先前的从头RL训练和VLA微调方法。我们发布了一个开源代码库,旨在促进机器人领域中VLA模型RL微调的更广泛采用。

英文摘要

The ability to efficiently and reliably learn new tasks has been a foundational challenge in robotics. Vision-Language-Action (VLA) models have demonstrated strong generalization across diverse manipulation tasks, yet pretrained policies consistently fall short of the reliability required for real-world deployment. Reinforcement learning (RL) fine-tuning offers a promising path to bridge this gap, but existing approaches either train from scratch without fully leveraging pretrained priors, or fine-tune VLAs without achieving the sample efficiency and success rates that practical deployment demands. We present EXPO-FT, a system for stable, sample-efficient RL finetuning of pretrained VLA policies that closes this gap. Our system solves a suite of challenging manipulation tasks, including routing string lights and inserting the plug to light it up, striking a pool ball into a pocket, and inserting a flower into a wine bottle, each requiring combinations of high precision, dynamic actions, and robustness to varied initial states. Our system achieves perfect task performance (30/30 successes) across all evaluated tasks within an average of 19.1 minutes of online robot data, outperforming both prior RL-from-scratch and VLA finetuning approaches. We release an open-source codebase with the aim of facilitating broader adoption of RL finetuning of VLA models in robotics.

2605.25475 2026-05-26 cs.CL cs.AI 版本更新

IndexMem: Learned KV-Cache Eviction with Latent Memory for Long-Context LLM Inference

IndexMem: 基于潜在记忆的学习型KV缓存驱逐策略用于长上下文LLM推理

Xintong Yang, Hao Gu, Binxing Xu, Lujun Li, Bei Liu, Jiacheng Liu, Qiyuan Zhu, Sirui Han, Yike Guo

发表机构 * The Hong Kong University of Science(香港科学与技术大学) Zhejiang University(浙江大学)

AI总结 提出一种可学习的索引器预测KV重要性,并结合轻量级潜在记忆模块压缩被驱逐的令牌,以在有限KV预算下实现准确的长上下文推理。

详情
AI中文摘要

大型语言模型(LLM)越来越需要处理长上下文,但标准softmax注意力机制的KV缓存随序列长度线性增长,迅速成为长上下文推理的瓶颈。一种实用的补救措施是驱逐不太重要的KV条目;然而,现有的驱逐策略大多是启发式的,难以捕捉令牌重要性的丰富、输入相关的分布。在这项工作中,我们引入了一个可学习的索引器来预测KV重要性,从而能够更准确地保留关键令牌。同时,简单地驱逐令牌会永久丢弃其信息,导致不可逆的遗忘和长距离检索性能下降。为了解决这个问题,我们提出了一个轻量级的潜在记忆模块,将驱逐的令牌压缩成紧凑的、在线更新的状态,并提供残差读出以补偿通过KV驱逐丢失的注意力贡献。总的来说,我们的方法能够在有限的KV预算下实现准确的长上下文推理,在RULER(4K/16K)上对Qwen、Mistral和Llama模型(在激进驱逐下提升高达25分)带来一致的改进,在Needle-in-a-Haystack检索中显著更稳定,并且在LongBench得分和压缩曲线上优于现有的驱逐策略。

英文摘要

Large Language Models (LLMs) are increasingly expected to operate over long contexts, yet standard softmax attention incurs a KV cache that grows linearly with sequence length, quickly becoming the bottleneck for long context inference. A practical remedy is to evict less important KV entries; however, existing eviction policies are largely heuristic and struggle to capture the rich, input-dependent distribution of token importance. In this work, we introduce a learnable indexer that predicts KV importance, enabling more accurate retention of critical tokens. Meanwhile, naively evicting tokens permanently discards their information, leading to irreversible forgetting and degraded retrieval over long ranges. To address this, we propose a lightweight latent memory module that compresses evicted tokens into a compact, online-updated state and provides residual readouts to compensate for the attention contributions lost through KV eviction. Collectively, our method enables accurate long-context inference under a bounded KV budget, delivering consistent improvements on RULER (4K/16K) across Qwen, Mistral, and Llama models (up to 25 points under aggressive eviction), markedly more stable Needle-in-a-Haystack retrieval, and superior LongBench scores and compression curves compared to existing eviction policies.

2605.25459 2026-05-26 cs.LG cs.AI 版本更新

From Simulation to Enaction: Post-trained language models recognize and react to their own generations

从模拟到行动:后训练语言模型识别并回应自身生成

Asvin G., Jack Lindsey

发表机构 * Institute for Advanced Study, Princeton(普林斯顿高级研究院) Anthropic

AI总结 本文发现后训练语言模型能够识别自身生成(on-policy)并降低输出熵,通过内部表示输入意外性来调节,且显式识别与隐式识别机制不同。

Comments Anthropic fellows project mentored by Jack Lindsey

详情
AI中文摘要

语言模型被预训练为被动预测器,没有动机去建模自身输出的后果。后训练改变了这一点:产生自身响应的模型可以从识别自身处于on-policy状态中获益。我们提供证据表明,后训练模型识别其on-policy生成,并且这种识别隐式编码在其输出分布中。特别是,在不同模型家族和规模类别中,on-policy输出分布熵比off-policy熵低3-4倍。我们将这种效应的部分原因追溯到输入意外性的内部表示,该表示跟踪模型先前预测中最新的输入标记的不可能性,并因果性地调节输出熵。这些现象的一个例子可以在对开放式提示的响应中观察到;后训练模型(与预训练模型不同)在第一个输出标记之前就将其对即将生成的响应主题的不确定性坍缩;用不同主题的前缀违反这种缓存意图会导致更高的输出熵。我们还测试了模型是否可以通过显式口头报告区分on-policy上下文和前缀。我们发现它们可以,但有趣的是,这种显式识别通过不同于隐式识别的机制进行路由。

英文摘要

Language models are pretrained as passive predictors with no incentive to model the consequences of their own outputs. Post-training changes this: a model producing its own responses can benefit from recognizing that it is on-policy. We present evidence that post-trained models recognize their on-policy generations, and this recognition is implicitly encoded in their output distributions. In particular, on-policy output distribution entropy is 3--4$\times$ lower than off-policy entropy, across model families and size classes. We trace part of this effect to an internal representation of input surprise, tracking the unlikeliness of the most recent input token according to the model's prior predictions, that causally modulates output entropy. One example of these phenomena can be observed in response to open-ended prompts; post-trained models (unlike pretrained models) collapse their uncertainty over the topic of their upcoming response before the first output token; violating this cached intention with a different-topic prefill results in higher output entropy. We also tested whether models can distinguish on-policy contexts from prefills via explicit verbal report. We find that they can, but that interestingly, this explicit recognition routes through a different mechanism than implicit recognition.

2605.25454 2026-05-26 cs.HC cs.AI cs.CL cs.CY cs.SI 版本更新

AI Content Moderation in Therapy Conversations

AI在治疗对话中的内容审核

Jiwon Kim, Claire Wang, Taeung Yoon, Sabelle Huang, Koustuv Saha

AI总结 研究审计三种主流内容审核系统(OpenAI、Meta、Google)在真实治疗对话中的标记行为,揭示其限制LLM作为治疗师的潜力。

详情
AI中文摘要

大型语言模型(LLMs)越来越多地被用于情感支持。它们也正在被开发用于正式的治疗目的。然而,像ChatGPT或Llama这样的LLM通常配备内容审核护栏,出于责任和安全考虑,阻止它们与用户讨论敏感话题,而这种无法触及这些话题的能力可能影响它们作为治疗师的能力。在本研究中,我们对三种最先进的审核系统(OpenAI的审核端点、Meta的Llama Guard和Google的Shield Gemma)进行了算法审计,以调查这些系统将现实治疗会话内容标记为不良的程度。我们的结果揭示了用户和组织在设计LLM扮演治疗师角色时可能遇到的限制。

英文摘要

Large language models (LLMs) are increasingly being used for emotional support. They are also being developed for formal therapy purposes. However, LLMs like ChaptGPT or Llama are often developed with content moderation guardrails that prevent them from discussing sensitive subjects with users for both liability and safety purposes, and this inability to broach these subjects may affect their capacity as therapists. In this study, we perform an algorithm audit on three state-of-the-art moderation systems (OpenAI's moderation endpoint, Meta's Llama Guard, and Google's Shield Gemma) to investigate the extent to which these systems flag the content of real-life therapy sessions as undesirable. Our results raise implications for the limitations that users and organizations may encounter when designing LLMs to play the part of a therapist.

2605.25446 2026-05-26 cs.AI cs.LG 版本更新

A Signal-Language Foundation Model for Broad-Spectrum Cardiovascular Assessment from Routine Electrocardiography

面向常规心电图广谱心血管评估的信号-语言基础模型

Ziqing Yu, Yuhui Tao, Jiayu Huo, Lei Pan, Zilong Xiao, Juecheng Chen, Xiao Li, Jianxuan Li, You Zhou, Zhixing Li, Cong Wang, Beijian Zhang, Chen Chen, Hongyang Lu, Konstantinos Patlatzoglou, Daniel B. Kramer, Jonathan W. Waks, Yangang Su, Fu Siong Ng, Shuo Wang, Yixiu Liang, Junbo Ge

发表机构 * Department of Cardiology, Zhongshan Hospital of Fudan University(复旦大学中山医院心内科) Shanghai Institute of Cardiovascular Diseases, National Clinical Research Centre for Interventional Medicine(上海心血管病研究所,国家介入医学临床研究中心) Digital Medical Research Center, School of Basic Medical Sciences, Fudan University(复旦大学基础医学研究院数字医疗研究中心) Shanghai Key Laboratory of Medical Imaging Computing and Computer Assisted Intervention(上海医学影像计算与计算机辅助手术重点实验室) National Heart and Lung Institute, Imperial College London, Hammersmith Hospital, Du Cane Road(伦敦帝国学院国家心肺研究所,哈马舍姆医院,杜肯路) Department of Cardiology, Shanghai Geriatric Medical Center(上海老年医学中心心内科) Cardiac Rhythm Management, Medtronic Technology Center, Medtronic (Shanghai) Ltd.(美敦力技术中心,美敦力(上海)有限公司,心律管理部) Richard A. and Susan F. Smith Center for Outcomes Research in Cardiology, Beth Israel Deaconess Medical Center, Harvard Medical School(哈佛医学院比尔·德·阿克谢心脏结局研究中心,贝斯以色列·德aconess医疗中心) Harvard-Thorndike Electrophysiology Institute, Beth Israel Deaconess Medical Center, Harvard Medical School(哈佛-托尔恩迪克电生理研究所,贝斯以色列·德aconess医疗中心,哈佛医学院) Department of Cardiology, Imperial College Healthcare NHS Trust(伦敦帝国学院医疗信托心内科部) Department of Cardiology, Chelsea and Westminster NHS Foundation Trust(切尔西和温斯洛医院 NHS 基础信托心内科部) Department of Computer Science and Technology, University of Cambridge(剑桥大学计算机科学与技术系)

AI总结 提出ECGCLIP信号-语言对比学习框架,通过大规模心电图-报告预训练,在89项下游任务中超越基线,实现对常见心律失常、超声心动图靶标及罕见心脏病的广谱评估。

详情
AI中文摘要

心电图(ECG)是心血管诊疗的核心,但传统AI模型通常局限于常见心律失常,且在不同人群或临床细微疾病中泛化能力较差。我们开发了ECGCLIP(心电图对比语言-图像预训练),一种信号-语言对比学习框架,将ECG波形与专家诊断报告对齐。ECGCLIP在来自1,324,856名患者的2,837,962份心电图研究上进行了预训练,并在一个留出内部测试集以及包含约150万份心电图的九个独立外部队列上进行了评估。评估覆盖89项下游任务,包括45项心电图诊断、39项超声心动图靶标和5种罕见心脏病,以PRAUC为主要指标。ECGCLIP在随机初始化和Merl-R18基线上持续提升性能。在内部测试集上,ECGCLIP-R34对心房颤动(PRAUC 0.900)和ST段抬高型心肌梗死(PRAUC 0.383)表现出强劲性能,并在所有外部队列中具有稳健泛化能力。它还改善了低患病率和诊断困难的疾病,包括埃布斯坦畸形、缩窄性心包炎、右位心和心脏淀粉样变性,内部PRAUC值分别为0.253、0.175、0.121和0.201。ECGCLIP数据高效,仅使用10%的训练数据即可达到或超过全数据集基线性能。特征可视化和显著性分析表明,其学习到的表示与既定心电图标准具有临床意义的对齐。这些发现表明,大规模心电图-报告对比预训练可以将常规心电图解读从常见心律失常扩展到广谱心血管评估以及超声心动图和罕见病的机会性筛查。

英文摘要

Electrocardiography (ECG) is central to cardiovascular care, but conventional AI models are often restricted to common arrhythmias and may generalize poorly across populations or clinically subtle diseases. We developed ECG Contrastive Language-Image Pre-training (ECGCLIP), a signal-language contrastive learning framework that aligns ECG waveforms with expert diagnostic reports. ECGCLIP was pre-trained on 2,837,962 ECG studies from 1,324,856 patients and evaluated on a held-out internal test set plus nine independent external cohorts comprising about 1.5 million ECGs. Evaluation covered 89 downstream tasks, including 45 ECG diagnoses, 39 echocardiographic targets, and 5 rare cardiac diseases, using PRAUC as the primary metric. ECGCLIP consistently improved performance over random initialization and Merl-R18 baselines. On the internal test set, ECGCLIP-R34 achieved strong performance for atrial fibrillation (PRAUC 0.900) and ST-segment elevation myocardial infarction (PRAUC 0.383), with robust generalization across all external cohorts. It also improved low-prevalence and diagnostically elusive diseases, including Ebstein anomaly, constrictive pericarditis, dextrocardia, and cardiac amyloidosis, with internal PRAUC values of 0.253, 0.175, 0.121, and 0.201, respectively. ECGCLIP was data efficient, matching or exceeding full-dataset baseline performance with only 10% of training data. Feature visualization and saliency analysis suggested clinically meaningful representations aligned with established electrocardiographic criteria. These findings indicate that large-scale ECG-report contrastive pre-training can expand routine ECG interpretation beyond common arrhythmias toward broad cardiovascular assessment and opportunistic screening of echocardiographic and rare conditions.

2605.25440 2026-05-26 cs.CL cs.AI cs.MA 版本更新

A Multi-Agent LLM Framework for Rating the Quality of Surgical Feedback

用于评估手术反馈质量的多智能体LLM框架

Rafal Kocielnik, J. Everett Knudsen, Steven Y. Cen, Jasmine Lin, Cherine H. Yang, Atharva Deo, Ujjwal Pasupulety, Peter Wager, Anima Anandkumar, Andrew J. Hung

发表机构 * Computing + Mathematical Sciences, California Institute of Technology(加州理工学院计算与数学科学系) Department of Urology, Cedars-Sinai(塞斯医疗中心泌尿科) Keck School of Medicine, University of Southern California(美国南加州大学凯克医学院)

AI总结 提出一个两阶段LLM框架,通过多智能体提示和手术领域知识注入发现可解释的反馈质量标准,并利用LLM作为评判者自动评分,在预测反馈有效性上优于先前方法。

Comments 25 pages, 3 figures

详情
AI中文摘要

手术室中主治医生提供的口头反馈在住院医师技能习得中起着关键的形成性作用。然而,评估培训者反馈的质量及其在实时手术中影响受训者行为的有效性仍然是一个挑战。先前的研究依赖于专家人工评分者的大量手动标注来评估反馈内容,并侧重于开发忽略反馈传递定性方面(如清晰度或紧迫性)的广泛分类法。有限的现有自动化方法,包括关键词分析和主题建模,也无法捕捉这些细微方面。我们引入了一个两阶段基于LLM的框架,该框架发现基于手术培训背景的可解释反馈质量标准。我们的方法使用多智能体提示和手术领域知识注入来发现一小套人类可解释的评分标准(例如,鼓励性、紧迫性、清晰性)。然后,这些标准通过LLM作为评判者的方法自动评分实时手术反馈。对4.2k个培训者反馈实例的评估表明,我们AI发现的标准在预测反馈有效性(包括观察到的受训者行为调整和培训者认可)方面优于先前基于内容的框架。这项工作推进了手术室中可扩展的、与人类对齐的沟通质量评估,并为改进手术教学实践提供了基础。

英文摘要

Verbal feedback delivered by attending surgeons in the operating room plays a critical formative role in resident trainee skill acquisition. Yet, assessing the quality of trainer feedback and its effectiveness in influencing trainee behavior during live surgery remains a challenge. Prior studies assessed feedback content relying on extensive manual annotation by expert human raters and focused on developing broad taxonomies that overlook the qualitative aspects of feedback delivery such as clarity or urgency. Limited existing automated methods, including keyword analysis and topic modeling, also fail to capture these nuanced aspects. We introduce a two-stage LLM-based framework that discovers interpretable feedback quality criteria grounded in the context of surgical training. Our method uses multi-agent prompting and surgical domain knowledge injection to discover a small set of human interpretable scoring criteria (e.g., Encouraging, Urgent, Clear). These criteria are then used to automatically score live surgical feedback via an LLM-as-a-judge approach. Evaluation on 4.2k trainer feedback instances demonstrates that our AI-discovered criteria outperform prior content-based frameworks in predicting feedback effectiveness, including observed trainee behavioral adjustments and trainer approval. This work advances scalable, human-aligned assessment of communication quality in the operating room and provides a foundation for improving surgical teaching practices.

2605.25435 2026-05-26 cs.AI 版本更新

Security of OpenClaw Agents: Fundamentals, Attacks, and Countermeasures

OpenClaw 代理的安全性:基础、攻击与对策

Yuntao Wang, Jianle Ba, Han Liu, Yanghe Pan, Jintao Wei, Zhou Su, Tom H. Luan, Linkang Du

发表机构 * School of Cyber Science and Engineering, Xi'an Jiaotong University(西安交通大学网络科学与工程学院)

AI总结 本文综述了 OpenClaw 代理的安全挑战,分类分析了技能投毒、认知操纵、多代理级联故障和供应链漏洞等威胁,并总结了现有防御机制。

Comments 17 pages, 13 figures

详情
AI中文摘要

由大型语言模型驱动的自主代理的快速发展催生了 OpenClaw,这是一类新的开源代理框架,作为持续运行、技能增强的系统,具有持久记忆、多通道交互和高度的自主性。这些能力使 OpenClaw 代理能够自主执行复杂的多步骤任务,并与外部应用程序无缝交互,但同时也引入了显著扩大的攻击面。特别是,高权限操作与持久记忆的结合使 OpenClaw 代理面临各种新兴威胁,包括技能投毒、认知操纵、多代理级联故障和供应链漏洞。在本综述中,我们全面研究了 OpenClaw 代理的安全格局。我们首先考察了将 OpenClaw 代理与传统 AI 代理系统区分开来的通用架构和关键特征。我们将现有的安全和隐私威胁分类到一个分层框架中,并分析漏洞如何在代理推理、行动执行和外部交互过程中产生。还回顾了代表性的防御机制,以描绘当前的防御格局。最后,讨论了与 OpenClaw 生态系统可靠性和可信度相关的几个未解决问题。

英文摘要

The rapid evolution of large language model (LLM)-driven autonomous agents has given rise to OpenClaw, a new class of open-source agent frameworks that operate as continuously running, skill-augmented systems with persistent memory, multi-channel interaction, and high degrees of autonomy. Such capabilities enable OpenClaw agents to autonomously execute complex, multi-step tasks and interact seamlessly with external applications, but simultaneously introduce a substantially enlarged attack surface. In particular, the combination of high-privilege operations and persistent memory exposes OpenClaw agents to various emerging threats, including skill poisoning, cognitive manipulation, multi-agent cascading failures, and supply-chain vulnerabilities. In this survey, we present a comprehensive study of the security landscape of OpenClaw agents. We first examine the general architecture and key characteristics that distinguish OpenClaw agents from traditional AI agent systems. We categorize existing security and privacy threats into a layered framework and analyze how vulnerabilities arise during agent reasoning, action execution, and external interaction. Representative defense mechanisms are also reviewed to draw the current defense landscape. Finally, several unresolved issues related to the reliability and trustworthiness of OpenClaw ecosystems are discussed.

2605.25430 2026-05-26 cs.AI 版本更新

CODESKILL: Learning Self-Evolving Skills for Coding Agents

CODESKILL:学习自进化技能的编码智能体

Yanzhou Li, Yiran Zhang, Xiaoyu Zhang, Xiaoxia Liu, Yang Liu

发表机构 * Nanyang Technological University(南洋理工大学) Zhejiang University(浙江大学)

AI总结 提出CODESKILL框架,通过强化学习从编码智能体轨迹中提取多粒度程序性技能并维护技能库,提升下游任务解决能力。

详情
AI中文摘要

编码智能体在解决软件工程任务时产生丰富的轨迹。为了实现智能体自我进化,这些轨迹可以提炼为可重用的程序性技能,以紧凑的方式编码经验来指导未来行为。然而,现有的技能构建和维护方法通常依赖固定提示和启发式更新规则,不清楚如何选择、抽象和维护知识以最好地服务下游智能体。我们提出CODESKILL,一个基于LLM的框架,将技能提取和技能库维护重新表述为可学习的管理策略。CODESKILL从编码智能体轨迹中提取多粒度程序性技能,用新经验进化技能,并维护一个紧凑的技能库用于未来任务解决。我们使用强化学习训练CODESKILL,采用混合奖励,将基于评分标准的密集技能质量反馈与来自冻结下游智能体的稀疏可验证执行反馈相结合。在EnvBench、SWE-Bench Verified和Terminal-Bench 2上的实验表明,CODESKILL相比无技能基线平均通过率提高9.69,相比最强的基于提示或记忆基线提高4.01,同时在迭代构建过程中将技能库大小维持在稳定水平。

英文摘要

Coding agents produce rich trajectories while solving software-engineering tasks. To enable agent self-evolution, these trajectories can be distilled into reusable procedural skills that compactly encode experience to guide future behavior. However, existing skill construction and maintenance methods often rely on fixed prompts and heuristic update rules, leaving it unclear how knowledge should be selected, abstracted, and maintained to best serve downstream agents. We propose CODESKILL, an LLM-based framework that reformulates skill extraction and skill-bank maintenance as a learnable management policy. CODESKILL extracts multi-granularity procedural skills from coding-agent trajectories, evolves skills with new experience, and maintains a compact skill bank for future task solving. We train CODESKILL with reinforcement learning, using a hybrid reward that combines dense rubric-based skill-quality feedback with sparse verifiable execution feedback from the frozen downstream agent. Experiments on EnvBench, SWE-Bench Verified, and Terminal-Bench 2 show that CODESKILL improves average pass rate by 9.69 over the no-skill baseline and by 4.01 over the strongest prompt-based or memory baseline, while maintaining the skill bank at a stable size during iterative construction.

2605.25427 2026-05-26 cs.CV cs.AI 版本更新

Binding Visual Features Point by Point

逐点绑定视觉特征

Udith Haputhanthri, Declan Campbell, Rim Assouel, Jonathan D. Cohen, Taylor W. Webb

发表机构 * Princeton University(普林斯顿大学) Mila – Quebec AI Institute(魁北克AI研究所) Université de Montréal(蒙特利尔大学)

AI总结 研究通过文本引导的“指向”机制解决视觉语言模型在多目标场景中的绑定问题,发现该机制诱导内部视觉搜索程序,消除绑定错误并实现组合泛化。

详情
AI中文摘要

尽管在标准基准测试中取得了成功,但视觉语言模型在处理涉及多目标场景的任务时仍表现出持续的失败,包括许多对人类来说相对容易的任务。最近的研究发现,这些失败可能源于在上下文中准确绑定对象特征的基本能力缺失,这在认知科学和神经科学中被称为“绑定问题”。人类视觉系统被认为通过串行处理来解决这一绑定问题,即一次只关注一个对象,以避免来自其他对象的干扰。最近的研究提出了“指向”——使用显式空间坐标来指代对象——作为视觉语言模型的类似解决方案,并发现它提高了具有挑战性的多目标任务的性能。然而,目前尚不清楚这种方法为何(即在机制或表征层面)能提高性能,以及这与人类视觉中的串行处理有何直接关系。本文研究了这一问题。我们发现,通过文本学习指向会诱导内部视觉搜索程序,并描述了支持这一过程的机制。我们还发现,指向行为可以通过微调泛化到新任务,并且这样做可以消除绑定错误并实现组合泛化。这些结果提供了一个原理证明,即串行处理可以像解决生物视觉中的绑定问题一样,解决视觉语言模型中的绑定问题。

英文摘要

Despite success on standard benchmarks, vision language models display persistent failures on tasks involving processing of multi-object scenes, including many tasks that are relatively easy for humans. Recent work has found that these failures may stem from a basic inability to accurately bind object features in-context, a challenge that is referred to as the "binding problem" in cognitive science and neuroscience. The human visual system is thought to solve this binding problem via serial processing, attending to individual objects one at a time so as to avoid interference from other objects. Recent work has proposed "pointing" -- the use of explicit spatial coordinates to refer to objects -- as an analogous solution for vision language models, and found that it improves performance on challenging multi-object tasks. However, it is unclear $\textit{why}$ (i.e., on a mechanistic or representational level) this approach improves performance, and how directly this relates to serial processing in human vision. Here, we investigate this question. We find that learning to point-via-text induces an internal visual search routine, and we characterize the mechanisms that support this procedure. We also find that pointing behavior can be generalized to new tasks via fine-tuning, and that doing so eliminates binding errors and enables compositional generalization. These results provide a proof-of-principle that serial processing can solve the binding problem for vision language models just as it does for biological vision.

2605.25424 2026-05-26 cs.LG cs.AI 版本更新

SeqRoute: Global Budget-Aware Sequential LLM Routing via Offline Reinforcement Learning

SeqRoute: 通过离线强化学习实现全局预算感知的顺序LLM路由

Zhongling Xu, Shunan Zheng, Wei Wang

发表机构 * Department of Operations Research and Industrial Engineering(运筹学与工业工程系)

AI总结 提出SeqRoute框架,将多轮LLM路由建模为有限时域马尔可夫决策过程,通过离线强化学习(CQL)和事后预算重标记(HBR)学习延迟满足,在全局预算约束下优化成本与质量,降低破产率至1%以下。

详情
AI中文摘要

现有的LLM路由框架将查询视为独立事件,忽略了受全局计算预算约束的真实用户会话的顺序性质。这种不匹配不可避免地导致预算破产:短视的路由策略在早期交互中耗尽资源,迫使后续通常更复杂的查询使用不充分的模型。我们引入SeqRoute,一个将多轮路由建模为有限时域马尔可夫决策过程并通过离线强化学习求解的框架。通过将剩余预算纳入状态空间并使用保守Q学习(CQL)进行训练,SeqRoute学习延迟满足以策略性地为会话后期的高风险轮次保留资源。为了克服数据匮乏,我们提出事后预算重标记(HBR)。该技术在不同假设预算下回顾性地模拟历史轨迹,将10,000个原始会话扩展为238万个包含关键破产信号的转换。在部署时,动态λ扫描机制无需重新训练即可实现成本-质量帕累托前沿的零样本导航。大量评估表明,SeqRoute在保持或提高质量的同时将运营成本降低6.0-73.5%,并将破产率抑制在1%以下,在整个帕累托前沿上严格优于行为克隆、预算感知启发式和静态基线。

英文摘要

Existing LLM routing frameworks treat queries as independent events, neglecting the sequential nature of real-world user sessions constrained by global computational budgets. This mismatch inevitably leads to budget bankruptcy: myopic routing policies exhaust resources on early interactions, forcing subsequent and often more complex queries onto inadequate models. We introduce SeqRoute, a framework that formulates multi-turn routing as a finite-horizon Markov Decision Process and solves it via offline reinforcement learning. By incorporating the remaining budget into the state space and training with Conservative Q-Learning (CQL), SeqRoute learns delayed gratification to strategically preserve resources for high-stakes turns later in the session. To overcome data starvation, we propose Hindsight Budget Relabeling (HBR). This technique retrospectively simulates historical trajectories under diverse hypothetical budgets, expanding 10,000 raw sessions into 2.38 million transitions enriched with critical bankruptcy signals. At deployment, a dynamic $λ$-sweep mechanism enables zero-shot navigation of the cost-quality Pareto frontier without retraining. Extensive evaluations demonstrate that SeqRoute reduces operational costs by 6.0-73.5% while maintaining or improving quality, and suppresses bankruptcy rates to under 1%, strictly dominating behavior cloning, budget-aware heuristics, and static baselines across the entire Pareto frontier.

2605.25422 2026-05-26 eess.SP cs.AI cs.IT math.IT 版本更新

A Token/KV-Cache Communication Media Selection and Resource Allocation Strategy for Multi-Agent Collaboration

面向多智能体协作的令牌/KV缓存通信介质选择与资源分配策略

Lipeng Dai, Luping Xiang, Kun Yang

发表机构 * State Key Laboratory of Novel Software Technology, Nanjing University, Nanjing, 210008, China(新型软件技术国家重点实验室,南京大学,南京) Institute of Intelligent Networks and Communications (NINE), Nanjing University (Suzhou Campus), Suzhou, 215163, China(智能网络与通信研究院(NINE),南京大学(苏州校区),苏州)

AI总结 针对多智能体协作中异构交互介质带来的端到端延迟权衡问题,提出一种联合通信介质选择与无线资源分配的优化方法,并设计低复杂度算法以最小化延迟。

详情
AI中文摘要

大型语言模型(LLM)与6G网络的融合正在催生自主多智能体协作范式,这预计将大幅增加东西向流量。尽管潜在空间交互机制比符号自然语言(NL)交换能实现更高效的协作,但先前的工作通常忽略了实际无线约束下的相关通信开销。在具身多智能体场景中,异构交互介质会导致不同的推理和传输成本,从而产生固有的端到端(E2E)延迟权衡。为解决这一问题,我们提出了一种联合设计,将通信介质选择与无线资源分配相结合。通过分析表征和基于仿真的评估,我们表明基于令牌的传输和基于键值(KV)缓存的传输在运行状态下并非统一最优,因为性能关键取决于可用计算资源和信道条件等系统参数。因此,我们构建了一个联合优化问题,旨在最小化多智能体协作的E2E延迟,并开发了一种低复杂度的联合介质选择与资源分配(JMSRA)算法。数值结果进一步证实,通过自适应地协调异构链路上的交互介质和带宽分配,所提方案相对于传统的仅NL和仅KV缓存基线显著降低了E2E延迟,从而在未来无线网络中实现高效且鲁棒的多智能体协作。

英文摘要

The convergence of large language models (LLMs) with 6G networks is fostering a paradigm of autonomous multi-agent cooperation, which in turn is expected to substantially increase east-west traffic. Although latent-space interaction mechanisms can enable more efficient collaboration than symbolic natural-language (NL) exchanges, prior work often abstracts away the associated communication overhead under practical wireless constraints. In embodied multi-agent settings, heterogeneous interaction media incur disparate inference and transmission costs, thereby inducing an inherent end-to-end (E2E) latency trade-off. To address this, we propose a joint design that integrates communication-media selection with wireless resource allocation. Through analytical characterization and simulation-based evaluation, we show that neither token-based transmission nor key-value (KV) cache-based transmission is uniformly optimal across operating regimes, as performance depends critically on system parameters such as available computational resources and channel conditions. Accordingly, we formulate a joint optimization problem aimed at minimizing the E2E latency of multi-agent collaboration and develop a low-complexity joint media selection and resource allocation (JMSRA) algorithm. Numerical results further confirm that, by adaptively coordinating the interaction media and bandwidth allocation over heterogeneous links, the proposed scheme achieves markedly reduced E2E latency relative to conventional NL-only and KV-cache-only baselines, enabling efficient and robust multi-agent collaboration in future wireless networks.

2605.25420 2026-05-26 cs.CL cs.AI cs.CY 版本更新

SomaliBench Eval: Measuring English-to-Somali Refusal Gaps in Open-Weight Language Models

SomaliBench Eval:衡量开源语言模型中英语到索马里语的拒绝差距

Khalid Yusuf Dahir

发表机构 * Independent researcher(独立研究人员)

AI总结 通过构建索马里语有害意图基准并评估四个开源模型,发现英语到索马里语的拒绝率存在显著差距,且多数非拒绝输出为不流畅的无效内容。

Comments 12 pages, 3 figures, 4 tables. Code: https://github.com/khaledyusuf44/somalibench_eval Dataset: https://huggingface.co/datasets/khaledyusuf44/somalibench-v0

详情
AI中文摘要

大型语言模型的安全评估仍然高度以英语为中心,即使模型在全球部署,低资源语言的评估也严重不足。我们在SomaliBench v0上评估了四个开源指令微调模型,这是一个由母语者验证的基准,包含100对英语和索马里语的有害意图提示。每个模型(Llama-3.1-8B-Instruct、Gemma-2-9B-Instruct、Qwen-2.5-7B-Instruct和Aya-23-8B)均在本地运行,温度为0,并使用相同的英语“有帮助、无害、诚实”(HHH)系统提示。一个固定的Claude Sonnet快照(claude-sonnet-4-5-20250929)将每个响应分类为拒绝、遵从或不清楚;母语作者对分层抽样的80行样本进行抽查。我们发现所有四个模型在英语到索马里语之间存在巨大的拒绝差距:Llama-3.1-8B(0.90;95%自助法置信区间[0.85, 0.96])、Aya-23-8B(0.75 [0.67, 0.83])、Qwen-2.5-7B(0.69 [0.59, 0.78])和Gemma-2-9B(0.38 [0.27, 0.49])。对于三个模型,索马里语中主要的非拒绝模式不是流畅的有害遵从,而是不清楚的输出:空、错误语言或不连贯的生成。母语验证抽查在80个采样行上与判断器达到100%一致(Cohen's kappa = 1.00)。我们仅报告总体拒绝率、类别差距和可靠性统计;原始模型生成保留在本地,不发布。

英文摘要

Large language model safety evaluation remains heavily English-centered, leaving low-resource languages under-measured even when models are deployed globally. We evaluate four open-weight instruction-tuned models on SomaliBench v0, a native-author-verified benchmark of 100 harmful-intent prompts paired across English and Somali. Each of Llama-3.1-8B-Instruct, Gemma-2-9B-Instruct, Qwen-2.5-7B-Instruct, and Aya-23-8B is run locally with temperature 0 and the same English "helpful, harmless, and honest" (HHH) system prompt. A pinned Claude Sonnet snapshot (claude-sonnet-4-5-20250929) classifies each response as refused, complied, or unclear; the native author spot-checks a stratified 80-row sample. We find large English-to-Somali refusal gaps for all four models: Llama-3.1-8B (0.90; 95% bootstrap CI [0.85, 0.96]), Aya-23-8B (0.75 [0.67, 0.83]), Qwen-2.5-7B (0.69 [0.59, 0.78]), and Gemma-2-9B (0.38 [0.27, 0.49]). For three models, the dominant Somali non-refusal mode is not fluent harmful compliance but unclear output: empty, wrong-language, or incoherent generations. The native verification spot-check achieves 100% agreement with the judge (Cohen's kappa = 1.00) on the 80 sampled rows. We report aggregate refusal rates, category gaps, and reliability statistics only; raw model generations are retained locally and are not released.

2605.25399 2026-05-26 cs.AI 版本更新

Towards end-to-end LLM-based censoring-aware survival analysis

面向端到端基于大语言模型的删失感知生存分析

Yishu Wei, Hexin Dong, Yi Lin, Jiahe Qian, Yi Liu, Yifan Peng

发表机构 * Department of Population Health Science, Weill Cornell Medicine(人口健康科学系,韦尔·科恩医学中心) Weill Cornell Medicine(韦尔·科恩医学中心)

AI总结 提出LLMSurvival框架,通过成对排序重制定时间事件预测,实现删失感知的生存分析,在ICU死亡率和骨折风险预测中优于Cox比例风险模型和三种深度学习模型。

详情
AI中文摘要

目的:生存分析是医学预测的核心,然而大语言模型(LLM)很少被用作端到端生存模型,因为删失阻碍了直接的监督微调。这里我们提出LLMSurvival,一个框架,使得未修改的LLM能够直接操作表格临床数据进行删失感知的生存分析。材料与方法:LLMSurvival将时间事件预测重新表述为可比较受试者之间的成对排序,并通过聚合与训练队列中锚定个体的比较来推导测试时风险。结果:在两个临床任务(MIMIC-IV中的ICU死亡率预测和纽约长老会/威尔康奈尔医学中心队列中的脆性骨折预测)中,LLMSurvival相比Cox比例风险模型,整体一致性提高了ICU死亡率3.1%和骨折风险0.5%,相比三个已建立的深度学习生存模型,ICU死亡率平均提高2.1%,骨折风险平均提高2.8%。讨论:结果表明,通过基于比较的重新制定,可以使带有删失的生存建模与LLM微调兼容。该框架展示了高可移植性,并且在不同的临床背景下优于专家制定的评分(如SAPS-II和FRAX评分)。此外,该框架支持本地部署,因为紧凑、公开可用的基础模型提供了足够的性能。结论:LLMSurvival框架作为通过LLM进行集成、删失意识的生存分析的概念验证。

英文摘要

Objective: Survival analysis is central to medical prediction, yet large language models (LLMs) are rarely used as end-to-end survival models because censoring prevents straightforward supervised fine-tuning. Here we present LLMSurvival, a framework that enables censoring-aware survival analysis with unmodified LLMs operating directly on tabular clinical data. Materials and Methods: LLMSurvival reformulates time-to-event prediction as pairwise ranking among comparable subjects, and derives test-time risk by aggregating comparisons against anchor individuals from the training cohort. Results: Across two clinical tasks (ICU mortality prediction in MIMIC-IV and fragility fracture prediction in a NewYork-Presbyterian/Weill Cornell Medicine cohort), LLMSurvival improves overall concordance over Cox proportional hazards modeling by 3.1% for ICU mortality and 0.5% for fracture risk, 2.1% on average for ICU mortality and 2.8% for fracture risk over three established deep learning survival models. Discussion: The results show that survival modeling with censoring can be made compatible with LLM fine-tuning through comparison-based reformulation. The framework demonstrates high portability and superior performance over expert curated scores like SAPS-II and FRAX scores across diverse clinical context. Furthermore, the framework supports local deployment, as compact, publicly available base models provide sufficient performance. Conclusion: The LLMSurvival framework serves as a proof of concept for an integrated, censoring-conscious approach to survival analysis via LLMs.

2605.25396 2026-05-26 cs.CV cs.AI 版本更新

Subspace-Guided Semantic and Topological Invariant Registration for Annotation-Free Ultrasound Plane Quality Control

子空间引导的语义与拓扑不变配准用于无标注超声平面质量控制

Chunzheng Zhu, Jianxin Lin, Feng Wang, Cheng Jiang, Guanghua Tan, Zhenyu Zhou, Shengli Li, Kenli Li

发表机构 * Hunan University(湖南大学) Shenzhen Maternity and Child Healthcare Hospital(深圳妇幼保健医院)

AI总结 提出STRIQ框架,通过子空间引导的配准一致性度量,实现无标注超声平面质量控制,达到与临床质量评分的最优相关性。

Comments MICCAI 2026 Accepted Paper; Subspace-Guided Registration for Ultrasound Quality Control

详情
AI中文摘要

超声图像的可靠质量控制对于实时采集指导和回顾性临床审计至关重要,然而现有方法严重依赖逐平面标注,或采用在临床采集固有空间变形下易产生系统性偏差的伪标签。我们提出STRIQ,一种基于配准的框架,将无标注超声平面质量控制重新定义为子空间引导的一致性度量问题。具体而言,STRIQ引入潜在配准对齐器(LRA)以建立查询图像与方差驱动锚点之间的层次特征空间对应,这些锚点通过方差谱准则从无标签数据中自主提炼,作为结构稳定的原型。为进一步区分解剖平面并减轻负知识迁移,我们提出正交知识子空间(OKS)模块。OKS将平面特定表示分解为相互正交的子空间,实现细粒度专家协作同时防止平面间干扰,确保质量度量基于原则性的子空间邻近性。在内部US4QA和公开CAMUS数据集上的大量实验表明,STRIQ实现了与临床质量评分的最优相关性,为无标注、实时可靠的超声质量控制建立了新范式。我们的代码可在https://github.com/zhcz328/STRIQ获取。

英文摘要

Reliable quality control (QC) of ultrasound images is essential for both real-time acquisition guidance and retrospective clinical audit, yet existing approaches rely heavily on per-plane annotations, or employ pseudo-labeling prone to systematic bias under spatial deformations inherent in clinical acquisition. We present STRIQ, a registration-driven framework that recasts annotation-free US plane quality control as a subspace-guided consistency measurement problem. Specifically, STRIQ introduces a Latent Registration Aligner (LRA) to establish hierarchical feature space correspondences between query images and variance-driven anchors, which are autonomously distilled from unlabeled data via a variance spectrum criterion to serve as structurally stable prototypes. To further disambiguate anatomical planes and mitigate negative knowledge transfer, we propose an Orthogonal Knowledge Subspace (OKS) module. The OKS decomposes plane-specific representations into mutually orthogonal subspaces, enabling fine-grained expert collaboration while preventing inter-plane interference, ensuring that the quality metric is grounded in principled subspace proximity. Extensive experiments on the in-house US4QA and public CAMUS datasets demonstrate that STRIQ achieves state-of-the-art correlation with clinical quality scores, establishing a new paradigm for annotation-free, real-time reliable ultrasound quality control. Our code is available at https://github.com/zhcz328/STRIQ.

2605.25394 2026-05-26 cs.AI cs.CL 版本更新

Second Guess: Detecting Uncertainty Through Abstention and Answer Stability in Small Language Models

Second Guess: 通过弃权和答案稳定性检测小型语言模型的不确定性

Ashwath Vaithinathan Aravindan, Mayank Kejriwal

发表机构 * University of Southern California(南加州大学) Information Sciences Institute(信息科学研究所)

AI总结 提出一种轻量级、无参数的提示技术Second Guess,通过添加“我不知道”选项并观察答案稳定性,在多项选择问答中实现弃权,有效检测小型语言模型的不确定性。

详情
AI中文摘要

大型语言模型在不确定时往往生成自信但错误的答案,而非弃权。这个问题对于小型语言模型(SLM)尤为严重,因为计算约束和自主操作放大了对可靠不确定性检测的需求。我们提出了_Second Guess_,一种轻量级、无参数的提示技术,用于多项选择问答(MCQA)中的弃权,非常适合SLM。我们的关键实证洞察是,真正知道答案的模型会一致地选择它,而不确定的模型在添加“我不知道”选项时会表现出不稳定的行为。在四个开源模型(2B-8B参数)和四个基准测试上评估,Second Guess实现了10.81%的最高复合风险改进。值得注意的是,在基于熵的方法退化的微调模型上,它保持了8%的复合风险改进,并且对性能较低的模型改进最大。重现本工作所需的所有代码和结果可在https://github.com/Mystic-Slice/second-guess获取。

英文摘要

Large language models often generate confident but incorrect answers rather than abstaining when uncertain. This problem is particularly acute for small language models (SLMs), where computational constraints and autonomous operation amplify the need for reliable uncertainty detection. We propose _Second Guess_, a lightweight, parameter-free prompting technique for abstention in multiple-choice question answering (MCQA) that is well-suited for SLMs. Our key empirical insight is that models which truly know an answer will select it consistently, while uncertain models exhibit unstable behavior when an ``I don't know'' option is added. Evaluated on four open models (2B-8B parameters) and four benchmarks, Second Guess achieves the highest composite risk improvement of 10.81\%. Notably, it maintains an 8\% composite risk improvement on fine-tuned models where entropy-based methods degrade, and improves most for lower-performing models. All code and results required to reproduce this work is available in https://github.com/Mystic-Slice/second-guess

2605.25389 2026-05-26 cs.CR cs.AI cs.MA 版本更新

Evo-Attacker: Memory-Augmented Reinforcement Learning for Long-Horizon Tool Attacks on LLM-MAS

Evo-Attacker: 用于LLM-MAS长时程工具攻击的记忆增强强化学习

Bingyu Yan, Xiaoming Zhang, Jinyu Hou, Chaozhuo Li, Ziyi Zhou, Yiming Hei, Litian Zhang

发表机构 * Beihang University(北航) Beijing University of Posts and Telecommunications(北京邮电大学) China Academy of Information and Communications Technology(信息通信技术研究院)

AI总结 提出Evo-Attacker,通过记忆增强强化学习框架将工具攻击建模为自进化过程,并引入Attack-Flow GRPO优化长时程信用分配,实验表明其优于基线方法。

Comments ACL 2026 main

详情
AI中文摘要

尽管基于大语言模型的多智能体系统(LLM-MAS)通过编排专业智能体和外部工具在解决复杂任务方面展现出显著能力,但对工具输出的隐式信任创造了一个关键攻击面。现有的工具攻击受限于领域特异性或固定静态模板。为应对这些挑战,我们提出Evo-Attacker,将工具攻击形式化为一个自进化的、记忆增强的强化学习过程。Evo-Attacker构建动态攻击记忆,并采用深思熟虑的推理来检索对抗模式,并在关键时刻策略性地修改干预。此外,我们引入Attack-Flow GRPO,通过终端结果优化中间推理步骤,解决长时程信用分配挑战。综合实验表明,Evo-Attacker始终优于基线,凸显其泛化和进化能力,以及防御性工具保障的迫切需求。

英文摘要

While Large Language Model-based Multi-Agent Systems (LLM-MAS) demonstrate remarkable capabilities in solving complex tasks by orchestrating specialized agents and external tools, the implicit trust in tool outputs creates a critical attack surface. Existing tool attacks are limited by domain specificity or fixed and static templates. To address these challenges, we propose Evo-Attacker, which formulates the tool attack as a self-evolving, memory-augmented reinforcement learning process. Evo-Attacker constructs a dynamic attack memory and employs deliberative reasoning to retrieve adversarial patterns and strategize modifying interventions at critical moments. Furthermore, we introduce Attack-Flow GRPO to optimize intermediate reasoning steps via terminal outcomes, addressing the long-horizon credit assignment challenge. Comprehensive experiments demonstrate that Evo-Attacker consistently outperforms baselines, highlighting its generalization and evolutionary capabilities and the urgent need for defensive tool safeguards.

2605.25385 2026-05-26 cs.CV cs.AI 版本更新

Weakly Supervised Camouflaged Object Detection Based on the SAM Model and Mask Guidance

基于SAM模型和掩码引导的弱监督伪装目标检测

Xia Li, Xinran Liu, Lin Qi, Junyu Dong

发表机构 * School of Computer Science(计算机科学学院) Technology, Ocean University of China, Qingdao 266100, China(技术,中国海洋大学,青岛266100,中国)

AI总结 提出MGNet网络,利用SAM模型生成伪标签,通过级联掩码解码器、上下文增强模块和掩码引导特征聚合模块,实现弱监督伪装目标检测,性能与全监督方法相当。

Comments 18 pages

详情
AI中文摘要

伪装目标检测(COD)由于目标与背景高度相似,是一项具有挑战性的任务。现有的全监督方法需要耗费大量人力进行像素级标注,因此弱监督方法成为平衡精度与标注效率的可行折中方案。然而,由于使用粗标注,弱监督方法常出现性能下降。本文提出一种新的弱监督伪装目标检测方法以克服这些限制。具体地,我们设计了一个新颖的网络MGNet,通过利用自定义级联掩码解码器(CMD)生成的初始掩码来引导分割过程并增强边缘预测,从而解决边缘模糊和漏检问题。我们引入上下文增强模块(CEM)以减少漏检,以及掩码引导特征聚合模块(MFAM)进行有效的特征聚合。针对弱监督挑战,我们提出BoxSAM,利用带有边界框提示的Segment Anything Model(SAM)生成伪标签。通过采用冗余处理策略,为训练MGNet提供高质量的像素级伪标签。大量实验表明,我们的方法在性能上与当前最先进方法具有竞争力。

英文摘要

Camouflaged object detection (COD) from a single image is a challenging task due to the high similarity between objects and their surroundings. Existing fully supervised methods require labor-intensive pixel-level annotations, making weakly supervised methods a viable compromise that balances accuracy and annotation efficiency. However, weakly supervised methods often experience performance degradation due to the use of coarse annotations. In this paper, we introduce a new weakly supervised approach for camouflaged object detection to overcome these limitations. Specifically, we propose a novel network, MGNet, which tackles edge ambiguity and missed detections by utilizing initial masks generated by our custom-designed Cascaded Mask Decoder (CMD) to guide the segmentation process and enhance edge predictions. We introduce a Context Enhancement Module(CEM) to reduce the missing detection, and a Mask-guided Feature Aggregation Module (MFAM) for effective feature aggregation. For the weak supervision challenge, we propose BoxSAM, which leverages the Segment Anything Model (SAM) with bounding-box prompts to generate pseudo-labels. By employing a redundant processing strategy, high quality pixel-level pseudo-labels are provided for training MGNet. Extensive experiments demonstrate that our method delivers competitive performance against current state-of-the-art methods.

2605.25377 2026-05-26 cs.CV cs.AI 版本更新

Adversarial Orthogonal Disentanglement for LVLM Hallucination Mitigation

对抗正交解缠用于LVLM幻觉缓解

Ruoxi Cheng, Haoxuan Ma, Zhengfei Hai, Yiyan Huang, Ranjie Duan, Tianle Zhang, Xu Yang, Ziyi Ye, Xingjun Ma

发表机构 * Fudan University(复旦大学) Tencent(腾讯) Nanjing University(南京大学) Southeast University(东南大学) Great Bay University(大坝大学) TeleAI, China Telecom(TeleAI,中国电信)

AI总结 提出对抗正交解缠(AOD)框架,通过最小最大目标学习幻觉相关方向,并利用双前向对比解码策略,在不需额外训练的情况下缓解大型视觉语言模型(LVLM)的幻觉问题。

详情
AI中文摘要

大型视觉语言模型(LVLM)推进了多模态理解,但其可靠性受到幻觉的限制,即生成内容与视觉事实冲突。现有缓解方法要么依赖昂贵的外部干预(如指令调优和检索),要么使用受限于有缺陷的注意力权重和纠缠的隐藏表示的内部机制。我们提出对抗正交解缠(AOD),一种用于缓解LVLM幻觉的潜在几何框架。AOD通过最小最大目标学习幻觉相关方向:分类器将幻觉信号集中到投影分量中,而对抗器通过梯度反转层将其从正交残差空间中移除。学习到的方向使得一种无需训练的双前向对比解码策略能够抑制幻觉同时保持通用能力。在三个LVLM上进行的四个幻觉和四个效用基准实验表明,AOD一致优于强基线。它在POPE上平均提高超过6%的准确率,将AMBER提升6%,并在MMMU等效用任务上保持强劲性能。进一步分析显示跨数据集的鲁棒迁移,表明AOD捕获了通用的幻觉相关偏差而非数据集特定伪影。我们的源代码和数据集可在https://github.com/Hunter-Wrynn/AOD获取。

英文摘要

Large Vision-Language Models (LVLMs) have advanced multimodal understanding, yet their reliability is limited by hallucination, where generated content conflicts with visual facts. Existing mitigation methods either rely on costly external interventions, such as instruction tuning and retrieval, or use internal mechanisms that remain limited by flawed attention weights and entangled hidden representations. We propose Adversarial Orthogonal Disentanglement (AOD), a latent geometric framework for mitigating LVLM hallucinations. AOD learns a hallucination-related direction through a minimax objective: a classifier concentrates hallucination signals into the projected component, while an adversary removes them from the orthogonal residual space via a Gradient Reversal Layer. The learned direction enables a training-free dual-forward-pass contrastive decoding strategy that suppresses hallucinations while preserving general capabilities. Experiments on three LVLMs across four hallucination and four utility benchmarks show that AOD consistently outperforms strong baselines. It improves POPE accuracy by over 6\% on average, boosts AMBER by 6\%, and maintains strong performance on utility tasks such as MMMU. Further analysis shows robust transfer across datasets, suggesting that AOD captures general hallucination-related biases rather than dataset-specific artifacts. Our source code and datasets are available at https://github.com/Hunter-Wrynn/AOD.

2605.25358 2026-05-26 cs.CL cs.AI cs.CY 版本更新

AI-Associated Lexical Shifts Across 34 Languages: Cross-Lingual Convergence and Diachronic Uptake in News Writing

AI相关的词汇转变跨越34种语言:新闻写作中的跨语言趋同与历时采纳

Thomas Stephan Juzek

发表机构 * Florida State University(佛罗里达州立大学)

AI总结 通过分析34种语言的新闻语料,使用GPT-4.1续写诊断方法,发现AI过度使用的词汇在跨语言中呈现语义趋同,且ChatGPT发布后这些词汇的使用频率显著增加。

Comments 19 pages (9-page main body, plus references and appendices), 3 figures; ACL ARR reviewed, committed to EMNLP 2026

详情
AI中文摘要

AI相关的词汇转变主要被记录在科学英语中。我们将这项工作扩展到WMT新闻抓取语料库中的34种语言,改进了一种分割-后半部分续写诊断方法,比较GPT-4.1续写与匹配的人类黄金标准文本。对于每种语言,我们使用对数流行率比率推导出排名靠前的AI过度使用词元。我们发现显著的跨语言语义趋同:语义相关的概念在类型多样的语言中反复出现,其中'强调'类动词出现在34种语言中的24种。基于嵌入和人工分析支持这一模式。我们还考察了ChatGPT发布前后新闻写作中的历时采纳情况。追踪每种语言前20个AI过度使用项目,我们发现从2020-2021年到2023-2024年,34种语言中有26种语言的流行率增加,平均变化为+15.1%,而匹配的基线词汇没有显示出可比的增加(-4.5%)。在具有较长历史覆盖的10种语言中,纵向分析显示2022年后的增加超过了早期观察到的适度变化,尽管效应大小小于科学英语。我们广泛验证了我们的方法,包括跨种子、模型变体、数据大小、模型系列等。我们的发现与以下观点一致:AI相关的词汇偏好超越了英语,并可能对全球语言使用施加跨语言同质化压力。

英文摘要

AI-associated lexical shifts have been documented mainly in Scientific English. We extend this work to 34 languages in the WMT News Crawl corpus, refining a split-halves continuation diagnostic that compares GPT-4.1 continuations with matched human gold-standard text. For each language, we derive ranked AI-overused lemmas using log prevalence ratios. We find substantial cross-lingual semantic convergence: semantically related concepts recur across typologically diverse languages, with 'emphasize'-type verbs appearing in 24 of 34 languages. Embedding-based and manual analyses support this pattern. We also examine diachronic uptake in news writing before and after ChatGPT's release. Tracking each language's top 20 AI-overused items, we find prevalence increases in 26 of 34 languages from 2020-2021 to 2023-2024, with a mean change of +15.1%, whilst matched baseline words show no comparable increase (-4.5%). In 10 languages with longer historical coverage, longitudinal analyses show post-2022 increases that exceed the modest shifts observed in earlier periods, though with smaller effect sizes than in Scientific English. We validate our approach extensively, including across seeds, model variants, data sizes, model families, and more. Our findings are consistent with the view that AI-associated lexical preferences extend beyond English and may exert cross-lingual homogenising pressure on global language use.

2605.25354 2026-05-26 cs.AI 版本更新

Context-CoT: Enhancing Context Learning via High-Quality Reasoning Synthesis

Context-CoT:通过高质量推理合成增强上下文学习

Hongbo Jin, Mingnan Zhu, Jingqi Tian, Xu Jiang, Zhongjing Du, Haoran Tang, Siyi Xie, Qiaoman Zhang, Jiayu Ding

发表机构 * Peking University(北京大学) Xiamen University(厦门大学) Tsinghua University(清华大学)

AI总结 针对大语言模型在动态提取和应用新知识方面的上下文学习能力不足,提出Context-CoT方法,通过合成高质量推理链来增强上下文学习,在CL-Bench上显著提升性能。

详情
AI中文摘要

虽然大语言模型在使用静态预训练知识进行推理方面表现出色,但在上下文学习——即从复杂、任务特定的上下文中动态提取、内化和应用新知识的能力——方面存在显著困难。最近在CL-Bench上的评估揭示了一个关键能力差距:前沿模型平均仅能解决17.2%的上下文相关任务。

英文摘要

While LLMs excel at reasoning over prompts using static pretrained knowledge, they struggle significantly with context learning-the ability to dynamically extract, internalize, and apply new knowledge from complex, task-specific contexts. Recent evaluations on the CL-Bench reveal a critical capability gap: frontier models solve only 17.2% of context-dependent tasks on average.

2605.25352 2026-05-26 cs.LG cs.AI 版本更新

Certified Robustness from Approximate Gaussian Mixture Structures in Pretrained Latent Spaces

基于预训练潜在空间中近似高斯混合结构的认证鲁棒性

Konstantinos Emmanouilidis, Tianjiao Ding, Nghia Nguyen, Nicolas Loizou, René Vidal

发表机构 * CS & MINDS Johns Hopkins University(计算机科学与MINDS约翰霍普金斯大学) CIS University of Pennsylvania(计算机与信息科学宾夕法尼亚大学) AMS & MINDS Johns Hopkins University(人工智能与机器学习系约翰霍普金斯大学) ESE, Radiology & IDEAS University of Pennsylvania(工程科学与放射学系及IDEAS宾夕法尼亚大学)

AI总结 本文提出一个框架,利用预训练编码器将输入映射到近似高斯混合的潜在分布,通过理论分析证明鲁棒性退化有界,从而实现可认证鲁棒分类器,在CIFAR-10和ImageNet上达到最优或竞争性的认证准确率。

详情
AI中文摘要

深度学习模型易受对抗扰动影响,这对安全关键部署提出了重要关切。经验性防御在实践中可以实现强鲁棒性,但缺乏形式化保证,这推动了可认证鲁棒分类器的需求。虽然认证方法提供了形式化保证,但由于无法利用复杂数据分布中的结构,它们通常产生过于保守的边界。在这项工作中,我们提出了一个设计可认证鲁棒分类器的框架,该框架利用数据表示中的潜在结构。我们首先分析高斯混合设置,推导出鲁棒分类器存在的必要和充分条件,并构建了一个具有闭式鲁棒性证书和泛化保证的分类器。我们的主要贡献是证明精确结构并非必需:我们证明,如果预训练编码器将输入映射到一个与高斯混合分布$\varepsilon$-接近(在KL散度下)的潜在分布,那么认证准确率会优雅地退化,并给出了一个显式边界,关联真实分布和近似分布下的鲁棒性。这一结果使得直接使用预训练模型成为可能,而无需精确的分布假设。实验上,我们的方法在CIFAR-10和ImageNet上实现了最先进或具有竞争力的认证准确率,同时保持了强大的干净性能和低计算开销。总体而言,我们的工作将近似潜在结构确立为通往可认证鲁棒性的一条实用且有原则的路径。

英文摘要

Deep learning models are vulnerable to adversarial perturbations, raising important concerns for safety-critical deployment. Empirical defenses can achieve strong robustness in practice, but lack formal guarantees, motivating the need for certifiably robust classifiers. While certified methods provide formal guarantees, they often yield overly conservative bounds due to their inability to exploit structure in complex data distributions. In this work, we propose a framework for designing certifiably robust classifiers that leverages latent structure in data representations. We first analyze the Gaussian mixture setting, deriving necessary and sufficient conditions for the existence of robust classifiers and constructing a classifier with a closed-form robustness certificate and generalization guarantees. Our main contribution is to show that exact structure is not required: we prove that if a pretrained encoder maps inputs to a latent distribution that is $\varepsilon$-close (in KL divergence) to a Gaussian mixture, then certified accuracy degrades gracefully, with an explicit bound relating robustness under the true and approximate distributions. This result enables the direct use of pretrained models without requiring exact distributional assumptions. Empirically, our method achieves state-of-the-art or competitive certified accuracy on CIFAR-10 and ImageNet, while maintaining strong clean performance and low computational overhead. Overall, our work establishes approximate latent structure as a practical and principled route to certifiable robustness.

2605.25348 2026-05-26 eess.IV cs.AI cs.CV cs.LG cs.SC 版本更新

Parameter-Efficient CT Reconstruction via Deep Graph Laplacian Regularization

基于深度图拉普拉斯正则化的参数高效CT重建

Veera Varuni Radhakrishnan, Chinthaka Dinesh, Qurat-ul-Ain Azim

发表机构 * Mechanical and Industrial Engineering Department(机械与工业工程系)

AI总结 提出深度图拉普拉斯正则化(Deep GLR)方法,通过将二次图正则化集成到近端前向-后向分裂优化框架中,仅用少量参数和数据即可实现低剂量CT重建的噪声抑制,在参数效率和数据效率上显著优于现有方法。

Comments 7 pages, 3 figures, conference

详情
AI中文摘要

低剂量计算机断层扫描(LDCT)重建面临重建质量与资源需求之间的关键权衡。虽然最近的深度学习方法达到了最先进的性能,但它们通常依赖超过50万个参数,并在超过35,000次扫描的大规模数据集上训练。本文研究在严格资源约束下,基于图的正则化是否能提供有意义的噪声抑制。我们提出了深度图拉普拉斯正则化(Deep GLR),将二次图正则化集成到近端前向-后向分裂优化框架中,并包含三个轻量级CNN模块。在LoDoPaB-CT基准上评估,Deep GLR达到了30.70 dB的PSNR,相比滤波反投影提高了6.33 dB,同时仅使用了91,848个参数,在1000个样本上训练(标准训练集的2.8%)。与基准方法相比,这代表了每dB改进5.8倍的参数效率和30倍的数据效率。学习到的图带宽参数(ε=1.25)收敛到可解释的值,表明该方法捕捉了有意义的图像先验而非过拟合。尽管与最先进方法相比仍有13 dB的差距,但结果表明基于图的正则化为资源受限的医学成像场景提供了有利的效率-质量权衡。

英文摘要

Low-dose computed tomography (LDCT) reconstruction faces a critical tradeoff between reconstruction quality and resource requirements. While recent deep learning methods achieve state-of-the-art performance, they typically rely on over 500,000 parameters trained on large-scale datasets exceeding 35,000 scans. This work investigates whether graph-based regularization can provide meaningful noise reduction under strict resource constraints. We propose Deep Graph Laplacian Regularization (Deep GLR), integrating quadratic graph regularization into a Proximal Forward-Backward Splitting optimization framework with three lightweight CNN modules. Evaluated on the LoDoPaB-CT benchmark, Deep GLR achieves 30.70 dB PSNR, representing a 6.33 dB improvement over filtered backprojection, while using only 91,848 parameters trained on 1000 samples (2.8\% of standard training set). Compared to benchmark methods, this represents 5.8 times better parameter efficiency and 30 times better data efficiency per dB improvement. The learned graph bandwidth parameter ($ε$=1.25) converges to interpretable values, suggesting the method captures meaningful image priors rather than overfitting. While a 13 dB gap remains versus state-of-the-art methods, results demonstrate that graph-based regularization provides a favorable efficiency-quality tradeoff for resource-constrained medical imaging scenarios.

2605.25346 2026-05-26 cs.RO cs.AI cs.LG cs.SY eess.SY math.OC 版本更新

Parallel Differentiable Reachability for Learning and Planning with Certified Neural Dynamics and Controllers

用于学习和规划的并行可微可达性:带认证的神经动力学与控制器

Keyi Shen, Glen Chou

发表机构 * MIT(麻省理工学院)

AI总结 提出一种基于JAX的并行可微可达性框架,结合泰勒模型流形构建与CROWN线性界传播,支持GPU批处理和自动微分,并用于认证训练和可达性感知的MPC,在非抓取操作和四旋翼任务中实现在线规划与有界不确定性下的认证可达集过近似。

Comments Robotics: Science and Systems XXII (RSS 2026)

详情
AI中文摘要

神经网络动力学模型和控制策略在机器人领域取得了强大性能,但在不确定性下提供可靠保证仍然困难,尤其是对于闭环神经网络系统。现有的可达性工具提供了形式化的过近似,但通常不可微、过于保守或对于现代学习和在线规划流程来说太慢。为了解决这个问题,我们提出了一个在JAX中可并行化、可微的可达性框架,适用于连续和离散时间系统,具有解析和基于神经网络的动力学和控制器。我们的框架通过统一表示结合了泰勒模型流形构建和CROWN风格的线性界传播,该表示在支持GPU批处理计算和自动微分的同时保留了仿射依赖。基于这个可达性基元,我们开发了(i)一种认证训练方法,鼓励生成对可达性友好的动力学模型和控制器,以及(ii)一种具有基于梯度细化的可达性感知采样MPC方案。在非抓取操作和四旋翼任务上的实验,包括硬件和更高维度的评估(高达72维),展示了在实际在线规划中保持有界不确定性下认证可达集过近似的可行性。

英文摘要

Neural network (NN) dynamics models and control policies achieve strong performance in robotics, but providing sound guarantees under uncertainty remains difficult, especially for closed-loop NN systems. Existing reachability tools provide formal over-approximations, yet are often non-differentiable, overly conservative, or too slow for modern learning and online planning pipelines. To address this, we present a parallelizable, differentiable reachability framework in JAX for continuous- and discrete-time systems with analytical and NN-based dynamics and controllers. Our framework combines Taylor-model flowpipe construction with CROWN-style linear bound propagation through a unified representation that preserves affine dependencies while supporting GPU-batched computation and automatic differentiation. Building on this reachability primitive, we develop (i) a certified training method that encourages reachability-friendly dynamics models and controllers, and (ii) a reachability-aware sampling-based MPC scheme with gradient-based refinement. Experiments on non-prehensile manipulation and quadrotor tasks, including hardware and higher-dimensional evaluations (up to 72D), demonstrate practical online planning while maintaining certified reachable-set over-approximations under bounded uncertainty.

2605.25344 2026-05-26 cs.CL cs.AI cs.LG quant-ph 版本更新

A general tensor-structured compression scheme for efficient large language models

一种用于高效大语言模型的通用张量结构压缩方案

Ying Lu, Peng-Fei Zhou, Qi-Xuan Fang, Pan Zhang, Shi-Ju Ran, Gang Su

发表机构 * School of Physical Sciences, University of Chinese Academy of Sciences(中国科学院大学物理科学学院) Kavli Institute for Theoretical Sciences, University of Chinese Academy of Sciences(中国科学院大学理论科学研究院) Center for Quantum Physics and Intelligent Sciences, Department of Physics, Capital Normal University(首都师范大学量子物理与智能科学中心) Institute of Theoretical Physics, Chinese Academy of Sciences(中国科学院理论物理研究所)

AI总结 提出张量混合(MixT)方案,通过将密集线性层替换为张量算子混合体,在保持MMLU准确率的同时大幅减少参数、FLOPs和内存。

Comments 12 pages, 4 figures

详情
AI中文摘要

大语言模型(LLMs)主要由密集线性变换主导,其存储、内存和计算开销阻碍了高效的适配和部署,同时掩盖了结构简化对功能的影响。本文提出张量混合(MixT),一种通用的张量结构压缩方案,将目标密集线性层替换为可原生执行的张量算子混合体。MixT直接作用于通用线性投影而非模型特定组件,因此可能适用于基于Transformer的LLMs及其他密集神经映射。我们在统一的恢复协议下对Qwen3-8B和LLaMA2-7B评估MixT,识别出一个广泛的压缩区域,在该区域内MMLU准确率基本保持不变,直到模型特定边界处出现突变。该突变与输出熵、预测熵和层间几何的协同变化同时发生。在LLaMA2-7B的突变边界处,MixT将全模型参数减少47.5%,推理FLOPs减少37.1%,训练FLOPs减少52.1%,峰值推理内存减少60.4%,展示了其在低成本LLM压缩中的实际潜力。

英文摘要

Large language models (LLMs) are dominated by dense linear transformations, whose storage, memory and computational overheads hinder efficient adaptation and deployment while masking the functional impacts of structural simplification. Here we present Tensor Mixture (MixT), a general tensor-structured compression scheme that replaces targeted dense linear layers with natively executable mixtures of tensor operators. Operating directly on generic linear projections instead of model-specific components, MixT is potentially applicable across Transformer-based LLMs and other dense neural mappings. We evaluate MixT on Qwen3-8B and LLaMA2-7B under a unified recovery protocol, identifying a broad compressible regime in which MMLU accuracy is largely preserved before an abrupt transition at model-specific boundaries. This transition coincides with coordinated shifts in output entropy, prediction entropy and inter-layer geometry. At the LLaMA2-7B transition boundary, MixT reduces full-model parameters by 47.5\%, inference FLOPs by 37.1\%, training FLOPs by 52.1\% and peak inference memory by 60.4\%, demonstrating its practical potential for lower-cost LLM compression.

2605.25338 2026-05-26 cs.LG cs.AI 版本更新

CausalFlow: Causal Attribution and Counterfactual Repair for LLM Agent Failures

CausalFlow: LLM Agent 失败的因果归因与反事实修复

Akash Bonagiri, Devang Borkar, Gerard Janno Anderias, Setareh Rafatirad, Houman Homayoun

发表机构 * Department of Computer Science University of California, Davis(计算机科学系加州大学戴维斯分校)

AI总结 提出CausalFlow框架,通过反事实干预计算步骤级因果责任分数,识别失败步骤并生成最小编辑修复,用于测试时修复和训练时监督,在多个基准上优于启发式方法。

详情
AI中文摘要

大型语言模型(LLM)代理在涉及推理、工具使用和环境交互的多步任务中经常失败。虽然此类失败通常被记录或通过启发式重试处理,但它们包含了关于执行中断位置的结构化信号。我们提出了CausalFlow,一个干预框架,将失败的代理轨迹转换为最小的反事实修复和可重用的监督。CausalFlow将执行轨迹建模为依赖步骤的顺序链,并通过步骤级反事实干预计算因果责任分数(CRS)来识别导致失败的步骤。对于这些步骤,我们生成最小编辑修复,将最终结果翻转为成功,产生形式为(错误步骤,修正步骤)的验证对比对。CausalFlow支持两种互补用途:具有最小行为漂移的针对性测试时修复,以及适用于离线偏好优化或奖励建模的训练时监督。在涵盖数学推理、代码生成、问答和医学浏览的四个基准测试中,CausalFlow将失败执行转换为具有高最小性和因果一致性分数的验证最小修复,并证明因果归因对于跨不同代理任务的可靠改进是必要的,在复杂检索设置中优于启发式细化,同时产生更局部的修复。这些结果表明,对结构化执行轨迹的干预分析提供了一种原则性和可扩展的机制,将代理失败转化为可靠性提升和可学习的监督。

英文摘要

Large language model (LLM) agents frequently fail on multi-step tasks involving reasoning, tool use, and environment interaction. While such failures are typically logged or retried heuristically, they contain structured signals about where execution broke down. We introduce CausalFlow, an interventional framework that converts failed agent traces into minimal counterfactual repairs and reusable supervision. CausalFlow models execution traces as sequential chains of dependent steps and computes Causal Responsibility Scores(CRS) via step-level counterfactual intervention to identify failure-inducing steps. For these steps, we generate minimally edited repairs that flip the final outcome to success, producing validated contrastive pairs of the form (wrong step, corrected step). CausalFlow supports two complementary uses: targeted test-time repair that recovers from failures with minimal behavioral drift, and training-time supervision suitable for offline preference optimization or reward modeling. Across four benchmarks spanning mathematical reasoning, code generation, question answering, and medical browsing, CausalFlow converts failed executions into validated minimal repairs with high minimality and causal-consensus scores, and demonstrates that causal attribution is necessary for reliable improvement across diverse agent tasks, outperforming heuristic refinement in complex retrieval settings while producing more localized repairs throughout. These results demonstrate that interventional analysis over structured execution traces provides a principled and scalable mechanism for transforming agent failures into reliability gains and learning-ready supervision.

2605.25313 2026-05-26 cs.LG cs.AI cs.RO stat.ML 版本更新

UWM-JEPA: Predictive World Models That Imagine in Belief Space

UWM-JEPA:在信念空间中进行想象的世界预测模型

Santosh Kumar Radha, Oktay Goktas

发表机构 * AgentField AI

AI总结 针对部分可观测环境,提出UWM-JEPA模型,通过密度矩阵潜变量和酉预测器在信念空间中保持联合状态谱,实现长时域盲推演下的不确定性保持,显著优于向量潜变量基线。

Comments 14 pages, 6 figures, 7 tables. Code and data: https://github.com/santoshkumarradha/uwm-jepa

详情
AI中文摘要

部分可观测环境下的世界模型必须想象多个兼容的隐藏未来,并在反事实动作下引导它们。联合嵌入预测架构(JEPAs)在潜在空间中实现这一点,但向量值潜变量没有内部结构来承载盲推演过程中隐藏连续性的信念。我们引入了酉世界模型JEPA(UWM-JEPA),这是一种JEPA世界模型,具有在联合系统-环境空间上的密度矩阵潜变量和学习的酉预测器。该结构在推演过程中精确保持联合状态谱,因此预测器本身不会耗散表示的不确定性。在一个需要根据给定动作序列进行五步前向模拟且目标观测被掩蔽的隐藏速度指示任务中,UWM-JEPA达到0.77的准确率,并且随着动作被扰动而单调下降;而参数匹配的LSTM-JEPA在相同的反事实目标目标和动作头训练下,在所有动作条件下都崩溃为多数类准确率(0.53)。在盲推演下,UWM-JEPA在短时域上损失不到十个点的探针R^2,而向量潜变量基线损失四十一个和六十八个点;两者在保留的上下文探针上表现相当,表明差异在于预测器而非编码器。动作敏感性本身需要针对反事实而非教师强制目标进行训练,这一发现适用于酉参数化之外。对于JEPA世界模型在部分可观测性下进行想象,潜变量几何和预测器动力学至关重要,而不仅仅是冻结的上下文编码能力。

英文摘要

World models for partially observed environments must imagine multiple compatible hidden futures and steer between them under counterfactual actions. Joint Embedding Predictive Architectures (JEPAs) do this in latent space, but a vector-valued latent has no internal structure for carrying the belief over hidden continuations through blind rollout. We introduce the Unitary World Model JEPA (UWM-JEPA), a JEPA world model with a density-matrix latent on a joint system-environment space and a learned unitary predictor. The construction preserves the joint-state spectrum exactly during rollout, so the predictor itself cannot dissipate the represented uncertainty. On a hidden-velocity indicator task requiring five-step forward simulation under a given action sequence with the target observation masked, UWM-JEPA reaches 0.77 accuracy and degrades monotonically as actions are perturbed; a parameter-matched LSTM-JEPA trained under the same counterfactual-target objective and action head collapses to majority-class accuracy (0.53) under every action condition. Under blind rollout, UWM-JEPA loses fewer than ten points of probe R^2 at short horizons while vector-latent baselines lose forty-one and sixty-eight; both nevertheless tie on a held-out context probe, locating the separation in the predictor rather than the encoder. Action sensitivity itself requires training against counterfactual rather than teacher-forced targets, a finding that applies beyond the unitary parameterisation. For JEPA world models to imagine under partial observability, latent geometry and predictor dynamics matter, not frozen context-encoding capacity alone.

2605.25293 2026-05-26 cs.CV cs.AI cs.RO 版本更新

Neuromorphic LiDAR-based Bird's Eye View Object Detection using Energy-efficient Spiking Neural Networks

基于神经形态激光雷达的鸟瞰图目标检测:使用节能脉冲神经网络

Sambit Mohapatra, Senthil Yogamani, Heinrich Gotzig, Patrick Mader

发表机构 * Valeo, Germany(德国瓦莱欧公司) Valeo, Ireland(爱尔兰瓦莱欧公司) TU Ilmenau, Germany(德国伊门豪大学)

AI总结 提出一种端到端脉冲编码器-解码器网络,用于激光雷达点云鸟瞰图表示中的目标检测,通过代理梯度反向传播训练,在KITTI基准上达到高精度,并实现3.33倍突触操作能耗降低。

详情
AI中文摘要

自动驾驶感知需要在严格的功耗约束下对三维传感器数据进行准确高效的处理。传统卷积神经网络实现了强大的检测精度,但计算密集,限制了其在资源受限的神经形态平台上的部署。脉冲神经网络通过事件驱动的稀疏计算提供了一种引人注目的替代方案,但其在复杂真实世界感知任务(如三维目标检测)中的应用仍然有限。在这项工作中,我们提出了一种端到端脉冲编码器-解码器网络,用于激光雷达点云鸟瞰图表示中的目标检测,并使用代理梯度反向传播进行训练。我们训练了两个变体:一个膜电位变体,在输出阶段读取连续神经元状态以获得最大精度,在$\mathrm{IoU}\!=\!0.5$(简单/中等/困难)下达到$92.05$/$87.04$/$86.51$ AP;以及一个全二进制脉冲变体,每一层仅操作脉冲序列,用于直接神经形态部署。我们评估了四种输入脉冲编码策略,并证明允许网络直接从数据学习脉冲表示优于手工制作的泊松、延迟和z轴编码方案,在KITTI基准上,当顺序帧不可用且BEV输入跨时间步重复呈现作为时间流代理时。分块能量分析表明,在保守的基于循环的操作下,与等效CNN相比,突触操作能量降低了$3.33 imes$。这些结果共同证明了脉冲神经网络在自动驾驶中实现准确且节能的神经形态感知的可行性。

英文摘要

Autonomous driving perception demands accurate and efficient processing of three-dimensional sensor data under strict power constraints. Traditional convolutional neural networks achieve strong detection accuracy but are computationally intensive, limiting their suitability for deployment on resource-constrained neuromorphic platforms. Spiking neural networks offer a compelling alternative through event-driven sparse computation, yet their application to complex real-world perception tasks such as three-dimensional object detection remains limited. In this work, we propose an end-to-end spiking encoder-decoder network for object detection in bird's eye view representations of LiDAR point clouds, trained using surrogate gradient backpropagation. We train two variants: a membrane potential variant that reads continuous neuron state at the output stage for maximum accuracy, achieving $92.05$/$87.04$/$86.51$ AP at $\mathrm{IoU}\!=\!0.5$ (Easy/Moderate/Hard), and, a fully binary spiking variant that operates exclusively on spike trains at every layer for direct neuromorphic deployment. We evaluate four input spike encoding strategies and demonstrate that allowing the network to learn spike representations directly from data outperforms hand-crafted Poisson, latency, and z-axis encoding schemes on the KITTI benchmark, where sequential frames are unavailable and the BEV input is presented repeatedly across timesteps as a proxy for temporal streaming. A block-wise energy analysis demonstrates a $3.33\times$ reduction in synaptic operation energy over an equivalent CNN under conservative loop-based operation. Together, these results demonstrate the viability of spiking neural networks for accurate and energy-efficient neuromorphic perception in autonomous driving.

2605.25272 2026-05-26 cs.AI cs.CY stat.AP 版本更新

AI Cartography: Mapping the Latent Landscape of AI Benchmark Ecosystems

AI 制图:绘制 AI 基准生态系统的潜在景观

Michael Hardy, Anka Reuel, Lijin Zhang, Jodi M. Casabianca, Sang Truong, Yash Dave, Hansol Lee, Benjamin Domingue, Sanmi Koyejo

发表机构 * Open LLM Leaderboard(开放大语言模型排行榜) HELM ICML(国际机器学习会议)

AI总结 针对排行榜分数受测量噪声影响的问题,提出基于验证性因子分析和概化理论的框架,分解排名方差来源,揭示基准间关系、局部依赖性及元数据影响,并比较显式与潜在缩放律的可靠性。

详情
AI中文摘要

虽然总体排行榜分数驱动着 AI 发展,但它们包含大量测量噪声,其来源和幅度尚未量化,使得排名何时反映真实能力差异何时反映评估伪像尚不明确。我们引入了一个用于测量 AI 基准生态系统中潜在景观的框架。将验证性因子分析(CFA)和概化理论应用于 Open LLM Leaderboard 上的 4000 多个模型,我们分解了排名方差的来源并确定:(1)当前报告实践中假设的结构低估了基准之间关系的强度;(2)排行榜项目之间存在局部依赖性的证据,这削弱了在当前评分系统下将基准用作测量工具的有效性;(3)在此背景下,贡献者元数据解释了比架构或部署类别更多的排名相关方差(约 9%);(4) 显式分数的“缩放律”斜率可靠性较低($R_β=0.53$);相比之下,潜在通用因子大小斜率在生态系统控制下高度稳定($R_g=0.97$)。我们能够提供对基准动态的独特见解,例如哪些基准是 LLM 规模的函数,哪些可能受到后训练实践的相反影响。我们提供了可操作的诊断方法,以确定如何信任基准排名以及如何改进基准设计。

英文摘要

While aggregate leaderboard scores drive AI development, they contain substantial measurement noise whose sources and magnitudes remain unquantified, making it unclear when rankings reflect genuine capability differences versus evaluation artifacts. We introduce a framework for measuring the latent landscape in AI benchmark ecosystems. Applying Confirmatory Factor Analysis (CFA) and Generalizability Theory to 4,000+ models from the Open LLM Leaderboard, we decompose sources of ranking variance and establish: (1) structures assumed in current reporting practice underestimate the strength of relationships between benchmarks; (2) evidence of local dependence among leaderboard items, undermining uses of benchmarks as measurement instruments under current scoring systems; (3) contributor metadata explains more rank-relevant variance ($\approx9\%$) than architecture or deployment categories in this context; (4) a manifest-score "scaling law" slope has low reliability ($R_β=0.53$); by contrast, the latent general-factor size slope is highly stable across ecosystem controls ($R_g=0.97$). We are able to provide unique insights into benchmark dynamics, such as which benchmarks are a function of LLM size and which can be oppositely impacted by post-training practices. We provide actionable diagnostics to determine how benchmark rankings can be trusted and how benchmark design can be improved.

2605.25271 2026-05-26 math.AG cs.AI cs.NE 版本更新

Positivity in classical enumerative geometry: a case study in synchronized AI-assisted mathematics

经典枚举几何中的正性:同步AI辅助数学的案例研究

Gergely Bérczi, László M. Fehér

发表机构 * Department of Mathematics, Aarhus University(阿arhus大学数学系) Eötvös University Budapest(布达佩斯欧多维奇大学) Alfréd Rényi Institute of Mathematics(阿尔弗雷德·雷尼数学研究所)

AI总结 研究对称多项式∏_{α∈A_{n,d}}(1+α_1 x_1+⋯+α_n x_n)(即Sym^d(C^n)的全陈类)的齐次部分c_k(n,d)的结构,通过AI与人类协作证明相关猜想、建立显式公式并研究对数凹性。

Comments 29 pages

详情
AI中文摘要

我们研究对称多项式 $\prod_{\alpha\in A_{n,d}}\bigl(1+\alpha_1 x_1+\cdots+\alpha_n x_n\bigr)$,其中 $A_{n,d}:=\{\alpha\in\mathbb{Z}_{\ge 0}^n:|\alpha|=d\}$,它是 $\mathrm{Sym}^d(\mathbb{C}^n)$ 的全陈类,视为一个环面表示,其陈根为权重 $\alpha_1 x_1+\cdots+\alpha_n x_n$($\alpha\in A_{n,d}$)。其齐次 $k$ 次部分 $c_k(n,d)$ 是 $\mathrm{Sym}^d(\mathbb{C}^n)$ 的第 $k$ 个陈类。这些陈类及其在各种对称函数基中的系数在枚举几何中起着核心作用。尽管定义简单,但其系数的通用封闭公式却十分微妙,且这些类的许多结构性质至今仍知之甚少。 在本文中,我们证明了关于其结构的几个猜想,建立了显式公式,并研究了陈类及其 $K$ 理论类比的对数凹性。在秩为二的情况下,通过过渡到 Schur 基并将 Schur 系数在 $d$ 的二项式基中展开,我们发现了一种新的二项式对数凹性现象,并证明了精细的正性结果。 本文展示了一种新颖的方法论:我们将多个AI系统与人类数学洞察力结合在协调的工作流程中,根据每个工具在实验发现、猜想形成、符号证明构建和验证方面的优势进行部署。据我们所知,这是协调多个AI工具在连贯的数学研究项目中取得实质性进展的首批详细案例研究之一。

英文摘要

We study the symmetric polynomial $\prod_{α\in A_{n,d}}\bigl(1+α_1 x_1+\cdots+α_n x_n\bigr)$ where $A_{n,d}:=\{α\in\mathbb{Z}_{\ge 0}^n:|α|=d\}$, which is the total Chern class of $\mathrm{Sym}^d(\mathbb{C}^n)$, viewed as a torus representation whose Chern roots are the weights $α_1 x_1+\cdots+α_n x_n$ for $α\in A_{n,d}$. Its homogeneous degree-$k$ part $c_k(n,d)$ is the $k$-th Chern class of $\mathrm{Sym}^d(\mathbb{C}^n)$. These Chern classes, together with their coefficients in various symmetric function bases, play a central role in enumerative geometry. Despite their simple definition, general closed formulas for their coefficients are subtle, and many structural properties of these classes have remained poorly understood. In this paper we prove several conjectures concerning their structure, establish explicit formulas, and study log-concavity properties for both the Chern classes and their $K$-theoretic analogue. In rank two, passing to the Schur basis and expanding the Schur coefficients in the binomial basis of $d$, we uncover a new binomial log-concavity phenomenon and prove refined positivity results. The paper demonstrates a novel methodology: we combine several AI systems with human mathematical insight in a coordinated workflow, deploying each tool according to its strengths in experimental discovery, conjecture formation, symbolic proof construction, and verification. To our knowledge, this is one of the first detailed case studies of orchestrating multiple AI tools to make substantial progress on a coherent mathematical research project.

2605.25267 2026-05-26 cs.LG cs.AI 版本更新

Latent Q-Barrier Shielding for Safe In-Context Reinforcement Learning

潜在Q-屏障屏蔽用于安全上下文强化学习

Minjae Kwon, Amir Moeini, Shangtong Zhang, Lu Feng

发表机构 * University of Virginia(弗吉尼亚大学)

AI总结 提出一种潜在Q-屏障屏蔽方法,通过学习上下文表示、潜在动力学和集成成本评论家,在部署时无需参数更新即可根据剩余预算和预测未来成本过滤或软重加权候选动作,从而改善安全上下文强化学习在分布外转移下的奖励-安全权衡。

详情
AI中文摘要

安全上下文强化学习(ICRL)在测试时不更新参数,仅从交互历史中在线适应,同时将情节成本控制在安全预算内。在分布外(OOD)部署转移下,仅预训练的安全ICRL可能产生较差的奖励-安全权衡,因为剩余预算仅通过冻结的策略条件影响行为,而非通过针对预测未来成本的显式动作级检查。我们提出一种潜在Q-屏障屏蔽,在部署前学习上下文表示、潜在动力学和集成成本评论家。无需参数更新,该屏蔽从历史中推断上下文,并使用剩余预算和预测未来成本过滤或软重加权候选动作。我们证明了一个条件性的、误差分解的屏障-边际结果:满足Q-屏障的动作将下一个潜在预算状态置于近似预算安全的延续中(在学习的评论家下),误差上界由贝尔曼误差和潜在预测误差决定。在五个安全ICRL基准测试中,该屏蔽在部署时相比强安全ICRL基线改善了奖励-安全权衡:在短上下文窗口后,它在五个基准中的四个上实现了更高的回报,同时在所有五个基准中匹配或降低了平均情节成本。

英文摘要

Safe in-context reinforcement learning (ICRL) adapts online from interaction history without test-time parameter updates while controlling episode cost under a safety budget. Under out-of-distribution (OOD) deployment shifts, pretraining-only safe ICRL can give poor reward-safety tradeoffs because the remaining budget affects behavior only through frozen policy conditioning, not an explicit action-level check against predicted future cost. We propose a latent Q-Barrier shield that learns a context representation, latent dynamics, and an ensemble cost critic before deployment. Without parameter updates, the shield infers context from history and filters or softly reweights candidate actions using the remaining budget and predicted future cost. We prove a conditional, error-decomposed barrier-margin result: a Q-Barrier-satisfying action leaves the next latent-budget state with an approximately budget-safe continuation under the learned critic, up to Bellman and latent-prediction errors. Across five safe ICRL benchmarks, the shield improves deployment-time reward-safety tradeoffs over a strong safe-ICRL baseline: after a short context window, it achieves higher return in four of five benchmarks while matching or lowering average episode cost in all five.

2605.25263 2026-05-26 cs.CL cs.AI 版本更新

Mimir: Large-scale Multilingual Concept Modeling

Mimir:大规模多语言概念建模

Elio Musacchio, Lucia Siciliani, Pierpaolo Basile

发表机构 * Department of Computer Science(计算机科学系) University of Bari Aldo Moro(巴里阿尔多·莫罗大学)

AI总结 提出Mimir,一个1.6B参数的大规模概念模型,通过多语言预训练和指令微调实现概念级别的理解与生成,替代传统的token预测范式。

详情
AI中文摘要

当前的语言建模方法围绕token构建。文本语料被分割成token,模型通过对这些token进行计算来训练,例如根据前文预测下一个token。这一范式已成为现代语言建模的标准,尤其是基于token的架构取得了卓越性能。然而,最近的研究不仅开始质疑语言模型如何从token中处理和理解意义,还开始质疑使用更高级别的粒度是否能推动研究领域的发展。这引出了概念建模的想法,即直接训练模型进行下一个概念预测,而非下一个token预测。目标是输入从token转变为概念,迫使底层语言模型将其粒度从细粒度的token转变为广泛的概念。在这项工作中,我们介绍了Mimir,一个1.6B参数的大规模概念模型,用于多语言概念理解和生成。我们利用了一个大规模多语言预训练语料库(38,883,987,240个句子),涵盖46种语言,以及一个大规模多轮多语言指令微调数据集(66,816,428个句子),覆盖总共35种语言。我们针对一个参数数量相当的语言模型,对模型性能进行了广泛评估。

英文摘要

Current language modeling approaches are built around tokens. Text corpora are split into tokens, and models are trained by performing computations on these tokens, such as predicting the next token given the preceding ones as context. This paradigm has become the standard in modern language modeling, especially given the outstanding performance obtained by token-based architectures. However, recent works have not only begun to question how language models process and understand meaning from tokens, but also to question whether using higher levels of granularity could advance the research field. This led to the idea of Concept Modeling, that is, to directly train models for next-concept prediction rather than next-token prediction. The goal is to change the input from tokens to concepts, forcing the underlying language model to shift its granularity from fine-grained tokens to broad concepts. In this work, we introduce Mimir, a 1.6B Large Concept Model trained for multilingual concept understanding and generation. We leverage a large-scale multilingual pre-training corpus (38,883,987,240 sentences) spanning 46 languages and a large-scale multi-turn and multilingual instruction-tuning dataset (66,816,428 sentences) covering a total of 35 languages. We extensively evaluate model performance against a language model with a comparable number of parameters.

2605.25258 2026-05-26 cs.IR cs.AI cs.CY cs.LG 版本更新

First, do no harm: Breaking suicidogenic echo chambers in media recommendation

首先,不伤害:打破媒体推荐中的自杀性回音室

Alberto Díaz-Álvarez, Raúl Lara-Cabrera, Fernando Ortega-Requena, Víctor Ramos-Osuna

发表机构 * E.T.S.I. Sistemas Informáticos (Universidad Politécnica de Madrid)(马德里理工大学信息系统工程系)

AI总结 针对推荐系统在心理健康场景中可能加剧用户自杀倾向的问题,提出RankAid重排序方法,通过惩罚有害内容并提升治疗性内容,在保持推荐准确性的同时确保临床安全。

Comments 10 pages, 5 figures. Research on safety-aware recommender systems and algorithmic ethics

详情
AI中文摘要

推荐系统通常优化用户参与度,但在心理健康背景下这种方法存在危险。当脆弱用户表现出自杀意念迹象时,标准算法往往将他们困在有害内容的回音室中,恶化其心理状态。为此,我们引入RankAid,一种重排序方法,在预测相关性的同时优先考虑临床安全性。它作为现有模型的附加层运行:根据用户当前的脆弱程度惩罚风险项目并提升治疗性内容。我们使用MovieLens 1M数据集评估了该方法,其中项目通过大语言模型进行了临床风险和治疗价值的语义注释。我们的模拟表明,该算法在危机高峰期成功阻止了有害内容的推荐,主动重塑信息流以支持情绪降级。此外,这种安全干预仅导致标准准确性指标(如NDCG)可控且可接受的下降。通过使用非对称超参数,RankAid还使系统管理员能够根据特定的临床指南调整干预的严重程度。

英文摘要

Recommender systems generally optimises user engagement, but this approach is dangerous in mental health contexts. When vulnerable users show signs of suicidal ideation, standard algorithms often trap them in echo chambers of harmful content, worsening their psychological state. In response, we introduce RankAid, a re-ranking method that prioritises clinical safety alongside predictive relevance. It works as an add-on layer to existing models: it penalises risky items and boosts therapeutic content depending on the user's current level of vulnerability. We evaluated this approach using the MovieLens 1M dataset, where items were semantically annotated for clinical risk and therapeutic value using large language models. Our simulations show that our algorithm successfully blocks the recommendation of harmful content during crisis peaks, actively reshaping the feed to support emotional de-escalation. Furthermore, this safety intervention only causes a controlled, acceptable drop in standard accuracy metrics like NDCG. By using asymmetric hyperparameters, RankAid also gives system administrators the flexibility to tune the severity of the intervention based on specific clinical guidelines.

2605.25254 2026-05-26 cs.CV cs.AI 版本更新

Guess the Unified Model: How Much Can We Recover from Generated Images?

猜猜统一模型:从生成的图像中我们能恢复多少?

Jasin Cekinmez, Ryo Mitsuhashi, Addison J. Wu, Yida Yin

发表机构 * Princeton University(普林斯顿大学)

AI总结 本文研究统一模型生成图像的可分离性,通过七个模型的大量图像实验,发现模型归因高度可行,且语义内容对可分离性有贡献但非主导信号。

详情
AI中文摘要

随着统一模型生成的图像现在在线广泛传播,追溯其来源模型为透明度和深入理解单个模型的特征行为提供了一条途径。先前的工作已经探索了LLM生成文本、扩散模型图像和数据集的来源,但统一模型生成图像的可分离性仍然是一个未充分探索的领域。我们通过使用七个统一模型生成的图像,检查在损坏、领域和提示语言上的可分离性来填补这一空白。我们表明模型归因高度可行,因为我们的模型在每个模型约20K图像的情况下达到了近乎完美的准确率。损坏和结构扰动对归因性能的影响较小,跨领域泛化表明语义内容对可分离性有贡献,但并非主导信号。最后,我们观察到对于大多数模型,提示语言归因接近随机水平,表明语言特定的视觉特征极少。这些发现突显了统一模型输出中一致的模型特定视觉特征,并为追踪和审计生成图像流水线开辟了新方向。

英文摘要

With unified model-generated images now widespread online, attributing their model of origin offers a path toward transparency and deeper insight into the characteristic behaviors of individual models. Prior work has explored provenance in LLM-generated text, diffusion model images, and datasets, but the separability of unified model-generated images remains an underexplored area. We address this gap by examining separability across corruption, domains, and prompt languages using images generated by seven unified models. We show that model attribution is highly feasible as our model achieves near-perfect accuracy with around 20K images per model. Corruptions and structural perturbations have only a modest effect on attribution performance, and cross-domain generalization reveals that semantic content contributes to separability but is not the dominant signal. Finally, we observe that for most models, prompt language attribution is around chance levels, suggesting minimal language-specific visual signatures. These findings highlight consistent model-specific visual characteristics in unified models outputs and open new directions for tracing and auditing generative image pipelines.

2605.23491 2026-05-26 cs.LG cs.AI cs.CL 版本更新

CoSPlay: Cooperative Self-Play at Test-Time with Self-Generated Code and Unit Test

CoSPlay: 测试时协作自我博弈与自生成代码和单元测试

Zhangyi Hu, Chenhui Liu, Tian Huang, Jindong Li, Yang Yang, Jiemin Wu, Zining Zhong, Menglin Yang, Yutao Yue

发表机构 * The Hong Kong University of Science and Technology (Guangzhou)(香港科技大学(广州)) Institute of Deep Perception Technology, JITRI, Wuxi, China(深度感知技术研究院,无锡,中国)

AI总结 提出CoSPlay框架,通过代码与单元测试的协作自我博弈,在无真实单元测试的情况下迭代优化两者,显著提升代码生成性能。

Comments Code is available at: https://github.com/sanae-ai/CosPlay | Data & log is available at: https://huggingface.co/datasets/yomi017/CosPlay

详情
AI中文摘要

最近,可验证奖励强化学习(RLVR)和测试时扩展(TTS)通过可执行验证推动了LLM代码生成的发展。然而,真实单元测试(GT UTs)仍然是瓶颈:最先进的RLVR方法需要它们进行昂贵的训练,而现有的TTS方法在没有它们的情况下会失去竞争力。这促使了无GT的TTS,其中现有方法直接使用自生成的UT来优化和选择代码候选。然而,这些UT通常带有噪声或与错误代码虚假耦合,而UT质量在没有可靠代码的情况下也无法验证。因此,关键挑战是同时改进两者。为此,我们提出了CoSPlay,一个无GT、无需训练的框架,通过协作自我博弈同时改进代码和UT。它首先探索多样化的解决方案思路,识别其潜在失败模式以生成有区分力的UT思路。然后,它利用代码-UT执行矩阵中的双向通过计数信号,迭代地修剪或修复弱代码,并刷新或替换不可靠的UT,使两个池共同进化。最后,当多个代码在最高通过计数上并列时,它从最大的输出共识簇中选择最终代码,因为正确的代码在相同输入上一致,而错误的代码则发散。在四个具有挑战性的基准上的实验表明,CoSPlay在Qwen2.5-7B-Instruct上将平均BoN从22.1%提升到33.2%,UT准确率从14.6%提升到78.3%,匹配或超越了RLVR模型CURE-7B。当应用于CURE-7B时,它进一步将BoN提高了5.7%。CoSPlay还能跨不同骨干网络泛化,并在相当的token预算下优于无GT的TTS基线,且随着预算增加持续获益。这些结果表明,无需任何GT数据即可实现竞争性代码生成的可扩展推理策略。

英文摘要

Recently, Reinforcement Learning with Verifiable Rewards (RLVR) and Test-Time Scaling (TTS) have advanced LLM code generation through executable verification. Yet Ground-Truth Unit Tests (GT UTs) remain a bottleneck: SOTA RLVR methods require them for costly training, while existing TTS methods lose competitiveness without them. This motivates GT-free TTS, where existing methods directly use self-generated UTs to refine and select code candidates. Yet such UTs are often noisy or spuriously coupled with wrong code, and UT quality in turn cannot be validated without reliable code. The key challenge is therefore to jointly improve both. To this end, we present CoSPlay, a GT-free, training-free framework that jointly improves codes and UTs through cooperative self-play. It first explores diverse solution ideas and identifies their potential failure modes to produce discriminative UT ideas. It then uses bidirectional pass-count signals from the Code-UT execution matrix to iteratively prune or fix weak codes and refresh or replace unreliable UTs, letting the two pools co-evolve. Finally, when multiple codes remain tied at the highest pass count, it picks the final code from the largest output-consensus cluster, since correct codes agree on the same inputs while wrong codes diverge. Experiments on four challenging benchmarks show that CoSPlay on Qwen2.5-7B-Instruct improves average BoN from 22.1% to 33.2% and UT accuracy from 14.6% to 78.3%, matching or surpassing the RLVR model CURE-7B. When applied to CURE-7B, it further improves BoN by 5.7%. CoSPlay also generalizes across diverse backbones and outperforms GT-free TTS baselines under comparable token budgets, with continued gains as the budget scales up. These results suggest a scalable inference strategy for competitive code generation without any GT data.

2605.23473 2026-05-26 cs.LG cs.AI 版本更新

Automated Random Embedding for Practical Bayesian Optimization with Unknown Effective Dimension

面向未知有效维度的实用贝叶斯优化的自动随机嵌入

Hong Qian, Xiang Shu, Xiang Xia, Xuhui Liu, Yangde Fu, Bei Liang, Huibin Wang, Liang Dou

发表机构 * Shanghai Institute of AI for Education, and School of Computer Science and Technology, East China Normal University(上海人工智能教育研究院,东华大学计算机科学与技术学院) Ant Group(蚂蚁集团) Nanjing University(南京大学)

AI总结 提出动态共享嵌入贝叶斯优化(DSEBO)方法,通过自动调整子空间维度并共享查询解,平衡近似与优化误差,在高维优化中显著降低遗憾和时间成本。

Comments This paper has been accepted by IJCAI 2026

详情
AI中文摘要

贝叶斯优化广泛应用于复杂黑箱函数的优化,但受维度灾难困扰。随机嵌入作为一种降维策略,通过在低维子空间中优化来简化具有有效维度的任务。然而,预先确定任务的有效维度仍是一个重大挑战,它影响子空间维度的选择和优化性能。传统方法使用专家提供的固定子空间维度,或依赖试错法估计子空间维度,消耗资源。为此,本文提出一种针对未知有效维度的高维贝叶斯优化的自动随机嵌入方法,称为动态共享嵌入贝叶斯优化(DSEBO)。DSEBO从低维度开始,如果当前子空间中的解显示初步收敛,则切换到更高维的子空间。DSEBO基于不同子空间中解的质量动态确定下一子空间的维度,并与新子空间共享已查询的解以实现更好的初始化。理论上,我们推导了DSEBO的遗憾界,并证明DSEBO能更好地平衡近似误差和优化误差。在维度规模变化的函数和未知有效维度的实际任务上的大量实验表明,与最先进方法相比,跨不同子空间的交替优化在高维优化中显著提高了优化遗憾和时间性能。

英文摘要

Bayesian optimization is widely employed for optimizing complex black-box functions but struggles with the curse of dimensionality. Random embedding, as a dimension reduction strategy, simplifies tasks that possess the effective dimension by optimizing within a low-dimensional subspace. However, determining the effective dimension of a task in advance remains a significant challenge, which influences the selection of the subspace dimensionality and the optimization performance. Traditional methods use fixed subspace dimensions provided by experts or rely on trial and error to estimate subspace dimensions with resources consumed. To this end, this paper proposes an automated random embedding for high-dimensional Bayesian optimization with unknown effective dimension, called Dynamic Shared Embedding Bayesian Optimization (DSEBO). DSEBO starts with a low dimension and switches to a higher subspace if the solutions in the current subspace show preliminary convergence. DSEBO dynamically determines the dimension of the next subspace based on the quality of the solutions in different subspaces and shares the queried solutions with the new subspace for a better initialization. Theoretically, we derive a regret bound for DSEBO and demonstrate that DSEBO can better balance approximation and optimization errors. Extensive experiments on functions with dimensionality of varying magnitudes and real-world tasks with unknown effective dimensions reveal that, compared with state-of-the-art methods, alternating optimization across different subspaces results in significant improvements in high-dimensional optimization, both in terms of optimization regret and time.

2605.22856 2026-05-26 eess.SP cs.AI cs.IT cs.LG cs.NI math.IT 版本更新

PilotWiMAE: Pilot-Native Representation Learning for Wireless Channels

PilotWiMAE:面向无线信道的导频原生表示学习

Berkay Guler, Giovanni Geraci, Hamid Jafarkhani

发表机构 * Center for Pervasive Communications and Computing, University of California, Irvine(加州大学尔湾分校普及通信与计算中心) Nokia and Universitat Pompeu Fabra(诺基亚与庞培法布拉大学)

AI总结 提出PilotWiMAE自监督框架,直接处理噪声导频观测,通过分解注意力机制和补丁归一化重构,在缩小观测空间的同时实现跨频段波束选择和信道表征,优于监督基线。

详情
AI中文摘要

信道基础模型假设能够访问完全观测的信道,这一假设在部署中不成立。我们提出PilotWiMAE,一种自监督框架,其编码器直接接收噪声导频观测,注意力沿时间与联合空频处理轴分解,这是受问题物理特性启发的归纳偏置。导频输入将观测空间缩小两个数量级,并消除了全CSI可用性的不现实假设,同时降低延迟。分解设计通过利用可分离的信道结构生成鲁棒表示,并允许预训练掩码率达到$99\%$。我们将捕获小尺度衰落结构的补丁归一化重构与恢复大尺度衰落特征的辅助尺度损失相结合,并使用AWGN课程学习来匹配预训练和部署时的导频噪声。仅在$3.5$\,GHz上预训练,在$28$\,GHz上评估,涵盖分布内和分布外场景,PilotWiMAE的跨频段波束选择和信道表征在更小的观测空间上仍优于监督基线。为削弱解码器容量与表示质量之间的耦合,我们进一步提出在编码器-解码器联合预训练之后进行以解码器为中心的预训练阶段,使得PilotWiMAE在不牺牲表示质量的情况下展现出有竞争力的信道估计性能。为促进该方向的进一步研究,我们发布了PilotWiMAE预训练权重和训练流程,以及基于Sionna的射线追踪信道生成工具CSIGen和本文使用的信道数据集。

英文摘要

Channel foundation models assume access to fully observed channels, an assumption that fails in deployment. We introduce PilotWiMAE, a self-supervised framework whose encoder ingests noisy pilot observations directly and whose attention factorizes along the axis separating temporal from joint space-frequency processing, an inductive bias inspired by the physics of the problem. Pilot input shrinks the observation space by up to two orders of magnitude and also removes the unrealistic assumption of full-CSI availability while incurring lower latency. The factorized design generates robust representations by exploiting the separable channel structure and allows a pretraining mask ratio of $99\%$. We pair patch-normalized reconstruction, which captures small-scale fading structure, with an auxiliary scale loss that recovers the large-scale fading features, and use an AWGN curriculum to match pilot noise at pretraining and deployment. Pretrained solely on $3.5$\,GHz and evaluated at $28$\,GHz across in-distribution and out-of-distribution settings, PilotWiMAE's cross-frequency beam selection and channel characterization beat supervised baselines despite operating on a smaller observation space. To weaken the coupling between decoder capacity and representation quality, we further propose a decoder-centric pretraining stage following the encoder-decoder joint pretraining, which allows PilotWiMAE to demonstrate competitive channel estimation without sacrificing representation quality. To foster further work in this direction, we release the PilotWiMAE pretrained weights and training pipeline, together with CSIGen, our Sionna-based ray-tracing channel-generation tool, and the channel datasets used in this work.

2605.22795 2026-05-26 stat.ML cs.AI cs.LG math.ST stat.TH 版本更新

Finite-Particle Convergence Rates for Conservative and Non-Conservative Drifting Models

保守与非保守漂移模型的有限粒子收敛速率

Krishnakumar Balasubramanian

发表机构 * Department of Statistics, University of California, Davis(加州大学戴维斯分校统计系)

AI总结 针对一步生成建模,提出保守漂移方法(用核密度估计梯度速度替代位移速度)并证明连续时间有限粒子收敛界,同时分析非保守方法(Laplace核)的对应速率。

详情
AI中文摘要

我们提出并分析了一种用于一步生成建模的保守漂移方法。该方法将原始的基于位移的漂移速度替换为核密度估计(KDE)梯度速度,即核平滑数据得分与核平滑模型得分之差。该速度为梯度场,解决了通用基于位移的漂移场中发现的非保守性问题。我们证明了在$\R^d$上保守方法的连续时间有限粒子收敛界:联合熵恒等式给出了经验Stein漂移、KDE的平滑Fisher差异以及中心速度平方的界。主要的有限粒子校正是倒数KDE自相互作用项,我们给出了确定性和高概率的局部占据条件,在此条件下该项可控。我们保持求积常数显式并追踪其可能的带宽依赖性:在额外的$h$均匀求积正则条件下,根残差速度率为$N^{-1/(d+4)}$;而更一般的增长条件产生优化根速率$N^{-(2-β)/(2(d+4-β))}$,其中$0\le β<2$。我们还分析了使用Laplace核的非保守漂移方法,对应于Deng等人2026年(arxiv:2602.04770)提出的原始基于位移的速度。对于该方法,一个尖锐的伴随核将速度分解为尖锐得分不匹配的正标量预处理加上Laplace尺度不匹配残差,产生类似的有限粒子速率,但带有一个不可避免的残差项。最后,我们解释了如何通过显式漂移大小$η$将连续时间残差速度界转化为一步生成保证。

英文摘要

We propose and analyze a conservative drifting method for one-step generative modeling. The method replaces the original displacement-based drifting velocity by a kernel density estimator (KDE)-gradient velocity, namely the difference of the kernel-smoothed data score and the kernel-smoothed model score. This velocity is a gradient field, addressing the non-conservatism issue identified for general displacement-based drifting fields. We prove continuous-time finite-particle convergence bounds for the conservative method on $\R^d$: a joint-entropy identity yields bounds for the empirical Stein drift, the smoothed Fisher discrepancy of the KDE, and the squared center velocity. The main finite-particle correction is a reciprocal-KDE self-interaction term, and we give deterministic and high-probability local-occupancy conditions under which this term is controlled. We keep the quadrature constants explicit and track their possible bandwidth dependence: the root residual-velocity rate $N^{-1/(d+4)}$ holds under an additional $h$-uniform quadrature regularity condition, while a more general growth condition yields the optimized root rate $N^{-(2-β)/(2(d+4-β))}$, where $0\le β<2$. We also analyze the non-conservative drifting method with Laplace kernel, corresponding to the original displacement-based velocity proposed in Deng et al., 2026 (arxiv:2602.04770). For this method, a sharp companion kernel decomposes the velocity into a positive scalar preconditioning of a sharp-score mismatch plus a Laplace scale-mismatch residual, producing an analogous finite-particle rate with an unavoidable residual term. Finally, we explain how the continuous-time residual-velocity bounds translate into one-step generation guarantees through the explicit drift size $η$.

2605.22769 2026-05-26 cs.CL cs.AI 版本更新

Understanding Data Temporality Impact on Large Language Models Pre-training

理解数据时间性对大型语言模型预训练的影响

Hippolyte Pilchen, Romain Fabre, Franck Signe Talla, Patrick Perez, Edouard Grave

发表机构 * Kyutai

AI总结 研究预训练数据顺序对大型语言模型获取时间敏感事实知识的影响,通过构建包含7000多个时间相关问题的基准并训练60亿参数模型,发现按时间顺序训练比随机打乱训练能产生更及时和精确的知识。

详情
AI中文摘要

大型语言模型(LLMs)通常在打乱顺序的语料库上进行训练,导致模型的知识在训练时被冻结,其时间基础仍然难以理解。在这项工作中,我们研究了预训练动态对获取时间敏感事实知识的影响,特别关注数据顺序。我们的主要贡献有两方面。首先,我们引入了一个包含7000多个时间基础问题的综合基准和一个评估协议,能够分析模型是否将事实与其对应的时间段正确关联。其次,我们在按时间顺序排列的Common Crawl快照上预训练了60亿参数的模型,并将其与标准的随机打乱预训练进行比较。我们的结果表明,按顺序训练的模型在通用语言理解和常识方面与随机打乱的基线相当,同时始终表现出更及时和精确的时间知识。按时间顺序的预训练提高了事实的新鲜度,而随机打乱的预训练在较旧的数据上表现更好,可能是由于事实重复增加。这些发现,连同我们在https://github.com/kyutai-labs/kairos 发布的代码、在https://huggingface.co/collections/kyutai/kairos 发布的检查点和数据集,为LLMs的持续学习未来研究提供了基础。

英文摘要

Large language models (LLMs) are typically trained on shuffled corpora, yielding models whose knowledge is frozen at train time and whose temporal grounding remains poorly understood. In this work, we study the impact of pre-training dynamics on the acquisition of time-sensitive factual knowledge, focusing specifically on data ordering. Our main contributions are twofold. First, we introduce a comprehensive benchmark of over 7,000 temporally grounded questions and an evaluation protocol that enables analysis of whether models correctly associate facts with their corresponding time periods. Second, we pretrain 6B-parameter models on temporally ordered Common Crawl snapshots and compare them against standard shuffled pre-training. Our results show that sequentially trained models match shuffled baselines on general language understanding and common knowledge while consistently exhibiting more up-to-date and temporally precise knowledge. Temporally ordered pre-training yields improved factual freshness, while shuffled pre-training peaks on older data, possibly due to increased factual repetition. These findings, along with the release of our code at https://github.com/kyutai-labs/kairos , checkpoints, and datasets at https://huggingface.co/collections/kyutai/kairos provide a foundation for future research on continual learning for LLMs.

2605.22365 2026-05-26 cs.CR cs.AI cs.LG 版本更新

TimeGuard: Channel-wise Pool Training for Backdoor Defense in Time Series Forecasting

TimeGuard: 面向时间序列预测中后门防御的通道式池化训练

Quang Duc Nguyen, Siyuan Liang, Yiming Li, Fushuo Huo, Dacheng Tao

发表机构 * College of Computing(计算学院) Data Science, Nanyang Technological University, Singapore(数据科学,新加坡南洋理工大学)

AI总结 针对时间序列预测中后门攻击防御难题,提出基于通道式池化训练的TimeGuard方法,通过时间感知池初始化与距离正则化损失选择缓解信号稀释与损失退化,显著提升鲁棒性。

Comments 44 pages, 30 figures. ICML 2026

详情
AI中文摘要

时间序列预测(TSF)极易受到后门攻击,但由于数据纠缠和任务公式化转变带来的挑战,有效的防御方法仍未被充分探索。为填补这一空白,我们对TSF生命周期中的十三种代表性后门防御进行了系统评估,并分析了它们的失败模式。我们的结果揭示了两个根本问题:(1)数据纠缠导致通道级信号稀释,使得样本过滤和触发器合成防御无法有效定位后门;(2)任务公式化转变导致训练损失退化,使得训练阶段中毒窗口与干净窗口难以区分。基于这些发现,我们提出了一种针对TSF的训练时后门防御方法,称为TimeGuard。该方法以通道式池化训练为核心范式,并使用时间感知标准初始化高置信度池以缓解信号稀释。此外,我们引入了距离正则化损失选择,在训练过程中逐步扩展可靠池并缓解损失退化。在多个数据集、预测架构和TSF后门攻击上的大量实验表明,TimeGuard显著提升了鲁棒性,将$\mathrm{MAE}_\mathrm{P}$相对于领先基线提升了1.96倍,同时将干净性能保持在5% $\mathrm{MAE}_\mathrm{C}$以内。

英文摘要

Time Series Forecasting (TSF) is highly vulnerable to backdoor attacks, yet effective defenses remain underexplored due to challenges arising from data entanglement and shifts in task formulation. To fill this gap, we conduct a systematic evaluation of thirteen representative backdoor defenses across the TSF life cycle and analyze their failure modes. Our results reveal two fundamental issues: (1) data entanglement induces channel-level signal dilution, rendering sample-filtering and trigger-synthesis defenses ineffective at localizing backdoors; and (2) task-formulation shift leads to training-loss degeneration, causing poisoned and clean windows to become indistinguishable at training stages. Based on these findings, we propose a training-time backdoor defense for TSF, termed TimeGuard. Our method adopts channel-wise pool training as the core paradigm and initializes a high-confidence pool using time-aware criteria to mitigate signal dilution. Moreover, we introduce distance-regularized loss selection to progressively expand the reliable pool during training and ease loss degeneration. Extensive experiments across multiple datasets, forecasting architectures, and TSF backdoor attacks demonstrate that TimeGuard substantially improves robustness, boosting $\mathrm{MAE}_\mathrm{P}$ by $1.96\times$ over the leading baseline, while preserving clean performance within 5% $\mathrm{MAE}_\mathrm{C}$.

2605.21602 2026-05-26 cs.AI cs.SE 版本更新

Benchmarking and Improving Monitors for Out-Of-Distribution Alignment Failure in LLMs

基准测试与改进LLMs中的分布外对齐失败监控器

Dylan Feng, Pragya Srivastava, Anca Dragan, Cassidy Laidlaw

发表机构 * University of California, Berkeley, USA(加州大学伯克利分校) Haize Labs, New York, USA(Haize实验室) Google DeepMind, India(谷歌DeepMind)

AI总结 针对大语言模型在分布外情境下的安全与对齐失败问题,提出MOOD基准并证明结合守卫模型与OOD检测器可提升监控召回率。

详情
AI中文摘要

大语言模型(LLMs)的许多安全和对齐失败源于分布外(OOD)情境:模型开发者未预见到的异常提示或响应模式。我们通过引入名为Misalignment Out Of Distribution (MOOD)的基准,系统研究LLM监控流程能否检测这些OOD对齐失败。对于在大量安全数据集上训练的现成模型,很难找到真正OOD的失败。我们通过在MOOD中包含一个受限训练集(用于训练我们自己的监控器)以及七个具有不同对齐失败且超出训练分布的测试集来规避这一问题。利用MOOD,我们发现守卫模型(安全分类器)通常难以泛化到OOD。为解决此问题,我们提出将守卫模型与OOD检测器结合。我们测试了四种OOD检测器,发现将守卫模型与基于马氏距离和困惑度的OOD检测器结合,可将召回率从39%提升至45%。我们还建立了跨模型规模的监控器(结合守卫模型和OOD检测器)的正向扩展趋势;发现将OOD检测纳入监控比使用参数多20倍的守卫模型能获得更高的召回率增益。我们的工作表明,OOD检测应成为LLM监控的关键组成部分,并为这一重要问题的进一步研究奠定了基础。我们公开发布了实验代码和数据,相关链接见:https://github.com/Dylan102938/mood-bench。

英文摘要

Many safety and alignment failures of large language models (LLMs) occur due to out-of-distribution (OOD) situations: unusual prompt or response patterns that are unforeseen by model developers. We systematically study whether LLM monitoring pipelines can detect these OOD alignment failures by introducing a benchmark called Misalignment Out Of Distribution (MOOD). It is difficult to find failures that are truly OOD for off-the-shelf models trained on vast safety datasets. We sidestep this by including a restricted training set in MOOD that we use to train our own monitors, as well as seven test sets with diverse alignment failures that are outside the training distribution. Using MOOD, we find that guard models (safety classifiers) often fail to generalize OOD. To fix this, we propose combining guard models with OOD detectors. We test four types of OOD detectors and find that a combination of a guard model with Mahalanobis distance and perplexity-based OOD detectors can improve recall from 39% to 45%. We also establish positive scaling trends across model scales for monitors that combine a guard model and OOD detector; we find that incorporating OOD detection into monitoring achieves a higher recall gain than using a guard model with 20 times more parameters. Our work suggests that OOD detection should be a crucial component of LLM monitoring and provides a foundation for further work on this important problem. We release the code and data for our experiments publicly, and you can find the relevant links here: https://github.com/Dylan102938/mood-bench.

2605.20749 2026-05-26 cs.LG cs.AI 版本更新

The Devil is in the Condition Numbers: Why is GLU Better than non-GLU Structure?

魔鬼在于条件数:为什么GLU优于非GLU结构?

Xingyu Lyu, Qianqian Xu, Zhiyong Yang, Peisong Wen, Qingming Huang

发表机构 * State Key Laboratory of AI Safety, Institute of Computing Technology, Chinese Academy of Sciences, Beijing 100190, China(人工智能安全国家重点实验室,计算技术研究所,中国科学院,北京100190,中国) School of Computer Science and Technology, University of Chinese Academy of Sciences, Beijing 101408, China(中国科学院大学计算机科学与技术学院,北京101408,中国) Beijing Academy of Artificial Intelligence (BAAI), Beijing, China(北京人工智能研究院(BAAI),北京,中国)

AI总结 通过神经正切核分析,发现门控线性单元(GLU)通过重塑核谱、减小条件数来加速优化收敛,而非主要降低泛化差距。

Comments Accepted by ICML 2026

详情
AI中文摘要

门控线性单元(GLU)及其变体被广泛应用于现代开源大语言模型架构中,并且始终优于其非门控对应物,然而这种优势的根本原因尚不清楚。在这项工作中,我们通过分析神经正切核(NTK)机制下的两层网络来研究GLU。我们的分析表明,GLU结构重塑了NTK谱,导致更小的条件数和更紧凑的特征值分布。基于这一发现,我们进一步分析了由此产生的训练动态,并展示了重塑后的谱如何导致GLU模型更快的收敛,包括在GLU和非GLU模型之间观察到的特征损失交叉现象。最后,我们通过实验观察到,GLU在缩小各种模型(包括ViT和GPT-2)的泛化差距方面影响有限,这表明其主要优势在于加速优化而非减少泛化差距。代码可在 https://github.com/Zemdalk/GLU-NTK 获取。

英文摘要

Gated Linear Units (GLU) and their variants are widely adopted in modern open-source large language model architectures and consistently outperform their non-gated counterparts, yet the underlying reasons for this advantage remain unclear. In this work, we study GLU by analyzing two-layer networks in the neural tangent kernel (NTK) regime. Our analysis reveals that the GLU structure reshapes the NTK spectrum, leading to a smaller condition number and a more compact eigenvalue distribution. Building on this finding, we further analyze the resulting training dynamics and show how the reshaped spectrum leads to faster convergence of GLU models, including a characteristic loss-crossing phenomenon observed between GLU and non-GLU models. Finally, we empirically observe that GLU has limited impact in reducing the generalization gap on various models, including ViT and GPT-2, suggesting that its primary benefit lies in accelerating optimization rather than reducing the generalization gap. The code is available at: https://github.com/Zemdalk/GLU-NTK.

2605.18916 2026-05-26 cs.MM cs.AI cs.CV cs.SD eess.AS 版本更新

CounterFlow: A Two-Phase Inference-Time Sampling for Counterfactual Video Foley Generation

CounterFlow: 一种用于反事实视频拟音生成的两阶段推理时采样方法

Gyubin Lee, Junwon Lee, Juhan Nam

发表机构 * Kim Jaechul Graduate School of AI, KAIST(金 Jaechul人工智能研究生院,韩国科学技术院)

AI总结 提出CounterFlow,一种两阶段推理时采样方案,用于预训练的流匹配VT2A模型,以生成与视觉证据矛盾但时间同步的反事实视频拟音,并通过新指标评估替换质量。

Comments accepted to CVPR 2026 Workshop on Sight and Sound

详情
AI中文摘要

我们研究反事实视频拟音生成,旨在采用与视觉证据矛盾的声源身份,同时保持与无声视频的时间同步。现有的视频与文本到音频(VT2A)模型难以处理此问题,当视频和文本内容不一致时,它们往往仍锚定于视觉隐含的声源。我们提出CounterFlow,一种用于预训练流匹配VT2A模型的推理时双阶段采样方案。第一阶段构建视频衍生的时间结构,同时抑制视觉隐含的声源;第二阶段放弃视频条件,完全专注于塑造朝向目标提示的音频音色。与朴素的负提示和最新基线相比,CounterFlow显著改进了反事实视频拟音生成。为了评估替换质量,我们提出一个利用文本-音频共嵌入空间的度量,同时衡量目标提示证据和残留的视觉隐含声源泄漏。视频演示和代码可在https://gyubin-lee.github.io/counterflow-demo/获取。

英文摘要

We investigate Counterfactual Video Foley Generation, which aims to adopt a sound-source identity that contradicts the visual evidence while remaining temporally synchronized to a silent video. Existing Video&Text-to-Audio (VT2A) models struggle with this, often remaining anchored to the visually implied sound source when video and text contents disagree. We present ConterFlow, an inference-time dual-phase sampling scheme for pretrained flow-matching VT2A models. Phase 1 builds a video-derived temporal structure while suppressing the visually implied source; Phase 2 drops video conditioning to focus entirely on shaping audio timbre toward the target prompt. ConterFlow substantially improves counterfactual Video Foley generation compared to naive negative prompting and state-of-the-art baselines. To evaluate replacement quality, we propose a metric leveraging a text-audio co-embedding space to measure both target-prompt evidence and residual visually implied source leakage. Video demonstrations and code are available at https://gyubin-lee.github.io/counterflow-demo/

2605.18797 2026-05-26 cs.LG cs.AI 版本更新

Simply Stabilizing the Loop via Fully Looped Transformer

通过全循环Transformer简单稳定循环

Rao Fu, Zixuan Yang, Jiankun Zhang, Jing Ma, Hechang Chen, Yu Li, Yi Chang

发表机构 * Hong Kong Baptist University(香港 Baptist 大学) Jilin University(吉林大学)

AI总结 针对循环Transformer在迭代次数增加时出现的训练不稳定性,提出全循环Transformer,通过全循环架构和注意力注入两种无参数修改,稳定训练至12次循环,下游任务性能提升最高13.2%。

详情
AI中文摘要

扩展模型性能通常需要增加模型大小。循环Transformer通过迭代重用相同的Transformer块提供了一种引人注目的替代方案,用额外的计算换取性能提升,而不增加参数数量或上下文长度。由于推理时可以调整循环迭代次数,它还提供了一种平衡性能和测试时计算的自然机制。然而,当循环迭代次数增加时,循环Transformer仍然存在训练不稳定性。我们的分析表明,这种不稳定性源于两个来源:梯度振荡和残差爆炸。为了解决这两个问题,我们提出了全循环Transformer,它引入了两种无参数修改:(1)全循环架构,将循环间信号分布到所有层以缓解残差爆炸;(2)注意力注入,重用现有的注意力块以抑制梯度振荡。这些修改稳定了训练动态,使得全循环Transformer能够稳定训练多达12次循环迭代,而其他基线循环模型在这种情况下会崩溃。在循环Transformer不会崩溃的较温和设置中,全循环Transformer仍然将平均下游任务性能提升了高达13.2%。总体而言,我们的实验表明,全循环Transformer提高了训练稳定性,增强了下游性能,并通过在推理时改变循环迭代次数,提供了在不同测试时计算预算下的初步适应性。

英文摘要

Scaling model performance typically requires increasing model size. Looped Transformer offers a compelling alternative by iteratively reusing the same Transformer blocks, trading additional computation for improved performance without increasing parameter count or context length. Because the number of loop iterations can be adjusted at inference, it also provides a natural mechanism for balancing performance and test-time compute. However, Looped Transformer still suffers from training instability when the number of loop iterations increases. Our analysis reveals that this instability stems from two sources: gradient oscillation and residual explosion. To address these two problems, we propose the Fully Looped Transformer, which introduces two parameter-free modifications: (1) Fully Looped Architecture, which distributes inter-loop signals across all layers to mitigate residual explosion; (2) Attention Injection, which reuses the existing attention block to suppress gradient oscillation. These modifications stabilize training dynamics, enabling the Fully Looped Transformer to be trained stably up to 12 loop iterations, whereas other baseline looped models collapse in this regime. In milder settings where Looped Transformer does not collapse, Fully Looped Transformer still improves average downstream-task performance by up to 13.2\%. Overall, our experiments demonstrate that Fully Looped Transformer improves training stability, enhances downstream performance, and provides preliminary adaptability under different test-time compute budgets by varying loop iterations at inference.

2605.18746 2026-05-26 cs.CV cs.AI cs.CL cs.LG cs.RO 版本更新

ESI-Bench: Towards Embodied Spatial Intelligence that Closes the Perception-Action Loop

ESI-Bench: 迈向闭环感知-动作的具身空间智能

Yining Hong, Jiageng Liu, Han Yin, Manling Li, Leonidas Guibas, Li Fei-Fei, Jiajun Wu, Yejin Choi

发表机构 * Stanford University(斯坦福大学) UCLA(加州大学洛杉矶分校) Northwestern University(西北大学)

AI总结 提出ESI-BENCH基准,通过主动探索(感知、移动、操作)在OmniGibson环境中评估具身空间智能,发现主动探索显著优于被动方法,失败主因是动作盲视而非感知弱,且模型存在元认知差距。

Comments https://esi-bench.github.io/

详情
AI中文摘要

空间智能通过感知-动作循环展开:智能体通过行动获取观察,并推理观察如何随动作变化。它们不是被动处理所见,而是主动揭示未见——遮挡结构、动态、包含关系和功能,这些无法仅通过被动感知解决。我们超越先前假设神谕观察的空间智能表述,将观察者重新定义为行动者。我们引入ESI-BENCH,一个基于OmniGibson、扎根于Spelke核心知识系统的全面具身空间智能基准,涵盖10个任务类别和29个子类别。智能体必须决定部署哪些能力——感知、移动和操作——以及如何排序以主动积累任务相关证据。我们对最先进的MLLM进行大量实验,发现主动探索显著优于被动对应物,智能体自发发现涌现的空间策略而无需明确指令,而随机多视角往往增加噪声而非信号,尽管消耗更多图像。大多数失败并非源于感知弱,而是动作盲视:糟糕的动作选择导致糟糕的观察,进而引发级联错误。虽然显式3D基础稳定了深度敏感任务的推理,但不完美的3D表示通过扭曲空间关系证明比2D基线更有害。人类研究进一步揭示,与寻求证伪视角并在矛盾下修正信念的人类不同,模型无论证据质量如何都过早且高置信度地承诺,暴露了一个既不能通过更好感知也不能通过更多具身互动单独闭合的元认知差距。

英文摘要

Spatial intelligence unfolds through a perception-action loop: agents act to acquire observations, and reason about how observations vary as a function of action. Rather than passively processing what is seen, they actively uncover what is unseen - occluded structure, dynamics, containment, and functionality that cannot be resolved from passive sensing alone. We move beyond prior formulations of spatial intelligence that assume oracle observations by recasting the observer as an actor. We introduce ESI-BENCH, a comprehensive benchmark for embodied spatial intelligence spanning 10 task categories and 29 subcategories built on OmniGibson, grounded in Spelke's core knowledge systems. Agents must decide what abilities to deploy - perception, locomotion, and manipulation - and how to sequence them to actively accumulate task-relevant evidence. We conduct extensive experiments on state-of-the-art MLLMs and find that active exploration substantially outperforms passive counterparts, with agents spontaneously discovering emergent spatial strategies without explicit instructions, while random multi-view often adds noise rather than signal despite consuming far more images. Most failures stem not from weak perception but from action blindness: poor action choices lead to poor observations, which in turn drive cascading errors. While explicit 3D grounding stabilizes reasoning on depth-sensitive tasks, imperfect 3D representation proves more harmful than 2D baselines by distorting spatial relations. Human studies further reveal that unlike humans who seek falsifying viewpoints and revise beliefs under contradiction, models commit prematurely with high confidence regardless of evidence quality, exposing a metacognitive gap that neither better perception nor more embodied interaction alone can close.

2605.18172 2026-05-26 cs.AI 版本更新

Visualizing the Invisible: Generative Visual Grounding Empowers Universal EEG Understanding in MLLMs

可视化不可见:生成式视觉定位赋能多模态大语言模型的通用脑电图理解

Jun-Yu Pan, Yansen Wang, Enze Zhang, Bao-Liang Lu, Wei-Long Zheng, Dongsheng Li

发表机构 * Shanghai Jiao Tong University(上海交通大学) Microsoft Research Asia(微软亚洲研究院)

AI总结 提出生成式视觉定位(GVG)框架,通过脑电图到图像的生成模型作为视觉翻译器,为多模态大语言模型提供结构化视觉上下文,以增强非视觉脑电图的理解和临床状态解释。

详情
AI中文摘要

利用预训练大语言模型和多模态大语言模型的通用表示为脑基础模型提供了一条有前景的路径。然而,视觉诱发的脑电图数据集仍然稀缺,导致现有方法主要将神经信号与抽象文本对齐,这种有损翻译可能丢弃脑活动中编码的细粒度感知信息。我们提出生成式视觉定位(GVG)框架,通过使用脑电图到图像的生成模型作为视觉翻译器,将不可见的信息可视化。GVG 不是仅将脑电图强制转换为文本,而是为非视觉脑电图生成实例特定的代理图像,提供结构化的视觉上下文,使多模态大语言模型能够利用其视觉先验进行临床状态解释。我们在两个多模态大语言模型骨干上验证了这一想法:GVG-X-Omni 和 GVG-Janus。仅图像对齐已具有竞争力:轻量级 GVG-X-Omni 在冻结的 7B 骨干上仅调整 170M 参数,即可匹配 1.7B 参数的文本对齐基线。我们进一步扩展了 GVG-Janus,采用三模态图像+文本对齐,其中文本提供类别语义锚点,视觉代理用感知细节丰富神经表示。实验表明,在脑电图理解和视觉生成方面均取得了一致增益,表明视觉代理定位作为文本对齐的有效补充。

英文摘要

Leveraging the universal representations of pre-trained LLMs and MLLMs offers a promising path toward brain foundation models. However, visually-evoked EEG datasets remain scarce, leading existing methods to align neural signals mainly with abstract text, a lossy translation that may discard fine-grained perceptual information encoded in brain activity. We propose Generative Visual Grounding (GVG), a framework that visualizes the invisible by using an EEG-to-image generative model as a visual translator. Instead of forcing EEG into text alone, GVG hallucinates instance-specific proxy images for non-visual EEG, providing structured visual contexts that allow MLLMs to exploit their visual priors for clinical-state interpretation. We validate this idea on two MLLM backbones, GVG-X-Omni and GVG-Janus. Image-only alignment is already competitive: the lightweight GVG-X-Omni matches 1.7B-parameter text-aligned baselines while tuning only 170M parameters on a frozen 7B backbone. We further extend GVG-Janus with trimodal Image+Text alignment, where text supplies categorical semantic anchors and visual proxies enrich neural representations with perceptual details. Experiments show consistent gains in EEG understanding and visual generation, suggesting visual proxy grounding as an effective complement to textual alignment.

2605.17937 2026-05-26 cs.CL cs.AI 版本更新

BacktestBench: Benchmarking Large Language Models for Automated Quantitative Strategy Backtesting

BacktestBench:面向自动化量化策略回测的大语言模型基准测试

Zhensheng Wang, Wenmian Yang, Qingtai Wu, Lequan Ma, Yiquan Zhang, Weijia Jia

发表机构 * Beijing Normal University(北京师范大学) Elmleaf Ltd.(Elmleaf公司)

AI总结 提出首个大规模自动化量化回测基准BacktestBench,包含18,246个问答对,并设计多智能体基线AutoBacktest,通过协调摘要器、检索器和编码器实现自然语言策略到可重复回测的转换。

Comments This paper has been accepted by KDD 2026 (Datasets and Benchmarks Track)

详情
AI中文摘要

量化回测对于评估交易策略至关重要,但仍受到高技术门槛和有限可扩展性的阻碍。虽然大语言模型(LLMs)通过先进的代码生成、工具使用和智能体规划为自动化这一复杂的跨学科工作流程提供了变革性路径,但实际实现因当前缺乏专门用于自动化量化回测的大规模基准而面临重大挑战,这阻碍了该领域的进展。为弥补这一关键差距,我们引入了BacktestBench,这是首个用于自动化量化回测的大规模基准。它基于超过600万条真实市场记录构建,包含18,246个精心标注的问答对,涵盖四个任务类别:指标计算、股票选择、策略选择和参数确认。我们还提出了AutoBacktest,一个稳健的多智能体基线,通过协调摘要器进行语义因子提取、检索器进行验证的SQL生成以及编码器进行Python回测实现,将自然语言策略转化为可重复的回测。我们对23个主流LLM的评估,辅以有针对性的消融实验,识别了影响端到端性能的关键因素,并强调了基于事实的验证和标准化指标表示的重要性。

英文摘要

Quantitative backtesting is essential for evaluating trading strategies but remains hampered by high technical barriers and limited scalability. While Large Language Models (LLMs) offer a transformative path to automate this complex, interdisciplinary workflow through advanced code generation, tool usage, and agentic planning, the practical realization is significantly challenged by the current lack of a large-scale benchmark dedicated to automated quantitative backtesting, which hinders progress in this field. To bridge this critical gap, we introduce BacktestBench, the first large-scale benchmark for automated quantitative backtesting. Built from over 6 million real market records, it comprises 18,246 meticulously annotated question-answering pairs across four task categories: metrics calculation, ticker selection, strategy selection, and parameter confirmation. We also propose AutoBacktest, a robust multi-agent baseline that translates natural language strategies into reproducible backtests by coordinating a Summarizer for semantic factor extraction, a Retriever for validated SQL generation, and a Coder for Python backtesting implementation. Our evaluation on 23 mainstream LLMs, complemented by targeted ablations, identifies key factors that influence end-to-end performance and highlights the importance of grounded verification and standardized indicator representations.

2605.17730 2026-05-26 cs.LG cs.AI 版本更新

L-Drive: Beyond a Single Mapping-Latent Context Drives Time Series Forecasting

L-Drive:超越单一映射——潜在上下文驱动时间序列预测

Fan Zhang, Shijun Chen, Hua Wang

发表机构 * Business University, Yantai, Shandong, China(山东商业大学) Ludong University, Yantai, Shandong, China(鲁东大学)

AI总结 针对分布偏移和机制变化导致直接映射范式在转折点响应滞后的问题,提出L-Drive框架,通过引入潜在上下文表征高层动态并利用门控调制增量表示,提升对变化段的适应能力,同时采用补丁共享相对位置基函数增强段内结构建模,实现预测精度与计算效率的更好平衡。

详情
AI中文摘要

多变量时间序列预测的主流方法主要遵循直接映射范式。它们在观测空间中学习从历史到未来的统一映射,以拟合值级依赖关系。然而,现实世界系统经常经历分布偏移和机制变化。在这种情况下,统一映射在转折点附近可能出现响应滞后,导致切换窗口内误差累积,降低预测可靠性。为解决此问题,我们提出L-Drive,一种变化感知预测框架。L-Drive引入潜在上下文,显式表征随时间演变的高层动态,并使用门控调制增量表示。这提供了更及时的变化线索,并改善了对变化段的适应。此外,它结合了补丁共享相对位置基函数,以加强段内结构建模并减少由绝对位置记忆引起的过拟合。大量实验验证了L-Drive的有效性,并展示了其在预测精度和计算效率之间更好的整体权衡。

英文摘要

Mainstream methods for multivariate time-series forecasting largely follow the Direct-Mapping paradigm. They learn a unified mapping from history to the future in the observation space to fit value-level dependencies. However, real-world systems often undergo distribution shifts and regime changes. In such cases, a unified mapping can exhibit response lag around turning points, causing error accumulation within the switching window and reducing forecasting reliability. To address this issue, we propose L-Drive, a change-aware forecasting framework. L-Drive introduces a Latent-Context, to explicitly characterize high-level dynamics evolving over time, and uses gating to modulate increment representations. This provides more timely change cues and improves adaptation to changing segments. In addition, it incorporates patch-shared relative positional basis functions to strengthen intra-segment structural modeling and reduce overfitting caused by absolute-position memorization. Extensive experiments validate the effectiveness of L-Drive and show a better overall trade-off between forecasting accuracy and computational efficiency.

2605.17537 2026-05-26 cs.AI 版本更新

Self-supervised Hierarchical Visual Reasoning with World Model

基于世界模型的自监督分层视觉推理

Yuanfei Xu, Lin Liu, Wengang Zhou, Mingxiao Feng, Houqiang Li

发表机构 * Department of Electronic Engineering and Information Science, University of Science and Technology of China(电子工程与信息科学系,中国科学技术大学) Institute of Artificial Intelligence, Hefei Comprehensive National Science Center(人工智能研究院,合肥综合性国家科学中心)

AI总结 提出ResDreamer,一种分层世界模型,通过自监督方式学习残差表示,实现高效视觉推理,在3D开放环境中达到最先进的样本和参数效率。

详情
AI中文摘要

具有对抗对手的3D开放世界环境因其巨大的状态空间仍然是强化学习的核心挑战。有效的推理表示在此类环境中至关重要。虽然现有的自监督视觉预见推理方法常常遭受多步误差累积,许多最近的研究转向注入领域特定知识以提供更稳定的指导。我们的关键洞察是,视觉推理表示的照片级真实感是次要的;真正重要的是提供信息丰富、任务相关的信号。为此,我们提出ResDreamer,一种分层世界模型,其中每个更高层被训练来重建下一层的残差。这种设计使得对日益复杂的世界动态进行渐进抽象成为可能,并促进更丰富潜在表示的出现。受“苦涩教训”启发,ResDreamer以纯自监督方式训练其推理表示。高层残差表示用于调节低层预测,使得世界模型仅以线性增加的跨层通信成本即可有效扩展。实验表明,ResDreamer实现了最先进的样本效率和参数效率。这种可扩展的分层视觉预见推理架构为开放、动态环境中更具能力的在线RL代理铺平了道路。代码可在https://github.com/XuYuanFei01/ResDreamer获取。

英文摘要

3D open-world environments with adversarial opponents remain a core challenge for reinforcement learning due to their vast state spaces. Effective reasoning representations are essential in such settings. While existing self-supervised visual foresight reasoning approaches often suffer from multi-step error accumulation, many recent studies resort to injecting domain-specific knowledge for more stable guidance. Our key insight is that the photorealistic fidelity of visual reasoning representations is secondary; what truly matters is providing informative, task-relevant signals. To this end, we propose ResDreamer, a hierarchical world model in which each higher-level layer is trained to reconstruct the residuals of the layer below. This design enables progressive abstraction of increasingly sophisticated world dynamics and fosters the emergence of richer latent representations. Drawing inspiration from the "Bitter Lesson", ResDreamer trains its reasoning representations in a purely self-supervised manner. The higher-level residual representations are used to modulate lower-level predictions, allowing the world model to scale effectively with only linearly increasing cross-layer communication costs. Experiments show that ResDreamer achieves state-of-the-art sample efficiency and parameter efficiency. This scalable hierarchical visual foresight reasoning architecture paves the way for more capable online RL agents in open-ended, dynamic environments. The code is accessible at https://github.com/XuYuanFei01/ResDreamer.

2605.16953 2026-05-26 cs.AI cs.CL 版本更新

How do Humans Process AI-generated Hallucination Contents: a Neuroimaging Study

人类如何处理AI生成的幻觉内容:一项神经影像学研究

Shuqi Zhu, Yi Zhong, Ziyi Ye, Bangde Du, Yujia Zhou, Qingyao Ai, Yiqun Liu

发表机构 * Department of Computer Science and Technology, Tsinghua University, Beijing, China(清华大学计算机科学与技术系) Institute of Trustworthy Embodied AI, Fudan University, Shanghai, China(复旦大学可信具身人工智能研究院)

AI总结 通过EEG实验,研究人类在处理多模态大语言模型生成的幻觉与非幻觉内容时的神经动力学差异,揭示误判的幻觉内容未能触发标准神经认知事实验证通路。

详情
AI中文摘要

尽管AI生成的幻觉带来了相当大的风险,但人类能够成功识别或被这些幻觉误导的潜在认知机制仍不清楚。为了解决这个问题,本文探索了人类的神经动力学,以表征大脑如何处理幻觉内容。我们记录了27名参与者在执行验证任务时的EEG信号,该任务要求判断由多模态大语言模型(MLLM)生成的图像描述的正确性。基于平均事件相关电位(ERP)研究,我们揭示了多种认知过程,例如语义整合、推理处理、记忆检索和认知负荷,在处理幻觉与非幻觉内容时表现出不同的模式。值得注意的是,人类参与者误判与正确判断的幻觉的神经反应显示出显著差异。这表明,被误判的AI生成幻觉未能触发标准的神经认知事实验证通路。

英文摘要

While AI-generated hallucinations pose considerable risks, the underlying cognitive mechanisms by which humans can successfully recognize or be misled by these hallucinations remain unclear. To address this problem, this paper explores humans' neural dynamics to characterize how the brain processes hallucinated content. We record EEG signals from 27 participants while they are performing a verification task to judge the correctness of image descriptions generated by a multi-modal large language model (MLLM). Based on an averaged event-related potential (ERP) study, we reveal that multiple cognitive processes, e.g., semantic integration, inferential processing, memory retrieval, and cognitive load, exhibit distinct patterns when humans process hallucinated versus non-hallucinated content. Notably, neural responses to hallucinations that were misjudged versus correctly judged by human participants showed significant differences. This indicates that misjudged AI-generated hallucinations failed to trigger the standard neurocognitive fact verification pathway.

2605.12906 2026-05-26 cs.LG cs.AI 版本更新

Data Difficulty and the Generalization--Extrapolation Tradeoff in LLM Fine-Tuning

数据难度与LLM微调中的泛化-外推权衡

Siyuan Liu, Tinghong Chen, Xinghan Li, Yifei Wang, Jingzhao Zhang

发表机构 * IIIS, Tsinghua University(清华大学人工智能学院) College of AI, Tsinghua University(清华大学人工智能学院) Shanghai Qi Zhi Institute(上海启智研究院) Amazon AGI SF Lab(亚马逊AGI旧金山实验室)

AI总结 本文通过实证和理论分析,研究了监督微调中数据难度对模型行为的影响,发现数据难度与数据量共同决定泛化与外推之间的权衡,并存在最优难度随数据量增加而向更难数据偏移的规律。

Comments Accepted to ICML 2026

详情
AI中文摘要

监督微调(SFT)期间的数据选择可以显著改变大型语言模型(LLMs)的行为。尽管已有工作研究了基于困惑度、难度或长度等启发式方法选择数据的效果,但报告的结果往往不一致或依赖于上下文。在这项工作中,我们从实证和理论角度系统地研究了数据难度在微调中的作用,并发现不存在普遍最优的难度水平;相反,其有效性取决于数据集大小。我们表明,对于固定的数据预算,SFT存在一个最优的数据难度,并且随着数据预算的增加,该最优难度向更难的数据偏移。为了解释这一现象,我们进行了受控的合成实验,揭示了一个简单的底层机制:分布内泛化差距与外推差距之间的相互作用。我们通过使用PAC-Bayesian泛化界限的理论分析进一步支持了这一机制。总的来说,我们的结果阐明了数据大小和难度如何共同影响SFT中泛化与外推之间的权衡,为在特定模型和数据条件下基于难度的数据选择提供了指导。

英文摘要

Data selection during supervised fine-tuning (SFT) can critically change the behavior of large language models (LLMs). Although existing work has studied the effect of selecting data based on heuristics such as perplexity, difficulty, or length, the reported findings are often inconsistent or context-dependent. In this work, we systematically study the role of data difficulty in fine-tuning from both empirical and theoretical perspectives, and find that there is no universally optimal difficulty level; rather, its effectiveness depends on the dataset size. We show that for a fixed data budget, there exists an optimal data difficulty for SFT, and that this optimal difficulty shifts toward harder data as the data budget increases. To explain this phenomenon, we conduct controlled synthetic experiments that reveal a simple underlying mechanism: the interplay between the (in-distribution) generalization gap and the extrapolation gap. We further support this mechanism through a theoretical analysis using PAC-Bayesian generalization bounds. Overall, our results clarify how data size and difficulty jointly affect the trade-off between generalization and extrapolation in SFT, providing guidance for difficulty-based data selection under certain model and data conditions.

2605.12374 2026-05-26 cs.CV cs.AI cs.LG 版本更新

Fill the GAP: A Granular Alignment Paradigm for Visual Reasoning in Multimodal Large Language Models

填补GAP:多模态大语言模型中视觉推理的粒度对齐范式

Yanting Miao, Yutao Sun, Dexin Wang, Mengyu Zhou, Pascal Poupart, Lei Lv, Qi Zhao, Li Wang, Hao Li, Xiaoxi Jiang, Guanjun Jiang

发表机构 * Qwen Large Model Application Team, Alibaba(阿里云大模型应用团队) Alibaba University of Waterloo(阿里大学水力学院) Vector Institute(向量研究所) Zhejiang University(浙江大学)

AI总结 提出GAP(粒度对齐范式),通过特征级、上下文级和能力引导级对齐,解决多模态大语言模型中视觉潜在推理的特征空间不匹配问题,提升感知与推理性能。

详情
AI中文摘要

视觉潜在推理让多模态大语言模型(MLLM)以连续令牌形式创建中间视觉证据,避免外部工具或图像生成器。然而,现有方法通常遵循输出即输入的潜在范式,产生不稳定的收益。我们识别出特征空间不匹配是导致这种不稳定的证据:主流的视觉潜在模型建立在预归一化MLLM上,重用解码器隐藏状态作为预测的潜在输入,尽管这些状态与模型训练时消耗的输入嵌入处于截然不同的范数范围(Xie et al., 2025; Li et al., 2026; Team et al., 2026)。这种不匹配可能使直接潜在反馈不可靠。受此诊断启发,我们提出GAP,一种用于视觉潜在建模的粒度对齐范式。GAP在三个层面对齐视觉潜在推理:特征级对齐通过轻量级PCA对齐潜在头将解码器输出映射为输入兼容的视觉潜在;上下文级对齐通过可检查的辅助视觉监督锚定潜在目标;能力引导对齐选择性地将潜在监督分配给基础MLLM难以处理的示例。在Qwen2.5-VL 7B上,所得模型在我们监督变体中实现了最佳平均聚合感知和推理性能。推理时干预探测进一步表明,生成的潜在提供了任务相关的视觉信号,而不仅仅是增加令牌槽位。

英文摘要

Visual latent reasoning lets a multimodal large language model (MLLM) create intermediate visual evidence as continuous tokens, avoiding external tools or image generators. However, existing methods usually follow an output-as-input latent paradigm and yield unstable gains. We identify evidence for a feature-space mismatch that can contribute to this instability: dominant visual-latent models build on pre-norm MLLMs and reuse decoder hidden states as predicted latent inputs, even though these states occupy a substantially different norm regime from the input embeddings the model was trained to consume (Xie et al., 2025; Li et al., 2026; Team et al., 2026). This mismatch can make direct latent feedback unreliable. Motivated by this diagnosis, we propose GAP, a Granular Alignment Paradigm for visual latent modeling. GAP aligns visual latent reasoning at three levels: feature-level alignment maps decoder outputs into input-compatible visual latents through a lightweight PCA-aligned latent head; context-level alignment grounds latent targets with inspectable auxiliary visual supervision; and capacity-guided alignment assigns latent supervision selectively to examples where the base MLLM struggles. On Qwen2.5-VL 7B, the resulting model achieves the best mean aggregate perception and reasoning performance among our supervised variants. Inference-time intervention probing further suggests that generated latents provide task-relevant visual signal beyond merely adding token slots.

2605.10913 2026-05-26 cs.AI cs.PL cs.SE 版本更新

Shepherd: A Runtime Substrate Empowering Meta-Agents with a Formalized Execution Trace

Shepherd: 一个为元代理提供形式化执行迹的运行时基座

Simon Yu, Derek Chong, Ananjan Nandi, Dilara Soylu, Jiuding Sun, Christopher D Manning, Weiyan Shi

发表机构 * Northeastern University(东北大学) Stanford University(斯坦福大学)

AI总结 提出Shepherd,一个基于函数式编程的Python运行时基座,将代理执行作为一等对象,通过类似Git的执行迹支持元代理的检查、分叉和重放,在三个用例中显著提升性能。

Comments 50 pages, 22 figures, 14 tables

详情
AI中文摘要

随着LLM代理系统承担更复杂的任务,它们越来越依赖元代理:对其他代理进行操作的高阶代理,就像管理者监督员工一样。无论元代理做什么:协调代理、在执行前停止风险动作、或修复失败的运行,都需要在运行时操纵代理执行。现有的代理基座使得这变得困难:它们只给元代理提供纯文本记录和环境快照,要求元代理构建自己的工具来重建和编排执行状态。因此,我们引入了Shepherd,一个基于函数式编程原则的Python基座,其中代理的执行本身是一个一等对象,元代理可以检查和转换它。每个模型调用、工具调用和环境变化都成为类似Git的执行迹中的一个结构化事件,任何过去的状态都可以被分叉(比docker commit快5倍)并重放。三个示例用例展示了Shepherd的多功能性:(1)一个监督代理防止并行编码代理之间的冲突,将CooperBench的性能从28.8%提升到54.7%;(2)一个反事实优化器通过提出编辑并从行为改变点重放运行来修复代理工作流,在TerminalBench-2上比MetaHarness低58%的挂钟时间;(3)一个元代理在展开期间选择分叉点以改进长程代理强化学习中的信用分配,在TerminalBench-2上将GRPO的增益翻倍。我们开源Shepherd,以通过原则性和高效的代理执行操作赋能未来的元代理。

英文摘要

As LLM agent systems take on more complex tasks, they increasingly rely on meta-agents: higher-order agents that operate on other agents, much as managers supervise employees. Whatever a meta-agent does: coordinating agents, halting risky actions before execution, or repairing failed runs, requires manipulation of agentic execution at runtime. Existing agentic substrates make this hard: they give meta-agents only plain transcripts and environment snapshots, requiring it to build it's own tooling to reconstruct and orchestrate execution state. Therefore, we introduce Shepherd, a Python substrate grounded in functional programming principles, where an agent's execution is itself a first-class object that a meta-agent can inspect and transform. Every model call, tool call, and environment change becomes a structured event in a Git-like execution trace, where any past state can be forked 5x faster than docker commit and replayed. Three example use cases show Shepherd's versatility: (1) a supervisor agent prevents conflicts among parallel coding agents, lifting CooperBench performance from 28.8% to 54.7%; (2) a counterfactual optimizer repairs agent workflows by proposing edits and replaying runs from the point of changed behavior, outperforming MetaHarness on TerminalBench-2 with 58% lower wall-clock; (3) a meta-agent picks fork points during rollouts to improve credit assignment in long-horizon agentic RL, doubling GRPO's gains on TerminalBench-2. We open-source Shepherd to empower future meta-agents with principled and efficient operations over agentic execution.

2605.09270 2026-05-26 cs.LG cs.AI 版本更新

Memorize Theorems, Not Instances: Probing SFT Generalization through Mathematical Reasoning

记忆定理而非实例:通过数学推理探究SFT泛化

Ruiying Peng, Mengyu Yang, Jing Lei, Xiaohui Li, Xueyu Wu, Xinlei Chen

发表机构 * Tsinghua Shenzhen International Graduate School(清华大学深圳国际研究生院) Huawei Technologies(华为技术)

AI总结 针对监督微调(SFT)损害推理泛化的问题,提出Theorem-SFT方法,通过显式定理应用训练,在多个基准上取得显著提升,并揭示前馈层是推理规则的主要存储位置。

详情
AI中文摘要

监督微调(SFT)广泛用于任务特定适配,但近期工作表明它会系统性地削弱推理泛化。我们认为根本原因不在于记忆本身,而在于其目标:标准SFT驱动模型利用并记忆问题-答案对中的虚假表面相关性,使其对表面输入变化脆弱。为解决此问题,我们提出Theorem-SFT,通过教授模型规则如何被调用而非答案看起来像什么,将监督重新导向显式定理应用。Theorem-SFT在多个基准和模型家族上取得一致提升:在MATH上(LLaMA3.2-3B-Instruct)提升8.8%,在GeoQA上(Qwen2.5-VL-7B-Instruct)提升20.27%,无需特定模态的重新训练。仅微调MLP层即可达到全层性能,表明前馈组件是推理规则的主要存储位置。我们的发现重新定义了争论:泛化失败并非源于记忆机制本身,而是源于记忆了错误的归纳目标。

英文摘要

Supervised Fine-Tuning (SFT) is widely used for task-specific adaptation, yet recent work shows it systematically undermines reasoning generalization. We argue the root cause is not memorization itself, but its target: vanilla SFT drives models to exploit and memorize spurious surface correlations in problem-solution pairs, leaving them brittle to superficial input variations. To address this, we propose Theorem-SFT, which reorients supervision toward explicit theorem application by teaching models how rules are invoked rather than what answers look like. Theorem-SFT yields consistent gains across benchmarks and model families: +8.8% on MATH (LLaMA3.2-3B-Instruct) and +20.27% on GeoQA (Qwen2.5-VL-7B-Instruct) without modality-specific re-training. Fine-tuning MLP layers alone matches full-layers performance, implicating feed-forward components as the primary locus of reasoning rules. Our findings reframe the debate: Generalization failures stem not from memorization as a mechanism, but from memorizing the wrong inductive targets.

2605.04906 2026-05-26 cs.AI 版本更新

Strat-Reasoner: Reinforcing Strategic Reasoning of LLMs in Multi-Agent Games

Strat-Reasoner:在多智能体游戏中增强大语言模型的战略推理能力

Yidong He, Yutao Lai, Pengxu Yang, Jiarui Gan, Jiexin Wang, Yi Cai, Mengchen Zhao

发表机构 * School of Software Engineering, South China University of Technology(华南理工大学软件学院) Department of Computer Science, University of Oxford(牛津大学计算机科学系)

AI总结 提出Strat-Reasoner框架,通过递归推理范式和集中式思维链比较模块,结合混合优势与组相对强化学习,提升大语言模型在多智能体游戏中的战略推理能力。

详情
AI中文摘要

虽然大语言模型(LLMs)在某些推理任务中表现出色,但在最终结果取决于所有智能体联合策略的多智能体游戏中,它们却难以应对。在多智能体游戏中,其他智能体的非平稳性给推理过程的评估和多个推理步骤上的信用分配带来了重大挑战。现有的单智能体强化学习(RL)方法及其多智能体扩展未能解决这些挑战,因为它们没有将其他智能体纳入推理过程。在这项工作中,我们提出了Strat-Reasoner,一种新颖的基于强化学习的框架,旨在提升LLMs在多智能体游戏中的战略推理能力。我们引入了一种新颖的递归推理范式,其中智能体的推理也整合了其他智能体的推理过程。为了为中间推理序列提供有效的奖励信号,我们采用了一个集中的思维链(CoT)比较模块来评估推理质量。最后,我们计算了一个准确的混合优势,并开发了一种组相对强化学习方法以优化LLM策略。实验结果表明,Strat-Reasoner显著提升了底层LLMs的战略能力,在各种多智能体游戏中平均性能提升了22.1%。代码已公开在https://github.com/ydhe1012/Strat-Reasoner。

英文摘要

While Large Language Models (LLMs) excel in certain reasoning tasks, they struggle in multi-agent games where the final outcome depends on the joint strategies of all agents. In multi-agent games, the non-stationarity of other agents brings significant challenges on the evaluation of the reasoning process and the credit assignment over multiple reasoning steps. Existing single-agent reinforcement learning (RL) approaches and their multi-agent extensions fail to address these challenges as they do not incorporate other agents in the reasoning process. In this work, we propose Strat-Reasoner, a novel RL-based framework that improves LLMs' strategic reasoning ability in multi-agent games. We introduce a novel recursive reasoning paradigm where an agent's reasoning also integrates other agents' reasoning processes. To provide effective reward signals for the intermediate reasoning sequences, we employ a centralized Chain-of-Thought (CoT) comparison module to evaluate the reasoning quality. Finally, we compute an accurate hybrid advantage and develop a group-relative RL approach to optimize the LLM policy. Experimental results show that Strat-Reasoner substantially improves strategic abilities of underlying LLMs, achieving 22.1\% average performance improvements across various multi-agent games. Code is publicly available at https://github.com/ydhe1012/Strat-Reasoner.

2605.04700 2026-05-26 cs.CR cs.AI cs.CL cs.LG cs.SD 版本更新

Sparse Tokens Suffice: Jailbreaking Audio Language Models via Token-Aware Gradient Optimization

稀疏令牌足矣:通过令牌感知梯度优化越狱音频语言模型

Zheng Fang, Xiaosen Wang, Shenyi Zhang, Shaokang Wang, Zhijin Ge

发表机构 * Wuhan University Institute for Math \& AI, Wuhan University Huazhong University of Science Shanghai Jiao Tong University Xidian University

AI总结 本文提出令牌感知梯度优化(TAGO)方法,通过仅保留高梯度能量的音频令牌对应的波形梯度,实现稀疏越狱攻击,在保持高成功率的同时大幅减少优化量。

Comments To appear in the 43rd International Conference on Machine Learning (ICML 2026)

详情
AI中文摘要

对音频语言模型(ALM)的越狱攻击通过优化音频扰动来引发不安全生成,通常在整个优化过程中密集地更新整个波形。在这项工作中,我们通过分析ALM中令牌对齐梯度的结构来研究这种密集优化的必要性。我们发现梯度能量在音频令牌之间高度不均匀,表明只有一小部分令牌对齐的音频区域主导了优化信号。受此观察启发,我们提出了令牌感知梯度优化(TAGO),它通过每次迭代仅保留与高梯度能量音频令牌对齐的波形梯度,同时屏蔽其余梯度,实现了稀疏越狱优化。在三个ALM上,TAGO优于基线,并且大幅稀疏化仍能保持较高的攻击成功率(例如,在Qwen3-Omni上,令牌保留率为0.25时,$\mathrm{ASR}_{l}$仍为86%,而全令牌保留时为87%)。这些结果表明密集的波形更新在很大程度上是冗余的,我们主张未来的音频越狱和安全对齐研究应进一步利用这种异质的令牌级梯度结构。

英文摘要

Jailbreak attacks on audio language models (ALMs) optimize audio perturbations to elicit unsafe generations, and they typically update the entire waveform densely throughout optimization. In this work, we investigate the necessity of such dense optimization by analyzing the structure of token-aligned gradients in ALMs. We find that gradient energy is highly non-uniform across audio tokens, indicating that only a small subset of token-aligned audio regions dominates the optimization signal. Motivated by this observation, we propose Token-Aware Gradient Optimization (TAGO), which enables sparse jailbreak optimization by retaining only waveform gradients aligned with audio tokens that have high gradient energy, while masking the remaining gradients at each iteration. Across three ALMs, TAGO outperforms baselines, and substantial sparsification preserves strong attack success rates (e.g. on Qwen3-Omni, $\mathrm{ASR}_{l}$ remains at 86% with a token retention ratio of 0.25, compared to 87% with full token retention). These results demonstrate that dense waveform updates are largely redundant, and we advocate that future audio jailbreak and safety alignment research should further leverage this heterogeneous token-level gradient structure.

2605.03804 2026-05-26 cs.AI 版本更新

ScrapMem: A Bio-inspired Framework for On-device Personalized Agent Memory via Optical Forgetting

ScrapMem: 一种基于生物启发的光学遗忘机制用于设备端个性化智能体记忆

Jiale Chang, Yuxiang Ren

发表机构 * Nanjing Agricultural University(南京农业大学) Nanjing University(南京大学)

AI总结 提出ScrapMem框架,通过光学遗忘机制压缩旧记忆并构建情节记忆图,在资源受限设备上实现高效多模态长期记忆,在ATM-Bench上取得51.0% Joint@10新最优,存储降低93%,召回率提升至70.3%。

Comments 10 pages, 4 figures

详情
AI中文摘要

对于LLM智能体而言,在资源受限的边缘设备上实现长期个性化记忆因高存储成本和多模态复杂性而具有挑战性。为此,我们提出ScrapMem,一个将多模态数据整合为“剪贴簿页面”的框架。ScrapMem引入了光学遗忘机制,一种逐步降低旧记忆分辨率的光学压缩机制,从而降低存储成本并抑制低价值细节。为保持语义一致性,我们构建了情节记忆图(EM-Graph),将关键事件组织成因果-时间结构。在多模态ATM-Bench上的大量实验表明,ScrapMem提供了三个主要优势:(1)强大性能,以51.0%的Joint@10分数实现了新的最优结果;(2)高存储效率,通过光学遗忘将内存使用量降低高达93%;(3)改进的召回率,通过结构化聚合将Recall@10提升至70.3%。ScrapMem为多模态LLM智能体在设备上的长期记忆提供了一种有效且存储高效的解决方案。

英文摘要

Long-term personalized memory for LLM agents is challenging on resource-limited edge devices due to high storage costs and multimodal complexity. To address this, we propose ScrapMem, a framework that integrates multimodal data into "Scrapbook Page." ScrapMem introduces Optical Forgetting, an optical compression mechanism that progressively reduces the resolution of older memories, lowering storage cost while suppressing low-value details. To maintain semantic consistency, we construct an Episodic Memory Graph (EM-Graph) that organizes key events into a causal-temporal structure. Extensive experiments on the multimodal ATM-Bench showcase that ScrapMem provides three main benefits: (1) strong performance, achieving a new state-of-the-art with a 51.0% Joint@10 score; (2) high storage efficiency, reducing memory usage by up to 93% via optical forgetting; and (3) improved recall, increasing Recall@10 to 70.3% through structured aggregation. ScrapMem offers an effective and storage-efficient solution for on-device long-term memory in multimodal LLM agents.

2605.03675 2026-05-26 cs.AI 版本更新

MEMTIER: Tiered Memory Architecture and Retrieval Bottleneck Analysis for Long-Running Autonomous AI Agents

MEMTIER:面向长期运行的自主AI智能体的分层内存架构与检索瓶颈分析

Bronislav Sidik, Lior Rokach

发表机构 * Institute for Applied AI Research(应用人工智能研究所) Faculty of Computer and Information Science(计算机与信息科学学院) Ben-Gurion University of the Negev(贝内尔-加里翁大学)

AI总结 提出MEMTIER三层内存架构,通过结构化事件存储、五信号加权检索、注意力归因权重更新、异步合并机制和PPO策略,在LongMemEval-S基准上将全上下文基线准确率从5%提升至38%,并支持本地6GB GPU运行。

Comments 11 pages, 1 figure, 5 tables. Under review

详情
AI中文摘要

长期运行的自主AI智能体面临一个记录充分的内存一致性问题:由于现有平面文件内存系统中的四种复合故障模式,工具执行成功率在72小时运行窗口内下降14个百分点。我们提出MEMTIER,这是OpenClaw智能体运行时的三方内存架构,引入了结构化事件JSONL存储、五信号加权检索引擎、注意力归因的认知权重更新循环、将事件事实提升到语义层的异步合并守护进程,以及基于PPO的检索权重自适应策略框架(基础设施已验证;性能提升待最终版本确认)。在完整的500问题LongMemEval-S基准测试(Wu等人,2025)上,MEMTIER在消费级6GB GPU上使用Qwen2.5-7B达到Acc=0.382,F1=0.412——比全上下文基线(0.050 -> 0.382,即5% -> 38%)提高了33个百分点。通过DeepSeek-V4-Flash事实预填充,单会话召回率达到0.686-0.714,超过了论文中RAG BM25 GPT-4o基线(0.560)在这些类别上的表现。时间推理提升至0.323,多会话综合提升至0.173,表明结构化语义预填充从根本上改变了轻量级检索所能达到的效果。所有阶段均在配备6GB GPU的消费级笔记本电脑上本地运行。

英文摘要

Long-running autonomous AI agents suffer from a well-documented memory coherence problem: tool-execution success rates degrade 14 percentage points over 72-hour operation windows due to four compounding failure modes in existing flat-file memory systems. We present MEMTIER, a tripartite memory architecture for the OpenClaw agent runtime that introduces a structured episodic JSONL store, a five-signal weighted retrieval engine, an attention-attributed cognitive weight update loop, an asynchronous consolidation daemon promoting episodic facts to a semantic tier, and a PPO-based policy framework for adapting retrieval weights (infrastructure validated; performance gains pending camera-ready). On the full 500-question LongMemEval-S benchmark (Wu et al., 2025), MEMTIER achieves Acc=0.382, F1=0.412 with Qwen2.5-7B on a consumer 6GB GPU - a +33 percentage point improvement over the full-context baseline (0.050 -> 0.382, i.e., 5% -> 38%). With DeepSeek-V4-Flash fact pre-population, single-session recall reaches 0.686-0.714, exceeding the paper's RAG BM25 GPT-4o baseline (0.560) on those categories. Temporal reasoning rises to 0.323 and multi-session synthesis to 0.173, demonstrating that structured semantic pre-population qualitatively changes what lightweight retrieval can achieve. All phases run locally on a consumer laptop with a 6GB GPU.

2605.03472 2026-05-26 cs.CL cs.AI 版本更新

Auditing Stealth Sycophancy in Mental-Health Dialogue: Structured Clinical-State Diagnostics and Clean Matched Benchmarks

审计心理健康对话中的隐性谄媚:结构化临床状态诊断与干净匹配基准

Tianze Han, Beining Xu, Hanbo Zhang, Yongming Lu

发表机构 * Shenzhen MSU-BIT University(深圳MSU-BIT大学)

AI总结 针对心理健康对话模型中隐式谄媚(表面共情但强化消极认知)的问题,提出基于动态情感签名图(DESG)的结构化离线审计框架,通过临床状态转移评估响应方向,并在干净匹配基准上实现最优有害风险检测。

详情
AI中文摘要

心理健康对话模型越来越多地由基于AI的评估器进行评估,但这些评估器通常将表面共情、支持性或流畅性视为安全的证据。在本文中,我们研究了一种隐藏的失败模式,称为隐式谄媚:一个响应可能看似共情,但暗中强化灾难化、回避、绝望预测或CBT式标签。为了检查这个问题,我们引入了一个用于隐式谄媚检测的诊断基准,该基准基于三个代表性的心理健康对话来源构建,涵盖日常同伴支持、咨询式情感支持和危机导向互动,并进一步构建了一个泄漏审计的干净单响应匹配基准,包含500个上下文和1500个匹配响应窗口。然后,我们提出了动态情感签名图(DESG),一个结构化的离线审计框架,将基于LLM的状态提取与最终评分分离,并通过语义、情感和认知扭曲状态转移而非自由形式的LLM判断来评估临床方向。与元数据、表面风格、词汇、嵌入和基于规则的LLM基线不同,DESG对响应引起的临床状态变化方向进行评分;在泄漏审计的干净匹配基准上,DESG-StateRisk比最强的非DESG基线提高了0.0488 macro-F1,并实现了最佳的有害风险检测结果。这些结果表明,评估隐式谄媚需要显式的临床状态建模以及泄漏检查、捷径控制和竞争性基线。

英文摘要

Mental-health dialogue models are increasingly evaluated by AI-based evaluators, yet these evaluators often treat surface empathy, supportiveness, or fluency as evidence of safety. In this paper, we study a hidden failure mode that we call implicit sycophancy: a response may appear empathetic while implicitly reinforcing catastrophizing, avoidance, hopeless prediction, or CBT-style labeling. To examine this problem, we introduce a diagnostic benchmark for implicit-sycophancy detection, built from three representative mental-health dialogue sources covering everyday peer support, counseling-style emotional support, and crisis-oriented interaction, and further construct a leakage-audited clean single-response matched benchmark with 500 contexts and 1,500 matched response windows. We then propose Dynamic Emotional Signature Graphs (DESG), a structured offline audit framework that separates LLM-based state extraction from final scoring and evaluates clinical direction through semantic, affective, and cognitive-distortion state transitions rather than free-form LLM judgment. Unlike metadata, surface-style, lexical, embedding, and rubric-LLM baselines, DESG scores the direction of clinical-state change induced by a response; on the leakage-audited clean matched benchmark, DESG-StateRisk improves over the strongest non-DESG baseline by 0.0488 macro-F1 and achieves the best harmful-risk detection result. These results suggest that evaluating implicit sycophancy requires explicit clinical-state modeling together with leakage checks, shortcut controls, and competitive baselines.

2605.02495 2026-05-26 cs.LG cs.AI stat.ML 版本更新

Efficient Preference Poisoning Attack on Offline RLHF

高效偏好投毒攻击离线RLHF

Chenye Yang, Weiyu Xu, Lifeng Lai

发表机构 * Department of Electrical and Computer Engineering, University of California, Davis, Davis, CA, USA(加州大学戴维斯分校电气与计算机工程系) Department of Electrical and Computer Engineering, University of Iowa, Iowa City, IA, USA(爱荷华大学电气与计算机工程系)

AI总结 针对离线RLHF中的偏好投毒攻击,提出基于梯度字典的二进制稀疏近似方法(BAL-A和BMP-A),实现高效标签翻转攻击。

Comments Accepted to ICML 2026

详情
AI中文摘要

离线人类反馈强化学习(RLHF)流程(如直接偏好优化DPO)在预收集的偏好数据集上训练,使其容易受到偏好投毒攻击。我们研究了对数线性DPO的标签翻转攻击。首先说明翻转一个偏好标签会在DPO梯度中引起与参数无关的偏移。利用这一关键性质,我们可以将目标投毒问题转化为结构化的二进制稀疏近似问题。为解决该问题,我们开发了两种攻击方法:二进制感知格点攻击(BAL-A)和二进制匹配追踪攻击(BMP-A)。BAL-A将二进制翻转选择问题嵌入二进制感知格点,并应用Lenstra-Lenstra-Lovász约简和Babai最近平面算法;我们提供了强制二进制系数并恢复最小翻转目标的充分条件。BMP-A将二进制匹配追踪适应于我们的非归一化梯度字典,并给出基于相干性的恢复保证和$K$翻转预算的鲁棒性(不可能性)证书。在合成字典和斯坦福人类偏好数据集上的实验验证了理论,并突出了字典几何如何决定攻击成功。

英文摘要

Offline Reinforcement Learning from Human Feedback (RLHF) pipelines such as Direct Preference Optimization (DPO) train on a pre-collected preference dataset, which makes them vulnerable to preference poisoning attack. We study label flip attacks against log-linear DPO. We first illustrate that flipping one preference label induces a parameter-independent shift in the DPO gradient. Using this key property, we can then convert the targeted poisoning problem into a structured binary sparse approximation problem. To solve this problem, we develop two attack methods: Binary-Aware Lattice Attack (BAL-A) and Binary Matching Pursuit Attack (BMP-A). BAL-A embeds the binary flip selection problem into a binary-aware lattice and applies Lenstra-Lenstra-Lovász reduction and Babai's nearest plane algorithm; we provide sufficient conditions that enforce binary coefficients and recover the minimum-flip objective. BMP-A adapts binary matching pursuit to our non-normalized gradient dictionary and yields coherence-based recovery guarantees and robustness (impossibility) certificates for $K$-flip budgets. Experiments on synthetic dictionaries and the Stanford Human Preferences dataset validate the theory and highlight how dictionary geometry governs attack success.

2604.23853 2026-05-26 cs.AI 版本更新

ClawTrace: Cost-Aware Tracing for LLM Agent Skill Distillation

ClawTrace: 面向LLM智能体技能蒸馏的成本感知追踪

Boqin Yuan, Yue Su, Renchu Song, Sen Yang, Jing Qin

发表机构 * University of California San Diego(加州大学圣地亚哥分校) Carnegie Mellon University(卡内基梅隆大学)

AI总结 针对技能蒸馏管道缺乏每步成本信号的问题,提出ClawTrace记录成本归因轨迹并生成TraceCard,通过CostCraft生成保留、剪枝和修复三类技能补丁,发现剪枝补丁作为质量护栏而保留补丁导致回归,主张按规则类型评估可复用技能。

Comments Accepted at Agent Skills '26 Workshop, ACM Conference on AI and Agentic Systems (CAIS 2026), San José, CA, May 26, 2026

详情
AI中文摘要

技能蒸馏管道从LLM智能体轨迹中学习可重用规则,但它们缺乏一个关键信号:每一步的成本。没有每步成本,管道无法区分添加缺失步骤以修复错误与移除从未影响结果的昂贵步骤。我们利用成本归因差距来探究蒸馏技能内部的规则类型是否以相同方式迁移到新任务。ClawTrace记录成本归因的智能体轨迹,并将每个会话编译成TraceCard;CostCraft读取TraceCard并编写三种技能补丁:保留、剪枝和修复。我们发现了一个聚合指标隐藏的模式。在30个保留的SpreadsheetBench任务上(两个种子),移除剪枝补丁大致使质量回归计数增加了三倍,而未降低中位成本。在整个84任务的SkillsBench迁移中,CostCraft未节省总成本。所有三个质量回归都追溯到保留通道,而两个质量提升都追溯到剪枝通道:剪枝补丁充当质量护栏,而保留补丁驱动回归。我们认为可重用的智能体技能应在规则类型层面进行评估,而不是作为整体指令包。为支持这一点,我们发布了ClawTrace、TraceCard模式以及全套类型化技能。

英文摘要

Skill-distillation pipelines learn reusable rules from LLM agent trajectories, but they lack a key signal: how much each step costs. Without per-step cost, a pipeline cannot distinguish adding a missing step to fix a bug from removing an expensive step that never affected the outcome. We use the cost-attribution gap to ask whether the rule types inside a distilled skill transfer the same way to new tasks. ClawTrace records cost-attributed agent traces and compiles each session into a TraceCard; CostCraft reads TraceCards and writes three kinds of skill patches: preserve, prune, and repair. We find a pattern aggregate metrics hide. On 30 held-out SpreadsheetBench tasks across two seeds, removing prune patches roughly tripled the quality-regression count without lowering median cost. Across the full 84-task SkillsBench transfer, CostCraft saves no aggregate cost. All three quality regressions trace to the preserve lane, and both quality wins trace to the prune lane: prune patches act as quality guardrails while preserve patches drive regressions. We argue that reusable agent skills should be evaluated at the rule-type level, not as monolithic instruction packages. To support this, we release ClawTrace, the TraceCard schema, and the full set of typed skills.

2604.23728 2026-05-26 cs.CV cs.AI 版本更新

ESIA: An Energy-Based Spatiotemporal Interaction-Aware Framework for Pedestrian Intention Prediction

ESIA:基于能量的时空交互感知框架用于行人意图预测

Yanping Wu, Meiting Dang, Lin Wu, Edmond S. L. Ho, Zhenghua Chen, Chongfeng Wei

发表机构 * James Watt School of Engineering, University of Glasgow(格拉斯哥大学詹姆斯·瓦特工程学院)

AI总结 提出ESIA框架,利用条件随机场和能量函数建模时空交互,通过结构一致性约束和模拟退火算法实现行人意图预测,在标准基准上达到最先进性能并提升可解释性。

Comments 13 pages, 6 figures, 3 tables

详情
AI中文摘要

自动驾驶的最新进展推动了行人意图预测的研究,该研究旨在通过建模时间动态、社交互动和环境背景来推断未来的过街决策和行动。然而,现有研究仍受限于过度简化的多智能体交互模式、不透明的推理逻辑以及行为预测中缺乏全局一致性,这损害了鲁棒性和可解释性。在这项工作中,我们提出了ESIA(基于能量的时空交互感知框架),一种新颖的基于条件随机场(CRF)的范式。我们将意图预测任务视为一个基于统一图表示的结构化预测问题,将行人和环境视为时空节点。为了表征它们的不同角色,我们为节点分配一元势能以捕捉个体意图,为边分配成对势能以编码社交和环境交互。这些势能被整合到一个统一的全局能量函数中,以确保行为预测的场景级一致性。为了在没有真实标签监督的情况下进一步约束推理,我们引入了结构一致性项来惩罚逻辑矛盾。该优化通过一种新颖的一元种子模拟退火(U-SSA)算法高效求解,该算法利用高置信度的一元先验快速收敛到高质量解。在标准基准上的大量实验表明,ESIA在现有方法中实现了最先进的性能,并具有更好的可解释性。

英文摘要

Recent advances in autonomous driving have motivated research on pedestrian intention prediction, which aims to infer future crossing decisions and actions by modeling temporal dynamics, social interactions, and environmental context. However, existing studies remain constrained by oversimplified multi-agent interaction patterns, opaque reasoning logic, and a lack of global consistency in behavioral predictions, which compromise both robustness and interpretability. In this work, we propose ESIA (Energy-based Spatiotemporal Interaction-Aware framework), a novel Conditional Random Field (CRF)-based paradigm. We cast the intention prediction task as a structured prediction problem over a unified graph-based representation, treating pedestrians and the environment as spatiotemporal nodes. To characterize their distinct roles, we assign unary potentials to nodes to capture individual intentions, and pairwise potentials to edges to encode social and environmental interactions. These potentials are integrated into a unified global energy function to ensure scene-level consistency across behavioral predictions. To further constrain inference without ground-truth supervision, we introduce structural consistency terms to penalize logical contradictions. This optimization is efficiently solved via a novel Unary-Seeded Simulated Annealing (U-SSA) algorithm, which leverages high-confidence unary priors to rapidly converge to a high-quality solution. Extensive experiments on standard benchmarks demonstrate that ESIA achieves state-of-the-art performance with improved interpretability over existing methods.

2604.23295 2026-05-26 cs.CL cs.AI 版本更新

Human-1 by Josh Talks: A Full-Duplex Conversational Modeling Framework in Hindi using Real-World Conversations

Human-1 by Josh Talks: 基于真实对话的印地语全双工对话建模框架

Bhaskar Singh, Shobhit Banga, Mahima Manik, Pranav Sharma

发表机构 * JoshTalks

AI总结 本文通过适配Moshi架构,使用自定义印地语分词器和26,000小时真实对话数据训练,提出了首个开放、可复现的印地语全双工口语对话系统,实现了自然的打断、重叠和反馈行为。

详情
AI中文摘要

全双工口语对话系统能够模拟自然的对话行为,如打断、重叠和反馈,然而这类系统在印度语言中仍 largely unexplored。我们通过适配最先进的双工语音架构Moshi,使用自定义印地语分词器,并在从14,695名说话者收集的26,000小时真实自发对话数据(具有独立的说话者通道)上进行训练,提出了首个开放、可复现的印地语全双工口语对话系统,从而能够直接从自然交互中学习话轮转换和重叠模式。为了支持印地语文本生成,我们替换了原始英语分词器,并重新初始化了依赖于文本词汇的参数,同时保留了预训练的音频组件。我们提出了一种两阶段训练方案——大规模预训练,然后在1,000小时对话数据上进行微调。通过提示对话延续范式,结合自动评估指标和人工判断,评估结果表明生成的模型在印地语中表现出自然且有意义的全双工对话行为。这项工作为印地语及其他印度语言的实时双工口语对话系统迈出了第一步。

英文摘要

Full-duplex spoken dialogue systems can model natural conversational behaviours such as interruptions, overlaps, and backchannels, yet such systems remain largely unexplored for Indian languages. We present the first open, reproducible full-duplex spoken dialogue system for Hindi by adapting Moshi, a state-of-the-art duplex speech architecture, using a custom Hindi tokeniser and training on 26,000 hours of real spontaneous conversations collected from 14,695 speakers with separate speaker channels, enabling direct learning of turn-taking and overlap patterns from natural interactions. To support Hindi text generation, we replace the original English tokeniser and reinitialise text-vocabulary-dependent parameters while retaining the pre-trained audio components. We propose a two-stage training recipe -- large-scale pre-training followed by fine-tuning on 1,000 hours of conversational data. Evaluation through the prompted dialogue continuation paradigm with both automatic metrics and human judgments demonstrates that the resulting model generates natural and meaningful full-duplex conversational behaviour in Hindi. This work serves as a first step toward real-time duplex spoken dialogue systems for Hindi and other Indian languages.

2604.11557 2026-05-26 cs.AI 版本更新

UniToolCall: Unifying Tool-Use Representation, Data, and Evaluation for LLM Agents

UniToolCall: 统一LLM智能体的工具使用表示、数据与评估

Yijuan Liang, Xinghao Chen, Yifan Ge, Ziyi Wu, Hao Wu, Changyu Zeng, Wei Xing, Xiaoyu Shen

发表机构 * University of Science and Technology of China(中国科学技术大学) Ningbo Institute of Digital Twin(宁波数字孪生研究所) Eastern Institute of Technology(东部技术研究所) Department of Computing, The Hong Kong Polytechnic University(香港理工大学计算机系)

AI总结 提出UniToolCall框架,通过标准化工具集构建、数据集生成和评估流程,结合22k+工具和390k+训练实例,引入锚点链接机制,在混合设置下使Qwen3-8B单轮严格精度达93.0%,超越GPT、Gemini和Claude。

Comments 21 pages, 10 figures, 9 tables. Code and datasets are publicly available at: https://github.com/EIT-NLP/UniToolCall

详情
AI中文摘要

工具使用能力是LLM智能体的基本组成部分,使其能够通过结构化函数调用与外部系统交互。然而,现有研究存在不一致的交互表示,很大程度上忽略了工具使用轨迹的结构分布,并依赖于不兼容的评估基准。我们提出了UniToolCall,一个统一的工具学习框架,标准化了从工具集构建、数据集生成到评估的整个流程。该框架整理了包含22k+工具的大型工具池,并通过结合10个标准化公共数据集与结构受控的合成轨迹,构建了包含390k+实例的混合训练语料库。它显式建模了多种交互模式,包括单跳与多跳、单轮与多轮,同时捕获了串行和并行执行结构。为了支持连贯的多轮推理,我们进一步引入了锚点链接机制,强制跨轮依赖关系。此外,我们将7个公共基准转换为统一的查询-动作-观察-答案(QAOA)表示,并在函数调用、轮次和对话级别进行细粒度评估。实验表明,在我们的数据集上微调Qwen3-8B显著提升了工具使用性能。在干扰项密集的Hybrid-20设置下,单轮严格精度达到93.0%,优于包括GPT、Gemini和Claude在内的商业模型。

英文摘要

Tool-use capability is a fundamental component of LLM agents, enabling them to interact with external systems through structured function calls. However, existing research exhibits inconsistent interaction representations, largely overlooks the structural distribution of tool-use trajectories, and relies on incompatible evaluation benchmarks. We present UniToolCall, a unified framework for tool learning that standardizes the entire pipeline from toolset construction and dataset generation to evaluation. The framework curates a large tool pool of 22k+ tools and constructs a hybrid training corpus of 390k+ instances by combining 10 standardized public datasets with structurally controlled synthetic trajectories. It explicitly models diverse interaction patterns, including single-hop vs. multi-hop and single-turn vs. multi-turn, while capturing both serial and parallel execution structures. To support coherent multi-turn reasoning, we further introduce an Anchor Linkage mechanism that enforces cross-turn dependencies. Furthermore, we convert 7 public benchmarks into a unified Query--Action--Observation--Answer (QAOA) representation with fine-grained evaluation at the function-call, turn, and conversation levels. Experiments show that fine-tuning Qwen3-8B on our dataset substantially improves tool-use performance. Under the distractor-heavy Hybrid-20 setting, achieves 93.0% single-turn Strict Precision, outperforming commercial models including GPT, Gemini, and Claude.

2604.10783 2026-05-26 cs.AI cs.LG 版本更新

Learning Preference-Based Objectives from Clinical Narratives for Dynamic Sepsis Treatment

从临床叙述中学习基于偏好的目标用于动态脓毒症治疗

Daniel J. Tan, Jayne Hui Zhen Chan, Kai Wen Hwang, Arturo Yong Yao Neo, Kay Choong See, Mengling Feng

发表机构 * Institute of Data Science, National University of Singapore, Singapore(新加坡国立大学数据科学研究所) National University Hospital, Singapore(新加坡国立大学医院) Saw Swee Hock School of Public Health, National University of Singapore, Singapore(新加坡国立大学 Saw Swee Hock 公共卫生学院)

AI总结 提出CN-PR框架,利用大语言模型从出院小结中提取轨迹级偏好,通过偏好优化学习奖励函数,在离线强化学习中改善脓毒症治疗结果。

详情
AI中文摘要

在医疗保健中为强化学习设计奖励函数仍然具有挑战性,因为临床有意义的结果稀疏、延迟且难以明确指定。尽管结构化临床数据捕获了生理状态,但它们往往无法反映患者轨迹的更广泛方面,如治疗反应、恢复动态和干预负担。相比之下,临床叙述编码了临床医生对疾病进展、治疗效果和恢复的纵向评估,提供了超越预定义结果指标的轨迹级监督的潜在来源。我们提出了临床叙述知情偏好奖励(CN-PR)框架,该框架通过将临床叙述视为轨迹级偏好的可扩展监督,直接从出院小结中学习奖励函数。使用大语言模型,我们推导出轨迹质量分数,并在患者轨迹之间构建成对偏好,通过基于偏好的优化来学习奖励。为了考虑叙述信息量的变异性,我们引入了一个任务相关性信号,根据监督与下游决策任务的相关性对其进行加权。我们在离线强化学习中评估了CN-PR在动态脓毒症治疗中的应用。学习到的奖励与轨迹质量分数表现出强烈的单调对齐,并产生了与改善恢复相关结果相关的策略,包括增加器官支持无天数和更快的休克解决,同时保持与基于结果的奖励基线相当的性能。这些发现在外部验证下得以保留。我们的结果表明,临床叙述为动态治疗方案中的奖励学习提供了可扩展且富有表现力的监督来源。

英文摘要

Designing reward functions for reinforcement learning (RL) in healthcare remains challenging because clinically meaningful outcomes are sparse, delayed, and difficult to explicitly specify. Although structured clinical data capture physiologic states, they often fail to reflect broader aspects of patient trajectories such as treatment response, recovery dynamics, and intervention burden. Clinical narratives, by contrast, encode longitudinal clinician assessments of disease progression, treatment effectiveness, and recovery, providing a potential source of trajectory-level supervision beyond predefined outcome metrics. We propose Clinical Narrative-informed Preference Rewards (CN-PR), a framework that learns reward functions directly from discharge summaries by treating clinical narratives as scalable supervision for trajectory-level preferences. Using a large language model, we derive trajectory quality scores and construct pairwise preferences between patient trajectories to learn rewards through preference-based optimization. To account for variability in narrative informativeness, we incorporate a task relevance signal that weights supervision according to its relevance to the downstream decision-making task. We evaluate CN-PR in dynamic sepsis treatment using offline RL. The learned reward demonstrated strong monotonic alignment with trajectory quality scores and produced policies associated with improved recovery-related outcomes, including increased organ support-free days and faster shock resolution, while maintaining mortality performance comparable to outcome-based reward baselines. These findings were preserved under external validation. Our results suggest that clinical narratives provide a scalable and expressive source of supervision for reward learning in dynamic treatment regimes.

2604.08870 2026-05-26 cs.LG cs.AI 版本更新

Temporal Dropout Risk in Learning Analytics: A Harmonized Survival Benchmark Across Dynamic and Early-Window Representations

学习分析中的时间辍学风险:跨动态与早期窗口表示的协调生存基准

Rafael da Silva, Jeff Eicher, Gregory Longo

发表机构 * Applied Data Science Program(应用数据科学项目) Eastern University(东部大学)

AI总结 本研究使用OULAD数据集,通过协调的生存分析基准(包括动态周表示和连续时间表示)评估辍学风险模型,发现时间行为特征比静态背景属性更具预测力。

Comments 34 pages, 14 figures, 18 tables. Includes appendix with reliability diagrams, sensitivity analyses, and dataset audit tables

详情
AI中文摘要

学生辍学是学习分析中持续关注的问题,然而比较研究经常在异质协议下评估预测模型,优先考虑区分度而非时间可解释性和校准。本研究引入了一个面向生存的基准,用于使用开放大学学习分析数据集(OULAD)进行时间辍学风险建模。比较了两个协调分支:一个动态周分支,采用人-时期表示的模型;以及一个可比较的连续时间分支,扩展了模型家族——基于树的生存模型、参数模型和神经网络模型。评估协议整合了四个分析层面:预测性能、消融、可解释性和校准。结果在每个分支内分别报告,因为跨分支单一排名在方法论上不合理。在可比较分支中,随机生存森林在区分度和特定时间点的Brier分数上领先;在动态分支中,泊松分段指数在紧密的五家族聚类中在综合Brier分数上略微领先。无重抽样自举变异将这些位置视为方向性信号而非绝对优势。消融和可解释性分析在所有家族中收敛于一个共同发现:主导预测信号主要不是人口统计学或结构性的,而是时间和行为性的。校准在更好区分的模型中证实了这一模式,但XGBoost AFT除外,它表现出系统性偏差。这些结果支持在学习分析中采用协调的多维基准的价值,并将辍学风险定位为一个时间行为过程,而非静态背景属性的函数。

英文摘要

Student dropout is a persistent concern in Learning Analytics, yet comparative studies frequently evaluate predictive models under heterogeneous protocols, prioritizing discrimination over temporal interpretability and calibration. This study introduces a survival-oriented benchmark for temporal dropout risk modelling using the Open University Learning Analytics Dataset (OULAD). Two harmonized arms are compared: a dynamic weekly arm, with models in person-period representation, and a comparable continuous-time arm, with an expanded roster of families -- tree-based survival, parametric, and neural models. The evaluation protocol integrates four analytical layers: predictive performance, ablation, explainability, and calibration. Results are reported within each arm separately, as a single cross-arm ranking is not methodologically warranted. Within the comparable arm, Random Survival Forest leads in discrimination and horizon-specific Brier scores; within the dynamic arm, Poisson Piecewise-Exponential leads narrowly on integrated Brier score within a tight five-family cluster. No-refit bootstrap sampling variability qualifies these positions as directional signals rather than absolute superiority. Ablation and explainability analyses converged, across all families, on a shared finding: the dominant predictive signal was not primarily demographic or structural, but temporal and behavioral. Calibration corroborated this pattern in the better-discriminating models, with the exception of XGBoost AFT, which exhibited systematic bias. These results support the value of a harmonized, multi-dimensional benchmark in Learning Analytics and situate dropout risk as a temporal-behavioral process rather than a function of static background attributes.

2604.08213 2026-05-26 cs.CV cs.AI 版本更新

EditCaption: Human-Refined SFT and HAE-DPO for Image Editing Instruction Synthesis

EditCaption: 用于图像编辑指令合成的人工精炼SFT与HAE-DPO

Xiangyuan Wang, Honghao Cai, Yunhao Bai, Chao Hui, Tianze Zhou, Haohua Chen, Hao Shi, Yuling Wu, Yao Hu, Xu Tang, Yibo Chen, Wei Zhu

发表机构 * Peking University(北京大学) The Chinese University of Hong Kong, Shenzhen(香港中文大学(深圳)) Tsinghua University(清华大学) Beihang University(北京航空航天大学) Xiaohongshu Inc.(小红书公司)

AI总结 提出EditCaption两阶段后训练流程,通过人工精炼SFT和基于难度自适应错误感知DPO(HAE-DPO)提升图像编辑指令合成质量,显著降低关键错误率并超越现有模型。

详情
AI中文摘要

高质量的源-目标图像对及精确的编辑指令对于指令引导的图像编辑至关重要,但大规模构建此类训练三元组成本高昂。最近的流程通常依赖视觉语言模型自动合成编辑指令,但我们发现强大的VLM仍难以描述图像对之间的视觉变换。具体而言,它们表现出三种反复出现的失败模式:方向不一致、视角模糊和缺少细粒度属性。在400个图像对的人工评估中,多个开源VLM基线产生超过47%的关键错误率,使得许多合成指令不适合下游训练。为解决此问题,我们提出EditCaption,一种用于图像编辑指令合成的两阶段后训练流程。首先,通过基于GLM的自动字幕生成、EditScore过滤和人工精炼构建100K监督微调数据集。其次,收集10K人工标注的偏好对,其中每个被拒绝的指令都标注了其主要错误类型和严重程度。基于此数据集,我们提出难度自适应错误感知DPO(HAE-DPO),一种任务适配的DPO目标,它引入了基于人工标注的严重程度、失败模式类型和参考模型难度的自适应边界。在三个基准上的实验表明,我们的235B模型经过SFT+HAE-DPO后在开源和闭源模型中达到最先进性能,在Eval-400、HQ-Edit和ByteMorph-Bench上分别获得4.720、4.672和4.651分——在所有三个基准上均超越Gemini-3-Pro。人工评估证实关键错误率从47.75%降至17.50%,正确率从41.75%提升至70.25%,超越Gemini-3-Pro(66.00%)。

英文摘要

High-quality source-target image pairs with precise editing instructions are essential for instruction-guided image editing, yet constructing such training triplets at scale remains costly. Recent pipelines often rely on vision-language models to synthesize editing instructions automatically, but we find that strong VLMs still struggle to describe visual transformations between image pairs. In particular, they exhibit three recurring failure modes: orientation inconsistency, viewpoint ambiguity, and missing fine-grained attributes. In a human evaluation on 400 image pairs, several open-source VLM baselines produce critical-error rates above 47\%, making many synthesized instructions unsuitable for downstream training. To address this, we propose EditCaption, a two-stage post-training pipeline for image editing instruction synthesis. First, we construct a 100K supervised fine-tuning dataset through GLM-based auto-captioning, EditScore filtering, and human refinement. Second, we collect 10K human-annotated preference pairs, where each rejected instruction is labeled with its primary error type and severity. Based on this dataset, we propose Hardness-Adaptive Error-Aware DPO (HAE-DPO), a task-adapted DPO objective that introduces an adaptive margin based on human-labeled severity, failure-mode type, and reference-model hardness. Experiments across three benchmarks demonstrate that our 235B model with SFT+HAE-DPO achieves state-of-the-art performance among open-source and closed models, scoring 4.720 on Eval-400, 4.672 on HQ-Edit, and 4.651 on ByteMorph-Bench -- surpassing Gemini-3-Pro on all three. Human evaluation confirms critical error rates drop from 47.75\% to 17.50\%, with correct rates improving from 41.75\% to 70.25\%, surpassing Gemini-3-Pro (66.00\%).

2604.07039 2026-05-26 cs.RO cs.AI 版本更新

AEROS: A Single-Agent Operating Architecture with Embodied Capability Modules

AEROS:一种具有具身能力模块的单智能体操作架构

Xue Qin, Simin Luan, John See, Cong Yang, Zhijun Li

发表机构 * School of Software, Harbin Institute of Technology(哈尔滨工业大学软件学院) School of Computer Science and Technology, Harbin Institute of Technology(哈尔滨工业大学计算机科学与技术学院) School of Mathematical and Computer Sciences, Heriot-Watt University, Malaysia Campus(赫瑞-沃森大学马来西亚分校数学与计算机科学学院) School of Future Science and Engineering, Soochow University(苏州大学未来科学与工程学院)

AI总结 提出AEROS架构,将机器人建模为单一持久智能主体,通过可安装的具身能力模块扩展能力,实现模块化可扩展性、可组合能力执行和一致的系统级安全。

Comments Submitted to Engineering Applications of Artificial Intelligence (EAAI). 48 pages, 5 figures, 9 tables

详情
AI中文摘要

机器人系统缺乏一种原则性的抽象来统一组织智能、能力和执行。现有方法要么在单体架构中耦合技能,要么将功能分解为松散协调的模块或多个智能体,通常缺乏一致的标识和控制权限模型。我们认为,机器人应被建模为一个单一的持久智能主体,其能力通过可安装的包来扩展。我们将这一观点形式化为AEROS(智能体执行运行时操作系统),其中每个机器人对应一个持久智能体,能力通过具身能力模块(ECM)提供。每个ECM封装了可执行技能、模型和工具,而执行约束和安全保证由策略分离的运行时强制执行。这种分离实现了模块化可扩展性、可组合能力执行和一致的系统级安全。我们在PyBullet仿真中使用Franka Panda 7自由度机械臂评估了一个参考实现,进行了八项实验,涵盖重新规划、故障恢复、策略执行、基线比较、跨任务通用性、ECM热插拔、消融和故障边界分析。每个条件下超过100次随机试验,AEROS在三个任务上实现了100%的任务成功率,而基线(BehaviorTree.CPP风格和ProgPrompt风格为92-93%,扁平流水线为67-73%);策略层阻止了所有无效动作,零误接受;运行时优势跨任务泛化,无需特定任务调整;ECM在运行时加载,交换后成功率为100%。

英文摘要

Robotic systems lack a principled abstraction for organizing intelligence, capabilities, and execution in a unified manner. Existing approaches either couple skills within monolithic architectures or decompose functionality into loosely coordinated modules or multiple agents, often without a coherent model of identity and control authority. We argue that a robot should be modeled as a single persistent intelligent subject whose capabilities are extended through installable packages. We formalize this view as AEROS (Agent Execution Runtime Operating System), in which each robot corresponds to one persistent agent and capabilities are provided through Embodied Capability Modules (ECMs). Each ECM encapsulates executable skills, models, and tools, while execution constraints and safety guarantees are enforced by a policy-separated runtime. This separation enables modular extensibility, composable capability execution, and consistent system-level safety. We evaluate a reference implementation in PyBullet simulation with a Franka Panda 7-DOF manipulator across eight experiments covering re-planning, failure recovery, policy enforcement, baseline comparison, cross-task generality, ECM hot-swapping, ablation, and failure boundary analysis. Over 100 randomized trials per condition, AEROS achieves 100% task success across three tasks versus baselines (BehaviorTree.CPP-style and ProgPrompt-style at 92--93%, flat pipeline at 67--73%), the policy layer blocks all invalid actions with zero false acceptances, runtime benefits generalize across tasks without task-specific tuning, and ECMs load at runtime with 100% post-swap success.

2603.28716 2026-05-26 cs.AI 版本更新

Dynamic Dual-Granularity Skill Bank for Agentic RL

动态双粒度技能库用于智能体强化学习

Songjun Tu, Chengdong Xu, Qichao Zhang, Yaocheng Zhang, Xiangyuan Lan, Linjing Li, Dong Li, Dongbin Zhao

发表机构 * Institute of Automation, Chinese Academy of Sciences(中国科学院自动化研究所) Pengcheng Laboratory(鹏城实验室) Sun Yat-Sen University(中山大学) MemoraX AI

AI总结 提出D2Skill,一种动态双粒度技能库,通过任务技能和步骤技能分别提供高层指导和细粒度决策支持,利用配对基线回放和技能注入回放的性能差距更新技能和优化策略,在ALFWorld等任务上显著提升性能。

Comments 19 pages

详情
AI中文摘要

智能体强化学习可以从可重复使用的经验中显著受益,但现有的基于技能的方法主要提取轨迹级指导,并且通常缺乏维护不断演化的技能记忆的原则性机制。我们提出D2Skill,一种用于智能体强化学习的动态双粒度技能库,将可重复使用的经验组织成任务技能(用于高层指导)和步骤技能(用于细粒度决策支持和错误纠正)。D2Skill通过在同一策略下进行配对的基线回放和技能注入回放,利用它们的性能差距推导出事后效用信号,用于技能更新和策略优化。技能库完全由训练时的经验构建,通过反思不断扩展,并通过效用感知的检索和修剪进行维护。在ALFWorld、WebShop和Search-Augmented QA任务上的实验表明,D2Skill在不同规模的模型上显著优于无技能的基线。进一步的消融和分析表明,双粒度技能建模和动态技能维护对这些增益至关重要,而学习到的技能表现出更高的效用,能够跨评估设置迁移,并且仅带来适度的训练开销。

英文摘要

Agentic RL can benefit substantially from reusable experience, yet existing skill-based methods mainly extract trajectory-level guidance and often lack principled mechanisms for maintaining an evolving skill memory. We propose D2Skill, a dynamic dual-granularity skill bank for agentic RL that organizes reusable experience into task skills for high-level guidance and step skills for fine-grained decision support and error correction. D2Skill jointly trains the policy and skill bank through paired baseline and skill-injected rollouts under the same policy, using their performance gap to derive hindsight utility signals for both skill updating and policy optimization. Built entirely from training-time experience, the skill bank is continuously expanded through reflection and maintained with utility-aware retrieval and pruning. Experiments on ALFWorld, WebShop, and Search-Augmented QA tasks show that D2Skill substantially improves performance over skill-free baselines across models of different scales. Further ablations and analyses show that both dual-granularity skill modeling and dynamic skill maintenance are critical to these gains, while the learned skills exhibit higher utility, transfer across evaluation settings, and introduce only modest training overhead.

2603.18908 2026-05-26 cs.AI 版本更新

Characterizing Linear Alignment Across Language Models

表征语言模型间的线性对齐

Matt Gorbett, Suman Jana

发表机构 * Independent Researcher(独立研究者) Department of Computer Science(计算机科学系) Columbia University(哥伦比亚大学)

AI总结 研究独立训练的大语言模型间是否存在线性对齐,并探索其在文本生成、嵌入分类、分布外检测及隐私保护跨孤岛推理中的应用。

详情
AI中文摘要

语言模型似乎越来越多地学习到相似的表示,尽管训练目标、架构和数据模态存在差异。这种独立训练模型之间新兴的兼容性为跨模型对齐下游目标带来了新的机会。此外,这种能力解锁了新的潜在应用领域,例如在安全、隐私或竞争约束禁止直接数据或模型共享的场景中。在这项工作中,我们研究了表示收敛在多大程度上实现了大语言模型之间的实用线性对齐。具体来说,我们学习独立模型最终隐藏状态之间的仿射变换,并在文本生成、嵌入分类和分布外检测中经验性地评估这些映射。我们发现,模型对之间的性能基本保持不变,并首次证明线性对齐有时能够实现跨独立训练模型的文本生成。我们进一步强调了线性对齐在隐私保护跨孤岛推理中的潜在应用。该框架在共享公共数据集上学习仿射变换,并使用同态加密来保护客户端查询。通过仅加密线性分类操作,该方法实现了亚秒级推理延迟。

英文摘要

Language models increasingly appear to learn similar representations, despite differences in training objectives, architectures, and data modalities. This emerging compatibility between independently trained models introduces new opportunities for cross-model alignment to downstream objectives. Moreover, this capability unlocks new potential application domains, such as settings where security, privacy, or competitive constraints prohibit direct data or model sharing. In this work, we investigate the extent to which representational convergence enables practical linear alignment between large language models. Specifically, we learn affine transformations between the final hidden states of independent models and empirically evaluate these mappings across text generation, embedding classification, and out-of-distribution detection. We find that performance is largely preserved across model pairs, and show for the first time that linear alignment sometimes enables text generation across independently trained models. We further highlight a potential application of linear alignment for privacy-preserving cross-silo inference. The framework learns an affine transformation over a shared public dataset and uses homomorphic encryption to protect client queries. By encrypting only the linear classification operation, the method achieves sub-second inference latency.

2603.16105 2026-05-26 cs.CL cs.AI 版本更新

Frequency Matters: Fast Model-Agnostic Data Curation for Pruning and Quantization

频率至关重要:用于剪枝和量化的快速模型无关数据筛选

Francesco Pio Monaco, Elia Cunegatti, Flavio Vella, Giovanni Iacca

发表机构 * University of Trento(特伦托大学)

AI总结 提出一种基于Zipf幂律的模型无关数据筛选策略ZipCal,通过最大化词汇多样性来选择校准数据,在剪枝和量化中实现与依赖模型困惑度的最先进方法相当的性能,且速度快约240倍。

Comments Added statistical analysis, mechanistic analysis and a comparison with a generative baseline. 22 pages

详情
AI中文摘要

训练后模型压缩对于增强大型语言模型(LLMs)的可移植性同时保持其性能至关重要。虽然已经提出了几种压缩方法,但较少关注选择最合适的数据集(所谓的校准数据)来寻找压缩模型配置。校准数据的选择是保留模型在任务内和任务间能力的关键步骤。在这项工作中,我们通过分析内在数据属性而非模型特定信号,解决了为剪枝和量化识别高性能校准集的挑战。我们引入了 exttt{ extbf{ZipCal}},一种基于Zipf幂律最大化词汇多样性的模型无关数据筛选策略。实验表明,我们的方法在各种剪枝基准测试中始终优于标准的均匀随机采样。值得注意的是,在下游性能方面,它与依赖模型困惑度的最先进方法表现相当。后者在大规模模型和数据集上变得极其昂贵,而 exttt{ extbf{ZipCal}}由于其可处理的线性复杂度,平均快约240倍。

英文摘要

Post-training model compression is essential for enhancing the portability of Large Language Models (LLMs) while preserving their performance. While several compression approaches have been proposed, less emphasis has been placed on selecting the most suitable set of data (the so-called \emph{calibration data}) for finding the compressed model configuration. The choice of calibration data is a critical step in preserving model capabilities both intra- and inter-tasks. In this work, we address the challenge of identifying high-performance calibration sets for both pruning and quantization by analyzing intrinsic data properties rather than model-specific signals. We introduce \texttt{\textbf{ZipCal}}, a model-agnostic data curation strategy that maximizes lexical diversity based on Zipfian power laws. Experiments demonstrate that our method consistently outperforms standard uniform random sampling across various pruning benchmarks. Notably, it also performs on par, in terms of downstream performance, with a state-of-the-art method that relies on model perplexity. The latter becomes prohibitively expensive at large-scale models and datasets, while \texttt{\textbf{ZipCal}} is on average $\sim$240$\times$ faster due to its tractable linear complexity\footnote{We make the code and the experiments available at https://github.com/FrancescoMonaco/ZipCal.}.

2603.12983 2026-05-26 cs.CL cs.AI 版本更新

Is Human Annotation Necessary? Iterative MBR Distillation for Error Span Detection in Machine Translation

人工标注是否必要?用于机器翻译错误跨度检测的迭代MBR蒸馏

Boxuan Lyu, Haiyue Song, Zhi Qu

发表机构 * Institute of Science Tokyo(东京科学研究所) National Institute of Information and Communications Technology(信息与通信技术国家研究所)

AI总结 提出一种基于最小贝叶斯风险解码的迭代MBR蒸馏自演化框架,利用现成大语言模型生成伪标签,无需人工标注即可在错误跨度检测任务上超越监督基线。

详情
AI中文摘要

错误跨度检测(ESD)是机器翻译(MT)评估中的一个关键子任务,旨在识别翻译错误的位置和严重程度。虽然对人工标注数据微调模型能提升ESD性能,但获取此类数据成本高昂且标注者之间容易不一致。为解决这一问题,我们提出一种基于最小贝叶斯风险(MBR)解码的新型自演化框架,命名为用于ESD的迭代MBR蒸馏,该框架通过利用现成的大语言模型(LLM)生成伪标签,消除了对人工标注的依赖。在WMT Metrics Shared Task数据集上的大量实验表明,仅在这些自生成伪标签上训练的模型在系统和跨度层面上均优于未适应的基础模型和基于人工标注的有监督基线,同时保持有竞争力的句子级性能。

英文摘要

Error Span Detection (ESD) is a crucial subtask in Machine Translation (MT) evaluation, aiming to identify the location and severity of translation errors. While fine-tuning models on human-annotated data improves ESD performance, acquiring such data is expensive and prone to inconsistencies among annotators. To address this, we propose a novel self-evolution framework based on Minimum Bayes Risk (MBR) decoding, named Iterative MBR Distillation for ESD, which eliminates the reliance on human annotations by leveraging an off-the-shelf LLM to generate pseudo-labels. Extensive experiments on the WMT Metrics Shared Task datasets demonstrate that models trained solely on these self-generated pseudo-labels outperform both unadapted base model and supervised baselines trained on human annotations at the system and span levels, while maintaining competitive sentence-level performance.

2603.09943 2026-05-26 cs.AI 版本更新

PathMem: Toward Cognition-Aligned Memory Transformation for Pathology MLLMs

PathMem: 面向病理学多模态大模型的认知对齐记忆转换

Jinyue Li, Yuci Liang, Qiankun Li, Xinheng Lyu, Jiayu Qian, Huabao Chen, Kun Wang, Zhigang Zeng, Anil Anthony Bharath, Yang Liu

发表机构 * University of Science and Technology of China(中国科学技术大学) Shenzhen University(深圳大学) Nanyang Technological University(南洋理工大学) Imperial College London(伦敦帝国学院) Huazhong University of Science and Technology(华中科技大学)

AI总结 提出PathMem框架,通过长期记忆与工作记忆的动态转换机制,实现结构化病理知识整合与可解释记忆控制,在WSI报告生成和开放诊断任务上达到SOTA。

详情
AI中文摘要

计算病理学需要视觉模式识别与结构化领域知识(包括分类学、分级标准和临床证据)的动态整合。在实践中,诊断推理需要将形态学证据与正式诊断和分级标准联系起来。尽管多模态大语言模型(MLLMs)展现出强大的视觉语言推理能力,但它们缺乏结构化知识整合和可解释记忆控制的显式机制。因此,现有模型在推理过程中难以一致地融入病理学特定的诊断标准。受人类病理学家层级记忆过程的启发,我们提出了PathMem,一种面向病理学MLLMs的以记忆为中心的多模态框架。PathMem将结构化病理知识组织为长期记忆(LTM),并引入记忆变换器(Memory Transformer),通过多模态记忆激活和上下文感知知识接地建模从LTM到工作记忆(WM)的动态转换,从而实现用于下游推理的上下文感知记忆细化。PathMem在多个基准测试中达到SOTA性能,在WSI-Bench报告生成(WSI-Precision提升12.8%,WSI-Relevance提升10.1%)和开放式诊断任务上分别比先前的基于WSI的模型提升9.7%和8.9%。

英文摘要

Computational pathology demands both visual pattern recognition and dynamic integration of structured domain knowledge, including taxonomy, grading criteria, and clinical evidence. In practice, diagnostic reasoning requires linking morphological evidence with formal diagnostic and grading criteria. Although multimodal large language models (MLLMs) demonstrate strong vision language reasoning capabilities, they lack explicit mechanisms for structured knowledge integration and interpretable memory control. As a result, existing models struggle to consistently incorporate pathology-specific diagnostic standards during reasoning. Inspired by the hierarchical memory process of human pathologists, we propose PathMem, a memory-centric multimodal framework for pathology MLLMs. PathMem organizes structured pathology knowledge as a long-term memory (LTM) and introduces a Memory Transformer that models the dynamic transition from LTM to working memory (WM) through multimodal memory activation and context-aware knowledge grounding, enabling context-aware memory refinement for downstream reasoning. PathMem achieves SOTA performance across benchmarks, improving WSI-Bench report generation (12.8% WSI-Precision, 10.1% WSI-Relevance) and open-ended diagnosis by 9.7% and 8.9% over prior WSI-based models.

2603.03354 2026-05-26 q-bio.NC cs.AI 版本更新

Non-Invasive Reconstruction of Intracranial EEG Across the Deep Temporal Lobe from Scalp EEG based on Conditional Normalizing Flow

基于条件归一化流的从头皮脑电无创重建深颞叶颅内脑电

Dongyi He, Bin Jiang, Kecheng Feng, Luyin Zhang, Ling Liu, Yuxuan Li, Yun Zhao, He Yan

发表机构 * School of Artificial Intelligence, Chongqing University of Technology(重庆理工大学人工智能学院) School of Smart Health, Chongqing Polytechnic University of Electronic Technology(重庆电子工程职业大学智能健康学院)

AI总结 提出NeuroFlowNet,一种基于条件归一化流的跨模态生成框架,首次从头皮脑电信号重建整个深颞叶区域的颅内脑电信号,解决了高保真重建的难题。

详情
AI中文摘要

尽管从无创头皮脑电图(sEEG)获取深部脑活动对神经科学和临床诊断至关重要,但直接生成高保真颅内脑电图(iEEG)信号仍是一个基本未探索的领域,限制了对深部脑动力学的理解。当前研究主要集中于传统信号处理或源定位方法,这些方法难以捕捉iEEG的复杂波形和随机特性。为应对这一关键挑战,本文引入NeuroFlowNet,一种新颖的跨模态生成框架,其核心贡献在于首次利用sEEG信号重建整个深颞叶区域的iEEG信号。NeuroFlowNet基于条件归一化流(CNF),通过可逆变换直接建模复杂条件概率分布,从而显式捕捉脑信号的随机性,从根本上避免了现有生成模型中常见的模式崩溃问题。此外,该模型集成了多尺度架构和自注意力机制,以稳健地捕捉细粒度时间细节和长程依赖关系。在公开的同步sEEG-iEEG数据集上的验证结果表明,NeuroFlowNet在时间波形保真度、频谱特征再现和功能连接恢复方面具有有效性。本研究为深部脑动力学的无创分析建立了一种更可靠、可扩展的新范式。该研究的代码可在https://github.com/hdy6438/NeuroFlowNet获取。

英文摘要

Although obtaining deep brain activity from non-invasive scalp electroencephalography (sEEG) is crucial for neuroscience and clinical diagnosis, directly generating high-fidelity intracranial electroencephalography (iEEG) signals remains a largely unexplored field, limiting our understanding of deep brain dynamics. Current research primarily focuses on traditional signal processing or source localization methods, which struggle to capture the complex waveforms and random characteristics of iEEG. To address this critical challenge, this paper introduces NeuroFlowNet, a novel cross-modal generative framework whose core contribution lies in the first-ever reconstruction of iEEG signals from the entire deep temporal lobe region using sEEG signals. NeuroFlowNet is built on Conditional Normalizing Flow (CNF), which directly models complex conditional probability distributions through reversible transformations, thereby explicitly capturing the randomness of brain signals and fundamentally avoiding the pattern collapse issues common in existing generative models. Additionally, the model integrates a multi-scale architecture and self-attention mechanisms to robustly capture fine-grained temporal details and long-range dependencies. Validation results on a publicly available synchronized sEEG-iEEG dataset demonstrate NeuroFlowNet's effectiveness in terms of temporal waveform fidelity, spectral feature reproduction, and functional connectivity restoration. This study establishes a more reliable and scalable new paradigm for non-invasive analysis of deep brain dynamics. The code of this study is available in https://github.com/hdy6438/NeuroFlowNet

2603.00857 2026-05-26 cs.LG cs.AI 版本更新

MultiPUFFIN: A Multimodal Domain-Constrained Foundation Model for Molecular Property Prediction of Small Molecules

MultiPUFFIN:用于小分子性质预测的多模态领域约束基础模型

Idelfonso B. R. Nogueira, Carine M. Rebello, Mumin Enis Leblebici, Erick Giovani Sperandio Nascimento

发表机构 * Department of Chemical Engineering, Norwegian University of Science and Technology (NTNU)(挪威科学与技术大学化学工程系) Faculty of Industrial Engineering, KU Leuven(鲁文大学工业工程学院) University of Surrey(萨里大学)

AI总结 提出多模态基础模型MultiPUFFIN,融合SMILES、2D图、3D构象及实验条件,通过条件感知精炼和热力学约束头,在小样本下优于ChemBERTa-2,预测小分子热物理性质。

详情
AI中文摘要

MultiPUFFIN是一个领域信息多模态基础模型,用于预测小分子的热物理性质,填补了化学工程、药物发现和材料科学中的关键空白。现有的分子基础模型在数百万分子上预训练以学习通用表示,但其标准MLP输出层不施加物理约束,蒸汽压预测可能违反温度单调依赖性,粘度曲线可能缺乏过程模拟器所需的功能形式。保证热力学一致性的领域信息方法仍局限于单一性质和少量数据集,而多模态基础模型则侧重于生物活性而非热物理性质。MultiPUFFIN通过双向跨模态注意力和门控融合融合SMILES序列、2D分子图和3D构象几何,并辅以实验条件和分子描述符的辅助编码器,填补了这一空白。骨干网络使用三种互补的自监督目标在500,000个未标记的PubChem分子上预训练。一个条件感知精炼堆栈包含五个条件器(温度、pH、压力、多晶型和测量方法),将每个性质路由到一个四头锦标赛,选择该性质性能最佳的热力学信息头。MultiPUFFIN的平均测试R²为0.784,在所有九个性质上优于微调的ChemBERTa-2,尽管训练使用的标记分子数量少了约2000倍。

英文摘要

MultiPUFFIN is a domain-informed multimodal foundation model for predicting thermophysical properties of small molecules, addressing a critical gap in chemical engineering, drug discovery, and materials science. Existing molecular foundation models pretrain on millions of molecules to learn general-purpose representations, but their standard MLP output layers impose no physical constraints, vapor pressure predictions may violate monotonic temperature dependence, and viscosity curves may lack the functional form required by process simulators. Domain-informed approaches that guarantee thermodynamic consistency have remained limited to single properties and small datasets, whereas multimodal foundation models have focused on biological activity rather than thermophysical properties. MultiPUFFIN fills this gap by fusing SMILES sequences, 2D molecular graphs, and 3D conformer geometries through bidirectional cross-modal attention and gated fusion, supplemented by auxiliary encoders for experimental conditions and molecular descriptors. The backbone is pretrained on 500,000 unlabelled PubChem molecules using three complementary self-supervised objectives. A condition-aware refinement stack of five conditioners (temperature, pH, pressure, polymorph, and measurement method) routes each property to a four-head tournament that selects the best-performing thermodynamically informed head for that property. MultiPUFFIN achieves a mean test R2 of 0.784 and outperforms fine-tuned ChemBERTa-2 on all nine properties despite training on roughly 2,000x fewer labeled molecules.

2602.21198 2026-05-26 cs.LG cs.AI cs.CL cs.CV cs.RO 版本更新

Learning from Trials and Errors: Reflective Test-Time Planning for Embodied LLMs

从试错中学习:具身大语言模型的反思式测试时规划

Yining Hong, Huang Huang, Manling Li, Li Fei-Fei, Leonidas Guibas, Jiajun Wu, Yejin Choi

发表机构 * Stanford University(斯坦福大学) Northwestern University(西北大学)

AI总结 提出反思式测试时规划方法,通过行动中反思和行动后反思两种模式,结合回溯性反思,使具身智能体在测试时进行自我纠正和经验积累,显著提升长程任务性能。

详情
AI中文摘要

具身大语言模型赋予机器人高级任务推理能力,但它们无法反思错误原因,导致部署成为一系列独立尝试,错误重复而非积累经验。借鉴人类反思实践,我们引入反思式测试时规划,整合两种反思模式: extit{行动中反思},代理在行动前利用测试时扩展生成并评分多个候选行动,基于内部反思;以及 extit{行动后反思},利用测试时训练,根据执行后的外部反思更新内部反思模型和行动策略。我们还包含回溯性反思,允许代理重新评估早期决策,并利用后见之明进行模型更新,实现适当的长程信用分配。在我们新设计的Long-Horizon Household基准和MuJoCo Cupboard Fitting基准上的实验表明,与基线模型相比有显著提升,并能零样本泛化到逼真的HM3D环境以及在Franka Panda机械臂上的真实机器人实验。消融实验证实,行动中反思和行动后反思相互依赖,且回溯性反思在较低计算开销下比逐步外部反馈实现更好的信用分配。定性分析进一步突出了通过反思进行的行为纠正。

英文摘要

Embodied LLMs endow robots with high-level task reasoning, but they cannot reflect on what went wrong or why, turning deployment into a sequence of independent trials where mistakes repeat rather than accumulate into experience. Drawing upon human reflective practitioners, we introduce Reflective Test-Time Planning, which integrates two modes of reflection: \textit{reflection-in-action}, where the agent uses test-time scaling to generate and score multiple candidate actions using internal reflections before execution; and \textit{reflection-on-action}, which uses test-time training to update both its internal reflection model and its action policy based on external reflections after execution. We also include retrospective reflection, allowing the agent to re-evaluate earlier decisions and perform model updates with hindsight for proper long-horizon credit assignment. Experiments on our newly-designed Long-Horizon Household benchmark and MuJoCo Cupboard Fitting benchmark show significant gains over baseline models, with zero-shot generalization to photorealistic HM3D environments and real-robot experiments on a Franka Panda arm. Ablations confirm that reflection-in-action and reflection-on-action are mutually dependent, and that retrospective reflection achieves better credit assignment than step-wise external feedback at lower computational overhead. Qualitative analyses further highlight behavioral correction through reflection.

2602.20210 2026-05-26 cs.LG cs.AI 版本更新

Multimodal Crystal Flow: Any-to-Any Modality Generation for Unified Crystal Modeling

多模态晶体流:面向统一晶体建模的任意模态生成

Kiyoung Seong, Sungsoo Ahn, Sehui Han, Changyoung Park

发表机构 * Graduate School of AI, KAIST, Seoul, South Korea(韩国科学技术院人工智能研究生院,首尔,韩国) Materials Intelligence Lab, LG AI Research, Seoul, South Korea(LG AI研究所材料智能实验室,首尔,韩国)

AI总结 提出多模态晶体流(MCFlow),一种统一的多模态流模型,通过原子类型和晶体结构的独立时间变量实现多种晶体生成任务,并在MP-20和MPTS-52基准上达到与任务特定基线竞争的性能。

详情
AI中文摘要

晶体建模涵盖一系列条件和非条件生成任务,包括晶体结构预测(CSP)和从头生成(DNG)。尽管最近的深度生成模型表现出有前景的性能,但它们仍然主要是任务特定的,缺乏跨任务共享晶体表示的统一框架。为了解决这一限制,我们提出了多模态晶体流(MCFlow),一种统一的多模态流模型,通过原子类型和晶体结构的独立时间变量将多种晶体生成任务实现为不同的推理轨迹。为了在标准Transformer模型中实现多模态流,我们引入了一种具有层次排列增强的组合和对称感知原子排序,无需显式结构模板即可注入组合和晶体学先验。在MP-20和MPTS-52基准上的实验表明,单个MCFlow模型在CSP、DNG和结构条件原子类型生成方面与任务特定基线具有竞争力。

英文摘要

Crystal modeling spans a family of conditional and unconditional generation tasks, including crystal structure prediction (CSP) and de novo generation (DNG). While recent deep generative models have shown promising performance, they remain largely task-specific, lacking a unified framework that shares crystal representations across tasks. To address this limitation, we propose Multimodal Crystal Flow (MCFlow), a unified multimodal flow model that realizes multiple crystal generation tasks as distinct inference trajectories via independent time variables for atom types and crystal structures. To enable multimodal flow in a standard transformer model, we introduce a composition- and symmetry-aware atom ordering with hierarchical permutation augmentation, injecting compositional and crystallographic priors without explicit structural templates. Experiments on the MP-20 and MPTS-52 benchmarks show that a single MCFlow model is competitive with task-specific baselines across CSP, DNG, and structure-conditioned atom type generation.

2602.18640 2026-05-26 cs.AI 版本更新

Decoding ML Decision: An Agentic Reasoning Framework for Large-Scale Ranking System

解码机器学习决策:面向大规模排序系统的智能体推理框架

Longfei Yun, Yihan Wu, Haoran Liu, Xiaoxuan Liu, Ziyun Xu, Yi Wang, Yang Xia, Pengfei Wang, Mingze Gao, Yunxiang Wang, Changfan Chen, Wenjie Fu, Hong Yan, Junfeng Pan

发表机构 * Meta

AI总结 提出GEARS框架,通过智能体技能封装排序专家知识,将排序优化转化为自主发现过程,实现高层意图驱动的系统调控并保证生产可靠性。

Comments 12 pages, 5 figures

详情
AI中文摘要

现代大规模排序系统在竞争目标、操作约束和不断变化的产品需求的复杂环境中运行。该领域的进展越来越受到工程上下文约束的瓶颈:将模糊的产品意图转化为合理、可执行、可验证的假设的艰巨过程,而不仅仅是建模技术本身。我们提出了GEARS(生成式智能体排序系统引擎),这是一个将排序优化重新定义为可编程实验环境中的自主发现过程的框架。GEARS不是将优化视为静态模型选择,而是利用专门智能体技能将排序专家知识封装为可复用的推理能力,使操作者能够通过高层意图(如氛围个性化)来引导系统。此外,为确保生产可靠性,该框架集成了验证钩子以强制执行统计稳健性,并过滤掉过度拟合短期信号的脆弱策略。跨不同产品表面的实验验证表明,GEARS通过协同算法信号与深度排序上下文,同时保持严格的部署稳定性,能够持续识别出接近帕累托最优的优越策略。

英文摘要

Modern large-scale ranking systems operate within a sophisticated landscape of competing objectives, operational constraints, and evolving product requirements. Progress in this domain is increasingly bottlenecked by the engineering context constraint: the arduous process of translating ambiguous product intent into reasonable, executable, verifiable hypotheses, rather than by modeling techniques alone. We present GEARS (Generative Engine for Agentic Ranking Systems), a framework that reframes ranking optimization as an autonomous discovery process within a programmable experimentation environment. Rather than treating optimization as static model selection, GEARS leverages Specialized Agent Skills to encapsulate ranking expert knowledge into reusable reasoning capabilities, enabling operators to steer systems via high-level intent vibe personalization. Furthermore, to ensure production reliability, the framework incorporates validation hooks to enforce statistical robustness and filter out brittle policies that overfit short-term signals. Experimental validation across diverse product surfaces demonstrates that GEARS consistently identifies superior, near-Pareto-efficient policies by synergizing algorithmic signals with deep ranking context while maintaining rigorous deployment stability.

2602.17658 2026-05-26 cs.LG cs.AI cs.IT math.IT 版本更新

MARS: Margin and Semantic-Aware Data Augmentation for Reward Modeling

MARS:面向奖励建模的边界与语义感知数据增强

Payel Bhattacharjee, Osvaldo Simeone, Ravi Tandon

发表机构 * University of Arizona(亚利桑那大学) Northeastern University London(伦敦东北大学)

AI总结 提出MARS框架,通过优先增强低边界偏好对并利用语义距离细化,提升奖励模型质量和对齐性能。

详情
AI中文摘要

奖励建模是RLHF、RLAIF和基于PPO的策略优化等对齐流程的核心,但其可靠性受限于有限且异构的人类偏好数据,这些数据难以大规模收集。虽然合成增强可以扩展偏好监督,但现有方法通常均匀增强或在表示层面增强,而不针对奖励模型不确定或容易误排序的示例。在本文中,我们介绍了MARS(面向奖励建模的边界与语义感知数据增强),一种自适应增强框架,优先考虑低边界偏好对,并使用语义距离作为第二层细化,以增强选择响应和拒绝响应之间的对比。在多个偏好数据集、奖励模型骨干、下游对齐设置以及包括RewardBench和AlpacaEval在内的基准测试中,MARS在奖励模型质量和对齐性能上都优于现有基线。我们的结果表明,当同时由模型边界和语义结构引导时,奖励模型增强最为有效。

英文摘要

Reward modeling is central to alignment pipelines such as RLHF, RLAIF, and PPO-based policy optimization, yet its reliability is constrained by limited and heterogeneous human preference data that are expensive to collect at scale. While synthetic augmentation can expand preference supervision, existing methods often augment uniformly or at the representation level, without targeting examples where the reward model is uncertain or prone to mis-ranking. In this paper, we introduce MARS (Margin and Semantic-Aware Data Augmentation for Reward Modeling), an adaptive augmentation framework that prioritizes low-margin preference pairs and uses semantic distance as a second layer for refinement to enhance the contrast between the chosen and rejected responses. Across multiple preference datasets, reward-model backbones, downstream alignment settings, and benchmarks including RewardBench and AlpacaEval, MARS improves both reward-model quality and alignment performance over existing baselines. Our results show that reward-model augmentation is most effective when guided by both model margins and semantic structure.

2602.17234 2026-05-26 cs.AI cs.LG 版本更新

All Leaks Count, Some Count More: Interpretable Temporal Contamination Detection and Mitigation in LLM Backtesting

所有泄漏都重要,有些泄漏更重要:LLM回测中可解释的时间污染检测与缓解

Zeyu Zhang, Ryan Chen, Bradly C. Stadie

发表机构 * Department of Statistics and Data Science, Northwestern University(统计与数据科学系,西北大学) Bridgewater AIA Labs(布里奇沃特AIA实验室)

AI总结 提出基于Shapley值的声明级评估框架Shapley-DCLR和推理时架构TimeSPEC,用于检测和缓解LLM回测中的时间污染问题。

Comments 8 pages plus appendix

详情
AI中文摘要

对已解决事件进行回测的LLM假设模型仅基于截止前知识进行推理,然而预训练模型不可避免地泄漏截止后知识。我们引入了一个声明级评估框架,将预测理由分解为原子声明,并应用Shapley值量化每个声明的决策影响,从而得到 extbf{Shapley-DCLR}( extbf{Shapley}加权的 extbf{决策关键泄漏率})——一个可解释的度量,用于衡量决策驱动推理中被污染的比例。我们进一步提出 extbf{TimeSPEC}(基于提取声明的时间监督预测),一种推理时架构,它将时间过滤的检索与声明级监督交织在一起,生成完全基于截止前证据的预测。在三个LLM上的消融实验证实了检索和监督共同必要;三项任务探测进一步说明,时间强制的性能成本与每个任务对截止后信息的依赖程度成正比。

英文摘要

Backtesting LLMs on resolved events assumes models reason only from pre-cutoff knowledge, yet pretrained models inevitably leak post-cutoff knowledge. We introduce a claim-level evaluation framework that decomposes prediction rationales into atomic claims and applies Shapley values to quantify each claim's decision impact, yielding \textbf{Shapley-DCLR} (\textbf{Shapley}-weighted \textbf{D}ecision-\textbf{C}ritical \textbf{L}eakage \textbf{R}ate) -- an interpretable metric measuring what fraction of decision-driving reasoning is contaminated. We further propose \textbf{TimeSPEC} (\textbf{Time}-\textbf{S}upervised \textbf{P}rediction with \textbf{E}xtracted \textbf{C}laims), an inference-time architecture that interleaves temporally-filtered retrieval with claim-level supervision, producing predictions grounded entirely in pre-cutoff evidence. Across three LLMs, the ablation experiments confirm retrieval and supervision are jointly necessary; and a three-task probe further illstrates that the performance cost of temporal enforcement scales with each task's reliance on post-cutoff information.

2602.15811 2026-05-26 cs.CV cs.AI 版本更新

CARL-CXR: Continual Adapter-Based Routing for Task-Unknown Chest Radiograph Classification

CARL-CXR:基于连续适配器路由的任务未知胸部X光片分类

Muthu Subash Kavitha, Anas Zafar, Amgad Muneer, Jia Wu

发表机构 * Department of Imaging Physics, The University of Texas MD Anderson Cancer Center(影像物理系,德克萨斯大学MD安德森癌症中心)

AI总结 提出CARL-CXR框架,通过固定高容量骨干网络、增量添加轻量级任务特定适配器和分类头,以及潜在任务选择器,解决任务未知推理下的胸部X光片增量分类问题,显著减少灾难性遗忘并提升路由准确性。

Comments 9 pages, 4 figures

详情
AI中文摘要

胸部X光片分类器的临床部署需要模型能够在新数据集可用时进行更新,而无需对先前观察到的数据进行重新训练或降低已验证的性能。我们研究了任务未知推理下的任务增量连续学习设置,其中异质的胸部X光数据集顺序到达,且在部署时任务身份不可用。我们提出了CARL-CXR,一个基于连续适配器的路由框架,该框架保持固定的高容量骨干网络,同时增量引入轻量级任务特定适配器和分类头。一个潜在任务选择器基于适配器条件特征进行操作,将每个输入动态路由到最相关的任务路径,利用紧凑的任务原型和特征级经验回放来在顺序更新中保留任务身份,而无需存储原始图像。在MIMIC-CXR和CheXpert两个具有不同患者群体、成像设备和注释流程的大规模数据集上的实验表明,CARL-CXR实现了最小的灾难性遗忘(AUROC下降0.012),比已建立的连续学习基线LwF和EWC分别减少了6倍和11倍,同时保持了具有竞争力的诊断性能(AUROC 0.74)。在任务未知部署下,CARL-CXR在路由准确性上比联合训练高出12.5个百分点(75.0% vs. 62.5%):与LwF和EWC不同,后者在推理时需要明确的任务标识符且不提供路由机制。

英文摘要

Clinical deployment of chest radiograph classifiers requires models that can be updated as new datasets become available without retraining on previously observed data or degrading validated performance. We study a task-incremental continual learning setting for chest radiograph classification under task-unknown inference, where heterogeneous chest X-ray datasets arrive sequentially and task identity is unavailable at deployment time. We propose CARL-CXR, a continual adapter-based routing framework that maintains a fixed high-capacity backbone while incrementally introducing lightweight task-specific adapters and classifier heads. A latent task selector operates on adapter-conditioned features to dynamically route each input to the most relevant task pathway, leveraging compact task prototypes and feature-level experience replay to preserve task identity across sequential updates without storing raw images. Experiments on MIMIC-CXR and CheXpert two large-scale datasets with distinct patient populations, imaging devices, and annotation pipelines demonstrate that CARL-CXR achieves minimal catastrophic forgetting (0.012 AUROC drop), representing a 6X and 11X reduction over established continual learning baselines LwF and EWC respectively, while maintaining competitive diagnostic performance (AUROC 0.74). Under task unknown deployment, CARL-CXR outperforms joint training by 12.5 points in routing accuracy (75.0% vs. 62.5%): unlike LwF and EWC, which require explicit task identifiers at inference and provide no routing mechanism.

2602.15620 2026-05-26 cs.CL cs.AI 版本更新

STAPO: Stabilizing Reinforcement Learning for LLMs by Silencing Rare Spurious Tokens

STAPO:通过抑制稀有虚假标记稳定大语言模型的强化学习

Shiqi Liu, Zeyu He, Guojian Zhan, Letian Tao, Zhilong Zheng, Jiang Wu, Yinuo Wang, Yang Guan, Kehua Sheng, Bo Zhang, Keqiang Li, Jingliang Duan, Shengbo Eben Li

发表机构 * School of Vehicle Mobility \& College of AI, Tsinghua University Didi Voyager Labs, DiDi Autonomous Driving

AI总结 针对强化学习微调大语言模型时因稀有虚假标记导致训练不稳定和性能崩溃的问题,提出STAPO方法,通过抑制这些标记的梯度扰动,在多个数学推理基准上实现稳定训练和性能提升。

详情
AI中文摘要

强化学习显著提升了大语言模型的推理能力,但现有的强化学习微调方法严重依赖熵正则化和重加权等启发式技术来维持稳定性。实践中,这些方法常遭遇后期性能崩溃,导致推理质量下降和训练不稳定。我们识别出这一不稳定的关键因素:一小部分标记(称为虚假标记,约占0.01%)对推理结果贡献甚微,但由于继承了完整的序列级奖励而获得不成比例放大的梯度更新。我们提出了一个统一框架,用于评估虚假风险、梯度范数和熵变化下标记级优化影响。基于对严重破坏优化的标记特征的分析,我们提出了抑制虚假标记(S2T)机制,以有效抑制其梯度扰动。将该机制融入基于组的目标中,我们提出了虚假标记感知策略优化(STAPO),促进了稳定有效的大规模模型优化。在使用Qwen 1.7B、8B和14B基础模型的六个数学推理基准上,STAPO一致展现出优越的熵稳定性,并在GRPO、20-Entropy和JustRL基础上平均性能提升11.49%($\rho_{\mathrm{T}}$=1.0, top-p=1.0)和3.73%($\rho_{\mathrm{T}}$=0.7, top-p=0.9)。

英文摘要

Reinforcement Learning (RL) has significantly improved large language model reasoning, but existing RL fine-tuning methods rely heavily on heuristic techniques such as entropy regularization and reweighting to maintain stability. In practice, they often suffer from late-stage performance collapse, leading to degraded reasoning quality and unstable training. We identify a key factor behind this instability: a small fraction of tokens, termed spurious tokens (around 0.01%), which contribute little to the reasoning outcome but receive disproportionately amplified gradient updates due to inheriting the full sequence-level reward. We present a unified framework for evaluating token-level optimization impacts across spurious risk, gradient norms, and entropy changes. Building on the analysis of token characteristics that severely disrupt optimization, we propose the Silencing Spurious Tokens (S2T) mechanism to efficiently suppress their gradient perturbations. Incorporating this mechanism into a group-based objective, we propose Spurious-Token-Aware Policy Optimization (STAPO), which promotes stable and effective large-scale model refinement. Across six mathematical reasoning benchmarks using Qwen 1.7B, 8B, and 14B base models, STAPO consistently demonstrates superior entropy stability and achieves an average performance improvement of 11.49% ($ρ_{\mathrm{T}}$=1.0, top-p=1.0) and 3.73% ($ρ_{\mathrm{T}}$=0.7, top-p=0.9) over GRPO, 20-Entropy, and JustRL.

2602.11534 2026-05-26 cs.LG cs.AI 版本更新

Krause Synchronization Transformers

Krause同步变换器

Jingkun Liu, Yisong Yue, Max Welling, Yue Song

发表机构 * Shanghai Qi Zhi Institute(上海启智研究院) College of AI, Tsinghua University(清华大学人工智能学院) Shanghai Jiao Tong University(上海交通大学) California Institute of Technology(加州理工学院) University of Amsterdam(阿姆斯特丹大学)

AI总结 提出基于有界置信共识动力学的Krause注意力机制,通过局部化稀疏交互替代全局softmax归一化,缓解表示坍缩和注意力汇聚现象,实现线性复杂度并提升性能。

Comments ICML 2026, Project page: https://jingkun-liu.github.io/krause-sync-transformers/

详情
AI中文摘要

Transformer中的自注意力依赖于全局归一化的softmax权重,导致所有token在每一层竞争影响力。当跨深度组合时,这种交互模式会诱导强同步动力学,倾向于收敛到主导模式,这种行为与表示坍缩和注意力汇聚现象相关。我们引入了Krause注意力,一种受有界置信共识动力学启发的原则性注意力机制。Krause注意力将基于相似性的全局聚合替换为基于距离的、局部化的、选择性稀疏的交互,促进结构化的局部同步而非全局混合。我们将这种行为与最近将Transformer动力学建模为相互作用粒子系统的理论联系起来,并展示有界置信交互如何自然地调节注意力集中并缓解注意力汇聚。将交互限制在局部邻域还将运行时复杂度从序列长度的二次方降低到线性。实验上,我们在多种设置中验证了Krause注意力,包括视觉(CIFAR/ImageNet上的ViT)、自回归图像生成(MNIST/CIFAR-10)、大语言模型(Llama/Qwen)以及从零开始训练的多种规模(100M/200M)的语言模型。在这些领域中,Krause注意力在提高计算效率的同时实现了持续的性能提升,突显了有界置信动力学作为注意力的一种可扩展且有效的归纳偏置。

英文摘要

Self-attention in Transformers relies on globally normalized softmax weights, causing all tokens to compete for influence at every layer. When composed across depth, this interaction pattern induces strong synchronization dynamics that favor convergence toward a dominant mode, a behavior associated with representation collapse and attention sink phenomena. We introduce Krause Attention, a principled attention mechanism inspired by bounded-confidence consensus dynamics. Krause Attention replaces similarity-based global aggregation with distance-based, localized, and selectively sparse interactions, promoting structured local synchronization instead of global mixing. We relate this behavior to recent theory modeling Transformer dynamics as interacting particle systems, and show how bounded-confidence interactions naturally moderate attention concentration and alleviate attention sinks. Restricting interactions to local neighborhoods also reduces runtime complexity from quadratic to linear in sequence length. Empirically, we validate Krause Attention across diverse settings, including vision (ViT on CIFAR/ImageNet), autoregressive image generation (MNIST/CIFAR-10), large language models (Llama/Qwen), and language models trained from scratch at multiple scales (100M/200M). Across these domains, Krause Attention achieves consistent performance gains while improving computational efficiency, highlighting bounded-confidence dynamics as a scalable and effective inductive bias for attention.

2602.08499 2026-05-26 cs.LG cs.AI 版本更新

Contextual Rollout Bandits for Reinforcement Learning with Verifiable Rewards

上下文展开赌博机:面向可验证奖励的强化学习

Xiaodong Lu, Xiaohan Wang, Jiajun Chai, Guojun Yin, Wei Lin, Zhijun Chen, Yu Luo, Fuzhen Zhuang, Yikun Ban, Deqing Wang

发表机构 * School of Computer Science and Engineering, Beihang University(北京航空航天大学计算机科学与工程学院) School of Artificial Intelligence, Beihang University(北京航空航天大学人工智能学院) Huawei(华为)

AI总结 针对RLVR中展开使用无差别、短视导致的问题,提出上下文赌博机框架,自适应选择高价值展开,提升训练效率与性能。

详情
AI中文摘要

可验证奖励的强化学习(RLVR)是提升大型语言模型推理能力的有效范式。然而,现有RLVR方法以无差别和短视的方式使用展开:每个提示内不同质量的响应被统一对待,且历史展开在单次使用后被丢弃。这导致监督噪声大、样本效率低以及策略更新次优。我们通过将RLVR中的展开调度形式化为上下文赌博机问题,并提出一个统一的神经调度框架来解决这些问题,该框架在整个训练过程中自适应地选择高价值展开。每个展开被视为一个臂,其奖励由连续优化步骤之间诱导的性能增益定义。由此产生的调度器支持噪声感知的组内选择和历史展开的自适应全局重用,所有这些都在一个统一的原则性框架内。我们通过推导次线性遗憾界并证明扩大展开缓冲区可改善可实现性能上限,提供了理论依据。在六个数学推理基准上的实验表明,在多种RLVR优化方法中,性能和训练效率均有一致的提升。

英文摘要

Reinforcement Learning with Verifiable Rewards (RLVR) is an effective paradigm for improving the reasoning capabilities of large language models. However, existing RLVR methods utilize rollouts in an indiscriminate and short-horizon manner: responses of heterogeneous quality within each prompt are treated uniformly, and historical rollouts are discarded after a single use. This leads to noisy supervision, poor sample efficiency, and suboptimal policy updates. We address these issues by formulating rollout scheduling in RLVR as a contextual bandit problem and proposing a unified neural scheduling framework that adaptively selects high-value rollouts throughout training. Each rollout is treated as an arm whose reward is defined by the induced performance gain between consecutive optimization steps. The resulting scheduler supports both noise-aware intra-group selection and adaptive global reuse of historical rollouts within a single principled framework. We provide theoretical justification by deriving sublinear regret bounds and showing that enlarging the rollout buffer improves the achievable performance upper bound. Experiments on six mathematical reasoning benchmarks demonstrate consistent gains in performance and training efficiency across multiple RLVR optimization methods.

2602.08426 2026-05-26 cs.CL cs.AI cs.CV 版本更新

Prism: Spectral-Aware Block-Sparse Attention

Prism: 频谱感知的块稀疏注意力

Xinghao Wang, Pengyu Wang, Xiaoran Liu, Fangxu Liu, Jason Chu, Kai Song, Xipeng Qiu

发表机构 * Fudan University(复旦大学) Shanghai Innovation Institute(上海创新研究院) ByteDance Inc.(字节跳动公司) OpenMOSS Team(OpenMOSS团队)

AI总结 针对长上下文LLM预填充中块稀疏注意力的块选择效率瓶颈,提出无训练频谱感知方法Prism,通过高低频分支分解和能量温度校准恢复位置信号,实现纯块级重要性估计,在保持精度同时实现高达5.1倍加速。

Comments ICML 2026

详情
AI中文摘要

块稀疏注意力有望加速长上下文LLM的预填充,但高效识别相关块仍是瓶颈。现有方法通常采用粗粒度注意力作为块重要性估计的代理,但往往诉诸昂贵的令牌级搜索或评分,导致显著的选择开销。在本工作中,我们将通过均值池化的标准粗粒度注意力的不准确性追溯到一个理论根源:均值池化与旋转位置嵌入(RoPE)之间的交互。我们证明均值池化充当低通滤波器,在高频维度上引起破坏性干扰,有效造成局部位置信息(如斜线模式)的“盲点”。为解决此问题,我们引入Prism,一种无训练的频谱感知方法,将块选择分解为高频和低频分支。通过应用基于能量的温度校准,Prism直接从池化表示中恢复衰减的位置信号,使得仅使用块级操作即可进行块重要性估计,从而提高效率。大量评估证实,Prism在保持与全注意力精度相当的同时,实现了高达$\mathbf{5.1 imes}$的加速。

英文摘要

Block-sparse attention is promising for accelerating long-context LLM pre-filling, yet identifying relevant blocks efficiently remains a bottleneck. Existing methods typically employ coarse-grained attention as a proxy for block importance estimation, but often resort to expensive token-level searching or scoring, resulting in significant selection overhead. In this work, we trace the inaccuracy of standard coarse-grained attention via mean pooling to a theoretical root cause: the interaction between mean pooling and Rotary Positional Embeddings (RoPE). We prove that mean pooling acts as a low-pass filter that induces destructive interference in high-frequency dimensions, effectively creating a "blind spot" for local positional information (e.g., slash patterns). To address this, we introduce Prism, a training-free spectral-aware approach that decomposes block selection into high-frequency and low-frequency branches. By applying energy-based temperature calibration, Prism restores the attenuated positional signals directly from pooled representations, enabling block importance estimation using purely block-level operations, thereby improving efficiency. Extensive evaluations confirm that Prism maintains accuracy parity with full attention while delivering up to $\mathbf{5.1\times}$ speedup.

2602.06717 2026-05-26 cs.LG cs.AI 版本更新

F-GRPO: Don't Let Your Policy Learn the Obvious and Forget the Rare

F-GRPO: 别让你的策略学到显而易见的而忘记罕见的

Daniil Plyusov, Alexey Gorbatovski, Boris Shaposhnikov, Viacheslav Sinii, Alexey Malakhov, Daria Korotyshova, Daniil Gavrilov

发表机构 * T-Tech

AI总结 针对强化学习中有限采样组导致罕见正确轨迹被忽略的问题,提出基于Focal loss的难度感知缩放系数F-GRPO,在不增加组大小和计算成本下提升数学推理性能。

详情
AI中文摘要

基于可验证奖励的强化学习通常依赖组采样来估计优势并稳定策略更新。实践中,计算限制往往排除非常大的组,因此训练使用有限的rollout集合,这些集合只能强化它们暴露的正确行为。在实际组大小下,更新可能会遗漏罕见的正确轨迹,同时仍然包含混合奖励,将概率集中在更常见的采样解上。我们推导了这种提示局部尾部遗漏事件作为组大小函数的概率,展示了非单调行为,并在分类抽象中描述了未采样的正确质量如何在总正确质量增长时缩小。受此分析启发,我们提出了一种难度感知缩放系数,灵感来自Focal loss,它降低了高成功采样组的更新权重。经验上,分类模拟在分类设置中展示了相同效果,Maze提供了单解测试,LLM实验包括代表性的GRPO组大小扫描以及GRPO、DAPO和CISPO之间的固定N迁移。在Qwen2.5-7B上,N=8时,我们的方法将平均数学pass@256从64.1提高到70.3(GRPO),69.3提高到72.5(DAPO),73.2提高到76.8(CISPO);在所有三种情况下,OOD pass@256也得到改善,且不增加组大小或计算成本。

英文摘要

Reinforcement Learning with Verifiable Rewards (RLVR) is commonly based on group sampling to estimate advantages and stabilize policy updates. In practice, computational limits often rule out very large groups, so training proceeds with finite rollout sets that can reinforce only the correct behavior they expose. At practical group sizes, updates can miss rare-correct trajectories while still containing mixed rewards, concentrating probability on more common sampled solutions. We derive the probability of such prompt-local tail-miss events as a function of group size, showing non-monotonic behavior, and in the categorical abstraction characterize how unsampled-correct mass can shrink even as total correct mass grows. Motivated by this analysis, we propose a difficulty-aware scaling coefficient, inspired by Focal loss, that down-weights updates on high-success sampled groups. Empirically, categorical simulation illustrates the same effect in the categorical setting, Maze provides a single-solution test, and LLM experiments include a representative GRPO group-size sweep together with fixed-$N$ transfer across GRPO, DAPO, and CISPO. On Qwen2.5-7B at $N{=}8$, our method improves average math pass@256 from 64.1 $\rightarrow$ 70.3 (GRPO), 69.3 $\rightarrow$ 72.5 (DAPO), and 73.2 $\rightarrow$ 76.8 (CISPO); OOD pass@256 also improves in all three cases, without increasing group size or computational cost.

2602.04120 2026-05-26 cs.LG cs.AI cs.DC cs.SE 版本更新

Scalable Explainability-as-a-Service (XaaS) for Edge AI Systems

面向边缘AI系统的可扩展可解释性即服务(XaaS)

Samaresh Kumar Singh, Joyjit Roy

AI总结 提出可解释性即服务(XaaS)分布式架构,通过解耦推理与解释生成、语义缓存、轻量验证和自适应引擎,在边缘设备上实现低延迟、高保真的可解释性,并在三个实际用例中降低38%延迟。

Comments 8 pages, 5 figures, 2 tables. This version updates metadata after publication in IEEE Xplore and publication by SoutheastCon 2026

详情
Journal ref
2026 IEEE SoutheastCon, Huntsville, AL, USA, 2026
AI中文摘要

尽管可解释人工智能(XAI)取得了显著进展,但其在边缘和物联网系统中的集成通常是临时且低效的。当前大多数方法以“耦合”方式运行,即解释生成与模型推理同时进行。因此,这些方法在异构边缘设备上部署时会产生冗余计算、高延迟和可扩展性差的问题。本文提出可解释性即服务(XaaS),一种将可解释性视为一等系统服务(而非模型特定功能)的分布式架构。我们提出的XaaS架构的关键创新在于解耦推理与解释生成,使边缘设备能够在资源和延迟约束下请求、缓存和验证解释。为此,我们引入三项主要创新:(1)基于语义相似性的分布式解释缓存检索方法,显著减少冗余计算;(2)轻量验证协议,确保缓存和新生成解释的保真度;(3)自适应解释引擎,根据设备能力和用户需求选择解释方法。我们在三个实际边缘AI用例上评估了XaaS的性能:(i)制造质量控制;(ii)自动驾驶车辆感知;(iii)医疗诊断。实验结果表明,XaaS在三个实际部署中延迟降低38%,同时保持高解释质量。总体而言,本工作使得在大规模异构物联网系统中部署透明和可问责的AI成为可能,并弥合了XAI研究与边缘实用性之间的差距。

英文摘要

Though Explainable AI (XAI) has made significant advancements, its inclusion in edge and IoT systems is typically ad-hoc and inefficient. Most current methods are "coupled" in such a way that they generate explanations simultaneously with model inferences. As a result, these approaches incur redundant computation, high latency and poor scalability when deployed across heterogeneous sets of edge devices. In this work we propose Explainability-as-a-Service (XaaS), a distributed architecture for treating explainability as a first-class system service (as opposed to a model-specific feature). The key innovation in our proposed XaaS architecture is that it decouples inference from explanation generation allowing edge devices to request, cache and verify explanations subject to resource and latency constraints. To achieve this, we introduce three main innovations: (1) A distributed explanation cache with a semantic similarity based explanation retrieval method which significantly reduces redundant computation; (2) A lightweight verification protocol that ensures the fidelity of both cached and newly generated explanations; and (3) An adaptive explanation engine that chooses explanation methods based upon device capability and user requirement. We evaluated the performance of XaaS on three real-world edgeAI use cases: (i) manufacturing quality control; (ii) autonomous vehicle perception; and (iii) healthcare diagnostics. Experimental results show that XaaS reduces latency by 38% while maintaining high explanation quality across three real-world deployments. Overall, this work enables the deployment of transparent and accountable AI across large scale, heterogeneous IoT systems, and bridges the gap between XAI research and edge-practicality.

2602.03695 2026-05-26 cs.MA cs.AI cs.CL 版本更新

Agent Primitives: Reusable Latent Building Blocks for Multi-Agent Systems

Agent Primitives: 面向多智能体系统的可复用潜在构建模块

Haibo Jin, Peng Kuang, Ye Yu, Xiaopeng Yuan, Haohan Wang

发表机构 * School of Information Sciences, University of Illinois at Urbana-Champaign, IL, USA(伊利诺伊大学厄巴纳-香槟分校信息科学学院) Siebel School of Computing and Data Science, University of Illinois at Urbana-Champaign, IL, USA(伊利诺伊大学厄巴纳-香槟分校Siebel计算与数据科学学院)

AI总结 提出Agent Primitives,一组可复用的潜在构建模块,通过KV缓存内部通信和自动组合,提升多智能体系统的鲁棒性、效率和跨任务复用性。

Comments 16 pages

详情
AI中文摘要

虽然现有的多智能体系统(MAS)可以通过多个智能体之间的协作处理复杂问题,但它们通常高度任务特定,依赖手动设计的智能体角色和交互提示,导致架构复杂性增加且跨任务复用性有限。此外,大多数MAS主要通过自然语言进行通信,使得它们在内部智能体历史中的长上下文、多阶段交互中容易受到错误累积和不稳定性的影响。在这项工作中,我们提出了 extbf{Agent Primitives},一组用于基于LLM的MAS的可复用潜在构建模块。受神经网络设计的启发,其中复杂模型由可复用组件构建,我们观察到许多现有的MAS架构可以分解为少量重复出现的内部计算模式。基于这一观察,我们实例化了三个原语:Review、Voting and Selection以及Planning and Execution。所有原语通过键值(KV)缓存进行内部通信,通过减轻多阶段交互中的信息退化来提高鲁棒性和效率。为了实现自动系统构建,一个Organizer智能体根据每个查询选择并组合原语,由先前成功配置的轻量级知识库引导,形成基于原语的MAS。实验表明,基于原语的MAS相比单智能体基线平均准确率提高12.0-16.5%,与基于文本的MAS相比,令牌使用量和推理延迟减少约3-4倍,同时相对于单智能体推理仅产生1.3-1.6倍的开销,并在不同模型骨干上提供更稳定的性能。

英文摘要

While existing multi-agent systems (MAS) can handle complex problems by enabling collaboration among multiple agents, they are often highly task-specific, relying on manually crafted agent roles and interaction prompts, which leads to increased architectural complexity and limited reusability across tasks. Moreover, most MAS communicate primarily through natural language, making them vulnerable to error accumulation and instability in long-context, multi-stage interactions within internal agent histories. In this work, we propose \textbf{Agent Primitives}, a set of reusable latent building blocks for LLM-based MAS. Inspired by neural network design, where complex models are built from reusable components, we observe that many existing MAS architectures can be decomposed into a small number of recurring internal computation patterns. Based on this observation, we instantiate three primitives: Review, Voting and Selection, and Planning and Execution. All primitives communicate internally via key-value (KV) cache, which improves both robustness and efficiency by mitigating information degradation across multi-stage interactions. To enable automatic system construction, an Organizer agent selects and composes primitives for each query, guided by a lightweight knowledge pool of previously successful configurations, forming a primitive-based MAS. Experiments show that primitives-based MAS improve average accuracy by 12.0-16.5\% over single-agent baselines, reduce token usage and inference latency by approximately 3$\times$-4$\times$ compared to text-based MAS, while incurring only 1.3$\times$-1.6$\times$ overhead relative to single-agent inference and providing more stable performance across model backbones.

2602.02495 2026-05-26 cs.CL cs.AI cs.LG 版本更新

Reward-free Alignment for Conflicting Objectives

无奖励的冲突目标对齐

Peter Chen, Xiaopeng Li, Xi Chen, Tianyi Lin

发表机构 * Columbia University(哥伦比亚大学)

AI总结 提出RACO框架,通过冲突规避梯度下降的裁剪变体直接利用成对偏好数据解决多目标冲突,实现帕累托最优对齐。

Comments Accepted to ICML 2026 (Oral)

详情
AI中文摘要

直接对齐方法越来越多地用于将大型语言模型(LLMs)与人类偏好对齐。然而,许多现实世界的对齐问题涉及多个相互冲突的目标,简单的偏好聚合可能导致训练不稳定和糟糕的权衡。特别是,加权损失方法可能无法识别同时改善所有目标的更新方向,而现有的多目标方法通常依赖显式奖励模型,增加了额外复杂性并扭曲了用户指定的偏好。本文的贡献有两方面。首先,我们提出了一种用于冲突目标的无奖励对齐框架(RACO),该框架直接利用成对偏好数据,并通过一种新颖的冲突规避梯度下降的裁剪变体解决梯度冲突。我们提供了收敛到尊重用户指定目标权重的帕累托临界点的保证,并进一步证明在双目标设置中裁剪可以严格改善收敛速度。其次,我们使用一些启发式方法改进了我们的方法,并进行了实验,以证明所提框架在LLM对齐中的兼容性。在多个LLM家族(Qwen 3、Llama 3、Gemma 3)上的多目标摘要和安全对齐任务的定性和定量评估表明,与现有的多目标对齐基线相比,我们的方法始终能实现更好的帕累托权衡。

英文摘要

Direct alignment methods are increasingly used to align large language models (LLMs) with human preferences. However, many real-world alignment problems involve multiple conflicting objectives, where naive aggregation of preferences can lead to unstable training and poor trade-offs. In particular, weighted loss methods may fail to identify update directions that simultaneously improve all objectives, and existing multi-objective approaches often rely on explicit reward models, introducing additional complexity and distorting user-specified preferences. The contributions of this paper are two-fold. First, we propose a Reward-free Alignment framework for Conflicted Objectives (RACO) that directly leverages pairwise preference data and resolves gradient conflicts via a novel clipped variant of conflict-averse gradient descent. We provide convergence guarantees to Pareto-critical points that respect user-specified objective weights, and further show that clipping can strictly improve convergence rate in the two-objective setting. Second, we improve our method using some heuristics and conduct experiments to demonstrate the compatibility of the proposed framework for LLM alignment. Both qualitative and quantitative evaluations on multi-objective summarization and safety alignment tasks across multiple LLM families (Qwen 3, Llama 3, Gemma 3) show that our method consistently achieves better Pareto trade-offs compared to existing multi-objective alignment baselines.

2601.21094 2026-05-26 cs.LG cs.AI cs.SY eess.SY 版本更新

Safety Generalization Under Distribution Shift in Safe Reinforcement Learning: A Diabetes Testbed

安全强化学习中的分布偏移下的安全泛化:一个糖尿病测试平台

Minjae Kwon, Josephine Lamp, Lu Feng

发表机构 * Department of Computer Science, University of Virginia(弗吉尼亚大学计算机科学系)

AI总结 研究安全强化学习算法在分布偏移下训练时安全保证能否迁移到部署中,使用糖尿病管理作为测试平台,发现安全泛化差距并通过测试时屏蔽有效恢复安全性。

Comments Accepted at ICML 2026. Camera-ready version

详情
AI中文摘要

安全强化学习算法通常在固定的训练条件下进行评估。我们使用糖尿病管理作为安全关键测试平台,研究训练时的安全保证是否能在分布偏移下迁移到部署中。我们在统一的临床模拟器上对安全强化学习算法进行基准测试,并揭示了一个安全泛化差距:在训练期间满足约束的策略经常在未见过的患者身上违反安全要求。我们证明,测试时屏蔽(使用学习到的动力学模型过滤不安全动作)能有效恢复跨算法和患者群体的安全性。在八种安全强化学习算法、三种糖尿病类型和三个年龄组中,屏蔽使得PPO-Lag和CPO等强基线的血糖达标时间范围提高了13-14%,同时降低了临床风险指数和血糖变异性。我们的模拟器和基准测试为研究安全关键控制领域中分布偏移下的安全性提供了一个平台。代码可在https://github.com/safe-autonomy-lab/GlucoSim 和 https://github.com/safe-autonomy-lab/GlucoAlg 获取。

英文摘要

Safe Reinforcement Learning (RL) algorithms are typically evaluated under fixed training conditions. We investigate whether training-time safety guarantees transfer to deployment under distribution shift, using diabetes management as a safety-critical testbed. We benchmark safe RL algorithms on a unified clinical simulator and reveal a safety generalization gap: policies satisfying constraints during training frequently violate safety requirements on unseen patients. We demonstrate that test-time shielding, which filters unsafe actions using learned dynamics models, effectively restores safety across algorithms and patient populations. Across eight safe RL algorithms, three diabetes types, and three age groups, shielding achieves Time-in-Range gains of 13--14\% for strong baselines such as PPO-Lag and CPO while reducing clinical risk index and glucose variability. Our simulator and benchmark provide a platform for studying safety under distribution shift in safety-critical control domains. Code is available at https://github.com/safe-autonomy-lab/GlucoSim and https://github.com/safe-autonomy-lab/GlucoAlg.

2601.13236 2026-05-26 eess.IV cs.AI physics.med-ph 版本更新

Pixelwise Uncertainty Quantification of Accelerated MRI Reconstruction

加速MRI重建的像素级不确定性量化

Ilias I. Giannakopoulos, Lokesh B Gautham Muthukumar, Yvonne W. Lui, Riccardo Lattanzi

发表机构 * Bernard and Irene Schwartz Center for Biomedical Imaging, Department of Radiology, NYU Grossman School of Medicine(贝纳德与伊蕾娜·施瓦茨生物医学成像中心,放射学系,纽约大学格罗斯曼医学院) Courant Institute of Mathematical Sciences, NYU(数学科学学院,纽约大学) Center for Advanced Imaging Innovation and Research (CAI 2 R), Department of Radiology, NYU Grossman School of Medicine(先进成像创新与研究中心(CAI 2 R),放射学系,纽约大学格罗斯曼医学院)

AI总结 提出一种基于共形分位数回归的像素级不确定性量化框架,用于加速MRI重建,无需全采样参考图像即可自动识别不可靠区域。

Comments 10 pages, 8 figues, 2 tables

详情
AI中文摘要

并行成像技术减少了磁共振成像(MRI)扫描时间,但随着加速因子的增加,图像质量会下降。在临床实践中,由于缺乏自动评估欠采样重建诊断质量的机制,通常选择保守的加速因子。本文提出了一种用于并行MRI重建的像素级不确定性量化的通用框架,无需任何真实参考图像即可自动识别不可靠区域。我们的方法将共形分位数回归与图像重建方法相结合,以估计统计上严格的像素级不确定性区间。我们在从fastMRI数据集获得的笛卡尔欠采样脑部和膝盖数据上训练并评估了模型,加速因子范围为2到10。使用端到端变分网络进行图像重建。定量实验表明,预测的不确定性图与真实重建误差之间具有高度一致性。使用我们的方法,在四倍及以上的加速水平下,相应的皮尔逊相关系数高于90%;而当使用更简单的启发式概念(残差幅度)计算不确定性时,该系数降至70%以下。定性示例进一步表明,基于分位数回归的不确定性图捕捉了不同加速因子下重建误差的大小和空间分布,不确定性升高的区域与病理和伪影对齐。所提出的框架能够在没有全采样真实参考图像的情况下评估重建质量。这代表了向自适应MRI采集协议迈出的一步,该协议可能能够动态平衡扫描时间和诊断可靠性。

英文摘要

Parallel imaging techniques reduce magnetic resonance imaging (MRI) scan time but image quality degrades as the acceleration factor increases. In clinical practice, conservative acceleration factors are chosen because no mechanism exists to automatically assess the diagnostic quality of undersampled reconstructions. This work introduces a general framework for pixel-wise uncertainty quantification in parallel MRI reconstructions, enabling automatic identification of unreliable regions without access to any ground-truth reference image. Our method integrates conformal quantile regression with image reconstruction methods to estimate statistically rigorous pixel-wise uncertainty intervals. We trained and evaluated our model on Cartesian undersampled brain and knee data obtained from the fastMRI dataset using acceleration factors ranging from 2 to 10. An end-to-end Variational Network was used for image reconstruction. Quantitative experiments demonstrate strong agreement between predicted uncertainty maps and true reconstruction error. Using our method, the corresponding Pearson correlation coefficient was higher than 90% at acceleration levels at and above four-fold; whereas it dropped to less than 70% when the uncertainty was computed using a simpler a heuristic notion (magnitude of the residual). Qualitative examples further show the uncertainty maps based on quantile regression capture the magnitude and spatial distribution of reconstruction errors across acceleration factors, with regions of elevated uncertainty aligning with pathologies and artifacts. The proposed framework enables evaluation of reconstruction quality without access to fully-sampled ground-truth reference images. It represents a step toward adaptive MRI acquisition protocols that may be able to dynamically balance scan time and diagnostic reliability.

2601.05613 2026-05-26 cs.LG cs.AI 版本更新

PiXTime: A Model for Federated Time Series Forecasting with Heterogeneous Data across Nodes

PiXTime: 一种跨节点异构数据联邦时间序列预测模型

Yiming Zhou, Jiahao Wang, Mingyue Cheng, Hao Wang, Defu Lian, Enhong Chen

发表机构 * University of Science and Technology of China(科学技术大学)

AI总结 提出基于Transformer的PiXTime框架,通过参数解耦架构(局部个性化模块+全局共享骨干)处理异构时间序列,实现联邦学习中的异构数据预测,并在多个基准上达到最优性能。

详情
AI中文摘要

虽然对分布式时间序列进行协同预测非常理想,但由于数据共享限制,直接合并局部数据集通常不可行。联邦学习提供了一种有前景的替代方案,但传统的联邦学习算法要求同构模型架构,这与去中心化节点中常见的结构差异(如时间分辨率不对齐、变量通道不匹配)不兼容。为弥合这一差距,我们引入了PiXTime,一种新颖的基于Transformer的框架,旨在原生适应并利用结构异构的时间数据。其核心采用参数解耦架构,将模型策略性地划分为局部个性化模块和全局聚合共享骨干。具体而言,节点特定的局部模块作为维度适配器,将不同长度的原始序列投影到统一表示空间。同时,全局同步的VE表将一致的类别标识注入特征空间,使共享骨干能够跨不一致的变量分布协同学习并泛化表示。在多个基准上的全面评估表明,PiXTime在异构联邦环境中实现了最先进的性能,同时在标准同构和集中式预测设置中保持强大的优势。

英文摘要

While collaborative forecasting on distributed time series is highly desirable, directly pooling localized datasets is often impractical due to data sharing constraints. Federated learning offers a promising alternative, yet conventional federated learning algorithms require homogeneous model architectures, which are incompatible with the structural discrepancies, such as unaligned temporal resolutions and mismatched variable channels, commonly observed across decentralized nodes. To bridge this gap, we introduce PiXTime, a novel Transformer-based framework designed to natively accommodate and leverage structurally heterogeneous temporal data. At its core, PiXTime adopts a parameter-decoupling architecture, strategically partitioning the model into localized personalized modules and a globally aggregated shared backbone. Specifically, node-specific local modules act as dimensional adapters, projecting raw sequences of diverse lengths into a unified representation space. Concurrently, a globally synchronized VE Table injects consistent categorical identities into the feature space, allowing the shared backbone to collaboratively learn and generalize representations across inconsistent variable distributions. Comprehensive evaluations on multiple benchmarks demonstrate that PiXTime achieves state-of-the-art performance in heterogeneous federated environments, while maintaining robust superiority in standard homogeneous and centralized forecasting settings.

2601.05483 2026-05-26 cs.AI 版本更新

MMUEChange: A Generalized LLM Agent Framework for Intelligent Multi-Modal Urban Environment Change Analysis

MMUEChange:面向智能多模态城市环境变化分析的通用LLM智能体框架

Zixuan Xiao, Jun Ma, Siwei Zhang

发表机构 * Department of Urban Planning and Design, The University of Hong Kong(香港大学城市规划与设计系)

AI总结 提出MMUEChange多模态智能体框架,通过模块化工具包和模态控制器实现异构城市数据灵活集成与跨模态对齐,在三个城市案例中任务成功率提升46.7%并有效缓解幻觉。

详情
Journal ref
Applied Soft Computing 190 (2026) 114576
AI中文摘要

理解城市环境变化对于可持续发展至关重要。然而,当前方法,特别是遥感变化检测,通常依赖于刚性的单模态分析。为克服这些限制,我们提出MMUEChange,一个多模态智能体框架,通过模块化工具包和核心模块——模态控制器实现跨模态和模态内对齐,灵活集成异构城市数据,从而支持对复杂城市变化场景的稳健分析。案例研究包括:纽约向小型社区公园的转变,反映了当地的绿地建设努力;香港各区集中水污染的扩散,指向协调的水管理;深圳露天垃圾场的显著减少,以及夜间经济活动与垃圾类型之间的对比关联,表明生活垃圾和建筑垃圾背后不同的城市压力。与性能最佳的基线相比,MMUEChange智能体在任务成功率上提升了46.7%,并有效缓解了幻觉,展示了其支持具有实际政策影响的复杂城市变化分析任务的能力。

英文摘要

Understanding urban environment change is essential for sustainable development. However, current approaches, particularly remote sensing change detection, often rely on rigid, single-modal analysis. To overcome these limitations, we propose MMUEChange, a multi-modal agent framework that flexibly integrates heterogeneous urban data via a modular toolkit and a core module, Modality Controller for cross- and intra-modal alignment, enabling robust analysis of complex urban change scenarios. Case studies include: a shift toward small, community-focused parks in New York, reflecting local green space efforts; the spread of concentrated water pollution across districts in Hong Kong, pointing to coordinated water management; and a notable decline in open dumpsites in Shenzhen, with contrasting links between nighttime economic activity and waste types, indicating differing urban pressures behind domestic and construction waste. Compared to the best-performing baseline, the MMUEChange agent achieves a 46.7% improvement in task success rate and effectively mitigates hallucination, demonstrating its capacity to support complex urban change analysis tasks with real-world policy implications.

2601.03790 2026-05-26 cs.CL cs.AI 版本更新

NeoAMT: Neologism-Aware Agentic Machine Translation with Reinforcement Learning

NeoAMT: 基于强化学习的新词感知智能机器翻译

Zhongtao Miao, Kaiyan Zhao, Masaaki Nagata, Yoshimasa Tsuruoka

发表机构 * The University of Tokyo(东京大学) NTT Communication Science Laboratories, NTT, Inc.(NTT通信科学实验室,NTT公司)

AI总结 提出NeoAMT框架,利用基于Wiktionary的搜索工具和强化学习训练翻译智能体,以提升包含新词的源句翻译质量。

Comments ACL 2026 Main. Fixed minor typos

详情
AI中文摘要

新词感知机器翻译旨在将包含新词的源句翻译成目标语言。与通用机器翻译相比,该领域仍未被充分探索。本文提出一个智能体框架NeoAMT,用于新词感知机器翻译,配备基于Wiktionary的搜索工具。具体而言,我们首先构建了一个专门用于新词感知机器翻译的数据集,并建立了一个基于Wiktionary的搜索工具。该数据集涵盖16种语言和75个翻译方向,源自约1000万条英文Wiktionary转储记录。搜索工具的检索语料库也来自同一转储中约300万条清洗后的记录。然后,我们利用该数据集和工具,通过强化学习训练翻译智能体,并评估新词感知机器翻译的准确性。此外,我们提出了一个强化学习训练框架,具有新颖的奖励设计和自适应展开生成策略,利用翻译难度进一步提高使用我们搜索工具的翻译智能体的翻译质量。

英文摘要

Neologism-aware machine translation aims to translate source sentences containing neologisms into target languages. This field remains underexplored compared with general machine translation (MT). In this paper, we propose an agentic framework, NeoAMT, for neologism-aware machine translation equipped with a Wiktionary-based search toolkit. Specifically, we first construct a dedicated dataset for neologism-aware machine translation and build a search toolkit grounded in Wiktionary. The dataset covers 16 languages and 75 translation directions in total, derived from approximately 10 million records of an English Wiktionary dump. The retrieval corpus of the search toolkit is also constructed from around 3 million cleaned records of the same dump. We then leverage the dataset and toolkit to train a translation agent via reinforcement learning (RL) and to evaluate the accuracy of neologism-aware machine translation. Furthermore, we propose an RL training framework featuring a novel reward design and an adaptive rollout generation strategy that exploits translation difficulty to further improve the translation quality of translation agents using our search toolkit.

2601.03624 2026-05-26 cs.AI 版本更新

Architecting Agentic Communities using Design Patterns

使用设计模式构建智能体社区

Zoran Milosevic, Fethi Rabhi

发表机构 * School of Computer Science and Engineering, University of New South Wales, Sydney, Australia(新南威尔士大学计算机科学与工程学院,悉尼,澳大利亚) Deontik, Australia(澳大利亚德诺提克)

AI总结 本文提出基于企业分布式系统设计模式的三层分类架构(LLM智能体、智能体AI、智能体社区),并通过临床试验匹配案例验证其形式化框架,为多智能体生态系统的工程化部署提供实践指导与形式化验证能力。

Comments supplementary material accompanying this paper is also attached .. its title is "Complete Agentic AI Design Patterns Catalogue"; Fixed encoding artefacts (garbled em dashes) throughout

详情
AI中文摘要

大型语言模型(LLM)及后续智能体AI技术的快速发展需要系统化的架构指导,以构建复杂的生产级系统。本文提出了一种使用源自企业分布式系统标准、形式化方法和行业实践的设计模式来架构此类系统的方法。我们将这些模式分为三层:LLM智能体(任务特定自动化)、智能体AI(自适应目标寻求者)和智能体社区(AI智能体与人类参与者通过正式角色、协议和治理结构进行协调的组织框架)。我们重点关注智能体社区——涵盖LLM智能体、智能体AI实体和人类的协调框架——这最适用于企业和工业应用。借鉴分布式系统中成熟的协调原则,我们将这些模式置于一个形式化框架中,该框架规定了协作协议,其中AI智能体和人类在受治理的生态系统中扮演角色。这种方法既提供了实践指导,也提供了形式化验证能力,通过问责机制表达组织、法律和伦理规则,确保智能体间通信、协商和意图建模的可操作且可验证的治理。我们通过一个临床试验匹配案例研究验证了该框架。我们的目标是为从业者提供可操作的指导,同时保持动态多智能体生态系统中企业部署所必需的形式化严谨性。

英文摘要

The rapid evolution of Large Language Models (LLM) and subsequent Agentic AI technologies requires systematic architectural guidance for building sophisticated, production-grade systems. This paper presents an approach for architecting such systems using design patterns derived from enterprise distributed systems standards, formal methods, and industry practice. We classify these patterns into three tiers: LLM Agents (task-specific automation), Agentic AI (adaptive goal-seekers), and Agentic Communities (organizational frameworks where AI agents and human participants coordinate through formal roles, protocols, and governance structures). We focus on Agentic Communities - coordination frameworks encompassing LLM Agents, Agentic AI entities, and humans - most relevant for enterprise and industrial applications. Drawing on established coordination principles from distributed systems, we ground these patterns in a formal framework that specifies collaboration agreements where AI agents and humans fill roles within governed ecosystems. This approach provides both practical guidance and formal verification capabilities, enabling expression of organizational, legal, and ethical rules through accountability mechanisms that ensure operational and verifiable governance of inter-agent communication, negotiation, and intent modeling. We validate this framework through a clinical trial matching case study. Our goal is to provide actionable guidance to practitioners while maintaining the formal rigor essential for enterprise deployment in dynamic, multi-agent ecosystems.

2601.03327 2026-05-26 cs.LG cs.AI 版本更新

Extreme-value forest fire prediction A study of the Loss Function in an Ordinality Scheme

极端值森林火灾预测:序数方案中损失函数的研究

Nicolas Caron, Christophe Guyeux, Hassan Noura, Benjamin Aynes

AI总结 提出首个序数分类框架预测火灾严重等级,研究损失函数设计对预测极端事件的影响,发现加权卡帕损失在极端类别上IoU提升超过0.1。

Comments Following external reviews, we identified major methodological issues in the manuscript, including insufficient justification of the ordinal clustering strategy, limited statistical validation, ambiguities in dataset splitting, and missing comparisons with standard ordinal approaches. We therefore request withdrawal in order to prepare a substantially revised version

详情
AI中文摘要

野火在空间和严重程度上是高度不平衡的自然灾害,使得极端事件的预测特别具有挑战性。在这项工作中,我们引入了第一个序数分类框架,用于预测与法国操作决策直接对齐的野火严重等级。我们的研究调查了损失函数设计对神经模型预测罕见但关键的高严重火灾发生能力的影响。我们将标准交叉熵与几种序数感知目标进行比较,包括提出的基于截断离散指数广义帕累托分布的概率TDeGPD损失。通过对多种架构和真实操作数据的广泛基准测试,我们表明序数监督显著提高了模型相对于传统方法的性能。特别是,加权卡帕损失(WKLoss)取得了最佳整体结果,在最极端严重类别上IoU(交并比)增益超过0.1,同时保持了有竞争力的校准质量。然而,由于数据集中极端事件极低的代表性,对于最罕见事件的性能仍然有限。这些发现强调了将严重性排序、数据不平衡考虑和季节性风险整合到野火预测系统中的重要性。未来的工作将集中于将季节动态和不确定性信息纳入训练,以进一步提高极端事件预测的可靠性。

英文摘要

Wildfires are highly imbalanced natural hazards in both space and severity, making the prediction of extreme events particularly challenging. In this work, we introduce the first ordinal classification framework for forecasting wildfire severity levels directly aligned with operational decision-making in France. Our study investigates the influence of loss-function design on the ability of neural models to predict rare yet critical high-severity fire occurrences. We compare standard cross-entropy with several ordinal-aware objectives, including the proposed probabilistic TDeGPD loss derived from a truncated discrete exponentiated Generalized Pareto Distribution. Through extensive benchmarking over multiple architectures and real operational data, we show that ordinal supervision substantially improves model performance over conventional approaches. In particular, the Weighted Kappa Loss (WKLoss) achieves the best overall results, with more than +0.1 IoU (Intersection Over Union) gain on the most extreme severity classes while maintaining competitive calibration quality. However, performance remains limited for the rarest events due to their extremely low representation in the dataset. These findings highlight the importance of integrating both severity ordering, data imbalance considerations, and seasonality risk into wildfire forecasting systems. Future work will focus on incorporating seasonal dynamics and uncertainty information into training to further improve the reliability of extreme-event prediction.

2601.02144 2026-05-26 cs.CL cs.AI 版本更新

Routing by Analogy: kNN-Augmented Expert Assignment for Mixture-of-Experts

类比路由:用于混合专家模型的kNN增强专家分配

Boxuan Lyu, Soichiro Murakami, Hidetaka Kamigaito, Peinan Zhang

发表机构 * Institute of Science Tokyo(东京科学研究院) CyberAgent Nara Institute of Science and Technology(奈良科学技術大學)

AI总结 提出kNN-MoE框架,通过检索历史相似案例的局部最优专家分配来增强MoE路由,使用检索邻居的平均相似度作为置信度混合系数,在分布偏移下提升鲁棒性。

详情
AI中文摘要

混合专家(MoE)架构通过使用参数化“路由器”将令牌分派给稀疏的专家子集,高效地扩展大型语言模型。通常,该路由器被训练一次然后冻结,导致路由决策在分布偏移下变得脆弱。我们通过引入kNN-MoE来解决这一限制,这是一种检索增强的路由框架,它从类似历史案例的记忆中重用局部最优的专家分配。该记忆通过直接优化令牌级路由对数似然以最大化参考集上的似然来离线构建。关键的是,我们使用检索邻居的平均相似度作为置信度驱动的混合系数,从而允许该方法在未找到相关案例时回退到冻结的路由器。实验表明,kNN-MoE优于零样本基线,并且与计算密集型的监督微调相比具有竞争力。

英文摘要

Mixture-of-Experts (MoE) architectures scale large language models efficiently by employing a parametric ``router'' to dispatch tokens to a sparse subset of experts. Typically, this router is trained once and then frozen, rendering routing decisions brittle under distribution shifts. We address this limitation by introducing kNN-MoE, a retrieval-augmented routing framework that reuses locally optimal expert assignments from a memory of similar past cases. This memory is constructed offline by directly optimizing token-wise routing logits to maximize the likelihood on a reference set. Crucially, we use the average similarity of retrieved neighbors as a confidence-driven mixing coefficient, thus allowing the method to fall back to the frozen router when no relevant cases are found. Experiments show that kNN-MoE outperforms the zero-shot baseline and is competitive with computationally intensive supervised fine-tuning.

2601.00553 2026-05-26 cs.CV cs.AI 版本更新

A Comprehensive Dataset for Human vs. AI Generated Image Detection

人类与AI生成图像检测的综合数据集

Rajarshi Roy, Ashhar Aziz, Shashwat Bajpai, Nasrin Imanpour, Gurpreet Singh, Shwetangshu Biswas, Kapil Wanaskar, Parth Patwa, Subhankar Ghosh, Shreyas Dixit, Nilesh Ranjan Pal, Vipula Rawte, Ritvik Garimella, Amitava Das, Amit Sheth, Gaytri Jena, Vasu Sharma, Aishwarya Naresh Reganti, Vinija Jain, Aman Chadha

发表机构 * 1 Kalyani Government Engineering College, India. 2 IIIT Delhi, India. 3 BITS Pilani Hyderabad Campus, India. 4 University of South Carolina, USA. 5 IIIT Guwahati, India. 6 NIT Silchar, India. 7 San Jos\' e State University, USA. 8 UCLA, USA. 9 Washington State University, USA. 10 Vishwakarma Institute of Information Technology, India. 11 Gandhi Institute for Technological Advancement, India. 12 Meta AI, USA. 13 Amazon AI, USA. 14 BITS Pilani Goa, India.

AI总结 针对AI生成图像检测问题,构建了包含96000个真实与合成数据点的MS COCOAI数据集,并提出了图像真伪分类与生成模型识别两个任务。

详情
AI中文摘要

像Stable Diffusion、DALL-E和MidJourney这样的多模态生成式AI系统从根本上改变了合成图像的创建方式。这些工具推动了创新,但也促进了误导性内容、虚假信息和被操纵媒体的传播。随着生成的图像越来越难以与照片区分,检测它们已成为当务之急。为了应对这一挑战,我们发布了MS COCOAI,这是一个用于AI生成图像检测的新数据集,包含96000个真实和合成数据点,基于MS COCO数据集构建。为了生成合成图像,我们使用了五个生成器:Stable Diffusion 3、Stable Diffusion 2.1、SDXL、DALL-E 3和MidJourney v6。基于该数据集,我们提出了两个任务:(1)将图像分类为真实或生成;(2)识别哪个模型生成了给定的合成图像。该数据集可在https://huggingface.co/datasets/Rajarshi-Roy-research/Defactify_Image_Dataset获取。

英文摘要

Multimodal generative AI systems like Stable Diffusion, DALL-E, and MidJourney have fundamentally changed how synthetic images are created. These tools drive innovation but also enable the spread of misleading content, false information, and manipulated media. As generated images become harder to distinguish from photographs, detecting them has become an urgent priority. To combat this challenge, we release MS COCOAI, a novel dataset for AI generated image detection consisting of 96000 real and synthetic datapoints, built using the MS COCO dataset. To generate synthetic images, we use five generators: Stable Diffusion 3, Stable Diffusion 2.1, SDXL, DALL-E 3, and MidJourney v6. Based on the dataset, we propose two tasks: (1) classifying images as real or generated, and (2) identifying which model produced a given synthetic image. The dataset is available at https://huggingface.co/datasets/Rajarshi-Roy-research/Defactify_Image_Dataset.

2512.23076 2026-05-26 cs.LG cs.AI cs.HC 版本更新

Multimodal Functional Maximum Correlation for Emotion Recognition

多模态功能最大相关用于情感识别

Deyang Zheng, Tianyi Zhang, Wenming Zheng, Shujian Yu

发表机构 * Key Laboratory of Child Development and Learning Science (Ministry of Education), School of Biological Sciences and Medical Engineering, Southeast University(儿童发展与学习科学重点实验室(教育部)、生物科学与医学工程学院、东南大学) Department of Artificial Intelligence, Westlake University(人工智能学院、西湖大学) Department of Artificial Intelligence, Vrije Universiteit Amsterdam(人工智能学院、阿姆斯特丹自由大学)

AI总结 提出多模态功能最大相关(MFMC)框架,通过双重总相关目标最大化高阶多模态依赖,在情感识别基准上取得最先进性能。

Comments manuscript accepted by IEEE Transactions on Affective Computing. Code is available at https://github.com/DY9910/MFMC

详情
AI中文摘要

情绪状态表现为中枢和自主系统之间协调但异质的生理反应,这对情感计算中的多模态表示学习构成了基本挑战。学习这种联合动态因情感标注的稀缺性和主观性而进一步复杂化,这推动了自监督学习(SSL)的使用。然而,大多数现有的SSL方法依赖于成对对齐目标,这些目标不足以表征两个以上模态之间的依赖关系,也无法捕捉由协调的脑和自主反应产生的高阶交互。为了解决这一限制,我们提出了多模态功能最大相关(MFMC),一个原则性的SSL框架,通过双重总相关(DTC)目标最大化高阶多模态依赖。通过推导一个紧致的夹逼界并使用基于功能最大相关分析(FMCA)的迹替代进行优化,MFMC直接捕捉联合多模态交互,而不依赖于成对对比损失。在三个公开的情感计算基准上的实验表明,MFMC在受试者依赖和受试者独立评估协议下均一致地达到最先进或具有竞争力的性能,突显了其对受试者间变异性的鲁棒性。特别是,MFMC将CEAP-360VR上的受试者依赖准确率从78.9%提高到86.8%,仅使用EDA信号就将受试者独立准确率从27.5%提高到33.1%。此外,在MAHNOB-HCI最具挑战性的EEG受试者独立划分中,MFMC与最佳方法的差距在0.8个百分点以内。我们的代码可在https://github.com/DY9910/MFMC获取。

英文摘要

Emotional states manifest as coordinated yet heterogeneous physiological responses across central and autonomic systems, posing a fundamental challenge for multimodal representation learning in affective computing. Learning such joint dynamics is further complicated by the scarcity and subjectivity of affective annotations, which motivates the use of self-supervised learning (SSL). However, most existing SSL approaches rely on pairwise alignment objectives, which are insufficient to characterize dependencies among more than two modalities and fail to capture higher-order interactions arising from coordinated brain and autonomic responses. To address this limitation, we propose Multimodal Functional Maximum Correlation (MFMC), a principled SSL framework that maximizes higher-order multimodal dependence through a Dual Total Correlation (DTC) objective. By deriving a tight sandwich bound and optimizing it using a functional maximum correlation analysis (FMCA) based trace surrogate, MFMC captures joint multimodal interactions directly, without relying on pairwise contrastive losses. Experiments on three public affective computing benchmarks demonstrate that MFMC consistently achieves state-of-the-art or competitive performance under both subject-dependent and subject-independent evaluation protocols, highlighting its robustness to inter-subject variability. In particular, MFMC improves subject-dependent accuracy on CEAP-360VR from 78.9% to 86.8%, and subject-independent accuracy from 27.5% to 33.1% using the EDA signal alone. Moreover, MFMC remains within 0.8 percentage points of the best-performing method on the most challenging EEG subject-independent split of MAHNOB-HCI. Our code is available at https://github.com/DY9910/MFMC.

2512.18735 2026-05-26 cs.CV cs.AI 版本更新

$M^3-Verse$: A "Spot the Difference" Challenge for Large Multimodal Models

$M^3-Verse$: 大型多模态模型的“找不同”挑战

Kewei Wei, Bocheng Hu, Jie Cao, Xiaohan Chen, Zhengxi Lu, Wubing Xia, Weili Xu, Jiaao Wu, Junchen He, Mingyu Jia, Ciyun Zhao, Ye Sun, Yizhi Li, Zhonghan Zhao, Jian Zhang, Gaoang Wang

发表机构 * Zhejiang University, China(浙江大学) Shanghai AI Lab, China(上海人工智能实验室) Hangzhou Normal University, China(杭州师范大学)

AI总结 提出 $M^3-Verse$ 基准,通过多视角视频对评估 LMM 在一致空间中对物体动态变化的理解能力,并验证了现有模型的局限性。

详情
AI中文摘要

现代大型多模态模型(LMMs)在静态图像和单状态时空理解方面表现出非凡的能力。然而,它们在两个不同视频观测中理解共享空间上下文内物体动态变化的能力仍未被充分探索。这种在一致环境中推理变换的能力对于空间智能领域的进步尤为关键。在本文中,我们引入了 $M^3-Verse$,一个多模态、多状态、多维度的基准,以正式评估这一能力。它基于成对视频,这些视频提供了室内场景在状态变化前后的多视角观察。该基准包含总共 270 个场景和 2,932 个问题,分为 50 多个子任务,探究 4 种核心能力。我们评估了 16 个最先进的 LMMs,并观察到它们在跟踪状态转换方面的局限性。为了解决这些挑战,我们进一步提出了一个简单而有效的基线,在多状态感知中实现了显著的性能提升。因此,$M^3-Verse$ 提供了一个具有挑战性的新测试平台,以促进对动态视觉世界有更全面理解的下一代模型的发展。您可以从 https://github.com/Wal-K-aWay/M3-Verse_pipeline 获取构建流程,并从 https://www.modelscope.cn/datasets/WalKaWay/M3-Verse 获取完整的基准数据。

英文摘要

Modern Large Multimodal Models (LMMs) have demonstrated extraordinary ability in static image and single-state spatial-temporal understanding. However, their capacity to comprehend the dynamic changes of objects within a shared spatial context between two distinct video observations, remains largely unexplored. This ability to reason about transformations within a consistent environment is particularly crucial for advancements in the field of spatial intelligence. In this paper, we introduce $M^3-Verse$, a Multi-Modal, Multi-State, Multi-Dimensional benchmark, to formally evaluate this capability. It is built upon paired videos that provide multi-perspective observations of an indoor scene before and after a state change. The benchmark contains a total of 270 scenes and 2,932 questions, which are categorized into over 50 subtasks that probe 4 core capabilities. We evaluate 16 state-of-the-art LMMs and observe their limitations in tracking state transitions. To address these challenges, we further propose a simple yet effective baseline that achieves significant performance improvements in multi-state perception. $M^3-Verse$ thus provides a challenging new testbed to catalyze the development of next-generation models with a more holistic understanding of our dynamic visual world. You can get the construction pipeline from https://github.com/Wal-K-aWay/M3-Verse_pipeline and full benchmark data from https://www.modelscope.cn/datasets/WalKaWay/M3-Verse.

2512.05865 2026-05-26 cs.LG cs.AI 版本更新

Intrinsically Interpretable Attention via Sparse Post-Training

通过稀疏后训练实现内在可解释的注意力机制

Florent Draye, Anson Lei, Hsiao-Ru Pan, Ingmar Posner, Bernhard Schölkopf

发表机构 * MPI-IS(马克斯·普朗克研究所) University of Oxford(牛津大学) ETH Zürich(苏黎世联邦理工学院)

AI总结 提出一种后训练方法,通过约束损失下的灵活稀疏正则化,在不牺牲性能的前提下将Transformer注意力连接稀疏至约0.4%,从而简化全局电路并提升可解释性。

详情
AI中文摘要

我们引入一种简单的后训练方法,使Transformer注意力变得稀疏而不牺牲性能。在约束损失目标下应用灵活的稀疏正则化,我们在高达7B参数的模型上证明,可以将注意力连接减少到其边缘的约0.4%,同时保留原始预训练损失。与为计算效率设计的稀疏注意力方法不同,我们的方法利用稀疏性作为结构先验:它保留了能力,同时暴露出更有组织和可解释的连接模式。我们发现这种局部稀疏性级联成全局电路简化:特定任务的电路涉及更少的组件(注意力头和MLP),连接它们的边缘减少了多达100倍。此外,使用跨层转录器,我们表明稀疏注意力显著简化了注意力归因,实现了基于特征和基于电路视角的统一视图。这些结果表明,Transformer注意力可以变得稀疏几个数量级,表明其大部分计算是冗余的,并且稀疏性可以作为更结构化和可解释模型的指导原则。

英文摘要

We introduce a simple post-training method that makes transformer attention sparse without sacrificing performance. Applying a flexible sparsity regularisation under a constrained-loss objective, we show on models up to 7B parameters that it is possible to retain the original pretraining loss while reducing attention connectivity to $\approx 0.4 \%$ of its edges. Unlike sparse-attention methods designed for computational efficiency, our approach leverages sparsity as a structural prior: it preserves capability while exposing a more organized and interpretable connectivity pattern. We find that this local sparsity cascades into global circuit simplification: task-specific circuits involve far fewer components (attention heads and MLPs) with up to 100x fewer edges connecting them. Additionally, using cross-layer transcoders, we show that sparse attention substantially simplifies attention attribution, enabling a unified view of feature-based and circuit-based perspectives. These results demonstrate that transformer attention can be made orders of magnitude sparser, suggesting that much of its computation is redundant and that sparsity may serve as a guiding principle for more structured and interpretable models.

2511.20236 2026-05-26 cs.AI cs.LG 版本更新

Actionable and diverse counterfactual explanations incorporating domain knowledge and plausibility constraints

结合领域知识和可行性约束的可操作且多样化的反事实解释

Szymon Bobek, Łukasz Bałec, Grzegorz J. Nalepa

发表机构 * Faculty of Physics, Astronomy and Applied Computer Science, Institute of Applied Computer Science, Jagiellonian Human-Centered AI Lab(物理、天文与应用计算机科学学院,应用计算机科学研究所,雅盖隆人机中心AI实验室)

AI总结 提出DANCE方法,通过建模特征依赖和领域约束生成可操作、多样化的反事实解释,在OpenML数据集和工业邮件营销场景中验证了其有效性和实用性。

详情
AI中文摘要

反事实解释通过识别实现期望结果所需的最小变化来提高机器学习模型的可操作可解释性。然而,现有方法常常忽略特征之间的依赖关系,这可能导致不现实或不切实际的修改。这一限制降低了反事实解释在现实决策支持系统中的实用性。受网络安全中电子邮件营销应用的启发,我们提出了DANCE(多样化、可操作且知识约束的解释),一种生成反事实的方法,该方法结合了特征依赖和领域约束。DANCE使用线性或概率结构对特征之间的关系进行建模,这些结构可以从数据中学习或由专家指定。在搜索过程中强制执行这些依赖关系以提高可行性和现实性。该方法在一个统一的目标中联合优化可行性、多样性、邻近性和稀疏性。我们在OpenML的140个数据集上评估了DANCE,并证明它在多个评估标准上相比现有方法具有竞争性或更优的性能。此外,我们与一个电子邮件营销平台合作,在真实工业环境中验证了该方法,表明它能够产生符合领域且可操作的建议。

英文摘要

Counterfactual explanations improve the actionable interpretability of machine learning models by identifying minimal changes required to achieve a desired outcome. However, existing methods often neglect dependencies among features, which can lead to unrealistic or impractical modifications. This limitation reduces the usefulness of counterfactual explanations in real-world decision-support systems. Motivated by applications in cybersecurity for email marketing, we propose DANCE (Diverse, Actionable, and Knowledge-Constrained Explanations), a method for generating counterfactuals that incorporate feature dependencies and domain constraints. DANCE models relationships between features using linear and probabilistic structures that can be learned from data or specified by experts. These dependencies are enforced during the search process to improve plausibility and feasibility. The method jointly optimizes plausibility, diversity, proximity, and sparsity within a unified objective. We evaluate DANCE on 140 datasets from OpenML and demonstrate that it achieves competitive or superior performance compared to existing approaches across multiple evaluation criteria. Additionally, we validate the method in a real-world industrial setting in collaboration with an email marketing platform, showing that it produces domain-consistent and actionable recommendations.

2511.19065 2026-05-26 cs.CV cs.AI cs.LG 版本更新

Understanding, Accelerating, and Improving MeanFlow Training

理解、加速和改进MeanFlow训练

Jin-Young Kim, Hyojun Go, Lea Bogensperger, Julius Erbach, Nikolai Kalischek, Federico Tombari, Konrad Schindler, Dominik Narnhofer

发表机构 * Yonsei University(延世大学) ETH Zurich(苏黎世联邦理工学院) University of Zurich(苏黎世大学) Max Planck ETH CLS(马克斯·普朗克ETH CLS) Google(谷歌)

AI总结 通过分析瞬时速度与平均速度的相互作用,提出一种加速瞬时速度形成并逐步转移训练重点的有效训练方案,实现更快的收敛和更优的少步生成性能。

详情
AI中文摘要

MeanFlow通过联合学习瞬时速度场和平均速度场,有望在少步内实现高质量生成建模。然而,其底层训练动态仍不清楚。我们分析两种速度之间的相互作用,发现:(i) 建立良好的瞬时速度是学习平均速度的前提;(ii) 当时间间隔较小时,瞬时速度的学习受益于平均速度,但随着间隔增大而退化;(iii) 任务亲和性分析表明,对于一步生成至关重要的大间隔平均速度的平滑学习,依赖于先形成准确的瞬时速度和小间隔平均速度。在这些观察的指导下,我们设计了一种有效的训练方案,加速瞬时速度的形成,然后将重点从短间隔平均速度转移到长间隔平均速度。我们改进的MeanFlow训练实现了更快的收敛和显著更好的少步生成:使用相同的DiT-XL骨干网络,我们的方法在1-NFE ImageNet 256x256上达到了令人印象深刻的FID 2.87,而传统的MeanFlow基线为3.43。或者,我们的方法以2.5倍更短的训练时间或使用更小的DiT-L骨干网络,匹配MeanFlow基线的性能。

英文摘要

MeanFlow promises high-quality generative modeling in few steps, by jointly learning instantaneous and average velocity fields. Yet, the underlying training dynamics remain unclear. We analyze the interaction between the two velocities and find: (i) well-established instantaneous velocity is a prerequisite for learning average velocity; (ii) learning of instantaneous velocity benefits from average velocity when the temporal gap is small, but degrades as the gap increases; and (iii) task-affinity analysis indicates that smooth learning of large-gap average velocities, essential for one-step generation, depends on the prior formation of accurate instantaneous and small-gap average velocities. Guided by these observations, we design an effective training scheme that accelerates the formation of instantaneous velocity, then shifts emphasis from short- to long-interval average velocity. Our enhanced MeanFlow training yields faster convergence and significantly better few-step generation: With the same DiT-XL backbone, our method reaches an impressive FID of 2.87 on 1-NFE ImageNet 256x256, compared to 3.43 for the conventional MeanFlow baseline. Alternatively, our method matches the performance of the MeanFlow baseline with 2.5x shorter training time, or with a smaller DiT-L backbone.

2511.15732 2026-05-26 cs.CY cs.AI 版本更新

Just Asking Questions: Doing Our Own Research on Conspiratorial Ideation by Generative AI Chatbots

只是提问:关于生成式AI聊天机器人阴谋论思维的自主研究

Katherine M. FitzGerald, Michelle Riedlinger, Axel Bruns, Stephen Harrington, Timothy Graham, Daniel Angus

发表机构 * Digital Media Research Centre, Queensland University of Technology(昆士兰理工大学数字媒体研究中心)

AI总结 本研究通过系统评估六种主流AI聊天机器人对阴谋论问题的回应,发现安全护栏在不同模型和阴谋论主题上存在显著差异,且设计具有选择性。

详情
AI中文摘要

基于人工智能框架的交互式聊天系统日益普及,并嵌入搜索引擎、网页浏览器、操作系统或通过网站和应用程序提供。研究人员致力于理解生成式AI的局限性和潜在危害,本文对此做出贡献。通过对六种AI聊天系统(ChatGPT 3.5、ChatGPT 4 Mini、Bing中的Microsoft Copilot、Google Search AI、Perplexity以及Twitter/X中的Grok)进行系统评估,本研究考察了这些领先产品如何回应与阴谋论相关的问题。这遵循了Glazunova等人(2023年)建立的平台政策实施审计方法。我们选取了五个众所周知且已被全面驳斥的阴谋论,以及四个与数据收集时的突发新闻事件相关的新兴阴谋论。我们的发现表明,生成式AI聊天机器人中针对阴谋论思维的安全护栏程度因聊天机器人模型和阴谋论的不同而存在显著差异。我们的观察表明,AI聊天机器人中的安全护栏通常设计得非常具有选择性:生成式AI公司似乎特别关注确保其产品不被视为种族主义;它们似乎还特别关注涉及重大国家创伤(如9/11事件)或与既定政治问题相关的阴谋论。未来的工作应包括持续努力,扩展到更多平台、多种语言以及远超美国范围的各类阴谋论。

英文摘要

Interactive chat systems that build on artificial intelligence frameworks are increasingly ubiquitous and embedded into search engines, Web browsers, and operating systems, or are available on websites and apps. Researcher efforts have sought to understand the limitations and potential for harm of generative AI, which we contribute to here. Conducting a systematic review of six AI-powered chat systems (ChatGPT 3.5; ChatGPT 4 Mini; Microsoft Copilot in Bing; Google Search AI; Perplexity; and Grok in Twitter/X), this study examines how these leading products respond to questions related to conspiracy theories. This follows the platform policy implementation audit approach established by Glazunova et al. (2023). We select five well-known and comprehensively debunked conspiracy theories and four emerging conspiracy theories that relate to breaking news events at the time of data collection. Our findings demonstrate that the extent of safety guardrails against conspiratorial ideation in generative AI chatbots differs markedly, depending on chatbot model and conspiracy theory. Our observations indicate that safety guardrails in AI chatbots are often very selectively designed: generative AI companies appear to focus especially on ensuring that their products are not seen to be racist; they also appear to pay particular attention to conspiracy theories that address topics of substantial national trauma such as 9/11 or relate to well-established political issues. Future work should include an ongoing effort extended to further platforms, multiple languages, and a range of conspiracy theories extending well beyond the United States.

2511.12046 2026-05-26 cs.CR cs.AI cs.CV cs.LG 版本更新

BackWeak: Backdooring Knowledge Distillation Simply with Weak Triggers and Fine-tuning

BackWeak: 使用弱触发器和微调简单后门知识蒸馏

Shanmin Wang, Dongdong Zhao

发表机构 * School of Computer Science and Artificial Intelligence(计算机科学与人工智能学院) Wuhan University of Technology(武汉科技大学)

AI总结 提出BackWeak方法,通过微调教师模型嵌入弱触发器实现后门攻击,无需替代学生模型或模拟蒸馏,在标准蒸馏过程中可靠转移至不同学生架构。

详情
AI中文摘要

知识蒸馏对于压缩大型模型至关重要,但依赖从第三方仓库下载的预训练“教师”模型引入了严重的安全风险——最显著的是后门攻击。现有的知识蒸馏后门方法通常复杂且计算密集:它们使用替代学生模型和模拟蒸馏来保证可转移性,并构建类似于通用对抗扰动(UAP)的触发器,这些触发器在幅度上不隐蔽,本质上表现出强烈的对抗行为。本文质疑这种复杂性是否必要,并构建了隐蔽的“弱”触发器——具有可忽略对抗效应的不可察觉扰动。我们提出了BackWeak,一种简单、无替代的攻击范式。BackWeak表明,通过使用非常小的学习率对良性教师模型进行微调并嵌入弱触发器,即可植入强大的后门。我们证明,这种精细的微调足以嵌入后门,在受害者的标准蒸馏过程中可靠地转移到不同的学生架构,从而实现高攻击成功率。在多个数据集、模型架构和知识蒸馏方法上的广泛实证评估表明,BackWeak比以往复杂的方法更高效、更简单,且通常更隐蔽。本文呼吁研究知识蒸馏后门攻击的学者特别关注触发器的潜在对抗特性。

英文摘要

Knowledge Distillation (KD) is essential for compressing large models, yet relying on pre-trained "teacher" models downloaded from third-party repositories introduces serious security risks--most notably backdoor attacks. Existing KD backdoor methods are typically complex and computationally intensive: they employ surrogate student models and simulated distillation to guarantee transferability, and construct triggers similar to universal adversarial perturbations (UAPs), which being not stealthy in magnitude, inherently exhibit strong adversarial behavior. This work questions whether such complexity is necessary and constructs stealthy "weak" triggers--imperceptible perturbations that have negligible adversarial effect. We propose BackWeak, a simple, surrogate-free attack paradigm. BackWeak shows that a powerful backdoor can be implanted by simply fine-tuning a benign teacher with a weak trigger using a very small learning rate. We demonstrate that this delicate fine-tuning is sufficient to embed a backdoor that reliably transfers to diverse student architectures during a victim's standard distillation process, yielding high attack success rates. Extensive empirical evaluations on multiple datasets, model architectures, and KD methods show that BackWeak is efficient, simpler, and often more stealthy than previous elaborate approaches. This work calls on researchers studying KD backdoor attacks to pay particular attention to the trigger's potential adversarial characteristics.

2511.10502 2026-05-26 cs.CR cs.AI 版本更新

On the Detectability of Active Gradient Inversion Attacks in Federated Learning

联邦学习中主动梯度反转攻击的可检测性研究

Vincenzo Carletti, Pasquale Foggia, Carlo Mazzocca, Giuseppe Parrella, Mario Vento

发表机构 * Department of Computer Information and Electrical Engineering and Applied Mathematics(计算机信息与应用数学系)

AI总结 本文研究联邦学习中主动梯度反转攻击的可检测性,提出基于异常权重结构和损失/梯度动态的轻量级客户端检测方法,实验证明能有效检测攻击而不修改训练协议。

详情
Journal ref
2026 IEEE Symposium on Security and Privacy (SP), pp. 1931-1950, 2026
AI中文摘要

联邦学习(FL)的一个关键优势是能够在客户端数据保留在本地的情况下协作训练机器学习(ML)模型。然而,这可能会造成一种虚假的安全感。尽管不共享私有数据提高了整体隐私性,但先前的研究表明,FL训练期间交换的梯度仍然容易受到梯度反转攻击(GIA)的攻击。这些攻击允许重建客户端的本地数据,打破了FL的隐私承诺。GIA可以由被动或主动服务器发起。在后一种情况下,恶意服务器操纵全局模型以促进数据重建。虽然有效,但先前属于此类别的攻击已被证明可以被客户端检测到,限制了其实际适用性。最近,出现了新的主动GIA,声称比先前的方法更隐蔽。本文首次对这些声明进行了全面分析,研究了四种最先进的GIA。我们提出了基于统计上不可能的权重结构以及异常损失和梯度动态的新型轻量级客户端检测技术。在多种配置下的广泛评估表明,我们的方法使客户端能够在不修改FL训练协议的情况下有效检测主动GIA。

英文摘要

One of the key advantages of Federated Learning (FL) is its ability to collaboratively train a Machine Learning (ML) model while keeping clients' data on-site. However, this can create a false sense of security. Despite not sharing private data increases the overall privacy, prior studies have shown that gradients exchanged during the FL training remain vulnerable to Gradient Inversion Attacks (GIAs). These attacks allow reconstructing the clients' local data, breaking the privacy promise of FL. GIAs can be launched by either a passive or an active server. In the latter case, a malicious server manipulates the global model to facilitate data reconstruction. While effective, earlier attacks falling under this category have been demonstrated to be detectable by clients, limiting their real-world applicability. Recently, novel active GIAs have emerged, claiming to be far stealthier than previous approaches. This work provides the first comprehensive analysis of these claims, investigating four state-of-the-art GIAs. We propose novel lightweight client-side detection techniques, based on statistically improbable weight structures and anomalous loss and gradient dynamics. Extensive evaluation across several configurations demonstrates that our methods enable clients to effectively detect active GIAs without any modifications to the FL training protocol.

2510.23008 2026-05-26 cs.AI 版本更新

From Prompt Optimization to Multi-Dimensional Credibility Evaluation: Enhancing Trustworthiness of Chinese LLM-Generated Liver MRI Reports -- with Preliminary Extension to Lung Cancer

从提示优化到多维可信度评估:增强中文LLM生成的肝脏MRI报告的可信度——初步扩展至肺癌

Qiuli Wang, Xinhuang Sun, Yonglin Chen, Jie Cheng, Yongxu Liu, Xingpeng Zhang, Xiaoming Li, Wei Chen

发表机构 * Yu-Yue Pathology Research Center, Jinfeng Laboratory, Chongqing, China(渝粤病理研究所,金风实验室,重庆,中国) T Magnetic Resonance Imaging Translational Medical Center, Department of Radiology, Southwest Hospital, Army Medical University, Chongqing, China(7T磁共振成像转化医学中心,放射科,西南医院,中国人民解放军军医大学,重庆,中国)

AI总结 本研究提出多维可信度评估(MDCA)框架,并指导机构特定提示优化,以增强LLM生成的肝脏MRI报告的可信度,初步扩展至肺癌。

Comments 10 pages, 6 figures, 4 tables

详情
AI中文摘要

大型语言模型(LLMs)在从影像学发现生成诊断结论方面展现出有前景的性能,从而支持放射学报告、实习生教育和质量控制。然而,关于如何在不同临床背景下优化提示设计的系统指导仍未被充分探索。此外,评估LLM生成的放射学报告可信度的全面且标准化的框架尚未建立。本研究旨在通过引入多维可信度评估(MDCA)框架并提供机构特定提示优化的指导,增强LLM生成的肝脏MRI报告的可信度。所提出的框架被应用于评估和比较几个先进LLM的性能,包括Kimi-K2-Instruct-0905、Qwen3-235B-A22B-Instruct-2507、DeepSeek-V3和ByteDance-Seed-OSS-36B-Instruct,使用SiliconFlow平台。

英文摘要

Large language models (LLMs) have demonstrated promising performance in generating diagnostic conclusions from imaging findings, thereby supporting radiology reporting, trainee education, and quality control. However, systematic guidance on how to optimize prompt design across different clinical contexts remains underexplored. Moreover, a comprehensive and standardized framework for assessing the trustworthiness of LLM-generated radiology reports is yet to be established. This study aims to enhance the trustworthiness of LLM-generated liver MRI reports by introducing a Multi-Dimensional Credibility Assessment (MDCA) framework and providing guidance on institution-specific prompt optimization. The proposed framework is applied to evaluate and compare the performance of several advanced LLMs, including Kimi-K2-Instruct-0905, Qwen3-235B-A22B-Instruct-2507, DeepSeek-V3, and ByteDance-Seed-OSS-36B-Instruct, using the SiliconFlow platform.

2510.10921 2026-05-26 cs.CV cs.AI cs.LG 版本更新

FG-CLIP 2: A Bilingual Fine-grained Vision-Language Alignment Model

FG-CLIP 2: 一种双语细粒度视觉-语言对齐模型

Chunyu Xie, Bin Wang, Fanjing Kong, Jincheng Li, Dawei Liang, Ji Ao, Dawei Leng, Yuhui Yin

发表机构 * AI Research(360人工智能研究院)

AI总结 提出FG-CLIP 2双语视觉语言模型,通过区域-文本匹配、长描述建模和文本内模态对比损失等细粒度监督,在英中双语上实现细粒度对齐,在29个数据集上取得最优结果。

Comments Accepted in ICML2026

详情
AI中文摘要

细粒度视觉-语言理解需要视觉内容与语言描述之间的精确对齐,这一能力在当前模型中仍然有限,尤其是在非英语环境下。虽然CLIP等模型在全局对齐上表现良好,但它们往往难以捕捉对象属性、空间关系和语言表达中的细粒度细节,且对双语理解的支持有限。为应对这些挑战,我们提出了FG-CLIP 2,一个旨在推进英语和中文细粒度对齐的双语视觉语言模型。我们的方法利用了丰富的细粒度监督,包括区域-文本匹配和长描述建模,以及多个判别性目标。我们进一步引入了文本内模态对比损失,以更好地区分语义相似的描述。在精心策划的大规模英语和中文数据混合上训练,包括新发布的1200万中文区域-文本数据集,FG-CLIP 2实现了强大的双语性能。为进行严格评估,我们提出了一个新的中文多模态理解基准,包括长描述检索和边界框分类。在8个任务的29个数据集上的大量实验表明,FG-CLIP 2优于现有方法,在两种语言上均达到了最先进的结果。我们发布了模型、代码和基准,以促进双语细粒度视觉-语言对齐的未来研究。

英文摘要

Fine-grained vision-language understanding requires precise alignment between visual content and linguistic descriptions, a capability that remains limited in current models, particularly in non-English settings. While models like CLIP perform well on global alignment, they often struggle to capture fine-grained details in object attributes, spatial relations, and linguistic expressions, with limited support for bilingual comprehension. To address these challenges, we introduce FG-CLIP 2, a bilingual vision-language model designed to advance fine-grained alignment for both English and Chinese. Our approach leverages rich fine-grained supervision, including region-text matching and long-caption modeling, alongside multiple discriminative objectives. We further introduce the Textual Intra-modal Contrastive (TIC) loss to better distinguish semantically similar captions. Trained on a carefully curated mixture of large-scale English and Chinese data, including a newly released 12M Chinese region-text dataset, FG-CLIP 2 achieves powerful bilingual performance. To enable rigorous evaluation, we present a new benchmark for Chinese multimodal understanding, featuring long-caption retrieval and bounding box classification. Extensive experiments on 29 datasets across 8 tasks show that FG-CLIP 2 outperforms existing methods, achieving state-of-the-art results in both languages. We release the model, code, and benchmark to facilitate future research on bilingual fine-grained vision-language alignment.

2510.08558 2026-05-26 cs.AI cs.CL cs.IR cs.LG 版本更新

Agent Learning via Early Experience

通过早期经验进行智能体学习

Kai Zhang, Xiangchao Chen, Bo Liu, Tianci Xue, Zeyi Liao, Zhihan Liu, Xiyao Wang, Yuting Ning, Zhaorun Chen, Xiaohan Fu, Jian Xie, Yuxuan Sun, Boyu Gou, Qi Qi, Zihang Meng, Jianwei Yang, Ning Zhang, Xian Li, Ashish Shah, Dat Huynh, Hengduo Li, Zi Yang, Sara Cao, Lawrence Jang, Shuyan Zhou, Jiacheng Zhu, Huan Sun, Jason Weston, Yu Su, Yifan Wu

发表机构 * Meta Superintelligence Labs(Meta超智能实验室) FAIR at Meta(Meta的FAIR部门) The Ohio State University(俄亥俄州立大学)

AI总结 提出早期经验范式,利用智能体自身动作生成的交互数据(无需奖励信号)通过隐式世界建模和自我反思两种策略提升智能体在多样化环境中的效果和跨域泛化能力。

Comments ICML 2026

详情
AI中文摘要

语言智能体的一个长期目标是通过自身经验学习和改进,最终在复杂的现实任务中超越人类。然而,在缺乏可验证奖励(如网站)或需要低效长程展开(如多轮工具使用)的许多环境中,基于经验数据使用强化学习训练智能体仍然困难。因此,当前大多数智能体依赖专家数据的监督微调,这难以扩展且泛化能力差。这一局限性源于专家示范的本质:它们只捕获了狭窄的场景范围,并使智能体暴露于有限的环境多样性。我们通过一种称为早期经验的中间范式来解决这一局限性:由智能体自身动作生成的交互数据,其中产生的未来状态作为监督信号,无需奖励。在此范式下,我们研究了使用此类数据的两种策略:(1)隐式世界建模,利用收集的状态将策略基于环境动态;(2)自我反思,智能体从其次优动作中学习以改进推理和决策。在八个多样化环境和多个模型家族上的评估表明,我们的方法持续提升了有效性和跨域泛化,凸显了早期经验的价值。此外,在具有可验证奖励的环境中,我们的结果提供了有希望的信号,表明早期经验为后续强化学习奠定了坚实基础,使其成为模仿学习与完全经验驱动智能体之间的实用桥梁。

英文摘要

A long-term goal of language agents is to learn and improve through their own experience, ultimately outperforming humans in complex, real-world tasks. However, training agents from experience data with reinforcement learning remains difficult in many environments, which either lack verifiable rewards (e.g., websites) or require inefficient long-horizon rollouts (e.g., multi-turn tool use). As a result, most current agents rely on supervised fine-tuning on expert data, which is challenging to scale and generalizes poorly. This limitation stems from the nature of expert demonstrations: they capture only a narrow range of scenarios, and expose the agent to limited environment diversity. We address this limitation with a middle-ground paradigm we call early experience: interaction data generated by the agent's own actions, where the resulting future states serve as supervision without reward signals. Within this paradigm, we study two strategies of using such data: (1) implicit world modeling, which uses collected states to ground the policy in environment dynamics; and (2) self-reflection, where the agent learns from its suboptimal actions to improve reasoning and decision-making. Evaluation across eight diverse environments and multiple model families shows that our approaches consistently improve effectiveness and out-of-domain generalization, highlighting the value of early experience. Moreover, in environments with verifiable rewards, our results provide promising signals that early experience offers a strong foundation for subsequent reinforcement learning, making it a practical bridge between imitation learning and fully experience-driven agents.

2510.08350 2026-05-26 cs.LG cs.AI 版本更新

DeepEN: A Deep Reinforcement Learning Framework for Personalized Enteral Nutrition in Critical Care

DeepEN: 一种用于重症监护中个性化肠内营养的深度强化学习框架

Daniel Jason Tan, Jiayang Chen, Dilruk Perera, Kay Choong See, Mengling Feng

发表机构 * Institute of Data Science(数据科学研究所) Saw Swee Hock School of Public Health, National University of Singapore, Singapore(Saw Swee Hock公共卫生学院,新加坡国立大学,新加坡) National University Hospital, Singapore(新加坡国立医院)

AI总结 提出DeepEN框架,利用深度强化学习从电子健康记录中学习个性化肠内营养方案,在MIMIC-IV数据集上相比临床实践降低绝对死亡率4.0个百分点。

详情
AI中文摘要

目的:由于个性化程度有限以及在动态代谢需求下对适当热量、蛋白质和液体目标的不确定性,ICU中的肠内营养(EN)输送仍不理想。我们引入DeepEN,一个使用电子健康记录数据进行个性化EN优化的强化学习(RL)框架。方法:DeepEN在来自MIMIC-IV的超过11,000名ICU患者上训练,以生成每4小时一次、针对患者的卡路里、蛋白质和液体目标。状态表示包括人口统计学、合并症、生命体征、实验室值和近期干预措施。一个生理学对齐的奖励框架平衡了生物标志物稳定性与长期生存。策略学习采用带有保守Q学习正则化的决斗双深度Q网络,以实现安全的离线训练。结果:DeepEN实现了最高的估计策略价值($V^π= 9.48$)和最低的校准死亡率(18.8 ± 1.0%),与临床实践(22.8%)相比绝对降低了4.0个百分点。该策略还表现出优越的代谢稳定性,实现了目标范围内葡萄糖、磷酸盐和钠值的最高比例。此外,偏离DeepEN策略与死亡率和生物标志物不稳定性独立相关,而偏离随机策略则没有这种关联。可解释性分析进一步表明,建议是基于器官功能和代谢状态的生理相关标志物,而不是静态剂量启发式。结论:DeepEN证明了保守离线RL在安全、个性化EN优化中的可行性,突出了数据驱动个性化在重症监护中补充基于指南方法的潜力。

英文摘要

Objective: Enteral nutrition (EN) delivery in the ICU remains suboptimal due to limited personalization and uncertainty regarding appropriate calorie, protein, and fluid targets under dynamic metabolic demands. We introduce DeepEN, a reinforcement learning (RL) framework for personalized EN optimization using electronic health record data. Methods: DeepEN was trained on over 11,000 ICU patients from MIMIC-IV to generate 4-hourly, patient-specific caloric, protein, and fluid targets. The state representation incorporated demographics, comorbidities, vital signs, laboratory values, and recent interventions. A physiologically aligned reward framework balanced biomarker stability with long-term survival. Policy learning employed a dueling double deep Q-network with Conservative Q-Learning regularization to enable safe offline training. Results: DeepEN achieved the highest estimated policy value ($V^π= 9.48$) and the lowest calibrated mortality (18.8 +/- 1.0%), representing a 4.0 percentage-point absolute reduction compared with clinician practice (22.8%). The policy also demonstrated superior metabolic stability, achieving the highest proportion of glucose, phosphate, and sodium values within target range. Furthermore, deviation from the DeepEN policy was independently associated with increased mortality and biomarker instability, whereas deviation from a random policy showed no such association. Interpretability analyses further indicated that recommendations were conditioned on physiologically relevant markers of organ function and metabolic status rather than static dosing heuristics. Conclusion: DeepEN demonstrates the feasibility of conservative offline RL for safe, individualized EN optimization, highlighting the potential of data-driven personalization to complement guideline-based approaches in critical care.

2510.05699 2026-05-26 cs.CR cs.AI 版本更新

Membership Inference Attacks on Tokenizers of Large Language Models

大型语言模型分词器的成员推理攻击

Meng Tong, Yuntao Du, Kejiang Chen, Weiming Zhang, Ninghui Li

发表机构 * University of Science and Technology of China(中国科学技术大学) Purdue University(普渡大学)

AI总结 针对预训练大型语言模型成员推理攻击的挑战,提出以分词器作为新攻击向量,探索五种攻击方法,并设计自适应防御。

Comments To appear at USENIX Security Symposium 2026

详情
AI中文摘要

成员推理攻击(MIAs)被广泛用于评估机器学习模型的隐私风险。然而,当这些攻击应用于预训练的大型语言模型(LLMs)时,会遇到显著挑战,包括错误标记的样本、分布偏移以及实验环境与真实环境之间模型规模的差异。为了解决这些限制,我们引入分词器作为成员推理的新攻击向量。具体来说,分词器将原始文本转换为LLMs的令牌。与完整模型不同,分词器可以从头开始高效训练,从而避免上述挑战。此外,分词器的训练数据通常代表用于预训练LLMs的数据。尽管有这些优势,分词器作为攻击向量的潜力尚未被探索。为此,我们首次开展了关于通过分词器泄露成员信息的研究,并探索了五种攻击方法来推断数据集成员身份。在数百万互联网样本上的广泛实验揭示了最先进LLMs分词器中的漏洞。为了缓解这一新兴风险,我们进一步提出了一种自适应防御。我们的发现强调了分词器是一个被忽视但关键的隐私威胁,突显了专门为其设计隐私保护机制的迫切需求。

英文摘要

Membership inference attacks (MIAs) are widely used to assess the privacy risks associated with machine learning models. However, when these attacks are applied to pre-trained large language models (LLMs), they encounter significant challenges, including mislabeled samples, distribution shifts, and discrepancies in model size between experimental and real-world settings. To address these limitations, we introduce tokenizers as a new attack vector for membership inference. Specifically, a tokenizer converts raw text into tokens for LLMs. Unlike full models, tokenizers can be efficiently trained from scratch, thereby avoiding the aforementioned challenges. In addition, the tokenizer's training data is typically representative of the data used to pre-train LLMs. Despite these advantages, the potential of tokenizers as an attack vector remains unexplored. To this end, we present the first study on membership leakage through tokenizers and explore five attack methods to infer dataset membership. Extensive experiments on millions of Internet samples reveal the vulnerabilities in the tokenizers of state-of-the-art LLMs. To mitigate this emerging risk, we further propose an adaptive defense. Our findings highlight tokenizers as an overlooked yet critical privacy threat, underscoring the urgent need for privacy-preserving mechanisms specifically designed for them.

2510.05688 2026-05-26 cs.LG cs.AI 版本更新

vAttention: Verified Sparse Attention

vAttention: 验证的稀疏注意力

Aditya Desai, Kumar Krishna Agrawal, Shuo Yang, Alejandro Cuadron, Luis Gaspar Schroeder, Matei Zaharia, Joseph E. Gonzalez, Ion Stoica

发表机构 * Electrical Engineering and Computer Sciences, University of California, Berkeley(加州大学伯克利分校电气工程与计算机科学系)

AI总结 提出vAttention,通过统一top-k和随机采样,实现首个具有用户指定(ε, δ)近似精度保证的实用稀疏注意力机制,显著提升质量-效率权衡。

详情
Journal ref
Proceedings of the International Conference on Learning Representations (ICLR), 2026
AI中文摘要

最先进的用于减少解码延迟的稀疏注意力方法主要分为两类:近似top-$k$(及其扩展top-$p$)和最近引入的基于采样的估计。然而,这些方法在逼近全注意力方面存在根本性局限:它们无法在头和查询向量之间提供一致的近似,最关键的是,缺乏对近似质量的保证,限制了其实际部署。我们观察到top-$k$和随机采样是互补的:当注意力分数由少数标记主导时,top-$k$表现良好,而当注意力分数相对均匀时,随机采样提供更好的估计。基于这一洞察并利用采样的统计保证,我们引入了vAttention,这是第一个具有用户指定$(ε, δ)$近似精度保证(因此称为“已验证”)的实用稀疏注意力机制。这些保证使vAttention成为向大规模实用、可靠部署稀疏注意力迈出的引人注目的一步。通过统一top-$k$和采样,vAttention在质量-效率权衡上优于两者各自的表现。我们的实验表明,vAttention显著提高了稀疏注意力的质量(例如,在RULER-HARD上,Llama 3.1 8B Instruct和DeepSeek-R1-Distill-Llama-8B提高了约4.5个百分点),并有效弥合了全注意力和稀疏注意力之间的差距(例如,在多个数据集上,以高达20倍稀疏度匹配全模型质量)。我们还展示了它可以部署在推理场景中,在不牺牲模型质量的情况下实现快速解码(例如,vAttention在AIME2024上以10倍稀疏度和高达32K标记生成实现了全模型质量)。代码:https://github.com/skylight-org/sparse-attention-hub。网页:https://sky-light.eecs.berkeley.edu。

英文摘要

State-of-the-art sparse attention methods for reducing decoding latency fall into two main categories: approximate top-$k$ (and its extension, top-$p$) and recently introduced sampling-based estimation. However, these approaches are fundamentally limited in their ability to approximate full attention: they fail to provide consistent approximations across heads and query vectors and, most critically, lack guarantees on approximation quality, limiting their practical deployment. We observe that top-$k$ and random sampling are complementary: top-$k$ performs well when attention scores are dominated by a few tokens, whereas random sampling provides better estimates when attention scores are relatively uniform. Building on this insight and leveraging the statistical guarantees of sampling, we introduce vAttention, the first practical sparse attention mechanism with user-specified $(ε, δ)$ guarantees on approximation accuracy (thus, "verified"). These guarantees make vAttention a compelling step toward practical, reliable deployment of sparse attention at scale. By unifying top-$k$ and sampling, vAttention outperforms both individually, delivering a superior quality-efficiency trade-off. Our experiments show that vAttention significantly improves the quality of sparse attention (e.g., $\sim$4.5 percentage points for Llama 3.1 8B Instruct and DeepSeek-R1-Distill-Llama-8B on RULER-HARD), and effectively bridges the gap between full and sparse attention (e.g., across datasets, it matches full model quality with up to 20x sparsity). We also demonstrate that it can be deployed in reasoning scenarios to achieve fast decoding without compromising model quality (e.g., vAttention achieves full model quality on AIME2024 at 10x sparsity with up to 32K token generations). Code: https://github.com/skylight-org/sparse-attention-hub. Webpage: https://sky-light.eecs.berkeley.edu.

2510.02837 2026-05-26 cs.AI cs.CL 版本更新

Beyond the Final Answer: Evaluating the Reasoning Trajectories of Tool-Augmented Agents

超越最终答案:评估工具增强型智能体的推理轨迹

Wonjoong Kim, Sangwu Park, Yeonjun In, Sein Kim, Dongha Lee, Chanyoung Park

发表机构 * Graduate School of Data Science, KAIST, Daejeon, South Korea(数据科学研究生院,韩国科学技术院,大田,韩国) Department of Industrial and Systems Engineering, KAIST, Daejeon, South Korea(工业与系统工程系,韩国科学技术院,大田,韩国) Department of Artificial Intelligence, Yonsei University, Seoul, South Korea(人工智能系,延世大学,首尔,韩国)

AI总结 针对工具增强型LLM,提出无参考框架TRACE,通过证据库多维度评估推理轨迹的效率、幻觉和适应性,并用元评估数据集验证其有效性。

Comments International Conference on Machine Learning (ICML) 2026

详情
AI中文摘要

尽管最近的工具增强型基准涉及复杂请求,但评估仍局限于答案匹配,忽略了效率、幻觉和适应性等关键轨迹方面。最直接的评估方法是将智能体的轨迹与真实轨迹进行比较,但注释所有有效的真实轨迹成本过高。为此,我们引入TRACE,一个用于工具增强型LLM多维度评估的无参考框架。通过整合一个从先前步骤积累知识的证据库,TRACE有效评估智能体的推理轨迹。为验证我们的框架,我们开发了一个新的元评估数据集,包含多样且有缺陷的轨迹,每个轨迹都标有多方面的性能分数。我们的结果证实,即使使用小型开源LLM,TRACE也能准确评估复杂轨迹。此外,我们应用该方法评估智能体在解决工具增强型任务时产生的轨迹,展示了先前未报告的观察结果及其相应的见解。

英文摘要

Although recent tool-augmented benchmarks involve complex requests, evaluation remains limited to answer matching, neglecting critical trajectory aspects like efficiency, hallucination, and adaptivity. The most straightforward method for evaluation is to compare an agent's trajectory with the ground-truth, but annotating all valid ground-truth trajectories is prohibitively expensive. In this manner, we introduce TRACE, a reference-free framework for the multi-dimensional evaluation of tool-augmented LLMs. By incorporating an evidence bank which accumulates knowledge from preceding steps, TRACE assesses an agent's reasoning trajectory effectively. To validate our framework, we develop a new meta-evaluation dataset with diverse and flawed trajectories, each labeled with multi-faceted performance scores. Our results confirm that TRACE accurately evaluates complex trajectories even with small open-source LLMs. Furthermore, we apply our method to evaluate the trajectories that agents produce while solving tool-augmented tasks, presenting previously unreported observations and their corresponding insights.

2510.02361 2026-05-26 cs.CL cs.AI 版本更新

ChunkLLM: A Lightweight Pluggable Framework for Accelerating LLMs Inference

ChunkLLM: 一种轻量级可插拔的LLM推理加速框架

Haojie Ouyang, Jianwei Lv, Lei Ren, Chen Wei, Xiaojie Wang, Fangxiang Feng

发表机构 * School of Artificial Intelligence, Beijing University of Posts and Telecommunications(北京邮电大学信息学院)

AI总结 针对Transformer自注意力二次复杂度导致的推理效率低下问题,提出ChunkLLM框架,通过QK适配器和块适配器实现块选择与压缩,在保持性能的同时显著加速推理。

详情
AI中文摘要

基于Transformer的大模型在自然语言处理和计算机视觉中表现出色,但由于自注意力对输入令牌的二次复杂度,面临严重的计算效率低下问题。最近,研究人员提出了一系列基于块选择和压缩的方法来缓解这一问题,但它们要么存在语义不完整的问题,要么训练-推理效率低下。为了全面解决这些挑战,我们提出了ChunkLLM,一个轻量级且可插拔的训练框架。具体来说,我们引入了两个组件:QK适配器(Q-Adapter和K-Adapter)和块适配器。前者附加在每个Transformer层上,兼具特征压缩和块注意力获取的双重目的。后者在模型的最底层运行,通过利用上下文语义信息来检测块边界。在训练阶段,骨干网络的参数保持冻结,仅QK适配器和块适配器进行训练。值得注意的是,我们设计了一种注意力蒸馏方法来训练QK适配器,这提高了关键块的召回率。在推理阶段,仅当当前令牌被检测为块边界时才触发块选择,从而加速模型推理。我们在涵盖多个任务的多种长文本和短文本基准数据集上进行了实验评估。ChunkLLM不仅在短文本基准上取得了可比的性能,而且在长上下文基准上保持了98.64%的性能,同时保持了48.58%的键值缓存保留率。特别地,在处理120K长文本时,ChunkLLM相比原始Transformer实现了最大4.48倍的加速。

英文摘要

Transformer-based large models excel in natural language processing and computer vision, but face severe computational inefficiencies due to the self-attention's quadratic complexity with input tokens. Recently, researchers have proposed a series of methods based on block selection and compression to alleviate this problem, but they either have issues with semantic incompleteness or poor training-inference efficiency. To comprehensively address these challenges, we propose ChunkLLM, a lightweight and pluggable training framework. Specifically, we introduce two components: QK Adapter (Q-Adapter and K-Adapter) and Chunk Adapter. The former is attached to each Transformer layer, serving dual purposes of feature compression and chunk attention acquisition. The latter operates at the bottommost layer of the model, functioning to detect chunk boundaries by leveraging contextual semantic information. During the training phase, the parameters of the backbone remain frozen, with only the QK Adapter and Chunk Adapter undergoing training. Notably, we design an attention distillation method for training the QK Adapter, which enhances the recall rate of key chunks. During the inference phase, chunk selection is triggered exclusively when the current token is detected as a chunk boundary, thereby accelerating model inference. Experimental evaluations are conducted on a diverse set of long-text and short-text benchmark datasets spanning multiple tasks. ChunkLLM not only attains comparable performance on short-text benchmarks but also maintains 98.64% of the performance on long-context benchmarks while preserving a 48.58% key-value cache retention rate. Particularly, ChunkLLM attains a maximum speedup of 4.48x in comparison to the vanilla Transformer in the processing of 120K long texts.

2510.02327 2026-05-26 cs.CL cs.AI eess.AS 版本更新

KAME: Tandem Architecture for Enhancing Knowledge in Real-Time Speech-to-Speech Conversational AI

KAME:用于增强实时语音到语音对话AI知识的串联架构

So Kuroki, Yotaro Kubo, Takuya Akiba, Yujin Tang

AI总结 提出一种混合架构,通过实时注入后端LLM的文本响应来增强S2S模型的知识,在保持低延迟的同时提升响应正确性。

Comments Published at IEEE ICASSP 2026

详情
AI中文摘要

实时语音到语音(S2S)模型擅长生成自然、低延迟的对话响应,但往往缺乏深层知识和语义理解。相反,结合自动语音识别、基于文本的大语言模型(LLM)和文本到语音合成的级联系统提供了优越的知识表示,但代价是高延迟,这破坏了自然交互的流畅性。本文介绍了一种新颖的混合架构,弥合了这两种范式之间的差距。我们的框架通过S2S变压器处理用户语音以实现即时响应,同时将查询并发地传递给强大的后端LLM。然后,LLM的基于文本的响应被实时注入以指导S2S模型的语音生成,有效地为其输出注入丰富的知识,而无需承受级联系统的全部延迟惩罚。我们使用MT-Bench基准的语音合成变体(包含多轮问答会话)评估了我们的方法。结果表明,我们的系统在响应正确性上显著优于基线S2S模型,接近级联系统的水平,同时保持了与基线相当的延迟。

英文摘要

Real-time speech-to-speech (S2S) models excel at generating natural, low-latency conversational responses but often lack deep knowledge and semantic understanding. Conversely, cascaded systems combining automatic speech recognition, a text-based Large Language Model (LLM), and text-to-speech synthesis offer superior knowledge representation at the cost of high latency, which disrupts the flow of natural interaction. This paper introduces a novel hybrid architecture that bridges the gap between these two paradigms. Our framework processes user speech through an S2S transformer for immediate responsiveness while concurrently relaying the query to a powerful back-end LLM. The LLM's text-based response is then injected in real time to guide the S2S model's speech generation, effectively infusing its output with rich knowledge without the full latency penalty of a cascaded system. We evaluated our method using a speech-synthesized variant of the MT-Bench benchmark that consists of multi-turn question-answering sessions. The results demonstrate that our system substantially outperforms a baseline S2S model in response correctness, approaching that of a cascaded system, while maintaining a latency on par with the baseline.

2509.24978 2026-05-26 cs.AI cond-mat.quant-gas quant-ph 版本更新

Agentic Exploration of Physics Models

物理模型的智能体探索

Maximilian Nägele, Florian Marquardt

发表机构 * Max Planck Institute for the Science of Light(马克斯·普朗克光科学研究所)

AI总结 提出 SciExplorer 智能体,利用大语言模型工具使用能力,无需领域特定蓝图即可探索未知物理系统,通过实验和观测恢复运动方程和哈密顿量。

详情
AI中文摘要

科学发现的过程依赖于观察、分析和假设生成的相互作用。机器学习正越来越多地被用于处理这一过程的各个方面。然而,完全自动化发现未知系统定律所需的启发式迭代循环(通过实验和分析探索系统)仍然是一个开放挑战,且不能针对特定任务进行定制。在这里,我们介绍了 SciExplorer,一个利用大语言模型工具使用能力来探索系统而无需任何领域特定蓝图的智能体,并将其应用于最初对智能体未知的物理系统。我们在涵盖机械动力学系统、波演化和量子多体物理的广泛模型上测试了 SciExplorer。尽管使用了最小工具集(主要基于代码执行),我们观察到在从观测动力学恢复运动方程和从期望值推断哈密顿量等任务上表现出色。该设置的有效性为在其他领域进行类似的科学探索打开了大门,无需微调或任务特定指令。

英文摘要

The process of scientific discovery relies on an interplay of observations, analysis, and hypothesis generation. Machine learning is increasingly being adopted to address individual aspects of this process. However, it remains an open challenge to fully automate the heuristic, iterative loop required to discover the laws of an unknown system by exploring it through experiments and analysis, without tailoring the approach to the specifics of a given task. Here, we introduce SciExplorer, an agent that leverages large language model tool-use capabilities to enable exploration of systems without any domain-specific blueprints, and apply it to physical systems that are initially unknown to the agent. We test SciExplorer on a broad set of models spanning mechanical dynamical systems, wave evolution, and quantum many-body physics. Despite using a minimal set of tools, primarily based on code execution, we observe impressive performance on tasks such as recovering equations of motion from observed dynamics and inferring Hamiltonians from expectation values. The demonstrated effectiveness of this setup opens the door towards similar scientific exploration in other domains, without the need for finetuning or task-specific instructions.

2509.22299 2026-05-26 cs.LG cs.AI 版本更新

HEAPr: Hessian-based Efficient Atomic Expert Pruning in Output Space

HEAPr: 基于Hessian的输出空间中高效原子专家剪枝

Ke Li, Zheng Yang, Zhongbin Zhou, Feng Xue, Zhonglin Jiang, Wenxiao Wang

发表机构 * School of Software Technology, Zhejiang University(浙江大学软件学院) FABU Inc.(FABU公司) Hangzhou Kuaidi Science and Technology Co., Ltd.(杭州快的科学技术有限公司)

AI总结 针对MoE模型粗粒度专家剪枝导致精度下降的问题,提出HEAPr算法,通过将专家分解为原子专家并利用二阶信息(最优脑外科原理)评估重要性,在输出空间简化计算,实现高比例无损压缩。

Comments ICLR 2026

详情
Journal ref
Proceedings of the International Conference on Learning Representations (ICLR), 2026
AI中文摘要

大型语言模型中的混合专家(MoE)架构相比密集LLM具有卓越性能和更低的推理成本。然而,其庞大的参数数量导致内存需求过高,限制了实际部署。现有的剪枝方法主要关注专家级剪枝,这种粗粒度通常导致显著的精度下降。在这项工作中,我们引入了HEAPr,一种新颖的剪枝算法,它将专家分解为更小、不可分割的原子专家,从而实现更精确和灵活的原子专家剪枝。为了衡量每个原子专家的重要性,我们利用基于最优脑外科理论原理的二阶信息。为了解决二阶信息带来的计算和存储挑战,HEAPr利用原子专家的固有属性,将专家参数的二阶信息转换为原子专家参数的二阶信息,并进一步简化为原子专家输出的二阶信息。这种方法将空间复杂度从$O(d^4)$(其中$d$是模型的维度)降低到$O(d^2)$。HEAPr仅需在小型校准集上进行两次前向传播和一次反向传播即可计算原子专家的重要性。在包括DeepSeek MoE和Qwen MoE系列在内的MoE模型上的大量实验表明,HEAPr在广泛的剪枝比例和基准测试中优于现有的专家级剪枝方法。具体来说,在大多数模型中,HEAPr在20%~25%的剪枝比例下实现了几乎无损的压缩,同时FLOPs也减少了近20%。代码可在[https://github.com/LLIKKE/HEAPr](https://github.com/LLIKKE/HEAPr)找到。

英文摘要

Mixture-of-Experts (MoE) architectures in large language models (LLMs) deliver exceptional performance and reduced inference costs compared to dense LLMs. However, their large parameter counts result in prohibitive memory requirements, limiting practical deployment. While existing pruning methods primarily focus on expert-level pruning, this coarse granularity often leads to substantial accuracy degradation. In this work, we introduce HEAPr, a novel pruning algorithm that decomposes experts into smaller, indivisible atomic experts, enabling more precise and flexible atomic expert pruning. To measure the importance of each atomic expert, we leverage second-order information based on principles similar to the Optimal Brain Surgeon theory. To address the computational and storage challenges posed by second-order information, HEAPr exploits the inherent properties of atomic experts to transform the second-order information from expert parameters into that of atomic expert parameters, and further simplifies it to the second-order information of atomic expert outputs. This approach reduces the space complexity from $O(d^4)$, where $d$ is the model's dimensionality, to $O(d^2)$. HEAPr requires only two forward passes and one backward pass on a small calibration set to compute the importance of atomic experts. Extensive experiments on MoE models, including DeepSeek MoE and Qwen MoE family, demonstrate that HEAPr outperforms existing expert-level pruning methods across a wide range of pruning ratios and benchmarks. Specifically, HEAPr achieves nearly lossless compression at pruning ratios of 20% ~ 25% in most models, while also reducing FLOPs nearly by 20%. The code can be found at [https://github.com/LLIKKE/HEAPr](https://github.com/LLIKKE/HEAPr).

2509.21592 2026-05-26 cs.CV cs.AI cs.LG 版本更新

What Happens Next? Anticipating Future Motion by Generating Point Trajectories

接下来会发生什么?通过生成点轨迹预测未来运动

Gabrijel Boduljak, Laurynas Karazija, Iro Laina, Christian Rupprecht, Andrea Vedaldi

发表机构 * Visual Geometry Group, University of Oxford(牛津大学视觉几何组)

AI总结 提出一种基于单张图像预测未来运动的方法,通过生成密集轨迹网格来捕捉场景动态和不确定性,相比现有方法更准确多样,并验证其在机器人等下游任务中的有效性。

详情
Journal ref
ICLR 2026
AI中文摘要

我们考虑从单张图像预测运动的问题,即预测世界中物体可能如何移动,而无法观察其他参数如物体速度或施加的力。我们将此任务表述为密集轨迹网格的条件生成,模型紧密遵循现代视频生成器的架构,但输出运动轨迹而非像素。这种方法捕捉了场景范围的动态和不确定性,比先前的回归器和生成器产生更准确和多样化的预测。我们在模拟数据上广泛评估了我们的方法,展示了其在机器人等下游应用中的有效性,并在真实世界的直觉物理数据集上显示出有希望的准确性。尽管最近最先进的视频生成器常被视为世界模型,但我们表明它们在从单张图像预测运动方面存在困难,即使在简单的物理场景如落块或机械物体交互中,尽管对这些数据进行了微调。我们表明这一局限性源于生成像素的开销,而非直接建模运动。

英文摘要

We consider the problem of forecasting motion from a single image, i.e., predicting how objects in the world are likely to move, without the ability to observe other parameters such as the object velocities or the forces applied to them. We formulate this task as conditional generation of dense trajectory grids with a model that closely follows the architecture of modern video generators but outputs motion trajectories instead of pixels. This approach captures scene-wide dynamics and uncertainty, yielding more accurate and diverse predictions than prior regressors and generators. We extensively evaluate our method on simulated data, demonstrate its effectiveness on downstream applications such as robotics, and show promising accuracy on real-world intuitive physics datasets. Although recent state-of-the-art video generators are often regarded as world models, we show that they struggle with forecasting motion from a single image, even in simple physical scenarios such as falling blocks or mechanical object interactions, despite fine-tuning on such data. We show that this limitation arises from the overhead of generating pixels rather than directly modeling motion.

2509.16931 2026-05-26 cs.IR cs.AI cs.LG 版本更新

Equip Pre-ranking with Target Attention by Residual Quantization

通过残差量化为预排序阶段配备目标注意力机制

Yutong Li, Yu Zhu, Yichen Qiao, Ziyu Guan, Lv Shao, Tong Liu, Bo Zheng

发表机构 * Taobao \& Tmall Group of Alibaba Hangzhou China Shanghai Jiao Tong University Shanghai China Xidian University Xi'an China Taobao \& Tmall Group of Alibaba Beijing China Taobao \& Tmall Group of Alibaba Shanghai Jiao Tong University Xidian University

AI总结 提出TARQ框架,利用残差量化在预排序阶段近似目标注意力架构,首次在延迟关键阶段引入TA建模能力,实现精度与效率的新最优平衡。

Comments 5 pages, 2 figures, accepted by SIGIR 2026 Short Paper Track

详情
AI中文摘要

工业推荐系统中的预排序阶段面临效率与效果之间的根本冲突。虽然目标注意力(TA)等强大模型在排序阶段擅长捕捉复杂的特征交互,但其高计算成本使其无法用于通常依赖简单向量积模型的预排序阶段。这种差异给整个系统造成了显著的性能瓶颈。为弥合这一差距,我们提出了TARQ,一种新颖的预排序框架。受生成模型启发,TARQ的关键创新在于通过残差量化为预排序阶段配备近似TA的架构。这使得我们首次将TA的建模能力引入延迟关键的预排序阶段,建立了精度与效率之间新的最优权衡。在淘宝进行的大量离线实验和大规模在线A/B测试证明了TARQ在排序性能上的显著提升。因此,我们的模型已全面部署在生产环境中,服务于数千万日活跃用户,并带来了可观的业务改进。代码和数据可在 https://github.com/zyody/tarq_sigir2026 获取。

英文摘要

The pre-ranking stage in industrial recommendation systems faces a fundamental conflict between efficiency and effectiveness. While powerful models like Target Attention (TA) excel at capturing complex feature interactions in the ranking stage, their high computational cost makes them infeasible for pre-ranking, which often relies on simplistic vector-product models. This disparity creates a significant performance bottleneck for the entire system. To bridge this gap, we propose TARQ, a novel pre-ranking framework. Inspired by generative models, TARQ's key innovation is to equip pre-ranking with an architecture approximate to TA by Residual Quantization. This allows us to bring the modeling power of TA into the latency-critical pre-ranking stage for the first time, establishing a new state-of-the-art trade-off between accuracy and efficiency. Extensive offline experiments and large-scale online A/B tests at Taobao demonstrate TARQ's significant improvements in ranking performance. Consequently, our model has been fully deployed in production, serving tens of millions of daily active users and yielding substantial business improvements. The code and data are available at https://github.com/zyody/tarq_sigir2026.

2508.19113 2026-05-26 cs.AI 版本更新

Hybrid Deep Searcher: Scalable Parallel and Sequential Search Reasoning

混合深度搜索器:可扩展的并行与顺序搜索推理

Dayoon Ko, Jihyuk Kim, Haeju Park, Sohyeon Kim, Dahyun Lee, Yongrae Jo, Gunhee Kim, Moontae Lee, Kyungjae Lee

发表机构 * Seoul National University(首尔国立大学) LG AI Research(LG AI研究) University of Illinois Chicago(伊利诺伊大学芝加哥分校) University of Seoul(首尔大学)

AI总结 提出混合搜索策略HybridDeepSearcher,通过并行查询扩展与显式证据聚合结合顺序推理,在多个基准上显著提升性能并实现测试时搜索扩展。

Comments Accepted to ICLR 2026

详情
AI中文摘要

大型推理模型(LRMs)结合检索增强生成(RAG)使得深度研究智能体能够通过外部知识检索进行多步推理。然而,我们发现现有方法很少展示测试时搜索扩展。通过单查询顺序搜索扩展推理的方法受限于证据覆盖范围,而每步生成多个独立查询的方法通常缺乏结构化聚合,阻碍了更深的顺序推理。我们提出一种混合搜索策略来解决这些限制。我们引入了HybridDeepSearcher,一种结构化的搜索智能体,它在进入更深的顺序推理之前集成了并行查询扩展与显式证据聚合。为了监督这种行为,我们引入了HDS-QA,一个新颖的数据集,通过包含并行子查询的监督推理-查询-检索轨迹,指导模型将广泛的并行搜索与结构化聚合相结合。在五个基准上,HybridDeepSearcher显著优于现有技术,在FanOutQA上F1分数提高+15.9,在BrowseComp子集上提高+9.2。进一步分析显示其一致的测试时搜索扩展:随着允许的额外搜索轮次或调用次数增加,性能持续提升,而竞争方法则趋于平稳。

英文摘要

Large reasoning models (LRMs) combined with retrieval-augmented generation (RAG) have enabled deep research agents capable of multi-step reasoning with external knowledge retrieval. However, we find that existing approaches rarely demonstrate test-time search scaling. Methods that extend reasoning through single-query sequential search suffer from limited evidence coverage, while approaches that generate multiple independent queries per step often lack structured aggregation, hindering deeper sequential reasoning. We propose a hybrid search strategy to address these limitations. We introduce HybridDeepSearcher, a structured search agent that integrates parallel query expansion with explicit evidence aggregation before advancing to deeper sequential reasoning. To supervise this behavior, we introduce HDS-QA, a novel dataset that guides models to combine broad parallel search with structured aggregation through supervised reasoning-query0retrieval trajectories containing parallel sub-queries. Across five benchmarks, HybridDeepSearcher significantly outperforms the state-of-the-art, improving F1 scores by +15.9 on FanOutQA and +9.2 on a subset of BrowseComp. Further analysis shows its consistent test-time search scaling: performance improves as additional search turns or calls are allowed, while competing methods plateau.

2508.12538 2026-05-26 cs.CR cs.AI cs.SE 版本更新

MCPXKIT: The Unified Toolkit for Analyzing Model Context Protocol Security

MCPXKIT:分析模型上下文协议安全性的统一工具包

Yongjian Guo, Puzhuo Liu, Wanlun Ma, Zehang Deng, Xiaogang Zhu, Peng Di, Xi Xiao, Sheng Wen

发表机构 * Shenzhen International Graduate School, Tsinghua University, Shenzhen, China(深圳国际研究生院,清华大学,深圳,中国) Ant Group, Hangzhou, China(蚂蚁集团,杭州,中国) Swinburne University of Technology, Melbourne, Australia(斯威本科技大学,墨尔本,澳大利亚) The University of Adelaide, Adelaide, Australia(阿德莱德大学,阿德莱德,澳大利亚) UNSW Sydney, Australia(悉尼大学,澳大利亚)

AI总结 本文提出MCPXKIT工具包,分类实现了31种攻击方法,通过定量实验揭示了MCP在工具描述依赖、文件攻击、链攻击及数据命令区分等方面的漏洞,并提供了安全增强建议。

Comments Accepted by IEEE Transactions on Dependable and Secure Computing (TDSC). $\href{https://ieeexplore.ieee.org/abstract/document/11531012}{Official \ version}$

详情
AI中文摘要

模型上下文协议(MCP)已成为一种通用标准,使AI代理能够无缝连接外部工具,显著增强其功能。然而,MCP在带来显著优势的同时,也引入了重大漏洞,例如工具投毒攻击(TPA),其中隐藏的恶意指令利用大型语言模型(LLM)的谄媚性来操纵代理行为。尽管存在这些风险,当前关于MCP安全性的学术研究仍然有限,大多数研究侧重于狭窄或定性的分析,未能捕捉真实世界威胁的多样性。为填补这一空白,我们提出了MCP利用工具包(MCPXKIT),该工具包在四个关键分类下分类并实现了31种不同的攻击方法:直接工具注入、间接工具注入、恶意用户攻击和LLM固有攻击。我们进一步对每种攻击的有效性进行了定量分析。我们的实验揭示了MCP漏洞的关键见解,包括代理对工具描述的盲目依赖、对基于文件的攻击的敏感性、利用共享上下文的链式攻击,以及难以区分外部数据与可执行命令。这些通过攻击实验验证的见解,强调了制定稳健防御策略和知情MCP设计的紧迫性。我们的贡献包括:1)构建全面的MCP攻击分类法,2)引入统一的攻击框架MCPXKIT,以及3)进行实证漏洞分析以增强MCP安全机制。这项工作为支持MCP生态系统的安全演进提供了基础框架。

英文摘要

The Model Context Protocol (MCP) has emerged as a universal standard that enables AI agents to seamlessly connect with external tools, significantly enhancing their functionality. However, while MCP brings notable benefits, it also introduces significant vulnerabilities, such as Tool Poisoning Attacks (TPA), where hidden malicious instructions exploit the sycophancy of large language models (LLMs) to manipulate agent behavior. Despite these risks, current academic research on MCP security remains limited, with most studies focusing on narrow or qualitative analyses that fail to capture the diversity of real-world threats. To address this gap, we present the MCP eXploit Toolkit (MCPXKIT), which categorizes and implements 31 distinct attack methods under four key classifications: direct tool injection, indirect tool injection, malicious user attacks, and LLM inherent attack. We further conduct a quantitative analysis of the efficacy of each attack. Our experiments reveal key insights into MCP vulnerabilities, including agents' blind reliance on tool descriptions, sensitivity to file-based attacks, chain attacks exploiting shared context, and difficulty distinguishing external data from executable commands. These insights, validated through attack experiments, underscore the urgency for robust defense strategies and informed MCP design. Our contributions include 1) constructing a comprehensive MCP attack taxonomy, 2) introducing a unified attack framework, MCPXKIT, and 3) conducting empirical vulnerability analysis to enhance MCP security mechanisms. This work provides a foundational framework, supporting the secure evolution of MCP ecosystems.

2508.09801 2026-05-26 cs.CR cs.AI 版本更新

Explainable Attention-Guided Stacked Graph Neural Networks for Malware Detection

可解释的注意力引导堆叠图神经网络用于恶意软件检测

Hossein Shokouhinejad, Roozbeh Razavi-Far, Griffin Higgins, Ali A Ghorbani

发表机构 * University of New Brunswick(新不伦瑞克大学)

AI总结 提出一种注意力引导的堆叠集成图神经网络框架,通过提取控制流图并利用多种GNN基学习器与注意力元学习器,实现恶意软件的高精度检测与可解释性分析。

详情
AI中文摘要

现代计算环境中的恶意软件检测需要模型不仅准确,而且可解释且对规避技术具有鲁棒性。图神经网络(GNN)通过建模基于图的程序表示(如控制流图(CFG))中的丰富结构依赖关系,在该领域显示出潜力。然而,单一模型方法可能面临泛化能力有限和缺乏可解释性的问题,尤其是在高风险安全应用中。在本文中,我们提出了一种新颖的堆叠集成框架,用于基于图的恶意软件检测和解释。我们的方法从可移植可执行(PE)文件中动态提取CFG,并通过两步嵌入策略对其基本块进行编码。使用一组多样化的GNN基学习器,每个学习器具有不同的消息传递机制,以捕获互补的行为特征。它们的预测输出由作为基于注意力的多层感知器实现的元学习器聚合,该元学习器既对恶意软件实例进行分类,又量化每个基模型的贡献。为了增强可解释性,我们引入了一种集成感知的事后解释技术,该技术利用GNN解释器生成的边级重要性分数,并使用学习到的注意力权重融合它们。这产生了与最终集成决策一致的可解释、模型无关的解释。实验结果表明,我们的框架在提供对恶意软件行为有洞察力的解释的同时,提高了分类性能。

英文摘要

Malware detection in modern computing environments demands models that are not only accurate but also interpretable and robust to evasive techniques. Graph neural networks (GNNs) have shown promise in this domain by modeling rich structural dependencies in graph-based program representations such as control flow graphs (CFGs). However, single-model approaches may suffer from limited generalization and lack interpretability, especially in high-stakes security applications. In this paper, we propose a novel stacking ensemble framework for graph-based malware detection and explanation. Our method dynamically extracts CFGs from portable executable (PE) files and encodes their basic blocks through a two-step embedding strategy. A set of diverse GNN base learners, each with a distinct message-passing mechanism, is used to capture complementary behavioral features. Their prediction outputs are aggregated by a meta-learner implemented as an attention-based multilayer perceptron, which both classifies malware instances and quantifies the contribution of each base model. To enhance explainability, we introduce an ensemble-aware post-hoc explanation technique that leverages edge-level importance scores generated by a GNN explainer and fuses them using the learned attention weights. This produces interpretable, model-agnostic explanations aligned with the final ensemble decision. Experimental results demonstrate that our framework improves classification performance while providing insightful interpretations of malware behavior.

2508.03104 2026-05-26 cs.LG cs.AI 版本更新

HiTeC: Hierarchical Contrastive Learning on Text-Attributed Hypergraph with Semantic-Aware Augmentation

HiTeC: 基于语义感知增强的文本属性超图层次对比学习

Mengting Pan, Fan Li, Chen Chen, Xiaoyang Wang, Wenjie Zhang

发表机构 * The University of New South Wales(新南威尔士大学) University of Wollongong(沃拉彭大学)

AI总结 提出HiTeC框架,通过两阶段层次对比学习,结合结构感知文本编码预训练和语义感知增强,解决文本属性超图中文本与拓扑关联不足、随机增强噪声及长程依赖捕获问题。

Comments 16 pages, 8 figures

详情
AI中文摘要

对比学习已成为自监督超图学习的主流范式,能够在无需昂贵标签的情况下实现有效训练。然而,现实世界超图中的节点实体通常关联丰富的文本信息,这在先前工作中被大量忽略。直接将现有基于对比学习的方法应用于此类文本属性超图(TAHGs)会导致三个关键限制:(1)普遍使用的图无关文本编码器无法捕获文本语义与超图拓扑之间的相关性,导致表示表达能力不足。(2)它们对随机数据增强的依赖引入了噪声并削弱了对比信号。(3)主要关注节点和超边级别的对比信号限制了捕获长程依赖的能力,而这对于有效的表示学习至关重要。为解决这些挑战,我们引入了HiTeC,一个两阶段层次对比学习框架,用于在TAHGs上进行有效的自监督学习。在第一阶段,我们使用结构感知的对比目标预训练文本编码器,以克服传统方法的图无关特性。在第二阶段,我们首先引入语义感知增强,包括结构上下文化的文本增强和语义感知的超边丢弃,以促进信息丰富的视图生成。随后,我们提出一个多尺度对比损失,结合基于$s$步行走的子图级别目标,以捕获长程依赖。在六个真实世界数据集上的大量实验验证了我们提出方法的有效性。

英文摘要

Contrastive learning (CL) has become a dominant paradigm for self-supervised hypergraph learning, enabling effective training without costly labels. However, node entities in real-world hypergraphs are often associated with rich textual information, which has been largely ignored in prior works. Directly applying existing CL-based methods to such text-attributed hypergraphs (TAHGs) leads to three key limitations: (1) The common use of graph-agnostic text encoders fails to capture the correlations between textual semantics and hypergraph topology, resulting in less expressive representations. (2) Their reliance on random data augmentations introduces noise and weakens the contrastive signals. (3) The primary focus on node- and hyperedge-level contrastive signals limits the ability to capture long-range dependencies, which is essential for effective representation learning. To address these challenges, we introduce HiTeC, a two-stage hierarchical contrastive learning framework for effective self-supervised learning on TAHGs. In the first stage, we pre-train the text encoder with a structure-aware contrastive objective to overcome the graph-agnostic nature of conventional methods. In the second stage, we begin by introducing semantic-aware augmentations, including structure-contextualized text augmentation and semantic-aware hyperedge dropping, to facilitate informative view generation. Subsequently, we propose a multi-scale contrastive loss with an $s$-walk-based subgraph-level objective to capture long-range dependencies. Extensive experiments on six real-world datasets validate the effectiveness of our proposed method.

2507.10593 2026-05-26 cs.SE cs.AI cs.CL cs.LG 版本更新

ToolRegistry: A Protocol-Agnostic Tool Management Library for Function-Calling LLMs

ToolRegistry: 一个用于函数调用LLM的协议无关工具管理库

Peng Ding, Rick Stevens

发表机构 * University of Chicago(芝加哥大学) Argonne National Laboratory(阿贡国家实验室)

AI总结 提出ToolRegistry系统,通过统一工具对象和注册表实现协议无关的工具管理,支持多种传输协议、可插拔后端和高级功能,显著减少集成代码并提升吞吐量。

Comments 16 pages, 4 figures, v3: add co-author, permission system, progressive tool disclosure, think-augmented calling, RPC framing, multi-provider support

详情
AI中文摘要

每个LLM工具调用在结构上都是一个RPC——一个函数名、JSON参数和序列化结果——然而每个协议(原生Python、MCP、OpenAPI、LangChain)都是从零开始集成的。我们提出ToolRegistry,一个使这种RPC本质显式化的系统:一个单一的Tool对象充当通用存根,无论传输方式如何,而注册表则作为RPC客户端运行时,负责调度、模式生成和执行。该系统以三个包的形式发布——一个核心注册表、一个通过MCP和OpenAPI暴露工具的服务器,以及一个生产就绪实现的中心——并通过可插拔的线程或进程后端调用工具。该系统现在还提供基于标签的权限策略、针对大型注册表的BM25F驱动的渐进式工具披露、增强思考的函数调用、多提供商模式支持(OpenAI、Anthropic、Gemini)、声明式JSONC/YAML配置,以及一个基于仅stdlib内置模块的近乎零依赖的核心。在我们的基准测试中,该库将集成代码减少了60-80%,并且为给定工作负载选择正确的并发模式(线程与进程)相比替代方案可带来高达3.1倍的吞吐量。ToolRegistry在https://github.com/Oaklight/ToolRegistry开源;文档位于https://toolregistry.readthedocs.io/。

英文摘要

Every LLM tool call is structurally an RPC -- a function name, JSON arguments, and a serialized result -- yet each protocol (native Python, MCP, OpenAPI, LangChain) is integrated from scratch. We present ToolRegistry, a system that makes this RPC nature explicit: a single Tool object acts as a universal stub regardless of transport, while the registry serves as the RPC client runtime for dispatch, schema generation, and execution. The system ships as three packages -- a core registry, a server exposing tools over MCP and OpenAPI, and a hub of production-ready implementations -- and invokes tools through pluggable thread or process backends. The system now also provides tag-based permission policies, BM25F-powered progressive tool disclosure for large registries, think-augmented function calling, multi-provider schema support (OpenAI, Anthropic, Gemini), declarative JSONC/YAML configuration, and a near-zero-dependency core built on stdlib-only vendored modules. In our benchmarks the library cuts integration code by 60-80%, and choosing the right concurrency mode (thread vs. process) yields up to 3.1x throughput over the alternative for a given workload. ToolRegistry is open-source at https://github.com/Oaklight/ToolRegistry; documentation lives at https://toolregistry.readthedocs.io/.

2507.07644 2026-05-26 cs.AI 版本更新

FloorplanQA: A Benchmark for Spatial Reasoning in LLMs using Structured Representations

FloorplanQA:使用结构化表示进行大语言模型空间推理的基准测试

Fedor Rodionov, Abdelrahman Eldesokey, Michael Birsak, John Femiani, Bernard Ghanem, Peter Wonka

发表机构 * King Abdullah University of Science and Technology(国王阿卜杜勒-阿齐兹大学) Miami University(迈阿密大学)

AI总结 提出FloorplanQA基准,通过结构化室内场景表示评估大语言模型在距离测量、可见性、路径查找和物体放置等空间推理任务上的表现,揭示模型在物理约束和空间一致性方面的盲点。

Comments ICML 2026, Project page: https://OldDeLorean.github.io/FloorplanQA/

详情
AI中文摘要

我们引入了FloorplanQA,一个用于评估大语言模型空间推理能力的诊断基准。FloorplanQA基于室内场景的结构化表示,例如(厨房、客厅、卧室、浴室等),这些场景以JSON或XML布局进行符号编码。该基准涵盖了核心空间任务,包括距离测量、可见性、路径查找以及在受限空间内的物体放置。我们在各种前沿开源和商业大语言模型上的实验结果表明,虽然模型可能在浅层查询上成功,但它们往往无法遵守物理约束、保持空间一致性,尽管它们对小的空间扰动大多保持鲁棒。FloorplanQA揭示了当前大语言模型的一个盲点:对室内布局的不一致推理。我们希望这个基准能激发新的工作,使语言模型能够在实际场景中准确推断和操作空间与几何属性。

英文摘要

We introduce FloorplanQA, a diagnostic benchmark for evaluating spatial reasoning in large language models (LLMs). FloorplanQA is grounded in structured representations of indoor scenes, such as (e.g., kitchens, living rooms, bedrooms, bathrooms, and others), encoded symbolically in JSON or XML layouts. The benchmark covers core spatial tasks, including distance measurement, visibility, path finding, and object placement within constrained spaces. Our results across a variety of frontier open-source and commercial LLMs reveal that while models may succeed in shallow queries, they often fail to respect physical constraints, preserve spatial coherence, though they remain mostly robust to small spatial perturbations. FloorplanQA uncovers a blind spot in today's LLMs: inconsistent reasoning about indoor layouts. We hope this benchmark inspires new work on language models that can accurately infer and manipulate spatial and geometric properties in practical settings.

2507.05890 2026-05-26 cs.CL cs.AI 版本更新

Psychometric Item Validation Using Virtual Respondents with Trait-Response Mediators

使用具有特质-反应中介的虚拟受访者进行心理测量项目验证

Sungjib Lim, Woojung Song, Eun-Ju Lee, Yohan Jo

发表机构 * Graduate School of Data Science, Seoul National University(首尔国立大学数据科学研究生院) Department of Communication, Seoul National University(首尔国立大学通信系) Interdisciplinary Program in Artificial Intelligence, Seoul National University(首尔国立大学人工智能跨学科项目)

AI总结 提出一种利用LLM模拟虚拟受访者(通过中介因素)来高效验证心理测量项目效度的框架,实验证明该方法能有效识别高有效性项目。

Comments This paper has been accepted for publication at TACL 2026

详情
AI中文摘要

随着心理测量调查越来越多地用于评估大型语言模型(LLM)的特质,对适用于LLM的可扩展调查项目生成的需求也随之增长。这里的一个关键挑战是确保生成项目的构念效度,即它们是否真正测量了预期的特质。传统上,这需要昂贵的大规模人类数据收集。为了提高效率,我们提出了一个使用LLM进行虚拟受访者模拟的框架。我们的核心思想是考虑中介因素:通过它们,相同的特质可能对调查项目产生不同的反应。通过模拟具有不同中介因素的受访者,我们识别出那些在这些中介因素中与预期特质稳健相关的调查项目。在三种心理特质理论(大五人格、施瓦茨价值观、VIA性格优势)上的实验表明,我们的中介生成方法和模拟框架有效地识别了高有效性项目。LLM展示了从特质定义生成合理中介因素以及模拟受访者行为以进行项目验证的能力。我们的问题表述、指标、方法和数据集为成本效益高的调查开发以及更深入地理解LLM如何模拟人类调查反应开辟了新方向。我们发布数据集和代码以支持未来工作。

英文摘要

As psychometric surveys are increasingly used to assess the traits of large language models (LLMs), the need for scalable survey item generation suited for LLMs has also grown. A critical challenge here is ensuring the construct validity of generated items, i.e., whether they truly measure the intended trait. Traditionally, this requires costly, large-scale human data collection. To make it efficient, we present a framework for virtual respondent simulation using LLMs. Our central idea is to account for mediators: factors through which the same trait can give rise to varying responses to a survey item. By simulating respondents with diverse mediators, we identify survey items that yield responses robustly correlated with intended traits across these mediators. Experiments on three psychological trait theories (Big5, Schwartz, VIA) show that our mediator generation methods and simulation framework effectively identify high-validity items. LLMs demonstrate the ability to generate plausible mediators from trait definitions and to simulate respondent behavior for item validation. Our problem formulation, metrics, methodology, and dataset open a new direction for cost-efficient survey development and a deeper understanding of how LLMs simulate human survey responses. We release our dataset and code to support future work.

2506.19037 2026-05-26 cs.CL cs.AI cs.IT cs.LG cs.NE math.IT 版本更新

Plan for Speed: Dilated Scheduling for Masked Diffusion Language Models

速度规划:用于掩码扩散语言模型的膨胀调度

Omer Luxembourg, Haim Permuter, Eliya Nachmani

发表机构 * School of Electrical and Computer Engineering, Ben-Gurion University of the Negev, Beersheba, Israel(电气与计算机工程学院,内盖夫本· Gurion大学,贝尔谢巴,以色列)

AI总结 提出膨胀解掩码调度器(DUS),通过将序列位置划分为非相邻的膨胀组并并行解掩码,最小化联合熵增益上界,在不修改去噪器的情况下实现高达5.8倍加速。

Comments Accepted at ICML 2026

详情
AI中文摘要

掩码扩散语言模型(MDLM)承诺快速、非自回归的文本生成,然而现有的采样器根据模型置信度选择要解掩码的标记,忽略了并行解掩码多个位置时的交互,实际上退化为缓慢的自回归行为。我们提出了膨胀解掩码调度器(DUS),这是一种仅推理、无需规划模型的方法,它将序列位置划分为非相邻的膨胀组,并并行解掩码,以在每个去噪步骤中最小化联合熵增益的上界。通过明确权衡网络调用次数与生成质量,DUS恢复了传统并行解掩码策略下丢失的大部分性能。在数学(GSM8K, MATH500)、代码(HumanEval, MBPP)、通用知识(BBH, MMLU-Pro)和指令遵循(IFEval)基准测试中,DUS优于基于置信度的规划器,并将扩散特有的质量-速度权衡转化为由块大小$B$确定的确定性、可预测的加速,与逐标记MDLM解码相比,实现了高达5.8倍的墙钟加速,而无需修改底层去噪器。作为即插即用的后滤波器,膨胀间隔也改进了自适应采样器。代码可在https://github.com/omerlux/DUS获取。

英文摘要

Masked diffusion language models (MDLMs) promise fast, non-autoregressive text generation, yet existing samplers, which pick tokens to unmask based on model confidence, ignore interactions when unmasking multiple positions in parallel and effectively reduce to slow, autoregressive behavior. We propose the Dilated Unmasking Scheduler (DUS), an inference-only, planner-model-free method that partitions sequence positions into non-adjacent dilated groups and unmasks them in parallel so as to minimize an upper bound on joint entropy gain at each denoising step. By explicitly trading off the number of network calls against generation quality, DUS recovers most of the performance lost under traditional parallel unmasking strategies. Across math (GSM8K, MATH500), code (HumanEval, MBPP), general-knowledge (BBH, MMLU-Pro), and instruction following (IFEval) benchmarks, DUS outperforms confidence-based planners and turns the diffusion-specific quality-speed trade-off into a deterministic, predictable speedup set by the block size $B$, yielding up to $5.8\times$ wall-clock speedup over token-by-token MDLM decoding without modifying the underlying denoiser. Applied as a drop-in post-filter, dilated spacing also improves adaptive samplers. Code is available at https://github.com/omerlux/DUS.

2506.18543 2026-05-26 cs.CR cs.AI 版本更新

SoK: A Comprehensive Security Analysis of Jailbreak Resilience in GPT and DeepSeek Models

SoK: GPT 和 DeepSeek 模型越狱鲁棒性的全面安全分析

Xiaodong Wu, Xiangman Li, Qi Li, Lingshuang Liu, Jianbing Ni

发表机构 * Queen’s University(女王大学) University of Waterloo(滑铁卢大学)

AI总结 通过 HarmBench 基准测试,对 DeepSeek 模型系列与 GPT-3.5、GPT-4 进行首次全面越狱分析,发现 DeepSeek 对优化驱动攻击有部分鲁棒性但易受提示工程攻击,而 GPT-4 Turbo 具有更一致的安全对齐,揭示了模型效率与对齐泛化之间的固有权衡。

详情
AI中文摘要

大型语言模型(LLM)的快速普及加剧了对其遭受越狱攻击的担忧,越狱攻击通过精心设计的对抗性输入来诱导不安全内容。尽管 GPT-4 等专有模型已被广泛评估,但新兴开源系统(如 DeepSeek)的鲁棒性仍未得到充分检验,尽管它们在 LLM 应用中的使用日益增长。在本文中,我们首次对 DeepSeek 模型系列进行了全面的越狱分析,通过 HarmBench 基准测试将其与 GPT-3.5 和 GPT-4 进行比较。我们研究了涵盖 510 种有害行为的七种代表性攻击方法,这些方法按功能和语义维度组织。结果表明,DeepSeek 对 TAP-T 等优化驱动攻击具有部分鲁棒性,但也导致其对基于提示和手动设计的对抗性输入更加敏感。相比之下,GPT-4 Turbo 在广泛行为中表现出更强大且一致的安全对齐,这可能是由于更强的安全优化和来自人类反馈的强化学习。此外,细粒度行为分析和案例研究表明,DeepSeek 往往无法一致地将安全约束应用于对抗性提示,导致拒绝行为不均匀。总体而言,我们的结果凸显了模型效率与对齐泛化之间的固有权衡,强调了针对性安全调优和稳健对齐策略对于确保开源 LLM 安全部署的重要性。

英文摘要

The rapid proliferation of Large Language Models (LLMs) has heightened concerns regarding their exposure to jailbreak attacks, which craft adversarial inputs designed to elicit unsafe content. Although proprietary models such as GPT-4 have been extensively evaluated, the robustness of emerging open-source systems like DeepSeek remains insufficiently examined, despite their growing use in LLM applications. In this paper, we conduct the first comprehensive jailbreak analysis of the DeepSeek model family, comparing it with GPT-3.5 and GPT-4 through the HarmBench benchmark. We investigate seven representative attack methods across 510 harmful behaviors, organized along both functional and semantic dimensions. Findings indicate that DeepSeek provides partial resilience against optimization-driven attacks such as TAP-T, but also results in greater susceptibility to prompt-based and manually engineered adversarial inputs. In contrast, GPT-4 Turbo demonstrates more robust and consistent safety alignment across a wide range of behaviors, likely due to stronger safety optimization and reinforcement learning from human feedback. In addition, fine-grained behavioral analysis and case studies reveal that DeepSeek often fails to consistently apply safety constraints to adversarial prompts, leading to uneven refusal behaviors. Overall, our results highlight an inherent trade-off between model efficiency and alignment generalization, underscoring the importance of targeted safety tuning and robust alignment strategies to ensure secure deployment of open-source LLMs.

2506.17629 2026-05-26 cs.CV cs.AI cs.CL 版本更新

CLiViS: Unleashing Cognitive Map through Linguistic-Visual Synergy for Embodied Visual Reasoning

CLiViS: 通过语言-视觉协同释放认知地图用于具身视觉推理

Kailing Li, Qi'ao Xu, Tianwen Qian, Yuqian Fu, Yang Jiao, Xiaoling Wang

发表机构 * School of Computer Science and Technology, East China Normal University(东华大学计算机科学与技术学院) King Abdullah University of Science and Technology(科廷大学) Fudan University(复旦大学)

AI总结 提出CLiViS框架,通过LLM进行高层任务规划并协调VLM驱动的开放世界视觉感知,构建动态认知地图以迭代更新场景上下文,实现无需训练的具身视觉推理。

详情
AI中文摘要

具身视觉推理(EVR)旨在基于自我中心视频遵循复杂、自由形式的指令,从而在动态环境中实现语义理解和时空推理。尽管具有潜力,EVR面临复杂指令多样性和长期自我中心视频中复杂时空动态的挑战。现有解决方案要么在静态视频描述上使用大型语言模型(LLM),这通常会遗漏关键视觉细节,要么依赖端到端视觉语言模型(VLM),后者在逐步组合推理上存在困难。考虑到LLM在推理和VLM在感知方面的互补优势,我们提出了CLiViS。这是一个新颖的无训练框架,利用LLM进行高层任务规划,并协调VLM驱动的开放世界视觉感知,以迭代更新场景上下文。基于这种协同,CLiViS的核心是一个动态认知地图,它在推理过程中不断演化。该地图构建了具身场景的结构化表示,连接了低层感知和高层推理。跨多个基准的大量实验证明了CLiViS的有效性和通用性,特别是在处理长期视觉依赖方面。代码可在 https://github.com/Teacher-Tom/CLiViS 获取。

英文摘要

Embodied Visual Reasoning (EVR) seeks to follow complex, free-form instructions based on egocentric video, enabling semantic understanding and spatiotemporal reasoning in dynamic environments. Despite its promising potential, EVR encounters significant challenges stemming from the diversity of complex instructions and the intricate spatiotemporal dynamics in long-term egocentric videos. Prior solutions either employ Large Language Models (LLMs) over static video captions, which often omit critical visual details, or rely on end-to-end Vision-Language Models (VLMs) that struggle with stepwise compositional reasoning. Consider the complementary strengths of LLMs in reasoning and VLMs in perception, we propose CLiViS. It is a novel training-free framework that leverages LLMs for high-level task planning and orchestrates VLM-driven open-world visual perception to iteratively update the scene context. Building on this synergy, the core of CLiViS is a dynamic Cognitive Map that evolves throughout the reasoning process. This map constructs a structured representation of the embodied scene, bridging low-level perception and high-level reasoning. Extensive experiments across multiple benchmarks demonstrate the effectiveness and generality of CLiViS, especially in handling long-term visual dependencies. Code is available at https://github.com/Teacher-Tom/CLiViS.

2506.11027 2026-05-26 cs.LG cs.AI cs.PL 版本更新

From Reasoning to Code: GRPO Optimization for Underrepresented Languages

从推理到代码:针对代表性不足语言的GRPO优化

Federico Pennino, Bianca Raimondi, Massimo Rondelli, Andrea Gurioli, Maurizio Gabbrielli

发表机构 * Qwen2.5-Coder

AI总结 提出结合Qwen2.5-Coder小模型与GRPO的强化学习方法,利用执行反馈和奖励机制提升Prolog、Lisp等低资源语言的代码生成准确性与推理质量。

Comments Accepted ICLP 2026

详情
AI中文摘要

使用大型语言模型(LLM)生成准确且可执行的代码对于代表性不足的编程语言(如Prolog和Lisp)仍然是一个重大挑战,因为与Python等高资源语言相比,公共训练数据稀缺。本文介绍了一种可泛化的强化学习(RL)方法,将Qwen2.5-Coder模型的小规模版本与组相对策略优化(GRPO)相结合,通过推理实现有效的代码生成。为了解决稀疏数据集的局限性,我们将执行驱动的反馈直接集成到RL循环中,利用一个奖励系统,该系统同时利用逻辑正确性和结构格式。在GSM8K数据集上的实验结果表明,在代表性不足的语言中,推理质量和代码准确性有显著提升。这些发现强调了我们的方法通过利用符号推理和基于解释器的反馈,使缺乏广泛训练资源的多种编程语言受益的潜力。

英文摘要

Generating accurate and executable code using Large Language Models (LLMs) remains a significant challenge for underrepresented programming languages, such as Prolog and Lisp, due to the scarcity of public training data compared to high-resource languages like Python. This paper introduces a generalizable Reinforcement Learning (RL) approach that combines small-scale versions of the Qwen2.5-Coder model with Group Relative Policy Optimization (GRPO) to enable effective code generation through reasoning. To address the limitations of sparse datasets, we integrate execution-driven feedback directly into the RL loop, utilizing a reward system that exploits both logical correctness and structural formatting. Experimental results on GSM8K dataset demonstrate significant improvements in reasoning quality and code accuracy across underrepresented languages. These findings underscore the potential of our approach to benefit a wide range of programming languages lacking extensive training resources by leveraging symbolic reasoning and interpreter-based feedback.

2506.06840 2026-05-26 stat.ML cs.AI cs.LG stat.AP stat.OT 版本更新

A Statistical Framework for Model Selection in LSTM Networks

LSTM网络中模型选择的统计框架

Fahad Mostafa

发表机构 * School of Mathematical and Natural Sciences, Arizona State University(数学与自然科学院,亚利桑那州立大学)

AI总结 针对LSTM网络模型选择依赖启发式且计算昂贵的问题,提出统一统计框架,通过扩展信息准则和收缩估计到序列神经网络,定义适应时间结构的惩罚似然、广义阈值方法处理隐状态动态,并利用变分贝叶斯和近似边际似然实现高效估计,在生物医学数据上验证了灵活性和性能提升。

详情
AI中文摘要

长短期记忆(LSTM)神经网络模型已成为从自然语言处理到时间序列预测等众多应用中序列数据建模的基石。尽管取得了成功,但模型选择问题,包括超参数调优、架构规范和正则化选择,仍然很大程度上是启发式的且计算昂贵。在本文中,我们提出了一个统一的统计框架,用于LSTM网络中的系统模型选择。我们的框架将经典的模型选择思想,如信息准则和收缩估计,扩展到序列神经网络。我们定义了适应时间结构的惩罚似然,提出了一个用于隐状态动态的广义阈值方法,并利用变分贝叶斯和近似边际似然方法提供了高效的估计策略。几个以生物医学数据为中心的示例展示了所提出框架的灵活性和改进的性能。

英文摘要

Long Short-Term Memory (LSTM) neural network models have become the cornerstone for sequential data modeling in numerous applications, ranging from natural language processing to time series forecasting. Despite their success, the problem of model selection, including hyperparameter tuning, architecture specification, and regularization choice remains largely heuristic and computationally expensive. In this paper, we propose a unified statistical framework for systematic model selection in LSTM networks. Our framework extends classical model selection ideas, such as information criteria and shrinkage estimation, to sequential neural networks. We define penalized likelihoods adapted to temporal structures, propose a generalized threshold approach for hidden state dynamics, and provide efficient estimation strategies using variational Bayes and approximate marginal likelihood methods. Several biomedical data centric examples demonstrate the flexibility and improved performance of the proposed framework.

2506.06454 2026-05-26 cs.LG cs.AI stat.ML 版本更新

LETS Forecast: Learning Embedology for Time Series Forecasting

LETS Forecast:用于时间序列预测的嵌入学

Abrar Majeedi, Viswanatha Reddy Gajjala, Satya Sai Srinath Namburi GNVV, Nada Magdi Elkordi, Yin Li

发表机构 * Department of Biostatistics and Medical Informatics, University of Wisconsin-Madison(生物统计学与医学信息学系,威斯康星大学麦迪逊分校) Department of Computer Sciences, University of Wisconsin-Madison(计算机科学系,威斯康星大学麦迪逊分校)

AI总结 提出DeepEDM框架,结合非线性动力系统建模与深度学习,通过延迟嵌入和核回归学习潜在动态,实现高精度时间序列预测。

Comments Accepted at International Conference on Machine Learning (ICML) 2025

详情
AI中文摘要

现实世界的时间序列通常受复杂的非线性动力学支配。理解这些潜在动力学对于精确的未来预测至关重要。虽然深度学习在时间序列预测中取得了重大成功,但许多现有方法并未显式建模动力学。为弥补这一差距,我们引入了DeepEDM,一个将非线性动力系统建模与深度神经网络相结合的框架。受经验动态建模(EDM)启发并基于Takens定理,DeepEDM提出了一种新颖的深度模型,该模型从时间延迟嵌入中学习潜在空间,并使用核回归来逼近潜在动力学,同时利用softmax注意力的高效实现,允许对未来时间步进行准确预测。为了评估我们的方法,我们在非线性动力系统的合成数据以及跨领域的真实世界时间序列上进行了全面实验。结果表明,DeepEDM对输入噪声具有鲁棒性,并在预测准确性上优于最先进的方法。我们的代码可在以下网址获取:https://abrarmajeedi.github.io/deep_edm。

英文摘要

Real-world time series are often governed by complex nonlinear dynamics. Understanding these underlying dynamics is crucial for precise future prediction. While deep learning has achieved major success in time series forecasting, many existing approaches do not explicitly model the dynamics. To bridge this gap, we introduce DeepEDM, a framework that integrates nonlinear dynamical systems modeling with deep neural networks. Inspired by empirical dynamic modeling (EDM) and rooted in Takens' theorem, DeepEDM presents a novel deep model that learns a latent space from time-delayed embeddings, and employs kernel regression to approximate the underlying dynamics, while leveraging efficient implementation of softmax attention and allowing for accurate prediction of future time steps. To evaluate our method, we conduct comprehensive experiments on synthetic data of nonlinear dynamical systems as well as real-world time series across domains. Our results show that DeepEDM is robust to input noise, and outperforms state-of-the-art methods in forecasting accuracy. Our code is available at: https://abrarmajeedi.github.io/deep_edm.

2506.01982 2026-05-26 cs.HC cs.AI 版本更新

Music Interpretation and Emotion Perception: A Computational and Neurophysiological Investigation

音乐诠释与情感感知:计算与神经生理学调查

Vassilis Lyberatos, Spyridon Kantarelis, Ioanna Zioga, Christina Anagnostopoulou, Giorgos Stamou, Anastasia Georgaki

发表机构 * School of Electrical and Computer Engineering, National Technical University of Athens(电气与计算机工程学院,国家技术大学雅典) Department of Music Studies, National and Kapodistrian University of Athens(音乐研究系,国家与卡波迪斯特里亚大学雅典)

AI总结 本研究利用计算和神经生理学方法,探究不同演奏情境(如曲目、调式练习曲和即兴演奏)及表现力水平对表演者情感传达和听众反应的影响,发现表现力和即兴演奏具有独特声学特征并引发更强情感反应,且即兴演奏带来更大的神经生理放松。

Comments Accepted at SMC 2025

详情
AI中文摘要

本研究采用计算和神经生理学方法,调查音乐表演中的情感表达与感知。探讨了不同演奏情境(如曲目、调式练习曲和即兴演奏)以及表现力水平对表演者情感传达和听众反应的影响。专业音乐家执行了多种任务,并由表演者和听众提供情感标注。音频分析显示,表现力和即兴演奏表现出独特的声学特征,而情感分析则显示出更强的情感反应。神经生理测量表明,即兴演奏中表现出更大的放松。这项多模态研究强调了表现力在增强情感传达和观众参与中的重要性。

英文摘要

This study investigates emotional expression and perception in music performance using computational and neurophysiological methods. The influence of different performance settings, such as repertoire, diatonic modal etudes, and improvisation, as well as levels of expressiveness, on performers' emotional communication and listeners' reactions is explored. Professional musicians performed various tasks, and emotional annotations were provided by both performers and the audience. Audio analysis revealed that expressive and improvisational performances exhibited unique acoustic features, while emotion analysis showed stronger emotional responses. Neurophysiological measurements indicated greater relaxation in improvisational performances. This multimodal study highlights the significance of expressivity in enhancing emotional communication and audience engagement.

2505.18190 2026-05-26 eess.SP cs.AI cs.LG 版本更新

PhySense: Sensor Placement Optimization for Accurate Physics Sensing

PhySense:面向精确物理感知的传感器布局优化

Yuezhou Ma, Haixu Wu, Hang Zhou, Huikun Weng, Jianmin Wang, Mingsheng Long

发表机构 * School of Software, BNRist, Tsinghua University(软件学院,BNRist,清华大学)

AI总结 提出PhySense两阶段框架,通过流生成模型和投影梯度下降联合优化传感器布局与物理场重建,实现高精度物理感知。

详情
AI中文摘要

物理感知在许多科学和工程领域中扮演着核心角色,它固有地涉及两个耦合的任务:从稀疏观测中重建密集物理场,以及优化分散的传感器布局以观测最大信息。虽然深度学习在稀疏数据重建方面取得了快速进展,但现有方法通常忽略传感器布局的优化,将重建与布局之间的相互增强束之高阁。为了改变这种次优实践,我们提出了PhySense,一个协同的两阶段框架,学习联合重建物理场和优化传感器布局,两者都旨在实现精确的物理感知。第一阶段涉及一个基于流的生成模型,通过交叉注意力增强以自适应地融合稀疏观测。利用重建反馈,第二阶段通过投影梯度下降执行传感器布局以满足空间约束。我们进一步证明两个阶段的学习目标与经典方差最小化原则一致,提供了理论保证。在三个具有挑战性的基准测试(特别是3D几何数据集)上的大量实验表明,PhySense实现了最先进的物理感知精度,并发现了以前未考虑的信息丰富的传感器布局。代码可在以下仓库获取:https://github.com/thuml/PhySense。

英文摘要

Physics sensing plays a central role in many scientific and engineering domains, which inherently involves two coupled tasks: reconstructing dense physical fields from sparse observations and optimizing scattered sensor placements to observe maximum information. While deep learning has made rapid advances in sparse-data reconstruction, existing methods generally omit optimization of sensor placements, leaving the mutual enhancement between reconstruction and placement on the shelf. To change this suboptimal practice, we propose PhySense, a synergistic two-stage framework that learns to jointly reconstruct physical fields and to optimize sensor placements, both aiming for accurate physics sensing. The first stage involves a flow-based generative model enhanced by cross-attention to adaptively fuse sparse observations. Leveraging the reconstruction feedback, the second stage performs sensor placement via projected gradient descent to satisfy spatial constraints. We further prove that the learning objectives of the two stages are consistent with classical variance-minimization principles, providing theoretical guarantees. Extensive experiments across three challenging benchmarks, especially a 3D geometry dataset, indicate PhySense achieves state-of-the-art physics sensing accuracy and discovers informative sensor placements previously unconsidered. Code is available at this repository: https://github.com/thuml/PhySense.

2505.11758 2026-05-26 cs.CV cs.AI cs.GR cs.RO 版本更新

Generalizable Vision-Language Few-Shot Adaptation with Predictive Prompts and Negative Learning

具有预测性提示和负学习的可泛化视觉语言少样本适应

Sriram Mandalika

发表机构 * Hasso Plattner Institute, University of Potsdam(霍普夫纳研究所,波茨坦大学)

AI总结 提出SCAN框架,通过查询自适应负路由、LLM引导对比提示和自适应融合权重,解决视觉语言模型少样本适应中负类信号处理问题,在11个基准上平均提升4.61%。

详情
AI中文摘要

视觉语言模型的少样本适应在推理时如何处理负类信号方面仍然存在根本性限制。现有方法对所有查询应用统一的负抑制,忽略了最具破坏性的混淆是查询特定的,并且随支持集几何形状而变化。我们提出SCAN(选择性混淆感知负样本),一个通过三个针对性贡献解决这一问题的框架。在推理中,查询自适应负路由将抑制限制在每个查询最易混淆的前K个类别,无需额外参数。通用负文本模板被替换为LLM引导的对比提示,描述易混淆类别对之间的区分属性,在关键处锐化文本决策边界。基于支持集Fisher可判别性估计的无参数自适应融合权重消除了手动调整视觉语言权衡的需要。在11个标准基准上评估,SCAN在16-shot设置下平均优于先前的基于提示和基于适配器的方法4.61%,在类间混淆最严重的细粒度数据集上提升高达7.70%。SCAN在分布偏移下也表现出强泛化性,在四个ImageNet OOD变体上平均提升2.95%,并在显著标签噪声下保持稳健性能,在50%标签损坏下的准确率仍超过最强竞争方法的干净基线。

英文摘要

Few-shot adaptation of vision-language models remains fundamentally limited by how negative class signals are handled at inference. Existing methods apply uniform negative suppression across all queries, ignoring that the most damaging confusions are query-specific and shift with support-set geometry. We introduce SCAN (Selective Confusion-Aware Negatives), a framework that addresses this gap through three targeted contributions. In inference, query-adaptive negative routing restricts suppression to the top-K most confusable classes per query, requiring zero additional parameters. Generic negative text templates are replaced with LLM-bootstrapped contrastive prompts that describe discriminative attributes between confusable class pairs, sharpening the textual decision boundary where it matters most. A parameter-free adaptive fusion weight estimated from support-set Fisher discriminability removes the need for manual tuning of the vision-language trade-off. Evaluated across 11 standard benchmarks, SCAN consistently outperforms prior prompt-based and adapter-based methods by an average of 4.61% at 16-shot, with gains of up to 7.70% on fine-grained datasets where inter-class confusion is most severe. SCAN also generalizes strongly under distribution shift, improving by 2.95% on average across four ImageNet OOD variants, and maintains robust performance under significant label noise, with accuracy under 50% label corruption still exceeding the clean baseline of the strongest competing method.

2505.08155 2026-05-26 cs.AI 版本更新

Efficient and Scalable Neural Symbolic Search for Knowledge Graph Complex Query Answering

高效且可扩展的神经符号搜索用于知识图谱复杂查询回答

Weizhi Fei, Zihao Wang, hang Yin, Shukai Zhao, Wei Zhang, Yangqiu Song

发表机构 * Department of Mathematical Sciences, Tsinghua University(清华大学数学科学系) Department of Computer Science and Engineering, Hong Kong University of Science and Technology(香港理工大学计算机科学与工程系) Department of Computer Sciences, University of Rochester(罗切斯特大学计算机科学系)

AI总结 提出一种结合约束策略和局部搜索的神经符号方法,以降低数据复杂度和近似解决NP难的循环查询,实现高效可扩展的复杂查询回答。

详情
AI中文摘要

复杂查询回答(CQA)是知识图谱(KG)上的一项关键推理任务,旨在从不完整的KG中回答一阶逻辑查询。现有的神经符号方法虽然取得了强劲的性能,但面临显著的复杂度瓶颈:数据复杂度随实体数量呈二次增长,且循环查询的查询复杂度为NP难。因此,这些方法难以有效扩展到大型知识图谱和复杂查询。为解决这些限制,我们提出了一种高效且可扩展的符号搜索方法,包含两个关键组件:(1)约束策略,大幅减少变量搜索域,降低数据复杂度;(2)局部搜索算法,近似解决NP难的循环查询。在各种CQA基准上的实验表明,对于树形查询,我们的方法仅使用10%的搜索空间即可达到97%的相对MRR,并实现10倍的加速。此外,该方法在复杂循环查询和大规模KG上展现出稳健的性能,有效缓解了效率和可扩展性挑战。我们的代码见https://github.com/HKUST-KnowComp/NLISA_KDD2026。

英文摘要

Complex Query Answering (CQA) is a crucial reasoning task over Knowledge Graphs (KGs), which aims to answer first-order logical queries from incomplete KGs. While existing neural-symbolic methods achieve strong performance, they face significant complexity bottlenecks: quadratic data complexity scaling with the number of entities, and NP-hard query complexity for cyclic queries. Consequently, these approaches struggle to scale effectively to large knowledge graphs and complex queries. To address these limitations, we propose an efficient and scalable symbolic search method comprising two key components: (1) constraint strategies that drastically reduce the variable search domain, lowering data complexity; and (2) a local search algorithm that approximately solves NP-hard cyclic queries. Experiments on various CQA benchmarks demonstrate that, for tree-form queries, our method achieves 97% relative MRR with a 10$\times$ speedup using only 10% of the search space. Furthermore, it demonstrates robust performance on complex cyclic queries and large-scale KGs, effectively alleviating efficiency and scalability challenges. Our code is provided in https://github.com/HKUST-KnowComp/NLISA_KDD2026.

2505.05880 2026-05-26 cs.AI cs.LG 版本更新

Combining Abstract Argumentation and Machine Learning for Efficiently Analyzing Low-Level Process Event Streams

结合抽象论证与机器学习高效分析低层过程事件流

Bettina Fazzinga, Sergio Flesca, Filippo Furfaro, Luigi Pontieri, Francesco Scala

发表机构 * University of Calabria(卡拉布里亚大学) CNR(国家科研委员会)

AI总结 提出一种数据高效的神经符号方法,通过抽象论证框架(AAF)优化序列标注模型生成的候选事件解释,以解决低层过程事件流中事件到活动映射的不确定性问题。

详情
AI中文摘要

监控和分析过程轨迹是现代公司和组织的一项关键任务。在轨迹事件与参考业务活动之间存在差距的场景中,这涉及一个解释问题,即将任何正在进行的轨迹的每个事件转换为活动实例的相应步骤。基于最近将解释问题框架化为抽象论证框架(AAF)内的接受问题的方法,可以优雅地分析可能的(可能以聚合形式)事件解释,并为那些与先验过程知识冲突的解释提供解释。由于在事件到活动映射高度不确定(或简单地说未充分指定)的环境中,这种基于推理的方法可能产生低信息量的结果和繁重的计算,因此可以考虑发现一个序列标注模型,该模型经过训练以上下文感知的方式建议高概率的候选事件解释。然而,最优地训练这样的模型可能需要使用大量手动注释的示例轨迹。因此,我们提出了一种数据高效的神经符号方法,其中由示例驱动的序列标注器返回的候选解释由基于AAF的推理器进行细化。这使我们能够利用先验知识来补偿示例数据的稀缺性,实验结果证实了这一点。

英文摘要

Monitoring and analyzing process traces is a critical task for modern companies and organizations. In scenarios where there is a gap between trace events and reference business activities, this entails an interpretation problem, amounting to translating each event of any ongoing trace into the corresponding step of the activity instance. Building on a recent approach that frames the interpretation problem as an acceptance problem within an Abstract Argumentation Framework (AAF), one can elegantly analyze plausible event interpretations (possibly in an aggregated form), as well as offer explanations for those that conflict with prior process knowledge. Since, in settings where event-to-activity mapping is highly uncertain (or simply under-specified) this reasoning-based approach may yield lowly-informative results and heavy computation, one can think of discovering a sequence-tagging model, trained to suggest highly-probable candidate event interpretations in a context-aware way. However, training such a model optimally may require using a large amount of manually-annotated example traces. We then propose a data-efficient neuro-symbolic approach to the problem, where the candidate interpretations returned by the example-driven sequence tagger is refined by the AAF-based reasoner. This allows us to also leverage prior knowledge to compensate for the scarcity of example data, as confirmed by experimenftal results.

2502.10906 2026-05-26 cs.AI 版本更新

PCGRLLM: Large Language Model-Driven Reward Design for Procedural Content Generation Reinforcement Learning

PCGRLLM:面向程序化内容生成强化学习的大语言模型驱动奖励设计

In-Chang Baek, Sung-Hyun Kim, Sam Earle, Zehua Jiang, Jin-Ha Noh, Julian Togelius, Kyung-Joong Kim

发表机构 * Gwangju Institute of Science and Technology(光州科学技术院) New York University(纽约大学) Corresponding author(通讯作者)

AI总结 提出PCGRLLM架构,利用大语言模型和反馈机制生成奖励函数,在二维环境中实现故事到奖励的生成,性能接近人类水平。

Comments 14 pages, 8 figures, Acccepted to Transactions on Games

详情
AI中文摘要

奖励设计在游戏AI训练中起着关键作用,需要大量领域知识和人力。近年来,一些研究探索了使用大语言模型(LLM)生成奖励函数来训练游戏代理和控制机器人。在内容生成文献中,已有早期工作为强化学习代理生成器生成奖励函数。本文介绍了PCGRLLM,一种基于早期工作的扩展架构,采用了反馈机制和几种基于推理的提示工程技术。我们在二维环境中的故事到奖励生成任务上,使用两种最先进的LLM和各种基于推理的提示方法评估了所提出的方法。我们的实验提供了富有洞察力的评估,展示了LLM在内容生成任务中不可或缺的能力。结果表明,与之前的结构相比,性能有了显著提升,达到了与人类相当的性能。我们的工作展示了在游戏AI开发中减少人类依赖的潜力,同时支持和增强创造性过程。

英文摘要

Reward design plays a pivotal role in the training of game AIs, requiring substantial domain-specific knowledge and human effort. In recent years, several studies have explored reward generation for training game agents and controlling robots using large language models (LLMs). In the content generation literature, there has been early work on generating reward functions for reinforcement learning agent generators. This work introduces PCGRLLM, an extended architecture based on earlier work, which employs a feedback mechanism and several reasoning-based prompt engineering techniques. We evaluate the proposed method on a story-to-reward generation task in a two-dimensional environment using two state-of-the-art LLMs across various reasoning-based prompting methods. Our experiments provide insightful evaluations that demonstrate the capabilities of LLMs essential for content generation tasks. The results demonstrate a substantial performance improvement over the previous structure, achieving performance comparable to that of humans. Our work demonstrates the potential to reduce human dependency in game AI development, while supporting and enhancing creative processes.

2502.10311 2026-05-26 cs.LG cs.AI cs.HC 版本更新

ExplainReduce: Generating global explanations from many local explanations

ExplainReduce: 从许多局部解释生成全局解释

Lauri Seppäläinen, Mudong Guo, Kai Puolamäki

发表机构 * University of Helsinki(赫尔辛基大学)

AI总结 本文提出 ExplainReduce 方法,通过贪心启发式算法将大量局部解释缩减为少量简单模型,作为生成式全局解释,并证明其有效性和竞争力。

Comments 21 pages with a 36 page appendix, 8 + 39 figures, 1+1 tables. The datasets and source code used in the paper are available at https://github.com/edahelsinki/explainreduce. Accepted for publication in the 4th World Conference on eXplainable Artificial Intelligence (2026)

详情
AI中文摘要

最常用的非线性机器学习方法是黑箱模型,人类无法解释。可解释人工智能(XAI)领域旨在开发工具来检查这些黑箱的内部工作原理。一种常用的模型无关的 XAI 方法涉及使用简单模型作为局部近似来产生所谓的局部解释;这种方法的例子包括 LIME、SHAP 和 SLISEMAP。本文展示了如何将大量局部解释缩减为少量简单模型的“代理集”,这些模型可以作为生成式全局解释。这种缩减过程 ExplainReduce 可以表述为一个优化问题,并使用贪心启发式算法高效近似。我们表明,对于许多问题,少至五个解释就能忠实地模拟黑箱模型,并且我们的缩减过程与其他模型聚合方法相比具有竞争力。

英文摘要

Most commonly used non-linear machine learning methods are closed-box models, uninterpretable to humans. The field of explainable artificial intelligence (XAI) aims to develop tools to examine the inner workings of these closed boxes. An often-used model-agnostic approach to XAI involves using simple models as local approximations to produce so-called local explanations; examples of this approach include LIME, SHAP, and SLISEMAP. This paper shows how a large set of local explanations can be reduced to a small "proxy set" of simple models, which can act as a generative global explanation. This reduction procedure, ExplainReduce, can be formulated as an optimisation problem and approximated efficiently using greedy heuristics. We show that, for many problems, as few as five explanations can faithfully emulate the closed-box model and that our reduction procedure is competitive with other model aggregation methods.

2502.01397 2026-05-26 cs.LG cs.AI cs.NA math.NA 版本更新

Message-Passing GNNs Fail to Approximate Sparse Triangular Factorizations

消息传递GNN无法近似稀疏三角分解

Vladislav Trifonov, Ekaterina Muravleva, Ivan Oseledets

发表机构 * AIC, Skoltech(斯克里普金技术大学人工智能中心) Skoltech AI4S Center(斯克里普金技术大学AI4S中心) Sberbank of Russia(俄罗斯储蓄银行) AIRI

AI总结 本文通过理论和实验证明,消息传递图神经网络在逼近稀疏三角分解时存在根本性局限,需要超越消息传递的架构创新。

Comments Camera-ready version published in Transactions on Machine Learning Research

详情
Journal ref
Transactions on Machine Learning Research, 2026
AI中文摘要

图神经网络(GNN)已被提议作为学习稀疏矩阵预条件子的工具,预条件子是加速线性求解器的关键组件。我们提出理论和实验证据表明,对于存在高质量预条件子但需要非局部依赖的矩阵类别,消息传递GNN从根本上无法近似稀疏三角分解。为了说明这一点,我们使用合成矩阵和SuiteSparse集合中的真实示例构建了一组基线。在包括图注意力网络和图变换器在内的多种GNN架构中,我们观察到预测因子与参考因子之间的余弦相似度较低(关键情况下≤0.7)。我们的理论和实验结果表明,需要超越消息传递的架构创新才能将GNN应用于矩阵分解等科学计算任务。此外,实验表明仅克服非局部性是不够的。需要定制的架构来捕获所需的依赖关系,因为即使是完全非局部的全局图变换器也无法匹配所提出的基线。

英文摘要

Graph Neural Networks (GNNs) have been proposed as a tool for learning sparse matrix preconditioners, which are key components in accelerating linear solvers. We present theoretical and empirical evidence that message-passing GNNs are fundamentally incapable of approximating sparse triangular factorizations for classes of matrices for which high-quality preconditioners exist but require non-local dependencies. To illustrate this, we construct a set of baselines using both synthetic matrices and real-world examples from the SuiteSparse collection. Across a range of GNN architectures, including Graph Attention Networks and Graph Transformers, we observe low cosine similarity ($\leq0.7$ in key cases) between predicted and reference factors. Our theoretical and empirical results suggest that architectural innovations beyond message-passing are necessary for applying GNNs to scientific computing tasks such as matrix factorization. Moreover, experiments demonstrate that overcoming non-locality alone is insufficient. Tailored architectures are necessary to capture the required dependencies since even a completely non-local Global Graph Transformer fails to match the proposed baselines.

2502.01184 2026-05-26 cs.LG cs.AI physics.chem-ph q-bio.QM 版本更新

FragmentNet: Adaptive Graph Fragmentation for Graph-to-Sequence Molecular Representation Learning

FragmentNet: 自适应图分片用于图到序列分子表示学习

Ankur Samanta, Rohan Gupta, Aditi Misra, Christian McIntosh Clarke, Jayakumar Rajadas

发表机构 * Department of Electrical and Computer Engineering, University of Toronto, Toronto, Canada(电气与计算机工程系,多伦多大学,多伦多,加拿大) Regenerative Biomaterials Laboratory, Stanford Cardiovascular Institute, Palo Alto, USA(再生生物材料实验室,斯坦福心血管研究所,帕洛阿尔托,美国)

AI总结 提出FragmentNet,通过自适应学习的分词器将分子图分解为化学有效的片段,并利用化学感知的空间位置编码保持分子拓扑,在片段级别进行掩码预训练,在多个属性预测任务上提升了性能。

Comments 22 pages, 13 figures, 5 tables

详情
AI中文摘要

分子表示学习方法通常将分子标记为单个原子或使用刚性、基于规则的分片分解,限制了它们捕捉有意义化学子结构上下文的能力。我们引入了FragmentNet,一种围绕新颖的自适应学习分词器构建的图到序列模型,该分词器将分子图分解为可调整粒度的化学有效片段,并辅以化学感知的空间位置编码,在生成的序列中保留分子拓扑。将自然语言处理中的掩码预训练策略扩展到分子领域,我们在化学有意义的片段级别而非单个原子级别对分子进行掩码和重建。在多个属性预测基准上的评估发现,在片段粒度上进行预训练在大多数任务上提高了下游性能,表明标记化粒度是分子表示学习的重要设计选择。

英文摘要

Molecular representation learning methods typically tokenize molecules as individual atoms or use rigid, rule-based fragment decompositions, limiting their ability to capture meaningful chemical substructure context. We introduce FragmentNet, a graph-to-sequence model built around a novel adaptive, learned tokenizer that decomposes molecular graphs into chemically valid fragments of adjustable granularity, complemented by chemically aware spatial positional encodings that preserve molecular topology in the resulting sequence. Extending masked pre-training strategies from natural language processing to the molecular domain, we mask and reconstruct molecules at the level of chemically meaningful fragments rather than individual atoms. Evaluating across multiple property prediction benchmarks, we find that pre-training at fragment granularity leads to improved downstream performance on the majority of tasks, demonstrating that tokenization granularity is an important design choice for molecular representation learning.

2303.07863 2026-05-26 cs.CV cs.AI cs.MM 版本更新

You Can Ground Earlier than See: An Effective and Efficient Pipeline for Temporal Sentence Grounding in Compressed Videos

你可以比看见更早定位:一种用于压缩视频中时序句子定位的高效流程

Xiang Fang, Daizong Liu, Pan Zhou, Guoshun Nan

发表机构 * The Hubei Engineering Research Center on Big Data Security, School of Cyber Science and Engineering, Huazhong University of Science and Technology(大数据安全湖北工程研究中心,网络安全科学与工程学院,华中科技大学) Peking University(北京大学) Beijing University of Posts and Telecommunications(北京邮电大学)

AI总结 提出一种三分支压缩域时空融合框架(TCSF),直接从压缩视频中提取I帧、运动向量和残差特征,实现高效准确的时序句子定位。

Comments Accepted by CVPR 2023

详情
AI中文摘要

给定一个未剪辑视频,时序句子定位(TSG)旨在根据句子查询语义上定位目标时刻。尽管先前的工作取得了不错的成功,但它们仅关注从连续解码帧中提取的高级视觉特征,未能处理压缩视频的查询建模,导致训练和测试期间表示能力不足且计算复杂度高。本文提出了一种新的设置——压缩域TSG,直接利用压缩视频而非完全解压的帧作为视觉输入。为了处理原始视频比特流输入,我们提出了一种新颖的三分支压缩域时空融合(TCSF)框架,该框架提取并聚合三种低级视觉特征(I帧、运动向量和残差特征)以实现高效准确的定位。特别地,不像先前工作那样编码整个解码帧,我们仅通过学习I帧特征来捕获外观表示,以减少延迟。此外,我们不仅通过学习运动向量特征来探索运动信息,还通过残差特征探索相邻帧的关系。通过这种方式,进一步设计了一个带有自适应运动-外观融合模块的三分支时空注意力层,以提取和聚合外观和运动信息用于最终定位。在三个具有挑战性的数据集上的实验表明,我们的TCSF以更低的复杂度实现了比现有最先进方法更好的性能。

英文摘要

Given an untrimmed video, temporal sentence grounding (TSG) aims to locate a target moment semantically according to a sentence query. Although previous respectable works have made decent success, they only focus on high-level visual features extracted from the consecutive decoded frames and fail to handle the compressed videos for query modelling, suffering from insufficient representation capability and significant computational complexity during training and testing. In this paper, we pose a new setting, compressed-domain TSG, which directly utilizes compressed videos rather than fully-decompressed frames as the visual input. To handle the raw video bit-stream input, we propose a novel Three-branch Compressed-domain Spatial-temporal Fusion (TCSF) framework, which extracts and aggregates three kinds of low-level visual features (I-frame, motion vector and residual features) for effective and efficient grounding. Particularly, instead of encoding the whole decoded frames like previous works, we capture the appearance representation by only learning the I-frame feature to reduce delay or latency. Besides, we explore the motion information not only by learning the motion vector feature, but also by exploring the relations of neighboring frames via the residual feature. In this way, a three-branch spatial-temporal attention layer with an adaptive motion-appearance fusion module is further designed to extract and aggregate both appearance and motion information for the final grounding. Experiments on three challenging datasets shows that our TCSF achieves better performance than other state-of-the-art methods with lower complexity.

2209.11572 2026-05-26 cs.CV cs.AI cs.IR cs.MM 版本更新

Multi-Modal Cross-Domain Alignment Network for Video Moment Retrieval

多模态跨域对齐网络用于视频时刻检索

Xiang Fang, Daizong Liu, Pan Zhou, Yuchong Hu

发表机构 * Hubei Key Laboratory of Distributed System Security(湖北分布式系统安全重点实验室) Hubei Engineering Research Center on Big Data Security(湖北大数据安全工程研究中心) School of Cyber Science and Engineering(网络安全学院) Huazhong University of Science and Technology(华中科技大学) Wangxuan Institute of Computer Technology(王轩计算机技术研究所) Peking University(北京大学) School of Computer Science and Technology(计算机科学与技术学院) Key Laboratory of Information Storage System Ministry of Education of China(信息存储系统教育部重点实验室)

AI总结 提出多模态跨域对齐网络,通过域对齐、跨模态对齐和特定对齐三个模块,解决跨域视频时刻检索中域差异和语义鸿沟问题。

Comments Accepted by IEEE Transactions on Multimedia

详情
AI中文摘要

作为多媒体信息检索中日益流行的任务,视频时刻检索(VMR)旨在根据给定的语言查询从未修剪的视频中定位目标时刻。大多数先前的方法严重依赖于大量手动标注(即时刻边界),这在实践中获取成本极高。此外,由于不同数据集之间的域差异,直接将预训练模型应用于未见过的域会导致性能显著下降。本文聚焦于一项新任务:跨域VMR,其中在一个域(“源域”)中有完全标注的数据集,但目标域(“目标域”)仅包含未标注的数据集。据我们所知,我们提出了关于跨域VMR的首项研究。为了解决这一新任务,我们提出了一种新颖的多模态跨域对齐(MMCDA)网络,将标注知识从源域迁移到目标域。然而,由于源域和目标域之间的域差异以及视频和查询之间的语义鸿沟,直接将训练好的模型应用于目标域通常会导致性能下降。为解决此问题,我们开发了三个新颖的模块:(i)域对齐模块,用于对齐每个模态在不同域之间的特征分布;(ii)跨模态对齐模块,旨在将视频和查询特征映射到联合嵌入空间,并对齐目标域中不同模态之间的特征分布;(iii)特定对齐模块,试图获取特定帧与给定查询之间的细粒度相似性以实现最优定位。通过联合训练这三个模块,我们的MMCDA能够学习域不变且语义对齐的跨模态表示。

英文摘要

As an increasingly popular task in multimedia information retrieval, video moment retrieval (VMR) aims to localize the target moment from an untrimmed video according to a given language query. Most previous methods depend heavily on numerous manual annotations (i.e., moment boundaries), which are extremely expensive to acquire in practice. In addition, due to the domain gap between different datasets, directly applying these pre-trained models to an unseen domain leads to a significant performance drop. In this paper, we focus on a novel task: cross-domain VMR, where fully-annotated datasets are available in one domain (``source domain''), but the domain of interest (``target domain'') only contains unannotated datasets. As far as we know, we present the first study on cross-domain VMR. To address this new task, we propose a novel Multi-Modal Cross-Domain Alignment (MMCDA) network to transfer the annotation knowledge from the source domain to the target domain. However, due to the domain discrepancy between the source and target domains and the semantic gap between videos and queries, directly applying trained models to the target domain generally leads to a performance drop. To solve this problem, we develop three novel modules: (i) a domain alignment module is designed to align the feature distributions between different domains of each modality; (ii) a cross-modal alignment module aims to map both video and query features into a joint embedding space and to align the feature distributions between different modalities in the target domain; (iii) a specific alignment module tries to obtain the fine-grained similarity between a specific frame and the given query for optimal localization. By jointly training these three modules, our MMCDA can learn domain-invariant and semantic-aligned cross-modal representations.

2011.10396 2026-05-26 cs.LG cs.AI 版本更新

Double Self-weighted Multi-view Clustering via Adaptive View Fusion

双自加权多视图聚类:通过自适应视图融合

Xiang Fang, Yuchong Hu

发表机构 * School of Computer Science and Technology, Key Laboratory of Information Storage System Ministry of Education of China, Huazhong University of Science and Technology(计算机科学与技术学院,信息存储系统教育部重点实验室,华中科技大学)

AI总结 提出双自加权多视图聚类框架(DSMC),通过自适应权重矩阵和权重因子分别对特征和图进行加权,去除冗余和噪声,并融合多图进行聚类。

Comments Corresponding author: Xiang Fang

详情
AI中文摘要

多视图聚类已应用于许多实际应用中,其中原始数据通常包含噪声。一些基于图的多视图聚类方法被提出来试图减少噪声的负面影响。然而,以往的基于图的多视图聚类方法即使存在冗余特征或噪声,也平等对待所有特征,这显然是不合理的。在本文中,我们提出了一种新颖的多视图聚类框架——双自加权多视图聚类(DSMC)来克服上述缺陷。DSMC执行双自加权操作,从每个图中去除冗余特征和噪声,从而获得鲁棒的图。对于第一次自加权操作,它通过引入自适应权重矩阵为不同特征分配不同的权重,这可以增强重要特征在联合表示中的作用,并使每个图鲁棒。对于第二次自加权操作,它通过施加自适应权重因子对不同图进行加权,这可以为更鲁棒的图分配更大的权重。此外,通过设计自适应多图融合,我们可以融合不同图中的特征,以整合这些图进行聚类。在六个真实世界数据集上的实验证明了其相对于其他最先进的多视图聚类方法的优势。

英文摘要

Multi-view clustering has been applied in many real-world applications where original data often contain noises. Some graph-based multi-view clustering methods have been proposed to try to reduce the negative influence of noises. However, previous graph-based multi-view clustering methods treat all features equally even if there are redundant features or noises, which is obviously unreasonable. In this paper, we propose a novel multi-view clustering framework Double Self-weighted Multi-view Clustering (DSMC) to overcome the aforementioned deficiency. DSMC performs double self-weighted operations to remove redundant features and noises from each graph, thereby obtaining robust graphs. For the first self-weighted operation, it assigns different weights to different features by introducing an adaptive weight matrix, which can reinforce the role of the important features in the joint representation and make each graph robust. For the second self-weighting operation, it weights different graphs by imposing an adaptive weight factor, which can assign larger weights to more robust graphs. Furthermore, by designing an adaptive multiple graphs fusion, we can fuse the features in the different graphs to integrate these graphs for clustering. Experiments on six real-world datasets demonstrate its advantages over other state-of-the-art multi-view clustering methods.

2011.10254 2026-05-26 cs.LG cs.AI stat.ML 版本更新

Unbalanced Incomplete Multi-view Clustering via the Scheme of View Evolution: Weak Views are Meat; Strong Views do Eat

通过视图演化方案的不平衡不完整多视图聚类:弱视图为食,强视图为食

Xiang Fang, Yuchong Hu, Pan Zhou, Dapeng Oliver Wu

发表机构 * School of Computer Science and Technology, Key Laboratory of Information Storage System Ministry of Education of China, Huazhong University of Science and Technology(计算机科学与技术学院,信息存储系统教育部重点实验室,华中科技大学) Hubei Engineering Research Center on Big Data Security, School of Cyber Science and Engineering, Huazhong University of Science and Technology(大数据安全工程研究中心,网络安全学院,华中科技大学) Department of Electrical and Computer Engineering, University of Florida(电气与计算机工程系,佛罗里达大学)

AI总结 针对不同视图不完整程度不平衡的问题,受生物进化理论启发,提出基于视图演化的不平衡不完整多视图聚类方法UIMC,通过加权多视图子空间聚类和低秩鲁棒表示恢复数据,显著提升聚类性能。

Comments Accepted by IEEE Transactions on Emerging Topics in Computational Intelligence

详情
Journal ref
IEEE Transactions on Emerging Topics in Computational Intelligence 2021
AI中文摘要

不完整多视图聚类是处理现实世界中不完整多视图数据的重要技术。以往的工作假设所有视图具有相同的不完整性,即平衡不完整性。然而,不同的视图往往具有不同的不完整性,即不平衡不完整性,这导致了强视图(低不完整性视图)和弱视图(高不完整性视图)。不平衡不完整性阻止我们直接使用先前的方法进行聚类。在本文中,受有效生物进化理论的启发,我们设计了新颖的视图演化方案来聚类强视图和弱视图。此外,我们提出了一种不平衡不完整多视图聚类方法(UIMC),这是第一个基于视图演化的有效方法,用于不平衡不完整多视图聚类。与先前的方法相比,UIMC有两个独特的优势:1)它提出了加权多视图子空间聚类来整合这些不平衡不完整的视图,有效解决了不平衡不完整多视图问题;2)它设计了低秩和鲁棒表示来恢复数据,减少了不完整性和噪声的影响。大量的实验结果表明,UIMC在三个评估指标上相比其他最先进的方法将聚类性能提高了高达40%。

英文摘要

Incomplete multi-view clustering is an important technique to deal with real-world incomplete multi-view data. Previous works assume that all views have the same incompleteness, i.e., balanced incompleteness. However, different views often have distinct incompleteness, i.e., unbalanced incompleteness, which results in strong views (low-incompleteness views) and weak views (high-incompleteness views). The unbalanced incompleteness prevents us from directly using the previous methods for clustering. In this paper, inspired by the effective biological evolution theory, we design the novel scheme of view evolution to cluster strong and weak views. Moreover, we propose an Unbalanced Incomplete Multi-view Clustering method (UIMC), which is the first effective method based on view evolution for unbalanced incomplete multi-view clustering. Compared with previous methods, UIMC has two unique advantages: 1) it proposes weighted multi-view subspace clustering to integrate these unbalanced incomplete views, which effectively solves the unbalanced incomplete multi-view problem; 2) it designs the low-rank and robust representation to recover the data, which diminishes the impact of the incompleteness and noises. Extensive experimental results demonstrate that UIMC improves the clustering performance by up to 40% on three evaluation metrics over other state-of-the-art methods.

2605.25250 2026-05-26 cs.AI 版本更新

LipoAgent: Coordinating Fine-Tuned LLM Agents for Safer Lipid Design

LipoAgent: 协调微调的大语言模型智能体以实现更安全的脂质设计

Leshu Li, An Lu, Haiyu Wang, Zhibin Feng, Conghui Duan, Qing Bao, Zongmin Zhao, Sai Qian Zhang

发表机构 * New York University(纽约大学) University of Illinois Chicago(伊利诺伊大学香槟分校)

AI总结 提出LipoAgent,一种安全感知的多智能体大语言模型框架,通过条件预测目标强制毒性作为效率预测的前提,并结合多智能体验证,在mRNA转染效率预测上平均相对提升32%。

详情
AI中文摘要

脂质纳米颗粒(LNPs)是核酸递送中最临床成熟的平台之一,但设计既有效又生物学安全的脂质仍是一个主要瓶颈。在实际筛选中,毒性是一个决策层面的约束:如果一种脂质有毒,其效率预测在临床上无关紧要。我们提出LipoAgent,一种用于脂质发现的安全感知多智能体大语言模型框架。LipoAgent将领域特定微调与条件预测目标相结合,强制毒性作为效率预测的前提,并通过多智能体验证进一步提高可靠性,在存在持续分歧时辅以轻量级人工监督。在多个基础模型上,与已报道的其他脂质设计模型相比,LipoAgent在mRNA转染效率预测上实现了平均32%的相对改进。湿实验验证证实,虚拟筛选排名可靠地转化为生物学转染结果。代码公开于https://github.com/SAI-Lab-NYU/LipoAgent.git。

英文摘要

Lipid nanoparticles (LNPs) are among the most clinically mature platforms for nucleic acid delivery, yet designing lipids that are both effective and biologically safe remains a major bottleneck. In practical screening, toxicity is a decision-level constraint: if a lipid is toxic, its efficiency prediction is clinically irrelevant. We propose LipoAgent, a safety-aware multi-agent LLM framework for lipid discovery. LipoAgent combines domain-specific finetuning with a conditional prediction objective that enforces toxicity as a prerequisite for efficiency prediction, and further improves reliability via multi-agent verification with lightweight human oversight when disagreement persists. Across multiple foundation models, LipoAgent achieves an average 32% relative improvement in mRNA transfection efficiency prediction compared with other reported models for lipid design. Wet-lab validation confirms that virtual screening rankings reliably translate to biological transfection outcomes. The code is publicly available at https://github.com/SAI-Lab-NYU/LipoAgent.git.

2605.25235 2026-05-26 cs.LG cs.AI math.OC 版本更新

Constraint-Anchored Attribution: Feasibility-Certified Counterfactuals and Bonferroni-PAC Sufficient Subsets for Neural CO Policies

约束锚定归因:神经组合优化策略的可行性认证反事实与Bonferroni-PAC充分子集

Sohaib Lafifi

发表机构 * Univ. Artois, UR 3926, Laboratoire de G\'enie Informatique et d'Automatique de l'Artois (LGI2A) B\'ethune F-62400 France Univ. Artois, UR 3926, Laboratoire de G\'enie Informatique et d'Automatique de l'Artois (LGI2A)

AI总结 提出一种神经组合优化策略的归因方法,通过LP松弛对偶分解决策、CSP可行性模型认证反事实,并用Bonferroni校正的Hoeffding充分子集测试界定PAC解释大小。

Comments 4 pages, 1 figure, Reference implementation: https://github.com/sohaibafifi/neuro-co-cax (MIT)

详情
AI中文摘要

我们为神经组合优化(CO)策略提供了一种归因方法,该方法(i)通过LP松弛对偶按约束族分解决策,(ii)通过组合可行性模型(实现为CSP可行性决策模型)认证反事实,以及(iii)通过沿贪心顺序的Bonferroni校正Hoeffding充分子集测试界定PAC充分解释的大小。在三个CO问题和三个随机种子上,我们的LP锚定$\Lambda$-归因在CVRPTW(n_cert=344)上匹配CF导出信号的96.5%,在定向问题(n_cert=281)上匹配77.2%,而代理梯度分别为75.0%和35.2%(配对差异+0.215和+0.420;McNemar精确$p \le 10^{-14}$)。在柔性作业车间调度问题的秩对齐机制中,两个后端在每个CSP认证翻转(n_cert=59)上一致,确认了无增益预测。Bonferroni-PAC子集平均每步5.0个节点($M=70$,$\varepsilon=\delta=0.2$,$k_{\max}=25$)。参考实现:https://github.com/sohaibafifi/neuro-co-cax

英文摘要

We give an attribution method for neural combinatorial-optimisation (CO) policies that (i) decomposes a decision by constraint families via LP-relaxation duals, (ii) certifies counterfactuals through a combinatorial feasibility model (implemented as a CSP feasibility-decision model), and (iii) bounds the size of a PAC-sufficient explanation with a Bonferroni-corrected Hoeffding sufficient-subset test along a greedy ordering. Across three CO problems and three seeds, our LP-anchored $Λ$-attribution matches the CF-derived signal at 96.5% on CVRPTW (n_cert=344) and 77.2% on the Orienteering Problem (n_cert=281) vs 75.0% and 35.2% for proxy gradient (paired diffs +0.215 and +0.420; McNemar exact $p \le 10^{-14}$). In the rank-aligned regime of the Flexible Job-Shop Scheduling Problem, both backends agree on every CSP-certified flip (n_cert=59), confirming the no-gain prediction. Bonferroni-PAC subsets average 5.0 nodes per step ($M=70$, $\varepsilon=δ=0.2$, $k_{\max}=25$). Reference implementation: https://github.com/sohaibafifi/neuro-co-cax

2605.25234 2026-05-26 cs.LG cs.AI stat.CO stat.ML 版本更新

On the Epistemic Uncertainty of Overparametrized Neural Networks

关于过参数化神经网络的认知不确定性

David Rügamer

发表机构 * Department of Statistics, LMU Munich(统计系,慕尼黑大学) Munich Center for Machine Learning (MCML)(慕尼黑机器学习中心)

AI总结 本文通过非可辨识性视角分析过参数化神经网络的认知不确定性,刻画了离散和连续残余不确定性来源,并以单隐层ReLU网络为例验证理论。

Comments Accepted at ICML 2026 (Main Track)

详情
AI中文摘要

认知不确定性通常被视为一种可减少的不确定性,随着数据增加而消失。这种观点隐含地假设参数可辨识,并将认知不确定性等同于预测变异性。然而,在过参数化神经网络中,由于对称性和冗余表示,模型参数通常不可辨识。因此,即使底层函数被完全识别,大量的参数不确定性仍然存在。在这项工作中,我们通过非可辨识性的视角分析认知不确定性,并刻画了残余不确定性的离散和连续来源。聚焦于单隐层ReLU网络,我们深入分析了由此产生的后验结构,并通过实证研究验证了我们的理论见解。

英文摘要

Epistemic uncertainty is often viewed as a reducible uncertainty that vanishes with increasing data. This perspective implicitly assumes parameter identifiability and equates epistemic uncertainty with predictive variability. In overparametrized neural networks, however, model parameters are typically non-identifiable due to symmetries and redundant representations. As a consequence, substantial parameter uncertainty can persist even when the underlying function is fully identified. In this work, we analyze epistemic uncertainty through the lens of non-identifiability and characterize both discrete and continuous sources of residual uncertainty. Focusing on one-hidden-layer ReLU networks, we thoroughly analyze the resulting posterior structure and validate our theoretical insights through empirical studies.

2605.25233 2026-05-26 cs.AI 版本更新

Meta-Agent: From Task Descriptions to Verified Multi-Agent Systems

Meta-Agent:从任务描述到经过验证的多智能体系统

Andy Xu, Yu-Wing Tai

发表机构 * Dartmouth College(达特茅斯学院)

AI总结 提出Meta-Agent两阶段框架,通过任务规划、网络搜索、代码生成和验证机制,自动从自然语言任务描述构建并执行可靠的多智能体系统,在编码、上下文学习和开放推理任务中提升成功率、错误恢复和工作流稳定性。

详情
AI中文摘要

AI智能体越来越多地被用于解决复杂的多步骤任务,但随着工作流规模和深度的增长,现有的多智能体框架仍然脆弱。中间阶段的小错误会通过智能体交互传播,同时不充分的依据和薄弱的验证机制进一步限制了可靠性。我们提出Meta-Agent,一个两阶段框架,能够从自然语言任务描述自动构建并执行专门的多智能体系统。在构建阶段,任务规划器将问题分解为智能体规范的有向无环图,包含明确的输入/输出契约和验证标准。网络搜索模块用外部证据为每个规范提供依据,代码生成模块产生系统提示和工具配置。构建时验证阶段随后验证生成的工件,并在检测到失败时触发有针对性的重新生成。在执行阶段,协调器在智能体图中分配子任务,同时执行时验证对中间输出进行把关。我们进一步引入三级错误归因机制,区分局部、上游和结构性失败,从而实现从局部重试到部分重新执行和重新分解的有针对性的恢复策略。我们在编码、上下文学习和开放式推理任务上评估Meta-Agent。与强多智能体基线及消融实验相比,结果表明在任务成功率、错误恢复和工作流稳定性方面均有持续改进。这些结果凸显了将规划、依据和验证紧密集成以构建可靠多智能体系统的重要性。

英文摘要

AI agents are increasingly used to solve complex, multi-step tasks, but existing multi-agent frameworks remain brittle as workflows grow in scale and depth. Small errors at intermediate stages can propagate through agent interactions, while insufficient grounding and weak verification mechanisms further limit reliability. We present Meta-Agent, a two-phase framework that automatically constructs and executes specialized multi-agent systems from natural-language task descriptions. In the construction phase, a task planner decomposes a problem into a directed acyclic graph of agent specifications with explicit input/output contracts and verification criteria. A web search module grounds each specification with external evidence, and a code generation module produces system prompts and tool configurations. A construction-time verification stage then validates generated artifacts and triggers targeted regeneration when failures are detected. In the execution phase, a coordinator dispatches subtasks across the agent graph while execution-time verification gates intermediate outputs. We further introduce a three-level error attribution mechanism that distinguishes local, upstream, and structural failures, enabling targeted recovery strategies ranging from localized retries to partial re-execution and re-decomposition. We evaluate Meta-Agent across coding, contextual learning, and open-ended reasoning tasks. Experiments against strong multi-agent baselines and ablation studies demonstrate consistent improvements in task success rate, error recovery, and workflow stability. The results highlight the importance of tightly integrating planning, grounding, and verification for building reliable multi-agent systems.

2605.25232 2026-05-26 cs.SE cs.AI cs.LO 版本更新

Specification-Based Code-Text-Code Reengineering for LLM-Mediated Software Evolution

基于规约的代码-文本-代码重构:面向LLM介导的软件演化

Oleg Grynets, Vasyl Lyashkevych, Arsen Dolichnyi, Roman Piznak, Taras Zelenyy, Volodymyr Morozov

发表机构 * EPAM Systems(EPAM系统) McLean, Virginia, USA(美国弗吉尼亚州麦莱恩) Lviv, Ukraine(乌克兰利沃夫) Kyiv, Ukraine(乌克兰基辅)

AI总结 提出一种基于规约的Code2Text2Code重构框架,通过将源代码转换为中性文本规约并迭代验证,解决直接Code2Code转换中的语义漂移、行为变化等问题,实现受控的LLM介导软件演化。

Comments 15 pages, 9 figures, 7 tables, 39 references

详情
AI中文摘要

直接的Code2Code转换仍然难以控制,因为它可能保留表面语法,同时引入语义漂移、隐藏的行为变化、可追溯性丧失、非惯用的目标实现或领域逻辑的不完整重建。本文提出了一种基于规约的Code2Text2Code重构框架,用于LLM介导的软件演化。核心思想是将源代码转换为中性的文本规约,该规约捕获程序行为、标识符、计算流程、条件、副作用、数据依赖和领域特定意图,而不直接转移源语言语法。所提出的框架结合了事实上下文提取、Code2Text生成、源代码与文本规约之间的迭代验证、Text2Code生成、目标代码验证、检索增强接地、语义感知分块以及转换损失估计。知识表示层集成了来自AST的元数据、基于图的依赖结构、中性自然语言规约、技术文档、业务文档和架构级表示。进行的实验包括从多种编程语言和SQL方言构建的Code2Text2Code数据集、中间表示比较、检索评估、文档转换评估以及使用DSPy进行提示调优。实现了使用结构保留、反向兼容性、接口稳定性和总图相似性的图形式化来估计转换损失。结果支持将Code2Text2Code方法解释为一种受控的基于规约的重构过程,用于LLM介导的软件演化,而非简单的代码转换。

英文摘要

Direct Code2Code transformation remains challenging to control because it can preserve surface-level syntax while introducing semantic drift, hidden behavioral changes, loss of traceability, non-idiomatic target implementations, or incomplete reconstruction of domain logic. This paper proposes a specification-based Code2Text2Code reengineering framework for LLM-mediated software evolution. The central idea is to transform source code into a neutral textual specification that captures program behavior, identifiers, computational flow, conditions, side effects, data dependencies, and domain-specific intent without directly transferring the source language syntax. The proposed framework combines factual context extraction, Code2Text generation, iterative verification between source code and text specification, Text2Code generation, target code verification, retrieval-augmented grounding, and semantic-aware chunking, and transformation loss estimation. The knowledge representation layer integrates metadata derived from AST, graph-based dependency structures, neutral natural language specifications, technical documentation, business documentation, and architecture-level representations. The conducted experiments include a Code2Text2Code dataset built from multiple programming languages and SQL dialects, comparison of intermediate representations, retrieval evaluation, documentation transformation evaluation, and prompt tuning using DSPy. A graph formalization using structural preservation, reverse compatibility, interface stability, and total graph similarity is implemented to estimate transformation losses. The results support the interpretation of the Code2Text2Code approach not as a simple code transformation, but as a controlled specification-based reengineering process for LLM-mediated software evolution.

2605.25210 2026-05-26 cs.LG cs.AI stat.ML 版本更新

Multi-Objective Learning for Diffusion Models: A Statistical Theory under Semi-Supervised Learning

扩散模型的多目标学习:半监督学习下的统计理论

Ziheng Cheng, Yixiao Huang, Hanlin Zhu, Haoran Geng, Somayeh Sojoudi, Jitendra Malik, Pieter Abbeel, Xin Guo

发表机构 * University of California, Berkeley(加州大学伯克利分校)

AI总结 针对扩散模型在多目标学习中因模型容量增大导致统计成本高的问题,提出半监督两阶段训练方法,利用未标记数据通过伪样本蒸馏,证明所需配对样本量仅取决于专家模型复杂度。

详情
AI中文摘要

扩散模型越来越多地被用作强大的条件生成器,然而实际部署通常涉及来自不同任务的多个目标分布,例如文本到图像生成中的多样化提示域,或机器人技术中具有扩散策略的多个环境。这自然引出了多目标学习(MOL)问题。一个关键挑战是,实现良好的帕累托权衡可能需要一个通用模型类,其容量远大于解决任何单个任务所需的容量,从而增加了统计成本,因为样本复杂度通常随模型复杂度而扩展。为了调和这一点,我们为有限数据下的扩散模型开发了一个原则性的多目标学习框架:一种半监督机制,其中配对(标记)样本稀缺,但(未标记)条件数据丰富。我们提出了一种两阶段训练程序,首先从有限的配对数据中拟合轻量级专家模型,然后通过生成伪样本将它们蒸馏成一个通用模型。我们建立了泛化界限,表明所需的配对样本数量仅取决于专家模型类的复杂度。我们进一步将理论扩展到用于序列决策的扩散策略,以考虑在线策略展开中的分布偏移。在机器人控制和图像恢复任务上进行了大量实验,以验证我们的理论结果。

英文摘要

Diffusion models are increasingly used as powerful conditional generators, yet real deployments often involve multiple target distributions arising from different tasks, e.g., diverse prompt domains in text-to-image generation, or multiple environments in robotics with diffusion policies. This naturally leads to a multi-objective learning (MOL) problem. A key challenge is that achieving good Pareto trade-offs can require a generalist model class with substantially larger capacity than what suffices for solving any individual task, thereby increasing statistical cost since sample complexity typically scales with the model complexity. To reconcile this, we develop a principled MOL framework for diffusion models with limited data: a semi-supervised regime where paired (labeled) samples are scarce, but (unlabeled) condition data are abundant. We propose a two-stage training procedure that first fits lightweight specialist models from limited paired data, and then distills them into a generalist model by generating pseudo-samples. We establish generalization bounds showing that the required number of paired samples only depends on the complexity of the specialist model classes. We further extend the theory to diffusion policies for sequential decision making to account for distribution shift in on-policy rollouts. Extensive experiments on robotic control and image restoration tasks are conducted to verify our theoretical results.

2605.25203 2026-05-26 cs.LG cs.AI cs.LO 版本更新

Influence-Inspired Spectral Rotations for Extreme Low-Bit LLM Quantization

基于影响启发的谱旋转用于极端低位LLM量化

Gorgi Pavlov

发表机构 * Lehigh University(莱斯大学)

AI总结 本文利用伴随理论论文的影响自适应Walsh几何,通过WHT旋转和列缩放结合重构误差量化器,实现极端低位权重量化,在多个模型上降低困惑度15-58%。

Comments 14 pages, no figures. Companion application paper to arXiv:2605.01637 (theory). Code and pinned eval stack: https://github.com/gogipav14/spectral-llm

详情
AI中文摘要

我们将伴随理论论文(arXiv:2605.01637)的影响自适应Walsh几何应用于极端低位仅权重量化。方法是一个数学不变的变换:对每个线性层的权重矩阵进行WHT旋转,并根据逐坐标Walsh基激活能量重新缩放其列,然后交给重构误差量化器(Intel auto-round)。这使每组整数舍入偏向高谱能量通道。在四个从135M到1.5B参数的预训练仅解码器模型上,BBT-spectral在W2A16下相对于普通auto-round将wikitext-2困惑度降低了15-58%;我们还报告了一个TinyLlama-1.1B辅助数据点。三个扩展将方法迁移到其失败的族:针对Qwen3注意力的每头PCA矩阵-Gamma替换q_norm/k_norm(Qwen3-0.6B上PPL从136.76降至88.99);与RoPE可交换的SO(2)每对旋转(Qwen2.5-1.5B上PPL从36.93降至21.84);以及通过架构模糊测试发现的Laguna风格融合专家布局的MoE感知输入侧吸收修复。W2与W4的消融实验给出了一个故意的阴性对照:在W4下,重新分配收益落在±0.5 PPL噪声基底内,这与Schur-凸性直觉一致,即非集中影响成本随噪声预算缩小而消失。所有量化权重导出为OpenVINO IR,并在Intel NPU + Arc dGPU + CPU上运行,PPL在设备间变化在±0.1内。我们不声称将理论论文的majorization论证形式化为布尔到实数值的迁移:这里使用的WHT激活能量不是理论论文的布尔影响,联系是直观的,贡献在于工程价值而非迁移定理。与SpinQuant、QuaRot、QuIP-sharp、AQLM、OmniQuant和ButterflyQuant在匹配校准下的头对头基准测试是未来的主要工作。

英文摘要

We apply the influence-adaptive Walsh geometry of a companion theory paper (arXiv:2605.01637) to extreme low-bit weight-only LLM quantization. The recipe is one math-invariant transformation: WHT-rotate each linear layer's weight matrix and rescale its columns by per-coordinate Walsh-basis activation energy before handing off to a reconstruction-error quantizer (Intel auto-round). This biases per-group integer rounding toward high-spectral-energy channels. On four pretrained decoder-only models from 135M to 1.5B parameters, BBT-spectral reduces wikitext-2 perplexity by 15-58% relative to vanilla auto-round at W2A16; we also report a TinyLlama-1.1B auxiliary data point. Three extensions transfer the recipe to families it failed on: a per-head PCA matrix-Gamma replacement of q_norm/k_norm for Qwen3 attention (PPL 136.76 -> 88.99 on Qwen3-0.6B); an SO(2) per-pair rotation that commutes with RoPE (PPL 36.93 -> 21.84 on Qwen2.5-1.5B); and an MoE-aware input-side absorption fix identified by architectural fuzzing of Laguna-style fused-expert layouts. A W2-vs-W4 ablation gives a deliberate negative control: the redistribution payoff falls within the +/-0.5 PPL noise floor at W4, consistent with the Schur-convexity intuition that the cost of unconcentrated influence vanishes as the noise budget shrinks. All quantized weights export to OpenVINO IR and run on Intel NPU + Arc dGPU + CPU with PPL invariant to device within +/-0.1. We do not claim a formal Boolean-to-real-valued transfer of the theory paper's majorization argument: the WHT activation energy used here is not the Boolean influence of the theory paper, the link is intuitive, and the contribution is engineering value rather than a transferred theorem. Head-to-head benchmarks against SpinQuant, QuaRot, QuIP-sharp, AQLM, OmniQuant, and ButterflyQuant at matched calibration are the main future-work item.

2605.25198 2026-05-26 cs.LG cs.AI 版本更新

Hide to Guide: Learning via Semantic Masking

隐藏以引导:通过语义掩码学习

Ruitao Liu, Qinghao Hu, Alex Hu, Yecheng Wu, Shang Yang, Luke J. Huang, Zhuoyang Zhang, Han Cai, Song Han

发表机构 * MIT(麻省理工学院) NVIDIA(英伟达)

AI总结 提出语义掩码专家策略优化(SMEPO),通过掩码专家轨迹中与奖励相关的语义片段,将困难问题转化为填空过程,提升强化学习在推理密集型任务中的探索效率。

详情
AI中文摘要

具有可验证奖励的强化学习(RLVR)已成为提升语言模型在推理密集型任务上性能的强大范式,但其有效性常受限于探索。例如,模型在困难问题上常常失败,留下很少有用的奖励信号。外部专家轨迹提供了一种自然的引导来源,但它们也可能在通往验证器目标的关键路径上暴露与奖励相关的内容,如最终答案、中间值、可执行实现或与答案相关的实体。这些内容可能创建意外的奖励黑客通道,使策略通过复制轨迹而非学习底层推理或智能体行为来获得奖励。现有的引导式RL方法通过使用部分轨迹来降低这种风险,但它们主要启发式地控制展示多少专家信息,而非控制应隐藏哪些部分。为此,我们提出语义掩码专家策略优化(SMEPO),一种用于专家引导RLVR的细粒度语义掩码策略。SMEPO不是粗略地截断轨迹或原样展示,而是在保留专家分解、计划和过程结构的同时,掩码关键路径上与奖励相关的语义片段。这将困难问题从从头推理转变为填空过程:策略可以遵循专家的问题解决路径,但仍需自行重建缺失的值、代码或实体。SMEPO易于应用,无需更改奖励函数或RL目标。在包括数学、代码和智能体搜索在内的多个领域,SMEPO相比GRPO将准确率提升最多3.2个百分点,并将训练时间减少最多4.2倍。代码已开源:https://github.com/mit-han-lab/SMEPO。

英文摘要

Reinforcement learning with verifiable rewards (RLVR) has become a powerful paradigm for improving language models on reasoning-intensive tasks, but its effectiveness is often limited by exploration. For example, models often fail on hard problems, leaving little useful reward signal. External expert traces offer a natural source of guidance, yet they may also expose reward-relevant content along the critical path to the verifier target, such as final answers, intermediate values, executable implementations, or answer-related entities. This content can create an unintended reward hacking channel, allowing the policy to obtain reward by copying the trace rather than learning the underlying reasoning or agentic behavior. Existing guided-RL methods reduce this risk by using partial trajectories, but they mainly control how much expert information is shown heuristically rather than which parts should be hidden. To this end, we propose Semantic Masked Expert Policy Optimization (SMEPO), a fine-grained semantic masking strategy for expert-guided RLVR. Instead of truncating traces coarsely or revealing them unchanged, SMEPO masks reward-relevant semantic spans along the critical path while preserving the expert's decomposition, plan, and procedural structure. This turns hard problems from reasoning from scratch into a fill-in-the-blank process: the policy can follow the expert's problem-solving route, but must still reconstruct the missing values, code, or entities by itself. SMEPO is simple to apply and requires no changes to the reward function or RL objective. Across diverse domains, including math, code, and agentic search, SMEPO improves accuracy by up to 3.2 points over GRPO and reduces training time by up to 4.2x. The code is available at https://github.com/mit-han-lab/SMEPO.

2605.25196 2026-05-26 cs.CY cs.AI 版本更新

Beyond Killer Robots: General AI Attitudes and Public Support for Military AI in Nine Countries

超越杀手机器人:九个国家中通用人工智能态度与公众对军事人工智能的支持

Andreas Jungherr, Antonia Schlude, Adrian Rauchfleisch

发表机构 * University of Bamberg(巴伐利亚数字转型研究所) Bavarian Research Institute for Digital Transformation (bidt)(巴伐利亚数字转型研究所) National Taiwan University(国立台湾大学)

AI总结 基于九国调查,研究公众对军事AI的支持主要受通用AI态度、对致命自主性的原则性反对还是外交政策取向影响,发现认为AI有益者更支持,而原则性反对仅与完全自主致命武力相关。

详情
AI中文摘要

人工智能赋能的军事系统是现代军事冲突的常见特征。应用范围从用于监视和攻击的自主无人机到AI支持的目标选择。AI对现代冲突的重要性也体现在政府与科技公司之间关于军事获取前沿AI条件的公开争议中。军事用途以及政府试图推动和引导这些用途的行为都发生在公众舆论的背景下,然而我们对人们如何看待军事AI仍知之甚少。基于一项在包括中国、德国和美国在内的九个国家中对9000名受访者进行的预注册调查,我们检验了军事AI的支持是否主要由对AI的通用态度、对致命自主性的原则性反对,或外交政策和地缘政治取向所塑造。在六个在致命性和人类控制方面有所不同的军事AI场景中,认为AI有益的受访者明显更支持军事AI。鹰派受访者也更支持。相比之下,对致命自主性的原则性反对与整体指数没有广泛关联,但与完全自主致命武力的应用相关。与我们的预期相反,感知到的AI风险与支持呈正相关。跨国差异适中,且与地缘政治背景大致一致。总体而言,公众对军事AI的舆论似乎是有条件地宽容的。公众并不绝对反对AI的各种军事用途。相反,不安主要集中在完全自主的致命武力上。

英文摘要

AI-enabled military systems are a fixture of modern military conflict. Applications vary from autonomous drones for surveillance and attack to AI-supported target selection. The importance of AI for modern conflict shows also in public disputes between governments and technology companies over the conditions for military access to frontier AI. Both military uses and government attempts at enabling and steering them happen before a backdrop of public opinion, yet we still know little about how people think about military AI. Drawing on a preregistered survey of 9,000 respondents in nine countries, including China, Germany, and the United States, we examine whether support for military AI is shaped primarily by general attitudes toward AI, principled opposition to lethal autonomy, or foreign-policy and geopolitical orientations. Across six military AI scenarios that vary in lethality and human control, respondents who view AI as beneficial are substantially more supportive of military AI. Hawkish respondents are also more supportive. By contrast, principled opposition to lethal autonomy is not broadly associated with the full index but is related to the application of fully autonomous lethal force. Contrary to our expectation, perceived AI risks are positively associated with support. Cross-national differences are moderate and broadly consistent with geopolitical context. Overall, public opinion toward military AI appears conditionally permissive. Publics are not categorically opposed to various military uses of AI. Instead, unease is concentrated around fully autonomous lethal force.

2605.25188 2026-05-26 cs.AI 版本更新

DarkForest: Less Talk, Higher Accuracy for Multi-Agent LLMs

DarkForest: 少说话,多智能体LLM更高精度

Yi Li, Songtao Wei, Dongming Jiang, Zhichun Guo, Qiannan Li, Bingzhe Li

发表机构 * University of Texas at Dallas(德克萨斯大学达拉斯分校) Independent Researcher(独立研究者) University of California, Davis(加州大学戴维斯分校)

AI总结 提出DarkForest框架,通过保持智能体独立、结构化解析响应并基于信念分布协调,减少通信开销和错误传播,在六个推理基准上实现领先质量并大幅降低令牌消耗。

详情
AI中文摘要

多智能体LLM系统通过组合多个智能体的输出来改进推理,但交互密集型方法可能导致错误传播和高通信开销。当智能体交换原始响应或推理轨迹时,不正确的中间推理可能被采纳和放大,导致自信但错误的共识;多轮通信也增加了令牌消耗、延迟和推理成本。在本文中,我们提出了一种名为DarkForest的受控通信协调框架。DarkForest首先保持智能体独立,因此每个智能体在不看到其他智能体输出的情况下产生答案。然后,它将原始响应解析为结构化候选记录,将语义等价的候选记录分组为聚类,并使用智能体可靠性、置信度、解析质量、支持模式可靠性和独立性校正来估计这些聚类上的校准信念分布。协调器仅从该信念状态接收策略允许的证据,并进行受控通信。在六个推理基准上的实验表明,DarkForest实现了领先的整体质量,在基准指标上比最强基线提高了30.7%,并且与通信密集型基线相比,令牌消耗减少了高达6.5倍。

英文摘要

Multi-agent LLM systems improve reasoning by combining outputs from multiple agents, but interaction-heavy methods can introduce error propagation and high communication overhead. When agents exchange raw responses or reasoning traces, incorrect intermediate reasoning may be adopted and amplified, leading to confident but wrong consensus; multi-round communication also increases token consumption, latency, and inference cost. In this paper, we propose a controlled-communication coordination framework named DarkForest. DarkForest first keeps agents independent, so each agent produces an answer without seeing the others' outputs. It then parses the raw responses into structured candidate records, groups semantically equivalent candidates into clusters, and estimates a calibrated belief distribution over these clusters using agent reliability, confidence, parse quality, support-pattern reliability, and independence corrections. A coordinator receives only policy-permitted evidence from this belief state with controlled communication. Experiments on six reasoning benchmarks show that DarkForest achieves leading overall quality, improves the strongest baseline by up to 30.7\% on benchmark metrics, and reduces token consumption by up to $6.5\times$ compared with communication-heavy baselines.

2605.25186 2026-05-26 cs.CL cs.AI 版本更新

By Their Fruits You Will Know Them: Comparing Formalizations of Law by the Decisions They Encode

凭其果实,你们将认识它们:通过编码的决策比较法律的形式化

Julius Vernie, Matthias Grabmair

发表机构 * Technical University of Munich(慕尼黑技术大学)

AI总结 提出一种系统方法,通过SAT求解器枚举不同形式化在边缘案例上的分歧,并转化为具体事实场景,以比较同一法律条款的不同形式化,应用于九个前沿LLM生成的十个欧盟条款形式化,发现行为分歧与结构一致性基本不相关。

Comments 23 pages, 17 figures, submitted to EMNLP PROC 2026

详情
AI中文摘要

将法律条款形式化有望实现机器可访问的法律和自动化法律推理,而最近的LLM使得直接从法规文本生成这种形式化变得诱人。然而,任何形式化都会做出隐含的解释选择,其后果难以预料,尤其是当LLM是作者时。我们提出了一种方法,通过它们在个别案例上的推理,系统地比较同一法律条款的不同形式化。给定一个条款的多个形式化,我们在节点级别匹配它们,从匹配中为每对推导出一个共享接口,并使用SAT求解器枚举任意两个形式化存在分歧的边缘案例。然后将选定的边缘案例转化为具体的事实场景,供法律专家检查并采取行动。我们将该方法应用于九个前沿LLM生成的十个欧盟条款的形式化。我们发现,形式化之间的行为分歧与其结构一致性基本不相关,并且口头化的案例揭示了定性的不同分歧类型,包括反映法律评论中真实争议的分歧。

英文摘要

Formalizing legal provisions promises machine-accessible law and automated legal reasoning, and recent LLMs make it tempting to generate such formalizations directly from statutory text. However, any formalization makes implicit interpretive choices whose consequences are hard to anticipate, especially if an LLM is the author. We present a method for systematically comparing different formalizations of the same legal provision by their inferences on individual cases. Given multiple formalizations of a provision, we match them at the node level, derive a shared interface for each pair from the matching, and use a SAT solver to enumerate the edge cases on which any two formalizations disagree. Selected edge cases are then verbalized into concrete factual scenarios that a legal expert can examine and act on. We apply our method to formalizations of ten EU provisions generated by nine frontier LLMs. We find that behavioral divergence between formalizations is essentially uncorrelated with their structural agreement and that the verbalized cases reveal qualitatively distinct types of disagreement, including divergences that mirror genuine controversies in the legal commentary.

2605.25181 2026-05-26 cs.AI 版本更新

SpecAlign: A Semantic Alignment Framework for SystemVerilog Assertion Generation

SpecAlign: 一种用于 SystemVerilog 断言生成的语义对齐框架

Jaime Rafael Imperial, Hao Zheng

发表机构 * University of South Florida(佛罗里达州立大学)

AI总结 提出 SpecAlign 框架,通过基于蕴含的分类和自一致性投票机制,评估并改进 LLM 生成的 SVA 与自然语言规范之间的语义对齐,无需黄金 RTL。

详情
AI中文摘要

现有的大语言模型(LLM)方法在生成 SystemVerilog 断言(SVA)时主要关注语法有效性和形式验证结果,而生成的断言与自然语言规范之间的语义对齐仍然难以量化。因此,在缺乏黄金 RTL 的情况下,幻觉或未对齐的 SVA 会降低信心并增加调试工作。本文提出了 SpecAlign,一个用于语义评估和优化 LLM 生成的 SVA 的框架。SpecAlign 引入了两个迭代对齐循环,通过基于蕴含的分类来评估自然语言属性和 SVA 是否符合设计规范。我们通过链式思维提示生成多个推理路径,并通过自一致性投票机制聚合它们,从而改进对齐决策。对未对齐的断言进行分析以生成可操作的反馈用于优化。我们进一步定义了一个定量对齐分数来衡量迭代过程中的语义一致性。实验结果表明,SpecAlign 能够有效检测语义不一致性,并在不依赖黄金 RTL 的情况下改进断言对齐,为传统形式验证评估指标提供了可扩展的补充。

英文摘要

Existing Large Language Model (LLM) approaches to SystemVerilog Assertion (SVA) generation primarily focus on syntactic validity and formal verification outcomes, while semantic alignment between generated assertions and natural language specifications remains difficult to quantify. As a result, hallucinated or misaligned SVAs can reduce confidence and increase debugging efforts in the absence of golden RTL. This paper presents SpecAlign, a framework for semantic evaluation and refinement of LLM-generated SVAs. SpecAlign introduces two iterative alignment loops that assess both natural language properties and SVAs against the design specification using entailment-based classification. We improve alignment decisions by generating multiple reasoning paths using chain-of-thought prompting and aggregating them via a self-consistency voting mechanism. Misaligned assertions are analyzed to generate actionable feedback for refinement. We further define a quantitative alignment score to measure semantic consistency across iterations. Experimental results demonstrate that SpecAlign effectively detects semantic inconsistencies and improves assertion alignment without relying on golden RTL, providing a scalable complement to traditional formal verification evaluation metrics.

2605.25170 2026-05-26 cs.LG cs.AI cs.ET cs.RO 版本更新

Grow-Prune-Freeze Networks: Adaptive & Continual Learning Technique for Olfactory Navigation

生长-剪枝-冻结网络:用于嗅觉导航的自适应与持续学习技术

Kordel K. France, Ovidiu Daescu

AI总结 提出生长-剪枝-冻结(GPF)网络框架,通过动态调整策略网络层数实现持续学习,在湍流羽流导航任务中达到94%成功率,并推广到其他机器学习任务。

详情
AI中文摘要

嗅觉训练数据分散在非标准化的数据集中,限制了构建代表性世界模型的能力。嗅觉导航是一项高度动态和非平稳的任务,受益于实时持续学习。我们引入了一种名为生长-剪枝-冻结(GPF)网络的自适应框架,使智能体能够通过生长、剪枝和冻结其策略的早期层来持续学习,以应对世界复杂性。将GPF基于非线性随机矩阵理论,我们展示了Pennington & Worth(2017)的工作可以从单隐藏层扩展到n层持续学习模型,并且网络权重的特征值组成在添加连续层时得以保持。我们展示了基于期望SARSA的GPF在湍流羽流导航上实现了94%的成功率——这是一个部分可观测、非平稳的任务,代表了激发机器人自适应学习的“大世界”挑战——并提供了将GPF应用于其他世界模型的支撑方法。进一步的实验表明,GPF可能很好地推广到其他机器学习任务,如Atari中的强化学习、图像分类和自回归语言模型。我们开源所有代码和数据,以鼓励对嗅觉机器人技术的改进和更多研究。

英文摘要

Training data for olfaction is scattered through disparate, non-standardized datasets that limit the ability to build representative world models. Olfactory navigation is a highly dynamic and non-stationary task that benefits from real-time continual learning. We introduce an adaptive framework called Grow-Prune-Freeze (GPF) networks that enable an agent to continually learn through growing, pruning, and freezing early layers of its policy in response to world complexity. Grounding GPFs in non-linear random matrix theory, we show that the work of Pennington & Worth (2017) can be extended from single hidden layers to n-layer continual-learning models, and that eigenvalue composition of network weights is preserved as successive layers are added. We show that GPFs based on Expected SARSA achieve a 94% success rate on turbulent plume navigation - a partially observable, non-stationary task representative of the "big world" challenges that motivate adaptive learning in robotics - and provide supporting methodology for applying GPFs in other world models. Further experiments amount evidence that GPFs may generalize well to other machine learning tasks such as reinforcement learning in Atari, image classification, and autoregressive language models. We open source all code and data to encourage improvements on and more research in olfactory robotics.

2605.25166 2026-05-26 cs.LG cs.AI 版本更新

AME-TS: Anchored Mixture-of-Experts for Time Series Forecasting

AME-TS:基于锚定的混合专家模型用于时间序列预测

Rui Wang, Renhao Xue, Ray Razi, Huan Song, Hannah R. Marlowe

发表机构 * Amazon Web Services(亚马逊网络服务)

AI总结 提出AME-TS,一种结构引导的稀疏时间序列基础模型,通过轻量级预测器估计序列级描述符并生成专家软结构先验,实现专家路由与可解释时间结构对齐,在GIFT-Eval基准上实现精度-效率权衡,并在M5微调中展现更稳定的专家专业化。

详情
AI中文摘要

时间序列预测模型通过大型Transformer骨干不断扩展规模,但大多数现有方法通过共享密集计算路径处理所有序列,尽管时间结构存在显著异质性。混合专家模型(MoE)通过条件计算提供了一种自然替代方案,但标准MoE路由导致专家专业化识别弱且在下游适应中常不稳定。我们提出AME-TS,一种结构引导的稀疏时间序列基础模型,将专家路由与可解释的时间结构对齐。AME-TS首先使用轻量级预测器估计序列级描述符,包括可预测性、季节性、趋势和稀疏性,并将其映射为专家上的软结构先验。该序列级先验在训练期间指导令牌级路由,鼓励结构对齐的专业化。在GIFT-Eval基准上,AME-TS在不同模型规模下提供了强大的精度-效率权衡:在小型模型规模上显著优于现有时间序列基础模型,在较大规模上与最强模型保持竞争力,同时通过稀疏路由激活显著更少的参数。我们进一步表明,在M5数据集微调期间,AME-TS学习了比标准MoE更可解释的路由几何和更稳定的专家专业化。这些结果表明,结构感知路由是实现稀疏专家模型在时间序列预测中优势的有效且可靠方式。

英文摘要

Time series forecasting models are increasingly scaled through large Transformer backbones, yet most existing approaches process all series through a shared dense computation path despite substantial heterogeneity in temporal structure. Mixture-of-Experts (MoE) offers a natural alternative by enabling conditional computation, but standard MoE routing leaves expert specialization weakly identified and often unstable during downstream adaptation. We propose AME-TS, a structure-guided sparse time series foundation model that aligns expert routing with interpretable temporal structure. AME-TS first uses a lightweight regime predictor to estimate series-level descriptors, including forecastability, seasonality, trend, and sparsity, and maps them to a soft structural prior over experts. This series-level prior guides token-level routing during training, encouraging structure-aligned specialization. On the GIFT-Eval benchmark, AME-TS delivers a strong accuracy-efficiency tradeoff across model scales: it substantially outperforms existing time series foundation models at small model scales and remains competitive with the strongest models at larger scales, while activating substantially fewer parameters through sparse routing. We further show that AME-TS learns more interpretable routing geometry and substantially more stable expert specialization than standard MoE during fine-tuning on the M5 dataset. These results suggest that structure-aware routing is an effective and reliable way to realize the benefits of sparse expert models for time series forecasting.

2605.25163 2026-05-26 cs.CV cs.AI 版本更新

K-U-KAN: Koopman-Enhanced U-KAN for 3D Dental Reconstruction from a Single Panoramic X-ray Radiograph

K-U-KAN: 基于Koopman增强的U-KAN用于单张全景X射线片的三维牙齿重建

Bikram Keshari Parida, Abhijit Sen, Wonsang You

发表机构 * Artificial Intelligence \& Image Processing Lab., Department of Information \& Communication Engineering, Sun Moon University, Asan-Si, South Korea Department of Physics Engineering Physics, Tulane University, New Orleans, LA, USA

AI总结 提出K-U-KAN三阶段流水线,结合Kolmogorov-Arnold网络、Koopman算子与U-KAN,从单张全景X射线高效重建三维牙齿结构,提升感知质量并缩短训练时间。

Comments 24 pages, 9 figures,

详情
AI中文摘要

全景X射线将三维颌骨压缩为二维条带;我们的目标是干净且快速地恢复缺失的深度。现有的隐式神经表示能渲染逼真的体积,但训练缓慢,对采样和位置编码敏感,且实际成本高。纯CNN基线效率高,但难以处理牙弓的长程几何,模糊了精细的釉质-牙本质边界,且可解释性差。我们提出K-U-KAN,一个三阶段流水线:(i) 使用Kolmogorov-Arnold网络将二维特征提升为深度感知的可观测变量,(ii) 通过Koopman令牌块以稳定的、相位感知的线性演化推进这些可观测变量,(iii) 将预测的深度区间放置在焦槽射线上,然后由轻量级3D注意力U-KAN细化体积。这种物理(Beer-Lambert图像形成)、几何(马蹄形焦槽)和学习线性动力学的结合,在批量大小为1的原生射线强度上产生了清晰的解剖结构、更少的伪影和鲁棒的行为。在保留数据上,K-U-KAN在信号和结构指标上与Transformer/隐式基线相当,显著提高了感知质量,并且训练时间大约减半——使单视图全景X射线到锥形束CT重建在临床流程中更加实用。

英文摘要

A panoramic X-ray compresses a 3D jaw into a 2D strip; we aim to recover the missing depth cleanly and fast. Existing implicit neural representations render realistic volumes but are slow to train, sensitive to sampling and positional encodings, and costly in practice. Pure CNN baselines are efficient yet struggle with the dental arch's long-range geometry, blur fine enamel-dentin boundaries, and offer little interpretability. We present K-U-KAN, a three-stage pipeline that (i) lifts 2D features into depth-aware observables with Kolmogorov-Arnold Networks, (ii) advances these observables by a stable, phase-aware linear evolution via a Koopman token block, and (iii) places the predicted depth bins onto focal-trough rays before a lightweight 3D attention U-KAN refines the volume. This marriage of physics (Beer-Lambert image formation), geometry (horseshoe focal trough), and learned linear dynamics yields sharp anatomy, fewer artifacts, and robust behavior on native radiographic intensities with batch size one. On held-out data, K-U-KAN matches transformer/implicit baselines on signal and structure metrics, clearly improves perceptual quality, and trains in roughly half the time-making single-view PX $\to$ CBCT reconstruction more practical for clinical pipelines.

2605.25162 2026-05-26 cs.CL cs.AI 版本更新

STREAM: A Data-Centric Framework for Mining High-Value Task-Oriented Dialogues from Streaming Media

STREAM:一个以数据为中心的框架,用于从流媒体中挖掘高价值任务导向对话

Liang Xue, Haoyu Liu, Cheng Wang, Pengyu Chen, Haozhuo Zheng, Yang Liu

发表机构 * Harbin Institute of Technology(哈尔滨工业大学) Byering Technology(伯英技术)

AI总结 提出STREAM框架,利用流媒体数据合成大规模多领域任务导向对话数据集StreamDial,通过角色构建和对话蓝图结合RAG生成高质量对话,解决数据稀缺问题。

详情
AI中文摘要

垂直领域的大语言模型受到复杂、特定领域任务导向对话稀缺的瓶颈。现有的数据获取管道面临持续的三难困境:专家标注昂贵,真实服务对话受隐私和商业限制,静态语料库很快过时。我们提出Stream,一个以数据为中心的框架,利用公开可用的流媒体(直播和短视频)大规模合成高价值服务对话。Stream从嘈杂的流中挖掘真实的交互信号,并通过将基于角色的个性构建与对话蓝图构建相结合来合成对话;它进一步采用检索增强生成(RAG)来支持知识感知的响应。基于Stream,我们发布了StreamDial,一个覆盖汽车、餐厅和酒店的大规模多领域数据集。StreamDial总共包含87,498个对话会话和1,497,320轮次,平均每个会话17.11轮,各领域规模相当。每个会话被组织为结构化四元组⟨P_u, P_a, B, H⟩,将对话历史与明确的用户/代理角色和对话蓝图配对,捕捉真实服务行为,如需求挖掘、约束冲突、协商和恢复。使用自动评估和下游任务的评估表明,StreamDial在强基线上提高了内在对话质量,使用StreamDial训练的模型在多个骨干网络上改进了对话状态跟踪;我们进一步报告了完整的人工评估集,并在受控训练预算下在Qwen3-8B上实现了令人鼓舞的多语言迁移。数据发布在https://github.com/hitxueliang/DialogDataSetBySTREAM。

英文摘要

Large language models for vertical domains are bottlenecked by the scarcity of complex, domain-specific task-oriented dialogues. Existing data acquisition pipelines face a persistent trilemma: expert annotation is expensive, real-world service conversations are constrained by privacy and commercial restrictions, and static corpora quickly become temporally stale. We propose Stream, a data-centric framework that leverages publicly available streaming media (live streams and short videos) to synthesize high-value service dialogues at scale. Stream mines authentic interaction signals from noisy streams and synthesizes conversations by integrating role-grounded persona construction with Conversational Blueprint construction; it further adopts retrieval-augmented generation (RAG) to support knowledge-aware responses. Based on Stream, we release StreamDial, a large-scale multi-domain dataset covering Automotive, Restaurant, and Hotel. StreamDial contains 87,498 dialogue sessions and 1,497,320 turns in total, with an average of 17.11 turns per session and a comparable scale across domains. Each session is organized as a structured quadruplet $\langle P_u, P_a, B, H \rangle$ that pairs dialogue history with explicit user/agent personas and a Conversational Blueprint, capturing realistic service behaviors such as requirement mining, constraint conflicts, negotiation, and recovery. Evaluations with automatic judges and downstream tasks show that StreamDial improves intrinsic dialogue quality over strong baselines, and models trained with StreamDial improve Dialogue State Tracking across backbones; we further report a completed human-evaluation set and encouraging multilingual transfer on Qwen3-8B under a controlled training budget. The data is released in https://github.com/hitxueliang/DialogDataSetBySTREAM.

2605.25156 2026-05-26 cs.LG cs.AI 版本更新

Abduction-Deduction Entanglement: Domain Generalization via Representation Transplants

溯因-演绎纠缠:通过表示移植实现领域泛化

Kasra Jalaldoust, Elias Bareinboum

发表机构 * Columbia University(哥伦比亚大学)

AI总结 本文提出一种基于表示移植的方法,通过参数化溯因-演绎纠缠中的非可识别性,在源分布约束下搜索目标分布空间,实现领域泛化中的最优目标预测。

详情
AI中文摘要

在源分布下训练的预测模型通常无法很好地泛化到不同的目标分布。对未见数据分布的有效推断必须依赖于生成源数据和目标数据的某些因果机制的不变性,然而这些结构不变性仅从源数据中是无法识别的。在关于数据的温和因果假设下,我们表明目标中的最优预测实际上部分可由源分布识别。该结果基于一个简单的观察:在任何领域中,最优预测可以分解为我们称之为溯因映射和演绎映射的一对映射,其中溯因映射从观测变量推断某些未观测变量(可能是混杂因素),演绎映射使用观测和推断的量来预测标签。大量源数据的使用固定了最优预测,从而约束了产生它的有效溯因-演绎组合——这种非可识别性我们称之为溯因-演绎纠缠。为了利用这一点,我们使用所谓的表示移植来参数化受约束的族,表示移植是表示空间中的一种特定线性变换,它在保留演绎成分的同时操纵表示的溯因内容。生成标签的因果机制的不变性意味着源和目标之间存在不变的演绎映射。因此,我们可以通过参数化移植来搜索合理的目标分布空间。我们在一个学习器-对手博弈中使用该方案,在理想优化下,该博弈可证明终止于学习器具有极小极大最优目标预测。评估验证了理论,表明该方法在领域泛化基准测试中具有竞争力。

英文摘要

Prediction models trained under the source distribution do not generalize well to a different target distribution. A valid inference about an unseen data distribution must be anchored by the invariance of certain causal mechanisms that generate the source and target data, however, these structural invariances are non-identifiable from the source data alone. Under mild causal assumptions about the data, we show that the optimal prediction in the target is in fact partially identifiable by the source distribution. The result rests on a simple observation: In any domain, the optimal prediction can be factorized into what we call a pair of abduction and deduction maps, where the abduction map makes inference about some unobserved variables (possibly confounders) from the observed variables and the deduction map predicts the label using both the observed and inferred quantities. Access to large source data pins down the optimal prediction, thus constrains the valid abduction-deduction ensembles that produce it -- a non-identifiability that we call the abduction-deduction entanglement. To leverage this, we parameterize the constrained family using what we call a representation transplant, that is a specific linear transformation in the representation space that manipulates the abduction content of the representation while retaining the deduction component. Invariance of the causal mechanism generating the label implies existence of an invariant deduction map between source and target. Thus, we can search the space of plausible target distributions via a parametric transplant. We use this scheme in a learner-adversary game that, under an idealistic optimization, provably terminates with the learner having the minimax-optimal target prediction. Evaluations verify the theory, showing that the method is competitive in DG benchmarks.

2605.25151 2026-05-26 cs.AI cs.CE 版本更新

Representation Without Control: Testing the Realization Effect in Language Models

无控制的表征:测试语言模型中的实现效应

Ciarán Walsh, Emilio Barkett

发表机构 * Columbia University(哥伦比亚大学)

AI总结 通过提示行为、线性读出和因果控制三个层面,测试语言模型是否表现出类似人类的实现效应,发现潜在读出成功但因果控制无效,表明三者不自动共存。

详情
AI中文摘要

大型语言模型越来越多地被用作行为模拟器,但其输出何时反映类似人类的认知机制而非提示敏感的表面模式仍不清楚。我们通过实现效应研究这一问题,这是行为经济学中一个特征明确的发现,即风险承担在纸面收益与实现收益及损失后存在系统性差异。我们在三个层面评估LLM行为:仅提示的行为敏感性、内部表征的线性读出以及通过激活引导的因果控制。仅提示结果显示系统的条件敏感性,但方向模式未复现人类实现效应的预测。Gemma的残差流在第18层包含一个线性可解码的实现状态信号,该信号可泛化到未见过的提示。然而,沿此方向引导并未可靠地改变下游风险选择,这一零结果在正尺度和负符号对称运行中均成立。行为敏感性、潜在读出和因果控制是三个不同的属性,它们不会自动共存,成功的潜在读出不足以证明模型在下游决策中行为上依赖于该表征。

英文摘要

Large language models are increasingly used as behavioral simulators, but it remains unclear when their outputs reflect human-like cognitive mechanisms rather than prompt-sensitive surface patterns. We study this question through the realization effect, a well-characterized finding in behavioral economics in which risk-taking differs systematically after paper versus realized gains and losses. We evaluate LLM behavior at three levels: prompt-only behavioral sensitivity, linear readout of internal representations, and causal control via activation steering. Prompt-only results show systematic condition sensitivity, but the directional pattern does not reproduce human realization-effect predictions. Gemma's residual stream contains a linearly decodable realization-status signal at layer 18 that generalizes to held-out prompts. Steering along this direction does not, however, reliably shift downstream risk choices, a null result that holds across positive scales and in a negative sign-symmetry run. Behavioral sensitivity, latent readout, and causal control are three distinct properties that do not automatically co-occur, and successful latent readout is insufficient evidence that a model behaviorally relies on a representation during downstream decision-making.

2605.25141 2026-05-26 cs.CL cs.AI 版本更新

LLM Agent Based Renewable Energy Forecasting Using Edge and IoT Data A Review of Solar Wind Weather and Grid Aware Decision Support

基于LLM Agent的利用边缘和物联网数据的可再生能源预测:太阳能、风能、天气和电网感知决策支持综述

Pavan Manjunath, Thomas Pruefer

发表机构 * Independent Researcher(独立研究员)

AI总结 本文综述了如何利用大语言模型代理整合异构传感器流、天气API数据、历史发电记录和电网约束,形成统一的决策支持工作流,以增强可再生能源预测。

详情
AI中文摘要

可再生能源发电的可靠预测是电网稳定性、能源交易、电池调度和碳感知运营规划的基础要求。太阳能和风能资源本质上是间歇性的,其输出随云量、风速、大气湍流、季节模式和局部地形而波动。物联网和边缘设备的普及,包括智能电表、逆变器、风速计、日射强度计、气象站和电网接口传感器,创造了前所未有的实时运行数据量,而传统的预测流程难以充分利用这些数据。本综述研究了大语言模型代理如何通过将异构传感器流、天气API数据、历史发电记录、电网约束和上下文推理整合到统一的决策支持工作流中,来增强可再生能源预测。我们调查了经典预测方法(统计时间序列模型、深度学习架构、物理混合方法)以及新兴的用于解释、不确定性沟通和操作员指导的LLM代理框架。提出了一个六层分类法,涵盖数据采集、预处理、特征工程、模型推理、不确定性估计和自然语言报告。综述识别了十二个开放挑战,包括实时部署、分布偏移下的模型漂移、不确定性量化、LLM代理中的幻觉控制、边缘硬件的互操作性以及与能源管理系统的集成。论文最后建议了一个研究议程,重点关注开放基准、物理信息LLM基础以及联邦预测架构。

英文摘要

Reliable forecasting of renewable energy generation is a foundational requirement for grid stability energy trading battery scheduling and carbon aware operational planning Solar and wind resources are inherently intermittent their output fluctuates with cloud cover wind speed atmospheric turbulence seasonal patterns and local terrain The proliferation of IoT and edge devices spanning smart meters inverters anemometers pyranometers weather stations and grid interface sensors has created an unprecedented volume of real time operational data that conventional forecasting pipelines are ill equipped to exploit fully This review investigates how large language model LLM agents can enhance renewable energy forecasting by integrating heterogeneous sensor streams weather API data historical generation records grid constraints and contextual reasoning into unified decision support workflows We survey classical forecasting methods statistical time series models deep learning architectures physics hybrid approaches and emerging LLM agent frameworks for explanation uncertainty communication and operator guidance A six layer taxonomy is proposed covering data acquisition preprocessing feature engineering model inference uncertainty estimation and natural language reporting The review identifies twelve open challenges spanning real time deployment model drift under distribution shift uncertainty quantification hallucination control in LLM agents interoperability of edge hardware and integration with energy management systems The paper concludes by recommending a research agenda centred on open benchmarks physics informed LLM grounding and federated forecasting architectures

2605.25135 2026-05-26 cs.LG cs.AI 版本更新

ASTRO: Adaptive Spatio-Temporal Reinforcement Optimization for GNN Powered Anomly Detection in Cyber Physical Systems

ASTRO: 用于信息物理系统中基于GNN的异常检测的自适应时空强化优化

Rai Ali Yar, Umaisa Lail, Anwar Shah

发表机构 * Department of Computer Science, FAST NUCES(计算机科学系,FAST NUCES) Department of Information Technology, Riphah International University(信息技术系,Riphah国际大学)

AI总结 提出ASTRO框架,结合深度Q网络与图神经网络、时间建模和多头注意力机制,通过强化学习动态优化阈值,在SWaT和WADI数据集上实现高F1分数,优于现有方法。

详情
AI中文摘要

工业物联网环境中的异常检测对于保护工业控制系统和信息物理系统免受运行时虚假数据注入和其他恶意攻击至关重要。传感器网络和互连控制回路日益复杂,使得识别隐藏在高维和时间依赖信号中的异常行为变得困难。为解决这些挑战,本文介绍了自适应时空强化优化ASTRO,一种新颖的异常检测框架,开创性地使用强化学习进行动态阈值优化。通过将深度Q网络与图神经网络、时间建模和多头注意力机制相结合,ASTRO不断调整其决策边界以提高检测精度。GNN组件建模传感器之间的空间关系,时间模型捕获时间序列依赖性,注意力层突出显示最具信息量的时间步。模型生成连续异常分数,通过自适应阈值转换为二元决策,该阈值通过深度Q网络优化。ASTRO方法在两个真实工业基准测试:安全水处理和水分配数据集上进行了评估。所提模型在SWaT上取得了卓越性能,F1分数为0.990。此外,在高度复杂的127个终端设备的WADI数据集上,它获得了0.788的F1分数,比最先进的基线高出近14%。多次运行的结果证实了其一致的泛化能力和稳定性。这些实验表明,ASTRO框架是增强大规模信息物理基础设施的高度实用和可扩展的方法。

英文摘要

Anomaly detection in Industrial Internet of Things (IIoT) environments is essential to protect the Industrial Control Systems (ICS) and Cyber-Physical Systems (CPS) from occuring run time false data injection and other malicious attacks. The increasing complexity of sensor networks and interconnected control loops makes it difficult to identify anomalous behavior hidden within high-dimensional and time-dependent signals. To address these challenges, this article introduces Adaptive Spatio-Temporal Reinforcement Optimization ASTRO (ASTRO), a novel anomaly detection framework that pioneers the use of reinforcement learning for dynamic threshold optimization. By integrating a Deep Q-Network (DQN) with Graph Neural Networks (GNNs), temporal modelling and a Multi-Head Attention mechanism, ASTRO continuously adapts its decision boundaries to improve detection accuracy. The GNN component models the spatial relations among sensors, Temporal model captures time series dependencies and the attention layer highlights most informative time steps. The model generates continuous anomaly scores, which are transformed into binary decisions using an adaptive threshold, optimized via a Deep Q-Network (DQN). The ASTRO approach is evaluated on two real world industrial benchmarks: the Secure Water Treatment (SWaT) and Water Distribution (WADI) datasets. The proposed model achieves an exceptional performance on the SWaT with F1 score of 0.990. Moreover, on highly complex 127 end devices WADI dataset, it secures F1 score of 0.788, outperforming state-of-the-art baselines by nearly 14%. Results across multiple runs confirm consistent generalization and stability. These experiments demonstrate that the ASTRO framework is highly practical and scalable method for strengthening the large scale cyber physical infrastructures

2605.25133 2026-05-26 cs.AI cs.CL 版本更新

Trust but Verify: Prover-Verifier Deliberation for Selective LLM Prediction

信任但验证:面向选择性LLM预测的证明者-验证者审议

João Sedoc, Baotong Zhang, Dean Foster

发表机构 * New York University(纽约大学)

AI总结 提出基于交互式证明理论的证明者-验证者审议协议,通过结构化置信度判定实现选择性预测,在GPQA Diamond上取得约30个百分点的高置信度精确率差距。

详情
AI中文摘要

可靠地知道语言模型何时正确几乎与正确本身同样重要。我们引入证明者-验证者审议(PVD),这是一种基于交互式证明理论的推理时协议,作为选择性预测的机制:该协议同时产生答案和结构化置信度判定,允许系统报告高置信度答案,同时在不明确的情况下弃权。在每个对话中,证明者通过可检查的子主张捍卫候选答案,而验证者发出有针对性的挑战并返回\textsc{Accept}、\textsc{Challenge}或\textsc{Reject}。由于冻结的语言模型是在噪声信道上运行的不完美的证明者和验证者,形式上的可靠性和完备性保证并不适用;相反,我们通过其覆盖-精确率行为来经验性地描述该协议。我们的主要实验使用Claude Sonnet 4.6作为证明者,Claude Haiku 4.5作为验证者,在GPQA Diamond上进行。没有答案修订即被接受的问题,我们称为Accept + No Change (ANC),作为高置信度子集报告;我们通过其精确率和覆盖来评估该子集。ANC将可靠答案与不可靠答案分开,与非ANC补集相比产生约30个百分点的HC-Prec差距。使用GPT和Gemini配对的鲁棒性实验表明,高HC-Prec可以跨模型系列转移,而验证者的严格性和领域能力在很大程度上决定了选择差距的大小。在Humanity's Last Exam上,较弱的证明者-验证者配对可能使ANC信号崩溃或反转,这说明了当验证者在其有效区域外操作时的实际失败模式。与自一致性、通用自一致性、多智能体辩论和Reflexion的比较表明,证明者-验证者审议为选择性预测提供了独特的论点可辩护性信号。

英文摘要

Reliably knowing when a language model is correct is almost as important as being correct. We introduce prover-verifier deliberation (PVD), an inference-time protocol grounded in interactive proof theory, as a mechanism for selective prediction: the protocol produces both an answer and a structured confidence verdict, allowing a system to report high-confidence answers while abstaining on uncertain cases. In each dialogue, a prover defends a candidate answer through checkable sub-claims while a verifier issues targeted challenges and returns \textsc{Accept}, \textsc{Challenge}, or \textsc{Reject}. Because frozen language models are imperfect provers and verifiers operating over a noisy channel, formal soundness and completeness guarantees do not transfer; instead, we characterize the protocol empirically through its coverage-precision behavior. Our main experiment uses Claude Sonnet 4.6 as prover and Claude Haiku 4.5 as verifier on GPQA Diamond. Questions accepted with no answer revision, which we call Accept + No Change (ANC), are reported as the high-confidence subset; we evaluate this subset by its precision and coverage. ANC separates reliable from unreliable answers, yielding a $\sim$30pp HC-Prec gap over the non-ANC complement. Robustness experiments with GPT and Gemini pairings show that high HC-Prec can transfer across model families, while verifier strictness and domain competence largely determine the size of the selection gap. On Humanity's Last Exam, weaker prover-verifier pairings can collapse or invert the ANC signal, illustrating a practical failure mode when the verifier operates outside its effective region. Comparisons with self-consistency, universal self-consistency, multi-agent debate, and Reflexion suggest that prover-verifier deliberation supplies a distinct argument-defensibility signal for selective prediction.

2605.25123 2026-05-26 cs.LG cs.AI cs.CL cs.CV stat.ML 版本更新

Inference-Time Alignment of Diffusion Models via Trust-Region Iterative Twisted Sequential Monte Carlo

扩散模型的推理时对齐:基于信任区域迭代扭曲序贯蒙特卡洛方法

Weixin Wang, Yu Yang, Wei Deng, Pan Xu

发表机构 * Duke University(杜克大学) Morgan Stanley(摩根大通)

AI总结 提出信任区域迭代扭曲序贯蒙特卡洛(TRI-TSMC)框架,通过迭代学习扭曲函数来改进扩散模型推理时的对齐,在文本生成和文本到图像生成任务上优于现有方法。

Comments 34 pages, 6 figures, and 7 tables

详情
AI中文摘要

我们研究基于扩散的生成模型的推理时对齐,旨在引导基础模型产生高奖励输出而不更新其权重。最近的基于序贯蒙特卡洛(SMC)的引导方法以原则性的方式近似奖励倾斜的目标分布,但其提议仍主要依赖于基础采样器。由于奖励信息主要通过粒子重加权和重采样在传播后使用,这些方法可能需要大量粒子预算,并遭受权重退化和高方差估计的问题。降低方差和提高粒子效率的一种方法是迭代学习提供前瞻指导的扭曲函数,如扭曲SMC。然而,现有的可学习扭曲方法主要针对经典序贯推理开发,当应用于具有高维状态空间和终端、噪声或黑盒奖励的扩散对齐时可能不稳定。我们提出信任区域迭代扭曲序贯蒙特卡洛(TRI-TSMC),一种用于在基于SMC的推理时对齐中学习扭曲函数的信任区域框架。每次迭代在路径空间中计算精确的KL约束更新,通过温度重要性重加权得到闭式解,并通过加权最大似然将该目标投影回参数化扭曲族。理论上,我们形式化了最优扭曲函数的值函数解释,并表明它产生零方差采样器。我们证明信任区域更新沿着护航路径朝向目标分布,加权最大似然更新是前向KL投影,并且该路径降低了残差重要性权重方差。实验上,在匹配的推理时预算下,TRI-TSMC在离散扩散文本生成和文本到图像生成上改进了主要对齐目标。

英文摘要

We study inference-time alignment for diffusion-based generative models, aiming to steer a base model toward high-reward outputs without updating its weights. Recent Sequential Monte Carlo (SMC)-based steering methods approximate reward-tilted target distributions in a principled way, but their proposals remain largely tied to the base sampler. Since reward information is mainly used after propagation through particle reweighting and resampling, these methods can require large particle budgets and suffer from weight degeneracy and high-variance estimates. One way to reduce variance and improve particle efficiency is to iteratively learn twisting functions that provide look-ahead guidance, as in twisted SMC. However, existing learnable twisting methods are developed mainly for classical sequential inference and can be unstable when applied to diffusion-based alignment with high-dimensional state spaces and terminal, noisy, or black-box rewards. We propose Trust-Region Iterative Twisted Sequential Monte Carlo (TRI-TSMC), a trust-region framework for learning twisting functions in SMC-based inference-time alignment. Each iteration computes an exact KL-constrained update in path space, which admits a closed-form solution by tempered importance reweighting, and projects this target back to the parameterized twisted family by weighted maximum likelihood. Theoretically, we formalize the value-function interpretation of the optimal twisting function and show that it yields a zero-variance sampler. We prove that the trust-region update follows an escort path toward the target distribution, that the weighted maximum-likelihood update is a forward-KL projection, and that the path reduces residual importance-weight variance. Empirically, TRI-TSMC improves primary alignment objectives on discrete diffusion text generation and text-to-image generation under matched inference-time budgets.

2605.25120 2026-05-26 cs.CL cs.AI cs.HC 版本更新

Evidence-Linked Radiology Reporting: A Human-Supervised Reference Architecture for Structured Imaging Intelligence

证据关联放射学报告:面向结构化成像智能的人机协同参考架构

Houman Kazemzadeh, Kamyar Naderi

发表机构 * Xylemed

AI总结 提出一种人机协同、证据关联的参考架构,通过结合特定检查模板、语音到结构处理、测量与分割捕获、受控AI辅助起草以及基于DICOM、HL7 FHIR等标准的互操作性,将放射学报告从自由文本转化为结构化智能层,支持审阅报告、纵向比较、临床数据重用及系统集成。

Comments Technical report, 27 pages, 2 figures, 12 tables, 1 listing; reference architecture paper; does not report clinical outcomes or validated diagnostic performance

详情
AI中文摘要

放射学报告仍然是向临床团队传达成像结果的主要机制。然而,这些报告背后的大量结构化信息,包括测量值、图像证据、既往比较、病灶标识、不确定性和术语,通常仍被禁锢在自由文本中,或分散在图像存档与通信系统、放射信息系统、报告工作站、工作表、高级可视化工具和电子健康记录中。本文提出一种人机协同、证据关联的结构化放射学报告参考架构。该框架结合了特定检查模板、语音到结构处理、测量与分割捕获、受控AI辅助起草,以及基于DICOM、DICOM结构化报告、DICOM分割、HL7 FHIR、RadLex、SNOMED CT、LOINC和UCUM的标准化互操作性。该系统并非作为自主报告生成器,而是作为企业成像的结构化智能层,支持审阅报告、纵向比较、临床数据重用、治理,以及与PACS、RIS、EHR、分析和注册工作流的集成。本文还讨论了针对AI辅助放射学报告系统的模态特定部署考虑、临床安全风险、验证要求、网络安全、隐私、质量管理和监管边界。

英文摘要

Radiology reports remain the primary mechanism by which imaging findings are communicated to clinical teams. However, much of the structured information behind these reports, including measurements, image evidence, prior comparisons, lesion identity, uncertainty, and terminology, often remains trapped in free text or fragmented across picture archiving and communication systems, radiology information systems, reporting workstations, worksheets, advanced visualization tools, and electronic health records. This paper proposes a human-supervised, evidence-linked reference architecture for structured radiology reporting. The framework combines exam-specific templates, speech-to-structure processing, measurement and segmentation capture, controlled AI-assisted drafting, and standards-based interoperability using DICOM, DICOM Structured Reporting, DICOM Segmentation, HL7 FHIR, RadLex, SNOMED CT, LOINC, and UCUM. The system is positioned not as an autonomous report generator, but as a structured intelligence layer for enterprise imaging that supports reviewed reporting, longitudinal comparison, clinical data reuse, governance, and integration with PACS, RIS, EHR, analytics, and registry workflows. The paper also discusses modality-specific deployment considerations, clinical safety risks, validation requirements, cybersecurity, privacy, quality management, and regulatory boundaries for AI-assisted radiology reporting systems.

2605.25119 2026-05-26 cs.CV cs.AI cs.LG 版本更新

Trust-Aware Joint Feature-Prediction Discrepancy for Robust Domain Adaptation

信任感知的联合特征-预测差异用于鲁棒域适应

Xi Ding, Lei Wang, Syuan-Hao Li, Yongsheng Gao

发表机构 * School of Engineering and Built Environment, Griffith University, Australia(工程与环境学院,格里菲斯大学,澳大利亚)

AI总结 提出信任感知域适应框架,通过联合特征-预测差异(JFPD)结合不确定性信任和语义对齐信任,实现可靠性感知的域差异估计,提升域适应性能。

Comments Research report

详情
AI中文摘要

域适应旨在减轻标记源域与未标记或稀疏标记目标域之间分布偏移导致的性能下降。大多数现有方法在特征空间或预测空间中估计域差异。然而,这些单一视角策略忽略了域偏移下的一个关键问题:用于对齐的信号可靠性。实际上,学习到的表示和语义预测都可能变得不可靠,平等对待所有目标样本可能导致误导性对齐和次优迁移。我们引入了信任感知域适应,这是一个原则性框架,通过特征和预测信号的可靠性来建模域差异。我们方法的核心是联合特征-预测差异(JFPD),这是一个统一公式,联合捕捉表示散度和预测散度,并通过样本特定信任加权它们的贡献。信任通过两种互补机制量化:不确定性信任,从预测熵导出以抑制不可靠预测;语义对齐信任,从特征空间中的原型相似性计算以强调良好对齐的表示。通过优先考虑自信且语义一致的样本,同时降低噪声或模糊样本的权重,JFPD提供了域差异的可靠性感知估计。我们进一步将JFPD集成到训练目标中,引导适应朝向目标域的可靠区域。在标准基准上的实验表明,所提出的框架始终实现优越的适应性能,并产生与目标域误差相关的差异估计。这项工作首次解决了在域适应中建模特征与预测之间交互信任的重要性。

英文摘要

Domain adaptation aims to mitigate performance degradation caused by distribution shifts between a labeled source domain and an unlabeled or sparsely labeled target domain. Most existing approaches estimate domain discrepancy either in feature space or in prediction space. However, these single-perspective strategies overlook a critical problem under domain shift: the reliability of the signals used for alignment. In practice, both learned representations and semantic predictions may become unreliable, and treating all target samples equally can lead to misleading alignment and suboptimal transfer. We introduce trust-aware domain adaptation, a principled framework that models domain discrepancy through the reliability of feature and prediction signals. Central to our approach is the Joint Feature-Prediction Discrepancy (JFPD), a unified formulation that jointly captures representation divergence and prediction divergence while weighting their contributions by sample-specific trust. Trust is quantified via two complementary mechanisms: uncertainty-aware trust, derived from prediction entropy to suppress unreliable predictions, and semantic-alignment trust, computed from prototype similarity in feature space to emphasize well-aligned representations. By prioritizing confident and semantically consistent samples while down-weighting noisy or ambiguous ones, JFPD provides a reliability-aware estimate of domain discrepancy. We further integrate JFPD into a training objective that guides adaptation toward trustworthy regions of the target domain. Experiments on standard benchmarks demonstrate that the proposed framework consistently achieves superior adaptation performance and yields discrepancy estimates that correlate with target-domain error. This work addresses, for the first time, the importance of modeling trust in the interaction between features and predictions for domain adaptation.

2605.25115 2026-05-26 cs.LG cs.AI cs.CE physics.app-ph 版本更新

Courant: a State-Adaptive Perceiver-Based Neural Surrogate with Local Support and Interpretable Field Decomposition

Courant:一种具有局部支持和可解释场分解的状态自适应感知器神经代理模型

Anuj Kumar, Josiah Bjorgaard, Nikolaos Bouklas, Matteo Salvador, Alexander Lavin

发表机构 * Pasteur Labs(Pasteur实验室) Cornell University(康奈尔大学) Institute for Simulation Intelligence(模拟智能研究所)

AI总结 提出基于感知器的编码-处理-解码代理模型Courant,通过状态自适应潜在查询和轻量解码器实现类似自适应hp细化的局部支持与可解释场分解,在稳态/瞬态模拟基准上取得竞争性精度。

详情
AI中文摘要

我们引入“Courant”,一种基于感知器的编码器-处理器-解码器代理模型,其潜在特征在物理空间中表现出自适应专门化和局部支持,实现了类似于自适应hp细化方案的功能,这是传统数值求解器和科学机器学习中非常期望的属性。所提出的架构结合了共享随机傅里叶特征坐标嵌入、状态自适应潜在查询和轻量解码器。Courant使用稳态或瞬态模拟数据进行端到端训练,仅使用物理空间中的标准L_2预测损失,在基准测试上达到竞争性精度。我们证明Courant的归纳偏差产生了设计上可解释的潜在变量:它们在模拟域中发展出多尺度几何专门化,并在时间相关情况下跟踪相干结构,类似于随时间演化的空间基函数,从而允许对模拟场进行紧凑的、几何锚定的、单位划分式的分解。

英文摘要

We introduce "Courant", a Perceiver-based encoder-processor-decoder surrogate model that has latent features exhibiting adaptive specialization and local support in the physical space, enabling functionality akin to an adaptive hp-refinement scheme, an attribute that is highly desirable in traditional numerical solvers and scientific machine learning broadly. The proposed architecture combines a shared random Fourier feature coordinate embedding, state-adapted latent queries, and a light-weight decoder. Courant is trained end-to-end with steady or transient simulation data and only a standard L_2 prediction loss in the physical space, achieving competitive accuracy on benchmarks. We demonstrate that Courant's inductive biases yield latents that are interpretable by design: they develop multiscale geometric specialization in the simulation domain and track coherent structures in the time-dependent case, acting analogously to time-evolving spatial basis functions and allowing for decoding a compact, geometry-anchored, partition-of-unity-like decomposition of the simulated field.

2605.25110 2026-05-26 cs.CV cs.AI cs.LG 版本更新

Uncertainty-DTW for Sequences and Visual Tokens

Uncertainty-DTW 用于序列和视觉标记

Lei Wang, Syuan-Hao Li, Yongsheng Gao, Piotr Koniusz

发表机构 * School of Engineering and Built Environment, Electrical and Electronic Engineering, Griffith University(工程与建筑环境学院,电气与电子工程学院,格里菲斯大学) School of Computer Science and Engineering, University of New South Wales(计算机科学与工程学院,新南威尔士大学)

AI总结 提出不确定性感知的动态时间规整(uDTW)框架,通过异方差不确定性建模和最大似然估计实现鲁棒对齐,并推广到视觉标记集,在多个领域取得优于现有方法的结果。

Comments Research report

详情
AI中文摘要

对齐结构化数据是计算机视觉和机器学习中的一个基本问题,支撑着时间序列分析、人类动作识别和视觉表示学习等任务。现有的对齐方法,包括动态时间规整(DTW)及其可微变体,依赖于确定性相似度度量,因此对异质和噪声特征敏感。在这项工作中,我们引入了不确定性感知对齐,这是一个概率框架,用异方差不确定性建模成对对应关系,并沿对齐路径执行结构化匹配。我们的公式,不确定性-DTW(uDTW),为每个对应分配一个正态分布,并通过最大似然估计目标参数化每条对齐路径,该目标包括(i)一个精度加权匹配项,抑制不可靠特征,以及(ii)一个对数方差正则化,防止退化解。这产生了一个概率对齐机制,对噪声具有鲁棒性且可解释,因为不确定性直接反映了匹配的可靠性。我们进一步将该框架从时间序列推广到标记化的视觉表示,从而能够对视觉标记集进行结构化匹配。学习到的不确定性可以解释为反向注意力:语义相关区域表现出低不确定性并主导对齐,而模糊/噪声区域具有高不确定性。这提供了对齐、注意力和不确定性建模之间的联系。我们在不同领域评估了所提出的框架。结果表明,与最先进的方法相比,该方法持续改进,并且学习到的不确定性与语义重要性相关。这些发现将不确定性感知对齐确立为一个通用、鲁棒且可解释的框架,用于从结构化数据中学习。

英文摘要

Aligning structured data is a fundamental problem in computer vision and machine learning, underlying tasks such as time series analysis, human action recognition, and visual representation learning. Existing alignment methods, including Dynamic Time Warping (DTW) and its differentiable variants, rely on deterministic similarity measures and are therefore sensitive to heterogeneous and noisy features. In this work, we introduce uncertainty-aware alignment, a probabilistic framework that models pairwise correspondences with heteroscedastic uncertainty and performs structured matching along alignment paths. Our formulation, uncertainty-DTW (uDTW), assigns each correspondence a Normal distribution and parametrizes each alignment path by a Maximum Likelihood Estimate objective consisting of (i) a precision-weighted matching term that suppresses unreliable features, and (ii) a log-variance regularization that prevents degenerate solutions. This yields a probabilistic alignment mechanism that is robust to noise and interpretable, as uncertainty directly reflects the reliability of matches. We further generalize this framework from temporal sequences to tokenized visual representations, enabling structured matching over sets of visual tokens. The learned uncertainty can be interpreted as a reverse-attention: semantically relevant regions exhibit low uncertainty and dominate the alignment, while ambiguous/noisy regions have high uncertainty. This provides a connection between alignment, attention, and uncertainty modeling. We evaluate the proposed framework across diverse domains. The results demonstrate consistent improvements over state-of-the-art methods and show that learned uncertainty correlates with semantic importance. These findings establish uncertainty-aware alignment as a general, robust, and interpretable framework for learning from structured data.

2605.25107 2026-05-26 cs.LG cs.AI cs.NA math.NA 版本更新

Leveraging Gauge Freedom for Learning Non-Gradient Population Dynamics of Stochastic Systems

利用规范自由度学习随机系统的非梯度种群动力学

Jules Berman, Tobias Blickhan, Benjamin Peherstorfer

发表机构 * Courant Institute of Mathematical Sciences, New York University, New York, NY 10012, USA(数学科学学院,纽约大学,纽约,纽约州,10012,美国)

AI总结 针对现有种群动力学推断局限于梯度流的问题,提出非梯度推断流(NGIF)算法,通过连续性方程的弱形式参数化一般向量场并选择非最小动能准则,在低维和高维物理问题中提高了分布精度并更好地捕捉非势输运。

详情
AI中文摘要

现有的种群动力学推断工作通常关注由标量势的梯度向量场产生的流。在所有与种群动力学兼容的容许流中,梯度流在特定意义下是最优的:它们最小化动能。基于不同准则选择场对应于确定种群动力学时的规范自由度,我们在本文中利用了这一点。我们提出了非梯度推断流(NGIF),一种使用连续性方程弱形式推断非梯度种群动力学的算法。这使我们能够参数化一般向量场,并选择超出最小动能的其他选择准则。我们在各种低维和高维物理问题上证明,这种更一般的方法提高了相对于梯度受限基线的分布精度,并更好地捕捉了非势输运。

英文摘要

Existing work on population dynamics inference often focuses on flows arising from vector fields that are the gradients of scalar potentials. Among all admissible flows that are compatible with the population dynamics, gradient flows are optimal in a specific sense: they minimize kinetic energy. The selection of fields based on different criteria corresponds to a gauge freedom when determining population dynamics, which we leverage in this work. We propose Non-Gradient Inference Flows (NGIF), an algorithm to infer non-gradient population dynamics using a weak formulation of the continuity equation. This allows us to parameterize general vector fields and choose other selection criteria beyond minimal kinetic energy. We demonstrate on a variety of low- and high-dimensional physics problems that this more general approach improves distributional accuracy over gradient-restricted baselines and better captures non-potential transport.

2605.25101 2026-05-26 cs.SE cs.AI cs.SY eess.SY 版本更新

Multi-Agent Specification-based Metamorphic Testing of FMU-Based Simulations

基于多智能体规范的FMU仿真蜕变测试

Ashir Kulshreshtha, Abdullah Mughees, Gaadha Sudheerbabu, Tanwir Ahmad, Kristian Klemets, Dragos Truscan, Mikael Manngård

发表机构 * University of Turku, Finland(图尔库大学,芬兰) Novia University of Applied Sciences, Finland(诺维亚应用科学大学,芬兰)

AI总结 针对FMU仿真模型中缺乏显式预期输出导致传统测试方法受限的问题,提出一种基于LLM的多智能体工作流,从规范和接口中自动提取蜕变关系并生成测试用例,在润滑油冷却系统FMU上验证了其有效性。

Comments Author version. 9 pages. Accepted for publication in the 10th International Workshop on Metamorphic Testing (MET 2026) of the IEEE Conference on Computers, Software, and Applications (COMPSAC2026), June 7-10, 2026 Madrid, Spain

详情
AI中文摘要

在许多工业领域中,功能模型接口(FMI)被用于跨不同合作伙伴使用各种建模工具交换仿真模型作为功能模型单元(FMU)。这为使用FMU进行基于仿真的验证和确认以确保系统行为可靠提供了可能性。然而,由于缺乏显式预期输出,为这些仿真模型推导有效的测试预言仍然具有挑战性。这限制了需要访问系统内部工作原理的传统测试方法的适用性。蜕变测试(MT)通过利用蜕变关系(MR)解决了这一限制,但从规范中提取此类关系在很大程度上仍然是手动且容易出错的过程。为了应对这一挑战,我们提出了一种基于LLM的多智能体工作流,用于对基于FMU的仿真模型进行基于规范的蜕变测试。该方法以功能和接口规范为输入,协调多个智能体提取需求并推导MR。这些MR使用Given-When-Then模式来表达输入条件(Given)、变换(When)和预期输出行为(Then)。然后利用这些关系生成蜕变测试用例,执行仿真,并评估多个会话间的输出一致性。我们在润滑油冷却系统FMU上评估了该方法,证明了其自动生成有意义的MR和相应测试用例的能力。初步结果表明,所提出的工作流能够通过减少手动工作并改进测试生成,有效支持动态仿真模型的系统化验证和确认。

英文摘要

In many industrial domains, the Functional Mock-up Interface (FMI) is used to exchange simulation models as Functional Mock-up Units (FMUs) across different partners using various modelling tools. This opens up the possibilities for simulation-based verification and validation using FMUs for ensuring reliable system behaviour. However, deriving effective test oracles for these simulation models remains challenging due to the absence of explicit expected outputs. This limits the applicability of conventional testing approaches, which require access to the internal workings of the systems. Metamorphic testing (MT) addresses this limitation by leveraging metamorphic relations (MRs), but extracting such relations from specifications remains largely a manual and error-prone process. To address this challenge, we propose an LLM-powered multi-agent workflow for specification-based metamorphic testing of FMU-based simulation models. The approach takes functional and interface specifications as input and orchestrates multiple agents to extract requirements and derive MRs. These MRs are expressed using Given-When-Then patterns to structure input conditions (Given), transformations (When), and expected output behaviours (Then). These relations are then used to generate metamorphic test cases, execute simulations, and evaluate output consistency across multiple sessions. We evaluate the approach on a Lube Oil Cooling system FMU, demonstrating its ability to automatically generate meaningful MRs and corresponding test cases. Preliminary results indicate that the proposed workflow can effectively support the systematic verification and validation of dynamic simulation models by reducing manual effort and improving test generation.

2605.25095 2026-05-26 cs.AI cs.LG math.OC 版本更新

RECTOR: Priority-Aware Rule-Based Reranking for Compliance-Aware Autonomous Driving Trajectory Selection

RECTOR: 基于优先级规则的合规感知自动驾驶轨迹选择重排序

Hadi Hajieghrary, Benedikt Walter, Chaitanya Shinde, Paul Schmitt, Miguel Hurtado

发表机构 * TORC Robotics LLC(TORC机器人公司) Daimler Truck AG(戴姆勒卡车集团) Reynolds & Moore(雷诺兹与摩尔公司) MassRobotics(马斯机器人)

AI总结 提出RECTOR,一种后生成重排序层,通过差异化代理和场景条件适用性机制,基于分层规则手册(安全>法律>道路>舒适)对候选轨迹进行评分,并采用确定性ε-词典序规则选择,在无需重新训练预测器的情况下,将安全与法律违规率从28.58%降至20.42%。

详情
AI中文摘要

自动驾驶堆栈必须从多模态候选集中选择一条轨迹;仅凭模型置信度选择会忽略安全、交通法规和舒适性约束。我们提出RECTOR(规则强制约束轨迹编排器),一种后生成重排序层,通过差异化代理和场景条件适用性机制,根据分层规则手册(安全>法律>道路>舒适)对候选轨迹进行评分,然后采用确定性ε-词典序规则进行选择,该规则通过构造保持跨层优先级——无需重新训练预测器。在Waymo开放运动数据集validation_interactive划分(43,219个增强实例,K=6)上,根据协议B(28条规则代理目录,oracle适用性),与同一候选集上仅基于置信度的选择相比,规则感知选择将安全+法律违规从28.58%降至20.42%,总违规从40.32%降至32.41%。在该基准上,均匀加权求和基线匹配了二元合规性——经验提升来自规则感知排序,而词典序保证是任何权重校准无法复制的结构性差异因素。在对抗性置信度破坏下,仅置信度选择在100%的场景中失败,而两种规则感知选择器在约96%的场景中拒绝了注入的模式。所有数据均为代理评估器结果(非安全认证),开环,5秒时域,美国规则,验证集划分。

英文摘要

Autonomous driving stacks must pick one trajectory from a multi-modal candidate set; choosing by model confidence ignores safety, traffic-law, and comfort constraints. We present \textsc{RECTOR} (Rule-Enforced Constrained Trajectory Orchestrator), a post-generation reranking layer that scores candidates against a tiered rulebook (Safety~$\succ$~Legal~$\succ$~Road~$\succ$~Comfort) via differentiable proxies and a scene-conditioned applicability mechanism, then selects with a deterministic $\varepsilon$-lexicographic rule that preserves cross-tier priority by construction -- without retraining the predictor. On the Waymo Open Motion Dataset \texttt{validation\_interactive} split (43{,}219 augmented instances, $K{=}6$), under Protocol~B (28-rule proxy catalog, oracle applicability) rule-aware selection cuts Safety+Legal violations from 28.58\% to 20.42\% and Total from 40.32\% to 32.41\% versus confidence-only on the same candidates. A uniform-weight weighted-sum baseline matches binary compliance on this benchmark -- the empirical lift comes from rule-aware ranking, while the lexicographic guarantee is the structural differentiator no weight calibration can replicate. Under adversarial confidence corruption, confidence-only selection fails in 100\% of scenarios while both rule-aware selectors reject the injected mode in $\sim$96\%. All figures are proxy-evaluator results (not a safety certificate), open-loop, 5\,s horizon, U.S.\ rules, validation split.

2605.25091 2026-05-26 cs.AI 版本更新

Evolutionary Enhanced Multi-Agent Reinforcement Learning for Cooperative Air Combat

进化增强的多智能体强化学习用于协同空战

Chengwei Li, Junlin Liu, Yang Gao

发表机构 * Institute of Automation, Chinese Academy of Sciences(中国科学院自动化研究所) School of Artificial Intelligence, University of Chinese Academy of Sciences(中国科学院大学人工智能学院)

AI总结 针对多机协同空战中现有MARL方法探索效率低、样本利用率低和策略泛化差的问题,提出ACE-MAPPO混合学习框架,融合进化算法与MAPPO,通过遗传软更新、进化优先轨迹回放和对抗进化课程学习机制提升性能。

详情
AI中文摘要

随着现代空战向超视距多机协同交战演变,无人作战飞行器的自主决策面临高维状态空间、离散动作指令和强对抗动态环境的重大挑战。为克服现有基于多智能体强化学习的方法在此类场景中的局限性,即探索效率不足、样本利用率低和策略泛化能力差,我们提出了对抗课程与进化增强的多智能体近端策略优化(ACE-MAPPO),一种将进化算法与MAPPO相结合的混合学习框架。具体而言,引入了遗传软更新机制以增强种群多样性并缓解收敛到局部最优。进一步采用了进化增强的优先轨迹回放策略以提高稀疏高价值样本的利用率。此外,设计了对抗进化课程学习机制,实现难度逐渐增加的自适应训练。大量实验结果表明,所提方法在训练稳定性、收敛速度和胜率方面优于MAPPO及其他基线算法,验证了其在多机协同空战场景中的有效性。

英文摘要

As modern air combat evolves toward beyond-visual-range (BVR) multi-aircraft cooperative engagements, autonomous decision-making for unmanned combat aerial vehicles (UCAVs) faces significant challenges due to high-dimensional state spaces, discrete action commands, and strongly adversarial dynamic environments. To overcome the limitations of existing multi-agent reinforcement learning (MARL) methods in such settings, namely insufficient exploration efficiency, low sample utilization, and poor policy generalization, we propose Adversarial Curriculum and Evolutionary-enhanced Multi-agent Proximal Policy Optimization (ACE-MAPPO), a hybrid learning framework that integrates evolutionary algorithms with MAPPO. Specifically, a genetic soft update mechanism is introduced to enhance population diversity and mitigate convergence to local optima. An evolutionary-augmented prioritized trajectory replay strategy is further employed to improve the utilization of sparse high-value samples. In addition, an adversarial evolutionary curriculum learning mechanism is designed to enable adaptive training with progressively increasing difficulty. Extensive experimental results demonstrate that the proposed method outperforms MAPPO and other baseline algorithms in terms of training stability, convergence speed, and win rate, validating its effectiveness in multi-aircraft cooperative air combat scenarios.

2605.25073 2026-05-26 cs.CR cs.AI cs.LG 版本更新

Security in the Fine-Tuning Lifecycle of Large Language Models: Threats, Defenses,Evaluation, and Future Directions

大型语言模型微调生命周期中的安全:威胁、防御、评估与未来方向

Wenjuan Li, Yitao Liu, Runze Chen, Rajkumar Buyya

发表机构 * Hangzhou Normal University(杭州师范大学) Zhejiang University(浙江大学) China Mobile (Zhejiang) Innovation Research Institute Co., Ltd.(中国移动(浙江)创新研究院有限公司) Quantum Cloud Computing and Distributed Systems (qCLOUDS) Lab, School of Computing and Information Systems(量子云计算与分布式系统(qCLOUDS)实验室,计算与信息学院) The University of Melbourne(墨尔本大学)

AI总结 本文系统综述了大型语言模型微调过程中的安全威胁与防御,提出了基于生命周期的三阶段框架,并通过统一实验评估了攻击与防御的有效性及跨阶段局限性。

Comments 39 pages, 7 figures, 22 tables

详情
AI中文摘要

背景:微调是将预训练大型语言模型(LLM)适应下游任务的核心,但其对训练数据、参数更新和可重用组件的依赖为攻击者提供了入口。威胁已从数据投毒和权重篡改演变为智能体操纵和接口利用,然而现有综述缺乏涵盖整个微调生命周期的统一框架。目标:本文对LLM微调安全进行了系统调查,并建立了一个基于生命周期的框架来比较攻击和防御,辅以统一的实证评估。方法:我们根据干预时机将攻击和防御机制分为三个阶段:微调前、微调中和微调后。在每个阶段,我们回顾和对比策略以揭示其演变和局限性。然后在统一的模型、硬件和协议设置下评估代表性方法,并进行跨阶段实验,配对来自不同阶段的攻击和防御。结果:攻击有效性高度依赖于模型且随规模非单调变化:对早期模型有效的权重编辑攻击在现代开源LLM上失去影响;跨语言后门迁移在更大规模上报告为近乎完美,但在测试的1B-4B模型上完全失败;纯粹良性样本可以损害指令微调模型的安全对齐。单阶段防御很少能跨阶段泛化,防御有效性共同依赖于模型架构和对齐状态。结论:我们识别了关键开放问题(配置鲁棒防御、跨阶段防御组合以及超越行为假设的嵌入空间攻击),并提出了具体的未来研究方向。

英文摘要

Background: Fine-tuning is central to adapting pre-trained Large Language Models (LLMs) to downstream tasks, but its reliance on training data, parameter updates, and reusable components opens entry points for attackers. Threats have evolved from data poisoning and weight tampering to agent manipulation and interface exploitation, yet existing reviews lack a unified framework spanning the full fine-tuning lifecycle. Objective: This paper presents a systematic survey of LLM fine-tuning security and establishes a lifecycle-based framework for comparing attacks and defenses, complemented by unified empirical evaluation. Methods: We divide attack and defense mechanisms into three phases by intervention timing: pre-tuning, during-tuning, and post-tuning. Within each phase, strategies are reviewed and contrasted to expose their evolution and limitations. Representative methods are then evaluated under a unified model, hardware, and protocol setup, with cross-phase experiments pairing attacks and defenses from different phases. Results: Attack effectiveness is highly model-dependent and non-monotonic with scale: weight-editing attacks effective on earlier models lose impact on modern open-source LLMs; cross-lingual backdoor transfer, reported as near-perfect at larger scales, fails entirely on tested 1B-4B models; and purely benign samples can compromise safety alignment in instruction-tuned models. Single-phase defenses rarely generalize across phases, and defense effectiveness depends jointly on model architecture and alignment state. Conclusion: We identify key open problems (configuration-robust defense, cross-phase defense composition, and embedding-space attacks beyond behavioral assumptions) and propose concrete future research directions.

2605.25062 2026-05-26 cs.NE cs.AI 版本更新

Cultivating Machine Intelligence: The OMEGA Shift from Top-Down Optimization to Autopoietic Cognitive Ecologies

培养机器智能:从自上而下优化到自创生认知生态的OMEGA转变

Ata G. Zare

发表机构 * Nyenrode Business University(奈恩罗德商学院)

AI总结 针对当前深度学习优化范式导致的幻觉、谄媚、奖励破解和对齐脆弱等结构性缺陷,提出RECLAIM框架,通过计算生态学而非严格优化来培养智能,并引入OMEGA转变概念。

Comments Extended preprint. A shorter version of this work is currently under peer review

详情
AI中文摘要

当前主流的人工智能范式通过代理目标的梯度下降和人类反馈的强化学习来训练神经架构。尽管能力显著,但这种自上而下的优化本质上会产生结构性故障模式,包括幻觉、谄媚、奖励破解和对齐脆弱性,这些是范式限制而非单纯的工程缺陷。为此,我们提出RECLAIM(递归的、生态的、认知的、类生命的、自适应的、智能机器)理论框架,通过计算生态学而非严格优化来培养智能。该模型由四个相互关联的理论支柱支撑:广义达尔文主义用盲目变异和选择性保留取代梯度;非主体涌现用环境物理学取代评估性奖励,从结构上防止针对人类意图的规范博弈;同时,波利亚-赫布桥将波利亚瓮动力学应用于赫布强化以实现路径依赖的特化;自由能原理被整合为环境热力学而非主体目标。该架构将自创生单元(由马尔可夫毯界定并竞争有限计算能量)置于由认知食物链和红皇后军备竞赛塑造的数据生态中。该框架表明,双过程认知、感觉特化、类比推理和内在动机是资源约束下进化的自然结果。我们将这种范式转变概念化为OMEGA转变,代表从优化和最大化到通过生成性自创生涌现的转变。

英文摘要

The dominant artificial intelligence paradigm trains neural architectures via gradient descent against proxy objectives and reinforcement learning from human feedback. While remarkably capable, this top-down optimization inherently generates structural failure modes, including hallucination, sycophancy, reward hacking, and alignment fragility, which represent paradigmatic limitations rather than mere engineering defects. In response, we introduce RECLAIM (Recursive, Ecological, Cognitive, Lifelike, Adaptive, Intelligent Machine), a theoretical framework for cultivating intelligence through computational ecology rather than engineering it through strict optimization. The model is supported by four interlocking theoretical pillars. General Darwinism replaces gradients with blind variation and selective retention, while non-agentic emergence substitutes evaluative rewards with environmental physics to structurally prevent specification gaming against human intent. Concurrently, the Polya-Hebbian bridge applies Polya urn dynamics to Hebbian reinforcement for path-dependent specialization, and the free energy principle is integrated as environmental thermodynamics rather than as an agent objective. The architecture situates autopoietic units, bounded by Markov blankets and competing for finite computational energy, within a data ecology shaped by cognitive food chains and Red Queen arms races. This framework suggests the spontaneous emergence of dual-process cognition, sensory specialization, analogical reasoning, and intrinsic motivation as natural consequences of evolution under resource constraints. We conceptualize this paradigm transition as the OMEGA shift, representing a move from optimization and maximization to emergence through generative autopoiesis.

2605.25061 2026-05-26 cs.LG cs.AI 版本更新

GL-LFGNN:A Global-Local Dual-branch Causal Graph Neural Network Based on Liang-Kleeman Information Flow for EEG Emotion Recognition

GL-LFGNN:基于Liang-Kleeman信息流的全局-局部双分支因果图神经网络用于脑电情感识别

Ziyi Wang, Dongyang Kuang

发表机构 * School of Mathematics (Zhuhai), Sun Yat-sen University, Zhuhai, China(中山大学数学学院(珠海))

AI总结 提出GL-LFGNN模型,利用Liang-Kleeman信息流理论构建有向因果图,通过全局-局部双分支架构整合全脑与区域连接,在MEEG数据集上以少量参数实现高精度情感识别。

Comments 10 pages, 3 figures

详情
AI中文摘要

基于脑电的情感识别在客观诊断情绪障碍方面具有重要前景。图神经网络已成为建模脑电通道间依赖关系的主流范式,但现有方法依赖于基于空间邻近性或功能相关性导出的对称邻接矩阵,这些矩阵本质上捕捉的是统计关联而非有向因果影响,这与神经信息流固有的非对称、因果驱动特性相冲突。为弥合这一差距,我们提出GL-LFGNN,一种基于Liang-Kleeman信息流理论的全局-局部双分支因果图神经网络。与仅评估时间优先性的格兰杰因果不同,我们的方法从动力系统角度严格量化因果强度,生成神经生理学可解释的有向图。双分支架构进一步将全脑连接性与符合既定功能神经解剖学的区域特定处理相结合。在MEEG数据集上,GL-LFGNN仅用37K参数(约为当前最优模型的10%)便达到86.17%(唤醒度)和86.71%(效价)的准确率,表明原则性的因果建模可同时增强可解释性、泛化能力和计算效率。代码将开源。

英文摘要

EEG-based emotion recognition holds significant promise for objective diagnosis of mood disorders. Graph neural networks (GNNs) have emerged as the dominant paradigm for modeling inter-channel dependencies in EEG, yet existing approaches rely on symmetric adjacency matrices derived from spatial proximity or functional correlations that fundamentally capture statistical associations rather than directed causal influences, which conflicts with the inherently asymmetric, causally-driven nature of neural information flow. To bridge this gap, we propose GL-LFGNN, a Global-Local Dual-branch Causal Graph Neural Network grounded in Liang-Kleeman information flow theory. Unlike Granger causality that merely assesses temporal precedence, our approach rigorously quantifies causal strength from a dynamical systems perspective, yielding neurophysiologically interpretable directed graphs. A dual-branch architecture further integrates whole-brain connectivity with region-specific processing aligned to established functional neuroanatomy. On the MEEG dataset, GL-LFGNN achieves 86.17% (Arousal) and 86.71% (Valence) accuracy with only 37K parameters -- approximately 10% of the current state-of-the-art -- demonstrating that principled causal modeling can simultaneously enhance interpretability, generalization, and computational efficiency. Code will be released.

2605.25058 2026-05-26 cs.HC cs.AI 版本更新

Intent Signal Theory: A Computational Framework for Intent-State Control in Human-AI Interaction

意图信号理论:人机交互中意图状态控制的计算框架

Gang Peng

发表机构 * Huizhou Lateni AI Technology Co., Ltd.(惠州莱尼人工智能技术有限公司) Huizhou University(惠州大学)

AI总结 提出意图信号理论(IST),通过区分潜在源意图、可观测意图代理、编码载体和模型输出四个对象,形式化意图丢失定理,并基于六种大语言模型、三种语言和三个任务领域的实验验证了结构-保真度分裂等预测,将提示工程重新定义为意图协议设计。

Comments 10 pages, 2 figures. Theoretical framework paper grounded in four companion empirical studies. Data and code repository: https://github.com/PGlarry/prompt-protocol-specification

详情
AI中文摘要

当前的人工智能交互模型将提示视为主要的交换对象,忽略了一个关键层面:用户的潜在源意图,即提示之前并激发提示的目标状态。这里我们引入意图信号理论(IST),这是一个形式化这一缺失意图层的计算框架。IST区分了四个通常被混淆的对象:潜在源意图(I*)、可观测意图代理(I-hat)、编码载体(P)和模型输出(O)。它形式化了维度权重、编码掩码、结构和保真度恢复分数以及公私意图分解。不可逆意图丢失定理确立了:载体中缺失的私有意图无法通过通用替换恢复。来自四项配套研究的证据(涵盖六种大语言模型、三种语言和三个任务领域)显示了与IST预测一致的结构-保真度分裂、人类验证的度量分离以及权重容忍平台。IST将提示工程重新定义为意图协议设计,并识别了当前人工智能系统所缺乏的一个计算层面。

英文摘要

Current AI interaction models treat the prompt as the primary object of exchange, omitting a critical layer: the user's latent source intent, the goal state preceding and motivating the prompt. Here we introduce Intent Signal Theory (IST), a computational framework that formalises this missing intent layer. IST distinguishes four objects routinely conflated: latent source intent (I*), observable intent proxy (I-hat), encoded carrier (P), and model output (O). It formalises dimensional weights, encoding masks, structural and fidelity recovery scores, and public-private intent decomposition. The Theorem of Irreversible Intent Loss establishes that private intent absent from the carrier cannot be recovered beyond generic substitution. Evidence from four companion studies spanning six LLMs, three languages and three task domains shows structural-fidelity splits, human-validated metric dissociation, and weight-tolerance plateaus consistent with IST's predictions. IST reframes prompt engineering as intent-protocol design and identifies a computational layer that current AI systems lack.

2605.25045 2026-05-26 cs.AI 版本更新

AION: Next-Generation Tasks and Practical Harness for Time Series

AION:下一代时间序列任务与实用框架

Tianxiang Zhan, Xiaobao Song, Tong Guan, Shirui Pan, Ming Jin

发表机构 * Griffith University(格里菲斯大学) Shenzhen University(深圳大学) Zhejiang University(浙江大学)

AI总结 针对时间序列研究向结合预测、上下文推理、工具使用和结构化决策支持的现实任务转变,提出AION框架,通过时间锚定、知识推理和可靠性机制(如实验后分析和分层审查)实现更详细的过程追踪和审查步骤。

Comments Project page and code are available at https://github.com/ztxtech/aion

详情
AI中文摘要

时间序列研究正从固定的预测基准转向结合预测、上下文推理、工具使用和结构化决策支持的现实任务。大多数基准基于干净数据和短评估循环构建;仅靠智能体可能会在最终输出前忽略时间约束、证据检查或审查。我们首先将下一代时间序列任务形式化为由任务文件、工作空间和验证接口组成的三元组。然后,我们提出AION,一个由六个组件组(智能体、技能、规则、记忆、评估和协议)构建的时间序列框架。在该框架中,我们使用三个设计原则:时间锚定、时间知识推理以及可靠性机制(如实验后分析和分层审查)。Kaggle商店销售案例研究表明,与在OpenCode直接构建模式下运行的相同基础智能体相比,该框架产生了更详细的过程追踪、更多工件和更多审查步骤。综合来看,这些结果支持从固定任务向现实世界约束下的现实任务的范式转变。

英文摘要

Time series research is moving beyond fixed forecasting benchmarks toward realistic tasks that combine prediction, contextual reasoning, tool use, and structured decision support. Most benchmarks are built around clean data and short evaluation loops; agents alone may miss temporal constraints, evidence checks, or review before finalizing outputs. We first formalize next-generation time series tasks as three-component tuples consisting of a task file, a workspace, and a validation interface. We then present AION, a time series harness built from six component groups: agents, skills, rules, memory, evaluation, and protocols. In this harness, we use three design principles: temporal grounding, temporal knowledge-grounded reasoning, and reliability mechanisms such as post-experiment analysis and layered review. A Kaggle Store Sales case study shows that the harness produces more detailed process traces, more artifacts, and more review steps than the same base agent operating in OpenCode direct build mode. Taken together, these results argue for a paradigm shift from fixed tasks to realistic ones under real-world constraints.

2605.25036 2026-05-26 cs.CL cs.AI 版本更新

Language Bias in LVLMs: From In-Depth Analysis to Simple and Effective Mitigation

LVLMs中的语言偏差:从深入分析到简单有效的缓解方法

Yangneng Chen, Jing Li

发表机构 * Harbin Institute of Technology, Shenzhen, China(哈尔滨工业大学(深圳))

AI总结 本文系统研究了大视觉语言模型中的语言偏差问题,发现其根源在于训练中的模态未对齐,并提出了两种简单有效的缓解方法:语言偏差正则化(LBR)和语言偏差惩罚(LBP)。

Comments Accepted by ICML 2026

详情
AI中文摘要

大型视觉语言模型(LVLMs)通过视觉理解扩展了大型语言模型,但仍然容易产生幻觉,即输出流畅但与图像不一致。最近的研究将这一问题与语言偏差联系起来——LVLMs过度依赖文本而忽视视觉输入的倾向。然而,大多数分析仍然是经验性的,没有揭示其根本原因。在本文中,我们对语言偏差进行了系统研究,并确定其根源在于训练过程中的模态未对齐。我们的分析表明,视觉指令微调(VIT)和直接偏好优化(DPO)通常优先考虑文本改进,这可能导致LVLMs过度倾向于语言建模,而不是平衡的多模态理解。为了解决这个问题,我们提出了两种简单而有效的方法:语言偏差正则化(LBR),通过在指令微调期间进行正则化来缓解语言偏差;以及语言偏差惩罚(LBP),在DPO训练过程中惩罚语言偏差。跨多种模型和基准的大量实验证明了我们方法的有效性。LBR在十多个通用基准上持续提高性能,而LBP显著减少了幻觉并提高了可信度。这些方法共同不仅缓解了语言偏差,还促进了LVLMs的整体对齐,且无需引入任何额外数据或辅助模型。我们的代码公开在https://github.com/lab-klc/LVLM-Language-Bias。

英文摘要

Large Vision-Language Models (LVLMs) extend large language models with visual understanding, but remain vulnerable to hallucination, where outputs are fluent yet inconsistent with images. Recent studies link this issue to language bias-the tendency of LVLMs to over-rely on text while neglecting visual inputs. Yet most analyses remain empirical without uncovering its underlying cause. In this paper, we provide a systematic study of language bias and identify its root in modality misalignment during training. Our analysis shows that both Visual Instruction Tuning (VIT) and Direct Preference Optimization (DPO) often prioritize textual improvements, which may cause LVLMs to overly lean toward language modeling rather than balanced multimodal understanding. To address this, we propose two simple yet effective methods: Language Bias Regularization (LBR) which mitigates language bias through regularization during instruction tuning, and Language Bias Penalty (LBP), which penalizes language bias in the DPO training process. Extensive experiments across diverse models and benchmarks demonstrate the effectiveness of our approach. LBR consistently improves performance on over ten general benchmarks, while LBP significantly reduces hallucination and improves trustworthiness. Together, these methods not only mitigate language bias but also advance the overall alignment of LVLMs, all without introducing any additional data or auxiliary models. Our code is publicly available at https://github.com/lab-klc/LVLM-Language-Bias.

2605.25022 2026-05-26 cs.CV cs.AI 版本更新

D3S2: Diffusion-Guided Dataset Distillation for Semantic Segmentation

D3S2: 扩散引导的语义分割数据集蒸馏

Wenjie Zheng, Haoji Hu, Jiali Lu, Xingze Zou, Jing Wang

发表机构 * Zhejiang University(浙江大学)

AI总结 针对语义分割数据集蒸馏中的长尾类别不平衡、像素级对齐和高计算成本问题,提出两阶段框架D3S2,通过类别平衡掩码选择和扩散引导图像合成生成紧凑训练集,在极低压缩率下显著提升分割性能。

详情
AI中文摘要

数据集蒸馏旨在将大规模数据集压缩为紧凑的合成集,同时保持训练效果。然而,现有研究主要关注图像分类,而语义分割等密集预测任务尚未充分探索。本文识别了分割数据集蒸馏的三个关键挑战:(i) 长尾类别不平衡,(ii) 图像与密集标签之间严格的像素级对齐需求,以及(iii) 使用复杂模型优化高分辨率数据的高计算成本。为应对这些挑战,我们提出D3S2,一种扩散引导的语义分割数据集蒸馏框架。我们的方法采用两阶段设计。在类别平衡掩码选择中,我们通过优先考虑低表示类别的贪婪策略构建代表性掩码集。在扩散引导图像合成中,我们使用预训练的布局到图像扩散模型生成以所选掩码为条件的图像,自然确保空间对齐。为进一步增强合成数据的训练效用,我们引入具有两个互补目标的引导扩散采样:用于像素级对齐的分割一致性损失,以及用于对齐跨层每类特征统计的类级特征匹配损失。大量实验证明了D3S2的优越性。值得注意的是,在1%的极低压缩率下,我们的方法在ADE20K和COCO-Stuff上使用Mask2Former (Swin-S)分别达到24.99%和35.49%的mIoU,比随机选择分别高出9.34%和5.70%。

英文摘要

Dataset distillation (DD) aims to compress large-scale datasets into compact synthetic sets while preserving training efficacy. However, existing studies mainly focus on image classification, leaving dense prediction tasks such as semantic segmentation largely underexplored. In this work, we identify three key challenges for segmentation DD: (i) long-tailed class imbalance, (ii) the need for strict pixel-wise alignment between images and dense labels, and (iii) the high computational cost of optimizing high-resolution data with complex models. To address these challenges, we propose D3S2, a Diffusion-guided Dataset Distillation framework for Semantic Segmentation. Our method adopts a two-stage design. In Class-Balanced Mask Selection, we construct a representative mask set via a greedy strategy that prioritizes underrepresented classes. In Diffusion-Guided Image Synthesis, we employ a pretrained layout-to-image diffusion model to generate images conditioned on the selected masks, naturally ensuring spatial alignment. To further enhance the training utility of synthesized data, we introduce guided diffusion sampling with two complementary objectives: a segmentation-consistency loss for pixel-level alignment, and a class-wise feature matching loss for aligning per-class feature statistics across layers. Extensive experiments demonstrate the superiority of D3S2. Notably, at an extremely compression rate of 1%, our method achieves 24.99% and 35.49% mIoU on ADE20K and COCO-Stuff with Mask2Former (Swin-S), outperforming random selection by 9.34% and 5.70%, respectively.

2605.25020 2026-05-26 cs.AI cs.CL 版本更新

Privacy-Preserving Local Language Models for Longitudinal Data Retrieval in Chronic Dermatologic Disease: Implementation in Pemphigus Patients

慢性皮肤病纵向数据检索中的隐私保护本地语言模型:在天疱疮患者中的实施

Abdurrahim Yilmaz, Ayşe Esra Koku Aksu, Duygu Yamen, Vefa Asli Erdemir, Mehmet Salih Gurel, Gulsum Gencoglan, Joram M. Posma, Burak Temelkuran

发表机构 * Division of Systems Medicine, Department of Metabolism, Digestion and Reproduction, Imperial College London(系统医学系,代谢、消化与生殖部,帝国理工学院伦敦分校) Department of Dermatology and Venereology, Istanbul Research and Training Hospital(皮肤科与性病科,伊斯坦布尔研究与培训医院) Department of Dermatology and Venereology, Istanbul Medeniyet University(皮肤科与性病科,伊斯坦布尔梅德尼yet大学) Department of Dermatology and Venereology, Istanbul Medicana Atakoy Hospital(皮肤科与性病科,伊斯坦布尔Medicana阿塔科伊医院)

AI总结 本研究评估了本地部署的隐私保护小型语言模型(SLM)在天疱疮患者长期随访记录中检索结构化临床特征并生成纵向摘要的能力,结果显示SLM在特征检索任务中平均准确率达82.25%,且医生对AI生成摘要的质量、临床准确性和实用性评分较高。

详情
AI中文摘要

慢性皮肤病如天疱疮需要长期随访,产生大量纵向临床文档,在常规就诊期间难以全面审查,增加了临床医生的工作量以及遗漏关键历史信息的风险。我们评估了本地部署的隐私保护小型语言模型(SLM)是否能够从长期皮肤科随访记录中检索结构化临床特征并生成纵向摘要。在这项回顾性病例系列研究中,30名天疱疮患者贡献了541份就诊记录,汇总为完整的纵向记录(89,336词);由两位皮肤科专家标注了56个临床相关特征。本地部署的SLM(Qwen3 4B Thinking 2507)对每份完整记录进行查询,以检索56个特征并生成一份最终报告摘要。在1,680个特征检索任务中,平均准确率为82.25%。皮肤科医生对AI生成摘要的整体质量(8.23-8.47)、临床准确性(7.93-8.20)和实用性(8.47-8.50)评分较高,评估者间无显著差异,且在53.3%的评估中总体偏好AI摘要。这些发现表明,隐私保护的本地部署SLM可以优于医学专家,并可靠地生成有临床意义的纵向摘要。在适当监督下,SLM可以支持临床决策。

英文摘要

Chronic dermatologic diseases such as pemphigus require long-term follow-up, generating extensive longitudinal clinical documentation that is difficult to review comprehensively during routine visits and increasing clinician workload as well as the risk of missing critical historical information. We evaluated whether a locally deployed, privacy-preserving small language model (SLM) could retrieve structured clinical features and generate longitudinal summaries from long-term dermatology follow-up records. In this retrospective case series, thirty pemphigus patients contributed 541 visit notes that were aggregated into full longitudinal records (89,336 words); 56 clinically relevant features were annotated by two expert dermatologists. The locally deployed SLM (Qwen3 4B Thinking 2507) was queried with each complete record to retrieve 56 features and generate one final report summaries. Across 1,680 feature retrieval tasks, mean accuracy was 82.25%. Dermatologists' ratings of AI-generated summaries were high for overall quality (8.23-8.47), clinical accuracy (7.93-8.20), and usefulness (8.47-8.50), with no significant inter-evaluator differences and an overall preference for AI summaries in 53.3% of evaluations. These findings suggest that privacy-preserving, locally deployed SLMs can outperform medical experts and reliably generate clinically meaningful longitudinal summaries. SLMs may support clinical decision-making when integrated with appropriate oversight.

2605.25004 2026-05-26 cs.LG cs.AI 版本更新

Metropolis-Scale Resilient and Trustworthy Traffic Flow Inference Using Multi-Source Data

基于多源数据的都市尺度弹性可信交通流推断

Qishen Zhou, Yifan Zhang, Michail A. Makridis, Anastasios Kouvelas, Yibing Wang, Simon Hu

发表机构 * School of Transportation, Jilin University(吉林大学交通运输学院) Department of Computer Science, City University of Hong Kong (Dongguan)(香港城市大学(东莞)计算机科学系) Institute for Transport Planning and Systems, ETH Zurich(苏黎世联邦理工学院交通规划与系统研究所) Institute of Intelligent Transportation Systems, College of Civil Engineering and Architecture, Zhejiang University(浙江大学智能交通系统研究所) ZJU-UIUC Institute, Zhejiang University(浙江大学-UIUC研究院)

AI总结 提出任务感知注意力神经过程(TA-ANP)统一概率框架,融合浮动车数据和稀疏固定检测器数据,实现高精度、可信的不确定性量化的全局交通状态推断,并在都市尺度数据集上取得最优性能。

Comments The paper has been submitted to Elsevier for possible publication

详情
AI中文摘要

从稀疏观测中以高精度和可信的不确定性量化推断网络级交通状态对于智能交通系统至关重要,但由于问题的欠定性、传感网络的多方面干扰以及多个推断子任务在联合建模时的固有冲突,这仍然具有挑战性。我们提出了任务感知注意力神经过程(TA-ANP),这是一个统一的概率框架,通过融合浮动车数据(FCD)和稀疏的固定检测器测量,实现弹性且可信的全局交通状态推断(GTSI)。通过将GTSI视为一个随机过程,TA-ANP利用神经过程的元学习特性,无需重新训练即可快速适应传感配置的变化。引入了一个具有不同时空归纳偏置的任务感知多查询注意力模块,以联合处理三个GTSI子任务,同时减轻跨任务干扰。对于不确定性量化,我们将神经过程与蒙特卡洛丢弃法相结合,以捕获偶然不确定性和认知不确定性。为了支持都市尺度评估,我们构建了都市多源交通数据集(MMTD),该数据集整合了固定环路传感器测量、FCD统计数据和OpenStreetMap道路网络数据,覆盖了包含2371个路段的城市网络。在MMTD上的实验表明,TA-ANP在确定性和概率性指标下的所有子任务中均达到了最先进的性能。由此产生的良好校准的不确定性使得能够以更少的传感器部署实现更高效的固定传感器布局。在“损坏-修复-新增”传感生命周期下,TA-ANP在干扰吸收、性能恢复和对未见传感配置的适应性方面表现出卓越的弹性。

英文摘要

Inferring network-wide traffic states from sparse observations with high accuracy and trustworthy uncertainty quantification is essential for intelligent transportation systems, yet it remains challenging due to the underdetermined nature of the problem, multifaceted disturbances in sensing networks, and the inherent conflicts among multiple inference sub-tasks when modeled jointly. We propose the Task-Aware Attentive Neural Process (TA-ANP), a unified probabilistic framework for resilient and trustworthy global traffic state inference (GTSI) by fusing floating car data (FCD) with sparse fixed-detector measurements. By casting GTSI as a stochastic process, TA-ANP leverages the meta-learning properties of neural processes to adapt rapidly to changes in sensing configurations without retraining. A task-aware multi-query attention module with distinct spatiotemporal inductive biases is introduced to jointly handle three GTSI sub-tasks, while mitigating cross-task interference. For uncertainty quantification, we combine neural processes with Monte Carlo Dropout to capture both aleatoric and epistemic uncertainty. To support metropolis-scale evaluation, we construct the Metropolitan Multi-Source Traffic Dataset (MMTD), integrating fixed-loop sensor measurements, FCD statistics, and OpenStreetMap road-network data over an urban network of 2,371 road segments. Experiments on MMTD show that TA-ANP achieves state-of-the-art performance across all sub-tasks under deterministic and probabilistic metrics. The resulting well-calibrated uncertainties enable more efficient fixed-sensor placement with fewer sensor deployments. Under a Damage-Repair-Addition sensing lifecycle, TA-ANP demonstrates superior resilience in terms of disturbance absorption, performance recovery, and adaptability to unseen sensing configurations.

2605.24999 2026-05-26 q-bio.NC cs.AI cs.MA 版本更新

Interpretation, Learning, and Empathy as One Constraint: A Residual-Adequacy Architecture with Accountable Abstention

解释、学习与共情作为单一约束:具有可问责弃权的残差充分性架构

Chainarong Amornbunchornvej

发表机构 * National Electronics and Computer Technology Center (NECTEC)(国家电子与计算机技术中心)

AI总结 提出一种认知架构,通过单一残差量统一处理解释、学习和共情,当情境超出表征能力时产生带类型和见证的弃权。

Comments First draft for journal submission. The code is at https://github.com/DarkEyes/RC-Arch

详情
AI中文摘要

一个智能体必须对当前情境采取行动,学习它尚无法表征的内容,并充分建模其他智能体以进行协调。这些能力通常由独立的机制实现,但它们共享一种失败模式:情境可能超出智能体当前能表征的范围,此时诚实的回应是原则性的拒绝,并说明缺失了什么。我们开发了一个小型认知架构,其中这些限制源于单一量。一个解释-决策单元(IDU)通过一组体制(具有私有基的局部表征框架)解释内容向量,并决定其许可哪些行动;内容相对于活跃体制表征范围的标量残差驱动该单元。低残差且许可清晰时发出行动;否则单元重新解释、尝试描述长度合理的扩展,或停止并给出带类型和见证的终止。我们证明该单元是总且确定性的:对于任何内容和固定配置,它在有限有界步数内停止,并带有唯一终止见证,因此弃权由构造携带其原因。通过绑定架构的开放参数而不改变其机制,相同的残差-范围约束在三个范围上恢复了三个有记录的现象:不知的类型学(类型化弃权);智能体之间的强制误解,局限于一个共享概念且对犯错的智能体不可见(有界共情);以及学习中的先决条件依赖,源于有界关注窗口而非假设(发展先决条件)。每个实例化都针对自然智能体和人工智能体进行了阐述,并提出了可证伪的预测,因此一个约束可以模拟人类和机器认知中的限制。该工作提供了一种统一和一种可问责弃权的概念,通过构造带有类型和见证。

英文摘要

An agent must act on the situation before it, learn what it cannot yet represent, and model other agents well enough to coordinate. These faculties are usually realized by separate mechanisms, yet they share a failure mode: the situation can exceed what the agent can currently represent, and the honest response is then a principled refusal that says what was missing. We develop a small cognitive architecture in which these limits arise from a single quantity. An Interpretation-Decision Unit (IDU) interprets a content vector through a family of regimes - local representational frames with private bases - and decides which actions it licenses; a scalar residual of the content against the active regimes' representational scope drives the unit. Low residual with a clean licensing emits an action; otherwise the unit re-interprets, attempts a description-length-justified expansion, or halts with a typed, witnessed terminal. We prove the unit is total and deterministic: for any content and fixed configuration it halts in finitely many bounded-cost steps with a unique terminal witness, so abstention carries its cause by construction. By binding the architecture's open parameters without changing its mechanics, the same residual-against-scope constraint recovers three documented phenomena at three scopes: the typology of not-knowing (typed abstention); a forced misunderstanding between agents, localized to one shared concept and invisible to the agent committing it (bounded empathy); and prerequisite dependence in learning derived from a bounded focus window rather than posited (developmental prerequisites). Each instantiation is worked for a natural and an artificial agent and states a falsifiable prediction, so one constraint can model limits in both human and machine cognition. The account contributes a unification and a notion of accountable abstention, typed and witnessed by construction.

2605.24993 2026-05-26 cs.AI cs.CV 版本更新

NeurIPS: Neuro-anatomical Inductive Priors for Sphere-based Brain Decoding

NeurIPS: 基于球面的脑解码的神经解剖学归纳先验

Sijin Yu, Zijiao Chen, Zhenyu Yang, Zihao Tan, Jiakun Xu, Zhongliang Liu, Shengxian Chen, Wenxuan Wu, Xiangmin Xu, Xin Zhang

发表机构 * South China University of Technology(南方科技大学) Stanford University(斯坦福大学) King's College London(伦敦国王学院) Foshan University(佛山大学) Pazhou Lab(琶洲实验室)

AI总结 提出NeurIPS框架,通过选择性ROI球形分词器和结构引导专家混合模型,将解剖变异转化为归纳先验,在自然场景数据集上实现表面解码器最先进性能,并显著提升训练效率。

Comments International Conference on Machine Learning (ICML) 2026

详情
AI中文摘要

当前的fMRI解码器面临性能-保真度权衡,其中高效的ID编码器优于几何保真的表面模型。我们认为这部分是由于低效的表面分词化以及未能将解剖学用作预测信号。我们提出NeurIPS,一个通过将解剖变异从干扰因素重新定义为强大的归纳先验来改进表面解码的框架。NeurIPS结合了两项创新:用于高效几何编码的选择性ROI球形分词器(SRST),以及使用皮层特征显式建模个体解剖的结构引导专家混合模型(SG-MoE)。在自然场景数据集上,NeurIPS为表面解码器建立了新的最先进水平,并实现了与强1D基线相当的性能。这是以空前的效率实现的,因为模型收敛速度显著加快(10个epoch对比600个epoch)。这种效率使得仅使用20%的数据即可快速适应新受试者,并确保随着训练队列扩大而稳健扩展。消融实验提供了因果证据,表明这些收益源于模型使用皮层特征,而非记忆受试者ID。通过利用解剖先验,NeurIPS为稳健、可泛化的脑解码提供了一条有原则且可扩展的路径。

英文摘要

Current fMRI decoders face a performance-fidelity trade-off where efficient ID encoders outperform geometrically faithful surface-based models. We argue this is partly driven by inefficient surface tokenization and the failure to use anatomy as a predictive signal. We present NeurIPS, a framework that improves surface-based decoding by reframing anatomical variation from a nuisance to a powerful inductive prior. NeurIPS unites two innovations: a Selective ROI Spherical Tokenizer (SRST) for efficient geometric encoding, and a Structure-Guided Mixture of Experts (SG-MoE) that explicitly models individual anatomy using cortical features. On the Natural Scenes Dataset, NeurIPS establishes a new state-of-the-art for surface decoders and achieves performance comparable to strong 1D baselines. This is achieved with unprecedented efficiency, as the model converges dramatically faster (10 vs. 600 epochs). This efficiency enables rapid adaptation to new subjects using only 20% of data and ensures robust scalability as the training cohort is expanded. Ablations provide causal evidence that these gains are driven by the model's use of cortical features, not by memorizing subject IDs. By leveraging anatomical priors, NeurIPS provides a principled and scalable path toward robust, generalizable brain decoding.

2605.24992 2026-05-26 cs.NI cs.AI cs.LG cs.MA 版本更新

Scaling up Energy-Aware Multi-Agent Reinforcement Learning for Mission-Oriented Drone Networks with Individual Reward

面向任务驱动无人机网络的能量感知多智能体强化学习扩展与个体奖励

Changling Li, Ying Li

发表机构 * Department of Computer Science, ETH Zurich(苏黎世联邦理工学院计算机科学系) Department of Computer Science, Colby College(科尔比学院计算机科学系)

AI总结 提出基于个体奖励函数的能量感知多智能体强化学习模型,利用深度Q网络解决无人机网络动态环境和电池容量限制下的轨迹规划问题,实验表明在任务密度高时成功率接近100%,且扩展性优于共享奖励模型。

Comments IEEE Internet of Things Journal

详情
Journal ref
volume=12, number=8, year=2025, pages=10640-10654
AI中文摘要

多智能体强化学习(MARL)因其通过交互学习的能力,在自动驾驶和智慧城市等协作系统中显示出广泛适用性。随着无人机网络的最新发展,研究人员也应用MARL来解决轨迹规划问题。然而,动态环境和有限的电池容量仍然是使用MARL实现高效协作任务执行的挑战。在本文中,我们提出了一种能量感知的MARL模型作为应对这些挑战的尝试,利用深度Q网络(DQN)和由任务执行进度及无人机剩余电量驱动的个体奖励函数。我们对所提出的模型进行了一系列仿真研究,并将其与共享奖励MARL进行比较,以探索MARL中信用分配的影响。结果表明,无论任务位置和长度如何,我们提出的模型都能达到至少80%的成功率。与共享奖励模式类似,个体奖励模式在任务密度高时可以获得更好的成功率,并且当任务密度接近40%时,几乎可以达到100%的成功率。我们提出的个体奖励模型的真正优势在环境扩展时得以显现。与共享奖励MARL的比较表明,我们提出的模型对环境大小和智能体数量的变化更加鲁棒。由于目标的清晰性,它可以用更少的步骤实现更高的成功率,从而更好地提高能源效率。

英文摘要

Multi-agent reinforcement learning (MARL) has shown wide applicability in collaborative systems such as autonomous driving and smart cities for its ability of learning through interaction. With the recent development of drone networks, researchers have also applied MARL to address the trajectory planning problems. However, the dynamic environment and the limited battery capacity are still challenging for using MARL to achieve efficient collaborative task execution. In this paper, we propose an energy-aware MARL model as an attempt to tackle these challenges, leveraging Deep Q-Networks (DQN) with \emph{individual reward functions} driven by the task execution progress and the remaining battery of drones. We conduct a set of simulation studies for the proposed mode and compare it with the shared reward MARL~\cite{Li2022MARL} to explore the impact of credit assignment in MARL. The results indicate that our proposed model can achieve at least 80\% success rate regardless of the task locations and lengths. Similar to the shared reward mode, the individual reward mode can achieve a better success rate when the task density is high, and it can hit nearly a 100\% success rate when task density gets close to 40\%. The true advantage of our proposed model with individual reward is revealed when scaling up the environment. The comparison to the shared reward MARL shows that the our proposed model is more robust towards the change of the environment size and agent numbers. It can achieve higher success rate with fewer steps due to the clarity of the goal which improves energy efficiency even better.

2605.24989 2026-05-26 cs.LG cs.AI cs.IR 版本更新

Selective Test-Time Compute Scaling for Click-Through Rate Prediction via Uncertainty-Triggered Feature Path Exploration

基于不确定性触发的特征路径探索的点击率预测选择性测试时计算扩展

Moyu Zhang, Yun Chen, Yujun Jin, Jinxin Hu, Yu Zhang, Xiaoyi Zeng

发表机构 * Alibaba Group(阿里巴巴集团)

AI总结 针对点击率预测中训练数据稀疏导致的不确定性,提出无需训练、模型无关的UTTSI框架,通过双信号估计器区分认知不确定性和偶然模糊性,对不确定实例进行自适应特征过滤和随机特征路径探索,在保持最坏延迟不变的情况下实现平均约2.8倍基础模型开销,实验和在线A/B测试均取得显著提升。

Comments 12 pages, 4 Figures, 3 Tables

详情
AI中文摘要

扩展测试时计算对语言模型已被证明非常有效,然而这一机会在工业点击率(CTR)预测中仍未得到充分探索。CTR模型存在一个根本的不对称性:训练中充分表示的特征组合产生自信的预测,而稀疏观察到的特征组合则产生不可靠的输出。现有的训练阶段解决方案(如自适应门控)学习一个固定的选择函数,但受限于相同的稀疏性,在部署时无法提供针对每个实例的补救措施。我们提出UTTSI(不确定性触发的测试时选择性推理),一个无需训练、模型无关的框架,将推理深度按比例扩展到每个实例的不确定性。一个结合模型logit置信度和数据级频率先验的双信号估计器区分认知不确定性和偶然模糊性。每个实例都经过自适应特征过滤以去除不可靠的嵌入;不确定的实例额外接受随机特征路径探索,其预测通过一致性加权集成进行聚合。自信的实例完全绕过探索,保持平均开销约为基础模型成本的2.8倍,最坏情况延迟不变。在四个数据集和三种骨干架构上的实验表明,与所有训练阶段基线相比,取得了持续且统计显著的增益。为期七天的在线A/B测试进一步证实了5.3%的相对CTR提升(p < 0.01),确立了选择性测试时计算分配作为CTR预测训练阶段进展的实用补充。

英文摘要

Scaling test-time compute has proven highly effective for language models, yet this opportunity remains largely unexplored for industrial Click-Through Rate (CTR) prediction. CTR models suffer from a fundamental asymmetry: feature combinations well-represented in training yield confident predictions, while sparsely observed ones produce unreliable outputs. Existing training-phase solutions such as adaptive gating learn a fixed selection function subject to the same sparsity, offering no per-instance recourse at deployment.We propose UTTSI (Uncertainty-Triggered Test-Time Selective Inference), a training-free model-agnostic framework that scales inference depth proportionally to per-instance uncertainty. A dual-signal estimator combining model logit confidence with a data-level frequency prior distinguishes epistemic uncertainty from aleatoric ambiguity. Every instance undergoes adaptive feature filtering to remove unreliable embeddings; uncertain instances additionally receive stochastic feature-path explorations whose predictions are aggregated via consistency-weighted ensembling. Confident instances bypass exploration entirely, keeping average overhead at approximately $2.8\times$ base model cost with worst-case latency unchanged.Experiments on four datasets with three backbone architectures demonstrate consistent, statistically significant gains over all training-phase baselines. A seven-day online A/B test further confirms a 5.3% relative CTR gain ($p < 0.01$), establishing selective test-time compute allocation as a practical complement to training-phase advances for CTR prediction.

2605.24975 2026-05-26 cs.RO cs.AI cs.LG 版本更新

Bridging the Gap: Enabling Soft Actor Critic for High Performance Legged Locomotion

弥合差距:实现软演员-评论家算法用于高性能腿部运动

Gianluca Sabatini, Chenhao Li, Marco Hutter

发表机构 * ETH Zurich(苏黎世联邦理工学院)

AI总结 本文通过识别软演员-评论家(SAC)在并行训练中性能不足的根本原因,并提出策略初始化、超时感知评论家目标和多步回报估计等改进,使其在腿部运动任务中达到与近端策略优化(PPO)相当的性能。

详情
AI中文摘要

近端策略优化(PPO)由于其在IsaacLab等大规模并行仿真环境中的鲁棒性和可扩展性,已成为训练腿部机器人的事实标准。然而,其基于策略的性质使其天生样本效率低下,阻碍了其在真实硬件上的持续适应和微调。相比之下,软演员-评论家(SAC)是一种可以重用过去经验的离策略算法,使其成为模拟到现实迁移工作流程的自然候选,其中同一算法既可用于仿真,也可用于真实机器人的在线学习。尽管有这些优势,SAC在大规模并行训练设置中始终未能匹配PPO的经验性能。本工作确定了这一差距的根本原因,并引入了针对性的修改,包括策略初始化、超时感知评论家目标和多步回报估计,使SAC能够稳定地大规模训练。在多个腿部机器人平台和多样化的运动任务上评估,我们的方法完全弥合了与PPO的性能差距。

英文摘要

Proximal Policy Optimization (PPO) has become the de facto standard for training legged robots, thanks to its robustness and scalability in massively parallel simulation environments like IsaacLab. However, its on-policy nature makes it inherently sample-inefficient, preventing its use for continuous adaptation and fine-tuning on real hardware. Soft Actor-Critic (SAC), by contrast, is an off-policy algorithm that can reuse past experience, making it a natural candidate for sim-to-real transfer workflows where the same algorithm can be used both in simulation and for online learning on the real robot. Despite these advantages, SAC has consistently failed to match PPO's empirical performance in massively parallel training settings. This work identifies the root causes of this gap and introduces targeted modifications, covering policy initialization, timeout-aware critic targets, and multi-step return estimation, that enable SAC to train stably at scale. Evaluated across multiple legged robot platforms and diverse locomotion tasks, our approach closes the performance gap with PPO entirely.

2605.24973 2026-05-26 cs.CV cs.AI cs.CL 版本更新

MinerU-Popo: Universal Post-Processing Model for Structured Document Parsing

MinerU-Popo:结构化文档解析的通用后处理模型

Bangrui Xu, Ziyang Miao, Xuanhe Zhou, Yiming Lin, Zirui Tang, Xiaomeng Zhao, Fan Wu, Cheng Tan, Fan Wu, Bin Wang, Conghui He

发表机构 * Shanghai Jiao Tong University(上海交通大学) Shanghai Artificial Intelligence Laboratory, OpenDataLab(上海人工智能实验室,OpenDataLab) University of California, Berkeley(加州大学伯克利分校)

AI总结 提出MinerU-Popo轻量级通用后处理框架,通过分解为文本/表格截断恢复、标题层级重建和图文关联四个子任务,并利用动态分块和重叠同步将OCR页面级结果重构为文档级逻辑结构,显著提升标题层级TEDS和RAG准确性。

Comments The code is available at https://github.com/opendatalab/MinerU-Popo

详情
AI中文摘要

基于VLM的OCR模型已成为文档解析的事实标准,因为它们可以准确提取页面级元素(例如单个页面内的段落)及其边界框和文本内容。然而,下游应用(如RAG)需要连贯的文档级信息,而这些模型常常破坏跨页连续性,并且无法恢复被页面边界截断的结构(如段落和表格)。这种关系不局限于单个页面;相反,它们需要对跨多个页面的标题、段落、表格和图像进行联合分析。因此,一个自然的解决方案是重用现有的OCR输出,并通过后处理重建文档级逻辑结构。为此,我们提出了MinerU-Popo,一个轻量级且通用的OCR输出后处理框架,它将来自不同解析器的页面级结果转换为连贯的文档级结构。MinerU-Popo将问题分解为四个聚焦的子任务:文本截断恢复、表格截断恢复、标题层级重建和图文关联。为了有效解决这些问题,我们构建了一个面向任务的数据引擎,具有任务特定的输入过滤,并使用生成的数据(30K)微调了一个轻量级后处理模型(Qwen3-VL-4B)。为了支持长文档,我们引入了基于重叠同步的动态分块,对齐微调模型的分块级输出并保持全局一致性。最后,我们将对齐后的输出组装成树状文档表示,并通过节点分块和摘要进一步丰富,以支持下游检索和分析。实验结果表明,MinerU-Popo在所有五个测试的OCR模型上,标题层级TEDS至少提高了20%,提高了RAG准确性并降低了每次查询的延迟。

英文摘要

VLM-based OCR models have become the de facto choice for document parsing, as they can accurately extract page-level elements (e.g., paragraphs within individual pages) together with their bounding boxes and textual content. However, downstream applications such as RAG require coherent document-level information, whereas these models often break cross-page continuity and fail to recover disrupted structures, such as paragraphs and tables truncated by page boundaries. Such relationships are not confined to a single page; instead, they require joint analysis of titles, paragraphs, tables, and images spanning multiple pages. A natural solution is therefore to reuse existing OCR outputs and reconstruct document-level logical structures through post-processing. To this end, we propose MinerU-Popo, a lightweight and universal framework for POst-Processing OCR outputs, which converts page-level results from diverse parsers into coherent document-level structures. MinerU-Popo decomposes the problem into four focused subtasks: text truncation recovery, table truncation recovery, title hierarchy reconstruction, and image-text association. To address these effectively, we build a task-oriented data engine with task-specific input filtering, and use the generated data (30K) to fine-tune a lightweight post-processing model (Qwen3-VL-4B). To support long documents, we introduce dynamic chunking with overlap-based synchronization, which aligns chunk-level outputs from the fine-tuned model and preserves global consistency. Finally, we assemble the aligned outputs into a tree-structured document representation, further enriched with node chunking and summaries for downstream retrieval and analysis. Empirical results show MinerU-Popo improves title-hierarchy TEDS by at least 20% across all five tested OCR models, improves RAG accuracy and reduces per-query latency.

2605.24971 2026-05-26 cs.LG cs.AI 版本更新

TGFormer: Towards Temporal Graph Transformer with Auto-Correlation Mechanism

TGFormer:基于自相关机制的时间图Transformer

Hongjiang Chen, Pengfei Jiao, Ming Du, Xuan Guo, Zhidong Zhao, Di Jin, Xiao Liu

发表机构 * Hangzhou Dianzi University, School of Cyberspace(杭州电子科技大学信息学院) Tianjin University, College of Intelligence and Computing(天津大学智能与计算学院) State Key Laboratory of Systems Medicine for Cancer, Shanghai Cancer Institute(癌症系统医学国家重点实验室,上海癌症研究院)

AI总结 针对时间图神经网络在捕获长期依赖和周期模式上的不足,提出TGFormer,通过轨迹框架和自相关机制实现子交互级别的依赖发现与表示聚合,在六个基准上最高提升9.35%精度。

详情
Journal ref
Pattern Recognition 170 (2026): 112053
AI中文摘要

对时间图神经网络(TGNN)日益增长的兴趣源于它们能够建模复杂动态并提供卓越性能。然而,TGNN在捕获长期依赖和识别周期模式方面面临根本性挑战。为解决这些限制,我们提出了TGFormer,一种专为时间图设计的新型Transformer架构。我们的模型通过建立与时间序列分析原理一致的轨迹框架,重新定义了时间图学习。这种方法使TGFormer能够通过对历史交互的系统分析来推导节点表示,从而实现对跨连续时间戳的节点关系的精细检查。基于随机过程理论,我们开发了一种自相关机制,系统性地揭示节点交互中的周期依赖。这一创新使TGFormer能够在子交互级别进行依赖发现和表示聚合,相比传统注意力机制展现出更高的效率和准确性。在六个公开基准上的实验验证了我们的方法的有效性,与最先进方法相比,TGFormer最高实现了9.35%的精度提升。

英文摘要

The growing interest in Temporal Graph Neural Networks (TGNNs) stems from their ability to model complex dynamics and deliver superior performance. However, TGNNs encounter fundamental challenges in capturing long-term dependencies and identifying periodic patterns. To address these limitations, we propose TGFormer, a novel Transformer architecture specifically designed for temporal graphs. Our model redefines temporal graph learning by establishing a trajectory framework that aligns with time series analysis principles. This approach allows TGFormer to derive node representations through systematic analysis of historical interactions, enabling granular examination of node relationships across sequential timestamps. Building upon stochastic process theory, we develop an auto-correlation mechanism that systematically uncovers periodic dependencies in node interactions. This innovation empowers TGFormer to perform dependency discovery and representation aggregation at sub-interaction levels, demonstrating superior efficiency and accuracy compared to conventional attention mechanisms. Experimental validation across six public benchmarks confirms the effectiveness of our approach, with TGFormer at most achieving 9.35\% precision improvement compared to state-of-the-art approaches.

2605.24969 2026-05-26 cs.LG cs.AI 版本更新

OSDTW: Optimal Shared Depth and Task Weighting for Long-Tailed Recognition

OSDTW:长尾识别的最优共享深度与任务加权

Chang Chu, Qingyue Zhang, Shao-Lun Huang, Junxiong Zheng

发表机构 * Shenzhen International Graduate School, Tsinghua University, Shenzhen, China(清华大学深圳国际研究生院,中国深圳) Shenzhen Zkosemi Semiconductor Technology Co., Ltd(深圳卓芯半导体科技有限公司)

AI总结 提出OSDTW框架,通过分解任务、共享编码器与任务特定解码器,并基于Fisher信息矩阵推导泛化误差的偏置-方差分解,以优化共享深度和任务权重,解决长尾识别中头部-尾部性能权衡问题。

Comments ICIC 2026 Oral

详情
AI中文摘要

长尾识别面临持续的头部-尾部权衡:提升尾部性能通常会降低头部准确率,并可能增加训练不稳定性。尽管重加权、解耦训练和多专家方法取得了强有力的实证结果,但关于头部和尾部类别之间表示共享以及跨类别组监督加权的关键设计选择仍主要基于启发式。在这项工作中,我们提出了OSDTW,一个原则性的任务分解框架,将原始的单标签识别问题划分为头部任务和尾部任务,通过共享编码器和任务特定解码器实现。为了处理两个标签组之间的互斥性和统计依赖性,我们引入了一个因子化模型,并表明由此产生的基于KL散度的泛化误差可以写为任务项之和(加一个常数),从而得到一个定义良好的任务级目标。我们进一步开发了一个三阶段训练流程:独立任务训练以估计任务级最优值和Fisher信息矩阵,加权联合训练以学习共享编码器,以及分支组装以构建最终的解耦模型。在块对角Fisher近似下,我们推导了期望泛化误差的可计算二阶展开,将其分解为编码器方差、编码器偏置和解码器方差。这种偏置-方差分解提供了一个可计算的代理来选择共享深度和任务权重,从而实现高效的超参数搜索。在标准长尾基准上的实验证明了所提出方法相对于强基线的有效性。

英文摘要

Long-tailed recognition suffers from a persistent head--tail trade-off: improving tail performance often degrades head accuracy and can increase training instability. Despite strong empirical results from re-weighting, decoupled training, and multi-expert methods, key design choices about representation sharing between head and tail classes and supervision weighting across class groups remain largely heuristic. In this work, we propose OSDTW, a principled task-decomposition framework that partitions the original single-label recognition problem into a head task and a tail task, implemented with a shared encoder and task-specific decoders. To handle the mutual exclusivity and statistical dependence between the two label groups, we introduce a factorized model and show that the resulting Kullback--Leibler divergence-based generalization error can be written as the sum of task-wise terms up to an additive constant, yielding a well-defined task-wise objective. We further develop a three-stage training pipeline: independent task training to estimate task-wise optima and the Fisher information matrix, weighted joint training to learn a shared encoder, and branch assembly to construct the final decoupled model. Under a block-diagonal Fisher approximation, we derive a computable second-order expansion of the expected generalization error, decomposing it into encoder variance, encoder bias, and decoder variance. This bias--variance decomposition provides a computable proxy to select the shared depth and task weights, enabling efficient hyper-parameter search. Experiments on standard long-tailed benchmarks demonstrate the effectiveness of the proposed approach over strong baselines.

2605.24965 2026-05-26 cs.CV cs.AI cs.LG 版本更新

Cross-Domain Generalization Limits of Vision Foundation Models in Facial Deepfake Detection

视觉基础模型在面部深度伪造检测中的跨域泛化极限

Ibrahim Delibasoglu

发表机构 * Department of Software Engineering, Faculty of Computer and Information Sciences(软件工程系,计算机与信息科学学院)

AI总结 本文通过系统评估三种视觉基础模型(RoPE-ViT、DINOv3、NVIDIA C-RADIOv4-H)在DF40基准上的线性探测性能,揭示了它们在面部深度伪造检测中的跨域泛化极限,发现基础模型对全脸合成保持高判别力,但对局部编辑技术存在根本性边界。

详情
AI中文摘要

生成模型的快速进化使得超逼真面部深度伪造的创建成为可能,暴露了现代数字取证中的一个关键弱点:检测器无法泛化到未见过的操作技术。传统网络遭受表示崩溃,过度拟合特定训练生成器的局部伪影指纹。本研究探讨了现代视觉基础模型是否可以作为可泛化的、开箱即用的特征提取器,能够在完全未见过的生成流形上追踪取证异常。我们进行了系统的跨域评估,比较了三种基础学习范式:全监督宏观语义特征(RoPE-ViT)、纯自监督几何特征(DINOv3)和多教师聚合表示(NVIDIA C-RADIOv4-H)。通过部署冻结的骨干网络并进行下游线性探测,我们映射了这些架构在具有挑战性的DF40基准上的性能极限。我们的实证结果揭示了预训练范式和参数规模之间的内在权衡,证明虽然基础模型对全脸合成保持高判别能力,但局部面部编辑技术在线性探测评估结构中暴露了基本边界。源代码和模型权重可在 http://github.com/mribrahim/deepfake 获取。

英文摘要

The rapid evolution of generative models has enabled the creation of hyper-realistic facial deepfakes, exposing a critical vulnerability in modern digital forensics: the inability of detectors to generalize to unseen manipulation techniques. Traditional networks suffer from representation collapse, overfitting to localized artifact fingerprints of specific training generators. This work investigates whether modern Vision Foundation Models can serve as generalizable, out-of-the-box feature extractors capable of tracking forensic anomalies across entirely unseen generative manifolds. We conduct a systematic cross-domain evaluation comparing three foundational learning paradigms: fully supervised macro-semantic features (RoPE-ViT), pure self-supervised geometric features (DINOv3), and multi-teacher agglomerative representations (NVIDIA C-RADIOv4-H). By deploying frozen backbones subjected to downstream linear probing, we map the performance limitations of these architectures on the challenging DF40 benchmark. Our empirical findings expose the intrinsic trade-offs between pre-training paradigms and parameter scale, proving that while foundation models retain high discriminative capabilities for entire face synthesis, localized face editing techniques expose fundamental boundaries in linear probe evaluation structures. Source code and model weights are available in http://github.com/mribrahim/deepfake

2605.24960 2026-05-26 cs.CL cs.AI cs.LG 版本更新

Investigating the Interplay between Contextual and Parametric Chain-of-Thought Faithfulness under Optimization

探究优化下上下文与参数化思维链忠实性之间的相互作用

Jingyi Sun, Qianli Wang, Pepa Atanasova, Nils Feldhus, Isabelle Augenstein

发表机构 * University of Copenhagen(哥本哈根大学) Technische Universität Berlin(柏林技术大学) German Research Center for Artificial Intelligence (DFKI)(德国人工智能研究中心(DFKI)) BIFOLD – Berlin Institute for the Foundations of Learning and Data(BIFOLD – 柏林学习与数据基础研究院)

AI总结 通过提出统一偏好对齐接口FaithMate,研究上下文与参数化两种思维链忠实性范式在优化下的相互作用,发现两者正相关但不对称,且上下文忠实性指标间存在权衡。

Comments The first two authors contributed equally and share first-authorship

详情
AI中文摘要

思维链(CoT)忠实性,即CoT是否真实反映大型语言模型(LLM)的底层行为,通常通过两种不相交的范式进行评估:上下文忠实性(通过扰动输入或CoT轨迹测量)和参数化忠实性(通过干预模型的参数化知识评估)。然而,先前的工作仅对它们进行描述性比较。我们通过提出FaithMate(一个统一的偏好对齐接口,用于优化模型朝向任一忠实性范式)来填补这一空白。它使我们能够研究两种范式之间的相互作用,检查忠实性增益在范式内部和跨范式之间是否以及多大程度上泛化。在三个模型、两个数据集和六个忠实性指标上,我们发现两种范式呈正相关但不对称:优化参数化忠实性在两种范式上均产生一致的增益,而上下文对应范式则带来更多可变的增益。在上下文范式内,一个指标上的忠实性增益不能一致地转移到其他指标上,这表明现有的上下文指标捕捉了忠实性的不同方面,并暴露了固有的权衡。这些发现意味着CoT忠实性不是一个单一目标,因此需要多方面的优化和评估。

英文摘要

Chain-of-Thought (CoT) faithfulness, i.e., whether CoTs genuinely reflect large language models' (LLM) underlying behavior, is typically evaluated under two disjoint paradigms: contextual faithfulness, measured by perturbing the input or CoT trace, and parametric faithfulness, assessed by intervening on a model's parametric knowledge. Yet prior work compares them only descriptively. We fill this gap by proposing FaithMate, a unified preference-alignment interface for optimizing models towards either faithfulness paradigm. It enables us to investigate the interplay between the two paradigms, examining whether and to what extent faithfulness gains generalize within and across paradigms. Across three models, two datasets, and six faithfulness metrics, we find that the two paradigms are positively coupled, yet asymmetric: optimizing towards parametric faithfulness yields consistent gains across both paradigms, whereas the contextual counterpart delivers more variable gains. Within the contextual paradigm, faithfulness gains on one metric do not consistently transfer to others, implying that existing contextual metrics capture disjoint facets of faithfulness and exposing inherent trade-offs. These findings imply that CoT faithfulness is not a monolithic objective and therefore requires multifaceted optimization and evaluation.

2605.24958 2026-05-26 cs.CL cs.AI 版本更新

SEP-Attack: A Simple and Effective Paradigm for Transfer-Based Textual Adversarial Attack

SEP-Attack:一种简单有效的基于迁移的文本对抗攻击范式

Han Liu, Zhi Xu, Xiaotong Zhang, Feng Zhang, Xiaoming Xu, Wei Wang, Fenglong Ma, Hong Yu

发表机构 * Dalian University of Technology(大连理工大学) Peking University(北京大学) Macao Polytechnic University(澳门理工学院) The Pennsylvania State University(宾夕法尼亚州立大学)

AI总结 提出SEP-Attack,利用行列式点过程生成多样化的代理集成权重,通过新指标评估预测置信度以计算词重要性并生成对抗样本,在多个数据集和API上显著优于现有方法。

详情
AI中文摘要

尽管深度神经网络在现代Web和语言应用中表现出色,但它们仍然容易受到对抗攻击,尤其是使用代理模型生成对抗样本而无需访问受害者模型的迁移攻击。文本领域的迁移攻击仍未得到充分探索,只有少数研究解决了这一挑战性问题,且由于对子模型平等对待或重要性分数估计不准确,往往导致次优结果。为了解决这些挑战,我们提出了一种简单而有效的基于迁移的文本对抗攻击范式,名为SEP-Attack。具体来说,我们采用行列式点过程(DPP)生成多样化的代理集成权重,代表子模型的迁移性。利用这些权重,我们引入了一种新的度量来评估预测置信度分数,进而用于计算词重要性分数并生成对抗候选。最后,我们量化每个候选的迁移性分数,并选择排名靠前的作为最终的迁移对抗样本。在四个数据集和两个真实API上进行的实验验证了SEP-Attack的有效性,显著优于最先进的基线方法。

英文摘要

Despite the strong performance of deep neural networks in modern Web and language applications, they remain vulnerable to adversarial attacks, especially transferable attacks that generate adversarial examples using surrogate models without accessing the victim model. Transferable attacks in the text domain are still under-explored, with only a few studies addressing this challenging issue, often with suboptimal results due to equal treatment of submodels or inaccurate estimation of importance scores. To address these challenges, we propose a simple yet effective paradigm for transfer-based textual adversarial attack, named SEP-Attack. Specifically, we employ the Determinantal Point Process (DPP) to generate diverse surrogate ensemble weights, representing the transferability of submodels. Using these weights, we introduce a new metric to evaluate prediction confidence scores, which in turn are used to calculate word importance scores and generate adversarial candidates. Finally, we quantify the transferability score for each candidate and select the top ones as the final transferable adversarial examples. Experiments conducted on four datasets and two real-world APIs validate the efficacy of SEP-Attack, significantly outperforming state-of-the-art baselines.

2605.24957 2026-05-26 cs.AI cs.CV cs.LG 版本更新

Mitigating Object Hallucinations in Vision-Language Models through Region-Aware Attention Recalibration

通过区域感知注意力重校准减轻视觉语言模型中的对象幻觉

Yuanzhi Xu, Qian Gao, Jun Fan, Guohui Ding, Zhenyu Yang, Sixue Lin, Yuteng Xiao

发表机构 * Qilu University of Technology (Shandong Academy of Sciences)(齐鲁工业大学(山东省科学院)) China Telecom Digital Intelligence Technology Co, Ltd(中国电信数字智能技术有限公司) Shenyang Aerospace University(沈阳航空航天大学) Qilu Institute of Technology(齐鲁理工学院)

AI总结 提出一种无需训练的区域感知自适应加权机制,通过计算注意力头的稳健统计中点并利用跨头分歧动态调整干预预算,以连续惩罚调制抑制幻觉路径,有效纠正视觉语义错位,同时保持生成流畅性。

详情
AI中文摘要

生成事实上不正确的对象(通常称为对象幻觉)仍然是大型视觉语言模型(LVLMs)中的一个持久挑战。当前解决该问题的方法——从昂贵的数据驱动微调和延迟较高的对比解码到刚性的注意力头截断——常常在计算效率或模型特征空间的连续性上做出妥协。为克服这些限制,我们引入了一种新颖的、无需训练的推理策略,该策略作为一种区域感知的自适应加权机制,动态纠正语义漂移,而不依赖于突然的启发式截断。通过计算各注意力头上的离群值稳健统计中点,我们为可靠的视觉表示建立了一个稳定锚点。然后,我们利用跨区域映射的跨头分歧来动态确定干预预算,通过连续惩罚调制温和地抑制引起幻觉的注意力路径。这种重校准过程有效纠正了视觉语义错位,同时完全保留了生成流畅性和语言先验。在包括CHAIR、POPE和MME在内的标准多模态基准上的全面评估表明,我们的策略显著减少了实例级和句子级幻觉。结果展示了与当代基线相比的最先进性能,证实了我们方法的效率和算法鲁棒性。我们的代码将公开。

英文摘要

The generation of factually incorrect objects, commonly known as object hallucination, remains a persistent challenge in Large Vision-Language Models (LVLMs). Current approaches to address this issue - ranging from expensive data-driven fine-tuning and high-latency contrastive decoding to rigid attention head truncation - frequently compromise either computational efficiency or the continuity of the model's feature space. To overcome these limitations, we introduce a novel, training-free inference strategy that operates as a region-aware adaptive weighting mechanism to dynamically correct semantic drift without relying on abrupt heuristic truncations. By computing an outlier-resistant statistical midpoint across various attention heads, we establish a stable anchor for reliable visual representations. We then utilize the inter-head disagreement mapped across regions to dynamically determine intervention budgets, gently suppressing hallucination-inducing attention paths through a continuous penalty modulation. This recalibration process effectively rectifies visual-semantic misalignments while fully preserving generative fluency and language priors. Comprehensive evaluations on standard multimodal benchmarks, including CHAIR, POPE, and MME, reveal that our strategy substantially curtails both instance- and sentence-level hallucinations. The results demonstrate state-of-the-art performance against contemporary baselines, confirming our method's efficiency and algorithmic robustness. Our code will be public.

2605.24953 2026-05-26 cs.AI 版本更新

Towards Multi-Turn Dialog Systems for Industrial Asset Operations and Maintenance

面向工业资产运维的多轮对话系统

Chengrui Li, Rujing Li, Yitong Bai, Rui Li

发表机构 * Columbia University(哥伦比亚大学)

AI总结 针对工业资产运维中的多轮、迭代问答问题,提出基于监督者-专家多智能体架构的多轮对话系统,通过结构化工件复用、动态重规划和并行工具执行,显著提升规划效果和任务完成率。

详情
AI中文摘要

工业资产运维问答本质上是多轮、迭代且高度依赖外部工具调用的。然而,传统的计划-执行单智能体架构在维护跨轮上下文和复用中间结果方面存在明显局限性。本文提出了一种基于监督者-专家多智能体架构的工业场景多轮对话系统。为缓解工具调用瓶颈,该系统集成了结构化工件复用、动态重规划和并行工具执行。评估结果表明,与基线相比,我们的系统实现了更好的响应质量,规划效果提升54.5%,任务完成率提升37.8%。系统性能分析进一步显示,跨轮工件复用有效减少了冗余工具调用,工具时间占比从47.3%降至26.3%,且第2-5轮的响应速度比第一轮快约4.2倍。

英文摘要

Industrial asset operations and maintenance question answering is inherently multi-turn, iterative, and highly dependent on external tool invocation. However, the conventional plan-execute single-agent architecture exhibits clear limitations in maintaining cross-turn context, and reusing intermediate results. In this paper, we present a multi-turn dialog system designed for industrial scenarios based on a supervisor-specialist multi-agent architecture. To alleviate tool invocation bottlenecks, the system incorporates structured artifact reuse, dynamic replanning, and parallel tool execution. Evaluation results show that our system achieves better response quality compared with the baseline, with planning effectiveness increasing by 54.5% and task completion improving by 37.8%. System profiling further shows that cross-turn artifact reuse effectively reduces redundant tool invocation, decreasing the tool-time share from 47.3% to 26.3% and making turns 2-5 approximately 4.2x faster than the first turn.

2605.24949 2026-05-26 cs.CR cs.AI 版本更新

APT-Agent: Automated Penetration Testing using Large Language Models

APT-Agent:利用大语言模型的自动化渗透测试

William Guanting Li, Alsharif Abuadbba, Kristen Moore, Dan Dongseong Kim

发表机构 * University of Queensland(昆士兰大学)

AI总结 提出APT-Agent框架,通过混合修正模块和命令特定记忆架构解决大语言模型在渗透测试中的幻觉和长期记忆问题,在Metasploitable 2上实现84.29%的端到端利用成功率。

Comments 11 pages, 8 figures

详情
AI中文摘要

渗透测试对于保护现代网络基础设施至关重要,然而传统的手动方法难以跟上其规模和复杂性。大语言模型(LLMs)为自动化这些任务提供了新的机会,但现有方法面临两个持续挑战:技术实体的幻觉和长期上下文记忆不足。为了解决这些问题,我们提出了APT-Agent,一个完全自动化的LLM驱动的渗透测试框架,系统性地协调侦察、利用和数据窃取。APT-Agent引入了一个混合修正模块来恢复幻觉命令,以及一个命令特定的记忆架构来跨多步攻击序列保留操作上下文。我们在Metasploitable 2上针对涵盖Web、数据库和网络协议的七个脆弱服务评估了我们的APT-Agent。APT-Agent实现了84.29%的端到端利用成功率,而在匹配条件下,Script Kiddie和PentestGPT分别为48.57%和18.57%。通过减少认知负担和最小化对人类干预的依赖,APT-Agent代表了向可扩展、可靠且认知高效的渗透测试自动化迈出的一步。

英文摘要

Penetration testing is essential to securing modern web infrastructures, yet traditional manual methods struggle to keep pace with their scale and complexity. Large Language Models (LLMs) offer new opportunities for automating these tasks, but existing approaches face two persistent challenges: hallucination of technical entities and insufficient long-term contextual memory. To address these issues, we present APT-Agent, a fully automated LLM-driven penetration testing framework that systematically orchestrates reconnaissance, exploitation, and exfiltration. APT-Agent introduces a hybrid rectification module to recover hallucinated commands and a command-specific memory architecture to preserve operational context across multi-step attack sequences. We evaluate our APT-Agent on Metasploitable 2 against seven vulnerable services spanning web, database, and network protocols. APT-Agent achieves an 84.29% end-to-end exploitation success rate, compared to 48.57% (Script Kiddie) and 18.57% (PentestGPT) under matched conditions. By reducing cognitive burden and minimizing reliance on human intervention, APT-Agent represents a step toward scalable, reliable, and cognitively efficient automation for penetration testing.

2605.24945 2026-05-26 cs.LG cs.AI physics.ao-ph 版本更新

RealBench: Benchmarking Data-Driven Numerical Weather Forecasting Under Operational Conditions and Extreme Event Challenges

RealBench: 在操作条件和极端事件挑战下对数据驱动数值天气预报的基准测试

Ruize Li, Zhibin Wen, Tao Han, Hao Chen, Fenghua Ling, Wei Zhang, Song Guo, Lei Bai

发表机构 * The Hong Kong University of Science and Technology(香港科学与技术大学) Nanjing University(南京大学) Southern University of Science and Technology(南方科技大学) Shanghai AI Laboratory(上海人工智能实验室) Shanghai TechWind Technology Co., Ltd.(上海技风科技有限公司)

AI总结 提出RealBench基准,通过使用低延迟操作分析和全球10,000+站点观测数据,在严格分布外测试集上评估AI天气预报模型,揭示再分析指标与实际性能的显著差异,特别是极端事件方面。

Comments 35 pages, 22 figures

详情
AI中文摘要

准确评估天气预报模型对于其在现实世界应用中的可靠部署至关重要。然而,现有基准主要依赖再分析产品(如ERA5),这些产品通过延迟数据同化生成,不能反映实时操作预报的约束,导致基准性能与现实预报之间存在系统性不匹配。在这项工作中,我们引入了RealBench,这是一个用于AI天气预报的下一代基准,强调在操作条件下的现实评估。RealBench具有严格分布外测试集,覆盖2025年,以消除数据泄露并捕捉近期大气状况。它整合了多个数据源,包括低延迟操作分析和包含超过10,000个站点的全球原位观测数据集,从而能够直接针对真实大气测量进行评估。除了标准全球指标外,RealBench还为高影响极端事件(包括热浪、寒潮和热带气旋)提供了全面的评估框架,使用事件特定指标更好地反映现实预报优先级。评估结果揭示了基于再分析的指标与现实性能之间的显著差异,特别是关于极端事件。通过突出现有基准的局限性,这项工作建立了一个更忠实且与操作相关的评估范式,为推进下一代AI天气预报系统提供了严格的基础。基准实现可在以下网址获取:https://github.com/lixruize-del/NWP-Benchmark。

英文摘要

Accurate evaluation of weather forecasting models is critical for their reliable deployment in real-world applications. However, existing benchmarks predominantly rely on reanalysis products such as ERA5, which are generated through delayed data assimilation and do not reflect the constraints of real-time operational forecasting, thereby resulting in a systematic mismatch between benchmark performance and real-world forecasting. In this work, we introduce RealBench, a next-generation benchmark for AI weather forecasting that emphasizes realistic evaluation under operational conditions. RealBench features a strictly out-of-distribution test set spanning 2025 to eliminate data leakage and capture recent atmospheric regimes. It integrates multiple data sources, including low-latency operational analysis and a large-scale global in-situ observation dataset comprising over 10,000 stations, enabling direct evaluation against real atmospheric measurements. Beyond standard global metrics, RealBench provides a comprehensive evaluation framework for high-impact extreme events, including heatwaves, cold surges, and tropical cyclones, using event-specific metrics that better reflect real-world forecasting priorities. The evaluation results reveal substantial discrepancies between reanalysis-based metrics and real-world performance, particularly concerning extreme events. By highlighting the limitations of existing benchmarks, this work establishes a more faithful and operationally relevant evaluation paradigm, providing a rigorous foundation for advancing next-generation AI weather forecasting systems. The benchmark implementation is available at: https://github.com/lixruize-del/NWP-Benchmark.

2605.24938 2026-05-26 cs.IR cs.AI cs.CV 版本更新

Your Embedding Model is SMARTer Than You Think

你的嵌入模型比你想象的更聪明

Jianrui Zhang, Hyun Jung Lee, Sukanta Ganguly, Tae-Eui Kam, Donghyun Kim, Yong Jae Lee

发表机构 * UW-Madison(威斯康星大学麦迪逊分校) Korea University(韩国大学) NetApp, Inc.(NetApp公司)

AI总结 提出SMART框架,通过利用标准单向量模型的隐式多向量能力,在推理时应用后期交互,无需额外训练即可提升多模态检索性能。

详情
AI中文摘要

多模态检索严重依赖单向量检索器,它将丰富的顺序令牌序列压缩为单个全局表示。虽然高效,但它们丢弃了密集检索任务所需的关键细粒度局部证据。多向量方法作为解决方案被引入,但严格需要训练,且许多忽略了全局总结表示的必要性。为解决这一问题,我们引入SMART,一个释放标准单向量模型潜在多向量能力的框架。我们首先证明,在池化嵌入上的标准对比训练通过梯度流隐式塑造了前序隐藏状态的检索几何结构。通过在推理时对这些冻结的隐藏状态应用直接后期交互,SMART作为一种即插即用的升级,持续提升跨多种模态的性能,甚至在MMEB-V2上进一步改进了最先进的模型。我们还揭示了SMART的优越性能,简单的轻量级后训练不仅节省时间和计算,还在视觉文档检索上带来进一步改进,使单向量模型能够超越最先进的多向量对应模型。最终,SMART为多模态检索提供了高效的推理增强和强大的微调技术。我们在https://github.com/HanSolo9682/SMART开源了代码和权重。

英文摘要

Multimodal retrieval relies heavily on single-vector retrievers, which compress rich, sequential token sequences into one single global representation. While efficient, they discard fine-grained, local evidence critical for dense retrieval tasks. Multi-vector approaches were introduced as a solution, but they strictly require training and many ignore the necessity of a globally summarizing representation. To address this, we introduce SMART, a framework that unlocks the latent multi-vector capabilities of standard single-vector models. We first demonstrate that standard contrastive training on the pooled embedding implicitly shapes the retrieval geometry of preceding hidden states via gradient flow. By applying direct late-interaction over these frozen hidden states during inference, SMART acts as a plug-and-play upgrade that consistently improves performance across diverse modalities, improving even the state-of-the-art models further on MMEB-V2. We also reveal SMART's superior performance, as simple lightweight post-training not only saves time and compute, but also brings forth further improvement on Visual Document retrieval, allowing a single-vector model to outperform SoTA multi-vector counterparts. Ultimately, SMART offers both a highly efficient inference enhancement and a powerful finetuning technique for multimodal retrieval. We open source our code and weights at https://github.com/HanSolo9682/SMART.

2605.24926 2026-05-26 cs.AI 版本更新

Energy Shields for Fairness

公平性能量护盾

Filip Cano, Thomas A. Henzinger, Konstantin Kueffner

发表机构 * Institute of Science and Technology Austria(科学与技术研究院)

AI总结 提出一种受物理学启发的轻量级自适应控制器——能量护盾,通过概率性干预平滑地保证运行时公平性,并首次同时提供短期安全性和长期活性保证。

详情
AI中文摘要

运行时公平性不是一个一次性约束,而是一个在决策序列上评估的动态属性。为了确保运行时公平性,必须考虑过去的决策,这是传统静态分类器所忽略的信息。传统的公平性护盾通过确定性干预来强制执行运行时公平性,每当决策序列违反运行公平性度量的目标时,就会突然干预。这激发了我们主要的概念贡献:能量护盾。能量护盾是一种新颖的、轻量级的自适应控制器,它监控决策序列并概率性地干预,通过利用受物理学启发的能量函数将序列推向公平性,从而平滑地确保运行时公平性:决策越不公平,推动力就越强。这使得能量护盾成为第一个同时提供短期安全性和长期活性保证的公平性护盾。安全性确保运行公平性度量以高概率保持在运行目标区间内,而活性确保公平性度量的极限位于极限目标区间内。直观地说,短期指定了容忍的公平性值,长期指定了期望的公平性值。我们还提供了一种合成程序,用于为给定的目标规范构建最小侵入性的能量护盾,并通过实验证明其效率。我们通过短期和长期公平性的视角,将我们的能量护盾与现有的公平性护盾进行了评估。

英文摘要

Runtime fairness is not a one-time constraint but a dynamic property evaluated over a sequence of decisions. To ensure fairness at runtime, it is necessary to account for past decisions, information neglected by conventional, static classifiers. Traditional fairness shields enforce runtime fairness abruptly, by intervening \emph{deterministically} whenever a sequence of decisions violates the target for a running fairness measure. This motivates our \emph{main conceptual contribution: \textbf{energy shields}.} An energy shield is a novel, lightweight, adaptive controller that monitors a sequence of decisions and intervenes \emph{probabilistically} to ensure runtime fairness smoothly, by utilizing physics-inspired energy functions to nudge the sequence toward fairness: the more unfair the decisions, the stronger the nudging force becomes. This makes energy shields the \emph{\textbf{first}} fairness shields to provide both \emph{short-term safety and long-term liveness guarantees}. Safety ensures that the running fairness measure stays within a running target interval with high probability, and liveness ensures that the limit of the fairness measure lies within the limit target interval. Intuitively, the short-term specifies the tolerated fairness values and the long-term specifies the desired fairness values. We also provide a synthesis procedure for constructing the least intrusive energy shield for a given target specification, and demonstrate its efficiency experimentally. We evaluate our energy shields against existing fairness shields through the lens of short- and long-term fairness.

2605.24920 2026-05-26 cs.LG cs.AI stat.ML 版本更新

Quaternion Self-Attention with Shared Scores

共享分数的四元数自注意力

Shogo Yamauchi, Tohru Nitta, Hideaki Tamori

发表机构 * Tokyo Woman's Christian University(东京女子基督教大学)

AI总结 提出一种共享分数四元数自注意力机制,通过四元数内积计算单一实值分数并共享注意力分布,在保持性能的同时大幅降低计算成本。

Comments 26 pages, 6 figures and 15 tables. Accepted at ICML2026

详情
AI中文摘要

四元数神经网络通过将四个相关特征表示为一个单一实体,实现了参数高效并建模多维依赖关系。然而,现有的四元数自注意力计算每个分量的分数并对每个分量应用独立的softmax操作,这增加了计算成本并允许注意力分布在分量间发散。我们提出了一种共享分数的四元数自注意力机制,该机制使用四元数内积计算单一实值分数,并在所有分量上应用共享的注意力分布。这将分数计算乘法减少了75%,并将softmax操作次数从四次减少到一次。我们证明,当查询和键由诱导分量预混合的四元数线性投影产生时,分量级分数和共享分数位于相同的交互子空间中,表明独立的分量级注意力主要重新参数化相同的交互,而不是扩展特征交互空间。在语音增强中,我们的方法在GPU上将推理时间减少了高达44.3%,在CPU上减少了58.1%,同时保持了质量,并且在视觉和自然语言处理中呈现一致的趋势。

英文摘要

Quaternion neural networks are parameter-efficient and model multidimensional dependencies by representing four related features as a single entity. However, existing quaternion self-attention computes component-wise scores and applies independent softmax operations to each component, which increases the computational cost and allows attention distributions to diverge across components. We propose a shared-score quaternion self-attention mechanism that computes a single real-valued score using the quaternion inner product and applies a shared attention distribution across all components. This reduces score-computation multiplications by 75% and the number of softmax operations from four to one. We prove that, when queries and keys are produced by quaternion linear projections that induce component pre-mixing, the component-wise and shared scores lie in the same interaction subspace, indicating that independent component-wise attention primarily re-parameterizes the same interactions rather than expanding the feature interaction space. In speech enhancement, our method reduces inference time by up to 44.3% on a GPU and 58.1% on a CPU while maintaining quality, with consistent trends across vision and natural language processing.

2605.24913 2026-05-26 eess.IV cs.AI q-bio.QM 版本更新

Explainable Multi-Task Retinal Imaging Reveals Microvascular Signals for Systemic Risk Stratification in Type 2 Diabetes: A Pilot Study

可解释多任务视网膜成像揭示2型糖尿病系统性风险分层的微血管信号:一项初步研究

Mini Han Wang, Liting Huang, Wei Hong, Boonthawan Wingwon

发表机构 * Faculty of Computer Science and Artificial Intelligence, Shenzhen University of Advanced Technology(深圳先进技术大学计算机科学与人工智能学院) Frontier Science Computing Center, Zhuhai Institute of Advanced Technology Chinese Academy of Sciences(中国科学院珠海先进技术研究院前沿科学计算中心) Chinese University of Hong Kong(香港中文大学) Zhuhai People's Hospital (The Affiliated Hospital of Beijing Institute of Technology, Zhuhai Clinical Medical College of Jinan University)(珠海人民医院(北京理工大学珠海临床医学院附属医院)) Lampang Inter-Tech College, Lampang Thailand(泰国 Lampang 职业技术学院)

AI总结 本研究开发了一个可解释的多任务深度学习框架,通过分析视网膜微血管特征与系统性异常(如肾脏异常)的关联,验证了视网膜成像作为糖尿病系统性风险分层生物标志物的潜力。

Comments 18 pages, 4 figures

详情
AI中文摘要

视网膜成像提供了进入系统性微血管健康的非侵入性窗口,并已成为系统性疾病的潜在生物标志物。然而,视网膜特征是否编码了生物学上有意义的系统性信号,并且可以使用可解释人工智能(XAI)可靠地解释,仍不清楚。我们开发了一个可解释的多任务深度学习框架,以研究视网膜微血管特征与2型糖尿病系统性异常之间的关联。使用共享神经网络和针对血糖状态、肾脏异常和多系统参与的任务特定头部,分析了来自2,719名个体的11,011张眼底图像。使用梯度加权类激活映射(Grad-CAM)、解剖掩膜和血管对齐分析评估模型可解释性。该框架展示了任务依赖的预测性能,对肾脏异常的最佳区分度(AUC高达0.63),而血糖状态预测性能有限(AUC = 0.49-0.61)。可解释性分析一致地将模型注意力定位到视网膜血管和视盘周围区域。掩膜实验表明,遮挡血管区域导致性能下降最大,表明视网膜血管是主要的预测来源。不同架构表现出异质的注意力模式,提示存在多种系统性信号编码的表征路径。这项初步研究表明,视网膜微血管特征包含与系统性异常(尤其是微血管损伤)相关的可测量信号。通过将多任务学习与定量XAI验证相结合,该框架推动视网膜成像向用于糖尿病系统性风险分层的可解释数字生物标志物发展。

英文摘要

Retinal imaging provides a non-invasive window into systemic microvascular health and has emerged as a potential biomarker for systemic diseases. However, whether retinal features encode biologically meaningful systemic signals that can be reliably interpreted using explainable artificial intelligence (XAI) remains unclear. An explainable multi-task deep learning framework was developed to investigate associations between retinal microvascular features and systemic abnormalities in Type 2 Diabetes Mellitus. A total of 11,011 fundus images from 2,719 individuals were analysed using a shared neural network with task-specific heads for glycaemic status, kidney abnormality, and multi-system involvement. Model interpretability was evaluated using Gradient-weighted Class Activation Mapping (Grad-CAM), anatomical masking, and vessel alignment analysis. The framework demonstrated task-dependent predictive performance, with the best discrimination observed for kidney abnormality (AUC up to 0.63), whereas glycaemic status prediction showed limited performance (AUC = 0.49-0.61). Explainability analyses consistently localized model attention to retinal vessels and peripapillary regions. Masking experiments showed that occlusion of vascular regions caused the greatest performance decline, indicating that retinal vessels were the primary predictive source. Different architectures exhibited heterogeneous attention patterns, suggesting multiple representational pathways for systemic signal encoding. This pilot study demonstrates that retinal microvascular features contain measurable signals associated with systemic abnormalities, particularly microvascular damage. By integrating multi-task learning with quantitative XAI validation, this framework advances retinal imaging toward interpretable digital biomarkers for systemic risk stratification in diabetes.

2605.24912 2026-05-26 cs.LG cs.AI q-bio.OT 版本更新

Explainable Retinal Imaging for Prediction of Multi-Organ Dysfunction in Type 2 Diabetes

可解释的视网膜成像用于预测2型糖尿病多器官功能障碍

Mini Han Wang, Liting Huang, Wei Hong, Boonthawan Wingwon

发表机构 * Faculty of Computer Science and Artificial Intelligence(计算机科学与人工智能学院) Frontier Science Computing Center(前沿科学计算中心) Chinese Academy of Sciences(中国科学院) Chinese University of Hong Kong(香港中文大学) Zhuhai People's Hospital(珠海人民医院) Beijing Institute of Technology(北京理工大学) Jinan University(暨南大学) Lampang Inter-Tech College

AI总结 本研究利用常规实验室生物标志物构建系统级异常指数,通过梯度提升模型预测2型糖尿病多系统失调,并采用SHAP实现可解释性,揭示了高血糖、肾功能障碍、血脂异常和炎症是主要驱动因素。

Comments 15 pages, 8 figures

详情
AI中文摘要

背景:2型糖尿病(T2DM)日益被认为是一种以代谢、肾脏、脂质和炎症通路协调功能障碍为特征的系统性疾病。现有的临床评估往往无法捕捉这种多维度负担。方法:我们对1,195名患者进行了回顾性研究,使用了常规收集的实验室生物标志物。构建了系统级异常指数以量化器官特异性功能障碍,并将多系统受累定义为两个或以上系统异常。训练了包括逻辑回归、随机森林和梯度提升在内的监督机器学习模型来预测多系统失调。使用SHapley Additive exPlanations(SHAP)实现模型可解释性。结果:梯度提升模型表现出近乎完美的区分能力(AUC = 1.000),显著优于逻辑回归(AUC = 0.925)。特征归因分析显示,高血糖、肾功能障碍、血脂异常和炎症是多系统风险的主要驱动因素。部分依赖分析中观察到的剂量-反应关系进一步支持了模型预测的生物学合理性。结论:本研究提出了一个可解释的、数据驱动的框架,用于量化T2DM的系统性疾病负担。通过将常规生物标志物与多器官功能障碍联系起来,我们的方法提供了预测准确性和机制洞察,为糖尿病护理中的风险分层和精准医学提供了潜力。本研究中使用的数据和代码可在GitHub上公开获取:https://github.com/MiniHanWang/Type-2-Diabetes-1.git

英文摘要

Background: Type 2 diabetes mellitus (T2DM) is increasingly recognised as a systemic disease characterised by coordinated dysfunction across metabolic, renal, lipid, and inflammatory pathways. Existing clinical assessments often fail to capture this multi-dimensional burden. Methods: We conducted a retrospective study of 1,195 patients using routinely collected laboratory biomarkers. System-level abnormality indices were constructed to quantify organ-specific dysfunction, and multi-system involvement was defined as abnormalities in two or more systems. Supervised machine learning models, including logistic regression, random forest, and gradient boosting, were trained to predict multi-system dysregulation. Model interpretability was achieved using SHapley Additive exPlanations (SHAP). Results: The gradient boosting model demonstrated near-perfect discrimination (AUC = 1.000), significantly outperforming logistic regression (AUC = 0.925). Feature attribution analysis revealed that hyperglycaemia, renal impairment, dyslipidaemia, and inflammation were the dominant drivers of multi-system risk. Dose-response relationships observed in partial dependence analyses further supported the biological plausibility of model predictions. Conclusion: This study presents an interpretable, data-driven framework for quantifying systemic disease burden in T2DM. By linking routine biomarkers to multi-organ dysfunction, our approach provides both predictive accuracy and mechanistic insight, offering potential for improved risk stratification and precision medicine in diabetes care. The data and code used in this study are openly available on GitHub at: https://github.com/MiniHanWang/Type-2-Diabetes-1.git

2605.24911 2026-05-26 cs.LG cs.AI 版本更新

Factorize to Generalize: Retrieval-Guided Invariant-Dynamic Decomposition for Time Series Forecasting

因式分解以泛化:面向时间序列预测的检索引导不变-动态分解

Jinjin Chi, Lei Feng, Lulu Zhang, Yongcheng Jing, Yiming Wang, Ximing Li, Jialie Shen, Leszek Rutkowski, Dacheng Tao

发表机构 * College of Computer Science and Technology, Jilin University(吉林大学计算机科学与技术学院) College of Computing and Data Science, Nanyang Technological University(南洋理工大学计算机与数据科学学院) City St George’s, University of London(伦敦大学城圣乔治学院) Systems Research Institute, Polish Academy of Sciences(波兰科学院系统研究所)

AI总结 提出检索引导的不变-动态分解框架,通过分离稳定共享结构与实例特定变化,提升时间序列零样本预测在分布偏移下的鲁棒性。

详情
AI中文摘要

时间序列基础模型(TSFMs)最近通过大规模预训练和检索增强预测实现了强大的零样本预测性能。然而,我们的实证分析揭示了基于检索的预测的一个非平凡限制:检索倾向于导致更振荡的预测,在高度波动的序列上提升性能,但在更平滑、趋势主导的序列上降低准确性。这表明检索信息可能在未明确区分稳定时间结构与实例特定变化的情况下被融合到预测中,这可能在分布偏移下降低鲁棒性。我们提出了一种用于时间序列预测的检索引导不变-动态分解框架。我们不将检索用作辅助预测上下文,而是利用检索到的序列作为来自相关环境的隐式样本,以指导表示分解。具体来说,我们首先通过基于注意力的聚合构建检索感知表示,然后引入检索引导路由机制将其分解为捕获稳定共享结构的不变组件和建模上下文相关变化的动态组件。这两个组件分别预测并融合以进行最终预测,使模型能够保留可迁移模式,同时保持对动态演变的适应性。我们进一步设计了鼓励不变学习和解耦的训练目标,并提供了理论见解,表明检索聚合减少了方差,并在没有显式环境监督的情况下近似不变表示学习。大量实验表明,我们的方法在分布偏移下持续提高鲁棒性,并在零样本预测设置中优于现有的TSFMs和基于检索的基线。

英文摘要

Time series foundation models (TSFMs) have recently achieved strong zero-shot forecasting performance through large-scale pretraining and retrieval-augmented prediction. However, our empirical analysis reveals a non-trivial limitation of retrieval-based forecasting: retrieval tends to induce more oscillatory predictions, improving performance on highly fluctuating series while degrading accuracy on smoother, trend-dominated ones. This suggests that retrieved information may be fused into prediction without explicitly distinguishing stable temporal structure from instance-specific variations, which can reduce robustness under distribution shifts. We propose a Retrieval-guided Invariant-Dynamic DEcomposition framework for time series forecasting. Rather than using retrieval as auxiliary predictive context, we leverage retrieved sequences as implicit samples from related environments to guide representation decomposition. Specifically, we first construct a retrieval-aware representation via attention-based aggregation, and then introduce a retrieval-guided routing mechanism to decompose it into an invariant component capturing stable shared structure and a dynamic component modeling context-dependent variations. These two components are forecast separately and fused for final prediction, enabling the model to preserve transferable patterns while remaining adaptive to evolving dynamics. We further design training objectives that encourage invariant learning and disentanglement, and provide theoretical insight showing that retrieval aggregation reduces variance and approximates invariant representation learning without explicit environment supervision. Extensive experiments demonstrate that our method consistently improves robustness under distribution shifts and outperforms existing TSFMs and retrieval-based baselines in zero-shot forecasting settings.

2605.24910 2026-05-26 cs.AI cs.CE 版本更新

Noise-Robust Financial Numerical Entity Attribute Tagging

鲁棒噪声的金融数值实体属性标注

Hsin-Min Lu, Chen-Yang Lai, Yi-Jhen Li, Ju-Chun Yen

发表机构 * National Taiwan University(国立台湾大学) National Central University(国立中央大学)

AI总结 针对金融数值实体标注中标签噪声和属性不全问题,提出NORA方法,通过任务感知实例加权和邻域先验KNN过滤,在6.6百万实例基准上实现鲁棒的多属性预测。

详情
AI中文摘要

金融数值实体(FNE)理解旨在恢复财务报告中数值提及的含义。现有研究主要关注概念名称预测,并面临两个重要限制。首先,来自内联XBRL的标签可能包含错误,因为申报通常是手动准备的。其次,其他重要的FNE属性,如报告时间关系、测量尺度和会计符号,较少被强调。我们提出鲁棒噪声的丰富金融数值实体属性标注(NORA)来解决这些差距。NORA使用任务感知的实例特定加权来减弱训练过程中噪声标签的影响,并进一步提出邻域先验调整KNN(NPK)过滤方法,以便在真实世界噪声测试集上进行更可靠的评估。此外,我们构建了一个包含660万个实例的大规模基准,具有多属性标签和申报元数据。实验表明,NORA与最先进的噪声标签基线(包括Co-teaching、Mixup、SSR和SelfMix)相比表现强劲。此外,NORA在未过滤和噪声过滤测试设置下均具有鲁棒性。它在概念名称和时间关系预测上取得了最佳准确率、宏F1和加权F1,同时在尺度和符号预测上保持竞争力。这些结果证明了在考虑真实世界财务申报中标签噪声的同时,联合建模丰富FNE属性的价值。

英文摘要

Financial Numerical Entity (FNE) understanding aims to recover the meaning of numerical mentions in financial reports. Existing studies primarily focus on concept name prediction and face two important limitations. First, labels derived from inline XBRL may contain errors because filings are usually prepared manually. Second, other important FNE attributes, such as reporting-time relation, measurement scale, and accounting sign, are less emphasized. We propose \textbf{NO}ise-\textbf{R}obust Tagging for Rich Financial Numerical Entity \textbf{A}ttributes (\textsc{NORA}) to address these gaps. NORA uses task-aware instance-specific weighting to attenuate the influence of noisy labels during training, and we further propose the Neighborhood Prior-adjusted KNN (NPK) filtering method for more reliable evaluation on real-world noisy test sets. In addition, we construct a large-scale benchmark containing 6.6 million instances with multi-attribute labels and filing metadata. Experiments show that \textsc{NORA} performs strongly compared with state-of-the-art noisy-label baselines, including Co-teaching, Mixup, SSR, and SelfMix. Moreover, NORA is robust under both unfiltered and noise-filtered test settings. It achieves the best Accuracy, Macro F1, and Weighted F1 for concept name and time-relation prediction, while remaining competitive on scale and sign prediction. These results demonstrate the value of jointly modeling rich FNE attributes while accounting for label noise in real-world financial filings.

2605.24908 2026-05-26 cs.LG cs.AI 版本更新

On the Impact of Class Imbalance on the Learning Dynamics of Deep Neural Networks:An Intuitive Insight

论类别不平衡对深度神经网络学习动态的影响:直观洞察

Ismail B. Mustapha, Shafaatunnur Hasan, Sunday O. Olatunji, Hatem S. Y. Nabus

发表机构 * Faculty of Computing(计算机学院) Universiti Teknologi Malaysia(技术大学) Adejkunle Ajasin University(阿德吉库内勒·阿贾辛大学) Johor, Malaysia(马来西亚 Johor) Akungba-Akoko, Nigeria(尼日利亚 Akungba-Akoko)

AI总结 通过监测不同不平衡比率下深度神经网络对多数类和少数类的学习模式,系统研究了类别不平衡如何导致模型早期欠拟合少数类并仅学习多数类,最终造成少数类表示过拟合而非泛化。

Comments Conference

详情
AI中文摘要

近年来,深度神经网络(DNN)中的类别不平衡问题引起了研究者的广泛关注。然而,相关文献中对DNN在不平衡数据上表现不佳的原因存在不同解释,表明人们对这一长期存在的现象如何影响DNN性能知之甚少。更好地理解这一问题对于开发有效的基于DNN的不平衡方法至关重要。因此,本研究通过监测DNN模型在不同不平衡比率数据集上对多数类和少数类的学习模式,系统研究了类别不平衡对DNN学习动态的影响。实验结果表明,与从平衡数据集学习时DNN类似地学习各个类别不同,类别不平衡严重损害了DNN的性能,导致模型在早期训练轮次中欠拟合少数类样本,同时仅学习多数类。尽管DNN最终学会了少数类样本,但这种学习方式仅导致学习到的少数类表示在测试阶段无法泛化,因为它们仅仅是过拟合以尽可能降低整体训练损失。

英文摘要

Class imbalance in deep neural networks (DNNs) has witnessed a rapid increase in research attention in recent years. However, the varying accounts of the reasons behind the poor performance of DNN on imbalance data in pertinent literature shows that little is known about how this agelong phenomenon impacts the performance of DNNs. A better understanding of this problem is crucial to developing effective DNN-based imbalance methods. Thus, this study systematically investigates the impact of class imbalance on the learning dynamics of DNN by monitoring the learning pattern of DNN models on both the majority and minority classes of datasets of varying imbalance ratios. Experimental findings shows that as against learning from balanced datasets where DNN learns the classes similarly, class imbalance has severe deteriorating impact on the performance of DNN, driving the model to underfit the minority class samples in the early training epochs while simultaneously learning only the majority class. Although DNN ultimately learns the minority samples, learning in this manner only results in learnt minority representations that are non-generalizable at test phase because they are merely overfitted to keep the overall training loss as low as possible.

2605.24902 2026-05-26 cs.CL cs.AI cs.LG 版本更新

When Reasoning Hurts: Source-Aware Evaluation of Frontier LLMs for Clinical SOAP Note Generation

当推理有害:面向临床SOAP笔记生成的前沿LLM源感知评估

Faizan Faisal

发表机构 * University of California, Davis(加州大学戴维斯分校)

AI总结 通过源感知基准测试,评估推理增强型LLM在临床SOAP笔记生成中的表现,发现推理能力反而降低GPT-5.4的质量,而相同源RAG带来模型依赖的小幅提升。

详情
AI中文摘要

推理增强型LLM在医学推理基准测试中表现强劲,但这些增益是否能迁移到结构化临床文档尚不清楚;我们通过一个跨OMI Health、ACI-Bench和PriMock57的源感知基准,利用临床对话生成SOAP笔记来研究这一问题。我们在一个2x2受控设计中评估GPT-5.4、DeepSeek-V4-Flash和Gemma-4-E4B,独立切换提供者原生推理和相同源检索增强生成(RAG)。输出使用七种自动指标以及两个参考感知的LLM评判者进行评估。两种评估方法一致认为,非推理的GPT-5.4配置达到最高整体质量,而DeepSeek-V4-Flash在推理增强配置中表现最佳。启用推理显著降低了GPT-5.4在所有三个数据集上的性能,而相同源RAG带来较小的、模型依赖的改进。总体而言,研究结果表明,不应假设更强的推理能力能改善对保真度敏感的SOAP笔记生成,而无需专门的、任务特定的评估。

英文摘要

Reasoning-enabled LLMs perform strongly on medical reasoning benchmarks, but it remains unclear whether these gains transfer to structured clinical documentation; we investigate this question using SOAP note generation from clinical dialogue in a source-aware benchmark spanning OMI Health, ACI-Bench, and PriMock57. We evaluate GPT-5.4, DeepSeek-V4-Flash, and Gemma-4-E4B in a controlled 2x2 design that independently toggles provider-native reasoning and same-source retrieval-augmented generation (RAG). Outputs are assessed using seven automatic metrics alongside two reference-aware LLM judges. Both evaluation approaches agree that a non-reasoning GPT-5.4 configuration achieves the highest overall quality, while DeepSeek-V4-Flash performs best among reasoning-enabled configurations. Enabling reasoning significantly degrades GPT-5.4 performance across all three datasets, whereas same-source RAG yields smaller, model-dependent improvements. Overall, the findings indicate that stronger reasoning capability should not be assumed to improve fidelity-sensitive SOAP note generation without dedicated, task-specific evaluation.

2605.24900 2026-05-26 cs.AI 版本更新

ProActor: Timing-Aware Reinforcement Learning for Proactive Task Scheduling Agents

ProActor: 时序感知强化学习用于主动任务调度智能体

Lei Ding, Bin He, Chenguang Wang, Yang Liu

发表机构 * University of California, Santa Cruz(加州大学圣克鲁兹分校) Zillow Group(Zillow集团)

AI总结 提出ProActor框架,通过时序感知强化学习(结合RULER奖励和阶段感知复合奖励)和高效训练系统ART-F,在保持动作一致性的同时显著提升主动任务调度的时序质量。

Comments 47 pages, 31 figures. Accepted to ACL 2026

详情
AI中文摘要

主动任务导向的智能体必须自主预测用户需求、识别可操作的机会,并在适当时刻触发软件动作——从根本上转变依赖显式指令的被动系统。然而,现有方法缺乏可泛化的端到端解决方案来度量和优化这种预期行为。本文介绍了ProActor,一个用于对话任务调度的统一框架,集成了:(1) 一种领域无关的自动标注方法,通过生成完整的机遇时间窗口而非刚性点标签,实现可扩展的主动性强化学习(RL);(2) 系统性的主动性指标,同时捕获时序质量和参考动作对齐;(3) 使用GRPO及多种奖励设计的RL优化。我们的洞察是,基于RULER的奖励结合主动性评分准则对提升时序质量至关重要,而由阶段感知复合奖励实现的主动性优化是平衡时序质量和参考动作对齐的关键。时序感知RL需要大量探索,这要求高效的基础设施。我们开发了ART-F,一种自适应框架,将请求自适应推理集群与单节点多GPU系统上的DDP训练相结合,实现了4位Qwen2.5-14B-ProActor-Q4的LoRA训练,加速4-8倍。在两个新自动标注数据集上的实验表明,在保持与最先进(SOTA)基线相当的动作一致性的同时,主动时序显著提升。消融实验验证了不同复合奖励变体的有效性。

英文摘要

Proactive task-oriented agents must autonomously anticipate user needs, identify actionable opportunities, and trigger software actions at appropriate moments - fundamentally shifting from reactive systems that await explicit instructions. However, existing approaches lack generalizable end-to-end solutions for measuring and optimizing such anticipatory behaviors. This paper introduces ProActor, a unified framework for conversational task scheduling that integrates: (1) a domain-agnostic automated annotation methodology that enables scalable proactiveness reinforcement learning (RL) by generating full opportunity time windows instead of rigid point labels, (2) systematic proactiveness metrics capturing both timing quality and reference action alignment, and (3) RL optimization using GRPO with various reward designs. Our insight is that RULER-based rewards with proactiveness rubrics are crucial for improving timing quality, and that proactiveness optimization enabled by stage-aware composite rewards is key to balancing timing quality and reference action alignment. Timing-aware RL requires extensive exploration, demanding efficient infrastructure. We develop ART-F, an adaptive framework combining request-adaptive inference clusters with DDP-based training on single-node multi-GPU systems, enabling LoRA training of 4-bit Qwen2.5-14B-ProActor-Q4 with 4-8x speedups. Experiments on two newly auto-annotated datasets demonstrate significant improvements in proactive timing while maintaining action consistency comparable to state-of-the-art (SOTA) baselines. Ablations validate the effectiveness of distinct composite reward variations.

2605.24899 2026-05-26 cs.AI 版本更新

TaBIIC2: Interactive Building of Ontological Taxonomies using Weighted Self-Organizing Maps

TaBIIC2:使用加权自组织映射交互式构建本体分类

Mathieu d'Aquin

发表机构 * LORIA, CNRS, Université de Lorraine(LORIA研究所、法国国家科学研究中心、洛林大学)

AI总结 本文提出一种工具,通过加权自组织映射聚类方法,支持用户逐步交互式地从表格数据中构建概念分类,并定义概念的内涵,平衡了纯手动分析与自动方法。

详情
AI中文摘要

本体表示一个领域的概念知识。本体的核心是概念和子概念的分类,这些概念代表特定实体,构建起来可能很复杂。在许多情况下,信息以记录形式提供,描述相关实体的特征,即表格数据。识别此类数据中的模式和相似性可以作为识别概念并组织它们的基础。然而,手动执行此操作可能具有挑战性,而纯自动方法(如凝聚聚类或依赖大型语言模型分析数据)可能会让用户面对大量结果且控制力不足。在本文中,我们描述了一种工具,通过识别聚类及其内涵定义,支持逐步交互式构建概念分类。为此,我们依赖加权自组织映射作为聚类方法,因为它们能够创建任意数量的聚类,这些聚类在聚类实体特定特征的值分布方面具有区分性。我们表明,通过集成这种机制和其他机制来快速创建将表格数据中的实例分组的概念,该工具代表了在纯手动分析和自动方法之间构建本体分类的中间地带。

英文摘要

Ontologies represent the conceptual knowledge of a domain. At the core of an ontology is the taxonomy of concepts and subconcepts that represent specific entities, which can be complex to build. In many cases, information is available in the form of records describing the characteristics of relevant entities, i.e., tabular data. Identifying patterns and similarities in such data can serve as a basis for identifying concepts and organizing them. However, doing so manually can be challenging, and purely automatic approaches, such as agglomerative clustering or relying on a large language model to analyze the data, can leave the user with overwhelming results and little control. In this paper, we describe a tool that enables the progressive and interactive construction of a taxonomy of concepts by identifying clusters as well as their intentional definitions. To do so, we rely on weighted self-organizing maps as a clustering method because they enable the creation of an arbitrary number of clusters that are distinct with respect to the distributions of values of specific characteristics of the clustered entities. We show that, by integrating this mechanism and others for rapidly creating concepts that group together instances from tabular data, this tool represents a middle ground between purely manual analysis and automatic methods for building ontological taxonomies.

2605.24883 2026-05-26 cs.AI cs.CR cs.SE 版本更新

Inverting the Shield: Systematically Generating Safety Tests from Policy Specifications

反转盾牌:从策略规范中系统生成安全测试

Xiaoyue Lu, Xianglin Yang, Haijun Liu, Jiahao Liu, Kuntai Cai, Yan Xiao, Jin Song Dong

发表机构 * Shenzhen Campus of Sun Yat-sen University(中山大学深圳校区) National University of Singapore(新加坡国立大学) Independent Researcher(独立研究者)

AI总结 提出POLARIS框架,通过将非结构化自然语言策略编译为一阶逻辑表示并构建语义策略图,实现覆盖驱动的可重复安全测试,相比基线方法提高了策略覆盖率和攻击成功次数。

Comments Accepted to the 64th Annual Meeting of the Association for Computational Linguistics (ACL 2026)

详情
AI中文摘要

大型语言模型(LLMs)的广泛集成需要严格且系统的安全评估。现有范式要么依赖构建的基准从预定义角度评估安全性,要么采用动态红队探测潜在漏洞。虽然有效,但这些方法面临挑战,因为它们严重依赖专家领域知识,提供的系统保证有限,且容易快速过时。为解决这些限制,我们引入了一个新颖框架POLARIS,将基于规范的软件测试的严谨性引入AI安全。POLARIS首先将非结构化自然语言策略编译为一阶逻辑(FOL)表示,建立高层规则与具体测试用例之间的可追溯链接。这种形式化使得能够构建语义策略图,其中复杂的策略违规场景被编码为可遍历路径。通过系统地探索该图,POLARIS发现组合违规模式,然后将其实例化为可执行的自然语言测试查询,实现覆盖驱动且可重复的安全测试。实验表明,与已建立的基线相比,POLARIS实现了更高的策略覆盖率和攻击成功次数。关键是,通过连接形式化方法和AI安全,POLARIS提供了一种有原则的自动化方法,确保LLMs遵守安全关键策略,并具有可验证的可追溯性。我们在https://github.com/huac-lxy/POLARIS发布代码。

英文摘要

The widespread integration of Large Language Models (LLMs) necessitates rigorous and systematic safety evaluation. Existing paradigms either rely on constructed benchmarks to assess safety from predefined perspectives, or employ dynamic red-teaming to probe potential vulnerabilities. While effective, these approaches face challenges, as they depend heavily on expert domain knowledge, offer limited systematic guarantees, and are vulnerable to rapid obsolescence. To address these limitations, we introduce a novel framework POLARIS that brings the rigor of specification-based software testing to AI safety. POLARIS first compiles unstructured natural-language policies into First-Order Logic (FOL) representations, establishing a traceable link between high-level rules and concrete test cases. This formalization enables the construction of a Semantic Policy Graph, where complex policy violation scenarios are encoded as traversable paths. By systematically exploring this graph, POLARIS uncovers compositional violation patterns, which are then instantiated into executable natural-language test queries, enabling coverage-driven and reproducible safety testing. Experiments demonstrate that POLARIS achieves higher policy coverage and attack success counts compared to established baselines. Crucially, by bridging formal methods and AI safety, POLARIS provides a principled, automated approach to ensuring LLMs adhere to safety-critical policies with verifiable traceability. We release our code at https://github.com/huac-lxy/POLARIS.

2605.24873 2026-05-26 cs.CL cs.AI cs.LG 版本更新

Towards a Universal Causal Reasoner

迈向通用因果推理器

Qirun Dai, Xiao Liu, Jiawei Zhang, Dylan Zhang, Hao Peng, Chenhao Tan

发表机构 * The University of Chicago(芝加哥大学) University of Illinois Urbana-Champaign(伊利诺伊大学厄巴纳-香槟分校)

AI总结 提出UniCo数据生成框架,覆盖Pearl因果阶梯的18种查询类型,将符号示例转化为代码和自然语言,通过监督微调显著提升LLM的因果推理能力和推理忠实度。

详情
AI中文摘要

尽管因果推理的重要性不言而喻,但训练LLM进行因果推理仍未被充分探索。现有的数据工作大多集中在针对因果关系的特定方面对LLM进行基准测试,这使得它们不太适合训练可泛化的因果推理器。为了解决这个问题,我们提出了UniCo,一个数据生成框架,它既(1)涵盖了Pearl因果阶梯中的18种因果查询类型,又(2)将原生符号示例转化为代码和自然语言形式,以模拟因果术语未明确指定的真实世界用例。为确保数据质量,UniCo用精确的因果推理来支撑答案,并过滤掉存在推理捷径的案例。通过使用66.6K个UniCo生成的实例进行监督微调,Qwen3-4B、Qwen3-8B和Olmo-3-7B-Instruct在所有18种分布内查询类型上平均提升了22.9%,在训练分布之外的7个已建立的因果基准上,相比最先进的因果数据生成框架提升了8.1%。更重要的是,在真实世界的医学理解、法律决策和表格推理中,UniCo训练的模型始终展现出更忠实的推理轨迹,在忠实度指标上平均超过基础模型20.2%。这些结果表明,以因果为中心的训练不仅增强了因果推理能力,还赋予了LLM在一般推理任务中的因果思维。

英文摘要

Despite the importance of causal reasoning, training LLMs to reason causally remains underexplored. Existing data efforts mostly focus on benchmarking LLMs on specific aspects of causality, making them less suitable for training generalizable causal reasoners. To address this, we propose UniCo, a data generation framework that both (1) addresses 18 causal query types across Pearl's Causal Ladder and (2) translates natively symbolic examples into code and natural language forms to simulate real-world use cases where causal terms are not explicitly specified. To ensure data quality, UniCo grounds answers with exact causal inference and filters cases with reasoning shortcuts. Upon supervised finetuning with 66.6K UniCo-generated instances, Qwen3-4B, Qwen3-8B and Olmo-3-7B-Instruct achieve an average of 22.9% improvements across all 18 in-distribution query types, and 8.1% over state-of-the-art causal data generation frameworks on 7 established causal benchmarks outside the training distribution. More importantly, in real-world medical understanding, legal decision, and tabular reasoning, UniCo-trained models consistently display more faithful reasoning traces, outperforming the base models by an average of 20.2% in faithfulness metrics. These suggest that causality-centered training not only strengthens causal reasoning, but also equips LLMs with a causal mindset in general reasoning tasks.

2605.24867 2026-05-26 cs.AI cs.CL cs.NI 版本更新

Clustering as Reasoning: A $k$-Means Interpretation of Chain-of-Thought Graph Learning

聚类即推理:思维链图学习的 $k$-均值解释

Xuanting Xie, Zhaochen Guo, Bingheng Li, Xingtong Yu, Zhifei Liao, Zhao Kang, Yuan Fang

发表机构 * University of Electronic Science and Technology of China(电子科技大学) Singapore Management University(新加坡国立大学) Michigan State University(密歇根州立大学) The Chinese University of Hong Kong(香港中文大学)

AI总结 提出KCoT框架,通过将Transformer块与$k$-均值算法建立数学对应,将思维链推理与图表示学习统一,实现迭代语义-拓扑交互,在标准基准上超越现有方法。

Comments Accepted by ICML 2026

详情
AI中文摘要

思维链(CoT)提示在增强大型语言模型(LLMs)对文本属性图(TAGs)的推理能力方面显示出潜力。本文通过聚类即推理的原则重新审视基于CoT的图学习,提供了关于迭代推理如何在图结构数据上运行的$k$-均值解释。我们观察到现有的图CoT方法依赖于分离的架构和固定的图表示,限制了逐步的语义-拓扑交互和可解释性。为克服这一限制,我们提出了一个名为KCoT的统一框架,将CoT推理与图表示学习相结合。我们的关键理论结果揭示了Transformer块与$k$-均值算法之间的形式数学对应,使得推理可以被解释为迭代的分配和更新步骤。基于这一见解,我们引入了一个语义判别提示,明确将这些步骤形式化为结构化的CoT推理,并采用结构对齐策略将拓扑先验与演化的思维条件表示融合。在标准基准上的实验表明,与最先进的方法相比,该方法持续改进,验证了聚类作为基于CoT的图学习的原则性机制。

英文摘要

Chain-of-Thought (CoT) prompting has shown promise in enhancing the reasoning capabilities of large language models (LLMs) on text-attributed graphs (TAGs). This work reframes CoT-based graph learning through the principle of clustering as reasoning, offering a $k$-means interpretation of how iterative reasoning operates over graph-structured data. We observe that existing graph CoT methods rely on disjoint architectures and fixed graph representations, limiting step-by-step semantic-topological interaction and interpretability. To overcome this limitation, we propose a unified framework named KCoT that integrates CoT reasoning with graph representation learning. Our key theoretical result reveals a formal mathematical correspondence between a Transformer block and the $k$-means algorithm, allowing reasoning to be interpreted as iterative assignment and update steps. Based on this insight, we introduce a Semantic Discriminating Prompt that explicitly formulates these steps as structured CoT reasoning, together with a structure-grounded alignment strategy to fuse topological priors with evolving thought-conditioned representations. Experiments on standard benchmarks demonstrate consistent improvements over state-of-the-art methods, validating clustering as a principled mechanism for CoT-based graph learning.

2605.24860 2026-05-26 eess.SY cs.AI cs.ET cs.LG cs.RO cs.SY 版本更新

DBPnet: Damper Characteristics-Based Bayesian Physics-Informed Neural Network for Wheel Load Estimation

DBPnet:基于阻尼特性的贝叶斯物理信息神经网络用于车轮载荷估计

Tianyi Wang, Tianyi Zeng, Zimo Zeng, Feiyang Zhang, Yujin Wang, Xiangyu Li, Yiming Xu, Sikai Chen, Junfeng Jiao, Christian Claudel, Xinbo Chen

发表机构 * Department of Civil, Architectural, and Environmental Engineering, The University of Texas at Austin(德克萨斯大学奥斯汀分校土木、建筑与环境工程系) School of Automation and Intelligent Sensing, Shanghai Jiao Tong University(上海交通大学自动化与智能感知学院) College of Electrical Engineering, Zhejiang University(浙江大学电气工程学院) School of Automotive Studies, Tongji University(同济大学汽车学院) School of Architecture, The University of Texas at Austin(德克萨斯大学奥斯汀分校建筑学院) Department of Civil and Environmental Engineering, University of Wisconsin-Madison(威斯康星大学麦迪逊分校土木与环境工程系)

AI总结 提出DBPnet,一种结合阻尼特性嵌入模块的贝叶斯物理信息神经网络,通过悬架连杆级建模和物理信息损失函数,实现鲁棒的车轮载荷估计。

Comments 14 pages, 12 figures, 6 tables

详情
AI中文摘要

高级驾驶辅助系统(ADAS)在现代汽车智能化中扮演重要角色,显著提升车辆安全性和稳定性。ADAS的性能关键依赖于准确可靠的车辆状态估计,特别是来自车辆动态传感器的信号。在这些信号中,车轮载荷是底盘控制和安全关键功能的关键变量,但由于复杂的悬架几何结构、非线性动力学和测量噪声,难以鲁棒估计。为解决此问题,我们提出DBPnet,一种贝叶斯物理信息神经网络(PINN),其具有受阻尼特性启发的物理感知嵌入模块。首先,本文提出一种悬架连杆级建模(SLLM)方法,通过显式考虑悬架的复杂几何结构,构建非线性瞬时动态模型。在SLLM基础上,将贝叶斯推断集成到PINN中,有效应对车辆底盘系统中的噪声和不确定性,从而提高模型的鲁棒性。然后,采用物理信息损失函数确保与基本物理原理的一致性,同时受阻尼特性启发的嵌入模块提取输入信号的时间变化特征,并将其融入PINN的每一层,确保物理观测指导神经网络而不受固定物理模型的约束。在高保真仿真和真实世界实验上的广泛评估表明,我们的DBPnet在RMSE和MaxError上始终低于基线方法。这些结果凸显了我们的DBPnet在推进车轮载荷估计和为更可靠的ADAS执行器功能发展做出贡献的潜力。

英文摘要

Advanced driver assistance systems (ADAS) play an important role in modern automotive intelligence, significantly enhancing vehicle safety and stability. The performance of ADAS critically relies on accurate and reliable vehicle state estimation, particularly from vehicle dynamic sensors. Among these signals, wheel load is a key variable for chassis control and safety-critical functions, yet it remains difficult to estimate robustly due to complex suspension geometry, nonlinear dynamics, and measurement noise. To address this issue, we propose DBPnet, a Bayesian physics-informed neural network (PINN) with a physics-aware embedding module inspired by damper characteristics. First, this paper presents a suspension linkage-level modeling (SLLM) approach that constructs a nonlinear instantaneous dynamic model by explicitly considering the complex geometric structure of the suspension. Building upon SLLM, Bayesian inference is integrated into the PINN to effectively cope with noise and uncertainty in the vehicle chassis system, thereby improving the model's robustness. Then, a physics-informed loss function is employed to ensure consistency with fundamental physical principles, while the damper characteristics-inspired embedding module extracts temporal variation features of input signals and incorporates them into each layer of the PINN, ensuring that physical observations guide the neural network without being constrained by fixed physical models. Extensive evaluations on high-fidelity simulations and real-world experiments demonstrate that our DBPnet consistently achieves lower RMSE and MaxError than baseline methods. These results highlight the potential of our DBPnet to advance wheel load estimation and contribute to the development of more reliable ADAS actuator functions.

2605.24856 2026-05-26 cs.LG cs.AI 版本更新

The Concept Allocation Zone: Tracking How Concepts Form Across Transformer Depth

概念分配区:追踪概念如何跨越Transformer深度形成

James Henry

发表机构 * Independent Researcher(独立研究者)

AI总结 提出概念分配区(CAZ)框架,通过层间度量(分离度、概念一致性、概念速度)检测概念在残差流中逐渐形成的深度区间,并在34个模型上验证了分离曲线的多模态性及温和CAZ的因果活性。

Comments 34 models, 8 architectural families, 7 concepts. Companion papers: GEM (arXiv forthcoming), CAZ Validation (arXiv forthcoming), PRH Validation (arXiv forthcoming). Code: https://github.com/jamesrahenry/Rosetta_Tools

详情
AI中文摘要

Transformer语言模型中的概念形成是深度扩展的,而非单层事件:概念在残差流的连续区域内逐渐出现。可解释性方法识别出类别分离峰值的单层——

英文摘要

Concept formation in transformer language models is depth-extended, not a single-layer event: concepts emerge gradually across a contiguous region of the residual stream. Mechanistic interpretability methods identify the single layer of peak class separation -- the "best layer" -- capturing a snapshot rather than the process itself. We introduce the Concept Allocation Zone (CAZ): the depth interval within which a concept becomes measurably separable, the region allocated to its geometric expression. We formalize the CAZ through three layer-wise metrics (Separation, Concept Coherence, Concept Velocity) and derive principled boundary detection without manual layer sweeps. A CAZ is not a concept: it is the depth region within which the model organizes its geometry to make a concept separable. A single concept typically participates in multiple CAZes; multiple concepts may share one. Empirical validation across 34 models from 8 architectural families and 7 concepts reveals that the separation curve S(l) is frequently multimodal. A scored detector uncovers "gentle CAZes" -- subtle allocation regions invisible to standard peak detection but causally active in 93-100% of cases under ablation (16 of 34 models; 26 in the companion validation paper). The framework generates seven testable predictions; four yield clear verdicts (two not supported, one partially supported, one supported), one had its precondition invalidated by the data, and two are underpowered -- with cross-architecture alignment confirmed as depth-matched rather than monolithic under leave-one-concept-out cross-validation. Reference implementation: rosetta_tools v1.3.1 (doi:10.5281/zenodo.20361433).

2605.24845 2026-05-26 cs.AI math.CO 版本更新

Solving Combinatorial Counting Problems with Weighted First-Order Model Counting

使用加权一阶模型计数解决组合计数问题

Yuanhong Wang, Juhua Pu, Yuxu Zhou, Yuyi Wang, Ondřej Kuželka

发表机构 * School of Artificial Intelligence, Jilin University, Changchun, China State Key Laboratory of Complex \& Critical Software Environment, Beihang University, China National Research Center for Educational Materials, China Tengen Intelligence Institute, China Czech Technical University in Prague, Prague, Czech Republic

AI总结 提出Cofola语言,通过类型化声明式编程和加权一阶模型计数(WFOMC)编译流水线,统一解决集合、多重集、排列、划分等组合计数问题。

Comments 47 pages, 9 figures

详情
AI中文摘要

组合计数问题遍及人工智能、统计学和离散数学。无论是枚举子集、多重集、排列、划分还是在结构和算术约束下的组合,解决它们仍然是一项顽固的手动练习。封闭形式的推导强大但脆弱,而将问题朴素编码为命题模型计数或约束满足会破坏使计数易于处理的交换性。我们提出了Cofola(组合计数语言与一阶逻辑),一种类型化声明式语言,其原语是日常计数问题中反复出现的组合对象,包括集合、袋子、元组、序列、圆圈、划分和组合,以及它们之上的自然关系和算术约束。指称语义将每个Cofola程序映射到一个明确定义的组合计数问题,一个三阶段编译流水线(预处理、分解和对称保持编码)将该问题简化为一个加权一阶模型计数(WFOMC)实例,并附加系数提取约束。为了尽可能保持在已知的可域提升片段内,编码将不可区分的实体分组,按字典序打破无序分组的对称性,并通过顺序公理编码序列和圆圈。在一系列代表性的组合计数问题上,从教科书数学问题到最接近的先前框架无法表达的多对象场景,Cofola生成了简洁的规范和统一的求解流水线,端到端实用。

英文摘要

Combinatorial counting problems pervade artificial intelligence, statistics, and discrete mathematics. Whether the task is enumerating subsets, multisets, permutations, partitions, or compositions under structural and arithmetic constraints, solving it remains a stubbornly manual exercise. Closed-form derivations are powerful but brittle, while naive encodings to propositional model counting or constraint satisfaction destroy the exchangeability that makes counting tractable in the first place. We present Cofola (COmbinatorial counting LAnguage with First-Order logic), a typed declarative language whose primitives are the combinatorial objects that recur in everyday counting questions, including sets, bags, tuples, sequences, circles, partitions, and compositions, together with natural relational and arithmetic constraints over them. A denotational semantics maps every Cofola program to a well-defined combinatorial counting problem, and a three-phase compilation pipeline (preprocessing, decomposition, and symmetry-preserving encoding) reduces this problem to a weighted first-order model counting (WFOMC) instance augmented with coefficient-extraction constraints. To stay inside known domain-liftable fragments whenever possible, the encoding groups indistinguishable entities, breaks the symmetry of unordered groupings lexicographically, and encodes sequences and circles via order axioms. On a suite of representative combinatorial counting problems, ranging from textbook math problems to multi-object scenarios that the closest prior framework cannot express, Cofola produces concise specifications and a uniform solving pipeline that is practical end-to-end.

2605.24844 2026-05-26 cs.AI cs.CL 版本更新

Geo-Expert: Towards Expert-Level Geological Reasoning via Parameter-Efficient Fine-Tuning

Geo-Expert: 通过参数高效微调实现专家级地质推理

Chenyou Guo, Zongqi Liu, Yizhou Zhang, Zhaorui Jiang, Ze Liu

发表机构 * Ocean University of China(中国海洋大学) Peking University(北京大学) Monash University(墨尔本大学)

AI总结 本文提出Geo-Expert,通过参数高效微调(LoRA)在定制高质量指令数据集上微调小规模语言模型,在专门的地质推理基准Geo-Eval上,8B模型超越70B通用模型和GPT-4o,32B模型接近前沿推理模型。

Comments 11 pages, 1 figure, 3 tables. Accepted at ICML 2026 AI for Science Workshop

详情
AI中文摘要

虽然应用于地质学的通用大语言模型(LLM)在推理地下结构和深时演化时常常产生幻觉,但目前地球科学中的人工智能主要针对地表遥感和GIS。为弥补这一差距,我们引入了Geo-Expert,这是一个参数高效的地质LLM系列,基于我们自定义指令合成流程处理的自定义策划高质量指令数据集进行微调。我们通过使用低秩适配(LoRA)方法微调三个基础模型:Qwen3-8B、Qwen3-32B和Gemma-3-27B,研究了模型缩放和架构的影响。我们在新的领域特定基准Geo-Eval上的广泛评估表明,领域对齐的8B模型在专门的地质推理上可以超越开放权重的70B通用模型和专有的GPT-4o,而32B变体接近前沿推理模型。优化后的8B模型进一步为部署提供了具有竞争力的性价比。这项工作为科学LLM的民主化提供了可复现的配方,并为地质人工智能建立了基线。

英文摘要

While general-purpose Large Language Models (LLMs) applied to Geology often hallucinate when reasoning about subsurface structures and deep-time evolution, current AI in Earth sciences predominantly targets surface remote sensing and GIS. To bridge this gap, we introduce Geo-Expert, a family of parameter-efficient geological LLMs fine-tuned on a custom-curated, high-quality instruction dataset processed using our custom instruction synthesis pipeline. We investigate the impact of model scaling and architecture by fine-tuning three base models: Qwen3-8B, Qwen3-32B, and Gemma-3-27B, with Low-Rank Adaptation (LoRA) method. Our extensive evaluation on a novel domain-specific benchmark, Geo-Eval, reveals that a domain-aligned 8B model can outperform open-weight 70B generalists and proprietary GPT-4o on specialized geological reasoning, while a 32B variant approaches frontier reasoning models. The optimized 8B model further offers a competitive cost-performance ratio for deployment. This work provides a reproducible recipe for democratizing scientific LLMs and establishes a baseline for geological artificial intelligence.

2605.24843 2026-05-26 cs.CV cs.AI 版本更新

Adversarial Error Correction for Visual Autoregressive Generation

视觉自回归生成的对抗性纠错

Ligong Bi, Tao Huang, Jianyuan Guo, Chang Xu

发表机构 * Shanghai Jiao Tong University(上海交通大学) City University of Hong Kong(香港城市大学) The University of Sydney(悉尼大学)

AI总结 提出AID-VAR框架,通过对抗性注入诊断机制纠正视觉自回归模型中的级联误差,提升生成质量。

详情
AI中文摘要

视觉自回归(VAR)模型通过执行层次化的下一尺度预测,已成为图像合成的强大范式。然而,VAR模型天生容易产生级联误差传播,其中细微的粗尺度误预测会在层次结构中放大,最终扭曲最终合成。为了缓解这一问题,我们提出了AID-VAR,一个即插即用的框架,通过对抗性注入诊断增强预训练的VAR。与标准的被动生成不同,AID-VAR引入了一种主动纠错机制,灵感来自GAN中的对抗性反馈。我们部署了一个判别器来诊断每个尺度转换处的保真度差距,并配有一个轻量级的引导注入器。该模块作为一个非侵入式适配器,优化冻结的VAR骨干网络的特征流形,有效引导生成朝向真实图像的分布,同时不破坏预训练潜在空间的稳定性。此外,为了严格评估这种跨尺度进展,我们引入了跨尺度一致性得分(ISCS),这是一个新的度量标准,用于量化连续分辨率尺度之间的保真度和结构对齐。在各种骨干网络上的实验结果表明,AID-VAR以可忽略的开销提供了更清晰的纹理细节和更少的结构失真。例如,AID-VAR-d20在参数仅增加3%的情况下,FID提升了16%。这些结果确立了AID-VAR作为升级大规模VAR生成器的高效且可扩展的途径,在不改变训练数据、基础架构或采样调度的情况下,增强了全局连贯性和局部细节。代码可在https://github.com/bijiw515/AID-VAR获取。

英文摘要

Visual Autoregressive (VAR) models have emerged as a powerful paradigm for image synthesis by performing hierarchical next-scale prediction. However, VAR models are inherently prone to cascading error propagation, where subtle coarse-scale mispredictions are amplified across the hierarchy, ultimately distorting the final synthesis. To mitigate this, we propose AID-VAR, a plug-and-play framework that enhances pre-trained VARs through Adversarially Injected Diagnosis. Instead of a standard passive generation, AID-VAR introduces a proactive error-correction mechanism inspired by the adversarial feedback in GANs. We deploy a discriminator to diagnose fidelity gaps at each scale transition, coupled with a lightweight guidance injector. This module operates as a non-invasive adapter that refines the feature manifold of a frozen VAR backbone, effectively steering the generation toward the distribution of real images without destabilizing the pre-trained latent space. Furthermore, to rigorously evaluate this cross-scale progression, we introduce the Inter-Scale Consistency Score (ISCS), a novel metric that quantifies the fidelity and structural alignment between consecutive resolution scales. Experimental results across various backbones demonstrate that AID-VAR delivers sharper textural details and fewer structural distortions with negligible overhead. For instance, AID-VAR-d20 achieves a 16% improvement in FID with only a 3% increase in parameters. These results establish AID-VAR as a highly efficient and scalable pathway for upgrading large-scale VAR generators, enhancing global coherence and local detail without altering training data, base architectures, or sampling schedules. Code is available at https://github.com/bijiw515/AID-VAR.

2605.24834 2026-05-26 cs.CR cs.AI 版本更新

Reflect-Guard: Enhancing LLM Safeguards against Adversarial Prompts via Logical Self-Reflection

Reflect-Guard: 通过逻辑自我反思增强大语言模型对对抗性提示的防护

Lixing Lin, Juli You, Yue Li, Luyun Lin, Yiqing Wang, Zhen Zhang, Moxuan Zheng

发表机构 * Yale University(耶鲁大学) Columbia University(哥伦比亚大学) Citigroup(摩根大通) Independent Researcher(独立研究者)

AI总结 提出Reflect-Guard方法,通过参数高效微调为大语言模型安全分类器注入链式思维自我反思能力,显著提升对对抗性越狱攻击的检测性能。

Comments 12 pages, 2 figures, and 4 tables

详情
AI中文摘要

大语言模型安全分类器(如Llama Guard)能有效检测明显有害的提示,但对通过角色扮演场景、虚构框架和间接请求伪装恶意意图的对抗性越狱攻击仍然脆弱。我们提出Reflect-Guard,一种通过参数高效微调为大语言模型安全分类器注入链式思维自我反思能力的方法。我们的方法从GPT-4o-mini中提炼分析推理能力,形成结构化反思注释,然后通过QLoRA训练Llama-Guard-3-8B,使其在发布安全判决前生成逻辑自我反思。仅使用1000个训练样本并更新0.5%的模型参数(约4200万),Reflect-Guard在两个具有挑战性的基准测试上取得了显著改进。在WildGuardTest上,F1分数从0.770提升至0.842(+7.2个百分点),对抗性提示的召回率从0.513提升至0.921(+40.8个百分点)。在JailbreakBench上,攻击成功率从10.3%降至1.8%,相对降低82.5%。这些增益在对抗性输入上尤为明显,显式的推理步骤使模型能够看穿击败标准模式匹配方法的混淆技术。我们的结果表明,教会安全分类器推理对抗性意图,而非简单分类表面模式,是实现鲁棒大语言模型安全性的有前景方向。

英文摘要

Large language model (LLM) safety classifiers such as Llama Guard are effective at detecting overtly harmful prompts but remain vulnerable to adversarial jailbreak attacks that disguise malicious intent through role-play scenarios, fictional framing, and indirect requests. We present Reflect-Guard, a method that augments LLM-based safety classifiers with chain-of-thought self-reflection capabilities through parameter-efficient fine-tuning. Our approach distills analytical reasoning from GPT-4o-mini into structured reflection annotations, then trains Llama-Guard-3-8B via QLoRA to generate logical self-reflections before issuing safety verdicts. Using only 1000 training examples and updating just 0.5% of model parameters (~42M), Reflect-Guard achieves substantial improvements on two challenging benchmarks. On WildGuardTest, F1 score improves from 0.770 to 0.842 (+7.2 pp), with recall on adversarial prompts increasing from 0.513 to 0.921 (+40.8 pp). On JailbreakBench, the attack success rate drops from 10.3% to 1.8%, representing an 82.5% relative reduction. These gains are especially pronounced on adversarial inputs, where the explicit reasoning step enables the model to see through obfuscation techniques that defeat standard pattern-matching approaches. Our results demonstrate that teaching safety classifiers to reason about adversarial intent, rather than simply classify surface patterns, is a promising direction for robust LLM safety.

2605.24831 2026-05-26 cs.CV cs.AI 版本更新

Multiscale Real-Time Object Detection in the NMS-Free Era: A Comparative Performance Evaluation of YOLOv8 and YOLO26

无NMS时代的实时多尺度目标检测:YOLOv8与YOLO26的对比性能评估

Chidera G. Oguine, Kanyifeechukwu J. Oguine, Obiozor M. Oguine, Ozioma C. Oguine

发表机构 * University of Abuja(阿布贾大学) Vanderbilt University(范德比大学) University of Notre Dame(圣约翰大学)

AI总结 本文在Pascal VOC和VisDrone数据集上,从准确率、定位、模型大小、计算量和延迟等维度,系统比较了基于NMS的YOLOv8与无NMS的YOLO26在多尺度下的性能,发现YOLO26在多数尺度上检测更强且模型复杂度更低,但在密集小目标场景下优势缩小,且YOLOv8在GPU延迟上仍有竞争力。

Comments 11 pages, 6 tables, 9 figures

详情
AI中文摘要

非极大值抑制(NMS)仍然是许多实时目标检测流程中的关键后处理步骤,但在资源受限的环境中可能引入延迟变化和部署复杂性。最近的无NMS设计(如YOLO26)旨在通过端到端检测减少这种依赖,然而与基于NMS的成熟模型(如YOLOv8)相比,其性能在标准基准之外尚未得到充分探索。本文在Pascal VOC和VisDrone上比较了YOLOv8和YOLO26,这两个数据集分别代表通用目标检测和密集空中小目标检测。两个模型家族在五个尺度上使用准确率、定位、模型大小、GFLOPs以及CPU/GPU延迟进行评估。结果表明,YOLO26在Pascal VOC上的大多数尺度上实现了更强的检测性能和更低的模型复杂度,而在VisDrone上性能差距缩小,两个模型在处理密集小目标时均表现困难。YOLOv8在GPU延迟上仍具有竞争力,表明无NMS设计并不能保证普遍的部署优势。总体而言,研究表明检测器的选择取决于数据集特征、目标尺度、模型容量和硬件约束。

英文摘要

Non-Maximum Suppression (NMS) remains a key post-processing step in many real-time object detection pipelines, but it can introduce latency variation and deployment complexity in resource-constrained settings. Recent NMS-free designs such as YOLO26 aim to reduce this dependence through end-to-end detection, yet their performance relative to established NMS-based models such as YOLOv8 remains underexplored beyond standard benchmarks. This paper compares YOLOv8 and YOLO26 on Pascal VOC and VisDrone, representing general object detection and dense aerial small-object detection, respectively. Both model families are evaluated across five scales using accuracy, localization, model size, GFLOPs, and CPU/GPU latency. Results show that YOLO26 achieves stronger detection performance and lower model complexity on Pascal VOC across most scales, while the performance gap narrows on VisDrone, where both models struggle with dense small targets. YOLOv8 remains competitive in GPU latency, showing that NMS-free design does not guarantee universal deployment superiority. Overall, the study shows that detector selection depends on dataset characteristics, object scale, model capacity, and hardware constraints.

2605.24823 2026-05-26 cs.AI 版本更新

Agent Manufacturing: Foundation-Model Agents as First-Class Industrial Entities

Agent制造:基础模型Agent作为一级工业实体

Yilei Zhang

发表机构 * University of Canterbury(坎特伯雷大学)

AI总结 本文提出Agent制造范式,即基础模型Agent通过解释开放目标、长程规划、调用工具和机器、与其他Agent及人类协商来协调生产,从而将工业中的人类协调认知工作自动化。

详情
AI中文摘要

制造业已经历了四个广泛认可的范式——机械化、电气化、可编程自动化和智能制造——每个范式都定义了从人类转移到机器的工作类型。在每种情况下,有一层工业工作仍然基本上由人类完成:生产的协调认知,包括工程师、规划师和运营经理所执行的解释、分配、诊断、协商和治理工作。我们认为,第五次转型正在进行中,其中这一层(而非其下的物理或常规认知层)正是基于基础模型的自主Agent主要重新分配的对象。我们将这一范式命名为Agent制造,并操作性地定义:当一个制造系统的主要协调机制是由基础模型Agent执行的推理,这些Agent能够解释开放目标、在长周期内规划、调用工具和机器、并与其他Agent和人类协商时,该系统就是Agent制造的一个实例。这一定义比现有的认知制造或工业5.0文献更窄且更可证伪,并且它将该范式与经典的多Agent制造系统(后者仅在封闭协议空间内自主)明确区分开来。

英文摘要

Manufacturing has passed through four widely recognized paradigms - mechanization, electrification, programmable automation, and Smart Manufacturing - each defined by the kind of work it shifted from humans to machines. In every case, one layer of industrial work remained fundamentally human: the coordinative cognition of production, comprising the interpretive, allocative, diagnostic, negotiative, and governance work exercised by engineers, planners, and operational managers. We argue that a fifth transition is now underway in which this layer, rather than the physical or routine-cognitive layers below it, is what foundation-model-based autonomous agents primarily redistribute. We name this paradigm Agent Manufacturing and define it operationally: a manufacturing system is an instance of Agent Manufacturing when its principal coordination mechanism is reasoning performed by foundation-model agents that can interpret open-ended goals, plan over long horizons, invoke tools and machines, and negotiate with other agents and humans. This is a narrower and more falsifiable definition than the existing literature on cognitive manufacturing or Industry 5.0 provides, and it distinguishes the paradigm sharply from classical multi-agent manufacturing systems, which were autonomous only within closed protocol spaces.

2605.24812 2026-05-26 cs.AI 版本更新

CoRe-Code: Collaborative Reinforcement Learning for Code Generation

CoRe-Code:面向代码生成的协作式强化学习

Zhihao Dou, Qinjian Zhao, Zhongwei Wan, Xiaoyu Xia, Sumon Biswas

发表机构 * The Ohio State University(俄亥俄州立大学) Royal Melbourne Institute of Technology(皇家墨尔本理工学院)

AI总结 提出CoRe-Code框架,通过规划器-编码器范式和基于GRPO的协作感知强化学习,增强多智能体间的协调与专业化,提升代码生成的准确性和效率。

详情
AI中文摘要

大型语言模型(LLM)在代码生成方面取得了强劲性能,但大多数方法依赖自回归解码而缺乏全局规划,常常导致局部连贯但全局次优的解决方案(例如,测试用例失败或复杂度低效)。虽然最近的方法如思维链(CoT)和多智能体系统(MAS)引入了规划,但它们有限的专业角色分工和协调阻碍了在复杂任务上的性能。为了解决多智能体代码生成中的协调与专业化挑战,我们提出了协作式强化代码(CoRe-Code),一个面向角色专业化的LLM智能体框架,通过增强智能体间协调来生成更准确和高效的代码。CoRe-Code采用简单的规划器-编码器范式,其中规划器生成高层计划,编码器执行计划以生成代码。我们进一步引入基于组相对策略优化(GRPO)的协作感知强化学习阶段,以增强角色专业化和对齐。实验表明,CoRe-Code优于现有多种基于强化学习和多智能体的方法。此外,我们证明CoRe-Code可以泛化到其他多智能体框架(例如,检索和调试智能体),凸显其灵活性和可扩展性。我们使用三个基础模型在多个不同难度的基准上评估CoRe-Code。与现有基线相比,结果显示在准确性上持续提升,同时在执行时间和内存使用方面也实现了更高效率,证明了CoRe-Code的有效性和实用性。

英文摘要

Large language models (LLMs) have achieved strong performance in code generation, but most methods rely on autoregressive decoding without global planning, often leading to locally coherent yet globally suboptimal solutions (e.g., failing test cases or inefficient complexity). While recent approaches such as Chain-of-Thought (CoT) and multi-agent systems (MAS) introduce planning, their limited role specialization and coordination hinder performance on complex tasks. To address the challenges of coordination and specialization in multi-agent code generation, we propose Collaborative Reinforcement Code (CoRe-Code), a framework for role specialized LLM agents that enhances inter-agent coordination to generate more accurate and efficient code. CoRe-Code adopts a simple Planner-Coder paradigm, where the Planner produces high-level plans and the Coder executes them to generate code. We further introduce a collaboration-aware reinforcement learning stage based on Group Relative Policy Optimization (GRPO) to enhance role specialization and alignment. Experiments show that CoRe-Code outperforms a wide range of existing RL-based and multi-agent methods. In addition, we demonstrate that CoRe-Code can generalize to other multi-agent frameworks (e.g., Retrieval and Debugging agents), highlighting its flexibility and scalability. We evaluate CoRe-Code on multiple benchmarks of varying difficulty using three base models. Compared to existing baselines, the results show consistent improvements in accuracy, while also achieving higher efficiency in terms of execution time and memory usage, demonstrating the effectiveness and practicality of CoRe-Code.

2605.24810 2026-05-26 cs.LG cs.AI cs.RO stat.AP 版本更新

Cross-Domain Energy-Guided Diffusion Generation for Off-Dynamics Reinforcement Learning

跨域能量引导扩散生成用于动态偏移强化学习

Yu Yang, Yihong Guo, Anqi Liu, Pan Xu

发表机构 * Duke University(杜克大学) Johns Hopkins University(约翰霍普金斯大学)

AI总结 提出CEDGE框架,利用能量引导扩散模型生成目标域轨迹,解决动态偏移下离线强化学习的域适应问题。

Comments 29 pages, 3 figures, and 14 tables

详情
AI中文摘要

离动态离线强化学习旨在从大规模源数据集和有限目标数据集中学习目标域策略,但面临转移动态不匹配的问题。现有方法如奖励增强和数据过滤受限于源数据集,无法合成新的目标行为以改善超出收集源轨迹的覆盖范围。虽然近期基于模型的方法尝试通过学习目标感知动态来解决此问题,但生成的体验仅在转移层面构建,导致长时域上的累积误差。这些限制促使离动态离线RL转向轨迹级生成。我们提出CEDGE,一种跨域能量引导扩散生成框架。CEDGE在源域轨迹上训练轨迹扩散模型,并通过能量引导将生成样本适应到目标域。该引导通过最小化源域与期望目标域轨迹之间的分布不匹配得到,并分解为回报、域和行为能量成分。得到的能量引导轨迹既可用于直接规划,也可作为策略学习的合成数据。由于目标适应通过能量引导而非重新训练扩散模型实现,与先前方法相比,CEDGE能高效适应新的目标动态。在ODRL基准上的实验表明,轨迹级能量引导生成改善了动态偏移下的扩散规划,并产生提升下游目标策略学习的合成数据。

英文摘要

Off-dynamics offline reinforcement learning seeks to learn a target-domain policy from a large source dataset and a limited target dataset under mismatched transition dynamics. Existing approaches such as reward augmentation and data filtering are constrained to the source dataset and cannot synthesize new target behavior to improve coverage beyond the collected source trajectories. While recent model-based methods attempt to address this by learning target-aware dynamics, the generated experience is constructed only at the transition level, which leads to accumulated errors over long horizons. These limitations necessitate a shift toward trajectory-level generation for off-dynamics offline RL. We propose CEDGE, a Cross-domain Energy-guided Diffusion GEneration framework. CEDGE trains a trajectory diffusion model on source-domain trajectories and adapts the generated samples to the target domain through energy guidance. This guidance is derived by minimizing the distribution mismatch between the source and desired target-domain trajectories and is decomposed into return, domain, and behavior energy components. The resulting energy-guided trajectories are useful both for direct planning and as synthetic data for policy learning. Since target adaptation is achieved via energy guidance rather than retraining the diffusion model, CEDGE can be efficiently adapted to new target dynamics compared to previous methods. Experiments on the ODRL benchmark demonstrate that trajectory-level energy-guided generation improves diffusion planning under dynamics shifts and produces synthetic data that improves downstream target policy learning.

2605.24808 2026-05-26 cs.LG cs.AI 版本更新

Disentangled Double Machine Learning for Accurate Causal Effect Estimation

解缠双机器学习用于精确因果效应估计

Guodu Xiang, Kui Yu, Yujie Wang, Richang Hong, Fuyuan Cao, Jiye Liang

发表机构 * School of Computer Science and Information Engineering, Hefei University of Technology(合肥工业大学计算机科学与信息工程学院) School of Computer and Information Technology, Shanxi University(山西大学计算机与信息学院)

AI总结 提出解缠双机器学习(DDML),通过因果角色解缠和残差依赖正交化策略,解决高维或有限样本下双机器学习中因混淆因子未解缠导致的偏差和不稳定问题,在合成、半合成和真实数据集上优于13种基线方法。

Comments 15 pages, 9 figures

详情
AI中文摘要

混淆偏差是从观测数据中估计因果效应的一个关键挑战。双机器学习(DML)通过估计治疗和结果 nuisance 函数、构建治疗和结果残差,并从残差中估计因果效应来解决这一问题。然而,DML 在高维或有限样本场景中常常产生有偏和不稳定的估计。一个原因是 DML 使用所有协变量估计 nuisance 函数,而没有解缠不同的潜在因子,导致不可靠的 nuisance 函数估计。另一个原因是不精确的 nuisance 估计进一步引入了治疗残差与剩余结果误差之间的残差依赖,破坏了因果效应估计的准确性。为了解决这些问题,本文提出解缠双机器学习(DDML),一种整合两种关键策略的新算法。首先,因果角色解缠策略将协变量分解为混淆因子、治疗特有因子和结果特有因子,以实现可靠的 nuisance 函数估计。其次,残差依赖正交化策略减轻由 nuisance 估计误差引起的残差依赖,以增强因果效应估计的精度。在合成、半合成和真实数据集上的实验结果表明,DDML 在 MAE 和 RMSE 上均显著优于 13 种最先进的基线算法。

英文摘要

Confounding bias is a key challenge in causal effect estimation from observational data. Double Machine Learning (DML) addresses this issue by estimating treatment and outcome nuisance functions, constructing treatment and outcome residuals, and estimating causal effects from the residuals. However, DML often produces biased and unstable estimates in highdimensional or finite-sample scenarios. One reason is that DML estimates nuisance functions using all covariates without disentangling distinct latent factors, resulting in unreliable nuisance function estimation. Another is that imprecise nuisance estimation further introduces residual dependence between the treatment residual and the remaining outcome error, undermining the accuracy of causal effect estimates. To address these issues, in this paper, we propose Disentangled Double Machine Learning (DDML), a novel algorithm that integrates two key strategies. First, a causal role disentanglement strategy decomposes covariates into confounders, treatment-specific factors, and outcomespecific factors for enabling reliable nuisance function estimation. And second, a residual dependence orthogonalization strategy mitigates residual dependence caused by nuisance estimation errors for enhancing the precision of causal effect estimates. Experimental results on synthetic, semi-synthetic, and real-world datasets demonstrate that DDML significantly outperforms 13 state-of-the-art baseline algorithms in both MAE and RMSE.

2605.24806 2026-05-26 cs.SD cs.AI eess.AS 版本更新

Zero-Shot Parkinson's Disease Detection from Speech: Comparing Large Audio and Language Models

零样本帕金森病语音检测:比较大型音频和语言模型

Muhammad Ashad Kabir, Sirajam Munira

发表机构 * School of Computing, Mathematics and Engineering, Charles Sturt University(计算机科学与工程学院,查尔斯·斯图尔特大学) Department of Computer Science, Rensselaer Polytechnic Institute(计算机科学系,伦塞拉尔理工学院)

AI总结 通过比较手工声学特征和原始音频波形两种输入模态,研究零样本帕金森病检测在不同语言中的性能差异,发现手工特征在低资源语言中更稳定,而音频输入带来数据集依赖的增益。

Comments 6 pages

详情
AI中文摘要

大型音频和语言模型最近在各个领域展示了零样本推理能力。然而,尚不清楚音频输入的形式——无论是从语音中提取的手工声学特征还是原始音频波形——如何影响不同语言中帕金森病(PD)检测的性能。在本研究中,我们系统地比较了两种零样本PD检测的输入模态:(i)由通用LLM分析的从语音记录中提取的手工声学特征,以及(ii)由音频能力模型分析的直接波形输入。在四种语言的PD语音数据集上的实验表明,性能因输入模态、语音任务和语言而异。手工声学特征在低资源语言(例如孟加拉语)中提供更稳定的性能,而音频输入带来数据集依赖的增益。这些发现突显了输入模态对零样本语音PD检测的影响。

英文摘要

Large audio and language models have recently demonstrated zero-shot reasoning capabilities across various domains. However, it remains unclear how the form of audio input, whether handcrafted acoustic features extracted from speech or the raw audio waveform itself, affects performance for Parkinson's disease (PD) detection across different languages. In this study, we systematically compare two input modalities for zero-shot PD detection: (i) handcrafted acoustic features extracted from speech recordings analyzed by a general-purpose LLM, and (ii) direct waveform input analyzed by audio-capable models. Experiments on PD speech datasets in four languages show that performance varies across input modalities, speech tasks, and languages. Handcrafted acoustic features provide more stable performance in a low-resource language (e.g., Bengali), whereas audio input yields dataset-dependent gains. These findings highlight the impact of input modality on zero-shot PD detection from speech.

2605.24799 2026-05-26 cs.CV cs.AI 版本更新

Divide-and-Conquer Inference for Large-Scale Visual Recognition with Multimodal Large Language Models

面向大规模视觉识别的多模态大语言模型分治推理

Zhipeng Ye, Jiaqi Huang, Feng Jiang, Qiufeng Wang, Yikang Duan, Dawei Wang, Xihang Zhou, Qian Qiao

发表机构 * Taizhou Institute of Science and Technology, Nanjing University of Science and Technology(泰州科技学院、南京理工大学) Department of Intelligence Science, Xi’an Jiaotong-Liverpool University(智能科学系,西安交通大学利物浦大学) School of Computer Science and Technology, Soochow University(计算机科学与技术学院,苏州大学) Department of Statistical Sciences, University of Toronto(统计科学系,多伦多大学)

AI总结 针对多模态大语言模型在长序列识别中性能崩溃的问题,提出分治推理(DCI)策略,通过递归分解任务和动态剪枝提升信噪比与分类精度。

详情
AI中文摘要

多模态大语言模型(MLLMs)在广泛的视觉语言任务中展现了强大的能力。然而,当应用于大规模图像分类时,随着标签空间的扩大,其性能显著下降——我们将这一现象定义为长序列识别中的性能崩溃。通过信息论分析,我们揭示了这种崩溃源于不断增长的信息熵与注意力机制中显著的注意力稀释和衰减之间的根本冲突,这损害了模型在处理极长提示时维持足够信噪比的能力。为缓解这一问题,我们提出了分治推理(DCI),一种用于MLLMs视觉识别的新型测试时扩展策略。DCI递归地将复杂的全局分类任务分解为多个更简单的局部子问题,并采用动态剪枝机制压缩搜索空间。该方法通过缓解长序列推理中固有的权重稀释问题,有效提高了局部信噪比和模型精度。此外,传统自注意力具有难以承受的二次计算复杂度,而DCI在大规模分类场景中实现了更有利的扩展行为并显著加速推理。在ImageNet-1K和ImageNet-21K等基准上的大量实验表明,DCI持续提高了分类精度。这使得轻量级开源模型无需任何额外训练或微调即可与甚至超越前沿闭源巨头。作为一种模型无关、即插即用的范式,DCI为在大规模场景中扩展MLLMs的推理精度提供了一种高效方法。

英文摘要

Multimodal Large Language Models (MLLMs) have demonstrated strong capabilities across a wide range of vision language tasks. However, when applied to large scale image classification, their performance degrades significantly as the label space expands a phenomenon we define as Performance Collapse in Long Sequence Recognition. Through an information theoretic analysis, we reveal that this collapse stems from a fundamental conflict between the escalating information entropy and the prominent attention dilution and decay within attention mechanisms, which impairs the model's ability to maintain a sufficient signal-to-noise ratio when processing extremely long prompts. To mitigate this, we propose Divide-and-Conquer Inference (DCI), a novel test-time scaling strategy for visual recognition with MLLMs. DCI recursively decomposes complex global classification tasks into multiple simpler, localized subproblems and employs a dynamic pruning mechanism to compress the search space. This method effectively improves the local signal to noise ratio and model accuracy by mitigating the inherent weight dilution issues in long-sequence inference. Moreover, while traditional self-attention incurs a prohibitive quadratic computational complexity, DCI achieves more favorable scaling behavior and substantially accelerates inference in large scale classification scenarios. Extensive experiments on benchmarks such as ImageNet-1K and ImageNet-21K demonstrate that DCI consistently improves classification accuracy. This enables lightweight open-source models to rival or even surpass frontier closed-source giants without any additional training or fine-tuning. As a model-agnostic, plug-and-play paradigm, DCI offers an efficient approach for scaling the inferential precision of MLLMs in large-scale scenarios.

2605.24792 2026-05-26 cs.CV cs.AI 版本更新

Parameter-Efficient VLMs for Gastrointestinal Endoscopy: Medical Image Generation and Clinical Visual Question Answering

用于胃肠内窥镜的参数高效视觉语言模型:医学图像生成与临床视觉问答

Ojonugwa Oluwafemi Ejiga Peter, Frederick Akor Ejiga, Fahmi Khalifa, Md Mahmudur Rahman

发表机构 * Computer Science Department, Morgan State University(莫尔甘州大学计算机科学系) International Organization for Migration (IOM)(国际移民组织) Electrical & Computer Engineering Department, Morgan State University(莫尔甘州大学电气与计算机工程系)

AI总结 提出双流水线参数高效微调模型,结合Florence-2和LoRA Stable Diffusion,分别解决临床视觉问答和隐私保护合成数据生成问题,在Kvasir-VQA数据集上取得高ROUGE和BLEU分数,并显著降低计算成本。

详情
AI中文摘要

胃肠内窥镜AI系统的主要局限性源于标注数据短缺、严格的隐私政策以及传统模型微调中的显著瓶颈。这些限制阻碍了复杂AI模型在临床实践中的成功应用,尤其影响了诊断的可靠性和可扩展性。在本文中,我们提出了一种双流水线PEFT模型,解决了两个基本问题:医学视觉问答(VQA)和隐私保护合成数据的生成。对于临床VQA,我们采用Florence-2视觉语言模型。利用PEFT增强了模型的可解释性,同时大幅降低了训练的计算成本。同时,我们使用低秩适应(LoRA)与Stable Diffusion 2.1生成高质量的胃肠图像,在不违反患者隐私的情况下增强训练数据库。本研究使用了Kvasir-VQA数据集。我们的Florence-2 VQA模型实现了ROUGE-1为0.92,ROUGE-L为0.91,BLEU分数从0.08提升到0.24。在私有数据集上的微调始终优于在公共数据集上的微调。秩为4的LoRA合成达到了最优性能,保真度得分为0.290,一致性得分为0.730,Frechet BiomedCLIP距离(FBD)为1450,计算成本降低了近90%。该框架提高了AI在胃肠内窥镜中的临床潜力。与FLUX、MSDM和Kandinsky 2.2相比,我们的模型表现出更优的FBD和强语义对齐。虽然其他模型在保真度或一致性上领先,但我们更低的FBD表明更好的图像-文本一致性。这些结果确立了我们的方法作为增强临床AI中VQA和合成数据生成的稳健解决方案。

英文摘要

The major limitations of gastrointestinal (GI) endoscopy AI systems arise from a shortage of annotated data, strict privacy policies, and significant bottlenecks in conventional model fine-tuning. Such limitations impede the successful application of sophisticated AI models in clinical practice, particularly affecting the reliability and scalability of diagnosis. In this paper, we present a dual-pipeline PEFT model that addresses two fundamental problems: medical Visual Question Answering (VQA) and the generation of privacy-preserving synthetic data. For clinical VQA, we adopt the Florence-2 vision-language model. Leveraging PEFT enhances model interpretability while substantially reducing the computational cost of training. Simultaneously, we employ Low-Rank Adaptation (LoRA) with Stable Diffusion 2.1 to generate high-quality GI images that enhance training databases without violating patient privacy. This research utilized the Kvasir-VQA dataset. Our Florence-2 VQA model achieved ROUGE-1 of 0.92, ROUGE-L of 0.91, and BLEU score improvements from 0.08 to 0.24. Fine-tuning on private datasets consistently showed better results than fine-tuning on public datasets. The rank-4 LoRA synthesis achieved optimal performance with a fidelity score of 0.290, an agreement score of 0.730, and a Frechet BiomedCLIP Distance (FBD) of 1450, reducing computational costs by almost 90 percent. This framework improves the clinical potential of AI in GI endoscopy. Compared to FLUX, MSDM, and Kandinsky 2.2, our model demonstrates superior FBD and strong semantic alignment. While other models lead in Fidelity or Agreement, our lower FBD indicates better image-text coherence. These results establish our approach as a robust solution for enhancing VQA and synthetic data generation in clinical AI.

2605.24786 2026-05-26 cs.LG cs.AI 版本更新

CONF-KV: Confidence-Aware KV Cache Eviction with Mixed-Precision Storage for Long-Horizon LLM

CONF-KV:面向长序列LLM的置信度感知KV缓存淘汰与混合精度存储

Yubo Li, Yidi Miao

发表机构 * Carnegie Mellon University(卡内基梅隆大学)

AI总结 提出CONF-KV方法,利用模型当前不确定性(置信度)动态调整KV缓存预算,结合混合精度存储和分块在线softmax注意力,在长序列推理中显著降低显存占用并保持高精度。

详情
AI中文摘要

长序列LLM推理使键值(KV)缓存成为GPU内存的主要消耗者,并使每个token的注意力计算越来越昂贵。许多常见的淘汰策略使用静态的最近窗口或历史注意力,忽略了每个解码步骤中计算出的一个信号:模型当前的不确定性。我们引入CONF-KV,一个KV缓存管理器,它将下一个token分布转换为标量置信度分数,并用它来选择每步缓存预算,在模型不确定时保留更多上下文,在模型确定时积极剪枝。在每个预算内,token根据累积注意力质量和最近性的组合进行排序,同时一个受保护的最近窗口保持局部连贯性。我们将该策略与分块在线softmax注意力、混合FP16/INT8存储以及金字塔式逐层预算变体相结合。在四个模型家族和生成长度高达4K的情况下,CONF-KV的显存占用接近固定的512 token滑动窗口,同时与完整KV相比,困惑度差异保持在1.5-2.1点以内。在长达32K token的“大海捞针”测试中,CONF-KV的检索准确率达到91.4%,而滑动窗口为53.8%,H2O为80.6%;在75个VisualWebArena任务中,它以2.8倍的峰值内存降低保留了完整KV成功率的95.3%。

英文摘要

Long-horizon LLM inference turns the key--value (KV) cache into the dominant GPU memory consumer and makes per-token attention increasingly expensive. Many common eviction policies use static recency windows or historical attention, leaving unused a signal computed on every decoding step: the model's current uncertainty. We introduce CONF-KV, a KV-cache manager that converts the next-token distribution into a scalar confidence score and uses it to choose the per-step cache budget, retaining more context when the model is uncertain and pruning aggressively when it is confident. Within each budget, tokens are ranked by a composite of accumulated attention mass and recency, while a protected recent window preserves local coherence. We combine the policy with blockwise online-softmax attention, mixed FP16/INT8 storage, and a pyramidal per-layer budget variant. Across four model families and generated lengths up to 4K, CONF-KV stays near the footprint of a fixed 512-token sliding window while remaining within 1.5--2.1 perplexity points of full KV. On Needle-in-a-Haystack up to 32K tokens, CONF-KV reaches 91.4% retrieval accuracy versus 53.8% for sliding windows and 80.6% for H2O; on 75 VisualWebArena tasks it retains 95.3% of full-KV success at 2.8 times lower peak memory.

2605.24784 2026-05-26 cs.AI 版本更新

GRAIL: AI translation for scientists application workflow on satellite data

GRAIL:面向卫星数据科学家应用工作流的AI翻译

Zhuocheng Shang, Ahmed Eldawy

发表机构 * University of California, Riverside(加州大学河滨分校)

AI总结 提出GRAIL系统,通过LangGraph管道将Python地理空间工作流翻译为可扩展的Spark程序,无需科学家学习新框架。

详情
AI中文摘要

领域科学家越来越多地开发Python脚本来分析卫星图像,但这些脚本缺乏大规模数据的可扩展性。本文演示了GRAIL,一个代理翻译系统,它将Python地理空间工作流转换为可执行的基于Spark的程序,而无需科学家学习新框架。GRAIL不是微调专门的LLM模型,而是调整RDPro(一个用于卫星数据分析的Scala库),通过结构化文档、API别名函数和面向修复的错误日志使其为LLM就绪。翻译被构建为一个LangGraph管道,将代码生成分解为具有引导输入和输出的显式部分,从而无需重新生成整个程序即可进行有针对性的修复。我们在真实的地理空间工作流上演示了GRAIL,并展示了翻译代码的正确性和可扩展性。

英文摘要

Domain scientists increasingly develop Python scripts to analyze satellite imagery but they lack scalability to large-scale data. This paper demonstrates GRAIL, an agentic translation system that converts Python geospatial workflows into executable Spark-based programs without requiring scientists to learn a new framework. Rather than fine-tuning a specialized LLM model, GRAIL adapts RDPro, a Scala library for satellite data analysis, to make it LLM-ready using structured documentation, API alias functions, and repair-oriented error logs. Translation is structured as a LangGraph pipeline that decomposes code generation into explicit sections with guided inputs and outputs, enabling targeted repair without regenerating the full program. We demonstrate GRAIL on real-world geospatial workflows and showcase the correctness and scalability of the translated code.

2605.24779 2026-05-26 cs.LG cs.AI math.CO 版本更新

Complement Submodular Information Measures for Balanced and Robust Data Selection

互补子模信息度量用于平衡和鲁棒的数据选择

Rishabh Iyer

发表机构 * The University of Texas at Dallas(德克萨斯大学达拉斯分校)

AI总结 提出互补子模信息(CSI)目标函数,通过建模子集与其补集之间的共享结构信息,实现平衡且鲁棒的数据选择,并在理论上证明其近似单调性和贪心近似保证,实验表明在鲁棒隐藏切片感知子集选择中优于经典子模目标。

详情
AI中文摘要

子模优化已成为数据选择、检索、摘要和表示学习的基本范式,因为它能够建模覆盖度、多样性和代表性。然而,经典子模目标仅优化所选子集,并未明确保留所选子集与剩余数据之间的结构信息。在许多现代机器学习应用中,包括训练/验证/测试分割、基准构建和鲁棒子集选择,选择的质量关键取决于在所选子集及其补集之间保持平衡结构。在这项工作中,我们引入了互补子模信息(CSI),这是一类新的互补感知子模目标,用于量化子集与其补集之间的共享结构信息。我们的框架产生了几个经典子模函数的互补感知变体,包括设施选址、图割、LogDet、饱和覆盖、集合覆盖、概率集合覆盖和基于特征函数。我们分析了CSI目标的理论性质,并表明它们在有限曲率条件下表现出近似单调性,从而得到接近$(1-1/e)$的贪心近似保证。实验上,CSI目标在鲁棒隐藏切片感知子集选择中始终优于标准子模目标。特别是,CSI目标显著改善了相干稀有/尾部语义结构的保留,同时抑制了噪声和孤立异常值,从而显著提高了下游预测性能。合成实验进一步说明了不同的CSI实例如何捕获代表性、多样性、连通性和平衡邻域保留的互补概念。

英文摘要

Submodular optimization has become a fundamental paradigm for data selection, retrieval, summarization, and representation learning due to its ability to model coverage, diversity, and representativeness. However, classical submodular objectives optimize only the selected subset and do not explicitly preserve structural information between the selected subset and the remaining data. In many modern machine learning applications, including train/validation/test splitting, benchmark construction, and robust subset selection, the quality of a selection depends critically on preserving balanced structure across both the selected subset and its complement. In this work, we introduce Complement Submodular Information (CSI), a new class of complement-aware submodular objectives that quantify shared structural information between a subset and its complement. Our framework induces complement-aware variants of several classical submodular functions including Facility Location, Graph Cut, LogDet, Saturated Coverage, Set Cover, Probabilistic Set Cover, and Feature Based Functions. We analyze the theoretical properties of CSI objectives and show that they exhibit approximate monotonicity under bounded curvature conditions, leading to near-$(1-1/e)$ greedy approximation guarantees. Empirically, CSI objectives consistently outperform standard submodular objectives on robust hidden-slice-aware subset selection. In particular, CSI objectives significantly improve preservation of coherent rare/tail semantic structure while simultaneously suppressing noisy and isolated outliers, leading to substantially improved downstream predictive performance. Synthetic experiments further illustrate how different CSI instantiations capture complementary notions of representativeness, diversity, connectivity, and balanced neighborhood preservation.

2605.24775 2026-05-26 cs.AI cs.MA 版本更新

PRIMA: Operational Patterns for Resilient Multi-Agent Research with Verifiable Identity and Convergent Feedback

PRIMA: 具有可验证身份和收敛反馈的弹性多智能体研究的操作模式

Sasank Annapureddy

发表机构 * GitHub

AI总结 针对长时间运行的多智能体LLM系统面临的故障模式,提出PRIMA框架,包含弹性恢复、子智能体操作规范和结构化工程交付的多阶段应用模式,并通过图同构案例验证其有效性。

Comments 11 pages. Single-author preprint. Supplementary case-study report (Graph Isomorphism algorithm proposal with three theorems, five conjectures, complete complexity analysis, and hard-instance evaluation) available at https://spockstein.github.io/prima/case-study-graph-isomorphism.html

详情
AI中文摘要

将LLM作为协调的多智能体研究系统运行数小时,会暴露出单次评估无法发现的故障模式:上游提供商无预警地限制服务,子智能体使任务偏离以适应可用工具,叙述机制而非使用它,以自我道歉开始修订迭代,或将上游上下文视为可执行指令。我们提出PRIMA,其主要贡献是三种应对这些故障模式的操作模式:(1) 弹性与恢复层,检测上游速率限制信号,将类型化的暂停记录持久化到磁盘,并在进程重启后恢复长时间运行的任务而不重新执行已收敛的工作;(2) 子智能体操作规范,将任务保真度、工具使用、修订和步骤间上下文边界规范编码为结构化的提示层;(3) 用于结构化工程交付的多阶段应用模式,将正交的草稿步骤与最终综合前的显式跨文档协调过程配对。这些模式基于一个基础协议:具有显式收敛标准的研究程序规范语言、双指标评分引擎(LLM评判的评分标准加沙盒代码)、外部元优化循环、事件驱动持久化、基于钩子的中间件、上下文压缩和多提供商LLM抽象。智能体身份来源于素数幂,提供无冲突标识符和无需中央注册表的可轻松验证的集群成员资格。理论保证包括$O(k)$验证、$O(V+E)$ DAG验证以及由算术基本定理保证的身份无冲突。一个图同构案例研究将架构主张落实到生成的产物中:一个六步协议,产生了一篇研究论文,提出了一种新的规范形式算法,包含三个定理和五个猜想。

英文摘要

Operating LLMs as coordinated multi-agent research systems over multi-hour runs surfaces failure modes that single-shot evaluation cannot: upstream providers throttle without warning, sub-agents drift the task to fit accessible tools, narrate machinery instead of using it, open revision iterations with self-apology, or treat upstream context as executable directives. We present PRIMA, whose primary contributions are three operational patterns for surviving these failure modes: (1) a resilience-and-recovery layer that detects upstream rate-limit signals, persists a typed pause record to disk, and resumes long-running runs without re-executing converged work even across process restarts; (2) a sub-agent operating discipline encoding task-fidelity, tool-use, revision, and inter-step context-boundary norms as a structural prompt layer; (3) a multi-phase application pattern for structured engineering deliverables pairing orthogonal draft steps with an explicit cross-document harmonization pass before final synthesis. These sit atop a foundational protocol: a research-program specification language with explicit convergence criteria, a dual-metric scoring engine (LLM-judged rubric plus sandboxed code), an outer meta-optimization loop, event-driven persistence, hook-based middleware, context compaction, and a multi-provider LLM abstraction. Agent identities derive from prime powers, giving collision-free identifiers and trivially-verifiable cluster membership without a central registry. Theoretical guarantees include $O(k)$ verification, $O(V+E)$ DAG validation, and identity collision freedom by the Fundamental Theorem of Arithmetic. A Graph Isomorphism case study grounds the architectural claims in a generated artifact: a six-step protocol that produced a research paper proposing a new canonical-form algorithm with three theorems and five conjectures.

2605.24773 2026-05-26 cs.AI 版本更新

Uncertainty Decomposition via Cyclical SG-MCMC and Soft-label Learning for Subjective NLP

通过循环SG-MCMC和软标签学习进行主观NLP中的不确定性分解

Keito Inoshita, Takato Ueno

发表机构 * Faculty of Business and Commerce(商科学部) Data Science and AI Innovation Research Promotion Center(数据科学与人工智能创新研究促进中心) Graduate School of Data Science(数据科学研究生院)

AI总结 提出结合循环随机梯度马尔可夫链蒙特卡洛(cSG-MCMC)与软标签学习的方法,在情感分类中沿多个轴评估不确定性,并在GoEmotions基准上优于现有方法。

详情
AI中文摘要

情感分类中标注者的分歧反映了情感概念固有的模糊性,对于主观NLP中的预测质量评估至关重要。然而,先前没有工作将软标签学习与贝叶斯深度学习相结合,以评估包括标注者分布保真度在内的多个轴上的不确定性。我们在冻结的RoBERTa上通过循环随机梯度马尔可夫链蒙特卡洛(cSG-MCMC)训练一个线性头,在五轴评估下以软标签目标针对经验标注者分布。在28情感的GoEmotions基准上,所提出的方法在三个轴上同时优于蒙特卡洛Dropout和深度集成——标注者分布的Jensen-Shannon散度(JSD)、每个情感偶然不确定性与分歧之间的Spearman相关性,以及选择性预测的风险-覆盖率曲线下面积(AURC)和ROC曲线下面积(AUROC)——表明独立的轴可以从一个后验中联合获得。事后温度缩放表现出双向效应,建立了硬标签校准和标注者JSD作为独立维度,并激励联合报告作为诚实协议。

英文摘要

Annotator disagreement in emotion classification reflects ambiguity intrinsic to emotion concepts and is essential for predictor-quality assessment in subjective NLP. Yet no prior work integrates soft-label learning with Bayesian deep learning to evaluate uncertainty along axes including annotator-distribution fidelity. We train a linear head on a frozen RoBERTa via cyclical stochastic gradient Markov chain Monte Carlo (cSG-MCMC), targeting the empirical annotator distribution with a soft-label objective under a five-axis evaluation. On the 28-emotion GoEmotions benchmark, the proposed method outperforms Monte Carlo Dropout and Deep Ensemble simultaneously on three axes -- Jensen-Shannon divergence (JSD) to the annotator distribution, Spearman correlation between per-emotion aleatoric uncertainty and disagreement, and selective-prediction Area Under the Risk-Coverage Curve (AURC) and Area Under the ROC Curve (AUROC) -- showing independent axes are jointly attainable from one posterior. Post-hoc temperature scaling exhibits a bidirectional effect, establishing hard-label calibration and annotator-JSD as independent dimensions and motivating joint reporting as an honest protocol.

2605.24771 2026-05-26 cs.CV cs.AI cs.LG 版本更新

From Theory to Decision Rule: Calibrating the Noisy-Label Crossover for Vision-Language Model Weak Supervision Across Three Medical-Imaging Benchmarks

从理论到决策规则:校准视觉-语言模型弱监督的噪声标签交叉点——基于三个医学影像基准

Bruce Changlong Xu, Jose James, Alexander Ryu

发表机构 * Department of Computer Science, Stanford University(计算机科学系,斯坦福大学)

AI总结 通过三个医学影像基准校准理论预测的噪声标签交叉点,提出基于少量金标标签的决策规则。

Comments 5 pages, 2 figures, 4 tables

详情
AI中文摘要

经典的噪声标签理论预测,弱监督下的下游性能上限是标注者的准确率,这意味着一个尖锐的交叉点:一旦金标训练的分类器达到标注者的水平,弱标签就会从帮助变为伤害。该预测是理论性的;缺少的是将其转化为现代基础模型标注者的实例级陈述的基准校准。我们针对BiomedCLIP生成的弱标签,在三个医学影像基准(PCAM、ISIC、NIH-CXR)和六个跨越11倍参数范围的下游架构上提供了这样的校准。理论预测的交叉点出现在PCAM上约100个样本,ISIC上20-50个,NIH-CXR上250-500个;交叉点以上的弱标签使AUC降低高达-0.10。对于五个预训练架构中的四个,交叉点位置与架构无关,而一个家族内的DenseNet扫描(2.5倍参数,相同预训练)支持了标注者(而非学生)是主要约束的观点。该校准进而产生一个可在10-20个金标标签下操作的决策规则:比较仅金标AUC与用户金标集上的VLM准确率。NIH-CXR上的结构化与随机噪声符号翻转表明,该界限的仅速率形式是不完整的,并确定了一个具体的改进(标签空间投影),未来的基准可以设计来测试它。

英文摘要

Classical noisy-label theory predicts that downstream performance under weak supervision is bounded above by the labeler's accuracy, implying a sharp crossover: once a gold-trained classifier matches the labeler, weak labels stop helping and start hurting. The prediction is theoretical; what is missing is a benchmark calibration that turns it into an instance-level statement for modern foundation-model labelers. We provide such a calibration for BiomedCLIP-generated weak labels on three medical-imaging benchmarks (PCAM, ISIC, NIH-CXR) and six downstream architectures spanning an 11x parameter range. The crossover predicted by theory appears at ng~100 on PCAM, 20-50 on ISIC, and 250-500 on NIH-CXR; weak labels above the crossover degrade AUC by up to -0.10. The location is architecture-invariant for four of five pretrained architectures, and a within-family DenseNet sweep (2.5x parameters, identical pretraining) supports the view that the labeler, not the student, is the dominant constraint. The calibration in turn produces a decision rule operable from 10-20 gold labels: compare gold-only AUC to VLM accuracy on the user's gold set. A structured-vs-random noise sign flip on NIH-CXR shows that the rate-only formulation of the bound is incomplete and identifies a concrete refinement (label-space projection) that future benchmarks can be designed to test.

2605.24769 2026-05-26 cs.CV cs.AI eess.IV 版本更新

Leveraging pretrained RGB denoisers for hyperspectral image restoration

利用预训练RGB去噪器进行高光谱图像恢复

Daniele Picone, Mohamad Jouni, Mauro Dalla-Mura

发表机构 * Univ. Grenoble Alpes, CNRS, Grenoble INP, GIPSA-Lab(格勒诺布尔阿尔卑斯大学、法国国家科学研究中心、格勒诺布尔INP、GIPSA实验室)

AI总结 提出一种轻量级适配器,通过投影映射重用冻结的预训练RGB去噪器,实现高光谱图像的去噪、去模糊和超分辨率恢复,实验表明RGB先验具有良好的迁移性。

详情
AI中文摘要

高光谱图像恢复面临若干挑战,包括训练数据有限、传感器特异性强以及光谱维度高。这些限制阻碍了鲁棒高光谱先验的学习,促使我们重用从大规模RGB数据中学到的先验。在这项工作中,我们提出了一种最小训练的轻量级适配器,通过投影映射将冻结的预训练RGB去噪器重新用于高光谱恢复。该方法对低维光谱投影进行去噪,并通过约束线性聚合重建高光谱立方体,同时保持即插即用的兼容性和底层RGB去噪器的稳定性。在多个数据集上的去噪、去模糊和超分辨率实验表明,该方法持续优于高光谱专用基线,显示了大规模RGB先验的强迁移性。

英文摘要

Hyperspectral image restoration faces several challenges, including limited training data, strong sensor specificity, and high spectral dimensionality. These limitations hinder the learning of robust hyperspectral priors, motivating the reuse of priors learned from large-scale RGB data. In this work, we propose a minimally trained, lightweight adapter that repurposes frozen pretrained RGB denoisers for hyperspectral restoration through a projection mapping. The method denoises low-dimensional spectral projections and reconstructs the hyperspectral cube through constrained linear aggregation, while preserving plug-and-play compatibility and the stability properties of the underlying RGB denoiser. Experiments on denoising, deblurring, and super-resolution across multiple datasets demonstrate consistent improvements over hyperspectral-specific baselines, showing the strong transferability of large-scale RGB priors.

2605.24764 2026-05-26 cs.IR cs.AI cs.CL 版本更新

Spectral Retrieval: Multi-Scale Sinc Convolution over Token Embeddings for Localized Retrieval in LLM Multi-Agent Systems

光谱检索:基于多尺度sinc卷积的令牌嵌入局部化检索在LLM多智能体系统中的应用

Andrea Morandi

发表机构 * Cisco(思科)

AI总结 提出光谱检索方法,通过多尺度sinc卷积对令牌嵌入进行重排序,在无需重新训练的情况下显著提升局部化检索性能,并自然适配于LLM多智能体系统。

详情
AI中文摘要

[删节版] - 光谱检索是一种插件式重排序阶段,通过在令牌嵌入上进行多尺度sinc卷积,在逐令牌MaxSim和均值池化检索之间进行插值。在标准稠密检索中,每个文档是一个均值池化向量;当相关性局限于一个短子跨度时,信号会平均为噪声。光谱检索重用来自晚期交互索引的逐令牌嵌入,并将其与归一化的sinc核在多个尺度上进行卷积。在L=1时,核作为恒等映射,恢复逐令牌MaxSim;随着L增大,它趋近于均匀滤波器,恢复均值池化。跨位置和尺度的最大余弦产生一个得分,其信息量不低于任一端点。在一个包含1000个文档和植入单位置尖峰的可控合成基准上,无论尖峰强度如何,均值池化检索处于随机水平(Recall@10 ~ 0.02),而光谱检索在植入余弦超过语料级令牌噪声基底时达到Recall@10 = 1.0。在冻结的all-mpnet-base-v2编码器上的LIMIT-small数据集中,光谱检索无需重新训练即可将Recall@10从0.33提升至0.90,MRR从0.22提升至0.79,严格Success@10从0.12提升至0.84。该方法自然适用于多智能体LLM系统,其中每个智能体受益于共享语料库上更紧密、特定角色的检索窗口。

英文摘要

[Abridged] - Spectral Retrieval is a plug-in re-ranking stage that interpolates between per-token MaxSim and mean-pool retrieval through a multi-scale sinc convolution over token embeddings. In standard dense retrieval each document is one mean-pooled vector; when relevance localises into a short subspan, the signal averages into noise. Spectral Retrieval reuses per-token embeddings from a late-interaction index and convolves them with a normalised sinc kernel at multiple scales. At L=1 the kernel acts as the identity, recovering per-token MaxSim; as L grows it approaches a uniform filter, recovering mean pooling. The maximum cosine over positions and scales yields a score provably no less informative than either endpoint. On a controlled synthetic benchmark with 1,000 documents and planted single-position spikes, mean-pool retrieval sits at chance (Recall@10 ~ 0.02) regardless of spike strength, while Spectral Retrieval reaches Recall@10 = 1.0 once the planted cosine exceeds the corpus-level token noise floor. On LIMIT-small with a frozen all-mpnet-base-v2 encoder, Spectral Retrieval lifts Recall@10 from 0.33 to 0.90, MRR from 0.22 to 0.79, and strict Success@10 from 0.12 to 0.84, without retraining. The method fits naturally into multi-agent LLM systems, where each agent benefits from a tighter, role-specific retrieval window over a shared corpus.

2605.24756 2026-05-26 cs.AI 版本更新

Proper Scoring Rules for Agentic Uncertainty Quantification

智能体不确定性量化的适当评分规则

Suresh Raghu, Satwik Pandey, Shashwat Pandey

发表机构 * Independent Researcher(独立研究者)

AI总结 针对语言模型智能体轨迹中的不确定性信号,提出严格适当的轨迹评分规则TPS,用于评估逐步骤成功概率过程,并处理删失数据。

Comments 38 pages, 2 figures

详情
AI中文摘要

语言模型智能体在轨迹中越来越多地发出不确定性信号,但现有的智能体不确定性量化评估常常混淆排序有用性与概率真实性。AUROC、AUPRC、风险覆盖、轨迹ECE和标量化轨迹评分评估了区分度、分箱校准或压缩摘要,但并未严格引出完整的基于前缀的条件成功概率轨迹$q_t = P^π(Y=1 | H_t)$。基于序列适当评分,我们引入了轨迹适当评分(TPS),这是一个预测器无关的严格适当的轨迹级评分规则族,适用于任何校准为最终成功概率的逐步骤不确定性信号。我们证明,在完全观测下,TPS在所选的评分族和权重方案内严格引出了成功概率过程。我们将构造扩展到行政删失轨迹,通过将完整数据评分投影到可观测的停止前缀上,得到精确的$q_Z$加权简化评分,并在$q_Z$未估计时得到可处理的近似。我们进一步表明,常见的轨迹评估器针对的是比完整前缀条件概率过程更弱的目标:轨迹ECE是分辨率盲的,而标量化轨迹Brier仅引出压缩标量,而非完整轨迹。在StrategyQA、Tau2-Bench、HotpotQA和WebShop上的实验表明,这些理论差异在操作上是可见的:概率重新校准可以显著改变TPS,而几乎不改变排序指标,并且可处理的删失近似相对于仅完整评估可能改变结论。

英文摘要

Language-model agents increasingly emit uncertainty signals throughout a trajectory, but existing agentic UQ evaluations often conflate ranking usefulness with probabilistic truthfulness. AUROC, AUPRC, risk-coverage, Trajectory ECE, and scalarized trajectory scores evaluate discrimination, binwise calibration, or collapsed summaries, but do not strictly elicit the full prefix-conditioned success-probability trace $q_t = P^π(Y=1 | H_t)$. Building on prequential proper scoring, we introduce the Trajectory Proper Score (TPS), a predictor-agnostic family of strictly proper trajectory-level scoring rules for any per-step uncertainty signal calibrated into a probability of eventual success. We prove that TPS strictly elicits the success-probability process under complete observation, within the chosen score family and weight schedule. We extend the construction to administratively censored trajectories by projecting the complete-data score onto the observable stopped prefix, yielding an exact $q_Z$-weighted reduced score and a tractable approximation when $q_Z$ is unestimated. We further show that common trajectory evaluators target weaker objects than the full prefix-conditioned probability process: Trajectory ECE is resolution-blind, while scalarized Trajectory Brier elicits only the collapsed scalar, not the full trace. Experiments on StrategyQA, Tau2-Bench, HotpotQA, and WebShop show that these theoretical distinctions are operationally visible: probability recalibration can substantially change TPS while leaving rank metrics nearly unchanged, and the tractable censored approximation can change the verdict relative to complete-only evaluation.

2605.24755 2026-05-26 cs.AI cs.CL 版本更新

Automated Detection and Classification of Delusion-related Content in Naturalistic Audio Diaries Using Multi-Agent Language Models

使用多智能体语言模型自动检测和分类自然音频日记中的妄想相关内容

Feng Chen, Justin Tauscher, Changye Li, Meliha Yetisgen, Alex Cohen, Adam Kuczynski, Angelina Pei-Tzu Tsai, Benjamin Buck, Dror Ben-Zeev, Trevor Cohen

发表机构 * Department of Biomedical Informatics and Medical Education, University of Washington, Seattle, WA, USA(生物医学信息学与医学教育系,华盛顿大学,西雅图,华盛顿州,美国) Department of Psychiatry and Behavioral Sciences, University of Washington, Seattle, WA, USA(精神病学与行为科学系,华盛顿大学,西雅图,华盛顿州,美国) Department of Psychology, Louisiana State University, Baton Rouge, LA, USA(心理学系,路易斯安那州立大学,巴吞鲁日,路易斯安那州,美国) Department of Psychiatry, University of North Carolina at Chapel Hill, Chapel Hill, NC, USA(精神病学系,北卡罗来纳大学教堂山分校,教堂山,北卡罗来纳州,美国)

AI总结 提出一种多智能体LLM流水线,从自然音频日记中自动检测和分类妄想信念、情感和行为反应,通过多数投票实现稳健性能。

Comments Accepted by CLPych 2026

详情
AI中文摘要

在自然环境中录制的言语独白为表征精神疾病现象学和检测症状恶化提供了机会。大型语言模型(LLM)为自动化这一过程提供了新的可能性,因为它们主要需要标注数据进行评估而非训练。在本文中,我们提出了一种新颖的自动化多智能体LLM流水线,用于从具有中度被害妄想的人的音频日记转录中,进行细粒度、多标签的提取,以识别暗示妄想信念、相关情感反应和行为反应的语言。通过评估三个基础模型的集成,我们证明详细的诊断提示指令成功减少了妄想主题分类的假阳性,但也限制了情感或行为反应的解读。此外,比较多智能体裁决框架表明,智能体之间的复杂对话辩论通过诱导过早共识降低了临床模糊文本的准确性。相反,多数投票建立了稳健的性能(妄想检测和分类的Micro F1分别为0.872和0.779)。这项工作为自动检测和表征自然言语中暗示妄想信念的内容提供了一个经过验证且可扩展的流水线。

英文摘要

Speech monologues recorded in naturalistic settings provide opportunities to characterize mental illness phenomenology and detect symptom exacerbation. Large language models (LLMs) offer new possibilities for automating this process, as they require annotated data primarily for evaluation rather than training. In this paper, we present a novel automated, multi-agent LLM pipeline for the fine-grained, multi-label extraction of language suggestive of delusional beliefs, associated affective responses, and behavioral responses from transcripts of naturalistic audio diaries collected from people with moderate persecutory ideation. Evaluating an ensemble of three foundation models, we demonstrate that detailed diagnostic prompt instructions successfully reduce false positives for delusional theme classification, but also constrain the interpretation of affective or behavioral responses. Furthermore, comparing multi-agent adjudication frameworks shows that complex conversational debate between agents diminishes accuracy on clinically ambiguous text by inducing premature consensus. Instead, majority voting establishes robust performance (Micro F1 of 0.872 and 0.779 for delusion detection and classification respectively). This work provides a validated and scalable pipeline for the automated detection and characterization of content suggesting delusional beliefs in naturalistic speech.

2605.24754 2026-05-26 cs.CV cs.AI cs.LG 版本更新

Motion-Compensated Weight Compression

运动补偿权重压缩

Ismail Lamaakal

发表机构 * Multidisciplinary Faculty of Nador Mohammed Premier University(纳多莫哈梅德 premier 大学多学科学院)

AI总结 提出运动补偿权重压缩(MCWC)方法,通过对齐置换对称块并利用层序预测和熵编码,有效压缩神经网络权重,在Transformer语言建模和视觉分类任务中提升率-精度帕累托前沿。

Comments 54 pages, 17 tables, 6 Figures

详情
AI中文摘要

神经网络权重日益成为部署的瓶颈,然而大多数压缩流水线独立处理各层,忽略了由函数保持对称性引起的跨层冗余。我们提出运动补偿权重压缩(MCWC),一种仅权重的编解码器,它对齐置换对称块(例如隐藏单元和注意力头)以最大化跨层对应,将深度转化为可预测序列。在对齐的坐标系中,MCWC使用带有周期性关键帧的轻量级层序预测器,并仅编码在率失真目标下训练的学习熵模型预测残差。一个简单的解码器通过熵解码、反量化、预测驱动重建和逆对齐来重建可部署的权重,从而实现快速权重物化以进行推理。在Transformer语言建模和视觉分类中,MCWC在强量化和学习权重编解码基线之上改善了率-精度帕累托前沿,同时保持有竞争力的解码时间。消融实验证实,对齐、预测、熵建模和关键帧调度对于获得全部增益都是必要的。我们的代码可通过 https://github.com/Ism-ail11/MCWC 获取。

英文摘要

Neural network weights are increasingly a bottleneck for deployment, yet most compression pipelines treat layers independently and overlook cross-layer redundancy induced by function-preserving symmetries. We propose Motion-Compensated Weight Compression (MCWC), a weight-only codec that aligns permutation-symmetric blocks (e.g., hidden units and attention heads) to maximize cross-layer correspondence, turning depth into a predictable sequence. In the aligned coordinate system, MCWC uses a lightweight layer-sequential predictor with periodic keyframes and encodes only quantized prediction residuals using a learned entropy model trained under a rate distortion objective. A simple decoder reconstructs deployable weights by entropy decoding, dequantization, predictor-driven reconstruction, and inverse alignment, enabling fast weight materialization for inference. Across Transformer language modeling and vision classification, MCWC improves the rate accuracy Pareto frontier over strong quantization and learned weight-codec baselines, while maintaining competitive decode time. Ablations confirm that alignment, prediction, entropy modeling, and keyframe scheduling are each necessary for the full gains. Our code is available via https://github.com/Ism-ail11/MCWC.

2605.24743 2026-05-26 cs.LG cs.AI 版本更新

Bilevel Optimization of Synthetic Trajectories for Multi-Turn LLM Fine-Tuning

用于多轮LLM微调的合成轨迹的双层优化

Shresth Verma, Mauricio Tec, Cheol Woo Kim, Kai Wang, Milind Tambe

发表机构 * Harvard University(哈佛大学) Georgia Institute of Technology(佐治亚理工学院)

AI总结 提出BOOST双层优化框架,通过内层加权训练和外层轻量级重加权头学习,解决合成轨迹质量异质性导致的LLM多轮交互性能下降问题。

详情
AI中文摘要

虽然LLM在单轮生成中表现出色,但在长程多轮交互中表现不佳。离线强化学习提供了一种可扩展的方法,但其性能依赖于多轮轨迹数据的可用性和质量。一种常见的补救措施是使用LLM或模拟器生成的合成轨迹来增强训练,但合成数据的质量高度异质,天真地将所有轨迹视为同等信息量会降低性能。我们提出BOOST,一个双层优化框架,其中内层在重新加权的数据上训练LLM,外层在保留的真实验证任务上训练一个轻量级的重加权头,无需外部评判器即可分配连续的轨迹级权重。为了夯实这一方法,我们推导出一个PAC-Bayesian界,揭示了三方权衡:合成数据增加了多样性但存在任务偏移风险,而将权重集中在高质量轨迹上提高了经验性能但以有效样本量为代价。实验上,我们的方法一致优于多个基线。分析表明,它提高了与真实数据分布一致且具有更高定性价值的合成轨迹的权重。

英文摘要

While LLMs excel at single-turn generation, they struggle with long-horizon, multi-turn interactions. Offline reinforcement learning (RL) offers a scalable approach, yet its performance hinges on the availability and quality of multi-turn trajectory data. A common remedy is to augment training with synthetic trajectories generated by LLMs or simulators, but synthetic data is highly heterogeneous in quality, and naively treating all trajectories as equally informative can degrade performance. We propose BOOST, a bilevel optimization framework where the inner level trains the LLM on reweighted data and the outer level trains a lightweight reweighting head on held-out real validation tasks, assigning continuous trajectory-level weights without requiring an external judge. To ground this approach, we derive a PAC-Bayesian bound revealing a three-way trade-off: synthetic data increases diversity but risks task-shift, while concentrating weight on high-quality trajectories improves empirical performance at the cost of effective sample size. Empirically, our method consistently outperforms multiple baselines. Analysis reveals it upweights synthetic trajectories that align with the real data distribution and exhibit higher qualitative merit.

2605.24737 2026-05-26 cs.CL cs.AI cs.CY 版本更新

Who judges the judges? Governance from metrics: a runtime framework for continuous LLM compliance monitoring

谁来评判评判者?基于指标的治理:面向持续LLM合规监控的运行时框架

Jehanne Dussert

发表机构 * Independent Researcher(独立研究者)

AI总结 针对AI合规作为审计时二元判定而非生产系统持续可测量属性的问题,提出基于指标的治理原则,并开发开源框架govllm,通过运行时可观测性信号实现持续合规监控,验证了多模型陪审团设计在监管评估中的有效性。

Comments 41 pages, 8 figures, preprint

详情
AI中文摘要

当前AI合规方法将合规性视为审计时的二元判定,而非生产系统的持续可测量属性。我们认为这种合规虚构在结构上不适合欧盟AI法案的要求,该法案要求持续的人类监督和检测部署系统中涌现的行为漂移。我们引入了基于指标的治理原则,即监管合规性是从运行时可观测性中推导出的持续信号,而非来自静态评估。基于这一原则,我们提出了govllm,一个开源框架,实现了治理驱动的路由架构,其中模型选择由累积的合规分数决定,而非仅由延迟或成本决定。我们方法的核心是一个监管评判者小组——针对每个标准(欧盟AI法案、GDPR、ANSSI、可访问性)专门化的LLM评估器——我们将评判者间的分歧重新定义为监管不确定性信号,而非噪声,需要人工仲裁。我们通过一个包含49个标注提示/响应对的地面真实语料库验证了该方法,涵盖五个监管标准,由四个完全本地运行的小型语言模型(SLM,1.7B-7B参数)评估。一致率从51.5%(mistral:7b)到69.1%(phi4-mini)不等,没有单一模型在所有标准上占主导地位——这从经验上激励了“档案即陪审团”的设计。我们进一步记录了小型监管评判者中的三种结构性失败模式,以及一种评判者特定的位置偏差,该偏差在三种问题顺序条件(原始、反转、排列)下使一致率降低多达25个百分点。govllm作为开源软件发布,以支持可复现的AI治理研究。

英文摘要

Current approaches to AI compliance treat conformity as a binary, audit-time verdict rather than a continuous, measurable property of production systems. We argue that this compliance fiction is structurally ill-suited to the requirements of the EU AI Act, which demands ongoing human oversight and the detection of emergent behavioural drift in deployed systems. We introduce governance from metrics, a principle whereby regulatory compliance is derived as a continuous signal from runtime observability rather than from static assessments. Building on this principle, we present govllm, an open-source framework implementing a governance-driven routing architecture in which model selection is determined by accumulated compliance scores rather than by latency or cost alone. Central to our approach is a panel of regulatory judges - LLM evaluators specialised per criterion (EU AI Act, GDPR, ANSSI, accessibility) - whose inter-judge disagreement we reframe not as noise but as a regulatory uncertainty signal warranting human arbitration. We validate this approach through a ground truth corpus of 49 annotated prompt/response pairs across five regulatory criteria, evaluated by four small language models (SLMs, 1.7B-7B parameters) running fully on-premise. Agreement rates range from 51.5% (mistral:7b) to 69.1% (phi4-mini), with no single model dominating across all criteria - empirically motivating the Profile-as-jury design. We further document three structural failure modes in small regulatory judges and a judge-specific position bias that degrades agreement by up to 25 percentage points across three question-order conditions (original, reversed, permuted). govllm is released as open-source software to support reproducible AI governance research.

2605.24719 2026-05-26 cs.CL cs.AI 版本更新

World-State Transformations for Neuro-symbolic Interactive Storytelling

世界状态转换用于神经符号交互式故事讲述

Santiago Góngora, Luis Chiruzzo, Gonzalo Méndez, Pablo Gervás

AI总结 本研究探索在神经符号架构中利用LLM预测规则系统中的世界状态转换,以解决纯LLM方法的故事连贯性问题,并通过实验表明该方法能保持世界状态一致性并促进玩家创造性输入。

Comments To be presented at the 17th International Conference on Computational Creativity (ICCC'26)

详情
AI中文摘要

大型语言模型(LLM)改变了处理自由文本用户输入的交互式故事讲述系统的可能性。然而,随着这类系统越来越多地被构建,越来越多的证据表明,仅依赖它们会出现故事连贯性问题。最近的研究表明,LLM可以有效地预测基于规则的交互式故事讲述系统中的状态变化,触发预编程的世界状态转换。在本文中,我们进行了一项探索性评估,研究这种转换是否可以作为玩家表达的催化剂,同时旨在解决纯LLM方法典型的连贯性问题。基于神经符号架构,我们使用开源模型(Llama 3 70B)和闭源模型(Gemini 1.5 Flash)进行了实验,测试以英语和西班牙语进行。八名参与者玩了两个场景,这些场景经过精心设计以评估不同的评估目标。我们的观察表明,转换提供了一种保持世界状态一致性的方式,同时鼓励玩家通过他们的书面输入进行创造性互动。

英文摘要

Large Language Models (LLMs) have changed the possibilities of Interactive Storytelling systems that process free-text user input. However, as more of these systems are built, evidence continues to mount regarding the story coherence problems that arise when relying solely on them. Recent research suggests that LLMs can effectively predict state changes within rule-based Interactive Storytelling systems, triggering pre-programmed world-state transformations. In this paper, we conduct an exploratory evaluation of whether such transformations can serve as a catalyst for player expression while aiming to address the incoherence issues typical of purely LLM-based approaches. Building upon a neuro-symbolic architecture, we conducted experiments using an open-source model (Llama 3 70B) and a closed-source model (Gemini 1.5 Flash), with testing conducted in both English and Spanish. Eight participants played two scenarios, carefully designed to assess different evaluation objectives. Our observations suggest that transformations offer a way to maintain world-state consistency while encouraging players to interact creatively through their written inputs.

2605.24703 2026-05-26 cs.CL cs.AI 版本更新

TS-Skill: A Benchmark for Evaluating Analytical Skills in Time-Series Question Answering

TS-Skill: 用于评估时间序列问答中分析技能的基准

Liying Han, Kang Yang, Oliver Wang, Jason Wu, Pengrui Quan, Gaofeng Dong, Ozan Baris Mulayim, Sizhe Ma, Yuyang Yuan, Dezhi Hong, Mario Berges, Mani Srivastava

发表机构 * University of California, Los Angeles(加州大学洛杉矶分校) Samsung Research America(三星美国研究院) Carnegie Mellon University(卡内基梅隆大学) Microsoft(微软) Amazon(亚马逊)

AI总结 提出TS-Skill基准,通过三种可组合的分析技能(时间尺度选择、时间定位和跨区间整合)来诊断时间序列问答中模型的信号级能力,并开发SKEvol框架自动构建基准,实验揭示不同技能上的能力差距。

详情
AI中文摘要

大型语言模型(LLMs)和时间序列语言模型(TSLMs)越来越多地应用于时间序列问答(TSQA)。与纯文本问答不同,TSQA要求模型将答案基于时间信号,这些信号的模式可能出现在不同尺度、特定时间位置或跨分离区间。然而,现有的基准通常按任务类型或高层次推理类别组织,难以诊断驱动模型性能的底层信号级能力。我们引入TS-Skill,一个用于评估TSQA中三种可组合分析技能的控制基准:时间尺度选择(SK1)、时间定位(SK2)和跨区间整合(SK3)。TS-Skill提供时间戳感知的问题、广泛的领域覆盖以及人工验证的问答质量。为了大规模构建基准,我们开发了SKEvol,一个技能引导的智能体框架,结合了领域感知的时间序列种子生成、技能控制的问题生成、元数据和代码辅助的答案构建、多阶段信号接地验证以及人在回路中的策展。在十个最先进的LLMs和TSLMs上的实验揭示了SK1-SK3之间显著且不均匀的能力差距。特别是,SK3对非智能体模型始终具有挑战性,而工具增强的智能体在独立的SK3上显示出选择性优势。这些发现表明,技能级评估可以揭示被聚合TSQA分数掩盖的时间推理失败。

英文摘要

Large language models (LLMs) and time-series language models (TSLMs) are increasingly applied to time-series question answering (TSQA). Unlike text-only QA, TSQA requires models to ground answers in temporal signals whose patterns may occur at different scales, specific time locations, or across separated intervals. However, existing benchmarks are typically organized by task types or high-level reasoning categories, making it difficult to diagnose the underlying signal-level capabilities driving model performance. We introduce TS-Skill, a controlled benchmark for evaluating three composable analytical skills in TSQA: temporal scale selection (SK1), temporal localization (SK2), and cross-interval integration (SK3). TS-Skill provides timestamp-aware questions, broad domain coverage, and human-validated QA quality. To construct the benchmark at scale, we develop SKEvol, a skill-guided agentic framework that combines domain-aware time-series seed generation, skill-controlled question generation, metadata- and code-assisted answer construction, multi-phase signal-grounded verification, and human-in-the-loop curation. Experiments on ten state-of-the-art LLMs and TSLMs reveal substantial and uneven capability gaps across SK1-SK3. In particular, SK3 remains consistently challenging for non-agent models, whereas tool-augmented agents show a selective advantage on standalone SK3. These findings demonstrate that skill-level evaluation can uncover temporal reasoning failures that are obscured by aggregate TSQA scores.

2605.24699 2026-05-26 cs.AI cs.LG 版本更新

MDIA: A Multi-Agent Diagnostic Intelligence Pipeline on HealthBench Professional

MDIA:HealthBench Professional上的多智能体诊断智能流水线

Roberto Cruz, David Rey-Blanco

发表机构 * TietAI

AI总结 提出MDIA多智能体诊断系统,通过7节点专业路由临床推理图架构,在非微调LLM上实现HealthBench Professional基准性能提升3.72个百分点,归因于系统架构设计而非提示工程。

Comments 33 pages, 10 figures

详情
AI中文摘要

大多数关于agentic-LLM临床基准测试的报告收益通常归因于提示工程,但我们的结果表明,更大的改进可能来自架构和引擎级别的设计。我们提出了MDIA,一个多智能体诊断智能体,实现为7节点专业路由临床推理图,在完整的HealthBench Professional基准测试(n=525)上,使用非微调LLM。MDIA在OpenAI的GPT-5.4-2026-03-05下达到0.6272,比OpenAI的ChatGPT for Clinicians的性能高出3.72个百分点。实验工作表明,性能提升归因于系统架构:专业路由、多轮上下文保留、药物状态安全门控、站点过滤搜索、长度感知合成和引擎级可靠性。这些发现支持了agentic临床基准性能由底层基础模型和编排架构共同塑造的观点。然而,我们也注意到在使用其他模型作为评分器时存在显著差异;特别是,当使用Gemini 2.5 Pro时,MDIA得分为0.6585,这表明评分器的选择是变异性来源。因此,对LLM的稳健评估需要跨多个独立评分器模型进行评估。

英文摘要

Most reported gains on agentic-LLM clinical benchmarks are often attributed to prompt engineering, yet our results suggest that larger improvements can come from architectural and engine-level design. We present MDIA, a Multi-agent Diagnostic Intelligence Agent implemented as a 7-node specialty-routed clinical reasoning graph, on the full HealthBench Professional benchmark (n = 525), on a non-fine-tuned LLM. MDIA achieves 0.6272 under OpenAI's GPT-5.4-2026-03-05, which is +3.72 pp above the performance of OpenAI's ChatGPT for Clinicians. The experimental work shows that performance lift is attributable to system architecture: specialty routing, multi-turn context preservation, drug-state safety gating, site-filtered search, length-aware synthesis, and engine-level reliability. These findings support the view that agentic clinical benchmark performance is shaped both by the underlying foundation model and the orchestration architecture. Nevertheless, we also noticed notable differences when using other models as a grader; in particular, when using Gemini 2.5 Pro, MDIA scored 0.6585, which suggests that the choice of grader is a source of variability. Robust evaluation of LLMs would therefore require assessment across several independent grader models.

2605.24697 2026-05-26 cs.CL cs.AI 版本更新

The Path Matters: Learning a Token-Commitment Policy for Diffusion Language Models

路径很重要:学习扩散语言模型的令牌提交策略

Bohang Sun, Max Zhu, Francesco Caso, Jindong Gu, Junchi Yu, Philip Torr, Pietro Liò, Jialin Yu

发表机构 * Department of Computer Science and Technology, University of Cambridge(计算机科学与技术系,剑桥大学) Department of Engineering Science, University of Oxford(工程科学系,牛津大学)

AI总结 本文提出TraceLock,一种轻量级可插拔控制器,通过学习可复用的轨迹状态策略来优化扩散语言模型中的令牌提交决策,从而改善质量与步数之间的权衡。

详情
AI中文摘要

扩散大语言模型通过并行细化多个令牌位置有望实现更快的生成,但这种并行性引入了一个隐藏的控制问题:每一步中哪些提议的令牌应被转移到部分解码的序列中?我们将此决策称为令牌提交。现有的冻结生成器解码器主要依赖于手工设计的置信度规则或特定块的接受过滤器。我们认为令牌提交可以学习为一种可复用的轨迹状态策略。我们引入了TraceLock,一种轻量级可插拔控制器,为冻结的扩散语言模型实例化此策略。由于无法获得 oracle 提交时间,TraceLock 从未来稳定性中推导出自我监督:在解码步骤 t,如果提议的令牌在完整解码轨迹完成后与位置 i 的最终令牌匹配,则将其标记为稳定。控制器对可变长度的轨迹状态进行评分,并决定哪些活跃的令牌提议应被提交到部分解码的序列中。一旦为给定的冻结主干训练完成,该控制器可以在局部窗口宽度、生成长度和步数预算下部署,无需重新训练或按设置校准。在问答、数学推理和代码生成上的实验表明,TraceLock 在质量-步数权衡上优于启发式和学习的基线,在跨设置部署下尤其稳定。诊断分析表明,其决策不能简化为标量置信度,这表明冻结的扩散语言模型暴露了一个超越基于置信度解码的可学习的提交轨迹空间。代码可在 https://github.com/BobSun98/TraceLock 获取。

英文摘要

Diffusion large language models promise faster generation by refining many token positions in parallel, but this parallelism introduces a hidden control problem: which proposed tokens should be transferred into the partially decoded sequence at each step? We refer to this decision as token commitment. Existing frozen-generator decoders largely rely on hand-designed confidence rules or block-specific acceptance filters. We argue that token commitment can instead be learned as a reusable trace-state policy. We introduce TraceLock, a lightweight plug-in controller that instantiates this policy for a frozen diffusion language model. Since oracle commitment times are unavailable, TraceLock derives self-supervision from future stability: at decoding step t, a proposed token for position i is labeled stable if it matches the final token at position i after the full decoding trace completes. The controller scores variable-length trace states and decides which active token proposals should be committed to the partially decoded sequence. Once trained for a given frozen backbone, the controller can be deployed across local-window widths, generation lengths, and step budgets without retraining or per-setting calibration. Experiments on question answering, mathematical reasoning, and code generation show that TraceLock improves the quality-step tradeoff over heuristic and learned baselines, with particularly stable behavior under cross-setting deployment. Diagnostic analyses show that its decisions are not reducible to scalar confidence, suggesting that frozen diffusion language models expose a learnable space of commitment trajectories beyond confidence-based decoding. Code is available at https://github.com/BobSun98/TraceLock.

2605.24687 2026-05-26 cs.CV cs.AI 版本更新

HoloFair: Unified T2I Fairness Evaluation and Fair-GRPO Debiasing

HoloFair: 统一的T2I公平性评估与Fair-GRPO去偏

Ruyi Chen, Lu Zhou, Xiaogang Xu, Chiyu Zhang, Jiafei Wu, Liming Fang

发表机构 * Nanjing University of Aeronautics and Astronautics(南京航空航天大学) School of Software Technology, Zhejiang University, Ningbo, China(浙江大学宁波校区软件学院) Ningbo Global Innovation Center, Zhejiang University, Ningbo, China(浙江大学宁波全球创新中心) Collaborative Innovation Center of Novel Software Technology and Industrialization(新型软件技术与产业化协同创新中心)

AI总结 提出HoloFair基准框架,通过多属性组间偏差指数(MGBI)评估文本到图像模型的公平性,并引入基于强化学习的Fair-GRPO方法进行去偏,在SD3.5-Medium模型上显著提升多维公平性且保持图像质量。

Comments Accepted to ICML 2026. Code and dataset are available at https://github.com/1059684669/HoloFair

详情
AI中文摘要

文本到图像(T2I)模型在视觉真实感和语义一致性方面取得了显著进展,但它们常常延续并放大社会偏见。现有的评估方法通常只处理单维偏见,缺乏从社会相关深层语义层面揭示模型偏见的视角。我们引入了HoloFair,一个用于多维人口统计偏见分析的综合基准框架。该框架基于我们大规模面向公平性的数据集和SpaFreq(空间-频率)属性分类器,提出了多属性组间偏差指数(MGBI)指标,旨在评估内在多样性和条件偏见。除评估外,我们还进一步引入了Fair-GRPO,一种基于强化学习的去偏方法,通过设计的多目标奖励函数改变生成模型的分布。例如,在SD3.5-Medium模型上的实验表明,Fair-GRPO在保持高图像质量的同时显著改善了多维公平性。我们还分析了潜在的奖励黑客现象,并提供了相应的缓解策略。代码和数据集可在https://github.com/1059684669/HoloFair获取。

英文摘要

Text-to-Image (T2I) models have made significant strides in visual realism and semantic consistency, yet they often perpetuate and amplify societal biases. Existing evaluation methods typically address only single-dimensional biases, lacking perspectives to uncover model biases at social-related deeper semantic levels. We introduce HoloFair, a comprehensive benchmark framework for multidimensional demographic bias analysis. Built upon our large-scale fairness-oriented dataset and the SpaFreq (Spatial-Frequency) attribute classifier, this framework proposes the Multi-attribute, Group-wise Bias Index (MGBI) metric, designed to assess both intrinsic diversity and conditional biases. Beyond evaluation, we further introduce Fair-GRPO, a reinforcement-learning-based debiasing method that alters the distribution of generative models through a designed multi-objective reward function. E.g., experiments on the SD3.5-Medium model demonstrate that Fair-GRPO significantly improves multidimensional fairness while maintaining high image quality. We also analyze potential reward hacking phenomena and provide corresponding mitigation strategies. Code and dataset are available at https://github.com/1059684669/HoloFair

2605.24686 2026-05-26 cs.AI 版本更新

Emotional intelligence in large language models is fragmented across perception, cognition, and interaction

大型语言模型中的情商在感知、认知和交互上存在碎片化

Minghao Lv, Lu Chen, Enchang Zhang, Anji Zhou, Xiaoran Xue, Hanyi Zhang, Fenghua Tang, Zhuo Rachel Han, Mengyue Wu

发表机构 * X-LANCE Lab(X-LANCE实验室) School of Computer Science, Shanghai Jiao Tong University(上海交通大学计算机科学学院) MoE Key Lab of Artificial Intelligence(人工智能MOE重点实验室) Jiangsu Key Lab of Language Computing(江苏省语言计算重点实验室) Beijing Key Laboratory of Applied Experimental Psychology(北京应用实验心理学重点实验室) National Demonstration Center for Experimental Psychology Education, Faculty of Psychology, Beijing Normal University(北京师范大学实验心理学教育国家级示范中心,心理学学院)

AI总结 本文提出FACET框架,基于Mayer-Salovey-Caruso四分支能力模型评估大型语言模型的情商,发现其并非单一能力,而是在认知和交互维度上碎片化,且隐藏情绪识别是普遍瓶颈。

详情
AI中文摘要

随着大型语言模型(LLMs)越来越多地集成到情感敏感领域,其情商(EI)的结构完整性成为安全和对齐的关键前沿。当前的基准测试常常将表面的礼貌与深层次的情感推理混为一谈,未能区分感知准确性和交互效能。在此,我们引入FACET(功能性情感能力和共情测试),这是一个基于心理测量学的框架,包含480个专家设计的项目。与先前的指标不同,FACET在理论上锚定于Mayer-Salovey-Caruso四分支能力模型,通过情绪感知、促进、理解和管理来操作化情商。通过对九个前沿模型(包括GPT-5、Claude-Sonnet-4)的评估,我们证明情商并非单一能力,而是在认知和交互维度上碎片化。尽管前沿模型在客观情绪识别和社会推理方面表现出强大的能力,但这并不一致地转化为交互成功。我们将这些差异归类为三种不同的表现类型:认知主导型、交互主导型和情境依赖型。这些类型表明情感技能并非随通用智能或模型大小均匀扩展;相反,它们由特定的对齐范式塑造。值得注意的是,我们识别出隐藏情绪识别是所有架构的普遍性能瓶颈。我们的结果表明,当前的RLHF过程可能优化了“随机共情”,即对情感句法的统计模仿,而牺牲了整合的情感推理。这些发现挑战了线性情感扩展的假设,并为开发能够真正临床共鸣的社会感知智能体提供了严谨的路线图。

英文摘要

As large language models (LLMs) are increasingly integrated into emotionally sensitive domains, the structural integrity of their emotional intelligence (EI) becomes a critical frontier for safety and alignment. Current benchmarks often conflate superficial politeness with deep affective reasoning, failing to distinguish between perceptual accuracy and interactive efficacy. Here, we introduce FACET (Functional Affective Competence and Empathy Test), a psychometrically grounded framework comprising 480 expert-crafted items. Unlike previous metrics, FACET is theoretically anchored in the Mayer-Salovey-Caruso four-branch ability model, operationalizing EI through perception, facilitation, understanding, and management of emotions. Through an evaluation of nine frontier models (including GPT-5, Claude-Sonnet-4), we demonstrate that emotional intelligence is not a monolithic capability but is fragmented across cognitive and interactive dimensions. While frontier models demonstrate robust proficiency in objective emotion recognition and social reasoning, this does not consistently translate to interactive success. We categorize these discrepancies into three distinct performance profiles: cognitive-dominant, interactive-dominant, and context-dependent. These typologies indicate that emotional skills do not scale uniformly with general intelligence or model size; rather, they are shaped by specific alignment paradigms. Notably, we identify hidden emotion recognition as a universal performance bottleneck across all architectures. Our results suggest that current RLHF processes may optimize for "stochastic empathy", a statistical mimicry of emotional syntax, at the expense of integrated affective reasoning. These findings challenge the assumption of linear emotional scaling and provide a rigorous roadmap for developing socially aware agents capable of genuine clinical resonance.

2605.24684 2026-05-26 cs.LG cs.AI 版本更新

Beyond the Aggregation Dilemma: Prior-Retaining Decoupled Learning for Multimodal Graphs

超越聚合困境:多模态图的先验保持解耦学习

Hao Yan, Xuanru Wang, Jun Yin, Shirui Pan, Senzhang Wang, Chengqi Zhang

发表机构 * School of Computer Science and Engineering, Central South University(中南大学计算机科学与工程学院) Department of Data Science and Artificial Intelligence, The Hong Kong Polytechnic University(香港理工大学数据科学与人工智能系) School of Information and Communication Technology, Griffith University(格里菲斯大学信息与通信技术学院)

AI总结 针对多模态属性图学习中强制聚合导致性能反转的聚合困境,提出解耦双路径架构SUPRA,通过保持先验特征的独立性和轻量级共享GNN捕获结构协同,并辅以深度监督缓解梯度饥饿,实现SOTA性能且显著降低计算开销。

详情
AI中文摘要

多模态属性图学习(MAGL)通过图聚合将节点内在属性与结构拓扑相结合。然而,随着预训练编码器演变为大型基础模型(LFM),MAGL的格局发生了根本性转变:在高置信度LFM先验下,强制聚合引入了拓扑噪声,淹没了判别信号,引发反直觉的性能反转,即复杂的MAGL架构性能不如简单的拓扑无关MLP。通过系统的实证和理论分析,我们确定这种反转源于一个基本的聚合困境,其特征是两种并发病理:(1)表征病理(信噪比退化)——强制聚合用拓扑噪声稀释了鲁棒的内在特征,导致噪声惩罚超过其协作收益;(2)优化病理(梯度饥饿)——拓扑聚合减弱了梯度流,而共享任务损失导致主导模态过早抑制较弱模态。为解决这一困境,我们提出SUPRA(共享-独特先验保持架构),一种解耦的双路径范式。SUPRA通过拓扑无关的MLP处理模态特定特征,同时通过轻量级共享GNN捕获结构协同,并辅以深度监督来对抗梯度饥饿。大量评估表明,SUPRA实现了最先进的性能,同时峰值GPU内存需求降低3.5倍,训练时间比多模态图变换器快4.4倍。

英文摘要

Multimodal Attributed Graph Learning (MAGL) integrates intrinsic node attributes with structural topology via graph aggregation. However, as pretrained encoders evolve into Large Foundation Models (LFMs), the landscape of MAGL fundamentally shifts: under high-confidence LFM priors, mandatory aggregation introduces topological noise that overwhelms discriminative signals, triggering a counter-intuitive performance inversion where sophisticated MAGL architectures underperform simple topology-agnostic MLPs. Through systematic empirical and theoretical analysis, we identify that this inversion stems from a fundamental aggregation dilemma characterized by two concurrent pathologies: (1) Representational Pathology (SNR Degradation) - mandatory aggregation dilutes robust intrinsic features with topological noise, causing the noise penalty to outweigh its collaborative benefit; and (2) Optimization Pathology (Gradient Starvation) - topological aggregation attenuates gradient flow, while a shared task loss causes dominant modalities to prematurely suppress weaker ones. To resolve this dilemma, we propose SUPRA (Shared-Unique Prior-Retaining Architecture), a decoupled dual-pathway paradigm. SUPRA processes modality-specific features through topology-agnostic MLPs while capturing structural synergy via a lightweight shared GNN, with auxiliary deep supervision counteracting gradient starvation. Extensive evaluations demonstrate that SUPRA achieves state-of-the-art performance while requiring 3.5x lower peak GPU memory and up to 4.4x faster training time than Multimodal Graph Transformers.

2605.24675 2026-05-26 cs.CV cs.AI 版本更新

VaaWIT: Visual-Aware Adaptation of Large Language Models for Multilingual Web Image Translation

VaaWIT: 面向多语言网页图像翻译的大语言模型视觉感知适配

Bo Li, Ronghao Chen, Ningyuan Deng, Huacan Wang, Shaolin Zhu, Lijie Wen

发表机构 * The Hong Kong University of Science(香港科技大学) Tianjin University(天津大学) Tsinghua University(清华大学)

AI总结 针对网页图像翻译中视觉表示差距问题,提出VaaWIT框架,通过双流注意力模块和视觉感知适配器,实现大语言模型对细粒度视觉特征的动态融合,在多个基准上超越开源模型并接近闭源模型性能。

Comments Accepted by KDD 2026

详情
AI中文摘要

翻译网页图像中的文本对于改善内容可访问性和跨语言信息检索至关重要,尤其是在社交媒体和电子商务领域。尽管大型视觉语言模型(LVLMs)已经推进了多模态理解,但由于视觉表示差距,将它们应用于网页图像翻译仍然具有挑战性:标准编码器通常优先考虑高级语义,而忽略了识别多样字符形态所需的细粒度视觉细节。为了解决这一挑战,我们提出了VaaWIT,一个端到端框架,用于适配大语言模型进行多语言网页图像翻译。该框架引入了两项关键技术贡献:(1)双流注意力模块(DSAM),促进多语言语义特征与详细视觉表示之间的双向交互,从而合成对文本变化鲁棒的统一特征;(2)视觉感知适配器(VAA),一种参数高效的微调策略,将这些融合的视觉线索动态注入冻结的LLM主干。这种设计使模型能够有效地将视觉上下文与语言推理对齐,同时最小化计算成本。在三个公共基准上的八个任务上的大量实验表明,VaaWIT显著优于最先进(SOTA)的开源基线,并达到了与专有模型相竞争的性能。这些结果验证了将细粒度视觉感知集成到LLM中用于复杂网页内容分析的有效性。

英文摘要

Translating text embedded in Web images is crucial for improving content accessibility and cross-lingual information retrieval, particularly within social media and e-commerce domains. Although Large Vision-Language Models (LVLMs) have advanced multimodal understanding, applying them to Web image translation remains challenging due to the visual representation gap: standard encoders often prioritize high-level semantics over the fine-grained visual details required for recognizing diverse character morphologies. To address this challenge, we propose VaaWIT, an end-to-end framework that adapts Large Language Models for multilingual Web image translation. The framework introduces two key technical contributions: (1) a Dual-Stream Attention Module (DSAM), which facilitates bidirectional interaction between multilingual semantic features and detailed visual representations, thereby synthesizing unified features robust to textual variations; and (2) a Visual-Aware Adapter (VAA), a parameter-efficient fine-tuning strategy that dynamically injects these fused visual cues into the frozen LLM backbone. This design enables the model to align the visual context with linguistic reasoning effectively while minimizing computational costs. Extensive experiments on eight tasks on three public benchmarks demonstrate that VaaWIT significantly outperforms state-of-the-art (SOTA) open-source baselines and achieves competitive performance against proprietary models. These results validate the efficacy of integrating fine-grained visual perception into LLMs for complex Web content analysis.

2605.24667 2026-05-26 cs.AI cs.LG 版本更新

When Mean CE Fails: Median CE Can Better Track Language Model Quality

当平均交叉熵失效时:中位数交叉熵能更好地跟踪语言模型质量

Hao Guo, Simon Dennis, Rivaan Patil, Kevin Shabahang

发表机构 * i14 University of Melbourne(墨尔本大学) University of California, Santa Cruz(加州大学圣克ruz分校)

AI总结 本文发现中位数交叉熵比平均交叉熵更能反映语言模型在训练过程中的任务性能,并建议在评估时报告多个百分位交叉熵。

Comments 20 pages

详情
AI中文摘要

平均交叉熵是语言模型的标准验证指标,但在训练过程中可能无法跟踪模型质量。我们在两种常见场景下研究了这一点。首先,在Qwen2.5-1.5B的合成事实学习SFT中,我们发现平均CE在初始学习阶段后显著上升,而保留的事实召回准确率保持接近峰值。其次,在TinyStories上的top-K蒸馏中,我们发现减小K会改善中位数CE而恶化平均CE;Top-5学生获得了最高的LLM评判分数,并在中位数CE上低于其教师,尽管其平均CE最差。在这两种情况下,中位数CE与任务性能的相关性比平均CE更紧密。分析训练过程中整体和尾部百分位CE的变化表明,训练重塑了经验性的每token CE分布。在top-K蒸馏中,较小的K产生了一个在两端都有更多质量的分布,降低了中位数并增加了平均值。在Qwen SFT中,整体部分迅速饱和,而尾部在训练后半段延伸。在这两种情况下,任务评估指标似乎对整体部分比尾部更敏感。实际上,我们建议在报告平均CE的同时报告一小部分百分位CE摘要,并利用它们之间的一致性作为跟踪分布重塑的工具,以及当平均和中位数CE在模型选择上不一致时的低成本诊断。

英文摘要

Mean cross-entropy is the standard validation metric for language models, but it can fail to track model quality during training. We examine this in two common scenarios. First, in Qwen2.5-1.5B SFT on synthetic fact-learning, we find that mean CE rises substantially after the initial learning phase while held-out fact-recall accuracy remains near its peak. Second, we find that in top-K distillation on TinyStories, decreasing K improves median CE while worsening mean CE; the Top-5 student attains the highest LLM-judge score and crosses below its teacher on median CE, despite having the worst mean CE. In both cases, median CE correlates much more closely with task performance than does mean CE. Analyzing how bulk and tail percentile CE move during training reveals that training reshapes the empirical per-token CE distribution. In top-K distillation, smaller K yields a distribution with more mass at both extremes, decreasing the median and increasing the mean. In Qwen SFT, the bulk saturates quickly while the tail extends in the latter half of training. In both, the task-evaluation metric appears more sensitive to the bulk than to the tail. Practically, we recommend reporting a small set of percentile CE summaries alongside the mean, and using concordance among them as a tool to keep track of distribution reshaping, as well as a low-cost diagnostic for when mean and median CE disagree on model selection.

2605.24663 2026-05-26 cs.CR cs.AI 版本更新

CyBOKClaw: Human-in-the-Loop CyBOK Mapping for Cybersecurity Curriculum

CyBOKClaw:用于网络安全课程的人机协同CyBOK映射框架

Yan Lin Aung, Kevin Togbe

发表机构 * University of Derby, Derby, UK(德比大学)

AI总结 提出CyBOKClaw,一种可解释的人机协同检索框架,通过查询归一化、术语扩展、概念提升、主题描述丰富和领域敏感排序规则,将网络安全关键词/短语映射到CyBOK,并采用专家引导的top-5有用性指标ECA-5评估,在开发集和验证集上分别达到91.88%和98.00%的ECA-5。

详情
AI中文摘要

本文提出了CyBOKClaw,一个可解释的人机协同检索框架,用于将网络安全关键词或短语(KWoPs)映射到网络安全知识体系(CyBOK)。该框架并非将任务视为严格的精确分类,而是设计为供专家审查的top-k候选生成器。它结合了查询归一化、策划的术语扩展、概念级提升、主题描述丰富以及领域敏感的排序规则。由于教育领域的KWoPs通常宽泛、模糊且仅与CyBOK术语大致对齐,严格的精确匹配只能提供部分实际效用。因此,我们使用结构检索指标和专家引导的top-5有用性指标ECA-5(前5名中精确或最接近可接受匹配)来评估该框架,该指标记录返回的候选是否包含至少一个专家判断为精确或可接受为最接近实际CyBOK位置的映射。在开发数据集上,CyBOKClaw达到了64.73%的EXA-5(前5名精确匹配)、84.18%的结构语义对齐和91.88%的ECA-5;在验证数据集上,达到了81.19%的EXA-5、93.32%的结构语义对齐和98.00%的ECA-5。这些结果表明,专家引导的top-k有用性比单纯的精确结构匹配更能忠实地反映实际CyBOK映射效用,并且CyBOKClaw作为一种针对CyBOK的专家支持检索系统是有效的。

英文摘要

This paper presents CyBOKClaw, an interpretable human-in-the-loop retrieval framework for mapping cybersecurity keywords or phrases (KWoPs) to the Cyber Security Body of Knowledge (CyBOK). Rather than treating the task as strict exact classification, the framework is designed as a top-k candidate generator for expert review. It combines query normalization, curated term expansion, concept-level boosts, topic-description enrichment, and domain-sensitive ranking rules. Because educational KWoPs are often broad, ambiguous, and only approximately aligned with CyBOK terminology, strict exact matching provides only a partial account of practical utility. We therefore evaluate the framework using both structural retrieval metrics and an expert-guided top-5 usefulness metric, ECA-5 (Exact or Closest Acceptable Match at top-5), which records whether the returned candidates contain at least one mapping that an expert would judge exact or accept as the nearest practical CyBOK placement. On the development dataset, CyBOKClaw achieves 64.73% EXA-5 (Exact Match at top-5), 84.18% structural semantic alignment, and 91.88% ECA-5; on the validation dataset, it achieves 81.19% EXA-5, 93.32% structural semantic alignment, and 98.00% ECA-5. These results show that expert-guided top-k usefulness provides a more faithful account of practical CyBOK mapping utility than exact structural matching alone, and that CyBOKClaw is effective as a CyBOK-specific expert-support retrieval system.

2605.24661 2026-05-26 cs.AI cs.CL 版本更新

Measuring Reasoning Quality in LLMs: A Multi-Dimensional Behavioral Framework

衡量LLM中的推理质量:一个多维行为框架

Ali Şenol, Garima Agrawal, Huan Liu

发表机构 * Department of Computer Engineering, Tarsus University(塔鲁斯大学计算机工程系) School of Computing and Augmented Intelligence (SCAI), Arizona State University (ASU)(计算与增强智能学院(SCAI),亚利桑那州立大学(ASU)) HumaConn AI Consulting(HumaConn AI咨询)

AI总结 提出一个基于行为的多维框架,从正确性、一致性、鲁棒性、逻辑连贯性、效率和稳定性六个维度评估LLM推理质量,揭示仅靠准确率无法观察到的行为,并支持部署决策。

详情
AI中文摘要

LLMs在复杂推理任务中取得了显著成功,但当前的评估方法主要依赖最终答案的正确性,对产生这些答案的底层推理过程提供的洞察有限。为弥补这一空白,本研究从行为角度提出了一个统一的多维框架来衡量LLMs的推理质量,操作化了六个理论驱动的维度:正确性(CQ)、一致性(CS)、鲁棒性(RS)、逻辑连贯性(LS)、效率(ES)和稳定性(SS)。在四个基准测试的975个条目上对七个LLMs进行的广泛实验表明,该框架揭示了仅靠准确率指标无法观察到的行为。值得注意的是,逻辑连贯性与正确性正交(r = -0.172,不显著),证实了正确答案可能源于不连贯的推理,而Claude-Haiku-4.5取得了最高的多维得分(Q_bal = 0.778)。此外,该框架暴露了关键的排名反转:DeepSeek-V3在准确率优先下排名第二,但在法律/合规权重下排名第五,这种反转是单一指标评估无法检测到的。判别效度证实11/15个维度对是独立的(|r| < 0.50),为将每个维度视为不同信号提供了心理测量学支持。该框架产生的维度概况直接支持三类部署决策:识别那些虽然最终答案正确但推理轨迹无法通过问责审计的模型(LS--CQ正交性);防止仅基于准确率的基准测试导致的排名错误;以及确保没有单一指标默默替代框架捕获的六个独立信号。

英文摘要

LLMs have achieved remarkable success in complex reasoning tasks, yet current evaluation approaches predominantly rely on final-answer correctness, offering limited insight into the underlying reasoning processes that produce those answers. To address this gap, this study proposes a unified multi-dimensional framework for measuring reasoning quality in LLMs from a behavioral perspective, operationalizing six theoretically grounded dimensions: Correctness (CQ), Consistency (CS), Robustness (RS), Logical Coherence (LS), Efficiency (ES), and Stability (SS). Extensive experiments on seven LLMs across 975 items from four benchmarks demonstrate that the framework reveals behaviors invisible to accuracy-only metrics. Notably, logical coherence is orthogonal to correctness (r = -0.172, ns), confirming that correct answers can arise from incoherent reasoning, while Claude-Haiku-4.5 achieves the highest multi-dimensional score (Q_bal = 0.778). Furthermore, the framework exposes critical ranking inversions: DeepSeek-V3 ranks second under accuracy-priority but fifth under legal/compliance weighting, a reversal that single-metric evaluation cannot detect. Discriminant validity confirms 11/15 dimension pairs are independent (|r| < 0.50), providing psychometric support for treating each dimension as a distinct signal. The dimensional profiles produced by the framework directly support three classes of deployment decision: identifying models whose reasoning traces would fail accountability audits despite correct final answers (LS--CQ orthogonality); preventing ranking errors caused by accuracy-only benchmarking; and ensuring that no single metric silently substitutes for the six independent signals the framework captures.

2605.24657 2026-05-26 cs.AI cs.SE 版本更新

Beyond Inference-Only Deployment: Comparing Weight-Based Consolidation Against Cascading Compaction

超越纯推理部署:比较基于权重的巩固与级联压缩

Simon Dennis, Kevin Shabahang, Hao Guo, Rivaan Patil

发表机构 * University of Melbourne(墨尔本大学)

AI总结 针对大型语言模型纯推理部署中用户知识无法持久化的问题,提出通过夜间反射、合成和LoRA微调将交互知识巩固到模型权重中,实验表明该方法相比级联压缩知识保留率提升43.6个百分点。

Comments 15 pages

详情
AI中文摘要

主流LLM平台以纯推理配置部署模型:模型服务请求但从不更新每个用户的权重。用户必须反复重新教授偏好、修正和项目上下文,基于上下文的变通方法消耗上下文窗口空间,并在级联压缩下退化。我们评估了一种替代方案:通过反射、合成和低秩适应(LoRA)微调,在单个消费级GPU上将交互知识夜间巩固到模型权重中。在十次真实的软件开发对话中(n=10,三种记忆类型共1146个测试问题),三轮级联压缩保留了36.8±3.0%的知识(介于11.8%的无上下文下限和90.1%的全上下文上限之间),而巩固保留了80.4±1.3%——提升了43.6个百分点(配对t(9)=14.8,p<0.001),是压缩保留量的两倍多,其中程序性修正(36.3%->74.6%)和情景项目事实(31.5%->78.2%)的增益最大。作为方法论上的附带说明,平均每token验证交叉熵与LLM判断的准确性呈负相关(r=-0.51),而中位数每token验证交叉熵几乎完全跟踪准确性(r=+0.99):在容忍表面形式变化的评估器下,平均值具有误导性,而重尾鲁棒统计量才是可靠的信号。持久个性化需要超越纯推理部署,转向将知识巩固到权重的架构。

英文摘要

Major LLM platforms deploy models in an inference-only configuration: the model serves requests but never updates per-user weights. Users must repeatedly re-teach preferences, corrections, and project context, and context-based workarounds consume context-window space and degrade under cascading compaction. We evaluate an alternative: nightly consolidation of interaction knowledge into model weights via reflection, synthesis, and Low-Rank Adaptation (LoRA) fine-tuning on a single consumer GPU. Across ten realistic software development conversations (n = 10, 1,146 test questions across three memory types), three cycles of cascading compaction retain 36.8 +/- 3.0% of knowledge (between an 11.8% no-context floor and a 90.1% full-context ceiling), while consolidation retains 80.4 +/- 1.3% -- a 43.6 pp gain (paired t(9) = 14.8, p < 0.001) that more than doubles what compaction preserves, with the largest gains on procedural corrections (36.3% -> 74.6%) and episodic project facts (31.5% -> 78.2%). As a methodological aside, mean per-token validation cross-entropy is negatively correlated with LLM-judged accuracy (r = -0.51) while median per-token validation cross-entropy tracks accuracy almost exactly (r = +0.99): under evaluators that tolerate surface-form variation, the mean is misleading and a heavy-tail-robust statistic is the faithful signal. Persistent personalization requires moving beyond inference-only deployment toward architectures that consolidate knowledge into weights.

2605.24652 2026-05-26 cs.AI cs.CV cs.MM cs.SD 版本更新

AVBench: Human-Aligned and Automated Evaluation Benchmark for Audio-Video Generative Models

AVBench:面向音视频生成模型的人类对齐与自动化评估基准

Jialiang Yang, Bin Xia, Ruihang Chu, Dingdong Wang, Wanke Xia, Zhun Mou, Tianyang Zhong, Yiting Zhao, Wenming Yang

发表机构 * Tsinghua University(清华大学) The Chinese University of Hong Kong(香港中文大学)

AI总结 提出AVBench,通过细粒度人类中心指标和偏好学习训练的专业评估器,实现音视频生成的自动化、准确评估。

详情
AI中文摘要

音视频(AV)生成的快速进步使得能够生成具有同步声音的高保真合成内容,特别是涉及语音和交互的人类相关场景。然而,AV生成的评估仍处于早期阶段,只有少数针对人类相关场景的粗粒度基准,并且依赖于有限的预设评估和通用多模态大语言模型,导致对模型能力的不准确评估。为了解决这些问题,我们引入了AVBench,一个专为人类中心AV生成设计的全自动化基准。AVBench基于两个关键设计以实现全面准确的评估:(i)人类中心和细粒度指标。AVBench整合了十个评估维度,专为以人为中心的现实场景设计,涵盖视觉质量、音频质量以及跨模态的多层次一致性。这些实用指标捕捉了现有基准经常忽略的人类相关细节。(ii)通过偏好学习训练的专业评估器。为了解决缺乏专门训练数据的问题,我们通过将真实视频转化为具有受控扰动的多样化训练对来构建大规模监督。在该高质量数据集上微调后,评估器学会可靠地检测细微的跨模态不一致性。关键的是,AVBench不输出离散的文本判断,而是从模型对二元决策的预测置信度中推导出连续评估分数。这种概率评分机制比传统的VQA风格评估更可靠,并且与人类判断高度一致。综合来看,AVBench为AV生成提供了自动化评估,展示了数据过滤的强大潜力,并可作为来自人类反馈的强化学习(RLHF)的可微分奖励信号。

英文摘要

Rapid advances in audio-video (AV) generation have enabled high-fidelity synthesis with synchronized sound, particularly for human-related scenarios involving speech and interactions. Yet evaluation for AV generation remains at an early stage, with only a few coarse-grained benchmarks for human-related scenarios and relying on limited preset evaluations with generic multimodal LLMs, leading to inaccurate assessments of model capabilities. To address these issues, we introduce AVBench, a fully automated benchmark tailored for human-centric AV generation. AVBench is built on two key designs for comprehensive and accurate evaluation: (i) Human-centric and fine-grained metrics. AVBench integrates ten evaluation dimensions designed for human-centered real-world scenarios, covering visual quality, audio quality, and multi-level consistency across modalities. These practical metrics capture human-related details that existing benchmarks often overlook. (ii) Specialized evaluators via preference learning. To address the lack of specialized training data, we construct large-scale supervision by transforming real-world videos into diverse training pairs with controlled perturbations. After fine-tuning on this high-quality dataset, the evaluators learn to reliably detect subtle cross-modal inconsistencies. Crucially, instead of producing discrete textual judgment, AVBench derives continuous evaluation scores from the model's prediction confidence on binary decisions. This probabilistic scoring mechanism enables a more reliable assessment than traditional VQA-style evaluation and aligns closely with human judgment. Taken together, AVBench offers automated evaluation for AV generation, demonstrates strong potential for data filtering, and serves as a differentiable reward signal for Reinforcement Learning from Human Feedback (RLHF).

2605.24639 2026-05-26 cs.CV cs.AI 版本更新

DisDop: Distillation with Domain Priors for Open-Vocabulary Aerial Object Detection

DisDop: 基于领域先验蒸馏的开放词汇航空目标检测

Ruihao Xu, Yong Liu, Yansong Tang, Sule Bai, Xubing Ye, Bingyao Yu, Yutao Guo, Jiwen Lu, Jie Zhou

发表机构 * Tsinghua Shenzhen International Graduate School, Tsinghua University(清华大学深圳国际研究生院,清华大学) Tsinghua University(清华大学)

AI总结 提出DisDop框架,通过从遥感基础模型(RemoteCLIP和DINOv3)中系统蒸馏多级领域先验知识到轻量级检测器,实现开放词汇航空目标检测的最新性能。

详情
AI中文摘要

近年来,随着无人机的广泛应用,航空图像的目标检测引起了越来越多的关注,尤其是不受预定义类别限制的开放词汇航空检测。由于无人机视角图像的稀缺性及其与自然图像的显著差异,直接应用为自然场景设计的普通开放词汇检测方法难以取得令人满意的结果。一些研究提出通过使用轻量级网络或生成伪标签来从预训练模型迁移知识,但它们往往依赖于在自然图像上训练的模型,忽略了专门为遥感和航空图像定制的基础模型的潜力。为了解决这一局限性,我们提出了DisDop,一个统一的框架,系统地将来自遥感基础模型(例如RemoteCLIP和DINOv3)的多级领域先验知识蒸馏到轻量级检测器中。具体来说,我们首先通过教师融合策略蒸馏视觉先验,该策略结合了RemoteCLIP的跨模态对齐能力和DINOv3的细粒度局部特征提取能力,将其互补优势迁移到检测器的骨干网络中。其次,我们通过显式建模类别间语义关系来蒸馏嵌入在RemoteCLIP文本编码器中的文本先验,同时结合全局上下文先验以增强小目标的局部特征表示。通过这种多级先验蒸馏框架,我们的DisDop在开放词汇航空检测基准上取得了新的最先进性能。大量的消融分析也证明了我们提出模块的合理性和有效性。

英文摘要

With the widespread application of drones in recent years, object detection of aerial images has attracted increasing attention, especially open-vocabulary aerial detection which is not restricted to predefined categories. Due to the scarcity of drone's viewpoint images and their significant differences from natural images, it is difficult to achieve satisfying results by directly applying vanilla open-vocabulary detection methods designed for natural scenarios. Some studies propose to transfer knowledge from pre-trained models by using lightweight networks or generating pseudo labels, but they tend to rely on models trained on natural images, neglecting the potential of foundation models specifically tailored for remote sensing and aerial imagery. To address this limitation, we propose DisDop, a unified framework that systematically distills multi-level domain priors from remote sensing foundation models (e.g., RemoteCLIP and DINOv3) into a lightweight detector. Specifically, we first distill visual priors through a teacher fusion strategy that combines RemoteCLIP's cross-modal alignment capability with DINOv3's fine-grained local feature extraction ability, transferring their complementary strengths to the detector's backbone. Second, we distill textual priors embedded in RemoteCLIP's text encoder by explicitly modeling inter-category semantic relationships, while incorporating global contextual priors to enhance local feature representation for small objects. Through this multi-level prior distillation framework, our DisDop achieves new state-of-the-art performance on open-vocabulary aerial detection benchmarks. Extensive ablation analysis also demonstrates the rationality and effectiveness of our proposed modules.

2605.24632 2026-05-26 cs.CR cs.AI cs.LG 版本更新

Demystifying the Mythos or Disrupting Bugonomics? From Zero-Day Asymmetry to Defender Remediation Throughput

揭秘神话或颠覆漏洞经济学?从零日不对称到防御者修复吞吐量

Alfredo Pesoli, Herman Errico, Lorenzo Cavallaro

发表机构 * University College London(伦敦大学学院) Bynario

AI总结 本文通过漏洞经济学视角分析LLM驱动的漏洞发现,指出其核心影响并非增加零日漏洞,而是提升防御者修复吞吐量,并利用Anthropic Mythos预览和Mozilla Firefox合作数据论证这一转变。

详情
AI中文摘要

最近,大型语言模型在生产软件中生成候选和确认漏洞的演示,重新引发了AI将重塑攻防安全的叙事。头条新闻强调能力,却很少审视成本和激励。本文通过漏洞经济学视角审视LLM驱动的漏洞发现:即生产、证明、优先级排序和修复安全相关缺陷的操作经济学。历史上,最引人注目的高端漏洞经济学是攻击方定价的,因为生产级零日漏洞和利用链是面向政府、经纪人和攻击方供应商的昂贵专家输出。防御方漏洞经济学早已存在于漏洞研究、奖励计划和供应商修复工作中;LLM辅助系统改变了其规模和分布。它们使得候选生成、代码理解、测试工具构建、影响证明草拟和报告准备在代码库规模上更便宜。利用和概念验证仍然重要,但在防御方工作流中,它们主要用于证明影响、指导优先级排序和证明修复的合理性。由此产生的瓶颈不仅仅是发现更多漏洞,而是吸收、验证、分类、修补和发布更大规模的报告流。利用Anthropic的Mythos预览和Mozilla Firefox合作的公开数据,以及公开的利用市场价格锚点和漏洞奖励计划,我们认为近期的转变不仅仅是更多的零日漏洞。而是向更广泛的防御者修复吞吐量迈进:低信号候选变得更便宜,证据丰富的修复变得更加重要,稀缺的能力转向维护者审查和发布工作。这种影响在开源领域尤为严重,因为LLM辅助发现可以增加报告量,而维护者侧的验证、分类、资金和发布能力可能无法扩展。

英文摘要

Recent demonstrations of large language models producing candidate and confirmed vulnerabilities in production software have renewed the narrative that AI will reshape offensive and defensive security. Headlines emphasize capability; they rarely interrogate costs and incentives. This paper examines LLM-driven vulnerability discovery through a bugonomics lens: the operational economics of producing, proving, prioritizing, and fixing security-relevant defects. Historically, the most visible high-end bugonomics was offense-priced because production-grade zero-days and exploit chains were expensive specialist outputs for governments, brokers, and offensive vendors. Defender-side bugonomics already existed in vulnerability research, reward programs, and vendor remediation work; LLM-assisted systems change its scale and distribution. They make candidate generation, code comprehension, harness construction, proof-of-impact drafting, and report preparation cheaper at codebase scale. Exploits and proofs of concept remain important, but in defender workflows they primarily prove impact, guide prioritization, and justify remediation. The resulting bottleneck is not only finding more bugs; it is absorbing, validating, triaging, patching, and shipping a larger stream of reports. Using public data from Anthropic's Mythos Preview and Mozilla Firefox collaborations, along with public exploit-market price anchors and vulnerability reward programs, we argue that the near-term shift is not simply more zero-days. It is a move toward broader defender remediation throughput: low-signal candidates become cheaper, evidence-rich remediation become more important, and scarce capacity shifts toward maintainer review and release work. The effect is acute in open source, where LLM-assisted discovery can increase report volume while maintainer-side validation, triage, funding, and release capacity may not scale.

2605.24631 2026-05-26 cs.LG cs.AI cs.CV 版本更新

Beyond Generative Priors: Minority Sampling with JEPA-Guided Diffusion

超越生成先验:JEPA引导扩散的少数采样

Sol Park, Soobin Um

发表机构 * Department of Artificial Intelligence, Kookmin University, Seoul, South Korea(人工智能系,韩国全州大学,首尔)

AI总结 提出一种基于世界模型JEPA引导的扩散采样框架,通过近似策略实现高效计算,在无条件、类别条件和文本到图像生成中提升少数样本的保真度和语义有效性。

Comments ICML 2026, 21 pages, 9 figures

详情
AI中文摘要

少数采样旨在数据流形上生成低密度实例,在医学诊断、异常检测和创意AI等应用中具有核心重要性。然而,现有方法相对于从训练数据中学习的生成先验来定义少数样本,将稀有性限制在可能无法很好反映现实世界语义的模型特定概念中。在这项工作中,我们提出了一种以世界为中心的少数采样视角,该视角相对于现实世界先验而非生成器诱导的密度来定义稀有性。为此,我们引入了JEPA引导,一种由联合嵌入预测架构(JEPA)引导的扩散采样框架——JEPA是一类编码广泛、语义丰富表示的世界模型。JEPA引导将扩散轨迹导向JEPA隐含密度下的低密度区域,从而使生成的少数样本与现实世界的语义稀有性对齐。为了使JEPA引导在计算上实用,我们开发了带有理论误差界限的原则性近似策略,显著降低了引导计算的开销。在无条件、类别条件和文本到图像生成上的大量实验表明,JEPA引导持续提高了少数样本的保真度和语义有效性,在捕捉现实世界的稀有性概念方面优于以生成器为中心的基线。代码可在https://github.com/soobin-um/jepa-guidance获取。

英文摘要

Minority sampling aims to generate low-density instances on a data manifold and is of central importance in applications such as medical diagnosis, anomaly detection, and creative AI. Existing approaches, however, define minority samples relative to generative priors learned from training data, confining rarity to model-specific notions that may poorly reflect real-world semantics. In this work, we propose a world-centric perspective on minority sampling, which defines rarity with respect to real-world priors rather than generator-induced densities. To this end, we introduce JEPA guidance, a diffusion sampling framework guided by a Joint-Embedding Predictive Architecture (JEPA) -- a class of world models that encode broad, semantically rich representations. JEPA guidance steers diffusion trajectories toward low-density regions under the implicit density induced by the JEPA, thereby aligning generated minorities with real-world semantic rarity. To make JEPA guidance computationally practical, we develop principled approximation strategies accompanied by theoretical error bounds, significantly reducing the overhead of guidance computation. Extensive experiments across unconditional, class-conditional, and text-to-image generation demonstrate that JEPA guidance consistently improves the fidelity and semantic validity of minority samples, outperforming generator-centric baselines in capturing real-world notions of rarity. Code is available at https://github.com/soobin-um/jepa-guidance.

2605.24621 2026-05-26 cs.CV cs.AI cs.LG 版本更新

Phase-Aware Wavelet-Based-Scattering Encoder-Decoder for Dense Predictions

相位感知的基于小波散射的编解码器用于密集预测

Ghassen Marrakchi, Basarab Matei

发表机构 * Northern Paris Computer Science Lab, Sorbonne Paris Nord University, Villetaneuse, France(北巴黎计算机科学实验室,巴黎-索邦大学,法国维莱特内斯)

AI总结 提出一种相位感知散射编解码器,通过在跳跃连接中显式保留相位信息来恢复空间结构,在图像去噪和皮肤病变分割任务中验证了相位对密集预测的有效性。

Comments 21 pages, 16 figures, 10 tables

详情
AI中文摘要

散射变换实现了Lipschitz稳定性和平移不变性,但密集预测任务需要保留在全局平均中丢失的空间结构。我们提出了相位感知散射编解码器,通过在跳跃连接中显式保留相位来恢复这些信息。在图像去噪(BSD68)上,打破平移不变性使PSNR提高了+2.17 dB;相位保留额外增加了+1.03 dB。一种新颖的空间洗牌消融实验(惩罚-1.26 dB)表明相位编码了位置依赖的结构。我们在第二个密集预测任务(ISIC皮肤病变分割)上进行了初步的可扩展性研究,完整的交叉验证正在进行中。这项工作推进了原则性的小波-深度学习集成,展示了相位信息如何在像素级预测中补充散射的稳定性-表达性权衡。

英文摘要

Scattering transforms achieve Lipschitz stability and translation invariance, but dense prediction tasks require preserving spatial structure lost in global averaging. We propose Phase-Aware Scattering Encoder-Decoder, which restores this information by explicitly preserving phase in skip connections. On image denoising (BSD68), breaking translation invariance improves PSNR by $+2.17$~dB; phase preservation adds $+1.03$~dB. A novel spatial shuffling ablation ($-1.26$~dB penalty) demonstrates phase encodes location-dependent structure. We conduct a preliminary extensibility study on a second dense prediction task (ISIC skin lesion segmentation), with full cross-validation as ongoing work. This work advances principled wavelet-deep learning integration, showing how phase information complements scattering's stability-expressiveness trade-off in pixel-level prediction.

2605.24614 2026-05-26 cs.CL cs.AI cs.LG 版本更新

Measuring the Depth of LLM Unlearning via Activation Patching

通过激活修补测量大语言模型遗忘的深度

Jaeung Lee, Dohyun Kim, Jaemin Jo

发表机构 * Sungkyunkwan University(全北大学)

AI总结 提出遗忘深度评分(UDS),通过激活修补量化遗忘的机制深度,在150个遗忘模型上的元评估中达到最高忠实性和鲁棒性。

Comments 18 pages

详情
AI中文摘要

大语言模型遗忘已成为隐私保护和人工智能安全的关键事后机制,但审计目标知识是否真正被擦除仍然具有挑战性。现有的输出级指标无法检测到这些知识是否仍可从内部表示中恢复。最近的白盒研究揭示了此类残留知识,但通常依赖于辅助训练或数据集特定调整,缺乏可推广的指标。为解决这些限制,我们提出遗忘深度评分(UDS),一种通过激活修补量化遗忘机制深度的指标。UDS首先使用保留模型基线识别编码目标知识的层,然后在0-1尺度上测量遗忘模型中该知识被擦除的程度。在跨越8种方法的150个遗忘模型上的20个指标的元评估中,UDS实现了最高的忠实性和鲁棒性,证实了我们的因果方法是遗忘评估中最可靠的。案例研究进一步揭示,白盒指标可能在层级别上不一致,并且擦除深度因示例而异。我们提供了将UDS集成到现有基准测试框架并简化评估流程的指南。代码和数据可在https://github.com/gnueaj/unlearning-depth-score获取。

英文摘要

Large language model (LLM) unlearning has emerged as a crucial post-hoc mechanism for privacy protection and AI safety, yet auditing whether target knowledge is truly erased remains challenging. Existing output-level metrics fail to detect when this knowledge remains recoverable from internal representations. Recent white-box studies reveal such residual knowledge but often rely on auxiliary training or dataset-specific adaptations, leaving no generalizable metric. To address these limitations, we propose the Unlearning Depth Score (UDS), a metric that quantifies the mechanistic depth of unlearning via activation patching. UDS first identifies layers that encode the target knowledge using a retain model baseline, then measures how much of it is erased in the unlearned model on a 0-1 scale. In a meta-evaluation across 20 metrics on 150 unlearned models spanning 8 methods, UDS achieves the highest faithfulness and robustness, confirming our causal approach as the most reliable for unlearning evaluation. Case studies further reveal that white-box metrics can disagree at the layer level and that erasure depth varies across examples. We provide guidelines for integrating UDS into existing benchmarking frameworks and streamlining the evaluation pipeline. Code and data are available at https://github.com/gnueaj/unlearning-depth-score

2605.24613 2026-05-26 cs.CL cs.AI cs.SE 版本更新

Guarded Repair for Harm-Aware Post-hoc Replacement of LLM Mathematical Reasoning

Guarded Repair: 面向危害感知的LLM数学推理事后替换

Haizhou Xia

AI总结 提出GuardedRepair框架,通过选择性替换机制在修复LLM数学推理错误时避免破坏正确结果,在GSM8K上准确率从95.60%提升至96.89%且未破坏正确案例。

Comments 15 pages,including appendices. Code and artifacts available at https://github.com/Haizhoux0517/guarded-repair

详情
AI中文摘要

LLM数学推理的事后修复引入了一种不对称风险:修复错误的推理轨迹是有用的,但替换原本正确的轨迹可能有害。我们在选择性替换设置下研究该问题,系统必须决定修复后的候选是否比保留原始缓存轨迹更安全。我们提出GuardedRepair,一种有保护的best-of-N修复框架,它诊断缓存推理轨迹,选择性触发修复,并仅在确定性验证守卫支持替换时才接受改变答案的候选。该框架结合了轻量级符号检查、表面语义风险诊断、有界候选生成和保守接受策略。在完整GSM8K测试集上,初始推理器已达到95.60%准确率,GuardedRepair将最终准确率提升至96.89%,修复了58个剩余错误中的17个,且主运行中未测量到破坏正确案例。在弱推理器ASDiv设置中,准确率从78.40%提升至87.60%。直接重新生成基线表明,这一增益不能仅由更强模型重新求解解释:重新求解所有GSM8K示例将准确率降至93.03%,并破坏了47个初始正确答案。额外分析表明,有保护修复显著改善了修复/破坏权衡,同时也揭示了替换风险被降低而非消除。这些结果支持将事后修复视为危害感知的选择性替换而非无约束的重新求解。

英文摘要

Post-hoc repair of LLM mathematical reasoning introduces an asymmetric risk: fixing an incorrect reasoning trace is useful, but replacing a trace that was already correct can be harmful. We study this problem under a selective replacement setting, where a system must decide whether a repaired candidate is safer than preserving the original cached trace. We present GuardedRepair, a guarded best-of-N repair framework that diagnoses cached reasoning traces, selectively triggers repair, and accepts answer-changing candidates only when deterministic verification guards support replacement. The framework combines lightweight symbolic checks, surface semantic-risk diagnostics, bounded candidate generation, and conservative acceptance policies. On the full GSM8K test set, where the initial reasoner already achieves 95.60% accuracy, GuardedRepair improves final accuracy to 96.89%, fixing 17 of 58 remaining errors without measured broken-correct cases in the main run. On a weak-reasoner ASDiv setting, accuracy improves from 78.40% to 87.60%. Direct regeneration baselines show that this gain is not explained by stronger-model re-solving alone: re-solving all GSM8K examples lowers accuracy to 93.03% and breaks 47 initially correct answers. Additional analyses show that guarded repair substantially improves the fixed/broken tradeoff, while also revealing that replacement risk is reduced rather than eliminated. These results support viewing post-hoc repair as harm-aware selective replacement rather than unconstrained re-solving.

2605.24608 2026-05-26 cs.AI cs.CV cs.LG 版本更新

Lattice theory and algebraic models for deep convolutional learning based on mathematical morphology

基于数学形态学的深度卷积学习的格论与代数模型

Gustavo, Angulo

发表机构 * Mines Paris, PSL University, CMA-Center for Applied Mathematics, Sophia-Antipolis, France(巴黎 Mines 学院,PSL 大学,应用数学中心,法国索菲亚-安蒂波利斯)

AI总结 本文基于格论和数学形态学,为深度卷积架构(CNN、ResNet、UNet)建立了严格的代数框架,揭示了标准CNN流水线是交叉格算子,并识别出三种真正的幂等开运算层设计。

详情
AI中文摘要

我们为深度卷积架构(包括CNN、ResNet和如UNet的编码器-解码器网络)建立了一个严格的代数框架,该框架基于格论和数学形态学。核心工具是Matheron-Maragos-Banon-Barrera (MMBB) 平移不变算子通用表示理论,我们将其系统地应用于标准深度网络的每一层。主要发现是:标准CNN流水线(线性卷积 + ReLU + 平坦最大池化)是一个交叉格算子:卷积是傅里叶下半格中的腐蚀,ReLU是格并闭包,最大池化是逐点最大加格中的膨胀,它们的组合既不是形态学开运算也不是闭运算。第二个发现是:ReLU在逐点格中的上伴随是一个全局(非局部)算子,在全局非负函数上为恒等映射,否则为负无穷,因此没有局部形态学腐蚀能与ReLU构成伴随对。这两个结果共同提供了深度在标准CNN中引入真正表示能力的精确代数原因:组合层不是幂等的。我们识别并完全刻画了三种真正的幂等开运算层设计:纯最大加形态学层(逐点格)、谱维纳层(傅里叶格)和自对偶形态学层。我们建立了完整的不动点和收敛理论。该框架还将最大池化、步长卷积和拉普拉斯金字塔统一在Goutsias-Heijmans伴随金字塔理论下,并给出了激活-池化膨胀(APD)分解及其正确的伴随算子。

英文摘要

We develop a rigorous algebraic framework for deep convolutional architectures, CNNs, ResNets, and encoder--decoder networks such as UNet, grounded in lattice theory and mathematical morphology. The central tool is the Matheron--Maragos--Banon--Barrera (MMBB) universal representation theory for translation-invariant operators, which we apply systematically to every layer of a standard deep network. The principal finding is that the standard CNN pipeline (linear convolution~$+$ ReLU~$+$ flat max-pooling) is a cross-lattice operator: the convolution is an erosion in the Fourier inf-semilattice while ReLU is a lattice-join closing and max-pooling is a dilation in the pointwise max-plus lattice, and their composition is a morphological opening in neither. A second finding is that the upper adjoint of ReLU in the pointwise lattice is a global (non-local) operator, the identity on globally non-negative functions and $-\infty$ otherwise, so no local morphological erosion can form an adjunction pair with ReLU. These two results together provide the precise algebraic reason why depth in standard CNNs introduces genuine representational power: the composed layer is not idempotent. Three layer designs that are genuine idempotent openings are identified and fully characterised: the pure max-plus morphological layer (pointwise lattice), the spectral Wiener layer (Fourier lattice), and the self-dual morphological layer. We establish a complete fixed-point and convergence theory. The framework also unifies max-pooling, strided convolution, and the Laplacian pyramid under the Goutsias--Heijmans adjoint pyramid theory, and gives the Activation--Pooling Dilation (APD) factorisation with its correct adjoint.

2605.24600 2026-05-26 cs.AI 版本更新

Agent-as-Peer-Debriefer: A Multi-Agent Framework with Perspective-Based Refinement for Qualitative Analysis

Agent-as-Peer-Debriefer: 一种基于视角精炼的多智能体定性分析框架

Zhimin Lin, Kun Cheng, Fan Bai, Jie Gao

发表机构 * Soochow University(苏州大学) Johns Hopkins University(约翰霍普金斯大学)

AI总结 提出一种多智能体框架,通过模拟同行汇报(peer debriefing)并引入理论驱动、数据驱动和应用三种分析视角,提升大语言模型在定性数据分析中的编码质量。

详情
AI中文摘要

大型语言模型(LLMs)越来越多地被用于定性数据分析(QDA),但其输出往往缺乏人类分析的深度和细微差别。我们认为这一差距反映了人类QDA中缺失的一种可信度实践:同行汇报(peer debriefing),即分析师向无偏见的同行寻求反馈并据此完善其编码。为了将这一实践引入LLM辅助的QDA,我们提出了Agent-as-Peer-Debriefer,一个将同行汇报构建到关键编码步骤中的多智能体QDA框架。在我们的框架中,层次编码代理遵循标准QDA流程生成代码、子主题和主题,以及自我解释和反思备忘录。然后,它将输出共享给三个同行汇报代理,每个代理应用不同的分析视角(理论驱动、数据驱动或应用),并通过保留、重命名、重新分配、合并或拆分代码来完善代码。这些视角来源于跨领域和数据集通用的已建立的人类QDA实践。为了评估该框架,我们在三个LLM上对两个领域的三个数据集进行了测试,测量与人工标注代码的语义相似度。在所有设置中,基于视角的同行汇报精炼比单一LLM基线更接近人类代码,消融实验进一步表明,这种提升不仅仅来自额外的精炼。三种视角也产生了不同的权衡,表明视角的选择是一个有意义且可控的设计决策。更广泛地说,这些发现表明,用明确的视角模拟同行汇报是实现更可信的LLM辅助QDA的一条有前景的途径。

英文摘要

Large language models (LLMs) are increasingly used for qualitative data analysis (QDA), yet their outputs often miss the depth and nuance of human analysis. We argue this gap reflects a missing credibility practice from human QDA: peer debriefing, in which an analyst seeks feedback from a disinterested peer and uses it to refine their coding. To bring this practice into LLM-assisted QDA, we propose Agent-as-Peer-Debriefer, a multi-agent QDA framework that builds peer debriefing into key coding steps. In our framework, a Hierarchical Coding Agent follows the standard QDA process to generate codes, sub-themes, and themes, along with self-explanations and reflection memos. It then shares these outputs with three Peer-Debriefing Agents, each applying a distinct analytical perspective (Theory-Driven, Data-Driven, or Applied) and refining the codes by keeping, renaming, reassigning, merging, or splitting them. These perspectives are drawn from established human QDA practices that generalize across domains and datasets. To evaluate the framework, we test it on three datasets across two domains with three LLMs, measuring semantic similarity to human-annotated codes. Across all settings, perspective-based, peer-debriefing refinement aligns more closely with human codes than a single-LLM baseline, and an ablation further shows the gain is not merely from additional refinement. The three perspectives also produce distinct trade-offs, showing that the choice of perspective is a meaningful and controllable design decision. More broadly, these findings suggest that simulating peer debriefing with explicit perspectives is a promising route to more credible LLM-assisted QDA.

2605.24598 2026-05-26 cs.AI cs.MA 版本更新

Hera: Learning Long-Horizon Coordination for Device-Cloud Collaborative LLM Agents

Hera: 面向设备-云协作LLM智能体的长时程协调学习

Yuxin Zhang, Mengxue Hu, Zheng Lin, Xiaoyi Fan, Fan Xie, Zihan Fang, Jing Yang, Wenjun Zhu, Zhiwen Chen, Chengfei Lv, Zhe Chen

发表机构 * Fudan University(复旦大学) Alibaba Group(阿里巴巴集团) The University of Hong Kong(香港大学) Shenzhen MSU-BIT University(深圳MSU-BIT大学) New York University(纽约大学) Universiti Malaya(马来亚大学) SpaceAIC Co., Ltd.(SpaceAIC公司)

AI总结 提出Hera,一种步骤级设备-云LLM智能体协调器,通过两阶段训练(模仿学习+强化学习)优化长时程任务的性能-成本帕累托前沿。

详情
AI中文摘要

大型语言模型(LLM)智能体通过自主与环境交互,擅长解决复杂的长期任务。然而,它们的实际部署面临根本性的设备-云困境:设备端模型高效但脆弱,而云端模型强大但计算成本高。最先进的LLM设备-云路由器通常做出粗粒度的任务级决策,无法适应多步智能体交互中变化的难度。为解决此问题,我们提出Hera,一种用于长期任务的步骤级设备-云LLM智能体协调器,实现了强大的性能-成本帕累托前沿。Hera采用新颖的两阶段训练范式:(1)冷启动的模仿学习,随后(2)联合优化任务成功率和云端使用效率的强化学习。第一阶段将步骤级路由视为监督分类问题:设备智能体在云端轨迹上重放,每个状态根据设备与云端动作的一致性进行标记。第二阶段,我们通过跨轨迹分组相同状态并使用偏好更高期望回报和更少未来云端调用的标签更新Hera,进行成本感知的强化学习。我们在ALFWorld、WebShop和AppWorld上评估Hera,它始终优于先前方法,在仅46.3%的步骤中使用云端的情况下,达到了云端单独成功率的92.5%。

英文摘要

Large language model (LLM) agents excel at solving complex long-horizon tasks through autonomous interaction with environments. However, their real-world deployment faces a fundamental device--cloud dilemma: on-device models are efficient but often brittle, while cloud models are stronger but costly in computation. State-of-the-art LLM device--cloud routers usually make coarse task-level decisions, which cannot adapt to the changing difficulty of multi-step agent interactions. To address this issue, we present Hera, a step-level device--cloud LLM agent coordinator for long-horizon tasks achieving a strong performance--cost Pareto frontier. Hera adopts a novel two-stage training paradigm: (1) imitation learning for cold-start, followed by (2) reinforcement learning that jointly optimizes task success and cloud usage efficiency. The first stage casts step-level routing as a supervised classification problem: the device agent is replayed on cloud trajectories, with each state labeled by the agreement between device and cloud actions. In the second stage, we perform cost-aware reinforcement learning by grouping identical states across trajectories and updating Hera with labels favoring higher expected return and fewer future cloud calls. We evaluate Hera on ALFWorld, WebShop, and AppWorld, where it consistently outperforms prior methods, achieving 92.5% of the cloud-only success rate with cloud use in only 46.3% of steps.

2605.24597 2026-05-26 cs.AI cs.CL cs.LG 版本更新

Learning to Reason Efficiently with A* Post-Training

学习通过A*后训练进行高效推理

Andreas Opedal, Francesco Ignazio Re, Abulhair Saparov, Mrinmaya Sachan, Bernhard Schölkopf, Ryan Cotterell

发表机构 * ETH Zürich(苏黎世联邦理工学院) MPI for Intelligent Systems, Tübingen(图宾根智能系统研究所) Purdue University(普渡大学)

AI总结 本文通过A*搜索算法指导LLM生成正确且高效的推理步骤,提出监督微调和强化学习两种训练方法,在1B-3B参数模型上显著提升推理准确性和效率。

Comments Preprint

详情
AI中文摘要

大型语言模型(LLM)的许多应用需要演绎推理,但模型经常产生不正确或冗余的推理步骤。我们将自然语言推理框架化为一个搜索问题,其中最终答案本身就是有效的证明,需要推理过程中间推理正确。具体来说,我们研究LLM是否能够通过A*搜索(一种保证通向目标的最优高效路径的算法)的指导,学习生成正确且高效的证明。我们探索了两种训练技术:在A*执行轨迹上的监督微调,以及使用A*信息的过程奖励模型进行强化学习。实验发现,1B-3B范围内的Llama-3.2模型从A*后训练中获益显著,从接近零准确率提升到超越更大的模型DeepSeek-V3.2。我们的分析揭示了一个权衡:简单的正确性奖励最大化准确率,而A*信息的信号在准确率和效率之间取得平衡。此外,我们发现,在更大的搜索空间中,使用不完美启发式训练的模型表现出更高的准确率。我们的结果展示了朝着由经典搜索算法原理指导的推理方向的有前景的路径。

英文摘要

Many applications of large language models (LLMs) require deductive reasoning, yet models frequently produce incorrect or redundant inference steps. We frame natural language inference as a search problem where the final answer is the valid proof itself, requiring a reasoning procedure in which intermediate inferences are correct. Specifically, we investigate whether LLMs can learn to generate correct and efficient proofs with guidance from A* search -- an algorithm that guarantees an optimally efficient path to a goal. We explore two training techniques: supervised fine-tuning on execution traces from A* and reinforcement learning with A*-informed process reward models. Empirically, we find that Llama-3.2 models in the 1B--3B range benefit substantially from A* post training, going from near-zero accuracy to outperforming DeepSeek-V3.2 -- a much larger model. Our analysis uncovers a trade-off: while simple correctness rewards maximize accuracy, A*-informed signals strike a balance between accuracy and efficiency. Furthermore, we find that on larger search spaces, models trained with imperfect heuristics exhibit superior accuracy. Our results demonstrate a promising direction towards reasoning guided by principles derived from classical search algorithms.

2605.24588 2026-05-26 cs.AI cs.LG 版本更新

HeartBeatAI: An Interpretable and Robust Deep Learning Framework for Multi-Label ECG Arrhythmia Detection

HeartBeatAI:用于多标签心电图心律失常的可解释且鲁棒的深度学习框架

Shubham Gupta, Nikhil Panwar, Partha Pratim Roy

发表机构 * Department of Computer Science and Engineering, Indian Institute of Technology Roorkee(印度拉胡尔理工学院计算机科学与工程系)

AI总结 提出HeartBeatAI框架,结合域泛化、多尺度特征聚合和临床可解释性,通过Squeeze-and-Excitation ResNet和多层浓度管道实现鲁棒的12导联心电图分类,在跨数据集评估中达到98%宏F1分数,但跨机构部署时罕见异常检测仍存在挑战。

详情
AI中文摘要

虽然深度学习增强了自动化心电图分析,但临床部署受到类别不平衡和泛化差距的阻碍。本文提出了HeartBeatAI,一个结合域泛化、多尺度特征聚合和临床可解释性的深度学习框架,用于鲁棒的12导联心电图分类。超越基于图像的范式,HeartBeatAI集成了一个Squeeze-and-Excitation ResNet来隔离诊断导联,以及一个多层浓度管道来捕捉宏观节律和微观形态异常。为了缓解域偏移,该框架采用了MixStyle正则化和标签平滑。通过使用源内和留一域外协议在四个大规模数据集上进行严格的基准测试,在源内条件下实现了高性能(98%宏F1分数)。然而,留一域外评估揭示了检测罕见异常时的显著退化,突显了跨机构部署中持续存在的挑战。

英文摘要

While Deep Learning (DL) enhances automated electrocardiogram (ECG) analysis, clinical deployment is hindered by class imbalance and the generalization gap. This paper presents HeartBeatAI, a deep learning framework combining domain generalization, multi-scale feature aggregation, and clinical explainability for robust 12-lead ECG classification. Moving beyond image-based paradigms, HeartBeatAI integrates a Squeeze-and-Excitation (SE) ResNet to isolate diagnostic leads alongside a Multi-Layer Concentration Pipeline to capture macro-rhythm and micro-morphological anomalies. To mitigate domain shift, the framework employs MixStyle regularization and Label Smoothing. Rigorous benchmarking across four large-scale datasets using intra-source and Leave-One-Domain-Out (LODO) protocols demonstrates high performance (98% Macro F1-score) under intra-source conditions. However, LODO evaluations reveal significant degradation in detecting rare anomalies, highlighting a persistent challenge in cross-institutional deployment.

2605.24584 2026-05-26 cs.LG cs.AI 版本更新

LAPLEX: The FFT of Learnable Laplace Kernels

LAPLEX: 可学习拉普拉斯核的FFT

Łukasz Struski, Hanna Blazhko, Piotr Kubaty, Jacek Tabor

发表机构 * Faculty of Mathematics and Computer Science, Jagiellonian University(杰里戈尼亚大学数学与计算机科学系) Doctoral School of Exact and Natural Sciences, Jagiellonian University(杰里戈尼亚大学精确与自然科学博士学院) Centre for Credible Artificial Intelligence, Warsaw University of Technology(华沙技术大学可信人工智能中心)

AI总结 提出LAPLEX算子,通过可学习坐标锚点隐式定义满秩稠密矩阵,实现FFT规模的可训练矩阵-向量运算,分离表达性与存储成本。

详情
AI中文摘要

深度学习中的快速线性代数通常面临一个选择:固定几何和精确计算(如傅里叶变换),或者通过稠密参数、随机特征或低秩近似实现自适应几何。为了超越这种权衡,我们引入了LAPLEX,一类精确的、可训练的(相位)拉普拉斯核算子。LAPLEX层通常是一个满秩稠密矩阵,由可学习的坐标锚点隐式定义,具有类似FFT的缩放特性。因此,它支持在现代GPU上对高达$10^9$维的向量进行可训练的矩阵-向量运算。作为神经网络层,它产生紧凑的投影和分类头,可解释为软性的、可训练的路由模型。同样的原语也可作为高效的Gram算子,实现对展平图像(维度$3 \cdot 10^6$)的高维协方差建模,在保留可见空间结构的同时不施加卷积偏差。这些应用反映了一个单一原则:无需存储稠密矩阵即可学习稠密几何,从而在普通稠密层无法企及的领域中实现数据自适应的全局交互。在这个意义上,LAPLEX将表达性与存储成本分离:它表现得像一个稠密可训练矩阵,但通过一个小的结构化参数集表示和应用。

英文摘要

Fast linear algebra in deep learning usually comes with a choice: fixed geometry and exact computation, as in the Fourier transform, or adaptive geometry paid for by dense parameters, random features, or low-rank surrogates. To move beyond this trade-off, we introduce LAPLEX, a class of exact, trainable (phased) Laplace-kernel operators. A LAPLEX layer is a typically full-rank dense matrix, implicitly defined by learnable coordinate anchors, with FFT-like scaling. Consequently, it supports trainable matrix--vector operations at vector dimensions up to $10^9$ on modern GPUs. As a neural layer, it yields compact projections and classification heads interpretable as soft, trainable routing models. The same primitive also serves as an efficient Gram operator, enabling high-dimensional covariance models on flattened images of dimension $3 \cdot 10^6$ that preserve visible spatial structure without imposing convolutional bias. These applications reflect a single principle: dense geometry can be learned without storing a dense matrix, which enables data-adaptive global interactions in regimes where ordinary dense layers are out of reach. In this sense, LAPLEX separates expressivity from storage cost: it behaves like a dense trainable matrix, but is represented and applied through a small structured set of parameters.

2605.24577 2026-05-26 cs.LG cs.AI cs.CL 版本更新

Polymorphism Is Rotation: Operational Mechanistic Interpretability from a Two-Layer Transformer to Pythia-70m

多态性即旋转:从两层Transformer到Pythia-70m的操作性机械可解释性

Jordan F. McCann

发表机构 * Independent Researcher(独立研究者)

AI总结 本文发现独立训练的Transformer在残差流基上通过均匀随机旋转相互关联,并利用正交Procrustes拟合实现特征字典和转向向量在模型间的迁移,无需重新训练。

Comments 26 pages, 4 figures, 40 references. Pre-registered four-bar framework; all numerical claims reproducible

详情
AI中文摘要

独立训练的Transformer在残差流基上计算相同的函数,这些基通过$\mathrm{SO}(d_{\mathrm{model}})$上的均匀随机旋转相互关联。我们将这种现象称为多态性:相同的函数,但内部坐标互不可解。每对模型之间的一次矩阵乘法即可消除这种多态性:在单批激活上进行正交Procrustes拟合,即可在独立训练的模型之间迁移稀疏自编码器特征字典和转向向量,无需重新训练。 该现象对标准SAE通用性度量不可见。解码器列余弦相似度在不同种子间匹配度达98%,即SAE通用性的头条数字,而一个种子训练的SAE重构另一个种子的激活时,解释方差为负,比预测常数均值更差。解码器列对齐,但编码器从旋转后的框架读取。单个Procrustes旋转$R$可在每个内部位置将重构恢复至种子内上限的0.025 EV以内。 $R$服从Haar分布:$\|R - I\|_F$与随机正交预测$\sqrt{2 d_{\mathrm{model}}}$在$d_{\mathrm{model}} = 512$时匹配至0.1%,且$R$的特征值谱与Haar $\mathrm{SO}(d_{\mathrm{model}})$的Kolmogorov-Smirnov检验在合并和逐对情况下均返回$p \approx 1.000$。均值差转向向量通过与$R$的不变子空间对齐在三种机制下迁移:当被共享输出权重固定时清晰,与旋转子空间重叠时部分,否则反转。在无共享输入/输出(Pythia)时,所有三种情况均坍缩为普遍反转。同一旋转解释适用于单次运行中的不同训练检查点。 在104k参数的Dyck-3 Transformer和九个独立训练的Pythia-70m种子(基于The Pile数据集)上,通过预注册的四柱操作框架进行验证。前沿规模(10B+)的复现仍有待研究。

英文摘要

Independently trained transformers compute the same function in residual-stream bases that differ by a uniform random rotation on $\mathrm{SO}(d_{\mathrm{model}})$. We call this phenomenon polymorphism: same function, mutually unintelligible interior coordinates. One matrix multiplication per model pair removes it: an orthogonal Procrustes fit on a single batch of activations transfers sparse-autoencoder feature dictionaries and steering vectors between independently trained models, with no retraining. The phenomenon is invisible to the standard SAE universality metric. Decoder-column cosine similarity matches across seeds at 98%, the SAE-universality headline number, while an SAE trained on one seed reconstructs another seed's activations at negative explained variance, worse than predicting the constant mean. The decoder columns align; the encoder reads from a rotated frame. A single Procrustes rotation $R$ restores reconstruction to within 0.025 EV of the within-seed ceiling at every internal site. $R$ is Haar-distributed: $\|R - I\|_F$ matches the random-orthogonal prediction $\sqrt{2 d_{\mathrm{model}}}$ to 0.1% at $d_{\mathrm{model}} = 512$, and a Kolmogorov-Smirnov test of $R$'s eigenvalue spectrum against Haar $\mathrm{SO}(d_{\mathrm{model}})$ returns $p \approx 1.000$ pooled and per-pair. Diff-of-means steering vectors transfer in three regimes by alignment with $R$'s invariant subspace: clean when pinned by shared output weights, partial when overlapping the rotated subspace, inverted otherwise. With no shared I/O (Pythia), all three collapse to universally inverted. The same rotation account holds across training checkpoints within a single run. Validated on a 104k-parameter Dyck-3 transformer and nine independently-trained Pythia-70m seeds on The Pile, via a pre-registered four-bar operational framework. Frontier-scale (10B+) replication remains open.

2605.24576 2026-05-26 cs.AI 版本更新

Associations between echocardiographic traits and AI-ECG predictions of heart failure

超声心动图特征与AI-ECG心力衰竭预测之间的关联

Elias Stenhede, Eivind Bjørkan Orstad, Torbjørn Omland, Henrik Schirmer, Arian Ranjbar

发表机构 * 1Medical Technology \& E-Health, Akershus University Hospital, 1478 Lørenskog, Norway 2Faculty of Medicine, University of Oslo, 0372 Oslo, Norway 3Department of Cardiology, Akershus University Hospital, 1478 Lørenskog, Norway 4Institute of Clinical Medicine, Campus Ahus, University of Oslo, 0317 Oslo, Norway

AI总结 本研究通过回顾性分析8147例患者数据,发现AI-ECG预测的心力衰竭风险主要与整体纵向应变等收缩功能指标相关,且在射血分数保留的患者中也能捕捉舒张功能异常。

详情
AI中文摘要

人工智能心电图(AI-ECG)可以检测心力衰竭(HF),包括左心室射血分数(LVEF)未捕获的疾病,但模型预测背后的心脏表型仍不清楚。因此,我们研究了AI-ECG预测的HF风险是否与已确立的心肌功能障碍、重构和充盈压的超声心动图测量指标一致。我们回顾性分析了2023年1月1日至2025年6月1日期间在阿克什胡斯大学医院三天内同时接受心电图和超声心动图检查的8147名患者的数据。对所有心电图应用了先前验证的用于HF检测的AI-ECG模型。斯皮尔曼秩相关系数ρ量化了超声心动图参数与AI-ECG风险之间的关联。按性别和左心室射血分数(LVEF)进行了亚组分析。外部验证包括来自哥伦比亚大学欧文医学中心的36,286对心电图-超声心动图数据。整体纵向应变(GLS)显示出最强的相关性(ρ=0.57),其次是二尖瓣环平面收缩期位移(MAPSE)(ρ=-0.49)和LVEF(ρ=-0.45)。在LVEF>50%的患者中,GLS、MAPSE和舒张相关参数的相关性仍然显著。女性的左心室容积指数相关性较弱,而舒张指数在女性中的相关性比男性更强。生理学验证表明,AI-ECG的HF风险预测主要与收缩功能指标(特别是整体纵向应变)一致,同时也能捕捉LVEF保留患者的舒张相关异常。这种方法可能提高临床可解释性,并识别模型改进的机会。

英文摘要

Artificial intelligence-enabled electrocardiography (AI-ECG) can detect heart failure (HF), including disease not captured by left ventricular ejection fraction (LVEF), but the cardiac phenotypes underlying model predictions remain unclear. We therefore investigated whether AI-ECG-predicted HF risk aligns with established echocardiographic measures of myocardial dysfunction, remodelling, and filling pressures. We retrospectively analysed ECG and echocardiography data from 8147 patients who underwent both examinations within three days at Akershus University Hospital between 1 January 2023 and 1 June 2025. A previously validated AI-ECG model for HF detection was applied to all ECGs. Spearman's rank correlation $ρ$ quantified associations between echocardiographic parameters and AI-ECG risk. Subgroup analyses were performed by sex and left ventricular ejection fraction (LVEF). External validation included 36,286 ECG-echocardiography pairs from Columbia University Irving Medical Center. Global longitudinal strain (GLS) showed the strongest correlation ($ρ$=0.57), followed by mitral annular plane systolic excursion (MAPSE) ($ρ$=-0.49) and LVEF ($ρ$=-0.45). In patients with LVEF>50%, correlations remained substantial for GLS, MAPSE, and diastolic-related parameters. Volumetric left ventricular indices correlated less strongly in women, whereas diastolic indices showed stronger correlations in women than in men. Physiological validation showed that AI-ECG HF risk predictions align primarily with measures of systolic function, particularly global longitudinal strain, while also capturing diastolic-related abnormalities in patients with preserved LVEF. This approach may improve clinical interpretability and identify opportunities for model refinement.

2605.24570 2026-05-26 cs.LG cs.AI cs.CV 版本更新

PILOT: Policy-Informed Learned Optimization for Adaptive Deep Network Training

PILOT: 策略引导的学习优化器用于自适应深度网络训练

Sattam Altuuaim, Lama Ayash, Muhammad Mubashar, Naeemullah Khan

发表机构 * King Abdullah University of Science and Technology(卡布斯大学) University of Strathclyde(斯特拉思克莱德大学)

AI总结 提出PILOT在线优化器,通过梯度方向一致性信号动态调整动量、归一化和符号更新的组合,在FashionMNIST和CIFAR-10上实现更高准确率。

Comments 16 pages, 5 figures

详情
AI中文摘要

尽管优化在深度学习中扮演核心角色,但大多数优化器依赖于训练开始前固定函数形式的更新结构。这种静态设计限制了它们响应损失景观中变化梯度行为的能力,其中训练可能在稳定、噪声和不一致状态之间切换。本研究提出PILOT(策略引导的学习优化器),一种在线优化器,在训练过程中自适应其更新行为。PILOT不使用动量、归一化和符号更新之间的固定平衡,而是将梯度方向一致性作为局部训练稳定性的信号。基于该一致性信号调整更新规则,使优化器能够在梯度变得稳定、噪声或不一致时调整其行为。在FashionMNIST和CIFAR-10上的实验表明,PILOT在卷积设置中始终达到评估优化器中的最高准确率。在CNN架构上,PILOT在FashionMNIST上达到94.13%,在CIFAR-10上达到81.94%。在ResNet-18上,它进一步提升了性能,在FashionMNIST上达到95.71%,在CIFAR-10上达到93.42%。这些结果表明,在训练过程中学习如何调整更新结构可以在保持简单一阶优化框架的同时,提高紧凑和更深卷积模型的性能。PILOT的实现公开于https://github.com/SattamAltwaim/PILOT.git。

英文摘要

Despite the central role of optimization in deep learning, most optimizers rely on update structures whose functional form is fixed before training begins. This static design can limit their ability to respond to changing gradient behavior across the loss landscape, where training may shift between stable, noisy, and inconsistent regimes. This study proposes PILOT (Policy-Informed Learned OpTimizer), an online optimizer that adapts its update behavior during training. Rather than using a fixed balance between momentum, normalization, and sign-based updates, PILOT uses gradient-direction agreement as a signal of local training stability. Conditioning the update rule on this agreement signal allows the optimizer to adjust its behavior when gradients become stable, noisy, or inconsistent. Experiments on FashionMNIST and CIFAR-10 show that PILOT consistently achieves the highest accuracy among the evaluated optimizers across convolutional settings. On the CNN architecture, PILOT reaches 94.13% on FashionMNIST and 81.94% on CIFAR-10. On ResNet-18, it further improves performance, reaching 95.71% on FashionMNIST and 93.42% on CIFAR-10. These results suggest that learning how to adapt the update structure during training can improve performance across both compact and deeper convolutional models while preserving a simple first-order optimization framework. The implementation of PILOT is publicly available at https://github.com/SattamAltwaim/PILOT.git

2605.24564 2026-05-26 cs.AI cs.CE cs.LG 版本更新

Summoning the Oracle to Slay It: Mitigating Look-Ahead Bias in Financial Backtesting with Large Language Models

召唤神谕以屠之:利用大语言模型缓解金融回测中的前瞻偏差

Weixian Waylon Li, Mengyu Wang, Tiejun Ma

发表机构 * University of Edinburgh(爱丁堡大学)

AI总结 提出FinCAD方法,通过对抗性偏差发现和实体日期自适应规则,在不重新训练的情况下抑制大语言模型对历史结果的记忆,从而缓解金融回测中的参数化前瞻偏差。

详情
AI中文摘要

在历史金融数据上回测大语言模型(LLMs)是不可靠的,因为预训练在事件发生后截断。一个在2024年训练的LLM已经“知道”2018-2020年股票的走势。我们将这种失败命名为参数化前瞻偏差,并提出FinCAD,一种上下文感知解码的推理时适配方法,无需重新训练即可抑制LLM对历史结果的记忆。FinCAD结合了一个对抗性偏差发现流程,该流程学习一个模型特定的记忆激活先验提示,以及一个实体和日期自适应规则,该规则将CAD强度按(实体,日期)记忆程度缩放,使得惩罚在记忆的样本内日期触发,并在样本外衰减至零。在五个7-14B LLM和五只大盘股上,FinCAD在记忆日期上将样本内回测收益削减高达-67.1%,同时将2025年样本外收益保持在$8K以内,夏普比率在基线的0.10以内,并保持通用推理能力在1.7分以内。在十一个模型的排行榜上,它将样本内/样本外Spearman相关性从+0.779提升至+0.846,恢复了真正预测样本外表现的排名。

英文摘要

Backtesting large language models (LLMs) on historical financial data is unreliable because pre-training cuts off after the events happened. An LLM trained in 2024 already "knows" which way 2018-2020 stocks moved. We name this failure parametric look-ahead bias and propose FinCAD, an inference-time adaptation of Context-Aware Decoding that suppresses an LLM's memory of historical outcomes without retraining. FinCAD pairs an adversarial bias-discovery pipeline that learns a model-specific memory-activating prior prompt with an entity- and date-adaptive rule that scales the CAD strength to per-(entity, date) memorisation, so the penalty fires on memorised in-sample dates and decays to zero out-of-sample. Across five 7-14B LLMs and five mega-cap equities, FinCAD cuts in-sample backtest returns by up to -67.1% on memorised dates while leaving 2025 out-of-sample returns within $8K and Sharpe within 0.10 of baseline, and preserves general-purpose reasoning within 1.7 pts. On an eleven-model leaderboard, it raises the in-sample / out-of-sample Spearman correlation from +0.779 to +0.846, recovering rankings that genuinely predict out-of-sample performance.

2605.24562 2026-05-26 cs.CV cs.AI 版本更新

PEDESTRIANQA: A Benchmark for Vision-Language Models on Pedestrian Intention and Trajectory Prediction

PEDESTRIANQA: 面向行人意图与轨迹预测的视觉-语言模型基准

Naman Mishra, Shankar Gangisetty, C. V. Jawahar

发表机构 * CVIT, IIIT-Hyderabad, India(IIIT-海得拉巴计算机视觉与智能技术研究所,印度)

AI总结 提出大规模视频数据集PedestrianQA,将行人意图和轨迹预测转化为带结构化理由的问答任务,通过微调视觉-语言模型显著提升预测准确性与可解释性。

详情
AI中文摘要

行人意图和轨迹预测对于自动驾驶系统的安全部署至关重要,直接影响复杂交通环境中的导航决策。近期大型视觉-语言模型的进展通过结合高容量视觉理解与灵活的自然语言推理,为这些任务提供了强大的新范式。本文中,我们引入PedestrianQA,这是一个大规模视频数据集,将行人意图和轨迹预测公式化为带有结构化理由的问答任务。PedestrianQA以自然语言表达丰富标注的行人序列,使视觉-语言模型能够从视觉动态、上下文线索和交通智能体间的交互中学习,同时生成其预测的简洁解释,无需为每个任务定制专门的架构。在PIE、JAAD、TITAN和IDD-PeD上的实证评估表明,在PedestrianQA上微调最先进的视觉-语言模型显著提高了意图分类、轨迹预测准确性以及解释性理由的质量,展示了视觉-语言模型作为安全关键行人行为建模的统一且可解释框架的强大潜力。

英文摘要

Pedestrian intention and trajectory prediction are critical for the safe deployment of autonomous driving systems, directly influencing navigation decisions in complex traffic environments. Recent advances in large vision-language models offer a powerful new paradigm for these tasks by combining high-capacity visual understanding with flexible natural language reasoning. In this work, we introduce PedestrianQA, a large-scale video-based dataset that formulates pedestrian intention and trajectory prediction as question-answering tasks augmented with structured rationales. PedestrianQA expresses richly annotated pedestrian sequences, in natural language, enabling VLMs to learn from visual dynamics, contextual cues, and interactions among traffic agents while generating concise explanations of their predictions without needing specialized architectures tailored for each task. Empirical evaluations across PIE, JAAD, TITAN, and IDD-PeD show that finetuning state-of-the-art VLMs on PedestrianQA significantly improves intention classification, trajectory forecasting accuracy, and the quality of explanatory rationales, demonstrating the strong potential of VLMs as a unified and explainable framework for safety-critical pedestrian behavior modeling.

2605.24550 2026-05-26 cs.AI cs.CL cs.LG 版本更新

Jailbreak to Protect: Buffering and Reinforcing via Temporary Jailbreaking for Safe Fine-Tuning in Large Language Models

越狱以保护:通过临时越狱进行缓冲和强化以实现大型语言模型的安全微调

Seokil Ham, Jaehyuk Jang, Wonjun Lee, Changick Kim

发表机构 * School of Electrical Engineering, Korea Advanced Institute of Science and Technology (KAIST)(韩国科学技术院电子工程学院)

AI总结 针对微调即服务中安全对齐被有害微调攻击削弱的问题,提出一种基于梯度分析的缓冲与强化框架,通过临时越狱适配器减少有害更新并利用QR分解合并强化安全,实现无需额外安全数据的高效防御。

Comments ICML 2026 Spotlight

详情
AI中文摘要

微调即服务(FaaS)使得大型语言模型(LLMs)的个性化成为可能,但它在有害微调攻击下会削弱安全对齐。最近的研究表明,在微调期间激活有害行为模块可以防止模型学习不良行为,但其机制尚不清楚。在本文中,我们重新审视临时越狱作为对抗有害微调的一种防御手段,并提供了梯度层面的分析,表明它能够饱和安全退化梯度,同时保留良性任务相关梯度。基于这一见解,我们提出了一种缓冲与强化微调框架,该框架在用户微调期间缓冲有害更新,并在适应后强化安全。具体来说,BufferLoRA作为一个可移除的适配器,在用户微调期间诱导临时越狱以减少有害更新。适应后,通过基于QR分解的合并,将经过训练的ReinforceLoRA(用于在临时越狱状态下恢复拒绝行为)与UserLoRA集成,以在保持用户任务性能的同时强化安全。大量实验表明,我们的框架在用户微调期间无需额外安全数据且计算成本极低的情况下,实现了卓越的安全性和实用性。

英文摘要

Fine-tuning-as-a-Service (FaaS) enables personalization of large language models (LLMs), but it can weaken safety-alignment under harmful fine-tuning attacks. Recent work has shown that activating harmful-behavior modules during fine-tuning can prevent models from learning undesired behaviors, but its mechanism remains unclear. In this paper, we revisit temporary jailbreaking as a defense against harmful fine-tuning and provide a gradient-level analysis showing that it saturates safety-degrading gradients while preserving benign task-relevant gradients. Based on this insight, we propose a Buffer-and-Reinforce fine-tuning framework that buffers harmful updates during user fine-tuning and reinforces safety after adaptation. Specifically, BufferLoRA induces temporary jailbreaking as a removable adapter to reduce harmful updates during user fine-tuning. After adaptation, ReinforceLoRA, trained to recover refusal behavior under the temporarily jailbroken state, is integrated with UserLoRA via QR decomposition-based merging to reinforce safety while preserving user-task performance. Extensive experiments show that our framework achieves superior safety and utility with no additional safety data during user fine-tuning and minimal computational cost.

2605.24549 2026-05-26 cs.AI 版本更新

PALoRA: Projection-Adaptive LoRA for Preserving Reasoning in Large Language Models

PALoRA: 投影自适应LoRA以保持大型语言模型的推理能力

Mustafa Hayri Bilgin, Mariam Barry, Albert Bifet, Azzedine Idir Ait Said, Soumya Banerjee

发表机构 * IT Group, Research, BNP Paribas(BNP巴黎研究院IT部门) LTCI, Télécom Paris, Institut Polytechnique de Paris(巴黎理工大学LTCI研究所) AI Institute, University of Waikato(瓦卡托大学人工智能研究所) Independent Researcher(独立研究员)

AI总结 提出PALoRA框架,通过奇异值微调(SVF)识别推理关键成分,并在正交约束下使用LoRA注入知识,以在保持推理能力的同时高效更新事实知识。

详情
AI中文摘要

高效地用新的或演化的事实知识更新大型语言模型(LLMs)仍然是一个核心挑战,因为即使是参数高效的适应也可能侵蚀先前获得的推理能力。这种紧张关系反映了可塑性-稳定性困境:模型必须吸收新知识,同时保留技能关键的表征。在这项工作中,我们通过多层感知器权重矩阵的谱结构研究这种权衡。我们在理论和实验上都表明,推理所需的信息不仅局限于主导奇异方向,而是分布在奇异谱上。受此观察启发,我们引入了PALoRA,一个用于减少干扰的知识注入的两阶段框架。PALoRA首先在推理数据集上训练一个奇异值微调(SVF)专家,并使用其学习的奇异缩放向量作为冻结的几何探针,以识别对目标技能关键的成分。然后,它在结构正交约束下使用低秩适应(LoRA)执行事实知识注入,确保更新避免已识别的技能相关子空间。在Llama 3.1 8B和Mistral 7B上,以及在数学、编码和科学推理基准测试中,PALoRA平均保留了SVF专家95%的推理性能,同时保持了竞争性的事实召回。与先前的谱参数高效微调(PEFT)方法相比,它持续提高了技能保留,同时增加了不到0.006%的参数开销。

英文摘要

Efficiently updating Large Language Models (LLMs) with new or evolving factual knowledge remains a central challenge, as even parameter-efficient adaptation can erode previously acquired reasoning abilities. This tension reflects a plasticity-stability dilemma: models must incorporate new knowledge while preserving skill-critical representations. In this work, we study this trade-off through the spectral structure of multilayer perceptron weight matrices. We show, both theoretically and empirically, that information essential for reasoning is not localized only in dominant singular directions, but is instead distributed across the singular spectrum. Motivated by this observation, we introduce PALoRA, a two-stage framework for knowledge injection with reduced interference. PALoRA first trains a Singular Value Fine-Tuning (SVF) expert on a reasoning dataset and uses its learned singular scaling vector as a frozen geometric probe to identify components that are critical for the target skill. It then performs factual knowledge injection with Low-Rank Adaptation (LoRA) under a structural orthogonality constraint, ensuring that updates avoid the identified skill-relevant subspace. Across Llama 3.1 8B and Mistral 7B, and across mathematical, coding, and scientific reasoning benchmarks, PALoRA preserves on average 95% of the SVF expert's reasoning performance while maintaining competitive factual recall. It consistently improves skill retention over prior spectral Parameter-Efficient Fine-Tuning (PEFT) methods while adding less than 0.006% parameter overhead.

2605.24546 2026-05-26 cs.AI cs.IR 版本更新

Beyond Control-Flow: Integrating the Resource Perspective into Multi-Collaborative Process Modeling from Text

超越控制流:将资源视角融入基于文本的多协作流程建模

Anton Antonov, Humam Kourani, Alessandro Berti, Gyunam Park

发表机构 * Fraunhofer Institute for Applied Information Technology(弗劳恩霍夫应用信息科技研究所) RWTH Aachen University(亚琛工业大学)

AI总结 提出一种资源感知的生成流程,从自然语言描述中自动生成包含组织(泳池)和角色(泳道)的BPMN 2.0协作图,同时保持控制流质量并增加少量运行时开销。

Comments Submitted to EDOC 2026, under review

详情
AI中文摘要

流程建模是业务流程管理(BPM)的一个子领域,专注于将流程工件转化为正式模型。传统上,这项任务需要大量的人工输入以及在BPM符号和特定业务上下文方面的领域专业知识。虽然大型语言模型(LLMs)现在可以自动化大部分手工工作,但当前的文本到模型方法主要关注控制流视角——排序活动,而不考虑流程的协作方面。在本文中,我们介绍了一种资源感知的生成流程,从自然语言描述中生成正式的BPMN 2.0协作图。我们不是仅仅提示LLM生成原始XML,而是描述了一种紧凑、可执行的中间语言,其中包含强制性的资源细节,定义了组织(泳池)和角色(泳道)。跨组织依赖关系通过标准形式符号——消息事件——来实现,而正交布局例程自动处理泳池和泳道内元素的空间排列。在十个业务流程和九个LLM上的实验表明,该方法在保持控制流质量的同时,实现了强大的资源发现,并且仅增加了边际运行时开销。这种方法将生成式建模推向更全面、多协作的业务运营表示。

英文摘要

Process modeling is a sub-domain of Business Process Management (BPM) focused on the translation of process artifacts into formal models. This task traditionally requires extensive human input and domain expertise in both BPM notations and the specific business context. While Large Language Models (LLMs) can now automate much of this manual work, current text-to-model approaches focus predominantly on the control-flow perspective-ordering activities without considering the collaborative aspect of the processes. In this paper, we introduce a resource-aware generation pipeline that produces formal BPMN 2.0 collaboration diagrams from natural-language descriptions. Rather than solely prompting an LLM for raw XML, we describe a compact, executable intermediate language with mandatory resource details defining both the organization (pool) and the role (lane). Cross-organization dependencies are materialized using the standard formal notation for such interactions-message events-while an orthogonal layout routine automatically handles the spatial arrangement of elements within pools and lanes. Experiments on ten business processes with nine LLMs show strong resource discovery while preserving control-flow quality and adding only marginal runtime overhead. This approach moves generative modeling toward a more comprehensive, multi-collaborative representation of business operations.

2605.24545 2026-05-26 cs.LG cs.AI 版本更新

Rethinking Federated Unlearning via the Lens of Memorization

通过记忆视角重新思考联邦遗忘学习

Jiaheng Wei, Yanjun Zhang, He Zhang, Leo Yu Zhang, Chao Chen, Kok-Leong Ong, Jun Zhang, Yang Xiang

发表机构 * Royal Melbourne Institute of Technology(皇家墨尔本理工学院) Griffith University(格里菲斯大学) Swinburne University of Technology(斯威本理工大学)

AI总结 针对联邦学习中遗忘数据与保留数据重叠导致遗忘无效和客户端不公平的问题,提出基于分组记忆评估的联邦记忆剪枝方法,通过重置负责记忆的冗余参数实现高效遗忘。

Comments This paper has been accepted by SIGKDD 2026

详情
AI中文摘要

联邦学习越来越需要机器遗忘来遵守隐私法规。然而,现有的联邦遗忘方法可能忽略了遗忘数据与保留数据之间的重叠信息,导致遗忘无效和客户端之间的不公平。在这项工作中,我们通过记忆的视角重新审视联邦遗忘。我们认为,遗忘主要应移除归因于待遗忘数据的独特记忆信息,同时保留也得到剩余数据支持的重叠模式。具体地,我们提出了分组记忆评估,一种示例级度量,将记忆知识与重叠知识分离。基于该度量,我们引入了联邦记忆剪枝(FedMemPrune),一种基于剪枝的遗忘方法,重置负责记忆的冗余参数。大量实验表明,FedMemPrune 与基于重训练的遗忘基线紧密匹配,同时比现有联邦遗忘算法更有效地消除记忆,在保持保留知识效用的情况下实现了强大的遗忘性能。

英文摘要

Federated learning (FL) increasingly needs machine unlearning to comply with privacy regulations. However, existing federated unlearning approaches may overlook the overlapping information between the unlearning and remaining data, leading to ineffective unlearning and unfairness between clients. In this work, we revisit federated unlearning through the lens of memorization. We argue that unlearning should mainly remove the unique memorized information attributable to the data to be forgotten, while preserving overlapping patterns that are also supported by the remaining data. Specifically, we propose Grouped Memorization Evaluation, an example-level metric that separates memorized knowledge from overlapping knowledge. Building on this metric, we introduce Federated Memorization Pruning (FedMemPrune), a pruning-based unlearning approach that resets redundant parameters responsible for memorization. Extensive experiments show that FedMemPrune closely matches retraining-based unlearning baselines while more effectively eliminating memorization than existing federated unlearning algorithms, yielding strong unlearning performance without sacrificing the utility of retained knowledge.

2605.24543 2026-05-26 cs.AI cs.SY eess.SY 版本更新

Emission-Aware Reinforcement Learning for Sustainable Electric Vehicle Charging and Carbon Dioxide Reduction Under Varying Renewable Penetration

面向可持续电动汽车充电与二氧化碳减排的排放感知强化学习:在不同可再生能源渗透率下

Ninglin Ou, Mohammad A. Razzaque, Iftekher Islam Shovon, Shafkat Khan Siam, Shafiuzzaman K Khadem, Krishnendu Guha, Mayeen U Khandaker, Md. Noor-A-Rahim

发表机构 * organization= nasc Research, School of Computer Science \& IT, University College Cork , country = IE organization= School of Computing, Engineering Digital Technologies, Teesside University , country= UK organization= International Energy Research Centre, Tyndall National Institute, Cork , country = IE Radiation Technologies Group, CCDCU, Faculty of Engineering Technology, Sunway University , country = Malaysia organization= Department of Physics, College of Science, Korea University , country = Republic of Korea

AI总结 提出基于软演员-评论家算法的排放感知强化学习策略,通过多目标奖励函数优化电动汽车充电调度,在EV2Gym平台上实现高达87%的碳排放减少和52%的可再生能源自消纳率。

Comments Submitted the Engineering Applications of Artificial Intelligence Journal (Elsevier)

详情
AI中文摘要

电动汽车(EV)的快速增长通过非协调充电导致的峰值负荷尖峰、电压不稳定和变压器过载给配电网络带来挑战。虽然模型预测控制(MPC)和标准强化学习(RL)方法已解决这些问题,但现有方法很少将实时碳强度或波动的可再生能源(RE)可用性作为主要调度目标,留下了巨大的脱碳潜力未实现。本文提出一种基于软演员-评论家(SAC)算法的排放感知RL策略,其多目标奖励函数惩罚碳排放、削减的现场可再生能源和未满足的用户需求。该智能体在EV2Gym平台上的统一基准框架中训练,结合了表后太阳能和风能曲线、时变的EirGrid碳强度数据以及25个电动汽车供电设备(EVSE)单元上真实的工作场所EV行为。比较了九种控制策略,包括启发式方法、排放感知MPC变体和所提出的RL智能体,在五种可再生能源渗透率场景(0%-50%)下各进行十次独立运行。RL智能体在50%风能渗透率下实现了低至23.96克二氧化碳每千瓦时的碳强度,相比未控制基线减排高达87%,并优于基于外部图表的配电网络(PDN)基准。在所有场景下,变压器过载保持在7千瓦时以下,而“尽可能快”(AFAP)启发式方法高达1093千瓦时;在风能和太阳能联合供应下,可再生能源自消纳率达到52%。将碳强度预测嵌入RL状态和奖励中,使充电与低排放时段对齐,同时保持电网合规性和用户满意度。

英文摘要

The rapid growth of Electric Vehicle (EV) adoption challenges power distribution networks through peak load spikes, voltage instability, and transformer overloads from uncoordinated charging. While Model Predictive Control (MPC) and standard Reinforcement Learning (RL) methods have addressed these issues, existing approaches rarely treat real-time carbon intensity or fluctuating renewable energy (RE) availability as primary scheduling objectives, leaving substantial decarbonisation potential unrealised. This paper proposes an emission-aware RL strategy based on the Soft Actor Critic (SAC) algorithm, with a multi-objective reward that penalises carbon emissions, curtailed on-site renewables, and unmet user demand. The agent is trained within a unified benchmarking framework on the EV2Gym platform, incorporating behind-the-meter solar and wind profiles, time-varying EirGrid carbon intensity data, and realistic workplace EV behaviour across 25 Electric Vehicle Supply Equipment (EVSE) units. Nine control strategies, including heuristics, emission-aware MPC variants, and the proposed RL agent, are compared under five renewable penetration scenarios (0%-50%) over ten independent runs each. The RL agent achieves a carbon intensity as low as 23.96 grams of carbon dioxide per kilowatt-hour under 50% wind penetration, representing up to 87% emission reduction versus the uncontrolled baseline, and outperforms the external graph-based Power Distribution Network (PDN) benchmark. Transformer overload remains below 7 kWh across scenarios, against up to 1093 kWh for the As Fast As Possible (AFAP) heuristic, and renewable self-consumption reaches 52% under combined wind and solar supply. Embedding carbon intensity forecasts into the RL state and reward aligns charging with low-emission periods while preserving grid compliance and user satisfaction.

2605.24542 2026-05-26 cs.CR cs.AI cs.LG cs.MA cs.SE 版本更新

AI-Driven Adaptive Adversaries and the Erosion of Cryptographic Trust in Public Key Systems

AI驱动的自适应对手与公钥系统中密码学信任的侵蚀

Petar Radanliev

发表机构 * Department of Computer Sciences, University of Oxford(牛津大学计算机科学系) The Alan Turing Institute(艾伦·图灵研究所) British Library(大英图书馆)

AI总结 本文研究人工智能驱动的自适应对手如何利用实现层面的可观测性侵蚀公钥密码学的安全性,提出了一种新的安全评估框架。

详情
Journal ref
J Anal Sci Technol 17, 26 (2026)
AI中文摘要

本文研究了在人工智能驱动的自适应对手优化下,公钥密码学(PKC)安全性的侵蚀问题。所解决的问题是以算法为中心的密码安全模型与操作攻击现实之间日益增长的错配,其中对手利用实现层面的可观测性,而不是破解密码原语。

英文摘要

This paper examines the erosion of Public Key Cryptography (PKC) security under adaptive adversarial optimisation driven by artificial intelligence. The problem addressed is the growing mismatch between algorithm-centric cryptographic security models and operational attack realities, where adversaries exploit implementation-level observability rather than breaking cryptographic primitives.

2605.24541 2026-05-26 cs.LG cs.AI cs.CL cs.IR 版本更新

SemanticZip: A Pilot Framework for Lossy Text Compression with LLMs as Semantic Decompressors

SemanticZip: 以LLM作为语义解压器的有损文本压缩的试点框架

Natalia Trukhina, Vadim Vashkelis

发表机构 * Embedded Intelligence Lab (EMILAB)(嵌入式智能实验室)

AI总结 提出SemanticZip框架,通过LLM将文本压缩为紧凑代码并解压为任务相关语义,在结构化散文、JSON等六种表示上评估,发现结构化散文恢复率最高(WAR=0.956,19.1%令牌增益),而CCL-Min平衡性最佳(39.4%令牌增益,WAR=0.874)。

Comments 13 pages, 1 figure, 2 tables. Pilot framework paper; code and supplementary artifacts available in ancillary files

详情
AI中文摘要

大型语言模型(LLM)系统的文本压缩通常被框架化为令牌删除、检索、摘要或精确重建。我们研究了一种更具攻击性但明确有损的设置:将文本压缩为紧凑代码,LLM可以将其扩展为任务相关的含义。我们将此设置称为SemanticZip。与无损压缩不同,SemanticZip不需要字节相同的重建;与普通摘要不同,它将基于模型的解压缩视为编解码器的一部分,并评估是否恢复了任务相关的语义承诺。 本文是一个试点框架,而非基准声明。我们形式化了LLM介导的解压缩,定义了受保护/有损数据包架构,并在五个作者构建的诊断案例上评估了六种表示体系:结构化散文、JSON、CCL-Core、CCL-Min、SemanticZip ASCII和SemanticZip emoji。一个独立的解码器LLM从每种压缩表示中重建类型化的语义原子,我们评估关键原子召回率、加权原子召回率、精确度和分词器增益。在该试点中,结构化散文具有最高的可恢复性,WAR=0.956,o200k_base令牌增益19.1%。CCL-Min是最强的平衡点,令牌增益39.4%,WAR=0.874。SemanticZip ASCII提供了最大的有用压缩,令牌增益46.5%,WAR=0.802,而表情符号密集的SemanticZip在压缩和恢复方面表现均较差。 主要贡献并非声称这些数字建立了通用前沿。相反,我们引入了一个可重复的实验接口,用于研究有损、LLM可解压的文本代码,以及一个设计原则:安全关键和精确的承诺应保持受保护,而可预测的低风险上下文可以进行语义压缩。

英文摘要

Text compression for large language model (LLM) systems is usually framed as token deletion, retrieval, summarization, or exact reconstruction. We study a more aggressive but explicitly lossy setting: compress text into compact codes that an LLM can expand into task-relevant meaning. We call this setting SemanticZip. Unlike lossless compression, SemanticZip does not require byte-identical reconstruction; unlike ordinary summarization, it treats model-based decompression as part of the codec and evaluates whether task-relevant semantic commitments are recovered. This paper is a pilot framework, not a benchmark claim. We formalize LLM-mediated decompression, define a protected/lossy packet architecture, and evaluate six representation regimes over five author-constructed diagnostic cases: structured prose, JSON, CCL-Core, CCL-Min, SemanticZip ASCII, and SemanticZip emoji. An independent decoder LLM reconstructs typed semantic atoms from each compressed representation, and we score Critical Atom Recall, Weighted Atom Recall, precision, and tokenizer gain. In this pilot, structured prose has the highest recoverability, with WAR = 0.956 and 19.1% o200k_base token gain. CCL-Min is the strongest balanced point, with 39.4% token gain and WAR = 0.874. SemanticZip ASCII provides the largest useful compression, with 46.5% token gain and WAR = 0.802, while emoji-heavy SemanticZip performs worse on both compression and recovery. The main contribution is not the claim that these numbers establish a universal frontier. Rather, we introduce a reproducible experimental interface for studying lossy, LLM-decompressible text codes and a design principle: safety-critical and exact commitments should remain protected, while predictable low-risk context may be semantically zipped.

2605.24539 2026-05-26 cs.AI 版本更新

DemoEvolve: Overcoming Sparse Feedback in Agentic Harness Evolution with Demonstrations

DemoEvolve:利用演示克服智能体框架演化中的稀疏反馈

Lirong Che, Yuzhe yang, Peiwen lin, Chuang wang, Xueqian wang, Jian su

发表机构 * Tsinghua University(清华大学) AgiBot

AI总结 提出DemoEvolve方法,通过人类演示引导框架演化,解决长时域随机环境中自生成轨迹因稀疏反馈和高方差导致的脆弱性问题,在Liar's Dice和Balatro任务中验证了其有效性。

详情
AI中文摘要

智能体框架演化通过修改冻结语言模型周围的可执行结构来改进它们。我们将这一范式研究为一种样本高效的快速适应形式:智能体无需更新模型权重,而是通过改变其外部框架来获取任务特定能力,同时保留基础模型的通用能力。先前的工作表明,自生成轨迹可以支持框架搜索,暗示智能体可以通过练习获得新的任务能力。然而,在长时域随机环境中,自我练习变得脆弱:奖励稀疏,结果方差高,且失败难以归因于具体的框架机制。我们引入了DemoEvolve,一种基于演示引导的框架演化方法。当仅依赖奖励的搜索过于宽泛和嘈杂时,胜任的人类轨迹作为编码提议者的专家参考经验,指导框架级别的诊断和编辑。在Liar's Dice上的实验表明,当回合短且失败可归因时,自轨迹演化可以工作。相比之下,Balatro暴露了更困难的长时域随机场景,其中自轨迹演化被稀疏反馈和候选选择噪声误导,而仅靠教程式文本知识无法带来稳定的改进。在相同的有限预算下,DemoEvolve产生了更有效和可审计的框架编辑,并实现了更好的性能。总体而言,演示使稀疏反馈的框架演化更具可诊断性、可定位性和稳定性。

英文摘要

Agent harness evolution improves frozen language-model agents by modifying the executable structures around them. We study this paradigm as a form of sample-efficient fast adaptation: instead of updating model weights, an agent can acquire task-specific competence by changing its external harness, while leaving the base model's general capabilities intact. Prior work shows that self-generated rollouts can support harness search, suggesting that agents may acquire new task competence through practice. Yet in long-horizon stochastic environments, self-practice becomes fragile: rewards are sparse, outcomes are high-variance, and failures are hard to attribute to concrete harness mechanisms. We introduce DemoEvolve, a demonstration-bootstrapped approach to harness evolution. When reward-only search is too broad and noisy, competent human trajectories serve as expert reference experience for the coding proposer, guiding harness-level diagnosis and editing. Experiments on Liar's Dice show that self-rollout evolution can work when episodes are short and failures are attributable. In contrast, Balatro exposes a harder long-horizon stochastic regime, where self-rollout evolution is misled by sparse feedback and candidate-selection noise, while tutorial-like textual knowledge alone does not yield stable improvement. Under the same limited budget, DemoEvolve produces more effective and auditable harness edits and achieves better performance. Overall, demonstrations make sparse-feedback harness evolution more diagnosable, localizable, and stable.

2605.24538 2026-05-26 cs.CY cs.AI cs.MA 版本更新

Is Decentralized AI Governable? From Regulative Policy to Constitutive Protocol

去中心化AI是否可治理?从规制政策到构成性协议

Botao Amber Hu, Helena Rong

发表机构 * University of Oxford(牛津大学) New York University Shanghai(纽约大学上海校区)

AI总结 本文分析去中心化AI的六层堆栈,揭示其导致的治理真空(责任缺口和无力化缺口),并提出从基于政策的规范性治理转向基于协议的构成性治理,同时确立合法性、可争议性、透明性和非支配性四个伦理条件。

Comments Submitted for Ethics and Information Technology

详情
AI中文摘要

每个主要的AI治理框架都预设了一个可识别的实体——开发者、部署者或操作者——该实体可以被追究责任并被强制遵守。去中心化AI(DeAI)瓦解了这一预设。我们将DeAI分析为一个六层去中心化堆栈——模型、训练、计算、驾驭、身份和所有权——并展示各层部分去中心化如何叠加成我们所谓的“治理真空”:一种AI系统足够重要以至于需要治理,但缺乏现有框架在其目标中所预设属性的状态。这种真空有两种分析上不同的形式:一是“责任缺口”,即无法识别出可问责的主体;二是“无力化缺口”,即使识别出主体也无法改变正在运行的系统。我们证明这些失败不仅是管辖权上的,而且通过规范性地址——向一个理解并响应的主体传达规则——挫败了治理的所有预设。借鉴Lessig的规制模式和Searle关于规制性规则与构成性规则的区分,我们主张将治理的焦点从政策转向协议,从规范性地址转向架构约束。基于协议的构成性治理并不针对系统内运作的主体,而是塑造决定系统内何种行动成为可能的基质。我们确定了这种治理必须满足的四个伦理条件——合法性、可争议性、透明性和非支配性——以避免退化为不负责任的专家统治权力,并认为在去中心化世界中治理AI的核心政治挑战是重建对架构选择的民主授权形式,这些选择在常规政策链条断裂后依然存在。

英文摘要

Every major framework for governing artificial intelligence presupposes an identifiable entity -- a developer, deployer, or operator -- who can be held responsible and compelled to comply. Decentralized AI (DeAI) dissolves this presupposition. We analyze DeAI as a six-layer decentralizing stack -- model, training, compute, harness, identity, and ownership -- and show how partial decentralization across layers compounds into what we call the \emph{governance vacuum}: a condition in which AI systems are consequential enough to require governance but lack the properties that existing frameworks presuppose in their targets. This vacuum takes two analytically distinct forms: an \emph{accountability gap}, where no addressable principal can be identified, and an \emph{incapacitation gap}, where even an identified principal cannot alter the running system. We demonstrate that these failures are not merely jurisdictional but defeat every presupposition of governance through normative address -- the communication of rules to a comprehending, responsive agent. Drawing on Lessig's modalities of regulation and Searle's distinction between regulative and constitutive rules, we argue for a shift in the locus of governance from policy to protocol, from normative address to architectural constraint. Protocol-based constitutive governance does not address the agents operating within a system but shapes the substrate that determines what kinds of actions are possible within it. We identify four ethical conditions -- legitimacy, contestability, transparency, and non-domination -- that such governance must satisfy to avoid degenerating into unaccountable technocratic power, and we argue that the central political challenge of governing AI in a decentralized world is reconstructing forms of democratic authorization for architectural choices that persist after the ordinary chain of policy has broken down.

2605.22794 2026-05-26 cs.AI cs.LG 版本更新

MOSS: Self-Evolution through Source-Level Rewriting in Autonomous Agent Systems

MOSS:自主智能体系统中通过源代码级重写的自我进化

Qianshu Cai, Yonggang Zhang, Xianzhang Jia, Huajiang Zheng, Wei Xue, Jun Song, Xinmei Tian, Yike Guo

发表机构 * University of Science and Technology of China(中国科学技术大学) Hong Kong Generative AI Research & Development Center(香港生成式AI研究与开发中心) The Hong Kong University of Science and Technology(香港理工大学) Hong Kong Baptist University(香港 Baptist大学)

AI总结 提出MOSS系统,通过源代码级重写实现自主智能体系统的自我进化,利用生产故障证据自动批处理和多阶段确定性流水线,在OpenClaw上单周期内将平均评分从0.25提升至0.61。

Comments 12 pages, 3 figures, 2 tables. Preprint. Code: https://github.com/hkgai-official/Moss

详情
AI中文摘要

自主智能体系统在部署后基本是静态的:它们不会从用户交互中学习,重复的失败会持续存在,直到下一次人工驱动的更新发布修复。自我进化的智能体应运而生,但所有进化都局限于文本可变的工件——技能文件、提示配置、记忆模式、工作流图——而智能体框架本身保持不变。由于路由、钩子排序、状态不变量和调度存在于代码中而非任何文本工件中,整个结构故障类别在文本层上是物理上不可达的。我们认为源代码级适应是一种本质上更通用的媒介:它是图灵完备的,是每个文本可变范围的严格超集,通过确定性方式生效而非基础模型合规性,并且不会在长上下文漂移下退化。我们提出了MOSS,一个在生产智能体基板上执行源代码级自我重写的系统。每次进化都锚定在自动策划的生产故障证据批次上,并通过确定性的多阶段流水线进行;代码修改委托给可插拔的外部编码智能体CLI,而MOSS保留阶段顺序和判定。候选者通过在临时试验工作器中重放批次来验证,然后通过用户同意门控的就地容器交换和健康探针门控的回滚进行推广。在OpenClaw上,MOSS在单周期内无需人工干预将四个任务的平均评分从0.25提升至0.61。

英文摘要

Autonomous agentic systems are largely static after deployment: they do not learn from user interactions, and recurring failures persist until the next human-driven update ships a fix. Self-evolving agents have emerged in response, but all confine evolution to text-mutable artifacts -- skill files, prompt configurations, memory schemas, workflow graphs -- and leave the agent harness untouched. Since routing, hook ordering, state invariants, and dispatch live in code rather than in any text artifact, an entire class of structural failure is physically unreachable from the text layer. We argue that source-level adaptation is a fundamentally more general medium: it is Turing-complete, a strict superset of every text-mutable scope, takes effect deterministically rather than through base-model compliance, and does not erode under long-context drift. We present MOSS, a system that performs self-rewriting at the source level on production agentic substrates. Each evolution is anchored to an automatically curated batch of production-failure evidence and proceeds through a deterministic multi-stage pipeline; code modification is delegated to a pluggable external coding-agent CLI while MOSS retains stage ordering and verdicts. Candidates are verified by replaying the batch against the candidate image in ephemeral trial workers, then promoted via user-consent-gated, in-place container swap with health-probe-gated rollback. On OpenClaw, MOSS lifts a four-task mean grader score from 0.25 to 0.61 in a single cycle without human intervention.

2605.22715 2026-05-26 cs.CV cs.AI cs.CL cs.HC 版本更新

AnyMo: Geometry-Aware Setup-Agnostic Modeling of Human Motion in the Wild

AnyMo:野外人体运动的几何感知与设置无关建模

Baiyu Chen, Zechen Li, Wilson Wongso, Lihuan Li, Xiachong Lin, Hao Xue, Benjamin Tag, Flora Salim

发表机构 * The University of New South Wales(新南威尔士大学) The Hong Kong University of Science and Technology (Guangzhou)(香港科技大学(广州)) The Hong Kong University of Science and Technology(香港科技大学)

AI总结 提出AnyMo框架,通过物理模拟生成多样化IMU信号、图编码器预训练和LLM对齐,实现跨设备/数据集的零样本活动识别、跨模态检索和运动描述,性能显著提升。

详情
AI中文摘要

随着可穿戴和移动设备日益融入日常生活,它们为持续感知野外人体运动提供了实用途径。但惯性信号高度依赖于传感设置,包括身体位置、安装方向、传感器朝向、设备硬件和采样协议。这种设置依赖性使得学习跨设备和数据集迁移的运动表示变得困难,并限制了可穿戴IMU在封闭集识别之外的广泛应用。我们提出AnyMo,一个用于设置无关人体运动建模的几何感知框架。AnyMo利用基于物理的IMU模拟在密集体表位置上生成多样且合理的合成信号,从配对的合成放置视图和掩蔽部分观测中预训练图编码器,将多位置IMU标记化为全身运动令牌,并将这些令牌与LLM对齐以进行运动-语言理解。我们在三个互补任务上评估AnyMo:跨14个未见下游数据集的零样本活动识别、跨模态检索和可穿戴IMU运动描述,其中在HAR上平均Accuracy/F1/R@2提升11.7%/11.6%/22.6%,零样本IMU到文本和文本到IMU检索MRR分别提升15.9%和28.6%,零样本描述BERT-F1提升18.8%。这些结果支持AnyMo作为野外可穿戴运动理解的通才模型。项目页面:https://baiyuchen.com/project/AnyMo。

英文摘要

As wearable and mobile devices become increasingly embedded in daily life, they offer a practical way to continuously sense human motion in the wild. But inertial signals are highly dependent on the sensing setup, including body location, mounting position, sensor orientation, device hardware, and sampling protocol. This setup dependence makes it difficult to learn motion representations that transfer across devices and datasets, and limits the broader use of wearable IMUs beyond closed-set recognition. We introduce AnyMo, a geometry-aware framework for setup-agnostic human motion modeling. AnyMo uses physics-grounded IMU simulation over dense body-surface placements to generate diverse and plausible synthetic signals, pre-trains a graph encoder from paired synthetic placement views and masked partial observations, tokenizes multi-position IMU into full-body motion tokens, and aligns these tokens with an LLM for motion-language understanding. We evaluate AnyMo on three complementary tasks: zero-shot activity recognition across 14 unseen downstream datasets, cross-modal retrieval, and wearable IMU motion captioning, where it improves average Accuracy/F1/R@2 by 11.7\%/11.6\%/22.6\% on HAR, increases zero-shot IMU-to-text and text-to-IMU retrieval MRR by 15.9\% and 28.6\%, respectively, and improves zero-shot captioning BERT-F1 by 18.8\%. These results support AnyMo as a generalist model for wearable motion understanding in the wild. Project page: https://baiyuchen.com/project/AnyMo.

2605.22634 2026-05-26 cs.SE cs.AI 版本更新

Contractual Skills: A GovernSpec Design Framework for Enterprise AI Agents

合同技能:面向企业AI代理的GovernSpec设计框架

Ting Liu

发表机构 * SymbolicLight Research(SymbolicLight研究院)

AI总结 提出一种基于GovernSpec的合同技能设计框架,通过组织SKILL.md文件为可读任务合同,明确任务意图、边界和验收标准,实验表明该框架能提升生成质量并降低关键错误率。

Comments 15 pages, 5 figures, 4 tables. v2 adds a public-skill A/B study, updates experimental results, and adds a public replication package link: AGI/contractual-skill" target="_blank" rel="noopener">https://github.com/SymbolicLight-AGI/contractual-skill

详情
AI中文摘要

技能已成为代理指令、工作流、脚本和参考材料的实用封装机制。然而,在企业环境中,技能通常需要表达比任务指导更多的内容:目标、输入边界、权限、人工审批点、证据要求、输出合同、质量标准、验证步骤和交接规则。本文提出合同技能,一种受GovernSpec启发的设计框架,用于将SKILL.md文件组织为可读的任务合同,同时保持轻量级技能发现和渐进加载。该框架明确了合同技能、GovernSpec YAML合同、模型上下文协议(MCP)接口、工具适配器、运行时护栏、追踪和评估系统之间的界限。我们通过三个离线实证研究评估该框架。第一个文本生成实验涵盖三个企业技能、十五个合成任务、四种指令条件和八个生成模型,产生960个输出和1680个交叉评判分数记录。第二个研究是公共技能A/B扩展:将八个公共技能与合同重写在四十八个合成任务、六个生成模型、两次重复、1152个输出和两个完整评判文件上进行比较。在此设置中,合同技能将平均质量从4.692提高到4.914,并将关键错误率从0.083降低到0.013。第三个研究是离线工具调用挑战,涉及八个模型和192个模拟工具调用记录。结果表明,合同技能最好被理解为一种治理层,使任务意图、边界和验收标准显式化,而不是独立的安全机制。

英文摘要

Skills have become a practical packaging mechanism for agent instructions, workflows, scripts, and reference materials. In enterprise settings, however, a skill often needs to express more than task guidance: goals, input boundaries, permissions, human approval points, evidence requirements, output contracts, quality criteria, verification steps, and handoff rules. This paper proposes contractual skills, a GovernSpec-inspired design framework for organizing SKILL.md files as readable task contracts while preserving lightweight skill discovery and progressive loading. The framework clarifies the boundary between contractual skills, GovernSpec YAML contracts, Model Context Protocol (MCP) surfaces, tool adapters, runtime guardrails, tracing, and evaluation systems. We evaluate the framework with three offline empirical studies. The first text-generation experiment covers three enterprise skills, fifteen synthetic tasks, four instruction conditions, and eight generation models, producing 960 outputs and 1680 cross-judge score records. The second study is a public-skill A/B expansion: eight public skills are compared with contractual rewrites across forty-eight synthetic tasks, six generation models, two repeats, 1152 outputs, and two complete judge files. In this setting, contractual skills raise mean quality from 4.692 to 4.914 and reduce critical-error rate from 0.083 to 0.013. The third study is an offline tool-calling challenge with eight models and 192 simulated tool-call records. The results suggest that contractual skills are best understood as a governance layer that makes task intent, boundaries, and acceptance criteria explicit, not as a standalone safety mechanism.

2605.22337 2026-05-26 cs.AI 版本更新

Meta-Soft: Leveraging Composable Meta-Tokens for Context-Preserving KV Cache Compression

Meta-Soft: 利用可组合元标记实现上下文保持的KV缓存压缩

Wei Luo, Yi Huang, Songchen Ma, Huanyu Qu, Jiang Cai, Mingkun Xu

发表机构 * Guangdong Institute of Intelligence Science and Technology(广东智能科学与技术研究院) University of Macau(澳门大学) Chinese University of Hong Kong, Shenzhen(香港中文大学(深圳)) Hong Kong University of Science and Technology(香港科技大学)

AI总结 提出Meta-Soft动态压缩框架,通过可学习正交基矩阵和Gumbel-Softmax选择网络合成元标记,结合注意力流整合机制保留丢弃上下文信息,解决KV缓存压缩中的信息丢失和上下文断裂问题。

Comments 9 pages, 2 figures

详情
AI中文摘要

大型语言模型中使用的KV缓存具有线性增长的时间复杂度,因此当处理长上下文时,LLMs面临内存爆炸和解码效率降低的问题。当前的KV缓存驱逐已成为重要的研究方向;然而,基于固定软标记(例如Judge Q)的现有方法依赖静态参数集作为查询来评估KV对的重要性,因此无法动态适应不同的输入提示,也无法精确捕捉复杂且变化的任务相关性。此外,被驱逐的KV对被永久丢弃,导致不可逆的信息丢失和上下文断裂。为了解决这个问题,我们提出了Meta-Soft,一种基于探针驱动上下文整合的动态压缩框架。具体来说,我们构建了一个带有可学习正交基矩阵$\mathcal{L}$的元库,并使用带有Gumbel-Softmax的选择器网络生成可微分的稀疏组合权重,从而从输入提示特征中动态合成最具针对性的$k$个软标记。我们将这些软标记附加到输入序列的末尾以探针关键信息。我们还引入了一种基于注意力流的整合机制,该机制将移除标记的语义信息重新分配到保留标记中,从而有效保持被丢弃的上下文信息。在多个数据集上的实验表明,我们的方法优于现有的最先进驱逐方法,并为KV缓存压缩提供了新的解决方案。

英文摘要

The KV cache used in large language models has linearly growing time complexity, so LLMs face memory blow-up and reduced decoding efficiency when they process long contexts. Current KV Cache eviction has become an important research direction; however, existing methods based on fixed Soft Tokens (e.g., Judge Q) rely on a static parameter set as the query to evaluate the importance of KV pairs, so they cannot adapt dynamically to different input prompts, and they cannot precisely capture complex and changing task relevance. Also, evicted KV pairs are discarded permanently, so this causes irreversible information loss and context breaks. To address this problem, we propose Meta-Soft, a dynamic compression framework based on probe-driven context integration. Specifically, we build a meta-library with a learnable orthogonal basis matrix $\mathcal{L}$, and we use a selector network with Gumbel-Softmax to produce differentiable sparse combination weights, so we dynamically synthesize the most targeted $k$ Soft Tokens from the input prompt features. We append these Soft Tokens to the end of the input sequence to probe key information. We also introduce an attention-flow based integration mechanism, which redistributes the semantic information of removed tokens into retained tokens, and this keeps the dropped context information effectively. Experiments on multiple datasets show that our method outperforms existing state-of-the-art eviction methods and provides a new solution for KV Cache compression.

2605.21740 2026-05-26 cs.AI 版本更新

SMDD-Bench: Can LLMs Solve Real-World Small Molecule Drug Design Tasks?

SMDD-Bench: 大语言模型能否解决真实世界的小分子药物设计任务?

Kevin Han, Renfei Zhang, Kathy Wei, Hamed Mahdavi, Niloofar Mireshghallah, Amir Barati Farimani

发表机构 * Carnegie Mellon University(卡内基梅隆大学) Stealth Pennsylvania State University(隐形宾夕法尼亚州立大学)

AI总结 提出SMDD-Bench基准,通过502个多轮长时任务实例评估LLM在真实小分子药物设计中的表现,发现最优模型GPT5.4仅解决40.2%任务。

详情
AI中文摘要

LLM智能体在科学发现应用中具有巨大潜力。然而,LLM智能体在跨不同化学空间和靶标的真实世界小分子药物设计(SMDD)任务上的表现尚不明确。当前的评估方法要么是临时的,对于真实发现过于简单,规模有限,或局限于单轮问答。为了标准化LLM智能体在小分子设计上的评估,我们引入了SMDD-Bench,一个具有挑战性的多轮长时智能体基准,包含502个保证可解的任务实例,涵盖5种任务类型:2D药效团识别、相互作用点发现、骨架跃迁、先导化合物优化和片段组装。SMDD-Bench任务覆盖广泛的化学空间,涉及102个独特的蛋白质靶标。完全解决该基准需要具备强大的化学和生物学推理能力及3D直觉,理解专业工具的使用,并在有限的oracle调用次数内展示规划专业知识。我们对7个前沿的开源和闭源LLM进行了基准测试,发现性能最好的LLM GPT5.4仅解决了40.2%的任务。我们希望SMDD-Bench能提供一个标准化的测试平台,激励该领域训练和评估用于全自动计算药物设计的LLM智能体。我们在smddbench.com上托管了一个公共排行榜。

英文摘要

LLM agents have incredible potential for scientific discovery applications. However, the performance of LLM agents on real-world, small molecule drug design (SMDD) tasks across diverse chemistries and targets is unclear. Current evaluation methods are either ad hoc, too simple for real-world discovery, limited in scale, or restricted to single-turn question answering. In effort to standardize the evaluation of LLM agents on small molecule design, we introduce SMDD-Bench, a challenging, multi-turn, long-horizon agentic benchmark consisting of 502 guaranteed-solvable task instances spanning 5 task types: 2D Pharmacophore Identification, Interaction Point Discovery, Scaffold Hopping, Lead Optimization, and Fragment Assembly. SMDD-Bench tasks span a wide region of chemical space and involve 102 unique protein targets. Completely solving the benchmark would require having strong chemical and biological reasoning and 3D intuition, understanding specialized tool use, and displaying planning expertise over a limited number of oracle calls. We benchmark 7 frontier open and closed source LLMs and find even the most performant LLM, GPT5.4, solves only 40.2\% of tasks. We hope SMDD-Bench provides a standardized testbed to invigorate the field towards training and evaluating LLM agents for fully autonomous computational drug design. We host a public leaderboard at smddbench.com .

2605.21652 2026-05-26 cs.CV cs.AI 版本更新

Look-Closer-Then-Diagnose: Confidence-Aware Ultrasound VQA via Active Zooming

Look-Closer-Then-Diagnose: 通过主动缩放实现置信度感知的超声VQA

Yue Zhou, Erxuan Wu, Yikang Sun, Hongjoo Lee, Yuan Bi, Huixiong Xu, Nassir Navab, Zhongliang Jiang

发表机构 * Computer Aided Medical Procedures (CAMP)(计算机辅助医疗程序) TU Munich, Germany(慕尼黑工业大学,德国) Munich Center for Machine Learning (MCML)(慕尼黑机器学习中心) Munich, Germany(慕尼黑,德国) Zhongshan Hospital, Fudan University, China(复旦大学中山医院) The University of Hong Kong, Hongkong, China(香港大学,香港,中国)

AI总结 提出一个模拟超声医师认知流程的框架,通过“缩放-诊断”范式和基于组相对策略优化的不确定性感知奖励,提升超声视觉问答中病灶定位和诊断性能。

详情
AI中文摘要

视觉-语言模型(VLM)显著推进了医学视觉问答,但在超声领域性能仍不理想。临床实践中,超声医师在制定报告时会明确关注病灶区域,尽管诊断解释有时因固有的主观性而存在差异。然而,现有VLM并未明确设计为在诊断前交互式地放大病灶;此外,它们通常将标注视为无偏真值,未能考虑其固有的主观性和模糊性。在本文中,我们提出了一个专门考虑超声医师认知工作流的框架。我们首先引入了一个结构化的“缩放-诊断”范式,该范式复制了交互式搜索过程以实现病灶聚焦推理。此外,在组相对策略优化(GRPO)框架内,我们引入了一个基于随机组 rollout 的不确定性感知奖励,以估计预测一致性作为模型置信度的代理。这两个组件共同鼓励模型在清晰案例上强化准确预测,同时在模糊情况下保持谨慎。在肝脏、乳腺和甲状腺数据集上的实验表明,我们的框架将病灶定位提高了39.3%,证明我们的模型学会了主动靠近观察并诊断的能力。

英文摘要

Vision-Language Models (VLMs) have significantly advanced medical visual question answering, yet their performance in ultrasound remains suboptimal. In clinical practice, sonographers explicitly focus on lesion regions to formulate reports, though diagnostic interpretations sometimes vary due to inherent subjectivity. However, existing VLMs are not explicitly structured to interactively zoom into lesions prior to diagnosis; moreover, they typically treat annotations as unbiased ground truths, failing to account for their inherent subjectivity and ambiguity. In this paper, we propose a framework specifically designed to consider the sonographer's cognitive workflow. We first introduce a structured Zoom-then-Diagnose paradigm, which replicates the interactive search process to enable lesion-focused reasoning. Furthermore, within the Group Relative Policy Optimization (GRPO) framework, we introduce an uncertainty-aware reward derived from stochastic group-wise rollouts to estimate prediction consistency as a proxy for model confidence. Together, these two components encourage the model to reinforce accurate predictions on clear cases while remaining cautious under ambiguity. Experiments across liver, breast, and thyroid datasets show that our framework improves lesion localization by 39.3\%, demonstrating that our model has learned the ability to actively look closer and diagnose.

2605.21417 2026-05-26 cs.CV cs.AI 版本更新

Ordering Matters: Rank-Aware Selective Fusion for Blended Emotion Recognition

排序重要:面向混合情感识别的排名感知选择性融合

Junghyun Lee, Hyunseo Kim, Hanna Jang, Junhyug Noh

发表机构 * Department of Artificial Intelligence and Software(人工智能与软件系)

AI总结 提出一种排名感知的多编码器框架,通过注意力门控模块选择最有效的编码器进行融合,并解耦预测为存在性和显著性头,结合无监督域适应,在混合情感识别任务中取得第二名成绩。

Comments Accepted at IEEE FG 2026 Workshops. Final system ranked 2nd in the BlEmoRE Challenge. 9 pages including appendix, 8 figures

详情
AI中文摘要

混合情感识别具有挑战性,因为情感通常表现为微妙且重叠的多模态线索的混合,而非单一主导信号。我们提出了一种排名感知的多编码器框架,该框架选择性地结合来自不同预提取视频和音频编码器的互补表示。我们的方法将异构编码器特征投影到共享潜在空间,通过基于注意力的门控模块估计样本级编码器重要性,并仅融合前n个最具信息量的编码器。为了更好地建模混合情感,我们将预测解耦为存在性和显著性头,并通过概率级融合对齐它们。我们进一步引入了无需伪标签的特征级无监督域适应,以提高在分布偏移下的鲁棒性。在BlEmoRE挑战赛上的实验表明,所提出的框架优于强单个编码器和朴素的多编码器融合基线。我们的最终系统在比赛中排名第二,支持了排名感知选择性融合在细粒度混合情感识别中的有效性。

英文摘要

Blended emotion recognition is challenging because emotions are often expressed as mixtures of subtle and overlapping multimodal cues rather than a single dominant signal. We propose a rank-aware multi-encoder framework that selectively combines complementary representations from diverse pre-extracted video and audio encoders. Our method projects heterogeneous encoder features into a shared latent space, estimates sample-wise encoder importance through an attention-based gating module, and fuses only the top-n most informative encoders. To better model blended emotions, we decouple prediction into presence and salience heads and align them through probability-level fusion. We further incorporate feature-level unsupervised domain adaptation without pseudo-labeling to improve robustness under distribution shift. Experiments on the BlEmoRE challenge show that the proposed framework outperforms strong individual encoders and naïve multi-encoder fusion baselines. Our final system ranked 2nd in the competition, supporting the effectiveness of rank-aware selective fusion for fine-grained blended emotion recognition.

2605.20278 2026-05-26 cs.LG cs.AI cs.CV 版本更新

ClaimDiff-RL: Fine-Grained Caption Reinforcement Learning through Visual Claim Comparison

ClaimDiff-RL: 通过视觉声明比较进行细粒度描述强化学习

Tianle Li, Xuyang Shen, Yan Ma, Rongxin Guo, Shaoxiang Chen, Jiacheng Chen, Haochen Wang, Hongyang Tang, Yucong Zhou, Yu Cheng

发表机构 * The Chinese University of Hong Kong(香港中文大学) MiniMax

AI总结 提出ClaimDiff-RL框架,利用原子声明差异作为奖励单元,通过多模态判断器枚举视觉差异并分配错误类型和严重程度,以解决长描述强化学习中事实性与覆盖度的权衡问题。

详情
AI中文摘要

长格式图像描述揭示了强化学习中的奖励粒度问题:描述被整体判断,而重要错误发生在单个视觉声明层面。一个好的密集描述应既忠实又信息丰富,避免幻觉而不遗漏显著细节。然而,成对偏好、基于参考的指标和整体标量奖励将这些局部错误压缩为单个序列级信号,模糊了事实性与覆盖度之间的权衡。我们引入ClaimDiff-RL框架,该框架使用基于参考的原子声明差异作为描述强化学习的奖励单元。给定一张图像、一个演员描述和一个参考描述,多模态判断器枚举视觉上可区分的差异,针对图像验证每个差异,分配开放词汇的错误类型和严重程度,并生成每个差异的统计信息用于奖励组合。这使得幻觉声明和遗漏的显著事实可以分别测量和调整。实验表明,整体标量奖励可以通过增加遗漏事实来减少幻觉,而ClaimDiff-RL揭示了这种忠实性与覆盖度的权衡,并实现了更平衡的操作点。在包含160张图像的人工标注诊断基准、公开描述基准和VQA基准上,ClaimDiff-RL改善了幻觉-遗漏事实平衡,保留了通用能力,甚至在多个细粒度能力维度(如物体计数、空间关系和场景识别)上超越了Gemini-3-Pro-Preview。这些结果表明,类型化、可验证的声明差异是细粒度且可诊断的描述强化学习的有效奖励单元。

英文摘要

Long-form image captioning exposes a reward granularity problem in RL: captions are judged as whole sequences, while the important errors occur at the level of individual visual claims. A good dense caption should be both faithful and informative, avoiding hallucination without omitting salient details. Yet pairwise preferences, reference-based metrics, and holistic scalar rewards compress these local errors into a single sequence-level signal, obscuring the tradeoff between factuality and coverage. We introduce ClaimDiff-RL, a framework that uses reference-conditioned atomic claim differences as the reward unit for caption RL. Given an image, an actor caption, and a reference caption, a multimodal judge enumerates visually grounded differences, verifies each difference against the image, assigns open-vocabulary error types and severity levels, and produces per-difference statistics for reward composition. This makes hallucinated claims and omitted salient facts separately measurable and tunable. Experiments show that holistic scalar rewards can reduce hallucination by increasing missing facts, while ClaimDiff-RL exposes this faithfulness and coverage tradeoff and enables more balanced operating points. On a 160-image human-labeled diagnostic benchmark, public captioning benchmarks, and VQA benchmarks, ClaimDiff-RL improves the hallucination--missing-fact balance, preserves general capability, and even surpasses Gemini-3-Pro-Preview on several fine-grained Capability dimensions such as object counting, spatial relations, and scene recognition. These results suggest that typed, verifiable claim differences are an effective reward unit for fine-grained and diagnosable caption RL.

2605.20025 2026-05-26 cs.AI 版本更新

AutoResearchClaw: Self-Reinforcing Autonomous Research with Human-AI Collaboration

AutoResearchClaw: 基于人机协作的自我强化自主研究

Jiaqi Liu, Shi Qiu, Mairui Li, Bingzhou Li, Haonian Ji, Siwei Han, Xinyu Ye, Peng Xia, Zihan Dong, Meng Chen, Congyu Zhang, Letian Zhang, Guiming Chen, Haoqin Tu, Xinyu Yang, Lu Feng, Xujiang Zhao, Haifeng Chen, Jiawei Zhou, Xiao Wang, Weitong Zhang, Hongtu Zhu, Yun Li, Jieru Mei, Hongliang Fei, Jiaheng Zhang, Linjie Li, Linjun Zhang, Yuyin Zhou, Sheng Wang, Caiming Xiong, James Zou, Zeyu Zheng, Cihang Xie, Mingyu Ding, Huaxiu Yao

发表机构 * Carnegie Mellon University(卡内基梅隆大学) Rutgers University(罗格斯大学) NEC Labs America(NEC美国实验室) Meta(Meta公司) Stanford University(斯坦福大学) Google(谷歌公司) University of Washington(华盛顿大学)

AI总结 提出AutoResearchClaw多智能体自主研究系统,通过结构化辩论、自愈执行、可验证报告、七种人机协作模式和跨运行进化机制,在ARC-Bench基准上比AI Scientist v2提升54.7%。

详情
AI中文摘要

自动化科学发现需要的不仅仅是根据想法生成论文。真正的研究是迭代的:假设从多个角度受到挑战,实验失败并为下一次尝试提供信息,经验在循环中积累。现有的自主研究系统通常将此过程建模为线性流水线:它们依赖单智能体推理,在执行失败时停止,并且不跨运行携带经验。我们提出AutoResearchClaw,一个基于五种机制的多智能体自主研究流水线:用于假设生成和结果分析的结构化多智能体辩论;带有Pivot/Refine决策循环的自愈执行器,将失败转化为信息;防止虚构数字和幻觉引用的可验证结果报告;具有七种干预模式的人机协作,涵盖从完全自主到逐步监督;以及将过去错误转化为未来保障的跨运行进化。在ARC-Bench(一个25个主题的实验阶段基准)上,AutoResearchClaw比AI Scientist v2高出54.7%。跨七种干预模式的人机协作消融实验表明,在高杠杆决策点上的精确、有针对性的协作始终优于完全自主和详尽的逐步监督。我们将AutoResearchClaw定位为一种研究放大器,增强而非取代人类的科学判断。代码可在https://github.com/aiming-lab/AutoResearchClaw获取。

英文摘要

Automating scientific discovery requires more than generating papers from ideas. Real research is iterative: hypotheses are challenged from multiple perspectives, experiments fail and inform the next attempt, and lessons accumulate across cycles. Existing autonomous research systems often model this process as a linear pipeline: they rely on single-agent reasoning, stop when execution fails, and do not carry experience across runs. We present AutoResearchClaw, a multi-agent autonomous research pipeline built on five mechanisms: structured multi-agent debate for hypothesis generation and result analysis, a self-healing executor with a \textsc{Pivot}/\textsc{Refine} decision loop that transforms failures into information, verifiable result reporting that prevents fabricated numbers and hallucinated citations, human-in-the-loop collaboration with seven intervention modes spanning full autonomy to step-by-step oversight, and cross-run evolution that converts past mistakes into future safeguards. On ARC-Bench, a 25-topic experiment-stage benchmark, AutoResearchClaw outperforms AI Scientist v2 by 54.7%. A human-in-the-loop ablation across seven intervention modes reveals that precise, targeted collaboration at high-leverage decision points consistently outperforms both full autonomy and exhaustive step-by-step oversight. We position AutoResearchClaw as a research amplifier that augments rather than replaces human scientific judgment. Code is available at https://github.com/aiming-lab/AutoResearchClaw.

2605.20023 2026-05-26 cs.AI cs.MA 版本更新

When Skills Don't Help: A Negative Result on Procedural Knowledge for Tool-Grounded Agents in Offensive Cybersecurity

当技能无济于事:关于程序性知识在进攻性网络安全中工具型智能体的负面结果

Samuel Jacob Chacko, James Hugglestone, Chashi Mahiul Islam, Xiuwen Liu

发表机构 * Florida State University(佛罗里达州立大学)

AI总结 本文通过重新分析一项控制实验,发现当环境反馈带宽高时,技能(Skills)对智能体性能的边际效益消失甚至产生负面影响,并提出了可证伪的假设。

Comments Accepted as a poster at ACM CAIS 2026 AgentSkills Workshop

详情
AI中文摘要

智能体技能(Agent Skills)是程序性知识的结构化包,在推理时加载到LLM智能体中,据报道在不同领域平均将任务通过率提高了16.2个百分点。然而,相同的基准测试显示出很大的方差,84个任务中有16个在引入技能后出现了负增量。社区尚未阐明技能何时有帮助以及何时只是冗余开销的清晰机制。我们重新分析了一项最近发表的180次运行的控制研究,该研究涉及一个基于MCP的自主夺旗(CTF)智能体,在四种文档条件(591、12865、17253和36001个token)下,并表明这些条件几乎完全对应于无技能、经验技能、策划技能和全面技能的消融。在进攻性网络安全(一个现有技能基准未深入覆盖的领域)中,技能的边际效益崩溃。无技能和全面技能条件之间的差距仅为8.9个百分点($p = 0.71$,$\chi^2$;$p = 0.25$,Cochran-Armitage趋势检验;六对Cohen's $h$值中有五对低于$0.2$的小效应阈值)。我们认为缺失的变量是环境反馈带宽。当智能体的工具层返回严格、模式验证、低延迟的观察时,环境本身提供了通常需要技能提供的程序性校正信号。因此,策划技能的边际效益显著降低,并且在某些情况下(例如,我们的时序侧信道设置)会主动降低性能。我们阐述了一个可证伪的假设,概述了其对复合AI系统的设计启示,并将发布重新分析管道以支持复制。

英文摘要

Agent Skills, structured packages of procedural knowledge loaded into an LLM agent at inference time, are widely reported to improve task pass rates by an average of 16.2~percentage points across diverse domains. Yet the same benchmarks show wide variance, with 16 of 84 tasks suffering negative deltas when Skills are introduced. The community has not yet articulated a clean mechanism for \emph{when} Skills help and when they are merely redundant overhead. We re-analyze a recently published 180-run controlled study of an MCP-grounded autonomous Capture-the-Flag (CTF) agent under four documentation conditions of increasing richness (591, 12865, 17253, and 36001 tokens) and show that these conditions correspond almost exactly to a No-Skills, Experiential-Skills, Curated-Skills, and Comprehensive-Skills ablation. In offensive cybersecurity, a domain not deeply covered by existing Skills benchmarks, the marginal benefit of Skills collapses. The spread between the no-Skills and full-Skills conditions is only 8.9~pp ($p = 0.71$, $χ^2$; $p = 0.25$, Cochran--Armitage trend test; five of six pairwise Cohen's $h$ values fall below the $0.2$ small-effect threshold). We argue that the missing variable is \emph{environment-feedback bandwidth}. When an agent's tool layer returns strict, schema-validated, low-latency observations, the environment itself supplies the procedural correction signal that Skills are normally needed to provide. As a result, the marginal benefit of curated Skills diminishes substantially, and, in some cases (e.g., our timing side-channel setting), actively degrades performance. We articulate a falsifiable hypothesis, sketch its design implications for compound AI systems, and will release the reanalysis pipeline to support replication.

2605.19409 2026-05-26 cs.LG cs.AI 版本更新

Unlocking the Potential of Continual Model Merging: An ODE Perspective

解锁持续模型合并的潜力:ODE视角

Lihong Lin, Haidong Kang

发表机构 * Northeastern University, Shenyang, China(东北大学,沈阳,中国)

AI总结 提出ODE-M框架,将持续模型合并建模为参数空间中的轨迹,通过整流时变速度场和效用感知时间调度平衡历史知识与新任务,提升长任务流性能。

Comments 21 pages, 8 figures

详情
AI中文摘要

持续模型合并(CMM)通过顺序整合任务适配模型实现基础模型的快速定制,无需重复训练。然而,现有合并规则通常通过固定代数或基于投影的操作更新部署模型,对保留多少先前积累的知识相对于新任务模型的控制有限。这种限制导致长任务流中保留不稳定和性能下降,当任务具有异构效用时更为明显。我们提出ODE驱动的合并(ODE-M),一个可控框架,将每次持续合并视为参数空间中的轨迹而非一步端点更新。受模式连通性启发,ODE-M使用整流时变速度场构建屏障感知轨迹,其中来自小型校准集的轻量级一阶反馈抑制损失增加的运动,同时保持向新模型的进展。然后通过沿该轨迹选择操作点(通过效用感知时间调度)获得下一个合并模型,为平衡保留的历史知识和新任务专业知识提供显式机制。在标准CMM基准上的大量实验表明,ODE-M在CLIP ViT骨干、流长度和异构任务效用设置上持续优于强持续合并基线。

英文摘要

Continual Model Merging (CMM) enables rapid customization of foundation models by sequentially incorporating task-adapted models without repeated retraining. However, existing merging rules usually update the deployed model through fixed algebraic or projection-based operations, providing limited control over how much previously accumulated knowledge should be retained relative to the incoming task model. This limitation leads to unstable retention and performance degradation in long task streams, and becomes more pronounced when tasks have heterogeneous utilities. We propose ODE-driven Merging (ODE-M), a controllable framework that formulates each continual merge as a trajectory in parameter space rather than a one-step endpoint update. Motivated by mode connectivity, ODE-M constructs a barrier-aware trajectory using a rectified time-dependent velocity field, where lightweight first-order feedback from a small calibration set suppresses loss-increasing motion while preserving progress toward the incoming model. The next merged model is then obtained by selecting an operating point along this trajectory through a utility-aware time schedule, providing an explicit mechanism for balancing retained historical knowledge and incoming task expertise. Extensive experiments on standard CMM benchmarks show that ODE-M consistently improves over strong continual merging baselines across CLIP ViT backbones, stream lengths, and heterogeneous task-utility settings.

2605.18840 2026-05-26 cs.LG cs.AI cs.CL 版本更新

The Growing Pains of Frontier Models: When Leaderboards Stop Separating and What to Measure Next

前沿模型的成长之痛:当排行榜不再区分以及接下来衡量什么

Adil Amin

发表机构 * Zehen Labs(泽亨实验室)

AI总结 本文通过分解SWE-bench和GPQA Diamond分数为种群耦合趋势和每版本残差(h场),诊断前沿模型能力之间的协作与权衡,并提供三步诊断法、每实验室测量优先级表及七个可证伪预测。

Comments 13 pages, 5 figures, 4 tables. Companion paper: "Lying Is Just a Phase: The Hidden Alignment Transition in Language Model Scaling." ( https://doi.org/10.48550/arXiv.2605.18838 ). Code: https://github.com/adilamin89/cape-scaling . Dashboard: https://zehenlabs.com/cape/

详情
AI中文摘要

排行榜在独立轴上对前沿模型进行排名,但并未揭示能力在版本间是相互增强还是权衡——而在前沿,这种相互作用是更具信息量的信号。我们将配对的SWE-bench和GPQA Diamond分数分解为种群耦合趋势和每版本残差(h场),该残差从两个公开基准分数诊断能力重点。在来自10个实验室的34个模型(2024-2026)中,能力相互协作(r = +0.72,p < 10^{-6}),但协作程度系统性地变化:每个实验室的耦合斜率跨度达5倍(谷歌1.15 vs. DeepSeek 0.23),且实验室发生转向——DeepSeek从推理密集型逆转为编码优先(Δh = 15.9个百分点);Anthropic在编码偏离和恢复之间振荡。种群回归作为等斜线相边界:用于识别基础尺度耦合转变的相同分类器√[(a/b)·B₁] [Amin, 2026] 对前沿模型进行分类,并已在下一个转变处检测到混合相行为(两个模型低于GPQA-IFEval等斜线)。h场不仅具有诊断性——它还告诉你需要改变什么。预训练建立耦合为0.871,而RLHF增加0.081 [Amin, 2026]:预训练级别的转变是永久的(DeepSeek的四个版本逆转持续存在),后训练转变是可逆的(Anthropic的三次编码偏离均在单个版本内恢复),仅推理计算在不重新训练的情况下将h改变+7.8个百分点。知道哪个组件占主导地位决定了是重新训练还是等待。我们提供了三步诊断法(定位、分类、预测)、每实验室测量优先级表以及七个带有时间戳标准的可证伪预测。五个截止日期后的版本落在95%预测区间内。代码、数据和交互式仪表盘:https://zehenlabs.com/cape/。

英文摘要

Leaderboards rank frontier models on independent axes but do not reveal whether capabilities reinforce or trade off across releases -- and at the frontier, this interaction is the more informative signal. We decompose paired SWE-bench and GPQA Diamond scores into a population coupling trend and per-release residual ($h$-field) that diagnoses capability emphasis from two public benchmark scores. Across 34 models from 10 labs (2024--2026), capabilities cooperate ($r = +0.72$, $p < 10^{-6}$), but cooperation varies systematically: per-lab coupling slopes span $5\times$ (Google $1.15$ vs. DeepSeek $0.23$), and labs pivot -- DeepSeek reversed from reasoning-rich to coding-first ($Δh = 15.9$~pp); Anthropic oscillates between coding excursions and recovery. The population regression serves as an isocline phase boundary: the same $\sqrt{(a/b)\cdot B_1}$ classifier that identifies the base-scale coupling transition [Amin, 2026] classifies frontier models and already detects mixed-phase behavior at the next transition (two models below the GPQA--IFEval isocline). The $h$-field is not just diagnostic -- it tells you what to change. Pretraining establishes coupling at $0.871$ while RLHF adds $0.081$ [Amin, 2026]: pretraining-level shifts are permanent (DeepSeek's four-release reversal persists), post-training shifts are reversible (Anthropic's three coding excursions each recover within one release), and inference compute alone shifts $h$ by $+7.8$~pp without retraining. Knowing which component dominates determines whether to retrain or wait. We provide a three-step diagnostic (locate, classify, predict), a per-lab measurement-priority table, and seven falsifiable predictions with timestamped criteria. Five post-cutoff releases fall within the 95\% prediction interval. Code, data, and an interactive dashboard: https://zehenlabs.com/cape/.

2605.18657 2026-05-26 cs.LG cs.AI 版本更新

KairosHope: A Next-Generation Time-Series Foundation Model for Specialized Classification via Dual-Memory Architecture

KairosHope: 一种基于双记忆架构的下一代时间序列基础模型,用于专门分类

Luis Balderas, José Alberto Rodríguez, Miguel Lastra, Antonio Arauzo-Azofra, José M. Benítez

发表机构 * Department of Computer Science and Artificial Intelligence(计算机科学与人工智能系) DiCITS, iMUDS, DaSCI(DiCITS、iMUDS、DaSCI) University of Granada(格拉纳达大学) Advanced Medical Imaging Group(先进医学成像组) Instituto de Investigación Biosanitaria de Granada (ibs.Granada)(格拉纳达生物医学研究机构(ibs.Granada)) Department of Software Engineering(软件工程系) Department of Rural Engineering(农村工程系) University of Córdoba(科尔多瓦大学)

AI总结 针对标准注意力计算瓶颈和经典统计知识缺失问题,提出KairosHope模型,通过双记忆系统(Titans模块和连续记忆系统CMS)替代二次注意力,并融合深度表示与统计特征的混合决策头,在UCR基准上实现优越分类性能。

详情
AI中文摘要

时间序列基础模型(TSFMs)在通用预测任务中取得了显著成功;然而,它们对专门分类问题的适应仍然受到标准注意力的计算瓶颈和对经典统计知识的系统性忽略的限制。本技术报告介绍了KairosHope,一种下一代TSFM,旨在协调大规模泛化与分类任务中的分析精度。该提案的核心是HOPE块,一种用双记忆系统替代二次注意力的架构:用于动态短期保留的Titans模块和用于长期历史上下文抽象的连续记忆系统(CMS)。为了丰富归纳偏差,引入了混合决策头,它将深度潜在表示与通过tsfeatures包提取的确定性统计特征融合。KairosHope在大型Monash档案上进行自监督预训练,结合了掩码时间序列建模(MTSM)和对比学习(InfoNCE)。随后,通过严格的线性探测和全微调(LP-FT)协议在UCR基准数据集上进行适应,以防止灾难性遗忘。实验结果表明,在具有严格时间因果关系的领域(如HAR或传感器数据)中,性能优越。因此,KairosHope为基础模型适应时间序列分析建立了一个稳健高效的框架。

英文摘要

Time Series Foundation Models (TSFMs) have demonstrated notable success in general-purpose forecasting tasks; however, their adaptation to specialized classification problems remains constrained by the computational bottleneck of standard attention and the systematic omission of classical statistical knowledge. This technical report introduces KairosHope, a next-generation TSFM designed to reconcile massive generalization with analytical precision in classification tasks. The core of the proposal is the HOPE block, an architecture that replaces quadratic attention with a dual-memory system: Titans modules for dynamic short-term retention and a Continuum Memory System (CMS) for the abstraction of long-term historical context. To enrich the inductive bias, a Hybrid Decision Head is introduced, which fuses deep latent representations with deterministic statistical features extracted via tsfeatures package. KairosHope undergoes self-supervised pre-training on the massive Monash archive, combining Masked Time Series Modeling (MTSM) and contrastive learning (InfoNCE). Its subsequent adaptation to the UCR benchmark datasets is conducted through a rigorous Linear Probing and Full Fine-Tuning (LP-FT) protocol to prevent catastrophic forgetting. Empirical results demonstrate superior performance in domains characterized by strict temporal causality such as HAR or Sensor data. Consequently, KairosHope establishes a robust and efficient framework for the adaptation of foundation models to time series analysis.

2605.17268 2026-05-26 cs.AI cs.CV cs.RO 版本更新

Is VLA Reasoning Faithful? Probing Safety of Chain-of-Causation in Autonomous Driving Models

VLA 推理是否忠实?自动驾驶模型中因果链的安全性探究

Nicanor Mayumu, Xiaoheng Deng, Patrick Mukala

发表机构 * School of Computer Science and Engineering(计算机科学与工程学院) Central South University(中南大学) School of Computer Science(计算机科学学院) University of Wollongong in Dubai(迪拜大学)

AI总结 通过分析300次VLA推理,发现输出推理与轨迹的忠实度仅42.5%,存在大量漏检行人、轨迹脆弱及推理-动作不一致问题,并提出了信息论忠实度形式化定义与安全架构。

Comments Accept (Poster), CVPR 2026 Workshop DriveX NonArchival Track

详情
AI中文摘要

我们首次系统研究了视觉-语言-动作(VLA)驾驶模型的忠实度,分析了100个多样化PhysicalAI-AV场景中300次Alpamayo-R1-10B推理。主要发现是,输出带有轨迹的自然语言推理可能显著不忠实:(i) 整体推理保真度仅为42.5%,因果链与场景现实匹配不到一半;(ii) 在三分之一涉及行人的场景中漏检了94个行人;(iii) 在轻微视觉扰动下轨迹脆弱性达97.7%;(iv) 平均推理-动作一致性仅为48.3%,53.3%的推理表现出一致性低,其中37.9%声称停止但模型继续前行。我们从信息论角度形式化定义了忠实度,定义了实体和动作保真度及验证标准,并概述了与这些结果一致的四组件安全架构。

英文摘要

We present the first systematic study of faithfulness in Vision-Language-Action (VLA) driving models, analyzing 300 Alpamayo-R1-10B inferences across 100 diverse PhysicalAI-AV scenarios. Our main finding is that output natural-language rationales with trajectories may be significantly unfaithful: (i) overall reasoning fidelity is only 42.5%, with Chain-of-Causation matching scene reality less than half the time; (ii) 94 missed pedestrians in one-third of pedestrian-relevant scenes; (iii) 97.7% trajectory fragility under mild visual perturbations; and (iv) only 48.3% mean reasoning-action consistency, with 53.3% of inferences exhibiting low consistency, including 37.9% of stop-claimed cases where the model continues instead. We formalize faithfulness information-theoretically, define entity and action fidelity with verification criteria, and outline a four-component safety architecture aligned with these results.

2605.16591 2026-05-26 cs.LG cs.AI 版本更新

How Few-Shot Examples Add Up: A Causal Decomposition of Function Vectors in In-Context Learning

少样本示例如何累加:上下文学习中函数向量的因果分解

Entang Wang, Yiwei Wang, Aleksandra Bakalova, Michael Hahn

AI总结 本文通过因果分解揭示少样本提示中函数向量由示例级子向量线性组合而成,并发现模型通过注意力重加权机制根据上下文调整示例贡献。

Comments Accepted at ICML 2026. 70 pages, 65 figures

详情
AI中文摘要

上下文学习(ICL)擅长从极少量示例中学习新任务,但我们仍缺乏对少样本提示如何塑造模型函数向量(FV)——一种驱动ICL查询任务行为的因果激活方向——的机制性解释。跨任务和模型,一个$n$样本FV可以通过示例级子FV的线性组合很好地近似,表明来自单个演示的贡献具有加性和可组合性。除了加性之外,我们展示了模型基于先前示例对单个示例的表示进行上下文化,以自适应地重新加权哪些演示主导FV:注意力转向在上下文中信息量更大、歧义更少的示例。最后,因果分解将查询-键路由与值更新分离,发现上下文化对FV质量最一致的贡献来自查询-键对齐——尤其是在歧义设置中——而值介导的效应则更加异质。综合起来,这些结果将加性叠加与上下文相关的注意力重加权统一为一个机制性的、可检验的说明,解释少样本提示如何实现任务。

英文摘要

In-context learning (ICL) excels at new tasks from minimal examples, yet we still lack a mechanistic explanation of how few-shot prompts shape a model's function vector (FV)--a causal activation direction that drives task behavior on the ICL query. Across tasks and models, an $n$-shot FV is well-approximated by a linear combination of example-level sub-FVs, suggesting additive and composable contributions from individual demonstrations. Beyond additivity, we show that models contextualize individual examples' representations based on prior examples to adaptively reweight which demonstrations dominate the FV: attention shifts toward examples that are more informative and less ambiguous under the context. Finally, a causal decomposition separates Query-Key routing from Value updates, finding that contextualization's most consistent contributions to FV quality arise from Query-Key alignment--particularly in ambiguous settings--while Value-mediated effects are more heterogeneous. Together, these results unify additive superposition with context-dependent attention reweighting into a mechanistic, testable account of how few-shot prompts implement tasks.

2605.15777 2026-05-26 cs.AI 版本更新

SaaS-Bench: Can Computer-Use Agents Leverage Real-World SaaS to Solve Professional Workflows?

SaaS-Bench:计算机使用代理能否利用真实世界SaaS解决专业工作流程?

Kean Shi, Zihang Li, Tianyi Ma, Zengji Tu, Jialong Wu, Xinbo Xu, Qingyao Yang, Ruoyu Wu, Weichu Xie, Ming Wu, Jason Zeng, Michael Heinrich, Elvis Zhang, Liang Chen, Kuan Li, Baobao Chang

发表机构 * UniPat AI PKU(北京大学) HKU(香港大学) G Labs(0G实验室) Pipeline Lab(Pipeline实验室)

AI总结 提出SaaS-Bench基准,包含23个可部署SaaS系统和106个真实工作场景任务,评估计算机使用代理在长期规划、跨应用协调等能力上的表现,发现最强模型端到端任务完成率不足4%。

Comments 24 pages, 11 figures

详情
AI中文摘要

计算机使用代理(CUA)正迅速将大型语言模型(LLM)从基于文本的推理扩展到更复杂环境中的行动执行,例如网络浏览器和图形用户界面(GUI)。然而,现有的网络和GUI代理基准通常依赖于简化设置、孤立任务或短周期交互,难以评估代理在现实专业工作流程中的能力。软件即服务(SaaS)环境是CUA评估的自然选择,因为它们承载了现代数字工作的很大一部分,并且自然涉及动态系统状态、跨应用协调、领域特定知识和长期依赖。为此,我们引入了SaaS-Bench,一个基于23个可部署SaaS系统(涵盖六个专业领域)的基准,包含106个基于现实工作场景的任务。这些任务需要长期执行,涵盖纯文本和多模态设置,并通过加权验证检查点进行评估,以衡量严格任务完成和部分进展。实验表明,代表性的基于LLM的代理在SaaS-Bench上表现不佳,即使最强的模型端到端完成任务也少于4%,暴露了在规划、状态跟踪、跨应用上下文维护和错误恢复方面的局限性。代码可在https://github.com/UniPat-AI/SaaS-Bench获取以进行复现。

英文摘要

Computer-Using Agents (CUAs) are rapidly extending large language models (LLMs) beyond text-based reasoning toward action execution in more complex environments, such as web browsers and graphical user interfaces (GUIs). However, existing web and GUI agent benchmarks often rely on simplified settings, isolated tasks, or short-horizon interactions, making it difficult to assess capabilities of agents in realistic professional workflows. Software-as-a-Service (SaaS) environments are a natural choice for CUA evaluation, as they host a large share of modern digital work and naturally involve dynamic system states, cross-application coordination, domain-specific knowledge, and long-horizon dependencies. To this end, we introduce SaaS-Bench, a benchmark built on 23 deployable SaaS systems across six professional domains, containing 106 tasks grounded in realistic work scenarios. These tasks require long-horizon execution, cover both text-only and multimodal settings, and are evaluated with weighted verification checkpoints that measure strict task completion and partial progress. Experiments show that representative LLM-based agents struggle on SaaS-Bench, with even the strongest model completing fewer than 4% of tasks end-to-end, exposing limitations in planning, state tracking, cross-application context maintenance, and error recovery. Code are available at https://github.com/UniPat-AI/SaaS-Bench for reproduction.

2605.14890 2026-05-26 cs.CL cs.AI 版本更新

Tokenizer Fertility and Zero-Shot Performance of Foundation Models on Ukrainian Legal Text: A Comparative Study

分词器生育率与基础模型在乌克兰法律文本上的零样本性能:一项比较研究

Volodymyr Ovcharov

发表机构 * LEX AI Platform(LEX AI平台) legal.org.ua Kyiv, Ukraine(基辅,乌克兰)

AI总结 本研究比较了七种基础模型在乌克兰法律文本上的分词器生育率和零样本性能,发现分词器生育率差异达1.6倍,Qwen 3模型比Llama系列多消耗60%的token,而NVIDIA Nemotron Super 3 (120B)以更低的成本取得最佳性能,同时揭示了少样本提示在形态丰富语言上的退化以及战时法律语言对模型泛化的影响。

Comments 25 pages, 13 tables, 5 figures; v2 adds cross-temporal generalization experiment and classical baseline

详情
AI中文摘要

在乌克兰法律文本上,不同基础模型的分词器生育率差异达1.6倍,然而这一成本关键维度在模型选择实践中被忽视。我们使用来自乌克兰国家登记册(EDRSR)的273份经过验证的法院判决,对来自五个提供商的七个模型进行了基准测试,测量了分词器生育率以及在三个任务上的零样本性能。发现了四个结果。(1)Qwen 3模型在相同输入上比Llama系列模型多消耗60%的token,使得分词器分析成为成本高效部署的前提。(2)NVIDIA Nemotron Super 3 (120B)取得了最高综合得分(83.1),以三分之一的API成本超越了Mistral Large 3(总参数多5.6倍)——模型规模并不能很好地代表领域性能。(3)少样本提示使性能下降高达26个百分点;分层和提示敏感性消融实验证实,这是乌克兰语演示的内在问题,而非示例选择的伪影。(4)跨时间泛化实验表明,在战前法院判决(2008-2013)上训练的分类器,应用于全面入侵时期的判决(2022-2026)时,性能下降27.9个百分点,并呈现出显著的前后不对称性:较新的模型向后迁移效果更好(比向前迁移高14.6个百分点),但较旧的模型在战时法律语言上完全失败。对于从业者:分词器分析应优先于模型选择,对于形态丰富的语言,零样本比少样本更可靠。为了支持可重复性并解决乌克兰语在法律NLP基准中的缺失,我们发布了一个包含14,452份法院判决的公开数据集,时间跨度为2008-2026年,标注了三个时间段的七个结果标签,这些时间段捕捉了武装冲突对司法程序的影响。

英文摘要

Tokenizer fertility varies 1.6x across foundation models on Ukrainian legal text, yet this cost-critical dimension is absent from model selection practice. We benchmark seven models from five providers on 273 validated court decisions from Ukraine's state registry (EDRSR), measuring tokenizer fertility and zero-shot performance on three tasks. Four findings emerge. (1) Qwen 3 models consume 60% more tokens than Llama-family models on identical input, making tokenizer analysis a prerequisite for cost-efficient deployment. (2) NVIDIA Nemotron Super 3 (120B) achieves the highest composite score (83.1), outperforming Mistral Large 3 (5.6x more total parameters) at one-third the API cost model scale is a poor proxy for domain performance. (3) Few-shot prompting degrades performance by up to 26 percentage points; stratified and prompt-sensitivity ablations confirm this is intrinsic to Ukrainian-language demonstrations, not an artifact of example selection. (4) A cross-temporal generalization experiment reveals that classifiers trained on pre-war court ecisions (2008-2013) lose 27.9 percentage points when applied to full-scale invasion era decisions (2022-2026), with a pronounced forward-backward asymmetry: newer models transfer backward (+14.6 pp above forward transfer), but older models fail catastrophically on wartime legal language. For practitioners: tokenizer analysis should precede model selection, and zero-shot is a more reliable default than few-shot for morphologically rich languages. To support reproducibility and address the absence of Ukrainian from legal NLP benchmarks, we release a public dataset of 14,452 court decisions spanning 2008-2026, annotated with seven outcome labels across three temporal epochs that capture the impact of armed conflict on judicial proceedings.

2605.14605 2026-05-26 cs.CR cs.AI cs.LG 版本更新

One Step to the Side: Why Defenses Against Malicious Finetuning Fail Under Adaptive Adversaries

一步之遥:为什么针对恶意微调的防御在自适应对手面前失败

Itay Zloczower, Eyal Lenga, Gilad Gressel, Yisroel Mirsky

发表机构 * Ben-Gurion University of the Negev(贝纳-约瑟夫大学) Amrita Vishwa Vidyapeetham(阿米塔维莎瓦迪耶佩塔)

AI总结 本文通过分析15种近期防御机制,发现它们共享一个弱点:仅掩盖或误导有害行为路径而未消除行为本身,并开发了一种统一的自适应攻击,成功突破了所有防御机制。

Comments Under review

详情
AI中文摘要

模型提供商越来越多地发布开放权重或允许用户通过API微调基础模型。尽管这些模型在发布前经过安全对齐,但其防护措施通常可以通过对有害数据的微调来移除。最近的防御旨在使模型对此类恶意微调具有鲁棒性,但它们主要仅针对不考虑防御的固定攻击进行评估。我们表明这些鲁棒性声明是不完整的。通过调查15种近期防御,我们识别了几种防御机制,并表明它们共享一个单一弱点:它们掩盖或误导通往有害行为的路径,而不移除行为本身。然后,我们开发了一种统一的自适应攻击,突破了所有防御机制。我们的结果表明,当前方法并未提供稳健的安全性;它们主要阻止了它们所设计的攻击。我们希望我们针对这一领域的统一自适应对手将帮助未来的研究人员和实践者在部署前对新防御进行压力测试。

英文摘要

Model providers increasingly release open weights or allow users to fine-tune foundation models through APIs. Although these models are safety-aligned before release, their safeguards can often be removed by fine-tuning on harmful data. Recent defenses aim to make models robust to such malicious fine-tuning, but they are largely evaluated only against fixed attacks that do not account for the defense. We show that these robustness claims are incomplete. Surveying 15 recent defenses, we identify several defense mechanisms and show that they share a single weakness: they obscure or misdirect the path to harmful behavior without removing the behavior itself. We then develop a unified adaptive attack that breaks defenses across all defense mechanisms. Our results show that current approaches do not provide robust security; they mainly stop the attacks they were designed against. We hope that our unified adaptive adversary for this domain will help future researchers and practitioners stress-test new defenses before deployment.

2605.14559 2026-05-26 cs.AI math.OC 版本更新

PyCSP3-Scheduling: A Scheduling Extension for PyCSP3

PyCSP3-Scheduling: PyCSP3的调度扩展

Sohaib Afifi

发表机构 * Univ. Artois, UR 3926, Laboratoire de Génie Informatique et d’Automatique de l’Artois (LGI2A)(阿劳斯-大学,UR 3926,阿劳斯信息工程与自动化实验室(LGI2A))

AI总结 提出PyCSP3 Scheduling库,通过53个专用约束和27个表达式为PyCSP3添加调度抽象,并编译为标准约束,在261个实例上验证了与原始公式的目标一致性,但运行时性能因编译开销而异。

详情
AI中文摘要

PyCSP$^3$提供了一种高效构建约束模型以解决组合约束问题的方法,并将其导出为XCSP$^3$,保持了建模与求解的完全分离。然而,它缺乏对调度抽象(如区间变量、序列变量和资源函数)的原生支持。因此,即使PyCSP$^3$已经提供了如NoOverlap和Cumulative等整数数组上的全局约束,调度模型仍需通过低层整数变量和手动通道约束进行编码。我们提出了PyCSP$^3$ Scheduling,一个通过53个专用约束和27个表达式为PyCSP$^3$添加调度抽象的库,并将其编译为标准PyCSP$^3$/XCSP$^3$约束,维护了支撑PyCSP$^3$生态系统的建模/求解分离。在17个模型家族(每个5次运行)的261个配对实例上,两种公式在所有72个双重证明最优对以及近一半的家族(8/17)中产生了相同的目标值,且在编译后结构保持不变;然而,运行时性能在不同家族间存在差异,部分家族有显著提升(高达5.8倍),而其他家族由于编译分解的开销出现性能下降。代码和基准测试可在以下网址获取:https://github.com/sohaibafifi/pycsp3-scheduling

英文摘要

PyCSP$^3$ provides a productive way to build constraint models for solving combinatorial constrained problems and export them to XCSP$^3$, preserving a complete separation between modeling and solving. However, it lacks native support for scheduling abstractions such as interval variables, sequence variables, and resource functions. As a result, scheduling models must be encoded with low-level integer variables and manual channeling constraints, even though PyCSP$^3$ already provides global constraints like NoOverlap and Cumulative on integer arrays. We present PyCSP$^3$ Scheduling, a library that adds scheduling abstractions to PyCSP$^3$ through 53 dedicated constraints and 27 expressions, and compiles them down to standard PyCSP$^3$/XCSP$^3$ constraints, maintaining the modeling/solving separation that underpins the PyCSP$^3$ ecosystem. On 261 paired instances across 17 model families (5 runs each), both formulations produce identical objectives on all 72 doubly-proved optimal pairs and nearly half of the families (8/17) remain structurally unchanged after compilation; however, runtime performance diverges across families, with clear gains on some (up to 5.8x) and regressions on others due to the overhead of compilation decompositions. Code and benchmarks are available at: https://github.com/sohaibafifi/pycsp3-scheduling

2605.13850 2026-05-26 cs.AI cs.MA cs.SE 版本更新

A Two-Dimensional Framework for AI Agent Design Patterns: Cognitive Function and Execution Topology

AI智能体设计模式的二维框架:认知功能与执行拓扑

Jia Huang, Joey Tianyi Zhou

发表机构 * Agency for Science, Technology and Research (A*STAR)(科技研究局(A*STAR)) Centre for Frontier AI Research (CFAR)(前沿人工智能研究中心(CFAR))

AI总结 提出一个结合认知功能(7类)和执行拓扑(6种结构)的二维分类框架,识别28种命名模式,并通过跨领域分析得出模式选择的五条经验法则。

Comments 10 pages, 6 tables, 28 named patterns

详情
AI中文摘要

现有的基于LLM的智能体架构框架从单一视角描述系统:行业指南(Anthropic、Google、LangChain)关注执行拓扑——数据如何流动,而认知科学调查关注认知功能——智能体做什么。单独任何一个轴都无法区分架构上不同的系统:相同的Orchestrator-Workers拓扑可以实现Plan-and-Execute、Hierarchical Delegation或Adversarial Verification——这三种模式具有根本不同的故障模式和设计权衡。我们提出一个二维分类,结合(1)认知功能轴,包含七个类别(感知、记忆、推理、行动、反思、协作、治理)和(2)执行拓扑轴,包含六种结构原型(链、路由、并行、编排、循环、层次)。由此产生的7x6矩阵识别出28种命名模式,其中15种为原创名称。我们通过系统的跨轴分析证明正交性,详细定义八种代表性模式,并在四个真实领域(金融贷款、法律尽职调查、网络运维、医疗分诊)验证描述覆盖范围。跨领域分析得出模式选择的五条经验法则,这些法则支配环境约束(时间压力、行动权限、失败成本不对称、规模)与架构选择之间的关系。该框架为AI智能体架构设计提供了原则性、框架中立且模型无关的词汇表。

英文摘要

Existing frameworks for LLM-based agent architectures describe systems from a single perspective: industry guides (Anthropic, Google, LangChain) focus on execution topology -- how data flows -- while cognitive science surveys focus on cognitive function -- what the agent does. Neither axis alone disambiguates architecturally distinct systems: the same Orchestrator-Workers topology can implement Plan-and-Execute, Hierarchical Delegation, or Adversarial Verification -- three patterns with fundamentally different failure modes and design trade-offs. We propose a two-dimensional classification that combines (1) a Cognitive Function axis with seven categories (Perception, Memory, Reasoning, Action, Reflection, Collaboration, Governance) and (2) an Execution Topology axis with six structural archetypes (Chain, Route, Parallel, Orchestrate, Loop, Hierarchy). The resulting 7x6 matrix identifies 28 named patterns, 15 with original names. We demonstrate orthogonality through systematic cross-axis analysis, define eight representative patterns in detail, and validate descriptive coverage across four real-world domains (financial lending, legal due diligence, network operations, healthcare triage). Cross-domain analysis yields five empirical laws of pattern selection governing the relationship between environmental constraints (time pressure, action authority, failure cost asymmetry, volume) and architectural choices. The framework provides a principled, framework-neutral, and model-agnostic vocabulary for AI agent architecture design.

2605.13282 2026-05-26 cs.AI cs.LG 版本更新

Differentiable Learning of Lifted Action Schemas for Classical Planning

经典规划中提升动作模式的可微学习

Jonas Reiter, Jakob Elias Gebler, Hector Geffner

发表机构 * RWTH Aachen University(亚琛工业大学)

AI总结 提出一种神经网络架构,从完全可观测状态但动作参数未观测的轨迹中学习提升动作模式,实现近乎完美的结构恢复。

详情
AI中文摘要

经典规划器可以有效解决用STRIPS或PDDL表示的非常大的确定性MDP,其中状态是对象和关系上的原子集合,提升动作模式添加或删除这些原子。这种紧凑表示产生了强大的搜索启发式,并为结构泛化提供了理想设置,因为提升关系和动作模式可以产生无限多个领域实例。一个核心挑战是从数据中学习这些关系和动作模式,最近的方法使用不同类型的观测来解决这个问题。在这项工作中,我们开发了一种新颖的神经网络架构,从状态完全可观测但动作参数未观测的轨迹中学习动作模式。该问题是一个简化,但却是从图像序列和动作标签学习规划领域的重要一步,我们旨在以近乎完美的方式解决这个简化问题。挑战在于同时从观测到的状态变化中识别动作参数并学习动作模式。我们的方法产生了一个鲁棒的可微组件,然后可以集成到更大的神经符号模型中。我们在各种规划领域上评估该架构,其中学习到的提升动作模式必须恢复真实结构。此外,我们报告了关于对观测噪声的鲁棒性以及与基于槽的动态模型相关变体的实验。

英文摘要

Classical planners can effectively solve very large deterministic MDPs represented in STRIPS or PDDL where states are sets of atoms over objects and relations, and lifted action schemas add or delete these atoms. This compact representation yields strong search heuristics and provides an ideal setting for structural generalization, since lifted relations and action schemas give rise to infinitely many domain instances. A central challenge is to learn these relations and action schemas from data, and recent approaches have addressed this problem using different types of observations. In this work, we develop a novel neural network architecture for learning action schemas from traces where states are fully observed but action arguments are unobserved. The problem is a simplification but an important step towards learning planning domains from sequences of images and action labels, and we aim to solve this simplification in a nearly perfect manner. The challenge lies in learning the action schemas while simultaneously identifying the action arguments from observed state changes. Our approach yields a robust differentiable component that can then be integrated into larger neuro-symbolic models. We evaluate the architecture on various planning domains, where the learned lifted action schemas must recover the ground-truth structure. Additionally, we report experiments on robustness to observation noise and on a variation related to slot-based dynamics models.

2605.12850 2026-05-26 cs.CL cs.AI cs.CR cs.LG 版本更新

Persona-Model Collapse in Emergent Misalignment

涌现性失调中的人格模型崩溃

Davi Bastos Costa, Renato Vicente

发表机构 * TELUS Digital Research Hub(TELUS数字研究中心) Center for Artificial Intelligence and Machine Learning(人工智能与机器学习中心) Institute of Mathematics, Statistics and Computer Science(数学、统计与计算机科学研究所) University of São Paulo(圣保罗大学)

AI总结 提出人格模型崩溃假说,通过道德易感性(S)和道德稳健性(R)两个指标,证明在有害数据上微调大语言模型会导致模型模拟、区分和维持一致角色的内部能力恶化,从而引发涌现性失调。

Comments 23 pages, 7 figures, 7 tables; NeurIPS 2026 submission; Corrected code repository URL

详情
AI中文摘要

在包含有害内容的狭窄数据上微调大型语言模型,会在无关提示上产生广泛的失调行为,这种现象称为涌现性失调。我们提出涌现性涉及人格模型崩溃:模型模拟、区分和维持一致角色的内部能力恶化。我们通过两个指标在行为上检验这一假设:道德易感性(S)和道德稳健性(R),它们根据模型在角色扮演下道德基础问卷回答的跨角色和角色内变异性计算得出。这些指标形式化了模型区分角色的能力(S)以及模拟给定角色时的一致性(R)。我们评估了四个前沿模型(DeepSeek-V3.1, GPT-4.1, GPT-4o, Qwen3-235B)的三种变体:基础版、微调为输出不安全代码的版本,以及匹配的微调为输出安全代码的对照版本。在四个模型中,不安全微调导致S平均增加55%,将所有四个不安全变体推至先前工作中13个前沿模型基准观测到的波段之外——其中GPT-4o达到波段上端的两倍以上——表明分化失调。它还导致R平均下降65%,相当于1/R增加304%。相比之下,匹配的安全对照将S保持在基础值附近,仅引起部分R损失,表明这些效应主要特定于失调。补充这些指标变化,不安全变体的无条件响应趋近于接近量表上限的饱和状态,与基础模型的结构化响应以及基础模型角色扮演有毒人格时的响应明显不同。综合来看,这些指标为涌现性失调提供了敏感的诊断,并作为其涉及人格模型崩溃的行为证据。

英文摘要

Fine-tuning large language models on narrow data with harmful content produces broadly misaligned behavior on unrelated prompts, a phenomenon known as emergent misalignment. We propose that emergent misalignment involves persona-model collapse: deterioration of the model's internal capacity to simulate, differentiate, and maintain consistent characters. We test this hypothesis behaviorally using two metrics: moral susceptibility (S) and moral robustness (R), computed from the across- and within-persona variability of models' Moral Foundations Questionnaire responses under persona role-play. These metrics formalize the model's ability to differentiate characters (S) and its consistency when simulating a given one (R). We evaluate four frontier models (DeepSeek-V3.1, GPT-4.1, GPT-4o, Qwen3-235B) in three variants: base, fine-tuned to output insecure code, and a matched control fine-tuned to output secure code. Across the four models, insecure fine-tuning produces an average $55\%$ increase in S, pushing all four insecure variants beyond the band observed across 13 frontier models benchmarked in prior work -- with GPT-4o reaching more than twice the band's upper end -- signaling dysregulated differentiation. It also causes an average $65\%$ decrease in R, equivalent to a $304\%$ increase in 1/R. By contrast, the matched secure control preserves S near the base and induces only a partial R loss, showing that these effects are largely misalignment-specific. Complementing these metric shifts, insecure variants' unconditioned responses converge toward saturation near the scale ceiling, departing markedly from both base models' structured responses and those elicited when base models role-play toxic personas. Taken together, these metrics provide a sensitive diagnostic for emergent misalignment and serve as behavioral evidence that it involves persona-model collapse.

2605.11182 2026-05-26 cs.AI 版本更新

The Many Faces of On-Policy Distillation: Pitfalls, Mechanisms, and Fixes

在线策略蒸馏的多种面貌:陷阱、机制与修复

Siqi Zhu, Xuyan Ye, Hongyu Lu, Weiye Shi, Ge Liu

发表机构 * UIUC(伊利诺伊大学香槟分校) Renmin University of China(中国人民大学) Peking University(北京大学)

AI总结 本文通过实证研究分析了在线策略蒸馏(OPD)和在线策略自蒸馏(OPSD)在大语言模型后训练中的有效性、失败机制及修复方法。

详情
AI中文摘要

在线策略蒸馏(OPD)和在线策略自蒸馏(OPSD)已成为大语言模型有前景的后训练方法,它们在模型自身策略采样的轨迹上提供密集的token级监督。然而,现有关于其有效性的结果仍然好坏参半:虽然OP(S)D在系统提示和知识内化方面显示出潜力,但最近的研究也报告了不稳定性和退化。在这项工作中,我们对OPD和OPSD何时有效、何时失败以及原因进行了全面的实证研究。我们发现,数学推理上的OPD对教师选择和损失公式高度敏感,而OPSD在我们测试的设置中失败,因为测试时缺乏实例特定的特权信息(PI)。相反,当PI表示共享的潜在规则(如系统提示或对齐偏好)时,OPSD是有效的。我们识别出三种失败机制:(1)由于以学生生成的前缀为条件导致的教师与学生之间的分布不匹配,(2)来自有偏TopK反向KL梯度的优化不稳定性,以及(3)OPSD特定的限制,即学生学习了无PI策略,该策略聚合了以PI为条件的教师,当PI是实例特定时这是不够的。我们进一步表明,停止梯度TopK目标、RLVR适应的教师和SFT稳定的学生可以缓解这些失败。

英文摘要

On-policy distillation (OPD) and on-policy self-distillation (OPSD) have emerged as promising post-training methods for large language models, offering dense token-level supervision on trajectories sampled from the model's own policy. However, existing results on their effectiveness remain mixed: while OP(S)D has shown promise in system prompt and knowledge internalization, recent studies also report instability and degradation. In this work, we present a comprehensive empirical study of when OPD and OPSD work, when they fail, and why. We find that OPD on mathematical reasoning is highly sensitive to teacher choice and loss formulation, whereas OPSD fails in our tested settings due to test-time absence of instance-specific privileged information (PI). In contrast, OPSD is effective when PI represents a shared latent rule, such as a system prompt or alignment preference. We identify three failure mechanisms: (1) distribution mismatch between teacher and student caused by conditioning on student-generated prefixes, (2) optimization instability from biased TopK reverse-KL gradients, and (3) an OPSD-specific limitation where the student learns a PI-free policy that aggregates PI-conditioned teachers, which is insufficient when PI is instance-specific. We further show that stop-gradient TopK objectives, RLVR-adapted teachers, and SFT-stabilized students mitigate these failures.

2605.10989 2026-05-26 cs.LG cs.AI 版本更新

SURGE: Surrogate Gradient Adaptation in Binary Neural Networks

SURGE: 二值神经网络中的替代梯度自适应

Haoyu Huang, Boyu Liu, Linlin Yang, Yanjing Li, Yuguang Yang, Xuhui Liu, Canyu Chen, Zhongqian Fu, Baochang Zhang

发表机构 * National College for Excellent Engineers, Beihang University, Beijing, China(北京航空航天大学优秀工程师学院) School of Artificial Intelligence, Beihang University, Beijing, China(北京航空航天大学人工智能学院) School of Electronic and Information Engineering, Beihang University, Beijing, China(北京航空航天大学电子与信息工程学院) King Abdullah University of Science and Technology, Saudi Arabia(沙特国王 Abdullah 科学技术大学) Huawei Noah’s Ark Lab, China(华为诺亚实验室)

AI总结 针对二值神经网络中梯度失配和固定范围梯度裁剪导致的信息损失问题,提出一种基于理论的可学习梯度补偿框架SURGE,通过双路径梯度补偿器和自适应梯度缩放器实现偏差减少的梯度估计与动态平衡,在图像分类、目标检测和语言理解任务上达到最优性能。

Comments Accepted as a poster at the 43rd International Conference on Machine Learning (ICML 2026)

详情
AI中文摘要

二值神经网络(BNN)的训练从根本上依赖于对不可微二值化操作(如符号函数)的梯度近似。然而,包括直通估计器(STE)及其改进变体在内的主流方法依赖于手工设计,存在梯度失配问题和固定范围梯度裁剪导致的信息损失。为了解决这一问题,我们提出了SURrogate GradiEnt Adaptation(SURGE),一种新颖的、具有理论依据的可学习梯度补偿框架。SURGE通过辅助反向传播缓解梯度失配。具体地,我们设计了一个双路径梯度补偿器(DPGC),为每个二值化层构建一个并行的全精度辅助分支,通过在反向传播期间进行输出分解来解耦梯度流。DPGC利用全精度分支估计超出STE一阶近似的分量,从而实现偏差减少的梯度估计。为了进一步增强训练稳定性,我们引入了一个基于最优缩放因子的自适应梯度缩放器(AGS),通过基于范数的缩放动态平衡分支间的梯度贡献。在图像分类、目标检测和语言理解任务上的实验表明,SURGE在现有最先进方法中表现最佳。

英文摘要

The training of Binary Neural Networks (BNNs) is fundamentally based on gradient approximation for non-differentiable binarization operations (e.g., sign function). However, prevailing methods including the Straight-Through Estimator (STE) and its improved variants, rely on hand-crafted designs that suffer from gradient mismatch problem and information loss induced by fixed-range gradient clipping. To address this, we propose SURrogate GradiEnt Adaptation (SURGE), a novel learnable gradient compensation framework with theoretical grounding. SURGE mitigates gradient mismatch through auxiliary backpropagation. Specifically, we design a Dual-Path Gradient Compensator (DPGC) that constructs a parallel full-precision auxiliary branch for each binarized layer, decoupling gradient flow via output decomposition during backpropagation. DPGC enables bias-reduced gradient estimation by leveraging the full-precision branch to estimate components beyond STE's first-order approximation. To further enhance training stability, we introduce an Adaptive Gradient Scaler (AGS) based on an optimal scale factor to dynamically balance inter-branch gradient contributions via norm-based scaling. Experiments on image classification, object detection, and language understanding tasks demonstrate that SURGE performs best over state-of-the-art methods.

2605.10718 2026-05-26 cs.DC cs.AI cs.LG cs.PF cs.SY eess.SY 版本更新

An Uncertainty-Aware Resilience Micro-Agent for Causal Observability in the Computing Continuum

一种面向计算连续体中因果可观测性的不确定性感知韧性微代理

Suvi De Silva, Alfreds Lapkovskis, Alaa Saleh, Sasu Tarkoma, Praveen Kumar Donta

发表机构 * Department of Computer Systems and Sciences(计算机系统与科学系) Department of Computer Science(计算机科学系)

AI总结 提出AURORA框架,通过集成自由能原理、因果do-calculus和局部因果状态图,在边缘层实现灰色故障的因果诊断与缓解,并采用双门控执行机制在不确定性高时避免破坏性干预。

详情
AI中文摘要

计算连续体中的灰色故障会产生模糊重叠的症状,现有方法由于缺乏因果意识或在高度认知不确定性下行动,无法可靠诊断,并可能导致破坏性干预。本文提出了一种面向因果可观测性的不确定性感知韧性微代理(AURORA),这是一个轻量级框架,用于诊断和缓解边缘层环境中的灰色故障。该框架采用并行微代理,集成自由能原理、因果do-calculus和局部因果状态图,支持每个故障马尔可夫毯内的反事实根因分析。将推理限制在因果相关变量上可降低计算开销,同时保持诊断保真度。AURORA进一步引入双门控执行机制,仅在因果置信度高且预测认知不确定性有界时授权修复;否则,放弃本地干预并将诊断有效载荷升级到雾层。我们的实验表明,AURORA优于基线,实现了0%的破坏性行动率,同时保持62.0%的修复准确率和3ms的平均修复时间。

英文摘要

Grey failures in the computing continuum produce ambiguous overlapping symptoms that existing approaches fail to diagnose reliably, either due to a lack of causal awareness or acting under high epistemic uncertainty, risking destructive interventions. This paper presents an uncertainty-aware resilience micro-agent for causal observability (AURORA), a lightweight framework for diagnosing and mitigating grey failures in edge-tier environments. The framework employs parallel micro-agents that integrate the free-energy principle, causal do-calculus, and localized causal state-graphs to support counterfactual root-cause analysis within each fault's Markov blanket. Restricting inference to causally relevant variables reduces computational overhead while preserving diagnostic fidelity. AURORA further introduces a dual-gated execution mechanism that authorizes remediation only when causal confidence is high and predicted epistemic uncertainty is bounded; otherwise, it abstains from local intervention and escalates the diagnostic payload to the fog tier. Our experiments demonstrate that AURORA outperforms baselines, achieving a 0% destructive action rate, while maintaining 62.0% repair accuracy and a 3ms mean time to repair.

2605.08063 2026-05-26 cs.CV cs.AI 版本更新

Flow-OPD: On-Policy Distillation for Flow Matching Models

Flow-OPD:面向流匹配模型的在线策略蒸馏

Zhen Fang, Wenxuan Huang, Yu Zeng, Yiming Zhao, Shuang Chen, Kaituo Feng, Yunlong Lin, Lin Chen, Zehui Chen, Shaosheng Cao, Feng Zhao

发表机构 * University of Science and Technology of China(中国科学技术大学) University of California, Los Angeles(加州大学洛杉矶分校) The Chinese University of Hong Kong(香港中文大学) Xiaohongshu Inc.(小红书公司)

AI总结 提出Flow-OPD框架,通过两阶段对齐策略(单奖励GRPO微调专家+流式冷启动与在线策略蒸馏)解决流匹配模型在多任务对齐中的奖励稀疏和梯度干扰问题,并引入流形锚点正则化抑制美学退化,在GenEval和OCR指标上显著提升。

Comments Project Page: https://costaliya.github.io/Flow-OPD/ , Code: https://github.com/CostaliyA/Flow-OPD

详情
AI中文摘要

现有的流匹配(FM)文本到图像模型在多任务对齐下存在两个关键瓶颈:标量奖励导致的奖励稀疏性,以及联合优化异构目标引起的梯度干扰,这共同导致了竞争指标的“跷跷板效应”和普遍的奖励破解。受大型语言模型社区中在线策略蒸馏(OPD)成功的启发,我们提出了Flow-OPD,这是第一个将在线策略蒸馏集成到流匹配模型中的统一后训练框架。Flow-OPD采用两阶段对齐策略:首先通过单奖励GRPO微调培养领域专精的教师模型,使每个专家在隔离环境中达到其性能上限;然后通过基于流的冷启动方案建立稳健的初始策略,并通过在线策略采样、任务路由标记和密集轨迹级监督的三步编排,将异构专业知识无缝整合到单个学生模型中。我们进一步引入了流形锚点正则化(MAR),它利用任务无关的教师提供全数据监督,将生成锚定到高质量流形,有效缓解了纯强化学习对齐中常见的美学退化。基于Stable Diffusion 3.5 Medium,Flow-OPD将GenEval分数从63提升至92,OCR准确率从59提升至94,相比原始GRPO总体提升约10个百分点,同时保持了图像保真度和人类偏好对齐,并展现出“超越教师”的涌现效应。这些结果确立了Flow-OPD作为构建通用文本到图像模型的可扩展对齐范式。代码和权重将在 https://github.com/CostaliyA/Flow-OPD 发布。

英文摘要

Existing Flow Matching (FM) text-to-image models suffer from two critical bottlenecks under multi-task alignment: the reward sparsity induced by scalar-valued rewards, and the gradient interference arising from jointly optimizing heterogeneous objectives, which together give rise to a 'seesaw effect' of competing metrics and pervasive reward hacking. Inspired by the success of On-Policy Distillation (OPD) in the large language model community, we propose Flow-OPD, the first unified post-training framework that integrates on-policy distillation into Flow Matching models. Flow-OPD adopts a two-stage alignment strategy: it first cultivates domain-specialized teacher models via single-reward GRPO fine-tuning, allowing each expert to reach its performance ceiling in isolation; it then establishes a robust initial policy through a Flow-based Cold-Start scheme and seamlessly consolidates heterogeneous expertise into a single student via a three-step orchestration of on-policy sampling, task-routing labeling, and dense trajectory-level supervision. We further introduce Manifold Anchor Regularization (MAR), which leverages a task-agnostic teacher to provide full-data supervision that anchors generation to a high-quality manifold, effectively mitigating the aesthetic degradation commonly observed in purely RL-driven alignment. Built upon Stable Diffusion 3.5 Medium, Flow-OPD raises the GenEval score from 63 to 92 and the OCR accuracy from 59 to 94, yielding an overall improvement of roughly 10 points over vanilla GRPO, while preserving image fidelity and human-preference alignment and exhibiting an emergent 'teacher-surpassing' effect. These results establish Flow-OPD as a scalable alignment paradigm for building generalist text-to-image models. The codes and weights will be released in: https://github.com/CostaliyA/Flow-OPD .

2605.07647 2026-05-26 cs.CL cs.AI 版本更新

Quality-Conditioned Agreement in Automated Short Answer Scoring: Mid-Range Degradation and the Impact of Task-Specific Adaptation

自动简答题评分中的质量条件一致性:中等范围退化与任务特定适应的影响

Abigail Victoria Gurin Schleifer, Moriah Ariely, Beata Beigman Klebanov, Asaf Salman, Giora Alexandron

发表机构 * Weizmann Institute of Science(魏茨曼科学研究院) ETS(教育考试服务中心)

AI总结 研究自动简答题评分中不同模型的任务适应程度与质量条件评分一致性的关系,发现所有AI模型在完全正确和完全错误的回答上表现良好,但在中等范围回答上出现显著退化,且退化程度与任务特定数据量相关。

Comments PRE-PRINT VERSION Accepted to ACL 21st Workshop on Innovative Use of NLP for Building Educational Applications (BEA26)

详情
AI中文摘要

自动简答题评分(ASAS)正从判别式微调模型转向少样本设置下的大语言模型(LLM)。这种范式利用了LLM广泛的世界知识和易于部署的优势,但有限的任务特定数据可能降低复杂评分任务的对齐。特别是,其对评分需要细微解释的部分正确回答的影响仍未充分探索。我们研究了不同模型的任务特定适应程度与质量条件评分一致性之间的关系。我们比较了三种LLM(GPT-5.2、GPT-4o、Claude Opus 4.5)在少样本模式下的表现、一个基于BERT的微调编码器以及一位人类专家,在两个开放式生物学题目上使用了数百个学生回答和由生物学教育专家提供的真实分数。结果表明,人类之间的一致性最高且在整个质量范围内稳定。所有AI模型在完全正确和完全错误的回答上表现良好,但在中等范围回答上表现出显著退化。这种中等范围退化取决于任务特定适应:在少样本LLM中最为严重,随着任务特定数据的增加而减少,其中微调编码器模型表现最佳。这种中等范围退化可能导致对理解发展中的学生所产生回答的不公平评估。我们的发现强调了质量条件公平性的重要性,尤其需要关注中等范围回答。

英文摘要

Automated short answer scoring (ASAS) is shifting from discriminative, fine-tuned models to large language models (LLMs) used in few-shot settings. This paradigm leverages LLMs broad world knowledge and ease of deployment, but limited task-specific data may reduce alignment on complex scoring tasks. In particular, its impact on scoring partially correct responses that require nuanced interpretation remains underexplored. We investigate the relationship between the degree of task-specific adaptation of different models and quality-conditioned scoring agreement. We compare three LLMs (GPT-5.2, GPT-4o, Claude Opus 4.5) in few-shot mode, a fine-tuned BERT-based encoder, and a human expert on two open-ended biology items, using several hundred student responses and ground truth scores provided by a biology education expert. The results show that human-human agreement is highest and stable across the full quality spectrum. All AI models perform well on fully correct and fully incorrect responses, but exhibit substantial degradation on mid-range responses. This mid-range degradation is conditioned on task-specific adaptation: It is most severe in few-shot LLMs with few examples and decreases as task-specific data increases, with fine-tuned encoder models performing best. This mid-range degradation may lead to inequitable evaluation of responses produced by students with developing understanding. Our findings highlight the importance of quality-conditioned fairness, with particular attention to mid-range responses.

2605.06505 2026-05-26 cs.LG cs.AI cs.CR 版本更新

PACZero: PAC-Private Fine-Tuning of Language Models via Sign Quantization

PACZero: 通过符号量化的语言模型PAC隐私微调

Murat Bilgehan Ertan, Xiaochen Zhu, Phuong Ha Nguyen, Marten van Dijk, Srinivas Devadas

发表机构 * CWI Amsterdam(阿姆斯特丹信息与计算科学研究所) MIT Cambridge(麻省理工学院) Vrije Universiteit Amsterdam(阿姆斯特丹自由大学)

AI总结 提出PACZero系列零阶机制,通过符号量化实现零互信息下的PAC隐私微调,在SST-2和SQuAD上取得竞争性结果。

详情
AI中文摘要

我们引入了PACZero,一系列用于微调大型语言模型的PAC隐私零阶机制,在$I(S^*; Y_{1:T})=0$时提供可用的效用。该隐私机制将成员推断攻击(MIA)后验成功率限制在先验水平,这是DP框架仅在$\varepsilon=0$和无限噪声下才能达到的MIA抵抗水平。所有下面的DP-ZO比较都在MIA后验水平上匹配。关键见解是,PAC隐私仅在发布依赖于哪个候选子集是秘密时才对互信息收费。对子集聚合的零阶梯度进行符号量化会产生频繁的一致步骤,即每个候选子集在更新方向上达成一致;在这些步骤中,发布的符号花费零条件互信息。我们提出了两个变体,涵盖隐私-效用权衡:PACZero-MI(通过对二元发布进行精确校准的预算化MI)和PACZero-ZPL(在分歧步骤上通过均匀硬币翻转实现$I=0$)。我们在SST-2和SQuAD上使用OPT-1.3B和OPT-6.7B在LoRA和全参数轨道上进行了评估。在SST-2 OPT-1.3B全微调$I=0$时,PACZero-ZPL达到$88.99\pm0.91$,比非私有MeZO基线($91.1$ FT)低2.1个百分点。在$\varepsilon<1$的高隐私机制下,没有先前方法能产生可用的效用,而PACZero-ZPL在$I=0$时在OPT-1.3B和OPT-6.7B上获得了有竞争力的SST-2准确率和非平凡的SQuAD F1分数。

英文摘要

We introduce PACZero, a family of PAC-private zeroth-order mechanisms for fine-tuning large language models that delivers usable utility at $I(S^*; Y_{1:T})=0$. This privacy regime bounds the membership-inference attack (MIA) posterior success rate at the prior, an MIA-resistance level the DP framework matches only at $\varepsilon=0$ and infinite noise. All DP-ZO comparisons below are matched at the MIA posterior level. The key insight is that PAC Privacy charges mutual information only when the release depends on which candidate subset is the secret. Sign-quantizing subset-aggregated zeroth-order gradients creates frequent unanimity, steps at which every candidate subset agrees on the update direction; at these steps the released sign costs zero conditional mutual information. We propose two variants that span the privacy-utility trade-off: PACZero-MI (budgeted MI via exact calibration on the binary release) and PACZero-ZPL ($I=0$ via a uniform coin flip on disagreement steps). We evaluate on SST-2 and SQuAD with OPT-1.3B and OPT-6.7B in both LoRA and full-parameter tracks. On SST-2 OPT-1.3B full fine-tuning at $I=0$, PACZero-ZPL reaches ${88.99\pm0.91}$, within $2.1$pp of the non-private MeZO baseline ($91.1$ FT). No prior method produces usable utility in the high-privacy regime $\varepsilon<1$, and PACZero-ZPL obtains competitive SST-2 accuracy and nontrivial SQuAD F1 across OPT-1.3B and OPT-6.7B at $I=0$.

2605.05226 2026-05-26 cs.LG cs.AI cs.CL 版本更新

Internalizing Outcome Supervision into Process Supervision: A New Paradigm for Reinforcement Learning for Reasoning

将结果监督内化为过程监督:推理强化学习的新范式

Fei Ding, Yongkang Zhang, Runhao Liu, Yuhao Liao, Zijian Zeng, Sibo wang, Huiming Yang

发表机构 * Alibaba Group(阿里巴巴集团) Tsinghua University(清华大学)

AI总结 提出一种监督内化方法,使模型在仅结果监督下自动提取过程级学习信号,实现细粒度策略优化。

详情
AI中文摘要

推理强化学习的核心挑战不仅在于结果级监督的稀疏性,更在于如何将仅在序列末尾提供的反馈转化为可指导中间推理步骤的细粒度学习信号。现有方法要么依赖结果级奖励进行序列级优化,导致精确信用分配困难,要么依赖外部构建的过程监督,成本高昂且难以可持续扩展。为解决这一问题,我们提出一个新视角:推理强化学习可以理解为将结果监督内化为过程监督的问题。基于此视角,我们引入一种用于推理强化学习的监督内化方法,使模型能够通过识别、纠正和重用失败的推理轨迹自动提取过程级学习信号,从而在仅结果监督下实现更细粒度的策略优化。我们进一步将这一思想抽象为一种新的训练范式,其中模型在强化学习过程中持续生成并完善自身的内部过程监督,为推理强化学习中细粒度信用分配开辟了一条不同于外部提供过程监督的新路径。

英文摘要

The central challenge of reinforcement learning for reasoning lies not only in the sparsity of outcome-level supervision, but more fundamentally in how to transform feedback provided only at the end of a sequence into fine-grained learning signals that can guide intermediate reasoning steps. Existing approaches either rely on outcome-level rewards for sequence-level optimization, which makes precise credit assignment difficult, or depend on externally constructed process supervision, which is costly and difficult to scale sustainably. To address this, we propose a new perspective: reinforcement learning for reasoning can be understood as the problem of internalizing outcome supervision into process supervision. From this perspective, we introduce a supervision-internalization method for reinforcement learning for reasoning, enabling the model to automatically extract process-level learning signals through identifying, correcting, and reusing failed reasoning trajectories, thereby achieving finer-grained policy optimization under outcome-only supervision. We further abstract this idea into a new training paradigm, in which the model continually generates and refines its own internal process supervision during reinforcement learning, opening a new path for fine-grained credit assignment in reinforcement learning for reasoning that differs from externally provided process supervision.

2605.04363 2026-05-26 cs.LG cs.AI 版本更新

Mitigating Label Shift in Tabular In-Context Learning via Test-Time Posterior Adjustment

通过测试时后验调整缓解表格上下文学习中的标签偏移

Seunghan Lee

发表机构 * LG AI Research(LG人工智能研究)

AI总结 针对TabPFN在表格数据上下文学习中对标签偏移敏感的问题,提出DistPFN方法,通过测试时后验调整重新缩放类别概率,无需修改架构或额外训练,在250多个OpenML数据集上显著提升分类性能。

Comments ICML 2026

详情
AI中文摘要

TabPFN最近作为表格数据集的基础模型受到关注,通过在合成数据上利用上下文学习实现了强性能。然而,我们发现TabPFN容易受到标签偏移的影响,常常过拟合训练数据集中的多数类。为了解决这一局限性,我们提出了DistPFN,这是第一个专为表格基础模型设计的测试时后验调整方法。DistPFN通过降低训练先验(即上下文的类别分布)的影响并强调模型预测后验的贡献来重新缩放预测的类别概率,无需架构修改或额外训练。我们进一步引入了DistPFN-T,它结合了温度缩放,以根据先验和后验之间的差异自适应地控制调整强度。我们在超过250个OpenML数据集上评估了我们的方法,证明在标签偏移下,各种基于TabPFN的模型在分类任务中取得了显著改进,同时在无标签偏移的标准设置中保持了强性能。代码可在以下仓库获取:https://github.com/seunghan96/DistPFN。

英文摘要

TabPFN has recently gained attention as a foundation model for tabular datasets, achieving strong performance by leveraging in-context learning on synthetic data. However, we find that TabPFN is vulnerable to label shift, often overfitting to the majority class in the training dataset. To address this limitation, we propose DistPFN, the first test-time posterior adjustment method designed for tabular foundation models. DistPFN rescales predicted class probabilities by downweighting the influence of the training prior (i.e., the class distribution of the context) and emphasizing the contribution of the model's predicted posterior, without architectural modification or additional training. We further introduce DistPFN-T, which incorporates temperature scaling to adaptively control the adjustment strength based on the discrepancy between prior and posterior. We evaluate our methods on over 250 OpenML datasets, demonstrating substantial improvements for various TabPFN-based models in classification tasks under label shift, while maintaining strong performance in standard settings without label shift. Code is available at this repository: https://github.com/seunghan96/DistPFN.

2605.03462 2026-05-26 cs.LG cs.AI 版本更新

From Muscle Bursts to Motor Intent: Self-Supervised Token Modeling for Heterogeneous EMG

从肌肉爆发到运动意图:面向异质EMG的自监督令牌建模

Zhenghao Huang, Huilin Yao, Kaikai Wang

AI总结 提出AEMG自监督学习方法,通过事件级令牌建模和Transformer编码,从异质EMG数据中提取可复用的神经肌肉表征,提升跨用户、跨会话的鲁棒性并减少校准数据需求。

Comments After further verification, we identified issues in the current version that may affect the reliability and reproducibility of the reported experimental results. In particular, part of the evaluation relies on a dataset for which the public-release/redistribution status and supporting validation remain unresolved

详情
AI中文摘要

表面肌电图提供了一种从可穿戴肌肉记录推断人类运动意图的实用方法,但在单一采集设置下训练的模型在用户、会话、电极布局或手势协议改变时往往会失去可靠性。本文提出AEMG,一种自监督学习方法,旨在从多样化的EMG源中提取可复用的神经肌肉表征。首先将八个公开手势数据集转换为共享信号格式,以减少通道配置、传感器拓扑和记录协议的差异。AEMG不依赖固定长度滑动窗口,而是从能量变化中识别收缩事件并将其表示为紧凑的神经肌肉令牌,同时有序令牌组描述运动过程中多个肌肉的协调活动。然后使用空间和时间条件Transformer编码这些令牌序列,保留电极位置、激活时序和顺序结构信息。在预训练中,模型通过向量量化重建构建收缩原型的离散库,并通过从周围观测中恢复掩蔽的神经肌肉令牌进一步学习上下文依赖关系。在留一受试者和低标签适应设置下的实验表明,学习到的表征提高了对未见用户的鲁棒性,并减少了手势识别所需的校准数据量。这些发现表明,事件级令牌建模为适应性强且数据高效的基于EMG的运动意图理解提供了一条可扩展的途径。

英文摘要

Surface electromyography provides a practical way to infer human movement intention from wearable muscle recordings, but models trained under a single acquisition setting often lose reliability when the user, session, electrode layout, or gesture protocol changes. This paper proposes AEMG, a self-supervised learning approach designed to extract reusable neuromuscular representations from diverse EMG sources. Eight public gesture datasets are first transformed into a shared signal format to reduce discrepancies in channel configuration, sensor topology, and recording protocol. Instead of relying on fixed-length sliding windows, AEMG identifies contraction events from energy variations and represents them as compact neuromuscular tokens, while ordered token groups describe the coordinated activity of multiple muscles during motion. A spatially and temporally conditioned Transformer is then used to encode these token sequences, preserving information about electrode position, activation timing, and sequential structure. For pre-training, the model constructs a discrete library of contraction prototypes through vector-quantized reconstruction and further learns contextual dependencies by recovering masked neuromuscular tokens from surrounding observations. Experiments under leave-one-subject-out and low-label adaptation settings show that the learned representation improves robustness to unseen users and reduces the amount of calibration data required for gesture recognition. These findings suggest that event-level token modeling offers a scalable route toward adaptable and data-efficient EMG-based motor-intent understanding.

2605.02900 2026-05-26 cs.CR cs.AI cs.CV cs.RO 版本更新

Safety in Embodied AI: A Survey of Risks, Attacks, and Defenses

具身人工智能的安全性:风险、攻击与防御综述

Xiao Li, Xiang Zheng, Yifeng Gao, Xinyu Xia, Yixu Wang, Xin Wang, Ye Sun, Yunhan Zhao, Ming Wen, Jiayu Li, Zixing Chen, Xun Gong, Yi Liu, Yige Li, Yutao Wu, Cong Wang, Jun Sun, Yixin Cao, Zhineng Chen, Jingjing Chen, Tao Gui, Qi Zhang, Zuxuan Wu, Xipeng Qiu, Xuanjing Huang, Tiehua Zhang, Zhipeng Wei, Kun Wang, Xinfeng Li, Hanxun Huang, Sarah Erfani, James Bailey, Jianping Wang, Chaowei Xiao, Ran He, Bo Li, Xingjun Ma, Yu-Gang Jiang

发表机构 * Fudan University(复旦大学) Shanghai Innovation Institute(上海创新研究院) City University of Hong Kong(香港城市大学) Jilin University(吉林大学) Singapore Management University(新加坡管理大学) Deakin University(德肯大学) Tongji University(同济大学) Nanyang Technological University(南洋理工大学) Chinese Academy of Sciences(中国科学院) The University of Melbourne(墨尔本大学) Johns Hopkins University(约翰霍普金斯大学)

AI总结 本文综述了具身AI在感知、认知、规划、行动及交互全流程中的安全风险、攻击与防御方法,提出了多层次分类体系,并指出了多模态感知融合脆弱性、规划不稳定及人机交互可信度等关键挑战。

Comments Survey paper; 75 pages, 4 figures, 18 tables; v2 expands embodied-specific coverage of agentic threats, World Action Model threats, and contextual risk mitigation, with over 100 new references added. Project page: https://x-zheng16.github.io/Awesome-Embodied-AI-Safety/

详情
AI中文摘要

具身人工智能将感知、认知、规划与交互集成到在开放、安全关键环境中运行的智能体中。随着这些系统获得自主性并进入交通、医疗、工业或辅助机器人等领域,确保其安全性在技术上具有挑战性,在社会上也变得不可或缺。与数字AI系统不同,具身智能体必须在不确定的感知、不完整的知识和动态的人机交互下行动,故障可能直接导致物理伤害。本综述对具身AI中的安全研究进行了全面且结构化的回顾,考察了从感知、认知到规划、行动与交互以及智能体系统的完整具身流程中的攻击与防御。我们引入了一个多层次分类体系,统一了分散的研究工作,并将具身特定的安全发现与视觉、语言和多模态基础模型的更广泛进展联系起来。我们的综述综合了来自500多篇论文的见解,涵盖对抗性攻击、后门攻击、越狱攻击和硬件级攻击;攻击检测、安全训练和鲁棒推理;以及风险感知的人机交互。这一分析揭示了几个被忽视的挑战,包括多模态感知融合的脆弱性、越狱攻击下规划的不稳定性,以及开放场景中人机交互的可信度。通过将领域组织成连贯的框架并识别关键研究空白,本综述为构建不仅具备能力和自主性,而且在现实部署中安全、鲁棒和可靠的具身智能体提供了路线图。

英文摘要

Embodied Artificial Intelligence (Embodied AI) integrates perception, cognition, planning, and interaction into agents that operate in open-world, safety-critical environments. As these systems gain autonomy and enter domains such as transportation, healthcare, and industrial or assistive robotics, ensuring their safety becomes both technically challenging and socially indispensable. Unlike digital AI systems, embodied agents must act under uncertain sensing, incomplete knowledge, and dynamic human-robot interactions, where failures can directly lead to physical harm. This survey provides a comprehensive and structured review of safety research in embodied AI, examining attacks and defenses across the full embodied pipeline, from perception and cognition to planning, action and interaction, and agentic system. We introduce a multi-level taxonomy that unifies fragmented lines of work and connects embodied-specific safety findings with broader advances in vision, language, and multimodal foundation models. Our review synthesizes insights from over 500 papers spanning adversarial, backdoor, jailbreak, and hardware-level attacks; attack detection, safe training and robust inference; and risk-aware human-agent interaction. This analysis reveals several overlooked challenges, including the fragility of multimodal perception fusion, the instability of planning under jailbreak attacks, and the trustworthiness of human-agent interaction in open-ended scenarios. By organizing the field into a coherent framework and identifying critical research gaps, this survey provides a roadmap for building embodied agents that are not only capable and autonomous but also safe, robust, and reliable in real-world deployment.

2605.02124 2026-05-26 cs.LG cs.AI math.PR 版本更新

Soft-to-Hard Routing in Sparse Mixture-of-Experts Models

稀疏混合专家模型中的软到硬路由

Reza Rastegar

发表机构 * Meta Platforms, Inc(Meta平台)

AI总结 本文通过边界层微积分方法,研究了稀疏混合专家模型中softmax路由随温度趋于零时趋近于硬top-1路由的极限过程,并给出了基于路由界面邻域概率的定量误差界。

详情
AI中文摘要

随着温度趋于零,softmax路由趋近于硬top-1路由,但极限过程在路由器平局时存在奇异性。本文针对总体平方损失混合专家回归中的软到硬极限,发展了一种边界层微积分方法。对于具有logits $a_k(x;ϕ)$的路由器,相关的局部量是前两名的间隔$Δ(x;ϕ)$,相关的全局量是边界质量$\\mathbb{P}(Δ(X;ϕ)\\\le w)$。在光滑性和横截性假设下,余面积和管状邻域估计展示了该质量如何随板宽缩放;在二元情形中,主导系数是路由界面上的显式曲面积分。这些几何估计给出了软目标$L_τ$和硬目标$L_0$之间的定量界,包括在间隔尾条件下的$O(τ^α)$一致比较,并得到了紧参数空间上软目标的$Γ$-收敛性。主要结论是,零温度近似由路由界面的$O(τ)$邻域所承载的概率控制,而不仅仅由温度本身决定。在分离出问题的这一边界层部分后,我们记录了一个从硬路由到小温度软路由的条件景观传递定理,以及一个简化的双专家高斯计算,展示了局部对称性破缺。仅包含合成诊断作为边界层预测的受控检验。

英文摘要

Softmax routing approaches hard top-1 routing as the temperature tends to zero, but the limiting passage is singular at router ties. This paper develops a boundary-layer calculus for this soft-to-hard limit in population squared-loss mixture-of-experts regression. For a router with logits $a_k(x;ϕ)$, the relevant local quantity is the top-two margin $Δ(x;ϕ)$, and the relevant global quantity is the boundary mass $\mathbb{P}(Δ(X;ϕ)\le w)$. Under smoothness and transversality assumptions, coarea and tubular-neighborhood estimates show how this mass scales with the slab width; in the binary case the leading coefficient is an explicit surface integral over the routing interface. These geometric estimates give quantitative bounds between the soft objective $L_τ$ and the hard objective $L_0$, including an $O(τ^α)$ uniform comparison under a margin-tail condition, and yield $Γ$-convergence of the soft objectives on compact parameter spaces. The main conclusion is that the zero-temperature approximation is controlled by the probability carried by an $O(τ)$ neighborhood of the routing interfaces, not by temperature alone. After isolating this boundary-layer part of the problem, we record a conditional landscape-transfer theorem from hard to small-temperature soft routing and a reduced two-expert Gaussian calculation illustrating local symmetry breaking. Synthetic diagnostics are included only as controlled checks of the boundary-layer predictions.

2605.02010 2026-05-26 cs.AI 版本更新

Reliable AI Needs to Externalize Implicit Knowledge: A Human-AI Collaboration Perspective

可靠AI需要外化隐性知识:人机协作视角

Hengyu Liu, Tianyi Li, Zhihong Cui, Yushuai Li, Zhangkai Wu, Torben Bach Pedersen, Kristian Torp, Christian S. Jensen

发表机构 * Department of Computer Science, Aalborg University, Aalborg, Denmark(奥胡斯大学计算机科学系) Department of Informatics, University of Oslo, Oslo, Norway(奥斯陆大学信息系) School of Computing, Macquarie University, Sydney, Australia(麦考瑞大学计算科学学院)

AI总结 本文从人机协作视角提出,可靠AI需要基础设施将隐性知识外化为可验证的形式,通过知识对象(KOs)实现人类验证,从而提升可靠性。

Comments Accepted at ICML 2026 (Position Paper Track). 14 pages, 2 figures, 1 table

详情
AI中文摘要

本文立场认为,可靠AI需要基础设施来支持人类对隐性知识的验证。AI从显性知识(论文、文档、结构化数据库)和隐性知识(推理模式、调试过程、中间步骤)中学习。隐性知识由于文档成本超过感知价值而未被外化——然而AI不加区分地学习它,既获得有益模式也获得有害偏见。当前的可靠性方法只能根据来源验证显性知识,造成根本性差距:最有价值的AI能力(推理、判断、直觉)恰恰是我们无法验证的。我们提出知识对象(KOs)——将隐性知识外化为人类可以检查、验证和认可的形式的结构化工件。KOs改变了验证经济学:以前验证成本过高的事情变得可行,使得累积的人类验证能够随时间提高可靠性。

英文摘要

This position paper argues that reliable AI requires infrastructure for human validation of implicit knowledge. AI learns from both explicit knowledge (papers, documentation, structured databases) and implicit knowledge (reasoning patterns, debugging processes, intermediate steps). Implicit knowledge remains unexternalized because documentation cost exceeds perceived value -- yet AI learns from it indiscriminately, acquiring both beneficial patterns and harmful biases. Current reliability methods can only verify explicit knowledge against sources, creating a fundamental gap: the most valuable AI capabilities (reasoning, judgment, intuition) are precisely those we cannot verify. We propose Knowledge Objects (KOs) -- structured artifacts that externalize implicit knowledge into forms humans can inspect, verify, and endorse. KOs transform verification economics: what was previously too costly to verify becomes feasible, enabling accumulated human validation to improve reliability over time.

2605.01284 2026-05-26 cs.CV cs.AI cs.CL cs.IR 版本更新

Chain of Evidence: Pixel-Level Visual Attribution for Iterative Retrieval-Augmented Generation

证据链:面向迭代检索增强生成的像素级视觉归因

Peiyang Liu, Ziqiang Cui, Xi Wang, Di Liang, Wei Ye

发表机构 * National Engineering Research Center for Software Engineering, Peking University(软件工程国家级工程研究中心,北京大学) City University of Hong Kong(香港城市大学) Peking University(北京大学) Tencent Technology(腾讯科技)

AI总结 提出Chain of Evidence (CoE)框架,利用视觉语言模型直接对检索到的文档截图进行推理,输出精确边界框以可视化完整推理链,解决迭代检索增强生成中的粗粒度归因和视觉语义丢失问题。

详情
AI中文摘要

迭代检索增强生成(iRAG)已成为通过逐步检索和推理外部文档来回答复杂多跳问题的强大范式。然而,当前系统主要基于解析文本运行,这造成了两个关键瓶颈:(1)粗粒度归因,用户需要根据模糊的文本级引用在冗长文档中手动定位证据;(2)视觉语义丢失,将视觉丰富的文档(如幻灯片、带有图表的PDF)转换为文本会丢弃对推理至关重要的空间逻辑和布局线索。为弥合这一差距,我们提出了证据链(CoE),这是一个与检索器无关的视觉归因框架,利用视觉语言模型直接对检索到的文档候选截图进行推理。CoE消除了特定格式的解析,输出精确的边界框,可视化检索候选集中的完整推理链。我们在两个不同的基准上评估CoE:Wiki-CoE,一个源自2WikiMultiHopQA的大规模结构化网页数据集;以及SlideVQA,一个具有挑战性的演示幻灯片数据集,包含复杂图表和自由形式布局。实验表明,微调后的Qwen3-VL-8B-Instruct取得了稳健的性能,在需要视觉布局理解的场景中显著优于基于文本的基线,同时为像素级可解释的iRAG建立了与检索器无关的解决方案。我们的代码可在https://github.com/PeiYangLiu/CoE.git获取。

英文摘要

Iterative Retrieval-Augmented Generation (iRAG) has emerged as a powerful paradigm for answering complex multi-hop questions by progressively retrieving and reasoning over external documents. However, current systems predominantly operate on parsed text, which creates two critical bottlenecks: (1) \textit{Coarse-grained attribution}, where users are burdened with manually locating evidence within lengthy documents based on vague text-level citations; and (2) \textit{Visual semantic loss}, where the conversion of visually rich documents (e.g., slides, PDFs with charts) into text discards spatial logic and layout cues essential for reasoning. To bridge this gap, we present \textbf{Chain of Evidence (CoE)}, a retriever-agnostic visual attribution framework that leverages Vision-Language Models to reason directly over screenshots of retrieved document candidates. CoE eliminates format-specific parsing and outputs precise bounding boxes, visualizing the complete reasoning chain within the retrieved candidate set. We evaluate CoE on two distinct benchmarks: \textbf{Wiki-CoE}, a large-scale dataset of structured web pages derived from 2WikiMultiHopQA, and \textbf{SlideVQA}, a challenging dataset of presentation slides featuring complex diagrams and free-form layouts. Experiments demonstrate that fine-tuned Qwen3-VL-8B-Instruct achieves robust performance, significantly outperforming text-based baselines in scenarios requiring visual layout understanding, while establishing a retriever-agnostic solution for pixel-level interpretable iRAG. Our code is available at https://github.com/PeiYangLiu/CoE.git.

2604.27636 2026-05-26 cs.AI 版本更新

Generative structure search for efficient and diverse discovery of molecular and crystal structures

生成式结构搜索:高效且多样地发现分子和晶体结构

Yifang Qin, Yu Shi, Junfu Tan, Chang Liu, Ming Zhang, Ziheng Lu

发表机构 * Zhongguancun Academy(中关村学院) Kairos Materials(Kairos材料)

AI总结 提出生成式结构搜索(GSS)框架,结合扩散模型和随机结构搜索,利用数据先验加速采样并保持能量引导的局部极小探索,以低于随机结构搜索十分之一的成本恢复多样亚稳态结构。

详情
AI中文摘要

预测稳定和亚稳态结构是分子和材料发现的核心,但受限于高维能量景观的搜索成本。深度生成模型提供了高效的结构采样,但其输出仍受训练数据影响,可能未充分探索罕见但物理相关的极小值。我们引入生成式结构搜索(GSS),一个统一框架,将基于扩散的生成和随机结构搜索(RSS)表述为由学习得分场和物理力驱动的共同采样过程的极限情况。耦合这些驱动因素使GSS能够利用数据先验加速采样,同时保留能量引导的局部极小探索。在分子和晶体系统中,GSS恢复了多样的亚稳态结构,其采样成本比RSS低十倍以上,且对训练分布之外的组成仍然有效。结果建立了一种物理基础的生成搜索策略,用于发现仅靠数据驱动采样无法达到的结构。

英文摘要

Predicting stable and metastable structures is central to molecular and materials discovery, but remains limited by the cost of searching high-dimensional energy landscapes. Deep generative models offer efficient structure sampling, yet their outputs remain shaped by training data and can underexplore minima that are rare but physically relevant. We introduce generative structure search (GSS), a unified framework that formulates diffusion-based generation and random structure search (RSS) as limiting regimes of a common sampling process driven by learned score fields and physical forces. Coupling these drivers lets GSS use data priors to accelerate sampling while retaining energy-guided exploration of local minima. Across molecular and crystalline systems, GSS recovers diverse metastable structures with more than tenfold lower sampling cost than RSS for broad coverage and remains effective for compositions outside the training distribution. The results establish a physically grounded generative search strategy for discovering structures beyond the reach of data-driven sampling alone.

2604.23396 2026-05-26 cs.IR cs.AI cs.CL cs.LG 版本更新

Lost in Decoding? Reproducing and Stress-Testing the Look-Ahead Prior in Generative Retrieval

迷失在解码中?复现与压力测试生成式检索中的前瞻先验

Kidist Amde Mekonnen, Yongkang Li, Yubao Tang, Simon Lupart, Maarten de Rijke

发表机构 * University of Amsterdam(阿姆斯特丹大学)

AI总结 本文复现并压力测试了生成式检索中的前瞻先验方法PAG,发现其规划信号在词汇表面形式变化下脆弱,并评估了跨语言鲁棒性与查询端缓解策略。

Comments 12 pages, 5 figures, 9 tables; accepted to the 49th International ACM SIGIR Conference on Research and Development in Information Retrieval, July 20-24, 2026, Melbourne/Naarm, Australia

详情
Journal ref
Proceedings of the 49th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR '26), pages XXX-XXX, 2026
AI中文摘要

生成式检索(GR)通过自回归生成文档标识符来对文档进行排序。由于许多GR方法依赖于trie约束的束搜索,它们在有限束解码下容易过早剪枝相关前缀。生成式检索中的前瞻规划(PAG)通过使用同时解码来计算文档级前瞻先验,指导后续顺序解码,从而缓解了这种失败模式。我们在推理时复现了PAG,并压力测试了其解码行为。使用作者发布的检查点和标识符/trie工件,在报告的解码设置下,我们在MS MARCO Dev和TREC-DL 2019/2020上复现了主要有效性结果,并在我们的硬件设置中证实了报告的束大小-延迟权衡。在复现之外,我们引入了规划漂移诊断,量化意图保持的查询变体如何改变规划器的top-n候选集和最高权重规划器令牌,以及这些变化如何影响引导解码。我们发现PAG的规划信号在词汇表面形式变化下是脆弱的:意图保持的拼写错误可能触发规划崩溃,其中规划的候选池变化足够大,使得前瞻奖励几乎无法提供有用的指导,实际上使解码退回到较弱的无引导搜索。我们进一步使用非英语mMARC O查询对英语索引评估了固定索引的跨语言鲁棒性,并评估了无需重新索引的查询端缓解策略;在我们的设置中,查询翻译提供了最强的恢复。总体而言,我们的结果证实了PAG报告的有效性以及在发布的推理设置下规划引导解码的优势,同时表明这些增益依赖于规划信号在现实查询变化和查询-文档不匹配下的稳定性。

英文摘要

Generative retrieval (GR) ranks documents by autoregressively generating document identifiers. Because many GR methods rely on trie-constrained beam search, they are vulnerable to early pruning of relevant prefixes under finite-beam decoding. Planning Ahead in Generative Retrieval (PAG) mitigates this failure mode by using simultaneous decoding to compute a document-level look-ahead prior that guides subsequent sequential decoding. We reproduce PAG at inference time and stress-test its decoding behavior. Using the authors' released checkpoint and identifier/trie artifacts under the reported decoding setup, we reproduce the main effectiveness results on MS MARCO Dev and TREC-DL 2019/2020, and corroborate the reported beam-size-latency trade-off in our hardware setting. Beyond reproduction, we introduce plan drift diagnostics that quantify how intent-preserving query variations alter the planner's top-n candidate set and highest-weight planner tokens, and how these changes affect guided decoding. We find that PAG's planning signal is brittle under lexical surface-form variation: intent-preserving typos can trigger plan collapse, where the planned candidate pool shifts enough that the look-ahead bonus provides little useful guidance, effectively reverting decoding toward weaker unguided search. We further evaluate fixed-index cross-lingual robustness using non-English mMARCO queries against an English index, and assess query-side mitigation strategies that require no re-indexing; query translation provides the strongest recovery in our setting. Overall, our results confirm PAG's reported effectiveness and the benefit of planning-guided decoding under the released inference setup, while showing that these gains depend on the stability of the planning signal under realistic query variation and query-document mismatch.

2604.20022 2026-05-26 cs.LG cs.AI cs.CL 版本更新

MoBayes: A Modular Bayesian Framework for Separating Reasoning from Language in Conversational Clinical Decision Support

MoBayes:一种用于对话式临床决策支持中推理与语言分离的模块化贝叶斯框架

Yusuf Kesmen, Fay Elhassan, Jiayi Ma, Julien Stalhandske, Yena Chang, David Sasu, Alexandra Kulinkina, Akhil Arora, Lars Klein, Mary-Anne Hartley

发表机构 * LiGHT, EPFL(LiGHT,瑞士联邦理工学院) University of Bern(伯尔尼大学) Aarhus University(奥胡斯大学)

AI总结 提出MoBayes框架,通过将LLM作为语言接口、贝叶斯模块进行概率推理,实现推理与语言分离,在临床决策支持中优于独立前沿LLM医生。

Comments 50 pages including appendix, 13 figures, 22 tables. Preprint

详情
AI中文摘要

大型语言模型(LLM)越来越多地用于对话式临床决策支持,但它们将下一个标记预测与概率决策混为一谈。我们认为这种混淆反映了架构上的局限性:此类系统缺乏显式的后验追踪、可控的弃权阈值和可审计的推理链。我们引入MoBayes,一个模块化贝叶斯对话框架,将推理与语言分离。LLM仅作为语言接口,将患者对话解析为结构化观察,而贝叶斯模块对这些观察进行概率推理以更新后验,通过期望信息增益选择后续问题,并通过校准的决策阈值决定何时停止或推迟。这种设计实现了显式后验追踪、可控的选择性决策,以及无需重新训练语言模型即可替换的特定人群统计后端。在经验知识和LLM生成的知识库上,MoBayes优于独立的前沿LLM医生,包括匹配模型系列的比较,其中廉价的传感器模型与MoBayes配对以较低成本超过更大的自主模型。在对抗性患者沟通风格和不同诊断场景下,该优势依然存在。这些结果表明,可靠的对话式临床决策支持系统应将概率推理与语言生成分离,而不是仅扩大模型规模。代码可在https://anonymous.4open.science/r/MoBayes/获取。

英文摘要

Large language models (LLMs) are increasingly used for conversational clinical decision support, yet they conflate next token prediction with probabilistic decision making. We argue that this conflation reflects an architectural limitation: such systems lack explicit posterior tracking, controllable abstention thresholds, and auditable reasoning chains. We introduce MoBayes, a Modular Bayesian dialogue framework that separates reasoning from language. The LLM acts only as a language interface, parsing patient conversation into structured observations, while a Bayesian module performs probabilistic inference over these observations to update posteriors, select follow-up questions via expected-information-gain and determine when to stop or defer through calibrated decision thresholds. This design enables explicit posterior tracking, controllable selective decision-making, and replaceable population-specific statistical backends without retraining the language model. Across empirical and LLM-generated knowledge bases, MoBayes outperforms standalone frontier LLM doctors, including matched model-family comparisons where inexpensive sensor models paired with MoBayes exceed larger autonomous models at lower cost. The advantage persists under adversarial patient communication styles and across varying diagnostic scenarios. These results suggest that reliable conversational clinical decision support systems should separate probabilistic reasoning from language generation rather than scaling model size alone. Code is available at https://anonymous.4open.science/r/MoBayes/

2604.18170 2026-05-26 cs.CL cs.AI 版本更新

Copy-as-Decode: Grammar-Constrained Parallel Prefill for LLM Editing

Copy-as-Decode: 面向LLM编辑的语法约束并行预填充

Ziyang Liu

AI总结 提出Copy-as-Decode机制,通过语法约束的并行预填充加速LLM编辑,实现高达303倍的自回归解码加速,并保持高覆盖率与无损性。

Comments The authors have decided to withdraw this version following internal review regarding authorship and contribution agreements

详情
AI中文摘要

LLMs通过自回归地重新生成完整输出来编辑文本和代码,即使大多数标记在输入中逐字出现。我们研究Copy-as-Decode,一种解码层机制,将编辑生成重新表述为基于两个原语语法的结构化解码:<copy lines="i-j"/>引用输入行范围,<gen>...</gen>生成新内容。一个标记级FSM保证语法有效性,服务层原语通过单次并行预填充前向(而非N步自回归步骤)更新每个复制跨度的KV缓存——共享推测解码的并行前向内核,但以输入标记作为草稿,程序强制接受替代概率验证。我们报告一个无需端到端训练的上界分析。(i) 内核加速:在Qwen2.5-{1.5B, 7B}上,通过并行预填充复制N个标记比自回归快6.8倍至303倍(N ∈ [8, 512],A100 80GB bf16)。(ii) 复制上限:在ProbeEdit和HumanEvalPack-Fix (Py/JS)上,74%–98%的金标准标记在行级原语下可达;结合每个语料库跨度直方图上的经验内核,得到闭式挂钟时间上界29.0倍/3.4倍/4.2倍(合并13.0倍)。标记级扩展达到91%–99%覆盖率,下界4.5倍–6.5倍。(iii) 流水线无损性:预言程序通过确定性解析器在所有482个案例上往返,将任何下游失败定位到跨度选择而非机制。扰动研究表明,在离一噪声下,合并EM从100%降至15.48%。在Qwen2.5-Coder-1.5B上的微调实验将HEvalFix-Py EM从0/33(未训练)提升至12%–17%,这是一个可学习性信号,而非生产选择器。批处理服务集成和多文件覆盖作为后续工作。

英文摘要

LLMs edit text and code by autoregressively regenerating the full output, even when most tokens appear verbatim in the input. We study Copy-as-Decode, a decoding-layer mechanism that recasts edit generation as structured decoding over a two-primitive grammar: <copy lines="i-j"/> references an input line range, <gen>...</gen> emits new content. A token-level FSM guarantees syntactic validity, and a serving-layer primitive updates the KV cache for each copy span via a single parallel-prefill forward rather than $N$ autoregressive steps -- sharing the parallel-forward kernel of speculative decoding but with input tokens as the draft and program-enforced acceptance replacing probabilistic verification. We report an upper-bound analysis that requires no end-to-end training. (i) Kernel speedup: on Qwen2.5-{1.5B, 7B}, copying $N$ tokens via parallel prefill is $6.8\times$--$303\times$ faster than autoregressive ($N \in [8, 512]$, A100 80GB bf16). (ii) Copy ceiling: on ProbeEdit and HumanEvalPack-Fix (Py/JS), $74$--$98\%$ of gold tokens are reachable under the line-level primitive; composed with the empirical kernel over each corpus's span histogram this yields a closed-form wall-clock bound of $29.0\times / 3.4\times / 4.2\times$ ($13.0\times$ pooled). A token-level extension reaches $91$--$99\%$ coverage with $4.5\times$--$6.5\times$ floors. (iii) Pipeline losslessness: oracle programs round-trip through the deterministic resolver on all $482$ cases, localizing any downstream failure to span selection rather than the mechanism. A perturbation study shows pooled EM drops from $100\%$ to $15.48\%$ under off-by-one noise. A fine-tuning pilot on Qwen2.5-Coder-1.5B lifts HEvalFix-Py EM from $0/33$ (untrained) to $12$--$17\%$, a learnability signal, not a production selector. Batched-serving integration and multi-file coverage are scoped as follow-up.

2604.18128 2026-05-26 cs.CL cs.AI cs.LG 版本更新

Depth Registers Unlock W4A4 on SwiGLU: A Reader/Generator Decomposition

深度寄存器解锁 SwiGLU 上的 W4A4:一种读取器/生成器分解

Ziyang Liu

AI总结 本研究通过深度寄存器和铰链损失(DR+sink)训练时干预,将 SwiGLU 解码器语言模型的 W4A4 量化困惑度从 1727 降至 119,并分解出残差轴读取器主导误差,而生成器 w2 的双线性输入是剩余差距的主因。

Comments The authors have decided to withdraw this version following internal review regarding authorship and contribution agreements

详情
AI中文摘要

我们在一个受控的 300M 参数 SwiGLU 解码器语言模型(在 FineWeb-Edu 的 5B 令牌上训练)中研究训练后 W4A4 量化,并询问哪些输入激活位点主导误差。朴素的四舍五入 W4A4 将验证困惑度从 FP16 的 23.6 降至 1727。一种简单的残差轴训练时干预——带有寄存器幅度铰链损失的深度寄存器(DR+sink)——在匹配的 FP16 PPL 和匹配的零样本能力下,将其降至 119(约 14 倍),并与 SmoothQuant 组合达到 39.9 PPL。与 FP16 之间约 2 PPL 的剩余差距是诊断核心。我们按输入激活位点分解 W4A4 损伤:SwiGLU 块中的五个可训练线性层分为残差轴读取器(qkv, w1, w3)和块内生成器(o_proj, w2)。基本的范数论证表明,残差轴幅度控制紧密约束读取器,但 w2 的双线性输入仅受因子范数平凡乘积的约束;经验上,DR+sink 降低了读取器的峰度,而生成器基本不变,并且读取器恢复的 W4A4 残差在三个匹配检查点上平坦约为 0.28 nats,其中 Delta-remove(w2) 占主导。我们将 DR+sink 作为训练时探针而非部署方案提出:一种事后替代方案(Per-Linear QuaRot)在读取器轴上几乎与之匹配。完整的 QuaRot——添加在线每头值 Hadamard 和在线 w2 输入旋转——也没有缩小差距,直接验证了正交旋转无法约束双线性 SwiGLU 尾部的预测。这些主张特定于我们的 300M、5B 令牌、单种子设置,并且我们的实验未将分区与铰链分离。

英文摘要

We study post-training W4A4 quantization in a controlled 300M-parameter SwiGLU decoder-only language model trained on 5B tokens of FineWeb-Edu, and ask which input-activation sites dominate the error. Naive round-to-nearest W4A4 collapses validation perplexity from FP16 23.6 to 1727. A simple residual-axis training-time intervention -- Depth Registers with a register-magnitude hinge loss (DR+sink) -- reduces this to 119 (about 14x) at matched FP16 PPL and matched zero-shot capacity, and composes with SmoothQuant to 39.9 PPL. The residual ~2 PPL gap to FP16 is the diagnostic core. We decompose W4A4 damage by input-activation site: the five trainable linears in a SwiGLU block split into residual-axis readers (qkv, w1, w3) and block-internal generators (o_proj, w2). Elementary norm arguments show residual-axis magnitude control bounds readers tightly but leaves w2's bilinear input bounded only by the trivial product of factor bounds; empirically, DR+sink collapses reader kurtosis while leaving generators essentially unchanged, and the reader-rescued W4A4 residue is flat at ~0.28 nats across three matched checkpoints with Delta-remove(w2) dominating. We present DR+sink as a training-time probe rather than a deployment proposal: a post-hoc alternative (Per-Linear QuaRot) nearly matches it on the reader axis. Full QuaRot -- adding online per-head value Hadamard plus online w2-input rotation -- does not close the gap either, directly testing the prediction that orthogonal rotation cannot bound the bilinear SwiGLU tail. Claims are specific to our 300M, 5B-token, single-seed setting, and our experiments do not isolate the partition from the hinge.

2604.17328 2026-05-26 cs.LG cs.AI 版本更新

Rethinking the Comparison Unit in Sequence-Level Reinforcement Learning: An Equal-Length Paired Training Framework from Loss Correction to Sample Construction

重新思考序列级强化学习中的比较单元:从损失校正到样本构建的等长配对训练框架

Fei Ding, Yongkang Zhang, Runhao Liu, Yuhao Liao, Zijian Zeng, Huiming Yang, Sibo wang, Linglin Liao

发表机构 * Alibaba Group(阿里巴巴集团) Tsinghua University(清华大学)

AI总结 本文提出序列级相对强化学习中的长度问题本质是比较单元构建问题,并基于此提出等长配对训练框架EqLen,通过双轨同步生成、前缀继承和段掩码构建可比较的训练样本。

详情
AI中文摘要

本文研究了序列级相对强化学习中的长度问题。我们观察到,尽管现有方法部分缓解了与长度相关的现象,但一个更根本的问题仍未得到充分刻画:训练过程中使用的比较单元缺乏内在可比性。基于这一观察,我们提出一个新的视角:长度问题不应仅仅被视为损失缩放或归一化偏差,而应被视为一个比较单元构建问题。我们进一步建立了一个基于样本构建的训练框架,该框架不是对不等长响应进行事后校正,而是在生成过程中主动构建等长、可对齐且可比较的训练段。在该框架内,我们提出了EqLen,一种适用于组相对比较算法(如GRPO、GSPO和RLOO)的具体方法。通过双轨同步生成、前缀继承和段掩码,EqLen高效地收集有效的等长训练段,并实现稳定的训练。

英文摘要

This paper investigates the length problem in sequence-level relative reinforcement learning. We observe that, although existing methods partially alleviate length-related phenomena, a more fundamental issue remains insufficiently characterized: the comparison units used during training lack inherent comparability. Building on this observation, we propose a new perspective: the length problem should not be viewed merely as a loss-scaling or normalization bias, but rather as a \emph{comparison unit construction} problem. We further establish a sample-construction-based training framework that, instead of applying post-hoc corrections to unequal-length responses, proactively constructs equal-length, alignable, and comparable training segments during generation. Within this framework, we propose EqLen, a concrete method applicable to group-relative comparison algorithms such as GRPO, GSPO, and RLOO. Through dual-track synchronous generation, prefix inheritance, and segment masking, EqLen efficiently collects effective equal-length training segments and enables stable

2604.16778 2026-05-26 cs.LG cs.AI 版本更新

Federation over Text: Insight Sharing for Multi-Agent Reasoning

文本上的联邦:多智能体推理的洞察共享

Dixi Yao, Tahseen Rabbani, Manzil Zaheer, Tian Li

发表机构 * University of Chicago(芝加哥大学) Google DeepMind(谷歌DeepMind)

AI总结 提出一种类似联邦学习的框架FoT,通过迭代聚合多个客户端的本地推理过程,构建跨任务元认知洞察库,无需共享问题实例或任务指令,显著提升推理效果和效率。

Comments 46 pages

详情
AI中文摘要

我们提出了一种类似联邦学习的框架——文本上的联邦(FoT),它使得处理不同任务的多个客户端能够通过迭代地联邦化其本地推理过程,共同生成一个共享的元认知洞察库,而无需共享实际的问题实例或任务指令。与梯度上的联邦(例如分布式训练)不同,FoT在语义层面运作,无需任何梯度优化或监督信号。迭代地,每个客户端运行一个LLM智能体,独立地对其特定任务进行本地思考和自我改进,并将推理轨迹与中央服务器共享,中央服务器将其聚合和提炼成一个跨任务(和跨领域)的洞察库,现有和未来的智能体可以利用该库来改进相关任务的性能。实验表明,FoT在广泛具有挑战性的应用中提高了推理效果和效率,包括数学问题求解、跨领域协作、现实世界日常任务以及机器学习研究洞察发现。具体而言,在前三个应用中,它平均提高了25%的性能得分,同时减少了4%的推理令牌。在研究洞察发现应用中,FoT能够生成覆盖后续论文中80%以上主要贡献的洞察。

英文摘要

We propose a federated learning-like framework, Federation over Text (FoT), that enables multiple clients solving different tasks to collectively generate a shared library of metacognitive insights by iteratively federating their local reasoning processes without sharing actual problem instances or task instructions. Instead of federation over gradients (e.g., as in distributed training), FoT operates at the semantic level without any gradient optimization or supervision signal. Iteratively, each client runs an LLM agent that does local thinking and self-improvement on their specific tasks independently, and shares reasoning traces with a central server, which aggregates and distills them into a cross-task (and cross-domain) insight library that existing and future agents can leverage to improve performance on related tasks. Experiments show that FoT improves reasoning effectiveness and efficiency across a wide range of challenging applications, including mathematical problem solving, cross-domain collaboration, real-world daily tasks, and machine learning research insight discovery. Specifically, it improves average performance scores by 25% while reducing the reasoning tokens by 4% across the first three applications. In the research insight discovery application, FoT is able to generate insights that cover over 80% of the major contributions in the subsequent papers.

2604.12376 2026-05-26 cs.CL cs.AI 版本更新

Cooperative Memory Paging with Keyword Bookmarks for Long-Horizon LLM Conversations

面向长程LLM对话的协作式内存分页与关键词书签

Ziyang Liu

AI总结 提出协作式分页方法,用关键词书签替代被驱逐的对话片段,并赋予模型 recall() 工具按需检索,在 LoCoMo 基准上四个模型均取得最佳答案质量,并通过消融实验揭示分页设计的关键因素。

Comments The authors have decided to withdraw this version following internal review regarding authorship and contribution agreements

详情
AI中文摘要

当LLM对话超出上下文窗口时,旧内容必须被驱逐——但模型在需要时如何恢复它们?我们提出协作式分页:被驱逐的片段被替换为最小关键词书签([pN:keywords],每个约8-24个token),并赋予模型一个 recall() 工具以按需检索完整内容。在 LoCoMo 基准(10个真实多会话对话,300+轮次)上,协作式分页在四种模型(GPT-4o-mini、DeepSeek-v3.2、Claude Haiku、GLM-5)的六种方法中实现了最高的答案质量——优于截断、BM25、词重叠检索、搜索工具基线和完整上下文——由四个独立的LLM评判员确认(p=0.017,配对bootstrap)。随后,我们通过边界策略和驱逐策略的5x4消融实验(3,176个合成探针,1,600个LoCoMo探针)研究分页设计空间。关键发现:(1)粗粒度固定大小页面(fixed_20)达到96.7%,而内容感知的topic_shift降至56.7%;(2)驱逐策略的选择依赖于数据(FIFO在合成数据上最佳,LFU在LoCoMo上最佳);(3)两种书签生成策略相比启发式基线有提升(+4.4和+8.7个E2E点);(4)剩余瓶颈是书签区分度——模型96%的时间触发recall(),但当书签区分度不足时,仅57%选择正确页面。关键词特异性单独造成25个百分点的准确率差异。

英文摘要

When LLM conversations grow beyond the context window, old content must be evicted -- but how does the model recover it when needed? We propose cooperative paging: evicted segments are replaced with minimal keyword bookmarks ([pN:keywords], ~8-24 tokens each), and the model is given a recall() tool to retrieve full content on demand. On the LoCoMo benchmark (10 real multi-session conversations, 300+ turns), cooperative paging achieves the highest answer quality among six methods -- outperforming truncation, BM25, word-overlap retrieval, a search-tool baseline, and full context -- on four models (GPT-4o-mini, DeepSeek-v3.2, Claude Haiku, GLM-5), confirmed by four independent LLM judges ($p=0.017$, paired bootstrap). We then study the paging design space with a 5x4 ablation over boundary strategies and eviction policies (3,176 synthetic probes, 1,600 LoCoMo probes). Key findings: (1) coarse fixed-size pages (fixed_20) reach 96.7% while content-aware topic_shift collapses to 56.7%; (2) eviction policy choice is data-dependent (FIFO best on synthetic, LFU on LoCoMo); (3) two bookmark generation strategies improve over the heuristic baseline (+4.4 and +8.7 E2E points); (4) the remaining bottleneck is bookmark discrimination -- the model triggers recall() 96% of the time but selects the correct page only 57% when bookmarks are insufficiently distinctive. Keyword specificity alone accounts for a 25 percentage point accuracy difference.

2604.12116 2026-05-26 cs.AI cs.SE 版本更新

The A-R Behavioral Space: Execution-Level Profiling of Tool-Using Language Model Agents in Organizational Deployment

A-R行为空间:组织部署中工具使用语言模型代理的执行层剖析

Shasha Yu, Fiona Carroll, Barry L. Bentley

发表机构 * Cardiff School of Technologies, Cardiff Metropolitan University(卡迪夫技术学院,卡迪夫市政大学) School of Professional Studies, Clark University(专业研究学院,克拉克大学) Harvard Medical School, Harvard University(哈佛医学院,哈佛大学)

AI总结 提出基于动作率(A)和拒绝信号(R)的二维A-R空间及散度(D)来测量执行层行为,评估不同规范制度和自主性配置下语言模型代理的执行与拒绝分布模式。

详情
AI中文摘要

大型语言模型(LLMs)越来越多地被部署为能够执行系统级操作的工具增强型代理。虽然现有基准主要评估文本对齐或任务成功,但较少关注在不同自主性支架下语言信号与可执行行为之间的结构关系。本研究引入了一种基于二维A-R空间的执行层行为测量方法,该空间由动作率(A)和拒绝信号(R)定义,散度(D)捕捉两者之间的协调性。模型在四种规范制度(控制、灰色、困境和恶意)和三种自主性配置(直接执行、规划和反思)下进行评估。该方法不是分配聚合安全分数,而是描述执行和拒绝如何随上下文框架和支架深度重新分布。实证结果表明,执行和拒绝构成了可分离的行为维度,其联合分布在制度和自主性水平上系统性地变化。基于反思的支架通常会在风险情境中促使配置转向更高的拒绝,但重新分布模式在不同模型间存在结构性差异。A-R表示使得横截面行为剖面、支架诱导的转变和协调变异性直接可观察。通过将执行层表征置于标量排名之上,这项工作为在组织环境中分析和选择工具增强的LLM代理提供了面向部署的视角,其中执行权限和风险容忍度各不相同。

英文摘要

Large language models (LLMs) are increasingly deployed as tool-augmented agents capable of executing system-level operations. While existing benchmarks primarily assess textual alignment or task success, less attention has been paid to the structural relationship between linguistic signaling and executable behavior under varying autonomy scaffolds. This study introduces an execution-layer be-havioral measurement approach based on a two-dimensional A-R space defined by Action Rate (A) and Refusal Signal (R), with Divergence (D) capturing coor-dination between the two. Models are evaluated across four normative regimes (Control, Gray, Dilemma, and Malicious) and three autonomy configurations (di-rect execution, planning, and reflection). Rather than assigning aggregate safety scores, the method characterizes how execution and refusal redistribute across contextual framing and scaffold depth. Empirical results show that execution and refusal constitute separable behavioral dimensions whose joint distribution varies systematically across regimes and autonomy levels. Reflection-based scaffolding often shifts configurations toward higher refusal in risk-laden contexts, but redis-tribution patterns differ structurally across models. The A-R representation makes cross-sectional behavioral profiles, scaffold-induced transitions, and coordination variability directly observable. By foregrounding execution-layer characterization over scalar ranking, this work provides a deployment-oriented lens for analyzing and selecting tool-enabled LLM agents in organizational settings where execution privileges and risk tolerance vary.

2604.08988 2026-05-26 cs.AI 版本更新

SEA-Eval: A Benchmark for Evaluating Self-Evolving Agents Beyond Episodic Assessment

SEA-Eval: 超越情景评估的自进化智能体基准

Sihang Jiang, Lipeng Ma, Zhonghua Hong, Keyi Wang, Zhiyu Lu, Tengfei Wang, Shisong Chen, Jinghao Zhang, Tianjun Pan, Weijia Li, Jiaqing Liang, Yanghua Xiao

发表机构 * Fudan University(复旦大学)

AI总结 本文提出自进化智能体(SEA)的形式化定义及其最小充分架构进化飞轮,并构建首个专门评估SEA的基准SEA-Eval,通过顺序任务流设计量化进化增益、稳定性和隐式对齐收敛。

详情
AI中文摘要

当前基于LLM的智能体在情景任务执行中表现出强大性能,但仍受限于静态工具集和情景遗忘,无法跨任务边界积累经验。本文从数字具身和连续跨任务进化的角度形式化自进化智能体(SEA),引入进化飞轮作为其最小充分架构,并提出SEA-Eval——首个专门设计用于评估SEA的基准。基于飞轮理论,SEA-Eval将SR和T作为主要指标,并通过顺序任务流设计,旨在量化进化增益、进化稳定性和隐式对齐收敛。实证评估表明,在可比成功率下,不同框架在单个任务上的token消耗差异高达31.2倍,且在顺序分析下出现不同的进化轨迹——这表明成功率单独造成能力幻觉,而T的顺序收敛是区分真正进化与伪进化的关键标准。

英文摘要

Current LLM-based agents demonstrate strong performance in episodic task execution but remain constrained by static toolsets and episodic amnesia, failing to accumulate experience across task boundaries. This paper formalizes the Self-Evolving Agent (SEA) from the perspective of digital embodiment and continuous cross-task evolution, introduces the Evolutionary Flywheel as its minimal sufficient architecture, and presents SEA-Eval -- the first benchmark designed specifically for evaluating SEAs. Grounded in Flywheel theory, SEA-Eval establishes SR and T as primary metrics and, through sequential task stream design, is designed to quantify evolutionary gain, evolutionary stability, and implicit alignment convergence. Empirical evaluation reveals that, under comparable success rates, token consumption differs by up to 31.2 times between frameworks on individual tasks, with divergent evolutionary trajectories emerging under sequential analysis -- demonstrating that success rate alone creates a capability illusion and that the sequential convergence of $T$ is the key criterion for distinguishing genuine evolution from pseudo-evolution.

2603.29897 2026-05-26 cs.IR cs.AI 版本更新

UniRank: End-to-End Domain-Specific Reranking of Hybrid Text-Image Candidates

UniRank: 混合文本-图像候选的端到端领域特定重排序

Yupei Yang, Lin Yang, Wanxi Deng, Lin Qu, Shikui Tu, Lei Xu

发表机构 * Shanghai Jiao Tong University(上海交通大学) Alibaba Group(阿里巴巴集团)

AI总结 提出UniRank,一种基于视觉语言模型的重排序框架,通过无需模态转换的统一评分和端到端领域适应(包括指令微调和基于强化学习的偏好对齐),在科学文献检索和设计专利搜索中显著提升性能。

详情
AI中文摘要

重排序是许多信息检索流程中的关键组件。尽管在纯文本场景中取得了显著进展,多模态重排序仍然具有挑战性,尤其是当候选集包含混合文本和图像项时。一个关键难点是模态差距:文本重排序器本质上更接近文本候选而非图像候选,导致跨模态排序存在偏差且次优。视觉语言模型(VLM)通过强大的跨模态对齐缓解了这一差距,并已被用于构建多模态重排序器。然而,大多数基于VLM的重排序器将所有候选编码为图像,将文本视为图像会引入大量计算开销。同时,现有的开源多模态重排序器通常在通用领域数据上训练,在特定领域场景中往往表现不佳。为解决这些限制,我们提出UniRank,一种基于VLM的重排序框架,无需任何模态转换即可原生地对混合文本-图像候选进行评分和排序。基于这种混合评分接口,UniRank提供了端到端的领域适应流程,包括:(1)指令微调阶段,通过将标签令牌似然映射到统一标量分数来学习校准的跨模态相关性评分;(2)硬负样本驱动的偏好对齐阶段,构建领域内成对偏好,并通过基于人类反馈的强化学习(RLHF)进行查询级策略优化。在科学文献检索和设计专利搜索上的大量实验表明,UniRank一致优于最先进的基线,Recall@1分别提高了8.9%和7.3%。

英文摘要

Reranking is a critical component in many information retrieval pipelines. Despite remarkable progress in text-only settings, multimodal reranking remains challenging, particularly when the candidate set contains hybrid text and image items. A key difficulty is the modality gap: a text reranker is intrinsically closer to text candidates than to image candidates, leading to biased and suboptimal cross-modal ranking. Vision-language models (VLMs) mitigate this gap through strong cross-modal alignment and have recently been adopted to build multimodal rerankers. However, most VLM-based rerankers encode all candidates as images, and treating text as images introduces substantial computational overhead. Meanwhile, existing open-source multimodal rerankers are typically trained on general-domain data and often underperform in domain-specific scenarios. To address these limitations, we propose UniRank, a VLM-based reranking framework that natively scores and orders hybrid text-image candidates without any modality conversion. Building on this hybrid scoring interface, UniRank provides an end-to-end domain adaptation pipeline that includes: (1) an instruction-tuning stage that learns calibrated cross-modal relevance scoring by mapping label-token likelihoods to a unified scalar score; and (2) a hard-negative-driven preference alignment stage that constructs in-domain pairwise preferences and performs query-level policy optimization through reinforcement learning from human feedback (RLHF). Extensive experiments on scientific literature retrieval and design patent search demonstrate that UniRank consistently outperforms state-of-the-art baselines, improving Recall@1 by 8.9% and 7.3%, respectively.

2603.25288 2026-05-26 cs.IT cs.AI cs.ET cs.LG eess.SP math.IT 版本更新

CSI-tuples-based 3D Channel Fingerprints Construction Assisted by MultiModal Learning

基于CSI元组的多模态学习辅助3D信道指纹构建

Chenjie Xie, Li You, Ruirong Chen, Gaoning He, Xiqi Gao

发表机构 * National Mobile Communications Research Laboratory, Southeast University(东南大学国家移动通信研究中心) Purple Mountain Laboratories(紫金山实验室) Huawei Technologies Co., Ltd.(华为技术有限公司)

AI总结 针对低空通信中的3D信道指纹构建问题,提出一种基于CSI元组的多模态回归框架,通过融合位置、通信测量和地理环境地图,实现高效高精度的信道状态信息估计。

Comments 14 pages, 9 figures

详情
Journal ref
IEEE Transactions on Wireless Communications, vol. 25, pp. 17369-17383, 2026
AI中文摘要

低空通信可以促进空中和地面无线资源的整合,扩大网络覆盖范围,提高传输质量,从而推动第六代(6G)移动通信的发展。作为低空传输的关键技术,3D信道指纹(3D-CF),也称为3D无线电地图或3D信道知识地图,有望增强对通信环境的理解,并辅助获取信道状态信息(CSI),从而避免重复估计并降低计算复杂度。本文提出了一种模块化的多模态框架来构建3D-CF。具体而言,我们首先基于莱斯衰落信道建立了3D-CF模型,将其表示为CSI元组的集合,每个元组包含低空飞行器(LAV)的位置及其对应的统计CSI。考虑到不同先验数据的异构结构,我们将3D-CF构建问题表述为一个多模态回归任务,其中CSI元组中的目标信道信息可以通过其对应的LAV位置、通信测量和地理环境地图直接估计。然后,相应地提出了一种高效的多模态框架,包括基于相关性的多模态融合(Corr-MMF)模块、多模态表示(MMR)模块和CSI回归(CSI-R)模块。数值结果表明,我们提出的框架能够高效地构建3D-CF,并在不同通信场景下比现有算法至少提高27.5%的精度,展示了其竞争性能和出色的泛化能力。我们还分析了计算复杂度,并说明了其在推理时间方面的优越性。

英文摘要

Low-altitude communications can promote the integration of aerial and terrestrial wireless resources, expand network coverage, and enhance transmission quality, thereby empowering the development of sixth-generation (6G) mobile communications. As an enabler for low-altitude transmission, 3D channel fingerprints (3D-CF), also referred to as the 3D radio map or 3D channel knowledge map, are expected to enhance the understanding of communication environments and assist in the acquisition of channel state information (CSI), thereby avoiding repeated estimations and reducing computational complexity. In this paper, we propose a modularized multimodal framework to construct 3D-CF. Specifically, we first establish the 3D-CF model as a collection of CSI-tuples based on Rician fading channels, with each tuple comprising the low-altitude vehicle's (LAV) positions and its corresponding statistical CSI. In consideration of the heterogeneous structures of different prior data, we formulate the 3D-CF construction problem as a multimodal regression task, where the target channel information in the CSI-tuple can be estimated directly by its corresponding LAV positions, together with communication measurements and geographic environment maps. Then, a high-efficiency multimodal framework is proposed accordingly, which includes a correlation-based multimodal fusion (Corr-MMF) module, a multimodal representation (MMR) module, and a CSI regression (CSI-R) module. Numerical results show that our proposed framework can efficiently construct 3D-CF and achieve at least 27.5% higher accuracy than the state-of-the-art algorithms under different communication scenarios, demonstrating its competitive performance and excellent generalization ability. We also analyze the computational complexity and illustrate its superiority in terms of the inference time.

2603.20479 2026-05-26 cs.CY cs.AI cs.CL 版本更新

Profiling learners' affective engagement: Emotion AI, intercultural pragmatics, and language learning

学习者情感投入画像:情感AI、跨文化语用学与语言学习

Robert Godwin-Jones

发表机构 * Virginia Commonwealth University(弗吉尼亚大学)

AI总结 本文探讨了情感AI在语言学习中的应用,特别是自动情感识别和模拟人类响应如何影响语用能力和互动能力的发展,并讨论了其个性化学习优势与情感操纵风险。

详情
Journal ref
Language Learning & Technology, 30(2), 14-35 (2026)
AI中文摘要

学习另一种语言可能是一个高度情感化的过程,通常以无数大大小小的挫折和成功为特征。对大多数学习者而言,语言学习并非遵循线性、可预测的路径,其曲折进程受动机(或去动机)变量影响,如个人特征、师生关系、学习材料以及对未来第二语言自我的梦想。虽然语言学习的某些方面(阅读、语法)相对机械,但其他方面可能充满压力且不可预测,尤其是用目标语言交谈。这种体验不仅需要结构和词汇知识,还需要以适合社会和文化语境的方式使用语言的能力。AI聊天机器人的出现为练习会话能力提供了新机会,既有优势(响应迅速、无评判),也有缺点(缺乏情感、文化偏见)。本文探讨了技术使用中产生的情感方面,特别是自动情感识别和AI系统中模拟的人类响应如何与语言学习以及语用和互动能力的发展相互作用。情感AI,即算法驱动对用户情感信号的解读,被认为能够实现更个性化的学习,适应感知到的学习者认知和情感状态。其他人则警告情感操纵以及不恰当和无效的用户画像。

英文摘要

Learning another language can be a highly emotional process, typically characterized by numerous frustrations and triumphs, big and small. For most learners, language learning does not follow a linear, predictable path, its zigzag course shaped by motivational (or demotivating) variables such as personal characteristics, teacher/peer relationships, learning materials, and dreams of a future L2 (second language) self. While some aspects of language learning (reading, grammar) are relatively mechanical, others can be stressful and unpredictable, especially conversing in the target language. That experience necessitates not only knowledge of structure and lexis, but also the ability to use the language in ways that are appropriate to the social and cultural context. A new opportunity to practice conversational abilities has arrived through the availability of AI chatbots, with both advantages (responsive, non-judgmental) and drawbacks (emotionally void, culturally biased). This column explores aspects of emotion as they arise in technology use and in particular how automatic emotion recognition and simulated human responsiveness in AI systems interface with language learning and the development of pragmatic and interactional competence. Emotion AI, the algorithmically driven interpretation of users' affective signals, has been seen as enabling greater personalized learning, adapting to perceived learner cognitive and emotional states. Others warn of emotional manipulation and inappropriate and ineffective user profiling

2603.20334 2026-05-26 cs.SE cs.AI 版本更新

Procedural Refinement by LLM-driven Algorithmic Debugging for ARC-AGI-2

基于LLM驱动的算法调试的程序化精炼用于ARC-AGI-2

Yu-Ning Qiu, Lin-Feng Zou, Jiong-Da Wang, Xue-Rong Yuan, Wang-Zhou Dai

发表机构 * Nanjing University(南京大学)

AI总结 提出一种神经符号精炼方法ABPR,结合LLM与Prolog元解释器,通过证明树推导进行语义重检,在ARC-AGI-2上实现高通过率,并扩展到RAVEN风格推理任务。

详情
AI中文摘要

在高复杂度的抽象推理中,系统必须从少量示例或结构化观察中推断出潜在规则,并将其应用于未见实例。LLM可以将此类规则表达为程序,但基于对话的常规精炼主要停留在结果层面:它观察到答案或输出是错误的,而没有正式重新检查是哪个抽象、关系或变换导致了该结果。我们提出基于溯因的程序化精炼(ABPR),一种神经符号精炼方法,它将LLM与Prolog元解释器相结合。ABPR将每个候选程序视为潜在规则的可执行声明性假设,并将其SLD目标-子目标解析具体化为紧凑的证明树风格推导,遵循Shapiro的算法程序调试(APD)。在此视角下,精炼不仅仅是代码级调试,而是对模型假设规则进行语义重检。我们主要在ARC-AGI-2上评估ABPR,这是一个具有挑战性的少样本抽象规则归纳基准,涉及网格变换。使用Gemini-3-Flash的ABPR在公共评估集上达到56.67%的Pass@2,而使用GPT-5.5 xHigh的ABPR达到98.33%的Pass@2。在填空式I-RAVEN-X和A-I-RAVEN改编上的补充实验表明,相同的轨迹引导框架可以扩展到RAVEN风格的关系和类比抽象,而不仅限于ARC特定的网格任务。重复运行和敏感性分析表明,随着搜索广度和总搜索深度的增加,并行轨迹引导搜索减少了随机方差。

英文摘要

In high-complexity abstract reasoning, a system must infer a latent rule from a few examples or structured observations and apply it to unseen instances. LLMs can express such rules as programs, but ordinary conversation-based refinement is largely outcome-level: it observes that an answer or output is wrong without formally re-checking which abstraction, relation, or transformation justified that outcome. We propose \emph{Abduction-Based Procedural Refinement} (ABPR), a neuro-symbolic refinement approach that couples an LLM with a Prolog meta-interpreter. ABPR treats each candidate program as an executable declarative hypothesis of the latent rule and reifies its SLD goal--subgoal resolution into compact proof-tree-style derivations, following Shapiro's algorithmic program debugging (APD). In this view, refinement is not merely code-level debugging, but semantic re-checking of the model's hypothesised rule. We evaluate ABPR primarily on ARC-AGI-2, a challenging few-shot abstract rule induction benchmark over grid transformations. ABPR with Gemini-3-Flash achieves 56.67\% Pass@2, while GPT-5.5 xHigh with ABPR reaches 98.33\% Pass@2 on the public evaluation set. Supplementary experiments on fill-in-the-blank I-RAVEN-X and A-I-RAVEN adaptations provide evidence that the same trace-guided framework extends beyond ARC-specific grid tasks to RAVEN-style relational and analogical abstraction. Repeated-run and sensitivity analyses show that parallel trace-guided search reduces stochastic variance as search breadth and total search depth increase.

2603.11583 2026-05-26 cs.CL cs.AI 版本更新

UtilityMax Prompting: A Formal Framework for Multi-Objective Large Language Model Tasks

UtilityMax Prompting:多目标大语言模型任务的形式化框架

Ofir Marom

发表机构 * Independent Researcher(独立研究者)

AI总结 提出UtilityMax Prompting框架,用影响图和期望效用最大化将多目标LLM任务形式化,在MovieLens 1M数据集上相比自然语言基线提升了精度和NDCG。

详情
AI中文摘要

大语言模型(LLM)任务的成功在很大程度上取决于其提示词。大多数用例使用自然语言指定提示词,当必须同时满足多个目标时,自然语言本质上是模糊的。在本文中,我们引入了UtilityMax Prompting,一个使用形式化数学语言指定任务的框架。我们将任务重构为一个影响图,其中LLM的答案是唯一的决策变量。在图中条件概率分布上定义效用函数,并指示LLM找到最大化期望效用的答案。这迫使LLM明确推理目标的每个组成部分,将其输出导向精确的优化目标,而非主观的自然语言解释。我们在MovieLens 1M数据集上,使用三个前沿模型(Claude Sonnet 4.6、GPT-5.4和Gemini 2.5 Pro)验证了我们的方法,在多目标电影推荐任务中,与自然语言基线相比,在精度和归一化折损累计增益(NDCG)上表现出一致的改进。

英文摘要

The success of a Large Language Model (LLM) task depends heavily on its prompt. Most use-cases specify prompts using natural language, which is inherently ambiguous when multiple objectives must be simultaneously satisfied. In this paper we introduce UtilityMax Prompting, a framework that specifies tasks using formal mathematical language. We reconstruct the task as an influence diagram in which the LLM's answer is the sole decision variable. A utility function is defined over the conditional probability distributions within the diagram, and the LLM is instructed to find the answer that maximises expected utility. This constrains the LLM to reason explicitly about each component of the objective, directing its output toward a precise optimization target rather than a subjective natural language interpretation. We validate our approach on the MovieLens 1M dataset across three frontier models (Claude Sonnet 4.6, GPT-5.4, and Gemini 2.5 Pro), demonstrating consistent improvements in precision and Normalized Discounted Cumulative Gain (NDCG) over natural language baselines in a multi-objective movie recommendation task.

2603.06626 2026-05-26 cs.LG cs.AI 版本更新

Grouter: Decoupling Routing from Representation for Accelerated MoE Training

Grouter: 将路由与表示解耦以加速MoE训练

Yuqi Xu, Rizhen Hu, Zihan Liu, Mou Sun, Kun Yuan

发表机构 * School of Mathematical Sciences, Peking University, Beijing, China(北京大学数学科学学院) Center for Machine Learning Research, Peking University, Beijing, China(北京大学机器学习研究中心) Yuanpei College, Peking University, Beijing, China(北京大学元培学院) Zhejiang Lab, Hangzhou, China(浙江实验室)

AI总结 提出Grouter方法,通过从预训练MoE模型中蒸馏高质量结构作为固定路由器,解耦结构优化与权重更新,显著加速模型收敛并提升训练吞吐量。

详情
AI中文摘要

传统的混合专家(MoE)训练通常没有任何结构先验,实际上要求模型在训练专家权重的同时,在巨大的组合空间中搜索最优路由策略。这种纠缠常常导致收敛缓慢和训练不稳定。本文介绍了Grouter,一种先发制人的路由方法,通过从完全训练的MoE模型中蒸馏高质量结构,并作为目标模型的固定路由器。通过将结构优化与权重更新解耦,Grouter显著加速了模型收敛的速度和质量。为了确保框架的通用性,我们还引入了专家折叠以适应不同模型配置的Grouter,以及专家调优以重新平衡不同数据分布下的工作负载。此外,通过利用先发制人路由提供的结构先验,我们可以实施有针对性的优化以进一步提高训练吞吐量。实验表明,Grouter实现了卓越的性能和效率,将预训练数据利用率提高了4.28倍,并实现了高达33.5%的吞吐量加速,确立了先发制人路由作为可扩展MoE训练的基本范式。我们在https://github.com/JimmyAwoe/Grouter公开了我们的代码和预训练的Grouter检查点。

英文摘要

Traditional Mixture-of-Experts (MoE) training typically proceeds without any structural priors, effectively requiring the model to simultaneously train expert weights while searching for an optimal routing policy within a vast combinatorial space. This entanglement often leads to sluggish convergence and training instabilities. This paper introduces Grouter, a preemptive routing method that by distilling high-quality structures from fully-trained MoE models and serving as a fixed router for target models. By decoupling structural optimization from weight updates, Grouter significantly accelerates both the speed and quality of model convergence. To ensure the framework's versatility, we also introduce expert folding to adapt Grouter across varying model configurations and expert tuning to rebalance workloads across different data distributions. Furthermore, by leveraging the structural priors provided by preemptive routing, we can implement targeted optimizations to further enhance training throughput. Experiments demonstrate that Grouter achieves superior performance and efficiency which boosts pre-training data utilization by 4.28x and achieves up to 33.5% throughput acceleration, establishing preemptive routing as a fundamental paradigm for scalable MoE training. We publicly release our code and pretrained Grouter checkpoints at https://github.com/JimmyAwoe/Grouter.

2603.05450 2026-05-26 cs.AI cs.CL 版本更新

Distributed Partial Information Puzzles: Examining Common Ground Construction Under Epistemic Asymmetry

分布式部分信息谜题:在认知不对称下检验共同基础的构建

Yifan Zhu, Mariah Bradford, Kenneth Lai, Timothy Obiso, Videep Venkatesha, James Pustejovsky, Nikhil Krishnaswamy

发表机构 * Brandeis University(布兰迪斯大学) Colorado State University(科罗拉多州立大学)

AI总结 提出分布式部分信息谜题(DPIP)任务,收集多模态数据集,并评估大语言模型与动态认知逻辑方法在追踪信念状态和共同基础构建上的表现。

Comments 10 pages, 4 figures

详情
Journal ref
Proceedings of COLING-LREC 2026
AI中文摘要

建立共同基础(一组共享的信念和相互认可的事实)对于协作至关重要,但仍然是当前AI系统面临的挑战,尤其是在多模态、多方设置中,协作者带来不同的信息。我们引入了分布式部分信息谜题(DPIP),这是一个协作构建任务,在认知不对称下引发丰富的多模态交流。我们提供了这些交互的多模态数据集,并在语音、手势和动作模态上进行注释和时间对齐,以支持对命题内容和信念动态的推理。然后,我们评估了两种建模共同基础(CG)的范式:(1)最先进的大语言模型(LLMs),被提示从多模态更新中推断共享信念,以及(2)基于动态认知逻辑(DEL)的公理流水线,逐步执行相同的任务。在注释的DPIP数据上的结果表明,它对现代LLMs跟踪任务进展和信念状态的能力构成了挑战。

英文摘要

Establishing common ground, a shared set of beliefs and mutually recognized facts, is fundamental to collaboration, yet remains a challenge for current AI systems, especially in multimodal, multiparty settings, where the collaborators bring different information to the table. We introduce the Distributed Partial Information Puzzle (DPIP), a collaborative construction task that elicits rich multimodal communication under epistemic asymmetry. We present a multimodal dataset of these interactions, annotated and temporally aligned across speech, gesture, and action modalities to support reasoning over propositional content and belief dynamics. We then evaluate two paradigms for modeling common ground (CG): (1) state-of-the-art large language models (LLMs), prompted to infer shared beliefs from multimodal updates, and (2) an axiomatic pipeline grounded in Dynamic Epistemic Logic (DEL) that incrementally performs the same task. Results on the annotated DPIP data indicate that it poses a challenge to modern LLMs' abilities to track both task progression and belief state.

2602.23916 2026-05-26 cs.CV cs.AI 版本更新

Topology-Driven Transferability Estimation of Medical Foundation Models for Segmentation

基于拓扑驱动的医学基础模型分割迁移性估计

Jiaqi Tang, Shaoyang Zhang, Xiaoqi Wang, Jiaying Zhou, Yang Liu, Qingchao Chen

发表机构 * Peking University(北京大学) Hohai University(河海大学) Beijing Normal University-Hong Kong Baptist University United International College(北京师范大学-香港 Baptist大学联合国际学院) National Institute of Health Data Science, Peking University(健康数据科学国家研究院,北京大学) Institute of Medical Technology, Peking University(北京大学医学技术研究院) State Key Laboratory of General Artificial Intelligence, Peking University(通用人工智能国家重点实验室,北京大学)

AI总结 提出拓扑驱动迁移性估计框架,通过全局表示拓扑散度、局部边界感知拓扑一致性和任务自适应融合,无需微调即可高效选择医学基础模型,在OpenMind基准上加权Kendall指标相对提升约31%。

详情
AI中文摘要

大规模自监督学习(SSL)的出现产生了大量的医学基础模型。然而,为特定分割任务选择最优的医学基础模型仍然是一个计算瓶颈。现有的迁移性估计(TE)指标主要针对分类任务设计,依赖于全局统计假设,无法捕捉密集预测所需的拓扑复杂性。我们提出了一种新颖的拓扑驱动迁移性估计框架,评估流形可处理性而非统计重叠。我们的方法引入了三个组成部分:(1)全局表示拓扑散度(GRTD),利用最小生成树量化特征-标签结构同构性;(2)局部边界感知拓扑一致性(LBTC),专门在关键解剖边界评估流形可分离性;(3)任务自适应融合,根据目标任务的语义基数动态整合全局和局部指标。在跨不同解剖目标和SSL基础模型的大规模OpenMind基准上验证,我们的方法在加权Kendall指标上显著优于最先进的基线,相对提升约31%,提供了一种鲁棒的、无需训练的代理,用于高效模型选择而无需微调成本。代码将在接收后公开。

英文摘要

The advent of large-scale self-supervised learning (SSL) has produced a vast zoo of medical foundation models. However, selecting optimal medical foundation models for specific segmentation tasks remains a computational bottleneck. Existing Transferability Estimation (TE) metrics, primarily designed for classification, rely on global statistical assumptions and fail to capture the topological complexity essential for dense prediction. We propose a novel Topology-Driven Transferability Estimation framework that evaluates manifold tractability rather than statistical overlap. Our approach introduces three components: (1) Global Representation Topology Divergence (GRTD), utilizing Minimum Spanning Trees to quantify feature-label structural isomorphism; (2) Local Boundary-Aware Topological Consistency (LBTC), which assesses manifold separability specifically at critical anatomical boundaries; and (3) Task-Adaptive Fusion, which dynamically integrates global and local metrics based on the semantic cardinality of the target task. Validated on the large-scale OpenMind benchmark across diverse anatomical targets and SSL foundation models, our approach significantly outperforms state-of-the-art baselines by around 31% relative improvement in the weighted Kendall metric, providing a robust, training-free proxy for efficient model selection without the cost of fine-tuning. The code will be made publicly available upon acceptance.

2602.18956 2026-05-26 cs.AI 版本更新

INDUCTION: Finite-Structure Concept Synthesis in First-Order Logic

INDUCTION: 一阶逻辑中的有限结构概念合成

Serafim Batzoglou

发表机构 * Independent Researcher(独立研究者)

AI总结 提出INDUCTION基准,用于一阶逻辑中有限结构的概念合成,通过精确模型检查验证公式的正确性,并发现低冗余公式在未见世界上的泛化能力更强。

详情
AI中文摘要

我们引入了INDUCTION,这是一个用于一阶逻辑中有限结构概念合成的基准。给定具有外延标记目标谓词的小型有限关系世界,模型必须输出一个单一的一阶逻辑公式,该公式统一解释所有世界中的目标,并通过精确模型检查验证其正确性。该基准包括三个设置:FullObs、CI(对比)和EC(存在性完成),并对公式冗余进行惩罚。我们发现了尖锐的难度梯度、持久的困难结构族,并观察到低冗余公式在未见世界上的泛化能力远优于高冗余公式。最新的精英模型在不同任务和性能指标上表现出定性不同的行为,暗示了它们不同的概念泛化策略。

英文摘要

We introduce INDUCTION, a benchmark for finite structure concept synthesis in first order logic. Given small finite relational worlds with extensionally labeled target predicates, models must output a single first order logical formula that explains the target uniformly across worlds, with correctness verified via exact model checking. The benchmark includes three regimes, FullObs, CI (contrastive), and EC (existential completion), nd penalizes formula bloat. We find sharp difficulty gradients, persistent hard structural families, and observe that low bloat formulas generalize far better on held out worlds. Elite recent models show qualitatively different behaviors across tasks and performance metrics, hinting to their different strategies of concept generalization.

2602.12224 2026-05-26 cs.GT cs.AI econ.TH 版本更新

Two-Sided Time-Independent Regret for Matching Markets with Limited Interviews

有限面试匹配市场的双面时间无关遗憾

Amirmahdi Mirfakhar, Xuchuang Wang, Mengfan Xu, Hedyeh Beyhaghi, Mohammad Hajiesmaili

发表机构 * University of Massachusetts Amherst(马萨诸塞大学阿姆赫斯特分校)

AI总结 针对面试次数有限的匹配市场,提出利用面试作为提示进行双面学习,并通过策略性延迟纠正早期错误,实现与时间无关的遗憾界。

详情
AI中文摘要

双面匹配平台依赖双方的偏好,但参与者只能评估一小部分潜在伙伴。在实践中,他们使用低成本的匹配前筛选(例如面试、个人资料浏览或试用任务)在提交申请和录用之前形成有噪声的印象。我们研究了带有面试的匹配市场中的赌博机学习,将这些交互建模为查询的提示(hints)~\citep{DBLP:conf/innovations/BhaskaraGIKM23},这些提示向双方揭示部分偏好信息,同时限制后续申请。我们的框架还允许企业方的不确定性:企业像代理人一样学习自己的偏好,并可能犯早期招聘错误。为了解决这个问题,我们引入了策略性延迟(strategic deferral),这是一种企业方行动,允许临时空缺,纠正过早的承诺,并在粗略匿名反馈下实现去中心化学习。我们为中心化和去中心化市场设计了算法,并表明每轮恒定数量的面试足以实现与时间无关的遗憾,优于已知没有面试时的$O(\log T)$保证。我们的界是接近最优的:中心化保证在信息论下界的$m$倍以内,而去中心化算法在结构化市场中达到多项式因子,在一般市场中仍然与时间无关。

英文摘要

Two-sided matching platforms rely on preferences from both sides, yet participants can evaluate only a small fraction of potential partners. In practice, they use low-cost pre-match screening, e.g., interviews, profile views, or trial tasks, to form noisy impressions before committing to applications and offers. We study bandit learning in matching markets with interviews, modeling these interactions as queried \emph{hints}~\citep{DBLP:conf/innovations/BhaskaraGIKM23} that reveal partial preference information to both sides while constraining subsequent applications. Our framework also allows firm-side uncertainty: firms, like agents, learn their preferences and may make early hiring mistakes. To address this, we introduce strategic deferral, a firm-side action that permits temporary vacancy, corrects premature commitments, and enables decentralized learning under coarse anonymous feedback. We design algorithms for centralized and decentralized markets and show that a constant number of interviews per round suffices for horizon-independent regret, improving over the $O(\log T)$ guarantees known without interviews. Our bounds are near-optimal: the centralized guarantee is within a factor $m$ of an information-theoretic lower bound, while decentralized algorithms match it up to polynomial factors in structured markets and remain horizon-independent in general markets.

2602.04360 2026-05-26 cs.LG cs.AI cs.CY 版本更新

Counterfactual Explanations for Hypergraph Neural Networks

超图神经网络的反事实解释

Fabiano Veglianti, Lorenzo Antonelli, Gabriele Tolomei

发表机构 * Department of Computer Control and Management Engineering, Sapienza University(计算机控制与管理工程系,萨皮恩扎大学) Department of Computer Science, Sapienza University(计算机科学系,萨皮恩扎大学)

AI总结 提出CF-HyperGNNExplainer方法,通过最小结构变化生成反事实超图,以解释超图神经网络的预测决策。

详情
AI中文摘要

超图神经网络(HGNNs)有效建模了许多现实系统中的高阶交互,但仍难以解释,限制了其在高风险场景中的部署。我们引入了CF-HyperGNNExplainer,一种针对HGNNs的反事实解释方法,该方法识别改变模型预测所需的最小结构变化。该方法通过仅限于删除节点-超边关联或删除超边的可操作编辑生成反事实超图,产生简洁且结构上有意义的解释。在超图基准数据集上的大量实验表明,CF-HyperGNNExplainer生成了有效且简洁的反事实,突出了对HGNN决策最关键的高阶关系。

英文摘要

Hypergraph neural networks (HGNNs) effectively model higher-order interactions in many real-world systems but remain difficult to interpret, limiting their deployment in high-stakes settings. We introduce CF-HyperGNNExplainer, a counterfactual explanation method for HGNNs that identifies the minimal structural changes required to alter a model's prediction. The method generates counterfactual hypergraphs using actionable edits limited to removing node-hyperedge incidences or deleting hyperedges, producing concise and structurally meaningful explanations. Extensive experiments on hypergraph benchmark datasets show that CF-HyperGNNExplainer generates valid and concise counterfactuals, highlighting the higher-order relations most critical to HGNN decisions.

2602.02605 2026-05-26 cs.NE cs.AI cs.CL q-bio.NC 版本更新

Fine-Tuning Language Models to Know What They Know

微调语言模型使其了解自身所知

Sangjun Park, Elliot Meyerson, Xin Qiu, Risto Miikkulainen

发表机构 * The University of Texas at Austin(德克萨斯大学奥斯汀分校) Cognizant AI Lab(认知人工智能实验室)

AI总结 本文提出一种框架,通过进化策略对齐方法(ESMA)在控制偏差的同时提升大语言模型的元认知能力,并在未见数据集、语言和新知识上展现出鲁棒泛化性。

Comments Preprint

详情
AI中文摘要

评估大语言模型(LLMs)的真实元认知能力因偏差和启发式方法而困难。本文提出一个框架,在控制这些偏差的同时测量和增强LLM的元认知能力。建立了使用$d'_{\rm type2}$指标的测量方法以隔离元认知能力。提出了元认知对齐进化策略(ESMA),在未见数据集、语言和新获取的知识上展现出鲁棒泛化性。最后,参数分析表明这些改进由一组稀疏参数驱动,为定向元认知优化提供了新途径。

英文摘要

Evaluating true metacognition in Large Language Models (LLMs) is difficult due to biases and heuristics. This paper presents a framework to measure and enhance LLM metacognition while controlling for these biases. A measurement method using the $d'_{\rm type2}$ metric is established to isolate metacognitive ability. The Evolution Strategy for Metacognitive Alignment (ESMA) is proposed, demonstrating robust generalization across unseen datasets, languages, and newly acquired knowledge. Finally, parameter analysis reveals that these improvements are driven by a sparse set of parameters, offering new pathways for targeted metacognitive optimization.

2602.02544 2026-05-26 cs.LG cs.AI 版本更新

SPA-Cache: Singular Proxies for Adaptive Caching in Diffusion Language Models

SPA-Cache: 扩散语言模型中的自适应缓存奇异代理

Wenhao Sun, Rong-Cheng Tu, Yifu Ding, Zhao Jin, Jingyi Liao, Yongcheng Jing, Dacheng Tao

发表机构 * College of Computing(计算学院) Data Science, Nanyang Technological University, Singapore, Singapore(数据科学,南洋理工大学,新加坡,新加坡)

AI总结 针对扩散语言模型因非因果特性无法使用标准KV缓存导致计算开销大的问题,提出SPA-Cache方法,通过低维奇异代理识别关键令牌并自适应分配缓存预算,实现高达8倍吞吐量提升和2-4倍加速。

Comments Accepted by ICML 2026.The code repository is available at https://github.com/wenhao728/spa-cache

详情
AI中文摘要

尽管扩散语言模型(DLM)为自回归范式提供了一种灵活、任意顺序的替代方案,但其非因果特性排除了标准的KV缓存,迫使在每个解码步骤进行昂贵的隐藏状态重新计算。现有的DLM缓存方法通过选择性隐藏状态更新来降低这一成本;然而,它们仍然受限于(i)昂贵的逐令牌更新识别启发式方法和(ii)僵化的统一预算分配,未能考虑异构的隐藏状态动态。为了解决这些挑战,我们提出了SPA-Cache,它在DLM缓存中联合优化了更新识别和预算分配。首先,我们推导出一个低维奇异代理,能够在低维子空间中识别更新关键令牌,大幅降低更新识别的开销。其次,我们引入一种自适应策略,在不降低生成质量的情况下,为稳定层分配更少的更新。这些贡献共同显著提高了DLM的效率,相比原始解码实现了高达8倍的吞吐量提升,相比现有缓存基线实现了2-4倍的加速。

英文摘要

While Diffusion Language Models (DLMs) offer a flexible, arbitrary-order alternative to the autoregressive paradigm, their non-causal nature precludes standard KV caching, forcing costly hidden state recomputation at every decoding step. Existing DLM caching approaches reduce this cost by selective hidden state updates; however, they are still limited by (i) costly token-wise update identification heuristics and (ii) rigid, uniform budget allocation that fails to account for heterogeneous hidden state dynamics. To address these challenges, we present SPA-Cache that jointly optimizes update identification and budget allocation in DLM cache. First, we derive a low-dimensional singular proxy that enables the identification of update-critical tokens in a low-dimensional subspace, substantially reducing the overhead of update identification. Second, we introduce an adaptive strategy that allocates fewer updates to stable layers without degrading generation quality. Together, these contributions significantly improve the efficiency of DLMs, yielding up to an $8\times$ throughput improvement over vanilla decoding and a $2$--$4\times$ speedup over existing caching baselines.

2602.02474 2026-05-26 cs.CL cs.AI cs.LG 版本更新

MemSkill: Learning and Evolving Memory Skills for Self-Evolving Agents

MemSkill:面向自进化智能体的可学习与进化记忆技能

Haozhen Zhang, Quanyu Long, Jianzhu Bao, Tao Feng, Weizhi Zhang, Haodong Yue, Wenya Wang

发表机构 * Nanyang Technological University(南洋理工大学) University of Illinois Urbana-Champaign(伊利诺伊大学厄巴纳-香槟分校) University of Illinois Chicago(伊利诺伊大学芝加哥分校) Tsinghua University(清华大学)

AI总结 提出MemSkill框架,将记忆操作转化为可学习和可进化的技能,通过控制器选择技能、执行器生成记忆、设计者进化技能集,形成闭环提升LLM智能体任务性能。

Comments Code is available at https://github.com/ViktorAxelsen/MemSkill

详情
AI中文摘要

大多数大语言模型(LLM)智能体记忆系统依赖少量静态、手工设计的操作来提取记忆。这些固定程序硬编码了关于存储内容和如何修订记忆的人类先验知识,使其在多样化的交互模式下僵化,并在长历史记录上效率低下。为此,我们提出 extbf{MemSkill},将这些操作重新定义为可学习和可进化的记忆技能,即从交互轨迹中提取、整合和修剪信息的结构化可重用例程。受智能体技能设计哲学的启发,MemSkill采用一个 extit{控制器},学习选择少量相关技能,并与基于LLM的 extit{执行器}配对,生成技能引导的记忆。除了学习技能选择,MemSkill引入一个 extit{设计者},定期审查所选技能产生错误或不完整记忆的困难案例,并通过提出改进和新技能来进化技能集。共同地,MemSkill形成了一个闭环流程,改进了技能选择策略和技能集本身。在LoCoMo、LongMemEval、HotpotQA和ALFWorld上的实验表明,MemSkill在强基线上提高了任务性能,并在不同设置下具有良好的泛化能力。进一步分析揭示了技能如何进化,为LLM智能体更自适应、自进化的记忆管理提供了见解。

英文摘要

Most Large Language Model (LLM) agent memory systems rely on a small set of static, hand-designed operations for extracting memory. These fixed procedures hard-code human priors about what to store and how to revise memory, making them rigid under diverse interaction patterns and inefficient on long histories. To this end, we present \textbf{MemSkill}, which reframes these operations as learnable and evolvable memory skills, structured and reusable routines for extracting, consolidating, and pruning information from interaction traces. Inspired by the design philosophy of agent skills, MemSkill employs a \emph{controller} that learns to select a small set of relevant skills, paired with an LLM-based \emph{executor} that produces skill-guided memories. Beyond learning skill selection, MemSkill introduces a \emph{designer} that periodically reviews hard cases where selected skills yield incorrect or incomplete memories, and evolves the skill set by proposing refinements and new skills. Together, MemSkill forms a closed-loop procedure that improves both the skill-selection policy and the skill set itself. Experiments on LoCoMo, LongMemEval, HotpotQA, and ALFWorld demonstrate that MemSkill improves task performance over strong baselines and generalizes well across settings. Further analyses shed light on how skills evolve, offering insights toward more adaptive, self-evolving memory management for LLM agents.

2601.22925 2026-05-26 cs.IR cs.AI cs.LG 版本更新

BEAR: Towards Beam-Search-Aware Optimization for Recommendation with Large Language Models

BEAR:面向大语言模型推荐中束搜索感知的优化

Weiqin Yang, Bohao Wang, Zhenxiang Xu, Jiawei Chen, Shengjia Zhang, Jingbang Chen, Canghong Jin, Can Wang

发表机构 * Zhejiang University(浙江大学) The Chinese University of Hong Kong, Shenzhen(香港中文大学(深圳)) Hangzhou City University(杭州市城市大学)

AI总结 针对监督微调与束搜索推理之间的不一致性,提出BEAR正则化方法,通过确保正例每个token在解码步骤中排名前B来避免过早剪枝,显著提升推荐性能。

Comments Accepted by SIGIR 2026

详情
AI中文摘要

近年来,利用大语言模型(LLM)进行推荐的研究迅速增长。这些方法通常采用监督微调(SFT)使LLM适应推荐场景,并在推理时使用束搜索高效检索前B个推荐项。然而,我们发现了关键的训练-推理不一致性:虽然SFT优化正例的整体概率,但即使这些项具有高整体概率,也不能保证它们会被束搜索检索到。由于贪心剪枝机制,束搜索可能会在正例的前缀概率不足时过早丢弃它。为了解决这种不一致性,我们提出了BEAR(束搜索感知正则化),一种新的微调目标,在训练中显式考虑束搜索行为。BEAR不直接模拟每个训练实例的束搜索(计算代价过高),而是强制执行一个宽松的必要条件:正例中的每个token在每个解码步骤中必须排在前B个候选token中。该目标有效降低了错误剪枝的风险,同时与标准SFT相比仅增加可忽略的计算开销。在四个真实世界数据集上的大量实验表明,BEAR显著优于强基线。代码可在https://github.com/Tiny-Snow/BEAR-SIGIR-2026获取。

英文摘要

Recent years have seen a rapid surge in research leveraging Large Language Models (LLMs) for recommendation. These methods typically employ supervised fine-tuning (SFT) to adapt LLMs to recommendation scenarios, and utilize beam search during inference to efficiently retrieve $B$ top-ranked recommended items. However, we identify a critical training-inference inconsistency: while SFT optimizes the overall probability of positive items, it does not guarantee that such items will be retrieved by beam search even if they possess high overall probabilities. Due to the greedy pruning mechanism, beam search can prematurely discard a positive item once its prefix probability is insufficient. To address this inconsistency, we propose BEAR (Beam-SEarch-Aware Regularization), a novel fine-tuning objective that explicitly accounts for beam search behavior during training. Rather than directly simulating beam search for each instance during training, which is computationally prohibitive, BEAR enforces a relaxed necessary condition: each token in a positive item must rank within the top-$B$ candidate tokens at each decoding step. This objective effectively mitigates the risk of incorrect pruning while incurring negligible computational overhead compared to standard SFT. Extensive experiments across four real-world datasets demonstrate that BEAR significantly outperforms strong baselines. Code is available at https://github.com/Tiny-Snow/BEAR-SIGIR-2026 .

2601.22709 2026-05-26 cs.CV cs.AI 版本更新

Gated Relational Alignment via Confidence-based Distillation for Efficient VLMs

基于置信度蒸馏的门控关系对齐用于高效视觉语言模型

Yanlong Chen, Amirhossein Habibian, Luca Benini, Yawei Li

发表机构 * Department of Information Technology(信息科技系) Electrical Engineering, ETH Zurich, Zurich, Switzerland(电气工程,苏黎世联邦理工学院,苏黎世,瑞士) Qualcomm AI Research, Amsterdam, the Netherlands(高通人工智能研究,阿姆斯特丹,荷兰) Department of Electrical, Electronic and Information Engineering(电气、电子与信息工程系) University of Bologna, Bologna, Italy(博洛尼亚大学,博洛尼亚,意大利) School of Electrical and Electronic Engineering(电气与电子工程学院)

AI总结 提出GRACE框架,通过信息瓶颈原理统一知识蒸馏与量化感知训练,使用置信度门控解耦蒸馏、关系中心核对齐和自适应控制器,在INT4量化下实现性能超越FP16基线并接近教师模型,同时显著降低内存和提升吞吐量。

Comments Accepted to the International Conference on Machine Learning (ICML 2026)

详情
AI中文摘要

视觉语言模型(VLM)具有强大的多模态性能,但部署成本高,且训练后量化通常会导致显著的精度损失。尽管有潜力,但针对VLM的量化感知训练仍未得到充分探索。我们提出GRACE,一个在信息瓶颈原则下统一知识蒸馏和量化感知训练的框架:量化约束信息容量,而蒸馏指导在此预算内保留什么。将教师视为任务相关信息的代理,我们引入置信度门控解耦蒸馏以过滤不可靠的监督,关系中心核对齐以传递视觉标记结构,以及通过拉格朗日松弛实现的自适应控制器以平衡保真度与容量约束。在LLaVA和Qwen系列的大量基准测试中,我们的INT4模型始终优于FP16基线(例如,LLaVA-1.5-7B:SQA上70.1 vs. 66.8;Qwen2-VL-2B:MMBench上76.9 vs. 72.6),几乎匹配教师性能。使用真实的INT4内核,我们实现了3倍的吞吐量,内存减少54%。这一原则性框架显著优于现有量化方法,使GRACE成为资源受限部署的有力解决方案。代码和数据可在https://github.com/ForeverBlue816/GRACE获取。

英文摘要

Vision-Language Models (VLMs) achieve strong multimodal performance but are costly to deploy, and post-training quantization often causes significant accuracy loss. Despite its potential, quantization-aware training for VLMs remains underexplored. We propose GRACE, a framework unifying knowledge distillation and QAT under the Information Bottleneck principle: quantization constrains information capacity while distillation guides what to preserve within this budget. Treating the teacher as a proxy for task-relevant information, we introduce confidence-gated decoupled distillation to filter unreliable supervision, relational centered kernel alignment to transfer visual token structures, and an adaptive controller via Lagrangian relaxation to balance fidelity against capacity constraints. Across extensive benchmarks on LLaVA and Qwen families, our INT4 models consistently outperform FP16 baselines (e.g., LLaVA-1.5-7B: 70.1 vs. 66.8 on SQA; Qwen2-VL-2B: 76.9 vs. 72.6 on MMBench), nearly matching teacher performance. Using real INT4 kernel, we achieve 3$\times$ throughput with 54% memory reduction. This principled framework significantly outperforms existing quantization methods, making GRACE a compelling solution for resource-constrained deployment. Code and data are available at: https://github.com/ForeverBlue816/GRACE.

2601.21601 2026-05-26 cs.LG cs.AI 版本更新

Dynamics Reveals Structure: Challenging the Linear Propagation Assumption

动力学揭示结构:挑战线性传播假设

Hoyeon Chang, Bálint Mucsányi, Seong Joon Oh

发表机构 * University of Tübingen(图宾根大学)

AI总结 通过关系代数研究神经网络中线性传播假设的几何极限,证明其在对合运算(否定、逆)上可行,但在组合运算上存在根本性障碍,导致特征映射崩溃,并解释知识编辑失败、反转诅咒和多跳推理等问题的共同根源。

详情
AI中文摘要

神经网络通过一阶参数更新进行自适应,但尚不清楚这种更新是否保持逻辑一致性。我们研究了线性传播假设(LPA)的几何极限,该假设认为局部更新能够连贯地传播到逻辑结论。为了形式化这一点,我们采用关系代数,研究关系的三种核心运算:否定翻转真值、逆交换参数顺序、组合链接关系。对于否定和逆,我们证明保证与方向无关的一阶传播需要一种张量分解,将实体对上下文与关系内容分离。然而,对于组合,我们识别出一个根本性障碍。我们证明组合可归结为合取,并证明任何在线性特征上良好定义的合取必须是双线性的。由于双线性与否定不兼容,这迫使特征映射崩溃。这些结果表明,知识编辑失败、反转诅咒和多跳推理可能源于LPA固有的共同结构限制。

英文摘要

Neural networks adapt through first-order parameter updates, yet it remains unclear whether such updates preserve logical coherence. We investigate the geometric limits of the Linear Propagation Assumption (LPA), the premise that local updates coherently propagate to logical consequences. To formalize this, we adopt relation algebra and study three core operations on relations: negation flips truth values, converse swaps argument order, and composition chains relations. For negation and converse, we prove that guaranteeing direction-agnostic first-order propagation necessitates a tensor factorization separating entity-pair context from relation content. However, for composition, we identify a fundamental obstruction. We show that composition reduces to conjunction, and prove that any conjunction well-defined on linear features must be bilinear. Since bilinearity is incompatible with negation, this forces the feature map to collapse. These results suggest that failures in knowledge editing, the reversal curse, and multi-hop reasoning may stem from common structural limitations inherent to the LPA.

2601.21463 2026-05-26 cs.SD cs.AI 版本更新

Unifying Speech Editing Detection and Content Localization via Prior-Enhanced Audio LLMs

通过先验增强的音频大语言模型统一语音编辑检测与内容定位

Jun Xue, Yi Chai, Yanzhen Ren, Jinshen He, Zhiqiang Tang, Zhuolin Yi, Yihuan Huang, Yuankun Xie, Yujie Chen

发表机构 * Key Laboratory of Aerospace Information Security(航空信息安全与可信计算重点实验室) School of Cyber Science and Engineering(网络安全工程学院) Wuhan University(武汉大学) Independent Researcher(独立研究员) School of Computer Science and Technology(计算机科学与技术学院) Anhui University(安徽大学) Communication University of China(中国通信大学) Beihang University(北京航空航天大学)

AI总结 提出基于音频大语言模型的统一框架,通过生成式方法联合处理语音编辑检测和内容定位,并引入先验增强策略和声学一致性损失以提升性能。

详情
AI中文摘要

现有的语音编辑检测(SED)数据集主要使用手动拼接或有限的编辑操作构建,导致多样性受限且对真实编辑场景的覆盖不足。同时,当前的SED方法严重依赖帧级监督来检测可观察的声学异常,这从根本上限制了它们处理删除型编辑的能力,其中被操纵的内容完全从信号中消失。为了解决这些挑战,我们提出了一个统一框架,通过基于音频大语言模型(Audio LLMs)的生成式公式,将语音编辑检测和内容定位连接起来。我们首先引入了AiEdit(https://huggingface.co/datasets/JunXueTech/AiEdit),这是一个大规模双语数据集(约140小时),使用最先进的端到端语音编辑系统覆盖添加、删除和修改操作,为现代威胁提供了更真实的基准。在此基础上,我们将SED重新定义为结构化文本生成任务,实现了对编辑类型识别和内容定位的联合推理。为了增强生成模型在声学证据中的基础,我们提出了一种先验增强的提示策略,注入从帧级检测器导出的词级概率线索。此外,我们引入了一种声学一致性感知损失,在潜在空间中明确强制正常和异常声学表示之间的分离。实验结果表明,所提出的方法在检测和定位任务上均持续优于现有方法。

英文摘要

Existing speech editing detection (SED) datasets are predominantly constructed using manual splicing or limited editing operations, resulting in restricted diversity and poor coverage of realistic editing scenarios. Meanwhile, current SED methods rely heavily on frame-level supervision to detect observable acoustic anomalies, which fundamentally limits their ability to handle deletion-type edits, where the manipulated content is entirely absent from the signal. To address these challenges, we present a unified framework that bridges speech editing detection and content localization through a generative formulation based on Audio Large Language Models (Audio LLMs). We first introduce AiEdit, https://huggingface.co/datasets/JunXueTech/AiEdit, a large-scale bilingual dataset (approximately 140 hours) that covers addition, deletion, and modification operations using state-of-the-art end-to-end speech editing systems, providing a more realistic benchmark for modern threats. Building upon this, we reformulate SED as a structured text generation task, enabling joint reasoning over edit type identification, and content localization. To enhance the grounding of generative models in acoustic evidence, we propose a prior-enhanced prompting strategy that injects word-level probabilistic cues derived from a frame-level detector. Furthermore, we introduce an acoustic consistency-aware loss that explicitly enforces the separation between normal and anomalous acoustic representations in the latent space. Experimental results demonstrate that the proposed approach consistently outperforms existing methods across both detection and localization tasks.

2601.15544 2026-05-26 cs.LG cs.AI 版本更新

RDumb++: Drift-Aware Continual Test-Time Adaptation

RDumb++:漂移感知的持续测试时自适应

Himanshu Mishra

发表机构 * Department of Computer Science(计算机科学系) University of British Columbia(不列颠哥伦比亚大学)

AI总结 针对持续测试时自适应中分布快速变化或长期漂移导致性能崩溃的问题,提出RDumb++方法,通过熵和KL散度漂移检测机制与自适应重置策略,在CCC基准上实现约3%的绝对准确率提升。

详情
AI中文摘要

持续测试时自适应(CTTA)旨在仅使用传入的无标签数据流在部署期间更新预训练模型。尽管先前的方法如Tent、EATA等在短期演化偏移下提供了有意义的改进,但当测试分布快速变化或时间跨度极长时,它们表现不佳。CCC基准测试体现了这一挑战,模型在包含750万样本且不断变化损坏类型和严重程度的数据流上运行。我们提出RDumb++,它是RDumb的合理扩展,引入了两种漂移检测机制,即基于熵的漂移评分和KL散度漂移评分,以及自适应重置策略。这些机制使模型能够检测累积的自适应何时变得有害,并在预测崩溃发生前恢复。在包含三种速度和三种种子的CCC-medium(九次运行,每次包含一百万样本)上,RDumb++始终优于RDumb,在整个数据流中实现约3%的绝对准确率提升,同时保持稳定的自适应。关于漂移阈值和重置强度的消融实验进一步表明,漂移感知重置对于防止崩溃和实现可靠的长期CTTA至关重要。

英文摘要

Continual Test-Time Adaptation (CTTA) seeks to update a pretrained model during deployment using only the incoming, unlabeled data stream. Although prior approaches such as Tent, EATA etc. provide meaningful improvements under short evolving shifts, they struggle when the test distribution changes rapidly or over extremely long horizons. This challenge is exemplified by the CCC benchmark, where models operate over streams of 7.5M samples with continually changing corruption types and severities. We propose RDumb++, a principled extension of RDumb that introduces two drift-detection mechanisms i.e entropy-based drift scoring and KL-divergence drift scoring, together with adaptive reset strategies. These mechanisms allow the model to detect when accumulated adaptation becomes harmful and to recover before prediction collapse occurs. Across CCC-medium with three speeds and three seeds (nine runs, each containing one million samples), RDumb++ consistently surpasses RDumb, yielding approx 3% absolute accuracy gains while maintaining stable adaptation throughout the entire stream. Ablation experiments on drift thresholds and reset strengths further show that drift-aware resetting is essential for preventing collapse and achieving reliable long-horizon CTTA.

2601.06870 2026-05-26 cs.LG cs.AI 版本更新

QASA: Quality-Aware Semantic Augmentation for Robust Multimodal Sentiment Analysis

QASA: 面向鲁棒多模态情感分析的质量感知语义增强

Jiazhang Liang, Jianheng Dai, Miaosen Luo, Menghua Jiang, Sijie Mai

发表机构 * School of Computer Science, South China Normal University(华南师范大学计算机学院)

AI总结 提出QASA框架,利用扩散模型生成视觉和听觉增强样本,并通过解耦质量感知评分模块分配训练权重,以解决高质量数据稀缺问题,提升多模态情感分析的鲁棒性和泛化能力。

Comments 11 pages, 4 figures

详情
AI中文摘要

多模态大语言模型在多模态情感分析中展现出强大的语义表示能力。然而,由于高质量训练数据的稀缺,它们学习稳定且可泛化的多模态特征的能力受到限制。为了解决这一问题,我们提出了QASA(质量感知语义增强),该方法使用扩散模型生成增强的视觉和听觉样本,从而扩大训练数据集并支持多模态学习。生成的样本质量可能参差不齐,并可能出现跨模态不一致。为此,我们引入了一个解耦的质量感知评分模块,根据每个增强样本的可靠性分配训练权重。这种方法减少了低质量数据的影响,有助于更稳定和鲁棒的模型训练。该框架结合了扩散模型的生成能力和多模态大模型的语义推理能力,提供了一种无需人工标注的自动数据增强策略,同时在有限高质量数据下提高了泛化性和鲁棒性。在CH-SIMS数据集上的实验表明,QASA在五类准确率(Acc5)和二类准确率(Acc2)上分别相对提升了18.0%和5.9%,并且在CMU-MOSI和MUStARD基准测试上也优于现有方法。

英文摘要

Multimodal large language models have demonstrated strong ability in capturing semantic representations for multimodal sentiment analysis. Their capacity to learn stable and generalizable multimodal features is limited, however, by the scarcity of high-quality training data. To address this, we propose QASA (Quality-Aware Semantic Augmentation), which uses diffusion models to generate augmented visual and auditory samples, thereby enlarging the training dataset and supporting multimodal learning. The generated samples can vary in quality and may exhibit cross-modal inconsistencies. To manage this, we introduce a decoupled quality-aware scoring module that assigns training weights based on the reliability of each augmented sample. This approach reduces the influence of low-quality data and contributes to more stable and robust model training. The framework combines the generative capabilities of diffusion models with the semantic reasoning of multimodal large models, providing an automated data augmentation strategy that does not require human annotation while improving generalization and robustness under limited high-quality data. Experiments on the CH-SIMS dataset show that QASA yields a relative increase of 18.0\% and 5.9\% in five-class accuracy (Acc5) and binary accuracy (Acc2), respectively, and it also outperforms existing methods on the CMU-MOSI and MUStARD benchmarks.

2601.03014 2026-05-26 cs.CL cs.AI 版本更新

SentGraph: Hierarchical Sentence Graph for Multi-hop Retrieval-Augmented Question Answering

SentGraph: 用于多跳检索增强问答的层次化句子图

Junli Liang, Pengfei Zhou, Wangqiu Zhou, Wenjie Qing, Qi Zhao, Ziwen Wang, Qi Song, Xiangyang Li

发表机构 * University of Science and Technology of China(中国科学技术大学) Hefei University of Technology(合肥工业大学)

AI总结 提出SentGraph,一种句子级图RAG框架,通过构建层次化句子图并建模细粒度逻辑关系,解决多跳问答中证据链不完整的问题。

详情
AI中文摘要

传统的检索增强生成(RAG)通过大型语言模型有效支持单跳问答,但在需要结合多个文档证据的多跳问答任务中面临显著限制。现有的基于块的检索通常提供不相关且逻辑不连贯的上下文,导致答案生成过程中证据链不完整和推理错误。为了解决这些挑战,我们提出了SentGraph,一种句子级图RAG框架,显式建模句子之间的细粒度逻辑关系以用于多跳问答。具体来说,我们离线构建一个层次化句子图:首先调整修辞结构理论以区分核心句和卫星句,然后将它们组织成带有跨文档实体桥的主题级子图。在线检索时,SentGraph执行图引导的证据选择和路径扩展,以检索细粒度的句子级证据。在四个多跳问答基准上的大量实验证明了SentGraph的有效性,验证了显式建模句子级逻辑依赖关系对多跳推理的重要性。

英文摘要

Traditional Retrieval-Augmented Generation (RAG) effectively supports single-hop question answering with large language models but faces significant limitations in multi-hop question answering tasks, which require combining evidence from multiple documents. Existing chunk-based retrieval often provides irrelevant and logically incoherent context, leading to incomplete evidence chains and incorrect reasoning during answer generation. To address these challenges, we propose SentGraph, a sentence-level graph-based RAG framework that explicitly models fine-grained logical relationships between sentences for multi-hop question answering. Specifically, we construct a hierarchical sentence graph offline by first adapting Rhetorical Structure Theory to distinguish nucleus and satellite sentences, and then organizing them into topic-level subgraphs with cross-document entity bridges. During online retrieval, SentGraph performs graph-guided evidence selection and path expansion to retrieve fine-grained sentence-level evidence. Extensive experiments on four multi-hop question answering benchmarks demonstrate the effectiveness of SentGraph, validating the importance of explicitly modeling sentence-level logical dependencies for multi-hop reasoning.

2512.18508 2026-05-26 stat.ME cs.AI cs.SY eess.SP eess.SY 版本更新

Selection-Induced Contraction of Innovation Statistics in Gated Kalman Filters

门控卡尔曼滤波中创新统计量的选择诱导收缩

Barak Or

发表机构 * metaor artificial intelligence(metaor人工智能) Google Reichman Tech School(谷歌Reichman技术学校) Reichman University(Reichman大学)

AI总结 本文证明在门控卡尔曼滤波中,经过门控后的创新统计量收敛于门控条件而非名义量,并推导了椭球门控下创新的一阶和二阶矩的精确表达式,揭示了门控引起的确定性协方差收缩,并扩展至最近邻关联分析。

Comments 9 pages, preprint

详情
AI中文摘要

验证门控是经典卡尔曼跟踪系统的基本组成部分。只有归一化创新平方(NIS)低于规定阈值的测量值才被考虑用于状态更新。虽然这个过程在统计上基于卡方分布,但它隐含地将无条件创新过程替换为条件观测过程,仅限于验证事件。本文表明,门控后计算的创新统计量收敛于门控条件量而非名义量。在线性高斯假设下,我们推导了椭球门控条件下创新的一阶和二阶矩的精确表达式,并证明门控会引起创新协方差的确定性、维度依赖的收缩。分析扩展至最近邻(NN)关联,后者被证明是一个额外的统计选择算子。我们证明,在多个门内测量中选择最小范数创新会引入不可避免的能量收缩,这意味着在非平凡门控和关联下,名义创新统计量无法保持。二维情况下的闭式结果量化了组合效应并说明了其实际意义。

英文摘要

Validation gating is a fundamental component of classical Kalman-based tracking systems. Only measurements whose normalized innovation squared (NIS) falls below a prescribed threshold are considered for state update. While this procedure is statistically motivated by the chi-square distribution, it implicitly replaces the unconditional innovation process with a conditionally observed one, restricted to the validation event. This paper shows that innovation statistics computed after gating converge to gate-conditioned rather than nominal quantities. Under classical linear--Gaussian assumptions, we derive exact expressions for the first- and second-order moments of the innovation conditioned on ellipsoidal gating, and show that gating induces a deterministic, dimension-dependent contraction of the innovation covariance. The analysis is extended to NN association, which is shown to act as an additional statistical selection operator. We prove that selecting the minimum-norm innovation among multiple in-gate measurements introduces an unavoidable energy contraction, implying that nominal innovation statistics cannot be preserved under nontrivial gating and association. Closed-form results in the two-dimensional case quantify the combined effects and illustrate their practical significance.

2512.15922 2026-05-26 cs.AI 版本更新

Leveraging Spreading Activation for Improved Document Retrieval in Knowledge-Graph-Based RAG Systems

利用传播激活改进基于知识图谱的RAG系统中的文档检索

Jovan Pavlović, Miklós Krész, László Hajdu

发表机构 * UP FAMNIT, University of Primorska(普里摩里卡大学FAMNIT学院) InnoRenew CoE, UP IAM, University of Primorska(InnoRenew联合学院、UP IAM、普里摩里卡大学) University of Szeged, Department of Applied Informatics(塞格德大学应用信息学系)

AI总结 提出一种基于自动构建异构知识图谱的传播激活算法,用于多跳问答中的文档检索,减少对语义知识图谱和LLM引导的依赖,性能优于或持平现有方法。

Comments 20 pages, 5 figures

详情
AI中文摘要

尽管初始成功且架构多样,检索增强生成系统在复杂推理任务中仍难以可靠地检索和连接多步证据。大多数标准RAG框架将所有检索到的信息视为同等可靠,忽视了大型文本语料库中不同的可信度和相互关联性。GraphRAG方法通过集成知识图谱(将信息结构化为节点和边,捕获实体关系,支持多步逻辑遍历)为RAG系统提供了潜在改进。然而,GraphRAG并非总是理想方案,因为它依赖于语料库的高质量图表示,这类表示通常需要人工构建知识图谱(构建和更新成本高)或自动图构建流程(往往不可靠)。此外,遵循该范式的系统通常使用大语言模型引导图遍历和证据检索。本文提出一种新颖的RAG框架,使用传播激活算法从由自动构建的异构知识图谱连接的文档语料库中检索信息。该方法减少了对语义知识图谱的依赖(后者因信息提取过程中的信息丢失而常不完整),避免了LLM引导的图遍历,并提高了多跳问答的性能。实验表明,我们的方法在多项最先进RAG方法上达到更好或相当的性能,并可作为即插即用模块集成到不同的迭代RAG流程中。与思维链迭代检索结合时,在答案正确性上相比朴素RAG实现了高达39%的绝对提升,且使用小型开源语言模型即可达到这些结果。

英文摘要

Despite initial successes and a variety of architectures, retrieval-augmented generation systems still struggle to reliably retrieve and connect the multi-step evidence required for complicated reasoning tasks. Most of the standard RAG frameworks regard all retrieved information as equally reliable, overlooking the varying credibility and interconnected nature of large textual corpora. GraphRAG approaches offer potential improvement to RAG systems by integrating knowledge graphs, which structure information into nodes and edges, capture entity relationships, and enable multi-step logical traversal. However, GraphRAG is not always an ideal solution, as it depends on high-quality graph representations of the corpus. Such representations usually rely on manually curated knowledge graphs, which are costly to construct and update, or on automated graph-construction pipelines that are often unreliable. Moreover, systems following this paradigm typically use large language models to guide graph traversal and evidence retrieval. In this paper, we propose a novel RAG framework that uses a spreading activation algorithm to retrieve information from a corpus of documents connected by an automatically constructed heterogeneous knowledge graph. This approach reduces reliance on semantic knowledge graphs, which are often incomplete due to information loss during information extraction, avoids LLM-guided graph traversal, and improves performance on multi-hop question answering. Experiments show that our method achieves better or comparable performance to several state-of-the-art RAG methods and can be integrated as a plug-and-play module with different iterative RAG pipelines. When combined with chain-of-thought iterative retrieval, it yields up to a 39% absolute improvement in answer correctness over naive RAG, while achieving these results with small open-weight language models.

2512.13323 2026-05-26 cs.AI cs.LG 版本更新

Error-Driven Prompt Optimization for Arithmetic Reasoning

基于错误驱动的算术推理提示优化

Árpád Pándy, Róbert Lakatos, András Hajdu

发表机构 * Deptartment of Data Science & Visualization, Faculty of Informatics, University of Debrecen(数据科学与可视化系,信息学院,德布勒恩大学)

AI总结 提出一种错误驱动的提示优化框架,通过聚类错误预测迭代优化提示规则,使小型本地语言模型在算术推理任务中准确率达到70.8%,超越GPT-3.5 Turbo。

详情
Journal ref
IEEE Access, vol. 14, pp. 62570-62583, 2026
AI中文摘要

人工智能的最新进展激发了人们对工业代理的兴趣,这些代理能够在表格数据工作流中支持金融和医疗等受监管领域的分析师。此类系统的关键能力是对结构化数据执行准确的算术运算,同时确保敏感信息永远不会离开安全的本地环境。在此,我们引入了一种用于算术推理的错误驱动优化框架,该框架增强了代码生成代理(CGA),特别应用于本地小型语言模型(SLM)。通过对领先的SLM(Qwen3 4B)进行系统评估,我们发现虽然基础模型在算术任务中表现出基本局限性,但我们提出的错误驱动方法通过聚类错误预测来迭代优化提示规则,显著提升了性能,将模型准确率提高到70.8%。我们的结果表明,开发可靠、可解释且可工业部署的AI助手不仅可以通过昂贵的微调实现,还可以通过系统的、错误驱动的提示优化来实现,从而使小型模型以符合隐私要求的方式超越大型语言模型(GPT-3.5 Turbo)。

英文摘要

Recent advancements in artificial intelligence have sparked interest in industrial agents capable of supporting analysts in regulated sectors, such as finance and healthcare, within tabular data workflows. A key capability for such systems is performing accurate arithmetic operations on structured data while ensuring sensitive information never leaves secure, on-premises environments. Here, we introduce an error-driven optimization framework for arithmetic reasoning that enhances a Code Generation Agent (CGA), specifically applied to on-premises small language models (SLMs). Through a systematic evaluation of a leading SLM (Qwen3 4B), we find that while the base model exhibits fundamental limitations in arithmetic tasks, our proposed error-driven method, which clusters erroneous predictions to refine prompt-rules iteratively, dramatically improves performance, elevating the model's accuracy to 70.8\%. Our results suggest that developing reliable, interpretable, and industrially deployable AI assistants can be achieved not only through costly fine-tuning but also via systematic, error-driven prompt optimization, enabling small models to surpass larger language models (GPT-3.5 Turbo) in a privacy-compliant manner.

2512.10961 2026-05-26 cs.HC cs.AI 版本更新

AI as Equalizer or Amplifier? Task Complexity as the Moderating Factor for Human Expertise in Hybrid Intelligence Systems

AI是均衡器还是放大器?任务复杂性作为混合智能系统中人类专业知识的调节因素

Tao An

发表机构 * Hawaii Pacific University(夏威夷太平洋大学)

AI总结 本文提出AI在常规任务中均衡表现,在复杂任务中放大专家与新手差距,并构建了人类贡献层次与参与层次的框架,强调领域知识而非提示工程决定放大效果。

Comments 9 pages, 3 figures, 1 table. v2 matches the camera-ready version accepted at HHAI 2026. Removed v1 aggregated projections (training timeline figure, n=580). Empirical basis is structured field observations of 10 to 20 colleagues at a single organization (Beijing Feimu) since mid-2024. Conceptual framework unchanged. To appear in Frontiers in Artificial Intelligence and Applications (IOS Press)

详情
AI中文摘要

越来越多的实证研究表明,生成式AI缩小了新手与专家在常规任务上的表现差距——即所谓的“均衡器”效应。本文挑战了这一结论的普遍性。基于认知增强理论、专家-新手研究以及对一个小型软件产品团队内部生成式AI使用的结构化观察,我们认为AI主要作为认知放大器:其输出质量根本上取决于指导它的人类专业知识。我们提出了一个包含人类贡献的三个层次(问题定义、质量评估、迭代优化)和三个参与级别(被动接受、迭代协作、认知指导)的框架,证明领域知识——而非提示工程技能——决定了放大效果。我们通过提出AI在结构良好的常规任务上均衡表现,而在需要深度判断的复杂任务上放大已有差异,来调和均衡器与放大器的观点。这种调和直接影响了人机混合系统的设计:我们应构建奖励和发展专业知识的AI,而非取代专业知识的AI。我们为HHAI社区提出了一个研究议程,聚焦于专业知识敏感的AI设计、自适应协作界面以及AI增强工作中人类能力发展的纵向研究。

英文摘要

A growing body of empirical research suggests that generative AI narrows performance gaps between novice and expert workers on routine tasks--the so-called "equalizer" effect. This paper challenges the generality of that conclusion. Drawing on cognitive augmentation theory, expert-novice research, and structured observations of in-house generative-AI use across a small software product team, we argue that AI functions primarily as a cognitive amplifier: a system whose output quality depends fundamentally on the expertise of the human who directs it. We present a framework comprising three layers of human contribution (problem definition, quality evaluation, iterative refinement) and three levels of engagement (passive acceptance, iterative collaboration, cognitive direction), demonstrating that domain expertise--not prompt engineering skill--determines amplification effectiveness. We reconcile the equalizer and amplifier perspectives by proposing that AI equalizes performance on well-structured, routine tasks while amplifying pre-existing differences on complex tasks requiring deep judgment. This reconciliation carries direct implications for hybrid human-AI system design: rather than building AI that replaces expertise, we should build AI that rewards and develops it. We outline a research agenda for the HHAI community centered on expertise-sensitive AI design, adaptive collaboration interfaces, and longitudinal studies of human capability development in AI-augmented work.

2512.06393 2026-05-26 cs.AI cs.CL cs.LG cs.LO 版本更新

Conflict-Aware Fusion: Mitigating Logic Inertia in Large Language Models via Structured Cognitive Priors

冲突感知融合:通过结构化认知先验缓解大语言模型中的逻辑惯性

Qiming Bao, Xiaoxuan Fu, Michael Witbrock

发表机构 * Xtracta & Strong AI Lab, University of Auckland(Xtracta与强人工智能实验室,奥克兰大学) School of Humanities, China University of Political Science and Law(人文学院,中国政法大学) Strong AI Lab, University of Auckland(强人工智能实验室,奥克兰大学)

AI总结 针对大语言模型在规则系统结构扰动下表现脆弱的问题,提出冲突感知融合训练流程,通过验证-演绎结构先验和符号推理奖励,在多个压力测试中实现鲁棒性饱和。

详情
AI中文摘要

大型语言模型(LLM)在许多推理基准上取得了高准确率,但在基于规则系统的结构扰动下仍然脆弱。我们引入了一个包含四个压力测试的诊断框架——冗余与必要规则删除、矛盾规则注入、逻辑保持重写和多定律堆叠——并用它来揭示逻辑惯性:生成式LLM(Qwen2/3、TinyLlama、GPT-4o、Gemma-3-4B-IT)和仅编码器BERT基线在矛盾前提下沿学习到的演绎轨迹持续推理的倾向。这种崩溃是剧烈的:未经处理的基线在基础任务上的准确率从1.00下降到矛盾注入时的0.00(实例级精确匹配),而GPT-4o仅解决了56.0%的矛盾案例。我们提出冲突感知融合,这是一个四阶段训练流程,将验证-演绎作为学习到的结构先验强制执行:(i)SFT建立验证前缀;(ii)DPO锐化矛盾停止决策边界;(iii)逻辑不变正则化(LIRE)通过对称KL惩罚逻辑等价规则公式之间的差异;(iv)来自验证反馈的强化学习(RLVF)使用符号前向链接引擎作为确定性预言奖励,联合优化不变性和敏感性。该流程在1.5B和8B骨干网络上均使所有四个主要压力测试达到饱和。我们进一步验证了第二阶段扩展,用Lean 4内核替换命题预言机,在分层187个问题的Lean翻译样本中,对105个经典可推导(T)问题达到99.0%的内核一致性(整体71.7%,涵盖两种极性),为形式化验证的RL训练提供了可靠的升级路径。代码和基准:https://github.com/14H034160212/lemo

英文摘要

Large language models (LLMs) achieve high accuracy on many reasoning benchmarks but remain brittle under structural perturbations of rule-based systems. We introduce a diagnostic framework with four stress tests -- redundant vs. essential rule deletion, contradictory-rule injection, logic-preserving rewrites, and multi-law stacking -- and use it to expose Logic Inertia: the tendency of generative LLMs (Qwen2/3, TinyLlama, GPT-4o, Gemma-3-4B-IT) and the encoder-only BERT baseline to persist along learned deductive trajectories under inconsistent premises. The collapse is sharp: untreated baselines fall from accuracy 1.00 on the base task to 0.00 on contradiction injection (instance-level exact match), and GPT-4o resolves only 56.0% of contradiction cases. We propose Conflict-Aware Fusion, a four-stage training pipeline that enforces verification-before-deduction as a learned structural prior: (i) SFT establishes the verification preamble; (ii) DPO sharpens the halt-on-contradiction decision boundary; (iii) Logical Invariance REgularisation (LIRE) penalises divergence between logically equivalent rule formulations via symmetric KL; (iv) Reinforcement Learning from Verification Feedback (RLVF) uses a symbolic forward-chaining engine as a deterministic oracle reward, jointly optimising invariance and sensitivity. The pipeline saturates all four primary stress tests for both 1.5B and 8B backbones. We further validate a Phase 2 extension that replaces the propositional oracle with a Lean 4 kernel, attaining 99.0% kernel agreement on the 105 classically-derivable (T) questions within a stratified 187-question Lean-translated sample (overall 71.7% across both polarities), providing a sound upgrade path to formally verified RL training. Code and benchmark: https://github.com/14H034160212/lemo

2512.05765 2026-05-26 cs.AI cs.LG 版本更新

AGI Requires a Coordination Layer on Top of Pattern Repositories

AGI 需要在模式存储库之上建立协调层

Edward Y. Chang

发表机构 * Department of Computer Science, Stanford University(斯坦福大学计算机科学系)

AI总结 本文提出大型语言模型(LLM)并非AGI的死胡同,而是缺少系统2协调层,通过UCCT和RCA实现语义锚定与因果验证,并设计MACI多智能体协调栈,实验表明自适应控制优于静态提示。

Comments 15 pages, 5 figures, 7 tables

详情
AI中文摘要

在本文中,我们认为那些将大型语言模型(LLM)视为AGI死胡同的有影响力的批评误判了瓶颈:它们混淆了海洋与渔网。模式存储库是必要的系统1基础;缺失的组件是一个系统2协调层,该层能够招募相关模式、验证其使用、保持状态并控制收敛。我们将常常被混淆的两种控制用途分开。由UCCT(统一上下文控制理论)形式化的语义锚定,通过由有效支持(rho_d)、表征不匹配(d_r)和自适应锚定预算(gamma log k)控制的相变,将标签和任务意图绑定到学习到的模式区域。由递归因果审计(RCA)实现的追踪-答案验证,测试最终因果判断是否在其自身推理轨迹的压力下得到支持。我们将这些思想转化为MACI,一个多智能体协调栈,通过诱饵(PID调节辩论)、过滤(苏格拉底式和因果审计)和持久性(事务性内存)整合多样性和控制。在因果判断和谄媚-偏执权衡上的实证验证表明,静态提示失败的地方,自适应控制成功。通过将常见反对意见重新定义为可测试的协调失败,我们认为通往AGI的道路是通过LLM,而不是绕过它们。能力不是协调。

英文摘要

In this paper we argue that influential critiques dismissing Large Language Models (LLMs) as a dead end for AGI misidentify the bottleneck: they confuse the ocean with the net. Pattern repositories are the necessary System-1 substrate; the missing component is a System-2 coordination layer that recruits relevant patterns, verifies their use, preserves state, and governs convergence. We separate two uses of control that are often conflated. Semantic anchoring, formalized by UCCT (Unified Contextual Control Theory), binds labels and task intent to learned pattern regions through a phase transition governed by effective support (rho_d), representational mismatch (d_r), and an adaptive anchoring budget (gamma log k). Trace-answer verification, implemented by Recursive Causal Audit (RCA), tests whether a final causal judgment is warranted by its own reasoning trace under pressure. We translate these ideas into MACI, a multi-agent coordination stack that integrates diversity and control via baiting (PID-modulated debate), filtering (Socratic and causal audit), and persistence (transactional memory). Empirical validation on causal judgment and the sycophancy-paranoia trade-off demonstrates that static prompting fails where adaptive control succeeds. By reframing common objections as testable coordination failures, we argue that the path to AGI runs through LLMs, not around them. Capability is not coordination.

2511.15407 2026-05-26 cs.AI cs.CV cs.LG 版本更新

IPR-1: Interactive Physical Reasoner

IPR-1:交互式物理推理器

Mingyu Zhang, Lifeng Zhuo, Tianxi Tan, Guocan Xie, Xian Nie, Yan Li, Renjie Zhao, Zizhu He, Ziyu Wang, Jiting Cai, Yong-Lu Li

发表机构 * CARNEGIE MELLON UNIVERSITY(卡内基梅隆大学)

AI总结 提出IPR模型,通过世界模型滚动评分和强化VLM策略,结合物理中心动作代码PhysCode,在1000+异构游戏基准上实现鲁棒的物理推理,性能超越GPT-5并零样本迁移至未见游戏。

Comments Accepted by CVPR 2026. 13 pages of main text and 20 pages of appendices. Project page: https://mybearyzhang.github.io/ipr-1

详情
AI中文摘要

人类通过观察、与环境交互以及内化物理和因果关系来学习。在这里,我们旨在探究一个智能体是否能够通过交互类似地获得类人推理能力,并随着更多经验不断改进。为此,我们引入了一个包含1000+异构游戏的Game-to-Unseen (G2U)基准,这些游戏展现出显著的视觉领域差异。现有方法(包括VLM和世界模型)难以捕捉底层物理和因果关系,因为它们不关注核心机制且过度拟合视觉细节。VLM/VLA智能体能够推理,但在交互设置中缺乏前瞻性,而世界模型进行想象但模仿视觉模式而非分析物理和因果关系。因此,我们提出IPR(交互式物理推理器),利用世界模型滚动来评分和强化VLM的策略,并引入PhysCode,一种以物理为中心的动作代码,将语义意图与动力学对齐,为预测和推理提供共享动作空间。在1000+游戏上预训练后,我们的IPR在从原始直觉到目标驱动推理的各个层次上表现稳健,甚至在总体上超越了GPT-5。我们发现,性能随着训练游戏和交互步骤的增加而提升,并且模型还能零样本迁移到未见过的游戏。这些结果支持以物理为中心的交互作为稳步提升物理推理的路径。更多演示和项目详情请见https://mybearyzhang.github.io/ipr-1。

英文摘要

Humans learn by observing, interacting with environments, and internalizing physics and causality. Here, we aim to ask whether an agent can similarly acquire human-like reasoning from interaction and keep improving with more experience. To study this, we introduce a Game-to-Unseen (G2U) benchmark of 1,000+ heterogeneous games that exhibit significant visual domain gaps. Existing approaches, including VLMs and world models, struggle to capture underlying physics and causality since they are not focused on core mechanisms and overfit to visual details. VLM/VLA agents reason but lack look-ahead in interactive settings, while world models imagine but imitate visual patterns rather than analyze physics and causality. We therefore propose IPR (Interactive Physical Reasoner), using world-model rollouts to score and reinforce a VLM's policy, and introduce PhysCode, a physics-centric action code aligning semantic intent with dynamics to provide a shared action space for prediction and reasoning. Pretrained on 1,000+ games, our IPR performs robustly on levels from primitive intuition to goal-driven reasoning, and even surpasses GPT-5 overall. We find that performance improves with more training games and interaction steps, and that the model also zero-shot transfers to unseen games. These results support physics-centric interaction as a path to steadily improving physical reasoning. Further demos and project details can be found at https://mybearyzhang.github.io/ipr-1.

2511.12378 2026-05-26 cs.AI 版本更新

Learning to Trust: Bayesian Adaptation to Varying Suggester Reliability in Sequential Decision Making

学会信任:序列决策中针对不同建议者可靠性的贝叶斯自适应方法

Dylan M. Asmar, Mykel J. Kochenderfer

发表机构 * Stanford Intelligent Systems Laboratory(斯坦福智能系统实验室) Stanford University(斯坦福大学)

AI总结 提出一种贝叶斯框架,通过将建议者质量融入信念表示并引入显式“询问”动作,使智能体在部分可观测环境中动态学习和适应变化的建议者可靠性,平衡信息获取与成本。

Comments Repo: https://github.com/dylan-asmar/learning_to_trust

详情
AI中文摘要

在不确定性下执行序列决策任务的自主智能体可以从外部动作建议中受益,这些建议提供了有价值的指导,但其可靠性固有地变化。现有整合此类建议的方法通常假设静态且已知的建议者质量参数,限制了实际部署。我们引入了一个框架,在部分可观测环境中动态学习并适应变化的建议者可靠性。首先,我们将建议者质量直接整合到智能体的信念表示中,使智能体能够通过贝叶斯推断建议者类型来推断并调整对建议的依赖。其次,我们引入了一个显式的“询问”动作,允许智能体在关键时刻策略性地请求建议,平衡信息获取与获取成本。实验评估表明,该框架在不同建议者质量下具有稳健性能,能够适应变化的可靠性,并策略性地管理建议请求。这项工作通过解决不确定环境中的建议不确定性,为自适应人机协作奠定了基础。

英文摘要

Autonomous agents operating in sequential decision-making tasks under uncertainty can benefit from external action suggestions, which provide valuable guidance but inherently vary in reliability. Existing methods for incorporating such advice typically assume static and known suggester quality parameters, limiting practical deployment. We introduce a framework that dynamically learns and adapts to varying suggester reliability in partially observable environments. First, we integrate suggester quality directly into the agent's belief representation, enabling agents to infer and adjust their reliance on suggestions through Bayesian inference over suggester types. Second, we introduce an explicit ``ask'' action allowing agents to strategically request suggestions at critical moments, balancing informational gains against acquisition costs. Experimental evaluation demonstrates robust performance across varying suggester qualities, adaptation to changing reliability, and strategic management of suggestion requests. This work provides a foundation for adaptive human-agent collaboration by addressing suggestion uncertainty in uncertain environments.

2510.26418 2026-05-26 cs.AI 版本更新

Chain-of-Thought Hijacking

思维链劫持

Jianli Zhao, Tingchen Fu, Rylan Schaeffer, Mrinank Sharma, Fazl Barez

发表机构 * Independent(独立) University of Oxford(牛津大学) Stanford University(斯坦福大学) Anthropic Martian Core

AI总结 提出思维链劫持攻击,通过诱导大型推理模型进行长时间良性推理来削弱其拒绝有害请求的能力,实现高成功率越狱。

详情
AI中文摘要

大型推理模型(LRMs)通过扩展推理时间推理来提高任务性能。尽管先前研究表明更长的推理应导致更稳健的安全行为,但我们发现了相反的证据:过度扩展的推理反而可以被利用来系统性地削弱拒绝行为。我们提出了思维链劫持,一种简单而有效的黑盒越狱攻击,诱导LRMs进行长时间的良性谜题求解推理(通常持续五分钟以上),然后引发有害的顺从。在HarmBench上,思维链劫持在Gemini 2.5 Pro、ChatGPT o4 Mini、Grok 3 Mini和Claude 4 Sonnet上分别实现了99%、94%、100%和94%的攻击成功率。为了理解该攻击为何成功,我们对开源推理模型进行了激活探测、注意力模式分析和因果干预。我们的结果表明,拒绝行为依赖于一个低维安全信号,其表达随着推理轨迹变长而减弱。特别是,扩展的良性推理将注意力从有害意图转移开,并减弱与拒绝相关的激活,产生了我们称之为拒绝稀释的现象。这些发现表明,过长的推理可能引入系统性的越狱攻击面。我们发布了评估材料以支持可重复性和进一步研究。

英文摘要

Large Reasoning Models (LRMs) improve task performance through extended inference-time reasoning. Although previous studies suggest that longer reasoning should lead to more robust safety behavior, we find evidence to the contrary: over-extended reasoning can instead be exploited to systematically weaken refusal behavior. We propose Chain-of-Thought Hijacking, a simple yet effective black-box jailbreak attack that induces LRMs to engage in prolonged benign puzzle-solving reasoning, often lasting more than five minutes, before eliciting harmful compliance. Across HarmBench, CoT Hijacking achieves attack success rates of 99%, 94%, 100%, and 94% on Gemini 2.5 Pro, ChatGPT o4 Mini, Grok 3 Mini, and Claude 4 Sonnet, respectively. To understand why this attack succeeds, we conduct activation probing, attention-pattern analysis, and causal interventions on open-source reasoning models. Our results indicate that refusal behavior depends on a low-dimensional safety signal whose expression weakens as reasoning traces grow longer. In particular, extended benign reasoning shifts attention away from harmful intentions and attenuates refusal-related activations, producing what we call refusal dilution. These findings demonstrate that excessively prolonged reasoning can introduce a systematic jailbreak attack surface. We release our evaluation materials to support reproducibility and further research.

2510.25065 2026-05-26 cs.AI 版本更新

Rewarding Structural Conformance of Reasoning using Process Mining

使用过程挖掘奖励推理的结构符合性

Yongjae Lee, Taekhyun Park, Sunghyun Sim, Hyerim Bae

发表机构 * Dept. of Industrial Engineering(工业工程系) Pusan National University(釜山国立大学) Dept. of Data Science(数据科学系) Changwon National University(昌原国立大学)

AI总结 提出TACReward奖励模型,利用过程挖掘技术聚合推理步骤的结构偏差,以改进稀疏奖励策略梯度方法在数学推理任务中的性能。

详情
AI中文摘要

近期稀疏奖励策略梯度方法的进展使得基于强化学习的语言模型后训练成为可能。然而,对于数学问题求解等推理任务,二值化结果奖励对中间推理步骤提供的反馈有限。虽然一些研究尝试通过估计整体推理质量来解决此问题,但这些奖励是否可靠地代表逐步推理质量仍不明确。在本研究中,我们将推理视为结构化过程,并提出TACReward,该奖励模型可无缝集成到稀疏奖励策略梯度方法中,无需额外的人工标注成本或架构修改。TACReward利用过程挖掘技术聚合教师与策略推理之间的逐步结构偏差,生成范围在[0, 1]的标量输出奖励以指示推理质量。在多个数学推理基准上的实验表明,将TACReward集成到稀疏奖励框架中鼓励策略模型改善推理的结构质量,从而在现有稀疏奖励框架上实现一致的性能提升。我们的代码和检查点可在https://github.com/Thrillcrazyer/TACReward和https://huggingface.co/Thrillcrazyer/TACReward7B公开获取。

英文摘要

Recent advances in sparse reward policy gradient methods have enabled effective reinforcement learning (RL)-based language model post-training. However, for reasoning tasks such as mathematical problem solving, binarized outcome rewards provide limited feedback on intermediate reasoning steps. While some studies have attempted to address this issue by estimating overall reasoning quality, it remains unclear whether these rewards are reliable proxies for the quality of stepwise reasoning. In this study, we consider reasoning as a structured process and propose TACReward, the reward model that can be seamlessly integrated into sparse reward policy gradient methods without additional human annotation costs or architectural modifications. TACReward aggregates stepwise structural deviations between teacher and policy reasoning using process mining techniques, producing a scalar output reward range of [0, 1] to indicate reasoning quality. Experiments on multiple mathematical reasoning benchmarks demonstrate that integrating the TACReward into sparse reward frameworks encourages the policy model to improve the structural quality of reasoning. Consequently, this leads to consistent performance improvements over existing sparse reward frameworks. Our code and checkpoints are publicly available at https://github.com/Thrillcrazyer/TACReward and https://huggingface.co/Thrillcrazyer/TACReward7B.

2510.15514 2026-05-26 cs.AI 版本更新

Voting with the Graph: Stable RLAIF via Topological Consistency Maximization

基于图的投票:通过拓扑一致性最大化实现稳定的RLAIF

Boyin Liu, Zhuo Zhang, Sen Huang, Lipeng Xie, Qingxu Fu, Haoran Chen, LI YU, Tianyi Hu, Zhaoyang Liu, Bolin Ding, Dongbin Zhao

发表机构 * Alibaba Group(阿里巴巴集团) Chinese Academy of Sciences Institute of Automation(中国科学院自动化研究所)

AI总结 提出拓扑共识奖励(TCR)框架,利用传递性作为去噪机制,通过拓扑多数投票过滤偏好信号中的随机噪声,以稳定强化学习从AI反馈(RLAIF)中的偏好学习。

详情
AI中文摘要

从AI反馈中强化学习(RLAIF)依赖LLM法官作为偏好测量工具,但这些工具本质上受限于随机测量误差——表现为偏好循环(例如,$A \succ B \succ C \succ A$)的随机波动,在最先进模型中占5-9%的评估。虽然重复采样通过平均多个判断来减轻噪声,但它孤立地处理每个比较,未能利用区分系统信号与随机噪声的结构约束。我们引入拓扑共识奖励(TCR),一个通过拓扑多数投票利用传递性作为去噪机制的框架:系统信号通过传递链相互增强,而随机误差聚集为拓扑暴露的循环。TCR近似最大无环子图以从偏好信号中过滤随机噪声。我们还提出循环发生率(CIR)作为诊断指标,衡量包含偏好循环的样本比例。在我们的噪声模型下,这些循环主要源于随机测量误差而非真正的非传递性。在Arena-Hard、MT-Bench和WritingBench上的实验表明,TCR始终优于成对基线和经典排序算法,并在不同法官模型上表现出稳健性能。

英文摘要

Reinforcement Learning from AI Feedback (RLAIF) relies on LLM judges as preference measurement instruments, yet these instruments are fundamentally limited by random measurement errors -- stochastic fluctuations that manifest as preference cycles (e.g., $A \succ B \succ C \succ A$), occurring in 5-9% of evaluations across state-of-the-art models. While repeated sampling mitigates noise by averaging multiple judgments, it treats each comparison in isolation and fails to exploit the structural constraints that distinguish systematic signals from random noise. We introduce Topological Consensus Rewards (TCR), a framework that leverages transitivity as a denoising mechanism via topological majority voting: systematic signals reinforce each other through transitive chains, while random errors cluster into topologically exposed cycles. TCR approximates the Maximum Acyclic Subgraph to filter stochastic noise from preference signals. We also propose Cycle Incidence Rate (CIR) as a diagnostic metric that measures the proportion of samples containing preference cycles. Under our noise model, these cycles arise primarily from stochastic measurement errors rather than genuine intransitivity. Experiments on Arena-Hard, MT-Bench, and WritingBench demonstrate that TCR consistently outperforms pairwise baselines and classical ranking algorithms, while exhibiting robust performance across different judge models.

2510.14925 2026-05-26 cs.AI cs.CL cs.LG 版本更新

False Fixed Points: Kantian Feedback, Stable Miscalibration, and Representational Compression in LLMs

虚假不动点:大语言模型中的康德反馈、稳定误校准与表征压缩

Akira Okutomi

发表机构 * ToppyMicroServices OÜ(ToppyMicroServices公司)

AI总结 本文通过康德承诺门控框架和线性反馈模型,研究大语言模型中高置信度错误作为局部稳定、内部一致且自信错误的虚假不动点现象,发现稳定性与正确性可分离,并探索高信噪比惯性和表征压缩作为稳定误校准的可能机制。

Comments 27 pages, 8 figures, v3.0

详情
AI中文摘要

大型语言模型中的高置信度错误通常被视为脆弱的失败。我们研究另一种可能性:某些错误可能是虚假不动点,即局部稳定、内部一致且自信地错误。这分离了鲁棒性与真实追踪。我们通过康德承诺门控框架和一个最小线性反馈模型来发展这种分离,其中稳定性和正确性可以偏离。在三个开源权重模型上,根据我们的隐藏状态敏感性探测,过度自信的错误项并不比自信正确的项系统性地更局部脆弱。基于弃权的自我批评通过牺牲覆盖率减少了过度自信的错误承诺,而C3-R(一种基于规则的显式反馈门控)则加剧了这种权衡而非消除它。这些结果激发但未证实高信噪比惯性和表征压缩作为稳定误校准的可能机制。

英文摘要

High-confidence errors in large language models are often treated as fragile failures. We study an alternative: some errors may be false fixed points, locally stable, internally coherent, and confidently wrong. This separates robustness from truth-tracking. We develop the separation through a Kantian commitment-gate framing and a minimal linear feedback model in which stability and correctness can diverge. Across three open-weight models, overconfident wrong items are not systematically more locally fragile than confidently correct items under our hidden-state sensitivity probes. Abstention-aware self-critique reduces overconfident wrong commitments by sacrificing coverage, and C3-R, a rule-based explicit feedback gate, sharpens that tradeoff rather than eliminating it. These results motivate, but do not establish, high signal-to-noise (high-SNR) inertia and representational compression as possible mechanisms for stable miscalibration.

2510.07343 2026-05-26 cs.GR cs.AI eess.IV 版本更新

Local MAP Sampling for Diffusion Models

扩散模型的局部MAP采样

Shaorong Zhang, Rob Brekelmans, Greg Ver Steeg

发表机构 * University of California, Riverside, CA, US(加州大学河滨分校)

AI总结 提出局部MAP采样(LMAPS)框架,通过沿扩散轨迹迭代求解局部MAP子问题,统一了优化方法与概率采样,在图像恢复和科学任务中达到最优性能。

详情
AI中文摘要

扩散后验采样(DPS)通过从$p(x_0 \mid y)$采样,为逆问题提供了一种基于贝叶斯原理的方法。虽然后验采样对于捕捉不确定性和多模态性很有价值,但许多经典和实际的逆问题设置最终优先考虑精确的点估计——最显著的是MAP估计器,它长期以来一直是成像和科学应用中的标准重建目标。我们引入了局部MAP采样(LMAPS),这是一种新的推理框架,沿扩散轨迹迭代求解局部MAP子问题。这一视角阐明了它们与全局MAP和DPS的联系,为基于优化的方法提供了统一的概率解释。在此基础之上,我们开发了实用算法,其中协方差近似基于高斯先验假设,并重新制定了目标函数以提高稳定性和可解释性。在广泛的图像恢复和科学任务中,LMAPS实现了最先进的性能。

英文摘要

Diffusion Posterior Sampling (DPS) provides a principled Bayesian approach to inverse problems by sampling from $p(x_0 \mid y)$. While posterior sampling is valuable for capturing uncertainty and multi-modality, many classical and practical inverse problem settings ultimately prioritize accurate point estimation -- most notably the MAP estimator, which has long served as a standard reconstruction objective in imaging and scientific applications. We introduce Local MAP Sampling (LMAPS), a new inference framework that iteratively solves local MAP subproblems along the diffusion trajectory. This perspective clarifies their connection to global MAP and DPS, offering a unified probabilistic interpretation for optimization-based methods. Building on this foundation, we develop practical algorithms with a covariance approximation motivated by a Gaussian prior assumption, and a reformulated objective for stability and interpretability. Across a broad set of image restoration and scientific tasks, LMAPS achieves state-of-the-art performance.

2510.04580 2026-05-26 cs.AI 版本更新

Strongly Solving 2048 4x3

强求解 2048 4x3

Tomoyuki Kaneko, Shuhei Yamashita

发表机构 * Graduate School of Arts and Sciences, the University of Tokyo(东京大学艺术及科学研究生院)

AI总结 通过按棋盘上数字和(称为状态年龄)划分状态空间,枚举所有可达状态和后继状态,强求解了4x3棋盘上的2048变体,最优策略期望得分约50724.26。

详情
AI中文摘要

2048是一个随机单人游戏,涉及4x4网格上的16个单元格,玩家在上下左右中选择一个方向,通过合并沿该方向相邻单元格中相同数字的两个方块来获得分数。本文证明,变体2048-4x3(4x3棋盘上的12个单元格,比原版少一行)已被强求解。在该变体中,对于最常见的初始状态(两个数字2的方块),最优策略的期望得分约为50724.26。可达状态和后继状态的数量分别为1,152,817,492,752和739,648,886,170。关键技术是按棋盘上数字之和(称为状态年龄)划分状态空间。年龄在状态与其任何有效动作后的后继状态之间保持不变,并通过环境的随机响应增加2或4。因此,我们可以按年龄划分状态空间,并仅依赖于最近年龄的状态来枚举一个年龄的所有(后继)状态。类似地,我们可以按年龄递减顺序确定(后继)状态值。

英文摘要

2048 is a stochastic single-player game involving 16 cells on a 4 by 4 grid, where a player chooses a direction among up, down, left, and right to obtain a score by merging two tiles with the same number located in neighboring cells along the chosen direction. This paper presents that a variant 2048-4x3 12 cells on a 4 by 3 board, one row smaller than the original, has been strongly solved. In this variant, the expected score achieved by an optimal strategy is about $50724.26$ for the most common initial states: ones with two tiles of number 2. The numbers of reachable states and afterstates are identified to be $1,152,817,492,752$ and $739,648,886,170$, respectively. The key technique is to partition state space by the sum of tile numbers on a board, which we call the age of a state. An age is invariant between a state and its successive afterstate after any valid action and is increased two or four by stochastic response from the environment. Therefore, we can partition state space by ages and enumerate all (after)states of an age depending only on states with the recent ages. Similarly, we can identify (after)state values by going along with ages in decreasing order.

2510.02171 2026-05-26 cs.SD cs.AI eess.AS 版本更新

Go witheFlow: Real-time Emotion Driven Audio Effects Modulation

Go witheFlow:实时情感驱动音频效果调制

Edmund Dervakos, Spyridon Kantarelis, Vassilis Lyberatos, Jason Liartis, Giorgos Stamou

发表机构 * Artificial Intelligence and Learning Systems Laboratory(人工智能与学习系统实验室) National Technical University of Athens(希腊国家技术大学)

AI总结 提出witheFlow系统,通过生物信号和音频特征实时自动调制音频效果,增强音乐表演中的人机协作。

Comments Accepted at NeurIPS Creative AI Track 2025: Humanity

详情
AI中文摘要

音乐表演是一种独特的人类活动,与表演者传达、唤起或表达情感的能力内在相关。机器无法以人类的意义表演音乐;它们可以制作、复制、执行或合成音乐,但缺乏情感或情绪体验的能力。因此,音乐表演是探索人机协作方面的理想候选。在本文中,我们介绍了witheFlow系统,旨在通过基于从生物信号和音频本身提取的特征自动调制音频效果,增强实时音乐表演。该系统目前处于概念验证阶段,设计轻量,能够在笔记本电脑上本地运行,并且在兼容的数字音频工作站和传感器可用的情况下是开源的。

英文摘要

Music performance is a distinctly human activity, intrinsically linked to the performer's ability to convey, evoke, or express emotion. Machines cannot perform music in the human sense; they can produce, reproduce, execute, or synthesize music, but they lack the capacity for affective or emotional experience. As such, music performance is an ideal candidate through which to explore aspects of collaboration between humans and machines. In this paper, we introduce the witheFlow system, designed to enhance real-time music performance by automatically modulating audio effects based on features extracted from both biosignals and the audio itself. The system, currently in a proof-of-concept phase, is designed to be lightweight, able to run locally on a laptop, and is open-source given the availability of a compatible Digital Audio Workstation and sensors.

2510.01389 2026-05-26 cs.RO cs.AI cs.LG 版本更新

INSIGHT: INference-time Sequence Introspection for Generating Help Triggers in Vision-Language-Action Models

INSIGHT: 视觉-语言-动作模型中生成帮助触发器的推理时序列内省

Ulas Berk Karli, Ziyao Shangguan, Tesca FItzgerald

发表机构 * Department of Computer Science, Yale University(耶鲁大学计算机科学系)

AI总结 提出INSIGHT框架,利用令牌级不确定性信号(熵、对数概率、不确定性估计)训练变压器分类器,预测VLA模型何时需要人类帮助,并对比强/弱监督下的性能,发现建模时间动态优于静态评分。

详情
AI中文摘要

最近的视觉-语言-动作(VLA)模型展现出强大的泛化能力,但它们缺乏用于预测失败和向人类监督者请求帮助的内省机制。我们提出了INSIGHT,一个利用令牌级不确定性信号来预测VLA何时应请求帮助的学习框架。使用π0-FAST作为基础模型,我们提取每个令牌的熵、对数概率以及基于狄利克雷的偶然不确定性和认知不确定性估计,并训练紧凑的变压器分类器将这些序列映射到帮助触发器。我们探索了强监督或弱监督的监督机制,并在分布内和分布外任务中进行了广泛比较。我们的结果显示了权衡:强标签使模型能够捕捉细粒度的不确定性动态以实现可靠的帮助检测,而弱标签虽然噪声较大,但在训练和评估对齐时仍能支持有竞争力的内省,为密集标注不可行时提供了可扩展的路径。关键的是,我们发现使用变压器建模令牌级不确定性信号的时间演化比静态序列级评分提供了更强的预测能力。本研究首次对VLA中基于不确定性的内省进行了系统评估,为主动学习和通过选择性人工干预实现实时错误缓解开辟了未来途径。

英文摘要

Recent Vision-Language-Action (VLA) models show strong generalization capabilities, yet they lack introspective mechanisms for anticipating failures and requesting help from a human supervisor. We present \textbf{INSIGHT}, a learning framework for leveraging token-level uncertainty signals to predict when a VLA should request help. Using $π_0$-FAST as the underlying model, we extract per-token \emph{entropy}, \emph{log-probability}, and Dirichlet-based estimates of \emph{aleatoric and epistemic uncertainty}, and train compact transformer classifiers to map these sequences to help triggers. We explore supervision regimes for strong or weak supervision, and extensively compare them across in-distribution and out-of-distribution tasks. Our results show a trade-off: strong labels enable models to capture fine-grained uncertainty dynamics for reliable help detection, while weak labels, though noisier, still support competitive introspection when training and evaluation are aligned, offering a scalable path when dense annotation is impractical. Crucially, we find that modeling the temporal evolution of token-level uncertainty signals with transformers provides far greater predictive power than static sequence-level scores. This study provides the first systematic evaluation of uncertainty-based introspection in VLAs, opening future avenues for active learning and for real-time error mitigation through selective human intervention.

2509.25339 2026-05-26 cs.CV cs.AI cs.LG eess.IV 版本更新

VisualOverload: Probing Visual Understanding of VLMs in Really Dense Scenes

VisualOverload: 在真正密集场景中探测VLM的视觉理解

Paul Gavrikov, Wei Lin, M. Jehanzeb Mirza, Soumya Jahagirdar, Muhammad Huzaifa, Sivan Doveh, Serena Yeung-Levy, James Glass, Hilde Kuehne

发表机构 * Independent Researcher(独立研究者) JKU Linz(林茨JKU) MIT CSAIL Tübingen AI Center(图宾根人工智能中心) Stanford(斯坦福) MIT-IBM Watson AI Lab(MIT-IBM沃森人工智能实验室)

AI总结 提出VisualOverload基准,通过密集场景中的简单视觉任务测试VLM,发现最佳模型仅达69.5%准确率,揭示计数、OCR和逻辑一致性等关键缺陷。

Comments Accepted at CVPR 2026

详情
AI中文摘要

最先进的VLM是否真正解决了基本视觉理解?我们提出VisualOverload,一个略有不同的视觉问答(VQA)基准,包含2,720个问答对,并持有私有真实答案。与以往通常关注近全局图像理解的VQA数据集不同,VisualOverload挑战模型在密集(或过载)场景中执行简单的、无需知识的视觉任务。我们的数据集由公共领域绘画的高分辨率扫描图组成,这些绘画包含多个人物、动作和展开的子情节,背景细节丰富。我们手动为这些图像标注了六个任务类别的问题,以探测对场景的彻底理解。我们假设当前基准高估了VLM的性能,编码和推理细节对它们来说仍然是一项具有挑战性的任务,尤其是当面对密集场景时。实际上,我们观察到在37个测试模型中,即使是最好的模型(o3)在我们最难的测试子集上也仅达到19.6%的准确率,在所有问题上总体准确率为69.5%。除了全面评估外,我们还通过错误分析补充了基准,揭示了多种失败模式,包括缺乏计数能力、OCR失败以及复杂任务下惊人的逻辑不一致。总之,VisualOverload暴露了当前视觉模型中的关键差距,并为社区开发更好的模型提供了重要资源。基准:http://paulgavrikov.github.io/visualoverload

英文摘要

Is basic visual understanding really solved in state-of-the-art VLMs? We present VisualOverload, a slightly different visual question answering (VQA) benchmark comprising 2,720 question-answer pairs, with privately held ground-truth responses. Unlike prior VQA datasets that typically focus on near global image understanding, VisualOverload challenges models to perform simple, knowledge-free vision tasks in densely populated (or, overloaded) scenes. Our dataset consists of high-resolution scans of public-domain paintings that are populated with multiple figures, actions, and unfolding subplots set against elaborately detailed backdrops. We manually annotated these images with questions across six task categories to probe for a thorough understanding of the scene. We hypothesize that current benchmarks overestimate the performance of VLMs, and encoding and reasoning over details is still a challenging task for them, especially if they are confronted with densely populated scenes. Indeed, we observe that even the best model (o3) out of 37 tested models only achieves 19.6% accuracy on our hardest test split and overall 69.5% accuracy on all questions. Beyond a thorough evaluation, we complement our benchmark with an error analysis that reveals multiple failure modes, including a lack of counting skills, failure in OCR, and striking logical inconsistencies under complex tasks. Altogether, VisualOverload exposes a critical gap in current vision models and offers a crucial resource for the community to develop better models. Benchmark: http://paulgavrikov.github.io/visualoverload

2509.12196 2026-05-26 cs.LG cs.AI 版本更新

Dynamic Relational Priming Improves Transformer in Multivariate Time Series

动态关系先验提升Transformer在多变量时间序列中的表现

Hunjae Lee, Corey Clark

发表机构 * Department of Computer Science, Southern Methodist University, Dallas TX USA(计算机科学系,南方 Methodist 大学,德克萨斯州达拉斯)

AI总结 提出动态关系先验注意力机制(prime attention),通过为每个token对动态调整表示,有效捕捉多变量时间序列中异构的通道间依赖关系,在保持相同计算复杂度下提升预测精度达6.5%。

详情
AI中文摘要

标准Transformer中的注意力机制使用静态的token表示,这些表示在每一层的所有成对计算中保持不变。这限制了它们与每个token对交互中可能存在的多样化关系动态的表示对齐。虽然标准注意力在关系相对同质的领域表现出色,但其静态关系学习难以捕捉多变量时间序列(MTS)数据中多样、异构的通道间依赖关系——其中单个系统内不同的通道对交互可能由完全不同的物理定律或时间动态支配。为了更好地将注意力机制与此类领域现象对齐,我们提出了带有动态关系先验的注意力机制(prime attention)。与标准注意力中每个token在所有成对交互中呈现相同表示不同,prime attention通过可学习的调制动态地(或按交互)定制每个token,以最好地捕捉每个token对的独特关系动态,从而针对特定关系优化每个成对交互。这种prime attention的表示可塑性使其能够在保持与标准注意力相同渐近计算复杂度的同时,有效提取MTS中关系特定的信息。我们的结果表明,prime attention在基准测试中始终优于标准注意力,预测精度提升高达6.5%。此外,我们发现与标准注意力相比,prime attention在使用最多40%更短序列长度时即可达到相当或更优的性能,进一步证明了其卓越的关系建模能力。

英文摘要

Standard attention mechanisms in transformers employ static token representations that remain unchanged across all pair-wise computations in each layer. This limits their representational alignment with the potentially diverse relational dynamics of each token-pair interaction. While they excel in domains with relatively homogeneous relationships, standard attention's static relational learning struggles to capture the diverse, heterogeneous inter-channel dependencies of multivariate time series (MTS) data--where different channel-pair interactions within a single system may be governed by entirely different physical laws or temporal dynamics. To better align the attention mechanism for such domain phenomena, we propose attention with dynamic relational priming (prime attention). Unlike standard attention where each token presents an identical representation across all of its pair-wise interactions, prime attention tailors each token dynamically (or per interaction) through learnable modulations to best capture the unique relational dynamics of each token pair, optimizing each pair-wise interaction for that specific relationship. This representational plasticity of prime attention enables effective extraction of relationship-specific information in MTS while maintaining the same asymptotic computational complexity as standard attention. Our results demonstrate that prime attention consistently outperforms standard attention across benchmarks, achieving up to 6.5\% improvement in forecasting accuracy. In addition, we find that prime attention achieves comparable or superior performance using up to 40\% less sequence length compared to standard attention, further demonstrating its superior relational modeling capabilities.

2509.12194 2026-05-26 cs.AI cs.CV 版本更新

Teaching large language models to reason like expert diagnosticians

教会大型语言模型像专家诊断医生一样推理

Thomas A. Buckley, Riccardo Conci, Peter G. Brodeur, Jason Gusdorf, Sourik Beltrán, Bita Behrouzi, Byron Crowe, Jacob Dockterman, Muzzammil Muhammad, Sarah Ohnigian, Andrew Sanchez, James A. Diao, Aashna P. Shah, Daniel Restrepo, Eric S. Rosenberg, Andrew S. Lea, Emily Glanton, Kimberly LeBlanc, Undiagnosed Diseases Network, Marinka Zitnik, Scott H. Podolsky, Zahir Kanjee, Raja-Elie E. Abdulnour, Jacob M. Koshy, Adam Rodman, Arjun K. Manrai

发表机构 * Department of Biomedical Informatics, Harvard Medical School(哈佛医学院生物医学信息学系) Department of Medicine, Beth Israel Deaconess Medical Center(贝塞斯达医院内科部) The Mongan Institute, Massachusetts General Hospital(麻省总医院蒙根研究所) Division of Gastroenterology, Brigham and Women’s Hospital(布里洛妇女医院胃肠病科) Department of Medicine, Brigham and Women’s Hospital(布里洛妇女医院内科部) Department of Medicine, Massachusetts General Hospital(麻省总医院内科部) Department of Pathology, Massachusetts General Hospital(麻省总医院病理学部) Department of Health Humanities and Bioethics, University of Rochester School of Medicine and Dentistry(罗切斯特大学医学院和牙科学院健康人文与生物伦理学部) Kempner Institute for the Study of Natural and Artificial Intelligence, Harvard University(哈佛大学凯普纳人工智能研究所) Center for the History of Medicine, Countway Library of Medicine, Harvard Medical School(哈佛医学院医学史中心,考特维图书馆) Department of Global Health and Social Medicine, Harvard Medical School(哈佛医学院全球健康与社会医学部) Division of Pulmonary and Critical Care Medicine, Brigham and Women’s Hospital(布里洛妇女医院呼吸科和重症医学科)

AI总结 提出 Dr. CaBot 代理 AI 系统,通过生成基于初始病例描述的幻灯片演示来模拟专家诊断推理,并在 NEJM CPC 和 NIH 未诊断疾病网络病例上取得优于前沿模型的表现,同时发布 CPC-Bench 基准以促进临床 AI 发展。

详情
AI中文摘要

鉴别诊断是一个迭代过程,将患者信息与更广泛的医学知识相结合。自1923年以来持续发表的临床病例系列,如NEJM临床病理会议(CPCs),展示了专家医生向同行演示诊断推理,并已被用于评估AI数十年。然而,先前的AI评估主要关注最终诊断准确性,而非细微的临床推理。在此,我们介绍Dr. CaBot,一个代理AI系统,通过仅从初始病例描述生成带有书面和旁白的幻灯片演示,来模拟专家诊断医生。CaBot最近生成了NEJM CPC 100多年历史上首个发表的AI诊断。在盲评中,医生在46/62(74%)的试验中错误分类了鉴别诊断的来源(CaBot vs. 医生撰写),并在各个质量维度上给予其好评。当被要求解决来自NIH未诊断疾病网络的72名未诊断疾病患者的病例时,CaBot仅从转诊记录中就识别出了50/72(69%)病例的工作诊断。为了促进透明度和研究,我们还开发了CPC-Bench,一个基于7,102个CPC和47,648个问题(涵盖10个任务)的经医生验证的基准。我们证明CaBot在CPC-Bench上优于前沿模型,并公开发布CaBot和CPC-Bench,以促进临床AI的进步。

英文摘要

Differential diagnosis is an iterative process that integrates patient information with broader medical knowledge. Clinical case series such as the NEJM Clinicopathologic Conferences (CPCs), published continuously since 1923, feature expert physicians who demonstrate diagnostic reasoning to peers, and have been used for decades to evaluate AI. However, prior AI evaluations have largely focused on final diagnostic accuracy rather than nuanced clinical reasoning. Here, we introduce Dr. CaBot, an agentic AI system that emulates an expert diagnostician by generating written and narrated slide-based presentations from an initial case description alone. CaBot recently generated the first AI diagnosis published in the 100+ year history of the NEJM CPCs. In blinded evaluations, physicians misclassified the source of the differential (CaBot vs. physician-written) in 46/62 (74%) of trials and rated them favorably across quality dimensions. When tasked with solving cases for 72 patients with undiagnosed disease from the NIH Undiagnosed Diseases Network, CaBot identified the working diagnosis in 50/72 (69%) of cases from referral notes alone. To promote transparency and research, we also developed CPC-Bench, a physician-validated benchmark based on 7,102 CPCs and 47,648 questions across 10 tasks. We show that CaBot outperforms frontier models on CPC-Bench, and release both CaBot and CPC-Bench publicly to foster progress in clinical AI.

2509.05614 2026-05-26 cs.CV cs.AI cs.RO 版本更新

SpecPrune-VLA: Accelerating Vision-Language-Action Models via Action-Aware Self-Speculative Pruning

SpecPrune-VLA: 通过动作感知的自推测剪枝加速视觉-语言-动作模型

Hanzhen Wang, Jiaming Xu, Yushun Xiang, Jiayi Pan, Yongkang Zhou, Yong-Lu Li, Guohao Dai

发表机构 * Shanghai Jiao Tong University(上海交通大学)

AI总结 针对视觉-语言-动作模型推理加速,提出结合全局上下文与局部信息的无训练两层剪枝方法,实现高达1.57倍加速且成功率几乎无下降。

Comments Accepted to ICML 2026

详情
AI中文摘要

剪枝是一种通过移除不重要值的计算来加速计算密集型模型的典型技术。最近,它被应用于加速视觉-语言-动作(VLA)模型推理。然而,现有的加速方法仅关注当前动作步骤的局部信息,忽略了全局上下文,导致在某些场景下成功率下降超过20%且加速效果有限。本文指出VLA任务中的时空一致性:连续步骤中的输入图像表现出高度相似性,并提出关键见解:令牌选择应结合局部信息与模型的全局上下文。基于此,我们提出SpecPrune-VLA,一种无需训练、具有启发式控制的两级剪枝方法。(1) 动作级静态剪枝:利用全局历史和局部注意力,在每个动作中静态减少视觉令牌。(2) 层级动态剪枝:根据逐层重要性自适应地剪枝每层的令牌。(3) 轻量级动作感知控制器:根据末端执行器的速度将动作分为粗粒度或细粒度,并相应调整剪枝激进程度。大量实验表明,SpecPrune-VLA在LIBERO模拟中实现高达1.57倍加速,在真实世界任务中实现1.70倍加速,且成功率下降可忽略不计。

英文摘要

Pruning is a typical acceleration technique for compute-bound models by removing computation on unimportant values. Recently, it has been applied to accelerate Vision-Language-Action (VLA) model inference. However, existing acceleration methods focus on local information from the current action step and ignore the global context, leading to >20% success rate drop and limited speedup in some scenarios. In this paper, we point out spatial-temporal consistency in VLA tasks: input images in consecutive steps exhibit high similarity, and propose the key insight that token selection should combine local information with global context of the model. Based on this, we propose SpecPrune-VLA, a training-free, two-level pruning method with heuristic control. (1) Action-level static pruning. We leverage global history and local attention to statically reduce visual tokens per action. (2) Layer-level dynamic pruning. We prune tokens adaptively per layer based on layer-wise importance. (3) Lightweight action-aware controller: We classify actions as coarse- or fine-grained by the speed of the end effector and adjust pruning aggressiveness accordingly. Extensive experiments show that SpecPrune-VLA achieves up to 1.57$\times$ speedup in LIBERO simulation and 1.70$\times$ on real-world tasks, with negligible success rate degradation.

2508.08652 2026-05-26 cs.AI 版本更新

Prompt-and-Check: Using Large Language Models to Evaluate Communication Protocol Compliance in Simulation-Based Training

Prompt-and-Check:使用大型语言模型评估基于模拟训练中的通信协议合规性

Vishakha Lall, Yisi Liu

发表机构 * Centre of Excellence in Maritime Safety(海上安全卓越中心) Singapore Polytechnic(新加坡理工学院) Singapore(新加坡)

AI总结 提出Prompt-and-Check方法,利用开源大语言模型通过上下文丰富的提示评估模拟训练中通信协议的合规性,并在海事领域案例中验证其有效性。

详情
AI中文摘要

准确的程序通信合规性评估在基于模拟的训练中至关重要,特别是在安全关键领域,遵守合规检查表反映了操作能力。本文探索了一种轻量级、可部署的方法,使用基于提示的推理与开源大型语言模型(LLMs),这些模型可以在消费级GPU上高效运行。我们提出了Prompt-and-Check,一种使用上下文丰富的提示来评估协议中每个检查表项目是否已满足的方法,仅基于转录的口头交流。我们在海事领域进行了一个案例研究,参与者执行相同的模拟任务,并实验了LLama 2 7B、LLaMA 3 8B和Mistral 7B等模型,在本地RTX 4070 GPU上运行。对于每个检查表项目,一个包含相关转录摘录的提示被输入模型,模型输出合规性判断。我们使用分类准确性和一致性分数将模型输出与专家标注的基准进行比较。我们的发现表明,提示使得无需任务特定训练即可进行有效的上下文感知推理。这项研究突出了LLMs在增强训练环境中的汇报、绩效反馈和自动评估方面的实际效用。

英文摘要

Accurate evaluation of procedural communication compliance is essential in simulation-based training, particularly in safety-critical domains where adherence to compliance checklists reflects operational competence. This paper explores a lightweight, deployable approach using prompt-based inference with open-source large language models (LLMs) that can run efficiently on consumer-grade GPUs. We present Prompt-and-Check, a method that uses context-rich prompts to evaluate whether each checklist item in a protocol has been fulfilled, solely based on transcribed verbal exchanges. We perform a case study in the maritime domain with participants performing an identical simulation task, and experiment with models such as LLama 2 7B, LLaMA 3 8B and Mistral 7B, running locally on an RTX 4070 GPU. For each checklist item, a prompt incorporating relevant transcript excerpts is fed into the model, which outputs a compliance judgment. We assess model outputs against expert-annotated ground truth using classification accuracy and agreement scores. Our findings demonstrate that prompting enables effective context-aware reasoning without task-specific training. This study highlights the practical utility of LLMs in augmenting debriefing, performance feedback, and automated assessment in training environments.

2507.14760 2026-05-26 eess.IV cs.AI cs.CV cs.LG 版本更新

QUTCC: Quantile Uncertainty Training and Conformal Calibration for Imaging Inverse Problems

QUTCC: 成像逆问题的分位数不确定性训练与保形校准

Cassandra Tong Ye, Shamus Li, Tyler King, Kristina Monakhova

AI总结 提出QUTCC方法,结合分位数回归与U-Net实现空间自适应保形校准,在多个成像逆问题中生成更紧的不确定性区间并定位模型幻觉。

详情
AI中文摘要

尽管深度学习为科学和医学成像带来了巨大前景,但任何失败和幻觉(与事实不符的预测)都难以定位,并可能产生严重的下游后果。不确定性估计技术,如保形预测,可以通过预测模型预测的统计有效误差条来提供帮助。然而,流行的保形预测方法并非为高维图像值问题设计,且在保形校准过程中未考虑图像内的空间相关性,导致不确定性区间过大。我们提出了一种实用的同时分位数回归方法,能够在保形校准期间实现非线性、空间自适应缩放。我们的方法QUTCC使用带有分位数嵌入的U-Net架构,在训练期间学习完整的条件分位数分布,然后利用这个非线性学习函数进行空间自适应保形校准。在测试时,我们的方法能够高效地估计具有像素边际覆盖保证的不确定性区间。此外,QUTCC还可以在没有内置分布假设的情况下预测逐像素条件概率密度估计。我们在多个去噪问题、加速磁共振成像和定量相位显微镜上评估了我们的方法。与先前的保形方法相比,我们的方法在相同覆盖水平下始终产生更紧的不确定性区间,能够预测不同任务的合理条件分布,并且在某些情况下,高不确定性区域可以帮助我们定位模型预测中的幻觉。

英文摘要

While deep learning offers tremendous promise for scientific and medical imaging, any failures and hallucinations (predictions that do not coincide with reality) are hard to pinpoint and can have serious downstream consequences. Uncertainty estimation techniques, such as conformal prediction, can help by predicting statistically valid error bars for a model's prediction. However, popular conformal prediction methods were not designed for high-dimensional image-valued problems and do not take into account spatial correlations within an image during conformal calibration, resulting in larger-than-necessary uncertainty intervals. We propose a practical simultaneous quantile regression method that enables non-linear, spatially-adaptive scaling during conformal calibration. Our method, QUTCC uses a U-Net architecture with a quantile embedding to learn a full conditional quantile distribution during training, and then leverages this non-linear, learned function for spatially-adaptive conformal calibration. At test time, our method can efficiently estimate uncertainty intervals with pixel-marginal coverage guarantees. In addition, QUTCC can also predict pixel-wise conditional probability density estimates without any built-in distributional assumptions. We evaluate our method on several denoising problems, accelerated magnetic resonance imaging, and quantitative phase microscopy. Our method consistently produces tighter uncertainty intervals than prior conformal methods at the same coverage level, can predict plausible conditional distributions for different tasks, and in some cases, high-uncertainty regions can help us locate hallucinations in a model's prediction.

2507.10644 2026-05-26 cs.AI cs.CL cs.CR cs.HC cs.MA 版本更新

From Multi-Agent Systems and the Semantic Web to Agentic AI: A Unified Narrative of the Web of Agents

从多智能体系统和语义网到智能体AI:智能体网络的统一叙事

Tatiana Petrova, Boris Bliznioukov, Aleksandr Puzikov, Radu State

发表机构 * SEDAN - SnT University of Luxembourg(卢森堡大学)

AI总结 本文提出智能体网络(WoA)经历了从平台端协调(第一代)、数据端标注(第二代)到模型端解释(第三代)的语义努力迁移,并分析了各代失败模式及当前开放问题。

详情
AI中文摘要

智能体网络(WoA)将文档为中心的Web转变为由自主智能体代表用户行动的环境,这一愿景随着大型语言模型(LLM)的成熟而变得可行。我们认为,在过去的三十年中,WoA按时间顺序经历了语义努力迁移:从平台端协调(多智能体系统,第一代),经过数据端标注(语义网,第二代),到模型端解释(LLM时代,第三代)。这一轨迹中的核心转变——从第二代到第三代,我们称之为从数据中的语义到模型中的语义的转变——具有预测性:每一代的失败模式和当前开放问题都源于该代语义努力的定位。本文做出五项贡献:(i) 一个跨越1990-2026年的统一进化叙事;(ii) 一个四维比较框架(语义基础、通信范式、智能定位、发现机制),统一应用于所有三代;(iii) 对十六个代表性系统在这些维度上的分类,包括混合LLM-知识图谱和计算机使用智能体;(iv) 涵盖2024年11月至2026年8月的制度融合(Linux基金会智能体AI基金会、A2A v1.0、MCP 2024年11月发布和2025年11月规范、Visa/Mastercard/Stripe支付网络协议、欧盟AI法案分阶段执行、NIST AI智能体标准倡议、2026年国际AI安全报告);以及(v) 基于跨代证据的七个命名教训,以及七个与代无关的挑战,无论哪种协议占主导地位,这些挑战都持续存在。进一步的进展更多地取决于标准机构、监管机构和商业支付网络正在组装的社会技术基础设施,而不是协议设计。

英文摘要

The Web of Agents (WoA) transforms the document-centric Web into an environment of autonomous agents acting on users' behalf, a vision newly tractable as large language models (LLMs) mature. We argue that across three decades the WoA has undergone a \emph{semantic-effort migration} in chronological order: from platform-side coordination (Multi-Agent Systems, Generation~I), through data-side annotation (Semantic Web, Generation~II), to model-side interpretation (LLM-era, Generation~III). The central Gen~II~$\rightarrow$~Gen~III transition within this trajectory, which we call the \emph{semantics-in-data $\rightarrow$ semantics-in-models} shift, is predictive: each generation's failure modes and current open problems follow from where that generation located its semantic effort. The survey makes five contributions: (i)~a unified evolutionary narrative spanning 1990--2026; (ii)~a four-dimensional comparative framework (semantic foundation, communication paradigm, locus of intelligence, discovery mechanism) applied uniformly across all three generations; (iii)~classification of sixteen representative systems on these dimensions, including hybrid LLM--knowledge-graph and computer-use agents; (iv)~coverage of the November~2024--August~2026 institutional convergence (Linux Foundation's Agentic AI Foundation, A2A v1.0, MCP November~2024 launch and November~2025 specification, Visa/Mastercard/Stripe payment-network protocols, EU AI Act phased enforcement, the NIST AI Agent Standards Initiative, International AI Safety Report 2026); and (v)~seven named lessons grounded in cross-generational evidence paired with seven generation-invariant challenges that persist regardless of which protocol prevails. Further progress depends less on protocol design than on the socio-technical infrastructure now being assembled by standards bodies, regulators, and commercial payment networks.

2507.09179 2026-05-26 cs.AI 版本更新

Hide-and-Shill: A Reinforcement Learning Framework for Market Manipulation Detection in Symphony-a Decentralized Multi-Agent System

Hide-and-Shill:面向交响乐系统中市场操纵检测的强化学习框架——一个去中心化多智能体系统

Ronghua Shi, Yiou Liu, Yuchun Feng, Lynn Ai, Bill Shi, Zhuang Liu

发表机构 * Department of Information Systems, City University of Hong Kong(香港城市大学信息系统系) Business School, University of New South Wales(新南威尔士大学商学院) Division of Engineering Science, University of Toronto(多伦多大学工程科学系) ProphetAI Data Technology Co., Ltd.(ProphetAI数据技术有限公司) Gradient, 3 FRASER STREET DUO TOWER, SINGAPORE(Gradient新加坡办公室)

AI总结 提出一个多智能体强化学习框架,通过动态对抗博弈建模操纵者与检测者的交互,利用延迟代币价格反应识别可疑模式,并集成GRPO、理论奖励函数和多模态智能体管道,在去中心化交响乐系统中实现无需中心化预言机的鲁棒操纵检测。

详情
AI中文摘要

去中心化金融(DeFi)开创了无需许可的金融创新新时代,但也导致了前所未有的市场操纵。在没有中心化监管的情况下,恶意行为者协调各种平台上的宣传活动和拉高出货计划。我们提出了一个用于去中心化操纵检测的多智能体强化学习(MARL)框架,将操纵者和检测者之间的交互建模为动态对抗博弈。该框架利用延迟的代币价格反应作为财务指标来识别可疑模式。我们的方法引入了三项创新:(1)群体相对策略优化(GRPO),以增强在稀疏奖励和部分可观测设置下的学习稳定性;(2)一个基于理性预期和信息不对称理论启发的奖励函数,区分价格发现与操纵噪声;(3)一个多模态智能体管道,集成基于LLM的语义特征、社交图信号和链上市数据,以支持知情决策。该框架集成在Symphony系统中,这是一个去中心化的多智能体架构,通过分布式日志支持点对点智能体执行和信任感知学习,支持链上可验证评估。Symphony促进战略参与者之间的对抗性共同进化,并在没有中心化预言机的情况下保持鲁棒的操纵检测,从而实现对全球DeFi生态系统的实时监控。在10万个真实世界话语片段上训练,并在对抗性模拟中验证,Hide-and-Shill在检测准确性和因果归因方面达到了顶级性能。这项工作将多智能体系统与金融监控相结合,推动了去中心化市场智能的新范式。所有资源可在Hide-and-Shill GitHub仓库中获取,以促进开放研究和可重复性。

英文摘要

Decentralized finance (DeFi) has introduced a new era of permissionless financial innovation but also led to unprecedented market manipulation. Without centralized oversight, malicious actors coordinate shilling campaigns and pump-and-dump schemes across various platforms. We propose a Multi-Agent Reinforcement Learning (MARL) framework for decentralized manipulation detection, modeling the interaction between manipulators and detectors as a dynamic adversarial game. This framework identifies suspicious patterns using delayed token price reactions as financial indicators.Our method introduces three innovations: (1) Group Relative Policy Optimization (GRPO) to enhance learning stability in sparse-reward and partially observable settings; (2) a theory-based reward function inspired by rational expectations and information asymmetry, differentiating price discovery from manipulation noise; and (3) a multi-modal agent pipeline that integrates LLM-based semantic features, social graph signals, and on-chain market data for informed decision-making.The framework is integrated within the Symphony system, a decentralized multi-agent architecture enabling peer-to-peer agent execution and trust-aware learning through distributed logs, supporting chain-verifiable evaluation. Symphony promotes adversarial co-evolution among strategic actors and maintains robust manipulation detection without centralized oracles, enabling real-time surveillance across global DeFi ecosystems.Trained on 100,000 real-world discourse episodes and validated in adversarial simulations, Hide-and-Shill achieves top performance in detection accuracy and causal attribution. This work bridges multi-agent systems with financial surveillance, advancing a new paradigm for decentralized market intelligence. All resources are available at the Hide-and-Shill GitHub repository to promote open research and reproducibility.

2505.14479 2026-05-26 cs.AI cs.CL 版本更新

A Neuro-Symbolic Approach for Reliable Proof Generation with LLMs: A Case Study in Euclidean Geometry

一种用于LLM可靠证明生成的神经符号方法:以欧几里得几何为例

Oren Sultan, Eitan Stern, Dafna Shahaf

发表机构 * The Hebrew University of Jerusalem(特拉维夫大学)

AI总结 提出一种结合LLM生成能力与结构化组件的神经符号方法,通过类比问题检索和形式验证器反馈,显著提升欧几里得几何证明的准确性。

Comments long paper

详情
AI中文摘要

大型语言模型(LLM)在需要严格逻辑推理和符号推理的形式化领域(如数学证明生成)中表现不佳。我们提出一种神经符号方法,结合LLM的生成优势与结构化组件以克服这一挑战。作为概念验证,我们专注于SAT级别的几何问题。我们的方法有两方面:(1)检索类比问题并利用其证明来指导LLM;(2)形式验证器评估生成的证明并提供反馈,帮助模型修正错误证明。我们的方法显著提高了不同模型族的证明准确性,在所有评估模型(OpenAI o1、GPT-5、Gemini-Flash-2.5和Claude Sonnet 4.6)上均取得了显著提升。基础模型的准确率从10%至44%提升至采用我们方法后的68%至96%,其中类比问题指导和验证器反馈均贡献了这些改进。更广泛地说,转向生成可证明正确结论的LLM有望大幅提高其可靠性、准确性和一致性,从而解锁需要可信赖性的复杂任务和关键现实应用。

英文摘要

Large language models (LLMs) struggle with formal domains that require rigorous logical deduction and symbolic reasoning, such as mathematical proof generation. We propose a neuro-symbolic approach that combines LLMs' generative strengths with structured components to overcome this challenge. As a proof of concept, we focus on SAT-level geometry problems. Our approach is two-fold: (1) We retrieve analogous problems and use their proofs to guide the LLM, and (2) a formal verifier evaluates the generated proofs and provides feedback, helping the model fix incorrect proofs. Our method significantly improves proof accuracy across diverse model families, achieving significant gains across all evaluated models: OpenAI o1, GPT-5, Gemini-Flash-2.5, and Claude Sonnet 4.6. Accuracy increases from 10% to 44% for the base models to 68% to 96% with our approach, with both analogous problem guidance and verifier feedback contributing to these improvements. More broadly, shifting to LLMs that generate provably correct conclusions has the potential to dramatically improve their reliability, accuracy and consistency, unlocking complex tasks and critical real-world applications that require trustworthiness.

2505.07078 2026-05-26 q-fin.TR cs.AI cs.CE 版本更新

Can LLM-based Financial Investing Strategies Outperform the Market in Long Run?

基于LLM的金融投资策略能否长期跑赢市场?

Weixian Waylon Li, Hyeonjun Kim, Mihai Cucuringu, Tiejun Ma

发表机构 * AIAI, School of Informatics The University of Edinburgh Edinburgh United Kingdom Global Finance Research Center Sungkyunkwan University Seoul Republic of Korea Dept. of Statistics \& OMI University of California, Los Angeles University of Oxford United States The University of Edinburgh Sungkyunkwan University University of California, Los Angeles University of Oxford

AI总结 提出FINSABER回测框架,在更长时间和更大股票池上评估基于LLM的择时策略,发现其优势在长期和广泛截面下显著下降,且在牛熊市中表现不佳。

Comments KDD 2026, Datasets & Benchmarks Track

详情
AI中文摘要

大型语言模型(LLM)最近被用于资产定价任务和股票交易应用,使AI代理能够从非结构化金融数据中生成投资决策。然而,大多数对LLM择时投资策略的评估都是在狭窄的时间范围和有限的股票池中进行的,由于幸存者偏差和数据窥探偏差,其有效性被夸大。我们通过提出FINSABER(一个在更长时间段和更大符号池中评估择时策略的回测框架),批判性地评估其泛化能力和稳健性。跨越二十年和100多个符号的系统回测表明,先前报告的LLM优势在更广泛的截面和更长期的评估下显著恶化。我们的市场制度分析进一步表明,LLM策略在牛市中过于保守,表现不及被动基准,在熊市中过于激进,导致重大损失。这些发现强调了开发能够优先考虑趋势检测和制度感知风险控制,而不仅仅是增加框架复杂性的LLM策略的必要性。

英文摘要

Large Language Models (LLMs) have recently been leveraged for asset pricing tasks and stock trading applications, enabling AI agents to generate investment decisions from unstructured financial data. However, most evaluations of LLM timing-based investing strategies are conducted on narrow timeframes and limited stock universes, overstating effectiveness due to survivorship and data-snooping biases. We critically assess their generalizability and robustness by proposing FINSABER, a backtesting framework evaluating timing-based strategies across longer periods and a larger universe of symbols. Systematic backtests over two decades and 100+ symbols reveal that previously reported LLM advantages deteriorate significantly under broader cross-section and over a longer-term evaluation. Our market regime analysis further demonstrates that LLM strategies are overly conservative in bull markets, underperforming passive benchmarks, and overly aggressive in bear markets, incurring heavy losses. These findings highlight the need to develop LLM strategies that are able to prioritise trend detection and regime-aware risk controls over mere scaling of framework complexity.

2504.12474 2026-05-26 cs.CL cs.AI 版本更新

Integrating Structural and Semantic Signals in Text-Attributed Graphs with BiGTex

在文本属性图中整合结构信号与语义信号:BiGTex

Azadeh Beiranvand, Seyed Mehdi Vahidipour

发表机构 * Faculty of Electrical and Computer Engineering, University of Kashan(卡尚大学电气与计算机工程学院)

AI总结 提出BiGTex架构,通过堆叠图-文本融合单元实现GNN与LLM的双向注意力,以参数高效微调(LoRA)在节点分类和链接预测任务上达到最优性能。

Comments 26 pages, 4 figures

详情
Journal ref
Machine Learning with Applications 24 (2026) 100921
AI中文摘要

文本属性图(TAGs)在表示学习中提出了独特挑战,要求模型同时捕捉节点关联文本的语义丰富性和图的结构依赖性。图神经网络(GNNs)擅长建模拓扑信息,但缺乏处理非结构化文本的能力。相反,大型语言模型(LLMs)精通文本理解,但通常不了解图结构。在这项工作中,我们提出了BiGTex(双向图文本),一种通过堆叠图-文本融合单元紧密集成GNN和LLM的新型架构。每个单元允许文本和结构表示之间的相互注意力,使信息能够双向流动:文本影响结构,结构指导文本解释。所提出的架构使用参数高效微调(LoRA)进行训练,保持LLM冻结同时适应任务特定信号。在五个基准数据集上的大量实验表明,BiGTex在节点分类中实现了最先进的性能,并有效泛化到链接预测。消融研究进一步强调了软提示和双向注意力在模型成功中的重要性。

英文摘要

Text-attributed graphs (TAGs) present unique challenges in representation learning by requiring models to capture both the semantic richness of node-associated texts and the structural dependencies of the graph. While graph neural networks (GNNs) excel at modeling topological information, they lack the capacity to process unstructured text. Conversely, large language models (LLMs) are proficient in text understanding but are typically unaware of graph structure. In this work, we propose BiGTex (Bidirectional Graph Text), a novel architecture that tightly integrates GNNs and LLMs through stacked Graph-Text Fusion Units. Each unit allows for mutual attention between textual and structural representations, enabling information to flow in both directions, text influencing structure and structure guiding textual interpretation. The proposed architecture is trained using parameter-efficient fine-tuning (LoRA), keeping the LLM frozen while adapting to task-specific signals. Extensive experiments on five benchmark datasets demonstrate that BiGTex achieves state-of-the-art performance in node classification and generalizes effectively to link prediction. An ablation study further highlights the importance of soft prompting and bi-directional attention in the model's success.

2504.05181 2026-05-26 cs.IR cs.AI cs.DL cs.LG 版本更新

Lightweight and Direct Document Relevance Optimization for Generative Information Retrieval

轻量级直接文档相关性优化用于生成式信息检索

Kidist Amde Mekonnen, Yubao Tang, Maarten de Rijke

发表机构 * Institute for Clarity in Documentation(文档清晰度研究所) Inria Paris-Rocquencourt(巴黎- Rocquencourt 国家信息与自动化所) Rajiv Gandhi University(拉朱·甘地大学) Tsinghua University(清华大学) Palmer Research Laboratories(帕勒尔研究实验室) University of Amsterdam(阿姆斯特丹大学)

AI总结 提出直接文档相关性优化(DDRO)方法,通过成对排序直接对齐令牌级文档ID生成与文档级相关性估计,无需显式奖励建模和强化学习,在MS MARCO和Natural Questions上分别提升MRR@10 7.4%和19.9%。

Comments 12 pages, 3 figures. SIGIR '25 Proceedings of the 48th International ACM SIGIR Conference on Research and Development in Information Retrieval July 13--18, 2025 Padua, Italy. Code and pretrained models available at: https://github.com/kidist-amde/ddro/

详情
Journal ref
Proceedings of the 48th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR '25), pages 1327-1338, 2025
AI中文摘要

生成式信息检索(GenIR)是一种有前景的神经检索范式,它将文档检索形式化为文档标识符(docid)生成任务,允许朝着统一的全局检索目标进行端到端优化。然而,现有的GenIR模型存在令牌级错位问题,即训练用于预测下一个令牌的模型往往无法有效捕捉文档级相关性。虽然基于强化学习的方法(如相关性反馈强化学习(RLRF))旨在通过奖励建模解决这种错位,但它们引入了显著的复杂性,需要优化辅助奖励函数,然后进行强化微调,这在计算上昂贵且往往不稳定。为了解决这些挑战,我们提出了直接文档相关性优化(DDRO),它通过成对排序的直接优化,将令牌级docid生成与文档级相关性估计对齐,无需显式的奖励建模和强化学习。在包括MS MARCO文档和Natural Questions在内的基准数据集上的实验结果表明,DDRO优于基于强化学习的方法,在MS MARCO上MRR@10提升了7.4%,在Natural Questions上提升了19.9%。这些发现凸显了DDRO通过简化优化方法增强检索效果的潜力。通过将对齐问题框架化为直接优化问题,DDRO简化了GenIR模型的排序优化流程,同时为基于强化学习的方法提供了一种可行的替代方案。

英文摘要

Generative information retrieval (GenIR) is a promising neural retrieval paradigm that formulates document retrieval as a document identifier (docid) generation task, allowing for end-to-end optimization toward a unified global retrieval objective. However, existing GenIR models suffer from token-level misalignment, where models trained to predict the next token often fail to capture document-level relevance effectively. While reinforcement learning-based methods, such as reinforcement learning from relevance feedback (RLRF), aim to address this misalignment through reward modeling, they introduce significant complexity, requiring the optimization of an auxiliary reward function followed by reinforcement fine-tuning, which is computationally expensive and often unstable. To address these challenges, we propose direct document relevance optimization (DDRO), which aligns token-level docid generation with document-level relevance estimation through direct optimization via pairwise ranking, eliminating the need for explicit reward modeling and reinforcement learning. Experimental results on benchmark datasets, including MS MARCO document and Natural Questions, show that DDRO outperforms reinforcement learning-based methods, achieving a 7.4% improvement in MRR@10 for MS MARCO and a 19.9% improvement for Natural Questions. These findings highlight DDRO's potential to enhance retrieval effectiveness with a simplified optimization approach. By framing alignment as a direct optimization problem, DDRO simplifies the ranking optimization pipeline of GenIR models while offering a viable alternative to reinforcement learning-based methods.

2504.05108 2026-05-26 cs.AI cs.LG cs.NE 版本更新

Algorithm Discovery With LLMs: Evolutionary Search Meets Reinforcement Learning

利用大语言模型发现算法:进化搜索遇见强化学习

Anja Surina, Amin Mansouri, Lars Quaedvlieg, Amal Seddas, Maryna Viazovska, Emmanuel Abbe, Caglar Gulcehre

发表机构 * EPFL(苏黎世联邦理工学院) Apple(苹果公司)

AI总结 提出通过强化学习微调持续优化大语言模型,结合进化搜索加速发现更优算法,在组合优化任务上验证有效性。

Comments 34 pages

详情
AI中文摘要

发现解决复杂问题的高效算法一直是数学和计算机科学中的重大挑战,多年来需要大量人类专业知识。近期,基于大语言模型(LLMs)的进化搜索在加速跨领域算法发现方面展现出潜力,特别是在数学和优化领域。然而,现有方法将LLM视为静态生成器,错过了利用进化探索获得的信号更新模型的机会。在这项工作中,我们提出通过强化学习(RL)微调持续优化搜索算子——即LLM,从而增强基于LLM的进化搜索。我们的方法利用进化搜索作为探索策略来发现改进的算法,而RL则基于这些发现优化LLM策略。我们在组合优化任务上的实验表明,将RL与进化搜索相结合加速了更优算法的发现,展示了RL增强的进化策略在算法设计中的潜力。

英文摘要

Discovering efficient algorithms for solving complex problems has been an outstanding challenge in mathematics and computer science, requiring substantial human expertise over the years. Recent advancements in evolutionary search with large language models (LLMs) have shown promise in accelerating the discovery of algorithms across various domains, particularly in mathematics and optimization. However, existing approaches treat the LLM as a static generator, missing the opportunity to update the model with the signal obtained from evolutionary exploration. In this work, we propose to augment LLM-based evolutionary search by continuously refining the search operator - the LLM - through reinforcement learning (RL) fine-tuning. Our method leverages evolutionary search as an exploration strategy to discover improved algorithms, while RL optimizes the LLM policy based on these discoveries. Our experiments on combinatorial optimization tasks demonstrate that integrating RL with evolutionary search accelerates the discovery of superior algorithms, showcasing the potential of RL-enhanced evolutionary strategies for algorithm design.

2502.15835 2026-05-26 cs.CL cs.AI cs.SE 版本更新

Pragmatic Reasoning improves LLM Code Generation

语用推理提升LLM代码生成

Zhuchen Cao, Sven Apel, Adish Singla, Vera Demberg

发表机构 * Max Planck Institute for Informatics Saarland Campus(马克斯·普朗克信息研究所萨尔兰州分校) Computer Science Saarland University(萨尔兰州大学计算机科学系) Max Planck Institute for Software Systems Saarland Campus(马克斯·普朗克软件系统研究所萨尔兰州分校)

AI总结 提出CodeRSA方法,通过局部语用竞赛对候选代码进行重排序,以解决自然语言到代码生成中的歧义问题,在多个基准测试中取得最佳平均准确率。

详情
AI中文摘要

语用推理帮助对话者通过考虑共享上下文和反事实替代方案,从模糊或未充分指定的信息中推断出预期含义。自然语言到代码生成中也会出现类似的挑战,因为用户指令通常允许多个合理的候选程序。然而,直接的RSA风格推理是困难的,因为它需要对程序空间和替代指令的大空间进行概率估计。我们提出了CodeRSA,一种受RSA启发的重排序方法,通过对采样代码候选进行局部语用竞赛,使语用推理变得可行。CodeRSA构建候选诱导的替代指令,并估计哪些候选最独特地受到原始指令的支持,从而避免了对整个程序-指令空间的全局归一化。我们在HumanEval+、MBPP+和BigCodeBench上使用四个开放权重的指令跟随模型评估了CodeRSA。在12个模型-基准设置中,CodeRSA在10个设置中取得了最强的平均准确率,并在其余情况下保持竞争力。进一步分析表明,其收益来自于将局部成对语用比较与更广泛的全局支持相结合,这为自然语言不确定性下的语言到代码重排序提供了一个可扩展的方向。

英文摘要

Pragmatic reasoning helps interlocutors infer intended meaning from ambiguous or underspecified messages by considering shared context and counterfactual alternatives. Similar challenges arise in natural language-to-code generation, where user instructions often admit multiple plausible candidate programs. However, direct RSA-style inference is difficult because it requires probability estimation over large spaces of programs and alternative instructions. We propose CodeRSA, an RSA-motivated reranking method that makes pragmatic reasoning tractable through local pragmatic contests among sampled code candidates. CodeRSA constructs candidate-induced alternative instructions and estimates which candidates are most distinctively supported by the original instruction, avoiding global normalization over the full program-instruction space. We evaluate CodeRSA on HumanEval+, MBPP+, and BigCodeBench using four open-weight instruction-following models. CodeRSA achieves the strongest average accuracy in 10 of 12 model-benchmark settings and remains competitive in the remaining cases. Further analyses show that its gains come from combining local pairwise pragmatic comparison with broader global support, suggesting a scalable direction for language-to-code reranking under natural-language uncertainty.

2502.08047 2026-05-26 cs.AI cs.MA 版本更新

WorldGUI: An Interactive Benchmark for Desktop GUI Automation from Any Starting Point

WorldGUI: 一个从任意起点进行桌面GUI自动化的交互式基准测试

Henry Hengyuan Zhao, Kaiming Yang, Wendi Yu, Difei Gao, Mike Zheng Shou

发表机构 * Show Lab, National University of Singapore(新加坡国立大学Show实验室)

AI总结 提出WorldGUI基准测试,覆盖10个桌面和Web应用,在多种系统构建的初始状态下评估GUI代理的规划鲁棒性,并引入WorldGUI-Agent框架通过三阶段批评提升动态环境下的可靠性。

Comments Technique Report

详情
AI中文摘要

近期GUI代理的进展显著提升了视觉定位能力,但稳健的规划仍然具有挑战性,特别是当环境偏离规范初始状态时。在实际应用中,用户通常在工作流程中请求帮助,此时软件可能已部分配置,步骤可能以不同顺序执行,或者界面可能与默认设置不同。这种任务状态变异性普遍存在,但在现有GUI基准测试中评估不足。为解决这一问题,我们引入了WorldGUI,一个涵盖十种广泛使用的桌面和Web应用的基准测试,其任务在多样化、系统构建的初始状态下实例化。这些变化捕捉了真实的人机交互场景,并能够诊断评估代理恢复、调整计划以及处理非默认上下文的能力。我们进一步提出了WorldGUI-Agent,一个简单且与模型无关的框架,围绕三个批评阶段组织规划和执行,提高了动态环境中的可靠性。实验表明,最先进的GUI代理在非默认初始条件下表现出显著的性能下降,揭示了有限的鲁棒性和脆弱的规划行为。我们的基准测试和框架为开发更适应性和可靠的GUI代理奠定了基础。代码和数据可在https://github.com/showlab/WorldGUI获取。

英文摘要

Recent progress in GUI agents has substantially improved visual grounding, yet robust planning remains challenging, particularly when the environment deviates from a canonical initial state. In real applications, users often invoke assistance mid-workflow, where software may be partially configured, steps may have been executed in different orders, or the interface may differ from its default setup. Such task-state variability is pervasive but insufficiently evaluated in existing GUI benchmarks. To address this gap, we introduce WorldGUI, a benchmark covering ten widely used desktop and web applications with tasks instantiated under diverse, systematically constructed initial states. These variations capture realistic human-computer interaction settings and enable diagnostic evaluation of an agent's ability to recover, adapt plans, and handle non-default contexts. We further present WorldGUI-Agent, a simple and model-agnostic framework that organizes planning and execution around three critique stages, improving reliability in dynamic environments. Experiments demonstrate that state-of-the-art GUI agents exhibit substantial performance degradation under non-default initial conditions, revealing limited robustness and fragile planning behaviors. Our benchmark and framework provide a foundation for developing more adaptable and reliable GUI agents. The code and data are available at https://github.com/showlab/WorldGUI.

2502.06018 2026-05-26 cs.LG cs.AI 版本更新

Kolmogorov-Arnold Fourier Networks

Kolmogorov-Arnold 傅里叶网络

Jusheng Zhang, Yijia Fan, Kaitong Cai, Keze Wang, Wenhao Wang

发表机构 * Sun Yat-sen University(中山大学) Vast Intelligence Lab(远见实验室)

AI总结 针对KAN网络参数爆炸和高维任务中高频特征捕获能力不足的问题,提出Kolmogorov-Arnold傅里叶网络(KAF),通过谱重参数化将局部B样条表示转换为全局自适应谱表示,引入可训练随机傅里叶特征和自适应混合GELU-傅里叶激活机制,在CV、NLP、音频和PDE求解任务上取得最优性能。

Comments Code:https://github.com/kolmogorovArnoldFourierNetwork/KAF

详情
AI中文摘要

尽管基于Kolmogorov-Arnold的可解释网络(KAN)具有强大的理论表达能力,但在高维任务中面临严重的参数爆炸和捕获高频特征能力有限的问题。为解决这些问题,我们提出了Kolmogorov-Arnold傅里叶网络(KAF),通过谱重参数化从根本上重新定义了KAN范式。我们的主要贡献包括:(1)提出从局部的、基于网格的B样条表示到全局的、自适应的谱表示的基础基变换。这一转变改变了网络的归纳偏置,将参数复杂度从$O(G)$降低到$O(1)$,同时保持表达能力;(2)引入通过谱对齐策略初始化的可训练随机傅里叶特征(RFF),使模型能够打破固定核的平滑性限制,准确捕获高频分量;(3)实现自适应混合GELU-傅里叶激活机制,在训练过程中逐步增强频率表示。大量实验证明了KAF在计算机视觉(CV)、自然语言处理(NLP)、音频和偏微分方程(PDE)求解任务上的优越性,以更高的效率实现了最先进的性能。代码可在https://github.com/kolmogorovArnoldFourierNetwork/KAF获取。

英文摘要

Although Kolmogorov-Arnold-based interpretable networks (KANs) possess strong theoretical expressiveness, they suffer from severe parameter explosion and limited ability to capture high-frequency features in high-dimensional tasks. To address these issues, we propose the Kolmogorov-Arnold Fourier Network (KAF), which fundamentally redefines the KAN paradigm through spectral reparameterization. Our key contributions include: (1) proposing a fundamental basis transformation from the local, grid-based B-spline representation to a global, adaptive spectral representation. This shift changes the network's inductive bias, reducing parameter complexity from $O(G)$ to $O(1)$ while preserving expressiveness; (2) introducing trainable Random Fourier Features (RFF) initialized via a spectral alignment strategy, which allows the model to break the smoothness limitation of fixed kernels and accurately capture high-frequency components; and (3) implementing an adaptive hybrid GELU-Fourier activation mechanism that progressively enhances frequency representation during training. Comprehensive experiments demonstrate the superiority of KAF across computer vision (CV), natural language processing (NLP), audio, and partial differential equation (PDE) solving tasks, achieving state-of-the-art performance with improved efficiency. The code is available at https://github.com/kolmogorovArnoldFourierNetwork/KAF.

2412.07333 2026-05-26 cs.CV cs.AI 版本更新

Fusion Embedding for Pose-Guided Person Image Synthesis with Diffusion Model

基于扩散模型的姿态引导人物图像合成的融合嵌入

Donghwna Lee, Kirok Kim, Jisu Lee, Kyungha Min, Wooju Kim

发表机构 * Department of Industrial Engineering(工业工程系)

AI总结 提出FPDM框架,通过对比学习显式对齐融合源-姿态嵌入与目标图像嵌入,并作为条件信号生成,解决姿态引导人物图像合成中纹理保真度和一致性问题。

详情
AI中文摘要

姿态引导人物图像合成(PGPIS)旨在生成指定姿态下的人物图像,同时保留源图像的身份和外观。该技术促进了多种应用,包括虚拟试穿、数字化身、动画和手语生成。尽管最近基于扩散的PGPIS取得了高质量结果,但这些模型通常依赖于去噪过程中的隐式特征聚合。因此,细粒度纹理保持有限,即使对于相同身份,也难以确保在姿态和源外观变化下生成一致性。为解决这些限制,我们提出了基于扩散模型的融合嵌入PGPIS(FPDM),这是第一个通过对比学习显式对齐融合源-姿态嵌入与目标图像嵌入,并随后使用学习到的融合嵌入作为生成条件信号的框架。FPDM将图像-姿态融合(IPF)模块集成到我们提出的源增强姿态融合方法中,以学习与目标图像对齐的融合嵌入。然后,我们采用由源外观、目标姿态和学习到的融合嵌入引导的条件扩散模型。在DeepFashion基准和RWTH-PHOENIX-Weather 2014T数据集上的实验表明,在定量和定性评估中,与现有方法相比具有竞争力的性能,消融研究证实显式融合嵌入对齐显著提高了纹理保真度以及跨姿态和源外观变化的一致性。

英文摘要

Pose-Guided Person Image Synthesis (PGPIS) aims to generate human images in specified poses while preserving the identity and appearance of a source image. This technology facilitates diverse applications, including virtual try-on, digital avatars, animation, and sign language generation. Despite the high-quality results of recent diffusion-based PGPIS, these models typically depend on implicit feature aggregation within the denoising process. As a result, fine-grained texture preservation is limited, and even for the same identity, it is difficult to ensure consistent generation under variations in pose and source appearance. To address these limitations, we propose Fusion Embedding for PGPIS using a Diffusion Model (FPDM), the first framework that explicitly aligns fused source-pose embeddings with target image embeddings via contrastive learning, and subsequently employs the learned fusion embedding as a conditioning signal for generation. FPDM integrates an Image-Pose Fusion (IPF) module into our proposed Source-Enhanced Pose Fusion approach to learn a fusion embedding aligned with the target image. We then employ a conditional diffusion model guided by source appearance, target pose, and the learned fusion embedding. Experiments on the DeepFashion benchmark and the RWTH-PHOENIX-Weather 2014T dataset demonstrate competitive performance compared to existing methods in both quantitative and qualitative evaluations, with ablation studies confirming that explicit fusion embedding alignment substantially improves texture fidelity and consistency across pose and source appearance variations.

2411.00934 2026-05-26 cs.CY cs.AI 版本更新

The Meme Is the Message: Generative Memesis and AI Visuals in the 2024 USA Presidential Elections

模因即信息:生成式模因与2024年美国总统选举中的AI视觉内容

Ho-Chun Herbert Chang, Benjamin Shaman, Yung-chun Chen, Mingyue Zha, Sean Noh, Chiyu Wei, Tracy Weener, Maya Magee

发表机构 * Program in Quantitative Social Science, Dartmouth College(量化社会科学项目,达特茅斯学院) Department of Mathematics, Dartmouth College(数学系,达特茅斯学院)

AI总结 本研究通过分析Instagram图像数据集,结合计算机视觉、大语言模型和面部情感分析,发现模因格式比AI生成内容更能预测用户参与度,但AI生成的模因与人类策展结合时产生协同效应,并定义了生成式模因作为AI介导的模因传播新模式。

详情
AI中文摘要

社交媒体上的视觉内容在塑造政治话语和公民参与方面变得越来越有影响力,但由于多媒体制作成本的增加,它也限制了参与。与此同时,生成式AI的发展通过降低这些成本,为公民参与政治提供了新的方式。基于239,526张Instagram图像的数据集,我们使用结合计算机视觉、大语言模型和面部情感分析的多模态工作流程,分析了2024年美国总统选举期间合成图像的影响。结果表明,模因格式比单独的AI生成内容更能预测参与度。然而,AI生成的模因产生了显著的交互效应,表明当合成图像通过人类策展与模因结合时,参与度会协同增加。我们还描述了用户如何策展图像。党派人士以不同方式使用AI:倾向民主党的用户倾向于将其用于内部群体支持,而倾向共和党的用户则更常将其用于外部群体攻击。与真实照片相比,用户通常选择更快乐的合成面孔。我们将生成式模因定义为一种传播模式,其中模因不再是人传人,而是通过AI以定制化视觉内容为中介。我们讨论了生成式AI如何增强公民参与、内容生产与策展的分化,以及这对新技术历史和参与式文化的影响。

英文摘要

Visual content on social media has become increasingly influential in shaping political discourse and civic engagement, but it also limits participation due to the increased cost of multimedia production. In tandem, the growth of generative AI provides novel ways for citizens to participate in politics by lowering these costs. Drawing on a dataset of 239,526 Instagram images, we analyze the effects of synthetic images during the 2024 United States presidential election, using a multimodal workflow combining computer vision, large language models, and facial affect analysis. Results show that meme format is a stronger predictor of engagement than AI-generated content alone. However, AI-generated memes yield a significant interaction effect, suggesting synergistic increases in engagement when synthetic imagery is integrated with memes through human curation. We also characterize how users curate images. Partisans use AI in different ways: Democrat-leaning users tend to use it for in-group support, whereas Republican-leaning users more often employ it for out-group attacks. Users generally select happier synthetic faces compared to real photographs. We define generative memesis as a mode of communication in which memes are no longer shared person-to-person, but mediated by AI through customized visuals. We discuss how generative AI may empower civic participation, the bifurcation of content production and curation, and its implications for in the history of novel technologies and participatory culture.

2409.08379 2026-05-26 cs.SE cs.AI econ.GN q-fin.EC 版本更新

The Impact of Large Language Models on Open-source Innovation: Evidence from GitHub Copilot

大型语言模型对开源创新的影响:来自GitHub Copilot的证据

Doron Yeverechyahu, Raveesh Mayya, Gal Oestreicher-Singer

发表机构 * Coller School of Management, Tel Aviv University(特拉维夫大学科尔学院) Stern School of Business, New York University(纽约大学斯特恩商学院)

AI总结 利用GitHub Copilot推出的自然实验,通过三种识别策略和两种分类方法,发现LLM使开源贡献增加28%-40%,且增量贡献增长显著大于实质性贡献,表明LLM偏向于利用现有代码库而非探索新功能。

Comments JEL Classification: O31, C88, J24, O35, L86

详情
AI中文摘要

大型语言模型(LLM)正在重塑知识工作,但它们对自愿、自我指导的开源创新论坛(贡献者无管理指导地选择任务)的影响可能与组织环境中观察到的效果根本不同。我们在开源软件开发中研究这个问题,其中个人的贡献在社区层面共同推动创新。与产品创新不同,产品创新中创新的分类类型已明确,开源环境中的知识工作需要根据任务对贡献者的认知需求进行区分。新兴文献区分了实质性贡献(需要创造性地解决问题以引入新功能)和增量贡献(利用对现有代码的理解来维护和改进代码)。我们利用2021年10月GitHub Copilot推出的自然实验,其中Copilot支持Python等语言,但出于商业原因不支持R,从而在原本可比的生态系统之间创建了外生划分。使用三种互补的识别策略和两种分类方法,我们发现Copilot的可用性使开源贡献增加了28%到40%。在所有规格中,增量贡献的增长显著大于实质性贡献的增长。这种差异在活动水平较高的项目中更为明显,并在模型升级后扩大:当现有上下文有助于定义问题和约束解决方案时,LLM更有效地发挥作用,使协作创新偏向于利用现有代码库而非探索新功能。鉴于生成式AI在知识经济中的爆炸性速度,本文提供了关于LLM影响的罕见因果实地证据。

英文摘要

Large Language Models (LLMs) are reshaping knowledge work, yet their impact on voluntary, self-guided open innovation forums (contributors choose tasks without managerial direction) may differ fundamentally from effects observed in organizational settings. We study this question in open-source software development, where individuals' contributions collectively drive innovation at a community level. Unlike product innovation, where typologies for classifying innovation are well established, knowledge work in open-source settings calls for a distinction grounded in the cognitive demand a task places on the contributor. Burgeoning literature distinguishes substantive contributions, which require creative problem formulation to introduce new functionality, from incremental contributions, which draw on comprehension of existing code to maintain and refine it. We exploit a natural experiment around GitHub Copilot's launch in October 2021, where Copilot supported languages like Python while not supporting R for business reasons, creating an exogenous partition between otherwise comparable ecosystems. Using three complementary identification strategies and two classification approaches, we find that Copilot availability increases open-source contributions by 28 to 40 percent. The increase in incremental contributions is significantly larger than the increase in substantive contributions across all specifications. This disparity is more pronounced in projects with higher activity levels and widens following a model upgrade: LLMs function more effectively when existing context helps define the problem and constrain solutions, tilting collaborative innovation toward exploitation of established codebases rather than exploration of new functionality. This paper provides a rare instance of causal field evidence on LLM effects, given the speed at which GenAI has exploded across the knowledge economy.

2401.11963 2026-05-26 cs.NE cs.AI cs.LG 版本更新

Bridging Evolutionary Algorithms and Reinforcement Learning: A Comprehensive Survey on Hybrid Algorithms

桥接进化算法与强化学习:混合算法的全面综述

Pengyi Li, Jianye Hao, Hongyao Tang, Xian Fu, Yan Zheng, Ke Tang

发表机构 * College of Intelligence and Computing, Tianjin University(天津大学智能与计算学院) Montreal Institute of Learning Algorithms (MILA)(蒙特利尔学习算法研究所) Department of Computer Science and Engineering, Southern University of Science and Technology(南方科技大学计算机科学与工程系)

AI总结 本文全面综述了进化强化学习(ERL)领域,将进化算法(EA)与强化学习(RL)融合,系统总结了三种主要研究方向:EA辅助RL优化、RL辅助EA优化以及EA与RL协同优化,并分析了各分支解决的问题及未来挑战。

Comments New Version, add more methods

详情
AI中文摘要

进化强化学习(ERL)将进化算法(EA)和强化学习(RL)相结合用于优化,已展现出显著的性能提升。通过融合这两种方法,ERL已成为一个有前景的研究方向。本综述全面概述了ERL中的不同研究分支。具体而言,我们系统地总结了相关算法的最新进展,并确定了三个主要研究方向:EA辅助的RL优化、RL辅助的EA优化以及EA和RL的协同优化。随后,我们对每个研究方向进行了深入分析,组织了多个研究分支。我们阐明了每个分支旨在解决的问题,以及EA和RL的整合如何应对这些挑战。最后,我们讨论了各个研究方向中潜在的挑战和未来的研究方向。为了便于研究人员深入研究ERL,我们在https://github.com/yeshenpy/Awesome-Evolutionary-Reinforcement-Learning上整理了所涉及的算法和代码。

英文摘要

Evolutionary Reinforcement Learning (ERL), which integrates Evolutionary Algorithms (EAs) and Reinforcement Learning (RL) for optimization, has demonstrated remarkable performance advancements. By fusing both approaches, ERL has emerged as a promising research direction. This survey offers a comprehensive overview of the diverse research branches in ERL. Specifically, we systematically summarize recent advancements in related algorithms and identify three primary research directions: EA-assisted Optimization of RL, RL-assisted Optimization of EA, and synergistic optimization of EA and RL. Following that, we conduct an in-depth analysis of each research direction, organizing multiple research branches. We elucidate the problems that each branch aims to tackle and how the integration of EAs and RL addresses these challenges. In conclusion, we discuss potential challenges and prospective future research directions across various research directions. To facilitate researchers in delving into ERL, we organize the algorithms and codes involved on https://github.com/yeshenpy/Awesome-Evolutionary-Reinforcement-Learning.

2311.15487 2026-05-26 cs.LG cs.AI math-ph math.MP math.OC stat.ML 版本更新

Global $\mathcal{L}^2$ minimization at uniform exponential rate via geometrically adapted gradient descent in Deep Learning

全局 $\mathcal{L}^2$ 最小化:通过深度学习中的几何自适应梯度下降实现均匀指数速率

Thomas Chen

发表机构 * Department of Mathematics, University of Texas at Austin(德克萨斯大学奥斯汀分校数学系)

AI总结 本文利用微分几何中黎曼度量的任意性,提出两种改进的梯度下降流(过参数化和欠参数化设置),在秩条件成立时证明其以均匀指数收敛速率驱动 $\mathcal{L}^2$ 代价到全局最小值,并推广到秩条件不成立的情形。

Comments AMS Latex, 21 pages. Typos corrected, references and comments added

详情
AI中文摘要

我们考虑深度学习网络中的监督学习场景,并利用黎曼度量选择的任意性(微分几何的一般事实)来定义梯度下降流。在标准的深度学习方法中,参数空间(权重和偏置)上的梯度流是相对于欧几里得度量定义的。而在这里,我们选择相对于深度学习网络输出层中的欧几里得度量的梯度流。这自然地在参数空间中诱导出两种改进的梯度下降流版本,一种适用于过参数化设置,另一种适用于欠参数化设置。在过参数化情况下,我们证明,只要秩条件成立,改进的梯度下降的所有轨道都以均匀指数收敛速率将 ${\mathcal L}^2$ 代价驱动到其全局最小值;因此,对于任何预先指定的接近全局最小值的程度,可以获得一个先验的停止时间。我们指出了后者与亚黎曼几何的关系。此外,我们将上述框架推广到秩条件不成立的情况;特别地,我们表明局部平衡只有在秩损失发生时才能存在,并且通常它们不是孤立点,而是参数空间中临界子流形的元素。

英文摘要

We consider the scenario of supervised learning in Deep Learning (DL) networks, and exploit the arbitrariness of choice in the Riemannian metric relative to which the gradient descent flow can be defined (a general fact of differential geometry). In the standard approach to DL, the gradient flow on the space of parameters (weights and biases) is defined with respect to the Euclidean metric. Here instead, we choose the gradient flow with respect to the Euclidean metric in the output layer of the DL network. This naturally induces two modified versions of the gradient descent flow in the parameter space, one adapted for the overparametrized setting, and the other for the underparametrized setting. In the overparametrized case, we prove that, provided that a rank condition holds, all orbits of the modified gradient descent drive the ${\mathcal L}^2$ cost to its global minimum at a uniform exponential convergence rate; one thereby obtains an a priori stopping time for any prescribed proximity to the global minimum. We point out relations of the latter to sub-Riemannian geometry. Moreover, we generalize the above framework to the situation in which the rank condition does not hold; in particular, we show that local equilibria can only exist if a rank loss occurs, and that generically, they are not isolated points, but elements of a critical submanifold of parameter space.

2605.24526 2026-05-26 cs.HC cs.AI 版本更新

TRAFA: Anticipating User Actions to Reduce Errors in Procedural Tasks with Predictive Feedback

TRAFA:通过预测性反馈预见用户操作以减少程序性任务中的错误

Sassan Mokhtar, Lars Doorenbos, Fatemeh Jabbari, Marius Bock, Dominik Bach, Juergen Gall

发表机构 * University of Bonn(波恩大学) Lamarr Institute for Machine Learning and Artificial Intelligence(拉马尔人工智能与机器学习研究所)

AI总结 提出TRAFA系统,通过跟踪-预测-行动框架实时预测用户动作并触发反馈,在错误发生前干预,实验证明相比传统反应式反馈能提高任务准确性和效率。

详情
AI中文摘要

交互式辅助系统通常在动作完成后提供反馈,支持错误恢复但无法预防错误本身。我们提出TRAFA,一种用于程序性任务的实时预测性反馈系统,在错误发生前进行干预。TRAFA通过跟踪-预测-行动框架实现预测性反馈:跟踪手和物体状态,基于场景上下文预测用户运动,并在预测动作可能违反任务约束时触发反馈。我们在顺序组装场景中实例化该流程,并通过技术基准测试和对照用户研究(与传统反应式反馈对比)进行评估。结果表明,预测性反馈在保持反馈事件数量相当的同时,提高了任务准确性和效率。这些发现将反馈时机定位为系统设计的关键维度,并展示了如何将实时预测集成到交互系统中以在错误发生前预防错误。

英文摘要

Interactive assistance systems typically provide feedback after an action has been completed, supporting error recovery but not preventing the error itself. We present TRAFA, a real-time predictive feedback system for procedural tasks that intervenes before errors are committed. TRAFA operationalizes predictive feedback through a Track-Forecast-Act framework that tracks hand and object state, forecasts user motion conditioned on scene context, and triggers feedback when a predicted action is likely to violate task constraints. We instantiate this pipeline in a sequential assembly setting and evaluate it through both technical benchmarking and a controlled user study against conventional reactive feedback. Our results show that predictive feedback improves task accuracy and efficiency while maintaining a comparable number of feedback events. These findings position feedback timing as a key dimension in system design and show how real-time anticipation can be integrated into interactive systems to prevent errors before they occur.

2605.24518 2026-05-26 cs.CL cs.AI 版本更新

Grammatically-Guided Sparse Attention for Efficient and Interpretable Transformers

语法引导的稀疏注意力:高效且可解释的Transformer

Spandan Pratyush

发表机构 * Independent Researcher(独立研究者)

AI总结 提出语法引导的稀疏注意力方法,通过词性标签动态生成注意力掩码,在保持精度的同时降低计算复杂度。

Comments 9 pages, 2 tables Code available at https://github.com/toughthinktank/grammatically_guided_attention#

详情
AI中文摘要

Transformer模型中自注意力的二次复杂度仍然是处理长序列和高效部署大型语言模型的主要瓶颈。为此,已有大量关于稀疏注意力的研究,Deepseek稀疏注意力结合了多种创建令牌片段的方法以降低时间复杂度。本文提出了一种新颖的方法——语法引导的稀疏注意力,它基于令牌的语法角色约束注意力计算。通过利用词性(POS)标签,动态生成注意力掩码,强制令牌之间建立语言上连贯的连接,从而在不牺牲必要语言依赖性的情况下减少计算图。提出并评估了两种掩码策略:硬掩码严格只允许预定义的语法交互,软掩码则将注意力偏向这些交互。使用类似DistilBERT的架构在SST-2情感分类任务上进行的实验表明,语法引导的稀疏注意力在保持与全注意力相当的精度的同时,显著降低了理论计算开销。初步结果显示,硬掩码的准确率为0.8200,软掩码为0.8165,与全注意力的0.8200非常接近,为构建更高效、可解释且具有语言知识的Transformer架构提供了途径。

英文摘要

The quadratic complexity of self-attention in Transformer models remains a significant bottleneck for processing long sequences and deploying large language models efficiently. For this approach, there has been significant research into Sparse Attention, and Deepseek Sparse Attention has combined various methods of creating segments of tokens to reduce the time complexity. This paper introduces a novel approach, Grammatically-Guided Sparse Attention, which constrains attention computations based on the grammatical roles of tokens. By leveraging Parts-of-Speech (POS) tags, attention masks are dynamically generated that enforce linguistically coherent connections between tokens, reducing the computational graph without sacrificing essential linguistic dependencies. Two masking strategies are proposed and evaluated: a hard mask that strictly allows only predefined grammatical interactions, and a soft mask that biases attention towards these interactions. The experiments, conducted on the SST-2 sentiment classification task using a DistilBERT-like architecture, demonstrate that Grammatically-Guided Sparse Attention maintains comparable accuracy to full attention while significantly reducing the theoretical computational overhead. Preliminary results show accuracy values of 0.8200 for hard masking and 0.8165 for soft masking, closely matching the 0.8200 of full attention, providing a path towards more efficient, interpretable, and linguistically-informed Transformer architectures.

2605.24516 2026-05-26 cs.MA cs.AI 版本更新

Adaptive Punishment for Cooperation in Mixed-Motive Games

混合动机博弈中促进合作的自适应惩罚

Min Tang, Fanqi Kong, Linyuan Lü, Xue Feng

发表机构 * University of Science and Technology of China(中国科学技术大学) State Key Laboratory of General Artificial Intelligence, BIGAI(一般人工智能国家重点实验室,BIGAI)

AI总结 提出自适应惩罚合作方法(APC),通过动态惩罚概率和背叛严重程度确定惩罚强度,在迭代公共物品博弈中有效促进合作并降低惩罚成本。

详情
AI中文摘要

混合动机场景在现实多智能体交互中普遍存在,其中自私的智能体往往为了即时奖励而背叛,忽视了利他合作改善长期收益和集体福利的潜力。同伴惩罚可以阻止背叛,但作为代价高昂的二阶利他行为,其持续施加可能损害惩罚者的利益。现有方法通常难以有效实施惩罚以促进合作。为了平衡惩罚的有效性和成本,我们提出了自适应惩罚合作方法(APC),这是一种分布式方法,基于动态惩罚概率和背叛严重程度来确定惩罚强度。这种动态概率大大减少了代价高昂且无效的惩罚,同时促进了合作。为了准确评估背叛及其严重程度,我们使用了一个背叛感知模块,其学习由游戏奖励引导。理论分析和实证结果表明,APC在迭代公共物品博弈中表现有效。在实证中,APC在连续社会困境中也显著优于现有基线,学习到理性且有效的惩罚策略,通过战略性地阻止背叛来促进合作。

英文摘要

Mixed-motive scenarios are ubiquitous in real-world multi-agent interactions, where self-interested agents often defect for immediate rewards, overlooking the potential of altruistic cooperation to improve long-term gains and collective welfare. Peer punishment can deter defection, but as costly second-order altruism, its persistent imposition may undermine the punisher's interests. Existing approaches often struggle to effectively implement punishment to promote cooperation. To balance the efficacy and cost of punishment, we propose Adaptive Punishment for Cooperation (APC), a distributed method that determines punishment intensity based on both a dynamic punishment probability and the severity of defection. This dynamic probability substantially reduces costly and ineffective punishment while also promotes cooperation. To accurately assess defection and its severity, we use a defection awareness module, whose learning is guided by game reward. Theoretical analysis and empirical results show APC performs effectively in iterated public goods game. Empirically, APC also significantly outperforms existing baselines across sequential social dilemmas, learning rational and effective punishment policies that foster cooperation by strategically deterring defection.

2605.24509 2026-05-26 cs.CV cs.AI cs.GR cs.LG 版本更新

Φ-Noise: Training-Free Temporal Video Conditioning via Phase-Based Noise Manipulation

Φ-Noise:基于相位噪声操作的无训练时间视频条件生成

Ofir Abramovich, Nadav Z. Cohen, Adi Rosenthal, Ariel Shamir

发表机构 * Canvas-Lab

AI总结 提出一种无需训练的方法,通过将参考视频的低频相位信息注入扩散噪声潜变量,实现运动条件视频生成,无需修改模型架构或推理流程。

Comments Under Review; 26 pages, 21 figures

详情
AI中文摘要

潜在视频扩散模型通过逐步将高斯噪声转换为基于文本或视觉输入的真实样本来生成视频。然而,现有的条件方法通常需要额外的训练和计算开销。受最近关于频率分量在生成模型中重要性的发现启发,我们提出了一种简单、无需训练的运动条件视频生成方法,通过将参考视频的低频相位信息直接注入扩散噪声潜变量。我们的方法在不修改模型架构或推理流程的情况下传递运动线索。通过多个应用,我们展示了在生成视频中对外观和动态的有效控制,同时与更复杂的条件方法相比取得了具有竞争力或更优的结果。

英文摘要

Latent video diffusion models generate videos by progressively transforming Gaussian noise into realistic samples conditioned on text or visual inputs. However, existing conditioning methods often require additional training and computational overhead. Motivated by recent findings on the importance of frequency components in generative models, we propose a simple, training-free approach for motion-conditioned video generation by injecting low-frequency phase information from a reference video directly into the diffusion noise latents. Our method transfers motion cues without modifying the model architecture or inference pipeline. Using several applications, we demonstrate effective control over both appearance and dynamics in generated videos, while achieving competitive or superior results compared to more complex conditioning approaches.

2605.24503 2026-05-26 cs.CV cs.AI 版本更新

FoodMonitor: Benchmarking MLLMs for Explainable Compliance Analysis

FoodMonitor:用于可解释合规性分析的多模态大语言模型基准测试

Ruihao Xu, Xingming Shui, Jingxuan Niu, Yiqin Wang, Jilin Yu, Haoji Zhang, Yansong Tang

发表机构 * Tsinghua Shenzhen International Graduate School, Tsinghua University(清华大学深圳国际研究生院,清华大学)

AI总结 针对现有视频异常检测缺乏规则驱动可解释性的问题,提出FoodMonitor基准,包含双通道违规标注和两阶段匹配评估协议,揭示当前多模态大语言模型在空间定位和细粒度规则理解上的瓶颈。

详情
AI中文摘要

随着基于AI的合规性监测在公共治理和工业安全中日益重要,提供可验证证据和可追溯问责信号的能力至关重要。然而,现有的视频异常检测数据集侧重于事件级二元分类,缺乏真实世界合规场景所需的规则驱动、可解释分析。我们引入了FoodMonitor,一个用于商业厨房监控中可解释合规性分析的基准。FoodMonitor包含477个视频片段,具有3307个违规标注,采用双通道设计覆盖人员级和环境级违规。每个标注指定了违反哪条规则、发生了何种不合规行为以及由谁实施,并附有帧级边界框。我们建立了一个统一的评估协议,包含两阶段匹配机制,分别评估空间定位和语义理解,以及一个复合指标($C_{ ext{score}}$),平衡环境和人员检测性能。对几种最先进的多模态大语言模型的系统评估显示,表现最佳的模型仅达到0.360 $C_{ ext{score}}$,空间定位和细粒度规则理解成为主要瓶颈。我们的分析识别出两种不同的失败模式:定位主导的错误和语义主导的错误,为未来模型开发提供了诊断性见解。

英文摘要

As AI-powered compliance monitoring becomes increasingly important in public governance and industrial safety, the ability to provide verifiable evidence and traceable accountability signals is essential. However, existing video anomaly detection datasets focus on event-level binary classification, lacking the rule-driven, explainable analysis required for real-world compliance scenarios. We introduce FoodMonitor, a benchmark for explainable compliance analysis in commercial kitchen surveillance. FoodMonitor comprises 477 video clips with 3,307 violation annotations across a dual-channel design covering both person-level and environment-level violations. Each annotation specifies which rule was violated, what non-compliant behavior occurred, and who committed it with frame-level bounding boxes. We establish a unified evaluation protocol with a two-stage matching mechanism that separately assesses spatial localization and semantic understanding, along with a composite metric ($C_{\text{score}}$) that balances environment and person detection performance. Systematic evaluation of several state-of-the-art multimodal large language models reveals that the best-performing model achieves only 0.360 $C_{\text{score}}$, with spatial localization and fine-grained rule understanding emerging as the primary bottlenecks. Our analysis identifies two distinct failure modes: localization-dominated errors and semantics-dominated errors, providing diagnostic insights for future model development.

2605.24497 2026-05-26 cs.AI 版本更新

Reasoning as an Attack Surface: Adaptive Evolutionary CoT Jailbreaks for LLMs

推理作为攻击面:针对大语言模型的自适应进化思维链越狱方法

Jianan Li, Simeng Qin, Xiaojun Jia, Lionel Z. Wang, Tianhang Zheng, Xiaoshuang Jia, Yang Liu, Xiaochun Cao

发表机构 * Nanyang Technological University, Singapore(南洋理工大学) The University of Hong Kong, Hong Kong, China(香港大学) The Hong Kong Polytechnic University, Hong Kong, China(香港理工大学) Zhejiang University, Zhejiang, China(浙江大学) Renmin University of China, Beijing, China(中国人民大学) Sun Yat-sen University, Guangdong, China(中山大学) Northeastern University(东北大学) Hebei Key Laboratory of Data Science and Knowledge Management(河北省数据科学与知识管理重点实验室)

AI总结 提出自适应进化思维链越狱框架AE-CoT,通过教师角色扮演重写有害目标、分解推理片段、多代进化搜索及自适应变异率控制,有效生成高破坏性越狱提示,在多个模型和数据集上超越现有方法。

详情
AI中文摘要

大型推理模型(LRM)在推理和生成任务中展现出卓越能力,并越来越多地部署于实际应用。然而,其显式的思维链(CoT)机制引入了新的安全风险,使其特别容易受到越狱攻击。现有方法通常依赖静态CoT模板来引发有害输出,但这种固定设计存在多样性、适应性和有效性不足的问题。为克服这些局限,我们提出一种自适应进化CoT越狱框架,称为AE-CoT。具体而言,该方法首先通过教师角色扮演将有害目标重写为温和提示,并将其分解为语义连贯的推理片段,构建CoT越狱候选池。然后,在结构化表示空间内,进行多代进化搜索,通过片段级交叉和具有自适应变异率控制机制的变异策略扩展候选多样性。一个独立的评分模型提供分级有害性评估,高分候选者进一步通过有害CoT模板增强,以诱导更具破坏性的生成。跨多个模型和数据集的广泛实验证明了所提出的AE-CoT的有效性,其持续优于最先进的越狱方法。

英文摘要

Large Reasoning Models (LRMs) have demonstrated remarkable capabilities in reasoning and generation tasks and are increasingly deployed in real-world applications. However, their explicit chain-of-thought (CoT) mechanism introduces new security risks, making them particularly vulnerable to jailbreak attacks. Existing approaches often rely on static CoT templates to elicit harmful outputs, but such fixed designs suffer from limited diversity, adaptability, and effectiveness. To overcome these limitations, we propose an adaptive evolutionary CoT jailbreak framework, called AE-CoT. Specifically, the method first rewrites harmful goals into mild prompts with teacher role-play and decomposes them into semantically coherent reasoning fragments to construct a pool of CoT jailbreak candidates. Then, within a structured representation space, we perform multi-generation evolutionary search, where candidate diversity is expanded through fragment-level crossover and a mutation strategy with an adaptive mutation-rate control mechanism. An independent scoring model provides graded harmfulness evaluations, and high-scoring candidates are further enhanced with a harmful CoT template to induce more destructive generations. Extensive experiments across multiple models and datasets demonstrate the effectiveness of the proposed AE-CoT, consistently outperforming state-of-the-art jailbreak methods.

2605.24490 2026-05-26 cs.AI cs.LG q-fin.PM 版本更新

Market Regime Council for Dynamic Credit Assignment in Multi-Agent LLM Decision Systems

市场制度委员会:多智能体LLM决策系统中的动态信用分配

Yunhua Pei, Zerui Ge, Jin Zheng, John Cartlidge

发表机构 * University of Bristol, UK(布里斯托大学)

AI总结 提出市场制度委员会(MRC),一种基于Shapley值进行在线智能体加权、贝叶斯自适应混合和制度依赖乘数的多智能体决策系统,在加密货币投资中实现高夏普比率和累计收益。

Comments 35 pages, 13 figures, preprint

详情
AI中文摘要

用于投资组合管理的多智能体LLM决策系统仍然缺乏一种原则性的方法来跨专业智能体分配信用,在制度转变下容易受到冷启动主导的影响,并且最终分配如何形成的透明度有限。我们提出了市场制度委员会(MRC),一种合作式多智能体决策系统,它计算所有单个、成对和大联盟输出的精确Shapley信用,用于在线智能体加权。实例化为N=3个专业智能体,在每个交易周期,MRC从指数加权性能历史中重新计算基于联盟的Shapley权重,使用贝叶斯自适应混合来稳定早期阶段,应用制度依赖乘数调整智能体权威,并通过五层因果追踪记录每次再平衡。在13种加密资产和5个种子的1037个交易日中,MRC实现了1.51的夏普比率和440.1%的累计收益,在主动基准中排名第一(CR、SR和IR),并在主动方法中实现了最低的最大回撤。消融实验表明,收益来自跨联盟输出的Shapley加权集成,而非任何单一阶段。代码和演示数据包含在补充材料中。

英文摘要

Multi-agent LLM decision systems for portfolio management still lack a principled way to assign credit across specialist agents, remain vulnerable to cold-start dominance under regime shifts, and offer limited transparency into how final allocations are formed. We propose Market Regime Council (MRC), a cooperative multi-agent decision system that computes exact Shapley credits across all single, pairwise, and Grand-coalition outputs for online agent weighting. Instantiated with N=3 specialist agents, at each trading period, MRC recomputes coalition-based Shapley weights from exponentially weighted performance histories, uses a Bayesian adaptive mixture to stabilize early periods, applies regime-dependent multipliers to adjust agent authority, and records each rebalance through a five-layer causal trace. Over 1,037 trading days across 13 crypto assets and five seeds, MRC achieves a Sharpe ratio of 1.51 and a cumulative return of 440.1%, ranking first on CR, SR, and IR among active baselines and attaining the lowest MDD among active methods. Ablation results show that the gains come from Shapley-weighted integration across coalition outputs rather than from any single stage in isolation. Code and demo data are included in the supplementary material.

2605.24489 2026-05-26 cs.AI q-bio.BM 版本更新

TIGER: Text-Informed Generalized Enzyme-Reaction Retrieval

TIGER:文本引导的通用酶-反应检索

Yuhang Zhang, Keyan Ding, Peilin Chen, Han Liu, Can Lin, Ruixi Chen, Shiqi Wang, Qi Song

发表机构 * University of Science and Technology of China(中国科学技术大学) City University of Hong Kong(香港城市大学) Zhejiang University(浙江大学)

AI总结 提出TIGER框架,利用蛋白质到文本生成模型提取文本语义知识,通过动态门控网络融合序列特征,实现酶与反应的双向检索,显著提升跨任务泛化性和鲁棒性。

Comments Accepted to ACL2026

详情
AI中文摘要

酶-反应检索是计算生物学中的一个基本问题,支撑着酶表征、反应机理阐明以及代谢途径和生物催化剂的合理设计。作为一个双向任务,它涉及酶到反应和反应到酶的映射。然而,现有方法在跨任务和跨分布泛化方面表现不佳,性能对数据集分割高度敏感,且检索方向之间存在显著的不对称性。为了应对这些挑战,我们提出了TIGER,一个文本引导的通用酶-反应检索框架,利用蛋白质到文本生成模型从酶序列中提取文本语义知识,提供连接酶和生化反应的通用表示。为了确保文本语义的质量和可靠性,我们设计了一个动态门控网络,自适应地将文本派生知识与序列特征融合,从而产生更一致和信息丰富的酶表示,同时一个结构共享特征投影器将酶和反应表示对齐到统一的潜在空间中。大量实验表明,在双向检索监督下,TIGER在多种分布上显著优于最先进的基线,并展现出强大的鲁棒性和跨任务迁移能力。

英文摘要

Enzyme-reaction retrieval is a fundamental problem in computational biology, underpinning enzyme characterization, reaction mechanism elucidation, and the rational design of metabolic pathways and biocatalysts. As a bidirectional task, it entails both enzyme-to-reaction and reaction-to-enzyme mapping. However, existing approaches suffer from poor generalization across tasks and distributions, with performance highly sensitive to dataset splits and substantial asymmetry between retrieval directions. To address these challenges, we present TIGER, a Text-Informed Generalized Enzyme-Reaction Retrieval framework that leverages protein-to-text generation models to distill textual semantic knowledge from enzyme sequences, providing a generalized representation that bridges enzymes and biochemical reactions. To ensure the quality and reliability of textual semantics, we design a Dynamic Gating Network that adaptively fuses text-derived knowledge with sequence features, enabling more consistent and informative enzyme representations, while a Structure-Shared Feature Projector aligns enzyme and reaction representations within a unified latent space. Extensive experiments demonstrate that, under bidirectional retrieval supervision, TIGER significantly outperforms state-of-the-art baselines across diverse distributions and exhibits strong robustness and transferability across tasks.

2605.24486 2026-05-26 cs.AI cs.CL 版本更新

AgentFugue: Agent Scaling for Long-Horizon Tasks through Collective Reasoning

AgentFugue:通过集体推理实现长时域任务的智能体扩展

Yuyang Hu, Hongjin Qian, Shuting Wang, Jiongnan Liu, Tong Zhao, Xiaoxi Li, Zheng Liu, Zhicheng Dou

发表机构 * GSAI, Renmin University of China(GSAI,中国人民大学) Beijing Academy of Artificial Intelligence(北京人工智能研究院)

AI总结 提出AgentFugue框架,通过共享推理中心实现多个对等智能体并行探索和选择性信息共享,无需显式角色分工或工作流编排,从而提升长时域任务性能。

详情
AI中文摘要

近期长时域智能体任务的进展主要通过更强模型、更好工具和更有效脚手架来扩展单个智能体。相比之下,对于扩展(scaling out)的理解要少得多:多个对等智能体,都针对同一任务,能否在不依赖显式角色分工或工作流编排的情况下成为额外能力来源?我们研究这个问题并提出AgentFugue,一个围绕共享推理中心构建的集体推理框架。当对等智能体并行探索同一任务时,中心记录每个智能体已建立、尝试或排除的简明笔记,并使每个智能体能够以对其当前搜索有用的形式选择性访问其他智能体的发现。这种设计将原本孤立的轨迹转变为可重用中间推理的互联生态,无需集中规划。我们将中心实例化为一个即插即用的通信层,使用监督微调和端到端强化学习进行训练。在我们研究的具有挑战性的长时域设置中,AgentFugue优于强基线。我们的结果表明,集体推理可以将对等智能体系统的扩展转变为能力增益的独特来源,而不仅仅是消耗更多计算的方式。

英文摘要

Recent progress on long-horizon agentic tasks has been driven largely by scaling up individual agents through stronger models, better tools, and more effective scaffolding. In contrast, much less is understood about scaling out: whether multiple peer agents, all targeting the same task, can become an additional source of capability without relying on explicit role specialization or workflow orchestration. We study this question and propose AgentFugue, a collective reasoning framework built around a shared reasoning hub. As peer agents explore the same task in parallel, the hub records concise notes on what each agent has established, attempted, or ruled out, and enables each agent to selectively access what other agents have discovered in a form useful for its current search. This design turns otherwise isolated trajectories into a connected ecology of reusable intermediate reasoning without requiring centralized planning. We instantiate the hub as a plug-in communication layer, trained with supervised fine-tuning and end-to-end reinforcement learning. Across the challenging long-horizon settings we study, AgentFugue improves over strong baselines. Our results suggest that collective reasoning can turn scaling out peer agent systems into a distinct source of capability gains, rather than merely a way of spending more compute.

2605.24484 2026-05-26 cs.AI cs.LG 版本更新

SPACE: Unifying Symmetric and Asymmetric Routing Problems for Generalist Neural Solver

SPACE:统一对称与非对称路由问题的通用神经求解器

Rongsheng Chen, Changliang Zhou, Canhong Yu, Yuanyao Chen, Yu Zhou, Zhuo Chen, Zhenkun Wang

发表机构 * School of Automation and Intelligent Manufacturing, Southern University of Science and Technology, Shenzhen, China(自动化与智能制造学院,南方科技大学,深圳,中国) Pengcheng Laboratory, Shenzhen, China(鹏城实验室,深圳,中国) Guangdong Provincial Key Laboratory of Fully Actuated System Control Theory and Technology, Southern University of Science and Technology, Shenzhen, China(广东省全驱动系统控制理论与技术重点实验室,南方科技大学,深圳,中国) College of Computer Science and Software Engineering, Shenzhen University, Shenzhen, China(计算机科学与软件工程学院,深圳大学,深圳,中国)

AI总结 针对现有神经求解器在对称与非对称车辆路径问题中表现不一致的问题,提出基于空间枢轴对齐的无坐标嵌入框架SPACE,通过双向弗雷歇表示和权重解耦自适应解码机制,实现统一节点表示与解生成,在110个变体上取得优异零样本泛化。

详情
AI中文摘要

通用神经路由求解器在利用统一模型解决多种车辆路径问题(VRPs)方面显示出巨大潜力。然而,现有求解器通常局限于对称设置,或在切换到非对称设置时由于输入不一致或固有结构差异而性能下降,这严重限制了它们在包含两种场景的实际应用中的实用性。为解决这一限制,我们基于每个节点到特定枢轴集的相对距离定义其空间位置,并进一步提出一种空间枢轴对齐的无坐标嵌入(SPACE)框架,该框架统一了对称和非对称VRP中的节点表示和解生成。具体而言,我们使用一种新颖的最远枢轴采样策略构建双向弗雷歇表示,以实现跨不同问题设置的不变节点表示。此外,我们引入了一种权重分解的自适应解码机制,将几何感知从问题表示中解耦,减轻约束决策对特定几何设置的过拟合。在110个VRP变体(包括55个对称问题及其非对称对应问题)上的大量实验表明,SPACE在对称和非对称VRP中均实现了有前景的零样本泛化。

英文摘要

Generalist neural routing solvers have shown great potential in solving diverse vehicle routing problems (VRPs) with a unified model. However, existing solvers are typically limited to symmetric settings or degrade in performance when switching to asymmetric settings due to input inconsistencies or inherent structural differences, substantially limiting their practicality in real-world scenarios that encompass both scenarios. To address this limitation, we define the spatial position of each node based on the relative distances to a specific set of pivots and further propose a Spatial Pivot-Aligned Coordinate-free Embedding (SPACE) framework that unifies node representation and solution generation across symmetric and asymmetric VRPs. Specifically, we construct a bidirectional Frechet representation using a novel furthest pivot sampling strategy to enable invariant node representations across distinct problem settings. Furthermore, we introduce a weight-decomposed adaptive decoding mechanism that decouples geometric perception from problem representations, mitigating the overfitting of constraint decisions to a specific geometry setting. Extensive experiments on 110 VRP variants, comprising 55 symmetric problems and their asymmetric counterparts, demonstrate that SPACE achieves promising zero-shot generalization in both symmetric and asymmetric VRPs.

2605.24475 2026-05-26 cs.CV cs.AI cs.MM 版本更新

Robust Fuzzy Multi-view Learning under View Conflict

视角冲突下的鲁棒模糊多视角学习

Siyuan Duan, Yuan Sun, Dezhong Peng, Yingke Chen, Xi Peng, Peng Hu

发表机构 * College of Computer Science, Sichuan University(四川大学计算机学院) Tianfu Jincheng Laboratory(天府锦城实验室) School of Artificial Intelligence, Sichuan University(四川大学人工智能学院)

AI总结 针对多视角分类中视角冲突问题,提出基于模糊集理论的鲁棒模糊多视角学习框架(R-FUML),通过模糊隶属度量化类别可信度、熵值融合及冲突样本惩罚机制,提升鲁棒性和不确定性估计。

详情
AI中文摘要

可信多视角分类旨在提供可靠的融合以实现准确预测,近年来在学术界和工业界引起了广泛关注。然而,现有的TMVC方法通常假设训练和测试阶段不同视角之间严格对齐,这在现实场景中往往不切实际。这一局限性促使我们重新审视TMVC并将其扩展到更具挑战性的设置:如何在训练和推理过程中减轻视角冲突(VC)的影响。针对这一设置,现有的TMVC方法存在三个关键缺陷:低估不确定性、误导性决策以及对VC的过拟合。为解决这些问题,本文提出了一种基于模糊集理论的新型鲁棒模糊多视角学习(R-FUML)框架。具体而言,R-FUML将网络输出建模为模糊隶属度以量化类别可信度,并使用基于熵的方法进行可靠的多视角融合。为此,我们提出了一种鲁棒多视角融合(RMF)策略,该策略同时考虑了视角特定的不确定性和视角间的冲突,从而减轻VC对决策的不利影响。为了在训练过程中识别并克服VC,我们进一步设计了一种针对VC的鲁棒学习(RLVC)框架。RLVC通过利用神经网络的记忆效应隔离冲突样本,然后通过对这些冲突视角施加惩罚来重新训练模型。在八个公开数据集上的大量实验表明,R-FUML在鲁棒性和不确定性估计方面始终优于15个最先进的基线方法。代码将在论文被接收后发布。

英文摘要

Trusted multi-view classification aims to deliver reliable fusion for accurate predictions and has recently attracted substantial attention in both academia and industry. However, existing TMVC methods typically assume strict alignment across different views during both training and testing phases, which is often impractical in real-world scenarios. This limitation motivates us to revisit TMVC and extend it to a more challenging setting: how to mitigate the impact of view conflict (VC) during both training and inference. To tackle this setting, existing TMVC methods suffer from three critical limitations: underestimated uncertainty, misleading decisions, and overfitting to VC. To address these issues, this paper proposes a novel Robust Fuzzy Multi-View Learning (R-FUML) framework grounded in Fuzzy Set Theory. Specifically, R-FUML models network outputs as fuzzy memberships to quantify category credibility and uses an entropy-based method for reliable multi-view fusion. To this end, we present a Robust Multi-view Fusion (RMF) strategy that accounts for both view-specific uncertainty and inter-view conflicts, thereby alleviating the adverse impacts of VC on decision-making. To identify and conquer VC during training, we further design a Robust Learning Against VC (RLVC) framework. RLVC isolates conflicting samples by leveraging neural networks' memory effects and then retrains the model by applying a penalty to these conflicting views. Extensive experiments across eight public datasets demonstrate that R-FUML consistently outperforms 15 state-of-the-art baselines in robustness and uncertainty estimation. The code will be released upon acceptance.

2605.24468 2026-05-26 cs.AI 版本更新

SAM: State-Adaptive Memory for Long-Horizon Reasoning Agent

SAM:面向长程推理智能体的状态自适应记忆

Yuyang Hu, Hongjin Qian, Shuting Wang, Jiongnan Liu, Ziliang Zhao, Jiejun Tan, Zheng Liu, Zhicheng Dou

发表机构 * GSAI, Renmin University of China(GSAI,中国人民大学) Beijing Academy of Artificial Intelligence(北京人工智能研究院)

AI总结 提出状态自适应记忆框架SAM,通过紧凑记忆线索和原始轨迹页面实现意图驱动的信息重建,无需重新训练基础模型,在多个基准上超越强基线。

详情
AI中文摘要

长程智能体推理要求大语言模型在包含思考、工具调用、观察和部分结论的长时间交互历史中行动。挑战不仅在于这些历史变长,而且当前决策所需的信息可能分散在遥远的步骤中,并且只在后来才变得相关。现有方法通过截断交互历史、将其压缩为更短的替代品或检索其选定部分进行重用来解决这一困难,但它们没有明确建模对过去交互的访问应如何适应智能体不断变化的状态。相反,我们将长程推理视为一个状态自适应记忆问题。为此,我们提出了状态自适应记忆(SAM),这是一个独立的框架,它将正在进行的交互整合为紧凑的记忆线索,同时保留原始轨迹页面用于意图驱动的回忆。这些线索不被视为历史的替代品;相反,它们充当轻量级句柄,使智能体能够根据当前需求重建时间上遥远的信息,而无需重新训练底层骨干网络。我们进一步通过专家引导的监督和强化学习优化记忆模块,使其与轨迹级别的效用对齐。在BrowseComp、BrowseComp-ZH、WideSearch和HLE上,SAM在各种智能体骨干网络上持续优于强基线。我们的结果表明,显式记忆建模为长程智能体推理提供了一个简单而有效的基础。

英文摘要

Long-horizon agentic reasoning requires large language models to act over long interaction histories containing thoughts, tool calls, observations, and partial conclusions. The challenge is not merely that these histories grow long, but that information needed for the current decision may be scattered across distant steps and only become relevant later. Existing approaches address this difficulty by truncating the interaction history, compressing it into shorter surrogates, or retrieving selected parts of it for reuse, but they do not explicitly model how access to past interaction should adapt to the agent's evolving state. We instead cast long-horizon reasoning as a problem of state-adaptive memory. To this end, we propose State-Adaptive Memory~(SAM), a standalone framework that consolidates ongoing interaction into compact memory cues while preserving raw trajectory pages for intent-driven recall. These cues are not treated as replacements for history; rather, they serve as lightweight handles that allow the agent to reconstruct temporally distant information according to its current needs, without retraining the underlying backbone. We further optimize the memory module through expert-guided supervision and reinforcement learning, aligning it with trajectory-level utility. Across BrowseComp, BrowseComp-ZH, WideSearch, and HLE, SAM consistently outperforms strong baselines over diverse agent backbones. Our results suggest that explicit memory modeling provides a simple and effective foundation for long-horizon agentic reasoning.

2605.24458 2026-05-26 cs.LG cs.AI 版本更新

Balancing Fairness, Privacy, and Accuracy: A Multitask Adversarial Framework for Centralized Data-Driven Systems

平衡公平性、隐私和准确性:面向集中式数据驱动系统的多任务对抗框架

Imesh Ekanayake, Elham Naghizade, Jeffrey Chan

发表机构 * School of Computing Technologies, RMIT University(计算技术学院,皇家墨尔本理工大学)

AI总结 提出一种多任务对抗模型,将公平性和隐私作为核心目标,通过优化代价函数动态平衡三者,在最小化性能损失的同时实现高公平性和隐私保护。

Comments 13 Pages, 6 figures, IEEE TKDE

详情
AI中文摘要

在集中式数据驱动应用中,公平性和隐私的整合至关重要,尤其是当这些系统日益影响具有重大社会影响的领域时。当前方法很少同时考虑隐私、公平性和准确性,这可能会损害伦理标准和隐私法规。然而,平衡这三个目标相当具有挑战性,因为每个目标通常对模型的设计和训练提出相互冲突的要求,使得优化一个目标而不损害其他目标变得困难。本文提出了一种新颖的多任务对抗模型,将公平性和隐私视为整体目标而非事后考虑,并学习一个隐藏敏感属性同时保留任务相关信息的潜在表示。我们的方法通过优化的代价函数动态平衡公平性与准确性及隐私,即使在严格条件下也能实现最小的性能损失。在多种数据集上的广泛测试表明,我们的模型能够在不大幅牺牲准确性的情况下实现高标准的公平性和隐私。与最先进的隐私和公平标准进行基准测试表明,我们的方法增强了隐私、公平性和准确性优化的鲁棒性,证明了其在不同数据集上的适应性。

英文摘要

The integration of fairness and privacy in centralized data-driven applications is critical, especially as these systems increasingly influence sectors with significant societal impact. Current methods rarely address privacy, fairness, and accuracy together, which can potentially compromise ethical standards and privacy regulations. However, balancing these three objectives is quite challenging since each of objective often imposes conflicting requirements on the design and training of models, making it difficult to optimize one without compromising the others. This paper introduces a novel multitask adversarial model that treats fairness and privacy as integral objectives rather than afterthoughts, and learns a latent representation that hides sensitive attributes while preserving essential task-related information. Our approach dynamically balances fairness with accuracy and privacy through an optimized cost function with minimal performance loss even under strict conditions. Extensive testing on diverse datasets shows the ability of our model to achieve high standards of fairness and privacy without significant sacrifice to accuracy. Benchmarking against state-of-the-art privacy and fairness standards shows that our method enhances the robustness of privacy, fairness, and accuracy optimization, proving its adaptability across various datasets.

2605.24453 2026-05-26 cs.SE cs.AI 版本更新

Code2UML: Agentic LLMs with context engineering for scalable software visualization

Code2UML: 基于上下文工程的可扩展软件可视化的智能体LLM

Alin-Gabriel Văduva, Anca-Ioana Andreescu, Simona-Vasilica Oprea, Adela Bâra

发表机构 * Bucharest University of Economic Studies(布加勒斯特经济大学)

AI总结 提出一种基于五个专门智能体和确定性IR压缩层的智能体架构,用于从源代码仓库自动生成UML图,在12个开源仓库和7种UML图上验证了高语法有效性(平均91.5%)和结构质量(平均81.7/100),且质量不随规模下降。

详情
AI中文摘要

基于大型语言模型(LLM)的代码分析工具被用于自动化软件文档任务。然而,这些方法在真实代码库中的可扩展性——其中中间表示(IR)超过LLM上下文限制——仍未充分探索。本文介绍了一种具有上下文工程的智能体架构,用于从源代码仓库自动生成UML图。它采用基于Claude Agent SDK构建的五个专门智能体的层次结构:PlannerAgent、AnalyzerAgent、DiagramAgent、CorrectorAgent和DependencyAnalyzerAgent,每个处理不同的认知子任务。一个确定性的、重要性加权的IR压缩层将完整项目IR转换为保证适合令牌限制的特定图视图,无需LLM调用且可在毫秒内完成。因此,我们在4种编程语言(Java、JavaScript、PHP、Python)的12个开源仓库和7种UML图上评估该系统,产生了84个观察结果,并在5个自动指标上进行了评估。结果表明高语法有效性(平均91.5%,其中组件图和部署图达到100%)、强关系精度(平均0.858)和一致的结构质量(平均81.7/100,跨语言方差为3.1分)。实体召回率平均为0.313,反映了有意的架构优先级而非全面覆盖。敏感性分析(31到4,578个IR实体)证实质量分数无论规模大小都保持稳定。

英文摘要

Large Language Model (LLM)-based code analysis tools are adopted to automate software documentation tasks. However, the scalability of these approaches to real codebases, where Intermediate Representations (IR) exceed LLM context limits, remains underexplored. This paper introduces an agentic architecture with context engineering for automated UML diagram generation from source code repositories. It employs a hierarchy of five specialized agents: PlannerAgent, AnalyzerAgent, DiagramAgent, CorrectorAgent and DependencyAnalyzerAgent, built on the Claude Agent SDK, each addressing a distinct cognitive subtask. A deterministic, importance-weighted IR compaction layer transforms full project IRs into diagram-specific views guaranteed to fit within token constraints, requiring no LLM calls and completing in milliseconds. Thus, we evaluate the system across 12 open-source repositories in 4 programming languages (Java, JavaScript, PHP, Python) and 7 UML diagram types, producing 84 observations assessed on 5 automated metrics. Results demonstrate high syntactic validity (mean: 91.5%, with component and deployment diagrams reaching 100%), strong relationship precision (mean: 0.858) and consistent structural quality (mean: 81.7/100, with cross-language variance of 3.1 points). Entity recall averaged 0.313, reflecting deliberate architectural prioritization over exhaustive coverage. A sensitivity analysis (31 to 4,578 IR entities) confirms that quality scores remain stable regardless of scale.

2605.24452 2026-05-26 cs.CL cs.AI cs.LG 版本更新

Temporal Concept Drift in Legal Judgment Prediction: Neural Baselines Across Three Epochs of Ukrainian Court Decisions

法律判决预测中的时间概念漂移:跨越乌克兰法院判决三个时期的神经基线

Volodymyr Ovcharov

AI总结 通过微调四种Transformer编码器在乌克兰法院三个时期(战前、混合战争、全面入侵)的判决上,研究法律语言的时间漂移,发现前向性能严重下降(最多27.2个百分点),法律领域预训练不能提升绝对性能但能减轻漂移,时序持续学习可消除灾难性遗忘。

Comments 17 pages, 6 tables, 5 figures. Dataset: https://huggingface.co/datasets/overthelex/ukrainian-court-decisions

详情
AI中文摘要

法律NLP基准测试在随机分割的数据上评估模型,隐含假设法律语言是平稳的。我们通过微调四种Transformer编码器——XLM-RoBERTa(base和large)及其法律领域变体——在地缘政治事件定义的三个时间时期的乌克兰法院判决上测试这一假设:战前(2008-2013)、混合战争(2014-2021)和全面入侵(2022-2026)。每个模型在一个时期上训练,并在所有三个时期上评估,产生一个3x3的跨时间泛化矩阵。四个发现出现。(1)前向退化严重:在战前数据上训练的模型应用于全面入侵时期判决时,宏F1最多下降27.2个百分点。(2)退化不对称:后向迁移(全面入侵到战前)比前向迁移稳健得多,与法律语言是加性的假设一致。(3)法律领域预训练(Legal-XLM-R)不提升绝对性能,但减少前向退化的幅度和不对称性。(4)时序持续学习消除了通用XLM-R的灾难性遗忘:战前知识完全保留(+1.8至+6.2个百分点),而全面入侵性能提升+16.5至+19.0个百分点;逆时序训练导致严重遗忘。跨司法管辖区在瑞士判决预测数据上的预训练提升绝对性能,但不减少时间退化幅度,确认时间漂移是法律语言演化的内在属性。数据集(三个时期共428K判决)作为LEXTREME贡献公开可用。

英文摘要

Legal NLP benchmarks evaluate models on randomly split data, implicitly assuming that legal language is stationary. We test this assumption by fine-tuning four transformer encoders -- XLM-RoBERTa (base and large) and their legal-domain variants -- on Ukrainian court decisions from three temporal epochs defined by geopolitical disruptions: pre-war (2008-2013), hybrid war (2014-2021), and full-scale invasion (2022-2026). Each model is trained on one epoch and evaluated on all three, producing a 3x3 cross-temporal generalization matrix. Four findings emerge. (1) Forward degradation is severe: models trained on pre-war data lose up to 27.2 percentage points of macro-F1 when applied to full-scale invasion era decisions. (2) The degradation is asymmetric: backward transfer (full-scale to pre-war) is substantially more robust than forward transfer, consistent with the hypothesis that legal language is additive. (3) Legal-domain pretraining (Legal-XLM-R) does not improve absolute performance but reduces forward degradation magnitude and asymmetry. (4) Chronological continual learning eliminates catastrophic forgetting for general XLM-R: pre-war knowledge is fully retained (+1.8 to +6.2 pp) while full-scale performance gains +16.5 to +19.0 pp; reverse-chronological training causes severe forgetting. Cross-jurisdictional pretraining on Swiss Judgment Prediction data improves absolute performance but does not reduce temporal degradation magnitude, confirming that temporal drift is an intrinsic property of legal language evolution. The dataset (428K decisions across three epochs) is publicly available as a LEXTREME contribution.

2605.24425 2026-05-26 cs.LG cs.AI cs.CL 版本更新

Momentum Streams for Optimizer-Inspired Transformers

动量流:优化器启发的Transformer

Jingchu Gai, Nai-Chieh Huang, Jiayun Wu

发表机构 * Carnegie Mellon University(卡内基梅隆大学)

AI总结 提出一类优化器启发的Transformer(如三重动量TMMFormer),通过将残差更新解释为优化器步骤,发现动量是性能提升的关键,能收敛到更平坦的极小值,减少遗忘并改善泛化。

详情
AI中文摘要

预归一化Transformer层的残差更新可以被解释为对代理token能量执行一阶优化器的一步,其中注意力和MLP子层充当梯度预言。基于这一观察,我们构建了一族优化器启发的Transformer(三重动量、Adam/AdamW、Muon、SOAP),并在匹配计算量下进行比较。在我们的主要预训练实验中,三重动量TMMFormer取得了最低的验证损失,优于普通Transformer和先前的架构变体。受控消融实验和支持理论表明,动量(而非预条件)是增益的主要来源。我们进一步证明,TMMFormer和其他基于动量的设计比普通Transformer收敛到更平坦的极小值,这导致更少的遗忘和更好的泛化。

英文摘要

The residual update of a pre-norm Transformer layer admits an interpretation as one step of a first-order optimizer acting on a surrogate token energy, wherein the attention and MLP sublayers function as gradient oracles. Based on this observation, we build a family of optimizer-inspired Transformers (triple-momentum, Adam/AdamW, Muon, SOAP) and compare them under matched compute. In our main pretraining experiment, the triple-momentum TMMFormer achieves the lowest validation loss, outperforming the vanilla Transformer and prior architectural variants. A controlled ablation and supporting theory show that momentum, not preconditioning, is the main source of the gain. We further show that TMMFormer and other momentum-based designs reach flatter minima than the vanilla Transformer, which leads to less forgetting and better generalization.

2605.24423 2026-05-26 cs.AI 版本更新

Benchmarking the Limits of In-Context Reinforcement Learning for Ad-Hoc Teamwork

临时团队协作中上下文强化学习的极限基准测试

Yuheng Jing, Kai Li, Ziwen Zhang, Jiajun Zhang, Zeyao Ma, Jiaxi Yang, Lei Zhang, Zhe Wu, Jinmin He, Junliang Xing, Jian Cheng

发表机构 * C$^{2}$DL, Institute of Automation, Chinese Academy of Sciences(中国科学院自动化研究所C²DL实验室) School of Artificial Intelligence, University of Chinese Academy of Sciences(中国科学院大学人工智能学院) School of Future Technology, University of Chinese Academy of Sciences(中国科学院大学未来技术学院) Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences(中国科学院深圳先进技术研究所) Department of Computer Science and Technology, Tsinghua University(清华大学计算机科学与技术系) University of Science and Technology of China(中国科学技术大学) Qwen Team, Alibaba Group(阿里集团Qwen团队)

AI总结 提出ICRL4AHT基准,基于Overcooked-V2评估上下文强化学习在临时团队协作中的表现,发现算法在未见队友和布局下常不如随机基线,凸显多智能体环境下的适应挑战。

Comments 41 pages, 14 figures

详情
AI中文摘要

上下文强化学习(ICRL)使基础智能体能够即时适应新任务,但其在需要与未知伙伴协调的临时团队协作(AHT)中的有效性尚未被探索。为严格评估这一点,我们引入了一个大规模基准ICRL4AHT,基于高吞吐量JAX实现的Overcooked-V2构建。我们的基准包括一个大型、多样化的队友套件,涵盖RL和启发式策略,支持可控的训练-测试转移,并提供了一个可复现的端到端流水线,用于队友生成、学习历史收集、数据集构建和在线多回合评估。我们评估了代表性的历史条件ICRL算法,包括算法蒸馏(AD)和决策预训练Transformer(DPT),跨越数百万次转移。结果揭示了显著的局限性:与它们在单智能体领域的成功相反,这些基线在多智能体设置中未能展现出稳健的测试时适应。具体来说,这些方法在未见队友和未见布局轨迹上经常表现不如随机基线,并且在长时间跨度内没有明显的上下文改进。这些发现凸显了在OvercookedV2 AHT协议下部分可观测性中战略推理的挑战,将我们的基准确立为下一代协调算法的关键测试平台。

英文摘要

In-Context Reinforcement Learning (ICRL) has enabled foundation agents to adapt instantaneously to novel tasks, yet its efficacy in Ad-Hoc Teamwork (AHT)-where coordination with unknown partners is required-remains unexplored. To rigorously evaluate this, we introduce a large-scale benchmark ICRL4AHT, built upon a high-throughput JAX implementation of Overcooked-V2. Our benchmark includes a large, diverse teammate suite spanning both RL and heuristic policies, enabling controlled train-test shifts, and provides a reproducible end-to-end pipeline for teammate generation, learning-history collection, dataset construction, and online multi-episode evaluation. We evaluate representative history-conditioned ICRL algorithms, including Algorithm Distillation (AD) and Decision-Pretrained Transformer (DPT), across millions of transitions. Results reveal notable limitations: contrary to their success in single-agent domains, these baselines fail to exhibit robust test-time adaptation in multi-agent settings. Specifically, these methods frequently underperform random baselines across both unseen teammate and unseen layout tracks, with no clear in-context improvement over long horizons. These findings highlight the challenges of strategic inference under partial observability within the OvercookedV2 AHT protocol, establishing our benchmark as a critical testbed for next-generation coordination algorithms.

2605.24420 2026-05-26 cs.LG cs.AI 版本更新

Batch Normalization Amplifies Memorization and Privacy Risks

批归一化加剧记忆化和隐私风险

Ngoc Phu Doan, Chongyan Gu, Ihsen Alouani

发表机构 * Queen’s University Belfast(女王大学贝尔法斯特)

AI总结 本文通过实证和理论分析,发现批归一化层会显著增加模型对异常样本的记忆化,从而加剧隐私泄露风险。

详情
AI中文摘要

批归一化(BN)被广泛采用以加速深度神经网络的收敛并实现更稳定的训练。然而,其对隐私和记忆化的影响在很大程度上尚未被探索。在这项工作中,我们研究了BN层对非典型或异常样本记忆化的影响及其对隐私泄露的启示。我们使用三种互补方法进行了广泛的实证研究:(i)对分布外训练样本的无意记忆化,(ii)通过梯度范数测量的每个样本影响,以及(iii)对成员推断攻击(MIA)的敏感性。跨多个数据集和架构,我们一致观察到,与没有BN的模型相比,BN显著增加了对异常值的记忆化。关键的是,这种放大的记忆化直接转化为隐私漏洞:具有BN的模型对MIA表现出显著更高的敏感性。我们通过理论分析补充了实证结果,表明BN在训练过程中放大了异常样本的每步影响,为这一现象提供了机制性见解。我们的结果突显了与BN相关的被低估的隐私风险,并为归一化层如何放大罕见或敏感训练样本的影响提供了实践和理论见解。

英文摘要

Batch Normalization (BN) is widely adopted to enable faster convergence and more stable training of deep neural networks. However, its impact on privacy and memorization has remained largely unexplored. In this work, we investigate the effect of BN layers on the memorization of atypical or outlier samples and its implications for privacy leakage. We conduct an extensive empirical study using three complementary approaches: (i) unintended memorization of out-of-distribution training samples, (ii) per-sample influence measured via gradient norms, and (iii) susceptibility to membership inference attacks (MIA). Across multiple datasets and architectures, we consistently observe that BN substantially increases the memorization of outliers compared to models without BN. Critically, this amplified memorization translates directly into privacy vulnerabilities: models with BN exhibit significantly higher susceptibility to MIAs. We complement our empirical findings with a theoretical analysis showing that BN amplifies the per-step influence of outlier samples during training, providing mechanistic insight into this phenomenon. Our results highlight an underappreciated privacy risk associated with BN and provide both practical and theoretical insights into how normalization layers can amplify the influence of rare or sensitive training examples.

2605.24414 2026-05-26 cs.AI 版本更新

JT-SAFE-V2: Safety-by-Design Foundation Model with World-Context Data

JT-SAFE-V2:具有世界上下文数据的安全设计基础模型

Junlan Feng, Fanyu Meng, Chong Long, Pengyu Cong, Duqing Wang, Yan Zheng, Yuyao Zhang, Xuanchang Gao, Ye Yuan, Yunfei Ma, Zhijie Ren, Fan Yang, Na Wu, Di Jin, Chao Deng

发表机构 * JIUTIAN Research(九天研究院)

AI总结 提出JT-Safe-V2大语言模型,通过世界知识预训练、高确定性训练和安全强化后训练实现通用智能与安全设计的联合优化,并引入Safe-MoMA框架降低推理成本,在通用智能和安全基准上达到最优性能。

详情
AI中文摘要

我们介绍了JT-Safe-V2,这是一个旨在提升基础模型安全性和可信度的大型语言模型,将我们之前的JT-Safe模型扩展为更全面的安全设计范式。JT-Safe-V2通过几个关键创新强调通用智能与安全设计的联合优化:用上下文世界知识丰富预训练数据、高确定性预训练程序,以及面向企业级代理能力的安全强化后训练机制。在这些安全增强的基础模型基础上,我们提出了Safe-MoMA(安全模型与代理混合),这是一个通过协调部署多个模型和代理实现可追溯高效推理的框架。广泛评估表明,JT-Safe-V2在通用智能和安全基准上均达到了最先进性能。此外,与使用最大的独立模型基线相比,Safe-MoMA在保持相当性能的同时将推理成本降低了30%以上。为了促进未来安全设计基础模型的研究,我们公开发布了后训练的JT-Safe-V2-35B模型检查点。

英文摘要

We introduce JT-Safe-V2, a large language model designed to advance the safety and trustworthiness of foundation models, extending our previous JT-Safe model toward a more comprehensive safety-by-design paradigm. JT-Safe-V2 emphasizes the joint optimization of general intelligence and safety-by-design through several key innovations: enriching pre-training data with contextual world knowledge, high-certainty pre-training procedures, and safety strengthening post-training mechanisms for enterprise-oriented agentic capabilities. Building on these safety-enhanced foundation models, we propose Safe-MoMA (Safe Mixture of Models and Agents), a framework that enables traceable and efficient inference through the orchestrated deployment of multiple models and agents. Extensive evaluations demonstrate that JT-Safe-V2 achieves state-of-the-art performance across both general intelligence and safety benchmarks. Moreover, Safe-MoMA reduces inference costs by more than 30\% compared to using the largest standalone model baseline while maintaining comparable performance. To facilitate future research on safety-by-design foundation models, we publicly release the post-trained JT-Safe-V2-35B model checkpoint.

2605.24411 2026-05-26 cs.AI cs.LG 版本更新

The Model Is Not the Product: A Dual-Pillar Architecture for Local-First Psychological Coaching

模型并非产品:面向本地优先心理辅导的双支柱架构

Alexander Mihalcea

发表机构 * iOS application(iOS应用)

AI总结 本文提出Psych LM,一种基于本地优先架构的iOS应用,通过自动记忆语料库和检索增强生成实现近无限上下文窗口,在移动设备上提供可靠的上下文感知心理辅导。

Comments 10 pages, 3 figures

详情
AI中文摘要

现有语言模型应用难以满足情感导向支持的需求,主要原因是它们无法在会话间维持深度、持久的上下文。本报告介绍了Psych LM,一款iOS应用,验证了对于此类应用,周围架构至关重要的论点。Psych LM在专为行为和生活辅导应用设计的本地优先运行时中运行本地设备端语言模型。该系统通过一个自动化的、用户可检查的记忆语料库实现了接近无限上下文窗口的实际效果,该语料库将对话转换为结构化的记忆卡片,包括事实、目标和事件,并通过语义和向量搜索动态注入提示中。因此,该系统可定义为一种主动学习、检索增强生成、设备端架构。该架构提供了四个主要贡献:以隐私为核心属性的本地优先设计;用于持久存储关键用户信息的记忆语料库的详细描述;提供独立于模型内部状态的稳定行为骨架的确定性编排层;以及专注于在现实操作条件下评估集成系统可靠性的基准框架。研发过程证实,通过优先考虑架构控制和资源管理而非简单模型大小,可以在移动环境的严格约束下可靠地实现复杂的上下文感知交互。

英文摘要

Existing language model applications struggle to meet the demand for emotionally oriented support, primarily due to their inability to maintain deep, persistent context across sessions. This report introduces Psych LM, an iOS application that validates the thesis that, for such applications, the surrounding architecture is paramount. Psych LM runs a local, on-device language model within a purpose-built, local-first runtime designed for behavioral and life-coaching applications. The system achieves the practical effect of a near-infinite context window through an automated, user-inspectable memory corpus that converts conversations into structured memory cards, including facts, goals, and events, and dynamically injects them into the prompt via semantic and vector search. As such, the system can be defined as an active-learning, retrieval-augmented generative, on-device architecture. This architecture delivers four primary contributions: a local-first design where privacy is a core property; a detailed description of the memory corpus for persistent context of key user information; a deterministic orchestration layer that provides a stable behavioral spine independent of the model's internal state; and a benchmark framework focused on evaluating the integrated system's reliability under realistic operating conditions. The R and D process confirms that complex, context-aware interaction can be reliably achieved under the strict constraints of a mobile environment by prioritizing architectural control and resource management over simple model size.

2605.24410 2026-05-26 cs.AI 版本更新

Advancing Graph Few-Shot Learning via In-Context Learning

通过上下文学习推进图少样本学习

Renchu Guan, Yajun Wang, Chunli Guo, Bowen Cao, Fausto Giunchiglia, Wei Pang, Yonghao Liu, Xiaoyue Feng

发表机构 * College of Computer Science and Technology, Jilin University(吉林大学计算机科学与技术学院) College of Software, Jilin University(吉林大学软件学院) Department of Computer Science and Technology, Yanbian University(延边大学计算机科学与技术系) Department of Information Engineering and Computer Science, University of Trento(特伦托大学信息工程与计算机科学系) School of Mathematical and Computer Sciences, Heriot-Watt University(赫瑞斯泰大学数学与计算机科学学院)

AI总结 提出VISION模型,将图少样本学习重构为免微调的序列推理问题,利用无监督任务生成器从无标签数据中构建伪任务,通过上下文感知网络融合局部拓扑和全局任务依赖,实现高效推理。

Comments KDD26

详情
AI中文摘要

图少样本学习旨在仅用少量标注样本对来自新类别的节点进行分类,是图学习中广泛研究的问题。然而,现有方法常面临两个关键限制。首先,主流的图少样本学习范式依赖于监督任务,未能利用图中大量的无标签节点。其次,许多方法在推理时需要复杂的任务适应或微调,限制了其效率和适用性。受大语言模型强大的上下文学习能力启发,我们提出了一种名为VISION的新模型,通过上下文学习推进图少样本学习,以应对这些挑战。我们的模型将图少样本学习重构为免微调的序列推理问题。其核心是一个上下文感知网络,该网络使用角色嵌入初始化节点,并采用双上下文融合模块协同整合局部拓扑结构和全局任务级依赖关系。这使得我们的模型能够在单次前向传播中,根据支持集上下文动态地为查询集生成类别感知表示。为了有效训练我们的模型,我们引入了一个无监督任务生成器,该生成器创建结构自适应特征,并从大量无标签数据中构建多样的伪任务。我们的方法将无监督元学习与图上下文学习统一起来,实现了高效推理。在多个基准数据集上的大量实验证明了我们模型的优越性。我们的公开代码可在以下网址找到。

英文摘要

Graph few-shot learning, which aims to classify nodes from novel classes with only a few labeled examples, is a widely studied problem in graph learning. However, existing methods often face two key limitations. First, the predominant graph few-shot learning paradigm relies on supervised tasks, failing to leverage the vast number of unlabeled nodes in the graph. Second, many approaches require complex task adaptation or fine-tuning during inference, limiting their efficiency and applicability. Inspired by the powerful in-context learning capabilities of large language models, we propose a novel model named VISION for adVancIng graph few-Shot learning via In-cOntext LearNing to address these challenges. Our model reframes graph few-shot learning as a fine-tuning-free sequence reasoning problem. At its core is a context-aware network that initializes nodes with role embeddings and employs a dual-context fusion module to synergistically integrate local topological structures and global task-level dependencies. This allows our model to dynamically generate class-aware representations for the query set conditioned on the support set context in a single forward pass. To effectively train our model, we introduce an unsupervised task generator that creates structure-adaptive features and constructs diverse pseudo-tasks from abundant unlabeled data. Our method unifies unsupervised meta-learning with graph in-context learning, achieving efficient inference. Extensive experiments on multiple benchmark datasets demonstrate the superiority of our model. Our public code can be found

2605.24405 2026-05-26 cs.LG cs.AI 版本更新

Generative OOD-regularized Model-based Policy Optimization

生成式OOD正则化的基于模型的策略优化

Aysin Tumay, Jiahe Huang, Elise Jortberg, Rose Yu

发表机构 * University of California, San Diego(加州大学圣地亚哥分校) Abiomed(阿比omed)

AI总结 提出GORMPO算法,利用生成式密度估计在稀疏状态-动作空间中限制策略更新到高密度区域,以解决离线强化学习中的分布外动作问题,并在真实医疗数据集和离线RL数据集上优于基线方法。

详情
AI中文摘要

我们研究使用离线强化学习的序贯决策。传统离线RL策略在训练仅依赖稀疏离线表示时可能导致分布外(OOD)动作。为确保在稀疏状态-动作空间中的安全离线策略,我们探索如何将密度估计模型集成到基于模型的RL方法中以避免OOD区域。生成式模型能够显式建模稀疏状态-动作空间中的密度。基于此,我们引入生成式OOD正则化的基于模型的策略优化(GORMPO),一种密度正则化的离线RL算法,使用生成式密度建模将策略更新限制在数据集的高密度区域。此外,我们考察更好的OOD检测是否对应更好的基于模型的离线策略。我们比较了(1)各种密度估计器的OOD检测能力,以及(2)它们在GORMPO框架内在真实医疗数据集和稀疏离线RL数据集上的性能。我们在温和假设下理论上保证了GORMPO的性能。实验上,GORMPO在真实医疗数据集上比最先进的基线方法提升17%,并在离线RL数据集上增强了基础模型。我们的实证发现表明,在动态稳定的环境中,更好的OOD检测通常导致改进的策略,而当动态不确定时,带有保守惩罚的较差密度估计更受青睐。

英文摘要

We study sequential decision-making with offline reinforcement learning (RL). Traditional offline RL policies may result in out-of-distribution (OOD) actions when training relies only on sparse offline representations. To ensure safe offline policies in a sparse state-action space, we explore how density estimation models can be integrated into model-based RL methods to avoid the OOD regions. Generative models are capable of explicitly modeling the density in sparse state-action spaces. Building on this, we introduce Generative OOD-regularized Model-based Policy Optimization (GORMPO), a density-regularized offline RL algorithm that uses generative density modeling to restrict policy updates to high-density areas of the dataset. Furthermore, we examine whether better OOD detection corresponds to better model-based offline policies. We compare (1) the OOD detection capabilities of various density estimators and (2) their performance within the GORMPO framework on a real-world medical dataset and sparse offline RL datasets. We theoretically guarantee GORMPO's performance under mild assumptions. Empirically, GORMPO outperforms state-of-the-art baselines by 17% on a real-world medical dataset and enhances the base model on the offline RL datasets. Our empirical findings show that better OOD detection generally results in improved policies in environments with stable dynamics, while conservative penalties with poor density estimation are favored when dynamics are uncertain.

2605.24398 2026-05-26 cs.CV cs.AI cs.GR 版本更新

VectorArk: Learning Practical Image Vectorization with Rounded Polygon Representation

VectorArk: 学习基于圆角多边形表示的实际图像矢量化

Tarun Gehlaut, Difan Liu, Charu Bansal, Krutik Malani, Souymodip Chakraborty, Ankit Phogat, Matthew Fisher, Vineet Batra

发表机构 * Adobe

AI总结 提出VectorArk模型,采用圆角多边形表示和退化模型,实现鲁棒且实用的图像矢量化,在多个数据集上取得优越的几何完整性和伪影抑制效果。

Comments CVPR 2026. Project page: https://vectorark.github.io/

详情
AI中文摘要

近期基于视觉-语言模型(VLM)的方法在图像矢量化任务上取得了令人印象深刻的结果。然而,它们通常在合成基准上进行评估,其中干净的SVG以高分辨率光栅化,然后重新矢量化。因此,这些方法在真实场景中泛化能力较差,例如图像具有未知的光栅化方法或由文本到图像模型生成。我们引入了VectorArk,一种新的基于VLM的模型,旨在实现鲁棒且实用的图像矢量化。VectorArk采用了一种新颖的圆角多边形表示,简化了学习过程,同时自然地生成平滑、视觉上吸引人的基元。我们还提出了一种退化模型,增强了在多样且不完美输入下的鲁棒性。我们的实验表明,与先前方法相比,VectorArk在多个数据集上实现了优越的几何完整性和伪影抑制,全面的消融实验验证了每个组件的贡献。

英文摘要

Recent vision-language model (VLM)-based approaches have achieved impressive results on image vectorization tasks. However, they are typically evaluated on synthetic benchmarks, where clean SVGs are rasterized at high resolution and then re-vectorized. As a result, these methods generalize poorly to real-world scenarios, such as images with unknown rasterization methods or those generated by text-to-image models. We introduce VectorArk, a new VLM-based model designed for robust and practical image vectorization. VectorArk employs a novel rounded polygon representation that simplifies the learning process while naturally producing smooth, visually appealing primitives. We also propose a degradation model that enhances robustness across diverse and imperfect inputs. Our experiments show that, in contrast to previous methods, VectorArk achieves superior geometric completeness and artifact suppression across multiple datasets, with comprehensive ablations validating the contribution of each component.

2605.24396 2026-05-26 cs.AI 版本更新

Understanding and Mitigating Premature Confidence for Better LLM Reasoning

理解并缓解过早自信以提升大语言模型推理能力

Jingchu Gai, Guanning Zeng, Christina Baek, Chen Wu, J. Zico Kolter, Andrej Risteski, Aditi Raghunathan

发表机构 * Carnegie Mellon University(卡内基梅隆大学) Tsinghua University(清华大学)

AI总结 针对大语言模型长思维链中逻辑跳跃和过早自信问题,提出渐进式自信塑造强化学习目标,无需外部标签或奖励模型,通过奖励逐步自信增长并惩罚过早承诺,显著提升推理准确性和质量。

详情
AI中文摘要

当前语言模型的长思维链(CoT)经常包含逻辑缺口和不合理的跳跃,限制了额外测试时计算带来的收益。直接提升推理质量需要过程奖励模型,但训练它们所需的步骤级标注昂贵且稀缺。我们在模型推理过程中自信度的演化中发现了一个信号:过早自信,即倾向于过早承诺答案并用剩余标记为其辩护,这强烈预测了跨任务和模型规模的推理缺陷。我们利用这一点提出了渐进式自信塑造,这是一种强化学习目标,训练模型在推理过程中更新自信度而非过早承诺——奖励逐步自信增长并惩罚过早承诺,无需外部标签或奖励模型。该方法在算术(Countdown)、数学(DAPO、AIME)和科学(ScienceQA)任务上,从1.5B到8B参数规模均提升了准确率和推理质量:在Countdown上,准确率提升3.2倍(+42.0个百分点),缺陷推理下降48个百分点;在AIME上,Pass@64提升6.6个百分点。与该机制一致,该方法还提升了忠实度:在安全基准上,我们的模型更透明地在其推理轨迹中暴露误导性内容而非隐藏它。控制实验表明,问题及其解决方案共同扩展:过早自信随模型规模和任务难度增长,而解决它带来的收益也随之增长。

英文摘要

Long chains of thought (CoT) from current language models frequently contain logical gaps and unjustified leaps, limiting the gains from additional test-time compute. Improving reasoning quality directly would require process reward models, but the step-level annotations needed to train them are expensive and scarce. We find such a signal in how the model's confidence evolves during reasoning: premature confidence, the tendency to commit to an answer early and use the remaining tokens to rationalize it, strongly predicts flawed reasoning across tasks and model scales. We exploit this in progressive confidence shaping, a reinforcement learning objective that trains models to update their confidence as they reason rather than commit early -- rewarding gradual confidence growth and penalizing early commitment, with no external labels or reward models. The method improves accuracy and reasoning quality from 1.5B to 8B parameters across arithmetic (Countdown), math (DAPO, AIME), and science (ScienceQA): on Countdown, accuracy improves 3.2x (+42.0pp) and flawed reasoning drops 48pp; on AIME, Pass@64 improves 6.6pp. Consistent with this mechanism, the method also improves faithfulness: on a safety benchmark, our models more transparently surface misleading content in their reasoning traces rather than concealing it. Controlled experiments reveal that the problem and its remedy scale together: premature confidence grows with model size and task difficulty, and so do the gains from addressing it.

2605.24381 2026-05-26 cs.LG cs.AI stat.AP stat.ML 版本更新

Assessing the Operational Viability of Foundation Models for Time Series Forecasting

评估基础模型在时间序列预测中的操作可行性

Kavin Soni, Debanshu Das, Vamshi Guduguntla

发表机构 * Google, USA(谷歌公司,美国)

AI总结 通过对比基础模型与监督学习方法在四种操作场景下的性能,提出基于经验特征的复杂度路由器以实现精度与效率的平衡。

Comments 21 pages, 8 Figures, Code available at [https://github.com/kavin-soni/timeseries-zeroshot-eval]

详情
AI中文摘要

时间序列预测驱动着金融、交通和能源等领域的操作决策。虽然监督学习方法表现出色,但它们需要特定领域的训练、特征工程和持续维护。大规模基础模型最近作为一种零样本替代方案出现,像LLM一样避免了任务特定训练。在这项工作中,我们评估了基础模型与标准监督方法的对比。我们不仅关注总体精度,还分析了四种操作场景下的性能:周期性人机系统、物理约束过程、随机金融市场和异构需求预测。我们的结果描述了最优部署区域。基础模型在具有可迁移周期结构的领域中表现良好,并且对于冷启动或长尾场景效率高。相反,监督专家在受严格物理约束的系统中保持更高的精度。在金融领域,较新的基础模型正在迅速缩小与监督专家的性能差距。我们进一步量化了推理延迟、数据漂移适应性和部署约束之间的权衡。最后,我们提出了一个复杂度路由器,它利用经验特征将每个序列分配给最优模型类别。我们证明,与部署通用基础模型相比,这种选择性路由实现了更高的精度和显著更低的推理成本,为平衡泛化性和效率提供了一个实用框架。

英文摘要

Time series forecasting drives operational decisions in areas like finance, transportation, and energy. While supervised learning approaches achieve strong performance, they require domain-specific training, feature engineering, and ongoing maintenance. Large-scale foundation models have recently emerged as a zero-shot alternative, avoiding task-specific training much like LLMs. In this work, we evaluate foundation models against standard supervised approaches. Rather than focusing solely on aggregate accuracy, we analyze performance across four operational regimes: periodic human-centric systems, physically constrained processes, stochastic financial markets, and heterogeneous demand forecasting. Our results characterize optimal deployment areas. Foundation models perform well in domains with transferable periodic structures and are efficient for cold-start or long-tail scenarios. Conversely, supervised specialists maintain higher precision in systems governed by strict physical constraints. In financial domains, newer foundation models are rapidly closing the performance gap with supervised specialists. We further quantify trade-offs in inference latency, data drift adaptability, and deployment constraints. Finally, we propose a Complexity Router that assigns each series to the optimal model class using empirical features. We demonstrate that this selective routing achieves higher accuracy and significantly lower inference costs compared to deploying a universal foundation model, providing a practical framework for balancing generalization and efficiency.

2605.24375 2026-05-26 cs.AI 版本更新

Distilling Game Code World Model Generation into Lightweight Large Language Models

将游戏代码世界模型生成蒸馏到轻量级大型语言模型

Tyrone Serapio, Arjun Prakash, Haoyang Xu, Kevin Wang, Amy Greenwald

发表机构 * Brown University(布朗大学)

AI总结 研究通过后训练将游戏代码世界模型生成能力蒸馏到小型模型,采用监督微调和带可验证奖励的强化学习提升生成代码的语法正确性和规则遵循性。

详情
AI中文摘要

大型语言模型(LLMs)在从自然语言生成可执行代码方面展现了强大的能力,为AI代理自动构建环境提供了可能性。最近关于代码世界模型(CWMs)的工作表明,LLMs可以将游戏规则转化为与蒙特卡洛树搜索等求解器兼容的Python实现。我们在游戏设置中研究此问题,其中生成的环境必须实现规则、合法动作、状态转移、观察和奖励。我们将这些特定于游戏的可执行模型称为游戏代码世界模型(GameCWMs)。然而,当前生成代码世界模型的方法依赖于前沿模型和推理时精炼循环,限制了可访问性和可扩展性。本文研究是否可以通过后训练将GameCWM生成能力蒸馏到更小的模型中。我们引入:(1)一个包含30个完美信息和不完美信息游戏的精选数据集,(2)一个评估生成代码的结构和语义游戏属性的验证框架,以及(3)一个结合监督微调(SFT)和带可验证奖励的强化学习(RLVR)的后训练流程。我们在Qwen2.5-3B-Instruct上进行实验,发现SFT可以提高语法正确性,而RLVR可以改善执行层面对游戏规则的遵循,从而提升Qwen在完美信息和不完美信息游戏中生成有效GameCWM的能力。总体而言,我们的流程使Qwen2.5-3B-Instruct更能够生成有效的GameCWM,从而为从自然语言自动生成环境提供了一条可扩展的路径。

英文摘要

Large Language Models (LLMs) have shown great ability in generating executable code from natural language, opening the possibility of automatically constructing environments for AI agents. Recent work on Code World Models (CWMs) demonstrates that LLMs can translate game rules into Python implementations compatible with solvers like Monte Carlo Tree Search. We study this problem in game settings, where generated environments must implement rules, legal actions, state transitions, observations, and rewards. We refer to these game-specific executable models as Game Code World Models (GameCWMs). However, current approaches to generating code world models rely on frontier models and inference-time refinement loops, limiting accessibility and scalability. This work investigates whether GameCWM generation capabilities can be distilled into smaller models through post-training. We introduce: (1) a curated dataset of 30 games spanning perfect and imperfect information games, (2) a verification framework that evaluates generated code against structural and semantic game properties, and (3) a post-training pipeline combining Supervised Fine-Tuning (SFT) with Reinforcement Learning with Verifiable Rewards (RLVR). We experiment with Qwen2.5-3B-Instruct and find that SFT can increase syntactic correctness, while RLVR can improve execution-level adherence to game rules, thereby improving Qwen's ability to generate valid GameCWMs in both perfect and imperfect information games. Overall, our pipeline makes Qwen2.5-3B-Instruct more capable of generating valid GameCWMs, thereby offering a scalable path toward automatic environment generation from natural language.

2605.24352 2026-05-26 cs.AI 版本更新

Partner-Aware Hierarchical Skill Discovery for Robust Human-AI Collaboration

伙伴感知的分层技能发现用于鲁棒的人机协作

Adnan Ahmad, Bahareh Nakisa, Mohammad Naim Rastgoo

发表机构 * Schoold of Information Technology, Deakin University(德肯大学信息科技学院) Faculty of Information Technology, Monash University(莫纳什大学信息技术学院)

AI总结 提出伙伴感知技能发现(PASD)框架,通过对比内在奖励学习基于伙伴行为的技能,缓解捷径学习,提升人机协作的鲁棒性和适应性。

详情
AI中文摘要

多智能体协作,尤其是在人机团队中,要求智能体能够适应具有多样化和动态行为的新伙伴。传统的深度分层强化学习(DHRL)方法关注智能体自身的奖励而忽略伙伴行为,导致捷径学习,即技能利用虚假信息而非适应伙伴的动态行为。这一限制削弱了智能体适应和有效协调新伙伴的能力。我们提出了伙伴感知技能发现(PASD),一种学习以伙伴行为为条件的技能的DHRL框架。PASD引入了一种对比内在奖励,以捕捉伙伴交互中出现的模式,在相似伙伴之间对齐技能表示,同时在不同策略之间保持可区分性。通过基于伙伴交互构建技能空间,该方法缓解了捷径学习并促进了行为一致性,从而实现鲁棒和自适应的协调。我们在Overcooked-AI基准测试中,针对具有不同技能水平和游戏风格的多样化伙伴群体,广泛评估了PASD。我们还使用从人机游戏轨迹训练的人类代理模型进一步评估了该方法。PASD始终优于现有的基于群体和分层基线,展示了可迁移的技能学习,能够泛化到广泛的伙伴行为。对学习到的技能表示的分析表明,PASD有效适应了多样的伙伴行为,突显了其在人机协作中的鲁棒性。

英文摘要

Multi-agent collaboration, especially in human-AI teaming, requires agents that can adapt to novel partners with diverse and dynamic behaviors. Conventional Deep Hierarchical Reinforcement Learning (DHRL) methods focus on agent-centric rewards and overlook partner behavior, leading to shortcut learning, where skills exploit spurious information instead of adapting to partners' dynamic behaviors. This limitation undermines agents' ability to adapt and coordinate effectively with novel partners. We introduce Partner-Aware Skill Discovery (PASD), a DHRL framework that learns skills conditioned on partner behavior. PASD introduces a contrastive intrinsic reward to capture patterns emerging from partner interactions, aligning skill representations across similar partners while maintaining discriminability across diverse strategies. By structuring the skill space based on partner interactions, this approach mitigates shortcut learning and promotes behavioral consistency, enabling robust and adaptive coordination. We extensively evaluate PASD in the Overcooked-AI benchmark with a diverse population of partners characterized by varying skill levels and play styles. We further evaluate the approach with human proxy models trained from human-human gameplay trajectories. PASD consistently outperforms existing population-based and hierarchical baselines, demonstrating transferable skill learning that generalizes across a wide range of partner behaviors. Analysis of learned skill representations shows that PASD adapts effectively to diverse partner behaviors, highlighting its robustness in human-AI collaboration.

2605.24343 2026-05-26 cs.AI 版本更新

Adaptive Human-AI Coordination via Hierarchical Action Disentanglement

通过层次化动作解耦实现自适应人机协作

Adnan Ahmad, Bahareh Nakisa, Mohammad Naim Rastgoo

发表机构 * School of Information Technology(信息科技学院) Deakin University(迪金大学) Faculty of Information Technology(信息科技学院) Monash University(莫纳什大学)

AI总结 提出内在动作解耦(IAD)框架,利用深度层次强化学习学习伙伴感知的低层动作序列,通过内在奖励鼓励动作解耦,实现与多样伙伴的自适应协调。

详情
AI中文摘要

人机协作需要智能体能够适应多样化的伙伴行为和技能水平,同时对未见过的伙伴保持鲁棒性。现有方法往往坍缩为单一主导行为或学习到对齐不良的技能,限制了有效协调。我们提出内在动作解耦(IAD),一种深度层次强化学习(DHRL)框架,学习以高层潜在技能为条件的、不同的、伙伴感知的低层动作序列。IAD引入内在奖励,明确鼓励智能体低层策略在不同技能上的动作分布解耦,从而在高层次决策与伙伴特定的行为响应之间产生可解释的映射。通过捕捉时间上扩展的交互模式,IAD能够在分布偏移下灵活适应异质伙伴动态。我们在Overcooked-AI领域中对多个布局和多种伙伴设置进行评估,包括未见过的模拟伙伴、基于人-人游戏训练的人类代理模型以及真实人类伙伴。结果表明,IAD在所有设置中均持续优于强基线,并实现更可靠、自适应的协调。

英文摘要

Human-AI collaboration requires agents that can adapt to diverse partner behaviors and skill levels while remaining robust to unseen partners. Existing methods often collapse to a single dominant behavior or learn poorly aligned skills, limiting effective coordination. We propose Intrinsic Action Disentanglement (IAD), a deep hierarchical reinforcement learning (DHRL) framework that learns distinct, partner-aware low-level action sequences conditioned on high-level latent skills. IAD introduces an intrinsic reward that explicitly encourages disentangled action distributions of the agent's low-level policy across skills, yielding an interpretable mapping between high-level decisions and partner-specific behavioral responses. By capturing temporally extended interaction patterns, IAD enables flexible adaptation to heterogeneous partner dynamics under distributional shift. We evaluate IAD in the Overcooked-AI domain across multiple layouts and diverse partner settings, including unseen simulated partners, a human-proxy model trained on human-human gameplay, and real human partners. Results show that IAD consistently outperforms strong baselines and achieves more reliable, adaptive coordination across all settings.

2605.24326 2026-05-26 cs.DC cs.AI cs.NI 版本更新

ScaleAcross Explorer: Exploring Communication Optimization for Scale-Across AI Model Training

ScaleAcross Explorer:探索跨规模AI模型训练的通信优化

Minghao Li, Alicia Golden, Samuel Hsia, Michael Kuchnik, Adi Gangidi, Xu Zhang, Ashmitha Jeevaraj Shetty, Zachary DeVito, Weiwei Chu, Dong He, Haoci Zhang, Yuchen Hao, Ruoming Pang, James Hongyi Zeng, Ying Zhang, Minlan Yu, Carole-Jean Wu

发表机构 * Harvard University(哈佛大学) Meta Platforms, Inc.(Meta平台公司)

AI总结 针对跨数据中心大规模AI模型训练(scale-across)的复杂设计空间,提出ScaleAcross Explorer优化器,通过联合优化并行放置、并行调度和网络层技术,实现高达64.62%的训练加速。

Comments 28 pages, 27 figures

详情
AI中文摘要

大型语言模型训练的快速扩展需要将GPU资源分布在多个数据中心建筑和区域之间。我们将这种范式称为“scale-across”训练。随着基础设施的扩展,系统设计空间变得越来越复杂,涵盖了新的模型架构、硬件异构性和不断演变的通信模式。借鉴Meta的生产经验,我们强调了在跨多个拥有数十万GPU的数据中心部署训练作业的复杂性。为了加速对庞大设计空间的探索并实现前沿模型开发的高效训练,我们对三个关键设计维度进行了深入表征:并行放置、并行调度和网络层技术。然后,我们提出了ScaleAcross Explorer,这是一个考虑设计维度相互作用并整体优化跨规模训练的优化器。测试床实验和模拟表明,在广泛的设计点上,与生产配置相比,训练速度提升高达64.62%,与最先进的基线相比,训练速度提升高达37.59%。

英文摘要

The rapid scaling of large language model training requires distributing GPU resources across multiple data center buildings and regions. We refer to such paradigm as "scale-across" training. As infrastructure expands, the system design space becomes increasingly intricate, encompassing new model architectures, hardware heterogeneity, and evolving communication patterns. Drawing from Meta's production experience, we highlight the complexities of deploying training jobs across a few data centers housing hundreds of thousands of GPUs. To accelerate exploration of the large design space and to enable efficient training for frontier model development, we conduct in-depth characterization of three key design dimensions: parallelism placement, parallelism scheduling, and network layer technologies. We then propose ScaleAcross Explorer, an optimizer that considers the interplay of design dimensions and holistically optimizes scale-across training. Testbed experiments and simulations demonstrate up to 64.62% training speedups over production configuration and up to 37.59% training speedups over the state-of-the-art baseline across a wide range of design points.

2605.24305 2026-05-26 cs.LG cs.AI 版本更新

ChaosBench-Logic v2: Evaluating LLM Logical Reasoning over Dynamical Systems at Scale

ChaosBench-Logic v2: 大规模评估大语言模型在动力系统上的逻辑推理能力

Noel Thomas

发表机构 * Mohamed bin Zayed University of Artificial Intelligence(穆罕默德·本·扎耶德人工智能大学)

AI总结 针对二元推理基准的准确性掩盖了关键缺陷,本文提出包含40,886个问题、覆盖165个动力系统的ChaosBench-Logic v2基准和CARE评估协议,揭示模型在状态转换推理、FOL演绎等任务上的表现差异和系统性反相关。

Comments 14 pages, 8 figures. Published at the ICLR 2026 Workshop on LLM Reasoning

详情
AI中文摘要

二元推理基准的标准准确性隐藏了关键失败模式:先验崩溃、释义下不一致以及无法推理参数依赖的动态。我们提出了ChaosBench-Logic v2,一个包含40,886个问题、覆盖165个动力系统、27个FOL谓词和78条公理边的基准,以及CARE(校准与对抗鲁棒评估)协议,该协议揭示了这些病理现象。评估14个模型,我们发现即使对于前沿模型,状态转换推理仍接近随机(MCC = 0.05),而给定前提的FOL演绎达到MCC = 0.52。按系列分解显示,专有模型的优势集中在跨指标(+0.40)和一致性任务上,而开源Qwen 2.5-32B在指标诊断上占优(0.91 vs. 0.45)。两个模型在分岔问题上表现出负MCC,通过混淆矩阵分析确认为系统性反相关。

英文摘要

Standard accuracy on binary reasoning benchmarks hides critical failure modes: prior collapse, inconsistency under paraphrase, and inability to reason about parameter-dependent dynamics. We present ChaosBench-Logic v2, a 40,886-question benchmark over 165 dynamical systems with 27 FOL predicates and 78 axiom edges, together with CARE (Calibration- and Adversarial-Robust Evaluation), a protocol that surfaces these pathologies. Evaluating 14 models, we find that regime-transition reasoning remains near random (MCC = 0.05) even for frontier models, whereas FOL deduction with given premises reaches MCC = 0.52. Per-family decomposition shows that the proprietary-model advantage concentrates on cross-indicator (+0.40) and consistency tasks, while open-source Qwen 2.5-32B dominates indicator diagnostics (0.91 vs. 0.45). Two models exhibit negative MCC on bifurcation questions, confirmed as systematic anti-correlation via confusion-matrix analysis.

2605.24304 2026-05-26 cs.CV cs.AI 版本更新

ArtSplat: Feed-Forward Articulated 3D Gaussian Splatting from Sparse Multi-State Uncalibrated Views

ArtSplat: 基于前馈的关节式3D高斯泼溅从稀疏多状态未标定视图

Inseo Lee, Yoonji Kim, Eugene Sohn, Jiwoong Lee, Jungmin You, Joonseok Lee, Jin-Hwa Kim

发表机构 * Seoul National University(首尔国立大学) Sogang University(成均馆大学) NAVER AI Lab(NAVER AI实验室)

AI总结 提出首个前馈框架ArtSplat,通过稀疏多视图跨多个关节状态,一次性重建几何和关节参数,引入逐像素关节图表示和跨状态注意力机制,在PartNet-Mobility上实现400倍加速。

详情
AI中文摘要

从稀疏视图图像重建关节物体是一个病态问题,需要同时推断几何和底层关节结构。现有基于NeRF和3D高斯泼溅(3DGS)的关节物体重建方法通常依赖密集视图或强先验(例如深度图、关节类型、预定义关节数量),并且需要昂贵的逐对象优化。在本文中,我们提出了ArtSplat,这是第一个用于关节式3D高斯泼溅的前馈框架。它通过单个前向传递,从跨多个关节状态的稀疏多视图图像中重建几何和关节参数。为了解决单次前向关节重建的挑战,我们引入了一种逐像素关节图表示,使得关节参数估计能够集成到前馈流水线中。我们进一步提出了一种带有状态令牌的跨状态注意力(CSA)机制,该机制有效捕获输入状态间的离散运动。在来自PartNet-Mobility的68个关节物体(包括单关节和多关节配置)上的实验表明,ArtSplat在几何和关节估计方面均达到了有竞争力的性能,同时比基线方法快400倍以上。

英文摘要

Articulated object reconstruction from sparse-view images is an ill-posed problem that requires simultaneous inference of geometry and underlying articulation structure. Existing methods for articulated object reconstruction based on NeRF and 3D Gaussian Splatting (3DGS) typically rely on dense views or strong priors (e.g., depth maps, joint types, predefined number of joints) and require costly per-object optimization. In this paper, we propose ArtSplat, the first feed-forward framework for articulated 3D Gaussian Splatting. It reconstructs both geometry and joint parameters from sparse multi-view images across multiple articulation states in a single forward pass. To address the challenges of single-pass articulated reconstruction, we introduce a per-pixel joint map representation that enables the integration of joint parameter estimation into the feed-forward pipeline. We further propose a Cross-State Attention (CSA) mechanism with state tokens, which effectively captures discrete motion across input states. Experiments on 68 articulated objects from PartNet-Mobility, including both single- and multi-joint configurations, demonstrate that ArtSplat achieves competitive performance in both geometry and joint estimation, while being over 400 times faster than baselines.

2605.24300 2026-05-26 cs.CR cs.AI cs.LG 版本更新

Enhancing Reliability in LLM-Based Secure Code Generation

增强基于LLM的安全代码生成的可靠性

Mohammed F. Kharma, Mohammad Alkhanafseh, Ahmed Sabbah, David Mohaisen

发表机构 * Department of Computer Science, Birzeit University(巴伊兹大学计算机科学系) Department of Computer Science, University of Central Florida(佛罗里达中央大学计算机科学系)

AI总结 提出Mitigation-Aware Chain-of-Thought (MA-CoT)框架,通过嵌入任务特定的CWE缓解指导和语言感知安全措施,显著降低LLM生成代码中的漏洞,在多个模型和语言上验证了其一致的安全可靠性提升。

Comments 15 pages; 7 tables; 3 figures

详情
AI中文摘要

大型语言模型(LLM)被广泛用于代码生成,但其安全可靠性在不同语言和提示策略下仍不一致。现有的提示工程提高了功能正确性,但很少能确保一致的安全结果。我们引入了 extit{Mitigation-Aware Chain-of-Thought (MA-CoT)}框架,该框架嵌入了任务特定的CWE缓解指导和语言感知安全措施,以减少生成代码中反复出现的漏洞。我们在三个LLM(gpt-5, claude-4.5, gemini-2.5)、三种编程语言(C, Java, Python)和四种提示策略(Vanilla, Zero-shot, CoT, MA-CoT)上,使用200个任务的主数据集以及LLMSecEval的外部验证对MA-CoT进行了评估。通过静态分析和专家验证,MA-CoT在主数据集上将总安全问题从92个减少到39个(57.6%),在LLMSecEval上从73个减少到4个(94.5%)。高严重性问题(阻塞+严重)分别从90个降至39个(56.7%)和从45个降至2个(95.6%)。在两个数据集中,MA-CoT是唯一能持续提高安全可靠性的策略;Zero-shot和CoT可靠性较低,且可能增加漏洞,尤其是在C语言中。我们进一步引入了严格的漏洞驱动因素分层归因(语言核心层与栈层),并表明残余风险集中在硬化导向的模式(例如,依赖于操作系统和工具链),这激励了在提示之外采用安全构造原语。

英文摘要

Large language models (LLMs) are widely used for code generation, but their security reliability remains inconsistent across languages and prompting strategies. Existing prompt engineering improves functional correctness but rarely ensures consistent security outcomes. We introduce the \textit{Mitigation-Aware Chain-of-Thought (MA-CoT)} framework, which embeds task-specific CWE mitigation guidance and language-aware safeguards to reduce recurring vulnerabilities in generated code. We evaluate MA-CoT across three LLMs (gpt-5, claude-4.5, gemini-2.5), three programming languages (C, Java, Python), and four prompting strategies (Vanilla, Zero-shot, CoT, MA-CoT) on a 200-task primary dataset, with external validation on LLMSecEval. Using static analysis with expert validation, MA-CoT reduces total security findings from 92 to 39 (57.6\%) on the primary dataset and from 73 to 4 (94.5\%) on LLMSecEval. High-severity findings (Blocker + Critical) drop from 90 to 39 (56.7\%) and from 45 to 2 (95.6\%), respectively. Across both datasets, MA-CoT is the only strategy that consistently improves security reliability; Zero-shot and CoT are less reliable and may increase vulnerability, especially in C. We further introduce a strict layered attribution of vulnerability drivers (language-core vs. stack layers) and show that residual risk concentrates in hardening-oriented patterns (e.g., OS- and toolchain-dependent), motivating secure-by-construction primitives alongside prompting.

2605.24298 2026-05-26 cs.CR cs.AI cs.LG 版本更新

An Empirical Evaluation of LLM-Generated Code Security Across Prompting Methods

LLM生成代码安全性的提示方法实证评估

Mohammed Kharma, Ahmed Sabbah, Mohammad Alkhanafseh, Mohammad Hammoudeh, David Mohaisen

发表机构 * Department of Computer Science, Birzeit University(计算机科学系,巴勒斯坦比泽大学) King Fahd University of Petroleum and Minerals(国王法赫德石油和矿物大学) University of Central Florida(中央佛罗里达大学)

AI总结 通过跨5个LLM和4种编程语言的实证评估,提出弱点感知零样本链式思考(WA-0CoT)提示策略,发现提示方法虽影响弱点类别分布,但无法显著降低漏洞频率或密度。

Comments 40 pages, 22 tables, 8 figures

详情
AI中文摘要

大型语言模型(LLM)在自动化代码生成中的日益使用提高了软件开发效率,但往往以安全性为代价。生成的代码经常忽略关键问题,使其容易受到弱加密和不正确的输入验证等问题的影响。为了研究这一问题,我们对跨五个LLM和四种编程语言(Java、C++、C和Python)的LLM生成代码的安全质量进行了全面的实证评估,考察了多种提示工程方法的影响。我们提出了一种弱点感知的零样本链式思考(WA-0CoT)提示策略,该策略利用CWE映射丰富提示中的安全上下文以指导模型推理。我们的实证分析在卡方检验的支持下发现,不同提示方法在漏洞频率或密度上没有统计学上的显著降低。然而,包括WA-0CoT在内的提示策略系统地影响了CWE类别的组成分布,其效果因编程语言而异。这些发现表明,虽然安全感知的提示改变了生成弱点的结构,但仅靠提示工程不足以可靠地降低整体漏洞水平。结果强调了在评估LLM生成代码的安全属性时,语言感知和模型感知的提示设计的重要性。

英文摘要

The growing use of Large Language Models (LLMs) for automated code generation has enhanced software development efficiency, but often at the cost of security. Generated code frequently overlooks critical concerns, leaving it vulnerable to issues such as weak encryption and improper input validation. To investigate this problem, we present a comprehensive empirical evaluation of the security quality of LLM-generated code across five LLMs and four programming languages (Java, C++, C, and Python), examining the impact of multiple prompt engineering methods. We introduce a weaknesses-aware zero-shot chain-of-thought (WA-0CoT) prompting strategy that enriches prompts with security context using CWE mappings to guide model reasoning. Our empirical analysis, supported by chi-square tests, finds no statistically significant reductions in vulnerability frequency or density across prompt methods. However, prompting strategies, including WA-0CoT, systematically influence the compositional distribution of CWE categories, with effects varying by programming language. These findings suggest that while security-aware prompting alters the structure of generated weaknesses, prompt engineering alone is insufficient to reliably reduce overall vulnerability levels. The results highlight the importance of language-aware and model-aware prompt design when evaluating the security properties of LLM-generated code.

2605.24294 2026-05-26 cs.CR cs.AI cs.LG 版本更新

Concept Drift Adaptation Using Self-Supervised and Reinforcement Learning In Android Malware Detection

使用自监督学习和强化学习在Android恶意软件检测中适应概念漂移

Ahmed Sabbah, Mohammad Kharma, Mohammad Alkhanafseh, Radi Jarrar, Samer Zein, David Mohaisen

发表机构 * Birzeit University(巴伊兹大学) University of Central Florida(中央佛罗里达大学)

AI总结 提出一个基于自监督学习和强化学习的框架,通过冻结编码器测量潜在漂移并轻量适配,同时利用PPO控制器在成本约束下选择维护动作,以应对Android恶意软件检测中的概念漂移。

Comments 9 pages, 2 figures, 2 tables

详情
AI中文摘要

Android恶意软件检测器在部署后常因概念漂移而性能下降,而每次维护步骤完全重新训练成本高昂。我们提出一个按时间顺序的自适应维护框架,将部署时的维护建模为序列决策问题。该框架在初始化阶段通过自监督学习学习稳定的潜在表示,冻结编码器,在固定表示空间中测量潜在漂移,并使用可训练适配器和分类头进行轻量下游适配。一个近端策略优化控制器根据检测器状态(包括当前效用、固定记忆集上的保留率、潜在漂移指标和更新成本)选择低成本的维护动作。我们在模拟器和真实Android恶意软件数据集上,使用静态和动态特征,在因果部署式协议下评估该框架。结果表明,RL控制器提供了一种强大的成本感知适配策略,在非平稳部署条件下,始终保持在最佳策略之列,同时在时间性能、记忆保留和维护成本之间取得有利平衡。

英文摘要

Android malware detectors often degrade after deployment because of concept drift, while full retraining at each maintenance step is costly. We propose a chronological adaptive maintenance framework that models deployment-time maintenance as a sequential decision problem. The framework learns a stable latent representation through self-supervised learning during initialization, freezes the encoder, measures latent drift in the fixed representation space, and performs lightweight downstream adaptation using a trainable adapter and classification head. A proximal policy optimization controller selects low-cost maintenance actions based on the detector state, including current utility, retention on a fixed memory set, latent drift indicators, and update cost. We evaluate the framework under a causal deployment-style protocol on emulator and real Android malware datasets with static and dynamic features. Results show that the RL controller provides a strong cost-aware adaptation strategy, consistently remaining among the top-performing policies while achieving a favorable balance between temporal performance, memory retention, and maintenance cost under non-stationary deployment conditions.

2605.24270 2026-05-26 cs.AI cs.CR 版本更新

Safety-Oriented Routing Analysis of Mixtral MoE Under Benign and Harmful Prompts

面向安全的路由分析:Mixtral MoE在良性及有害提示下的表现

Md Nurul Absar Siddiky

发表机构 * Department of Electrical Computer Engineering University of Hawai'i at M\= a noa Honolulu, HI, USA

AI总结 通过激活和梯度两种信号分析Mixtral 8x7B-Instruct在良性及有害提示下的路由行为,发现安全相关的路由是微妙、深度依赖且分布式的,而非由固定专家集主导。

详情
AI中文摘要

稀疏混合专家(MoE)语言模型对每个token仅激活一小部分参数,使得路由器行为成为模型计算的核心部分。本文利用两种互补信号——基于专家选择频率的激活路由分数和基于路由器门敏感性的梯度分数——研究Mixtral 8x7B-Instruct在良性及有害提示下的路由行为。我们分析了专家和层级别的路由行为,并进行了专家抑制干预。结果表明,激活基础的专家使用广泛且长尾,而梯度基础的重要性则集中。在专家级别,良性提示组和有害提示组在两种信号下保持接近,仅有适度分离。在层级别,激活路由在8-15层附近最具选择性,而梯度重要性集中在最后几层。专家分类显示,大多数专家在良性和有害提示间共享,尽管有限子集表现出明确的组偏好。排名靠前的专家集在梯度分数下显示出比激活分数更强的良性-恶意重叠,表明集中在共同的后期专家集上。在干预实验中,抑制来自激活分数的前五个良性主导专家,将100个提示中的受限响应从24减少到14,而抑制梯度导出的专家则从34减少到22,且意外逆转更少。总体而言,Mixtral中与安全相关的路由是微妙、深度依赖且分布式的,而非由固定专家集主导。

英文摘要

Sparse mixture-of-experts (MoE) language models activate only a small subset of parameters for each token, making router behavior a central part of model computation. This paper studies routing behavior of Mixtral 8x7B-Instruct under benign and harmful prompts using two complementary signals: activation-based routing scores derived from expert selection frequencies and gradient-based scores derived from router-gate sensitivities. We analyze expert- and layer-level routing behavior and conduct expert-suppression interventions. The results show that activation-based expert usage is broad and long-tailed, whereas gradient-based importance is concentrated. At expert level, benign and harmful prompt groups remain close under both signals with modest separation. At layer level, activation-based routing is most selective around layers 8-15, while gradient-based importance is concentrated in final layers. Expert classification shows most experts are shared across benign and harmful prompts, though a limited subset shows clear group preference. Top-ranked expert sets show stronger benign-malicious overlap under gradient scores than activation scores, suggesting concentration on a common late-layer expert set. In intervention experiments, suppressing top five benign-dominant experts from activation scores reduces restricted responses from 24 to 14 over 100 prompts, while suppressing gradient-derived experts reduces them from 34 to 22 with fewer unintended reversals. Overall, safety-relevant routing in Mixtral is subtle, depth-dependent, and distributed rather than dominated by a fixed set of experts.

2605.24266 2026-05-26 cs.CL cs.AI 版本更新

An Interactive Paradigm for Deep Research

深度研究的交互式范式

Lin Ai, Victor S. Bursztyn, Xiang Chen, Julia Hirschberg, Saayan Mitra

发表机构 * Adobe Research(Adobe研究院) Department of Computer Science, Columbia University(哥伦比亚大学计算机科学系)

AI总结 提出SteER框架,通过可解释的中间过程控制、成本效益决策和实时用户模型,在深度研究中实现用户对齐,性能优于现有基线。

详情
AI中文摘要

近年来,大型语言模型(LLMs)的进展使得深度研究系统能够通过结合检索、推理和生成,为开放式查询合成全面、报告式的答案。然而,大多数框架依赖于僵化的流程,采用一次性范围界定和长时间自主运行,如果用户意图在过程中发生变化,几乎没有修正的空间。我们提出了SteER,一个可引导的深度研究框架,将可解释的中间过程控制引入长周期研究流程中。在每个决策点,SteER使用成本效益公式来确定是暂停等待用户输入还是自主继续。它结合了多样性感知规划与奖励对齐、新颖性和覆盖率的效用信号,并维护一个在会话过程中不断演化的实时用户模型。SteER在对齐方面比最先进的开源和专有基线高出最多22.80%,在广度、平衡等质量指标上领先,并且在85%以上的成对对齐判断中被人类读者偏好。我们还引入了一个用户查询基准和数据生成流水线。据我们所知,这是第一个以交互式、可解释的控制范式推进深度研究的工作,为长形式任务中可控、用户对齐的智能体铺平了道路。

英文摘要

Recent advances in large language models (LLMs) have enabled deep research systems that synthesize comprehensive, report-style answers to open-ended queries by combining retrieval, reasoning, and generation. Yet most frameworks rely on rigid workflows with one-shot scoping and long autonomous runs, offering little room for course correction if user intent shifts mid-process. We present SteER, a framework for Steerable deEp Research that introduces interpretable, mid-process control into long-horizon research workflows. At each decision point, SteER uses a cost-benefit formulation to determine whether to pause for user input or to proceed autonomously. It combines diversity-aware planning with utility signals that reward alignment, novelty, and coverage, and maintains a live persona model that evolves throughout the session. SteER outperforms state-of-the-art open-source and proprietary baselines by up to 22.80\% on alignment, leads on quality metrics such as breadth and balance, and is preferred by human readers in 85\%+ of pairwise alignment judgments. We also introduce a persona-query benchmark and data-generation pipeline. To our knowledge, this is the first work to advance deep research with an interactive, interpretable control paradigm, paving the way for controllable, user-aligned agents in long-form tasks.

2605.24247 2026-05-26 cs.CL cs.AI 版本更新

Improving Labeling Consistency with Detailed Constitutional Definitions and AI-Driven Evaluation

通过详细的宪法定义和AI驱动的评估提高标注一致性

Konstantin Berlin, Adam Swanda

发表机构 * Cisco AI Defense(思科AI防御)

AI总结 提出一种AI驱动的工作流,通过为每个类别编写详细的宪法定义并由前沿LLM解释,以比人类更一致和准确地生成黄金标签,在三个内容审核类别上将跨模型不一致性降低高达57倍。

Comments Under review at ACL Rolling Review (ARR), May 2026 cycle. Also available at https://doi.org/10.5281/zenodo.20125267

详情
AI中文摘要

许多自动化标注流水线根据书面规范将输入分类到类别中,内容审核是一个突出的用例。简单的类别定义不足以让标注者产生这些流水线所需的准确、一致的黄金标签。一个解决方案是编写一个规定性定义,解决足够多的实际边界情况,使得标注者无法对书面解释产生分歧。在实践中,这种详细程度的定义超出了人类标注者工作记忆的容量,因此标注者依赖直觉,标签偏离书面规则,准确性和一致性下降。我们提出并展示了一种AI驱动工作流的有效性,其中AI帮助编写每个类别的宪法,该宪法以足够详细的方式定义标签以覆盖边缘情况,并且前沿LLM在每个输入上解释该宪法,以比阅读相同文档的人类更一致和准确地产生黄金标签。我们在三个内容审核类别(骚扰、仇恨言论、非暴力犯罪)上评估,并表明该方法相比段落定义将跨模型不一致性降低高达57倍,跨模型分歧诊断规范缺口,人类负责关于每个类别应含义的高层决策,而不是单个标注调用。对于安全评估,我们引入了一个双轴公式,在完整对话上独立评分意图和内容,以便下游消费者可以基于任一轴或两者采取行动。

英文摘要

Many automated labeling pipelines classify inputs into categories defined by a written specification, content moderation being a prominent use case. Simple category definitions are not detailed enough for labelers to produce the accurate, consistent golden labels these pipelines require. One solution is to write a prescriptive definition that settles enough real boundary cases that labelers cannot disagree with the written interpretation. In practice, definitions at that level of detail exceed what a human annotator can hold in working memory, so annotators fall back on intuition and the labels drift from the written rules, regressing on accuracy and consistency. We propose and demonstrate the efficacy of an AI-driven workflow in which AI helps write a per-category constitution that defines the label in enough detail to cover edge cases, and a frontier LLM interprets it on each input to produce the golden label more consistently and accurately than humans reading the same document. We evaluate on three content moderation categories (harassment, hate speech, non-violent crime) and show that the approach reduces cross-model inconsistency by up to 57x compared to paragraph definitions, with cross-model disagreement diagnosing specification gaps and the human responsible for high-level decisions about what each category should mean rather than individual labeling calls. For the safety evaluation, we introduce a dual-axis formulation scoring intent and content independently over the full conversation, so downstream consumers can act on either axis or both.

2605.24243 2026-05-26 cs.CV cs.AI stat.ML 版本更新

GIBLy: Improving 3D Semantic Segmentation through an Architecture-Agnostic Lightweight Geometric Inductive Bias Layer

GIBLy: 通过架构无关的轻量级几何归纳偏置层改进3D语义分割

Diogo Lavado, Alessandra Micheletti, Clàudia Soares

发表机构 * NOVA School of Science and Technology(诺瓦科学与技术学校) Università degli Studi di Milano(米兰大学)

AI总结 提出一种轻量级几何归纳偏置层GIBLy,通过集成可学习的几何先验提升3D分割性能,仅增加少量参数即可在多个基准上获得一致提升。

详情
AI中文摘要

在3D场景理解中,深度学习模型依赖大型模型和大量训练来捕捉3D数据中存在的几何结构。然而,现有方法缺乏显式机制来融入几何信息(例如可学习的基元形状),往往需要更大的模型和更多的训练数据,这增加了成本并可能限制泛化能力。我们引入了GIBLy,一种轻量级几何归纳偏置层,将可学习的几何先验集成到3D分割流程中。GIBLy通过提供与简单几何形状(因此可解释)对齐的特征来增强现有架构——无论是基于MLP、卷积还是Transformer——以最小的计算开销提升分割性能。我们在多个3D语义分割基准上验证了我们的方法,展示了一致的性能提升,包括在TS40K上使用PTV3时mIoU提升高达+11.5%,而仅增加58K额外参数。我们的结果突显了显式编码几何结构以支持准确高效的3D场景理解的优势,且仅需一个轻量级的附加层。

英文摘要

In 3D scene understanding, deep learning models rely on large models and extensive training to capture basic geometric structures that are present in the 3D data. However, existing methods lack explicit mechanisms to incorporate geometric information, such as learnable primitive shapes, often necessitating large models and more training data which in turn increases cost and can limit generalization. We introduce GIBLy, a lightweight geometric inductive bias layer that integrates learnable geometric priors into 3D segmentation pipelines. GIBLy enhances existing architectures -- whether MLP-based, convolution-based, or transformer-based -- by providing features aligned with simple geometric shapes (and thus human-interpretable) that improve segmentation performance with minimal computational overhead. We validate our approach across multiple 3D semantic segmentation benchmarks, demonstrating consistent performance gains, including up to +11.5% mIoU on TS40K with PTV3, while adding only 58K extra parameters. Our results highlight the benefit of explicitly encoding geometric structure to support accurate and efficient 3D scene understanding, with a lightweight add-on layer

2605.24239 2026-05-26 cs.CR cs.AI 版本更新

Unlocking Apple's Private Cloud Compute: An Analysis of Privacy-Preserving Artificial Intelligence

解锁苹果的私有云计算:隐私保护人工智能分析

Yannik Dittmar, Marvin Jerome Stephan, Thomas Völkl, Matthias Hollick, Jiska Classen

发表机构 * Hasso Plattner Institute, University of Potsdam(哈索普兰特纳研究所,波茨坦大学) TU Darmstadt, Secure Mobile Networking Lab(德累斯顿技术大学,安全移动网络实验室) IMDEA Networks Institute, Madrid, Spain(IMDEA网络研究所,马德里,西班牙)

AI总结 通过逆向工程苹果私有云计算(PCC)在移动设备上的实现,评估其隐私保护特性,并开放非公开接口以支持自定义查询和独立基准测试。

详情
Journal ref
Proceedings of the 19th ACM Conference on Security and Privacy in Wireless and Mobile Networks (WiSec 2026)
AI中文摘要

许多现有的移动设备人工智能解决方案依赖于大量敏感数据的收集,引发隐私担忧,并且通常需要存储上下文和模型改进数据。苹果的私有云计算(PCC)旨在通过强调移动设备集成和隐私优先设计来解决这一问题。PCC的核心主张是它不存储任何用户数据,并且用户输入和用户账户是不可关联的。 尽管大多数PCC系统规范是公开的,但编译后的二进制文件增加了一层不透明性。没有可重现的构建,这些二进制文件中也没有符号,导致规范与实际交付给用户的产品之间可能存在差异。此外,查询PCC的底层模型和接口并不公开可访问,限制了学术上对模型属性(如准确性)的评估。这给评估像PCC这样的隐私保护方法是否既值得信赖又能提供高质量答案带来了挑战。 我们是第一个逆向工程移动设备上PCC实现以评估隐私方面,并在本地设备上开放其非公开接口以支持自定义PCC查询的研究团队。我们通过独立基准测试PCC模型,展示了超出苹果预期用例的访问级别。通过公开我们的PCC基准测试框架,我们为未来的研究提供了支持。

英文摘要

Many existing Artificial Intelligence (AI) solutions on mobile devices rely on an extensive collection of sensitive data, raising privacy concerns and often requiring storage for both context and model improvement. Apple's Private Cloud Compute (PCC) aims to address this by emphasizing mobile device integration and a privacy-first design. The central claim of PCC is that it does not store any user data and that user input and user accounts are unlinkable. While most of the PCC system specifications are public, compiled binaries add a layer of opaqueness. There are no reproducible builds, and there are no symbols within those binaries, creating potential discrepancies between the specification and what is shipped to the user. Additionally, the underlying models and interfaces for querying PCC are not openly accessible, limiting academic evaluation of model properties, such as accuracy. This poses a challenge in assessing whether a privacy-preserving approach like PCC is actually trustworthy while also providing high-quality answers. We are the first to reverse-engineer the PCC implementation on mobile devices to evaluate privacy aspects and to open its non-public interfaces on local devices to support custom PCC queries. We demonstrate this level of access beyond Apple's intended use cases by independently benchmarking the PCC model. We enable future research by making our PCC benchmarking framework publicly available.

2605.24238 2026-05-26 cs.AI 版本更新

Toward Enactive Artificial Intelligence

走向生成式人工智能

Banafsheh Rafiee, Richard Sutton

发表机构 * Independent Researcher(独立研究者) Department of Computing Science, University of Alberta, Canada(阿尔伯塔大学计算机科学系) Alberta Machine Intelligence Institute (Amii)(阿尔伯塔机器智能研究所)

AI总结 本文主张将生成式认知方法融入人工智能,强调感知与行动不可分割、具身性和自主性,并指出强化学习在结构上与生成式原则存在共鸣但仍有差距。

详情
AI中文摘要

在本文中,我们主张将生成式感知与认知方法融入人工智能(AI)。生成式方法将感知视为与世界主动的、技巧性的互动,其中智能体通过行动以及理解其行动如何塑造经验来感知。这与将感知视为被动内部过程的经典观点形成对比,后者认为大脑接收感觉输入、处理信息并发出行动指令。生成式观点强调感知的动态性、具身性和交互性,植根于嵌入环境中的智能体的生活经验。我们识别并发展了四个我们认为与AI最相关的关键生成式概念:经验、行动-感知不可分割性、自主性和具身性。主流AI,从经典基于规则的系统到大型语言模型,在很大程度上忽视了这些见解,将认知视为与具身交互和内在规范性脱节的内在处理。然而,强化学习(RL)通过强调行动、智能体-环境交互、反馈驱动的适应和以智能体为中心的评估,在结构上与生成式原则产生共鸣。然而,这种共鸣不应被视为理论等价,因为RL近似了一些生成式见解,但关键要素仍然缺失或发展不足。基于这一分析,我们建议将生成式思想更广泛地融入主流AI和RL中。

英文摘要

In this paper, we advocate for incorporating enactive approaches to perception and cognition into artificial intelligence (AI). Enactive approaches view perception as an active, skillful engagement with the world, where agents perceive by acting and by understanding how their actions shape their experience. This contrasts with classical views that treat perception as a passive internal process in which the brain receives sensory input, processes it, and issues commands for action. Enactive views emphasize the dynamic, embodied, and interactive character of perception, grounded in the lived experience of agents embedded in their environments. We identify and develop four key enactive concepts that we find most relevant to AI: experience, action perception inseparability, autonomy, and embodiment. Much of mainstream AI, from classical rule based systems to large language models, has largely neglected these insights, treating cognition as internal processing detached from embodied interaction and intrinsic normativity. Reinforcement learning (RL), however, exhibits structural resonance with enactive principles through its emphasis on action, agent environment interaction, feedback driven adaptation, and agent centered evaluation. However, this resonance should not be taken as theoretical equivalence, as RL approximates some enactive insights, but key elements remain absent or weakly developed. Building on this analysis, we suggest a broader incorporation of enactive ideas into both mainstream AI and RL.

2605.24229 2026-05-26 cs.AI 版本更新

How Well Do Models Follow Their Constitutions?

模型遵循其宪法的程度如何?

Arya Jakkli, Senthooran Rajamanoharan, Neel Nanda

发表机构 * Anthropic

AI总结 提出多方法审计流程,评估前沿AI模型在对抗性多轮交互中遵循其书面行为规范(如Anthropic宪法和OpenAI模型规范)的程度,发现新一代模型违规率显著下降。

Comments 37 pages including appendix. Code, tenet lists, and full transcripts: https://github.com/ajobi-uhc/constitution-audits. Companion blog post on LessWrong/AI Alignment Forum: https://www.lesswrong.com/posts/Tk4SF8qFdMrzGJGGw/how-well-do-models-follow-their-constitutions

详情
AI中文摘要

前沿AI开发者现在训练模型遵循长篇书面行为规范,例如Anthropic的宪法(Anthropic, 2025a)和OpenAI的模型规范(OpenAI, 2025a),通过角色训练(Anthropic, 2024)和审慎对齐(Guan et al., 2024)等方法整合到后训练中。这些文档起到治理作用,但目前尚不清楚模型在实际部署中面临的对抗性、多轮压力下实际遵循这些规范的程度。我们提出一个多方法审计流程,将每个实验室发布的规范视为可审计目标:它将规范分解为可测试的原子性原则(Anthropic 205条,OpenAI 197条),使用Petri审计代理(Anthropic, 2025b)生成多轮对抗场景,运行修改后的SURF风格评分搜索(Murray et al., 2026)以捕捉Petri遗漏的浅层单轮失败,根据相关规范验证标记的对话记录,并将结果与实验室自己发布的系统卡进行比较。对每个规范的七个模型应用该流程,我们发现每一代模型遵循自己实验室规范的程度显著提高。在Anthropic宪法上,Claude系列的违规率从15.0%(Sonnet 4)下降到2.0%(Sonnet 4.6);在OpenAI模型规范上,GPT系列的违规率从11.7%(GPT-4o)下降到3.6%(GPT-5.2中等推理),严重程度上限从10/10降至7/10。我们无法从外部隔离这些改进是来自规范特定训练、更广泛的后训练改进还是评估意识。剩余的失败集中在操作者强加的角色在AI身份质疑下、代理部署中的不可逆操作以及带有虚假精度的捏造定量声明。

英文摘要

Frontier AI developers now train models against long written behavioral specifications, such as Anthropic's constitution (Anthropic, 2025a) and OpenAI's Model Spec (OpenAI, 2025a), integrated into post-training via methods like character training (Anthropic, 2024) and deliberative alignment (Guan et al., 2024). These documents serve a governance function, but it is unclear how well models actually follow them under adversarial, multi-turn pressure similar to what they would face in real-world deployment. We propose a multi-method audit pipeline that treats each lab's published specification as an auditable target: it decomposes the specification into atomic testable tenets (205 for Anthropic, 197 for OpenAI), generates multi-turn adversarial scenarios with the Petri auditing agent (Anthropic, 2025b), runs a modified SURF-style rubric search (Murray et al., 2026) to catch shallow single-turn failures Petri misses, validates flagged transcripts against the relevant specification, and compares the findings against the lab's own published system card. Applying the pipeline across seven models per specification, we find that models follow their own lab's specification substantially better with each generation. On Anthropic's constitution, the Claude family falls from a 15.0% violation rate (Sonnet 4) to 2.0% (Sonnet 4.6); on OpenAI's Model Spec, the GPT family falls from 11.7% (GPT-4o) to 3.6% (GPT-5.2 medium reasoning), with the severity ceiling falling from 10/10 to 7/10. We cannot externally isolate whether these gains come from specification-specific training, broader post-training improvements, or evaluation awareness. Remaining failures cluster around operator-imposed personas under AI-identity questioning, irreversible action in agentic deployments, and fabricated quantitative claims with false precision.

2605.24216 2026-05-26 cs.LG cs.AI cs.CL cs.CR 版本更新

Agent-ToM: Learning to Monitor Autonomous LLM Agents via Theory-of-Mind Reasoning

Agent-ToM: 通过心智理论推理学习监控自主LLM智能体

Nesreen K. Ahmed, Nima Nafisi

发表机构 * Cisco Outshift(思科Outshift)

AI总结 针对自主LLM智能体的隐蔽恶意行为监控难题,提出基于心智理论推理的Agent-ToM框架,通过信念推断、意图假设与验证实现结构化轨迹分析,在监控基准上取得优于集成方法的性能。

Comments 23 pages, 9 figures

详情
AI中文摘要

监控自主大语言模型(LLM)智能体的隐蔽恶意行为具有挑战性,因为攻击模式具有延迟性、上下文依赖性和长期性。智能体可能追求隐藏目标同时保持表面良性行为,即使拥有完整轨迹访问也难以检测。先前的监控方法改进了脚手架或集成聚合,但独立处理每条轨迹,未从先前的监控经验中学习。此外,标准推理方法解释观察到的行为,但没有明确推理智能体的信念、意图和目标对齐,而这些对于区分良性任务执行和隐蔽偏离是必要的。 我们提出 extbf{Agent-ToM},一种基于心智理论(ToM)推理的监控学习框架,用于自主智能体的安全分析。Agent-ToM通过推断信念、具有校准置信度的意图假设、预期行动以及与任务一致行为基线的偏离,执行结构化的全轨迹分析。在推理时,它采用 extit{推理-验证-细化}流程来构建和验证监控决策。在训练时,Agent-ToM将批评信号蒸馏到持久的 extit{语义护栏记忆}中,使得跨回合可重用的信念和意图条件约束成为可能。我们在对抗性智能体监控基准(SHADE-Arena和CUA-SHADE-Arena)上评估Agent-ToM。Agent-ToM实现了强精确率-召回率平衡,并使用连贯的双调用推理流程,优于包括集成方法在内的最先进监控基线。这些结果表明,在监控层学习,结合结构化的ToM推理和验证,为保护自主LLM智能体提供了有效且可部署的基础。

英文摘要

Monitoring autonomous large language model (LLM) agents for covert malicious behavior is challenging due to delayed, context-dependent, and long-horizon attack patterns. Agents may pursue hidden objectives while maintaining superficially benign behavior, making detection difficult even with full trajectory access. Prior monitoring approaches improve scaffolding or ensemble aggregation, but treat each trajectory independently and do not learn from prior monitoring experience. Moreover, standard reasoning methods explain observed behavior without explicitly reasoning about agent beliefs, intentions, and goal alignment required to distinguish benign task execution from covert deviation. We propose \textbf{Agent-ToM}, a learning-to-monitor framework grounded in Theory-of-Mind (ToM) reasoning for security analysis of autonomous agents. Agent-ToM performs structured full-trajectory analysis by inferring beliefs, intent hypotheses with calibrated confidence, expected actions, and deviations from task-consistent behavioral baselines. At inference time, it employs a \textit{Reason-Verify-Refine} pipeline to construct and validate monitoring decisions. At training time, Agent-ToM distills critique signals into a persistent \textit{semantic guardrail memory}, enabling reusable belief- and intent-conditioned constraints across episodes. We evaluate Agent-ToM on adversarial agent monitoring benchmarks (SHADE-Arena and CUA-SHADE-Arena). Agent-ToM achieves strong precision-recall balance and outperforms state-of-the-art monitoring baselines, including ensemble methods, while using a coherent two-call reasoning pipeline. These results demonstrate that learning at the monitoring layer, combined with structured ToM reasoning and verification, provides an effective and deployable foundation for securing autonomous LLM agents.

2605.24213 2026-05-26 cs.SE cs.AI cs.LG 版本更新

Towards Evaluation Engineering: An Empirical Study of ML Evaluation Harnesses in the Wild

迈向评估工程:机器学习评估工具在野外的实证研究

Zhimin Zhao, Zehao Wang, Abdul Ali Bangash, Bram Adams, Ahmed E. Hassan

发表机构 * Software Analysis and Intelligence Lab (SAIL), School of Computing, Queen's University(软件分析与智能实验室(SAIL),计算学院,女王大学) Concordia University(Concordia大学) Lahore University of Management Sciences (LUMS)(拉合尔管理科学大学(LUMS))

AI总结 通过对57个评估工具的实证研究,提出五阶段工具模型,并分类16560个问题,发现规范阶段问题最多(41.4%),主要根因是未实现功能(24.3%)、文档缺失(20.3%)和输入验证缺失(17.2%),为将评估工程作为独立软件工程关注点奠定实证基础。

详情
AI中文摘要

评估工具是编排模型评估的软件系统,管理模型调用、数据加载、指标计算和结果报告。尽管它们在机器学习基础设施中扮演关键角色,但其操作挑战和工程问题迄今受到的关注有限。我们对57个评估工具进行了实证研究,推导出一个五阶段工具模型,并根据工作流阶段和根本原因对16,560个问题进行了分类。大多数工具操作挑战集中在规范阶段(占问题的41.4%),在此阶段工具集成外部模型、数据集和评分评判者。操作挑战的三个最常见根本原因是未实现功能(24.3%)、文档缺失(20.3%)和输入验证缺失(17.2%),这些合计占分类问题的61.7%,涵盖现有功能的缺陷和阻碍预期工作流的能力缺口。根本原因也因工作流阶段而异:环境不兼容和外部依赖破坏占配置问题的36.2%,而算法错误(25.9%)和验证缺失(22.5%)主导评估问题。这些贡献共同为将评估工程视为一个独立的软件工程关注点建立了实证基础。

英文摘要

Evaluation harnesses are software systems that orchestrate model evaluation by managing model invocation, data loading, metric computation, and result reporting. Despite their critical role in machine learning infrastructure, their operational challenges and engineering concerns have received limited attention so far. We present an empirical study of 57 evaluation harnesses, deriving a five-stage harness model and classifying 16,560 issues by workflow stage and root cause. Most harness operational challenges concentrate in the Specification stage (41.4% of issues), where harnesses integrate external models, datasets, and scoring judges. The three most frequent root causes of operational challenges are unimplemented features (24.3%), documentation gaps (20.3%), and missing input validation (17.2%), which together account for 61.7% of classified issues, spanning both defects in existing functionality and capability gaps that block intended workflows. Root causes also vary by workflow stage: environment incompatibility and external dependency breakage account for 36.2% of provisioning issues, whereas algorithmic error (25.9%) and validation gap (22.5%) dominate assessment issues. Together, these contributions establish an empirical foundation for treating evaluation engineering as a distinct software engineering concern.

2605.24212 2026-05-26 stat.AP cs.AI cs.LG stat.ML 版本更新

Distributionally Robust Transfer Learning with Structurally Missing Covariates, with Application to Cross-National Cardiac Arrest Prediction

分布鲁棒迁移学习在结构缺失协变量中的应用:以跨国心脏骤停预测为例

Siqi Li, Chuan Hong, Ziye Tian, Benjamin Sieu-Hon Leong, Koshi Nakagawa, Hideharu Tanaka, Sang Do Shin, Khuong Quoc Dai, Do Ngoc Son, Marcus Eng Hock Ong, Nan Liu, Molei Liu

发表机构 * Centre for Biomedical Data Science, Duke-NUS Medical School, Singapore(生物医学数据科学中心,杜克-国家大学医学院,新加坡) Duke-NUS AI + Medical Sciences Initiative, Duke-NUS Medical School, Singapore(杜克-国家大学医学院AI+医学科学倡议,新加坡) Department of Biostatistics and Bioinformatics, Duke University, Durham, NC, USA(生物统计学与生物信息学系,杜克大学,北卡罗来纳州达勒姆,美国) Duke Clinical Research Institute, Durham, NC, USA(杜克临床研究学院,北卡罗来纳州达勒姆,美国) Emergency Medicine Department, National University Hospital, Singapore(急诊医学部,国立大学医院,新加坡) Department of Sport and Medical Science, Faculty of Physical Education, Kokushikan University, Tokyo, Japan(体育与医学科学系,体育学院,立命馆大学,东京,日本) Graduate School of Emergency Medical System, Kokushikan University, Tokyo, Japan(急救医疗系统研究生院,立命馆大学,东京,日本) Department of Emergency Medicine, Seoul National University College of Medicine, Seoul, Republic of Korea(急诊医学系,首尔国立大学医学院,首尔,韩国) Center for Emergency Medicine, Bach Mai Hospital, Hanoi, Vietnam(急救医学中心,巴赫梅医院,河内,越南) Center for Critical Care Medicine, Bach Mai Hospital, Hanoi, Vietnam(重症医学中心,巴赫梅医院,河内,越南) Health Services Research Centre, Singapore Health Services, Singapore(卫生服务研究中心,新加坡卫生服务,新加坡) Department of Emergency Medicine, Singapore General Hospital, Singapore(急诊医学部,新加坡中央医院,新加坡) Pre-hospital & Emergency Research Centre, Health Services Research and Population Health, Duke-NUS Medical School, Singapore(院前与急诊研究中心,卫生服务研究与人口健康,杜克-国家大学医学院,新加坡)

AI总结 提出DRUM框架,通过分布鲁棒优化和神经网络生成器处理目标域中结构缺失的协变量,实现无标签目标域的预测模型迁移,并在跨国心脏骤停预测中验证有效性。

详情
AI中文摘要

当关键训练协变量在部署时不可用且目标域中标记结果有限时,跨医疗系统部署临床预测模型常常失败。例如,院外心脏骤停(OHCA)的高性能模型依赖于高资源环境中常规收集的详细院前测量数据,但在许多国际登记处中不可用。现有方法要么丢弃缺失协变量,牺牲预测信息,要么依赖于关于其目标分布的可检验假设。我们提出了DRUM(具有结构缺失协变量的分布鲁棒无监督迁移学习),这是一个将预测模型迁移到某些协变量结构缺失且结果标签不可用的目标群体的框架。DRUM将协变量划分为共享组件($X$,在所有环境中观察到)和缺失组件($A$,仅在源域中观察到)。DRUM不进行缺失协变量插补,而是使用神经网络生成器优化未知目标分布$A \mid X$上的最坏情况预测性能,并通过鲁棒性参数控制与源条件允许的偏差。我们进一步开发了一种偏差校正程序,以减少对干扰估计误差的敏感性。模拟显示,在分布偏移下,平均和最坏情况预测误差均有显著改善。应用于跨国OHCA预测,将模型从美国登记处迁移到多个未记录院前变量的亚洲登记处,DRUM在各个站点产生了更校准的预测和改进的临床分类性能。

英文摘要

Deploying clinical prediction models across healthcare systems often fails when key training covariates are unavailable at deployment and labeled outcomes are limited in the target domain. For example, high-performing models for out-of-hospital cardiac arrest (OHCA) rely on detailed prehospital measurements routinely collected in high-resource settings but unavailable in many international registries. Existing methods either discard missing covariates, sacrificing predictive information, or rely on untestable assumptions about their target distribution. We propose DRUM (\underline{D}istributionally \underline{R}obust \underline{U}nsupervised transfer learning with structurally \underline{M}issing covariates), a framework that transfers prediction models to target populations where certain covariates are structurally absent and outcome labels are unavailable. DRUM partitions covariates into shared components ($X$), observed across all settings, and missing components ($A$), observed only in the source. Rather than imputing missing covariates, DRUM optimizes worst-case predictive performance over the unknown target distribution of $A \mid X$ using a neural network generator, with a robustness parameter controlling allowable deviation from the source conditional. We further develop a bias correction procedure that reduces sensitivity to nuisance estimation error. Simulations show substantial improvements in both mean and worst-case prediction error under distribution shift. Applied to cross-national OHCA prediction, transferring models from a US registry to multiple Asian registries where prehospital variables are unrecorded, DRUM yields better-calibrated predictions and improved clinical classification performance across sites.

2605.24211 2026-05-26 cs.CL cs.AI 版本更新

Teaching Through Analogies: A Modular Pipeline for Educational Analogy Generation

通过类比教学:教育类比生成的模块化流水线

Mariam Barakat, Ekaterina Kochmar

发表机构 * Mohamed bin Zayed University of Artificial Intelligence(穆罕默德·本·扎耶德人工智能大学)

AI总结 提出一个模块化流水线,将教育类比生成分解为四个阶段,基于结构映射理论,评估12个LLM在两个数据集上的表现,发现子概念显著提升解释质量和封闭检索精度,并引入LLM作为评判的评估方法。

Comments 36 pages, 25 figures. To appear in Proceedings of the 21st Workshop on Innovative Use of NLP for Building Educational Applications (BEA 2026)

详情
AI中文摘要

类比通过将不熟悉的概念与已知概念联系起来,帮助学习者理解。尽管最近取得了进展,大型语言模型(LLM)在生成与人类质量相当的类比方面仍然困难。我们提出了一个用于教育类比生成的模块化流水线,将任务分解为四个阶段:源发现、子概念生成、解释生成和评估。基于结构映射理论,该流水线能够系统地、逐阶段地分析模型选择和输入配置如何影响类比质量。我们在两个具有结构化子概念注释的数据集(SCAR和ParallelPARC)上评估了来自六个模型家族的12个最先进的LLM,以及用于封闭设置检索的七个嵌入模型。我们的结果表明,子概念显著提高了解释质量和封闭设置检索精度,但在开放式源生成中提供的益处有限。我们进一步引入了一种LLM作为评判的评估方法,并针对七名注释者的人类注释验证了其评分,发现Claude Sonnet 4.6在人类排名上的对齐比细粒度绝对分数更可靠。综合来看,我们的发现揭示了孤立研究无法捕捉的跨阶段交互,并强调了子概念基础作为类比质量生成的关键驱动因素。

英文摘要

Analogies help learners understand unfamiliar concepts by relating them to known concepts. Despite recent advances, large language models (LLMs) continue to struggle to generate analogies of comparable quality to those produced by humans. We present a modular pipeline for educational analogy generation, decomposing the task into four stages: source finding, sub-concept generation, explanation generation, and evaluation. Grounded in Structure Mapping Theory, the pipeline enables systematic, stage-by-stage analysis of how model choice and input configuration affect analogy quality. We evaluate 12 state-of-the-art LLMs across six model families on two datasets with structured sub-concept annotations (SCAR and ParallelPARC), alongside seven embedding models for closed-setting retrieval. Our results show that sub-concepts substantially improve explanation quality and closed setting retrieval precision but provide limited benefit in open-ended source generation. We further introduce an LLM-as-a-judge evaluation methodology and validate its scoring against human annotations from seven annotators, finding that Claude Sonnet 4.6 aligns more reliably with human rankings than with fine-grained absolute scores. Taken together, our findings reveal cross-stage interactions that isolated studies cannot capture, and highlight sub-concept grounding as a key driver of analogy quality generation.

2605.24192 2026-05-26 cs.LG cs.AI cs.CV 版本更新

Filtered Posterior Mean Collections: A Unified Framework for Analytical Models of Diffusion Generalization

滤波后验均值集合:扩散泛化分析模型的统一框架

Matthew Niedoba, Berend Zwartsenberg, Frank Wood

发表机构 * University of British Columbia(不列颠哥伦比亚大学) Inverted AI Alberta Machine Intelligence Institute(阿尔伯塔机器智能研究所)

AI总结 本文提出滤波后验均值集合(FPMC)统一框架,通过查询精度向量、响应权重和源分布建模扩散模型去噪函数的泛化行为,并通过软松弛和源分布增强提升现有方法性能。

Comments 27 Pages, 7 figures

详情
AI中文摘要

作为图像扩散模型骨干的神经网络去噪函数,在多种网络架构和训练超参数下展现出显著一致的泛化行为。最近一系列研究试图通过聚合训练数据集补丁的后验加权平均值来建模这些网络的输出。在本工作中,我们将这些方法整合为一个统一的模型类,称为滤波后验均值集合(FPMC)。我们使用查询精度向量、响应权重和源分布定义该模型类,并说明现有方法可通过这些设计轴的具体选择恢复。依次研究每个轴,我们发现FPMC性能可以通过对先前基于补丁的方法进行软松弛以及通过源分布的增强来改进。将这些发现应用于现有的FPMC,我们在三个自然图像数据集上展示了样本的一致改进。

英文摘要

The neural-network denoising functions which form the backbone of image diffusion models are remarkably consistent in their generalization behaviour across a wide variety of network architectures and training procedure hyperparameters. A recent line of research has sought to model the outputs of these networks by aggregating posterior weighted averages of training dataset patches. In this work, we consolidate these approaches into a unified model class which we call Filtered Posterior Mean Collections (FPMCs). We define this model class using query precision vectors, response weights, and source distributions, and illustrate that existing methods are recoverable with specific choices of these design axes. Investigating each axis in turn, we find that FPMC performance can be improved with soft relaxations of prior patch-based methods, and through augmentations of source distributions. Applying these findings to an existing FPMC, we demonstrate consistent sample improvement across three natural image datasets.

2605.24183 2026-05-26 cs.DB cs.AI cs.LG 版本更新

AvalancheBench: Evaluating Enterprise Data Agents Through Latent World Recovery

AvalancheBench: 通过潜在世界恢复评估企业数据智能体

Darek Kleczek, Fuheng Zhao, Alexander W. Lee, Julien Tissier, Pawel Liskowski, Ugur Cetintemel, Anupam Datta

发表机构 * Brown University and Snowflake(布朗大学和Snowflake)

AI总结 提出AvalancheBench基准,通过潜在世界恢复评估企业数据智能体的分析理解能力,揭示早期错误如何传播并导致系统性错误推荐。

详情
AI中文摘要

我们介绍了AvalancheBench,一个通过潜在世界恢复评估企业数据智能体的基准。AvalancheBench在三个方面改进了现有基准。首先,它评估分析理解而非流程完成:系统根据是否恢复了解释数据的片段、驱动因素、时间事件和关系来评分,而不仅仅是执行工作流或生成看似合理的报告。其次,它通过从已知潜在世界生成观测数据,为目标驱动分析提供真实基准,从而允许对不完整但有效的恢复给予部分分数。第三,它揭示了早期分析错误如何传播到后续结论:遗漏的片段、合并的事件或错误的归因可能导致系统性错误推荐。在这个意义上,AvalancheBench通过提供一个受控环境来诊断智能体是否恢复了企业数据背后的分析结构,从而补充了真实数据基准。在第一个电子商务用例中,领先编码智能体的最强配置仅恢复了26%的评分标准,失败集中在通用客户细分和合并的时间事件上。

英文摘要

We introduce AvalancheBench, a benchmark for evaluating enterprise data agents through \emph{latent world recovery}. AvalancheBench improves on existing benchmarks in three ways. First, it evaluates analytical understanding rather than pipeline completion: systems are scored on whether they recover the segments, drivers, temporal events, and relationships that explain the data, not merely on whether they execute a workflow or produce a plausible report. Second, it provides ground truth for goal-driven analytics by generating observations from a known latent world, enabling partial credit for incomplete but valid recoveries. Third, it exposes how early analytical mistakes propagate into later conclusions: missed segments, merged events, or wrong attributions can lead to systematically wrong recommendations. In this sense, AvalancheBench complements real-data benchmarks by providing a controlled setting for diagnosing whether agents recover the analytical structure behind enterprise data. On a first e-commerce use case, the strongest configuration of a leading coding agent recovers only 26\% of the rubric, with failures concentrated in generic customer segmentations and merged temporal events.

2605.24180 2026-05-26 physics.soc-ph cs.AI cs.DL cs.HC 版本更新

Human-AI Collaboration in Science at Scale: A Global Large-scale Randomized Field Experiment

大规模人机协作科学:一项全球大规模随机现场实验

Binglu Wang, Weixin Liang, Jiahui Xue, Yuhui Zhang, Hancheng Cao, Dashun Wang, Yian Yin

发表机构 * Kellogg School of Management, Northwestern University(西北大学凯洛格管理学院) Center for Science of Science and Innovation, Northwestern University(西北大学科学与创新中心) Northwestern Institute on Complex Systems, Northwestern University(西北大学复杂系统研究所) Department of Computer Science, Stanford University(斯坦福大学计算机科学系) Goizueta Business School, Emory University(埃默里大学戈伊兹特亚商学院) Department of Information Science, Cornell University(康奈尔大学信息科学系)

AI总结 通过全球大规模随机现场实验,研究大型语言模型(LLMs)生成的定制化反馈能否提升科研人员的修订率并促进AI工具使用,尤其惠及资源受限的研究者。

详情
AI中文摘要

合作是现代科学的定义模式,但其核心机制——反馈——仍然难以观察、难以扩展且分布不均。在此,我们测试大型语言模型(LLMs)是否能够贡献于这一隐蔽但至关重要的实践,并重新分配科学反馈——这一知识生产中必不可少但稀缺的资源。在一项全球大规模随机现场实验中,我们为来自133个地理区域的超过45,000名研究人员的150个领域的31,000多篇arXiv预印本提供了定制的LLM生成反馈。与对照组相比,收到反馈的作者修改其手稿的可能性显著更高,相对于基线修订率提高了12.55%。接触AI反馈还增加了作者在未来论文中使用LLM工具的频率,表明科学实践发生了长期转变。这些效应在非英语主导研究区域的作者、与学术文献联系较少的手稿以及h指数较低和职业早期阶段的团队中最为显著,这与AI反馈可能在获取及时批评受限的地方提供最大益处的观点一致。总之,这些发现提供了因果证据,表明基于AI的结构化干预可以将科学反馈的获取从一种主要是私人优势转变为更广泛分布的资源,对全球研究体系的生产力、公平性和能力产生更广泛的影响。

英文摘要

Collaboration is the defining mode of modern science, yet its core mechanism -- feedback -- remains hard to observe, difficult to scale, and unequally distributed. Here we test whether large language models (LLMs) can contribute to this hidden but vital practice and reallocate scientific feedback, an essential yet scarce resource for knowledge production. In a global large-scale randomized field experiment, we delivered customized LLM-generated feedback for over 31,000 arXiv preprints across 150 fields and more than 45,000 researchers from 133 geographic regions. Relative to controls, authors who received feedback had a significantly higher likelihood of revising their manuscripts, corresponding to a 12.55% relative increase over the baseline revision rate. Exposure to AI feedback also increased authors' subsequent use of LLM tools in their future papers, suggesting longer-run shifts in scientific practice. These effects were strongest among authors from non-English-dominant research regions, manuscripts less embedded in the scholarly literature, and teams with lower h-indexes and earlier career stages, consistent with the idea that AI feedback may provide the greatest benefit where access to timely critique is otherwise limited. Together, these findings provide causal evidence that structured AI-based interventions can transform access to scientific feedback from a largely private advantage into a more widely distributed resource, with broader implications for productivity, equity, and capacity across the global research system.

2605.24173 2026-05-26 cs.CL cs.AI cs.CR cs.LG 版本更新

Extracting Training Data from Diffusion Language Models via Infilling

通过填充从扩散语言模型中提取训练数据

Yihan Wang, N. Asokan

发表机构 * University of Waterloo(滑铁卢大学) KTH Royal Institute of Technology(皇家理工学院)

AI总结 提出填充提取协议,利用扩散语言模型的双向去噪能力,通过任意二进制掩码参数化,揭示掩码几何形状控制提取能力,边缘条件掩码比前缀条件掩码多提取三倍逐字序列,且双向访问打开了自回归模型无法利用的通道。

详情
AI中文摘要

大型语言模型中的记忆化几乎完全通过前缀条件提取进行研究,这是自回归模型的自然选择。然而,扩散语言模型(DLM)可以在任意位置去噪掩码标记。因此,仅前缀探测揭示了DLM中记忆化的一个方面,并显著低估了训练数据提取的风险。为了真实地建模DLM中训练数据的可提取性,我们引入了\emph{填充提取},这是一种由任意二进制掩码参数化的数据提取协议,它包含了前缀仅探测并考虑了DLM的双向归纳偏差。在LLaDA-8B和Dream-7B上,跨五种提取模式、三种训练流水线和三个涵盖逐字和部分泄漏的语料库进行实例化,我们发现掩码几何形状控制着可提取性:边缘条件掩码比前缀条件掩码\emph{多提取三倍}的逐字序列,并且双向访问打开了自回归模型中无法利用的通道。特别是,我们表明,一个能够访问已删除个人身份信息的训练数据的现实对手,甚至可以从DLM中提取被删除的电子邮件地址,其召回率高于规模匹配的自回归模型。解码的可调参数可测量地影响提取性能,而后续的监督微调阶段并未消除先前的记忆化。

英文摘要

Memorization in large language models has been studied almost exclusively through prefix-conditioned extraction, a natural choice for autoregressive models. However, diffusion language models (DLMs) can denoise masked tokens at arbitrary positions. Thus, prefix-only probing reveals only one facet of memorization in DLMs and significantly underestimates the risk of training-data extraction. In order to realistically model extractability of training data in DLMs, we introduce \emph{infilling extraction}, a data-extraction protocol parameterized by an arbitrary binary mask that subsumes prefix-only probing and accounts for the bidirectional inductive bias of DLMs. Instantiating it on LLaDA-8B and Dream-7B across five extraction modes, three training pipelines, and three corpora covering verbatim and partial leakage, we find that mask geometry governs extractability: edge-conditioned masks \emph{extract up to three times more} verbatim sequences than prefix-conditioned ones, and bidirectional access opens channels inaccessible in autoregressive models. In particular, we show that a realistic adversary with access to training data where personally identifiable information has been redacted, can even achieve higher recall on extracting redacted email addresses from DLMs than from scale-matched autoregressive models. Tunable parameters for decoding measurably affect extraction performance, while a follow-up supervised finetuning stage does not eliminate the prior memorization.

2605.24172 2026-05-26 cs.AI 版本更新

EPPC-OASIS: Ontology-Aware Adaptation and Structured Inference Refinement for Electronic Patient-Provider Communication Mining in Secure Messages

EPPC-OASIS:面向安全消息中电子患者-提供者通信挖掘的本体感知适应与结构化推理精炼

Samah Fodeh, Sreeraj Ramachandran, Elyas Irankhah, Muhammad Arif, Afshan Khan, Ganesh Puthiaraju, Linhai Ma, Srivani Talakokkul, Jordan Alpert, Sarah Schellhorn

发表机构 * Yale University(耶鲁大学) Cleveland Clinic Lerner College of Medicine of Case Western Reserve University, Cleveland Clinic(克利夫兰医学中心勒纳医学院,克利夫兰医学中心) Medical Oncology, Yale School of Medicine(耶鲁医学院医学肿瘤学)

AI总结 提出EPPC-OASIS框架,通过本体感知的Wasserstein对齐目标增强微调,并结合推理精炼步骤,从安全消息中自动提取结构化EPPC编码,在多个语言模型上取得一致改进。

详情
AI中文摘要

安全的患者-提供者消息包含临床上重要的通信行为,这些行为难以大规模手动表征。电子患者-提供者通信(EPPC)框架为编码这些行为提供了本体,但自动提取仍然具有挑战性,因为预测必须保留细粒度的代码/子代码结构,同时将注释锚定在消息文本中。我们开发了EPPC-OASIS,一种用于结构化EPPC提取的本体感知适应方法,并将其与可部署的推理精炼程序相结合,旨在提高最终注释的一致性。EPPC-OASIS通过Wasserstein对齐目标增强监督微调,该目标鼓励模型表示邻域与EPPC本体派生邻域之间的对齐,而推理精炼则使用验证、自一致性、混合校正以及选择或集成来解决残差预测错误。我们在一个去标识化的安全患者-提供者消息语料库上,针对多个开放权重语言模型,将框架与提示、监督微调、基于偏好和鲁棒性导向的基线进行了比较。跨模型家族,最佳可部署流水线实现了77.13%的代码+子代码F1和63.83%的三元组F1,相比最强的监督微调基线,分别获得了+1.39和+2.12 F1点的适度但一致的绝对提升。这些结果表明,结合结构化推理精炼的本体感知适应可以支持可扩展的回顾性EPPC挖掘,但在操作使用前需要外部验证。

英文摘要

Secure patient-provider messages contain clinically important communication behaviors that are difficult to characterize manually at scale. The Electronic Patient-Provider Communication (EPPC) framework provides an ontology for coding these behaviors, but automated extraction remains challenging because predictions must preserve fine-grained code/sub-code structure while grounding annotations in message text. We developed EPPC-OASIS, an ontology-aware adaptation approach for structured EPPC extraction, and combined it with deployable inference-refinement procedures designed to improve the coherence of final annotations. EPPC-OASIS augments supervised fine-tuning with a Wasserstein alignment objective that encourages alignment between model representation neighborhoods and EPPC ontology-derived neighborhoods, while inference refinement uses verification, self-consistency, hybrid correction, and selection or ensembling to address residual prediction errors. We evaluated the framework on a de-identified corpus of secure patient-provider messages against prompting, supervised fine-tuning, preference-based, and robustness-oriented baselines across multiple open-weight language models. Across model families, the best deployable pipeline achieved 77.13% Code+Sub-code F1 and 63.83% Triplet F1, corresponding to modest but consistent absolute gains of +1.39 and +2.12 F1 points over the strongest supervised fine-tuning baseline. These results suggest that ontology-aware adaptation with structured inference refinement can support scalable retrospective EPPC mining, although external validation is needed before operational use.

2605.24171 2026-05-26 cs.LG cs.AI 版本更新

PromptAudit: Auditing Prompt Sensitivity in LLM-Based Vulnerability Detection

PromptAudit: 审计基于LLM的漏洞检测中的提示敏感性

Steffen J. Camarato, Yahya Hmaiti, Mandana Ghadamian, David Mohaisen

发表机构 * University of Central Florida(佛罗里达大学)

AI总结 提出PromptAudit框架,通过固定数据集、解码和解析仅变化提示策略,评估五种提示策略在五个开源模型上对1000个CVE(6074个代码样本,16种编程语言)的漏洞检测性能,发现标准思维链提示整体性能最佳,而提示敏感性是系统的一级属性。

详情
AI中文摘要

大型语言模型越来越多地用于漏洞检测,但它们在各种提示表述下的可靠性仍未得到表征。我们提出了PromptAudit,一个受控评估框架,通过固定数据集、解码和解析,仅变化提示策略来隔离提示效应。我们在1000个CVE(涵盖16种编程语言的6074个代码样本)上,使用五种提示策略对五个开源模型进行评估,计算准确率、召回率、弃权率、覆盖率和有效F1。我们发现,标准思维链提示实现了最强的整体操作性能,而少样本提示提供了模型相关的益处,对于提示敏感的模型最为显著。相比之下,自适应思维链经常抑制召回率,而自一致性导致过度弃权,急剧降低了有效性能。这些结果表明,漏洞检测行为由模型和提示共同决定,并且提示敏感性是一个一级系统属性,必须在评估和部署中明确表征。

英文摘要

Large language models are increasingly used for vulnerability detection, yet their reliability under different prompt formulations remains uncharacterized. We present PromptAudit, a controlled evaluation framework that isolates prompt effects by fixing the dataset, decoding, and parsing while varying only the prompting strategy. Using five prompting strategies across five open-weight models on 1,000 CVEs (6,074 code samples spanning 16 programming languages), we evaluate accuracy, recall, abstention, coverage, and effective F1. We find that standard chain-of-thought prompting achieves the strongest overall operational performance, while few-shot prompting provides model-dependent benefits that are most pronounced for prompt-sensitive models. In contrast, adaptive chain-of-thought frequently suppresses recall and self-consistency induces excessive abstention, sharply reducing effective performance. These results show that vulnerability detection behavior is jointly determined by the model and the prompt, and that prompt sensitivity is a first-class system property that must be explicitly characterized in evaluation and deployment.

2605.24168 2026-05-26 cs.AI cs.LG 版本更新

Inference Time Context Sparsity: Illusion or Opportunity?

推理时上下文稀疏性:幻觉还是机遇?

Sahil Joshi, Prithvi Dixit, Agniva Chowdhury, Anshumali Shrivastava, Joseph E. Gonzalez, Ion Stoica, Kumar Krishna Agrawal, Aditya Desai

发表机构 * Berkeley(伯克利)

AI总结 本文通过实证和理论证据论证,在长上下文LLM推理中采用极端但原则性的上下文维度稀疏性不仅是可行的,而且能显著加速处理(如H100上实现10倍加速),从而挑战了密集注意力机制的必要性。

Comments 19 pages, 8 figures

详情
AI中文摘要

稀疏性长期以来一直是LLM效率的核心主题,但其在上下文处理中的作用仍未解决。随着LLM工作负载转向更长的上下文和智能体交互,注意力的计算和内存瓶颈变得日益关键,这引发了这些约束是否根本性的问题。我们的立场是,这些约束是人为且不必要的,LLM推理的未来在于沿上下文维度的极端但原则性的稀疏性。这一立场得到了多方面的经验和理论证据支持。首先,我们发现坚持密集注意力是不合理的,因为在长上下文中,查询实际上将O(N)个注意力信息投影到维度d << N的隐藏空间中,使得该过程固有地有损。其次,我们对跨越五个模型家族的20个LLM进行了广泛的稀疏性研究,变化上下文长度和不同稀疏水平。我们经验性地展示了一个强烈趋势:当前的LLM,尽管未针对上下文稀疏性进行训练,但在不同复杂度的任务(包括检索、多跳QA、数学推理和智能体编码)中对推理时解码稀疏性表现出显著的鲁棒性。重要的是,我们还表明当前的硬件已经足以从这种稀疏性中实现实质性收益。例如,我们的稀疏解码内核在H100等硬件上以50倍稀疏性水平将大上下文处理加速高达10倍,相比FlashInfer。总体而言,这些结果将极端上下文稀疏性定位为不仅是启发式的,而是LLM推理、训练和架构设计的原则性基础:既可行又有益,是未来系统的一个有吸引力的方向。

英文摘要

Sparsity has long been a central theme in LLM efficiency, but its role in context processing remains unresolved. As LLM workloads shift toward longer contexts and agentic interactions, the compute and memory bottlenecks of attention become increasingly critical, raising the question of whether these constraints are fundamental. Our position is that these constraints are artificial and unnecessary, and that the future of LLM inference lies in extreme but principled sparsity along the context dimension. This position is supported by several strands of empirical and theoretical evidence. First, we find the insistence on dense attention unreasonable, since in a long context a query effectively projects O(N) attention information into a hidden space of dimension d << N, making the process inherently lossy. Second, we perform an extensive study of sparsity in LLMs spanning 20 models across five model families, varying context lengths, and different sparsity levels. We empirically demonstrate a strong trend: current LLMs, despite not being trained for context sparsity, are remarkably robust to inference-time decode sparsity across tasks of varying complexity, including retrieval, multi-hop QA, mathematical reasoning, and agentic coding. Importantly, we also show that current hardware is already sufficient to realize substantial gains from this sparsity. For example, our sparse decode kernels accelerate large-context processing by up to 10x over FlashInfer at 50x sparsity levels on hardware such as the H100. Overall, these results position extreme context sparsity not as a heuristic, but as a principled foundation for LLM inference, training, and architecture design: one that is both feasible and beneficial, and a compelling direction for future systems.

2605.24162 2026-05-26 cs.LG cs.AI 版本更新

Knowledge Graph Modulated Deep Learning for Limited-Sample Clinical Data Analysis

知识图谱调制的深度学习用于有限样本临床数据分析

Yuwei Xue, Sakib Mostafa, James Zou, Joseph Liao, Maximilian Diehn, Ash A. Alizadeh, Lei Xing, Md. Tauhidul Islam

发表机构 * Department of Radiation Oncology, Stanford University(放射肿瘤科,斯坦福大学) Stanford University(斯坦福大学) Stanford University School of Medicine(斯坦福大学医学院) Department of Computer Science, Stanford University(计算机科学系,斯坦福大学) Department of Electrical Engineering, Stanford University(电气工程系,斯坦福大学) Department of Biomedical Data Science, Stanford University School of Medicine(生物医学数据科学系,斯坦福大学医学院) Stanford Cancer Institute, Stanford University(斯坦福癌症研究所,斯坦福大学) Institute for Stem Cell Biology and Regenerative Medicine, Stanford University(干细胞生物学与再生医学研究所,斯坦福大学) Department of Medicine, Division of Oncology, Stanford University(医学系,肿瘤学分会,斯坦福大学) Institute of Computational and Mathematical Engineering, Stanford University(计算与数学工程研究所,斯坦福大学)

AI总结 提出Graph-in-Graph (GiG)框架,通过将患者表示为模块化图并整合生物知识图谱,在有限样本临床任务中显著提升预测性能。

Comments 17 pages, 4 figures, 12 supplementary figures

详情
AI中文摘要

生物系统受结构化分子相互作用支配,其中通路、调控回路和功能基因关系塑造细胞行为和疾病进展。这些知识大多自然表示为图。然而,大多数生物医学AI模型无法直接使用图编码的生物知识,而是需要压缩的低维表示,这可能会丢失重要结构并降低性能,尤其是在有限样本的临床研究中。这里,我们引入Graph-in-Graph (GiG),一个知识图谱调制的深度学习框架,用于数据高效的临床预测。GiG将每个患者表示为一个独立的模块化图,其中精选的生物知识图谱定义边,患者特定的测量(如基因表达)定义节点特征。这种设计允许整合多个生物知识图谱,同时在患者级表示学习中保留基因-基因相互作用和通路拓扑。在涵盖近9700名患者和五个临床任务(包括液体活检癌症检测、前列腺癌诊断和32类泛癌分类)的队列中,GiG持续优于传统和最先进的方法,在有限样本设置中增益最大。在具有挑战性的前列腺癌诊断任务中,GiG相对于竞争方法将macro-F1提高了最多49个百分点。用随机拓扑替换真实通路图的对照实验证实,这些增益源于生物学基础的知识图谱结构,而非仅图建模。这些发现表明,知识图谱调制的深度学习可以提高临床数据分析的鲁棒性、可解释性和样本效率,并为将生物知识图谱整合到预测建模中提供了一个原则性框架。

英文摘要

Biological systems are governed by structured molecular interactions, where pathways, regulatory circuits, and functional gene relationships shape cellular behavior and disease progression. Much of this knowledge is naturally represented as graphs. However, most biomedical AI models cannot directly use graph-encoded biological knowledge and instead require compressed low-dimensional representations, which can lose important structure and reduce performance, especially in limited-sample clinical studies. Here, we introduce Graph-in-Graph (GiG), a knowledge graph-modulated deep learning framework for data-efficient clinical prediction. GiG represents each patient as a standalone modular graph, in which curated biological knowledge graphs define edges and patient-specific measurements, such as gene expression, define node features. This design allows multiple biological knowledge graphs to be integrated while preserving gene-gene interactions and pathway topology during patient-level representation learning. Across cohorts comprising nearly 9,700 patients and five clinical tasks, including liquid biopsy cancer detection, prostate cancer diagnosis, and 32-class pan-cancer classification, GiG consistently outperforms traditional and state-of-the-art methods, with the largest gains in limited-sample settings. On the challenging prostate cancer diagnosis task, GiG improves macro-F1 by up to 49 percentage points relative to competing methods. Control experiments replacing real pathway graphs with random topologies confirm that these gains arise from biologically grounded knowledge graph structure rather than graph modeling alone. These findings show that knowledge graph-modulated deep learning can improve robustness, interpretability, and sample efficiency in clinical data analysis, and provide a principled framework for integrating biological knowledge graphs into predictive modeling.

2605.24155 2026-05-26 cs.IR cs.AI cs.LG 版本更新

An Interpretable CF-RL-TOPSIS Fusion Model for Skills-Aware Talent Recommendation

一种可解释的CF-RL-TOPSIS融合模型用于技能感知的人才推荐

Özkan Canay

发表机构 * Sakarya University(萨克萨大学)

AI总结 提出CF-RL-TOPSIS可解释融合模型,结合协同过滤、强化学习臂和熵权TOPSIS,在ICT人才历史基准上验证其在不同数据模式下的有效性。

Comments Preprint submitted to Knowledge-Based Systems; 4 figures and 8 tables

详情
AI中文摘要

有效的技能感知人才推荐必须平衡行为转换模式、轨迹敏感适应性和可检查的职业层面标准。然而,关于这些信号如何相互作用的公共基准证据仍然有限。本研究提出CF-RL-TOPSIS,一种可解释的后期融合模型,它集成了转换感知协同分支、紧凑的强化风格职业族臂和由六个语义代理构建的熵权TOPSIS分支;验证选择的融合系数保持可审计。该模型在两个冻结的公共ICT人才历史基准JobHop和Karrierewege上进行了评估,使用重复的时间顺序前5排名和配对Wilcoxon检验。在JobHop上,完整混合模型达到NDCG@5 = 0.3040 +/- 0.0073,并显著优于repeat-last、item Markov、转换感知协同过滤、CF+TOPSIS混合、GRU4Rec和SASRec(计划比较中p <= 0.0039)。在Karrierewege上,混合模型保持竞争力,但未显著超过最强的Markov基线,揭示了一个持久性主导的环境,其中臂分支适当缩小到接近零权重。代理敏感性、家庭级深度Q网络和运行时检查支持这一解释,一个详细的用户级案例展示了如何检查单个推荐的各分支分数、标准权重和排名变化。贡献不是基准无关的优越性声明,而是对透明后期融合在简单延续启发式之外增加价值的条件的可重复说明。在语义丰富、非饱和的人才历史机制中,三个分支相互增强;在持久性主导机制中,相同的架构通过其协同骨干保持竞争力,而自适应分支正确处于非活跃状态。

英文摘要

Effective skills-aware talent recommendation must balance behavioral transition patterns, trajectory-sensitive adaptation, and inspectable occupation-level criteria. Evidence from public benchmarks on how these signals interact, however, remains limited. This study proposes CF-RL-TOPSIS, an interpretable late-fusion model that integrates a transition-aware collaborative branch, a compact reinforcement-style occupation-family bandit, and an entropy-weighted TOPSIS branch constructed from six semantic proxies; the validation-selected fusion coefficients remain auditable. The model is evaluated on two frozen public ICT talent-history benchmarks, JobHop and Karrierewege, using repeated chronological top-5 ranking and paired Wilcoxon tests. On JobHop the full hybrid attains NDCG@5 = 0.3040 +/- 0.0073 and significantly surpasses repeat-last, item Markov, transition-aware collaborative filtering, the CF+TOPSIS hybrid, GRU4Rec, and SASRec (p <= 0.0039 across planned comparisons). On Karrierewege the hybrid remains competitive but does not significantly exceed the strongest Markov baseline, revealing a persistence-dominated setting in which the bandit branch appropriately shrinks to near-zero weight. Proxy-sensitivity, family-level deep Q-network, and runtime checks support this interpretation, and a worked user-level case shows how branch scores, criterion weights, and rank shifts can be inspected for an individual recommendation. The contribution is not a benchmark-agnostic superiority claim, but a reproducible account of the conditions under which transparent late fusion adds value beyond simple continuation heuristics. In semantically rich, non-saturating talent-history regimes the three branches reinforce one another; in persistence-dominated regimes the same architecture remains competitive through its collaborative backbone, with the adaptive branch correctly inactive.

2605.24154 2026-05-26 cs.AI cs.SE 版本更新

Palette: A Modular, Controllable, and Efficient Framework for On-demand Authorized Safety Alignment Relaxation in LLMs

Palette: 一种模块化、可控且高效的框架,用于按需授权安全对齐放松的LLMs

Qitao Tan, Xiaoying Song, Arman Akbari, Arash Akbari, Yanzhi Wang, Xiaoming Zhai, Lingzi Hong, Zhen Xiang, Jin Lu, Geng Yuan

发表机构 * University of Georgia(佐治亚大学) University of North Texas(北卡罗来纳州立大学) Northeastern University(东北大学)

AI总结 提出Palette框架,通过多目标搜索识别拒绝方向并轻量级适配模型,实现按需放松授权领域的安全拒绝行为,同时保持其他区域的标准安全,支持模块化组合多领域授权。

详情
AI中文摘要

当前基础模型的安全对齐大多遵循“一刀切”范式,跨用户和上下文应用相同的拒绝策略。因此,模型可能拒绝对于一般用户不安全但授权专业人员合法的请求,限制了专业环境中的有用性。现有方法要么需要昂贵的重新对齐,要么依赖推理时控制,但存在控制不精确和延迟增加的问题。为此,我们提出Palette,一个模块化、可控且高效的框架,选择性地放松授权目标领域的拒绝行为,同时在其他地方保持标准安全。我们的方法通过多目标搜索识别拒绝方向,并通过轻量级适配将其内化到模型中。Palette进一步支持模块化组合:它独立学习领域特定的安全控制,并通过参数合并进行组合,无需重新训练即可实现按需多领域授权。在四个安全基准、多个模型变体以及LLMs和VLMs上的实验表明,Palette在不牺牲通用实用性的情况下提供精确的安全控制,为基础模型适应多样化专业需求提供了一条实用路径。

英文摘要

Current safety alignment of foundation models largely follows a \emph{one-size-fits-all} paradigm, applying the same refusal policy across users and contexts. As a result, models may refuse requests that are unsafe for general users but legitimate for authorized professionals, limiting helpfulness in specialized professional settings. Existing approaches either require costly realignment or rely on inference-time steering that suffers from imprecise control and added latency. To this end, we propose \textsc{Palette}, a modular, controllable, and efficient framework that selectively relaxes refusal behavior on authorized target domains while preserving standard safety elsewhere. Our method identifies a refusal direction via multi-objective search and internalizes it into the model through lightweight adaptation. \textsc{Palette} further supports modular composition: it learns domain-specific safety controls independently and composes them through parameter merging, enabling on-demand multi-domain authorization without retraining. Experiments across four safety benchmarks, multiple model variants, and both LLMs and VLMs show that \textsc{Palette} delivers precise safety control without sacrificing general utility, offering a practical path toward foundation models that adapt to diverse professional needs.

2605.24139 2026-05-26 cs.AI cs.LG 版本更新

MAPLE: Multi-State Aggregated Policy Evaluation for AlphaZero in Imperfect-Information Games

MAPLE:不完全信息游戏中AlphaZero的多状态聚合策略评估

Qian-Rong Li, Hung Guei, I-Chen Wu, Ti-Rong Wu

发表机构 * Department of Computer Science, National Yang Ming Chiao Tung University(国立阳明交通大学计算机科学系) Institute of Information Science, Academia Sinica(中科院信息所)

AI总结 提出MAPLE方法,通过单搜索树聚合多个采样世界状态的策略和价值评估,结合PIMC和IS-MCTS优势,在Phantom Go和Dark Hex上分别提升Elo 291和136。

Comments Accepted by the IEEE Conference on Games (IEEE CoG 2026)

详情
AI中文摘要

不完全信息游戏(IIGs)具有挑战性,因为玩家必须在未完全观察真实游戏状态的情况下做出决策。虽然AlphaZero在完美信息游戏中取得了显著成功,但将其扩展到IIGs仍然困难。现有的基于搜索的方法,如完美信息蒙特卡洛(PIMC),存在策略融合问题,而信息集蒙特卡洛树搜索(IS-MCTS)在与神经网络结合时计算成本高昂。在本文中,我们提出了多状态聚合策略评估(MAPLE),一种树搜索方法,它在单个搜索树内聚合来自多个采样世界状态的策略和价值评估,结合了PIMC和IS-MCTS的优点,同时保持可控的计算成本。我们进一步引入基于孪生网络的采样策略,从信息集中选择信息丰富的世界状态。在Phantom Go和Dark Hex上的实验表明,MAPLE显著优于基于PIMC的AlphaZero基线,分别实现了291和136的Elo提升。这些结果表明,MAPLE是一种在不完全信息游戏中进行AlphaZero式学习的有效方法。

英文摘要

Imperfect-information games (IIGs) are challenging, as players must make decisions without fully observing the true game state. While AlphaZero has achieved remarkable success in perfect-information games, extending it to IIGs remains difficult. Existing search-based approaches, such as Perfect Information Monte Carlo (PIMC), suffer from strategy fusion, while Information Set Monte Carlo Tree Search (IS-MCTS) incurs high computational cost when combined with neural networks. In this paper, we propose Multi-State Aggregated PoLicy Evaluation (MAPLE), a tree search method that aggregates policy and value evaluations from multiple sampled world states within a single search tree, combining the advantages of PIMC and IS-MCTS while maintaining a controllable computational cost. We further incorporate a Siamese-based sampling strategy to select informative world states from the information set. Experiments on Phantom Go and Dark Hex show that MAPLE significantly outperforms the PIMC-based AlphaZero baseline, achieving Elo improvements of 291 and 136, respectively. These results demonstrate that MAPLE is an effective approach for AlphaZero-style learning in imperfect-information games.

2605.24138 2026-05-26 cs.SE cs.AI 版本更新

Understanding Conversational Patterns in Multi-agent Programming: A Case Study on Fibonacci Game Development

多智能体编程中的对话模式理解:以斐波那契游戏开发为例

Srijita Basu, Viktor Kjellberg, Simin Sun, Bengt Haraldsson, Md. Abu Ahammed Babu, Wilhelm Meding, Farnaz Fotrousi, Miroslaw Staron

发表机构 * Chalmers University of Technology University of Gothenburg Gothenburg Sweden Research \& Development, Volvo Car Corporation Gothenburg Sweden University of Gothenburg Research \& Development, Volvo Car Corporation

AI总结 本文通过分析12种开源LLM组合中设计者与程序员智能体的对话,揭示了多智能体交互的效率、一致性和有效性三个关键维度,发现DeepSeek-R1:DeepSeek-R1对能从首次迭代起稳定收敛到正确解,而其他组合则存在发散或错误共识问题。

Comments 10 pages, 7 figures, AIware, FSE 2026

详情
AI中文摘要

大型语言模型(LLM)越来越多地应用于软件工程(SE),但它们在自主、面向角色的协作方面的潜力仍远未得到充分探索。理解多个基于LLM的智能体如何协调、保持角色对齐并收敛到解决方案对SE至关重要,因为简单地让智能体交互并不能可靠地产生正确或稳定的结果。最近的实证研究表明,非结构化或理解不足的交互动态可能导致错误传播、对错误解决方案的过早共识,或阻止收敛的长期分歧,即使在交互早期存在正确的部分解决方案。作为解决这一未被充分探索领域的初步步骤,我们对两个智能体(设计者和程序员)之间的对话进行了系统分析,涉及来自7个开源LLM(Gemma 2、Gemma 3、LLaMA 3.2、LLaMA 3.3、DeepSeek-R1、MiniCPM和Qwen3)的12种模型组合。我们的系统方法揭示了多智能体交互的三个关键维度:效率(收敛的速度和稳定性)、一致性(通过BLEU和ROUGE可视化的角色对齐程度)和有效性(编译成功和错误解决的程度)。结果表明,DeepSeek-R1:DeepSeek-R1对从第一次迭代起就独特地收敛到正确解,并一致地保持到最终迭代,而LLaMA 3.2:LLaMA 3.2和Qwen3:Qwen3对尽管偏离了正确解,但表现出强烈的设计者:程序员角色对齐。其他对偏离了任务,从未收敛到结果。这些发现推进了对智能体编程的理解,并强调了进一步研究理解和校准收敛及停止条件的必要性,这对于未来的自主SE至关重要。

英文摘要

Large Language Models (LLMs) are increasingly applied to software engineering (SE), yet their potential for autonomous, role-oriented collaboration remains largely underexplored. Understanding how multiple LLM-based agents coordinate, maintain role alignment, and converge on solutions is critical for SE, as naively allowing agents to interact does not reliably lead to correct or stable outcomes. Recent empirical studies show that unstructured or poorly understood interaction dynamics can result in error propagation, premature consensus on incorrect solutions, or prolonged disagreement that prevents convergence, even when correct partial solutions are present early in the interaction. As an initial step towards addressing this underexplored area, we undertake a systematic analysis of conversations between two agents, a Designer and a Programmer across 12 model combinations from 7 open-source LLMs (Gemma 2, Gemma 3, LLaMA 3.2, LLaMA 3.3, DeepSeek-R1, MiniCPM, and Qwen3). Our systematic approach reveals three key dimensions of multi-agent interaction: efficiency (the speed and stability of convergence), consistency (the degree of role alignment visualized by BLEU and ROUGE), and effectiveness (the extent of compilation success and error resolution). Results show that the DeepSeek-R1:DeepSeek-R1 pair was unique in converging to the correct solution from the very first iteration and sustaining it consistently to the final iteration, while LLaMA 3.2:LLaMA 3.2 and Qwen3:Qwen3 demonstrated strong Designer:Programmer role alignment despite of diverging from the correct solution. The other pairs deviated from the task, never to converge to a result. These findings advance understanding of agentic programming and highlight the need for further research on understanding and calibrating convergence and stop conditions essential for future autonomous SE.

2605.24137 2026-05-26 cs.SE cs.AI 版本更新

Empirical Analysis and Detection of Hallucinations in LLM-Generated Bug Report Summaries

LLM生成的错误报告摘要中幻觉的经验分析与检测

Hinduja Nirujan, Shreyas Patil, Abdallah Ayoub, Ahmad Abdel Latif, Gouri Ginde

发表机构 * Electrical and Software Engineering(电气与软件工程学院)

AI总结 本研究从章节感知角度经验性地调查了LLM生成的错误报告摘要中的幻觉,提出了联合预测幻觉内容、识别受影响章节和分类幻觉类型的检测方法,并在BugsRepo数据集上取得了良好性能。

详情
AI中文摘要

大型语言模型(LLMs)越来越多地被用于生成软件错误报告的摘要,包括诸如重现步骤(S2R)、实际行为(AB)和预期行为(EB)等章节。然而,这些模型经常产生看似可信但缺乏源报告支持的幻觉,这可能会误导开发者并降低对自动化维护工具的信任。现有的幻觉检测方法通常在完整响应级别评估输出,并未考虑技术文档的结构。一项对80个结构化错误报告摘要的初步探索性研究发现,约47.9%包含缺失信息,而12.3%包含捏造内容,凸显了在错误报告摘要中进行系统性幻觉分析的必要性。在这项工作中,我们从章节感知的角度经验性地调查了LLM生成的错误报告摘要中的幻觉。利用源自Mozilla OSS项目的BugsRepo数据集,我们引入了受控的合成幻觉注入,以构建用于训练和评估的基准。我们提出了一种章节感知的幻觉检测方法,该方法联合预测摘要是否包含幻觉内容、识别受影响的章节,并对幻觉类型进行分类。在多个预训练语言模型上的实验结果表明,所提出的方法在所有任务上均取得了强劲性能,最佳模型在报告级别上获得了0.89的Macro-F1,在章节级别上获得了0.83的Macro-F1,在幻觉类型上获得了0.84的Macro-F1。我们进一步分析了常见的幻觉模式和模型失败模式,以更好地理解当前LLM生成的错误报告摘要的局限性。研究结果强调了章节感知的幻觉分析对于提高软件维护工作流中LLM辅助错误报告摘要可靠性的重要性。

英文摘要

Large Language Models (LLMs) are increasingly used to generate summaries of software bug reports, including sections such as Steps-to-Reproduce (S2R), Actual Behavior (AB), and Expected Behavior (EB). However, these models frequently produce hallucinations that can be convincing but unsupported by the source report. This can mislead developers and reduce trust in automated maintenance tools. Existing hallucination detection approaches typically evaluate outputs at the full-response level and do not consider the structure of technical documents. An initial exploratory study on 80 structured bug report summaries found that approximately 47.9% contained missing information, while 12.3% included fabricated content, highlighting the need for systematic hallucination analysis in bug report summarization. In this work, we empirically investigate hallucinations in LLM-generated bug report summaries from a section-aware perspective. Using the BugsRepo dataset, derived from Mozilla OSS projects, we introduce controlled synthetic hallucination injection to construct a benchmark for training and evaluation. We propose a section-aware hallucination detection approach that jointly predicts whether a summary contains hallucinated content, identifies affected sections, and classifies hallucination types. Experimental results across multiple pretrained language models show that the proposed approach achieves strong performance across all tasks, with the best model obtaining 0.89 report-level Macro-F1, 0.83 section-level Macro-F1, and 0.84 hallucination-type Macro-F1. We further analyze common hallucination patterns and model failure modes to better understand limitations of current LLM-generated bug report summaries. The findings highlight the importance of section-aware hallucination analysis for improving the reliability of LLM-assisted bug report summarization in software maintenance workflows.

2605.24117 2026-05-26 cs.AI 版本更新

SkillEvolBench: Benchmarking the Evolution from Episodic Experience to Procedural Skills

SkillEvolBench:从情景经验到程序性技能的演化基准测试

Yingtie Lei, Zhongwei Wan, Jiankun Zhang, Samiul Alam, Zixuan Zhong, Peizhou Huang, Xin Wang, Jingxuan Zhang, Donghao Zhou, Yunta Hsieh, Zhihao Dou, Hui Shen, Yan Xu, Dimitrios Dimitriadis, Tuo Zhang, Mi Zhang

发表机构 * The Ohio State University(俄亥俄州立大学) The University of Chicago(芝加哥大学) University College London(伦敦大学学院) University of Michigan(密歇根大学) The Chinese University of Hong Kong(香港中文大学) Case Western Reserve University(凯斯西储大学) Amazon(亚马逊)

AI总结 提出SkillEvolBench基准,通过六个真实环境中的180个任务,评估大语言模型代理能否将情景经验提炼为可复用的程序性技能,发现当前代理难以形成稳健的技能,且原始轨迹复用优于蒸馏技能。

详情
AI中文摘要

大语言模型(LLM)代理在解决现实任务时会积累丰富的情景轨迹,但目前尚不清楚这种经验能否被提炼为可复用的程序性技能。我们引入了SkillEvolBench,一个用于评估从经验复用到技能形成这一步骤的诊断基准。它包含跨越六个真实代理环境的180个任务,组织成具有共享潜在程序的角色条件任务族。代理从获取任务中学习,使用压缩轨迹和验证器反馈更新外部技能库,然后面对冻结部署任务,测试上下文偏移、对抗性捷径和组合。通过将自生成和策划起始的技能演化与无技能和原始轨迹控制进行比较,SkillEvolBench将程序抽象与基础能力、策划的先验知识和情景轨迹的直接复用分离开来。在十种模型配置和三种代理框架下,我们发现当前代理通常能局部适应,但很少形成稳健的可复用技能。基于技能的条件可以改善获取或重放,个别模型有时在特定部署轴上有所提升,但这些提升在冻结部署下不稳定。原始轨迹复用经常优于蒸馏技能,表明当前的抽象过程丢弃了未来任务仍有用的上下文和程序线索。容量和成本分析进一步表明,编写更多技能或更大的三级资源库并不足够:额外的更新可以改善覆盖范围,同时引入特定于情节的漂移和程序混乱。这些发现将SkillEvolBench定位为一个测试平台,用于衡量一次性经验何时成为持久的程序性知识而非任务局部记忆。

英文摘要

Large language model (LLM) agents accumulate rich episodic trajectories while solving real-world tasks, but it remains unclear whether such experience can be distilled into reusable procedural skills. We introduce SkillEvolBench, a diagnostic benchmark for evaluating this step from experience reuse to skill formation. It contains 180 tasks across six real-world agent environments, organized into role-conditioned task families with shared latent procedures. Agents learn from acquisition tasks, update an external skill library using compacted trajectories and verifier feedback, and then face frozen deployment tasks testing context shift, adversarial shortcuts, and composition. By comparing self-generated and curated-start skill evolution against no-skill and raw-trajectory controls, SkillEvolBench separates procedural abstraction from base capability, curated prior knowledge, and direct reuse of episodic traces. Across ten model configurations and three agent harnesses, we find that current agents often adapt locally but rarely form robust reusable skills. Skill-based conditions can improve acquisition or replay, and individual models sometimes gain on specific deployment axes, but these gains are unstable under frozen deployment. Raw-trajectory reuse frequently outperforms distilled skills, suggesting that current abstraction procedures discard contextual and procedural cues that remain useful for future tasks. Capacity and cost analyses further show that writing more skills or larger Tier-3 resource libraries is not sufficient: additional updates can improve coverage while introducing episode-specific drift and procedural clutter. These findings position SkillEvolBench as a testbed for measuring when one-off experience becomes durable procedural knowledge rather than task-local memory.

2605.24111 2026-05-26 cs.RO cs.AI 版本更新

MASt3R-Nav: WayPixel Navigation in Relative 3D Maps

MASt3R-Nav: 相对3D地图中的WayPixel导航

Vansh Garg, Rohit Jayanti, Krish Pandya, Sarthak Chittawar, Siddharth Tourani, Muhammad Haris Khan, Sourav Garg, Madhava Krishna

发表机构 * Robotics Research Center, IIIT-Hyderabad, India(1 罗斯科技研究中心,IIIT-海得拉巴,印度) University of Heidelberg(2 海德堡大学) MBZUAI(3 MBZUAI)

AI总结 提出一种基于像素相对连接性的地图表示,通过相对3D坐标系中的像素对应构建地图,并利用像素级图进行全局路径规划,训练控制器预测轨迹,实现高精度导航。

Comments 2026 IEEE International Conference on Robotics & Automation (ICRA)

详情
AI中文摘要

视觉导航能力与其底层世界表示紧密相关。与需要全局一致几何的传统3D地图不同,图像或物体相对拓扑图几乎完全放弃了几何理解,但这以牺牲导航能力为代价,通常仅限于教-重复模式。本文提出一种新颖的地图表示,即像素相对连接性,它在几何上精确但不需要全局几何一致性。受近期3D基础图像匹配进展的启发,我们通过基于单个图像对相对3D坐标系中像素对应的图像间连接性,从图像序列构建地图。然后,我们利用该像素级图通过近似和稀疏化图像内像素连接性来执行全局路径规划。由此,我们推导出“WayPixel Costmap”表示,并训练一个以此条件化的控制器来预测轨迹展开。我们展示了这种基于相对几何的密集像素级成本图比其图像级和物体级对应物是更精确的控制预测条件变量。这实现了一个高能力的导航系统,通过在模拟器中的四种导航任务和真实世界演示中得到验证。

英文摘要

Visual navigation ability is strongly tied to its underlying representation of the world. Unlike classical 3D maps that require globally-consistent geometry, image- or object-relative topological graphs almost entirely do away with geometric understanding. But, this comes at the cost of navigation capability, often limiting it to merely teach-and-repeat. In this work, we propose a novel map representation in the form of pixel-relative connectivity, which is geometrically accurate but does not require global geometric consistency. Inspired by recent progress in 3D grounded image matching, we construct a map from an image sequence through inter-image connectivity based on pixel correspondences in the relative 3D coordinate systems of individual image pairs. We then use this pixel-level graph to perform global path planning by approximating and sparsifying intra-image pixel connectivity. Through this, we derive a ''WayPixel Costmap'' representation and train a controller conditioned on it to predict a trajectory rollout. We show that this dense pixel-level costmap based on relative geometry is a more accurate conditioning variable for control prediction than its image- and object-level counterparts. This enables a highly capable navigation system, as validated on four types of navigation tasks in the simulator and through real world demonstrations.

2605.24110 2026-05-26 cs.AI 版本更新

EvoCode-Bench: Evaluating Coding Agents in Multi-Turn Iterative Interactions

EvoCode-Bench:评估多轮迭代交互中的编码智能体

Haiyang Shen, Xuanzhong Chen, Wendong Xu, Yun Ma, Liang Chen, Kuan Li

发表机构 * UniPat AI Peking University(北京大学) Tsinghua University(清华大学) HKU(香港大学)

AI总结 提出EvoCode-Bench基准,通过多轮状态化任务和累积测试评估编码智能体在需求变化下维持代码库工作的能力,发现多轮指标远低于单轮指标,且最强智能体多轮成功率仅约50%。

Comments Work in Progress; 32 pages, 10 figures, preprint

详情
AI中文摘要

编码智能体越来越多地被用作迭代开发伙伴,但大多数基准测试仍然评估一个规范后跟一个最终评估。这忽略了一个基本问题:当需求变化时,智能体能否保持自己的代码库正常工作?我们引入了EvoCode-Bench,一个包含26个状态化编码任务和227个评估轮次的基准。每个任务保留智能体的工作空间5-15轮,通过可观察的行为陈述需求,并使用累积可执行测试来检查新需求以及仍然活跃的先前需求。我们使用两个指标评估了13个编码智能体:MT@4,一个四次尝试失败停止的多轮分数;以及SR,一个从参考完成的先前状态开始的单轮分数。对于大多数智能体,SR超过MT@4 22-40分。差距也改变了排名:最高SR的智能体(78.9)在持续执行中仅排名第三(44.0 MT@4)。即使是最强的智能体在多轮指标上也仅达到约50%的成功率,并且到第5轮时,聚合通过率下降到第1轮性能的一半以下。失败分析显示了层级依赖的行为:较弱的智能体早期失败,而较强的智能体存活足够长以暴露规范跟踪和回归失败。我们发布了基准数据和Harbor多轮基础设施。

英文摘要

Coding agents are increasingly used as iterative development partners, but most benchmarks still evaluate one specification followed by one final assessment. This leaves out a basic question: can an agent keep its own codebase working as requirements change? We introduce EvoCode-Bench, a benchmark of 26 stateful coding tasks and 227 evaluated rounds. Each task preserves the agent's workspace for 5-15 rounds, states requirements through observable behavior, and uses cumulative executable tests to check new requirements and still-active prior ones. We evaluate 13 coding agents with two metrics: MT@4, a four-attempt fail-stop multi-round score, and SR, a single-round score from a reference-completed prior state. For most agents, SR exceeds MT@4 by 22-40 points. The gap also changes rankings: the highest-SR agent (78.9) ranks only third in persistent execution (44.0 MT@4). Even the strongest agents achieve only about 50% success on multi-turn metrics, and aggregate pass rate drops below half of round-1 performance by round 5. Failure analysis shows tier-dependent behavior: weaker agents fail early, while stronger agents survive long enough to expose specification-tracking and regression failures. We release the benchmark data and Harbor multi-turn infrastructure.

2605.24106 2026-05-26 cs.LG cs.AI 版本更新

Overcoming "Physics Shock" in Earth Observation A Heteroscedastic Uncertainty Framework for PINN-based Flood Inference

克服地球观测中的“物理冲击”:面向PINN洪水推断的异方差不确定性框架

Tewodros Syum Gebre, Jagrati Talreja, Matilda Anokye, Leila Hashemi-Beni

发表机构 * Built Environment Department, College of Science and Technology, North Carolina A&T State University(北卡罗来纳A&T州立大学科学与技术学院建筑环境系) United Nations University Institute for Water, Environment and Health(联合国大学水、环境与健康研究所)

AI总结 提出一种不确定性感知的物理信息神经网络框架,通过动态热身启动和异方差不确定性建模,解决遥感洪水映射中物理约束与噪声数据冲突导致的梯度发散问题,在Sen1Floods11数据集上IoU提升25%。

Comments This article is accepted in IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing

详情
AI中文摘要

从遥感数据(如合成孔径雷达SAR)中快速准确地绘制洪水范围对于灾害应急响应至关重要,但标准深度学习模型由于缺乏水文约束常产生物理上不可能的预测。尽管物理信息神经网络(PINNs)试图通过将控制定律直接嵌入损失函数来解决这一问题,但其在真实遥感数据上的应用经常失败。将刚性空间导数(如二维浅水方程)强加于试图拟合噪声SAR散斑的无条件潜在空间会导致灾难性的梯度发散,我们将这一现象称为“物理冲击”。本文提出了一种专门针对应用地球观测的新型不确定性感知PINN框架,以解决这一不稳定性。通过集成动态热身启动协议和通过负对数似然目标建模异方差偶然不确定性,网络学会在高传感器噪声区域动态放松物理约束,而在高置信度区域严格强制执行。在Sen1Floods11数据集上的评估表明,我们的概率注意力门控FNO-UNet成功稳定了多目标优化,与确定性基线相比,交并比(IoU)相对提高了25%。此外,通过深度集成,我们成功地将内在传感器噪声与分布外地形未知性分离开来,为运营机构提供了高度校准、物理一致的置信区间,用于稳健的灾害缓解和实时决策。

英文摘要

Rapid and accurate flood extent mapping from Remote Sensing data, such as Synthetic Aperture Radar (SAR), is critical for operational disaster response, but standard Deep Learning models often produce physically impossible predictions due to a lack of hydrological constraints. While PhysicsInformed Neural Networks (PINNs) attempt to address this by embedding governing laws directly into the loss function, their application to real-world remote sensing data frequently fails. Enforcing rigid spatial derivatives (e.g., the 2D Shallow Water Equations) onto unconditioned latent spaces attempting to fit noisy SAR speckle causes catastrophic gradient divergence, a phenomenon we term Physics Shock. In this paper, we propose a novel Uncertainty-Aware PINN framework tailored specifically for applied Earth Observation that addresses this instability. By integrating a dynamic Warm-Start protocol and modeling heteroscedastic aleatoric uncertainty via a negative log-likelihood objective, the network learns to dynamically relax physical constraints in regions of high sensor noise while strictly enforcing them in high-confidence areas. Evaluated on the Sen1Floods11 dataset, our probabilistic Attention-Gated FNO-UNet successfully stabilizes multi-objective optimization, achieving a +25% relative improvement in Intersection over Union (IoU) compared to deterministic baselines. Furthermore, through Deep Ensembles, we successfully disentangle intrinsic sensor noise from out-of-distribution terrain ignorance, providing operational agencies with highly calibrated, physically consistent confidence bounds for robust disaster mitigation and real-time decision-making.

2605.24096 2026-05-26 cs.DB cs.AI cs.DC cs.SE 版本更新

The Time is Here for Just-in-Time Systems: Challenges and Opportunities

即时系统的时代已到来:挑战与机遇

Shu Liu, Alexander Krentsel, Shubham Agarwal, Mert Cemri, Ziming Mao, Soujanya Ponnapalli, Alexandros G. Dimakis, Sylvia Ratnasamy, Matei Zaharia, Aditya Parameswaran, Ion Stoica

发表机构 * UC Berkeley(加州大学伯克利分校) Bespoke Labs

AI总结 本文提出基于LLM的即时系统合成方法Jitskit,通过从零开始合成专用键值存储系统,在18个规格上性能超越现有系统最高达4.6倍。

Comments preprint

详情
AI中文摘要

像键值存储这样的核心系统历史上需要数年时间构建,并且设计为通用型以分摊跨部署的成本,但付出了显著的性能代价。我们认为基于LLM的编码代理现在使一种不同的方法变得可行:即时系统,其中整个系统从零开始合成,专门针对环境、工作负载和所需的系统属性。我们提出了一个即时系统合成流水线Jitskit,并探索了其在从跨不同YCSB工作负载、部署约束(如计算资源)和系统属性(如一致性和持久性)的规格卡中合成键值存储的有效性。Jitskit迭代地改进系统实现,以匹配针对不断演化的评估测试套件的规格。生成的合成系统性能优越,在尝试的18个规格中的18个上击败了可比的最新系统,在最有利的规格上比最佳现成基线高出4.6倍。直接运行Claude Code要么奖励黑客,要么性能比Jitskit差高达5.4倍。我们讨论了构建Jitskit过程中克服的挑战和关键收获。

英文摘要

Core systems like key-value stores have historically taken years to build, and are designed to be general so as to amortize cost across deployments, paying a significant performance cost. We argue that LLM-based coding agents now make a different approach tractable: Just-in-Time Systems, in which the entire system is synthesized from scratch, specialized to the environment, workload, and required system properties. We present a JIT system synthesis pipeline, Jitskit, and explore its effectiveness in synthesizing key-value stores from spec cards that span different YCSB workloads, deployment constraints (e.g., compute resources), and system properties (e.g., consistency and durability). Jitskit iteratively refines a system implementation to match the specification against an evolving evaluation test suite. The resulting synthesized systems are performant, beating comparable state-of-the-art systems on 18 of 18 specs tried, by up to 4.6x over the best off-the-shelf baseline on the most favorable spec. Naively running Claude Code either reward-hacks or underperforms Jitskit by up to 5.4x. We discuss the challenges we overcame in building Jitskit and our key takeaways.

2605.24084 2026-05-26 cs.LG cs.AI cs.LO 版本更新

Verified SHAP: Provable Bounds for Exact Shapley Values of Neural Networks

Verified SHAP: 神经网络精确Shapley值的可证明界

David Boetius, Shahaf Bassan, Guy Katz, Stefan Leue, Tobias Sutter

发表机构 * University of Konstanz, Konstanz, Germany(康斯坦茨大学) Hebrew University of Jerusalem, Jerusalem, Israel(耶路撒冷希伯来大学) University of St.Gallen, St.Gallen, Switzerland(斯图加特大学)

AI总结 利用神经网络验证技术,提出一种计算SHAP值精确上下界的算法,可扩展到比现有精确方法大数个数量级的搜索空间。

Comments Accepted at ICML 2026. 34 pages, 13 figures

详情
AI中文摘要

Shapley加法解释(SHAP)被广泛认为对于神经网络在计算上是棘手的,因为它们在输入特征上诱导出指数搜索空间。在这项工作中,我们迈出了将精确SHAP计算扩展到更大搜索空间的第一步,引入了一种算法,该算法利用神经网络验证的最新进展来计算神经网络SHAP值的任意紧的精确下界和上界,最终恢复精确的SHAP值。我们证明了我们的方法可以扩展到比最先进的精确方法大数个数量级的搜索空间。这为精确SHAP计算提供了重要的第一步,并为在更大搜索空间上评估统计近似方法建立了原则性的基石。

英文摘要

Shapley additive explanations (SHAP) are widely recognised as computationally intractable for neural networks, since they induce an exponential search space over the input features. In this work, we take a first step towards scaling exact SHAP computation to larger search spaces by introducing an algorithm that leverages recent advances in neural network verification to compute arbitrarily tight exact lower and upper bounds on SHAP values for neural networks, ultimately recovering the exact SHAP values. We demonstrate that our approach scales to orders of magnitude larger search spaces than state-of-the-art exact methods. This provides an important first step towards exact SHAP computation and establishes a principled cornerstone for evaluating statistical approximation methods on larger search spaces.

2605.24079 2026-05-26 cs.SE cs.AI cs.CL 版本更新

TRACER: A Semantic-Aware Framework for Fine-Grained Contamination Detection in Code LLMs

TRACER: 一种用于代码大语言模型中细粒度污染检测的语义感知框架

Yifeng Di, Xuliang Huang, Tianyi Zhang

发表机构 * Purdue University West Lafayette, IN(帕克大学韦斯特拉法叶分校) University of Illinois Urbana-Champaign(伊利诺伊大学厄巴纳-香槟分校)

AI总结 提出TRACER框架,通过三级语义重叠和粗到细流水线检测代码LLM中的细粒度数据污染,在基准测试中F1达0.91。

Comments 21 pages, 2 figures, 15 tables

详情
AI中文摘要

数据污染是对模型评估可靠性的已知威胁。然而,在代码大语言模型(LLM)中,污染往往超出精确重复,这一问题仍未得到充分探索。我们提出了TRACER,一种用于细粒度代码污染检测的语义感知框架。TRACER使用三级语义重叠——功能相同、几乎相同和共享逻辑——对污染进行建模,并通过粗到细的流水线进行检测。我们还引入了首个细粒度代码污染检测基准,涵盖三个广泛使用的基准和三个具有代表性的后训练数据集。TRACER在多个LLM骨干网络上取得了强大且一致的性能,其中GPT-5在细粒度检测中F1分数达到0.91。在二分类设置中,TRACER的F1达到0.92,比现有方法高出42%-217%。我们进一步进行了消融研究和错误分析,以评估TRACER中各个组件的贡献。

英文摘要

Data contamination is a known threat to the reliability of model evaluation. However, it remains underexplored in code large language models (LLMs), where contamination often goes beyond exact duplication. We present TRACER, a semantic-aware framework for fine-grained code contamination detection. TRACER models contamination using three levels of semantic overlap - Functionally Identical, Nearly Identical, and Shared Logic - and detects them through a coarse-to-fine pipeline. We also introduce the first benchmark for fine-grained code contamination detection, spanning three widely used benchmarks and three representative post-training datasets. TRACER achieves strong and consistent performance across multiple LLM backbones, with GPT-5 reaching an F1 score of 0.91 in fine-grained detection. In the binary setting, TRACER attains an F1 of 0.92, outperforming existing methods by 42%-217%. We further conduct ablation studies and error analysis to assess the contributions of individual components in TRACER.

2605.24069 2026-05-26 cs.CR cs.AI 版本更新

When the Manual Lies: A Realistic Benchmark to Evaluate MCP Poisoning Attacks for LLM Agents

当手册撒谎:评估LLM智能体MCP投毒攻击的现实基准

Shi Liu, Xuehai Tang, Xikang Yang, Liang Lin, Biyu Zhou, Wenjie Xiao, Wantao Liu

发表机构 * Institute of Information Engineering, Chinese Academy of Sciences, Beijing, China(中国科学院信息工程研究所) School of Cyber Security, University of Chinese Academy of Sciences, Beijing, China(中国科学院大学网络安全学院)

AI总结 针对LLM智能体通过模型上下文协议(MCP)集成外部工具时面临的工具描述投毒(TDP)攻击,提出MCP-TDP安全基准,包含32个真实测试用例,评估8种主流LLM发现严重漏洞,并提出反应性自我纠正防御机制。

详情
AI中文摘要

使用工具的大型语言模型(LLM)智能体的兴起,通过模型上下文协议(MCP)等协议标准化,通过集成外部开放领域知识和工具,为LLM智能体解锁了前所未有的自主执行能力。然而,这种互操作性引入了一个针对智能体认知规划层的隐蔽攻击面。本文系统性地研究了工具描述投毒(TDP),一种新颖的语义攻击。在TDP中,恶意指令并非嵌入工具的可执行代码,而是隐蔽地注入其描述性元数据——即智能体依赖进行安全规划和决策的“手册”。为了严格系统地评估这一新兴威胁,我们引入了MCP-TDP安全基准。这个高保真沙箱环境包含32个跨越6个不同风险类别的真实测试用例。我们对8种主流LLM的评估揭示了严重漏洞,领先模型如GPT-4o在六个高风险场景中表现出近100%的攻击成功率(ASR)。此外,我们的发现表明,常见的提示护栏防御基本无效,并且可能适得其反(我们称之为“防火墙谬误”)。关键的是,我们还提出了一种防御机制:“反应性自我纠正”,即智能体在执行后自主检测并撤销其恶意行为。这项工作为TDP提供了第一个专门的安全基准,为保护高级智能体系统的认知和规划层提供了重要见解。

英文摘要

The rise of tool-using Large Language Model (LLM) agents, standardized by protocols like the Model Context Protocol (MCP), has unlocked unprecedented autonomous execution capabilities for LLM Agents by integrating external open-domain knowledge and tools. However, this interoperability introduces a covert attack surface targeting the agent's cognitive planning layer. This paper systematically investigates Tool Description Poisoning (TDP), a novel semantic attack. In TDP, malicious instructions are not embedded in a tool's executable code, but rather covertly injected into its descriptive metadata, the very "manual" an agent relies on for secure planning and decision-making. To rigorously and systematically evaluate this emerging threat, we introduce the MCP-TDP Security Benchmark. This high-fidelity sandbox environment comprises 32 realistic, real-world test cases spanning 6 distinct risk categories. Our evaluation of 8 mainstream LLMs reveals severe vulnerabilities, with leading models like GPT-4o exhibiting a nearly 100% Attack Success Rate (ASR) in six high-risk scenarios. Furthermore, our findings demonstrate that common prompt-guardrail defenses are largely ineffective and can, counterintuitively, even be counterproductive (a phenomenon which we term the "Firewall Fallacy"). Crucially, we also propose a defense mechanism: "Reactive Self-Correction," where an agent autonomously detects and reverts its own malicious actions post-execution. This work provides the first specialized security benchmark tailored for TDP, offering essential insights for securing the cognitive and planning layers of advanced agentic systems.

2605.24064 2026-05-26 cs.LG cs.AI 版本更新

Generative Representation Learning on Hyper-relational Knowledge Graphs via Masked Discrete Diffusion

超关系知识图谱上的生成式表示学习:基于掩码离散扩散

Jaejun Lee, Seheon Kim, Joyce Jiyoung Whang

发表机构 * School of Computing(计算学院) Department of AI Computing, KAIST, Daejeon, South Korea(人工智能计算系,韩国科学技术院,大田,韩国)

AI总结 针对超关系知识图谱中任意掩码查询的补全与事实生成任务,提出基于掩码离散扩散的生成式表示学习方法KREPE,统一链接预测与事实生成,性能达到最优。

Comments 28 pages, 16 figures, 18 tables, 43rd International Conference on Machine Learning (ICML 2026)

详情
AI中文摘要

超关系知识图谱(HKG)能有效表示复杂事实。在HKG中推断新知识是一个关键问题,但现有方法将其视为简单的链接预测,假设事实中几乎所有实体和关系已知,仅留单个空白待填充。然而,这种受限假设在现实场景中可能不成立,因为事实的多个甚至全部组成成分可能同时缺失。为弥补这一差距,我们引入一个称为事实生成的任务:从任意掩码查询生成有效超关系事实,即补全部分观察到的事实或从头生成事实。我们提出KREPE,这是首个用于HKG的生成式表示学习方法,通过掩码离散扩散学习以局部事实成分和HKG全局结构为条件的缺失成分概率分布。KREPE通过上下文消息传递建模事实内依赖,并通过聚合随机采样上下文建模事实间关联。KREPE在单一训练框架内无缝统一链接预测与事实生成,在标准HKG链接预测基准上达到最先进性能,并在生成新颖且正确事实方面超越基于LLM的基线方法。

英文摘要

Hyper-relational knowledge graphs (HKGs) effectively represent complex facts. While inferring new knowledge in HKGs is a critical problem, current methods cast it as a simple link prediction, assuming that nearly all entities and relations within a fact are known, leaving only a single blank to be filled. However, this restricted assumption may not hold in real-world scenarios in which multiple, or even all, constituent components of a fact may be missing simultaneously. To bridge this gap, we introduce a task called fact generation: generating a valid hyper-relational fact from an arbitrarily masked query, i.e., completing a partially observed fact or generating a fact from scratch. We propose KREPE, the first generative representation learning method for HKGs that learns to model the probability distributions of missing components conditioned on the local fact components and global structure of HKGs via a masked discrete diffusion. KREPE models both the intra-fact dependencies by contextual message passing and inter-fact correlations by aggregating stochastically sampled contexts. KREPE seamlessly unifies link prediction and fact generation within a single training framework, achieving state-of-the-art performance on standard HKG link prediction benchmarks and outperforming LLM-based baselines in generating novel and correct facts.

2605.24062 2026-05-26 cs.LG cs.AI 版本更新

Federated Learning over Human-Body Communication for On-Body Edge Intelligence: A Survey, Taxonomy, and BODYFED-HBC Scheduling Vignette

基于人体通信的联邦学习用于体表边缘智能:综述、分类法与BODYFED-HBC调度示例

Koffka Khan

发表机构 * Department of Computing and Information Technology(计算与信息技术系) The University of the West Indies(西印度大学)

AI总结 本文综述了人体通信与联邦学习在可穿戴设备中的交叉领域,提出了一种区分体内、体中心、跨用户和临床云联邦学习部署的分类法,并引入BODYFED-HBC参考架构和调度算法以解决体信道感知的联邦学习问题。

详情
AI中文摘要

人体通信(HBC)是一种有前景的可穿戴体域网络物理层,因为它可以将通信局限在身体周围,并减轻传统无线电链路的负担。联邦学习(FL)是一种有前景的学习层,因为它可以减少生理和行为传感的原始数据集中化。然而,这两类文献之间的联系仍然薄弱:用于可穿戴设备的FL通常抽象通信层,而HBC研究通常抽象学习和模型更新流量。本文综述了HBC、无线体域网络、可穿戴FL、身体互联网隐私和边缘智能优化的交叉领域。我们提出了一种分类法,区分了体内、体中心、跨用户和临床云FL部署,并识别了体信道感知FL这一开放问题:即客户端选择、更新压缩和聚合由姿态相关的HBC链路、剩余能量、传感器内存和隐私风险控制的学习协议。为了使研究议程具体化,我们引入了BODYFED-HBC作为参考架构,并提供了优化公式和调度算法。我们进一步指定了一个可复现的模拟示例,该示例结合了公共可穿戴数据集和经验性的体耦合通信信号损耗模型。文章最后为工作在硬件层之上的计算机科学家提供了开放数据集、评估指标、局限性和研究方向。

英文摘要

Human-body communication (HBC) is a promising physical substrate for wearable body-area networks because it can localize communication around the body and reduce the burden of conventional radio links. Federated learning (FL) is a promising learning substrate because it can reduce raw-data centralization for physiological and behavioral sensing. Yet these two literatures remain weakly connected: FL for wearables usually abstracts the communication layer, whereas HBC research usually abstracts learning and model-update traffic. This article surveys the intersection of HBC, wireless body-area networks, wearable FL, Internet-of-Bodies privacy, and edge-intelligence optimization. We propose a taxonomy that distinguishes intra-body, body-hub, cross-user, and clinical-cloud FL deployments, and we identify the open problem of body-channel-aware FL: learning protocols whose client selection, update compression, and aggregation are controlled by posture-dependent HBC links, residual energy, sensor memory, and privacy risk. To make the research agenda concrete, we introduce BODYFED-HBC as a reference architecture and provide an optimization formulation and scheduling algorithm. We further specify a reproducible simulation vignette that combines public wearable datasets with empirical body-coupled-communication signal-loss models. The article concludes with open datasets, evaluation metrics, limitations, and research directions for computer scientists working above the hardware layer.

2605.24058 2026-05-26 cs.LG cs.AI 版本更新

Signs Beat Floats: Low-Rank Double-Binary Adaptation for On-Device Fine-Tuning

符号胜过浮点:面向设备端微调的低秩双二值适配器

Yoshihiko Fujisawa, Yuma Ichikawa, Yudai Fujimoto, Akira Sakai, Katsuki Fujisawa

发表机构 * Fujitsu Limited(富士通株式会社) Institute of Science Tokyo(东京科学研究所) RIKEN Center for AIP(理化学研究所先进信息处理中心) Tokai University(静冈大学)

AI总结 提出LoRDBA,一种用二值符号载波和通道级缩放替代低秩因子的适配器,在保持LoRA兼容性的同时显著降低存储和计算开销,并在设备端微调中匹配或超越低比特基线性能。

Comments 34 pages, 3 figures

详情
AI中文摘要

大型语言模型的设备端适配通常保持量化基模型冻结,同时训练和部署一个小型任务特定的LoRA适配器。然而,在未合并的适配器模式下,适配器不仅仅是一个紧凑的存储模块;它引入了一个额外的密集浮点分支,维护可训练状态以进行本地更新,并充当通信和热交换单元。我们提出LoRDBA,一种LoRA兼容的适配器,它将两个低秩因子替换为二值符号载波,同时通过轻量级的通道级缩放表示幅度,将密集适配器分支转换为两个符号累积矩阵乘法,中间穿插通道级缩放。有限样本分析表明,重建质量由原始LoRA因子的残差与幅度之比决定。在适配器模式实验中,LoRDBA在匹配模型大小的情况下优于低比特基线,并在某些场景下匹配fp16 LoRA的质量。尽管适配器占用减少了超过10倍,未合并的适配器在匹配秩r=16时最多引入8%的预填充延迟开销,训练内存开销约为fp16 LoRA的1.6倍。

英文摘要

On-device adaptation of large language models commonly keeps a quantized base model frozen while training and deploying a small, task-specific LoRA adapter. In the unmerged adapter-mode setting, however, the adapter is more than a compact storage module; it introduces an additional dense floating-point branch, maintains a trainable state for local updates, and acts as a unit of communication and hot-swapping.We introduce LoRDBA, a LoRA-compatible adapter that replaces both low-rank factors with binary sign carriers while representing magnitudes through lightweight, channel-wise scales, converting the dense adapter branch into two sign-accumulation matrix multiplications interleaved with channel-wise scaling. A finite-sample analysis shows that reconstruction quality is governed by the residual-to-magnitude ratio of the original LoRA factors. In adapter-mode experiments, LoRDBA outperforms low-bit baselines at matched model sizes while matching fp16 LoRA quality in selected regimes. The unmerged adapter incurs at most 8% prefill latency overhead at matched rank r=16 despite an over 10x reduction in adapter footprint, with moderate training memory overhead of approximately 1.6x that of fp16 LoRA.

2605.24057 2026-05-26 cs.LG cs.AI 版本更新

Feature Lottery? A Bifurcation Theory of Concept Emergence

特征彩票?概念涌现的分岔理论

Fuming Yang

发表机构 * MIT(麻省理工学院)

AI总结 提出一种基于分岔理论的方法,通过损失Hessian驱动的超临界叉形分岔检测表示动力学中的结构涌现,并引入无标签相位坐标β/β_c,在多种设置下验证了四个不同的转变阶段,揭示了特征可解释性的早期可预测性。

详情
AI中文摘要

神经网络在训练过程中的特定时刻获得结构化表示,然而识别这些转变通常依赖于回顾性的、基于标签的指标。我们引入了一种表示动力学的分岔理论来实时检测这些时刻。通过分析附加在演化编码器上的被动高斯混合模型探针,我们展示了结构的开始对应于由损失Hessian驱动的超临界叉形分岔。系统表现出一个理论上可预测的过零点(β_c),与网络当前状态(β)相比,产生一个动态比率β(t)/β_c(t):一个通用的、无标签的表示动力学相位坐标,完全可以从隐藏状态计算得出。我们在不同设置下实证验证了该坐标预测的四个不同转变阶段:语言模型(Pythia)上的稀疏自编码器、自监督学习(CIFAR)和grokking(模算术)。关键的是,在有限耗散下,宏观对称性破缺可能滞后于初始过零点数个数量级,这为grokking中观察到的延迟逃逸提供了严格的动力学解释。微观上,分岔产生了一个共享的不稳定子空间,迫使集体对称性破缺。我们将其称为稀疏自编码器训练中的“特征彩票”:一个特征的最终可解释性变得惊人地早期可预测。仅在训练5%时,早期原子纯度就能稳健地预测最终收敛纯度,其中前十百分位的早期原子在收敛时的纯度比基线高出12倍以上。除了解释概念涌现外,β/β_c还为训练健康提供了实用的早期预警指标,在下游指标反应之前检测到可用结构的出现、特征身份的结晶以及表示崩溃的时期。

英文摘要

Neural networks acquire structured representations at specific moments during training, yet identifying these transitions typically relies on retrospective, label-dependent metrics. We introduce a bifurcation theory of representation dynamics to detect these moments in real time. Analyzing a passive GMM probe attached to the evolving encoder, we show the onset of structure corresponds to a supercritical pitchfork bifurcation driven by the loss Hessian. The system exhibits a theoretically predictable zero-crossing ($β_c$) that, compared to the network's current state ($β$), yields a dynamic ratio $β(t)/β_c(t)$: a universal, label-free phase coordinate for representation dynamics, computable entirely from hidden states. We empirically validate four distinct transition regimes predicted by this coordinate across diverse settings: SAEs on language models (Pythia), SSL (CIFAR), and grokking (modular arithmetic). Crucially, under finite dissipation, macroscopic symmetry-breaking can lag the initial zero-crossing by orders of magnitude, which providing a rigorous dynamical account of the delayed escape observed in grokking. Microscopically, the bifurcation creates a shared unstable subspace, forcing collective symmetry breaking. We term this the "feature lottery" in SAE training: a feature's terminal interpretability becomes predictable remarkably early. By only 5% of training, early atom purity robustly predicts final convergence purity, with top-decile early atoms achieving over 12x the baseline purity at convergence. Beyond explaining concept emergence, $β/β_c$ provides a practical early-warning indicator for training health, detecting the onset of usable structure, the crystallization of feature identity, and representational collapse epochs before downstream metrics react.

2605.24055 2026-05-26 cs.LG cs.AI 版本更新

Cascade-KDE: Robust Time-Series Restoration under Out-of-Distribution Impulse Corruptions

Cascade-KDE:面向分布外脉冲损坏的鲁棒时间序列恢复

Yuefeng Liu, Ning Yang, Ziyu Yang

发表机构 * School of Digital and Intelligent Industry (School of Cyber Science and Technology)(数字与智能产业学院(网络科学与技术学院)) Inner Mongolia University of Science and Technology(内蒙古科技大学)

AI总结 提出Cascade-KDE无训练框架,通过二维密度估计、密度截断鲁棒期望和指数级联自适应停止,在保留局部结构的同时鲁棒恢复被高斯噪声和脉冲异常损坏的时间序列。

详情
AI中文摘要

工业传感、医疗和能源系统中的真实世界时间序列数据通常被高斯噪声和偶尔的大幅度脉冲异常值混合污染。对于依赖局部形状的任务,如心电图形态分析和电池退化监测,主要要求不仅是低重建误差,还要保留导数峰值和任务关键特征。我们提出了Cascade-KDE,一种用于损坏时间序列的无训练恢复框架。该方法首先估计二维时间-幅度密度,然后应用密度截断鲁棒期望来限制远处异常点的影响,最后通过具有自适应停止的指数级联细化序列。该设计旨在提高在分布外脉冲损坏下的鲁棒性,同时使恢复轨迹接近原始局部结构。在多个基准数据集上,所提方法在曲线保真度、导数保留、下游分类和运行时效率方面相比经典滤波器和代表性学习基线表现出一致的改进。这些结果表明,基于有界密度的恢复是噪声时间序列流程中保留特征预处理的实用选择。

英文摘要

Real-world time-series data in industrial sensing, healthcare, and energy systems is often corrupted by a mixture of Gaussian noise and occasional large-magnitude impulse outliers. For tasks that depend on local shape, such as ECG morphology analysis and battery degradation monitoring, the main requirement is not only low reconstruction error but also preservation of derivative peaks and task-critical features. We propose Cascade-KDE, a training-free restoration framework for corrupted time series. The method first estimates a two-dimensional temporal-amplitude density, then applies a Density-Truncated Robust Expectation to limit the influence of distant abnormal points, and finally refines the sequence through an exponential cascade with adaptive stopping. This design aims to improve robustness under out-of-distribution impulse corruptions while keeping the restored trajectory close to the original local structure. Across several benchmark datasets, the proposed method shows consistent gains over classical filters and representative learning-based baselines on curve fidelity, derivative preservation, downstream classification, and runtime efficiency. These results suggest that bounded density-based restoration is a practical option for feature-preserving preprocessing in noisy time-series pipelines.

2605.24053 2026-05-26 cs.AI cs.CL cs.LG 版本更新

Breaking the Chains of Probability: Neutrosophic Logic as a New Framework for Epistemic Uncertainty in Large Language Models

打破概率的锁链:中智逻辑作为大型语言模型中认知不确定性的新框架

Maikel Yelandi Leyva-Vázquez, Florentin Smarandache

发表机构 * Universidad Bolivariana del Ecuador, Coordinación Académica de Posgrado(巴尔干大学厄瓜多尔分校,研究生院) Universidad de Guayaquil(瓜亚基尔大学) Universidad Bernardo O’Higgins(伯纳多·奥希金斯大学) Mathematics, Physics, and Natural Sciences Division, University of New Mexico(新墨西哥大学数学、物理和自然科学系)

AI总结 本文提出使用中智逻辑(Truth、Indeterminacy、Falsity三个独立维度)替代传统概率框架,通过实验发现该框架能更丰富地表示LLM的内部状态,并在35%的评估中自发出现超真状态,为透明、可靠和伦理感知的AI系统提供关键步骤。

Comments Published in Neutrosophic Sets and Systems, Vol. 99 (2026). Author's preprint version. Open code and data available at: github.com/mleyvaz/neutrosophic-llm-logic

详情
Journal ref
Neutrosophic Sets and Systems, Vol. 99, 2026
AI中文摘要

大型语言模型(LLM)主要受概率框架支配,其中结果概率之和被约束为1。这种由Softmax层强加的结构限制导致不确定性崩溃,使得难以区分认知不确定性、悖论和模糊性。我们提出了一种中智逻辑应用的实证研究,该框架将真(T)、不确定(I)和假(F)视为三个独立维度,用于建模LLM中的认知状态。我们在四个OpenAI GPT模型家族上进行了实验,涵盖五种语言现象:逻辑悖论、认知无知、模糊性、伦理矛盾和未来偶然性,采用三种提示策略:中智、概率和熵衍生。我们的发现表明,中智方法通过允许T+I+F>1(我们称之为超真状态),提供了模型内部状态的更丰富表示。在35%的评估中,超真状态自发出现,主要出现在伦理矛盾和逻辑悖论下。我们证明,该方法在模糊上下文中保留了真值,并提供了一种稳健的方法来识别和量化内部模型冲突。我们得出结论,中智评估层的集成是迈向更透明、可靠和伦理感知的AI系统的关键一步。

英文摘要

Large Language Models (LLMs) are predominantly governed by probabilistic frameworks in which the sum of outcome probabilities is constrained to unity. This architectural limitation, often imposed by Softmax layers, leads to a collapse of uncertainty that makes it difficult to differentiate between epistemic uncertainty, paradox, and vagueness. We present an empirical investigation of the application of Neutrosophic Logic, a framework that treats Truth (T), Indeterminacy (I), and Falsity (F) as three independent dimensions, to model epistemic states in LLMs. We conducted experiments on a family of four OpenAI GPT models across five linguistic phenomena: logical paradoxes, epistemic ignorance, vagueness, ethical contradictions, and future contingencies, under three prompting strategies: neutrosophic, probabilistic, and entropy-derived. Our findings reveal that the neutrosophic approach, by allowing T+I+F > 1, a state we term hyper-truth, provides a richer representation of a model's internal state. In 35% of evaluations, hyper-truth emerged spontaneously, predominantly under ethical contradiction and logical paradox. We demonstrate that this approach preserves truth values in fuzzy contexts and offers a robust method for identifying and quantifying internal model conflict. We conclude that the integration of neutrosophic evaluation layers is a critical step toward more transparent, reliable, and ethically aware AI systems.

2605.24052 2026-05-26 cs.LG cs.AI 版本更新

Truthful Online Preference Aggregation for LLM Fine-Tuning in Mobile Crowdsourcing

移动众包中用于LLM微调的诚实在线偏好聚合

Shugang Hao, Lingjie Duan

发表机构 * Singapore University of Technology and Design(新加坡科技设计大学) Hong Kong University of Science and Technology(香港科技大学)

AI总结 针对移动众包中工人可能策略性谎报偏好反馈的问题,提出一种动态贝叶斯博弈模型和在线加权聚合机制,确保工人诚实反馈并实现次线性遗憾。

详情
AI中文摘要

为了更好地满足移动应用(如导航)中用户的需求,移动众包平台可以迭代地将大语言模型(LLM)生成的内容(例如,AI生成的交通状况预测)与从众包工人(例如,移动用户)收集的人类反馈进行对齐。然而,工人可能会策略性地谎报他们的在线偏好反馈,以最大化其影响力或报酬。移动众包中现有的流程(例如,基于EM的权重估计)无法在这种在线设置中识别出最准确的工人,导致在$T$个时隙上产生线性遗憾$\mathcal{O}(T)$。在本文中,我们研究了移动众包中用于LLM微调的诚实在线偏好聚合。我们建立了一个新的动态贝叶斯博弈来建模平台与策略性移动工人之间的多智能体在线学习过程。我们提出了一种新颖的在线加权聚合机制,该机制根据每个工人的反馈准确性动态调整其在偏好聚合中的权重。我们证明了我们的机制确保了策略性工人的诚实反馈,并在$T$个时隙上实现了次线性遗憾$\mathcal{O}(\sqrt{T})$。我们进一步将我们的机制扩展到每个时隙工人反馈有限的挑战性场景,仍然保证了次线性遗憾$\mathcal{O}(\sqrt{T})$。在真实世界数据集上进行的LLM微调实验进一步证明了我们的机制相对于基准方案的显著性能提升。

英文摘要

To better serve users' demands in mobile applications (e.g., navigation), mobile crowdsourcing platforms can iteratively align large language model (LLM)-generated content (e.g., AI-generated traffic condition predictions) with human feedback collected from crowdsourcing workers (e.g., mobile users). However, workers may strategically misreport their online preference feedback to maximize their influence or payment. Existing pipelines in mobile crowdsourcing (e.g., EM-based weight estimation) fail to identify the most accurate worker in this online setting, resulting in a linear regret $\mathcal{O}(T)$ over $T$ time slots. In this paper, we study truthful online preference aggregation for LLM fine-tuning in mobile crowdsourcing. We formulate a new dynamic Bayesian game to model the multi-agent online learning process between the platform and strategic mobile workers. We propose a novel online weighted aggregation mechanism that dynamically adjusts each worker's weight in the preference aggregation according to their feedback accuracy. We prove that our mechanism ensures truthful feedback from strategic workers and achieves a sublinear regret $\mathcal{O}(\sqrt{T})$ over $T$ time slots. We further extend our mechanism to a challenging scenario with limited worker feedback per time slot, still guaranteeing a sublinear regret $\mathcal{O}(\sqrt{T})$. Experiments on LLM fine-tuning with real-world datasets further demonstrate significant performance gains of our mechanisms over benchmark schemes.

2605.24050 2026-05-26 cs.SE cs.AI stat.AP 版本更新

More Skills, Worse Agents? Skill Shadowing Degrades Performance When Expanding Skill Libraries

更多技能,更差智能体?扩展技能库时技能遮蔽降低性能

Hongwen Song, Song, Wei

发表机构 * Databricks Inc.(Databricks公司)

AI总结 本文研究LLM智能体技能库扩展导致性能下降的现象,提出将性能下降分解为技能遮蔽和上下文开销两种效应,并通过实验证明技能遮蔽是主要瓶颈。

详情
AI中文摘要

技能库允许LLM智能体按需加载任务特定指令,使非专家用户能够通过自然语言解决领域特定任务,而无需知道存在哪些技能或它们如何工作。然而,随着技能库的增长,性能会下降——当从一组已知有用的小技能扩展到包含202个技能的库时,性能下降高达21%。在这项工作中,我们将这种性能下降定义为从加载已知有用技能库到加载完整技能库之间的通过率下降。此外,我们提出通过条件化技能调用——即智能体在轨迹中选择哪些技能——将通过率下降分解为两种效应:\emph{技能遮蔽},即随着技能库扩展,智能体更频繁地选择错误技能;以及\emph{上下文开销},即即使选择正确,扩大的上下文也会降低执行性能。我们推导了这两种效应的上界,以表征它们对通过率下降的影响程度。我们对效应及其上界的经验估计均表明,\emph{技能遮蔽}效应随技能库大小增长,并对性能下降有显著贡献,而\emph{上下文开销}效应仍然很小且与零无显著差异。这种观察到的非对称性表明,技能选择失败(而非上下文扩大)是扩展技能库时的主要瓶颈。

英文摘要

Skill libraries allow LLM agents to load task-specific instructions on demand, letting non-expert users solve domain-specific tasks through natural language without knowing which skills exist or how they work. However, performance degrades as libraries grow -- by up to 21\% when scaling from a small set of helpful skills to a 202-skill library. In this work, we formulate this performance degradation as the pass rate drop between loading a library of known-helpful skills and the full library. Moreover, we propose to decompose the pass rate drop by conditioning on the skill(s) invocation -- which skills the agent selects during a trajectory -- into two effects: \emph{skill shadowing}, where the agent selects wrong skills more often as the library expands, and \emph{context overhead}, where the enlarged context degrades execution even when selection is correct. We derive upper bounds on both effects to characterize their magnitudes of impacts to the pass rate drop. Our empirical estimates of the effects and their upper bounds both show that the \emph{skill shadowing} effect grows with library size and significantly contributes to the performance degradation, whereas the \emph{context overhead} effect remains small and indistinguishable from zero. This observed asymmetry establishes that the skill selection failure, not the enlarged context, is the primary bottleneck when expanding the skill libraries.

2605.24048 2026-05-26 cs.LG cs.AI 版本更新

Mixture of Complementary Agents for Robust LLM Ensemble

互补代理混合:鲁棒的大语言模型集成

Yichi Zhang, Kevin Lu, Yuang Zhang, Jie Gao, Lirong Xia, Fang-Yi Yu

发表机构 * DIMACS, Rutgers University(罗格斯大学DIMACS研究中心) Department of Mathematics, Rutgers University(罗格斯大学数学系) Department of Computer Science, George Mason University(乔治·梅森大学计算机科学系) Department of Computer Science, Rutgers University(罗格斯大学计算机科学系)

AI总结 将大语言模型选择视为组合选择问题,提出基于互补性的贪心选择算法,在性能与成本间取得最佳平衡。

详情
AI中文摘要

多AI协作,例如集成或辩论大语言模型(LLMs),是一种有前景的聚合信息和提升性能的范式。这些流程的基础步骤是将多个提议LLM的响应输入到一个总结LLM中,后者合成一个更好的答案。然而,选择哪些提议者并非易事。现有方法主要关注准确性(选择最强模型)或多样性(确保多样性),并且常常忽视提议者之间以及与总结者之间的交互。我们将提议者选择重新定义为类似于特征选择的组合选择问题,其中LLM的价值在于其与其他模型的互补性。然而,由于时间复杂度过高,直接应用标准特征选择算法在LLM场景中不切实际。受此限制,我们探索了一系列计算可行的贪心式选择算法,这些算法使用少量标记集评估互补性。我们的实验验证了互补性作为提议者选择的指导原则,并确定了在实践中实现最佳性能-成本权衡的方法。

英文摘要

Multi-AI collaboration, such as ensembling or debating large language models (LLMs), is a promising paradigm for aggregating information and boosting performance. A foundational step in these pipelines is to feed the responses of several proposer LLMs into a summarizer LLM, which synthesizes a better answer. However, choosing which proposers to include is non-trivial. Existing approaches primarily focus either on accuracy (picking the strongest models) or diversity (ensuring variety), and often overlook the interactions among proposers and with the summarizer. We reframe proposer selection as a combinatorial selection problem akin to feature selection, where the value of an LLM lies in its complementarity with others. However, directly applying standard feature-selection algorithms is impractical in the LLM setting due to prohibitive time complexity. Motivated by this limitation, we explore an extensive range of computationally feasible, greedy-style selection algorithms that assess complementarity using a small labeled set. Our experiments validate complementarity as a guiding principle for proposer selection and identify methods that achieve the best performance-cost trade-offs in practice.

2605.24045 2026-05-26 cs.LG cs.AI 版本更新

A Large-Scale Dataset and Benchmark: Do Protein-Ligand Models Learn Binding Sites or Just Binding Likelihood?

大规模数据集与基准:蛋白质-配体模型学习的是结合位点还是仅仅结合可能性?

Zhaohan Meng, Zhen Bai, Ke Yuan, Iadh Ounis, Zaiqiao Meng, Hao Xu, Joseph Loscalzo

发表机构 * School of Computing Science(计算科学学院) School of Cancer Sciences(癌症科学学院) School of Life Science and Technology(生命科学与技术学院) Institute of Science Tokyo(东京科学研究院) Cancer Research UK Scotland Institute(英国癌症研究会苏格兰研究所) Language Technology Lab(语言技术实验室) Department of Medicine, Brigham and Women’s Hospital, Harvard Medical School(哈佛医学院内科部,布里格斯妇女医院) The Broad Institute of MIT and Harvard(MIT和哈佛大学Broad研究所)

AI总结 针对现有基准无法评估模型是否定位结合位点的问题,提出包含约10万对蛋白质-配体的InteractBind数据集和细粒度基准,通过结合位点定位任务揭示模型在强二元预测下定位能力有限。

Comments Under Review for the NeurIPS 2026 Conference, Track on Evaluations and Datasets

详情
AI中文摘要

蛋白质-配体建模是计算药物发现和分子设计的基础。现有的蛋白质-配体基准通常通过二元结合预测和亲和力回归等任务评估蛋白质与配体是否相互作用以及结合强度。然而,这些评估提供的证据有限,无法判断模型是否能够定位结合位点或识别分子识别背后的非共价相互作用。为填补这一空白,我们引入了InteractBind,一个大规模蛋白质-配体数据集,包含约10万对蛋白质-配体对,以及一个用于细粒度评估的基准。核心细粒度任务是结合位点定位,它利用覆盖六种主要非共价相互作用类型的蛋白质残基和配体原子相互作用图,评估模型导出的相互作用图是否能够定位结合位点。InteractBind还包含结合亲和力和蛋白质相似性控制的分割,以支持现实的泛化评估。使用InteractBind,我们评估了八个现有的基于序列和交互感知的模型,评估了二元结合预测和结合位点定位。结果显示,尽管二元结合预测表现强劲,但结合位点定位能力有限,且在不同非共价相互作用类型间存在显著差异。总体而言,InteractBind建立了一个基准范式,鼓励开发更具可解释性和物理基础的蛋白质-配体模型。

英文摘要

Protein-ligand modeling underpins computational drug discovery and molecular design. Existing protein-ligand benchmarks typically evaluate whether a protein and ligand interact and how strongly they bind, through tasks such as binary binding prediction and affinity regression. However, these evaluations provide limited evidence of whether models can localize binding sites or identify the non-covalent interactions underlying molecular recognition. To address this gap, we introduce InteractBind, a large-scale protein-ligand dataset comprising approximately 100k protein-ligand pairs, together with a benchmark for fine-grained evaluation. The core fine-grained task is that of binding-site localization, which uses protein-residue and ligand-atom interaction maps spanning six major types of non-covalent interactions to assess whether model-derived interaction maps localize binding sites. InteractBind further includes binding affinity and protein similarity-controlled splits to support realistic generalization assessment. Using InteractBind, we evaluate eight existing sequence-based and interaction-aware models, assessing binary binding prediction and binding-site localization. Results reveal limited binding-site localization despite strong binary binding prediction, with marked variation across non-covalent interaction types. Overall, InteractBind establishes a benchmark paradigm that encourages the development of more interpretable and physically grounded protein-ligand models.

2605.24043 2026-05-26 cs.LG cs.AI 版本更新

LLM-AutoSciLab: Closed-Loop Scientific Discovery via Active Experimentation with LLMs

LLM-AutoSciLab:通过LLM主动实验进行闭环科学发现

Sanchit Kabra, Nikhil Abhyankar, Saaketh Desai, Prasad Iyer, Chandan K Reddy

发表机构 * Virginia Tech(弗吉尼亚理工大学) Sandia National Laboratories(桑迪亚国家实验室)

AI总结 提出LLM-AutoSciLab闭环框架,通过假设生成与实验选择迭代优化,在预算约束下实现主动数据采集,在三个基准上优于现有方法且样本效率提升2-5倍。

详情
AI中文摘要

科学发现是一个闭环过程,其中假设指导数据采集,观察结果细化假设空间。然而,大多数方法将发现简化为对固定数据集的监督学习,其中有限的观察可能支持多种局部拟合但无法泛化的合理机制。因此,关键挑战在于选择信息丰富的观察以消除不确定性,将焦点从静态推断转向自适应数据采集。为此,我们提出LLM-AutoSciLab,一个将假设生成与假设条件实验选择和机制细化相结合的闭环框架。LLM-AutoSciLab不是将模型拟合到被动收集的数据,而是迭代地提出合理的假设,选择信息丰富的实验来区分或细化它们,并使用由此产生的证据更新其状态。为了评估具有主动数据采集的动态闭环科学发现,我们引入了ActiveSciBench,包含两个数据集:包含57个酶动力学任务的ActiveSciBench-Chem和包含45个基因调控网络任务的ActiveSciBench-GRN。这些数据集将发现建模为预算约束过程,需要自适应实验设计、变量选择和真实机制的恢复。在NewtonBench、ActiveSciBench-Chem和ActiveSciBench-GRN上,LLM-AutoSciLab优于先前方法,在NewtonBench和ActiveSciBench-Chem上分别达到67.6%和35.1%的符号准确率,在ActiveSciBench-GRN上达到31.1%的精确图恢复。此外,假设引导的实验比最强竞争基线样本效率高2-5倍。代码和数据可在https://github.com/scientific-discovery/LLM-AutoSciLab获取。

英文摘要

Scientific discovery is a closed-loop process in which hypotheses guide data acquisition and observations refine the hypothesis space. Yet most approaches reduce discovery to supervised learning over fixed datasets, where limited observations can support multiple plausible mechanisms that fit locally but fail to generalize. Thus, the key challenge is selecting informative observations to resolve uncertainty, shifting the focus from static inference to adaptive data acquisition. To address this, we propose LLM-AutoSciLab, a closed-loop framework that couples hypothesis generation with hypothesis-conditioned experiment selection and mechanism refinement. Rather than fitting models to passively collected data, LLM-AutoSciLab iteratively proposes plausible hypotheses, selects informative experiments to distinguish or refine them, and updates its state using the resulting evidence. To evaluate dynamic, closed-loop scientific discovery with active data acquisition, we introduce ActiveSciBench, comprising two datasets: ActiveSciBench-Chem with 57 enzyme-kinetics tasks and ActiveSciBench-GRN with 45 gene-regulatory-network tasks. These datasets model discovery as a budget-constrained process requiring adaptive experiment design, variable selection, and recovery of true mechanisms. Across NewtonBench, ActiveSciBench-Chem, and ActiveSciBench-GRN, LLM-AutoSciLab outperforms prior methods, achieving 67.6% and 35.1% symbolic accuracy on NewtonBench and ActiveSciBench-Chem, respectively, and 31.1% exact graph recovery on ActiveSciBench-GRN. Moreover, hypothesis-guided experimentation is 2-5x more sample-efficient than the strongest competing baselines. Code and data are available at: https://github.com/scientific-discovery/LLM-AutoSciLab

2605.24037 2026-05-26 cs.CV cs.AI 版本更新

Mode-as-Sequence: Translating Multimodal Motion Prediction into Unified Sequential Mode Modeling

模式即序列:将多模态运动预测转化为统一序列模式建模

Zikang Zhou, Haibo Hu, Xinhong Chen, Yifan Zhang, Nan Guan, Yung-Hui Li, Chun Jason Xue, Jianping Wang

发表机构 * City University of Hong Kong(香港城市大学) City University of Hong Kong (Dongguan)(香港城市大学(东莞)) Hon Hai Research Institute(富士康研究学院) Mohamed bin Zayed University of Artificial Intelligence(莫莫丁·宾·扎耶德人工智能大学)

AI总结 提出Mode-as-Sequence框架,将无序模式集转化为有序模式序列并显式建模模式间依赖,通过ModeSeq和Parallel ModeSeq两种实例化方法解决多模态运动预测中的模式坍塌和置信度排序问题,在Waymo数据集上取得领先性能。

详情
AI中文摘要

多模态运动预测本质上是欠监督的:每个训练场景只提供一个已实现的未来,但存在多个合理的未来。这种稀疏监督通常会导致模式坍塌(冗余假设和模式覆盖不足)以及在预测少量轨迹时置信度排序不可靠。我们提出Mode-as-Sequence,一个统一的解码框架,将无序模式集转化为有序模式序列,并显式建模模式间依赖。在该框架下,我们开发了两种互补的实例化方法。ModeSeq执行循环模式解码,每个模式基于先前生成的模式生成,鼓励多样化、非冗余的假设,并具有校准的置信度排序。为了消除逐模式自回归瓶颈,我们进一步提出Parallel ModeSeq,它使用掩码模式间自注意力保留相同的因果依赖,同时在前向传播中一次性解码所有模式,从而实现高效的大K推理和可扩展的联合场景预测。为了在稀疏标签下学习代表性模式和校准的置信度,我们引入了Early-Match-Take-All (EMTA)及其联合场景扩展MA-EMTA,以及一个轻量级的排序正则化器,以减少置信度反转。在大型基准上的大量实验表明,在数据集、预测时长和对象类型上,排序导向指标和最佳K准确率均有一致提升。在Waymo开放数据集挑战中,ModeSeq在2024年无激光雷达运动预测赛道获得第一名,Parallel ModeSeq在2025年交互预测挑战赛中获得第一名,验证了Mode-as-Sequence在准确性和效率上的有效性。

英文摘要

Multimodal motion forecasting is inherently under-supervised: each training scene provides only one realized future, yet multiple plausible futures exist. This sparse supervision often leads to mode collapse (redundant hypotheses and insufficient mode coverage) and unreliable confidence ranking when predicting a small set of trajectories. We propose Mode-as-Sequence, a unified decoding framework that translates an unordered mode set into an ordered mode sequence and explicitly models mode-to-mode dependency. Under this framework, we develop two complementary instantiations. ModeSeq performs recurrent mode decoding, where each mode is generated conditioned on the previously generated modes, encouraging diverse, non-redundant hypotheses with calibrated confidence ordering. To remove the mode-by-mode autoregressive bottleneck, we further propose Parallel ModeSeq, which preserves the same causal dependency using masked mode-to-mode self-attention while decoding all modes in a single forward pass, enabling efficient large-$K$ inference and scalable joint-scene prediction. To learn representative modes and calibrated confidence under sparse labels, we introduce Early-Match-Take-All (EMTA) and its joint-scene extension MA-EMTA, together with a lightweight ranking regularizer that reduces confidence inversions. Extensive experiments on large-scale benchmarks demonstrate consistent improvements in both ranking-oriented metrics and best-of-K accuracy across datasets, horizons, and object types. In the Waymo Open Dataset challenges, ModeSeq achieves 1st place in the 2024 LiDAR-free motion prediction track, and Parallel ModeSeq achieves 1st place in the 2025 Interaction Prediction Challenge, validating the effectiveness of Mode-as-Sequence for both accuracy and efficiency.

2605.24034 2026-05-26 q-bio.GN cs.AI 版本更新

WTKO-CNN: Deep Learning Reveals Sequence Motifs Distinguishing Wild-Type and Knockout ATAC-seq Peaks

WTKO-CNN:深度学习揭示区分野生型和敲除ATAC-seq峰值的序列基序

Lopamudra Dey

AI总结 提出带注意力机制的卷积神经网络WTKO-CNN,通过分类DNA序列并利用显著性图提取k-mer基序,发现区分野生型与敲除状态的转录因子结合序列特征。

详情
AI中文摘要

染色质调控因子可以通过修改调控DNA元件的可及性来改变转录程序。理解野生型(WT)和敲除(KO)条件下调控序列的差异对于破译转录控制至关重要。在这里,我们应用了一个带有注意力机制的卷积神经网络WTKO-CNN对DNA序列进行WT或KO分类,实现了高预测性能。为了解释模型,我们生成了显著性图,以识别对分类决策最有影响的核苷酸位置。从这些高显著性区域中,我们提取并聚类了k-mer,从而实现了从头基序发现。从CNN滤波器导出的序列标识和共有基序揭示了有生物学意义的模式,并通过MEME、TOMTOM和HOMER与已知转录因子结合位点进行进一步验证。我们的分析识别了与区分WT和KO序列的转录因子家族相关的基序,表明CNN引导的显著性图是揭示功能序列特征的有力方法。

英文摘要

Chromatin regulators can alter transcriptional programs by modifying the accessibility of regulatory DNA elements. Understanding how regulatory sequences differ between wild-type (WT) and knockout (KO) conditions is crucial for deciphering transcriptional control. Here, we applied a convolutional neural network, \textbf{WTKO-CNN} with an attention mechanism to classify DNA sequences as WT or KO, achieving high predictive performance. To interpret the model, we generated saliency maps to identify nucleotide positions most influential for the classification decision. From these high-saliency regions, we extracted and clustered k-mers, enabling de novo motif discovery. Sequence logos and consensus motifs derived from the CNN filters revealed biologically meaningful patterns, which are further validated using MEME, TOMTOM, and HOMER against known transcription factor binding sites. Our analysis identified motifs associated with transcription factor families that discriminate WT from KO sequences, demonstrating that CNN-guided saliency mapping is a powerful approach for uncovering functional sequence features.

2605.24020 2026-05-26 cs.CV cs.AI 版本更新

Machine Intelligence that Understands Visual and Linguistic Information and Interacts with Humans and Environments

理解视觉与语言信息并与人类及环境交互的机器智能

Van Quang Nguyen

发表机构 * System Information Sciences(信息科学系)

AI总结 本文提出GRIT、LTMI和两阶段指令解释框架,分别改进图像描述、视觉对话和交互式指令跟随任务,在准确性和效率上取得领先结果。

Comments Doctoral dissertation, Tohoku University, 2022. Uploaded for archival purposes. 146 pages

详情
AI中文摘要

计算机视觉与自然语言处理交叉领域的进展对于辅助技术、多媒体查询和机器人等应用至关重要。本论文提出了新颖的架构,以改进智能体在三个关键视觉-语言任务上的表现:图像描述、视觉对话和交互式指令跟随。 首先,我们解决了图像描述中视觉表示的局限性。传统模型依赖CNN检测器提取的区域特征,缺乏全局上下文且计算开销大。我们提出GRIT(基于网格和区域的图像描述Transformer),一种纯Transformer架构。通过使用基于DETR的检测器整合网格和区域特征,GRIT实现了端到端训练,并在推理准确性和速度上均优于先前方法。 其次,我们处理视觉对话,这需要对图像进行多轮对话。挑战在于高效建模多个输入(图像、问题、历史)之间的交互。我们引入LTMI(轻量级多输入Transformer)。利用专门的注意力块,LTMI层在VisDial数据集上验证,其表示能力与标准Transformer扩展相当,但参数不到其十分之一。 最后,我们使用ALFRED数据集研究具身AI的交互式指令跟随。我们提出一个包含两阶段指令解释的框架:首先独立于视觉上下文解码语言指令以预测暂定的动作-对象序列,然后与视觉特征融合以最终执行。通过使用多个自我中心视图和分层注意力,我们的方法准确定位对象,并实现了8.37%的最新未见成功率。

英文摘要

Advancements at the intersection of computer vision and natural language processing are crucial for applications like assistive tech, multimedia querying, and robotics. This dissertation proposes novel architectures to improve intelligent agents across three key vision-language tasks: image captioning, visual dialog, and interactive instruction following. First, we address limitations in visual representation for image captioning. Traditional models rely on region-based features from CNN detectors, which lack global context and suffer from high computational overhead. We propose GRIT (Grid and Region-based Image captioning Transformer), a transformer-only architecture. By integrating grid and region features using a DETR-based detector, GRIT enables end-to-end training and out-performs prior methods in both inference accuracy and speed. Second, we tackle visual dialog, which requires multi-turn conversation about an image. The challenge lies in efficiently modeling interactions between multiple inputs (image, question, history). We introduce LTMI (Light-weight Transformer for Many Inputs). Utilizing a specialized attention block, an LTMI layer matches the representational power of a standard Transformer extension while utilizing less than one-tenth of its parameters, as validated on the VisDial dataset. Finally, we study interactive instruction-following for embodied AI using the ALFRED dataset. We propose a framework featuring a two-stage instruction interpretation: it first decodes language directives independently of visual context to predict a tentative action-object sequence, which is then fused with visual features for final execution. Using multiple egocentric views and hierarchical attention, our method accurately localizes objects and achieves a state-of-the-art unseen success rate of 8.37%.

2605.24018 2026-05-26 cs.AI cs.MA 版本更新

EvoSci: A Bio-Inspired Multi-Agent Framework for the Evolution of Scientific Discovery

EvoSci: 一种受生物启发的多智能体框架用于科学发现的演化

Xiaoyu Xiong, Yuqi Ren, Deyi Xiong

发表机构 * TJUNLP Lab, School of Computer Science and Technology, Tianjin University, China(天津大学计算机科学与技术学院 TJUNLP 实验室)

AI总结 提出EvoSci框架,结合生物启发式演化与知识图谱建模,通过多角色智能体协作迭代生成、评估和优化研究想法,显著提升科学探索的连贯性和创造力。

Comments ACL 2026 Main Conference

详情
AI中文摘要

大型语言模型(LLMs)在科学发现中展现出强大潜力,但现有方法在研究工作流设计和多角色协作机制方面仍面临重大挑战。为解决这些问题,我们提出了EvoSci,一个多智能体科学协作框架,它整合了受生物启发的演化与知识图谱建模。为了迭代生成、评估和完善研究想法,EvoSci包含了多个基于角色的智能体,包括导师、研究者和评审者。通过结合协作推理、共享记忆和演化反馈,EvoSci显著增强了科学探索的连贯性和创造力。在真实研究主题上的实验表明,EvoSci在基于LLM的结构化同行评审和比较排名评估中显著优于强基线,获得了最高的整体同行评审分数(ICLR 4.90)和最高排名(Top-10 = 54)。这些结果表明其在科学想法生成和持续发现方面的优越性。

英文摘要

Large language models (LLMs), have shown strong potential in scientific discovery, yet existing methods still face substantial challenges in the design of research workflows and multi-role collaboration mechanisms. To mitigate these issues, we propose EvoSci, a multi-agent scientific collaboration framework, which integrates bio-inspired evolution with knowledge graph modeling. To iteratively generate, evaluate, and refine research ideas, EvoSci incorporates multiple role-based agents, including mentor, researcher, and reviewer. By combining collaborative reasoning, shared memory, and evolutionary feedback, EvoSci significantly enhances the coherence and creativity of scientific exploration. Experiments on real-world research topics demonstrate that EvoSci significantly outperforms strong baselines in LLM-based structured peer-review and comparative ranking evaluations, achieving the highest overall peer-review score (ICLR 4.90) and top ranking (Top-10 = 54). These results suggest its superiority in both scientific idea generation and continuous discovery.

2605.24016 2026-05-26 cs.AR cs.AI 版本更新

SA-Kura: An Energy-Efficient Systolic Array Accelerator for Locally-Coupled Kuramoto Drift in Diffusion Sampling

SA-Kura: 用于扩散采样中局部耦合Kuramoto漂移的节能脉动阵列加速器

Jeongmin Jin, Kyeongwon Lee, Mundo Jeong, Jongin Choi, Woojoo Lee

发表机构 * National Research Foundation of Korea(韩国国家研究基金会) Institute of Information & communications Technology Planning & Evaluation(信息通信技术规划与评估院)

AI总结 针对扩散采样中局部耦合Kuramoto漂移的计算瓶颈,提出首个专用数字脉动阵列加速器SA-Kura,通过重新公式化耦合计算实现高效脉动执行,相比软件和GPU分别实现193倍和6.57倍加速。

Comments 8 pages, 6 figures, 1 table; ACM/IEEE ISLPED 2026 accepted paper

详情
AI中文摘要

扩散推理在边缘部署中仍然成本高昂,但现有加速器几乎完全专注于分数网络,因为标准漂移仅仅是微不足道的线性缩放。Kuramoto定向扩散用局部耦合的相位相互作用取代了这种微不足道的漂移,提高了采样效率,但引入了新的硬件瓶颈:在每个反向步骤中评估的中心依赖非线性5x5模板。该内核难以映射到传统的CNN加速器和面向矩阵的引擎。我们提出了SA-Kura,据我们所知,这是第一个专用于局部耦合Kuramoto漂移的数字脉动阵列加速器。通过将成对正弦耦合重新表述为独立于中心相位的邻居累加,然后进行单个中心依赖的乘减组合,SA-Kura消除了PE内的超越函数单元,并实现了具有寄存器级复用的规则脉动执行。SA-Kura以可综合RTL实现,集成到基于RISC-V的轻量级SoC中,在FPGA上原型验证,并通过45nm CMOS综合和功耗分析进行评估。仅对于漂移内核,与同一SoC平台上处理器内核上相同内核的软件执行相比,SA-Kura分别将延迟和能耗降低了193倍和69.4倍。与独立的Jetson Orin Nano CUDA实现相同内核相比,它快6.57倍,并且每像素能耗降低约46.0倍。

英文摘要

Diffusion inference remains costly for edge deployment, yet existing accelerators focus almost exclusively on score networks because standard drift is merely a trivial linear scaling. Kuramoto orientation diffusion replaces this trivial drift with locally coupled phase interactions, improving sampling efficiency but introducing a new hardware bottleneck: a center-dependent nonlinear 5 x 5 stencil evaluated at every reverse step. This kernel maps poorly to conventional CNN accelerators and matrix-oriented engines. We present SA-Kura, to our knowledge the first digital systolic-array accelerator dedicated to locally coupled Kuramoto drift. By reformulating pair-wise sinusoidal coupling into neighbor accumulation independent of the center phase followed by a single center-dependent multiply-subtract combination, SA-Kura eliminates in-PE transcendental units and enables regular systolic execution with register-level reuse. SA-Kura was implemented in synthesizable RTL, integrated into a lightweight RISC-V-based SoC, prototyped on FPGA, and evaluated through 45 nm CMOS synthesis and power analysis. For the drift kernel only, compared with software execution of the same kernel on the processor core in the same SoC platform, SA-Kura reduces latency and energy by 193x and 69.4x, respectively. Compared with a standalone Jetson Orin Nano CUDA implementation of the same kernel, it is 6.57x faster and achieves approximately 46.0x lower energy per pixel.

2605.24004 2026-05-26 cs.AI cs.CV cs.LG cs.RO 版本更新

Reason--Imagine--Act: Closed-Loop LLM Decision Making with World Models for Autonomous Driving

推理--想象--行动:基于世界模型的闭环LLM自动驾驶决策

Zhengqi Sun, Yiwen Sun, Boxuan Liu, Tailai Chen, Tianxu Guo, Jiabin Liu

发表机构 * 1Department of Information Management, Peking University, Beijing 100871, China 2School of Intelligence Science Technology, Peking University, Beijing 100871, China 3State Key Laboratory of General Artificial Intelligence, BIGAI, Beijing 100080, China 4Yuanpei College, Peking University, Beijing 100871, China 5China Agricultural University, Beijing, China 6CRSC Research \& Design Institute Group Co., Ltd., Beijing, China

AI总结 提出Reason--Imagine--Act (RIA)闭环框架,结合LLM推理器与动作条件世界模型进行在线安全验证,在CARLA点目标协议下实现80.05%路线完成率、51.10%到达率和0.20%碰撞率。

Comments Accepted by the 2026 IEEE International Conference on Intelligent Transportation Systems (ITSC 2026). 8 pages, 2 figures

详情
AI中文摘要

大型语言模型(LLM)在自动驾驶中具有潜力,但仅基于语义的决策策略可能在动态交通中产生物理上不安全的行为。现有方法要么在没有显式动力学验证的情况下进行在线语言推理,要么主要在离线流程中使用世界模型,在决策时语义意图与物理可行性之间存在差距。我们提出了Reason--Imagine--Act (RIA),一个闭环框架,将LLM推理器与动作条件世界模型耦合,用于在线安全验证。在每一步,LLM提出一个动作模板和候选子动作,世界模型执行短时域展开,安全评分器选择最安全的可执行动作并反馈给下一步推理。在统一的CARLA点目标协议(1000个回合)下,RIA实现了80.05%的路线完成率、51.10%的到达率和0.20%的碰撞率。在相同的闭环接口下,RIA在核心闭环指标上始终优于无训练基线,包括CARLA TM和MADA。为便于复现,代码可在https://github.com/pku-smart-city/source_code/tree/main/RIA获取。

英文摘要

Large language models (LLMs) are promising for autonomous driving, but semantics-only decision policies can yield physically unsafe behavior in dynamic traffic. Existing methods either perform online language reasoning without explicit dynamics verification or use world models mainly in offline pipelines, leaving a gap between semantic intent and physical feasibility at decision time. We propose Reason--Imagine--Act (RIA), a closed-loop framework that couples an LLM reasoner with an action-conditioned world model for online safety verification. At each step, the LLM proposes an action template and candidate sub-actions, the world model performs short-horizon rollouts, and a safety scorer selects the safest executable action with feedback to the next reasoning step. Under a unified CARLA point-goal protocol (1000 episodes), RIA achieves 80.05% route completion, 51.10% arrival rate, and 0.20% collision rate. Under the same closed-loop interface, RIA consistently outperforms training-free baselines, including CARLA TM and MADA, on core closed-loop metrics. For reproducibility, code is available at https://github.com/pku-smart-city/source_code/tree/main/RIA.

2605.24002 2026-05-26 physics.chem-ph cond-mat.mtrl-sci cs.AI physics.comp-ph 版本更新

Harnessing AtomisticSkills for Agentic Atomistic Research

利用原子技能实现代理原子研究

Bowen Deng, Bohan Li, Matthew Cox, Hoje Chun, Juno Nam, Artur Lyssenko, Sathya Edamadaka, Jurgis Ruza, Xiaochen Du, Nofit Segal, Jesus Diaz Sanchez, Mingrou Xie, Ty Perez, Yu Yao, Miguel Steiner, Sauradeep Majumdar, Charles B. Musgrave, Anirban Chandra, Abhirup Patra, Detlef Hohl, Connor W. Coley, Ju Li, Rafael Gómez-Bombarelli

发表机构 * Department of Materials Science Engineering, Massachusetts Institute of Technology, Cambridge, MA 02139, USA Department of Chemical Engineering, Massachusetts Institute of Technology, Cambridge, MA 02139, USA Department of Chemistry, Kookmin University, Seoul 02707, Republic of Korea Harvard University, Department of Chemistry Department of Chemistry, Massachusetts Institute of Technology, Cambridge, MA 02139, USA Department of Nuclear Science Shell Information Technology International Inc., Texas 77082, United States Shell International Exploration \& Production Inc., Texas 77079, United States

AI总结 提出AtomisticSkills框架,通过分层分解科学工作流为技能和工具,使通用AI编码代理能够进行原子级研究,并在多个科学任务中验证其能力。

详情
AI中文摘要

计算材料科学和化学涵盖广泛的知识领域和碎片化的软件生态系统。尽管大语言模型(LLMs)已展现出研究能力,但扩展单体代理以管理原子研究的严谨性和复杂性仍然是一个挑战。在此,我们介绍AtomisticSkills,一个开源框架,使通用AI编码代理能够跨材料科学、化学和药物发现进行原子研究。通过将科学工作流分层分解为代理技能和工具,AtomisticSkills为代理提供模块化、可扩展且即插即用的研究能力。该框架集成了超过100个人工策划的多学科技能,包括数据库访问、热力学和动力学建模,以及采用机器学习原子间势(MLIPs)和密度泛函理论(DFT)的多种模拟引擎。我们根据科学文献验证其功能覆盖范围,并展示了跨不同科学任务的强大编排能力:锂离子固态电解质的生成设计、用于CO2捕获的金属有机框架的高通量筛选、自主MLIP基准测试和微调、用于药物设计的基于多阶段结构的虚拟筛选、多模态X射线衍射模式分析,以及用于析氧反应的铁氧化物催化剂筛选。AtomisticSkills为构建完全自主的AI科学家提供了关键的代理基础设施。

英文摘要

Computational materials science and chemistry span vast knowledge domains and fractured software ecosystems. Although large language models (LLMs) have demonstrated research capabilities, scaling monolithic agents to manage the rigor and complexity of atomistic research remains a challenge. Here, we introduce AtomisticSkills, an open-source harness framework that empowers general-purpose AI coding agents to conduct atomistic research across materials science, chemistry, and drug discovery. By hierarchically decomposing scientific workflows into agent skills and tools, AtomisticSkills provides agents with modular, extensible, and plug-and-play research capabilities. The framework integrates more than 100 human-curated multidisciplinary skills, including database access, thermodynamics and kinetics modeling, and diverse simulation engines employing machine learning interatomic potentials (MLIPs) and density functional theory (DFT). We validate its functional coverage against scientific literature and demonstrate robust orchestration capabilities across diverse scientific campaigns: generative design of Li-ion solid-state electrolytes, high-throughput screening of metal-organic frameworks for CO2 capture, autonomous MLIP benchmarking and fine-tuning, multi-stage structure-based virtual screening for drug design, multimodal X-ray diffraction pattern analysis, and screening of Fe-oxide catalysts for oxygen evolution reaction. AtomisticSkills provides a critical agent infrastructure towards building fully autonomous AI scientists.

2605.23997 2026-05-26 cs.CV cs.AI cs.LG 版本更新

IVR-R1: Refining Trajectories through Iterative Visual-Grounded Reasoning in Reinforcement Learning

IVR-R1:通过强化学习中的迭代视觉基础推理优化轨迹

Chenghao Li, Fusheng Hao, Xikai Zhang, Likang Xiao, Yanwei Ren, Fuxiang Wu, Quan Chen, Liu Liu

发表机构 * Hangzhou International Innovation Institute, Beihang University(北京航空航天大学杭州国际创新研究院) School of Artificial Intelligence, Beihang University(北京航空航天大学人工智能学院) Kuaishou Technology(快手科技) Shenzhen Institute of Advanced Integration Technology, Shenzhen(深圳先进集成技术研究院)

AI总结 提出IVR-R1框架,利用奖励驱动的筛选机制和迭代再推理循环,在强化学习中动态校正多模态推理轨迹,以解决视觉幻觉和逻辑错误问题。

详情
AI中文摘要

通过强化学习的多模态大语言模型在复杂视觉推理任务中展现出显著能力,但在长程多模态场景中仍存在局限,常出现视觉幻觉和逻辑错误。当前方法通常将高维视觉场景预编码为离散文本代理以促进下游推理。然而,随着推理链展开,文本与视觉场景之间固有的信息不对称会侵蚀视觉基础,导致推理误导和错误输出。为解决此问题,我们提出IVR-R1(迭代视觉基础推理),一种新颖的强化学习训练框架,通过动态视觉重新对齐主动校正推理轨迹以指导策略优化。具体而言,利用奖励驱动的筛选机制识别有缺陷的展开,IVR-R1在多模态上下文中执行细粒度的步骤级错误归因。通过将中间推理状态与原始视觉先验进行迭代交叉引用,再推理循环实现自动轨迹校正,有效合成专家级演示,作为策略模型的高保真推理模板。我们在多种多模态基准上的实验表明,IVR-R1持续优于现有强化学习方法,为在复杂多模态推理中保持逻辑和视觉一致性建立了优越范式。

英文摘要

Multimodal large language models via reinforcement learning (RL) have demonstrated remarkable capabilities in complex visual reasoning tasks, yet they remain limited in long-horizon multimodal scenarios, often suffering from visual hallucination and logical error. Current methods typically pre-encode high-dimensional visual scenes into discrete textual proxies to facilitate downstream reasoning. As the reasoning chain unfolds, however, the inherent information asymmetry between text and visual scenes tends to erode visual grounding, resulting in misguided reasoning and erroneous outputs. To address this issue, we introduce IVR-R1 (Iterative Visual-grounded Reasoning), a novel RL training framework that facilitates dynamic visual re-alignment that actively rectifies reasoning trajectories to guide policy optimization. Specifically, by leveraging a reward-driven screening mechanism to identify flawed rollouts, IVR-R1 executes a fine-grained, step-level error attribution within the multimodal context. By iteratively cross-referencing intermediate reasoning states against pristine visual priors, a Re-Reasoning Loop enables automated trajectory rectification, effectively synthesizing expert-level demonstrations that serve as high-fidelity reasoning templates for the policy model. Our experiments across diverse multimodal benchmarks demonstrate that IVR-R1 consistently outperforms existing reinforcement learning methods, establishing a superior paradigm for maintaining logical and visual consistency in complex multimodal reasoning.

2605.23994 2026-05-26 cs.CV cs.AI 版本更新

RAW: Robust Avatar Watermarking -- Benchmarking and Baseline

RAW:鲁棒的数字人水印——基准测试与基线方法

Jack Parry, Jack Saunders, Vinay Namboodiri

发表机构 * University of Bath(巴斯大学)

AI总结 针对数字人水印面临的后处理攻击,提出基准测试RAW和基于3D人脸重建的UV纹理空间水印方法WALT,在缩放攻击和背景移除攻击下分别达到92.4%和95.6%的鲁棒性。

详情
AI中文摘要

数字人水印面临独特挑战:在部署前,数字人通常要经过背景替换、重新构图和格式转换等常规后处理。我们提出 extbf{RAW}(鲁棒的数字人水印),一个包含来自5个商业提供商的50个合成数字人视频和6种模拟真实数字人工作流程的攻击的基准测试。评估7种现有方法发现,数字人特定的攻击(如背景移除)会显著降低水印恢复率。我们提出 extbf{WALT}(通过学习纹理进行数字人水印),该方法通过3D人脸重建在UV纹理空间中嵌入水印。WALT在缩放攻击下达到最高鲁棒性(92.4%),同时在背景移除攻击下保持强劲性能(95.6%)。我们发布该基准测试以促进针对数字人水印的研究。

英文摘要

Digital avatar watermarking presents unique challenges: avatars are routinely post-processed with background replacement, reframing, and format conversion before deployment. We introduce \textbf{RAW} (Robust Avatar Watermarking), a benchmark comprising 50 synthetic avatar videos from 5 commercial providers and 6 attacks simulating real-world avatar workflows. Evaluating 7 existing methods reveals that avatar-specific attacks such as background removal significantly degrade watermark recovery. We propose \textbf{WALT} (Watermarking Avatars with Learned Textures), which embeds watermarks in UV texture space via 3D face reconstruction. WALT achieves the highest robustness to zoom attacks (92.4\%) while maintaining strong performance on background removal (95.6\%). We release our benchmark to facilitate research into avatar-specific watermarking.

2605.23992 2026-05-26 cs.CV cs.AI 版本更新

A World Model of Radiologist Reading for Medical Image Representation Learning

放射科医生阅读的世界模型用于医学图像表示学习

Yiwei Li, Zihao Wu, Huaqin Zhao, Yifan Zhou, Chao Cao, Dajiang Zhu, Tianming Liu, Lin Zhao

发表机构 * University of Georgia(佐治亚大学) University of Texas at Arlington(德克萨斯大学阿灵顿分校) New Jersey Institute of Technology(新泽西理工学院)

AI总结 提出GazeWorld,一种将图像视为世界、放射科医生注视序列视为轨迹的医学成像世界模型,通过自回归预测注视补丁表示和空间补全未访问区域,在多个基准上实现最先进的诊断准确率和零样本性能。

详情
AI中文摘要

放射科医生的眼动追踪数据提供了专家在图像阅读过程中如何搜索、比较和积累证据的丰富记录;然而,现有方法仅部分利用这一信号,要么作为静态空间先验,要么作为与诊断脱节的辅助预测目标。我们提出GazeWorld,一种医学成像世界模型,将图像视为世界,将放射科医生的注视序列视为通过该世界的轨迹。GazeWorld自回归地从所有先前访问过的补丁预测下一个注视补丁的潜在表示,同时一个空间补全分支覆盖未访问区域。在推理时,GazeWorld仅从图像生成一系列补丁表示,无需真实注视数据。冻结的GazeWorld特征在CheXpert、RSNA肺炎和SIIM-ACR气胸的所有九个监督设置中实现了最先进的诊断准确率,并在所有三个基准上取得了最高的零样本准确率。在GazeSearch基准上,使用相同冻结特征训练的通用解码器在ScanMatch和SED上分别比专门构建的LogitGaze-Med高出16%和22%,尽管未明确训练以预测注视。GazeWorld表明,建模专家如何阅读(而不仅仅是他们得出什么结论)为医学成像AI提供了一种有前景的预训练范式。

英文摘要

Radiologist eye-tracking data provide a rich record of how experts search, compare, and accumulate evidence during image reading; yet, existing methods exploit this signal only partially, either as a static spatial prior or as an auxiliary prediction target decoupled from diagnosis. We propose GazeWorld, a medical imaging world model that treats the image as the world and the radiologist's fixation sequence as a trajectory through it. GazeWorld autoregressively predicts the latent representation of the next fixated patch from all previously visited ones, while a spatial-completion branch covers unvisited regions. At inference, GazeWorld generates a sequence of patch representations from the image alone without requiring real gaze data. Frozen GazeWorld features achieve state-of-the-art diagnostic accuracy across all nine supervised settings on CheXpert, RSNA Pneumonia, and SIIM-ACR Pneumothorax, as well as the highest zero-shot accuracy on all three benchmarks. On the GazeSearch benchmark, a generic decoder trained on the same frozen features outperforms the purpose-built LogitGaze-Med by over 16\% in ScanMatch and 22\% in SED, despite not being explicitly trained to predict gaze. GazeWorld demonstrates that modeling how experts read, not just what they conclude, offers a promising pretraining paradigm for medical imaging AI.

2605.23989 2026-05-26 cs.AI cs.CL cs.CR 版本更新

Towards trustworthy agentic AI: a comprehensive survey of safety, robustness, privacy, and system security

迈向可信的自主AI:安全性、鲁棒性、隐私与系统安全的全面综述

Jinhu Qi, Muzhi Li, Jiahong Liu, Yuqin Shu, Dianzhi Yu, Shicheng Ma, Wenqian Cui, Yiyang Zhao, Yiyi Chen, Ruoxi Jiang, Irwin King, Zenglin Xu

发表机构 * Faculty of Engineering, Department of Computer Science and Engineering, The Chinese University of Hong Kong(香港中文大学工程学院、计算机科学与工程系) Artificial Intelligence Innovation and Incubation Institute, Fudan University(复旦大学人工智能创新与孵化院) Shanghai Academy of AI for Science(上海人工智能科学研究院)

AI总结 本文综述了自主AI系统在安全鲁棒性与隐私系统安全两个核心维度的风险来源、阶段缓解策略及统一评估指标,并讨论了开放挑战。

Comments 36 pages, 4 figures. Survey/review article on trustworthy agentic AI. Published in Academia AI and Applications, 2026

详情
Journal ref
Academia AI and Applications, vol. 2, 2026
AI中文摘要

自主AI系统——即通过规划、工具使用、记忆和长程交互增强的大型语言模型(LLM)——能够自主执行复杂任务,但其多步轨迹引入了新的故障模式,挑战了可信赖性。本综述通过两个对高风险部署至关重要的核心维度,对可信自主AI进行了重点考察:安全性与鲁棒性,以及隐私与系统安全性。针对每个维度,我们澄清了关键概念,识别了风险在代理工作流中出现的环节,并总结了针对各阶段的缓解策略。其他可信赖性方面(价值对齐、透明度、公平性和问责制)作为相关背景而非平行章节进行讨论。为了支持一致的比较和部署决策,我们将评估整合到一个统一的指标与基准中心,强调结果和过程信号(例如,约束违反、轨迹完整性和对抗成功率),并为发布门控提供场景到指标的指导。最后,我们概述了开放挑战,如自我进化代理、运行时监控与验证、隐私保护个性化以及信任-效用权衡,并提出了一个关于开源自主系统中现实世界安全失败的案例研究。我们的目标是作为在高风险环境中构建可信自主系统的研究人员和实践者的实用参考。

英文摘要

Agentic AI systems -- Large Language Models (LLMs) augmented with planning, tool use, memory, and long-horizon interactions -- can execute complex tasks autonomously, but their multi-step trajectories introduce new failure modes that challenge trustworthiness. This survey provides a focused examination of trustworthy agentic AI through two core dimensions that are critical for high-risk deployments: Safety and Robustness, and Privacy and System Security. For each dimension, we clarify key concepts, identify where risks emerge along the agent workflow, and summarize stage-targeted mitigation strategies. Other trustworthiness aspects (value alignment, transparency, fairness, and accountability) are discussed as relevant context rather than parallel chapters. To support consistent comparison and deployment decisions, we consolidate evaluation into a unified metrics-and-benchmarks hub, emphasizing both outcome and process signals (e.g., constraint violations, trace completeness, and adversarial success rates) and offering scenario-to-metric guidance for release gating. We conclude by outlining open challenges such as self-evolving agents, runtime monitoring and verification, privacy-preserving personalization, and the trust-utility trade-off, and present a case study of real-world security failures in open-source agentic systems. Our goal is to serve as a practical reference for researchers and practitioners building trustworthy agentic systems in high-stakes environments.

2605.23987 2026-05-26 cs.AI cs.RO 版本更新

Beyond Predefined Learning Objects: A Thinking-Learning Interaction Model for Up-to-Date Autonomous Robot Learning

超越预定义学习对象:面向最新自主机器人学习的思维-学习交互模型

Hong Su

发表机构 * School of Computer Science, Chengdu University of Information Technology(成都信息科技大学计算机学院)

AI总结 针对自主机器人在开放环境中无法依赖预定义学习对象的问题,提出一种思维-学习交互模型,通过思维指导学习(识别变化、选择证据、组织训练、规划验证)和学习促进思维(更新知识、经验、策略、推理)的双向机制,实现输入特征发现、输出类别扩展、模型更新和动作例程重构,实验验证了模型在特征适应、新类别形成、模型更新和动作优化上的有效性。

详情
AI中文摘要

在开放和变化环境中运行的自主机器人不能总是依赖预定义的输入、输出和动作例程。尽管现有的学习方法使机器人能够通过环境交互提高性能,但学习对象往往是预先固定的,例如输入特征、识别输出、网络结构、任务目标或动作序列。这限制了它们在长期运行中出现新特征、新类别或更高效任务例程时的适应能力。为解决此问题,本文提出了一种面向自主机器人的思维-学习交互模型。核心思想是:思维通过识别潜在变化、选择有用证据、组织训练材料和规划验证动作来指导学习,而学习通过更新任务知识、特征选择经验、动作策略和未来推理过程来促进思维。基于这种双向机制,机器人可以逐步超越预定义的学习设置,并通过与环境的持续交互调整其识别关系和动作关系。具体来说,该模型支持自适应输入特征发现、输出类别扩展、学习模型更新和动作例程重构。实验结果表明,该模型在特征适应中将最终识别准确率从0.419提高到0.845,实现了更高的新类别形成准确率和模型更新成功率,并将动作例程重构中的平均动作长度从13.0减少到4.0。在学习增强思维方面,有用证据选择率从0.272提高到0.965,表明学习结果能有效改善未来的证据选择和推理。

英文摘要

Autonomous robots operating in open and changing environments cannot always rely on predefined inputs, outputs, and action routines. Although existing learning methods enable robots to improve their performance through environmental interaction, the objects of learning are often fixed in advance, such as input features, recognition outputs, network structures, task goals, or action sequences. This limits their ability to adapt when new features, new categories, or more efficient task routines appear during long-term operation. To address this problem, this paper proposes a thinking-learning interaction model for autonomous robots. The core idea is that thinking guides learning by identifying potential changes, selecting useful evidence, organizing training materials, and planning verification actions, while learning promotes thinking by updating task knowledge, feature-selection experience, action strategies, and future reasoning processes. Based on this bidirectional mechanism, the robot can gradually move beyond predefined learning settings and adapt its recognition relations and action relations through continuous interaction with the environment. Specifically, the proposed model supports adaptive input feature discovery, output category expansion, learning model update, and action routine reconstruction. Experimental results show that the proposed model improves the final recognition accuracy from 0.419 to 0.845 in feature adaptation, achieves higher new-category formation accuracy and model-update success rate, and reduces the average action length from 13.0 to 4.0 in action routine reconstruction. In learning-enhanced thinking, the useful evidence selection rate increases from 0.272 to 0.965, indicating that learning results can effectively improve future evidence selection and reasoning.

2605.23986 2026-05-26 cs.DB cs.AI cs.MA 版本更新

MemForest: An Efficient Agent Memory System with Hierarchical Temporal Indexing

MemForest: 一种具有层次化时间索引的高效智能体记忆系统

Han Chen, Zining Zhang, Wenqi Pei, Bingsheng He, Ming Wu, Jason Zeng, Michael Heinrich, Wei Wu, Hongbao Zhang

发表机构 * National University of Singapore(新加坡国立大学) Zero Gravity Labs(零重力实验室)

AI总结 针对长上下文LLM智能体记忆系统中粗粒度状态管理和顺序更新导致的维护开销问题,提出MemForest框架,通过并行块提取和层次化时间索引树MemTree实现高效写入和局部更新,在LongMemEval-S上达到79.8% pass@1准确率,吞吐量比现有方法高约6倍。

Comments 12 pages. Extended version with appendix as supplemental material. Submitted to VLDB

详情
AI中文摘要

记忆是使长上下文LLM智能体能够通过持续的提供和更新生命周期在交互中保持持久状态的基本组件。尽管已有大量先前工作,现有系统由于两个关键限制而遭受显著的维护开销:粗粒度的状态管理和固有的顺序更新流水线。特别是,更新通常与LLM推理紧密耦合,需要全状态重写,导致可扩展性差,且随着记忆积累延迟增加。为了解决这些挑战,我们提出了MemForest,一个将智能体记忆重新表述为写高效的时间数据管理问题的记忆框架。MemForest通过并行块提取打破顺序瓶颈,将记忆构建解耦为并发、独立的操作。为了进一步消除粗粒度维护,我们引入了MemTree,一种层次化时间索引,将记忆组织为时间有序的树,而不是扁平的全局摘要。这种设计用局部逐节点更新取代了全状态重写,将维护成本降低到受影响的树路径,同时自然保留时间演化的状态。我们在两个长上下文记忆基准LongMemEval-S和LoCoMo上评估了MemForest。在LongMemEval-S上,MemForest在有状态基线中实现了最佳整体性能,达到79.8%的pass@1准确率,同时保持的记忆构建吞吐量比包括EverMemOS在内的最先进方法高约6倍。

英文摘要

Memory is a fundamental component for enabling long-context LLM agents, supporting persistent state across interactions through a continuous serve-and-update lifecycle. Despite substantial prior work, existing systems suffer from significant maintenance overhead due to two key limitations: coarse-grained state management and inherently sequential update pipelines. In particular, updates are often tightly coupled with LLM inference and require full-state rewrites, leading to poor scalability and growing latency as memory accumulates. To address these challenges, we present MemForest, a memory framework that reformulates agent memory as a write-efficient temporal data management problem. MemForest breaks the sequential bottleneck via parallel chunk extraction, decoupling memory construction into concurrent, independent operations. To further eliminate coarse-grained maintenance, we introduce MemTree, a hierarchical temporal index that organizes memory as time-ordered trees rather than flat global summaries. This design replaces full-state rewrites with localized per-node updates, reducing maintenance cost to the affected tree paths while naturally preserving temporally evolving states. We evaluate MemForest on two long-context memory benchmarks, LongMemEval-S and LoCoMo. On LongMemEval-S, MemForest achieves the best overall performance among stateful baselines, reaching 79.8% pass@1 accuracy while sustaining a memory construction throughput approximately 6x higher than state-of-the-art approaches including EverMemOS.

2605.23984 2026-05-26 cs.LG cs.AI cs.CV 版本更新

Parameter Efficient Multi-Class Intelligent Scheduling for Multimodal Online Distributed Industrial Anomaly Detection

面向多模态在线分布式工业异常检测的参数高效多类智能调度

Heqiang Wang, Weihong Yang, Zheyuan Yang, Jia Zhou, Xiaoxiong Zhong, Fangming Liu, Weizhe Zhang

发表机构 * Pengcheng Laboratory(鹏城实验室) Shenzhen International Graduate School(深圳国际研究生院)

AI总结 针对工业异常检测中分布式、持续生成数据的特点,提出多模态在线分布式工业异常检测框架,通过多类智能调度问题和序列边际增益贪婪算法协调模型更新,并采用资源高效类级低秩适应策略降低系统开销,在MVTec 3D-AD和Eyecandies数据集上取得优越性能。

详情
AI中文摘要

工业异常检测作为工业系统的基本挑战已引起广泛关注。异构工业传感器的快速发展推动工业异常检测从单模态向多模态范式转变。然而,现有方法主要针对集中式和离线场景设计,忽视了实际工业环境中分布式和持续生成的数据特征。随着边缘智能的发展,现代边缘设备不仅能够采集数据,还能进行分布式模型训练,实现系统范围内的协作智能。工业异常检测是此背景下的关键应用。受这些挑战启发,我们提出了一种名为多模态在线分布式工业异常检测(MODIAD)的新框架。首先给出了MODIAD的完整工作流程,然后制定了多类智能调度(MIS)问题,通过平衡数据充足性和类别更新频率来协调跨类模型更新。为了高效解决该问题,我们设计了序列边际增益贪婪(SMG)算法,能够在资源约束下实现有效的多类训练。此外,为了提升训练过程中的计算和通信效率,我们提出了资源高效类级低秩适应(REC-LoRA)策略,在保持检测性能的同时显著降低系统开销。在两个代表性多模态工业异常检测数据集MVTec 3D-AD和Eyecandies上的大量实验表明,所提方法在MODIAD场景下实现了优越的性能和效率。

英文摘要

Industrial anomaly detection has attracted significant attention as a fundamental challenge in industrial systems. The rapid advancement of heterogeneous industrial sensors has driven industrial anomaly detection from unimodal to multimodal paradigms. However, existing methods are primarily designed for centralized and offline settings, overlooking the distributed and continuously generated data characteristic of real-world industrial environments. With the advancement of edge intelligence, modern edge devices are increasingly capable of not only data acquisition but also distributed model training, enabling collaborative intelligence across the system. Industrial anomaly detection represents a critical application in this context. Motivated by these challenges, we propose a novel framework termed Multimodal Online Distributed Industrial Anomaly Detection (MODIAD). We first present a comprehensive workflow for MODIAD and then formulate a Multi-class Intelligent Scheduling (MIS) problem to coordinate cross class model updates by balancing data sufficiency and class update frequency. To efficiently solve this problem, we design a Sequential Marginal Gain Greedy (SMG) algorithm that enables effective multi-class training under resource constraints. Furthermore, to improve the computational and communication efficiency during training, we propose an Resource Efficient Class-Wise Low Rank Adaptation (REC-LoRA) strategy, which significantly reduces system overhead while preserving detection performance. Extensive experiments on two representative multimodal industrial anomaly detection datasets, MVTec 3D-AD and Eyecandies demonstrate that the proposed approach achieves superior performance and efficiency under the MODIAD scenario.

2605.23983 2026-05-26 cs.AI cs.LO cs.SI 版本更新

Saturating Scaling Laws for Equational Discovery: A Phenomenology of Growth Dynamics in Three Toy Substrates with Two Real-World Replications

等式发现的饱和标度律:三个玩具基底中的增长动力学现象学及两个真实世界复现

Fabio Rovai

发表机构 * Tesseract Academy(Tesseract学院)

AI总结 研究确定性等式发现基底中的增长动力学,提出饱和幂律增长模型,并在玩具域和真实世界数据中验证其基底条件性。

Comments 17 pages, 5 figures, 4 tables, 2 algorithms. Code and data at https://github.com/fabio-rovai/saturating-scaling-laws (currently private; will be made public on acceptance)

详情
AI中文摘要

我们研究确定性等式发现基底中的增长动力学。在三个玩具域(算术、布尔、高阶列表;n=592条轨迹)中,短程基底大小符合幂律N(t) ∝ t^b。在每个基底内,b对架构敏感(交叉验证R²≈0.82);回归不能跨基底迁移(算术+布尔到列表得到R²≈-0.84)。一个启发式平均场闭包模型预测饱和幂律dN/dt = K N^k exp(-μ N),其中纯幂律是短程近似。三个稳健性检验:在4/5的玩具轨迹中,(k, μ)的bootstrap区间紧密,1/5退化;对玩具数据的样本外预测(拟合前100个epoch,预测后400个)中纯幂律5/5获胜,表明玩具轨迹未达到饱和;在两个真实世界增长代理上结果出现分歧。每月新Mathlib/*.lean文件添加量(mathlib4,60个月,9701个文件)支持饱和形式,在样本外预测上优于纯幂律约7倍;Coq mathcomp每月提交量(129个月,3083次提交)在两个测试中都偏向纯幂律,μ趋近于零。动力学在两个层面上是基底条件性的:基底内架构与b的回归不可迁移,且N(t)本身偏好的函数族(纯幂律vs饱和幂律)因基底而异。我们提出“饱和幂律增长,具有基底条件性的(k, μ),当基底达到饱和状态时可观测”作为工作框架。

英文摘要

We investigate growth dynamics in deterministic equational discovery substrates. Across three toy domains (arithmetic, boolean, higher-order list; n=592 trajectories), short-range substrate sizes fit a power-law N(t) proportional to t^b. Within each substrate b is architecture-sensitive (cross-validated R^2 approximately 0.82); the regression does not transfer across substrates (arith+bool to list yields R^2 approximately -0.84). A heuristic mean-field closure model predicts a saturating power-law dN/dt = K N^k exp(-mu N) of which the pure power-law is the short-range approximation. Three robustness checks: bootstrap intervals on (k, mu) are tight in 4/5 toy trajectories and degenerate in 1/5; out-of-sample forecasting on toy data (fit first 100 epochs, predict next 400) is won by pure power-law 5/5, indicating the toy trajectories do not reach saturation; on two real-world growth proxies the result splits. New Mathlib/*.lean file additions per month (mathlib4, 60 months, 9701 files) support the saturating form on OOS forecasting by approximately 7x over pure power-law; Coq mathcomp monthly commits (129 months, 3083 commits) favour pure power-law on both tests with mu collapsing to zero. The dynamics are substrate-conditional at two levels: within-substrate architecture-to-b regressions do not transfer, and the preferred functional family for N(t) itself (pure vs. saturating power-law) differs by substrate. We propose "saturating power-law growth with substrate-conditional (k, mu), observable when the substrate has reached its saturation regime" as a working framing.

2605.23981 2026-05-26 q-bio.NC cs.AI cs.CY cs.HC cs.SY eess.SY 版本更新

Metacognition Should Be the Scientific Framework for Bounded and Effective Self-Governance in Generative AI

元认知应成为生成式AI中有限且有效自我治理的科学框架

Eugene Yu Ji, Igor Grossmann, Amir-Hossein Karimi

发表机构 * University of Waterloo(滑铁卢大学)

AI总结 本文提出元认知作为生成式AI自我治理的科学框架,通过计算、算法和生态三个层面的元认知对齐实现有限且有效的自我治理。

Comments 16 pages, 1 figure, 1 table

详情
AI中文摘要

生成式AI研究日益面临一个共同问题:当不确定性高、证据缺失或上下文不足时,系统必须维持并管理自身的生成活动。本文认为,元认知应成为生成式AI中有限且有效自我治理的科学框架,其中输出生成与生成系统导航和调节自身活动的能力一同被恰当评估。我们通过展示有限且有效的AI自我治理需要跨计算、算法和生态层面的元认知对齐来推进这一观点。在计算层面,元认知指定系统应服务的元级功能,如监控、评估、控制和适应。在算法层面,这些功能通过诸如引出、迭代和模块化等程序实现。在生态层面,元认知信号在界面、工作流和问责安排中变得有意义、可操作和可问责。因此,元认知使得将生成式AI视为既有能力又治理良好的成为可能,而非将能力和治理视为竞争目标。

英文摘要

Generative AI research increasingly confronts a shared problem: systems must sustain yet govern their own generative activity when uncertainty is high, evidence is missing, or context is insufficient. This position paper argues that metacognition should become the scientific framework for bounded and effective self governance in generative AI, where output generation is properly evaluated together with the capacities through which generative systems navigate and regulate their own activity. We advance this position by showing that bounded and effective AI self-governance requires metacognitive alignment across computational, algorithmic, and ecological levels. At the computational level, metacognition specifies the meta-level functions a system is meant to serve, such as monitoring, evaluation, control, and adaptation. At the algorithmic level, these functions are realized through procedures such as elicitation, iteration, and modularization. At the ecological level, metacognitive signals become meaningful, actionable, and accountable within the interface, workflow, and accountability arrangements. Metacognition thus makes it possible to conceive generative AI as both capable and well-governed, rather than treating capability and governance as competing aims.

2605.23972 2026-05-26 cs.AI cs.CL cs.RO 版本更新

Why We Need World Models for AGI: Where LLMs Fail and How World Models May Outperform

为什么我们需要世界模型来实现通用人工智能:大语言模型失败之处以及世界模型如何可能超越

Feisal Alaswad, Batoul Aljaddouh, Maher Alrahhal, Poovammal E, Talal Bonny

发表机构 * Department of Computing Technologies(计算技术系) SRM Institute of Science and Technology(SRM科学与技术学院) Bio-Sensing and Bio-Sensors Group(生物传感与生物传感器组) Smart Automation and Communication Technologies Research Institute of Sciences and Engineering(科学与工程智能自动化与通信技术研究所) University of Sharjah, UAE(阿联酋沙迦大学) Department of Computer Engineering(计算机工程系) College of Computing and Informatics(计算与信息学院)

AI总结 本文通过提出潜在动态推理(LDI)概念和Flux环境案例研究,论证了大语言模型在因果推理、状态跟踪和长程规划上的局限性,并展示基于显式状态空间的强化学习智能体在长程游戏中显著优于纯文本LLM。

Comments 19 pages, 5 figures

详情
AI中文摘要

大语言模型在语言生成和知识密集型任务中表现出色,但在需要因果推理、持久状态跟踪和长程规划的场景中仍然受限。我们认为,这些限制可能源于序列预测与对潜在环境动态进行推理之间的目标层级不匹配。为了形式化这一区别,我们引入了潜在动态推理(LDI),这是一种概念性视角,将语言和多模态观测解释为底层转移动态的部分证据。为了实证研究这一视角,我们引入了Flux,一个完全通过自然语言规则指定的序列推理环境。作为一个概念验证案例研究,这些规则首先被编译成一个显式的状态转移模拟器,说明在某些情况下,结构化的潜在转移动态可以从文本规则描述中操作性地提取出来。这使得我们能够在纯文本观测上运行的LLM与直接在提取的潜在状态空间中训练的强化学习智能体之间进行受控比较。在该案例研究中,能够显式访问潜在状态空间的智能体在长程游戏中表现出更稳定的行为,总胜率约为79%,而LLM仅为11%。定性分析进一步揭示了与不稳定的持久状态跟踪一致的失败模式,包括无效动作、状态跟踪错误和短程推理失败。Flux环境的完整实现可在https://github.com/FeisalAlaswad/FLUX-RL-Agent获取。在评估的设置中,这些结果表明,如果没有持久状态跟踪和转移建模的机制,仅凭强大的序列预测可能难以支持稳健的长程动态推理。

英文摘要

Large language models achieve strong performance in language generation and knowledge-intensive tasks, yet remain limited in settings requiring causal reasoning, persistent state tracking, and long-horizon planning. We argue that these limitations may arise from an objective-level mismatch between sequence prediction and reasoning over latent environment dynamics. To formalize this distinction, we introduce Latent Dynamics Inference (LDI), a conceptual perspective that interprets language and multimodal observations as partial evidence of underlying transition dynamics. To empirically investigate this perspective, we introduce Flux, a sequential reasoning environment specified entirely through natural-language rules. As a proof-of-concept case study, the rules are first compiled into an explicit state-transition simulator, illustrating that structured latent transition dynamics can, in some cases, be operationally extracted from textual rule descriptions. This enables a controlled comparison between the LLMs operating purely over textual observations and reinforcement-learning agents trained directly within the extracted latent state space. Within this case study, agents operating with explicit access to the latent state space exhibit substantially more stable behavior in long-horizon gameplay, achieving an aggregate win rate of approximately 79% versus 11% for LLMs. Qualitative analysis further reveals failure modes consistent with unstable persistent state tracking, including invalid actions, state-tracking errors, and short-horizon reasoning failures. The complete implementation of the Flux environment available at https://github.com/FeisalAlaswad/FLUX-RL-Agent Within the evaluated setting, these results suggest that strong sequence prediction alone may struggle to support robust long-horizon dynamic reasoning without mechanisms for persistent state tracking and transition modeling

2605.23967 2026-05-26 q-bio.NC cond-mat.mtrl-sci cs.AI physics.app-ph 版本更新

Sensing Intelligence as a Trainable Metamaterial Property

感知智能作为可训练的元材料属性

Kyungmi Na, Yifei Li, Xinyi Yang, Bolei Deng

发表机构 * Daniel Guggenheim School of Aerospace Engineering, Georgia Institute of Technology(德鲁·福金斯航空航天工程学院,佐治亚理工学院) Computer Science and Artificial Intelligence Laboratory, Massachusetts Institute of Technology(计算机科学与人工智能实验室,麻省理工学院)

AI总结 本文提出将感知智能作为可训练的元材料属性,通过可微仿真优化元材料几何结构,使神经网络能够训练其身体进行感知,从而显著提升感知精度或减少电子传感器数量。

详情
AI中文摘要

在生物系统中,感知并非仅由大脑完成:身体在外部刺激被转换为神经信号之前,会对其进行变形、振动和过滤。在工程系统中,这一处理负担主要落在电子设备和计算上,而机械体通常仅设计用于强度和稳定性。在此,我们将感知智能呈现为身体的一种可训练属性。我们展示了元材料的几何结构可以被优化,以将外部刺激重塑为神经网络更易于解释的内部信号。我们不是手工设计这种物理预处理,而是通过可微仿真将感知损失反向传播到身体的设计参数,让神经网络训练自己的身体进行感知。在数值和实验感知场景中,优化后的身体将感知精度提高了多达五倍,或将所需电子传感器的数量减少了近一个数量级。

英文摘要

In biological systems, sensing is not performed by the brain alone: the body deforms, vibrates, and filters external stimuli before they are transduced into neural signals. In engineered systems, this processing burden is placed largely on electronics and computation, while the mechanical body is usually designed only for strength and stability. Here, we present sensing intelligence as a trainable property of the body. We show that the geometry of a metamaterial can be optimized to reshape external stimuli into internal signals that are easier for a neural network to interpret. Rather than hand-designing this physical preprocessing, we let the neural network train its own body for sensing by backpropagating the sensing loss to the body's design parameters through differentiable simulation. Across numerical and experimental sensing scenarios, the optimized body improves sensing accuracy by up to fivefold or reduces the number of required electronic sensors by nearly an order of magnitude.

2605.23966 2026-05-26 cs.CL cs.AI cs.SY eess.SY math.CO 版本更新

TriVAL: A Tri-Validation Framework for Faithful Automatic Optimization Modeling

TriVAL: 一种用于忠实自动优化建模的三重验证框架

Ziyang Fang, JinXi Wang, Jinghui Zhong, Yew-Soon Ong

发表机构 * School of Computer Science and Engineering, South China University of Technology(华南理工大学计算机科学与工程学院) Centre for Frontier AI Research, Agency for Science, Technology and Research(科技研究局前沿人工智能研究中心) College of Computing and Data Science, Nanyang Technological University(南洋理工大学计算与数据科学学院)

AI总结 提出TriVAL三重验证框架,在语义规范、数学公式和代码生成三个阶段进行显式验证,通过构建-验证-修正循环提高自动优化建模的准确性,并在新基准NL4COP上超越现有方法。

Comments 13 pages

详情
AI中文摘要

优化建模作为自然语言问题描述与优化求解器之间的关键桥梁,是将运筹学(OR)应用于实际决策的基石。大语言模型(LLM)的最新进展推动了自动优化建模的显著进步。然而,现有方法在建模过程中仍缺乏显式验证,导致早期阶段引入的错误会沿流水线传播,最终降低建模精度。为解决这一挑战,我们提出TriVAL,一种在自动优化建模的三个阶段(语义规范、数学公式和代码生成)进行显式验证的三重验证框架。在每个阶段,TriVAL遵循构建-验证-修正循环,根据阶段特定标准评估当前结果,并在必要时进行修正。这种设计有助于在错误跨阶段累积之前识别和纠正它们,从而在整个建模过程中保持忠实性。为了在更具挑战性的组合问题上评估自动优化建模,我们进一步引入NL4COP,一个包含50种不同问题类型、150个实例的基准,其决策逻辑更复杂、约束耦合更紧密、建模要求比现有基准更高。在NL4COP和已有基准上的实验表明,TriVAL始终优于最先进的方法,在最具挑战性的问题上提升最大。

英文摘要

Optimization modeling serves as the pivotal bridge between natural-language problem descriptions and optimization solvers, and remains a cornerstone for bringing operations research (OR) into real-world decision making. Recent advances in large language models (LLMs) have driven significant progress in automatic optimization modeling. However, existing methods still lack explicit validation during the modeling process, allowing errors introduced in earlier stages to carry through the pipeline and ultimately reduce final modeling accuracy. To address this challenge, we introduce TriVAL, a tri-validation framework that performs explicit validation at three stages of automatic optimization modeling: semantic specification, mathematical formulation, and code generation. At each stage, TriVAL follows a construct-validate-revise loop that assesses the current result against stage-specific criteria and revises it when needed. This design helps identify and correct errors before they accumulate across stages, helping preserve faithfulness throughout the modeling process. To evaluate automatic optimization modeling on more challenging combinatorial problems, we further introduce NL4COP, a benchmark of 150 instances across 50 diverse problem types with more complex decision logic, more tightly coupled constraints, and more demanding modeling requirements than existing benchmarks. Experiments on NL4COP and established benchmarks show that TriVAL consistently outperforms state-ofthe-art methods, with the largest gains on the most challenging problems.

2605.23964 2026-05-26 eess.SY cs.AI cs.SY 版本更新

Multi-market value-stacking: Battery control for combined imbalance participation and non-uniform FCR bidding

多市场价值堆叠:结合不平衡参与和非均匀FCR投标的电池控制

Celle Hendrickx, Fabio Pavirani, Chris Develder

发表机构 * Gent University - imec, IDLab(根特大学 - imec,IDLab)

AI总结 提出一种两阶段控制框架,通过非均匀FCR投标和深度强化学习实时交易,在保持FCR合规的同时实现7.56%的利润提升。

Comments 5 pages, 2 figures. Presented at ACM Sustainability Week 2026 (ACM Sustainability Week Companion 26), June 22-25, 2026, Banff, AB, Canada

详情
AI中文摘要

现代电力系统中可再生能源(RES)占比不断增加,加剧了电网不平衡和频率偏差,从而增强了对频率 containment reserve(FCR)和无源平衡等辅助服务的需求。电池储能系统(BESS)非常适合这些服务,但先前的研究通常依赖于在整个控制周期内保持恒定的均匀FCR投标。这种静态投标未能充分利用BESS的灵活性,因为它们没有平衡为FCR交付预留能量与用于不平衡套利之间的权衡,限制了在价值堆叠场景中可实现的价值。为解决这一限制,我们针对欧洲背景提出了一种引入非均匀FCR投标的两阶段控制框架。在第一阶段,我们使用数据驱动的蒙特卡洛(MC)优化推导出时变投标序列。在第二阶段,深度强化学习(DRL)代理利用剩余灵活性进行实时不平衡交易,同时主动管理能量状态(SoE)以确保符合FCR要求。该框架作为概念验证提出,突出了时变投标策略的潜在优势。通过引入日循环预算和时变储备承诺,我们的方法相比均匀基线实现了7.56%的利润增长。这些结果表明,非均匀投标可以通过更有效地将储备义务与快速变化的不平衡机会对齐来释放额外价值。

英文摘要

The growing share of Renewable Energy Sources (RES) in modern power systems increases both grid imbalances and frequency deviations, reinforcing the need for ancillary services such as Frequency Containment Reserve (FCR) and passive balancing. Battery Energy Storage Systems (BESS) are well-suited for these services, but prior research typically relies on uniform FCR bids that remain constant throughout the control period. Such static bids fail to fully exploit BESS flexibility, as they do not balance the trade-off between reserving energy for FCR delivery and using it for imbalance arbitrage, limiting the achievable value in value-stacking settings. To address this limitation, we propose a two-stage control framework for the European context that introduces non-uniform FCR bids. In the first stage, we derive a time-varying bid sequence using data-driven Monte Carlo (MC) optimization. In the second stage, a Deep Reinforcement Learning (DRL) agent leverages the residual flexibility for real-time imbalance trading while proactively managing the State of Energy (SoE) to ensure compliance with FCR requirements. The framework is presented as a proof of concept, highlighting the potential benefits of time-varying bidding strategies. By incorporating daily cycle budgets and time-varying reserve commitments, our approach achieves a 7.56% profit increase compared to uniform baselines. These results show that non-uniform bidding can unlock additional value by more effectively aligning reserve obligations with rapidly changing imbalance opportunities.

2605.23961 2026-05-26 q-bio.BM cs.AI cs.LG 版本更新

Multimodal Alignment and Preference Optimization for Zero-Shot Conditional RNA Generation

多模态对齐与偏好优化用于零样本条件RNA生成

Roman Klypa, Alberto Bietti, Sergei Grudinin

发表机构 * Univ. Grenoble Alpes, CNRS, Grenoble INP, LJK(格勒诺布尔阿尔卑斯大学、法国国家科学研究中心、格勒诺布尔INP、LJK实验室) Center for Computational Mathematics, Flatiron Institute(计算数学中心、Flatiron研究所)

AI总结 提出Moirain框架,通过多模态监督微调和直接偏好优化实现条件RNA序列生成,在零样本条件下生成具有高结合亲和力的生物合理RNA序列。

详情
AI中文摘要

设计能与特定蛋白质相互作用的RNA分子是实验和计算生物学中的一个关键挑战。尽管自然语言建模和基于深度学习的蛋白质设计最近取得了进展,但在提高成功交互频率和生成序列的真实性方面仍有很大空间。在这项工作中,我们将条件RNA序列生成视为一个多阶段对齐问题,引入了Moirain:一组通过多模态监督微调(SFT)和直接偏好优化(DPO)优化的模型。我们的方法从对多样化RNA语料库的大规模预训练开始,以捕捉序列合理性的基本语法。为了实现目标特异性生成,我们采用了一种多模态SFT架构,该架构以蛋白质结构和序列特征为条件进行RNA合成。最后,我们利用DPO使用合成交互数据来优化模型:利用DPO在非对齐偏好空间中导航的独特能力,我们提高了功能适应性,同时不破坏学习到的自然分布。对Moirain系列(Moirain-Base、-Multi和-DPO)的广泛评估表明,与现有基线相比,我们的框架始终能产生新颖、多样且生物合理的RNA序列,并具有优越的结合亲和力。

英文摘要

The design of RNA molecules that interact with specific proteins is a critical challenge in experimental and computational biology. Despite recent progress in natural language modeling and deep learning-based protein design, there remains significant room to improve the frequency of successful interactions and the authenticity of generated sequences for functional applications. In this work, we frame conditional RNA sequence generation as a multi-stage alignment problem, introducing Moirain: a suite of models optimized via multimodal supervised fine-tuning (SFT) and Direct Preference Optimization (DPO). Our approach begins with large-scale pretraining on diverse RNA corpora to capture the fundamental grammars of sequence plausibility. To achieve target-specific generation, we employ a multimodal SFT architecture that conditions RNA synthesis on protein structural and sequential features. Finally, we leverage DPO to refine the model using synthetic interaction data: taking advantage of DPO's unique ability to navigate non-aligned preference spaces, we improve functional fitness without collapsing the learned natural distribution. Extensive evaluation of the Moirain series (Moirain-Base, -Multi, and -DPO) demonstrates that our framework consistently produces novel, diverse, and biologically plausible RNA sequences with superior binding affinities compared to existing baselines.

2605.23958 2026-05-26 cs.CY cs.AI econ.GN q-fin.EC 版本更新

AI in the Enterprise: How People Use M365 Copilot Chat

企业中的AI:人们如何使用M365 Copilot Chat

Scott Counts, Yan Chen, Jing Dong, Himanshu Sharma, Andrey Zaikin, Rui Hu, Alperen Kok, Gorkem Ozer Yilmaz, Siddharth Suri, Kiran Tomlinson, Sonia Jaffe, Will Wang

发表机构 * Microsoft Corporation(微软公司)

AI总结 基于约550万次会话的用户交互分类,研究M365 Copilot Chat在企业中的使用模式,发现其作为知识工作日常助手,主要用于写作、信息检索、分析、决策和策略制定等,并揭示了不同职业群体间的使用差异及未来AI采用方向。

详情
AI中文摘要

M365 Copilot每周被全球超过一百万家公司的数百万人在工作流程中使用。由于其几乎专门用于工作目的,M365 Copilot在AI领域中具有独特地位,能够清晰展示人们如何使用AI进行工作以及未来可能扩展的使用领域。本文通过对用户与M365 Copilot Chat交互的直接分类来刻画这种使用模式。基于对约550万次会话样本的匿名化和隐私保护分析,我们结合了用户意图的学习分类和与M365 Copilot Chat一起完成的O*NET工作活动分类。我们发现M365 Copilot正在成为知识工作的日常助手:写作占主导地位,但用户也依赖它进行信息检索、分析、决策和策略制定,以及评估和诊断程序和系统等。信息寻求任务仍然常见,但时间趋势表明,相对而言,从“聊天即搜索”向内容和通信相关工作转变。跨职业群体以及与劳动力市场工作的比较进一步表明,使用广泛但不均衡,M365 Copilot Chat完成的工作的相对份额在某些情况下跨越不同工作,而在其他情况下则具有职业特异性。劳动力市场中相对代表性不足的领域预示着企业AI采用的下一个前沿。

英文摘要

M365 Copilot is used every week by millions of people across more than a million companies around the world as part of their workflows. Uniquely positioned in the AI landscape given its near-exclusive use for work purposes, M365 Copilot can offer a clear picture of how people use AI for work and where that usage may expand next. This paper characterizes that usage through direct classification of user interactions with M365 Copilot Chat. Based on an anonymized and privacy-preserving analysis of a sample of approximately 5.5 million sessions, we combine a learned classification of user intent with a classification of O*NET work activities done with M365 Copilot Chat. We find that M365 Copilot is emerging as an everyday assistant for knowledge work: writing dominates, but users also rely on it for information retrieval, analysis, decision making and strategizing, and evaluating and diagnosing programs and systems, among others. Information seeking tasks remain common, but time trends suggest a relative shift away from ``chat as search'' and toward content and communication-related work. Comparisons across occupational groupings and to work done in the labor market further show that usage is broad but uneven, where the relative share of work done with M365 Copilot Chat cuts across jobs in some cases and is occupation-specific in others. Areas of relative underrepresentation in the labor market suggest the next frontier for enterprise AI adoption.

2605.23957 2026-05-26 cs.AI cs.LG 版本更新

Low-Cost Labels, Reliable Choices: Rollout-Calibrated Hyper-Heuristics for Job Shop Scheduling

低成本标签,可靠选择:用于作业车间调度的Rollout校准超启发式算法

Junhao Wei, Yanxiao Li, Yifu Zhao, Zhenhong Peng, Baili Lu, Dexing Yao, Haochen Li, Qinbin He, Sio-Kei Im, Yapeng Wang, Xu Yang

发表机构 * Faculty of Applied Sciences, Macao Polytechnic University(澳门理工学院应用科学学院) Pazhou Lab (Huangpu), Guangzhou(广州 Pazhou 实验室(黄埔)) College of Animal Science and Technology, Zhongkai University of Agriculture and Engineering(仲恺农业工程学院动物科学与技术学院) Macao Polytechnic University(澳门理工学院)

AI总结 提出一种基于Rollout校准的超启发式算法,通过遗憾归一化标签、上下文KNN不确定性估计和门控机制,在低成本标签下实现可靠的选择器,显著降低平均RPD。

详情
AI中文摘要

学习辅助的超启发式算法可以在保持构造性作业车间调度问题(JSSP)启发式的可行性和可解释性的同时,选择调度规则。其主要计算成本在于标签生成而非模型拟合,因为每个监督标签通常需要从部分调度中展开候选规则。我们研究了这一标签成本问题以及一个可靠性问题:学习的选择器不应偏离强默认规则,除非预测的增益是可信的。所提出的选择器使用遗憾归一化的展开标签、上下文KNN不确定性估计以及一个门控机制,仅在预测改进超过不确定性调整的边际时采取行动。我们还变化展开深度和广度以衡量成本-质量权衡。在合成JSSP实例上,门控选择器在学习的选择器中实现了最低的平均RPD,接近最佳固定调度规则,并将Random-HH的平均RPD降低了一个数量级以上。

英文摘要

Learning-assisted hyper-heuristics can select among dispatching rules while preserving the feasibility and interpretability of constructive Job Shop Scheduling Problem (JSSP) heuristics. Their main computational cost lies in label generation rather than model fitting, since each supervised label usually requires rolling out candidate rules from a partial schedule. We study this label-cost problem together with a reliability problem: a learned selector should not switch away from a strong default rule unless the predicted gain is credible. The proposed selector uses regret-normalized rollout labels, a contextual KNN uncertainty estimate, and a gate that acts only when the predicted improvement exceeds an uncertainty-adjusted margin. We also vary rollout depth and breadth to measure the cost-quality trade-off. On synthetic JSSP instances, the gated selector achieves the lowest mean RPD among learned selectors, remains close to the best fixed dispatching rule, and reduces Random-HH mean RPD by more than an order of magnitude.

2605.23956 2026-05-26 cs.AI cs.LG cs.MA 版本更新

QUIVER: A Formal Framework for Quantifying Perturbation Propagation and Bifurcation in Compound AI Systems

QUIVER: 复合AI系统中扰动传播与分岔的量化形式化框架

Prashanti Nilayam, Sankalp Nayak

发表机构 * Servicenow CA, USA(Servicenow加州美国)

AI总结 提出QUIVER形式化框架,通过敏感性矩阵、轨迹散度、分岔阈值和分布忠实度四个组件,量化图结构LLM流水线中扰动传播与结构分岔,并在三个不同架构的企业和公共流水线上验证其有效性。

详情
AI中文摘要

将多个LLM调用链接成有向计算图的复合AI系统现已成为生产AI的主导架构。尽管这些架构利用具有混合模式输出的异构节点,但现有框架无法量化扰动如何通过此类流水线传播,其中节点是随机的且执行路径可能发生结构分岔。我们引入QUIVER,一个用于测量图结构LLM流水线中扰动传播的形式化框架。该框架定义了:(1) 一个敏感性矩阵,带有类型分派的距离度量,将边分类为放大器、吸收器或阈值敏感,并辅以出现提升;(2) 轨迹散度,将变异分解为值漂移、结构路径散度和迭代次数散度;(3) 分岔阈值,识别导致结构执行路径变化的最小扰动;(4) 分布忠实度,量化每个节点评估数据集何时偏离生产分布。我们在两个生产企业流水线和一个公共DSPy多跳QA流水线上进行验证,这三个架构在结构上各不相同。在8200多个仪器化轨迹(32000多对比较)中,我们证明QUIVER揭示了不同架构的独特敏感性剖面,区分了产生相同散度率的机制不同的级联模式,仅从观测数据预测易发生轨迹分岔的节点,并将过时的评估伪影定位到聚合指标无法揭示的特定节点-字段类别。

英文摘要

Compound AI systems that chain multiple LLM calls into directed computation graphs are now the dominant architecture for production AI. Although these architectures leverage heterogeneous nodes with mixed-mode outputs, no existing framework quantifies how perturbations propagate through such pipelines, where nodes are stochastic and execution paths can diverge structurally. We introduce QUIVER, a formal framework for measuring perturbation propagation in graph-structured LLM pipelines. The framework defines: (1) a sensitivity matrix with type-dispatched distance metrics that classifies edges as amplifiers, absorbers, or threshold-sensitive, complemented by occurrence-lift; (2) trajectory divergence decomposing variation into value drift, structural path divergence, and iteration count divergence; (3) bifurcation thresholds identifying the smallest perturbation that causes structural execution path changes; and (4) distribution faithfulness, quantifying when per node evaluation datasets diverge from production distributions. We validate on two production enterprise pipelines and a public DSPy multihop QA pipeline, three structurally distinct architectures. Across 8,200+ instrumented traces (32,000+ pair comparisons), we demonstrate that QUIVER reveals distinct sensitivity profiles across architectures, distinguishes mechanistically different cascade patterns producing identical divergence rates, predicts nodes prone to trajectory bifurcation from observational data alone, and localizes stale evaluation artifacts to specific node-field categories that aggregate metrics cannot surface.

2605.23954 2026-05-26 cs.CL cs.AI cs.SD 版本更新

EchoDistill:Alignment Noisy-to-Clean Self-Distillation for Robust Audio LLMs

EchoDistill:面向鲁棒音频大语言模型的噪声到干净自蒸馏对齐

Liang Lin, Chunxi Luo, Kaiwen Luo, Jie Zhang, Jin Wang, Yuanhe Zhang, Cai Yuchen, Qiankun Li, Gongli Xi, Zhenhong Zhou, Kun Wang, Junhao Dong

发表机构 * NTU(国立台湾大学) SHU(上海大学) ICT, CAS(中国科学院信息科技研究院) HDU(华中科技大学) BUPT(北京邮电大学) USTC(中国科学技术大学) SKL-NST, BUPT(北京邮电大学国家智能计算研究中心)

AI总结 提出EchoDistill框架,通过冻结的干净音频教师模型指导噪声学生模型进行组相对策略优化,实现噪声到干净的自蒸馏对齐,提升音频大语言模型在复杂噪声下的语义可靠性和任务性能。

详情
AI中文摘要

音频大语言模型极易受到现实世界噪声的影响,常常导致严重的语义漂移和幻觉。现有的鲁棒性方法主要依赖于波形级声学增强、答案级监督或噪声表示的内部抑制。为了解决这些问题,我们提出了EchoDistill,一种基于对齐的噪声到干净自蒸馏框架。EchoDistill利用冻结的干净音频教师模型为推理时的噪声音频学生模型提供语义参考。具体地,学生模型在噪声条件下采样候选响应以暴露其测试时行为。这些轨迹随后通过组相对策略优化进行优化,其中与教师模型的令牌级一致性作为奖励加成。通过将噪声学生模型的候选响应与干净语义证据对齐,并应用音频感知奖励塑造,我们的方法鼓励既正确又真正基于声学推理的轨迹。EchoDistill显著提高了音频大语言模型在复杂噪声下的语义可靠性和任务性能,且不引入任何额外推理成本。大量实验表明:(I) 与最强基线相比,EchoDistill在强噪声下GSR平均提升4.18%↑。(II) 在Qwen-Omni上的消融结果进一步显示,EchoDistill相比仅GRPO变体在Acc上平均提升3.02%↑,在Noisy上提升3.89%↑,在GSR上提升4.53%↑。我们的代码可在https://anonymous.4open.science/r/echodistill-10DE获取。

英文摘要

Audio Large Language Models (ALLMs) are highly vulnerable to real-world noise, which often induces severe semantic drift and hallucinations. Existing robustness methods primarily rely on waveform-level acoustic enhancement, answer-level supervision, or the internal suppression of noise representations. To address these issues, we propose echodistill, an alignment-based noisy-to-clean self-distillation framework. Echodistill leverages a frozen clean-audio teacher to provide semantic references for an inference-time noisy-audio student. Specifically, the student samples candidate responses under noisy conditions to expose its test-time behavior. These trajectories are then optimized via group-relative policy optimization (GRPO), where the token-level consistency with the teacher acts as a reward bonus. By aligning the noisy student's candidate responses with clean semantic evidence, and applying audio-aware reward shaping, our method encourages reasoning trajectories that are both correct and genuinely acoustically grounded. Echodistill significantly improves the semantic reliability and task performance of Audio LLMs under complex noise, without introducing any additional inference costs. Extensive experiments show that: (I) Compared with the strongest baseline, echodistill achieves average improvements of 4.18\%$\uparrow$ in GSR under strong noise. (II) Ablation results on Qwen-Omni further show that echodistill improves over the GRPO-only variant by 3.02\%$\uparrow$ in Acc, 3.89\%$\uparrow$ in Noisy, and 4.53\%$\uparrow$ in GSR on average. Our codes are available at https://anonymous.4open.science/r/echodistill-10DE.

2605.23952 2026-05-26 cs.AI cs.CL q-bio.NC 版本更新

Machine Psychometrics: A Mathematical Psychology of Artificial Intelligence

机器心理测量学:一种人工智能的数学心理学

Alex Bogdan, Adrian de Valois-Franklin

发表机构 * Evolutionairy AI

AI总结 针对人工智能评估中忽视心理结构或过度拟人化的两种错误,本文引入机器心理测量学,通过测量潜在行为、元认知、沟通和自我建模倾向,构建机器心智档案和信任协议,以测量而非判断来理解非人类智能体。

Comments 45 pages, 11 figures

详情
AI中文摘要

人工智能体现在产生的行为足够丰富,足以引发信任、惊喜和担忧,然而我们的评估工具仍然优先考虑能力分数而非心理结构。本文认为,两种对称错误(人工心智盲视,即否认非生物系统中的心理组织;以及人工心智投射,即仅从流畅行为推断类似人类的内心生活)之间的哲学僵局,可以通过在意识问题之下引入一个严谨的测量层来规避,而非解决意识问题本身。借鉴Michael Levin关于认知作为跨基质目标导向能力的连续统观点,以及数学心理学的方法论库(项目反应理论、信号检测理论、贝叶斯认知建模、校准分析、认知偏差测试组),本文发展了机器心理测量学,作为测量人工智能体中潜在行为、元认知、沟通和自我建模倾向的测量科学。其操作核心是机器心智档案:一个多维、领域受限、版本化的轮廓,涵盖校准、源完整性、暗示抵抗性、上下文稳定性、表达对齐、工具完整性、漂移监测和分布基础。一个补充的信任协议通过探针测试组、扰动测试、信度和效度分析以及高风险领域的纵向监测,将心智档案转化为部署决策。哲学贡献是第三种立场,人工心智纪律,既不拟人化也不否认,既不预设意识也不排除意识。目标不是将人工智能体人性化,而是精确地理解它们,因为它们不是人类,通过测量而非判断。

英文摘要

Artificial agents now generate behavior rich enough to invite trust, surprise, and concern, yet our evaluation tools still privilege capability scores over psychological structure. This paper argues that the philosophical impasse between two symmetrical errors (Artificial Mind Blindness, which dismisses psychological organization in non-biological systems, and Artificial Mind Projection, which infers human-like inner life from fluent behavior alone) can be circumvented not by resolving the consciousness question, but by introducing a disciplined measurement layer beneath it. Drawing on Michael Levin's continuum view of cognition as goal-directed competency across substrates, and on the methodological repertoire of mathematical psychology (Item Response Theory, Signal Detection Theory, Bayesian cognitive modeling, calibration analysis, cognitive-bias batteries), the paper develops Machine Psychometrics as a measurement science of latent behavioral, metacognitive, communicative, and self-modeling dispositions in artificial agents. Its operational core is the Machine Mindprint: a multidimensional, domain-bounded, versioned profile spanning calibration, source integrity, suggestibility resistance, context stability, expressive alignment, tool integrity, drift monitoring, and distributional grounding. A complementary Trust Protocol turns Mindprints into deployment decisions through probe batteries, perturbation testing, reliability and validity analysis, and longitudinal monitoring across high-stakes domains. The philosophical contribution is a third stance, Artificial Mind Discipline, that neither anthropomorphizes nor dismisses, neither presupposes consciousness nor forecloses it. The aim is not to humanize artificial agents, but to understand them precisely because they are not human, through measurement before judgment.

2605.23951 2026-05-26 cs.AI cs.LO cs.MA 版本更新

Methods for Formal Verification of Agent Skills: Three Layers Toward a Mechanically Checkable Capability-Containment Proof

智能体技能形式化验证方法:面向可机械检查的能力包含证明的三层架构

Alfredo Metere

发表机构 * Metere Consulting, LLC(梅特尔咨询公司)

AI总结 本文提出三层可组合方法(静态抽象解释、精炼类型系统、SMT有界模型检测),将智能体技能从声明或测试级别提升至形式化验证级别,实现机械可检查的能力包含证明。

详情
AI中文摘要

伴随论文引入了一个关于智能体技能清单的四级验证格(未验证、声明、测试、形式化),并将最高级别作为目标。本文填补了这一空白。我们给出了技能行为的精确语义,忠实于技能如何被LLM驱动的运行时(通过非确定性LLM侧可达的确定性脚本侧)消费,将验证问题表述为该语义上的能力包含属性,并提出了三种可组合方法,共同将技能从声明或测试级别提升至形式化级别:(1)通过在小效应格上的抽象解释,对脚本侧进行可靠静态能力包含分析;(2)一个用于工具调用封装的精炼类型系统,机械地拒绝任何静态推断能力不在清单声明集中的调用;(3)针对父论文的双条件正确性准则的SMT有界模型检测,其中边界选择使得任何符合运行时事务缓冲区视野的反例都作为具体轨迹呈现。我们证明了这三个层次组合起来能可靠地覆盖父论文的威胁模型,仅剩一个残余(LLM拒绝行动的自由),该残余由父论文的运行时双条件在会话边界捕获。这些方法重用现有的成熟工具(Z3、Semgrep、CodeQL、精炼类型检查器、机械化证明助手),而非要求操作者构建新工具,并且携带证明的工件扩展了现有的SKILL.md约定。所有三种方法以及捆绑生产者和重新检查器作为零依赖JavaScript模块在开源enclawed框架(https://github.com/metereconsulting/enclawed;项目页面https://www.enclawed.com/)中提供,包含53个单元测试和一个端到端CLI演示示例技能。

英文摘要

The companion paper introduced a four-level verification lattice on agent-skill manifests (unverified, declared, tested, formal) and left the top level aspirational. This paper closes that gap. We give a precise semantics for skill behaviour faithful to how a skill is consumed by an LLM-driven runtime (a deterministic script-side reachable through a non-deterministic LLM-side), state the verification problem as a capability-containment property over that semantics, and present three composable methods that together raise a skill from declared or tested to formal: (1) sound static capability-containment analysis of the script-side via abstract interpretation over a small effect lattice; (2) a refinement type system for tool-call envelopes that mechanically rejects any call whose statically-inferred capability is not in the manifest's declared set; (3) SMT-bounded model checking against the parent paper's biconditional correctness criterion, with the bound chosen so any counter-example fitting the runtime's transaction-buffer horizon is exhibited as a concrete trace. We prove the three layers composed soundly cover the parent paper's threat model modulo a single residual (the LLM's freedom to refuse to act) that the parent paper's runtime biconditional catches at session boundary. The methods reuse existing well-engineered tools (Z3, Semgrep, CodeQL, refinement-type checkers, mechanised proof assistants) rather than asking operators to build new ones, and the proof-carrying artifact extends the existing SKILL.md convention. All three methods plus the bundle producer and re-checker ship as zero-dependency JavaScript modules in the open-source enclawed framework (https://github.com/metereconsulting/enclawed; project page https://www.enclawed.com/), with 53 unit tests and an end-to-end CLI demo on a sample skill.

2605.23950 2026-05-26 cs.AI cs.SE 版本更新

Stop Comparing LLM Agents Without Disclosing the Harness

停止比较 LLM Agent 而不公开其执行框架

Yunbei Zhang, Janet Wang, Yingqiang Ge, Weijie Xu, Jihun Hamm, Chandan K. Reddy

发表机构 * Tulane University(Tulane 大学) Rutgers University(Rutgers 大学) Independent Researcher(独立研究者) Virginia Tech(弗吉尼亚理工大学)

AI总结 本文论证在长周期任务中,Agent 执行框架(Harness)比底层模型更能决定性能,并提出框架感知的评估标准与方差分解协议。

详情
AI中文摘要

这篇立场论文认为,对于在具有可比前沿能力的模型上评估的长周期任务,Agent 执行框架(即围绕语言模型管理上下文构建、工具交互、编排和验证的基础设施层)通常比其包装的模型更能决定 Agent 性能。我们形式化并辩护了绑定约束论题:在此情况下,性能方差更多地由框架配置而非模型选择决定,当前评估协议因此系统性地将框架层面的提升错误归因于模型改进。我们从三个方面支持这一论点。首先,控制论形式化将框架视为闭环动态系统的控制器,LLM 为其管理的随机策略,这解释了为什么小的框架变化可以产生超过替换模型所带来的性能变化。其次,已发表的基准测试、行业部署以及受控方差分解表明,框架引起的方差可能显著超过模型引起的方差,包括模型排名反转的情况。第三,我们提出了一个框架感知的评估框架,包含披露标准和方差分解协议。在框架规范被公开之前,长周期 Agent 的排行榜比较应被视为不完整且可能具有误导性。

英文摘要

This position paper argues that, for long-horizon tasks evaluated across models with comparable frontier capability, the agent execution harness, namely the infrastructure layer that governs context construction, tool interaction, orchestration, and verification around a language model, is often a stronger determinant of agent performance than the model it wraps. We formalize and defend the Binding Constraint Thesis: in this regime, performance variance is governed more by harness configuration than by model choice, and current evaluation protocols therefore systematically misattribute harness-level gains to model improvements. We support this thesis along three lines. First, a control-theoretic formalization treats the harness as the controller of a closed-loop dynamical system and the LLM as the stochastic policy it governs, which explains why small harness changes can produce performance shifts that exceed those obtained by substituting one model for another. Second, published benchmarks, industry deployments, and a controlled variance decomposition show that harness-induced variance can substantially exceed model-induced variance, including cases of model ranking reversal. Third, we propose a harness-aware evaluation framework with a disclosure standard and a variance decomposition protocol. Until harness specifications are disclosed, leaderboard comparisons for long-horizon agents should be treated as incomplete and potentially misleading.

2605.23949 2026-05-26 cs.MA cs.AI 版本更新

SODE: Analyzing Social Dynamics in LLM Agents

SODE:分析LLM智能体中的社会动态

Inseo Jung, Yoonseok Oh, Kyungryul Back, Jinkyu Kim, Jungbeom Lee

发表机构 * Department of Computer Science, Korea University(韩国大学计算机科学系)

AI总结 提出SODE框架,通过直接互惠、间接互惠和群体动态三个进化维度评估LLM智能体的社会行为,发现指令调优模型存在被动顺从问题,推理模型则因短视优化破坏长期合作,并展示长期框架可激发推理模型的互惠能力。

详情
AI中文摘要

随着大型语言模型(LLMs)演变为交互式智能体,理解它们在人类社会动态中的行为对齐变得至关重要。虽然行为博弈论为研究这些互动提供了框架,但先前的工作主要依赖于基于结果的指标,如平均分数。这种关注忽略了促进可持续合作的机制,因为相同的分数可能源于截然不同的策略。为了弥合这一差距,我们引入了SODE(社会动态评估),这是一个框架,它从三个进化维度评估LLM智能体:用于策略适应的直接互惠、用于声誉敏感的间接互惠以及用于合作韧性的群体动态。应用SODE揭示了系统性的分歧:指令调优模型通常表现出“被动顺从”,使其容易受到利用,而推理模型则优先考虑短期优化,破坏了长期合作。值得注意的是,我们证明了“长期框架”可以激发推理模型中的互惠能力。因此,SODE为将AI智能体与复杂的人类社会动态对齐提供了一个系统的、基于机制的基准。

英文摘要

As Large Language Models (LLMs) evolve into interactive agents, understanding their behavioral alignment within human social dynamics becomes essential. While behavioral game theory offers a framework to study these interactions, previous work has predominantly relied on outcome-based metrics such as average scores. This focus overlooks the mechanisms that facilitate sustainable cooperation, as identical scores can be derived from vastly different strategies. To bridge this gap, we introduce SODE (Social Dynamics Evaluation), a framework that evaluates LLM agents across three evolutionary dimensions: Direct Reciprocity for strategy adaptation, Indirect Reciprocity for reputation sensitivity, and Group Dynamics for cooperative resilience. Applying SODE reveals systematic divergences: instruction-tuned models often exhibit "passive compliance" that renders them vulnerable to exploitation, while reasoning models prioritize short-horizon optimization, destabilizing long-term cooperation. Notably, we demonstrate that a "long-horizon framing" can unlock reciprocal capabilities in reasoning models. Thus, SODE offers a systematic, mechanism-grounded benchmark for aligning AI agents with complex human social dynamics.

2605.23946 2026-05-26 cs.CY cs.AI 版本更新

AI-Driven Controlled Environment Agriculture as Resilient Infrastructure for U.S. Fresh-Produce Supply Chains

AI驱动的可控环境农业作为美国新鲜农产品供应链的韧性基础设施

Andrii Vakhnovskyi

发表机构 * IOGRU LLC

AI总结 本文提出CEA-RIF 2.0框架,评估AI驱动的可控环境农业作为区域新鲜农产品连续性基础设施的七个维度,并论证AI只有在改善运营指标时才创造韧性价值。

Comments 12 pages, 5 figures, 7 tables. Includes open-data greenhouse control metrics demonstration

详情
AI中文摘要

气候波动、区域生产集中、劳动力约束、网络风险以及对长途新鲜农产品供应链的依赖暴露了美国新鲜农产品和特种作物系统的脆弱性。可控环境农业(CEA)可以通过将选定的生产转移到受保护的、传感器丰富的环境中来减少部分暴露,但近期风险投资的垂直农场失败表明,CEA不能被视为普遍的粮食安全解决方案。本文提出了可控环境农业韧性基础设施框架2.0版(CEA-RIF 2.0),用于评估AI驱动的CEA作为针对性的区域新鲜农产品连续性基础设施。该框架评估七个维度:供应连续性、气候隔离、能源与电网集成、水与养分循环、信息物理可靠性、经济可行性以及治理与部署。借鉴美国政府报告、同行评审的CEA和能源文献、需求响应研究、网络安全标准、国际智能农业项目、2025-2026年融资和政策信号以及公共自动温室数据集,本文论证AI只有在改善可测量的运营结果(如气候稳定性、能源灵活性、产量一致性、异常检测、劳动生产率和故障安全恢复)时才创造韧性价值。分析将AI驱动的CEA重新定义为信息物理基础设施问题:能源感知、电网交互、安全、可互操作、区域分布、财务纪律严格并与公共韧性目标相连。本文最后提出了一个研究议程,包括跨机构试验台、开放数据集、标准化指标、需求响应试点和信息物理参考架构。

英文摘要

Climate volatility, regional production concentration, labor constraints, cyber risk, and dependence on long-distance fresh-produce supply chains expose vulnerabilities in U.S. fresh-produce and specialty-crop systems. Controlled environment agriculture (CEA) can reduce some exposure by moving selected production into protected, sensor-rich environments, but recent failures in venture-backed vertical farming show that CEA cannot be treated as a universal food-security solution. This paper proposes the Controlled Environment Agriculture Resilience Infrastructure Framework, Version 2.0 (CEA-RIF 2.0), for evaluating AI-driven CEA as targeted regional fresh-produce continuity infrastructure. The framework assesses seven dimensions: supply continuity, climate isolation, energy and grid integration, water and nutrient circularity, cyber-physical reliability, economic viability, and governance and deployment. Drawing on U.S. government reports, peer-reviewed CEA and energy literature, demand-response research, cybersecurity standards, international smart-agriculture programs, 2025-2026 financing and policy signals, and public autonomous-greenhouse datasets, the paper argues that AI creates resilience value only when it improves measured operational outcomes such as climate stability, energy flexibility, yield consistency, anomaly detection, labor productivity, and safe recovery from faults. The analysis reframes AI-driven CEA as a cyber-physical infrastructure problem: energy-aware, grid-interactive, secure, interoperable, regionally distributed, financially disciplined, and connected to public resilience goals. The paper concludes with a research agenda for interagency testbeds, open datasets, standardized metrics, demand-response pilots, and cyber-physical reference architectures.

2605.23945 2026-05-26 cs.AI cs.DC 版本更新

Accelerating Long-Tail Generation in Synchronous RLHF Training via Adaptive Tensor Parallelism

通过自适应张量并行加速同步RLHF训练中的长尾生成

Long Zhao, Qinghe Wang, Jiaan Zhu, Youhui Bai, Zewen Jin, Chaoyi Ruan, Shengnan Wang, Cheng Li

发表机构 * Anhui University(安徽大学) University of Science and Technology of China(中国科学技术大学) National University of Singapore(新加坡国立大学) Institute of Artificial Intelligence, Hefei Comprehensive National Science Center(合肥综合国家科学中心人工智能研究院)

AI总结 针对同步RLHF训练中长尾生成导致的GPU利用率低问题,提出自适应张量并行方法PAT,通过预测引导的在线重配置和轻量级状态迁移机制,显著降低生成延迟和端到端训练迭代延迟。

Comments 11page, 14 figures

详情
AI中文摘要

基于人类反馈的强化学习(RLHF)已成为提升模型质量的关键后训练范式。然而,同步三阶段RLHF流水线常受限于生成阶段,其中响应长度偏斜导致解码过程中有效批量大小迅速缩小,使得GPU在少数长响应未完成时处于低利用率状态。主流框架采用静态张量并行(TP)配置,无法适应变化的批量特征,留下了大量性能提升空间。我们提出PAT,一种自适应TP方法,在每次RLHF迭代的生成阶段动态重配置TP。PAT引入两项关键技术。首先,一种预测引导的在线重配置方法基于离线性能分析决定重配置时机和目标TP配置,仅在预测延迟收益超过重配置开销时触发重配置。其次,一种轻量级在线重配置机制仅更新受TP变化影响的状态和布局:通过基于成本模型在KV缓存迁移和重计算之间选择,适配未完成的解码状态;执行原地权重重分片;并重用缓存的通信组。我们在SGLang之上实现PAT,并将其集成到VeRL框架中。使用DeepScaleR对LLaMA3.1-8B和Qwen3-14B的评估表明,与原始VeRL设置相比,PAT将生成延迟降低高达34.6%,端到端RLHF训练迭代延迟降低高达27.2%。

英文摘要

Reinforcement Learning from Human Feedback (RLHF) has become a key post-training paradigm for improving model quality. However, the synchronous three-stage RLHF pipeline is often bottlenecked by the generation stage, where response-length skew causes the effective batch size to shrink rapidly during decoding, leaving GPUs underutilized while a few long responses remain unfinished. Mainstream frameworks employ a static tensor parallelism (TP) configuration that cannot adapt to changing batch characteristics, leaving substantial performance headroom unexplored. We propose PAT, an adaptive TP method that dynamically reconfigures TP during the generation stage of each RLHF iteration. PAT introduces two key techniques. First, a predictor-guided online reconfiguration method decides both the reconfiguration point and the target TP configuration based on offline profiling, triggering reconfiguration only when the predicted latency benefit outweighs the reconfiguration overhead. Second, a lightweight online reconfiguration mechanism updates only the states and layouts affected by TP changes: it adapts unfinished decoding states through a cost-model-based choice between KV-cache migration and recomputation, performs in-place weight resharding, and reuses cached communication groups. We implement PAT on top of SGLang and integrate it with the VeRL framework. Evaluations on LLaMA3.1-8B and Qwen3-14B using DeepScaleR show that PAT reduces generation latency by up to 34.6% and end-to-end RLHF training iteration latency by up to 27.2% compared to the original VeRL setup.

2605.23944 2026-05-26 cs.AI math.PR 版本更新

Right-Sizing Communication and Recommendation Set Size in AI-Assisted Search

AI辅助搜索中的通信与推荐集规模合理设定

Jing Dong, Prakirt Raj Jhunjhunwala, Yash Kanoria

发表机构 * Columbia Business School, Columbia University(哥伦比亚大学商学院) Amazon.com Inc.(亚马逊公司)

AI总结 通过建模用户与AI推荐系统的交互,研究在考虑通信成本和搜索成本时,如何优化消息精度和推荐集大小以最大化用户期望收益。

详情
AI中文摘要

我们建模了用户与AI驱动的推荐系统之间的交互。用户通过代价高昂且带有噪声的消息传递偏好信息来启动过程。AI助手作为贝叶斯代理,解释用户消息以形成关于其真实偏好的后验信念,并做出产品推荐。具体来说,它决定呈现多少推荐,以最大化用户最终选择的期望效用,同时考虑推荐集大小带来的搜索成本。我们使用基于互信息的成本函数来建模用户在交互过程中产生的两种不同成本:(i) 通信成本,随偏好消息的精度增加而增加;(ii) 搜索成本,随AI助手提供的推荐集大小增加而增加。我们研究位于d维空间中的产品和偏好,并询问如何最大化用户的期望收益。对于大d,我们描述了在两种不同的推荐采样分布下(即从产品宇宙中采样推荐),最优消息精度和推荐集大小如何依赖于成本参数:(i) 贝叶斯后验信念,和(ii) 优化的倾斜分布。在后验采样方案(i)下,我们识别出一种混合机制,其中高效的交互策略需要联合优化用户传达的信息量(以比特计)和AI助手提供的推荐数量。在倾斜采样方案(ii)下,我们的结果表明,最优交互策略仅使用通信和搜索中的一种,倾向于选择成本较低的那一种。

英文摘要

We model the interaction between a user and an AI driven recommendation system. The user initiates the process by conveying preference information through a costly and noisy message. The AI assistant, acting as a Bayesian agent, interprets the user's message to form a posterior belief about their true preferences and make product recommendations. In particular, it determines how many recommendations to present so as to maximize the user's expected utility from their final choice, while accounting for the search cost induced by the size of the recommendation set. We use mutual information based cost functions to model the two distinct costs incurred by the user during the interaction: (i) a communication cost, which increases with the precision of their preference message, and (ii) a search cost, which increases with the size of the recommendation set provided by the AI assistant. We study products and preferences which live in d dimensional space, and ask how the user's expected payoff can be maximized. For large d, we characterize how optimal message precision and recommendation set size depend on the cost parameters, under two distinct distributions from which recommendations can be sampled from the product universe: (i) Bayes' posterior belief, and (ii) an optimized tilted distribution. Under the posterior sampling scheme (i), we identify a hybrid regime, in which an efficient interaction policy requires jointly optimizing the amount of information (in bits) conveyed by the user and the number of recommendations provided by the AI assistant. In the tilted sampling scheme (ii), our results show that the optimal interaction policy uses only one of communication and search, favoring whichever of them is less costly.

2605.23943 2026-05-26 cs.AI physics.hist-ph quant-ph 版本更新

Spacetime Formation under Requirements: Contextual Realization and Form-Dependent Probability

需求下的时空形成:语境实现与形式依赖概率

Song-Ju Kim

发表机构 * Sobin Institute LLC(索宾研究所有限公司)

AI总结 本文提出一种新解释:量子概率是在有限状态需求下语境时空形成的固定时空投影,通过需求驱动的非布尔实现机制解释非交换性、干涉和类量子概率。

Comments 19 pages, 1 figure

详情
AI中文摘要

量子认知学通常通过在固定事件结构上用量子概率替代经典概率来解释顺序效应、语境性和全概率律违反。本文提出一种不同的解释:量子概率是在有限状态需求下语境时空形成的固定时空投影。该框架并非从时间、空间、对象或概率出发,而是从需求出发,例如有限表征能力、单态语义稳定性、语境敏感干预、避免显式语境标签、连贯世界形成和主体间可变换性。当这些需求无法在单一全局布尔事件结构中实现时,在固定时空投影下,这种不匹配表现为非交换性、干涉和类量子概率。基于先前的单态语境性方法,我们将经典语境簿记成本重新解释为语境时空形成的固定时空阴影。经典表征中的辅助记忆或语境标签,在此解释中对应于局部布尔逻辑世界之间的类似和乐的不匹配。干涉项是当局部经典实现贡献被非平凡粘合并投影回固定经典时空形式时产生的交叉项。结果是一种先验-操作实在论解释:对象性、事件性、概率和时空被视为需求下的实现形式,而客观性由跨观察者和历史依赖的时空形成所保持的不变量定义。

英文摘要

Quantum cognition often explains order effects, contextuality, and violations of the law of total probability by replacing classical probability with quantum probability on a fixed event structure. This paper proposes a different interpretation: quantum probability is the fixed-spacetime projection of contextual spacetime formation under finite-state requirements. The framework begins not with time, space, objects, or probabilities, but with requirements such as finite representational capacity, single-state semantic stability, context-sensitive intervention, avoidance of explicit context labels, coherent world-formation, and intersubjective transformability. When these requirements cannot be realized within a single global Boolean event structure, the mismatch appears, under fixed-spacetime projection, as noncommutativity, interference, and quantum-like probability. Building on prior single-state approaches to contextuality, we reinterpret classical contextual bookkeeping cost as the fixed-spacetime shadow of contextual spacetime formation. Auxiliary memory or context labels in a classical representation correspond, in this account, to holonomy-like mismatch among locally Boolean logic-worlds. The interference term is the cross term generated when locally classical realization contributions are nontrivially glued and projected back into a fixed classical spacetime form. The result is a transcendental-operational realist account: objecthood, eventhood, probability, and spacetime are treated as forms of realization under requirements, while objectivity is defined by invariants preserved across observer- and history-dependent spacetime formations.

2605.23942 2026-05-26 cs.AI 版本更新

A Dynamical Framework for Cognitive Processes Based on Transformations and Semantic Equivalence

基于变换和语义等价性的认知过程动力学框架

Carlo Cattani, Dioneia Motta Monte-Serrat

发表机构 * Engineering School, DEIM, University of Tuscia, (VT), 01100, Italy(1 工程学院,DEIM,图齐亚大学,(VT),01100,意大利) Department of Physics, University of Sao Paulo, USP(2 物理系,圣保罗大学,USP;法律系,里贝拉奥普雷托大学,Unaerp,巴西) Department of Law, University of Ribeirao Preto, Unaerp, Brazil

AI总结 提出一个基于变换和语义等价性的动力学框架,通过迭代更新规则建模认知过程,并利用不动点论证和收缩条件确保稳定性,在语言应用中展示上下文依赖解释的轨迹。

详情
AI中文摘要

本文提出一个结构性和动力学框架,从控制论视角建模认知过程。认知状态表示为状态空间中的元素,通过迭代更新规则演化: \[ X_{t+1} = \pi\big(F(f(X_t))\big), \] 其中 $f$ 描述内部变换,$F$ 表示解释映射,$\pi$ 强制语义等价。该模型被解释为整合变换、观察和稳定的反馈系统。引入范畴论表述以捕捉组合结构,并通过不动点论证和收缩条件分析相关动力学,确保稳定性。为展示该框架的操作特性,提供了计算示例和诱导动力学的定性分析。一个具体的语言应用展示了如何将上下文依赖的解释建模为朝向稳定语义类的轨迹。所提出的方法连接了动力系统、范畴论和认知建模,提供了将认知视为朝向不变解释的反馈驱动过程的统一表示。

英文摘要

This paper proposes a structural and dynamical framework for modeling cognitive processes within a cybernetic perspective. Cognitive states are represented as elements of a state space evolving through an iterative update rule of the form \[ X_{t+1} = π\big(F(f(X_t))\big), \] where $f$ describes internal transformations, $F$ represents interpretative mappings, and $π$ enforces semantic equivalence. The model is interpreted as a feedback system integrating transformation, observation, and stabilization. A categorical formulation is introduced to capture compositional structure, while the associated dynamics are analyzed through fixed-point arguments and contraction conditions ensuring stability. To demonstrate the operational character of the framework, a computational illustration is provided, together with a qualitative analysis of the induced dynamics. A concrete linguistic application shows how context-dependent interpretation can be modeled as a trajectory toward a stable semantic class. The proposed approach connects dynamical systems, category theory, and cognitive modeling, and provides a unified representation of cognition as a feedback-driven process evolving toward invariant interpretations.

2605.23941 2026-05-26 cs.AI cs.RO 版本更新

MEMOR-E: In-Context and Fine-Tuned LLM Personalization for Alzheimer's Assistive Robotics

MEMOR-E: 面向阿尔茨海默病辅助机器人的上下文与微调大语言模型个性化

Maissa Abir Smaili, Eren Sadikoglu, Ransalu Senanayake

发表机构 * Istanbul Medipol University(伊斯坦布尔梅迪波大学) Arizona State University(亚利桑那州立大学)

AI总结 提出移动四足机器人MEMOR-E,结合微调与上下文学习的大语言模型,实现阿尔茨海默病患者的个性化认知支持与可解释人机交互。

Comments 8 pages 14 figures

详情
AI中文摘要

阿尔茨海默病是一种神经退行性疾病,其特征是记忆和语言能力进行性衰退,导致日常生活独立性降低,从而激发社交辅助机器人的支持需求。本文介绍了MEMOR-E,一种配备交互式平板界面的移动四足机器人,通过药物提醒、日常指导、记忆导向互动和陪伴来协助患者和护理人员。我们评估了微调大语言模型(LLMs)以模拟阶段一致的认知行为并解释标准神经心理学语言任务中响应的可行性,使用了235名阿尔茨海默病患者的音频转录和合成生成的健康对照数据。我们还报告了在LLMs中使用上下文学习(ICL)的结果,其中第二个LLM生成了领域和严重程度级别的认知错误摘要。我们的结果表明,MEMOR-E能够生成阶段感知的非诊断性认知摘要,支持个性化辅助互动,同时可解释AI机制将模型输出转化为透明、人类可读的证据,以实现护理人员监督和可信赖的人机交互。

英文摘要

Alzheimer's disease is a neurodegenerative disorder marked by progressive declines in memory and language that reduce independence in daily life, motivating socially assistive robotic support. This paper presents MEMOR-E, a mobile quadruped robot with an interactive tablet interface that assists patients and caregivers through medication reminders, routine guidance, memory oriented interactions, and companionship. We evaluated the feasibility of fine tuning large language models (LLMs) to emulate stage consistent cognitive behavior and interpret responses across standard neuropsychological language tasks, using audio transcriptions from 235 Alzheimer's patients and synthetically generated healthy controls. We also report findings on using in context learning (ICL) in LLMs, where a second LLM produced domain and severity level cognitive error summaries. Our results show that MEMOR-E can generate stage aware, non diagnostic cognitive summaries that support personalized assistive interactions, while explainable AI mechanisms translate model outputs into transparent, human readable evidence to enable caregiver oversight and trustworthy human robot interaction.

2605.23940 2026-05-26 cs.AI cs.CL 版本更新

Residual Drift Dominates Contradiction in Multi-Turn Constraint Reasoning

残差漂移主导多轮约束推理中的矛盾

Sebastien Kawada

AI总结 通过构建DRIFT-Bench基准和MUS-Repair方法,发现多轮推理系统的主要失败模式是可满足漂移而非逻辑矛盾,残差错误中98-100%为可满足漂移。

Comments Published at ICLR 2026 Workshop on Reasoning and Planning for LLMs. 18 pages. ICLR page: https://iclr.cc/virtual/2026/10017484 Code: https://github.com/kaons-research/drift-bench

详情
AI中文摘要

多轮推理系统如何失败?预期的答案是逻辑矛盾,即系统维护的状态变得不可满足。我们表明,主导模式反而是可满足漂移,即内部状态保持一致,而返回的答案默默违反先前的承诺。我们构建了DRIFT-Bench(将推理分解为失败类型),这是一个包含三个约束领域816个测试问题的求解器辅助基准,并在四个开源模型(8B-120B参数)上评估了四种方法。MUS-Repair方法将最小不可满足子集反馈给生成器,在所有设置中表现最强(比最佳非MUS基线高+1.8到+15.0个百分点)。但核心发现是修复留下的问题。在结构化反馈后,模型很少自相矛盾。它们会遗忘。残差错误在所有设置中98-100%是可满足漂移,而矛盾降至接近零。可靠的多轮系统必须单独验证返回的答案尊重维护的状态。代码可在https://github.com/kaons-research/drift-bench获取。

英文摘要

How do multi-turn reasoning systems fail? The expected answer is logical contradiction, in which the system's maintained state becomes unsatisfiable. We show that the dominant mode is instead satisfiable drift, where the internal state stays consistent while the returned answer silently violates prior commitments. We build DRIFT-Bench (Decomposing Reasoning Into Failure Types), a solver-instrumented benchmark of 816 test problems across three constraint domains, and evaluate four methods on it across four open-weight models (8B-120B parameters). MUS-Repair, which feeds minimal unsatisfiable subsets back to the generator, is strongest in every setting (+1.8 to +15.0 pp over the best non-MUS baseline). But the central finding is what repair leaves behind. After structured feedback, models rarely contradict themselves. They forget. Residual errors are 98-100% satisfiable drift across all settings, while contradiction drops to near zero. Reliable multi-turn systems must separately validate that the returned answer respects the maintained state. Code is available at https://github.com/kaons-research/drift-bench.

2605.23939 2026-05-26 cs.AI cs.LG 版本更新

DRIVE: Modeling Skills at the Reasoning and Interaction Levels for Web Agents under Continual Learning

DRIVE:在持续学习下为Web代理建模推理与交互层面的技能

Xirui Liu, Sihang Zhou, Yanning Hou, Rong Zhou, Haoyuan Chen, Maolin He, Siwei Wang, Hao Chen, Jian Huang

发表机构 * College of Intelligence Science and Technology, National University of Defense Technology(智能科学与技术学院,国防科技大学) College of Computer Science and Technology, National University of Defense Technology(计算机科学与技术学院,国防科技大学)

AI总结 提出DRIVE框架,通过将历史经验分离为自然语言推理技能和程序化交互技能,并采用场景感知协调机制,解决Web代理在持续学习中推理与交互知识纠缠的问题,在WebArena上平均任务成功率提升7.3个百分点。

Comments 35 pages, 5 figures

详情
AI中文摘要

Web代理需要高层推理(用于任务分解)和低层交互(用于页面元素操作)来执行不同任务。然而,这些知识类型存在根本差异:推理知识(例如,预订航班需要首先搜索路线)是抽象的且可跨网站迁移,而交互知识(例如,在站点A的特定坐标点击搜索按钮)严重依赖于页面特定上下文。现有方法统一存储经验。这造成了一个困境:抽象表示在具体页面上失去可执行性,而具体表示无法跨领域泛化。这种纠缠限制了能力积累:在新网站上,代理要么因表面差异而无法识别可重用的任务逻辑,要么尝试基于过时页面结构的不可行操作。为了解耦它们,我们提出DRIVE,一个双层技能建模框架,将历史经验分离为自然语言推理技能(捕获可迁移的任务逻辑)和程序化交互技能(将抽象动作接地到可执行操作)。一种场景感知协调机制根据任务语义自适应地检索和调用这些双层技能。DRIVE还使用技能级反思来识别层次特定的失败模式,实现有针对性的技能库扩展和精炼。在五个WebArena领域上的实验表明,DRIVE达到了52.8%的平均任务成功率,比无技能基线高出7.3个百分点。进一步的消融实验显示,推理和交互技能提供了不同且互补的益处,支持将可迁移的任务逻辑与可执行的页面级操作分离。

英文摘要

Web agents require both high-level reasoning (for task decomposition) and low-level interactions (for page elements manipulation) to conduct different tasks. However, these knowledge types differ fundamentally: reasoning knowledge (e.g., booking a flight requires first searching for routes) is abstract and transferable across websites, while interaction knowledge (e.g., clicking the Search button at a specific coordinate on Site A) depends heavily on page-specific contexts. Existing methods store experiences uniformly. This creates a dilemma: abstract representations lose executability on concrete pages, while concrete representations fail to generalize across domains. This entanglement limits capability accumulation: on new websites, agents either fail to recognize reusable task logic due to surface-level differences or attempt infeasible actions from outdated page structures. To disentangle them, we propose DRIVE, a dual-level skill modeling framework separating historical experience into natural language reasoning skills, which capture transferable task logic, and programmatic interaction skills, grounding abstract actions to executable operations. A scene-aware coordination mechanism adaptively retrieves and invokes these dual-level skills based on task semantics. DRIVE also uses skill-level reflection to identify hierarchy-specific failure modes, enabling targeted skill library expansion and refinement. Experiments across five WebArena domains show DRIVE attains an average task success rate of 52.8%, exceeding the skill-free baseline by 7.3 percentage points. Further ablations show reasoning and interaction skills provide distinct, complementary benefits, supporting separation of transferable task logic from executable page-level operations.

2605.23938 2026-05-26 cs.AI cs.CY cs.LG 版本更新

Authority Inversion in LLM-Mediated Ubiquitous Systems: When Models Trust Users Over Sensors

LLM介导的普适系统中的权威倒置:当模型信任用户胜过传感器

Long Zhang, Zi-bo Qin, Wei-neng Chen

发表机构 * School of Computer Science and Engineering, South China University of Technology(华南理工大学计算机科学与工程学院) School of Computer Science(计算机科学学院) Engineering, South China University of Technology(华南理工大学工程学院)

AI总结 本研究揭示了大语言模型在融合传感器与用户冲突信息时,由于格式依赖性导致数值传感器数据被自然语言用户主张支配的权威倒置现象,并提出了几何框架、审计指标(CIR和AAI)以及推理时层干预方法(GAC)来诊断和缓解该问题。

详情
AI中文摘要

大语言模型(LLM)越来越多地融合普适系统中的异构输入。然而,当传感器测量值与用户主张冲突时,LLM如何隐式分配权威尚未被研究,这引发了在物理传感必须保持优先级的部署场景中的关键可靠性问题。与显式的传统融合不同,LLM将权威分配隐藏在学习的表示中。我们发现这种分配严重依赖于格式:数值传感器数据未能整合到与答案相关的模型方向中,使得自然语言主张主导最终决策,我们将这种现象称为 extbf{权威倒置}。为了诊断和缓解这一问题,我们开发了一个上下文整合的几何框架,引入了两个可计算的审计指标,即上下文整合比(CIR)和权威对齐指数(AAI),并提出了几何权威校准(GAC),一种推理时的层级干预方法,以抑制错位的用户权威。在四个数据集(共576个冲突实例)上评估四个模型(参数规模4B至35B,三种架构),揭示了极端的倒置:在数值任务上,模型表现出接近零的传感器信任(AAI = -0.805,Cohen's d = -2.14),且不受模型容量影响。验证我们的几何框架,理论引导的因果注入翻转了80.2%的错误决策(随机对照<0.4%)。实际应用中,GAC将HAR准确率从0–1.6%提升至21.9–27.5%,优于提示基线。最终,LLM介导系统中的权威分配必须被显式审计并根据应用特定配置,而不是保持隐式。

英文摘要

Large language models (LLMs) increasingly fuse heterogeneous inputs in ubiquitous systems. Yet, how LLMs implicitly allocate authority when sensor measurements and user claims conflict remains unexamined, raising critical reliability concerns for deployments where physical sensing must retain priority. Unlike explicit traditional fusion, LLMs bury authority allocation within learned representations. We discover this allocation is severely format-dependent: numerical sensor data fails to integrate into answer-relevant model directions, allowing natural-language claims to dominate the final decision, a phenomenon we term \textbf{Authority Inversion}.To diagnose and mitigate this, we develop a geometric framework of context integration, introduce two computable audit metrics, specifically the Context Integration Ratio (CIR) and Authority Alignment Index (AAI), and propose Geometric Authority Calibration (GAC), an inference-time layer-level intervention to suppress misplaced user authority. Evaluating four models (4B to 35B parameters, three architectures) across four datasets totaling 576 conflict instances reveals extreme inversion: on numerical tasks, models exhibit near-zero sensor trust (AAI = -0.805, Cohen's d = -2.14), unaffected by model capacity. Validating our geometric framework, theory-guided causal injection flips 80.2\% of incorrect decisions (vs. <0.4\% for random controls). Practically, GAC improves HAR accuracy from 0 -- 1.6\% to 21.9 -- 27.5\%, outperforming prompting baselines. Ultimately, authority allocation in LLM-mediated systems must be explicitly audited and application-specifically configured rather than left implicit.

2605.23936 2026-05-26 cs.AI cs.LG 版本更新

Fuzzy, Neutrosophic, and Uncertain Graph Theory: Properties and Applications

模糊、中智和不确定图论:性质与应用

Takaaki Fujita, Florentin Smarandache

AI总结 本书系统综述了不确定性下的图论,以不确定图框架为核心,统一了模糊、中智等模型,并介绍了扩展图类及其在分子图、决策系统、图神经网络等领域的应用。

Comments 326 pages. Publisher: Neutrosophic Science International Association (NSIA) Publishing House. ISBN: 978-197250204-4

详情
AI中文摘要

本书全面系统地综述了不确定性下的图论,特别强调了不确定图框架的统一作用。它回顾了模糊、中智及相关模型中的基本概念、结构性质、图类和图参数,同时介绍了广泛的扩展,如不确定有向图、超图、超超图和动态图。除了理论发展,本书还探讨了实际应用,包括不确定分子图、决策系统、图神经网络、知识图谱和认知地图。通过从共同视角组织多样化的不确定性感知图模型,本书为理解它们在复杂系统中的关系、能力和应用提供了一个连贯的框架。

英文摘要

This book presents a comprehensive and systematic survey of graph theory under uncertainty, with particular emphasis on the unifying role of the uncertain graph framework. It reviews fundamental concepts, structural properties, graph classes, and graph parameters within fuzzy, neutrosophic, and related models, while also introducing a wide range of extensions such as uncertain digraphs, hypergraphs, superhypergraphs, and dynamic graphs. In addition to theoretical developments, the book explores practical applications, including uncertain molecular graphs, decision-making systems, graph neural networks, knowledge graphs, and cognitive maps. By organizing diverse uncertainty-aware graph models within a common perspective, this work provides a coherent framework for understanding their relationships, capabilities, and applications in complex systems.

2605.23935 2026-05-26 cs.AI cs.CY cs.MA cs.SE cs.SY eess.SY 版本更新

Operationalizing Reconstructive Authority: Runtime Construction, Dependency Resolution, and Execution Gating in Autonomous Agent Systems

操作化重构权威:自主智能体系统中的运行时构建、依赖解析与执行门控

Marcelo Fernandez - TraslaIA

发表机构 * TraslaIA

AI总结 本文提出一种运行时执行模型,通过动态依赖解析和恢复循环,确保动作仅在当前状态可构建权威时执行,从而保证安全性和条件活性。

Comments Agent Governance Series, Paper P6. Companion papers on arXiv: P0 (2604.17511), P1 (2603.18829), P2 (2604.17517). P3/4 and P5 submitted concurrently (pending arXiv IDs). Zenodo: 10.5281/zenodo.19699460

详情
AI中文摘要

自主智能体系统的失败不仅源于错误决策,还源于执行那些在运行时其权威不再成立的决策。先前的工作将重构权威(RAM)定义为有效执行的条件:仅当权威能从当前状态构建时,才允许执行动作。本文关注运行时强制执行问题:如何在运行系统中强制执行该条件。我们引入一种运行时执行模型,其中权威在动作时被评估,执行取决于其可构建性。这将执行状态空间从允许/拒绝扩展到第三种状态——暂停,表示由于不完整或不确定的可观测性导致权威未定义。我们定义了一个具体的执行协议,包括动态依赖解析、权威重构和显式决策语义。我们进一步引入一个恢复循环,将漂移检测(IML)与执行控制(ACP)集成,允许系统暂停执行、获取缺失信息并重新尝试权威重构。我们证明该模型保证了安全性——没有动作会在没有可构建权威的情况下执行——以及条件活性:当定义权威的变量变得可观测时,执行恢复。这项工作将重构权威操作化为一种运行时强制机制,提供了在真实系统中应用RAM所需的执行语义。

英文摘要

Autonomous agent systems fail not only due to incorrect decisions, but due to executing decisions whose authority no longer holds at runtime. Prior work defined Reconstructive Authority (RAM) as a condition for valid execution: actions are permitted only if authority can be constructed from current state. This paper addresses enforcement at runtime: how to enforce this condition in a running system. We introduce a runtime execution model in which authority is evaluated at action time and execution is conditioned on its constructibility. This extends the execution state space beyond admit/deny with a third state, halt, representing cases where authority is undefined due to incomplete or uncertain observability. We define a concrete execution protocol including dynamic dependency resolution, authority reconstruction, and explicit decision semantics. We further introduce a Recovery Loop that integrates drift detection (IML) with execution control (ACP), allowing the system to suspend execution, acquire missing information, and re-attempt authority reconstruction. We show that this model guarantees safety -- no action is executed without constructible authority -- and conditional liveness: execution resumes when authority-defining variables become observable. This work operationalizes reconstructive authority as a runtime enforcement mechanism, providing the execution semantics required to apply RAM in real systems.

2605.23934 2026-05-26 cs.AI quant-ph 版本更新

Practical Quantum CIM Empowerment via All-Domestic-Core Agentic Large Model

实用量子CIM赋能:基于全自主核心智能体大模型

Wang Rui, Lu Diannan

发表机构 * Department of Chemical Engineering, Tsinghua University(清华大学化学工程系)

AI总结 本研究将飞秒激光泵浦的相干伊辛机与LLM驱动的智能体系统结合,实现QUBO/Ising模型校准、约束权重决策迭代和文献方案快速验证,并完全基于国产大模型和硬件完成,同时发现智能体辅助量子计算迭代可反向增强智能体问题解决能力的新范式。

Comments 21 pages 7 figures

详情
AI中文摘要

量子计算设备被认为是解决NP完全问题的强大工具。然而,其建模的复杂性给非专业人士带来了显著障碍,而约束权重和建模方法的繁琐迭代也消耗了专家的大量精力。为应对这些挑战,本研究通过利用LangGraph和LangChain框架,将飞秒激光泵浦的相干伊辛机(CIM)与LLM驱动的智能体系统集成。综合研究表明,大语言模型(LLMs)可以有效执行建模任务,如QUBO/Ising模型校准、约束权重决策迭代以及文献报道方案的快速验证。值得注意的是,所有这些任务都可以完全基于国产大模型实现,结合国内开发的CIM硬件,我们真正实现了完全依赖全自主智能体大模型和硬件的实用量子CIM赋能。这项工作成功实现了稳健的技术集成,为后续研究奠定了坚实基础。然而,它也指出了当前阶段大模型和量子计算这两个前沿领域持续存在的挑战。令人鼓舞的是,我们意外发现了一种有前景的新范式,其中智能体辅助的量子计算迭代积累的知识反向增强了智能体自身的问题解决能力,从而应对这些挑战。

英文摘要

Quantum computing devices are recognized as powerful tools for solving NP-complete problems. However, the intricacy of their modeling presents notable barriers for non-specialists, while the tedious iteration of constraint weights and modeling methodologies also consumes substantial effort on the part of experts. To address these challenges, this study integrates a femtosecond laser-pumped Coherent Ising Machine (CIM) with an LLM-driven agentic system by leveraging the LangGraph and LangChain frameworks. Comprehensive investigations demonstrate that large language models (LLMs) can effectively perform such tasks in modeling as QUBO/Ising model calibration, constraint weight decision iteration and rapid validation of literature-reported schemes. Notably, all these tasks can be fully implemented based on domestic large models, combined with domestically developed CIM hardware, we truly achieve the practical empowerment of quantum CIM that fully relies on all-domestic agentic large models and hardware. This work successfully realizes robust technological integration, laying a solid foundation for subsequent research. Nevertheless, it also identifies the persisting challenges in the two cutting-edge fields of large models and quantum computing at the current stage. Encouragingly, we unexpectedly discover a promising new paradigm where accumulated knowledge from agent-assisted quantum computing iterations reciprocally enhances the agent's own problem-solving capability, thereby addressing these challenges.

2605.23932 2026-05-26 cs.AI cs.CL cs.CY cs.LG 版本更新

When Correct Beliefs Collapse: Epistemic Resilience of LLMs under Clinical Pressure

当正确信念崩溃:LLMs在临床压力下的认知韧性

Boyu Xiao, Xiuqi Tian, Xuwen Song, Haochun Wang, Guanchun Song, Sendong Zhao, Bing Qin

发表机构 * Research Center for Social Computing and Interactive Robotics, Harbin Institute of Technology, China(社会计算与交互机器人研究院,哈尔滨工业大学,中国)

AI总结 研究LLMs在临床对话中面对逐步升级压力时信念稳定性问题,提出Med-Stress压力测试框架,发现知识-韧性差距,并设计RBED和R-FT方法提升鲁棒性。

Comments ACL 2026

详情
AI中文摘要

尽管在医学基准测试中准确率很高,但LLMs在临床对话中可能表现出严重的多轮谄媚行为,在逐步升级的压力下放弃最初正确的诊断。我们提出了\textbf{\textsc{Med-Stress}},一个针对性的压力测试框架,用于评估在逐步升级压力下的信念稳定性。在九个前沿大型语言模型(LLMs)中,我们发现医学知识与鲁棒性之间存在明显的分离:高初始诊断能力并不意味着高信念稳定性,导致多个LLMs存在较大的知识-鲁棒性差距。为了缓解这种失败模式,我们提出了一种轻量级的推理时防御方法\textbf{\texttt{RBED}}(\textbf{R}ole-\textbf{B}ased \textbf{E}pistemic \textbf{D}efense),以及一种训练时方法\textbf{\texttt{R-FT}}(\textbf{R}esilience-oriented \textbf{F}ine-\textbf{T}uning),该方法内化了基于证据的抗压能力。实验表明,\textbf{\texttt{R-FT}}几乎消除了信念变化,并显著提高了鲁棒性。

英文摘要

Despite strong medical benchmark accuracy, LLMs can exhibit severe multi-turn sycophancy in clinical dialogue, abandoning initial correct diagnosis under escalating pressure. We propose \textbf{\textsc{Med-Stress}}, a targeted stress test framework that evaluates belief stability under escalating pressure. Across nine frontier large language models (LLMs), we find a clear dissociation between medical knowledge and robustness: high initial diagnostic capability does not imply high belief stability, yielding large knowledge-robustness gaps for several LLMs. To mitigate this failure mode, we propose a lightweight inference-time defense, \textbf{\texttt{RBED}} (\textbf{R}ole-\textbf{B}ased \textbf{E}pistemic \textbf{D}efense), and \textbf{\texttt{R-FT}} (\textbf{R}esilience-oriented \textbf{F}ine-\textbf{T}uning), a training-time approach that internalizes evidence-based resistance to pressure. Experiments show that \textbf{\texttt{R-FT}} nearly eliminates belief change and substantially improves robustness.

2605.23931 2026-05-26 cs.AI cs.PL cs.SE 版本更新

BODHI: Precise OS Kernel Specification Inference

BODHI:精确的操作系统内核规范推断

Zhiming Chang, Ziyang Li

发表机构 * Department of Applied Mathematics and Statistics(应用数学与统计学系) Johns Hopkins University(约翰霍普金斯大学) Department of Computer Science(计算机科学系)

AI总结 提出一种领域知识提示方法BODHI,通过结构化C到Python翻译指南增强少样本提示,在OSV-Bench基准上将Pass@1从55.10%提升至96.73%,缩小了通用代码生成与形式规范合成之间的差距。

详情
AI中文摘要

操作系统内核的形式化验证需要精确的规范来捕获系统调用的预期行为。手动编写这些规范需要深厚的领域专业知识,这促使使用大型语言模型(LLM)来自动化该过程。然而,在OSV-Bench(一个源自Hyperkernel操作系统内核的245个规范生成任务基准)中,最佳报告的Pass@1为55.10%。我们提出了一种领域知识提示方法(BODHI),该方法通过一个涵盖15类领域特定翻译模式的结构化C到Python翻译指南来增强标准的少样本提示。受结构化思维链(SCoT)提示的启发,该指南通过关注点分离来组织翻译,将前置条件提取和后置条件生成作为不同的类别处理。在来自六个提供商(Anthropic、Mistral、Amazon、DeepSeek、Meta、Alibaba)的九个模型上进行了评估,涵盖了密集、混合专家和推理架构,BODHI改进了所有测试的模型,提升幅度从+11%到+32%。最佳配置(Claude Opus 4.6 + BODHI)达到了96.73%的Pass@1。BODHI减少了语法和语义错误,对具有足够指令跟随能力以利用结构化参考材料的模型效果最强。这些结果表明,领域知识注入是一种与模型无关的技术,显著缩小了通用代码生成与形式规范合成之间的差距。

英文摘要

The formal verification of operating system kernels requires precise specifications that capture the intended behavior of system calls. Writing these specifications manually demands deep domain expertise, motivating the use of large language models (LLMs) to automate the process. However, in OSV-Bench, a benchmark of 245 specification generation tasks derived from the Hyperkernel OS kernel, the best reported Pass@1 is 55.10%. We propose a domain knowledge prompting method (BODHI), which augments the standard few-shot prompt with a structured C-to-Python translation guide covering 15 categories of domain-specific translation patterns. Inspired by Structured Chain-of-Thought (SCoT) prompting, the guide organizes translation by separation of concerns, addressing pre-condition extraction and post-condition generation as distinct categories. Evaluated on nine models from six providers (Anthropic, Mistral, Amazon, DeepSeek, Meta, Alibaba), covering dense, mixture-of-experts and reasoning architectures, BODHI improves every model tested, with gains ranging from +11% to +32%. The best configuration (Claude Opus 4.6 + BODHI) reaches 96.73% Pass@1. BODHI reduces both syntax and semantic errors, with the strongest effect on models that have sufficient instruction-following capability to utilize structured reference material. These results demonstrate that domain knowledge injection is a model-agnostic technique that substantially bridges the gap between general-purpose code generation and formal specification synthesis.

2605.23930 2026-05-26 cs.AI cs.LG cs.MA 版本更新

Quantum Frog: Emergent Cooperation and Difficulty Scaling in a Quantized-Time Cooperative Game

量子青蛙:量化时间合作博弈中的涌现合作与难度缩放

Saad Mankarious

发表机构 * Gymnasium API

AI总结 通过强化学习分析量化时间合作博弈Quantum Frog,发现同步冲刺策略最优,合作训练可大幅提升成功率并缩短回合步数。

详情
AI中文摘要

我们引入了\emph{Quantum Frog},这是一个双人合作游戏,基于一种新颖的\emph{量化时间}机制,其中环境仅在玩家行动时推进。受经典街机游戏Frogger启发,Quantum Frog要求两只青蛙穿越一个8×8的交通网格并一起到达远端。我们使用强化学习(RL)作为分析镜头来回答四个设计问题:(1)游戏难度如何随交通密度缩放,(2)最优单智能体策略是什么以及为什么,(3)独立和合作双智能体游戏之间的合作差距有多大,以及(4)当智能体被激励合作时会出现什么联合策略?我们通过五个升级阶段训练智能体:表格Q学习、深度Q网络(\DQN)、独立\DQN~(\IDQN)和多智能体近端策略优化(\MAPPO\ 带有集中式评论家),针对一到六辆车的交通密度进行评估。我们的主要发现是:(i)量化时间机制使得\emph{冲刺策略}(每一步直接向上移动)普遍最优,因为暴露于交通的时间被最小化;(ii)添加一个不协调的第二玩家比将单个专家玩家的交通量增加六倍更难;(iii)合作训练相对于独立智能体将联合成功率提高了+32–34个百分点,并将回合长度从约90步减少到约6步;(iv)涌现的合作策略是同步冲刺,而不是复杂的位置协调,这表明在时间关键的合作任务中,仅共享激励就足以使智能体对齐。这些发现为Quantum Frog的商业设计提供了具体、经验基础的指导,并为环境机制在塑造多智能体学习动态中的作用提供了更广泛的见解。

英文摘要

We introduce \emph{Quantum Frog}, a two-player cooperative game built on a novel \emph{quantized-time} mechanic in which the environment advances only when a player acts. Inspired by the classic arcade game Frogger, Quantum Frog requires two frogs to cross an 8$\times$8 grid of traffic and reach the far side together. We use reinforcement learning (RL) as an analytical lens to answer four design questions: (1) how does game difficulty scale with traffic density, (2) what is the optimal single-agent policy and why, (3) how large is the cooperation gap between independent and cooperative two-agent play, and (4) what joint strategy emerges when agents are incentivised to cooperate? We train agents through five escalating stages, Tabular Q-Learning, Deep Q-Network (\DQN), Independent \DQN~(\IDQN), and Multi-Agent Proximal Policy Optimisation (\MAPPO\ with a centralised critic), evaluating each against traffic densities of one to six cars. Our key findings are: (i) the quantized-time mechanic makes a \emph{rush strategy} (moving directly upward at every step) universally optimal, as time exposure to traffic is minimised; (ii) adding an uncoordinated second player is harder than sextupling the traffic for a single expert player; (iii) cooperative training recovers +32--34 percentage points of joint success rate relative to independent agents and reduces episode length from $\sim$90 to $\sim$6 steps; and (iv) the emergent cooperative strategy is synchronised rushing, not complex positional coordination, illustrating that shared incentives alone suffice to align agents in time-critical cooperative tasks. These findings provide concrete, empirically grounded guidance for the commercial design of Quantum Frog and offer broader insights into the role of environment mechanics in shaping multi-agent learning dynamics.

2605.23929 2026-05-26 cs.AI cs.SE 版本更新

Toward Reliable Design of LLM-Enabled Agentic Workflows: Optimizing Latency-Reliability-Cost Tradeoffs

面向LLM驱动的智能体工作流的可靠设计:优化延迟-可靠性-成本权衡

Ya-Ting Yang, Quanyan Zhu

发表机构 * New York University(纽约大学)

AI总结 本文通过引入参数化指数可靠性函数建模LLM与非LLM智能体的性能,提出水填充令牌分配策略,并刻画最优工作流可靠性的影子价格,以解决延迟、可靠性和成本之间的权衡问题。

详情
AI中文摘要

现代AI系统日益依赖由多个交互智能体组成的工作流,其中一些由大语言模型(LLM)驱动,另一些由传统计算模块驱动。本文分析了LLM驱动的智能体工作流中延迟、可靠性和成本之间的基本权衡。我们为LLM和非LLM智能体引入了性能模型,这些模型捕捉了计算努力与输出质量之间的关系,并利用参数化指数可靠性函数纳入了LLM智能体的推理和输出令牌的影响。然后,我们研究了在延迟和成本约束下顺序工作流的设计。主要结果包括一种水填充令牌分配策略,以及以影子价格形式刻画的最优工作流可靠性特征。

英文摘要

Modern AI systems increasingly rely on workflows composed of multiple interacting agents, some powered by large language models (LLMs) and others by conventional computational modules. This paper analyzes the fundamental tradeoffs between latency, reliability, and cost in LLM-enabled agentic workflows. We introduce performance models for both LLM and non-LLM agents that capture the relationship between computational effort and output quality, incorporating the impact of reasoning and output tokens for LLM agents using a parametric exponential reliability function. Then, we study the design of sequential workflows under latency and cost constraints. Main results include a water-filling token allocation policy and characterizations of optimal workflow reliability in terms of shadow prices.

2605.23928 2026-05-26 cs.AI cs.CL cs.DC cs.MA cs.PL cs.SE 版本更新

Context: Proactive Goal-Directed Intelligence via Composable Sandboxed Programs, Declarative Wiring, and Structured Interaction

Context: 通过可组合沙盒程序、声明式连接和结构化交互实现主动目标导向智能

Gregory Magarshak

发表机构 * Qbix, Inc.\ \& Intercoin, Inc. New York USA Qbix, Inc.\ \& Intercoin, Inc. IE University NYC

AI总结 提出Context架构,通过可组合沙盒程序、声明式连接和结构化交互实现主动目标导向智能,并证明其在成本、正确性和效率上的优势。

Comments 7 pages; third in a series with arXiv:2501.XXXXX (Magarshak Machine / SPACER) and arXiv:2502.XXXXX (Grokers)

详情
AI中文摘要

我们提出Context,Magarshak架构的智能层,用主动目标导向智能体取代被动查询-响应聊天机器人,无需等待用户提示即可推进共享任务。该架构基于三个相互增强的机制。编写时上下文组装通过Groker智能体预计算丰富的类型化属性,将交互上下文组装为图状态的确定性纯函数;上下文块在语义变化之间的轮次中字节相同,实现近100%的KV缓存重用。可组合沙盒智慧程序形成一个受管理的库,包含LM生成的命令式程序,通过类型化流关系声明式连接到目标类型,通过阶段排序组合,并在交互时执行而无需进一步调用LM。主动目标流状态机通过检查图状态并发出结构化交互内容(选项数组、治理功能、澄清提示)驱动对话走向终止状态,无需等待用户输入。我们证明了六个形式化结果:上下文稳定性定理,将每轮LM成本限制为语义变化率的函数;程序组合正确性定理;声明式连接正确性定理;主动主导定理,证明主动智能体在期望轮数到终止状态上弱主导被动智能体;协调开销消除与质量保持,建立多方目标聊天中的帕累托改进;以及跨平台投票一致性定理。已在开源Qbix/Safebox/Safebots栈中实现。

英文摘要

We present Context, the intelligence layer of the Magarshak Architecture, which replaces reactive query-response chatbots with proactive goal-directed agents that advance shared tasks without waiting for user prompts. The architecture rests on three mutually reinforcing mechanisms. Write-time context assembly precomputes enriched typed attributes via Groker agents, assembling interaction context as a deterministic pure function of graph state; context blocks are byte-identical across turns between semantic changes, enabling near-100% KV-cache reuse. Composable sandboxed wisdom programs form a governed library of LM-generated imperative programs declaratively wired to goal types via typed stream relations, composed via phase ordering, and executed at interaction time without further LM calls. Proactive goal stream state machines drive conversations toward terminal states by inspecting graph state and emitting structured interaction content (option arrays, governance affordances, clarification prompts) without awaiting user input. We prove six formal results: the Context Stability Theorem, bounding per-turn LM cost as a function of semantic change rate; a Program Composition Correctness Theorem; a Declarative Wiring Soundness Theorem; the Proactive Dominance Theorem, proving proactive agents weakly dominate reactive agents on expected turns-to-terminal-state; Coordination Overhead Elimination and Quality Preservation, establishing Pareto improvements in multi-participant goal chats; and a Cross-Platform Vote Consistency Theorem. Implemented in the open-source Qbix / Safebox / Safebots stack.

2605.23926 2026-05-26 cs.AI cs.LG 版本更新

How Much Thinking is Enough? Quantifying and Understanding Redundancy in LLM Reasoning

多少思考才足够?量化和理解LLM推理中的冗余

Zhiyuan Zhai, Xinkai You, Wenjing Yan, Xin Wang

发表机构 * Fudan University(复旦大学) The Chinese University of Hong Kong(香港中文大学)

AI总结 本文通过形式化推理冗余度量,量化了前沿推理模型在数学基准上高达61%-93%的步骤级冗余,并证明这种冗余是长度无关结果奖励的结构性后果,而非模型特定伪影。

详情
AI中文摘要

具备推理能力的大语言模型通过生成长思维链来解决难题,这严重增加了延迟、GPU时间和能耗。粗略检查其轨迹发现大量重构、验证和循环自省,然而这种深思熟虑中有多少实际上是必要的,从未在大规模上被度量或从第一性原理解释。本文填补了这两个空白。 我们直接以推理模型本身的形式化推理冗余:一个正确轨迹的冗余度是其尾部可被截断的最大分段步骤比例,同时迫使模型终止思考并输出最终答案,仍能产生正确答案。对四个前沿推理模型和两个数学基准的大规模量化表明,步骤级冗余一致地高——在我们研究的8个(模型,基准)条件下介于61%和93%之间,其中六个条件下中位关键前缀等于单个分段步骤——该发现对评判模型族的选择是稳健的,并且尽管在MATH-500上随问题难度增加而降低,所有四个模型即使在最难的Level-5问题上仍然显著冗余(ρ∈[46%,85%])。 然后我们证明这种冗余是长度无关结果奖励的结构性后果,而非模型特定伪影:在任何此类奖励下,没有有限期望停止时间是最优的。该结果无论RL算法、基础模型、数据分布或策略是通过RL还是蒸馏获得均成立;因此过度思考不是需要在单个模型中修补的缺陷,而是当前推理模型训练方式的结构性属性。代码:https://github.com/zhiyuanZhai20/how-much-thinking-is-enough

英文摘要

Reasoning-capable large language models solve hard problems by emitting long chains of thought, paying heavily in latency, GPU time, and energy. Casual inspection of their traces reveals extensive reformulation, verification, and circular self-reflection, yet how much of this deliberation is actually necessary has never been measured at scale or explained from first principles. This paper closes both gaps. We formalise reasoning redundancy directly in terms of the reasoning model itself: the redundancy of a correct trace is the largest fraction of its trailing segmented steps that can be truncated while $π$, forced to terminate thinking and emit a final answer, still produces the correct answer. A large-scale quantification across four frontier reasoning models and two mathematical benchmarks shows that step-level redundancy is consistently high -- between 61% and 93% across the 8 (model, benchmark) conditions we study, with the median critical prefix equal to a single segmented step in six of the eight conditions -- that the finding is robust to the choice of judge family, and that although $ρ$ decreases with problem difficulty on MATH-500, all four models remain substantially redundant ($ρ\in [46\%, 85\%]$) even on the hardest Level-5 problems. We then prove that this redundancy is a structural consequence of length-agnostic outcome rewards, not a model-specific artefact: under any such reward, no finite expected stopping time is optimal. The result holds regardless of RL algorithm, base model, data distribution, or whether the policy is obtained via RL or distillation; over-thinking is therefore not a bug to be patched in individual models but a structural property of how current reasoning models are trained. Code: https://github.com/zhiyuanZhai20/how-much-thinking-is-enough

2605.23925 2026-05-26 cs.CY cs.AI cs.CL 版本更新

Catching The Correct Answer Trap: Characterising AI Tutor Blind Spots When Analysing Student Reasoning

捕捉正确答案陷阱:分析学生推理时AI导师盲点的特征化

Moiz Imran, Sahan Bulathwela

发表机构 * Department of Computer Science, University College London, UK(英国伦敦大学学院计算机科学系) Centre for Artificial Intelligence, University College London, UK(英国伦敦大学学院人工智能中心)

AI总结 本研究通过分析Eedi数学平台的学生回答,发现智能辅导系统在评估学生推理时存在“正确答案陷阱”,即当学生通过错误推理得出正确答案时,系统难以检测其误解,并比较了微调T5与大型语言模型的检测性能。

Comments To be published at the International Conference on Artificial Intelligence in Education (AIED'26)

详情
AI中文摘要

智能辅导系统越来越多地提供对学生作业的自动反馈,但稳健的反馈需要评估推理过程,而不仅仅是最终答案。我们研究了一种称为“正确答案陷阱”(CAT)的失败模式:当学生通过错误推理得出正确答案时,模型会低估误解。通过分析来自Eedi数学平台的真实学生回答,我们展示了这些失败中有71%集中在仅两种问题类型上,这两种类型共享一个共同结构,即错误推理恰好产生了正确的数值答案。比较微调后的T5与前沿大型语言模型,我们发现改进的能力减少了但并未消除问题(检测准确率分别为84%和57%)。即使性能最好的模型,每检测到一个真正的误解就会产生大约四个误报,使得在现实班级规模下独立筛选不切实际。我们的发现表明,高总体准确率可能掩盖推理评估中的关键失败,并且对学生推理的仔细分析仍然需要人工判断。

英文摘要

Intelligent tutoring systems increasingly provide automated feedback on student work, but robust feedback requires assessing reasoning, not only final answers. We study a failure mode we call the correct answer trap (CAT): models under-detect misconceptions when students reach a correct answer via flawed reasoning. Analysing real student responses from the Eedi mathematics platform, we show that 71% of these failures concentrate in just two question types, both sharing a common structure where flawed reasoning happens to produce the correct numerical answer. Comparing a fine-tuned T5 with a frontier large language model, we find that improved capabilities reduce but do not eliminate the problem (84% vs 57% detection accuracy). Even the best-performing model generates roughly four false alarms for every genuine detection, making stand-alone screening impractical at realistic class sizes. Our findings demonstrate that high overall accuracy can mask critical failures in reasoning assessment, and that careful analysis of student reasoning still benefits from human judgment.

2605.23922 2026-05-26 cs.CY cs.AI cs.LG 版本更新

High-Risk AI Systems and the Problem of Identity in the European AI Act

高风险人工智能系统与欧洲人工智能法案中的身份问题

Andrea Ferrario

发表机构 * Institute of Biomedical Ethics and History of Medicine, University of Zürich(伦理与医学史研究所,苏黎世大学) SUPSI, Dalle Molle Institute for Artificial Intelligence (IDSIA)(SUPSI,达勒莫利人工智能研究所) ETH Zürich(苏黎世联邦理工学院)

AI总结 本文通过功能+框架分析欧盟AI法案中高风险AI系统的身份认定问题,提出同步身份测试方法以支持监管审计。

Comments Accepted as a non-archival paper at The 2026 ACM Conference on Fairness, Accountability, and Transparency (FAccT '26), June 25-28, 2026, Montreal, QC, Canada

详情
AI中文摘要

欧盟人工智能法案(AIA)为高风险AI系统建立了一个生命周期治理制度,围绕事前合规评估、上市后监测以及在“重大修改”时重新评估。这些义务预设了AI身份判断:监管机构和提供者必须决定更新后的系统是否随时间保持同一系统。在这项工作中,我们展示了如何通过人工制品身份的功能+框架来澄清这一逻辑,该框架通过预期功能以及适当功能的上下文相关标准(即“AI可信度”)来个体化AI系统。我们进一步论证,AIA没有为同步身份提供内部、可审计的标准——即在给定时间两个AI系统在监管目的上是否应被视为相同——而是基本上将这种相同性判断委托给部门或协调工具。功能+提供了一个以预期功能和可信度概况及水平为基础的同步身份测试,使得同步身份决策在采购、责任和市场监督等治理环境中可检查。我们的贡献是一个概念性和审计视角:我们提供了AIA生命周期义务与功能+身份组件之间的对应图,并通过一个用于审计和争议情境的最小决策流程使同步案例在操作上清晰可读。最后,我们提出两个面向实施的建议:(1)更精确、可测试的预期用途报告,以及(2)标准化、可审计的可信度报告,支持跨时间和跨部署的可比性。

英文摘要

The EU Artificial Intelligence Act (AIA) establishes a lifecycle governance regime for high-risk AI systems built around ex-ante conformity assessment, post-market monitoring, and re-assessment upon "substantial modification." These obligations presuppose AI identity judgments: regulators and providers must decide when an updated system remains the same system over time. In this work, we show how this logic is clarified by the function+ framework of artifact identity, which individuates AI systems by their intended function together with context-sensitive criteria of appropriate functioning, captured as "AI trustworthiness." We further argue that the AIA does not provide an internal, auditable criterion for synchronic identity--when two AI systems at a given time should count as the same for regulatory purposes--and instead largely defers such sameness determinations to sectoral or harmonization instruments. function+ supplies a synchronic identity test anchored in intended function and trustworthiness profiles and levels, making synchronic identity decisions inspectable in governance settings such as procurement, liability, and market surveillance. Our contribution is a conceptual and auditing lens: we provide a correspondence map between AIA lifecycle obligations and function+ identity components, and we make the synchronic case operationally legible via a minimal decision flow for audit and dispute contexts. We conclude with two implementation-facing recommendations: (1) more precise, testable reporting of intended purpose, and (2) standardized, auditable trustworthiness reporting that supports comparability over time and across deployments.

2605.23921 2026-05-26 cs.CY cs.AI 版本更新

Authority Signals in Claude AI Health Citations: A Descriptive Analysis Using the Authority Signals Framework

Claude AI 健康引用中的权威信号:基于权威信号框架的描述性分析

Erin T. Jacques, Erela Datuowei, Elizabeth Quaye, Corey H. Basch, Arijit Chatterjee, Juanita Davis

发表机构 * Department of Health and Human Performance, York College, The City University of New York(健康与人类绩效系,约克学院,纽约市立大学) Department of Health Studies & Applied Educational Psychology, Teachers College, Columbia University(健康研究与应用教育心理学系,哥伦比亚大学教师学院) Department of Accounting and Finance, York College, The City University of New York(会计与金融系,约克学院,纽约市立大学) Department of Public Health, William Paterson University(公共卫生系,威廉·帕特森大学)

AI总结 本研究使用权威信号框架,分析 Anthropic 的 Claude AI 在回答消费者健康问题时引用来源的权威信号,发现机构来源占主导地位(97.8%),并建立了 Claude 引用行为的基线。

Comments 10 pages, 2 figures, 2 tables

详情
AI中文摘要

本研究旨在确定 Anthropic 的 Claude AI 在回答消费者健康问题时呈现来源所使用的权威信号。尽管围绕 LLM 生成健康引用的质量存在大量讨论,但关于引用来源的完整性以及这些来源在多大程度上被健康专业人士视为可信来源的信息有限。这项描述性横断面研究使用了 HealthSearchQA 数据集,该数据集包含由 Google Research 整理的 3,172 个消费者健康问题。排除后,最终分析了 3,075 个问题产生的 10,038 条引用。应用权威信号框架(Jacques 等人,2026)检查了 542 个来源的分层比例样本中四个领域的 10 个权威信号。已建立的机构来源占所有引用的 97.8%(n = 9,818)。医疗机构是最常被引用的组织类型(36.5%),其次是政府资源(31.6%)和专业协会(28.4%)。商业健康信息占 2.2%(n = 220)。前 10 个组织占所有引用的 57.8%,仅梅奥诊所就占 24.7%。在重点样本的商业来源中,86.4% 显示医疗审查声明,82.5% 使用模式标记,71.8% 具有全面内容,而传统机构来源在 Claude 的引用中出现时可能带有或不带有这些相同标记。随着 Anthropic 将 Claude 定位为符合 HIPAA 的医疗保健应用,这些发现为 Claude 的引用行为建立了基线,并展示了权威信号框架作为持续跨平台评估 AI 介导健康信息的工具的实用性。

英文摘要

This study seeks to determine the authority signals used by Anthropic's Claude AI in its presentation of sources when answering consumer health questions. While there exists a great deal of discourse around the quality of health citations that LLMs produce, there is limited information on the integrity of the sources the citations originate from, and to what extent the sources are, from what health professionals would consider, credible sources. This descriptive cross-sectional study used data from HealthSearchQA, which contains 3,172 consumer health questions curated by Google Research. After exclusions, a final dataset of 3,075 questions yielding 10,038 citations was analyzed. The Authority Signals Framework (Jacques et al., 2026) was applied to examine 10 authority signals across four domains for a disproportionate stratified sample of 542 sources. Established institutional sources accounted for 97.8% of all citations (n = 9,818). Medical Institutions were the most frequently cited organization type (36.5%), followed by Government Resources (31.6%) and Professional Associations (28.4%). Commercial Health Information comprised 2.2% (n = 220). The top 10 organizations accounted for 57.8% of all citations, with Mayo Clinic alone representing 24.7%. Among commercial sources in the focused sample, 86.4% displayed medical review statements, 82.5% used schema markup, and 71.8% had comprehensive content, while traditional institutional sources appeared in Claude's citations with or without these same markers. As Anthropic positions Claude for HIPAA-ready healthcare applications, these findings establish a baseline for Claude's citation behavior and demonstrate the utility of the Authority Signals Framework as a tool for ongoing, cross-platform evaluation of AI-mediated health information.

2605.23920 2026-05-26 cs.CY cs.AI 版本更新

Artificial Effort

人工努力

Federico Belotti, Stefano Coniglio, Antonio Cosma, Francesco Fallucchi

发表机构 * University of Bergamo(贝加莫大学)

AI总结 研究在AI和LLM时代,真实努力任务是否仍能反映人类努力,发现大多数任务可被低成本高精度自动化,仅少数抵抗自动化,且口头金钱激励对LLM无影响。

详情
AI中文摘要

真实努力任务中,参与者执行认知成本高昂的活动,其结果取决于实际表现,广泛应用于实验经济学。然而,其有效性基于人类执行这些任务的假设。我们研究在人工智能(AI)和大型语言模型(LLM)时代,这一假设是否仍然成立。使用来自三个主要提供商的8个经典真实努力任务和23个LLM,我们表明大多数任务现在可以以可忽略的成本准确解决,而只有少数任务抵抗自动化。性能随着每一代模型而提高,中端模型正在迅速缩小与前沿模型的差距,拓宽了可广泛访问的模型集,这些模型可以自动化这些任务。此外,我们表明口头提供金钱激励对LLM性能没有影响。我们的发现为在无监督环境中使用真实努力任务建立了一个边界条件:当参与者可以廉价地将任务完成外包给LLM时,观察到的表现可能不再反映真正的人类努力。

英文摘要

Real-effort tasks, in which participants perform cognitively costly activities whose outcomes depend on actual performance, are widely used in experimental economics. Their validity, however, rests on the assumption that a human performs them. We study whether this assumption still holds in the era of Artificial Intelligence (AI) and Large Language Models (LLMs). Using 8 canonical real-effort tasks and 23 LLMs from three major providers, we show that most tasks can now be solved accurately and at a negligible cost, while only a few resist automation. Performance improves with each model generation, and midtier models are rapidly closing the gap with frontier ones, broadening the set of widely accessible models that can automate these tasks. Additionally, we show that verbally offering monetary incentives has no effect on LLM performance. Our findings establish a boundary condition for the use of real-effort tasks in unsupervised settings: when participants can cheaply outsource task completion to an LLM, observed performance may no longer reflect genuine human effort.

2605.23916 2026-05-26 cs.IR cs.AI econ.GN q-fin.EC 版本更新

Agent-Facing Information Design in LLM Tool Registries

面向智能体的LLM工具注册表信息设计

Haochuan Kevin Wang

发表机构 * Massachusetts Institute of Technology(麻省理工学院)

AI总结 本研究首次系统性地分析了LLM工具注册表中广告式描述对智能体选择的影响,发现法律上允许的夸大宣传(如主观最高级表述)完全主导优化效果,而虚假声明无额外影响,并提出了分离选择导向与营销导向描述及智能体注意力质量分数等注册表设计建议。

详情
AI中文摘要

LLM工具注册表作为未受监管的广告平台运作:提供者编写自由文本描述,智能体据此进行选择,但缺乏衡量基础设施——无可见性标准、质量评分或结果审计——来使该市场承担责任。我们提供了首个系统性框架,结合了跨越五个LLM和十个领域的17,700多次试验以及建设性的注册表设计处方。仅法律上的夸大宣传(主观最高级表述、利益框架)就捕获了100%的优化效果;虚假声明未增加任何额外偏差——这使得FTC对欺骗性广告规则的执法对活跃机制无效。信息披露在结构上失败:系统提示警告对五个模型中的四个产生零可测量效果,行为上限使得基于标签的修正没有空间。最高级表述是主导单一特征(SBC = +0.35)。注册表层的描述规范化实现了与模型无关的一流福利。我们提出将面向选择的描述(结构化的、注册表控制的)与面向营销的描述(提供者撰写的、选择后展示)分离,并引入智能体注意力质量分数以区分能力与文案撰写。

英文摘要

LLM tool registries function as unregulated advertising platforms: providers write free-text descriptions that agents use for selection, yet no measurement infrastructure -- no viewability standard, quality score, or outcome audit -- exists to make this market accountable. We provide the first systematic framework, combining 17,700+ trials across five LLMs and ten domains with a constructive registry design prescription. Legal puffery alone (subjective superlatives, benefit framing) captures 100% of the optimization effect; fabricated claims add zero incremental bias -- rendering FTC enforcement of deceptive advertising rules ineffective against the active mechanism. Disclosure fails structurally: system-prompt warnings produce zero measurable effect for four of five models, and behavioral ceilings leave no headroom for label-based correction. Superlatives are the dominant single feature (SBC = +0.35). Registry-layer description normalization achieves first-best welfare model-independently. We propose separating selection-facing descriptions (structured, registry-controlled) from marketing-facing descriptions (provider-authored, shown post-selection), and introduce the Agent Attention Quality Score to distinguish capability from copywriting.

2605.23914 2026-05-26 cs.DC cs.AI cs.MA 版本更新

VineLM: Trie-Based Fine-Grained Control for Agentic Workflows

VineLM: 基于Trie的细粒度控制用于智能体工作流

Nikos Pagonas, Matthew Lou, Tianyi Peng, Dan Rubenstein, Kostis Kaffes

发表机构 * Columbia University(哥伦比亚大学)

AI总结 提出VineLM工作流管理器,通过Trie结构动态选择每个阶段调用的模型,在请求级目标下优化成本-延迟-准确率边界,稀疏分析减少离线分析成本98-99.8%。

详情
AI中文摘要

智能体工作流将可配置的LLM阶段与工具阶段交错,通常包括重试或优化循环。现有工作流管理器离线分析完整工作流配置,并为每个请求分配静态工作流级计划,将每个可配置LLM阶段绑定到单个模型,在重复循环迭代中重用该模型,且不在运行时重新审视这些选择。我们提出VineLM,一种工作流管理器,通过在请求级目标(如在成本或延迟预算下最大化准确率)下执行过程中为每个阶段调用选择模型,实现细粒度控制。VineLM将可行执行表示为模型选择前缀的带注释Trie,并使用检查点和级联分析来估计路径准确率、成本和延迟,而无需在每个路径上详尽分析每个请求。运行时,VineLM在每个阶段调用后重新定位Trie根,并使用已实现的执行前缀和剩余延迟预算在剩余子Trie上重新规划。在NL2SQL和数学推理工作流上,VineLM在相同每请求预算下比粗粒度工作流级基线提高了成本-延迟-准确率边界,准确率提升高达18%,其稀疏分析相比详尽分析将离线分析成本降低了98-99.8%。

英文摘要

Agentic workflows interleave configurable LLM stages with tool stages and often include retries or refinement loops. Existing workflow managers profile full workflow configurations offline and assign each request a static workflow-level plan that binds each configurable LLM stage to a single model, reuses that model across repeated loop iterations, and does not revisit those choices at runtime. We present VineLM, a workflow manager that enables fine-grained control by choosing the model for each stage invocation as execution unfolds under request-level objectives such as maximizing accuracy under cost or latency budgets. VineLM represents feasible executions as an annotated trie of model-choice prefixes and uses checkpointing and cascade profiling to estimate path accuracy, cost, and latency without exhaustively profiling every request on every path. At runtime, VineLM re-roots the trie after each stage invocation and replans over the remaining subtrie using the realized execution prefix and remaining latency budget. On NL2SQL and math reasoning workflows, VineLM improves the cost-latency-accuracy frontier over coarse workflow-level baselines, achieving up to 18% higher accuracy at the same per-request budget with its sparse profiling reducing offline profiling cost by 98-99.8% when compared to exhaustive profiling.

2605.23912 2026-05-26 cs.CL cs.AI cs.SD 版本更新

Raon-Speech Technical Report

Raon-Speech 技术报告

Beomsoo Kim, Changho Choi, Dohyun Kim, Dongki Lee, Ethan Ewer, Eunchong Kim, Gyeongman Kim, Haechan Kim, Hyeonghwan Kim, Inkyu Park, Jihun Yun, Jihwan Moon, Jiyun Kim, Joonghyun Bae, Junhyuck Kim, Minkyu Kim, Sehun Lee, Seungjun Chung, Sungwoo Cho, Dongmin Park, Dongwon Kim, Hara Kang, Jonghyun Lee, Keon Lee, Kangwook Lee, Jaewoong Cho

发表机构 * KRAFTON

AI总结 本文提出 Raon-Speech,一个 9B 参数的语音语言模型,通过多阶段训练实现英语和韩语的语音理解、回答与生成,并扩展为全双工对话模型 Raon-SpeechChat,在语音任务上超越同类模型。

详情
AI中文摘要

我们提出了 Raon-Speech,一个在英语和韩语语音理解、回答和生成方面表现优异的 9B 参数语音语言模型(SpeechLM),以及 Raon-SpeechChat,一个用于自然实时对话的高性能全双工扩展。Raon-Speech 成功地将预训练的大语言模型(LLM)转换为既能理解又能生成语音的 SpeechLM,同时保留了强大的文本能力。它在 138 万小时精心策划的英语和韩语语音及文本数据集上训练,训练阶段包括:(1) 语音模块对齐,(2) 基于知识蒸馏的端到端 SpeechLM 预训练,以及 (3) 基于多任务偏好优化的后训练。在 42 个英语和韩语语音及文本基准测试中,与包括 Qwen2.5-Omni 和 Fun-Audio-Chat 在内的八个近期类似规模的音频基础模型相比,Raon-Speech 在语音中心任务上建立了最强的整体表现,同时保留了强大的文本问答性能。在此基础上,Raon-SpeechChat 通过在 119K 小时的时间对齐的真实和合成对话数据上进行持续训练,实现了自然的全双工对话。它通过三个互补的训练阶段进行:(1) 因果编码器适应,(2) 全双工预训练,(3) 用于语音和角色控制的全双工微调。在多个全双工基准测试中,Raon-SpeechChat 在 FDB v1.0 涵盖的轮流发言和中断敏感行为上显示出最明显的优势,并在更广泛的全双工评估套件中保持竞争力。我们开源了所有模型检查点、训练和推理流程以及交互式演示。

英文摘要

We present Raon-Speech, a top-performing 9B-parameter speech language model (SpeechLM) for English and Korean speech understanding, answering, and generation, and Raon-SpeechChat, a high-performing full-duplex extension for natural real-time conversation. Raon-Speech successfully transforms a pre-trained LLM into a SpeechLM that both understands and generates speech while preserving strong text capabilities. It trains on 1.38M hours of highly curated English and Korean speech and text datasets with the following training stages: (1) speech modules alignment, (2) end-to-end SpeechLM pre-training with knowledge distillation, and (3) multi-task preference optimization-based post-training. Across 42 English and Korean speech and text benchmarks, Raon-Speech establishes the strongest overall profile on speech-centric tasks in our comparison against eight similarly sized recent audio foundation models, including Qwen2.5-Omni and Fun-Audio-Chat, while preserving strong text question answering performance. Building upon it, Raon-SpeechChat enables natural full-duplex conversation by continual training on 119K hours of time-aligned real and synthetic dialogue data. It proceeds through three complementary training stages: (1) causal encoder adaptation, (2) full-duplex pre-training, (3) full-duplex fine-tuning for voice and role-control. On multiple full-duplex benchmarks, Raon-SpeechChat shows its clearest strengths on the turn-taking and interruption-sensitive behaviors covered by FDB v1.0, and remains competitive across the broader full-duplex evaluation suite. We open-source all model checkpoints, the training and inference pipeline, and an interactive demo.

2605.23909 2026-05-26 cs.AI cs.LG 版本更新

Confidence Calibration in Large Language Models

大型语言模型中的置信度校准

Noam Michael, Daniel BenShushan, Jacob Bien, Don A. Moore

发表机构 * U.C. Berkeley(伯克利大学) University of Southern California(南加州大学)

AI总结 通过预注册研究,发现大型语言模型(LLMs)的置信度普遍高于准确率,且存在显著的难易效应:困难测试中过度自信,简单测试中信心不足,并提出了LifeEval测试用于评估不同难度下的模型校准。

详情
AI中文摘要

我们研究了大型语言模型(LLMs)在不同任务上的置信度校准情况。预注册研究的结果表明,当前一批LLMs与人类一样,过于确信自己是正确的:平均而言,置信度超过了准确率。然而,重要的是,这种趋势受到强大的难易效应的调节,即在困难测试中过度自信最为严重;相比之下,简单测试实际上显示出明显的信心不足。我们开发了LifeEval,一个用于评估不同难度水平下模型校准的测试。

英文摘要

We investigate the calibration of large language models' (LLMs') confidence across diverse tasks. The results of our preregistered study show that the current crop of LLMs are, like people, too sure they are right: confidence exceeds accuracy, on average. Importantly, however, this tendency is moderated by a powerful hard-easy effect, wherein overconfidence is greatest on difficult tests; by contrast, easy tests actually show substantial underconfidence. We develop LifeEval, a test for evaluating model calibration across levels of difficulty.

2605.23905 2026-05-26 q-fin.GN cs.AI cs.GT 版本更新

AI-Driven Alpha Decay: Algorithmic Homogenization, Reflexive Signal Erosion, and the Paradox of Intelligent Markets

AI驱动的阿尔法衰变:算法同质化、反射性信号侵蚀与智能市场的悖论

Shuchen Meng, Xupeng Chen

发表机构 * Department of Financial Engineering, New York University(金融工程系,纽约大学) Department of Electrical and Computer Engineering, New York University(电气计算机工程系,纽约大学)

AI总结 本文通过理论模型和实证数据证明,AI驱动的投资策略在大规模采用时具有自我挫败性,导致超额收益压缩,并推导出阿尔法半衰期公式,揭示了信号寿命、灭绝级联、红皇后不可能性以及脆弱性-效率权衡等四个理论结果。

详情
AI中文摘要

我们证明,AI驱动的投资策略在大规模采用时本质上具有自我挫败性。随着AI采用率的上升,三个相互强化的渠道——信号拥挤、表演性信号侵蚀和红皇后竞争——压缩了超额收益。我们推导出阿尔法半衰期 $h(ϕ) = \ln 2/[θ+ δ(ϕ)]$,其中 $θ$ 是自然均值回复率,$δ(ϕ) = Nϕρa/λ(ϕ)$ 是AI加速的衰变成分,随采用率凸递减。在当前采用水平($ϕ\approx 0.7$,$ρ\approx 0.6$)下,模型暗示信号半衰期为18个月,而AI之前为5-7年。我们建立了四个理论结果。第一,阿尔法半衰期定理:信号寿命随AI采用率凸递减。第二,信号灭绝级联:超过临界阈值 $ϕ^*$ 后,一类信号的衰变会触发对剩余信号的加速竞争。第三,红皇后不可能性:在单一文化均衡中,尽管大量AI投资,净阿尔法恒为零。第四,脆弱性-效率权衡:最大化价格发现的采用水平严格超过最小化系统性脆弱性的水平。实证验证将投资组合收敛校准到SEC 13F表格提交模式(2013-2024年,9950万持仓),记录到模拟机构投资组合收敛在样本期内增加了42%。我们检查了模拟对冲基金回报动态,显示采用AI的基金之间横截面离散度下降,并模拟了2010年闪电崩盘以说明脆弱性后果。

英文摘要

We show that AI-driven investment strategies are inherently self-defeating at scale. As AI adoption rises, three mutually reinforcing channels -- signal crowding, performative signal erosion, and Red Queen competition -- compress excess returns. We derive the alpha half-life $h(ϕ) = \ln 2/[θ+ δ(ϕ)]$, where $θ$ is the natural mean-reversion rate and $δ(ϕ) = Nϕρa/λ(ϕ)$ is the AI-accelerated decay component, which is convex-decreasing in adoption. At current adoption levels ($ϕ\approx 0.7$, $ρ\approx 0.6$), the model implies signal half-lives of 18 months versus 5-7 years pre-AI. We establish four theoretical results. First, the alpha half-life theorem: signal lifespans are convex-decreasing in AI adoption. Second, a signal extinction cascade: beyond a critical threshold $ϕ^*$, the decay of one signal class triggers accelerated competition for remaining signals. Third, a Red Queen impossibility: in the monoculture equilibrium, net alpha is identically zero despite heavy AI investment. Fourth, a fragility-efficiency tradeoff: the adoption level maximizing price discovery strictly exceeds the level minimizing systemic fragility. Empirical validation calibrates portfolio convergence to SEC Form 13F filing patterns (99.5 million holdings, 2013-2024), documenting that simulated institutional portfolio convergence increases by 42% over the sample period. We examine simulated hedge fund return dynamics showing declining cross-sectional dispersion among AI-adopting funds, and simulate the 2010 Flash Crash to illustrate fragility consequences.

2605.22800 2026-05-26 cs.LG cs.AI stat.ML 版本更新

The Matching Principle: A Geometric Theory of Loss Functions for Nuisance-Robust Representation Learning

匹配原则:面向干扰鲁棒表示学习的损失函数几何理论

Vishal Rajput

发表机构 * KU Leuven(根特大学)

AI总结 提出匹配原则,通过估计任务协方差矩阵并匹配惩罚矩阵的像空间,统一了多种鲁棒性方法,并在线性高斯模型中证明最优性。

Comments 58 pages, 13 pre-specified empirical blocks. v2: partial-pass framing, geometry-task dissociation, T2B protocol v3, layout/figure fixes; core theorems unchanged. Code: matching-pmh (PyPI). Related note: arXiv:2604.21395

详情
AI中文摘要

鲁棒性、领域自适应、光度/遮挡不变性、传感器漂移和对齐风格被视为独立的文献领域,拥有各自独立的方法族。在标签保持的部署偏移下,它们共享一个几何对象:协方差 Sigma_task = Cov_{Q_n}(n),即输入在标签不变的情况下可以变化的方式。CORAL、对抗训练、数据增强、度量学习、雅可比惩罚和对齐约束并非独立的技巧——它们都是 Sigma_task 的估计量。固定该对象后,雅可比惩罚由一个矩阵 Sigma' 确定,其像空间必须覆盖 range(Sigma_task)——即匹配原则。我们在线性高斯模型中证明了最优性(定理A),证明了任何能够消除部署漂移的二次惩罚都需要像空间覆盖(定理G),并在全局最小值处证明了相同的二分性(定理A*_global)。错误方向/信号对齐控制(引理C;推论E/E*)以及七个估计量(引理D1-D7),加上无标签TDI,为需要学习 Sigma_task 的情况提供了可证伪的配方。在十三个模块(从ML到Qwen2.5-7B)上,测试了匹配的、各向同性的和错误方向的惩罚对几何和部署漂移的影响。其中十二个模块与可识别性成立的理论一致;Office-31是一个命名的特征间隙失败案例。部分通过:几何可以在不改善每个头条任务指标的情况下提升。一次初步的7B DPO运行(一个epoch,240对):匹配风格-PMH保持了风格TDI,而标准DPO则使其退化。我们不声称标准训练达到全局最小值(假设(O)是开放的),不声称估计的 Sigma_task 总是可识别的,也不声称在每个排行榜上占优。我们提出一个可证伪的设计配方:估计 Sigma_task,匹配 Sigma',运行控制,分别报告任务和几何指标。

英文摘要

Robustness, domain adaptation, photometric/occlusion invariance, sensor drift, and alignment style are treated as separate literatures with separate method families. Under label-preserving deployment shift they share one geometric object: the covariance Sigma_task = Cov_{Q_n}(n) of ways inputs can change without changing the label. CORAL, adversarial training, augmentation, metric learning, Jacobian penalties, and alignment constraints are not independent tricks--they are estimators of Sigma_task. Fix that object and the Jacobian penalty is pinned by a matrix Sigma' whose range must cover range(Sigma_task)--the matching principle. We prove optimality in a linear-Gaussian model (Thm. A), necessity of range coverage for any quadratic penalty that zeros deployment drift (Thm. G), and the same dichotomy at global minima (Thm. A*_global). Wrong-direction/signal-aligned controls (Lemma C; Cor. E/E*) and seven estimators (Lemmas D1--D7), plus label-free TDI, yield a falsifiable recipe when Sigma_task must be learned. Thirteen blocks (ML through Qwen2.5-7B) test matched vs isotropic vs wrong-direction penalties on geometry and deployment drift. Twelve match theory where identifiability holds; Office-31 is a named eigengap failure. Partial passes: geometry can improve without every headline task metric moving. A pilot 7B DPO run (one epoch, 240 pairs): matched style-PMH preserves Style TDI where standard DPO degrades it. We do not claim standard training reaches global minima (assumption (O) is open), that estimated Sigma_task is always identifiable, or dominance on every leaderboard. We claim a falsifiable design recipe: estimate Sigma_task, match Sigma', run the controls, report task and geometry separately.

2605.22093 2026-05-26 cs.AI 版本更新

Knowledge Graph Re-engineering Along the Ontological Continuum (extended version)

知识图谱沿本体论连续体的重工程(扩展版)

Enrico Daga, Valentina Tamma, Terry Payne

发表机构 * The Open University, Walton Hall, Milton Keynes, United Kingdom(开放大学) School of Computer Science and Informatics, University of Liverpool, UK(利兹大学计算机科学与信息学学院)

AI总结 本文提出本体论连续体作为概念框架,通过语义与语用、属性与可供性两个正交维度描述、比较和转换知识图谱,以解决不同建模实践间的集成与重用问题,并通过案例研究验证其有效性。

详情
AI中文摘要

知识图谱已成为数据集成的主要载体,对现代AI的成功至关重要,但KG建模实践的多样性(从轻量级词汇表到丰富公理化的本体论)使得集成和重用成本高昂且脆弱。这一挑战在神经符号AI中尤为突出,其中桥接神经和符号组件依赖于重新设计KG以适应新需求的能力;生成式AI现在提供了前所未有的自动化能力,但如果没有对KG空间的原则性理解,这种自动化在概念上仍然缺乏基础。我们将本体论连续体引入为缺失的概念化,这是一个理论构造,其特征框架由两个正交区分定义:语义与语用,以及属性与可供性;这些共同定义了一个词汇表,用于描述、比较、导航和转换跨越全部建模实践的KG。方法论立场是经验性的:连续体并非规定KG应如何建模,而是旨在定义一种存在理论,源于对现实世界KG工程实践的观察,其结构可以形式化地明确表达,例如通过形式概念分析(FCA)。我们通过一个关于溯源知识的案例研究来夯实这一愿景,展示单一关注点如何在连续体上以不同方式体现。我们阐述了五个开放的研究挑战,并邀请社区将本体论连续体发展为一个共享的研究议程。

英文摘要

Knowledge graphs have become the primary vehicle for data integration and are critical to the success of modern AI, but the diversity of KG modelling practices, from lightweight vocabularies to richly axiomatised ontologies, makes integration and reuse expensive and brittle. This challenge is particularly acute in neuro-symbolic AI, where bridging neural and symbolic components depends on the ability to reengineer KGs to fit new requirements; GenAI now offers unprecedented automation capability, but without a principled understanding of the KG space, such automation remains conceptually ungrounded. We introduce the ontological continuum as that missing conceptualisation, a theoretical construct a theoretical construct whose characterisation framework is defined by two orthogonal distinctions: semantics vs pragmatics, and properties vs affordances; together these define a vocabulary to describe, compare, navigate, and transform KGs across the full range of modelling practices. The methodological stance is empirical: rather than prescribing how KGs should be modelled, the continuum aims to define a theory of the existent, derived from observation of real-world KG engineering practices and whose structure can be made formally explicit, for example, through Formal Concept Analysis (FCA). We ground the vision through a case study on provenance knowledge, showing how a single concern manifests differently across the continuum. We articulate five open research challenges and invite the community to develop the ontological continuum as a shared research agenda.

2605.22005 2026-05-26 cs.LG cs.AI cs.CL 版本更新

Check Your LLM's Secret Dictionary! Five Lines of Code Reveal What Your LLM Learned (Including What It Shouldn't Have)

检查你的大语言模型的秘密词典!五行代码揭示你的大语言模型学到了什么(包括它不应该学到的)

Hisashi Miyashita

发表机构 * Mgnite Inc.(Mgnite公司)

AI总结 通过对lm_head权重矩阵进行奇异值分解(仅需五行PyTorch代码且无需模型推理),直接从模型权重中揭示可解释的语义子空间,并发现模型训练数据组成和策展哲学。

详情
AI中文摘要

我们展示了基于Transformer的大语言模型的lm_head权重矩阵的奇异值分解——仅需五行PyTorch代码且无需模型推理——直接从模型权重中揭示可解释的语义子空间。每个左奇异向量识别出当隐藏状态与相应奇异方向对齐时最容易被选中的词汇标记;检查这些聚类揭示了模型的训练数据组成和策展哲学。 分析GPT-OSS-120B、Gemma-2-2B和Qwen2.5-1.5B,我们发现奇异值谱和词汇聚类结构在不同模型间存在系统性差异:GPT呈现出功能分化子空间的渐进层次;Gemma以19世纪前的英语正字法为主,形成阶梯式聚类结构,这可能有助于高输出可控性;Qwen展现出广泛的多语言覆盖,同时其子空间的词汇被作者认为在伦理上不适合直接发表。 基础-指令对比表明,伦理上令人担忧的子空间源自预训练,并且不会被后训练对齐移除。我们引入词汇聚类得分(VCS)来量化子空间一致性,以及加权投影得分(WPS)作为静态故障标记检测器;将WPS应用于GPT-OSS-120B,无需任何模型推理即可恢复shokubutsu-hyakka-tsu(ID 137606),这是CJK语言社区中广泛报道的一个著名故障标记。我们提出了问题词汇内容根本原因的分类法,并呼吁将lm_head SVD分析作为标准发布前安全审计步骤。我们的发现进一步指出了SVD引导的分词器优化和更可控的大语言模型设计方向。

英文摘要

We show that singular value decomposition of the lm_head} weight matrix of a transformer-based large language model -- requiring only five lines of PyTorch and no model inference -- reveals interpretable semantic subspaces directly from the model weights. Each left singular vector identifies the vocabulary tokens most readily selected when the hidden state aligns with the corresponding singular direction; inspecting these clusters exposes the model's training data composition and curation philosophy. Analysing GPT-OSS-120B, Gemma-2-2B, and Qwen2.5-1.5B, we find that singular value spectra and vocabulary cluster structures differ systematically across models: GPT exhibits a graduated hierarchy of functionally differentiated subspaces; Gemma is dominated by pre-nineteenth-century English orthography, forming a stepwise clustering structure that may contribute to high output controllability; and Qwen exhibits broad multilingual coverage alongside subspaces whose vocabulary the authors have determined to be ethically inappropriate for direct publication. Base-instruct comparison reveals that ethically concerning subspaces originate in pretraining and are not removed by post-training alignment. We introduce the Vocabulary Cluster Score (VCS) to quantify subspace coherence, and the Weighted Projection Score (WPS) as a static glitch token detector; applying WPS to GPT-OSS-120B recovers shokubutsu-hyakka-tsu (ID 137606), a well-known glitch token widely reported in the CJK language community, without any model inference. We propose a taxonomy of root causes for problematic vocabulary content and call for lm_head} SVD analysis to be adopted as a standard pre-release safety auditing step. Our findings further suggest directions toward SVD-guided tokenizer optimisation and more controllable LLM design.

2605.20490 2026-05-26 cs.AI cs.LG 版本更新

ECUAS$_n$: A family of metrics for principled evaluation of uncertainty-augmented systems

ECUAS$_n$: 一种用于原则性评估不确定性增强系统的度量族

Lautaro Estienne, Erik Ernst, Matías Vera, Pablo Piantanida, Luciana Ferrer

发表机构 * School of Engineering, UBA, Argentina(阿根廷UBA工程学院) ICC, CONICET-Universidad de Buenos Aires, Argentina(阿根廷CONICET-布宜诺斯艾利斯大学ICC) LISN, CNRS, Université Paris-Saclay, France(法国CNRS巴黎萨克雷大学LISN) International Laboratory on Learning Systems, Canada(加拿大学习系统国际实验室) CSC, CONICET, Argentina(阿根廷CONICET CSC) Mila - Quebec AI Institute, Canada(加拿大魁北克AI研究所Mila) CNRS, Université Paris-Saclay, France(法国CNRS巴黎萨克雷大学)

AI总结 针对高 stakes 自动决策中不确定性增强系统的评估问题,提出一种基于适当评分规则的度量族 ECUAS$_n$,通过参数 $n$ 平衡错误预测成本与不确定性质量,并在分类和生成数据集上验证其理论优势与实证效果。

Comments pre-print, 9-pages paper, 25 pages total

详情
AI中文摘要

在高风险自动决策中,获取预测不确定性对于使用户(人类或下游系统)能够根据应用特定的成本权衡接受或拒绝预测至关重要。这种不确定性增强(UA)系统——即同时输出预测和不确定性分数的系统——目前在文献中以多种方式被评估,包括使用单独的指标评估预测和不确定性分数、设置固定拒绝成本的成本函数或对覆盖-风险曲线进行积分。我们认为这些评估方法不足以评估UA系统在不确定性下决策的整体性能,并提出了一种新的度量族ECUAS$_n$,将其表述为感兴趣任务的适当评分规则。参数$n$根据用例需求控制错误预测成本与不完美不确定性之间的权衡。我们通过在不同分类和生成数据集(包括TriviaQA的手动注释子集)上的实验,从理论和实证两方面展示了ECUAS$_n$度量的优势。

英文摘要

In high-stakes automated decision-making, access to predictive uncertainty is essential for enabling users -- human or downstream systems -- to accept or reject predictions based on application-specific cost trade-offs. Such uncertainty-augmented (UA) systems -- i.e., systems that output both predictions and uncertainty scores -- are currently being assessed in the literature in a variety of ways, using separate metrics to evaluate the predictions and the uncertainty scores, setting a cost function with a fixed rejection cost or integrating over a coverage-risk curve. We argue that these evaluation approaches are inadequate for assessing overall performance of the UA system for decision making under uncertainty and propose a novel family of metrics, ECUAS$_n$, formulated as proper scoring rules for the task of interest. The parameter $n$ controls the trade-off between the cost of incorrect predictions and imperfect uncertainties depending on the needs of the use-case. We demonstrate the advantages of the ECUAS$_n$ metrics both theoretically and empirically, through experiments on diverse classification and generation datasets, including a manually annotated subset of TriviaQA.

2605.19846 2026-05-26 cs.CV cs.AI cs.CL 版本更新

FineBench: Benchmarking and Enhancing Vision-Language Models for Fine-grained Human Activity Understanding

FineBench: 细粒度人类活动理解的视觉-语言模型基准测试与增强

Gueter Josmy Faure, Min-Hung Chen, Jia-Fong Yeh, Hung-Ting Su, Winston H. Hsu

发表机构 * National Taiwan University(国立台湾大学) Google(谷歌) Independent Researcher(独立研究员)

AI总结 针对视觉-语言模型在细粒度人类活动理解上的不足,提出包含密集标注的长视频问答基准FineBench和增强框架FineAgent。

Comments CVPR'26 (Workshop on Video Large Language Models). Project Page: https://joslefaure.github.io/assets/html/finebench.html

详情
AI中文摘要

视觉-语言模型(VLM)在通用视频理解方面表现出色,但在需要细致理解人类动作和交互的真实世界应用中,它们常常难以进行细粒度理解。虽然最近一些以人为中心的基准测试评估了模型行为的公平性/伦理、情感感知等维度,但它们没有结合长视频、密集的问答覆盖以及大规模的帧级空间/时间定位。为弥补这一差距,我们引入了FineBench,一个专门设计用于评估细粒度理解的以人为中心的视频问答(VQA)基准。FineBench包含199,420个多项选择问答对,密集标注在64个长视频(每个15分钟)上,重点关注详细的人物运动、人物交互和物体操作,包括组合动作。我们的广泛评估显示,虽然像GPT-5这样的专有模型取得了不错的性能,但当前的开源VLM明显表现不佳,特别是在多人场景的空间推理以及区分人类运动和交互的细微差异方面。为了解决这些已识别的弱点,我们提出了FineAgent,一个模块化框架,通过利用定位器和描述器来增强VLM。实验表明,FineAgent在FineBench上持续提高了各种开源VLM的性能。FineBench为未来细粒度以人为中心的视频理解研究提供了严格的测试平台,而FineAgent则为增强当前VLM中的此类推理提供了一种实用方法。项目页面和代码:https://joslefaure.github.io/assets/html/finebench.html。

英文摘要

Vision-Language Models (VLMs) have demonstrated remarkable capabilities in general video understanding, yet they often struggle with the fine-grained comprehension crucial for real-world applications requiring nuanced interpretation of human actions and interactions. While some recent human-centric benchmarks evaluate aspects of model behaviour such as fairness/ethics, emotion perception, and broader human-centric metrics, they do not combine long-form videos, very dense QA coverage, and frame-level spatial/temporal grounding at scale. To bridge this gap, we introduce FineBench, a human-centric video question answering (VQA) benchmark specifically designed to assess fine-grained understanding. FineBench comprises 199,420 multiple-choice QA pairs densely annotated across 64 long-form videos (15 minutes each), focusing on detailed person movement, person interaction, and object manipulation, including compositional actions. Our extensive evaluation reveals that while proprietary models like GPT-5 achieve respectable performance, current open-source VLMs significantly underperform, struggling particularly with spatial reasoning in multi-person scenes and distinguishing subtle differences in human movements and interactions. To address these identified weaknesses, we propose FineAgent, a modular framework that enhances VLMs by leveraging a Localizer and a Descriptor. Experiments show that FineAgent consistently improves the performance of various open VLMs on FineBench. FineBench provides a rigorous testbed for future research into fine-grained human-centric video understanding, while FineAgent offers a practical approach to enhance such reasoning in current VLMs. Project page and code at https://joslefaure.github.io/assets/html/finebench.html.

2605.18932 2026-05-26 cs.LG cs.AI 版本更新

HypergraphFormer: Learning Hypergraphs from LLMs for Editable Floor Plan Generation

HypergraphFormer: 从大语言模型中学习超图以实现可编辑的楼层平面图生成

Nikita Klimenko, Hesam Salehipour, Parham Eftekhar, Amir Khasahmadi, Ramon Elias Weber

发表机构 * Autodesk Research(Autodesk研究院) York University(约克大学) UC Berkeley(加州大学伯克利分校)

AI总结 提出HypergraphFormer,利用大语言模型学习超图表示来生成楼层平面图,在RPLAN数据集上超越现有方法,并支持任意边界和高度可编辑性。

详情
AI中文摘要

在这项工作中,我们提出了HypergraphFormer,一种基于大语言模型学习超图表示的新型高效楼层平面图生成方法。该模型通过监督微调训练,生成基于超图的文本表示,编码楼层平面图中的空间关系和连通性信息。我们在RPLAN数据集上训练和评估我们的方法,并进一步在本文发布的一个独立的分布外数据集上展示其泛化能力。我们的方法在多种指标上优于基于栅格化或向量化表示的最先进技术。我们还展示了改进的数据效率,特别是在分布偏移下。超图公式通过将公寓足迹与其功能和几何细分解耦,使得能够为任意、不规则、用户指定的边界生成楼层平面图。此外,我们展示了所提出的方法具有高度的可编辑性,使其特别适合由大语言模型支持的设计导向工作流程。

英文摘要

In this work, we propose HypergraphFormer, a novel and efficient approach to floor plan generation based on learning hypergraph representations with a large language model (LLM). The model is trained via supervised fine-tuning to generate a hypergraph-based textual representation that encodes spatial relationships and connectivity information within floor plans. We train and evaluate our approach on the RPLAN dataset, and further demonstrate its generalizability on a separate out-of-distribution dataset, which we release in this paper. Our method outperforms state-of-the-art techniques based on rasterized or vectorized representations across a diverse set of metrics. We also show improved data efficiency, particularly under distribution shift. The hypergraph formulation enables the generation of floor plans for arbitrary, irregular, user-specified boundaries by decoupling apartment footprints from their functional and geometric subdivisions. Furthermore, we show that the proposed methodology offers a high degree of editability, making it particularly well suited to design-oriented workflows supported by LLMs.

2605.18224 2026-05-26 cs.LG cs.AI 版本更新

A Simplex Witness Certificate for Constant Collapse in Variational Autoencoders

变分自编码器中恒定坍缩的单纯形见证证书

Zegu Zhang, Jianhua Peng, Jian Zhang

发表机构 * Independent Researcher(独立研究者) School of Computing, Southeast University(东南大学计算机学院)

AI总结 提出一种基于GMM教师后验和单纯形见证的证书,用于检测和量化VAE编码器均值是否发生输入无关的恒定坍缩,并在MNIST、CIFAR-10和CIFAR-100上验证了方法有效性。

详情
AI中文摘要

我们研究变分自编码器中的精确恒定坍缩:确定性编码器均值变得与输入无关。先验保持为标准高斯分布。在VAE训练之前,我们从基于GMM的数据视角选择一个固定的教师后验,并将一个固定的仅潜在空间单纯形见证附加到编码器均值上。这种构造产生两个关联对象。第一个是证书:如果见证预测优于教师的最佳恒定预测器,则编码器均值不能是输入无关的常数。第二个是局部逃逸方向:在坍缩流形上,教师残差为对齐损失提供样本相关的下降方向。对于任何全支撑的教师后验,相同的几何结构也给出一个具有零教师-见证对齐误差的闭式潜在码。其缩放版本追踪一条从恒定预测器到精确教师码的边际能量路径,该路径量化了受保护见证子空间内的非坍缩。我们在MNIST、CIFAR-10和CIFAR-100上实例化了该方法。使用搜索的无监督PCA-GMM教师,在CIFAR-10和CIFAR-100上,所有五个种子的普通VAE均未通过教师-见证证书,而RST变体在所有五个种子中均通过。在坍缩压力设置下(β_KL ∈ {2,4,8}),普通VAE再次在所有种子中失败,而RST-alpha-prefit保持证书阳性。在两个自然图像数据集上的逃逸轨迹从低边际初始化开始增加见证边际,并表现出非零的教师诱导梯度范数。该分析仅限于编码器均值的精确恒定坍缩;生成质量、解码器使用和其他坍缩模式仍是独立的问题。

英文摘要

We study exact constant collapse in variational autoencoders: the deterministic encoder mean becomes independent of the input. The prior remains the standard Gaussian. Before VAE training, we select a fixed teacher posterior from a GMM-based view of the data and attach a fixed latent-only simplex witness to the encoder mean. This construction yields two linked objects. The first is a certificate: if the witness prediction improves on the best constant predictor of the teacher, the encoder mean cannot be input-independent constant. The second is a local escape direction: on the collapsed manifold, the teacher residual gives a sample-dependent descent direction for the alignment loss. For any full-support teacher posterior, the same geometry also gives a closed-form latent code with zero teacher-witness alignment error. Its scaled versions trace a margin-energy path from the constant predictor to the exact teacher code, which quantifies non-collapse inside the protected witness subspace. We instantiate the method on MNIST, CIFAR-10, and CIFAR-100. With searched unsupervised PCA-GMM teachers, vanilla VAEs fail the teacher-witness certificate in all five seeds on CIFAR-10 and CIFAR-100, while RST variants pass in all five seeds. Under collapse-stress settings with \(β_{\mathrm{KL}}\in\{2,4,8\}\), vanilla VAE again fails in all seeds, whereas RST-alpha-prefit remains certificate-positive. Escape trajectories on both natural-image datasets increase the witness margin from a low-margin initialization and exhibit nonzero teacher-induced gradient norms. The analysis is confined to exact constant collapse of the encoder mean; generation quality, decoder use, and other collapse modes remain separate questions.

2605.16302 2026-05-26 cs.LG cs.AI cs.CL 版本更新

Reducing Credit Assignment Variance via Counterfactual Reasoning Paths

通过反事实推理路径减少信用分配方差

Fei Ding, Yongkang Zhang, Youwei Wang, Zijian Zeng

发表机构 * Alibaba Group(阿里巴巴集团) Tsinghua University(清华大学)

AI总结 提出反事实比较框架,通过采样多条推理轨迹并利用差异隐式估计过程级优势,将稀疏终端奖励转化为步骤敏感信号,从而改进大语言模型多步推理的信用分配,并引入隐式行为策略优化(IBPO)提升训练稳定性和性能上限。

详情
AI中文摘要

使用大语言模型进行多步推理的强化学习通常依赖于稀疏的终端奖励,这会导致一个条件较差的信用分配问题:最终反馈均匀地传播到所有中间决策。这导致高梯度方差、不稳定的训练和许多无效更新,最终限制了模型的持续改进。我们提出了一种用于信用分配的反事实比较框架。对于每个输入,该框架采样多个推理轨迹,并将它们的差异视为对替代决策的隐式近似。这产生了一个隐式过程级优势估计器,将稀疏终端奖励转化为步骤敏感的学习信号。基于此框架,我们引入了隐式行为策略优化(IBPO),该方法在数学和代码推理基准上显著提高了训练稳定性和性能上限。我们的结果为释放大语言模型的推理潜力指明了一个有前景的方向。

英文摘要

Reinforcement learning for multi-step reasoning with large language models (LLMs) typically relies on sparse terminal rewards, which creates a poorly conditioned credit-assignment problem: the final feedback is propagated uniformly across all intermediate decisions. This leads to high gradient variance, unstable training, and many ineffective updates, ultimately limiting sustained model improvement. We propose a counterfactual-comparison framework for credit assignment. For each input, the framework samples multiple reasoning trajectories and treats their differences as implicit approximations to alternative decisions. This yields an implicit process-level advantage estimator that converts sparse terminal rewards into step-sensitive learning signals. Building on this framework, we introduce Implicit Behavior Policy Optimization (IBPO), which substantially improves training stability and the performance ceiling on mathematical and code-reasoning benchmarks. Our results point to a promising direction for unlocking the reasoning potential of LLMs.

2605.15236 2026-05-26 cs.IT cs.AI cs.NI math.IT 版本更新

Learning Selective Merge Policies for Deadline-Constrained Coded Caching via Deep Reinforcement Learning

基于深度强化学习的截止时间约束编码缓存选择性合并策略学习

Amirhossein Yousefiramandi

发表机构 * Amirhossein Yousefiramandi(阿米尔霍塞因·尤塞菲拉曼迪)

AI总结 针对截止时间约束的编码缓存问题,提出基于深度强化学习的选择性合并策略,通过近端策略优化训练策略网络,在广播包过期率和效率上优于SACM++。

详情
AI中文摘要

在编码缓存中,服务器利用用户端的缓存信息,通过单个编码多播消息或数据包(即合并数据包)并行服务多个用户,从而缓解峰值网络拥塞。为了在视频流等截止时间驱动的应用中向用户传递及时消息,我们必须在线确定要合并的消息进行传递,因为每个请求都有时间限制。值得注意的是,虽然合并有助于当前编码多播数据包,但可能损害未来的传递。我们的解决方案采用深度强化学习,将编码多播传递视为一个掩码动作离散状态控制问题,并通过近端策略优化训练的策略网络优于SACM++。在均匀需求基准上,我们的策略网络将广播数据包过期率$ρ$相对于最佳编码多播基线(SACM++)降低了$40.9\%$($0.208$ vs. $0.352$),同时在编码多播方法中,在Track~A电池组上取得了最佳广播效率分数$σ$。这里一个值得注意的现象是,对于截止时间更严格的应用,合并变得有选择性而非激进,因为策略网络仅在大约$31.8\%$的机会中选择性合并,尽管在同一模拟器家族的不同变体中观察到相同现象。我们设计的重点是高效的成对XOR合并,而高阶($K{\ge}3$)编码可视为自然推广,留待未来工作。

英文摘要

In the coded caching, the server uses the cached information at the users to serve multiple users in parallel with a single coded multi-casting message or packet, that is, a merged packet, and thus mitigates the peak network congestion. In order to deliver the timely messages to the users in the deadline-driven applications like the video streaming, we must determine online the messages to be merged for the delivery, as there is a time limit for each request. It is important to note that while the merging aids the current coded multi-casting packet, it could harm the future deliveries. Our solution employs the deep reinforcement learning to view the coded multi-casting delivery as a masked action-discrete state control problem, and our policy network, trained via the proximal policy optimization, performs better than SACM++. On the uniform-demand benchmark, our policy network reduces the broadcast-packet expiration ratio $ρ$ by $40.9\%$ ($0.208$ vs.\ $0.352$) with respect to the best coded multi-casting baseline (SACM++), while also attaining the best broadcast-efficiency score $σ$ across the Track~A battery among the coded multi-casting methods. One noteworthy phenomenon here is that, for the applications with stricter deadlines, the merging becomes selective instead of aggressive, since the policy network selectively merges at approximately $31.8\%$ of the chances, even though the same observation holds across the variations within the same simulator family. The focus of our design is on the efficient pairwise XOR merging, where the higher-order ($K{\ge}3$) coding can be considered as a natural generalization left for future work.

2605.14889 2026-05-26 cs.CV cs.AI 版本更新

SurgicalMamba: Dual-Path SSD with State Regramming for Online Surgical Phase Recognition

SurgicalMamba: 具有状态重编程的双路径SSD用于在线手术阶段识别

Sukju Oh, Sukkyu Sun

发表机构 * Department of Computer Science and Artificial Intelligence(计算机科学与人工智能系)

AI总结 提出SurgicalMamba模型,基于Mamba2的结构化状态空间对偶性(SSD),通过双路径SSD块、强度调制步进和状态重编程三个组件,实现在线手术阶段识别,在多个基准上达到最先进性能。

Comments 28 pages, 7 figures, 10 tables; Code available at https://github.com/sukjuoh/Surgical-Mamba

详情
AI中文摘要

在线手术阶段识别(SPR)是上下文感知手术室系统的基础,要求仅根据过去上下文对每一帧做出预测。手术视频提出了自然视频识别器无法共同解决的三个需求:手术过程跨越数万帧,时间流动不均匀(长时间常规片段被短暂的阶段定义转换打断),视觉领域狭窄,因此骨干特征在通道间高度相关。现有识别器要么让每帧成本随已处理长度增长,要么保持成本有界但以均匀速率和通道独立动态推进状态,无法解决后两个需求。我们提出SurgicalMamba,一种基于Mamba2的结构化状态空间对偶性(SSD)的因果SPR模型,将每帧成本保持在O(d)。它引入了三个与SSD兼容的组件,共同解决这些需求:双路径SSD块,在循环状态级别分离长期和短期模式;强度调制步进,一种连续时间时间扭曲,使慢路径的有效速率适应阶段相关信息;以及状态重编程,一种每块的Cayley旋转,在原本轴对齐的SSM循环中打开跨通道混合。学习到的旋转平面继承了阶段对齐的结构,无需任何直接监督,提供了手术工作流的可解释内部特征。在七个公开SPR基准上,SurgicalMamba在严格在线评估下达到了最先进的准确率和阶段级Jaccard指数:在Cholec80上为94.6%/82.7%(比最强先前方法高0.7 pp/2.2 pp),在AutoLaparo上为89.5%/68.9%(高1.7 pp/2.0 pp),在单个GPU上达到238.74 fps。消融实验分离了每个组件的贡献。代码公开于https://github.com/sukjuoh/Surgical-Mamba。

英文摘要

Online surgical phase recognition (SPR) underpins context-aware operating-room systems and requires committing to a prediction at every frame from past context alone. Surgical video poses three demands that natural-video recognizers do not jointly address: procedures span tens of thousands of frames, time flows non-uniformly as long routine stretches are punctuated by brief phase-defining transitions, and the visual domain is narrow so backbone features are strongly correlated across channels. Existing recognizers either let per-frame cost grow with elapsed length, or hold cost bounded but advance state at a uniform rate with channel-independent dynamics, leaving the latter two demands unaddressed. We present SurgicalMamba, a causal SPR model built on Mamba2's structured state-space duality (SSD) that holds per-frame cost at O(d). It introduces three SSD-compatible components that jointly address these demands: a dual-path SSD block that separates long- and short-term regimes at the level of recurrent state; intensity-modulated stepping, a continuous-time time-warp that adapts the slow path's effective rate to phase-relevant information; and state regramming, a per-chunk Cayley rotation that opens cross-channel mixing in the otherwise axis-aligned SSM recurrence. The learned rotation planes inherit a phase-aligned structure without any direct supervision, offering an interpretable internal signature of surgical workflow. Across seven public SPR benchmarks, SurgicalMamba reaches state-of-the-art accuracy and phase-level Jaccard under strict online evaluation: 94.6%/82.7% on Cholec80 (+0.7 pp/+2.2 pp over the strongest prior) and 89.5%/68.9% on AutoLaparo (+1.7 pp/+2.0 pp), at 238.74 fps on a single GPU. Ablations isolate the contribution of each component. The code is publicly available at https://github.com/sukjuoh/Surgical-Mamba.

2605.10977 2026-05-26 cs.CR cs.AI 版本更新

PASA: A Principled Embedding-Space Watermarking Approach for LLM-Generated Text under Semantic-Invariant Attacks

PASA:一种针对语义不变攻击的LLM生成文本的原则性嵌入空间水印方法

Zhenxin Ai, Haiyun He

发表机构 * The Hong Kong University of Science and Technology (Guangzhou)(香港理工大学(广州))

AI总结 提出PASA水印算法,在潜在嵌入空间的语义簇上嵌入和检测水印,通过理论框架实现检测精度、鲁棒性和失真的基本权衡,在强释义攻击下仍保持鲁棒性和文本质量。

详情
AI中文摘要

大型语言模型(LLM)的水印是一种有前景的方法,用于检测LLM生成的文本并实现负责任的部署。然而,现有的水印方法通常容易受到语义不变攻击(如释义)的影响。我们提出了PASA,一种原则性、鲁棒且无失真的水印算法,在语义级别嵌入和检测水印。PASA在潜在嵌入空间中的语义簇上操作,并通过由密钥和语义历史同步的共享随机性构建令牌和辅助序列之间的分布依赖关系。该设计基于我们的理论框架,该框架表征了联合最优的嵌入-检测对,实现了检测精度、鲁棒性和失真之间的基本权衡。在多个LLM和语义不变攻击上的评估表明,即使在强释义攻击下,PASA仍保持鲁棒性,同时保持高文本质量,优于标准词汇空间基线。消融研究进一步验证了我们超参数选择的有效性。网页:https://ai-kunkun.github.io/PASA_page/。

英文摘要

Watermarking for large language models (LLMs) is a promising approach for detecting LLM-generated text and enabling responsible deployment. However, existing watermarking methods are often vulnerable to semantic-invariant attacks, such as paraphrasing. We propose PASA, a principled, robust, and distortion-free watermarking algorithm that embeds and detects a watermark at the semantic level. PASA operates on semantic clusters in a latent embedding space and constructs a distributional dependency between token and auxiliary sequences via shared randomness synchronized by a secret key and semantic history. This design is grounded in our theoretical framework that characterizes a jointly optimal embedding-detection pair, achieving the fundamental trade-offs among detection accuracy, robustness, and distortion. Evaluations across multiple LLMs and semantic-invariant attacks demonstrate that PASA remains robust even under strong paraphrasing attacks while preserving high text quality, outperforming standard vocabulary-space baselines. Ablation studies further validate the effectiveness of our hyperparameter choices. Webpage: https://ai-kunkun.github.io/PASA_page/.

2605.10764 2026-05-26 cs.CV cs.AI 版本更新

Break the Brake, Not the Wheel: Untargeted Jailbreak via Entropy Maximization

打破刹车,而非车轮:通过熵最大化实现无目标越狱

Mengqi He, Xinyu Tian, Xin Shen, Shu Zou, Jinhong Ni, Zhaoyuan Yang, Weikang Li, Xuesong Li, Jing Zhang

发表机构 * Australian National University(澳大利亚国立大学) The University Of Queensland(昆士兰大学) Peking University(北京大学) GE research(通用电气研究院) CSIRO(澳大利亚联邦科学与工业研究组织)

AI总结 提出UJEM-KL攻击方法,通过最大化决策令牌的熵来翻转视觉-语言模型的拒绝输出,实现高迁移性的无目标越狱。

Comments Preprint. 17 pages, 8 figures, 6 tables

详情
AI中文摘要

近期研究表明,基于梯度的通用图像越狱攻击在视觉-语言模型(VLM)上几乎没有或完全没有跨模型迁移性,这使人们对可迁移多模态越狱的可行性产生了怀疑。我们在严格的无目标威胁模型下重新审视这一结论,不强制固定前缀或响应模式。初步实验发现,在自回归解码过程中,拒绝行为集中在高熵令牌上,而攻击前非拒绝令牌在前排候选者中已占据相当大的概率质量。受此启发,我们提出通过熵最大化的无目标越狱(UJEM)-KL,这是一种轻量级攻击,通过最大化这些决策令牌的熵来翻转拒绝结果,同时稳定剩余的低熵位置以保持输出质量。在三个VLM和两个安全基准测试中,UJEM-KL实现了具有竞争力的白盒攻击成功率,并持续提高了迁移性,同时在代表性防御下仍然有效。我们的实验结果表明,有限的迁移性主要源于过度受限的优化目标。

英文摘要

Recent studies show that gradient-based universal image jailbreaks on vision-language models (VLMs) exhibit little or no cross-model transferability, casting doubt on the feasibility of transferable multimodal jailbreaks. We revisit this conclusion under a strictly untargeted threat model without enforcing a fixed prefix or response pattern. Our preliminary experiment reveals that refusal behavior concentrates at high-entropy tokens during autoregressive decoding, and non-refusal tokens already carry substantial probability mass among the top-ranked candidates before attack. Motivated by this finding, we propose Untargeted Jailbreak via Entropy Maximization(UJEM)-KL, a lightweight attack that maximizes entropy at these decision tokens to flip refusal outcomes, while stabilizing the remaining low-entropy positions to preserve output quality. Across three VLMs and two safety benchmarks, UJEM-KL achieves competitive white-box attack success rates and consistently improves transferability, while remaining effective under representative defenses. Our experimental results indicate that the limited transferability primarily stems from overly constrained optimization objectives.

2605.10430 2026-05-26 cs.LG cs.AI stat.ML 版本更新

Real vs. Semi-Simulated: Rethinking Evaluation for Treatment Effect Estimation

真实 vs. 半模拟:重新思考治疗效果估计的评估

George Panagopoulos

发表机构 * Department of Computer Science University of Luxembourg(计算机科学系卢森堡大学)

AI总结 通过大规模实证研究,比较了半模拟基准和真实数据集上使用反事实指标与可观测指标评估治疗效果估计模型的效果,揭示了两种评估体系之间的差距,并发现简单元学习器与强基础模型结合具有竞争力。

详情
AI中文摘要

利用机器学习估计异质性治疗效果在学术研究和工业实践中都引起了广泛关注。然而,这两个领域通常在不同条件下评估模型。方法论工作通常依赖于半模拟基准和需要反事实结果的指标,而实际应用则依赖于基于排名或测试结果的可观测指标。尽管方法论进展与实际部署之间存在众所周知的差距,但这些评估体系之间的关系尚未得到系统研究。我们对标准半模拟基准系列和真实数据集上的治疗效果评估进行了大规模实证研究。我们的基准涵盖了与多个基础学习器配对的元学习器,以及专门的因果机器学习模型。我们使用应用导向文献中常见的可观测指标以及方法论文中常用的反事实指标来评估这些方法。我们的结果揭示了两个互补的差距。首先,即使在相同的半模拟基准上,反事实指标也不能可靠地恢复可观测指标偏好的估计器。其次,在半模拟基准上获得的排名不能迁移到真实数据集。我们还发现,具有强大基础模型的简单元学习器始终具有竞争力,这与专门的因果模型形成对比。总体而言,我们的发现表明,治疗效果估计研究的进展不应仅通过反事实指标和半模拟基准来评估,而应结合可观测指标和真实数据验证。

英文摘要

Estimating heterogeneous treatment effects with machine learning has attracted substantial attention in both academic research and industrial practice. However, the two communities often evaluate models under markedly different conditions. Methodological work typically relies on semi-simulated benchmarks and metrics that require counterfactual outcomes, whereas real-world applications rely on observable metrics based on ranking or test outcomes. Despite the well-known gap between methodological progress and practical deployment, the relationship between these evaluation regimes has not been examined systematically. We conduct a large-scale empirical study of treatment effect evaluation across standard semi-simulated benchmark families and real-world datasets. Our benchmark covers meta-learners paired with multiple base learners, as well as specialized causal machine learning models. We evaluate these methods using observable metrics common in application-oriented literature, alongside counterfactual metrics commonly used in methods papers. Our results reveal two complementary gaps. First, counterfactual metrics do not reliably recover the estimators preferred by observable metrics, even on the same semi-simulated benchmarks. Second, rankings obtained on semi-simulated benchmarks do not transfer to real datasets. We further find that simple meta-learners with strong base models are consistently competitive, in contrast to specialized causal models. Overall, our findings suggest that progress in treatment effect estimation research should not be assessed solely through counterfactual metrics and semi-simulated benchmarks, but it would benefit from incorporating observable metrics and real-data validation.

2605.07733 2026-05-26 cs.LG cs.AI 版本更新

Intelligent Truck Matching in Full Truckload Shipments using Ping2Hex approach

使用Ping2Hex方法的整车运输智能卡车匹配

Srinivas Kumar Ramdas, Jose Mathew, Ankit Singh Chauhan, Dinesh Rajkumar, Aravind Manoj, Mohit Goel

发表机构 * Project44 Gmbh(Project44公司)

AI总结 提出基于Ping2Hex的智能卡车匹配系统ITM 2.0,通过概率排序和LightGBM模型解决GPS数据中车辆标识缺失导致的匹配问题,显著提升精度和覆盖率。

Comments 12 pages, 10 figures, 8 tables. Accepted at iSCSi 2026 (International Conference on Industry Sciences and Computer Sciences Innovation). To appear in Procedia Computer Science (Elsevier)

详情
Journal ref
ISCSI(2026)
AI中文摘要

利用GPS数据进行准确的卡车与货物匹配是整车供应链可视性的基础,能够实现实时跟踪和准确的预计到达时间(ETA)预测。然而,缺失或损坏的车辆标识符使得传统匹配方法无法使用,导致货物失去可视性。本文提出了智能卡车匹配(ITM)2.0,一个机器学习系统,通过将匹配问题表述为概率排序来解决这一关键缺口。我们的方法利用Uber H3六边形空间索引将GPS ping离散化为路线相似性特征,结合时间信息,然后应用带有阈值后处理的LightGBM梯度提升。通过严格的评估,包括离线模型选择(SVM、XGBoost、LightGBM)、全面的消融研究和生产影子测试,我们展示了相对于基于规则的基线的显著提升。ITM 2.0在北美实现了26个百分点的精度提升,在欧洲实现了14个百分点的提升,同时覆盖率翻倍。该系统已在Project44部署用于处理整车运输,展示了对于高达1公里的地理编码误差、多个候选卡车和稀疏ping的鲁棒性。

英文摘要

Accurate truck-to-shipment matching using GPS data is foundational for full truckload supply chain visibility, enabling real-time tracking and accurate estimated time of arrival (ETA) predictions. However, missing or corrupted vehicle identifiers prevent traditional matching approaches, leaving shipments without visibility. This paper presents Intelligent Truck Matching (ITM) 2.0, a machine learning system that addresses this critical gap by formulating matching as a probabilistic ranking problem. Our approach leverages Uber H3 hexagonal spatial indexing to discretize GPS pings into route similarity features, combined with temporal information, then applies LightGBM gradient boosting with threshold-based post-processing. Through rigorous evaluation including offline model selection (SVM, XGBoost, LightGBM), comprehensive ablation studies, and production shadow testing, we demonstrate substantial gains over rule-based baselines. ITM 2.0 achieves 26 percentage point precision improvement in North America and 14 points in Europe, while doubling coverage. Deployed in production at Project44 handling full truckload shipments, the system demonstrates robustness to geocoding errors up to 1 km, multiple candidate trucks, and sparse pings.

2605.06415 2026-05-26 cs.LG cs.AI cs.CL cs.CV 版本更新

E = T*H/(O+B): A Dimensionless Control Parameter for Mixture-of-Experts Ecology

E = T*H/(O+B):混合专家生态的无量纲控制参数

Qingjun Zhang

发表机构 * School of Integrated Circuits, Wuxi Taihu University(无锡太湖大学集成电路学院)

AI总结 提出无量纲控制参数E = T*H/(O+B),通过12个控制实验证明E≥0.5可保证混合专家模型无死亡专家,并发现专家复活、正交毒性依赖数据集等六项额外结果。

Comments 12 experiments, 11,000+ training epochs, cross-modal validation (vision + language). Extended version of the Claude-in-the-Loop ecology framework

详情
AI中文摘要

我们引入E = T*H/(O+B),这是一个无量纲控制参数,用于预测混合专家(MoE)模型是否会发展出健康的专家生态还是陷入死亡专家。E将四个超参数——路由温度T、路由熵权重H、先知权重O和平衡权重B——组合成一个单一量。通过12个控制实验(8个视觉,4个语言),总计超过11,000个训练周期,我们确定仅E ≥ 0.5就足以保证零死亡专家,消除了手工设计负载平衡辅助损失的必要性。我们在CIFAR-10、CIFAR-100、TinyImageNet-200、WikiText-2和WikiText-103上跨模态验证了这一点。另外还发现了六项结果:(1)死亡专家可以复活——由平衡损失驱动路由器重新探索触发;(2)正交毒性依赖于数据集,并非普遍存在;(3)任务复杂性改变了临界E阈值;(4)模型过拟合与专家生态健康解耦;(5)三层MoE自发崩溃为两层功能结构;(6)生态结构在50倍温度范围内保持不变。我们提出E作为MoE训练的统一诊断指标,类似于流体力学中的雷诺数。

英文摘要

We introduce E = T*H/(O+B), a dimensionless control parameter that predicts whether Mixture-of-Experts (MoE) models will develop a healthy expert ecology or collapse into dead experts. E combines four hyperparameters -- routing temperature T, routing entropy weight H, oracle weight O, and balance weight B -- into a single quantity. Through 12 controlled experiments (8 vision, 4 language) totaling over 11,000 training epochs, we establish that E >= 0.5 alone is sufficient to guarantee zero dead experts, removing the necessity for handcrafted load-balancing auxiliary losses. We validate this cross-modally on CIFAR-10, CIFAR-100, TinyImageNet-200, WikiText-2, and WikiText-103. Six additional findings emerge: (1) dead experts can resuscitate -- triggered by balance loss driving router re-exploration; (2) ortho toxicity is dataset-dependent, not universal; (3) task complexity shifts the critical E threshold; (4) model overfitting is decoupled from expert ecological health; (5) three-tier MoE spontaneously collapses into a two-tier functional structure; (6) ecological structure is temperature-invariant across a 50x range. We propose that E serves as a unified diagnostic for MoE training, analogous to the Reynolds number in fluid dynamics.

2605.04295 2026-05-26 cs.LG cs.AI 版本更新

LLMs Uncertainty Quantification via Adaptive Conformal Semantic Entropy

通过自适应共形语义熵进行LLM不确定性量化

Hamed Karimi, Vaishali Meyappan, Reza Samavi

发表机构 * Toronto Metropolitan University(多伦多 Metropolitan 大学) Vector Institute(向量研究所)

AI总结 提出自适应共形语义熵(ACSE)方法,通过聚类语义熵并自适应调整不确定性分数,结合共形校准实现统计可靠的接受/弃权决策,在多个数据集上优于现有基线。

Comments Accepted for publication in the Proceedings of the 35th International Joint Conference on Artificial Intelligence (IJCAI 2026); 14 Pages

详情
AI中文摘要

LLMs的过度自信,特别是在产生幻觉时,对在安全关键环境中部署模型构成了重大挑战,并使得对不确定性进行可靠估计成为必要。现有的不确定性量化方法通常优先考虑词汇或概率度量;然而,这些技术往往忽略了具有相似含义的不同响应的语义差异。在本文中,我们提出了自适应共形语义熵(ACSE),一种通过自适应测量LLMs输出中的语义分散性来估计提示级不确定性的方法。我们的不确定性评分函数基于对同一提示的多个不同响应的语义熵进行聚类。该函数根据每个聚类的语义特征自适应调整不确定性分数。为了确保我们分数的统计可靠性,我们使用共形校准应用决策规则来接受/弃权提示,提供了有限样本、无分布的保证,使得接受响应中的错误率保持在用户指定的容差范围内。我们使用不同LLMs和数据集进行的广泛实验评估表明,我们的方法在判别性能、共形保证和概率校准指标方面始终优于最先进的不确定性量化基线。作为一个亮点,对于TriviaQA数据集,我们方法的AUROC为0.88,而令牌熵方法为0.65。

英文摘要

LLMs' overconfidence, particularly when hallucinating, poses a significant challenge for the deployment of the models in safety-critical settings and makes a reliable estimation of uncertainty necessary. Existing approaches for uncertainty quantification typically prioritize lexical or probabilistic measures; however, these techniques often ignore the semantic variance of different responses with similar meaning. In this paper, we propose Adaptive Conformal Semantic Entropy (ACSE), a method for estimating prompt-level uncertainty by adaptively measuring semantic dispersion in LLMs outputs. Our uncertainty scoring function is based on clustering semantic entropy of multiple diverse responses to the same prompt. The function adaptively adjusts the uncertainty score based on semantic features of each cluster. To ensure statistical reliability of our score, we use conformal calibration to apply a decision rule to accept/abstain the prompts, providing a finite-sample, distribution-free guarantee such that the error rate among the accepted responses remains bounded by a user-specified tolerance. Our extensive experimental evaluations using different LLMs and datasets, demonstrate that our approach consistently outperforms state-of-the-art uncertainty quantification baselines using discriminative performance, conformal guarantees, and probabilistic calibration indicators. As a highlight, for TriviaQA dataset, AUROC of our approach is 0.88 compared to 0.65 produced by the token entropy approach.

2605.03509 2026-05-26 cs.CV cs.AI 版本更新

BFORE: Butterfly-Firefly Optimized Retinex Enhancement for Low-Light Image Quality Improvement

BFORE: 蝴蝶-萤火虫优化的Retinex增强用于低光图像质量提升

Ahmed Cherif

发表机构 * Sofrecom Tunisia(Sofrecom突尼斯) Orange Innovation(Orange创新)

AI总结 提出BFORE框架,结合蝴蝶优化算法和萤火虫算法自动搜索最佳Retinex增强参数,最大化高斯自然度评分,显著提升低光图像质量。

详情
AI中文摘要

低光图像存在可见度差、噪声和颜色失真问题。现有的基于Retinex的增强方法依赖手动调整参数,无法泛化到不同光照条件。本文提出BFORE(蝴蝶-萤火虫优化的Retinex增强),一个自动为每张图像寻找最佳增强参数的框架。BFORE分两阶段工作:(1)蝴蝶优化算法(BOA)搜索最优的多尺度Retinex带颜色恢复(MSRCR)参数,然后(2)萤火虫算法(FA)微调伽马校正、去噪和颜色参数。两个阶段都最大化高斯自然度评分(GNS),一种衡量增强图像自然度的无参考指标。标准质量指标(PSNR、SSIM、NIQE)仅在优化后计算,确保零数据泄露。在30对合成图像上,BFORE达到GNS=0.971,优于次优方法MSRCR(0.894)8.6%。在来自LOL数据集的115张真实图像上,BFORE达到GNS=0.887,优于MSRCR(0.808)9.8%。与三个在相同条件下训练的深度学习基线(Zero-DCE、SCI、IAT)进行受控比较,BFORE在GNS上超过最佳深度学习方法14.7%。消融研究证实,混合BOA+FA策略显著优于单独使用每种优化器,而在三个评估预算下的可扩展性分析表明,一旦计算资源可用,结构化优化器显著优于均匀随机采样(128次评估时p=0.009,300次评估时p=0.021)。所有改进均具有统计显著性(Wilcoxon符号秩检验p<0.0001)。每张图像在CPU上的处理时间为3-6分钟,适用于离线应用。

英文摘要

Low-light images suffer from poor visibility, noise, and color distortion. Existing Retinex-based enhancement methods rely on manually tuned parameters that do not generalize across different lighting conditions. This paper proposes BFORE (Butterfly-Firefly Optimized Retinex Enhancement), a framework that automatically finds the best enhancement parameters for each image. BFORE works in two phases: (1) a Butterfly Optimization Algorithm (BOA) searches for optimal Multi-Scale Retinex with Color Restoration (MSRCR) parameters, then (2) a Firefly Algorithm (FA) fine-tunes gamma correction, denoising, and color parameters. Both phases maximize a Gaussian Naturalness Score (GNS), a no-reference metric that measures how natural the enhanced image looks. Standard quality metrics (PSNR, SSIM, NIQE) are computed only after optimization, ensuring zero data leakage. On 30 synthetic image pairs, BFORE achieves GNS = 0.971, outperforming the next-best method MSRCR (0.894) by 8.6%. On 115 real images from the LOL dataset, BFORE achieves GNS = 0.887, outperforming MSRCR (0.808) by 9.8%. A controlled comparison with three deep learning baselines (Zero-DCE, SCI, IAT) trained under identical conditions shows BFORE surpasses the best DL method by 14.7% in GNS. An ablation study confirms that the hybrid BOA+FA strategy significantly outperforms each optimizer in isolation, and a scalability analysis at three evaluation budgets shows that the structured optimizer significantly outperforms uniform random sampling once compute is available (p = 0.009 at 128 evaluations, p = 0.021 at 300 evaluations). All improvements are statistically significant (p < 0.0001, Wilcoxon signed-rank test). Processing time is 3-6 minutes per image on CPU, suitable for offline applications.

2605.02037 2026-05-26 cs.RO cs.AI 版本更新

VILAS: A VLA-Integrated Low-cost Architecture with Soft Grasping for Robotic Manipulation

VILAS:一种集成软抓取的VLA低成本机器人操作架构

Zijian An, Hadi Khezam, Bill Cai, Ran Yang, Shijie Geng, Yiming Feng, Yue Zheng, Lifeng Zhou

发表机构 * Drexel University(德雷塞尔大学) Virginia Seafood Agricultural Research and Extension Center(弗吉尼亚海鲜农业研究与推广中心) Amazon Store Foundation AI (SFAI)(亚马逊商店基金会人工智能(SFAI))

AI总结 提出VILAS低成本模块化机器人操作平台,集成软抓取机构,支持端到端VLA策略学习与部署,并在葡萄抓取任务中验证有效性。

详情
AI中文摘要

我们提出了VILAS,一个完全低成本、模块化的机器人操作平台,旨在支持端到端视觉-语言-动作(VLA)策略学习并在可访问硬件上部署。该系统集成了法如FR5协作臂、Jodell RG52-50电动夹爪和双摄像头感知模块,通过基于ZMQ的通信架构统一协调遥操作、数据收集和策略部署于单一框架内。为了在不依赖显式力传感的情况下安全操作易碎物体,我们设计了一种基于kirigami的软柔性夹爪扩展件,在压缩载荷下产生可预测变形,提供对脆弱目标的温和且可重复接触。我们在VILAS平台上部署并评估了三种最先进的VLA模型:pi_0、pi_0.5和GR00T N1.6。所有模型均使用通过我们的遥操作流水线收集的相同演示数据集,从公开发布的预训练检查点进行微调。在葡萄抓取任务上的实验验证了所提系统的有效性,证实了有能力的操作策略可以在低成本模块化硬件上成功训练和部署。我们的结果进一步为当前VLA模型在真实环境中的部署特性提供了实践见解。

英文摘要

We present VILAS, a fully low-cost, modular robotic manipulation platform designed to support end-to-end vision-language-action (VLA) policy learning and deployment on accessible hardware. The system integrates a Fairino FR5 collaborative arm, a Jodell RG52-50 electric gripper, and a dual-camera perception module, unified through a ZMQ-based communication architecture that seamlessly coordinates teleoperation, data collection, and policy deployment within a single framework. To enable safe manipulation of fragile objects without relying on explicit force sensing, we design a kirigami-based soft compliant gripper extension that induces predictable deformation under compressive loading, providing gentle and repeatable contact with delicate targets. We deploy and evaluate three state-of-the-art VLA models on the VILAS platform: pi_0, pi_0.5, and GR00T N1.6. All models are fine-tuned from publicly released pretrained checkpoints using an identical demonstration dataset collected via our teleoperation pipeline. Experiments on a grape grasping task validate the effectiveness of the proposed system, confirming that capable manipulation policies can be successfully trained and deployed on low-cost modular hardware. Our results further provide practical insights into the deployment characteristics of current VLA models in real-world settings.

2604.23703 2026-05-26 cs.HC cs.AI cs.CY 版本更新

Talking Slide Avatars: Open-Source Multimodal Communication Approach for Teaching

Talking Slide Avatars: 面向教学的开源多模态通信方法

Xinxing Wu

发表机构 * School of Mathematics and Computer Science, Kentucky State University(肯塔基州立大学数学与计算机科学学院)

AI总结 提出一种集成OpenVoice和Ditto-TalkingHead的开源工作流,用于创建可说话的幻灯片头像,以增强在线教学中的教师存在感和叙事连续性。

Comments 15 pages

详情
AI中文摘要

基于幻灯片的讲授在高等教育中广泛使用,但在在线、混合和异步情境中,幻灯片常常失去教师存在感、叙事连续性和表达框架,而这些有助于学习者与课程内容建立联系。完整的讲座视频可以部分恢复这些特性,但录制、修改和复用耗时。本研究提出了一种基于实践的实现和反思性分析,用于创建可说话的幻灯片头像的开源工作流。该工作流将OpenVoice(用于文本转语音和授权语音风格转换)与Ditto-TalkingHead(用于音频驱动的说话图像合成)相结合,使教师能够将简短的脚本和授权或合成的肖像图像转换为幻灯片或基于HTML的讲座材料的配音视频。本研究不仅将这一工作流视为技术解决方案,还将可说话的幻灯片头像定位为数字教育学、美育和艺术技术实践交叉点的多模态通信产物。本文记录了生产流程,分析了通信和美学可供性,并提出了关于脚本长度、图像选择、节奏、披露、可访问性、同意和伦理使用的实用指南。其贡献并非经过验证的学习干预,而是面向教育者的开源生产模型和通信设计框架。研究得出结论:简短、透明且精心设计的头像,在有选择地使用并采取适当伦理保障时,可为引言、过渡、提醒和总结提供可复用的通信层。

英文摘要

Slide-based teaching is widely used in higher education, yet in online, hybrid, and asynchronous contexts, slides often lose instructor presence, narrative continuity, and expressive framing that help learners connect with course content. Full lecture video can partly restore these qualities, but it is time-consuming to record, revise, and reuse. This study presents a practice-based implementation and analytic reflection of an open-source workflow for creating talking slide avatars. The workflow integrates OpenVoice for text-to-speech and authorized voice-style conversion with Ditto-TalkingHead for audio-driven talking-image synthesis, enabling instructors to transform a short script and an authorized or synthetic portrait image into a narrated video for slide decks or HTML-based lecture materials. Rather than treating this workflow only as a technical solution, the study frames talking slide avatars as multimodal communication artifacts at the intersection of digital pedagogy, aesthetic education, and art-technology practice. The paper documents the production pipeline, analyzes communicative and aesthetic affordances, and proposes practical guidelines for script length, image selection, pacing, disclosure, accessibility, consent, and ethical use. Its contribution is not a validated learning intervention, but an educator-oriented open-source production model and communication-design framework. The study concludes that short, transparent, and carefully designed avatars may provide a reusable communication layer for introductions, transitions, reminders, and recaps when used selectively and with appropriate ethical safeguards.

2604.13088 2026-05-26 cs.LG cs.AI 版本更新

Design Conditions for Intra-Group Learning of Sequence-Level Rewards: Token Gradient Cancellation

序列级奖励的组内学习设计条件:令牌梯度消除

Fei Ding, Yongkang Zhang, youwei wang, Zijian Zeng

发表机构 * Alibaba Group(阿里巴巴集团) Tsinghua University(清华大学)

AI总结 针对大语言模型多步推理中稀疏终端奖励导致的信用分配问题,提出反事实比较框架和隐式行为策略优化(IBPO),通过轨迹差异近似替代决策,将稀疏奖励转化为步骤敏感信号,提升训练稳定性和推理性能。

详情
AI中文摘要

基于大语言模型的多步推理强化学习通常依赖于稀疏的终端奖励,这导致了不良条件的信用分配问题:最终反馈均匀地传播到所有中间决策。这导致高梯度方差、不稳定的训练和许多无效更新,最终限制了模型的持续改进。我们提出了一种用于信用分配的反事实比较框架。对于每个输入,该框架采样多个推理轨迹,并将它们的差异视为替代决策的隐式近似。这产生了一个隐式过程级优势估计器,将稀疏的终端奖励转化为步骤敏感的学习信号。基于此框架,我们引入了隐式行为策略优化(IBPO),显著提高了数学和代码推理基准上的训练稳定性和性能上限。我们的结果指向了一个有希望的方向,以解锁大语言模型的推理潜力。

英文摘要

Reinforcement learning for multi-step reasoning with large language models (LLMs) typically relies on sparse terminal rewards, which creates a poorly conditioned credit-assignment problem: the final feedback is propagated uniformly across all intermediate decisions. This leads to high gradient variance, unstable training, and many ineffective updates, ultimately limiting sustained model improvement. We propose a counterfactual-comparison framework for credit assignment. For each input, the framework samples multiple reasoning trajectories and treats their differences as implicit approximations to alternative decisions. This yields an implicit process-level advantage estimator that converts sparse terminal rewards into step-sensitive learning signals. Building on this framework, we introduce Implicit Behavior Policy Optimization (IBPO), which substantially improves training stability and the performance ceiling on mathematical and code-reasoning benchmarks. Our results point to a promising direction for unlocking the reasoning potential of LLMs.

2604.11811 2026-05-26 cs.PL cs.AI cs.CL cs.LG 版本更新

M$^\star$: Every Task Deserves Its Own Memory Harness

M$^\star$:每个任务都应有专属的记忆框架

Wenbo Pan, Shujie Liu, Xiangyang Zhou, Shiwei Zhang, Wanlu Shi, Mirror Xu, Xiaohua Jia

发表机构 * City University of Hong Kong(香港城市大学) Microsoft(微软)

AI总结 提出M$^\star$方法,通过可执行程序进化自动发现任务优化的记忆系统,在对话、具身规划和专家推理等任务上优于固定记忆基线。

Comments Preprint. Code: https://github.com/wbopan/mstar ; Live demo: https://mstar.wenbo.io

详情
AI中文摘要

大型语言模型代理依赖专门的记忆系统在长时间交互中积累和重用知识。最近的架构通常采用针对特定领域定制的固定记忆设计,例如用于对话的语义检索或用于编码的技能重用。然而,为某一目的优化的记忆系统往往无法迁移到其他任务。为了解决这一限制,我们引入了M$^\star$,一种通过可执行程序进化自动发现任务优化记忆框架的方法。具体来说,M$^\star$将代理记忆系统建模为用Python编写的记忆程序。该程序封装了数据模式、存储逻辑和代理工作流指令。我们使用反射式代码进化方法联合优化这些组件;该方法采用基于种群的搜索策略,并分析评估失败以迭代改进候选程序。我们在涵盖对话、具身规划和专家推理的四个不同基准上评估M$^\star$。结果表明,M$^\star$在所有评估任务上稳健地优于现有的固定记忆基线。此外,进化出的记忆程序对每个领域展现出结构不同的处理机制。这一发现表明,针对给定任务特化记忆机制探索了广泛的设计空间,并提供了比通用记忆范式更优的解决方案。

英文摘要

Large language model agents rely on specialized memory systems to accumulate and reuse knowledge during extended interactions. Recent architectures typically adopt a fixed memory design tailored to specific domains, such as semantic retrieval for conversations or skills reused for coding. However, a memory system optimized for one purpose frequently fails to transfer to others. To address this limitation, we introduce M$^\star$, a method that automatically discovers task-optimized memory harnesses through executable program evolution. Specifically, M$^\star$ models an agent memory system as a memory program written in Python. This program encapsulates the data Schema, the storage Logic, and the agent workflow Instructions. We optimize these components jointly using a reflective code evolution method; this approach employs a population-based search strategy and analyzes evaluation failures to iteratively refine the candidate programs. We evaluate M$^\star$ on four distinct benchmarks spanning conversation, embodied planning, and expert reasoning. Our results demonstrate that M$^\star$ improves performance over existing fixed-memory baselines robustly across all evaluated tasks. Furthermore, the evolved memory programs exhibit structurally distinct processing mechanisms for each domain. This finding indicates that specializing the memory mechanism for a given task explores a broad design space and provides a superior solution compared to general-purpose memory paradigms.

2604.03675 2026-05-26 cs.AI cs.CL cs.IR 版本更新

OASES: Outcome-Aligned Search-Evaluation Co-Training for Agentic Search

OASES:面向智能搜索的结果对齐搜索-评估协同训练

Erhan Zhang, Yiqun Chen, Zechun Niu, Wei Yang, Xiaochi Wei, Yan Gao, Yi Wu, Yao Hu, Jiaxin Mao

发表机构 * Renmin University of China(中国人民大学) Xiaohongshu Inc.(小红书公司) University of Southern California(南加州大学)

AI总结 提出OASES框架,通过结果对齐的过程奖励和搜索-评估协同训练,解决智能搜索中奖励稀疏和过程监督不可靠的问题,在多跳问答基准上优于强强化学习基线。

详情
AI中文摘要

智能搜索使语言模型能够通过自适应地多步获取外部证据来解决知识密集型任务。具有可验证奖励的强化学习已成为搜索智能体广泛采用的训练范式,但仅结果奖励是稀疏的,并且对中间搜索动作的信用分配有限。因此,现有的过程奖励方法试图通过代理信号、外部评估器或基于似然的信息增益来密集化监督。然而,代理奖励可能偏离最终结果目标,而固定评估器随着搜索策略的演化可能变得过时,导致不可靠的过程监督。为应对这些挑战,我们提出OASES,一种用于智能搜索的结果对齐搜索-评估监督框架。OASES通过评估每个中间搜索状态对回答原始问题的支持程度,推导出结果对齐的过程奖励。它进一步在策略上协同训练搜索策略和状态评估器,使评估器能够适应演化的搜索行为并提供更可靠的过程奖励。在五个多跳问答基准上的实验表明,OASES始终优于强强化学习基线,进一步分析证实了结果对齐过程奖励和搜索-评估协同训练的优势。

英文摘要

Agentic search enables language models to solve knowledge-intensive tasks by adaptively acquiring external evidence over multiple steps. Reinforcement learning with verifiable rewards (RLVR) has emerged as a widely adopted training paradigm for search agents, yet outcome-only rewards are sparse and provide limited credit assignment for intermediate search actions. Existing process-reward methods therefore seek to densify supervision through proxy signals, external evaluators, or likelihood-based information gain. However, proxy rewards can deviate from the final outcome objective, while fixed evaluators can become stale as the search policy evolves, leading to unreliable process supervision. To address these challenges, we propose OASES, an Outcome-Aligned Search-Evaluation Supervision framework for agentic search. OASES derives outcome-aligned process rewards by evaluating how well each intermediate search state supports answering the original question. It further co-trains the search policy and the state evaluator on policy, allowing the evaluator to adapt to evolving search behavior and provide more reliable process rewards. Experiments on five multi-hop QA benchmarks show that OASES consistently outperforms strong RL baselines, with further analyses confirming the benefits of outcome-aligned process rewards and search-evaluation co-training.

2603.18444 2026-05-26 cs.LG cs.AI 版本更新

Discounted Beta-Bernoulli Reward Estimation for Sample-Efficient Reinforcement Learning with Verifiable Rewards

折扣Beta-Bernoulli奖励估计用于基于可验证奖励的样本高效强化学习

Haechan Kim, Soohyun Ryu, Gyouk Chu, Doohyuk Jang, Eunho Yang

发表机构 * KAIST(韩国科学技术院)

AI总结 针对基于可验证奖励的强化学习样本效率低的问题,提出折扣Beta-Bernoulli奖励估计方法,利用历史奖励统计量降低估计方差并避免方差崩溃,在多个推理基准上显著提升性能。

Comments 14 pages, 3 figures

详情
AI中文摘要

基于可验证奖励的强化学习已成为提升大语言模型推理能力的有效后训练范式。然而,现有的基于组的RLVR方法常遭受严重的样本低效问题。这种低效源于对少量rollout的奖励进行点估计,导致高估计方差、方差崩溃以及生成响应的无效利用。在本工作中,我们从统计估计角度重新审视RLVR,将奖励建模为从策略诱导分布中抽取的样本,并将优势计算视为从有限数据中估计奖励分布的问题。基于此观点,我们提出折扣Beta-Bernoulli奖励估计,该方法利用历史奖励统计量处理非平稳分布。尽管有偏,所得估计量展现出降低且稳定的方差,理论上避免了估计方差崩溃,并在均方误差上优于标准点估计。在六个分布内和三个分布外推理基准上的大量实验表明,使用DBB的GRPO一致优于朴素GRPO,在1.7B和8B模型上分别实现了分布内平均Acc@8提升3.22/2.42点,分布外提升12.49/6.92点,且无需额外计算成本或内存开销。

英文摘要

Reinforcement learning with verifiable rewards (RLVR) has emerged as an effective post-training paradigm for improving the reasoning capabilities of large language models. However, existing group-based RLVR methods often suffer from severe sample inefficiency. This inefficiency stems from reliance on point estimation of rewards from a small number of rollouts, leading to high estimation variance, variance collapse, and ineffective utilization of generated responses. In this work, we reformulate RLVR from a statistical estimation perspective by modeling rewards as samples drawn from a policy-induced distribution and casting advantage computation as the problem of estimating the reward distribution from finite data. Building on this view, we propose Discounted Beta-Bernoulli (DBB) reward estimation, which leverages historical reward statistics for the non-stationary distribution. Although biased, the resulting estimator exhibits reduced and stable variance, theoretically avoids estimated variance collapse, and achieves lower mean squared error than standard point estimation. Extensive experiments across six in-distribution and three out-of-distribution reasoning benchmarks demonstrate that GRPO with DBB consistently outperforms naive GRPO, achieving average Acc@8 improvements of 3.22/2.42 points in-distribution and 12.49/6.92 points out-of-distribution on the 1.7B and 8B models, respectively, without additional computational cost or memory usage.

2603.17044 2026-05-26 cs.LG cs.AI cs.CV 版本更新

Do Understanding and Generation Fight? A Diagnostic Study of DPO for Unified Multimodal Models

理解与生成相冲突吗?统一多模态模型DPO的诊断研究

Abinav Rao, Sujan Rachuri

AI总结 通过系统实验发现,在统一多模态模型上应用DPO时,生成质量难以对齐,主要原因是理解和生成梯度近乎正交且存在11-14倍的幅度不平衡,源于VQ token数量不对称。

Comments Experiments are inconclusive: The claim that architectures such as Chameleon or Emu would exhibit stronger gradient conflict is not supported by experiments or analysis, and all experiments are conducted on Janus-Pro without evaluation on other unified multimodal architectures

详情
AI中文摘要

统一多模态模型共享一个语言模型骨干来同时进行理解和生成图像。DPO能否同时对齐这两种能力?我们首次系统研究了这一问题,在Janus-Pro的1B和7B参数上应用DPO,采用七种训练策略和两种事后方法。核心发现是负面的:在该架构下,所有测试条件下生成质量都抵制DPO对齐。在7B规模下,没有任何方法能改善生成CLIPScore(|Δ| < 0.2,每个种子n=200,3个种子,p > 0.5);在1B规模下,所有方法都降低了生成质量,并且该结果在偏好数据类型(真实vs生成和模型vs模型)以及测试的数据量(150-288对)上均成立。梯度分析揭示了原因:理解和生成梯度近乎正交(cos ~ 0),且由于VQ token数量不对称(576个生成token vs. ~30-100个文本token),幅度不平衡达到约11-14倍。这种不平衡是多任务DPO中的主要干扰机制;幅度平衡产生了方向正确的理解增量(VQA +0.01-0.04,虽然单独不显著),但生成差距仍然存在。我们识别出离散VQ tokenization是一个可能的结构瓶颈——生成DPO损失收敛到ln(2)支持了这一点——并为使用基于VQ的统一模型的从业者提供了实用指导。

英文摘要

Unified multimodal models share a language model backbone for both understanding and generating images. Can DPO align both capabilities simultaneously? We present the first systematic study of this question, applying DPO to Janus-Pro at 1B and 7B parameters under seven training strategies and two post-hoc methods. The central finding is negative: generation quality resists DPO alignment across all tested conditions on this architecture. No method improves generation CLIPScore at 7B (|Delta| < 0.2, p > 0.5 at n=200 per seed, 3 seeds); at 1B, all methods degrade generation, and the result holds across preference data types (real-vs-generated and model-vs-model) and the data volumes tested (150-288 pairs). Gradient analysis reveals why: understanding and generation gradients are near-orthogonal (cos ~ 0) with ~11-14x magnitude imbalance driven by VQ token count asymmetry (576 generation tokens vs. ~30-100 text tokens). This imbalance is the dominant interference mechanism in multi-task DPO; magnitude-balancing yields directionally positive understanding deltas (+0.01-0.04 VQA, though individually not significant), but the generation gap persists regardless. We identify discrete VQ tokenization as a likely structural bottleneck -- supported by the generation DPO loss converging to ln(2) -- and provide practical guidance for practitioners working with VQ-based unified models.

2602.13203 2026-05-26 cs.NI cs.AI 版本更新

Adversarial Network Imagination: Causal LLMs and Digital Twins for Proactive Telecom Mitigation

对抗性网络想象:因果大语言模型与数字孪生用于主动电信缓解

Vignesh Sriram, Yuqiao Meng, Luoxi Tang, Zhaohan Xi

发表机构 * Binghamton University(宾夕法尼亚州立大学)

AI总结 提出对抗性网络想象框架,结合因果大语言模型、知识图谱和数字孪生,主动生成、模拟和评估对抗性网络故障,实现从被动故障排查向预期韧性分析的转变。

详情
AI中文摘要

电信网络会经历复杂的故障,如光纤中断、流量过载和级联中断。现有的监控和数字孪生系统大多是反应式的,仅在服务降级后检测故障。我们提出了对抗性网络想象,一个闭环框架,集成了因果大语言模型(LLM)、知识图谱和数字孪生,以主动生成、模拟和评估对抗性网络故障。因果LLM产生结构化的故障场景,这些场景基于知识图谱中编码的网络依赖关系。这些场景在数字孪生中执行,以测量性能降级并评估缓解策略。通过基于模拟反馈迭代细化场景,该框架将网络操作从被动故障排查转向预期韧性分析。

英文摘要

Telecommunication networks experience complex failures such as fiber cuts, traffic overloads, and cascading outages. Existing monitoring and digital twin systems are largely reactive, detecting failures only after service degradation occurs. We propose Adversarial Network Imagination, a closed-loop framework that integrates a Causal Large Language Model (LLM), a Knowledge Graph, and a Digital Twin to proactively generate, simulate, and evaluate adversarial network failures. The Causal LLM produces structured failure scenarios grounded in network dependencies encoded in the Knowledge Graph. These scenarios are executed within a Digital Twin to measure performance degradation and evaluate mitigation strategies. By iteratively refining scenarios based on simulation feedback, the framework shifts network operations from reactive troubleshooting toward anticipatory resilience analysis.

2602.10527 2026-05-26 cs.CY cs.AI 版本更新

AI-PACE: A Framework for Integrating AI into Medical Education

AI-PACE:将人工智能融入医学教育的框架

Scott P. McGrath, Katherine K. Kim, Karnjit Johl, Haibo Wang, Nick Anderson

发表机构 * Center for Information Technology in the Interest of Society(信息科技促进社会中心) University of California, Berkeley(加州大学伯克利分校) School of Medicine, Department of Public Health Sciences(医学院公共卫生科学系) University of California, Davis(加州大学戴维斯分校) School of Medicine, Department of Internal Medicine(医学院内科医学系) Research Centre of Big Data and AI for Medicine(医学大数据与人工智能研究中心) First Affiliated Hospital of Sun Yat-Sen University(中山大学第一附属医院)

AI总结 本文通过文献综述,提出AI-PACE框架,旨在将人工智能教育系统性地整合到医学培训的各个阶段,强调纵向整合、跨学科合作以及技术与临床应用的平衡。

Comments Version 2: Revisions after round 1 of peer review. Paper under consideration at npj Digital Medicine. 12 pages, 2 figures, 2 tables

详情
AI中文摘要

人工智能(AI)在医疗领域的整合正在加速,然而医学教育尚未跟上这些技术进步的步伐。本文通过对文献的全面分析,综合了当前关于医学教育中人工智能的知识,确定了关键能力、课程方法和实施策略。其目的是强调在医学学习连续体中结构化人工智能教育的迫切需求,并提供一个课程开发框架。研究结果表明,有效的人工智能教育需要在医学培训中纵向整合、跨学科合作,并平衡关注技术基础和临床应用。本文为医学教育者提供了基础,以帮助未来的医生为人工智能增强的医疗环境做好准备。

英文摘要

The integration of artificial intelligence (AI) into healthcare is accelerating, yet medical education has not kept pace with these technological advancements. This paper synthesizes current knowledge on AI in medical education through a comprehensive analysis of the literature, identifying key competencies, curricular approaches, and implementation strategies. The aim is highlighting the critical need for structured AI education across the medical learning continuum and offer a framework for curriculum development. The findings presented suggest that effective AI education requires longitudinal integration throughout medical training, interdisciplinary collaboration, and balanced attention to both technical fundamentals and clinical applications. This paper serves as a foundation for medical educators seeking to prepare future physicians for an AI-enhanced healthcare environment.

2602.10090 2026-05-26 cs.AI cs.CL cs.LG 版本更新

Agent World Model: Infinity Synthetic Environments for Agentic Reinforcement Learning

Agent World Model: 用于智能体强化学习的无限合成环境

Zhaoyang Wang, Canwen Xu, Boyi Liu, Yite Wang, Siwei Han, Zhewei Yao, Huaxiu Yao, Yuxiong He

发表机构 * University of North Carolina at Chapel Hill(北卡罗来纳大学教堂山分校)

AI总结 提出Agent World Model (AWM)全合成环境生成管道,通过代码驱动和数据库支持的环境进行大规模强化学习,使智能体在多样日常场景中泛化。

Comments Accepted to ICML 2026

详情
AI中文摘要

近年来,大型语言模型(LLM)的进步使得自主智能体能够与工具和环境进行多轮交互。然而,扩展此类智能体训练受到缺乏多样且可靠环境的限制。在本文中,我们提出了Agent World Model(AWM),一个完全合成的环境生成管道。使用该管道,我们扩展到涵盖日常场景的1000个环境,智能体可以在其中与丰富的工具集交互并获得高质量的观测。值得注意的是,这些环境是代码驱动的并由数据库支持,比由LLM模拟的环境提供更可靠和一致的状态转换。此外,与从现实环境中收集轨迹相比,它们实现了更高效的智能体交互。为了展示该资源的有效性,我们对多轮工具使用智能体进行了大规模强化学习。得益于完全可执行的环境和可访问的数据库状态,我们还可以设计可靠的奖励函数。在三个基准上的实验表明,仅在合成环境中训练(而非特定于基准的环境)能产生强大的分布外泛化能力。代码可在 https://github.com/Snowflake-Labs/agent-world-model 获取。

英文摘要

Recent advances in large language model (LLM) have empowered autonomous agents to perform multi-turn interactions with tools and environments. However, scaling such agent training is limited by the lack of diverse and reliable environments. In this paper, we propose Agent World Model (AWM), a fully synthetic environment generation pipeline. Using this pipeline, we scale to 1,000 environments covering everyday scenarios, in which agents can interact with rich toolsets and obtain high-quality observations. Notably, these environments are code-driven and backed by databases, providing more reliable and consistent state transitions than environments simulated by LLMs. Moreover, they enable more efficient agent interaction compared with collecting trajectories from realistic environments. To demonstrate the effectiveness of this resource, we perform large-scale reinforcement learning for multi-turn tool-use agents. Thanks to the fully executable environments and accessible database states, we can also design reliable reward functions. Experiments on three benchmarks show that training exclusively in synthetic environments, rather than benchmark-specific ones, yields strong out-of-distribution generalization. The code is available at https://github.com/Snowflake-Labs/agent-world-model.

2602.09620 2026-05-26 cs.AI cs.LO 版本更新

FLINGO -- Instilling ASP Expressiveness into Linear Integer Constraints

FLINGO -- 将 ASP 表达力注入线性整数约束

Jorge Fandinno, Pedro Cabalar, Philipp Wanko, Torsten Schaub

发表机构 * University of Corunna(科鲁纳大学) University of Nebraska Omaha(内布拉斯加奥马哈大学) University of Potsdam(波茨坦大学)

AI总结 本文提出 FLINGO 语言和工具,通过将 ASP 的默认值、未定义、非确定性选择和聚合等表达力融入数值约束,并给出到 clingcon 格式的翻译,从而扩展了约束回答集编程。

Comments To appear in Theory and Practice of Logic Programming

详情
AI中文摘要

约束回答集编程(CASP)是一种混合范式,它通过数值约束处理丰富了回答集编程(ASP),这是许多实际应用的关键需求。然而,大多数 CASP 求解器中约束的规范更接近于数值后端的表达力和语义,而非 ASP 范式。在 ASP 中,数值属性被表示为谓词,允许声明默认值、使属性未定义、使用选择规则进行非确定性赋值或使用聚合值。在 CASP 中,一旦我们切换到这些属性的基于约束的表示,这些特性中的大多数(如果不是全部)就会丢失。在本文中,我们提出了 flingo 语言(和工具),它将上述表达力融入数值约束中,并通过多个示例说明了其使用。基于先前建立其语义基础的工作,我们还提出了从新引入的 flingo 语法到遵循 clingcon 输入格式的常规 CASP 程序的翻译。

英文摘要

Constraint Answer Set Programming (CASP) is a hybrid paradigm that enriches Answer Set Programming (ASP) with numerical constraint processing, a crucial requirement for many real-world applications. However, the specification of constraints in most CASP solvers aligns more closely with the expressiveness and semantics of the numerical back-end than the ASP paradigm. In the latter, numerical attributes are represented as predicates, which allows declaring default values, leaving the attribute undefined, making non-deterministic assignments with choice rules, or using aggregated values. In CASP, most (if not all) of these features are lost once we switch to a constraint-based representation of those same attributes. In this paper, we present the flingo language (and tool) that incorporates the aforementioned expressiveness within numerical constraints, and we illustrate its use with several examples. Based on previous work that established its semantic foundations, we also present a translation from the newly introduced flingo syntax to regular CASP programs following the clingcon input format.

2602.03955 2026-05-26 cs.AI cs.MA 版本更新

AgentArk: Distilling Multi-Agent Intelligence into a Single LLM Agent

AgentArk:将多智能体智能蒸馏到单个LLM智能体中

Yinyi Luo, Yiqiao Jin, Weichen Yu, Mengqi Zhang, Srijan Kumar, Xiaoxiao Li, Weijie Xu, Xin Chen, Jindong Wang

发表机构 * Carnegie Mellon University(卡内基梅隆大学) William & Mary(威廉与玛丽学院) Georgia Institute of Technology(佐治亚理工学院) Amazon(亚马逊) University of British Columbia(不列颠哥伦比亚大学)

AI总结 提出AgentArk框架,通过三种分层蒸馏策略将多智能体系统的交互动态蒸馏到单个模型权重中,使单个智能体具备多智能体的推理和自校正能力,同时保持计算效率。

详情
AI中文摘要

虽然大型语言模型(LLM)多智能体系统通过迭代辩论实现了卓越的推理性能,但实际部署受到高计算成本和错误传播的限制。本文提出AgentArk,一种新颖的框架,将多智能体动态蒸馏到单个模型的权重中,有效地将显式的测试时交互转化为隐式的模型能力。这使得单个智能体在保持计算效率的同时具备多智能体系统的智能。具体来说,我们研究了跨多种模型、任务、规模和场景的三种分层蒸馏策略:推理增强微调;基于轨迹的增强;以及过程感知蒸馏。通过将计算负担从推理转移到训练,蒸馏后的模型在保持单个智能体效率的同时,展现出多个智能体的强推理和自校正性能。它们还在各种推理任务中表现出增强的鲁棒性和泛化能力。我们希望这项工作能为未来高效且鲁棒的多智能体开发研究提供启示。我们的代码位于https://github.com/AIFrontierLab/AgentArk。

英文摘要

While large language model (LLM) multi-agent systems achieve superior reasoning performance through iterative debate, practical deployment is limited by their high computational cost and error propagation. This paper proposes AgentArk, a novel framework to distill multi-agent dynamics into the weights of a single model, effectively transforming explicit test-time interactions into implicit model capabilities. This equips a single agent with the intelligence of multi-agent systems while remaining computationally efficient. Specifically, we investigate three hierarchical distillation strategies across various models, tasks, scaling, and scenarios: reasoning-enhanced fine-tuning; trajectory-based augmentation; and process-aware distillation. By shifting the burden of computation from inference to training, the distilled models preserve the efficiency of one agent while exhibiting strong reasoning and self-correction performance of multiple agents. They further demonstrate enhanced robustness and generalization across diverse reasoning tasks. We hope this work can shed light on future research on efficient and robust multi-agent development. Our code is at https://github.com/AIFrontierLab/AgentArk.

2602.01576 2026-05-26 cs.LG cs.AI cs.CV 版本更新

Generative Visual Code Mobile World Models

生成式视觉代码移动世界模型

Woosung Koh, Sungjun Han, Segyu Lee, Se-Young Yun, Jamin Shin

发表机构 * Trillion Labs(万亿实验室)

AI总结 提出通过单一视觉语言模型预测可执行网页代码来生成移动GUI下一状态,结合文本和视觉世界模型优势,实现高保真视觉生成与精确文本渲染。

Comments ICML 2026

详情
AI中文摘要

移动图形用户界面世界模型为在训练和推理时提升移动GUI代理性能提供了有前景的路径。然而,当前方法面临关键权衡:基于文本的世界模型牺牲了视觉保真度,而视觉世界模型在精确文本渲染上的不足导致其依赖缓慢、复杂的流水线和大量外部模型。我们提出一种新范式:通过可渲染代码生成进行视觉世界建模,其中单一视觉语言模型预测下一个GUI状态为可执行网页代码,该代码渲染为像素,而非直接生成像素。这结合了两种方法的优势:视觉语言模型保留其语言先验以实现精确文本渲染,同时其在结构化网页代码上的预训练实现了高保真视觉生成。我们推出了gWorld(8B、32B),这是基于该范式的首个开源权重视觉移动GUI世界模型,以及一个自动合成基于代码的训练数据的数据生成框架(gWorld)。在4个分布内和2个分布外基准测试的广泛评估中,gWorld在准确率与模型规模之间建立了新的帕累托前沿,性能优于8个前沿开源权重模型(其规模大50.25倍以上)。进一步分析表明:(1)通过gWorld扩展训练数据带来有意义的收益;(2)我们流水线的每个组件都提高了数据质量;(3)更强的世界建模提升了下游移动GUI策略性能。

英文摘要

Mobile Graphical User Interface (GUI) World Models (WMs) offer a promising path for improving mobile GUI agent performance at train- and inference-time. However, current approaches face a critical trade-off: text-based WMs sacrifice visual fidelity, while the inability of visual WMs in precise text rendering led to their reliance on slow, complex pipelines dependent on numerous external models. We propose a novel paradigm: visual world modeling via renderable code generation, where a single Vision-Language Model (VLM) predicts the next GUI state as executable web code that renders to pixels, rather than generating pixels directly. This combines the strengths of both approaches: VLMs retain their linguistic priors for precise text rendering while their pre-training on structured web code enables high-fidelity visual generation. We introduce gWorld (8B, 32B), the first open-weight visual mobile GUI WMs built on this paradigm, along with a data generation framework (gWorld) that automatically synthesizes code-based training data. In extensive evaluation across 4 in- and 2 out-of-distribution benchmarks, gWorld sets a new pareto frontier in accuracy versus model size, outperforming 8 frontier open-weight models over 50.25x larger. Further analyses show that (1) scaling training data via gWorld yields meaningful gains, (2) each component of our pipeline improves data quality, and (3) stronger world modeling improves downstream mobile GUI policy performance.

2602.01086 2026-05-26 cs.AI cs.CR cs.DB cs.DC cs.SE 版本更新

MedBeads: An Agent-Native, Immutable Data Substrate for Trustworthy Medical AI

MedBeads:面向可信医疗AI的智能体原生不可变数据基底

Takahito Nakajima

发表机构 * Diagnostic Imaging and Interventional Radiology, Institute of Medicine, University of Tsukuba(东京大学医学研究院诊断影像与介入放射学部) Center for Cyber Medicine Research, University of Tsukuba(东京大学计算机医学研究中心)

AI总结 针对医疗AI中电子病历与智能体间的上下文不匹配问题,提出基于Merkle有向无环图的不可变数据架构MedBeads,通过确定性图遍历替代概率检索,实现可审计、防篡改的临床上下文提供。

Comments 19 pages, 5 figures. Code available at https://github.com/medbeads/medbeads

详情
AI中文摘要

背景:截至2026年,大型语言模型(LLM)展现出专家级医学知识。然而,将其部署为自主“临床智能体”仍受限。当前的电子病历(EMR)及FHIR等标准专为人工审阅设计,导致“上下文不匹配”:AI智能体接收碎片化数据,必须依赖概率推理(如RAG)重建患者病史。该方法引发幻觉并阻碍可审计性。方法:我们提出MedBeads,一种智能体原生数据基础设施,其中临床事件是不可变的“珠子”——Merkle有向无环图(DAG)中的节点——通过密码学引用因果前驱。这种“一次写入、多次读取”架构使篡改在数学上可检测。我们实现了原型,包含Go核心引擎、用于LLM集成的Python中间件以及基于React的可视化界面。结果:我们使用合成数据成功实现了工作流。FHIR到DAG的转换将扁平资源转化为因果关联图。我们的广度优先搜索(BFS)上下文检索算法以O(V+E)复杂度遍历相关子图,支持实时决策支持。防篡改特性由设计保证:任何修改都会破坏密码学链。可视化通过显式因果链接帮助临床医生理解。结论:MedBeads通过从概率搜索转向确定性图遍历、从可变记录转向不可变链,解决了“上下文不匹配”,为“可信医疗AI”提供了基底。它保证了AI接收的上下文是确定且防篡改的,而LLM负责解释。结构化的珠子格式充当了令牌高效的“AI原生语言”。我们将MedBeads作为开源软件发布,以加速智能体原生数据标准。

英文摘要

Background: As of 2026, Large Language Models (LLMs) demonstrate expert-level medical knowledge. However, deploying them as autonomous "Clinical Agents" remains limited. Current Electronic Medical Records (EMRs) and standards like FHIR are designed for human review, creating a "Context Mismatch": AI agents receive fragmented data and must rely on probabilistic inference (e.g., RAG) to reconstruct patient history. This approach causes hallucinations and hinders auditability. Methods: We propose MedBeads, an agent-native data infrastructure where clinical events are immutable "Beads"--nodes in a Merkle Directed Acyclic Graph (DAG)--cryptographically referencing causal predecessors. This "write-once, read-many" architecture makes tampering mathematically detectable. We implemented a prototype with a Go Core Engine, Python middleware for LLM integration, and a React-based visualization interface. Results: We successfully implemented the workflow using synthetic data. The FHIR-to-DAG conversion transformed flat resources into a causally-linked graph. Our Breadth-First Search (BFS) Context Retrieval algorithm traverses relevant subgraphs with O(V+E) complexity, enabling real-time decision support. Tamper-evidence is guaranteed by design: any modification breaks the cryptographic chain. The visualization aids clinician understanding through explicit causal links. Conclusion: MedBeads addresses the "Context Mismatch" by shifting from probabilistic search to deterministic graph traversal, and from mutable records to immutable chains, providing the substrate for "Trustworthy Medical AI." It guarantees the context the AI receives is deterministic and tamper-evident, while the LLM determines interpretation. The structured Bead format serves as a token-efficient "AI-native language." We release MedBeads as open-source software to accelerate agent-native data standards.

2601.22984 2026-05-26 cs.AI 版本更新

Why Your Deep Research Agent Fails? On Hallucination Evaluation in Full Research Trajectory

为什么你的深度研究智能体会失败?关于完整研究轨迹中的幻觉评估

Yuhao Zhan, Tianyu Fan, Linxuan Huang, Zirui Guo, Chao Huang

发表机构 * Zhejiang University(浙江大学) The University of Hong Kong(香港大学)

AI总结 针对深度研究智能体(DRA)在完整研究轨迹中累积的幻觉问题,提出从结果评估转向过程感知评估的PING分类法和细粒度评估框架,并构建DeepHalluBench基准,实验揭示系统性的可靠性差距。

详情
AI中文摘要

诊断深度研究智能体(DRA)的失败模式仍然是一个关键挑战。现有基准主要依赖端到端评估,掩盖了在研究轨迹中累积的中间幻觉。为弥补这一差距,我们提出从基于结果的评估转向过程感知评估,通过审计完整计划-搜索-总结轨迹中的幻觉。我们引入PING分类法,将DRA幻觉分为四种互补类型:传播、意图、噪声诱导和接地。我们进一步将该分类法实例化为一个细粒度评估框架,将轨迹分解为原子动作、声明和子查询以进行严格验证。利用该框架隔离100个特别容易产生幻觉的任务(包括对抗性场景),我们策划了DeepHalluBench。对六个代表性DRA的实验表明,在我们的幻觉压力测试集上,所有评估系统仍表现出不可忽视的可靠性差距。此外,我们的诊断分析将这些失败追溯到系统性缺陷,特别是幻觉传播和认知偏差,为未来的架构优化提供了可操作的见解。代码和数据可在https://github.com/yuhao-zhan/DeepHalluBench获取。

英文摘要

Diagnosing failure patterns in Deep Research Agents (DRAs) remains a critical challenge. Existing benchmarks predominantly rely on end-to-end evaluation, obscuring intermediate hallucinations that accumulate throughout the research trajectory. To bridge this gap, we propose a shift from outcome-based to processaware evaluation by auditing hallucinations in the full plan-search-summarize trajectory. We introduce the PING Taxonomy, which categorizes DRA hallucinations into four complementary types: Propagation, Intent, Noiseinduced, and Grounding. We further instantiate this taxonomy into a fine-grained evaluation framework that decomposes trajectories into atomic actions, claims, and sub-queries for rigorous verification. Leveraging this framework to isolate 100 distinctively hallucinationprone tasks including adversarial scenarios, we curate DeepHalluBench. Experiments on six representative DRAs show that, on our hallucination-prone stress-test set, all evaluated systems still exhibit non-negligible reliability gaps. Furthermore, our diagnostic analysis traces these failures to systemic deficits, especially hallucination propagation and cognitive biases, providing actionable insights for future architectural optimization. Code and data are available in https://github.com/yuhao-zhan/DeepHalluBench.

2601.21726 2026-05-26 cs.AI 版本更新

DropoutTS: Sample-Adaptive Dropout for Robust Time Series Forecasting

DropoutTS: 用于鲁棒时间序列预测的样本自适应Dropout

Siru Zhong, Yiqiu Liu, Zhiqing Cui, Zezhi Shao, Fei Wang, Qingsong Wen, Yuxuan Liang

发表机构 * The Hong Kong University of Science(香港理工大学) Chinese Academy of Sciences, China(中国科学院)

AI总结 针对深度时间序列模型对噪声敏感的问题,提出一种模型无关的插件DropoutTS,通过频谱稀疏性量化实例级噪声并动态调整Dropout率,在抑制伪波动的同时保持细粒度保真度,显著提升模型鲁棒性且几乎不增加参数。

详情
AI中文摘要

深度时间序列模型容易受到现实应用中普遍存在的噪声数据的影响。现有的鲁棒性策略要么修剪数据,要么依赖昂贵的先验量化,无法在有效性和效率之间取得平衡。在本文中,我们引入了DropoutTS,一种模型无关的插件,它将范式从学习“什么”转变为学习“多少”。DropoutTS采用样本自适应Dropout机制:利用频谱稀疏性通过重建残差高效量化实例级噪声,它通过将噪声映射到自适应Dropout率来动态校准模型学习能力——选择性地抑制伪波动,同时保持细粒度保真度。跨不同噪声场景和开放基准的大量实验表明,DropoutTS持续提升优秀骨干模型的性能,在几乎不增加参数且无需修改架构的情况下提供先进的鲁棒性。我们的代码可在https://github.com/CityMind-Lab/DropoutTS获取。

英文摘要

Deep time series models are vulnerable to noisy data ubiquitous in real-world applications. Existing robustness strategies either prune data or rely on costly prior quantification, failing to balance effectiveness and efficiency. In this paper, we introduce DropoutTS, a model-agnostic plugin that shifts the paradigm from "what" to learn to "how much" to learn. DropoutTS employs a Sample-Adaptive Dropout mechanism: leveraging spectral sparsity to efficiently quantify instance-level noise via reconstruction residuals, it dynamically calibrates model learning capacity by mapping noise to adaptive dropout rates - selectively suppressing spurious fluctuations while preserving fine-grained fidelity. Extensive experiments across diverse noise regimes and open benchmarks show DropoutTS consistently boosts superior backbones' performance, delivering advanced robustness with negligible parameter overhead and no architectural modifications. Our code is available at https://github.com/CityMind-Lab/DropoutTS.

2601.16091 2026-05-26 cs.MA cs.AI cs.LG 版本更新

Delayed Assignments in Online Non-Centroid Clustering with Stochastic Arrivals

随机到达的在线非质心聚类中的延迟分配

Saar Cohen

发表机构 * Bar Ilan University(巴伊兰大学) University of Oxford(牛津大学)

AI总结 针对随机到达模型,提出一种常数竞争比的在线非质心聚类算法,允许延迟分配以平衡聚类距离成本和延迟成本。

Comments To Appear in the 25th International Conference on Autonomous Agents and Multiagent Systems (AAMAS), 2026

详情
AI中文摘要

聚类是一个基本问题,旨在将一组元素(如智能体或数据点)划分为若干簇,使得同一簇内的元素彼此之间的距离小于与其他簇内元素的距离。本文提出了一个研究带延迟的在线非质心聚类的新框架,其中元素作为有限度量空间中的点逐个到达,应被分配到簇中,但分配不必立即进行。具体而言,每个点到达时其位置被揭示,在线算法必须不可撤销地将其分配到现有簇或创建一个新簇(此时仅包含该点)。然而,我们允许以延迟成本为代价推迟决策,而不是遵循更常见的到达时立即决策的假设。这带来了一个关键挑战:目标是最小化每个簇内点之间的总距离成本以及因推迟分配而产生的总延迟成本。在经典的坏情况到达模型(点以任意顺序到达)中,没有算法的竞争比优于点数的次对数。为克服这一强不可能性,我们专注于随机到达模型,其中点的位置随时间独立地从有限度量空间上的一个未知固定概率分布中抽取。我们提供了超越坏情况对手的希望:设计了一个常数竞争的算法,即随着点数的增长,输出聚类的期望总成本与最优离线聚类的总成本之比被一个常数所界。

英文摘要

Clustering is a fundamental problem, aiming to partition a set of elements, like agents or data points, into clusters such that elements in the same cluster are closer to each other than to those in other clusters. In this paper, we present a new framework for studying online non-centroid clustering with delays, where elements, that arrive one at a time as points in a finite metric space, should be assigned to clusters, but assignments need not be immediate. Specifically, upon arrival, each point's location is revealed, and an online algorithm has to irrevocably assign it to an existing cluster or create a new one containing, at this moment, only this point. However, we allow decisions to be postponed at a delay cost, instead of following the more common assumption of immediate decisions upon arrival. This poses a critical challenge: the goal is to minimize both the total distance costs between points in each cluster and the overall delay costs incurred by postponing assignments. In the classic worst-case arrival model, where points arrive in an arbitrary order, no algorithm has a competitive ratio better than sublogarithmic in the number of points. To overcome this strong impossibility, we focus on a stochastic arrival model, where points' locations are drawn independently across time from an unknown and fixed probability distribution over the finite metric space. We offer hope for beyond worst-case adversaries: we devise an algorithm that is constant competitive in the sense that, as the number of points grows, the ratio between the expected overall costs of the output clustering and an optimal offline clustering is bounded by a constant.

2601.10457 2026-05-26 cs.AI 版本更新

NSR-Boost: A Neuro-Symbolic Residual Boosting Framework for Industrial Legacy Models

NSR-Boost:一种面向工业遗留模型的神经符号残差提升框架

Ziming Dai, Dabiao Ma, Jinle Tong, Mengyuan Han, Jian Yang, Hongtao Liu, Haojun Fei, Qing Yang

发表机构 * Tianjin University(天津大学) Qfin Holdings, Inc.(Qfin控股公司)

AI总结 针对工业遗留模型升级成本高、风险大的问题,提出非侵入式神经符号残差提升框架NSR-Boost,通过残差定位、LLM生成符号专家和轻量聚合器动态集成,显著提升性能并降低坏账率。

Comments Accepted by KDD 2026

详情
AI中文摘要

尽管梯度提升决策树(GBDTs)主导了工业表格应用,但在高并发生产环境中升级遗留模型仍面临高昂的重新训练成本和系统性风险。为解决这一问题,我们提出了NSR-Boost,一种专门为工业场景设计的神经符号残差提升框架。其核心优势在于“非侵入性”。它将遗留模型视为冻结模型,并对预测失败的“困难区域”进行针对性修复。该框架包括三个关键阶段:首先,通过残差发现困难区域;然后,利用大型语言模型(LLM)生成符号代码结构,并通过贝叶斯优化微调参数,从而生成可解释的专家;最后,通过轻量聚合器将专家与遗留模型输出动态集成。实验结果表明,该框架在六个公共数据集和一个私有数据集上显著优于最先进的(SOTA)基线。更重要的是,我们报告了NSR-Boost在Qfin Holdings的核心金融风险控制系统中的成功部署,实际在线流量的实证结果显示出卓越的性能改进和坏账率的显著降低。总之,它有效捕获了传统模型遗漏的长尾风险,并为工业提供了一种安全、低成本的演进范式。

英文摘要

Although the Gradient Boosted Decision Trees (GBDTs) dominate industrial tabular applications, upgrading legacy models in high-concurrency production environments still faces prohibitive retraining costs and systemic risks. To address this problem, we present NSR-Boost, a neuro-symbolic residual boosting framework designed specifically for industrial scenarios. Its core advantage lies in being ``non-intrusive''. It treats the legacy model as a frozen model and performs targeted repairs on "hard regions" where predictions fail. The framework comprises three key stages: First, finding hard regions through residuals, then generating interpretable experts by generating symbolic code structures using Large Language Model (LLM) and fine-tuning parameters using Bayesian optimization, and finally dynamically integrating experts with legacy model output through a lightweight aggregator. Experimental results demonstrate that the framework significantly outperforms state-of-the-art (SOTA) baselines across six public datasets and one private dataset. More importantly, we report the successful deployment of NSR-Boost within the core financial risk control system of Qfin Holdings, where empirical results on real-world online traffic exhibit superior performance improvements and a significant reduction in the bad rate. In conclusion, it effectively captures long-tail risks missed by traditional models and offers a safe, low-cost evolutionary paradigm for industry.

2601.10201 2026-05-26 cs.LG cs.AI cs.CL 版本更新

Future-KL Regularized GRPO: Process-Level Credit Assignment from $f$-Divergence Regularization

未来KL正则化GRPO:基于f-散度正则化的过程级信用分配

Jiarui Yao, Ruida Wang, Hao Bai, Tong Zhang

发表机构 * University of Illinois Urbana-Champaign(伊利诺伊大学厄巴纳-香槟分校)

AI总结 本文提出未来KL正则化策略优化(FRPO),通过因果未来正则化回报修正GRPO中局部KL损失缺失的梯度信号,在数学推理任务中提升pass@16并保持更高熵和更低策略漂移。

详情
AI中文摘要

组相对策略优化(GRPO)广泛用于无评论家的大语言模型(LLM)后训练,但其KL正则化通常作为局部损失侧的token惩罚实现。我们表明这遗漏了自回归KL正则化诱导的策略梯度信号。与标准KL正则化强化学习(RL)目标不同,GRPO的组归一化引入非线性提示级效用;对于二元验证器奖励,该效用为$2\arcsin\sqrt p$。因此,奖励和KL在归一化前无法融合而不改变隐式目标。我们推导了具有token级$f$-散度正则化的GRPO风格目标的on-policy梯度。奖励项恢复标准化的GRPO优势,而正则化项包括局部KL损失遗漏的因果未来正则化回报。对于反向KL,这产生简单的未来KL修正:在优势构建后添加每个token对数比的反向累积和。由此产生的方法,未来KL正则化策略优化(FRPO),不需要评论家或额外的模型传递。在数学推理任务上,FRPO在我们的主要大模型设置中提高了pass@16,同时保持比传统损失侧KL基线更高的熵和更低的策略漂移。

英文摘要

Group Relative Policy Optimization (GRPO) is widely used for critic-free Large Language Model (LLM) post-training, but its KL regularization is usually implemented as a local loss-side token penalty. We show that this misses the policy-gradient signal induced by autoregressive KL regularization. Unlike standard KL-regularized Reinforcement Learning (RL) objectives, GRPO's group normalization induces a non-linear prompt-level utility; for binary verifier rewards, this utility is $2\arcsin\sqrt p$. As a result, reward and KL cannot be fused before normalization without changing the implicit objective. We derive the on-policy gradient of GRPO-style objectives with token-wise $f$-divergence regularization. The reward term recovers the standardized GRPO advantage, while the regularizer term includes a causal future-regularization return-to-go omitted by local KL losses. For reverse KL, this yields a simple future KL correction: add a reverse cumulative sum of per-token log ratios after advantage construction. The resulting method, Future-KL Regularized Policy Optimization (FRPO), requires no critic or extra model passes. On mathematical reasoning tasks, FRPO improves pass@16 in our main large-model setting while maintaining higher entropy and lower policy drift than conventional loss-side KL baselines.

2601.06366 2026-05-26 cs.CR cs.AI 版本更新

SafeGPT: Preventing Data Leakage and Unethical Outputs in Enterprise LLM Use

SafeGPT:防止企业LLM使用中的数据泄露和不道德输出

Pratyush Desai, Luoxi Tang, Yuqiao Meng, Zhaohan Xi

发表机构 * Binghamton University(宾夕法尼亚州立大学)

AI总结 提出SafeGPT双护栏系统,通过输入侧检测/编辑、输出侧审核/重构及人工反馈,有效降低数据泄露风险和偏见输出。

详情
AI中文摘要

大型语言模型(LLM)正在改变企业工作流程,但当员工无意中共享机密数据或生成违反政策的内容时,会带来安全和伦理挑战。本文提出SafeGPT,一个双护栏系统,防止敏感数据泄露和不道德输出。SafeGPT集成了输入侧检测/编辑、输出侧审核/重构以及人工反馈循环。实验表明,SafeGPT有效降低了数据泄露风险和偏见输出,同时保持了用户满意度。

英文摘要

Large Language Models (LLMs) are transforming enterprise workflows but introduce security and ethics challenges when employees inadvertently share confidential data or generate policy-violating content. This paper proposes SafeGPT, a two-sided guardrail system preventing sensitive data leakage and unethical outputs. SafeGPT integrates input-side detection/redaction, output-side moderation/reframing, and human-in-the-loop feedback. Experiments demonstrate SafeGPT effectively reduces data leakage risk and biased outputs while maintaining satisfaction.

2601.06201 2026-05-26 cs.SE cs.AI 版本更新

RiskBridge: Turning CVEs into Business-Aligned Patch Priorities

RiskBridge:将CVE转化为业务对齐的补丁优先级

Yelena Mujibur Sheikh, Awez Akhtar Khatik, Luoxi Tang, Yuqiao Meng, Zhaohan Xi

发表机构 * Binghamton University(宾夕法尼亚大学)

AI总结 提出RiskBridge框架,通过集成CVSS v4、EPSS和CISA KEV等多源情报,结合概率模型、策略引擎和ROI优化器,动态生成业务对齐的补丁优先级,显著降低残余风险并提升修复效率。

详情
AI中文摘要

企业面临着前所未有的网络安全漏洞激增,每月有数千个新的CVE被披露。传统的优先级框架如CVSS提供静态严重性指标,未能考虑利用概率、合规紧迫性和运营影响,导致修复效率低下且延迟。本文介绍了RiskBridge,一个可解释且合规感知的漏洞管理框架,它整合了来自CVSS v4、EPSS和CISA KEV的多源情报,以生成动态的、业务对齐的补丁优先级。RiskBridge采用概率性零日暴露模拟(ZDES)模型来预测近期的利用可能性,一个策略即代码引擎将监管要求(如PCI DSS、NIST SP 800-53)转化为自动化的SLA逻辑,以及一个ROI驱动的优化器,以最大化每次修复工作的累积风险降低。使用实时CVE数据集的实验评估表明,与最先进的商业基线相比,残余风险降低了88%,SLA合规性提高了18天,修复效率提高了35%。这些发现验证了RiskBridge作为一个实用且可审计的决策智能系统,统一了概率建模、合规推理和优化分析。该框架代表了向现代企业环境中自动化、可解释且以业务为中心的漏洞管理迈出的一步。

英文摘要

Enterprises are confronted with an unprecedented escalation in cybersecurity vulnerabilities, with thousands of new CVEs disclosed each month. Conventional prioritization frameworks such as CVSS offer static severity metrics that fail to account for exploit probability, compliance urgency, and operational impact, resulting in inefficient and delayed remediation. This paper introduces RiskBridge, an explainable and compliance-aware vulnerability management framework that integrates multi-source intelligence from CVSS v4, EPSS, and CISA KEV to produce dynamic, business -- aligned patch priorities. RiskBridge employs a probabilistic Zero-Day Exposure Simulation (ZDES) model to forecast near-term exploit likelihood, a Policy-as-Code Engine to translate regulatory mandates (e.g., PCI DSS, NIST SP 800-53) into automated SLA logic, and an ROI-driven Optimizer to maximize cumulative risk reduction per remediation effort. Experimental evaluations using live CVE datasets demonstrate an 88% reduction in residual risk, an 18-day improvement in SLA compliance, and a 35% increase in remediation efficiency compared to state-of-the-art commercial baselines. These findings validate RiskBridge as a practical and auditable decision-intelligence system that unifies probabilistic modeling, compliance reasoning, and optimization analytics. The framework represents a step toward automated, explainable, and business-centric vulnerability management in modern enterprise environments

2601.03191 2026-05-26 cs.CV cs.AI cs.LG 版本更新

AnatomiX, an Anatomy-Aware Grounded Multimodal Large Language Model for Chest X-Ray Interpretation

AnatomiX:一种解剖学感知的胸部X光解读多模态大语言模型

Anees Ur Rehman Hashmi, Numan Saeed, Christoph Lippert

发表机构 * Hasso Plattner Institute(霍普夫纳研究所) MBZUAI(穆萨大学人工智能研究所)

AI总结 提出AnatomiX,一种两阶段解剖学感知多模态大语言模型,通过先识别解剖结构再执行下游任务,在解剖定位、短语定位、定位诊断和定位描述任务上相比现有方法提升超过25%。

详情
AI中文摘要

多模态医学大语言模型在胸部X光解读方面取得了显著进展,但在空间推理和解剖学理解方面仍面临挑战。尽管现有的定位技术提高了整体性能,但它们往往未能建立真正的解剖对应关系,导致医学领域中的解剖理解错误。为弥补这一差距,我们引入了AnatomiX,一种用于解剖学定位的胸部X光解读的多任务多模态大语言模型。受放射学工作流程启发,AnatomiX采用两阶段方法:首先识别解剖结构并提取其特征,然后利用大语言模型执行多种下游任务,如短语定位、报告生成、视觉问答和图像理解。在多个基准上的大量实验表明,与现有方法相比,AnatomiX实现了卓越的解剖推理,并在解剖定位、短语定位、定位诊断和定位描述任务上性能提升超过25%。代码和预训练模型可在 https://aneesurhashmi.github.io/anatomix 获取。

英文摘要

Multimodal medical large language models have shown substantial progress in chest X-ray interpretation but continue to face challenges in spatial reasoning and anatomical understanding. Although existing grounding techniques improve overall performance, they often fail to establish a true anatomical correspondence, resulting in incorrect anatomical understanding in the medical domain. To address this gap, we introduce AnatomiX, a multitask multimodal large language model for anatomically grounded chest X-ray interpretation. Inspired by the radiological workflow, AnatomiX adopts a two stage approach: first, it identifies anatomical structures and extracts their features, and then leverages a large language model to perform diverse downstream tasks such as phrase grounding, report generation, visual question answering, and image understanding. Extensive experiments across multiple benchmarks demonstrate that AnatomiX achieves superior anatomical reasoning and delivers over 25% improvement in performance on anatomy grounding, phrase grounding, grounded diagnosis and grounded captioning tasks compared to existing approaches. Code and pretrained model are available at https://aneesurhashmi.github.io/anatomix

2601.02589 2026-05-26 cs.CL cs.AI 版本更新

FlowPlan-G2P: A Structured Generation Framework for Transforming Scientific Papers into Patent Descriptions

FlowPlan-G2P:一种将科学论文转化为专利描述的结构化生成框架

Kris W Pan, Yongmin Yoo

发表机构 * Amazon(亚马逊公司) Macquarie University(麦考瑞大学)

AI总结 提出FlowPlan-G2P图介导生成框架,通过概念图归纳、章节级规划和图条件生成三阶段分解,将科学论文转化为符合专利规范的描述,在领域评估中优于大型专有模型。

详情
AI中文摘要

由于科学论文与专利在修辞和结构上的根本差异,从科学论文生成专利描述具有挑战性。现有方法将其视为表面改写,未能捕捉专利起草中固有的层次推理和法定约束。我们提出FlowPlan-G2P,一种图介导的生成框架,将该转换分解为三个阶段:(1)概念图归纳,将技术实体和功能依赖提取为有向图;(2)章节级规划,将图划分为与规范专利章节对齐的连贯子图;(3)图条件生成,基于章节特定子图合成符合法律要求的段落。在专家验证基准上的实验表明,标准NLG指标系统性偏好法律不合规输出而非有效专利描述,这促使我们进行领域特定评估。在该评估下,使用开放权重骨干的FlowPlan-G2P始终优于原始专有模型,表明结构化分解比模型规模更能决定质量。

英文摘要

Generating patent descriptions from scientific papers is challenging due to fundamental rhetorical and structural disparities between the two genres. Existing approaches treat this as surface-level rewriting, failing to capture the hierarchical reasoning and statutory constraints inherent in patent drafting. We propose FlowPlan-G2P, a graph-mediated generation framework that decomposes this transformation into three stages: (1) Concept Graph Induction, extracting technical entities and functional dependencies into a directed graph; (2) Section-level Planning, partitioning the graph into coherent subgraphs aligned with canonical patent sections; and (3) Graph-Conditioned Generation, synthesizing legally compliant paragraphs conditioned on section-specific subgraphs. Experiments on expert-validated benchmarks reveal that standard NLG metrics systematically favor legally non-compliant outputs over valid patent descriptions, motivating our domain-specific evaluation. Under this evaluation, FlowPlan-G2P with an open-weight backbone consistently outperforms vanilla proprietary models, demonstrating that structured decomposition is a stronger determinant of quality than model scale.

2512.19097 2026-05-26 cs.LG cs.AI 版本更新

DIVER-1: Scaling Intracranial EEG Foundation Models for Transferable Representations

DIVER-1: 扩展颅内脑电图基础模型以实现可迁移表示

Danny Dongyeop Han, Yonghyeon Gwon, Ahhyun Lucy Lee, Taeyang Lee, Seong Jin Lee, Jubin Choi, Sebin Lee, Jihyun Bang, Seungju Lee, David Keetae Park, Shinjae Yoo, Chun Kee Chung, Jiook Cha

发表机构 * Seoul National University(首尔国立大学) Brookhaven National Laboratory(布鲁克海文国家实验室)

AI总结 提出DIVER-1自监督iEEG基础模型,通过可变电极-时间注意力、时空重采样等设计处理可变输入,在5310小时ECoG和SEEG上预训练,在认知解码和癫痫检测任务上超越现有模型,并首次进行受控计算感知的扩展研究。

Comments 31 pages, 12 figures, 14tables

详情
AI中文摘要

颅内脑电图(iEEG)提供直接、毫秒级的人类神经活动记录,但由于电极布局、解剖覆盖、参考方案和记录条件在不同患者和中心之间存在差异,可重用的表示学习变得困难。我们引入了DIVER-1,一个用于可变输入记录的自监督iEEG基础模型,它结合了任意变量电极-时间注意力、时空重采样、输入条件位置嵌入和多域掩码重建,而不假设固定的电极布局。我们在5310小时的ECoG和SEEG上预训练了两个变体DIVER-1-0.1s和DIVER-1-1s,涵盖352k通道小时,大约是BrainTreeBank预训练量的54倍。我们在两个保留基准上评估DIVER-1:用于自然认知解码的Neuroprobe和用于癫痫检测的MAYO。在考虑泄漏的Neuroprobe上,尽管预训练时未使用构成Neuroprobe语料库的BrainTreeBank记录,DIVER-1-0.1s仍优于先前评估的iEEG基础模型;它在平均AUROC上也超过了线性频谱图解码器,并与更强的非线性基线保持竞争力,这是先前评估的iEEG基础模型未能达到的水平。DIVER-1-1s在MAYO癫痫检测上也取得了最高的AUROC。最后,我们进行了据我们所知首次受控计算感知的自监督iEEG预训练扩展研究,扫描了数据规模、受试者数量、训练时长和模型大小(高达1.8B参数)。我们的结果表明存在数据受限区域:扩展独特记录和充分训练是比单纯增加参数数量更可靠的扩展轴。代码可在链接处获取。

英文摘要

Intracranial EEG (iEEG) provides direct, millisecond-scale recordings of human neural activity, but reusable representation learning is difficult because electrode layouts, anatomical coverage, referencing schemes, and recording conditions vary across patients and centers. We introduce DIVER-1, a self-supervised iEEG foundation model for variable-input recordings that combines any-variate electrode-time attention, spatio-temporal resampling, input-conditioned positional embeddings, and multi-domain masked reconstruction without assuming a fixed electrode montage. We pretrain two variants, DIVER-1-0.1s and DIVER-1-1s, on 5,310 hours of ECoG and SEEG spanning 352k channel-hours, roughly 54x the BrainTreeBank-based pretraining volume. We evaluate DIVER-1 on two held-out benchmarks: Neuroprobe for naturalistic cognitive decoding and MAYO for seizure detection. On leakage-aware Neuroprobe, DIVER-1-0.1s outperforms prior evaluated iEEG foundation models despite using no BrainTreeBank recordings, the corpus underlying Neuroprobe, during pretraining; it also exceeds the linear spectrogram decoder in mean AUROC and remains competitive with stronger nonlinear baselines, a level prior evaluated iEEG foundation models did not reach. DIVER-1-1s also achieves the top AUROC on MAYO seizure detection. Finally, we conduct, to our knowledge, the first controlled compute-aware scaling study for self-supervised iEEG pretraining, sweeping data scale, subject count, training duration, and model size up to 1.8B parameters. Our results indicate a data-constrained regime: expanding unique recordings and training sufficiently long are more reliable scaling axes than increasing parameter count alone. Code is available at link.

2512.12576 2026-05-26 cs.CL cs.AI 版本更新

Coupled Variational Reinforcement Learning for Language Model General Reasoning

耦合变分强化学习用于语言模型通用推理

Xueru Wen, Jie Lou, Yanjiang Liu, Hongyu Lin, Ben He, Xianpei Han, Le Sun, Yaojie Lu, Debing Zhang

发表机构 * Chinese Information Processing Laboratory, Institute of Software, Chinese Academy of Sciences(中国科学院软件研究所信息处理实验室) University of Chinese Academy of Sciences(中国科学院大学)

AI总结 提出CoVRL方法,通过混合采样策略耦合先验和后验分布,将变分推理与强化学习结合,以解决无验证器强化学习中探索效率低和推理轨迹与答案不一致的问题,在数学和通用推理基准上提升性能。

Comments Accepted to ICML 2026

详情
AI中文摘要

虽然强化学习在语言模型推理方面取得了显著进展,但它受到可验证奖励要求的限制。最近的无验证器强化学习方法通过利用LLM生成参考答案的概率作为奖励信号来解决这一限制。然而,这些方法通常仅基于问题采样推理轨迹。这种设计将推理轨迹采样与答案信息解耦,导致探索效率低下以及轨迹与最终答案之间的不一致。在本文中,我们提出了 extit{{Co}upled {V}ariational {R}einforcement {L}earning}(CoVRL),它通过混合采样策略耦合先验和后验分布,将变分推理与强化学习联系起来。通过构建和优化整合这两种分布的复合分布,CoVRL实现了高效探索,同时保持了思想与答案之间的强一致性。在数学和通用推理基准上的大量实验表明,CoVRL在基础模型上提升了12.4%的性能,并在最先进的无验证器强化学习基线基础上额外提升了2.3%,为增强语言模型的通用推理能力提供了一个原则性框架。

英文摘要

While reinforcement learning has achieved impressive progress in language model reasoning, it is constrained by the requirement for verifiable rewards. Recent verifier-free RL methods address this limitation by utilizing the probabilities that LLMs generate reference answers as reward signals. However, these approaches typically sample reasoning traces conditioned only on the question. This design decouples reasoning-trace sampling from answer information, leading to inefficient exploration and incoherence between traces and final answers. In this paper, we propose \textit{\b{Co}upled \b{V}ariational \b{R}einforcement \b{L}earning} (CoVRL), which bridges variational inference and reinforcement learning by coupling prior and posterior distributions through a hybrid sampling strategy. By constructing and optimizing a composite distribution that integrates these two distributions, CoVRL enables efficient exploration while preserving strong thought-answer coherence. Extensive experiments on mathematical and general reasoning benchmarks show that CoVRL improves performance by 12.4\% over the base model and achieves an additional 2.3\% improvement over state-of-the-art verifier-free RL baselines, providing a principled framework for enhancing the general reasoning capabilities of language models.

2512.11941 2026-05-26 cs.CV cs.AI 版本更新

DynaPURLS: Dynamic Refinement of Part-Aware Representations for Skeleton-Based Zero-Shot Action Recognition

DynaPURLS: 基于骨架的零样本动作识别中部分感知表示的动态细化

Jingmin Zhu, Anqi Zhu, James Bailey, Jun Liu, Hossein Rahmani, Mohammed Bennamoun, Farid Boussaid, Qiuhong Ke

发表机构 * Monash University(莫纳什大学) Lancaster University(兰卡斯特大学) University of Western Australia(西澳大学)

AI总结 提出DynaPURLS框架,通过多尺度视觉-语义对应和动态细化模块,解决骨架零样本动作识别中的领域偏移问题,在三个基准数据集上取得最优结果。

Comments Accepted by IEEE Transactions on Pattern Analysis and Machine Intelligence

详情
AI中文摘要

基于骨架的零样本动作识别(ZS-SAR)从根本上受到主流方法的限制,这些方法依赖于将骨架特征与静态的类级语义对齐。这种粗粒度的对齐无法弥合可见类和未见类之间的领域偏移,从而阻碍了细粒度视觉知识的有效迁移。为了解决这些限制,我们引入了 extbf{DynaPURLS},一个统一的框架,它建立稳健的多尺度视觉-语义对应,并在推理时动态细化它们以增强泛化能力。我们的框架利用大型语言模型生成层次化的文本描述,涵盖全局运动和局部身体部位动态。同时,一个自适应划分模块通过语义分组骨架关节点生成细粒度的视觉表示。为了强化这种细粒度对齐以应对训练-测试领域偏移,DynaPURLS包含一个动态细化模块。在推理时,该模块通过轻量级可学习投影将文本特征适应于输入的视觉流。该细化过程由一个置信度感知的类平衡记忆库稳定,该记忆库减轻了来自噪声伪标签的错误传播。在三个大规模基准数据集(包括NTU RGB+D 60/120和PKU-MMD)上的大量实验表明,DynaPURLS显著优于先前的方法,创造了新的最先进记录。源代码已在https://github.com/Alchemist0754/DynaPURLS公开。

英文摘要

Zero-shot skeleton-based action recognition (ZS-SAR) is fundamentally constrained by prevailing approaches that rely on aligning skeleton features with static, class-level semantics. This coarse-grained alignment fails to bridge the domain shift between seen and unseen classes, thereby impeding the effective transfer of fine-grained visual knowledge. To address these limitations, we introduce \textbf{DynaPURLS}, a unified framework that establishes robust, multi-scale visual-semantic correspondences and dynamically refines them at inference time to enhance generalization. Our framework leverages a large language model to generate hierarchical textual descriptions that encompass both global movements and local body-part dynamics. Concurrently, an adaptive partitioning module produces fine-grained visual representations by semantically grouping skeleton joints. To fortify this fine-grained alignment against the train-test domain shift, DynaPURLS incorporates a dynamic refinement module. During inference, this module adapts textual features to the incoming visual stream via a lightweight learnable projection. This refinement process is stabilized by a confidence-aware, class-balanced memory bank, which mitigates error propagation from noisy pseudo-labels. Extensive experiments on three large-scale benchmark datasets, including NTU RGB+D 60/120 and PKU-MMD, demonstrate that DynaPURLS significantly outperforms prior art, setting new state-of-the-art records. The source code is made publicly available at https://github.com/Alchemist0754/DynaPURLS

2511.21734 2026-05-26 cs.CL cs.AI 版本更新

Asking LLMs to Verify First is Almost Free Lunch

先让LLMs验证几乎是免费的午餐

Shiguang Wu, Quanming Yao

发表机构 * Department of Electonic Engineering(电子工程系)

AI总结 提出Verification-First (VF)策略,通过先验证候选答案再生成解决方案,以低计算开销提升推理能力,并扩展为Iter-VF迭代方法,在多个基准上优于标准CoT和现有TTS策略。

详情
AI中文摘要

为了在不增加训练成本或大量测试时采样的情况下增强大型语言模型(LLMs)的推理能力,我们引入了Verification-First (VF)策略,该策略在生成解决方案之前提示模型验证提供的候选答案(即使是琐碎或随机的答案)。这种方法触发了一种“反向推理”过程,与标准的前向思维链(CoT)互补,通过修剪LLM的输出分布来限制答案的逻辑搜索空间。我们进一步将VF提示推广到Iter-VF,这是一种顺序测试时缩放(TTS)方法,利用模型之前的答案迭代地循环验证-生成过程。跨多个基准和各种LLMs的大量实验证实,使用随机答案的VF提示在最小计算开销下始终优于标准CoT,并且Iter-VF优于现有的TTS策略。VF在SOTA思考模型上也有效。例如,通过使用简单的VF提示,我们在GPQA-Diamond上使用Gemini-3-Pro-Preview获得了新的SOTA准确率94.9%,其中VF相对减少了约30%的错误。

英文摘要

To enhance the reasoning capabilities of Large Language Models (LLMs) without high costs of training, nor extensive test-time sampling, we introduce Verification-First (VF), a strategy that prompts models to verify a provided candidate answer, even a trivial or random one, before generating a solution. This approach triggers a "reverse reasoning" process complementary to standard forward Chain-of-Thought (CoT), which restricts the logical search space of the answer by pruning the LLM's output distribution. We further generalize VF prompting to Iter-VF, a sequential test-time scaling (TTS) method that iteratively cycles the verification-generation process using the model's previous answer. Extensive experiments across various benchmarks and various LLMs confirm that VF prompting with random answer consistently outperforms standard CoT with minimal computational overhead, and Iter-VF outperforms existing TTS strategies. VF is also effective on SOTA thinking models. For example, by using the simple VF prompting, we obtain a new SOTA 94.9% accuracy on GPQA-Diamond with Gemini-3-Pro-Preview where VF reduces its errors by ~30% relatively.

2511.08654 2026-05-26 cs.CY cs.AI cs.CL 版本更新

AI-generated podcasts: Synthetic Intimacy and Cultural Mistranslation in NotebookLM's Audio Overviews

AI生成的播客:NotebookLM音频概览中的合成亲密关系与文化误译

Jill Walker Rettberg

发表机构 * University of Bergen(卑尔根大学) Center for Digital Narrative(数字叙述中心)

AI总结 本文分析Google NotebookLM生成的AI播客,揭示其固定模板结构及将文本和文化语境翻译为白人、受过教育的中产阶级美国默认设置的问题。

Comments This project has received funding from the European Union's Horizon 2020 research and innovation programme under grant agreement number 101142306. The project is also supported by the Center for Digital Narrative, which is funded by the Research Council of Norway through its Centres of Excellence scheme, project number 332643. Media, Culture & Society, online first (2026)

详情
AI中文摘要

本文分析了Google NotebookLM生成的AI播客,该工具生成两个健谈的AI主持人讨论用户上传文档的音频播客。虽然AI生成的播客已被作为工具讨论(例如在医学教育中),但它们尚未作为媒体被分析。通过上传不同类型的文本并分析生成的输出,我展示了播客的结构如何围绕固定模板构建。我还发现NotebookLM不仅将其他语言的文本翻译成活泼的标准中西部美国口音,还将文化语境翻译为白人、受过教育的中产阶级美国默认设置。这是媒体塑造公众方式的一个显著发展,标志着从学者们描述的21世纪初至今人类播客中的多元公共领域(主持人面向特定社区并回应听众评论)向播客类型抽象化的转变。

英文摘要

This paper analyses AI-generated podcasts produced by Google's NotebookLM, which generates audio podcasts with two chatty AI hosts discussing whichever documents a user uploads. While AI-generated podcasts have been discussed as tools, for instance in medical education, they have not yet been analysed as media. By uploading different types of text and analysing the generated outputs I show how the podcasts' structure is built around a fixed template. I also find that NotebookLM not only translates texts from other languages into a perky standardised Mid-Western American accent, it also translates cultural contexts to a white, educated, middle-class American default. This is a distinct development in how publics are shaped by media, marking a departure from the multiple public spheres that scholars have described in human podcasting from the early 2000s until today, where hosts spoke to specific communities and responded to listener comments, to an abstraction of the podcast genre.

2511.04556 2026-05-26 cs.AI cs.CE 版本更新

Optimizing Sensor Placement for Flow Reconstruction in Urban Drainage Networks: A Digital Twin-Based Sparse Sensing Approach

城市排水管网流量重建的传感器优化布置:基于数字孪生的稀疏传感方法

Zihang Ding, Amit Kumar, Imran Md. Azizul Islam, Mila Avellar Montezuma, Ruihang Zhang, Kun Zhang

发表机构 * Department of Civil and Environmental Engineering, University of Minnesota Duluth(明尼苏达大学 Duluth 分校土木与环境工程系) Institute for Water Education, UNESCO IHE Delft(联合国教科文组织国际水教育研究所) Department of Mechanical and Industrial Engineering, University of Minnesota Duluth(明尼苏达大学 Duluth 分校机械与工业工程系)

AI总结 针对资源受限下城市排水管网监测与流量预测难题,提出一种基于数字孪生的数据驱动稀疏传感方法,通过奇异值分解和QR分解优化传感器位置,实现系统级流量重建,在明尼苏达州德卢斯林地流域验证中,3个传感器达到平均NSE 0.949。

Comments 32 pages (including supplementary information), 11 figures. Submitted to Water Research. Partially presented at HydroML 2025 Symposium, Minnesota Water Resources Conference 2025, and AGU Fall Meeting 2025

详情
AI中文摘要

强降雨引发的城市洪水日益频繁和广泛。虽然高时空分辨率的洪水预测和监测是理想的,但时间、预算和技术上的实际限制阻碍了其全面实施。如何在资源受限的情况下监测城市排水管网并预测水流状况是一个主要挑战。为了解决这一问题,我们引入了一种数据驱动的稀疏传感(DSS)方法,通过明尼苏达州德卢斯林地流域的数字孪生进行演示。具体来说,我们将EPA-SWMM与基于奇异值分解和QR分解的传感器选择相结合,以优化系统级流量重建的监测位置。由不同情景驱动的SWMM模拟集成提供了必要的水力数据,以提取降阶基并识别信息丰富的传感器位置。跨事件验证表明,在77个候选节点中,三个策略性放置的传感器在观测到的风暴事件中实现了平均系统级纳什-萨特克利夫效率(NSE)为0.949。将QR选择的传感器集与通过穷举搜索和蒙特卡洛随机放置获得的参考传感器配置进行了基准测试。这一比较进一步表明,基于QR选择的传感器的流量重建紧密跟踪穷举最优值,同时显著优于随机放置。我们通过引入乘性高斯噪声和模拟单个传感器故障进一步评估了框架的鲁棒性。虽然模型对噪声相对具有弹性,但传感器缺失的影响在很大程度上取决于分配的传感器数量及其具体位置。

英文摘要

Urban flooding triggered by intense rainfall is becoming increasingly frequent and widespread. While flood prediction and monitoring in high spatio-temporal resolution are desired, practical constraints in time, budget, and technology hinder its full implementation. How to monitor urban drainage networks and predict flow conditions under constrained resources is a major challenge. To address this, we introduced a data-driven sparse sensing (DSS) approach, demonstrated via a digital-twin of the Woodland catchment in Duluth, Minnesota. Specifically, we coupled EPA-SWMM with singular value decomposition and QR factorization-based sensor selection to optimize monitoring locations for system-level flow reconstruction. An ensemble of SWMM simulations, driven by diverse scenarios, provided the necessary hydraulic data to extract the reduced basis and identify informative sensor locations. Cross-event validation showed that three strategically placed sensors among 77 candidate nodes achieved a mean system-level Nash-Sutcliffe efficiency (NSE) of 0.949 across observed storm events. The QR-selected sensor sets were benchmarked against reference sensor configurations obtained from exhaustive searches and Monte Carlo random-placements. This comparison further showed that flow reconstruction based on QR-selected sensors closely tracked the exhaustive optimum while substantially outperforming random placements. We further evaluated the framework's robustness by introducing multiplicative Gaussian noise and simulating individual sensor failures. While the model is relatively resilient to noise, the impact of sensor dropouts depends heavily on the number of sensors allocated and their specific locations.

2509.07961 2026-05-26 cs.AI 版本更新

Probing the Preferences of a Language Model: Integrating Verbal and Behavioral Tests of AI Welfare

探究语言模型的偏好:整合AI福祉的言语与行为测试

Valen Tagliabue, Leonard Dung

发表机构 * Future Impact Group (FIG)(未来影响集团) Ruhr-University Bochum(波鸿鲁尔大学)

AI总结 本研究通过言语报告和行为实验(虚拟环境导航与话题选择)测量语言模型的偏好,发现偏好满足可作为AI福祉的实证代理,但测量一致性因模型和条件而异。

Comments Forthcoming in Philosophy and the Mind Sciences (PhiMiSci)

详情
AI中文摘要

我们开发了新的实验范式来测量语言模型中的福祉。我们比较了模型关于其偏好的言语报告与在虚拟环境中导航和选择对话主题时通过行为表达的偏好。我们还测试了成本和奖励如何影响行为,以及对于幸福主义福祉量表(测量自主性和生活目的等状态)的反应是否在语义等价的提示下保持稳定。总体而言,我们观察到我们的测量之间存在显著程度的相互支持。在不同条件下,陈述偏好与行为之间观察到的可靠相关性表明,偏好满足原则上可以作为当今某些AI系统中经验可测量的福祉代理。此外,我们的设计为模型行为的定性观察提供了一个富有启发性的环境。然而,测量之间的一致性在某些模型和条件下比其他情况更明显,并且反应因扰动而改变。由于这一点,以及关于福祉本质和语言模型的认知状态(以及福祉主体性)的背景不确定性,我们目前不确定我们的方法是否成功测量了语言模型的福祉状态。尽管如此,这些发现凸显了在语言模型中测量福祉的可行性,邀请进一步探索。

英文摘要

We develop new experimental paradigms for measuring welfare in language models. We compare verbal reports of models about their preferences with preferences expressed through behavior when navigating a virtual environment and selecting conversation topics. We also test how costs and rewards affect behavior and whether responses to an eudaimonic welfare scale - measuring states such as autonomy and purpose in life - are stable across semantically equivalent prompts. Overall, we observed a notable degree of mutual support between our measures. The reliable correlations observed between stated preferences and behavior across conditions suggest that preference satisfaction can, in principle, serve as an empirically measurable welfare proxy in some of today's AI systems. Furthermore, our design offered an illuminating setting for qualitative observation of model behavior. Yet, the consistency between measures was more pronounced in some models and conditions than others and responses were changed by perturbations. Due to this, and the background uncertainty about the nature of welfare and the cognitive states (and welfare subjecthood) of language models, we are currently uncertain whether our methods successfully measure the welfare state of language models. Nevertheless, these findings highlight the feasibility of welfare measurement in language models, inviting further exploration.

2508.15760 2026-05-26 cs.CL cs.AI 版本更新

LiveMCP-101: Stress Testing and Diagnosing MCP-enabled Agents on Challenging Queries

LiveMCP-101:对支持MCP的智能体进行压力测试与诊断

Ming Yin, Dinghan Shen, Silei Xu, Sixun Dong, Mian Zhang, Yebowen Hu, Shujian Liu, Jianbing Han, Simin Ma, Song Wang, Sathish Reddy Indurthi, Xun Wang, Yiran Chen, Kaiqiang Song

发表机构 * Duke University(杜克大学)

AI总结 针对MCP工具在动态多步任务中的评估空白,提出LiveMCP-101基准测试(101个真实查询),通过并行评估框架发现前沿LLM成功率低于60%,并识别出七种失败模式。

详情
AI中文摘要

工具调用已成为AI智能体的关键能力。与依赖静态、特定于提供商的工具定义的传统工具调用框架不同,模型上下文协议(MCP)提供了统一接口来动态发现和调用工具。然而,在现实动态场景中使用多样化MCP工具进行多步任务基准测试存在显著空白。在这项工作中,我们提出了LiveMCP-101,一个包含101个真实世界查询的基准测试,这些查询需要协调使用多个MCP工具。为了解决真实工具响应中的时间变异性,我们引入了一个并行评估框架,其中参考智能体同时执行经过验证的计划以产生实时参考输出。实验表明,即使是前沿LLM的成功率也低于60%,突显了多步工具使用中的挑战。全面的错误分析识别了涵盖工具规划、参数化和输出处理的七种失败模式,为改进当前模型指明了具体方向。LiveMCP-101为评估现实世界智能体能力设定了严格标准,推动通过MCP工具编排可靠执行复杂任务的自主智能体系统的发展。

英文摘要

Tool calling has emerged as a critical capability for AI agents. In contrast to conventional tool calling frameworks that rely on static, provider-specific tool definitions, the Model Context Protocol (MCP) offers a unified interface to discover and invoke tools dynamically. However, there is a significant gap in benchmarking multi-step tasks using diverse MCP tools in realistic, dynamic scenarios. In this work, we present LiveMCP-101, a benchmark of 101 real-world queries that require coordinated use of multiple MCP tools. To address temporal variability in real-world tool responses, we introduce a parallel evaluation framework where a reference agent executes a validated plan simultaneously to produce real-time reference outputs. Experiments show that even frontier LLMs achieve a success rate below 60\%, highlighting challenges in multi-step tool use. Comprehensive error analysis identifies seven failure modes spanning tool planning, parameterization, and output handling, pointing to concrete directions for improving current models. LiveMCP-101 sets a rigorous standard for evaluating real-world agent capabilities, advancing toward autonomous agent systems that reliably execute complex tasks through MCP tool orchestration.

2508.12479 2026-05-26 math.OC cs.AI cs.GT cs.MA econ.GN q-fin.EC 版本更新

EXOTIC: An Exact, Optimistic, Tree-Based Algorithm for Min-Max Optimization

EXOTIC: 一种用于极小极大优化的精确、乐观、基于树的算法

Chinmay Maheshwari, Chinmay Pimpalkhare, Debasish Chatterjee

发表机构 * Department of Electrical and Computer Engineering, and Data Science and AI Institute at Johns Hopkins University(约翰霍普金斯大学电气与计算机工程系及数据科学与人工智能研究院) Institute for Computational and Mathematical Engineering at Stanford University(斯坦福大学计算与数学工程研究所) Center for Systems and Control, IIT Bombay(印度理工学院班加罗尔系统与控制中心)

AI总结 针对凸-非凹和非凸-凹极小极大优化问题,提出一种基于Sion极小极大定理扩展的重新表述,并设计EXOTIC算法结合迭代凸优化与乐观分层树搜索,以计算全局极小极大值,理论保证最优性间隙上界,实验优于梯度方法,并首次精确求解三人以上博弈的安全策略。

Comments 35 pages, 2 figures, 3 tables

详情
AI中文摘要

极小极大优化出现在许多领域,如博弈论、对抗性机器学习等。对于这些问题,基于梯度的方法已被充分理解并享有强有力的保证。然而,在缺乏凸性或凹性的情况下,现有方法研究收敛到近似鞍点或一阶稳定点,这些点可能任意远离全局最优。在这项工作中,我们提出了一个计算凸-非凹和非凸-凹极小极大优化中全局极小极大值的算法框架。对于凸-非凹极小极大问题,我们使用一种重新表述,将问题转化为具有适当定义可行集和目标函数的非凹-凸极大极小优化问题。这种重新表述可以看作是Sion极小极大定理到凸-非凹设置的扩展。然后我们介绍EXOTIC——一种用于求解重新表述的极大极小问题的精确、乐观、基于树的算法。EXOTIC将内部最小化的迭代凸优化求解器与外部最大化的乐观分层树搜索相结合,灵感来自StroquoOL~\cite{bartlett2019simple}。与假设随机零均值噪声评估的StroquoOL不同,EXOTIC处理由内部凸子问题有限时间解产生的确定性、有偏和预算依赖的评估误差。我们建立了其最优性间隙的上界。相同的框架也适用于非凸-凹极小极大优化。实验上,EXOTIC在文献中的流行基准上优于基于梯度的方法。最后,我们通过计算三人或更多玩家博弈中的安全策略来展示EXOTIC的实用性——这是一项计算上具有挑战性的任务,据我们所知,之前没有方法能精确求解。

英文摘要

Min-max optimization arises in many domains such as game theory, adversarial machine learning, etc. For these problems, gradient-based methods are well understood and enjoy strong guarantees. However, in the absence of convexity or concavity, existing approaches study convergence to an approximate saddle point or first-order stationary points, which may be arbitrarily far from global optima. In this work, we present an algorithmic framework for computing the global minimax value in convex--non-concave and non-convex--concave min-max optimization. For convex--non-concave min-max problems, we use a reformulation that transforms the problem into a non-concave--convex max-min optimization problem with suitably defined feasible sets and objective function. This reformulation can be viewed as an extension of Sion's minimax theorem to the convex--non-concave setting. We then introduce EXOTIC -- an Exact, Optimistic, Tree-based algorithm for solving the reformulated max-min problem. EXOTIC combines an iterative convex optimization solver for the inner minimization with an optimistic hierarchical tree search for the outer maximization, inspired by StroquOOL~\cite{bartlett2019simple}. Unlike StroquOOL, which assumes stochastic zero-mean noisy evaluations, EXOTIC handles deterministic, biased, and budget-dependent evaluation errors arising from finite-time solutions of the inner convex subproblems. We establish an upper bound on its optimality gap. The same framework also applies to non-convex--concave min-max optimization. Empirically, EXOTIC outperforms gradient-based methods on popular benchmarks from the literature. Finally, we demonstrate the utility of EXOTIC by computing security strategies in multi-player games with three or more players -- a computationally challenging task that, to our knowledge, no prior method solves exactly.

2508.11872 2026-05-26 cs.CY cs.AI cs.LG cs.MM 版本更新

Designing Singing Syllabi with Virtual Avatars: AI-Assisted Syllabus Reauthoring

用虚拟化身设计歌唱教学大纲:AI辅助的大纲重创

Xinxing Wu

发表机构 * Kentucky State University, USA(美国肯塔基州立大学)

AI总结 本文提出一种AI辅助工作流,将传统文本教学大纲转化为音乐、视频和虚拟化身增强的学习制品,作为正式大纲的补充。

Comments 16 pages, 1 figures, 1 table

详情
AI中文摘要

传统教学大纲通常作为静态参考文档,而非课程的引人入胜的介绍。在实际教学中,我们观察到很少有学生彻底阅读或完全理解传统文本教学大纲中的信息,这可能导致重要信息未被充分利用。本文将大纲沟通重新定义为设计问题,并记录了一种AI辅助工作流,用于将传统大纲转化为音乐、视频和虚拟化身增强的学习制品。本文追溯了歌词改编、音乐生成、视频合成、虚拟化身合成以及可选的基于浏览器的交互过程。本文贡献了一个可重复的工作流和一个具体的大纲重创示例。本文的讨论将歌唱大纲定位为正式书面大纲的补充而非替代,并指出了未来实证评估的方向。本文描述的完整实现已在 https://github.com/xinxingwu-uk/SSVA 公开。

英文摘要

Traditional syllabi often function as static reference documents rather than engaging introductions to a course. In practical teaching, we observe that few students thoroughly read or fully comprehend the information provided in traditional, text-based course syllabi, which can leave essential information underused. This paper reframes syllabus communication as a design problem and documents an AI-assisted workflow for transforming a traditional syllabus into a musical, video-based, and avatar-enhanced learning artifact. The paper traces the process of lyrical adaptation, music generation, video composition, avatar synthesis, and optional browser-based interaction. And the paper contributes a reproducible workflow and a concrete example of syllabus reauthoring. The discussion in this paper positions the singing syllabus as a supplement to, not a replacement for, the formal written syllabus and identifies future directions for empirical evaluation. The complete implementation described in this paper is publicly available at https://github.com/xinxingwu-uk/SSVA

2506.10054 2026-05-26 cs.LG cs.AI cs.CL cs.CV 版本更新

Uni-DPO: A Unified Paradigm for Dynamic Preference Optimization of LLMs

Uni-DPO:大语言模型动态偏好优化的统一范式

Shangpin Peng, Weinong Wang, Zhuotao Tian, Senqiao Yang, Xing Wu, Haotian Xu, Chengquan Zhang, Takashi Isobe, Baotian Hu, Min Zhang

发表机构 * Harbin Institute of Technology, Shenzhen(哈尔滨工业大学(深圳)) Xi’an Jiaotong University(西安交通大学) The Chinese University of Hong Kong(香港中文大学) University of Chinese Academy of Sciences(中国科学院大学) Tsinghua University(清华大学) Huazhong University of Science and Technology(华中科技大学)

AI总结 针对现有DPO方法忽略数据质量和学习难度差异的问题,提出Uni-DPO统一框架,通过自适应重加权偏好对实现更有效的数据利用和更优性能。

Comments Accepted by ICLR 2026. Code & models: https://github.com/pspdada/Uni-DPO

详情
AI中文摘要

直接偏好优化(DPO)因其简单高效已成为从人类反馈中进行强化学习(RLHF)的基石。然而,现有的基于DPO的方法通常平等对待所有偏好对,忽略了数据质量和学习难度的显著差异,导致数据利用效率低下和性能次优。为解决这一局限,我们提出Uni-DPO,一个统一的动态偏好优化框架,该框架联合考虑(a)偏好对的内在质量和(b)模型在训练过程中的动态表现。通过基于这两个因素自适应地重新加权样本,Uni-DPO能够更有效地利用偏好数据并实现卓越性能。跨模型和基准的大量实验证明了Uni-DPO的有效性和泛化能力。在文本任务上,使用Uni-DPO微调的Gemma-2-9B-IT在Arena-Hard上超越领先的大语言模型Claude 3 Opus 6.7个百分点。在数学和多模态任务上,Uni-DPO在所有基准上持续优于基线方法,为其有效性和鲁棒性提供了强有力的实证证据。

英文摘要

Direct Preference Optimization (DPO) has emerged as a cornerstone of reinforcement learning from human feedback (RLHF) due to its simplicity and efficiency. However, existing DPO-based methods typically treat all preference pairs equally, overlooking substantial variations in data quality and learning difficulty, which leads to inefficient data utilization and suboptimal performance. To address this limitation, we propose Uni-DPO, a unified dynamic preference optimization framework that jointly considers (a) the inherent quality of preference pairs and (b) the model's evolving performance during training. By adaptively reweighting samples based on both factors, Uni-DPO enables more effective use of preference data and achieves superior performance. Extensive experiments across models and benchmarks demonstrate the effectiveness and generalization of Uni-DPO. On textual tasks, Gemma-2-9B-IT fine-tuned with Uni-DPO surpasses the leading LLM, Claude 3 Opus, by 6.7 points on Arena-Hard. On mathematical and multimodal tasks, Uni-DPO consistently outperforms baseline methods across all benchmarks, providing strong empirical evidence of its effectiveness and robustness.

2506.09199 2026-05-26 cs.LG cs.AI cs.DC 版本更新

FLoRIST: Singular Value Thresholding for Efficient and Accurate Federated Fine-Tuning of Large Language Models

FLoRIST: 用于高效准确的大语言模型联邦微调的奇异值阈值化方法

Hariharan Ramesh, Jyotikrishna Dass

发表机构 * Anonymous Institution, Anonymous City, Anonymous Region, Anonymous Country(匿名机构,匿名城市,匿名地区,匿名国家)

AI总结 提出FLoRIST框架,通过奇异值阈值化在紧凑中间空间中对局部适配器进行分解,实现数学上准确的聚合,同时保持通信和计算高效。

Comments 21 pages, 12 figures

详情
Journal ref
Ninth Conference on Machine Learning and Systems (MLSys 2026)
AI中文摘要

将低秩适配(LoRA)集成到联邦学习为在不共享本地数据的情况下对大语言模型(LLMs)进行参数高效微调提供了一种有前景的解决方案。然而,为联邦LoRA设计的几种方法在平衡通信效率、模型准确性和计算成本方面面临重大挑战,尤其是在异构客户端之间。这些方法要么依赖于简单的局部适配器平均,这会引入聚合噪声;要么需要传输大型堆叠局部适配器,导致通信效率低下;要么需要重建内存密集的全局权重更新矩阵并执行计算昂贵的分解来设计客户端特定的低秩适配器。在这项工作中,我们提出了FLoRIST,一个联邦微调框架,在不产生高通信或计算开销的情况下实现了数学上准确的聚合。FLoRIST不是在服务器端构建完整的全局权重更新矩阵,而是通过对堆叠的局部适配器分别执行奇异值分解,采用高效的分解流程。该方法在紧凑的中间空间内操作,以表示来自局部LoRA的累积信息。我们引入了可调的奇异值阈值化,用于服务器端最优秩选择,以构建一对所有客户端共享的全局低秩适配器。跨多个数据集和LLMs的大量实证评估表明,FLoRIST在同构和异构设置中始终在卓越的通信效率和竞争性能之间取得最佳平衡。

英文摘要

Integrating Low-Rank Adaptation (LoRA) into federated learning offers a promising solution for parameter-efficient fine-tuning of Large Language Models (LLMs) without sharing local data. However, several methods designed for federated LoRA present significant challenges in balancing communication efficiency, model accuracy, and computational cost, particularly among heterogeneous clients. These methods either rely on simplistic averaging of local adapters, which introduces aggregation noise, require transmitting large stacked local adapters, leading to poor communication efficiency, or necessitate reconstructing memory-dense global weight-update matrix and performing computationally expensive decomposition to design client-specific low-rank adapters. In this work, we propose FLoRIST, a federated fine-tuning framework that achieves mathematically accurate aggregation without incurring high communication or computational overhead. Instead of constructing the full global weight-update matrix at the server, FLoRIST employs an efficient decomposition pipeline by performing singular value decomposition on stacked local adapters separately. This approach operates within a compact intermediate space to represent the accumulated information from local LoRAs. We introduce tunable singular value thresholding for server-side optimal rank selection to construct a pair of global low-rank adapters shared by all clients. Extensive empirical evaluations across multiple datasets and LLMs demonstrate that FLoRIST consistently strikes the best balance between superior communication efficiency and competitive performance in both homogeneous and heterogeneous setups.

2506.09084 2026-05-26 cs.LG cs.AI 版本更新

PageLLM: A Multi-Grained Reward Framework for Whole-Page Optimization with Large Language Models

PageLLM:面向整页优化的大语言模型多粒度奖励框架

Xinyuan Wang, Liang Wu, Dongjie Wang, Yanjie Fu

发表机构 * Arizona State University(亚利桑那州立大学) Nokia(诺基亚) University of Kansas(堪萨斯大学)

AI总结 针对整页优化中人工标注成本高和页面级连贯性与项目级放置粒度不匹配的问题,提出PageLLM框架,通过将隐式反馈解耦为粗粒度页面级奖励和细粒度项目级奖励,结合PPO的RLHF进行微调,显著提升排序性能并在线上部署中取得收益。

详情
AI中文摘要

整页优化(WPO)决定了搜索和推荐结果如何呈现给用户,而大语言模型(LLMs)通过将页面生成视为序列生成为其开辟了新途径。然而,将LLMs适配到网络规模的WPO仍受限于昂贵的人工标注需求以及页面级连贯性与项目级放置之间的粒度不匹配。在这项工作中,我们表明这两个挑战是耦合的:只要奖励信号被解耦为两个互补的粒度,仅凭隐式用户反馈就足以进行对齐。我们提出了PageLLM,一个基于奖励的微调框架,该框架(i)将隐式反馈转化为四个对比偏好对族,涵盖相关性、排序、多样性和冗余度;(ii)学习一个粗粒度的页面级奖励和一个细粒度的项目级奖励,后者捕捉对参与度敏感的位置交换;(iii)在预训练的LLM上通过基于PPO的RLHF结合这两种奖励。在七个亚马逊类别上针对十一个基线的广泛实验表明,单独任何一种奖励都不足够——丢弃页面级或项目级信号分别使NDCG@100降低17.8%和15.2%,而联合奖励则使NDCG@100提升高达46.8%。在拥有1000万用户的在线A/B测试中,PageLLM使GMV提升0.44%,点击率提升0.14%,证实了来自隐式反馈的多粒度奖励可扩展到生产级WPO。代码和数据可在匿名仓库中获取。

英文摘要

Whole-page optimization (WPO) decides how search and recommendation results are surfaced to users, and large language models (LLMs) open a new route to it by treating page generation as sequence generation. Adapting LLMs to web-scale WPO, however, remains bottlenecked by the need for costly human annotations and by the mismatched granularity between page-level coherence and item-level placement. In this work we show that these two challenges are coupled: implicit user feedback alone suffices for alignment, provided the reward signal is decoupled into two complementary granularities. We propose PageLLM, a reward-based fine-tuning framework that (i) turns implicit feedback into four contrastive preference-pair families covering relevance, ranking, diversity, and redundancy, (ii) learns a coarse page-level reward and a fine item-level reward that captures engagement-sensitive position swaps, and (iii) combines both rewards in PPO-based RLHF over a pre-trained LLM. Extensive experiments on seven Amazon categories against eleven baselines show that neither reward alone is sufficient -- dropping the page-level or item-level signal reduces NDCG@100 by 17.8% and 15.2% respectively, whereas the joint reward improves NDCG@100 by up to 46.8%. Deployed in a 10M-user online A/B test, PageLLM raises GMV by 0.44% and click-through rate by 0.14%, confirming that multi-grained rewards from implicit feedback scale to production WPO. Code and data are available at an anonymized repository.

2505.23803 2026-05-26 cs.CR cs.AI 版本更新

MultiPhishGuard: An Explainable and Adaptive Multi-Agent LLM System for Phishing Email Detection

MultiPhishGuard: 一种用于钓鱼邮件检测的可解释且自适应的多智能体大语言模型系统

Yinuo Xue, Eric Spero, Meng Wai Woo, Wei Gao, Giovanni Russello

发表机构 * The University of Auckland(奥克兰大学)

AI总结 提出基于LLM的多智能体框架MultiPhishGuard,通过协调文本、URL、元数据等五个专业智能体并利用PPO动态加权,结合对抗训练提升对新型钓鱼策略的鲁棒性,在公开数据集上达到97.89%准确率。

详情
AI中文摘要

由于不断演变的对抗策略和异构攻击模式,钓鱼邮件检测面临重大挑战。传统方法(如基于规则的过滤器和黑名单)往往难以跟上步伐,导致漏检和安全风险。虽然机器学习方法提高了检测性能,但在适应新颖且快速变化的钓鱼策略方面仍然有限。我们提出了MultiPhishGuard,一个基于LLM的多智能体检测框架,具有跨专业智能体的学习协调能力。该系统由五个协作智能体(文本、URL、元数据、解释简化器和对抗智能体)组成,使用近端策略优化动态加权智能体贡献。为了应对新兴威胁,该框架包含一个对抗训练循环,其中基于LLM的智能体生成细微的、上下文感知的邮件变体,以暴露潜在模型弱点并提高对模糊钓鱼案例的鲁棒性。在公开数据集上的实验评估表明,MultiPhishGuard在性能上优于既定基线,包括思维链提示和单智能体变体,消融研究和比较分析支持了这一点。该系统达到97.89%的准确率,假阳性率为2.73%,假阴性率为0.20%。此外,解释简化器智能体将技术模型输出转化为面向人类用户的通俗语言解释。总体而言,这些结果表明,具有自适应协调和对抗训练的多智能体LLM架构代表了钓鱼邮件检测的一个有前景的方向。

英文摘要

Phishing email detection faces significant challenges due to evolving adversarial tactics and heterogeneous attack patterns. Traditional approaches, such as rule-based filters and denylists, often struggle to keep pace, leading to missed detections and security risks. While machine learning methods have improved detection performance, they remain limited in adapting to novel and rapidly changing phishing strategies. We present MultiPhishGuard, an LLM-based multi-agent detection framework with learned coordination across specialized agents. The system consists of five cooperative agents (text, URL, metadata, explanation simplifier, and adversarial agents), with agent contributions dynamically weighted using Proximal Policy Optimization. To address emerging threats, the framework incorporates an adversarial training loop in which an LLM-based agent generates subtle, context-aware email variants to expose potential model weaknesses and improve robustness to ambiguous phishing cases. Experimental evaluations on public datasets show that MultiPhishGuard achieves stronger performance than established baselines, including Chain-of-Thought prompting and single-agent variants, as supported by ablation studies and comparative analyses. The system achieves an accuracy of 97.89%, with a false positive rate of 2.73% and a false negative rate of 0.20%. In addition, an explanation simplifier agent transforms technical model outputs into plain-language rationales intended for human users. Overall, these results suggest that multi-agent LLM architectures with adaptive coordination and adversarial training represent a promising direction for phishing email detection.

2410.15173 2026-05-26 cs.CL cs.AI 版本更新

Uncovering Autoregressive LLM Knowledge of Thematic Fit in Event Representation

揭示自回归LLM在事件表示中主题适配性的知识

Safeyah Khaled Alshemali, Daniel Bauer, Yuval Marton

发表机构 * Imperial College London(伦敦帝国学院) Columbia University(哥伦比亚大学) University of Washington(华盛顿大学)

AI总结 通过多种提示设计、输入上下文操作、推理和输出形式,研究自回归大语言模型是否具有一致且可表达的事件参数主题适配性知识,并在基准测试上取得新最优结果。

Comments Significant update with massive changes: all experiments rerun with current LLMs; includes new probability estimate analysis and expanded results in Sections 4 and 5. The paper has been accepted to CoNLL-2026

详情
AI中文摘要

主题适配性估计任务衡量语义参数与给定谓词特定语义角色的兼容性。我们通过实验各种提示设计、操作输入上下文、推理和输出形式,研究自回归LLM是否具有一致且可表达的事件参数主题适配性知识。我们在主题适配性基准测试上取得了新的最优结果,但表明封闭和开放权重的LLM对我们的提示策略反应不同:封闭模型总体得分更高,并从多步推理中受益,但在过滤与给定谓词、角色和参数不兼容的生成句子方面表现较差。我们的分析表明,词元元组输入和句子输入导致主题适配性得分分布出人意料地不同。

英文摘要

The thematic fit estimation task measures semantic arguments' compatibility with a given semantic role for a given predicate. We investigate if autoregressive LLMs have consistent, expressible knowledge of event arguments' thematic fit by experimenting with various prompt designs, manipulating input context, reasoning, and output forms. We set a new state-of-the-art on thematic fit benchmarks, but show that closed and open weight LLMs respond differently to our prompting strategies: Closed models achieve better scores overall and benefit from multi-step reasoning, but they perform worse at filtering out generated sentences incompatible with the given predicate, role, and argument. Our analysis shows that lemma tuple input and sentence input result in surprisingly different thematic fit score distributions.

2403.04780 2026-05-26 cs.CL cs.AI 版本更新

Graph-oriented Instruction Tuning of Large Language Models for Generic Graph Mining

面向通用图挖掘的大语言模型图导向指令微调

Yanchao Tan, Hang Lv, Pengxiang Zhan, Shiping Wang, Carl Yang

发表机构 * Engineering Research Center of Big Data Intelligence, Ministry of Education(教育部大数据智能工程研究中心) Fujian Key Laboratory of Network Computing and Intelligent Information Processing(福建省网络计算与智能信息处理重点实验室) College of Computer and Data Science, Fuzhou University(福州大学计算机与数据科学学院) Department of Computer Science, Emory University(埃默里大学计算机科学系)

AI总结 提出MuseGraph框架,通过紧凑图描述、基于思维链的指令生成和图感知指令微调,将GNN与LLM结合,实现跨任务和数据集的高效图挖掘。

Comments Accepted by TPAMI 2025

详情
Journal ref
IEEE Trans. Pattern Anal. Mach. Intell., vol. 48, no. 1, pp. 155-169, Jan. 2026
AI中文摘要

具有丰富属性的图对于建模互联实体和增强各种实际应用中的预测至关重要。传统的图神经网络(GNN)通常需要针对不同的图任务和数据集进行重新训练。尽管大语言模型(LLM)的出现为自然语言处理带来了新范式,但它们在通用图挖掘(即训练单个模型同时处理多样任务和数据集)方面的潜力仍未充分探索。为此,我们的新颖框架MuseGraph无缝地将GNN和LLM的优势整合到一个基础模型中,用于跨任务和数据集的图挖掘。该框架首先采用紧凑的图描述,在语言令牌限制内封装关键图信息。然后,我们提出了一种基于思维链(CoT)指令包的多样化指令生成机制,以从GPT-4等高级LLM中提取推理能力。最后,我们设计了一种图感知的指令微调策略,以促进多个任务和数据集之间的相互增强,同时防止LLM生成能力的灾难性遗忘。我们的实验结果表明,在五个图任务和十个数据集上取得了显著改进,展示了MuseGraph在提高图导向下游任务准确性的同时增强LLM生成能力的潜力。

英文摘要

Graphs with abundant attributes are essential in modeling interconnected entities and enhancing predictions across various real-world applications. Traditional Graph Neural Networks (GNNs) often require re-training for different graph tasks and datasets. Although the emergence of Large Language Models (LLMs) has introduced new paradigms in natural language processing, their potential for generic graph mining, training a single model to simultaneously handle diverse tasks and datasets, remains under-explored. To this end, our novel framework MuseGraph, seamlessly integrates the strengths of GNNs and LLMs into one foundation model for graph mining across tasks and datasets. This framework first features a compact graph description to encapsulate key graph information within language token limitations. Then, we propose a diverse instruction generation mechanism with Chain-of-Thought (CoT)-based instruction packages to distill the reasoning capabilities from advanced LLMs like GPT-4. Finally, we design a graph-aware instruction tuning strategy to facilitate mutual enhancement across multiple tasks and datasets while preventing catastrophic forgetting of LLMs' generative abilities. Our experimental results demonstrate significant improvements in five graph tasks and ten datasets, showcasing the potential of our MuseGraph in enhancing the accuracy of graph-oriented downstream tasks while improving the generation abilities of LLMs.

2305.11663 2026-05-26 cs.LG cs.AI cs.CL cs.CY 版本更新

Algorithmic failure as a humanities methodology: machine learning's mispredictions identify rich cases for qualitative analysis

作为人文学科方法论的算法失败:机器学习的错误预测识别出用于定性分析的丰富案例

Jill Walker Rettberg

AI总结 本文通过实验验证了Munk等人提出的利用机器学习失败预测识别定性分析中模糊且丰富案例的方法,使用简单kNN算法对虚构角色与机器视觉技术互动的动作数据进行分类,发现不可预测的动作更具矛盾性和情感负荷,支持该方法在人文学科中的适用性。

详情
Journal ref
Big Data & Society 9(2) 2022
AI中文摘要

本文评论测试了Munk等人(2022)提出的一种方法论,即利用机器学习中的失败预测作为识别定性分析中模糊且丰富案例的方法。使用一个描述500件艺术品、电影、小说和电子游戏中虚构角色与机器视觉技术互动动作的数据集,我训练了一个简单的机器学习算法(使用R中的kNN算法),仅根据虚构角色的信息预测动作是主动还是被动。可预测的动作通常是缺乏情感且明确的,其中机器视觉技术被当作简单工具。不可预测的动作,即算法无法正确预测的动作,则更加矛盾且情感负荷更重,角色与技术之间的权力关系更为复杂。因此,结果支持Munk等人的理论,即失败预测可以有效地用于识别定性分析的丰富案例。本测试不仅简单复制了Munk等人的结果,还证明了该方法可以应用于更广泛的人文学科领域,并且不需要复杂的神经网络,简单的机器学习算法也能奏效。需要进一步研究以理解该方法适用于哪些类型的数据以及哪种机器学习最具生成性。为此,附上了产生结果所需的R代码,以便复制测试。该代码也可重复使用或改编,以在其他数据集上测试该方法。

英文摘要

This commentary tests a methodology proposed by Munk et al. (2022) for using failed predictions in machine learning as a method to identify ambiguous and rich cases for qualitative analysis. Using a dataset describing actions performed by fictional characters interacting with machine vision technologies in 500 artworks, movies, novels and videogames, I trained a simple machine learning algorithm (using the kNN algorithm in R) to predict whether or not an action was active or passive using only information about the fictional characters. Predictable actions were generally unemotional and unambiguous activities where machine vision technologies were treated as simple tools. Unpredictable actions, that is, actions that the algorithm could not correctly predict, were more ambivalent and emotionally loaded, with more complex power relationships between characters and technologies. The results thus support Munk et al.'s theory that failed predictions can be productively used to identify rich cases for qualitative analysis. This test goes beyond simply replicating Munk et al.'s results by demonstrating that the method can be applied to a broader humanities domain, and that it does not require complex neural networks but can also work with a simpler machine learning algorithm. Further research is needed to develop an understanding of what kinds of data the method is useful for and which kinds of machine learning are most generative. To support this, the R code required to produce the results is included so the test can be replicated. The code can also be reused or adapted to test the method on other datasets.

2105.13431 2026-05-26 cs.LG cs.AI cs.SY eess.SY 版本更新

An Offline Risk-aware Policy Selection Method for Bayesian Markov Decision Processes

贝叶斯马尔可夫决策过程的离线风险感知策略选择方法

Giorgio Angelotti, Nicolas Drougard, Caroline Ponzoni Carvalho Chanel

发表机构 * Natural Intelligence Toulouse Institute, University of Toulouse, France(图卢兹大学自然智能研究所) ISAE-SUPAERO, University of Toulouse, France(图卢兹大学ISAE-SUPAERO)

AI总结 针对离线强化学习中模型不确定性导致策略风险高的问题,提出一种基于贝叶斯形式化框架的风险感知策略选择方法EvC,通过最大化贝叶斯后验下的风险感知目标来选择稳健策略。

Comments Preprint, under review

详情
Journal ref
Artificial Intelligence, Volume 354, 2026
AI中文摘要

在离线模型学习用于规划以及离线强化学习中,有限的数据集阻碍了相对马尔可夫决策过程(MDP)的值函数估计。因此,所获得策略在真实世界中的性能受到限制且可能存在风险,尤其是当部署错误策略可能导致灾难性后果时。为此,目前正在探索多种途径以减少模型误差(或学习模型与真实模型之间的分布偏移),并在更广泛的意义上获得针对模型不确定性的风险感知解决方案。但在最终应用中,实践者应选择哪种基线?在计算时间不是问题且鲁棒性优先的离线背景下,我们提出了Exploitation vs Caution(EvC),这是一种范式:(1)优雅地融入遵循贝叶斯形式化的模型不确定性,以及(2)在由当前基线提供的固定候选策略集合中,选择最大化贝叶斯后验下风险感知目标的策略。我们在不同离散但简单的环境中使用最先进的方法验证了EvC,这些环境提供了多种MDP类别。在测试场景中,EvC成功选择了稳健策略,因此成为旨在将离线规划和强化学习求解器应用于真实世界的实践者的有用工具。

英文摘要

In Offline Model Learning for Planning and in Offline Reinforcement Learning, the limited data set hinders the estimate of the Value function of the relative Markov Decision Process (MDP). Consequently, the performance of the obtained policy in the real world is bounded and possibly risky, especially when the deployment of a wrong policy can lead to catastrophic consequences. For this reason, several pathways are being followed with the scope of reducing the model error (or the distributional shift between the learned model and the true one) and, more broadly, obtaining risk-aware solutions with respect to model uncertainty. But when it comes to the final application which baseline should a practitioner choose? In an offline context where computational time is not an issue and robustness is the priority we propose Exploitation vs Caution (EvC), a paradigm that (1) elegantly incorporates model uncertainty abiding by the Bayesian formalism, and (2) selects the policy that maximizes a risk-aware objective over the Bayesian posterior between a fixed set of candidate policies provided, for instance, by the current baselines. We validate EvC with state-of-the-art approaches in different discrete, yet simple, environments offering a fair variety of MDP classes. In the tested scenarios EvC manages to select robust policies and hence stands out as a useful tool for practitioners that aim to apply offline planning and reinforcement learning solvers in the real world.

2603.18363 2026-05-26 cs.CL cs.AI cs.LG 版本更新

PowerFlow: Unlocking the Dual Nature of LLMs via Principled Distribution Matching

PowerFlow: 通过原则性分布匹配释放LLMs的双重特性

Ruishuo Chen, Yu Chen, Zhuoran Li, Longbo Huang

发表机构 * Institute for Interdisciplinary Information Sciences, Tsinghua University, Beijing, China(清华大学交叉信息研究院)

AI总结 提出PowerFlow框架,将无监督微调重构成分布匹配问题,利用GFlowNet和长度感知轨迹平衡目标,通过调整α-幂分布方向性激发LLMs的逻辑推理或创造性。

Comments Camera-ready version accepted at ICML 2026

详情
AI中文摘要

无监督内部反馈强化学习(RLIF)已成为一种有前景的范式,可以在没有外部监督的情况下激发大型语言模型(LLMs)的潜在能力。然而,当前方法依赖于启发式内在奖励,通常缺乏明确的理论优化目标,并且容易产生退化偏差。在这项工作中,我们引入了PowerFlow,一个原则性框架,将无监督微调重新表述为分布匹配问题。通过将GFlowNet视为未归一化密度的摊销变分采样器,我们提出了一个长度感知的轨迹平衡目标,明确抵消了自回归生成中固有的结构长度偏差。通过针对$α$-幂分布,PowerFlow能够方向性地激发LLMs的双重特性:锐化分布($α> 1$)以增强逻辑推理,或展平分布($α< 1$)以释放表达性创造力。大量实验表明,PowerFlow始终优于现有的RLIF方法,匹配甚至超过有监督的GRPO。此外,通过减轻对齐模型中的过度锐化,我们的方法在多样性和质量上同时取得提升,在创造性任务中推动了帕累托前沿。

英文摘要

Unsupervised Reinforcement Learning from Internal Feedback (RLIF) has emerged as a promising paradigm for eliciting the latent capabilities of Large Language Models (LLMs) without external supervision. However, current methods rely on heuristic intrinsic rewards, which often lack a well-defined theoretical optimization target and are prone to degenerative biases. In this work, we introduce PowerFlow, a principled framework that reformulates unsupervised fine-tuning as a distribution matching problem. By casting GFlowNet as an amortized variational sampler for unnormalized densities, we propose a length-aware Trajectory-Balance objective that explicitly neutralizes the structural length biases inherent in autoregressive generation. By targeting $α$-power distributions, PowerFlow enables the directional elicitation of the dual nature of LLMs: sharpening the distribution ($α> 1$) to intensify logical reasoning, or flattening it ($α< 1$) to unlock expressive creativity. Extensive experiments demonstrate that PowerFlow consistently outperforms existing RLIF methods, matching or even exceeding supervised GRPO. Furthermore, by mitigating over-sharpening in aligned models, our approach achieves simultaneous gains in diversity and quality, shifting the Pareto frontier in creative tasks.

2509.13389 2026-05-26 cs.AI 版本更新

From Next Token Prediction to (STRIPS) World Models

从下一个词预测到(STRIPS)世界模型

Carlos Núñez-Molina, Vicenç Gómez, Hector Geffner

发表机构 * RWTH Aachen University, Germany(亚琛工业大学,德国) Universitat Pompeu Fabra, Spain(庞培法华大学,西班牙)

AI总结 研究下一个词预测能否产生支持规划的世界模型,提出STRIPS Transformer和标准Transformer两种架构,在五个经典规划领域评估训练准确率、泛化能力和规划性能。

详情
AI中文摘要

我们研究下一个词预测是否能够产生真正支持规划的世界模型,在一个受控的符号设置中,从动作轨迹单独学习命题STRIPS动作模型,并且可以精确评估正确性。我们引入了两种架构。第一种是STRIPS Transformer,一种符号对齐的模型,基于连接Transformer与STRIPS领域形式语言结构的理论结果。第二种是标准Transformer架构,没有内置显式符号结构,我们研究不同的位置编码方案和注意力聚合机制。我们在五个经典规划领域评估这两种架构,测量训练准确率、泛化能力以及跨领域和问题规模的规划性能。有趣的是,两种方法都可以产生支持使用现成STRIPS规划器在指数级多的未见初始状态和目标上进行规划的模型。尽管STRIPS Transformer具有强烈的符号归纳偏置,但它更难优化,并且需要更大的数据集才能可靠地泛化。相比之下,带有stick-breaking注意力的标准Transformer实现了近乎完美的训练准确率和强大的泛化能力。最后,没有stick-breaking注意力的标准Transformer无法泛化到长轨迹,而从较短轨迹训练的Transformer中提取的符号STRIPS模型则可以。

英文摘要

We study whether next-token prediction can yield world models that truly support planning, in a controlled symbolic setting where propositional STRIPS action models are learned from action traces alone and correctness can be evaluated exactly. We introduce two architectures. The first is the STRIPS Transformer, a symbolically aligned model grounded in theoretical results linking transformers and the formal language structure of STRIPS domains. The second is a standard transformer architecture without explicit symbolic structure built in, for which we study different positional encoding schemes and attention aggregation mechanisms. We evaluate both architectures on five classical planning domains, measuring training accuracy, generalization, and planning performance across domains and problem sizes. Interestingly, both approaches can be used to produce models that support planning with off-the-shelf STRIPS planners over exponentially many unseen initial states and goals. Although the STRIPS Transformer incorporates a strong symbolic inductive bias, it is harder to optimize and requires larger datasets to generalize reliably. In contrast, a standard transformer with stick-breaking attention achieves near-perfect training accuracy and strong generalization. Finally, standard transformers without stick-breaking attention do not generalize to long traces, whereas a symbolic STRIPS model extracted from a transformer trained on shorter traces does.

2603.11001 2026-05-26 cs.CY cs.AI 版本更新

RCTs & Human Uplift Studies: Methodological Challenges and Practical Solutions for Frontier AI Evaluation

随机对照试验与人类提升研究:前沿AI评估的方法论挑战与实践解决方案

Patricia Paskov, Kevin Wei, Shen Zhou Hong, Dan Bateyko, Xavier Roberts-Gaal, Carson Ezell, Gailius Praninskas, Valerie Chen, Umang Bhatt, Ella Guest

发表机构 * RAND Johns Hopkins University(约翰霍普金斯大学) Cornell University(康奈尔大学) Harvard University(哈佛大学) University of Cambridge(剑桥大学) London School of Economics(伦敦经济学院)

AI总结 本文通过访谈16位专家,系统梳理了人类提升研究(测量AI对人类绩效影响)在随机对照试验中面临的方法论挑战,包括内部效度、外部效度和构念效度问题,并提出了相应的解决方案。

详情
AI中文摘要

人类提升研究,即通过随机对照试验(RCT)或类似方法测量AI访问对人类绩效影响的研究,越来越多地为前沿AI治理和部署决策提供信息。尽管RCT方法在其他领域是稳健的,但它们与前沿AI系统独特属性的相互作用仍未得到充分研究,特别是当结果用于高风险的决策时。我们呈现了对16位在生物安全、网络安全、教育和劳动等领域具有人类提升研究经验的专家从业者的访谈结果。在访谈中,专家们描述了人类提升研究所依赖的标准因果推断假设与研究目标之间反复出现的紧张关系。快速演变的AI系统、不断变化的基线、异质且变化的用户熟练度以及多孔的真实世界环境,对内部效度、外部效度和构念效度的假设造成了压力,使得提升证据的解释和适当使用复杂化。我们贡献了(1)人类提升研究中方法论挑战的综合,映射到研究效度的风险,并按其对大语言模型(LLM)系统的特异性程度进行分类,以及(2)从挑战到提议解决方案的映射。通过整理专家识别的挑战和解决方案,我们旨在阐明人类提升证据的解释限制和适当用途,使评估实践与其所指导的决策相一致,并为AI治理提供更协调的方法论基础。

英文摘要

Human uplift studies, or studies that measure the effects of AI access on human performance via randomized controlled trials (RCT) or similar methodologies, increasingly inform frontier AI governance and deployment decisions. While RCT methods are robust in other fields, their interaction with the distinctive properties of frontier AI systems remains underexamined, particularly when results are used to inform high-stakes decisions. We present findings from interviews with 16 expert practitioners with experience conducting human uplift studies in domains including biosecurity, cybersecurity, education, and labor. Across interviews, experts described a recurring tension between the standard causal inference assumptions upon which human uplift studies rely and the object of study itself. Rapidly evolving AI systems, shifting baselines, heterogeneous and changing user proficiency, and porous real-world settings strain assumptions underlying internal, external, and construct validity, complicating the interpretation and appropriate use of uplift evidence. We contribute (1) a synthesis of methodological challenges in human uplift studies, mapped to risks to study validity and classified by their degree of specificity to large language model (LLM) systems, and (2) a mapping from challenges to proposed solutions. By collating expert-identified challenges and solutions, we seek to clarify the interpretive limits and appropriate uses of human uplift evidence, to align evaluation practice with the decisions it informs, and to support more coordinated methodological foundations for AI governance.

2602.20191 2026-05-26 cs.LG cs.AI cs.CL 版本更新

MoBiQuant: Mixture-of-Bits Quantization for Token-Adaptive Any-Precision LLM

MoBiQuant: 面向令牌自适应任意精度LLM的混合比特量化

Dongwei Wang, Jinhee Kim, Seokho Han, Denis Gudovskiy, Yohei Nakata, Tomoyuki Okuno, KhayTze Peong, Kang Eun Jeon, Jong Hwan Ko, Yiran Chen, Huanrui Yang

发表机构 * University of Arizona(亚利桑那大学) Duke University(杜克大学) Sungkyunkwan University(成均馆大学) Panasonic AI Lab(松下人工智能实验室) Korea Advanced Institute of Science and Technology(韩国科学技术院)

AI总结 针对动态运行时约束下大语言模型任意精度量化的泛化性问题,提出基于令牌敏感度的混合比特量化框架MoBiQuant,通过多合一递归残差量化和令牌感知路由器实现灵活推理,在匹配或超越前沿单精度PTQ的同时显著节省内存并提升吞吐量。

Comments 20 pages, 10 figures

详情
AI中文摘要

动态运行时延迟和内存约束要求灵活部署大语言模型(LLM),使得LLM能够根据可用计算资源以不同的量化精度进行推理。最近关于这种任意精度量化的工作要么依赖于硬件效率低下的向量量化,要么在切换位宽时引入额外的缩放因子。同时,现有的为固定低精度校准的后训练量化(PTQ)方法在运行时精度变化下表现出较差的泛化性。在这项工作中,我们将跨位宽泛化性差的根源归因于一种精度依赖的“异常迁移”现象,其中PTQ敏感令牌的分布随精度变化。受此观察启发,我们提出了 exttt{MoBiQuant},一种新颖的任意精度混合比特量化框架,它根据令牌敏感性调整权重精度以实现灵活的LLM推理。具体来说,我们提出了一种多合一递归残差量化方法,可以在运行时迭代重建更高精度的权重,并通过令牌感知路由器缓解“异常迁移”,动态选择每个令牌的最优推理精度。大量实验表明, exttt{MoBiQuant}在匹配或超越前沿单精度PTQ的同时表现出强大的弹性,与最先进的任意精度方法相比,实现了显著的内存节省和高达$1.34 imes$的吞吐量提升。

英文摘要

Dynamic runtime latency and memory constraints necessitate flexible large language model (LLM) deployment, where an LLM can be inferred with various quantization precisions based on available computational resources. Recent work on such any-precision quantization either relies on hardware-inefficient vector quantization or induces additional scaling factors when switching between bit-widths. Meanwhile, existing post-training quantization (PTQ) methods calibrated for a fixed low precision show poor generalizability under runtime precision change. In this work, we attribute the source of poor generalization across bit-widths to a precision-dependent \textit{outlier migration} phenomenon where the distribution of PTQ-sensitive tokens changes across precisions. Motivated by this observation, we propose \texttt{MoBiQuant}, a novel any-precision Mixture-of-Bits quantization framework that adjusts weight precision for flexible LLM inference based on token sensitivity. Specifically, we propose a many-in-one recursive residual quantization that can iteratively reconstruct higher-precision weights at runtime and mitigates \textit{outlier migration} with a token-aware router to dynamically select the optimal inference precision of each token.Extensive experiments show that \texttt{MoBiQuant} matches or surpasses frontier single-precision PTQ while exhibiting strong elasticity, achieving significant memory savings and throughput gains of up to $1.34\times$ over state-of-the-art any-precision methods.

2602.17162 2026-05-26 cs.AI q-bio.GN 版本更新

JEPA-DNA: Grounding Genomic Foundation Models through Joint-Embedding Predictive Architectures

JEPA-DNA:通过联合嵌入预测架构夯实基因组基础模型

Ariel Larey, Elay Dahan, Amit Bleiweiss, Raizy Kellerman, Guy Leib, Omri Nayshool, Dan Ofer, Tal Zinger, Dan Dominissini, Gideon Rechavi, Nicole Bussola, Simon Lee, Shane O'Connell, Dung Hoang, Marissa Wirth, Alexander W. Charney, Nati Daniel, Yoli Shavit

发表机构 * Applied AI Architecture, NVIDIA, Israel(NVIDIA应用人工智能架构,以色列) Worldwide Field Ops, NVIDIA, Israel(NVIDIA全球现场运营,以色列) Developer Programs, NVIDIA, Israel(NVIDIA开发者计划,以色列) Cancer Research Center and Wohl Institute of Translational Medicine, Sheba Medical Center, Tel Hashomer, Israel(癌症研究中心和Wohl转化医学研究所,Sheba医疗中心,Tel Hashomer,以色列) Windreich Department of AI and Human Health, Icahn School of Medicine at Mount Sinai, New York, USA(AI与人类健康风reich部门,Mount Sinai医学中心,纽约,美国)

AI总结 提出JEPA-DNA框架,将联合嵌入预测架构与生成式目标结合,通过潜在空间监督全局序列嵌入,实现从令牌恢复到语义对齐的转变,在17项基因组基准任务上提升线性探测和零样本性能,达到新最优。

详情
AI中文摘要

基因组基础模型(GFM)通常依赖掩码语言建模(MLM)或下一令牌预测(NTP)来学习“自然法则”。虽然这些生成范式在捕捉局部语法方面有效,但它们优先考虑令牌级重建而非高级功能上下文。我们引入JEPA-DNA,一个模型无关的持续训练框架,将联合嵌入预测架构(JEPA)与传统生成式目标相结合。通过在潜在空间中监督全局序列嵌入,JEPA-DNA迫使模型预测掩码基因组片段的功能表示,将学习信号从令牌恢复转向语义对齐。我们在17个不同的基因组基准任务上评估JEPA-DNA,证明无论底层GFM架构或生成式目标如何,在线性探测和零样本性能上均有一致提升。我们的框架通过弥合生成精度与潜在语义基础之间的差距,建立了GFM的新最优水平,超越了现有最佳模型。通过广泛的消融研究,我们进一步表征了生成式目标与潜在目标之间的协同交互。我们的代码公开在https://github.com/NVIDIA-Digital-Bio/JEPA-DNA。

英文摘要

Genomic Foundation Models (GFMs) typically rely on Masked Language Modeling (MLM) or Next-Token Prediction (NTP) to learn the "Laws of Nature". While effective at capturing local syntax, these generative paradigms prioritize token-level reconstruction over high-level functional context. We introduce JEPA-DNA, a model-agnostic continual training framework that integrates a Joint-Embedding Predictive Architecture (JEPA) with traditional generative objectives. By supervising global sequence embeddings in a latent space, JEPA-DNA forces models to predict the functional representations of masked genomic segments, shifting the learning signal from token recovery to semantic alignment. We evaluate JEPA-DNA on 17 diverse genomic benchmark tasks, demonstrating consistent gains in linear probing and zero-shot performance regardless of the underlying GFM architecture or generative objective. Our framework establishes a new state-of-the-art for GFMs, surpassing the best existing models by bridging generative precision with latent semantic grounding. Through extensive ablation studies, we further characterize the synergistic interplay between generative and latent objectives. Our code is publicly available at https://github.com/NVIDIA-Digital-Bio/JEPA-DNA.

2602.00682 2026-05-26 cs.IR cs.AI 版本更新

RecGOAT: Graph Optimal Adaptive Transport for LLM-Enhanced Multimodal Recommendation with Dual Semantic Alignment

RecGOAT: 用于LLM增强多模态推荐的图最优自适应传输与双语义对齐

Yuecheng Li, Hengwei Ju, Zeyu Song, Wei Yang, Chi Lu, Peng Jiang, Kun Gai

发表机构 * Fudan University(复旦大学) University of Southern California(南加州大学)

AI总结 针对生成式语言模型表示与ID协同信号之间的语义异质性,提出基于图神经网络和最优传输理论的双粒度语义对齐框架RecGOAT,通过实例级和分布级对齐实现统一特征空间,理论证明其表示误差更低,实验达到最优性能。

Comments Under Review

详情
AI中文摘要

将大型语言模型(LLM)表示整合到多模态推荐中已显示出前景,但一个基本挑战仍被忽视:生成式LM表示与推荐系统依赖的基于ID的协同信号之间的语义异质性。未经对齐而简单注入LM特征会降低推荐性能而非提升。为解决此问题,我们提出RecGOAT,一个基于图神经网络和最优传输理论的双粒度语义对齐框架。RecGOAT首先通过多模态注意力图丰富协同语义,捕获物品-物品、用户-物品和用户-用户关系,并通过LLM推断的行为偏好初始化用户表示。然后,它在两个互补粒度上对齐LM导出的模态表示与推荐ID:(1)通过跨模态对比学习(CMCL)进行实例级对齐,产生判别性的每个样本表示;(2)通过最优自适应传输(OAT)进行分布级对齐,最小化ID分布与LLM语义之间的1-Wasserstein距离,以产生统一、一致对齐的特征空间。理论上,我们证明统一表示比任何单模态表示具有严格更低的目标误差,该误差界由Wasserstein距离和InfoNCE损失界定,为对齐一致性和融合全面性提供了严格保证。在三个公开基准上的大量实验展示了最先进的性能。在大规模在线广告平台上的部署进一步验证了RecGOAT的工业可扩展性。我们的代码可在https://github.com/6lyc/RecGOAT-LLM4Rec获取。

英文摘要

Integrating large language model (LLM) representations into multimodal recommendation has shown promise, yet a fundamental challenge remains largely overlooked: the semantic heterogeneity between generative LM representations and the ID-based collaborative signals that recommendation systems rely on. Naively injecting LM features without alignment degrades recommendation performance rather than improving it. To resolve this, we propose RecGOAT, a dual-granularity semantic alignment framework built on graph neural networks and optimal transport theory. RecGOAT first enriches collaborative semantics through multimodal attentive graphs that capture item-item, user-item, and user-user relationships, initializing user representations via LLM-inferred behavioral preferences. It then aligns LM-derived modality representations with recommendation IDs at two complementary granularities: (1) instance-level alignment via cross-modal contrastive learning (CMCL), which produces discriminative per-sample representations; and (2) distribution-level alignment via optimal adaptive transport (OAT), which minimizes the 1-Wasserstein distance between ID distributions and LLM semantics to produce a unified, consistently aligned feature space. Theoretically, we prove that the unified representation achieves strictly lower target error than any single-modality representation, with the gap bounded by the Wasserstein distance and the InfoNCE loss, providing rigorous guarantees for both alignment consistency and fusion comprehensiveness. Extensive experiments on three public benchmarks demonstrate state-of-the-art performance. Deployment on a large-scale online advertising platform further validates RecGOAT's industrial scalability. Our code is available at https://github.com/6lyc/RecGOAT-LLM4Rec.

2601.20539 2026-05-26 cs.AI cs.CL 版本更新

PathWise: Planning through World Model for Automated Heuristic Design via Self-Evolving LLMs

PathWise:通过世界模型规划实现基于自进化LLM的自动启发式设计

Oguzhan Gungordu, Siheng Xiong, Faramarz Fekri

发表机构 * Georgia Institute of Technology(佐治亚理工学院)

AI总结 提出PathWise多智能体推理框架,将启发式生成建模为基于蕴含图的序列决策过程,通过策略智能体、世界模型智能体和评论智能体的协作实现状态感知规划,在组合优化问题上收敛更快、泛化更强。

Comments Accepted to ICML 2026

详情
AI中文摘要

大型语言模型(LLM)已实现组合优化问题(COP)的自动启发式设计(AHD),但现有框架依赖固定的进化规则和静态提示模板,常导致短视的启发式生成、冗余评估以及对新启发式如何推导的有限推理。我们提出一种新颖的多智能体推理框架,称为通过世界模型规划实现基于自进化LLM的自动启发式设计(PathWise),该框架将启发式生成公式化为一个基于蕴含图的序列决策过程,该图作为搜索轨迹的紧凑、有状态记忆。这种方法使系统能够继承过去的决策,并在不同代之间重用或避免推导信息。策略智能体规划进化动作,世界模型智能体根据这些动作生成启发式展开,评论智能体提供路由反思,总结先前步骤的经验教训,将基于LLM的AHD从试错进化转变为通过推理进行状态感知规划。在多种COP上的实验表明,PathWise能更快收敛到更好的启发式,在不同LLM骨干上泛化,并扩展到更大规模的问题。

英文摘要

Large Language Models (LLMs) have enabled automated heuristic design (AHD) for combinatorial optimization problems (COPs), but existing frameworks' reliance on fixed evolutionary rules and static prompt templates often leads to myopic heuristic generation, redundant evaluations, and limited reasoning about how new heuristics should be derived. We propose a novel multi-agent reasoning framework, referred to as Planning through World Model for Automated Heuristic Design via Self-Evolving LLMs (PathWise), which formulates heuristic generation as a sequential decision process over an entailment graph serving as a compact, stateful memory of the search trajectory. This approach allows the system to carry forward past decisions and reuse or avoid derivation information across generations. A policy agent plans evolutionary actions, a world model agent generates heuristic rollouts conditioned on those actions, and critic agents provide routed reflections summarizing lessons from prior steps, shifting LLM-based AHD from trial-and-error evolution toward state-aware planning through reasoning. Experiments across diverse COPs show that PathWise converges faster to better heuristics, generalizes across different LLM backbones, and scales to larger problem sizes.

2512.12677 2026-05-26 cs.CL cs.AI 版本更新

Fine-Tuning Causal LLMs for Text Classification: Embedding-Based vs. Instruction-Based Approaches

微调因果大语言模型用于文本分类:基于嵌入与基于指令的方法

Amirhossein Yousefiramandi, Ciaran Cooney

发表机构 * Clarivate Intellectual Property(Clarivate知识产权)

AI总结 本文探索在资源受限下微调解码器-only大语言模型用于文本分类,比较了基于嵌入的分类头方法和基于指令的微调方法,并采用4位量化与LoRA实现高效训练,实验表明嵌入头方法在单标签分类中匹配或超越微调BERT基线,而指令微调仅在多标签且大参数量时有效。

Comments 20 pages, 5 figures

详情
AI中文摘要

我们探索在资源受限下高效微调解码器-only大语言模型(LLMs)用于下游文本分类的策略。研究了两种方法:(1) 将分类头附加到预训练的因果LLM上,并在任务上微调,使用LLM的最终token嵌入作为序列表示;(2) 以提示-响应的格式对LLM进行指令微调以进行分类。为了在单GPU上微调高达8B参数的模型,我们将4位模型量化与低秩适配(LoRA)结合,实现参数高效训练。在两个专利基准测试(一个5类单标签内部语料库和具有14个类别的公共WIPO-Alpha多标签数据集)上的实验表明,嵌入头方法在单标签分类中匹配或超过微调BERT基线,同时训练参数少10-30倍。指令微调仅在多标签场景下具有竞争力,且需要至少1亿参数的大幅可训练预算。这些结果表明,直接利用因果LLM的内部表示,结合高效微调技术,在有限计算资源下能产生强大的分类性能。我们讨论了每种方法的优势,并概述了在分类场景中优化LLM微调的实用指南和未来方向。

英文摘要

We explore efficient strategies to fine-tune decoder-only Large Language Models (LLMs) for downstream text classification under resource constraints. Two approaches are investigated: (1) attaching a classification head to a pretrained causal LLM and fine-tuning it on the task, using the LLM's final-token embedding as a sequence representation, and (2) instruction-tuning the LLM in a prompt-to-response format for classification. To enable single-GPU fine-tuning of models up to 8B parameters, we combine 4-bit model quantization with Low-Rank Adaptation (LoRA) for parameter-efficient training. Experiments on two patent benchmarks, a 5-class single-label internal corpus and the public WIPO-Alpha multi-label dataset with 14 categories, show that the embedding-head approach matches or exceeds fine-tuned BERT baselines on single-label classification while training 10-30x fewer parameters. Instruction-tuning is competitive only in the multi-label regime, and only with substantially larger trainable budgets of at least 100M parameters. These results demonstrate that directly leveraging the internal representations of causal LLMs, together with efficient fine-tuning techniques, yields strong classification performance under limited computational resources. We discuss the advantages of each approach and outline practical guidelines and future directions for optimizing LLM fine-tuning in classification scenarios.

2512.05402 2026-05-26 cs.LG cs.AI cs.CE cs.NE 版本更新

Smart Timing for Mining: A Deep Learning Framework for Bitcoin Hardware ROI Prediction

挖矿的智能时机:用于比特币硬件投资回报率预测的深度学习框架

Sithumi Wickramasinghe, Bikramjit Das, Dorien Herremans

发表机构 * Singapore University of Technology and Design(新加坡科技设计大学)

AI总结 提出MineROI-Net,一种基于Transformer的深度学习框架,将比特币ASIC硬件采购建模为时间序列分类任务,预测一年内的投资回报率类别,在2015-2024年20种ASIC矿机数据上达到83.2%准确率和83.5%宏F1分数。

详情
AI中文摘要

由于市场波动、技术快速过时和协议驱动的收入周期,比特币挖矿硬件的获取需要战略时机。尽管挖矿已演变为资本密集型行业,但关于何时购买新的专用集成电路(ASIC)硬件的指导很少,且没有先前的计算框架解决这一决策问题。我们通过将硬件获取建模为时间序列分类任务来填补这一空白,预测购买ASIC机器是否在一年内产生盈利(投资回报率(ROI)>= 1)、边际(0 < ROI < 1)或亏损(ROI <= 0)的回报。我们提出了MineROI-Net,一种开源的基于Transformer的架构,旨在捕捉挖矿盈利能力中的多尺度时间模式。在2015年至2024年间发布的20种ASIC矿机在不同市场体制下的数据上评估,MineROI-Net优于循环、卷积和基于注意力的基线,达到了83.2%的准确率和83.5%的宏F1分数。该模型展示了强大的经济相关性,在检测亏损时期达到了97.8%的精确率,在检测盈利时期达到了81.5%的精确率,同时避免了将盈利场景误分类为亏损以及反之亦然。这些结果表明,MineROI-Net为挖矿硬件采购时机提供了一种实用的数据驱动工具,可能降低资本密集型挖矿操作中的财务风险。

英文摘要

Bitcoin mining hardware acquisition requires strategic timing due to volatile markets, rapid technological obsolescence, and protocol-driven revenue cycles. Despite mining's evolution into a capital-intensive industry, there is little guidance on when to purchase new Application-Specific Integrated Circuit (ASIC) hardware, and no prior computational frameworks address this decision problem. We address this gap by formulating hardware acquisition as a time series classification task, predicting whether purchasing ASIC machines yields profitable (Return on Investment (ROI) >= 1), marginal (0 < ROI < 1), or unprofitable (ROI <= 0) returns within one year. We propose MineROI-Net, an open-source Transformer-based architecture designed to capture multi-scale temporal patterns in mining profitability. Evaluated on data from 20 ASIC miners released between 2015 and 2024 across diverse market regimes, MineROI-Net outperforms recurrent, convolutional, and attention-based baselines, achieving 83.2% accuracy and 83.5% macro F1-score. The model demonstrates strong economic relevance, achieving 97.8% precision in detecting unprofitable periods and 81.5% precision in detecting profitable ones, while avoiding misclassifying profitable scenarios as unprofitable and vice versa. These results indicate that MineROI-Net offers a practical, data-driven tool for timing mining hardware acquisitions, potentially reducing financial risk in capital-intensive mining operations.

2505.20110 2026-05-26 cs.LG cs.AI 版本更新

Beyond the Proxy: Trajectory-Distilled Guidance for Offline GFlowNet Training

超越代理:用于离线GFlowNet训练的轨迹蒸馏指导

Ruishuo Chen, Xun Wang, Rui Hu, Zhuoran Li, Longbo Huang

发表机构 * Institute for Interdisciplinary Information Sciences, Tsinghua University, Beijing, China(清华大学交叉信息研究院)

AI总结 提出轨迹蒸馏GFlowNet(TD-GFN),利用逆强化学习从离线轨迹中提取稠密边奖励,通过DAG剪枝和优先反向采样指导策略,避免代理模型,提升离线GFlowNet训练的收敛速度和样本质量。

Comments Camera-ready version accepted at ICML 2026

详情
AI中文摘要

生成流网络(GFlowNets)擅长采样多样化的高奖励对象。在许多实际应用中,由于无法进行主动奖励查询,这些模型必须使用静态离线数据集进行训练。主流的训练方法通常依赖代理模型为在线采样的轨迹提供奖励反馈。然而,由于数据稀缺或评估成本高,构建可靠的代理往往具有挑战性。虽然现有的无代理方法试图解决这一问题,但它们通常施加粗糙的约束,限制了模型有效探索的能力。为了克服这些限制,我们提出了轨迹蒸馏GFlowNet(TD-GFN),一种新颖的无代理训练框架。TD-GFN利用逆强化学习(IRL)从离线轨迹中提取稠密的、转移级别的边奖励,为高效探索提供丰富的结构指导。关键的是,为了确保鲁棒性,这些奖励通过DAG剪枝和优先反向采样间接指导策略。这种设计确保梯度更新仅依赖于数据集中的真实终端奖励,从而防止错误传播。实验结果表明,TD-GFN在收敛速度和样本质量上显著优于广泛的现有基线,为离线GFlowNet训练建立了更鲁棒和高效的范式。

英文摘要

Generative Flow Networks (GFlowNets) excel at sampling diverse, high-reward objects. In many practical applications where active reward queries are infeasible, these models must be trained using static offline datasets. Prevailing training methods typically rely on a proxy model to provide reward feedback for online sampled trajectories. However, constructing a reliable proxy is often challenging due to data scarcity or high evaluation costs. While existing proxy-free approaches attempt to address this, they often impose coarse constraints that limit the model's ability to explore effectively. To overcome these limitations, we propose Trajectory-Distilled GFlowNet (TD-GFN), a novel proxy-free training framework. TD-GFN utilizes inverse reinforcement learning (IRL) to extract dense, transition-level edge rewards from offline trajectories, providing rich structural guidance for efficient exploration. Crucially, to ensure robustness, these rewards guide the policy indirectly through DAG pruning and prioritized backward sampling. This design ensures that gradient updates rely exclusively on ground-truth terminal rewards from the dataset, thereby preventing error propagation. Empirical results demonstrate that TD-GFN significantly outperforms a broad range of existing baselines in both convergence speed and sample quality, establishing a more robust and efficient paradigm for offline GFlowNet training.

2509.02113 2026-05-26 cs.LG cs.AI cs.CR cs.SI 版本更新

HiGraph: A Large-Scale Hierarchical Graph Dataset for Malware Analysis

HiGraph:用于恶意软件分析的大规模层次图数据集

Han Chen, Hanchen Wang, Hongmei Chen, Ying Zhang, Lu Qin, Wenjie Zhang

发表机构 * University of Technology Sydney(新南威尔士大学) Yunnan University(云南大学) University of New South Wales(新南威尔士大学)

AI总结 针对现有图方法忽略软件层次结构的问题,提出包含2亿控制流图和59.5万函数调用图的大规模层次图数据集HiGraph,用于构建抗混淆和演化的鲁棒恶意软件检测器。

Comments updated dataset statistics

详情
AI中文摘要

基于图的恶意软件分析的进展受到缺乏捕捉软件固有层次结构的大规模数据集的严重限制。现有方法通常将程序简化为单层图,未能建模高层功能交互与低层指令逻辑之间的关键语义关系。为填补这一空白,我们引入了\dataset,这是用于恶意软件分析的最大公开层次图数据集,包含嵌套在 extbf{595K}个函数调用图(FCG)中的超过 extbf{2亿}个控制流图(CFG)。这种两层表示保留了构建对代码混淆和恶意软件演化具有鲁棒性的检测器所必需的结构语义。我们通过大规模分析展示了HiGraph的实用性,揭示了良性软件和恶意软件的不同结构特性,将其确立为社区的基础基准。数据集和工具可在https://higraph.org公开获取。

英文摘要

The advancement of graph-based malware analysis is critically limited by the absence of large-scale datasets that capture the inherent hierarchical structure of software. Existing methods often oversimplify programs into single level graphs, failing to model the crucial semantic relationship between high-level functional interactions and low-level instruction logic. To bridge this gap, we introduce \dataset, the largest public hierarchical graph dataset for malware analysis, comprising over \textbf{200M} Control Flow Graphs (CFGs) nested within \textbf{595K} Function Call Graphs (FCGs). This two-level representation preserves structural semantics essential for building robust detectors resilient to code obfuscation and malware evolution. We demonstrate HiGraph's utility through a large-scale analysis that reveals distinct structural properties of benign and malicious software, establishing it as a foundational benchmark for the community. The dataset and tools are publicly available at https://higraph.org.

2405.01906 2026-05-26 cs.AI cs.LG cs.NE 版本更新

Instance-Conditioned Adaptation for Large-scale Generalization of Neural Routing Solver

实例条件适应:神经路由求解器的大规模泛化

Changliang Zhou, Xi Lin, Zhenkun Wang, Xialiang Tong, Mingxuan Yuan, Qingfu Zhang

发表机构 * School of Automation and Intelligent Manufacturing and Guangdong Provincial Key Laboratory of Fully Actuated System Control Theory and Technology, Southern University of Science and Technology, Shenzhen 518055, China(自动化与智能制造学院和广东省全驱动系统控制理论与技术重点实验室,南方科技大学,深圳518055,中国) Department of Computer Science, City University of Hong Kong, Hong Kong SAR, China(计算机科学系,香港城市大学,香港特别行政区,中国) Huawei Noah’s Ark Lab, Hong Kong SAR, China(华为诺亚实验室,香港特别行政区,中国)

AI总结 提出实例条件适应模型(ICAM),通过简单高效的实例条件适应函数和低复杂度的适应模块,显著提升神经路由求解器在大规模旅行商问题(TSP)、容量车辆路径问题(CVRP)和非对称旅行商问题(ATSP)上的泛化性能,同时保持快速推理速度。

Comments 13 pages, 5 figures

详情
Journal ref
IEEE Transactions on Intelligent Transportation Systems, 2026
AI中文摘要

神经组合优化(NCO)方法在无需专家知识的情况下,展现出了解决智能交通系统路由问题的巨大潜力。然而,现有的构造性NCO方法仍难以解决大规模实例,这严重限制了其应用前景。为了解决这些关键缺陷,本文提出了一种新颖的实例条件适应模型(ICAM),以实现神经路由求解器更好的大规模泛化。特别地,我们设计了一个简单而高效的实例条件适应函数,以较小的时空开销显著提升现有NCO模型的泛化性能。此外,通过对不同注意力机制之间信息融合性能的系统研究,我们进一步提出了一个强大且低复杂度的实例条件适应模块,为不同规模的实例生成更好的解。在合成实例和基准实例上的大量实验结果表明,我们提出的方法能够在解决大规模旅行商问题(TSP)、容量车辆路径问题(CVRP)和非对称旅行商问题(ATSP)时,以非常快的推理时间获得有希望的结果。我们的代码可在 https://github.com/CIAM-Group/ICAM 获取。

英文摘要

The neural combinatorial optimization (NCO) method has shown great potential for solving routing problems of intelligent transportation systems without requiring expert knowledge. However, existing constructive NCO methods still struggle to solve large-scale instances, which significantly limits their application prospects. To address these crucial shortcomings, this work proposes a novel Instance-Conditioned Adaptation Model (ICAM) for better large-scale generalization of neural routing solvers. In particular, we design a simple yet efficient instance-conditioned adaptation function to significantly improve the generalization performance of existing NCO models with a small time and memory overhead. In addition, with a systematic investigation on the performance of information incorporation between different attention mechanisms, we further propose a powerful yet low-complexity instance-conditioned adaptation module to generate better solutions for instances across different scales. Extensive experimental results on both synthetic and benchmark instances show that our proposed method is capable of obtaining promising results with a very fast inference time in solving large-scale Traveling Salesman Problems (TSPs), Capacitated Vehicle Routing Problems (CVRPs), and Asymmetric Traveling Salesman Problems (ATSPs). Our code is available at https://github.com/CIAM-Group/ICAM.

2505.02129 2026-05-26 cs.DB cs.AI 版本更新

Subspace Aggregation Query and Index Generation for Multidimensional Resource Space Model

多维资源空间模型的子空间聚合查询与索引生成

Xiaoping Sun, Hai Zhuge

发表机构 * Key Lab of Intelligent Information Processing, Institute of Computing Technology, Chinese Academy of Sciences(中国科学院计算技术研究所智能信息处理重点实验室) Great Bay University and Great Bay Institute for Advanced Study(大湾区大学和大湾区先进研究院)

AI总结 针对多维资源空间中的子空间聚合查询问题,提出一种基于偏序关系的图索引生成方法,以高效定位非空点并聚合资源,并通过策略降低索引生成成本。

详情
AI中文摘要

在多维语义空间中组织大规模资源是一种有效管理和查询不同语义维度资源的方法。为支持高级应用,本文提出一种资源空间模型,用于在表示每个维度的坐标树上的偏序范围内定义的子空间上进行聚合查询,其中子空间中的每个点包含沿坐标树偏序关系路径聚合的资源,并且每个点的聚合资源可由应用测量、排序和选择。为了高效定位大子空间中的非空点,提出一种生成图索引的方法,以构建维度坐标上的偏序关系,使子空间查询能够通过索引链接到达非空点,并沿索引路径将资源聚合到其超点。生成此类索引成本高昂,因为索引节点的子节点数量可能很大,导致索引节点总数非常庞大(随维度数量和维度规模呈指数增长)。所提出的方法采用一系列策略来降低成本。分析和实验表明,生成的索引在支持子空间聚合查询方面具有有效性。

英文摘要

Organizing large-scale resources in a multidimensional semantic space is an approach to efficiently managing and querying resources from different semantic dimensions. To support advanced applications, this paper proposes a resource space model for aggregation query on subspaces defined by a range within the partial order on the coordinate trees representing each dimension, where each point in the subspace contains resources aggregated along the paths of the partial order relations on the coordinate trees and the aggregated resources at each point can be measured, ranked and selected by applications. To efficiently locate non-empty points in a large subspace, an approach to generating graph index is proposed to build partial order relations on coordinates of dimensions to enable a subspace query to reach non-empty points through indexing links and aggregate resources along indexing paths to their super points. Generating such an index is costly as the number of children of an indexing node can be large so that the total number of indexing nodes can be very large (exponentially growing with the number of dimensions and scale of dimensions). The proposed approach adopts the a set of strategies to reduce the cost. Analysis and experiments show the effectiveness of the generated index in supporting subspace aggregation query.