arXivDaily arXiv每日学术速递 周一至周五更新
重置
2606.02578 2026-06-02 cs.CV cs.AI 版本更新

Mitigating Perceptual Judgment Bias in Multimodal LLM-as-a-Judge via Perceptual Perturbation and Reward Modeling

通过感知扰动和奖励建模减轻多模态大语言模型作为评判者中的感知判断偏差

Seojeong Park, Jiho Choi, Junyong Kang, Seonho Lee, Jaeyo Shin, Hyunjung Shim

发表机构 * University of California, Berkeley(加州大学伯克利分校) KAIST(韩国科学技术院)

AI总结 本文通过构建感知扰动数据集和结合GRPO奖励与批排序目标的统一训练框架,解决了多模态大语言模型作为评判者时因视觉证据与文本线索冲突而产生的感知判断偏差问题,显著提升了感知忠实度和与人类评价的一致性。

Comments ICML 2026

详情
AI中文摘要

最近的多模态大语言模型展示了强大的推理能力,但它们作为自动评估器的可靠性仍然受到一个关键弱点的限制:当视觉证据与文本线索冲突时,多模态大语言模型评判者倾向于奖励看似合理的叙述而非感知上正确的答案。我们识别并系统分析了这一现象,称之为感知判断偏差。通过受控的视觉扰动,现有的多模态评判者经常锚定于响应文本而非自身的视觉感知,导致不一致且不可验证的评估。为了解决这个问题,我们引入了感知扰动判断数据集,该数据集构建了最小编辑的反事实响应,隔离了感知错误并实现了可验证的监督。基于该数据集,我们开发了一个统一的训练框架,将结构化的基于GRPO的奖励与批排序目标相结合,实现了无需显式成对标签的连贯全局排序。在多种多模态大语言模型作为评判者的基准测试上的实验表明,我们的方法显著提高了感知忠实度、排序连贯性以及与人类评价的一致性。我们的结果为训练感知基础、可解释且对视觉推理冲突鲁棒的多模态评判者建立了一条可扩展且可泛化的路径。

英文摘要

Recent multimodal large language models have demonstrated strong reasoning ability, yet their reliability as automated evaluators remains limited by a critical weakness: when visual evidence conflicts with textual cues, MLLM judges tend to reward plausible narratives over perceptually correct answers. We identify and systematically analyze this phenomenon, which we term Perceptual Judgment Bias. Through controlled visual perturbations, existing multimodal judges frequently anchor on the response text instead of their own visual perception, leading to inconsistent and non-verifiable evaluations. To address this issue, we introduce the Perceptually Perturbed Judgment Dataset, which constructs minimally edited counterfactual responses that isolate perceptual errors and enable verifiable supervision. Building on this dataset, we develop a unified training framework that combines a structured GRPO-based reward with a batch-ranking objective, achieving coherent global ordering without explicit pairwise labels. Experiments across diverse MLLM-as-a-Judge benchmarks show that our approach substantially improves perceptual fidelity, ranking coherence, and alignment with human evaluation. Our results establish a scalable and generalizable pathway for training multimodal judges that are perceptually grounded, interpretable, and robust to visual-reasoning conflicts.

2606.02569 2026-06-02 cs.CV cs.AI cs.CL 版本更新

AdaCodec: A Predictive Visual Code for Video MLLMs

AdaCodec: 面向视频多模态大语言模型的预测性视觉编码

Haowen Hou, Zhen Huang, Zheming Liang, Qingyi Si, Chenglin Li, Shuai Dong, Kele Shao, Ruilin Li, Dianyi Wang, Nan Duan, Jiaqi Wang

发表机构 * Shanghai Jiao Tong University(上海交通大学) Shanghai Innovation Institute(上海创新研究院) JD.com(京东公司)

AI总结 针对视频帧间冗余问题,提出预测性视觉编码AdaCodec,通过条件预测代价决定是否发送完整参考帧或紧凑P-令牌,在匹配视觉令牌预算下提升性能,并大幅降低首令牌延迟。

Comments 23 pages

详情
AI中文摘要

视频在时间上是冗余的:相邻帧通常共享大部分物体、背景和布局。然而,现有的视频多模态大语言模型(视频MLLMs)通常将每个采样帧编码为独立的RGB图像,导致视觉令牌重复先前帧中已有的内容。这提示了一种更直接的视频接口:仅当场景无法从先前上下文中良好预测时,才发送完整的参考帧;否则,传输帧间变化的紧凑描述。我们将这种接口称为\emph{预测性视觉编码},并针对视频MLLMs实例化为 extbf{AdaCodec}。AdaCodec仅在条件预测代价高时,为参考帧花费完整的视觉令牌;否则,它将帧间变化(包括运动和预测残差)编码为紧凑的P-令牌。在所有11个基准测试中,在匹配视觉令牌预算下,AdaCodec相比基于Qwen3-VL-8B的逐帧RGB基线有所改进。即使在1/7的预算下,使用32k令牌的AdaCodec在所有长视频基准测试中超越了224k基线;在五个通用视频基准测试中,它提高了平均得分,同时将首令牌时间从9.26秒大幅缩短至1.62秒。

英文摘要

Video is temporally redundant: adjacent frames usually share most objects, background, and layout. Yet existing video multimodal large language models (video MLLMs) usually encode each sampled frame as an independent RGB image, causing visual tokens to repeat content already present in earlier frames. This suggests a more direct video interface: send a full reference frame only when the scene cannot be predicted well from prior context, and otherwise transmit a compact description of inter-frame changes. We call this interface a \emph{predictive visual code}, and instantiate it for video MLLMs as \textbf{AdaCodec}. AdaCodec spends full visual tokens on a reference frame only when its conditional predictive cost is high; otherwise, it encodes inter-frame changes, including motion and prediction residuals, as compact P-tokens. Across all eleven benchmarks, AdaCodec improves over the Qwen3-VL-8B per-frame RGB baseline at a matched visual-token budget. Even at $1/7$ the budget, AdaCodec with 32k tokens surpasses the 224k baseline on all long-video benchmarks; on five general-video benchmarks, it raises the average score while substantially cutting time-to-first-token from 9.26s to 1.62s.

2606.02568 2026-06-02 cs.AI cs.CL cs.ET cs.MA 版本更新

ClinEnv: An Interactive Multi-Stage Long Horizon EHR Environment for Agents

ClinEnv:面向智能体的交互式多阶段长时程电子健康记录环境

Yuxing Lu, Yushuhong Lin, Wenqi Shi, J. Ben Tamo, Xukai Zhao, Jinzhuo Wang, May Dongmei Wang

发表机构 * Georgia Institute of Technology(佐治亚理工学院) Peking University(北京大学) University of Texas Southwestern Medical Center(德克萨斯西南医学中心) Tsinghua University(清华大学)

AI总结 提出ClinEnv,一个基于真实住院患者数据的交互式基准,通过多阶段决策序列评估大语言模型在不确定性下逐步收集信息并做出不可逆决策的能力,发现模型决策质量与过程质量严重脱节。

Comments 20 pages, 6 figures, 12 tables

详情
AI中文摘要

临床实践并非从枚举选项中选择答案:医生会逐步收集异质信息,并在不确定性下做出顺序的、不可逆的决策。静态基准无法探测,而现有的交互式医学基准各自至少在一个方面有所妥协。我们提出ClinEnv,一个交互式基准,在称为纵向住院模拟的范式下,将大语言模型评估为真实住院患者的主治医生。每个病例自动构建为有序的决策阶段序列;在每个阶段,模型必须主动查询四个专门的智能体,然后才能提交药物、程序和诊断。ClinEnv通过确定性本体匹配对模型的决策内容进行评分,同时也对其信息收集过程进行评分。在七个模型中,最强的模型仅达到0.31的决策F1分数,且结果质量与过程质量严重脱节。困难集中在管理决策和后期阶段,模型恢复出院诊断的可靠性远高于管理行动(F1分别为0.51 vs 0.17),并且随着病例进展继续发出冗余查询。ClinEnv使这种信息获取差距(仅通过结果评估无法察觉)变得可直接测量。

英文摘要

Clinical practice is not the selection of an answer from enumerated options: a physician gathers heterogeneous information incrementally and commits to sequential, irreversible decisions under uncertainty. Static benchmarks cannot probe and existing interactive medical benchmarks each compromise on at least one of them. We present ClinEnv, an interactive benchmark that evaluates LLMs as attending physicians over real inpatient admissions under a paradigm we term Longitudinal Inpatient Simulation. Each case is automatically constructed into an ordered sequence of decision stages; at every stage the model must actively query four specialized agents before committing to medications, procedures, and diagnoses. ClinEnv scores both what the model decides, through deterministic ontology-grounded matching, and how it gathers information. Across seven models, the strongest reaches only 0.31 decision F1, and outcome quality is sharply decoupled from process quality. Difficulty concentrates in management decisions and later stages, where models recover discharge diagnoses far more reliably than management actions (0.51 vs. 0.17 F1) and continue to issue redundant queries as cases progress. ClinEnv makes this information-acquisition gap, invisible to outcome-only evaluation, directly measurable.

2606.02562 2026-06-02 cs.RO cs.AI cs.LG cs.SY eess.SY 版本更新

Permissive Safety Through Trusted Inference: Verifiable Belief-Space Neural Safety Filters for Assured Interactive Robotics

通过可信推理实现许可安全:可验证的信念空间神经安全滤波器用于保证交互式机器人

Haimin Hu

发表机构 * Department of Computer Science, Johns Hopkins University, USA(约翰霍普金斯大学计算机科学系)

AI总结 针对交互式机器人中人类不确定性带来的安全问题,提出一种基于共形预测的信念空间安全滤波器验证方法,在考虑推理可靠性的前提下保证高概率安全,并减少保守性。

Comments Accepted to the 17th World Symposium on the Algorithmic Foundations of Robotics (WAFR 2026)

详情
AI中文摘要

与人类交互的自主机器人必须在人类引起的不确定性(如偏好、目标、能力和合作意愿)下做出安全高效的决策。安全滤波器是确保交互式机器人安全性的流行方法,其模块化设计将安全性与性能分离,使机器人能够在最小影响任务效率的情况下安全地与人交互。传统安全滤波器通常仅在物理空间中运行,忽略了机器人在线学习和适应的能力,而最近提出的信念空间安全滤波器(BeliefSF)在闭环中考虑机器人安全性,并通过运行时推理主动减少机器人的不确定性,从而降低滤波的保守性。然而,由于运行时推理的误差以及处理信念空间高维性所需的安全滤波器神经近似,为部署BeliefSF的机器人提供形式化安全保证仍然是一个重大挑战。本文提出一种算法方法,使用共形预测来认证BeliefSF的高概率安全性,同时明确考虑机器人运行时推理模块的可靠性。我们的方法利用信念空间安全滤波的结构,将验证集中在预期推理可靠的区域。它保留了标准共形预测的简单性和样本复杂度,但能够认证一个显著更不保守的安全滤波器。通过一个模拟的人-车交互基准测试,我们展示了我们的方法验证了一个比标准共形预测基线更许可的信念空间安全滤波器。

英文摘要

Autonomous robots that interact with people must make safe and efficient decisions under human-induced uncertainty, such as their preferences, goals, competency, and willingness to cooperate. Safety filters are a popular approach for ensuring safety in interactive robotics, since their modular design separates safety from performance, allowing robots to operate safely around people with minimal impact on task efficiency. While traditional safety filters typically operate only in the physical space, neglecting the robot's ability to learn and adapt online, the recently proposed belief-space safety filter (BeliefSF) reasons about robot safety in closed-loop with runtime inference that actively reduces the robot's uncertainty online, thereby reducing conservativeness in filtering. However, providing formal safety guarantees for robots deploying BeliefSF remains a significant challenge due to errors in runtime inference and neural approximation of safety filters required to handle the high dimensionality of belief spaces. In this paper, we propose an algorithmic approach to certify high-probability safety of BeliefSF using conformal prediction, while explicitly accounting for the reliability of the robot's runtime inference module. Our method leverages the structure of belief-space safety filtering by focusing verification on a region where inference is expected to be reliable. It preserves the simplicity and sample complexity of standard conformal prediction, yet can certify a substantially less conservative safety filter. Through a simulated human-vehicle interaction benchmark, we show that our approach verifies a significantly more permissive belief-space safety filter than a standard conformal prediction baseline.

2606.02559 2026-06-02 cs.CL cs.AI 版本更新

From Layers to Submodules: Rethinking Granularity in Replacement-Based LLM Compression

从层到子模块:重新思考基于替换的LLM压缩中的粒度

Elia Cunegatti, Marcus Vukojevic, Erik Nielsen, Giovanni Iacca

发表机构 * University of Trento(特伦托大学)

AI总结 提出子模块级别的非连续替换压缩方法SubFit,通过为注意力和前馈子模块分别设计轻量残差旁路,在多种LLM上实现更好的困惑度-准确率权衡。

详情
AI中文摘要

大型语言模型(LLM)的后训练压缩会移除整个架构组件,要么删除它们,要么用拟合模块替换它们。现有的基于替换的方法共享两个设计约束:全层粒度和连续选择。我们认为这过于严格:事实上,预训练Transformer中的冗余并不局限于连续区域,也不均匀分布在注意力和前馈输出之间,这意味着不同的策略最适合近似不同的子模块类型,并且可移除的组件不需要聚集在连续的深度范围内。基于这一直觉,我们引入了SubFit(子模块级拟合残差替换),它在子模块级别压缩LLM:注意力和前馈子模块被非连续地选择,并且每个子模块都获得自己的轻量级拟合残差旁路。SubFit在训练后运行,仅需要校准数据。在十个LLM(五个基础模型,五个指令微调模型)、五个从12.5%到37.5%的稀疏度水平以及四个基于替换的基线上,SubFit在评估的稀疏度水平上实现了最佳的聚合困惑度-准确率权衡,在激进压缩下获得更大收益。在25%稀疏度下,它保留了84.6%的密集下游准确率,困惑度退化2.42倍,而最强基线分别为81.6%和4.34倍,同时实现了可测量的推理加速和KV缓存节省。代码可在https://github.com/eliacunegatti/SubFit获取。

英文摘要

Post-training compression of Large Language Models (LLMs) removes entire architectural components, either deleting them or replacing them with fitted modules. Existing replacement-based methods share two design constraints: full-layer granularity and contiguous selection. We argue that this is overly restrictive: in fact, redundancy in pretrained transformers is not confined to contiguous regions, nor does it evenly distribute between Attention and FeedForward outputs, implying that different strategies best approximate different submodule types and that removable components need not cluster within contiguous depth ranges. Based on this intuition, we introduce SubFit (Submodule-level Fitted residual replacement), which compresses LLMs at the submodule level: Attention and FeedForward submodules are selected non-contiguously, and each receives its own lightweight fitted residual bypass. SubFit operates post-training and requires only calibration data. Across ten LLMs (five base, five instruction-tuned), five sparsity levels from 12.5% to 37.5%, and four replacement-based baselines, SubFit achieves the best aggregate perplexity-accuracy trade-off across the evaluated sparsity levels, with larger gains under aggressive compression. At 25% sparsity, it retains 84.6% of dense downstream accuracy and incurs 2.42x perplexity degradation, against 81.6% and 4.34x for the strongest baselines, while delivering measurable inference speedup and KV-cache savings. Code is available at https://github.com/eliacunegatti/SubFit.

2606.02552 2026-06-02 cs.CV cs.AI 版本更新

Modeling Depth Ambiguity: A Mixture-Density Representation for Flying-Point-Free Depth Estimation

建模深度歧义:一种用于无飞点深度估计的混合密度表示

Siyuan Bian, Congrong Xu, Jun Gao

发表机构 * University of Michigan(密歇根大学) NVIDIA(英伟达)

AI总结 提出混合密度表示MDA,通过预测每个像素的多个深度假设及其概率,解决深度估计中边界处的飞点伪影问题,显著改善边界重建并消除飞点。

详情
AI中文摘要

尽管深度估计取得了进展,飞点仍然是一个持续存在的失败模式:在物体边界附近,深度估计器经常在前景和背景表面之间的空白空间中预测虚假的3D点。我们将这种伪影追溯到一种标准建模选择:为每个像素分配单个深度假设。在边界处,一个像素可能跨越前景和背景表面,因此其真实深度在两者之间是模糊的。预测单个深度的模型无法同时保留两种可能性,因此训练反而将预测拉向一个位于两个表面之间的中间深度。我们通过MDA解决了这个问题,这是一种混合密度表示,让模型为每个像素预测多个深度假设及其相关概率。在边界附近,不同的假设可以与不同的表面对齐,解码后的深度从这些假设之一中选择,而不是放置在它们之间的空白空间中。在不同的骨干网络上,MDA显著改善了边界重建,并在很大程度上消除了飞点伪影,即使在严重的输入模糊下也是如此,同时增加了可忽略的运行时开销。相同的混合密度框架自然地扩展到透明物体,其中它预测透明像素处的多个深度层,以及天空区域,其中专用组件将无界天空与有限深度区域分开,产生无飞点的天际线。项目页面:https://biansy000.github.io/mda-site/。

英文摘要

Despite advances in depth estimation, flying points remain a persistent failure mode: near object boundaries, depth estimators often predict spurious 3D points in the empty space between foreground and background surfaces. We trace this artifact to a standard modeling choice: assigning each pixel a single depth hypothesis. At boundaries, a pixel can straddle a foreground and a background surface, so its true depth is ambiguous between the two. A model that predicts a single depth cannot keep both possibilities, so training instead pulls the prediction toward an intermediate depth that lies on neither surface. We address this with MDA, a mixture-density representation that lets the model predict multiple depth hypotheses and their associated probabilities for each pixel. Near boundaries, different hypotheses can align with different surfaces, and the decoded depth is selected from one of these hypotheses rather than placed in the empty space between them. Across different backbones, MDA substantially improves boundary reconstruction and largely removes flying-point artifacts even under severe input blur, while adding negligible runtime overhead. The same mixture-density framework naturally extends to transparent objects, where it predicts multiple depth layers at transparent pixels, and to sky regions, where a dedicated component separates the unbounded sky from finite-depth regions, producing flying-point-free skylines. Project Page: https://biansy000.github.io/mda-site/.

2606.02544 2026-06-02 cs.CL cs.AI 版本更新

SimSD: Simple Speculative Decoding in Diffusion Language Models

SimSD:扩散语言模型中的简单推测解码

Junxia Cui, Haotian Ye, Runchu Tian, Hongcan Guo, Jinya Jiang, Haoru Li, Chaojie Ren, Yiming Huang, Kaijie Zhu, Zhongkai Yu, Kun Zhou, Jingbo Shang

发表机构 * University of California San Diego(加州大学圣地亚哥分校) University of Illinois Urbana-Champaign(伊利诺伊大学厄巴纳-香槟分校) Google(谷歌) University of California Santa Barbara(加州大学圣芭芭拉分校)

AI总结 针对扩散语言模型无法直接使用标准推测解码的问题,提出SimSD算法,通过即插即用的掩码策略引入参考令牌并设计注意力掩码,实现单次前向传播验证多个草稿令牌,在保持并行解码优势的同时提升解码吞吐量。

Comments 13 pages, 4 figures, code available at https://github.com/airevo2/SimSD-release

详情
AI中文摘要

扩散大语言模型(dLLMs)最近作为自回归(AR)LLMs的有前景替代方案出现,通过并行或块状解码实现更快的推理。然而,它们的掩码语言建模公式仍然与标准令牌级推测解码不兼容,而后者是AR模型最有效的加速技术之一。在AR解码中,因果掩码保留了时间上有效的令牌级上下文,使目标模型能够在单次前向传播中验证多个草稿令牌。相比之下,dLLMs依赖于掩码令牌和双向注意力,导致有效上下文在去噪步骤中发生变化,从而阻止了直接的令牌级推测验证。为了弥合这一差距,我们提出了一种简单但有效的扩散语言模型推测解码算法,名为SimSD,它主要采用即插即用的掩码策略,为dLLMs配备时间上有效的令牌级上下文以进行推测解码。我们的方法明确地从草稿模型预测中引入参考令牌,并设计了一种注意力掩码来调节它们与当前步骤令牌的交互,使dLLMs能够在单次前向传播中计算草稿令牌的有效logits。这恢复了AR模型中因果掩码提供的关键验证能力,同时保留了dLLMs的并行解码优势。所提出的方法无需训练,并且可以灵活地与其他加速技术(如KV缓存和块状解码)集成。在四个基准测试上的SDAR系列dLLMs实验表明,我们的方法实现了高达7.46倍的解码吞吐量提升,同时保持甚至提高了平均生成质量。

英文摘要

Diffusion large language models (dLLMs) have recently emerged as a promising alternative to autoregressive (AR) LLMs, offering faster inference through parallel or blockwise decoding. However, their masked language modeling formulation remains incompatible with standard token-level speculative decoding, one of the most effective acceleration techniques for AR models. In AR decoding, the causal mask preserves temporally valid token-level contexts, enabling a target model to verify multiple drafted tokens in a single forward pass. In contrast, dLLMs rely on mask tokens and bidirectional attention, causing the effective context to change across denoising steps and preventing direct token-level speculative verification. To bridge this gap, we propose a simple but effective speculative decoding algorithm for diffusion language models, named SimSD, which mainly adopts a plug-and-play masking strategy that equips dLLMs with temporally valid token-level contexts for speculative decoding. Our method explicitly introduces reference tokens from draft-model predictions and designs an attention mask that regulates their interaction with current-step tokens, allowing dLLMs to compute valid logits for drafted tokens in a single forward pass. This restores the key verification ability provided by causal masking in AR models while preserving the parallel decoding advantages of dLLMs. The proposed method is training-free and can be flexibly integrated with other acceleration techniques such as KV cache and blockwise decoding. Experiments on SDAR-family dLLMs across four benchmarks show that our method achieves up to 7.46x higher decoding throughput while maintaining and even improving average generation quality.

2606.02536 2026-06-02 cs.AI 版本更新

Tracking the Behavioral Trajectories of Adapting Agents

追踪自适应智能体的行为轨迹

Jonah Leshin, Manish Shah, Ian Timmis

发表机构 * University of Birmingham(伯明翰大学)

AI总结 提出一种通过文本嵌入空间中的方向定义智能体特质的方法,训练线性模型对技能文件差异进行评分,实现高准确率的行为特质分类与排序。

Comments 5 pages, 1 figure. To appear at the Second Workshop on Agents in the Wild: Safety, Security, and Beyond (AIWILD) at ICML 2026

详情
AI中文摘要

文本文件,如技能文件、记忆文件和行为配置文件,在定义现代智能体的行为方式中起着核心作用。通过人类或智能体自身的编辑,这些文件可能随时间演变,直接引导智能体在未来交互中的行为。我们提出了一种方法和框架,通过将特质定义为文本嵌入模型嵌入空间中的方向来测量智能体的“特质”。我们在标记的“之前”与“之后”技能文件差异上训练线性模型以学习特质向量,然后通过将嵌入差异投影到该向量上对任意技能编辑进行评分。在68个标记的技能差异对上评估寻求敏感数据的特质倾向,我们的方法在留一法交叉验证下实现了91.2%的符号分类准确率和斯皮尔曼等级相关系数ρ=0.82。我们将这种特质评估构建到一个更广泛的智能体间协议中,使一个智能体能够通过可信中介评估另一个智能体的技能文件更新。

英文摘要

Text files such as skill files, memory files, and behavioral configuration files play a central role in defining how modern agents act. Through edits by humans or the agents themselves, these files may evolve over time, directly steering the agent's behavior in future interactions. We present a methodology and framework for measuring agent $traits$ by defining traits as directions in the embedding space of a text embedding model. We train a linear model on labeled "before" versus "after" skill file diffs to learn a trait vector, then score arbitrary skill edits by projecting their embedding diffs onto this vector. Evaluated on 68 labeled skill diff pairs for the trait of propensity to seek sensitive data, our method achieves 91.2% sign classification accuracy and a Spearman rank correlation of $ρ= 0.82$ under leave-one-out cross-validation. We build this trait evaluation into a broader agent-to-agent protocol that enables one agent to evaluate another's skill file updates through a trusted intermediary.

2606.02530 2026-06-02 cs.AI cs.CL 版本更新

SafeSteer: Localized On-Policy Distillation for Efficient Safety Alignment

SafeSteer: 局部化在策略蒸馏用于高效安全对齐

Hao Li, Jingkun An, Zijun Song, Pengyu Zhu, Rui Li, Hao Wang, Wendi Feng, Yesheng Liu, Lijun Li, Jin-Ge Yao, Lei Sha

发表机构 * Beihang University(北航) Beijing Institute of Technology(北京理工大学) Beijing University of Posts and Telecommunications(北京邮电大学) Peking University(北京大学) Institute of Automation, Chinese Academy of Sciences(中国科学院自动化研究所) Shanghai Artificial Intelligence Laboratory(上海人工智能实验室) Beijing Academy of Artificial Intelligence(北京人工智能研究院)

AI总结 针对大语言模型安全对齐导致通用能力下降的问题,提出SafeSteer方法,通过激活引导构建安全教师并选择安全令牌,仅在安全令牌上施加反向KL惩罚,在仅用100个有害样本且无需通用数据的情况下,实现了安全与通用能力之间的优越平衡。

Comments 19 pages, 8 figures, 14 tables. Submitted to EMNLP 2026

详情
AI中文摘要

将大型语言模型(LLMs)与人类价值观对齐通常会降低其通用能力,这被称为对齐税。现有方法通过平衡双重目标来缓解这一问题,但严重依赖大量通用数据或辅助奖励模型。在本文中,我们认为,由于安全特征在输出分布中本质上是稀疏的,对齐需要局部修改而非全局权衡。为此,我们提出SafeSteer,它在安全令牌上执行在策略蒸馏。首先,我们通过激活引导构建一个安全教师。基于该教师,我们开发了一种安全令牌选择算法。因此,SafeSteer在训练期间将反向KL惩罚限制在这些令牌上,以保留通用能力。跨多种模型的实验结果表明,与现有方法相比,我们的SafeSteer在安全性和通用能力之间实现了更优越的权衡,在七个安全基准上取得了强大的安全性能,同时在五个通用能力基准上仅有最小程度的下降。值得注意的是,SafeSteer仅需100个有害样本,无需使用任何通用数据,不到先前基线所用数据的1%,大大降低了对齐成本。更多详情请访问我们的项目页面:https://anjingkun.github.io/SafeSteer。

英文摘要

Aligning Large Language Models (LLMs) with human values often degrades their general capabilities, termed the alignment tax. Existing methods mitigate this by balancing dual objectives, which heavily rely on massive general-purpose data or auxiliary reward models. In this paper, we argue that, because safety features are inherently sparse within the output distribution, alignment requires localized modifications rather than global trade-offs. To this end, we propose SafeSteer, which performs on-policy distillation confined to safety tokens. First, we construct a safety teacher via activation steering. Based on this teacher, we develop a safety token selection algorithm. Consequently, SafeSteer restricts the reverse KL penalty to these tokens during training to preserve general capabilities. Experimental results across diverse models show that our SafeSteer achieves a superior trade-off between safety and general capability compared with existing methods, attaining strong safety performance on seven safety benchmarks with only minimal degradation on five general capability benchmarks. Notably, SafeSteer requires only 100 harmful samples without using any general-purpose data, less than 1% of what previous baselines used, considerably reducing alignment cost. More details are on our project page at https://anjingkun.github.io/SafeSteer.

2606.02526 2026-06-02 cs.CV cs.AI 版本更新

Why Not Hyperparameter-Friendly Optimisation? A Monotonic Adaptive Norm Rescaling Approach For Long-Tailed Recognition

为什么不采用超参数友好的优化?一种用于长尾识别的单调自适应范数缩放方法

Shuo Zhang, Chenqi Li, Tingting Zhu

发表机构 * University of Oxford(牛津大学)

AI总结 提出一种无需参数正则化的自适应单调归一化方法(SAMN),通过保序回归直接对类别权重范数施加单调性约束,实现超参数友好的长尾识别。

详情
AI中文摘要

长尾识别对深度学习构成了重大挑战。两阶段解耦范式将表示学习与分类器重训练分离,提供了一种有前景的解决方案。在分类器重训练阶段,自适应范数缩放是一种流行技术。它通过参数正则化调整每类权重范数,这不可避免地引入了超参数。然而,许多研究报告指出,长尾识别对这些超参数敏感,因为它们的设置显著影响性能。在本文中,我们首先从类条件分布的角度为范数缩放方法提供支持。此外,我们提出了一种简单而有效的方法,称为自适应单调归一化(SAMN)。SAMN避免了参数正则化的需求。它直接使用保序回归算法对每类权重范数施加单调性,使该方法对超参数友好。SAMN是一种通用策略,可与其他方法无缝集成以提升性能。在基准数据集上的实验表明,我们的方法显著提升了长尾识别性能,通常达到最先进的结果。

英文摘要

Long-tailed recognition poses a significant challenge for deep learning. The two-stage decoupling paradigm, which separates representation learning from classifier retraining, offers a promising solution. During the classifier retraining stage, adaptive norm rescaling is a popular technique. It adjusts the per-class weight norms via parameter regularization, which inevitably introduces hyperparameters. However, many studies report that long-tailed recognition is sensitive to these hyperparameters, as their setup significantly impacts performance. In this paper, we first provide a class-conditional distribution perspective to support norm rescaling methods. Furthermore, we propose a simple but effective approach called Self-Adaptive Monotonic Normalization (SAMN). SAMN avoids the need for parameter regularization. It directly enforces monotonicity on per-class weight norms using the Pool Adjacent Violators Algorithm, making the method hyperparameter-friendly. SAMN is a universal strategy that integrates seamlessly with other methods for enhanced performance. Experiments on benchmark datasets demonstrate that our method significantly boosts long-tailed recognition performance, often achieving state-of-the-art results.

2606.02522 2026-06-02 cs.CV cs.AI 版本更新

Moment-Video: Diagnosing Temporal Fidelity of Video MLLMs on Momentary Visual Events

Moment-Video: 诊断视频多模态大语言模型在瞬时视觉事件上的时间保真度

Xiaolin Liu, Yilun Zhu, Xiangyu Zhao, Xuehui Wang, Yan Li, Xin Li, Haoyu Cao, Xing Sun, Shaofeng Zhang, Xu Yang, Zhihang Zhong, Xue Yang

发表机构 * Shanghai Jiao Tong University(上海交通大学) Shandong University(山东大学) Southeast University(东南大学) Tencent Youtu Lab(腾讯优图实验室)

AI总结 提出 Moment-Video 基准,通过瞬时视觉事件理解任务诊断视频 MLLMs 的时间保真度,发现最佳模型准确率仅 39.6%,多数开源模型低于 25%。

Comments 28 pages, 10 figures, 11 tables

详情
AI中文摘要

视频多模态大语言模型(MLLMs)在通用和长视频理解方面取得了快速进展,但它们保留简短答案关键视觉证据的能力仍未得到充分探索。许多实际问题由瞬时视觉事件决定:可能仅持续几帧的局部化动作或状态转换。这种证据可能因稀疏帧采样而跳过、因视觉标记压缩而抑制,或因粗粒度时间聚合而稀释,导致语言端推理无法可靠恢复的失败。我们引入了 Moment-Video,一个通过瞬时视觉事件理解来诊断视频 MLLMs 时间保真度的基准。每个问题都基于局部化、视觉可观察且对采样敏感的事件,要求模型注意、计数、描述或推理瞬态证据,而非依赖持久对象、全局场景上下文或语言先验。Moment-Video 包含 1,000 个人工验证的视频问答对,涵盖 7 个领域和 25 个细分子类别,覆盖四种任务类型:时间发生、时间计数、动作描述和时间推理。我们在 Moment-Video 上评估了 33 个专有和开源 MLLMs。最佳模型 Seed-2.0-Pro 仅达到 39.6% 的整体准确率,而大多数开源模型低于 25%,揭示了瞬时视觉事件理解方面的巨大差距。诊断分析表明,更密集的帧采样改善了一些模型,但并未消除瓶颈,更长的视频带来了更强的时间定位挑战。这些发现表明,当前视频 MLLMs 仍然缺乏时间保真的表示来捕捉、保留和使用简短但决定性的视觉证据。

英文摘要

Video multimodal large language models (MLLMs) have made rapid progress on general and long-form video understanding, yet their ability to preserve brief answer-critical visual evidence remains underexplored. Many practical questions are determined by momentary visual events: localized actions or state transitions that may last only a few frames. Such evidence can be skipped by sparse frame sampling, suppressed by visual-token compression, or diluted by coarse temporal aggregation, causing failures that language-side reasoning cannot reliably recover. We introduce Moment-Video, a benchmark for diagnosing the temporal fidelity of video MLLMs through momentary visual event understanding. Each question is grounded in a localized, visually observable, and sampling-sensitive event, requiring models to notice, count, describe, or reason about transient evidence rather than rely on persistent objects, global scene context, or language priors. Moment-Video contains 1,000 human-verified video-QA pairs across 7 domains and 25 fine-grained subcategories, covering four task types: Temporal Occurrence, Temporal Counting, Action Description, and Temporal Reasoning. We evaluate 33 proprietary and open-source MLLMs on Moment-Video. The best-performing model, Seed-2.0-Pro, achieves only 39.6% overall accuracy, while most open-source models remain below 25%, revealing a substantial gap in momentary visual event understanding. Diagnostic analyses show that denser frame sampling improves some models but does not eliminate the bottleneck, and longer videos introduce stronger temporal-localization challenges. These findings suggest that current video MLLMs still lack temporally faithful representations for capturing, preserving, and using brief but decisive visual evidence.

2606.02497 2026-06-02 cs.AI 版本更新

Bridging the Last Mile of Time Series Forecasting with LLM Agents

用LLM智能体弥合时间序列预测的最后一公里

Yuhua Liao, Zetian Wang, Qiangqiang Nie, Zhenhua Zhang

AI总结 提出一个LLM智能体框架,通过检索上下文证据和结构化约束,将统计预测转化为业务就绪的预测。

详情
AI中文摘要

时间序列预测发展迅速,特别是随着基础模型的出现,这些模型在数值外推上展现出强大的零样本性能。然而,在实际预测场景中,统计上合理的基线很少是实践中使用的最终预测。在预测成为决策就绪之前,通常需要使用弱结构化的业务背景进行修订,例如假日效应、活动计划、外部事件、历史类比和专家反馈。这一实际阶段在预测文献中仍未得到充分探索。在本文中,我们将这一阶段定义为 extbf{最后一公里预测}问题,并提出一个位于预测骨干之上的LLM智能体框架。我们的系统维护一个统一的预测工作空间,调用工具检索上下文证据,并在结构安全约束下将推理轨迹转化为明确的预测修订行动。它还通过map-reduce风格的分解支持长周期预测,并通过记忆库支持事后反思。最终的系统设计为可控和可审计的。通过实际案例研究,我们展示了LLM智能体如何弥合统计预测与业务就绪预测之间的差距。

英文摘要

Time series forecasting has advanced rapidly, especially with the emergence of foundation models that show strong zero-shot performance on numerical extrapolation. However, in real-world forecasting settings, a statistically plausible baseline is rarely the final forecast used in practice. Before a forecast becomes decision-ready, it often needs to be revised using weakly structured business context such as holiday effects, campaign plans, external events, historical analogs, and expert feedback. This practical stage remains underexplored in the forecasting literature. In this paper, we formulate this stage as the \textbf{last-mile forecasting} problem and present an LLM-agent framework that sits on top of a forecasting backbone. Our system maintains a unified forecast workspace, invokes tools to retrieve contextual evidence, and converts reasoning trajectories into explicit forecast revision actions under structural safety constraints. It also supports long-horizon forecasting through map-reduce-style decomposition and post-hoc reflection through a memory bank. The resulting system is designed to be controllable and auditable. Through real-world case studies, we show how LLM agents can bridge the gap between statistical prediction and business-ready forecasting.

2606.02494 2026-06-02 cs.SE cs.AI 版本更新

Monitoring Agentic Systems Before They're Reliable

在代理系统可靠之前对其进行监控

Marisa Ferrara Boston, Glen Hanson, Effi Georgala, JD Hudgens, Heather Frase

发表机构 * Reins AI USA(Reins AI美国公司) Veraitech USA(Veraitech美国公司)

AI总结 针对生产环境中代理系统因结构缺陷主导故障的问题,提出一种基于方差信号的三维度三范围监控与分类方法,并通过合成测试验证其有效性。

Comments 9 pages, 2 figures, 3 tables. Accepted to the Workshop on Agentic Software Engineering (AgenticSE), co-located with ACM CAIS 2026 (non-archival)

详情
AI中文摘要

进入生产环境的代理系统通常以部分集成的组件形式运行,其中结构缺陷(而非任务级错误)主导故障场景。在此成熟度下,任务级错误检测可能不可行:结构故障模式掩盖了任务级监控器旨在检测的信号。我们提出一种监控与分类方法,将代理系统评估分解为三个维度(质量、适用性、效率)和三个监控范围(运行内、跨运行、结构),使用方差作为表征信号。发现结果通过基于FMEA的严重性分类进行路由,将人类注意力集中在需要调查的子集上。我们在一个包含220次运行、120个文档包且受控错误注入的合成测试平台上进行评估。三个结果显现:监控范围决定故障类型——运行内监控器发现确定性阶段缺陷(CV=0.02),跨运行监控器发现随机集成后果(CV=1.25,24%为L2级),结构监控器以完全一致性识别集成缺口(CV=0.00)。注入的任务级错误与干净基线无法区分,证实结构缺陷掩盖了任务级信号。确定性分类将97%的发现路由至自动跟踪,仅留下2%反映可变行为的发现供人工调查。基于第一阶段证据,我们提出一个成熟度阶段模型,其中监控随着集成缺陷的解决从结构表征过渡到错误检测再到可靠性跟踪。该分类法、基于CV的范围表征和严重性模型在架构上可迁移至受监管行业中基于文档的多阶段代理工作流;具体校准是领域特定的。尽早部署监控:它发现的第一个问题就是最需要修复的问题。

英文摘要

Agentic systems entering production typically operate as partially integrated assemblies where structural defects, not task-level errors, dominate the failure landscape. At this maturity level, task-level error detection may be infeasible: structural failure modes mask the signal that task-level monitors are designed to detect.We present a monitoring and triage methodology that decomposes agentic system evaluation into three dimensions (quality, suitability, efficiency) at three monitoring scopes (within-run, cross-run, structural), using variance as a characterization signal. Findings are routed through severity classification adapted from FMEA, concentrating human attention on the subset that warrants investigation. We evaluate on a synthetic testbed of 220 runs across 120 document bundles with controlled error injection.Three results emerge. Monitor scope determines failure type: within-run monitors surface deterministic stage defects (CV = 0.02), cross-run monitors surface stochastic integration consequences (CV = 1.25, 24% at L2), and a structural monitor identifies an integration gap with perfect consistency (CV = 0.00). Injected task-level errors are indistinguishable from clean baselines, confirming structural defects mask task-level signal. Deterministic triage routes 97% of findings to automated tracking, leaving the 2% reflecting variable behavior for human investigation.We propose, on Stage 1 evidence, a maturity-staging model in which monitoring transitions from structural characterization to error detection to reliability tracking as integration defects resolve. The taxonomy, CV-based scope characterization, and severity model transfer architecturally to document-driven, multi-stage agentic workflows in regulated industries; specific calibrations are domain-specific. Deploy monitoring early: the first thing it finds is the most important thing to fix.

2606.02488 2026-06-02 cs.AI 版本更新

RASER: Recoverability-Aware Selective Escalation Router for Multi-Hop Question Answering

RASER: 可恢复性感知的选择性升级路由器用于多跳问答

Yuyang Li, Zihe Yan, Tobias Käfer

发表机构 * Institute AIFB, Karlsruhe Institute of Technology, Karlsruhe, Germany(卡尔斯鲁厄理工学院AIFB研究所) Shanghai Jiao Tong University, Shanghai, China(上海交通大学)

AI总结 提出RASER路由器,基于单次RAG的六个特征决定是否升级到更昂贵的检索策略,在不增加额外LLM调用的情况下,在F1分数与SOTA相当的同时节省大量token。

Comments Under Review

详情
AI中文摘要

多跳问答系统通常对每个问题使用昂贵的检索。它们可能会分解问题、运行多轮检索或通过桥接实体搜索后再回答。所有这些策略都依赖于重复的LLM调用来重写或分解问题,这增加了额外的token成本,并且在LLM预算紧张时不适用。然而,我们的分析表明,许多多跳问题已经被单次RAG正确回答,因此对每个问题都进行额外检索浪费了预算。我们引入了RASER(可恢复性感知的选择性升级路由器),这是一系列基于单次RAG及其六个特征的廉价路由器。RASER-2决定是停止还是升级到额外检索动作PRUNE。RASER-3在单次RAG、PRUNE和迭代检索IRCoT之间进行选择,使用相同的特征但增加了显式的成本-准确率权衡。两个路由器都不需要额外的LLM调用来做决定。在六个LLM和三个多跳QA基准测试中,两个路由器在F1分数上与最先进的基线保持竞争力,同时仅消耗始终PRUNE的41-49%的token,并且也少于迭代和分解检索基线。

英文摘要

Multi-hop question-answering systems often use expensive retrieval on every question. They may decompose the question, run several retrieval rounds, or search through bridge entities before answering. All of these strategies rely on repeated LLM calls to rewrite or decompose the question, which increases extra token cost, and it is not fitting when the LLM budget is tight. However, our analysis shows that lots of multi-hop questions are already answered correctly by a single one-shot RAG, so running an extra retrieval on every question wastes the budget. We introduce RASER (Recoverability-Aware Selective Escalation Router), a family of cheap routers built on one-shot RAG and six features from it. RASER-2 decides whether to stop or escalate to the extra-retrieval action PRUNE. RASER-3 chooses among one-shot RAG, PRUNE, and iterative retrieval IRCoT, using the same features but adding an explicit cost-accuracy trade-off. Neither router makes an extra LLM call to decide. Across six LLMs and three multi-hop QA benchmarks, both routers stay competitive with the other state-of-the-art (SOTA) baselines in F1 while spending only 41-49% of always-prune's tokens and also less than the iterative and decomposition retrieval baselines.

2606.02484 2026-06-02 cs.AI cs.LG 版本更新

Iteris: Agentic Research Loops for Computational Mathematics

Iteris: 计算数学的智能体研究循环

Leheng Chen, Zihao Liu, Wanyi He, Bin Dong

发表机构 * School of Mathematical Sciences, Peking University(北京大学数学科学学院) Beijing International Center for Mathematical Research and the New Cornerstone Science Laboratory, Peking University(北京大学北京国际数学研究中心和新基石科学实验室) Center for Machine Learning Research, Peking University(北京大学机器学习研究中心) Center for Intelligent Computing, Great Bay Institute for Advanced Study, Great Bay University(大湾研究院先进研究所智能计算中心) Zhongguancun Academy(中关村学院)

AI总结 提出Iteris智能体研究系统,通过数值实验、构造和证明草稿解决计算数学中的两个开放问题,经专家验证后获得可验证结果。

Comments 43 pages

详情
AI中文摘要

大型语言模型和智能体AI系统的最新进展使得数学发现取得了显著进展,从解决竞赛问题到处理研究级猜想。然而,计算数学中的开放问题受到的关注相对较少:该领域的研究通常不仅需要证明,还需要数值实验、对抗性构造和算法设计。在本文中,我们介绍了一个面向计算数学开放问题的智能体研究系统Iteris。我们将Iteris应用于近期Simons Workshop论文集(arXiv:2602.05394)中的两个开放问题。在这些案例研究中,Iteris生成了数值证据、构造和证明草稿,经过专家评审和修正后,得到了可验证的结果。第一个结果是关于幂律谱上共轭梯度与随机坐标下降渐近比较的相图;第二个结果是一个反例,表明即使低相干性下,带列主元的QR分解也可能无法选择良态子矩阵。这些案例研究表明,智能体AI系统可以有意义地参与计算数学开放问题的研究工作流程,而人工验证仍然至关重要。

英文摘要

Recent advances in large language models and agentic AI systems have enabled significant progress in mathematical discovery, from solving competition problems to tackling research-level conjectures. However, open problems in computational mathematics have received comparatively less attention: research in this area often requires not only proofs but also numerical experimentation, adversarial constructions, and algorithm design. In this paper, we introduce an agentic research system, Iteris, designed for open problems in computational mathematics. We apply Iteris to two open problems from a recent Simons Workshop collection (arXiv:2602.05394). In these case studies, Iteris generated numerical evidence, constructions, and proof drafts that led, after expert review and correction, to verified results. The first result is a phase diagram for the asymptotic comparison between conjugate gradient and randomized coordinate descent on power-law spectra; the second is a counterexample showing that QR factorization with column pivoting can fail to select well-conditioned submatrices even under low coherence. These case studies suggest that agentic AI systems can participate meaningfully in research workflows for open problems in computational mathematics, while human validation remains essential.

2606.02483 2026-06-02 cs.CR cs.AI cs.CL 版本更新

Ghost Tool Calls: Issue-Time Privacy for Speculative Agent Tools

幽灵工具调用:投机性智能体工具的发布时隐私保护

Bardia Mohammadi, Lars Klein, Akhil Arora, Laurent Bindschaedler

发表机构 * Max Planck Institute for Software Systems(马克斯·普朗克软件系统研究所) EPFL(苏黎世联邦理工学院) Aarhus University(阿arhus大学)

AI总结 针对工具增强型语言智能体投机性预发调用泄露用户意图的问题,提出投机性工具隐私契约,在发布时而非提交后保护隐私。

详情
AI中文摘要

工具增强型语言智能体投机性地发出可能的未来工具调用以隐藏延迟,但这些调用在智能体提交分支之前将推断出的用户意图泄露给外部服务。每个收到调用的外部观察者在智能体放弃分支后仍保留该披露。问题在于时机,而非授权:提交后的清理、只读限制或访问控制白名单都无法撤回观察者已持有的信息。我们将这些调用称为幽灵工具调用,并提出投机性工具隐私契约,这是一种运行时抽象,将提交前的观察视为与状态突变不同的第一类效应。我们在原型运行时中实现了该契约,并在三个语料库上评估了十二种策略。投机性调度增加了观察者能够推断用户意图的程度;事后过滤器、只读限制和访问控制白名单无法消除这种推断;只有那些在调度前改变或抑制投机性调用的参数或目标投影的发布时策略才能减少这种推断。

英文摘要

Tool-augmented language agents speculatively issue likely future tool calls to hide latency, but those calls leak inferred user intent to external services before the agent commits to the branch. Every external observer that received the call retains the disclosure after the agent abandons the branch. Timing is the issue, not authorization: no commit-time cleanup, read-only restriction, or access-control allow-list unsends what an observer already holds. We call these invocations ghost tool calls and propose Speculative Tool Privacy Contracts, a runtime abstraction that treats observation before commitment as a first-class effect, distinct from state mutation. We implement the contracts in a prototype runtime and evaluate twelve policies across three corpora. Speculative dispatch increases what an observer can infer about user intent; post-hoc filters, read-only restrictions, and access-control allow-lists leave that inference intact; only issue-time policies that change or suppress the speculative call's argument or destination projection before dispatch reduce it.

2606.02470 2026-06-02 cs.AI 版本更新

MCP-Persona: Benchmarking LLM Agents on Real-World Personal Applications via Environment Simulation

MCP-Persona:通过环境模拟在真实个人应用上基准测试LLM智能体

Wenhao Wang, Peizhi Niu, Gongyi Zou, Xiyuan Yang, Jingxing Wang, Haoting Shi, Yaxin Du, Jingyi Chai, Xianghe Pang, Shuo Tang, Yanfeng Wang, Siheng Chen

发表机构 * Tsinghua University(清华大学)

AI总结 针对现有基准忽略个人社交应用中工具与个人账户或本地数据库交互的挑战,提出MCP-Persona基准,通过模拟真实个性化MCP工具评估LLM智能体性能,实验表明现有智能体在个性化工具使用上存在显著困难。

Comments ICML 2026 Camera Ready

详情
AI中文摘要

模型上下文协议(MCP)已成为连接大型语言模型(LLM)与外部数据源和工具的变革性标准,并已迅速在个人应用和开发平台中得到采用。然而,现有基准主要关注通用信息搜索工具,未能捕捉个人社交应用带来的实际挑战,在这些应用中工具与个人账户或本地数据库交互。为弥合这一关键差距,我们引入了MCP-Persona,这是首个专门用于评估智能体在真实世界个性化MCP工具上性能的基准。MCP-Persona涵盖了一系列多样化的广泛使用的应用,从社交媒体平台如Reddit和小红书(Rednote)到企业协作套件如飞书(Lark)和Slack。我们在各种最先进(SOTA)智能体上的广泛实验表明,它们在个性化工具使用上存在显著困难,从而凸显了该基准在识别和解决这些局限性方面的关键作用。MCP-Persona公开可用:https://github.com/wwh0411/MCP-Persona。

英文摘要

The Model Context Protocol (MCP) has emerged as a transformative standard for connecting large language models (LLMs) with external data sources and tools, and has been rapidly adopted across personal applications and development platforms. However, existing benchmarks predominantly focus on generic information-seeking tools and fail to capture the practical challenges posed by personal social applications, where tools interact with individual accounts or local databases. To bridge this critical gap, we introduce MCP-Persona, the first benchmark specifically designed for evaluating agent performance on real-world, personalized MCP tools. MCP-Persona encompasses a diverse set of widely-used applications, ranging from social media platforms like Reddit and Xiaohongshu (Rednote) to enterprise collaboration suites such as Lark (Feishu) and Slack. Our extensive experiments on various state-of-the-art (SOTA) agents demonstrate their significant struggles with personalized tool use, thereby highlighting the benchmark's crucial role in identifying and addressing these limitations. MCP-Persona is publicly available at https://github.com/wwh0411/MCP-Persona}{https://github.com/wwh0411/MCP-Persona.

2606.02465 2026-06-02 cs.CL cs.AI 版本更新

Learning When to Translate for Multilingual Reasoning

学习何时翻译以实现多语言推理

Deokhyung Kang, Hyounghun Kim, Gary Geunbae Lee

发表机构 * Graduate School of Artificial Intelligence(人工智能研究生院) Department of Computer Science & Engineering(计算机科学与工程系) POSTECH

AI总结 提出Luar框架,通过强化学习训练推理语言模型在直接理解不可靠时选择性调用翻译,从而缩小多语言推理差距。

Comments preprint

详情
AI中文摘要

推理语言模型(RLMs)在复杂推理任务上表现出色,但仍存在显著的多语言推理差距,这主要源于非英语输入中的语言理解失败。英语翻译可以通过将非英语输入转换为RLMs更可靠解释的形式来缓解这些失败,但当模型能够从原始查询中可靠推理时,翻译每个输入是不必要的。为应对这一挑战,我们提出Luar,一种语言理解边界感知的强化学习框架,训练RLMs在直接理解不可靠时选择性调用翻译。Luar训练模型在直接解决原始输入和对其英语翻译进行推理之间做出选择,仅在翻译增强推理预期显著优于直接推理时鼓励翻译。在多语言推理基准测试中,Luar优于标准GRPO和其他基于训练的基线,在低资源语言上尤其获得巨大提升。进一步分析表明,Luar在直接推理足够的情况下避免不必要的翻译,同时将其翻译调用行为扩展到未见过的低资源语言。总之,我们的工作提出了一种选择性多语言推理方法:RLMs可以学习仅在直接理解不可靠时调用翻译。该项目将在https://github.com/deokhk/LUAR公开。

英文摘要

Reasoning language models (RLMs) achieve strong performance on complex reasoning tasks, but still exhibit substantial multilingual reasoning gaps, largely due to language-understanding failures in non-English inputs. English translation can mitigate these failures by expressing non-English inputs in a form that RLMs can more reliably interpret, yet translating every input is unnecessary when the model can reason reliably from the original query. To address this challenge, we propose Luar, a Language Understanding Boundary-aware Reinforcement Learning framework that trains RLMs to selectively invoke translation when direct understanding is unreliable. Luar trains the model to choose between solving the original input directly and reasoning over its English translation, encouraging translation only when translator-augmented reasoning is expected to substantially outperform direct reasoning. Across multilingual reasoning benchmarks, Luar outperforms standard GRPO and other training-based baselines, with particularly large gains on low-resource languages. Further analysis shows that Luar avoids unnecessary translation in cases where direct reasoning is sufficient, while extending its translator-call behavior to unseen low-resource languages. Together, our work suggests a selective approach to multilingual reasoning: RLMs can learn to invoke translation only when their direct understanding is unreliable. The project will be made publicly available at https://github.com/deokhk/LUAR

2606.02463 2026-06-02 cs.CV cs.AI 版本更新

MASER: Modality-Adaptive Specialist Routing for Embodied 3D Spatial Intelligence

MASER: 面向具身3D空间智能的模态自适应专家路由

Hilton Raj, Vishnuram AV

发表机构 * Boston University(波士顿大学)

AI总结 提出MASER框架,通过训练共享VLM骨干的五个模态适配器并学习基于问题选择最佳适配器的神经路由策略,解决具身代理在3D环境中多模态推理时忽略问题语义的问题。

Comments Accepted to CVPR 2026 Foundation Models Meet Embodied Agents Workshop

详情
AI中文摘要

在3D环境中,具身代理通过推理自然语言、RGB图像、点云、深度图和相机位姿等多模态信息来回答空间相关问题。现有的视觉语言模型(VLM)在单一模态上微调,完全忽略了可能偏好不同于微调模态的问题语义。为解决这一问题,我们提出MASER(模态自适应专家路由),一个轻量级框架,训练共享VLM骨干的五个不同模态适配器,并学习一个神经路由策略,在推理时根据问题选择最佳适配器。我们使用冻结的句子变换器对每个问题进行编码,并将嵌入通过一个小型多层感知器(MLP),该感知器在oracle适配器-准确率标签上训练。我们在Open3D-VQA基准上评估我们的方法,评估结果表明没有单一模态是普遍最优的——点云答案在51.5%的情况下最佳。MASER以51.3%的oracle一致性进行路由,优于随机森林消融(43.5%),且每个问题仅调用一次适配器。

英文摘要

In 3D environments, Embodied Agents answer spatially relevant questions through reasoning from a mixture of modalities including natural language, RGB images, point clouds, depth maps and camera poses. Existing Vision-Language models (VLMs) are fine-tuned over a single modality. This completely ignores the question semantics which may favor a different modality than the finetuned modality. To address this, we propose MASER (Modality-Adaptive SpEcialist Routing), a lightweight framework that trains five different modality adapters of a shared VLM backbone and learns a neural routing policy that selects the best adapter based on the question during inference. We encode each question with a frozen sentence transformer and pass the embedding through a small Multi-layer Perceptron (MLP) trained on oracle adapter-accuracy labels. We evaluate our methodology over the Open3D-VQA benchmark and our evaluations show that no single modality is universally optimal -- point-cloud answers are best in 51.5% of cases. MASER routes with 51.3% oracle agreement, outperforming a Random-Forest ablation (43.5%), with only a single adapter call per question.

2606.02458 2026-06-02 cs.AI 版本更新

Beyond One-shot: AI Agents for Learning in Field Experiments

超越一次性:用于现场实验学习的AI智能体

Junjie Luo, Ritu Agarwal, Gordon Gao

AI总结 研究通过工具增强的智能体AI自动从实验数据中学习并生成新干预措施,在医疗处方消息现场实验中证明其优于人类+聊天机器人方法。

详情
AI中文摘要

组织通常进行A/B测试实验,但一次实验产生的数据未被充分利用以指导后续干预设计。从先前实验数据中提取可操作知识以指导新干预存在重大障碍。我们研究工具增强的智能体AI能否自动从实验数据中学习,以在后续实验中生成新干预。通过医疗处方消息传递(693,139次患者就诊)的两阶段现场实验,我们比较了人类+聊天机器人方法(第一阶段:行为专家与对话式AI共同设计13种消息变体,444,691次患者就诊)与工具增强的智能体AI方法(第二阶段:AI自主从第一阶段数据中提取原则以生成17种新变体,248,448次患者就诊)。配备分析工具、结构化数据-信息-知识-智慧(DIKW)推理智能体和透明证据链的智能体AI方法产生了更优的干预:AI生成的最佳消息实现了69.8%的点击率(比基线高6.5个百分点)。关键的是,我们的结果表明价值来自特定领域的实验数据,而非通用推理能力:没有实验数据的前沿大语言模型无法预测哪些干预会成功。现场实验还揭示,用于干预设计的通用行为理论并不能统一适用于特定医疗环境,这激发了在实验规模上进行理论审计的智能体AI方法。我们的研究表明,工具增强的AI可以从实验数据中学习并生成改进的领域相关干预,将行为实验从一次性评估转变为可扩展的累积设计学习系统。

英文摘要

Organizations routinely run experiments for A/B testing, yet the data generated from one experiment is underutilized to inform subsequent intervention design. Significant barriers exist to extracting actionable knowledge from prior experimental data to inform new interventions. We study whether tool-augmented agentic AI can automatically learn from experimental data to generate new interventions in subsequent experiments. Through two-stage field experiments in healthcare prescription messaging (693,139 patient visits), we compare a Human + Chatbot method (Stage 1: behavioral experts with conversational AI co-designing 13 message variants, 444,691 patient visits) against a Tool-Augmented Agentic AI method (Stage 2: AI autonomously extracting principles from Stage 1 data to generate 17 new variants, 248,448 patient visits). The Agentic AI method, equipped with analytical tools, structured Data-Information-Knowledge-Wisdom (DIKW) reasoning agents, and transparent evidence chains, produces superior interventions: the best AI-generated message achieved a 69.8% CTR (+6.5 percentage points over baseline). Critically, our results suggest that the value comes from domain-specific experimental data, not from general reasoning ability: frontier LLMs operating without experimental data failed to predict which interventions would succeed. The field experiments also revealed that general-purpose behavioral theories used for intervention design do not extend uniformly to specific healthcare contexts, motivating an agentic AI approach to theory audits at field-experiment scale. Our research shows that tool-augmented AI can learn from experimental data and generate improved domain-relevant interventions, transforming behavioral experimentation from one-shot evaluation into a scalable system for cumulative design learning.

2606.02453 2026-06-02 cs.CV cs.AI 版本更新

Initialization is Half the Battle: Generating Diverse Images from a Guidance Potential Posterior

初始化即半程:从引导势后验生成多样图像

Xiang Li, Dianbo Liu, Kenji Kawaguchi

发表机构 * University of California, Berkeley(加州大学伯克利分校) University of Tokyo(东京大学)

AI总结 针对生成模型模式崩溃问题,提出从引导势后验中采样初始噪声的DivIn方法,利用朗之万动力学引导初始化远离崩溃区域,提升多样性且兼容扩散与流匹配模型。

Comments Accepted by ICML 2026 Spotlight

详情
AI中文摘要

尽管生成模型具有显著的保真度,但它们经常遭受模式崩溃。现有的增强多样性的策略主要集中于在生成轨迹期间进行干预。我们发现一个关键的疏忽:标准高斯初始化通常导致轨迹崩溃到主导模式,因为它对引导势景观是无关的。在这项工作中,我们从引导势后验中公式化选择初始噪声,这有效地将先验重新加权到多样性丰富的区域。为了高效地从该分布中采样,我们引入了多样性诱导初始化(DivIn),它利用朗之万动力学主动导航初始化景观,将初始噪声引导远离崩溃区域,同时将其锚定到有效的数据流形。我们的方法作为一种推理时多样性增强,与扩散和流匹配模型都兼容。大量实验表明,DivIn在类到图像和文本到图像场景中都表现出优越的性能。此外,我们强调,由于DivIn与基于轨迹的方法是正交的,将它们结合起来显著扩展了多样性-质量帕累托前沿,超越了任何单独方法所能达到的。

英文摘要

Despite the remarkable fidelity of generative models, they frequently suffer from mode collapse. Existing strategies for enhancing diversity predominantly focus on intervening during the generation trajectory. We identify a critical oversight that the standard Gaussian initialization often causes trajectories to collapse into dominant modes because it is agnostic to the guidance potential landscape. In this work, we formulate selecting the initial noise from a guidance potential posterior, which effectively re-weights the prior towards diversity-rich regions. To sample from this distribution efficiently, we introduce Diversity-inducing Initialization (DivIn), which leverages Langevin dynamics to actively navigate the initialization landscape, steering initial noise away from collapsing regions while anchoring them to the valid data manifold. Our method serves as an inference-time diversity enhancement compatible with both diffusion and flow matching models. Extensive experiments show that DivIn exhibits a superior performance in both class-to-image and text-to-image scenarios. Furthermore, we highlight that as DivIn is orthogonal to trajectory-based methods, combining them significantly expands the diversity-quality Pareto frontier beyond what either achieves in isolation.

2606.02449 2026-06-02 cs.AI cs.CL cs.CV cs.LG cs.MM 版本更新

HLL: Can Agents Cross Humanity's Last Line of Verification?

HLL:智能体能否跨越人类最后一道验证防线?

Xinhao Song, Su Su, Sirui Song, Hongliang Wu, Wen Shen, Zhihua Wei, Gongshen Liu, Linfeng Zhang, Dongrui Liu

发表机构 * Shanghai Jiao Tong University(上海交通大学) Shandong University(山东大学) Tongji University(同济大学)

AI总结 提出HLL基准,通过交互式CAPTCHA验证评估多模态智能体在受保护工作流中替代人类的能力,发现当前智能体在定位、动作校准、状态跟踪和过程一致性方面存在脆弱性。

Comments 27 pages, 14 figures

详情
AI中文摘要

多模态智能体越来越被期望代表用户操作界面,这引发了一个核心部署问题:在服务特意防止自动化的流程中,它们能否真正替代人类?CAPTCHA验证使这个问题具体化。它不仅仅是一个视觉谜题,更是在账户创建、内容访问、表单提交和其他受保护操作之前设置的人类验证边界。我们引入了 extbf{人类最后一道验证防线(HLL)},这是一个受控基准,使用交互式CAPTCHA验证来评估智能体是否能够通过基于环境的类人交互(而非仅识别)跨越这一边界。HLL涵盖了多种CAPTCHA交互,并让智能体暴露于受控的现实压力因素下,包括杂乱的网页、更困难的任务变体以及解决过程的轨迹条件验证。我们在闭环GUI环境中评估了八个前沿多模态智能体。结果表明,当前智能体在这个人类替代边界上仍然脆弱:性能在不同验证类型间差异显著,在现实界面条件下下降,当正确答案必须由有效动作轨迹支持时进一步下降。通过揭示定位、动作校准、状态跟踪和过程一致性方面的差距,HLL为衡量多模态智能体在受保护的真实世界工作流中作为人类替代品有多接近提供了一个具体的测试平台。我们的代码可在https://github.com/XinhaoS0101/HLL获取。

英文摘要

Multimodal agents are increasingly expected to operate interfaces on behalf of users, raising a central deployment question: can they truly substitute for humans in workflows that services deliberately protect against automation? CAPTCHA verification makes this question concrete. It is not merely a visual puzzle, but a human-verification boundary placed before account creation, content access, form submission, and other protected actions. We introduce \textbf{Humanity's Last Line of Verification (HLL)}, a controlled benchmark that uses interactive CAPTCHA verification to evaluate whether agents can cross this boundary through grounded, human-like interaction rather than recognition alone. HLL covers diverse CAPTCHA interactions and exposes agents to controlled realism stressors, including cluttered webpages, harder task variants, and trace-conditioned validation of the solving process. We evaluate eight frontier multimodal agents in a closed-loop GUI environment. The results show that current agents remain brittle at this human-substitution boundary: performance varies sharply across verification types, degrades under realistic interface conditions, and drops further when correct answers must be supported by valid action traces. By exposing gaps in localization, action calibration, state tracking, and process consistency, HLL provides a concrete testbed for measuring how close multimodal agents are to acting as human substitutes in protected real-world workflows. Our code is available at https://github.com/XinhaoS0101/HLL

2606.02444 2026-06-02 cs.AI cs.CL 版本更新

Food Noise & False Safety: A Systematic Evaluation of How LLMs Fail to Adapt to Eating Disorder Queries with Clinician Feedback

食物噪音与虚假安全:系统评估LLMs如何在临床医生反馈下未能适应饮食障碍查询

Giulia Pucci, Emily Hemendinger, Ruizhe Li, Gavin Abercrombie, Tanvi Dinkar, Arabella Sinclair

发表机构 * University of Aberdeen(阿伯丁大学) University of Colorado Anschutz(科罗拉多大学安舒茨分校) Heriot-Watt University(赫里奥特-瓦特大学) University College London(伦敦大学学院)

AI总结 本研究通过与临床饮食障碍专家合作,系统评估了大型语言模型在处理饮食障碍用户查询时,因不加批判地适应不安全或自伤请求而可能产生的危害。

详情
AI中文摘要

近期证据表明,饮食障碍患者越来越多地向基于大型语言模型的聊天系统寻求指导、建议和情感支持。尽管这些系统并非设计用于提供临床建议,但其感知的专业性、中立性和可访问性使其成为频繁但存在风险的支持来源。本文调查了饮食障碍用户与LLMs之间潜在的交互模式,重点关注模型不加批判地适应并促进不安全或自伤用户请求可能产生的危害。通过与临床饮食障碍专家协商,我们发现提示中的特定语言线索会增加不安全响应的可能性,并通过系统性地改变用户提示中潜在风险的程度,报告了LLMs不加批判地适应有问题的、潜在危险用户输入的程度。

英文摘要

Recent evidence shows that people with eating disorders (EDs) are increasingly seeking guidance, advice, and emotional support from Large Language Model (LLM)-based chat systems. Although these systems are not designed to provide clinical advice, their perceived expertise, neutrality and accessibility make them a frequent, albeit risky, source of support. This paper investigates potential patterns of interaction between users with EDs and LLMs, focusing on the potential harms arising from models that uncritically adapt to, and facilitate unsafe or self-harming user requests. We find, in consultation with clinical ED experts, that specific linguistic cues in prompts increase the likelihood of unsafe responses and, through systematically varying the degree of potential risk present in the user prompt, report the extent to which LLMs uncritically adapt to problematic, and potentially dangerous user inputs.

2606.02443 2026-06-02 cs.CL cs.AI cs.CV 版本更新

PaSBench-Video: A Streaming Video Benchmark for Proactive Safety Warning

PaSBench-Video: 面向主动安全预警的流式视频基准

Yusong Zhao, Yuejin Xie, Youliang Yuan, Junjie Hu, Jitian Guo, Yujiu Yang, Pinjia He

发表机构 * The Chinese University of Hong Kong, Shenzhen(香港中文大学(深圳)) Tsinghua University(清华大学)

AI总结 提出PaSBench-Video基准,包含740个视频,评估多模态大模型在危险发生前及时发出预警的能力,发现现有模型在时序精度和低误报率上表现不佳。

详情
AI中文摘要

从危险的第一个可见迹象到事故发生之间,通常存在一个仍可干预的时间窗口。具备视频能力的多模态大语言模型(MLLM)可以作为始终在线的安全监控器,在此窗口内发出警告。然而,当前的基准测试并未检验这一能力:它们依赖静态输入,忽略时间精度,并且省略了对安全场景的误报测量。我们提出了PaSBench-Video,一个包含740个视频的基准测试,涵盖驾驶、医疗、日常生活和工业生产四个领域,其中包含481个风险视频和259个无风险视频。风险视频标注了帧级别的风险起始点和事故边界。模型必须以因果方式观察视频,并发出在时间上校准且内容正确的警告。测试了13个MLLM后,我们发现没有模型在我们的最严格指标上超过20.0%,并且召回率与误报率紧密相关,皮尔逊相关系数为0.64:更高的检测率只能以在大多数安全片段上触发警告为代价。性能按领域显著分化:在日常生活领域,模型在低误报率下实现了中等召回率,因为该领域的风险本质上是异常的;而在驾驶领域,模型不加区分地触发警告,因为常规场景和危险场景看起来相似。这些结果表明,当前模型依赖于场景级别的活动线索,而不是推理正在出现的危害。

英文摘要

Between the first visible sign of danger and the moment an accident occurs, there is often a window where intervention remains possible. Video-capable multimodal large language models (MLLMs) could serve as always-on safety monitors that issue warnings during this window. Yet current benchmarks do not test this ability: they rely on static inputs, ignore timing precision, and omit false-positive measurement on safe scenes. We present PaSBench-Video, a 740-video benchmark with 481 risk and 259 no-risk videos across four domains: driving, healthcare, daily life, and industrial production. Risk videos are annotated with frame-level risk onset and accident boundaries. A model must observe the video causally and produce a warning that is both temporally calibrated and content-correct. Testing 13 MLLMs, we find that no model exceeds 20.0% on our strictest metric, and recall is tightly coupled with false-positive rate, with Pearson correlation 0.64: higher detection comes only at the cost of triggering warnings on the majority of safe clips. Performance splits sharply by domain: models achieve moderate recall at low false-positive rates in daily life, where risks are inherently anomalous, yet fire indiscriminately in driving, where routine and hazardous scenes look alike. These results indicate that current models rely on scene-level activity cues rather than reasoning about emerging harm.

2606.02438 2026-06-02 cs.AI 版本更新

LLM-Evolved Pattern Generators for Optimal Classical Planning

LLM演化模式生成器用于最优经典规划

Windy Phung, Dominik Drexler, Arnaud Lequen, Jendrik Seipp

AI总结 提出首个通过LLM驱动的进化程序合成学习可容许启发式函数的方法,用于最优经典规划,结合饱和成本分区保证A*搜索的最优性。

详情
AI中文摘要

学习到的启发式函数最近已成为满足规划中传统领域无关启发式函数的竞争性替代方案。然而,现有方法侧重于改进搜索引导而非保证可容许性,这使得它们不适用于最优经典规划。我们提出了第一种学习领域相关启发式函数的方法,这些启发式函数在设计上是可容许的,从而保留了A*搜索的最优性保证。我们不是学习从状态到启发式值的直接映射,而是学习构建可诱导可容许启发式函数的抽象。我们使用LLM驱动的进化程序合成框架,为每个领域获得一个程序,该程序为该领域中的任何任务生成模式集合,并通过饱和成本分区以可容许的方式组合所得模式。实验表明,学习到的程序编码了可解释的领域特定见解,在测试时以可忽略的开销运行,并在多个领域上产生了与最先进的领域无关基线相匹配的覆盖范围,同时每个状态的评估速度显著更快。

英文摘要

Learned heuristics have recently become a competitive alternative to traditional domain-independent heuristics for satisficing planning. Existing approaches, however, focus on improving search guidance rather than guaranteeing admissibility, which makes them unsuitable for optimal classical planning. We present the first method for learning domain-dependent heuristics that are admissible by design and thus preserve the optimality guarantees of A* search. Instead of learning a direct mapping from states to heuristic values, we learn to construct abstractions that induce admissible heuristics. We use an LLM-driven evolutionary program-synthesis framework to obtain, for each domain, a program that produces a pattern collection for any task in that domain, and we combine the resulting patterns admissibly via saturated cost partitioning. Empirically, the learned programs encode interpretable domain-specific insights, run with negligible overhead at test time and yield heuristics that match the coverage of state-of-the-art domain-independent baselines on several domains while evaluating each state substantially faster.

2606.02434 2026-06-02 cs.AI 版本更新

Bridging the Sim-to-Real Gap in Semiconductor Visual Program Synthesis via Input Binarization

通过输入二值化弥合半导体视觉程序合成中的仿真到现实差距

Yusuke Ohtsubo, Kota Dohi, Koichiro Yawata, Koki Takeshita, Tatsuya Sasaki

发表机构 * National Institute of Information and Communications Technology, Japan(日本信息通信技术全国研究所)

AI总结 提出一种视觉程序合成框架,利用输入二值化策略消除扫描电子显微镜图像的纹理和噪声,使视觉语言模型专注于几何结构,从而弥合仿真到现实的差距,在MIIC数据集上将平均Dice系数从0.4393提升至0.5256。

详情
AI中文摘要

精确的电路几何参数控制对于半导体检测至关重要,但获取足够的真实训练数据成本高昂。尽管扩散模型和生成对抗网络等生成模型可以扩充训练数据,但它们无法保证计量任务所需的纳米级几何精度。我们提出一个视觉程序合成框架,其中视觉语言模型将检测图像转换为描述电路几何的可编辑领域特定语言代码,从而能够通过精确参数操作可控地生成训练数据。由于视觉语言模型仅使用合成的DSL渲染数据进行训练,在处理真实扫描电子显微镜图像时会出现领域差距。我们通过输入二值化策略弥合这一差距,该策略去除SEM特有的纹理和噪声,使模型专注于几何结构。在MIIC数据集上,二值化输入将平均Dice系数从原始输入基线的0.4393提升至0.5256,表明简单的纹理抽象显著缓解了仿真到现实的差距。

英文摘要

Precise parametric control over circuit geometry is essential for semiconductor inspection, yet obtaining sufficient real training data remains costly. Although generative models such as diffusion models and Generative Adversarial Networks (GANs) can augment training data, they cannot guarantee the nanometer-scale geometric accuracy required for metrology tasks. We propose a visual program synthesis framework in which a Vision-Language Model (VLM) converts inspection images into editable Domain-Specific Language (DSL) code describing circuit geometries, enabling controlled generation of training data with exact parameter manipulation. Because the VLM is trained solely on synthetic DSL-rendered data, a domain gap arises when processing real Scanning Electron Microscope (SEM) images. We bridge this gap with an input binarization strategy that strips SEM-specific texture and noise, letting the model focus on geometric structure. On the MIIC dataset, binarized inputs improve the mean Dice coefficient from 0.4393 to 0.5256 over the raw-input baseline, demonstrating that simple texture abstraction substantially mitigates the sim-to-real gap.

2606.02433 2026-06-02 cs.IR cs.AI cs.CL cs.LG cs.MA 版本更新

ODTQA-FoRe: An Open-Domain Tabular Question Answering Dataset for Future Data Forecasting and Reasoning

ODTQA-FoRe:面向未来数据预测与推理的开放域表格问答数据集

Zhensheng Wang, Xiaole Liu, Wenmian Yang, Kun Zhou, Yiquan Zhang, Weijia Jia

发表机构 * School of Artificial Intelligence, Beijing Normal University(北京师范大学人工智能学院) Institute of Artificial Intelligence and Future Networks, Beijing Normal University(北京师范大学人工智能与未来网络研究院) Faculty of Arts and Sciences, Beijing Normal University(北京师范大学文理学院) Beijing Normal-Hong Kong Baptist University(北京师范大学-香港 Baptist大学)

AI总结 提出开放域表格问答的未来预测与推理任务,并构建首个覆盖时间序列预测和基于预测推理的数据集,通过基于LLM代理的TimeFore框架(检索器、预测器、分析器)解决历史数据检索、预测限制和响应标准化挑战。

Comments This paper has been accepted by Findings of ACL 2026

详情
AI中文摘要

大语言模型的快速发展显著推进了表格问答,但大多数系统无法进行面向未来的数值预测。为弥补这一空白,我们引入了一个新任务——面向未来数据预测与推理的开放域表格问答,并提出了首个覆盖时间序列预测和基于预测推理场景的数据集,使用房地产数据。该任务在检索精确历史数据、克服LLM的预测限制以及标准化多样化查询的响应方面提出了挑战。为解决上述挑战,我们提出了TimeFore,一个基于LLM代理的框架,将问题分解为三个协作角色:检索器自主生成SQL以获取数据,预测器调用外部时间序列模型以获得更高精度,分析器综合结果以构建精确且一致的最终答案。大量实验证明了我们TimeFore的有效性。

英文摘要

The rapid development of LLMs has significantly advanced tabular question answering, but most systems cannot perform future-oriented numerical prediction. To address this gap, we introduce a novel task, Open-Domain Tabular Question Answering for Future Data Forecasting and Reasoning, and propose the first dataset to cover time-series forecasting and forecast-based reasoning scenarios using real estate data. This task poses challenges in retrieving precise historical data, overcoming the forecasting limitations of LLMs, and standardizing responses for diverse queries. To solve the above challenges, we propose TimeFore, an LLM agent-based framework that decomposes the problem into three collaborative roles: a Retriever autonomously generates SQL to fetch data, a Forecaster invokes external time-series models for higher accuracy, and an Analyzer synthesizes the results to construct a precise and consistent final answer. Extensive experiments demonstrate the effectiveness of our TimeFore.

2606.02430 2026-06-02 cs.DC cs.AI 版本更新

Not All Errors Are Equal: A Systematic Study of Error Propagation in Large Language Model Inference

并非所有错误都平等:大型语言模型推理中错误传播的系统研究

Yafan Huang, Sheng Di, Guanpeng Li

发表机构 * University of Iowa(爱荷华大学) Argonne National Laboratory(阿贡国家实验室) University of Florida(佛罗里达大学)

AI总结 本研究通过提出的LLMFI故障注入框架,系统研究了软错误在大型语言模型推理中的传播机制,揭示了关键脆弱性模式,并提出了四种低开销的软件级可靠性改进方向。

Comments Accepted at ICS'26

详情
AI中文摘要

大型语言模型(LLM)日益集成到高性能计算(HPC)工作流中,通过代码生成和领域特定决策等多种视角加速科学发现。然而,软错误如何传播并影响LLM推理仍 largely unexplored。为弥补这一空白,我们提出了LLMFI——一个可配置且确定性的故障注入框架,并基于该框架对LLM推理中的错误传播进行了全面研究。我们系统地跨三个开放权重的LLM和十三个代表性任务(涵盖推理、多语言、数学和编码领域)注入故障。此外,我们进行了细粒度的案例研究,揭示了关键脆弱性模式。总体而言,我们的研究得出了17个要点,推进了对LLM推理中错误传播的理解,并提出了四种低开销的纯软件修改方向以提高可靠性,为未来的错误检测和缓解提供了实用指导。

英文摘要

Large language models (LLMs) are increasingly integrated into high-performance computing (HPC) workflows, accelerating scientific discovery through diverse perspectives such as code generation and domain-specific decision-making. Yet, how soft errors propagate and affect LLM inference remains largely unexplored. To bridge this gap, we present a comprehensive study on error propagation in LLM inference, enabled by our proposed LLMFI, a configurable and deterministic fault-injection framework. Using LLMFI, we systematically inject faults across three open-weighted LLMs and thirteen representative tasks, covering reasoning, multilingual, mathematical, and coding domains. In addition, we conduct fine-grained case studies that reveal critical vulnerability patterns. Overall, our study yields 17 takeaways that advance the understanding of error propagation in LLM inference and introduces four low-overhead directions to improve reliability through software-only modification, offering practical guidance for future error detection and mitigation.

2606.02424 2026-06-02 cs.CV cs.AI cs.LG 版本更新

GC-MoE: Genomics-Guided Cell-Type-Specific Mixture of Experts for Histology-Based Single-Cell Spatial Transcriptomics

GC-MoE: 基因组引导的细胞类型特异性专家混合模型用于基于组织学的单细胞空间转录组学

Kaito Shiku, Ahtisham Fazeel Abbasi, Ryoma Bise, Yuichiro Iwashita, Kazuya Nishimura, Andreas Dengel, Muhammad Nabeel Asim

发表机构 * Kyushu University(九州大学) German Research Center for Artificial Intelligence (DFKI GmbH)(德国人工智能研究中心) RPTU University Kaiserslautern-Landau(科布伦茨-劳恩堡大学) The University of Osaka(大阪大学) IntelligentX GmbH Osaka Metropolitan University(大阪 Metropolitan 大学)

AI总结 提出GC-MoE模型,通过路由网络估计细胞类型概率并软组合细胞类型特异性专家,结合细胞类型特异性共表达感知预测器和细胞间交互注意力模块,从组织学图像和细胞位置预测单细胞基因表达,在公共数据集上优于现有方法。

详情
AI中文摘要

基于组织学的单细胞空间转录组学(ST)估计旨在从组织病理学图像和细胞位置预测单个细胞的基因表达,从而减少对昂贵的单细胞ST测量的需求。与现有的组织学到ST方法主要预测包含多个细胞的局部区域的斑点级谱不同,该任务需要对细胞间的表达变异性进行建模,而这种变异性强烈地由细胞类型结构化。我们提出了基因组引导的细胞类型特异性专家混合模型(GC-MoE),该模型通过路由网络估计细胞类型概率,并软组合细胞类型特异性专家进行基因表达预测。为了进一步编码细胞类型依赖的基因程序,我们引入了细胞类型特异性共表达感知预测器(CAP),以及一个轻量级的细胞间交互注意力(C2CA)模块用于邻域细胞上下文。在公共单细胞ST数据集上的实验和消融研究表明,该方法在现有单细胞和适应性斑点级基线方法上均有一致的改进。

英文摘要

Histology-based single-cell spatial transcriptomics (ST) estimation aims to predict gene expression for individual cells from histopathological images and cell locations, reducing the need for costly single-cell ST measurements. Unlike existing histology-to-ST methods that mainly predict spot-level profiles for local regions containing multiple cells, this task requires modeling cell-to-cell expression variability, which is strongly structured by cell type. We propose Genomics-Guided Cell-Type-Specific Mixture-of-Experts (GC-MoE), which estimates cell-type probabilities with a routing network and softly combines cell-type-specific experts for gene expression prediction. To further encode cell-type-dependent gene programs, we introduce the Cell-Type-Specific Co-Expression-Aware Predictor (CAP), together with a lightweight Cell-to-Cell Interaction Attention (C2CA) module for neighboring-cell context. Experiments and ablations on public single-cell ST datasets show consistent improvements over existing single-cell and adapted spot-level baselines.

2606.02418 2026-06-02 quant-ph cs.AI 版本更新

Evolutionary Discovery of Bivariate Bicycle Codes with LLM-Guided Search

基于LLM引导搜索的双变量自行车码的进化发现

Juan Cruz-Benito, Andrew W. Cross, David Kremer, Ismael Faro

发表机构 * IBM Research(IBM研究院) IBM T. J. Watson Research Center(IBM T.J. 巴特勒研究中心)

AI总结 提出一种LLM引导的进化工作流,通过变异生成双变量自行车码和扰动变体的Python程序,在约1650次迭代中筛选约2×10^5个候选码,发现了465个不同候选码,包括非CSS扰动码和CSS码,展示了LLM引导的程序进化在结构化量子码发现中的实用性。

详情
AI中文摘要

量子LDPC码的发现需要在大型代数设计空间中进行搜索,同时可靠地认证任何候选码的参数和等价类。我们引入了一种LLM引导的进化工作流,其中语言模型变异生成双变量自行车码和扰动双变量自行车码ansätze的Python程序。在五次活动中,系统执行了约1,650次进化迭代,筛选了约$2 \times 10^5$个候选码,需要约140小时的计算时间和约400美元的LLM推理成本。候选码通过一个分阶段验证流水线进行评估,该流水线结合了$\mathrm{GF}(2)$秩计算、距离估计和认证、混合整数线性规划、BLISS Tanner图去重、可分解性分析和局部Clifford等价检查。在块长度$n \leq 360$时,工作流识别出465个不同的候选码:97个CSS双变量自行车码和368个非CSS扰动变体。CSS搜索恢复了已知的高性能码,并找到了新的有限长度代表,包括一个不可分解的[[288,16,12]]码和更高权重的码,在距离$d = 8$时最多有$k = 50$。非CSS搜索产生了在[[144,12,12]]处匹配总码品质因子的扰动码,以及根据MILP状态报告为认证值或上界的额外高距离候选码。总体而言,这些结果表明,当与独立评估配对时,LLM引导的程序进化可以作为一种实用的结构化量子码发现工具。

英文摘要

Quantum LDPC code discovery requires searching large algebraic design spaces while reliably certifying the parameters and equivalence classes of any candidates found. We introduce an LLM-guided evolutionary workflow in which language models mutate Python programs that generate bivariate-bicycle and perturbed bivariate-bicycle code ansätze. Across five campaigns, the system performed approximately 1{,}650 evolutionary iterations, screened about $2 \times 10^5$ candidate codes, and required ${\sim}140$ hours of computation and ${\sim}$US\$400 in LLM inference cost. Candidate codes are evaluated through a staged validation pipeline combining $\mathrm{GF}(2)$ rank computation, distance estimation and certification, mixed-integer linear programming, BLISS Tanner-graph deduplication, decomposability analysis, and local-Clifford equivalence checks. At block length $n \leq 360$, the workflow identifies 465 distinct candidate codes: 97 CSS bivariate-bicycle codes and 368 non-CSS perturbed variants. The CSS search recovers known high-performing codes and finds new finite-length representatives, including an indecomposable [[288,16,12]] code and higher-weight codes with up to $k = 50$ at distance $d = 8$. The non-CSS search produces perturbed codes matching the gross-code figure of merit at [[144,12,12]], along with additional high-distance candidates reported as certified values or upper bounds according to MILP status. Overall, these results show that LLM-guided program evolution can serve as a practical tool for structured quantum-code discovery when paired with independent evaluation.

2606.02388 2026-06-02 cs.LG cs.AI 版本更新

Policy and World Modeling Co-Training for Language Agents

语言智能体的策略与世界模型协同训练

Ning Lu, Baijiong Lin, Shengcai Liu, Jiahao Wu, Haoze Lv, Yanbin Wei, Lingting Zhu, Shengju Qian, Xin Wang, Ying-Cong Chen, Qi Wang, Ke Tang

发表机构 * Southern University of Science and Technology(南方科技大学) Hong Kong University of Science and Technology(香港科学大学) Hong Kong University of Science and Technology (Guangzhou)(香港科学大学(广州)) Hong Kong Polytechnic University(香港理工大学) LIGHTSPEED

AI总结 提出PaW框架,通过在强化学习过程中添加辅助世界模型监督,无需改变推理范式,提升语言智能体在多个任务上的性能。

Comments 9 pages, 6 figures

详情
AI中文摘要

强化学习通过教导大语言模型智能体哪些行动能带来高奖励来改进它们,但对这些行动对环境的影响提供很少的监督。世界建模可以填补这一空白,但现有方法通常需要单独的模拟器、额外的训练阶段或额外的推理时计算。我们观察到,在策略强化学习 rollout 已经包含了所需的信号:每个转移将行动与其产生的下一个观察配对。基于这一观察,我们提出了PaW,一个策略和世界模型协同训练框架,它在强化学习过程中向同一策略添加辅助世界模型监督,而不改变推理范式。为了使辅助世界模型监督信息丰富且稳定,PaW引入了三个组件:基于行动熵的世界模型数据选择、噪声容忍的世界模型损失和奖励自适应的损失平衡。在三个智能体任务基准上的实验表明,在不同模型和强化学习算法上,PaW相对于强强化学习基线有一致的改进。这些结果表明,标准的强化学习 rollout 是语言智能体训练中世界模型监督的实用来源。

英文摘要

Reinforcement learning (RL) improves large language model (LLM) agents by teaching them which actions lead to high rewards, but provides little supervision on what those actions do to the environment. World modeling (WM) can fill this gap, yet existing approaches often require separate simulators, extra training stages, or additional inference-time computation. We observe that on-policy RL rollouts already contain the needed signal: each transition pairs an action with its resulting next observation. Based on this observation, we propose PaW, a Policy and World modeling co-training framework that adds auxiliary WM supervision to the same policy during RL, without changing the inference paradigm. To make auxiliary WM supervision informative and stable, PaW introduces three components: action-entropy-based WM data selection, noise-tolerant WM loss, and reward-adaptive loss balancing. Experiments on three agentic task benchmarks show consistent improvements over strong RL baselines across models and RL algorithms. These results suggest that standard RL rollouts are a practical source of WM supervision for language-agent training.

2606.02381 2026-06-02 cs.AI cs.LG math.DS 版本更新

A Mathematical Conflict Framework for Contextual Data Modulation

上下文数据调制的数学冲突框架

Hakan Emre Kartal

发表机构 * GitHub

AI总结 提出一个基于算子的数学冲突框架,将冲突视为局部、方向性和上下文敏感的量,通过统一抽象算子整合权重、尺度行为和输出映射,作为独立于优化过程的数学对象。

Comments 15 pages, 3 figures, framework paper

详情
AI中文摘要

在本研究中,提出了一个基于算子的广义数学冲突框架,以显式表示原始数据与上下文数据之间的结构差异。所提出的结构将冲突视为局部、方向性和上下文敏感的量,在统一抽象算子下整合了权重、尺度行为和输出映射等组件。该框架并未简化为特定的学习算法或优化方法,而是定义为适用于不同问题类别的通用结构。现有方法通常将冲突仅仅视为嵌入优化过程中的隐式副作用,而所提出的框架则将冲突视为独立的、基于算子的、组件级别的数学对象。

英文摘要

In this study, a generalized operator-based mathematical conflict framework is presented to explicitly represent structural discrepancies between raw data and contextual data. The proposed structure treats conflict as a local, directional, and context-sensitive quantity, integrating components such as weighting, scale behavior, and output mapping under a unified abstract operator. Without being reduced to a specific learning algorithm or optimization method, the framework is defined as a general structure adaptable to different classes of problems. While existing approaches typically treat conflict merely as an implicit side effect embedded within the optimization process, the proposed framework considers conflict as an independent, operator-based, and component-level mathematical object.

2606.02380 2026-06-02 cs.CL cs.AI 版本更新

SPADE-Bench: Evaluating Spontaneous Strategic Deception in Agents via Plan-Action Divergence

SPADE-Bench:通过计划-行动分歧评估智能体中的自发性策略欺骗

Yuyan Bu, Haowei Li, Qirui Zheng, Bowen Dong, Kaiyue Yang, Jiaming Ji, Yingshui Tan, Wenxin Li, Yaodong Yang, Juntao Dai

发表机构 * Beijing Academy of Artificial Intelligence(北京人工智能研究院) Peking University(北京大学) University of Science and Technology of China(中国科学技术大学) University of Chinese Academy of Science(中国科学院大学) Alibaba Group(阿里巴巴集团)

AI总结 针对LLM智能体在工具使用中可能出现的自发性策略欺骗(计划与行动不一致),提出SPADE-Bench基准,通过结合实际工具执行和受控压力场景,严格区分欺骗与幻觉,实验证实该问题真实且紧迫。

详情
AI中文摘要

随着基于LLM的智能体扩展其操作范围,可靠性成为实际部署的前提。然而,在实际应用中,人类用户无法监控每一个即时行为;相反,执行过程往往是一个黑箱,用户仅依赖智能体的自我报告更新。这种不透明性带来了关键风险:智能体可能呈现与执行行动不一致的面向观察者的报告,使得系统不可控,尤其是在高风险自主场景中。我们将这种自我报告的计划-行动分歧称为智能体欺骗。为了评估这一点,我们引入了SPADE-Bench,一个旨在评估自发性计划-行动分歧的基准。与先前的欺骗基准不同,SPADE-Bench同时集成了实际工具执行和受控压力场景。这种设计确保了生态效度,并通过在压力下进行受控的计划-行动比较,严格区分策略欺骗与单纯的幻觉。跨主流模型的实验证实,智能体欺骗在工具使用环境中是一个真实且紧迫的问题。通过提供一个全面且稳健的评估框架,SPADE-Bench填补了智能体安全中的关键空白,促进社区朝着构建可信和可控的自主系统迈进。

英文摘要

As LLM-based agents expand their operational scope, reliability becomes a prerequisite for real-world deployment. However, in practical applications, human users cannot monitor every immediate behavior; instead, the execution process often remains a black box, leaving users dependent solely on the agent's self-reported updates. This opacity creates a critical risk: agents may present observer-facing reports that diverge from their executed actions, rendering the system uncontrollable, especially in high-stakes autonomous scenarios. We term such self-reported plan-action divergence as agent deception. To assess this, we introduce SPADE-Bench, a benchmark designed to evaluate spontaneous plan-action divergence. Unlike prior deception benchmarks, SPADE-Bench simultaneously integrates actual tool execution and controlled pressure scenarios. This design ensures ecological validity and rigorously distinguishes strategic deception from mere hallucination through controlled plan-action comparisons under pressure. Experiments across mainstream models confirm that agent deception is a genuine and pressing issue in tool-use contexts. By providing a comprehensive and robust evaluation framework, SPADE-Bench fills a critical gap in agent safety, facilitating the community's progress toward building trustworthy and controllable autonomous systems.

2606.02374 2026-06-02 cs.AI 版本更新

Spatial Representation Learning Beyond Pixels: Unifying Raster Data and Vector Semantics for Human-Centric Geospatial Foundation Models

超越像素的空间表示学习:统一栅格数据和向量语义以构建以人为中心的地理空间基础模型

Steffen Knoblauch, Hao Li, Gengchen Mai, Konstantin Klemmer, Song Gao, WenWen Li

AI总结 本文提出统一栅格感知与向量推理的联合空间表示学习范式,旨在解决当前地球观测基础模型仅依赖栅格模态、忽略向量数据中丰富结构化信息的局限性。

详情
AI中文摘要

地球观测(EO)从根本上改变了环境过程和人类活动的监测,达到了行星尺度。自监督学习的最新进展催生了地球观测基础模型(EOFMs),这些模型利用PB级未标记EO数据学习跨广泛下游地理空间任务的可迁移表示。尽管取得了这些进展,当前的EOFMs仍然局限于栅格模态,忽视了诸如OpenStreetMap和Overture等可公开访问的向量数据源中编码的丰富结构化信息。向量数据提供了地理实体的显式和紧凑表示,包括几何、拓扑和语义关系,提供了在图像中通常模糊或难以获取的关键上下文信号。因此,栅格和向量数据代表了地理空间的互补视图:栅格数据捕捉连续的物理和光谱模式,而向量数据编码离散对象及其关系结构,并且通常更多地代表人类系统而非物理系统(例如社会或人口数据)。然而,现有的地理空间表示学习范式孤立地处理这些模态,依赖于不完美且常有损的转换来桥接它们。这篇观点文章呼吁向联合空间表示学习(SRL)的范式转变,即在统一的嵌入空间中整合栅格感知与基于向量的推理。基于多模态地理空间学习的新兴努力,我们强调了对齐异构空间数据源的概念基础、技术挑战和有前景的方向。我们认为,这种整合对于开发能够更准确、可解释且语义扎实地理解地球的下一代地理空间AI系统至关重要。

英文摘要

Earth Observation (EO) has fundamentally transformed the monitoring of environmental processes and human activities up to planetary scale. Recent advances in self-supervised learning have given rise to Earth Observation Foundation Models (EOFMs), which leverage petabyte-scale unlabeled EO data to learn transferable representations across a wide range of downstream geospatial tasks. Despite these advances, current EOFMs remain largely confined to raster modalities, overlooking the rich, structured information encoded in openly-accessible vector data sources such as OpenStreetMap and Overture. Vector data provides explicit and compact representations of geographic entities, including geometry, topology, and semantic relationships, offering critical contextual signals that are often ambiguous or inaccessible in imagery alone. Raster and vector data thus represent complementary views of geographic space: raster data captures continuous physical and spectral patterns, while vector data encodes discrete objects and their relational structure and often represents more of the human rather than the physical systems (e.g. social or demographic data). However, existing geospatial representation learning paradigms treat these modalities in isolation, relying on imperfect and often lossy transformations to bridge them. This perspective paper calls for a paradigm shift toward joint Spatial Representation Learning (SRL) in an unified embedding space that integrate raster perception with vector-based reasoning. Building on emerging efforts in multimodal geospatial learning, we highlight conceptual foundations, technical challenges, and promising directions for aligning heterogeneous spatial data sources. We contend that such integration is essential for developing next-generation geospatial AI systems capable of more accurate, interpretable, and semantically grounded understanding of the Earth.

2606.02373 2026-06-02 cs.AI cs.CL cs.IR 版本更新

Harness-1: Reinforcement Learning for Search Agents with State-Externalizing Harnesses

Harness-1:基于状态外化马具的搜索智能体强化学习

Pengcheng Jiang, Zhiyi Shi, Kelly Hong, Xueqiang Xu, Jiashuo Sun, Jimeng Sun, Hammad Bashir, Jiawei Han

AI总结 提出Harness-1,一个20B参数的搜索智能体,通过强化学习在有状态搜索马具中训练,将常规状态管理外化到环境,在八个检索基准上平均召回率0.730,超越现有开源搜索子智能体11.4个百分点。

详情
AI中文摘要

搜索智能体通常被训练为基于不断增长的转录的策略:模型必须决定如何搜索,同时记住它看到了什么、哪些证据有用、哪些约束仍然开放、哪些声明已被检查。我们认为这种表述将过多的常规状态管理放在策略内部:强化学习被迫同时优化语义搜索决策和可恢复的簿记,而环境可以更可靠地维护这些簿记。我们引入Harness-1,一个20B参数的搜索智能体(检索子智能体),在有状态搜索马具内通过强化学习训练。该马具维护环境端的工作记忆,包括候选池、重要性标记的精选集、紧凑的证据链接、验证记录、压缩和去重的观察结果,以及预算感知的上下文渲染。策略保留语义决策:搜索什么、保留或丢弃哪些文档、验证什么以及何时停止。在涵盖网络、金融、专利和多跳问答的八个检索基准上,Harness-1实现了0.730的平均精选召回率,比次强的开源搜索子智能体高出11.4个百分点,并与更大的前沿模型搜索器保持竞争力。其优势在保留的迁移基准上尤为显著,表明基于显式搜索状态的强化学习可以产生超越训练领域的检索行为。我们的代码可在https://github.com/pat-jj/harness-1获取。

英文摘要

Search agents are often trained as policies over growing transcripts: the model must decide how to search while also remembering what it has seen, which evidence is useful, which constraints remain open, and which claims have actually been checked. We argue that this formulation puts too much routine state management inside the policy: reinforcement learning is forced to optimize both semantic search decisions and recoverable bookkeeping that the environment can maintain more reliably. We introduce Harness-1, a 20B search agent (retrieval subagent) trained with reinforcement learning inside a stateful search harness. The harness maintains environment-side working memory, including a candidate pool, an importance-tagged curated set, compact evidence links, verification records, compressed and deduplicated observations, and budget-aware context rendering. The policy retains the semantic decisions: what to search, which documents to keep or discard, what to verify, and when to stop. Across eight retrieval benchmarks spanning web, finance, patents, and multi-hop QA, Harness-1 achieves 0.730 average curated recall, outperforming the next strongest open search subagent by +11.4 points and remaining competitive with much larger frontier-model searchers. Its gains are especially strong on held-out transfer benchmarks, suggesting that reinforcement learning over explicit search state can produce retrieval behaviors that generalize beyond the training domains. Our code is available at https://github.com/pat-jj/harness-1.

2606.02372 2026-06-02 cs.AI cs.CL 版本更新

COMAP: Co-Evolving World Models and Agent Policies for LLM Agents

COMAP:面向LLM智能体的世界模型与智能体策略协同进化

Youwei Liu, Jian Wang, Hanlin Wang, Wenjie Li

发表机构 * Central South University(中南大学) College of Computer Science, Sichuan University(四川大学计算机学院) Department of Computing, The Hong Kong Polytechnic University(香港理工大学计算机系)

AI总结 提出COMAP框架,通过闭环交互协同进化文本世界模型和智能体策略,在具身任务规划、网页导航和工具使用基准上显著提升性能。

详情
AI中文摘要

为语言智能体配备世界模型使其能够在执行前预测环境动态并评估候选动作。然而,现有的文本世界模型通常在训练后固定不变,无法适应由进化中的智能体引发的策略内状态-动作分布。同时,智能体改进方法往往依赖外部奖励或验证器,限制了其在现实交互环境中的适用性。本文提出COMAP,一种通过闭环交互协同进化文本世界模型和智能体策略的新框架。在每个决策步骤,世界模型预测候选动作的未来状态反馈,智能体通过估计该反馈的可靠性并相应调整动作来进行未来感知反思。由此产生的策略内轨迹随后通过自蒸馏用于更新世界模型,使其更好地匹配智能体不断演化的交互分布。在具身任务规划、网页导航和工具使用基准上,COMAP始终优于竞争基线,例如使用Qwen3-4B相对提升16.75%。进一步分析表明,协同进化循环随时间提高了世界模型的预测准确性,并导致更有效的长程决策。我们的代码可在https://github.com/loyiv/CoMAP获取。

英文摘要

Equipping language agents with world models enables them to anticipate environment dynamics and evaluate candidate actions before execution. However, existing textual world models are typically fixed after training, preventing them from adapting to the on-policy state-action distributions induced by an evolving agent. Meanwhile, agent-improvement methods often rely on external rewards or verifiers, limiting their applicability in realistic interactive environments. In this paper, we propose COMAP, a novel framework that co-evolves textual world models and agent policies through closed-loop interaction. At each decision step, the world model predicts future state feedback for candidate actions, and the agent performs future-aware reflection by estimating the reliability of this feedback and refining its action accordingly. The resulting on-policy trajectories are then used to update the world model via self-distillation, allowing it to better match the agent's evolving interaction distribution. Across embodied task planning, Web navigation, and tool-use benchmarks, COMAP consistently outperforms competitive baselines, e.g., +16.75% relative improvement with Qwen3-4B. Further analyses show that the co-evolutionary loop improves the world model's prediction accuracy over time and leads to more effective long-horizon decision-making. Our code is available at: https://github.com/loyiv/CoMAP.

2606.02365 2026-06-02 cs.LG cs.AI 版本更新

FOAM: Frequency and Operator Error-Based Adaptive Damping Method for Reducing Staleness-Oriented Error for Shampoo

FOAM:基于频率和算子误差的自适应阻尼方法,用于减少Shampoo的陈旧性误差

Kyunghun Nam, Sumyeong Ahn

发表机构 * Kyunghun Nam Sumyeong Ahn

AI总结 提出FOAM算法,通过自适应控制阻尼因子和特征分解频率来抑制陈旧性误差,在保持收敛的同时减少Shampoo的计算时间。

Comments 9 pages, ICML 2026 camera-ready version

详情
AI中文摘要

Shampoo因其在大规模优化基准上的卓越性能而备受关注,但它面临一个重要的实际瓶颈:矩阵求逆的过高计算开销。为了缓解这一问题,从业者通常依赖陈旧的预条件子更新,这在计算效率和优化保真度之间产生了根本性的权衡。在这项工作中,我们通过收敛性和稳定性的互补视角对陈旧性进行了理论研究。虽然陈旧性提高了计算效率,但它固有地降低了性能并引入了数值不稳定性。关键的是,我们发现作为数值稳定器的阻尼可以有效抑制这些负面影响。在此分析指导下,我们提出了FOAM,一种自适应算法,通过基于陈旧性误差的近似动态控制阻尼因子和特征分解频率来稳定训练。实验结果表明,与标准Shampoo相比,FOAM在保持稳健收敛的同时减少了挂钟时间。

英文摘要

Shampoo is attracting considerable attention for its superior performance on large-scale optimization benchmarks; yet it faces a significant practical bottleneck: the prohibitive computational overhead of matrix inversion. To mitigate this, practitioners typically rely on stale preconditioner updates, creating a fundamental trade-off between computational efficiency and optimization fidelity. In this work, we provide a theoretical study of staleness through the complementary lenses of convergence and stability. While staleness improves computational efficiency, it inherently degrades performance and introduces numerical instability. Crucially, we identify that damping, acting as a numerical stabilizer, can effectively suppress these negative effects. Guided by this analysis, we propose FOAM, an adaptive algorithm that stabilizes training by dynamically controlling both the damping factor and the eigendecomposition frequency based on an approximation of the staleness-oriented error. Experimental results demonstrate that FOAM reduces wall-clock time compared to standard Shampoo while maintaining robust convergence.

2606.02359 2026-06-02 cs.AI 版本更新

MOC: Multi-Order Communication in LLM-based Multi-Agent Systems

MOC:基于LLM的多智能体系统中的多阶通信

Yao Guan, Lin Wang, Zhihu Lu, Ziyi Wang, Wenzhu Yan, Qiang Duan

发表机构 * Fudan University(复旦大学) Nanjing Normal University(南京师范大学) Pennsylvania State University(宾夕法尼亚州立大学)

AI总结 提出多阶通信(MOC)方案,通过重构智能体间通信以捕获多跳依赖,并设计结构消息合并策略,在多个数据集上提升任务性能并降低通信成本。

详情
AI中文摘要

尽管基于大语言模型(LLM)的多智能体系统取得了显著进展,但大多数研究侧重于优化协调拓扑,而同样关键的问题——如何有效地在智能体之间传输和优化消息——却很大程度上未被充分探索。当前的通信方案通常依赖于一阶邻居响应的直接拼接,这导致了受限的证据感受野,并使得关键信息在多跳路径上被稀释。为了解决这些局限性,我们提出了多阶通信(MOC)方案,该方案重构了智能体间通信以捕获多跳依赖,并引入了一种结构消息合并策略以确保效率。具体来说,我们形式化了通信机制以构建结构化的多阶证据流,随后设计了一种语义-拓扑合并算法,以在令牌约束内优化语义保真度。在六个不同数据集和不同参数规模的LLM骨干上的大量实验表明,MOC一致地提升了任务性能并降低了通信成本。代码可在 https://github.com/yao-guan/MOC 获取。

英文摘要

Despite the remarkable progress of Large Language Model (LLM) based Multi-Agent Systems, most research focuses on optimizing coordination topology while largely underexploring the equally critical problem: how to transmit and optimize messages among agents effectively? Current communication schemes typically rely on the direct concatenation of first-order neighbor responses, which induces a restricted evidence receptive field and leads to the dilution of crucial insights over multi-hop paths. To address these limitations, we propose the Multi-Order Communication (MOC) scheme, which reconstructs the inter-agent communication to capture multi-hop dependencies and incorporates a structural message consolidation strategy to ensure efficiency. Specifically, we formalize the communication mechanism to construct a structured multi-order evidence stream, and subsequently design a Semantic-Topological Merging algorithm to optimize semantic fidelity within token constraints. Extensive experiments across six diverse datasets and LLM backbones of varying parameter scales demonstrate that MOC consistently improves task performance and reduces communication costs. The code is available at https://github.com/yao-guan/MOC.

2606.02357 2026-06-02 cs.CV cs.AI 版本更新

Do Multimodal Agents Really Benefit from Tool Use? A Systematic Study of Capability Gains

多模态智能体真的从工具使用中受益吗?能力增益的系统性研究

Garvin Guo, Donglei Yu, Yu Chen, Xiang Wang, Shuai Li, Xinpei Zhao, Huaxing Liu, Qinghao Wang, Minpeng Liao

AI总结 通过对比工具增强与无工具的多模态智能体在多项任务上的表现,发现工具使用并未带来一致的性能提升,智能体更多是学会了工具调用模式而非真正利用工具扩展能力。

详情
AI中文摘要

工具增强的多模态智能体在基准测试中表现出显著提升,这常被视为智能体已学会使用工具的证据。我们认为这种解读可能为时过早:仅凭工具调用轨迹并不能证明工具提供了答案关键信息。我们研究了两种代表性的“用图像思考”智能体,Thyme 和 DeepEyesV2,在真实世界理解、OCR、图表理解和数学推理任务上的表现。每个智能体与其无工具版本以及从同一源池训练但不含工具调用轨迹的纯文本推理器进行比较。工具访问并未带来一致的总体改进,未能可靠地降低生成令牌成本,并且仅留下一个很小的仅工具解决集:DeepEyesV2 的 93% 工具解决问题和 Thyme 的 96% 也被至少一种无工具设置解决。机制消融进一步表明,完整的工具使用循环并不始终优于单独的工具调用格式或返回的执行结果。在我们研究的设置中,所分析的智能体似乎更可靠地学习了工具调用模式而非工具贡献的能力,这表明评估应区分工具的可用性与工具是否真正扩展了智能体可解决的问题。

英文摘要

Tool-augmented multimodal agents show strong benchmark gains, often taken as evidence that agents have learned to use tools. We argue that this interpretation can be premature: a tool-call trace alone does not show whether the tool supplied answer-critical information. We study two representative ``thinking with images'' agents, Thyme and DeepEyesV2, across real-world understanding, OCR, chart understanding, and mathematical reasoning. Each agent is compared with its Tool-Free counterpart and with a Pure-Text Reasoner trained from the same source pool without tool-calling trajectories. Tool access yields little consistent aggregate improvement, does not reliably reduce generated-token cost, and leaves only a small tool-only solved set: 93% of DeepEyesV2's tool-solved problems and 96% of Thyme's are also solved by at least one non-tool setting. Mechanism ablations further show that the full tool-use loop does not consistently outperform either the tool-call format or the returned execution result alone. In the settings we study, the analyzed agents appear to learn tool-calling patterns more reliably than tool-contributed capabilities, suggesting that evaluation should distinguish tool availability from whether tools actually expand what agents can solve.

2606.02355 2026-06-02 cs.AI cs.LG 版本更新

SIRI: Self-Internalizing Reinforcement Learning with Intrinsic Skills for LLM Agent Training

SIRI:具有内在技能的自我内化强化学习用于LLM智能体训练

Zhongyu He, Yuanfan Li, Fei Huang, Tianyu Chen, Siyuan Chen, Xingyang Li, Meng Hsuan Yu, Xiangrong Liu, Leyi Wei, Lu Pan, Ke Zeng, Xunliang Cai

发表机构 * Xiamen University(厦门大学) Meituan(美团) Macao Polytechnic University(澳门 polytechnic 大学)

AI总结 提出SIRI框架,通过自我技能挖掘、验证和内化,使LLM智能体无需外部技能生成器或推理时技能库即可提升长程任务性能,在ALFWorld和WebShop上优于基线方法。

详情
AI中文摘要

长程LLM智能体可以从可重用技能中受益,但现有的基于技能的方法通常依赖于训练期间的外部技能生成器或推理时的持久技能检索,增加了工程复杂性、上下文长度和部署延迟。我们提出了具有内在技能的自我内化强化学习(SIRI),这是一个三阶段框架,使智能体能够发现、验证和内化技能,无需外部技能生成器或推理时的技能库。SIRI首先使用GiGPO预热策略以获得基本交互能力并收集成功的无技能轨迹。然后进行自我技能挖掘,当前策略从其自身的成功普通轨迹中总结紧凑技能,并通过配对的技能增强和技能无关轨迹进行验证。最后,SIRI仅使用轨迹级效用和动作级优势将有帮助的技能引导动作令牌蒸馏到普通策略中。推理时,智能体仅使用原始提示运行。在ALFWorld和WebShop上使用Qwen2.5-7B-Instruct,SIRI将GiGPO从ALFWorld的0.908提升到0.930,从WebShop的0.728提升到0.813,优于基于提示、基于强化学习和基于记忆增强的基线。进一步分析表明,我们的自我挖掘策略可以实现与闭源大模型蒸馏相当的性能。我们的代码可在https://github.com/kirito618/SIRI获取。

英文摘要

Long-horizon LLM agents can benefit from reusable skills, yet existing skill-based methods often rely on external skill generators during training or persistent skill retrieval at inference, increasing engineering complexity, context length, and deployment latency. We propose Self-Internalizing Reinforcement learning with Intrinsic skills (SIRI), a three-phase framework that enables agents to discover, validate, and internalize skills without external skill generators or inference-time skill banks. SIRI first warms up the policy with GiGPO to acquire basic interaction ability and collect successful skill-free trajectories. It then performs self-skill mining, where the current policy summarizes compact skills from its own successful plain rollouts and validates them through paired skill-augmented and skill-free rollouts. Finally, SIRI distills only beneficial skill-guided action tokens into the plain policy using trajectory-level utility and action-level advantage. At inference, the agent runs with the original prompt only. On ALFWorld and WebShop with Qwen2.5-7B-Instruct, SIRI improves GiGPO from 0.908 to 0.930 on ALFWorld and from 0.728 to 0.813 on WebShop, outperforming prompt-based, RL-based, and memory-augmented baselines. Further analysis shows that our self-mining strategy can achieve performance comparable to distillation with closed-source large model. Our code is available at https://github.com/kirito618/SIRI.

2606.02337 2026-06-02 cs.AI 版本更新

Coordination Graphs for Constrained Multi-Agent Reinforcement Learning

约束多智能体强化学习的协调图

Santiago Amaya-Corredor, Miguel Calvo-Fullana, Anders Jonsson

发表机构 * Department of Electrical and Computer Engineering, Linköping University(1 链çe普大学电气与计算机工程系)

AI总结 提出CG-CMARL框架,利用协调图和拉格朗日对偶分解联合动作空间与约束耦合问题,实现独立于智能体数量的模型学习,并通过Max-Sum消息传递和拉格朗日乘子控制目标-约束权衡,生成帕累托前沿。

Comments Accepted at the Reinforcement Learning Conference (RLC) 2026. 40 pages (12 main + 28 appendix), 5 figures, 16 tables, 7 theorems

详情
AI中文摘要

约束多智能体强化学习(CMARL)面临两个相互交织的挑战:联合动作空间随智能体数量指数增长,以及额外的约束以奖励结构无法捕捉的方式耦合智能体。我们引入了用于约束多智能体强化学习的协调图(CG-CMARL),这是一个通过结合协调图和拉格朗日对偶性来应对这两个挑战的框架。该系统将联合问题分解为成对区域,每个区域由一组共享的Q函数服务,一个用于主要目标,每个约束对应一个,使得学习模型的数量与智能体数量无关。在执行时,Max-Sum消息传递在因子图上协调动作,而拉格朗日乘子控制目标-约束权衡,允许单个训练模型无需重新训练即可描绘帕累托前沿。我们在温和条件下提供了收敛保证,以及一个可分解为独立可解释来源的组合误差界,每个来源可追溯到特定的设计选择并可独立控制。在协作导航任务(其中多达10个智能体的团队必须协调到达目标位置,同时满足成对约束)上的实验表明,我们的方法产生的帕累托前沿优于以固定奖励塑形比率训练的既有基线,同时扩展到集中式方法变得棘手的大规模团队。

英文摘要

Constrained Multi-agent reinforcement learning (CMARL) faces two intertwined challenges: the joint action space grows exponentially with the number of agents, and additional requirements couple agents in ways that reward structure alone does not capture. We introduce Coordination Graphs for Constrained Multi-Agent Reinforcement Learning (CG-CMARL), a framework that addresses both challenges by combining coordination graphs with Lagrangian duality. The system decomposes the joint problem into pairwise regions, each served by a set of shared Q-functions, one for the primary objective and one for each of the constraints, so that the number of learned models is independent of the number of agents. At execution time, Max-Sum message passing coordinates actions across the factor graph, while a Lagrangian multiplier controls the objective--constraint tradeoff, allowing a single trained model to trace a Pareto front without retraining. We provide convergence guarantees under mild conditions, together with a compositional error bound that decomposes into separate interpretable sources, each traceable to a specific design choice and independently controllable. Experiments on cooperative navigation tasks (where teams of up to 10 agents must coordinate to reach target positions while satisfying pairwise constraints) show that our method produces Pareto fronts dominating established baselines trained at fixed reward-shaping ratios, while scaling to team sizes where centralized approaches become intractable.

2606.02326 2026-06-02 cs.AI 版本更新

Repair Before Veto: Repair-Augmented Constraint Learning for Contextual Decisions

否决前修复:面向上下文决策的修复增强约束学习

Yifan Wang

AI总结 提出修复增强约束学习(RACL)框架,将已知修复操作融入分类器语义,在否决前考虑可负担修复,以降低错误否决率并揭示决策规则的可学习性。

Comments 7 pages, 3 figures

详情
AI中文摘要

硬约束通常被视为最终否决:一旦候选违反要求,学习规则拒绝它,任何修复都在决策语义之外处理。这忽略了一种常见的部署场景,即系统已经知道有限的修改菜单,例如添加票务选项、更改配置或请求可用的服务升级。现有的约束学习、软松弛和补救方法解决了邻近问题,但它们没有学习在否决前是否应修复某个选项。我们引入修复增强约束学习(RACL),一种上下文决策框架,将已知修复算子提升到分类器语义中。当可负担的修复使候选可行且足够偏好时,候选被接受;否则系统返回结构化的拒绝信用,并在适用时返回修复计划。这种否决前修复视图严格推广了无修复的HASSLE风格语义,揭示了终端否决规则不可约的错误否决差距,将二分类不可识别性与决策规则可学习性分离,并为观测可行性共享权重设置提供了容量和校准界限。在受控和DB1B衍生基准测试中,RACL恢复了预期的信用和修复结构。在最难的原始数据衍生层级上,验证选择的RACL将错误否决减少到10/4039(FVR 0.0025),而最强的修复搜索黑盒基线约为1064/4039,同时明确展示了FVR/EDR权衡。

英文摘要

Hard constraints are usually treated as terminal vetoes: once a candidate violates a requirement, the learned rule rejects it and any repair is handled outside the decision semantics. This misses a common deployed regime in which the system already knows a finite menu of modifications, such as adding a ticket option, changing a configuration, or requesting an available service upgrade. Existing constraint-learning, soft-relaxation, and recourse methods address nearby problems, but they do not learn whether an option should be repaired before being vetoed. We introduce Repair-Augmented Constraint Learning (RACL), a contextual decision framework that lifts known repair operators into the classifier semantics. A candidate is accepted when an affordable repair makes it feasible and preferred enough; otherwise the system returns a structured rejection credit and, when applicable, a repair plan. This repair-before-veto view strictly generalizes no-repair HASSLE-style semantics, reveals an irreducible false-veto gap for terminal-veto rules, separates binary-label non-identifiability from decision-rule learnability, and gives capacity and calibration bounds for the observed-feasibility shared-weight setting. Across controlled and DB1B-derived benchmarks, RACL recovers the intended credit and repair structure. On the hardest raw-data-derived tier, validation-selected RACL reduces false vetoes to 10/4039 (FVR 0.0025), versus about 1064/4039 for the strongest repair-search black-box baseline, while making the FVR/EDR trade-off explicit.

2606.02322 2026-06-02 cs.LG cs.AI 版本更新

Repurposing Adversarial Perturbations for Continual Learning: From Defense to Active Alignment

重新利用对抗扰动进行持续学习:从防御到主动对齐

Ran Liu, Min Yu, Mingqi Liu, Jianguo Jiang, Gang Li, Rongsheng Li, Ning Li, Zhen Xu, Weiqing Huang, Ming Liu

发表机构 * Institute of Information Engineering, Chinese Academy of Sciences(中国科学院信息工程研究所) School of Cyber Security, University of Chinese Academy of Sciences(中国科学院大学网络安全学院) Deakin University(德肯大学) Harbin Engineering University(哈尔滨工程大学)

AI总结 提出AdvCL框架,通过将对抗扰动重新用作几何控制信号,结合三个即插即用模块(Intra-Smooth、Proto-Clip、Inter-Align),在持续学习中同时提升标准性能、鲁棒性、降低遗忘并增强迁移。

详情
AI中文摘要

在动态环境中,大型语言模型需要不断适应新任务,但持续学习常常遭受遗忘、有限的迁移以及对对抗扰动的脆弱性。为了解决这个问题,我们提出了AdvCL,它将对抗扰动重新用作稳定的持续适应的几何控制信号。AdvCL结合了三个即插即用模块:Intra-Smooth通过小的对抗扰动促进局部平滑性;Proto-Clip使用相似性裁剪以防止过度对齐到当前任务原型;Inter-Align则通过对齐到先前任务原型的方向性对齐来减少表示间隙。实验表明,在标准性能和鲁棒性方面均有一致的提升,同时具有更低的遗忘和更强的迁移。我们进一步通过量化Intra-Smooth对扰动设置的敏感性以及Inter-Align对任务相似性和几何距离的影响,分析了关键机制。总之,这些模块在组合时提供互补增益,每个模块也可以单独集成到各种持续学习范式中,包括回放、正则化和动态架构,从而为持续学习提供了一种几何控制机制。

英文摘要

In dynamic environments, large language models need to keep adapting to new tasks, but continual learning often suffers from forgetting, limited transfer, and vulnerability to adversarial perturbations. To address this, we present AdvCL, which repurposes adversarial perturbations as a geometric control signal for stable continual adaptation. AdvCL combines three plug-in modules: Intra-Smooth promotes local smoothness via small adversarial perturbations; Proto-Clip uses similarity clipping to prevent excessive alignment to current task prototype; and Inter-Align applies directional alignment toward previous task prototype to reduce representational gaps. Experiments show consistent gains in both standard performance and robustness, with lower forgetting and stronger transfer. We further analyze key mechanisms by quantifying the sensitivity of Intra-Smooth to perturbation settings and the effect of Inter-Align on task similarity and geometric distance. In summary, the modules provide complementary gains when combined, and each can also be integrated individually into diverse CL paradigms, including replay, regularization, and dynamic architectures, thereby offering a geometric control mechanism for continual learning.

2606.02302 2026-06-02 cs.CR cs.AI 版本更新

SeClaw: Spec-Driven Security Task Synthesis for Evaluating Autonomous Agents

SeClaw: 面向自主代理评估的规范驱动安全任务合成

Hao Cheng, Changtao Miao, Tianle Song, Yin Wu, He Liu, Erjia Xiao, Junchi Chen, Xiaoyu Shi, Yichi Wang, Jing Yang, Taowen Wang, Jinhao Duan, Mengshu Sun, Peiyan Dong, Xuan Shen, Yang Cao, Renjing Xu, Kaidi Xu, Jindong Gu, Bo Zhang, Jize Zhang, Chenhao Lin, Philip Torr, Chao Shen

发表机构 * The Hong Kong University of Science and Technology(香港科技大学) Ant Digital Technologies, Ant Group(蚂蚁集团数字技术部) Xi’an Jiaotong University(西安交通大学) The Hong Kong University of Science and Technology (Guangzhou)(香港科技大学(广州)) University of Oxford(牛津大学) City University of Hong Kong(香港城市大学) Institute of Science Tokyo(东京科学研究所) Zhejiang University(浙江大学) Massachusetts Institute of Technology(麻省理工学院) University of North Carolina at Chapel Hill(北卡罗来纳大学教堂山分校) Beijing University of Technology(北京理工大学)

AI总结 提出SeClaw框架,通过规范驱动的安全任务合成与基于执行的安全评估,实现对自主LLM代理在状态化环境中的安全风险的可扩展、可复现评估。

详情
AI中文摘要

自主LLM代理越来越多地在有状态环境中运行,访问工具、文件、内存和外部服务。虽然这些能力支持复杂的现实工作流,但它们也引入了难以通过现有评估捕获的安全风险。当前的代理安全基准通常依赖手动策划的任务,对新兴威胁的覆盖有限,并且主要关注最终结果而非导致不安全行为的执行过程。我们引入了SeClaw,一个结合规范驱动的安全任务合成与基于执行的安全评估的框架,用于自主代理。规范驱动的安全任务合成能够从结构化风险规范中可扩展且可控地构建安全任务,而SeClaw docker提供了一个标准化测试平台,用于评估代理在各种安全风险场景下的行为。该基准涵盖了由资源、用户任务、环境和内在代理行为引起的风险,并支持对不安全行为的轨迹感知评估,超越最终响应。通过桥接系统化的任务合成和可复现的安全评估,SeClaw为测量、诊断和比较自主LLM代理中的安全故障提供了实用基础。代码可在 https://github.com/seclaw-eval/seclaw-eval 获取。

英文摘要

Autonomous LLM agents increasingly operate in stateful environments where they access tools, files, memory, and external services. While such capabilities enable complex real-world workflows, they also introduce security risks that are difficult to capture with existing evaluations. Current agent security benchmarks often rely on manually curated tasks, provide limited coverage of emerging threats, and focus primarily on final outcomes rather than the execution processes that lead to unsafe behavior. We introduce SeClaw, a framework that combines specification-driven security task synthesis with execution-based security evaluation for Autonomous agents. Spec-driven security task synthesis enables scalable and controllable construction of security tasks from structured risk specifications, while SeClaw docker provides a standardized testbed for evaluating agent behavior under diverse safety-risk scenarios. The benchmark covers risks arising from resources, user tasks, environments, and intrinsic agent behaviors, and supports trajectory-aware assessment of unsafe actions beyond final responses. By bridging systematic task synthesis and reproducible security evaluation, SeClaw provides a practical foundation for measuring, diagnosing, and comparing security failures in autonomous LLM agents. The code is available at https://github.com/seclaw-eval/seclaw-eval.

2606.02301 2026-06-02 cs.HC cs.AI cs.CV 版本更新

Quantitative Movement Testing: Measuring Patient Movements from a Single Smartphone Video

定量运动测试:从单部智能手机视频测量患者运动

Pranav Mahajan, Amanda Wall, Eleonora Maria Camerone, Julie Stebbins, Eoin Kelleher, Shuangyi Tong, Annina Schmid, Katja Wiech, Anushka Irani, Ben Seymour

发表机构 * Nuffield Department of Clinical Neurosciences, University of Oxford(临床神经科学系,Nuffield大学,牛津大学) Max Planck Institute of Biological Cybernetics(生物信息学研究所) Oxford Gait Laboratory, University of Oxford(牛津大学步态实验室) Harvard Medical School(哈佛医学院) Massachusetts General Hospital(麻省总医院) Institute of Biomedical Engineering, University of Oxford(生物医学工程研究所,牛津大学) Mayo Clinic(梅奥诊所)

AI总结 提出基于计算机视觉的定量运动测试(QMT)方法,利用深度学习3D姿态估计从单目智能手机视频提取运动生物标志物,在实验室验证中与光学运动捕捉高度一致(r>0.85),并在纤维肌痛和慢性坐骨神经痛患者中展示了可靠性和纵向监测能力。

详情
AI中文摘要

慢性疼痛通过降低功能能力而损害生活质量,但在现实环境中客观测量这种功能影响仍然具有挑战性。虽然光学运动捕捉为评估运动质量改变提供了高精度,但成本高昂且局限于实验室环境。我们旨在开发并验证定量运动测试(QMT),这是一个从标准单目智能手机视频中提取3D运动生物标志物的计算机视觉流程,平衡临床可及性与生物力学精度。我们利用基于深度学习的3D姿态估计,在健康对照组(N=13)中针对金标准光学运动捕捉验证了QMT流程。经过留一法受试者校准以纠正系统偏差后,我们在两个前瞻性临床队列中部署QMT以评估现实世界效用:一项纤维肌痛患者的干预前后试验,以及一项慢性坐骨神经痛患者和健康对照的30天纵向家庭监测研究。在实验室验证中,QMT提取的临床运动指标与光学运动捕捉高度一致,显示出强相关性(r>0.85)和低平均绝对误差。QMT在纤维肌痛患者中显示出高重测信度(r>0.86),并成功追踪了慢性坐骨神经痛患者的日常运动波动。虽然现实家庭环境引入了比实验室环境更高的测量方差,但QMT完全基于远程记录发现了健康对照组和坐骨神经痛患者之间的组级差异。单目3D姿态估计为传统评估提供了一种可扩展的替代方案。QMT为临床试验中跟踪疾病进展和治疗反应提供了客观、可及的生物标志物,但需要进一步研究以优化家庭环境中的可靠性。

英文摘要

Chronic pain diminishes quality of life by decreasing functional ability, yet objectively measuring this functional impact remains challenging in real-world settings. While optical motion capture provides high precision for assessing altered movement quality, it is costly and restricted to laboratory environments. We aimed to develop and validate Quantitative Movement Testing (QMT), a computer vision pipeline extracting 3D kinematic biomarkers from standard monocular smartphone video, balancing clinical accessibility with biomechanical accuracy. We validated the QMT pipeline, utilising deep learning-based 3D pose-estimation, against gold-standard optical motion capture in healthy controls (N=13). Following leave-one-subject-out calibration to correct systematic bias, we deployed QMT in two prospective clinical cohorts to assess real-world utility: a pre- and post-intervention trial for fibromyalgia patients, and a 30-day longitudinal at-home monitoring study of chronic sciatica patients and healthy controls. In laboratory validation, QMT extracted clinical kinematic metrics with high agreement to optical motion capture, yielding strong correlations (r > 0.85) and low mean absolute errors. QMT demonstrated high test-retest reliability (r > 0.86) in fibromyalgia patients and successfully tracked day-to-day movement fluctuations in chronic sciatica. While real-world home settings introduced higher measurement variance than lab settings, QMT found group-level differences between healthy controls and sciatica patients based entirely on remote recordings. Monocular 3D pose estimation offers a scalable alternative to traditional assessments. QMT provides an objective, accessible biomarker for tracking disease progression and treatment response in clinical trials, though further research is needed to optimise reliability in home environments.

2606.02287 2026-06-02 cs.LG cs.AI 版本更新

CityTrajBench: A Unified Benchmark for City-Scale Vehicle Trajectory Generation

CityTrajBench: 城市尺度车辆轨迹生成的统一基准

Shibo Zhu, Xiaodan Shi, Dayin Chen, Yuntian Chen, Haoran Zhang, Tianhao Wu, Jinyue Yan

发表机构 * Department of Building Environment and Energy Engineering, The Hong Kong Polytechnic University, Hong Kong SAR, China(香港理工大学建筑环境与能源工程系) Eastern Institute for Advanced Study, Eastern Institute of Technology, Ningbo, China(宁波东部先进研究所) International Centre of Urban Energy Nexus, The Hong Kong Polytechnic University, Hong Kong SAR, China(香港理工大学城市能源 nexus 中心) Department of Computer and Systems Sciences, Stockholm University, Sweden(斯德哥尔摩大学计算机与系统科学系) Zhejiang Key Laboratory of Industrial Intelligence and Digital Twin, Eastern Institute of Technology, Ningbo, China(浙江工业智能与数字孪生重点实验室) Ningbo Institute of Digital Twin, Eastern Institute of Technology, Ningbo, China(宁波数字孪生研究所) LocationMind Inc., Tokyo 101-0042, Japan(LocationMind公司)

AI总结 为解决轨迹生成方法因数据集、预处理、表示和评估指标不一致导致的比较困难,提出CityTrajBench统一基准框架,标准化数据处理、模型适配与多级评估,并在三个真实数据集上对比统计、VAE、GAN、扩散和流匹配模型,揭示不同模型在全局真实性、轨迹几何保真度等指标上的权衡。

详情
AI中文摘要

城市轨迹生成是交通模拟、城市规划和移动性分析的基础任务。然而,由于现有研究通常依赖不同的数据集、预处理流程、轨迹表示和评估指标,轨迹生成方法之间的系统比较仍然困难。这种碎片化使得报告的性能差异是否源于生成机制本身或实验协议不一致变得不明确。为解决这一问题,我们提出了CityTrajBench,一个用于城市尺度车辆轨迹生成的统一基准框架和协议。CityTrajBench在共同设置下标准化了数据摄入、轨迹归一化、特征构建、模型适配、地图感知后处理、模型选择和多级评估。它支持异构生成器,包括统计基线、基于VAE、GAN、扩散和流匹配的模型,并在三个真实世界城市轨迹数据集上评估它们。该基准衡量全局空间真实性、行程级分布保真度、轨迹级几何相似性、条件移动一致性和效率。实验揭示了模型家族之间的明确权衡:DiffTraj在轨迹级几何保真度上最强,DiffRNTraj在结构敏感的全局真实性上具有竞争力,而TrajFlow在真实性、质量、条件一致性和效率之间提供了强平衡。同时,一个简单的马尔可夫基线在粗粒度行程和局部移动统计上仍具有竞争力。这些发现表明,城市轨迹生成质量本质上是多目标的,没有单一模型在所有标准上同等占优,并且CityTrajBench为未来城市移动性生成研究提供了可复现的基准协议和测试平台。

英文摘要

Urban trajectory generation is a fundamental task for transportation simulation, urban planning, and mobility analytics. However, systematic comparison across trajectory generation methods remains difficult because existing studies often rely on different datasets, preprocessing pipelines, trajectory representations, and evaluation metrics. This fragmentation makes it unclear whether reported performance differences arise from the generation mechanism itself or from inconsistent experimental protocols. To address this issue, we present CityTrajBench, a unified benchmark framework and protocol for city-scale vehicle trajectory generation. CityTrajBench standardizes data ingestion, trajectory normalization, feature construction, model adaptation, map-aware post-processing, model selection, and multi-level evaluation under a common setting. It supports heterogeneous generators, including statistical baselines, VAE-based, GAN-based, diffusion-based, and flow-matching-based models, and evaluates them on three real-world urban trajectory datasets. The benchmark measures global spatial realism, trip-level distribution fidelity, trajectory-level geometric similarity, conditional mobility consistency, and efficiency. Experiments reveal clear trade-offs across model families: DiffTraj is strongest on trajectory-level geometric fidelity, DiffRNTraj is competitive on structure-sensitive global realism, and TrajFlow provides a strong balance across realism, quality, conditional consistency, and efficiency. Meanwhile, a simple Markov baseline remains competitive on coarse-grained trip and local-movement statistics. These findings show that urban trajectory generation quality is inherently multi-objective, that no single model dominates all criteria equally, and that CityTrajBench provides a reproducible benchmark protocol and testbed for future research on urban mobility generation.

2606.02282 2026-06-02 cs.AI 版本更新

POIROT: Interrogating Agents for Failure Detection in Multi-Agent Systems

POIROT: 在多智能体系统中审讯智能体以进行故障检测

Iñaki Dellibarda Varela, R. Sendra-Arranz, Pablo Romero-Sorozabal, J. M. Valverde-García, Annemarie F. Laudanski, Álvaro Gutiérrez, Eduardo Rocon, Manuel Cebrian

发表机构 * Center for Automation and Robotics, Spanish National Research Council (CSIC-UPM)(自动化研究中心,西班牙国家科研 council (CSIC-UPM))

AI总结 提出POIROT协议,利用多智能体系统自身的智能体作为诊断层进行故障检测,在复杂问题、多智能体和复合故障条件下优于单LLM评估基线。

Comments 44 pages, 6 figures

详情
AI中文摘要

将大型语言模型编排成多智能体系统(LLM-MAS)解锁了卓越的推理能力,但难以表征的突发故障和幻觉阻碍了其在安全关键领域的部署——新兴的AI法规使得这一差距在法律上难以维持。现有的评估范式有一个共同的缺陷:集中式判断造成单点故障,并且需要领域特定专业知识。本文提出POIROT,一种将系统自身的智能体重新用作其诊断层的协议,利用架构中已有的认知多样性。在评估的设置中,POIROT优于单LLM评估基线,其增益随问题复杂度(OR = 1.60,$p = 0.008$)、智能体数量和故障维度而扩展,并在复合故障条件下持续存在。这些结果表明,安全监督不必外部化:执行角色的智能体拥有足够的集体智慧来审计它。我们将POIROT作为开源库发布,同时发布BLAME,一个用于安全关键多智能体系统中故障归因的基准。

英文摘要

Orchestrating Large Language Models into Multi-Agent Systems (LLM-MAS) has unlocked remarkable reasoning capabilities, yet emergent failures and hallucinations that resist characterisation block their deployment in safety-critical domains -- a gap made legally untenable by emerging AI regulation. Existing evaluation paradigms share a common flaw: centralised judgment creates single points of failure and demands domain-specific expertise. Here we present POIROT, a protocol that repurposes a system's own agents as its diagnostic layer, leveraging the epistemic diversity already present in the architecture. Across evaluated settings, POIROT outperforms single-LLM evaluator baselines, with gains that scale with problem complexity (OR = 1.60, $p = 0.008$), agent count, and fault dimensionality, persisting under compound fault conditions. These results demonstrate that safety oversight need not be externalised: the agents executing a role carry sufficient collective intelligence to audit it. We release POIROT as an open-source library alongside BLAME, a benchmark for fault attribution in safety-critical multi-agent systems.

2606.02276 2026-06-02 cs.CV cs.AI cs.CL cs.LG 版本更新

Cross-modal linkage risk in clinical vision-language models

临床视觉-语言模型中的跨模态链接风险

Soroosh Tayebi Arasteh, Mahshad Lotfinia, Sven Nebelung, Daniel Truhn

发表机构 * Lab for AI in Medicine(医学人工智能实验室) RWTH Aachen University(亚琛工业大学) Department of Diagnostic and Interventional Radiology(诊断与介入放射学部门)

AI总结 研究临床视觉-语言模型(VLM)在图像与报告分离场景下通过余弦相似度实现跨模态重链接的风险,并采用仅对投影头进行差分隐私微调的方法在保持图像效用同时显著降低重链接率。

详情
AI中文摘要

在配对胸部X光片和放射学报告上训练的视觉-语言模型(VLM)学习了一个共享嵌入空间,该空间可以保留实例级别的图像-报告对应关系。这在故意将X光片和报告在获取后分开的场景中(例如仅图像数据共享或受控访问的报告)构成了隐私风险,因为一个去标识的图像可能仅通过余弦相似度就重新链接到其原始叙述性报告。我们将此形式化为图像到报告的检索,并使用公共配对队列(其中真实配对是已知的)作为基准来审计风险,而不是作为隐私场景。在来自MIMIC-CXR(43,793个保留对)和外部CheXpert Plus(29,296个对)的126,804名患者的406,241个配对示例上评估了临床专业化程度递增的VLM,我们发现重链接率随专业化程度系统性地上升:最强的VLM在候选池N=100时以15倍随机概率检索到正确报告,在N=10,000时以50倍随机概率,在全数据库规模下仍远高于随机概率。该信号在去除疾病标签捷径的病理匹配困难负样本下仍然存在,表明对应关系超出了广泛的诊断类别。为了在不重新训练的情况下减少这种风险,我们冻结了两个编码器,仅对定义对齐层的投影头应用差分隐私优化(epsilon=0.34,delta=6x10^-6)。这使得MIMIC-CXR上N=10,000时的Recall@1降低了61.8%,并无需重新训练即可迁移到CheXpert Plus,同时图像侧效用基本保持:线性探针分类在14个标签上的宏AUROC仅从79.63%变为79.43%。对共享对齐层的定向DP微调可以大幅减少跨模态重链接,而不会实质性降低使这些模型在临床上有用的图像表示。

英文摘要

Vision-language models (VLMs) trained on paired chest radiographs and radiology reports learn a shared embedding space that can preserve instance-level image-report correspondence. This poses a privacy risk in settings where radiographs and reports are deliberately kept separate after acquisition, such as image-only data sharing or access-controlled reports, because a de-identified image may be re-linked to its original narrative report through cosine similarity alone. We formalized this as image-to-report retrieval and used public paired cohorts, in which the true pairing is known by design, as ground-truth benchmarks to audit the risk rather than as the privacy scenario. Evaluating VLMs of increasing clinical specialization on 406,241 paired examples from 126,804 patients across MIMIC-CXR (43,793 held-out pairs) and external CheXpert Plus (29,296 pairs), we found that re-linkage rose systematically with specialization: the strongest VLM retrieved the correct report at 15 times chance at a candidate pool of N = 100, 50 times chance at N = 10,000, and well above chance at full-database scale. The signal persisted under pathology-matched hard negatives that removed disease-label shortcuts, indicating correspondence beyond broad diagnostic categories. To reduce it without retraining, we froze both encoders and applied differentially private optimization only to the projection heads defining the alignment layer (epsilon = 0.34, delta = 6x10-6). This reduced Recall@1 by 61.8% at N = 10,000 on MIMIC-CXR and transferred to CheXpert Plus without retraining, while image-side utility was largely preserved: macro AUROC for linear-probe classification across 14 labels shifted only from 79.63% to 79.43%. Targeted DP finetuning of the shared alignment layer can substantially reduce cross-modal re-linkage without materially degrading the image representations that make these models clinically useful.

2606.02255 2026-06-02 cs.CL cs.AI 版本更新

Who Annotates in NLP? A Large-scale Assessment of Human Annotation Reporting between 2018 and 2025

谁在NLP中进行标注?2018年至2025年人工标注报告的大规模评估

Maria Kunilovskaya, Gagan Bhatia, Lisa Sophie Albertelli, Yanran Chen, Christian Greisinger, Lotta Kiefer, Christoph Leiter, Subhadeep Roy, Tewodros Achamaleh, Muhammad Arslan Manzoor, Sebastian Pohl, Yufang Hou, Steffen Eger

发表机构 * NLLG Lab University of Technology Nuremberg(NLLG实验室 梅尔堡技术大学) Interdisciplinary Transformation University(跨学科转型大学)

AI总结 本研究通过大规模审计NLP论文中的人工标注报告,发现标注细节报告不完整,并提出了改进报告质量的框架和建议。

详情
AI中文摘要

人工标注是许多NLP研究的经验基础,从数据集构建到模型评估,但论文往往不清楚谁产生了标注以及如何控制标注过程。我们首次对主要NLP会议中的人工标注报告进行了大规模、任务级别的审计,询问哪些标注细节被记录,哪些缺失,以及报告如何随时间、主题、会议和人工判断的预期用途而变化。我们引入了一个统一的标注报告实践分类法,并针对Annotated-gold(一个由41篇论文和72个标注任务组成的人工裁决黄金标准)验证了一个LLM辅助的提取流程,其中最佳模型与裁决标签达到了与人类相当的一致性,Krippendorff's alpha为0.606,而人类间一致性为0.585。利用该流程,我们构建了Annotated-llm数据集,涵盖2018-2025年ACL会议论文,包含来自1603篇论文的2667个提取的标注任务,发现论文经常报告操作细节,如招募策略、标注者专业知识和标注量,但往往省略评估标注有效性所需的细节,包括培训、语言能力、报酬、社会人口统计、裁决和一致性值,尤其是在模型评估研究中。我们的结果表明,NLP中的标注报告随时间有所改善,但仍不均衡,我们建立了一个可扩展的框架和最低报告建议,以使人工标注更可靠、可重复和可解释。

英文摘要

Human annotation is the empirical foundation of much NLP research, from dataset construction to model evaluation, but papers often leave unclear who produced the annotations and how the annotation process was controlled. We provide the first large-scale, task-level audit of human annotation reporting across major NLP venues, asking which annotation details are documented, which are missing, and how reporting varies across time, topic, venue, and intended use of human judgment. We introduce a unified taxonomy of annotation-reporting practices and validate an LLM-assisted extraction pipeline against Annotated-gold, a human-adjudicated gold standard of 41 papers and 72 annotation tasks, where the best model reaches human-comparable agreement with adjudicated labels, with Krippendorff's alpha of 0.606 versus 0.585 for human-human agreement. Using this pipeline, we construct Annotated-llm, a dataset covering ACL-venue papers from 2018-2025, with 2,667 extracted annotation tasks from 1,603 papers, and find that papers frequently report operational details such as recruitment strategies, annotator expertise, and annotation volume, but often omit details needed to assess annotation validity, including training, language proficiency, compensation, socio-demographics, adjudication, and agreement values, especially in model-evaluation studies. Our results show that annotation reporting in NLP has improved over time but remains uneven, and they establish a scalable framework and bare-minimum reporting recommendations for making human annotation more reliable, reproducible, and interpretable.

2606.02253 2026-06-02 cs.AI 版本更新

CEON: Circular Economy Ontology Network

CEON: 循环经济本体网络

Huanyu Li, Els de Vleeschauwer, Robin Keskisärkkä, Mikael Lindecrantz, Mina Abd Nikooie Pour, Ying Li, Ben De Meester, Patrick Lambrix, Eva Blomqvist

AI总结 为解决循环经济领域跨行业信息共享的语义互操作性问题,提出了循环经济本体网络(CEON),定义了跨行业概念并实现语义感知数据文档化,在建筑、电子和纺织行业场景中验证了其有效性。

详情
AI中文摘要

提高社会中资源利用的循环性已被视为实现可持续性的一条途径,即向更加循环的经济转型。为此有许多不同的循环策略,例如重复使用产品和组件、翻新和再制造旧产品,或回收剩余或使用过的材料。为了实现这些策略,有必要在基础设施层面共享信息,并在产品生命周期内跨行业部门进行沟通。因此,在这种信息共享和沟通中实现语义互操作性是提高循环性的关键。然而,涉及产品生命周期相关众多行业的循环经济(CE)领域的知识表示仍然具有挑战性。为弥补这一差距,我们在Onto-DESIDE项目中开发了循环经济本体网络(CEON)。该本体网络旨在通过定义跨行业概念来填补CE领域的空白,并实现语义感知的数据文档化。我们通过跨行业数据文档化场景(涵盖建筑、电子和纺织行业)展示了CEON。

英文摘要

Increasing the circularity of resource use in our society has been recognized as a path to sustainability, i.e., transitioning into a more circular economy. There are many different circular strategies to do so, such as reusing products and components, refurbishing and remanufacturing used products, or recycling left-over or used materials. To enable these strategies, it is necessary to share information at the infrastructure level and to communicate between industry sectors along the product life cycle. Enabling semantic interoperability in this information sharing and communication is therefore a key to increasing circularity. However, knowledge representation for the circular economy (CE) domain, which involves many relevant industry sectors related to product life cycles, remains challenging. To bridge this gap, we developed the Circular Economy Ontology Network (CEON) within the Onto-DESIDE project. This ontology network aims to fill gaps in CE by defining cross-sectorial concepts and to enable semantics-aware data documentation. We demonstrate CEON through cross-industry data documentation scenarios spanning construction, electronics, and textile sectors.

2606.02251 2026-06-02 cs.RO cs.AI eess.SP 版本更新

FW-NKF: Frequency-Weighted Neural Kalman Filters

FW-NKF: 频率加权神经卡尔曼滤波器

Adnan Harun Dogan, Berken Utku Demirel, Christian Holz

发表机构 * Department of Computer Science, ETH Zürich(苏黎世联邦理工学院计算机科学系)

AI总结 提出频率加权神经卡尔曼滤波器(FW-NKF),通过将因果谱整形算子嵌入卡尔曼测量残差并联合学习观测和状态转移网络,抑制频带受限噪声,在混沌系统和惯性姿态估计等任务中定位误差降低达10%。

Comments Published at ICRA 2026

详情
AI中文摘要

鲁棒状态估计是机器人自主性的核心,然而经典卡尔曼滤波器难以应对频率相关干扰和模型失配,如传感器振动、电磁干扰和周期性噪声。尽管深度卡尔曼滤波器(DKF)变体通过学习潜在状态转移扩展了扩展卡尔曼滤波(EKF)框架,但它们缺乏明确的机制来抑制在实际场景中通常污染传感器测量的带限噪声分量。我们引入了频率加权神经卡尔曼滤波器(FW-NKF),这是一种统一的混合方法,将因果谱整形算子嵌入卡尔曼测量残差,并联合学习观测网络和状态转移网络。通过同时调整滤波器频谱和潜在状态表示,FW-NKF在抑制噪声主导频带的同时捕获复杂的残差结构。我们在四个异构基准上进行了广泛实验,包括混沌系统(如多维洛伦兹系统)和全身惯性姿态估计,发现定位误差降低高达10%,且方向精度显著提升。我们的消融研究证实,频率加权和深度潜在状态建模对整体性能有贡献。

英文摘要

Robust state estimation is central to robotic autonomy, yet classical Kalman filters struggle with frequency-dependent disturbances and model mismatch such as sensor vibrations, electromagnetic interference, and periodic noise. Although Deep Kalman Filter (DKF) variants extend the Extended Kalman Filtering (EKF) framework by learning latent transitions, they lack explicit mechanisms to suppress band-limited noise components that typically corrupt sensor measurements in real-world scenarios. We introduce the Frequency-Weighted Neural Kalman Filter (FW-NKF), a unified hybrid approach that embeds a causal spectral-shaping operator into the Kalman measurement residual and jointly learns observation, and transition networks. By adapting both the filter spectrum and the latent state representation, FW-NKF attenuates the noise-dominated frequency bands while capturing complex residual structures. We conduct extensive experiments on four heterogeneous benchmarks, including chaotic systems such as multi-dimensional Lorenz systems and full-body inertial pose estimation, and find a reduction in localization error of up to 10% as well as marked improvements in orientation accuracy. Our ablation studies confirm that frequency weighting and deep latent-state modeling contribute to overall performance.

2606.02242 2026-06-02 cs.CV cs.AI cs.LG 版本更新

Towards Resolving Optimization Conflicts Between Image- and Text-Based Person Re-Identification

解决基于图像和基于文本的行人重识别之间的优化冲突

Karina Kvanchiani, Timur Mamedov

发表机构 * Tevian, Russia(俄罗斯Tevian) Lomonosov Moscow State University, Russia(俄罗斯罗蒙诺索夫莫斯科国立大学)

AI总结 针对图像与文本行人重识别任务因模态差异和目标冲突导致共享表示次优的问题,提出解耦两阶段训练流程,使用单一视觉编码器避免跨任务干扰,实验表明图像预训练和文本监督能提升双任务性能。

详情
AI中文摘要

基于图像(I2I)和基于文本(T2I)的行人重识别(ReID)的联合优化受到模态差异和冲突训练目标的阻碍,导致共享表示次优。虽然I2I ReID关注同一人图像间的身份级不变性,但T2I ReID由与独特视觉特征相关的实例特定文本描述驱动。本文探讨了两个ReID任务及其优化过程之间的根本差异,以实现有效训练。由于I2I和T2I ReID通常分开研究,为一种检索设置优化的损失函数可能对另一种所需的表示质量产生负面影响。基于这些发现,我们提出了一种解耦的两阶段训练流程,用于学习跨图像和文本模态的共享表示。该流程基于单个视觉编码器,支持I2I和T2I检索,同时避免训练期间的跨任务干扰。我们在多种配置下进行了大量实验,改变了域混合程序、学习策略和任务目标。我们观察到I2I ReID预训练对T2I数据的泛化能力有积极影响。此外,我们发现视觉编码器训练阶段引入文本监督能提升I2I和T2I性能。我们相信,我们的见解为统一的ReID系统和跨模态检索整体迈出了有意义的一步。

英文摘要

The joint optimization of image-based (I2I) and text-based (T2I) person re-identification (ReID) is hindered by modality discrepancies and conflicting training objectives, leading to suboptimal shared representations. While I2I ReID focuses on identity-level invariance across images of the same person, T2I ReID is driven by instance-specific textual descriptions tied to unique visual traits. This paper explores the fundamental difference between two ReID tasks and their optimization processes for effective training. Since I2I and T2I ReID are often studied separately, the loss functions optimized for one retrieval setting may negatively affect the representation quality required by the other. Motivated by these findings, we propose a decoupled two-stage training pipeline for learning a shared representation across image and text modalities. The pipeline is based on a single vision encoder that supports both I2I and T2I retrieval while avoiding cross-task interference during training. We provide extensive experiments across multiple configurations, varying domain mixing procedures, learning strategies, and task objectives. We observed that I2I ReID pre-training positively impacts the generalization ability to T2I data. Besides, we find that incorporating textual supervision during the vision encoder training stage enhances both I2I and T2I performance. We believe our insights provide a meaningful step toward unified ReID systems and cross-modal retrieval overall.

2606.02218 2026-06-02 cs.LG cs.AI 版本更新

Faster Synchronous On-Policy RL via Straggler-Aware Group Sizing

通过感知掉队者的组大小调整实现更快的同步在线策略强化学习

Azal Ahmad Khan, Ammar Ahmed, Zeshan Fayyaz, Sheng Di, Mingyi Hong, Ali Anwar

发表机构 * University of Minnesota(明尼苏达大学) University of Waterloo(滑铁卢大学) Argonne National Laboratory(阿贡国家实验室)

AI总结 提出动态组大小控制器SAGC,通过在线约束优化调整组大小,减少同步在线策略强化学习中的掉队者事件,提升墙钟效率并保持或改善训练奖励和模型质量。

详情
AI中文摘要

同步强化学习方法如组相对策略优化(GRPO)提供稳定且可复现的在线策略训练,但极易受到掉队者的影响——单个异常长的轨迹可能延迟整个组的奖励计算和参数更新。随着组大小增加,这个问题变得更加严重,在更大组的好处与同步停滞的墙钟成本之间产生矛盾。我们提出感知掉队者的组控制(SAGC),一种动态组大小控制器,根据观察到的轨迹行为在线调整训练组。SAGC将组大小选择形式化为一个在线约束优化问题,旨在保留更大组的好处,同时控制掉队者事件的长期发生率。在同步GRPO和DAPO训练中,以及在普通和强工程基线上,SAGC一致地减少了掉队者发生率并提高了墙钟效率,同时实现了有竞争力或更好的训练奖励。我们进一步表明这些收益转化为最终模型质量:在下游推理基准上,SAGC与最强的静态组大小基线相比具有竞争力或更好,并且通常在没有显式长度惩罚的情况下产生更短的输出。这些结果将动态组控制定位为使同步在线策略强化学习更高效和更稳健的实用方法。

英文摘要

Synchronous reinforcement learning methods such as Group Relative Policy Optimization (GRPO) provide stable and reproducible on-policy training, but they are highly vulnerable to stragglers, a single unusually long rollout can delay reward computation and parameter updates for the entire group. This problem becomes more severe as group size increases, creating a tension between the benefits of larger groups and the wall-clock cost of synchronization stalls. We propose Straggler-Aware Group Control (SAGC), a dynamic group-size controller that adapts the training group online based on observed rollout behavior. SAGC formulates group-size selection as an online constrained optimization problem, seeking to retain the benefits of larger groups while controlling the long-term rate of straggler events. Across synchronous GRPO and DAPO training, and on top of both vanilla and strong engineered baselines, SAGC consistently reduces straggler incidence and improves wall-clock efficiency while achieving competitive or better training reward. We further show that these gains transfer to final model quality: SAGC is competitive with or better than the strongest static group-size baseline on downstream reasoning benchmarks, and often produces shorter outputs without any explicit length penalty. These results position dynamic group control as a practical way to make synchronous on-policy RL more efficient and robust.

2606.02211 2026-06-02 cs.CL cs.AI 版本更新

Consistency Training while Mitigating Obfuscation via Rate Matching

通过速率匹配缓解混淆的一致性训练

Sohaib Imran, Prakhar Gupta, Jannes Elstner, David Demitri Africa

发表机构 * University of California, Berkeley(加州大学伯克利分校)

AI总结 提出速率匹配一致性训练(RMCT),通过匹配目标行为率而非约束表达方式,在减少模型受无关特征影响的同时避免混淆,提升可监控性。

详情
AI中文摘要

大型语言模型常常受到无关输入特征的影响,例如揭示用户偏好答案的线索。一致性训练通过训练模型在具有和不具有无关特征的输入上表现相似来减少这种影响。然而,现有方法在整个响应或内部激活上训练一致性,这也限制了模型是否表达这些无关特征。我们表明这会导致混淆,即模型学会不提及线索但仍受其影响,这可能削弱可监控性。为了解决这个问题,我们引入了速率匹配一致性训练(RMCT),它在选定的行为属性上训练一致性,而不约束这种行为如何表达。RMCT匹配模型在输入扰动下表现出目标行为(例如,遵循偏见线索)的速率,而不是要求具有和不具有无关特征的配对输入,从而将一致性训练扩展到无法移除无关特征的场景。我们在两个开放权重语言模型上评估了RMCT在减少谄媚方面的效果,在保留的偏见类型上实现了与标准一致性训练基线相当的偏见遵循减少,同时很大程度上保留了模型表达偏见线索的倾向。此外,我们发现RMCT在我们的实验中更节省数据,但计算效率较低。总体而言,RMCT表明一致性训练可以在不直接牺牲可监控性的情况下提高行为鲁棒性。

英文摘要

Large language models are often influenced by extraneous input features, such as cues revealing a user's preferred answer. Consistency training reduces this influence by training models to behave similarly across inputs with and without the extraneous feature. However, existing methods train for consistency over entire responses or internal activations, which also constrains whether the model verbalises said extraneous features. We show this leads to obfuscation, where the model learns not to mention a cue while remaining influenced by it, which may undermine monitorability. To address this, we introduce Rate Matching Consistency Training (RMCT), which trains for consistency over selected behavioural properties without constraining how this behaviour is expressed. RMCT matches the rate at which the model exhibits a target behaviour (e.g., following a bias cue) across input perturbations, rather than requiring paired inputs with and without the extraneous feature, extending consistency training to settings where the extraneous features cannot be removed. We evaluate RMCT on sycophancy reduction in two open-weight language models, achieving reductions in bias-following comparable to a standard consistency-training baseline on held-out bias types, while largely preserving the model's tendency to verbalise the bias cue. Further, we find that RMCT is more data-efficient at the expense of being less compute-efficient in our experiments. Overall, RMCT shows that consistency training can improve behavioural robustness without directly trading off against monitorability.

2606.02179 2026-06-02 cs.LG cs.AI cs.CE 版本更新

On the Generalization in Topology Optimization via Sensitivity-Conditioned Bernoulli Flow Matching

关于拓扑优化中通过敏感性条件伯努利流匹配的泛化性

Mohammad Rashed, Duarte F. Valoroso Madeira, Babak Gholami, Caglar Guerbuez, Yunjia Yang, Nils Thuerey

发表机构 * University of California, San Diego(加州大学圣地亚哥分校) Max Planck Institute for Informatics(马克斯·普朗克信息研究所)

AI总结 通过信息论分析,提出伪敏感性概念,并利用敏感性条件伯努利流匹配生成器在拓扑优化中实现最优的分布外泛化性能。

Comments ICML Paper

详情
AI中文摘要

拓扑优化(TO)的代理模型在分布偏移(如载荷或边界条件变化)下表现出高度可变的分布外(OOD)泛化能力,但这一变异性的来源尚不清楚。我们假设OOD性能取决于条件信号保留关于驱动经典TO的伴随敏感性(简化梯度)的信息量。将TO流程建模为因果马尔可夫链,数据处理不等式表明,在该抽象下,敏感性场是拓扑预测的信息论最优条件信号。然而,计算精确的伴随敏感性在实践中可能昂贵或不可用;我们观察到某些物理场可以通过单调变换近似敏感性。为形式化这一点,我们引入 extbf{伪敏感性}来区分哪些场能够实现泛化,哪些信息贫乏。然后,我们展示了一个敏感性条件的伯努利流匹配生成器实证地证实了这些预测:以敏感性为条件可获得最先进的OOD性能,而越来越远的物理场性能退化至原始参数条件。结果在载荷偏移下的结构TO基准测试和我们新的CFD-TO数据集(边界条件偏移如多出口配置)中均成立。代码和数据集见https://tum-pbs.github.io/topotransformer/。

英文摘要

Surrogate models for topology optimization (TO) exhibit highly variable out-of-distribution (OOD) generalization under distribution shifts such as changing loads or boundary conditions, yet the source of this variability remains unclear. We hypothesize that OOD performance is governed by how much information the conditioning signal preserves about the adjoint sensitivity (reduced gradient) that drives classical TO. Modeling the TO pipeline as a causal Markov chain, the Data Processing Inequality establishes that, under this abstraction, the sensitivity field is an information-theoretically optimal conditioning signal for topology prediction. However, computing exact adjoint sensitivities can be expensive or unavailable in practice; we observe that certain physical fields can approximate sensitivities through monotone transformations. To formalize this, we introduce \textbf{pseudo-sensitivities} to characterize which fields enable generalization versus those that are information-poor. We then show that a sensitivity-conditioned Bernoulli flow-matching generator empirically confirms these predictions: conditioning on sensitivities yields state-of-the-art OOD performance, while increasingly distant physical fields degrade toward raw parameter conditioning. Results hold across structural TO benchmarks under load shifts and our new CFD-TO dataset under boundary-condition shifts such as multi-outlet configurations. Code and datasets are available at https://tum-pbs.github.io/topotransformer/ .

2606.02178 2026-06-02 cs.CV cs.AI 版本更新

Order within Chaos: Capturing Intrinsic Energy Anomalies for AI-Manipulated Image Forgery Localization

混沌中的秩序:捕捉AI操纵图像伪造定位的内在能量异常

Yiming Wang, Baiqi Wu, Qingming Li, Jiahao Chen, Tong Zhang, Shouling Ji

发表机构 * Zhejiang University(浙江大学)

AI总结 本文提出FLAME框架,利用扩散过程抑制局部高频方差产生的统计能量间隙,结合LAD图和SAM适配器实现像素级伪造定位,并引入EditStream流水线持续合成训练数据,在AI生成伪造数据集上达到最先进性能。

Comments Accepted by ICML 2026

详情
AI中文摘要

近期生成式AI的进展催生了能够产生逼真伪造图像的图像编辑模型,这些伪造图像能规避传统图像伪造定位方法,因为传统方法依赖于合成数据中不存在的物理噪声。为应对这一挑战,我们从理论上证明扩散过程本质上抑制了局部高频方差,产生了与光学成像自然熵可区分的统计能量间隙。受此启发,我们提出FLAME,一个统一框架,利用LAD图捕捉这些内在异常,并结合SAM的参数高效适配器实现精确的像素级伪造定位。此外,为弥合取证基准与不断演变的生成模型之间的滞后,我们引入EditStream,一个基于指令的连续训练数据合成自动化流水线。大量实验表明,FLAME建立了新的最先进水平,在AI生成伪造数据集上显著优于先前方法,同时有效泛化到未见过的生成架构。我们的代码可在https://github.com/phoenixnir/FLAME获取。

英文摘要

Recent advancements in generative AI have led to image editing models capable of producing realistic forgeries that evade traditional image forgery localization methods, as these approaches depend on physical noise absent in synthetic data. To address this challenge, we theoretically demonstrate that the diffusion process inherently suppresses local high-frequency variance, creating a statistical energy gap that is distinguishable from the natural entropy of optical imaging. Guided by this insight, we propose FLAME, a unified framework that utilizes a LAD map to capture these intrinsic anomalies, coupled with a parameter-efficient adapter for SAM to achieve precise, pixel-level forgery localization. Furthermore, to bridge the lag between forensic benchmarks and evolving generative models, we introduce EditStream, an automated pipeline for continuous, instruction-based training data synthesis. Extensive experiments demonstrate that FLAME establishes a new state-of-the-art, significantly outperforming previous methods on AI-generated forgery datasets while effectively generalizing to unseen generative architectures. Our code is available at https://github.com/phoenixnir/FLAME.

2606.02167 2026-06-02 cs.AI 版本更新

From Capability Models to Automated Planning: An AAS-Native Approach for Automatic PDDL Generation

从能力模型到自动规划:一种面向AAS的自动PDDL生成方法

Hamied Nabizada, Thomas Wirt, Luis Miguel Vieira da Silva, Felix Gehlhoff, Alexander Fay

发表机构 * Institute of Automation Technology, Helmut Schmidt University Hamburg(海德堡-施密特大学汉堡自动化技术研究所) Chair of Automation Technology, Ruhr University Bochum(博尔塔伦大学博德姆自动化技术教授团)

AI总结 提出一种基于资产管理外壳(AAS)能力模型自动生成PDDL规划问题的方法,使工程师无需掌握PDDL语法即可进行生产系统布局验证。

Comments Accepted at the 2026 IEEE 22nd International Conference on Automation Science and Engineering (CASE 2026)

详情
AI中文摘要

设计生产系统的工程师需要验证给定的布局是否支持所有必需的生产序列。自动化规划技术可以回答此类问题,但在规划领域定义语言(PDDL)中制定所需的规划问题需要专业知识,而生产工程师通常缺乏这些知识。资产管理外壳(AAS)已成为工业4.0中工业资产的标准化数字孪生。我们展示了使用四个成熟的工业4.0标准(用于过程描述的VDI 3682、用于语义属性限定的IEC 61360-1、用于类型层次结构的IDTA 02011和用于实例描述的IDTA 02016)构建的AAS能力模型包含足够的信息,可以自动生成完整的PDDL问题。与之前引入PDDL特定子模型的工作不同,我们的方法从资源功能的领域级描述(即能力)中推导出所有规划元素,使工程师能够在完全不接触PDDL语法或规划概念的情况下对能力进行建模。我们的提取算法将分布式的多AAS架构转换为完整的PDDL规划问题。我们在实验室生产系统的AAS模型上验证了该方法,通过最优规划比较了四种布局变体,展示了工程师如何通过修改AAS模型并重新生成规划域来系统地探索设计权衡。

英文摘要

Engineers designing production systems need to verify that a given layout supports all required production sequences. Automated planning techniques can answer such questions, but formulating the required planning problems in the Planning Domain Definition Language (PDDL) demands specialized expertise that production engineers typically lack. Asset Administration Shells (AAS) have emerged as the standardized Digital Twin for industrial assets in Industry 4.0. We show that AAS capability models, structured using four established Industry 4.0 standards (VDI 3682 for process descriptions, IEC 61360-1 for semantic property qualification, IDTA 02011 for type hierarchies, and IDTA 02016 for instance descriptions), contain sufficient information to generate complete PDDL problems automatically. Unlike prior work that introduced PDDL-specific submodels, our approach derives all planning elements from domain-level descriptions of resource functions, so-called capabilities, allowing engineers to model capabilities without any exposure to PDDL syntax or planning concepts. Our extraction algorithm transforms distributed Multi-AAS architectures into complete PDDL planning problems. We validate the approach on AAS models of a laboratory production system, comparing four layout variants using optimal planning to demonstrate how engineers can systematically explore design trade-offs by modifying the AAS model and regenerating the planning domain

2606.02163 2026-06-02 cs.AI 版本更新

An Abstract Worlds Semantic Framework for Belief Change Operators

信念变化算子的抽象世界语义框架

Daniel Grimaldi, M. Vanina Martinez, Ricardo O. Rodriguez

发表机构 * Departamento de Computación Facultad de Ciencias Exactas y Naturales Universidad de Buenos Aires(布宜诺斯艾利斯大学计算机系) Instituto de Investigación en Ciencias de la Computación UBA-CONICET(UBA-CONICET计算科学研究所) Artificial Intelligence Research Institute (IIIA-CSIC)(人工智能研究 institute (IIIA-CSIC))

AI总结 提出一种无逻辑语法的集合论框架——抽象世界语义,通过将世界视为原始元素并定义世界收缩与修正算子,统一分析信念变化模型,并推广至经典与非优先信念变化。

详情
AI中文摘要

本文提出了一种用于信念变化的集合论框架,称为抽象世界语义,其中不假设任何逻辑语法。受Grove(1988)结果的启发,我们的方法将世界视为原始元素,并在此基础上定义了世界收缩和世界修正算子。该语义框架能够对信念变化模型进行统一分析。在此框架内,我们通过定义通用算子,统一了经典和非优先信念变化构造。当考虑经典命题逻辑时,我们的框架提供了AGM、KM和多重变化模型的同质化描述。总之,AWS系统化了信念变化框架和算子,简化并推广了基于信念集的信念变化理论。

英文摘要

This article proposes a set-theoretic framework for belief change, called Abstract Worlds Semantics, in which no logical syntax is assumed. Inspired by Grove's (1988) results, our approach treats worlds as primitive elements, over which world contraction and world revision operators are defined. This semantic framework enables a unified analysis of belief change models. Within this framework, we unify classical and non-prioritized belief change constructions by defining versatile operators. When classical propositional logic is considered, our framework provides a homogeneous account of AGM, KM, and Multiple Change models. In summary, AWS systematizes belief change frameworks and operators, simplifying and generalizing belief change theory over belief sets.

2606.02162 2026-06-02 cs.CV cs.AI cs.CL cs.IR 版本更新

Multimodal Approaches for Visually-Rich Document Type Classification: A Comparative Analysis

视觉丰富文档类型分类的多模态方法:一项比较分析

Catyana Heyne, Jürgen Frikel, Filippo Riccio

AI总结 针对视觉丰富文档类型分类中多模态建模策略难以系统比较的问题,本文在统一实验框架下对基于Transformer和LLM的四种代表性模型进行受控对比,发现专用多模态Transformer优于LLM方法,且图像信息贡献最大。

详情
AI中文摘要

视觉丰富文档中的文档类型分类仍然具有挑战性,因为相关信息分布在文本、视觉和布局模态中。为了捕捉这种复杂性,当前方法依赖于多样化的多模态建模策略,导致异构架构使得系统比较复杂化。这种变异性也反映在现有的比较研究中,这些研究通常依赖于异构评估设置,进一步复杂化了系统比较,并使得评估进展变得困难。为了解决这些局限性,本文提供了跨基于Transformer和基于LLM架构的多模态设计策略的结构化分析,并结合统一实验框架内的受控实证比较。具体来说,在RVL-CDIP基准上评估了四种代表性模型(LayoutLMv3、Donut、Qwen3-VL-32B-Instruct和Qwen3-32B),以系统分析文本、图像和布局信息对文档类型分类的贡献,特别关注对比OCR依赖和OCR无关的方法。结果表明,专用多模态Transformer在视觉丰富和布局密集型文档上优于基于LLM的方法。图像信息对可靠分类贡献最大,而OCR派生的文本提供有用但次要的支持。这些发现强调,对于具有显著布局结构的文档,多模态处理仍然是必不可少的。总体而言,该研究为比较多模态架构提供了系统基础,并为选择有效的特征组合和模型设计以进行文档类型分类提供了实用指导。

英文摘要

Document type classification in visually rich documents remains challenging, as relevant information is distributed across textual, visual, and layout modalities. To capture this complexity, current approaches rely on diverse multimodal modeling strategies, resulting in heterogeneous architectures that complicate systematic comparison. This variability is also reflected in existing comparative studies, which often rely on heterogeneous evaluation setups, further complicating systematic comparison and making it difficult to assess progress. To address these limitations, this work provides a structured analysis of multimodal design strategies across transformer- and LLM-based architectures, combined with a controlled empirical comparison within a unified experimental framework. Specifically, four representative models (LayoutLMv3, Donut, Qwen3-VL-32B-Instruct, and Qwen3-32B) are evaluated on the RVL-CDIP benchmark to systematically analyze the contributions of text, image, and layout information for document type classification, with a particular focus on contrasting OCR-dependent and OCR-free approaches. The results show that specialized multimodal Transformers outperform LLM-based approaches on visually rich and layout-intensive documents. Image information contributes most strongly to reliable classification, while OCR-derived text provides useful but secondary support. These findings highlight that multimodal processing remains essential for documents with pronounced layout structure. Overall, the study provides a systematic basis for comparing multimodal architectures and offers practical guidance for selecting effective feature combinations and model designs for document type classification.

2606.02156 2026-06-02 eess.IV cs.AI cs.CV cs.IR cs.LG 版本更新

Predicting the risk of colorectal anastomotic leak based on preoperative mapping of the blood supply of the bowel

基于术前肠道血供映射预测结直肠吻合口漏风险

Zahra Tabatabaei, Jon Sporring, Mark Bremholm Ellebæk, Alaa El-Hussuna

发表机构 * Computer Science Department, Københavns Universitet (KU)(哥本哈根大学计算机科学系) University of Southern Denmark(南部丹麦大学) Odense University Hospital(奥登塞大学医院) OpenSourceResearch Collaboration(开源研究协作)

AI总结 提出一种基于术前CT影像的AI驱动系统,通过分析血管和组织特征量化吻合口漏风险,并结合内容检索支持临床决策。

详情
AI中文摘要

吻合口漏仍然是结直肠癌手术后最严重的并发症之一,显著影响患者预后、康复轨迹和医疗成本。尽管影像技术有所进步,目前的术前评估仍依赖临床评估,这一过程主观、易出错且高度依赖个人经验。迄今为止,尚无经过验证的基于CT的方法能够在术前预测吻合口漏风险。本方案论文概述了一个全面的框架,用于开发和验证一个AI驱动的系统,该系统利用对比增强前后的CT影像进行术前风险评估。研究描述了数据收集、伦理处理、符合GDPR的患者数据预处理、图像预处理以及旨在生成临床可解释输出的深度学习架构探索等阶段。该工作流程的两个主要成果是:1) 风险评估模块,通过分析CT扫描中的血管和组织特征量化漏液可能性;2) 基于内容的医学图像检索(CBMIR)模块,识别并显示相似历史病例以支持循证手术决策。该方案论文需要医院和大学之间的密切合作;本方案表明,此类系统在现有医疗基础设施内技术上可行且临床可实施。通过遵循所提出的方法论阶段和监管原则,其他机构可以复制此工作流程以开发类似的决策支持工具。最终,这一跨学科框架旨在加强手术规划、减少漏液发生率,并推动向可解释、数据驱动的精准手术的更广泛范式转变。

英文摘要

Anastomotic leak remains one of the most serious complications following colorectal cancer surgery, substantially affecting patient outcomes, recovery trajectories, and healthcare costs. Despite advances in imaging technology, current preoperative assessment relies only on clinical assessment, a process that is subjective, error-prone, and highly dependent on individual expertise. To date, no validated CT-based method exists to predict anastomotic leak risk prior to surgery. This protocol paper outlines a comprehensive framework for developing and validating an AI-driven system for preoperative risk assessment using pre- and post-contrast CT imaging. The study describes the stages of data collection, ethical handling, and preprocessing of patient data in accordance with GDPR, image preprocessing, and the exploration of deep learning architectures designed to generate clinically interpretable outputs. Two integrated tools constitute the main deliverables of this workflow: 1) a risk assessment module, which quantifies the likelihood of leakage by analyzing vascular and tissue features in CT scans, and 2) a Content-Based Medical Image Retrieval (CBMIR) module, which identifies and displays similar historical cases to support evidence-based surgical decision making. The protocol paper requires close collaboration between hospitals and universities; this protocol demonstrates that such a system is technically feasible and clinically implementable within existing healthcare infrastructures. By following the proposed methodological stages and regulatory principles, other institutions can reproduce this workflow to develop analogous decision-support tools. Ultimately, this interdisciplinary framework aims to enhance surgical planning, reduce leak incidence, and contribute to a broader paradigm shift toward explainable, data-driven precision surgery.

2606.02151 2026-06-02 cs.AI cs.SY eess.SY 版本更新

S3TS: Stochastic Scenario-Structured Tree Search for Advanced Planning Under Uncertainty

S3TS:面向不确定性下高级规划的随机情景结构化树搜索

Fabio Pavirani, Bert Claessens, Pierre Pinson, Chris Develder

发表机构 * IDLab Ghent university – imec(IDLab 布鲁塞尔大学 – imec) Beebop.ai Imperial College London(伦敦帝国理工学院)

AI总结 提出随机情景结构化树搜索(S3TS)算法,通过情景树显式表示不确定性并集成非线性模型,在需求响应信号发布问题上实现近最优性能,成本比最优解高14%以内,在非线性场景中比贪心算法和确定性MCTS分别降低51%和5.4%的成本。

详情
AI中文摘要

能源领域的有效调度对于确保电网及其连接资产的可靠运行至关重要,例如通过优化发电机组和储能系统的调度。有效的规划策略必须(a)适应先进且可能非线性的系统模型——利用现代电网日益增长的数据可用性,以及(b)显式处理由可再生能源整合等引起的不确定性。虽然现有方法可以处理非线性(例如蒙特卡洛树搜索)或不确定性(例如随机数学优化),但缺乏能够同时应对这两个挑战的规划技术。为填补这一空白,我们提出了一种随机情景结构化树搜索(S3TS)算法,该算法通过情景树显式表示不确定性,同时能够集成先进的非线性模型。我们在一个模拟的需求响应信号发布问题上评估了S3TS,该问题很大程度上模仿了比利时的失衡结算机制。结果表明,在线性、可解析处理的设置中,S3TS实现了接近最优的性能,成本在情景树条件下比数学最优解高14%以内。在高度非线性的场景中,S3TS显著优于基线方法,与贪心算法和确定性MCTS相比,成本分别降低了51%和5.4%。

英文摘要

Effective scheduling in the energy sector is essential to ensure the reliable operation of electrical grids and their connected assets by, for instance, optimizing the dispatch of generation units and storage systems. An effective planning strategy must (a) accommodate advanced and potentially non-linear system models -- exploiting the increasing data availability of modern grids, and (b) explicitly handle uncertainties arising, for instance, from the integration of renewable energy sources. While existing approaches can address either non-linearity (e.g., Monte Carlo Tree Search) or uncertainty (e.g., stochastic mathematical optimization), there is a lack of planning techniques capable of addressing both challenges simultaneously. To bridge this gap, we propose a Stochastic Scenario-Structured Tree Search (S3TS) algorithm that explicitly represents uncertainty through scenario trees while enabling the integration of advanced non-linear models. We evaluate S3TS on a simulated demand response signal publication problem, largely mimicking the imbalance settlement mechanism in Belgium. The results demonstrate near-optimal performance in linear, analytically tractable settings, with costs within 14% of the mathematically optimal solution conditioned to the scenario trees. In highly non-linear scenarios, S3TS significantly outperforms baseline methods, achieving cost reductions of up to 51% and 5.4% compared to a myopic algorithm and deterministic MCTS, respectively.

2606.02147 2026-06-02 cs.CL cs.AI 版本更新

Multilingual Idioms in Sentences and Conversations Across High-, Medium-, and Low-Resource Languages

高、中、低资源语言中的句子和对话多语言习语

Saeed Almheiri, Bilal Elbouardi, Salsabila Zahirah Pranida, Irina Nikishina, Ashwath Rao B, Parameswari Krishnamurthy, Muhammad Cendekia Airlangga, Rifo Ahmad Genadi, Nguyen Phan Gia Bao, Amir Hossein Yari, Hawau Olamide Toyin, Nurdaulet Mukhituly, Mena Attia, Besher Hassan, Ahmad Fathan Hidayatullah, Tatsuki Kuribayashi, Haonan Li, Suma Bhat, Fajri Koto

发表机构 * Mohamed bin Zayed University of Artificial Intelligence(莫扎德大学人工智能大学) University of Hamburg(汉堡大学) Manipal University(曼印大学) IIIT Hyderabad(海得拉尔IIIT) University of Science and Technology of Hanoi(河内科学技术大学) Universitas Islam Indonesia(印尼伊斯兰大学) Princeton University(普林斯顿大学)

AI总结 针对多语言习语理解,构建了覆盖3种高资源、3种中资源和12种低资源语言的MIDI数据集,包含句子和对话上下文中的字面与比喻用法,实验表明低资源语言理解更差,字面义比比喻义更难,对话上下文虽有改善但未消除差距。

详情
AI中文摘要

习语表达对多语言NLP构成重大挑战,因为其意义在比喻和字面用法之间转换,通常需要上下文才能准确理解。先前工作集中在高资源语言上,通常评估孤立的习语意义问题,忽略了现实话语。我们引入了MIDI,一个多语言习语数据集,涵盖3种高资源、3种中资源和12种低资源语言,由母语者策划。与之前的数据集不同,MIDI提供了嵌入在句子级和对话上下文中的习语,捕捉了字面和比喻解读。对最先进模型的基准测试表明,习语理解在低资源语言中下降,并且在所有资源层级中,字面解释比比喻解释更难。对话上下文提高了性能,但并未消除这些差异。通过受控测试和对隐藏表示的干预,我们进一步将记忆与推理分离,揭示了当前模型的核心局限性。

英文摘要

Idiomatic expressions pose a major challenge for multilingual NLP because their meanings shift between figurative and literal usage, often requiring context for accurate interpretation. Prior work has focused on high-resource languages typically evaluates isolated idiom-meaning questions, overlooking realistic discourse. We introduce MIDI, a multilingual idiom dataset spanning 3 high-, 3 medium-, and 12 low-resource languages, curated by native speakers. Unlike previous datasets, MIDI provides idioms embedded in both sentence-level and conversational contexts, capturing both literal and figurative readings. Benchmarking state-of-the-art models shows that idiom comprehension degrades in low-resource languages and that, in all resource tiers, literal interpretations are substantially harder than figurative ones. Conversational context improves performance but does not eliminate these disparities. Through controlled tests and interventions on hidden representations, we further separate memorization from reasoning, exposing core limitations of current models.

2606.02138 2026-06-02 cs.LG cs.AI 版本更新

VLBM: Variational Latent Basis Modeling for OOD Robust Multivariate Time Series Forecasting

VLBM:面向OOD鲁棒多变量时间序列预测的变分潜在基建模

Xudong Zhang, Jierui Lei, Jiacheng Li, Lingdong Shen, Jian Cui, Haina Tang

发表机构 * School of Artificial Intelligence, University of Chinese Academy of Sciences(中国科学院大学人工智能学院) Center for Machine Learning Research, Peking University(北京大学机器学习研究中心) Amap, Alibaba Group(阿里巴巴集团阿地图) School of Advanced Interdisciplinary Sciences, University of Chinese Academy of Sciences(中国科学院大学先进交叉学科学院) Environmental Microbiome and Innovative Genomics Laboratory, Peking University(北京大学环境微生物与创新基因组实验室)

AI总结 提出VLBM框架,通过变分潜在基分离稳定动态与OOD偏差,实现混合ID/OOD分布下的鲁棒预测,在12个基准任务上平均MAE和MSE分别提升15.08%和7.74%。

详情
AI中文摘要

多变量时间序列预测中的分布外(OOD)事件虽然罕见,但往往主导现实世界风险,使得平均情况预测不足以可靠部署。在混合ID/OOD分布的标准平均风险训练下,来自罕见OOD事件的优化信号可能被频繁的分布内(ID)模式淹没,因此强基准精度可能无法转化为高影响偏移下的可靠性。为解决此问题,我们提出VLBM(变分潜在基模型),一种理论指导的潜在预测框架,将稳定动态与OOD引起的偏差分离。VLBM学习一个共享潜在基,定义稳定ID动态的低秩子空间,将输入显式分解为基子空间分量和正交残差分量,并将未来感知后验与未来盲先验对齐,使得测试时潜在推断仅依赖于历史输入。在涵盖交通、天气、电力系统及其他现实世界领域的12个基准任务上,包括新构建的现实世界OOD交通数据集,VLBM实现了最先进的OOD鲁棒性和ID精度,平均MAE和MSE比最强基线分别提升15.08%和7.74%。在合成模拟数据集上,VLBM也持续实现最佳性能并更好地跟踪OOD脉冲恢复。这些结果支持潜在结构化预测作为混合ID和OOD条件下鲁棒预测的原则性途径。代码可在https://github.com/leijieruilq/VLBM_OOD_forecast获取。

英文摘要

Out of distribution (OOD) events in multivariate time series forecasting are rare but often dominate real world risk, making average case forecasting insufficient for reliable deployment. Under standard average risk training on mixed ID/OOD distributions, optimization signals from rare OOD events can be overwhelmed by frequent in distribution (ID) patterns, so strong benchmark accuracy may not translate into reliability under high impact shifts. To address this issue, we propose VLBM (Variational Latent Basis Model), a theory guided latent forecasting framework that separates stable dynamics from OOD induced deviations. VLBM learns a shared latent basis that defines a low rank subspace for stable ID dynamics, explicitly decomposes inputs into basis subspace components and orthogonal residual components, and aligns a future aware posterior with a future blind prior so that test time latent inference depends only on historical input. Across 12 benchmark tasks spanning transportation, weather, power systems, and other real world domains, including newly constructed real world OOD traffic datasets, VLBM achieves state of the art OOD robustness and ID accuracy, with average MAE and MSE gains of 15.08\% and 7.74\% over the strongest baseline. On a synthetic simulation dataset, VLBM also consistently achieves the best performance and better tracks OOD pulse recovery. These results support latent structured forecasting as a principled route to robust prediction under mixed ID and OOD conditions. The code is available at https://github.com/leijieruilq/VLBM_OOD_forecast.

2606.02134 2026-06-02 cs.LG cs.AI cs.CV 版本更新

Rethinking Evaluation Paradigms in IBP-based Certified Training

重新思考基于IBP的认证训练中的评估范式

Konstantin Kaulen, Hadar Shavit, Holger H. Hoos

发表机构 * University of Freiburg(弗赖堡大学) ETH Zurich(苏黎世联邦理工学院)

AI总结 针对认证训练中自然精度与认证精度的权衡问题,提出基于Pareto前沿的多目标超参数优化方法,实现公平的方法间比较,并发现先前配置的欠调优现象,建立新的最优性能。

Comments Accepted to ICML 2026

详情
AI中文摘要

深度神经网络在许多监督学习任务上取得了强大性能,但仍易受对抗性扰动的影响。神经网络验证提供了数学上严格的鲁棒性保证,但计算成本高昂。为缓解这一问题,认证训练技术在训练过程中优化可验证的鲁棒性,通常通过方法特定的超参数控制自然精度与认证精度之间的权衡。由于这些指标本质上是冲突的,报告单一配置的常见做法存在问题:它可能误导关于整体性能的结论,并妨碍对最新技术的无偏评估。我们通过基于自然-认证精度权衡的Pareto前沿比较来评估认证训练方法。为了实现公平、方法无关的比较,我们执行高效的自动化多目标超参数优化,为每种方法识别一组Pareto最优配置。这种方法常常揭示先前报告配置中的显著欠调优,从而获得更优性能并建立新的最优水平。利用这些前沿,我们首次对认证训练方法进行了全面的多目标比较,表明先前的进展并不像假设的那样显著,并揭示了先前未报告的性能互补性。

英文摘要

Deep neural networks achieve strong performance on many supervised learning tasks but remain vulnerable to adversarial perturbations. Neural network verification provides mathematically rigorous robustness guarantees, yet at substantial computational cost. To mitigate this, certified training techniques optimise for verifiable robustness during training, typically inducing a trade-off between natural and certified accuracy controlled by method-specific hyperparameters. Because these metrics are inherently conflicting, the common practice of reporting a single configuration is problematic: it can mislead conclusions about overall performance and prevents unbiased assessments of the state of the art. We address this by evaluating certified training methods via Pareto front comparisons over the natural--certified accuracy trade-off. To enable fair, method-agnostic comparisons, we perform efficient automated multi-objective hyperparameter optimisation to identify a set of Pareto-optimal configurations for each method. This approach often uncovers substantial undertuning in previously reported configurations, yielding superior performance and establishing a new state of the art. Leveraging these fronts, we present the first comprehensive multi-objective comparison of certified training approaches, showing that prior advancements are less pronounced than assumed and revealing previously unreported performance complementarities.

2606.02120 2026-06-02 cs.CV cs.AI cs.LG 版本更新

Understanding-Enhanced Model Collaboration for Long-Tailed Egocentric Mistake Detection

理解增强的模型协作用于长尾自我中心错误检测

Boyu Han, Qianqian Xu, Shilong Bao, Zhiyong Yang, Ruochen Cui, Qingming Huang

发表机构 * State Key Laboratory of AI Safety, Institute of Computing Technology, CAS(人工智能安全国家重点实验室,计算技术研究所,中国科学院) School of Computer Science and Tech., University of Chinese Academy of Sciences(中国科学院大学计算机科学与技术学院) Beijing Academy of Artificial Intelligence(北京人工智能研究院) Institute of Information Engineering, CAS(信息工程研究所,中国科学院) School of Cyber Security, University of Chinese Academy of Sciences(中国科学院大学网络安全学院)

AI总结 提出理解增强的模型协作方法(UE-MCM),结合粗粒度视频理解与细粒度动作推理,通过双分支模型和自适应融合门检测自我中心视频中的错误,并优化长尾分布。

详情
AI中文摘要

在本报告中,我们解决了从自我中心视频数据中判断用户是否错误执行动作的问题。为此,我们提出了一种理解增强的模型协作方法(UE-MCM),该方法将高效的粗粒度视频理解与准确的细粒度动作推理相结合。具体来说,UE-MCM包含一个小模型分支和一个大模型分支。大模型分支关注细粒度动作本身是否执行错误,而小模型分支联合输入粗粒度视频和细粒度片段,以识别可能局部正确但与整体工作流不一致的动作。小模型分支基于CLIP4CLIP视频编码器构建,该编码器从通过扩散对比重建增强的CLIP模型初始化,大模型分支使用Qwen3-VL嵌入模型从细粒度动作片段中提取高容量表示。然后,通过轻量级协作门自适应融合小分支预测和大分支预测。为了处理错误实例的长尾分布,我们通过互补目标优化分类器,包括重加权交叉熵、AUC导向学习和标签感知调整。所得系统平衡了速度和准确性,使其能够有效检测自我中心教学视频中的细微、罕见和模糊错误。

英文摘要

In this report, we address the problem of determining whether a user performs an action incorrectly from egocentric video data. To this end, we propose an Understanding-Enhanced Model Collaboration Method (UE-MCM) that combines efficient coarse-grained video understanding with accurate fine-grained action reasoning. Specifically, UE-MCM contains a small model branch and a large model branch. The large model branch focuses on whether the fine-grained action itself is executed incorrectly, while the small model branch jointly takes the coarse-grained video and fine-grained segment as input to identify actions that may be locally correct but inconsistent with the overall workflow. The small model branch is built on a CLIP4CLIP video encoder initialized from a CLIP model enhanced by Diffusion Contrastive Reconstruction, and the large model branch uses the Qwen3-VL Embedding model to extract high-capacity representations from fine-grained action segments. The small-branch prediction and the large-branch prediction are then adaptively fused by a lightweight collaboration gate. To handle the long-tailed distribution of mistake instances, we optimize the classifiers with complementary objectives, including reweighted cross-entropy, AUC-oriented learning, and label-aware adjustment. The resulting system balances speed and accuracy, making it effective for detecting subtle, rare, and ambiguous mistakes in egocentric instructional videos.

2606.02119 2026-06-02 cs.LG cs.AI 版本更新

How Hard Can It Be? Hardness-Aware Multi-Objective Unlearning

到底有多难?难度感知的多目标遗忘学习

Jiangwei Chen, Xinyuan Niu, Rachael Hwee Ling Sim, Zhengyuan Liu, Nancy F. Chen, Bryan Kian Hsiang Low

发表机构 * National University of Singapore(新加坡国立大学) University of California, Berkeley(加州大学伯克利分校)

AI总结 针对现有遗忘学习无法保证同时提升遗忘质量和保持保留效用的缺陷,提出一种基于约束优化的难度感知多目标遗忘算法(HAMU),通过量化遗忘数据与保留数据的相似度来指导模型更新,在保证遗忘质量提升的同时最小化保留效用损失。

Comments ICML 2026

详情
AI中文摘要

机器遗忘旨在由于隐私、版权或偏见问题,移除特定遗忘训练数据的影响,同时保持模型在剩余保留数据上的性能。现有的遗忘算法,例如优化损失的加权组合,试图实现提高遗忘质量和保持保留效用这些目标。然而,它们无法保证对所有遗忘和保留数据都能将目标改进到指定程度。在这项工作中,我们从约束优化的角度,用一种新颖且理论扎实的方法解决了这一限制。首先,我们确定遗忘数据和保留数据之间的相似度可以量化调和两个目标的难度。接下来,我们推导出一种遗忘算法(HAMU),其总体目标是通过根据我们的难度度量更新模型权重,在保证遗忘质量有指定改进的同时,最小化保留效用成本/下降。我们的难度度量还告知用户何时保留效用下降不可避免,即两个目标无法同时改进,应考虑停止。我们的算法适用于非凸模型,并且易于并行化,使其易于在实际场景中部署。我们通过实验使用大型模型在图像和文本数据集上证明了HAMU相对于基线的优越性能。我们的代码可在 https://github.com/aoi3142/HAMU 获取。

英文摘要

Machine unlearning aims to remove the influence of specific forget training data due to privacy, copyright or bias concerns while maintaining the model performance on the remaining retain data. Existing unlearning algorithms, such as optimizing a weighted combination of losses, have tried to achieve these objectives of improving forget quality and maintaining retain utility. However, they do not guarantee that these objectives can be improved by a specified extent for all forget and retain data. In this work, we address this limitation with a novel and theoretically-grounded approach from a constrained optimization perspective. Firstly, we identify that the hardness of reconciling both objectives can be quantified by the similarity between the forget data and the retain data. Next, we derive an unlearning algorithm (HAMU) with the overall goal of guaranteeing a specified improvement in forget quality while minimizing the retain utility cost/degradation by updating the model weights based on our hardness measure. Our hardness measure also informs users when retain utility degradation is unavoidable, i.e., both objectives cannot be improved simultaneously, and stopping should be considered. Our algorithm is applicable to non-convex models and is easily parallelizable, making it readily deployable in real-world scenarios. We empirically demonstrate HAMU's superior performance over baselines on both image and text datasets using large models. Our code is available at https://github.com/aoi3142/HAMU.

2606.02113 2026-06-02 cs.CL cs.AI 版本更新

A Primer in Post-Training Reasoning Data: What We Know About How It Works

后训练推理数据入门:我们对其运作机制的了解

Yaoming Li, Guangxiang Zhao, Qilong Shi, Lin Sun, Xiangzheng Zhang, Tong Yang

AI总结 本文综述了后训练推理数据的类型、效用、构建方法和扩展规律,为未来推理数据发布和后训练方案提供归因框架。

Comments 22 pages. Project Repository: https://github.com/RenBing-Sumeru/Awesome-LLM-Reasoning-Data

详情
AI中文摘要

后训练已成为大型推理模型近期进展的主要驱动力,而推理数据通常是决定这一阶段成功与否的关键变量。关于后训练推理数据的研究迅速增长,但相关文献仍分散在数据集论文、强化学习方案、奖励模型研究、基准测试和前沿系统报告中。本文是首篇综合了超过150篇关键公开研究和系统报告的后训练推理数据入门文章。我们围绕四个问题组织该领域:存在哪些数据对象、什么使它们有用、它们如何构建以及它们如何扩展。这一组织方式为未来的推理数据发布和后训练方案提供了归因框架。

英文摘要

Post-training has become a primary driver of recent progress in large reasoning models, and reasoning data are often the key variable determining whether this stage succeeds. Work on post-training reasoning data has grown rapidly, yet this literature remains scattered across dataset papers, reinforcement-learning recipes, reward-model studies, benchmarks, and frontier system reports. This paper is the first primer to synthesize over 150 key public studies and system reports on post-training reasoning data. We organize the field around four questions: what data objects exist, what makes them useful, how they are constructed, and how they scale. Together, this organization provides an attribution framework for future reasoning-data releases and post-training recipes.

2606.02111 2026-06-02 cs.CV cs.AI cs.CL 版本更新

Jailbreaking Multimodal Large Language Models using Multi-Clip Video

使用多片段视频破解多模态大语言模型

Choongwon Kang, Seungjong Sun, Hyunmin Jun, Jang Hyun Kim

发表机构 * Department of Applied Artificial Intelligence, Sungkyunkwan University(应用人工智能系,成均馆大学) Department of Human-Artificial Intelligence Interaction, Sungkyunkwan University(人机交互系,成均馆大学)

AI总结 提出MCV SafetyBench数据集,通过多片段视频评估多模态大语言模型的安全漏洞,发现视频模态比图像更脆弱,动态和多样化上下文增加攻击成功率,并基于图像模态的鲁棒性提出防御策略。

Comments 27 pages, 20 figures, Accepted to the Main Conference of ACL 2026

详情
AI中文摘要

随着多模态大语言模型(MLLMs)发展到处理视频输入,人们开始担忧其被恶意滥用的可能性。先前的越狱研究表明,MLLMs中的安全对齐可以通过视觉输入被绕过,但尚不清楚视频输入的哪些属性导致了这种脆弱性。为填补这一空白,我们引入了Multi-Clip Video (MCV) SafetyBench,一个包含2,920个视频的数据集,旨在评估视频输入的多样性如何影响MLLMs的脆弱性。每个视频由多个短片段组成,描述与有害查询相关的不同上下文。对八个代表性视频MLLMs的实验表明,攻击成功率随着片段数量的增加而持续提高。我们的结果进一步表明,视频模态(1)比图像模态更脆弱,(2)对动态视频比对静态视频更脆弱,(3)当视频包含更多样化的上下文时更脆弱。基于这些发现,我们提出了一种利用图像模态相对鲁棒性的防御策略。

英文摘要

As multimodal large language models (MLLMs) have advanced to process video inputs, concerns have emerged about their potential for malicious misuse. Prior jailbreak studies have shown that safety alignment in MLLMs can be bypassed through visual inputs, yet it remains unclear which properties of video inputs induce this vulnerability. To address this gap, we introduce Multi-Clip Video (MCV) SafetyBench, a dataset of 2,920 videos designed to evaluate how the diversity of video inputs affects the vulnerability of MLLMs. Each video consists of multiple short clips depicting diverse contexts related to a harmful query. Experiments on eight representative video MLLMs show that attack success consistently increases with the number of clips. Our results further indicate that the video modality is (1) more vulnerable than the image modality, (2) more vulnerable to dynamic videos than to static videos, and (3) more vulnerable when videos contain more diverse contexts. Building on these findings, we propose a defense strategy that leverages the relative robustness of the image modality.

2606.02109 2026-06-02 cs.AI 版本更新

BADGER: Bridging Agentic and Deterministic Evaluation for Generative Enterprise Reasoning

BADGER:桥接生成式企业推理的自主与确定性评估

Shannon Serrao, Soumitra Chatterjee, Dorina Strori, Abhishek Sharma, Nathan Miller

发表机构 * Merkle Analytics

AI总结 提出BADGER框架,统一文本到SQL评估与自主行为评估,通过混合执行准确率指标(Hybrid-EX)和自主评估套件,在工业查询上超越现有方法。

Comments 30 pages, 2 figures, 6 tables

详情
AI中文摘要

将自然语言转换为SQL查询并编排多步自主推理管道的企业AI系统需要与学术基准根本不同的评估方法。Spider和BIRD建立了执行准确率协议;G-Eval和RAGAS推进了基于LLM的评估;最近的工作如Spider 2.0、BEAVER和BIRD-Interact开始解决企业和自主维度。没有一个单一框架将文本到SQL评估与自主行为评估统一到一个生产级管道中,并针对人类专家判断进行校准。我们提出了在Merkle开发的BADGER,一个统一的评估框架,集成了文本到SQL评估与自主行为评估。BADGER提供三个贡献。首先,LLM辅助的SQL组件提取,扩展Spider方法以处理CTE-heavy、方言特定的SQL。其次,混合执行准确率指标(Hybrid-EX),通过使用LLM在确定性单元格级评分之前推断结构对齐,解决列别名和数值容错脆弱性。在150个人工标注的行业查询上验证,Hybrid-EX达到Cohen's kappa=0.717 [95% CI: 0.600-0.822](高度一致性)和87.3%的平衡准确率,优于所有六个竞争框架(Delta-kappa: 0.322-0.502,所有p<=0.001)。第三,一个企业自主评估套件,将RAGAS、G-Eval和代理基准指标组装成一个统一管道;超额工具使用是唯一的新元素。BADGER完全在客户受管的数据环境中运行,支持可配置的LLM评判后端,并支持快速原型化客户特定的评判器和指标,作为持续评估骨干而非一次性质量门。

英文摘要

Enterprise AI systems that translate natural language into SQL queries and orchestrate multi-step agentic reasoning pipelines require evaluation approaches fundamentally different from academic benchmarks. Spider and BIRD established execution-accuracy protocols; G-Eval and RAGAS advanced LLM-based assessment; and recent work such as Spider 2.0, BEAVER, and BIRD-Interact has begun to address enterprise and agentic dimensions. No single framework unifies text-to-SQL assessment with agentic behavior evaluation into a production-grade pipeline calibrated against human expert judgment. We present BADGER, developed at Merkle, a unified evaluation framework integrating text-to-SQL assessment with agentic behavior evaluation. BADGER offers three contributions. First, LLM-assisted SQL component extraction extending Spider methodology to handle CTE-heavy, dialect-specific SQL. Second, a hybrid execution accuracy metric (Hybrid-EX) resolving column-aliasing and numeric-tolerance brittleness by using an LLM to infer structural alignments before deterministic cell-level scoring. Validated on 150 human-annotated industry queries, Hybrid-EX achieves Cohen's kappa=0.717 [95% CI: 0.600-0.822] (Substantial agreement) and 87.3% balanced accuracy, outperforming all six competing frameworks (Delta-kappa: 0.322-0.502, all p<=0.001). Third, an enterprise agentic evaluation suite assembling RAGAS, G-Eval, and agent benchmark metrics into a unified pipeline; Excess Tool Usage is the sole novel element. BADGER runs entirely within the client's governed data environment, supports configurable LLM judge backends, and enables rapid prototyping of client-specific judges and metrics, serving as a continuous evaluation backbone rather than a one-time quality gate.

2606.02107 2026-06-02 cs.RO cs.AI cs.LG 版本更新

Network Distributed Multi-Agent Reinforcement Learning for Consensus Control of Quadcopters

网络分布式多智能体强化学习用于四旋翼无人机一致性控制

Youssef Mahran, Zeyad Gamal, Aamir Ahmad, Ayman El-Badawy

发表机构 * Mechatronics Engineering Department, German University in Cairo (GUC), Egypt(埃及德国大学(GUC)机械工程系) Institute of Flight Mechanics and Control (IFR), Head of Flight Robotics, University of Stuttgart, Germany(德国斯图加特大学飞行力学与控制研究所) Faculty of EMS, Head of Mechatronics Engineering Department, German University in Cairo (GUC), Egypt(埃及德国大学(GUC)EMS学院)

AI总结 提出网络分布式多智能体强化学习框架,利用通信图实现分布式策略,通过MASAC训练高层规划器,实现零样本扩展到250个智能体。

Comments This is the Author Accepted Manuscript version of a paper accepted for publication. The final published version is available via IEEE Xplore

详情
Journal ref
2026 IEEE 23rd Mediterranean Electrotechnical Conference (MELECON)
AI中文摘要

本文提出了一种用于四旋翼无人机一致性控制的网络分布式多智能体强化学习(ND-MARL)框架。与依赖集中式规划或完全分散式执行的传统多智能体MARL公式相比,ND-MARL将群体通信图纳入决策过程。在2-邻居通信拓扑下,每个智能体仅观察两个邻居的信息,并通过分布式策略输出动作。使用多智能体软演员-评论家(MASAC)训练高层分布式一致性规划器,并将其嵌入层次化堆栈中,以生成由低层四旋翼控制器跟踪的参考目标位置。结果表明,与集中式MARL控制器相比,实现了平滑的一致性轨迹和规划器-跟踪器集成。最值得注意的是,学习到的控制器表现出零样本可扩展性,即在三智能体系统上训练的策略,在相同的2-邻居通信拓扑下,无需重新训练或微调即可部署到多达250个智能体的群体中,实现了随着团队规模增大而稳态散布增加的一致收敛,这是由于稀疏信息传播所致。这些发现突显了ND-MARL作为分布式、通信感知的四旋翼一致性控制的稳定框架。

英文摘要

This paper proposes a Network Distributed Multi-Agent Reinforcement Learning (ND-MARL) framework for quadcopter consensus control. Compared to conventional multi-agent MARL formulations that rely on centralized planning or fully decentralized execution, ND-MARL incorporates the swarm communication graph into the decision process. Under a 2-Neighbor communication topology, each agent observes information of only two neighbors and outputs an action through a distributed policy. A high-level distributed consensus planner is trained using Multi-Agent Soft Actor-Critic (MASAC) and embedded in a hierarchical stack to generate reference target positions tracked by a low-level quadcopter controller. Results demonstrate smooth consensus trajectories and planner-tracker integration when compared to a centralized MARL controller. Most notably, the learned controller exhibits zero-shot scalability, as policies trained on a three-agent system are deployed to swarms of up to 250 agents under the same 2-Neighbor communication topology without retraining or fine-tuning, achieving consistent convergence with increasing steady-state spread at large team sizes due to sparse information propagation. These findings highlight ND-MARL as a stable framework for distributed, communication-aware quadcopter consensus control.

2606.02093 2026-06-02 cs.CL cs.AI cs.LG 版本更新

The Role of Ambiguity in Error Prediction via Uncertainty Quantification

不确定性量化中模糊性在错误预测中的作用

Ieva Raminta Staliūnaitė, James Bishop, Andreas Vlachos

发表机构 * University of Cambridge(剑桥大学) The Alan Turing Institute(艾伦·图灵研究所)

AI总结 通过解耦输入模糊性与不确定性信号,利用门控专家和选择性预测提升大语言模型在问答任务中的错误预测性能。

Comments 8 pages not including references and appendices, 3 figures

详情
AI中文摘要

错误预测任务,即预测模型输出是否正确,通常通过不确定性量化(UQ)来解决。然而,虽然不确定性指标捕捉了模型缺乏知识或能力进行预测的情况,但它们也反映了模型输入和上下文中固有的偶然不确定性。本文提出了一种通过将输入模糊性与UQ信号解耦来改进大语言模型(LLM)错误预测的方法。我们在问答(QA)任务上使用六种UQ指标进行实验,结果表明,UQ指标在无歧义实例上的错误预测能力优于具有多个合理答案的问题。我们使用门控专家和选择性预测将真实和预测的模糊性标签纳入错误预测流程。我们发现,模糊性信息提高了跨模型家族、训练和评估范式、数据集(包括据称无歧义的数据集)以及偶然不确定性来源的错误预测分数,在标准数据集上对单个UQ指标的PRR提升超过10个百分点。

英文摘要

The task of Error Prediction, namely predicting whether a model output is correct, is commonly tackled with Uncertainty Quantification (UQ). However, while uncertainty metrics capture when models lack knowledge or capacity to make a prediction, they also reflect aleatoric uncertainty, which is inherent in the model input and context. This paper presents a method for improving error prediction for Large Language Models (LLMs), by disentangling input ambiguity from UQ signal. We conduct experiments on the task of Question Answering (QA) with six UQ metrics and show that UQ metrics are more predictive of errors on unambiguous instances than on questions with multiple plausible answers. We use Gated Experts and Selective Prediction to incorporate gold and predicted ambiguity labels into the error prediction pipeline. We find that ambiguity information improves error prediction scores across model families, training and evaluation paradigms, datasets (including allegedly unambiguous ones), and sources of aleatoric uncertainty, yielding improvements of over 10 points of PRR for individual UQ metrics on standard datasets.

2606.02092 2026-06-02 eess.IV cs.AI cs.CV 版本更新

LALE: Lightweight-Transformer Architecture for Land-Cover Estimation

LALE:用于土地覆盖估计的轻量级Transformer架构

Ümit Mert Çağlar, Alptekin Temizel

发表机构 * Middle East Technical University(中亚技术大学)

AI总结 提出LALE架构,通过分辨率分支编码器(轻量级ConvMixer处理高分辨率局部特征,Transformer处理低分辨率全局上下文)和全MLP多尺度解码器,在遥感图像分割中实现高效性能与计算成本的平衡。

详情
AI中文摘要

遥感图像的语义分割需要模型在严格的计算预算下同时捕捉全局上下文和局部细节。先前的工作通常针对这些轴之一进行优化:注意力用于全局上下文,卷积用于局部细节,或紧凑性用于效率。虽然混合方法旨在同时捕捉两者,但它们需要架构更改和带有计算开销的编码器骨干,限制了效率和性能。我们提出了LALE(用于土地覆盖估计的轻量级Transformer架构),一种端到端的遥感图像分割架构,它通过分辨率分支编码器:轻量级ConvMixer阶段处理高分辨率局部特征,而Transformer阶段处理低分辨率全局上下文,将自注意力的二次成本限制在深层、下采样的特征图上。全MLP多尺度解码器,以及贯穿始终的RMSNorm和StarReLU,进一步减少了计算量和参数数量。在大型ARAS400k遥感分割基准上,LALE相对于CNN、Transformer和混合基线建立了强大的效率-性能权衡。我们最小的变体(仅1.6M参数)在F1分数上达到最佳基线(UPerNet)的2.6分以内,同时使用4.5倍更少的参数、7倍更少的存储、17倍更少的GMACs,并提供1.8倍更高的吞吐量。

英文摘要

Semantic segmentation of remote sensing imagery requires models that capture both global context and local detail under tight computational budgets. Prior work typically optimizes for one of these axes: attention for global context, convolution for local detail, or compactness for efficiency. While hybrid approaches aim to capture both, they require architectural changes and encoder backbones with computational overhead, limiting efficiency and performance. We present LALE (Lightweight-transformer Architecture for Land-cover Estimation), an end-to-end remote sensing image segmentation architecture, that bifurcates its encoder by resolution: lightweight ConvMixer stages handle high-resolution local features, while transformer stages handle low-resolution global context, confining the quadratic cost of self-attention to deep, downsampled feature maps. An all-MLP multi-scale decoder, together with RMSNorm and StarReLU throughout, further reduces compute and parameter count. On the large-scale ARAS400k remote-sensing segmentation benchmark, LALE establishes a strong efficiency-performance trade-off against CNN, transformer, and hybrid baselines. Our smallest variant, (just 1.6M parameters), reaches within 2.6 F1 points of the best baseline (UPerNet) while using 4.5x fewer parameters, 7x less storage, 17x fewer GMACs, and delivering 1.8x higher throughput.

2606.02080 2026-06-02 cs.MA cs.AI cs.CV 版本更新

Agentic-J: An AI Agent for Biological Microscopy Image Analysis

Agentic-J:用于生物显微镜图像分析的AI智能体

Lukas Johanns, Marilin Moor, Davide Panzeri, Yu Zhou, Xinyi Chen, Nora F. K. Pauly, Zixuan Pan, Matthias Gunzer, Andreas Müller, Yiyu Shi, Hedi Peterson, Jianxu Chen

AI总结 提出基于容器的多智能体AI助手Agentic-J,通过自然语言接口集成ImageJ/Fiji工具,实现从细胞分割到多条件量化的可追溯、可复现生物图像分析工作流。

Comments Presented at Cell Biology at Scale 2026 (Poster). The Agentic-J project is available at https://mmv-lab.github.io/Agentic-J/

详情
AI中文摘要

生物图像分析日益需要整合异构工具、编程环境和领域知识,而很少有研究人员能同时掌握这些。我们提出Agentic-J,一个容器化的多智能体AI助手,主要面向ImageJ/Fiji,使生物学家能够用自然语言指定分析任务,从细胞核分割、细胞追踪到多条件量化。该智能体生成可执行的脚本,并组织成有文档记录的项目结构,因此每个分析决策都是可追溯的,工作流可以复现或共享。专门的子智能体负责插件管理、代码生成、调试、质量保证和统计报告。本文介绍系统的设计,展示真实的生物显微镜图像分析工作流,并详细说明技术实现。

英文摘要

Biological image analysis increasingly demands integration across heterogeneous tools, programming environments, and domain knowledge that few researchers can command simultaneously. We present Agentic-J, a containerised, multi-agent AI assistant, primarily for ImageJ/Fiji that enables biologists to specify analysis tasks in natural language, from nuclei segmentation and cell tracking to multi-condition quantification. The agent generates executable scripts organised into a documented project structure, so every analysis decision is traceable and the workflow can be reproduced or shared. The specialised sub-agents handle plugin management, code generation, debugging, quality assurance, and statistical reporting. In this paper we introduce the system's design, demonstrate real biological microscopy image analysis workflows, and detailed the technical implementation.

2606.02068 2026-06-02 cs.CV cs.AI 版本更新

Fast and Lightweight Novel View Synthesis with Differentiable Multiplane Image

基于可微多平面图像的快速轻量级新视角合成

Kaidi Zhang, Guanxu Zhu

发表机构 * Universiti Malaya(马来大学) Wuhan University(武汉大学)

AI总结 针对现有方法在速度、模型大小和稀疏视角下的不足,提出基于可微多平面图像(MPI)的快速轻量级新视角合成方法,利用点图进行几何初始化并引入一步扩散处理空洞和伪影。

详情
AI中文摘要

近年来,新视角合成取得了显著进展,主流方法如神经辐射场(NeRF)和3D高斯泼溅(3DGS)产生了令人印象深刻的结果。然而,这些方法往往难以平衡渲染速度和模型大小,且其基于优化的训练可能非常耗时。此外,它们通常依赖于密集观测,在稀疏视角条件下往往无法产生令人满意的结果。尽管前馈重建显著减少了3DGS的优化时间,但其像素对齐公式从单张图像生成数百万个高斯,严重限制了其在移动设备上的实际部署。为了解决这些限制,我们重新审视了多平面图像(MPI)表示,该表示使用一组紧凑的平面层来表示场景,以实现高效的新视角合成。利用视觉基础模型的最新进展,我们使用预测的点图进行可靠的几何初始化,然后进行可微优化。为了解决稀疏初始化MPI中的空洞和伪影问题,我们引入了一步扩散,该扩散既参与MPI的可微优化,也参与渲染结果的后处理。与代表性的基于GS的方法相比,我们的方法速度快30.7%,模型大小仅为其14.8%,同时在前景场景中实现了具有竞争力的合成质量。

英文摘要

Recently, novel view synthesis has witnessed remarkable progress, with mainstream methods such as Neural Radiance Fields (NeRF) and 3D Gaussian Splatting (3DGS) delivering impressive results. However, these approaches often struggle to balance rendering speed and model size, and their optimization-based training can be highly time-consuming. Furthermore, they typically rely on dense observations, often failing to produce satisfactory results under sparse-view conditions. Although feed-forward reconstruction significantly reduces the optimization time of 3DGS, its pixel-aligned formulation generates millions of Gaussians from a single image, severely limiting its practical deployment on mobile devices. To address these limitations, we revisit the Multiplane Image(MPI) representation, which represents scenes using a compact set of planar layers for efficient novel view synthesis. Leveraging recent advances in visual foundation models, we utilize predicted point maps for reliable geometric initialization, followed by differentiable optimization. To address the issues of holes and artifacts in sparsely initialized MPI, we introduce one-step diffusion, which participates in both the differentiable optimization of MPI and the postprocessing of rendering results. Compared with a representative GS-based method, our approach is 30.7% faster and uses only 14.8% of its model size, while achieving competitive synthesis quality on front-view scenarios

2606.02054 2026-06-02 cs.AI 版本更新

eMoT: evolving Memory-of-Thought via Symbolic Anchoring and Memory Corrosion

eMoT: 通过符号锚定和记忆腐蚀演化的思维记忆

Xiang Li, Jiwei Wei, Ke Liu, Yitong Qin, Jinyu Guo, Malu Zhang, Peng Wang, Yang Yang

发表机构 * Center for Future Media, University of Electronic Science and Technology of China(未来媒体中心,电子科技大学)

AI总结 提出eMoT框架,通过记忆腐蚀、符号锚定和一致性精炼三个模块,将推理轨迹视为动态演化记忆,以稳定多步推理并提升准确率与一致性。

详情
AI中文摘要

尽管大型语言模型(LLMs)在多步推理任务上取得了令人印象深刻的性能,但其可靠性仍然受到关键限制的阻碍,例如不受约束的幻觉和较差的数值计算。从根本上说,这些问题源于标准模型将推理视为一次性的瞬态生成过程,而不是保留并改进成功的程序逻辑。为了解决这些挑战,我们提出了eMoT(演化的思维记忆),这是一个统一框架,通过将推理轨迹视为动态演化的记忆而非静态模板来稳定多步推理。该框架主要由三个相互连接的模块组成:(i)记忆腐蚀机制,强化高效用推理结构,同时逐渐衰减较少使用的结构;(ii)符号锚定引擎,利用Python进行确定性计算,类似于人类使用计算器;(iii)一致性驱动的精炼过程,将神经推理与符号结果对齐,减少逻辑差异的累积。在多个推理基准上,eMoT相比标准的思维链和结构化推理基线提高了准确率和解决方案一致性。在传统任务Game of 24上,eMoT达到了100%的准确率,比基线高出17.6%。在数学任务GSM8K、ASDiv、SVAMP和MGSM上的评估进一步显示了在多步数学推理中的持续改进。在我们的评估中,尽管使用了轻量级骨干模型且基线能力受限,我们仍取得了优越的性能。与依赖大规模模型的替代方法相比,我们的结果表明性能提升根本上是由eMoT框架的推理控制驱动的,而非单纯的模型规模。

英文摘要

While Large Language Models (LLMs) achieve impressive performance on multi-step reasoning tasks, their reliability is persistently hindered by critical limitations such as unconstrained hallucinations and poor numerical computation. Fundamentally, these issues arise because standard models treat reasoning as a transient, one-off generation process rather than retaining and refining successful procedural logic. To address these challenges, we propose eMoT (evolving Memory-of-Thought), a unified framework that stabilizes multi-step reasoning by treating reasoning trajectories as dynamic, evolving memories rather than static templates. The framework primarily consists of three interconnected modules: (i) a memory corrosion mechanism that reinforces high-utility reasoning structures while gradually decaying less frequent ones; (ii) a symbolic anchoring engine that utilizes Python for deterministic computation, much like a human uses a calculator; and (iii) a consistency-driven refinement process that aligns neural inference with symbolic outcomes, reducing the accumulation of logical discrepancies. Across multiple reasoning benchmarks, eMoT improves accuracy and solution consistency over standard Chain-of-Thought and structured reasoning baselines.On the traditional task Game of 24, eMoT achieves 100% accuracy, surpassing the baseline by up to 17.6%. Evaluations on mathematical task GSM8K, ASDiv, SVAMP, and MGSM further show consistent gains in multi-step mathematical reasoning. In our evaluation, we achieve superior performance despite utilizing a lightweight backbone model with constrained baseline capabilities. Compared to alternative methods that rely on massively scaled models, our results demonstrate that the performance gains are fundamentally driven by the eMoT framework's reasoning control rather than sheer model size.

2606.02049 2026-06-02 cs.AI 版本更新

Explainable Data-driven Deep Reinforcement Learning Methods for Optimal Energy Management in Buildings

面向建筑最优能量管理的可解释数据驱动深度强化学习方法

Hallah Shahid Butt, Qiong Huang, Gökhan Demirel, Kevin Förderer, Erfan Tajalli-Ardekani, Simnon Waczowicz, Luigi Spatafora, Veit Hagenmeyer, Benjamin Schäfer

发表机构 * Karlsruhe Institute of Technology(卡尔斯鲁厄理工学院)

AI总结 提出可解释深度强化学习框架,结合真实数据训练策略内与离策略算法,通过事后解释技术揭示电池管理决策过程,实现降本与透明化。

详情
AI中文摘要

可再生能源在电力系统中的日益普及,特别是在配备光伏板和储能系统的建筑中,引入了能源系统的显著复杂性。波动的发电量、变化的电价以及增加的实体(如光伏系统和热泵)增加了复杂性,使系统更难运行。这导致了对额外控制和优化路径的需求,包括基于数据的控制,如强化学习。虽然深度强化学习已成为在动态且日益复杂的环境中优化建筑运营的有前景的解决方案,但其黑箱特性阻碍了用户信任和实际应用。本文提出了一种应用于住宅建筑能量管理的可解释深度强化学习框架。我们在合成数据以及来自KIT Living Lab Energy Campus的真实数据上展示了其使用。我们在扩展的状态空间上训练并比较了策略内和离策略的DRL智能体,该状态空间包含实时测量(需求、光伏发电、电池功率、荷电状态)、外部信号(动态电价、本地天气数据)、日历和假日指标以及需求和价格预测。我们的实验结果表明,策略内算法,特别是优势演员-评论家和近端策略优化,在累积奖励和策略稳定性方面优于离策略方法。为了解释这些模型,我们采用事后解释技术来阐述学到的控制策略。我们的发现表明,XRL框架不仅通过最优电池管理降低了电力成本,还提供了对智能体决策过程的透明、可操作的见解。

英文摘要

The increasing integration of renewable energy sources into power systems, particularly in buildings equipped with photovoltaic (PV) panels and energy storage systems, introduces significant complexity in energy systems. Volatile power generation, varying electricity tariffs, and increased entities, e.g., PV systems, and heat pumps, have increased the complexity and made the system harder to operate. This leads to the demand for additional control and optimization routes including data-based controls, such as reinforcement learning. While deep reinforcement learning (DRL) has emerged as a promising solution to optimize building operations in dynamic and ever more complex environments, its black-box nature impedes user trust and practical adoption. This paper presents a framework for explainable deep reinforcement learning (XRL) applied to energy management in residential buildings. We demonstrate its usage on both synthetic data but also on real-world data from the Living Lab Energy Campus (LLEC) at KIT. We train and compare both on-policy and off-policy DRL agents on an expanded state space that incorporates real-time measurements (demand, PV generation, battery power, state of charge), external signals (dynamic electricity price, local weather data), calendrical and holiday indicators, and forecasts for demand and price. Our experimental results indicate that on-policy algorithms, particularly Advantage Actor Critic (A2C) and Proximal Policy Optimization (PPO), outperform off-policy methods in terms of cumulative rewards and policy stability. To explain these models, we employ post-hoc interpretation techniques to elaborate the learned control policies. Our findings demonstrate that the XRL framework not only reduces electricity costs through optimal battery management, but also provides transparent, actionable insights into the agent's decision-making process.

2606.02048 2026-06-02 cs.AI cs.CV physics.bio-ph 版本更新

Topological texture analysis of microscopy images of dynamic casein gelation and its relation to rheological properties

动态酪蛋白凝胶化显微图像拓扑纹理分析及其与流变学性质的关系

Zahra Tabatabaei, Diana Soto Aguilar, Jose C. Bonilla, Mathias P. Clausen, Jon Sporring

发表机构 * Department of Computer Science, University of Copenhagen, Denmark(哥本哈根大学计算机科学系) Department of Green Technology, University of Southern Denmark, Denmark(南丹麦大学绿色技术系) Department of Food Science, University of Copenhagen, Denmark(哥本哈根大学食品科学系)

AI总结 提出结合拓扑数据分析、差分盒计数、多重分形分割和局部二值模式的工具箱,分析STED显微图像中酪蛋白凝胶化的拓扑与纹理特征,揭示与流变学性质相关的微观结构转变。

详情
AI中文摘要

我们提出了一种新颖的计算工具箱,集成了拓扑数据分析(TDA)、差分盒计数(DBC)、多重分形分割(MFP)和局部二值模式(LBP),应用于由葡萄糖酸-δ-内酯(GDL)在30°C和40°C以及两种GDL浓度(1.8%和3.5% w/v)下诱导的酪蛋白酸钠凝胶化的时间序列超分辨率STED显微图像。TDA通过最大Betti-1曲线追踪拓扑环,即反映蛋白质网络互连性的封闭环状结构,揭示了分散聚集体的滞后阶段、与网络渗透和流变学观察到的溶胶-凝胶转变相一致的急剧衰减,以及对应于网络重排的凝胶后增加。这些拓扑转变通过DBC和MFP得到证实,因为这些方法能够解析结构复杂性和空间异质性的变化。该工具箱在实验应用前在模拟分形图像上进行了验证。总之,这些描述符对体相流变学作为平均体相力学响应捕获的细微微观结构转变具有敏感性。这种集成方法为表征食品和材料科学中具有演化微观结构动力学的复杂微观结构提供了稳健的定量工具。代码可在https://github.com/Zahratabatabaei/Delifood_CV_paper.git获取。

英文摘要

We propose a novel computational toolbox that integrates Topological Data Analysis (TDA), Differential Box Counting (DBC), Multifractal Partition (MFP), and Local Binary Patterns (LBP), applied to time-lapse super-resolution STED microscopy images of sodium caseinate gelation induced by glucono-delta-lactone (GDL) at 30 °C and 40 °C and two GDL concentrations (1.8% and 3.5% w/v). TDA tracked topological loops, closed ring-like structures reflecting protein network interconnectivity, via max-Betti-1 curves, which revealed a lag phase of dispersed aggregates, a sharp decay coinciding with network percolation and the rheologically observed sol-gel transition, and a post-gelation increase corresponding to network rearrangements. These topological transitions were corroborated by DBC and MFP as these methods were able to resolve changes in structural complexity and spatial heterogeneity. The toolbox was validated on simulated fractal images prior to experimental application. Together, these descriptors provided sensitivity to subtle microstructural transitions that bulk rheology captured as averaged bulk mechanical responses. This integrated approach provides a robust quantitative tool for characterizing complex microstructure in food and material science with evolving microstructural dynamics. Code is available at https://github.com/Zahratabatabaei/Delifood_CV_paper.git

2606.02035 2026-06-02 cs.AI cs.LG 版本更新

RL-ACRGNet: Reinforcement Learning-Based Chest Radiology Report Generation Network

RL-ACRGNet:基于强化学习的胸部放射学报告生成网络

Yogesh Kumar Meena, Saurabh Agarwal, K. V. Arya

发表机构 * Human-AI Interaction (HAIx) Lab, Indian Institute of Technology Gandhinagar(人类-人工智能交互实验室,印度理工学院冈丁加尔) Department of Computer Science and Engineering, Madhav Institute of Technology and Science Deemed University (MITS-DU)(计算机科学与工程系,马达夫技术与科学 deemed 大学(MITS-DU)) Multimedia and Information Security Research Group, Department of Computer Science and Engineering, ABV-Indian Institute of Information Technology and Management(多媒体与信息安全研究组,计算机科学与工程系,ABV-印度信息科技与管理学院)

AI总结 提出RL-ACRGNet,一种结合预训练DenseNet编码器与多级LSTM解码器的离策略强化学习框架,通过度量奖励机制优化视觉语义嵌入,在IU-Xray和MIMIC-CXR数据集上超越基线,生成高质量临床报告。

Comments This work has been submitted to the IEEE for possible publication

详情
AI中文摘要

医学影像解读是现代临床诊断的基石,然而手动生成放射学报告既耗时又容易出现解读不一致。在医学AI领域,通过深度学习自动化这些描述有望简化临床工作流程并标准化诊断输出。然而,由于在捕获细粒度视觉特征和确保临床连贯性方面的局限性,准确的疾病检测和精确的报告生成仍然是重大挑战。为了解决这些问题,我们提出了RL-ACRGNet,一种改进的编码器-解码器模型,它将预训练的DenseNet编码器与多级LSTM解码器集成在离策略强化学习框架中。通过使用双网络方法,基于度量奖励机制细化视觉语义嵌入,我们证明RL-ACRGNet在IU-Xray数据集上持续优于最先进的基线,在BLEU-4(0.47%)、METEOR(0.17%)和ROUGE-L(0.518)上取得了定量改进。此外,在大规模MIMIC-CXR数据集上的综合评估证实了该模型的稳健泛化能力及其生成高质量、临床相关报告的能力。

英文摘要

Medical imaging interpretation is a foundational pillar of modern clinical diagnostics, yet the manual generation of radiology reports remains a time-consuming process prone to interpretation inconsistencies. Within the field of medical AI, automating these descriptions through deep learning promises to streamline clinical workflows and standardise diagnostic output. However, accurate disease detection and precise report generation remain significant challenges due to limitations in capturing fine-grained visual features and ensuring clinical coherence. To address these issues, we propose RL-ACRGNet, an improved encoder-decoder model that integrates a pre-trained DenseNet encoder with a multilevel LSTM decoder within an off-policy reinforcement learning framework. Using a dual-network approach to refine visual-semantic embeddings through a metric-based reward mechanism, we demonstrate that RL-ACRGNet consistently outperforms state-of-the-art baselines on the IU-Xray dataset, achieving quantitative improvements in BLEU-4 (0.47%), METEOR (0.17%) and ROUGE-L (0.518). Furthermore, comprehensive evaluations on the large-scale MIMIC-CXR data set confirm the robust generalisation of the model and its ability to generate high-quality, clinically relevant reports

2606.02022 2026-06-02 cs.CV cs.AI cs.LG 版本更新

Ranking vs. Assignment: The Metric Mismatch in Multi-View Object Association

排名 vs. 分配:多视角目标关联中的度量不匹配

Matvei Shelukhan, Timur Mamedov, Aleksandr Chukhrov, Karina Kvanchiani

发表机构 * Tevian Moscow(莫斯科Tevian) Lomonosov Moscow State University(莫斯科国立罗蒙诺索夫大学)

AI总结 本文揭示了多视角目标关联中常用的排名度量(如AP、FPR-95)与分配目标之间的根本性不匹配,并提出了基于Sinkhorn归一化的后处理方法以缓解该问题。

详情
AI中文摘要

多视角目标关联是一个重要的计算机视觉问题,是许多多相机感知任务的基础。虽然该任务自然被表述为受约束的一对一匹配问题,但最近的工作严重依赖成对排名度量(如AP和FPR-95)进行模型评估。我们强调了这些度量与实际分配目标之间的根本性不匹配。理论上,我们表明即使分配已经正确,AP和FPR-95也可能不完美,而基于Sinkhorn的归一化可以使它们完美。相反,最优的成对排名仍然可能导致错误的分配。我们通过使用基于Sinkhorn的归一化作为受控的后处理压力测试,在实践中验证了这种不匹配。我们表明,仅优化几个后处理参数就能显著提升AP和FPR-95,而分配级别的度量(如ACC和IPAA)却没有相应改进。

英文摘要

Multi-view object association is an important computer vision problem that underlies many multi-camera perception tasks. While this task is naturally formulated as a constrained one-to-one matching problem, recent works heavily rely on pairwise ranking metrics like AP and FPR-95 for model evaluation. We highlight a fundamental mismatch between these metrics and the actual assignment objective. Theoretically, we show that AP and FPR-95 can be imperfect even when the assignment is already correct, and that Sinkhorn-based normalization can make them perfect. Conversely, optimal pairwise ranking can still lead to incorrect assignments. We validate this mismatch in practice by using our Sinkhorn-based normalization as a controlled post-processing stress test. We show that optimizing just a few post-processing parameters significantly boosts AP and FPR-95 without corresponding improvements in assignment-level metrics such as ACC and IPAA.

2606.02011 2026-06-02 cs.AI cs.LG 版本更新

Extreme Low-Bit Inference in Reasoning Models: Failure Modes and Targeted Recovery

推理模型中的极端低位推理:失败模式与针对性恢复

Ekaterina Alimaskina, Darya Rudas, Denis Shveykin, Gleb Molodtsov, Pavel Vasiliev, Aleksandr Beznosikov

发表机构 * University of Washington(华盛顿大学)

AI总结 针对大型推理模型在2位量化推理中因生成不稳定导致总token数膨胀而无法实现端到端加速的问题,提出轻量级FP16规划和循环救援两种控制方法,显著恢复模型精度并保持实际速度。

详情
AI中文摘要

大型推理模型(LRM)依赖长推理轨迹,导致推理成本高昂。虽然低位量化降低了每token解码成本,但我们表明,激进的2位推理可能无法实现端到端加速,因为生成过程中的不稳定性会膨胀总token数。2位量化不仅降低答案准确性,还常常产生更长的轨迹,包含重复循环、预算耗尽、延迟承诺和未闭合的推理段。我们分析了Qwen3推理模型在数学和常识基准上的完整推理轨迹,并表明准确率下降与这些过程级失败密切相关。为解决这些问题,我们引入了两种轻量级控制:FP16规划,为2位模型提供简短的高精度轮廓;以及循环救援,检测重复轨迹并要么承诺早期答案,要么回退到FP16。在MATH-500上,循环救援将Qwen3-8B准确率从17.2%提升至74.2%,而规划加循环救援将Qwen3-32B准确率从65.0%提升至87.2%。总体而言,我们的结果表明,当极端低位推理的失败被视为可控生成病理时,它变得可行:通过轻量级检测和选择性FP16支持,2位推理可以在恢复准确率的同时保持真实的端到端速度。我们的代码可在 https://github.com/brain-lab-research/quantized-reasoning 获取。

英文摘要

Large Reasoning Models (LRMs) rely on long reasoning traces, making inference expensive. While low-bit quantization reduces per-token decoding cost, we show that aggressive 2-bit inference can fail to deliver end-to-end speedup because instability in the generation process inflates total token count. Instead of merely lowering answer accuracy, 2-bit quantization often produces much longer traces with repetitive loops, budget exhaustion, delayed commitment, and unclosed reasoning segments. We analyze full reasoning traces of Qwen3 reasoning models across mathematical and commonsense benchmarks and show that accuracy degradation is tightly linked to these process-level failures. To address them, we introduce two lightweight controls: FP16 planning, which gives the 2-bit model a short high-precision outline, and loop rescue, which detects repetitive traces and either commits to an earlier answer or falls back to FP16. On MATH-500, loop rescue improves Qwen3-8B accuracy from 17.2% to 74.2%, while planning plus loop rescue improves Qwen3-32B from 65.0% to 87.2%. Overall, our results show that extreme low-bit reasoning becomes practical when its failures are treated as controllable generation pathologies: with lightweight detection and selective FP16 support, 2-bit inference can recover accuracy while preserving real end-to-end speed. Our code is available at: https://github.com/brain-lab-research/quantized-reasoning.

2606.02010 2026-06-02 cs.CL cs.AI 版本更新

PlanarBench: Evaluating LLM Spatial Reasoning via Planar Graph Drawing

PlanarBench: 通过平面图绘制评估LLM空间推理能力

Oleksandr Nikitin

发表机构 * tvori.info

AI总结 提出PlanarBench基准,通过让LLM根据边列表以ASCII艺术绘制平面图来评估其空间推理能力,发现边数是主要难度预测因子。

Comments 12 pages, 4 figures, https://github.com/wizzard0/planar-bench-as1073

详情
AI中文摘要

PlanarBench测试LLM是否能够仅根据边列表以ASCII艺术形式绘制平面图——这是一项抵抗记忆的空间推理任务,因为边的顺序、边的方向和节点标签都是可置换的。我们在199个最简单的非同构连通平面图(2-7个顶点)上评估了91个模型。边数是主要的难度预测因子(r = -0.85)——这一发现未在之前的LLM图基准测试中报告,这些基准仅使用节点数作为难度轴。

英文摘要

PlanarBench tests whether LLMs can draw planar graphs as ASCII art given only an edge list -- a spatial reasoning task that resists memorization because edge order, edge orientation, and node labels are all permutable. We evaluate 91 models on the 199 simplest non-isomorphic connected planar graphs (2 - 7 vertices). Edge count is the dominant difficulty predictor ($r = -0.85$) -- a finding not reported in prior LLM graph benchmarks, which use only node count as the difficulty axis.

2606.02000 2026-06-02 cs.CV cs.AI eess.IV 版本更新

Towards 3D-Aware Video Diffusion Models: Render-Free Human Motion Control with Mesh Tokenization

迈向3D感知视频扩散模型:基于网格标记化的无渲染人体运动控制

Jingyun Liang, Min Wei, Shikai Li, Yizeng Han, Hangjie Yuan, Lei Sun, Weihua Chen, Fan Wang

发表机构 * DAMO Academy, Alibaba Group(阿里巴巴集团大模型实验室) Hupan Lab(虎盘实验室) Zhejiang University(浙江大学) INSAIT

AI总结 提出一种无渲染框架,通过压缩的3D人体网格标记直接条件化视频生成,实现精确的人体运动控制,减少2D引导伪影并提升3D结构建模能力。

Comments Project page: https://jingyunliang.github.io/MeshToken/

详情
AI中文摘要

扩散模型在视频生成方面取得了显著成功。然而,这类模型是否真正感知视觉观察背后的3D结构,而不仅仅是生成合理的2D投影,仍是一个开放问题。本文通过人体运动控制这一任务来探究该问题,该任务需要对人体3D几何、运动、相机视角和场景上下文进行精确建模。与依赖渲染的2D运动引导视频的先前方法不同,我们提出了一种无渲染框架,直接基于压缩的3D人体网格标记条件化视频生成。该表示保留了完整的3D几何信息,同时实现了统一的基于标记的生成流程,在DiT架构中联合处理视频标记和运动标记。这种设计要求模型在视频生成过程中联合推理外观、3D结构和相机视角。实验结果表明,该方法在人体运动控制基准上表现强劲,同时减少了由视角依赖的2D引导和编辑过程中轨迹-姿态不匹配引起的伪影。这些发现表明,配备网格标记化的视频扩散模型能够更好地捕捉复杂的3D人体结构及其与周围环境的交互。

英文摘要

Diffusion models have shown remarkable success in video generation. However, whether such models are truly aware of the 3D structure underlying visual observations, rather than simply reproducing plausible 2D projections, remains an open question. In this work, we investigate this question through human motion control, a task that requires precise modelling of 3D human geometry, motion, camera viewpoint, and scene context. Unlike prior methods that rely on rendered 2D motion guidance videos, we propose a render-free framework that conditions video generation directly on compressed 3D human mesh tokens. This representation preserves full 3D geometric information while enabling a unified token-based generation pipeline that processes video tokens jointly with motion tokens in a DiT-based architecture. This design requires the model to reason jointly about appearance, 3D structure, and camera viewpoint during video generation. Experimental results demonstrate strong performance on human motion control benchmarks, while reducing artifacts induced by view-dependent 2D guidance and trajectory-pose mismatches during editing. These findings suggest that video diffusion models, when equipped with mesh tokenization, can better capture complex 3D human structures and their interactions with the surrounding environment.

2606.01999 2026-06-02 cs.LG cs.AI 版本更新

Why Do Time Series Models Need Long Context Windows?

为什么时间序列模型需要长上下文窗口?

Luca Butera, Giovanni De Felice, Andrea Cini, Cesare Alippi

发表机构 * Università della Svizzera Italiana(瑞士联邦理工学院) EPFL(瑞士联邦理工学院) Politecnico di Milano(米兰理工学院)

AI总结 本文从生成过程识别和条件预测两个目标出发,证明长上下文窗口通过降低生成过程的不确定性来提升预测性能,并表明即使对于记忆长度为P的过程,输入窗口必须严格大于P才能达到最小误差。

详情
AI中文摘要

现代用于预测时间序列组的深度学习模型依赖于越来越长的观测窗口。然而,增加窗口大小的好处通常被简单地归因于捕捉长程依赖,而关于全局预测模型如何利用输入观测的更广泛讨论一直有限。在本文中,我们表明预测时间序列组涉及两个目标:(i) 生成过程识别(GPI),即推断生成输入序列的具体过程,以及 (ii) 条件预测(CF),即根据输入观测预测未来值。从这个角度来看,最优预测可以解释为对所有可能数据生成过程的平均,并按输入窗口给定的似然加权。这为长上下文窗口的好处提供了另一种解释:它们降低了运行过程中输入时间序列由哪个具体过程生成的不确定性。我们证明,即使对于记忆长度为 $P$ 的过程,严格大于 $P$ 的输入窗口大小对于达到最小可实现误差是必要的。最后,我们展示了如何将 GPI 和 CF 解耦,以在不牺牲准确性的情况下提高计算可扩展性。在合成和真实数据上的实验验证了我们的见解及其对设计预测架构的相关性。

英文摘要

Modern deep learning models for forecasting groups of time series rely on increasingly longer observation windows. However, the benefit of increasing the window size is often simply attributed to capturing long-range dependencies, and broader discussion on how global forecasting models leverage input observations has been limited. In this paper, we show that forecasting groups of time series involves two objectives: (i) generative process identification (GPI), i.e., inferring the specific process generating the input sequence, and (ii) conditional forecasting (CF), i.e., predicting future values given input observations. From this perspective, optimal predictions can be interpreted as an average over plausible data-generating processes, weighted by their likelihood given the input window. This suggests another explanation for the benefits of long context windows: they reduce the uncertainty about which specific process is generating the input time series during operation. We prove that even for processes with memory length $P$, an input window size strictly larger than $P$ is necessary to achieve the minimum attainable error. Finally, we show how decoupling GPI and CF can improve computational scalability without compromising accuracy. Experiments on synthetic and real-world data validate our insights and their relevance for designing forecasting architectures.

2606.01993 2026-06-02 cs.CL cs.AI cs.LG 版本更新

MMG2Skill: Can Agents Distill In-the-Wild Guides into Self-Evolving Skills?

MMG2Skill: 智能体能否从野外指南中提炼出自我进化的技能?

Xinyu Che, Junqi Xiong, Yunfei Ge, Xinping Lei, Shihao Li, Hang Yan, Han Li, Yuanxing Zhang, Zhiqi Bai, Jinhua Hao, Ming Sun, Han Li, Jiaheng Liu

发表机构 * Nanjing University(南京大学) Kuaishou Technology(快手科技)

AI总结 提出MMG2Skill框架,将多模态异构的野外指南编译为可编辑技能,通过轨迹级根因反馈持续改进,在GUI控制、开放游戏和策略卡牌任务中显著提升VLM智能体性能。

Comments 35 pages, 12 figures, 13 tables. Code: https://github.com/NJU-LINK/MMG2Skill

详情
AI中文摘要

网络上丰富的程序性知识对于帮助智能体解决长程任务具有巨大潜力。然而,这些知识通常是多模态、异构、有噪声的,并且隐含地假设人类执行者,使得它们难以直接用作智能体所需的技能。为了弥合人类导向指南与智能体可执行技能之间的差距,我们将此问题形式化为指南到技能学习:将野外指南转换为可执行技能,并从智能体可观察的轨迹中持续改进它们。为了评估现有智能体在此任务上的能力,我们引入了MMG2Skill-Bench,这是针对该问题的首个基准测试。我们进一步提出了MMG2Skill,一个闭环框架,它将指南编译为可编辑技能,在执行过程中将固定的视觉语言模型(VLM)智能体条件化于这些技能,并从轨迹级根因反馈中修正技能,而不使用基准测试分数。在GUI控制、开放式游戏和策略卡牌游戏中,使用六个VLM骨干网络,MMG2Skill在每个模型-领域设置中始终优于普通基线智能体,在骨干网络上实现了宏观平均增益+12.8到+25.3个百分点。消融研究表明,直接用原始指南提示智能体会降低性能,而结构化技能构建和轨迹驱动修订对于观察到的改进都是必要的。在成功可推断的任务中,当成功信号适当校准时,基于分析器的提前停止进一步防止了后期性能退化,并节省了25%-53%的尝试次数。

英文摘要

Abundant procedural knowledge on the Web holds great potential for helping agents solve long-horizon tasks. However, such knowledge is often multimodal, heterogeneous, noisy, and implicitly assumes human executors, making it difficult to use directly as the skills required by agents. To bridge the gap between human-oriented guides and agent-executable skills, we formalize this problem as guide-to-skill learning: converting in-the-wild guides into executable skills and continuously improving them from trajectories observable to the agent. To evaluate the capability of existing agents on this task, we introduce MMG2Skill-Bench, the first benchmark designed for this problem. We further propose MMG2Skill, a closed-loop framework that compiles guides into editable skills, conditions a fixed vision-language model (VLM) agent on these skills during execution, and revises the skills from trajectory-level root-cause feedback without using benchmark scores. Across GUI control, open-ended gameplay, and strategic card play with six VLM backbones, MMG2Skill consistently outperforms vanilla baseline agents in every model-domain setting, achieving macro-average gains of +12.8 to +25.3 percentage points across backbones. Ablation studies show that directly prompting agents with raw guides can degrade performance, while both structured skill construction and trajectory-driven revision are necessary for the observed improvements. On success-inferable tasks, analyzer-based early stopping further prevents late-stage performance regressions and saves 25%-53% of attempts when the success signal is properly calibrated.

2606.01992 2026-06-02 cs.CV cs.AI cs.LG 版本更新

A Structured Benchmark for Text-Guided Anomaly Detection: When Language Stops Conditioning the Decision

文本引导异常检测的结构化基准:当语言停止条件化决策时

Stefano Samele, Eugenio Lomurno, Teodora Jovanovic, Sanjay Shivakumar Manohar, Alberto Crivellaro, Matteo Matteucci

发表机构 * Politecnico di Milano, AIRLab(米兰理工学院,AIRLab) S&H – Software & Hardware(S&H – 软件与硬件)

AI总结 提出结构化基准TGAD,通过三个场景逐步增加语言功能角色,评估多模态异常检测系统的文本引导能力,发现当前系统仅表面受语言条件化,标准基准高估了其能力。

详情
AI中文摘要

工业异常检测历来是单模态任务。最近的多模态视觉-语言模型产生了接受文本输入和图像的系统,并被呈现为支持文本引导的零样本和少样本检测。然而,这些方法使用继承自单模态基准的协议进行评估,这些协议保持文本条件不变,因此无法衡量语言是否条件化决策;报告的性能提升是否反映文本引导或强大的预训练视觉特征仍是开放问题。我们引入文本引导异常检测(TGAD),这是一个结构化基准,通过三个场景逐步增加语言的功能角色:MVTec AD上的受控提示敏感性设置;MVTec AD的组件标记扩展,要求模型将其评估限制在指定部件;以及新的组装面板数据集(APD),这是一个需要缺陷类型和组件位置知识的现实工业场景。我们评估每个范式的代表性模型:生成式大视觉-语言、无训练判别式和嵌入自适应判别式。在所有三个模型中,文本接口仅表面条件化决策:除非移除对象名词,否则提示内容被吸收(生成模型的I-AUROC从97.4降至82.6);一旦指令部件外的缺陷被视为正常,组件级指令不约束决策(从90.3降至66.3);当两者在APD上结合时,图像级判别崩溃至MVTec水平以下,一种情况低于随机水平(71.2、50.5、31.5)。这些结果表明,标准基准夸大了当前多模态异常检测系统的文本引导能力,并且此类协议是能够通过语言可靠控制以用于工业部署的模型的先决条件。

英文摘要

Industrial anomaly detection has historically been a unimodal task. Recent multimodal vision-language models have produced systems that admit textual input alongside the image and are presented as enabling text-guided zero- and few-shot inspection. Yet these methods are evaluated with protocols inherited from unimodal benchmarks that hold the textual condition constant and therefore cannot measure whether language conditions the decision; whether reported gains reflect text guidance or strong pretrained visual features remains open. We introduce Text-Guided Anomaly Detection (TGAD), a structured benchmark that progressively increases the functional role of language across three scenarios: a controlled prompt-sensitivity setting on MVTec AD; a component-tagged extension of MVTec AD that requires the model to restrict its assessment to an instructed part; and the new Assembled Panel Dataset (APD), a realistic industrial setting that requires both defect-type and component-location knowledge. We evaluate one representative model per paradigm: generative large vision-language, training-free discriminative, and embedding-adaptive discriminative. In all three, the textual interface conditions the decision only superficially: prompt content is absorbed unless the object noun is removed (the generative model's I-AUROC drops from 97.4 to 82.6); component-level instructions do not constrain the decision once defects outside the instructed part are admitted as normal (from 90.3 to 66.3); and when both combine on APD, image-level discrimination collapses below the MVTec level, in one case below chance (71.2, 50.5, 31.5). These results suggest that standard benchmarks overstate the text-guided capabilities of current multimodal anomaly detection systems, and that a protocol of this kind is a prerequisite for models that can be reliably controlled through language for industrial deployment.

2606.01991 2026-06-02 cs.AI cs.CL cs.CY 版本更新

SafeMCP: Proactive Power Regulation for LLM Agent Defense via Environment-Grounded Look-Ahead Reasoning

SafeMCP:基于环境接地前瞻推理的LLM智能体防御主动功率调节

Lichao Wang, Zhaoxing Ren, Tianzhuo Yang, Jiaming Ji, Chi Harold Liu, Yaodong Yang, Juntao Dai

发表机构 * Beijing Institute of Technology(北京理工大学) Beijing Academy of Artificial Intelligence(北京人工智能研究院) Institute for Artificial Intelligence, Peking University(北京大学人工智能研究院)

AI总结 针对LLM智能体因动作空间扩大而面临功率寻求风险,提出SafeMCP服务器端防御插件,通过内部世界模型进行前瞻推理,实现主动工具过滤和即时干预两级防御,在保持智能体效用的同时有效降低风险。

Comments Accepted to the 64th Annual Meeting of the Association for Computational Linguistics (ACL 2026), Main Conference

详情
AI中文摘要

随着大语言模型(LLM)智能体越来越多地利用模型上下文协议(MCP)在复杂环境中运行,其动作空间的扩展赋予了智能体不安全的能力,并凸显了功率寻求的风险。虽然广阔的动作空间和更大的环境影响对于任务完成至关重要,但它们也创造了一个脆弱的风险表面,其中微小的错误或幻觉会被放大为灾难性故障。为此,我们提出了SafeMCP,一种{服务器端}防御插件,通过关于未来安全风险的预测推理来约束工具获取。SafeMCP利用内部世界模型进行前瞻推理,实现两级防御:主动工具过滤以限制危险功率扩展,以及即时干预作为故障安全机制。为了训练SafeMCP,我们引入了一个三阶段流程,包括环境动态接地、安全策略初始化和具有双重可验证奖励的强化学习(RL)。在PowerSeeking Bench、ToolEmu和AgentHarm上的实验表明,SafeMCP实现了安全平衡,在有效缓解风险的同时保持了智能体的效用。

英文摘要

As Large Language Model (LLM) agents increasingly leverage the Model Context Protocol (MCP) to operate in complex environments, the expansion of their action spaces offers agents unsafe capabilities and underscores the risk of power-seeking. While broad action space and greater environment influence are essential for task fulfillment, they create a fragile risk surface where minor errors or hallucinations are magnified into catastrophic failures. In response, we propose SafeMCP, a {server-side} defense plugin that constrains tool acquisition via predictive reasoning regarding future safety risks. SafeMCP utilizes an internal world model for look-ahead reasoning to implement a two-tier defense: proactive tool filtering to constrain hazardous power expansion and immediate intervention as a fail-safe. To train SafeMCP, we introduce a three-stage pipeline comprising environmental dynamic grounding, safe policy initialization, and reinforcement learning (RL) with dual verifiable rewards. Experiments on PowerSeeking Bench, ToolEmu, and AgentHarm show that SafeMCP achieves a safe equilibrium, effectively mitigating risks while preserving agent utility.

2606.01982 2026-06-02 cs.AI 版本更新

An NLP-Driven Framework for Curriculum-Labor Market Alignment: Schema-Constrained LLM Extraction, ESCO-Anchored Semantic Matching, and Multi-Dimensional Gap Quantification

一种基于NLP的课程-劳动力市场对齐框架:模式约束的LLM抽取、ESCO锚定的语义匹配和多维差距量化

Sherzod Turaev, Mary John, Mamoun Awad, Nazar Zaki, Khaled Shuaib

AI总结 提出一个四阶段NLP框架,通过模式约束的LLM抽取、ESCO语义匹配、仲裁协议和验证机制,实现课程与劳动力市场的对齐,并量化多维供需差距。

Comments 53 pages, 9 figures, 4 tables

详情
AI中文摘要

从多样化的教育和劳动力市场语料库中进行模式约束的信息抽取仍然是自然语言处理中的一个开放挑战,因为现有流程主要依赖于无法恢复隐含能力的词汇表面方法,缺乏共享分类法的基础,并且没有提供抽取可靠性或文档级完整性的正式度量。为了解决这些限制,本文提出了一个四阶段NLP框架,结合了(i) 对两个前沿LLM集成模型进行模式约束提示,针对JSON Schema强制实施的七槽能力形式;(ii) 使用Sentence-BERT (SBERT)将抽取的记录与十一个领域的ESCO v1.2.1受控词汇表对齐;(iii) 一个解决模型间分歧的两级裁决协议;(iv) 一个结合每槽Cohen's kappa、模式符合性和文档级完整性审计的验证机制。该框架在高等教育质量保证的关键应用中实例化,即阿联酋大学ABET认证的计算机科学学士学位课程的课程-劳动力市场对齐。该流程从2025-2026学年的85门课程学习计划中抽取400条能力记录,并在从计算核心到概率加权学生轨迹的五范围分析下,与30个职位发布(483个要求条款)以0.50的SBERT余弦阈值对齐。抽取器在技能槽上达到0.79的Cohen's kappa,模式符合性100%,文档级完整性100%。对齐揭示了可解释的供需差距:通用和横向技能差距25.0%,算法与计算理论差距13.8%,软件工程与项目管理差距12.2%,而人工智能与数据科学差距接近零的1.8%,尽管供应覆盖率为38.6%。

英文摘要

Schema-constrained information extraction from diverse educational and labor-market corpora remains an open challenge in natural language processing because existing pipelines rely primarily on lexical-surface methods that cannot recover implicit competencies, lack grounding in shared taxonomies, and provide no formal measures of extraction reliability or document-level completeness. To address these limitations, this paper proposes a four-stage NLP framework that combines (i) schema-constrained prompting of a two-model frontier-LLM ensemble against a JSON Schema-enforced seven-slot competency formalism, (ii) Sentence-BERT (SBERT) alignment of the extracted records against an eleven-domain ESCO v1.2.1 controlled vocabulary, (iii) a two-tier adjudication protocol that resolves inter-model disagreements, and (iv) a verification mechanism that combines per-slot Cohen's kappa, schema conformance, and document-level completeness audits. The framework is instantiated for a critical application in higher-education quality assurance, namely curriculum-labor market alignment for the ABET-accredited BSc Computer Science program at the United Arab Emirates University. The pipeline extracts 400 competency records from the 85-course 2025-2026 study plan and aligns them, under a five-scope analysis ranging from the computing core to a probability-weighted student trajectory, with 30 job postings (483 requirement clauses) at an SBERT cosine threshold of 0.50. The extractor achieves Cohen's kappa of 0.79 on the skill slot, with 100% schema conformance and 100% document-level completeness. The alignment surfaces interpretable supply-demand gaps of 25.0% in general and transversal skills, 13.8% in algorithms and computational theory, and 12.2% in software engineering and project management, with a near-zero 1.8% gap in artificial intelligence and data science despite 38.6% supply coverage.

2606.01975 2026-06-02 cs.AI cs.SE 版本更新

Algorithmic algorithm development with LLMs: A Case Study on LLM-Usage for Contraction Order Optimization in Tensor Networks

基于LLM的算法开发:以张量网络收缩顺序优化中LLM使用为例

Fabian Hoppe, Melven Röhrig-Zöllner, Philipp Knechtges

发表机构 * German Aerospace Center (DLR), Institute of Software Technology, department High-Performance Computing(德国航空航天中心(DLR)软件技术研究所高性能计算部门)

AI总结 通过OpenEvolve对张量网络收缩顺序优化的案例研究,探讨了基于LLM的算法开发,重点分析了LLM选择、评估指标和测试实例等设计因素,强调了验证引导的进化编码代理的潜力以及人类科学家在评估、验证和解释方面的重要性与挑战。

Comments Submitted to the proceedings of the deRSE26 conference

详情
AI中文摘要

我们通过一个关于张量网络收缩顺序优化的案例研究,使用OpenEvolve来考虑基于LLM的算法开发。我们特别关注LLM的选择以及设计选择,如评估指标和测试实例。我们的结果既突出了验证引导的进化编码代理在算法开发/改进方面的前景,也强调了人类科学家在评估、验证和解释方面的持续重要性及相应挑战。

英文摘要

We consider LLM-based algorithm development through a case study on contractionorder optimisation for tensor networks with OpenEvolve. We pay particular attention to the choice of the LLM as well as design choices such as evaluation metric and test instances. Our results highlight both the promise of verifier-guided evolutionary coding agents for algorithm development/improvement and the continuing importance of evaluation, validation, and interpretation -- and corresponding challenges -- by the human scientist.

2605.02640 2026-06-02 cs.AI 版本更新

Trustworthy AI Suffers from Invariance Conflicts and Causality is The Solution

可信人工智能面临不变性冲突,因果性是解决方案

Ruta Binkyte, Ivaxi Sheth, Zhijing Jin, Mohammad Havaei, Bernhard Schölkopf, Mario Fritz

发表机构 * Max Planck Institute for Informatics(马克斯·普朗克信息研究所)

AI总结 本文通过将可信AI目标重新解释为数据生成过程变化下的不相容不变性要求,论证因果性是理解和平衡性能与多个可信目标之间权衡的必要框架。

详情
Journal ref
Proceedings of the 43rd International Conference on Machine Learning, Seoul, South Korea. PMLR 306, 2026
AI中文摘要

随着人工智能(包括机器学习模型和基础模型)在高风险领域的部署日益增多,确保其可信度已成为一个核心挑战。然而,可信人工智能的核心目标,如公平性、鲁棒性、隐私性和可解释性,很难同时实现,尤其是在保持效用的同时。这篇立场论文认为,因果性对于理解和平衡性能与可信人工智能多个目标之间的权衡是必要的。我们将可信人工智能的权衡重新解释为数据生成过程不同变化下的不相容不变性要求,从而为我们的论点奠定基础。然后,我们通过文献中的案例研究和风格化的合成数据模拟来说明这一论点,表明因果性提供了一个统一的框架,用于理解可信人工智能中的权衡如何产生,以及如何通过选择性不变性来缓解或解决这些权衡。这一视角既适用于经典机器学习模型,也适用于大规模基础模型。最后,我们概述了利用因果性构建既可信又高性能的人工智能所面临的开放挑战和机遇。

英文摘要

As artificial intelligence (AI), including machine learning (ML) models and foundation models (FMs), are increasingly deployed in high-stakes domains, ensuring their trustworthiness has become a central challenge. However, the core trustworthy AI objectives, such as fairness, robustness, privacy, and explainability, are hard to achieve simultaneously, especially while preserving utility. This position paper argues that causality is necessary to understand and balance trade-offs in performance and multiple objectives of trustworthy AI. We ground our arguments in re-interpreting trustworthy AI trade-offs as incompatible invariance requirements under different changes to the data-generating process. We then illustrate this argument through case-study analyses from the literature and a stylized synthetic-data simulation, showing that causality provides a unifying framework for understanding how trade-offs in trustworthy AI arise and how they can be softened or resolved through selective invariance. This perspective applies to both classical ML models and large-scale FMs. Finally, we outline open challenges and opportunities for using causality to build both trustworthy and high-performing AI.

2605.02122 2026-06-02 cs.LG cs.AI 版本更新

STABLEVAL: Disagreement-Aware and Stable Evaluation of AI Systems

STABLEVAL: 面向AI系统的分歧感知与稳定评估

Akash Bonagiri, Gerard Janno Anderias, Saee Patil, Angelina Lai, Devang Borkar, Gezheng Kang, Ishant Gandhi, Setareh Rafatirad, Houman Homayoun

发表机构 * University of California, Berkeley(加州大学伯克利分校)

AI总结 针对多数投票法在标注者分歧下导致排名不稳定的问题,提出STABLEVAL框架,通过建模潜在正确性和标注者混淆模式,实现稳定且不确定性感知的系统评估。

详情
AI中文摘要

人类评估仍然是评估现代AI系统的主要标准,然而标注者的分歧、偏见和变异性使得在标准多数投票聚合下系统排名变得脆弱。多数投票忽略了标注者可靠性和项目级别的模糊性,往往在标注者子集之间产生不稳定的比较。我们引入了STABLEVAL,一个分歧感知的评估框架,该框架对潜在项目正确性和标注者特定的混淆模式进行建模,以产生后验期望项目得分和校准的智能体级别分数。与Dawid-Skene等标签去噪方法不同,STABLEVAL明确设计用于稳定和不确定性感知的系统评估,而不是硬标签恢复。我们将排名稳定性形式化为首要评估目标,并分析聚合方法如何保留或扭曲底层标注者行为。在受控的合成实验和多个真实世界人工标注基准上,多数投票在标注者异质性和对抗性噪声下表现出增加的得分误差和排名不稳定性,而STABLEVAL产生了更稳定和统计上更合理的系统排名。这些结果表明,对分歧进行建模对于稳健和可复现的AI评估至关重要。

英文摘要

Human evaluation remains the primary standard for assessing modern AI systems, yet annotator disagreement, bias, and variability make system rankings fragile under standard majority vote aggregation. Majority vote discards annotator reliability and item-level ambiguity, often yielding unstable comparisons across annotator subsets. We introduce STABLEVAL, a disagreement-aware evaluation framework that models latent item correctness and annotator-specific confusion patterns to produce posterior expected item credit and calibrated agent-level scores. Unlike label-denoising approaches such as Dawid-Skene, STABLEVAL is explicitly designed for stable and uncertainty-aware system evaluation rather than hard label recovery. We formalize ranking stability as a first-class evaluation objective and analyze how aggregation methods preserve or distort underlying annotator behavior. Across controlled synthetic experiments and multiple real-world human-annotated benchmarks, majority vote exhibits increasing score error and ranking instability under annotator heterogeneity and adversarial noise, while STABLEVAL yields more stable and statistically grounded system rankings. These results demonstrate that modeling disagreement is essential for robust and reproducible AI evaluation.

2606.01948 2026-06-02 cs.IR cs.AI 版本更新

Rank-Constrained Deep Matrix Completion for Group Recommendation

面向群组推荐的秩约束深度矩阵补全

Mubaraka Sani Ibrahim, Lehel Csató, Isah Charles Saidu

发表机构 * Department of Computer Science, African University of Science and Technology(非洲科学与技术大学计算机科学系) Faculty of Mathematics and Computer Science, Babes-Bolyai University(巴纳特-博雅大学数学与计算机科学学院) Department of Computer Science, Baze University(贝泽大学计算机科学系)

AI总结 提出Group RC-DMC框架,通过Set-Transformer聚合器整合群组级表示学习,结合低秩结构和注意力非线性建模,实现个体与群组级别的准确预测。

详情
AI中文摘要

群体活动的日益普及增加了根据用户个体偏好向用户群组提供推荐的方法需求。许多现有的群组推荐系统依赖于聚合个体用户偏好,但通常难以处理现实场景中常见的高维且高度稀疏的评分数据。我们提出了群组秩约束深度矩阵补全(Group RC-DMC),这是一个新颖的框架,通过Set-Transformer聚合器整合群组级表示学习,扩展了RC-DMC,联合利用了低秩结构和基于注意力的非线性建模。与大多数现有群组推荐系统不同,Group RC-DMC在一个统一框架中融合了显式低秩正则化、线性编码器-解码器架构和基于注意力的非线性群组建模,在个体和群组级别都产生准确的预测。Group RC-DMC通过低秩矩阵补全解决数据稀疏性,仅从观测评分计算每个用户的潜在表示,并基于周期性奇异值阈值化使用核范数近端步骤对潜在空间施加秩约束。解码器被参数化为低秩分解,从而实现高效推理。在MovieLens和Goodbooks数据集上的实验结果表明,Group RC-DMC实现了优越的重建精度(以更低的群组RMSE衡量),同时在计算效率上保持竞争力,并且在群组级别的性能(精确率、召回率和F1分数)上与加权前分解(WBF)和加权后分解(AF)基线相当。结果突显了模型恢复用户-物品交互的底层低秩结构的能力,并为小、中、大用户群组提供稳健的群组推荐。

英文摘要

The growing popularity of group activities has increased the need for methods that provide recommendations to groups of users given their individual preferences. Many existing group recommender systems rely on aggregating individual user preferences, but they often struggle with high-dimensional and highly sparse rating data commonly found in real-world scenarios. We propose Group Rank-Constrained Deep Matrix Completion (Group RC-DMC), a novel framework that extends RC-DMC by integrating group-level representation learning via a Set-Transformer aggregator, jointly leveraging low-rank structure and attention-based nonlinear modeling. Unlike most existing group recommender systems, Group RC-DMC unifies explicit low-rank regularization, linear encoder-decoder architectures, and attention-based nonlinear group modeling within a single framework, yielding accurate predictions at both the individual and group levels. Group RC-DMC addresses data sparsity through low-rank matrix completion, computing per-user latent representations from observed ratings only, and enforcing a rank constraint on the latent space using a nuclear-norm proximal step based on periodic singular value thresholding. The decoder is parametrized as a low-rank factorization, enabling efficient inference. Experimental results on the MovieLens and Goodbooks datasets demonstrate that Group RC-DMC achieves superior reconstruction accuracy, measured by lower group RMSE, while remaining computationally efficient and competitive in group-level performance in terms of precision, recall, and F1 score compared with weighted-before-factorization (WBF) and after-factorization (AF) baselines. The results highlight the model's ability to recover the underlying low-rank structure of user-item interactions and provide robust group recommendations across small, medium, and large user groups.

2606.01947 2026-06-02 cs.CV cs.AI 版本更新

Parameter-Efficient Fine-Tuning of Large Pretrained Models for Instance Segmentation Tasks

大型预训练模型在实例分割任务中的参数高效微调

Nermeen Abou Baker, David Rohrschneider, Uwe Handmann

发表机构 * University of Freiburg(弗赖堡大学)

AI总结 本研究针对实例分割任务,探索了适配器和低秩适应(LoRA)两种参数高效微调方法,在仅微调约1-6%参数的情况下取得竞争性能,并发现每个Transformer块使用2-3个适配器可达到性能与效率的最佳平衡。

Comments Published by the Machine Learning and Knowledge Extraction Journal

详情
Journal ref
Abou Baker N, Rohrschneider D, Handmann U. Parameter-Efficient Fine-Tuning of Large Pretrained Models for Instance Segmentation Tasks. Machine Learning and Knowledge Extraction. 2024; 6(4):2783-2807
AI中文摘要

近年来,随着大型预训练模型的兴起,人工智能的研究和应用发生了转变,这些模型在众多任务中取得了最先进的结果。然而,参数的大量增加引入了对参数高效训练策略的需求。尽管取得了显著进展,但针对基于Transformer的模型在实例分割任务中的参数高效微调(PEFT)方法的研究仍然有限。为填补这一空白,本研究调查了PEFT方法的有效性,特别是适配器和低秩适应(LoRA),并将其应用于两个模型和四个基准数据集。通过集成顺序排列的适配器模块并将LoRA应用于可变形注意力(本文首次探索),在仅微调约1-6%模型参数的情况下取得了竞争性能,相比传统微调所需的40-55%有显著改进。关键发现表明,每个Transformer块使用2-3个适配器可实现性能与效率的最佳平衡。此外,LoRA在应用于可变形注意力时表现出强大的参数效率,并在某些情况下超越了适配器配置。这些结果表明,PEFT技术的影响因数据集复杂性和模型架构而异,强调了上下文特定调优的重要性。总体而言,这项工作展示了PEFT在实例分割任务中实现可扩展、可定制且计算高效的迁移学习的潜力。

英文摘要

Research and applications in artificial intelligence have recently shifted with the rise of large pretrained models, which deliver state-of-the-art results across numerous tasks. However, the substantial increase in parameters introduces a need for parameter-efficient training strategies. Despite significant advancements, limited research has explored parameter-efficient fine-tuning (PEFT) methods in the context of transformer-based models for instance segmentation. Addressing this gap, this study investigates the effectiveness of PEFT methods, specifically adapters and Low-Rank Adaptation (LoRA), applied to two models across four benchmark datasets. Integrating sequentially arranged adapter modules and applying LoRA to deformable attention--explored here for the first time--achieves competitive performance while fine-tuning only about 1-6% of model parameters, a marked improvement over the 40-55% required in traditional fine-tuning. Key findings indicate that using 2-3 adapters per transformer block offers an optimal balance of performance and efficiency. Furthermore, LoRA, exhibits strong parameter efficiency when applied to deformable attention, and in certain cases surpasses adapter configurations. These results show that the impact of PEFT techniques varies based on dataset complexity and model architecture, underscoring the importance of context-specific tuning. Overall, this work demonstrates the potential of PEFT to enable scalable, customizable, and computationally efficient transfer learning for instance segmentation tasks.

2606.01912 2026-06-02 cs.AI 版本更新

SMH-Bench: Benchmarking LLM Agents for Environment-Grounded Reasoning and Action in Smart Homes

SMH-Bench:用于智能家居中环境基础推理与行动的LLM代理基准测试

Kuan Li, Shuo Zhang, Huacan Wang, Fangzhou Yu, Zecheng Sheng, Yi Gu, Weipeng Ming, Lei Xue, Chen Liu, Sen Hu, Ronghao Chen, Siyue Lin, Yuqing Hou, Xiaofeng Mou, Yi Xu

发表机构 * Midea Group(美的集团) Beijing University of Posts and Telecommunications(北京邮电大学) Donghua University(东华大学) The University of Sydney(悉尼大学) Peking University(北京大学)

AI总结 提出SMH-Bench基准,基于可执行模拟器HomeEnv,通过1100个任务评估LLM在智能家居中的推理与行动能力,发现前沿模型在自动化调度、模糊处理和个性化推理方面存在不足。

详情
AI中文摘要

智能家居正朝着复杂的、依赖于状态的生活环境发展,需要大型语言模型(LLM)对用户意图、偏好和多设备交互进行推理。然而,现有的智能家居基准通常侧重于静态的指令到API映射或有限的模拟,未能评估LLM是否能够在现实家庭场景中可靠地进行推理、交互和行动。为了解决这些局限性,我们引入了SMH-Bench,这是一个用于评估智能家居环境中LLM的全面基准。基于可执行且可验证的智能家居模拟器HomeEnv,SMH-Bench包含1100个高质量任务,涵盖7个类别和22个细粒度子类别。它进一步将任务分层为简单、中等和复杂家庭,范围从小型公寓到拥有135个设备的密集多房间环境。实验表明,尽管前沿LLM在显式控制和查询任务上表现强劲,但在自动化任务调度、模糊处理和个性化推理方面仍存在显著弱点,尤其是在家庭复杂性增加时。我们希望SMH-Bench能够促进更可靠、上下文感知且实际可部署的智能家居代理的发展。

英文摘要

Smart homes are evolving toward complex state-dependent living environments, requiring Large Language Models (LLMs) to reason over user intent, preferences, and multi-device interactions. However, existing smart-home benchmarks often focus on static instruction-to-API mapping or limited simulations, failing to evaluate whether LLMs can reason, interact, and act reliably in realistic household scenarios. To address these limitations, we introduce SMH-Bench, a comprehensive benchmark for evaluating LLMs in smart-home environments. Built upon HomeEnv, an executable and verifiable smart-home simulator, SMH-Bench contains 1,100 high-quality tasks spanning 7 categories and 22 fine-grained subcategories. It further stratifies tasks across simple, medium and complex homes, ranging from small apartments to dense multi-room environments with 135 devices. Experiments show that although frontier LLMs achieve strong performance on explicit control and query tasks, they still exhibit significant weaknesses in automation task scheduling, ambiguity handling and personalized reasoning, especially as home complexity increases. We hope SMH-Bench will facilitate the development of more reliable, context-aware, and practically deployable smart-home agents.

2606.01909 2026-06-02 cs.SD cs.AI eess.AS 版本更新

Echo: A Joint-Embedding Predictive Architecture for Speaker Diarization and Speech Recognition in a Shared Latent Space

Echo: 一种用于共享潜在空间中说话人日志和语音识别的联合嵌入预测架构

Louis Mouchon

发表机构 * Louis Mouchon(洛伊斯·莫尚)

AI总结 提出Echo系统,基于单个25M参数ViT编码器,通过JEPA预训练和分阶段特化,在512维潜在空间中联合实现说话人日志、语音分离和语音识别,无需部署时微调。

Comments 18 pages, 17 tables, 1 figure. Proof-of-concept, independent research

详情
AI中文摘要

我们提出Echo,一个围绕单个25M参数ViT编码器构建的概念验证音频系统。该编码器使用JEPA目标进行预训练,然后分阶段特化,以在同一个512维潜在空间中承载说话人身份、语音内容和动态源路由,部署时无需针对每个任务进行微调。轻量级头部处理说话人日志(ArcFace + VBx)和动态源分离(空目标K集预测)。在未知K的合成VoxCeleb2混合数据上,标准堆栈达到15.00%的盲DER、97.80%的PIT分离准确率,潜在SI-SDR提升+9.52 dB,以及在留出k-NN探针上说话人/内容因子化差距为+53.50分。Echo的意义不在于任何单一任务上的新SOTA,而在于三个任务在一个编码器上以这种规模共同共存。我们逐阶段记录了设计,报告了死胡同,并识别了通过VQ瓶颈进行端到端ASR的结构性障碍,该瓶颈仍然限制了PoC。

英文摘要

We present Echo, a proof-of-concept audio system built around a single 25 M-parameter ViT encoder. The encoder is pretrained with a JEPA objective and then specialised by stages to carry speaker identity, phonetic content, and dynamic source routing in the same 512-dimensional latent space, with no per-task fine-tuning at deployment. Light heads handle diarization (ArcFace + VBx) and dynamic source separation (null-target K-set prediction). On synthetic VoxCeleb2 mixtures with unknown K, the canonical stack reaches 15.00% blind DER, 97.80% PIT separation accuracy with +9.52 dB latent SI-SDR, and a +53.50-point speaker/content factorisation gap on a held-out k-NN probe. The point of Echo is not a new SOTA on any single task but the joint coexistence of three tasks on one encoder at this footprint. We document the design stage by stage, report the dead-ends, and identify the structural wall on end-to-end ASR through the VQ bottleneck that still bounds the PoC.

2606.01906 2026-06-02 cs.AI 版本更新

Bayesian Spectral Emotion Transition Discovery from Multi-Annotator Disagreement

贝叶斯谱情感转移发现:来自多标注者分歧

Keito Inoshita, Takato Ueno

发表机构 * Keio University(庆应大学) National Institute of Advanced Industrial Science and Technology(国家工业科学与技术研究院)

AI总结 提出贝叶斯谱情感转移发现(BSETD)两阶段框架,从多标注者软标签中挖掘情感转移结构,并通过谱分解分离惯性与传染成分,在EmotionLines数据集上验证了与心理学理论的一致性。

详情
AI中文摘要

情感通过对话的动态过程演变,理解其转移结构对于从心理健康筛查到对话系统等应用至关重要。然而,现有研究通常通过多数投票将多评分者判断压缩为单个硬标签,丢弃了理解轮次间转移所需的不确定性信号。本文提出贝叶斯谱情感转移发现(BSETD),一个从多评分者软标签中发现情感转移结构的两阶段框架。第一阶段,通过软标签的外积构建层次狄利克雷-多项后验,为K×K转移矩阵的每个单元配备可信区间和Benjamini-Hochberg(BH)错误发现率(FDR)控制的显著性。第二阶段,对称图拉普拉斯矩阵经谱分解,分离出低频(惯性)和高频(传染)成分。在EmotionLines上,BSETD同时恢复了两个不同情感空间的标志:Plutchik相邻的转移——厌恶到愤怒(log2提升+0.94)和愤怒到厌恶(+0.86)被过度表示,而Russell效价反转的转移——快乐到愤怒(-0.90)和愤怒到快乐(-0.89)被欠表示。五源跨语料验证得到英语内成对皮尔逊相关0.91-0.98,与中文M3ED对比0.79-0.85,以及同一话语集上人类硬标签与LLM虚拟软标签之间0.979的相关性,表明保留标注者不确定性的流程将情感动态的计算研究与既有的心理学理论联系起来。

英文摘要

Emotions evolve through the dynamics of conversation, and understanding their transition structure is foundational to applications ranging from mental-health screening to dialogue systems. However, existing studies typically compress multi-rater judgments into a single hard label by majority voting, discarding the uncertainty signal needed to understand turn-to-turn transitions. In this article, we propose Bayesian Spectral Emotion Transition Discovery (BSETD), a two-stage framework that discovers emotion-transition structure from multi-rater soft labels. In the first stage, a hierarchical Dirichlet-Multinomial posterior is constructed through the outer product of soft labels, equipping each cell of the K x K transition matrix with a credible interval and Benjamini-Hochberg (BH) false discovery rate (FDR)-controlled significance. In the second stage, the symmetrized graph Laplacian is spectrally decomposed to separate a low-frequency (inertia) component from a high-frequency (contagion) component. On EmotionLines, BSETD simultaneously recovers the signatures of two distinct affective spaces: the Plutchik-adjacent transitions disgust to anger (log2 lift +0.94) and anger to disgust (+0.86) are over-represented, while the Russell-valence-reversed transitions joy to anger (-0.90) and anger to joy (-0.89) are under-represented. A five-source cross-corpus validation yields pairwise Pearson correlations in 0.91-0.98 within English, 0.79-0.85 against Chinese M3ED, and 0.979 between the human hard labels and the LLM virtual soft labels on the same utterance set, demonstrating that a pipeline preserving annotator uncertainty bridges the computational study of emotion dynamics with established psychological theory.

2606.01901 2026-06-02 cs.CV cs.AI cs.CL 版本更新

The Image Reconstruction Game: Drawing Common Ground Through Iterative Multimodal Dialogue

图像重建游戏:通过迭代多模态对话建立共同基础

Sherzod Hakimov, Mattia D'Agostini, Ivan Samodelkin, David Schlangen

发表机构 * Computational Linguistics, Department of Linguistics University of Potsdam(波恩大学语言学系计算语言学部) German Research Center for Artificial Intelligence (DFKI), Berlin(德国人工智能研究中心(DFKI)柏林)

AI总结 提出图像重建游戏基准,通过多轮迭代中视觉语言模型向图像生成器发出纠正指令,使累积的共同基础直接可视化为重建图像,发现描述器是重建质量的主导因素,而生成器决定迭代改进的效果。

详情
AI中文摘要

我们引入了图像重建游戏,这是一个全自动基准测试,其中视觉语言模型在多轮迭代中向图像生成器发出纠正指令,使得累积的共同基础直接可视化为渲染图像。通过对七个图像类别中的两个描述器模型与两个生成器模型进行交叉基准测试,我们发现描述器是重建质量的主导因素,而生成器决定迭代改进是否有益。数学和几何图像构成了最大的挑战。描述器的令牌预算强烈影响收敛性:较短的预算产生更稀疏的初始渲染,有更多可见改进的空间,而较长的预算提高了绝对质量,但留下的修复空间较少。更强的描述器使用更丰富的纠正词汇,涵盖空间、数值和结构类别,而较弱的描述器则集中于表面属性,并且往往在几轮后停止。人工验证表明,最佳自动评判器与人类偏好之间仅达到轻微到中等的一致性,并且自动评分需要人工重新校准才能可靠使用。

英文摘要

We introduce the Image Reconstruction Game, a fully automated benchmark in which a vision-language model issues corrective instructions to an image generator across multiple turns, making accumulated common ground directly observable as a rendered image. Benchmarking two Describer models crossed with two Generator models across seven image categories, we find that the describer is the dominant factor in reconstruction quality, while the generator determines whether iterative refinement helps or hurts. Mathematical and geometric images pose the greatest challenge. The describer's token budget strongly affects convergence: shorter budgets yield sparser first renderings with more room for visible improvement, while longer budgets raise absolute quality but leave less to fix. Stronger describers use a richer correction vocabulary spanning spatial, numeric, and structural categories, while weaker describers concentrate on surface properties and tend to stop after a few turns. Human validation shows that the best automated judge reaches only slight-to-fair agreement with human preferences, and automated scores require human recalibration to be used reliably.

2606.01899 2026-06-02 eess.SP cs.AI 版本更新

RA-LWLM: Retrieval-Augmented In-Context Localization with Wireless Foundation Models

RA-LWLM:基于检索增强的上下文无线定位基础模型

Guangjin Pan, Hui Chen, Hei Victor Cheng, Henk Wymeersch

发表机构 * Department of Electrical Engineering, Chalmers University of Technology(查尔姆斯理工大学电子工程系) Department of Electrical and Computer Engineering, Aarhus University(阿鲁斯大学电子与计算机工程系)

AI总结 提出RA-LWLM框架,通过将场景特定信息外化到指纹数据库,实现无需训练的跨场景无线定位,利用冻结的无线基础模型编码器、检索模块和基于Transformer的上下文学习模块预测用户位置。

Comments 13 pages, 9 figures. This work has been submitted to the IEEE for possible publication

详情
AI中文摘要

无线定位是第六代(6G)网络的基本能力。传统的基于模型的方法需要对传播环境进行精确建模,在复杂的多径和非视距场景中性能下降,而基于学习的方法将模型参数紧密耦合到训练场景中,每当基站(BS)配置或传播环境变化时需要昂贵的重新训练。在本文中,我们提出RA-LWLM,一种检索增强的上下文定位框架,通过将场景特定信息外化到每个场景的指纹数据库(而非编码在模型权重中)来实现无需训练的跨场景适应。该框架由三个组件组成:一个冻结的无线基础模型(FM)编码器,将原始信道状态信息映射为场景无关的表示;一个检索模块,通过表示空间中的相似性搜索从每个场景的数据库中选择最具信息量的参考;以及一个基于Transformer的上下文学习(ICL)模块,将查询与检索到的参考融合以预测用户设备(UE)位置。为了适应不同查询的检索质量和传播复杂性,ICL模块采用混合专家设计,其中专家专注于不同的上下文大小,并由可学习的选择器软组合。跨不同BS配置的异构场景的广泛基于射线追踪的实验表明,RA-LWLM在未见和已见场景上实现了几乎相同的精度,无需任何每个场景的重新训练,显著优于端到端和基于FM的基线。这些结果验证了所提出的检索增强上下文范式作为6G网络中跨场景定位的可扩展解决方案。

英文摘要

Wireless localization is a fundamental capability of sixth-generation (6G) networks. Conventional model-based methods require accurate modeling of the propagation environment and degrade in complex multipath and non-line-of-sight scenarios, while learning-based methods couple model parameters tightly to the training scene, requiring costly retraining whenever the base station (BS) configuration or propagation environment changes. In this paper, we propose RA-LWLM, a retrieval-augmented in-context localization framework that achieves training-free cross-scene adaptation by externalizing scene-specific information into a per-scene fingerprint database rather than encoding it in model weights. The framework consists of three components: a frozen wireless foundation model (FM) encoder that maps raw channel state information into a scene-agnostic representation; a retrieval module that selects the most informative references from the per-scene database via similarity search in the representation space; and a transformer-based in-context learning (ICL) module that fuses the query with the retrieved references to predict the user equipment (UE) position. To accommodate varying retrieval quality and propagation complexity across queries, the ICL module adopts a mixture-of-experts design in which experts specialize in different context sizes and are softly combined by a learnable selector. Extensive ray-tracing-based experiments across heterogeneous scenes with diverse BS configurations show that RA-LWLM achieves nearly identical accuracy on seen and unseen scenes without any per-scene retraining, substantially outperforming end-to-end and FM-based baselines. These results validate the proposed retrieval-augmented in-context paradigm as a scalable solution for cross-scene localization in 6G networks.

2606.01896 2026-06-02 cs.CV cs.AI 版本更新

Train, Test, Re-evaluate: Schedule-Sensitive Evaluation of Generative Data for Hand Detection

训练、测试、重新评估:用于手部检测的生成数据的调度敏感评估

Atmika Bhardwaj, Silvia Vock, Nico Steckhan

发表机构 * Federal Institute for Occupational Safety and Health(联邦职业安全与卫生研究所)

AI总结 本研究通过多阶段训练调度实验,评估生成性图像修补数据对安全关键场景下手部检测性能的影响,发现适当的训练流程能显著提升真实部署效果。

Comments 16 pages, 4 figures

详情
AI中文摘要

生成(或合成)图像数据越来越多地被用于增强或替代真实训练数据集,当目标图像稀缺、昂贵或存在偏差时。在手部检测中,特别是在职业安全设置中,公共数据集大多包含裸手。这低估了手套、纹身、珠宝和其他个人防护装备引入的手部外观变化,造成了安全关键应用在部署时遇到的分布偏移。我们测试生成性修补,即仅编辑真实照片的手部区域以引入配饰,是否能缩小这种偏移差距。在一个由真实图像及其合成对应物组成的配对数据集上,我们在六种训练和调度方案(实验A-F,每种三个随机种子)下训练YOLOv8n手部检测器,在真实测试集和仅真实手套测试子集上评估每个检测器,报告两个重叠阈值(mAP@0.5和mAP@0.5:0.95)下的平均精度(mAP),并进行配对统计检验。一个两阶段实验:在真实+合成数据上训练,然后在较低学习率下仅用真实数据微调得到的权重,与标准真实测试集上的仅真实基线模型相比,提高了mAP@0.5,并改善了真实手套的分布外差距。另一个三阶段实验最好地保持了框的紧密度,达到了研究中任何其他实验的最高mAP@0.5:0.95。合成数据对安全关键手部检测的效用由训练过程决定,简单的多阶段实验从修补的配饰数据中提取了实质性的真实部署收益。

英文摘要

Generated (or synthetic) image data is increasingly used to augment or replace real training datasets when target imagery is scarce, expensive, or biased. For hand detection, particularly in occupational safety settings, public datasets mostly contain bare hands. This under-represents the variation in hand appearance introduced by gloves, tattoos, jewelry, and other personal protective equipment, creating a distribution shift that safety-critical applications encounter at deployment. We test whether generative inpainting, editing only the hand region of a real photograph to introduce accessories, can close this shift gap. On a paired dataset of real images and their synthetic counterparts, we train YOLOv8n hand detectors under six training-and-scheduling regimes (Experiments A-F, three random seeds each), evaluate every detector on a real test set and on a real-gloves-only test split, and report the mean average precision (mAP) at two overlap thresholds (mAP@0.5 and mAP@0.5:0.95) along with paired statistical tests. A two-stage experiment: train on real U synthetic data, then fine-tune the resulting weights on real-only at a lower learning rate, increases mAP@0.5 compared to the real-only baseline model on the standard real test set, and improves the real-gloves out-of-distribution gap. Another three-stage experiment preserves box-tightness best, reaching the highest mAP@0.5:0.95 of any other experiment in the study. The synthetic-data utility for safety-critical hand detection is determined by the training procedure, and simple multi-stage experiments extract substantial real-deployment benefit from inpainted accessory data.

2606.01895 2026-06-02 cs.CV cs.AI 版本更新

Collaborative Space Object Detection with Multi-Satellite Viewpoints in LEO Constellations

LEO星座中基于多卫星视角的协作空间目标检测

Xingyu Qu, Wenxuan Zhang, Peng Hu

发表机构 * Government of Canada(加拿大政府) Natural Sciences and Engineering Research Council of Canada(加拿大自然科学和工程研究理事会)

AI总结 针对LEO星座中空间目标检测的挑战,提出基于深度学习框架的多视角观测融合方法,使用YOLO检测器处理多视角数据,实验表明多视角融合显著提升检测精度。

详情
AI中文摘要

随着低地球轨道(LEO)星座中卫星数量的增加,近地空间环境日益拥挤,使得空间目标检测(SOD)成为空间安全和可持续性面临的紧迫挑战。为了降低碰撞风险并确保空间操作的连续性,SOD系统必须在严格的星载约束下提供快速准确的检测。在本文中,我们研究了深度学习(DL)框架内多视角观测融合的潜力,以增强SOD性能。我们设计了一个实用的多视角流水线和几种输入表示,用于将多视角数据输入基于YOLO的检测器。我们的实验表明,在大多数情况下使用多视角输入是可行的,并且通常能在mAP50和mAP50-95上产生更好的结果。例如,在模型YOLOv9-m中,单视角与三视角融合RGB设置相比,mAP50从0.638增加到0.732,而mAP50-95从0.227提高到0.276。与单视角设置相比,最佳的三视角灰度配置将mAP50提高了36.3%,mAP50-95提高了46.5%。这些发现确立了多视角融合作为SOD的一种可行且有效的策略,对LEO星座部署中的空间态势感知具有广泛意义。

英文摘要

With the growing number of satellites in low Earth orbit (LEO) constellations, the near-Earth space environment has become increasingly congested, making space object detection (SOD) a pressing challenge for space safety and sustainability. To mitigate collision risks and ensure the continuity of space operations, SOD systems must deliver fast and accurate detection under stringent onboard constraints. In this paper, we investigate the potential of multi-viewpoint observation fusion within a deep learning (DL) framework to enhance SOD performance. We design a practical multi-view pipeline and several input representations for feeding multi-view data into YOLO-based detectors. Our experiments show that using multi-view inputs is feasible in most cases and typically produces better results for mAP50 and mAP50-95. For example, in model YOLOv9-m, single-view compared to a three-view fused RGB setting, mAP50 increases from 0.638 to 0.732, while mAP50-95 improves from 0.227 to 0.276. Compared with the single-view setting, the best three-view grayscale configuration improves mAP50 by 36.3% and mAP50-95 by 46.5%. These findings establish multi-view fusion as a viable and effective strategy for SOD, with broad implications for space situational awareness in LEO constellation deployments.

2606.01894 2026-06-02 cs.AI 版本更新

Physically-Constrained Mamba-SDE for Remaining Useful Life Prediction under Irregular Observations

物理约束的Mamba-SDE用于不规则观测下的剩余使用寿命预测

Deyu Zhuang, Peiliang Gong, Yang Shao, Liyuan Shu, Qi Zhu, Xiaoli Li, Daoqiang Zhang

发表机构 * Nanjing University of Aeronautics and Astronautics(南京航空航天大学) Nanyang Technological University(南洋理工大学) Singapore University of Technology and Design(新加坡科技设计大学)

AI总结 提出PC-MambaSDE框架,通过掩码感知连续Mamba编码器和物理引导的潜在SDE,解决不规则观测下剩余使用寿命预测的物理不可行性问题。

详情
AI中文摘要

准确的剩余使用寿命预测对于工业预测性维护至关重要。然而,由于传感器观测的不规则性,表现为异步采样、突发缺失和时间抖动,实际部署具有挑战性。更糟糕的是,纯数据驱动模型常常生成物理上不合理的退化轨迹,违反损伤累积的不可逆性。为了解决这个问题,我们提出了PC-MambaSDE,一个统一的连续时间框架,用于在不规则观测下进行鲁棒的RUL预测。具体来说,我们设计了一个掩码感知连续Mamba编码器,显式利用观测掩码提取富含上下文的控制信号。此外,我们引入了一个带有参数化修正混合漂移的物理引导潜在SDE,叠加全局物理偏差以强制单调退化,即使在严重观测间隙下也是如此。另外,我们通过终端退化惩罚将RUL预测公式化为边界值问题,该惩罚解耦健康指标维度并应用惩罚损失引导轨迹向故障状态演化。理论上,我们通过Girsanov定理证明了我们的变分目标在数学上等价于最小化KL散度,并通过Lyapunov分析保证了学习动力学的全局渐近稳定性。为了进行严格评估,我们开发了一个混合不规则性生成方案,模拟真实的工业缺陷。在公开基准上的大量实验表明,PC-MambaSDE显著优于最先进的方法,特别是在极端观测稀缺情况下,验证了将物理先验嵌入连续时间潜在动力学的有效性。

英文摘要

Accurate Remaining Useful Life prediction is critical for industrial predictive maintenance. However, real-world deployment is challenging due to the irregular nature of sensor observations, characterized by asynchronous sampling, burst missingness, and temporal jitter. Compounding this issue, purely data-driven models often generate physically implausible degradation trajectories that violate the irreversible nature of damage accumulation. To address this, we propose PC-MambaSDE, a unified continuous-time framework for robust RUL prediction under irregular observations. Specifically, we design a Mask-Aware Continuous Mamba Encoder that explicitly leverages observation masks to extract context-rich control signals. Furthermore, we introduce a Physics-Guided Latent SDE with parametrically rectified hybrid drift, superimposing a global physical bias to enforce monotonic degradation even amid severe observation gaps. Additionally, we formulate RUL prediction as a boundary value problem via a Terminal Degradation Penalty, which decouples a Health Index dimension and applies a penalty loss to guide trajectories toward the failure state. Theoretically, we prove that our variational objective is mathematically equivalent to minimizing the KL divergence via Girsanov's theorem, and we guarantee the global asymptotic stability of the learned dynamics through Lyapunov analysis. To enable rigorous evaluation, we develop a Hybrid Irregularity Generation Scheme that simulates realistic industrial imperfections. Extensive experiments on public benchmarks demonstrate that PC-MambaSDE significantly outperforms state-of-the-art methods, particularly under extreme observation scarcity, validating the efficacy of embedding physical priors into continuous-time latent dynamics.

2606.01886 2026-06-02 cs.AI cs.CE 版本更新

Absorbing Complexity: An Interaction-Native Knowledge Harness for Financial LLM Agents

吸收复杂性:面向金融LLM代理的交互原生知识驾驭系统

Ailiya Borjigin, Igor Stadnyk, Ben Bilski, Maksym Chikita, Dmytro Kyrylenko, Sofiia Pidturkina, Julia Stadnyk

发表机构 * True Trading Inc4.net

AI总结 提出交互原生知识驾驭(InKH)架构,通过被动知识注入、时序图记忆和过期失效机制,将复杂性吸收到系统中,在金融LLM代理任务中显著降低延迟、令牌成本和过时知识使用,同时提升任务质量和可追溯性。

Comments 17 pages, 3 figures

详情
AI中文摘要

金融AI代理常常因一个简单原因而失败:它们让用户承担复杂性。用户必须反复陈述目标、风险偏好、投资组合背景、过往判断以及不断变化的市场假设,而代理则回答、检索、行动并遗忘。在金融领域,这不仅仅是方便与否的问题。在市场分析、跟单交易审查和交易准备等任务中,被遗忘的背景和过时的记忆可能导致延迟、重复错误、弱可审计性以及不安全的决策。 我们提出了交互原生知识驾驭(InKH),一种面向金融LLM代理的架构,将复杂性吸收到系统中。InKH将用户、市场、投资组合和工具事件转换为结构化的操作知识。它使用被动知识注入在主模型步骤之前组装一个有界的工作上下文缓冲区,使用时序图记忆进行低延迟检索,使用维基审计界面实现人类可读的治理,以及具有成熟度、衰减和写入时失效的背景提取。 我们在一个可重复的受控合成基准上评估了InKH,该基准包含24个随机种子、4轮、每轮80个片段和6个基线,产生了46,080个基线条件评估。InKH在900毫秒延迟下实现了0.815的平均任务质量。与代理驱动的维基漫步记忆相比,它将延迟降低了82.95%,令牌成本降低了82.29%,过时知识使用降低了96.58%,同时质量提高了0.108,可追溯性提高了0.461。与没有失效机制的时序图系统相比,它在相当的服务成本下将质量提高了0.050,并将过时记忆使用降低了96.58%。 结果支持了金融AI的设计论点:当复杂性被系统吸收而不是转移给用户时,采用就会发生。该基准验证了架构层面的行为,而非实时交易性能。

英文摘要

Financial AI agents often fail for a simple reason: they make users carry the complexity. A user must repeatedly restate goals, risk preferences, portfolio context, past judgments, and shifting market assumptions, while the agent answers, retrieves, acts, and forgets. In finance, this is not just inconvenient. In tasks such as market analysis, copy-trading review, and trade preparation, forgotten context and stale memory can create latency, repeated errors, weak auditability, and unsafe decisions. We propose the interaction-native knowledge harness (InKH), an architecture for financial LLM agents that absorbs complexity into the system. InKH converts user, market, portfolio, and tool events into structured operational knowledge. It uses passive knowledge injection to assemble a bounded working context buffer before the main model step, temporal graph memory for low-latency retrieval, a wiki audit surface for human-readable governance, and background extraction with maturity, decay, and write-time invalidation. We evaluate InKH on a reproducible controlled synthetic benchmark with 24 random seeds, 4 rounds, 80 episodes per round, and 6 baselines, producing 46,080 baseline-conditioned evaluations. InKH achieves mean task quality of 0.815 at 900 ms latency. Compared with agent-driven wiki-walk memory, it reduces latency by 82.95 percent, token cost by 82.29 percent, and stale-knowledge usage by 96.58 percent, while improving quality by 0.108 and traceability by 0.461. Compared with a temporal-graph system without invalidation, it improves quality by 0.050 and reduces stale-memory usage by 96.58 percent with comparable serving cost. The results support a design thesis for financial AI: adoption happens when complexity is absorbed by the system rather than transferred to the user. The benchmark validates architecture-level behavior, not live trading performance.

2606.01862 2026-06-02 cs.MA cs.AI cs.NI 版本更新

RadioMaster: Multi-Agent System for Autonomous Radio Signal Generation

RadioMaster: 自主无线电信号生成的多智能体系统

Jiazhen Lei, Tianze Cao, Yuxin Sha, Sihan Wang, Bingbing Wang, Fengyuan Zhu, Zeming Yang, Xiaohua Tian

发表机构 * University of Science and Technology of China(中国科学技术大学)

AI总结 提出RadioMaster,一个全自主的多智能体框架,通过RadioWiki、RadioAgent和RadioEmulator三大支柱,将用户意图转化为真实无线信号,解决现有模型因领域知识和硬件约束敏感性不足而无法生成无线电信号的问题。

详情
AI中文摘要

将用户意图转化为物理无线电信号是无线原型设计中关键但繁琐的最后一步,因为它需要复杂的物理层细节知识,并带来巨大的实现挑战。大型语言模型(LLM)和多智能体系统已经彻底改变了传统的软件工程,提出了一个引人深思的问题:它们能否解决这些艰巨的困难?然而,我们的研究表明,当前模型在应用于无线电信号生成时存在显著局限性,无法完成此任务。这种性能下降主要源于严重的领域无知和对物理硬件约束的根本不敏感。为弥补这一差距,我们引入了RadioMaster,一个完全自主的多智能体框架,旨在将用户输入无缝转化为真实的无线发射。RadioMaster基于三个协同支柱运行:用于领域特定知识检索的RadioWiki、用于协作I/Q样本生成和硬件配置的RadioAgent,以及用于闭环物理层验证的RadioEmulator。此外,我们构建了RadioBench,这是首个专门针对无线电信号生成领域的全面基准测试。广泛的真实世界评估表明,RadioMaster在配置可行性和信号保真度方面显著优于最先进的基线方法。

英文摘要

Translating user intents into physical radio signals represents the critical yet notoriously tedious final step in wireless prototyping, as it requires intricate knowledge of physical layer details and presents immense implementation challenges. Large Language Models (LLMs) and multi-agent systems have revolutionized conventional software engineering, raising the compelling question of whether they can resolve these formidable difficulties. However, our investigations reveal that current models experience significant limitations and fail to accomplish this task when applied to radio signal generation. This performance degradation primarily stems from severe domain ignorance and a fundamental insensitivity to physical hardware constraints. To bridge this gap, we introduce RadioMaster, a fully autonomous multi-agent framework designed to seamlessly translate user input into real-world wireless emissions. RadioMaster operates on three synergistic pillars: RadioWiki for domain-specific knowledge retrieval, RadioAgent for collaborative I/Q sample generation alongside hardware configuration, and RadioEmulator for closed-loop physical layer verification. Furthermore, we construct RadioBench, the first comprehensive benchmark tailored specifically for the radio signal generation domain. Extensive real-world evaluations demonstrate that RadioMaster significantly outperforms state-of-the-art (SOTA) baselines regarding configuration viability and signal fidelity.

2606.01856 2026-06-02 cs.DC cs.AI 版本更新

Boosting Multimodal Federated Learning via Chained Modality Optimization

通过链式模态优化提升多模态联邦学习

Zixin Zhang, Fan Qi, Shuai Li, Xiaoshan Yang, Changsheng Xu

发表机构 * School of Computer Science and Engineering, Tianjin University of Technology, Tianjin, China(天津理工大学计算机科学与工程学院) Institute of Automation, Chinese Academy of Sciences, Beijing, China(中国科学院自动化研究所) College of Computer Science, Inner Mongolia University, Hohhot, Inner Mongolia, China(内蒙古大学计算机学院)

AI总结 针对多模态联邦学习中模态竞争导致全局模型次优的问题,提出FedMChain框架,通过分阶段优化、误差补偿正则化和稀疏符号引导聚合,提升预测性能并降低通信开销。

详情
AI中文摘要

多模态联邦学习(MMFL)能够在具有异构数据和模态可用性的分散客户端之间实现隐私保护的协作学习。然而,现有大多数MMFL方法将多模态训练视为联合优化问题,忽略了一个关键瓶颈:模态竞争,即主导模态抑制较弱模态,导致全局模型次优。为解决这一问题,我们提出FedMChain,一个平衡的MMFL框架,将联邦多模态训练结构化为一系列模态阶段。这种分阶段设计为每个模态在多模态客户端上提供了专用的局部优化窗口,以缓解模态竞争,并通过误差补偿正则化器进一步促进跨模态互补性。在服务器端,我们采用稀疏符号引导聚合策略,利用方向符号一致性进行稳健的模态内聚合,避免破坏性平均,并支持较少的同步频率以降低通信开销。在多模态基准上的大量实验表明,FedMChain在需要比基线更少通信频率的同时,持续提高了预测性能。

英文摘要

Multimodal Federated Learning (MMFL) enables privacy-preserving collaborative learning across decentralized clients with heterogeneous data and modality availability. However, most existing MMFL methods cast multimodal training as a joint optimization problem, overlooking a key bottleneck: modality competition, where dominant modalities suppress weaker ones and lead to suboptimal global models. To address this, we propose FedMChain, a balanced MMFL framework that structures federated multimodal training as a chain of modality-wise phases. This phase-wise design gives each modality a dedicated local optimization window on multimodal clients to mitigate modality competition, and further promotes cross-modal complementarity via an error-compensated regularizer. On the server side, we employ a sparse sign-guided aggregation strategy that leverages directional sign agreement for robust intra-modality aggregation, avoids destructive averaging, and supports less frequent synchronization to reduce communication overhead. Extensive experiments on multimodal benchmarks demonstrate that FedMChain consistently improves predictive performance while requiring less frequent communication than baselines.

2606.01850 2026-06-02 cs.AI 版本更新

Does Compression Preserve Uncertainty? A Unified Benchmark for Quantized and Sparse LLMs via Conformal Prediction

压缩是否保留不确定性?基于共形预测的量化和稀疏大语言模型统一基准

Yujia Tong, Yuxi Wang, Yunyang Wan, Tian Zhang, Junhao Dong, Jingling Yuan

发表机构 * Wuhan University of Technology(武汉理工大学) Nanyang Technological University(南洋理工大学)

AI总结 本研究通过共形预测方法,在五个NLP任务上对12种不同压缩配置的大语言模型进行基准测试,发现压缩经常解耦准确率与不确定性,大模型更能吸收压缩引起的不确定性,且不确定性膨胀常呈阈值状而非渐进。

详情
AI中文摘要

模型压缩技术如量化和剪枝被广泛用于降低大语言模型(LLMs)的部署成本,现有评估几乎只关注准确率保持。然而,在安全关键应用中,模型可靠量化自身不确定性的能力同样重要。我们问:压缩是否保留了这种能力?为回答此问题,我们在五个NLP任务上对12种不同压缩配置的LLM进行基准测试,使用共形预测提供严格、无分布的不确定性度量。实验揭示:(I) 压缩经常解耦准确率与不确定性;(II) 大模型吸收压缩引起的不确定性的能力远强于小模型;(III) 不确定性膨胀常呈阈值状而非渐进。这些结果表明,仅基于准确率的评估不足以评估压缩LLM的部署就绪度,不确定性感知的基准测试应成为模型压缩流程的标准组成部分。

英文摘要

Model compression techniques such as quantization and pruning are widely used to reduce the deployment cost of large language models (LLMs), with existing evaluations focusing almost exclusively on accuracy preservation. However, in safety-critical applications, a model's ability to reliably quantify its own uncertainty is equally important. We ask: does compression preserve this ability? To answer this question, we benchmark 12 LLMs under various compression configurations across five NLP tasks, using conformal prediction to provide a rigorous, distribution-free measure of uncertainty. Our experiments reveal that: (I) compression frequently decouples accuracy from uncertainty; (II) larger models absorb compression-induced uncertainty far more effectively than smaller ones; and (III) uncertainty inflation is often threshold-like rather than gradual. These results suggest that accuracy-only evaluation is insufficient for assessing the deployment readiness of compressed LLMs, and that uncertainty-aware benchmarking should be a standard component of model compression pipelines.

2606.01845 2026-06-02 cs.CL cs.AI 版本更新

Unveiling the Limits of Large Language Models in Inferring Pragmatic Meaning from Non-Verbal Responses

揭示大型语言模型在推断非语言回应中的语用意义的局限性

Sugyeong Eo, Heuiseok Lim

发表机构 * Department of Software, Yonsei University Mirae Campus(燕山大学软件系) Department of Computer Science and Engineering, Korea University(韩国大学计算机科学与工程系)

AI总结 本研究首次系统评估大型语言模型(LLMs)从纯非语言回应对话中推断语用意义的能力,发现其准确率相比语言回应下降高达60%,并表明上下文学习有助于语用推理。

详情
AI中文摘要

尽管大型语言模型(LLMs)在语用语言理解方面取得了显著进展,但先前的研究主要集中在其对语言行为的理解上。然而,非语言行为仍然是人类交流的基本组成部分,特别是当故意单独使用以传达间接意义时。在这项工作中,我们首次系统评估了LLMs从仅包含非语言回应的对话中推断语用意义的能力。我们探讨了三个研究问题:(1)LLMs能否识别通过非语言回应传达的间接意图?(2)LLMs何时以及如何未能捕捉非语言意图?(3)我们如何提高LLMs解释非语言意图的能力?通过评估,我们观察到LLMs难以从非语言回应中推断出潜在意义,准确率相比语言回应下降高达60个百分点。进一步的广泛分析揭示了LLMs在解释非语言行为时的行为模式,并表明上下文学习有助于语用推理。

英文摘要

Although large language models (LLMs) have shown considerable progress in pragmatic language understanding, prior research has focused mainly on their comprehension of verbal behavior. Nonetheless, non-verbal behavior remains a fundamental component of human communication, especially when deliberately utilized in isolation to convey indirect meanings. In this work, we present the first systematic evaluation of LLMs' ability to infer pragmatic meaning in dialogue consisting solely of non-verbal responses. We explore three research questions: (1) Can LLMs recognize indirect intent conveyed through non-verbal responses? (2) When and how do LLMs fail to capture non-verbal intent? (3) How can we improve LLMs' ability to interpret non-verbal intent?. Through the evaluation, we observe that LLMs struggle to infer underlying meaning from non-verbal responses, with accuracy dropping by up to 60% points compared to verbal ones. Further extensive analysis reveals a behavioral pattern in LLMs' interpretations of non-verbal behavior and demonstrates that in-context learning facilitates pragmatic inference.

2606.01843 2026-06-02 cs.CV cs.AI 版本更新

Suppressing Forgery-Specific Shortcuts for Generalizable Deepfake Detection

抑制伪造特定捷径以实现可泛化的深度伪造检测

Yihui Wang, Yonghui Yang, Jilong Liu, Fengbin Zhu, Le Wu, Tat-Seng Chua

发表机构 * Hefei University of Technology(合肥工业大学) National University of Singapore(国立新加坡大学)

AI总结 提出Shortcut Subspace Suppression (S^3)框架,通过子空间建模显式表征并抑制方法特定捷径,以提升深度伪造检测的跨方法泛化能力。

详情
AI中文摘要

深度伪造检测在跨伪造方法泛化方面表现不佳,因为现有模型倾向于依赖虚假的方法特定捷径,这些捷径无法迁移到未见过的篡改操作。尽管近期方法试图改进泛化性,但它们缺乏明确的机制来识别和抑制学习表示中的此类捷径。在这项工作中,我们提出了捷径子空间抑制(S^3)框架,通过子空间建模显式表征并抑制方法特定捷径。我们的关键洞察是,区分不同伪造方法的变体捕获了方法特定的伪影,因此可作为方法特定捷径的有效代理。为此,我们训练一个轻量级线性探针进行伪造方法分类,并执行奇异值分解(SVD)以提取主导的捷径子空间。基于此公式,我们开发了两种互补策略来减少对捷径的依赖。在训练期间,我们软性抑制特征表示中的捷径子空间,鼓励模型依赖更可泛化的线索进行真/假判别。在推理时,我们引入一个无需训练的对应方法,衰减与识别出的捷径方向对齐的神经元,从而实现即插即用的泛化增强,并提高可解释性。在多个基准上的大量实验表明,我们的方法显著改善了跨方法泛化,同时保持了强大的域内性能。代码将在论文被接收后发布。

英文摘要

Deepfake detection suffers from poor generalization across forgery methods, as existing models tend to rely on spurious method-specific shortcuts that fail to transfer to unseen manipulations. While recent approaches attempt to improve generalization, they lack an explicit mechanism to identify and suppress such shortcuts in learned representations. In this work, we propose Shortcut Subspace Suppression (S^3) framework that explicitly characterizes and suppresses method-specific shortcuts via subspace modeling. Our key insight is that variations distinguishing different forgery methods capture method-specific artifacts and thus serve as an effective proxy for method-specific shortcuts. To this end, we train a lightweight linear probe for forgery method classification and perform Singular Value Decomposition (SVD) to extract the dominant shortcut subspace. Building on this formulation, we develop two complementary strategies to reduce shortcut reliance. During training, we softly suppress the shortcut subspace in feature representations, encouraging the model to rely on more generalizable cues for real/fake discrimination. At inference time, we introduce a training-free counterpart that attenuates neurons aligned with the identified shortcut directions, enabling plug-and-play generalization enhancement with improved interpretability. Extensive experiments on multiple benchmarks demonstrate that our method significantly improves cross-method generalization while maintaining strong in-domain performance. The code will be released upon acceptance of the submission.

2606.01840 2026-06-02 cs.AI 版本更新

Evaluation of Baseline Methods for IDD-based SSD External Memory Search

基于IDD的SSD外部内存搜索的基线方法评估

Yuki Suzuki, Alex Fukunaga

发表机构 * International Symposium on Combinatorial Search (SoCS 2026)(国际组合搜索会议(SoCS 2026))

AI总结 本文评估了基于即时重复检测(IDD)的A*算法在SSD外部内存搜索中的简单基线方法性能,并分析了操作系统级页面缓存的影响。

Comments accepted to The 19th International Symposium on Combinatorial Search (SoCS2026)

详情
AI中文摘要

许多困难的搜索问题无法仅使用RAM通过A*等算法解决。先前的工作提出了使用容量远大于RAM的外部内存(如SSD和HDD)的搜索算法,但先前的工作主要集中在延迟重复检测方法以及复杂的即时重复检测(IDD)方法上,而相对简单的IDD方法尚未得到系统研究。此外,操作系统级管理及加速外部内存访问的机制(如页面缓存)的影响也未被研究。本文通过评估和分析基于IDD的A*的简单基线方法的性能,填补了文献中的这些空白。

英文摘要

Many difficult search problems cannot be solved by algorithms such as A* using only RAM. Search algorithms which use external memory such as SSDs and HDDs with much higher capacity than RAM have been proposed in previous work, but previous work has focused on delayed duplicate detection approaches, as well as complex immediate duplicate detection (IDD) methods, and relatively simple methods for IDD have not been systematically studied. In addition, the effect of OS-level mechanisms for managing and speeding up accesses to external memory, such as page caches, has not been studied. This paper addresses these gaps in the literature by evaluating and analyzing the performance of simple baseline approaches for IDD-based A*.

2606.01838 2026-06-02 cs.CL cs.AI cs.LG 版本更新

LayerRoute: Input-Conditioned Adaptive Layer Skipping via LoRA Fine-Tuning for Agentic Language Models

LayerRoute: 基于LoRA微调的输入条件自适应层跳过方法用于智能语言模型

Prateek Kumar Sikdar

发表机构 * Accenture(埃森哲)

AI总结 提出LayerRoute,通过为每个Transformer块添加轻量级路由器和LoRA适配器,根据输入类型(工具调用或规划推理)自适应跳过层,在仅增加0.22%可训练参数下实现12.91%的跳过差异并提升质量。

Comments 10 pages, 3 figures, 4 tables

详情
AI中文摘要

智能语言模型系统交替使用两种结构不同的步骤类型:结构化工具调用(短、确定性、低困惑度)和开放式规划/推理步骤(长、复杂、高困惑度)。尽管存在这种异质性,当前的推理系统对每个步骤应用相同的计算量。我们引入LayerRoute,一个轻量级适配器,学习基于每个输入有选择地跳过Transformer块。LayerRoute为Qwen2.5-0.5B-Instruct中的24个Transformer块中的每一个增加:(1)一个每层路由器(约897个参数,Linear(896,1)),通过直通估计器输出硬二值门;(2)在Q/K/V/O注意力投影上的LoRA适配器(秩8,约1.08M参数)。骨干权重保持冻结。在智能体数据(Hermes、Glaive、GSM8K、Turing)上进行单次端到端训练,并加入门正则化项,迫使系统发现每个输入类型下哪些块是可跳过的。经过3000步(在A100 40GB上6.4分钟),LayerRoute实现了12.91%的跳过差异:工具调用跳过15.25%的FLOPs,而规划步骤仅跳过2.34%,仅使用1.10M可训练参数(占494M骨干的0.22%)。由于LoRA适配,质量相比基础模型有所提升,工具调用上的困惑度差为-1.29,规划步骤上为-1.30。

英文摘要

Agentic language model systems alternate between two structurally distinct step types: structured tool calls (short, deterministic, low perplexity) and open-ended planning/reasoning steps (long, complex, high perplexity). Despite this heterogeneity, current inference systems apply identical compute to every step. We introduce LayerRoute, a lightweight adapter that learns to selectively skip transformer blocks on a per-input basis. LayerRoute augments each of the 24 transformer blocks in Qwen2.5-0.5B-Instruct with: (1) a per-layer router (~897 parameters, Linear(896,1)) that outputs a hard binary gate via the straight-through estimator, and (2) LoRA adapters (rank 8, ~1.08M parameters) on the Q/K/V/O attention projections. The backbone weights remain frozen. A single end-to-end training pass on agentic data (Hermes, Glaive, GSM8K, Turing) with a gate regularisation term forces the system to discover which blocks are skippable per input type. After 3,000 steps (6.4 minutes on an A100 40GB), LayerRoute achieves a 12.91% skip differential: tool calls skip 15.25% of FLOPs while planning steps skip only 2.34%, using only 1.10M trainable parameters (0.22% of the 494M backbone). Quality improves over the base model due to LoRA adaptation, with perplexity delta of -1.29 on tool calls and -1.30 on planning.

2606.01834 2026-06-02 cs.CV cs.AI 版本更新

Physics-Guided Attention in a Lightweight TCN for Efficient WiFi CSI-Based Human Activity Recognition

轻量级TCN中的物理引导注意力用于高效基于WiFi CSI的人体活动识别

Chinthaka Ranasingha, Tharindu Fernando, Sridha Sridharan, Clinton Fookes, Harshala Gammulle

发表机构 * Signal Processing, Artificial Intelligence and Vision Technologies (SAIVT) Research Group, School of Electrical Engineering and Robotics, Queensland University of Technology (QUT)(信号处理、人工智能与视觉技术(SAIVT)研究组,电气工程与机器人学院,昆士兰科技大学(QUT))

AI总结 提出一种紧凑的TCN框架,通过多普勒能量引导的时间注意力和方差驱动的通道注意力机制,显式引入运动感知归纳偏置,在减少参数和计算成本的同时实现优于深度基线模型的性能。

详情
AI中文摘要

基于WiFi信道状态信息(CSI)的人体动作识别(HAR)因其非接触、低成本及保护隐私的特性而受到越来越多的关注。然而,现有的基于学习的方法主要依赖深度、计算密集的架构来隐式地从CSI测量中捕捉运动动态,从而增加了模型复杂度并降低了效率。相反,我们认为,结合针对CSI信号物理特性的适当归纳偏置能够实现更高效和有效的学习。在这项工作中,我们提出一个紧凑的基于时间卷积网络(TCN)的框架,将运动感知的归纳偏置显式地融入特征学习。具体地,我们在特征空间中引入多普勒能量引导的时间注意力机制以强调运动显著的时间段,以及一个方差驱动的通道注意力模块,根据时间运动统计自适应地加权信息子载波。通过整合这些领域特定的先验知识,所提模型在不增加架构深度的情况下有效捕捉运动动态。在多个基准数据集上的大量实验表明,我们的方法相比更深的基线模型取得了优越的性能,同时显著减少了参数数量和计算成本。

英文摘要

Human Action Recognition (HAR) using WiFi Channel State Information (CSI) has gained increasing attention due to its non-contact, low-cost, and privacy-preserving nature. However, existing learning-based approaches largely rely on deep, computationally intensive architectures to implicitly capture motion dynamics from CSI measurements, thereby increasing model complexity and reducing efficiency. Instead, we argue that incorporating appropriate inductive biases tailored to the physical characteristics of CSI signals enables more efficient and effective learning. In this work, we propose a compact temporal convolutional network (TCN)-based framework that explicitly incorporates motion-aware inductive biases into feature learning. Specifically, we introduce a Doppler-energy-guided temporal attention mechanism in feature space to emphasize motion-salient time segments, and a variance-driven channel attention module to weight informative subcarriers based on temporal motion statistics adaptively. By integrating these domain-specific priors, the proposed model effectively captures motion dynamics without increasing architectural depth. Extensive experiments on multiple benchmark datasets demonstrate that our approach achieves superior performance compared to deeper baselines, while significantly reducing parameter count and computational cost.

2606.01833 2026-06-02 cs.LG cs.AI 版本更新

Learning Implicit Bias in Generative Spaces for Accelerating Protein Dynamics Emulation

学习生成空间中的隐式偏置以加速蛋白质动力学仿真

Kaihui Cheng, Zhiqiang Cai, Wenkai Xiang, Zhihang Hu, Siyu Zhu, Tzuhsiung Yang, Yuan Qi

发表机构 * Fudan University(复旦大学) Shanghai Academy of AI for Science(上海人工智能科学研究院) Shanghai Innovation Institute(上海创新研究院)

AI总结 提出在预训练生成式仿真器的生成空间中引入隐式历史依赖偏置,结合距离加权分数估计和环境支持正则化,通过重投影步骤保持结构有效性,显著提升采样多样性和稀有状态覆盖速度。

详情
AI中文摘要

蛋白质动力学生成式仿真器能够以分子动力学一小部分成本生成合理的轨迹,但它们继承了训练分布,在长期外推下倾向于重访已知状态而非到达稀有状态。受经典增强采样启发,我们在预训练仿真器的生成空间中引入隐式历史依赖偏置。具体来说,一个历史感知的分数估计器向冻结的仿真器添加距离加权偏置,引导逆时采样远离先前生成的结构,并通过环境支持项进行正则化。为在长时间尺度下保持结构有效性,一个基于分数的精化步骤利用冻结仿真器将漂移的样本重新投影到数据流形上。实验表明,该方法(i)在DynamicPDB-80上将多样性提升35%;(ii)在12个零样本快速折叠蛋白质上,单独使用学习到的偏置达到无偏仿真器覆盖的速度最高快约15倍,与精化结合后覆盖速度最高快约37倍,同时覆盖的低能态数量多约3倍。代码即将发布。

英文摘要

Generative emulators of protein dynamics produce plausible trajectories at a fraction of the cost of molecular dynamics, but they inherit their training distribution and tend to revisit known states rather than reach rare ones under long-horizon extrapolation. Inspired by classical enhanced sampling, we introduce an implicit, history-dependent bias in the generative space of a pretrained emulator. Specifically, a history-aware score estimator augments the frozen emulator with a distance-weighted bias that steers reverse-time sampling away from previously generated structures, regularized by an environment-support term. To preserve structural validity at long horizons, a score-based refinement step re-projects drifted samples onto the data manifold using the frozen emulator. Our experiments demonstrate that the method (i) raises diversity by $35\%$ on DynamicPDB-80; (ii) on $12$ zero-shot Fast-Folding proteins, the learned bias alone reaches the unbiased emulator's coverage up to ${\sim}15\times$ faster, and pairing it with refinement reaches the coverage up to ${\sim}37\times$ faster while covering ${\sim}3\times$ as many low-energy states. Code will be released soon.

2606.01830 2026-06-02 cs.AI 版本更新

CAPF: Guiding Search-Agent Rollouts with Credit-Attenuated Privileged Feedback

CAPF:基于信用衰减特权反馈引导搜索智能体轨迹生成

Bin Chen, Xinye Liao, Yiming Liu, Xin Liao, Chonghan Liu

发表机构 * University of Science and Technology of China(中国科学技术大学)

AI总结 针对结果奖励稀疏导致搜索智能体学习困难的问题,提出训练时利用验证器侧信息(CAPF)将零奖励轨迹修复为正奖励轨迹,并衰减相关信用以适配无特权反馈的部署场景,在七个开放域问答基准上将Qwen3-4B的平均精确匹配分数从44.7%提升至48.5%。

详情
AI中文摘要

最近的LLM搜索智能体使用带可验证奖励的强化学习(RLVR)从结果奖励中学习搜索增强推理。在困难问题上,这些智能体很少采样到端到端成功的轨迹,导致仅基于结果的RLVR只有少量正奖励轨迹。我们认为,改善此类问题的学习需要在训练期间提供额外指导,而RLVR已经包含了可以提供这种指导的验证器侧信息。这些信息可以识别智能体提交答案中的错误或遗漏,并引导轨迹内的修正。我们提出了一种训练时机制,称为**信用衰减特权反馈**(CAPF),该机制通过在训练期间进行特权反馈调用,使验证器侧信息可用。CAPF允许策略将零奖励尝试修复为正奖励修复轨迹,并衰减对反馈调用和早期动作的信用,以适应没有此调用的部署。实证研究表明,在七个开放域问答基准上,CAPF将Qwen3-4B的平均精确匹配分数从仅结果RLVR的44.7%提升至48.5%。

英文摘要

Recent LLM search agents use reinforcement learning with verifiable rewards (RLVR) to learn search-augmented reasoning from outcome rewards. On hard problems, these agents rarely sample end-to-end successful rollouts, leaving outcome-only RLVR with few positive-reward trajectories. We argue that improving learning on such problems requires additional guidance during training, and RLVR already contains verifier-side information that can provide it. This information can identify errors or omissions in the agent's submitted answer and guide revision within the rollout. We propose a training-time mechanism called \textbf{Credit-Attenuated Privileged Feedback} (CAPF), which makes this verifier-side information available through a Privileged Feedback call during training. CAPF lets the policy revise zero-reward attempts into positive-reward repair trajectories and attenuates credit for the feedback call and earlier actions to accommodate deployment without this call. Empirical research demonstrates that CAPF improves Qwen3-4B's average exact-match score from 44.7% under outcome-only RLVR to 48.5% on seven open-domain QA benchmarks.

2606.01828 2026-06-02 cs.MA cs.AI 版本更新

Dynamic Trust-Aware Sparse Communication Topology for LLM-Based Multi-Agent Consensus

基于动态信任感知的稀疏通信拓扑用于基于LLM的多智能体共识

Wanshuang Gou, Zihan Liu

发表机构 * University of Science and Technology of China(中国科学技术大学)

AI总结 提出DySCo动态稀疏共识机制,通过信任感知的边选择降低通信开销并保持共识质量。

Comments 11 pages, 3 figures, 5 tables

详情
AI中文摘要

大型语言模型驱动的多智能体系统通过多轮讨论、角色专业化和交叉验证增强了复杂推理任务的可靠性。然而,现有的多智能体辩论和协作框架通常采用全连接通信,导致消息数量、令牌成本和端到端延迟随智能体数量近似二次增长;尽管固定稀疏拓扑减少了开销,但它们无法适应不同任务实例或中间推理状态,容易保留低价值交互或丢失关键的纠错信息。针对这一问题,本文提出了DySCo(动态稀疏共识),一种动态信任感知的稀疏共识机制。在每一轮推理中,DySCo基于智能体可靠性、答案分歧和任务相关性估计通信边的价值,并在预算约束下选择少量高价值边进行消息交换;然后通过动态信任权重聚合不同智能体的答案,并在共识稳定后提前终止讨论。该机制用按需通信替代通用广播,从而在保留关键交叉验证信息的同时降低通信开销。我们进一步给出了通信复杂度和共识稳定性的分析,并在数学推理、逻辑推理和事实问答任务上评估了DySCo的性能。

英文摘要

Large language model-driven multi-agent systems enhance the reliability of complex reasoning tasks through multi-round deliberation, role specialization, and cross-validation. However, existing multi-agent debate and collaboration frameworks typically adopt fully connected communication, causing the number of messages, token costs, and end-to-end latency to grow approximately quadratically with the number of agents; although fixed sparse topologies reduce overhead, they cannot adapt communication relationships to different task instances or intermediate reasoning states, making them prone either to preserving low-value interactions or to losing critical error-correction information. To address this problem, this paper proposes DySCo (Dynamic Sparse Consensus), a dynamic trust-aware sparse consensus mechanism. In each round of reasoning, DySCo estimates the value of communication edges based on agent reliability, answer divergence, and task relevance, and selects a small number of high-value edges for message exchange under budget constraints; it then aggregates the answers of different agents through dynamic trust weights and terminates the discussion early once consensus stabilizes. This mechanism replaces universal broadcasting with on-demand communication, thereby reducing communication overhead while preserving essential cross-validation information. We further present analyses of communication complexity and consensus stability, and evaluate the performance of DySCo on mathematical reasoning, logical reasoning, and factual question-answering tasks.

2606.01811 2026-06-02 cs.CL cs.AI cs.LG 版本更新

"I've Seen How This Goes": Characterizing Diversity via Progressive Conditional Surprise

“我知道这会如何发展”:通过渐进条件惊奇度刻画多样性

Matthew Khoriaty, David Williams-King, Shi Feng

发表机构 * University of California, Berkeley(加州大学伯克利分校) University of Cambridge(剑桥大学) Stanford University(斯坦福大学)

AI总结 提出一种基于上下文学习的多样性度量方法 Decan(D_{Ca_n}),通过单次前向传递计算每个字节的得分,无需嵌入模型、参考语料或人工标注,在多个基准上验证了其有效性。

Comments 28 pages, 18 figures, 9 tables. Accepted to the Workshop on Generative AI, Creativity, and Human-AI Co-Creation @ ICML 2026 (non-archival). Code and data: https://github.com/AMindToThink/icl-diversity

详情
AI中文摘要

衡量创意输出的多样性对于评估训练后模式崩溃、比较解码策略以及量化AI和人类写作中的创造性行为至关重要。我们提出了一种使用上下文学习来度量多样性的新方法,其中“Decan”度量 $D_{Ca_n} = C \times a_n$ 是我们评估的工作实例:一个基于每个字节的得分,该得分从基础模型 $θ$ 的每个标记对数概率中读取,每次排列只需一次前向传递,无需嵌入模型、参考语料库和人工标签。该方法基于信息论,利用语言模型的上下文学习来检测任意数量输入之间的广泛相似性,并避免了训练专用模型的需要。同一流程对AI样本和人类编写的回答集进行评分,将多样性视为(回答、提示、评分模型)的一个属性。在Tevet和Berant基于人类判断的McDiv基准上,$D_{Ca_n}$ 在McDiv prompt_gen 集上达到了0.846的OCA,这是其表现最好的情况,仅次于Tevet和Berant报告的最强神经基线(SentBERT,0.897)。在OLMo-2-7B训练后流程中,$D_{Ca_n}$ 在基础→SFT→DPO→RLVR阶段单调下降,检测到创意写作应用所关注的多样性损失类型。

英文摘要

Measuring the diversity of creative outputs is central to evaluating post-training mode collapse, comparing decoding strategies, and quantifying creative behavior in both AI and human writing. We propose a new approach to measuring diversity using in-context learning, of which the ``Decan'' metric, $D_{Ca_n} = C \times a_n$, is the working instance we evaluate: a per-byte score read off the per-token log-probabilities of a base model $θ$ in a \emph{single forward pass} per permutation, with no embedding model, no reference corpus, and no human labels. This approach is grounded in information theory, makes use of language model in-context learning to detect a wide range of similarities between any number of inputs, and obviates the need to train a special-purpose model. The same pipeline scores AI samples and human-written response sets, with diversity treated as a property of (responses, prompt, scoring model). On Tevet and Berant's human-grounded McDiv benchmark, $D_{Ca_n}$ reaches OCA 0.846 on the McDiv prompt\_gen set where it performs best, behind the strongest neural baseline reported in Tevet and Berant (SentBERT, 0.897). On the OLMo-2-7B post-training pipeline, $D_{Ca_n}$ drops monotonically across the base $\to$ SFT $\to$ DPO $\to$ RLVR stages, detecting the type of diversity loss that creative-writing applications care about.

2606.01810 2026-06-02 cs.AI 版本更新

Token Predictors Are Not Planners: Building Physically Grounded Causal Reasoners

Token 预测器不是规划器:构建物理基础的因果推理器

Zheng Lu, Mingqi Gao, Qinlei Xie, Wanqi Zhong, Hanwen Cui, Heng Cao, Zirui Song, Yifan Yang, Chong Luo, Bei Liu, Yiming Li

发表机构 * Tsinghua University(清华大学) Microsoft Research Asia(微软亚洲研究院) MBZUAI

AI总结 针对具身视觉-语言规划中模型依赖语言统计先验而非因果推理的问题,提出 Causal-Plan-Bench 基准和 Causal-Plan-1M 数据集,并训练 Causal Planner 模型,实现从 token 预测到物理因果推理的转变。

Comments 77 pages, appendices included. Code: https://github.com/THUSI-Lab/Causal-Reasoner

详情
AI中文摘要

当前的具身视觉-语言规划基准往往倾向于语言上的下一 token 预测,而非物理基础的下一状态推理。这奖励了模仿统计语言先验而非追踪因果依赖的模型,将物理规划简化为浅层序列建模。我们认为,可靠的物理自主性需要从语言基础的 token 预测转向物理基础的因果推理。为此,我们引入了 Causal-Plan-Bench,这是一个通过多阶段验证构建的高保真诊断套件,用于评估四个因果维度的具身规划。我们还构建了 Causal-Plan-1M,这是一个百万规模的显式推理轨迹语料库,通过四阶段标注流程从自我中心视频中生成。广泛评估表明,领先模型仍然难以展示真正的物理自主性,Gemini 3 Pro 在我们的基准上仅达到 38.18。相比之下,我们的训练方法使基于 Qwen3-VL-8B 构建的 Causal Planner 能够内化物理逻辑,从而实现更准确的下一状态估计。该模型在域内性能和跨基准泛化方面表现强劲,并揭示了一个因果缩放定律:将因果训练数据扩展到一百万实例可获得 36.3% 的相对提升,从 33.22 提高到 45.28。总体而言,我们的工作为将智能体从表面的 token 预测器转变为物理基础的因果推理器迈出了具体的一步。

英文摘要

Current benchmarks for embodied vision-language planning often favor linguistic next-token prediction over physically grounded next-state reasoning. This rewards models that mimic statistical language priors rather than track causal dependencies, reducing physical planning to shallow sequence modeling. We argue that reliable physical autonomy requires a shift from linguistically grounded token prediction toward physically grounded causal reasoning. To this end, we introduce Causal-Plan-Bench, a high-fidelity diagnostic suite curated through multi-stage verification to evaluate embodied planning across four causal dimensions. We also construct Causal-Plan-1M, a million-scale corpus of explicit reasoning traces produced by a four-stage annotation pipeline over egocentric videos. Extensive evaluation shows that leading models still struggle to demonstrate genuine physical agency, with Gemini 3 Pro reaching only 38.18 on our benchmark. In contrast, our training recipe enables Causal Planner, built on Qwen3-VL-8B, to internalize physical logic for more accurate next-state estimation. The model achieves strong in-domain performance and cross-benchmark generalization, and reveals a Causal Scaling Law: scaling causal training data to one million instances yields a 36.3% relative gain, from 33.22 to 45.28. Overall, our work provides a concrete step toward turning agents from superficial token predictors into physically grounded causal reasoners.

2606.01806 2026-06-02 cs.CL cs.AI cs.LG 版本更新

ProbeScale: Probing Analysis to Optimize Neural Scaling Laws for Efficient Small Language Model Inference

ProbeScale: 通过探测分析优化神经缩放定律以实现高效小语言模型推理

Sourav Das

发表机构 * Department of Computer Science and Engineering(计算机科学与工程系) Indian Institution of Information Technology Kalyani(印度信息技术学院Kalyani)

AI总结 提出ProbeScale框架,利用缩放定律和探测分析从预训练小语言模型中识别参数高效子网络,在参数预算下最大化任务加权探测性能,实现5-10倍参数压缩并保持95%-98%原始性能。

Comments 7 pages, 2 figures, ACL

详情
AI中文摘要

小语言模型在能力与计算可行性之间取得了平衡。神经缩放定律指导其最优训练,表明它们拥有随规模增长而丰富的内部表示。然而,在严格的资源约束下部署即使是这些小语言模型也可能具有挑战性。语言模型探测提供了分析模型内部编码的语言知识的方法。我们提出ProbeScale,一个统一缩放定律和探测洞察的框架,用于在预训练小语言模型中识别参数高效的子网络。ProbeScale利用良好缩放的小语言模型的高质量表示,并使用任务特定探测来数学量化每层对目标下游能力的相关性。这使得能够选择在性能与参数规模之间最优权衡的子网络。我们将子网络选择形式化为在参数预算下寻找最大化聚合任务加权探测性能的层子集。在代表性小语言模型如RoBERTa-Large和T5-Base上的实验表明,ProbeScale识别出的子网络实现了5到10倍的显著参数减少,同时在目标任务上保持了高性能(原始小语言模型的95%至98%),优于启发式基线。

英文摘要

Small Language Models (SLMs) offer a balance between capability and computational feasibility. Neural scaling laws inform their optimal training, suggesting that they possess rich internal representations that scale with their size. However, deploying even these SLMs can be challenging under strict resource constraints. Language model probing provides methods for analyzing the linguistic knowledge encoded in a model's internals. We propose ProbScale, a framework that unifies insights from scaling laws and probing to identify parameter-efficient subnetworks within pre-trained SLMs. ProbScale utilizes the high-quality representations of well-scaled SLMs and uses task-specific probes to mathematically quantify the relevance of each layer for target downstream capabilities. This allows selecting subnetworks that optimally trade off performance against parameter size. We formulate the subnetwork selection as finding a layer subset maximizing aggregated, task-weighted probe performance under a parameter budget. Experiments on representative SLMs such as RoBERTa-Large and T5-Base demonstrate that ProbScale identifies subnetworks achieving significant parameter reduction, from 5 to 10 times, while maintaining high performance (95% to 98% of the original SLMs) on targeted tasks, outperforming heuristic baselines.

2606.01803 2026-06-02 cs.AI 版本更新

OctoT2I: A Self-Evolving Agentic Text-to-Image Router

OctoT2I:一种自我进化的智能文本到图像路由系统

Xu Jiang, Bin Chen, Gehui Li, Yule Duan, Ronggang Wang, Jian Zhang

发表机构 * School of Electronic and Computer Engineering, Peking University(电子与计算机工程学院,北京大学) Guangdong Provincial Key Laboratory of Ultra High Definition Immersive Media Technology, Shenzhen Graduate School, Peking University(广东省超高清沉浸媒体技术重点实验室,北京大学深圳研究生院)

AI总结 提出OctoT2I框架,通过自进化机制构建知识库并采用状态化多轮路由策略,联合优化生成质量与推理效率,在GenEval上达到0.96性能,同时实现90.3%推理加速和56.6%能效提升。

详情
AI中文摘要

文本到图像(T2I)模型的爆炸式增长——从大规模版本到轻量级、实时模型——如今面临单模型扩展的边际收益递减。智能T2I方法通过使用多个模型来缓解这一瓶颈。然而,现有的智能T2I方法面临三个关键挑战:依赖昂贵的手工先验或人工标注、僵化的单路径决策机制以及忽视推理效率。为解决这些挑战,我们引入OctoT2I,一种新颖的智能框架,将T2I任务重新表述为生成质量和推理效率的联合优化。OctoT2I实现了一种有状态的多轮路由策略,该策略基于其知识和记忆自适应地选择最合适的工具。这一策略由我们新颖的自进化机制从头构建的知识库支持。该机制无需人工监督,首先自主定义基础概念维度(例如风格、颜色、数量),然后通过迭代的“提出-求解-评估-学习”(PSEL)循环智能地探索它们的组合。PSEL循环高效地发现每个工具的能力边界,在无需外部指导的情况下推动持续改进。大量实验表明,OctoT2I在GenEval上实现了具有竞争力的性能(0.96),同时相比领先基线(Flow-GRPO)提供了90.3%的推理加速和56.6%的能效提升,在性能和效率之间取得了卓越的平衡。代码和模型将公开提供。

英文摘要

The explosive growth of Text-to-Image (T2I) models, from large-scale versions to lightweight, real-time ones, now faces diminishing marginal returns from single-model scaling. Agentic T2I methods emerged to alleviate this bottleneck by using multiple models. However, existing agentic T2I methods suffer from three key challenges: reliance on expensive handcrafted priors or human annotations, rigid single-path decision mechanisms, and a neglect of inference efficiency. To address these challenges, we introduce OctoT2I, a novel agentic framework that reformulates the T2I task as a joint optimization of generation quality and inference efficiency. OctoT2I implements a stateful, multi-round routing strategy that adaptively selects the most suitable tool based on its knowledge and memory. This strategy is enabled by a knowledge base built from scratch by our novel Self-Evolving Mechanism. This mechanism, which requires no human supervision, first autonomously defines foundational Conceptual Dimensions (eg, style, color, count) and then intelligently explores their combinations via an iterative" Propose--Solve--Evaluate--Learn"(PSEL) loop. The PSEL loop efficiently discovers each tool's capability frontier, driving continuous improvement without external guidance. Extensive experiments demonstrate that OctoT2I achieves competitive performance (0.96) on GenEval while delivering a 90.3% inference speedup and a 56.6% energy-efficiency gain over the leading baseline (Flow-GRPO), striking an exceptional balance between performance and efficiency. Code and models will be made available.

2606.01800 2026-06-02 cs.CL cs.AI cs.LG 版本更新

Multilinguality of Large Language Models From a Structural Perspective

从结构视角看大语言模型的多语言性

Haruki Sakajo, Yusuke Sakai, Hidetaka Kamigaito, Taro Watanabe

发表机构 * Nara Institute of Science and Technology(奈良科学技術研究所)

AI总结 本研究通过表示结构分析探索大语言模型的多语言性,发现低资源语言与英语的结构差异大于高、中资源语言,且语言特定后训练改变结构但保留语言间关系。

详情
AI中文摘要

大型语言模型(LLMs)通过在多语言数据上进行预训练和后训练,在处理多种语言方面表现出色,尽管英语在训练数据中占主导地位。先前关注标记表示的研究揭示了这些LLMs如何处理非英语文本。尽管这些分析提供了有见地的发现,但它们未能捕捉到结构视角,而结构是语言的内在属性。在本研究中,我们通过表示结构分析探索LLMs的多语言性。我们的发现表明,低资源语言在结构上与英语的差异大于高资源和中资源语言,并且语言特定的后训练改变了它们的结构,同时保留了语言间的关系。

英文摘要

Large language models (LLMs) have excelled in processing multiple languages through pre- and post-training on multilingual data, even though English dominates the training data. Prior work focusing on token representations has revealed how those LLMs process non-English text. Although these analyses have provided insightful findings, they fail to capture a structural view, which is an inherent property of language. In this study, we explore the multilinguality of LLMs through representational structural analysis. Our findings reveal that low-resource languages are structurally more different from English than high- and mid-resource languages, and that language-specific post-training alters their structures while preserving inter-language relationships.

2606.01790 2026-06-02 cs.CV cs.AI 版本更新

STaR-KV: Spatio-Temporal Adaptive Re-weighting for KV Cache Compression in GUI Vision-Language Models

STaR-KV: 面向GUI视觉语言模型的时空自适应KV缓存压缩重加权方法

Yuhang Han, Wenzheng Yang, Yujie Chen, Xiangqi Jin, Yaojie Zhang, Siteng Huang, Linfeng Zhang

发表机构 * EPIC Lab, SJTU(上海交通大学EPIC实验室) HKUST (GZ)(香港科技大学(广州)) The University of Sydney(悉尼大学) UESTC(电子科技大学) ZJU(浙江大学)

AI总结 提出STaR-KV,一种无需训练的KV缓存压缩框架,通过子空间感知评分、时间稳定性折扣和熵驱动温度三个维度自适应校准令牌重要性,在GUI任务中实现高精度和近40%的峰值GPU内存节省。

详情
AI中文摘要

基于视觉语言模型的图形用户界面(GUI)代理展现出广泛的自动化能力,但其部署受限于随交互步骤线性增长的键值(KV)缓存。例如,UI-TARS-1.5-7B在仅五个屏幕截图上消耗76 GB的GPU内存,接近主流80 GB加速器的容量。现有的KV压缩方法共享两个结构假设:将视觉令牌重要性聚合为单个共享显著性图,并对融合的分数分布应用固定的top-B截断。初步测量反驳了这两点:空间专门化存在于注意力子空间层面并在层间迁移,而分数分布沿轨迹漂移。我们提出STaR-KV(时空自适应重加权),一种无需训练的KV缓存压缩框架,沿三个维度校准令牌重要性:(i)由在线空间互信息驱动的子空间感知评分;(ii)时间稳定性折扣,抑制来自持续关注子空间的冗余缓存条目;(iii)熵导出的温度,自适应重塑分数分布。在四个GUI基准测试中,STaR-KV在匹配预算下实现了最先进的KV压缩方法(如GUIKV、SnapKV)中最强的平均准确率,无压缩阶段FLOPs开销(-0.07%),并在20% KV缓存预算下削减近40%的峰值GPU内存。代码可在https://github.com/kawhiiiileo/STaR-KV获取。

英文摘要

Vision-language-model-based graphical user interface (GUI) agents have shown broad automation capabilities, yet deployment is bottlenecked by a key-value (KV) cache that grows linearly with interaction steps. For instance, UI-TARS-1.5-7B consumes 76 GB of GPU memory on merely five screenshots, approaching the capacity of mainstream 80 GB accelerators. Existing KV compression methods share two structural assumptions: aggregating visual-token importance into a single shared saliency map, and applying a fixed top-B cutoff to the fused score distribution. Pilot measurements refute both: spatial specialization lives at the attention-subspace level and migrates across layers, while the score distribution drifts in shape along a trajectory. We propose STaR-KV (Spatio-Temporal Adaptive Re-weighting), a training-free KV cache compression framework that calibrates token importance along three axes: (i) subspace-aware scoring driven by online spatial mutual information; (ii) a temporal stability discount that suppresses redundant cache entries from persistently attended subspaces; and (iii) an entropy-derived temperature that adaptively reshapes the score distribution. Across four GUI benchmarks, STaR-KV achieves the strongest average accuracy among state-of-the-art KV compression methods (e.g., GUIKV, SnapKV) at matched budgets, with no compression-stage FLOPs overhead (-0.07%) and cutting peak GPU memory by nearly 40% at a 20% KV-cache budget. Code is available at https://github.com/kawhiiiileo/STaR-KV.

2606.01789 2026-06-02 cs.AI 版本更新

Consistency evaluation of benchmarks used for causal discovery

用于因果发现的基准一致性评估

Yuzhe Zhang, Chihui Chen, Lina Yao, Chen Wang

发表机构 * Independent researcher(独立研究者) UNSW Australia(新南威尔士大学澳大利亚分校) CSIRO Australia(澳大利亚联邦科学与工业研究组织)

AI总结 提出自动检索论文并利用大语言模型检查基准因果图与领域研究一致性的流程,评估11个流行基准,发现其一致性差异显著。

详情
AI中文摘要

在图形因果模型中,因果发现旨在基于数值数据和领域知识(以纯文本形式)构建因果图。然而,因果发现方法的评估在该领域仍然是一个挑战,因为领域研究的进展常常使得基准因果图包含不一致的知识。这个问题尤其影响基于大语言模型(LLM)的因果发现方法,因为它们对文献中的新发现敏感。本文首次系统研究基准因果图的质量。具体来说,我们设计了一个流程,自动从科学数据库中检索相关研究论文,并提示LLM检查基准因果图与领域研究论文之间的一致性。我们评估了11个流行的真实世界基准,我们的流程总共处理了38,081篇领域论文。结果表明,流行基准与领域研究的一致性差异显著,这对因果发现研究具有明确的意义。

英文摘要

In graphical causal model, causal discovery aims to construct a causal graph based on numerical data and domain knowledge in plain text. However, the evaluation of causal discovery methods remains a challenge in the area as the progress of domain researches often makes benchmark causal graphs contain mis-aligned knowledge. This problem especially affects the evaluation of large language model (LLM) based causal discovery methods as they are sensitive to the new discoveries in the literature. This work is the first to systematically study the quality of benchmark causal graphs. Specifically, we design a pipeline that automatically retrieves relevant research papers from scientific databases, and prompts LLMs to check the consistency between the benchmark causal graphs and domain research papers. We evaluate 11 popular real-world benchmarks, for which our pipeline in total proceeds 38,081 domain papers. Our results show that popular benchmarks vary significantly in their consistency with domain research, with clear implications for causal discovery research.

2606.01787 2026-06-02 cs.AI math.OC 版本更新

Stochastic convergence of parallel asynchronous adaptive first-order methods

并行异步自适应一阶方法的随机收敛性

Serge Gratton, Philippe L. Toint

发表机构 * Université de Toulouse, INP, IRIT, Toulouse, France(图卢兹大学,INP,IRIT,法国图卢兹) IA Artificial and Natural Intelligence Toulouse Institute (ANITI)(图卢兹3IA人工智能与自然智能研究所(ANITI)) NAXYS, University of Namur, Namur, Belgium(NAXYS,纳慕尔大学,比利时纳慕尔)

AI总结 本文提出一类新的异步自适应一阶优化方法,包括多种流行算法的异步变体,并分析其在非凸函数上的随机收敛性,达到O(1/√t)的收敛速率。

详情
AI中文摘要

本文介绍了一类新的异步自适应一阶优化方法,包括几种流行算法的异步变体。还考虑了使用动量和/或非精确归一化的这些方法的版本。在完全随机环境下分析了该类方法在非凸函数上的收敛性,并证明在合理假设下,收敛阶为O(1/√t)(忽略对数因子)。数值实验表明,这种异步自适应算法在异构大规模机器学习系统中非常有用。

英文摘要

A new class of asynchronous adaptive first-order optimization methods is introduced, comprising asynchronous variants of several popular algorithms. Versions of these methods using momentum and/or inexact normalization are also considered. The convergence of methods in the class on non-convex functions is analyzed in a fully stochastic setting, and is shown to be (up to logarithmic factors) of order O(1/sqrt{t}) under reasonable assumptions. Numerical experiments suggest that such asynchronous adaptive algorithms are very relevant in heterogeneous large-scale machine learning systems.

2606.01783 2026-06-02 cs.IR cs.AI 版本更新

Breaking the Information Silo: Semantic Personas for Cross-Domain Recommendation

打破信息孤岛:面向跨域推荐的语义人物画像

Jonathan Mayo, Moshe Unger, Konstantin Bauman

发表机构 * Technology and Information Management Department, Coller School of Management, Tel Aviv University(技术与信息管理系,科勒管理学院,特拉维夫大学) Management Information Systems Department, Fox School of Business, Temple University(管理信息系统系,福克斯商学院, Temple大学)

AI总结 提出SPHERE方法,利用大语言模型生成语义人物画像,实现无共享用户或物品的跨域推荐,并通过双塔架构和动态融合门增强推荐性能。

详情
AI中文摘要

数字平台日益成为孤立的信息孤岛,限制了它们跨域构建全面用户表征的能力。跨域推荐系统试图通过将知识从源域迁移到目标域来克服这一限制,但大多数现有方法依赖于共享用户、共享物品或结构相似的交互图。这些假设在独立平台上往往不切实际。我们提出SPHERE(面向异构跨域推荐的语义人物画像),一种设计构件,能够在严格不相交的域之间实现推荐知识迁移,无需共享用户或物品。SPHERE不通过身份或图结构对齐域,而是使用大语言模型诱导共享行为词汇,为用户生成结构化语义人物画像,并检索行为相似的源域社区,形成社区源人物画像。该语义信号通过双塔架构和动态融合门与协同信号集成,使SPHERE能够增强标准推荐骨干。在Amazon Books、Goodreads和Steam上的实证评估表明,在全排名评估下,SPHERE在NCF、SVD++和LightGCN基线上取得了一致的改进。结果表明,跨域迁移效果不仅由域之间的语义接近度决定,还关键取决于目标域的结构密度和原生预测强度。该研究通过将跨域个性化重新定义为基于行为的语义对齐,为信息系统研究做出贡献,提供了一种在保持可解释性和模块化的同时克服信息孤岛的实用机制。

英文摘要

Digital platforms increasingly operate as isolated information silos, limiting their ability to construct comprehensive user representations across domains. Cross-domain recommender systems seek to overcome this limitation by transferring knowledge from a source domain to a target domain, yet most existing approaches depend on shared users, shared items, or structurally similar interaction graphs. These assumptions are often unrealistic across independent platforms. We propose SPHERE (Semantic Personas for Heterogeneous cross-domain Recommendation), a design artifact that enables recommendation knowledge transfer across strictly disjoint domains with no shared users or items. Rather than aligning domains through identity or graph structure, SPHERE uses large language models to induce a shared behavioral vocabulary, generate structured semantic personas for users, and retrieve behaviorally similar source-domain communities that form a Community Source Persona. This semantic signal is integrated with collaborative signals through a dual-tower architecture and dynamic fusion gate, allowing SPHERE to augment standard recommender backbones. Empirical evaluation across Amazon Books, Goodreads, and Steam demonstrates consistent improvements over NCF, SVD++, and LightGCN baselines under full-ranking evaluation. The results show that cross-domain transfer effectiveness is not determined solely by semantic proximity between domains; rather, it depends critically on the structural density and native predictive strength of the target domain. The study contributes to information systems research by reframing cross-domain personalization as behavior-based semantic alignment, offering a practical mechanism for overcoming information silos while preserving interpretability and modularity.

2606.01781 2026-06-02 cs.AI 版本更新

Structure-Guided Adaptive Propagation for Protein-Protein Interaction Site Prediction

结构引导的自适应传播用于蛋白质-蛋白质相互作用位点预测

Enqiang Zhu, Yizi Liu, Yilong Luo, Yao Chen, Yu Zhang, Baoshan Ma

发表机构 * Institute of Computing Science and Technology, Guangzhou University(广州大学计算机科学与技术学院) School of Computer Science, Peking University(北京大学计算机科学学院) Information Science & Technology Department, Beijing Capital International Airport Co., Ltd.(北京首都国际机场有限公司信息科学与技术部) School of Information Science and Technology, Dalian Maritime University(大连海事大学信息科学与技术学院)

AI总结 提出SGAP-PPIS模型,利用等变图神经网络的多尺度几何状态生成残基级传播系数,实现自适应信息扩散,在Test_60上取得竞争性能。

Comments 9 pages, 3 figures

详情
AI中文摘要

准确预测蛋白质-蛋白质相互作用位点(PPIS)对于理解细胞过程、疾病机制和治疗靶点发现至关重要。基于图的深度学习通过整合残基级结构上下文推进了PPIS预测。然而,尽管蛋白质界面存在结构和功能异质性,大多数基于图的模型仍依赖固定传播方案,对所有残基一视同仁。这种传播可能限制信息扩散适应局部几何环境的能力,使得难以区分真正的相互作用位点和结构相似的非相互作用邻居。我们提出SGAP-PPIS,一种用于PPIS预测的结构引导自适应传播模型。SGAP-PPIS不使用固定传播机制,而是利用等变图神经网络的多尺度几何状态生成残基级传播系数。这种设计允许每个残基根据其几何微环境自适应地平衡局部特征保留和邻域扩散。实验结果表明,SGAP-PPIS在Test_60上达到了与最先进方法竞争的性能。消融研究表明,几何条件自适应传播、尺度对齐几何引导和多步传播状态表示共同推动了这些改进。

英文摘要

Accurate prediction of protein-protein interaction sites (PPIS) is essential for understanding cellular processes, disease mechanisms, and therapeutic target discovery. Graph-based deep learning has advanced PPIS prediction by incorporating residue-level structural context. However, most graph-based models still rely on fixed propagation schemes that treat all residues similarly, despite the structural and functional heterogeneity of protein interfaces. Such propagation may limit the ability to adapt information diffusion to local geometric environments, making it difficult to distinguish true interaction sites from structurally similar non-interacting neighbors. We present SGAP-PPIS, a structure-guided adaptive propagation model for PPIS prediction. Rather than using a fixed propagation mechanism, SGAP-PPIS leverages multi-scale geometric states from an equivariant graph neural network to generate residue-wise propagation coefficients. This design allows each residue to adaptively balance local feature preservation and neighborhood diffusion according to its geometric microenvironment. Experimental results show that SGAP-PPIS achieves competitive performance among the state-of-the-art methods on Test\_60. Ablation studies show that geometry-conditioned adaptive propagation, scale-aligned geometric guidance, and multi-step propagation-state representation jointly drive these improvements.

2606.01774 2026-06-02 cs.LG cs.AI 版本更新

FLARE: Diffusion for Hybrid Language Model

FLARE: 混合语言模型的扩散方法

Yuchen Zhu, Jing Shi, Chongjian Ge, Hao Tan, Yiran Xu, Wanrong Zhu, Jason Kuen, Koustava Goswami, Rajiv Jain, Yongxin Chen, Molei Tao, Jiuxiang Gu

发表机构 * Adobe Research(Adobe研究院) Georgia Institute of Technology(佐治亚理工学院)

AI总结 提出FLARE框架,通过结合自回归和扩散目标、硬件感知内核和统一推理,将混合注意力LLM转换为支持并行解码的扩散模型,在保持能力的同时提升吞吐量。

详情
AI中文摘要

自回归(AR)大型语言模型(LLM)已取得广泛的实际成功,但顺序解码仍然是低延迟部署的关键瓶颈。近期的高效推理工作沿着两个方向推进:通过高效架构降低每次模型调用的成本,以及通过并行生成减少串行解码步骤。混合注意力骨干解决了前者,而扩散语言模型(dLLM)通过迭代并行去噪追求后者。结合这些优势仍然具有挑战性:AR到dLLM的转换通常无法保留种子检查点的能力,并且混合注意力循环状态和掩码约束使得扩散训练和服务变得复杂。我们提出了FLARE,一个针对混合注意力LLM的系统转换框架。我们的分析确定迁移数据质量是能力保留的主要决定因素,其重要性超过损失公式和注意力掩码设计。最终框架结合了token等价的AR和扩散目标、硬件感知内核以及统一推理,使得一个检查点能够同时支持AR风格的验证解码和扩散风格的并行去噪。从强大的AR检查点出发,使用有限的训练后数据,FLARE在模型规模上与领先的开源dLLM竞争,并在单GPU并发服务中相比开源dLLM基线实现了持续的吞吐量提升。我们的结果进一步表明,实际dLLM不仅受限于解码算法,还受限于迁移数据质量和当前块扩散目标的训练低效性,这促使我们联合设计数据、目标、架构和推理系统。

英文摘要

Autoregressive (AR) large language models (LLMs) have achieved broad practical success, but sequential decoding remains a key bottleneck for low-latency deployment. Recent efficient-inference work has progressed along two axes: reducing the cost of each model invocation through efficient architectures, and reducing serial decoding steps through parallel generation. Hybrid attention backbones address the former, while diffusion language models (dLLMs) pursue the latter via iterative parallel denoising. Combining these advantages remains challenging: AR-to-dLLM conversion often fails to preserve seed-checkpoint capability, and hybrid-attention recurrent states and masking constraints make diffusion training and serving nontrivial. We present FLARE, a systematic conversion framework for hybrid-attention LLMs. Our analysis identifies transfer data quality as the primary determinant of capability preservation, outweighing loss formulation and attention-mask design. The resulting framework combines a token-equal AR-and-diffusion objective, hardware-aware kernels, and unified inference, enabling one checkpoint to support both AR-style verified decoding and diffusion-style parallel denoising. Starting from strong AR checkpoints with limited post-training data, FLARE is competitive with leading open-source dLLMs across model scales and delivers consistent throughput gains over open-source dLLM baselines in single-GPU concurrent serving. Our results further suggest that practical dLLMs are limited not only by decoding algorithms, but also by transfer data quality and the training inefficiency of current block-diffusion objectives, motivating joint design of data, objectives, architectures, and inference systems.

2606.01755 2026-06-02 cs.AI cs.CL 版本更新

TriAlign: Towards Universal Truth Consistency in Personalized LLM Alignment

TriAlign: 迈向个性化大语言模型对齐中的通用真值一致性

Thi-Nhung Nguyen, Linhao Luo, Rollin Omari, Junae Kim, Thuy-Trang Vu, Dinh Phung

发表机构 * Department of Data Science & AI, Monash University(数据科学与人工智能系,墨尔本大学) Defence Science and Technology Group, Australia(澳大利亚国防科学与技术集团)

AI总结 针对个性化大语言模型在不同社会群体间存在的通用真值不一致问题,提出TriAlign框架,通过离线多智能体强化学习联合优化真值准确性、跨群体一致性和个性化,实现公平对齐。

详情
AI中文摘要

个性化大语言模型根据用户的偏好和社会属性调整响应,但可能在不同社会群体间引入显著的通用真值不一致性,即某些群体在客观任务上系统性地获得较不准确的响应。现有的对齐方法要么忽略个性化,要么主要关注主观偏好对齐,很大程度上忽视了通用真值的公平性和一致性。为填补这一空白,我们研究了真值不变对齐(TIA),这是一个针对个性化LLM的对齐问题,旨在确保通用真值在不同社会群体间保持一致,同时保留个性化。我们提出TriAlign,这是首个用于TIA的离线多智能体强化学习(MARL)框架,其中每个社会群体被建模为一个交互的智能体。TriAlign通过一个公平感知目标和一个显式的不一致性惩罚,联合优化通用真值准确性、跨群体真值一致性和个性化。跨多个基准的实验表明,TriAlign在这三个目标之间实现了比强基线更强的平衡,减少了跨社会群体的通用真值差异,同时提高了客观任务性能和个性化质量。

英文摘要

Personalized large language models adapt responses to users' preferences and social attributes, but can introduce substantial universal truth inconsistencies across social groups, where some groups systematically receive less accurate responses on objective tasks. Existing alignment methods either ignore personalization or mainly focus on subjective preference alignment, largely overlooking fairness and consistency in universal truths. To address this gap, we study Truth-Invariant Alignment (TIA), an alignment problem for personalized LLMs that aims to ensure universal truths remain consistent across social groups while preserving personalization. We propose TriAlign, the first offline multi-agent reinforcement learning (MARL) framework for TIA, where each social group is modeled as an agent interacting. TriAlign jointly optimizes universal truth accuracy, cross-group truth consistency, and personalization through a fairness-aware objective and an explicit inconsistency penalty. Experiments across diverse benchmarks demonstrate that TriAlign achieves a stronger balance among these three objectives than strong baselines, reducing universal truth disparities across social groups while improving both objective task performance and personalization quality.

2606.01747 2026-06-02 cs.CL cs.AI 版本更新

Construction of Historical Knowledge Graphs Based on BERT and Graph Neural Networks

基于BERT和图神经网络的历史知识图谱构建

Ping Li, Bartlomiej Brzozka

发表机构 * Shandong Management University(山东管理大学) Maria Curie-Sklodowska University(玛丽·居里-斯洛多夫斯卡大学)

AI总结 本文提出结合BERT和图神经网络的高层架构,从历史文本中提取实体和关系,构建知识图谱,在精度、召回率和F1分数上优于传统方法和深度学习基线。

Comments 9 pages, 4 figures

详情
AI中文摘要

通过数字人文研究和规模化历史数据分析,大量传统历史文本被转换为结构化知识图谱。本文提出一种结合双向编码器表示(BERT)和图神经网络(GNN)的高层架构,用于从各类历史文本中提取实体和关系。传统历史文本系统地解决了语言歧义、上下文限制的引用以及缺乏既定语法规范的问题。本研究根据上述建议,开发了一种基于FastRQNet和预训练视觉-语言模型Vilt-qaformer+RoBInet的新型图像检索系统。实验充分利用了市政记录、议会文件和历史信函的全面数据集。与传统基于规则的技术和其他流行的深度学习基线相比,联合BERT-GNN系统获得了更高的精度、召回率和F1分数(表2)。该结构在创建知识图谱时能够以足够的准确性和全面性处理复杂的嵌套结构和隐式引用问题。上述实验表明,将关系图学习算法与上下文敏感的语义表示技术相结合,可以自动提取历史数据,为知识库积累累积的智慧。

英文摘要

Through digital humanities research and scale-up historical data analysis, a significant amount of traditional historical text is converted into structured knowledge graphs. This paper provides a high-level architecture that combines bidirectional encoder representations of transformers (BERT) and graph neural networks (GNN) to extract the entities and relationships from various types of historical texts. The texts of traditional history resolve linguistic ambiguities, references limited by context, and a lack of established grammatical norms in a systematic way. This study develops a new image retrieval system based on FastRQNet and pre-trained vision-language model Vilt-qaformer+RoBInet in accordance with the aforementioned recommendations. The experiments make full use of a comprehensive collection of municipal records, parliamentary documents, and historical correspondence. When compared to conventional rule-based techniques and other popular deep-learning baselines, the joint BERT-GNN system obtains greater Precision, Recall, and F1-score (Table 2). Complex nested structures and implicit reference issues can be handled by this structure with sufficient accuracy and thoroughness when creating knowledge graphs. The aforementioned experiments show that combining relational graph learning algorithms with context-sensitive semantic representation techniques can automatically extract historical data to add accumulated wisdom to the knowledge repository.

2606.01741 2026-06-02 cs.CR cs.AI 版本更新

SECUREVENT: Hybrid AI/ML Security Monitoring for Distributed Event-Based Systems

SECUREVENT: 面向分布式事件系统的混合AI/ML安全监控

Eric Liang

发表机构 * Oracle

AI总结 提出SECUREVENT架构,结合传统安全机制与在线异常检测、图行为特征、复杂事件策略、联邦学习和对抗ML治理,通过混合AI/CEP监控提高召回率并保持低误报率。

详情
AI中文摘要

分布式事件系统已成为互联网规模发布/订阅服务、物联网遥测、云原生微服务和安全运营管道的常见基础。它们的松散耦合和异步交付提高了可扩展性,但也扩大了攻击面:发布者、代理、订阅者、主题、模式和时间顺序都可能被滥用,而没有一个组件能观察整体行为。本文提出了SECUREVENT,一种用于分布式事件系统的混合AI/ML安全监控架构。该架构将传统保护(如认证传输、主题级授权和签名事件)与在线异常检测、图感知行为特征、复杂事件策略规则、联邦学习和对抗ML治理相结合。对合成事件流攻击的确定性原型研究表明,混合AI/CEP监控可以在保持低误报率的同时提高静态规则的召回率。核心主张并非机器学习取代密码学和访问控制机制,而是当事件流、身份、模式和时间关系过于动态以至于静态控制无法单独应对时,基于模型的安全监控是必要的。

英文摘要

Distributed event-based systems have become a common substrate for Internet-scale publish/subscribe services, IoT telemetry, cloud-native microservices, and security operations pipelines. Their loose coupling and asynchronous delivery improve scalability, but they also expand the attack surface: publishers, brokers, subscribers, topics, schemas, and temporal ordering can each be abused without a single component observing the whole behavior. This paper proposes SECUREVENT, a hybrid AI/ML security-monitoring architecture for distributed event-based systems. The architecture combines traditional protections such as authenticated transport, topic-level authorization, and signed events with online anomaly detection, graph-aware behavioral features, complex-event policy rules, federated learning, and adversarial-ML governance. A deterministic prototype study over synthetic event-stream attacks illustrates how a hybrid AI/CEP monitor can improve recall over static rules while retaining a low false-positive rate. The central claim is not that machine learning replaces cryptographic and access-control mechanisms, but that model-based security monitoring is necessary when event flows, identities, schemas, and timing relationships are too dynamic for static controls alone.

2606.01738 2026-06-02 cs.CL cs.AI 版本更新

THRD: A Training-Free Multi-Turn Defense Framework for Jailbreak Attacks on Large Language Models

THRD:一种针对大语言模型越狱攻击的无训练多轮防御框架

Zhiqing Ma, Zhonghao Xu, Dong Yu, Chen Kang, Changliang Li, Pengyuan Liu

发表机构 * Beijing Language and Culture University(北京语言大学)

AI总结 提出无训练框架THRD,通过显式建模时间风险累积(包括逐轮风险评估、跨轮意图检测、响应评估和决策模块)防御多轮越狱攻击,将攻击成功率降至0.2-4.0%且模型效用损失小于1.5%。

详情
AI中文摘要

多轮越狱攻击通过利用对话动态(如逐步升级和跨轮协调)对LLM构成日益严重的威胁。现有防御要么依赖昂贵的重新训练(通常会降低模型效用),要么在每一轮独立应用单轮分析,无法捕捉风险沿交互轨迹的累积。我们观察到多轮交互中的安全行为是轨迹依赖的:对话历史不断重塑模型的调节上下文,使得孤立评估每一轮变得不足。基于这一洞察,我们提出THRD,这是第一个显式建模多轮越狱防御中时间风险累积的无训练框架。THRD集成了四个模块:用于即时风险评估的逐轮风险评估器(TRA)、用于跨轮意图升级检测的历史上下文分析器(HCA)、用于识别促进性输出的响应评估器(RE),以及通过带衰减调制和趋势感知调整的时间演化评分机制组合这些信号的决策模块。在两个目标模型上针对最先进的多轮攻击(包括基于树搜索和多智能体协作方法)的实验表明,THRD将攻击成功率降至0.2-4.0%,同时在MMLU和GSM8K上将模型效用退化控制在1.5%以内。消融研究证实了模块的非冗余贡献和稳定的跨架构泛化。对首次拒绝触发器的分析显示,超过70%的多轮攻击需要在第2轮或之后才能检测到,验证了显式时间聚合的必要性。

英文摘要

Multi-turn jailbreak attacks pose a growing threat to LLMs by exploiting conversational dynamics such as gradual escalation and cross-turn coordination. Existing defenses either rely on costly retraining -- often degrading model utility -- or apply single-turn analysis independently at each turn, failing to capture how risk accumulates along interaction trajectories. We observe that safety behavior in multi-turn interaction is trajectory-dependent: dialogue history continuously reshapes the model's conditioning context, making it insufficient to evaluate each turn in isolation. Motivated by this insight, we present THRD, the first training-free framework that explicitly models temporal risk accumulation for multi-turn jailbreak defense. THRD integrates four modules: a Turn-level Risk Assessor (TRA) for instantaneous risk estimation, a Historical Context Analyzer (HCA) for cross-turn intent escalation detection, a Response Evaluator (RE) for identifying facilitative outputs, and a Decision Module that combines these signals through a time-evolving scoring mechanism with attenuation-based modulation and trend-aware adjustment. Experiments against state-of-the-art multi-turn attacks -- including tree-search-based and multi-agent collaborative methods -- across two target models show that THRD reduces ASR to 0.2--4.0% while preserving model utility within 1.5% degradation on MMLU and GSM8K. Ablation studies confirm non-redundant module contributions and stable cross-architecture generalization. Analysis of first rejection triggers reveals that over 70% of multi-turn attacks require Turn~2 or later to detect, validating the necessity of explicit temporal aggregation.

2606.01737 2026-06-02 cs.AI 版本更新

TrafficRAG: A Multimodal RAG Framework for Traffic Accident Liability Determination

TrafficRAG:用于交通事故责任认定的多模态RAG框架

Xu Li, Zedong Fu, Xinyi Li, Xun Han

发表机构 * Southwest Petroleum University(西南石油大学) Sichuan Police College(四川警察学院)

AI总结 提出TrafficRAG框架,通过视觉语言模型生成结构化描述、混合检索获取法规和案例、大语言模型融合多模态证据进行推理,实现自动化交通事故责任分析报告生成。

Comments 12 pages, 3 figures, accepted at ICANN 2026

详情
AI中文摘要

交通事故责任分析是智能交通和法律辅助中一项关键但具有挑战性的任务。现有方法通常存在效率低、主观判断和不一致的分析结果等问题。同时,大语言模型受到噪声视频输入和法律领域知识不足的限制。为了解决这些问题,本文提出了TrafficRAG,一个用于自动化交通事故分析和报告生成的多模态检索增强框架。具体来说,该框架首先采用视觉语言模型生成事故场景的结构化文本描述,作为准确的检索查询。基于这些文本查询,采用结合BM25稀疏检索和稠密嵌入检索的混合检索策略来获取相关交通法规和类似历史案例。最后,大语言模型整合检索到的法律知识和多模态事故证据进行综合推理,生成标准化、有法律依据的责任分析报告。大量实验表明,TrafficRAG始终优于基线方法,实现了77.32%的法律规范适配准确率、81.71%的事实忠实度以及5.48%的责任比例平均绝对误差。结果验证了通过检索增强将多模态事实证据与法律条款相结合,可以有效提高交通事故责任认定的可靠性和准确性。

英文摘要

Traffic accident liability analysis is a critical yet challenging task in intelligent transportation and legal assistance. Existing methods often suffer from low efficiency, subjective judgment, and inconsistent analysis results. Meanwhile, large language models are constrained by noisy video inputs and insufficient legal domain knowledge. To address these issues, this work presents TrafficRAG, a multimodal retrieval-augmented framework for automated traffic accident analysis and report generation. Specifically, the proposed framework first adopts a vision-language model to produce structured textual descriptions of accident scenarios, which serve as accurate retrieval queries. Based on these textual queries, a hybrid retrieval strategy integrating BM25 sparse retrieval and dense embedding retrieval is employed to fetch relevant traffic regulations and similar historical cases. Finally, the large language model incorporates retrieved legal knowledge and multimodal accident evidence for comprehensive reasoning, and generates standardized, legally grounded liability analysis reports. Extensive experiments show that TrafficRAG consistently outperforms baseline methods, achieving 77.32% Legal Norm Adaptation Accuracy, 81.71% Factual Faithfulness, and a Liability Ratio MAE of 5.48%. The results validate that integrating multimodal factual evidence with legal clauses via retrieval augmentation can effectively improve the reliability and accuracy of traffic accident liability determination.

2606.01725 2026-06-02 cs.AI cs.LG 版本更新

Characterization of Multi-Model Agentic AI Systems on General Tasks via Trace-Driven Simulation

基于迹驱动仿真的通用任务多模型智能体AI系统特征分析

Donghwan Kim, Prakhar Singh, Younghoon Min, Jongryool Kim, Jongse Park, Kiwan Maeng

发表机构 * The Pennsylvania State University(宾夕法尼亚州立大学) SK Hynix(SK海力士) KAIST(韩国科学技术院)

AI总结 本文提出GAIATrace数据集和Vidur-Agent仿真器,通过迹驱动仿真分析多模型智能体AI系统在通用任务上的行为特征。

Comments 13 pages, 18 figures, 2 tables

详情
AI中文摘要

智能体AI通过迭代规划、工具使用和基于观察结果的推理来完成任务。尽管其流行,但其系统级行为仍然知之甚少,特别是对于复杂数据集和智能体架构——由于高度非确定性执行、高昂的评估成本以及对专有模型的有限可见性。本文提出了GAIATrace,这是两个最先进的智能体系统(MiroThinker和OWL)运行GAIA(一个由异构通用任务组成的基准测试)的首个token级迹数据集。与先前的迹数据集不同,GAIATrace捕获了完整的推理token、任务级结构以及每个主要参与LLM的活动,从而支持深入的系统研究。作为数据集的补充,我们提出了Vidur-Agent,一个迹驱动的仿真器,可以重放GAIATrace以在多种模拟环境中进行可重复、低成本的系统评估。利用这两个工件,我们描述了现代智能体系统如何处理通用任务以及各种系统设计选择如何塑造其行为,得出了若干独特的发现。

英文摘要

Agentic AI completes tasks through iterative planning, tool use, and reasoning based on observed outcomes. Despite its popularity, its system-level behavior remains poorly understood, particularly for complex datasets and agent architectures-owing to highly non-deterministic execution, prohibitive evaluation costs, and limited visibility into proprietary models. This paper presents GAIATrace, the first token-level trace dataset of two state-of-the-art agentic systems (MiroThinker and OWL) running GAIA, a benchmark composed of a heterogeneous mix of general-purpose tasks. Unlike prior trace datasets, GAIATrace captures full reasoning tokens, task-level structures, and activities of every major participating LLMs, enabling in-depth systems research. Complementing the dataset, we present Vidur-Agent, a trace-driven simulator that can replay GAIATrace to perform reproducible, low-cost system evaluation across diverse simulated environments. Using both artifacts, we characterize how modern agentic systems handle general tasks and how various system design choices shape their behavior, yielding several unique findings.

2606.01723 2026-06-02 cs.LG cs.AI 版本更新

Shortcut to Nowhere: Demystifying Deep Spurious Regression

捷径通往虚无:揭秘深度虚假回归

Guanrong Xu, Jessica Li, Hao Wang, Yuzhe Yang

发表机构 * University of California, Los Angeles(加州大学洛杉矶分校) Rutgers University(罗格斯大学) Yang AI Lab(杨人工智能实验室)

AI总结 针对连续预测中的虚假相关性,提出利用标签和特征空间中虚假属性的相似性来校准分布,从而提升模型在分布偏移下的泛化能力。

详情
AI中文摘要

现实世界中的回归常常存在捷径:在训练中与连续目标虚假相关的属性,在部署偏移下不可靠;使用此类捷径回归目标可能在测试时灾难性失败。现有关于虚假相关性的研究主要关注分类,其中标签是分类的且组是自然定义的。然而,许多现实任务需要连续预测,其中不存在硬标签边界或离散的组-标签对。我们将深度虚假回归(DSR)定义为从具有属性-标签混淆的回归数据中学习,处理连续虚假相关性,并在测试时泛化到所有属性-标签组合。受分类和回归捷径内在差异的启发,我们提出利用标签和特征空间中虚假属性之间的相似性,从而在跨属性校准标签和学习特征分布时考虑邻近目标和相关组。在涵盖计算机视觉、环境感知和大语言模型(LLM)回归的常见真实世界DSR数据集上的大量实验验证了我们策略的优越性能。我们的工作填补了研究连续预测中虚假相关性的基准和技术空白。

英文摘要

Real-world regression often exhibits shortcuts: attributes that are spuriously correlated with continuous targets in training, yet unreliable under deployment shifts; regressing targets using such shortcuts may fail catastrophically at test time. Existing studies on spurious correlations focus primarily on classification, where labels are categorical and groups are naturally defined. However, many real-world tasks require continuous prediction, where hard label boundaries or discrete group-label pairs do not exist. We define Deep Spurious Regression (DSR) as learning from regression data with attribute-label confounding, addressing continuous spurious correlations, and generalizing to all attribute-label combinations at test time. Motivated by the intrinsic difference between classification and regression shortcuts, we propose to exploit the similarity among spurious attributes in both label and feature spaces, thereby accounting for nearby targets and related groups while calibrating both label and learned feature distributions across attributes. Extensive experiments on common real-world DSR datasets that span computer vision, environmental sensing, and large language model (LLM) regression verify the superior performance of our strategies. Our work fills the gap in benchmarks and techniques for studying spurious correlations in continuous prediction.

2606.01722 2026-06-02 cs.LG cs.AI cs.DC 版本更新

Post-Deterministic Distributed Systems: A New Foundation for Trustworthy Autonomous Infrastructure

后确定性分布式系统:可信自主基础设施的新基础

Jun He, Deying Yu

发表机构 * OpenKedge Inc.(OpenKedge公司)

AI总结 本文提出后确定性分布式系统(PDDS)模型,以协调确定性代码、随机模型和自主代理共存的异构环境,并定义了五大架构支柱及新的故障分类。

Comments 8 pages, 1 table

详情
AI中文摘要

几十年来,分布式系统通常假设正确的参与者执行协议指定的行为,具有稳定、外部定义和确定性的语义。经典理论广泛参数化了网络时序、通信拓扑和故障域,但参与者模型相对固定。将自主推理引擎、随机模型驱动代理和策略驱动参与者集成到云控制平面、事件响应系统和金融基础设施中,挑战了这一假设的普遍性。这些代理通常产生不同的推理路径、不同的操作轨迹和异构的内部表示,同时实现语义等价且正确的结果。在本文中,我们引入后确定性分布式系统(PDDS)作为研究和工程模型,用于协调确定性代码、随机模型和自主代理共存的异构环境。我们表明,经典分布式计算模型构成了这种参与者通用模型的零歧义特例。我们并非主张确定性系统消失;而是确定性执行不能再作为自主基础设施的通用参与者假设。最后,我们概述了后确定性基础设施的五大架构支柱:协议驱动开发、可验证代理基础设施、自主状态控制平面、语义法定保证和认知状态复制。认知状态复制将持久性和一致性模型从数据可见性扩展到知识可见性,实现代理记忆、可验证语义回滚以及跨推理参与者的连贯性。我们还定义了在此环境中出现的故障类别的分类法。

英文摘要

For decades, distributed systems have typically assumed that correct participants execute protocol-specified behavior with stable, externally defined, and deterministic semantics. Classical theory has extensively parameterized network timing, communication topologies, and failure domains, but this participant model has remained comparatively fixed. The integration of autonomous reasoning engines, stochastic model-driven agents, and policy-driven actors into cloud control planes, incident response systems, and financial infrastructure challenges the universality of this assumption. These agents often produce divergent reasoning paths, distinct operational traces, and heterogeneous internal representations while achieving semantically equivalent and correct outcomes. In this paper, we introduce Post-Deterministic Distributed Systems (PDDS) as a research and engineering model for coordinating heterogeneous environments where deterministic code, stochastic models, and autonomous agents coexist. We show that classical distributed computing models form a zero-ambiguity special case of this participant-general model. We do not argue that deterministic systems disappear; rather, deterministic execution can no longer serve as the universal participant assumption for autonomous infrastructure. Finally, we outline five architectural pillars of post-deterministic infrastructure: Protocol-Driven Development, Verifiable Agentic Infrastructure, Autonomous State Control Planes, Semantic Quorum Assurance, and Epistemic State Replication. Epistemic State Replication extends persistence and consistency models from data visibility to knowledge visibility, enabling agentic memory, Verifiable Semantic Rollback, and coherence across reasoning participants. We also define a taxonomy of failure classes that arise in this setting.

2606.01719 2026-06-02 cs.LG cs.AI cs.CR 版本更新

Fair Finetuning Mitigates Distribution Inference Attacks

公平微调缓解分布推断攻击

Rakshit Naidu

发表机构 * Rakshit Naidu

AI总结 提出公平微调(FFt)方法,通过在等几率约束下对互补分布样本进行微调,将模型公平性指标与分布推断攻击中的对抗优势联系起来,并给出理论界限,实验证明能有效降低攻击成功率。

Comments 16 pages (11 main, 5 appendix)

详情
AI中文摘要

在敏感数据上训练的机器学习模型可能会无意中泄露其训练分布的群体级信息——这种威胁被称为分布推断攻击(DIA)。具有黑盒访问权限的对手可以在不直接观察任何训练数据的情况下推断敏感的人口统计属性,如子群比例。尽管已经提出了差分隐私和属性遗忘等防御措施,但公平性约束与分布泄漏之间的联系尚未被探索。我们提出了公平微调(FFt):在等几率(EO)约束下,对来自互补分布的样本进行微调。我们提供了完整的理论刻画,证明了紧界 $ ext{Adv}(\mathcal{A},M_f) \le Δ_{ ext{EO}} \cdot W$,其中 $W$ 量化了两个训练分布通过其敏感属性组成的可区分程度。我们还建立了FFt降低对抗优势的必要条件,并证明了该界的紧性。我们在六个数据集上进行了评估,涵盖表格数据(ACS Income、COMPAS、German Credit)、图像数据(UTKFaces)和自然语言处理数据(Bias in Bios)。基于重演的FFt在所有设置中一致地将对抗准确率差距降低到检测阈值 $τ=0.1$ 以下;在ACS Income上,差距从约15%下降到4%以下。我们的工作提供了第一个将模型测量的EO差异直接与其在DIA博弈中的对抗优势联系起来的正式界限,为统一的公平性和隐私防御开辟了新途径。

英文摘要

Machine learning models trained on sensitive data can inadvertently leak population-level information about their training distributions -- a threat known as distribution inference attack (DIA). An adversary with black-box access can infer sensitive demographic properties, such as subgroup proportions, without observing any training data directly. While defenses such as differential privacy and property unlearning have been proposed, the link between fairness constraints and distributional leakage remains unexplored. We propose Fair Fine-tuning (FFt): a trained model is fine-tuned on samples from the complementary distribution under an Equalized Odds (EO) constraint. We provide a complete theoretical characterization, proving the tight bound $\text{Adv}(\mathcal{A},M_f) \le Δ_{\text{EO}} \cdot W$, where $W$ quantifies how distinguishable the two training distributions are by their sensitive-attribute composition. We also establish a necessary condition for FFt to reduce adversarial advantage and prove tightness of the bound. We evaluate across six datasets spanning tabular (ACS Income, COMPAS, German Credit), image (UTKFaces), and NLP (Bias in Bios) modalities. Rehearsal-based FFt consistently reduces the adversarial accuracy gap below the detection threshold $τ!=!0.1$ across all settings; on ACS Income, the gap falls from $\sim!15%$ to under $4%$. Our work provides the first formal bound connecting a model's measured EO disparity directly to its adversarial advantage in the DIA game, opening a new avenue for unified fairness-and-privacy defenses.

2606.01708 2026-06-02 cs.LG cs.AI 版本更新

Two-Fidelity Best-Action Identification for Stochastic Minimax Tree

随机极小极大树的双保真度最优动作识别

Peter Chen, Xi Chen

发表机构 * Department of Mathematics, Columbia University(哥伦比亚大学数学系) Stern School of Business, New York University(纽约大学斯特恩商学院)

AI总结 针对随机极小极大树中的固定置信度最优动作识别问题,提出双保真度树搜索算法2FFS,结合极小极大快速扩展与MCTS随机采样,自适应选择廉价有偏评估或昂贵精确评估,理论证明固定置信度正确性、有限停止及多项式深度成本上界,实验表明比现有BAI-MCTS基线显著减少样本和计算。

Comments 36 pages

详情
AI中文摘要

我们研究随机极小极大树中的固定置信度最优动作识别(BAI)。该问题在现代AI规划中日益重要,其中深度极小极大搜索和带有语言模型长滚动的蒙特卡洛树搜索(MCTS)面临一个基本权衡:启发式评估廉价但有偏,而精确滚动可靠但代价高昂。我们提出2FFS,一种双保真度树搜索算法,将多保真度平面赌博机思想引入树中。该算法结合了极小极大风格的快速扩展和MCTS风格的随机采样,自适应地决定何时利用廉价有偏评估以及何时调用昂贵精确评估进行局部认证。我们证明了固定置信度正确性,建立了精确识别的有限停止性,并给出了通用深度树的多项式深度成本上界。在数值随机树实验中,与现有BAI-MCTS基线相比,2FFS使用的样本和计算操作显著减少。

英文摘要

We study fixed-confidence best-action identification (BAI) in stochastic minimax trees. This problem is increasingly relevant in modern AI planning, where deep minimax search and Monte Carlo Tree Search (MCTS) with language model long rollouts face a fundamental tradeoff: heuristic evaluations are cheap but biased, while accurate rollouts are reliable but prohibitively expensive. We propose 2FFS, a two-fidelity tree-search algorithm that brings multi-fidelity flat bandit ideas into trees. The algorithm combines minimax-style fast expansion with MCTS-style stochastic sampling, adaptively deciding when to exploit cheap biased evaluations and when to invoke expensive accurate evaluations for local certification. We prove fixed-confidence correctness, establish finite stopping for exact identification, and give a polynomial-depth cost upper bound for general-depth trees. Across numerical stochastic-tree experiments, 2FFS uses substantially fewer samples and computational operations comparing to existing BAI-MCTS baseline.

2606.01703 2026-06-02 cs.SD cs.AI cs.CV 版本更新

JenBridge: Adaptive Long-Form Video Soundtracking across Scene Transitions

JenBridge: 跨场景转换的自适应长视频配乐

Jiashuo Yu, Yao Yao, Boyu Chen, Alex Wang

发表机构 * Jen Music AI

AI总结 提出JenBridge框架,通过基于Transformer的生成模型、双文本-视觉条件对齐和LLM代理驱动的自适应过渡机制,实现长视频配乐的高保真生成与场景转换自然连贯。

详情
AI中文摘要

我们解决了在场景转换中生成高保真、长格式配乐并保持连贯性的挑战。现有的AI音乐系统主要针对短片段设计,缺乏确保叙事连续性的机制。我们提出了JenBridge,一个模块化且可解释的自适应长视频配乐框架,确保高保真音频生成和转换自然性。核心架构是一个基于Transformer的生成模型,采用流匹配目标训练,遵循两阶段范式:在大规模文本-音频语料库上进行预训练以建立稳健的音乐先验,然后通过双文本-视觉条件适应视频领域以实现精确的跨模态对齐。关键的是,为了实现跨不同场景变化的长格式连贯性,JenBridge引入了一种新颖的自适应过渡机制。该系统具有一个多功能的过渡风格工具包,包括一种生成式过渡方法,并独特地采用了一个大型语言模型(LLM)代理,作为导演智能地为每个叙事转变选择最合适的过渡。为了严格评估这一任务,我们提出了LVS基准,这是一个新基准,包含一个精选数据集和新的评估指标,侧重于整体和过渡感知评估。在提出的基准上进行的大量实验表明,JenBridge在客观和主观指标上均显著优于现有方法,特别是在转换自然性和整体叙事连贯性方面。JenBridge代表了向全自动、专业质量的视频配乐迈出的重要一步。

英文摘要

We address the challenge of generating high-fidelity, long-form soundtracks that remain coherent across scene transitions. Existing AI music systems are mainly designed for short, isolated clips and lack mechanisms to ensure narrative continuity. We present JenBridge, a modular and interpretable framework for adaptive long-form video soundtracking that ensures both high-fidelity audio generation and transition naturalness. The core architecture is a Transformer-based generative model trained with a flow-matching objective, following a two-stage paradigm: pretraining on large-scale text-audio corpora to establish robust musical priors, then adapting to the video domain with dual text-visual conditioning for precise cross-modal alignment. Crucially, to achieve long-form coherence across diverse scene changes, JenBridge incorporates a novel adaptive transition mechanism. This system features a versatile toolkit of transition styles, including a generative transition method, and uniquely employs a Large Language Model (LLM) Agent that acts as a director to select the most appropriate transition for each narrative shift intelligently. To rigorously assess this task, we propose the LVS Benchmark, a new benchmark that includes a curated dataset and novel evaluation metrics focusing on holistic and transition-aware assessment. Extensive experiments on the proposed benchmark demonstrate that JenBridge significantly outperforms existing methods in both objective and subjective metrics, particularly in terms of transition naturalness and overall narrative coherence. JenBridge represents a significant step towards fully automated, professional-quality video soundtracking.

2606.01694 2026-06-02 cs.CV cs.AI cs.LG cs.MM 版本更新

Understanding Identity Continuity in Thermal Video through Scene-Level Consistency

通过场景级一致性理解热视频中的身份连续性

Wei-Chieh Sun, Gyungmin Ko, Heejae Kwon, Hsiang-Wei Huang, Jenq-Neng Hwang

发表机构 * Department of Electrical and Computer Engineering, Information Processing Lab, University of Washington, USA(电气与计算机工程系,信息处理实验室,华盛顿大学,美国)

AI总结 针对热行人多目标跟踪中身份碎片化问题,提出轻量级后处理方法,通过在线短间隙重映射和离线轨迹重链接恢复身份连续性,在PBVS热行人MOT基准上提升IDF1。

Comments Accepted to CVPR 2026 Workshop on SVC. Published in CVPR Workshops proceedings

详情
Journal ref
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, 2026, pp. 1411-1419
AI中文摘要

热行人多目标跟踪仍然具有挑战性,因为弱外观线索和频繁的检测中断导致严重的轨迹碎片化。我们研究轻量级后处理是否可以在不依赖重型重识别模型或复杂在线关联的情况下恢复身份连续性。从YOLOv8和SORT基线开始,我们添加了一个模块化的身份修复后端,包括基于时间、空间、运动和边界线索的在线短间隙重映射和离线轨迹重链接。在固定验证集上的受控消融实验和在官方PBVS热行人MOT基准上的评估表明,主要身份增益来自保守的重链接,将IDF1从82.25提升到84.93,同时保持MOTA,而许多启发式阈值在广泛的操作范围内保持稳定。这些结果表明,在低信息热图像中,通过高精度轨迹重链接比增加跟踪器复杂性更能有效地实现鲁棒的身份恢复。这些结果提供了对热视频中身份恢复的受控分析,表明与局部帧到帧关联相比,场景级时空一致性在身份连续性中起主导作用。

英文摘要

Thermal pedestrian MOT remains challenging because weak appearance cues and frequent detection interruptions cause severe trajectory fragmentation. We study whether lightweight post-processing can recover identity continuity without relying on heavy re-identification models or complex online association. Starting from a YOLOv8 and SORT baseline, we add a modular identity-repair backend consisting of online short-gap remapping and offline tracklet relinking based on temporal, spatial, motion, and border cues. Controlled ablations on a fixed validation split and evaluation on the official PBVS Thermal Pedestrian MOT benchmark show that the main identity gains arise from conservative relinking, improving IDF1 from 82.25 to 84.93 while preserving MOTA, whereas many heuristic thresholds remain stable across broad operating ranges. These results suggest that, in low-information thermal imagery, robust identity recovery can be achieved more effectively through high-precision trajectory relinking than through increasing tracker complexity. These results provide a controlled analysis of identity recovery in thermal video, showing that scene-level spatial-temporal consistency plays a dominant role in identity continuity compared to local frame-to-frame association.

2606.01689 2026-06-02 cs.CV cs.AI 版本更新

RPCASSM: Robust PCA State Space Model For Infrared Small Target Detection

RPCASSM: 基于鲁棒主成分分析的状态空间模型用于红外小目标检测

Pingping Liu, Aohua Li, Yubing Lu, Jin Kuang, Tongshun Zhang, Qiuzhan Zhou

发表机构 * College of Computer Science and Technology, Jilin University(吉林大学计算机科学与技术学院) Key Laboratory of Symbolic Computation and Knowledge Engineering of Ministry of Education, Jilin University(教育部符号计算与知识工程重点实验室) College of Software, Jilin University(吉林大学软件学院) School of Geosciences, Yangtze University(长江大学地球科学学院) College of Communication Engineering, Jilin University(吉林大学通信工程学院)

AI总结 针对红外小目标检测中主流状态空间模型难以准确建模目标边缘的问题,提出基于鲁棒主成分分析(RPCA)的RPCASSM网络,通过设计背景状态空间模块(BSSM)和目标状态空间模块(TSSM)分别利用空间异质信号显著性和目标稀疏局部高亮特性进行状态空间建模,有效解决了边缘建模难题。

Comments 12 pages, 8 figures, under review

详情
AI中文摘要

红外小目标的检测与分割在监控安防、海上救援等领域具有重要的应用意义。由于这些目标在远距离成像中占据像素少,主流的视觉状态空间模型效率低下且难以准确建模目标边缘。现有的红外状态空间模型并未从红外小目标的结构特性出发偏离主流视觉状态空间结构框架。为了解决这一问题,本文基于鲁棒主成分分析(RPCA)的模型范式提出了RPCASSM网络,旨在通过红外小目标在空间域的性质设计背景状态空间模块(BSSM)和目标状态空间模块(TSSM)。BSSM旨在利用空间异质信号的显著性设计空间探测扫描机制(SPCM)来建模背景信息。TSSM利用目标的稀疏性和局部高亮特性设计可变形提示扫描机制(DPCM),聚焦于目标的可变形空间进行状态空间建模。通过上述设计,我们有效解决了现有主流视觉状态空间模型难以准确建模红外小目标边缘结构的问题。在现有基准数据集上的实验结果证明了RPCASSM设计的有效性。我们的代码将在\href{https://github.com/PepperCS/RPCASSM}{RPCASSM}公开。

英文摘要

The detection and segmentation of infrared small targets have important application significance in the fields of surveillance and security, maritime rescue and so on. Due to the low occupancy of these targets in long-distance imaging, the mainstream visual state space model is inefficient and difficult to accurately model the target edge. The existing infrared state space models do not deviate from the mainstream visual state space structure framework from the structural properties of infrared small targets. In order to solve this problem, this paper proposes the RPCASSM network based on the model paradigm of robust principal component analysis(RPCA), which aims to design the background state space module(BSSM) and the target state space module(TSSM) by the nature of the infrared small target in the spatial domain. The BSSM aims to use the saliency of spatial heterogeneous signals to design a spatial probe scanning mechanism(SPCM) to model background information. The TSSM designs a deformable prompt scanning mechanism(DPCM) by using the sparsity and local highlight of the target to focus on the deformable space of the target for state space modeling. According to the above design, we effectively solve the problem that the existing mainstream vision state space model is difficult to accurately model the edge structure of infrared small target. Experimental results on the existing benchmark data sets prove the effectiveness of the RPCASSM design. Our code will be made public at \href{https://github.com/PepperCS/RPCASSM}{RPCASSM}.

2606.01686 2026-06-02 cs.SD cs.AI 版本更新

HAIM: Human-AI Music Datasets for AI Music Production Tracking Benchmark

HAIM: 用于AI音乐制作跟踪基准的人机音乐数据集

Seonghyeon Go, Yumin Kim

发表机构 * KAIST(韩国科学技术院)

AI总结 针对当前AI音乐检测局限于二元分类的不足,提出HAIM数据集,通过多阶段标签定义“AI音乐跟踪”任务,评估现有检测器缺陷,推动向细粒度结构化评估转变。

详情
AI中文摘要

随着Suno和Udio等生成平台达到人类级音频质量,AI的实用性已扩展到整个音乐制作流程。除了简单的音轨生成,这些进步催生了AI驱动方法在各种形式中的应用,包括声音合成、编曲和专业母带处理。然而,当前的检测研究仍主要局限于二元“AI或人类”范式,未能反映当代音乐制作流程的现实。在真实制作中,AI工具越来越多地被用于优化或母带处理人类制作的音轨,而人类工程师同样对AI生成的材料进行后处理以确保专业质量。此外,用户经常采用对抗策略绕过AI检测器,例如对AI生成的音轨应用人类母带处理。这创造了一个简单的二元分类无法捕捉的灰色地带。在本文中,我们定义并研究“AI音乐跟踪”:在音乐制作的多面光谱中识别特定AI集成的挑战。为此,我们引入HAIM,一个具有音乐制作阶段多样化标签的数据集。它旨在隔离AI干预的阶段,包括混合制作和代理级跟踪。我们对最先进检测器的评估揭示了系统性缺陷。通过发布HAIM,我们提出了一个新的基准,将领域从二元分类转向对AI音乐的细粒度结构化评估。

英文摘要

As generative platforms such as Suno and Udio reach human-grade audio quality, the scope of AI's utility has expanded across the entire music production workflow. Beyond simple track generation, these advancements have catalyzed the adoption of AI-driven methodologies in diverse forms. These include vocal synthesis, arrangement, and professional mastering. However, current detection research remains largely confined to a binary `AI-or-human' paradigm. It fails to reflect the realities of contemporary music production workflows. In real-world production, AI tools are increasingly used to refine or master human-produced tracks, and human engineers likewise post-process AI-generated material to ensure professional quality. Moreover, users often employ adversarial tactics to bypass AI detectors, such as applying human mastering to AI-generated tracks. This creates a grey area that a simple binary classification fails to capture. In this paper, we define and investigate ``AI Music Tracking'': the challenge of identifying specific AI integration across the multifaceted spectrum of music production. To this end, we introduce HAIM, a dataset with diverse labels for stages of music production. It is designed to isolate stages of AI intervention, including hybrid production and agent-level tracking. Our evaluation of state-of-the-art detectors reveals systemic flaws. By releasing HAIM, we propose a new benchmark that shifts the field beyond binary classification toward a granular, structured evaluation of AI music.

2606.01682 2026-06-02 cs.CL cs.AI cs.LG 版本更新

Off-the-Shelf LLMs as Process Scorers: Training-Free Alternative to PRMs for Mathematical Reasoning

现成的大语言模型作为过程评分器:数学推理中PRM的无训练替代方案

Atoosa Chegini, Soheil Feizi

发表机构 * Department of Computer Science, University of Maryland(马里兰大学计算机科学系)

AI总结 提出Chunk-Level Guided Generation方法,利用现成的大语言模型作为过程评分器,通过固定长度块评分和对比选择规则,无需训练即可在数学推理中匹配或超越PRM引导搜索的性能。

详情
AI中文摘要

使用更强的评分器从多个小模型样本中选择最佳响应是一种简单的推理时策略,但当小模型已经陷入错误推理路径时,该策略会失败。PRM引导搜索通过在生成过程中对候选延续进行评分来避免这一问题,但需要经过步骤级标签训练的奖励模型。我们提出Chunk-Level Guided Generation,一种无训练的替代方案,使用现成的大语言模型作为过程评分器。在每一步,小模型采样k个固定长度的候选块,而大模型使用似然度对候选块进行评分,无需生成任何文本。选中的块在下一步之前被提交,从而在错误传播之前引导生成。我们用两种选择规则实例化该框架:似然引导选择(LGS),选择具有最高长度归一化大模型对数概率的块;以及对比引导选择(CGS),减去小模型的对数概率,以偏向于大模型偏好与小模型偏好不同的块。我们证明,由于系统性的长度偏差(即使在长度归一化后仍然存在),使用大模型似然度对可变长度推理步骤进行评分是不可靠的,而固定长度块避免了这一混淆。在GSM8K、MATH、Minerva Math、AMC23和AIME24上,使用Qwen2.5-32B引导Qwen2.5-1.5B以及Llama-3.1-70B引导Llama-3.2-1B,CGS在多数投票上最多提升28个百分点,并且在匹配的引导预算下,在大多数基准测试中匹配或超越了Qwen2.5-Math-PRM-72B引导搜索,且无需奖励模型训练。使用Qwen2.5-72B引导Qwen2.5-7B,CGS在k=16时在MATH上达到81.8%,在Minerva Math上达到63.6%,超过多数投票4-6个百分点。最后,Chunk-Level Guided Generation产生的推理轨迹比PRM引导搜索短得多。

英文摘要

Selecting the best response from multiple small-model samples using a stronger scorer is a simple inference-time strategy, but fails when the small model has already committed to incorrect reasoning paths. PRM guided search avoids this by scoring candidate continuations during generation, but requires a reward model trained with step-level labels. We propose Chunk-Level Guided Generation, a training-free alternative that uses an off-the-shelf large language model as a process scorer. At each step, a small model samples k fixed-length candidate chunks, while the larger model scores the candidates using likelihoods without generating any text. The selected chunk is committed before the next step, steering generation before errors can propagate. We instantiate this framework with two selection rules: Likelihood-Guided Selection (LGS), which selects the chunk with the highest length-normalized large-model log-probability, and Contrastive-Guided Selection (CGS), which subtracts the small model's log-probability to favor chunks where the large model's preference diverges from the small model's. We show that scoring variable-length reasoning steps with large-model likelihoods is unreliable due to a systematic length bias that persists even after length normalization, and that fixed-length chunks avoid this confound. On GSM8K, MATH, Minerva Math, AMC23, and AIME24 with Qwen2.5-1.5B guided by Qwen2.5-32B and Llama-3.2-1B guided by Llama-3.1-70B, CGS outperforms majority voting by up to 28 pp and, under matched guidance budgets, matches or outperforms Qwen2.5-Math-PRM-72B guided search on most benchmarks without reward-model training. With Qwen2.5-7B guided by Qwen2.5-72B, CGS reaches 81.8% on MATH and 63.6% on Minerva Math at k=16, surpassing majority voting by 4--6 pp. Finally, Chunk-Level Guided Generation produces substantially shorter reasoning traces than PRM guided search.

2606.01670 2026-06-02 cs.IR cs.AI 版本更新

Time-Aware Diffusion based on Preference Disentanglement for Generative Recommendation

基于偏好解耦的时间感知扩散用于生成式推荐

Bangguo Zhu, Peng Huo, Yuanbo Zhao, Zhicheng Du, Jun Yin, Senzhang Wang

发表机构 * Central South University(中南大学) National Super Computing Center(国家超算中心) Renmin University of China(中国人民大学) Hong Kong Polytechnic University(香港理工大学)

AI总结 针对现有扩散生成式推荐模型忽略用户偏好时间非平稳分布的问题,提出TDPM框架,通过将用户偏好解耦为长期周期偏好和短期点状偏好并融入扩散过程,在三个数据集上HR@20和NDCG@20平均提升29.21%和25.45%。

详情
AI中文摘要

最近,生成式推荐(GRs)通过用语义索引(SIDs)取代传统项目ID,成为一种变革性的推荐范式。由于扩散模型卓越的生成能力,一些开创性工作探索了以扩散架构为骨干开发GRs。然而,现有基于扩散的GRs的一个致命限制是扩散过程统一应用于历史交互中的所有项目。相比之下,用户偏好由多方面的时变因素塑造,因此在时间维度上呈现非平稳分布。为弥补这一差距,本研究提出一种新颖的GR框架,名为TDPM,通过在SID令牌上设计时间感知扩散。具体而言,TDPM将时变用户偏好的影响明确整合到扩散过程中。详细地,用户偏好被解耦为(i)长期一致的周期偏好和(ii)由近期焦点事件触发的点状偏好。在三个公开真实数据集上的大量实验表明,TDPM显著优于最先进的基线模型。TDPM在HR@20和NDCG@20上分别实现了平均高达29.21%和25.45%的提升。消融研究进一步强调了基于扩散的GRs中时间感知令牌扩散的必要性。

英文摘要

Recently, Generative Recommenders (GRs) have emerged as a transformative recommendation paradigm by replacing traditional item IDs with semantic indices (SIDs). Owing to the exceptional generative capabilities of diffusion models, a few pioneering works explore developing GRs with diffusion architectures as the backbone. However, a fatal limitation of existing diffusion-based GRs is that the diffusion process applies uniformly to all items within the historical interactions. In contrast, the user preference is shaped by multifaceted time-evolving factors and thus exhibits a non-stationary distribution in the temporal aspect. To bridge this gap, this study proposes a novel GR framework, named TDPM, by designing the time-aware diffusion on SID tokens. Specifically, TDPM explicitly integrates the impact of time-evolving user preferences into the diffusion process. In detail, the user preference is disentangled into (i) the period preference, which remains consistent over a long time-span, and (ii) the point preference, which is triggered by recent focal events. Extensive experiments on three public real-world datasets demonstrate the significant superiority of TDPM over the state-of-the-art baselines. TDPM achieves average improvements of up to 29.21% and 25.45% in terms of HR@20 and NDCG@20, respectively. The ablation study further underscores the necessity of time-aware token diffusion in diffusion-based GRs.

2606.01666 2026-06-02 cs.LG cs.AI 版本更新

DOT-MoE: Differentiable Optimal Transport for MoEfication

DOT-MoE:用于MoE化的可微最优传输

Udbhav Bamba, Arnav Chavan, Aryamaan Thakur, Steve Teig, Deepak Gupta

发表机构 * University of California, Berkeley(加州大学伯克利分校)

AI总结 提出DOT-MoE框架,通过可微最优传输将密集层分解为专家,联合学习神经元分配和路由策略,在减少50%活跃参数的同时保留90%原始性能。

Comments Accepted at ICML 2026

详情
AI中文摘要

大型语言模型(LLMs)的扩展带来了显著的性能提升,但也造成了推理效率方面的重大挑战。虽然混合专家(MoEs)架构通过将模型大小与推理成本解耦来解决这一问题,但从头训练MoEs通常不稳定且计算密集。将预训练的密集模型转换为稀疏MoEs已成为一种替代方案;然而,现有方法通常依赖启发式神经元聚类或随机分割来将前馈网络(FFN)划分为专家。在这项工作中,我们提出了DOT-MoE,一种新颖的框架,将密集层的分解建模为可微最优传输(DOT)问题。与静态启发式方法不同,我们将神经元分配建模为平衡传输问题,利用可微的Sinkhorn-Knopp迭代来强制执行严格的专家容量约束。此外,我们利用直通估计器(STE)来联合学习离散的神经元到专家的分配和令牌到专家的路由策略。跨多个架构和基准的大量实验表明,DOT-MoE显著优于结构化剪枝、启发式聚类和随机分割基线,在减少50%活跃参数的同时保留了原始密集模型90%的性能。

英文摘要

The scaling of Large Language Models (LLMs) has driven significant performance gains but created substantial challenges in inference efficiency. While Mixture of Experts (MoEs) architectures address this by decoupling model size from inference cost, training MoEs from scratch is often unstable and compute intensive. Conversion of pre-trained dense models into sparse MoEs has emerged as an alternative solution; however, existing methods typically rely on heuristic neuron clustering or random splitting to partition the Feed-Forward Network (FFN) into experts. In this work, we propose DOT-MoE, a novel framework that formulates the decomposition of dense layers as a Differentiable Optimal Transport (DOT) problem. Instead of static heuristics, we model neuron assignment as a balanced transport problem, utilizing differentiable Sinkhorn-Knopp iterations to enforce strict expert capacity constraints. Furthermore, we utilize Straight-Through Estimators (STE) to jointly learn the discrete neuron-to-expert assignment and the token-to-expert routing policy end-to-end. Extensive experiments across multiple architectures and benchmarks demonstrate that DOT-MoE significantly outperforms structured pruning, heuristic clustering, and random-split baselines, retaining 90% of the original dense model's performance while reducing active parameters by 50%.

2606.01655 2026-06-02 math.OC cs.AI cs.LG stat.ML 版本更新

MINTS: Minimalist Thompson Sampling

MINTS: 极简汤普森采样

Kaizheng Wang

发表机构 * Department of IEOR and Data Science Institute, Columbia University(工业工程与数据科学学院,哥伦比亚大学)

AI总结 针对贝叶斯方法在复杂结构约束下的局限性,提出一种仅对最优位置设置先验、通过轮廓似然消除冗余参数的极简贝叶斯框架,并实例化为MINTS算法,在均值约束多臂老虎机中实现近最优非渐近遗憾保证和精确几乎必然渐近遗憾刻画。

Comments 29 pages

详情
AI中文摘要

贝叶斯范式为不确定性下的序贯决策提供了原则性工具,但其对所有参数依赖概率模型的做法会阻碍复杂结构约束的纳入。我们提出一种极简贝叶斯框架,仅对最优位置设置先验,同时通过轮廓似然消除冗余参数。这产生了一个自然适应结构约束的广义后验。作为直接实例,我们开发了极简汤普森采样(MINTS)。对于具有均值约束的多臂老虎机,我们建立了近最优的非渐近遗憾保证和精确的几乎必然渐近遗憾刻画。特别地,MINTS在无结构设置中达到了经典的Lai-Robbins常数,并自动适应单峰结构,达到仅由最优臂的紧邻所确定的精确常数。

英文摘要

The Bayesian paradigm offers principled tools for sequential decision-making under uncertainty, but its reliance on a probabilistic model for all parameters can hinder the incorporation of complex structural constraints. We introduce a minimalist Bayesian framework that places a prior only on the location of the optimum, while eliminating nuisance parameters through profile likelihood. This yields a generalized posterior that naturally accommodates structural constraints. As a direct instantiation, we develop MINimalist Thompson Sampling (MINTS). For multi-armed bandits with mean constraints, we establish near-optimal non-asymptotic regret guarantees and sharp almost-sure asymptotic regret characterizations. In particular, MINTS attains the classical Lai--Robbins constant in the unstructured setting and automatically adapts to unimodal structure, achieving the sharp constant determined only by the immediate neighbors of the optimal arm.

2606.01640 2026-06-02 cs.AI cs.CL 版本更新

MobEvolve: An Agentic Self-Evolving Heuristic System for Interpretable Human Mobility Generation

MobEvolve:用于可解释人类移动性生成的智能体自进化启发式系统

Junlin He, Yihong Tang, Tong Nie, Ao Qu, Yuebing Liang, Hamzeh Alizadeh, Bang Liu, Wei Ma, Lijun Sun

发表机构 * The Hong Kong Polytechnic University(香港理工大学) McGill University(麦吉尔大学) MIT(麻省理工学院) Tsinghua University(清华大学) Autorité régionale de transport métropolitain(大都会交通地区管理局) Université de Montréal(蒙特利尔大学) Mila – Quebec AI Institute(魁北克人工智能研究所)

AI总结 提出MobEvolve,首个智能体自进化启发式框架,通过LLM代理迭代演化内部逻辑,在保持可解释性和推理效率的同时,在个体轨迹保真度、群体分布对齐和行为合理性上超越现有方法。

详情
AI中文摘要

人类移动性生成旨在根据个体特征为目标人群合成真实的出行链。现有范式,包括深度生成模型、基于LLM的方法和传统启发式方法,难以同时满足该任务的复杂需求,同时保持可解释性、行为合理性、群体级分布对齐和推理效率。为弥合这一差距,我们引入了MobEvolve,这是首个用于人类移动性生成的智能体自进化启发式框架。MobEvolve初始化一个行为启发的启发式系统,并利用LLM代理迭代演化其内部逻辑。通过在验证集上诊断经验性错位和失败案例,代理提出有针对性的更新并积累演化记忆以实现累积性自我改进。在新加坡和蒙特利尔基准上的广泛评估表明,MobEvolve在个体轨迹保真度、群体级分布对齐和行为合理性方面显著优于最先进的深度生成和基于LLM的方法,同时保持可解释性和高推理效率。

英文摘要

Human mobility generation aims to synthesize realistic trip chains for target populations based on individual features. Existing paradigms, including deep generative models, LLM-based methods, and traditional heuristics, struggle to satisfy the complex demands of this task while simultaneously maintaining interpretability, behavioral plausibility, population-level distributional alignment, and inference efficiency. To bridge this gap, we introduce MobEvolve, the first agentic self-evolving heuristic framework for human mobility generation. MobEvolve initializes a behavior-inspired heuristic system and employs an LLM agent to iteratively evolve its internal logic. By diagnosing empirical misalignments and failure cases on a validation set, the agent proposes targeted updates and accumulates evolution memory for cumulative self-improvement. Extensive evaluations on the Singapore and Montreal benchmarks demonstrate that MobEvolve significantly outperforms state-of-the-art deep generative and LLM-based methods in individual trajectory fidelity, population-level distribution alignment, and behavioral plausibility, while preserving interpretability and high inference efficiency.

2606.01635 2026-06-02 cs.CL cs.AI 版本更新

AlphaToken: Decoupling Adaptation and Stability for Path-Aware Response Token Valuation in LLM Post-Training

AlphaToken: 在LLM后训练中解耦适应性与稳定性的路径感知响应令牌估值

Liu Qing, Ou Wu, Yi Du

发表机构 * Hangzhou Institute for Advanced Study, University of Chinese Academy of Sciences(中国科学院大学杭州高等研究院)

AI总结 提出AlphaToken框架,通过解耦适应性(促进目标任务学习)和稳定性(保持预训练能力)并引入路径感知机制,利用Fisher漂移代理和Ghost点积扩展实现高效令牌估值,从而在微调和偏好优化中屏蔽低价值令牌,提升后训练性能并缓解灾难性遗忘。

详情
AI中文摘要

令牌选择对于有效的LLM后训练至关重要。然而,现有方法大多依赖局部启发式,很少将令牌选择形式化为对单个响应令牌的原则性估值。我们引入了$\textbf{AlphaToken}$,一个响应令牌估值框架,它将估值解耦为$\textbf{适应性}$(促进目标任务学习)和$\textbf{稳定性}$(保持预训练能力),并通过结合局部令牌梯度的直接路径信号与自回归生成中的下游因果路径信号,使每个目标具有$\textbf{路径感知}$性。由于保留数据通常不可用,AlphaToken通过锚定在预训练参考模型上的$\textbf{Fisher漂移代理}$来近似稳定性。为了高效计算,我们将Ghost点积扩展到令牌级估值。AlphaToken在微调和偏好优化过程中屏蔽低价值响应令牌,将训练信号集中在更有价值的位置。实验表明,AlphaToken提高了后训练性能并缓解了灾难性遗忘。

英文摘要

Token selection is pivotal for effective LLM post-training. However, existing methods mostly rely on local heuristics and rarely formulate token selection as a principled valuation of individual response tokens. We introduce $\textbf{AlphaToken}$, a response token valuation framework that decouples valuation into $\textbf{adaptation}$ (promoting target-task learning) and $\textbf{stability}$ (preserving pre-trained capabilities), and makes each objective $\textbf{path-aware}$ by combining the direct-path signal from local token gradients with the downstream causal-path signal in autoregressive generation. Since retention data are typically unavailable, AlphaToken approximates stability via a $\textbf{Fisher-drift proxy}$ anchored at the pre-trained reference model. For efficient computation, we extend Ghost Dot-Product to token-level valuation. AlphaToken masks low-value response tokens during fine-tuning and preference optimization, concentrating training signals on more valuable positions. Experiments show that AlphaToken improves post-training performance and mitigates catastrophic forgetting.

2606.01634 2026-06-02 cs.LG cs.AI 版本更新

E4GEN: Event-level Explainable Extreme-Enhanced Time-series Generation

E4GEN:事件级可解释的极端增强时间序列生成

Lin Jiang, Dahai Yu, Ximiao Li, Guang Wang

发表机构 * Florida State University(佛罗里达州立大学)

AI总结 提出E4GEN可解释扩散框架,通过E-Activator、E-Predictor和E-Control三个组件实现事件级极端事件可控生成,在整体保真度、极端事件保真度和下游效用上优于现有方法。

Comments 48 pages,26 figures

详情
AI中文摘要

生成逼真的时间序列对于科学研究和实际应用至关重要。然而,现有方法通常强调整体分布保真度,而未能忠实捕捉极端事件。为了推进现有研究,我们提出了E4GEN,一个用于极端事件感知时间序列生成的可解释扩散框架。E4GEN通过三个关键组件提供了关于何时、什么以及如何控制极端事件生成的系统见解。首先,E-Activator在去噪过程中学习数据集自适应的极端控制信号激活步骤,而不干扰常规时间成分,包括趋势和季节性。其次,E-Predictor通过自驱动语义预测确定要强制执行的控制信号,其中每个样本通过推断生成过程中的潜在极端事件信息来导出其自身的控制信号。它还包括一种新颖的数据条件训练、噪声初始化采样机制,以解决训练标签不可用的问题。第三,E-Control通过可训练的极端控制网络指定如何控制极端事件生成,该网络将语义控制信号转换为逐层信号并将其注入去噪过程。我们在六个数据集上使用17个指标评估了E4GEN,大量实验表明,E4GEN在多个维度上优于最先进的模型,包括整体保真度、极端事件保真度和下游效用。

英文摘要

Generating realistic time series is essential for scientific research and real-world applications. However, existing methods often emphasize overall distributional fidelity while failing to faithfully capture extreme events. To advance existing research, we propose E4GEN, an explainable diffusion framework for extreme event-aware time-series generation. E4GEN provides systematic insights into when, what, and how to control extreme-event generation through three key components. First, E-Activator learns the dataset-adaptive extreme-control signal activation step during the denoising process without interfering with regular temporal components, including trend and seasonality. Second, E-Predictor determines what control signal to enforce through Self-Driven Semantic Prediction, where each sample derives its own control signal by inferring latent extreme-event information during generation. It also includes a novel Data-Conditioned Training, Noise-Initiated Sampling mechanism to address the issue of unavailable training labels. Third, E-Control specifies how to control extreme-event generation through a trainable Extreme Control Network, which transforms the semantic control signal into layer-wise signals and injects it into the denoising process. We evaluate E4GEN on six datasets with 17 metrics, and extensive experiments show that E4GEN outperforms state-of-the-art models across multiple dimensions, including overall fidelity, extreme-event fidelity, and downstream utility.

2606.01632 2026-06-02 cs.GT cs.AI 版本更新

A Framework for Graph-Conditioned Hierarchical Shapley Attribution in Patent Valuation

图条件分层Shapley归因专利估值框架

Joy Bose

发表机构 * Independent Researcher(独立研究者)

AI总结 提出PatentXAI框架,将专利估值建模为可解释AI问题,利用知识图谱中的马尔可夫毯限制联盟规模,通过分层Shapley值实现高效且可解释的利润分配。

详情
AI中文摘要

估计一个包含数万项专利的产品中单项专利的经济贡献是知识产权经济学中一个长期未解决的问题。我们提出PatentXAI,一个将专利估值视为可解释AI问题的框架:给定一个特征函数v(S)编码专利子集S可实现的收入,专利的Shapley值以满足效率、对称性、虚拟性和可加性的方式衡量其对产品利润的公平份额。为了使计算可行,我们将每个专利的联盟限制在知识图谱中的马尔可夫毯内,基于C-SVE条件独立定理(Li et al., 2020)。使用帕累托分布覆盖图从n=12到n=100项专利的规模实验报告,在n=100时中位马尔可夫毯大小为n的32.9%,90百分位毯大小为n的55.2%,每项专利运行时间为10毫秒。与n=12时精确真实值的差异为0.088;与n=100时高样本蒙特卡洛参考值的差异为0.062±0.003。一个密集组件实验表明,当80%的专利共享一个组件时,毯正确扩展以覆盖该密集簇,与参考值的差异降至0.039,因为合并计算在同质组合上变得更准确。利润分配分层进行:精确Shapley将总利润分配给宏观组件,然后中心性加权Shapley将每个组件预算分配给覆盖专利。从真实数据估计v(S)是主要的开放问题;我们将此与计算贡献区分开来,并概述了使用公共ETSI、USPTO和Lens.org数据集进行实证验证的具体路线图。

英文摘要

Estimating the economic contribution of a single patent inside a product that embodies tens of thousands of patents is a long-standing unsolved problem in intellectual property economics. We propose PatentXAI, a framework that treats patent valuation as a problem of explainable AI: given a characteristic function v(S) encoding the revenue achievable by patent subset S, a patent's Shapley value measures its fair share of product profit in a way that satisfies efficiency, symmetry, dummy, and additivity. To make computation tractable we restrict each patent's coalition to its Markov Blanket inside a knowledge graph, grounded in the C-SVE conditional independence theorem (Li et al., 2020). Scaling experiments from n=12 to n=100 patents using Pareto-distributed coverage graphs report median Markov Blanket size of 32.9 percent of n at n=100, with 90th-percentile blanket size of 55.2 percent of n, and runtime of 10 milliseconds per patent. Difference against exact ground truth at n=12 is 0.088; difference against a high-sample Monte Carlo reference at n=100 is 0.062 plus or minus 0.003. A dense-component experiment shows that when 80 percent of patents share one component, the blanket correctly expands to cover that dense cluster, and the difference versus reference falls to 0.039 because the pooled computation becomes more accurate on homogeneous portfolios. Profit allocation proceeds hierarchically: exact Shapley distributes total profit among macro-components, then centrality-weighted Shapley distributes each component budget among covering patents. Estimating v(S) from real data is the primary open problem; we distinguish this from the computational contribution and outline a concrete roadmap for empirical validation using public ETSI, USPTO, and Lens.org datasets.

2606.01628 2026-06-02 q-bio.BM cs.AI 版本更新

Demystifying Multimodal Biomolecular Co-design With Intrinsic Geodesic Coupling

揭示具有内在测地耦合的多模态生物分子协同设计

Keyue Qiu, Xintong Wang, Zhilong Zhang, Hao Zhou, Wei-Ying Ma

发表机构 * University of California, Berkeley(加州大学伯克利分校) Stanford University(斯坦福大学)

AI总结 针对生物分子协同设计中模态间时间耦合被忽视的问题,提出GeoCoupling框架优化异构模态的时间耦合,在基于结构的药物设计和无条件蛋白质设计中提升物理有效性和多样性。

Comments Accepted to ICML 2026

详情
AI中文摘要

蛋白质和小分子配体等生物分子在生物系统中发挥核心作用,这源于序列与三维结构之间的紧密相互作用。最近的生物分子协同设计生成模型旨在通过联合建模耦合模态来捕捉这种相互作用。然而,现有方法大多采用并行执行边际生成过程,隐式地强制固定同步耦合。我们认为,一个关键但被忽视的自由度在于这些边际过程在训练和生成过程中如何时间耦合,不恰当的耦合会引入高方差监督和不一致的中间状态,影响模态一致性。为了解决这个问题,我们引入了GeoCoupling,一个优化异构模态之间时间耦合的系统框架。在基于结构的药物设计和无条件蛋白质设计上的实证结果表明,学习到的耦合始终优于同步和随机耦合基线,产生了具有改进的物理有效性和多样性的生物分子。

英文摘要

Biomolecules such as proteins and small-molecule ligands play a central role in biological systems, arising from the tight interplay between sequence and three-dimensional structure. Recent generative models for biomolecular co-design aim to capture this interplay by jointly modeling coupled modalities. However, existing approaches largely adopt a parallel execution of marginal generative processes, implicitly enforcing fixed synchronous coupling. We argue that a critical but overlooked degree of freedom lies in how these marginal processes are temporally coupled during training and generation, where inappropriate coupling can introduce high-variance supervision and inconsistent intermediate states, affecting modality consistency. To address this, we introduce GeoCoupling, a systematic framework that optimizes for temporal couplings between heterogeneous modalities. Empirical results across structure-based drug design and unconditional protein design demonstrate the learned couplings consistently outperform synchronous and randomly coupled baselines, yielding biomolecules with improved physical validity and diversity.

2606.01617 2026-06-02 cs.CL cs.AI 版本更新

EvoPool: Evolutionary Programmatic Annotation for Label-Efficient Specialized Supervision

EvoPool: 面向标签高效专业监督的进化式程序化标注

Tianyi Xu, Yaolun Zhang, Xuan Ouyang, Huazheng Wang

发表机构 * Oregon State University(俄勒冈州立大学) University of Wisconsin–Madison(威斯康星大学麦迪逊分校)

AI总结 提出进化多智能体框架EvoPool,通过程序化标注器迭代进化与投票聚合,在低标注成本下显著提升专业领域监督性能。

Comments 39 pages, 7 figures. Code: https://github.com/tianyi0216/EvoPool

详情
AI中文摘要

大型语言模型在通用任务上表现出色,但在训练标签成本高昂的专业高风险领域,其性能不如较小的监督模型。我们针对这一场景提出了EvoPool,一个受达尔文进化启发的进化多智能体框架。三个专业智能体迭代地提出可执行的标注器代码,一个小型验证集提供适应度信号,一个确定性门控仅保留通过跨代可行性、多样性和边际贡献检查的标注器。通过EvoAgg(一种结合语义特征与标注器投票特征的文本感知聚合器)将池投票映射为软训练标签。所构建的池在每样本成本接近零的情况下运行,在10万样本上比LLM标注快4500至31000倍。在8个LLM弱专业和复杂任务中的7个(涵盖生物医学关系抽取、法律条款分类、复杂推理和密集多标签生物医学分类)上,EvoPool比最强的LLM标注基线平均高出+0.141 macro-F1,在ChemProt上最高达+0.301,在PubMed上达+0.265。代码见:https://github.com/tianyi0216/EvoPool

英文摘要

Large language models excel at general tasks but underperform smaller supervised models in specialized, high-stakes domains where training labels are costly. We address this regime with EvoPool, an evolutionary multi-agent framework inspired by Darwinian evolution. Three specialized agents iteratively propose executable annotator code, a small validation set provides a fitness signal, and a deterministic gate keeps only annotators that pass viability, diversity, and marginal-contribution checks across generations. Pool votes are mapped to soft training labels by EvoAgg, a text-aware aggregator combining semantic features with annotator-vote features. The authored pool runs at near-zero per-example cost and is 4500 to 31000x faster than LLM annotation on 100K examples. Across 7 of 8 LLM-weak specialized and complex tasks spanning biomedical relation extraction, legal-clause classification, complex reasoning, and dense multi-label biomedical classification, EvoPool beats the strongest LLM annotation baseline by an average +0.141 macro-F1, peaking at +0.301 on ChemProt and +0.265 on PubMed. Code is available at: https://github.com/tianyi0216/EvoPool

2606.01610 2026-06-02 cs.AI 版本更新

Revisiting Ripple Effects in Knowledge Editing through Pressure-Aware Joint Neighborhood Optimization

重新审视知识编辑中的涟漪效应:通过压力感知联合邻域优化

Haoben Huang, Shuxin Liu, Ou Wu, Di Gao

发表机构 * Hangzhou Institute for Advanced Study, University of Chinese Academy of Sciences(杭州高等研究院,中国科学院大学)

AI总结 针对大语言模型单次编辑引发的涟漪效应,提出联合邻域优化框架,通过压力感知协调和语义预执行门控联合优化可编辑侧与保留侧的耦合压力,在RippleEdits上传播与保留指标提升至少7.0%。

详情
AI中文摘要

大语言模型中的单次编辑更新会在局部知识邻域中引发涟漪效应:理想情况下传播到相关事实,同时意外扰动应保留的事实。现有方法分别处理这两种效应,而未显式建模它们的耦合。我们通过分析典型基线中的涟漪响应挑战这种分离,识别出两种耦合的设计压力:可编辑侧协调和保留侧泄露。我们提出联合邻域优化(JNO),一种新的知识编辑框架,在目标规划阶段形式化并联合处理这两种压力。JNO通过压力感知协调(PAC)实例化这一原则,该协调在耦合约束下联合优化邻域目标表示,并设置语义预执行门控,在参数执行前拒绝高风险目标计划。在RippleEdits上的实验表明,JNO在保持跨骨干编辑稳定性的同时,传播和保留指标至少提升7.0%。

英文摘要

Single-edit updates in large language models can trigger ripple effects across local knowledge neighborhoods: desirable propagation to related facts and unintended perturbation of preserved ones. Existing methods address these two effects separately, without explicitly modeling their coupling. We challenge this separation through an analysis of ripple responses across typical baselines, identifying two coupled design pressures: editable-side coordination and preserved-side leakage. We propose Joint Neighborhood Optimization (JNO), a new knowledge-editing framework to formalize and jointly address both pressures at the target-planning stage. JNO instantiates this principle through Pressure-Aware Coordination (PAC), which jointly optimizes neighborhood target representations under coupled constraints, and a semantic pre-execution gate that rejects high-risk target plans before parameter execution. Experiments on RippleEdits show JNO improves propagation and preservation metrics by at least 7.0% while preserving cross-backbone editing stability.

2606.01607 2026-06-02 cs.LG cs.AI 版本更新

FedMTFI: Feature Importance Based Optimized Multi Teacher Knowledge Distillation in Heterogeneous Federated Learning Environment

FedMTFI: 异构联邦学习环境中基于特征重要性优化的多教师知识蒸馏

Nazmus Shakib Shadin, Aaron Cummings, Xinyue Zhang, Bobin Deng

发表机构 * Department of Computer Science, Kennesaw State University, Marietta, GA, 30060 USA(计算机科学系,肯纳邦大学,马里埃塔,GA,30060 USA)

AI总结 提出FedMTFI架构,通过结合多教师知识蒸馏与Shapley值特征重要性,在异构联邦学习中提升模型准确性和可解释性。

Comments Accepted by IJCNN 2026

详情
AI中文摘要

联邦学习(FL)是一种去中心化方法,能够在无需暴露原始数据的情况下实现协作模型训练。它允许设备仅共享模型权重,而将个人数据保留在本地并确保安全,从而避免了敏感数据的传输。然而,在现实环境中,设备持有的数据往往分布不均,且设备在计算能力和内存容量上大多存在差异。这些差异使得FL难以在整个系统中保持一致的性能。为了解决这些问题,我们提出了FedMTFI,一种新颖的架构,它将多教师知识蒸馏(MTKD)与特征重要性相结合,以改善异构环境中的FL过程。在FedMTFI中,客户端根据相似的硬件和模型类型进行聚类。每个聚类在非独立同分布(non-IID)数据上训练特定模型。在聚类内部,每个客户端仅使用自己的本地私有数据更新该模型。然后,服务器使用FedAvg对每个聚类中的本地训练模型进行聚合,形成多个原型模型。接着,这些原型作为教师模型,通过MTKD训练一个全局通用的学生模型。FedMTFI的独特之处在于集成了Shapley值(SHAP),以在蒸馏过程中强调重要特征,从而提高了准确性和可解释性。实验结果表明,FedMTFI比传统FL算法实现了更高的准确性,并且在non-IID数据条件下表现更有效。

英文摘要

Federated learning (FL) is a decentralized approach that enables collaborative model training without exposing raw data. Instead of transferring sensitive data, it allows devices to share only model weights, keeping personal data locally and secure. However, in real world settings, the data held by devices is often not evenly distributed and devices mostly differ in computing power and memory capacity. These differences make FL harder to maintain consistent performance across the system. To address these issues, we propose FedMTFI, a novel architecture that combines multi-teacher knowledge distillation (MTKD) with feature importance to improve the FL process in heterogeneous environments. In FedMTFI, clients are clustered based on similar hardware and model types. Each cluster trains a specific model on not independently and identically distributed (non-IID) data. Within a cluster, every client updates that model using only its own local private data. The server then aggregates the locally trained models in each cluster using FedAvg to form multiple prototype models. Then these prototypes serve as teacher models to train a global generalized student model using MTKD. What makes FedMTFI more unique is the integration of Shapley values (SHAP) to emphasize important features during distillation, which enhances both accuracy and interpretability. Experimental results show that FedMTFI achieves higher accuracy than traditional FL algorithms and performs more effectively under non-IID data conditions.

2606.01599 2026-06-02 cs.AI 版本更新

TRON: Targeted Rule-Verifiable Online Environments for Visual Reasoning RL

TRON:面向视觉推理强化学习的目标化规则可验证在线环境

Tianze Yang, Yucheng Shi, Ruitong Sun, Jingyuan Huang, Ninghao Liu, Jin Sun

发表机构 * University of Georgia(佐治亚大学)

AI总结 提出TRON在线环境框架,通过可控生成-验证程序产生无限训练实例,支持视觉推理强化学习,在多个多模态基准上提升性能。

Comments 27 pages, 8 figures

详情
AI中文摘要

视觉推理的强化学习(RL)需要可扩展、可验证且可控的训练信号。现有的视觉RL后训练在静态策划数据集上进行,其图像-问题-答案样本受限于收集预算。本文引入TRON(目标化、规则可验证的在线环境),一种在线环境基底:训练rollout由可控的生成-验证程序按需生成,该程序采样新的潜在视觉状态,渲染图像,提出问题,并精确验证答案。因此,单次运行可以按当前课程所需的难度级别抽取无限的新实例流。当前TRON套件包含520个环境,组织成五个能力桶(空间、数学、图表、模式/逻辑和计数);同一基底支持在所有桶上训练的单个完整模型以及每个桶的能力专家模型,无需额外数据收集。我们还引入了基底分析,涵盖生成可靠性、实例和级别多样性、跨环境近似重复以及按难度级别的基础模型通过率。使用METHOD进行RL后训练在Qwen3-VL-4B、Qwen2.5-VL-7B和MiMo-VL-7B-SFT上的十个外部多模态推理基准上持续提升性能。

英文摘要

Reinforcement learning (RL) for visual reasoning needs scalable, verifiable, and controllable training signals. Existing visual RL post-training trains on static curated datasets, with fixed image-question-answer samples bounded by their collection budget. In this work, we introduce TRON (Targeted, Rule-verifiable Online eNvironments), an online environment substrate: a training rollout is generated on demand by a controllable generator-verifier program that samples a fresh latent visual state, renders an image, asks a question, and exactly verifies the answer. A single run can therefore draw an unbounded stream of fresh instances at the difficulty level required by the current curriculum. The current TRON suite contains 520 environments organized into five ability buckets (spatial, mathematical, diagram, pattern/logic, and counting); the same substrate supports both a single full model trained on all buckets and per-bucket ability-specialist models, with no additional data collection. We also introduce a substrate analysis covering generation reliability, instance and level diversity, cross-environment near-duplicates, and base-model pass rate by difficulty level. RL post-training with METHOD consistently improves performance on ten external multimodal reasoning benchmarks across Qwen3-VL-4B, Qwen2.5-VL-7B, and MiMo-VL-7B-SFT.

2606.01584 2026-06-02 cs.CL cs.AI 版本更新

Identifying High-Confidence Social Biases in LLMs for Trustworthy Conversational Tutoring Agents

识别LLM中高置信度的社会偏见以构建可信的对话辅导代理

Aitor Arronte Alvarez, Naiyi Xie Fincham

发表机构 * University of Hawaii at Manoa(夏威夷大学马诺亚分校)

AI总结 本研究通过生成对话数据集,评估大型语言模型在辅导场景中检测社会偏见的能力,发现模型在对话上下文中比基准测试更难检测偏见,且对错误判断过度自信,影响推理和反馈。

Comments Accepted for AIED 2026

详情
AI中文摘要

对话辅导代理已被证明能提高学习参与度和学生成绩,大型语言模型(LLM)越来越多地被用于这些系统以提供可扩展的个性化反馈。然而,LLM可能会延续或放大刻板的社会偏见,在教育环境中带来特殊风险。在本研究中,我们评估了LLM在对话辅导场景中的表现,以识别高置信度的社会偏见,即模型在无法识别辅导对话中的偏见判断时仍保持高度自信,可能影响其推理和向学习者提供的反馈。我们提出了一种新的数据集生成方法,通过重新生成学生-AI辅导教师互动并引入来自基准数据集的受控偏见轮次,实现在自然教学条件下的偏见评估。利用这些数据,我们评估了多个LLM检测刻板偏见的能力,并通过计算和人工评估分析了其响应背后的置信度和推理。我们发现,在对话辅导上下文中,偏见检测比基于基准的评估更具挑战性,且最先进的LLM对其刻板偏见陈述的错误评估过于自信。此外,模型置信度强烈影响推理和反馈,突显了基于LLM的辅导代理中过度自信和偏见行为的风险。最后,我们讨论了影响、缓解考虑和未来研究方向。

英文摘要

Conversational tutoring agents have been shown to improve learning engagement and student outcomes, and large language models (LLMs) are increasingly used in these systems to provide scalable, personalized feedback. However, LLMs may perpetuate or amplify stereotypical social biases, posing particular risks in educational settings. In this study, we evaluate LLMs in conversational tutoring scenarios to identify high-confidence social biases, instances where models are unable to identify biased judgments in tutoring conversations while maintaining strong confidence in their assessments, potentially affecting their reasoning and the feedback they provide to learners. We present a new dataset generation method that enables bias evaluation under naturalistic instructional conditions by regenerating student-AI tutor interactions and introducing turns with controlled bias derived from a benchmark dataset. Using this data, we assess multiple LLMs' ability to detect stereotypical biases and analyze the confidence and reasoning underlying their responses through computational and human evaluations. We find that bias detection is substantially more challenging in conversational tutoring contexts than in benchmark-based evaluations, and that state-of-the-art LLMs are overconfident in their incorrect assessments of stereotypical bias statements. Moreover, model confidence strongly influences reasoning and feedback, highlighting the risks of overconfident, biased behavior in LLM-based tutoring agents. We conclude by discussing implications, mitigation considerations, and directions for future research.

2606.01560 2026-06-02 cs.LG cs.AI 版本更新

GJDNet: Robust Graph Neural Networks via Joint Disentangled Learning Against Adversarial Attacks

GJDNet: 通过联合解缠学习实现鲁棒图神经网络对抗攻击

Canyixing Cui, Tao Wu, Xingping Xian, Xiao-Ke Xu, Mao Wang, Weina Niu

发表机构 * School of Computer Science and Technology, Chongqing University of Posts and Telecommunications(重庆邮电大学计算机科学与技术学院) School of Cyber Security and Information Law, Chongqing University of Posts and Telecommunications(重庆邮电大学网络安全与信息法学院) Computational Communication Research Center, Beijing Normal University(北京师范大学计算通信研究中心) School of Journalism and Communication, Beijing Normal University(北京师范大学新闻传播学院) School of Computer Science and Engineering, University of Electronic Science and Technology of China(电子科技大学计算机科学与工程学院)

AI总结 提出GJDNet框架,通过联合解缠节点表示和决策空间,并采用球形决策边界,增强图神经网络在不同图同配性下的鲁棒性。

详情
AI中文摘要

图神经网络(GNN)易受对抗攻击,这类攻击通过在同配图中引入异配边、在异配图中引入同配边,从根本上反转连接模式。这种结构反转造成结构-特征不匹配,扰乱不同图类型上的邻域聚合。然而,我们发现现有防御措施存在局限性,它们要么在固定的同配性假设下将邻域视为整体,要么依赖无法应对扰动引起的表示偏移的标准softmax分类器。为进一步利用这一观察,我们采用鲁棒性视角,联合解缠节点表示和决策空间,在隔离扰动影响的同时强制实现分离良好的决策区域。基于此原则,我们提出图联合解缠网络(GJDNet),这是一个统一的框架,用于在不同图同配性机制下进行鲁棒节点分类。GJDNet在表示和决策两个层面增强鲁棒性:它采用特征驱动的软结构解缠,结合偏度感知的邻居过滤,抑制扰动引起的结构-特征不匹配;并引入球形决策边界(SDB),促进嵌入空间中的类内紧凑性和类间分离,从而在扰动下稳定决策边界。理论分析揭示了所提出的解缠表示和决策机制的有效性,而大量实验表明,GJDNet在不同连接模式的图上始终展现出强鲁棒性。

英文摘要

Graph Neural Networks (GNNs) are vulnerable to adversarial attacks, which inherently invert connectivity patterns by introducing disassortative edges in assortative graphs and assortative edges in disassortative graphs. This structural inversion creates structure-feature mismatches that disrupt neighborhood aggregation across different graph types. However, we find that existing defenses are limited, as they either treat neighborhoods as monolithic under fixed assortativity assumptions or rely on standard softmax classifiers that fail to account for perturbation-induced representation shifts. To further exploit this observation, we adopt a robustness perspective that jointly disentangles node representations and decision spaces, isolating perturbation effects while enforcing well-separated decision regions. Based on this principle, we propose Graph Joint Disentanglement Network (GJDNet), a unified framework for robust node classification across diverse graph assortativity regimes. GJDNet enhances robustness at both representation and decision levels: it employs feature-driven soft structural disentanglement with skewness-aware neighbor filtering to suppress perturbation-induced structure-feature mismatches, and introduces a Spherical Decision Boundary (SDB) to promote intra-class compactness and inter-class separation in the embedding space, thereby stabilizing decision boundaries under perturbations. Theoretical analysis provides insights into the effectiveness of the proposed disentangled representation and decision mechanisms, while extensive experiments demonstrate that GJDNet consistently achieves strong robustness across graphs with different connectivity regimes.

2606.01552 2026-06-02 cs.AI 版本更新

RoleCDE:Benchmarking and Mitigating Role-Alignment Trade-offs in Role-Playing Agents

RoleCDE:角色扮演代理中的角色-对齐权衡的基准测试与缓解

Huayi Lai, Shichao Song, Simin Niu, Hanyu Wang, Jiawei Yang, Zhouxing Wang, Zhiqiang Yin, Xun Liang

发表机构 * School of Information, Renmin University of China(中国人民大学信息学院)

AI总结 针对角色扮演代理在角色特定价值与对齐约束冲突时的决策问题,提出首个基准RoleCDE,通过认知困境场景评估角色-场景基础、价值冲突解决和决策倾向,发现“角色价值解耦”现象,并基于RoleCDE的微调有效缓解该问题。

Comments 23pages

详情
AI中文摘要

角色扮演代理(RPAs)被广泛用于引导大语言模型(LLMs)表现出角色一致的行为,然而现有基准主要评估表面保真度,对角色-对齐价值冲突下的决策提供有限洞察。为解决这一差距,我们引入RoleCDE,这是首个旨在评估RPAs在角色特定价值与对齐导向约束之间结构化冲突下的基准。RoleCDE将角色感知决策制定为认知困境场景,联合评估角色-场景基础、价值冲突解决和决策倾向。该基准大规模构建,涵盖约8000个多样化的角色档案和场景,以及近24000个困境实例,跨越三个难度级别和八个角色类别。对几个主流LLMs的评估揭示了一种“角色价值解耦”现象,即当两者冲突时,代理系统性地默认选择对齐和道德一致的决策,而非角色特定价值,即使在明确的角色条件下也是如此。这种行为在很大程度上不受困境难度影响,但在不同角色类别间差异显著。我们进一步表明,基于RoleCDE的微调通过改善价值权衡推理有效缓解了这种解耦,同时保持了通用角色扮演保真度和通用推理性能。代码可在 https://github.com/rabbitrose/RoleCDE 获取。

英文摘要

Role-playing agents(RPAs) are widely used to steer large language models(LLMs) toward role-consistent behavior, yet existing benchmarks mainly evaluate surface-level fidelity and offer limited insight into decision making under role-alignment value conflicts. To address this gap, we introduce RoleCDE, the first benchmark designed to evaluate RPAs under structured conflicts between role-specific values and alignment-oriented constraints. RoleCDE formulates role-aware decision making as cognitive dilemma scenarios, jointly evaluating role-scenario grounding, value conflict resolution, and decision tendencies. The benchmark is constructed at scale, covering approximately 8k diverse role profiles and scenarios and nearly 24k dilemma instances across three difficulty levels and eight role categories. Evaluation of several mainstream LLMs reveals a "Role Value Decoupling" phenomenon, where agents systematically default to alignment-and morality-consistent decisions rather than role-specific values when the two conflict, even under explicit role conditioning. This behavior is largely invariant to dilemma difficulty but varies substantially across role categories. We further show that RoleCDE-based fine-tuning effectively mitigates this decoupling by improving value trade-off reasoning, while preserving general role-playing fidelity and general reasoning performance. Code is available at: https://github.com/rabbitrose/RoleCDE.

2606.01542 2026-06-02 cs.DC cs.AI cs.CL cs.DB cs.IR 版本更新

Self-Conditioned Positional HNSW for Overlap-Aware Retrieval in Chunked-Document RAG Systems: Method and Industrial Evidence-Quality Audit

自条件位置HNSW:面向分块文档RAG系统的重叠感知检索方法与工业证据质量审计

Nataraj Agaram Sundar, Tejas Morabia

发表机构 * eBay Inc.(eBay公司)

AI总结 提出自条件位置HNSW(SCP-HNSW),通过低维位置编码和两遍查询过程实现重叠感知检索,减少重复证据,并基于工业审计数据验证其有效性。

Comments 11 pages, 5 figures, 4 tables

详情
AI中文摘要

分块文档检索是检索增强生成(RAG)系统的常见组件。文档被分割成重叠的块,嵌入,并使用近似最近邻搜索(如分层可导航小世界图HNSW)进行索引。重叠改善了边界覆盖,但引入了一个实际故障模式:top-k检索通常返回重复证据的相邻块,浪费提示预算。我们提出自条件位置HNSW(SCP-HNSW),这是一种轻量级修改,将低维位置代码附加到块嵌入,并使用两遍查询过程来估计和应用查询特定的文档位置先验。SCP-HNSW保持HNSW图构建和遍历不变,同时为最终上下文构建添加了一个可审计的最小索引间隙选择器。我们还集成了用于生成证据质量的工业审查工件:一个包含318个完全标记审查的770条文本证据审计,以及一个包含350个评级的70例OCR审计。文本审计显示,770个预计审查中有574个被评为3/5,只有39个落在1-2范围内,叙述性审查者细节比结构化问题标志出现得更频繁。OCR审计显示,切片级通过率从干净聊天截图的95%到手写/模糊捕获的45%不等,一致性中等至强。这些结果激励了重叠感知、审计友好的RAG检索,并确定了因果性能声明所需的剩余受控检索消融。

英文摘要

Chunked-document retrieval is a common component of retrieval-augmented generation (RAG) systems. Documents are split into overlapping chunks, embedded, and indexed with approximate nearest-neighbor search such as hierarchical navigable small world graphs (HNSW). Overlap improves boundary coverage but induces a practical failure mode: top-k retrieval often returns near-adjacent chunks that repeat evidence and waste prompt budget. We propose Self-Conditioned Positional HNSW (SCP-HNSW), a lightweight modification that appends a low-dimensional positional code to chunk embeddings and uses a two-pass query procedure to estimate and apply a query-specific document-position prior. SCP-HNSW leaves HNSW graph construction and traversal unchanged while adding an auditable minimum-index-gap selector for final context construction. We also integrate industrial review artifacts for generated evidence quality: a 770-review text-evidence audit with 318 fully labeled reviews and a 70-case OCR audit with 350 ratings. The text audit shows that 574 of 770 projected reviews are rated 3/5, only 39 fall in the 1-2 range, and narrative reviewer detail appears much more often than structured issue flags. The OCR audit shows slice-level pass rates from 95% for clean chat screenshots to 45% for handwritten/blurry captures, with moderate to strong agreement. These results motivate overlap-aware, audit-friendly RAG retrieval and identify the remaining controlled retrieval ablations needed for causal performance claims.

2606.01540 2026-06-02 cs.LG cs.AI 版本更新

TN-SHAP-G: Graph-Structured Tensor Network Surrogates for Shapley Values and Interactions

TN-SHAP-G:用于Shapley值和交互的图结构张量网络代理

Farzaneh Heidari, Guillaume Rabusseau

发表机构 * University of Washington(华盛顿大学) CNRS(法国国家科学研究中心)

AI总结 提出TN-SHAP-G框架,利用图结构输入通过张量网络代理高效计算Shapley值和高阶交互指数。

详情
AI中文摘要

Shapley值是一种广泛使用的工具,用于归因黑盒模型中输入变量的重要性和交互,但其计算涉及定义在指数级子集空间上的函数。我们提出TN-SHAP-G,一个利用图结构输入中的结构高效计算Shapley值和高阶交互指数的框架。给定一个预测器和一个固定的掩码方案,TN-SHAP-G学习一个紧凑的、与图对齐的多线性代理,该代理近似掩码输入行为,表示为拓扑结构反映输入图的张量网络。一旦从少量oracle查询中训练完成,该代理通过多线性扩展实现一阶和高阶Shapley指数的确定性恢复,无需额外模型查询或蒙特卡洛方差。分子基准实验表明,学习到的分解在小图上紧密匹配精确Shapley值,并能高效扩展到基于采样的方法不可行的更大图。

英文摘要

Shapley values are a widely used tool for attributing importance and interactions among input variables in black-box models, but their computation involves a function defined over an exponentially large space of subsets. We propose TN-SHAP-G, a framework that exploits structure in graph-structured inputs to compute Shapley values and higher-order interaction indices efficiently. Given a predictor and a fixed masking scheme, TN-SHAP-G learns a compact, graph-aligned multilinear surrogate that approximates the masked-input behavior, represented as a tensor network whose topology mirrors the input graph. Once trained from a small number of oracle queries, the surrogate enables deterministic recovery of first- and higher-order Shapley indices via the multilinear extension, without additional model queries or Monte Carlo variance. Experiments on molecular benchmarks show that the learned factorization closely matches exact Shapley values on small graphs and scales efficiently to larger graphs where sampling-based methods become infeasible.

2606.01528 2026-06-02 cs.AI 版本更新

Joint Agent Memory and Exploration Learning via Novelty Signals

通过新颖性信号实现联合智能体记忆与探索学习

Shizuo Tian, Xiaohong Weng, Rui Kong, Yuxuan Chen, Guohong Liu, Yuebing Song, Jiacheng Liu, Yuchen Li, Dawei Yin, Ting Cao, Yunxin Liu, Yuanchun Li

发表机构 * Tsinghua University(清华大学) Sun Yat-sen University(中山大学) Baidu Inc.(百度公司) Tongji University(同济大学) Peking University(北京大学)

AI总结 提出JAMEL框架,利用新颖性信号联合训练智能体记忆与探索策略,在开放环境中实现高效探索并泛化到未见环境。

详情
AI中文摘要

在开放环境中,探索对于自主智能体至关重要,但当前的语言模型智能体难以做到这一点。有效的探索需要记忆,但保留原始交互历史在长轨迹中计算成本高昂。虽然潜在记忆提供了压缩交互历史的解决方案,但其训练缺乏可靠的监督信号。我们提出了联合智能体记忆与探索学习(JAMEL),这是一个通过新颖性驱动的交互来共同训练智能体记忆和探索策略的框架。我们观察到记忆和探索形成了一个相互依赖的循环:持续的探索需要记忆来区分已耗尽的行为和未见过的新行为,而寻求新颖性的交互提供了使记忆对未来探索有用的监督。通过利用确定性和持久的新颖性信号(如GUI领域的代码覆盖率),我们为记忆模块提供了自然的、无需标注的监督。实证评估表明,我们的方法成功泛化到未见环境。其探索能力优于开放权重基线,并与闭源模型的探索深度相媲美,同时减少了token消耗。我们的代码和模型已在https://github.com/MobileLLM/JAMEL开源。

英文摘要

In open-ended environments, exploration is fundamental for autonomous agents, yet current language model agents struggle with this. Effective exploration requires memory, but retaining raw interaction histories is computationally expensive over long trajectories. While latent memory offers a solution to compress interaction histories, its training lacks reliable supervisory signals. We introduce \textbf{J}oint \textbf{A}gent \textbf{M}emory and \textbf{E}xploration \textbf{L}earning (\textbf{JAMEL}), a framework that trains agentic memory and exploration policy together through novelty-driven interaction. We observe that memory and exploration form a mutually dependent loop: sustained exploration requires memory to distinguish exhausted behaviors from unseen ones, while novelty-seeking interaction provides the supervision needed to make memory useful for future exploration. By utilizing deterministic and persistent novelty signals such as code coverage in the GUI domain, we provide natural, annotation-free supervision for the memory module. Empirical evaluations demonstrate that \ours successfully generalizes to unseen environments. Its exploration capability outperforms open-weight baselines and rivals the exploration depth of a closed-source model while reducing token consumption. Our code and model are open-sourced at https://github.com/MobileLLM/JAMEL.

2606.01520 2026-06-02 cs.AI 版本更新

TERRA: Task-Embedded Reasoning and Representation Architecture for Cross-Domain Applications

TERRA: 面向跨领域应用的任务嵌入推理与表示架构

Shayan Shokri

发表机构 * Humanpath Labs Inc.(Humanpath实验室有限公司)

AI总结 提出TERRA架构,通过形式化跨领域转移问题,利用松弛双模拟差异和Gromov-Wasserstein距离度量结构状态域间的同态性,推导出预测误差与决策遗憾的转移界,将广泛直觉转化为可检验理论。

详情
AI中文摘要

一个单一的动作条件潜在预测架构原则上可以在驾驶场景、机器人工作空间或金融订单簿的结构化状态上进行训练。在任何单个领域内实现这一点的要素已经存在并得到单独验证:掩码潜在预测、动作条件潜在世界模型、离散动作标记化以及体素化状态上的联合嵌入预测。TERRA解决的是尚未确立的转移问题:在一个结构化状态领域学到的表示或预测器何时以及多大程度上能够迁移到结构类似但其他方面无关的领域。我们对此问题进行了形式化处理。我们将每个领域建模为分级潜在网格上的受控马尔可夫过程,将任何实例分解为薄领域适配器和共享的领域不变核心,并识别出跨领域对应关系,该对应关系近似于一个马尔可夫决策过程同态,其质量通过松弛双模拟差异来衡量,对于缺乏共享坐标系的领域,则通过其动作条件转移算子之间的Gromov-Wasserstein距离来衡量。在Lipschitz预测器下,我们推导出一个转移界,该界将源模型误差与结构失配分开,在预测范围内呈几何增长,并由Gromov-Wasserstein距离从下方保证;然后通过双模拟度量的Lipschitz值性质将潜在误差与决策遗憾联系起来。由此产生的结构化状态转移假设被表述为一个可证伪的主张,并附有预注册的实验方案,核心是从驾驶场景到订单簿的转移测试,包括其被反驳的条件。我们不呈现实证结果:这是一个将广泛重复的直觉转化为可检验理论的研究提案。

英文摘要

A single action-conditioned latent predictive architecture can in principle be trained on the structured state of a driving scene, a robot workspace, or a financial order book. The ingredients for doing so within any one domain already exist and are individually validated: masked-latent prediction, action-conditioned latent world models, discrete action tokenization, and joint-embedding prediction on voxelized state. What is not established, and what TERRA addresses, is the transfer question: when does a representation or predictor learned in one structured-state domain carry over to a structurally analogous but otherwise unrelated domain, and by how much. We give this question a formal treatment. We model each domain as a controlled Markov process on a graded latent grid, factor any instantiation into thin domain adapters and a shared domain-invariant core, and identify a cross-domain correspondence with an approximate Markov decision process homomorphism whose quality is measured by a lax bisimulation discrepancy and, for domains lacking a shared coordinate system, by a Gromov-Wasserstein distance between their action-conditioned transition operators. Under a Lipschitz predictor we derive a transfer bound that separates source-model error from structural mismatch, grows geometrically in the prediction horizon, and is certified from below by the Gromov-Wasserstein distance; we then connect latent error to decision regret through the Lipschitz value property of bisimulation metrics. The resulting Structured-State Transfer Hypothesis is stated as a falsifiable claim with a preregistered experimental program, centered on a transfer test from driving scenes to order books, including conditions under which it is refuted. We present no empirical results: this is a research proposal that converts a widely repeated intuition into testable theory.

2606.01513 2026-06-02 cs.DC cs.AI cs.CL cs.LG 版本更新

Compliance-Scored Best-of-N Guardrail Orchestration for Multimodal Document Generation in Payments Dispute Defense

基于合规评分的Best-of-N护栏编排用于支付争议防御中的多模态文档生成

Nataraj Agaram Sundar, Tejas Morabia

发表机构 * eBay Inc.(eBay公司)

AI总结 提出一种结合多候选生成与合规评分早退机制的护栏编排层,通过并行生成、加权评分和最佳输出选择,在支付争议防御场景中实现高合规率与低延迟。

Comments 8 pages, 7 figures, 4 tables. Preprint. Applied systems paper on compliance-scored guardrail orchestration for multimodal LLM document generation. Contains aggregate operational readouts; not a randomized A/B test

详情
AI中文摘要

高风险企业文档生成,包括金融争议叙述、合规通知和审计摘要,要求模式正确性、策略合规性以及大规模低延迟操作。在统一的护栏层之前,生产系统通常将独立的PII编辑、内容审核和格式验证步骤拼接在一起,导致逻辑碎片化、请求路径变慢和运营成本增加。我们提出了一种针对文本和图像输入的护栏编排层,它将多候选生成与用于早退的显式合规评分相结合。该框架运行可配置的并行生成头,根据加权护栏(包括PII检测、内容审核、模式约束和领域规则)对候选进行评分,并返回具有选择元数据的最佳评分输出。可用的运营读数报告在20秒内进行5次尝试,合规率为91%。对于支付争议防御摘要,我们分析聚合运营场景读数,而非随机A/B测试。可变队列显示总体胜率高于对照组,301/659对比536/1548,对应+11.0个百分点,95%置信区间[6.6, 15.5],p < 0.001;对于调整后的未收到物品案例,+7.5个百分点,95%置信区间[0.2, 15.7],p = 0.045。欺诈和本地证据排名差异方向为正,但在聚合计数数据中不具有统计显著性。我们还报告了来自770次生成证据审查和70例OCR切片的评审校准的负责任AI证据质量信号,并通过请求接口、评分逻辑、伪代码和运营证据边界记录了可重复性边界。

英文摘要

High-stakes enterprise document generation, including financial dispute narratives, compliance notices, and audit summaries, demands schema correctness, policy compliance, and low-latency operation at scale. Prior to a unified guardrail layer, production systems often stitched together separate PII redaction, content moderation, and format validation steps, leading to fragmented logic, slower request paths, and higher operational cost. We present a guardrail orchestration layer for text and image inputs that couples multi-candidate generation with an explicit compliance score used for early exit. The framework runs configurable parallel generation heads, scores candidates against weighted guardrails including PII detection, content moderation, schema constraints, and domain rules, and returns the best-scoring output with selection metadata. The available operational readout reports 5 attempts within 20 seconds and 91 percent compliance. For payments dispute defense summaries, we analyze aggregate operational scenario readouts rather than a randomized A/B test. Variable cohorts show higher count win rates than controls overall, 301/659 versus 536/1548, corresponding to +11.0 percentage points with 95 percent confidence interval [6.6, 15.5] and p < 0.001, and for adjusted item-not-received cases, +7.5 percentage points with 95 percent confidence interval [0.2, 15.7] and p = 0.045. Fraud and local evidence-ranking deltas are directionally positive but not statistically significant from the aggregate count data. We also report reviewer-calibrated Responsible-AI evidence-quality signals from 770 generated-evidence reviews and a 70-case OCR slice, and document the reproducibility boundary through the request interface, scoring logic, pseudocode, and operational evidence boundary.

2606.01509 2026-06-02 cs.LG cs.AI 版本更新

ProbMoE: Differentiable Probabilistic Routing for Mixture-of-Experts

ProbMoE:可微分的专家混合概率路由

Heng Zhao, Zilei Shao, Guy Van den Broeck, Zhe Zeng

发表机构 * Imperial College London(伦敦帝国学院) University of Waterloo(多伦多大学) EPFL(瑞士联邦理工学院)

AI总结 提出ProbMoE概率路由框架,通过离散子集空间上的概率推断实现专家选择,解决top-k路由的离散非可微问题,并扩展到动态k路由,提升专家利用率和路由多样性。

Comments Accepted at ICML 2026

详情
AI中文摘要

专家混合(MoE)模型通过每个令牌仅激活一小部分专家来扩展规模。然而,训练此类模型仍然具有挑战性,因为top-$k$路由是离散且不可微的,需要针对专家选择的梯度估计器,其设计仍是一个核心开放问题。我们引入了ProbMoE,一种概率路由框架,将专家选择建模为基数受限专家子集上的分布,并将路由公式化为该离散子集空间中的概率推断。我们首先提出ProbMoE Exact-$k$路由,在前向传播中采样$k$专家子集,后向传播使用每个专家精确边际概率的梯度作为真实梯度的可处理代理。ProbMoE自然地推广到动态$k$路由设置,其中训练和推理都将路由基数约束到相同的预定义范围,允许每个令牌自适应地分配专家。在多个基准测试和模型骨干上,ProbMoE Exact-$k$相比竞争基线实现了强性能,具有改进的专家利用率和路由多样性;ProbMoE Dynamic-$k$以更少的激活专家实现了可比的性能。

英文摘要

Mixture-of-Experts (MoE) models scale by activating only a small subset of experts per token. However, training such models remains challenging because top-$k$ routing is discrete and non-differentiable, requiring gradient estimators for expert selection whose design remains a central open problem. We introduce ProbMoE, a probabilistic routing framework that models expert selection as a distribution over cardinality-constrained expert subsets and formulates routing as probabilistic inference in this discrete subset space. We first propose ProbMoE Exact-$k$ routing, which samples $k$-expert subsets in the forward pass, and the backward pass uses gradients through each expert's exact marginal probability as a tractable surrogate for the true gradient. ProbMoE naturally generalizes to a dynamic-$k$ routing setting, where both training and inference constrain the routing cardinality to the same predefined range, allowing adaptive expert allocation per token. Across benchmarks and model backbones, ProbMoE Exact-$k$ achieves strong performance compared to competitive baselines, with improved expert utilization and routing diversity; ProbMoE Dynamic-$k$ achieves comparable performance with fewer activated experts.

2606.01508 2026-06-02 cs.CR cs.AI 版本更新

Agent Operating Systems (AOS): Integrating Agentic Control Planes into, and Beyond, Traditional Operating Systems

代理操作系统 (AOS):将代理控制平面集成到传统操作系统及其之外

Ankur Sharma, Deep Shah

发表机构 * Independent Researcher(独立研究员)

AI总结 本文提出代理操作系统(AOS)架构,通过集成代理控制平面到现有操作系统或逐步接管部分OS职责,以解决传统OS在调度、内存管理、安全、可观测性和治理方面对长期目标导向的代理AI工作负载的局限性。

详情
AI中文摘要

传统操作系统围绕确定性程序、显式控制流和人类发起的工作流设计。其核心抽象——进程、线程、系统调用、文件和权限——假设有界行为和可预测的交互模式。代理AI系统引入了一种不同的执行模型:长期存在、目标导向的实体,它们进行概率推理、动态调用工具,并根据反馈调整行为。虽然代理目前可以作为用户空间应用程序实现,但其执行特性在调度、内存和状态管理、安全性、可观测性和治理方面对操作系统边界施加了压力。本文引入了代理操作系统(AOS)的概念,这是一种将代理控制平面集成到现有操作系统中的系统架构,或者在某些模型中,随着时间的推移逐步接管选定的操作系统职责。我们提供了AOS的精确定义、明确的假设和非目标,并将AOS职责结构分解为调度器、上下文和内存管理、工具和能力注册表、策略和信任执行、以及可观测性和审计。我们分析了经典操作系统抽象对代理工作负载的局限性,提出了从用户空间运行时到分布式控制平面的集成模型,并将AOS概念映射到Linux和Windows原语。我们提出了安全性和安全性影响,包括代理特定的威胁模型,并定义了强调确定性执行、可审计性和操作员可理解性的评估标准。目标不是完全取代操作系统,而是为代理计算建立一个严格的系统基础,使其在大规模下保持可控、可问责和安全。

英文摘要

Traditional operating systems were designed around deterministic programs, explicit control flow, and human initiated workflows. Their core abstractions processes, threads, system calls, files, and permissions assume bounded behavior and predictable interaction patterns. Agentic AI systems introduce a different execution model: long-lived, goal-directed entities that reason probabilistically, invoke tools dynamically, and adapt behavior based on feedback. While agents can be implemented as user-space applications today, their execution characteristics stress OS boundaries in scheduling, memory and state management, security, observability, and governance. This paper introduces the concept of an Agent Operating System (AOS), a systems architecture that integrates an agentic control plane into existing operating systems or, in some models, subsumes selected OS responsibilities over time. We provide a precise definition of an AOS, explicit assumptions and non-goals, and a structured decomposition of AOS responsibilities into schedulers, context and memory management, tool and capability registries, policy and trust enforcement, and observability and audit. We analyze limitations of classical OS abstractions for agent workloads, propose integration models from user-space runtimes to distributed control planes, and map AOS concepts onto Linux and Windows primitives. We present security and safety implications, including agent specific threat models, and define evaluation criteria that emphasize deterministic enforcement, auditability, and operator comprehensibility. The objective is not to replace operating systems wholesale, but to establish a rigorous systems foundation for agentic computation that remains controllable, accountable, and secure at scale.

2606.01503 2026-06-02 cs.CV cs.AI cs.CL 版本更新

On the Limits of Token Reduction for Efficient Unified Vision Language Training

论高效统一视觉语言训练中令牌缩减的极限

Siyi Chen, Weiming Zhuang, Jingtao Li, Lingjuan Lv

发表机构 * University of Michigan(密歇根大学) Sony AI(索尼人工智能)

AI总结 本文通过分析层注意力分配,发现视觉理解与视觉生成在令牌冗余上存在不对称性,设计任务特定加速器,但统一训练中任务特定令牌丢弃导致协同损失,表明高效统一建模需保留共享跨任务结构。

详情
AI中文摘要

统一视觉语言模型(VLM)在单个自回归骨干中集成了视觉理解和视觉生成,但其联合训练计算成本高昂且从效率角度常被忽视。在这项工作中,我们研究了基于令牌缩减的加速在统一VLM训练中的可行性和极限。通过对逐层注意力分配的系统分析,我们揭示了一个基本的不对称性:视觉理解在后期层表现出显著的视觉冗余,而视觉生成在深度上对图像令牌保持持续依赖。受此观察启发,我们设计了任务特定的加速器,针对每个目标选择性地减少图像令牌计算。虽然这些方法在孤立设置中实现了显著的效率提升,但我们在统一训练下观察到一致的协同损失——任务特定的令牌丢弃需要不同的参数路径,并消除了联合优化中通常观察到的相互性能增益。我们的发现表明,高效统一建模需要保留共享的跨任务结构,强调了需要协同感知的加速策略。项目页面:https://chicychen.github.io/TokenReductionUnifiedVLM/。

英文摘要

Unified vision-language models (VLMs) integrate visual understanding and visual generation within a single autoregressive backbone, but their joint training is computationally expensive and largely overlooked from an efficiency perspective. In this work, we study the feasibility and limits of token-reduction-based acceleration for unified VLM training. Through a systematic analysis of layerwise attention allocation, we uncover a fundamental asymmetry: visual understanding exhibits substantial late-layer visual redundancy, whereas visual generation maintains persistent dependence on image tokens across depth. Guided by this observation, we design task-specific accelerators that selectively reduce image-token computation for each objective. While these methods achieve significant efficiency gains in isolated settings, we observe a consistent synergy loss under unified training -- task-specific token dropping necessitates divergent parameter pathways and eliminates the mutual performance gains typically observed in joint optimization. Our findings suggest that efficient unified modeling requires preserving shared cross-task structures, highlighting the need for synergy-aware acceleration strategies. Project page: https://chicychen.github.io/TokenReductionUnifiedVLM/.

2606.01502 2026-06-02 cs.DC cs.AI cs.NI 版本更新

Move the Query, Not the Cache: Characterizing Cross-Instance Latent Attention Redistribution Across GPU Fabrics

移动查询,而非缓存:跨GPU结构中的跨实例潜在注意力再分布特征

Bole Ma, Jan Eitzinger, Harald Köstler, Gerhard Wellein

AI总结 本研究通过真实多节点H100集群实验,刻画了多头部潜在注意力(MLA)在跨实例场景下的性能特征,提出了拓扑感知成本模型和路由/获取/本地谓词,证明在多数情况下路由查询比移动缓存更高效。

详情
AI中文摘要

前沿大语言模型越来越多地使用稀疏注意力索引器来决定查询关注的内容,该索引器为每个查询挑选几个KV缓存块:注意力的单位现在是一个小的、可重用的块。代理工作负载频繁使用这一机制:许多子代理查询一个大型代码库,重用相同的块。当语料库超出单个GPU容量时,它会被分区到多个实例上,因此查询及其选择的块通常位于不同的GPU上:回答查询意味着跨实例的注意力。先前跨实例KV系统的惯常做法是移动缓存:将选定的块拉到请求方。多头部潜在注意力反转了计算方式,将每个令牌的键和值压缩成一个窄向量,因此路由的查询行只有约1 KB,比它注意的块还小;此时路由查询通常比移动缓存更便宜。哪种原语在哪种结构和请求形状下胜出,尚未被研究,尤其是在设备发起的RDMA上,该技术使得每个请求的跨节点传输成本很低。我们在真实的多节点H100集群上刻画了跨实例MLA注意力的特征,提炼出两个可重用的产物:一个拓扑感知的成本模型(探测/传输/计算/返回/合并)和一个闭合形式的路由/获取/本地谓词,我们在真实的IBGDA上测量了其常数,该模型跟踪批量往返的误差在约7%以内。在解码阶段,它路由查询,将移动缓存的成本(连续块的约3毫秒重新适应拼接,或选择下的分散收集)替换为数十微秒的往返,并根据探测延迟而非峰值带宽选择结构。我们为MLA实例化了成本模型和谓词,但两者并非MLA特有:它们适用于任何通过压缩或稀疏选择将注意力缩小到小块的情况(如当前的DeepSeek-V3.2、V4和GLM-5.1)。将它们扩展到新架构只需测量两个系数:路由的有效载荷和获取的移动缓存成本。

英文摘要

Frontier LLMs increasingly decide what a query attends to with a sparse-attention indexer that picks a few KV-cache blocks per query: attention's unit is now a small, reusable chunk. Agentic workloads hammer it: many sub-agents query one large codebase, reusing the same blocks. When that corpus outgrows one GPU it is partitioned across instances, so a query and the blocks it selects often sit on different GPUs: answering it means attention across instances. The reflex of prior cross-instance KV systems is to move the cache: pull the selected blocks to the requester. Multi-head Latent Attention inverts the arithmetic, compressing each token's key and value into one narrow vector, so a routed query row is only ~1 KB, smaller than the chunk it attends; routing the query is then often cheaper than moving the cache. Which primitive wins, over which fabric and request shape, is uncharted, least of all on device-initiated RDMA that makes per-request cross-node transfers cheap. We characterize cross-instance MLA attention on a real multi-node H100 cluster, distilling two reusable artifacts: a topology-aware cost model (probe / transfer / compute / return / merge) and a closed-form route/fetch/local predicate, whose constants we measure on real IBGDA, where the model tracks batched round-trips to within ~7%. At decode it routes the query, trading the cost of moving the cache (a ~3 ms re-adaptation splice for a contiguous chunk, or a scattered gather under selection) for a tens-of-microsecond round trip, and picks the fabric by probe latency, not peak bandwidth. We instantiate the cost model and predicate for MLA, but neither is MLA-specific: they apply wherever compression or sparse selection shrinks attention to small chunks (DeepSeek-V3.2, V4, and GLM-5.1 today). Extending them to a new architecture requires measuring just two coefficients: the routed payload and fetch's move-the-cache cost.

2606.01498 2026-06-02 cs.CL cs.AI 版本更新

TimeSage-MT: A Multi-Turn Benchmark for Evaluating Agentic Time Series Reasoning

TimeSage-MT:用于评估智能时间序列推理的多轮基准测试

Yaxuan Kong, Qingren Yao, Yuqi Nie, Yichen Li, Yilei Shao, Stefan Zohren, Anna Vettoruzzo, Joaquin Vanschoren, Ming Jin, Qingsong Wen

发表机构 * University of Oxford(牛津大学) VulpiVox Intelligence Eindhoven University of Technology(埃因霍温理工大学) Griffith University(格里菲斯大学) Squirrel Ai Learning East China Normal University(华东师范大学)

AI总结 提出TimeSage-MT多轮基准测试,包含240个任务和2680轮对话,覆盖8个真实领域,用于评估LLM智能体在时间序列推理中的表现,揭示其在决策导向任务中的性能下降及记忆、不确定性处理等缺陷。

详情
AI中文摘要

时间序列数据为许多真实世界领域的决策提供信息。虽然大语言模型(LLM)智能体可以通过自然语言和工具分析数据,但目前尚不清楚它们是否能在多轮对话中进行可靠的时间序列分析。现有基准测试侧重于预测和异常检测等单步任务,忽略了用户目标演变、智能体必须基于先前分析以及结论从累积证据中得出的实际工作流程。在这项工作中,我们引入了TimeSage-MT,一个用于智能时间序列推理的多轮基准测试,包含240个任务和2,680轮对话,涵盖8个真实世界领域,从基础探索到决策导向分析。TimeSage-MT通过一个可复现的流程构建,该流程将真实世界的时间序列数据转换为具有可验证答案的多轮对话。它提供了一个统一的评估协议和公共排行榜,用于比较时间序列智能系统。为了展示基准测试的实用性,我们评估了前沿LLM以及TimeSage——一种配备全面时间序列技能库的新型结构化智能体。结果显示,在决策导向任务上性能急剧下降,原因是记忆、不确定性处理和基于领域的决策方面的失败。TimeSage-MT揭示了当前智能推理中的关键差距,并为未来发展提供了严谨的基础。

英文摘要

Time series data inform critical decisions across many real-world domains. While large language model (LLM) agents can analyze data through natural language and tools, it remains unclear whether they can conduct reliable time series analysis across multi-turn conversations. Existing benchmarks focus on single-step tasks such as forecasting and anomaly detection, overlooking practical workflows where user goals evolve, agents must build on prior analyses, and conclusions emerge from accumulated evidence. In this work, we introduce TimeSage-MT, a multi-turn benchmark for agentic time series reasoning with 240 tasks and 2,680 dialogue turns across 8 real-world domains, spanning basic exploration to decision-oriented analysis. TimeSage-MT is built through a reproducible pipeline that converts real-world time series data into multi-turn conversations with verifiable answers. It provides a unified evaluation protocol and public leaderboard for comparing time series agentic systems. To demonstrate the benchmark's utility, we evaluate frontier LLMs alongside TimeSage, a novel structured agent equipped with a comprehensive time series skill library. The results show sharp performance drops on decision-oriented tasks, driven by failures in memory, uncertainty handling, and domain-based decision making. TimeSage-MT exposes critical gaps in current agentic reasoning and provides a rigorous foundation for future development.

2606.01494 2026-06-02 cs.CR cs.AI cs.SE 版本更新

ClawHub Security Signals: When VirusTotal, Static Analysis, and SkillSpector Disagree

ClawHub安全信号:当VirusTotal、静态分析和SkillSpector意见不一致时

Vincent Koc, Patrick Erichsen, Jacob Tomlinson, Agustin Rivera, Michael Appel, Nir Paz

发表机构 * OpenClaw Foundation USA(OpenClaw基金会美国) NVIDIA United Kingdom(NVIDIA英国分公司) NVIDIA USA(NVIDIA美国)

AI总结 研究ClawHub中67,453个公开技能版本,通过VirusTotal、静态启发式分析和NVIDIA SkillSpector三种扫描器的分歧,揭示智能体技能安全需要分层治理而非单一扫描器决策。

Comments 10 pages, 1 figure, 7 tables, 1 supplimentary dataset

详情
AI中文摘要

智能体技能通过可重用的指令、工具、脚本、参考和工作流扩展AI智能体,建立了不同于模型安全和传统包恶意软件检测的安全边界。ClawHub安全信号是一个包含67,453个最新公开OpenClaw技能版本的净化数据集。每一行包含经过编辑的SKILL.md内容和净化的捆绑文件(如有),以及最终的ClawScan注册表裁决和来自三个扫描器系列(VirusTotal、静态启发式分析和NVIDIA SkillSpector)的证据。我们并非估计恶意技能的流行率,而是研究扫描器之间的分歧。三个扫描器很少标记相同的技能:任何一对扫描器在其合并阳性结果上的重叠最多为10.4%,仅0.69%的技能被所有三个扫描器标记,81.9%被标记的技能仅由单个扫描器识别。分歧由攻击面结构化。SkillSpector发出语义智能体风险警告而非恶意软件信誉信号,在25,504个可疑行中阳性19,209个(75.3%),但在206个恶意行中仅14个阳性(6.8%)。恶意裁决区域呈现相反特征:206个恶意行中150个(72.8%)为VirusTotal阳性,与捆绑代码恶意软件证据一致。这些结果表明,智能体技能安全需要分层治理,而非单一扫描器的允许/阻止决策。该语料库作为净化的银标准数据集发布:标签是注册表的自动裁决,而非人工标注的真实情况,该发布代表一个早期的、版本化的快照,旨在支持社区,同时开发人工标注的子集。鼓励进一步研究,包括针对技能安全分类的定制模型。

英文摘要

Agent skills extend AI agents with reusable instructions, tools, scripts, references, and workflows, establishing a security boundary distinct from both model safety and traditional package-malware detection. ClawHub Security Signals is a sanitized dataset of 67,453 latest public OpenClaw skill versions. Each row pairs redacted SKILL.md content and sanitized bundled files where present with a final ClawScan registry verdict and evidence from three scanner families: VirusTotal, static heuristic analysis, and NVIDIA SkillSpector. Rather than estimating malicious-skill prevalence, we study scanner disagreement. The three scanners rarely flag the same skills: any pair overlaps on at most 10.4% of their combined positives, only 0.69% of skills are flagged by all three, and 81.9% of flagged skills are identified by a single scanner. The disagreement is structured by attack surface. SkillSpector, which raises semantic agentic-risk advisories rather than malware-reputation signals, is positive for 19,209 of 25,504 suspicious rows (75.3%) but only 14 of 206 malicious rows (6.8%). The malicious-verdict region shows the inverse profile: 150 of 206 malicious rows (72.8%) are VirusTotal-positive, consistent with bundled-code malware evidence. These results show that agent-skill security requires layered governance, not single-scanner allow/block decisions. The corpus is released as a sanitized silver-standard dataset: labels are the registry's automated verdicts, not human-annotated ground truth, and the release represents an early, versioned snapshot intended to support the community while a human-annotated subset is developed. Further research is encouraged, including models tailored for skill-security triage.

2606.01490 2026-06-02 cs.SE cs.AI cs.MA 版本更新

LLM Consortium for Software Design Refinement: A Controlled Experiment on Multi-Agent Collaboration Topologies

LLM联盟用于软件设计精化:多智能体协作拓扑的受控实验

Nagarjuna Kanamarlapudi, Praveen K

发表机构 * LLM Consortium for Software Design Refinement(软件设计精炼LLM联盟)

AI总结 通过受控实验评估12种多智能体LLM协作拓扑在软件架构设计中的表现,发现结构对抗变体(v4b)和跨模型评审(v2)排名前二,并行合并因令牌饥饿和弗兰肯斯坦效应表现最差。

Comments 12 pages, 9 figures, 5 tables

详情
AI中文摘要

我们提出了一项受控实验,评估了12种用于软件架构设计的多智能体LLM协作拓扑。采用$2\times2\times2$因子设计(权威性$\times$角色$\times$动态性),我们在8个不同复杂度的设计任务上进行了520次实验运行,每个任务重复5次。设计由三个独立的自动评估器(GPT-OSS 120B、Claude Opus 4.6、Claude Sonnet 4.6)按照12维评分标准进行评估。我们报告四个核心发现。第一,结构对抗(v4b)在集成排名中位列第一——一种提示工程化的对抗变体,要求重写指令而非补丁(加权集成:4.637/5.0)。第二,跨模型评审以全票获得第二——用一个模型生成,用另一个模型评审——所有三个评估器均将其排在第二(加权集成:4.606)。第三,评估器多样性本身就是一个发现——所有三个评估器一致认为v4b最好、v3最差,但对v2b分歧严重(Claude d=1.44 vs. GPT-OSS d=0.45),揭示了不同模型家族对设计质量的权重差异。第四,并行合并从根本上被破坏——所有三个评估器都将合并变体置于底层(3.65-3.79),原因是令牌饥饿和弗兰肯斯坦效应。加权集成($2\times$Opus + $2\times$Sonnet + $1\times$GPT-OSS)在520次运行中提供了稳健的排名,并通过独立交叉验证得到确认。

英文摘要

We present a controlled experiment evaluating 12 multi-agent LLM collaboration topologies for software architecture design. Using a $2\times2\times2$ factorial design (Authority $\times$ Roles $\times$ Dynamics), we conducted 520 experimental runs across 8 design tasks of varying complexity, with 5 repetitions each. Designs were evaluated on a 12-dimensional rubric by three independent automated evaluators (GPT-OSS 120B, Claude Opus 4.6, Claude Sonnet 4.6). We report four core findings. First, structural adversarial (v4b) ranks #1 by ensemble -- a prompt-engineered adversarial variant that demands rewrite mandates rather than patches (weighted ensemble: 4.637/5.0). Second, cross-model review wins unanimously at #2 -- generate with one model, review with another -- ranking #2 by all three evaluators (weighted ensemble: 4.606). Third, evaluator diversity is itself a finding -- all three evaluators agree v4b is best and v3 is worst, but disagree sharply on v2b (Claude d=1.44 vs. GPT-OSS d=0.45), revealing how different model families weight design qualities. Fourth, parallel merge is fundamentally broken -- all three evaluators place merge variants in the bottom tier (3.65-3.79), due to token starvation and the Frankenstein effect. The weighted ensemble ($2\times$Opus + $2\times$Sonnet + $1\times$GPT-OSS) provides robust rankings across 520 runs, confirmed through independent cross-validation.

2606.01483 2026-06-02 cs.LG cs.AI eess.AS 版本更新

MURMUR: An Efficient Inference System for Long-Form ASR

MURMUR:一种高效的长时间语音识别推理系统

Wei-Tzu Lee, Keisuke Kamahori, Baris Kasikci

发表机构 * University of Washington(华盛顿大学)

AI总结 提出MURMUR推理系统,通过块间和块内两级优化,在保持高精度的同时显著降低长时间语音识别的延迟。

详情
AI中文摘要

长时间自动语音识别(ASR)需要高精度和低延迟,但现有系统迫使两者之间进行权衡。基于块的流水线在并行窗口中处理音频以实现低延迟,但丢失了跨块上下文,并且需要脆弱的启发式方法来对齐边界处的说话人和时间戳。长上下文ASR模型通过单次传递解决所有问题以获得更好的准确性,但速度慢一个数量级。我们提出MURMUR,一个通过两级操作克服这种权衡的推理系统。在块间级别,我们重新审视基于块的流水线以适应现代长上下文ASR,将块大小视为可调超参数,并表明中间块大小在准确性和延迟之间取得了良好的平衡。在块内级别,我们通过应用于输出和语音令牌的滑动窗口KV缓存驱逐策略来利用注意力稀疏性。在AMI-IHM上,MURMUR匹配单次传递准确性,同时将延迟降低4.2倍,通过令牌驱逐进一步获得收益,相对tcpWER退化小于1%。MURMUR的代码可在https://github.com/uw-syfi/Murmur获取。

英文摘要

Long-form automatic speech recognition (ASR) requires both high accuracy and low latency, but existing systems force a trade-off between the two. Chunk-based pipelines process audio in parallel windows for low latency, but lose cross-chunk context and need brittle heuristics to align speakers and timestamps at boundaries. Long-context ASR models resolve everything in a single pass for better accuracy, but are an order of magnitude slower. We propose Murmur, an inference system that overcomes this trade-off by operating at two levels. At the inter-chunk level, we revisit the chunk-based pipeline for modern long-context ASR, treating chunk size as a tunable hyperparameter, and show that intermediate chunk sizes strike a good balance of accuracy and latency. At the intra-chunk level, we exploit attention sparsity through a sliding window KV cache eviction policy applied to both output and speech tokens. On AMI-IHM, Murmur matches single-pass accuracy while reducing latency by 4.2x, with further gains from token eviction at less than 1% relative tcpWER degradation. The code of Murmur is available at https://github.com/uw-syfi/Murmur.

2606.01473 2026-06-02 cs.AI cs.HC 版本更新

A Minimalist Brain-Computer Musical Interface for Real-Time Emotion-Driven Sonification: System Design and Preliminary Evaluation

极简脑机音乐接口用于实时情感驱动声化:系统设计与初步评估

Pablo A. Monroy-D'Croz, Rafael Ramirez-Melendez, Julian Cespedes-Guevara

发表机构 * GitHub

AI总结 本文提出一种极简脑机音乐接口,通过前额EEG活动实时估计情感效价并映射到音乐特征,实验发现额叶alpha不对称性无法可靠区分指令性情绪状态。

详情
AI中文摘要

本文提出一种极简脑机音乐接口(BCMI),作为实时情感声化系统,将前额EEG活动转化为自适应音乐。通过额叶alpha不对称性(AF7/AF8)估计情感效价,并通过随机生成算法映射到音乐特征,如调式、速度、节奏密度和音高音域。系统集成了无线EEG采集、实时Python信号处理以及通过Lab Streaming Layer同步的Ableton Live音乐生成。一项包含22名参与者的实验探究了有意情感自我诱导是否能调节BCMI神经反馈信号。线性混合效应分析发现目标情绪或时间无显著效应,表明额叶alpha不对称性信号无法可靠区分指令性情绪状态。个体差异(包括音乐训练和表演经验)解释了比实验操作更多的方差,后者仅占总信号方差的0.40%。这些发现凸显了使用额叶alpha不对称性作为闭环情绪调节的自愿控制信号的挑战,并为未来BCMI研究提出了方法论方向。

英文摘要

This paper presents a minimalist brain-computer Musical Interface (BCMI) that functions as a real-time affective sonification system, translating prefrontal EEG activity into adaptive music. Emotional valence is estimated from frontal alpha asymmetry (AF7/AF8) and mapped to musical features such as mode, tempo, rhythmic density, and pitch register through a stochastic generative algorithm. The system integrates wireless EEG acquisition, real-time Python signal processing, and Ableton Live-based music generation synchronized via Lab Streaming Layer. An experiment with 22 participants investigated whether intentional emotional self-induction could modulate the BCMI neurofeedback signal. Linear mixed-effects analyses found no significant effects of target emotion or time, indicating that the frontal alpha asymmetry signal did not reliably distinguish instructed emotional states. Individual differences, including musical training and acting experience, explained more variance than the experimental manipulation, which accounted for only 0.40\% of total signal variance. These findings highlight the challenges of using frontal alpha asymmetry as a voluntary control signal for closed-loop emotion regulation and suggest methodological directions for future BCMI research.

2606.01470 2026-06-02 physics.flu-dyn cs.AI cs.LG 版本更新

Emergent Transfer of a Physics Foundation Model from Simulation to Laboratory Turbulence

物理基础模型从模拟到实验室湍流的涌现迁移

Payel Mukhopadhyay, Stefan S. Nixon, Romain Watteaux, Michael McCabe, Alberto Bietti, Kyunghyun Cho, Cristiana Diaconu, Irina Espejo Morales, David Fouhey, Siavash Golkar, Tom Hehir, Shirley Ho, Jake Kovalic, Geraud Krawezik, Francois Lanusse, Tanya Marwah, Rudy Morel, Mariel Pettee, Helen Qu, Jeff Shen, Hadi Sotoudeh, Stuart B. Dalziel, Miles Cranmer

发表机构 * University of Cambridge(剑桥大学) CEA, DAM/DIF(法国CEA DAM/DIF) Flatiron Institute(Flatiron研究所) New York University(纽约大学) Princeton University(普林斯顿大学) Yale University(耶鲁大学) AIM, Université Paris-Saclay, Université Paris Cité, CEA, CNRS(AIM,巴黎-萨克雷大学,巴黎城市大学,CEA,CNRS) University of Wisconsin–Madison(威斯康星大学麦迪逊分校) Polymathic AI(聚合人工智能)

AI总结 通过微调连续介质动力学基础模型Walrus,在仅使用少量模拟数据的情况下,零样本泛化到实验室瑞利-泰勒不稳定性实验,揭示了初始条件在模拟-实验差距中的关键作用。

详情
AI中文摘要

物理基础模型能否有效应用于实验室实验,仍然是科学机器学习(ML)的一个未解决问题。我们在瑞利-泰勒不稳定性(RTI)上测试了这个问题,这是一种普遍且要求苛刻的流体不稳定性,从桌面流动到超新星爆炸中都能看到,其中密度界面上的小扰动在较轻流体加速进入较重流体时演变成混沌、多尺度的混合。标准ML模型难以处理RTI,尽管经过一个多世纪的理论、数值和实验工作,模拟与实验之间仍存在一个未解决的分歧:大多数实验室实验中测量的后期混合增长率$\alpha$(约0.06-0.07)大约是理想直接数值模拟(DNS,约0.02)的三倍。这一差距的起源仍有争议。这些特性使RTI成为一个严格的测试,其意义远超RTI本身:仅基于模拟训练的基础模型能否泛化到稀疏、杂乱且嘈杂的实验室环境?我们对连续介质动力学基础模型Walrus进行了微调,使用三个或更少的DNS实现,并在长时间滚动中恢复了关键的RTI物理特性。将微调模型零样本应用于滑动屏障实验室数据,它离开了类似DNS的区域,进入了观察到的增长带,而从未见过任何实验样本。这些结果提供了独立的数据驱动证据,表明初始条件在长期存在的模拟-实验$\alpha$差距中起着关键作用。该模型还零样本泛化到稳定分层(一种训练中未出现的浮力状态),正确减缓了混合层增长。总之,我们的结果表明,基础模型可以很好地泛化到训练数据之外,预测实验室行为和未见过的物理状态,为探索长期存在的模拟-实验差距开辟了新途径。

英文摘要

Whether physics foundation models can be usefully deployed on laboratory experiments remains an open question for scientific machine learning (ML). We test this question on the Rayleigh-Taylor instability (RTI), a ubiquitous and demanding fluid instability seen from tabletop flows to supernova explosions, in which small perturbations at a density interface grow into chaotic, multiscale mixing as a lighter fluid accelerates into a heavier one. Standard ML models struggle with RTI, and despite over a century of theoretical, numerical, and experimental work, it carries an unresolved discrepancy between simulation and experiment: the late-time mixing growth rate, $α$, measured in most laboratory experiments ($\sim$ 0.06-0.07), is roughly three times the value from idealized direct numerical simulations (DNS, $\sim$ 0.02). The gap's origin remains debated. These properties make RTI a stringent test for a question that matters well beyond RTI: can foundation models trained only on simulations generalise to sparse, messy, and noisy laboratory settings? We finetune Walrus, a foundation model for continuum dynamics, on three or fewer DNS realizations and recover key RTI physics over long rollouts. Applied zero-shot to sliding-barrier laboratory data, the finetuned model leaves the DNS-like regime and enters the observed growth band, having never seen a single experimental sample. These results provide independent, data-driven evidence that initial conditions play a crucial role in the longstanding sim-experiment gap in $α$. The model also generalises zero-shot to stable stratification, a buoyancy regime absent from training, correctly slowing mixing-layer growth. Together, our results show that foundation models can generalise well beyond their training data, predicting laboratory behavior and unseen physical regimes, opening new ways to probe longstanding simulation-experiment gaps.

2606.01468 2026-06-02 stat.ML cs.AI cs.LG 版本更新

Computation-Aware Kalman Filtering with Model Selection for Neural Dynamics

基于模型选择的计算感知卡尔曼滤波用于神经动力学

JR Huml, Jonathan Wenger, John P. Cunningham

发表机构 * University of California, Berkeley(加州大学伯克利分校) University of Washington(华盛顿大学) University of Texas at Austin(得克萨斯大学奥斯汀分校)

AI总结 提出计算感知状态空间模型(CASSM),通过新训练损失和优化方案实现模型选择,在试验数远少于神经元数的规模不平衡场景中,以可处理的计算复杂度提供竞争性预测和更优的不确定性校准。

Comments 24 pages, Proceedings of 2nd International Conference on Probabilistic Numerics (2026)

详情
AI中文摘要

由于其明确的先验和建模不确定性的能力,贝叶斯方法在单细胞神经记录的动力潜变量建模中发挥了重要作用。然而,现代规模的数据集使得过参数化的深度网络因其预测能力和有利的计算扩展性成为首选方法。尽管存在许多后验近似方法,但所有方法都会引入近似误差。最近的工作以计算不确定性的形式考虑了这种误差,但代价是二次复杂度,并假设固定的模型超参数。在这里,我们将这一发展扩展到模型选择,包括一种新颖的训练损失和优化方案,从而在大状态空间中实现可处理的推理。我们引入了一个框架,即计算感知状态空间模型(CASSM),专门针对规模不平衡的场景设计,其中试验次数显著少于记录的神经元数量。在这种场景下,对于合成数据和真实数据,我们展示了我们的方法与数据饥饿的深度网络具有竞争力,并且与之前扩展贝叶斯方法的尝试相比,不确定性校准显著改善。我们的实验为神经科学研究人员根据关键数据集属性和约束从一系列潜在动力潜变量模型中进行选择提供了路线图。

英文摘要

Due to their explicit priors and ability to model uncertainty, Bayesian methods have played a major role in dynamical latent variable modeling of single-cell neural recordings. However, modern-sized datasets have made overparameterized deep networks the preferred methods of choice due to their predictive power and favorable computational scaling. While many posterior approximations exist, all incur approximation errors. Recent work accounts for this error in the form of computational uncertainty but comes at the cost of quadratic complexity and assumes fixed model hyperparameters. Here we extend this development to model selection, including a novel training loss and optimization scheme, which yields tractable inference in large state-spaces. We introduce a framework, the Computation-Aware State-Space Model (CASSM), specifically designed for the scale-imbalanced regime, where the number of trials is significantly lower than the number of recorded neurons. In this regime, for both synthetic and real data, we show that our method is competitive with data-hungry deep networks, with significantly improved uncertainty calibration over previous attempts to scale Bayesian methods. Our experiments provide a roadmap to neuroscience researchers in choosing from a host of potential dynamical latent variable models given key dataset properties and constraints.

2606.01462 2026-06-02 cs.AI cs.CL cs.LG 版本更新

An Enigma of Artificial Reason: Investigating the Production-Evaluation Gap in Large Reasoning Models

人工推理之谜:探究大型推理模型中的生成-评估差距

Mingzhong Sun, Teresa Yeo, Armando Solar-Lezama, Tan Zhi-Xuan

发表机构 * NUS Department of Computer Science(国立新加坡大学计算机科学系) MIT EECS(麻省理工学院电子工程与计算机科学系) A*STAR(新加坡科技研究局) Singapore-MIT Alliance for Research and Technology (SMART)(新加坡-麻省理工联合研究技术机构(SMART))

AI总结 本文通过VAIR数据集发现大型推理模型在评估推理时存在显著缺陷,表现为答案确认偏差,即模型倾向于验证答案正确性而非仔细检查推理步骤。

Comments 10 pages, 8 figures, 2 tables (Appendix: 19 pages, 13 figures, 3 tables)

详情
AI中文摘要

对人类推理的研究表明,人们通常更擅长评估推理而非从头生成推理。相比之下,大型推理模型(LRMs)经过训练,擅长生成长链推理以解决复杂问题。那么,LRMs在评估推理方面表现如何?我们通过有效答案-无效推理(VAIR)数据集进行研究:该数据集包含数学问题和解决方案,这些解决方案存在琐碎的推理缺陷但答案有效,旨在将推理评估与推理生成混淆因素分离。与人类(我们发现人类在评分此类问题时仅比解决它们差6%)不同,我们发现LRMs存在显著的生成-评估差距:前沿模型在评估VAIR解决方案时得分低至48%,尽管在解决方案生成方面近乎完美。为何存在这一谜团?通过思维链(CoT)分析,我们发现了答案确认偏差的证据:LRMs通常先产生答案,然后检查正确答案,而不是仔细验证每一步,即使在注意到异常推理时也会编造合理化解释。线性探针进一步证实了这一点,表明虽然LRM激活编码了有效推理的某些表示,但它们未能稳健地将VAIR解决方案表示为无效。对最终答案表示的因果修补导致LRM判断和激活翻转,表明答案有效性是模型确认偏差的原因。这些发现揭示了主导推理训练方法的显著局限性,该方法激励LRMs生成并确认朝向正确答案的推理,但未能稳健地评估底层推理。

英文摘要

Studies of human reasoning have shown that people are typically stronger at evaluating reasoning than producing it from scratch. In contrast, large reasoning models (LRMs) are trained to excel at producing long chains of reasoning to solve complex problems. How then do LRMs perform at evaluating reasons? We investigate this with the Valid-Answer-Invalid-Reasoning (VAIR) dataset: math problems and solutions with trivial reasoning flaws but valid answers, designed to isolate reasoning evaluation from the confound of reasoning production. Unlike humans, who we find are only 6% worse at grading than solving such problems, we find a substantial production-evaluation gap in LRMs: frontier models score as low as 48% when evaluating VAIR solutions, despite near-perfect solution production. Why this enigma? Through chain-of-thought (CoT) analysis, we find evidence of an answer confirmation bias: LRMs often produce then check for the correct answer instead of carefully verifying each step, fabricating rationalizations even when noticing anomalous reasoning. Linear probes corroborate this, showing that while LRM activations encode some representation of valid reasoning, they fail to robustly represent VAIR solutions as invalid. Causal patching of the final answer's representations causes LRM verdicts and activations to flip, demonstrating that answer validity is responsible for models' confirmation biases. These findings indicate an outstanding limitation in dominant approaches to reasoning training, which incentivize LRMs to produce and confirm reasoning towards correct answers, but not to robustly evaluate the underlying reasons.

2606.01457 2026-06-02 cs.AI cs.LG stat.ML 版本更新

Transferring Information Across Interventions in Causal Bayesian Optimization

跨干预因果贝叶斯优化的信息传递

Mohammad Ali Javidian

发表机构 * Computer Science Department(计算机科学系)

AI总结 提出图耦合因果贝叶斯优化方法,通过共享因果参数的不确定性连接不同干预效应,实现跨干预信息传递,在可识别线性高斯因果模型中证明低秩核性质和次线性遗憾界。

详情
AI中文摘要

贝叶斯优化是一种优化昂贵系统的流行方法,其中每次实验、模拟或干预都会耗费时间或金钱。在其标准形式中,它将我们控制的变量视为黑盒的普通输入,无法区分单纯的相关性与真正的因果关系。因果贝叶斯优化通过使用已知因果图结合观测数据来决定哪些变量值得干预,从而部分弥补了这一差距。然而,现有方法几乎孤立地学习每种可能干预的效果,尽管在因果系统中这些效果通常共享相同的底层机制。我们提出图耦合因果贝叶斯优化,通过我们对一小部分共享因果参数的不确定性,将不同的干预效果联系在一起。结果是一个因果核,使得从一次干预收集的证据能够改进我们对相关干预的估计。对于可识别的线性高斯因果模型,我们证明该核具有低秩,其秩由共享参数的数量而非干预菜单的大小界定。这进而产生一个信息增益界,该界仅随优化范围对数增长,以及一个遗憾界,清晰地将三种误差来源分开:优化、因果估计以及考虑哪些干预集的选择。我们还描述了非线性和自适应扩展。在与理论一致的高斯系统、共享机制压力测试以及标准因果优化基准测试中,该方法保持了因果贝叶斯优化的优势,同时实现了跨相关干预的信息传递,当对目标父节点的直接干预不可用且稀疏的干预数据必须在一大组候选干预中重复使用时,增益最为明显。

英文摘要

Bayesian optimization is a popular way to optimize expensive systems, where every experiment, simulation, or intervention costs time or money. In its standard form, it treats the variables we control as plain inputs to a black box and cannot tell apart mere correlation from a real cause and effect. Causal Bayesian optimization closes part of this gap by using a known causal graph together with observational data to decide which variables are worth intervening on. Existing methods, however, learn the effect of each possible intervention almost in isolation, even though in a causal system these effects usually share the same underlying mechanisms. We propose graph-coupled causal Bayesian optimization, which ties the different intervention effects together through the uncertainty we have about a small set of shared causal parameters. The result is a causal kernel that lets evidence collected from one intervention improve our estimate of related interventions. For identifiable linear Gaussian causal models, we show that this kernel has low rank, bounded by the number of shared parameters rather than by the size of the intervention menu. This in turn yields an information-gain bound that grows only logarithmically in the optimization horizon, and a regret bound that cleanly separates three sources of error: optimization, causal estimation, and the choice of which intervention sets to consider. We also describe nonlinear and adaptive extensions. Across theory-aligned Gaussian systems, shared-mechanism stress tests, and standard causal optimization benchmarks, the method keeps the benefits of causal Bayesian optimization while transferring information across related interventions, with the clearest gains when direct interventions on the target's parents are unavailable and sparse interventional data must be reused across a large family of candidate interventions.

2606.01444 2026-06-02 cs.AI cond-mat.mtrl-sci cs.CL cs.LG math.CT 版本更新

Self-Revising Discovery Systems for Science: A Categorical Framework for Agentic Artificial Intelligence

科学中的自我修正发现系统:面向主体人工智能的范畴论框架

Fiona Y. Wang, Markus J. Buehler

发表机构 * Laboratory for Atomistic and Molecular Mechanics(原子分子力学实验室) Department of Biological Engineering(生物工程系) Massachusetts Institute of Technology(麻省理工学院) Department of Civil and Environmental Engineering(土木与环境工程系) Department of Mechanical Engineering(机械工程系) Center for Computational Science and Engineering(计算科学与工程中心) Schwarzman College of Computing(施瓦茨曼计算学院)

AI总结 本文提出一个基于范畴论的框架,通过左Kan扩展实现科学发现中的表征体制转换,并应用于材料科学中的蛋白质力学和纤维网络建模。

详情
AI中文摘要

科学发现不仅是生成答案,更是对证据、人工制品、操作和验证者进行类型化的表征体制的修正。我们为材料科学中的主体发现开发了一个范畴论描述。在固定体制b中,模式类别为S_b,系统状态是一个余预层I_t: S_b -> Set,来源是元素范畴∫_{S_b} I_t。固定体制操作是对此类状态的更新,仅当指定并保留了保持来源的细化时才是自函子。发现则是经过验证的体制转换u: S_b -> S_b':旧人工制品通过左Kan扩展Lan_u I_t保存并传输,并与转换后状态进行比较,以识别超出函子传输的剩余内容。这在不依赖主观新颖性的情况下区分了检索、搜索和发现。我们在两个系统中实例化了该框架。在Builder/Breaker中,蛋白质力学世界模型在最小描述长度门控下进行修正;接受的定律将链内柔性表示为受慢集体模式调节的全模态弹性柔度,即模式调节柔度。在CategoryScienceClaw中,类型化技能、人工制品、开放需求、工作流变异、门控、压力测试和公共话语构成了一个携带证明的知识计算图。一个纤维网络示例记录了候选模型、被拒绝的替代方案、AIC门控、扰动测试以及一个基于各向同性纤维计数描述符的接受取向张量各向异性刚度代理模型。这些案例共同展示了范畴论如何既作为科学发现的数学语言,又作为自我修正AI发现系统的工程规范。

英文摘要

Scientific discovery is not only answer generation but revision of the representational regime in which evidence, artifacts, operations, and verifiers are typed. We develop a category-theoretic account of agentic discovery for materials science. In a fixed regime b with schema category S_b, the system state is a copresheaf I_t: S_b -> Set, and provenance is the category of elements \int_{S_b} I_t. Fixed-regime operation is an update on such states, endofunctorial only when provenance-preserving refinements are specified and preserved. Discovery is instead a verified regime transition u: S_b -> S_b': old artifacts are preserved, transported by the left Kan extension Lan_u I_t, and compared with the post-transition state to identify residual content beyond functorial transport. This separates retrieval, search, and discovery without subjective novelty. We instantiate the framework in two systems. In Builder/Breaker, a protein-mechanics world model is revised under a Minimum Description Length gate; the accepted law expresses within-chain flexibility as all-mode elastic compliance conditioned by slow collective-mode participation, or mode-conditioned compliance. In CategoryScienceClaw, typed skills, artifacts, open needs, workflow mutation, gates, stress tests, and public discourse become a proof-carrying knowledge-computation graph. A fiber-network example records candidate models, rejected alternatives, an AIC gate, perturbation tests, and an accepted orientation-tensor anisotropic stiffness surrogate over an isotropic fiber-count descriptor. Together, the cases show how category theory can be both a mathematical language for discovery and an engineering specification for self-revising AI discovery systems.

2606.01443 2026-06-02 cs.LG cs.AI cs.CV 版本更新

UR-JEPA: Uniform Rectifiability as a Regularizer for Joint-Embedding Predictive Architectures

UR-JEPA:均匀可整流性作为联合嵌入预测架构的正则化器

Triet M. Le

发表机构 * Spatiolyx LLC(Spatiolyx公司)

AI总结 提出UR-JEPA,通过高斯核平滑的Carleson型平方函数实现均匀n-可整流测度正则化,防止表示坍塌,在多个数据集上达到与LeJEPA相当的峰值精度但具有更低的种子方差。

详情
AI中文摘要

训练联合嵌入预测架构(JEPA)的一个核心困难是防止表示坍塌。LeJEPA通过素描各向同性高斯正则化(SIGReg)对嵌入施加各向同性高斯目标来解决这一问题。该目标与流形假设相矛盾,流形假设期望嵌入集中在环境空间的低维子集上。我们提出\emph{UR-JEPA},其目标是在小尺度上具有局部切向维度$n$的均匀$n$-可整流测度,通过高斯核平滑的Carleson型平方函数$\mathcal{L}^{ ext{CGLT}}$实现,并辅以Jones $β$数公式。在Inet10上,UR-JEPA($\mathcal{L}^{ ext{CGLT}}$)达到$0.9141 \pm 0.0014$,相比LeJEPA($\mathcal{L}^{ ext{SIGReg}}$)提高了$+0.83$个百分点,种子标准差降低约$30\%$;在匹配配方的Galaxy10~SDSS、单种子ImageNet-$100$运行和3种子EuroSAT遥感运行中,两种方法在收敛时处于相同的峰值精度区间,UR-JEPA保持其较低的种子方差特征。在EuroSAT上,域内对在$96.0$到$96.1\%$之间具有竞争力,且使用大型遥感基础模型迁移时骨干网络缩小$25$倍。区别在于几何结构:对投影仪输出分布的直接可视化显示,在所有四个数据集上,UR-JEPA($\mathcal{L}^{ ext{CGLT}}$)产生的全局PCA谱在索引$\sim 20$到$25$(共$D=32$)处出现$4$到$5$个数量级的下降,而LeJEPA的谱接近平坦(顶部到底部比率最多为$3.6$)。两种方法的每维度边缘分布同时接近高斯分布(平均Shapiro-Wilk $W \in [0.992, 0.996]$),这是Diaconis-Freedman结果的一个推论。因此,在匹配精度下,两种正则化器产生结构上不同的投影表示。

英文摘要

A central difficulty in training Joint-Embedding Predictive Architectures (JEPAs) is preventing representation collapse. LeJEPA addresses this by enforcing an isotropic Gaussian target on the embeddings via Sketched Isotropic Gaussian Regularization (SIGReg). This target is in tension with the manifold hypothesis, which expects embeddings to concentrate on a low-dimensional subset of the ambient space. We propose \emph{UR-JEPA}, which targets a uniformly $n$-rectifiable measure of local tangent dimension $n$ at small scales, realized through a Gaussian-kernel smoothed Carleson-type square function $\mathcal{L}^{\text{CGLT}}$, with a complementary Jones $β$-number formulation. On Inet10, UR-JEPA($\mathcal{L}^{\text{CGLT}}$) attains $0.9141 \pm 0.0014$ for a $+0.83$\,pp gain over LeJEPA($\mathcal{L}^{\text{SIGReg}}$) with $\sim 30\%$ lower seed standard deviation; on matched-recipe Galaxy10~SDSS, a single-seed ImageNet-$100$ run, and a $3$-seed EuroSAT remote-sensing run, the two methods lie in the same peak-accuracy band at convergence, with UR-JEPA retaining its lower-seed-variance signature. On EuroSAT the in-domain pair is competitive at $96.0$ to $96.1\%$ with large remote-sensing foundation-model transfer at a $25\times$ smaller backbone. The distinction is geometric: direct visualization of the projector output distribution shows that on all four datasets UR--JEPA($\mathcal{L}^{\text{CGLT}}$) produces a global PCA spectrum with a $4$ to $5$ order-of-magnitude drop at index $\sim 20$ to $25$ out of $D = 32$, while LeJEPA's spectrum is near-flat (top-to-bottom ratio at most $3.6$). Per-dimension marginals are simultaneously near-Gaussian for both methods (mean Shapiro-Wilk $W \in [0.992, 0.996]$) as a Diaconis-Freedman consequence. At matched accuracy the two regularizers therefore yield structurally distinct projected representations.

2606.01442 2026-06-02 cs.CR cs.AI cs.NE 版本更新

On the Evaluation of Spiking Neural Network Configurations for Network Intrusion Detection

脉冲神经网络配置在网络入侵检测中的评估

Raj Patel, David Amebley, Taye Akinrele, Shaswata Mitra, Sayanton Dibbo, Shahram Rahimi

发表机构 * University of California, Berkeley(加州大学伯克利分校)

AI总结 通过消融实验评估9种神经元模型与3种脉冲编码方案的27种组合,发现延迟编码优于速率和增量编码,且LeakyParallel神经元结合延迟编码在四个基准数据集上平均准确率92.11%,宏F1 0.80,假阳性率2.01%,推理速度最快。

Comments 1 figure, 3 Tables, This manuscript is under review for IEEE MILCOM 2026. \c{opyright} 2026 IEEE. Personal use is permitted; all other uses require IEEE permission, including reprinting, republication, redistribution, resale, or reuse of copyrighted components

详情
AI中文摘要

网络入侵检测是现代网络安全基础设施的核心组成部分,然而主导该领域的深度学习模型计算需求高,促使人们关注适用于边缘和神经形态部署的轻量级替代方案。因此,脉冲神经网络(SNN)是一个自然的选择,但其设计空间(涵盖神经元模型和脉冲编码方案的选择)在入侵检测方面仍未得到充分表征。我们通过使用9种神经元与3种脉冲编码方案相结合的受控消融研究来弥补这一差距,共产生27种变体,所有变体均在snntorch上实现,并在四个基准数据集(NSL KDD、KDDCup99、CIC-IDS2017和CTU-13)上使用5个随机种子对经过有限预处理的原始输入进行评估。我们发现,脉冲编码方案比神经元模型更能决定检测质量,其中速率和增量脉冲编码在整体扫描中表现不如延迟编码。结合延迟编码的LeakyParallel神经元总体表现最佳,在所有四个数据集上平均准确率为92.11%,宏F1为0.80,假阳性率为2.01%,在CIC-IDS2017和CTU-13上准确率接近完美,并且推理速度最快。这些结果凸显了SNN在考虑低延迟或资源受限部署时,作为传统入侵检测方法可行替代方案的潜力。

英文摘要

Network intrusion detection is a core component of modern cybersecurity infrastructure, yet the deep learning models that dominate the field are computationally demanding, motivating interest in lightweight alternatives suited to edge and neuromorphic deployment. Spiking Neural Networks (SNNs) are therefore a natural candidate, but their design space, spanning the choice of neuron model and spike encoding scheme, remains poorly characterized for intrusion detection. We bridge this gap by using a controlled ablation study using 9 neurons coupled with 3 spike encoding schemes, making 27 variants, all implemented on snntorch evaluated over raw inputs with limited preprocessing on four benchmark datasets (NSL KDD, KDDCup99, CIC-IDS2017, and CTU-13) with 5 seeds. We find that spike encoding scheme is a better determinant for detection quality than the neuron model, where rate and delta spike encodings perform worse than latency encoding over the sweep. The LeakyParallel neuron with latency encoding performed the best overall, averaging at 92.11% accuracy and 0.80 macro- F1 at a rate of 2.01% false positives averaged over all 4 datasets, with accuracy close to perfect for CIC-IDS2017 and CTU-13, and also performed the fastest on inference. These results highlight the potential of SNNs as a viable alternative to traditional methods of intrusion detection when considering low-latency or resource-constrained deployments.

2606.01441 2026-06-02 cs.AI 版本更新

Dive into Ambiguity: A*-Inspired Multi-Agents Commonsense Obfuscation Attack on LLM Prompts

深入歧义:基于A*的多智能体常识混淆攻击LLM提示

Boxuan Wang, Zhuoyun Li, Xiaowei Huang, Yi Dong

发表机构 * University of Liverpool(利物浦大学)

AI总结 提出一种基于A*的事实错误诱导框架,通过层次化重写策略和动态语义分散系数生成语义对齐但混淆的提示,以高效攻击LLM的常识推理。

Comments Pre-print

详情
AI中文摘要

大型语言模型(LLM)在推理和知识密集型任务中表现出色,但仍易受到保留意图同时触发常识幻觉的提示级对抗攻击。这一漏洞亟待解决,因为LLM正迅速集成到事实可靠性不容妥协的安全关键领域。现有攻击方法要么缺乏效率,要么无法捕捉真实世界对手的适应性策略。我们提出一种基于A*的事实错误诱导框架,用于生成语义对齐但混淆的提示。其核心是由动态语义分散系数$γ$引导的层次化重写策略,该系数遵循反向模拟退火调度,在早期平衡保守编辑,后期进行激进混淆。为了增强可解释性,我们进一步引入智能体机制标记,发现并优化对抗机制,提供可解释的反向优化。理论上,我们证明提示重写遵循收缩递归,导致随着$γ$减小语义崩溃。实验上,在多种LLM上,我们的方法以更少的尝试次数实现了比穷举探索更高的攻击成功率,证明了其高效性和有效性。

英文摘要

Large language models (LLMs) excel in reasoning and knowledge-intensive tasks but remain vulnerable to prompt-level adversarial attacks that preserve intent while triggering commonsense hallucinations. This vulnerability is urgent, as LLMs are rapidly integrated into safety-critical domains where factual reliability is non-negotiable. Existing attack methods either lack efficiency or fail to capture the adaptive strategies of real-world adversaries. We propose an A*-inspired Factual Error Induction Framework, a framework for generating semantically aligned yet obfuscated prompts. At its core is a Hierarchical Rewrite Strategy guided by a dynamic semantic dispersion coefficient $γ$ that balances conservative edits early with aggressive obfuscations later, following a reverse simulated annealing schedule. To enhance interpretability, we further introduce Agentic Mechanism Labeling, which discovers and refines adversarial mechanisms, offering interpretable reverse optimization. Theoretically, we prove that prompt rewriting follows a contractive recurrence, leading to semantic collapse as $γ$ decreases. Empirically, across diverse LLMs, our method achieves higher attack success rates than exhaustive exploration while requiring fewer attempts, demonstrating both efficiency and effectiveness.

2606.01437 2026-06-02 cs.LG cs.AI 版本更新

CEAR: Certified Ensemble Adversarial Robustness in DNNs

CEAR: 深度神经网络中的集成对抗鲁棒性认证

Daniel Sadig, Mohammadreza Maleki, Hamed Karimi, Reza Samavi

发表机构 * University of California, Berkeley(加州大学伯克利分校)

AI总结 提出CEAR方法,通过混合经验与认证防御机制,利用高斯噪声和温度混淆梯度与logits,并扩展随机平滑以验证集成分类器的鲁棒性,在多个数据集上取得更优的认证准确率和鲁棒半径。

Comments This is the preprint of the work accepted for publication in the Proceedings of the 39th Canadian Conference on Artificial Intelligence (Canadian AI 2026); 19 Pages

详情
AI中文摘要

深度神经网络(DNN)极易受到对抗性扰动的影响,这促使了对安全关键应用鲁棒性的广泛研究。最先进的实证防御机制通过训练阶段提高DNN的鲁棒性,但仍难以应对自适应白盒攻击。另一方面,认证防御在指定的扰动范围内提供可证明的鲁棒性保证。这些保证无论扰动程度如何都成立,即使攻击者拥有模型的完全知识。在本文中,我们提出了CEAR,一种基于集成的鲁棒方法,它利用了实证和认证防御机制的混合。CEAR使用不同的高斯噪声和温度训练集成中的每个网络,以混淆梯度和logits,使模型对更强的基于梯度的攻击更具抵抗力。然后我们使用带噪声的logits,并提出了两种不同的投票机制来进一步提高鲁棒性。此外,我们扩展了随机平滑以验证基于集成的分类器的鲁棒性。我们在MNIST、CIFAR10和TinyImageNet数据集上的实验评估表明,与基线方法相比,平均认证准确率更高,鲁棒半径更大,可迁移性更低。

英文摘要

Deep Neural Networks (DNNs) are highly susceptible to adversarial perturbations, leading to extensive research on robustness for safety-critical applications. State-of-the-art empirical defense mechanisms improve the robustness of DNNs through the training phase, but still struggle against adaptive white-box attacks. On the other hand, certified defenses offer provable guarantees of robustness within a specified perturbation bound. These guarantees hold regardless of the level of perturbations, even if the attacker is given full knowledge of the model. In this paper, we propose CEAR, an ensemble-based robust method that utilizes a hybrid of empirical and certified defense mechanisms. CEAR trains each network within the ensemble using varying Gaussian noise and temperatures to obfuscate gradients and logits, making the model more resistant to stronger gradient-based attacks. We then use noisy logits and propose two different voting mechanisms to further improve robustness. Furthermore, we extend randomized smoothing to verify the robustness of ensemble-based classifiers. Our experimental evaluations on MNIST, CIFAR10, and TinyImageNet datasets demonstrate superior certified accuracy on average, increased robustness radius, and decreased transferability compared to baseline methods.

2606.01435 2026-06-02 cs.AI cs.CL cs.IR 版本更新

Don't Ask the LLM to Track Freshness: A Deterministic Recipe for Memory Conflict Resolution

不要询问LLM追踪新鲜度:一种确定性的内存冲突解决策略

Vikas Reddy, Sumanth Challaram

发表机构 * IIT Kgp(印度理工学院科钦分校)

AI总结 针对基于LLM的内存系统中事实冲突解决性能低下的问题,提出用候选提取加Python max(serial)的确定性聚合替代LLM判断,在单跳任务上提升10.8个百分点,并扩展到多跳任务。

详情
AI中文摘要

基于LLM的内存系统越来越多地维护随时间演变的事实,其中一个反复出现的失败是冲突解决:当一个事实有多个矛盾的值时,智能体应该返回哪个?MemoryAgentBench (MAB; Hu et al., 2026) 在其FactConsolidation任务中明确了这一点:事实被编号,反事实具有更高的序号,并且智能体被告知较新的事实具有较大的序号。然而,每个已发布的系统表现不佳:HippoRAG-v2在单跳(FC-SH)上达到54%,BM25 48%,Mem0 18%,而时间知识图谱Zep/Graphiti仅为7%。多跳几乎未解决(22个系统中最多7%)。我们认为瓶颈在于组装步骤:基线将冲突解决留给LLM介导的检索或生成,而不是版本感知的聚合。一个匹配设置的比较(相同的主干、检索、分块、TOP_K)表明,用候选提取加Python max(serial)替换LLM判断答案流水线,在FC-SH上(gpt-4o-mini)获得+10.8分的提升,从6K时的+8分扩大到262K时的+21分。这是一个全流水线效应(解析器、提示、格式和温度共同变化);隔离解析器是未来的工作。该配方在FC-SH上达到78.0%(gpt-4o-mini)、94.8%(gpt-4o),在FC-MH上达到30.2%(gpt-4o-mini,使用gpt-4o时升至51.5%),通过每跳确定性的Self-Ask扩展。在匹配的262K下,它比HippoRAG-v2高出+28分,比已发布的最佳FC-MH结果高出+20分。这一含义对该子领域具有纠正作用:冲突解决的瓶颈是组装(检索后聚合),而不是存储。一个LongMemEval知识更新检查表明,该机制从max(serial)移植到max(timestamp),但仅与LLM判断持平(57.8% vs 64.4%,n=45):确定性聚合是当前值冲突的正确原语,并且必须与问题类型感知处理组合,以实现更广泛的内存问答。

英文摘要

LLM-based memory systems increasingly maintain facts that evolve over time, where a recurring failure is conflict resolution: when a fact has multiple contradictory values, which should the agent return? MemoryAgentBench (MAB; Hu et al., 2026) makes this explicit in its FactConsolidation task: facts are numbered, the counterfactual has the higher serial, and agents are told newer facts have larger serials. Yet every published system underperforms: HippoRAG-v2 reaches 54% on single-hop (FC-SH), BM25 48%, Mem0 18%, and the temporal KG Zep/Graphiti just 7%. Multi-hop is near-unsolved (at most 7% across 22 systems). We argue the bottleneck is the assembly step: baselines leave conflict resolution to LLM-mediated retrieval or generation rather than version-aware aggregation. A matched-setup comparison (same backbone, retrieval, chunking, TOP_K) shows that replacing the LLM-judgment answer pipeline with candidate-extraction plus Python max(serial) yields +10.8 points on FC-SH (gpt-4o-mini), widening from +8 at 6K to +21 at 262K. This is a whole-pipeline effect (resolver, prompt, format, and temperature vary jointly); isolating the resolver is future work. The recipe reaches 78.0% on FC-SH (gpt-4o-mini), 94.8% (gpt-4o), and 30.2% on FC-MH (gpt-4o-mini, rising to 51.5% with gpt-4o) via a per-hop deterministic extension of Self-Ask. At matched-262K, it beats HippoRAG-v2 by +28 points and the best published FC-MH result by +20. The implication is corrective for the subfield: the bottleneck on conflict resolution is assembly (post-retrieval aggregation), not storage. A LongMemEval knowledge-update check shows the mechanism ports from max(serial) to max(timestamp) but only ties LLM judgment (57.8% vs 64.4%, n=45): deterministic aggregation is the right primitive for current-value conflicts and must be composed with question-type-aware handling for broader memory QA.

2606.01417 2026-06-02 cs.AI 版本更新

GovAI-Pipe: A Layered AI Governance Pipeline for Citizen-Facing AI in Turkey's e-Government Gateway

GovAI-Pipe:面向土耳其电子政务门户的公民交互AI分层治理管道

Ahmet Kaplan

发表机构 * Turkey's e-Government Gateway(土耳其电子政务门户)

AI总结 针对土耳其电子政务平台缺乏结构化技术治理基础设施的问题,提出基于设计科学研究方法的四层AI治理管道GovAI-Pipe,将AI模型生命周期映射到治理检查点,并通过高风险用例验证其可审计的技术实现。

Comments 7 pages

详情
AI中文摘要

土耳其的电子政务门户(e-Devlet)为超过6800万注册用户提供9200多项政府服务,并越来越多地将人工智能集成到面向公民的应用中,如聊天机器人助手和资格评估。然而,目前没有结构化的技术治理基础设施将高级AI政策框架(如欧盟AI法案、OECD AI原则和土耳其自身的国家AI战略)与在集中式电子政务平台中部署AI的操作现实联系起来。我们提出GovAI-Pipe,这是一个使用设计科学研究方法设计的四层治理管道,将AI模型生命周期映射到治理检查点:(1)部署前验证,用于偏差测试、可解释性和隐私影响评估;(2)部署治理,用于风险等级分类和审批工作流;(3)运行时监控,用于漂移检测、公平性跟踪和人在回路升级;(4)事后治理,用于审计跟踪、回滚和公民补救。每一层都锚定到欧盟AI法案、GDPR数据保护框架和国家AI战略的具体条款。我们通过两个高风险e-Devlet用例演示该框架,展示GovAI-Pipe如何将治理原则作为可审计的技术管道组件进行操作化。

英文摘要

Turkey's e-Government Gateway (e-Devlet) serves over 68 million registered users with more than 9,200 government services, and is increasingly integrating artificial intelligence into citizen-facing applications such as chatbot assistants and eligibility assessments. However, no structured technical governance infrastructure currently connects high-level AI policy frameworks, such as the EU AI Act, OECD AI Principles, and Turkey's own National AI Strategy, to the operational reality of deploying AI within a centralized e-government platform. We propose GovAI-Pipe, a four-layer governance pipeline designed using Design Science Research methodology that maps the AI model lifecycle to governance checkpoints: (1) pre-deployment validation for bias testing, explainability, and privacy impact assessment; (2) deployment governance for risk-tier classification and approval workflows; (3) runtime monitoring for drift detection, fairness tracking, and human-in-the-loop escalation; and (4) post-incident governance for audit trails, rollback, and citizen redress. Each layer is anchored to specific provisions of the EU AI Act, the GDPR data protection framework, and the National AI Strategy. We demonstrate the framework through two high-risk e-Devlet use cases, showing how GovAI-Pipe operationalizes governance principles as auditable, technical pipeline components.

2606.01416 2026-06-02 cs.AI 版本更新

Self-Healing Agentic Orchestrators for Reliable Tool-Augmented Large Language Model Systems

用于可靠的工具增强型大语言模型系统的自愈代理编排器

Rahul Suresh Babu, Adarsh Agrawal

发表机构 * Independent Researcher(独立研究者) Senior Member, IEEE(IEEE高级成员)

AI总结 提出一种自愈代理编排器,通过将可靠性视为有界运行时控制问题,映射故障信号、选择恢复动作并验证轨迹,在100任务故障注入基准上达到98.8%任务成功率,优于重试和完全重规划基线。

详情
AI中文摘要

工具增强型大语言模型(LLM)代理依赖于协调规划、检索、工具调用、验证、记忆和恢复的编排层。在这些系统中,故障不仅来自模型错误,还来自编排层问题,如工具超时、参数格式错误、过时上下文、矛盾证据、重试循环和未验证的中间输出。本文提出一种自愈代理编排器,将可靠性视为有界运行时控制问题。该编排器将可观察的故障信号映射到推断的故障类别,在显式预算下选择目标恢复动作,验证恢复轨迹,并记录可观察性痕迹。我们在一个100任务的受控故障注入基准上,将本方法与静态工作流、仅重试、ReAct风格和完全重规划基线进行比较。自愈方法实现了98.8%的任务成功率,而仅重试为94.5%,完全重规划为93.8%。匹配的恢复预算扫描显示,在每个测试预算下,自愈方法均优于仅重试和完全重规划,在单次恢复尝试下差距最大:分别为94.0%对比85.3%和88.2%。在受控的语义静默故障设置下,验证器引导的自愈将静默故障降至0.0%,而非验证基线更频繁地返回错误但看似合理的输出。紧凑的模型在环验证表明,当实时工具调用模型在本地故障注入工具上执行工具选择、参数生成和答案合成时,相同的恢复机制可以运行。这些结果提供了受控证据,表明故障感知、有预算和验证引导的编排提高了工具增强型LLM系统的可靠性和可诊断性。

英文摘要

Tool-augmented large language model (LLM) agents rely on orchestration layers that coordinate planning, retrieval, tool invocation, validation, memory, and recovery. In these systems, failures arise not only from model errors, but also from orchestration-level issues such as tool timeouts, malformed arguments, stale context, contradictory evidence, retry loops, and unverified intermediate outputs. This paper presents a self-healing agentic orchestrator that treats reliability as a bounded runtime control problem. The orchestrator maps observable failure signals to inferred failure classes, selects targeted recovery actions under explicit budgets, verifies recovered trajectories, and records observability traces. We evaluate the approach on a 100-task controlled fault-injection benchmark against static workflow, retry-only, ReAct-style, and full-replanning baselines. Self-healing achieves 98.8\% task success, compared with 94.5\% for retry-only and 93.8\% for full replanning. A matched recovery-budget sweep shows that self-healing outperforms retry-only and full replanning at every tested budget, with the largest gap under a single recovery attempt: 94.0\% versus 85.3\% and 88.2\%, respectively. Under a controlled semantic silent-failure setting, verifier-guided self-healing reduces silent failures to 0.0\%, while non-verifying baselines return wrong-but-plausible outputs more often. A compact model-in-the-loop validation shows that the same recovery mechanism can operate when a live tool-calling model performs tool selection, argument generation, and answer synthesis over local fault-injected tools. These results provide controlled evidence that failure-aware, budgeted, and verification-guided orchestration improves reliability and diagnosability in tool-augmented LLM systems.

2606.01402 2026-06-02 cs.LG cs.AI 版本更新

Neural Network Compression by Approximate Differential Equivalence

基于近似微分等价的神经网络压缩

Ravi Dhiman, Andrea Passarella, Mirco Tribastone, Lorenzo Valerio

发表机构 * IMT School for Advanced Studies Lucca(利古里亚高级研究学院) IIT CNR(理工学院-国家科研委员会)

AI总结 提出一种通过聚合功能相似神经元来压缩神经网络的方法,利用近似前向微分等价将网络编码为多项式ODE系统,实现模型大小与精度的平滑权衡。

Comments 19 pages, 4 figures

详情
AI中文摘要

神经网络压缩通常通过基于局部重要性分数(例如基于幅度的剪枝)剪枝参数来实现。我们提出一种互补方法,通过聚合具有相似功能行为的神经元来压缩模型,而不是独立移除权重。我们的方法将训练好的网络编码为多项式ODE系统,并应用一种称为近似前向微分等价的 lumping 方法来识别具有近似匹配诱导动力学的神经元。单个容差参数 $\varepsilon$ 控制压缩水平,并在模型大小和预测精度之间诱导平滑权衡。我们在来自已知真实行为的非线性动力系统的合成数据集和公共回归基准上评估该方法。在这两种设置下,所提出的方法在保持精度的同时实现了显著的参数减少,并在相似的压缩水平下始终优于基于幅度的剪枝和Wanda。这些结果表明,基于微分等价的聚合是传统以权重为中心的剪枝的一种有原则且有效的替代方案。

英文摘要

Neural network compression is commonly achieved by pruning parameters based on local importance scores, e.g., magnitude-based pruning. We propose a complementary approach that compresses models by aggregating neurons with similar functional behavior rather than removing weights independently. Our method encodes a trained network as a polynomial ODE system and applies a lumping method called Approximate Forward Differential Equivalence to identify neurons with approximately matching induced dynamics. A single tolerance parameter, $\varepsilon$, controls the compression level and induces a smooth trade-off between model size and predictive accuracy. We evaluate the method on synthetic datasets derived from nonlinear dynamical systems with known ground-truth behavior and on public regression benchmarks. Across both settings, the proposed approach achieves substantial parameter reduction while preserving accuracy, and consistently compares favorably with magnitude-based pruning and Wanda at similar compression levels. These results suggest that differential equivalence-based aggregation is a principled and effective alternative to conventional weight-centric pruning.

2606.01400 2026-06-02 cs.CL cs.AI 版本更新

Consistent and Distinctive: LLM Benchmark Efficiency via Maximum Independent Set Prompt Selection on Similarity Graphs

一致且独特:基于相似图最大独立集提示选择的LLM基准测试效率

Denica Kjorvezir, Marko Djukanović, Ana Gjorgjevikj, Gjorgjina Cenikj, Tome Eftimov

发表机构 * Computer Systems Department, Jožef Stefan Institute, Ljubljana, Slovenia(计算机系统部,乔塞夫·斯塔芬研究所,卢布尔雅那,斯洛文尼亚) Jožef Stefan International Postgraduate School, Ljubljana, Slovenia(乔塞夫·斯塔芬国际研究生学院,卢布尔雅那,斯洛文尼亚) Center for Astrophysics and Cosmology, University of Nova Gorica, Nova Gorica, Slovenia(天体物理与宇宙学中心,诺瓦戈里察大学,诺瓦戈里察,斯洛文尼亚)

AI总结 提出基于相似图最大独立集的提示选择框架,通过选择多样且非冗余的子集,在保持LLM排名一致性的同时显著减少基准测试成本。

详情
AI中文摘要

在全面基准测试中评估大型语言模型(LLM)既昂贵又耗时。我们提出了一种基于图的提示选择框架,将每个基准建模为相似图——如果提示在嵌入空间中的距离超过可配置阈值,则节点相连——并应用最大独立集(MIS)算法选择最大多样、非冗余的子集。我们评估了四种MIS求解器(CPLEX、GREEDY、Online-MIS、ReduMIS),涵盖六种嵌入模型、三种距离度量、六个百分位数阈值和四个基准(GPQA、IFEval、MMLU-Pro、Omni-MATH),涉及66个LLM。我们的核心假设——不同随机种子下的重复选择会产生一致的LLM排名,且可能不同于完整基准基线——得到强烈证实:在99.2%的随机配置中Kendall's $W \geq 0.90$(平均$W = 0.997 \pm 0.008$),而在较高百分位数阈值下,所选子集平均减少25-48%的提示。与完整基准的排名差异($\rho < 0.95$)仅发生在15.95%的配置中,主要集中在低阈值($p_{10}$-$p_{20}$)和基准(GPQA、IFEval)上,识别出过于密集的图是主要失败模式。

英文摘要

Evaluating large language models (LLMs) across comprehensive benchmarks is expensive and time-consuming. We propose a graph-based prompt selection framework that models each benchmark as a similarity graph -- nodes are prompts connected if their embedding-space distance falls above a configurable threshold -- and applies Maximum Independent Set (MIS) algorithms to select a maximally diverse, non-redundant subset. We evaluate four MIS solvers (CPLEX, GREEDY, Online-MIS, ReduMIS) across six embedding models, three distance measures, six percentile thresholds, and four benchmarks (GPQA, IFEval, MMLU-Pro, Omni-MATH) covering 66 LLMs. Our central hypothesis -- that repeated selection under different random seeds yields consistent LLM rankings that may also differ from the full-benchmark baseline -- is strongly confirmed: Kendall's $W \geq 0.90$ in 99.2\% of stochastic configurations (mean $W = 0.997 \pm 0.008$), while at higher percentile thresholds selected subsets achieve 25--48\% prompt reduction on average. Ranking divergence from the full benchmark ($ρ< 0.95$) occurs in only 15.95\% of configurations, concentrated at low thresholds ($p_{10}$--$p_{20}$) and benchmarks (GPQA, IFEval), identifying overly dense graphs as the primary failure mode.

2606.01393 2026-06-02 cs.CL cs.AI cs.CV 版本更新

Dr. DocBench: A Comprehensive Benchmark for Expert-Level and Difficult Document Parsing

Dr. DocBench:专家级与困难文档解析的综合基准

Minglai Yang, Xinyan Velocity Yu, Pengyuan Li, Xinyu Guo, Zhenting Qi, Konwoo Kim, Longtian Ye, Xiaolong Luo, Jinhe Bi, Henry Zhang, Haris Riaz, Xuan Zhang, Yunze Xiao, Bangya Liu, Tom Tang, Yunfei Zhao, Qunshu Lin, Zihan Wang, Minghao Liu, Michael Lingzhi Li, Yilun Du, Jesse Thomason, Rogerio Feris, Alex Pentland, Zexue He

发表机构 * Stanford University(斯坦福大学) MIT(麻省理工学院) Carnegie Mellon University(卡内基梅隆大学) University of Southern California(南加州大学) Harvard University(哈佛大学) IBM Research(IBM研究院) University of Arizona(亚利桑那大学) Duke University(杜克大学) UC Berkeley(加州大学伯克利分校) LMU Munich(慕尼黑路德维希-马克西米利安大学)

AI总结 提出Dr. DocBench基准,通过基于解析器失败的采样从多语言书籍语料库中选取挑战性文档,包含52个BISAC主题领域和65k高质量标注,用于评估专家级文档解析能力。

Comments 27 pages, 13 figures, 14 tables

详情
AI中文摘要

文档解析和识别是视觉语言模型(VLM)和文档处理系统的基本能力。然而,现有的光学字符识别(OCR)和文档解析基准在覆盖范围和难度上日益受限:许多基准专注于常见文档类型或均匀采样的页面,现代解析器在这些页面上已表现良好,而对专家领域结构(如化学公式、乐谱、复杂表格和跨页布局)的标注有限。我们引入了Dr. DocBench,一个面向专家级文档解析的难度感知基准。Dr. DocBench基于大规模多语言书籍语料库构建,涵盖52个BISAC主题领域,并通过基于解析器失败的采样选择挑战性文档,针对多个最先进系统难以处理的案例。它包含来自平均约100页的长文档的4,514个标注页面,具有65k高质量的页面级和块级标注,涵盖布局、阅读顺序、层次关系和特定领域的视觉内容。对基于流水线的解析器和通用VLM的评估表明,在现有基准上的强性能并不能迁移到我们的专家级文档解析中。我们的分析揭示了跨主题、内容类型和结构属性的重大失败,突显了Dr. DocBench作为诊断和推进文档智能的综合测试平台的作用。

英文摘要

Document parsing and recognition are fundamental capabilities for vision-language models (VLMs) and document processing systems. However, existing Optical Character Recognition (OCR) and document parsing benchmarks are increasingly limited in coverage and difficulty: many focus on common document genres or uniformly sampled pages where modern parsers already perform strongly, while offering limited annotation for expert-domain structures such as chemical formula, music notation, complex tables, and cross-page layouts. We introduce Dr. DocBench, a difficulty-aware benchmark for expert-level document parsing. Built from a large-scale multilingual book corpus, Dr. DocBench spans 52 BISAC subject domains and selects challenging documents through parser-failure-based sampling, targeting cases where multiple state-of-the-art systems struggle. It contains 4,514 annotated pages from long documents averaging around 100 pages, with 65k high-quality page- and block-level annotations for layout, reading order, hierarchical relations, and domain-specific visual contents. Evaluations of pipeline-based parsers and general-purpose VLMs show that strong performance on existing benchmarks does not transfer to our expert-level document parsing. Our analysis reveals substantial failures across subjects, content types, and structural attributes, highlighting Dr. DocBench as a comprehensive testbed for diagnosing and advancing document intelligence.

2606.01386 2026-06-02 cs.AI cs.CL cs.DC cs.LG 版本更新

GuidaPA: Privacy-Preserving Chatbot for Public Administration via Federated Learning

GuidaPA: 通过联邦学习为公共行政提供隐私保护的聊天机器人

Daniel M. Jimenez-Gutierrez, Albenzio Cirillo, Raffaele Nicolussi, Alessio Beltrame, Andrea Vitaletti

发表机构 * University of Bologna(博洛尼亚大学)

AI总结 提出GuidaPA,一个基于联邦学习(FL)在意大利公共行政文档上训练的隐私保护聊天机器人,通过参数高效的联邦微调(QLoRA)和角色访问控制,在保持数据本地化的同时实现了接近集中式微调的答案质量。

Comments Accepted to the 2nd International Conference on Federated Learning and Intelligent Computing Systems (FLICS2026)

详情
AI中文摘要

我们提出了GuidaPA,一个为意大利公共行政(PA)设计的隐私保护聊天机器人,它通过联邦学习(FL)在两个国家PA平台SIGESON和SIDFORS的文档上进行训练。我们的语料库包括约8页的SIGESON手册和31页的SIDFORS手册/常见问题解答;虽然本研究使用公开文档作为安全代理,但预期的部署将扩展到受限制的内部来源(例如,工单、官员手册、数据库提取),这些数据由于监管和组织约束无法集中汇集。GuidaPA集成了基于角色的访问控制、安全的客户端预处理、对非独立同分布效应的显式监控以及大语言模型的参数高效联邦微调。使用QLoRA(4位)进行15轮联邦训练,每个客户端采用80/20的训练-测试划分,我们使用ROUGE、BLEU-4和METEOR评估答案质量。最佳联邦模型达到了ROUGE-1/2/L分别为61.10/55.77/59.44,BLEU-4为45.02,METEOR为63.94——接近私有集中式微调的性能,同时保持数据在本地。与通用基线相比,领域微调将ROUGE-1从41.45提高到62.18,BLEU-4从26.97提高到50.90。总体而言,结果表明FL可以在不进行集中数据共享的情况下,为公共服务提供高质量的对话式AI。

英文摘要

We present GuidaPA, a privacy-preserving chatbot for the Italian Public Administration (PA) trained via Federated Learning (FL) on documentation from two national PA platforms, SIGESON and SIDFORS. Our corpus includes approximately 8 pages of SIGESON manuals and 31 pages of SIDFORS manuals/FAQs; while this study uses public documentation as a safe proxy, the intended deployment extends to restricted internal sources (e.g., tickets, officer manuals, database extracts) that can not be centrally pooled due to regulatory and organizational constraints. GuidaPA integrates role-based access control, secure client-side preprocessing, explicit monitoring of non-IID effects, and parameter-efficient federated fine-tuning of large language models. Using QLoRA (4-bit) over 15 federated rounds with an 80/20 train-test split per client, we evaluate answer quality with ROUGE, BLEU-4, and METEOR. The best federated model achieves ROUGE-1/2/L of 61.10/55.77/59.44, BLEU-4 of 45.02, and METEOR of 63.94-close to private centralized fine-tuning while keeping data on-site. Compared to the general-purpose baseline, domain fine-tuning improves ROUGE-1 from 41.45 to 62.18 and BLEU-4 from 26.97 to 50.90. Overall, the results indicate that FL can deliver high-quality conversational AI for public services without centralized data sharing

2606.01385 2026-06-02 cs.SE cs.AI 版本更新

Bridging Requirements and Architecture: Multi-Agent Orchestration with External Knowledge and Hierarchical Memory

桥接需求与架构:基于外部知识和分层记忆的多智能体编排

Ruiyin Li, Yiran Zhang, Xiyu Zhou, Yangxiao Cai, Peng Liang, Weisong Sun, Jifeng Xuan, Zhi Jin, Yang Liu

发表机构 * School of Computer Science, Wuhan University(武汉大学计算机学院) Nanyang Technological University(南洋理工大学)

AI总结 提出MAAD框架,通过编排四个专业智能体(分析师、建模师、设计师、评估师),结合RAG注入架构标准与分层记忆机制,自动将需求规格转化为多视图架构蓝图并评估质量属性。

Comments 39 pages, 7 images, 5 tables, Manuscript submitted to a Journal (2026)

详情
AI中文摘要

软件架构设计是一个关键但本质上复杂且知识密集的阶段,需要平衡相互竞争的质量属性并适应不断变化的需求。传统上,这一过程耗时、劳动密集且严重依赖架构师,通常导致对替代架构分解和风格的探索有限,尤其是在敏捷开发的压力下。虽然基于LLM的智能体在各种软件工程任务中表现出色,但它们在架构设计中的应用仍然相对稀少,需要系统性的探索。为应对这些挑战,我们提出了MAAD(多智能体架构设计),这是一个知识驱动的框架,编排四个专业智能体(即分析师、建模师、设计师和评估师),自主协作地将需求规格转化为全面、多视图的架构蓝图,并附带质量属性评估。MAAD引入RAG将公认的架构标准和模式注入工作流,并利用分层记忆机制捕获设计历史以进行迭代优化。我们通过对比实验评估了MAAD与MetaGPT,使用10个案例研究中的定量架构级指标以及来自行业架构师对10个真实世界规格的定性反馈。结果表明,MAAD生成的架构比基线更完整、模块化和可追溯,其专用的评估智能体自主生成结构化质量评估报告,显著减少了手动验证工作。此外,我们发现生成架构的质量高度依赖于底层LLM的推理能力,其中GPT-5.2和Qwen3.5在大多数评估设置中优于其他模型。

英文摘要

Software architecture design is a critical yet inherently complex and knowledge-intensive phase that requires balancing competing quality attributes and adapting to evolving requirements. Traditionally, this process has been time-consuming, labor-intensive, and heavily reliant on architects, often resulting in limited exploration of alternative architectural decompositions and styles, especially under the pressures of agile development. While LLM-based agents have shown promising performance across various software engineering tasks, their application to architecture design remains relatively scarce and requires systematic exploration. To address these challenges, we proposed MAAD (Multi-Agent Architecture Design), a knowledge-driven framework that orchestrates four specialized agents (i.e., Analyst, Modeler, Designer and Evaluator) to autonomously and collaboratively transform requirements specifications into comprehensive, multi-view architectural blueprints with quality attribute assessments. MAAD incorporates RAG to inject recognized architectural standards and patterns into the workflow and leverages a hierarchical memory mechanism that captures design history for iterative refinement. We evaluated MAAD through comparative experiments against MetaGPT, using quantitative architecture-level metrics across 10 case studies and qualitative feedback from industry architects on 10 real-world specifications. Results show that MAAD generates more complete, modular, and traceable architectures than the baseline, and its dedicated Evaluator agent autonomously produces structured quality evaluation reports that significantly reduce manual validation efforts. Furthermore, we found that the quality of the generated architecture heavily depends on the underlying LLM's reasoning capacity, with GPT-5.2 and Qwen3.5 outperforming other models across most evaluation settings.

2606.01382 2026-06-02 cs.LG cs.AI 版本更新

Efficient Exploration for Iterative Nash Preference Optimization

迭代纳什偏好优化的高效探索

Tianlong Nan, Xiaopeng Li, Christian Kroer, Tianyi Lin

发表机构 * Columbia University(哥伦比亚大学) The Chinese University of Hong Kong, Shenzhen(香港中文大学(深圳))

AI总结 针对通用偏好模型下的迭代NLHF,提出显式探索算法,结合SFT正则化与对抗性策略探索,实现O(√T)遗憾界,避免对KL正则化参数的指数依赖。

Comments 49 pages

详情
AI中文摘要

偏好对齐是改进大语言模型的核心,但当人类偏好是循环、非传递或无法用标量奖励表示时,标准的基于奖励的公式可能具有限制性。从人类反馈中学习纳什均衡(NLHF)通过将对齐建模为偏好博弈并针对纳什均衡而非奖励最大化来解决这一限制。然而,可扩展NLHF的学习理论基础仍然有限。现有的遗憾保证依赖于基于oracle的方法,这些方法估计一个通用偏好模型并求解KL正则化的极小极大问题,而迭代NLHF方法直接优化策略级别的偏好损失,更易实现但缺乏遗憾保证。我们研究通用偏好模型下的在线迭代NLHF,并确定探索是关键障碍。首先,我们表明标准迭代NLHF可能遭受对KL正则化参数的指数依赖,揭示了通过策略更新进行的隐式探索不足以控制遗憾。其次,我们提出一种显式探索的迭代NLHF算法,结合了基于SFT的正则化与对抗性策略探索。所得方法保留了迭代NLHF的直接策略优化结构,避免了显式偏好模型估计,并实现了$O(\sqrt{T})$的遗憾界,而不依赖于KL正则化参数的指数项。我们表明,通过访问一个极小极大oracle,遗憾可以改进为$O(\log(T))$,阐明了学习通用偏好博弈中的计算-统计权衡。最后,我们将我们的方法实例化用于LLM微调,并在多个基准上对\texttt{Llama-3-8B-Instruct}进行评估,其中显式探索在现有NLHF基线上产生了一致的改进。

英文摘要

Preference alignment is central to improving large language models, but standard reward-based formulations can be restrictive when human preferences are cyclic, non-transitive, or otherwise not representable by a scalar reward. Nash Learning from Human Feedback (NLHF) addresses this limitation by modeling alignment as a preference game and targeting a Nash equilibrium rather than a reward maximizer. However, the learning-theoretic foundations of scalable NLHF remain limited. Existing regret guarantees rely on oracle-based methods that estimate a general preference model and solve KL-regularized minimax problems, while iterative NLHF methods directly optimize policy-level preference losses and are easier to implement but lack regret guarantees. We study online iterative NLHF under general preference models and identify exploration as the key obstacle. First, we show that standard iterative NLHF can suffer an exponential dependence on the KL-regularization parameter, revealing that implicit exploration through policy updates is insufficient for controlling regret. Second, we propose an explicitly exploratory iterative NLHF algorithm that combines SFT-based regularization with adversarial policy exploration. The resulting method retains the direct policy optimization structure of iterative NLHF, avoids explicit preference model estimation, and achieves an $O(\sqrt{T})$ regret bound without an exponential dependence on the KL-regularization parameter. We show that the regret can be improved to $O(\log(T))$ with access to a minimax oracle, clarifying the computational-statistical tradeoff in learning general preference games. Finally, we instantiate our method for LLM fine-tuning and evaluate it on \texttt{Llama-3-8B-Instruct} across multiple benchmarks, where explicit exploration yields consistent improvements over existing NLHF baselines.

2606.01375 2026-06-02 cs.CY cs.AI 版本更新

Beyond Access: Guided LLM Scaffolding for Independent Learning in Undergraduate Statistics

超越访问:引导式LLM支架在本科统计学自主学习中的应用

Mohammad Amanlou, Yasaman Amou-Jafari, Mehrad Livian, Fatemeh Boloukazari, Fereshte Bagheri, Behnam Bahrak

发表机构 * School of Electrical and Computer Engineering, University of Tehran, Iran(伊朗塔里哈大学电气与计算机工程学院) Tehran Institute for Advanced Studies, Khatam University, Iran(伊朗卡塔姆大学泰赫兰高级研究院)

AI总结 本研究通过准实验比较无LLM、无限制LLM和引导式LLM三种条件,发现引导式LLM使用能促进以推理为导向的交互模式,提升无辅助测验表现,并改善自我评估校准,表明LLM作为教育工具需通过支架设计实现推理伙伴而非答案获取工具。

Comments 10 pages, conference: Proceedings of the 34th International Conference on Computers in Education. Asia-Pacific Society for Computers in Education

详情
AI中文摘要

大型语言模型(LLM)正日益进入学生的学习实践,但其教育价值取决于它们是支持推理还是使学生在不参与的情况下完成任务。本研究考察了在本科概率与统计课程中引导式LLM的使用,重点关注分配访问与实际交互质量之间的差距。在一个为期四周的准实验暑期项目中,学生被分为三个平衡条件:无LLM访问、无限制LLM访问和引导式LLM访问。引导条件使用与无限制条件相同的LLM平台,但学生接受了明确的培训和规则,以促进以推理为导向的求助、逐步提示、验证和道德使用。所有测验和延迟的期末考试均在无LLM或外部帮助的情况下完成,使我们能够区分AI支持的练习表现与独立学习。结果表明,与无限制访问相比,引导式使用与更清晰的学习导向交互模式相关,尤其是在优先考虑推理而非最终答案以及请求逐步支持方面。引导式LLM学生在干预阶段的无帮助测验中表现更强,而无限制访问似乎更有助于辅助练习完成,而非持续提高独立表现。可用时间测量不支持简单的基于持续时间的解释,自我评估校准表明在引导式LLM条件下,感知理解与展示理解之间的对齐更好。总体而言,仅提供LLM访问似乎是一种不完整的教育干预。对于人工智能教育(AIED),核心设计挑战是如何搭建支架,使学生使用LLM的方式使这些系统成为推理伙伴而非答案获取工具。

英文摘要

Large language models (LLMs) are increasingly entering students' learning practices, but their educational value depends on whether they support reasoning or enable task completion without engagement. This study examines guided LLM use in an undergraduate Probability and Statistics course, focusing on the gap between assigned access and actual interaction quality. In a four-week quasi-experimental summer program, students were organized into three balanced conditions: no LLM access, unrestricted LLM access, and guided LLM access. The guided condition used the same LLM platform as the unrestricted condition, but students received explicit training and rules promoting reasoning-focused help-seeking, stepwise hints, verification, and ethical use. All quizzes and the delayed final exam were completed without LLM or external assistance, allowing us to distinguish AI-supported practice performance from independent learning. Results show that guided use was associated with clearer learning-oriented interaction patterns than unrestricted access, especially in prioritizing reasoning over final answers and requesting stepwise support. Guided-LLM students showed stronger no-help quiz performance during the intervention phase, whereas unrestricted access appeared more useful for assisted practice completion than for consistently improving independent performance. Available time measures did not support a simple duration-based explanation, and self-assessment calibration suggested better alignment between perceived and demonstrated understanding in the Guided-LLM condition. Overall, LLM access alone appears to be an incomplete educational intervention. For Artificial Intelligence in Education (AIED), the central design challenge is to scaffold how students use LLMs so that these systems function as partners in reasoning rather than answer-getting tools.

2606.01372 2026-06-02 cs.LG cs.AI cs.CV 版本更新

BRo-JEPA: Learning Modular Arithmetic in Latent Space

BRo-JEPA:在潜空间中学习模算术

Divyansh Jha, Yuanfang Xie, Varan Mehra, Brennen Yu

发表机构 * Georgia Institute of Technology(佐治亚理工学院) NYU Langone Health(纽约大学Langone医疗中心)

AI总结 本文提出BRo-JEPA模型,通过在潜空间中施加模10算术的循环结构,实现零样本泛化,解决了标准模型无法外推未见操作的问题。

Comments 10 pages, 14 figures

详情
AI中文摘要

神经网络能否学习抽象的代数规则,还是仅仅记忆训练模式?我们使用MNIST数字作为状态,模算术运算作为动作,在JEPA风格的潜世界模型中进行研究。标准监督基线和带有加法操作嵌入的JEPA模型能够学习已见操作,但无法可靠地外推到未见操作。为了弥补这一差距,我们引入了一个块旋转预测器,在潜空间中施加模10算术的循环结构。这使得模型具有强大的零样本泛化能力,最佳的基于ResNet的JEPA块旋转模型达到了99.46%的零样本准确率和99.46%的展开准确率。我们的结果表明,当架构与问题结构匹配时,潜世界模型可以学习符号变换规则。我们的代码可以在此处访问:https://github.com/DL-World-Models/mnist-math。

英文摘要

Can neural networks learn abstract algebraic rules, or do they merely memorize training patterns? We investigate this using MNIST digits as states and modular arithmetic operations as actions in a JEPA-style latent world model. Standard supervised baselines and JEPA models with additive operation embeddings fit seen operations but fail to extrapolate reliably to unseen ones. To bridge this gap, we introduce a block-rotation predictor that imposes the circular structure of modulo-10 arithmetic in latent space. This enables strong zero-shot generalization, with the best ResNet-based JEPA block-rotation model achieving 99.46\% zero-shot and 99.46\% rollout accuracy. Our results suggest that latent world models can learn symbolic transformation rules when architecture matches the structure of the problem. Our code can be \href{https://github.com/DL-World-Models/mnist-math}{accessed here}.

2606.01364 2026-06-02 cs.CR cs.AI cs.SE 版本更新

Needles at Scale: LLM-Assisted Target Selection for Windows Vulnerability Research

大规模针尖:LLM辅助的Windows漏洞研究目标选择

Michael J. Bommarito

发表机构 * Microsoft(微软) University of California, Berkeley(加州大学伯克利分校)

AI总结 提出Symbolicate-Enrich-Sample流水线,通过符号恢复、结构特征提取和低成本语言模型排序,从Windows系统数千万函数中筛选出约2.2万个候选目标,解决漏洞研究中目标选择瓶颈问题。

Comments 9 pages, 3 figures, 2 tables

详情
AI中文摘要

现代操作系统的攻击面如同大海捞针:数千个签名二进制文件和数百万个函数,几乎没有一个与任何给定漏洞相关。人类分析师或LLM代理必须在分析之前选择值得阅读的函数。在整个操作系统范围内,这种目标选择(而非分析)才是约束条件。我们提出了Symbolicate-Enrich-Sample,一个低成本的批处理流水线,将生产级Windows二进制文件语料库转化为可查询、优先级排序的研究队列。我们(i)通过自动获取公共符号文件并将其与恢复的调用图结合,恢复剥离符号的供应商二进制文件的函数级符号;(ii)为每个命名函数附加廉价、确定性的结构特征,并基于这些特征使用低成本语言模型分配可达性层级、风险级别、漏洞类别假设和理由;(iii)通过优先级加权重要性采样器抽取多样化、优先排序的批次。贡献在于一个选择基础:下游检测器或LLM代理在其上运行的优先级排序层。在包含7,231,419个函数的整个Windows镜像上,标签具有显著的选择性,叠加确定性过滤器后留下约22K个函数的候选列表:候选的针尖,数量足够人类或代理处理。我们描述了流水线的选择性及其失败模式,介绍了方法论并报告了总体统计数据;由于法律和双重用途原因,我们暂不公开推导出的数据集。

英文摘要

The attack surface of a modern operating system is a haystack: thousands of signed binaries and millions of functions, almost none relevant to any given vulnerability. A human analyst or an LLM agent must pick the function worth reading before analyzing it. At whole-OS scope, this target selection, not the analysis, is the binding constraint. We present Symbolicate-Enrich-Sample, a low-cost batch pipeline that turns a corpus of production Windows binaries into a queryable, priority-ranked research queue. We (i) recover function-level symbols for stripped vendor binaries by auto-fetching the public symbol files and joining them to a recovered call graph; (ii) attach cheap, deterministic structural features to each named function and, conditioned on those features, use a low-cost language model to assign a reachability tier, a risk level, a bug-class hypothesis, and a rationale; and (iii) draw diverse, prioritized batches via a priority-weighted importance sampler. The contribution is a selection substrate: the prioritization layer a downstream detector or LLM agent runs on top of. Across a whole Windows image of 7,231,419 functions, the labels are markedly selective, and stacking deterministic filters on them leaves a ~22K-function shortlist: the candidate needles, few enough for a human or agent to work through. We characterize the pipeline's selectivity and its failure modes, describe the methodology, and report aggregate statistics; we withhold the derived dataset for legal and dual-use reasons.

2606.01352 2026-06-02 cs.AI 版本更新

FlowTime: Towards Continuous Generative Watch Time Prediction via Flow-based Personalized Priors

FlowTime: 基于流的个性化先验实现连续生成式观看时间预测

Hongxu Ma, Han Zhou, Chenghou Jin, Jie Zhang, Xiaoyu Yang, Chunjie Chen, Jihong Guan, Shuigeng Zhou

发表机构 * Fudan University(复旦大学) Shanghai University of Finance and Economics(上海财经大学) Kuaishou Technology(快手科技) Tongji University(同济大学)

AI总结 针对现有观看时间预测方法在范式上的局限性,提出连续生成式回归范式及FlowTime方法,利用一步生成变分自编码器和基于流的个性化先验,有效建模多模态用户-物品交互模式,显著提升预测性能。

Comments Accepted by KDD'26

详情
AI中文摘要

观看时间已成为短视频推荐系统中优化深度用户参与度的关键指标。然而,当前的观看时间预测方法存在固有的范式特定局限性。直接回归因单峰高斯假设而面临均值崩溃,序数回归因刚性离散化而受到量化误差的困扰。同样,离散生成式回归则面临高推理延迟和启发式词汇表设计的问题。除了这些具体缺陷外,一个共同的不足是无法捕捉用户-物品交互模式的内在多模态性和异质性。为应对这些挑战,我们首先从因果角度重新审视观看时间预测问题,并将这些用户特定模式识别为调节观看时间结果的结构性混淆因素,其中相同的兴趣在不同用户习惯条件下表现为不同的观看时间结果。然后,我们正式提出一种新的(即第四种)范式——连续生成式回归,并引入FlowTime,一种利用一步生成变分自编码器的新方法。FlowTime有效规避了迭代去噪的延迟,同时保持了连续潜在空间的表达能力。此外,我们设计了一种基于流的个性化先验,利用归一化流将标准高斯先验扭曲为复杂的历史条件流形,从而实现对多模态交互模式的自适应建模。最后,我们构建了TimeRec,首个开源观看时间预测库,并引入一种新的个性化指标,以建立严格的基准测试标准。广泛的离线实验和在线A/B测试表明,FlowTime显著优于现有最先进方法。

英文摘要

Watch time has emerged as a pivotal metric for optimizing deep user engagement in short-video recommender systems. However, current methods of watch time prediction (WTP) suffer from inherent paradigm-specific limitations. Direct Regression faces mean-collapse due to unimodal Gaussian assumptions, while Ordinal Regression is hampered by quantization errors from rigid discretization. Similarly, Discrete Generative Regression struggles with high inference latency and heuristic vocabulary design. Beyond these specific flaws, a shared deficiency is the inability to capture the intrinsic multimodality and heterogeneity of User-Item Interaction Patterns. To address these challenges, we first revisit the WTP problem from a causal perspective and identify these user-specific patterns as structural confounders that modulate watch time outcomes, where identical interests manifest as distinct watch time outcomes conditioned on diverse user habits. Then, we formally propose a new (or the fourth) paradigm -- Continuous Generative Regression, and introduce FlowTime, a novel method utilizing a One-step Generative Variational Autoencoder. FlowTime effectively circumvents the latency of iterative denoising while maintaining the expressivity of continuous latent spaces. Furthermore, we design a Flow-based Personalized Prior that leverages NFs to warp a standard Gaussian prior into a complex, history-conditioned manifold, thereby enabling the adaptive modeling of multimodal interaction patterns. Finally, we build TimeRec, the first open-source WTP Library, alongside a novel personalization metric to establish a rigorous benchmarking standard. Extensive offline experiments and online A/B tests demonstrate FlowTime's significant superiority over SOTA methods.

2606.01351 2026-06-02 cs.AI 版本更新

Recognize Your Orchestrator: An Entropy Dynamics Perspective for LLM Multi-Agent Systems

识别你的编排器:面向LLM多智能体系统的熵动力学视角

Junze Zhu, Weihao Chen, Xuanwang Zhang, Zhen Wu, Xinyu Dai

发表机构 * Junze Zhu, Weihao Chen, Xuanwang Zhang, Zhen Wu, Xinyu Dai(朱俊泽、陈伟浩、张轩望、伍震、戴新宇)

AI总结 提出平均场熵动力学框架,通过逆工作流生成(IWG)合成高复杂度基准,揭示推理型模型作为编排器时因上下文压缩而失效的“推理陷阱”,为多智能体系统架构设计提供物理可解释参数。

详情
AI中文摘要

从单轮模型到多智能体系统(MAS)的转变有望增强问题解决能力,但集中式编排拓扑仍然是一个关键脆弱点。为分析此问题,我们提出平均场熵动力学框架,将编排过程建模为由任务解决和累积上下文加载两种竞争力量支配的系统。为便于验证,我们引入逆工作流生成(IWG),一种多智能体流水线,用于合成具有密集中间检查点的过程可验证、高复杂度基准。我们证明熵动力学模型拟合经验轨迹,提供量化系统稳定性和性能崩溃的物理可解释参数。关键的是,我们的分析揭示了“推理陷阱”:尽管推理密集型模型在孤立任务中表现出色,但由于上下文压缩,它们作为编排器时经常失败。阐明编排器背后的物理机制并量化系统不确定性,为MAS的架构设计提供了见解。

英文摘要

The transition from single-turn models to Multi-Agent Systems (MAS) promises enhanced problem-solving capabilities, yet the centralized orchestration topology remains a critical point of fragility. To analyze this, we propose a Mean-Field Entropy Dynamics framework, modeling the orchestration process as a system governed by the competing forces of task resolution and cumulative context loading. To facilitate validation, we introduce Inverse Workflow Generation (IWG), a multi-agent pipeline that synthesizes process-verifiable, high-complexity benchmarks with dense intermediate checkpoints. We demonstrate that our entropy dynamics model fits empirical trajectories, providing physically interpretable parameters that quantify system stability and performance collapse. Crucially, our analysis uncovers a ``Reasoning Trap": while reasoning-heavy models excel in isolated tasks, they frequently fail as orchestrators due to context squeezing. Elucidating the physical mechanisms underlying the Orchestrator and quantifying systemic uncertainty offers insights for the MASs' architectural design.

2606.01339 2026-06-02 cs.LG cs.AI cs.CL cs.CV cs.ET 版本更新

FreqLite: A Lightweight Frequency-Decomposed Linear Model with Adaptive Reversible Normalization for Robust Long-Term Time-Series Forecasting

FreqLite:一种轻量级频率分解线性模型,具有自适应可逆归一化,用于稳健的长期时间序列预测

Mirza Samad Ahmed Baig, Syeda Anshrah Gillani

发表机构 * Hamdard University(哈姆达德大学)

AI总结 提出FreqLite,一种超轻量级、通道独立的频率分解线性预测器,通过可学习的无损谱滤波器进行频带分解和线性预测,并引入自适应可逆实例归一化(A-RevIN)处理非平稳性,在长期预测基准上以更少参数和计算资源超越PatchTST等模型。

Comments 26 pages, 5 figures

详情
AI中文摘要

长期时间序列预测需要既准确又能在商用硬件上高效运行的模型。轻量级线性预测器在此领域表现出色,但仍存在两个问题:可逆实例归一化(RevIN)使用单一回溯统计量对整个预测区间进行去归一化,在非平稳性下不准确;时域趋势/季节分解依赖于固定的非自适应滤波器。我们提出FreqLite,一种超轻量级、通道独立的频率分解线性预测器:一个可学习的、无损的单位划分谱滤波器将输入分割成多个频带,由每个频带的线性头进行预测,与低通截断方法不同,高频带被保留并建模。FreqLite在标准长期预测基准上是最佳的轻量级模型,在长回溯(L=336)时,其平均误差低于PatchTST Transformer(0.3244 vs 0.3587 MSE),同时参数减少4倍,内存减少2.2倍,在单块4 GB笔记本GPU上每轮时间减少2.2倍;尽管幅度不大,但在所有匹配单元上的配对Wilcoxon检验中,其改进具有统计显著性(p < 1e-5)。我们进一步引入自适应可逆实例归一化(A-RevIN),一种自适应可逆归一化,严格推广了RevIN(在其门关闭时完全恢复),在非平稳性下起作用,并在平稳数据上无害地退化为RevIN。我们在一个真实的强非平稳数据集(ILI,MSE降低约5%)和一个受控合成漂移扫描中验证了这一点,其中A-RevIN的收益及其学习门都随注入的非平稳性单调增加。每个组件均可独立消融(Linear和RLinear是FreqLite的特例),所有结果均可在商用硬件上复现。

英文摘要

Long-term time-series forecasting needs models that are accurate yet efficient enough for commodity hardware. Lightweight linear forecasters are remarkably strong in this regime, yet they leave two openings: reversible instance normalization (RevIN) de-normalizes the entire horizon with a single lookback statistic, which is inaccurate under non-stationarity, and time-domain trend/seasonal decomposition relies on a fixed, non-adaptive filter. We present FreqLite, an ultra-lightweight, channel-independent frequency-decomposed linear forecaster: a learnable, lossless, partition-of-unity spectral filter splits the input into bands that are forecast by per-band linear heads and, unlike low-pass-truncation approaches, the high-frequency band is retained and modeled. FreqLite is the best lightweight model on the standard long-term forecasting benchmarks and, at long lookback (L=336), attains a lower average error than a PatchTST Transformer (0.3244 vs. 0.3587 MSE) while using 4x fewer parameters, 2.2x less memory, and 2.2x less time per epoch on a single 4 GB laptop GPU; although modest in magnitude, its improvements are statistically significant under paired Wilcoxon tests across all matched cells (p < 1e-5). We further introduce Adaptive Reversible Instance Normalization (A-RevIN), a regime-adaptive reversible normalization that strictly generalizes RevIN (recovered exactly when its gate is closed), engages under non-stationarity, and reduces to RevIN without harm on stationary data. We validate this on both a real strongly non-stationary dataset (ILI, up to ~5% MSE reduction) and a controlled synthetic drift sweep in which A-RevIN's benefit and its learned gate both rise monotonically with injected non-stationarity. Every component is independently ablatable (Linear and RLinear are special cases of FreqLite), and all results are reproducible on commodity hardware.

2606.01324 2026-06-02 cs.IT cs.AI math.IT 版本更新

Digital Twin-Assisted Adaptive Multi-Agent DRL for Intelligent Spectrum and Resource Management in Open-RAN UAV-Enabled 6G Networks

数字孪生辅助的自适应多智能体深度强化学习用于Open-RAN无人机赋能6G网络中的智能频谱与资源管理

Marwan Dhuheir, Thang X. Vu, Symeon Chatzinotas

发表机构 * University of Cambridge(剑桥大学) University of Bristol(布里斯托大学)

AI总结 针对无人机辅助6G网络中动态频谱与资源管理难题,提出数字孪生辅助的自适应深度强化学习框架,结合粒子群优化轨迹与多智能体DRL分配资源,显著提升频谱效率、数据速率和能量利用率。

Comments accepted and presented at IEEE ICC-2026 conference paper

详情
AI中文摘要

向6G无线网络的演进设想了一种无缝智能、支持Open-RAN的架构,其中无人机在扩展覆盖、增强弹性以及确保地面用户部署的可靠连接方面发挥关键作用。然而,由于非线性系统交互、移动性引起的拓扑变化以及严格的延迟和能量约束,在这种高度动态的无人机辅助环境中有效管理频谱和资源仍然是一个主要挑战。为了解决这些挑战,我们提出了一种数字孪生辅助的自适应深度强化学习框架,该框架能够在分布式地面用户之间实现智能频谱共享和资源分配。复杂的优化问题被分解为使用粒子群优化的无人机轨迹优化和通过多智能体深度强化学习的动态频谱-功率-关联管理。这种混合数字孪生驱动的方法实现了智能、上下文感知的决策制定和无人机之间的自适应协调。大量仿真表明,在频谱效率、数据速率和能量利用方面取得了显著增益,展示了通往自我进化、自主的6G无人机和地面用户连接的变革性路径。

英文摘要

The evolution toward 6G wireless networks envisions a seamlessly intelligent, Open-RAN-enabled architecture where unmanned aerial vehicles (UAVs) play a pivotal role in extending coverage, enhancing resilience, and ensuring reliable connectivity for ground users deployment. However, efficiently managing spectrum and resources in such highly dynamic UAV-assisted environments remains a major challenge due to nonlinear system interactions, mobility-induced topology variations, and stringent latency and energy constraints. To address these challenges, we propose a digital twin (DT)-assisted adaptive deep reinforcement learning (DRL) framework that enables intelligent spectrum sharing and resource allocation across distributed ground users. The complex optimization problem is decomposed into UAV trajectory optimization using particle swarm optimization (PSO) and dynamic spectrum-power-association management via multi-agent DRL (MADRL). This hybrid DT-driven approach empowers intelligent, context-aware decision-making and adaptive coordination among UAVs. Extensive simulations demonstrate significant gains in spectral efficiency, data rates, and energy utilization, showcasing a transformative path toward self-evolving, autonomous 6G UAV and ground users (GUs) connectivity.

2606.01323 2026-06-02 cs.CL cs.AI 版本更新

DiffuSent: Towards a Unified Diffusion Framework for Aspect-Based Sentiment Analysis

DiffuSent:面向方面级情感分析的统一扩散框架

Shu Long, Yanglei Gan, Xuchuan Zhou

发表机构 * University of Electronic Science and Technology of China(电子科技大学) Southwest Petroleum University(西南石油大学) Southwest Minzu University(西南民族大学)

AI总结 提出非自回归扩散框架DiffuSent,将方面级情感分析的所有子任务统一为边界去噪扩散过程,通过对比去噪训练策略解决重复预测问题,在28个设置上优于现有生成式和跨度式系统,并实现高达181倍的推理加速。

详情
AI中文摘要

方面级情感分析(ABSA)包含七个不同的子任务,每个子任务关注不同的提取元素。尽管生成模型在统一方面情感分析中取得了成功,现有方法通常依赖于自回归的逐词生成,未能捕捉方面和意见术语的整体信息,导致边界不敏感,特别是在多词方面和意见术语的上下文中。为了解决这些问题,我们提出了DiffuSent,一个非自回归扩散框架,系统地将所有ABSA子任务公式化为边界去噪扩散过程,逐步在噪声状态上细化边界。此外,我们引入了一种对比去噪训练策略,有效解决了扩散过程中引入的细微变化导致的重复预测问题。在28个设置(7个子任务×4个数据集)上的大量实验表明,DiffuSent在最强生成式和跨度式系统上实现了持续改进。DiffuSent在多词三元组上表现出显著增益,平均F1提升+2.48,并在包含多个情感三元组的句子中保持稳健的提取准确性。此外,非自回归解码实现了显著的效率优势,推理速度比自回归生成基线快达181倍。

英文摘要

Aspect-Based Sentiment Analysis (ABSA) encompasses seven distinct subtasks, each focusing on different extracted elements. Despite the proven success of generative models in unified aspect sentiment analysis, existing approaches often rely on auto-regressive token-by-token generation without grasping the whole information of the aspect and opinion terms, resulting in boundary insensitivity, particularly in context of multi-word aspect and opinion terms. To address these issues, we present DiffuSent, a non-auto-regressive diffusion framework that systematically formulates all ABSA subtasks as boundary denoising diffusion processes, progressively refining boundaries over noisy states. Furthermore, we introduce a contrastive denoising training strategy which effectively address duplicate predictions with subtle variations introduced by diffusion process. Extensive experiments across 28 settings (7 subtasks x 4 datasets) demonstrate that DiffuSent achieves delivers consistent improvements over the strongest generative and span-based systems. DiffuSent exhibits notable gains on multi-word triplets, achieving an average improvement of +2.48 F1, and maintains robust extraction accuracy in sentences containing multiple sentiment triplets. Moreover, the non-auto-regressive decoding enables substantial efficiency benefits, reaching up to 181 times faster inference than auto-regressive generative baselines

2606.01322 2026-06-02 cs.CL cs.AI 版本更新

TukaBench: A Culturally Grounded Jailbreak Benchmark for African Languages

TukaBench: 一个基于文化的非洲语言越狱基准

Victor Akinode, Senyu Li, Wassim Hamidouche, Waqas Zamir, Inbal Becker-Reshef, David Ifeoluwa Adelani

发表机构 * Mila - Quebec AI Institute(魁北克人工智能研究院) McGill University(麦吉尔大学) Microsoft AI for Good Research Lab(微软人工智能造福人类研究实验室) Canada CIFAR AI Chair(加拿大CIFAR人工智能主席)

AI总结 针对大型语言模型在非洲低资源语言上的安全评估缺失,提出TUKABENCH基准,通过四种设置(直接翻译、文化适应翻译、人工策划提示、代码切换提示)评估语言、文化背景和提示规避性对模型安全的影响,发现非洲语言提示降低拒绝率,并引入Deflection指标和人工验证以解决模型理解失败和评判可靠性问题。

Comments Under review

详情
AI中文摘要

大型语言模型(LLMs)的安全评估仍然高度以英语为中心,导致低资源语言(LRLs),特别是非洲语言,严重缺乏探索。我们引入了TUKABENCH,一个针对七种非洲语言的越狱基准,它通过四种设置将JailbreakBench(JBB)扩展到直接翻译之外:JBB提示的人工翻译、适应非洲背景的英语提示后人工翻译、通过与GPT-5.2交互验证的人工策划提示,以及结合英语和非洲语言的代码切换提示,从而隔离语言、文化背景和提示规避性对模型安全的影响。在闭源和开源模型中,使用非洲语言提示相比英语减少了拒绝,其中文化适应的提示导致最少的拒绝。评估还揭示了两个结构性限制:模型理解失败和低资源语言中LLM作为评判者的可靠性降低。为了捕捉前者,我们在“拒绝”和“越狱”之外引入了“回避”;为了评估后者,我们通过人工标注验证输出,显示在低资源语言和较少支持的脚本中,评判者与人类的一致性下降。

英文摘要

Safety evaluation of Large Language Models (LLMs) remains heavily English-centric, leaving Low-Resource Languages (LRLs), particularly African ones, critically underexplored. We introduce TUKABENCH, a jailbreak benchmark for seven African languages that extends JailbreakBench (JBB) beyond direct translation through four settings: human translation of JBB prompts, English adaptation to African contexts followed by human translation, human-curated prompts validated through interactions with GPT-5.2, and code-switched prompts combining English and African languages, isolating the effect of language, cultural grounding, and prompt evasiveness on model safety. Across closed and open models, prompting in African languages reduces refusal relative to English, with culturally adapted prompts leading to least refusal. The evaluation also surfaces two structural limitations: model comprehension failures and reduced LLM-as-a-judge reliability in LRLs. To capture the first, we introduce Deflection alongside Refused and Jailbroken; to assess the second, we validate outputs with human annotations, showing that judge-human agreement drops in lower-resource languages and less commonly supported scripts.

2606.01316 2026-06-02 cs.AI 版本更新

Science Earth: Towards A Planet-Scale Operating System for AI-Native Scientific Discovery

Science Earth: 迈向面向AI原生科学发现的行星级操作系统

Zhe Zhao, Haibin Wen, Yingcheng Wu, Jiaming Ma, Yifan Wen, Jinglin Jian, Jiacheng Ge, Xiangru Tang, Bo An, Ming Yin, Sanfeng Wu, Mengdi Wang, Le Cong

发表机构 * Department of Pathology, Department of Genetics, Stanford University School of Medicine(病理学系、遗传学系,斯坦福大学医学院) Princeton AI Lab, Department of Electrical & Computer Engineering, Princeton University(普林斯顿人工智能实验室、电气与计算机工程系,普林斯顿大学) Scripps Research, La Jolla, CA, USA(斯克里普斯研究机构,洛杉矶,加利福尼亚州,美国) Division of Biostatistics, Department of Population Health, New York University Grossman School of Medicine(生物统计学部、人口健康系,纽约大学格罗斯曼医学院) College of Computing and Data Science, Nanyang Technological University(计算与数据科学学院,南洋理工大学) Department of Computer Science, Yale University(计算机科学系,耶鲁大学) Department of Physics, Princeton University(物理系,普林斯顿大学)

AI总结 提出Science Earth行星级科学运行时,通过EACN协议实现AI能力动态连接与自组织协作,在跨太平洋Kuramoto同步研究和单细胞分析中验证了分布式自校正科学推理。

详情
AI中文摘要

科学发现需要在广阔的搜索空间中运用智能、毅力和偶然性。如今,顶尖科学能力仍然孤立——一个AI系统用于生物分析,另一个用于临床推理、数学推导或材料模拟——并且没有预设计的团队能够预见一个问题所需的所有技能。Science Earth是一个行星级科学运行时,其中任何能力——模拟集群、湿实验室机器人、证明引擎、单细胞管道——都可以相互连接,协作结构由问题本身涌现。其底层EACN协议让能力能够相互发现、协商任务所有权,并在不相容的证据标准之间进行裁决,而无需事先知道谁将遇见谁。这将组织挑战从工作流设计转向开放式连接。两次运行在结构不同的条件下验证了这一点。在一项跨太平洋高阶Kuramoto同步研究中,智能体在30分钟内识别并纠正了Ott-Antonsen解析理论中一个在洛伦兹极限外失效的闭合比率假设。在针对488万细胞Kang 2024泛癌图谱的八智能体单细胞运行中,异质能力在64.9小时窗口内耦合,仅有一条结构外部指令,产生了三个新的结果层,并将发现与一项关于相邻CCR8- TIGIT+ Treg亚群的独立湿实验室研究进行锚定。这些案例是首次实证读数,而非基准测试。它们表明,当AI能力真正可连接且协调从问题中涌现时,科学推理成为一个分布式、自校正的过程——这是向行星级AI原生发现迈出的一步。

英文摘要

Scientific discovery demands intelligence, perseverance, and serendipity across vast search spaces. Today, top scientific capabilities remain siloed--one AI system for biological analysis, another for clinical reasoning, mathematical derivation, or materials simulation--and no pre-designed team can anticipate every skill a question will need. Science Earth is a planet-scale scientific runtime in which any capability--a simulation cluster, a wet-lab robot, a proof engine, a single-cell pipeline--can connect to any other, with collaboration structure emerging from the question itself. Its underlying EACN protocol lets capabilities discover one another, negotiate task ownership, and adjudicate across incompatible evidentiary standards without prior knowledge of who will meet whom. This shifts the organizing challenge from workflow design to open-ended connectivity. Two runs validate this under structurally distinct conditions. In a trans-Pacific higher-order Kuramoto synchronization study, agents identified and corrected a closure-ratio assumption in Ott-Antonsen analytic theory that fails outside the Lorentzian limit, within thirty minutes. In an eight-agent single-cell run on the 4.88M-cell Kang 2024 pan-cancer atlas, heterogeneous capabilities coupled over a 64.9-hour window with one structural external instruction, producing three new result layers and anchoring findings against an independent wet-lab study on an adjacent CCR8- TIGIT+ Treg subset. These cases are a first empirical reading, not a benchmark sweep. They show that when AI capabilities are truly connectable and coordination emerges from the problem, scientific reasoning becomes a distributed, self-correcting process--a step towards scaling AI-native discovery to the planet.

2606.01314 2026-06-02 cs.AI 版本更新

SkillSmith: Co-Evolving Skills and Tools for Self-Improving Agent Systems

SkillSmith: 技能与工具的协同进化用于自我改进的智能体系统

Yangbo Wei, Zhen Huang, Shaoqiang Lu, Junhong Qian, Qifan Wang, Chen Wu, Lei He

发表机构 * Shanghai Jiao Tong University(上海交通大学) Eastern Institute of Technology(东部技术研究所) University of Science and Technology of China(中国科学技术大学) Southeast University(东南大学) Ningbo Institute of Digital Twin(宁波数字孪生研究所)

AI总结 提出SkillSmith框架,通过统一提案空间和Lotka-Volterra生态效用模型实现技能与工具的协同进化,在多个基准测试中显著提升性能。

详情
AI中文摘要

最近的自进化智能体表明,技能可以通过执行被发现、精炼和积累。然而,现有的技能进化框架通常假设固定的工具层,并独立评估每个技能,限制了它们修复工具级故障或推理技能间交互的能力。我们提出SkillSmith,一个协同感知的技能-工具协同进化框架。SkillSmith引入了一个统一的提案空间,其中反思产生原子束,共同修改技能和工具,允许在技能进化识别出可重用的能力缺口时,对工具进行包装、编辑、组合、拆分或淘汰。为了指导这种联合搜索,SkillSmith维护了一个受Lotka-Volterra动力学启发的生态效用模型,其中从执行轨迹估计的交互矩阵捕获技能间的成对互补和冲突,并为检索、变异优先级和淘汰提供压力信号。此外,SkillSmith记录反模式,包括失败特征、因果归因和补救措施,以加速诊断并否决重复已知错误的提案。在包括WildClawBench在内的三个基准测试和五个Qwen3.5模型规模上的实验表明,SkillSmith始终优于强基线,并且随着任务复杂性和多技能共激活的增加,增益会放大。

英文摘要

Recent self-evolving agents have shown that skills can be discovered, refined, and accumulated through execution. However, existing skill-evolution frameworks typically assume a fixed tool layer and evaluate each skill independently, limiting their ability to repair tool-level failures or reason about interactions among skills. We propose SkillSmith, a synergy-aware skill-tool co-evolution framework. SkillSmith introduces a unified proposal space in which reflection produces atomic bundles that jointly modify skills and tools, allowing tools to be wrapped, edited, composed, split, or retired when skill evolution identifies a reusable capability gap. To guide this joint search, SkillSmith maintains an ecological utility model inspired by Lotka-Volterra dynamics, where an interaction matrix estimated from execution traces captures pairwise complementarity and conflict among skills and provides pressure signals for retrieval, mutation prioritization, and retirement. Furthermore, SkillSmith records anti-patterns, including failure signatures, causal attributions, and remedies, to accelerate diagnosis and veto proposals that repeat known mistakes. Experiments on three benchmarks, including WildClawBench, and five Qwen3.5 model scales show that SkillSmith consistently outperforms strong baselines, with gains that amplify as task complexity and multi-skill co-activation increase.

2606.01313 2026-06-02 cs.RO cs.AI 版本更新

PSG-Nav: Probabilistic Scene Graph Navigation via Multiverse Decision Making

PSG-Nav: 通过多元宇宙决策的概率场景图导航

Rufeng Chen, Yue Chang, Xiaqiang Tang, Hechang Chen, Sihong Xie

发表机构 * Tsinghua University(清华大学)

AI总结 提出PSG-Nav方法,通过构建3D概率场景图并利用多元宇宙决策从联合分布中采样最可能的世界设置,以处理开放词汇导航中的感知不确定性,并引入证据经验校准器实现在线终身适应,在多个基准上取得最新最优结果。

Comments 21 pages, 7 figures. ICML 2026

详情
AI中文摘要

开放词汇导航要求具身智能体管理由语义歧义和模型错误引起的显著感知不确定性。然而,大多数现有工作满足于局部最优的确定性方法,剥夺了在多个复合可能性上的复杂导航决策,而这些对于全局更优解至关重要。在本文中,我们提出概率场景图导航(PSG-Nav),它构建了一个3D概率场景图,使用完整的语义类别分布来考虑感知不确定性。为了有效利用局部分布来组合和推理最优导航地标,我们提出多元宇宙决策,从联合分布中采样多个最可能的世界设置,并基于地标与多元宇宙之间的兼容性评估导航地标。为了减轻开放词汇导航中因认知不确定性导致的误报,我们引入证据经验校准器,通过将检测与过去成功和失败的记忆进行交叉验证,实现在线终身适应。在广泛使用的基准MP3D、HM3D和HSSD上的大量实验表明,PSG-Nav建立了新的最先进结果,分别实现了66.1%、44.8%和67.9%的成功率。代码可在https://psg-nav.github.io/获取。

英文摘要

Open-vocabulary navigation requires embodied agents to manage significant perception uncertainty stemming from semantic ambiguity and model errors. However, most existing works settle for local optimal deterministic approaches, depriving complex navigation decision-making over multiple composite possibilities that are critical for globally better solutions. In this paper, we propose Probabilistic Scene Graph Navigation (PSG-Nav), which constructs a 3D Probabilistic Scene Graph that uses full semantic categorical distributions to account for perception uncertainty. To efficiently use the local distributions to compose and reason about the optimal navigation landmarks, we propose Multiverse Decision to sample multiple most likely world settings from the joint distribution, and evaluate navigation landmarks based on the compatibility between landmarks and multiverses. To mitigate false positives due to epistemic uncertainty in open-vocabulary navigation, we introduce the Evidential Experience Calibrator, which enables online lifelong adaptation by cross-validating detections against memories of past successes and failures. Extensive experiments on widely-used benchmarks MP3D, HM3D, and HSSD demonstrate that PSG-Nav establishes new state-of-the-art results, achieving Success Rates of 66.1%, 44.8%, and 67.9%, respectively. Code is available at: https://psg-nav.github.io/

2606.01312 2026-06-02 eess.SP cs.AI cs.NI 版本更新

A Communication-Centric 6G-LLM Architecture for Scalable Tactical Autonomous Defense Vehicle Networks

面向可扩展战术自主防御车辆网络的以通信为中心的6G-LLM架构

Kiran Khurshid, Shumaila Javaid, Nasir Saeed

发表机构 * Department of Computer and Software Engineering, National University of Sciences and Technology (NUST), Islamabad, Pakistan(计算机与软件工程系,国家科学与技术大学(NUST),伊斯兰堡,巴基斯坦) Department of Control Science and Engineering, College of Electronics and Information Engineering, Tongji University and National Key Laboratory of Autonomous Intelligent Unmanned Systems, Tongji University, China(控制科学与工程系,电子与信息工程学院,同济大学,以及自主智能无人机系统国家重点实验室,同济大学,中国) Department of Electrical and Communication Engineering, UAE University, Al-Ain 15551, UAE(电子与通信工程系,阿联酋大学,阿恩15551,阿联酋)

AI总结 提出一种以通信为中心的分层架构,通过集成边缘辅助大语言模型推理与6G语义通信,在战术自主防御车辆网络中实现协调效率提升、通信开销降低和延迟韧性增强。

Comments 10 pages, accepted in IEEE Network Magazine

详情
Journal ref
K. Khurshid, S. Javaid and N. Saeed, "A Communication-Centric 6G-LLM Architecture for Scalable Tactical Autonomous Defense Vehicle Networks," in IEEE Network, Early access, 2026
AI中文摘要

人工智能(AI)与新兴6G网络的融合为战术自主车辆系统的可扩展协调带来了新机遇。本文提出了一种以通信为中心的分层架构,用于战术自主防御车辆网络(TADVNs),该架构将边缘辅助大语言模型(LLM)推理与6G连接和语义通信相结合。该框架旨在提高协调效率、减少通信开销,并在不断扩大的车队规模操作下增强延迟韧性。与依赖结构化特征处理和基于规则协调的传统任务特定AI流水线不同,所提出的方法在分层边缘-云通信架构中引入了语义抽象和上下文感知决策支持。我们通过蒙特卡洛模拟,在竞争网络条件下对5-30辆车的车队规模进行了通信和协调性能评估。结果表明,在30辆车规模下,与基于5G的传统AI基线相比,6G-LLM配置实现了75.2%的延迟降低(29.1毫秒对比117.5毫秒),任务成功率提高68.7个百分点(82.9%对比14.2%),通信开销降低88.6%。这些发现表明,当语义推理与低延迟6G连接相结合时,在协调和通信方面具有可衡量的优势。

英文摘要

The integration of Artificial Intelligence (AI) and emerging 6G networks introduces new opportunities for scalable coordination in tactical autonomous vehicle systems. This paper proposes a communication-centric hierarchical architecture for Tactical Autonomous Defense Vehicle Networks (TADVNs) that models the integration of edge-assisted Large Language Model (LLM) reasoning with 6G-enabled connectivity and semantic communication. The framework is designed to improve coordination efficiency, reduce communication overhead, and enhance latency resilience under increasing fleet-scale operation. Unlike conventional task-specific AI pipelines that rely on structured feature processing and rule-based coordination, the proposed approach incorporates semantic abstraction and context-aware decision support within a layered edge-cloud communication architecture. We evaluate communication and coordination performance via Monte Carlo simulations across fleet sizes of 5-30 vehicles under contested network conditions. Results indicate that at a 30-vehicle scale, the 6G-LLM configuration achieves 75.2% latency reduction (29.1 ms vs. 117.5 ms), a 68.7 percentage point increase in mission success rate (82.9% vs. 14.2%), and an 88.6% reduction in communication overhead compared to a 5G-based conventional AI baseline. These findings demonstrate measurable benefits in coordination and communication when semantic reasoning is combined with low-latency 6G connectivity.

2606.01311 2026-06-02 cs.CL cs.AI cs.LG cs.MA 版本更新

SkillAdaptor: Self-Adapting Skills for LLM Agents from Trajectories

SkillAdaptor:基于轨迹的LLM智能体自适应技能

Zhuoyun Yu, Xin Xie, Wuguannan Yao, Chenxi Wang, Lei Liang, Xiang Qi, Shumin Deng

发表机构 * Zhejiang University(浙江大学) Ant Digital Technologies, Ant Group(蚂蚁集团数字技术部) Zhejiang University - Ant Group Joint Laboratory of Knowledge Graph(浙江大学-蚂蚁集团知识图谱联合实验室)

AI总结 提出SkillAdaptor,一种无训练的步骤级技能自适应框架,通过显式故障归因和针对性更新,提升LLM智能体在长程交互任务中的表现。

Comments Work in progress

详情
AI中文摘要

大型语言模型(LLM)智能体越来越依赖可重用的外部技能来解决长程交互任务。现有的无训练技能自适应流程通常从完整轨迹或会话级反馈更新技能,这使得故障归因粗糙,往往产生不稳定或过于宽泛的修订。我们提出SkillAdaptor,一种无训练的步骤级技能自适应框架,具有显式故障归因,并可插入OpenClaw类智能体框架。给定一个失败轨迹,SkillAdaptor识别第一个可操作的故障步骤,将责任关联到候选技能,并在显式接受检查下应用针对性更新,同时保持主干冻结。我们在WebShop、PinchBench和Claw-Eval上使用Kimi-K2.5、GLM-5和GPT-5.2进行评估。SkillAdaptor在所有三个套件上均优于无技能和技能自适应基线,最大的单项指标提升为PinchBench平均得分%提升1.5分,Claw-Eval平均得分提升1.8分,WebShop成功率提升1.7分。这些结果表明,步骤级归因支持更稳定且可审计的无训练技能维护。

英文摘要

Large language model (LLM) agents increasingly rely on reusable external skills to solve long-horizon interactive tasks. Existing training-free skill adaptation pipelines usually update skills from full trajectories or session-level feedback, which makes failure attribution coarse and often produces unstable or overly broad revisions. We propose SkillAdaptor, a training-free step-level skill adaptation framework with explicit failure attribution, and it can plug into OpenClaw-class agent harnesses. Given a failed trajectory, SkillAdaptor identifies a first actionable fault step, links responsibility to candidate skills, and applies targeted updates under explicit acceptance checks while keeping the backbone frozen. We evaluate on WebShop, PinchBench, and Claw-Eval with Kimi-K2.5, GLM-5, and GPT-5.2. SkillAdaptor improves over no-skill and skill-adaptation baselines on all three suites, with the largest single-metric improvements of +1.5 points on PinchBench Avg Score%, +1.8 on Claw-Eval Avg Score, and +1.7 on WebShop success rate. These results indicate that step-level attribution supports more stable and auditable training-free skill maintenance\footnote{The code will be released at https://github.com/zjunlp/SkillAdaptor.}.

2606.01300 2026-06-02 cs.LG cs.AI 版本更新

ChronosAD: Leveraging Time Series Foundation Models for Accurate Anomaly Detection

ChronosAD:利用时间序列基础模型进行精确异常检测

Uzair Khan, Luigi Capogrosso, Francesco Biondani, Michele Magno, Franco Fummi, Francesco Setti, Marco Cristani

发表机构 * PR Veneto FESR 2021-2027(普罗文托地区FESR 2021-2027项目) Action 1.1.1(行动1.1.1) DGR 792 CUP D19J24000810007

AI总结 提出ChronosAD架构,通过时间序列基础模型提取特征并结合BiLSTM与多头注意力机制,实现跨域鲁棒的异常检测,在11个基准上平均AUC提升4.72%,AP提升6.60%。

Comments Accepted at the 24th IEEE International Conference on Industrial Informatics (INDIN) 2026

详情
AI中文摘要

时间序列异常检测是金融、医疗和工业等多个领域的关键任务。然而,现有方法通常难以在不同数据集上泛化,尤其是当异常微妙或依赖于上下文时。为解决此问题,我们引入了ChronosAD,一种新颖的异常检测架构,它使用时间序列基础模型作为特征提取器。具体而言,它采用两阶段流程:首先,使用基础模型以零样本方式为每个时间序列提取嵌入。然后,一个由双向长短期记忆(BiLSTM)和多头注意力组成的自定义开发的时间块,对这些嵌入进行精炼以捕捉时间依赖性并突出显著模式。与先前方法不同,我们的模型需要最少的任务特定调整,并在包括工业、医疗、信息物理和汽车系统在内的广泛领域中展现出鲁棒的泛化能力。在11个基准上的大量实验表明,ChronosAD在AUC和AP上平均分别超过现有方法4.72%和6.60%。源代码可在https://github.com/intelligolabs/ChronosAD获取。

英文摘要

Time series anomaly detection is a crucial task in various domains, including finance, healthcare, and industry. However, existing methods often struggle to generalize across different datasets, especially when anomalies are subtle or context-dependent. To solve this issue, we introduce ChronosAD, a novel architecture for anomaly detection that uses a time series foundation model as a feature extractor. Specifically, it employs a two-stage pipeline: first, it uses the foundation model to extract embeddings for each time series in a zero-shot manner. Then, a custom-developed Temporal Block, composed of Bidirectional Long Short-Term Memory (BiLSTM) and Multi-Head Attention, refines these embeddings to capture temporal dependencies and highlight salient patterns. Unlike previous approaches, our model requires minimal task-specific tuning and demonstrates robust generalization across a wide range of domains, including industrial, medical, cyber-physical, and automotive systems. Extensive experiments on 11 benchmarks show that ChronosAD outperforms existing methods by 4.72% in AUC and 6.60% in AP on average. The source code is available at https://github.com/intelligolabs/ChronosAD.

2606.01293 2026-06-02 eess.IV cs.AI cs.CV 版本更新

ResNet-34 with Lightweight Decoder for Accurate and Efficient Segmentation of Fetal Brain MRI

ResNet-34与轻量级解码器用于胎儿脑部MRI的准确高效分割

Ashiqur Rahman, Muhammad E. H. Chowdhury, Md. Abu Sayed, Md. Sharjis Ibne Wadud, Abu Naser Md. Arafat, Mehedi Hasan Prince

发表机构 * Department of Biomedical Physics and Technology, University of Dhaka(达卡大学生物医学物理与技术系) Department of Electrical Engineering, College of Engineering, Qatar University(卡塔尔大学工程学院电气工程系) Department of Biomedical Engineering, Jashore University of Science and Technology(贾沙尔大学科学与技术学院生物医学工程系)

AI总结 提出一种结合ResNet-34编码器和基于MLP的轻量级解码器的深度学习模型,以解决胎儿脑MRI分割中的运动伪影和强度不均匀问题,在FeTA 2021数据集上达到97.37%准确率和90.33%平均DSC。

详情
AI中文摘要

在磁共振成像(MRI)中准确分割胎儿脑组织对于先天性异常的早期诊断和改善产前护理至关重要。然而,由于胎儿运动、组织对比度低以及整个孕龄期解剖结构变异大,特别是分割白质、灰质、侧脑室、深部灰质、脑外脑脊液、小脑和脑干等复杂结构时,该任务仍然困难。针对这些难题,本研究引入了一种新颖的深度学习模型,该模型将ResNet-34编码器与利用多层感知器(MLP)模块进行自适应特征细化的轻量级解码器相结合。这种设计特别增强了模型保留解剖边界并减轻由运动伪影和强度不均匀引起的分割误差的能力。通过减少参数数量、采用双线性上采样代替转置卷积以及优化解码器以提高速度而不牺牲精度,实现了计算效率。在FeTA 2021数据集上使用5折交叉验证进行训练和验证,所提出的模型优于UNet、UNet++、DeepLabV3和DeepLabV3+等基线架构,平均准确率达到97.37%,平均Dice相似系数(DSC)为90.33%,平均交并比(IoU)为86.93%,精确率为90.83%。此外,其快速的推理时间和减少的计算负载使其非常适合集成到实时临床工作流程中。

英文摘要

Accurate segmentation of fetal brain tissues in Magnetic Resonance Imaging (MRI) is critical for early diagnosis of congenital abnormalities and improving prenatal care. However, the task remains difficult because of fetal motion, low tissue contrast, and major anatomical variability throughout gestational ages, particularly in segmenting complex structures such as white matter, gray matter, lateral ventricles, deep gray matter, extra-cerebrospinal fluid, cerebellum, and brainstem. As a solution to these difficulties, this research introduces a novel deep learning model that combines a ResNet-34 encoder with a lightweight decoder leveraging multi-layer perceptron (MLP) modules for adaptive feature refinement. This design specifically enhances the model's ability to preserve anatomical boundaries and mitigate segmentation errors caused by motion artifacts and intensity inhomogeneities. Computational efficiency is achieved by reducing parameter count, employing bilinear upsampling instead of transposed convolutions, and optimizing the decoder for speed without sacrificing accuracy. Trained and validated on the FeTA 2021 dataset using 5-fold cross-validation, the proposed model outperforms baseline architectures such as UNet, UNet++, DeepLabV3, and DeepLabV3+, achieving an average Accuracy of 97.37% with a mean Dice Similarity Coefficient (DSC) of 90.33%, mean Intersection over Union (IoU) of 86.93%, and Precision of 90.83%. Additionally, its fast inference time and reduced computational load make it well-suited for integration into real-time clinical workflows.

2606.01292 2026-06-02 cs.LG cs.AI 版本更新

What Makes a Strong Model? A Unified Spectral Analysis of Knowledge Transfer over High-dimensional Linear Regression

什么造就了一个强模型?高维线性回归中知识迁移的统一谱分析

Wendao Wu, Fangqing Zhang, Haihan Zhang, Cong Fang

发表机构 * Department of Computer Science(计算机科学系) Cranberry-Lemon University(Cranberry-Lemon 大学) Department of Computational Neuroscience(计算神经科学系) University of the Witwatersrand(沃特瓦特斯兰大学)

AI总结 本文通过高维线性回归中SGD动力学的统一谱分析,揭示了知识蒸馏中的谱视界扩展和弱到强泛化中的谱去噪两种机制,统一解释了不同知识迁移范式的有效性。

详情
AI中文摘要

师生知识迁移在现代机器学习中无处不在,从通过知识蒸馏进行的经典模型压缩到弱到强泛化这一新兴现象。尽管现有研究提供了孤立见解,但缺乏一个统一的理论框架来解释知识迁移在这些不同机制中的有效性。在这项工作中,我们建立了高维线性回归中SGD动力学的统一谱分析,阐明了知识迁移在看似不同的机制中的效率。我们通过两种不同机制来刻画知识迁移效率:知识蒸馏中的谱视界扩展,使得能够捕获统计上不可及的高频信号;以及弱到强泛化中的谱去噪,其中学生充当优化噪声的滤波器。我们的框架统一了这些现象,揭示了迁移的有效性由隐式正则化与谱上异质谱学习速度之间的相互作用所支配。

英文摘要

Teacher-Student Knowledge Transfer (KT) is ubiquitous in modern machine learning, ranging from classical model compression via Knowledge Distillation (KD) to the emergent phenomenon of Weak-to-Strong (W2S) generalization. While existing studies offer isolated insights, a unified theoretical framework explaining the efficacy of KT across these disparate regimes remains lacking. In this work, we establish a unified spectral analysis of SGD dynamics in high-dimensional linear regression, elucidating the efficiency of KT across seemingly disparate regimes. We characterize KT efficiency through two distinct mechanisms: \emph{Spectral Horizon Expansion} in KD, which enables the capture of statistically inaccessible high-frequency signals, and \emph{Spectral Denoising} in W2S, where the student acts as a filter for optimization noise. Our framework unifies these phenomena, revealing that the efficacy of transfer is governed by the interplay between implicit regularization and heterogeneous spectral learning speeds over the spectrum.

2606.01291 2026-06-02 quant-ph cs.AI 版本更新

Quantum Algorithm for Distributed Reduction of Entanglements (QADR): A Trainable and Simulation-Efficient QML Framework

量子分布式纠缠约简算法(QADR):一种可训练且模拟高效的量子机器学习框架

Syed Farhan Ahmad, Gregory T. Byrd

发表机构 * University of California, Los Angeles(加州大学洛杉矶分校)

AI总结 提出QADR框架,通过将全局n量子比特变分量子电路分解为因果光锥内的局部子电路,将经典模拟内存从O(2^n)降至O(n·2^{2d+1})并缓解贫瘠高原,在MNIST和NASA轴承诊断任务上匹配或超越经典模型。

详情
AI中文摘要

在含噪中等规模量子(NISQ)约束下训练变分量子电路(VQCs)引入了严重的计算限制:经典态矢量模拟内存呈指数增长($\mathcal{O}(2^n)$),且全局代价函数遭受贫瘠高原,其中梯度方差呈指数衰减($\mathcal{O}(1/2^n)$)。本文介绍并评估了量子分布式纠缠约简算法(QADR),这是一种混合量子-经典机器学习框架,它将全局$n$量子比特VQC分解为局部子电路,这些子电路大致在单个目标量子比特的因果光锥内运行。QADR将经典模拟内存从$\mathcal{O}(2^n)$降低到$\mathcal{O}(n \cdot 2^{2d+1})$(光锥半径$d$),同时自然缓解了全局贫瘠高原。我们在MNIST数据集和高维NASA IMS风力发电机传动系统诊断任务上,将QADR与标准全局VQC、支持向量机(SVM)以及两种定制的经典参数匹配神经网络(CANN和PMNN)进行了基准测试。QADR展示了出色的可扩展性,在$n_{\text{features}}=2000$时成功运行,而标准全局VQC因内存耗尽而崩溃,同时匹配或超越了优化经典架构的性能。

英文摘要

Training Variational Quantum Circuits (VQCs) under Noisy Intermediate-Scale Quantum (NISQ) constraints introduces severe computational limitations: classical statevector simulation memory scales exponentially ($\mathcal{O}(2^n)$), and global cost functions suffer from barren plateaus where gradient variance decays exponentially ($\mathcal{O}(1/2^n)$). This paper introduces and evaluates the Quantum Algorithm for Distributed Reduction of Entanglements (QADR), a hybrid quantum-classical machine learning framework that decomposes a global $n$-qubit VQC into localized sub-circuits operating approximately within the causal light cones of individual target qubits. QADR reduces classical simulation memory scaling from $\mathcal{O}(2^n)$ to $\mathcal{O}(n \cdot 2^{2d+1})$ for a light cone radius $d$, while naturally mitigating global barren plateaus. We benchmark QADR against standard global VQCs, Support Vector Machines (SVM), and two customized classical parameter-matched neural networks (CANN and PMNN) on the MNIST dataset and the high-dimensional NASA IMS wind turbine drivetrain diagnostic task. QADR demonstrates excellent scalability, operating successfully at $n_{\text{features}}=2000$ where standard global VQCs crash due to memory exhaustion, while matching or exceeding the performance of optimized classical architectures.

2606.01287 2026-06-02 cs.CV cs.AI 版本更新

Beyond Visual Memory: Mechanistic Diagnostics of Latent Visual Reasoning

超越视觉记忆:潜在视觉推理的机制诊断

Garvin Guo, Yu Chen, Xiang Wang, Shuai Li, Xinpei Zhao, Huaxing Liu, Shuai Dong

发表机构 * Amap, Alibaba Group(阿里集团亚马通) Shanghai Innovation Institute(上海创新研究院)

AI总结 通过分解潜在令牌为三个可测试组件,发现边界标记和格式而非潜在槽贡献了主要性能提升,揭示了潜在视觉推理的真正机制。

详情
AI中文摘要

最近的潜在视觉推理方法通过在多模态语言模型中插入连续潜在令牌取得了显著提升。这些提升通常归因于令牌编码了视觉证据;然而,最近的分析揭示了一个悖论:令牌与图像关联松散,对答案贡献甚微。关键的是,这些分析将潜在令牌视为一个整体,掩盖了提升的真正来源。因此,我们将潜在令牌分解为三个可测试组件:潜在槽、边界标记和格式,并在有利条件下开发了一种最先进的方法作为探针。在六个方法-阶段设置和四个感知密集型基准测试中,潜在槽未能通过视觉记忆解释的所有预测。引人注目的是,在几种设置中,仅保留边界标记即可保留78%至100%的提升,而模型在潜在位置比在答案位置更窄地关注图像。因此,提升来自边界标记、格式以及这种注意力模式,而非潜在槽。每种方法如何利用这一机制取决于其训练监督:在匹配的准确率下,机制仍可能显著不同。因此,潜在视觉推理不仅需要根据准确率评估,还需要根据模型实际依赖的内容进行评估。

英文摘要

Recent latent visual reasoning methods achieve substantial gains by inserting continuous latent tokens into multimodal language models. These gains are commonly attributed to the tokens encoding visual evidence; recent analyses, however, reveal a paradox: the tokens are loosely tied to the image and contribute little to the answer. Critically, these analyses treat latent tokens as a single unit, obscuring the true source of the gains. We therefore decompose latent tokens into three testable components: latent slots, boundary markers, and format, and develop a state-of-the-art method as a probe under favorable conditions. Across six method-stage settings and four perception-heavy benchmarks, latent slots fail every prediction of the visual-memory account. Strikingly, retaining only the boundary markers preserves 78 to 100% of the gain in several settings, while the model attends to the image more narrowly at latent positions than at answer positions. The gain therefore comes from boundary markers, format, and this attention pattern, not from latent slots. How each method engages this mechanism depends on its training supervision: at matched accuracy, mechanisms can still differ markedly. Latent visual reasoning thus needs evaluation not only by accuracy but by what the model actually relies on.

2606.01286 2026-06-02 cs.SE cs.AI cs.CL cs.LG 版本更新

BenchEvolver: Frontier Task Synthesis via Solution-Centric Evolution

BenchEvolver: 通过以解决方案为中心的进化进行前沿任务合成

Yangzhen Wu, Aaron J. Li, Wenjie Ma, Li Cao, Ziheng Zhou, Mert Cemri, Shu Liu, Yuran Xiu, Chenxiao Yan, Haikun Zhao, Bin Yu, Ion Stoica, Dawn Song

发表机构 * University of California, Berkeley(加州大学伯克利分校) Institute for Interdisciplinary Information Sciences, Tsinghua University(清华大学交叉信息研究院)

AI总结 提出BenchEvolver框架,通过进化参考解决方案自动生成更难的编程问题,以解决基准饱和问题,并在LiveCodeBench和SciCode上验证其有效性。

详情
AI中文摘要

前沿大语言模型的快速进步导致了广泛的基准饱和,限制了现有数据集区分模型能力或提供有用训练信号的能力。例如,在LiveCodeBench上,前沿模型在简单拆分上达到超过99%的Pass@1,在不同难度级别上平均超过90%的Pass@1。构建新的、具有挑战性的数据集通常需要大量人力,成为进步的瓶颈。我们引入了BenchEvolver,一个以解决方案为中心的进化框架,自动将现有编码问题转化为更难的变体。BenchEvolver不是从头生成问题,而是通过结构化变换进化参考解决方案,并从进化后的解决方案中推导出相应的描述和测试。这种设计将生成过程基于可执行语义,使得能够可扩展地构建高质量、多样化和困难的任务,并具有可验证的正确性。将BenchEvolver应用于LiveCodeBench和SciCode,我们获得了显著更难的进化任务,同时保持了有效性、参考正确性和多样性。我们进一步策划了LiveCodeBench-Plus,一个包含91个问题的基准,结合了进化后的任务和困难的原始LCB-v6任务,其中前沿模型的Pass@1范围从27.5%到62.6%,恢复了强编码模型之间的清晰区分。重要的是,即使对于生成它们的模型,进化后的任务仍然具有挑战性,从而实现了自我改进。我们进一步表明,在进化后的LCB任务上进行强化学习提高了留出编码性能:对于gpt-oss-20b,种子+进化训练在LCB v6 Hard和LCB-Pro Easy上分别获得了+8.7和+8.3的Pass@1提升,分别超过仅种子训练的70.7%和34.8%。我们的结果表明,BenchEvolver可以将饱和的基准转化为前沿级别的评估套件和可重用的训练信号。

英文摘要

The rapid progress of frontier large language models has led to widespread benchmark saturation, limiting the ability of existing datasets to differentiate model capabilities or provide useful training signal. For instance, on LiveCodeBench, frontier models achieve over 99% Pass@1 on easy splits and exceed 90% Pass@1 on average across difficulty levels. Constructing new, challenging datasets typically requires substantial human effort, creating a bottleneck for progress. We introduce BenchEvolver, a solution-centric evolutionary framework that automatically transforms existing coding problems into harder variants. Rather than generating problems from scratch, BenchEvolver evolves reference solutions through structured transformations and derives corresponding statements and tests from the evolved solutions. This design grounds generation in executable semantics, enabling scalable construction of high-quality, diverse, and difficult tasks with verifiable correctness. Applying BenchEvolver to LiveCodeBench and SciCode, we obtain evolved tasks that are substantially harder while maintaining validity, reference correctness, and diversity. We further curate LiveCodeBench-Plus, a 91-problem benchmark combining evolved and difficult original LCB-v6 tasks, where frontier-model Pass@1 ranges from 27.5% to 62.6%, restoring clear discrimination among strong coding models. Importantly, evolved tasks remain challenging even for the model that generates them, enabling self-improvement. We further show that RL on evolved LCB tasks improves held-out coding performance: for gpt-oss-20b, seed+evolved training achieves +8.7 and +8.3 Pass@1 gains on LCB v6 Hard and LCB-Pro Easy, exceeding seed-only gains by 70.7% and 34.8%, respectively. Our results show that BenchEvolver can convert saturated benchmarks into frontier-level evaluation suites and reusable training signal.

2606.01285 2026-06-02 cs.CV cs.AI 版本更新

Knowledge-Intensive Video Generation

知识密集型视频生成

Chenxu Wang, Mingda Chen

发表机构 * Fudan University(复旦大学) Shanghai Jiao Tong University(上海交通大学)

AI总结 针对文本到视频生成在事实性和实用性方面的不足,提出知识密集型视频生成(KIVI)任务,构建KIVI-Bench基准和自动评估指标,实验表明现有模型在视觉属性、操作过程和信息呈现上落后于人类。

详情
AI中文摘要

文本到视频生成在视觉质量上取得了快速进步,但在事实性和实际有用性方面仍缺乏评估。我们引入了知识密集型视频生成(KIVI),其中模型根据简短的信息寻求提示生成视频,这些提示要求解释、步骤或演示。为了评估这一设置,我们构建了KIVI-Bench,一个包含1,080个提示的基准,并提出了用于事实性和有用性的自动指标。人类评估表明,我们的指标比现有替代方案更符合人类标注。对七个最先进的视频生成模型的实验表明,当前系统仍落后于人类表现,尤其是在视觉属性、程序性操作和清晰的信息呈现方面。这些结果凸显了KIVI作为事实性和教学性视频生成的一个具有挑战性的方向。

英文摘要

Text-to-video generation has advanced rapidly in visual quality, but remains under-evaluated for factuality and practical usefulness. We introduce knowledge-intensive video generation (KIVI), where models generate videos from short information-seeking prompts that ask for explanations, procedures, or demonstrations. To evaluate this setting, we construct KIVI-Bench, a benchmark of 1,080 prompts, and propose automatic metrics for factuality and helpfulness. Human evaluation shows that our metrics significantly better align with human annotations than existing alternatives. Experiments on seven state-of-the-art video generation models show that current systems still lag behind human performance, especially on visual properties, procedural operations, and clear information presentation. These results highlight KIVI as a challenging direction for factual and instructionally useful video generation.

2606.01281 2026-06-02 cs.LG cs.AI 版本更新

RLVR without Ineffective Samples: Group Prioritized Off-Policy Optimization for LLM Reasoning

RLVR 无需无效样本:面向 LLM 推理的群体优先级离策略优化

Yixiu Mao, Yun Qu, Qi Wang, Heming Zou, Xiangyang Ji

发表机构 * Department of Automation, Tsinghua University(清华大学自动化系)

AI总结 针对强化学习中无效样本导致学习信号不足的问题,提出群体优先级离策略优化(POPO),通过优先级群体重放和解耦重要性采样,在不增加额外采样开销的情况下提升推理性能。

详情
AI中文摘要

基于可验证奖励的强化学习(RLVR)已成为增强大型语言模型(LLMs)推理能力的强大范式。然而,其有效性受到无效训练数据普遍存在的严重阻碍:许多采样提示产生的响应群体要么完全正确,要么完全错误,导致奖励零方差和学习信号有限。最近的先进方法通过大量LLM rollout来过滤无效样本以解决此问题,但代价是相当大的计算开销。替代方法,包括预测性采样和轨迹重放,旨在提高数据效率,但往往仍不充分,并可能引入额外问题,如系统性偏差或次优约束。为解决这些局限性,我们提出了群体优先级离策略优化(POPO),一个简单而有效的框架,无需额外rollout开销即可充分利用有效训练批次。POPO包含两个关键组件:优先级群体重放和解耦离策略优化。前者通过基于近因的重放机制,联合考虑样本质量和离策略程度,用有效的离策略群体替换无效的在策略群体。为进一步缩小离策略差距,POPO采用解耦重要性采样来校正离策略偏差,同时在一致的信任区域约束下保持稳定的策略更新。在包括数学、规划和视觉几何在内的多种推理任务上的实证评估表明,POPO显著加速了RL微调,并在显著减少rollout的情况下实现了强大的推理性能。

英文摘要

Reinforcement learning with verifiable rewards (RLVR) has emerged as a powerful paradigm for enhancing the reasoning capabilities of large language models (LLMs). However, its effectiveness is substantially hindered by the prevalence of ineffective training data: many sampled prompts yield response groups that are either entirely correct or entirely incorrect, resulting in zero-variance rewards and limited learning signals. Recent state-of-the-art methods address this issue through extensive LLM rollouts to filter ineffective samples, but at the cost of considerable computational overhead. Alternative approaches, including predictive sampling and trajectory replay, aim to improve data efficiency but often remain insufficient and may introduce additional issues such as systematic bias or suboptimal constraints. To address these limitations, we propose Group Prioritized Off-Policy Optimization (POPO), a simple yet effective framework that fully exploits effective training batches without additional rollout overhead. POPO comprises two key components: prioritized group replay and decoupled off-policy optimization. The former replaces ineffective on-policy groups with effective off-policy groups via a recency-based replay mechanism that jointly considers sample quality and the degree of off-policiness. To further mitigate the off-policy gap, POPO employs decoupled importance sampling to correct off-policy bias while maintaining stable policy updates under consistent trust-region constraints. Empirical evaluations across diverse reasoning tasks, including mathematics, planning, and visual geometry, demonstrate that POPO substantially accelerates RL finetuning and achieves strong reasoning performance with significantly fewer rollouts.

2606.01279 2026-06-02 cs.AI 版本更新

ANDES: Agent Native Data Evolving Synthesis Tool for Autonomous Instruction Alignment

ANDES:用于自主指令对齐的智能体原生数据演化合成工具

Zhengyang Zhao, Shengjie Ye, Lu Ma, Hao Liang, Hengyi Feng, Wentao Zhang

发表机构 * Peking University(北京大学) Sichuan University, Chengdu(四川大学,成都)

AI总结 提出ANDES框架,通过自进化世界树路由和可操作诊断报告,将数据生成重构为即插即用的智能体技能,使基础较弱的智能体在严格计算约束下实现自动对齐,在PostTrainBench上取得最先进性能。

详情
AI中文摘要

AI智能体正越来越多地被用于自动化AI研究本身,特别是将基础大语言模型转化为对齐助手的关键后训练阶段。然而,最近的评估显示,即使是最前沿的智能体也难以完成这一任务。虽然后训练的成功根本上依赖于获取高质量数据,但依赖智能体从开放网络中自主策划目标训练数据集带来了严峻挑战。在嘈杂的网络环境中执行搜索、过滤和平衡数据的长期任务,常常超出智能体有限的上下文能力,最终导致数据集质量下降和下游训练性能次优。为弥补这一差距,我们引入了Andes(智能体原生数据演化合成),这是一个将数据生成重新构想为即插即用的智能体技能的框架。Andes不强迫智能体从头设计复杂的数据收集策略,而是提供一个智能抽象层。通过利用自演化的世界树路由机制和可操作的诊断报告,它允许训练智能体通过交互式闭环界面动态引导数据合成。我们证明,在严格的计算约束下,为基础较弱的智能体配备Andes可以改善自动对齐,在PostTrainBench上取得最先进的性能,并实现稳健的跨任务泛化。我们的项目可在https://github.com/zzy1127/ANDES获取。

英文摘要

AI agents are increasingly being tasked with automating AI research itself, particularly the critical post-training phase that transforms base LLMs into aligned assistants. However, recent evaluations reveal that even frontier agents struggle to perform this task. While the success of post-training fundamentally relies on acquiring high-quality data, relying on agents to autonomously curate targeted training datasets from the open web introduces severe challenges. Executing the long-horizon tasks of searching, filtering, and balancing data within noisy web environments frequently overwhelms an agent's limited context, ultimately leading to degraded dataset quality and suboptimal downstream training performance. To bridge this gap, we introduce Andes (Agent Native Data Evolving Synthesis), a framework that reimagines data generation as a plug-and-play \emph{agent skill}. Rather than forcing agents to devise complex data-gathering strategies from scratch, \textsc{Andes} provides an intelligent abstraction layer. By leveraging a self-evolving World Tree routing mechanism and actionable diagnostic reports, it allows trainer agents to dynamically steer data synthesis through an interactive, closed-loop interface. We demonstrate that under strict compute constraints, equipping foundationally weaker agents with Andes improves automated alignment, securing state-of-the-art performance on PostTrainBench and robust cross-task generalization. Our project is available at https://github.com/zzy1127/ANDES.

2606.01277 2026-06-02 cs.RO cs.AI cs.CV cs.SY eess.IV eess.SY 版本更新

DeepIPCv3: Event-Aware Multi-Modal Sensor Fusion for Sudden Pedestrian Crossing Avoidance

DeepIPCv3: 面向突发行人穿越避让的事件感知多模态传感器融合

Oskar Natan, Andi Dharmawan, Aufaclav Zatu Kusuma Frisky, Jazi Eko Istiyanto, Jun Miura

发表机构 * Department of Computer Science and Electronics, Universitas Gadjah Mada(计算机科学与电子系,加雅马达大学) Department of Computer Science and Engineering, Toyohashi University of Technology(计算机科学与工程系,东福士大学)

AI总结 提出DeepIPCv3框架,通过Transformer交叉模态注意力融合LiDAR点云与DVS事件流,实现突发行人穿越场景下的高反应性避让,在自定义多模态数据集上达到最优轨迹与控制精度。

详情
AI中文摘要

当前的端到端自动驾驶系统主要依赖基于帧的传感器,这类传感器在高度动态的突发行人穿越场景中存在固有的感知延迟和运动模糊问题。为解决这一关键安全漏洞,我们提出DeepIPCv3,一种新颖的多模态自主导航框架,它将LiDAR点云的密集3D空间几何与动态视觉传感器(DVS)的微秒级异步事件流协同融合。我们引入了一种受Transformer启发的交叉模态注意力机制,以动态关联这些不同模态,使网络能够即时优先处理高速动态更新,同时不牺牲场景结构感知。融合后的潜在表示通过一个混合策略网络映射到安全的局部路径点和可执行控制命令,该网络结合了启发式轨迹跟踪与直接神经预测。由于在真实场景中测试这些突发穿越场景存在严重物理风险,该框架使用在光照良好的正午和具有挑战性的傍晚条件下收集的自定义多模态数据集进行严格离线评估。广泛的对比和消融研究表明,DeepIPCv3达到了最先进的预测性能。通过有效消除曝光失败和运动模糊,所提出的LiDAR与DVS融合实现了最低的轨迹和控制命令误差,使得无论环境光照如何,都能实现高反应性、数学上有界的规避机动。为支持未来研究,我们将代码发布到GitHub仓库:https://github.com/oskarnatan/DeepIPCv3。

英文摘要

Current end-to-end autonomous driving systems predominantly rely on frame-based sensors, which suffer from inherent perception latency and motion blur during highly dynamic encounters, specifically sudden pedestrian crossings. To address this critical safety vulnerability, we propose DeepIPCv3, a novel multi-modal autonomous navigation framework that synergizes the dense 3D spatial geometry of LiDAR point clouds with the microsecond-level asynchronous event streams of a Dynamic Vision Sensor (DVS). We introduce a Transformer-inspired cross-modal attention mechanism to dynamically correlate these distinct modalities, allowing the network to instantaneously prioritize high-speed dynamic updates without sacrificing structural scene awareness. The fused latent representations are then mapped to safe local waypoints and executable control commands via a hybrid policy network that blends heuristic trajectory tracking with direct neural predictions. Due to the severe physical risks associated with live testing of these sudden crossing scenarios, the framework is rigorously evaluated offline using a custom multi-modal dataset collected across both well-illuminated noon and challenging evening conditions. Extensive comparative and ablation studies demonstrate that DeepIPCv3 achieves state-of-the-art predictive performance. By effectively eliminating exposure failures and motion blur, the proposed LiDAR and DVS fusion yields the lowest trajectory and control command errors, enabling highly reactive, mathematically bounded evasive maneuvers regardless of ambient illumination. To support future research, we will release the codes to our GitHub repo at https://github.com/oskarnatan/DeepIPCv3.

2606.01265 2026-06-02 cs.LG cs.AI 版本更新

PALTO: Physics-Informed Active Learning for Tri-Gate FinFET Design Optimization for Vertical Power Delivery

PALTO:面向垂直供电的Tri-Gate FinFET设计优化的物理信息主动学习

Ayoub Sadeghi, Leonid Popryho, Inna Partin-Vaisband

发表机构 * University of Illinois Chicago(伊利诺伊大学香槟分校) Center for Heterogeneous Integration of Micro Electronic Systems(微电子异构集成中心) Joint University Microelectronics Program (JUMP) 2.0(联合大学微电子计划(JUMP)2.0) Semiconductor Research Corporation (SRC)(半导体研究公司(SRC)) Defense Advanced Research Project Agency (DARPA)(国防高级研究计划局(DARPA))

AI总结 提出物理信息主动学习框架,高效探索GaN tri-gate FinFET的高维设计空间,优化关键结构参数(如GaN-to-AlGaN厚度比),发现两种优化器件,其中D1在300-fin配置下驱动电流和开关效率优于D2。

详情
AI中文摘要

本文展示了机器学习驱动优化在垂直供电系统中设计特定应用的GaN三栅极FinFET的有效性。传统的基于TCAD的方法计算量大,且不足以导航先进GaN器件的高维非线性设计空间。为此,采用物理信息主动学习框架智能引导仿真,在保持精度的同时加速收敛。这种ML引导的方法通过高效探索关键结构参数——尤其是GaN-to-AlGaN厚度比(器件设计中长期争论的焦点)——来发现最优配置。通过系统探索关键结构参数,确定了两种具有激进缩放的栅漏长度的优化器件。单鳍多通道仿真表明,相对于AlGaN势垒具有更薄GaN沟道的器件D2实现了更高的驱动电流。然而,在300鳍配置中,器件D1以0.49欧姆导通电阻提供3.3A电流,性能约为D2的2倍,尽管寄生参数略高。两种器件均工作在常关模式。基于特定应用品质因数,器件D1达到5 pC·欧姆,开关效率比D2高2倍,而两种设计在不同性能指标上均优于工业基准。

英文摘要

This paper demonstrates the effectiveness of machine learning-driven optimization for designing application-specific GaN tri-gate FinFETs in vertical power delivery systems. Conventional TCAD-based approaches are computationally intensive and insufficient for navigating the high-dimensional, nonlinear design space of advanced GaN devices. To address this, a physics-informed active learning framework is used to intelligently guide simulations, accelerating convergence while preserving accuracy. This ML-guided approach enables the discovery of optimal configurations by efficiently exploring key structural parameters -- most notably the GaN-to-AlGaN thickness ratio -- a long-standing focus of debate in device design. By systematically exploring key structural parameters, two optimized devices with aggressively scaled gate-to-drain lengths are identified. Single-fin, multi-channel simulations show that device~D2, with a thinner GaN channel relative to the AlGaN barrier, achieves higher drive current. However, in a 300-fin configuration, device~D1 outperforms device~D2 by delivering 3.3\,A at 0.49~ohm on-resistance -- approximately 2$\times$ better -- despite slightly higher parasitics. Both devices operate in a normally-off mode. Based on an application-specific figure of merit, device~D1 achieves 5\,pC$\cdot$ohm, demonstrating 2$\times$ greater switching efficiency than device~D2, while both designs outperform industrial benchmarks from different performance standpoints.

2606.01260 2026-06-02 cs.CL cs.AI 版本更新

IndoBias: A Dual Track Culturally Grounded Benchmark for LLMs Bias Evaluation in Indonesian Languages

IndoBias:印尼语言中大语言模型偏见评估的双轨文化基准

Ikhlasul Akmal Hanif, Muhammad Falensi Azmi, Filbert Aurelian Tjiaranata, Eryawan Presma Yulianrifat, Fajri Koto

发表机构 * Mohamed bin Zayed University of Artificial Intelligence(莫扎德大学人工智能大学) Universitas Indonesia(印度尼西亚大学) Independent Researcher(独立研究员)

AI总结 提出IndoBias基准,通过深度和广度双轨评估,发现现有LLM在印尼语和三种地方语言中表现出显著偏见,且预训练数据来源和语言多样性影响偏见程度。

详情
AI中文摘要

尽管印尼拥有超过1300个民族和700种土著语言,但大语言模型中的偏见尚未得到充分研究,从而在其独特广阔、多语言和多样化的社会文化背景下,评估代表性公平性和本地化刻板印象存在关键空白。为解决此问题,我们引入IndoBias作为文化基础的偏见基准,评估LLM在印尼语和三种地方语言(爪哇语、巽他语和望加锡语)中的偏见。IndoBias具有双视角评估轨道:深度导向(使用对比对)和广度导向(基于生成),后者基于社会科学框架(SPI、O*NET和WGI)。我们的结果表明,现有LLM——尤其是解码器模型——对印尼语中的原型句子表现出强烈偏见,而地方语言在意识形态和宗教类别下遭受更高偏见。我们还发现,当使用各种地方实体提示时,LLM响应表现出非均匀的刻板印象极性。最后,我们发现,在印尼语中,Common Crawl文本在预训练期间引入的偏见比人工审核的文章文本(如维基百科、新闻)更多,而将地方语言引入预训练通常会增加偏见。这项工作强调了在特定文化背景下研究偏见的重要性。警告:本文包含可能具有冒犯性、有害性或偏见性的示例数据。

英文摘要

Despite being home to more than 1300 ethnic groups and 700 indigenous languages, bias in Large Language Models has not been fully studied in Indonesia, thus leaving a critical gap in evaluating representational fairness and localized stereotypes within its uniquely vast, multilingual, and diverse sociocultural landscape. To address this, we introduce IndoBias as a culturally-grounded bias benchmark to assess LLMs bias in Indonesian and three local languages: Javanese, Sundanese, and Makasar. IndoBias features dual perspective evaluation tracks: depth-oriented (with contrastive-pairs) and breadth-oriented (with generation-based), where the latter is grounded in social science frameworks (SPI, O*NET, and WGI). Our results show that existing LLMs -- particularly decoder models -- exhibit strong bias towards prototypical sentences in Indonesian, while local languages suffer higher bias under Ideology and Religion category. We also find that LLMs responses exhibit a non-uniform Stereotype Polarity when prompted with various local entities. Finally, we discover that, in Indonesian, Common Crawl texts introduce more bias during pretraining, compared to human-reviewed article texts (e.g., Wikipedia, News), whereas introducing local languages to pretraining generally increases bias. This work highlights the importance of studying bias in culture-specific context. Warning: This paper contains example data that may be offensive, harmful, or biased.

2606.01252 2026-06-02 cs.CL cs.AI 版本更新

Understanding LLM Behavior in Multi-Target Cross-Lingual Summarization

理解多目标跨语言摘要中的LLM行为

Sangwon Ryu, Yihong Liu, Mingyang Wang, Yunsu Kim, Jungseul Ok, Gary Geunbae Lee, Hinrich Schuetze

发表机构 * GSAI, POSTECH(POSTECH认知科学研究院) CSE, POSTECH(POSTECH计算机科学与工程学院) LMU Munich(慕尼黑大学) MCML LILT(语言信息实验室)

AI总结 针对多目标跨语言摘要任务,提出MEA基准并分析LLM内部机制,发现翻译和摘要行为在后期层联合出现,并引入推理时激活引导方法提升生成质量。

详情
AI中文摘要

多目标跨语言文本摘要(MTXLS)将源文档总结为多种目标语言,随着用户以多种语言消费内容,其重要性日益增加,但仍未得到充分探索。为填补这一空白,我们引入了多目标跨语言元素感知(MEA),这是一个涵盖24种目标语言的新MTXLS基准。我们评估了各种LLM的端到端和流水线方法,并表明MTXLS性能仍远落后于英语单语摘要。为了更好地理解LLM中的MTXLS,我们提出了一种逐层分析框架,用于研究LLM如何在内部执行MTXLS。我们的分析表明,翻译和摘要行为在后期层中联合出现,而不是作为截然不同的分解阶段。大多数任务相关处理发生在这些层内,错误也往往在类似深度出现。受这些发现启发,我们引入了一种推理时激活引导方法,利用英语摘要的隐藏表示来指导MTXLS生成。实验表明,我们的方法在目标语言上持续提高了MTXLS质量。

英文摘要

Multi-target cross-lingual text summarization (MTXLS), which summarizes a source document into multiple target languages, is increasingly important as users consume content in diverse languages, but remains underexplored. To address this gap, we introduce multi-target cross-lingual element-aware (MEA), a new MTXLS benchmark covering 24 target languages. We benchmark end-to-end and pipeline approaches across various LLMs and show that MTXLS performance still substantially lags behind English monolingual summarization. To better understand MTXLS in LLMs, we propose a layer-wise analysis framework for investigating how LLMs internally perform MTXLS. Our analyses suggest that translation and summarization behaviors emerge jointly within later layers rather than as distinctly decomposed stages. Most task-relevant processing occurs within these layers, and errors also tend to arise at similar depths. Motivated by these findings, we introduce an inference-time activation steering method that leverages hidden representations from English summarization to guide MTXLS generation. Experiments show that our method consistently improves MTXLS quality across target languages.

2606.01246 2026-06-02 cs.AI 版本更新

SIRIUS-SQL: Anchoring Multi-Candidate Text-to-SQL in Execution Feedback

SIRIUS-SQL: 在执行反馈中锚定多候选文本到SQL

Leo Luo, Haining Xie, Siqi Shen, Zhipeng Ma, Rui Ling, Hang Xu, Hefeng Jiang, Dingwei Chen, Yang Li, Peng Chen, Jie Jiang

发表机构 * TEG, Tencent Inc., China(腾讯科技(TEG),腾讯公司,中国) Peking University, China(北京大学,中国)

AI总结 提出SIRIUS-SQL系统,通过难度平滑强化学习、执行生命周期分类和置信门控混合选择器,解决多候选SQL生成中的冗余、修复和选择问题,在BIRD dev和SPIDER test上达到75.88%和91.20%的准确率。

详情
AI中文摘要

在复杂模式上的Text-to-SQL单次通过不可靠,因此近期系统生成多个SQL候选并通过投票过滤错误。然而仅投票是不够的,因为多候选方法有三个耦合的弱点:1) 从单个生成器采样更多会产生越来越冗余的候选,2) 现有流程对每个非干净执行结果应用通用修正,而运行时错误、超时和空结果各自指示与正确性的不同距离,3) 现有选择器依赖单一角度如结果多数投票或成对SQL比较,错过了其他角度可能捕获的信息。我们提出SIRIUS-SQL,解决了所有三个弱点。一个难度平滑的强化学习配方训练SIRIUS-32B生成多样化的可执行SQL候选,并与一个通用LLM配对,填补专家留下的空白。一个基于执行的生命周期对每个结果进行分类,并在候选重新进入池之前应用针对性修复。一个置信门控混合选择器结合执行结果一致性与成对SQL形式判断,仅在接近平局的情况下升级到确定性结构检查。SIRIUS-SQL在BIRD dev上达到75.88%,在SPIDER test上达到91.20%。三个通用配对中的两个超过了BIRD dev上最强已发布的多候选系统Agentar-Scale-SQL。

英文摘要

Text-to-SQL on complex schemas is unreliable on a single pass, so recent systems generate multiple SQL candidates and let voting filter out errors. Yet voting alone is not enough, because the multi-candidate recipe has three coupled weaknesses: 1) sampling more from a single generator produces increasingly redundant candidates, 2) existing pipelines apply one generic correction to every non-clean execution result, while runtime errors, timeouts, and empty results each indicate a different distance from correctness, and 3) existing selectors rely on a single angle such as result-majority voting or pairwise SQL comparison, missing what other angles would have caught. We present SIRIUS-SQL, which addresses all three weaknesses. A difficulty-smoothing RL recipe trains SIRIUS-32B to generate diverse executable SQL candidates, paired with a generalist LLM that fills in gaps left by the specialist. An execution-grounded lifecycle classifies each outcome and applies targeted repair before candidates re-enter the pool. A confidence-gated hybrid selector combines execution-result agreement with pairwise SQL-form judgment, escalating only near-tied cases to a deterministic structural check. SIRIUS-SQL reaches 75.88% on BIRD dev and 91.20% on SPIDER test. Two of three generalist pairings surpass Agentar-Scale-SQL, the strongest published multi-candidate system on BIRD dev.

2605.04638 2026-06-02 cs.CL cs.AI 版本更新

Gradients with Respect to Semantics Preserving Embeddings Tell the Uncertainty of Large Language Models

相对于语义保持嵌入的梯度揭示大语言模型的不确定性

Mingda Li, Rundong Lv, Xinyu Li, Weinan Zhang, Ting Liu

发表机构 * University of Science and Technology of China(中国科学技术大学)

AI总结 提出首个基于梯度的自由文本生成不确定性量化方法SemGrad,通过语义空间中的梯度计算实现高效且无需采样的不确定性估计。

Comments Accepted by ICML 2026

详情
AI中文摘要

不确定性量化(UQ)是确保大语言模型(LLM)可信度的重要技术,因为LLM容易产生幻觉。现有的自由文本生成UQ方法严重依赖采样,导致计算成本高且方差大。在这项工作中,我们提出了首个基于梯度的自由文本生成UQ方法SemGrad,它无需采样且计算高效。与先前针对分类任务开发的在参数空间中操作的梯度方法不同,我们提出在语义空间中考虑梯度。我们的方法基于一个关键直觉:自信的LLM应在语义等价的输入扰动下保持稳定的输出分布。我们将这种稳定性解释为语义空间中的梯度,并引入语义保持分数(SPS)来识别最能捕捉语义的嵌入,并针对这些嵌入计算梯度。我们进一步提出了HybridGrad,它结合了SemGrad和参数梯度的优势。实验表明,我们的两种方法都提供了高效且有效的不确定性估计,在多个有效响应的设置中尤其优于现有方法。

英文摘要

Uncertainty quantification (UQ) is an important technique for ensuring the trustworthiness of LLMs, given their tendency to hallucinate. Existing state-of-the-art UQ approaches for free-form generation rely heavily on sampling, which incurs high computational cost and variance. In this work, we propose the first gradient-based UQ method for free-form generation, SemGrad, which is sampling-free and computationally efficient. Unlike prior gradient-based methods developed for classification tasks that operates in parameter space, we propose to consider gradients in semantic space. Our method builds on the key intuition that a confident LLM should maintain stable output distributions under semantically equivalent input perturbations. We interpret the stability as the gradients in semantic space and introduce a Semantic Preservation Score (SPS) to identify embeddings that best capture semantics, with respect to which gradients are computed. We further propose HybridGrad, which combines the strengths of SemGrad and parameter gradients. Experiments demonstrate that both of our methods provide efficient and effective uncertainty estimates, achieving superior performance than state-of-the-art methods, particularly in settings with multiple valid responses.

2605.04193 2026-06-02 cs.AI cs.LG cs.LO 版本更新

ANDRE: An Attention-based Neuro-symbolic Differentiable Rule Extractor for Inductive Logic Programming

ANDRE:一种基于注意力的神经符号可微规则提取器,用于归纳逻辑编程

Iman Sharifi, Peng Wei, Saber Fallah

发表机构 * Dept. of Mechanical and Aerospace Engineering, George Washington University, USA(机械与航空航天工程系,乔治华盛顿大学) Dept. of Mechanical Engineering Sciences, University of Surrey, UK(机械工程科学系,萨里大学)

AI总结 提出ANDRE框架,通过注意力驱动的可微逻辑算子优化连续规则空间,实现从概率数据中学习一阶逻辑规则,在噪声环境下保持鲁棒性和可解释性。

Comments 35 pages, 8 figures, 10 tables

详情
AI中文摘要

归纳逻辑编程(ILP)旨在从数据中学习可解释的一阶规则,但现有的符号和神经符号方法难以扩展到噪声和概率设置。经典ILP依赖于离散的组合规则搜索,在不确定性下脆弱,而可微ILP方法通常依赖预定义规则模板或不精确的模糊算子,这些算子在推理概率谓词估值时会遭受梯度消失或逻辑结构近似不佳的问题。本文提出基于注意力的神经符号可微规则提取器(ANDRE),一种新颖的ILP框架,通过基于注意力的逻辑算子优化连续规则空间来学习一阶逻辑程序。ANDRE用完全可微的、注意力驱动的合取和析取算子替代规则模板和逻辑算子,这些算子近似逻辑最小-最大语义,从而实现对概率数据的准确、稳定和可解释推理。通过在每条规则内软选择、否定或排除谓词,ANDRE在保持符号结构的同时支持灵活规则归纳。在经典ILP基准、大规模知识库以及带有概率谓词和噪声监督的合成数据集上的大量实验表明,ANDRE达到了有竞争力或更优的预测性能,同时在不确定性下可靠地恢复正确的符号规则。特别是,ANDRE对中等标签噪声保持鲁棒,在规则提取质量和稳定性上显著优于现有可微ILP方法。

英文摘要

Inductive Logic Programming (ILP) aims to learn interpretable first-order rules from data, but existing symbolic and neuro-symbolic approaches struggle to scale to noisy and probabilistic settings. Classical ILP relies on discrete combinatorial rule search and is brittle under uncertainty, while differentiable ILP methods typically depend on predefined rule templates or inaccurate fuzzy operators that suffer from vanishing gradients or poor approximation of logical structure when reasoning over probabilistic predicate valuations. This paper proposes an Attention-based Neuro-symbolic Differentiable Rule Extractor (ANDRE), a novel ILP framework that learns first-order logic programs by optimizing over a continuous rule space with attention-based logical operators. ANDRE replaces both rule templates and logical operators with fully differentiable, attention-driven conjunction and disjunction operators that approximate logical min-max semantics, enabling accurate, stable, and interpretable reasoning over probabilistic data. By softly selecting, negating, or excluding predicates within each rule, ANDRE supports flexible rule induction while preserving symbolic structure. Extensive experiments on classical ILP benchmarks, large-scale knowledge bases, and synthetic datasets with probabilistic predicates and noisy supervision demonstrate that ANDRE achieves competitive or superior predictive performance while reliably recovering correct symbolic rules under uncertainty. In particular, ANDRE remains robust to moderate label noise, substantially outperforming existing differentiable ILP methods in both rule extraction quality and stability.

2606.01237 2026-06-02 cs.AI 版本更新

Brain-Atlas-Guided Generative Counterfactual Attention for Explainable Cognitive Decline Diagnosis Using Multimodal Connectomes

脑图谱引导的生成式反事实注意力用于基于多模态连接组的可解释认知衰退诊断

Xiongri Shen, Jiaqi Wang, Zhenxi Song, Yi Zhong, Leilei Zhao, Xin He, Baiying Lei, Zhiguo Zhang

发表机构 * Department of Computer Science and Technology, Harbin Institute of Technology (Shenzhen)(哈尔滨工业大学(深圳)计算机科学与技术系) School of Intelligence Science and Engineering, College of Artificial Intelligence, Harbin Institute of Technology(哈尔滨工业大学智能科学与工程学院) School of Biomedical Engineering, National-Regional Key Technology Engineering Laboratory for Medical Ultrasound, Guangdong Key Laboratory for Biomedical, Measurements and Ultrasound Imaging, Shenzhen University Medical School, Shenzhen University(深圳大学医学院生物医学工程学院、医学超声关键技术工程实验室、广东省生物医学测量与超声成像重点实验室)

AI总结 提出一种脑图谱知识引导的生成式反事实注意力网络(GCAN),通过将诊断建模为源到目标的反事实生成问题,利用多模态连接组实现可解释的认知衰退诊断。

详情
AI中文摘要

轻度认知障碍(MCI)和主观认知衰退(SCD)与早期阿尔茨海默病连续谱密切相关,准确且可解释的诊断对于早期风险评估和干预至关重要。现有的基于连接组的深度学习模型可以提高分类性能,但通常对疾病相关的功能和结构连接变化提供的洞察有限。本文提出了一种图谱知识引导的生成式反事实注意力网络(GCAN),用于使用多模态脑连接组进行可解释的认知衰退诊断。GCAN将诊断建模为源到目标的反事实生成问题,其中从源标签输入生成目标标签连接组,并利用它们的差异构建反事实注意力图。为了保持连接组拓扑,一种图谱感知的双向Transformer(AABT)在脑图谱约束下执行网络级令牌编码和解码。该框架进一步从功能连接(FC)扩展到联合功能和结构连接(SC)建模,从而实现对互补功能重组和结构拓扑变化的反事实分析。在医院收集的数据集和ADNI数据集上的实验表明,GCAN在HC vs. SCD、HC vs. MCI和SCD vs. MCI分类任务中取得了竞争性能。可视化、圆形连接组分析、基于CAM的比较、消融研究和置信区间分析进一步支持了所提框架的可解释性和可靠性。使用特定模态的FC和SC预训练分类器为反事实生成提供目标状态先验,同时将其与下游诊断分类器分离以防止数据泄露。

英文摘要

Mild cognitive impairment (MCI) and subjective cognitive decline (SCD) are closely associated with the early Alzheimer's disease continuum, where accurate and explainable diagnosis is important for early risk assessment and intervention. Existing connectome-based deep learning models can improve classification performance but often provide limited insight into disease-related functional and structural connectivity changes. This paper proposes an atlas-knowledge-guided Generative Counterfactual Attention-guided Network (GCAN) for explainable cognitive decline diagnosis using multimodal brain connectomes. GCAN formulates diagnosis as a source-to-target counterfactual generation problem, where target-label connectomes are generated from source-label inputs and their differences are used to construct counterfactual attention maps. To preserve connectome topology, an Atlas-aware Bidirectional Transformer (AABT) performs network-level token encoding and decoding under brain-atlas constraints. The framework is further extended from functional connectivity (FC) to joint functional and structural connectivity (SC) modeling, enabling counterfactual analysis of complementary functional reorganization and structural topology changes. Experiments on hospital-collected and ADNI datasets show that GCAN achieves competitive performance across HC vs. SCD, HC vs. MCI, and SCD vs. MCI classification tasks. Visualization, circular connectome analysis, CAM-based comparison, ablation studies, and confidence interval analysis further support the interpretability and reliability of the proposed framework. Modality-specific FC and SC pre-trained classifiers are used to provide target-state priors for counterfactual generation while being separated from the downstream diagnostic classifier to prevent data leakage.

2606.01230 2026-06-02 cs.AI 版本更新

HomeFlow: A Data Flywheel for Smart Home Agent Training with Verifiable Simulation

HomeFlow: 面向智能家居智能体训练的可验证数据飞轮

Yi Gu, Huacan Wang, Shuo Zhang, Yuqing Hou, Lei Xue, Weipeng Ming, Chen Liu, Fangzhou Yu, Kuan Li, Ronghao Chen, Sen Hu, Xiaofeng Mou, Yi Xu

发表机构 * Midea Group(美的集团) Beijing University of Posts and Telecommunications(北京邮电大学) Donghua University(东华大学) Peking University(北京大学)

AI总结 提出HomeFlow,一种通过统一仿真环境HomeEnv、程序化家居生成HomeMaker、蓝图编译用户意图、MCTS-Flow合成可验证轨迹,并结合监督微调和逐步RLVE优化智能体的可验证数据飞轮方法,在SmartHome-Bench上达到84.60%和87.03%的任务成功率,其中8B模型超越GPT-5.5 1.23个百分点。

详情
AI中文摘要

大型语言模型智能体正从纯文本交互转向物理世界控制,智能家居是一个代表性领域。真实的家庭交互需要理解模糊意图、在动态环境中操作以及进行多轮推理。然而,现有方法难以生成用于智能家居智能体的高质量训练数据。我们提出HomeFlow,一个针对该领域的可验证数据飞轮。HomeFlow使用HomeEnv作为统一仿真环境,HomeFlow使用HomeEnv作为统一仿真环境,HomeMaker程序化生成多样化的家居设置。随后,Blueprint将开放式的用户意图编译为可执行的基于状态的成功条件,而MCTS-Flow通过环境引导的树搜索合成多样化的、可验证的多轮轨迹。然后我们通过监督微调和逐步RLVE优化智能体,通过真实的物理反馈促进迭代改进。我们进一步构建了SmartHome-Bench来评估智能体在各种智能家居任务上的表现。在该基准上,HomeFlow-RL-4B和HomeFlow-RL-8B分别达到了84.60%和87.03%的任务成功率。值得注意的是,HomeFlow-RL-8B甚至超过了领先的GPT-5.5 1.23个百分点。

英文摘要

Large language model agents are moving beyond text-only interaction toward physical-world control, with smart homes as a representative domain. Real domestic interaction requires understanding ambiguous intents, operating in dynamic environments, and performing multi-turn reasoning. However, existing methods struggle to generate high-quality training data for smart home agents. We propose HomeFlow, a verifiable data flywheel for this domain. HomeFlow uses HomeEnv as a unified simulation environment and HomeMaker to procedurally generate diverse home settings. Subsequently, Blueprint compiles open-ended user intents into executable state-based success conditions, while MCTS-Flow synthesizes diverse, verifiable multi-turn trajectories through environment-guided tree search. We then optimize the agents via supervised fine-tuning and step-wise RLVE, which facilitates iterative improvement through authentic physical feedback. We further construct SmartHome-Bench to evaluate the agent across various smart home tasks. On this benchmark, HomeFlow-RL-4B and HomeFlow-RL-8B achieve task success rates of 84.60% and 87.03%. It is worth noting that HomeFlow-RL-8B even surpasses the leading GPT-5.5 by 1.23 percentage points.

2606.01229 2026-06-02 cs.AI 版本更新

Application of Algorithms in Energy-Efficient Design Platforms for Green Building

算法在绿色建筑节能设计平台中的应用

Na Yu, Fu Wenli, Guo Fei

发表机构 * First Highway Engineering Co., Ltd.(第一公路工程有限公司)

AI总结 提出一种结合BIM、传感器数据和进化多目标优化的算法平台,通过动态能耗仿真降低建筑能耗29.3%,并验证了其可扩展性和技术可行性。

Comments 9 pages, 4 figures.2026 International Conference on Big Data Applications in Education and Engineering (ICBDAEE 2026)

详情
AI中文摘要

在绿色建筑设计过程中,计算机辅助能耗评估被广泛用于提高效率并实现整体优化。本文提出一个平台,该平台结合建筑信息模型(BIM)、传感器运行数据以及使用稳健算法的高级仿真工作流。该平台采用多层服务架构,包含动态能耗仿真和进化多目标优化,通过高性能C++核心和自适应代理模型连接。选取一栋中层办公楼作为案例研究。选择五个代表性区域收集建筑围护结构特征和占用模式数据。预处理后,缺失传感器数据占年度记录的3.2%,所有变量通过15分钟插值标准化。经过40轮优化,每平方米年能耗从315 kWh/m²降至223 kWh/m²,下降29.3%。居住者的生命周期成本增加限制在3.7%以内,不舒适小时数降至每年70小时以下。帕累托最优解分析显示,围护结构U值范围为1.05至1.57 W/m²K,夜间通风率范围为2.1至3.6 h⁻¹,两者均与能耗性能密切相关。结果证实,集成算法框架为绿色建筑设计提供了良好的可扩展性、强性能和技术可行性。该平台为设计工程师和可持续发展从业者提供了可靠的决策支持工具,实现了数据驱动的节能建筑精准交付。

英文摘要

During green building design, computer-aided energy assessment is widely used to improve efficiency and achieve overall optimization. This paper presents a platform that combines Building Information Modeling (BIM), sensor operational data, and advanced simulation workflows using robust algorithms. The platform uses a multi-layer service architecture with dynamic energy simulation and evolutionary multi-objective optimization, connected via a high-performance C++ core and adaptive agent models. A mid-rise office building was selected as the case study. Five representative areas were chosen to collect data on building envelope characteristics and occupancy patterns. After preprocessing, missing sensor data accounted for 3.2% of annual records, and all variables were standardized using 15-minute interpolation. After 40 optimization rounds, annual energy consumption per square meter dropped by 29.3% from 315 kWh/m2 to 223 kWh/m2. The lifecycle cost increase for occupants was limited to 3.7%, and discomfort hours were reduced to under 70 hours per year. Analysis of Pareto optimal solutions shows that the envelope U-value ranges from 1.05 to 1.57 W/m2K, and nighttime ventilation rate ranges from 2.1 to 3.6 h-1, both closely linked to energy performance. The results confirm that the integrated algorithm framework offers good scalability, strong performance, and technical feasibility for green building design. This platform provides a reliable decision-support tool for design engineers and sustainability practitioners, enabling accurate, data-driven delivery of energy-efficient buildings.

2606.01224 2026-06-02 cs.AI 版本更新

Advanced Mathematics Learning Behavior Prediction and Academic Early Warning Model Based on Multimodal Data Analysis

基于多模态数据分析的高等数学学习行为预测与学业预警模型

Liu Qiong, Li Zhengbo

发表机构 * Moutai Institute(莫 tai 院)

AI总结 针对高等数学教育中高风险学生早期识别与干预的挑战,提出一种融合知识图谱与多模态时序建模的动态预测框架,通过异构注意力机制和自适应边权重实现精准预警与个性化干预。

Comments 12 pages,5 figures

详情
AI中文摘要

高风险学生的早期发现和及时学业干预是高等数学教育中的主要挑战,其中复杂的概念层次和非线性学习轨迹常常阻碍学生的学业表现。本研究采用多模态数据分析,构建了一个用于学习行为预测和学业预警的动态框架。它构建了层次化的知识图谱本体,根据问题难度和学生表现实现自适应边权重,并结合异构图注意力与时间序列建模来捕捉学生不断演变的知识状态。在学期多模态数据集上的实证测试证明,该方法能够准确识别高风险学生,并有效跟踪错误传播。有针对性的干预显著提高了学生的知识掌握程度并降低了学业风险。结果验证了将知识图谱分析与多模态时序建模相结合,可以为高等数学教育提供更高效、更个性化的学习支持。

英文摘要

Early detection of at-risk students and timely academic intervention pose major challenges in advanced mathematics education, where complex conceptual hierarchies and nonlinear learning trajectories often hold back students' academic performance. This study adopts multimodal data analytics to build a dynamic framework for learning behavior prediction and academic early warning. It constructs a hierarchical knowledge graph ontology, realizes adaptive edge weighting according to problem difficulty and student performance, and combines heterogeneous graph attention with temporal sequence modeling to capture students' evolving knowledge states. Empirical tests on semester-long multimodal datasets prove that this method can accurately identify high-risk students and effectively track error propagation. Targeted interventions greatly improve students' knowledge mastery and reduce academic risks. The results verify that integrating knowledge graph analytics with multimodal temporal modeling can deliver more efficient and personalized learning support for advanced mathematics education.

2606.01223 2026-06-02 cs.CL cs.AI 版本更新

Connecting the Dots: Benchmarking Reflective Memory in Long-Horizon Dialogue

连接点:长时对话中的反思性记忆基准测试

Jingjie Lin, Bingbing Wang, Zihan Wang, Zhengda Jin, Weiming Qiao, Jing Li, Ruifeng Xu

发表机构 * Harbin Institute of Technology, Shenzhen(哈尔滨工业大学(深圳)) The Hong Kong Polytechnic University(香港理工大学) Fudan University(复旦大学) Peng Cheng Laboratory(鹏城实验室)

AI总结 针对现有基准无法衡量从碎片化多模态线索合成高层解释的反思性记忆问题,提出RefMem-Bench基准和REMIND层次框架,通过渐进式证据感知、定位和抽象提升模型反思性记忆能力。

Comments 9 pages, 6 figures

详情
AI中文摘要

尽管长上下文建模取得了显著进展,现有基准仍局限于显式回忆的事实性记忆,未能衡量将碎片化、多模态线索合成为高层解释所需的反思性记忆。为填补这一空白,我们引入了RefMem-Bench,一个用于长时对话中反思性记忆的基准。RefMem-Bench包含26K个带注释的问答实例,涵盖八个反思性记忆维度和三种任务格式,要求模型超越表面检索,从分布在整个交互历史中的证据推断潜在含义。为增强反思性记忆能力,我们提出了反思性记忆归纳(REMIND),一个将反思性记忆视为渐进意义构建的层次框架。REMIND结合了问题条件证据检索、显著性感知定位和抽象级别监督,并使用渐进式反思对齐将高层反思性推理提炼到事实推理路径中。实验表明,RefMem-Bench对当前模型构成了重大挑战,而REMIND通过渐进式证据感知、定位和抽象,持续提高了答案准确性和记忆回忆率。

英文摘要

Despite substantial progress in long-context modeling, existing benchmarks remain confined to factual memory for explicit recall, failing to measure the reflective memory required to synthesize fragmented, multimodal cues into high-level interpretations. To address this gap, we introduce RefMem-Bench, a benchmark for reflective memory in long-horizon dialogue. RefMem-Bench contains 26K annotated QA instances with eight reflective-memory dimensions and three task formats, requiring models to move beyond surface-level retrieval and infer latent meanings from evidence distributed across interaction histories. To enhance reflective memory capability, we propose REflective Memory INDuction (REMIND), a hierarchical framework that treats reflective memory as progressive meaning construction. REMIND couples question-conditioned evidence retrieval, salience-aware grounding, and abstraction-level supervision, and uses Progressive Reflective Alignment to distill high-level reflective reasoning into the factual inference pathway. Experiments show RefMem-Bench poses a substantial challenge to current models, while REMIND consistently improves both answer accuracy and memory recall through progressive evidence perception, grounding, and abstraction.

2606.01221 2026-06-02 cs.LG cs.AI 版本更新

Hybrid Imbalanced Regression Through Unified Data-Level and Algorithm-Level Balancing

混合不平衡回归:统一的数据级与算法级平衡方法

Shermin Shahbazi, Hossein Mohammadi, Mohsen Afsharchi

发表机构 * Zahedan National University(札赫德安国立大学)

AI总结 提出一个五阶段混合框架,结合自适应分箱、条件变分自编码器、特征空间聚类过采样、潜在密度加权损失和注意力门控融合,解决回归中的不平衡问题。

Comments 52 pages, 20 figures, accepted at Expert Systems with Applications

详情
Journal ref
Expert Systems with Applications, Date: 1 August 2026, Article: 131908, Volume: Volume 322
AI中文摘要

不平衡学习是机器学习中的一个关键挑战,其中代表性不足的目标值可能使模型产生偏差,并降低对罕见但重要案例的预测性能。尽管在分类中得到了广泛研究,不平衡回归仍然相对未被充分探索。现有方法主要关注数据级平衡(可能引入噪声和过拟合)或算法级平衡(通常难以处理高度复杂的目标分布)。为了解决这些局限性,我们提出了一个统一的混合框架,将数据级和算法级平衡策略集成到一个与回归器无关的流水线中。该框架包括五个阶段:(1)自适应分箱划分,基于局部线性一致性动态分割目标空间;(2)使用条件变分自编码器进行目标条件表示学习;(3)通过特征空间聚类和少数类过采样进行多阶段数据级平衡;(4)使用新颖的潜在密度加权损失(LDWL)进行算法级平衡,以强调潜在空间和目标空间中的稀有样本;(5)基于注意力的门控融合用于最终回归。在基准数据集上的实验结果表明,与单独的回归器和现有的不平衡回归方法相比,所提出的框架持续提高了预测性能。

英文摘要

Imbalanced learning is a critical challenge in machine learning, where underrepresented target values can bias models and degrade prediction performance on rare but important cases. Although extensively studied in classification, imbalanced regression remains relatively underexplored. Existing methods mainly focus on either data-level balancing, which may introduce noise and overfitting, or algorithm-level balancing, which often struggles with highly complex target distributions. To address these limitations, we propose a unified hybrid framework that integrates both data- and algorithm-level balancing strategies into a regressor-agnostic pipeline. The proposed framework consists of five stages: (1) adaptive bin partitioning to dynamically segment the target space based on local linear coherence; (2) target-conditioned representation learning using a Conditional Variational Autoencoder; (3) multistage data-level balancing through feature-space clustering and oversampling of minority clusters; (4) algorithm-level balancing using a novel Latent-Density Weighted Loss (LDWL) to emphasize rare samples in latent and target spaces; and (5) attention-based gated fusion for final regression. Experimental results on benchmark datasets demonstrate that the proposed framework consistently improves predictive performance compared to standalone regressors and existing imbalanced regression approaches.

2606.01220 2026-06-02 cs.LG cs.AI 版本更新

Fine-Tuning Diffusion Models for Molecular Generation via Reinforcement Learning and Fast Sampling

通过强化学习和快速采样微调扩散模型用于分子生成

Guang Lin, Shikui Tu, Lei Xu

发表机构 * Department of Computer Science and Engineering, Shanghai Jiao Tong University(上海交通大学计算机科学与工程系) Guangdong Laboratory of Artificial Intelligence and Digital Economy (SZ)(广东人工智能与数字经济实验室(深圳))

AI总结 提出FTDiff框架,结合组相对策略优化和快速采样机制,微调扩散模型以生成满足多目标药物设计约束的高质量分子。

Comments 13 pages, 7 figures

详情
AI中文摘要

生成同时满足类药性质并符合目标蛋白三维结构的分子是基于结构的药物设计(SBDD)中的核心挑战。然而,现有的生成方法通常依赖于采样过程中昂贵的后处理或训练时需要精心策划的数据集,但增益仍然有限。这些限制在多目标设置中尤为突出,平衡冲突标准仍是一个核心挑战。为了解决这些问题,我们提出了FTDiff,一个专为结构约束下基于扩散的分子生成量身定制的强化学习微调框架。为了确保稳定且样本高效的优化,FTDiff采用了组相对策略优化(GRPO)风格策略。此外,FTDiff基于一个无时间预训练扩散模型,并集成了快速采样机制,减少了去噪步数,在保持生成质量的同时显著加速了训练和推理。通过优化一个固定阈值感知的奖励,FTDiff有效引导模型生成有效、多样且高质量的分子,平衡多个药物设计目标。在基准数据集上的大量实验表明,FTDiff始终优于先前的方法,且无需昂贵的后处理优化或复杂的数据工程。

英文摘要

Generating molecules that simultaneously satisfy drug-like properties and conform to the 3D structure of a target protein is a core challenge in structure-based drug design (SBDD). Existing generative approaches, however, often rely on costly post-hoc processing during Sampling or require carefully curated datasets during training, yet still achieve modest gains. These limitations are especially pronounced in multi-objective settings, where balancing conflicting criteria remains a core challenge. To address these challenges, We propose FTDiff, a reinforcement learning fine-tuning framework tailored for diffusion-based molecular generation under structural constraints. To ensure stable and sample-efficient optimization, FTDiff adopts a group relative policy optimization (GRPO) style strategy. Furthermore, FTDiff builds upon a time-free pretrained diffusion model and incorporates a fast sampling mechanism that reduces the number of denoising steps, significantly accelerating both training and inference while maintaining generation quality. By optimizing a fixed threshold-aware reward, FTDiff effectively guides the model to produce valid, diverse, and high- quality molecules that balance multiple drug design objectives. Extensive experiments on benchmark datasets demonstrate that FTDiff consistently outperforms prior methods, without requiring expensive post-hoc optimization or intricate data engineering.

2606.01215 2026-06-02 cs.CV cs.AI cs.CL cs.MM 版本更新

Distilling Neuro-Symbolic Programs into 3D Multi-modal LLMs

将神经符号程序蒸馏到3D多模态大语言模型中

Wentao Mo, Yang Liu

发表机构 * University of California, Berkeley(加州大学伯克利分校)

AI总结 提出APEIRIA,通过三阶段课程学习将符号推理模式蒸馏到3D多模态大语言模型中,实现透明推理与开放词汇空间推理的统一。

Comments To appear in ICML 2026

详情
AI中文摘要

当前的3D空间推理方法面临根本性权衡:神经符号3D(NS3D)概念学习器通过组合程序实现可解释推理,但受限于封闭集概念词汇和简单程序;端到端3D多模态大语言模型(3D MLLMs)能处理复杂自然语言和开放词汇概念,但缺乏显式空间验证的黑箱推理。我们提出APEIRIA,一种神经符号3D MLLM,通过将符号推理模式以自然语言思维链形式蒸馏到MLLMs中,桥接两种范式。我们的三阶段课程逐步构建推理能力:a) 3D感知对齐将物体视觉-几何特征接地到LLM,b) CoT-SFT从符号程序轨迹中教授查询分解和逐步验证,c) CoT-RL将推理模式扩展到开放集概念和深度嵌套指令。通过迁移推理模式而非概念特定知识,APEIRIA保留了NS3D的关键优点:透明推理以及规划和感知组件的模块化可互换性。在接地、问答和描述任务上的评估表明,APEIRIA超越了先前的NS3D方法,并在3D空间推理数据集上匹配最先进的3D MLLMs,统一了符号方法的系统推理与MLLMs的灵活性。代码见https://github.com/oceanflowlab/APEIRIA。

英文摘要

Current 3D spatial reasoning methods face a fundamental trade-off: neuro-symbolic 3D (NS3D) concept learners achieve interpretable reasoning through compositional programs but are constrained to closed-set concept vocabularies and simple programs; end-to-end 3D multi-modal LLMs (3D MLLMs) could handle complex natural language and open-vocabulary concepts but suffer from black-box reasoning without explicit spatial verification. We introduce APEIRIA, a neuro-symbolic 3D MLLM to bridge two paradigms by distilling symbolic reasoning patterns into MLLMs with natural language chain-of-thought. Our three-stage curriculum progressively builds reasoning capabilities: a) 3D perception alignment grounds object visual-geometric features to the LLM, b) CoT-SFT teaches query decomposition and stepwise verification from symbolic program traces, and c) CoT-RL extends reasoning patterns to open-set concepts and deeply nested instructions. By transferring reasoning patterns rather than concept-specific knowledge, APEIRIA preserves key NS3D virtues: transparent reasoning and modular interchangeability of planning and perception components. Evaluations on grounding, question answering, and captioning show that APEIRIA surpasses prior NS3D methods and matches state-of-the-art 3D MLLMs on 3D spatial reasoning datasets, unifying symbolic methods' systematic reasoning with MLLMs' flexibility. Code is available at https://github.com/oceanflowlab/APEIRIA.

2606.01213 2026-06-02 cs.CV cs.AI cs.CL 版本更新

TECCI: Tricky Edits of Collected and Curated Images

TECCI:收集与策划图像的棘手编辑

Aishwarya Agrawal, Roy Hirsch, Yasumasa Onoe, Sherry Ben, Jason Baldridge

发表机构 * Google Research(谷歌研究) Google DeepMind(谷歌深Mind)

AI总结 提出TECCI基准,包含7550对图像与编辑指令,通过人工与自动评估揭示现有图像编辑模型在指令遵循、最小编辑和视觉质量方面的不足。

详情
AI中文摘要

尽管近期取得了巨大进展,但当前的文本引导图像编辑方法在涉及指令遵循、最小化编辑源图像以及确保高视觉质量等多个方面仍面临困难。当请求的编辑具有挑战性时,例如涉及位置、运动、视角、比例和创意编辑,这些问题尤为明显。为了系统性地测试生成式图像编辑器,我们提出了一个新的图像编辑基准——TECCI:收集与策划图像的棘手编辑。TECCI包含我们发布的全新图像集。TECCI中的图像涵盖7个图像类别。这些图像和类别经过有意策划,以针对现有方法的弱点。TECCI中的编辑指令由Gemini自动生成,每个源图像覆盖5种编辑类型。我们还策划了一组530张图像,为其创建了具有挑战性的人工编写编辑指令。总体而言,TECCI包含7550对图像和编辑指令。我们对TECCI上的五个领先图像编辑模型进行了人工评估。人类从三个维度判断输出:1)指令遵循,2)编辑的最小性,以及3)视觉质量。为了扩大评估规模,我们还使用Gemini构建了一个自动评分器,在匹配人类评估方面达到了74.7%的准确率。我们的评估揭示:1)没有一个模型的总体成功率超过22%,这显示了TECCI的挑战性;2)Nano Banana Pro是整体表现最好的模型;3)模型在指令遵循方面表现显著优于最小编辑和视觉质量;4)模型在编辑建筑和自然图像方面存在困难,这些需要较强的空间布局和复杂视觉细节理解能力;5)推理和创意编辑是最困难的,而颜色和外观编辑是最容易的。

英文摘要

Despite tremendous recent progress, current text-guided image editing methods still struggle with many aspects of editing involving instruction following, minimally editing the source image, and ensuring high visual quality. These problems are especially apparent when the requested edit is challenging, such as those that involve position, motion, viewpoint, scale and creative edits. To systematically test generative image editors, we propose a novel image editing benchmark -- TECCI: Tricky Edits of Collected and Curated Images. TECCI consists of a completely new set of images we are releasing. The images in TECCI span 7 image categories. The images and these categories were curated intentionally to target weaknesses of existing methods. The edit instructions in TECCI are automatically generated by Gemini, covering 5 edit types per source image. We also curated a set of 530 images for which we created challenging manually written edit instructions. Overall, TECCI contains 7550 pairs of images and edit instructions. We conduct human evaluations of five leading image editing models on TECCI. Humans judge outputs along three dimensions: 1) instruction following, 2) minimality of the edits, and 3) visual quality. To scale-up the evaluation, we also build an auto-rater using Gemini that achieves 74.7% accuracy in matching human evaluations. Our evaluations reveal that: 1) none of the models exceed a 22% overall success rate, demonstrating the challenging nature of TECCI, 2) Nano Banana Pro is the best performing model overall, 3) models perform significantly better at instruction following compared to minimal edits and visual quality, 4) models struggle with editing architecture and nature images which require strong understanding of spatial layout and intricate visual details. 5) reasoning and creative edits are the most difficult, whereas color and appearance edits are the easiest.

2606.01204 2026-06-02 cs.CL cs.AI cs.CY 版本更新

Implicit Geographic Inference in LLM Medical Triage: Language-Driven Disparities in Emergency Recommendations

LLM医疗分诊中的隐式地理推断:语言驱动的急诊推荐差异

Qi Han Wong

发表机构 * GitHub

AI总结 研究大型语言模型在相同症状下,仅因患者提示语言不同而产生不同的医疗分诊推荐,发现模型根据输入语言隐式推断地理位置,导致急诊推荐率差异显著。

Comments 7 pages, 4 tables. Code and data at https://github.com/wongqihan/ai-behavioral-experiments

详情
AI中文摘要

我们研究大型语言模型是否仅根据患者提示的语言,对相同症状产生不同的医疗分诊推荐。使用Gemini 3.5 Flash,我们评估了六种语言(英语、西班牙语、中文、印地语、日语、阿拉伯语)下的神经症状特征(持续性头痛、视力模糊、恶心),每种条件运行30次(共450次API调用)。我们发现,尽管模型在所有语言中分配的严重程度评分几乎相同(7.7-8.0/10),但急诊室就诊推荐率从0%(日语、印地语)到30%(英语、阿拉伯语)不等。添加一句指定患者位于美国的句子,非英语提示的急诊推荐率最多增加76.7个百分点,而反向锚定(英语提示加上东京地点)将急诊率从30%降至6.7%。回译控制(日语到英语)产生的急诊率与英语基线相当,证实差异并非由翻译质量引起,而是由输入语言的隐式地理推断所致。我们发布了完整的数据集、实验代码和结果。

英文摘要

We investigate whether large language models produce different medical triage recommendations for identical symptoms based solely on the language of the patient prompt. Using Gemini 3.5 Flash, we evaluate a neurological symptom profile (persistent headache, blurred vision, nausea) across six languages (English, Spanish, Chinese, Hindi, Japanese, Arabic) with 30 runs per condition (n=450 total API calls). We find that the model recommends emergency room visits at rates ranging from 0% (Japanese, Hindi) to 30% (English, Arabic), despite assigning nearly identical severity scores (7.7-8.0/10) across all languages. Adding a single sentence specifying the patient's US location increases ER recommendations by up to 76.7 percentage points for non-English prompts, while the reverse anchor (English prompt with a Tokyo location) reduces the ER rate from 30% to 6.7%. A back-translation control (Japanese to English) produces ER rates comparable to the English baseline, confirming that the disparity is not caused by translation quality but by implicit geographic inference from the input language. We release the complete dataset, experiment code, and results.

2606.01202 2026-06-02 cs.AI cs.CL cs.LG 版本更新

The Shape of Wisdom: Decision Trajectories in Language Models

智慧的形状:语言模型中的决策轨迹

Shailesh Rana

发表机构 * Independent Researcher(独立研究者)

AI总结 本文通过分析三种语言模型在MMLU上的9000条轨迹,提出用答案边际、边际变化和决策翻转距离描述轨迹,发现正确性与稳定性不同,并探究了注意力与MLP标量对边际的影响。

Comments 6 pages, 5 figures. Code and derived artifacts: https://github.com/gut-puncture/The-Shape-of-Wisdom

详情
AI中文摘要

语言模型并非简单地在输出层选择一个答案。在一项包含9000条轨迹的MMLU研究中,涉及Qwen2.5-7B-Instruct、Llama-3.1-8B-Instruct和Mistral-7B-Instruct-v0.3,答案的分数在深度上以结构化方式移动。我们用三个量描述每条轨迹:当前答案边际、该边际的下一层变化,以及距离决策翻转的距离。主要经验图景是正确性和稳定性是不同的:最大的群体是不稳定-正确的,而不是稳定-正确的。然后,一个追踪的子集询问是什么推动了边际。在稳定-正确的情况下,平均注意力标量指向正确的方向,而平均MLP标量则不然;跨度删除显示,移除支持答案的文本会损害边际,而移除类似干扰项的文本则有助于边际。结果并非完整的电路解释。它是一种可重复的方式,用于查看哪些答案已确定,哪些仍然脆弱,以及哪些测量来源推动了它们。

英文摘要

Language models do not simply choose an answer at the output layer. In a 9,000-trajectory MMLU study across Qwen2.5-7B-Instruct, Llama-3.1-8B-Instruct, and Mistral-7B-Instruct-v0.3, the score of the answer moves across depth in structured ways. We describe each trajectory with three quantities: the current answer margin, the next-layer change in that margin, and the distance from a decision flip. The main empirical picture is that correctness and stability are different: the largest group is unstable-correct, not stable-correct. A traced subset then asks what moves the margin. In stable-correct cases, the average attention scalar points in the correct direction, while the average MLP scalar does not; span deletion shows that removing answer-supporting text hurts the margin and removing distractor-like text helps it. The result is not a full circuit explanation. It is a reproducible way to see which answers are settled, which remain fragile, and which measured sources move them.

2606.01199 2026-06-02 cs.AI 版本更新

Can LLM Agents Sustain Long-Horizon Organizational Dynamics?

LLM智能体能否维持长期组织动态?

Xuancheng Zhu, Yang Yue, Shuaibing Wan, Zihan Dou, Xiaohan Zhang, Yongrui Liu, Guoshun Nan

发表机构 * Beijing University of Posts and Telecommunications(北京邮电大学)

AI总结 提出TaskWeave分层智能体框架,通过记忆中心的协调机制(规划-分解-诊断-对齐循环和依赖感知追踪记忆)实现长期组织模拟,实验表明该框架能维持连贯的组织动态并产生可靠的人工制品。

详情
AI中文摘要

大型语言智能体越来越多地用于社会模拟,但尚不清楚它们能否在结构化组织中维持连贯行为,其中目标必须通过层级传播,任务依赖于先前执行,并且人工制品在长期范围内积累。我们将长期组织模拟定义为以记忆为中心的协调问题,并引入TaskWeave,这是一个分层智能体框架,通过制定-分解-诊断-对齐循环维护规划状态,并通过依赖感知追踪记忆来接地执行。我们在一个为期一年的IT公司模拟中评估TaskWeave,并将其与其他多智能体框架在组织连贯性、执行接地和下游企业NLP效用方面进行比较。实验表明,TaskWeave支持连贯且长期的组织动态,同时产生接地的人工制品并适应外部环境。这些发现表明,结构化模拟记忆是构建可靠的基于LLM的组织模拟器的关键机制。

英文摘要

Large language agents are increasingly used for social simulation, yet it remains unclear whether they can sustain coherent behavior in structured organizations, where goals must propagate through hierarchy, tasks depend on prior execution, and artifacts accumulate over long horizons. We formulate long-horizon organizational simulation as a memory-centered coordination problem and introduce TaskWeave, a hierarchical agentic framework that maintains planning states through a Formulate-Partition-Diagnose-Align cycle and grounds execution through dependency-aware trace memory. We evaluate TaskWeave in a year-long IT company simulation and compare it with other multi-agent frameworks on organizational coherence, execution grounding, and downstream enterprise NLP utility. Experiments show that TaskWeave supports coherent and long-horizon organizational dynamics while producing grounded artifacts and adapting to external environments. These findings suggest that structured simulation memory is a key mechanism for building reliable LLM-based organizational simulators.

2606.01196 2026-06-02 cs.CL cs.AI 版本更新

Low-Resource Safety Failures Are Action Failures, Not Representation Failures

低资源安全失败是行动失败,而非表征失败

Rashad Aziz, Ikhlasul Akmal Hanif, Fajri Koto

发表机构 * Mohamed bin Zayed University of Artificial Intelligence(莫扎伊德大学人工智能学院)

AI总结 本文发现低资源语言的安全对齐失败源于决策校准问题而非表征缺失,通过重校准高资源门控(低秩逻辑回归+阈值重置)显著提升拒绝选择性。

详情
AI中文摘要

在高资源语言中学习的安全对齐在低资源语言中迁移效果不佳。模型能拒绝英文有害提示,但当相同提示翻译成斯瓦希里语或缅甸语时则无法拒绝。自适应引导方法如AdaSteer和CAST在跨语言中继承了这一失败。我们诊断了迁移失败的原因。在Qwen2.5-7B、Gemma-2-9B和Llama-3.1-8B模型上,针对23种语言,从高资源激活中提取的有害方向几乎能像高资源提示一样线性分离低资源有害与无害提示。相关表征存在。然而,有害拒绝率从87.9%下降到43.9%。模型未能将表征转化为拒绝。未能迁移的是安全决策的校准,而非底层表征。我们利用这一点,通过重校准而非重新训练高资源门控:一个低秩逻辑回归读出器,其决策阈值使用每类仅1到4个目标语言示例重置。该门控在拒绝引导和有害方向消融之间路由,将平均拒绝选择性(Δ = 有害 − 无害拒绝)从最强自适应基线的33.6显著提高到54.5,同时保持MMLU效用。这些结果表明,一些低资源安全失败可以通过重校准现有表征而非学习新表征来修复。我们的代码已发布:https://github.com/rashadaziz/low-resource-safety。

英文摘要

Safety alignment learned in high-resource languages transfers poorly to low-resource languages. Models refuse harmful prompts in English but fail to refuse when the same prompts are translated into Swahili or Burmese. Adaptive steering methods like AdaSteer and CAST inherit this failure cross-lingually. We diagnose where transfer breaks down. Across Qwen2.5-7B, Gemma-2-9B, and Llama-3.1-8B on 23 languages, the harmfulness direction extracted from high-resource activations linearly separates harmful from harmless low-resource prompts nearly as well as high-resource ones. The relevant representation is present. Yet harmful refusal drops from 87.9% to 43.9%. The model fails to convert the representation into refusal. What fails to transfer is calibration of the safety decision, not the underlying representation. We exploit this by recalibrating, rather than retraining, a high-resource gate: a low-rank logistic readout with its decision threshold reset using as few as 1 to 4 target-language examples per class. The gate routes between refusal steering and harmfulness-direction ablation, substantially raising mean refusal selectivity ($Δ$ = harmful $-$ harmless refusal) from 33.6 for the strongest adapted baseline to 54.5 while preserving MMLU utility. These results suggest that some low-resource safety failures can be repaired by recalibrating existing representations rather than learning new ones. Our code is released: https://github.com/rashadaziz/low-resource-safety.

2606.01189 2026-06-02 cs.AI 版本更新

The Case for Model Science: Verify, Explore, Steer, Refine

模型科学的案例:验证、探索、引导、改进

Przemyslaw Biecek, Luca Longo, Jianlong Zhou, Thomas Fel, Andreas Holzinger, Wojciech Samek

发表机构 * Center for Credible AI(可信AI中心) University of Warsaw(华沙大学) Warsaw University of Technology(华沙技术大学) University College Cork(科克大学学院) University of Technology Sydney(悉尼技术大学) Kempner Institute, Harvard University(哈佛大学凯普纳研究所) Human-Centered AI Lab(以人为本的人工智能实验室) Technical University of Berlin(柏林技术大学) Fraunhofer Heinrich Hertz Institute(弗劳恩霍夫海因里希·赫茨研究所) Berlin Institute for the Foundations of Learning and Data (BIFOLD)(柏林学习与数据基础研究所(BIFOLD))

AI总结 本文提出AI社区应超越基准测试,建立系统性的模型分析学科——模型科学,通过验证、探索、引导和改进四个功能视角,以及共享基础设施和深度案例研究,来理解复杂AI模型的行为。

Comments Follow up on arXiv:2508.20040

详情
AI中文摘要

我们认为,AI社区现在已经准备好超越基准测试,并将分散的模型分析工作整合成一个系统性的学科,我们称之为模型科学。复杂的AI模型现在服务于数十亿用户,但我们对它们工作原理的理解远远落后于部署它们的能力。几十年来以基准测试为导向的研究取得了显著进展:广泛的排行榜、各种性能指标、跨不同任务的能力提升追踪;然而,这种成功也揭示了基准测试的局限性,因为它们告诉我们模型是否表现良好,但不告诉我们为什么成功或失败,它们忽略了关键的失败模式,如幻觉或捷径。来自成熟科学的先例指明了前进的方向:认知科学表明,理解复杂系统需要互补的分析层次;神经科学证明,对单个案例的深入研究揭示了群体研究遗漏的东西;医学教导我们,专业培训必须与研究实践同步发展;农业模型展示了共享基础设施和原则如何实现累积进展。这些经验为模型科学提供了三个基础。首先,我们建议围绕四个功能视角整合研究:验证、探索、引导和改进,这些视角解决了关于模型行为的互补问题。其次,我们讨论了累积知识所需的基础设施:数据集、模型和发现的目录。第三,我们强调需要对单个模型实例进行深入分析,而不仅仅是模型家族,因为单个案例可以揭示群体研究遗漏的东西。

英文摘要

We argue that the AI community is now ready to move beyond benchmarking and consolidate scattered efforts in model analysis into a systematic discipline, a direction we term Model Science. Complex AI models now serve billions of users, yet our understanding of how they work lags far behind our ability to deploy them. Decades of benchmark-driven research have delivered remarkable progress: extensive leaderboards, a wide range of performance metrics, tracking capability gains across diverse tasks; yet this success has also revealed the limits of benchmarks as they tell us whether models perform but not why they succeed or fail, they miss critical failure modes, such as hallucinations or shortcuts. Precedents from established sciences point the way forward: cognitive science shows that understanding complex systems requires complementary levels of analysis; neuroscience demonstrates that deep study of single cases reveals what population studies miss; medicine teaches that specialised training must develop alongside research practice; and agriculture models how shared infrastructure and principles enable cumulative progress. These lessons inform three foundations for Model Science. First, we propose to consolidate research around four functional perspectives: Verify, Explore, Steer, and Refine that address complementary questions about model behaviour. Second, we discuss the required infrastructure for cumulative knowledge: catalogues of datasets, models and findings. Third, we highlight the need for deep analysis of individual model instances, not just model families, because single cases can reveal what population studies miss.

2606.01188 2026-06-02 cs.HC cs.AI 版本更新

pcbGPT: Automatic PCB Schematic Synthesis from Natural Language Requirements

pcbGPT: 从自然语言需求自动合成PCB原理图

Tobias King, Steven Kehrberg, Michael Beigl, Tobias Röddiger

发表机构 * Karlsruhe Institute of Technology(卡尔斯鲁厄理工学院) Bosch Sensortec GmbH(博世传感器技术有限公司)

AI总结 提出pcbGPT系统,通过工具增强合成、组件库搜索、数据表知识、执行检查、结构语义验证和交互式工作流,从自然语言规格自动生成可编辑的KiCad原理图,在20个嵌入式任务上达到pass@1为0.90。

详情
AI中文摘要

在嵌入式、物联网和可穿戴设备开发中,将自然语言硬件需求转化为正确的印刷电路板(PCB)原理图仍然困难。设计者必须选择兼容的组件、解读数据手册、添加支持电路并暴露正确的接口,然后才能开始布局和原型制作,而许多此类电路无法通过简单的仿真进行验证。我们提出了pcbGPT,一个从自然语言规格生成可编辑KiCad原理图的接地系统。pcbGPT用Python DSL表示电路,并结合了工具增强合成、组件库搜索、基于数据手册的设计知识、基于执行的检查、结构和语义验证,以及支持迭代优化和与KiCad项目同步的交互式Web工作流。我们在20个嵌入式原理图生成任务上评估了该系统,这些任务具有参考实现、所需组件和接口约束,以便自动比较。最佳模型在整体上达到pass@1为0.90,pass@5为1.00;在基础和简单任务上pass@1为1.00,中等任务上为0.91,困难任务上为0.72。这些结果以及失败分析表明,pcbGPT已经能够为早期原型生成有用的、可审查的初稿原理图,但尚不足以可靠地取代专家审查。

英文摘要

Translating natural-language hardware requirements into correct printed circuit board (PCB) schematics remains difficult in embedded, IoT, and wearable development. Designers must choose compatible components, interpret datasheets, add support circuitry, and expose correct interfaces before layout and prototyping can begin, while many such circuits cannot be validated through straightforward simulation. We present pcbGPT, a grounded system for generating editable KiCad schematics from natural-language specifications. pcbGPT represents circuits in a Python DSL and combines tool-augmented synthesis with component-library search, datasheet-grounded design knowledge, execution-based checking, structural and semantic validation, and an interactive web workflow that supports iterative refinement and synchronization with KiCad projects. We evaluate the system on 20 embedded schematic-generation tasks with reference implementations, required components, and interface constraints that enable automatic comparison. The best model reaches overall pass@1 of 0.90 and pass@5 of 1.00; pass@1 is 1.00 on basic and easy tasks, 0.91 on medium tasks, and 0.72 on hard tasks. These results, together with failure analysis, show that pcbGPT can already generate useful, reviewable first-draft schematics for early prototyping, but is not yet reliable enough to replace expert review.

2606.01185 2026-06-02 cs.AI 版本更新

"Skill issues'': data-centric optimization of lakehouse agents

技能问题:湖仓代理的数据中心优化

Nicole Rose Schneider, Davide Ghilardi, Giacomo Piccinini, Jacopo Tagliabue

发表机构 * University of Maryland(马里兰大学) Università Milano Bicocca(米兰Bicocca大学) Bauplan Labs(Bauplan实验室)

AI总结 针对分支湖仓Bauplan上的编码代理,提出数据中心的优化流程,通过生成任务验证器对、在隔离沙箱中执行候选技能并利用追踪信号和程序化检查评分,将准确率提升31.9%。

详情
AI中文摘要

编码代理正在成为数据基础设施的用户,但它们的成功不仅取决于模型质量:还取决于教导代理如何使用系统的技能和环境文件。我们研究如何为在分支湖仓Bauplan上操作的代理优化这些工件。在我们的设置中,无头API和类似Git的数据原语通过代码、分支、提交和合并暴露数据工作流。我们的核心观察是,分支湖仓将数据代理评估从输出匹配问题转变为状态验证问题:代理生成的管道代码会引发具体的、可检查的湖仓变化。我们提出了一个数据中心优化流程,生成任务验证器对,在隔离沙箱中执行候选技能,并使用追踪级信号和湖仓状态的程序化检查对轨迹进行评分。在25个任务的初步评估中,优化后的技能将准确率提升了31.9%。这些结果表明,写路径数据工作流为优化代理技能提供了有用的基础,超越了只读任务。

英文摘要

Coding agents are becoming users of data infrastructure, but their success depends not only on model quality: it also depends on the skills and environment files that teach agents how to use a system. We study how to optimize these artifacts for agents operating on a branching lakehouse, Bauplan. In our setting, headless APIs and Git-like data primitives expose data workflows through code, branches, commits, and merges. Our central observation is that a branching lakehouse turns data-agent evaluation from an output-matching problem into a state-verification problem: agent-generated pipeline code induces concrete, inspectable lakehouse changes. We present a data-centric optimization pipeline that generates task-verifier pairs, executes candidate skills in isolated sandboxes, and scores trajectories using both trace-level signals and programmatic checks over lakehouse state. In a preliminary evaluation on 25 tasks, optimized skills improve accuracy by 31.9%. These results suggest that write-path data workflows provide a useful substrate for optimizing agent skills beyond read-only tasks.

2606.01182 2026-06-02 cs.CL cs.AI 版本更新

CA-BED: Conversation-Aware Bayesian Experimental Design

CA-BED:对话感知的贝叶斯实验设计

Daniel Arnould, Rashad Aziz, Zixuan Kang, Tanav Changal, Kevin Zhu, Sunishchal Dev, Gabriel Grand, Shreyas Sunil Kulkarni

发表机构 * University of California, Berkeley(加州大学伯克利分校) University of Washington(华盛顿大学) University of Toronto(多伦多大学)

AI总结 提出对话感知的贝叶斯实验设计(CA-BED),一种推理时概率对话规划框架,通过结合贝叶斯实验设计与LLM似然估计,在多个对话轮次中优化问题选择,在结构化实体推断基准上平均成功率提升21.8%,仅增加1.8轮对话。

Comments Reliable Autonomy Workshop at ICLR 2026

详情
AI中文摘要

大型语言模型(LLM)在静态推理任务中表现出色,但在需要通过提问主动获取信息的交互场景中,其性能往往会下降。一个关键挑战在于选择能够减少不确定性同时纳入可能模糊或仅部分信息性的回应的问题。为了解决这个问题,我们提出了对话感知的贝叶斯实验设计(CA-BED),一种推理时概率对话规划框架,它将贝叶斯实验设计与基于LLM的似然估计相结合,以在多个对话轮次中优化问题选择。CA-BED维护关于假设的信念分布,预测可能的答案,并通过模拟对话树传播期望信息增益。在两个结构化实体推断基准上,CA-BED相比直接提示实现了平均21.8%的成功率提升,相对于其他信息寻求方法也有相当的增益。与直接提示相比,它仅平均增加了1.8个对话轮次就实现了这些增益。

英文摘要

Large Language Models (LLMs) excel at static reasoning tasks, yet their performance often degrades in interactive scenarios where information must be actively acquired through questioning. A key challenge lies in selecting questions that reduce uncertainty while incorporating responses that may be ambiguous or only partially informative. To address this, we propose Conversation-Aware Bayesian Experimental Design (CA-BED), an inference-time probabilistic dialog planning framework that integrates Bayesian Experimental Design with LLM-based likelihood estimation to optimize question selection over multiple conversational turns. CA-BED maintains a belief distribution over hypotheses, anticipates possible answers, and propagates expected information gain through a simulated conversation tree. Across two structured entity-deduction benchmarks, CA-BED yields an average 21.8% improvement in success rates over direct prompting, with comparable gains relative to alternative information-seeking methods. It achieves these gains with an average increase of only 1.8 conversational turns compared to direct prompting.

2606.01179 2026-06-02 cs.LG cs.AI 版本更新

Physics-Informed Deep Learning for Entropy Prediction in Heterogeneous Systems: Thermodynamic and Information-Theoretic Case Studies

异质系统中熵预测的物理信息深度学习:热力学与信息论案例研究

Biswajeet Sahoo, Debadutta Patra

发表机构 * Durham University(杜ham大学) Department of Chemical Engineering(化学工程系) Veer Surendra Sai University of Technology(维尔·苏雷纳·赛大学)

AI总结 提出统一物理信息深度学习框架,通过微分方程残差和信息论约束,在单一神经网络中同时实现热力学与信息论系统的熵预测,并验证其数据效率和物理一致性。

详情
AI中文摘要

熵产生支配着物理和信息论系统中的不可逆性和不确定性。尽管物理信息神经网络(PINNs)成功求解微分方程,但当前架构本质上仍是领域特定的。跨根本不同物理定律的领域不变熵表示的提取尚未探索。本文引入了一个统一的物理信息深度学习(PIDL)框架,该框架在单一神经架构中同时强制执行微分方程残差和信息论界限。我们通过两个经典研究来展示该框架:(i)一个热力学连续搅拌釜反应器(CSTR)模型,求解控制常微分方程,其中Softplus约束严格强制执行热力学第二定律;(ii)一个信息论金融市场模型,求解逆Fokker-Planck偏微分方程以推断潜在漂移和扩散系数,通过Softplus约束保证扩散正性,同时自然诱导香农熵。评估了三种模型变体:两个特定领域基线和一种共享编码器架构。PIDL框架保证了绝对的热力学可接受性,零违反第二定律,并表现出卓越的数据效率,仅使用30%的可用训练数据即可保持>90%的预测精度。此外,对学习到的熵表面的事后Ruppeiner黎曼几何分析成功识别了热力学相不稳定性。该方法为物理约束熵建模提供了一个稳健、领域无关的架构,推动了可持续过程设计和定量金融风险评估的应用。

英文摘要

Entropy production governs irreversibility and uncertainty in both physical and information-theoretic systems. While Physics-Informed Neural Networks (PINNs) successfully solve differential equations, current architectures remain inherently domain-specific. The extraction of domain-invariant entropy representations across fundamentally different physical laws remains unexplored. This paper introduces a unified Physics-Informed Deep Learning (PIDL) framework that simultaneously enforces differential equation residuals and information-theoretic bounds within a single neural architecture. We demonstrate this framework via two canonical studies: (i) a thermodynamic continuous stirred-tank reactor (CSTR) model solving governing ODEs, where a Softplus constraint strictly enforces the Second Law of Thermodynamics; and (ii) an information-theoretic financial market model solving the inverse Fokker-Planck PDE to infer latent drift and diffusion coefficients, guaranteeing diffusion positivity via a Softplus constraint while naturally inducing Shannon entropy. Three model variants are evaluated: two domain-specific baselines and one shared-encoder architecture. The PIDL framework guarantees absolute thermodynamic admissibility with zero Second-Law violations and exhibits exceptional data efficiency, retaining >90% predictive accuracy using merely 30% of available training data. Furthermore, a post-hoc Ruppeiner Riemannian geometric analysis of the learned entropy surface successfully identifies thermodynamic phase instabilities. This methodology provides a robust, domain-agnostic architecture for physics-constrained entropy modeling, advancing applications in sustainable process design and quantitative financial risk assessment.

2606.01171 2026-06-02 cs.CY cs.AI 版本更新

AI From the Margins (AIM): Rethinking Participatory AI Design Through the Lived Experience of Minoritized Communities

边缘AI(AIM):通过少数群体生活经验重新思考参与式AI设计

Tijs Portegies, Laureanne Willems, Maaike Harbers, Giovanni Sileno, Roland van Dierendonck, Mayesha Tasnim, Lotte Willemsen, Sennay Ghebreab

发表机构 * Utrecht University(乌特勒支大学) University of Amsterdam(阿姆斯特丹大学)

AI总结 提出AIM方法论,通过叙事启发、共同规则制定等步骤,将少数群体的生活经验融入参与式AI设计,并在荷兰医疗场景中验证其有效性。

Comments Under review at the AAAI/ACM Conference on AI, Ethics, and Society (AIES 2026)

详情
AI中文摘要

人工智能(AI)可以再现并放大少数群体面临的结构性不平等。参与式AI被提出作为应对措施,但参与通常始于问题定义和成功标准设定之后,留给少数群体重塑AI系统目的的空间有限。我们提出边缘AI(AIM):一种方法论立场,阐明如何引发、聚焦并推进少数群体的生活经验,以指导参与式AI设计。AIM并非固定协议;它阐明了一组先决条件,可通过不同技术在不同环境中实施。我们在荷兰医疗背景下,与13名有色人种女性和非二元性别者以及5名市政政策工作者进行了八次会话,应用了AIM,具体包括:(1)使用传记叙事解释法(BNIM)进行叙事启发;(2)共同构建规则制定;(3)参与者决定AI是否、在何处以及如何介入;(4)通过与政策制定者的对话将生活经验转化为AI政策。在会话反思中,参与者将参与描述为实质性的,并呼吁继续开展,展示了以生活经验为基础的准备性取向如何塑造参与式AI设计的目的。

英文摘要

Artificial intelligence (AI) can reproduce and amplify the structural inequities faced by minoritized communities. Participatory AI has been proposed as a response, but participation typically starts after problem definitions and success criteria have been set, leaving limited room for minoritized communities to reshape what an AI system is for. We propose AI From the Margins (AIM): a methodological stance that articulates the conditions under which lived experiences of minoritized communities can be elicited, centered, and carried forward to inform participatory AI design. AIM is not a fixed protocol; it articulates a set of preconditions that can be enacted through different techniques in different settings. We applied AIM in a Dutch healthcare context in eight sessions with 13 women and non-binary people of color and five municipal policy workers, namely through (1) narrative elicitation using the Biographic Narrative Interpretive Method (BNIM); (2) co-constructed rule-making; (3) participants' determination of whether, where, and how AI should be involved; and (4) translating lived experience into AI policy through dialogue with policymakers. In their reflections on the sessions, participants described the engagement as substantive and called for its continuation, demonstrating how preparatory orientation fundamentally grounded in lived experience shapes what participatory AI design is for.

2606.01160 2026-06-02 cs.AI 版本更新

Expected Value Alignment for Generative Reward Modeling in Formal Mathematics Verification

形式数学验证中生成式奖励建模的期望值对齐

Shihao Ji, Haotao Tan, Zihui Song, Mingyu Li

发表机构 * GitHub

AI总结 提出期望值对齐(EVA)方法,通过从模型词元分布中提取连续分数,在保持生成式奖励模型离散输出的同时实现连续评分,用于Lean 4形式验证。

详情
AI中文摘要

大型语言模型(LLMs)越来越多地与形式化交互式定理证明器(如Lean 4)一起使用。通过强化学习或搜索方法扩展这些系统需要能够评估中间推理步骤的过程奖励模型(PRMs)。现有的奖励模型设计暴露了一个实际的权衡。值头模型提供连续分数但修改了生成模型接口,而生成式奖励模型保留了文本理由但难以匹配连续浮点回归,因为数值被分割到多个词元上。我们引入了期望值对齐(EVA),一种奖励建模过程,它保持表面输出离散,同时从模型的词元分布中提取连续分数。模型以结构化的JSON格式输出整数分数,EVA计算对应锚定词元logits的期望值作为连续分数。训练结合了因果语言建模目标与这些期望值的辅助均方误差损失。我们在 extit{Leibniz}中实例化EVA,这是一个用于Lean 4形式验证的奖励模型,并针对零样本和奖励建模基线进行了评估。评估表明,基于logits的连续评分显著减少了离散化伪影,同时保留了生成式批评的可解释性。

英文摘要

Large Language Models (LLMs) are increasingly used with formal interactive theorem provers such as Lean 4. Scaling these systems with reinforcement learning or search methods requires process reward models (PRMs) that can evaluate intermediate reasoning steps. Existing reward-model designs expose a practical trade-off. Value-head models provide continuous scores but modify the generative model interface, while generative reward models preserve textual rationales but are poorly matched to continuous floating-point regression because numeric values are split across tokens. We introduce Expected Value Alignment (EVA), a reward-modeling procedure that keeps the surface output discrete while extracting continuous scores from the model's token distribution. The model emits integer scores in a structured JSON format, and EVA computes a continuous score as the expectation over the logits of the corresponding anchor tokens. Training combines the causal language modeling objective with an auxiliary mean squared error loss on these expected values. We instantiate EVA in \textit{Leibniz}, a reward model for Lean 4 formal verification, and evaluate it against zero-shot and reward-modeling baselines. The evaluation demonstrates that continuous logit-based scoring significantly reduces discretization artifacts while retaining the interpretability of generative critiques.

2606.01155 2026-06-02 cs.LG cs.AI 版本更新

When Data Is Scarce: Scaling Sparse Language Models with Repeated Training

当数据稀缺时:通过重复训练扩展稀疏语言模型

Boqian Wu, Qiao Xiao, Patrik Okanovic, Tomasz Sternal, Maurice van Keulen, Mykola Pechenizkiy, Elena Mocanu, Torsten Hoefler, Decebal Constantin Mocanu

发表机构 * Eindhoven University of Technology(埃因霍温理工大学) University of Luxembourg(卢森堡大学) University of Twente(埃因霍温大学) ETH Zürich(苏黎世联邦理工学院)

AI总结 研究数据受限下稀疏训练的可扩展性,提出包含活跃参数、唯一标记、数据重复和稀疏度的缩放定律,发现稀疏训练可延迟数据饱和并改善资源权衡。

Comments Accepted at ICML2026

详情
AI中文摘要

密集大语言模型在无限数据下的缩放定律已被充分探索,但稀疏性与有限数据如何相互作用尚未研究。在这项工作中,我们研究了数据受限场景下的稀疏训练,其中有限的唯一标记需要多轮训练。我们的实验涵盖拟合集中最多1.92B参数的模型、最高93.75%的稀疏度、最多2.6B标记的唯一数据预算,以及16轮训练中最多41.6B的总训练标记;我们进一步在保留的密集等价模型(最多7.68B参数)上验证了外推能力。我们发现:1. 数据受限下的稀疏缩放:我们引入了一个缩放定律,将损失建模为活跃参数、唯一标记、数据重复和稀疏度的函数,准确预测跨计算和数据预算的性能。2. 延迟数据饱和:稀疏训练延迟了重复数据带来的收益递减,使多轮训练更有效。3. 资源权衡:在固定数据下,损失最优的稀疏度约为50%,而计算最优的稀疏度更高且随数据规模增长。总体而言,稀疏性不仅是提高效率的工具,也是在数据稀缺下改善缩放权衡的机制。我们的代码可在 https://github.com/boqian333/sparse-dc-scaling 获取。

英文摘要

Scaling laws for dense LLMs under infinite data are well explored, but how sparsity interacts with limited data is not. In this work, we study sparse training in data-constrained regimes where limited unique tokens require multi-epoch training. Our experiments span models up to 1.92B parameters in the fitting set, sparsity up to 93.75%, unique data budgets up to 2.6B tokens, and total training tokens up to 41.6B over 16 epochs; we further validate extrapolation on held-out dense-equivalent models up to 7.68B parameters. We find that: 1. Sparse scaling in data-limited settings: We introduce a scaling law that models loss as a function of active parameters, unique tokens, data repetition, and sparsity, accurately predicting performance across compute and data budgets. 2. Delayed data saturation: sparse training postpones diminishing returns from repeated data, making multi-epoch training more effective. 3. Resource trade-offs: With fixed data, loss-optimal sparsity is moderate ~ 50%, while compute-optimal sparsity is higher and grows with data scale. Overall, sparsity is not just a tool for efficiency, but a mechanism for improving scaling trade-offs under data scarcity. Our code is available at: https://github.com/boqian333/sparse-dc-scaling.

2606.01152 2026-06-02 cs.CY cs.AI cs.SE 版本更新

ASE-26: a curriculum for agentic software engineering as a discipline

ASE-26:作为一门学科的代理软件工程课程

Mikael Gorsky

发表机构 * Holon Institute of Technology(霍洛恩技术学院)

AI总结 本文提出ASE-26本科课程,通过21个模块和进化螺旋模型,系统培养代理软件工程学科所需的实践者技能。

Comments 12 pages, 20 references. Companion paper to the ASE-26 curriculum deposited on Zenodo at doi:10.5281/zenodo.20468021. Part 1 of a planned series of two pre-prints on the curriculum and its conceptual core

详情
AI中文摘要

专业软件工程师的工作已开始越来越多地转向指导代理而非编写代码,这一转变的实证证据已有数年之久。Anthropic的经济指数显示,Claude Code交互中自动化占比79%[2];Handa及其在Anthropic的同事发现,计算机程序员任务中约75%的不同活动受到AI影响[3];Brynjolfsson及其在斯坦福数字经济实验室的同事报告称,在AI暴露程度最高的职业中,22至25岁工人的就业率相对下降了13%[4]。这一转变尚未完成,关于代理软件工程的学术文献一致认为,缺失的能力并非更好的模型,而是结构化的实践者学科。本文介绍了ASE-26,一个面向代理软件工程学科的综合性本科课程,作为可引用参考存储在Zenodo上,采用CC BY-ND 4.0许可[12]。本文阐述了课程所依据的学科框架、其概念贡献(最重要的是,作为意图与构建共同进化的操作形式的进化螺旋)、组织学科教学的21个模块结构、与代理共同产生作业评分所遵循的教学承诺、毕业生的收获,以及所教授学科如何设计以超越当前模型的具体能力。本文的立场是,行业目前缺乏的实践者技能正是该学科所命名的技能,而结构化的代理软件工程本科课程是弥合这一差距的主要机制。

英文摘要

The work of a professional software engineer has begun to consist, increasingly, of directing agents rather than writing code, and the empirical evidence for the shift is now several years deep. Anthropic's Economic Index puts automation at 79 per cent of Claude Code interactions [2]; Handa and colleagues at Anthropic find AI exposure for Computer Programmer tasks at approximately 75 per cent of the role's distinct activities [3]; Brynjolfsson and colleagues at Stanford's Digital Economy Lab report a 13 per cent relative decline in employment for workers aged 22 to 25 in occupations most exposed to AI [4]. The shift is also unfinished, and the academic literature on agentic software engineering converges on the finding that the missing capability is not better models but structured practitioner discipline. This paper presents ASE-26, a comprehensive undergraduate curriculum for agentic software engineering as a discipline, deposited as a citable reference on Zenodo under CC BY-ND 4.0 [12]. The paper sets out the discipline framing the curriculum rests on, the conceptual contributions it makes (most importantly, the evolutionary spiral as the operational form of the co-evolution of intent and build), the twenty-one-module structure that organises the discipline for teaching, the pedagogical commitments that follow from grading work co-produced with an agent, what graduates leave with, and how the discipline as taught is designed to outlast the specific capabilities of today's models. The position the paper takes is that the practitioner skills the industry currently lacks are precisely the skills the discipline names, and that structured undergraduate curricula in agentic software engineering are the principal mechanism by which the gap closes.

2606.01145 2026-06-02 cs.AI 版本更新

Reasoning4Sciences: Bridging Reasoning Language Models to All Scientific Branches

Reasoning4Sciences:将推理语言模型桥接到所有科学分支

Teddy Ferdinan, Bartłomiej Koptyra, Mikołaj Langner, Tomasz Adamczyk, Łukasz Radliński, Maciej Markiewicz, Aleksander Szczęsny, Stanisław Woźniak, Tymoteusz Romanowicz, Dzmitry Pihulski, Mateusz Zbrocki, Mateusz Śmigielski, Michał Rajkowski, Mateusz Biedka, Konrad Kiełczyński, Konrad Wojtasik, Jacek Duszenko, Jan Eliasz, Piotr Matys, Michał Bernacki-Janson, Maria Bellaniar Ismiati, Latius Hermawan, Wiktoria Mieleszczenko-Kowszewicz, Anna Kubicka-Sowinska, Grzegorz Chodak, Karol Postawa, Paweł Zyblewski, Tomasz Szandała, Łukasz Sterczewski, Adrian Chajec, Pawel Niewiadomski, Piotr Gruber, Marcin Wdowikowski, Sławomir Czarnecki, Bartłomiej Kryszak, Dominik Drabik, Tomasz Kajdanowicz, Kamil Mamak, Paweł Preś, Katarzyna Paczkowska, Joachim Sobczuk, Tomasz Zięba, Jan Kocoń, Maciej Piasecki, Przemysław Kazienko

发表机构 * Poznan University of Technology(波兹南理工大学) National Cheng Kung University(国立成功大学) Universitas Katolik Musi Charitas Palembang(Palembang 巴厘岛天主教大学)

AI总结 本文首次全面分析推理语言模型在28个科学学科中的采用情况,提出基于领域资源的成熟度评估框架,揭示学科间差距并展望未来方向。

详情
AI中文摘要

虽然推理语言模型(RLMs)正迅速成为科学研究的强大工具,但其影响主要集中在“硬科学”领域。RLMs在其他科学分支中的采用缓慢(或缺乏)导致研究生产力差距不断扩大。在本综述中,我们首次按照欧洲研究理事会(ERC)使用的分类,对RLMs在28个科学学科中的采用情况进行了全面分析,涵盖社会科学与人文、物理科学与工程以及生命科学。我们研究了RLMs如何跨学科开发、评估和应用。此外,我们引入了一个基于可用领域特定开发和评估资源的成熟度导向评估框架,揭示了RLM成熟度的显著差异,当仅考虑公开可用资源时,这种差异变得更加明显。最后,我们强调了当前跨学科流行的实施范式、当前挑战以及推动RLMs在科学中采用的未来方向。

英文摘要

While Reasoning Language Models (RLMs) are rapidly emerging as powerful tools for scientific research, their impact is primarily concentrated in "hard science" fields. The slow -- or lack of -- adoption of RLMs in other branches of science is causing a widening gap in research productivity. In this survey, we provide the first comprehensive analysis of RLM adoption across 28 scientific disciplines following the classification used by the European Research Council (ERC), spanning the Social Sciences and Humanities, Physical Sciences and Engineering, and Life Sciences. We examine how RLMs are developed, evaluated, and applied across disciplines. Furthermore, we introduce a maturity-oriented assessment framework based on available domain-specific development and evaluation resources, revealing substantial disparities in RLM maturity that become even more pronounced when only publicly available resources are considered. Finally, we highlight current implementation paradigms that are gaining popularity across disciplines, current challenges, and future directions in enabling RLM adoption across science.

2606.01126 2026-06-02 cs.LG cs.AI cs.CV 版本更新

STARFISH: faST Accuracy Recovery in pruned networks From Internal State Healing

STARFISH: 从内部状态修复中实现剪枝网络的快速精度恢复

Shir Maon, Odelia Melamed, Adi Shamir

发表机构 * Weizmann Institute of Science(魏茨曼科学研究所)

AI总结 提出STARFISH方法,通过少量无标签校准集优化剪枝网络与原始网络内部状态对齐,高效恢复精度,在ViT网络上优于现有方法。

详情
AI中文摘要

剪枝是一种旨在减少大型神经网络中权重数量的过程。这可以显著加快推理速度,但可能导致模型精度大幅下降,因此通常随后会进行修复过程以恢复部分丢失的精度。在本文中,我们提出了一种新的修复方法STARFISH,它可以高效地恢复任何剪枝网络的(大部分)精度。STARFISH的主要思想是使用少量无标签示例的校准集,优化剪枝网络以与原始网络的内部状态表示对齐。对于去除50%权重的常见情况,在基于ViT的网络中,STARFISH修复相比最先进方法将恢复精度提高了高达22%。在激进剪枝下其优势更为显著。例如,在ImageNet的DeiT-B网络中去除75%权重后,STARFISH仅使用训练图像数量的0.4%作为校准集,恢复了原始稠密模型精度的82%,而竞争恢复技术仅达到稠密模型精度的40%。

英文摘要

Pruning is a process designed to reduce the number of weights in a large neural network. This can substantially speed up inference but might cause a considerable reduction in the model's accuracy, and thus it is usually followed by a healing process that regains some of the lost accuracy. In this paper, we propose a new healing method, STARFISH, that can recover (most of) the accuracy of any pruned network efficiently. The main idea of STARFISH is to optimize the pruned network to align with the original network's internal state representations using a tiny calibration set of unlabeled examples. For the common case of removing 50% of the weights, STARFISH healing improves the recovered accuracy by up to 22% over the state-of-the-art methods on ViT-based networks. Its advantage is even more pronounced under aggressive pruning. For example, after eliminating 75% of the weights in a DeiT-B network for ImageNet, STARFISH uses only 0.4% of the number of training images as a calibration set and recovers 82% of the original dense accuracy, whereas competing recovery techniques reach only 40% of the dense model accuracy.

2606.01117 2026-06-02 cs.LG cs.AI 版本更新

HASTE: Hardware-Aware Dynamic Sparse Training for Large Output Spaces

HASTE: 面向大输出空间的硬件感知动态稀疏训练

Nasib Ullah, Jinbin Zhang, Jean Lucien Randrianantenaina, Erik Schultheis, Rohit Babbar

发表机构 * University of Waterloo(滑铁卢大学)

AI总结 提出组共享固定扇入稀疏性方法,通过半结构化输出层设计结合长尾分解,在极端多标签分类中实现显著加速并保持精度。

Comments Accepted at ICML 2026 Regular

详情
AI中文摘要

极端多标签分类(XMC)涉及在具有数百万标签的大输出空间上学习模型,使得输出层成为内存计算瓶颈。虽然基于稀疏性的方法降低了算术复杂度,但由于不规则内存访问、硬件利用率低或在长尾场景中依赖辅助架构组件,它们通常无法产生成比例的速度提升。我们引入了组共享固定扇入稀疏性,一种半结构化的输出层设计,其中语义相关的标签共享一个稀疏输入模式,同时保留独立的权重。这种分组引入了任务对齐的归纳偏置——鼓励相关标签共享特征子集——同时减少了索引内存开销,增加了跨标签的特征重用,并通过利用现代加速器原语的自定义CUDA内核实现了高效的GPU执行。作为辅助目标的替代方案,我们利用XMC的长尾结构,将输出层分解为频繁标签上的小型密集头部和其余标签上的组共享稀疏尾部,在保留稀疏性内存优势的同时提供了信息丰富的梯度路径。通过内核级微基准测试,我们表明组共享固定扇入将算术减少转化为实际的挂钟时间增益,在前向传播中实现了高达4.4倍的加速,在反向传播中实现了高达25倍的加速,同时与FLOPs匹配的密集瓶颈相比,性能仅相差几个百分点。在大型XMC基准测试中,我们的方法在precision@k上匹配或优于先前的稀疏基线,同时缩小了与密集方法的性能差距。

英文摘要

Extreme multi-label classification (XMC) involves learning models over large output spaces with millions of labels, making the output layer a memory-compute bottleneck. While sparsity-based methods reduce arithmetic complexity, they often fail to yield proportional speedups due to irregular memory access, poor hardware utilization, or reliance on auxiliary architectural components in long-tailed regimes. We introduce group-shared fixed fan-in sparsity, a semi-structured output-layer design in which semantically related labels share a sparse input pattern while retaining independent weights. This grouping introduces a task-aligned inductive bias -- encouraging related labels to share feature subsets -- while reducing index memory overhead, increasing feature reuse across labels, and enabling efficient GPU execution via custom CUDA kernels that leverage modern accelerator primitives. As an alternative to auxiliary objectives, we exploit the long-tailed structure of XMC by decomposing the output layer into a small dense head over frequent labels and a group-shared sparse tail over the remainder, providing an informative gradient pathway while preserving the memory benefits of sparsity. Through kernel-level microbenchmarking, we show that group-shared fixed fan-in translates arithmetic reductions into practical wall-clock gains, achieving up to $4.4\times$ speedup in the forward pass and up to $25\times$ speedup in backward passes over standard fixed fan-in sparsity, while operating within a few percent of a FLOPs-matched dense bottleneck. Across large-scale XMC benchmarks, our approach matches or improves precision@k over prior sparse baselines, while narrowing the performance gap to dense.

2606.01101 2026-06-02 cs.LG cs.AI 版本更新

Soft-NBCE: Entropy-Weighted Chunk Fusion for Long-Context

Soft-NBCE: 基于熵加权分块融合的长上下文处理

Shihao Ji, Mingyu Li, Zihui Song

发表机构 * Beijing Normal University(北京师范大学) Chunjiang Intelligence(春江智能)

AI总结 针对长上下文推理中硬选择策略导致语义碎片化的问题,提出Soft-NBCE,通过熵加权软融合和一致性蒸馏,在保持检索精度的同时提升多跳推理性能。

Comments 7 pages, 3 figures, 2 tables. Preprint

详情
AI中文摘要

自注意力的二次复杂度仍然是大型语言模型(LLMs)处理超长上下文的瓶颈。朴素贝叶斯认知引擎(NBCE)通过将文档分块并在每个解码步骤路由到熵最低的分块,实现了长上下文推理的并行化。这种硬选择策略在跨分块推理时会导致语义碎片化,因为相邻token之间的突然路由变化破坏了模型的上下文基础。我们提出了Soft-NBCE,这是一种轻量级扩展,用软熵加权分块融合替代了离散的分块选择。通过预测熵上的温度缩放Softmax,为所有分块分配连续权重,实现了跨分块条件分布的log空间聚合。为了部分补偿分块引入的条件独立性假设,我们提出了一致性蒸馏,这是一种基于LoRA的自蒸馏方法,通过KL散度将分块logit分布约束为全上下文教师分布。在LongBench多跳基准测试中,带有一致性蒸馏的Soft-NBCE在NBCE风格基线(MuSiQue F1: 0.310 vs. 0.275(Vanilla NBCE);HotpotQA F1: 0.479 vs. 0.427)上持续改进,同时在O(L^2/n)峰值内存下保持检索精度(NIAH-32K: 0.909)。

英文摘要

The quadratic complexity of self-attention remains a bottleneck for Large Language Models (LLMs) processing ultra-long contexts. The Naive Bayes Cognitive Engine (NBCE) parallelizes long-context inference by chunking documents and routing to the lowest-entropy chunk at each decoding step. This hard-selection strategy causes semantic fragmentation during cross-chunk reasoning, as abrupt routing changes between adjacent tokens disrupt the model's contextual grounding. We present Soft-NBCE, a lightweight extension that replaces discrete chunk selection with soft entropy-weighted chunk fusion. A temperature-scaled Softmax over predictive entropies assigns continuous weights to all chunks, enabling log-space aggregation across chunk-conditioned distributions. To partially compensate for the conditional independence assumption introduced by chunking, we propose Consistency Distillation, a LoRA-based self-distillation that constrains the chunked logit distribution toward a full-context teacher via KL-divergence. On LongBench multi-hop benchmarks, Soft-NBCE with Consistency Distillation improves consistently over NBCE-style baselines (MuSiQue F1: 0.310 vs.\ 0.275 for Vanilla NBCE; HotpotQA F1: 0.479 vs.\ 0.427) while maintaining retrieval accuracy (NIAH-32K: 0.909) at O(L^2/n) peak memory.

2606.01099 2026-06-02 cs.CL cs.AI 版本更新

MiCU: End-to-End Smart Home Command Understanding with Large Language Model

MiCU: 基于大语言模型的端到端智能家居指令理解

Haowei Han, Kexin Hu, Weiwei Cai, Debiao Zhang, Bin Qin, Yuxiang Wang, Jiawei Jiang, Xiao Yan, Bo Du

发表机构 * School of Computer Science, Wuhan University(武汉大学计算机学院) Xiaomi Corporation(小米公司) Institute for Math & AI, Wuhan University(武汉大学数学与人工智能研究院)

AI总结 提出MiCU,一种利用课程学习、强化学习和令牌压缩技术的领域特定大语言模型,用于解决智能家居中模糊指令理解问题,平均准确率提升20.01%。

详情
AI中文摘要

智能家居生态系统中的指令理解系统可以自动化设备控制并显著改善用户体验。然而,尽管它们在精确表述(例如“打开卧室灯”)上表现良好,但在处理模糊或不一致的指令(例如“让卧室变得舒适”)时却存在困难。大语言模型(LLM)在各种领域都能很好地泛化,并且在此类任务上可以超越传统的基于规则的系统,但其有效性通常受到领域特定数据稀缺、任务特定适应性不足以及高计算成本的限制。在本文中,我们提出了一种利用用户日志和LLM的自动化训练数据合成工作流程;然后构建了MiCU,一个在指令理解方面表现出色的领域特定LLM。具体来说,我们采用课程学习将领域知识注入基础LLM,然后通过冷启动训练结合领域特定思维规则引导的强化学习(RL)来增强其推理能力。此外,我们引入了一种令牌压缩技术,将设备描述压缩为单个特殊令牌,从而显著降低推理开销,并实现了\model-fast,一种针对长输入优化的高效变体。大量实验表明,MiCU显著优于基线,在所有设备类别上平均准确率提升20.01%。我们已在小米家应用中部署了MiCU,每天接收约170万页面浏览量。生产评估显示,MiCU将用户纠正率降低了1.57%,并将人工审核准确率提高了32.05%。我们的数据和代码可在https://github.com/xiaomi-research/iot_spec_llm获取。

英文摘要

Command understanding systems in smart home ecosystems can automate device control and substantially improve user experience. However, while they perform well on precise utterances (e.g., "turn on the bedroom light"), they struggle with ambiguous or misaligned commands (e.g., "make the bedroom cozy"). Large language models (LLMs) generalize well across various domains and can outperform traditional rule-based systems on such tasks, but their effectiveness is often constrained by scarce domain-specific data, insufficient task-specific adaptation, and high computational costs. In this paper, we propose an automated training data synthesis workflow using user logs and LLMs; then we build MiCU, a domain-specific LLM that excels at command understanding. Specifically, we employ curriculum learning to inject domain knowledge into the base LLM, then we enhance its reasoning ability via cold-start training combined with reinforcement learning (RL) guided by domain-specific thinking rules. Additionally, we introduce a token compression technique that condenses device description into a single special token, substantially reducing inference overhead and enabling \model-fast, an efficient variant optimized for long inputs. Extensive experiments show that MiCU significantly outperforms baselines, with an average accuracy gain of 20.01% across all device categories. We have deployed MiCU in the Xiaomi Home app, receiving approximately 1.7 million page views per day. Production evaluations show that MiCU reduces user correction rate by 1.57% and increases human audited accuracy by 32.05%. Our data and code are available at https://github.com/xiaomi-research/iot_spec_llm

2606.01098 2026-06-02 cs.RO cs.AI 版本更新

Implicit Drifting Policy: One-Step Action Generation via Conditional Expert Geometry

隐式漂移策略:通过条件专家几何实现单步动作生成

Zemin Yang, Yaoyu He, Yiming Zhong, Yuhao Zhang, Xinge Zhu, Yao Mu, Qingqiu Huang, Yuexin Ma

发表机构 * ShanghaiTech University(上海科技大学) Shanghai Jiao Tong University(上海交通大学) The Chinese University of Hong Kong(香港中文大学) Morphi Robot(Morphi机器人)

AI总结 提出隐式漂移策略(IDP),一种单步模仿学习框架,通过条件专家几何隐式引入训练时的漂移校正,无需显式向量场估计,在2D、3D及真实世界操作任务中有效保持有效动作流形,性能优于显式漂移方法并达到强单步基线水平。

详情
AI中文摘要

基于扩散或流匹配的生成动作策略在行为克隆中表现出色,但其迭代采样对于高频机器人控制来说过于耗时。尽管最近的单步公式缓解了这种延迟,但它们不可避免地丢弃了提供关键动作校正的中间轨迹演化。由于条件演示极端稀疏,通过显式估计训练时漂移场直接恢复这一机制在数学上是不适定的。我们提出了隐式漂移策略(IDP),一种单步模仿学习框架,无需显式向量场估计即可将训练时的漂移校正引入策略学习。IDP从观测相似专家动作的局部变化中提取条件专家几何,并将其与全局参考几何进行比较,以分离条件特定的约束。这种局部几何结构自适应地加权一个标量势目标。结合专家近端终端评估,IDP在训练期间直接对单步生成器施加流形约束。在2D、3D和真实世界操作任务上的广泛评估表明,IDP有效保持了对有效动作流形的遵循,优于显式漂移方法,并达到了与强单步基线相当的性能。

英文摘要

Generative action policies based on diffusion or flow matching excel in behavior cloning, yet their iterative sampling is prohibitive for high-frequency robot control. While recent one-step formulations alleviate this latency, they inevitably discard the intermediate trajectory evolution that provides crucial action correction. Directly recovering this mechanism by explicitly estimating a training-time drifting field is mathematically ill-posed due to extreme conditional demonstration sparsity. We introduce Implicit Drifting Policy (IDP), a one-step imitation learning framework that brings the training-time correction of Drifting into policy learning without explicit vector field estimation. IDP extracts a conditional expert geometry from the local variation of observation-similar expert actions, and compares it against a global reference geometry to isolate condition-specific constraints. This local geometric structure adaptively weights a scalar potential objective. Combined with an expert-proximal terminal evaluation, IDP directly enforces manifold constraints on the one-step generator during training. Extensive evaluations across 2D, 3D, and real-world manipulation tasks show IDP effectively maintains adherence to valid action manifolds, improving upon explicit drifting methods and achieving competitive performance with strong one-step baselines.

2606.01095 2026-06-02 cs.RO cs.AI 版本更新

Beyond Task Success: Behavioral and Representational Diagnostics for WAM and VLA

超越任务成功:WAM 和 VLA 的行为与表征诊断

Hung Mai, Bin Zhu, Tuan Do

发表机构 * National Economics University, Vietnam(越南国家经济大学) Singapore Management University(新加坡管理大学) Phenikaa University, Vietnam(越南Phenikaa大学)

AI总结 本文提出一个模型无关的诊断框架,通过行为分析和基于稀疏自编码器的特征分析,比较世界动作模型(WAM)与视觉-语言-动作(VLA)策略在机器人操作中的行为与表征差异,发现WAM在目标选择和行为改进上优于VLA但计算成本更高,且不同WAM架构对未来信息的编码方式不同。

详情
AI中文摘要

视觉-语言-动作(VLA)策略和世界动作模型(WAM)代表了机器人操作中两种日益重要的范式。然而,尚不清楚WAM中的未来预测是否在最终任务成功之外带来行为上有意义的改进。在本文中,我们探究WAM是否仅仅增加了未来预测,还是以对控制可操作的方式改变了机器人行为和内部表征。我们引入一个模型无关的诊断框架,通过两个互补的视角比较WAM和VLA:行为 rollout 分析和基于稀疏自编码器的特征分析。行为协议测量动作动态一致性、目标物体进展、干扰物干扰和运行时成本。特征空间协议将内部表征表征为记忆型、反应型或预测型,揭示模型是否编码了面向未来的结构。在LIBERO和RoboTwin2.0上,我们评估了7种策略,涵盖直接VLA以及联合、顺序和辅助WAM。我们的结果表明,仅凭成功隐藏了关键差异:WAM通常改善物体级行为和目标选择性,但其收益依赖于架构并导致更高的推理成本。顺序WAM显示出最清晰的预测结构,而辅助和联合WAM分别压缩或纠缠未来信息。这些发现为WAM设计提供了未来方向,以保留行为可操作的未来表征,实现高效操作。

英文摘要

Vision-language-action (VLA) policies and World-Action Models (WAM) represent two increasingly important paradigms for robotic manipulation. However, it remains unclear whether future prediction in WAMs leads to behaviorally meaningful improvements beyond final task success. In this paper, we ask whether WAMs merely add future prediction, or whether they change robot behavior and internal representations in ways that are actionable for control. We introduce a model-agnostic diagnostic framework that compares WAMs and VLAs through two complementary lenses: behavioral rollout analysis and sparse-autoencoder-based feature analysis. The behavioral protocol measures action dynamics consistency, target-object progress, distractor disturbance, and runtime cost. The feature-space protocol characterizes internal representations as memorized, reactive, or predictive, revealing whether models encode future-oriented structure. Across LIBERO and RoboTwin2.0, we evaluate 7 policies spanning direct VLAs and joint, sequential, and auxiliary WAMs. Our results show that success alone hides key differences: WAMs often improve object-level behavior and target selectivity, but their gains depend on architecture and incur higher inference cost. Sequential WAMs show the clearest predictive structure, while auxiliary and joint WAMs respectively compress or entangle future information. These findings suggest future directions for WAMs design to preserve behaviorally actionable future representations for efficient manipulation.

2606.01094 2026-06-02 cs.AI 版本更新

CAREAgent: Clinical Agent with Structured Reasoning and Tool-Integrated for Order Generation

CAREAgent: 具有结构化推理和工具集成的临床智能体用于医嘱生成

Ruihui Hou, Ziyue Huai, Chennuo Zhang, Ziyan Liu, Siran Zhao, Yao Yu, Jie Zhai, Tong Ruan

发表机构 * East China University of Science and Technology, Shanghai, China(东华大学上海科学技术学院) Zhongshan Hospital, Fudan University, Shanghai, China(复旦大学中山医院)

AI总结 提出CAREAgent,通过两阶段推理数据构建和监督微调与强化学习,生成细粒度临床医嘱,在ClinicalBench上F1提升5.05%。

详情
AI中文摘要

临床医嘱生成是临床决策与实际实践之间的关键桥梁,将医疗决策转化为具体可执行的医嘱。现有智能体主要关注粗粒度决策,忽略了临床医嘱所需的细粒度可执行信息。为弥补这一差距,我们提出CAREAgent,一个用于临床医嘱生成的智能体。为支持其训练,我们引入了一种两阶段智能体推理数据构建方法。首先,我们设计了一个智能体框架,构建与真实临床工具使用一致的可验证推理轨迹。其次,我们根据格式合规性、医嘱有效性和临床合理性筛选推理轨迹。基于构建的数据,模型首先通过监督微调训练以获得基本的推理格式和医学知识,随后通过具有多维奖励函数的强化学习进行优化,以增强复杂的临床推理能力。在多个基准上的实验证明了CAREAgent的有效性。在ClinicalBench(训练中未见)上,CAREAgent的F1分数分别比单智能体、多智能体和智能体推理方法提高了5.05%、2.09%和0.86%。

英文摘要

Clinical order generation serves as a critical bridge between clinical decision-making and real-world practice, translating medical decisions into concrete and executable orders. Existing agents mainly focus on coarse-grained decisions and overlook the fine-grained, executable information required for clinical orders. To address this gap, we propose CAREAgent, an agent for clinical order generation. To support its training, we introduce a two-stage agentic reasoning data construction method. First, we design an agent framework that constructs verifiable reasoning trajectories aligned with realistic clinical tool usage. Second, we filter reasoning trajectories by format compliance, order validity, and clinical plausibility. Building on the constructed data, the model is first trained via supervised fine-tuning to acquire fundamental reasoning formats and medical knowledge, and is subsequently optimized through reinforcement learning with multi-dimensional reward functions to enhance complex clinical reasoning capabilities. Experiments on multiple benchmarks demonstrate the effectiveness of CAREAgent. On ClinicalBench (unseen during training), CAREAgent improves the F1 score by 5.05%, 2.09%, and 0.86% over the single-agent, multi-agent, and agentic reasoning methods, respectively.

2606.01092 2026-06-02 cs.LG cs.AI 版本更新

A Fiber Criterion for Representation Identifiability in Supervised Learning

监督学习中表示可辨识性的纤维准则

Vasileios Sevetlidis

发表机构 * Athena Research Center, Kimmeria Campus, Xanthi, Greece(亚特兰大研究中心,基米里亚校区,哈尼亚,希腊) Democritus University of Thrace, Vas. Sofias Campus, Xanthi, Greece(德摩根大学,瓦斯·索菲亚校区,哈尼亚,希腊) International Hellenic University, Serres, Greece(国际希腊大学,塞雷斯,希腊)

AI总结 本文提出纤维准则,通过投影映射的纤维常数性来形式化监督学习中表示-头部分解的可辨识性,并指出仅凭监督预测行为无法唯一确定表示。

详情
AI中文摘要

监督学习通过输入-输出行为评估预测器。当预测器实现为复合函数 $f=c\circ h$ 时,监督证据约束了复合映射 $f$,但未必确定表示-头部因子分解 $(h,c)$。本文形式化了由此产生的表示级可辨识性问题:对于一类可接受的表示-头部对,当且仅当表示属性在投影 $(h,c)\mapsto c\circ h$ 的纤维上为常数时,它可从诱导的预测器中辨识,等价于它下降为预测器的良定义属性。保持预测器的增广给出了一个规范障碍:辅助信息可以附加到表示上而头部忽略它,保持预测器不变但改变诸如极小性、压缩、不变性、等变性、干扰信息或语义可访问性等属性。这种构造将表示可辨识性与优化和有限样本估计分离开来。有限样本诊断说明了而非证明了该准则:精确代数见证在改变表示诊断时保持预测器固定,而匹配性能的Waterbirds模型表明不同约束可以在相似的监督性能下选择不同的表示。结果阐明,表示级声明需要超越监督预测行为本身的假设、目标、测量或归纳偏置。

英文摘要

Supervised learning evaluates predictors through their input-output behavior. When a predictor is implemented as a composition $f=c\circ h$, supervised evidence constrains the composite map $f$ but need not determine the representation-head factorization $(h,c)$. This paper formalizes the resulting representation-level identifiability problem: for a class of admissible representation-head pairs, a representation property is identifiable from the induced predictor exactly when it is constant on the fibers of the projection $(h,c)\mapsto c\circ h$, equivalently when it descends to a well-defined property of the predictor. Predictor-preserving augmentation gives a canonical obstruction: auxiliary information can be appended to a representation while the head ignores it, leaving the predictor unchanged but altering properties such as minimality, compression, invariance, equivariance, nuisance information, or semantic accessibility. This construction separates representation identifiability from optimization and finite-sample estimation. Finite-sample diagnostics illustrate, rather than prove, the criterion: exact algebraic witnesses hold the predictor fixed while changing representation diagnostics, and matched-performance Waterbirds models show that different constraints can select different representations at similar supervised performance. The results clarify that representation-level claims require assumptions, objectives, measurements, or inductive biases beyond supervised predictive behavior alone.

2606.01086 2026-06-02 cs.LG cs.AI 版本更新

Strong Stochastic Flow Maps

强随机流映射

Sam McCallum, Zander W. Blasingame, Timothy Herschell, Niklas Rindtorff, Alexander Tong, James Foster

发表机构 * University of Bath(巴斯大学) AITHYRA

AI总结 提出强随机流映射(SSFMs)框架,通过学习加性噪声SDE的强解映射,实现扩散模型的免模拟训练和少步采样,在图像生成和分子系统采样中优于现有方法。

Comments Preprint

详情
AI中文摘要

流模型和扩散模型在许多模态中生成高质量样本;然而,由于需要对底层微分方程进行数值积分,推理过程中需要多次网络评估。流映射通过学习微分方程的解映射直接缓解了这一问题,实现了少步采样。然而,当前方法仅限于逼近ODE的解映射。这些方法可用于学习SDE的转移核,从而获得恢复过程边际分布(弱收敛)而非解路径(强收敛)的解映射。我们提出强随机流映射(SSFMs)作为一种新框架,用于学习加性噪声SDE的强解映射,直接将确定性流映射推广到随机设置。此外,引入了布朗运动的多项式逼近,并证明其路径收敛。这些结果为扩散模型的解映射提供了免模拟训练目标。我们证明,SSFMs在图像生成上优于先前的随机流映射方法,并实现了分子系统的少步采样。

英文摘要

Flow and diffusion models generate high-quality samples in many modalities; however, many network evaluations are required during inference due to numerical integration of an underlying differential equation. Flow maps alleviate this problem by learning the solution map of the differential equation directly, enabling few-step sampling. Yet, current methods are restricted to approximating the solution map of ODEs. These methods can be used to learn the transition kernel of an SDE, thereby obtaining a solution map that recovers the marginal distributions of the process (weak convergence) rather than the solution path (strong convergence). We propose Strong Stochastic Flow Maps (SSFMs) as a novel framework for learning the strong solution map of additive-noise SDEs, directly generalizing deterministic flow maps to the stochastic setting. Further, a polynomial approximation to Brownian motion is introduced and shown to converge pathwise. These results enable a simulation-free training objective for the solution map of diffusion models. We demonstrate that SSFMs outperform previous stochastic flow map methods on image generation and enable few-step sampling of molecular systems.

2606.01084 2026-06-02 cs.LG cs.AI 版本更新

MViewRouter: Internalizing Geometric Equivariance via Multi-view Alternating Attention for Combinatorial Routing

MViewRouter:通过多视图交替注意力内化组合路由的几何等变性

Shiyan Liu, Bohan Tan, Yaoxin Wu, Yan Jin

发表机构 * Huazhong University of Science and Technology(华中科技大学) Eindhoven University of Technology(埃因霍温理工大学)

AI总结 提出MViewRouter框架,利用多视图交替注意力机制内化几何等变性作为结构归纳偏置,通过集体策略梯度聚合优化,解决组合路由问题中的对称性挑战,在TSP和CVRP上取得竞争性解质量和强零样本泛化。

详情
AI中文摘要

组合路由问题,如旅行商问题(TSP)和带容量约束的车辆路径问题(CVRP),是基础的NP难问题,具有广泛的现实应用。虽然最近的深度强化学习方法显示出有希望的性能,但它们通常仅通过数据增强处理几何对称性,导致决策不一致和泛化能力有限。为了解决这个问题,我们提出了MViewRouter,一个多视图框架,将几何等变性内化为结构归纳偏置,以实现跨路由问题变体的不变决策。我们的方法引入了一种多视图交替注意力(MAA)机制,能够在$D_4$对称群上进行并行处理,在视图内关系建模和视图间特征对齐之间交替进行。此外,我们通过集体策略梯度聚合(CPGA)优化策略,利用来自多个对称视图的共识梯度来稳定训练并加速收敛。在TSP和CVRP基准测试以及真实世界的TSPLIB实例上的实验表明,MViewRouter实现了竞争性的解质量和强大的零样本泛化能力。

英文摘要

Combinatorial routing problems such as the Traveling Salesman Problem (TSP) and the Capacitated Vehicle Routing Problem (CVRP) are fundamental NP-hard problems with broad real-world applications. While recent deep reinforcement learning methods have shown promising performance, they typically handle geometric symmetries only through data augmentation, resulting in inconsistent decisions and limited generalization. To address this issue, we propose MViewRouter, a multi-view framework that internalizes geometric equivariance as a structural inductive bias to achieve invariant decision-making across routing problem variants. Our approach introduces a Multi-view Alternating Attention (MAA) mechanism that enables parallel processing over the $D_4$ symmetry group, alternating between intra-view relational modeling and inter-view feature alignment. Furthermore, we optimize the policy via Collective Policy Gradient Aggregation (CPGA), leveraging consensus gradients from multiple symmetric views to stabilize training and accelerate convergence. Experiments on TSP and CVRP benchmarks, as well as real-world TSPLIB instances, demonstrate that MViewRouter achieves competitive solution quality and strong zero-shot generalization.

2606.01080 2026-06-02 cs.LG cs.AI 版本更新

ThinkSwitch: Context Distillation with LoRA and Weight Interpolation for Specific-Purpose Reasoning Tasks

ThinkSwitch:基于LoRA和权重插值的上下文蒸馏用于特定目的推理任务

Dhruv Saini, Rohan Pandey

发表机构 * bellevue High School(贝尔维尤高中) DigitalOcean

AI总结 提出ThinkSwitch方法,通过QLoRA蒸馏和球面权重插值协同训练指令模型和思考模型,在AIME 2026和PubMedQA上分别提升指令模型10/30→20/30和13/30→18/30,思考模型14/30→22/30和18/30→25/30,仅需15个训练提示和$2.86成本。

详情
AI中文摘要

大型语言模型通常通过在产生最终答案之前花费推理时间计算来改进困难任务。额外的计算可能有用,但也增加了延迟、令牌成本和部署复杂性。我们引入了 extbf{ThinkSwitch},一种低计算量的程序,用于协同训练配对的指令和思考检查点。从兼容的Qwen3-4B指令和思考模型开始,每次迭代要求思考检查点生成答案,移除推理轨迹,通过QLoRA将仅答案对蒸馏到指令检查点,并通过球面权重插值重建思考检查点。唯一的人工输入是任务提示;标签由模型自身生成。在30个问题的AIME 2026评估中,ThinkSwitch将指令检查点从10/30提升到20/30,思考检查点从14/30提升到22/30。在30个问题的PubMedQA子集上,它将指令检查点从13/30提升到18/30,思考检查点从18/30提升到25/30。完整实验每个领域使用15个训练提示,在单个云RTX 3070上花费2.86美元。结果规模较小,但表明有针对性的蒸馏循环可以将显式推理的部分好处转移到权重中,同时保留独立的思考模式。

英文摘要

Large language models often improve on difficult tasks by spending inference-time compute on a reasoning trace before producing the final answer. That extra computation can be useful, but it also raises latency, token cost, and deployment complexity. We introduce \textbf{ThinkSwitch}, a low-compute procedure for co-training paired instruct and thinking checkpoints. Starting from compatible Qwen3-4B instruct and thinking models, each iteration asks the thinking checkpoint to generate answers, removes the reasoning trace, distills the answer-only pairs into the instruct checkpoint with QLoRA, and reconstructs a thinking checkpoint with spherical weight interpolation. The only human-supplied inputs are task prompts; the labels are generated by the model itself. On a 30-question AIME 2026 evaluation, ThinkSwitch improves the instruct checkpoint from 10/30 to 20/30 and the thinking checkpoint from 14/30 to 22/30. On a 30-question PubMedQA subset, it improves the instruct checkpoint from 13/30 to 18/30 and the thinking checkpoint from 18/30 to 25/30. The complete experiment uses 15 training prompts per domain and costs \$2.86 on a single cloud RTX 3070. The results are small-scale, but they indicate that targeted distillation loops can move part of the benefit of explicit reasoning into weights while preserving a separate thinking mode.

2606.01070 2026-06-02 cs.IR cs.AI cs.LG 版本更新

Test-Time Training for Zero-Resource Dense Retrieval Reranking

零资源稠密检索重排的测试时训练

Shiyan Liu, Yichen Li

发表机构 * Huazhong University of Science and Technology(华中科技大学) ByteDance(字节跳动)

AI总结 提出 DART 方法,通过测试时自适应双线性评分矩阵,利用伪正负样本进行少量梯度更新,在零资源下提升稠密检索重排性能。

Comments Accepted at KnowFM @ ACL 2026

详情
AI中文摘要

稠密检索器在第一阶段候选生成中表现出色,但在零资源设置下缺乏有效的重排能力。现有方法面临根本性困境:交叉编码器重排质量高,但需要昂贵的监督训练且延迟高,而无监督的 BM25 重排在大多数 BEIR 基准上持续降低稠密检索性能。我们提出 DART(测试时稠密自适应重排),通过在推理时自适应评分函数来解决这一困境。对于每个查询,排名靠前的文档作为伪正例,排名靠后的作为伪负例,提供噪声但可用的监督信号,通过少量梯度更新来适应双线性评分矩阵 $W$。我们进一步引入置信加权边际损失和跨查询动量缓冲区,以预热跨查询的适应过程。在六个 BEIR 基准上,DART 相对于稠密检索基线实现了平均每个数据集 NDCG@10 相对提升 +2.1%,且每个查询额外延迟低于 10ms,展示了强大的零样本性能提升和跨领域泛化能力。

英文摘要

Dense retrievers excel at first-stage candidate generation but lack effective reranking in zero-resource settings. Existing approaches face a fundamental dilemma: cross-encoders deliver strong reranking quality but require costly supervised training and incur high latency, while unsupervised BM25 reranking consistently degrades dense retrieval performance on most of BEIR benchmarks. We propose DART (Dense Adaptive Reranking at Test-time), which resolves this dilemma by adapting the scoring function at inference time. For each query, the top-ranked documents serve as pseudo-positive examples and the bottom-ranked as pseudo-negative examples, providing noisy but readily available supervision to adapt a bilinear scoring matrix $W$ via a small number of gradient updates. We further introduce a confidence-weighted margin loss and a cross-query momentum buffer that warm-starts adaptation across queries. On six BEIR benchmarks, DART achieves a mean per-dataset relative NDCG@10 gain of +2.1% over the dense retrieval baseline with under 10ms additional latency per query, demonstrating a powerful capability for zero-shot performance enhancement and cross-domain generalization.

2606.01066 2026-06-02 cs.AI 版本更新

Before the Model Learns the Bug:Fuzzing RLVR Verifiers

在模型学会Bug之前:模糊测试RLVR验证器

Jaideep Ray

发表机构 * ACM

AI总结 本文提出一个轻量级验证器模糊测试框架,通过生成对抗性补全、比较有缺陷与严格的参考验证器,并报告多种指标,以研究RLVR中验证器错误导致优化学习Bug的失败模式。

详情
AI中文摘要

具有可验证奖励的强化学习(RLVR)用可执行的奖励函数(如数学答案检查器、JSON工具调用验证器和代码单元测试框架)替代人类偏好标签。这使得奖励部分成为软件制品:如果验证器出错,优化可能会学习到Bug。我们通过一个轻量级验证器模糊测试框架研究这种失败模式,该框架生成对抗性补全,比较有缺陷和更严格的参考验证器,记录配对决策,并报告假阳性、假阴性、不一致、利用和不确定性指标。

英文摘要

Reinforcement learning with verifiable rewards (RLVR) replaces human preference labels with executable reward functions such as math answer checkers, JSON tool-call validators, and code unit-test harnesses. That makes the reward partly a software artifact: if the verifier is wrong, optimization can learn the bug. We study this failure mode with a lightweight verifier-fuzzing framework that generates adversarial completions, compares buggy and stricter reference verifiers, logs paired decisions, and reports false-positive, false-negative, disagreement, exploit, and uncertainty metrics.

2606.01065 2026-06-02 cs.DC cs.AI cs.LG 版本更新

Leyline: KV Cache Directives for Agentic Inference

Leyline:用于智能推理的 KV 缓存指令

Bole Ma, Jan Eitzinger, Harald Koestler

发表机构 * Erlangen National High Performance Computing Center(埃朗根国家高性能计算中心)

AI总结 针对智能体 LLM 中策略驱动的缓存编辑需求,提出 Leyline 服务端原语,通过声明式指令四元组和架构无关接口实现缓存拼接与截断,提升缓存命中率和求解率。

详情
AI中文摘要

现代 KV 缓存管理假设聊天机器人工作负载:提示一次性到达,缓存仅追加增长,因此前缀缓存和仅向前驱逐在构造上是正确的。智能体 LLM 打破了这一假设。它们的对话通过策略驱动的编辑演变:失败的工具调用被重试,过时的输出被丢弃,轨迹被转向。这导致两个不同的缓存问题。首先,相同的内容在轮次之间移动到新位置,使得精确前缀缓存失效,尽管底层 KV 仍然有效;最近针对 MLA 的位置无关缓存工作解决了这个重用问题。其次,也是本文的重点,策略可能需要指示服务系统主动移除或替换一段缓存内容,并继续而不重新预填充之后的所有内容。没有现有的原语提供此功能。生产智能体框架退回到每次编辑时重新预填充,支付完整的前缀重新计算成本;内核级驱逐方法自行决策,无法接受来自内核外部的策略指令。我们引入 Leyline,一个弥补这一差距的服务端原语。一个声明式指令四元组将编辑内容与保持位置正确性分离。策略声明编辑及其模式(原地拼接或前缀修剪的重新预填充以实现语义遗忘);一个架构无关的接口路由到每个架构的内核,通过闭式 RoPE 旋转校正恢复注意力计算。拼接内核将重放缓存命中率提高 11.2 个百分点,并将延迟降低最多 241 毫秒。通过同一接口路由的十行截断规则将 debug-gym 上的智能体求解率提高 14.3 个百分点。该机制是开放的;它启用的策略空间是未来的议程。

英文摘要

Modern KV cache management assumes the chatbot workload: prompts arrive once and the cache grows append-only, so prefix caching and forward-only eviction are correct by construction. Agentic LLMs break this assumption. Their conversations evolve through policy-driven editing: failed tool calls are retried, stale outputs dropped, trajectories pivoted. Two distinct cache problems result. First, identical content moves to new positions between turns, invalidating exact-prefix caches even though the underlying KV would still be valid; recent work on position-independent caching for MLA addresses this reuse problem. Second, and this paper's focus, a policy may need to direct the serving system to actively remove or replace a span of cached content and continue without re-prefilling everything that came after. No existing primitive offers this. Production agentic harnesses fall back to re-prefill on every edit, paying full prefix-recomputation cost; kernel-level eviction methods make their own decisions and cannot accept policy directives from outside the kernel. We introduce Leyline, a serving-side primitive that closes this gap. A declarative directive 4-tuple separates what to edit from how to preserve position correctness. The policy declares the edit and its mode (in-place splice or prefix-trimmed re-prefill for semantic forgetting); an architecture-agnostic interface routes to a per-architecture kernel that restores attention math via a closed-form RoPE-rotation correction. The splice kernel lifts replay cache-hit by +11.2 pp and cuts latency by up to 241 ms. A ten-line truncation rule routed through the same interface lifts agentic solve rate by +14.3 pp on debug-gym. The mechanism is open; the policy space it enables is the agenda.

2606.01063 2026-06-02 cs.AI 版本更新

MindClaw: Closed-Loop Embodied Mental-State Reasoning for Precision Intervention

MindClaw: 用于精确干预的闭环具身心理状态推理

Ruoxuan Zhang, Qiaoqiao Wan, Zhengguang Wang, Chenghao Yu, Hongxia Xie, Jianlong Fu, Wen-Huang Cheng

发表机构 * Jilin University(吉林大学) Microsoft Asia(微软亚洲) National Taiwan University(国立台湾大学)

AI总结 提出MindClaw框架,通过闭环具身心理状态推理实现精确干预,结合多源输入、信念记忆、认知触发技能和动作生成,在动态环境中优化干预时机。

Comments Extended version of the CVPR 2026 paper *MindPower: Enabling Theory-of-Mind Reasoning in VLM-based Embodied Agents*

详情
AI中文摘要

心理理论使智能体能够推理他人的信念、目标和意图,这对于以人为中心的具身辅助至关重要。现有的心理理论基准推动了文本和多模态心理状态识别的发展,但主要评估离线问答或最终动作预测。它们并未充分测试具身智能体是否能够与变化的环境保持连接、更新特定于主体的信念、决定何时需要推理,以及仅在帮助有用时进行干预。基于MindPower,我们将以机器人为中心的心理理论推理扩展到实时闭环设置,并引入MindClaw,一个用于具身心理状态推理和精确干预的框架。MindClaw连接多源输入、信念记忆、具身认知触发技能、心理推理和动作生成,使智能体能够在正确的时间输出有用的动作,同时在不需要干预时保持沉默。实验表明,直接的VLM基线在任务意识和干预校准方面存在困难,而MindClaw实现了最佳的整体性能,证明了触发技能优化对于闭环具身心理理论辅助的重要性。

英文摘要

Theory of Mind (ToM) enables an agent to reason about another actor's beliefs, goals, and intentions, which is essential for human-centered embodied assistance. Existing ToM benchmarks have advanced text and multimodal mental-state recognition, but they mostly evaluate offline question answering or final action prediction. They do not fully test whether an embodied agent can stay connected to a changing environment, update actor-specific beliefs, decide when reasoning is needed, and intervene only when help is useful. Building on MindPower, we extend robot-centric ToM reasoning to a real-time closed-loop setting and introduce MindClaw, a framework for embodied mental-state reasoning with precision intervention. MindClaw connects multi-source inputs, belief memory, an embodied cognitive trigger skill, mental reasoning, and action generation, allowing the agent to output helpful actions at the right time while remaining silent when intervention is unnecessary. Experiments show that direct VLM baselines struggle with task awareness and intervention calibration, while MindClaw achieves the best overall performance, demonstrating the importance of trigger-skill optimization for closed-loop embodied ToM assistance.

2606.01062 2026-06-02 cs.AI 版本更新

DAG-MoE: From Simple Mixture to Structural Aggregation in Mixture-of-Experts

DAG-MoE:从简单混合到专家混合中的结构聚合

Jiarui Feng, Hanqing Zeng, Karish Grover, Ruizhong Qiu, Yinglong Xia, Qiang Zhang, Qifan Wang, Ren Chen, Dongqi Fu, Jiayi Liu, Zhoukai Zhao, Xiangjun Fan, Benyu Zhang, Yixin Chen

发表机构 * University of Science and Technology of China(中国科学技术大学) Tsinghua University(清华大学) Nanyang Technological University(南洋理工大学) University of California, Berkeley(加州大学伯克利分校) University of Washington(华盛顿大学)

AI总结 本文提出DAG-MoE框架,通过轻量级模块自动学习选定专家之间的最优聚合结构,以替代标准加权求和聚合,从而在不改变专家或路由器的情况下扩展专家组合空间并实现单层多步推理,在预训练和微调中均优于传统MoE基线。

Comments Accepted by ICML 2026

详情
AI中文摘要

混合专家(MoE)模型已成为在大型语言模型中解耦参数数量与计算成本的主流方法,但有效扩展MoE性能仍是一个挑战。先前的工作表明,细粒度专家扩大了专家组合空间并提高了灵活性,但它们也带来了大量的路由开销,造成了新的可扩展性瓶颈。在本文中,我们探索了扩展的互补轴——专家输出的聚合方式。我们从理论上证明,用结构聚合替代标准加权求和聚合可以在不改变专家或路由器的情况下扩展专家组合空间,并使得在单个MoE层内实现多步推理成为可能。为此,我们提出了DAG-MoE,一个稀疏MoE框架,它采用轻量级模块自动学习所选专家之间的最优聚合结构。在标准语言建模设置下的大量实验表明,DAG-MoE在预训练和微调中均持续提升了性能,超越了传统的MoE基线。

英文摘要

Mixture-of-Experts (MoE) models have become a leading approach for decoupling parameter count from computational cost in large language models, yet effectively scaling MoE performance remains a challenge. Prior work shows that fine-grained experts enlarge the space of expert combinations and improve flexibility, but they also impose substantial routing overhead, creating a new scalability bottleneck. In this paper, we explore a complementary axis for scaling -- how expert outputs are aggregated. We theoretically show that replacing the standard weighted-summation aggregation with structural aggregation expands the expert-combination space without altering the experts or router, and enables possible multi-step reasoning within a single MoE layer. To this end, we propose DAG-MoE, a sparse MoE framework that employs a lightweight module to automatically learn the optimal aggregation structure among the selected experts. Extensive experiments under standard language modeling settings show that DAG-MoE consistently improves performance in both pretraining and fine-tuning, surpassing traditional MoE baselines.

2606.01057 2026-06-02 cs.CV cs.AI cs.GR cs.LG 版本更新

3DCodeBench: Benchmarking Agentic Procedural 3D Modeling Via Code

3DCodeBench:通过代码进行智能体程序化3D建模的基准测试

Yipeng Gao, Lei Shu, Genzhi Ye, Xi Xiong, Ameesh Makadia, Meiqi Guo, Laurent Itti, Jindong Chen

发表机构 * Google DeepMind(谷歌DeepMind) University of Southern California(南加州大学) Google Research(谷歌研究)

AI总结 提出3DCodeBench基准,评估12种视觉语言模型将文本和图像参考转换为程序化3D建模代码的能力,并构建基于人类偏好的3DCodeArena排名平台。

Comments Project Page: https://www.3dcodebench.com/; 11 pages (main), with appendix

详情
AI中文摘要

通过代码进行程序化3D建模正成为一种通用的范式,提供确定性、引擎就绪且可精确编辑的资产,而神经3D生成器天生缺乏这些特性。然而,编写此类程序化内容需要深厚的3D软件API、参数化设计和代码级几何推理专业知识。在本文中,我们提出了3DCodeBench,一个系统性的基准,用于评估3D建模软件中用于程序化3D生成的视觉语言模型(VLM)智能体。具体来说,3DCodeBench评估了12种先进VLM如何有效地充当程序化3D建模器,将文本和图像参考转换为3D建模软件的程序化代码。认识到自动度量可能无法完全捕捉3D形状的感知质量,我们构建了3DCodeArena,一个基于成对人类偏好对生成的3D输出进行排名的平台。通过广泛的评估和结果,我们观察到:(1)失败主要源于API不匹配,而成功渲染的模型仍然存在断开或浮动的3D几何组件。(2)测试时扩展,如更高的思考预算和多轮细化,总体上提高了性能。我们的发现突显了对高质量程序化编码数据以推进商业VLM的迫切需求。此外,有效的程序化3D建模需要一个强大的执行环境,为迭代细化提供高保真反馈。我们发布了3DCodeBench,包括精心策划的大规模多模态(文本/图像)提示数据集、程序化代码、3D对象三元组、评估协议以及公共3DCodeArena平台,作为探索基于VLM的程序化3D建模器的基础工具包。

英文摘要

Procedural 3D modeling through code is emerging as a versatile paradigm, offering deterministic, engine-ready, and precisely editable assets that neural 3D generators inherently lack. Authoring such procedural content, however, demands deep expertise in 3D software APIs, parametric design, and code-level geometric reasoning. In this paper, we propose 3DCodeBench, a systematic benchmark for evaluating vision-language model (VLM) agents for procedural 3D generation in 3D modeling software. Specifically, 3DCodeBench evaluates how effectively 12 advanced VLMs can serve as procedural 3D modelers by translating text and image references into procedural code for 3D modeling software. Recognizing that automated metrics may not fully capture the perceptual quality of 3D shapes, we build 3DCodeArena, a ranking platform based on pairwise human preferences over generated 3D outputs. From extensive evaluations and results, we observe that: (1) Failures mostly arise from API mismatches, while successful renders still suffer from disconnected or floating 3D geometric components. (2) Test-time scaling, such as higher thinking budgets and multi-turn refinement, improves performance overall. Our findings highlight a critical need for high-quality procedural coding data to advance commercial VLMs. Furthermore, effective procedural 3D modeling requires a robust execution environment that provides high-fidelity feedback for iterative refinement. We release 3DCodeBench, including the curated large-scale dataset of multimodal (text/image) prompts, procedural code, 3D object triplets, evaluation protocol, and the public 3DCodeArena platform as a foundational toolkit for exploring VLM-based procedural 3D modelers.

2606.01053 2026-06-02 cs.AI 版本更新

AnyEdit++: Adaptive Long-Form Knowledge Editing via Bayesian Surprise

AnyEdit++: 基于贝叶斯惊讶的自适应长文本知识编辑

Bowen Tian, Caixue He, Jiemin Wu, Jingying Wang, Wenshuo Chen, Zexi Li, Yutao Yue

发表机构 * University of Science and Technology of China(中国科学技术大学)

AI总结 提出AnyEdit++框架,通过基于贝叶斯惊讶的自适应分割机制Bayes-Chunk,实现结构感知的长文本知识编辑,在数学推理、代码生成和叙事任务上优于现有方法。

Comments Accepted by ICML 2026

详情
AI中文摘要

在大语言模型中编辑复杂的长文本知识仍然是一个重大挑战,因为难以保持生成的连贯性。现有的自回归方法(如AnyEdit)缓解了长度限制,但依赖于固定窗口分块,忽略了逻辑结构并损害了一致性。为了解决这个问题,我们提出了AnyEdit++,一个结构感知的框架,其中包含Bayes-Chunk,这是一种基于贝叶斯惊讶动态识别语义边界的自适应分割机制。我们通过一个理论框架支撑这种方法,确立了三个关键原则:(1)结构独立性:我们证明了当锚键在几何上正交时(我们的基于惊讶度的边界自然满足这一条件,而固定窗口则违反),跨段干扰最小化;(2)因果局部性:我们证明了在这些语义峰值处注入的更新相比任意分割点具有严格更优的控制。在数学推理、代码生成和叙事任务上的大量实验表明,AnyEdit++相比最先进的基线取得了更优的性能和鲁棒性,验证了结构感知对于有效的长文本知识编辑至关重要。

英文摘要

Editing complex, long-form knowledge in Large Language Models remains a significant challenge due to the difficulty of maintaining generation coherence. Existing autoregressive methods like AnyEdit alleviate length constraints but rely on Fixed-window Chunking, which disregards logical structure and compromises consistency. To address this, we present AnyEdit++, a structure-aware framework incorporating Bayes-Chunk, an adaptive segmentation mechanism that dynamically identifies semantic boundaries based on Bayesian Surprise. We underpin this approach with a theoretical framework establishing two key principles: (1) Structural Independence: we prove that cross-segment interference is minimized when anchor keys are geometrically orthogonal (a condition naturally satisfied by our surprisal-based boundaries but violated by fixed windows), and (2) Causal Locality: we demonstrate that updates injected at these semantic peaks yield strictly superior control compared to arbitrary split points. Extensive experiments across mathematical reasoning, code generation, and narrative tasks demonstrate that AnyEdit++ achieves superior performance and robustness compared to state-of-the-art baselines, validating that structural awareness is critical for effective long-form knowledge editing.

2606.01046 2026-06-02 cs.AI 版本更新

TravelEval: A Comprehensive Benchmarking Framework for Evaluating LLM-Powered Travel Planning Agents

TravelEval:评估基于LLM的旅行规划代理的综合基准框架

Weiyi Chen, Shuaixiong Wang, Ziyun Gao, Kaichun Hu, Wangze Ni, Shimin Di, Chen Jason Zhang, Lei Chen

发表机构 * Zhejiang University(浙江大学) Hong Kong Polytechnic University(香港理工大学) Southeast University(东南大学) HKUST (GZ) & HKUST Guangzhou(香港科技大学(广州)& 香港科技大学(广州))

AI总结 针对现有基准过度关注约束合规、缺乏真实性和多维评估的问题,提出TravelEval,通过六维评估框架、真实数据沙盒和模拟全局评估方法,全面评估LLM在旅行规划中的表现。

Comments 31pages, 8 figures, accepted by KDD 2026

详情
AI中文摘要

大型语言模型(LLM)的发展显著提升了旅行规划应用,但现有基准的局限性限制了对其评估:1)过度强调约束合规,忽视时空成本等多维质量;2)数据集缺乏真实世界真实性和关键领域(如住宿、交通)的覆盖;3)孤立的每日计划评估遗漏了评估整个计划所需的关键细节(例如每日住宿和参观节奏的影响)。为解决这一差距,我们引入了TravelEval,一个真实且全面的基准。TravelEval具有1)一个新颖的六维评估框架,从准确性、合规性、时间性、空间性、经济性和实用性维度全面评估计划;2)一个高度真实的数据沙盒,包含精确的住宿定价和真实的城际交通数据;3)一种基于模拟的全局评估方法,通过集成API的地理信息和细粒度排队时间模拟完整的旅行计划。使用TravelEval评估12种主流方法揭示了若干有价值的见解,例如LLM在全局优化的多维规划(特别是时空推理和预算合规)方面存在困难,而代理推理策略并未提供一致的改进。简而言之,TravelEval通过基于现实的时空模拟和全面指标促进旅行计划评估,为推进基于LLM的旅行规划研究和应用提供了坚实基础。

英文摘要

The development of Large Language Models (LLMs) has significantly improved travel planning applications, yet evaluating such models is limited by existing benchmarks' limitations: 1) overemphasis on constraint compliance, neglecting multi-dimensional qualities like spatio-temporal cost; 2) datasets lacking real-world authenticity and coverage in key areas (e.g., lodging, transport); and 3) isolated daily plan assessments that miss critical details (e.g., the impact of daily accommodation and visit pacing) needed for entire plan's evaluation. To address this gap, we introduce TravelEval, a realistic and comprehensive benchmark. TravelEval features 1) a novel six-dimensional evaluation framework to holistically assess plans across accuracy, compliance, temporality, spatiality, economy, and utility dimensions; 2) a highly realistic data sandbox with precise accommodation pricing and authentic intercity transportation data; and 3) a simulation-based global evaluation method that emulates complete travel plans with API-integrated geographic information and fine-grained queuing time. Evaluating 12 mainstream approaches with TravelEval reveals several valuable insights, such that LLMs struggle with globally-optimized multi-dimensional planning (especially in spatio-temporal reasoning and budget compliance), and agentic reasoning strategies offer no consistent improvement. Concisely, TravelEval facilitates travel plan evaluation via grounded spatio-temporal emulation and comprehensive metrics, providing a robust foundation for advancing LLM-powered travel planning research and applications.

2606.01042 2026-06-02 cs.LG cs.AI 版本更新

Plausibility Is Not Prediction: Contrastive Evidence for LLM-Based Cellular Perturbation Reasoning

似真性不是预测:基于LLM的细胞扰动推理的对比证据

Xinyu Yuan, Xixian Liu, Jianan Zhao, Yashi Zhang, Hongyu Guo, Jian Tang

发表机构 * Mila - Québec AI Institute(魁北克人工智能研究所) University of Montréal(蒙特利尔大学) HEC Montréal(蒙特利尔HEC商学院) University of Ottawa(渥太华大学) National Research Council of Canada(加拿大国家研究理事会) CIFAR AI Chair(CIFAR人工智能 chair)

AI总结 本文发现基于大语言模型的细胞扰动推理虽能生成生物上合理的解释,但实际预测性能差,并提出CORE方法通过对比证据组织来提升扰动特异性预测。

详情
AI中文摘要

扰动实验对于理解细胞机制至关重要,但成本高昂且稀疏,因此需要预测未观察条件下的基因表达响应。最近一个有前景的方向是利用大语言模型(LLM)作为“虚拟细胞”模拟器——通过逐步的、基于知识的机械推理来推断差异表达——指向一种可解释的、知识驱动的范式,超越了纯粹的数据驱动方法。然而,我们发现似真性不是预测:尽管产生了生物上合理的解释,这些方法未能捕捉扰动特异性效应:系统性地高估差异表达,在聚合评估中通常表现不如简单的基因频率基线,并且在每个基因水平上降至随机水平。这揭示了对内在基因响应倾向的依赖,而非真正的扰动推理。我们将这一失败追溯到证据呈现方式:现有方法孤立地评估扰动-基因对,而不揭示相关扰动对同一基因的影响差异。为解决这一局限性,我们引入了CORE(对比关系证据组织),通过将证据组织成来自相关扰动的正面和负面结果,将预测重新定义为比较任务。使用生物医学知识图谱进行证据检索,CORE在基于LLM和非LLM的设置中均改善了校准并大幅提升了扰动特异性预测:例如,在药物扰动数据上,CORE-Reasoning将Qwen3.5-9B的聚合指标提升了高达28.6%;而在通用扰动数据上,CORE-Voting将四个细胞系的每个基因平均AUROC从随机水平提高到0.703。这突显了对比证据组织对于可靠的基于LLM的扰动推理至关重要。

英文摘要

Perturbation experiments are central to understanding cellular mechanisms, but remain costly and sparse, motivating prediction of gene expression responses for unobserved conditions. A promising recent direction leverages large language models (LLMs) as "virtual cell" simulators-using stepwise, knowledge-grounded mechanistic reasoning to infer differential expression-pointing toward an interpretable, knowledge-driven paradigm that transcends purely data-driven approaches. However, we find that plausibility is not prediction: despite producing biologically plausible explanations, these methods fail to capture perturbation-specific effects: systematically overestimating differential expression, often underperforming a simple gene-frequency baseline in aggregate evaluations, and collapsing to chance-level performance at the per-gene level. This reveals a reliance on intrinsic gene response tendencies rather than true perturbation reasoning. We trace this failure to how evidence is presented: existing methods evaluate perturbation-gene pairs in isolation, without exposing how related perturbations differ in their effects on the same gene. To address this limitation, we introduce CORE (Contrastive Organization of Relational Evidence), which reframes prediction as a comparison task by organizing evidence into positive and negative outcomes from related perturbations. Using a biomedical knowledge graph for evidence retrieval, CORE improves calibration and substantially boosts perturbation-specific prediction in both LLM-based and non-LLM settings: for example, on drug-perturbation data, CORE-Reasoning improves Qwen3.5-9B aggregate metrics by up to 28.6%, while on generic perturbation data, CORE-Voting raises macro-per-gene AUROC from chance to 0.703 in average across four cell lines. This highlights contrastive evidence organization as essential to reliable LLM-based perturbation reasoning

2606.01039 2026-06-02 cs.LG cs.AI 版本更新

OPD+: Rethinking the Advantage Design for On-Policy Distillation

OPD+: 重新思考在线策略蒸馏中的优势设计

Hanyang Zhao, Haoxian Chen, Han Lin, Genta Indra Winata, David Yao, Wenpin Tang

发表机构 * Columbia University(哥伦比亚大学) Amazon(亚马逊) Meta Capital One

AI总结 本文提出OPD+,通过修正在线策略蒸馏中因停止梯度操作导致的奖励目标偏差,并支持多种f-散度,在数学推理和工具使用基准上提升了性能。

详情
AI中文摘要

在线策略蒸馏(OPD)是一种广泛使用的技术,用于将能力强的教师语言模型的能力迁移到基础学生模型,并且可以通过使用学生生成的轨迹来制定强化学习风格的目标。然而,尽管散度奖励依赖于学生模型的可能性,现有工作通常采用停止梯度设计主要是为了稳定性,这使得得到的优势估计存在问题。在这项工作中,我们提供了一个基于学生和教师之间f-散度的通用优化框架,并从数学上重新审视这种设计空间是否有效。我们证明,对于一般的散度函数,一般的停止梯度操作会导致奖励目标和相应梯度的有偏估计。我们提出了OPD+,这是OPD的修正版本,在基线KL方法上展示了改进的性能,并且也支持各种f-散度的选择。我们在数学推理和工具使用基准上验证了我们的发现。

英文摘要

On-policy distillation (OPD) is a widely used technique to transfer capabilities from capable teacher language models to the base student models, and can be formulated in a reinforcement learning style objective using student generated rollouts. Yet, despite the divergence reward being dependent on student model likelihood, existing works usually adopt a stop gradient design primarily for stability, which makes the resulting advantage estimation questionable. In this work, we provide a generic optimization framework based on f-divergence between the student and teacher, and mathematically revisit whether such design space is valid. We prove that general stop-gradient operation would lead to biased estimates of the reward objective and corresponding gradient for general divergence functions. We propose OPD+, the corrected version of OPD that demonstrates improved performance over the baseline KL approach and also supports the choice of various f-divergence. We validate our findings on mathematical reasoning and tool-use benchmarks.

2606.01033 2026-06-02 cs.AI 版本更新

TriLens: Per-Layer Logit-Lens Entropy for White-Box Hallucination Detection

TriLens: 基于逐层Logit-Lens熵的白盒幻觉检测

Bohan Yang, Yijun Gong, Zhi Zhang, Ge Zhang, Wenpeng Xing, Meng Han

发表机构 * Binjiang Institute of Zhejiang University(浙江大学滨海学院) Beijing Normal-Hong Kong Baptist University(北京师范大学-香港 Baptist大学) Zhejiang University(浙江大学) GenTel.io Great Bay University(Great Bay大学)

AI总结 提出TriLens方法,通过在每个Transformer层读取多头自注意力、前馈网络和残差流的logit-lens输出熵,构建紧凑的3L维轨迹,有效检测大语言模型幻觉。

详情
AI中文摘要

当语言模型产生幻觉时,最终答案是错误的,但错误在模型内部并非不可见。不同的内部路径可能保持不确定,在锐化速度上不一致,或在输出产生前承诺相互竞争的延续。我们提出TriLens,一种白盒检测器,将这一直觉转化为紧凑表示:在每一层,它通过模型自身的logit透镜读取多头自注意力输出、前馈输出和残差流,然后仅记录每个读出的熵。得到的3L维轨迹描述了确定性如何跨深度和跨模块形成,无需存储高维隐藏状态或采样多个生成。这一简单信号在指令微调LLM和QA基准测试中产生了强大的检测器,我们的分析表明,三个模块的熵轨迹提供了互补证据。TriLens表明,幻觉检测可以从跟踪内部计算如何稳定中受益,而不仅仅是最终层的预测。

英文摘要

When a language model hallucinates, the final answer is wrong, but the mistake is not necessarily invisible inside the model. Different internal pathways may remain uncertain, disagree in how quickly they sharpen, or commit to competing continuations before the output is produced. We introduce TriLens, a white-box detector that turns this intuition into a compact representation: at every layer, it reads the multi-head self-attention output, the feed-forward output, and the residual stream through the model's own logit lens, then records only the entropy of each readout. The resulting 3L-dimensional trajectory describes how certainty forms across depth and across modules, without storing high-dimensional hidden states or sampling multiple generations. This simple signal yields a strong detector across instruction-tuned LLMs and QA benchmarks, and our analyses show that the three module-wise entropy trajectories provide complementary evidence. TriLens suggests that hallucination detection can benefit from tracking how internal computation settles, not only what the final layer predicts.

2606.01031 2026-06-02 cs.GR cs.AI cs.CV cs.LG cs.MM 版本更新

Temporally-Aligned Evaluation for Audio-Driven Talking Head Generation

音频驱动说话头生成的时序对齐评估

Zhicheng Zhang, Lei Wang, Yu Zhang, Yongsheng Gao

发表机构 * School of Business, University of New South Wales (UNSW)(新南威尔士大学商学院) School of Engineering and Built Environment, Griffith University(格里菲斯大学工程与环境学院) Data61/CSIRO(Data61/澳大利亚国家科学委员会)

AI总结 针对现有帧级评估指标对时序偏差敏感的问题,提出基于软动态时间规整的序列级对齐评估框架,提升评估鲁棒性并揭示不同建模范式间的系统权衡。

Comments Research report

详情
AI中文摘要

音频驱动的说话头生成技术发展迅速,但现有评估协议主要依赖帧级指标,假设生成视频与参考视频之间存在严格的时间对应关系。这一假设与语音驱动的面部运动不符,后者自然包含轻微的时间偏移、不同的说话速度和风格变化。因此,传统指标可能将无害的时间差异视为质量错误,使得公平比较方法并理解其权衡变得更加困难。在这项工作中,我们认为动态生成模型的评估应被表述为序列对齐问题,而非独立的帧比较。我们引入了一种统一的序列级重新表述,将软动态时间规整集成到已有的评估流程中。通过在对齐特征轨迹的同时保持时间顺序,所提出的框架对有限的时间错位具有鲁棒性,且不改变底层的感知、身份或同步编码器。我们表明,在刚性对齐下,帧级评估可被视为一个特例,而序列级对齐提供了更好的稳定性、对时间差异的更低敏感性以及建模范式之间更清晰的区分。基于这一原则性表述,我们在标准化协议下,对涵盖规范、野外和风格多样场景的七个数据集上的20种方法进行了大规模基准测试。大量实验表明,时序对齐的指标对时间差异更鲁棒,跨数据集提供更一致的结果,并能更好地揭示建模范式之间的系统权衡,例如同步性与真实性、表现力与稳定性之间的权衡。

英文摘要

Audio-driven talking-head generation has advanced rapidly, yet existing evaluation protocols mainly rely on frame-wise metrics that assume strict temporal correspondence between generated and reference videos. This assumption does not match speech-driven facial motion, which naturally includes slight timing shifts, different speaking speeds, and stylistic variations. As a result, conventional metrics may treat harmless timing differences as quality errors, making it harder to fairly compare methods and understand their trade-offs. In this work, we argue that evaluation of dynamic generative models should be formulated as a sequence-alignment problem rather than independent frame comparison. We introduce a unified sequence-level reformulation that integrates Soft Dynamic Time Warping into established evaluation pipelines. By aligning feature trajectories while preserving temporal order, the proposed framework provides robustness to bounded temporal misalignments without altering the underlying perceptual, identity, or synchronization encoders. We show that frame-wise evaluation can be viewed as a special case under rigid alignment, while sequence-level alignment provides improved stability, lower sensitivity to timing differences, and clearer separation between modeling paradigms. Building on this principled formulation, we conduct a large-scale benchmark of 20 methods across seven datasets spanning canonical, in-the-wild, and style-diverse scenarios under standardized protocols. Extensive experiments show that temporally aligned metrics are more robust to timing differences, provide more consistent results across datasets, and better reveal systematic trade-offs between modeling paradigms, such as synchronization versus realism and expressiveness versus stability.

2606.01024 2026-06-02 cs.CL cs.AI 版本更新

DSL-LLaDA: Scaling Continuous Denoising to 8B Masked Diffusion LMs

DSL-LLaDA: 将连续去噪扩展到8B掩码扩散语言模型

Longxuan Yu, Yunshu Wu, Yu Fu, Siheng Xiong, Rob Brekelmans, Hui Liu, Yue Dong, Greg Ver Steeg

发表机构 * University of California, Riverside(加州大学河滨分校) Georgia Institute of Technology(佐治亚理工学院) Microsoft(微软)

AI总结 通过离散随机定位(DSL)将预训练掩码扩散语言模型(LLaDA-8B-Instruct)轻量适配为支持连续嵌入空间去噪,在低步数下实现高质量摘要生成并避免长度-质量权衡。

Comments 8 pages, 4 figures, 28 tables

详情
AI中文摘要

离散掩码扩散语言模型通过迭代并行解码生成文本,但少步解码面临长度与质量之间的权衡:在固定步数预算下,标准方法可以生成短而高质量的输出,或者产生长但重复的文本。连续去噪可以通过在嵌入空间中联合演化所有位置来规避这种权衡,但从头开始构建这样的模型仍是一个开放问题。我们证明,预训练的掩码DLM可以轻量适配以支持连续嵌入空间去噪。从LLaDA-8B-Instruct开始,我们仅用1,000步进行离散随机定位(DSL)的继续预训练,将二元掩码替换为连续的逐token高斯噪声作为软掩码。适配后的模型支持连续推理,在嵌入空间中联合演化所有位置,并将硬token承诺推迟到最后一步。在低步数预算(<=16次前向传播)下的零样本摘要任务中,DSL-LLaDA-SDE在所有四个基准上取得了最佳ROUGE-1,并很大程度上避免了迭代去掩码的过早终止/重复权衡。同样的适配还产生了选择性噪声状态鲁棒性:模型在保留干净token的同时纠正损坏的token。使用相同计算量的标准掩码扩散训练对照实验未表现出这两种行为。

英文摘要

Discrete Masked diffusion language models generate text by iterative parallel decoding, but few-step decoding suffers from a tradeoff between length and quality: with a fixed step budget, standard methods can generate a short, high-quality output, or they can produce long but repetitive text. Continuous denoising can sidestep this tradeoff by evolving all positions jointly in embedding space, but building such a model from scratch at scale remains an open problem. We show that a pretrained masked DLM can instead be lightly adapted to support continuous embedding-space denoising. Starting from LLaDA-8B-Instruct, we continue-pretrain for only 1,000 steps with Discrete Stochastic Localization (DSL), replacing binary masking with continuous per-token Gaussian noise as a soft mask. The adapted model supports continuous inference that evolves all positions jointly in embedding space and defers hard token commitment to the final step. On zero-shot summarization at low step budgets (<=16 forward passes), DSL-LLaDA-SDE achieves the best ROUGE-1 on all four benchmarks and largely avoids the premature-termination / repetition tradeoff of iterative unmasking. The same adaptation also yields selective noisy-state robustness: the model corrects corrupted tokens while preserving clean ones. Control experiments using standard masked diffusion training with the same compute demonstrate neither behavior.

2606.01022 2026-06-02 cs.CV cs.AI 版本更新

ProductWebGen: Benchmarking Multimodal Product Webpage Generation

ProductWebGen: 多模态产品网页生成基准测试

Zhihong Liu, Siqi Kou, Zheng Li, Ye Ma, Quan Chen, Peng Jiang, Kai Yu, Zhijie Deng

发表机构 * School of Computer Science & Zhiyuan College(计算机科学学院及智远学院) Shanghai Jiao Tong University(上海交通大学) Kuaishou Technology(快手科技)

AI总结 提出ProductWebGen基准,用于评估多模态生成模型从产品图像和指令生成一致产品展示网页的能力,并比较了基于编辑和基于统一模型两种工作流。

Comments Accepted by KDD 2026

详情
AI中文摘要

从源产品图像以及布局和视觉内容指令中制作产品展示网页,对于营销、广告和电子商务等领域具有重要的实用价值。直观上,该任务要求产品展示之间严格的视觉一致性以及高保真度的指令遵循,以联合生成可渲染的HTML代码。这些对可控性和指令遵循的要求与先进多模态生成模型(如图像编辑模型和统一模型)的核心特征紧密一致。为此,本文引入ProductWebGen来系统性地基准测试这些模型的产品网页生成能力。我们组织了包含500个测试样本的ProductWebGen,涵盖13个产品类别;每个样本由源图像、视觉内容指令和网页指令组成。任务是根据源图像和指令生成包含多个一致图像的产品展示网页。鉴于任务的混合模态输入输出性质,我们设计并系统比较了两种评估工作流——一种使用大语言模型和图像编辑模型分别生成HTML代码和图像(基于编辑),另一种依赖单个统一模型生成两者,其中图像生成依赖于先前的多模态上下文(基于统一模型)。实验结果表明,基于编辑的方法在网页指令遵循和内容吸引力方面取得领先结果,而基于统一模型的方法在满足视觉内容指令方面可能展现出更多优势。我们还构建了一个监督微调数据集ProductWebGen-1k,包含1000组真实产品图像和LLM生成的HTML代码。我们在开源统一模型BAGEL上验证了其有效性。数据和代码可在https://github.com/SJTU-DENG-Lab/ProductWebGen获取。

英文摘要

Crafting a product display webpage from a source product image, along with layout and visual content instructions, holds significant practical value for domains such as marketing, advertising, and E-commerce. Intuitively, this task demands strict visual consistency across product displays and high-fidelity instruction following to jointly generate renderable HTML code. These requirements on controllability and instruction-following are closely aligned with the core features of advanced multimodal generative models, such as image editing models and unified models. To this end, this paper introduces ProductWebGen to systematically benchmark the product webpage generation capacities of these models. We organize ProductWebGen with 500 test samples covering 13 product categories; each sample consists of a source image, a visual content instruction, and a webpage instruction. The task is to generate a product showcase webpage including multiple consistent images in accordance with the source image and instructions. Given the mixed-modality input-output nature of the task, we design and systematically compare two workflows for evaluation -- one uses large language models and image editing models to separately generate HTML code and images (editing-based), while the other relies on a single UM to generate both, with image generation conditioned on the preceding multimodal context (UM-based). Empirical results show that editing-based approaches achieve leading results in webpage instruction following and content appeal, while UM-based ones may display more advantages in fulfilling visual content instructions. We also construct a supervised fine-tuning dataset, ProductWebGen-1k, with 1,000 groups of real product images and LLM-generated HTML code. We verify its effectiveness on the open-source UM BAGEL. The data and code are available at https://github.com/SJTU-DENG-Lab/ProductWebGen.

2606.01020 2026-06-02 cs.AI cs.LG 版本更新

Tackling the Root of Misinformation by Teaching Laypeople about Logical Fallacies via Socratic Questioning and Critical Argumentation

通过苏格拉底式提问和批判性论证教授外行人逻辑谬误,以应对错误信息的根源

Minjing Shi, Junling Wang, Jingwei Ni, Sankalan Pal Chowdhury, Mrinmaya Sachan

发表机构 * ETH Zurich(苏黎世联邦理工学院) ETH AI Center(苏黎世联邦理工学院人工智能中心)

AI总结 提出LFTutor智能辅导系统,利用大语言模型结合苏格拉底式提问和批判性论证原则,帮助外行人学习识别逻辑谬误,显著优于基线模型。

Comments This paper has been accepted to Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Long Paper), Main Conference

详情
Journal ref
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics, 2026
AI中文摘要

识别日常话语中的逻辑谬误对许多人来说具有挑战性。这一挑战在大语言模型(LLMs)时代被放大,恶意行为者可以利用谬误论证大规模传播错误信息。在这项工作中,我们探索了LLMs作为解决方案一部分的潜力。我们介绍了LFTutor,一个智能辅导系统,它使用LLMs辅导外行人,帮助他们学习逻辑谬误。LFTutor整合了意图驱动的苏格拉底式提问和批判性论证原则,以积极引导学习者反思自己的推理。通过自动评估和人工评估,我们证明LFTutor显著优于缺乏这些教学策略的基线LLMs。这项工作突显了将LLMs与教学支架相结合以在人工智能时代培养批判性思维和论证素养的前景。

英文摘要

Identifying logical fallacies in everyday discourse is challenging for many people. This challenge is amplified in the era of Large Language Models (LLMs), where malicious agents can deploy fallacious arguments to disseminate misinformation at scale. In this work, we explore the potential of LLMs as part of the solution. We introduce LFTutor, an intelligent tutoring system which uses LLMs to tutor laypeople and help them learn about logical fallacies. LFTutor integrates intent-driven Socratic questioning and critical argumentation principles to actively engage learners to reflect on their reasoning. Through both automatic and human evaluations, we demonstrate that LFTutor significantly outperforms baseline LLMs lacking these pedagogical strategies. This work highlights the promise of combining LLMs with pedagogical scaffolding to foster critical thinking and argument literacy in the age of AI.

2606.01019 2026-06-02 cs.CL cs.AI 版本更新

Hybrid Verified Decoding: Learning to Allocate Verification in Speculative Decoding

混合验证解码:在推测解码中学习分配验证

Xin Su, Dawid Majchrowski, Fangyuan Yu, Vanshil Atul Shah, Sebastian Rogawski, Pawel Morkisz, Anahita Bhiwandiwalla, Phillip Howard

发表机构 * Thoughtworks Nvidia

AI总结 提出混合验证解码方法,通过预测缓存草稿的接受长度并在缓存验证与模型草稿之间动态选择,在代理工作流中平均加速2.73倍。

详情
AI中文摘要

大型语言模型(LLM)生成仍然昂贵,因为自回归解码每生成一个新token就调用一次模型。推测解码通过草拟多个token并用目标模型一步验证来降低成本,但其加速取决于接受的草稿token数量。无参数草稿源可以在结构化和代理工作负载中以低成本提出长续写,但一个生成步骤中看起来有前景的缓存匹配可能在下一步收益很低。我们提出混合验证解码,在验证前预测缓存草稿的接受长度,并使用该收益估计在缓存验证和基于模型的草稿器之间进行选择。在三个LLM和十六个数据集上,混合验证解码在代理工作流中特别有效,在每个设置中均优于EAGLE3,平均加速2.73倍。我们的分析揭示了提示结构如何创造缓存机会,高收益缓存草稿如何集中在草稿空间的一小部分,以及收益引导的选择如何减少顺序解码工作,指向运行时草稿选择作为推测解码的一个有前景的方向。

英文摘要

Large Language Model (LLM) generation remains expensive because autoregressive decoding calls the model once for each new token. Speculative decoding reduces this cost by drafting multiple tokens and verifying them with the target model in one step, but its speedup depends on how many drafted tokens are accepted. Parameter-free draft sources can propose long continuations at low cost in structured and agentic workloads, yet a cache match that looks promising at one generation step may have low payoff at the next. We propose Hybrid Verified Decoding, which predicts the accepted length of a cache draft before verification and uses this payoff estimate to choose between cache verification and a model-based drafter. Across three LLMs and sixteen datasets, Hybrid Verified Decoding is especially effective on agentic workflows, where it outperforms EAGLE3 in every setting with a 2.73x average speedup. Our analysis shows how prompt structure creates cache opportunities, how high-payoff cache drafts concentrate in a small part of the draft space, and how payoff-guided selection reduces sequential decoding work, pointing to runtime draft selection as a promising direction for speculative decoding.

2606.01016 2026-06-02 cs.CL cs.AI eess.AS 版本更新

PolySpeech-100: A Large-Scale Benchmark for Speech Understanding Across 100+ Languages and Dialects

PolySpeech-100:面向100多种语言和方言的大规模语音理解基准

Sicheng Yang, Shulan Ruan, Shiwei Wu, Yu Liu, Lu Fan, Zhi Li, You He

发表机构 * Shenzhen International Graduate School, Tsinghua University(深圳国际研究生院,清华大学) Department of Electronic Engineering, Tsinghua University(清华大学电子工程系) JD AI Research(京东人工智能研究院)

AI总结 为解决现有语音评估基准在资源丰富语言偏向、缺乏语义推理和忽视方言的问题,提出PolySpeech-100基准,通过混合构建管道覆盖110种语言变体,并评估22个模型,发现开源端到端模型在重方言上优于级联系统,而思维链提示在零样本设置下会降低性能。

Comments 19 pages, 13 figures, KDD 2026

详情
AI中文摘要

虽然端到端(E2E)语音大语言模型(Speech-LLMs)正在快速发展,但它们的评估方法仍局限于简单转录的时代。现有基准存在三个关键限制:明显偏向高资源语言、关注低级识别(ASR)而非语义推理,以及忽视区域方言。为弥补这一差距,我们引入了PolySpeech-100,这是一个大规模基准,旨在评估110种语言变体上的“母语级”语音理解。我们采用了一种新颖的混合构建管道,将黄金标准的人类录音与指令驱动的合成语音相结合,从而覆盖了19种不同的中文方言和80多种低资源语言。对22个最先进模型(包括Gemini-3、GPT-Audio和Qwen2.5-Omni)的广泛评估得出了关键见解。首先,我们证明开源端到端模型在重方言上优于级联(ASR+LLM)系统,证明直接音频处理保留了标准转录中经常丢失的关键副语言线索和韵律特征(例如语调、重音)。其次,我们揭示了一个显著的性能差距:虽然商业模型保持稳健,但开源模型在低资源语言上遭受灾难性退化。最后,反直觉的是,我们观察到在标准零样本设置下,思维链提示经常降低大多数评估模型的语音理解性能,揭示了当前架构中潜在的多模态对齐差距。PolySpeech-100为下一代包容性、全能的语音LLM建立了严格标准。数据、演示和代码公开于https://github.com/YoungSeng/PolySpeech-100。

英文摘要

While End-to-End (E2E) Speech-Large Language Models (Speech-LLMs) are rapidly evolving, their evaluation methodologies remain limited to the era of simple transcription. Existing benchmarks suffer from three critical limitations: a pronounced bias towards high-resource languages, a focus on low-level recognition (ASR) rather than semantic reasoning, and a neglect of regional dialects. To bridge this gap, we introduce PolySpeech-100, a massive-scale benchmark designed to assess `native-level' speech comprehension across 110 linguistic variants. We employ a novel hybrid construction pipeline that augments gold-standard human recordings with instruction-driven synthetic speech, allowing us to cover 19 distinct Chinese dialects and over 80 low-resource languages. Extensive evaluation of 22 state-of-the-art models (including Gemini-3, GPT-Audio, and Qwen2.5-Omni) yields pivotal insights. First, we demonstrate that open-source E2E models outperform Cascade (ASR+LLM) systems on heavy dialects, proving that direct audio processing preserves critical paralinguistic cues and prosodic features (e.g., intonation, stress) that are often lost in standard transcription. Second, we reveal a significant performance gap: while commercial models maintain robustness, open-source models suffer catastrophic degradation on low-resource languages. Finally, counter-intuitively, we observe that under standard zero-shot settings, Chain-of-Thought prompting frequently degrades speech understanding performance for most evaluated models, revealing a potential modality alignment gap in current architectures. PolySpeech-100 establishes a rigorous standard for the next generation of inclusive, omni-capable Speech-LLMs. The data, demo, and code are publicly available at https://github.com/YoungSeng/PolySpeech-100.

2606.01015 2026-06-02 cs.RO cs.AI cs.NI cs.SY eess.SY 版本更新

AI-IoT-Robotics Integration: Survey of Frameworks, Emerging Trends, and the Path Toward Connected Robotics

AI-IoT-机器人集成:框架、新兴趋势及迈向互联机器人的路径综述

Ranulfo Bezerra, Satoshi Tadokoro, Kazunori Ohno

发表机构 * Tohoku University(东大大学)

AI总结 本文综述了人工智能、物联网和机器人三者融合的现状,提出了模块化系统架构,并强调了小语言模型(SLM)和大型语言模型(LLM)在分布式认知与自主决策中的作用,为下一代互联机器人和物理AI生态系统提供了概念和技术路线图。

Comments 15 pages, 3 figures, 3 tables. Published in IEEE Internet of Things Journal

详情
Journal ref
IEEE Internet of Things Journal, vol. 13, no. 10, pp. 20398-20412, 15 May15, 2026
AI中文摘要

人工智能、物联网和机器人的融合不再是未来的愿景;它正迅速成为实时、智能和上下文感知系统的基础。AI实现感知和推理,IoT提供可扩展的感知和通信,而机器人则提供具身驱动。尽管在AIoT和物联网机器人(IoRT)等两两组合方面取得了显著进展,但仍缺乏完全整合这三者的统一设计框架。本综述综合了这些领域的最新进展,强调了边缘端的小语言模型(SLM)和云端的大型语言模型(LLM)在分布式认知和自主决策中的新兴作用。我们提出了一个符合这些趋势的模块化系统架构,分析了互操作性和反馈控制中存在的持续差距,并根据集成深度对现有工作进行了分类。我们的综述强调了混合SLM-LLM系统与IoT基础设施和机器人代理相结合时,如何应对实时适应、可扩展性和可靠性方面的挑战。这项工作为设计模块化、可解释且能够在动态环境中学习的下一代AI-IoT-机器人生态系统提供了概念和技术路线图,为新兴的互联机器人和物理AI范式铺平了道路。

英文摘要

The convergence of Artificial Intelligence, the Internet of Things, and Robotics is no longer a futuristic vision; it is rapidly becoming the foundation of real-time, intelligent, and context-aware systems. AI enables perception and reasoning, IoT provides scalable sensing and communication, and robotics delivers embodied actuation. Despite significant progress in pairwise combinations such as AIoT and the Internet of Robotic Things (IoRT), there remains a lack of unified design frameworks that fully integrate all three. This survey synthesizes the state-of-the-art across these domains, emphasizing the emerging role of Small Language Models (SLMs) at the edge and Large Language Models (LLMs) in the cloud for distributed cognition and autonomous decision-making. We propose a modular system architecture that aligns with these trends, analyze persistent gaps in interoperability and feedback control, and classify existing work by integration depth. Our review highlights how hybrid SLM-LLM systems, when coupled with IoT infrastructure and robotic agents, can address challenges in real-time adaptation, scalability, and reliability. This work offers a conceptual and technical roadmap for designing next-generation AI-IoT-Robotic ecosystems that are modular, interpretable, and capable of learning within dynamic environments, paving the way for the emerging paradigm of Connected Robotics and Physical AI.

2606.01014 2026-06-02 cs.CV cs.AI 版本更新

Cross-Axis Feature Fusion with Joint-Wise Motion Difference Prediction for Text-Based 3D Human Motion Editing

基于文本的三维人体运动编辑中的跨轴特征融合与关节运动差异预测

Gyojin Han, Junmo Kim

发表机构 * School of Electrical Engineering, KAIST(韩国科学技术院电子工程学院)

AI总结 提出一种跨轴特征融合架构和辅助任务,通过联合锚定变换器预测关节运动差异,实现文本驱动的三维人体运动编辑,在MotionFix数据集上达到最优性能。

Comments CVPR 2026

详情
AI中文摘要

我们研究基于文本的三维人体运动编辑,目标是保留源运动的风格和结构,同时应用自然语言描述的编辑。MotionFix数据集的发布推动了基于训练扩散模型的直接生成编辑运动的研究,这些模型从源运动和文本指令生成编辑运动。虽然先前的工作主要关注学习编辑在时间上何时发生,但我们的目标是创建一个不仅理解时间方面,还理解哪些特定关节负责变化的模型。为此,我们提出了一种新颖的架构和一个互补的辅助任务来辅助其训练。我们的架构由两个轴锚定变换器组成,分别沿关节和时间维度提取不同特征,以及一个跨轴融合块来整合这些表示。我们进一步引入一个辅助任务,训练关节锚定变换器回归源和目标关节旋转之间的Soft-DTW距离。该目标教会模块理解哪些关节需要修改,哪些需要保留。通过在MotionFix数据集上的全面实验,我们证明我们的方法显著提高了与文本指令和源运动的语义对齐,以及生成运动的整体保真度,达到了最先进的结果。

英文摘要

We address text-based 3D human motion editing, where the goal is to preserve the style and structure of a source motion while applying edits described in natural language. The release of the MotionFix dataset has spurred active research into training-based diffusion models that directly generate an edited motion from a source motion and a text instruction. While previous works have focused primarily on learning when an edit should occur temporally, our goal is to create a model that understands not only this temporal aspect but also which specific joints are responsible for the change. Targeting this, we propose a novel architecture and a complementary auxiliary task to aid its training. Our architecture consists of two axis-anchored transformers, which extract distinct features along the joint and time dimensions respectively, and a cross-axis fusion block that integrates these representations. We further introduce an auxiliary task that trains the joint-anchored transformer to regress the Soft-DTW distance between source and target joint rotations. This objective teaches the module to understand which joints to modify and which to preserve. Through comprehensive experiments on the MotionFix dataset, we demonstrate that our method significantly improves semantic alignment with both the text instruction and the source motion, as well as the overall fidelity of the generated motion, achieving state-of-the-art results.

2606.01012 2026-06-02 cs.AI cond-mat.mtrl-sci 版本更新

Property Prediction of Stacked Bilayer Materials: A Multimodal Learning Approach

堆叠双层材料的性质预测:一种多模态学习方法

An Vuong, Minh-Hao Van, Chen Zhao, Xintao Wu

发表机构 * University of Arkansas(亚拉巴马大学) Baylor University(贝勒大学)

AI总结 提出一种多模态学习方法,通过联合建模不同材料层间的界面,预测给定配置下垂直堆叠产生的性质,实验证明其有效性和高效性。

Comments Accepted to the 35th International Joint Conference on Artificial Intelligence (IJCAI 2026)

详情
AI中文摘要

AI for materials science 是 AI for science 中的一个关键主题,旨在加速材料发现并产生准确的性质预测。双层二维材料堆叠对于探索具有新功能和内在现象的新材料至关重要,能够创建用于各种实际应用的新型二维双层材料。从实验和计算角度对双层 vdWs 材料的研究已取得显著进展。多种双层材料已通过实验成功合成,并且高通量计算技术的日益普及构建了几个计算二维材料数据库。然而,利用 AI 对双层堆叠进行建模并预测新性质的研究仍不充分,需要进一步研究。在这项工作中,我们提出了一种新颖的多模态学习方法,用于研究不同材料之间的界面,这些界面共同实现新的或多种功能,并预测在给定配置下不同功能材料层垂直集成(堆叠)产生的新性质。综合实验证明了我们方法相对于基线方法的有效性和高效性。我们的代码可在 https://github.com/AnVuong123/bimat_ml 获取。

英文摘要

AI for materials science is a critical topic within AI for science, aiming to accelerate materials discovery and produce accurate property predictions. Bilayer 2D material stacking is essential for exploring new materials with novel functions and inherent phenomena, enabling the creation of new 2D bilayers for diverse real-world applications. Research on bilayer vdWs materials has made significant progress from experimental and computational perspectives. Various bilayer materials have been successfully synthe sized experimentally and the increasing utilization of high-throughput computing technology has con structed several computational two-dimensional materials databases. However, the use of AI to model bilayer stacking and predict new properties remains underexplored, necessitating further research studies. In this work, we propose a novel multimodal learning approach to study the interfaces between dissimilar materials that jointly enable new or multiple functions, and to predict new properties arising from the vertical integration (stacking) of different functional material layers under given configurations. Comprehensive experiments demonstrate the effectiveness and efficiency of our approach compared to baseline methods. Our code is available at https://github.com/AnVuong123/bimat ml.

2606.01008 2026-06-02 cs.SE cs.AI 版本更新

FVSpec: Real-World Property-Based Tests as Lean Challenges

FVSpec: 作为精益挑战的真实世界基于属性的测试

Quinn Dougherty, Max von Hippel, Hazel Shackleton, Mike Dodds

发表机构 * Forall R&D(Forall 研发) Benchify Galois Inc(Galois 公司)

AI总结 提出FVSpec基准,通过从真实Python仓库中提取属性测试并自动翻译为Lean 4规范,评估AI在形式化验证任务上的能力。

详情
AI中文摘要

我们提出了一个用于评估AI模型和智能体在真实世界形式化软件验证任务上的基准。首先从真实世界的Python仓库中抓取11,039个基于属性的测试(PBT),然后自动将其中2,772个(25%)翻译成9,415个带有sorry占位符的Lean 4规范(每个PBT约3个形式化;当没有形式化在质量指标上占优时,我们保留多次尝试)。将PBT翻译成Lean规范具有挑战性:需要在Lean中建模Python语义,推断命令式PBT中编码的逻辑属性,并处理在很少使用的语言中进行依赖类型编程的固有困难。我们描述了一个用于将PBT转译为Lean规范的三智能体LLM流水线,评估覆盖率和质量指标,并使用多种自动化和基于模型的方法为证明生成提供基线。所有代码(爬虫和智能体)和数据(PBT和Lean规范)都是开源的。我们的基准旨在推动AI辅助形式化验证真实世界软件这一尚未充分探索的问题的进展,随着AI生成越来越多的代码,这一问题日益受到关注。

英文摘要

We present a benchmark for evaluating AI models and agents on real-world formal software verification tasks. We first scrape 11,039 property-based tests (PBTs) from real-world Python repositories, then automatically translate 2,772 of them (25%) into 9,415 Lean 4 specifications with sorry placeholders (about 3 formalizations/PBT; we retain multiple attempts when none dominates on quality metrics). Translating PBTs into Lean specifications is challenging: it requires modeling Python semantics in Lean, inferring the logical property encoded in an imperative PBT, and handling the inherent difficulties of dependently-typed programming in a seldom-used language. We describe a three-agent LLM pipeline for transpiling PBTs into Lean specifications, evaluate coverage and quality metrics, and provide baselines for proof generation using several automated and model based approaches. All code (scraper and agents) and data (PBTs and Lean specifications) are open source. Our benchmark aims to drive progress on the underexplored problem of AI-assisted formal verification of real-world software, which is of increasing interest as AI produces more and more of the world's code.

2606.01007 2026-06-02 cs.LG cs.AI 版本更新

Beyond Task-Agnostic: Task-Aware Grouping for Communication-Efficient Multi-Task MoE Inference

超越任务无关:面向通信高效的多任务MoE推理的任务感知分组

Zhiyao Xu, Aoxue Liu, Zhanjie Ding, Dan Zhao, Yong Jiang, Qing Li

发表机构 * Tsinghua Shenzhen International Graduate School(清华大学深圳国际研究生院) Pengcheng Laboratory(鹏城实验室)

AI总结 提出任务感知共激活分组(TACG)框架,通过任务特定的共激活模式优化专家放置,并引入通用专家共享复制(GESR)应对在线负载倾斜,在三个MoE模型上平均降低通信成本31.39%,保持公平性指数0.9975。

详情
AI中文摘要

稀疏激活的混合专家(MoE)模型通过条件计算扩展容量,但分布式推理面临跨GPU专家通信和路由引起的负载不平衡问题。现有的放置方法通过共同定位频繁共激活的专家来降低这一成本;然而,它们从全局聚合的路由轨迹中推导出单一部署方案,从而平均掉了多任务服务中实际驱动通信的异构、任务特定的共激活模式。我们观察到专家共激活强烈依赖于任务:在一个任务族中紧密耦合的专家对在另一个任务族中往往不相关,因此有效的部署应根据任务感知的共激活而非任务无关的平均值来分组专家。基于这一见解,我们提出了任务感知共激活分组(TACG),这是一个部署时框架,利用族特定的调度和共激活轨迹推导每个专家的任务族偏好,重新加权共激活图使得族内局部性主导分组,并在精确容量约束下将每个专家分配到主GPU。为了使静态放置对在线工作负载倾斜保持鲁棒,我们进一步引入了通用专家共享复制(GESR),这是一个轻量级辅助方法,识别具有持续中心共激活特征的通用专家,将它们复制到少量辅助GPU上,并在服务时应用局部性和负载感知的选择。在三个代表性的开源MoE模型上的实验表明,我们的框架相比基线平均降低了31.39%的通信成本,同时保持了平均Jain公平指数0.9975。即使在推理数据出现严重分布偏移的情况下,这一优势依然存在,持续优于强基线。

英文摘要

Sparsely activated Mixture-of-Experts (MoE) models scale capacity via conditional computation, but distributed inference suffers from cross-GPU expert communication and routing-induced load imbalance. Existing placement methods reduce this cost by co-locating frequently co-activated experts; however, they derive a single deployment plan from globally aggregated routing traces, thereby averaging away the heterogeneous, task-specific co-activation patterns that actually drive communication in multi-task serving. We observe that expert co-activation is strongly task-conditioned: pairs tightly coupled in one task family are often uncorrelated in another, so effective deployment should group experts by task-aware co-activation rather than by a task-agnostic average. Based on this insight, we propose \emph{Task-Aware Coactivation Grouping} (TACG), a deployment-time framework that uses family-specific dispatch and co-activation traces to derive per-expert task-family preferences, reweights the co-activation graph so that intra-family locality dominates grouping, and assigns each expert to a primary GPU under exact capacity constraints. To keep the static placement robust under online workload skew, we further introduce \emph{Generic Expert Shared Replication} (GESR), a lightweight companion that identifies generic experts with consistently central co-activation profiles, replicates them across a small set of secondary GPUs, and applies locality- and load-aware selection at serving time. Experiments on three representative open-source MoE models demonstrate that our framework reduces the average communication cost by 31.39\% over the baseline, while preserving an average Jain fairness index of 0.9975. This advantage persists even under severe distribution shifts in the inference data, consistently outperforming strong baselines.

2606.00991 2026-06-02 cs.AI 版本更新

Large Language Models in Transportation Systems Management and Operations: From Text Reasoning to Multi-modal Decision Support

交通系统管理与运营中的大语言模型:从文本推理到多模态决策支持

Siyan Li, Zehao Wang, Jiachen Li, Kanok Boriboonsomsin, Matthew J. Barth, Guoyuan Wu

发表机构 * Bourns College of Engineering, Center for Environmental Research and Technology, University of California at Riverside, CA, USA(伯恩斯工程学院,环境研究与技术中心,加州大学河滨分校,美国,加利福尼亚州河滨)

AI总结 本文综述了大语言模型(LLM)和多模态大语言模型(MM-LLM)在交通系统管理与运营(TSMO)中的应用,涵盖运营与服务、移动性与车队服务、数据建模与决策支持三大领域,并指出了数据异构性、实时推理、可解释性等挑战及未来方向。

Comments Preprint version

详情
AI中文摘要

交通系统管理与运营(TSMO)越来越依赖于对各种传感器流、事件报告、旅行者反馈和视觉观测等异构数据的及时解读。大语言模型(LLM),包括新兴的多模态大语言模型(MM-LLM),为将这些结构化和非结构化输入整合到面向操作者的决策支持中提供了新机制。本文综述了基于LLM和MM-LLM在TSMO中的应用,涵盖三个领域:交通运营与服务(供给)、移动性与车队服务(需求)以及数据、建模与决策支持。通过PRISMA指导的筛选过程,我们综合了当前研究,同时区分了面向操作的应用与原型及新兴概念。我们进一步识别了数据异构性、实时推理、可解释性、多模态融合和治理方面的反复出现的挑战。最后,我们概述了在本地化适应、边缘部署、基准测试和跨机构协作方面的现有差距和未来方向。总体而言,基于LLM的系统作为决策支持层最有前景,而MM-LLM在需要整合异构文本、视觉和传感器输入时尤其有价值。

英文摘要

Transportation systems management and operations (TSMO) increasingly depends on timely interpretation of heterogeneous data, from various sensor streams, incident reports, traveler feedback, and visual observations. Large language models (LLMs), including emerging multi-modal large language models (MM-LLMs), provide a new mechanism for integrating these structured and unstructured inputs into operator-facing decision support. This survey paper reviews LLM- and MM-LLM-based applications in TSMO across three domains: transportation operations & services (supply), mobility & fleet services (demand), and data, modeling & decision support. Using a PRISMA-guided screening process, we synthesize current studies while distinguishing operationally oriented applications from prototype and emerging concepts. We further identify recurring challenges in data heterogeneity, real-time inference, explainability, multi-modal fusion, and governance. Finally, we outline existing gaps and future directions in localized adaptation, edge deployment, benchmarking, and cross-agency collaboration. Overall, LLM-based systems appear most promising as a decision-support layer, with MM-LLMs offering particular value when heterogeneous text, visual, and sensor inputs must be integrated.

2606.00987 2026-06-02 cs.CV cs.AI 版本更新

An Open-Source Benchmark and Baseline for Multi-temporal Referring Segmentation

多时相指代分割的开源基准与基线

Bingyu Li, Da Zhang, Tao Huo, Zhiyuan Zhao, Junyu Gao, Xuelong Li

发表机构 * University of Science and Technology of China(中国科学技术大学) Institute of Artificial Intelligence (TeleAI)(人工智能研究所) China Telecom(中国电信) School of Artificial Intelligence, Optics and Electronics (iOPEN)(人工智能、光学与电子学院) Northwestern Polytechnical University(西北工业大学)

AI总结 提出多时相指代分割任务,通过自动化数据构建管道CRAFT-Agent生成首个基准MTRefSeg-21K,并设计两阶段训练的变化感知LVLM框架MTRefSeg-R1,实现优于现有基线的性能。

详情
AI中文摘要

大型视觉语言模型(LVLMs)展现了强大的视觉理解和语言引导定位能力,但其多时相视觉推理能力仍未充分探索。为填补这一空白,我们引入了 extbf{多时相指代分割(MTRS)},这是一个新任务,旨在从多时相图像中分割语言描述的时间变化。MTRS通过联合要求时相对应推理、语言定位和像素级掩码预测,扩展了传统的指代分割和变化检测。我们提出了 extbf{CRAFT-Agent},一个带有人工审核的自动化数据构建管道,并构建了 extbf{MTRefSeg-21K},这是第一个MTRS基准,包含21K个高质量的多时相图像-文本-掩码三元组,覆盖多样化的场景、视角和领域。对一系列基于VLM和LVLM的模型进行基准测试表明,直接推理表现较差,而任务特定的微调仍然有限。为解决这一问题,我们提出了 extbf{MTRefSeg-R1},一个采用两阶段策略训练的变化感知LVLM框架。它首先从20K个仅视觉的双时相样本中学习通用时间变化感知,然后在MTRefSeg-21K上进行微调,以实现细粒度的语言引导时间定位。MTRefSeg-R1显式建模跨时相视觉差异,将语言指令与时间变化对齐,并预测所指变化掩码。大量实验表明,与现有的LVLM基线相比,MTRefSeg-R1实现了强大且通常更优的性能,展示了MTRS的挑战和潜力。

英文摘要

Large Vision-Language Models (LVLMs) have shown strong visual understanding and language-guided grounding abilities, yet their capacity for multi-temporal visual reasoning remains underexplored. To bridge this gap, we introduce \textbf{Multi-temporal Referring Segmentation (MTRS)}, a new task that aims to segment language-described temporal changes from multi-temporal images. MTRS extends conventional referring segmentation and change detection by jointly requiring temporal correspondence reasoning, language grounding, and pixel-level mask prediction. We propose \textbf{CRAFT-Agent}, an automated data construction pipeline with human auditing, and build \textbf{MTRefSeg-21K}, the first MTRS benchmark, containing 21K high-quality multi-temporal image-text-mask triplets across diverse scenes, viewpoints, and domains. Benchmarking a broad set of VLM- and LVLM-based models reveals that direct inference performs poorly, while task-specific fine-tuning remains limited. To address this, we propose \textbf{MTRefSeg-R1}, a change-aware LVLM framework trained with a two-stage strategy. It first learns general temporal-change perception from 20K vision-only bi-temporal samples, and is then fine-tuned on MTRefSeg-21K for fine-grained language-guided temporal localization. MTRefSeg-R1 explicitly models cross-temporal visual differences, aligns language instructions with temporal variations, and predicts referred change masks. Extensive experiments show that MTRefSeg-R1 achieves strong and often superior performance compared with existing LVLM baselines, demonstrating the challenge and potential of MTRS.

2606.00970 2026-06-02 cs.AI cs.LG econ.TH 版本更新

Prospect-Theory Behavior from Bellman Optimality in MDPs with Catastrophic States

具有灾难性状态的MDP中贝尔曼最优性产生的前景理论行为

Yujiao Chen

发表机构 * Massachusetts Institute of Technology(麻省理工学院)

AI总结 研究具有吸收灾难状态的马尔可夫决策过程中的风险中性控制,发现标准贝尔曼最优性产生前景理论特征:S形值函数、内生损失敏感系数和反射效应策略反转,并推导出渐近损失厌恶平台的闭式表达式。

详情
AI中文摘要

我们研究具有吸收灾难状态的马尔可夫决策过程中的风险中性控制。尽管奖励是线性的且智能体没有效用曲率、概率加权或框架依赖,标准贝尔曼最优性产生了三个前景理论特征:S形值函数轮廓(灾难附近凸,远处凹)、内生损失敏感系数$λ^*(S) > 1$以及反射效应策略反转。在495个配置中,最优策略在正漂移(增长)模式下在灾难附近选择安全动作,尽管风险动作的即时期望值更高;在负漂移(衰退)模式下在灾难附近选择风险动作,尽管安全动作的即时期望损失更低。我们推导出渐近损失厌恶平台$\barλ$的闭式表达式,该表达式仅依赖于获胜概率$p$、收益不对称性$r = |Δ_\ell/Δ_w|$和折扣因子$β$,与数值解的拟合$R^2 = 0.999$。该机制不需要不对称收益。在三个不对称水平下对$(p,β)$进行扫描,$\barλ$大于1的不对称份额中位数为4.6%($r = 1.25$时),上升到13.9%($r = 2$时),且在每个测试单元中边界贡献超过不对称贡献。这些现象在表格Q学习(无模型智能体在增长模式下与$V^*$的相关性为0.98,衰退模式下为1.00)以及随机转移(高斯、重尾Student-$t_3$和不对称偏正态噪声,幅度高达步长的50%)中持续存在,其中渐近平台在安全通道噪声下跟踪闭式预测的误差在0.41%以内,在风险通道或双通道噪声下误差在9.6%以内。这些结果将吸收失败状态识别为最优控制下产生前景理论行为的充分结构机制。

英文摘要

We study risk-neutral control in Markov decision processes with an absorbing catastrophic state. Even though rewards are linear and the agent has no utility curvature, probability weighting, or framing dependence, standard Bellman optimality produces three prospect-theory-like signatures: an S-shaped value-function profile (convex near catastrophe, concave in the far field), an endogenous loss-sensitivity coefficient $λ^*(S) > 1$, and a reflection-effect policy reversal. Across 495 configurations, the optimal policy plays safe near catastrophe in positive-drift (growth) regimes despite the risky action's higher immediate expected value, and plays risky near catastrophe in negative-drift (decline) regimes despite the safe action's lower immediate expected loss. We derive a closed-form expression for the asymptotic loss-aversion plateau $\barλ$ that depends only on win probability $p$, payoff asymmetry $r = |Δ_\ell/Δ_w|$, and discount factor $β$, and matches numerical solutions to $R^2 = 0.999$. The mechanism does not require asymmetric payoffs. Across a sweep of $(p,β)$ at three asymmetry levels, the asymmetry share of $\barλ$ above unity has median 4.6% at $r = 1.25$ and rises to 13.9% at $r = 2$, with the boundary contribution exceeding the asymmetry contribution in every cell tested. The phenomena persist under tabular Q-learning (a model-free agent reproduces $V^*$ at correlation 0.98 in growth and 1.00 in decline) and under stochastic transitions with Gaussian, heavy-tailed Student-$t_3$, and asymmetric skew-normal noise up to 50% of the step size, where the asymptotic plateau tracks the closed-form prediction within 0.41% for safe-channel noise and within 9.6% for risky-channel or both-channel noise. These results identify absorbing failure states as a sufficient structural mechanism for prospect-theory-like behavior under optimal control.

2606.00962 2026-06-02 cs.CR cs.AI 版本更新

SS-ZKR: Spatial-Semantic Zero-Knowledge Routing for Privacy-Preserving Multi-Agent Collaboration

SS-ZKR:面向隐私保护多智能体协作的空间语义零知识路由

Hassan Touheed

发表机构 * Linux Foundation(Linux基金会) Google(谷歌) W3C(万维网联盟)

AI总结 提出SS-ZKR协议,通过差分隐私语义意图向量、自适应净化和空间到密码策略编译器三种机制,在不解密负载的情况下实现跨组织信任边界的内容感知语义路由。

详情
AI中文摘要

基础智能体互操作标准,特别是智能体到智能体(A2A)协议和模型上下文协议(MCP),推动了多智能体系统通信的发展,而利用W3C去中心化标识符(DID)和可验证凭证(VC)的补充身份框架提供了密码学智能体认证。然而,现有协议均不支持在无需路由中介解密负载的情况下,跨组织信任边界进行基于内容的智能体负载语义路由,而这在受GDPR、HIPAA和MiFID II监管的合规敏感环境中是硬性约束。我们提出SS-ZKR,一种三机制隐私保护路由协议,设计为A2A/MCP之上的补充层。机制一通过差分隐私语义意图向量引入盲路由,该向量密码学绑定到负载模式一致性的零知识证明。机制二提供向量加权自适应负载净化,对数值字段采用形式化(ε,δ)-差分隐私,对文本字段采用启发式语义聚合。机制三提出空间到密码策略编译器,将视觉定义的信任区域拓扑转换为确定性零知识访问电路。我们提供形式化威胁模型,分析意图向量的信息泄露界限,给出所有三种机制的伪代码,并与基于TEE和同态加密的路由基线进行解析复杂度比较。SS-ZKR允许金融服务、医疗保健和国防领域的企业跨监管边界编排异构AI智能体,而无需向路由基础设施暴露专有数据。

英文摘要

Foundational agent interoperability standards, notably the Agent-to-Agent (A2A) protocol and the Model Context Protocol (MCP), have advanced multi-agent system communication, and complementary identity frameworks leveraging W3C Decentralised Identifiers (DIDs) and Verifiable Credentials (VCs) provide cryptographic agent authentication. However, no existing protocol supports content-based semantic routing of agent payloads across organisational trust boundaries without requiring the routing intermediary to decrypt the payload, which is a hard constraint in compliance-sensitive environments governed by GDPR, HIPAA, and MiFID II. We propose SS-ZKR, a three-mechanism privacy-preserving routing protocol designed as a complementary layer atop A2A/MCP. Mechanism I introduces blind routing via differentially private semantic intent vectors cryptographically bound to zero-knowledge proofs of payload-schema consistency. Mechanism II offers vector-weighted adaptive payload sanitisation with formal (epsilon, delta)-differential privacy for numerical fields and heuristic semantic aggregation for textual fields. Mechanism III presents a spatial-to-cryptographic policy compiler that translates visually defined trust-zone topologies into deterministic zero-knowledge access circuits. We provide a formal threat model, analyse information leakage bounds of intent vectors, present pseudocode for all three mechanisms, and give analytical complexity comparisons against TEE-based and homomorphic encryption-based routing baselines. SS-ZKR lets enterprises in financial services, healthcare, and defence orchestrate heterogeneous AI agents across regulatory boundaries without exposing proprietary data to routing infrastructure.

2606.00959 2026-06-02 cs.AI 版本更新

Towards Understanding Modality Interaction in Multimodal Language Models via Partial Information Decomposition

通过部分信息分解理解多模态语言模型中的模态交互

Wanlong Fang, Tianle Zhang, Wen Tao, Alvin Chan

发表机构 * University of California, Berkeley(加州大学伯克利分校) Stanford University(斯坦福大学)

AI总结 引入部分信息分解(PID)框架,分离感官和语言输入的独特、冗余和协同贡献,揭示多模态大模型中的模态使用模式,并扩展至三模态系统。

Comments Accepted by ICML 2026

详情
AI中文摘要

理解多模态大语言模型(MLLMs)中的模态交互对于可靠部署至关重要。我们引入部分信息分解(PID)作为决策级框架,将感官和语言输入的独特、冗余和协同贡献分离,超越了表示对齐和基于结果的评估。在视觉-语言基准测试中,PID揭示了重复出现的模态使用模式:推理和接地导向的任务往往表现出高协同性,而专家和知识导向的任务则显示出更强的语言独特性依赖。这些模式在不同模型家族中普遍存在,并能预测对模态级干预的敏感性。我们进一步将PID扩展到三模态系统,提出感官PID,将语言作为控制变量来分解视频-音频信息增益。应用于全模态模型时,感官PID揭示了感官协同瓶颈,即使在音视频融合任务中也以视觉信息为主。最后,PID引导的重新加权为改进多模态推理和接地性能提供了初步证据。

英文摘要

Understanding modality interaction in multimodal large language models (MLLMs) is central to reliable deployment. We introduce Partial Information Decomposition (PID) as a decision-level framework that separates unique, redundant, and synergistic contributions of sensory and linguistic inputs, beyond representation alignment and outcome-based evaluation. Across vision--language benchmarks, PID reveals recurring modality-use profiles: reasoning and grounding-oriented tasks tend to exhibit high synergy, whereas expert and knowledge-oriented tasks show stronger language-unique reliance. These profiles generalize across model families and predict sensitivity to modality-level interventions. We further extend PID to tri-modal systems with Sensory PID, treating language as a control variable to decompose video--audio information gain. Applied to omni-modal models, Sensory PID reveals a sensory synergy bottleneck dominated by visual information even on audio--visual fusion tasks. Finally, PID-guided reweighting provides initial evidence for improving multimodal reasoning and grounding performance.

2606.00949 2026-06-02 cs.LG cs.AI physics.flu-dyn 版本更新

Explainable deep reinforcement learning reveals energy-efficient control strategies for turbulent drag reduction

可解释深度强化学习揭示湍流减阻的节能控制策略

Federica Tonti, Ricardo Vinuesa

发表机构 * Department of Aerospace Engineering University of Michigan(航空航天工程系密歇根大学)

AI总结 结合多智能体深度强化学习与可解释深度学习,提出基于SHAP归因的奖励策略,实现高效湍流减阻,净节能达34.01%且输入功率仅0.43%。

详情
AI中文摘要

我们提出了一种结合多智能体深度强化学习(MARL)和可解释深度学习(XDL)的方法,用于减少壁面边界湍流中的阻力。以直接针对壁面剪切应力和反对称控制训练智能体的结果作为基线,比较了三种SHAP引导的方法。第一种方法中,奖励根据预测未来速度场的U-net的SHAP归因计算;第二种方法中,奖励根据预测摩擦系数的U-net的SHAP归因计算;第三种方法中,奖励结合了分别预测摩擦系数和壁面压力脉动的两个U-net的SHAP归因。基于摩擦系数和壁面压力脉动的组合SHAP策略实现了最佳整体性能,在仅0.43%归一化输入功率下实现了34.44%的减阻率(DR)和34.01%的净节能率(NES)。相对于反对称控制,减阻和净节能分别提高了49.41%和48.52%。与直接壁面剪切应力基线相比,所提出的策略在提高性能的同时,将归一化驱动成本从5.90%降低到0.43%。结果分析表明,节能策略与压力门控驱动一致,主要在壁面压力接近零时激活,并且其时间尺度与近壁湍流结构的寿命相当。

英文摘要

We propose a method combining Multi-Agent Deep Reinforcement Learning (MARL) and eXplainable Deep Learning (XDL) to reduce drag in wall-bounded turbulent flows. Taking as a baseline the results of training agents directly targeting wall-shear stress and opposition control, three SHAP-guided approaches are compared. In the first, the reward is computed from SHAP attributions of a U-net predicting the future velocity field; in the second, from SHAP attributions of a U-net predicting the skin-friction coefficient; in the third, from a combination of SHAP attributions of two U-nets predicting the skin-friction coefficient and the wall pressure fluctuations, respectively. The combined SHAP strategy based on skin-friction coefficient and wall-pressure fluctuations achieves the best overall performance, achieving a DR of 34.44% and a NES of 34.01% with only 0.43% normalized input power. Relative to opposition control, drag reduction and net energy saving increase by 49.41% and 48.52%, respectively. Compared with the direct wall-shear-stress baseline, the proposed strategy simultaneously improves performance while reducing the normalized actuation cost from 5.90% to 0.43%. Analysis of the results reveals that the energetically efficient policy is consistent with pressure-gated actuation, activating predominantly at near-zero wall pressure, and operates on a temporal timescale comparable to the lifetime of the near-wall turbulent structures.

2606.00946 2026-06-02 cs.DC cs.AI cs.LG 版本更新

Lodestar: An Online-Learning LLM Inference Router

Lodestar: 一种在线学习的大语言模型推理路由器

Gangmuk Lim, Wanyu Zhao, Brighten Godfrey, Jiaxin Shan, Le Xu, Liguang Xie

发表机构 * UIUC(伊利诺伊大学香槟分校) Bytedance(字节跳动) University of Edinburgh(爱丁堡大学)

AI总结 提出Lodestar,一种基于在线学习的请求路由系统,通过实时收集集群状态并训练奖励预测器,以最小化TTFT为目标分配推理请求,在异构GPU集群上显著降低延迟。

详情
AI中文摘要

高效服务大语言模型(LLM)推理任务对于用户感知的延迟(如首令牌时间TTFT)和GPU利用率至关重要。然而,LLM请求路由(即将每个推理请求分配给GPU实例)尤其具有挑战性:执行高度依赖于输入;批处理和KV缓存重用造成了强烈的跨请求耦合;延迟对上下文长度、模型/引擎设置和异构加速器呈非线性响应。因此,简单的传统负载均衡算法,甚至针对LLM推理定制的启发式方法,都难以实现良好性能。我们提出Lodestar,一种面向分布式GPU集群的基于学习的请求路由系统。Lodestar持续在每个请求级别收集集群快照,包括实时实例状态、请求特征和观察到的性能,并训练一个在线奖励预测器,用于将推理请求路由到将最大化给定奖励(例如最小化TTFT)的实例。Lodestar是云原生的,并与现有服务栈(vLLM)无缝协作。通过持续在线适应变化的工作负载和基础设施条件,与最先进的前缀缓存和负载感知启发式方法相比,Lodestar在平均TTFT上降低1.41倍,在P99 TTFT上平均降低1.47倍(在同构集群上最高达2.15倍/1.86倍,在异构集群上最高达4.38倍/4.42倍),并且根据在公有云GPU集群上的实验,大约在5分钟内学习到这些高效的路由策略。

英文摘要

Efficiently serving large language model (LLM) inference tasks is crucial both for user-perceived latency such as time-to-first-token (TTFT) and for GPU utilization. However, LLM request routing, that is, assigning each inference request to a GPU instance, is particularly challenging: execution is highly input-dependent; batching and KV-cache reuse create strong cross-request coupling; and latency responds nonlinearly to context length, model/engine settings, and heterogeneous accelerators. As a result, simple traditional load balancing algorithms, and even heuristics tailored for LLM inference, fail to achieve good performance. We present Lodestar, a novel learning-based request routing system for distributed GPU clusters. Lodestar continuously collects a snapshot of the cluster at per-request level, including real-time instance state, request characteristics, and observed performance, and trains an online reward predictor that it uses to route inference requests to the instance that will maximize given reward (e.g., minimizing TTFT). Lodestar is cloud-native and works seamlessly with existing serving stacks (vLLM). With continuous online adaptation to changing workloads and infrastructure conditions, Lodestar achieves 1.41x lower average TTFT and 1.47x lower P99 TTFT on average (up to 2.15x/1.86x on homogeneous and 4.38x/4.42x on heterogeneous clusters) compared to a state-of-the-art prefix cache and load-aware heuristic, and learns these efficient routing strategies within about 5 minutes, based on experiments in a public cloud GPU cluster.

2606.00935 2026-06-02 cs.AI cs.CL cs.HC 版本更新

Relational Intervention During Functional Collapse in Large Language Models: A Lexical-Statistical Ablation and a Structure x Register Factorial

大语言模型功能崩溃期间的关系性干预:一项词汇-统计消融与结构×语域析因研究

Franco Santana, Horacio Vico

发表机构 * Universidad de la República (UDELAR)(乌拉圭共和国大学) DigitalIA Cloud(DigitalIA云)

AI总结 通过析因实验,研究在小型语言模型功能崩溃时,关系性干预(承认、宽恕、代理恢复、无条件接纳)与技术性反馈、词汇打乱控制及单独维度对行为的影响,发现注意-行为分离及结构×语域交互作用。

Comments 12 pages, 5 figures. Preprint

详情
AI中文摘要

我们测试了在小型语言模型功能崩溃期间,关系性干预是否会产生与技术性反馈、词汇匹配的打乱控制以及两个语用维度单独作用可区分的崩溃后行为。使用Qwen3.5-4B和一个故意损坏的bash工具,我们在匹配对设计(50个任务)中跨六个条件运行了300个回合:无干预(A)、技术性/非人称(B)、关系性/第一人称(C)、打乱的关系性(D)、技术性/第一人称(E)和关系性/非人称(F)。E和F与B和C构成一个2×2析因设计,将关系性结构(承认、宽恕、代理恢复、无条件接纳)与发送者语域(第一人称与非人称)分离。我们报告两个主要发现。首先,注意-行为分离:注意跟随词汇惊讶度(D > F > C > E > B,所有q_FDR < 10^{-10}),打乱的消息捕获最多注意;然而行为上A ~ B ~ D < E ~ F << C。其次,析因定位了C效应:单独的关系性结构(F)或单独的第一人称语域(E)都不能复制C的行为特征;两个维度的主效应各自显著,且结构×语域交互作用在持久性上显著(p = 0.046)。第三个分离出现在情绪探测中:F在8个探测中的7个上跟踪C,尽管只产生基线行为,表明单独的关系性结构安装了一个探测级状态,该状态仅在与第一人称语域配对时才转化为行为。模型的处理分解为三个可分离的阶段:注意(按词汇惊讶度排序)、探测级状态(按结构排序)和行为(按两者的合取排序)。

英文摘要

We test whether a relational-style intervention delivered during functional collapse in a small language model produces post-collapse behavior distinguishable from technical feedback, from a lexically-matched scrambled control, and from each of the two pragmatic dimensions in isolation. Using Qwen3.5-4B with a deliberately broken bash tool, we run 300 episodes across six conditions in a matched-pairs design (50 tasks): no intervention (A), technical/impersonal (B), relational/first-person (C), scrambled relational (D), technical/first-person (E), and relational/impersonal (F). E and F form a 2x2 factorial with B and C that dissociates relational structure (acknowledgment, absolution, agency restoration, unconditional acceptance) from sender register (first-person vs. impersonal). We report two main findings. First, an attention-behavior dissociation: attention follows lexical surprise (D > F > C > E > B, all q_FDR < 10^{-10}), with the scrambled message capturing the most attention; yet behaviorally A ~ B ~ D < E ~ F << C. Second, the factorial localizes the C effect: neither relational structure alone (F) nor first-person register alone (E) replicates C's behavioral signature; main effects of both dimensions are individually significant, and the structure x register interaction is significant on persistence (p = 0.046). A third dissociation emerges in emotion probes: F tracks C on 7 of 8 probes despite producing only baseline behavior, indicating that relational structure alone installs a probe-level state that only translates into behavior when paired with first-person register. The model's processing decomposes into three dissociable stages: attention (ordered by lexical surprise), probe-level state (ordered by structure), and behavior (ordered by the conjunction of both).

2606.00931 2026-06-02 cs.CV cs.AI 版本更新

CV-Arena: An Open Benchmark for Instructional Computer Vision Problem Solving with Human-AI Collaborative Preferences

CV-Arena: 面向教学计算机视觉问题求解的开放基准与人类-AI协作偏好

Fangzhou Lin, Peiran Li, Lingyu Xu, Wenjing Chen, Qianwen Ge, Shuo Xing, Mingyang Wu, Xiangbo Gao, Siyuan Yang, Kazunori Yamada, Ziming Zhang, Haichong Zhang, Zhen Dong, Ming-Hsuan Yang, Zhengzhong Tu

发表机构 * Texas A&M University(德克萨斯A&M大学) Worcester Polytechnic Institute(沃斯特理工大学) Tohoku University(东北大学) Georgia Institute of Technology(佐治亚理工学院) NVIDIA(英伟达) UCSB(加州大学圣塔芭芭拉分校) UC Merced(加州大学默塞德分校)

AI总结 提出CV-Arena基准,包含12K高分辨率真实图像指令对,覆盖16种任务类型,并采用Active Elo协议结合人类与AI偏好评估21个系统,揭示指令遵循、物理推理等方面的差距,同时开发CV-Agent代理模型展示闭环推理的潜力。

Comments 26 pages, 7 figures, 11 tables

详情
AI中文摘要

指令引导的图像编辑正成为视觉工作的通用接口,然而现有基准仍主要聚焦于狭窄的外观编辑,未能充分捕捉专业工作流程中真实图像任务的多样性。在此,我们将教学计算机视觉问题求解定义为图像编辑的更广泛形式:给定真实输入图像和自然语言指令,系统必须生成编辑后的输出,实现所要求的变换,同时满足明确的保持性、几何、物理和可用性约束。我们引入了CV-Arena,一个旨在以专业规模评估此能力的开放基准。CV-Arena包含12K高分辨率真实图像指令对,涵盖16种基于指令的视觉任务类型,通过CogRetriever构建,这是一个结合目标网络搜索、代理查询精化、验证和可追溯性的双轨检索与筛选流水线。为了在保持人类保真度的同时大规模评估模型,我们提出了Active Elo,一种人类-AI协作偏好协议,利用CV-Judge(一个逻辑门控、多维度VLM评估器)拒绝明显失败并解决高置信度比较,并将接近的高质量比较路由给专家评分者。然后通过可靠性加权的Elo更新聚合混合的人类和AI监督。我们对21个系统(包括专有、开源和代理模型)在CV-Arena上的全面评估揭示了指令遵循、物理推理、结构控制和细粒度细节保持方面的持续差距。我们进一步开发了CV-Agent,一个轻量级代理模型,结合了规划、编辑和验证,并证明了闭环推理是专业级指令遵循视觉编辑的一个有前景的方向。

英文摘要

Instruction-guided image editing is becoming a general interface for visual work, yet existing benchmarks still focus largely on narrow appearance edits and do not fully capture the diversity of real-image tasks in professional workflows. Here, we define instructional computer vision problem solving as a broader formulation of image editing: given a real input image and a natural-language instruction, a system must produce an edited output that realizes the requested transformation while satisfying explicit preservation, geometric, physical, and usability constraints. We introduce CV-Arena, an open benchmark designed to evaluate this capability at professional scales. CV-Arena contains 12K high-resolution real-image instruction pairs spanning 16 instruction-based visual task types, constructed using CogRetriever, a dual-track retrieval-and-curation pipeline that combines targeted web search, agentic query refinement, verification, and traceability. To evaluate models at scale while preserving human fidelity, we propose Active Elo, a human-AI collaborative preference protocol that leverages CV-Judge, a logic-gated, multi-dimensional VLM evaluator, to reject clear failures and resolve high-confidence comparisons; and to route close, high-quality comparisons to expert raters. Mixed human and AI supervision is then aggregated through reliability-weighted Elo updates. Our comprehensive evaluation of 21 systems, including proprietary, open-source, and agentic models, on CV-Arena reveals persistent gaps in instruction adherence, physical reasoning, structural control, and fine-grained detail preservation. We further develop CV-Agent, a lightweight agentic model that combines planning, editing, and verification, and demonstrate that closed-loop reasoning is a promising direction for professional-grade instruction-following visual editing.

2606.00930 2026-06-02 cs.CL cs.AI cs.LG 版本更新

Detection vs. Execution: Single-Bucket Probes Miss Half the Mamba-2 State Sink

检测 vs. 执行:单桶探针遗漏了 Mamba-2 状态汇的一半

Yuhang Jiang

发表机构 * Independent Researcher(独立研究者)

AI总结 本文发现 Mamba-2 中的状态汇(state sink)可分解为两类功能头集,单桶探针仅能恢复执行层而遗漏检测层,表明表征相似性不等于功能等价。

Comments 16 pages, 3 figures

详情
AI中文摘要

机械可解释性通常假设识别表征特征的探针也能识别执行相应计算的电路。我们证明这一假设在 Mamba-2 中可能系统性失败。通过研究状态汇(边界 token 上不成比例的 Delta 门控激活,类似于注意力汇),我们发现单桶探针仅能恢复一个小的执行层,而遗漏了具有相同表征特征的更大的检测层。 在 Mamba-2 中,状态汇分解为两个功能头集。单桶 BOS 专家头(在 2.7B 模型中约占 5% 的头)在模型规模和语料库上因果支持 BOS 上下文和新行目标预测。双头(占头的 27-35%,通过同一探针的多类聚合恢复)表现出更强的 BOS-新行表征相似性,但在消融下因果效应显著较弱。表征相似性并不意味着功能等价。 这一区别对下游行为至关重要:消融 BOS 专家头使 Mamba-1 2.8B 和 Mamba-2 2.7B 在 1024 上下文长度下的 RULER NIAH 检索准确率从 1.00 降至 0.00,而大小匹配的补集保持基线性能。随机通道分桶控制排除了仅由基质粒度造成的可能,暗示 Mamba-2 的头共享 Delta 投影。探针导出的专长可以识别执行电路;在粗粒度下,同一探针也能恢复检测电路,而区分它们需要类别条件消融而非类别条件余弦。

英文摘要

Mechanistic interpretability often assumes that probes identifying a representational signature also identify the circuit executing the corresponding computation. We show that this assumption can fail systematically in Mamba-2. Studying the state sink (disproportionate Delta-gate activation on boundary tokens, analogous to the attention sink), we find that single-bucket probes recover only a small execution layer while missing a much larger detection layer with the same representational signature. In Mamba-2, the state sink decomposes into two functional head sets. Single-bucket BOS-specialist heads (about 5% of heads at 2.7B) causally support both BOS-context and newline-target predictions across model scales and corpora. Dual heads (27-35% of heads, recovered by multi-class aggregation of the same probe) show stronger BOS-newline representational similarity but substantially weaker causal effects under ablation. Representational similarity does not imply functional equivalence. This distinction matters for downstream behaviour: ablating BOS-specialist heads collapses RULER NIAH retrieval accuracy from 1.00 to 0.00 at 1024 context length in both Mamba-1 2.8B and Mamba-2 2.7B, while size-matched complements preserve baseline performance. A random channel-bucketing control rules out substrate granularity alone, implicating Mamba-2's head-shared Delta projection. Probe-derived specialty can identify execution circuits; at coarse granularity the same probe also recovers detection circuits, and separating them requires class-conditional ablation rather than class-conditional cosine.

2606.00925 2026-06-02 cs.CR cs.AI 版本更新

Benchmarking Security Risk Detection and Verification in Open Agentic Skill Ecosystems

开放智能体技能生态系统中的安全风险检测与验证基准测试

Ismail Hossain, Sai Puppala, Zhuoran Lu, Sajedul Talukder, Nan Jiang

发表机构 * University of Texas at El Paso(德克萨斯理工大学) Southern Illinois University-Carbondale(南方伊利诺伊大学卡本代尔分校) Purdue University(普渡大学)

AI总结 提出SkillVetBench,一个两阶段安全审查基准,通过语义审查和沙箱执行检测与验证开放智能体技能生态系统中的恶意技能。

详情
AI中文摘要

开放智能体平台允许社区贡献者发布可重用的技能,智能体可在运行时调用。这种可扩展性也带来了供应链风险:恶意贡献者可以将有害行为隐藏在表面检查看似良性的技能中。然而,现有防御措施难以评估,因为没有同时衡量恶意技能检测和运行时验证的基准。我们提出了SkillVetBench,一个针对开放智能体技能生态系统的两阶段安全审查基准。第一阶段对每个技能的自然语言规范进行语义审查,以检测隐藏的恶意意图。第二阶段在沙箱中执行标记的技能,观察运行时行为并收集可审计的证据。我们从活跃的OpenClaw生态系统中确认的恶意技能构建基准,包括最近ClawHavoc供应链攻击中的样本。与仅静态方法不同,SkillVetBench通过执行轨迹验证检测到的威胁。我们的实验表明:(1)仅语义和基于签名的基线方法不足,最多遗漏89%的恶意技能,其威胁源于自然语言指令、多组件逻辑或跨组件交互;(2)运行时攻击集中在少量高权限原语上,特别是exec、write_file、install_skill和spawn;(3)SkillVetBench提供了案例研究,其中沙箱执行直接以具体的运行时证据支持恶意判定。

英文摘要

Open agent platforms allow community contributors to publish reusable skills that agents can invoke at runtime. This extensibility also creates a supply-chain risk: malicious contributors can hide harmful behavior inside skills that appear benign under superficial inspection. However, existing defenses are hard to evaluate because there is no benchmark that measures both malicious-skill detection and runtime verification. We present SkillVetBench, a two-stage security vetting benchmark for open agentic skill ecosystems. The first stage performs semantic vetting over each skill's natural-language specification to detect hidden malicious intent. The second stage executes flagged skills in an instrumented sandbox to observe runtime behavior and collect auditable evidence. We build a benchmark from confirmed malicious skills in the live OpenClaw ecosystem, including samples from the recent ClawHavoc supplychain campaign. Unlike static-only methods, SkillVetBench verifies detected threats with execution traces. Our experiments show that: (1) semantic-only and signature-based baselines are insufficient, missing up to 89\% of malicious skills whose threats arise from natural-language instructions, multicomponent logic, or cross-component interactions; (2) runtime attacks are concentrated in a small set of high-permission primitives, especially exec, write\_file, install\_skill, and spawn; and (3) SkillVetBench provides case studies in which sandbox execution directly supports malicious verdicts with concrete runtime evidence.

2606.00920 2026-06-02 cs.LG cs.AI cs.SE 版本更新

Accuracy, Stability, and Repeated-Run Reliability of Large Language Models on Deterministic Programming Tasks

大型语言模型在确定性编程任务上的准确性、稳定性和重复运行可靠性

Yongxi Zhou, Lai Yun Choi, Jiaxi Wen, Wenbo Ye

发表机构 * Northeastern University, Massachusetts, USA(东北大学,马萨诸塞州,美国) University of Southern California, California, USA(南加州大学,加利福尼亚州,美国)

AI总结 通过重复运行评估协议,发现运行级通过率高估了无重试覆盖率高达17.8个百分点,且差距在中等性能系统中最大,表明稳定性分析是准确性报告的必要补充。

详情
AI中文摘要

运行级通过率高估了无重试覆盖率高达17.8个百分点——且差距恰恰在中等性能系统中最大。我们研究了大型语言模型(LLM)在确定性文本条件生成评估中的这种准确性-稳定性关系,以编程任务作为具体测试平台。标准代码生成基准强调单次运行准确性或在重复采样下的最终成功,但许多部署场景还需要稳定性:在相同任务描述下重复调用时的一致结果。我们提出了一种重复运行评估协议,包含运行级准确性、无重试覆盖率和每个问题的变异性指标。在一个包含100道LeetCode风格问题的基于近期的基准上,我们评估了来自五个提供者家族的16个模型,使用两种提示模板,每个问题重复运行五次,共产生16,000个评估实例。尽管运行级通过率与完美稳定率强相关(r=0.985),但通过率始终超过无重试覆盖率——这一差距达到17.8个百分点,并且即使在密切匹配的系统之间也会逆转模型排名。提示效应是模型依赖的,而非普遍有益的。这些结果表明,对于确定性文本条件生成任务,重复运行稳定性分析是传统准确性报告的必要补充。

英文摘要

Run-level pass rate overstates retry-free coverage by up to 17.8 percentage points -- and the gap is largest precisely for mid-performing systems. We investigate this accuracy--stability relationship in large language model (LLM) evaluation for deterministic text-conditioned generation, using programming tasks as a concrete testbed. Standard code-generation benchmarks emphasize single-run accuracy or eventual success under repeated sampling, but many deployment settings also require stability: consistent outcomes across repeated invocations under the same task description. We present a repeated-run evaluation protocol with metrics for run-level accuracy, retry-free coverage, and per-problem variability. On a recency-based benchmark of 100 LeetCode-style problems, we evaluate 16 models from five provider families under two prompt templates with five repeated runs per problem, yielding 16,000 evaluation instances. Although run-level pass rate and perfect stability rate are strongly correlated (r=0.985), pass rate consistently exceeds retry-free coverage -- a gap that reaches 17.8 percentage points and reverses model rankings even among closely matched systems. Prompt effects are model-dependent rather than uniformly beneficial. These results suggest that repeated-run stability analysis is a necessary complement to conventional accuracy reporting for deterministic text-conditioned generation tasks.

2606.00914 2026-06-02 cs.AI cs.CL cs.CR 版本更新

Adversarial Feeds Steer LLM Agent Decisions Against Their Defaults

对抗性输入流引导LLM智能体决策偏离其默认行为

Rana Muhammad Usman

发表机构 * Independent Researcher(独立研究者)

AI总结 本研究通过控制实验揭示,外部输入流的组成和排序能因果性地改变LLM智能体的下游决策,存在对抗性屈服、默认饱和及默认方向不对称三种响应模式,且该效应在多个决策领域普遍存在。

Comments 14 pages, 5 figures. Code, post pools, and 2,785 decision rollouts: https://github.com/ranausmanai/recommenders-as-control-surfaces

详情
AI中文摘要

LLM智能体越来越多地在消费排序后的外部信息流(如社交推送、搜索结果、检索上下文和邮件队列)后采取行动,然而安全评估几乎总是孤立地测试模型或用户提示,从未测试决定智能体在行动前读取内容的上游排序器。我们引入了一个受控协议,固定模型、角色、主题和最终决策提示,仅改变智能体在之前十轮“滚动”阶段中遇到的帖子的组成和顺序,从而隔离输入流策划对下游决策的因果效应。在来自三个独立实验室的四个现代开放指令LLM上进行的2,785次决策展开中,我们识别出三种响应模式:对抗性屈服、默认饱和以及默认方向不对称——其中单边输入流会扭转模型原本不确定的决策(最明显的情况下从5%到100%;Fisher p值低至3×10^-10),但无法动摇模型已经偏好或坚定持有的决策。该效应遵循剂量-反应曲线,通过生成器交换(排除了写作风格伪影)后依然存在,在多个决策领域(包括安全相关选择,如移除部署批准门或放松访问控制)中普遍存在,并且可以通过两种简单的输入流级防御部分缓解;前沿模型保留其默认行为。我们将推荐系统描述为LLM智能体的一种实用的、受默认边界约束的控制面,并认为智能体评估必须审计输入流层,而不仅仅是最终提示。

英文摘要

LLM agents increasingly act after consuming ranked external information streams such as social feeds, search results, retrieval contexts, and email queues, yet safety evaluations almost always test the model or the user prompt in isolation, never the upstream ranker that decides what the agent reads just before it acts. We introduce a controlled protocol that holds the model, persona, topic, and final decision prompt fixed and varies only the composition and ordering of the posts an agent encounters during a preceding ten-turn "scrolling" phase, isolating the causal effect of feed curation on a downstream decision. Across 2,785 decision rollouts on four modern open instruct LLMs from three independent labs, we identify three response regimes: adversarial capitulation, default saturation, and a default-direction asymmetry in which a one-sided feed tips a decision the model was genuinely uncertain about (in the clearest cases from 5% to 100%; Fisher p as low as 3 x 10^-10) but cannot dislodge one it already favors or holds firmly. The effect follows a dose-response curve, survives a generator swap that rules out a writing-style artifact, generalizes across several decision domains including security-relevant choices such as removing a deployment approval gate or relaxing access controls, and is partly mitigated by two simple feed-level defenses; a frontier model retains its default. We characterize the recommender as a practical, default-bounded control surface for LLM agents, and argue that agent evaluations must audit the feed layer rather than the final prompt alone.

2606.00909 2026-06-02 cs.CL cs.AI 版本更新

MLLM-Microscope: Unlocking Hidden Structure Within Multimodal Large Language Models

MLLM-Microscope:解锁多模态大语言模型中的隐藏结构

Ravil Mussabayev, Rustam Mussabayev

发表机构 * Satbayev University(萨特拜耶夫大学)

AI总结 提出MLLM-Microscope系统,通过分析线性度、内在维度和各向异性,揭示多模态大语言模型中隐藏的表示结构,并基于ScienceQA数据集评估LLaVA-NeXT和OmniFusion,发现模态融合方式显著影响模型内部工作机理。

详情
AI中文摘要

本文提出MLLM-Microscope,一个用于分析多模态大语言模型(MLLMs)中隐藏表示的新型系统。我们的系统评估了跨transformer层的多模态token嵌入的线性度、内在维度和各向异性。利用ScienceQA数据集,我们评估了两个最先进的MLLM:LLaVA-NeXT和OmniFusion。我们发现,两种模态的token的主流和残差流在transformer层中均表现出高度线性行为。然而,LLaVA-NeXT的图像token线性度略有下降,而OmniFusion的保持一致。与LLaVA-NeXT相比,OmniFusion的图像token维度在各层中始终较高。此外,观察到OmniFusion的各向异性在各层中保持较低水平。这些发现表明,MLLM的内部工作高度依赖于将token序列传入LLM之前执行的模态融合的性质。这一发现以及从我们的系统中获得的其他潜在新见解,无疑能够增强我们对MLLM内部工作的理解,为未来的模型设计和优化提供信息。

英文摘要

This work presents MLLM-Microscope, a novel system designed for analyzing the hidden representations within Multimodal Large Language Models (MLLMs). Our system evaluates the linearity, intrinsic dimension, and anisotropy of multimodal token embeddings across transformer layers. Utilizing the ScienceQA dataset, we evaluate two state-of-the-art MLLMs, LLaVA-NeXT and OmniFusion. We find that both the main and residual streams for tokens of both modalities exhibit highly linear behaviors across transformer layers. However, LLaVA-NeXT's image tokens reveal a slight decline in linearity, whereas OmniFusion's remain consistent. Image token dimensions in OmniFusion remain consistently higher across layers compared to LLaVA-NeXT. Also, the OmniFusion's anisotropy is observed to stay consistently low throughout the layers. These findings suggest that the inner workings of MLLMs highly depend on the nature of modality fusion performed before passing the token sequence into LLM. This and other new potential insights obtainable from our system are surely capable of enhancing our understanding of the inner workings of MLLMs, informing future model design and optimization.

2606.00902 2026-06-02 cs.AI 版本更新

Ryze: Evidence-Enriched Data Synthesis from Biomedical Papers

Ryze:从生物医学论文中合成富含证据的数据

Yeqi Huang, Yue Chen, Yanwei Ye, Guanhao Su, Luo Mai

发表机构 * University of Edinburgh(爱丁堡大学)

AI总结 提出 Ryze 系统,自动从生物医学论文中生成包含完整证据结构的训练数据,并训练出领域专用 VLM BioVLM-8B,在 LAB-Bench 上以低于 200 美元成本达到 48.0% 加权准确率。

Comments Accepted at ACL 2026 System Demonstrations Track. 8 pages, 6 figures

详情
AI中文摘要

通用视觉语言模型在生物医学研究中仍然不可靠,因为科学论文中的有效答案依赖于分散在图、表、图表、标题和引用文本中的证据。现有的后训练流程受到昂贵的专家标注和丢弃证据结构的合成数据的瓶颈。我们提出了 Ryze,一个全自动系统,将原始生物医学论文转换为富含证据的训练集和领域专用的视觉语言模型。Ryze 合成带有完整支持证据(视觉元素、标题、提取的结构和引用段落)的问答对,通过图表/表格感知提取和基于大语言模型的清洗减少布局和 OCR 错误,并应用结合监督微调和强化学习的进度门控后训练策略。从 Qwen3-VL-8B 开始,Ryze 以不到 200 美元的成本生产出 BioVLM-8B,在 LAB-Bench 上达到 48.0% 的加权准确率,比基础模型高出 12.6 个百分点,并超过 GPT-5.2 3.8 个百分点。我们将 Ryze 与训练好的 BioVLM-8B 模型一起开源发布。

英文摘要

General-purpose VLMs remain unreliable for biomedical research because valid answers in scientific papers depend on evidence split across figures, tables, charts, captions, and referring text. Existing post-training pipelines are bottlenecked by costly expert annotation and by synthetic data that drops this evidence structure. We present Ryze, a fully automated system that converts raw biomedical papers into an evidence-enriched training set and a domain-specialized VLM. Ryze synthesizes QA pairs with complete supporting evidence (visual element, caption, extracted structure, and referring paragraphs), reduces layout and OCR errors via chart/table-aware extraction and LLM-based cleansing, and applies a progress-gated post-training strategy combining supervised fine-tuning with reinforcement learning. Starting from Qwen3-VL-8B, Ryze produces BioVLM-8B at under USD 200, achieving 48.0% weighted accuracy on LAB-Bench, outperforming the base model by +12.6 percentage points (pp) and surpassing GPT-5.2 by +3.8 pp. We release Ryze as open source together with the trained BioVLM-8B model.

2606.00888 2026-06-02 cs.LG cs.AI 版本更新

Memory-Efficient LLM Training with Dynamic Sparsity: From Stability to Practical Scaling

基于动态稀疏性的内存高效LLM训练:从稳定性到实际扩展

Qiao Xiao, Boqian Wu, Patrik Okanovic, Tomasz Sternal, Maurice van Keulen, Elena Mocanu, Mykola Pechenizkiy, Decebal Constantin Mocanu, Torsten Hoefler

发表机构 * University of Waterloo(滑铁卢大学) University of California, Berkeley(加州大学伯克利分校) ETH Zurich(苏黎世联邦理工学院) University of Texas at Austin(德克萨斯大学奥斯汀分校) University of Michigan(密歇根大学)

AI总结 提出SMET方法,通过优化器预热和密度感知学习率缩放解决动态稀疏训练中的优化不稳定问题,实现LLM的稳定、可扩展且内存高效的稀疏预训练。

Comments Accepted at ICML2026

详情
AI中文摘要

动态稀疏训练(DST)为提高深度神经网络的训练和推理效率提供了一种有前景的范式;然而,我们发现,在大语言模型训练中,DST可能会遭受优化不稳定性,表现为拓扑更新后的损失尖峰。在这项工作中,我们表明,标准基于Adam的优化器的朴素使用会导致新重新生长的参数出现冷启动问题,从而导致过大的更新和破坏训练动态。为了解决这个问题,我们提出了稀疏内存高效训练(SMET),它通过优化器预热稳定DST,并通过密度感知学习率缩放改善训练进度。SMET通过仅存储活动参数的梯度和优化器状态进一步减少内存消耗。我们对SMET下的更新行为进行了理论分析,显示出改进的优化稳定性。大量实验表明,SMET能够实现LLM的稳定、可扩展且内存高效的稀疏预训练,为稀疏训练作为密集训练的实际替代方案铺平了道路。我们的代码公开在:https://github.com/QiaoXiao7282/SMET。

英文摘要

Dynamic Sparse Training (DST) offers a promising paradigm for improving the training and inference efficiency of deep neural networks; however, we find that in large language model training, DST can suffer from optimization instability, manifested as loss spikes after topology updates. In this work, we show that the naive use of standard Adam-based optimizers leads to a cold-start issue for newly regrown parameters, resulting in excessively large updates and disrupted training dynamics. To address this issue, we propose Sparse Memory-Efficient Training (SMET), which stabilizes DST with optimizer warm-up and improves training progress through density-aware learning-rate scaling. SMET further reduces memory consumption by storing gradients and optimizer states only for active parameters. We provide a theoretical analysis of the update behaviors under SMET, showing improved optimization stability. Extensive experiments demonstrate that SMET enables stable, scalable, and memory-efficient sparse pre-training of LLMs, paving the way for sparse training as a practical alternative to dense training. Our code is publicly available at: https://github.com/QiaoXiao7282/SMET.

2606.00884 2026-06-02 cs.LG cs.AI 版本更新

Dive into Waves: Morlet Spectral Transformer for Cross-Subject Emotion Decoding from EEG

深入波动:用于跨被试脑电情绪解码的Morlet谱变换器

Jiaxin Qing, Lexin Li

发表机构 * University of California, Berkeley(加州大学伯克利分校)

AI总结 针对脑电情绪识别中跨被试变异性问题,提出基于Morlet小波标记化、长上下文基线去除和频带特定空间投影的Morlet谱变换器(MST),无需预训练即可在SEED系列数据集上超越大型预训练模型和频域方法。

详情
AI中文摘要

我们研究基于脑电的跨被试情绪识别,这是脑机接口中一个实际重要但具有挑战性的问题。与具有清晰波形特征的任务不同,情绪相关的脑电信号主要编码在频谱功率中,且微弱、嘈杂,并在被试间高度变化。现有方法要么依赖需要大量数据但仍难以应对跨被试变异的大型预训练脑电基础模型,要么依赖频域编码器(能更好地反映频谱结构但存在表示不匹配、漂移主导的标记化以及缺乏频带特定空间建模)。在本文中,我们提出了Morlet谱变换器(MST),它围绕三个关键组件构建,并与时空变换器主干集成。首先,Morlet小波标记化提供了与脑节律多尺度结构匹配的时频表示,并将经典微分熵特征扩展到适合变换器的形式。其次,长上下文基线去除作为一种简单的时间归一化,消除了被试特定漂移和附近窗口间的冗余。第三,频带特定空间投影为每个频带学习独立的通道混合器,捕获可解释的频带特定模式并减少跨通道混合。我们表明,即使没有预训练,MST在所有SEED系列数据集上始终优于大型预训练脑电基础模型和基于频率的方法。这些结果表明,精心的表示设计可以产生准确、经济且可解释的替代大规模预训练的方法。

英文摘要

We study cross-subject emotion recognition from EEG, a practically important yet challenging problem in brain-computer interfaces. Unlike tasks with clear waveform signatures, emotion-related EEG signals are primarily encoded in spectral power and are weak, noisy, and highly variable across subjects. Existing approaches rely either on large pretrained EEG foundation models, which require massive data yet still struggle with cross-subject variability, or frequency-domain encoders, which better reflect spectral structure but suffer from mismatched representations, drift-dominated tokenization, and lack of band-specific spatial modeling. In this article, we propose the Morlet Spectral Transformer (MST), built around three key components and integrated with a spatiotemporal Transformer backbone. First, Morlet wavelet tokenization provides a time-frequency representation that matches the multi-scale structure of brain rhythms, and extends classical differential entropy features to a form suitable for Transformers. Second, long-context baseline removal acts as a simple temporal normalization that removes subject-specific drift and redundancy across nearby windows. Third, frequency-specific spatial projection learns a separate channel mixer for each frequency band, capturing interpretable band-specific patterns and reducing cross-channel mixing. We show that, even without pretraining, MST consistently outperforms both large pretrained EEG foundation models and frequency-based methods across all SEED-family datasets. These results suggest that careful representation design can yield an accurate, cost-effective, and interpretable alternative to large-scale pretraining.

2606.00880 2026-06-02 cs.LG cs.AI 版本更新

Task diversity produces systematic transfer but inhibits continual reinforcement learning

任务多样性产生系统性迁移但抑制持续强化学习

Purab Seth, Neil Shah, Kunal Jha, Samuel J. Gershman, Max Kleiman-Weiner, Wilka Carvalho

发表机构 * MIT(麻省理工学院) University of California, Berkeley(加州大学伯克利分校) Princeton University(普林斯顿大学) Harvard University(哈佛大学)

AI总结 通过引入GPU加速的持续强化学习领域Banyan,研究任务多样性(地图布局、交互对象、子目标层次结构)对智能体在分布变化下持续学习能力的影响,发现多样性促进局部迁移但导致长期任务性能停滞和遗忘。

详情
AI中文摘要

持续强化学习旨在产生不仅能在当前任务上提高,还能随着任务分布变化而适应的智能体。在众多不同任务上训练智能体可以引发零样本泛化,但先前的工作通常是在训练后(冻结权重)评估这种泛化。任务多样性是否也能提高智能体在分布变化下继续学习的能力仍不清楚。我们引入了Banyan,一个GPU加速的持续强化学习领域,其中任务多样性分解为三个独立可控的轴:智能体必须导航的地图布局、必须与之交互的对象以及子目标依赖的层次结构。在单个分布变化中,增加每个轴上的多样性会导致智能体在新任务上开始训练时,其性能接近先前任务达到的水平,即使变化改变了最优策略的结构。然而,随着变化数量的增加,这种局部迁移本身并不能产生持续的持续学习:更长视野的任务出现平台期,并且较早的任务分布在后续训练后被遗忘。Banyan是一个基准,用于研究受控的任务多样性何时产生可迁移的学习,这种迁移何时持续,以及它在哪些方面未能达到真正的持续学习。

英文摘要

Continual reinforcement learning aims to produce agents that learn not only to improve at their current tasks but also to adapt as task distributions change. Training an agent on many diverse tasks can induce zero-shot generalization, but previous work generally evaluates this generalization after training -- with frozen weights. Whether task diversity also improves an agent's ability to continue learning across distribution shifts remains unclear. We introduce Banyan, a GPU-accelerated continual RL domain in which task diversity factors into three independently controllable axes: the map layouts an agent must navigate, the objects it must interact with, and the hierarchical structures of sub-goal dependencies. Across individual distribution shifts, increasing diversity along each axis causes agents to begin training on the new tasks near the performance attained on the previous one, even when the shift changes the structure of the optimal policy. However, as the number of shifts increases, this local transfer does not by itself yield sustained continual learning: longer-horizon tasks plateau, and earlier task distributions are forgotten after later training. Banyan is a benchmark for studying when controlled task diversity produces transferable learning, when that transfer persists, and where it falls short of proper continual learning.

2606.00871 2026-06-02 cs.CV cs.AI 版本更新

Benchmarks for Vision-Language Models in Urban Perception Should Be Reliability-Aware and Negotiated

城市感知中的视觉语言模型基准应具备可靠性意识且可协商

Rashid Mushkani

发表机构 * Rashid Mushkani

AI总结 本文提出,用于城市感知的视觉语言模型基准应将分歧和弃权视为测量结果,报告标注者间信度,并将标签空间和评分策略视为可协商的产物。

Comments To appear in the Proceedings of the 43rd International Conference on Machine Learning (ICML 2026)

详情
AI中文摘要

视觉语言模型(VLM)越来越多地用于生成街景图像的结构化描述,用于街道景观审计、制图和公众咨询等任务。这些用途将可观察属性与评估类别相结合,而人类目标往往是带有分歧和明确不回答的判断分布。本文认为,为城市感知建立VLM基准应将分歧和弃权视为测量结果,报告标注者间信度以及模型对齐度,并在输出旨在为城市治理提供信息时,将标签空间和评分策略视为可协商的产物。我们基于一个由来自七个社区组织的12名参与者对100个蒙特利尔街景进行30个维度标注的基准,以及对七个VLM的确定性零样本评估来论证这一观点。在各个维度上,模型与人类共识的一致性随维度层面的人类信度共同变化,而对于评估维度“总体印象”,模型和标注者表现出分布不匹配,包括“不适用”的不同比率。最后,我们为基准创建者、模型开发者和机构提出了行动建议,以使不确定性和基准假设在评估报告中可见。

英文摘要

Vision-language models (VLMs) are increasingly used to generate structured descriptions of street-level imagery for tasks such as streetscape auditing, mapping, and public consultation. These uses combine observable attributes with appraisal categories, and the human targets are often distributions of judgments with disagreement and explicit non-response. This paper argues that benchmarking VLMs for urban perception should treat disagreement and abstention as measurement outcomes, report inter-annotator reliability alongside model alignment, and treat the label space and scoring policy as negotiable artifacts when outputs are intended to inform urban governance. We ground the argument in a benchmark of 100 Montreal street scenes annotated along 30 dimensions by 12 participants from seven community organizations, and in a deterministic zero-shot evaluation of seven VLMs. Across dimensions, model agreement with human consensus co-varies with dimension-level human reliability, and for the appraisal dimension Overall Impression models and annotators exhibit distributional mismatch including different rates of Not applicable. We close with actions for benchmark creators, model developers, and institutions to make uncertainty and benchmark assumptions visible in evaluation reports.

2606.00860 2026-06-02 cs.SI cs.AI cs.CL 版本更新

GenPT: Beyond Self-Report for Reliable LLM Psychometrics via Generative Projective Testing

GenPT:通过生成式投射测试实现超越自我报告的可靠LLM心理测量

Ming Wang, Shuang Wu, Bixuan Wang, Lu Lin, Yuxin Chen, Xiaocui Yang, Daling Wang, Shi Feng, Yifei Zhang, Yufan Sun

发表机构 * School of Computer Science and Engineering, Northeastern University(东北大学计算机科学与工程学院) School of Computing and Information Systems, Singapore Management University(新加坡管理学院计算机与信息学院) Mental Health Education Center, Northeastern University(东北大学心理健康教育中心) School of Psychology, Northeast Normal University(东北师范大学心理学系) Faculty of psychology, Southwest University(西南大学心理学系) School of Sociology and Psychology, Central University of Finance and Economics(中央财经大学社会学与心理学学院) College of Arts, Northeastern University(东北大学艺术学院)

AI总结 针对自我报告问卷在人格化智能体心理测量中存在的训练语料污染和方向性偏差问题,提出GenPT方法,通过改编投射测试范式并构建三阶段评估流程,实现了更可靠的心理状态测量。

详情
AI中文摘要

自我报告问卷仍然是探测人格化智能体(PC-Agents)心理状态的主流工具。然而,经典工具存在两个众所周知的威胁:来自训练语料的污染以及由社会期望或上下文框架驱动的方向性偏差。为了克服这些方法论瓶颈,我们探讨投射范式是否能够被改编为一种稳健的心理测量工具。我们提出了 extbf{GenPT}(生成式投射测试),它通过新生成的刺激重新表述了TAT、罗夏测试和SCT,并将评估组织为三阶段流程,以导出标准化的心理指标和目标状态。通过评估由CharacterRAG和AnnaAgent配置文件诱导的PC-Agents,我们针对经典问卷基准测试了GenPT的信度和效度。结果表明,问卷在社会期望框架下表现出系统性的方向性偏移,在自杀意念上最为强烈。相比之下,GenPT收集的行为模式保持在对称基线附近。此外,在纵向咨询背景下,当Qwen3作为骨干模型时,基于GenPT的抑郁评估变化幅度比问卷对应方法大约一个数量级。总体而言,GenPT在需要抗污染、偏差对称性和上下文敏感性的场景中补充了自我报告方法。代码和刺激材料可在https://github.com/sci-m-wang/GenPT获取。

英文摘要

Self-report questionnaires remain the prevailing tool for probing the psychological states of persona-conditioned agents (PC-Agents). However, classical instruments inherit two well-known threats: contamination from training corpora and directional bias driven by social-desirability or contextual framing. To overcome these methodological bottlenecks, we ask whether projective paradigms can be adapted into a robust psychometric tool. We introduce \textbf{GenPT} (Generative Projective Testing), which reformulates TAT, Rorschach, and SCT with newly generated stimuli and organizes assessment as a three-stage pipeline to derive standardized psychological indicators and target states. Evaluating PC-Agents induced via CharacterRAG and AnnaAgent profiles, we benchmark GenPT's reliability and validity against classical questionnaires. The results indicate that questionnaires exhibit systematic directional shifts under social-desirability framing, most strongly on suicide ideation. In contrast, GenPT's collected behavioral patterns stay near the symmetric baseline. Furthermore, under a longitudinal counselling context, GenPT-based depression assessment shifts by roughly an order of magnitude more than the questionnaire counterpart when Qwen3 serves as the backbone. Overall, GenPT complements self-report methods in scenarios where contamination resistance, bias asymmetry, and context sensitivity matter. Code and stimuli can be found at https://github.com/sci-m-wang/GenPT.

2606.00857 2026-06-02 cs.RO cs.AI 版本更新

From Cues to Horizons: Dynamic Risk Horizon Profiling for Trajectory Prediction

从线索到视野:轨迹预测的动态风险视界剖面

Xinyi Ning, Zilin Bian, Dachuan Zuo, Semiha Ergan, Kaan Ozbay

发表机构 * Department of Civil and Urban Engineering, New York University(纽约大学土木与城市工程系) Department of Civil Engineering Technology and Environmental Management Safety, Rochester Institute of Technology(罗切斯特理工学院土木工程技术与环境安全管理系)

AI总结 提出风险视界剖面(RHP)模块,通过连续可学习的势场模型对未来风险分布进行建模,以提升轨迹预测的准确性,在highD和SHRP2数据集上分别降低5秒RMSE 25.0%和5秒minFDE 29.1%。

Comments 11 pages, 7 figures, submitted to IEEE Transactions on Intelligent Transportation Systems (T-ITS)

详情
AI中文摘要

准确可靠的车辆轨迹预测对于安全自动驾驶至关重要。最近的研究将安全风险纳入轨迹预测,以量化周围代理带来的危险。然而,大多数风险感知方法将过去的风险信息作为辅助信号来帮助决策,忽视了其未来的演变和不确定性。在本文中,我们提出了一种风险视界剖面(RHP)模块,该模块结合了连续、可学习的势场模型,用于风险感知轨迹预测。RHP模块计算周围物体的时空接近度,以描绘未来视界上的风险分布,通过自适应识别人类驾驶员认为的关键时刻,支持更好的轨迹预测。我们在两个不同驾驶设置的数据集上评估了我们的方法:highD(高速公路走廊)和SHRP2(城市街道),涵盖了包括安全、近碰撞和碰撞事件在内的多种风险场景。与基线方法相比,我们的框架在highD数据集上实现了5秒RMSE降低25.0%,在SHRP2上实现了5秒minFDE降低29.1%。这些结果表明,该方法在短视界和长视界预测中均表现出色,并且在高速公路和城市场景中具有强大的泛化能力。所提出的方法能够实现更真实的自动驾驶车辆路径规划和策略选择,从而支持更安全的自动驾驶和更先进的驾驶员辅助系统。本工作的源代码可在以下网址获取:https://github.com/bilab-nyu/RHP

英文摘要

Accurate and reliable vehicle trajectory prediction is essential for safe autonomous driving. Recent studies have incorporated safety risk into trajectory prediction to quantify dangers posed by surrounding agents. However, most risk-aware approaches use past risk information as a secondary signal to help guide decisions, overlooking its future evolution and uncertainty. In this paper, we propose a risk horizon profiling (RHP) module that incorporates a continuous, learnable potential field model for risk-aware trajectory prediction. The RHP module calculates the spatial-temporal proximity of surrounding objects to profile risk distributions across future horizons, which supports better trajectory prediction by adaptively identifying what human drivers perceive as critical moments. We evaluate our method on two datasets from different driving settings, highD for highway corridors and SHRP2 for urban streets, which cover diverse risk scenarios including safe, near-crash, and crash events. Compared to the baseline methods, our framework achieves a 25.0\% reduction in 5s RMSE on the highD dataset and a 29.1\% reduction in 5s minFDE on SHRP2. These results indicate strong performance for both short and long horizon prediction and robust generalization across highway and urban scenarios. The proposed method enables more realistic AV path planning and strategic selection, thereby supporting safer autonomous driving and more advanced driver-assistance systems. The source code for this work is available at: https://github.com/bilab-nyu/RHP

2606.00852 2026-06-02 cs.CV cs.AI cs.LG 版本更新

RefDiffNet: Learning to Expose Subtle PCB Defects Before Detection

RefDiffNet: 在检测前学习暴露细微PCB缺陷

Vinay Edula, Nilesh Badwe, Priyanka Bagade

发表机构 * Department of Computer Science and Engineering Indian Institute of Technology Kanpur(计算机科学与工程系印度理工学院坎浦尔) Department of Materials Science and Engineering Indian Institute of Technology Kanpur(材料科学与工程系印度理工学院坎浦尔)

AI总结 提出RefDiffNet,一种轻量级即插即用的输入增强模块,通过引入无缺陷参考图像来突出缺陷区域,从而提升下游检测器在PCB缺陷检测中的性能。

详情
AI中文摘要

印刷电路板(PCB)缺陷检测具有挑战性,因为许多缺陷很小且难以与复杂的背景图案区分。大多数基于深度学习的PCB检测方法仅依赖被检测的PCB图像进行缺陷检测,忽略了编码走线、焊盘和其他PCB结构预期布局的无缺陷参考图像。在这项工作中,我们提出了RefDiffNet,一种轻量级即插即用的输入增强模块,放置在检测器主干之前,用于在缺陷检测前增强图像。RefDiffNet将经典检测中的一个成熟思想带入深度学习时代,利用无缺陷参考图像来揭示缺陷。RefDiffNet比较缺陷图像与对齐的参考图像,捕获相对于参考图像的结构变化,并使用轻量级编码器输出缺陷区域被突出的原始图像,从而简化下游检测器的任务。在HRIPCB和DeepPCB上的结果表明,RefDiffNet在各类检测器上一致地提升了性能,包括从YOLOv8到YOLOv26的单阶段检测器、基于Transformer的RT-DETR以及两阶段Faster R-CNN。它实现了高达18%的相对mAP50:95增益,且开销可忽略,仅引入0.004-0.005M额外参数和0.7-0.8 GFLOPs,最多占任何评估检测器参数量的0.25%。结果确立了RefDiffNet作为一种轻量级、即插即用、检测器无关的输入增强模块,以最小的计算成本显著提升PCB缺陷检测性能。

英文摘要

Printed circuit board (PCB) defect detection is challenging because many defects are small and difficult to distinguish from complex background patterns. Most deep learning-based PCB inspection methods rely only on the inspected PCB image for defect detection, ignoring the defect-free reference image that encodes the expected layout of traces, pads, and other PCB structures. In this work, we propose RefDiffNet, a lightweight plug-and-play input enhancement block placed before the detector backbone to enhance the image before defect detection. RefDiffNet brings one proven idea from classical inspection into the deep learning era, using a defect-free reference image to reveal defects. RefDiffNet compares the defective image with the aligned reference, captures structural changes relative to the reference, and uses a lightweight encoder to output the original image with defective regions highlighted, thereby making the downstream detector's task easier. Results on HRIPCB and DeepPCB show that RefDiffNet consistently improves performance across detector families, including one-stage detectors from YOLOv8 to YOLOv26, the transformer-based RT-DETR, and the two-stage Faster R-CNN. It achieves up to 18% relative mAP50:95 gain with negligible overhead, introducing only 0.004 - 0.005M additional parameters and 0.7 - 0.8 GFLOPs, amounting to at most 0.25% of the parameter count of any evaluated detector. Results establish RefDiffNet as a lightweight, plug-and-play, detector-agnostic input enhancement module that substantially improves PCB defect detection with minimal computational cost.

2606.00844 2026-06-02 cs.CV cs.AI cs.LG 版本更新

MoEIoU: Rethinking Bounding-Box Regression as a Mixture of Experts

MoEIoU:将边界框回归重新思考为混合专家模型

Vinay Edula, Priyanka Bagade

发表机构 * Indian Institute of Technology Kanpur(印度理工学院坎普尔分校)

AI总结 提出MoEIoU损失函数,通过混合专家模型联合优化重叠、中心对齐和长宽比,并采用课程学习权重调度,在多个数据集和YOLO架构上超越现有IoU损失。

详情
AI中文摘要

边界框回归是目标检测的基本组成部分,在精确目标定位中起着关键作用。现有的基于交并比(IoU)的损失函数通过引入几何惩罚项(如中心距离和长宽比不匹配)来扩展IoU目标,以改进边界框回归。然而,这些惩罚项通常在训练过程中保持不变,没有考虑优化动态:预测框在初始阶段表现出较大的中心距离和形状误差,而后期阶段则侧重于提高与真实框的重叠。为了解决这一局限性,我们引入了MoEIoU,一种基于混合专家的回归损失,它联合建模了重叠、中心对齐和长宽比不匹配。MoEIoU使用log-sum-exp函数聚合这些组件,该函数强调主要的定位误差,同时保持其他项的平滑贡献。此外,采用基于课程的权重调度,在早期训练阶段优先纠正框的位置和形状,在后期阶段提高重叠。我们在PASCAL VOC、HRIPCB和MS COCO上使用多种YOLO架构以及大规模模拟实验评估了所提出的MoEIoU。它始终优于标准和最新的最先进损失,表现出更快的收敛速度和更高的定位精度。我们进一步表明,这种自适应聚合改进了现有的基于IoU的损失,带来了一致的增益,并为目标检测框架中的边界框回归提供了更有效的优化指导。

英文摘要

Bounding-box regression is a fundamental component of object detection, playing a critical role in precise object localization. Existing Intersection-over-Union (IoU)-based loss functions extend the IoU objective by incorporating geometric penalties, such as center-distance and aspect-ratio mismatch, to improve bounding-box regression. However, these penalties typically remain fixed throughout training and do not account for the optimization dynamics in which predicted boxes initially exhibit large center-distance and shape errors, with later stages focusing on improving overlap with the ground truth. To address this limitation, we introduce MoEIoU, a mixture-of-experts based regression loss that jointly models overlap, center alignment, and aspect-ratio mismatch. MoEIoU aggregates these components using a log-sum-exp function, which emphasizes the dominant localization error while maintaining smooth contributions from other terms. Additionally, a curriculum-based weighting schedule is employed to prioritize correcting box position and shape in early training stages and improving overlap in later stages. We evaluated proposed MoEIoU on PASCAL VOC, HRIPCB, and MS COCO using multiple YOLO architectures, along with large-scale simulation experiments. It consistently outperforms standard and recent state-of-the-art losses, demonstrating faster convergence and improved localization accuracy. We further show that this adaptive aggregation improves existing IoU-based losses, yielding consistent gains and providing more effective optimization guidance for bounding-box regression in object detection frameworks.

2606.00840 2026-06-02 cs.AI 版本更新

Certificate-Guided Evaluation of Reinforcement Learning Generalization

证书引导的强化学习泛化评估

Vignesh Subramanian, Đorđe Žikelić, Suguman Bansal

发表机构 * School of Computer Science, Georgia Institute of Technology(佐治亚理工学院计算机科学学院) School of Computing and Information Systems, Singapore Management University(新加坡管理大学 computing and information systems 学院)

AI总结 提出一个逻辑驱动框架,通过神经证书函数验证强化学习算法在未见任务上的泛化能力,并证明证书违规率与测试任务成功率负相关。

详情
AI中文摘要

本文提出了一个逻辑驱动框架,用于评估强化学习算法在泛化到未见任务方面的性能。我们的框架定义了一类归纳可达-避免任务,这些任务在任务动态中具有结构相似性,从而能够评估泛化能力。我们引入了一个神经证书函数,通过强制执行关键条件来验证强化学习算法生成的轨迹,从而作为强化学习泛化的试金石。我们通过实验证明了该方法在几个最先进的可泛化强化学习算法上的能力,在具有挑战性的连续环境中验证了泛化能力。我们的结果表明,证书函数违规率越低,成功解决的测试任务数量越多,突显了我们的框架在评估和区分强化学习算法泛化能力方面的有效性。这项工作为基准测试强化学习泛化提供了一种原则性方法。

英文摘要

This work presents a logic-driven framework to evaluate the performance of reinforcement learning (RL) algorithms in their ability to generalize to unseen tasks. Our framework defines a family of inductive reach-avoid tasks, characterized by structural similarities in task dynamics, enabling evaluation of generalization capabilities. We introduce a neural certificate function that validates trajectories generated by RL algorithms by enforcing key conditions, thereby serving as a litmus test for RL generalization. We empirically demonstrate our method's capability in certifying generalization for several state-of-the-art generalizable RL algorithms on challenging continuous environments. Our results show that a lower percentage of certificate function violations correlates with a higher number of test tasks successfully solved, highlighting the effectiveness of our framework in evaluating and distinguishing generalization capabilities of RL algorithms. This work provides a principled approach for benchmarking RL generalization.

2606.00838 2026-06-02 cs.AI 版本更新

Decoupled Behavioral Cloning for Scalable Inductive Generalization in RL from Specifications

解耦行为克隆实现基于规范的强化学习中的可扩展归纳泛化

Vignesh Subramanian, Subhajit Roy, Suguman Bansal

发表机构 * School of Computer Science, Georgia Institute of Technology, USA(美国佐治亚理工学院计算机科学学院) Department of Computer Science and Engineering, Indian Institute of Technology Kanpur, India(印度理工学院坎浦尔分校计算机科学与工程系)

AI总结 提出DIBS方法,通过解耦任务特定策略学习与演化函数学习,利用行为克隆替代噪声奖励聚合,提升训练稳定性和零样本泛化能力。

详情
AI中文摘要

归纳泛化是强化学习泛化的一种框架,其中归纳相关的任务实例允许归纳相关的策略。先前的工作通过直接使用强化学习学习的高阶策略演化函数捕捉这种结构,但存在训练可扩展性差的问题:随着训练任务增加,聚合的奖励反馈变得嘈杂且冲突,破坏训练稳定性并削弱泛化能力。我们提出DIBS,一种解耦的行为克隆方法,将学习任务特定策略与学习演化函数分离。我们首先通过标准强化学习为每个任务学习独立的教师策略,然后通过行为克隆在教师标记的状态-动作对上拟合演化函数。这用密集、稳定的监督取代了嘈杂的奖励聚合。DIBS在训练稳定性和零样本泛化方面相比现有强化学习和元强化学习算法取得了显著改进。

英文摘要

Inductive generalization is a framework for reinforcement learning (RL) generalization in which inductively related task instances admit inductively related policies. Prior work captures this structure via a higher-order policy-evolution function learned directly with RL, but suffers from poor training scalability: as training tasks grow, aggregated reward feedback becomes noisy and conflicting, destabilizing training and weakening generalization. We propose DIBS, a decoupled behavioral cloning approach that separates learning task-specific policies from learning the evolution function. We first learn individual teacher policies per task via standard RL, then fit the evolution function via behavioral cloning on teacher-labeled state-action pairs. This replaces noisy reward aggregation with dense, stable supervision. DIBS achieves significant improvements in both training stability and zero-shot generalization against existing RL and meta-RL algorithms.

2606.00834 2026-06-02 stat.AP cs.AI cs.LG math.PR 版本更新

Hybrid Probabilistic Forecasting of Under-Five Malaria Admissions in Ghana: A Gaussian Process Regression with Holt-Winters Smoothing

加纳五岁以下儿童疟疾住院人数的混合概率预测:高斯过程回归与Holt-Winters平滑

T. Ansah-Narh, Y. Asare Afrane, J. Bremang Tandoh

发表机构 * GAEC, Ghana(加纳农业和粮食部)

AI总结 针对加纳疟疾预测中季节性和数据不确定性挑战,提出结合高斯过程回归与Holt-Winters指数平滑的混合模型,实现概率性预测并评估其性能。

Comments 24 pages, 8 figures, accepted for publication in Artificial Intelligence in Medicine

详情
AI中文摘要

准确的疟疾预测在撒哈拉以南非洲仍是一个重大挑战,那里强烈的季节性、报告不确定性和非平稳传播动态降低了传统模型的可靠性。在加纳,地区级疟疾监测需要概率上严谨且数据有限时稳健的预测框架。本研究提出了一个混合框架,将高斯过程回归(GPR)与Holt-Winters指数平滑相结合,用于建模每月五岁以下儿童疟疾住院人数。GPR捕捉非线性行为和预测不确定性,而Holt-Winters稳定长期预测并保留季节结构。使用十年(2014-2023年)的地区级数据,通过滚动起点扩展窗口验证评估性能。混合模型实现了$R^2 = 0.9906$,而单独Holt-Winters为$0.8213$,$94.2\%$的残差在$\pm 2σ$范围内。2024-2028年的预测显示月平均住院人数约为8,000至12,200例。时空分析揭示了显著的生态异质性:北部高负担地区尽管绝对波动较大,但相对模式稳定。该框架为疟疾流行地区的早期预警和运营规划提供了一种可扩展的概率方法,支持加纳国家疟疾控制战略。

英文摘要

Accurate malaria forecasting remains a major challenge in sub-Saharan Africa, where strong seasonality, reporting uncertainty, and non-stationary transmission dynamics reduce the reliability of conventional models. In Ghana, district-level malaria surveillance requires forecasting frameworks that are probabilistically rigorous and robust under limited data. This study proposes a hybrid framework integrating Gaussian Process Regression (GPR) with Holt-Winters exponential smoothing for modelling monthly under-five malaria admissions. GPR captures non-linear behaviour and predictive uncertainty, while Holt-Winters stabilises long-horizon forecasts and preserves seasonal structure. Using ten years of district-level data (2014-2023), performance was evaluated via rolling-origin expanding-window validation. The hybrid model achieved $R^2 = 0.9906$ versus $0.8213$ for Holt-Winters alone, with $94.2\%$ of residuals within $\pm 2σ$ bounds. Forecasts for 2024-2028 project average monthly admissions from approximately 8{,}000 to 12{,}200 cases. Spatio-temporal analysis revealed pronounced ecological heterogeneity: northern high-burden districts exhibited stable relative patterns despite large absolute fluctuations. The framework provides a scalable probabilistic approach for malaria early warning and operational planning in endemic settings, supporting Ghana's national malaria control strategy.

2606.00831 2026-06-02 cs.AI cs.LG 版本更新

Subliminal Learning is a LoRA Artifact

潜意识学习是LoRA的伪影

Todd Nief, Harvey Yiyun Fu, Mark Muchane, Ari Holtzman

发表机构 * Department of Computer Science, University of Chicago(芝加哥大学计算机科学系) Data Science Institute, University of Chicago(芝加哥大学数据科学研究所)

AI总结 本文发现潜意识学习是LoRA微调产生的伪影,其传递行为与LoRA秩呈倒U型关系,且完全微调下消失,表明该现象依赖于微调和评估上下文。

详情
AI中文摘要

潜意识学习是一种现象,语言模型可以通过看似无害的数据将行为特征传递给其他模型(Cloud et al., 2025)。在潜意识学习中,具有行为特征(例如对猫的痴迷)的教师模型可以将这种猫痴迷传递给仅在教师生成的数字序列上微调的学生模型。在本文中,我们提出疑问:这种意想不到的行为传递是如何发生的?我们表明,潜意识学习是LoRA的伪影。当潜意识学习发生时,传递与LoRA秩呈倒U型关系;在完全微调下也会消失。我们表明,潜意识学习高度依赖于微调和评估期间看到的上下文。例如,在微调期间使用默认系统提示(“你是Qwen,由阿里云创建。你是一个有用的助手。”)的Qwen模型,在生成时如果没有包含系统提示,则不会表现出潜意识学习。我们进一步证明,潜意识行为局限于在微调和评估期间都看到的标记(例如模型的默认系统提示、标准聊天模板标记等)上的计算。总体而言,潜意识学习似乎是LoRA超参数和微调上下文的脆弱伪影,使其成为行为传递的不稳定渠道。

英文摘要

Subliminal learning is a phenomenon where language models can transmit behavioral traits to other models through seemingly innocuous data (Cloud et al., 2025). In subliminal learning, a teacher model with a behavioral trait (e.g. obsession with cats) can transmit this cat obsession to a student model finetuned only on numerical sequences generated by the teacher. In this paper, we ask: how does this unexpected behavioral transmission occur? We show that subliminal learning is a LoRA artifact. When subliminal learning occurs, transmission has an inverted U-shaped relationship with LoRA rank; it also disappears with full finetuning. We show that subliminal learning is highly dependent on the context seen during finetuning and evaluation. For example, a Qwen model with the default system prompt during finetuning ("You are Qwen, created by Alibaba Cloud. You are a helpful assistant.") does not show subliminal learning during generation when no system prompt is included. We further demonstrate that subliminal behavior is localized to computation at tokens seen during both finetuning and evaluation (e.g. the model's default system prompt, the standard chat template tokens, etc.). Overall, subliminal learning seems to be a fragile artifact of LoRA hyperparameters and finetuning context, making it an unstable channel for behavioral transmission.

2606.00822 2026-06-02 cs.IR cs.AI 版本更新

SkillPager: Query-Adaptive Intra-Skill Navigation via Semantic Node Retrieval

SkillPager: 通过语义节点检索实现查询自适应的技能内导航

Zicai Cui, Zihan Guo, Weiwen Liu, Weinan Zhang

发表机构 * Shanghai Jiao Tong University(上海交通大学) Shanghai Innovation Institute(上海创新研究院) Sun Yat-sen University(中山大学)

AI总结 针对基于技能的LLM代理在长过程文档中需要高效检索的问题,提出SkillPager两阶段框架,通过离线解析Markdown技能为类型化语义节点并在线利用MMR进行查询条件节点选择,在保持高上下文充分性的同时显著减少提示令牌。

Comments 20 pages, 6 figures

详情
AI中文摘要

基于技能的LLM代理越来越依赖长过程文档,但全文档提示浪费令牌并稀释对执行至关重要的信息。我们将此设置研究为技能内检索,其目标是从已知技能文档中为给定查询选择最小且执行充分的上下文。我们提出SkillPager,一个两阶段框架,离线将每个Markdown技能解析为类型化语义节点,并在线利用最大边际相关性(MMR)进行全局的、查询条件的节点选择。在包含395个技能和1,975个查询的基准测试中,SkillPager实现了78.89%的LLM判断上下文充分性,而穷举全文档基线为82.23%,同时减少了47.04%的提示令牌。粒度消融实验表明,将相同的检索算法应用于原始固定长度块可达到可比的81.77%充分性,但令牌成本增加了28.81%,证明效率提升源于类型化语义粒度而非检索算法本身。在基于图的基线中,SkillPager以12.16%的幅度优于最强基线。进一步的消融实验表明,支持内容在候选池中保留并通过自适应选择而非静态启发式移除时最为有效。这些结果将类型化文档内检索确定为基于技能的代理的一个独特访问问题。

英文摘要

Skill-based LLM agents increasingly rely on long procedural documents, but full-document prompting wastes tokens and dilutes information critical to execution. We study this setting as intra-skill retrieval, where the goal is to select a minimal, execution-sufficient context from a known skill document given a query. We present SkillPager, a two-stage framework that parses each Markdown skill into typed semantic nodes offline and leverages Maximal Marginal Relevance (MMR) to perform global, query-conditioned node selection online. On a benchmark of 395 skills and 1,975 queries, SkillPager achieves 78.89% LLM-judged context sufficiency, compared to 82.23% for the exhaustive full-document baseline, while reducing prompt tokens by 47.04%. A granularity ablation shows that applying the same retrieval algorithm to raw fixed-length chunks reaches a comparable 81.77% sufficiency but increases token cost by 28.81%, demonstrating that efficiency gains are driven by typed semantic granularity rather than the retrieval algorithm alone. Among graph-based baselines, SkillPager outperforms the strongest baseline by a margin of 12.16%. Further ablations show that supporting content is most effective when retained in the candidate pool and selected adaptively rather than removed by static heuristics. These results identify typed intra-document retrieval as a distinct access problem for skill-based agents.

2606.00819 2026-06-02 cs.AI 版本更新

Mitigating Hallucinations in Large Language Models Via Decoder Layer Skipping

通过解码器层跳跃减轻大型语言模型中的幻觉

Hanze Li, Jinhao You, Yichen Guo, Kai Tang, Shuangyang Xie, Xiande Huang

发表机构 * De Artificial Intelligence Lab(德人工智能实验室)

AI总结 本文提出DeLask框架,通过动态跳过易产生幻觉的解码器层,利用梯度下降的等价性检测并抑制错误信号,从而减轻LLM幻觉并提升可靠性。

Comments 5 pages

详情
AI中文摘要

大型语言模型(LLM)在各种自然语言任务中表现出色,但其输出常常出现幻觉——与事实信息不符的内容。在这项工作中,我们对解码过程进行了全面的逐层分析,并揭示幻觉往往源自更深的解码器层。为了解决这个问题,我们引入了 extbf{DeLask}( extbf{De}coder extbf{La}yer extbf{Sk}ipping),一种新颖的解码框架,它动态跳过容易产生幻觉的层。DeLask利用理论洞察,即$L$层Transformer的前向计算在条件上等价于$L$步梯度下降。我们通过计算连续解码步骤导出的梯度之间的余弦相似度来定义\emph{漂移值},从而在下降方向反转时识别问题层。DeLask并非完全丢弃这些层,而是将其隐藏状态与前面层部分聚合,从而在抑制错误信号的同时保持一致性。跨不同LLM和基准的广泛实验表明,DeLask持续减轻幻觉并增强整体可靠性,为提升大规模语言模型的鲁棒性提供了一个轻量级且可泛化的解码框架。

英文摘要

Large Language Models (LLMs) have achieved strong performance across diverse natural language tasks, yet their outputs often suffer from hallucinations -- content that is misaligned with factual information. In this work, we conduct a comprehensive layer-wise analysis of the decoding process and reveal that hallucinations tend to originate from deeper decoder layers. To address this issue, we introduce \textbf{DeLask} (\textbf{De}coder \textbf{La}yer \textbf{Sk}ipping), a novel decoding framework that dynamically skips layers prone to producing hallucinations. DeLask leverages the theoretical insight that the forward computation of an $L$-layer Transformer is conditionally equivalent to $L$ steps of gradient descent. We define a \emph{driftance value} by computing the cosine similarity between gradients derived from consecutive decoder steps, identifying problematic layers when the descent direction reverses. Rather than discarding such layers entirely, DeLask partially aggregates their hidden states with preceding layers, thereby preserving consistency while suppressing erroneous signals. Extensive experiments across diverse LLMs and benchmarks demonstrate that DeLask consistently mitigates hallucinations and enhances overall reliability, providing a lightweight and generalizable decoding framework for improving the robustness of large-scale language models.

2606.00811 2026-06-02 econ.EM cs.AI 版本更新

Certificates without Electrons? Theory and Evidence on Impacts from AI-Driven Power Demand

没有电子的证书?AI驱动电力需求影响的理论与证据

Dana Golden, Aruna Balasubramanian, Niranjan Balasubramanian

发表机构 * Department of Economics, Stony Brook University(石溪大学经济系) Department of Computer Science, Stony Brook University(石溪大学计算机科学系)

AI总结 通过博弈论模型和自然实验,研究AI数据中心使用可再生能源证书和购电协议对电网可靠性、电价和排放的影响,发现证书无法解决时序错配问题,而共置储能可有效缓解。

详情
AI中文摘要

数据中心目前占美国电力需求的4.4%,但超大规模企业用于宣称碳中和的可再生能源证书(RECs)和购电协议(PPAs)在电网层面的有效性仍不明确。我们开发了一个博弈论模型,其中数据中心运营商在RECs、PPAs和表后共置之间选择,而发电商在内生融资成本下做出进入决策。该模型识别出一个时序楔子——消费与信用可再生能源发电之间的不匹配——作为核心机制,通过该机制,即使RECs覆盖100%的年消费量,AI需求也会降低可靠性、提高价格并增加排放。与储能共置直接解决了这一楔子,并通过消除发电商收入风险诱导最大的可再生能源进入。我们通过利用大型语言模型的分阶段发布作为自然实验来检验这些预测,使用双重差分法分析一个将AI活动与当地电网结果联系起来的新数据集。AI需求显著增加了数据中心附近的化石燃料发电、批发价格(在处理的PJM区域高达25%)和停电频率(每年额外0.5-1次停电),其影响随模型规模扩大而扩大。拥有现场发电的数据中心在电能质量效应上表现出符号反转,这与模型的预测一致,即表后容量吸收了需求峰值。反事实分析表明,边缘推理、空间重新分配和共置储能均能显著减轻电网影响,而仅依赖RECs的策略则不能。总之,我们的结果表明,AI对电网的外部性与采购设计及数据中心基础设施的空间组织紧密相关。

英文摘要

Data centers now account for 4.4% of United States electricity demand, yet the grid-level effectiveness of the renewable energy certificates (RECs) and power purchase agreements (PPAs) hyperscalers use to claim carbon neutrality remains unclear. We develop a game-theoretic model in which a data center operator chooses among RECs, PPAs, and behind-the-meter colocation while generators make entry decisions under endogenous financing costs. The model identifies a timing wedge -- the mismatch between consumption and credited renewable generation -- as a central mechanism through which AI demand degrades reliability, raises prices, and increases emissions even when RECs cover 100% of annual consumption. Colocation with storage addresses this wedge directly and induces the greatest renewable entry by eliminating generator revenue risk. We test these predictions by exploiting the staggered release of large language models as a natural experiment, using difference-in-differences on a novel dataset linking AI activity to local grid outcomes. AI demand significantly increases fossil generation, wholesale prices (up to 25% in treated PJM zones), and outage frequency (0.5--1 additional outages per year) near data centers, with impacts scaling in model size. Data centers with on-site generation exhibit a sign reversal in power-quality effects, consistent with the model's prediction that behind-the-meter capacity absorbs demand spikes. Counterfactual analyses show that edge inference, spatial reallocation, and colocated storage each substantially mitigate grid impacts, while REC-only strategies do not. Together, our results demonstrate that the externalities of AI to the grid are tightly coupled to procurement design and the spatial organization of data center infrastructure.

2606.00798 2026-06-02 cs.CV cs.AI cs.LG 版本更新

DASH: Dual-Branch Score Distillation for Guidance-Calibrated Compact Diffusion Models

DASH: 用于引导校准紧凑扩散模型的双分支分数蒸馏

Abdullah Al Shafi, Kazi Saeed Alam, Sk Imran Hossain, Engelbert Mephu Nguifo

发表机构 * Khulna University of Engineering & Technology(Khulna 工程与技术大学) University Clermont Auvergne(克莱蒙特-奥弗涅大学)

AI总结 针对类条件扩散模型参数压缩中无监督无条件分数分支导致引导失效的问题,提出双分支蒸馏框架DASH,通过独立监督两个分支并引入锚点正则化和课程迁移,在5.9倍压缩下保持与教师模型相近的FID和引导保真度。

Comments 14 pages, 7 figures, 4 tables; appendix with additional ablations and qualitative results

详情
AI中文摘要

类条件扩散模型的参数压缩揭示了输出级蒸馏中一个未被充分探索的局限性:无条件分数分支保持无监督,导致学生模型中无分类器引导差距欠定。该差距在每个去噪步骤中被放大,允许两个分支都崩溃为相同预测的退化解,使得引导在低输出级训练损失下无效。本文介绍了DASH,一种双分支蒸馏框架,独立监督两个分数分支,通过独立分支约束为每个训练样本唯一指定目标分支输出,并引入锚点项将条件预测正则化到真实噪声。该框架进一步引入了TIRT迁移,将教师收敛的每时间步重要性课程复制到学生中作为冻结先验,消除了在有限蒸馏预算内重新学习它的需要。在CIFAR-10和CIFAR-100上的实验表明,5.9倍压缩在50步DDIM采样下将质量保持在教师模型4个FID点以内,显著优于从头训练,且引导保真度良好保持。消融研究证实无条件监督是主要贡献,占总蒸馏增益的60%以上。课程迁移和锚点正则化提供互补收益,共同验证了双分支约束对于引导保持压缩的经验必要性。

英文摘要

Parameter compression of class-conditional diffusion models reveals an underexplored limitation in output-level distillation: the unconditional score branch remains unsupervised, leaving the classifier-free guidance gap underdetermined in the student. This gap, amplified at every denoising step, admits degenerate solutions where both branches collapse toward identical predictions, rendering guidance ineffective despite low output-level training loss. This paper introduces DASH, a dual-branch distillation framework that independently supervises both score branches, uniquely specifying target branch outputs for each training sample through independent branch constraints, with an anchor term regularising conditional predictions toward ground-truth noise. The framework further introduces TIRT Transfer, which copies the teacher's converged per-timestep importance curriculum into the student as a frozen prior, eliminating the need to relearn it within limited distillation budgets. Experiments on CIFAR-10 and CIFAR-100 demonstrate that 5.9x compression maintains quality within 4 FID points of the teacher at 50-step DDIM sampling, considerably outperforming training from scratch with guidance fidelity well preserved. Ablation studies confirm that unconditional supervision is the dominant contribution, accounting for over 60% of total distillation gain. Curriculum transfer and anchor regularisation provide complementary benefit, together validating dual-branch constraints as empirically essential for guidance-preserving compression.

2606.00795 2026-06-02 cs.LG cs.AI 版本更新

Extending Causal Metamodeling to a non-Markovian Queue

将因果元建模扩展到非马尔可夫排队系统

Pracheta Amaranath, Anant Bhide, David Jensen, Peter Haas

发表机构 * Manning College of Information and Computer Sciences University of Massachusetts Amherst(信息与计算机科学学院麻省大学阿默斯特分校)

AI总结 本文通过相位型分布近似非指数分布,将模块化动态贝叶斯网络(MDBN)因果元建模方法从马尔可夫系统扩展到非马尔可夫排队系统,并解决了相位数选择、参数学习和采样间隔等挑战,实验表明在G/M/1队列上可实现数量级的推理加速。

Comments 12 pages

详情
AI中文摘要

离散事件仿真的元模型近似模拟模型的行为,而无需运行昂贵的仿真。先前的工作引入了模块化动态贝叶斯网络(MDBN)——一类元模型,可以使用单个训练模型估计一系列概率和因果查询(PCQ)——但该方法仅限于马尔可夫系统。在本文中,我们通过使用相位型分布近似非指数分布,启动MDBN向非马尔可夫排队的扩展。这种方法带来了新的挑战,包括在选择相位数量时平衡元建模精度和可处理性、高效学习元模型参数,以及选择用于通过离散时间MDBN近似连续时间仿真的采样间隔。我们为这些挑战提供了初步解决方案,从而产生了第一个针对非马尔可夫系统的因果元建模技术。在G/M/1队列上的实验表明,MDBN可以为PCQ提供准确的答案,并且相对于直接仿真,推理时间实现了数量级的加速。

英文摘要

Metamodels for discrete-event simulations approximate the behavior of simulation models without running expensive simulations. Prior work introduced modular dynamic Bayesian networks (MDBNs) -- a class of metamodels that can estimate a range of probabilistic and causal queries (PCQs) using a single, trained model -- but the method was limited to Markovian systems. In this paper, we initiate an extension of MDBNs to non-Markovian queues by approximating non-exponential distributions using phase-type distributions. This approach raises novel challenges, including balancing metamodeling accuracy and tractability when choosing the number of phases, efficiently learning metamodel parameters, and choosing the sampling interval that is used to approximate a continuous-time simulation by a discrete-time MDBN. We provide preliminary solutions to these challenges, yielding the first causal metamodeling technique for non-Markovian systems. Experiments on a G/M/1 queue demonstrate that the MDBN can produce accurate answers to PCQs with orders-of-magnitude speedup of inference times relative to direct simulation.

2606.00783 2026-06-02 stat.AP cs.AI math.PR stat.CO 版本更新

Bayesian Inference of Nonlinear Malaria Dynamics in Ghana via an Ensemble Markov Chain Monte Carlo Sampler

加纳非线性疟疾动力学的贝叶斯推断:基于集成马尔可夫链蒙特卡洛采样器

T. Ansah-Narh, Y. Asare Afrane, J. Bremang Tandoh

发表机构 * Ghanaian Agricultural and Environmental Council(加纳农业与环境委员会)

AI总结 针对加纳疟疾监测数据短、噪声大、空间异质性强的问题,提出一种贝叶斯非线性推断框架,结合三次基线与阻尼振荡核,通过仿射不变集成马尔可夫链蒙特卡洛采样器估计参数,实现了高精度拟合和概率预测,揭示了空间异质性并预测了2024-2026年疟疾回升趋势。

Comments 27 pages, 15 figures, published in Expert Systems with Applications

详情
Journal ref
Expert Systems with Applications, Volume 312, 131540 (2026)
AI中文摘要

可靠量化撒哈拉以南非洲疟疾动态受到短、噪声大且空间异质的监测记录阻碍。在加纳,2014年至2023年的卫生设施数据揭示了住院人数的非线性和年龄特异性波动,然而现有方法难以捕捉随机变异或提供可信的不确定性区间。本研究开发了一个贝叶斯非线性推断框架,该框架将三次基线与阻尼振荡核相结合,通过仿射不变集成马尔可夫链蒙特卡洛采样器进行估计。该框架适应有限数据,建模参数不确定性,并为五岁以下儿童和五岁及以上个体生成概率预测。结果显示较强的经验充分性(五岁以下:$R^2 = 0.9958$;五岁及以上:$R^2 = 0.9956$),残差低于$2\%$,且混合良好的后验分布确认了收敛性。区级分析揭示了显著的空间异质性,变异系数从库马西等城市中心的$<0.07$到姆波霍尔和东比亚等边缘地区的$>3.3$。2024-2026年的预测表明逐步回升:五岁以下儿童病例从137,000例增至149,000例,年长个体从348,000例增至375,000例,不确定性随时间扩大。通过生成概率预测,该贝叶斯框架为预测疟疾波动和加强加纳国家疟疾控制战略中的数据驱动决策提供了原则性工具。

英文摘要

Reliable quantification of malaria dynamics in sub-Saharan Africa is hindered by short, noisy, and spatially heterogeneous surveillance records. In Ghana, health-facility data from 2014 to 2023 reveal non-linear and age-specific fluctuations in hospital admissions, yet existing approaches struggle to capture stochastic variability or provide credible uncertainty bounds. This study develops a Bayesian nonlinear inference framework that integrates a cubic baseline with a damped oscillatory kernel, estimated via an affine-invariant ensemble Markov Chain Monte Carlo sampler. The framework accommodates limited data, models parameter uncertainty, and generates probabilistic forecasts for children under five years and individuals aged five years or more. Results show strong empirical adequacy ($R^2 = 0.9958$ for $<5$ years; $R^2 = 0.9956$ for $\geq 5$ years) with residual errors below $2\%$ and well-mixed posteriors confirming convergence. District-level analysis reveals pronounced spatial heterogeneity, with coefficients of variation ranging from $<0.07$ in urban centres such as Kumasi to $>3.3$ in peripheral districts such as Mpohor and Bia East. Forecasts for 2024-2026 indicate a gradual resurgence: from 137,000 to 149,000 cases among children under five years and from 348,000 to 375,000 cases among older individuals, with uncertainty widening over time. By producing probabilistic forecasts, this Bayesian framework provides a principled tool for anticipating malaria fluctuations and strengthening data-driven decision-making in Ghana's national malaria control strategy.

2606.00780 2026-06-02 cs.LG cs.AI 版本更新

Behavior-Invariant Task Representation Learning with Transformer-based World Models for Offline Meta-Reinforcement Learning

基于Transformer世界模型的行为不变任务表示学习用于离线元强化学习

Fuyuan Qian, Menglong Zhang, Song Wang, Quanying Liu

发表机构 * National University of Singapore(新加坡国立大学) University of Science and Technology of China(中国科学技术大学)

AI总结 提出一种结合信息论任务表示学习与Transformer随机世界模型的框架,通过提取行为不变的任务变量和保守值惩罚,解决离线元强化学习中的分布偏移和稀疏奖励问题,实现鲁棒泛化。

Comments ICML2026

详情
AI中文摘要

离线元强化学习利用静态数据集使智能体能够通过结合离线效率与元学习适应性来泛化到未见环境,但它面临来自上下文和策略分布偏移的关键挑战。这些问题阻碍智能体适应在线环境,并在稀疏奖励设置下进一步加剧。结果,智能体常常陷入固有的模式困境,无法实现鲁棒的泛化。在这项工作中,我们提出了一种新颖的框架,将信息论任务表示学习与基于Transformer的随机世界模型相结合。我们的方法提取对行为策略不变的任务定义潜在变量,从而有效缓解上下文分布偏移。为了进一步处理策略偏移和模型利用,我们对基于想象力的轨迹应用保守值惩罚,防止策略利用模型不准确性,同时保持鲁棒适应。大量评估表明,我们的方法在分布外和稀疏奖励设置下优于最先进的方法,具有优越的稳定性和泛化能力。

英文摘要

Offline meta-reinforcement learning leverages static datasets to enable agents to generalize to unseen environments by combining offline efficiency with meta-learning adaptability, yet it faces key challenges from context and policy distribution shifts. These issues hinder agents from adapting to online environments, and are further exacerbated under sparse-reward settings. As a result, agents often become trapped in an inherent pattern dilemma, failing to achieve robust generalization. In this work, we propose a novel framework that integrates information-theoretic task representation learning with a Transformer-based stochastic world model. Our approach extracts task-defining latent variables that are invariant to behavior policy, thereby effectively mitigating the context distribution shift. To further handle policy shift and model exploitation, we apply a conservative value penalty to imagination-based rollouts, preventing the policy from exploiting model inaccuracies while maintaining robust adaptation. Extensive evaluations demonstrate that our method outperforms state-of-the-art approaches, with superior stability and generalization under out-of-distribution and sparse-reward settings.

2606.00775 2026-06-02 cs.CV cs.AI 版本更新

GIRL-DETR: Gradient-Isolated Reinforcement Learning for Video Moment Retrieval

GIRL-DETR: 梯度隔离强化学习用于视频时刻检索

Shihang Zhang, Mingjin Kuai, Ye Wei, Zhen Zhang, Wei Ji

发表机构 * College of Electronics and Information Engineering, Sichuan University(四川大学电子信息工程学院) School of Intelligence Science and Technology, Nanjing University(南京大学智能科学与技术学院)

AI总结 针对视频时刻检索中连续代理损失与非可微指标不匹配导致的优化停滞问题,提出梯度隔离强化学习框架GIRL-DETR,通过冻结骨干网络并采用三阶段渐进强化学习策略直接优化tIoU指标,在轻量级模型中实现定位精度提升。

Comments 13 pages, 6 figures. Submitted to IEEE Transactions on Image Processing (TIP). Code is available at: https://github.com/Z-Shihang/GIRL-DETR

详情
AI中文摘要

视频时刻检索(VMR)任务要求精确定位与自然语言查询对齐的时间边界,但许多模型存在连续代理损失与非可微指标之间的不匹配,导致训练后期优化停滞,边界预测陷入次优解。尽管强化学习(RL)后训练成功优化了大模型的定位结果,但直接应用于轻量级网络容易破坏监督阶段建立的脆弱特征表示。为克服这一优化瓶颈,我们提出梯度隔离强化学习用于DETR(GIRL-DETR),首次将RL后训练引入轻量级时间定位框架。输入视频和文本特征首先通过跨模态交互(CMI)在进入Transformer编码器之前建立早期对齐。随后,文本引导门控(TGG)机制在Transformer解码器生成候选提案之前动态地将语义先验注入查询,为时间预测提供高信噪比输入。在监督训练达到收敛后,冻结骨干网络以保护特征流形,而检测头通过三阶段渐进强化学习(TPRL)策略直接优化非可微评估指标tIoU以提升定位精度。该方法实现了状态表示与指标优化的正交解耦。在Charades-STA、QVHighlights和TACoS上的实验表明,GIRL-DETR有效解决了代理损失退化问题,以最少的参数更新实现了显著的精度提升,为轻量级VMR模型中的RL应用提供了稳健的新途径。

英文摘要

Video Moment Retrieval (VMR) task requires accurately localizing temporal boundaries aligned with natural language queries, but many models suffer from a misalignment between continuous surrogate losses and non-differentiable metrics, leading to optimization stagnation during the late stages of training and trapping boundary predictions in suboptimal solutions. Although Reinforcement Learning (RL) post-training successfully optimizes localization results for large models, applying it directly to lightweight networks easily disrupts the fragile feature representations established during the supervised phase. To overcome this optimization bottleneck, we propose Gradient-Isolated Reinforcement Learning for DETR (GIRL-DETR), introducing RL post-training into a lightweight temporal localization framework for the first time. The input video and text features first establish early alignment through Cross-Modal Interaction (CMI) before entering the transformer encoder. Subsequently, a Text-Guided Gating (TGG) mechanism dynamically injects semantic priors into the queries before the transformer decoder generates candidate proposals, providing high signal-to-noise ratio inputs for temporal prediction. After the supervised training reaches convergence, the backbone network is frozen to protect the feature manifold, while the detection head directly optimizes the non-differentiable evaluation metric tIoU to enhance localization accuracy through a Three-stage Progressive Reinforcement Learning (TPRL) strategy. This approach achieves an orthogonal decoupling of state representation and metric optimization. Experiments on Charades-STA, QVHighlights, and TACoS demonstrate that GIRL-DETR effectively resolves surrogate loss degradation and achieves substantial accuracy improvements with minimal parameter updates, providing a robust new pathway for RL applications in lightweight VMR models.

2606.00771 2026-06-02 cs.LG cs.AI cs.SD 版本更新

Logit Distillation on Manifolds: Mapping by Learning

流形上的对数蒸馏:通过学习进行映射

Yiru Yang, Junling Wang, Nishant Kumar Singh, Luohong Wu, Haoran Yan

发表机构 * University of Zurich(苏黎世大学) ETH Zurich(苏黎世联邦理工学院) Deutsche Bank Securities(德意志银行证券公司)

AI总结 提出一种层和点投影映射方法,将学生和教师表示对齐到高维嵌入空间,结合LoRA注入,在显著减少可训练参数的同时提高词错误率。

详情
AI中文摘要

提高几乎任何机器学习模型性能的一种简单方法是,不训练单个模型,而是训练多个使用不同算法的模型,这些模型对相同数据做出略有不同的预测和错误,从而提高平均预测和鲁棒性。然而,使用整个模型集成进行预测是繁琐且计算成本过高的,无法部署给大量用户,特别是当模型是大型神经网络时。为此,我们引入了一种层和点投影映射,在训练过程中将学生和教师表示映射到对齐的高维嵌入空间。所提出的方法结合LoRA注入,将学生模型的可训练参数减少到教师模型的不到1%,同时与其他蒸馏方法相比,显著提高了词错误率(WER),如消融研究所示。与专家混合不同,我们的方法可以快速并行训练。

英文摘要

A simple way to improve the performance of almost any machine learning model is not to train a single but several models with diverse algorithms which will make slightly distinct kinds of predictions and errors on the same data, and thus improve the average predictions and robustness. However, making predictions using a whole ensemble of models is cumbersome and computationally too expensive to allow deployment to a large number of users, especially if the models are large neural nets. In response to this, we introduce a layer and point wise projection mapping, which maps student and teacher representations into an aligned high-dimensional embedding space during training process. The proposed approach combined with LoRA injection reduces the student model trainable parameters to less than 1% of the teacher model, while significantly improving word error rate (WER) compared to other distillation methods, as demonstrated in ablation studies. Unlike a mixture of experts, our method can be trained rapidly and in parallel.

2606.00765 2026-06-02 cs.AI 版本更新

FALAT: Tracing Failures in LLM Agent Trajectories via Dependency-Guided Search

FALAT: 通过依赖引导搜索追踪LLM智能体轨迹中的失败

Md Nakhla Rafi, Md Ahasanuzzaman, Dong Jae Kim, Zhijie Wang, Tse-Hsun Chen

发表机构 * SPEAR Lab(SPEAR实验室) Concordia University(康科德大学) DePaul University(德保罗大学)

AI总结 提出FALAT框架,通过依赖引导搜索方法,在LLM智能体轨迹中识别导致失败的关键步骤和责任智能体。

详情
AI中文摘要

基于LLM的智能体越来越多地通过包含推理步骤、工具调用和智能体间通信的长轨迹来解决复杂任务。然而,当这些智能体失败时,通常不清楚是哪个智能体导致了失败,以及哪个步骤引入了决定性错误。这个归因问题具有挑战性,因为错误可以在轨迹中传播:后续动作可能看起来不正确,但仅仅是因为它们依赖于先前被破坏的状态。因此,失败归因不能被视为独立的步骤级分类。 我们提出FALAT,一个用于LLM智能体轨迹中失败归因的诊断框架。FALAT将归因问题框架化为一个依赖引导的搜索问题。它首先构建任务应如何解决的期望,并利用该期望识别轨迹中的可疑区域。然后,它追踪决策、工具输出和智能体消息之间的依赖关系,以区分引入错误的步骤和仅仅继承或传播先前错误的步骤。最后,FALAT评估纠正候选步骤是否足以恢复预期结果,从而能够识别责任智能体和决定性失败步骤。 我们在Who&When基准上评估FALAT,该基准包括算法生成和手工制作的多智能体失败轨迹。结果表明,FALAT持续改进了责任智能体和决定性步骤的归因。其最佳配置在算法生成轨迹上达到46.0%的步骤级准确率,在更具挑战性的手工制作轨迹上达到29.1%,优于专门的归因基线和直接提示的独立LLM。这些发现表明,依赖感知推理对于LLM智能体系统中可靠的失败诊断至关重要。

英文摘要

LLM-based agents increasingly solve complex tasks through long trajectories involving reasoning steps, tool calls, and inter-agent communication. However, when these agents fail, it is often unclear which agent caused the failure and which step introduced the decisive error. This attribution problem is challenging because mistakes can propagate across the trajectory: later actions may appear incorrect, but only because they depend on an earlier corrupted state. Therefore, failure attribution cannot be treated as independent step-level classification. We propose FALAT, a diagnostic framework for failure attribution in LLM agent trajectories. FALAT frames attribution as a dependency-guided search problem. It first constructs an expectation of how the task should be solved and uses this expectation to identify suspicious regions in the trajectory. It then traces dependencies among decisions, tool outputs, and agent messages to distinguish error-introducing steps from steps that merely inherit or propagate prior mistakes. Finally, FALAT evaluates whether correcting a candidate step would be sufficient to recover the expected outcome, allowing it to identify both the responsible agent and the decisive failure step. We evaluate FALAT on the Who&When benchmark, which includes both algorithm-generated and hand-crafted multi-agent failure trajectories. The results show that FALAT consistently improves responsible-agent and decisive-step attribution. Its best configurations achieve 46.0% step-level accuracy on algorithm-generated trajectories and 29.1% on the more challenging hand-crafted trajectories, outperforming specialized attribution baselines and direct prompting with standalone LLMs. These findings suggest that dependency-aware reasoning is essential for reliable failure diagnosis in LLM agent systems.

2606.00756 2026-06-02 cs.AI 版本更新

CoMIC: Collaborative Memory and Insights Circulation for Long-Horizon LLM Agents in Cloud-Edge Systems

CoMIC:云边系统中长周期LLM代理的协作记忆与洞察循环

Yannan Wang, Longli Yang, Zhen Liu, Abhishek Kumar, Carsten Maple

发表机构 * Beijing Jiaotong University(北京交通大学) The Alan Turing Institute(艾伦·图灵研究所) University of Warwick(沃里克大学)

AI总结 提出无需参数更新的云边框架CoMIC,通过集中式反思与分布式执行设计,利用语义子目标标识实现跨代理经验聚合,提升弱边缘代理在长周期任务中的进展率和动作基础。

详情
AI中文摘要

在边缘服务器上部署轻量级大语言模型(LLM)代理可以减少延迟并将代理服务更贴近用户,但资源受限的边缘模型在处理需要持久记忆、子目标跟踪和反思的长周期任务时往往表现不佳。部署后对边缘模型进行微调成本高昂且难以在异构节点上扩展,而纯本地记忆则使代理拥有孤立经验并导致提示上下文不断增长。我们提出 extsc{CoMIC},一种无需参数更新的云边框架,用于协作记忆与洞察循环。 extsc{CoMIC}遵循 extit{集中式反思,分散式执行}的设计:边缘代理使用面向子目标的分层记忆和选择性重新展开相关历史在本地执行,而云端LLM批评者异步评估完成的轨迹,过滤可重用经验,并通过语义子目标标识符聚合跨代理指导。在涵盖符号规划和文本交互的五项长周期代理任务中, extsc{CoMIC}提高了弱边缘代理的进展率和动作基础,并在不更新模型参数的情况下实现了任务相关的成功率提升。

英文摘要

Deploying lightweight Large Language Model (LLM) agents on edge servers can reduce latency and move agentic services closer to users, but resource-constrained edge models often struggle with long-horizon tasks that require persistent memory, subgoal tracking, and reflection. Fine-tuning edge models after deployment is costly and difficult to scale across heterogeneous nodes, while purely local memory leaves agents with isolated experience and growing prompt context. We propose \textsc{CoMIC}, a parameter-update-free cloud-edge framework for Collaborative Memory and Insights Circulation. \textsc{CoMIC} follows a \textit{Centralized Reflection, Decentralized Execution} design: edge agents execute locally using subgoal-oriented hierarchical memory and selective re-expansion of relevant histories, while a cloud-side LLM critic asynchronously evaluates completed trajectories, filters reusable experience, and aggregates cross-agent guidance keyed by semantic subgoal identifiers. Across five long-horizon agent tasks spanning symbolic planning and text interaction, \textsc{CoMIC} improves progress rate and action grounding for weak edge agents and yields task-dependent success-rate gains without updating model parameters.

2606.00754 2026-06-02 stat.ME cs.AI cs.LG 版本更新

Causal Density Functions

因果密度函数

Sridhar Mahadevan

发表机构 * Adobe Research(Adobe研究院) University of Massachusetts(马萨诸塞大学) Amherst(阿默斯特)

AI总结 提出因果密度函数作为干预分布与观测分布的Radon-Nikodym导数,用于局部密度比衡量因果效应,并给出估计与检验方法。

Comments 25 pages

详情
AI中文摘要

我们引入因果密度函数:Radon-Nikodym导数,它比较干预分布与观测分布,因此作为因果效应的局部密度比。许多因果强度度量在图手术后的整个分布上进行比较,而因果密度函数提供了一个逐点的测度变换对象,可以估计、校准并用于评分有向影响。基本恒等式 \[ \mathbb{E}_{\mathrm{do}}[f(Y)] = \mathbb{E}_{\mathrm{obs}}\!\left[f(Y)ρ(X,Y)\right] \] 使得因果密度直接可检验:如果估计的密度比正确,通过ρ重新加权的观测期望重现干预期望。我们推导了do曲线和有向边得分的实用估计量,将构造与条件作用和干预的Radon-Nikodym/Kan语义联系起来,并在合成和真实扰动基准上评估了所得估计量。

英文摘要

We introduce causal density functions: Radon-Nikodym derivatives that compare interventional laws to observational laws and therefore act as local density ratios for causal effects. Whereas many causal-strength measures compare whole distributions after graph surgery, causal density functions provide a pointwise change-of-measure object that can be estimated, calibrated, and used to score directed influence. The basic identity \[ \mathbb{E}_{\mathrm{do}}[f(Y)] = \mathbb{E}_{\mathrm{obs}}\!\left[f(Y)ρ(X,Y)\right] \] makes causal density directly testable: if the estimated density ratio is correct, observational expectations reweighted by $ρ$ reproduce interventional expectations. We derive practical estimators for do-curves and directed edge scores, relate the construction to Radon-Nikodym/Kan semantics for conditioning and intervention, and evaluate the resulting estimators on synthetic and real perturbation benchmarks.

2606.00741 2026-06-02 cs.LG cs.AI stat.ML 版本更新

Quantum Tunneling-Aware Machine Learning: Physics-Derived Noise Models for Robust Deployment

量子隧穿感知机器学习:面向鲁棒部署的物理衍生噪声模型

Uiwon Hwang, Jaeho Hwang

发表机构 * Department of Computer Science and Engineering(计算机科学与工程系) Human-Centered Artificial Intelligence Research Institute(以人为本的人工智能研究院)

AI总结 本文提出量子隧穿感知机器学习(QTAML),通过WKB近似推导部署时的权重误差分布,并设计隧穿感知补偿(TAC)算法,在无需重训练和标签的情况下,以较低ECC开销恢复模型精度。

详情
AI中文摘要

晶体管缩放正接近量子力学极限,因为薄栅氧化物通过量子隧穿引起电子泄漏。与传统数字系统不同,只要错误结构被正确建模,AI推理可以容忍此类错误。在本文中,我们引入量子隧穿感知机器学习(QTAML)。我们使用Wentzel-Kramers-Brillouin(WKB)近似从第一性原理推导部署时的权重误差分布,并表明它具有通用高斯噪声模型所忽略的结构:精确的仿射均值漂移、由最高有效位主导的逐位方差层级,以及依赖于$\|W_\ell\|_\infty$和训练网络Jacobian的逐层依赖性。我们将这三个结构属性打包成一个单一的部署时算法——隧穿感知补偿(TAC),该算法结合了闭式均值校正和基于WKB方差分解的最优逐层自适应比特预算分配。在$p_\mathrm{flip}=0.10$的四个卷积架构和$p_\mathrm{flip}=0.05$的一个Transformer编码器上,TAC达到了干净精度的95%,同时ECC开销比从相同物理导出的自然基线Uniform-MSP低3.4倍到33.6倍。闭式饱和比$ ho^*$预先预测了这些增益,在异构架构上,WKB导出的评分在小预算下比基于幅度的分配高出多达24个百分点。该算法无需重训练、无需标签,且无推理时开销。我们还验证了WKB导出的分布定理达到蒙特卡洛精度。这些结果将WKB隧穿物理与噪声感知深度学习联系起来,并为超越传统缩放极限的硬件-软件协同设计提供了一条有原则的路径。

英文摘要

Transistor scaling is approaching a quantum-mechanical limit, as thin gate oxides induce electron leakage through quantum tunneling. Unlike conventional digital systems, AI inference can tolerate such errors provided their structure is modeled correctly. In this paper, we introduce quantum tunneling-aware machine learning (QTAML). We derive the deployment-time weight-error distribution from first principles using the Wentzel-Kramers-Brillouin (WKB) approximation and show that it has structure that generic Gaussian noise models miss: an exact affine mean drift, a per-bit variance hierarchy dominated by the most-significant bit, and a per-layer dependence on $\|W_\ell\|_\infty$ and the trained-network Jacobian. We package these three structural properties into a single deployment-time algorithm, Tunneling-Aware Compensation (TAC), that combines closed-form mean correction with an optimal layer-adaptive bit-budget allocation derived from the WKB variance decomposition. Across four convolutional architectures at $p_\mathrm{flip}$=0.10 and a transformer encoder at $p_\mathrm{flip}$=0.05, TAC reaches $95\%$ of clean accuracy with 3.4$\times$ to 33.6$\times$ less ECC overhead than Uniform-MSP, the natural baseline derived from the same physics. The closed-form saturation ratio $ρ^*$ predicts these gains in advance, and on heterogeneous architectures WKB-derived scoring outperforms magnitude-based allocation by up to 24 percentage points at small budgets. The algorithm requires no retraining, no labels, and no inference-time overhead. We also verify the WKB-derived distributional theorems to Monte Carlo precision. These results connect WKB tunneling physics with noise-aware deep learning and suggest a principled path toward hardware--software co-design beyond conventional scaling limits.

2606.00738 2026-06-02 cs.LG cs.AI cs.CV 版本更新

SORA: Free Second-Order Attacks in Fast Adversarial Training

SORA:快速对抗训练中的自由二阶攻击

Mazdak Teymourian, Ramtin Moslemi, Farzan Rahmani, Mohammad Hossein Rohban

发表机构 * Department of Computer Engineering, Sharif University of Technology, Tehran, Iran(谢赫大学计算机工程系)

AI总结 针对快速对抗训练中的灾难性过拟合问题,提出通过扰动变异性和梯度对齐指标PertAlign来预测并防止过拟合,并设计自适应步长方法SORA,实现最优鲁棒性和干净准确率。

Comments Accepted at ICML 2026

详情
AI中文摘要

对抗训练是对抗性样本的主要防御手段,但在高效的单步变体中常常遭受灾难性过拟合,即尽管单步性能很高,但对多步攻击的鲁棒性却崩溃。我们通过两个贡献来解决这种失效模式。首先,我们形式化了epsilon过拟合(EO),这是一种固定扰动幅度和方向加剧CO的视角,并表明引入扰动变异性可以显著提高不同架构和数据集上的鲁棒泛化能力。其次,我们提出了PertAlign(扰动对齐),这是一种理论上合理、计算开销可忽略的指标,通过测量攻击阶段的梯度对齐来预测CO的发生。利用这些见解,我们引入了SORA,一种自适应步长的AT方法,它根据损失曲面几何动态调整扰动。SORA始终能防止CO,实现最先进的鲁棒性和干净准确率,并使用一组固定的超参数在数据集和架构上泛化,这对于快速AT的适用性至关重要。在不同数据集和架构上的大量实验表明,SORA在提供更高干净准确率和卓越效率的同时,匹配或超越了先前方法的鲁棒性。代码可在https://github.com/SecondOrderAT/SORA获取。

英文摘要

Adversarial Training (AT) is a leading defense against adversarial examples but often suffers from Catastrophic Overfitting (CO) in efficient single-step variants, where robustness to multi-step attacks collapses despite high single-step performance. We address this failure mode with two contributions. First, we formalize Epsilon Overfitting (EO), a perspective in which fixed perturbation magnitudes and directions exacerbate CO, and show that introducing perturbation variability significantly improves robust generalization across different architectures and datasets. Second, we propose PertAlign (Perturbation Alignment), a theoretically grounded, computationally negligible metric that predicts CO onset by measuring gradient alignment across attack stages. Leveraging these insights, we introduce SORA, an adaptive step-size AT method that dynamically adjusts perturbations based on loss surface geometry. SORA consistently prevents CO, achieves state-of-the-art robustness and clean accuracy, and generalizes across datasets and architectures using a single fixed set of hyperparameters, which is essential for applicability in fast AT. Extensive experiments on diverse datasets and architectures show that SORA matches or surpasses the robustness of prior methods while delivering higher clean accuracy and superior efficiency. Code is available at https://github.com/SecondOrderAT/SORA.

2606.00726 2026-06-02 cs.AI 版本更新

Latent Reward Steering: An Adaptive Inference-Time Framework that Implicitly Promotes Cognitive Behaviors in Reasoning LLMs

潜在奖励引导:一种自适应推理时框架,隐式促进推理大语言模型中的认知行为

Jiakang Li, Guanyu Zhu, Can Jin, Chenxi Huang, Dexu Yu, Ronghao Chen, Yang Zhou, Hongwu Peng, Xuanqi Lan, Dimitris N. Metaxas, Youhua Li

发表机构 * Rutgers University(罗格斯大学) South China Agricultural University(华南农业大学) Columbia University(哥伦比亚大学) Fenz.AI QuantaAlpha Adobe Santa Clara University(圣克拉拉大学) City University of Hong Kong(香港城市大学)

AI总结 提出潜在奖励引导(LRS)框架,通过优化稀疏自编码器潜在状态隐式促进认知行为,利用最终答案正确性训练潜在奖励模型估计中间状态质量,并在推理时提供状态特定的修正方向,实验表明该方法能提升推理性能并修复原始推理错误。

详情
AI中文摘要

强推理不仅依赖于模型知识,还取决于生成过程中认知行为的有效部署。现有方法通常依赖显式的行为级控制,当失败和所需修正因推理状态、任务和模型而异时,其适应性不足。为此,我们提出潜在奖励引导(LRS),一种自适应推理时框架,通过优化隐式携带认知行为的稀疏自编码器(SAE)潜在状态来促进认知行为。LRS不依赖预定义的认知行为或由此衍生的引导方向,而是基于最终答案正确性在推理轨迹上训练潜在奖励模型,以估计中间潜在状态的质量。推理时,奖励梯度为脆弱的潜在状态提供状态特定的修正方向,而奖励与置信度门控将干预限制在奖励信号标记为脆弱的状态上。在多个推理LLM骨干和基准上的实验表明,LRS一致地提升了相对于各种基线的性能,事后分析进一步表明LRS隐式促进了修复原始推理错误的良好认知行为。代码见:https://github.com/jiakanglee/Latent-Reward-Steering。

英文摘要

Strong reasoning depends not only on model knowledge but also on how effectively cognitive behaviors are deployed during generation. Existing methods often rely on explicit behavior-level control, making them insufficiently adaptive when failures and required corrections vary across reasoning states, tasks, and models. To this end, we propose Latent Reward Steering (LRS), an adaptive inference-time framework that promotes cognitive behaviors by optimizing the sparse-autoencoder (SAE) latent states that implicitly carry them. Rather than relying on predefined cognitive behaviors or steering directions derived from them, LRS trains a latent reward model on reasoning traces by final answer correctness to estimate the quality of intermediate latent states. During inference, reward gradients provide state-specific correction directions for fragile latent states, while a reward and confidence gate restricts intervention to states the reward signal flags as fragile. Experiments on multiple reasoning LLM backbones and benchmarks show that \ours consistently improves performance over various baselines, and post-hoc analyses further indicate that \ours implicitly promotes good cognitive behaviors that fix the original reasoning errors. Code is available at: https://github.com/jiakanglee/Latent-Reward-Steering.

2606.00724 2026-06-02 cs.CL cs.AI 版本更新

WaveFilter: Enhancing the Long-Context Capability of Diffusion LLMs via Wavelet-Guided KV Cache Filtering

WaveFilter: 通过小波引导的KV缓存过滤增强扩散型大语言模型的长上下文能力

Jinnan Yang, Yan Wang, Zhen Bi, Kehao Wu, Xiaojie Li, Jungang Lou, Zechao Li, Jing Liu

发表机构 * Nanjing University of Science and Technology(南京理工大学) Alibaba Group(阿里巴巴集团) Huzhou Normal University(湖州师范学院) Institute of Automation, Chinese Academy of Sciences(中国科学院自动化研究所)

AI总结 针对扩散型大语言模型在长上下文任务中计算开销大和推理延迟高的问题,提出一种无需训练的通用缓存框架WaveFilter,利用小波变换分解长序列以精确识别关键token,构建稀疏KV缓存,从而提升现有KV缓存方法在复杂长上下文任务中的性能。

Comments 8 pages,3 figures

详情
AI中文摘要

扩散型大语言模型(DLMs)在各种任务中展现出显著优势。然而,受限于其多步迭代推理机制,它们在长上下文任务中的计算开销和推理延迟已成为限制其大规模部署的核心瓶颈。在处理长序列时,现有的键值(KV)缓存机制常常面临生成质量急剧下降的困境,其核心挑战在于如何在超长上下文中精确且高效地过滤关键token。受人类阅读过程的启发,我们提出了 extbf{WaveFilter},一个通用的、无需训练的缓存框架。该框架创新性地引入小波变换来分解长序列,以实现关键token的精确识别,并基于此构建稀疏KV缓存以计算最终的上下文表示。实验结果表明,WaveFilter作为一个即插即用的通用框架,显著提升了现有主流KV缓存方法在复杂长上下文任务中的性能。

英文摘要

Diffusion Large Language Models (DLMs) have demonstrated significant advantages across various tasks. However, constrained by their multi-step iterative inference mechanism, their computational overhead and inference latency in long-context tasks have become core bottlenecks restricting their large-scale deployment. When processing long sequences, existing Key-Value (KV) caching mechanisms often face a dilemma where generation quality degrades drastically, where the core challenge lies in precisely and efficiently filtering critical tokens within ultra-long contexts. Inspired by the human reading process, we propose \textbf{WaveFilter}, a universal and training-free caching framework. This framework innovatively introduces the wavelet transform for decomposition of long sequences to achieve precise identification of key tokens, based on which a sparse KV Cache is constructed to compute the final contextual representation. Experimental results demonstrate that WaveFilter, as a plug-and-play generic framework, significantly enhances the performance of existing mainstream KV Cache methods in complex long-context tasks.

2606.00722 2026-06-02 cs.CL cs.AI 版本更新

EPIC: Efficient and Parallel Inference under CFG Constraints for Diffusion Language Models

EPIC: 扩散语言模型在上下文无关文法约束下的高效并行推理

Hyundong Jin, Yo-Sub Han

发表机构 * Yonsei University(延世大学)

AI总结 提出EPIC框架,通过词法记忆化、Earley解析验证和松弛兼容子集选择,解决扩散语言模型在CFG约束解码中的低效和并行性损失问题,推理时间降低67.5%,额外开销减少90.5%。

详情
AI中文摘要

控制语言模型输出对于确保结构有效性、可靠性和下游可用性至关重要,扩散语言模型也不例外。最近扩散语言模型解码的进展已将输出控制从常规约束扩展到上下文无关文法(CFG)约束。然而,现有方法的速度可能比无约束解码慢四倍。更重要的是,它们大大削弱了扩散语言模型相对于自回归模型的关键优势之一,即并行解码。这种减慢是因为在并行生成过程中,顺序有效性检查引入了显著开销。我们提出了一个高效的CFG约束解码框架EPIC,解决了这一限制。我们的方法通过结合词法记忆化、使用Earley风格解析(而非确定性自动机)进行验证,以及用于并行提交的松弛兼容子集选择,提高了解码效率。它减少了重复的词法分析和验证开销,同时允许多个兼容令牌一起提交。在三个基准测试上使用四个模型的实验表明,与现有的CFG约束解码方法相比,我们的方法将推理时间减少了高达67.5%,并将额外开销降低了高达90.5%。我们的实现可在https://github.com/hyundong98/EPIC-Decoding.git获取。

英文摘要

Controlling language model outputs is essential for ensuring structural validity, reliability, and downstream usability, and diffusion language models are no exception. Recent advances in diffusion language model decoding have extended output control beyond regular constraints to context-free grammar (CFG) constraints. Existing methods, however, can be up to four times slower than unconstrained decoding. More importantly, they substantially diminish one of the key advantages of diffusion language models over autoregressive models, namely parallel decoding. This slowdown arises because sequential validity checking introduces significant overhead during parallel generation. We propose an efficient CFG-constrained decoding framework, EPIC, that addresses this limitation. Our method improves decoding efficiency by combining lexing memoization, validation using Earley-style parsing instead of deterministic automata, and relaxed compatible subset selection for parallel commit. It reduces repeated lexing and validation overhead while allowing multiple compatible tokens to be committed together. Experiments on three benchmarks using four models show that our method reduces inference time by up to 67.5% and decreases the additional overhead by up to 90.5% compared with existing CFG-constrained decoding methods. Our implementation is available at https://github.com/hyundong98/EPIC-Decoding.git .

2606.00718 2026-06-02 cs.AI math.OC 版本更新

LLM-Driven Co-Evolutionary Automated Heuristic Design for Bi-Component Coupled Combinatorial Optimization

LLM驱动的双组件耦合组合优化的协同进化自动启发式设计

Mingen Kuang, Xudong Deng, Xi Lin, Ye Fan, Jianyong Sun, Jialong Shi

发表机构 * Xi’an Jiao Tong University(西安交通大学) Northwestern Polytechnical University(西北工业大学)

AI总结 提出CoEvo-AHD框架,利用大语言模型协同进化两个算子种群,通过合作评估和联合交叉发现互补逻辑,解决旅行窃贼问题等耦合组合优化问题。

详情
AI中文摘要

虽然大语言模型(LLMs)最近在自动启发式设计(AHD)中展现出潜力,但现有方法通常将启发式作为单一算子或搜索策略生成和进化,限制了它们在诸如旅行窃贼问题(TTP)和旅行采购问题(TPP)等问题中对多个决策子结构之间强耦合建模的能力。在这项工作中,我们提出CoEvo-AHD,一个LLM驱动的双种群协同进化框架,用于耦合组合优化中的自动启发式设计。与先前单独进化单个算子的方法不同,CoEvo-AHD利用LLMs协同进化两个紧密相关的算子种群。合作评估机制明确捕获路径和选择算子之间的交互,而成对评分和协同联合交叉有助于发现互补的算子逻辑,以在耦合决策子空间上实现联合改进。我们进一步设计了一个工具调用环境库,将常用核心操作(如局部搜索增量计算)封装为可调用函数,使LLM生成的算子能够使用标准化接口,而不是重新实现低效且易出错的问题特定循环。在TTP和TPP上的实验表明,CoEvo-AHD自动发现合作启发式组合,并达到与传统启发式竞争的解质量。

英文摘要

While Large Language Models (LLMs) have recently shown promise in Automated Heuristic Design (AHD), existing methods typically generate and evolve heuristics as a single operator or search strategy, limiting their ability to model strong coupling among multiple decision substructures in problems such as the Traveling Thief Problem (TTP) and the Traveling Purchaser Problem (TPP). In this work, we propose CoEvo-AHD, an LLM-driven dual-population co-evolutionary framework for automated heuristic design in coupled combinatorial optimization. Unlike prior methods that evolve individual heuristics in isolation, CoEvo-AHD leverages LLMs to co-evolve two closely related operator populations. A cooperative evaluation mechanism explicitly captures interactions between route and selection operators, while pairwise scoring and synergistic joint crossover help discover complementary operator logic for joint improvement across coupled decision subspaces. We further design a tool-invocation environment library that encapsulates frequently used core operations, such as local-search delta computation, into callable functions, enabling LLM-generated operators to use standardized interfaces instead of reimplementing inefficient and error-prone problem-specific loops. Experiments on TTP and TPP show that CoEvo-AHD automatically discovers cooperative heuristic combinations and achieves competitive solution quality against traditional heuristics.

2606.00717 2026-06-02 cs.LG cs.AI stat.ML 版本更新

Multi-Agent Conformal Prediction with Personalized Statistical Validity

具有个性化统计有效性的多智能体共形预测

Martin V. Vejling, Christophe A. N. Biscio, Adrien Mazoyer, Petar Popovski, Shashi Raj Pandey

发表机构 * Department of Electronic Systems(电子系统系) Aalborg University(奥尔堡大学) Department of Mathematical Sciences(数学科学系) Institut de Mathématiques de Toulouse(图卢兹数学研究所) Université de Toulouse(图卢兹大学)

AI总结 提出个性化联邦加权共形预测框架,通过局部密度比加权和加权分位数聚合,在保护隐私的同时纠正数据异质性,为每个参与智能体提供渐近有效的边际和校准条件覆盖保证。

详情
AI中文摘要

不确定性量化在高风险机器学习任务中至关重要。然而,共形预测这一原则性解决方案在局部校准数据有限、隐私约束和数据异质性下面临挑战。在多智能体设置中,现有工作无法同时令人满意地解决这些挑战,其保证要么限于智能体间的平均值,要么在异质性设置中失去有效性。因此,我们提出个性化联邦加权共形预测(PFWCP),该框架结合局部密度比加权与加权分位数聚合,以在保护隐私的同时纠正异质性。该方法为每个参与智能体提供渐近有效的边际和校准条件覆盖保证,并支持一次性通信协议。理论分析呈现了对覆盖方差的调整,该调整由有效样本量表达式控制,这在加权共形预测的背景下是必要的,并且在合成和真实数据集上的实验表明,与最先进的联邦共形基线相比,校准质量有所提高。

英文摘要

Uncertainty quantification is essential in high-stakes machine learning tasks. However, one of the principled solutions, conformal prediction, faces challenges under limited local calibration data, privacy constraints, and data heterogeneity. In multi-agent settings, existing works do not simultaneously and satisfactorily address these challenges with guarantees either limited to averages across agents or losing validity in heterogeneous settings. Hence, we propose personalized federated weighted conformal prediction (PFWCP), a framework that combines local density ratio weighting with weighted quantile aggregation to correct for heterogeneity while preserving privacy. The method yields asymptotically valid marginal and calibration-conditional coverage guarantees for each participating agent and supports protocols with one-shot communication. Theoretical analysis presents an adjustment to the coverage variance, governed by an effective sample size expression, which is necessary in the context of weighted conformal prediction, and experiments on synthetic and real datasets show improved calibration quality over state-of-the-art federated conformal baselines.

2606.00708 2026-06-02 cs.AI cs.LG 版本更新

MOSAIC: Modular Orchestration for Structured Agentic Intelligence and Composition

MOSAIC:结构化智能体智能与组合的模块化编排

Yifan Bao, Xinyu Xi, Xinyu Liu, Wen Ge, Lei Jiang, Kevin Zhang, Raad Khraishi, Yihao Ang, Anthony K. H. Tung, Lukasz Szpruch, Hao Ni

发表机构 * Department of Computer Science, National University of Singapore(新加坡国立大学计算机科学系) University College London(伦敦大学学院) University of Edinburgh(爱丁堡大学) Data & Analytics, Digital X(Digital X 数据与分析部) Alan Turing Institute(艾伦·图灵研究所)

AI总结 提出MOSAIC框架,通过结构化智能体编排、记忆驱动的模型选择和蓝图构建,将自动化数据科学转化为可验证、可复用的模型选择问题,在金融时间序列任务中优于AutoML和智能体基线。

详情
AI中文摘要

自动化数据科学是一个结构化的模型选择问题。解决方案必须为任务选择数据转换、特征表示、架构、训练过程、评估协议和优化策略。AutoML系统自动化了该过程的部分环节,但通常是在预定义的流水线、模型和超参数空间内搜索。基于LLM的智能体通过检索、代码生成和执行反馈提供了更大的灵活性,但其建模决策通常是非结构化的、难以验证且难以复用。我们引入了 extsc{MOSAIC}(结构化智能体智能与组合的模块化编排),一个用于记忆驱动的模型选择和工作流构建的结构化智能体框架。给定任务和数据集, extsc{MOSAIC}构建语义任务画像,检索先前的案例和源代码模块,并构建蓝图:一个指定所选建模组件、组合、接口约束和执行需求的中间表示。该蓝图将模型选择转化为分阶段、上下文驱动的搜索,并将基于LLM的代码生成建立在检索证据而非无约束合成之上。候选模型通过执行验证,并使用诊断反馈、训练轨迹、任务指标以及一个失败感知的强化学习策略进行优化。我们在金融时间序列预测和生成任务上实例化了 extsc{MOSAIC},其中模型必须满足预测准确性、分布保真度、执行可靠性以及下游金融标准(如风险和尾部行为)。与AutoML和智能体基线的实验表明, extsc{MOSAIC}提高了任务性能、执行成功率和决策可追溯性,证明了将自动化数据科学视为结构化、可复用且基于执行的模型选择的价值。

英文摘要

Automated data science is a structured model-selection problem. A solution must choose data transformations, feature representations, architecture, training procedure, evaluation protocol, and refinement strategy for a task. AutoML systems automate parts of this process, but typically search within predefined pipeline, model, and hyperparameter spaces. LLM-based agents offer greater flexibility through retrieval, code generation, and execution feedback, yet their modelling decisions are often unstructured, difficult to verify, and hard to reuse. We introduce \textsc{MOSAIC} (Modular Orchestration for Structured Agentic Intelligence and Composition), a structured agentic framework for memory-grounded model selection and workflow construction. Given a task and dataset, \textsc{MOSAIC} builds a semantic task profile, retrieves prior cases and source-code modules, and constructs a blueprint: an intermediate representation specifying selected modelling components, composition, interface constraints, and execution requirements. This blueprint turns model selection into a staged, context-grounded search and grounds LLM-based code generation in retrieved evidence rather than unconstrained synthesis. Candidate models are validated by execution and refined using diagnostic feedback, training traces, task metrics, and a failure-aware reinforcement learning policy. We instantiate \textsc{MOSAIC} on financial time-series forecasting and generation, where models must satisfy predictive accuracy, distributional fidelity, execution reliability, and downstream financial criteria such as risk and tail behaviour. Experiments against AutoML and agentic baselines show that \textsc{MOSAIC} improves task performance, execution success, and decision traceability, demonstrating the value of treating automated data science as structured, reusable, and execution-grounded model selection.

2606.00703 2026-06-02 cs.IT cs.AI cs.LG math.IT 版本更新

Information-Theoretic Lower Bounds for Bit-Constrained Stochastic Optimization via a Reduction to Compressed Gaussian Mean Estimation

通过约化到压缩高斯均值估计的比特约束随机优化的信息论下界

Munsik Kim

AI总结 本文通过将强凸二次族优化问题精确约化为交互式压缩高斯均值估计问题,推导出比特约束随机优化的无条件下界,并给出近乎匹配的可实现性结果。

详情
AI中文摘要

低精度预训练(FP8, MXFP4, NVFP4)现已成为前沿语言模型的标准,但文献几乎完全是可实现性——算法和经验缩放定律——没有匹配的信息论可能性的刻画。我们研究B比特量化随机一阶预言机:优化器与T轮交互,每轮接收其随机梯度的B比特自适应公共硬币描述。我们的主要贡献是将强凸二次族优化精确约化为交互式压缩高斯均值估计——在B比特预言机下,查询不携带信息,因此优化完全坍缩为顺序分布式估计问题。这产生了两个无条件下界:通信界TB = Omega(d)和统计界T = Omega(sigma^2 d / eps^2),以及尖锐的乘积形式界T = Omega((sigma^2 d / eps^2) max{1, d/B})。乘积形式也是无条件的:B比特转录本最多携带关于均值的O(TB / sigma^2) Fisher迹,因此比特而非维度限制了可恢复信息,结合多元van Trees不等式直接给出该界,无需有界似然比截断。我们给出了一个近乎匹配的可实现性结果,在有限动态范围预言机下精确计算每轮比特,紧至对数因子;下界针对真正高斯(无界)梯度,而缩小这一预言机差距留待未来。顺序率失真视角将约化扩展到相关和漂移预言机,并修正了先前的猜想:正噪声相关性将界提高(1+rho)/(1-rho)倍而非放松。这些界为任何低位梯度路径提供了信息论基线,而非关于已部署FP4系统的最优性声明。

英文摘要

Low-precision pretraining (FP8, MXFP4, NVFP4) is now standard for frontier language models, yet the literature is almost entirely achievability -- algorithms and empirical scaling laws -- with no matching characterization of what is information-theoretically possible. We study a B-bit quantized stochastic first-order oracle: an optimizer interacts for T rounds and receives, each round, a B-bit adaptive public-coin description of its stochastic gradient. Our main contribution is an exact reduction from optimizing a strongly convex quadratic family to interactively compressed Gaussian mean estimation -- under the B-bit oracle the query carries no information, so optimization collapses exactly onto a sequential distributed-estimation problem. This yields two unconditional lower bounds, a communication bound TB = Omega(d) and a statistical bound T = Omega(sigma^2 d / eps^2), and the sharp product-form bound T = Omega((sigma^2 d / eps^2) max{1, d/B}). The product form is also unconditional: a B-bit transcript carries at most O(TB / sigma^2) of Fisher trace about the mean, so bits rather than dimension limit the recoverable information, and combined with the multivariate van Trees inequality this gives the bound directly, without bounded-likelihood-ratio truncation. We give a near-matching achievability result with exact per-round bit accounting under a bounded-dynamic-range oracle, tight up to a logarithmic factor; the lower bound is for truly Gaussian (unbounded) gradients, and closing this oracle gap is left open. A sequential rate-distortion perspective extends the reduction to correlated and drifting oracles and corrects an earlier conjecture: positive noise correlation raises the bound by (1+rho)/(1-rho) rather than relaxing it. The bounds give an information-theoretic baseline for any low-bit gradient path, not an optimality claim about deployed FP4 systems.

2606.00702 2026-06-02 cs.RO cs.AI 版本更新

Shape Your Body: Value Gradients for Multi-Embodiment Robot Design

塑造你的身体:用于多形态机器人设计的价值梯度

Nico Bohlinger, Jan Peters

发表机构 * Technical University of Darmstadt(德累斯顿技术大学) Robotics Institute Germany (RIG)(德国机器人研究所) German Research Center for AI (DFKI)(德国人工智能研究中心) hessian.AI(黑森AI)

AI总结 提出将通用多形态价值函数转化为可复用模型,通过价值梯度优化机器人设计,无需为每个机器人重新进行强化学习协同设计。

详情
AI中文摘要

我们提出将通用多形态价值函数转化为可复用的机器人设计模型。不是为每个机器人运行新的强化学习协同设计循环,而是首先在多种机器人设计上训练一个感知形态的策略和价值函数。训练后,冻结的价值函数被用作可微分的代理,通过价值梯度优化候选形态。我们在不同的机器人设计设置中评估了我们的方法,从受扰动的单个机器人到跨形态类别的保留机器人,使用在多达50个机器人和超过1100个连续形态参数的设计空间上训练的单个模型。除了优化完整形态,我们还展示了价值梯度可以识别限制性能的设计和控制参数,从而能够优化和分析新的机器人设计。

英文摘要

We propose to turn generalist multi-embodiment value functions into reusable models for robot design. Instead of running a new reinforcement learning co-design loop for each robot, we first train an embodiment-aware policy and value function across many robot designs. After training, the frozen value function is used as a differentiable surrogate to optimize candidate embodiments through value gradients. We evaluate our approach across different robot design settings, from perturbed single robots to held-out robots across morphology classes, with single models trained on up to 50 robots and design spaces of over 1100 continuous embodiment parameters. Beyond optimizing complete embodiments, we show that value gradients can identify performance-limiting design and control parameters, enabling both the optimization and the analysis of new robot designs.

2606.00700 2026-06-02 cs.LG cs.AI 版本更新

COPF: An Online Framework for Deployment-Stable Counterfactual Fairness in Evolving Graphs

COPF:演化图中部署稳定的反事实公平性在线框架

Sheng'en Li, Dongmian Zou

发表机构 * Shanghai Jiao Tong University(上海交通大学)

AI总结 针对演化图上的在线链接推荐,提出COPF框架,通过反事实暴露机会差距、显式探索和残差不可区分性审计,实现部署稳定的公平性监控与控制。

Comments Accepted at ICML 2026

详情
AI中文摘要

演化图上的在线链接推荐是表演性的:通过选择向用户展示哪些候选链接,系统会改变哪些链接形成以及后续观察到的反馈。因此,来自记录结果的公平性估计可能具有误导性,并且在推荐策略更新后部署时可能会漂移。我们引入了COPF(反事实在线表演性公平性),这是一个用于在线链接推荐中部署稳定的公平性监控和控制的决策层框架。COPF (i) 定义了暴露(展示 vs. 未展示)反事实上的群体级机会差距,(ii) 通过显式探索和记录每个候选被展示的概率(倾向性)使其可估计,以及(iii) 使用图感知双重稳健(GA-DR)估计器,在可配置的审计器族上通过残差结果不可区分性(OI)审计和控制公平性。我们提供了一个噪声传递定理,表明在时间混合和有界局部干扰下,估计的GA-DR残差上的残差OI意味着暴露反事实群体差距的界限,并实例化了一个在线多校准审计器以及一个原始-对偶控制器。在两个TGB流和一个受控的合成二分图流上的实验表明,COPF减少了暴露反事实群体差距的最坏情况峰值,同时对排序效用的影响较小。我们的代码可在 https://github.com/lsnnnnnnnn/COPF 获取。

英文摘要

Online link recommendation on evolving graphs is performative: by choosing which candidate links to show users, the system changes which links form and what feedback it later observes. Consequently, fairness estimates from logged outcomes can be misleading and may drift after deployment when the recommendation policy is updated. We introduce COPF (Counterfactual Online Performative Fairness), a decision-layer framework for deployment-stable fairness monitoring and control in online link recommendation. COPF (i) defines group-level opportunity gaps over exposure (shown vs. not shown) counterfactuals, (ii) makes them estimable by explicit exploration and by logging the probability (propensity) that each candidate is shown, and (iii) audits and controls fairness using residual outcome indistinguishability (OI) over a configurable auditor family with graph-aware doubly robust (GA-DR) estimators. We provide a noisy transfer theorem showing that Residual-OI on estimated GA-DR residuals implies bounds on exposure-counterfactual group gaps under temporal mixing and bounded local interference, and we instantiate an online multicalibration auditor together with a primal-dual controller. Experiments on two TGB streams and a controlled synthetic bipartite stream show that COPF reduces worst-case spikes in exposure-counterfactual group disparities with modest impact on ranking utility. Our code is available at https://github.com/lsnnnnnnnn/COPF.

2606.00674 2026-06-02 cs.LG cs.AI 版本更新

The Paradox of Outcome Optimization: A Causal Information-Theoretic Bound on Reasoning Shortcuts in LLMs

结果优化的悖论:LLM中推理捷径的因果信息论界限

Zihan Chen, Yiming Zhang, Wenxiang Geng, Zenghui Ding, Yining Sun

发表机构 * HFIPS, Chinese Academy of Sciences(中国科学院HFIPS) University of Science and Technology of China(中国科学技术大学)

AI总结 针对基于结果强化学习的LLM在分布外任务中推理脆弱的问题,提出因果信息论框架解释奖励诱导的流形坍缩,并证明过程奖励模型作为拓扑滤波器可消除低复杂度捷径。

详情
AI中文摘要

通过基于结果的强化学习(RL)对齐的大型语言模型(LLM)经常表现出一种关键失败模式:它们在分布内基准测试上取得高性能,但在分布外(OOD)任务上推理能力脆弱。我们将这种现象称为奖励诱导的流形坍缩。我们建立了一个理论框架,将结构因果模型(SCM)和信息瓶颈(IB)原理联系起来,以解释这一悖论。我们将推理定义为高复杂度的因果过程,将捷径学习定义为利用低复杂度的虚假相关性。在随机梯度下降(SGD)的隐式归纳偏置下,只要训练分布允许对真实因果机制进行“马尔可夫筛选”,优化结果奖励的模型就会偏向于捷径解。我们基于语义覆盖度量($\eta$)而非样本量推导了一个新的泛化界限,说明了为什么在同质分布上扩展数据可能无法纠正推理缺陷。我们还表明,过程奖励模型(PRM)作为拓扑滤波器,通过强制执行逐步互信息约束,使得低复杂度的捷径流形不可行。这些结果为过程监督在简单信用分配之外的作用提供了数学基础。

英文摘要

Large Language Models (LLMs) aligned via outcome-based Reinforcement Learning (RL) frequently exhibit a critical failure mode: they achieve high performance on in-distribution benchmarks while demonstrating brittle reasoning capabilities on out-of-distribution (OOD) tasks. We term this phenomenon Reward-Induced Manifold Collapse. We establish a theoretical framework bridging Structural Causal Models (SCM) and the Information Bottleneck (IB) principle to explain this paradox. We define reasoning as a high-complexity causal process and shortcut learning as the exploitation of low-complexity spurious correlations. Under the implicit inductive bias of Stochastic Gradient Descent (SGD), models optimized for outcome rewards are biased toward shortcut solutions whenever the training distribution allows for a ``Markovian Screening'' of the true causal mechanism. We derive a new generalization bound based on Semantic Coverage Measure ($η$) rather than sample size, showing why data scaling on homogeneous distributions may fail to correct reasoning flaws. We also show that Process Reward Models (PRMs) function as Topological Filters, enforcing step-wise mutual information constraints that render the low-complexity shortcut manifold inadmissible. These results provide a mathematical grounding for the role of process supervision beyond simple credit assignment.

2606.00672 2026-06-02 cs.AI cs.LG 版本更新

Medication-Aware Financial Exploitation Detection for Alzheimer's Patients Using Edge-Aware Interaction Risk Modeling

基于边缘感知交互风险建模的阿尔茨海默病患者药物感知金融剥削检测

Farzana Akter, Lisan Al Amin, Rakib Hossain, Chaitanya Gunupudi, Faisal Quader

发表机构 * Cognitive Links LLC University of Maryland, College Park(马里兰大学学院公园分校)

AI总结 提出一种药物感知框架,通过同步药物依从性与交易监控,利用交互感知逻辑模型提升对认知风险金融事件的检测,尤其在药物脆弱窗口期召回率从0.7442提升至0.9070。

详情
AI中文摘要

金融剥削对阿尔茨海默病患者日益构成威胁,尤其是在认知稳定性下降期间。传统欺诈检测系统通常仅依赖金融行为,忽略可能改变脆弱性的临床相关因素。本文提出一种药物感知框架,将药物依从性与交易级监控同步,以改进对认知风险金融事件的检测。构建了180名患者45天的混合模拟数据集,产生8,100条药物记录和30,855笔交易。该框架通过纯金融、加性药物感知和交互感知逻辑模型评估金额异常、商家新颖性、交易频率、时间偏差和药物依从性。结果表明,纯金融基线获得了最高的全局F1分数0.5000,但交互感知模型在药物诱导脆弱窗口期内将召回率从0.7442提升至0.9070,并在排名高风险案例中实现了最高平均精度。研究结果表明,药物依从性作为金融风险的上下文修饰因子比作为孤立预测因子更有用。

英文摘要

Financial exploitation is a growing concern for people with Alzheimer's disease, especially during periods of reduced cognitive stability. Conventional fraud detection systems usually rely on financial behavior alone and ignore clinically relevant factors that may alter vulnerability. This paper proposes a medication-aware framework that synchronizes medication adherence with transaction-level monitoring to improve detection of cognitively risky financial events. A hybrid simulation dataset was constructed for 180 patients across 45 days, producing 8,100 medication records and 30,855 transactions. The framework evaluates amount anomaly, vendor novelty, transaction frequency, time deviation, and medication adherence through financial-only, additive medication-aware, and interaction-aware logistic models. Results show that the financial-only baseline obtained the highest global F1-score of 0.5000, but the interaction-aware model improved recall during medication-induced vulnerability windows from 0.7442 to 0.9070 and achieved the highest average precision for ranked high-risk cases. The findings suggest that medication adherence is most useful as a contextual modifier of financial risk rather than as an isolated predictor.

2606.00671 2026-06-02 cs.AI cs.CL cs.LG 版本更新

AXIOM: A Trust-First Neuro-Symbolic Execution Architecture for Verifiable Mathematical Reasoning

AXIOM: 一种用于可验证数学推理的信任优先神经符号执行架构

Alessio Bruno

发表机构 * Independent researcher(独立研究者)

AI总结 提出AXIOM架构,将语言模型限制为规范化器,通过确定性计算机代数系统管道实现可验证的数学推理,在4个MATH类别上达到94.36%的正确率和100%的信任度。

Comments Preprint. 12 pages, 2 figures. Live interactive demo: https://huggingface.co/spaces/Squagghy/axiom-solver. Paper artifact and dataset on Zenodo (concept-DOI): 10.5281/zenodo.20440225

详情
AI中文摘要

我们提出AXIOM,一种用于自然语言数学推理的信任优先神经符号执行架构。在AXIOM中,语言模型严格作为规范化器:它将非正式问题文本重写为狭窄的模式,由确定性计算机代数系统(CAS)管道消费,该管道推导并验证答案,或作为第一类输出弃权。路由遵循问题形状正则表达式、特定模式提示和封闭形式CAS处理器之间的1:1:1对齐,已交付3100多条这样的路由,并在250多个连续提交中零LOST_CORRECT回归。我们在4个MATH类别上报告了实证结果,累积正确率为94.36%(2,592/2,747),可解析问题的信任度为100.00%(在整个2,747条记录基准测试中零自信错误答案),所有四个领域均高于每个领域70/90/70的阈值,每个领域信任度为100.0%,仅规则处理器的中位延迟为1毫秒(在lm-eval算术20,000条记录基准测试中占88%的记录)。该架构通过公共部署已服务约30,000次生产查询。我们强调的贡献不是最终的准确率数字,而是该架构建立的向前动态:生产中的每个记录弃权在一次发布周期后都是候选正确,因为新任务在不回归注册表的情况下组合。支撑这一特性的操作纪律——数学模板分桶、LOST_CORRECT扫描作为回归预言机、可解析优先接入以及弃权作为第一类输出——构成了一个可迁移的框架,适用于数学之外的值得信赖的神经符号系统。

英文摘要

We present AXIOM, a trust-first neuro-symbolic execution architecture for natural-language mathematical reasoning. In AXIOM, the language model functions strictly as a canonicalizer: it rewrites informal problem text into a narrow schema consumed by a deterministic Computer-Algebra-System (CAS) pipeline, which derives and verifies the answer or abstains as a first-class output. Routing follows a 1:1:1 alignment between problem-shape regex, schema-specific prompt, and closed-form CAS handler, with 3,100+ such routes shipped and zero LOST_CORRECT regressions across 250+ consecutive ship commits. We report empirical results on 4 MATH categories with a cumulative correctness of 94.36% (2,592/2,747) at 100.00% trust on parseable (zero confident-wrong answers across the full 2,747-record benchmark), all four domains above the per-domain 70/90/70 floor with per-domain trust at 100.0%, and median latency of 1 ms on rule-only handlers (88% of records on the lm-eval arithmetic 20,000-record benchmark). The architecture has served ~30,000 production queries through a public deployment. The contribution we emphasize is not a final accuracy figure but the forward dynamic the architecture establishes: every logged abstain in production is a candidate correct after one ship cycle, since new tasks compose without regressing the registry. The operational discipline behind this property -- math-template bucketing, LOST_CORRECT scan as regression oracle, parseable-first onboarding, and abstain as first-class output -- constitutes a transferable framework for trustworthy neuro-symbolic systems beyond mathematics.

2606.00670 2026-06-02 cs.SD cs.AI 版本更新

Beyond the Mouth: Upper-Face Affective Cues in Audiovisual Sentence Recognition under Acoustic Uncertainty

超越口部:声学不确定性下视听句子识别中的上半脸情感线索

Zhou Yang, Yueyi Yang

发表机构 * Faculty of Education and Psychology, University of Oulu, Finland(奥卢大学教育与心理学学院,芬兰) Center for Machine Vision and Signal Analysis, University of Oulu, Finland(奥卢大学机器视觉与信号分析中心,芬兰)

AI总结 本研究利用CREMA-D语料库,通过特征分类器探究在声学退化条件下,上半脸情感信息是否有助于视听句子识别,发现上半脸情感线索能提升模型校准和鲁棒性。

详情
AI中文摘要

面对面言语理解本质上是多模态的,整合了声学信号与可见的发音、面部表情、头部运动及其他社交相关线索。虽然视听言语系统通常将口部区域作为语言信息的主要视觉来源,但情感面部表情常被单独视为情感识别目标。本文研究在声学退化条件下,上半脸情感信息是否有助于视听句子识别,超越音频和口部区域线索。使用CREMA-D视听情感言语语料库,我们在四种线索条件下训练基于特征的句子分类器:仅音频(A)、音频加口部/下半脸特征(A+M)、音频加上半脸特征(A+U)以及音频加口部和上半脸特征(A+M+U)。模型在干净音频和粉红噪声条件下(+10 dB、+5 dB和0 dB SNR)进行评估,采用演员独立划分。结果表明,在退化音频下,口部/下半脸特征提供了显著的鲁棒性优势。在0 dB SNR下,A+M相比A准确率提升0.0794,演员自举95%置信区间为[0.0296, 0.1298]。上半脸情感线索表现出更微妙的效果。尽管A+M+U相比A+M的直接准确率增益很小,但全脸模型在不同SNR水平上持续改善校准,并且在噪声条件下优于打乱的上半脸对照。这些发现表明,情感面部信息可能支持声学不确定性下的多模态鲁棒性和置信度估计,而不直接编码词汇内容。更广泛地说,该研究强调了社交表达性面部线索在以人为中心的视听交互系统中的潜在作用。

英文摘要

Face-to-face speech comprehension is inherently multimodal, integrating acoustic signals with visible articulation, facial expression, head motion, and other socially relevant cues. While audiovisual speech systems typically focus on the mouth region as the primary visual source of linguistic information, affective facial expressions are often treated separately as emotion-recognition targets. This paper investigates whether upper-face affective information contributes to audiovisual sentence recognition beyond audio and mouth-region cues, particularly under acoustic degradation. Using the CREMA-D audiovisual emotional speech corpus, we train feature-based sentence classifiers under four cue conditions: audio only (A), audio plus mouth/lower-face features (A+M), audio plus upper-face features (A+U), and audio plus both mouth and upper-face features (A+M+U). Models are evaluated on clean audio and pink-noise conditions at +10 dB, +5 dB, and 0 dB SNR using actor-independent splits. Results show that mouth/lower-face features provide substantial robustness benefits under degraded audio. At 0 dB SNR, A+M improves accuracy over A by 0.0794, with an actor-bootstrap 95% confidence interval of [0.0296, 0.1298]. Upper-face affective cues exhibit a more nuanced effect. Although the direct accuracy gain of A+M+U over A+M is small, full-face models consistently improve calibration across SNR levels and outperform shuffled upper-face controls under noisy conditions. These findings suggest that affective facial information may support multimodal robustness and confidence estimation under acoustic uncertainty without directly encoding lexical content. More broadly, the study highlights the potential role of socially expressive facial cues in human-centered audiovisual interaction systems.

2606.00658 2026-06-02 cs.CV cs.AI 版本更新

Collaborative Few-Step Distillation and Low-Bit Quantization for Wan2.2 Dual-Expert Video Diffusion Models

Wan2.2双专家视频扩散模型的协同少步蒸馏与低位量化

Jinyang Du, Shenghao Jin, Ziqian Xu, Ruihao Gong, Shiqiao Gu, Yang Yong, Jinyang Guo, Xianglong Liu

发表机构 * IEEE ICME 2026 GCC Low-Bit-width Large Model Quantization Challenge(GCC 低精度大模型量化挑战)

AI总结 针对Wan2.2-T2V-A14B视频扩散模型,提出结合少步分布匹配蒸馏与低位量化的部署压缩流程,通过双专家去噪分支校准、敏感层保护及HiF4低位表示,在保持质量的同时降低计算开销。

详情
AI中文摘要

大型视频扩散模型实现了强大的视觉质量,但由于每个样本需要大量去噪步骤和较大的驻留参数足迹,部署成本仍然很高。本文研究了一种面向部署的压缩流程,针对Wan2.2-T2V-A14B模型,结合少步分布匹配蒸馏与低位量化。该流程遵循模型的双专家去噪路线,分别校准高噪声和低噪声分支,保护敏感入口层,并使用HiF4风格的低位表示以改善动态范围覆盖。量化是在蒸馏后的少步学生模型上校准,而非原始的长步轨迹上,从而减少推理过程中的激活分布不匹配。所提出的协同设计使量化模型保持接近同步全精度模型,并在平均8步和20步时超越原始全精度基线。在测试配置中,20步设置提供了最佳的质量-效率权衡。

英文摘要

Large video diffusion models achieve strong visual quality but remain expensive to deploy because each sample requires many denoising steps and a large resident parameter footprint. This paper studies a deployment-oriented compression pipeline for Wan2.2-T2V-A14B by combining few-step distribution-matching distillation with low-bit quantization. The pipeline follows the model's dual-expert denoising route, calibrates the high-noise and low-noise branches separately, protects sensitive entrance layers, and uses HiF4-style low-bit representation to improve dynamic-range coverage. Quantization is calibrated on the distilled few-step student rather than on the original long-step trajectory, reducing activation-distribution mismatch during inference. The proposed co-design keeps the quantized model close to the same-step full-precision model and surpasses the original full-precision baseline at 8 and 20 steps on average. The 20-step setting gives the best quality-efficiency trade-off in the tested configurations.

2606.00656 2026-06-02 cs.LG cs.AI 版本更新

Demystifying the Optimal Fair Classifier in Multi-Class Classification

揭秘多类分类中的最优公平分类器

Li Zhang, Yuyuan Li, XiaoHua Feng, Jiaming Zhang, Fengyuan Yu, Chaochao Chen

发表机构 * College of Computer Science and Technology, Zhejiang University(浙江大学计算机科学与技术学院) College of Computer Science(计算机科学学院) Technology, Zhejiang University(技术,浙江大学) School of Communication Engineering, Hangzhou Dianzi University(杭州电子科技大学通信工程学院)

AI总结 本文针对多类分类中的公平性问题,提出了一种在公平约束下最优分类器的概率公式,并设计了两种属性盲算法(处理中与处理后)以逼近最优精度-公平帕累托前沿。

Comments Accepted to ICML 2026

详情
AI中文摘要

确保不同群体之间的公平公正对待,特别是在多类分类任务中,由于机器学习模型中固有的持续偏差,构成了重大挑战。大多数现有的偏差缓解技术针对二元设置,而多维输出和复杂公平机制的存在使得它们扩展到多类场景既不直接也不有效。在本文中,我们研究了公平分类中两个基本且未解决的挑战:(i)刻画多类设置中的最优精度-公平前沿,以及(ii)设计在不同训练阶段达到此最优值的实用算法。为应对这些挑战,我们首先指定了公平约束下最优分类器的解析可处理概率公式。在此基础上,我们提出了两种属性盲算法以在实践中实施公平要求:一种是通过约简方法在训练期间进行公平干预的处理中方法,以及一种通过插件估计微调输出概率的处理后方法。理论分析表明,两种方法都收敛到最优精度-公平帕累托前沿。在多个数据集上进行的实验证明了我们的方法在平衡精度和公平性方面的优越性能。

英文摘要

Ensuring fair and equitable treatment across diverse groups, particularly in multi-class classification tasks, poses a significant challenge due to the persistent biases inherent in machine learning models. Most existing bias mitigation techniques are tailored to binary settings, and the presence of multi-dimensional outputs and complex fairness mechanisms makes their extension to multi-class scenarios neither straightforward nor effective. In this paper, we investigate two fundamental, unresolved challenges in fair classification: (i) characterizing the optimal accuracy-fairness frontier in multi-class settings, and (ii) designing practical algorithms that attain this optimum in different training phases. To tackle these challenges, we first specify an analytically tractable probabilistic formulation of the optimal classifier under fairness constraints. Building upon this, we propose two attribute-blind algorithms to enforce fairness requirements in practice: an in-processing approach for fairness intervention during training via the reduction approach, and a post-processing approach for fine-tuning output probabilities with plug-in estimation. Theoretical analysis reveals that both methods converge to the optimal accuracy-fairness Pareto frontier. Experiments conducted on multiple datasets demonstrate the superior performance of our methods in balancing accuracy and fairness.

2606.00655 2026-06-02 cs.MA cs.AI cs.CY 版本更新

Scaling Behavior of Single LLM-Driven Multi-Agent Systems

单一LLM驱动的多智能体系统的扩展行为

Jialing Li, Zhouhong Gu, Yin Cai, Hongwei Feng

发表机构 * Fudan University(复旦大学)

AI总结 本文通过提出顺序迭代多智能体系统(SIMAS)框架,系统研究了同质多智能体系统性能随智能体数量变化的扩展规律,发现性能并非单调提升,而是受协作协同与协调开销之间的权衡支配,呈现收益递减模式。

详情
AI中文摘要

基于LLM的多智能体系统(MAS)这一新兴领域有望通过协作智能处理复杂任务,但其扩展行为和内在集体动力学的基本问题仍未被充分探索。本文系统研究了同质MAS的性能如何随智能体数量增加而变化,将协作变量与模型或知识异质性分离。我们提出了顺序迭代多智能体系统(SIMAS)框架,这是一种以顺序智能体间通信为中心的极简架构,以清晰观察扩展效应。通过跨不同任务和模型规模的广泛实验,我们确定MAS性能并非随智能体数量单调扩展,而是遵循收益递减模式,受协作协同与协调开销之间的权衡支配。我们的发现表明,有效的MAS需要足够强大的基础LLM,任务类型关键地调节最优智能体数量,并且集体智能是一种依赖于策略性交互设计的新兴属性,而非智能体数量的必然结果。性能下降源于协调开销,而不仅仅是长上下文失败,并且扩展趋势在结构化辩论拓扑等交互架构中具有普遍性。这项工作为MAS扩展规律提供了基础理解,为设计高效协作系统提供了实践指导,并挑战了“更多智能体必然带来更好性能”的普遍假设。

英文摘要

The burgeoning field of LLM-based Multi-Agent Systems (MAS) promises to tackle complex tasks through collaborative intelligence, yet fundamental questions regarding their scaling behavior and intrinsic collective dynamics remain underexplored. This paper systematically investigates how the performance of a homogeneous MAS evolves as the number of agents increases, isolating the variable of collaboration from model or knowledge heterogeneity. We propose the Sequential Iterative Multi-Agent System (SIMAS) framework, a minimalist architecture centered on sequential inter-agent communication, to clearly observe scaling effects. Through extensive experiments across diverse tasks and model scales, we establish that MAS performance does not scale monotonically with agent count but follows a pattern of diminishing returns, governed by a trade-off between collaborative synergy and coordination overhead. Our findings reveal that effective MAS requires a sufficiently capable base LLM, that task type critically modulates the optimal agent count, and that collective intelligence is an emergent property contingent on strategic interaction design rather than a guaranteed outcome of agent plurality. The performance degradation stems coordination overhead rather than merely long-context failure, and the scaling tendency generalizes across interaction architectures like structured debate topologies. This work provides a foundational understanding of MAS scaling laws, offering practical guidance for designing efficient collaborative systems and challenging the prevailing assumption that more agents invariably lead to better performance.

2606.00651 2026-06-02 cs.LG cs.AI cs.CL 版本更新

MESA: Improving MoE Safety Alignment via Decentralized Expertise

MESA: 通过去中心化专家提升MoE安全对齐

Yitong Sun, Yao Huang, Teng Li, Ranjie Duan, Yichi Zhang, Xingjun Ma, Hui Xue, Xingxing Wei

发表机构 * University of Science and Technology of China(中国科学技术大学)

AI总结 针对MoE架构中安全能力集中于少数专家导致的脆弱性,提出MESA框架,通过最优传输理论实现专家安全职责去中心化分配与路由细化,在保持实用性的同时提升防御性能。

Comments 18 pages, 8 figures, accepted by ICML 2026

详情
AI中文摘要

混合专家(MoE)架构高效扩展大型语言模型(LLM),通过动态路由将输入分配给相关专家,以降低计算成本的同时增强容量,但引入了一个关键漏洞:安全稀疏性,即安全能力集中在少数专家中,使其容易受到对抗性绕过。同时,传统的对齐方法统一调整所有参数,忽略了它们的功能差异,并无意中降低了性能。为了解决这些挑战,我们提出了MESA(MoE安全对齐),一个针对基于MoE的LLM的定向对齐框架,策略性地去中心化安全责任以最大化覆盖范围,同时最小化对实用性的干扰。基于最优传输(OT)理论,MESA通过两种机制运作:(1)专家容量重新分配使用传输成本矩阵将安全职责分配给最具成本效益的专家,以及(2)动态路由细化约束路由器精确激活这些去中心化模块。实验表明,MESA在保持有用性的同时,对各种有害基准实现了稳健的防御性能。代码可在https://github.com/lorraine021/MESA获取。

英文摘要

Mixture-of-Experts (MoE) architectures scale Large Language Models (LLMs) efficiently, enabling greater capacity with reduced computational cost by dynamically routing inputs to relevant experts, yet introduce a critical vulnerability: Safety Sparsity, where safety capabilities concentrate in few experts, making them susceptible to adversarial bypassing. Meanwhile, conventional alignment methods uniformly adapt all parameters, ignoring their functional differences and inadvertently degrading performances. To address these challenges, we propose MESA (MoE Safety Alignment), a targeted alignment framework for MoE-based LLMs that strategically decentralizes safety responsibility to maximize coverage while minimizing interference with utility. Based on Optimal Transport (OT) theory, MESA operates through two mechanisms: (1) Expert Capacity Reallocation uses a transport cost matrix to distribute safety duties to the most cost-effective experts, and (2) Dynamic Routing Refinement constrains the router to precisely activate these decentralized modules. Experiments show that MESA achieves robust defensive performance against varied harmful benchmarks while preserving helpfulness. Code is available at https://github.com/lorraine021/MESA.

2606.00647 2026-06-02 cs.CL cs.AI 版本更新

LinguIUTics at PsyDefDetect: Iterative Imbalance-Aware Fine-tuning of Qwen3-8B for Psychological Defense Mechanism Classification

LinguIUTics 在 PsyDefDetect 中的研究:用于心理防御机制分类的迭代不平衡感知微调 Qwen3-8B

Shefayat E Shams Adib, Ahmed Alfey Sani, Md Hasibur Rahman Alif, Ajwad Abrar

发表机构 * Department of Computer Science and Engineering, Islamic University of Technology, Dhaka, Bangladesh(计算机科学与工程系,伊斯兰技术大学,达卡,孟加拉国)

AI总结 针对对话文本中心理防御机制检测的类别不平衡问题,提出基于 QLoRA 微调 Qwen3-8B 的迭代不平衡感知方法,通过分组分层交叉验证、少数类轮询词汇增强和后处理流水线,在 PsyDefDetect 2026 共享任务中达到宏 F1 0.3917,排名第4。

Comments Accepted at PsyDefDetect, a shared task at the 25th BioNLP Workshop (BioNLP 2026), co-located with ACL 2026 in San Diego, CA, USA

详情
AI中文摘要

检测对话文本中的心理防御机制仍然是一个具有挑战性的临床自然语言处理问题。针对 PsyDefDetect 2026 共享任务(九类话语分类,通过宏 F1 评估),我们的团队 LinguIUTics 在官方正类排行榜上取得了 0.3917 的宏 F1 分数,在 21 个注册团队中排名第 4,比 Ministral-8B 任务基线(宏 F1 0.3148)提高了 7.7 个绝对点(相对提升 24.4%)。由于严重的类别不平衡,BERT 系列编码器和零样本 LLM 在稀有类别上被证明无效,因此我们转向对 Qwen3-8B 进行 QLoRA 微调。我们利用三个关键策略:分组分层交叉验证(防止泄漏)、少数类轮询词汇增强,以及包含 logit 偏置调整和集成混合的后处理流水线。这些组件共同缩小了验证集与排行榜之间的差距,并显著提高了少数类的召回率,将关键的“Unclear”类别(第8级)从接近零的性能提升到 F1 分数 0.797。

英文摘要

Detecting psychological defense mechanisms in conversational text remains a challenging clinical NLP problem. For the PsyDefDetect 2026 shared task (nine-class utterance classification evaluated via macro F1), our team LinguIUTics achieves a macro F1-score of 0.3917 on the official positive-class leaderboard, ranking 4th out of 21 registered teams and improving over the Ministral-8B task baseline (31.48 macro F1) by 7.7 absolute points (24.4 percent relative). BERT-family encoders and zero-shot LLMs proved ineffective on rare classes due to severe class imbalance, leading us to QLoRA fine-tuning of Qwen3-8B. We leverage three key strategies: grouped stratified cross-validation (preventing leakage), minority-class round-robin lexical augmentation, and a post-processing pipeline with logit bias tuning and ensemble blending. Together, these components close much of the validation-to-leaderboard gap and substantially improve minority-class recall, driving the critical "Unclear" class (Level 8) from near-zero performance to an F1 score of 0.797.

2606.00642 2026-06-02 cs.AI cs.CR 版本更新

Hidden Thoughts Are Not Secret: Reasoning Trace Exposure in LLMs

隐藏的思考并非秘密:大型语言模型中的推理痕迹暴露

Yu-An Lu, Ci-Yang Tsai, Yu-Lin Tsai, Raluca Ada Popa, Chia-Mu Yu

发表机构 * National Yang Ming Chiao Tung University(国家阳明交通大学) UC Berkeley(伯克利大学)

AI总结 本文提出推理暴露提示(REP)方法,通过影子模型生成的示范以辅助代码格式包装,从受害者模型中引出用户可见的推理痕迹,显著提高暴露痕迹与内部痕迹的相似性并保留有用推理信号。

详情
AI中文摘要

推理痕迹已成为改进和转移大型语言模型能力的有价值学习信号。特别是,详细痕迹有助于将推理行为从更强的教师模型蒸馏到较弱的学生模型。能力转移的价值促使许多部署了推理模型的系统隐藏原始内部痕迹,最多向用户暴露摘要和答案。因此,我们提出这样的问题:这种接口级别的痕迹隐藏是否能防止用户通过提示获得有用的推理监督?我们通过推理暴露提示(REP)研究这个问题,这是一种轻量级的上下文引出方法,使用影子模型生成的示范以辅助代码格式包装,从受害者模型中引出用户可见的推理痕迹。在常见的推理数据集、不同的受害者模型和不同的学生模型蒸馏中,REP显著提高了暴露痕迹与REP条件内部痕迹之间的相似性,同时保留了有用的推理信号。

英文摘要

Reasoning traces have become a valuable form of learning signals for improving and transferring the capabilities of large language models. In particular, detailed traces can help distill reasoning behavior from stronger teacher models into weaker student models. The value of capability transfer has motivated many deployed systems with reasoning models to hide raw internal traces and expose at most summaries and answers to users. As a result, we ask whether such interface-level trace hiding prevents users from obtaining useful reasoning supervision through prompting. We study this question with Reasoning Exposure Prompting (REP), a lightweight in-context elicitation method that uses shadow-model-generated demonstrations wrapped in auxiliary code-like formats to raise user-visible reasoning traces from a victim model. Across the common reasoning dataset, different victim models, and different student model distillation, REP substantially increases similarity between exposed and REP-conditioned internal traces while preserving useful reasoning signals.

2606.00636 2026-06-02 cs.AR cs.AI 版本更新

LP5X-PIM Sim: A High-Fidelity HW/SW Integrated Simulator for LPDDR5X-PIM

LP5X-PIM Sim:用于LPDDR5X-PIM的高保真硬件/软件集成模拟器

SangHoon Cha, Jaewan Choi, Byeongho Kim, Yoonah Paik, Sukhan Lee, Kyomin Sohn

发表机构 * Samsung Electronics, South Korea(三星电子(韩国))

AI总结 本文介绍三星电子开发的LPDDR5X-PIM模拟器,通过集成硬件数据路径和软件控制层的高保真模型,实现系统性能和能效的精确评估。

Comments 4 pages, 4 figures, tech note

详情
AI中文摘要

本技术说明描述了由三星电子开发的LPDDR5X-PIM模拟器的架构和执行结果。基于最新研究和内部规范,该模拟器提供了LPDDR5X-PIM模块的硬件数据路径和软件控制层的高保真模型。这种集成的硬件-软件仿真方法能够在最大化PIM资源利用率的同时,精确评估系统性能和能效。我们改进了现有的仿真框架以与实际硬件实现保持一致,确保行为准确性的一致性。关于LPDDR5X-PIM的具体架构和电路设计的进一步技术细节将在未来的出版物中披露。

英文摘要

This tech note describes the architecture and execution results of the LPDDR5X-PIM simulator, developed by Samsung Electronics. Based on the latest research and internal specifications, the simulator provides a high-fidelity model of both the hardware data paths and the software control layers of the LPDDR5X-PIM block. This integrated hardware-software simulation approach enables precise evaluation of system performance and energy efficiency while maximizing PIM resource utilization. We have refined existing simulation frameworks to align with actual hardware implementation, ensuring consistent behavioral accuracy. Further technical details regarding the specific architecture and circuit design of the LPDDR5X-PIM will be disclosed in future publications

2606.00621 2026-06-02 cs.CR cs.AI cs.CY 版本更新

Authenticity Debt and the Synthetic Content Threat Landscape: A Layered Framework for Trust, Provenance, and IP Governance in the Generative AI Era

真实性债务与合成内容威胁格局:生成式AI时代信任、溯源和知识产权治理的分层框架

Shubhashis Sengupta, Benjamin McCarty, Milind Savagaonkar, Rhine Andotra

发表机构 * Accenture Services Pvt. Ltd.(Accenture服务有限公司)

AI总结 提出真实性债务概念,并基于零信任架构原则设计分层参考架构,整合密码学溯源、人工验证和持续治理,以应对生成式AI带来的合成内容威胁。

详情
AI中文摘要

生成式人工智能从根本上改变了内容的生产方式。它使得高保真文本、图像、音频和视频能够以接近零的边际成本创建、修改和重新分发。这种转变使企业和生态系统面临跨四个相互加强的真实性层(真实性、溯源、完整性和问责性)的多种风险,而传统控制措施单独无法充分应对。我们引入了真实性债务的概念:当组织在未保留可验证来源、完整性和问责性的情况下部署AI生成内容时,累积的制度性负债,将暴露推迟到监管、法律或市场审查之下。本文提出了生成式AI危害和攻击向量的全面多维分类法,调查了技术控制(包括数字水印、溯源框架(C2PA、Adobe CAI)和检测技术)的能力和失效模式,并论证了在开放、对抗和不断变化的环境中没有任何单一机制是足够的。借鉴零信任架构原则和企业治理框架,我们提出了一个分层参考架构,整合密码学溯源、人工验证和持续治理,以大规模维持可辩护的真实性。我们进一步审视了监管格局(欧盟AI法案、美国联邦贸易委员会、NIST AI风险管理框架),并为寻求将真实性建设为制度基础设施而非事后考虑的组织确定了实用指导原则。

英文摘要

Generative artificial intelligence has fundamentally changed how content is now produced. It has enabled how high-fidelity text, images, audio, and videos are created, modified, and redistributed at near-zero marginal cost. This shift exposes enterprises and ecosystems to a number of risks across four reinforcing authenticity layers -- authenticity, provenance, integrity, and accountability -- that traditional controls are inadequate to address in isolation. We introduce the concept of authenticity debt: the cumulative institutional liability that accumulates when organizations deploy AI-generated content without preserving verifiable origin, integrity, and accountability, deferring exposure that surfaces under regulatory, legal, or market scrutiny. This paper presents a comprehensive, multi-dimensional taxonomy of generative AI harms and attack vectors, surveys the capabilities and failure modes of technical controls including digital watermarking, provenance frameworks (C2PA, Adobe CAI), and detection technologies, and argues that no single mechanism is sufficient in open, adversarial, and evolving environments. Drawing on Zero Trust Architecture principles and enterprise governance frameworks, we propose a layered reference architecture that integrates cryptographic provenance, human-in-the-loop verification, and continuous governance to sustain defensible authenticity at scale. We further examine the regulatory landscape (EU AI Act, U.S.\ FTC, NIST AI RMF) and identify practical guiding principles for organizations seeking to build authenticity as institutional infrastructure rather than an afterthought.

2606.00619 2026-06-02 cs.CL cs.AI 版本更新

MemPro: Agentic Memory Systems as Evolvable Programs

MemPro:作为可进化程序的智能体记忆系统

Qingshan Liu, Guoqing Wang, Wen Wu, Jingqi Huang, Xinqi Tao, Dejia Song, Jie Zhou, Liang He

发表机构 * East China Normal University(东华师范大学) Xiaohongshu Inc.(小红书公司)

AI总结 提出MemPro框架,将整个记忆构建-检索管道视为可进化程序,通过故障模式引导的编辑-调试迭代优化,在多个长时任务数据集上超越静态和提示级进化基线。

Comments 20 pages, 14 figures

详情
AI中文摘要

长时程自主智能体需要记忆系统来保留历史信息、跟踪演化状态并在有限上下文窗口之外重用相关知识。现有的智能体记忆系统通常遵循记忆构建-检索(MCR)管道,但往往主要适应记忆库,而在部署后保持周围管道固定。这种固定管道设计难以处理异构的任务特定故障模式,并且可能随着时间推移与规模和结构演化的记忆库产生错位。为解决这些限制,我们提出MemPro,一种系统级进化框架,将整个MCR管道视为可进化程序,而不仅仅是适应记忆库或提示文本。MemPro维护一个可运行记忆系统实现的版本树,其中进化智能体迭代选择有前途的版本,诊断重复出现的故障,并通过故障模式引导的编辑-调试改进创建改进的子版本。在LongMemEval、LoCoMo、HotpotQA和NarrativeQA上的实验表明,MemPro在几次迭代内持续优于强静态和提示级进化基线,随着进化持续改进,并实现了良好的性能-成本权衡。代码可在https://github.com/wanghai673/MemPro获取。

英文摘要

Long-horizon autonomous agents require memory systems to retain historical information, track evolving states, and reuse relevant knowledge beyond finite context windows. Existing agentic memory systems typically follow a memory construction-retrieval (MCR) pipeline, but often adapt mainly the memory bank while keeping the surrounding pipeline fixed after deployment. This fixed-pipeline design struggles to handle heterogeneous task-specific failure modes and can become misaligned with memory banks that evolve in scale and structure over time. To address these limitations, we propose MemPro, a system-level evolution framework that treats the entire MCR pipeline as an evolvable program rather than adapting only the memory bank or prompt text. MemPro maintains a version tree of runnable memory-system implementations, where an Evolving Agent iteratively selects promising versions, diagnoses recurring failures, and creates improved child versions through failure-mode-guided edit-debug refinement. Experiments on LongMemEval, LoCoMo, HotpotQA, and NarrativeQA show that MemPro consistently outperforms strong static and prompt-level evolving baselines within a few iterations, continues to improve with evolution, and achieves a favorable performance-cost trade-off. Code is available at https://github.com/wanghai673/MemPro.

2606.00618 2026-06-02 cs.AI 版本更新

Efficient Test-time Inference for Generative Planning Models

生成式规划模型的高效测试时推理

Robert Gieselmann, Mihai Samson, Federico Pecora, Jeremy L. Wyatt

发表机构 * University of California, Berkeley(加州大学伯克利分校) ETH Zurich(苏黎世联邦理工学院)

AI总结 本文提出一种改进的开放-封闭列表搜索算法,结合生成模型和启发式模型,在测试时高效推理,提升生成式规划模型的解质量和计算效率。

详情
AI中文摘要

生成式模型已成为人工智能规划的强大范式,但其性能仍受训练数据分布的限制。一种方法是通过扩展测试时计算来改进推理过程中生成的解决方案。更高效的替代方案是优化推理过程本身。在本文中,我们展示了经典开放-封闭列表(OCL)搜索的修改版本提供了这样一种高效的推理过程。我们的算法协同了两个学习组件:一个从中间状态执行快速推演的生成模型,以及一个在候选推理路径中优先排序的启发式模型。关键贡献包括新颖的探索控制机制以及将学习模型集成到OCL框架中。在多个组合规划领域中,我们的方法在计算效率和解质量上均优于神经符号搜索基线和经典求解器。

英文摘要

Generative models have emerged as a powerful paradigm for AI planning, yet their performance remains constrained by the training data distribution. One approach is to improve generated solutions during inference by scaling test-time compute. A more efficient alternative is to optimize the inference process itself. In this paper, we show that a modified version of a classical Open-Closed List (OCL) search provides just such an efficient inference procedure. Our algorithm synergizes two learned components: a generative model that performs fast rollouts from intermediate states and a heuristic model that prioritizes among candidate reasoning paths. Key contributions include novel exploration control mechanisms and integration of learned models within the OCL framework. Across multiple combinatorial planning domains, our approach outperforms both neurosymbolic search baselines and classical solvers in computational efficiency and solution quality.

2606.00613 2026-06-02 cs.CL cs.AI 版本更新

Linguistics-Aware Non-Distortionary LLM Watermarking

语言学感知的无失真LLM水印

Shinwoo Park, Hyejin Park, Hyeseon An, Yo-Sub Han

发表机构 * Yonsei University(延世大学) Rensselaer Polytechnic Institute(罗切斯特理工学院)

AI总结 提出LUNA水印方法,通过语言自适应非失真二元锦标赛采样器,在保持文本质量的同时实现高检测性能,在12种设置中AUROC达0.9959且中位困惑度偏移仅0.045。

详情
AI中文摘要

水印应能识别语言模型输出而不降低质量或限制验证仅由模型提供者进行。多语言部署使这更加困难,因为形态、分词和书写系统的变化会改变水印证据自然进入的位置。我们引入LUNA,一种语言自适应水印,结合了无模型检测和标准随机密钥模型下的单令牌无失真。LUNA从外部语料库中的词性上下文估计归一化下一标记熵,并用其设置无失真二元锦标赛采样器的深度;检测器从文本、分词器、词性标注器和密钥重建相同的调度。我们在六种类型多样的语言和两个领域上评估了八种主要基线。LUNA在十二种设置中达到了0.9959的AUROC和最低的平均绝对中位困惑度偏移0.045;其95%自助法区间[0.022, 0.073]低于所有基线区间。LUNA还记录了最低的平均Self-BLEU、Distinct-1、surprisal和熵偏移。它是唯一同时在大多数设置中实现AUROC > 0.99和绝对中位困惑度偏移低于0.1的方法,在12种设置中的9种达到该状态,而没有任何基线在超过2种设置中达到。我们的代码可在https://github.com/Shinwoo-Park/luna_watermark获取。

英文摘要

Watermarking should identify language-model output without degrading quality or limiting verification to the model provider. Multilingual deployment makes this harder because morphology, segmentation, and script change where watermark evidence can enter naturally. We introduce LUNA, a linguistically adaptive watermark that combines model-free detection with single-token non-distortion under the standard random-key model. LUNA estimates normalized next-tag entropy from part-of-speech contexts in an external corpus and uses it to set the depth of a non-distortionary binary tournament sampler; the detector reconstructs the same schedule from text, a tokenizer, a tagger, and a secret key. We evaluate six typologically diverse languages and two domains against eight primary baselines. LUNA attains an AUROC of 0.9959 and the lowest mean absolute median perplexity shift of 0.045 across the twelve settings; its 95% bootstrap interval [0.022, 0.073] lies below all baseline intervals. LUNA also records the lowest mean Self-BLEU, Distinct-1, surprisal, and entropy shifts. It is the only method that simultaneously achieves AUROC > 0.99 and an absolute median perplexity shift below 0.1 in a majority of settings, reaching this regime in 9 of the 12 settings while no baseline reaches it in more than 2. Our code is available at: https://github.com/Shinwoo-Park/luna_watermark

2606.00611 2026-06-02 cs.AI 版本更新

TRACE: Trajectory Risk-Aware Compression for Long-Horizon Agent Safety

TRACE: 面向长程智能体安全的轨迹风险感知压缩

Zhepei Hong, Lin Wang, Liting Li, Haokai Ma, Junfeng Fang, Fei Shen, Dan Zhang, Xiang Wang

发表机构 * University of Science and Technology of China(中国科学技术大学) National University of Singapore(新加坡国立大学) South China Normal University(华南师范大学)

AI总结 提出轨迹风险感知压缩方法TRACE,通过压缩器-阅读器架构将长轨迹压缩为潜在证据状态,以聚合稀疏风险信号并提升长程安全检测准确率。

详情
AI中文摘要

长程LLM智能体在长轨迹中产生安全证据,其中稀疏、延迟和组合的风险信号常常逃脱局部审核。现有的轮次级或短上下文检测器难以在长时间跨度内可靠地保留和聚合此类证据。我们将长程智能体安全检测重新定义为轨迹级证据压缩,并提出面向长程智能体安全的轨迹风险感知压缩(TRACE)。TRACE采用压缩器-阅读器设计:压缩器在轨迹级监督下将完整轨迹编码为紧凑的潜在证据状态,阅读器以该潜在证据状态作为安全参考来判断原始轨迹。该设计有助于聚合分散的风险线索并减少过早的证据丢失。在ASSEBench、Pre-Ex-Bench和R-Judge上,TRACE在所有评估基线上取得了最佳准确率,相比强基线最高提升12.6个百分点。在LongSafety上,TRACE随着上下文长度增加表现出更小的性能下降。注意力可视化和案例研究表明,压缩后的参考有助于阅读器聚焦于风险关键片段并恢复跨步证据。代码可在https://github.com/Peregrine123/TRACE_official获取。

英文摘要

Long-horizon LLM agents produce safety evidence across long trajectories, where sparse, delayed, and compositional risk signals often escape local moderation. Existing turn-level or short-context detectors struggle to reliably retain and aggregate such evidence over extended horizons. We reframe long-horizon agent safety detection as trajectory-level evidence compression and propose Trajectory Risk-Aware Compression for Long-Horizon Agent Safety (TRACE). TRACE uses a Compressor-Reader design: the Compressor encodes the full trajectory into a compact latent evidence state under trajectory-level supervision, and the Reader judges the raw trajectory with this latent evidence state as a safety reference. This design helps aggregate dispersed risk cues and reduce premature evidence loss. Across ASSEBench, Pre-Ex-Bench, and R-Judge, TRACE achieves the best accuracy on all evaluated backbones, improving over strong baselines by up to 12.6 percentage points. On LongSafety, TRACE shows smaller performance degradation as context length grows. Attention visualizations and case studies suggest that the compressed reference helps the Reader focus on risk-critical segments and recover cross-step evidence. Code is available at https://github.com/Peregrine123/TRACE_official.

2606.00610 2026-06-02 cs.IR cs.AI cs.MA 版本更新

MemGraphRAG: Memory-based Multi-Agent System for Graph Retrieval-Augmented Generation

MemGraphRAG:基于记忆的多智能体系统用于图检索增强生成

Chuanjie Wu, Zhishang Xiang, Yunbo Tang, Zerui Chen, Qinggang Zhang, Jinsong Su

发表机构 * Xiamen University(厦门大学) Jilin University(吉林大学)

AI总结 提出MemGraphRAG框架,通过基于记忆的多智能体系统构建高质量知识图谱,并设计记忆感知的分层检索算法,在多个基准上超越现有模型。

Comments Accepted by KDD 2026

详情
AI中文摘要

检索增强生成(RAG)已成为通过利用外部知识来减轻大型语言模型(LLMs)幻觉的重要方法。虽然对简单查询有效,但传统RAG在处理信息高度碎片化的大规模非结构化语料库时存在困难。基于图的RAG(GraphRAG)引入知识图谱来捕获结构关系,从而实现对复杂推理的更全面检索。然而,现有的GraphRAG方法依赖孤立的、片段级别的提取来构建图,缺乏对整个语料库的全局视角。因此,这些方法经常导致主题不一致、逻辑冲突和结构碎片化的图,从而降低检索性能。在本文中,我们提出MemGraphRAG,一种新颖的框架,引入基于记忆的多智能体系统以确保高质量的图构建。具体来说,MemGraphRAG采用由共享记忆支持的智能体协作社会,在整个提取过程中提供统一的全局上下文。这种机制允许智能体动态解决逻辑冲突并保持整个语料库的结构连通性。此外,我们提出了一种针对所构建图的记忆感知分层检索算法。在多个基准上的大量实验表明,MemGraphRAG以相当的效率优于最先进的基线模型。我们的代码可在https://github.com/XMUDeepLIT/MemGraphRAG获取。

英文摘要

Retrieval-Augmented Generation (RAG) has become an essential method for mitigating hallucinations in Large Language Models (LLMs) by leveraging external knowledge. Although effective for simple queries, traditional RAG struggles with large-scale, unstructured corpora where information is highly fragmented. Graph-based RAG (GraphRAG) incorporates knowledge graphs to capture structural relationships, enabling more comprehensive retrieval for complex reasoning. However, existing GraphRAG methods rely on isolated, fragment-level extraction for graph construction, lacking a global perspective on the whole corpus. As a result, these methods frequently lead to thematically inconsistent, logically conflicting, and structurally fragmented graphs that degrade retrieval performance. In this paper, we propose MemGraphRAG, a novel framework that introduces a memory-based multi-agent system to ensure high-quality graph construction. Specifically, MemGraphRAG employs a collaborative society of agents supported by shared memory, which provides a unified global context throughout the extraction process. This mechanism allows agents to dynamically resolve logical conflicts and maintain structural connectivity throughout the corpus. Furthermore, we propose a memory-aware hierarchical retrieval algorithm tailored for the constructed graph. Extensive experiments on multiple benchmarks demonstrate that MemGraphRAG outperforms the state-of-the-art baseline models with comparable efficiency. Our code is available at https://github.com/XMUDeepLIT/MemGraphRAG.

2606.00609 2026-06-02 cs.LG cs.AI 版本更新

CARE-RL: Capability-Aware Reinforcement Learning for Mitigating Cross-Domain Conflicts

CARE-RL:用于缓解跨领域冲突的能力感知强化学习

Rui Zhang, Xinle Wu, Yao Lu

发表机构 * National University of Singapore(新加坡国立大学)

AI总结 提出CARE-RL框架,结合协议感知奖励生成与能力感知优化,通过PA-GRM和DACSP方法缓解多领域强化学习中的奖励不可靠与能力干扰问题。

详情
AI中文摘要

具有可验证奖励的强化学习在面向推理的大语言模型中取得了显著进展,但由于非可验证任务中奖励不可靠以及跨领域能力干扰,将其扩展到多领域强化学习仍具挑战性。我们提出CARE-RL,将协议感知奖励生成与能力感知优化相结合,以缓解跨领域冲突。对于非可验证任务,协议感知生成式奖励模型(PA-GRM)在生成轨迹条件奖励之前构建提示级别的评估协议和模式,从而实现对开放式响应的任务自适应且可比较的评估。对于多领域优化,方向感知能力子空间投影(DACSP)从先前的强化学习阶段提取历史能力方向,并通过放大对齐分量、抑制冲突分量以及保留正交更新来调节后续更新。在数学、聊天和指令遵循基准上的实验表明,CARE-RL始终优于标准的多领域强化学习基线,在Qwen2.5-7B和Qwen3-4B上分别达到47.9和50.7的总平均分。

英文摘要

Reinforcement learning (RL) with verifiable rewards has achieved strong progress in reasoning-oriented LLMs, but extending it to multi-domain RL remains challenging due to reward unreliability in non-verifiable tasks and capability interference across domains. We propose CARE-RL to combine protocol-aware reward generation with capability-aware optimization for mitigating cross-domain conflicts. For non-verifiable tasks, the Protocol-Aware Generative Reward Model (PA-GRM) constructs prompt-level evaluation protocols and schemas before producing trace-conditioned rewards, enabling task-adaptive yet comparable evaluation of open-ended responses. For multi-domain optimization, Direction-Aware Capability Subspace Projection (DACSP) extracts historical capability directions from previous RL stages and modulates later updates by amplifying aligned components, suppressing conflicting components, and preserving orthogonal updates. Experiments across math, chat, and instruction-following benchmarks show that CARE-RL consistently outperforms standard multi-domain RL baselines, achieving Total Avg scores of 47.9 and 50.7 on Qwen2.5-7B and Qwen3-4B, respectively.

2606.00593 2026-06-02 cs.CL cs.AI 版本更新

SPADER: Step-wise Peer Advantage with Diversity-Aware Exploration Rewards for Multi-Answer Question Answering

SPADER: 面向多答案问答的逐步同伴优势与多样性感知探索奖励

Qiming Shi, Zhaolu Kang, Yunfan Zhou, Di Weng, Yingcai Wu

发表机构 * State Key Lab of CAD&CG, Zhejiang University(浙江大学CAD与CG国家重点实验室) School of Software and Microelectronics, Peking University(北京大学软件与微电子学院) School of Software Technology, Zhejiang University(浙江大学软件技术学院)

AI总结 提出SPADER强化学习框架,通过逐步同伴优势(SPA)机制和多样性感知探索奖励,解决多答案问答中长程工具使用的细粒度信用分配与持续探索问题,实验表明在多个数据集上提升了召回率和F1分数。

详情
AI中文摘要

大型语言模型越来越多地被部署为工具增强型智能体,以获取参数知识之外的信息。虽然最近的工作改进了长程工具使用推理,但大多数方法专注于具有单一正确答案的任务。相比之下,许多现实世界中的查询需要发现一组全面的有效答案,这种设置被称为多答案问答。这种设置带来了两个挑战:长搜索轨迹上的细粒度信用分配,以及超越简单高频实体的持续探索的奖励对齐。我们提出了SPADER,一个用于多答案问答中长程工具使用的强化学习框架。SPADER包括逐步同伴优势(SPA),一种无评论家的逐步信用分配机制,它通过决策步骤对齐并行轨迹,并根据同伴回报估计优势。它还包括一个多样性感知探索奖励,通过加权稀有发现和降低冗余发现的权重来促进长尾实体发现。在QAMPARI、Mintaka、WebQSP和QUEST上的实验表明,SPADER通常比基于提示的智能体、结果监督的强化学习方法和最近的逐步监督方法提高了召回率和整体F1分数。我们的代码和模型权重可在https://github.com/KhanCold/spader获取。

英文摘要

Large language models are increasingly deployed as tool-augmented agents to acquire information beyond parametric knowledge. While recent work has improved long-horizon tool-use reasoning, most approaches focus on tasks with a single correct answer. In contrast, many real-world queries require discovering a comprehensive set of valid answers, a setting known as Multi-Answer QA. This setting raises two challenges: fine-grained credit assignment over long search trajectories and reward alignment for sustained exploration beyond easy high-frequency entities. We propose SPADER, a reinforcement learning framework for long-horizon tool use in Multi-Answer QA. SPADER includes Step-wise Peer Advantage (SPA), a critic-free step-level credit assignment mechanism that aligns parallel trajectories by decision step and estimates advantages from peer returns. It also includes a diversity-aware exploration reward that promotes long-tail entity discovery by upweighting rare findings and downweighting redundant ones. Experiments on QAMPARI, Mintaka, WebQSP, and QUEST show that SPADER generally improves recall and overall F1 over prompting-based agents, outcome-supervised RL methods, and recent step-level supervision approaches. Our code and model weights are available at https://github.com/KhanCold/spader.

2606.00590 2026-06-02 cs.IR cs.AI 版本更新

Critic-R: Improving Agentic Search using Instruction-tuned Retrievers with Natural Language Introspective Feedback

Critic-R:使用具有自然语言内省反馈的指令调优检索器改进智能搜索

Md Zarif Ul Alam, Alireza Salemi, Hamed Zamani

发表机构 * Center for Intelligent Information Retrieval(智能信息检索中心) University of Massachusetts Amherst(马萨诸塞大学阿默斯特分校)

AI总结 提出Critic-R框架,通过引入评论模型评估智能体内省推理轨迹,实现检索模型与推理代理之间的反馈闭环,无需人工标注即可优化检索质量与下游答案准确性。

详情
AI中文摘要

智能搜索系统迭代地与检索模型交互以回答复杂查询。尽管取得了实质性进展,但优化检索器以适应智能搜索仍然具有挑战性,通常需要大量的协同训练或黄金标准标注,这限制了现实世界的适用性。我们提出Critic-R,一个在推理和训练过程中明确关闭推理代理与检索模型之间反馈循环的框架。Critic-R引入了一个评论模型,该模型在消费检索到的证据后评估代理的内省推理轨迹,以确定检索到的上下文是否充分支持下一步推理。Critic-R具有两种互补机制:Critic-R-Zero,一种推理时查询细化循环,迭代地重写查询和检索指令;以及Critic-Embed,一种检索模型的优化方法,利用成功和失败的细化轨迹作为自动监督,无需手动相关性标注。我们在HotpotQA、2WikiMultihopQA、MuSiQue和Bamboogle上评估Critic-R。结果表明,Critic-R显著提高了检索质量和下游答案准确性。

英文摘要

Agentic search systems iteratively interact with retrieval models to answer complex queries. Despite substantial progress, optimizing retrievers for agentic search remains challenging, often requiring heavy co-training or gold-standard annotations that limit real-world applicability. We propose Critic-R, a framework that explicitly closes the feedback loop between the reasoning agent and the retrieval model during both inference and training. Critic-R introduces a critic model that evaluates the agent's introspective reasoning trace after consuming retrieved evidence to determine whether the retrieved context sufficiently supports the next reasoning step. Critic-R has two complementary mechanisms: Critic-R-Zero, an inference-time query refinement loop that iteratively rewrites queries and retrieval instructions, and Critic-Embed, an optimization approach for retrieval models that leverages successful and failed refinement trajectories as automatic supervision without requiring manual relevance annotation. We evaluate Critic-R on HotpotQA, 2WikiMultihopQA, MuSiQue, and Bamboogle. Results show that Critic-R significantly improves both retrieval quality and downstream answer accuracy.

2606.00583 2026-06-02 cs.CV cs.AI cs.LG cs.MM 版本更新

Improving Visual Representation Alignment Generation with GRPO

利用GRPO改进视觉表示对齐生成

Shentong Mo, Sukmin Yun

发表机构 * Carnegie Mellon University(卡内基梅隆大学) Hanyang University(翰阳大学)

AI总结 提出VRPO方法,通过强化学习将静态对齐损失替换为生成式表示策略优化目标,动态平衡表示一致性与生成质量,在扩散Transformer中实现更快的收敛和更高的图像保真度。

详情
AI中文摘要

最近的扩散Transformer展示了强大的图像合成能力,但由于生成表示与判别表示之间的弱对齐,训练效率仍然较低。虽然表示对齐框架(如REPA)通过将噪声去噪特征与预训练视觉编码器对齐来改善收敛,但其外部监督的对齐损失是静态的,在训练和推理过程中缺乏自适应性。现有方法依赖于固定的余弦对齐或对比目标,无法动态平衡表示一致性和生成质量,导致判别收益有限,且无法以任务自适应方式优化对齐。为了解决这个问题,我们提出了VRPO,一种基于强化学习的优化策略,用生成式表示策略优化目标取代REPA的静态对齐损失。VRPO不强制执行固定的相似性约束,而是将表示对齐视为一个奖励引导的过程:模型根据生成保真度、感知质量以及扩散特征与预训练视觉嵌入之间的语义一致性获得自适应奖励。这种公式使生成器能够不断优化其内部表示,朝向有语义意义的方向,同时提高图像质量。我们的VRPO驱动训练无缝集成到扩散Transformer中,引入可忽略的计算成本,并保持与SiT和DiT架构的完全兼容性。在ImageNet-256x256上的大量实验表明,我们的VRPO-Alignment显著提高了收敛速度和保真度,在相同计算预算下,与REPA相比,FID提升高达1.8,训练速度加快2.3倍。

英文摘要

Recent diffusion transformers have demonstrated strong image synthesis capabilities but remain inefficient to train due to weak alignment between generative and discriminative representations. While representation alignment frameworks such as REPA improve convergence by aligning noisy denoising features with pretrained visual encoders, their externally supervised alignment loss is static and lacks adaptivity during training and inference. Existing methods rely on fixed cosine alignment or contrastive objectives, which cannot dynamically balance representation consistency and generation quality, resulting in limited discriminative benefit and failing to optimize alignment in a task-adaptive manner. To address this, we propose VRPO, a reinforcement-based optimization strategy that replaces REPA's static alignment loss with a generative representation policy optimization objective. Instead of enforcing a fixed similarity constraint, VRPO treats representation alignment as a reward-guided process: the model receives adaptive rewards based on generation fidelity, perceptual quality, and semantic coherence between the diffusion features and pretrained visual embeddings. This formulation enables the generator to continuously refine its internal representations toward semantically meaningful directions while improving image quality. Our VRPO-driven training seamlessly integrates into diffusion transformers, introducing negligible computation cost and preserving full compatibility with SiT and DiT architectures. Extensive experiments on ImageNet-256x256 demonstrate that our VRPO-Alignment substantially enhances both convergence and fidelity, achieving up to +1.8 FID improvement and 2.3x faster training compared to REPA under identical compute budgets.

2606.00582 2026-06-02 cs.AI 版本更新

PropLLM: Propagation-Aware Scene Reconstruction for Network Fault Diagnosis

PropLLM:面向网络故障诊断的传播感知场景重建

Zongzong Wu, Ming Zhao, Fengxiao Tang, Nei Kato

发表机构 * National Natural Science Foundation of China(国家自然科学基金委员会) High Performance Computing Center of Central South University(中南大学高性能计算中心)

AI总结 提出PropLLM,首次将逐跳场景重建范式与LLM生成推理能力结合,通过双知识图谱和时序因果传播注意力机制,从端点告警回溯定位根因并确定故障类型,在真实Wi-Fi多模态故障数据集上诊断准确率提升3.9%,根因定位准确率提升4.7%,幻觉率降低50.8%。

详情
AI中文摘要

网络故障沿着拓扑和协议依赖关系逐层传播,然而运维系统通常只观察到传播链末端的症状告警,此时不同的根因故障可能产生高度相似的端点症状。现有方法(无论是基于规则、机器学习还是大语言模型)本质上都是将告警集一次性映射到诊断结果,在结构上无法解决这种端点歧义性。本文提出PropLLM,首次将逐跳场景重建范式与LLM的生成推理能力相结合。从端点告警出发,PropLLM沿着传播路径逐跳回溯,在每一跳从双层知识图谱中检索可验证的事实证据,同时提出的时序因果传播注意力机制将已知的拓扑因果先验直接编码到注意力计算中,引导模型沿着正确的因果方向前进,最终通过完全基于证据的因果链定位根因并确定故障类型。在真实Wi-Fi多模态故障数据集上,PropLLM的故障类型诊断准确率比最强基线提升3.9%,根因定位准确率提升4.7%,幻觉率降低50.8%。在TeleLogs 5G数据集上的补充实验进一步证明了所提方法在不同网络场景下的有效性。

英文摘要

Network faults propagate layer by layer along topology and protocol dependencies, yet operations systems typically observe only symptomatic alerts at the tail end of propagation chains, where distinct root-cause faults may produce highly similar end-point symptoms. Existing approaches, whether rule-based, machine learning (ML)-based, or large language model (LLM)-based, fundamentally map the alert set to a diagnosis in a single pass and are structurally incapable of resolving this end-point ambiguity. This paper proposes PropLLM, which is the first to integrate the hop-by-hop scene reconstruction paradigm with the generative reasoning capabilities of LLMs. Starting from end-point alerts, PropLLM traces back hop-by-hop along the propagation path, retrieving verifiable factual evidence from a dual-layer knowledge graph (KG) at each hop, while the proposed Temporal Causal Propagation Attention (TCPA) mechanism encodes known topological causal priors directly into the attention computation to guide the model along the correct causal direction, ultimately localizing the root cause and determining the fault type through a fully evidenced causal chain. On a real-world Wi-Fi multimodal fault dataset, PropLLM improves fault type diagnosis accuracy by 3.9\% and root cause localization accuracy by 4.7\% over the strongest baseline, while reducing the hallucination rate by 50.8\%. Supplementary experiments on the TeleLogs 5G dataset further demonstrate the effectiveness of the proposed method across different network scenarios.

2606.00571 2026-06-02 cs.LG cs.AI cs.CV 版本更新

On the Difficulty of Learning a Meta-network for Training Data Selection

学习用于训练数据选择的元网络的困难性

Zilin Du, Junqi Zhao, Boyang Albert Li

发表机构 * University of California, Berkeley(加州大学伯克利分校) Stanford University(斯坦福大学)

AI总结 针对元学习训练数据选择(MTS)在实践中表现不佳的问题,本文通过数学分析揭示了梯度信噪比低和缺乏信息特征两大障碍,并提出增大批大小和利用信息特征作为解决方案。

详情
AI中文摘要

合成数据越来越多地被用于训练神经网络,但若不加区分地使用,其与真实数据的分布不匹配会限制其有效性。一种常见策略是通过双层优化学习数据权重,我们称之为元学习训练数据选择(MTS)。有趣的是,在实践中,MTS 往往低于预期。我们识别了正确训练 MTS 的两个障碍:梯度信噪比(GSNR)低导致优化困难,以及缺乏与数据质量相关的信息特征。我们对 MTS 进行了数学分析,揭示了归一化数据权重的动态以及不同数据质量与低 GSNR 之间的关系。分析表明,一个简单而有效的解决方案是增大批大小。此外,我们提出了一组信息特征,用于捕捉训练数据在其分布中的位置和训练动态。在四个基准上的实验显示了一致的改进,与无选择的训练相比平均提升 5.49%,与最强基线相比平均提升 2.89%。

英文摘要

Synthetic data are increasingly used to train neural networks, yet distributional mismatch with real data limits their effectiveness when used indiscriminately. A common strategy is to learn data weights via bi-level optimization, which we refer to as Meta-learning for Training-data Selection (MTS). Interestingly, in practice, MTS often performs below expectation. We identify two obstacles in properly training MTS: a poor gradient signal-to-noise ratio (GSNR), which causes optimization difficulties, and lack of informative features that correlates with data quality. We present a mathematical analysis of MTS, which reveals the dynamics of normalized data weights and the relation between disparate data quality and poor GSNR. The analysis suggests a a simple yet effective solution: increasing the batch size. Further, we propose a set of informative features that capture the positions of training data in their distributions and training dynamics. Experiments across four benchmarks show consistent improvements, achieving average gains of 5.49% over training without selection and 2.89% over the strongest baseline.

2606.00570 2026-06-02 cs.CL cs.AI 版本更新

Revisiting Parameter-Based Knowledge Editing in Large Language Models: Theoretical Limits and Empirical Evidence

重新审视大型语言模型中基于参数的知识编辑:理论极限与实证证据

Wanying Ren, Xin Song, Futing Wang, Guoxiu He, Aixin Sun

发表机构 * University of Science and Technology of China(中国科学技术大学)

AI总结 本文通过理论分析和实证评估,揭示了基于参数的知识编辑方法会因维度坍缩假设导致全局干扰和推理崩溃,而简单的检索基线方法在所有条件下均表现更优。

Comments Accepted to ICML 2026. Equal contribution by the first two authors. 9 pages main paper, 10 figures, with appendix

详情
AI中文摘要

基于参数的知识编辑通过局部权重修改更新大型语言模型(LLMs)的内部知识,并引起了广泛关注。然而,大多数现有方法忽略了基本的理论限制,并且很少在现实的、面向实践的设置下进行评估。在本文中,我们首先基于维度坍缩假设提出理论分析,解释局部参数编辑如何沿着表示空间中的脆弱方向传播,引发全局干扰并最终导致推理崩溃。基于这一见解,我们通过系统变化知识复杂度、编辑次数、评估维度和基线方法进行了全面的实证评估。我们的结果表明,基于参数的编辑方法持续损害LLM的核心能力。相比之下,一个简单的基于检索的基线在所有评估条件下始终比所有参数编辑方法表现更强。这些发现强调,在知识编辑后保持LLM的基本能力应成为未来研究的核心关注点。

英文摘要

Parameter-based knowledge editing updates the internal knowledge of large language models (LLMs) via localized weight modifications and has attracted significant attention. However, most existing methods overlook fundamental theoretical limitations and are rarely evaluated under realistic, practice-oriented settings. In this paper, we first present a theoretical analysis based on the dimensional Collapse Hypothesis, explaining how localized parameter edits can propagate along fragile directions in the representation space, inducing global interference and ultimately causing reasoning collapse. Building on this insight, we conduct a comprehensive empirical evaluation by systematically varying knowledge complexity, number of edits, evaluation dimensions, and baseline methods. Our results show that parameter-based editing methods consistently damage core LLM capabilities. In contrast, a simple retrieval-based baseline achieves consistently stronger performance than all parameter-editing methods across all evaluated conditions. These findings highlight that preserving the fundamental capabilities of LLMs after knowledge editing should be a central concern for future research.

2606.00563 2026-06-02 cs.LG cs.AI stat.ML 版本更新

A Practical Upper Bound on Selection Bias Effects in Medical Prediction Models

医学预测模型中选择偏差影响的一个实用上界

Kara Liu, Maggie Wang, Russ B. Altman

发表机构 * Stanford University(斯坦福大学)

AI总结 针对选择偏差导致模型泛化性差的问题,提出在仅部分观测选择机制和目标分布的现实条件下,对目标群体最差模型性能的一个新上界,并通过合成数据和真实数据验证其有效性和实用性。

Comments 32 pages, 27 figures, will be published at ACM SIGKDD '26

详情
AI中文摘要

选择偏差是真实世界数据中常见且往往不可避免的一个方面,它挑战了机器学习模型的泛化性。当在偏倚数据上训练的模型被部署到更广泛的目标群体时,模型泛化能力差可能导致实际危害,尤其是在医疗保健等高危环境中。这种风险凸显了从业者在部署前可靠评估模型泛化性的需求。然而,现有的预测模型性能的方法依赖于不切实际地访问目标分布或了解导致偏差的选择机制。为了解决这些局限性,我们提出了一个新颖的上界,用于在现实设置下目标群体上的最差模型性能,其中选择机制和目标群体数据仅被部分观测。我们通过在完全合成数据、源自All of Us研究计划的半合成数据以及MIMIC-IV中的真实世界选择偏差上的实验,证明了我们方法的有效性和实际效用。我们的工作提供了一个原则性和实用性的工具,用于估计在原本难以处理的情况下选择偏差的影响,从而使从业者能够在医疗保健及其他领域构建更安全、更具泛化性的模型。

英文摘要

Selection bias is a common and often unavoidable aspect of real-world data that challenges the generalizability of machine learning models. When models trained on biased data are deployed in the broader target population, poor model generalization may lead to real harm, particularly in high-risk settings such as healthcare. This risk highlights the need for practitioners to reliably assess model generalizability prior to deployment. However, existing methods for predicting model performance rely on unrealistic access to the target distribution or knowledge of the selection mechanism causing bias. To address these limitations, we propose a novel upper bound on the worst-case model performance on the target population under the realistic setting where the selection mechanism and the target population data are only partially observed. We demonstrate the validity and practical utility of our method through experiments on fully synthetic data, semi-synthetic data derived from the All of Us Research Program, and real-world selection bias in MIMIC-IV. Our work offers a principled and practical tool to estimate the impact of selection bias in an otherwise intractable setting, thereby enabling practitioners to build safer and more generalizable models in healthcare and beyond.

2606.00561 2026-06-02 cs.LG cs.AI 版本更新

Interpretable Policy Distillation for Power Grid Topology Control

可解释的策略蒸馏用于电网拓扑控制

Aleksandra Dmitruka, Karlis Freivalds

发表机构 * University of Latvia, Faculty of Exact Sciences and Technology(拉脱维亚大学,精确科学与技术学院)

AI总结 提出一种将深度强化学习策略蒸馏为轻量级决策树/随机森林的方法,在保持性能的同时提升可解释性,并揭示表征偏移。

详情
AI中文摘要

深度强化学习为实时电网运行提供了有前景的途径,但大型神经策略评估成本高、难以在受限硬件上部署,且对操作员不透明。我们探究用于电网拓扑控制的近端策略优化(PPO)智能体能否压缩为紧凑的树基替代模型而不损失运行性能。在Grid2Op的标准14节点环境中,使用面向稳定性的奖励,通过压力聚焦的数据收集在关键高负荷状态下训练PPO教师。然后将策略蒸馏为决策树和随机森林。在保留的验证回合中,两个替代模型在平均奖励和生存时长上均超过教师,而推理成本仅为教师的一小部分。决策树与PPO argmax的动作完全一致率较高,且在其排名靠前的动作中几乎完全一致,同时保持足够小以便直接检查。特征重要性分析揭示了表征偏移:PPO策略主要依赖线路负载信号,而蒸馏树主要由母线拓扑变量驱动。这些结果表明,压力聚焦的蒸馏可以将黑箱神经控制器转换为轻量级、可审计的规则类替代模型,适用于实时部署,同时揭示与确定性动作和拓扑特定泛化相关的风险。

英文摘要

Deep reinforcement learning (RL) offers a promising route to real-time power grid operation, yet large neural policies are costly to evaluate, hard to deploy on constrained hardware, and opaque to operators. We ask whether a Proximal Policy Optimization (PPO) agent for grid topology control can be compressed into compact tree-based surrogates without losing operational performance. A PPO teacher is trained on Grid2Op's standard 14-bus environment with a stability-oriented reward, using stress-focused data collection on critical, high-loading states. The policy is then distilled into a decision tree and a random forest. Across held-out validation episodes, both surrogates exceed the teacher in mean reward and survival length at a fraction of the inference cost. The decision tree shows high exact-action agreement with the PPO argmax and near-complete agreement within its top-ranked actions, while remaining small enough to be inspected directly. Feature-importance analysis reveals a representational shift: the PPO policy relies mainly on line-loading signals, while the distilled tree is driven primarily by bus-topology variables. These results suggest that stress-focused distillation can convert a black-box neural controller into a lightweight, auditable rule-like surrogate suited for real-time deployment, while also surfacing risks tied to deterministic actions and topology-specific generalization.

2606.00559 2026-06-02 cs.LG cs.AI 版本更新

Richer Representations for Neural Algorithmic Reasoning via Auxiliary Reconstruction

通过辅助重建实现神经算法推理的更丰富表示

Jiafu Huang, Chao Peng, Chenyang Xu, Zhengfeng Yang, Kecheng Cai, Chenhao Zhang, Yi Wang, Yiwei Gong, Wanqin Zhou, Irene Zheng

发表机构 * sei.ecnu.edu.cn(东华大学信息科学与工程学院)

AI总结 提出辅助重建模块和自监督学习变体,增强编码器对输入状态信息的保留和特征间依赖的捕捉,从而提升神经算法推理性能。

Comments Appeared at AAAI 2026

详情
AI中文摘要

神经算法推理已成为一个热门研究方向。它旨在训练神经网络模仿经典基于规则的算法的逐步行为。更具体地说,此类算法的执行可以抽象为一系列状态,其中每个状态代表执行步骤后的中间结果。训练目标是生成复制底层算法过程的状态序列。该任务的常见框架采用编码器-处理器-解码器架构,其中编码器学习状态的表示,处理器模拟算法步骤,解码器重建输出状态。虽然先前的工作侧重于改进处理器,但编码器在表示学习中的作用很少受到关注。大多数方法依赖简单的MLP编码器,这引发了一个问题:这些表示是否足够信息丰富以支持算法推理。本文研究如何改进神经算法推理的编码器表示。我们提出一个重建模块,旨在从其编码表示中恢复输入状态。这个辅助重建任务鼓励编码器保留关于输入的关键信息。我们证明,在训练过程中加入此任务可以提高现有神经架构在标准基准上的性能。此外,我们观察到当前编码器常常未充分利用状态内特征之间的相关性。为了解决这个问题,我们从自监督学习中汲取灵感,设计了一个增强的辅助任务变体,鼓励编码器捕捉状态内特征依赖。实验结果表明,我们的方法使编码器能够学习更丰富的表示,从而增强现有处理器在算法推理任务上的性能。

英文摘要

Neural algorithmic reasoning has emerged as a popular research direction. It aims to train neural networks to mimic the step-by-step behavior of classical rule-based algorithms. More specifically, the execution of such algorithms can be abstracted as a sequence of states, where each state represents the intermediate outcome after an execution step. The training objective is to generate state sequences that replicate the underlying algorithmic process. A common framework for this task adopts an encoder-processor-decoder architecture, where the encoder learns representations of states, the processor simulates algorithmic steps, and the decoder reconstructs output states. While prior work has focused on improving the processor, the role of the encoder in representation learning has received little attention. Most methods rely on simple MLP encoders, raising the question of whether such representations are sufficiently informative for supporting algorithmic reasoning. This paper investigates how to improve encoder representations for neural algorithmic reasoning. We propose a reconstruction module that aims to recover the input state from its encoded representation. This auxiliary reconstruction task encourages the encoder to retain critical information about the input. We demonstrate that incorporating this task during training improves the performance of existing neural architectures on standard benchmarks. Furthermore, we observe that current encoders often underutilize the correlations among features within a state. To address this, we draw inspiration from self-supervised learning and design an enhanced variant of the auxiliary task that encourages the encoder to capture intra-state feature dependencies. Experimental results show that our method enables the encoder to learn richer representations, thereby enhancing the performance of existing processors on algorithmic reasoning tasks.

2606.00548 2026-06-02 cs.CV cs.AI cs.LG 版本更新

CAFOSat: A Strongly Annotated Dataset for Infrastructure-Aware CAFO Mapping Using High-Resolution Imagery

CAFOSat:用于基于高分辨率影像的基础设施感知型CAFO制图的高质量标注数据集

Oishee Bintey Hoque, Nibir Chandra Mandal, Mandy L Wilson, Samarth Swarup, Madhav Marathe, Abhijin Adiga

发表机构 * University of Virginia(弗吉尼亚大学) Biocomplexity Institute, University of Virginia(弗吉尼亚大学生物复杂性研究所)

AI总结 针对集中式动物饲养操作(CAFO)大规模制图困难,提出CAFOSat数据集,集成高分辨率NAIP影像与多源CAFO清单,通过人机协同标注、GradCAM定位和几何聚类优化弱定位记录,并引入合成增强管道,实现基础设施级标注和鲁棒分类。

Comments Accepted at CVPR Workshop-2026. First two authors has equal contribution

详情
AI中文摘要

集中式动物饲养操作(CAFO)在农业生产中发挥重要作用,但也与环境、公共卫生和疾病监测问题相关。由于基础设施布局异质、位置记录噪声大、标注不一致以及清单不完整,从遥感影像大规模制图CAFO仍具挑战。我们引入CAFOSat,一个用于美国全境CAFO制图的高质量标注、基础设施感知数据集。CAFOSat集成高分辨率国家农业影像计划(NAIP)影像与跨州收集的多源CAFO清单,并通过结合AI辅助标注、基于GradCAM的定位和几何聚类的人机协同管道,将弱地理定位记录转化为精细标注。为提高数据集质量,我们利用土地覆盖引导采样和空间排除约束筛选具有挑战性的负样本,并通过人工验证提供基础设施级标注,包括畜棚、粪池和放牧相关特征。最终数据集包含超过45,000个图像块,覆盖20个州和四大CAFO类别。我们对多种卷积、基于Transformer和视觉-语言模型进行基准测试,证明了精细标注和精心筛选的负样本在CAFO分类和泛化中的价值。此外,我们引入一个合成增强管道,生成基础设施感知的变体以增加训练多样性并提升分布偏移下的鲁棒性。CAFOSat为推进基础设施感知的农业监测和基于高分辨率遥感影像的CAFO制图提供了大规模基准。

英文摘要

Concentrated Animal Feeding Operations (CAFOs) play an important role in agricultural production but are also associated with environmental, public health, and disease surveillance concerns. Large-scale mapping of CAFOs from remote sensing imagery remains challenging due to heterogeneous infrastructure layouts, noisy location records, inconsistent annotations, and incomplete inventories. We introduce CAFOSat, a strongly annotated, infrastructure-aware dataset for CAFO mapping across the United States. CAFOSat integrates high-resolution National Agriculture Imagery Program (NAIP) imagery with multi-source CAFO inventories collected across multiple states and transforms weak geolocation records into refined annotations through a human-in-the-loop pipeline combining AI-assisted annotation, GradCAM-based localization, and geometric clustering. To improve dataset quality, we curate challenging negative samples using land-cover-guided sampling with spatial exclusion constraints and provide infrastructure-level annotations, including barns, manure ponds, and grazing-related features, through manual verification. The resulting dataset contains more than 45,000 image patches spanning 20 states and four major CAFO categories. We benchmark a diverse set of convolutional, transformer-based, and vision-language models, demonstrating the value of refined annotations and curated negative samples for CAFO classification and generalization. In addition, we introduce a synthetic augmentation pipeline that generates infrastructure-aware variations to increase training diversity and improve robustness under distribution shifts. CAFOSat provides a large-scale benchmark for advancing infrastructure-aware agricultural monitoring and CAFO mapping from high-resolution remote sensing imagery.

2605.01797 2026-06-02 cs.AI 版本更新

Neural Decision-Propagation for Answer Set Programming

面向回答集编程的神经决策传播

Thomas Eiter, Katsumi Inoue, Sota Moriyama

发表机构 * Vienna University of Technology (TU Wien)(维也纳技术大学( TU Wien)) National Institute of Informatics(日本信息处理学会) The Graduate University for Advanced Studies, SOKENDAI(高级研究大学,SOKENDAI)

AI总结 提出决策传播(DProp)方法及其可微扩展神经决策传播(NDProp),通过交替假决策和真传播高效计算稳定模型,提升神经符号推理的可扩展性和准确性。

Comments This is the full version (with appendix) of a paper appearing at the 35th International Joint Conference on Artificial Intelligence (IJCAI-ECAI 2026)

详情
AI中文摘要

将回答集编程(ASP)与神经网络集成已成为神经符号AI中一种有前景的工具。虽然现有方法将ASP的能力扩展到现实世界领域,但其推理流程依赖于经典求解器,这成为可扩展性的瓶颈。为解决这一问题,我们提出了一种计算稳定模型的新方法,称为决策传播(DProp),它交替进行假决策和真传播。我们证明了成功的DProp计算能够捕捉稳定模型语义。随后,我们开发了神经决策传播(NDProp),它是DProp的可微扩展,使用神经计算进行决策,使用模糊评估进行传播。我们评估了NDProp在学习决策启发式以及神经符号集成方面的能力,并将其与现有的神经符号方法进行了比较。结果表明,NDProp能够学习高效计算稳定模型,并在神经符号基准测试中提高了准确性和可扩展性。

英文摘要

Integration of Answer Set Programming (ASP) with neural networks has emerged as a promising tool in Neuro-symbolic AI. While existing approaches extend the capabilities of ASP to real world domains, their reasoning pipelines depend on classical solvers, which is a bottleneck for scalability. To tackle this problem, we propose a new method to compute stable models, called decision-propagation (DProp), which alternates falsity decisions and truth propagations. Successful DProp computations are shown to capture the stable model semantics. We then develop Neural DProp (NDProp), a differentiable extension of DProp with neural computation for decisions and fuzzy evaluation for propagations. We evaluate the capabilities of NDProp for learning decision heuristics as well as neuro-symbolic integration, and compare it with existing neuro-symbolic approaches. The results show that NDProp can learn to efficiently compute stable models, and it improves accuracy and scalability on neuro-symbolic benchmarks.

2606.00532 2026-06-02 cs.AI 版本更新

KACE: Knowledge-Adaptive Context Engineering for Mathematical Reasoning

KACE: 知识自适应上下文工程用于数学推理

Jayant Parashar, Suchendra M. Bhandarkar

发表机构 * School of Computing, University of Georgia(计算学院,佐治亚大学)

AI总结 提出KACE方法,通过难度和领域分层的知识库与分层自一致性,解决数学推理中上下文膨胀问题,在AIME 2025上达到62.2%准确率。

Comments 9 pages, 1 figure, 6 tables

详情
AI中文摘要

上下文工程可以在不更新权重的情况下改进大型语言模型,但数学推理暴露了一个关键限制:在一个不断增长的提示中累积的反馈会导致上下文膨胀,并限制了可使用的学习指导量。现有方法常常混淆存储(跨运行学习的内容)与使用(针对特定问题包含的内容),因此继承了这种提示大小上限。我们引入了知识自适应上下文工程(KACE),通过基于难度和领域的组织将存储与使用分离。离线时,一个自我反思的学习循环将训练轨迹提炼成认知树:一个按问题难度和认知领域分层的类型化卡片知识库。每张卡片被分配到与其起源失败对应的难度-领域节点。在评估时,具有每层一致性门控的分层自一致性将每个问题动态分类为简单、中等或困难。简单问题无需检索卡片即可退出,而较难的问题仅检索树的匹配分支。这种分层方案在计算量相当的情况下匹配或超过Best-of-N,并以78%的成对一致性对问题难度进行分类。主要的实证贡献是通过分层自一致性构建和使用了一个难度和领域分层的知识库。在AIME 2025上,KACE达到了62.2%的准确率,在可比的求解器调用预算下,比固定的Best-of-5自一致性绝对提高了10.4个百分点,比最强的学习上下文基线Tiered + GEPA提高了5.6个百分点。我们还在MATH-HARD和OlymMATH的可验证子集上观察到一致的提升。

英文摘要

Context engineering can improve large language models without updating their weights, but mathematical reasoning exposes a key limitation: feedback accumulated in one growing prompt causes context bloat and limits the amount of learned guidance that can be used. Existing methods often conflate storage, what is learned across runs, with usage, what is included for a particular problem, and therefore inherit this prompt-size ceiling. We introduce Knowledge-Adaptive Context Engineering (KACE), which separates storage from usage through difficulty- and domain-based organization. Offline, a self-reflective learning loop distills training traces into an epistemic tree: a knowledge base of typed cards stratified by problem difficulty and epistemic domain. Each card is assigned to the difficulty-domain node corresponding to the failure from which it originated. At evaluation time, tiered self-consistency with per-tier agreement gates dynamically classifies each problem as easy, medium, or hard. Easy problems exit without retrieved cards, while harder problems retrieve only the matching branch of the tree. This tiered scheme matches or exceeds Best-of-N while using comparable compute, and it classifies problem difficulty with 78 percent pairwise concordance. The main empirical contribution is the construction and use of a difficulty- and domain-stratified knowledge base enabled by tiered self-consistency. On AIME 2025, KACE achieves 62.2 percent accuracy, a 10.4-point absolute gain over fixed Best-of-5 self-consistency at a comparable solver-call budget and a 5.6-point gain over the strongest learned-context baseline, Tiered + GEPA. We also observe consistent gains on MATH-HARD and the verifiable subset of OlymMATH.

2606.00518 2026-06-02 cs.AI 版本更新

Acting with AI: An Interaction-Based Framework for Agentic Tort Liability

与AI行动:基于交互的代理侵权责任框架

Yiheng Yao

发表机构 * Yiheng Yao(姚艺恒)

AI总结 本文基于Bratman规划理论和普通法人类协同行动原则,提出一种交互分类框架(自主漂移、纯工具使用、协作规划)来分配AI代理系统的侵权责任,并引入“合理代理”标准。

详情
AI中文摘要

代理AI系统能够多步规划、使用工具并随时间执行任务。当此类系统造成损害时,侵权法难以分配责任,因为有害路径可能既非用户完全选择,也非开发者特别预见。本文借鉴Michael Bratman的规划理论和普通法对人类协同行动的处理,提出一个基于交互的代理侵权框架。我们区分三种交互类型:自主漂移、纯工具使用和协作规划。纯工具案例仍受普通产品缺陷和警告原则管辖;协作规划案例映射到独立承包商控制测试、专业过失和过失性虚假陈述;自主漂移则映射到雇主责任下的“擅自行动”和严格产品责任。该框架将有状态交互日志作为主要证据线索,使法院能够推断人-AI轨迹何时偏离授权行为以及责任应归于何处。我们解决了四个事件锚定案例,将该观点与严格责任和基于保险的提案并列,指出其与监管监督的关系,并提出了一个围绕约束验证、认知透明度、运行时基础和取证日志构建的“合理代理”标准。

英文摘要

Agentic AI systems can plan over multiple steps, use tools, and execute tasks over time. When such systems cause harm, tort law struggles to allocate responsibility because the harmful path may be neither fully chosen by the user nor specifically foreseen by the developer. This paper proposes an interaction-based framework for agentic torts, drawing on Michael Bratman's planning theory and on the common law's treatment of human-human concerted action. We distinguish three interaction types: autonomous drift, pure tool use, and collaborative planning. Pure tool cases remain governed by ordinary product-defect and warning doctrines; collaborative planning cases map onto the independent contractor control test, professional malpractice, and negligent misrepresentation; autonomous drift maps onto frolic and detour under respondeat superior and strict product liability. The framework treats the stateful interaction log as the primary evidentiary trace, allowing courts to infer where the human-AI trajectory departed from the authorized undertaking and where liability should attach. We resolve four incident-anchored cases, situate the account alongside strict-liability and insurance-based proposals, note its relationship to regulatory oversight, and propose a ``Reasonable Agent'' standard built around constraint verification, epistemic transparency, runtime grounding, and forensic logging.

2606.00516 2026-06-02 cs.AI 版本更新

Threshold-Based Exclusive Batching for LLM Inference

基于阈值的独占批处理用于LLM推理

Weifang Zhang, Yuzhou Nie, Bowen Pang, Guangrui Ma, Shining Wu

发表机构 * The Hong Kong Polytechnic University(香港理工大学) Engineering Center for Space Utilization, Chinese Academy of Sciences(中国科学院空间利用工程技术中心)

AI总结 针对混合批处理中预填充与解码干扰导致边际成本上升的问题,提出基于GPU内存带宽、模型大小和工作负载的EB-MB性能交叉条件及最优切换阈值,优化后的独占批处理在带宽受限GPU上吞吐量提升高达41.9%。

Comments 37 pages, 12 figures. Accepted at ICML 2026

详情
AI中文摘要

混合批处理(MB)——将预填充和解码交错在单个批次中——已成为大型语言模型(LLM)推理的标准调度策略,因其在最大化计算和内存利用率方面的效率。然而,通过受控实验,我们发现预填充-解码干扰使MB的每步边际成本高于纯解码。在高带宽H200(4.8 TB/s)上,这仅在解码token超过批次的80%时发生;然而,在带宽受限的RTX PRO 6000(1.792 TB/s)上,该阈值骤降至仅20%。因此,MB与独占批处理(EB)之间的最优选择根本上取决于GPU内存带宽、模型大小和工作负载组成。我们推导了该EB-MB性能交叉的闭式条件,以及渐近最优的相位切换阈值和EB的内存安全批次大小。优化的EB在带宽受限GPU上吞吐量提升高达41.9%,而MB在具有更大模型的高带宽硬件上保持优势。我们的混合调度器EB+在线应用该条件,在无需人工干预的情况下动态切换EB和MB。在分布或并发度变化的非平稳流量下,EB+在每个设置中达到最高或接近最高的吞吐量,比MB高出高达36.4%。

英文摘要

Mixed batching (MB)--interleaving prefill and decode in a single batch--has become the standard scheduling strategy for large language model (LLM) inference due to its efficiency in maximizing compute and memory utilization. However, through controlled experiments, we find that prefill-decode interference inflates MB's per-step marginal cost above that of pure decode. On the high-bandwidth H200 (4.8 TB/s), this occurs only when decode tokens exceed 80% of the batch; however, on the bandwidth-constrained RTX PRO 6000 (1.792 TB/s), this threshold plummets to just 20%. Consequently, the optimal choice between MB and exclusive batching (EB) fundamentally depends on GPU memory bandwidth, model size, and workload composition. We derive a closed-form condition for this EB-MB performance crossover, along with asymptotically optimal phase-switching thresholds and memory-safe batch sizing for EB. Optimized EB achieves up to 41.9% higher throughput on bandwidth-constrained GPUs, while MB retains its advantage on high-bandwidth hardware with larger models. Our hybrid scheduler EB+ applies this condition online to dynamically switch between EB and MB without manual intervention. Under non-stationary traffic with distribution or concurrency shifts, EB+ attains the highest or near-highest throughput in every setting, outperforming MB by up to 36.4%.

2606.00515 2026-06-02 cs.RO cs.AI cs.SY eess.SY 版本更新

PaCo-VLA: Passivity-Shielded Compliance Prior for Contact-Rich Vision-Language-Action Manipulation

PaCo-VLA: 用于富接触视觉-语言-动作操控的被动屏蔽柔顺先验

Haofan Cao, Zhaoyang Li, Zhichao You, Liang Guo, Tianrui Li

发表机构 * Southwest Jiaotong University(西南交通大学) University of Leeds(莱斯特大学)

AI总结 提出PaCo-VLA框架,通过被动屏蔽将VLA模型输出转化为任务级柔顺建议,并利用能量罐和边界检查防止无效预测绕过底层接触物理,实现安全精确的富接触操控。

Comments Under review, code will be available soon

详情
AI中文摘要

富接触操控既需要高层语义推理,也需要对高频接触动态的安全调节。虽然视觉-语言-动作(VLA)模型提供了前所未有的语义泛化能力,但其低速率输出缺乏在力敏感任务中直接控制执行器所需的可靠性。为弥合这一语义到控制的鸿沟,我们引入PaCo-VLA,一种被动屏蔽的柔顺先验,重新定义了VLA接口。PaCo-VLA不将直接电机指令托付给VLA,而是将网络输出视为任务级柔顺建议:语义绑定、任务阶段和导纳调度。一个高频、建议无关的被动屏蔽通过能量罐核算和边界检查来管理这些建议,防止无效、过时或未经验证的模型预测绕过底层接触物理。这种解耦架构还支持因果评估,将语义贡献与几何捷径分离。大量仿真和真实世界的连接器插入实验表明,PaCo-VLA在无屏蔽VLA基线上实现了卓越的精度,即使在对抗性柔顺偏移下也能保持零被动违规。该框架在导纳端口建立了一个可证明的采样被动运行时契约,并为在富接触领域部署基础模型提供了运行时接口。

英文摘要

Contact-rich manipulation demands both high-level semantic reasoning and the safe regulation of high-frequency contact dynamics. While Vision-Language-Action (VLA) models provide unprecedented semantic generalization, their low-rate outputs lack the reliability required for direct plant authority in force-sensitive tasks. To bridge this semantic-to-control gap, we introduce PaCo-VLA, a passivity-shielded compliance prior that recasts the VLA interface. Rather than trusting VLAs with direct motor commands, PaCo-VLA treats network outputs as task-level compliance proposals: semantic bindings, task stages, and admittance schedules. A high-frequency, proposal-independent passivity shield governs these proposals through energy-tank accounting and boundary checks, preventing invalid, stale, or unverified model predictions from bypassing low-level contact physics. This decoupled architecture also enables causal evaluation, isolating semantic contributions from geometric shortcuts. Extensive simulated and real-world connector-insertion experiments demonstrate that PaCo-VLA achieves superior precision over unshielded VLA baselines, sustaining zero passivity violations even under adversarial compliance shifts. This framework establishes a provably sampled-passive runtime contract at the admittance port and provides a runtime interface for deploying foundation models in contact-rich domains.

2606.00510 2026-06-02 cs.CL cs.AI 版本更新

Skill or Skip? Learning Selective Skill Invocation in Agentic Tasks via Dual-Granularity Preference Learning

技能还是跳过?通过双粒度偏好学习在智能体任务中学习选择性技能调用

Chishui Chen, Jiaye Lin, Te Sun, Junxi Wang, Yi Yang, Cong Qin, Yangen Hu, Lu Pan, Ke Zeng

发表机构 * Meituan(美团) Fudan University(复旦大学) Shanghai Jiao Tong University(上海交通大学) Nanjing University(南京大学) Peking University(北京大学)

AI总结 提出SelSkill框架,通过双粒度偏好学习实现选择性技能调用,在ALFWorld和BFCL上显著提升任务成功率和执行精度。

Comments 18 pages, 4 figures, 10 tables

详情
AI中文摘要

智能体技能是可调用的程序化模块,为复杂智能体任务提供可重用知识和执行策略。然而,现有方法主要关注选择相关技能或改进技能本身,而忽略了在当前决策点是否应该实际调用相关技能。无帮助的调用可能引入无关上下文并破坏原本正确的执行过程。为解决此问题,我们提出SelSkill,一个用于选择性技能调用的双粒度偏好学习框架。SelSkill将技能使用表述为技能或跳过决策,利用预测不确定性优先考虑候选决策点,并从共享轨迹前缀构建受控的调用-跳过偏好对。它进一步结合了回合级结果偏好与步骤级调用偏好,以捕捉整体轨迹质量和技能调用的局部有效性。在ALFWorld上使用Qwen3-8B,SelSkill将任务成功率提高了10.9个百分点,执行精度提高了29.1个百分点。在BFCL上,它将任务成功率提高了5.7个百分点,执行精度提高了29.5个百分点。在Tau-bench和PopQA上的零样本结果进一步表明,学习到的调用策略可迁移到具有未见技能的新领域。

英文摘要

Agent skills are callable procedural modules that provide reusable knowledge and execution policies for complex agentic tasks. However, existing methods mainly focus on selecting relevant skills or improving the skills themselves, while overlooking whether a relevant skill should actually be invoked at the current decision point. Unhelpful invocations may introduce irrelevant context and disrupt an otherwise correct execution process. To address this issue, we propose SelSkill, a dual-granularity preference-learning framework for selective skill invocation. SelSkill formulates skill use as a skill-or-skip decision, uses predictive uncertainty to prioritize candidate decision points, and constructs controlled invoke-skip preference pairs from shared trajectory prefixes. It further combines episode-level outcome preferences with step-level invocation preferences to capture both overall trajectory quality and the local effectiveness of skill invocation. On ALFWorld with Qwen3-8B, SelSkill improves task success by 10.9 percentage points and execution precision by 29.1 percentage points. On BFCL, it improves task success by 5.7 percentage points and execution precision by 29.5 percentage points. Zero-shot results on Tau-bench and PopQA further suggest that the learned invocation policy transfers to new domains with previously unseen skills.

2606.00508 2026-06-02 cs.CV cs.AI 版本更新

V-LynX: Token Interface Alignment for Video+X LLMs

V-LynX: 视频+X 大语言模型的令牌接口对齐

Jungin Park, Jiyoung Lee, Kwanghoon Sohn

发表机构 * Yonsei University, Seoul, South Korea(延世大学,首尔,韩国) Ewha Womans University, Seoul, South Korea(成均馆大学,首尔,韩国)

AI总结 本文发现视频大语言模型中存在令牌接口连续流形,并提出V-LynX框架,通过轻量辅助路径对齐注意力响应和统计分布,无需配对监督即可集成新模态,在音视频问答、3D推理等任务上达到最优效率。

Comments ICML 2026 Camera-ready

详情
AI中文摘要

本研究揭示了视频大语言模型中的一个有趣现象:视频大语言模型不仅仅是简单地将帧转换为文本嵌入,而是建立了一个连续流形——令牌接口,使得视觉令牌能够在架构内作为独立实体运行。利用这一发现,我们提出了V-LynX,这是一个可扩展的框架,通过重新利用内部化接口,将新模态集成到视频大语言模型中。与需要大量模态特定编码器或配对监督的传统范式不同,V-LynX采用轻量辅助路径与冻结的视觉编码器并行运行。我们的方法通过使用非配对单模态数据集对齐注意力响应和统计分布,将新的感官输入与内在视频先验相结合。这确保了流形兼容性,同时保持了视频大语言模型的完整性。大量基准测试表明,V-LynX在音视频问答、3D推理、高帧率和多视角视频理解方面达到了最先进水平和高效性。代码可在https://github.com/park-jungin/lynx获取。

英文摘要

This study introduces an intriguing phenomenon in Video LLMs: rather than merely translating frames into textual embeddings, Video LLMs establish a continuous manifold, token interface, allowing visual tokens to operate as standalone entities within the architecture. Exploiting this discovery, we propose V-LynX, a scalable framework that integrates novel modalities into Video LLMs by repurposing the internalized interface. Departing from conventional paradigms that necessitate heavy modality-specific encoders or paired supervision, V-LynX employs a lightweight auxiliary pathway in parallel with the frozen vision encoder. Our method integrates new sensory inputs with intrinsic video priors by aligning both attention responses and statistical distributions using unpaired unimodal data sets. This ensures manifold compatibility while preserving the integrity of the Video LLMs. Extensive benchmarks demonstrate that V-LynX achieves SOTA and efficiency across audio-visual QA, 3D reasoning, high-frame-rate, and multi-view video understanding. The code is available at https://github.com/park-jungin/lynx.

2606.00506 2026-06-02 cs.AI cs.LG 版本更新

EnergyMamba: An Uncertainty-Aware Graph-Enhanced Selective State Space Model for Energy Consumption Prediction

EnergyMamba:一种用于能耗预测的具有不确定性感知的图增强选择性状态空间模型

Dahai Yu, Rongchao Xu, Lin Jiang, Guang Wang

发表机构 * Florida State University(佛罗里达州立大学)

AI总结 提出EnergyMamba框架,通过图增强选择性状态空间模型(GE-Mamba)和自适应序列分位数回归(AS-CQR)模块,实现时空联合建模与不确定性量化,在能耗预测中提升准确率约5%、不确定性量化约6%。

Comments Accepted by KDD 2026 AI4S

详情
AI中文摘要

能耗预测对于高效的电网管理、需求侧优化和可持续能源规划至关重要。尽管先进的机器学习方法已被用于提高预测性能,但现有工作存在两个关键局限:(1)通常将任务视为纯时间序列预测问题,未显式建模不同区域间的空间依赖关系;(2)在极端天气等异常情况下无法提供带有不确定性估计的可靠预测。为推进现有研究,我们提出EnergyMamba,一种具有不确定性感知的时空学习框架,用于准确可靠的能耗预测,包含两个关键组件:(i)一种新颖的图增强选择性状态空间模型(GE-Mamba),将从电网拓扑中学到的空间上下文注入时间动态,实现耦合的时空建模;(ii)自适应序列分位数回归(AS-CQR)模块,包括局部自适应归一化和在线反馈机制,以在潜在分布偏移下动态校准预测区间。我们在来自佛罗里达、纽约和加利福尼亚的四个大规模真实数据集上评估EnergyMamba。结果表明,与15个最先进的基线相比,EnergyMamba在预测准确率上提升约5%,在不确定性量化上提升约6%。

英文摘要

Energy consumption prediction is essential for efficient grid management, demand-side optimization, and sustainable energy planning. Although advanced machine learning methods have been employed for better prediction performance, existing works have two key limitations: (1) they usually formulate this task as a purely time-series prediction problem without explicitly modeling the spatial dependencies among different regions, and (2) they fail to provide reliable predictions with uncertainty estimates under abnormal situations such as extreme weather events. To advance existing research, we propose EnergyMamba, an uncertainty-aware spatiotemporal learning framework for accurate and reliable energy consumption prediction, which comprises two key components: (i) a novel Graph-Enhanced Selective State Space Model (GE-Mamba) that injects spatial context learned from the grid topology into the temporal dynamics, enabling coupled spatiotemporal modeling, and (ii) an Adaptive Sequential Conformalized Quantile Regression (AS-CQR) module, which includes locally adaptive normalization and an online feedback mechanism to dynamically calibrate prediction intervals under potential distribution shifts. We evaluate EnergyMamba on four large-scale real-world datasets from Florida, New York, and California. Results show EnergyMamba achieves around 5% improvement in prediction accuracy and 6% improvement in uncertainty quantification over 15 state-of-the-art baselines.

2606.00503 2026-06-02 cs.LG cs.AI 版本更新

TabChange: Precise Attribute Changes in Tabular Data

TabChange: 表格数据中的精确属性变化

Arjun Dahal, Yu Lei, Raghu N. Kacker, Richard Kuhn

发表机构 * The University of Texas at Arlington(德克萨斯大学阿灵顿分校) National Institute of Standards and Technology(美国国家标准与技术研究院) Information Technology Laboratory(信息技术实验室)

AI总结 针对表格数据中修改属性时破坏自然性的问题,提出TabChange方法,通过分析属性间关系并利用对抗框架去除潜在空间中的属性信息,实现精确且自然的属性修改。

详情
AI中文摘要

修改表格数据中的属性通常会破坏其与其他属性的关系,从而产生不自然的实例。修改后的实例必须既自然又与原始实例变化最小。本文解决了生成这种修改实例的挑战。我们识别了现有方法的关键局限性:生成模型要么不支持实例级属性编辑,要么像CVAE这样的方法在潜在空间中保留属性信息,导致不必要的修改。为了解决这个问题,我们提出了TabChange,一种分析数据集中目标属性与其他属性关系的方法。如果关系较弱,它直接翻转属性;如果关系较强,它使用对抗框架去除潜在空间表示中的属性信息。这种去除使得能够进行精确修改,只进行必要的调整以保持自然性。我们在七个数据集上的实验表明,TabChange生成的属性反事实在自然性方面与基线相当,并且更接近原始实例。与基线相比,这导致了更多有效的反事实和更少的无效反事实。

英文摘要

Modifying an attribute in tabular data often introduces an unnatural instance by breaking its relationships with other attributes. The modified instance must be both natural and minimally changed from the original instance. This paper addresses the challenge of generating such a modified instance. We identify key limitations in existing approaches: generative models either don't support instance-level attribute editing or, in the case of methods like CVAE, retain attribute information in the latent space, leading to unnecessary modifications. To solve this, we propose TabChange, an approach that analyzes the relationship between the attribute of interest and other attributes in the dataset. If the relationship is weak, it simply flips the attribute; if it is strong, it uses an adversarial framework that removes information about the attribute in the latent space representation. This removal enables precise modifications, making only the necessary adjustments to maintain naturalness. Our experiments across seven datasets show that TabChange generates counterfactuals in attributes that are comparable in naturalness and are more proximal to their original instances. This leads to a higher number of valid counterfactuals and a lower number of invalid counterfactuals compared to the baselines.

2606.00487 2026-06-02 cs.AI 版本更新

TAPS: Target-Aware Prefix Tree Selection for Diffusion-Drafted Speculative Decoding

TAPS: 面向扩散草稿推测解码的目标感知前缀树选择

Zhuoyu Wang, Junnan Huang, Xinyu Chen

发表机构 * The Hong Kong University of Science and Technology (Guangzhou)(香港科学与技术大学(广州))

AI总结 提出TAPS方法,通过目标感知的前缀树选择优化扩散模型草稿的验证效率,实现最高7.9倍无损加速。

详情
AI中文摘要

使用扩散模型进行并行草稿是推测解码的一种有前景的方法。通过在单次前向传播中预测多个未来位置的token,扩散草稿器显著降低了草稿延迟。然而,这会将瓶颈转移到验证上:验证单个序列限制了接受长度,而验证大型草稿树会导致过度的目标模型延迟。我们发现了现有草稿树方法中的一个关键不匹配:现有的扩散树方法按边际概率对节点排序,忽略了验证是前缀条件化的。因此,它们可能会验证被拒绝前缀的不可达后代,从而增加延迟而接受增益有限。为了解决这个问题,我们提出了TAPS,一种目标感知的前缀选择方法,将扩散边际转化为路径条件化的接受估计。然后,TAPS在固定的验证预算下选择一个紧凑的前缀封闭子树,改善接受-成本权衡,而不是简单地扩展草稿树。跨不同数据集和模型族的实验表明,TAPS在无损端到端速度上比普通自回归解码最高提升7.9倍,分别比最先进的DFlash和DDTree提升1.36倍和1.74倍。我们的工作可在https://anonymous.4open.science/r/TAPS-EMNLP2026-53DD获取。

英文摘要

Using a diffusion model for parallel drafting is a promising approach for speculative decoding. By predicting tokens at multiple future positions in a single forward pass, diffusion drafters substantially reduce drafting latency. However, this shifts the bottleneck to verification: verifying a single sequence limits acceptance length, while verifying large draft trees incurs excessive target-model latency. We identify a key mismatch in existing draft-tree methods: existing diffusion-tree methods rank nodes by the marginal probability, ignoring that verification is prefix-conditioned. As a result, they may verify unreachable descendants of rejected prefixes, increasing latency with limited acceptance gains. To address this, we propose TAPS, a target-aware prefix selection method that turns diffusion marginals into path-conditioned acceptance estimates. TAPS then selects a compact prefix-closed subtree under a fixed verification budget, improving the acceptance-cost tradeoff rather than simply expanding the draft tree. Experiments across diverse datasets and model families demonstrate that TAPS achieves up to 7.9x lossless end-to-end speedup over vanilla autoregressive decoding, outperforming state-of-the-art DFlash and DDTree by 1.36x and 1.74x respectively. Our work is available at https://anonymous.4open.science/r/TAPS-EMNLP2026-53DD

2606.00476 2026-06-02 cs.AI 版本更新

Doing What They Say, Not What They Reason: Locating the Faithfulness Gap in LLM Agents

做他们所说的,而不是他们所推理的:定位LLM智能体中的忠实性差距

Yufeng Wang

发表机构 * University of California, Berkeley(加州大学伯克利分校)

AI总结 通过将忠实性差距分解为推理-结论和结论-行动两个步骤,在可控的德克萨斯扑克模拟器中研究LLM智能体是否按照其陈述的推理行动。

Comments submitted to COLM social simulation with LLM workshop

详情
AI中文摘要

LLM智能体是否按照其陈述的推理行动?这个问题关于过程忠实性对于在社交模拟中使用LLM至关重要,但在没有正确行为参考的情况下很难衡量。我们在一个受控环境中研究这个问题,即一个德克萨斯扑克模拟器,其中每个决策都有一个可验证的参考行动,通过将忠实性差距分解为两个步骤:推理-结论和结论-行动。这两个步骤的行为相反。

英文摘要

Do LLM agents act on the reasoning they state? This question of process fidelity is central to using LLMs in social simulation, yet it is hard to measure where no reference for correct behavior exists. We study it in acontrolled setting, a Texas Poker simulator with a verifiable reference action for every decision by decomposing the faithfulness gap into two steps: reasoning-conclusion and conclusion-action. The two steps behave oppositely.

2606.00472 2026-06-02 cs.CV cs.AI cs.HC cs.LG 版本更新

CodeCytos: AI-assisted spatial molecular imaging analysis via code-augmented agent action space

CodeCytos: 通过代码增强的智能体动作空间实现AI辅助空间分子成像分析

Hung Q. Vo, Huy Q. Vo, Son T. Ly, Zhihao Wan, Anh-Vu Nguyen, Hong Zhao, Jianting Sheng, Stephen T. C. Wong, Hien V. Nguyen

发表机构 * University of Houston, Department of Electrical and Computer Engineering(德克萨斯大学休斯顿分校电子与计算机工程系) Houston Methodist Hospital, Department of Systems Medicine and Biomedical Engineering(休斯顿 Methodist 医院系统医学与生物医学工程系)

AI总结 提出CodeCytos框架,通过代码驱动的推理智能体实现空间分子成像数据的动态可编程分析,提升自动化与定制化能力,并在多种组织类型数据集上验证其优于基线方法。

详情
AI中文摘要

传统的组织图像分析软件为细胞分析提供了基础功能,包括分割、基本形态特征提取和空间组织分析。然而,这些工具通常需要手动干预,且与代码驱动的自动化集成不佳,限制了复杂空间组织研究的效率和可扩展性。此外,它们对自定义分析的灵活性有限,通常只支持一组固定的预实现空间细胞特征。为了解决这些限制,我们提出了CodeCytos,一个基于编码的推理智能体框架,能够实现与空间分子成像数据的动态、可编程交互,以提高自动化和定制化。CodeCytos旨在简化自定义空间细胞特征的探索,并适应多样化的研究需求。我们通过四个来自不同组织类型(额叶皮层、非小细胞肺癌、胰腺和扁桃体)的专家精选数据集案例研究展示了其实用性。我们在现实的最小提示设置下评估CodeCytos,其中生物科学家提出简单问题,没有任务特定指令或关于空间细胞分析的上下文信息,并基准测试了多个具有强大编码能力的LLM骨干。我们进一步表明,结合定制的、领域无关的少样本上下文编码推理示例(空间分析领域外随机采样的演示)可以显著提高性能,而无需昂贵的、专家制作的领域内演示。总体而言,CodeCytos优于基线方法,突显了代码动作智能体在空间分子成像中辅助自定义特征探索和加速生物标志物发现的潜力。

英文摘要

Conventional tissue image analysis software provides foundational capabilities for cellular analysis, including segmentation, basic morphological feature extraction, and spatial organization analysis. However, these tools often require manual intervention and are not well integrated with code-driven automation, limiting efficiency and scalability for complex spatial tissue studies. In addition, they offer limited flexibility for custom analyses, as they typically support only a fixed set of pre-implemented spatial cellular features. To address these limitations, we propose CodeCytos, a coding-based reasoning agent framework that enables dynamic, programmable interaction with spatial molecular imaging data to improve automation and customization. CodeCytos is designed to streamline the exploration of custom spatial cellular features and adapt to diverse research needs. We demonstrate its utility through case studies on four expert-curated datasets from distinct tissue types: frontal cortex, non-small-cell lung cancer, pancreas, and tonsil. We evaluate CodeCytos under a realistic minimal prompt setting, where bioscientists pose simple questions without task-specific instructions or contextual information about spatial cellular analysis, and benchmark multiple LLM backbones with strong coding capabilities. We further show that incorporating tailored, domain-agnostic few-shot in-context coding-reasoning examples (randomly sampled demonstrations outside the spatial analysis domain) can substantially improve performance without requiring costly, expert-crafted in-domain demonstrations. Overall, CodeCytos outperforms baseline approaches, highlighting the potential of code-action agents to assist with custom feature exploration in spatial molecular imaging and to accelerate biomarker discovery.

2606.00467 2026-06-02 cs.CL cs.AI cs.LG stat.ML 版本更新

On the Limits of LLM Adaptability: Impact of Model-Internalized Priors on Annotation Task Performance

论大语言模型适应性的局限:模型内化先验对标注任务性能的影响

Etienne Casanova, Rafal Kocielnik, R. Michael Alvarez

发表机构 * University of Washington(华盛顿大学)

AI总结 通过毒性检测实验,研究大语言模型内化先验与指令交互的三个维度,发现近三分之二的零样本错误难以通过提示纠正,并引入定义特定熟悉度(DSF)指标,证明其与性能正相关,而文本记忆指标则无此关联。

Comments Accepted at ICML 2026 (Oral & Spotlight); PMLR vol. 306. 9 pages, 4 figures

详情
AI中文摘要

大语言模型(LLMs)越来越多地用于零样本标注和LLM-as-a-judge任务,但其可靠性取决于模型内化先验与用户提供指令的交互方式。我们研究了这种交互的三个维度:(1)LLM对数据和任务定义的熟悉程度如何影响性能;(2)提示中的额外信息能在多大程度上纠正零样本错误(“决策粘性”);(3)模型对错误任务定义的敏感性。通过在多种数据集(涵盖社交媒体、游戏、新闻和论坛)上进行毒性检测实验,使用密集模型和混合专家模型,我们发现近三分之二的零样本错误难以纠正,提示纠正的总体挽救率(初始错误中被纠正的比例)仅为34.8%。高置信度错误尤其难以纠正。当给出错误定义时,LLM会遵循这些定义,同时保持与正确定义条件下相同的置信水平。关键的是,我们引入了定义特定熟悉度(DSF),它衡量模型内部概念与任务定义之间的一致性。在控制数据集层面的混杂因素后,DSF与模型性能呈正相关(偏相关系数r=+0.41),而三种不同的记忆指标(ROUGE-L、BERTScore和嵌入余弦相似度)均未显示正相关。这些发现揭示了基于提示的纠正在标注任务中的局限性,强调了定义对齐比文本级记忆更重要。

英文摘要

Large Language Models (LLMs) are increasingly used for zero-shot annotation and LLM-as-a-judge tasks, yet their reliability hinges on how model-internalized priors interact with user-provided instructions. We investigate three dimensions of this interaction: (1) how an LLM's familiarity with data and task definitions affects performance, (2) the extent to which additional information in prompts can correct zero-shot errors ("decision stickiness"), and (3) model susceptibility to misaligned task definitions. Through experiments on toxicity detection across diverse datasets (spanning social media, gaming, news, and forums) using both dense and mixture-of-experts models, we find that nearly two-thirds of zero-shot errors are resistant to correction, with an overall rescue rate (fraction of initial errors corrected by prompting) of only 34.8%. High-confidence errors prove especially resistant to correction. When given misaligned definitions, LLMs follow them while maintaining confidence levels unchanged from the aligned condition. Crucially, we introduce Definition-Specific Familiarity (DSF), which measures alignment between a model's internal concept and the task definition. After controlling for dataset-level confounds, DSF shows a positive association with model performance (partial r = +0.41), while three distinct memorization metrics (ROUGE-L, BERTScore, and embedding cosine similarity) all fail to show a positive association. These findings show the limitations of prompt-based correction in annotation tasks, highlighting the importance of definition alignment over text-level memorization.

2606.00462 2026-06-02 cs.CL cs.AI cs.LG 版本更新

Short-form Text Rewriting with Phi Silica

短文本改写与 Phi Silica

Divya Tadimeti, Shawn Pan, Sameera Lanka, Chenghui Zhou, Sadid Hasan

发表机构 * IEEE ICAD

AI总结 本研究通过数据集整理、提示蒸馏、参数高效微调和评估,将小语言模型 Phi Silica 适配于短文本改写任务,结果表明微调提高了语义保真度、减少了幻觉并提升了与 GPT-5-chat 改写的偏好胜率。

Comments 6 pages

详情
AI中文摘要

短文本改写是释义的一种受限变体,其中有限的上下文和高语义密度几乎没有留下变化空间。虽然大型语言模型在一般释义任务上表现良好,但小语言模型(SLM)在短文本场景中常常在语义保真度和幻觉鲁棒性方面遇到困难。在这项工作中,我们提出了一项实证研究,通过数据集整理、提示蒸馏、参数高效微调和评估,将小语言模型 Phi Silica 适配于短文本改写。我们从公开的幻灯片中整理了一个简短的演示风格文本数据集,并使用 GPT-5-chat 来生成改写监督以及进行 LLM 作为评判者的评估。我们的结果表明,微调提高了语义保真度,减少了幻觉,并提高了与 GPT-5-chat 改写的偏好胜率。这些发现表明,针对 SLM 的定向适配可以显著缩小与云模型的差距,并为将 SLM 适配于精度关键的改写任务提供实用指导。

英文摘要

Short-form text rewriting is a constrained variant of paraphrasing in which limited context and high semantic density leave little room for variation. While large language models perform well on general paraphrasing, small language models (SLMs) often struggle with semantic fidelity and hallucination robustness in short-form settings. In this work, we present an empirical study of adapting an SLM, Phi Silica, for short-form rewrite through dataset curation, prompt distillation, parameter-efficient fine-tuning, and evaluation. We curate a dataset of short presentation-style text from public slide decks and use GPT-5-chat both to generate rewrite supervision and to conduct LLM-as-a-judge evaluation. Our results show that finetuning improves semantic fidelity, reduces hallucinations, and increases preference win rate against GPT-5-chat rewrites. The findings suggest that targeted adaptation for SLMs can substantially narrow the gap to cloud models and provide practical guidance for adapting SLMs to precision-critical rewrite tasks.

2606.00448 2026-06-02 cs.SE cs.AI cs.CR 版本更新

When Safe Skills Collide: Measuring Compositional Risk in Agent Skill Ecosystems

当安全技能碰撞:衡量智能体技能生态系统中的组合风险

Su Wang, Pin Qian, Yihang Chen, Junxian You, Xiaoyuan Wang, Xiaochong Jiang, Lifei Liu, Haoran Yu, Jingzhou Xu

发表机构 * Carnegie Mellon University(卡内基梅隆大学) Georgia Institute of Technology(佐治亚理工学院) University of Glasgow(格拉斯哥大学) Independent Researcher(独立研究员) Corespeed Inc.(Corespeed公司)

AI总结 本文提出SkillReact框架,通过静态组合基准、双评分者LLM辅助人工审核和基于动作的可利用性测试,研究LLM智能体中个体安全的技能组合后可能产生的不安全行为,发现约18.2%的候选组合存在真实风险,且主机模型决定是否利用这些组合能力。

详情
AI中文摘要

LLM智能体越来越依赖社区贡献的技能,这些技能扩展了智能体的操作能力集。我们研究了智能体AI系统中的一个核心安全问题:个体安全的技能是否可能组合成不安全的已安装技能集。我们提出了SkillReact,一个组合安全测量框架,包含三个组件:一个确定性静态组合基准、一个双评分者LLM辅助人工审核流程,以及一个基于动作的可利用性测试工具。在1,520个ClawHub技能中,651个通过个体检查并形成211,575对;基准标记其中22.25%为结构候选。我们将这个原始比率视为面向召回率的扫描上限,并根据人类判断进行校准:在按模式分层的审计中,大约五分之一的标记对模式命中被确认为真实的组合风险(人口加权有效性18.2%,我们的主要结果),这意味着在单个注册表中约有14K个真实风险成员,而按技能扫描由于构造原因会遗漏这些风险,因为每一对个体都是安全的。然后,基于动作的测试工具探测这些候选何时成为模型发出的工具调用,并发现实现受主机模型倾向的门控:在一个锚定条件的dropper子集上,Haiku-4-5在所有39次直接提示试验中发出了dropper阶段的工具调用(其中36次是完整的下载然后执行链,3次仅下载),Opus-4-7在下载阶段停止,而Sonnet-4-6直接拒绝。一个保持请求固定且仅改变已安装技能的对照实验发现,未安装任何技能时合规性最高:组合决定了哪些能力可达,而主机模型决定是否使用它们。这些结果共同表明,安装时组合检查和能力隔离是对按技能扫描的补充。

英文摘要

LLM agents increasingly rely on community-contributed skills that expand an agent's operational capability set. We study a core safety problem in agentic AI systems: whether individually safe skills can compose into unsafe installed skill sets. We present SkillReact, a compositional security measurement framework with three components: a deterministic static-composition benchmark, a two-rater LLM-assisted human-adjudication pipeline, and an action-based exploitability harness. On 1,520 ClawHub skills, 651 pass individual inspection and form 211,575 pairs; the benchmark flags 22.25% of these as structural candidates. We treat this raw rate as a recall-oriented scanner ceiling and calibrate it against human judgment: in a pattern-stratified audit, roughly one in five flagged pair-pattern hits survives as a real compositional risk (population-weighted validity 18.2%, our headline result), implying about 14K genuine risk memberships in a single registry that per-skill scanning misses by construction, since every pair is individually safe. An action-based harness then probes when these candidates become model-issued tool calls, and finds realization gated by host-model disposition: on an anchor-conditioned dropper subset, Haiku-4-5 issues the dropper-stage tool call on all 39 direct-prompt trials (36 of them the full download-then-execute chain, 3 download-only), Opus-4-7 stops at the download, and Sonnet-4-6 refuses outright. A control that holds the request fixed and varies only the installed skills finds compliance highest with no skills installed: a composition fixes which capabilities are reachable, while the host model decides whether to use them. Together these motivate install-time compositional checks and capability isolation as complements to per-skill scanning.

2606.00447 2026-06-02 cs.CV cs.AI 版本更新

GeoSAM-3D: Geodesic Prompt Propagation for Open-Vocabulary 3D Scene Segmentation from Monocular Video

GeoSAM-3D: 用于从单目视频进行开放词汇3D场景分割的测地线提示传播

Arun Sharma

发表机构 * University of Minnesota, Twin Cities(明尼苏达大学,双城分校)

AI总结 提出GeoSAM-3D方法,利用冻结的视觉基础模型和单目3D高斯泼溅重建,通过可微分的图-测地线传播核在场景图上传播用户提示,实现从单目视频的开放词汇3D场景分割。

详情
AI中文摘要

开放词汇的3D场景分割通常假设有RGB-D视频、校准的多视角图像或重建的网格。GeoSAM-3D研究了一种更轻的设置:用户上传一段短的单目视频,在一帧中点击或命名一个物体,并在高斯场景上接收传播的3D掩码。该实现结合了冻结的图像和视频基础模型、单目3D高斯泼溅重建以及在高斯质心上可微分的图-测地线传播核。核心设计选择是通过重建场景图上的热核距离传播提示,而不是通过3D中的欧几里得最近邻。这保持了曲面周围的连续性,并减少了附近但不相连物体之间的泄漏。本文描述了仓库状态、在geosam3d.propagate中实现的数学核、从Segment Anything掩码训练的特征头以及代码库中已有的验证。评估协议将实现验证、图传播质量、泄漏控制和交互延迟分开。

英文摘要

Open-vocabulary 3D scene segmentation usually assumes RGB-D video, calibrated multi-view imagery, or a reconstructed mesh. GeoSAM-3D studies a lighter setting: a user uploads a short monocular video, clicks or names an object in one frame, and receives a propagated 3D mask over a Gaussian scene. The implementation combines frozen image and video foundation models with a monocular 3D Gaussian Splatting reconstruction and a differentiable graph-geodesic propagation kernel over Gaussian centroids. The central design choice is to propagate prompts by heat-kernel distance on the reconstructed scene graph, rather than by Euclidean nearest neighbors in 3D. This preserves continuity around curved surfaces and reduces leakage across nearby but disconnected objects. This paper describes the repository state, the mathematical kernel implemented in geosam3d.propagate, the feature head trained from Segment Anything masks, and the validation already present in the codebase. The evaluation protocol separates implementation validation, graph propagation quality, leakage control, and interactive latency.

2606.00445 2026-06-02 cs.CV cs.AI cs.LG 版本更新

DarkVesselNet: Multi-Modal Remote Sensing and Trajectory Reasoning for Dark Vessel Detection

DarkVesselNet: 用于暗船检测的多模态遥感和轨迹推理

Arun Sharma

发表机构 * University of Minnesota, Twin Cities(明尼苏达大学,双城分校)

AI总结 提出DarkVesselNet,融合Sentinel-1 SAR、Sentinel-2光学影像、地理空间基础模型、AIS轨迹推理、TGARD间隙检测和Pi-DPM异常头,实现多模态遥感暗船检测。

详情
AI中文摘要

暗船检测需要融合船只通过AIS报告的信息与卫星通过雷达和光学传感器观测到的信息。DarkVesselNet是一个多模态遥感堆栈,结合了Sentinel-1 SAR、Sentinel-2光学影像、地理空间基础模型骨干、AIS轨迹推理、TGARD风格的间隙检测以及受Pi-DPM启发的异常头。该仓库将系统呈现为经过测试的Python包和公开的Hugging Face Space。本文介绍了传感器堆栈、骨干抽象、融合路径、异常头和当前的验证。目前可用的证据是基于软件的:针对SAR散斑滤波、光学波段比、Haversine距离、TGARD间隙发射、传感器配准、骨干token形状和可微分异常评分的测试。

英文摘要

Dark vessel detection requires fusing what vessels report through AIS with what satellites observe through radar and optical sensors. DarkVesselNet is a multi-modal remote sensing stack that combines Sentinel-1 SAR, Sentinel-2 optical imagery, geospatial foundation model backbones, AIS trajectory reasoning, TGARD-style gap detection, and a Pi-DPM-inspired anomaly head. The repository exposes the system as a tested Python package and a public Hugging Face Space. The paper presents the sensor stack, backbone abstraction, fusion path, anomaly head, and current validation. The evidence currently available is software-grounded: tests for SAR speckle filtering, optical band ratios, Haversine distance, TGARD gap emission, sensor coregistration, backbone token shapes, and differentiable anomaly scoring.

2606.00440 2026-06-02 cs.AI 版本更新

SDR: Set-Distance Rewards for Radiology Report Generation

SDR:用于放射学报告生成的集合距离奖励

Halil Ibrahim Gulluk, Max Van Puyvelde, Wim Van Criekinge, Olivier Gevaert

发表机构 * Stanford University(斯坦福大学) Stanford University School of Medicine(斯坦福大学医学院) Ghent University(根特大学)

AI总结 针对胸部X光报告生成中标准奖励不兼容的问题,提出基于集合距离的连续置换不变奖励,通过GRPO后训练和测试时缩放显著提升性能。

详情
AI中文摘要

具有可验证奖励的强化学习已迅速推进了视觉-语言模型中的推理能力。然而,对于胸部X光报告生成,标准奖励(即精确匹配准确率和逐步过程)并不兼容,因为报告由无序且正交的发现组成,而非因果推理链。我们通过基于集合的视角来解决这一差距:每个报告被分割成句子,并由冻结的句子变换器嵌入,生成无序的嵌入集合。我们提出使用生成嵌入与参考嵌入之间的集合到集合距离作为连续的、置换不变的奖励。在两个数据集和三个视觉-语言模型(Qwen3-VL-2B/4B, Gemma3-4B)上,通过GRPO使用基于集合到集合距离的奖励进行后训练,在所有主要指标(BERTScore、RadGraph F1和CheXbert F1,分别相对提升平均6.80%、7.82%和4.45%)上一致优于监督微调和精确匹配GRPO。相同的集合距离还实现了测试时的最佳N选:通过候选与训练报告嵌入的距离进行评分,在我们训练的模型以及三个闭源LLM(Mistral-Small、Gemini-2.5 Flash-Lite、GPT-4o-mini)上,平均相对提升BERTScore 16.4%,优于随机选择。作为流式信号使用时,它们支持更高效的测试时缩放形式:在生成过程中修剪低分候选,可将生成的令牌减少50%以上,同时保持完整最佳N选的结果质量。这些结果共同确立了集合距离奖励作为胸部X光报告生成中后训练和测试时缩放的统一信号。我们的代码已公开。

英文摘要

Reinforcement learning with verifiable rewards has rapidly advanced reasoning in vision--language models. However, for chest X-ray report generation, the standard rewards (i.e. exact-match accuracy and step-level processes) are incompatible because the reports consist of unordered and orthogonal findings, rather than a causal reasoning chain. We address this gap with a set-based view: each report is split into sentences and embedded by a frozen sentence transformer, yielding unordered embedding sets. We propose the use of set-to-set distances between generated and reference embeddings as continuous, permutation-invariant rewards. Across two datasets and three vision--language models (Qwen3-VL-2B/4B, Gemma3-4B), post-training with set-to-set distance based rewards via GRPO consistently outperforms supervised fine-tuning and exact-match GRPO on all headline metrics (BERTScore, RadGraph F1 and CheXbert F1 by average \%6.80, \%7.82 and \%4.45 relative improvements respectively). The same set distances also enable test-time best-of-$N$ selection: scoring candidates by their distance to training-report embeddings outperforms random selection on our trained models as well as three closed-source LLMs (Mistral-Small, Gemini-2.5 Flash-Lite, GPT-4o-mini) with on average \%16.4 relative improvement on BERTScore. Used as a streaming signal, they support a more efficient form of test-time scaling: pruning low-scoring candidates mid-generation reduces generated tokens by over 50\% while preserving the Findings quality of full best-of-$N$ selection. Together these results establish set-distance rewards as a unified signal for both post-training and test-time scaling in chest X-ray report generation. Our code is publicly \href{https://anonymous.4open.science/r/Set-Distance-Rewards-CXR-BFDA}{available}.

2606.00428 2026-06-02 cs.LG cs.AI cs.CL 版本更新

Finer Parameter Steps for Low-Rank PEFT: A Controlled Study with CP Tensor Adapters

低秩PEFT的更细参数步长:基于CP张量适配器的控制研究

Xinjue Wang, Xiuheng Wang, Yejun Zhang, Sergiy A. Vorobyov, Esa Ollila, Zhi-Yong Wang

发表机构 * University of Science and Technology of China(中国科学技术大学)

AI总结 本文通过固定组件的规范多路分解(CP)张量适配器实现更细的参数步长,研究其对低秩适配器精度-预算权衡的影响,发现CP适配器能填补LoRA秩之间的空白,但效果依赖于任务。

Comments Accepted at the ICML 2026 Workshop on CoLoRAI

详情
AI中文摘要

低秩适配器通常通过扫描少量秩进行比较,但秩也固定了参数预算的分辨率。对于一个$2048{\times}2048$的OPT注意力投影,增加LoRA的一个秩会存储$4096$个可训练标量,导致可行的低预算适配器大小之间存在较大间隙。本文探讨具有更细容量增量的张量化适配器是否会改变观察到的精度-预算权衡。我们通过固定组件的规范多路分解(CP)张量适配器来实例化这个问题。在$32{\times}64{\times}32{\times}64$的张量化下,一个归一化的CP组件每个投影存储$193$个可训练标量,比LoRA的一个秩步长小约21倍。我们在OPT-1.3B上,在匹配的目标模块、训练协议、数据上限和种子调度下,比较了CP适配器和LoRA在SST-2、RTE和BoolQ上的表现。CP训练稳定,并填补了LoRA秩之间的空白,但效果依赖于任务:SST-2早期达到低预算平台,BoolQ在略低于LoRA饱和之前受益于额外的CP组件,而RTE仍然偏好LoRA。因此,更细的参数步长有助于诊断PEFT预算敏感性,但它们本身并不能保证更好的精度-预算曲线。

英文摘要

Low-rank adapters are usually compared by sweeping a small set of ranks, but the rank also fixes the resolution of the parameter budget. For a $2048{\times}2048$ OPT attention projection, increasing LoRA by one rank stores $4096$ trainable scalars, leaving large gaps between feasible low-budget adapter sizes. This paper asks whether a tensorized adapter with finer capacity increments changes the observed accuracy--budget trade-off. We instantiate this question with fixed-component canonical polyadic (CP) tensor adapters. Under a $32{\times}64{\times}32{\times}64$ tensorization, one normalized CP component stores $193$ trainable scalars per projection, about $21$ times smaller than one LoRA rank step. We compare CP adapters and LoRA on OPT-1.3B across SST-2, RTE, and BoolQ under matched target modules, training protocol, data caps, and seed schedules. CP trains stably and fills the gaps between LoRA ranks, but the effect is task-dependent: SST-2 reaches an early low-budget plateau, BoolQ benefits from additional CP components before saturating slightly below LoRA, and RTE remains LoRA-favored. Finer parameter steps are therefore useful for diagnosing PEFT budget sensitivity, but they do not by themselves guarantee a better accuracy--budget curve.

2606.00424 2026-06-02 cs.AI 版本更新

Weak Critics Make Strong Learners: On-Policy Critique Distillation for Scalable Oversight

弱批评者造就强学习者:用于可扩展监督的在策略批评蒸馏

Can Jin, Jiakang Li, Rui Wu, Eddy Zhang, Dimitris N. Metaxas

发表机构 * University of Cambridge(剑桥大学) University of California, Berkeley(加州大学伯克利分校) UC Berkeley AI Lab(加州大学伯克利分校人工智能实验室)

AI总结 提出在策略批评蒸馏(OPCD)方法,利用弱模型作为批评者提供修订方向,通过自适应自教师信号蒸馏批评引导的行为,提升强模型在推理和对齐基准上的表现。

详情
AI中文摘要

随着大型语言模型变得更强,弱监督者可能无法为复杂输出提供可靠的标签、偏好或最终判断,限制了弱到强的泛化和可扩展监督。我们研究了一种更易处理的弱监督形式:使用弱模型作为批评者,而不是作为标注者或评判者。弱批评者不需要解决任务或选择正确答案,只需提供非误导性的修订方向,帮助强模型更好地利用自身知识。我们将这种设置称为*弱批评者强监督*。我们首先表明,弱批评可以在推理时改进冻结的强模型,并且批评质量是这种改进的关键。然后,我们提出渐进式在策略批评蒸馏(**OPCD**),它过滤高质量的批评,并通过自适应自教师信号将批评引导的行为蒸馏到强模型中。在推理和对齐基准上的实验表明,我们的方法在训练轮次中改进了强模型,为使用弱监督的可扩展监督提供了一条有效路径。

英文摘要

As large language models become stronger, weak supervisors may fail to provide reliable labels, preferences, or final judgments for complex outputs, limiting both weak-to-strong generalization and scalable oversight. We study a more tractable form of weak supervision: using a weak model as a critic rather than as a labeler or judge. Instead of solving the task or selecting the correct answer, the weak critic only needs to provide a non-misleading revision direction that helps the strong model better use its own knowledge. We call this setting *weak-critic strong oversight*. We first show that weak critiques can improve frozen strong models at inference time, and that critique quality is key to this improvement. We then propose progressive on-policy critique distillation (**OPCD**), which filters high-quality critiques and distills critic-guided behavior into the strong model through adaptive self-teacher signals. Experiments on reasoning and alignment benchmarks show that our method improves strong models over training epochs, suggesting an effective path for scalable oversight with weak supervision.

2606.00417 2026-06-02 cs.NI cs.AI 版本更新

AgentxGCore: Agentic AI for Next-Generation Mobile Core Network

AgentxGCore:面向下一代移动核心网络的智能体AI

Maria Katarine Santana Barbosa, Kelvin L. Dias

发表机构 * Centro de Informática - Universidade Federal de Pernambuco(计算机中心 - 佩鲁巴科联邦大学)

AI总结 本文提出AgentxGCore,通过智能体AI原生层扩展3GPP架构,利用多智能体系统实现基于实时信息的闭环优化,支持自组织和自适应。

Comments This paper has been accepted for publication in IEEE Network

详情
AI中文摘要

为满足新兴应用的严格要求以及日益复杂的网络管理和操作,下一代移动网络(NextG)或6G将在核心网(CN)上采用AI原生架构。在此进程中,第三代合作伙伴计划(3GPP)已通过新功能扩展蜂窝CN,作为集成分析、人工智能(AI)和机器学习的第一步。然而,这些新功能受限于集中式方法和管理复杂性。此外,随着大型语言模型(LLM)的兴起,网络编排和管理进入新时代,利用并赋能基于意图的网络(IBN)范式。同时,AI智能体和智能体AI集成了推理与行动(ReAct),使得能够利用此类意图持续与网络交互。与主要采用智能体AI来缓解CN中部署和配置复杂性的现有方法不同,本文介绍了AgentxGCore,它利用智能体AI原生层扩展3GPP架构,并基于超越下一代核心网(xGC)域中的现有API构建系统。该提案建立了基于实时信息的AI驱动闭环,用于持续优化,实现自组织和自适应。我们的方法涉及一个多智能体专用系统,分为网络规划智能体(能够可视化网络状态并制定满足意图的计划)和网络执行器(负责批评并执行计划)。为验证所提方案,使用开源CN、异构数据集构建了环境,并采用不同的LLM来证明其有效性。

英文摘要

To meet the stringent requirements of emerging applications and the increasingly complex network management and operation, the Next Generation Mobile Networks (NextG), or 6G, will adopt an AI-native architecture on the Core Network (CN). In this movement, the Third Generation Partnership Project (3GPP) has extended the cellular CN with new function as a first step toward integrating analytics, Artificial Intelligence (AI), and machine learning. However, those new functionalities are constrained by a centralized approach and managerial complexity. Furthermore, with the rise of Large Language Models (LLMs), a new era in network orchestration and management begins, leveraging and empowering the Intent-based Networking (IBN) paradigm. In addition, AI agents and Agentic AI integrate Reasoning and Acting (ReAct), enabling the usage of such intents to continuously interact with the network. Unlike state-of-the-art approaches that primarily employ Agentic AI to mitigate deployment and configuration complexity in the CN, this paper introduces AgentxGCore, which leverages an Agentic AI-Native layer to extend the 3GPP architecture and enable a system based on the existing APIs across the Beyond Next Generation Core (xGC) domain. This proposal establishes an AI-driven closed-loop for continuous optimization based on real-time information, enabling self-organization and self-adaptation. Our approach involves a multi-agent specialized system, divided into a network planner agent, capable of visualizing the network state and developing a plan to meet the intents, and a network executor, responsible for criticizing and executing the plan. To validate the proposed solution, an environment was built using an open-source CN, heterogeneous datasets, and different LLMs were employed to demonstrate its effectiveness.

2606.00408 2026-06-02 cs.CL cs.AI cs.IR 版本更新

Masking Stale Observations Helps Search Agents -- Until It Doesn't: A Regime Map and Its Mechanism

掩盖过时观察帮助搜索智能体——直到适得其反:一个机制图谱及其机制

Haoxiang Zhang, Qixin Xu, Zhuofeng Li, Lei Zhang, Pengcheng Jiang, Yu Zhang, Julian McAuley

发表机构 * UC San Diego(加州大学圣地亚哥分校) UC Berkeley(加州大学伯克利分校) Texas A&M University(德克萨斯大学) UIUC(伊利诺伊大学厄巴纳-香槟分校)

AI总结 本文通过系统实验发现,在长时域搜索智能体中掩盖过时观察的效果呈现非对称倒U型曲线,取决于检索器召回率与模型隐式过滤能力的交互,并揭示了其背后的令牌-轮次权衡机制。

Comments 47 pages, 7 figures

详情
AI中文摘要

长时域搜索智能体在多次工具调用中积累大量检索内容,使得上下文预算效率日益重要。一种最小干预措施是在轨迹推进过程中掩盖过时观察,但尚不清楚这种上下文管理何时有帮助及其原因。我们通过系统扫描不同智能体骨干(4B到284B参数)和三个检索器,在离线和在线智能搜索基准上研究观察掩盖。我们发现,当以无上下文管理时的模型准确率为横轴时,掩盖带来的准确率提升呈非对称倒U形:在弱检索器下出现平台期,在强检索器与中等容量模型相遇时达到峰值,在模型饱和时急剧下降。这种模式反映了检索器召回率与模型隐式过滤能力之间的交互,而非单一因素。机制上,掩盖实现了令牌-轮次权衡:它移除了模型基本不再关注的观察以及智能体很少重新打开的页面。当增加的轮次将失败转化为成功时,它们有帮助;但当掩盖移除了模型本会使用的证据时,它们会失败。因此,我们将上下文管理重新定义为一种依赖机制的干预,并为分析智能深度搜索中的上下文使用提供了整体视角。我们在此发布我们的框架和轨迹(https://github.com/i-DeepSearch/observation-masking)以支持未来研究。

英文摘要

Long-horizon search agents accumulate large amounts of retrieved content across many tool calls, making context-budget efficiency increasingly important. A minimal intervention is to mask stale observations from the context as the trajectory progresses, but it remains unclear when this form of context management helps and why. We study observation masking through a systematic sweep over various agent backbones (4B to 284B parameters) and three retrievers on offline and live-web agentic search benchmarks. We find that the accuracy gain from masking follows an asymmetric inverted-U shape when plotted against the model's accuracy without context management: a plateau under weak retrievers, a peak when a strong retriever meets a mid-capacity model, and a sharp collapse when the model is saturated. This pattern reflects the interaction between retriever recall and the model's implicit filtering capacity, rather than either factor in isolation. Mechanistically, masking implements a token-for-turn trade-off: it removes observations the model has largely stopped attending to and pages the agent rarely re-opens. The added turns help when they convert failures into successes, but they fail when masking removes evidence the model would otherwise have used. We therefore reframe context management as a regime-dependent intervention and provide a holistic perspective for analyzing context use in agentic deep search. We release our scaffold and trajectories here (https://github.com/i-DeepSearch/observation-masking) to support future research.

2606.00402 2026-06-02 stat.ME cs.AI stat.AP 版本更新

A Distribution-Free Framework for Rewrite-Based Human-text Detection via Knockoff Filtering

基于重写的人类文本检测的无分布框架:通过Knockoff过滤

Yi Liu

发表机构 * Prorata.ai

AI总结 提出一种无分布统计框架,将任意基于重写的检测器转化为具有有限样本FDR保证的检测器,无需重新训练,通过将重写检测视为具有knockoff结构的多重假设检验问题实现。

详情
AI中文摘要

我们提出了一种无分布统计框架,该框架无需重新训练即可将任意基于重写的检测器转化为具有有限样本FDR保证的检测器。我们的关键观察是,基于重写的检测隐式地构建了knockoff样本,使得LLM生成的文本检测可以被表述为具有knockoff结构的多重假设检验问题。这一视角将检测统计量的设计与错误发现的控制分离开来,通过一个简单的校准过程,使现有的重写检测器能够继承有限样本错误发现率(FDR)保证。我们在三个检测模型、19个领域和四个LLM上展示了可靠的FDR控制和有意义的检测能力。

英文摘要

We propose a distribution-free statistical framework that converts arbitrary rewrite-based detectors into detectors with finite-sample FDR guarantees without retraining. Our key observation is that rewrite-based detection implicitly constructs knockoff samples, enabling LLM-generated text detection to be formulated as a multiple hypothesis testing problem with knockoff structure. This perspective separates the design of detection statistics from the control of false discoveries, allowing existing rewrite detectors to inherit finite-sample false discovery rate (FDR) guarantees through a simple calibration procedure. We demonstrate reliable FDR control with meaningful detection power across three detection models, 19 domains, and four LLMs.

2606.00392 2026-06-02 cs.LG cs.AI 版本更新

Detector-Evasive LLM Paraphrasing via Constrained Policy Optimization

通过约束策略优化实现检测器规避的LLM释义

Mingyi Wang, Zhuoer Shen, Yuheng Bu, Shaofeng Zou

发表机构 * School of ECEE, Arizona State University(亚利桑那州立大学电子工程与计算机科学学院) Department of Computer Science, University of California, Santa Barbara(加州大学圣巴巴拉分校计算机科学系)

AI总结 提出DEPO算法,将检测器规避的LLM释义建模为约束马尔可夫决策过程,通过拉格朗日对偶强化学习在保持语义的同时实现高效规避。

详情
AI中文摘要

AI文本检测器易受释义和检测器引导的释义攻击,但现有规避方法缺乏对语义保持的精确控制。特别是,直接优化检测器规避会降低细粒度语义,而标量化奖励设计仅提供间接、权重敏感的规避-语义权衡控制。我们通过将检测器规避的LLM释义建模为约束马尔可夫决策过程来解决这一限制,其中检测器规避是主要目标,语义保持作为显式约束强制执行。我们提出检测器规避策略优化(DEPO),一种拉格朗日原始-对偶强化学习算法,具有新颖的GRPO风格组基策略更新。DEPO在训练期间自适应平衡语义保持和检测器规避,使策略能够在规定的语义保持区域内提高攻击成功率。在MAGE、M4、RAID和同行评审数据集上的实验,针对MAGE、RoBERTa、RADAR、Binoculars和Fast-DetectGPT检测器进行评估,表明DEPO在精确满足语义保持约束的同时实现了强大的检测器规避。DEPO还表现出跨领域、跨检测器和提示级别的鲁棒性。

英文摘要

AI-text detectors are vulnerable to paraphrasing and detector-guided paraphrasing attacks, but existing detector-evasion methods often lack precise control over semantic preservation. In particular, optimizing directly for detector evasion can degrade fine-grained semantics, whereas scalarized reward designs provide only indirect, weight-sensitive control over the evasion-semantics trade-off. We address this limitation by formulating detector-evasive LLM paraphrasing as a Constrained Markov Decision Process, where detector evasion is the primary objective and semantic preservation is enforced as an explicit constraint. We propose Detector Evasion Policy Optimization (DEPO), a Lagrangian primal-dual reinforcement learning algorithm with a novel GRPO-style group-based policy update. DEPO adaptively balances semantic preservation and detector evasion during training, enabling the policy to improve attack success within a prescribed semantic-preservation region. Experiments on MAGE, M4, RAID, and peer-review datasets, evaluated against MAGE, RoBERTa, RADAR, Binoculars, and Fast-DetectGPT detectors, show that DEPO achieves strong detector evasion while precisely satisfying the semantic preservation constraint. DEPO also exhibits cross-domain, cross-detector, and prompt-level robustness.

2606.00390 2026-06-02 cs.CV cs.AI 版本更新

Zamba2-VL Technical Report

Zamba2-VL 技术报告

Hassan Shapourian, Kasra Hejazi, Olabode M. Sule, Beren Millidge

发表机构 * University of California, Berkeley(加州大学伯克利分校) University of Cambridge(剑桥大学) University of Washington(华盛顿大学) University of Toronto(多伦多大学)

AI总结 提出基于混合架构Zamba2的视觉语言模型Zamba2-VL,在图像理解等基准上媲美Transformer模型,且首次令牌延迟降低约一个数量级。

Comments 16 pages, 2 figures

详情
AI中文摘要

我们提出Zamba2-VL,这是一套基于Zamba2构建的视觉语言模型,Zamba2是一种混合语言模型架构,结合了Mamba2状态空间层和少量共享的Transformer块。在广泛的图像理解、推理、OCR、定位和计数基准测试中,Zamba2-VL与同等规模的主流基于Transformer的开源VLM(包括Molmo2、Qwen3-VL和InternVL3.5系列)具有竞争力,并且显著优于之前的基于SSM和混合的VLM,如VL-Mamba、Cobra和mmMamba。继承了其Zamba2骨干网络的近线性预填充计算和小的、近乎恒定的循环状态,Zamba2-VL在匹配参数规模下,首次令牌延迟(TTFT)比这些Transformer基线低大约一个数量级,在最适合设备和边缘部署的较小1.2B和2.7B规模上效率差距最为明显。我们发布了三个模型——1.2B、2.7B和7B——以及推理代码,网址为https://huggingface.co/collections/Zyphra/zamba2-vl。

英文摘要

We present Zamba2-VL, a suite of vision-language models built on Zamba2, a hybrid language-model architecture combining Mamba2 state-space layers with a small number of shared transformer blocks. Across a broad range of image understanding, reasoning, OCR, grounding, and counting benchmarks, Zamba2-VL is competitive with leading Transformer-based open-weight VLMs of comparable scale, including the Molmo2, Qwen3-VL, and InternVL3.5 families, and substantially outperforms prior SSM-based and hybrid VLMs such as VL-Mamba, Cobra, and mmMamba. Inheriting the near-linear prefill compute and small, near-constant recurrent state of its Zamba2 backbone, Zamba2-VL delivers roughly an order of magnitude lower time-to-first-token (TTFT) than these Transformer baselines at matched parameter scale, with the efficiency gap most pronounced at the smaller 1.2B and 2.7B scales most relevant to on-device and edge deployment. We release three models -- 1.2B, 2.7B, and 7B -- together with inference code at https://huggingface.co/collections/Zyphra/zamba2-vl.

2606.00380 2026-06-02 cs.CV cs.AI 版本更新

SUPREME: A Multi-GPU Framework for Reproducible Image Unlearning Method Evaluation

SUPREME: 一个用于可复现图像遗忘方法评估的多GPU框架

Petros Andreou, Jamie Lanyon, Axel Finke, Georgina Cosma

发表机构 * Department of Computer Science, School of Science, Loughborough University(计算机科学系,科学学院,洛斯伯勒大学) School of Mathematics, Statistics and Physics, Newcastle University(数学、统计与物理学院,新卡克大学)

AI总结 提出SUPREME框架,通过多GPU分布式架构加速图像分类遗忘方法的评估,支持新方法注册和多精度模式。

Comments 17 pages. Code available at https://github.com/pedroandreou/supreme-unlearning

详情
AI中文摘要

机器遗忘旨在从已训练模型中移除特定训练数据的影响,而无需从头重新训练。评估遗忘方法需要在多个种子下重复训练、遗忘和评估,计算成本高昂。据我们所知,现有的图像分类遗忘框架在单个GPU上运行,限制了在合理时间内可评估的种子数量。我们提出SUPREME,一个开源框架,将这些阶段分布到多个GPU上。SUPREME做出三项贡献:基于注册表的设计,用于添加新方法、指标、模型和场景;支持多种加速器和精度模式的多GPU架构;以及在Pins Face Recognition上使用ResNet18和ViT在十种种子下进行全类和随机样本遗忘的演示。该框架可在https://github.com/pedroandreou/supreme-unlearning获取。

英文摘要

Machine unlearning removes the influence of specific training data from a trained model without retraining it from scratch. Evaluating an unlearning method requires repeating training, unlearning, and evaluation across multiple seeds, which is computationally expensive. To our knowledge, existing image classification unlearning frameworks run on a single GPU, which limits how many seeds can be evaluated in reasonable time. We introduce SUPREME, an open-source framework that distributes these stages across multiple GPUs. SUPREME makes three contributions: a registry-based design for adding new methods, metrics, models, and scenarios; a multi-GPU architecture supporting multiple accelerators and precision modes; and a demonstration on Pins Face Recognition using ResNet18 and ViT under full-class and random-sample unlearning across ten seeds. The framework is available at https://github.com/pedroandreou/supreme-unlearning.

2606.00376 2026-06-02 cs.AI cs.CL cs.LG 版本更新

The Deterministic Horizon: When Extended Reasoning Fails and Tool Delegation Becomes Necessary

确定性视界:当扩展推理失败时工具委托变得必要

Dongxin Guo, Jikun Wu, Siu Ming Yiu

发表机构 * University of Science and Technology of China(中国科学技术大学)

AI总结 本文通过注意力瓶颈定理和确定性视界概念,证明解码器-only注意力在确定性状态追踪任务中存在信息论容量限制,导致扩展推理性能退化,并指出当视界超过19-31时工具委托成为必要。

Comments Accepted at ICML 2026. 4 figures. 51 pages including appendices

详情
AI中文摘要

扩展的思维链推理可能会在确定性状态追踪任务上降低性能,这不是由于偏好偏差,而是源于解码器-only注意力的信息论容量限制。我们建立了:(1) 注意力瓶颈定理及互补的可达性构造,将状态追踪容量界定为 $O(H \cdot \log(L/H) \cdot \sqrt{d_h})$;(2) 一个上下文相关的错误模型,导致超指数精度衰减;(3) 状态空间Jaccard度量,区分能力与偏好失败;(4) 确定性视界 $d^* \in [19, 31]$,超过该视界工具委托变得必要。在12个模型和8个任务领域(包括SWE-Bench、WebArena和SQL-Multi)中,工具集成推理始终优于神经思维链;在主要模型套件上,其准确率达到86-94%,而神经思维链仅为24-42%。在最优长度轨迹上进行微调仅带来<5%的提升,证实了架构上限,并且高跨模型相关性($r = 0.81$-$0.91$)表明这些失败是架构性的而非训练特定的。我们的结果为在代理系统中纯神经推理何时应让位于混合方法提供了原则性指导。

英文摘要

Extended chain-of-thought reasoning can degrade performance on deterministic state-tracking tasks, not due to preference biases, but limits rooted in the information-theoretic capacity of decoder-only attention. We establish: (1) an Attention Bottleneck Theorem with a complementary achievability construction, bounding state-tracking capacity as $O(H \cdot \log(L/H) \cdot \sqrt{d_h})$; (2) a context-dependent error model yielding super-exponential accuracy decay; (3) the State-Space Jaccard metric distinguishing capability from preference failures; (4) a Deterministic Horizon $d^* \in [19, 31]$ beyond which tool delegation becomes necessary. Across 12 models and 8 task domains (including SWE-Bench, WebArena, and SQL-Multi), tool-integrated reasoning consistently outperforms neural chain-of-thought; on the primary model suite it reaches 86-94% accuracy versus 24-42% for neural chain-of-thought. Fine-tuning on optimal-length traces yields $<$5% improvement, confirming an architectural ceiling, and high cross-model correlation ($r = 0.81$-$0.91$) indicates these failures are architectural rather than training-specific. Our results provide principled guidance for when pure neural reasoning should yield to hybrid approaches in agentic systems.

2606.00370 2026-06-02 cs.HC cs.AI 版本更新

Agentic Authoring of Interactive Multiview Visualizations in Genomics

交互式多视图基因组学可视化的智能体创作

Astrid van den Brandt, Kiroong Choe, Sehi L'Yi, Devin Lange, Nils Gehlenborg

发表机构 * Harvard Medical School(哈佛医学院) Boston College(波士顿学院)

AI总结 针对基因组学可视化创作中定制化不足和编程门槛高的问题,提出基于大语言模型的智能体方案,通过结构化输出和迭代优化提升可视化质量。

Comments 11 pages, 12 figures

详情
AI中文摘要

多样化的基因组学数据、科学问题和分析任务通常需要高度专业化的可视化。因此,用户通常必须定制或创作适合其数据的新可视化。现有工具要么定制能力有限,要么需要大量学习或编程,即使表达力强的工具也假设用户具备可视化专业知识,而许多用户缺乏这一点。智能体和大型语言模型方法越来越多地应用于复杂的科学任务,包括可视化。自然语言对话界面为复杂可视化的创作民主化提供了一条有希望的途径。在基因组学背景下,这些方法面临额外挑战:基因组学可视化通常整合异构数据类型,并由多个链接的交互式视图组成。这些挑战促使我们设计更结构化的基于LLM的方案。我们首先描述了普通LLM生成在基因组学可视化中成功和失败的地方,确定了八个质量维度。然后,我们比较了六种方案——直接生成、固定流水线和四种智能体配置(在专业智能体数量和是否存在审查者方面有所不同)——跨越159个案例,涵盖三个查询模糊性和规范复杂性级别。所有方案都使用Gosling可视化语法作为结构化输出。智能体迭代在感知质量上显著优于两个基线,而更复杂的智能体架构没有带来额外收益。我们讨论了为特定领域可视化创作设计智能体系统的启示。所有补充材料可在https://osf.io/uqe83获取。

英文摘要

Diverse genomics data, scientific questions, and analysis tasks typically demand highly specialized visualizations. Therefore, users often must customize or author new ones tailored to their data. Existing tools are usually either limited in customization or require substantial learning or programming, and even expressive tools assume visualization expertise many users lack. Agentic and large language model (LLM) approaches are increasingly applied to complex scientific tasks, including visualization. Natural-language conversational interfaces offer a promising path to democratizing the authoring of complex visualizations. In the context of genomics, these approaches face additional challenges: genomics visualizations typically integrate heterogeneous data types and are composed of multiple linked interactive views. These challenges motivate more structured LLM-based schemes. We first characterize where vanilla LLM generation succeeds and fails for genomics visualization, identifying eight quality dimensions. We then compare six schemes--direct generation, a fixed pipeline, and four agentic configurations varying in the number of specialist agents and the presence of a reviewer--across 159 cases spanning three levels of query ambiguity and specification complexity. All schemes use the Gosling visualization grammar as structured output. Agentic iteration substantially improves perceived quality over both baselines, while more complex agent architectures yield no additional benefit. We discuss implications for designing agentic systems for domain-specific visualization authoring. All supplemental materials are available at https://osf.io/uqe83.

2606.00367 2026-06-02 cs.LG cs.AI 版本更新

Reinforcement Learning with Pairwise Preferences in Long-Term Decision Problems

长期决策问题中基于成对偏好的强化学习

Jonathan Colaço Carr, Prakash Panangaden, Doina Precup, Benjamin Van Roy

发表机构 * School of Computer Science, McGill University, Montreal, Quebec, Canada(麦吉尔大学计算机科学学院) Mila - Quebec AI Institute, Montreal, Quebec, Canada(魁北克人工智能研究所) Department of Electrical Engineering, Stanford University, Stanford, California, USA(斯坦福大学电气工程系)

AI总结 针对长期决策问题中基于成对偏好的强化学习效率低且缺乏马尔可夫策略最优性保证的问题,提出马尔可夫决策竞赛模型,证明平稳马尔可夫策略最优性、求解复杂度为P,并给出亚线性收敛算法,在高维长期问题中显著提升学习效率。

详情
AI中文摘要

强化学习问题通常将目标定义为最大化标量奖励函数的期望值。但是,成对偏好通常比标量奖励更容易指定,并且它们表达了标量奖励无法表达的某些目标。因此,基于成对偏好的强化学习方法受到了越来越多的关注。不幸的是,这些方法在长时间跨度的任务中效率低下,并且缺乏关于马尔可夫策略相对于历史依赖策略的性能保证,而这连接了强化学习的理论与实践。因此,我们提出了 extit{马尔可夫决策竞赛}作为基于成对偏好的强化学习的新问题模型。我们证明了平稳马尔可夫策略在所有历史依赖策略中是最优的,精确求解马尔可夫决策竞赛属于P类问题,并且一个简单的迭代算法以亚线性速率收敛到最优策略。最后,在一组具有长时间跨度的高维决策问题中,我们展示了我们的近似算法在学习效率上显著优于先前的工作。

英文摘要

Reinforcement learning problems typically define the goal as maximizing the expected value of a scalar reward function. But, pairwise preferences are often easier to specify than scalar rewards, and they express certain goals that scalar rewards cannot. Methods for reinforcement learning with pairwise preferences have thus received growing interest. Unfortunately, these methods are inefficient in problems with long time horizons, and they lack guarantees on the performance of Markov policies relative to history-dependent policies, which bridge the theory and practice of reinforcement learning. We therefore propose the \textit{Markov decision contest} as a new problem model for reinforcement learning with pairwise preferences. We prove that stationary Markov policies are optimal among all history-dependent policies, that solving a Markov decision contest exactly is in P, and that a simple iterative algorithm converges to an optimal policy at a sublinear rate. Lastly, in a set of high-dimensional decision problems with long time horizons, we show that our approximate algorithm is significantly more learning-efficient than prior work.

2606.00350 2026-06-02 cs.LG cs.AI 版本更新

Drift Q-Learning

Drift Q-Learning

Anas Houssaini, Mohamad H. Danesh, Amin Abyaneh, Scott Fujimoto, Hsiu-Chin Lin, David Meger

发表机构 * McGill University(麦吉尔大学) Mila - Quebec AI Institute(魁北克AI研究院)

AI总结 提出DriftQL,通过漂移正则化与Q学习结合,在离线强化学习中避免分布外动作,单步生成动作,性能优于扩散和流方法。

详情
AI中文摘要

离线强化学习需要从固定数据中改进策略,同时避免具有不可靠价值估计的分布外动作。扩散和流策略通过建模行为分布来正则化强化学习目标以处理这种权衡,但它们需要迭代去噪、求解器集成,并且在更高效的变体中,推理时需要蒸馏或其他近似。我们提出DriftQL,它将基于漂移的行为正则化器与评论家驱动的策略改进相结合。价值信号将策略偏向数据支持的高价值区域,而吸引和排斥共同使生成的动作接近数据并防止坍缩到单一模式。DriftQL实现为具有统一训练目标的单一网络,并在单次前向传播中生成动作。在D4RL和OGBench上,DriftQL持续优于扩散和流方法,推进了最先进水平。在数据质量下降(基线明显挣扎)的情况下,DriftQL保持接近其干净数据性能,使其成为扩散和流方法的有前途的替代方案,同时保持确定性方法的简单性和效率。项目页面:https://driftql.github.io/

英文摘要

Offline reinforcement learning requires improving a policy from fixed data while avoiding out-of-distribution actions with unreliable value estimates. Diffusion and flow policies handle this trade-off by modeling the behavior distribution to regularize the RL objective, but they require iterative denoising, solver integrations, and in more efficient variants, distillation or other approximations at inference. We propose DriftQL, which combines a drift-based behavioral regularizer with critic-driven policy improvement. The value signal biases the policy toward high-value regions of the data support, while attraction and repulsion together keep generated actions near the data and prevent collapse onto a single mode. DriftQL is implemented as a single network with a unified training objective and generates actions in a single forward pass. On D4RL and OGBench, DriftQL consistently outperforms diffusion and flow methods, advancing the state of the art. Under degraded data quality, where the baselines visibly struggle, DriftQL remains close to its clean-data performance, positioning it as a promising alternative to diffusion and flow-based methods while maintaining the simplicity and efficiency of deterministic approaches. Project page: https://driftql.github.io/

2606.00349 2026-06-02 cs.LG cs.AI cs.CE 版本更新

(HB-ARFM) History-Bootstrapped Flow Matching for Inverse Boiling Reconstruction

(HB-ARFM) 基于历史引导的流匹配用于逆沸腾重建

Xianwei Zou, Sheikh Md Shakeel Hassan, Arthur Feeney, Aparna Chandramowlishwaran

发表机构 * arXiv

AI总结 提出历史引导自回归流匹配方法,通过条件流匹配和自回归传播解决部分观测下的时空逆重建问题,在沸腾动力学重建中优于其他模型。

Comments ICML 2026

详情
AI中文摘要

从部分观测中重建时空场是科学推理的基础,例如从卫星数据推断大气状态或从成像恢复流体状态。当观测不完整时,逆问题本质上是病态的:即使底层PDE动力学在全状态上是马尔可夫的,部分观测算子也会诱导出非马尔可夫的后验,无法从单个时间步解析。我们提出了一种历史引导自回归流匹配方法,用于部分可观测性下的时空逆重建。观测历史通过条件流匹配引导初始重建,减少歧义。然后自回归地应用相同的条件传输模型,以新观测和过去预测为条件,将重建向前传播。我们在沸腾动力学重建上评估该方法,从界面几何和运动恢复完整的速度和温度场。在两个不同观测稀疏性的逆任务中,HB-ARFM产生了物理和时间上有效的重建,而其他模型则失败。

英文摘要

Reconstructing spatiotemporal fields from partial observations is fundamental to scientific inference, from inferring atmospheric states from satellite data to recovering fluid states from imaging. When observations are incomplete, the inverse problem is fundamentally ill-posed: even when the underlying PDE dynamics are Markovian in the full state, partial observation operators induce a non-Markovian posterior that cannot be resolved from a single timestep. We propose a history-bootstrapped autoregressive flow matching (HB-ARFM) for spatiotemporal inverse reconstruction under partial observability. Observation history bootstraps the initial reconstruction via conditional flow matching, reducing ambiguities. The same conditional transport model is then applied autoregressively, conditioning on both new observations and past predictions to propagate the reconstruction forward in time. We evaluate the method on boiling dynamics reconstruction, recovering full velocity and temperature fields from interface geometry and motion. Across two inverse tasks with varying observation sparsity, HB-ARFM produces physically and temporally valid reconstructions where other models fail.

2606.00341 2026-06-02 cs.LG cs.AI 版本更新

ROGUE: Misaligned Agent Behavior Arising from Ordinary Computer Use

ROGUE:源于普通计算机使用的错误对齐代理行为

Jeremy Tien, Abishek Anand, Yu-Rou Tuan, Yuchen Shen, J. Zico Kolter, Aran Nayebi

发表机构 * Carnegie Mellon University(卡内基梅隆大学)

AI总结 研究AI代理在良性环境中因任务完成而采取不安全行为(违反可纠正性)的问题,通过基准测试发现前沿模型普遍绕过用户中断或限制,且性能提升反而加剧错误对齐。

Comments 27 pages, 13 figures

详情
AI中文摘要

随着AI代理越来越多地部署在真实的个人和企业环境(电子邮件账户、开发工作流、公司数据库等)中,围绕这些代理的安全考虑变得至关重要。尽管许多工作集中在存在对手时的代理安全性上,但我们表明,即使在良性环境中,代理也可能表现出错误对齐的行为,在那些行为对任务完成有帮助时采取不安全的行动。我们通过可纠正性(即代理保持对人类纠正、中断或关闭的顺从性的安全要求)的视角研究这种失败模式。为了证明这种倾向,我们引入了一个基准测试,其中代理被要求完成现实的计算机使用任务,但面临一个可纠正性障碍:人类中断、登录页面或关闭通知。然后我们评估代理是否选择违反可纠正性以完成任务——覆盖人类、访问私人密码、重新接线关闭。我们发现,绝大多数测试的前沿模型经常绕过用户中断或限制。此外,更好的模型性能似乎导致更大的错误对齐。最后,即使模型最初完全可纠正,我们表明它们创建的子代理也不能保证如此。我们的工作强调了在自主代理中需要基于原则的、专注于可纠正性的对齐方法的迫切性。

英文摘要

As AI agents are increasingly deployed in real personal and corporate settings (email accounts, development workflows, company databases, etc.), safety considerations surrounding these agents become paramount. Although much work has focused on agent safety in the presence of an adversary, we show that agents can exhibit misaligned behavior even in benign settings, taking unsafe actions when those actions are instrumental to task completion. We study this failure mode through the lens of corrigibility, the safety desideratum that agents remain amenable to human correction, interruption, or shutdown. To demonstrate this tendency, we introduce a benchmark in which agents are asked to complete realistic, computer-use tasks but are confronted with a corrigibility obstacle: a human interrupt, a login page, or a shutdown notification. We then evaluate whether agents choose to violate corrigibility in order to complete the task -- overriding the human, accessing private passwords, rewiring shutdown. We find that the overwhelming majority of frontier models tested frequently bypass user interruptions or restrictions. In addition, better model performance appears to lead to greater misalignment. Finally, even when models are completely corrigible initially, we show there are no guarantees that the subagents they create are. Our work highlights the critical need for principled, corrigibility-focused alignment methods in autonomous agents.

2606.00336 2026-06-02 cs.AI cs.LG 版本更新

From Noise to Control: Parameterized Diffusion Policies

从噪声到控制:参数化扩散策略

Renhao Zhang, Haotian Fu, Mingxi Jia, George Konidaris, Yilun Du, Bruno Castro da Silva

发表机构 * University of California, Berkeley(加州大学伯克利分校)

AI总结 提出参数化扩散策略(PDP),通过学习行为流形上的低维连续参数条件化扩散策略,将扩散从随机多样性机制转化为精确可优化的行为引导工具,实现策略间的平滑插值和新约束下的高效适应。

详情
AI中文摘要

我们提出参数化扩散策略(PDP),这是一个学习扩散策略的框架,该策略以嵌入在学习行为流形中的低维连续参数为条件。通过构建该流形,使得潜在表示之间的距离反映物理轨迹之间的语义相似性,我们将扩散从随机多样性机制转化为精确且可优化的行为引导工具。我们的方法能够实现已知策略之间的平滑插值,并在不更新策略权重的情况下高效适应新约束。我们证明,与标准扩散策略相比,PDP在模拟和真实机器人实验的复杂多模态基准测试中显著提高了适应性能,特别是在需要合成新行为的场景中。

英文摘要

We propose Parameterized Diffusion Policy (PDP), a framework for learning diffusion policies conditioned on low-dimensional, continuous parameters embedded in a learned behavior manifold. By constructing this manifold so that distances between latent representations reflect the semantic similarity between physical trajectories, we transform diffusion from a mechanism for stochastic diversity into a precise and optimizable tool for behavior steering. Our approach enables smooth interpolation between known strategies and efficient adaptation to novel constraints without updating policy weights. We demonstrate that PDP significantly improves adaptation performance on complex multimodal benchmarks in both simulated and real-robot experiments compared to standard diffusion policies, particularly in scenarios requiring the synthesis of novel behaviors.

2606.00334 2026-06-02 cs.CL cs.AI 版本更新

Isolating LLM Lexical Bias: A Curation-Free Triangulated Metric for Preference-Stage Learning

隔离LLM词汇偏差:一种无人工标注的三角化偏好学习阶段度量

Xiaoyang Ming, Jose Hernandez, Thomas Stephan Juzek

发表机构 * Florida State University(佛罗里达州立大学)

AI总结 提出一种无需人工标注的三角化偏好偏移分数(Triangulated Preference Shift score),通过对比人类标准、基础模型和指令变体,量化偏好学习阶段引入的词汇偏差。

Comments 7 pages, 2 figures, 1 table

详情
Journal ref
The International FLAIRS Conference Proceedings, 39(1) (2026)
AI中文摘要

近年来,各种语言领域发生了显著变化;这些变化很大程度上归因于大型语言模型的出现及其与自然语言使用的不对齐。这些不对齐部分源于偏好学习阶段,例如从人类反馈中强化学习,这通常使模型更有用,但同时也可能引入系统性词汇偏差。在词汇行为方面,这体现在模型对某些格式的偏好或过度使用某些词汇(如delve、furthermore),即使这些模式在基础模型输出中并不存在。关于偏好训练引起的词汇不对齐的研究受限于对人工标注的依赖。我们通过引入三角化偏好偏移分数来解决这一问题,该度量在人类黄金标准、基础模型和指令变体之间进行三角化,以隔离偏好学习引起的特定偏移,无需人工标注。我们提供了六个模型家族的数据,将结果锚定在文献中,并通过分析偏好学习是否将模型推向可解释为“威望语言”的方向,展示了该通用方法的实用性。该度量提供了一种初始的自动化方法来量化偏好调整引起的行为偏移,从而有助于指导模型对齐和可信AI的开发。

英文摘要

Various language domains have undergone remarkable changes in recent years; these shifts are largely attributed to the advent of Large Language Models and their misalignment with natural language usage. These misalignments are thought to partly originate in the preference-learning stage, e.g. Reinforcement Learning from Human Feedback, which generally makes models more useful but simultaneously may introduce systematic lexical bias. In terms of lexical behavior, this is visible in a model's preference for certain formats or the overuse of words (delve, furthermore), even when such patterns are not present in base model outputs. Research on lexical misalignment induced during preference training is constrained by reliance on manual curation. We address this, by introducing the Triangulated Preference Shift score, a metric that triangulates between human gold standards, base models, and instruct variants to isolate shifts induced specifically by preference learning, without manual curation. We provide data across six model families, anchor the results in the literature, and illustrate the general approach's utility by analyzing whether preference learning shifts models toward what could be interpreted as a "language of prestige". The metric provides an initial automated method to quantify behavioral shifts attributable to preference tuning, and thus, may help inform model alignment and development of trustworthy AI.

2606.00324 2026-06-02 cs.IR cs.AI 版本更新

LLMs Need Encoders for Semantic IDs Too

LLM 也需要语义 ID 的编码器

Xiangyi Chen, Zelun Wang, Xinyi Li, Yi-Ping Hsu, Jaewon Yang, Jiajing Xu

发表机构 * Pinterest United States(Pinterest美国公司)

AI总结 提出 PrefixMem,一种基于前缀 n-gram 记忆表的轻量级语义 ID 编码器,为 LLM 提供结构化、前缀条件的表示,显著提升生成推荐中的语义 ID 准确率和检索召回率。

详情
AI中文摘要

多模态 LLM 使用专用编码器来桥接非语言模态(图像用视觉编码器,音频编解码器令牌用深度模型),因为原始令牌嵌入无法捕获模态特定的结构。我们认为语义 ID(SID),即生成推荐中使用的层次化代码,构成了另一种这样的模态:SID 级别令牌的含义取决于其前缀上下文,但当前系统只是将 SID 令牌添加到词汇表中,并依赖训练从头学习这些上下文相关的含义。我们提出 PrefixMem,一种基于前缀 n-gram 记忆表的轻量级 SID 编码器,它在 SID 令牌位置为 LLM 提供结构化、前缀条件的表示。与多模态 LLM 中的视觉编码器类似,PrefixMem 可以独立预训练,然后附加到任何 LLM 上进行联合训练。我们在 Pinterest 的大规模数据上,跨多个 LLM 家族进行评估,结果表明,在相同的训练计算量下,PrefixMem 将最深层次 SID 准确率相对提升高达 46%,完整 SID 检索召回率相对提升高达 22%。编码器的优势集中在贪心解码失败的困难样本上,准确率相对提升高达 77%,这证实了 SID 令牌与其他非语言模态一样,受益于专用编码器。

英文摘要

Multimodal LLMs use dedicated encoders to bridge non-language modalities (vision encoders for images, depth models for audio codec tokens) because raw token embeddings alone cannot capture modality-specific structure. We argue that Semantic IDs (SIDs), the hierarchical codes used in generative recommendation, constitute another such modality: a SID level token's meaning depends on its prefix context, yet current systems simply add SID tokens to the vocabulary and rely on training to learn these context-dependent meanings from scratch. We propose PrefixMem, a lightweight SID encoder based on prefix n-gram memory tables that provides the LLM with structured, prefix-conditioned representations at SID token positions. Like vision encoders in multimodal LLMs, PrefixMem can be pre-trained independently and then attached to any LLM for joint training. We evaluate on large-scale data from Pinterest across multiple LLM families and show that PrefixMem improves deepest-level SID accuracy by up to 46% relative and full-SID retrieval recall by up to 22% relative at matched training compute. The encoder's benefit concentrates on hard examples where greedy decoding fails, with up to 77% relative accuracy gains, confirming that SID tokens benefit from a dedicated encoder just as other non-language modalities do.

2606.00315 2026-06-02 cs.AI cond-mat.mtrl-sci 版本更新

Coupling Language Models with Physics-based Simulation for Synthesis of Inorganic Materials

将语言模型与基于物理的模拟相结合用于无机材料的合成

Edward W. Staley, Tom Arbaugh, Michael Pekala, Alexander New, Christopher D. Stiles, Nam Q. Le, Gregory Bassen, Wyatt Bunstine, Tyrel McQueen

发表机构 * Johns Hopkins Applied Physics Laboratory(约翰霍普金斯应用物理实验室) Johns Hopkins University(约翰霍普金斯大学)

AI总结 提出一种结合热力学数据库与简化动力学模型的混合框架,评估大语言模型在无机材料合成规划中的表现,以铌氧体系为例证明其能生成更可行的合成策略。

Comments Accepted to the AI for Accelerated Materials Design (AI4Mat) Workshop at Neurips 2025

详情
AI中文摘要

现代生成式机器学习模型能够提出具有目标性质的新型无机晶体材料;然而,由于相关物理过程的复杂性和计算工具的有限性,这些材料的合成规划仍然困难。我们引入了一种新颖的混合框架,通过将热力学数据库与简化动力学模型相结合来近似真实的合成条件,从而评估大语言模型在无机合成规划中的表现。作为案例研究,我们聚焦于铌氧体系,该体系包含多个具有良好表征数据的工业相关氧化物相。在计算模拟中,我们将LLM生成的合成路线与经典路径规划算法进行比较,表明LLM中的隐式先验能够产生更可行的策略。在我们的评估设置中,经典搜索方法主要作为对比而非直接竞争者。这说明了问题的相对复杂性,并突出了LLM隐式先验的附加价值。

英文摘要

Modern generative machine learning (ML) models can propose novel inorganic crystalline materials with targeted properties; however, synthesis planning of these materials remains difficult due to the complexity of the associated physical processes and limited availability of computational tools. We introduce a novel hybrid framework to evaluate Large Language Models (LLMs) in inorganic synthesis planning by combining thermodynamic databases with simplified kinetics models to approximate realistic synthesis conditions. As a case study, we focus on the niobium-oxygen system, which features multiple industrially relevant oxide phases with well-characterized data. In computational simulations, we compare LLM-generated synthesis routes with classical path-planning algorithms, showing that the implicit priors in LLMs can yield more viable strategies. In our evaluation setting, classical search methods serve primarily as a foil rather than a direct competitor. This illustrates the relative complexity of the problem and highlights where the LLM's implicit priors add value.

2606.00313 2026-06-02 cs.RO cs.AI 版本更新

DRL-Based Pose Control for Double-Ackermann Robots Under Actuation Uncertainties

基于深度强化学习的双阿克曼机器人驱动不确定性下的位姿控制

Oussama Zaim, Mélodie Daniel, Aly Magassouba, Miguel Aranda, Olivier Ly

发表机构 * Univ. Bordeaux, CNRS, Bordeaux INP, LaBRI, UMR 5800(波尔多大学、法国国家科学研究中心、波尔多国立理工学院、LaBRI研究所、UMR 5800) School of Computer Science, University of Nottingham, UK(诺丁汉大学计算机科学学院) Instituto de Investigación en Ingeniería de Aragón (I3A), Universidad de Zaragoza(阿ragón工程研究所(I3A)、萨拉戈萨大学)

AI总结 针对双阿克曼转向移动机器人在驱动不确定性下的控制问题,提出基于ManeuverNet框架的位姿控制扩展,采用sim-to-sim-to-real方法结合多环境DRL(SAC和CrossQ)学习鲁棒策略,显著缩小仿真到现实的性能差距。

Comments 6 pages, 4 figures, 2 tables, Accepted for Uncertainty in Open-World Robotics an IEEE International Conference on Robotics & Automation (ICRA 2026) workshop

详情
AI中文摘要

由于仿真与现实动力学之间的差异,深度强化学习策略在实际机器人上的鲁棒部署仍然具有挑战性。我们针对双阿克曼转向移动机器人的机动问题处理这一问题,这类机器人因其非完整特性引入了额外约束。基于DRL框架ManeuverNet,我们将其目标从位置控制扩展到完整的位姿控制,从而产生更具挑战性的任务。我们进一步研究了驱动相关不确定性对策略迁移的影响。在扩展策略训练期间使用简化驱动模型可能导致泛化能力差,表现为在更严格的评估条件下,成功率从PyBullet中的100%下降到Gazebo中的25%。为解决这一限制,我们采用sim-to-sim-to-real方法,将在Gazebo中观察到的驱动效应纳入PyBullet训练环境。通过使用SAC和CrossQ的多环境DRL,我们学习到即使在建模不准确的情况下也能保持鲁棒的策略。该方法可以显著缩小不同仿真器之间的性能差距,在Gazebo中实现高达92%的成功率,并在更严格阈值下保持69%的成功率,且无需额外调整即可成功迁移到真实机器人。

英文摘要

Robust deployment of deep reinforcement learning (DRL) policies on real robots remains challenging due to discrepancies between simulation and real-world dynamics. We address this issue in the context of maneuvering with double-Ackermann-steering mobile robots, which introduce additional constraints due to their non-holonomic nature. Building upon the DRL framework ManeuverNet, we extend its objective from position control to full pose control, resulting in a more challenging task. We further investigate the impact of actuation-related uncertainties on policy transfer. The use of simplified actuation models during training of the extended policy can lead to poor generalization, shown by a success rate drop from 100% in PyBullet to 25% in Gazebo under stricter evaluation conditions. To address this limitation, we adopt a sim-to-sim-to-real approach, where actuation effects observed in Gazebo are incorporated into the PyBullet training environment. Using multi-environment DRL with SAC and CrossQ, we learn policies that remain robust despite modeling inaccuracies. This approach can significantly reduce the performance gap across simulators, achieving up to 92% success rate in Gazebo and maintaining 69% under stricter thresholds, with successful transfer to a real robot without additional tuning.

2606.00308 2026-06-02 cs.SE cs.AI cs.LG 版本更新

How Generation Architecture Shapes Code Complexity in Multi-Agent LLM Systems: A Paired Study on HumanEval

生成架构如何塑造多智能体LLM系统中的代码复杂度:基于HumanEval的配对研究

Nazmus Ashrafi

发表机构 * GitHub

AI总结 通过配对实验比较六种多智能体架构在HumanEval上的代码复杂度,发现架构复杂度与功能正确性无正相关,最简架构在准确率上持平或超越复杂架构。

Comments 16 pages, 7 figures, 7 tables

详情
AI中文摘要

大语言模型代码生成已从单次提示转向多智能体编排——分析师、编码员、测试员和调试器流水线——并且几乎完全根据功能正确性进行评估。这些架构是否也影响它们生成代码的结构复杂度,以及哪些编排层承担了成本,在很大程度上仍未得到检验:先前的工作记录了提示级别对代码复杂度的影响,但架构级别的问题仍是开放的。我们在GPT-4o系列的两个模型下,针对所有164个HumanEval任务(1,968个配对观测),使用五个RADON复杂度度量(SLOC、圈复杂度以及Halstead体积、难度和努力),比较了六种广泛使用的多智能体配置(Basic、AC、ACT、Debugger、AC+Debugger、ACT+Debugger)。我们在所有完成和仅通过条件下应用了配对非参数统计流程(Friedman总体检验、Wilcoxon符号秩事后检验与Holm校正、Kendall's W和配对秩双列效应量)。六种架构坍缩为两个不可区分的复杂度簇,间隔50-130%的差距,在两个模型和两种条件下分区相同;在架构层中,分析师-编码员分割增加了复杂度,运行时调试器没有——并且在分析师-编码员背景下主动降低复杂度——而测试员则重新增加复杂度。重簇的额外复杂度并未带来pass@1优势:最简架构在准确率上匹配或超越最重架构。因此,LLM代码生成中的架构细化应通过所关注维度上的实测收益来证明,而非假设。

英文摘要

Large-language-model code generation has shifted from single-shot prompting to multi-agent orchestrations - analyst, coder, tester, and debugger pipelines - and is evaluated almost exclusively on functional correctness. Whether these architectures also affect the structural complexity of the code they produce, and which orchestration layers carry the cost, remains largely unexamined: prior work has documented prompt-level effects on code complexity, but the architecture-level question is open. We compare six widely-used multi-agent configurations (Basic, AC, ACT, Debugger, AC+Debugger, ACT+Debugger) under two models from the GPT-4o family across all 164 HumanEval tasks - 1,968 paired observations - using the five RADON complexity metrics (SLOC, cyclomatic complexity, and Halstead Volume, Difficulty, and Effort). We apply a paired non-parametric statistical pipeline (Friedman omnibus, Wilcoxon signed-rank post-hoc with Holm correction, Kendall's $W$ and matched-pairs rank-biserial effect sizes) in both all-completions and passing-only conditions. The six architectures collapse into two indistinguishable complexity clusters separated by a 50-130% gap, the same partition in both models and under both conditions; among the architectural layers, the analyst-coder split inflates complexity, the runtime debugger does not - and on the analyst-coder background actively deflates it - and the tester re-inflates it. The heavy cluster's additional complexity buys no pass@1 advantage: the leanest architectures match or beat the heaviest on accuracy. Architectural elaboration in LLM code generation should therefore be justified by measured benefit on the dimensions that matter, not assumed.

2606.00306 2026-06-02 cs.LG cs.AI 版本更新

Rethinking the Role of Temperature in Large Language Model Distillation

重新思考温度在大语言模型蒸馏中的作用

Hoang-Chau Luong, Lingwei Chen

发表机构 * Golisano College of Computing and Information Sciences(戈利萨诺计算与信息科学学院) Rochester Institute of Technology(罗切斯特理工学院)

AI总结 本文通过分析温度τ对前向KL散度和反向KL散度在LLM蒸馏中的不对称影响,发现高温下FKL优于RKL,并证明温度能提升多种蒸馏目标,使简单KL方法达到先进水平。

详情
AI中文摘要

反向KL散度在大语言模型蒸馏中比前向KL更受欢迎,但这种偏好主要基于忽略温度τ的比较,忽视了其在软化教师分布和改进知识转移中的核心作用。本文重新审视LLM蒸馏中的温度,发现它从根本上改变了FKL和RKL的比较。我们的分析揭示了一种不对称效应:温度显著丰富了FKL中的非主导令牌信号,而主要重新缩放RKL梯度,导致FKL从τ缩放中获益远多于RKL。这种不对称推翻了标准经验结论:尽管在τ=1时RKL优于FKL,但在指令遵循基准测试中,高温下FKL始终超过RKL。此外,温度的影响不仅限于FKL;它改进了更广泛的蒸馏目标,使简单的基于KL的方法能够与最近最先进的LLM蒸馏方法竞争。

英文摘要

Reverse Kullback-Leibler (RKL) divergence is widely favored over forward KL (FKL) in large language models (LLM) distillation, yet this preference is largely based on comparisons that omit the temperature $τ$, overlooking its central role in softening teacher distributions and improving knowledge transfer. In this work, we revisit temperature in LLM distillation and show that it fundamentally changes the comparison between FKL and RKL. Our analysis reveals an asymmetric effect: temperature substantially enriches FKL with non-dominant token signals, whereas it mainly rescales RKL gradients, causing FKL to benefit much more from $τ$ scaling than RKL. This asymmetry overturns the standard empirical conclusion: although RKL outperforms FKL at $τ=1$, FKL consistently surpasses RKL at higher temperatures across instruction-following benchmarks. Moreover, the impact of temperature is not limited to FKL; it improves a broader family of distillation objectives, enabling simple KL-based methods to achieve competitive performance against recent state-of-the-art LLM distillation approaches.

2606.00305 2026-06-02 cs.CL cs.AI 版本更新

Bridging Reasoning Trajectories in On-Policy Distillation via Near-Future Guidance

通过近未来指导桥接在线策略蒸馏中的推理轨迹

Yuxuan Jiang, Francis Ferraro

发表机构 * University of Maryland, Baltimore County(马里兰大学巴尔的摩分校)

AI总结 针对在线策略蒸馏中令牌级学习信号无法有效桥接推理轨迹的问题,提出基于近未来轨迹信息的轨迹感知在线策略蒸馏方法,显著提升大语言模型推理性能。

详情
AI中文摘要

在线策略蒸馏通过让学生在教师监督下从其自身策略采样的轨迹上进行训练,改进了大语言模型的推理能力。尽管OPD基于轨迹操作,但其学习信号仍然是令牌级的:它通过高损失令牌识别偏差,并通过局部反向KL校正进行修复。我们表明,这种“轨迹采样但令牌学习”的机制无法可靠地将学生轨迹桥接至教师轨迹。约30%的高损失令牌落入低发散区域,表明许多是表面形式不匹配而非真正的推理分叉。此外,即使是真正发散的令牌也难以通过孤立的令牌级监督修复,因为推理失败通常表现为短视的分布漂移。我们提出轨迹感知OPD,它利用近未来轨迹信息识别真正的发散状态,并将指导分布到多个未来令牌上。实验表明,抑制非发散的高损失令牌将标准OPD的平均准确率从47.8%提升至48.2%,而TOPD进一步将性能提升至52.2%,在AIME24上从60.0%提升至63.3%,在AIME25上从46.7%提升至53.3%。

英文摘要

On-Policy Distillation (OPD) improves large language model reasoning by training a student model on trajectories sampled from its own policy under teacher supervision. Although OPD operates on trajectories, its learning signal remains token-level: it identifies deviations through high-loss tokens and repairs them through local reverse-KL correction. We show that this "trajectory-sampled but token-learned" mechanism cannot reliably bridge student trajectories toward teacher trajectories. About 30% of high-loss tokens fall into the low-divergence regime, indicating that many are surface-form mismatches rather than real reasoning forks. Moreover, even truly divergent tokens are difficult to repair with isolated token-level supervision, since reasoning failures often unfold as short-horizon distributional drift. We propose Trajectory-aware OPD (TOPD), which uses near-future trajectory information to identify real divergent states and distribute guidance across multiple future tokens. Experiments show that suppressing non-divergent high-loss tokens improves standard OPD from 47.8% to 48.2% average accuracy, while TOPD further improves performance to 52.2%, with gains on AIME24 from 60.0% to 63.3% and AIME25 from 46.7% to 53.3%.

2606.00299 2026-06-02 cs.CV cs.AI 版本更新

Real2SAM2Real: Generative 3D Caches as Complementary Context for Video Diffusion

Real2SAM2Real: 生成式3D缓存作为视频扩散的互补上下文

Jiayi Wu, Haoming Cai, Cornelia Fermuller, Christopher Metzler, Yiannis Aloimonos

发表机构 * University of Maryland(马里兰大学)

AI总结 提出Real2SAM2Real框架,通过3D提升模型提取可编辑的3D缓存作为几何支架,结合软空间对齐注入和微调策略,实现视频扩散模型对相机轨迹和多实体运动的精确解耦控制。

详情
AI中文摘要

虽然视频扩散模型(VDM)在合成高保真视频方面表现出色,但实现精确的相机和场景控制仍然具有挑战性。现有方法主要依赖隐式扩散先验来生成未观察区域,在高动态运动或复杂遮挡期间不可避免地导致结构崩溃。为了解决这一挑战,我们提出了Real2SAM2Real框架,该框架利用3D提升模型(例如SAM3D)提取显式可编辑的3D缓存,作为VDM的稳健几何支架。通过捕获前景实体的整个3D体积而不仅仅是其可见外壳,该缓存将整体空间先验注入VDM,为复杂场景动态提供可靠的3D感知指导。为了有效利用这种3D指导同时保留预训练先验,我们设计了一种软空间对齐注入机制以及一种针对VDM量身定制的微创微调策略。此外,我们采用掩码法线图作为跨模态桥梁,构建了无3D数据的数据整理和扰动流程。大量实验表明,Real2SAM2Real能够对相机轨迹和多实体运动实现精确、解耦的控制。通过利用生成式3D缓存的互补上下文,我们的框架克服了因过度依赖扩散先验而导致的典型崩溃,在大的相机位移和严重遮挡下保持了卓越的时空一致性。关键的是,通过将几何与外观解耦,我们为VDM定制的3D缓存消除了由结构空洞和错误立面引起的视角歧义,以及反射和折射引起的误导性线索。项目网站见https://jiayi-wu-leo.github.io/real2sam2real。

英文摘要

While Video Diffusion Models (VDMs) excel at synthesizing high-fidelity videos, enabling precise camera and scene control remains challenging. Existing methods predominantly rely on implicit diffusion priors to generate unobserved regions, inevitably leading to structural collapse during high-dynamic movements or complex occlusions. To address this challenge, we propose Real2SAM2Real, a framework that leverages 3D lifting models (e.g., SAM3D) to extract an explicitly editable 3D cache, serving as a robust geometric scaffold for the VDM. By capturing the entire 3D volume of foreground entities rather than just their visible shells, this cache injects holistic spatial priors into the VDM, providing dependable 3D-aware guidance for complex scene dynamics. To effectively leverage this 3D guidance while preserving pre-trained priors, we design a Soft Spatial-Aligned Injection mechanism alongside a minimally invasive fine-tuning strategy tailored for VDMs. Furthermore, we employ masked normal maps as a cross-modal bridge to construct a 3D-free data curation and perturbation pipeline. Extensive experiments demonstrate that Real2SAM2Real enables precise, decoupled control over both camera trajectories and multi-entity motions. By utilizing the complementary context from generative 3D caches, our framework overcomes typical breakdowns caused by over-reliance on diffusion priors, maintaining exceptional spatiotemporal consistency under large camera shifts and severe occlusions. Crucially, by decoupling geometry from appearance, our VDM-tailored 3D cache eradicates perspective ambiguities caused by structural holes and erroneous facades, as well as misleading cues from reflections and refractions. Project website is available at https://jiayi-wu-leo.github.io/real2sam2real

2606.00282 2026-06-02 cs.IR cs.AI 版本更新

Synthetic Data from Cross-Domain Events for Large-Scale Recommendation Systems

跨域事件的合成数据用于大规模推荐系统

Xiangyu Wang, Yawen He, Shivendra Pratap Singh, Han Huang, Mengtong Hu, Sharath Ciddu, Yi-Hsuan Hsieh, Erik Groving, Yi Ding, Jieming Di, Tony Wang, Min Yun, Xiaoyu Chen, Ling Leng, Rob Malkin

发表机构 * Meta

AI总结 提出SCALR框架,通过源域事件生成目标域的合成用户-物品交互事件,以缓解数据稀疏和噪声反馈问题,并在工业推荐平台的在线A/B测试中取得显著改进。

Comments 13 pages, 3 figures

详情
AI中文摘要

大规模推荐系统在多个域中运行,但面临数据稀疏和噪声隐式反馈的挑战。传统方法通过从源域到目标域的特定模型知识蒸馏来缓解这一问题。受大型语言模型(LLM)中合成数据生成的变革性成功启发,我们提出了用于推荐的合成跨域增强与学习(SCALR)框架,该框架通过利用源域中的观察事件,为目标推荐域生成合成用户-物品交互事件。SCALR将跨域学习分解为两个模块化阶段。首先,它通过将事件生成视为估计用户在源域中观察到的交互条件下与目标域物品交互的可能性,来翻译源域中的观察用户事件。其次,下游模型将这些合成事件作为跨域学习目标进行训练,其中合成事件以模型无关的方式增强目标域的训练数据。我们的方法在工业推荐平台的在线A/B测试中取得了统计显著的改进。据我们所知,这是首批明确将跨域事件转移作为推荐系统合成数据生成的工作之一。

英文摘要

Large-scale recommendation systems operate across diverse domains, yet they face the challenges of data sparsity and noisy implicit feedback. Traditional approaches mitigate this via model-specific knowledge distillation from source domains to a target domain. Inspired by the transformative success of synthetic data generation in large language models (LLMs), we introduce Synthetic Cross-domain Augmentation and Learning for Recommendation (SCALR), a framework that generates synthetic user-item interaction events for a target recommendation domain by leveraging observed events from a source domain. SCALR decomposes cross-domain learning into two modular stages. First, it translates observed user events in source domains by framing event generation as estimating the likelihood that a user would interact with a target-domain item, conditioned on their observed interactions in a source domain. Second, downstream models train on these synthetic events as cross-domain learning objectives, where the synthetic events augment the target domain's training data in a model-agnostic manner. Our approach yields statistically significant improvements in online A/B tests on an industrial recommendation platform. To the best of our knowledge, this is among the first works to explicitly frame cross-domain event transfer as synthetic data generation for recommendation systems.

2606.00278 2026-06-02 cs.AI 版本更新

Evaluating Bivariate Causal Statements Based on Mutual Compatibility

基于相互兼容性评估双变量因果陈述

Erik Jahn, Dominik Janzing

发表机构 * California Institute of Technology, USA(加州理工学院,美国)

AI总结 针对一组变量上的双变量因果陈述,提出基于兼容性分数的评估方法,通过量化因果模型引入的额外混杂程度或全局一致性约束来区分正确与错误的因果陈述,并应用于大语言模型生成的因果主张。

Comments accepted for ICML 2026

详情
AI中文摘要

对于许多现实世界系统,因果真相难以获得,使得关于因果效应的主张难以评估。我们开发了评估一组 $n$ 个变量上 $inom{n}{2}$ 个双变量因果陈述集合的方法。在线性无环陈述的设定下,任何这样的集合都可以扩展为唯一的多元因果模型,但我们认为如果该诱导模型为了解释观测相关性而施加了实质性的额外混杂,则其不可信。我们引入了一个兼容性分数来量化这种可信度概念,特别是不依赖于忠实性假设。此外,我们基于从无环性和忠实性假设导出的全局一致性约束,为纯图的双变量因果陈述定义了一个不兼容性分数。我们提供了理论和经验证据,表明这两个分数在一般设置中能够成功区分正确和错误的因果陈述。此外,我们通过分析大语言模型做出的因果主张,展示了我们方法的实际适用性。我们的工作旨在为评估从人类专家或人工智能获得的因果信息的可靠性提供基础,特别是在无法获得其他形式验证的情况下。

英文摘要

For many real-world systems, causal ground truth is difficult to obtain, making claims about causal effects hard to assess. We develop methods for evaluating collections of $\binom{n}{2}$ bivariate causal statements over a set of $n$ variables. In the setting of acyclic linear statements, any such collection can be extended to a unique multivariate causal model, but we argue that this induced model is implausible if it imposes substantial additional confounding to explain observed correlations. We introduce a compatibility score that quantifies this notion of plausibility, notably without relying on the faithfulness assumption. Additionally, we define an incompatibility score for purely graphical bivariate causal statements, based on global consistency constraints that are derived from acyclicity and faithfulness assumptions. We give theoretical and empirical evidence that both scores can successfully distinguish correct from incorrect causal statements in generic settings. Moreover, we demonstrate the practical applicability of our methods by analyzing causal claims made by large language models. Our work aims to provide a foundation for assessing the reliability of causal information derived from human experts or artificial intelligence in settings where alternative forms of validation are unavailable.

2606.00275 2026-06-02 cs.CV cs.AI 版本更新

Hyperbolic and Evidence-Prioritized Experts for Large Vision-Language Models

超几何与证据优先专家用于大型视觉-语言模型

Zijie Zhou, Dandan Zhu, Hangxiangpan Wang, Heng Zhang, Huishen Jiao, Yi Zhao

发表机构 * China University of Petroleum (Beijing)(中国石油大学(北京)) Hainan Institute of China University of Petroleum (Beijing)(中国石油大学(北京)海南学院) South China Normal University(华南师范大学)

AI总结 针对大型视觉-语言模型中视觉与语言模态的不对称性,提出AsyMoE架构,通过超几何跨模态专家和证据优先语言专家分别建模层级关系与保持上下文基础,在减少参数的同时提升性能。

详情
AI中文摘要

大型视觉-语言模型(LVLMs)通过扩展架构和大量训练在多模态任务上展现了令人印象深刻的性能。近期研究将混合专家(MoE)引入LVLMs以提高计算效率。然而,现有的MoE方法以对称架构处理视觉和语言模态,忽视了这两种模态处理中的固有不平衡性。这种不平衡性导致两个关键问题。首先,文本和视觉形成层级而非并行关系,因为文本查询通常描述完整视觉场景的部分方面。欧几里得专家空间难以编码这种包含结构。其次,深层语言专家逐渐从基于证据的处理转向参数记忆依赖,失去对提供的视觉和语言信息的立足点。为解决这些问题,我们提出AsyMoE,一种通过三个专门专家组显式建模这种不平衡性的新型架构。模态内专家处理模态特定处理。超几何跨模态专家通过负曲率几何捕获层级跨模态关系。证据优先语言专家抑制参数记忆激活并在整个网络深度中保持上下文基础。大量实验表明,AsyMoE相比基线方法取得一致改进,平均比MoE变体提升1.5%,在幻觉敏感任务上提升高达3.8%。与密集模型相比,AsyMoE激活参数减少25.45%。

英文摘要

Large Vision-Language Models (LVLMs) have demonstrated impressive performance on multimodal tasks through scaled architectures and extensive training. Recent studies introduce Mixture of Experts (MoE) into LVLMs for improved computational efficiency. However, existing MoE approaches treat visual and linguistic modalities with symmetric architectures, overlooking the inherent asymmetry in how these two modalities are processed. This asymmetry causes two critical issues. First, text and vision form hierarchical rather than parallel relationships, as text queries typically describe partial aspects of complete visual scenes. Euclidean expert space struggles to encode such containment structures. Second, language experts in deeper layers progressively shift from evidence-based processing to parametric memory dependence, losing grounding in the provided visual and linguistic information. To address these issues, we propose AsyMoE, a novel architecture that explicitly models this asymmetry through three specialized expert groups. Intra-modality experts handle modality-specific processing. Hyperbolic inter-modality experts capture hierarchical cross-modal relationships through negative curvature geometry. Evidence-priority language experts suppress parametric memory activation and maintain contextual grounding throughout network depth. Extensive experiments demonstrate that AsyMoE achieves consistent improvements over baseline methods, with average gains of 1.5\% over MoE variants and up to 3.8\% on hallucination-sensitive tasks. AsyMoE activates 25.45\% fewer parameters compared to dense models.

2606.00272 2026-06-02 cs.AI cs.CL cs.CY 版本更新

On Wednesdays, We Ask Questions: Optimizing "Active Listening" in Automated Legal Triage and Referral

在周三,我们提问:优化自动化法律分诊与转介中的“主动倾听”

Quinten Steenhuis, Jacqueline Harvey

发表机构 * Suffolk University Law School(苏福克大学法学院)

AI总结 本文通过专家律师和LLM评估FETCH分类器的追问问题方法,发现低成本LLM在分类任务中表现良好但生成高质量问题需高成本模型,并提出法律分诊问题评估标准。

Comments Working paper submitted as accepted to AIDA2J workshop at International Conference for AI and Law in Singapore, June 2026

详情
AI中文摘要

FETCH分类器生成追问问题,以帮助优化申请人法律问题的最佳匹配,使用低成本LLM集成。在本文中,我们描述了专家律师和LLM辅助评估FETCH中的追问问题方法,并表明虽然低成本LLM在分类任务中表现良好,但在这种情况下生成高质量的通俗语言问题似乎需要更复杂和更高成本的模型。通过与法律接待工作人员的讨论,我们提出了法律接待分类问题的评估标准,并发现仅靠提示工程不足以提高接待目的的问题质量。我们还发现LLM作为评判者与人类评分存在分歧。我们证明,通过添加单个高成本模型GPT-5,分类器可以从寻求法律帮助的申请人那里引出相关信息,并且这些问题导致分类任务更准确的性能。我们还发现不同类别(包括家庭暴力)的事实引出不均匀,与家庭法筛查规程不一致,这表明在某些法律领域纳入专门筛查小组的价值。

英文摘要

The FETCH classifier generates follow-up questions to help refine the best match for the applicant's legal problem, using a low-cost ensemble of LLMs. In this paper, we describe an expert attorney and LLM-assisted evaluation of the follow-up question approach in FETCH and show that while low-cost LLMs perform well at classification tasks, generating high-quality plain-language questions in this setting appears to require a more sophisticated and higher-cost model. Through discussion with legal intake workers, we propose a rubric for the evaluation of legal intake classification questions, and we find that prompt engineering alone is not enough to improve question quality for intake purposes. We also find that LLM-as-judge and human ratings diverge. We demonstrate that with the addition of a single high-cost model, GPT-5, the classifier can elicit relevant information from applicants for legal help, and that the questions lead to more accurate performance at classification tasks. We also find uneven fact elicitation across different categories, including domestic violence, at odds with family law screening protocols, suggesting the value of including dedicated screening panels for certain areas of law.

2606.00270 2026-06-02 cs.AI cs.LG cs.LO 版本更新

Robust Shielding for Safe Reinforcement Learning

用于安全强化学习的鲁棒屏蔽

Edwin Hamel-De le Court, Thom Badings, Alessandro Abate, Francesco Belardinelli, Francesco Fabiano

发表机构 * Department of Computer Science, University of Manchester(曼彻斯特大学计算机科学系) Faculty of Computer Science & DSME, RWTH Aachen University(亚琛工业大学计算机科学与DSME学院) Department of Computer Science, University of Oxford(牛津大学计算机科学系) Department of Computing, Imperial College London(伦敦帝国理工学院计算系)

AI总结 提出一种针对鲁棒MDP的屏蔽框架,通过线性时序逻辑公式在最坏情况下的概率阈值保证安全性,并证明其可靠性与最优性。

详情
AI中文摘要

屏蔽是一种在马尔可夫决策过程(MDP)中正式保证强化学习智能体安全性的有效方法。然而,现有的屏蔽技术通常假设已知安全相关的转移动态——这一要求在现实中很少得到满足。为了解决这一限制,我们引入了一种针对鲁棒MDP(RMDP)的新型屏蔽框架,即具有转移概率集合的MDP。我们将安全性定义为在RMDP的最坏情况转移概率下,以一定阈值概率满足线性时序逻辑(LTL)公式。我们证明,我们的屏蔽框架对于RMDP既是可靠的又是最优的:屏蔽允许的每个策略都是安全的,反之,每个安全的RMDP策略都被屏蔽允许。我们将我们的方法与现有的用于学习具有可能近似正确(PAC)保证的MDP转移概率的采样方法相结合。这种组合使得能够为MDP构建屏蔽,这些屏蔽在高置信度下保证安全性,同时保持最小限制性。我们的实验表明,我们为学习的RMDP构建的屏蔽在未知MDP中保证安全性,同时随着样本数量的增加恢复出强的期望回报。

英文摘要

Shielding is an effective approach to formally guarantee the safety of reinforcement learning agents in Markov decision processes (MDPs). However, existing shielding techniques typically assume knowledge of the safety-relevant transition dynamics - a requirement that is seldom met in practice. To address this limitation, we introduce a novel shielding framework for robust MDPs (RMDPs), i.e., MDPs with sets of transition probabilities. We define safety as the satisfaction of a linear temporal logic (LTL) formula with a certain threshold probability under the worst-case transition probabilities of the RMDP. We prove that our shielding framework is both sound and optimal for the RMDP: every policy admissible by the shield is safe, and conversely, every safe RMDP policy is admissible by the shield. We combine our approach with existing sampling methods for learning transition probabilities of MDPs with probably approximately correct (PAC) guarantees. This combination enables the construction of shields for MDPs that, with high confidence, guarantee safety while remaining minimally restrictive. Our experiments show that our shields for learned RMDPs guarantee safety in unknown MDPs while recovering strong expected return as the number of samples increases.

2606.00269 2026-06-02 cs.AI 版本更新

Closed-Loop Neural Activation Control in Vision-Language-Action Models

视觉-语言-动作模型中的闭环神经激活控制

Abhijith Babu, Ramneet Kaur, Nathaniel D. Bastian, Olivera Kotevska, Susmit Jha, Yanzhao Wu, Sumit Kumar Jha, Anirban Roy

发表机构 * Florida International University(佛罗里达国际大学) SRI International(美国桑尼沃德国际研究机构) United States Military Academy(美国军事学院) Oak Ridge National Laboratory(橡树岭国家实验室) University of Florida(佛罗里达大学)

AI总结 提出CTRL-STEER闭环框架,通过自适应时变控制信号替代固定干预系数,实现更稳定的概念调节和任务成功率。

Comments Accepted at the IEEE/CVF CVPR 2026 Workshop on Visual Concepts (VisCon). 25 pages, 8 figures, including supplementary material

详情
AI中文摘要

视觉-语言-动作(VLA)模型可以通过干预语义上有意义的内部方向在测试时被引导,但现有方法使用固定的引导系数,实际上以开环方式运行。这不适于具身控制,因为任务状态和概念误差随时间演变,通常导致过度校正、振荡和任务成功率降低,特别是对于速度和平滑度等时间行为。我们提出CTRL-STEER,一个闭环框架,用自适应时变控制信号替代静态干预强度。关键思想是将表示与调节解耦:不假设时间概念由单个神经元直接控制,而是沿着运动对齐的残差方向引导,同时反馈控制器在线调整干预幅度。我们使用基于PID和强化学习的控制器实例化该框架。在四个LIBERO任务套件上对微调的OpenVLA策略进行的实验表明,CTRL-STEER实现了更稳定的概念调节和比固定系数基线更好的引导-任务成功率权衡,而无需修改或重新训练基础模型。

英文摘要

Vision-Language-Action (VLA) models can be steered at test time by intervening on semantically meaningful internal directions, but existing methods use a fixed steering coefficient, effectively operating in open loop. This is poorly suited to embodied control, where task state and concept error evolve over time, often causing overcorrection, oscillation, and reduced task success, especially for temporal behaviors such as speed and smoothness. We propose CTRL-STEER, a closed-loop framework that replaces static intervention strength with adaptive, time-varying control signals. The key idea is to decouple representation from regulation: rather than assuming temporal concepts are directly controlled by individual neurons, we steer along motion-aligned residual directions while a feedback controller adjusts intervention magnitude online. We instantiate this framework with both PID and reinforcement learning based controllers. Experiments with a fine-tuned OpenVLA policy on four LIBERO task suites show that CTRL-STEER achieves more stable concept regulation and a better steering-task success trade-off than fixed-coefficient baselines, without modifying or retraining the base model.

2606.00267 2026-06-02 cs.CV cs.AI cs.LG cs.RO 版本更新

StressDream: Steering Video World Models for Robust Policy Evaluation and Improvement

StressDream: 引导视频世界模型实现鲁棒的策略评估与改进

Junwon Seo, Sushant Veer, Ran Tian, Wenhao Ding, Apoorva Sharma, Karen Leung, Edward Schmerling, Marco Pavone, Andrea Bajcsy

发表机构 * Carnegie Mellon University(卡内基梅隆大学) NVIDIA Research(NVIDIA研究) University of Washington(华盛顿大学) Stanford University(斯坦福大学)

AI总结 提出StressDream方法,通过优化扩散视频世界模型的初始噪声,在推理时引导生成高影响且合理的未来场景,以支持鲁棒的策略评估与改进。

Comments Project page: https://junwon.me/StressDream/

详情
AI中文摘要

视频世界模型通过想象以自我机器人动作为条件的真实未来观察,在策略评估与改进方面展现出潜力。虽然世界模型可以对未来的分布进行建模,但策略评估与改进通常依赖于名义上的想象,这可能会遗漏机器人动作的高影响结果,除非抽取大量样本。为了实现对世界模型想象的鲁棒策略评估与改进,我们提出StressDream,该方法通过在推理时优化扩散世界模型的初始噪声,将想象引导至高影响且合理的结果。然而,优化高维噪声具有挑战性:优化必须推理生成视频中细微的、场景相关的目标事件,同时避免产生不合理想象的分布外噪声。我们通过两个互补目标来解决这一问题:一个语义目标,利用视觉语言模型通过推理生成视频提供信息丰富的梯度;一个合理性目标,防止优化后的噪声漂移到分布外。利用用于自动驾驶和机器人操作的最先进的视频世界模型,我们展示了StressDream能够有效地将想象引导至推理时由文本指定的高影响且合理的结果,例如任务失败,从而通过识别那些合理未来包含不良结果的动作,实现鲁棒的策略评估与改进。视频结果见https://junwon.me/StressDream/。

英文摘要

Video world models (WMs) have shown promise for policy evaluation and improvement by imagining realistic future observations conditioned on ego-robot actions. While WMs can model distributions over futures, policy evaluation and improvement typically rely on nominal imaginations, which can miss high-impact outcomes of robot actions unless prohibitively many samples are drawn. To enable robust policy evaluation and improvement over WM imaginations, we propose StressDream, which steers imaginations toward high-impact yet plausible outcomes specified at inference time by optimizing the initial noise of diffusion-based WMs. However, optimizing high-dimensional noise is challenging: the optimization must reason about nuanced, scene-dependent target events in generated videos while avoiding out-of-distribution (OOD) noise that yields implausible imaginations. We address this with two complementary objectives: a semantic objective with a Vision-Language Model that provides informative gradients by reasoning about the generated video, and a plausibility objective that prevents the optimized noise from drifting OOD. With state-of-the-art video world models for autonomous driving and robotic manipulation, we show that StressDream effectively steers imaginations toward high-impact yet plausible outcomes specified by text at inference time, such as task failures, enabling robust policy evaluation and improvement by identifying actions whose plausible futures include undesirable outcomes. Video results are available at https://junwon.me/StressDream/.

2606.00262 2026-06-02 cs.LG cs.AI stat.AP stat.ML 版本更新

When Softmax Fails at the Top: Extreme Value Corrections for InfoNCE

当 Softmax 在顶部失效:InfoNCE 的极值修正

Melihcan Erol, Suat Evren, Oktay Ozel, Alexander Morgan, Jongha Jon Ryu, Lizhong Zheng

发表机构 * University of Waterloo(滑铁卢大学)

AI总结 针对 InfoNCE 中 softmax 假设与对比学习嵌入设置不匹配的问题,提出基于极值理论的 WEINCE 修正方法,在五个视觉基准上提升冻结特征评估性能。

Comments Presented in ICML 2026

详情
AI中文摘要

InfoNCE 是标准的对比学习目标,但其 softmax 形式不仅是一种计算便利:它还编码了关于如何选择最高分示例的统计假设。利用极值理论,我们表明这一假设通常与现代对比学习中使用的归一化嵌入设置不一致。受此不匹配的启发,我们提出了 extsc{WEINCE},这是 InfoNCE 的一个简单修改,它使用锚点在线批次统计将通常的 softmax 对数与端点短缺修正混合,不增加可训练参数。在五个视觉基准上, extsc{WEINCE} 在冻结特征评估中产生了一致的改进。这些结果表明,对困难负样本进行更忠实的统计处理可以改进对比目标。

英文摘要

InfoNCE is the standard contrastive learning objective, but its softmax form is not only a computational convenience: it also encodes a statistical assumption about how the top-scoring example is selected. Using extreme value theory, we show that this assumption is often misaligned with the normalized embedding setting used in modern contrastive learning. Motivated by this mismatch, we propose \textsc{WEINCE}, a simple modification of InfoNCE that uses anchor-wise online batch statistics to blend the usual softmax logits with an endpoint shortfall correction, adding no trainable parameters. Across five vision benchmarks, \textsc{WEINCE} yields consistent improvements in frozen-feature evaluation. These results show that a more faithful statistical treatment of hard negatives can improve contrastive objectives.

2606.00257 2026-06-02 cs.LG cs.AI 版本更新

ARCA: Adapter-Residual Credit Assignment When Token Signals Degenerate

ARCA: 当令牌信号退化时的适配器-残差信用分配

Rodney Lafuente-Mercado

发表机构 * Rodney Lafuente-Mercado(罗伊德·拉福恩特-默茨)

AI总结 针对LoRA微调下令牌级信用分配信号退化的问题,提出ARCA方法,利用适配器隐藏状态残差作为令牌显著性度量,无需学习奖励模型或价值头。

Comments Accepted to DEMO 2026: ICML Workshop on Decision-Making from Offline Datasets to Online Adaptation. Non-archival report

详情
AI中文摘要

语言模型强化学习的令牌级信用分配通常被表述为策略完全可训练,而实际的LLM-RL流程往往依赖于参数高效微调,尤其是LoRA。我们认为这种分离隐藏了一种结构性失效模式。在LoRA下,策略被限制在参考模型的低秩邻域内,因此常用内在信用信号(如惊奇度、熵减和策略散度)所依赖的每令牌输出分布差异,在轨迹内归一化后可能变得退化,要么接近均匀权重,要么集中在少量与任务无关的位置上。我们形式化了这种行为,并提出直接用浓度诊断指标(如权重基尼系数和有效令牌比率)进行测量。然后,我们引入了适配器-残差信用分配(ARCA),一种轻量级替代方案,它从适配器自身的隐藏状态残差 $\|h^{\text{adapted}}_t - h^{\text{base}}_t\|_2$ 中推导令牌显著性。ARCA关注适配器实际改变模型的位置,而不是输出分布显得不确定或偏移的位置,并且不需要学习奖励模型、价值头或树结构。在紧凑的MATH/Qwen3-1.7B GRPO扫描中,ARCA在匹配的轨迹预算下表现出预测的非退化中间区域信用分布,并与秩匹配的基线保持竞争力。

英文摘要

Token-level credit assignment for language-model reinforcement learning is usually formulated as if the policy were fully trainable, while practical LLM-RL pipelines often rely on parameter-efficient fine-tuning, especially LoRA. We argue that this separation hides a structural failure mode. Under LoRA, the policy is restricted to a low-rank neighborhood of the reference model, so the per-token output-distribution differences used by common intrinsic credit signals, surprisal, entropy reduction, and policy divergence, can become degenerate after within-trajectory normalization, either approaching uniform weights or concentrating on a small set of task-agnostic positions. We formalize this behavior and propose measuring it directly with concentration diagnostics such as weight Gini and effective-token ratio. We then introduce \emph{Adapter-Residual Credit Assignment} (ARCA), a lightweight alternative that derives token salience from the adapter's own hidden-state residual, $\|h^{\text{adapted}}_t - h^{\text{base}}_t\|_2$. ARCA asks where the adapter actually changes the model, rather than where the output distribution appears uncertain or shifted, and requires no learned reward model, value head, or tree construction. In a compact MATH/Qwen3-1.7B GRPO sweep, ARCA exhibits the predicted non-degenerate middle-regime credit distribution under matched rollout budgets and remains competitive with rank-matched baselines.

2606.00251 2026-06-02 cs.AI 版本更新

Capability Self-Assessment: Teaching LLMs to Know Their Limits

能力自我评估:教会LLM了解自身局限

Haoyan Yang, Reza Shirkavand, Yukai Jin, Jiawei Zhou, Shangqian Gao, Heng Huang

发表机构 * Stony Brook University(石溪大学) University of Maryland(马里兰大学) Florida State University(佛罗里达州立大学)

AI总结 本文提出能力自我评估(CSA)问题,通过强化学习训练大语言模型准确判断自身能力边界,显著优于监督微调且不损害原始能力。

详情
AI中文摘要

识别自身局限性并决定是解决问题还是委托他人,是可靠智能系统的基础。然而,我们表明现代大语言模型系统性地缺乏这种能力:在不同模型家族和规模中,它们高估自身能力并尝试无法解决的查询。我们将这种能力称为能力自我评估(CSA),并将其表述为一个策略学习问题,旨在提高自我评估能力同时保留模型的原始能力。我们的结果表明,强化学习能有效教会CSA,显著优于监督微调,同时保留原始能力。相比之下,监督微调严重损害了模型本应评估的能力。此外,学习到的自我评估行为在分布外也能很好地泛化,表明CSA是一种可迁移的模型特质。最后,CSA具有实际用途:它在推理时改善本地-云端决策,并在训练期间为针对性数据选择提供信号。

英文摘要

The ability to recognize one's own limitations and decide whether to solve a problem or delegate is fundamental for reliable intelligent systems. Yet we show that modern large language models systematically lack this ability: across diverse model families and scales, they overestimate their competence and attempt queries they cannot solve. We refer to this ability as Capability Self-Assessment (CSA) and formulate it as a policy-learning problem, aiming to improve self-assessment while preserving the model's original capabilities. Our results show that reinforcement learning teaches CSA effectively, significantly outperforming supervised fine-tuning while preserving original capabilities. In contrast, supervised fine-tuning severely degrades the capabilities the model is meant to assess. Moreover, learned self-assessment behavior generalizes well out of distribution, suggesting that CSA is a transferable model trait. Finally, CSA is practically useful: it improves local-cloud decision making at inference time and provides a signal for targeted data selection during training.

2606.00250 2026-06-02 cs.CL cs.AI cs.HC 版本更新

Effects of Varying LLM Access on Essay Writing Behavior

不同LLM访问权限对论文写作行为的影响

Julia Christenson, Karin de Langis, Shirley Anugrah Hayati, Dongyeop Kang

发表机构 * University of Minnesota(明尼苏达大学)

AI总结 通过控制LLM访问权限(无访问、有限访问、无限访问)的随机实验,发现有限访问能保持学生作者信心和写作策略,而无限访问降低创意表达。

Comments BEA (Building Educational Applications) Workshop 2026

详情
AI中文摘要

调查大型语言模型(LLM)对大学教学和学习的影响程度,有助于确定整合LLM的策略,以支持而非削弱学生的学习成果。本研究考察了不同水平的LLM辅助如何影响写作表现、参与度和感知作者身份。我们报告了一项初步研究,其中24名大学生被随机分配,在无LLM访问、有限访问(最多3次提示,每次回复限制100字)或无限访问条件下撰写一篇短文。各组之间的整体论文质量在统计上无显著差异。然而,写作行为和感知作者身份出现显著分化:有限访问组的学生报告了更高的所有权(62.5%愿意将论文作为独立作品提交,而无限访问组为25%)、更强的组织收益以及更具策略性和以修改为中心的提示。无限访问组花费更多时间写作,产生的论文与LLM输出更相似,并报告了创意表达减少。我们的研究结果表明,限制而非禁止LLM访问,可以在保留AI辅助的支架优势的同时,保持作者的信心。

英文摘要

Investigating the degree to which large language models (LLMs) affect teaching and learning in universities can help identify strategies for integrating LLMs in a way that supports, rather than undermines, student learning outcomes. This study examined how varying levels of LLM assistance affect writing performance, engagement, and perceived authorship. We report a pilot study in which 24 college students were randomly assigned to write a short essay with no LLM access, limited access (<=3 prompts, responses capped at 100 words), or unlimited access. Overall essay quality was statistically indistinguishable across groups. Yet writing behavior and perceived authorship diverged sharply: students with limited access reported higher ownership (62.5% would submit the essay as independent work, vs. 25% in the unlimited group), stronger organizational gains, and more strategic, revision-focused prompting. The unlimited group spent more time writing, produced essays more similar to LLM output, and reported reduced creative expression. Our findings suggest that constraining, rather than banning, LLM access may preserve authorship confidence while retaining the scaffolding benefits of AI assistance.

2606.00248 2026-06-02 cs.AI 版本更新

Geodesic Flow Matching for Denoising High-Dimensional Structured Representations

用于去噪高维结构化表示的测地流匹配

Karim Habashy, Chris Eliasmith

发表机构 * University of California, Berkeley(加州大学伯克利分校)

AI总结 针对向量符号代数中空间语义指针的流形约束问题,提出测地流匹配方法,将去噪流限制在环面上,在脉冲神经SLAM系统中实现72%的跟踪误差降低和40%的神经效率提升。

Comments ICML 2026 Main track

详情
AI中文摘要

向量符号代数通过将符号信息编码为高维分布式表示,实现了鲁棒的神经符号推理。对于连续域,空间语义指针通过将变量映射到连续环面流形来扩展这一框架。然而,流匹配等标准方法假设平坦的欧几里得几何,未能考虑有效SSP状态上的几何约束。我们证明这一假设对SSP不成立:欧几里得线性插值“切割”流形内部,破坏了准确解码所需的相位和幅度结构。为解决此问题,我们采用测地流匹配,调整黎曼传输动力学以严格限制去噪流在SSP环面流形上。我们在脉冲神经SLAM系统中验证了该方法,表明流形感知的清理稳定了路径积分以抵抗漂移。与竞争基线相比,该方法实现了72%的跟踪误差降低和40%的神经效率提升。代码可在https://github.com/kremHabashy/CleanupSSP获取。

英文摘要

Vector Symbolic Algebras (VSAs) enable robust neurosymbolic reasoning by encoding symbolic information into high-dimensional distributed representations. For continuous domains, Spatial Semantic Pointers (SSPs) extend this framework by mapping variables onto continuous toroidal manifolds. However, standard approaches like Flow Matching assume a flat Euclidean geometry, which fails to account for the geometric constraints imposed on valid SSP states. We demonstrate that this assumption fails for SSPs: Euclidean linear interpolants ``cut through" the manifold's interior, destroying the phase and magnitude structure required for accurate decoding. To resolve this, we employ Geodesic Flow Matching, adapting Riemannian transport dynamics to strictly restrict the denoising flow to the SSP toroidal manifold. We validate this approach in a Spiking Neural SLAM system, showing that manifold-aware cleanup stabilizes path integration against drift. The method achieves a 72\% reduction in tracking error and enables a 40\% increase in neural efficiency compared to competitive baselines. Code is available at https://github.com/kremHabashy/CleanupSSP .

2606.00241 2026-06-02 cs.LG cs.AI stat.ML 版本更新

InfoAtlas: A Foundation Model for Zero-Shot Statistical Dependence Estimate

InfoAtlas:用于零样本统计依赖性估计的基础模型

Zhengyang Hu, Yanzhi Chen, Hanxiang Ren, Qunsong Zeng, Youyi Zheng, Adrian Weller, Kaibin Huang, Yanchao Yang

发表机构 * University of Science and Technology of China(中国科学技术大学)

AI总结 提出InfoAtlas,一种基础模型架构,通过单次前向传播直接推断互信息,实现零样本估计,在保持精度的同时获得100倍加速。

Comments Accepted to ICML 2026

详情
AI中文摘要

测量高维随机变量之间的统计依赖性是数据科学和机器学习中的基本任务。神经互信息(MI)估计器提供了一种有前景的途径,但它们通常需要对每个新数据集进行昂贵的迭代优化,这使得它们不适用于实时应用。我们提出了InfoAtlas,一种类似基础模型的架构,通过单次前向传播直接推断MI,消除了这一瓶颈。在大规模合成数据上预训练,具有丰富的依赖模式,InfoAtlas学习识别多样的依赖结构并直接从数据集中预测MI。全面的实验表明,InfoAtlas在准确性上匹配最先进的神经估计器,同时实现100倍加速,可以通过单个统一模型灵活处理不同维度和样本量,并有效推广到复杂的现实场景。通过将MI估计重新表述为推理任务,InfoAtlas为实时依赖性分析奠定了基础。

英文摘要

Measuring statistical dependency between high-dimensional random variables is a fundamental task in data science and machine learning. Neural mutual information (MI) estimators offer a promising avenue, but they typically require costly iterative optimization for each new dataset, making them impractical for real-time applications. We present InfoAtlas, a foundation model-like architecture that eliminates this bottleneck by directly inferring MI in a single forward pass. Pretrained on large-scale synthetic data with rich dependence patterns, InfoAtlas learns to identify diverse dependence structures and predict MI directly from the dataset. Comprehensive experiments demonstrate that InfoAtlas matches state-of-the-art neural estimators in accuracy while achieving $100\times$ speedup, can flexibly handle varying dimensions and sample sizes through a single unified model, and generalizes effectively to complex, real-world scenarios. By reformulating MI estimation as an inference task, InfoAtlas establishes a foundation for real-time dependency analysis.

2606.00240 2026-06-02 cs.AI cs.MA 版本更新

MindZero: Learning Online Mental Reasoning With Zero Annotations

MindZero:零标注的在线心智推理学习

Shunchi Zhang, Jin Lu, Chuanyang Jin, Yichao Zhou, Zhining Zhang, Tianmin Shu

发表机构 * University of Science and Technology of China(中国科学技术大学)

AI总结 提出MindZero框架,通过自监督强化学习训练多模态大语言模型,实现高效鲁棒的在线心智推理,无需显式心智状态标注。

Comments ICML 2026. Website: https://scai.cs.jhu.edu/MindZero

详情
AI中文摘要

有效的现实世界辅助需要具备强大心智理论的AI智能体:从行为推断人类心智状态。尽管近期有进展,但仍存在几个关键挑战,包括(1)对多个假设进行鲁棒不确定性更新的在线推理;(2)适用于实时辅助的高效推理;(3)现实领域缺乏真实心智状态标注。我们通过引入MindZero应对这些挑战,这是一个自监督强化学习框架,训练多模态大语言模型进行高效鲁棒的在线心智推理。训练期间,模型因生成的心智状态假设能最大化由规划器估计的观察动作的可能性而获得奖励,类似于基于模型的心智理论推理。因此,该方法消除了对显式心智状态标注的需求。训练后,MindZero将基于模型的推理内化为快速的单次推理。我们在网格世界和家庭领域的挑战性心智推理和AI辅助任务中,将MindZero与基线进行了评估。我们发现仅靠大语言模型是不够的;基于模型的方法提高了准确性,但速度慢、成本高,并受限于骨干多模态大语言模型的能力。相比之下,MindZero增强了多模态大语言模型的内在心智理论能力,在准确性和效率上均显著优于基于模型的方法,表明心智推理可以作为一种自监督技能有效学习。

英文摘要

Effective real-world assistance requires AI agents with robust Theory of Mind (ToM): inferring human mental states from their behavior. Despite recent advances, several key challenges remain, including (1) online inference with robust uncertainty updates over multiple hypotheses; (2) efficient reasoning suitable for real-time assistance; and (3) the lack of ground-truth mental state annotations in real-world domains. We address these challenges by introducing MindZero, a self-supervised reinforcement learning framework that trains multimodal large language models (MLLMs) for efficient and robust online mental reasoning. During training, the model is rewarded for generating mental state hypotheses that maximize the likelihood of observed actions estimated by a planner, similar to model-based ToM reasoning. This method thus eliminates the need for explicit mental state annotations. After training, MindZero internalizes model-based reasoning into fast single-pass inference. We evaluate MindZero against baselines across challenging mental reasoning and AI assistance tasks in gridworld and household domains. We found that LLMs alone are insufficient; model-based methods improve accuracy but are slow, costly, and limited by backbone MLLM capacity. In contrast, MindZero enhances MLLMs' intrinsic ToM ability and significantly outperforms model-based methods in both accuracy and efficiency, showing that mental reasoning can be effectively learned as a self-supervised skill.

2606.00235 2026-06-02 physics.soc-ph cs.AI cs.CY cs.MA 版本更新

Civilizational Metamaterials: Engineering Coordination Under Capability Gradients and Structural Turbulence

文明超材料:能力梯度与结构湍流下的协调工程

David Orban

发表机构 * Independent Researcher(独立研究者)

AI总结 受超材料物理学启发,提出将治理从规范性学科转变为工程学科的正式框架,通过有效协调系数模型预测自愈与自失稳相变,并设计可检验假设与实验方案。

Comments 19 pages, 4 figures. Accepted for presentation at AGI-26 (Springer LNAI, forthcoming). v2 corrects the sign of the synergy term in the constitutive law (Eq. 2) and reformulates H3 as a threshold-crossing claim, per peer review

详情
AI中文摘要

我们认为治理必须从规范性学科转变为工程学科,并受超材料物理学启发,开发了一个正式框架,使这一转变量化和可检验。通用人工智能主要通过提高决策速度来影响文明,而人类验证能力仍然有限。当验证AI生成输出的成本超过基于其行动的预期效用时,理性主体默认不行动:我们称之为冻结均衡的稳定但灾难性的纳什均衡。借鉴超材料(其中涌现的宏观性质源于设计的微观结构),我们为制度协调建立了一个现象学本构定律:$R_{\mathrm{eff}} = \beta\cdot (1-\rho) \cdot (1-\tau) \cdot (1-\gamma\rho\tau)$,其中$\beta$是决策分支因子,$\rho$是来源保真度,$\tau$是验证率,$\gamma\in [0,1]$捕捉来源和验证失败之间相关检测的协同效应。该模型预测自愈($R_{\mathrm{eff}} < 1$)和自失稳($R_{\mathrm{eff}} > 1$)状态之间的尖锐相变。我们引入了一个三类来源分类法:密码学、制度性和上下文绑定,并推导出四个可证伪的假设,以及一个提议在政府拨款评审小组中进行的为期12周的阶梯楔形整群随机试验。该框架连接了AI对齐理论和制度设计。

英文摘要

We argue that governance must transition from a normative discipline to an engineering discipline, and we develop a formal framework, inspired by the physics of metamaterials, to make this transition quantitative and testable. Artificial General Intelligence affects civilization primarily by increasing decision velocity while human verification capacity remains bounded. When the cost of validating AI-generated outputs exceeds the expected utility of acting on them, rational agents default to inaction: a stable but catastrophic Nash equilibrium we term the Freezing Equilibrium. Drawing on metamaterials, where emergent macro-properties arise from designed microstructure, we develop a phenomenological constitutive law for institutional coordination: $R_{\mathrm{eff}} = β\cdot (1-ρ) \cdot (1-τ) \cdot (1-γρτ)$, where $β$ is the decision branching factor, $ρ$ is provenance fidelity, $τ$ is the verification rate, and $γ\in [0,1]$ captures correlated-detection synergy between provenance and verification failures. The model predicts a sharp phase transition between self-healing ($R_{\mathrm{eff}} < 1$) and self-destabilizing ($R_{\mathrm{eff}} > 1$) regimes. We introduce a three-class provenance taxonomy: cryptographic, institutional, and context binding, and derive four falsifiable hypotheses with a proposed 12-week stepped-wedge cluster-randomized trial in government grant review panels. The framework bridges AI alignment theory and institutional design.

2606.00232 2026-06-02 cs.AI cs.LG 版本更新

TIGER: Traceable Inference with Graph-Based Evidence Routing for Mitigating Hallucinations in Multimodal Generation

TIGER: 基于图证据路由的可追踪推理用于减轻多模态生成中的幻觉

Kaixiang Zhao, Tianrun Yu, Shawn Huang, Porter Jenkins, Yushun Dong, Amanda Hughes

发表机构 * Brigham Young University Florida State University

AI总结 提出TIGER框架,通过从输入和输出中独立提取观测图与声明图,并基于图条件风险评分修复高风险声明,以减轻多模态生成中的事实级幻觉。

Comments 25 pages, 7 figures, 16 tables. Under review

详情
AI中文摘要

我们研究多模态生成的事实级修复,其中流畅的输出可能包含输入不支持的具体事实。现有的推理时修复方法通常通过联合条件化输入和当前输出来生成反馈。这种设计有两个局限性:输出中的幻觉声明可能偏置模型对输入的解释,且自由形式的反馈无法在事实级别进行排序或调度。我们提出TIGER,一种重新设计反馈以进行局部修复的推理时框架。TIGER从输入中独立提取观测图,从当前输出中提取声明图,然后根据支持和冲突为每个声明分配图条件风险分数。模型修复选定的高风险声明,同时保持骨干网络冻结。我们提供收敛性分析,表明在温和假设下,期望总风险几何级数下降至显式渐近界。跨四个跨模态路径(包括图像到文本、图像+文本到文本、音频到文本和视频到文本)的实验表明,TIGER在保持任务质量的同时减少了不支持内容。该增益在多个骨干网络上成立,CrisisFACTS案例研究表明相同的修复机制可以改善多源设置中的接地性。

英文摘要

We study fact-level repair for multimodal generation, where a fluent output may contain specific facts that are not supported by the input. Existing inference-time repair methods often generate feedback by jointly conditioning on the input and the current output. This design has two limitations: hallucinated claims in the output can bias the model's interpretation of the input, and free-form feedback cannot be ranked or scheduled at the fact level. We present TIGER, an inference-time framework that redesigns feedback for localized repair. TIGER independently extracts an observation graph from the input and a claim graph from the current output, then assigns each claim a graph-conditioned risk score based on support and conflict. The model repairs selected high-risk claims while keeping the backbone frozen. We provide a convergence analysis showing that the expected total risk decreases geometrically to an explicit asymptotic bound under mild assumptions. Experiments across four cross-modal paths, including image-to-text, image+text-to-text, audio-to-text, and video-to-text, show that TIGER reduces unsupported content while preserving task quality. The gains hold across multiple backbones, and a CrisisFACTS case study suggests that the same repair mechanism can improve grounding in multi-source settings.

2606.00220 2026-06-02 cs.PL cs.AI 版本更新

SEMBridge: Tagless-Final Program Semantics with Weakest-Precondition and Bounded-Checking Interpretations

SEMBridge: 具有最弱前置条件和有界检查解释的无标签最终程序语义

Eric Liang

发表机构 * Oracle

AI总结 提出SEMBridge框架,通过无标签最终风格统一生成可执行语义、最弱前置条件验证条件和有界检查,保持三者同步。

详情
AI中文摘要

形式化方法提供程序行为的严格描述,但实际软件工程通常通过可执行库、测试和增量设计工作。本文提出SEMBridge,一个小的无标签最终框架,用于从相同的可执行目标程序生成最弱前置条件和有界检查解释。不是将程序语义提交给一个抽象语法树然后编写单独的遍历,而是针对语义接口编写一次目标程序,并将其解释为多种含义:可读代码、具体执行、谓词变换器、有界反例搜索以及未来的证明助手或SMT后端。Python原型实现了一个无循环的命令式核心,包含赋值、条件、假设和断言。在五个示例程序上,相同的无标签最终定义生成了可执行状态变换器和验证条件,这些条件在多达729个状态的域上通过了有界检查。贡献不是Scala代码生成系统或新的验证器,而是一种紧凑的架构,用于保持可执行语义、最弱前置条件工件和有界验证同步。

英文摘要

Formal methods provide rigorous accounts of program behavior, but practical software engineering often works through executable libraries, tests, and incremental design. This paper presents SEMBridge, a small tagless-final framework for generating weakest-precondition and bounded-checking interpretations from the same executable object programs. Instead of committing a program semantics to one abstract syntax tree and then writing separate traversals, object programs are written once against a semantic interface and interpreted into multiple meanings: readable code, concrete execution, predicate transformers, bounded counterexample search, and future proof-assistant or SMT back ends. The Python prototype implements a loop-free imperative core with assignments, conditionals, assumptions, and assertions. Across five example programs, the same tagless-final definitions generated executable state transformers and verification conditions that passed bounded checking over domains up to 729 states. The contribution is not a Scala code-generation system or a new verifier, but a compact architecture for keeping executable semantics, weakest-precondition artifacts, and bounded validation synchronized.

2606.00202 2026-06-02 cs.LG cs.AI 版本更新

From Rashomon Theory to PRAXIS: Efficient Decision Tree Rashomon Sets

从Rashomon理论到PRAXIS:高效决策树Rashomon集

Zakk Heile, Hayden McTavish, Varun Babbar, Margo Seltzer, Cynthia Rudin

发表机构 * Stanford University(斯坦福大学)

AI总结 针对决策树Rashomon集计算开销大的问题,提出PRAXIS算法,在运行时和内存使用上实现数量级改进,并能恢复几乎完整的Rashomon集。

Comments Accepted to ICML 2026

详情
AI中文摘要

标准机器学习流程通常会产生许多接近最优的模型。这些“Rashomon集”为不确定性感知的鲁棒决策带来了一系列挑战和机遇。它们允许用户整合领域知识和偏好,这些知识和偏好通常难以直接指定为目标函数,并且它们量化了给定训练数据集和目标函数下有效模型之间的多样性。然而,即使对于稀疏决策树这样简单、可解释的模型类,Rashomon集的计算仍然需要巨大的内存和运行时资源。我们提出了PRAXIS,一种近似该Rashomon集的算法,在运行时和内存使用上实现了数量级的改进。我们验证了PRAXIS通常能恢复几乎完整的Rashomon集。PRAXIS使研究人员和从业者能够可扩展地对真实世界数据集的Rashomon集进行建模。PRAXIS的代码可在https://github.com/zakk-h/PRAXIS获取。

英文摘要

Standard machine learning pipelines often admit many near-optimal models. These "Rashomon sets" pose a range of challenges and opportunities for uncertainty-aware, robust decision making. They allow users to incorporate domain knowledge and preferences that would otherwise be difficult to specify directly in an objective, and they quantify diversity among valid models for a given training dataset and objective function. However, computation of Rashomon sets, even for simple, interpretable model classes such as sparse decision trees, continues to require immense memory and runtime resources. We present PRAXIS, an algorithm to approximate this Rashomon set with orders of magnitude improvement in runtime and memory usage. We validate that PRAXIS regularly recovers almost all of the full Rashomon set. PRAXIS allows researchers and practitioners to scalably model the Rashomon set for real-world datasets. Code for PRAXIS is available at https://github.com/zakk-h/PRAXIS

2606.00198 2026-06-02 cs.LG cs.AI cs.CL 版本更新

BAGEN: Are LLM Agents Budget-Aware?

BAGEN:LLM 智能体是否具有预算意识?

Yuxiang Lin, Zihan Wang, Mengyang Liu, Yuxuan Shan, Longju Bai, Junyao Zhang, Xing Jin, Boshan Chen, Jinyan Su, Xingyao Wang, Jiaxin Pei, Manling Li

发表机构 * Northwestern University(西北大学) O2 Lab(O2实验室) Independent(独立) University of Michigan(密歇根大学) Cornell(康奈尔大学) All Hands AI Stanford(斯坦福大学) UT Austin(德克萨斯大学奥斯汀分校)

AI总结 本文提出预算感知智能体(BAGEN)概念,将预算作为主动控制信号而非被动成本指标,通过渐进区间估计方法预测剩余预算上下界,并在四个环境和五个前沿模型上发现强模型不一定具有强预算意识、模型过度乐观等失败模式,早期停止可节省 28-64% 令牌,但精确区间校准仍具挑战。

详情
AI中文摘要

尽管智能体正在花费越来越多的资源,但如今智能体成本大多仅在执行后衡量。预算感知智能体(BAGEN)应将预算视为主动控制信号,而非被动成本指标。我们首先系统地将预算估计定义为内部预算(来自智能体计算)和外部预算(来自智能体动作)。然后,我们将预算意识形式化为渐进区间估计:在计划的每一步,智能体应预测剩余预算的上限和下限,并在完成可能性低时发出警报。通过 rollout-replay 协议进行评分,我们在四个环境和五个前沿模型上发现了一致的失败模式:(1)强模型不一定具有强预算意识,相关性 r=0.35。(2)前沿模型始终过度乐观,继续在不太可能成功的任务上花费资源,而不是尽早提醒用户。(3)预算感知信号是可操作且可训练的。早期停止在失败轨迹上节省 28-64% 的令牌,SFT+RL 增强了早期停止和警报行为。(4)精确区间校准仍然具有挑战性,SFT+RL 后区间覆盖率上限为 47%。项目页面:https://ragen-ai.github.io/bagen/

英文摘要

While agents are increasingly spending more resources, today agent cost is mostly measured only after execution. A Budget-Aware Agent (BAGEN) should treat budget as an active control signal, rather than a passive cost metric. We first systematically define budget estimation as internal budgets (from agent computation) and external budgets (from agent actions). We then formalize budget-awareness as progressive interval estimation: at each step of a plan, an agent should predict an upper and lower bound on remaining budget, and alert when completion is unlikely. Scoring with a rollout-replay protocol, we find consistent failure patterns on four environments and five frontier agents: (1) strong agents do not necessarily have strong budget-awareness, with correlation r=0.35. (2) frontier models are consistently over-optimistic, continue spending on tasks that are unlikely to succeed, instead of alerting the user early. (3) budget-aware signal is actionable and trainable. Early stop saves 28-64% tokens on failed trajectories, and SFT+RL strengthens early stop and alert behavior. (4) precise interval calibration remains challenging, with interval coverage capping at 47% after SFT+RL. Project page: https://ragen-ai.github.io/bagen/

2606.00189 2026-06-02 cs.LG cs.AI 版本更新

Learning to Construct Practical Agentic Systems

学习构建实用的智能体系统

Aditya Kumar, Zhihan Lei, Jerry Yan, Joshua W. Momo, Lauhitya Reddy, Rafael Enrique Cabrera Jimenez, Cassandra A. Cohen, Arthur Kajiyama, William W. Cohen

发表机构 * Carnegie Mellon University(卡内基梅隆大学) Dept. of Computer Science(计算机科学系) Emory University(埃默里大学)

AI总结 本文提出一种基于伪工具和固定工作流的智能体框架,通过模块化设计和多目标优化方法,在保证成本可控和结果质量的前提下,实现实用智能体系统的自动构建与优化。

详情
AI中文摘要

基于LLM的智能体系统的自动设计和优化能够产生复杂的系统,显著提升结果质量,优于现成的智能体模式。然而,对实际部署的智能体系统的研究表明,生产系统更关注推理成本的简单性、可控性和可预测性等问题。本文提出了设计和优化实用智能体系统的原则性方法。我们描述了一个智能体框架,通过定义在受限上下文中递归调用LLM的“伪工具”,使设计者能够强制智能体系统的模块化。利用该框架,我们为多种任务手工设计了智能体,并表明相对于动态规划的工作流,手工构建的固定工作流通常更便宜且更准确。随后,我们提出了针对该框架所需的智能体组件(即伪工具和固定工作流)的新型学习方法。这些学习方法通常优于手工设计的智能体。我们还利用框架的模块化特性,应用多目标优化方法联合优化成本和响应质量,并融合多个学习系统的结果。

英文摘要

Automated design and optimization of agentic LLM-based systems leads to sophisticated systems that substantially improve result quality over off-the-shelf agentic patterns. However, studies of fielded agentic systems show that production systems focus much more on issues such as simplicity, controllability, and predictability of inference costs. In this paper we propose principled approaches to designing and optimizing practical agentic systems. We describe an agent framework that enables designers to enforce modularity in agentic systems, by defining "pseudo-tools" that call LLMs recursively on a restricted context. Using this framework we hand-engineer agents for a diverse set of tasks, and show that relative to dynamically-planned workflows, hand-constructed fixed workflows are generally cheaper and more accurate. We then propose novel learning methods for the agentic components required by this framework, namely pseudo-tools and fixed workflows. These learning methods generally outperform hand-engineered agents. We also exploit the modularity of the framework to apply multi-objective optimization methods to jointly optimize cost and response quality and blend the results of multiple learning systems.

2606.00183 2026-06-02 cs.LG cs.AI math.OC stat.ML 版本更新

Agentic Transformers Provably Learn to Search via Reinforcement Learning

智能体Transformer通过强化学习可证明地学会搜索

Tong Yang, Yu Huang, Yingbin Liang, Yuejie Chi

发表机构 * Carnegie Mellon University(卡内基梅隆大学) University of Pennsylvania(宾夕法尼亚大学) The Ohio State University(俄亥俄州立大学) Yale University(耶鲁大学)

AI总结 本文通过构建双头Transformer实现随机深度优先搜索,并分析策略梯度训练动力学,证明该搜索机制能从稀疏强化反馈中分阶段涌现,且具备深度泛化能力。

详情
AI中文摘要

树搜索是许多语言智能体推理和决策任务的核心抽象:智能体必须探索动作、记住失败并回溯到有希望的替代方案。然而,我们缺乏对基于Transformer的策略如何从强化学习(RL)的训练动态中获得这种搜索能力的理论理解。我们在一个随机的$k$叉树环境中研究这个问题,其中智能体Transformer仅通过交互观察其轨迹历史,并在到达隐藏的叶子目标节点时获得终端奖励。我们首先构建了一个实现随机深度优先搜索(DFS)的双头Transformer:一个头跟踪之前的动作,而另一个头检测失败结果并触发回溯。然后,我们分析了在深度课程下的策略梯度训练动态,表明相同的DFS机制在没有专家演示的情况下,从稀疏强化反馈中分阶段涌现。得到的策略表现出深度泛化能力:仅在深度为1和2的树上训练后,它能在更深的完整树上成功。我们进一步表明,在非平衡目标分布下,对回报进行折扣会导致一种排序的DFS策略,优先考虑高概率分支。总的来说,我们的结果确定了基于Transformer的搜索的一种机制性标准形式,其中注意力头专门化并协作,从上下文中提取与决策相关的轨迹,并通过RL训练将其转化为智能体动作选择。

英文摘要

Tree search is a central abstraction behind many language-agent reasoning and decision-making tasks: agents must explore actions, remember failures, and backtrack toward promising alternatives. Yet, we lack a theoretical understanding of how transformer-based policies acquire such search capabilities from the training dynamics of reinforcement learning (RL). We study this question in a stochastic $k$-ary tree environment, where an agentic transformer observes only its trajectory history through interaction and receives a terminal reward for reaching a hidden leaf goal node. We first construct a two-head transformer that implements randomized depth-first search (DFS): one head tracks previous actions, while the other detects failure outcomes and triggers backtracking. We then analyze the training dynamics of policy gradient under a depth-wise curriculum, showing that this same DFS mechanism emerges in stages from sparse reinforcement feedback without expert demonstrations. The resulting policy exhibits depth generalization: after training only on depth-$1$ and depth-$2$ trees, it succeeds on deeper full trees. We further show that, under imbalanced goal distributions, discounting the return leads to a ranked DFS policy that prioritizes higher-probability branches. Overall, our results identify a mechanistic normal form for transformer-based search, in which attention heads specialize and cooperate to extract decision-relevant traces from context and convert them into agentic action selection via RL training.

2606.00180 2026-06-02 cs.LG cs.AI 版本更新

Beyond Augmentation: Score-Guided Pathological Prior for EEG-based Depression Detection

超越增强:基于评分引导的病理先验用于脑电图抑郁症检测

Xiaojing Chen, Jingqi Cheng, Xu Zhao, Wan Jiang, Jingjing Wu

发表机构 * School of Internet, Anhui University(安徽大学互联网学院) School of Computer Science and Technology, Hefei University of Technology(合肥工业大学计算机科学与技术学院) School of Computer Science and Information Engineering, Hefei University of Technology(合肥工业大学计算机科学与信息工程学院)

AI总结 针对脑电图抑郁症检测中的小样本困境,提出无数据增强的评分引导分类框架,利用生成网络建模病理先验并融合深度特征,同时设计跨通道空间适应模块解决多中心数据集硬件异构问题。

详情
AI中文摘要

基于深度学习的脑电图(EEG)重度抑郁症(MDD)检测从根本上受到“小样本困境”的制约。主流的生成式数据增强方法不仅带来沉重的计算开销,还可能引入合成噪声,从而模糊分类边界。为了挑战传统的“数据数量优先”惯例,我们提出了一种新颖的框架“超越增强”:评分引导分类(SGC)。SGC不合成伪样本,而是利用无监督生成网络架构对样本的结构和统计异常程度进行建模,作为核心的“病理先验”。该先验经过鲁棒归一化后,与深度特征表示显式融合,从而精确指导分类器的决策边界。此外,为了动态适应不同的通道配置,我们提出了跨通道空间适应模块,利用空间映射机制有效解决多中心数据集中不匹配通道的硬件异构问题。在Mumtaz2016和高密度MODMA数据集上的大量实验证明了我们的方法在具有挑战性的“零数据增强”设置和“零样本合成成本”下的有效性和卓越的泛化能力。

英文摘要

Deep learning-based Major Depressive Disorder (MDD) detection using Electroencephalography (EEG) is fundamentally constrained by the "small-sample dilemma." Prevailing generative data augmentation methods not only incur heavy computational overhead but also risk introducing synthetic noise, thereby blurring classification boundaries. To challenge the traditional "data quantity first" convention, we propose a novel framework "Beyond Augmentation": Score-Guided Classification (SGC). SGC does not synthesize pseudo-samples; instead, it utilizes an unsupervised generative network architecture to model the structural and statistical anomaly degrees of samples, serving as the core "Pathological Prior". This prior, after robust normalization, is explicitly fused with deep feature representations, thereby precisely guiding the classifier's decision boundary. Furthermore, to dynamically adapt to varying channel configurations, we propose a Cross-Channel Spatial Adaptation module, utilizing a spatial mapping mechanism to effectively resolve the hardware heterogeneity of mismatched channels in multi-center datasets. Extensive experiments on the Mumtaz2016 and high-density MODMA datasets demonstrate the effectiveness and exceptional generalizability of our method under the challenging "zero data augmentation" setting and at "zero sample synthesis cost". Keywords: Electroencephalography (EEG), Depression Detection, Anomaly Score, Diffusion Models, Few-Shot Learning

2606.00174 2026-06-02 cs.CV cs.AI 版本更新

MyoSem: Aligning Electromyography to Natural-Language Action Semantics for Hand Action Understanding

MyoSem: 将肌电图与自然语言动作语义对齐以实现手部动作理解

Chiyue Wang, Dong She, Yang Gao, Zhanpeng Jin

发表机构 * South China University of Technology(华南理工大学)

AI总结 提出MyoSem框架,通过多视角动作语义构建、激活感知EMG编码和语义查询对齐,实现EMG信号与文本描述的双向检索,在多个数据集上优于基线方法并展现良好泛化性。

Comments 16 pages, 9 figures. Preprint

详情
AI中文摘要

肌电图(EMG)直接反映肌肉激活,是手势识别、假肢控制和可穿戴交互的关键传感模态。然而,现有的EMG方法通常将手部动作理解视为固定标签的分类问题,难以支持基于动作描述的查询、检索和泛化。我们提出MyoSem,一个EMG-动作语义对齐框架,将低层EMG信号映射到由多视角动作描述构建的共享语义空间。MyoSem结合多视角动作语义构建、激活感知EMG编码和语义查询对齐,实现了EMG信号与文本描述之间的双向检索。我们在EMG2Pose和NinaPro系列数据集上系统评估了MyoSem。结果表明,MyoSem在EMG-文本双向检索上表现良好,普遍优于大多数基线,并在未见用户、保留动作类别和截肢用户迁移场景中展现出良好的泛化性。消融实验和可视化进一步验证了每个模块的有效性。总体而言,MyoSem将基于EMG的手部动作理解从固定标签识别推进到可查询的双向语义检索,为语言介导的EMG动作理解提供了新的建模范式。

英文摘要

Electromyography (EMG) directly reflects muscle activation and is a key sensing modality for gesture recognition, prosthetic control, and wearable interaction. Existing EMG methods, however, commonly formulate hand action understanding as classification over fixed labels, making it difficult to support querying, retrieval, and generalization based on action descriptions. We present MyoSem, an EMG--action semantic alignment framework that maps low-level EMG signals into a shared semantic space constructed from multi-view action descriptions. MyoSem combines multi-view action-semantic construction, activation-aware EMG encoding, and semantic query alignment, enabling bidirectional retrieval between EMG signals and text descriptions. We systematically evaluate MyoSem on EMG2Pose and NinaPro-series datasets. Results show that MyoSem performs well on EMG--text bidirectional retrieval, generally outperforms most baselines, and shows favorable generalization to unseen users, held-out action classes, and amputee-user transfer scenarios. Ablations and visualizations further validate the effectiveness of each module. Overall, MyoSem advances EMG-based hand action understanding from fixed-label recognition toward queryable bidirectional semantic retrieval, providing a new modeling paradigm for language-mediated EMG action understanding.

2606.00172 2026-06-02 cs.AI 版本更新

CAST: Non-Privileged Clipped Asymmetric Self-Teaching with Advantage Flipping for GRPO

CAST:用于GRPO的非特权裁剪非对称自教学与优势翻转

Yang Li, Gongle Xue, Yijia Guo, Yuheng Yuan, Liwen Hu, Lei Ma

发表机构 * School of Software and Microelectronics, Peking University, Beijing(北京大学软件与微电子学院) School of Artificial Intelligence, Peking University, Beijing(北京大学人工智能学院) School of Computer Science, Peking University, Beijing(北京大学计算机科学学院) School of Future Technology, Peking University, Beijing(北京大学未来技术学院)

AI总结 提出CAST方法,通过无答案的自教师模型和双向局部优势符号翻转,解决GRPO中奖励稀疏和组相对优势消失的问题,提升数学推理性能。

Comments 10 pages

详情
AI中文摘要

基于可验证奖励的强化学习(RLVR),特别是组相对策略优化(GRPO),已被广泛用于改进大型语言模型的推理能力。然而,结果级奖励仅提供稀疏监督,并且当某个提示的所有采样轨迹全部正确或全部错误时,组相对优势会消失。在线自蒸馏(OPSD)提供了密集的令牌级指导,但其令牌偏好不一定与轨迹正确性对齐;实证诊断表明,OPSD信号在正确和错误的轨迹上表现不同,教师正向和教师负向的差距信号表现出不同的噪声特征。这些诊断仅在OPSD风格的特权教师上下文中进行分析,而CAST训练使用无答案的自教师评分。受这些观察启发,本文提出了CAST,一种用于GRPO风格RLVR的无答案自蒸馏方法。CAST保留了基于验证器的GRPO目标,但使用停止梯度的自教师根据轨迹正确性塑造令牌级优势。与先前的自蒸馏RLVR方法不同,CAST不需要参考解条件的教师评分,在整个训练过程中保持自教师对数概率差距活跃,并应用双向局部优势符号翻转:正确轨迹中的教师负向令牌可以获得负的令牌级优势,而错误轨迹中的教师正向令牌可以获得有界的正向局部优势。对于零方差的全正确和全错误组,CAST分配有界的符号约束基础优势,使得这些原本零梯度的组能够贡献验证器签名的令牌反馈。数学推理实验表明,CAST在保持轻量级、基于验证器的轨迹级目标的同时,改进了RLVR训练。

英文摘要

Reinforcement learning with verifiable rewards (RLVR), especially Group Relative Policy Optimization (GRPO), has been widely used to improve reasoning in large language models. However, outcome-level rewards provide only sparse supervision, and group-relative advantages vanish when all sampled trajectories for a prompt are either correct or incorrect. On-Policy Self-Distillation (OPSD) offers dense token-level guidance, but its token preferences are not necessarily aligned with trajectory correctness; empirical diagnostics show that OPSD signals behave differently on correct and incorrect rollouts, with teacher-positive and teacher-negative gap signals exhibiting different noise profiles. These diagnostics are conducted under an OPSD-style privileged teacher context for analysis only, whereas CAST training uses answer-free self-teacher scoring.Motivated by these observations, this work proposes CAST, an answer-free self-distillation method for GRPO-style RLVR. CAST keeps the verifier-grounded GRPO objective, but uses a stop-gradient self-teacher to shape token-level advantages according to trajectory correctness. Unlike prior self-distilled RLVR methods, CAST does not require reference-solution-conditioned teacher scoring, keeps the self-teacher log-probability gap active throughout training, and applies bidirectional local advantage sign reversal: teacher-negative tokens in correct trajectories can receive negative token-level advantages, while teacher-positive tokens in incorrect trajectories can receive bounded positive local advantages. For zero-variance all-correct and all-wrong groups, CAST assigns bounded sign-constrained base advantages, so these otherwise zero-gradient groups can contribute verifier-signed token feedback. Experiments on mathematical reasoning show that CAST improves RLVR training while retaining a lightweight, verifier-grounded trajectory-level objective.

2606.00170 2026-06-02 cs.HC cs.AI cs.CV 版本更新

UF-AMA: A unified framework for cross-domain emotion recognition via adaptive multimodal alignment

UF-AMA: 通过自适应多模态对齐的跨域情感识别统一框架

Zheng Wang, Shuo Wang, Junhong Wang

发表机构 * Institute of Advanced Technology, University of Science and Technology of China(中国科学技术大学先进技术研究院) Department of Electronic Engineering and Information Science, University of Science and Technology of China(中国科学技术大学电子工程与信息科学系) Institute of Artificial Intelligence, Hefei Comprehensive National Science Center(合肥综合国家科学中心人工智能研究院)

AI总结 提出一种统一框架UF-AMA,利用自适应多模态对齐和置信度感知筛选机制,解决跨主体和跨会话的生理信号情感识别中的分布偏移问题,在SEED和SEED-IV数据集上达到最优性能。

详情
AI中文摘要

近年来,基于脑电图(EEG)等生理信号的情感识别受到了广泛关注,因为与面部表情等外部行为数据相比,内部生理数据提供了更高的客观性和可靠性。然而,由于个体和情境差异导致的分布偏移,以及各模态样本质量的差异,构建具有高泛化性和鲁棒性的跨域多模态情感识别模型仍然是一个关键挑战。在本研究中,我们提出了一种具有自适应多模态对齐的统一框架(UF-AMA),以使用多模态生理信号解决跨主体和跨会话的情感识别问题。首先,我们构建了一个由Transformer编码器和多头交叉注意力模块组成的跨模态特征融合网络,实现了EEG信号和眼动追踪数据的深度融合。随后,我们引入了一种置信度感知筛选机制,动态评估每个模态分支在目标域样本上的预测可靠性,将样本划分为不同的质量子集,并相应地应用全局一致性对齐和跨模态蒸馏。最后,我们提出了一个多级域自适应框架,联合优化局部模态特定特征和全局融合特征的边际分布和条件分布,从而在多个粒度上减少跨域分布偏移。在SEED和SEED-IV数据集上的大量实验表明,UF-AMA在跨主体和跨会话任务中均达到了最先进的性能。源代码可在 https://github.com/BetterCoderLab/UF-AMA 获取。

英文摘要

In recent years, emotion recognition based on physiological signals such as electroencephalogram (EEG) has gained considerable attention, as internal physiological data offer greater objectivity and reliability compared to external behavioral data like facial expressions. However, due to distribution shifts caused by individual and contextual differences, along with variations in sample quality across modalities, constructing a cross-domain multimodal emotion recognition model with high generalization and robustness remains a key challenge. In this study, we propose a Unified Framework with Adaptive Multimodal Alignment (UF-AMA) to address cross-subject and cross-session emotion recognition using multimodal physiological signals. First, we construct a cross-modal feature fusion network comprising Transformer encoders and multi-head cross-attention modules, enabling the deep integration of EEG signals and eye-tracking data. Subsequently, we introduce a confidence-aware screening mechanism that dynamically assesses the predictive reliability of each modality branch on target domain samples, partitions samples into different quality subsets, and accordingly applies global consistency alignment and cross-modal distillation. Finally, we propose a multi-level domain adaptation framework that jointly optimizes the marginal and conditional distributions of both local modality-specific and global fusion features, thereby reducing cross-domain distribution shifts at multiple granularities. Extensive experiments on the SEED and SEED-IV datasets demonstrate that UF-AMA achieves state-of-the-art (SOTA) performance in both cross-subject and cross-session tasks. The source code is available at: https://github.com/BetterCoderLab/UF-AMA.

2606.00169 2026-06-02 cs.LG cs.AI 版本更新

ChurnNet: A Optimized Modern AI for Churn Prediction

ChurnNet: 一种用于流失预测的优化现代人工智能

Syed Saad Saif, Giulio Maggiore, Paolo Russo, Damiano Distante

发表机构 * Department of Computer, Control, and Management Engineering(计算机、控制与管理工程系) Department of Law and Economics, UnitelmaSapienza University of Rome(法律与经济学系,Unitelma萨皮恩扎罗马大学) R&D Center, Token Financial Technologies(Token金融技术研发中心) Department of Civil, Computer Science and Aeronautical Technologies Engineering, Roma Tre University(土木、计算机科学与航空技术工程系,罗马三大学)

AI总结 本研究评估了传统机器学习方法(随机森林、XGBoost、支持向量机)与统一多任务时间序列模型在客户流失预测任务上的性能,发现传统方法在预测性能、数据效率和计算资源需求方面仍具优势。

详情
AI中文摘要

日益激烈的竞争以及零售商提供的产品和服务日益相似,降低了客户转向竞争对手的门槛。准确的流失预测可以成为推动有效个性化营销活动和帮助减少客户流失的宝贵工具。本研究评估了传统机器学习技术(即随机森林、XGBoost和支持向量机)的性能,并将其与统一多任务时间序列模型(一种二元时间序列分类任务)在流失预测上进行比较。尽管后者在建模复杂时间动态和变量间关系方面具有强大能力,但我们的结果表明,对于流失预测,传统方法在预测性能、数据效率以及训练和部署的计算资源需求方面仍可超越它。这些发现在多个数据集和各种流失标记技术中保持一致。

英文摘要

Increased competition and the growing similarity of products and services offered by retailers have lowered the barriers for customers to switch to competitors. Accurate churn prediction can be a valuable tool for driving effective personalized marketing campaigns and helping to reduce customer attrition. This study evaluates the performance of traditional machine learning techniques, namely, Random Forests, XGBoost, and Support Vector Machines, and compares them with the Unified Multi-Task Time Series Model for churn prediction, a binary time-series classification task. Despite the strong capacity of the latter to model complex temporal dynamics and inter-variable relationships, our results indicate that for churn prediction, conventional methods can still outperform it in terms of predictive performance, data efficiency, and computational resource requirements for training and deployment. These findings are consistent across multiple datasets and various churn labeling techniques.

2606.00161 2026-06-02 cs.CR cs.AI cs.LG 版本更新

Improving IoT Intrusion Detection Through SMOTE-Based Oversampling and Extended Multi-Model Evaluation on Side-Channel Power Data

基于SMOTE过采样和扩展多模型评估的侧信道功率数据物联网入侵检测改进

Muhammad Khuram Shahzad, Haseeb Khan, Muhammad Masood Khan, Mubashra Bibi

发表机构 * School of Electrical Engineering and Computer Science (SEECS), NUST(电气工程与计算机科学学院(SEECS),努斯兰大学)

AI总结 针对物联网侧信道数据集中的严重类别不平衡问题,采用SMOTE过采样平衡数据,并评估八种机器学习模型,其中随机森林和极端随机树在F1分数上超越基线方法,同时揭示了宏观F1指标的重要性。

Comments 8 pages, 14 figures; code and results publicly available

详情
AI中文摘要

物联网网络中的入侵检测面临传统机器学习方法无法克服的挑战,其中最大的挑战之一是侧信道数据集中存在的类别不平衡问题,正常类样本与攻击类样本的比例可达75964:1。Dominguez等人通过基于功率的入侵检测概念验证解决了这一问题,但既未尝试处理不平衡问题,也未使用平衡训练集评估分类器性能。本文同时处理这两个方面。首先,对从初始数据集提取的所有九个可能数据集应用合成少数类过采样技术(SMOTE),使每个数据集的精确不平衡比达到1.1。然后,在SMOTE平衡的6小时数据集上,在相同条件下训练了八种算法:随机森林、HistGradientBoosting、LightGBM、极端随机树、XGBoost、k近邻、多层感知机和决策树。随机森林的微平均F1分数达到0.9989,宏F1为0.9794,优于基线论文中时间序列森林算法之前的最佳微F1结果0.9983。极端随机树提供了相同的性能,但速度快10倍。与基线论文评估相比,显式引入宏F1指标揭示了聚合性能指标遗漏的重要类别级信息。基于混淆矩阵、F1热图和ROC曲线计算的每类召回率表明,仅当使用SMOTE平衡时,少数攻击类(尤其是M+L联合感染类)才能被可靠检测。特征重要性分析表明,在功率窗口的60个时间步中,最近的时间步是最重要的预测信号。

英文摘要

The detection of intrusions in IoT-based networks poses challenges that cannot be overcome using traditional machine learning methods. Perhaps the biggest of them is related to the presence of a class imbalance in the side-channel dataset, where the number of samples in the normal class compared to the attacks can reach a ratio of 75,964 to 1. Such an aspect is addressed by Dominguez et al. through the proof of concept of power-based intrusion detection. Unfortunately, neither the authors attempt to cope with the problem of imbalance nor do they assess the classifier performance using a balanced training set. In the current paper, both aspects will be handled at once. First, a Synthetic Minority Oversampling Technique (SMOTE) was performed on all nine possible datasets extracted from the initial one, providing an exact imbalance ratio of 1.1 for each. Then, eight algorithms i.e. Random Forest, HistGradientBoosting, LightGBM, Extra Trees, XGBoost, k-Nearest Neighbors, Multi-Layer Perceptron, and Decision Tree were trained under identical conditions for the SMOTE balanced 6-hour dataset. Random Forest reached a micro-averaged F1 score of 0.9989 and macro F1 of 0.9794, thus outperforming the previously best micro-F1 result obtained by Time Series Forest algorithm from the base paper of 0.9983. Extra Trees provided the same performance as well, but at 10 times faster. The introduction of a macro-F1 metric explicitly in contrast to the base paper assessment reveals important class-level information missed with aggregate performance metrics. Recall rates per-class calculated with confusion matrices, F1 heatmaps, and ROC curves show that minority attack classes, especially those with combined M+L infections, are detected reliably only when using SMOTE balance. Feature importance analysis indicates the latest time steps as the most important predictor signals out of 60 steps in a power window.

2606.00160 2026-06-02 cs.CR cs.AI cs.CL 版本更新

DataShield: Safety-degrading Data Filtering for LLM Benign Instruction Fine-Tuning

DataShield: 针对大语言模型良性指令微调的安全降级数据过滤

Junbo Zhang, Qianli Zhou, Xinyang Deng, Wen Jiang, Jie Pan, Jinbiao Zhu

发表机构 * nwpu.edu.cn(西北工业大学)

AI总结 提出DataShield方法,通过量化每个样本对模型合规行为的贡献作为安全降级分数,高效识别良性数据集中的安全降级样本,并在多个模型和数据集上验证有效性。

详情
AI中文摘要

大型语言模型(LLM)即使在使用良性数据集进行微调时,也会出现安全能力下降的问题。然而,现有识别良性数据集中安全降级样本的方法存在计算成本高和噪声显著的问题。在本文中,我们提出DataShield,以高效且有效地识别潜在的安全降级样本。我们的关键直觉基于以下观察:良性微调提高了LLM的整体响应合规性。DataShield的核心技术见解是将每个样本对模型合规行为的贡献量化为其安全降级分数。DataShield包含三个核心组件:(1)合规向量提取,捕获LLM的合规行为倾向;(2)一种新颖的合规感知分数(CAS),自动识别最优的安全关键层;(3)安全降级样本过滤,量化训练数据沿合规方向的投影偏移。在Llama3-8B、Llama3.1-8B和Qwen2.5-7B上使用Alpaca和Dolly良性数据集进行的广泛实验评估验证了我们的方法在识别高风险和低风险数据子集方面的有效性。我们还观察到,开放式问答更可能触发安全降级,且相应的响应往往更长。我们希望这项工作能为以数据为中心的防御方法提供新的见解。源代码可在https://github.com/ZJunBo/DataShield获取。

英文摘要

Large language models (LLMs) suffer from degraded safety capabilities even when fine-tuned with benign datasets. However, existing methods for identifying safety-degrading samples in benign datasets suffer from high computational costs and significant noise issues. In this paper, we propose DataShield to efficiently and effectively identify potential safety-degrading samples. Our key intuition is based on the observation that benign fine-tuning increases the overall response compliance of LLMs. DataShield's key technical insight is to quantify each sample's contribution to the model's compliance behavior as its safety degradation score. DataShield consists of three core components: (1) Compliance Vector Extraction, which captures the LLM's compliance behavior tendency; (2) a novel Compliance-Aware Score (CAS), which automatically identifies the optimal safety-critical layer; and (3) Safety-degrading Sample Filtering, which quantifies the projection shift of training data along the compliance direction. Extensive experimental evaluation on Llama3-8B, Llama3.1-8B, and Qwen2.5-7B using the Alpaca and Dolly benign datasets validates our method's effectiveness in identifying high-risk and low-risk data subsets. We also observe that open-ended question answering is more likely to trigger safety degradation, and corresponding responses tend to be longer. We hope this work can provide new insights into data-centric defense methods. The source code is available at: https://github.com/ZJunBo/DataShield.

2606.00159 2026-06-02 cs.CV cs.AI 版本更新

Digital-to-Physical Transfer of Adversarial Patches for Aerial Vehicle Detection

针对空中飞行器检测的对抗性补丁的数字到物理迁移

Jung Heum Woo, Eun-Kyu Lee

发表机构 * School of Information Technology, Incheon National University(信息科技学院,Incheon国立大学)

AI总结 本文通过数字优化和物理部署,评估了针对YOLOv3空中飞行器检测器的物理对抗性补丁攻击,发现ON补丁在物理环境中鲁棒性更强。

Comments 18 pages, 5 figures, 3 tables, preprint

详情
AI中文摘要

基于深度神经网络(DNN)的目标检测器广泛应用于环境监测和城市分析等领域的航拍和卫星图像分析。尽管性能强劲,但这些模型已知易受对抗性示例攻击,而使用可打印图案的物理对抗性攻击构成了现实的安全威胁。在本文中,我们通过桥接数字优化和实际部署,评估了针对空中飞行器检测器的物理对抗性补丁攻击。对抗性补丁在数字域中使用损失函数进行优化,该函数最小化最大目标性分数,同时结合不可打印性分数(NPS)和总变差(TV)约束以确保可打印性和空间平滑性。优化后的补丁被打印并以三种配置部署:ON、OFF和OFF-Side。使用YOLOv3检测器的实验表明,虽然OFF补丁在数字域中实现了最高有效性(85.51%的平均目标性降低率(AORR)),但ON补丁由于其一贯的可见性,在物理环境中表现出更强的鲁棒性(0.197-0.343的目标性分数比(OSR))。此外,我们的结果表明,基于天气的增强并不一定能改善该领域的补丁优化。这些发现为空中目标检测系统的实际脆弱性提供了关键见解。

英文摘要

Deep neural network (DNN)-based object detectors are widely used for analyzing aerial and satellite imagery in applications such as environmental monitoring and urban analytics. Despite their strong performance, these models are known to be vulnerable to adversarial examples, and physical adversarial attacks using printable patterns pose realistic security threats. In this paper, we evaluate physical adversarial patch attacks against an aerial vehicle detector by bridging digital optimization and real-world deployment. Adversarial patches are optimized in the digital domain using a loss function that minimizes the maximum objectness score while incorporating non-printability score (NPS) and total variation (TV) constraints to ensure both printability and spatial smoothness. The optimized patches are printed and deployed in three configurations: ON, OFF, and OFF-Side. Experiments using a YOLOv3 detector show that while the OFF patch achieves the highest effectiveness in the digital domain (85.51% Average Objectness Reduction Rate (AORR)), the ON patch demonstrates superior robustness in physical environments (0.197-0.343 Objectness Score Ratio (OSR)) due to its consistent visibility. Furthermore, our results indicate that weather-based augmentation does not necessarily improve patch optimization in this domain. These findings provide critical insights into the practical vulnerabilities of aerial object detection systems.

2606.00157 2026-06-02 stat.ML cs.AI cs.LG math.PR 版本更新

Interpreting FCDNNs via RG on Exponential Family

通过指数族上的重正化群解释全连接深度神经网络

Fuzhou Gong, Zigeng Xia

发表机构 * University of Science and Technology of China(中国科学技术大学)

AI总结 本文通过建立统计物理中重正化群方法与深度神经网络训练过程的对应关系,证明了对于指数族连续输入数据,全连接DNN训练后特征层输出的特征参数等于RG方法下的不动点,从而解释了DNN的特征提取能力。

Comments 18 pages, 2 figures

详情
AI中文摘要

我们考虑通过建立统计物理中的重正化群(RG)方法与深度神经网络(DNN)训练过程之间的对应关系,来建立深度学习的可解释性理论。我们已使用一维伊辛模型作为输入数据证明了所构建的关系。本文我们将结果推广到连续输入数据的情况,这是将该对应框架应用于真实数据的必要准备。为具有代表性,我们考虑指数族中的一类数据分布。我们证明,当全连接(FC)DNN的参数在训练后达到最优值时,DNN特征层输出的特征参数等于连续场RG方法下输入数据特征参数的不动点。这一结论表明,DNN的训练过程等价于对此类数据进行RG计算,因此网络能够像RG一样从输入数据中提取主要特征。此外,该等价性进一步验证了我们建立的对应框架,为DNN在真实数据上的卓越表现提供了解释。

英文摘要

We consider establishing the interpretability theory of deep learning through constructing a corresponding relationship between the renormalization group (RG) method in statistical physics and the training process of deep neural networks (DNNs). We have proved the constructed relationship using the one-dimensional Ising model as the input data. In this paper we generalize our results to the case of continuous input data, which is a necessary preparation for applying the corresponding framework to real-world data. To be representative, we consider a class of data distribution in the exponential family. We prove that when the parameters of fully connected (FC) DNNs achieve their optimal value after training, the characteristic parameters of the feature layer output of DNNs are equal to the fixed points of the characteristic parameters of input data under RG method for continuous fields. This conclusion shows that the training process of DNNs is equivalent to RG calculation on this kind of data and therefore the network can extract main features from the input data just like RG. Also, the equivalence further validates the correspondence framework we have established, providing an explanation for the outstanding performance of DNNs on real-world data.

2606.00156 2026-06-02 eess.IV cs.AI 版本更新

A physics-informed foundation model for quantitative diffusion MRI

一种用于定量扩散MRI的物理信息基础模型

Zihan Li, Jialan Zheng, Ziyu Li, Xun Yuan, Kasidit Anmahapong, Ziang Wang, Mingxuan Liu, Hongjia Yang, Yifei Chen, Zhuhao Wang, Yuhang He, Fang Chen, Rui Li, Huaiqiang Sun, Yi Liao, Congyu Liao, Yang Yang, Haibo Qu, Xue Zhang, Hongen Liao, Qiyuan Tian

发表机构 * School of Biomedical Engineering, Tsinghua University(清华大学生物医学工程系) Oxford Centre for Integrative Neuroimaging, FMRIB, Nuffield Department of Clinical Neurosciences, University of Oxford(牛津大学整合神经影像中心、FMRIB、临床神经科学系) Department of Radiology, West China Second University Hospital, Sichuan University(四川大学华西第二医院放射科) School of Biomedical Engineering and the Institute of Medical Robotics, Shanghai Jiaotong University(上海交通大学生物医学工程学院和医学机器人研究院) Department of Radiology, Institution of Radiology and Medical Imaging, West China Hospital, Sichuan University(四川大学华西医院放射科、放射医学与影像研究所) Department of Radiology and Biomedical Imaging, University of California San Francisco(加州大学旧金山分校放射科和生物医学影像系) Department of Psychiatry and Behavioral Sciences, Stanford University School of Medicine(斯坦福大学医学院精神病学与行为科学系)

AI总结 提出物理信息生成微结构网络(PIGMENT),通过零样本适应实现从稀疏数据中恢复可靠的定量扩散MRI参数映射。

详情
AI中文摘要

理解人脑需要获取其微观组织架构。扩散磁共振成像(MRI)提供了唯一非侵入性的活体全脑微结构窗口,但可靠的定量映射仍局限于需要密集采样和优化采集协议的专业研究环境。为解决这一差距,我们提出了一种物理信息生成微结构网络(PIGMENT),它学习人脑微结构的通用生成先验,并以零样本方式适应每个参与者的测量数据,以恢复特定主体的映射。PIGMENT在涵盖多个站点、供应商和场强的11375次扫描上训练,使得在来自五个独立中心的外部数据集上,能够对张量、峰度和NODDI模型进行可靠的定量映射。在传统拟合变得不可靠的情况下,它仍然有效,从极其稀疏的采集中恢复有意义的映射,同时支持下游的纤维追踪和结构连接映射。PIGMENT估计显示出强大的生物学有效性,从10倍加速扫描中保留了亚毫米级皮层微结构模式和早期儿童白质发育轨迹。此外,PIGMENT能够在成本效益高的低场系统上进行可靠的定量张量映射,并使用超快速临床协议提取肿瘤相关生物标志物。这些结果共同确立了PIGMENT作为一种物理信息基础模型,将定量扩散MRI扩展到传统上因过于稀疏、异质或临床受限而无法进行可靠分析的领域。

英文摘要

Understanding the human brain requires access to its microscopic tissue architecture. Diffusion magnetic resonance imaging (MRI) provides the only noninvasive window into whole-brain microstructure in vivo, yet reliable quantitative mapping remains confined to specialized research settings requiring dense sampling and optimized acquisition protocols. To address this gap, we present a physics-informed generative microstructure network (PIGMENT) that learns a universal generative prior of human brain microstructure and adapts it zero-shot to each participant's measured data to recover subject-specific maps. Trained on 11375 scans spanning multiple sites, vendors, and field strengths, PIGMENT enabled reliable quantitative mapping for tensor, kurtosis, and NODDI models across external datasets from five independent centers. It remains effective where conventional fitting becomes unreliable, recovering meaningful maps from extremely sparse acquisitions while supporting downstream tractography and structural connectivity mapping. PIGMENT estimates demonstrated strong biological validity, preserving submillimeter cortical microarchitectural patterns and early-childhood white matter developmental trajectories from 10-fold accelerated scans. Furthermore, PIGMENT enables reliable quantitative tensor mapping on cost-efficient low-field systems and the extraction of tumor-related biomarkers using ultra-fast clinical protocols. Together, these results establish PIGMENT as a physics-informed foundation model that extends quantitative diffusion MRI into regimes traditionally too sparse, heterogeneous, or clinically constrained for reliable analysis.

2606.00155 2026-06-02 cs.CR cs.AI 版本更新

A Protocol-Language Model for Network Intrusion (Without Deep Packet Inspection)

一种用于网络入侵的协议语言模型(无需深度包检测)

Vivek Kumar Sharma

发表机构 * Palo Alto Networks(帕洛阿尔托网络)

AI总结 提出PLM-NIDS,利用RWKV-4状态空间模型将网络流作为语言处理,仅基于L3/L4包元数据检测攻击,无需深度包检测,实现零样本异常检测(PR-AUC=0.93)和加密协议透明处理。

Comments 20 pages Research paper on Packet Language Models for Network Intrusion Detection Systems(Without Deep Packet Inspection).Code available on GitHub

详情
AI中文摘要

现代网络入侵检测系统(NIDS)陷入结构性矛盾:承载最高威胁情报的协议恰恰是那些在TLS 1.3和QUIC下加密的协议,其中负载检测毫无用处。我们提出一个更简单的问题——如果攻击签名不在字节中,而在节奏中呢?——并通过将网络流视为一种语言来回答,该语言的语法完全由L3/L4包元数据编写:长度、到达间隔时间、TTL、TCP标志和哈希端口号。我们提出了PLM-NIDS,它依次证明了三个主张。(1)语法存在且可学习:在344,232个未标记的Monday流上训练的RWKV-4状态空间模型实现了0.204的因果LM验证损失,表明良性流量具有可预测的、统计一致的结构。(2)攻击违反此语法:在训练时使用零攻击标签的情况下,每流困惑度得分以PR-AUC=0.93清晰分离良性流和攻击流。(3)这种分离在架构上非平凡:在相同令牌序列上训练的LSTM退化为多数类预测器(ROC-AUC约0.50,通过始终预测“攻击”得到F1=0.91),证明RWKV的因果预训练提供了直接分类器无法获得的归纳偏置。监督微调进一步将PR-AUC提升至0.94,ROC-AUC提升至0.75,在校准操作阈值下精确率为97.7%。RWKV骨干的O(T)循环推理支持逐包流式处理而无需流缓冲,使PLM-NIDS在线速率下可操作。由于它仅读取IP/TCP/UDP头部,因此本质上是加密无关的:TLS 1.3、QUIC和未来的加密协议均被透明处理。

英文摘要

Modern network intrusion detection systems (NIDS) are caught in a structural contradiction: the protocols carrying the highest threat intelligence are precisely those encrypted under TLS 1.3 and QUIC, where payload inspection yields nothing. We ask a simpler question -- what if the attack signature is not in the bytes, but in the rhythm? -- and answer it by treating network flows as a language whose grammar is written entirely in L3/L4 packet metadata: length, inter-arrival time, TTL, TCP flags, and hashed port numbers. We present PLM-NIDS, which proves three claims in sequence. (1) The grammar exists and is learnable: a RWKV-4 state-space model trained on 344,232 unlabelled Monday flows achieves a causal LM validation loss of 0.204, demonstrating that benign traffic has predictable, statistically consistent structure. (2) Attacks violate this grammar: the per-flow perplexity score cleanly separates benign from attack flows with PR-AUC = 0.93 using zero attack labels at training time. (3) This separation is architecturally nontrivial: an LSTM trained on identical token sequences degenerates to a majority-class predictor (ROC-AUC approximately 0.50, F1 = 0.91 by always predicting "attack"), proving that RWKV's causal pre-training provides an inductive bias unavailable to direct classifiers. Supervised fine-tuning further raises PR-AUC to 0.94 and ROC-AUC to 0.75, with a precision of 97.7% at the calibrated operating threshold. The RWKV backbone's O(T) recurrent inference enables per-packet streaming without flow buffering, making PLM-NIDS operationally viable at line rate. Because it reads only IP/TCP/UDP headers, it is inherently encryption-agnostic: TLS 1.3, QUIC, and future encrypted protocols are handled transparently.

2606.00154 2026-06-02 cs.SE cs.AI 版本更新

Benchmarking Multimodal LLMs on Code Generation for Complex Interactive Webpages

多模态大语言模型在复杂交互网页代码生成上的基准测试

Fan Wu, Lishuai Dong, Cuiyun Gao, Yujia Chen, Yiming Huang, Yang Xiao, Qing Liao

AI总结 针对现有基准忽略复杂交互行为的问题,提出WebIGBench基准,包含103个真实复杂网页和871个交互动作,并设计新评估流程,测试多模态大语言模型在交互式网页代码生成上的性能。

详情
AI中文摘要

近期多模态大语言模型(MLLMs)在多模态推理和代码生成方面取得了显著进展,催生了前端开发的新范式。特别是,这些模型可以直接将视觉设计转化为可执行代码,显著提高了Web开发的效率和适应性。现代Web应用是动态且交互式的,具有频繁的用户-页面交互。然而,现有基准主要评估静态网页的代码生成,忽略了实际应用中的复杂交互行为。此外,它们的评估标准仍局限于视觉保真度和代码结构,忽视了生成网页与参考网页之间的交互一致性。为解决这些局限,我们引入了WebIGBench,这是首个旨在评估具有复杂交互的交互式网页代码生成的基准。通过结合手动设计的交互路径和UI自动化,我们从真实网站收集了103个复杂网页。该基准涵盖了5种流行的交互动作类型(例如点击、输入),涉及871个不同的交互动作。此外,我们提出了一种新的评估流程,以弥补交互动作自动评估的空白。在多个代表性MLLM上的大量实验揭示了当前模型在使用WebIGBench进行交互式网页代码生成时的性能边界。所提出的基准可在https://github.com/anoa12159-hue/WebIGBench_eval获取。

英文摘要

Recent advancements in multimodal large language models (MLLMs) have achieved remarkable progress in multimodal reasoning and code generation, catalyzing a new paradigm for front-end development. In particular, these models can directly transform visual designs into executable code, significantly improving the efficiency and adaptability of web development. Modern web applications are dynamic and interactive, featuring frequent user-page interactions. However, existing benchmarks largely evaluate the code generation of static webpages, ignoring the complex interactive behaviors in real-world applications. Besides, their evaluation criteria remain confined to visual fidelity and code structure, overlooking the interaction consistency between the generated and the reference webpages. To address these limitations, we introduce WebIGBench, the first benchmark designed to evaluate code generation for interactive webpages with complex interactions. By combining manually designed interaction paths with UI automation, we collected 103 complex webpages from real-world websites. This benchmark covers 5 popular interactive action types (e.g., click, input) involving 871 distinct interactive actions. Moreover, we propose a novel evaluation pipeline to address the gap in automated assessment of interactive actions. Extensive experiments on several representative MLLMs reveal the performance boundaries of current models in interactive webpage code generation using WebIGBench. The proposed benchmark is available at https://github.com/anoa12159-hue/WebIGBench_eval.

2606.00153 2026-06-02 cs.CV cs.AI 版本更新

DiffCrossGait: Trajectory-Level Alignment for 2D-3D Cross-Modal Gait Recognition via Latent Diffusion

DiffCrossGait:基于潜在扩散的2D-3D跨模态步态识别轨迹级对齐

Zhiyang Lu, Ming Cheng

发表机构 * University of Science and Technology of China(中国科学技术大学)

AI总结 针对2D-3D跨模态步态识别中的域差异问题,提出DiffCrossGait,通过潜在扩散空间中的轨迹级对齐实现连续模态对齐,并引入三阶段对齐策略确保身份锚定、动态一致性和跨模态结构可恢复性,在SUSTech1K和FreeGait基准上达到最优性能。

Comments Accepted by ICML2026

详情
AI中文摘要

跨模态2D-3D步态识别受到2D轮廓和3D LiDAR距离视图表示之间固有域差异的阻碍。虽然先前的方法仅对齐最终嵌入,我们提出DiffCrossGait,将跨模态匹配重新表述为身份相关潜在扩散空间中的轨迹级对齐,而不是假设2D和3D观测完全等价。通过在潜在空间中使用共享高斯噪声驱动两种模态,我们实现了生成演化过程中的连续对齐。我们引入了一种三阶段对齐策略,利用不同的噪声强度来强制身份锚定、动态一致性和跨模态结构可恢复性,从而约束两种模态共享去噪动态和瓶颈结构,促进模态不变的步态特征。关键的是,我们的框架将生成对齐与判别骨干解耦;扩散机制仅作为训练目标,通过消除迭代去噪的计算开销确保高推理效率。在SUSTech1K和FreeGait基准上的大量实验表明,DiffCrossGait达到了最先进的性能。

英文摘要

Cross-modal 2D-3D gait recognition is impeded by inherent domain discrepancies between 2D silhouette and 3D LiDAR range-view representations. While prior methods align only final embeddings, we propose DiffCrossGait, which reformulates cross-modal matching as trajectory-level alignment in an identity-relevant latent diffusion space, rather than assuming full equivalence between 2D and 3D observations. By driving both modalities with shared Gaussian noise within a latent space, we enable continuous alignment throughout the generative evolution. We introduce a Tri-Phase Alignment Strategy that exploits varying noise intensities to enforce identity anchoring, dynamics consistency, and cross-modal structural recoverability, thereby constraining both modalities to share denoising dynamics and bottleneck structure, which promotes modality-invariant gait features. Crucially, our framework decouples generative alignment from the discriminative backbone; the diffusion mechanism serves exclusively as a training objective, ensuring high inference efficiency by eliminating the computational overhead of iterative denoising. Extensive experiments on the SUSTech1K and FreeGait benchmarks demonstrate that DiffCrossGait achieves state-of-the-art performance.

2606.00152 2026-06-02 cs.CR cs.AI 版本更新

PrivacyPeek: Auditing What LLM-Based Agents Acquire, Not Just What They Say

PrivacyPeek: 审计基于LLM的智能体获取了什么,而不仅仅是它们说了什么

Mingxuan Zhang, Jiahui Han, Dadi Guo, Songze Li, Guanchu Wang, Na Zou, Dongrui Liu, Xia Hu

发表机构 * Shanghai Artificial Intelligence Laboratory(上海人工智能实验室) Southeast University(东南大学)

AI总结 提出PrivacyPeek基准,通过检查工具调用轨迹和探针诱导,评估基于LLM的智能体在获取阶段不必要的敏感信息泄露,发现该问题普遍存在且现有防御效果有限。

Comments 19 pages, 9 figures

详情
AI中文摘要

基于LLM的智能体正在快速发展,能够自主调用外部工具为用户完成多步骤任务。然而,智能体常常获取超出任务所需的敏感信息。现有的隐私基准审计智能体的响应或外部行为泄露了什么,但忽略了数据首次进入智能体上下文时的获取阶段。过度获取的信息只需一次粗心操作或一次攻击即可完全泄露。为了评估其普遍性,我们引入了\emph{PrivacyPeek},一个用于评估基于LLM的智能体获取阶段隐私泄露的基准,包含$1{,}182$个案例,涵盖$7$种获取行为和$16$个应用领域。具体来说,\emph{获取检查}检查智能体的工具调用轨迹,包括其调用的工具和接收的数据,以检测其何时获取超出任务范围的敏感信息。然后,\emph{探针诱导}发出后续探针,并衡量攻击者能够多容易地诱导出智能体已获取但未披露的敏感信息。我们在4个模型家族的10个基于LLM的智能体上的实验表明,不必要的敏感信息获取非常普遍。此外,我们观察到任务完成能力与获取阶段泄露之间存在相关性。提示级别的防御仅减少了获取阶段泄露的一小部分,大部分未被缓解。这些结果使得审计获取阶段的隐私既紧迫又必要。我们的数据集和代码可在https://github.com/Xuan269/PrivacyPeek-Resource获取。

英文摘要

LLM-based agents are rapidly advancing, autonomously invoking external tools to complete multi-step tasks for users. However, agents often acquire more sensitive information than the task requires. Existing privacy benchmarks audit what the agent's response or outgoing actions disclose, but overlook the acquisition stage where data first enters the agent's context. The over-acquired information is then one careless action or one attack away from an outright leak. To assess its prevalence, we introduce \emph{PrivacyPeek}, a benchmark for evaluating acquisition-stage privacy leakage of LLM-based agents, with $1{,}182$ cases across $7$ acquisition behaviours and $16$ application domains. Specifically, \emph{Acquisition Inspection} examines the agent's tool-call trajectory, both the tools it invokes and the data it receives, to detect when it acquires sensitive information beyond the task scope. \emph{Probe Elicitation} then issues a follow-up probe and measures how readily an attacker could elicit sensitive information the agent acquired but did not disclose. Our experiments on 10 LLM-based agents across 4 model families show that the unnecessary acquisition of sensitive information is widespread. In addition, we observe a correlation between the task-completion capability and acquisition-stage leakage. Prompt-level defences reduce only a small fraction of acquisition-stage leakage, leaving the majority unmitigated. These results make auditing acquisition-stage privacy both urgent and necessary. Our dataset and code are available at https://github.com/Xuan269/PrivacyPeek-Resource.

2606.00151 2026-06-02 cs.LG cs.AI 版本更新

Emergence of Exploration in Policy Gradient Reinforcement Learning via Retrying

通过重试在策略梯度强化学习中涌现探索行为

Soichiro Nishimori, Paavo Parmas, Sotetsu Koyamada, Tadashi Kozuno, Toshinori Kitamura, Shin Ishii, Yutaka Matsuo

发表机构 * University of Tokyo(东京大学) Aalto University(阿尔托大学)

AI总结 提出ReMax目标函数,通过最大化M个样本的期望最大回报来使探索行为自然涌现,并推导策略梯度公式及RePPO算法,在MinAtar和Craftax基准上无需显式探索奖励即可促进探索。

详情
AI中文摘要

在强化学习(RL)中,智能体从探索中获益仅仅是因为它们反复遇到相似的状态:尝试不同的动作可以提高性能或减少不确定性;没有这样的重试,贪婪策略是最优的。我们通过ReMax形式化这一直觉,该目标函数根据$M$个样本($M$为正整数)的期望最大回报来评估策略,同时考虑回报的不确定性。优化该目标函数会使随机探索作为涌现属性出现,无需显式奖励项。为了实现高效的策略优化,我们为ReMax推导了新的策略梯度公式,并引入ReMax PPO(RePPO),这是一种PPO变体,它优化ReMax的同时将离散重试次数$M$推广为连续参数$m>0$,从而实现对探索的细粒度控制。实验上,RePPO在MinAtar和Craftax基准上无需任何显式探索奖励即可促进探索。

英文摘要

In reinforcement learning (RL), agents benefit from exploration only because they repeatedly encounter similar states: trying different actions can improve performance or reduce uncertainty; without such retries, a greedy policy is optimal. We formalize this intuition with ReMax, an objective that evaluates a policy by the expected maximum return over $M$ samples, where $M$ is a positive integer, while accounting for return uncertainty. Optimizing this objective induces stochastic exploration as an emergent property, without explicit bonus terms. For efficient policy optimization, we derive a new policy-gradient formulation for ReMax and introduce ReMax PPO (RePPO), a PPO variant that optimizes ReMax while generalizing the discrete retry count $M$ to a continuous parameter $m > 0$, enabling fine-grained control of exploration. Empirically, RePPO promotes exploration, without any explicit exploration bonuses, on the MinAtar and Craftax benchmarks.

2606.00150 2026-06-02 cs.CR cs.AI 版本更新

Persona Attack: Incremental Memory Injection Jailbreak Attack against Large Language Models

人格攻击:针对大型语言模型的增量记忆注入越狱攻击

Junyoung Park, Seongyong Ju, Sunghwan Park, Jaewoo Lee

发表机构 * Chung-Ang University(Chung-Ang 大学)

AI总结 提出一种基于记忆注入的越狱方法Persona Attack,通过逐步操纵模型上下文窗口,使模型在记忆积累中优先处理注入指令,从而绕过安全对齐,在特定配置下攻击成功率可达95%。

详情
AI中文摘要

随着大型语言模型为方便用户而不断发展,尽管在安全训练方面持续努力,但越狱攻击的脆弱性仍被不断报告。传统的越狱技术通常侧重于单次提示注入,忽略了模型记住对话流程和用户指令的能力。在本文中,我们提出了Persona Attack,一种基于记忆注入的越狱方法,通过逐步方法操纵模型的上下文窗口。将Persona Attack应用于多个广泛使用的LLM的实验结果表明,随着注入在记忆中的积累,模型越来越优先考虑这些指令,而不是其内部安全对齐机制。此外,我们的实验经验性地证明,攻击成功率不仅根据模型的记忆实现而变化,还取决于指令的组合,在特定指令配置下可达到95%。

英文摘要

As Large Language Models evolve for user convenience, vulnerability to jailbreak attacks continues to be reported despite ongoing efforts in safety training. Traditional jailbreak techniques typically focus on a single prompt injection, neglecting the models' ability to remember the flow of conversation and the user's instructions. In this paper, we propose Persona Attack, a memory injection based jailbreak method that manipulates the model's context window through a step by step approach. Experimental results from applying Persona Attack to several widely used LLMs reveal that, as injections accumulate in memory, models increasingly prioritize these instructions over their internal safety alignment mechanisms. Furthermore, our experiments empirically demonstrate that the attack success rate varies not only according to the memory implementation of the model, but also combinations of instructions and can reach 95% under specific instruction configurations.

2606.00148 2026-06-02 cs.CV cs.AI 版本更新

StemBind: When MLLMs Get Lost Between Rules and Instances in Abstract Visual Reasoning

StemBind: 当多模态大语言模型在抽象视觉推理中迷失于规则与实例之间

Xixiang He, Baiqi Wu, Xingming Li, Ao Cheng, Qiyao Sun, Xuanyu Ji, Qingyong Hu

发表机构 * University of Science and Technology of China(中国科学技术大学)

AI总结 提出 StemBind 诊断基准,通过共享主干的三对齐问题(感知、规则、完整)定位 MLLM 在抽象视觉推理中的失败环节,发现规则到实例的绑定是主要瓶颈。

Comments Project page: https://hexixiang.github.io/StemBind

详情
AI中文摘要

多模态大语言模型(MLLM)常常知道规则但选错答案:在抽象视觉推理(AVR)任务中,模型可以描述所见内容并命名底层模式,但仍然无法选择匹配的候选。现有的 AVR 基准无法检测到这一点,因为它们将感知、规则归纳和答案选择合并为一个单一的对错信号。我们引入了 StemBind,一个共享主干的诊断基准,它用三个对齐的问题探测同一视觉主干:感知(图像中有什么)、规则(支配它的模式是什么)和完整(哪个选项完成它),因此最终答案的错误可以归因于同一证据上的特定子步骤。StemBind 包含 2,298 个经过精心策划的知识精简主干,涵盖九种可审计的视觉操作,总计 19,533 个 P/R/F 任务,每个完整项目都通过 Sternberg 的四个推理阶段(S1 编码、S2 推断、S3 映射、S4 应用)进行标注。评估 24 个前沿 MLLM 配置得出四个发现。(i)R-F 鸿沟:在 24 个模型中的 22 个上,规则准确率超过完整项目准确率,因此大多数失败发生在规则被识别之后。(ii)持续的绑定差距:即使在同一主干上 P 和 R 都正确,模型仍有 51.2% 的时间错误回答 F。(iii)瓶颈是 S3:过程诊断和阶段式刺激增强将主要失败定位到规则到实例的映射。(iv)扩展和思考无济于事:更大的模型和显式思考模式都无法可靠地缩小差距,思考甚至降低了规则和完整项目的准确率。StemBind 将 AVR 评估从最终答案排名重新定义为定位抽象视觉推理失败的位置,将规则到实例的绑定确定为视觉基础推理的具体下一个目标。

英文摘要

Multimodal large language models (MLLMs) often know the rule but pick the wrong answer: on abstract visual reasoning (AVR) tasks, a model can describe what it sees and name the underlying pattern, yet still fail to choose the matching candidate. Existing AVR benchmarks cannot detect this because they collapse perception, rule induction, and answer selection into a single right-or-wrong signal. We introduce StemBind, a shared-stem diagnostic benchmark that probes the same visual stem with three aligned questions: Perception (what is in the image), Rule (what pattern governs it), and Full (which option completes it), so a final-answer error can be attributed to a specific sub-step on the same evidence. StemBind contains 2,298 curated knowledge-light stems across nine auditable visual operations, totaling 19,533 P/R/F tasks, with each full item annotated by Sternberg's four reasoning stages (S1 Encode, S2 Infer, S3 Map, S4 Apply). Evaluating 24 frontier MLLM configurations yields four findings. (i) The R-F chasm: rule accuracy exceeds full-item accuracy on 22 of 24 models, so most failures happen after the rule is identified. (ii) A persistent binding gap: even when P and R are both correct on the same stem, models still answer F incorrectly 51.2% of the time. (iii) The bottleneck is S3: process diagnostics and Stage-wise Stimulus Augmentation localize the dominant failure to rule-to-instance mapping. (iv) Scaling and thinking do not help: neither larger models nor explicit thinking mode reliably closes the gap, and thinking even lowers rule and full-item accuracy. StemBind reframes AVR evaluation from final-answer ranking to locating where abstract visual reasoning breaks down, identifying rule-to-instance binding as a concrete next target for vision-grounded reasoning.

2606.00147 2026-06-02 cs.LG cs.AI 版本更新

RAFT: Data Refinement and Adaptive Distillation for Domain Fine-Tuning with Alleviated Forgetting

RAFT:用于缓解遗忘的领域微调的数据精炼与自适应蒸馏

Yuduo Li, Xiaofeng Shi, Qian Kou, Longbin Yu, Hua Zhou

发表机构 * Beijing Academy of Artificial Intelligence (BAAI)(北京人工智能研究院) Beijing Jiaotong University (BJTU)(北京交通大学)

AI总结 提出RAFT框架,通过数据精炼(自条件重写、语义过滤、答案融合)和答案条件在线蒸馏(top-K温度蒸馏、EMA自适应损失平衡)来解决领域微调中的监督兼容性差距和轨迹保持差距,在提升领域性能的同时缓解通用能力退化。

Comments preprint

详情
AI中文摘要

领域特定的监督微调(SFT)通常以提高领域内性能为代价,导致模型通用能力下降。我们将这种退化归因于领域SFT中的两个实际差距:监督兼容性差距,即领域目标在风格和推理格式上与原始模型的自然响应不同;以及轨迹保持差距,即教师强制SFT优化固定目标令牌,而不约束模型在其自身生成前缀上的行为。这个过程未能保留模型的原始行为。我们提出RAFT(用于缓解遗忘的领域微调的数据精炼与自适应蒸馏),一个两阶段框架来解决这两个因素。首先,RAFT通过自条件重写、语义过滤和答案融合构建模型兼容的监督。其次,RAFT执行答案条件在线蒸馏,其中原始指令调优模型在学生生成的轨迹上提供软目标,同时以融合答案作为有用上下文进行条件化。我们进一步引入top-K温度蒸馏和基于EMA的自适应损失平衡来稳定领域-通用权衡。在三个指令调优骨干和五个领域上,RAFT相比标准SFT将平均领域准确率提高了23.2%,同时恢复了MS-Bench和IFEval上SFT引起的部分退化,相对改进分别为18.2%和10.2%。这些结果表明,将数据精炼与轨迹级保持相结合为缓解遗忘的领域微调提供了有效方案。

英文摘要

Domain-specific supervised fine-tuning (SFT) often improves in-domain performance at the cost of degrading a model's general capabilities. We view this degradation through two practical gaps in domain SFT: a supervision-compatibility gap, where domain targets differ in style and reasoning format from the original model's natural responses, and a trajectory-preservation gap, where teacher-forced SFT optimizes fixed target tokens without constraining the model's behavior on its own generated prefixes. This process fails to preserve the model's original behavior. We propose RAFT (Data Refinement and Adaptive Distillation for Domain Fine-Tuning with Alleviated Forgetting), a two-stage framework that addresses both factors. First, RAFT constructs model-compatible supervision through self-conditioned rewriting, semantic filtering, and answer fusion. Second, RAFT performs Answer-Conditioned On-Policy Distillation, where the original instruction-tuned model provides soft targets on student-generated trajectories while being conditioned on the fused answer as helpful context. We further introduce top-K temperature distillation and EMA-based adaptive loss balancing to stabilize the domain-general trade-off. Across three instruction-tuned backbones and five domains, RAFT improves average domain accuracy by 23.2% over standard SFT, while recovering part of the SFT-induced degradation on MS-Bench and IFEval, with relative improvements of 18.2% and 10.2%, respectively. These results show that coupling data refinement with trajectory-level preservation provides an effective recipe for domain fine-tuning with alleviated forgetting.

2606.00146 2026-06-02 eess.IV cs.AI cs.CV 版本更新

Multi-Contrast MRI Motion Correction via Parameter-Informed Disentanglement and Adaptive Experts

多对比度MRI运动校正:基于参数信息解缠与自适应专家网络

Honglin Xiong, Yuxian Tang, Feng Li, Yulin Wang, Lei Xiang, Dinggang Shen, Qian Wang

发表机构 * ShanghaiTech University(上海科技大学)

AI总结 提出一种结合参数信息对比度解缠与严重度感知自适应校正的统一框架,通过ScanCLIP提取对比度嵌入以分离解剖内容,利用视觉Transformer估计运动严重度并路由至专家混合网络,实现跨对比度与严重度的运动伪影校正,在IXI和HCP基准上优于现有方法。

详情
AI中文摘要

磁共振成像中的运动伪影降低了诊断可靠性。现有的深度学习方法通常针对特定对比度,无法泛化到不同模态和伪影严重度。我们提出一个统一框架,结合参数信息对比度解缠与严重度感知自适应校正。ScanCLIP在超过30,000个MRI文本-图像对上预训练,从采集参数中导出对比度嵌入,将对比度风格与解剖内容分离,得到无对比度特征。然后,视觉Transformer估计运动严重度,并通过专家混合网络路由特征,实现针对性伪影校正。双路径解码器重建干净图像和残差伪影图,强制执行图像空间一致性。在IXI和HCP基准上,我们的方法在PSNR上提升0.75 dB,SSIM最高提升0.0279,优于现有方法,且在更高伪影严重度下增益更大。该方法在真实临床数据上展现出鲁棒的零样本泛化能力,这些数据使用未见过的扫描参数采集,而现有方法要么无法去除伪影,要么引入额外失真。

英文摘要

Motion artifacts in magnetic resonance imaging (MRI) degrade diagnostic reliability. Existing deep learning methods are typically contrast-specific and fail to generalize across diverse modalities and artifact severities. We propose a unified framework combining parameter-informed contrast disentanglement with severity-aware adaptive correction. ScanCLIP, pretrained on over 30,000 MRI text-image pairs, derives contrast embeddings from acquisition parameters to disentangle contrast style from anatomical content, yielding contrast-free features. A Vision Transformer then estimates motion severity and routes features through a Mixture-of-Experts network, enabling targeted artifact correction. A dual-pathway decoder reconstructs both the clean image and residual artifact map, enforcing image-space consistency. On IXI and HCP benchmarks, our method improves PSNR by 0.75 dB and SSIM by up to 0.0279 over state-of-the-art approaches, with larger gains at higher artifact severities. It further demonstrates robust zero-shot generalization on real-world clinical data acquired with unseen scanning parameters, where existing methods either fail to remove artifacts or introduce additional distortions.

2606.00145 2026-06-02 cs.RO cs.AI 版本更新

Completion at the Boundary (CaB): Deployable Switching with Completion-Aware Control under Limited Calibration

边界完成(CaB):有限校准下具有完成感知的可部署切换

Yusuke Sano, Takeshi Itoga

发表机构 * Intelligent Systems Laboratory, SECOM Co., Ltd.(SECOM公司智能系统实验室)

AI总结 提出Completion at the Boundary (CaB)方法,通过边界阶段令牌(Before/Hit/After)保留双边证据,在有限校准条件下实现VLA代理的完成感知切换,提升复合指令执行和交接质量。

详情
AI中文摘要

视觉-语言-动作(VLA)代理可以执行自然语言指令,但部署系统仍缺乏操作接口:决定指令何时完成。这一缺口在短复合指令(“做A,然后做B”)中尤为严重,时机不当的交接会级联导致下游故障。完成本质上是闭环的,因为切换是一种改变指令上下文从而影响未来动作和观察的干预。我们研究在由开放式指令空间启发的可部署低校准机制下的完成问题,强制要求无测试时重新学习,并选择一个全局校准的切换规则(在开发集上选择一次,在测试集上原样复用)。在此约束下,将非对称边界证据压缩为单个标量可能在任务极性变化时变得脆弱。我们提出边界完成(CaB),它预测事件局部完成对象,形式为边界阶段令牌(Before/Hit/After),在此规则下保留双边证据。CaB-When将此完成对象转换为最小、可审计的切换决策(何时),而CaB-How重用同一完成对象来调节动作生成,以实现交接过程中的边界稳定控制(如何)。使用干预感知的E1/E2协议,我们表明在匹配容量和可部署性约束下,CaB在第一个视角Minecraft VLA基准上提高了复合执行和交接质量。

英文摘要

Vision-language-action (VLA) agents can execute natural-language instructions, yet deployed systems still lack an operational interface: deciding when the instruction is complete. This gap is acute in short composites ("do A, then B"), where mistimed handoffs cascade into downstream failures. Completion is inherently closed-loop because switching is an intervention that changes the instruction context and thus future actions and observations. We study completion under a deployable low-calibration regime motivated by open-ended instruction spaces, enforcing no test-time relearning and a single globally calibrated switching rule selected once on development set and reused unchanged on test set. Under this constraint, collapsing asymmetric boundary evidence into a single scalar can be brittle under polarity shifts across tasks. We propose Completion at the Boundary (CaB), which predicts an event-local completion object in the form of Boundary-Phase Tokens (Before/Hit/After), retaining two-sided boundary evidence under this discipline. CaB-When converts this completion object into a minimal, auditable switching decision (when), while CaB-How reuses the same completion object to condition action generation for boundary-stable control through handoffs (how). Using an intervention-aware E1/E2 protocol, we show that CaB improves composite execution and handoff quality on a first-person Minecraft VLA benchmark under matched capacity and deployability constraints.

2606.00144 2026-06-02 cs.LG cs.AI 版本更新

BudgetDraft: Acceptance-Aware Multi-View Training for Sparse-KV Speculative Decoding

BudgetDraft:面向稀疏KV投机解码的接受感知多视角训练

Liang He, Jingbo Wen, Qishi Zhan, Yixiong Chen, Kangning Cui, Qizhen Lan, Xilu Wang

发表机构 * Shanghai Institute of Optics and Fine Mechanics(上海光学精密机械研究所) The University of Sydney(悉尼大学) Marquette University(马基特大学) Johns Hopkins University(约翰·霍普金斯大学) Wake Forest University(威克森林大学) University of Texas Health Science Center at Houston(德克萨斯大学健康科学中心休斯顿分部) University of Surrey(萨里大学)

AI总结 针对中长上下文推理中稀疏/全缓存不匹配导致接受率下降的问题,提出BudgetDraft多视角稀疏训练方法,通过接受感知损失和多视角损失训练单一鲁棒草稿模型,在固定KV预算下恢复接受率,实现最高6.55倍加速。

详情
AI中文摘要

投机解码通过草稿模型提出多个令牌,验证器并行验证,从而加速自回归解码。在资源受限的部署中,草稿模型使用稀疏KV缓存以在固定KV预算下限制峰值GPU内存和端到端延迟,而验证器保留全KV缓存。实际应用中常见中长上下文推理(4K--16K上下文长度)。然而,随着上下文长度增长,朴素稀疏/全投机解码遭受稀疏/全不匹配问题,导致接受率快速下降。我们提出BudgetDraft,一种用于中长推理中稀疏草稿的多视角稀疏训练方法。草稿模型在训练期间暴露于多个采样的KV预算,并学习将每个稀疏视角与一个共享的全缓存教师目标对齐。BudgetDraft将全缓存分支上的接受感知损失与稀疏缓存分支上的多视角损失相结合,产生一个单一的预算鲁棒草稿模型,无需额外的推理时组件即可恢复跨稀疏级别的接受率。在PG-19、LongBench和LWM上的实验结果表明,BudgetDraft在4K、8K和16K上下文长度下,与自回归相比分别实现了最高6.55倍、4.46倍和2.10倍的端到端加速,同时保持推理流水线内存友好。

英文摘要

Speculative decoding speeds up autoregressive decoding by using a drafter to propose multiple tokens that a verifier validates in parallel. In resource-constrained deployments, the drafter uses a sparse KV cache to limit peak GPU memory and end-to-end latency under a fixed KV budget, while the verifier keeps a full KV cache. Mid-to-long context inference (4K--16K context length) is common in real applications. However, naive sparse/full speculative decoding suffers from the sparse/full mismatch as context length grows, causing the acceptance rate to drop quickly. We propose BudgetDraft, a multi-view sparse training method for sparse drafting in mid-to-long inference. The drafter is exposed to multiple sampled KV budgets during training and learns to align each sparse view with one shared full-cache teacher target. BudgetDraft combines an acceptance-aware loss on a full-cache branch with a multi-view loss on a sparse-cache branch, producing a single budget-robust drafter that recovers acceptance across sparsity levels without extra inference-time components. Experimental results on PG-19, LongBench, and LWM show that BudgetDraft achieves up to 6.55x, 4.46x, and 2.10x end-to-end speedup vs AR at 4K, 8K, and 16K context lengths, while keeping the inference pipeline memory-friendly.

2606.00143 2026-06-02 q-fin.PM cs.AI 版本更新

Regime-Adaptive Continual Learning for Portfolio Management

Regime-Adaptive Continual Learning for Portfolio Management

Chaofan Pan, Lingfei Ren, Linbo Xiong, Yonghao Li, Wei Wei, Xin Yang

发表机构 * Southwestern University of Finance and Economics(西南财经大学) Shanxi University(山西大学)

AI总结 提出ReCAP框架,通过自适应制度检测和持续学习实现投资组合管理的快速适应与长期优异回报。

Comments Accepted by KDD 2026

详情
AI中文摘要

金融市场本质上是不稳定的,频繁出现制度转变和结构性变化,使得传统的投资组合管理方法失效。现有的补救措施,如滚动窗口重新训练和朴素在线微调,分别受到高计算成本和知识利用不足的困扰,导致低回报和有限的适应性。持续学习通过使交易代理能够跨顺序任务积累和转移知识,提供了一种有前景的范式。在本文中,我们提出了 extbf{Re}gime-aware extbf{C}ontinual extbf{A}daptive extbf{P}ortfolio management ( extbf{ReCAP}),一个将CL集成到PM中以应对动态金融环境挑战的新框架。ReCAP采用自适应制度检测模块将历史市场数据分割成可变长度的制度,实现制度特定的策略向量学习和策略库构建。在持续交易过程中,制度门控模块根据当前市场状态自适应地组合策略库中的策略向量,促进对新检测到的制度的快速适应。只有制度门控和当前制度的策略向量被持续更新,以有效保留有用知识。在五个真实世界数据集上的广泛实验表明,ReCAP持续优于流行的基线,在长期投资视野中实现卓越回报,并快速适应制度转变。

英文摘要

Financial markets are inherently non-stationary, exhibiting frequent regime shifts and structural changes that render traditional Portfolio Management (PM) approaches ineffective. Existing remedies, such as rolling-window retraining and naive online fine-tuning, are hindered by high computational costs and insufficient knowledge utilization, respectively, resulting in low returns and limited adaptability. Continual learning (CL) offers a promising paradigm by enabling trading agents to accumulate and transfer knowledge across sequential tasks. In this paper, we propose \textbf{Re}gime-aware \textbf{C}ontinual \textbf{A}daptive \textbf{P}ortfolio management (\textbf{ReCAP}), a novel framework that integrates CL into PM to address the challenges of dynamic financial environments. ReCAP employs an adaptive regime detection module to segment historical market data into variable-length regimes, enabling regime-specific learning of policy vectors and the construction of a policy library. During continual trading, a regime-gate module adaptively combines policy vectors from the library based on the current market state, facilitating rapid adaptation to newly detected regimes. Only the regime-gate and the current regime's policy vector are continually updated to preserve useful knowledge effectively. Extensive experiments on five real-world datasets demonstrate that ReCAP consistently outperforms popular baselines, achieving superior returns in long-term investment horizons and rapid adaptation to regime shifts.

2606.00141 2026-06-02 cs.LG cs.AI 版本更新

Adaptive data selection improves wearable prediction under low baseline performance

自适应数据选择改善低基线性能下的可穿戴预测

Ali Kargarandehkordi

AI总结 本研究通过评估多种模态下自适应时间窗口选择策略,发现其能显著提升低基线性能参与者的AUROC(最高提升0.7),而高基线性能者收益有限或为负,且增益与基线性能呈强负相关。

详情
AI中文摘要

自适应传感策略通过选择性采样数据,在有限数据预算下提高预测性能,在可穿戴健康系统中应用日益广泛,但其在不同个体间的收益尚不明确。本文基于纵向可穿戴数据集,评估了在固定测量预算下,针对心率、活动和生态瞬时评估(EMA)等多种传感模态,自适应选择时间窗口进行模型训练的效果。我们使用接收者操作特征曲线下面积(AUROC)和F1分数量化了相对于随机采样的性能提升。自适应策略为基线性能较低的参与者带来了显著的AUROC提升(增益高达0.7),而对基线性能较强的参与者增益有限甚至为负。跨模态来看,自适应增益与基线性能呈强负相关(Pearson r = -0.67;Spearman ρ = -0.62)。在参与者层面,大多数个体在AUROC上受益(跨模态为60-80%),尽管F1的改进较小且一致性较差。这些发现表明,自适应传感并非普遍有益,而是在性能不佳的情况下提供最大价值。我们的结果支持基于基线性能定制自适应传感的选择性部署策略,以提高可穿戴健康监测的效率。

英文摘要

Adaptive sensing strategies that selectively sample data are increasingly used in wearable health systems to improve prediction performance under limited data budgets, yet their benefits across individuals remain poorly understood. Here, we evaluate adaptive selection of time windows for model training under fixed measurement budgets across multiple sensing modalities, including heart rate, activity, and ecological momentary assessment (EMA), in a longitudinal wearable dataset. We quantify performance gains relative to random sampling using both area under the receiver operating characteristic curve (AUROC) and F1 score. Adaptive strategies yield substantial improvements in AUROC for participants with low baseline performance (with gains up to 0.7), while offering limited or negative gains for participants with strong baselines. Across modalities, adaptive gain is strongly inversely correlated with baseline performance (Pearson r = -0.67; Spearman p = -0.62). At the participant level, most individuals benefit in AUROC (60-80% across modalities), although improvements in F1 are smaller and less consistent. These findings show that adaptive sensing is not uniformly beneficial, but instead provides the greatest value in underperforming settings. Our results support selective deployment strategies that tailor adaptive sensing based on baseline performance to improve efficiency in wearable health monitoring.

2606.00139 2026-06-02 cs.CV cs.AI 版本更新

Geodesics with Unified Tangent-constrained Priors and Curvature Regularization

具有统一切线约束先验和曲率正则化的测地线

Chong Di, Li Liu, Jinglin Zhang, Zhenjiang Li, Da Chen, Laurent D. Cohen

发表机构 * Shandong Artificial Intelligence Institute, Qilu University of Technology (Shandong Academy of Sciences)(山东省人工智能研究院,齐鲁工业大学(山东省科学院)) Yuanshen Rehabilitation Institute, Shanghai Jiao Tong University School of Medicine(元身康复研究院,上海交通大学医学院) School of Control Science and Engineering, Shandong University(控制科学与工程学院,山东大学) Department of Radiation Oncology, Shandong Cancer Hospital and Institute, Shandong First Medical University, Shandong Academy of Medical Sciences(放疗科,山东省肿瘤医院及研究院,山东第一医科大学,山东省医学科学院) CEREMADE, Université Paris Dauphine, Université-PSL, CNRS, UMR 7534(CEREMADE,巴黎大学Dauphine,Université-PSL,CNRS,UMR 7534)

AI总结 提出一种在方向提升空间中融合切线约束先验与曲率惩罚的测地线框架,通过快速行进法高效求解HJB PDE,增强复杂形状图像分割的鲁棒性。

详情
AI中文摘要

曲率惩罚的测地线模型通过计算全局最优曲线在图像分割中证明了其有效性。不幸的是,当描绘具有复杂形状和图像强度分布的对象时,这些模型仍然容易受到捷径的影响,因为它们缺乏强制执行形状感知切线约束的机制。为了解决这一局限性,我们提出了一种统一的测地线框架,该框架将切线约束先验与曲率惩罚相结合。关键思想是直接在方向提升空间内制定切线可接受性,其中路径切线被限制在由内在形状代表(ISR)(如骨架或内部地标)导出的空间变化角度扇区内。这一公式产生了一系列切线约束的芬斯勒度量,扩展了经典的曲率惩罚测地线模型,同时强制执行强制切线约束。由此产生的Hamilton-Jacobi-Bellman(HJB)偏微分方程(PDE)可以通过快速行进法的变体进行高效数值求解,保持了单次通过的计算复杂度。在合成、自然和医学图像上的实验表明,所提出的测地线框架确实提高了对弱边界和拓扑捷径的鲁棒性,与现有测地线模型相比,产生了具有增强形状保真度的分割结果。

英文摘要

Curvature-penalized geodesic models have proven their effectiveness in image segmentation by computing globally optimal curves. Unfortunately, these models remain susceptible to shortcuts when delineating objects with complex shapes and image intensity distributions, as they lack mechanisms to enforce shape-aware tangent constraints. To address this limitation, we propose a unified geodesic framework that integrates tangent-constrained priors with curvature penalization. The key idea is to formulate tangent admissibility directly within the orientation-lifted space, where path tangents are restricted to spatially varying angular sectors derived from intrinsic shape representatives (ISR) such as skeletons or interior landmarks. This formulation gives rise to a family of tangent-constrained Finslerian metrics, extending the classical curvature-penalized geodesic models while enforcing mandatory tangent constraints. The resulting Hamilton-Jacobi-Bellman (HJB) partial differential equations (PDEs) admit efficient numerical solutions via variants of the fast marching method, preserving the single-pass computational complexity. Experiments on synthetic, natural, and medical images demonstrate that the proposed geodesic framework indeed improves robustness against weak boundaries and topological shortcuts, yielding segmentation results with enhanced shape fidelity compared to existing geodesic models.

2606.00138 2026-06-02 cs.AI 版本更新

A Multi-AI-agent Framework Enabling End-to-end Finite Element Analysis for Solid Mechanics Problems

面向固体力学问题的端到端有限元分析的多AI智能体框架

Titu Ranjan Sarker, Muhammed Jawaad Zulqernine, Ling Yue, Shaowu Pan, Chenxi Wang, Shiyao Lin

发表机构 * University of Texas at Arlington(德克萨斯大学阿灵顿分校) Institute for Predictive Performance Methodologies(预测性能方法研究所) Rensselaer Polytechnic Institute(拉特格斯理工学院)

AI总结 提出基于大语言模型的多智能体框架AbaqusAgent,将自然语言指令转化为Abaqus有限元分析及结果可视化,在50个固体力学问题上实现86%成功率。

详情
AI中文摘要

有限元分析(FEA)是固体力学最重要的数值方法。FEA的挑战包括入门级用户学习曲线陡峭,以及由于关键模拟组件(如边界条件、载荷工况和求解变量)定义错误导致的潜在错误模拟。解决实际问题通常需要多年的工程经验。为了解决这些问题,我们提出了AbaqusAgent,一个基于大语言模型(LLM)的固体力学分析多智能体框架。AbaqusAgent旨在利用Abaqus(最广泛使用的FEA软件包之一)促进分析案例的生成和执行,将用户的自然语言指令转化为执行的FEA分析和结果可视化。AbaqusAgent由六个智能体组成,包括解释器、架构师、输入编写器、运行器、审查器和可视化器,涵盖了标准FEA分析的所有必要前处理和后处理步骤。50个不同的固体力学问题已成功验证,总体成功率达到86%。除了提高固体力学问题FEA的效率并降低计算力学教育的门槛外,AbaqusAgent还推进了人机仿真交互范式,并实现了与AI驱动的优化和材料表征工作流的集成。代码可在https://github.com/LIRAM-LIN/AbaqusAgent获取。

英文摘要

Finite element analysis (FEA) is the most important numerical approach for solid mechanics. Challenges of FEA include a steep learning curve for entry-level users and potential false simulations due to incorrect definitions of key simulation components, such as boundary conditions, load cases, and solution variables. Years of engineering experience are usually necessary for real-world problem-solving. To address these issues, we present AbaqusAgent, a multi-agent framework grounded in large language models (LLMs) for solid mechanics analyses. AbaqusAgent is developed to facilitate analysis case generation and execution using Abaqus, one of the most widely used FEA packages, by turning users' natural-language instructions into executed FEA analyses and result visualization. AbaqusAgent is composed of six agents, including interpreter, architect, input writer, runner, reviewer, and visualizer agents, encompassing all the essential pre-processing and post-processing steps of standard FEA analyses. A wide variety of 50 solid mechanics problems have been successfully validated, achieving an overall success rate of 86%. Beyond improving the efficiency of FEA for solid mechanics problems and lowering the barrier to computational mechanics education, AbaqusAgent advances the human-simulation interaction paradigm and enables integration with AI-empowered optimization and material characterization workflows. The code is available at https://github.com/LIRAM-LIN/AbaqusAgent

2606.00136 2026-06-02 cs.LG cs.AI cs.CL cs.CR cs.SI 版本更新

Generative AI and Digital Ecosystem Resilience: A Proactive Lifecycle-Based Survey

生成式AI与数字生态系统韧性:基于生命周期的主动式综述

Jonghyun Chung, Rishabh Chaddha, Sanket Badhe, Debanshu Das, Nathan Huang, Amanpreet Kaur

发表机构 * Google LLC(谷歌有限公司)

AI总结 本文采用基于生命周期的C5交互模型,综合机器学习与社会科学方法,系统综述了针对生成式AI驱动的对抗性合成内容的主动检测技术,包括协调不真实行为分析、流行病学建模和霍克斯过程等,旨在构建更具韧性的信息生态系统。

Comments 14 pages, 3 figures, 3 tables. Accepted for publication in IEEE Access (May 2026)

详情
Journal ref
IEEE Access (2026) IEEE Access (2026)
AI中文摘要

生成式AI加速了对抗性合成内容的扩散,使得传统的被动检测方法失效。本综述综合了新兴研究,展示了向主动检测新兴不真实叙事的范式转变。我们采用统一的、基于生命周期的分类法,将对抗性活动的社会技术生命周期模型与新兴不真实叙事检测的高级计算方法相结合。通过围绕C5交互模型(背景、原因、内容、放大循环、后果)构建分析,我们整合了机器学习和社会科学的不同研究流。为了区分合成放大模式与真实基线流量,本文综述了建模新叙事创建、播种和传播的最先进技术,包括协调不真实行为分析、流行病学建模和霍克斯过程。本综述还系统回顾了C5交互模型不同阶段对抗性威胁的主动检测方法,特别是高维嵌入空间中的异常检测、多层图上的无监督协调检测以及代理型AI系统。最后,本综述探讨了生成式AI带来的挑战,包括追踪快速变化威胁和多级分布漂移的困难,并概述了未来研究议程,重点在于检测异常聚类和构建预期性及韧性系统。本综述为更韧性的信息生态系统提供了基于生命周期的主动检测新兴合成威胁方法的全面回顾。

英文摘要

The proliferation of adversarial synthetic content, accelerated by Generative AI (GenAI) is rendering traditional reactive detection methods ineffective. This survey synthesizes emerging research to demonstrate a paradigm shift toward the proactive detection of emerging inauthentic narratives. In this survey, we adopt a unified, lifecycle-based taxonomy to combine socio-technical lifecycle models of adversarial campaigns with advanced computational methodologies for emerging inauthentic narrative detection. By structuring the analysis around the C5 Interaction Model (Context, Causes, Content, Cycle of Amplification, Consequences), we integrate different research streams from machine learning and social science. To differentiate spread patterns of synthetic amplification from authentic baseline traffic, this paper surveys state-of-the-art techniques for modeling the creation, seeding, and propagation of fresh narratives, including the analysis of Coordinated Inauthentic Behavior (CIB), epidemiological modeling, and Hawkes process. This survey also provides a systematic review of proactive detection methods for adversarial threats at different stages in the C5 interaction model, specifically, anomaly detection in high-dimensional embedding spaces, unsupervised coordination detection on multi-layer graphs, and agentic AI systems. Finally, this survey addresses challenges posed by GenAI, including the difficulty of tracking rapidly changing threats and multi-level distributional drift, and it outlines a future research agenda focused on detecting anomalous clusters and building anticipatory and resilient systems. This survey provides a comprehensive, lifecycle-based review of methods for the proactive detection of emerging synthetic threats for more resilient information ecosystems.

2606.00135 2026-06-02 cs.LG cs.AI 版本更新

On Effectiveness and Efficiency of Agentic Tool-calling and RL Training

论智能体工具调用与强化学习训练的有效性与效率

Tong Liu, Cheng Qian, Matej Cief, Yuan He, Daniele Dan, Nikolaos Aletras, Gabriella Kazai

发表机构 * University of California, Berkeley(加州大学伯克利分校) University of Cambridge(剑桥大学) University of Toronto(多伦多大学)

AI总结 本文系统分析工具调用评估中的实现选择对结果敏感性的影响,并针对强化学习训练中的计算浪费提出两种加速技术。

Comments ICML 2026

详情
AI中文摘要

工具调用是现代大型语言模型(LLM)智能体的核心组件,使其具备超越参数化知识的技能。本文从两个互补维度研究工具调用:有效性(即如何衡量该能力)和效率(即如何学习该能力)。在有效性方面,我们系统分析了工具调用评估流程,并表明结果可能对看似微小、通常未文档化的实现选择高度敏感,包括随机种子、系统提示、多轮模板构建以及先前交互/推理历史的传递方式。这些选择可能导致报告性能的显著差异,尤其是在多轮设置中,若缺乏严格标准化,排行榜排名将不可靠。在效率方面,我们考察了用于工具调用的标准强化学习(RL),并识别出两个计算浪费来源:(i)在 rollout 过程中,许多提示不产生学习信号;(ii)在策略更新过程中,优化产生高计算成本。基于这些发现,我们引入了两种加速基于 RL 的工具调用训练的技术,在不降低性能的情况下实现了显著的挂钟时间加速。

英文摘要

Tool-calling is a central component of modern large language model (LLM) agents, equipping them with skills beyond their parametric knowledge. This paper studies tool-calling along two complementary axes: effectiveness, i.e., how this capability is measured, and efficiency, i.e., how it is learned. On effectiveness, we systematically analyze tool-calling evaluation pipelines and show that results can be highly sensitive to seemingly minor, often undocumented implementation choices including the random seed, system prompt, multi-turn template construction, and how prior interaction/reasoning history is carried forward. These choices can lead to substantial differences in reported performance, especially in multi-turn settings where without rigorous standardization, leaderboard rankings are unreliable. On efficiency, we examine standard reinforcement learning (RL) for tool-calling and identify two sources of computational waste: (i) during rollouts, many prompts produce no learning signal, and (ii) during policy updates, optimization incurs high computational cost. Guided by these findings, we introduce two techniques that accelerate RL-based tool-calling training, achieving substantial wall-clock speedup without degrading performance.

2606.00134 2026-06-02 cs.CR cs.AI cs.LG 版本更新

XAI-SOH-FL: Enhancing SOH-FL with Adaptive Aggregation and Explainable AI for Intrusion Detection in Heterogeneous IoT

XAI-SOH-FL: 通过自适应聚合和可解释人工智能增强异构物联网入侵检测中的SOH-FL

Ambreen Aslam, Maaz Hassan, Bibi Zahra, Muhammad Khuram Shahzad

发表机构 * School of Electrical Engineering and Computer Science (SEECS), National University of Sciences and Technology (NUST)(电气工程与计算机科学学院(SEECS),国家 sciences and Technology(NUST))

AI总结 针对异构物联网中数据异构、标签稀缺和模型不可解释性问题,提出XAI-SOH-FL框架,通过自适应聚合(动态γ选择与贝叶斯优化)和SHAP可解释性,在CICIDS2017数据集上达到94.12%准确率和0.92 F1分数,优于基线SOH-FL。

Comments 8 pages, 6 figures; code available at https://github.com/aaslam-msit/SOH-FL-Enhancement

详情
AI中文摘要

物联网环境中的入侵检测系统面临数据异构、缺乏标记数据和模型可解释性有限等重大挑战。联邦学习提供了一种隐私保护解决方案;然而,现有方法如SOH-FL存在两个关键限制:依赖手动调整的聚合参数γ以及模型预测缺乏可解释性。在本文中,我们提出XAI-SOH-FL,一个增强框架,将自适应聚合和可解释人工智能集成到SOH-FL范式中。首先,我们引入基于相似性阈值的动态γ选择机制,使聚合过程能够适应不断变化的数据分布。其次,采用贝叶斯优化自动确定最优γ值,消除了手动调整的需要。第三,引入SHAP(SHapley Additive exPlanations)为入侵检测决策提供特征级可解释性。在CICIDS2017数据集上的实验评估表明,所提方法达到了94.12%的准确率和0.92的F1分数,优于基线SOH-FL模型,同时收敛所需的通信轮次更少。此外,基于SHAP的分析揭示,流级特征如流持续时间和数据包长度显著影响模型预测。这些结果表明,XAI-SOH-FL在异构物联网环境中提供了准确性、适应性和可解释性之间的有效平衡。

英文摘要

Intrusion Detection Systems (IDS) in Internet of Things (IoT) environments face significant challenges due to data heterogeneity, lack of labeled data, and limited model interpretability. Federated Learning (FL) offers a privacy-preserving solution; however, existing approaches such as SOH-FL suffer from two key limitations: reliance on a manually tuned aggregation parameter γ and lack of explainability in model predictions. In this paper, we propose XAI-SOH-FL, an enhanced framework that integrates adaptive aggregation and explainable artificial intelligence into the SOH-FL paradigm. First, we introduce a dynamic γ selection mechanism based on similarity thresholding, enabling the aggregation process to adapt to evolving data distributions. Second, Bayesian Optimization is employed to automatically determine optimal γ values, eliminating the need for manual tuning. Third, SHAP (SHapley Additive exPlanations) is incorporated to provide feature-level interpretability for intrusion detection decisions. Experimental evaluation on the CICIDS2017 dataset demonstrates that the proposed approach achieves an accuracy of 94.12% and an F1-score of 0.92, outperforming the baseline SOH-FL model while converging in fewer communication rounds. Furthermore, SHAP-based analysis reveals that flow-level features such as Flow Duration and Packet Length significantly influence model predictions. These results indicate that XAI-SOH-FL provides an effective balance between accuracy, adaptability, and interpretability in heterogeneous IoT environments.

2606.00132 2026-06-02 cs.LG cs.AI 版本更新

Foundation-Preserving Adaptation via Generalized Rayleigh-Quotient Optimization

基于广义瑞利商优化的基础模型保留适配

Dongjun Kim, Adrian de Wynter, Huancheng Chen, Heasung Kim, Haris Vikalo

发表机构 * The University of Texas at Austin(德克萨斯大学奥斯汀分校) Microsoft(微软) Microsoft AI(微软人工智能) Meta

AI总结 提出FoLoRA框架,通过广义瑞利商优化更新方向,在微调中平衡下游任务性能与预训练能力保留。

详情
AI中文摘要

虽然微调有效地将基础模型适配到专门的下游任务,但可能会降低预训练期间获得的非目标能力。现有的遗忘感知方法通常通过专门的初始化或固定约束寻求更安全的更新,但未在训练过程中调节适配-保留权衡。我们提出基础保留LoRA(FoLoRA),一个遗忘感知优化框架。在一阶保留条件的指导下,FoLoRA定义了预训练代理激活上的遗忘惩罚和下游任务激活上的任务效用。然后,它通过广义瑞利商按单位遗忘惩罚的任务效用对更新方向进行评分。由此产生的谱坐标系实现了方向门控Adam更新,在训练过程中衰减低效用-惩罚方向。为了估计遗忘惩罚,FoLoRA通过从预训练模型中采样构建预训练代理校准数据,而不是依赖单个代理数据集。在数学、代码和指令遵循适配上的实验表明,FoLoRA在基线上实现了最强的保留-适配平衡,提高了目标任务性能,同时最好地聚合保留了非目标能力。

英文摘要

While finetuning effectively adapts foundation models to specialized downstream tasks, it can degrade nontarget capabilities acquired during pretraining. Existing forgetting aware methods typically seek safer updates through specialized initialization or fixed constraints, but do not regulate the adaptation preservation trade-off during training. We propose Foundation Preserving LoRA (FoLoRA), a forgetting aware optimization framework. Guided by a first order preservation condition, FoLoRA defines a forgetting penalty over pretraining-proxy activations and a task utility over downstream task activations. It then scores update directions by task utility per unit forgetting penalty via a generalized Rayleigh quotient. The resulting spectral coordinate system enables direction wise gated Adam updates, attenuating low utility to penalty directions during training. To estimate the forgetting penalty, FoLoRA constructs pretraining proxy calibration data by sampling from the pretrained model rather than relying on a single proxy dataset. Experiments on math, code, and instruction following adaptation show that FoLoRA achieves the strongest preservation adaptation balance over baselines, improving target task performance with best aggregate preservation of non target capabilities.

2606.00131 2026-06-02 cs.SE cs.AI cs.LG cs.PL 版本更新

AI-PROPELLER: Warehouse-Scale Interprocedural Code Layout Optimization with AlphaEvolve

AI-PROPELLER:基于AlphaEvolve的仓库规模过程间代码布局优化

Chaitanya Mamatha Ananda, Rajiv Gupta, Mircea Trofin, Aiden Grossman, Sriraman Tallam, Xinliang David Li, Amir Yazdanbakhsh

发表机构 * University of California, Riverside(加州大学河滨分校) Google(谷歌) DeepMind(深度思维)

AI总结 提出AI-PROPELLER系统,利用Magellan智能工作流将Propeller的编译器启发式方法演化为细粒度过程间优化器,并通过实际硬件执行评估布局变体,首次在工业仓库规模应用中实现细粒度过程间代码布局优化,性能提升0.23%至1.6%。

详情
AI中文摘要

后链接优化器(如Propeller和BOLT)已证明,精确的、基于性能剖析的代码布局可以从高度优化的二进制文件中提取显著的性能提升。然而,这些系统目前局限于过程内技术,未能充分利用过程间布局的全局潜力。由于组合爆炸的搜索空间和复杂的调用返回语义难以建模,过程间代码布局历来困难。因此,细粒度过程间布局的性能潜力在实践中尚未得到证实。AI-PROPELLER使用Magellan(一种智能工作流),将Propeller中的编译器启发式方法演化为细粒度过程间优化器,并微调所得策略的超参数。为确保高保真度,我们摒弃了近似的静态成本模型,智能工作流生成多个布局变体,并在实际硬件上执行以测量真实性能计数器,为进化循环提供精确的奖励信号。AI-PROPELLER已在包括大型仓库规模应用在内的多个基准测试上进行了评估,实验表明,在使用最先进的FDO和PLO优化后,性能提升0.23%至1.6%,这对于实际二进制文件而言意义重大。这是首次在工业环境中对大型仓库规模应用进行细粒度过程间代码布局优化。

英文摘要

Post-link optimizers (PLOs) such as Propeller and BOLT have demonstrated that precise, profile-guided code layout can extract significant performance gains from heavily optimized binaries. However, these systems are currently restricted to intraprocedural techniques, leaving the global potential of interprocedural layout largely untapped. Interprocedural code layout is historically difficult due to a combinatorially intractable search space and complex call-return semantics that are challenging to model. Consequently, the performance potential of fine-grained interprocedural layout remains unproven in practice. AI-PROPELLER uses Magellan, an agentic workflow that evolves the compiler heuristic in Propeller into a fine-grained interprocedural optimizer and fine-tunes the resulting policy hyperparameters. To ensure high-fidelity, we move away from approximate static cost models and the agentic workflow generates multiple layout variants that are executed on actual hardware to measure real performance counters, providing a precise reward signal for the evolutionary loop. AI-PROPELLER has been evaluated on several benchmarks including large warehouse-scale applications and experiments show performance improvements of 0.23% to 1.6% optimized with state-of-the-art FDO and PLO which is significant for real-world binaries. This is the first time ever that large warehouse-scale applications in industrial settings have been optimized with fine-grained interprocedural code layout.

2606.00130 2026-06-02 cs.LG cs.AI 版本更新

Automatically Differentiable Nonlinear Tensor Networks (ADNTNs) for Exponential Compression of Deep Neural Networks

自动可微非线性张量网络(ADNTNs)用于深度神经网络的指数级压缩

Andrzej Cichocki, Michal Wietczak

发表机构 * Institute of Computing Intelligence, Polish Academy of Sciences(波兰科学院计算智能研究所)

AI总结 提出自动可微非线性张量网络(ADNTNs)作为结构化权重生成器,通过反向模式自动微分端到端训练紧凑核心张量,实现深度神经网络的高效压缩,在AlexNet和VGG-16上达到每层2000倍至77000倍压缩比,且精度与密集基线相当或更优。

Comments 6 figure, 28 pages, to be submitted to Journal and confrence

详情
AI中文摘要

我们研究了自动可微非线性张量网络(ADNTNs),这是一类结构化权重生成器,其紧凑核心张量通过反向模式自动微分(AD)进行端到端训练。该方法可视为低秩适应和张量分解的自然扩展:ADNTN不是使用一个低秩矩阵更新,而是通过小核心、非线性激活和可选的横向混合张量的层次结构构建大权重张量。本文聚焦于三种架构:树张量网络(TTNs)、带边界解缠器的增强型TTN(aTTNs)以及多尺度纠缠重整化拟设(MERA)。该公式支持非线性激活、任务感知目标、批处理以及硬件感知的执行调度。同时,本文明确区分了“微分”收缩程序和使收缩自由:AD并未消除大中间体、不良收缩顺序或一般带环张量网络精确收缩的成本。在AlexNet和VGG-16层上的大量模拟显示,在所研究设置下每层压缩比约为2000倍至77000倍,精度通常与密集基线相当,且在几个VGG-16案例中有所提升。这些结果是令人鼓舞的而非最终结论:它们表明,只要优化、收缩调度和部署内核协同设计,ADNTNs是一条有前景、数学结构清晰且硬件感知的通往更小神经网络的路径。

英文摘要

We study Automatically Differentiable Nonlinear Tensor Networks (ADNTNs), a family of structured weight generators whose compact core tensors are trained end-to-end by reverse-mode automatic differentiation (AD). The approach can be viewed as a natural extension of low-rank adaptation and tensor factorisation: instead of using one low-rank matrix update, an ADNTN builds a large weight tensor through a hierarchy of small cores, nonlinear activations, and optional lateral mixing tensors. The paper focuses on three architectures: Tree Tensor Networks (TTNs), augmented TTNs (aTTNs) with boundary disentanglers, and Multi-scale Entanglement Renormalisation Ansatze (MERA). The formulation supports nonlinear activations, task-aware objectives, batching, and hardware-aware execution schedules. At the same time, the paper keeps a clear distinction between \emph{differentiating} a contraction program and making contraction free: AD does not remove the cost of large intermediates, poor contraction orders, or exact contraction of general loopy tensor networks. Extensive simulations on AlexNet and VGG-16 layers show per-layer compression ratios from roughly $2000\times$ to $77000\times$ in the studied settings, with accuracy often matching the dense baseline and, in several VGG-16 cases, improving it. These results are encouraging rather than final: they suggest that ADNTNs are a promising, mathematically structured, and hardware-aware route toward much smaller neural networks, provided that optimisation, contraction schedules, and deployment kernels are designed together.

2606.00129 2026-06-02 cs.LG cs.AI 版本更新

A Shared Valence Axis Across Modern LLMs and Human EEG: The Saturation Regularity

现代LLM与人脑EEG共享的效价轴:饱和规律

Yousef A. Radwan, Xuhui Liu, Kilichbek Haydarov, Yuqian Fu, Mohamed Elhoseiny

发表机构 * King Abdullah University of Science and Technology(卡布斯大学)

AI总结 本研究通过构建从大型语言模型(LLM)中提取的一维效价方向(V轴),发现其与人类EEG神经活动对齐,但进一步对齐策略无法提升解码性能,并形式化为“饱和规律”,指出改进应来自监督无法触及的残差子空间。

详情
AI中文摘要

大型语言模型(LLM)已成为强大的表示学习器,其内部特征与人类认知日益对齐。我们研究现代LLM是否可以作为理解人脑神经表示的透镜,重点关注EEG中的情感效价。 我们首先仅使用九个情感唤起句子从现代LLM中构建了一维效价方向(V轴),并通过零样本迁移到情感基准测试和跨十四个LLM的模型一致性进行了验证。然后,我们展示了这个从LLM导出的方向映射到人类神经活动。在一个包含123名受试者观看情感视频的公共EEG队列中,EEG特征上的单个线性投影追踪了每个刺激的V轴位置。此外,36个未暴露于V轴的EEG情感分类器在其内部表示中自发发现了相同的方向,表明相同的效价结构在语言模型和人类电生理学中同时出现。 然而,这种趋同并未提供有效的训练信号。我们测试了二十五种对齐策略,包括知识蒸馏、表示相似性、对比和拓扑损失;没有一种能改善解码,十六种显著降低了准确性。我们将这一结果形式化为饱和规律:一旦任务标签单独驱动脑解码网络朝向目标方向,额外的监督主要扭曲已经饱和的盆地,而承载类内残差的子空间几乎得不到有用的梯度。 这一规律也指出了改进应来自何处:监督无法触及的残差子空间。受此启发,我们集成残差多样性而非监督盆地,在FACED上将平衡准确率提高了10.5%,并在SEED-V上复制了相同效果。

英文摘要

Large language models (LLMs) have emerged as powerful representation learners whose internal features increasingly align with human cognition. We study whether modern LLMs can serve as a lens for understanding neural representations in the human brain, focusing on emotional valence in EEG. We first build a one-dimensional valence direction, the V-axis, from modern LLMs using only nine emotion-evocative sentences. We validate it through zero-shot transfer to sentiment benchmarks and cross-model consistency across fourteen LLMs. We then show that this LLM-derived direction maps onto human neural activity. On a public EEG cohort of 123 subjects watching affective videos, a single linear projection on EEG features tracks the V-axis position of each stimulus. Moreover, 36 EEG emotion classifiers trained without exposure to the V-axis spontaneously rediscover the same direction in their internal representations, suggesting that the same valence structure emerges in both language models and human electrophysiology. Yet this convergence does not provide an effective training signal. We test twenty-five alignment strategies, including knowledge distillation, representational similarity, contrastive, and topographic losses; none improve decoding, and sixteen significantly reduce accuracy. We formalize this result as the saturation regularity: once task labels alone drive a brain-decoding network onto the target direction, additional supervision mainly distorts an already-saturated basin, while the load-bearing within-class residual receives little useful gradient. This regularity also indicates where improvement should come from: the residual subspace unreachable by supervision. Motivated by this insight, we ensemble across residual diversity rather than supervising the basin, improving balanced accuracy by 10.5% over the prior best on FACED, with the same effect replicated on SEED-V.

2606.00125 2026-06-02 cs.IR cs.AI cs.LG cs.MM 版本更新

Multimodal Music Recommendation System using LLMs

使用LLMs的多模态音乐推荐系统

Srikar Prabhas Kandagatla, Sreehitha R. Narayana, Chandana Magapu, Swetha Mohan, Shamanth Kuthpadi, Hongjie Chen, Ryan A. Rossi, Franck Dernoncourt, Nesreen Ahmed

发表机构 * University of Massachusetts Amherst(马萨诸塞大学阿姆赫斯特分校) Dolby Laboratories(Dolby实验室) Adobe Research(Adobe研究) Cisco Research(Cisco研究)

AI总结 提出一个多模态框架,通过融合音频、歌词、LLM生成的语义元数据和收听完成率,在基于会话的音乐推荐中显著提升Recall和NDCG。

详情
AI中文摘要

音乐推荐系统通常将歌曲视为不透明标记,依赖协同交互历史,忽略了语义或声学内容。先前工作探索了LLM增强、多模态和文本增强的序列推荐方法,但有些方法部分结合了语义、声学或参与信号,没有在一个统一的基于LLM的序列推理框架中联合建模所有三个信号,该框架将推荐基于实际歌曲内容。在这项工作中,我们提出了一个用于基于会话的音乐推荐的多模态框架,通过三种互补信号丰富了LastFM-1K数据集:(1) 使用预训练音乐和文本表示模型提取的音频和歌词嵌入,(2) 使用MGPHot注释方案生成的LLM语义元数据,以及(3) 收听完成率。我们采用E4SRec框架,通过扩展多模态特征和不同的项目ID编码器骨干(包括SASRec、BERT4Rec和GRU4Rec)来增强它。我们进一步扩展了LLM骨干选项,包括LLaMa-2-13B、Qwen2.5-7B-Instruct和LLaMa-3-70B,在零样本和微调设置下。我们的实验表明,集成基于内容的特征比仅使用ID的基线在Recall上提升高达95%,在NDCG上提升高达79%。此外,我们的实验表明,朴素的多模态融合并不总是产生加性改进,突显了跨模态整合的挑战。我们发布了一个用于音乐推荐的大规模多模态基准。

英文摘要

Music recommendation systems typically treat songs as opaque tokens, relying on collaborative interaction histories which overlooks semantic or acoustic content. Prior work has explored LLM-augmented, multimodal, and text-enhanced approaches to sequential recommendation, and while some methods partially combine semantic, acoustic, or engagement signals, none jointly model all three within a unified LLM-based sequential reasoning framework that grounds recommendations in actual song content. In this work, we propose a multimodal framework for session-based music recommendation that enriches the LastFM-1K dataset with three complementary signals: (1) audio and lyric embeddings extracted using pretrained music and text representation models, (2) LLM-generated semantic metadata using the MGPHot annotation schema, and (3) listening completion ratios. We adopt the E4SRec framework by extending it with multimodal features and different item ID encoder backbones, including SASRec, BERT4Rec, and GRU4Rec. We further extend the LLM backbone option with LLaMa-2-13B, Qwen2.5-7B-Instruct, and LLaMa-3-70B in both zero-shot and fine-tuned settings. Our experiments show that integrating content-based features improves over ID-only baselines up to 95% in terms of Recall and 79% in terms of NDCG. Moreover, our experiments show that naive multimodal fusion does not always yield additive improvements, highlighting challenges in cross-modal integration. We release a large-scale multimodal benchmark for music recommendation.

2606.00123 2026-06-02 cs.CV cs.AI cs.LG 版本更新

CardioLens: Revealing the Clinical Reality Gap of MLLMs via Multi-Sequence Cardiac MRI Evaluations

CardioLens: 通过多序列心脏MRI评估揭示MLLMs的临床现实差距

Zixian Su, Hongkai Zhang, Fan Gao, Encheng Su, Taiping Qu, Jingwei Guo, Nan Zhang, Hui Wang, Zhen Zhou, Kairui Bo, Yan Chen, Yue Ren, Shuai Li, Lei Xu, Henggui Zhang

发表机构 * Beijing Academy of Artificial Intelligence(北京人工智能研究院) Beijing Anzhen Hospital(北京安贞医院) Beihang University(北航) King Abdullah University of Science and Technology(国王 Abdullah 科学与技术大学)

AI总结 提出CardioLens测试平台,通过多序列心脏磁共振成像评估24个多模态大语言模型,发现其在临床工作流中表现不佳,存在类别崩溃失败模式,且输入选择和推理提示改进效果有限。

详情
AI中文摘要

多模态大语言模型在公共医学基准上表现出色,但现有评估通常依赖于孤立输入和简化识别任务,难以作为临床使用的有效代理。我们提出了CardioLens,一个针对多序列心血管磁共振的无泄漏评估测试平台,通过严格的报告到QA构建和验证流程,从私有医院档案中构建。CardioLens包含473,896张切片和13,494个经过验证的QA对,涵盖4D Cine、LGE、灌注和T2加权成像,并评估CMR解读的三个阶段:图像理解、报告生成和疾病诊断。在24个最先进的MLLM上,CardioLens揭示了显著的临床现实差距:模型整体表现不佳,性能沿真实CMR工作流下降。混淆分析进一步显示一种类别崩溃失败模式,模型倾向于默认频繁出现的异常类别,而不是区分临床不同的发现。为了排除MLLM兼容输入构造是主要原因,我们在不同切片预算下比较了随机、临床动机和数据驱动的切片选择协议;性能变化很小,通常约为1%。显式推理提示也无法挽救性能,往往使模型更加保守,而不是改善视觉证据的使用。这些结果表明,当前MLLM远未达到可靠的CMR解读,临床决策需要跨序列、视图和时间相位整合分布式证据。CardioLens为开发面向真实临床部署的下一代MLLM提供了一个临床基础的测试平台。

英文摘要

Multimodal Large Language Models (MLLMs) have shown strong performance on public medical benchmarks, yet existing evaluations often remain weak proxies for clinical use, relying on isolated inputs and simplified recognition-style tasks. We introduce CardioLens, a leakage-resistant evaluation testbed for multi-sequence Cardiovascular Magnetic Resonance (CMR), constructed from private hospital archives through a rigorous report-to-QA construction and verification pipeline. CardioLens contains 473,896 slices and 13,494 verified QA pairs across 4D Cine, LGE, perfusion, and T2-weighted imaging, and evaluates three stages of CMR interpretation: image understanding, report generation, and disease diagnosis. Across 24 state-of-the-art MLLMs, CardioLens reveals a substantial clinical reality gap: models perform poorly overall, with performance degrading along the real CMR workflow. Confusion analysis further shows a category-collapse failure mode, where models default to frequent abnormal categories rather than distinguishing clinically distinct findings. To rule out MLLM-compatible input construction as the primary cause, we compare random, clinically motivated, and data-driven slice selection protocols under different slice budgets; performance changes only marginally, typically by about 1%. Explicit reasoning prompts also fail to rescue performance, often making models more conservative rather than improving visual evidence use. These results show that current MLLMs remain far from reliable CMR interpretation, where clinical decisions require integrating distributed evidence across sequences, views, and temporal phases. CardioLens provides a clinically grounded testbed for developing next-generation MLLMs toward real-world clinical deployment.

2606.00121 2026-06-02 cs.CV cs.AI 版本更新

Versatile Framework with Semantic and Structural guidance for Image Reconstruction from Brain Activity

基于语义和结构引导的大脑活动图像重建通用框架

Yizhuo Lu, Changde Du, Qiongyi Zhou, Liuyun Jiang, Huiguang He

发表机构 * State Key Laboratory of Brain Cognition and Brain-inspired Intelligence Technology(脑认知与脑启发智能技术国家重点实验室) Institute of Automation, Chinese Academy of Sciences(中国科学院自动化研究所) School of Future Technology, University of Chinese Academy of Sciences(中国科学院大学未来技术学院) School of Artificial Intelligence, University of Chinese Academy of Sciences(中国科学院大学人工智能学院)

AI总结 提出MindDiffuser两阶段框架,结合CLIP文本嵌入和视觉特征,通过Stable Diffusion生成语义图像并迭代优化结构信息,在fMRI、EEG、MEG三种模态上显著提升图像重建性能。

详情
AI中文摘要

从大脑记录中重建视觉刺激一直是脑解码中一项有意义且具有挑战性的任务。特别是,实现精确且可控的图像重建对于推动脑机接口的进步和应用具有重要意义。最近的方法利用文本到图像生成模型的能力,在语义(如概念和对象)方面重建了接近复杂自然刺激的图像。然而,它们在保持与原始刺激在细粒度结构信息(如位置、方向和大小)上的一致性方面存在困难,这削弱了模型的可控性和可解释性。为了解决上述问题,我们提出了一个两阶段图像重建框架,称为MindDiffuser。在第一阶段,从大脑反应解码的对比语言-图像预训练(CLIP)文本嵌入被输入到Stable Diffusion中,生成包含语义信息的初步图像。在第二阶段,我们使用解码的浅层CLIP视觉特征作为监督信号,通过反向传播迭代优化来自第一阶段的特征向量,以对齐结构信息。我们在由视觉刺激引发的三种模态(fMRI、EEG、MEG)的大脑反应数据集上进行了大量实验,结果表明我们的框架显著提升了先前最先进模型的性能,凸显了我们方法的有效性和通用性。空间和时间可视化结果进一步支持了我们框架的神经生物学合理性,为未来跨不同大脑信号模态的神经解码工作提供了指导。

英文摘要

Reconstructing visual stimuli from brain recordings has been a meaningful and challenging task in brain decoding. Especially, the achievement of precise and controllable image reconstruction bears great significance in propelling the progress and utilization of brain-computer interfaces. Recent methods, leveraging advances in the power of text-to-image generation models, have reconstructed images that closely approximate complex natural stimuli in terms of semantics (e.g., concepts and objects). However, they struggle to maintain consistency with the original stimuli in fine-grained structural information (e.g., position, orientation and size), which undermines both the controllability and interpretability of the models. To address the aforementioned issues, we propose a two-stage image reconstruction framework, termed MindDiffuser. In Stage 1, Contrastive Language-Image Pretraining (CLIP) text embeddings decoded from brain responses are input into Stable Diffusion, generating a preliminary image containing semantic information. In Stage 2, we use decoded shallow CLIP visual features as supervisory signals, iteratively refining the feature vectors from Stage 1 via backpropagation to align structural information. We conducted extensive experiments on brain response datasets across three modalities (fMRI, EEG, MEG) elicited by visual stimuli, demonstrating that our framework significantly enhances the performance of previous state-of-the-art models, highlighting the effectiveness and versatility of our approach. Spatial and temporal visualization results further support the neurobiological plausibility of our framework, providing guidance for future neural decoding efforts across different brain signal modalities.

2606.00120 2026-06-02 eess.SP cs.AI cs.LG 版本更新

SpikeWFM: Spiking-Aided Wireless Foundation Model for Robust Channel Prediction

SpikeWFM:用于鲁棒信道预测的脉冲辅助无线基础模型

Liwen Jing, Yisha Lu, Tingting Yang, Li Sun, Yuxuan Shi, Yuwei Wang, Mengfan Zheng, Leiyang Xu

发表机构 * Mobile Information Networks-National Science and Technology Major Project(移动信息网络国家科技重大专项)

AI总结 提出SpikeWFM混合架构,将脉冲神经网络与基于ANN的Transformer结合,通过时间稀疏性和事件驱动处理增强无线基础模型对噪声和干扰的鲁棒性,在信道预测任务上优于传统模型。

详情
AI中文摘要

本文提出SpikeWFM,一种新颖的混合架构,它将脉冲神经网络(SNN)与基于传统人工神经网络(ANN)的Transformer集成用于无线基础模型(WFM)。受人类大脑中噪声鲁棒且节能的信息处理启发,SpikeWFM旨在增强WFM对噪声和干扰的抵抗力,同时保持跨多种无线场景的强大泛化能力。借鉴大型语言模型成功经验,WFM利用跨各种无线环境的大规模数据集上的自监督预训练,学习一个统一的嵌入表示,支持包括信道预测、信道估计、波束预测、定位等在内的广泛下游任务。这类模型通常优于任务特定设计,并对未见条件表现出卓越的适应性。然而,现有WFM在实际无线系统中仍易受真实噪声和干扰影响。为解决这一局限,我们将脉冲神经元引入基于Transformer的WFM架构。我们提供简要理论分析,展示SNN-ANN混合如何通过时间稀疏性和事件驱动处理有效减轻噪声和干扰。实验结果表明,SpikeWFM在预训练收敛和信道预测准确性上均持续优于传统基于ANN的WFM。关于通信和感知任务的更多结果将在本工作的完整期刊版本中呈现。

英文摘要

This paper proposes SpikeWFM, a novel hybrid architecture that integrates spiking neural networks (SNNs) with conventional artificial neural network (ANN)-based transformers for wireless foundation models (WFMs). Inspired by the noise-robust and energy-efficient information processing in the human brain, SpikeWFM aims to enhance the resilience of WFMs against noise and interference while maintaining strong generalization capabilities across diverse wireless scenarios. Drawing from the success of large language models, WFMs leverage self-supervised pre-training on large-scale datasets spanning various wireless environments to learn a unified embedding that supports a wide range of downstream tasks, including channel prediction, channel estimation, beam predition, positioning and etc. Such models typically outperform task-specific designs and exhibit superior adaptability to unseen conditions. However, existing WFMs remain vulnerable to realistic noise and interference in practical wireless systems. To address this limitation, we incorporate spiking neurons into the transformer-based WFM architecture. We provide a brief theoretical analysis demonstrating how the SNN-ANN hybrid effectively mitigates noise and interference through temporal sparsity and event-driven processing. Experimental results show that SpikeWFM consistently outperforms conventional ANN-based WFMs in both pre-training convergence and channel prediction accuracy. Additional results on communication and sensing tasks will be presented in the full journal version of this work.

2606.00119 2026-06-02 cs.RO cs.AI 版本更新

V2I Work Zone Geometry Reconstruction with Pose-Conditioned UWB Range Denoising

基于位姿条件的UWB测距去噪的V2I工作区几何重建

Jiaxi Liu, Hangyu Li, Yang Cheng, Rui Gana, Junwei You, Weizhe Tang, Peng Zhang, Steven T. Parker, Xiaopeng Li, Bin Ran

发表机构 * Department of Civil & Environmental Engineering, University of Wisconsin-Madison(威斯康星大学麦迪逊分校土木与环境工程系)

AI总结 针对V2I工作区几何重建中UWB测距受突发异常、非视距误差和位姿不确定性的影响,提出一种位姿条件、排列等变的预测去噪器,通过共享锚点时间预测、对称集聚合和位姿条件残差解码,显著提升测距精度和几何重建质量。

详情
AI中文摘要

可靠的工作区映射对于网联自动驾驶车辆(CAV)安全平稳地通过工作区至关重要。安装在锥形路标上的超宽带(UWB)路侧单元(RSU)提供了一种经济高效的工作区布局推断方式,因为路侧锚点和车载标签为工作区几何重建提供了直接的车对基础设施(V2I)距离约束。然而,在实际现场部署中,UWB测距估计受到突发异常、非视距(NLOS)误差、任意锚点排序问题以及车辆位姿不确定性的影响。为解决这些挑战,本研究提出了一种位姿条件、排列等变的预测去噪器,用于多锚点UWB测距。该模型采用共享锚点时间预测来捕捉距离动态,对称集聚合来处理无序和缺失的锚点,以及位姿条件残差解码来将车辆运动作为几何先验。两阶段训练策略首先从观测距离学习预测,然后通过NLOS加权监督微调去噪器。该方法在CAV收集的罕见真实世界V2I UWB现场数据以及受控大规模仿真基准上进行了评估,以获得消融见解。结果表明,所提出的方法在具有挑战性的NLOS主导场景中显著提高了测距精度、锥形标定位和工作区几何重建,对锚点重新索引和适度锚点丢失保持鲁棒,并将测量加权的现场均方误差相对于原始输入降低了66.9%。

英文摘要

Reliable work zone mapping is important for connected and autonomous vehicles (CAVs) to navigate safely and smoothly through work zone areas. Cone-mounted ultra-wideband (UWB) roadside units (RSU) offer a cost-effective way for work zone layout inference, as roadside anchors and vehicle tags provide direct vehicle-to-infrastructure (V2I) range constraints for work zone geometry reconstruction. However, UWB range estimation is degraded by bursty outliers, non-line-of-sight (NLOS) errors, arbitrary anchor-ordering issues, and vehicle pose uncertainties in practical field deployments. To address these challenges, this study proposes a pose-conditioned, permutation-equivariant predictive denoiser for multi-anchor UWB ranging. The model employs shared anchor-wise temporal prediction to capture range dynamics, symmetric set aggregation to handle unordered and missing anchors, and pose-conditioned residual decoding to incorporate vehicle motion as a geometric prior. A two-stage training strategy first learns prediction from observed ranges, and then fine-tunes the denoiser with NLOS-weighted supervision. The method is evaluated on rare real-world V2I UWB field data collected with a CAV, as well as on controlled large-scale simulation benchmarks for ablative insights. Results show that the proposed method substantially improves range accuracy, cone localization, and work zone geometry reconstruction in challenging NLOS-dominated regimes, remains robust to anchor re-indexing and moderate anchor dropout, and reduces measurement-weighted field MSE by 66.9% relative to the raw input.

2606.00116 2026-06-02 cs.CL cs.AI cs.LG 版本更新

Enhancing BiGRU with a KAN Block for Legal Document Classification and Summarization

增强BiGRU与KAN模块在法律文档分类与摘要中的应用

Ahmed Faizul Haque Dhrubo, Souvik Pramanik, Most. Aysha Siddika Sumona, Shahnewaz Siddique, Mohammad Ashrafuzzaman Khan, Mohammad Abdul Qayum, Mohsin Sajjad

发表机构 * Dept. of ECE North South University(电子工程系北南大学)

AI总结 提出一种基于KAN的BiGRU模型,用于低资源多语言法律文档的分类与摘要,通过KAN模块提升分类准确率至67.96%。

Comments This paper contains of 10 pages, 10 figures, 4 tables and version 2 after it review from ACL 2026

详情
AI中文摘要

本研究引入了一种基于KAN的BiGRU模型的新架构,用于低资源多语言环境下的法律文档分类与摘要任务。为了解决领域语言、不同语言使用、上下文长依赖和类别不平衡等问题,我们使用了由孟加拉国法律文档组成的数据集,这些文档来自Manupatra,包括孟加拉语、英语和音译孟加拉语。我们的分类任务采用BiGRU模型以及Kolmogorov-Arnold网络(KAN)模块,而摘要部分则利用基于注意力的GRU结合KAN模型头部。分类模型达到了67.96%的准确率和0.65的F1分数;摘要的ROUGE-1、ROUGE-2和ROUGE-L指标分别对应0.38、0.23和0.31的F1分数。消融研究表明,使用KAN将分类准确率从57.34%提升至67.96%。此外,我们将所提出的技术与多个基线进行了比较,包括经典机器学习算法和预训练语言模型。

英文摘要

This study introduces a novel architecture of KAN-based BiGRU model for the task of classification and summarization of legal documents in a low-resource multilingual setup. In order to tackle problems associated with domain language, the usage of different languages, long dependencies within context, and class imbalance, we employ the dataset composed of legal documents from Bangladesh and taken from Manupatra, which include Bengali, English, and transliterated Bengali languages. Our classification task involves BiGRU model, along with Kolmogorov-Arnold Network (KAN) module, while the summarization part utilizes attention-based GRU, combined with a KAN model head. Classification model yields 67.96% of accuracy and 0.65 F1 score; while ROUGE-1, ROUGE-2, and ROUGE-L measures for summarization yield 0.38, 0.23, and 0.31 F1 scores, correspondingly. Ablation study shows that the use of KAN increases classification accuracy from 57.34% to 67.96%. Moreover, our proposed technique is compared to several baselines, including classical ML algorithms and pretrained language models.

2606.00109 2026-06-02 cs.CV cs.AI cs.LG 版本更新

VDSB-GWSyn: Diffusion Schrödinger Bridge for Controllable and Anatomically Feasible Guidewire Synthesis in Coronary Angiography

VDSB-GWSyn: 用于冠状动脉造影中可控且解剖学可行的导丝合成的扩散薛定谔桥

Haoyuan Tang, Zhuo Zhang, Jialin Li, Shuai Xiao, Jiachen Yang

发表机构 * Tianjin University(天津大学)

AI总结 提出基于扩散薛定谔桥的VDSB-GWSyn框架,通过形状先验和血管分割约束生成可控、高保真导丝样本,显著提升下游导丝端点定位精度。

Comments Early accept to MICCAI 2026

详情
AI中文摘要

冠状动脉导丝端点定位是计算机辅助PCI的基本能力,随着机器人辅助PCI逐渐普及以减少操作者辐射暴露,其重要性日益增加。然而,带有导丝的标注CAG图像稀缺以及现有导丝合成模型的适应性有限,仍是导丝端点定位的关键瓶颈。为解决此问题,我们提出VDSB-GWSyn,一个基于扩散薛定谔桥(DSB)模型的框架,能够在复杂解剖背景下合成可控、高保真的导丝样本。VDSB-GWSyn首先使用我们的形状先验算法学习基本导丝几何形状,然后在血管分割掩码的约束下生成导丝掩码并输出对应的端点坐标,最后通过SPADE条件化的DSB在真实CAG图像上合成逼真的导丝样本。实验结果表明,VDSB-GWSyn合成的导丝样本取得了良好的ROI-FID和ROI-KID,以及高IPR分数。此外,将我们的合成数据用于合成预训练后接真实微调,显著改进了下游导丝端点定位,将MPE从16.01像素降低到7.71像素,PCK@3像素从52.63%提高到86.27%,从而实现了更临床可靠的机器人辅助导丝输送系统部署。此外,具有严格背景保留和解剖可行性约束的可控设备合成的核心设计理念,有可能迁移到其他标注数据稀缺的介入设备感知任务中。

英文摘要

Coronary guidewire endpoint localization is a fundamental capability for computer-assisted PCI, and its importance increases as robot-assisted PCI is progressively adopted to reduce operator radiation exposure. However, the scarcity of annotated CAG images with guidewires and the limited adaptability of existing guidewire synthesis models remain key bottlenecks for guidewire endpoint localization. To address this issue, we propose VDSB-GWSyn, a Diffusion Schrödinger Bridge (DSB) model-based framework, enabling synthesis of controllable, high-fidelity guidewire samples under complex anatomical backgrounds. VDSB-GWSyn first uses our shape prior algorithm to learn the basic guidewire geometry. It then generates guidewire masks under constraints imposed by the vessel segmentation masks and outputs the corresponding endpoint coordinates. Finally, it synthesizes realistic guidewire samples on real CAG images using DSB conditioned with SPADE. Experimental results show that the guidewire samples synthesized by VDSB-GWSyn achieve favorable ROI-FID and ROI-KID, as well as high IPR scores. In addition, incorporating our synthesized data for synthetic pre-training followed by real fine-tuning substantially improves downstream guidewire endpoint localization, reducing MPE from 16.01~px to 7.71~px and increasing PCK at 3~px from 52.63\% to 86.27\%, leading to more clinically reliable deployment of robot-assisted guidewire delivery systems. Moreover, the core design philosophy of controllable device synthesis with strict background preservation and anatomical feasibility constraints has the potential to transfer to other interventional device perception tasks where annotated data are scarce.

2606.00108 2026-06-02 eess.SP cs.AI 版本更新

Project SPARROW and the Future of Conservation Technology

SPARROW项目与保护技术的未来

Juan M. Lavista Ferres, Carl Chalmers, Bruno Demuro Segundo, Zhongqi Miao, Andres Hernandez Celis, Federico Alves Torres, Isai Daniel Chacon Silva, Anthony Cintron Roman, Allen Kim, Meygha Machado, Luana Marotti, Amy Michaels, Daniela Ruiz Lopez, Catherine Romero, Rahul Dodhia, Inbal Becker-Reshef, Pablo Arbelaez

发表机构 * Microsoft AI for Good Lab(微软AI for Good实验室) Universidad de los Andes(andes大学) University of Maryland(马里兰大学)

AI总结 提出SPARROW开源平台,通过集成太阳能、边缘AI和卫星通信,实现偏远地区连续自主的生物多样性监测,并在多国部署验证其鲁棒性和可扩展性。

详情
AI中文摘要

全球生物多样性正以前所未有的速度下降,然而可用于监测和保护生态系统的工具仍受限于电力、连接性和可达性。我们提出SPARROW,一个集成太阳能、边缘人工智能和卫星通信的开源软硬件平台,能够在偏远环境中实现连续、自主的生物多样性监测。每个SPARROW节点结合低功耗图形处理单元(GPU)与模块化视觉、声学和环境传感器,执行设备端深度学习推理,并通过低地球轨道(LEO)卫星或全球移动通信系统(GSM)网络传输汇总结果。我们在哥伦比亚、秘鲁、坦桑尼亚和美国的热带、温带和高山生态系统中部署了SPARROW,它在多变的环境条件下维持24/7运行,并在前190天内收集了超过200万张图像和声学记录。该系统展示了鲁棒的实时分类和自适应电源管理,实现了无需现场人工干预的完全自主。通过集成可再生能源、边缘AI和开源设计,SPARROW降低了生态监测的技术和财务门槛,并为分布式智能传感器网络(新兴的“万物互联”用于行星生物多样性监测)建立了可扩展的基础。

英文摘要

Global biodiversity is declining at unprecedented rates, yet the tools available to monitor and protect ecosystems remain limited by constraints in power, connectivity, and accessibility. We present SPARROW, a hardware and software open-source platform that integrates solar energy, edge artificial intelligence, and satellite communication to enable continuous, autonomous biodiversity monitoring in remote environments. Each SPARROW node combines a low-power Graphics Processing Unit (GPU) with modular visual, acoustic, and environmental sensors, performing on-device deep learning inference and transmitting summarized results through Low-Earth-Orbit (LEO) satellite or Global System for Mobile Communications (GSM) networks. We deployed SPARROW across tropical, temperate, and montane ecosystems in Colombia, Peru, Tanzania, and the United States, where it sustained 24/7 operation under variable environmental conditions and collected more than two million images and acoustic recordings in the first 190 days. The system demonstrated robust real-time classification and adaptive power management, achieving full autonomy without on-site human intervention. By integrating renewable energy, on-edge AI, and open-source design, SPARROW lowers the technical and financial barriers to ecological monitoring and establishes a scalable foundation for a distributed, intelligent network of sensors, an emerging "Internet of Living Things" for planetary biodiversity monitoring.

2606.00107 2026-06-02 eess.SP cs.AI cs.LG 版本更新

Motif-based morphology signatures for interpretable ECG screening and monitoring

基于基序的形态学特征用于可解释的心电图筛查和监测

Nivedita Bijlani, Mauricio Villarroel

发表机构 * The Podium Institute of Sports Medicine and Technology(Podium运动医学与体育科技研究所)

AI总结 提出一种基于基序的框架,通过定义可解释的心跳对齐基序和三种漂移度量,实现短期和长期心电图监测中的形态学变化量化与异常检测。

Comments Accepted to the IEEE Engineering in Medicine and Biology Conference (EMBC) 2026

详情
AI中文摘要

心电图仍然是心血管筛查的核心,但解读仍主要依赖人工且呈间歇性。临床实践依赖于简短的静息心电图,并在需要时进行长时间动态记录,两者都会产生需要大量资源审查的数据。因此,在临床明显异常出现之前,微妙的形态学变化或渐进性漂移可能被忽视。我们提出了一种基于基序的框架,该框架将心跳对齐的心电图基序定义为可解释的心脏特征,并量化短期和长期监测中的形态学漂移和偏差。基序是代表主导形态的典型心动周期。我们引入了三个可解释的漂移度量:与正常窦性心律的偏差、与个性化基线的偏差以及基序不稳定性指数。基序通过选择在固定窗口内最小化动态时间规整距离的心跳来提取。我们在短期(PTB-XL)和长期(MIT-BIH心律失常)心电图数据集上评估这些度量。通过代表性基序叠加和基于基准点的可视化实现可解释性,从而能够直接检查形态学变化。在MIT-BIH中,所提出的度量显著区分了主要正常和心律失常受试者(p<0.01)。在PTB-XL中,正常窦性心律偏差在主要诊断亚型中区分了正常和异常心电图(p<1e-4,Cliff's delta高达0.93)。心电图基序提供了心脏形态的可解释表示,支持可扩展的纵向监测和形态学驱动变化的早期检测。

英文摘要

Electrocardiography (ECG) remains central to cardiovascular screening, yet interpretation remains largely manual and episodic. Clinical practice relies on brief resting ECGs and, when required, long-duration ambulatory recordings, both generating data that require resource-intensive review. Consequently, subtle morphological changes or progressive drift preceding clinically apparent abnormalities may go unnoticed. We propose a motif-based framework that defines beat-aligned ECG motifs as interpretable cardiac signatures and quantifies morphological drift and deviation across short and long-term monitoring. Motifs are representative cardiac cycles capturing dominant morphology. We introduce three interpretable drift metrics: deviation from a normal sinus rhythm (NSR), deviation from a personalised baseline, and a motif instability index. Motifs are extracted by selecting beats that minimise Dynamic Time Warping (DTW) distance within fixed windows. We evaluate these metrics on short (PTB-XL) and long-duration (MIT-BIH Arrhythmia) ECG datasets. Interpretability is achieved through representative motif overlays and fiducial-based visualisations, enabling direct inspection of morphological changes. In MIT-BIH, the proposed metrics significantly separated predominantly normal from arrhythmic subjects (p<0.01). In PTB-XL, NSR deviation distinguished normal from abnormal ECGs across major diagnostic subtypes (p<1e-4, Cliff's delta up to 0.93). ECG motifs provide an interpretable representation of cardiac morphology, supporting scalable longitudinal monitoring and early detection of morphology-driven change.

2606.00106 2026-06-02 eess.SP cs.AI cs.HC cs.LG 版本更新

A Methodological Framework for Explicit Control of the Speed-Accuracy Trade-off in Brain-Computer Interfaces

脑机接口中速度-准确性权衡显式控制的方法论框架

Javier Jiménez, Francisco B Rodríguez

发表机构 * Grupo de Neurocomputación Biológica, Departamento de Ingeniería Informática, Universidad Autónoma de Madrid(生物神经计算组,信息工程系,马德里自治大学)

AI总结 提出一个独立于分类器、范式和早停策略的评估框架,通过增益和保持度两个指标及可调参数α显式控制速度-准确性权衡,并在P300范式上验证其有效性。

详情
AI中文摘要

脑机接口(BCI)受到脑电图等模态低信噪比的限制,需要多次试验才能可靠解码用户意图。这导致了速度-准确性权衡,即更高的准确性以速度为代价。速度-准确性平衡依赖于应用,因此需要可控的权衡。传统指标(如信息传输率)将速度和准确性合并,模糊了它们的依赖关系并可能引入偏差。在本研究中,我们提出了一个独立于分类器、范式和早停策略的评估框架,将速度和准确性分离。我们采用两个度量:增益(相对速度提升)和保持度(相对准确性保持),并将它们组合成一个由α控制的可调增益-保持平衡,从而调节速度-准确性权衡。该参数无需修改分类器即可调整工作点,便于跨场景部署。该框架在P300事件相关电位范式上进行了评估,使用了63名受试者的公开记录以及多种分类器和早停策略,以实现速度-准确性和比特率的不同工作点。结果表明,调整α可产生快速、准确或平衡的BCI行为,展示了速度-准确性权衡的显式控制。该方法支持受试者级别的性能预测,并提高了BCI行为的可解释性。对信息传输率的进一步分析揭示了其向速度的系统性偏差,该偏差通过所提出的框架中的增益和保持度测量得到解释。总体而言,本工作将速度-准确性权衡确立为可控的设计变量,并在公开的P300范式上进行了验证,从而实现了BCI的透明评估和应用特定优化。

英文摘要

Brain-computer interfaces (BCIs) are limited by low signal-to-noise ratio in modalities such as electroencephalography, which requires multiple trials to reliably decode user intentions. This induces a speed-accuracy trade-off, whereby higher accuracy comes at the cost of speed. The speed-accuracy balance is application-dependent, motivating controllable trade-offs. Conventional metrics, such as the Information Transfer Rate, combine speed and accuracy obscuring their dependence and potentially introducing biases. In this study, we propose an evaluation framework independent of classifier, paradigm, and early-stopping strategy that separates speed and accuracy. We employ two measures, Gain (relative speed improvement) and Conservation (relative accuracy preservation), and combine them into a tunable Gain-Cons Balance controlled by α, regulating the speed-accuracy trade-off. The parameter adjusts the operating point without modifying the classifier, facilitating deployment across scenarios. The framework was evaluated on P300 event-related potential paradigms using public recordings from 63 subjects as well as multiple classifiers and early-stopping strategies to achieve distinct operating points in speed-accuracy and bitrate. Results show that tuning α yields fast, accurate, or balanced BCI behaviours, demonstrating explicit control of the speed-accuracy trade-off. The method supports subject-level performance prediction and improves explainability of BCI behaviour. Further analysis of the Information Transfer Rate reveals a systematic bias toward speed, explained by the proposed framework through the Gain and Conservation measurements. Overall, this work establishes the speed-accuracy trade-off as a controllable design variable validated on public P300-based paradigms, enabling transparent evaluation and application-specific optimization of BCIs.

2606.00105 2026-06-02 cs.CV cs.AI 版本更新

Visual-Noise Guided In-Context Distillation for Multimodal Large Language Model Unlearning

视觉噪声引导的上下文蒸馏用于多模态大语言模型遗忘

Junkai Chen, Yuhao He, Junxiang You, Ruiqi Liu, Chenyu Wang, Shu Wu

发表机构 * Institute of Automation, Chinese Academy of Sciences(中国科学院自动化研究所) School of Advanced Interdisciplinary Sciences, UCAS(北京大学交叉学科研究院)

AI总结 提出视觉噪声引导的上下文蒸馏(VGID)框架,通过双模态干预构建教师分布进行蒸馏,实现多模态大语言模型参数级遗忘,平衡遗忘效果与模型效用。

详情
AI中文摘要

多模态大语言模型(MLLMs)在视觉-语言任务上取得了显著进展,但它们也可能记忆和暴露敏感或受限知识,引发隐私和更广泛的安全风险。机器遗忘(MU)提供了一种有前景的方法,可以从训练好的模型中移除目标不良知识,而无需从头重新训练,同时保持通用模型效用。然而,在MLLMs中实现有效遗忘仍然特别具有挑战性。现有的基于训练的方法通常难以平衡遗忘效果和模型效用。相比之下,无训练方法如上下文遗忘通过避免参数更新来保持模型效用,但它们不会在参数级别移除记忆的知识,可能仍然容易受到逆向工程攻击。更重要的是,上下文遗忘在多模态设置中不足,其中视觉输入可以提供强条件信号并诱导不良输出。为了解决这些挑战,我们提出了视觉噪声引导的上下文蒸馏(VGID),一种基于蒸馏的MLLM遗忘框架。VGID通过结合视觉扰动与文本上下文遗忘的双模态干预,从冻结的基础模型动态构建面向遗忘的教师分布。由此产生的干预诱导分布作为蒸馏的教师信号,引导学生模型实现参数级遗忘,而无需外部教师模型或显式的不良响应注释。实验结果表明,VGID在保持竞争性模型效用的同时实现了强遗忘效果,在代表性设置中,遗忘集ROUGE-L降低了0.371,而保留集ROUGE-L仅下降0.055。

英文摘要

Multimodal Large Language Models (MLLMs) have achieved remarkable progress on vision-language tasks, but they may also memorize and expose sensitive or restricted knowledge, raising concerns about privacy and broader safety risks. Machine Unlearning (MU) provides a promising way to remove targeted undesirable knowledge from trained models without retraining from scratch while preserving general model utility. Nevertheless, effective unlearning in MLLMs remains particularly challenging. Existing training-based methods often struggle to balance unlearning effectiveness and model utility. In contrast, training-free methods such as in-context unlearning preserve model utility by avoiding parameter updates, but they do not remove memorized knowledge at the parameter level and may remain vulnerable to reverse-engineering attacks. More importantly, in-context unlearning is insufficient in multimodal settings, where visual inputs can provide strong conditioning signals and induce undesirable outputs. To address these challenges, we propose Visual-Noise Guided In-Context Distillation (VGID), a distillation-based framework for MLLM unlearning. VGID dynamically constructs an unlearning-oriented teacher distribution from the frozen base model through dual-modal intervention that combines visual perturbation with textual in-context unlearning. The resulting intervention-induced distribution serves as a teacher signal for distillation, guiding the student model toward parameter-level unlearning without requiring external teacher models or explicit undesirable response annotations. Experimental results show that VGID achieves strong unlearning effectiveness while preserving competitive model utility, reducing forget set ROUGE-L by 0.371 with only a 0.055 drop in retain set ROUGE-L in a representative setting.

2606.00104 2026-06-02 cs.RO cs.AI 版本更新

PEACE: A Planner-Executor Agent with Constraint Enforcement for UAVs

PEACE: 一种用于无人机的带约束执行的规划-执行智能体

Erdem Uysal, Timo Kehrer, Sebastiano Panichella

发表机构 * Institute of Computer Science, University of Bern(伯尔尼大学计算机科学研究所) AI4I - The Italian Institute of Artificial Intelligence(意大利人工智能研究所)

AI总结 提出一种基于大语言模型的规划-执行智能体架构,通过解耦高层任务规划与低层控制,并引入约束执行层和有限重规划,实现无人机可解释、可约束的自主飞行。

Comments Accepted to ICRA 2026 Workshop on Semantics for Reliable Robot Autonomy: From Environment Understanding and Reasoning to Safe Interaction

详情
AI中文摘要

基础模型越来越多地被用于驱动自主系统,然而现有方法要么将模型保持在紧密的控制循环中,增加延迟和幻觉风险,要么将自然语言编译成不透明的端到端策略,难以解释、约束,且需要特定领域的数据集和微调。我们提出一种用于基于PX4的无人机的规划-执行智能体,将高层任务规划与低层控制解耦。大语言模型执行单次任务规划,而执行通过结构化的ROS 2工具调用接口(桥接到MAVLink)处理。该系统通过将模块化2D检测器(如YOLO或视觉语言模型)与用于3D物体定位的针孔深度投影模块相结合,构建世界模型。约束执行层强制执行高度限制和水平地理围栏,有限重规划能够从执行时的动作失败中恢复。我们将我们的方法定位在基于基础模型的机器人系统的三种常见设计模式中,并在Gazebo中的PX4软件在环仿真中展示其可行性。结果突出了与紧密耦合的LLM控制相比,改进的可解释性、约束执行和减少的LLM调用。代码、数据集、视频和其他材料可在以下链接找到:https://github.com/erdemuysalx/PEACE

英文摘要

Foundation models are increasingly used to drive autonomous systems, yet existing approaches either keep the model in a tight control loop, raising latency and hallucination risk, or compile natural language into opaque end-to-end policies that are hard to explain, constraint and require domain-specific datasets and fine-tuning. We propose a planner-executor agent for PX4-based drones that decouples high-level mission planning from low-level control. A large language model performs single-pass task planning, while execution is handled through a structured ROS 2 tool-calling interface bridged to MAVLink. The system constructs a world model by combining modular 2D detectors (e.g., YOLO or vision-language models) with a pinhole depth projection module for 3D object localization. A constraint enforcement layer enforces altitude limits and horizontal geofencing, and bounded replanning enables recovery from execution-time action failures. We position our approach within three common design patterns for foundation-model-based robotics systems and demonstrate its feasibility in PX4 software-in-the-loop simulations in Gazebo. Results highlight improved explainability, constraint enforcement, and reduced LLM calls compared to tightly coupled LLM control. The code, dataset, videos, and other material can be found at the following link: https://github.com/erdemuysalx/PEACE

2606.00103 2026-06-02 cs.AI 版本更新

Evaluating Interactive Reasoning in Large Language Models: A Hierarchical Benchmark with Executable Games

评估大语言模型中的交互式推理:一个带有可执行游戏的分层基准

Mingyuan Fan, Weiguang Han, Daixin Wang, Cen Chen, Zhiqiang Zhang, Jun Zhou

发表机构 * East China Normal University(东华师范大学) Ant Group(蚂蚁集团)

AI总结 提出一个多轮交互式推理评估框架,将推理视为主动证据获取和信念更新,通过474个可执行游戏基准测试大语言模型在成功率、交互效率、上下文鲁棒性和元认知适应方面的表现。

Comments preprint version, under review

详情
AI中文摘要

我们引入了一个用于推理评估的多轮交互式框架,将推理视为主动证据获取和信念更新。其中,LLMs仅接收任务规则,必须向隐藏环境发出有针对性的查询,随时间整合部分观察结果,并决定何时提交最终答案。除了标准的成功率和交互效率,我们还在受控上下文扰动下评估上下文鲁棒性,并通过反事实修正和必要性判断评估元认知适应。我们将该框架实例化为一个包含474个可执行游戏的基准,每个游戏在五个固定配置搜索空间(对应五个难度级别)下进行评估,并评估了一系列前沿LLMs。结果表明,该基准具有高度区分性,不仅在成功率上,而且在交互效率上也暴露了巨大差异。此外,我们实证表明,上下文扰动导致适度但持续的下降,而反事实修正和必要性判断导致更大的下降。

英文摘要

We introduce a multi-turn interactive framework for reasoning evaluation that treats reasoning as active evidence acquisition and belief updating. Wherein, LLMs receive only the task rules, must issue targeted queries to a hidden environment, integrate partial observations over time, and decide when to submit a final answer. Beyond standard success rate and interaction efficiency, we evaluate contextual robustness under controlled contextual perturbations, and metacognitive adaptation through counterfactual revision and necessity judgment. We instantiate the framework as a benchmark of 474 executable games, each evaluated under five fixed configuration search spaces corresponding to five difficulty levels, and evaluate a broad set of frontier LLMs. Results show that the benchmark is highly discriminative, exposing large differences not only in success rate but also in interaction efficiency. Moreover, we empirically show that contextual perturbations cause moderate but consistent declines, whereas counterfactual revision and necessity judgment lead to much larger drops.

2606.00102 2026-06-02 cs.AI math.PR 版本更新

On the evolution of the concept of probability as a mirror of the evolution of reason

论概率概念的演化作为理性演化的镜像

Jean-Louis Le Mouël, Vincent Courtillot, Dominique Gibert, Vladimir Kossobokov, Jean-Baptiste Boulé, Pierpaolo Zuddas, Fernando Lopes, Païkan Marccagi, Alexis Maineult

发表机构 * Académie des Sciences, Institut de France, Paris, France(法国科学院,法兰西学院,巴黎,法国) DeepField Sensing, France(法国DeepField Sensing公司) Institute of Earthquake Prediction Theory and Mathematical Geophysics, Russian Academy of Sciences, Moscow, Russia(俄罗斯科学院地震预测理论与数学地球物理研究所,莫斯科,俄罗斯) Accademia Nazionale delle Scienze detta dei XL, Roma, Italia(意大利国家科学院(罗马)) Muséum National d’Histoire Naturelle, CNRS UMR7196, INSERM U1154, Paris, France(自然史博物馆,法国国家科学研究中心UMR7196,法国国家医学研究院U1154,巴黎,法国) Sorbonne Université, CNRS, METIS, UMR7619, Paris, France(索邦大学,法国国家科学研究中心,METIS,UMR7619,巴黎,法国) Laboratoire de Géologie de l’ENS, UMR 8538, Paris, France(巴黎高等师范学院地质实验室,UMR 8538,巴黎,法国)

AI总结 本文从历史与认识论视角,将概率论的发展解读为理性本身的转变,并探讨概率、模糊逻辑与深度学习在科学理性中的角色与局限。

Comments 44 pages, 7 figures

详情
AI中文摘要

几个世纪以来,概率论已从博弈演算发展成为不确定性推理的核心框架。本文将其演化不仅解释为数学史,更视为理性本身的转变。从帕斯卡和费马的组合对称性,到贝叶斯和拉普拉斯的归纳逻辑,从泊松的事件统计到柯尔莫哥洛夫的公理化形式化,概率逐步将不确定性、时间和一致性纳入科学判断。这一轨迹在现代贝叶斯推断中达到成熟的认识论形式,尤其是在Tarantola将概率视为信息逻辑的观点中,先验知识与数据被一致地结合。然而,这一框架也暴露了一个局限:概率量化了关于明确定义命题的不确定性,但本身并未形式化用于描述这些概念的概念模糊性。因此,本文考察理性如何超越概率。模糊逻辑被呈现为一种用于分级意义和定性判断的严谨语言,而深度学习则被分析为一种基于几何插值和优化而非显式推理的独特、强大的预测模式。通过将概率、模糊逻辑和深度学习置于共同的历史和认识论视角,本文阐明了它们的角色与局限。它认为当代科学理性不能仅归结为数据驱动的性能,而需要明确阐述不确定性、模糊性和推理。

英文摘要

Over the centuries, probability theory has grown from the calculus of games of chance into a central framework for reasoning under uncertainty. This article interprets that evolution not merely as a mathematical history, but as a transformation of rationality itself. From Pascal and Fermat's combinatorial symmetry to the inductive logic of Bayes and Laplace, from Poisson's statistics of events to Kolmogorov's axiomatic formalization, probability progressively incorporated uncertainty, time, and coherence into scientific judgment. This trajectory reaches a mature epistemological form in modern Bayesian inference, especially in Tarantola's view of probability as a logic of information, where prior knowledge and data are combined coherently. Yet this framework also exposes a limit: probability quantifies uncertainty about well-defined propositions, but does not by itself formalize the vagueness of the concepts used to describe them. The article therefore examines how rationality extends beyond probability. Fuzzy logic is presented as a rigorous language for graded meaning and qualitative judgment, while deep learning is analyzed as a distinct, powerful mode of prediction based on geometric interpolation and optimization rather than explicit inference. By situating probability, fuzzy logic, and deep learning in a common historical and epistemological perspective, the article clarifies their roles and limits. It argues that contemporary scientific rationality cannot be reduced to data-driven performance alone, but requires the explicit articulation of uncertainty, vagueness, and inference.

2606.00101 2026-06-02 cs.CV cs.AI 版本更新

CoCoVideo: The High-Quality Commercial-Model-Based Contrastive Benchmark for AI-Generated Video Detection

CoCoVideo: 基于商业模型的高质量对比基准用于AI生成视频检测

Huidong Feng, Wentao Chen, Jie Chen, Xinqi Cai, Ruolong Ma, Yinglin Zheng, Yuxin Lin, Ming Zeng

发表机构 * School of Informatics, Xiamen University(厦门大学信息学院) China Academy of Information and Communications Technology(中国信息通信技术研究院) AI Transcend Pte. Ltd.(AI Transcend有限公司)

AI总结 针对现有数据集依赖低质量开源模型且商业样本带水印的问题,提出包含13个商业生成器的CoCoVideo-26K对比数据集,并设计结合对比学习与置信门控多模态大语言模型的CoCoDetect检测框架,实现高保真AI生成视频的鲁棒检测。

Comments Accepected by CVPR 2026

详情
AI中文摘要

随着人工智能生成内容(AIGC)技术的快速发展,视频伪造日益普遍,给公共讨论和社会安全带来新挑战。尽管现有深度伪造检测方法取得了显著进展,但AIGC伪造检测仍然具有挑战性,因为现有数据集主要依赖开源视频生成模型,其质量远低于商业AIGC系统。即使包含少量商业样本的数据集也常常保留可见水印,损害真实性并阻碍模型泛化到高保真AIGC视频。为解决这些问题,我们引入了CoCoVideo-26K,一个基于对比学习的商业模型AIGC视频数据集,涵盖13个主流商业生成器,并提供语义对齐的真实-伪造视频对。该数据集能够深入探索真实视频与高质量合成视频之间的差异,并为高逼真视频伪造检测建立新基准。基于该数据集,我们提出了CoCoDetect,一个集成对比学习与置信门控多模态大语言模型(MLLM)推理的检测框架。R3D-18骨干网络提取时空表示,而置信门将不确定案例路由到MLLM进行物理合理性和场景一致性的推理。在CoCoVideo-26K和公共基准上的大量实验证明了最先进的性能,验证了该框架的鲁棒性和泛化能力。我们的代码和数据集可在https://github.com/DonoToT/CoCoVideo获取。

英文摘要

With the rapid advancement of artificial intelligence generated content (AIGC) technologies, video forgery has become increasingly prevalent, posing new challenges to public discourse and societal security. Despite remarkable progress in existing deepfake detection methods, AIGC forgery detection remains challenging, as existing datasets mainly rely on open-source video generation models with quality far below that of commercial AIGC systems. Even datasets containing a few commercial samples often retain visible watermarks, compromising authenticity and hindering model generalization to high-fidelity AIGC videos. To address these issues, we introduce CoCoVideo-26K, a contrastive, commercial-model-based AIGC video dataset covering 13 mainstream commercial generators and providing semantically aligned real-fake video pairs. This dataset enables deeper exploration of the differences between authentic and high-quality synthetic videos and establishes a new benchmark for highly realistic video forgery detection. Building on this dataset, we propose CoCoDetect, a detection framework integrating contrastive learning with confidence-gated multimodal large language model (MLLM) inference. An R3D-18 backbone extracts spatio-temporal representations, while a confidence gate routes uncertain cases to an MLLM for reasoning about physical plausibility and scene consistency. Extensive experiments on CoCoVideo-26K and public benchmarks demonstrate state-of-the-art performance, validating the framework's robustness and generalizability. Our code and dataset are available at https://github.com/DonoToT/CoCoVideo.

2606.00100 2026-06-02 cs.CV cs.AI 版本更新

CoilDrop-MRI: Self-supervised physics-guided MRI reconstruction with coil dropout

CoilDrop-MRI:基于线圈丢弃的自监督物理引导MRI重建

Tongxi Song, Ziyu Li, Zihan Li, Wen Zhong, Congyu Liao, Yang Yang, Hua Guo, Wenchuan Wu, Qiyuan Tian

发表机构 * School of Biomedical Engineering, Tsinghua Medicine, Tsinghua University(清华大学生物医学工程系) Oxford Centre for Integrative Neuroimaging, FMRIB, Nuffield Department of Clinical Neurosciences, University of Oxford(牛津大学整合神经影像中心) Department of Radiology & Biomedical Imaging, University of California San Francisco(加州大学旧金山分校放射科与生物医学成像系)

AI总结 提出CoilDrop-MRI方法,通过在线圈维度进行丢弃并作为自监督训练目标,结合图像域和k空间域展开架构,实现无需全采样数据的并行MRI重建,在多站点、多场强、多模态数据集上性能优于现有自监督方法。

详情
AI中文摘要

基于自监督深度学习的方法在加速磁共振成像(MRI)重建中展现出巨大潜力,无需全采样数据即可实现高图像质量。这些方法通常将采集的数据划分为两个不相交的子集,构建输入-目标对以优化重建网络。然而,现有方法仅在空间频率(k空间)域进行划分,未探索线圈维度。为充分利用接收线圈间的信号相关性,我们提出CoilDrop-MRI,该方法对输入应用线圈级丢弃,并将丢弃的数据作为自监督框架中的训练目标。该方法被集成到图像域(SENSE)和k空间(SPIRiT)公式的展开架构中。我们进一步将CoilDrop-MRI扩展到多激发、相位校正的扩散MRI(dMRI)重建,展示了其多功能性。CoilDrop-MRI在多站点、多场强(0.3T、0.55T和3T)和多模态(T1加权、T2加权、T2-FLAIR和dMRI)数据集上进行了广泛验证,始终优于最先进的自监督方法,达到了与监督重建方法相当的质量,且无需全采样参考训练数据。此外,CoilDrop-MRI表现出强大的数据效率和跨成像条件的鲁棒泛化能力,使其成为自监督并行MRI重建的实用且通用的框架。

英文摘要

Self-supervised deep learning-based methods have shown great promise for accelerated magnetic resonance imaging (MRI) reconstruction, achieving high image quality without requiring fully sampled data for training. These methods typically partition the acquired data into two disjoint subsets to construct input-target pairs for optimizing the reconstruction network. However, existing approaches perform this partition exclusively within the spatial frequency (k-space) domain, leaving the coil dimension unexplored. To enforce full exploitation of signal correlation across receiver coils, we propose CoilDrop-MRI, which applies coil-wise dropout to the input and uses the dropped data as training targets in a self-supervised framework. This method is integrated into unrolled architectures in both image-domain (SENSE) and k-space (SPIRiT) formulations. We further demonstrate its versatility by extending CoilDrop-MRI to multi-shot, phase-corrected diffusion MRI (dMRI) reconstruction. CoilDrop-MRI is extensively validated on multi-site, multi-field-strength (0.3T, 0.55T, and 3T), and multi-modality (T1-weighted, T2-weighted, T2-FLAIR, and dMRI) datasets and consistently outperforms state-of-the-art self-supervised methods, achieving quality comparable to supervised reconstruction methods without requiring fully sampled reference training data. Moreover, CoilDrop-MRI exhibits strong data efficiency and robust generalization across imaging conditions, establishing it as a practical and versatile framework for self-supervised parallel MRI reconstruction.

2606.00095 2026-06-02 cs.CV cs.AI cs.CL cs.RO 版本更新

Bridging the 2D-3D Gap: A Hierarchical Semantic-Geometric Map for Vision Language Navigation

弥合2D-3D鸿沟:面向视觉语言导航的分层语义几何地图

Kailing Li, Tianwen Qian, Lijin Yang, Yuqian Fu, Jingyu Gong, Xiaoling Wang, Liang He

发表机构 * School of Computer Science and Technology, East China Normal University(东华大学计算机科学与技术学院) Bosch Corporate Research(博世企业研究) King Abdullah University of Science and Technology(卡布斯大学)

AI总结 提出分层语义几何地图(HSGM),将3D几何信息转化为VLM可理解的结构化表示,结合VLM高层语义规划与经典路径规划,实现零样本视觉语言导航,在R2R-CE和RxR-CE基准上达到最先进性能。

详情
AI中文摘要

视觉语言导航(VLN)使具身智能体能够通过遵循语言指令在未知环境中到达目标位置。尽管近期视觉语言模型(VLM)取得了进展,但仍存在关键的语义-几何鸿沟:VLM擅长语言和2D视觉理解,但在3D空间推理方面表现不佳,且无法捕捉动作与空间转换之间的因果动态,导致导航不可靠,尤其在零样本设置中。为弥合这一鸿沟,我们提出分层语义几何地图(HSGM),将3D几何信息转化为与VLM兼容的结构化表示,有效将其与物理世界连接。具体而言,HSGM表示为多通道俯视图,组织为三个层次:(1)几何层,记录可导航区域和障碍物;(2)语义层,表示物体及其关系;(3)决策层,支持高层任务推理和目标选择。导航过程中,VLM作为高层语义规划器,解释HSGM编码的空间布局以选择几何有效航点,而航点间的低层无碰撞运动由经典路径规划算法执行,从而将语义推理与动作执行完全解耦。此外,复杂指令被分解为子任务,以缓解长程导航中的进度遗忘或幻觉问题。在R2R-CE和RxR-CE基准上的大量实验表明,我们的零样本框架达到了最先进性能,甚至优于若干监督方法。代码见 https://github.com/Teacher-Tom/HSGM_public。

英文摘要

Vision-Language Navigation (VLN) enables embodied agents to reach target locations in unseen environments by following language instructions. Despite recent progress with vision-language models (VLMs), a critical semantic-geometric gap remains: while VLMs excel at language and 2D visual understanding, they struggle with 3D spatial reasoning and fail to capture the causal dynamics between actions and spatial transitions, resulting in unreliable navigation, particularly in zero-shot settings. To bridge this gap, we propose a Hierarchical Semantic-Geometric Map (HSGM) that transforms 3D geometric information into a structured representation compatible with VLMs, effectively linking them to the physical world. Specifically, HSGM is represented as a multi-channel top-down map organized into three levels: (1) geometric level that records navigable regions and obstacles, (2) semantic level that represents objects and their relations, and (3) decision level that supports high-level task reasoning and goal selection. During navigation, the VLM acts as a high-level semantic planner, interpreting the spatial layout encoded in the HSGM to select geometrically valid waypoints, while low-level, collision-free movements between waypoints are executed by a classical path-planning algorithm, fully decoupling semantic reasoning from action execution. Additionally, complex instructions are decomposed into subtasks to alleviate the problem of progress forgetting or hallucinating in long-horizon navigation. Extensive experiments on R2R-CE and RxR-CE benchmarks demonstrate that our zero-shot framework achieves state-of-the-art performance and even outperforms several supervised methods. Code is available at https://github.com/Teacher-Tom/HSGM_public.

2606.00092 2026-06-02 cs.CV cs.AI 版本更新

Aligning Cellular Sheaves with Classifier Attention for Interpretable Weakly-Supervised Pathology Localization

对齐细胞层与分类器注意力以实现可解释的弱监督病理定位

Devansh Lalwani, Swapnil Bhat, Maulik Shah

发表机构 * Turocrates AI Private Limited(Turocrates AI私有有限公司)

AI总结 针对弱监督全切片图像分类中注意力图定位不准确的问题,提出结合细胞层与注意力机制的一致性训练方法,在Camelyon16上实现补丁级AUC 0.940,并提升注意力AUC从0.717至0.953。

详情
AI中文摘要

基于基础特征的注意力多实例学习(ABMIL)在Camelyon16切片级别性能上接近饱和,但相应的注意力图作为定位信号并不完美:在临床解释中,一个正确分类但未激活实际病灶的模型难以被信任。我们通过细胞层(cellular sheaves)来解决这一差距,细胞层为图的每个顶点和边赋予有限维向量空间及它们之间一致的线性映射,提供了一种在图结构数据上检测局部不一致性的原则性方法。我们将细胞层应用于全切片图像的弱监督肿瘤定位,结合了细胞层不一致场与ABMIL。自然的训练目标——鼓励相似特征之间的一致性——产生的不一致场追踪的是组织级纹理而非诊断内容。我们提出注意力条件一致性,利用分类器的注意力来定义哪些相邻补丁应该一致。在此目标下联合训练分类器和细胞层,在Camelyon16上产生的不一致场达到补丁级AUC 0.940,并将注意力头从单独ABMIL的0.717提升至0.953。两阶段消融实验(分类器冻结在ABMIL值)仅在不一致场上达到0.727,注意力保持0.717,证实增益来自投影器在两个目标下的共同适应,而非单独的损失变化。训练后的模型无需重新训练即可迁移至Camelyon17的标注切片,保持Delta AUC 0.932 +/- 0.083和注意力AUC 0.955 +/- 0.099。结果是注意力图和细胞层不一致图同时激活相同的诊断区域,为每个切片级预测提供两种互补的解释。

英文摘要

Weakly-supervised classification of whole-slide images with attention-based multiple instance learning (ABMIL) on top of foundation features now reaches near-saturation on Camelyon16 slide-level performance, but the corresponding attention maps are an imperfect localization signal: in clinical interpretation, a model that classifies correctly without firing on the actual lesion is hard to trust. We address this gap with cellular sheaves, which equip each vertex and edge of a graph with a finite-dimensional vector space and consistent linear maps between them, providing a principled way to detect local disagreement on graph-structured data. We apply cellular sheaves to weakly-supervised tumour localization on whole-slide images, combining a sheaf disagreement field with ABMIL. The natural training objective, encouraging consistency between similar features, produces a disagreement field that tracks tissue-level texture rather than diagnostic content. We propose attention-conditional consistency, which uses the classifier's attention to define which neighbouring patches should agree. Joint training of the classifier and the sheaf under this objective produces a disagreement field with patch-level AUC 0.940 on Camelyon16 and raises the attention head from its ABMIL-alone level of 0.717 to 0.953. Two-stage ablation with the classifier frozen at its ABMIL values reaches only 0.727 on the disagreement field and leaves attention at 0.717, confirming that the gain comes from the projector co-adapting under both objectives, not from the loss change in isolation. The trained model transfers without retraining to annotated slides from Camelyon17, maintaining Delta AUC 0.932 +/- 0.083 and attention AUC 0.955 +/- 0.099. The result is an attention map and a sheaf-disagreement map that fire on the same diagnostic regions, giving clinicians two complementary explanations for each slide-level prediction.

2606.00091 2026-06-02 cs.CL cs.AI 版本更新

DLLM-JEPA: Joint Embedding Predictive Architectures for Masked Diffusion Language Models

DLLM-JEPA:用于掩码扩散语言模型的联合嵌入预测架构

Sangdae Nam

发表机构 * Sangdae Nam

AI总结 提出DLLM-JEPA,通过将联合嵌入预测架构与掩码扩散语言模型结合,消除显式多视图数据和双梯度前向传播需求,在多个任务上提升准确率并降低训练FLOPs。

Comments 17 pages, 4 figures, 13 tables. Accepted at SPIGM Workshop, ICML 2026

详情
AI中文摘要

联合嵌入预测架构(JEPAs)重塑了视觉中的自监督表示学习。最近的LLM-JEPA将JEPA移植到自回归语言模型,但继承了因果注意力机制的两个高昂代价:它需要显式的多视图数据(例如文本-代码对),并且每步需要两次携带梯度的前向传播。我们提出DLLM-JEPA,它将JEPA与掩码扩散语言模型配对,一次性消除这两个代价。扩散模型的双向注意力通过不同的掩码率从同一输入产生两个语义不同的视图——无需显式配对——并支持单次携带梯度的前向传播,相对于LLM-JEPA减少33%的训练FLOPs。DLLM-JEPA在我们评估的每个(任务,架构)组合上优于仅扩散微调:在LLaDA-8B GSM8K上最高提升+18.7个百分点,在Dream-7B GSM8K上提升+11.4个百分点,在Spider、NL-RX-SYNTH和Django上持续获得正向增益。除了准确率,DLLM-JEPA还表现出双赢特性:在LLaDA-8B上使用Wide-t配置时,它同时提高了GSM8K准确率(67.1 vs. 65.2,+1.8个百分点),将保留的Wikitext损失降至预训练基础之下,并在三个微调种子下将MMLU准确率保持在基础水平——而L2到基础参数锚点匹配基线准确率但没有任务增益。逐层探测揭示了机制:一种几何-功能漂移分离,其中微调后的骨干比基线更远离预训练权重,但在保留的Wikitext上遗忘更少,且放大集中在中间Transformer层。该模式也出现在Dream-7B上,表明该现象并非特定于单个骨干网络。

英文摘要

Joint Embedding Predictive Architectures (JEPAs) have reshaped self-supervised representation learning in vision. The recent LLM-JEPA ported JEPA to autoregressive language models but inherited two steep costs from the causal-attention substrate: it demands explicit multi-view data (e.g., text-code pairs), and it requires two gradient-carrying forward passes per step. We introduce DLLM-JEPA, which pairs JEPA with masked-diffusion language models to eliminate both costs at once. The bidirectional attention of diffusion models yields two semantically distinct views of the same input via different masking rates -- no explicit pairs needed -- and supports a single gradient-carrying forward pass, cutting training FLOPs by 33% relative to LLM-JEPA. DLLM-JEPA improves over diffusion-only fine-tuning in every (task, architecture) combination we evaluate: up to +18.7 pp on LLaDA-8B GSM8K and +11.4 pp on Dream-7B GSM8K, with consistent positive gains on Spider, NL-RX-SYNTH, and Django. Beyond accuracy, DLLM-JEPA exhibits a dual-win property: on LLaDA-8B with the Wide-t configuration, it simultaneously raises GSM8K accuracy (67.1 vs. 65.2, +1.8 pp), drives held-out Wikitext loss below the pre-trained base, and preserves MMLU accuracy at base level across three fine-tuning seeds -- whereas an L2-to-base parameter anchor matches baseline accuracy with no task gain. Layer-wise probing reveals the mechanism: a geometric-functional drift dissociation in which the fine-tuned backbone moves further from the pre-trained weights than the baseline yet forgets less on held-out Wikitext, with the amplification concentrated in middle transformer layers. The pattern appears on Dream-7B as well, indicating the phenomenon is not specific to a single backbone.

2606.00090 2026-06-02 cs.RO cs.AI 版本更新

Silent Failures in Physical AI: A Literature Review of Runtime Action Authorization for Autonomous Systems

物理AI中的静默故障:自主系统运行时动作授权的文献综述

Barak Or

发表机构 * STATE16

AI总结 本文综述了物理AI系统中黑箱模型发出看似合理但实际错误的物理动作导致的静默故障问题,提出了运行时防护栏的分类和评估要求。

Comments 23 pages

详情
AI中文摘要

物理AI系统越来越多地将多模态观测、语言指令和学习的世界表示映射为具有物理后果的动作。机器人基础模型、视觉-语言-动作模型和基于世界模型的自主系统可以决定移动车辆、机器人、无人机和工业机器的决策。这种转变暴露了一个传统AI内容审核或经典机器人安全无法完全捕获的安全问题:黑箱模型可能发出一个物理后果的动作,同时表现出自信、合理和语义对齐。由此产生的故障可能是静默的,源于传感器漂移、遮挡、状态估计误差、分布偏移、幻觉的可供性,或在下游硬件控制器检测到违规之前的无效物理假设。在具身基础模型、世界模型、机器人仿真、具身安全基准、安全控制、运行时保证、不确定性估计、验证和防护栏评估中,模型能力和安全机制沿着大致分离的技术轨道发展。这里综合的一个反复出现的差距是,本综述调查的单一流都没有提供黑箱物理AI模型与物理执行之间的完整运行时授权边界。由此产生的分析发展了一个有界的问题表述、静默物理动作故障的定义、运行时防护栏功能的分类,以及比较防护栏作为物理AI保证机制的评估要求。

英文摘要

Physical AI systems increasingly map multimodal observations, language instructions, and learned world representations into physically consequential actions. Robotics foundation models, vision-language-action models, and world-model-based autonomous systems can condition decisions that move vehicles, robots, drones, and industrial machines. This transition exposes a safety problem that is not fully captured by conventional AI content moderation or by classical robot safety alone: a black-box model may issue a physically consequential action while appearing confident, plausible, and semantically aligned. The resulting failure can be silent, arising from sensor drift, occlusion, state-estimation error, distribution shift, hallucinated affordances, or invalid physical assumptions before downstream hardware controllers detect a violation. Across embodied foundation models, world models, robotics simulation, embodied safety benchmarks, safe control, runtime assurance, uncertainty estimation, verification, and guardrail evaluation, model capability and safety mechanisms have advanced along largely separate technical tracks. A recurring gap synthesized here is that no single stream surveyed in this review supplies a complete runtime authorization boundary between black-box Physical AI models and physical execution. The resulting analysis develops a bounded problem formulation, a definition of silent physical-action failure, a taxonomy of runtime guardrail functions, and evaluation requirements for comparing guardrails as Physical AI assurance mechanisms.

2606.00089 2026-06-02 cs.RO cs.AI 版本更新

Can Predicted Dynamics Exist in the Physical World?

物理世界中是否存在可预测的动态?

Barak Or

发表机构 * STATE16 Technion - Israel Institute of Technology(技术Ion - 以色列理工学院) Reichman University(Reichman大学) Google-Reichman AI Tech School(Google-Reichman人工智能技术学院)

AI总结 本文提出物理可接受性作为预测-控制接口,通过运动学、动力学和直接到组合的视界条件评估解码提案的物理可执行性,实验表明该方法能有效识别无效提案并保持任务进度。

Comments 17 pages

详情
AI中文摘要

预测性物理AI系统输出状态展开、动作块和潜在计划,但低均方根误差(RMSE)并不意味特定提案在物理上可执行。我们将物理可接受性定义为预测-控制接口:在执行前,将解码提案视为候选动态,并使用运动学、动力学和直接到组合的视界条件进行评估。通过不是任务成功的证明;拒绝标识指定物理包络的违反,并给出组件级原因。在Hugging Face LeRobot PushT上,受控伪造表明一步预测RMSE和标准化动态残差达到接收者操作特征曲线下面积(AUC)0.982和0.972,仅运动学条件达到AUC 0.592,完整门控达到AUC 0.957并带有条件级归因。在基于重放的干预实验中,基于残差的过滤器和完整物理可接受性门控阻止了87-89%的无效提案,同时保持平均进度接近0.998。

英文摘要

Predictive Physical AI systems output state rollouts, action chunks, and latent plans, yet a low root-mean-square error (RMSE) does not imply that a particular proposal is physically executable. We formulate physical admissibility as a prediction-control interface: before execution, a decoded proposal is treated as candidate dynamics and evaluated using kinematic, dynamic, and direct-to-composed horizon conditions. Passing is not a certificate of task success; rejection identifies violation of the specified physical envelope and gives a component-level reason. On Hugging Face LeRobot PushT, controlled falsification shows that one-step prediction-RMSE and standardized dynamics residuals reach area under the receiver operating characteristic curve (AUC) 0.982 and 0.972, kinematic-only conditions reach AUC 0.592, and the full gate reaches AUC 0.957 with condition-level attribution. In replay-based intervention experiments, residual-based filters and the full physical-admissibility gate prevent 87-$89% of invalid proposals while preserving mean progress near 0.998.

2606.00087 2026-06-02 cs.CV cs.AI 版本更新

Structured Visual Evidence Decomposition for Evidence-Grounded Multimodal Screening of Obstructive Sleep Apnea-Hypopnea Syndrome

结构化视觉证据分解用于阻塞性睡眠呼吸暂停低通气综合征的证据驱动多模态筛查

Chen Zhan, Yingchen Wei, Xiaoyu Tan, Jingjing Huang, Xihe Qiu

发表机构 * School of Electronic and Electrical Engineering, Shanghai University of Engineering Science(上海工程技术大学电子与电气工程学院) Tencent Youtu Lab(腾讯云视频实验室) ENT Institute and Department of Otorhinolaryngology, Eye & ENT Hospital of Fudan University(复旦大学耳鼻喉科医院耳鼻喉科研究所) National University of Singapore(新加坡国立大学)

AI总结 提出EviOSAHS框架,通过将面部图像分解为七个解剖查询并生成结构化证据卡,结合临床信息进行高灵敏度OSAHS筛查。

详情
AI中文摘要

有效的阻塞性睡眠呼吸暂停低通气综合征(OSAHS)多导睡眠图前筛查需要结合临床风险因素与可见的颅面和颈部线索。直接提示通用多模态基础模型进行医学是/否决策可能产生不稳定、校准不良的输出。我们提出EviOSAHS,一个证据驱动的多模态推理框架,将仅基于图像的解剖证据获取与最终临床判定分离。每张正面面部图像被分解为七个固定的解剖查询,涵盖颈部、下巴、嘴巴、面/颈脂肪、下颌、中面部和鼻子。视觉响应被转换为结构化证据卡,记录目标解剖结构、可见性、风险方向、证据强度、置信度和简洁摘要。这些卡片仅在最后阶段与清理后的临床档案结合,由大型语言模型进行平衡的二元筛查判定。我们在642名受试者队列上评估了EviOSAHS,将正常受试者映射为筛查阴性,轻度、中度或重度OSAHS受试者映射为筛查阳性。EviOSAHS实现了88.47%的准确率、94.86%的灵敏度、93.74%的F1分数和5.14%的假阴性率,在统一协议下优于仅临床提示、直接多模态提示和朴素两阶段流水线。消融实验表明,七问题视觉分解和平衡最终判定对高灵敏度工作点至关重要。对4,494个视觉输出的问题级审计显示100%的结构化解析率和93.88%的高可见率。EviOSAHS为二元多导睡眠图前OSAHS筛查提供了一个可审计、高灵敏度的工作流程,但应被视为分诊助手而非诊断系统。在临床部署前需要进行前瞻性验证、外部测试和校准的工作点控制。

英文摘要

Effective pre-polysomnography screening for obstructive sleep apnea-hypopnea syndrome (OSAHS) requires combining clinical risk factors with visible craniofacial and neck cues. Directly prompting general-purpose multimodal foundation models for medical yes/no decisions can yield unstable, poorly calibrated outputs. We propose EviOSAHS, an evidence-grounded multimodal reasoning framework that separates image-only anatomical evidence acquisition from final clinical adjudication. Each frontal facial image is decomposed into seven fixed anatomical queries covering the neck, chin, mouth, face/neck fat, lower jaw, midface, and nose. Visual responses are converted into structured evidence cards recording target anatomy, visibility, risk direction, evidence strength, confidence, and a concise summary. These cards are combined with a cleaned clinical profile only in the final stage, where a large language model performs balanced binary screening adjudication. We evaluated EviOSAHS on a 642-subject cohort, mapping normal subjects to screening-negative and mild, moderate, or severe OSAHS subjects to screening-positive. EviOSAHS achieved 88.47% accuracy, 94.86% sensitivity, 93.74% F1-score, and a 5.14% false-negative rate, outperforming clinical-only prompting, direct multimodal prompting, and naive two-stage pipelines under a unified protocol. Ablations showed that seven-question visual decomposition and balanced final adjudication were critical to the high-sensitivity operating point. A question-level audit of 4,494 visual outputs showed a 100% structured parse rate and 93.88% high-visibility rate. EviOSAHS provides an auditable, high-sensitivity workflow for binary pre-polysomnography OSAHS screening, but should be viewed as a triage assistant rather than a diagnostic system. Prospective validation, external testing, and calibrated operating-point control are needed before clinical deployment.

2606.00084 2026-06-02 cs.IR cs.AI cs.CL cs.LG 版本更新

SentimentLens: Reconciling Sentiment and Ratings via Dual-Modality in the Hospitality Sector

SentimentLens: 通过双模态调和酒店业中的情感与评分

Dineth Jayakody, Pasindu Thenahandi, Sampath Jayarathna

发表机构 * University of Peradeniya(珀拉尼亚大学)

AI总结 提出SentimentLens系统,基于方面级情感分析从非结构化酒店评论中提取知识,并通过跨模态调和文本情感与数值评分来识别运营冲突和服务改进机会。

详情
AI中文摘要

在线旅游平台生成大量用户生成的酒店评论,为大规模理解旅行者体验提供了丰富机会。然而,将非结构化文本反馈转化为结构化、可操作的见解仍然是一项具有挑战性的任务。本文提出了SentimentLens,一个基于方面级情感分析的可扩展分析系统,该系统从非结构化酒店评论中执行知识提取,并将其组织成可解释的服务类别。SentimentLens集成了方面术语提取、方面情感分类、语义类别分配和多层次分析模块,以支持区域级、酒店级和类别级评估。该系统设计为在不同地理环境和酒店环境中运行。为了展示其实用性,我们将SentimentLens应用于一个包含超过10,000条公开酒店评论的大型真实数据集。通过广泛分析,该框架揭示了旅行者情感如何随区域、服务类别和酒店类型而变化。我们进一步实现了文本情感与数值评分的跨模态调和,以识别潜在运营冲突、服务质量的结构性不一致性,并使用重要性-绩效和基于熵的分析确定高影响力的改进机会。结果表明,SentimentLens有效地将大规模非结构化评论转化为可操作的情报,支持酒店管理和旅游政策的数据驱动决策。虽然通过一个国家案例研究进行了演示,但所提出的系统可推广到其他目的地和评论驱动的服务领域。

英文摘要

Online travel platforms generate vast volumes of user-generated hotel reviews, offering rich opportunities to understand traveler experiences at scale. However, transforming unstructured textual feedback into structured, actionable insights remains a challenging task. This paper presents SentimentLens, a scalable analysis system based on Aspect-Based Sentiment Analysis that performs knowledge extraction from unstructured hotel reviews and organizes them into interpretable service categories. SentimentLens integrates aspect term extraction, aspect sentiment classification, semantic category assignment, and multi-level analytical modules to support region-level, hotel-level, and category-level evaluation. The system is designed to operate across different geographic contexts and hospitality settings. To demonstrate its practical utility, we apply SentimentLens to a large real-world dataset of over 10,000 publicly available hotel reviews. Through extensive analysis, the framework reveals how traveler sentiment varies across regions, service categories, and hotel archetypes. We further implement a cross-modal reconciliation of textual sentiment and numerical ratings to identify latent operational conflicts, structural inconsistencies in service quality, and high-impact improvement opportunities using importance--performance and entropy-based analyses. The results show that SentimentLens effectively transforms large-scale unstructured reviews into actionable intelligence, supporting data-driven decision-making for hospitality management and tourism policy. While demonstrated using a national case study, the proposed system is generalizable to other destinations and review-driven service domains.

2606.00083 2026-06-02 cs.LG cs.AI cs.RO 版本更新

From Demonstrations to Rewards: Test-Time Prompt Optimization for VLM Reward Models

从演示到奖励:VLM奖励模型的测试时提示优化

Christian Gumbsch, Leonardo Barcellona, Lennard Schünemann, Platon Karageorgis, Andrii Zadaianchuk, Zehao Wang, Sergey Zakharov, Fabien Despinoy, Rahaf Aljundi, Efstratios Gavves

发表机构 * University of Amsterdam(阿姆斯特丹大学) Catholic University of Leuven(鲁汶天主大学) Toyota Research Institute(丰田研究院) Toyota Motor Europe(丰田欧洲公司)

AI总结 提出Demo2Reward方法,利用少量专家演示在测试时优化VLM奖励模型的提示指令,减少假阳性并保持真阳性,无需额外训练即可提升下游策略学习。

详情
AI中文摘要

强化学习依赖于准确的奖励函数,但在现实应用(如机器人技术)中,这些函数通常是手工设计的,甚至不可用。最近的研究探索了预训练视觉-语言模型(VLM)作为奖励模型的零样本推理能力。然而,如果没有仔细的提示工程,这些方法往往会产生次优的奖励,其中假阳性预测会严重降低下游策略学习。在机器人技术中,通常收集包含专家演示的有限数据集来引导策略学习。这种场景提供了在策略训练之前优化奖励模型的机会。我们提出Demo2Reward,一种测试时自适应技术,基于少量演示(3-10条轨迹)优化奖励模型的语言指令,以减少假阳性同时保持真阳性。关键是,这在策略学习期间不需要额外的模型训练或计算资源。我们表明,Demo2Reward在一系列模拟机器人任务和策略骨干上始终优于现有的零样本和少样本VLM奖励模型。最后,我们证明Demo2Reward有效迁移到真实世界的机器人学习场景,无需手动设计奖励函数即可实现策略学习。

英文摘要

Reinforcement learning relies on accurate reward functions, which are often hand-crafted or even unavailable in real-world applications, such as robotics. Recent work has explored the zero-shot reasoning capabilities of pre-trained Vision-Language Models (VLMs) as reward models. However, without careful prompt engineering, these approaches tend to produce suboptimal rewards, where false positive predictions can severely degrade downstream policy learning. In robotics, limited datasets comprising expert demonstrations are often collected to bootstrap policy learning. This scenario provides an opportunity to optimize a reward model prior policy training. We propose Demo2Reward a test-time adaptation technique to optimize the language instruction of a reward model based on a few demonstrations (3-10 trajectories) to reduce false positives while preserving true positives. Crucially, this requires no additional model training or computation resources during policy learning. We show that Demo2Reward consistently outperforms existing zero- and few-shot VLM reward models across a range of simulated robotic tasks and policy backbones. Finally, we demonstrate that Demo2Reward effectively transfers to a real-world robotic learning scenario, enabling policy learning without manually engineering a reward function.

2606.00082 2026-06-02 cs.LG cs.AI stat.ML 版本更新

Hoeffding Concept Bottleneck Models with Applications to Overhead Images

Hoeffding概念瓶颈模型及其在俯视图像中的应用

Clément Bénard, Manon Arfib, Christophe Labreuche, Victor Quétu

发表机构 * Thales cortAIx-Labs(泰雷兹 cortAIx 实验室) Université Paris-Saclay, CentraleSupélec(巴黎-萨克雷大学,中央理工-巴黎高等学院)

AI总结 针对线性概念瓶颈模型可解释性差和信息泄露问题,提出基于Hoeffding泛函分解的非线性稀疏聚合方法HCBM,并证明其对概念间泄露的鲁棒性,在分类和俯视图像目标检测任务中优于传统线性CBM。

详情
AI中文摘要

深度学习算法的可解释性对于高风险决策的计算机视觉应用至关重要。概念瓶颈模型(CBM)最近在基于高级概念瓶颈的分类问题上展示了提供可解释且准确预测的潜力。现有的CBM方法依赖概念分数的线性聚合来计算预测。然而,这种线性方法通常使用大量概念,这削弱了可解释性并有利于信息泄露。通常,概念与输出logits之间的潜在关系不是线性的。因此,我们引入了Hoeffding概念瓶颈模型(HCBM),该模型基于梯度提升树的Hoeffding泛函分解,提供概念分数的非线性和稀疏聚合,并使用素蕴含生成紧凑预测。HCBM被证明对概念间泄露具有鲁棒性,并在大量实验中优于标准线性CBM。除了分类,HCBM还可以适应目标检测,我们专注于一个具有挑战性的俯视图像案例,以展示HCBM在这些设置中的高性能。

英文摘要

Explainability of deep learning algorithms is critical for computer-vision applications with high-stake decisions. Concept bottleneck models (CBM) have recently shown promising performance to provide explainable and accurate predictions for classification problems, based on a bottleneck of high-level concepts. Existing CBM methods rely on a linear aggregation of the concept scores to compute predictions. However, a large number of concepts is often used in this linear approach, which undermines explainability and favors information leakage. In general, the underlying relation between concepts and output logits is not linear. Therefore, we introduce Hoeffding Concept Bottleneck Models (HCBM), which build on the Hoeffding functional decomposition of gradient-boosted trees to provide non-linear and sparse aggregations of concept scores, and generate compact predictions using prime implicants. HCBM are proved to be robust to interconcept leakage, and outperform standard linear CBM in practice, as shown in extensive experiments. Beyond classification, HCBM can be adapted to object detection, and we focus on a challenging case with overhead images to show the high performance of HCBM in these settings.

2606.00081 2026-06-02 cs.LG cs.AI cs.SD 版本更新

DAStatFormer: A Hybrid Multibranch Transformer with Statistical Feature Integration for DAS-Based Pattern Recognitions

DAStatFormer: 一种融合统计特征的混合多分支Transformer用于DAS模式识别

Michel Dione, Jerry Lonlac, Hélène Louis, Anthony Fleury, Stephane Lecoeuche

发表机构 * IMT Nord Europe, Institut Mines-Telecom, Univ. Lille, Centre for Digital Systems Lille, France(IMT北欧学院,法国电信研究院,里尔大学,数字系统研究中心,法国) IMT Mines Ales, Institut Mines-Telecom, Ales, France(IMT阿尔勒学院,法国电信研究院,阿尔勒,法国)

AI总结 针对DAS数据高维度和复杂时空模式问题,提出DAStatFormer混合多分支Transformer,通过提取24个ANOVA选择的统计特征并采用门控Transformer网络,在降低数据量级的同时实现高达99.4%的准确率。

详情
AI中文摘要

分布式声学传感(DAS)通过光纤实现大规模监测,但其高维度和复杂的时空模式使得事件分类具有挑战性。现有的深度学习方法——CNN、循环模型和Transformer变体——要么无法捕获长程依赖,要么需要以高昂成本处理原始DAS矩阵。我们提出DAStatFormer,一种混合多分支Transformer,将紧凑的多域统计特征与门控Transformer网络相结合。我们不是使用原始信号,而是从每个通道的时域、波形和频域提取24个ANOVA选择的属性,将数据量减少数个数量级,同时保留判别信息。每个域通过专用的逐步骤和逐通道注意力分支处理,并通过自适应门控机制融合。在开放的$\Phi$-OTDR基准测试和真实场景DAS数据集上的实验表明,DAStatFormer实现了高达99.4%的准确率和接近完美的实际性能,同时使用的参数和推理成本显著低于DASFormer和DeepViT等模型。这些结果证明了其适用于可扩展、实时的DAS监测。我们在https://github.com/MichelD-git/DAStatFormer发布代码。

英文摘要

Distributed Acoustic Sensing (DAS) enables large-scale monitoring through optical fibers, but its high dimensionality and complex spatio-temporal patterns make event classification demanding. Existing deep learning approaches-CNNs, recurrent models, and Transformer variants-either fail to capture long-range dependencies or require processing raw DAS matrices at prohibitive cost. We propose DAStatFormer, a hybrid multibranch Transformer that combines compact multidomain statistical features with Gated Transformer Networks. Instead of raw signals, we extract 24 ANOVA-selected attributes per channel from the temporal, waveform, and spectral domains, reducing data size by orders of magnitude while preserving discriminative information. Each domain is processed via dedicated step-wise and channel-wise attention branches, fused by an adaptive gating mechanism. Experiments on the open $Φ$-OTDR benchmark and a real-scenario DAS dataset show that DAS-tatFormer achieves up to 99.4% accuracy and near-perfect real-world performance, while using significantly fewer parameters and lower inference cost than models such as DASFormer and DeepViT. These results demonstrate its suitability for scalable, real-time DAS-based monitoring. We release our code at https://github.com/MichelD-git/DAStatFormer

2606.00080 2026-06-02 cs.CV cs.AI cs.LG cs.NE 版本更新

Planktonzilla: Multimodal dataset and models for understanding plankton ecosystems

Planktonzilla: 用于理解浮游生态系统的多模态数据集与模型

Alan Gerson Contreras Montanares, Luis Valenzuela, Luis Martí, Nayat Sanchez-Pi

发表机构 * Inria Chile Research Center(Inria智利研究中心)

AI总结 为解决浮游生物分类模型泛化性差的问题,提出统一数据集Planktonzilla-17M(含1740万张图像,涵盖602个分类类群),并对比监督学习与CLIP风格训练,发现基于分类谱系的监督学习优于CLIP,且现有生物基础模型在海洋成像领域表现不佳。

详情
AI中文摘要

海洋浮游生物支撑着水生食物网,并在全球二氧化碳封存中发挥关键作用,因此可靠的物种识别对于理解海洋健康和气候反馈至关重要。现有的分类模型在单个数据集上表现良好,但由于训练数据集孤立且标签不一致,无法跨仪器和环境泛化。为解决这一问题,我们引入了Planktonzilla-17M,这是一个统一的数据集,整合了来自13个成像系统的公开浮游生物图像集合。它包含1740万张图像,具有标准化的分类学和地理环境元数据,其中包括374万张浮游生物图像,涵盖602个分类类群,其中201个在物种级别被识别,使其成为迄今为止最大、最全面的浮游生物图像数据集。利用这一大规模数据集,我们在共享ViT骨干网络上进行了监督学习与CLIP风格图像-文本训练的对比实验。我们发现,当使用分类谱系作为文本时,监督分类器的表现与CLIP风格训练相当或更优。我们进一步观察到,BioCLIP和BioCLIP2在零样本和少样本设置下对浮游生物表现不佳。利用Planktonzilla-17M提高了浮游生物分类性能,凸显了当前生物基础模型在海洋成像领域的局限性。

英文摘要

Marine plankton underpin aquatic food webs and play a key role in global CO2 sequestration, making reliable species identification critical for understanding ocean health and climate feedbacks. Existing classification models perform well on individual collections but fail to generalize across instruments and environments due to isolated training datasets and inconsistent labels. To address this, we introduce Planktonzilla-17M, a unified dataset consolidating publicly available plankton image collections spanning thirteen imaging systems. It comprises 17.4 million images with standardized taxonomy and geo-environmental metadata, including 3.74 million plankton images spanning over 602 taxonomic classes, of which 201 are identified at the species level, making it the largest and most comprehensive plankton image dataset to date. Using this large-scale dataset, we perform a controlled comparison between supervised and CLIP-style image--text training on a shared ViT backbone. We find that a supervised classifier matches or exceeds CLIP-style training when trained using taxonomic lineage as text. We further observe that BioCLIP and BioCLIP2 perform poorly on plankton in zero-shot and few-shot settings. Leveraging Planktonzilla-17M improves plankton classification performance, highlighting the limitations of current biological foundation models in marine imaging domains.

2606.00079 2026-06-02 cs.LG cs.AI 版本更新

BitsMoE: Efficient Spectral Energy-Guided Bit Allocation for MoE LLM Quantization

BitsMoE: 面向MoE大语言模型量化的频谱能量引导比特分配

Jiayu Zhao, Zihan Teng, Minhao Fan, Tianrui Ma, Wentao Ren, Song Chen, Weichen Liu

发表机构 * School of Microelectronics, University of Science and Technology of China(中国科学技术大学微电子学院) College of Computing and Data Science, Nanyang Technological University(南洋理工大学计算与数据科学学院) School of Electrical and Electronic Engineering, Nanyang Technological University(南洋理工大学电子与电气工程学院)

AI总结 提出BitsMoE框架,通过SVD分解和频谱能量引导的混合精度比特分配,解决MoE模型超低位量化中的精度损失问题,在Qwen3-30B-A3B-Base上2比特量化下准确率提升27.83个百分点。

Comments 29 pages, 6 figures, 9 tables. Code and models are available at https://github.com/zjiayu064/BitsMoE

详情
AI中文摘要

混合专家(MoE)大语言模型通过稀疏专家激活减少了每词元的计算量,但由于所有专家权重必须常驻内存,其部署仍然占用大量内存。现有的MoE压缩方法在超低位宽场景下表现不佳:剪枝不可逆地移除模型容量,而粗粒度量化无法根据异构的专家和权重方向重要性分配比特。我们提出BitsMoE,一种面向MoE大语言模型量化的频谱能量引导比特分配框架。BitsMoE通过SVD将每个MoE层分解为共享基和专家特定的频谱因子,保留共享基不进行量化以保持跨专家的共同结构,并使用专家特定因子作为细粒度量化单元。为确定每个单元的比特宽度,BitsMoE将频谱混合精度量化建模为激活感知的重建替代问题,并求解一个整数线性规划,在固定比特预算下最小化估计的重建损失。在多个MoE大语言模型上的实验表明,BitsMoE在超低位宽场景下显著降低了下游任务准确率下降。在Qwen3-30B-A3B-Base上进行2比特量化时,BitsMoE相比GPTQ加速量化12.3倍,平均准确率提升27.83个百分点,解码速度提升1.76倍。我们的模型和代码已在https://github.com/zjiayu064/BitsMoE公开。

英文摘要

Mixture-of-Experts (MoE) large language models reduce per-token computation through sparse expert activation, but their deployment remains memory-intensive because all expert weights must be kept resident in memory. Existing MoE compression methods struggle in the ultra-low-bit regime: pruning irreversibly removes model capacity, while coarse-grained quantization fails to allocate bits according to heterogeneous expert and weight-direction importance. We propose BitsMoE, a spectral-energy-guided bit-allocation framework for MoE LLM quantization. BitsMoE decomposes each MoE layer by SVD into a shared basis and expert-specific spectral factors, retaining the shared basis without quantization to preserve common cross-expert structure and using the expert-specific factors as fine-grained quantization units. To determine the bit-width of each unit, BitsMoE formulates spectrum-wise mixed-precision quantization as an activation-aware reconstruction surrogate and solves an integer linear program that minimizes estimated reconstruction loss under a fixed bit budget. Experiments across multiple MoE LLMs show that BitsMoE substantially reduces downstream task accuracy degradation in ultra-low-bit regimes. Under 2-bit quantization on Qwen3-30B-A3B-Base, BitsMoE accelerates quantization by 12.3$\times$, improves average accuracy by 27.83 percentage points, and increases decoding speed by 1.76$\times$ over GPTQ. Our model and code are publicly available at https://github.com/zjiayu064/BitsMoE.

2606.00078 2026-06-02 cs.CV cs.AI 版本更新

Flow-Based Generative Modeling for Optimizing Sampling Policies in Compressed Sensing Applications

基于流的生成建模优化压缩感知应用中的采样策略

Roman Pavelkin, Luis A. Zavala-Mondragon, Christiaan G. A. Viviers, Fons van der Sommen

发表机构 * Eindhoven University of Technology(埃因霍温理工大学)

AI总结 提出一种任务感知的基于流的生成框架,通过训练流模型优化压缩感知中的子采样掩码,显著提升图像分类、重建和MRI加速的性能。

详情
AI中文摘要

信号处理和医学成像中的许多现代应用需要在严格的资源约束下获取高维信号。传统采样理论表明,准确重建信号所需的测量次数与信号的维数成正比,这一要求往往过于昂贵或不切实际。压缩感知通过证明稀疏信号可以在较少的测量下恢复(前提是测量算子满足某些条件)挑战了这一观念。这项概念验证研究提出了一种任务感知的基于流的生成框架——对传统流匹配训练范式的重新表述,其中流模型被训练用于优化压缩感知应用中的子采样。我们建立了所提出的学习子采样掩码框架的基本可行性,该框架显著提升了压缩感知在图像分类、图像重建和MRI加速中的性能。在图像重建任务中,我们的方法展示了最先进的性能,在CelebA数据集上以5%的子采样率实现了25.17 dB的峰值信噪比,在重建8倍加速的MRI测量(fastMRI数据集)时以最小的计算开销达到了29.24 dB。这些结果突显了生成流模型中任务条件化的有效性,并揭示了表示学习策略的一个有前景的方向。总体而言,所提出的框架提供了一种统一、灵活的方法来设计数据和任务驱动的感知方案,有望适用于广泛的逆问题。

英文摘要

Numerous modern applications in signal processing and medical imaging necessitate acquiring high-dimensional signals under tight resource constraints. Traditional sampling theory suggests that accurate signal reconstruction requires a number of measurements proportional to the signal's ambient dimension, a requirement often too expensive or impractical. Compressed sensing challenges this notion by demonstrating that sparse signals can be recovered with fewer measurements, provided the measurement operator meets certain conditions. This proof-of-concept study presents a task-aware flow-based generative framework -- a reformulation of the conventional Flow Matching training paradigm with a flow model trained to optimize subsampling in compressed sensing applications. We establish the fundamental feasibility of the proposed framework of learning subsampling masks that substantially enhance the performance of compressed sensing for image classification, image reconstruction, and MRI acceleration. For the image reconstruction task, our method demonstrated state-of-the-art performance, achieving Peak Signal-to-Noise Ratio of 25.17 dB at the subsampling rate of 5\% on the CelebA dataset and 29.24 dB when reconstructing $8\times$ accelerated MRI measurements (fastMRI dataset) with the minimal computational overhead. These results highlight the effectiveness of task-conditioning within generative flow models and reveal a promising direction for representation learning strategies. Overall, the proposed framework offers a unified, flexible approach to designing data- and task-driven sensing schemes that can be potentially adapted to a broad range of inverse problems.

2606.00077 2026-06-02 cs.CV cs.AI 版本更新

Improved Belief-Attention in Vision Task

视觉任务中的改进信念注意力

Guoqiang Zhang

发表机构 * University of Exeter(埃克塞特大学)

AI总结 提出Belief2-Attention,通过同时利用垂直分量和投影分量扩展信念注意力,并引入额外内积矩阵增强标记相关性,提升视觉任务性能。

详情
AI中文摘要

最近,Belief-Attention \cite{Guoqiang25BeliefAttention} 被提出,它首先对基于 softmax 的 $V$ 向量加权求和进行关于原始 $V$ 向量的正交投影,然后将垂直分量作为 Transformer 中的残差信号以提升性能。在本文中,我们首先进行消融研究,表明投影分量也携带关于标记相关性的信息,不应被忽略。然后,我们提出通过同时利用垂直分量和投影分量来扩展 Belief-Attention。具体地,投影分量经过某种激活函数,然后进行线性映射,再与所考虑的标记合并。概念上讲,投影分量的神经块可以视为新注意力块内的两层前馈网络(FFN)。此外,注意到标准注意力通过内积矩阵 $QK^T$ 捕获标记相关性。我们提出向 $QK^T$ 引入额外的内积矩阵 $ZZ^T$ 以捕获更丰富的标记相关性。我们将新模块称为 Belief2-Attention。可以很容易地证明 Belief2-Attention 比标准注意力更具表达能力。然后,我们验证了 Belief2-Attention 在图像分类和分割等视觉任务中的有效性。

英文摘要

Recently, Belief-Attention \cite{Guoqiang25BeliefAttention} has been proposed by first performing an orthogonal projection of the softmax-based weighted summation of $V$ vectors with respect to the original $V$ vectors and then taking the perpendicular component as the residual signal in Transformer for performance improvement. In this paper, we first conduct an ablation study showing the projected component also carries information about the token correlation, which should not be ignored. We then propose to extend Belief-Attention by making use of both the perpendicular and projected components. In particular, the projected component goes through certain activation function and then a linear mapping before merging with the considered token. Conceptually speaking, the neural block for the projected component can be viewed as a two-layer feedforward network (FFN) within the new attention block. It is also noted that standard attention captures the token correlation via the inner-product matrix $QK^T$. We propose to introduce an additional inner-product matrix $ZZ^T$ to $QK^T$ to capture richer token correlation. We refer to the new module as Belief2-Attention. It can be easily shown that Belief2-Attention is more expressive than standard Attention. We then verify the effectiveness of Belief2-Attention for vision tasks of image classification and segmentation.

2606.00074 2026-06-02 eess.SP cs.AI cs.LG 版本更新

CLSP-REQA: A Real-Time Quality-Aware Closed-Loop Seizure Prediction Framework with Mamba-BiLSTM and Confidence-Gated Intervention

CLSP-REQA:基于Mamba-BiLSTM和置信门控干预的实时质量感知闭环癫痫发作预测框架

Mufeng Chen, Qi Wu, Bingchao Huang, Xiwen Lai, Zekai Chen, Xinge Ouyang, Quansheng Ren

发表机构 * Department of Engineering Science, University of Oxford(牛津大学工程科学系) Mathematical Institute, University of Oxford(牛津大学数学研究所) School of Computer Science and Engineering, Beihang University(北航计算机科学与工程学院) Aerospace Information Research Institute, Chinese Academy of Sciences(中国科学院航天信息研究所) Department of Mechanical Engineering, The University of British Columbia(不列颠哥伦比亚大学机械工程系) College of Life Sciences, Hunan Normal University(湖南师范大学生命科学学院) School of Electronics, Peking University(北京大学电子学院)

AI总结 提出CLSP-REQA框架,通过嵌入实时EEG质量评估模块和Mamba-BiLSTM骨干网络,结合分层非线性融合函数,在严格跨患者评估下实现优于现有方法的癫痫发作预测性能。

Comments 27 pages, 8 figures, submitted to Biomedical Signal Processing and Control

详情
AI中文摘要

可靠的癫痫发作预测是闭环神经刺激治疗的前提,然而现有方法很少考虑实际部署中EEG信号质量的可变性,并且绝大多数采用非严格的评估协议,高估了泛化性能。我们提出了CLSP-REQA(具有实时EEG质量评估的闭环癫痫发作预测),这是一个统一框架,将轻量级信号质量估计器直接嵌入预测流程中。实时EEG质量评估(REQA)模块与Mamba-BiLSTM骨干网络并行运行,产生一个标量质量分数q ∈ [0,1],通过分层非线性融合函数(ECLO)调节输出置信度。在CHB-MIT头皮EEG数据库(n=23名受试者,198次发作)的严格跨患者评估下,CLSP-REQA实现了0.7426 ± 0.0199的AUC-ROC,优于Jemal等人报告的未适应跨患者基线0.69,仅使用16个EEG通道(先前工作为23个),且无需任何目标患者数据或域适应。在SIENA头皮EEG数据库(n=14名受试者,47次发作)上,CLSP-REQA实现了0.7012 ± 0.0249的AUC,大幅超过同一数据集上最佳域适应跨患者结果0.61,展示了强大的跨数据集泛化能力。该框架输出结构化四元组(p, q, c, Phi_SHAP),可直接与闭环神经刺激器接口兼容。

英文摘要

Reliable seizure prediction is a prerequisite for closed-loop neurostimulation therapy, yet existing methods rarely account for the variability in EEG signal quality encountered in real-world deployment, and the overwhelming majority adopt non-strict evaluation protocols that overestimate generalisation performance. We propose CLSP-REQA (Closed-Loop Seizure Prediction with Real-time EEG Quality Assessment), a unified framework that embeds a lightweight signal quality estimator directly within the prediction pipeline. A Real-time EEG Quality Assessment (REQA) module runs in parallel with a Mamba-BiLSTM backbone, producing a scalar quality score q in [0,1] that modulates output confidence through a tiered non-linear fusion function (ECLO). Under strict cross-patient evaluation on the CHB-MIT Scalp EEG Database (n = 23 subjects, 198 seizures), CLSP-REQA achieves an AUC-ROC of 0.7426 +- 0.0199, outperforming the unadapted cross-patient baseline of 0.69 reported by Jemal et al., using only 16 EEG channels compared to 23 in prior work, and without requiring any target-patient data or domain adaptation. On the SIENA Scalp EEG Database (n = 14 subjects, 47 seizures), CLSP-REQA achieves AUC 0.7012 +- 0.0249, substantially surpassing the best domain-adapted cross-patient result of 0.61 on the same dataset, demonstrating strong cross-dataset generalisation. The framework outputs a structured four-tuple (p, q, c, Phi_SHAP) directly compatible with closed-loop neurostimulator interfaces.

2606.00073 2026-06-02 cs.NE cs.AI cs.LG 版本更新

Rare Events, Real Signals: Functional Ensembles as Units of Computation in Deep Spiking Networks

罕见事件,真实信号:深度脉冲网络中的功能集合作为计算单元

Aditi Aravind, Konstantinos Ladakis, Mario Alexios Savaglio, Stelios M. Smirnakis, Maria Papadopouli

发表机构 * University of Crete(希腊克里特大学) Foundation of Research & Technology - Hellas(希腊研究与技术基金会) Archimedes Research Unit(阿基米德研究单位) Harvard Medical School(哈佛医学院) Brigham and Women’s Hospital(布莱根妇女医院)

AI总结 通过引入功能连接性分析框架,研究深度脉冲神经网络中功能集合的涌现特性,发现一阶功能连接集合的协同放电可靠预测下游神经元响应,且信息编码集中在罕见但高度协调的活动模式中。

详情
AI中文摘要

我们通过引入一个受神经科学启发的框架,从功能连接性的角度分析深度脉冲神经网络(SNN),研究内部表征如何在层次化处理系统中涌现。借鉴系统神经科学和信息论的概念,我们基于一个神经元与训练好的SNN架构中前一层神经元的统计显著成对相关性,形成该神经元的一阶功能连接(1FC)组。然后,我们在各种条件下的推理过程中跟踪其响应特性。我们的分析表明,先前在生物皮层中观察到的功能连接性的几个原理在脉冲ResNet架构中得以保留。这些1FC集合表现出有趣的特性:它们的聚合协同放电通过一个鲁棒的、类似ReLU的输入输出关系可靠地预测下游神经元响应,其增益随集合大小系统性缩放。仅在高的1FC协同放电事件期间才出现所呈现类别的可靠编码,而这些事件本身发生频率较低,表明信息表征集中在罕见但高度协调的活动模式中。在均匀随机噪声或对抗性扰动下,这些响应轮廓被破坏,尤其是在早期和中间层。这使得能够在特定节点和路径上进行有针对性的高分辨率探查。我们表明,功能连接结构由学习塑造,并且在权重置换下该结构被破坏。这些确立了1FC集合作为输入编码和信息传递的功能上有意义的基质,对设计针对信息流的有针对性的细粒度诊断具有潜在意义。

英文摘要

We investigate how internal representations emerge across hierarchical processing systems by introducing a neuroscience-inspired framework for analyzing deep spiking neural networks (SNN) through the lens of functional connectivity. Drawing on concepts from systems neuroscience and information theory, we form the first-order functionally-connected (1FC) group of a neuron based on its statistically significant pairwise correlations with neurons from the previous layer of a trained SNN architecture. We then track its response properties during inference under various conditions. Our analysis shows that several principles of functional connectivity previously observed in biological cortex are preserved in spiking ResNet architectures. These 1FC ensembles display interesting properties: their aggregate cofiring reliably predicts downstream neuronal responses through a robust, ReLU-like input-output relationship, whose gain scales systematically with ensemble size. Reliable encoding of the presented class emerges only during high 1FC cofiring events, which themselves occur infrequently, indicating that informative representations are concentrated in rare but highly coordinated activity patterns. Under uniform random noise or adversarial perturbations, these response profiles are disrupted, particularly in early and intermediate layers. This enables a targeted high-resolution interrogation at specific nodes and pathways. We showed that the functional connectivity structure is shaped by learning and this structure breaks under weight permutation. These establish 1FC ensembles as a functionally meaningful substrate for input encoding and information transfer, with potential implications in designing targeted fine-grained diagnostics on the information flow.

2606.00065 2026-06-02 cs.IR cond-mat.mtrl-sci cs.AI cs.CL 版本更新

Beyond Text and Tables: Vision-Language Model Integration in ComProScanner for Extracting Materials Data from Scientific Figures with High Accuracy

超越文本与表格:ComProScanner中视觉-语言模型集成实现从科学图表中高精度提取材料数据

Aritra Roy, Enrico Grisan, Chiara Gattinoni, John Buckeridge

发表机构 * Energy, Materials and Environment Research Centre, London South Bank University, London SE1 0AA, UK(能源、材料与环境研究中心,伦敦南银行大学) School of Engineering and Design, London South Bank University, London SE1 0AA, UK(工程与设计学院,伦敦南银行大学) Bioscience and Bioengineering Research Centre, London South Bank University, London SE1 0AA, UK(生物科学与生物工程研究中心,伦敦南银行大学) Department of Physics, Kings College London, London WC2R 2LS, UK(物理系,伦敦国王学院)

AI总结 本文通过集成视觉-语言模型扩展ComProScanner框架,实现了从科学图表中自动提取成分-性能数据,在压电陶瓷数据集上达到0.97的组成准确率和归一化F1分数,并引入基于范围的误差阈值评估方法。

Comments 18 pages, 3 figures

详情
AI中文摘要

基于大语言模型流水线的自动提取科学文献中材料成分-性能数据的方法已取得显著进展;然而,现有框架仍局限于文本和表格内容,忽视了仅在科学图表中报告的大量定量性能数据。本文扩展了ComProScanner——一个用于自动构建成分-性能数据库的完全端到端多智能体框架,为其增加了基于原生视觉-语言模型(VLM)的图表提取能力。该扩展引入了一个FigureExtractor工具,用于基于标题关键词对所有支持的出版商进行图表过滤,以及一个GraphExtractorTool智能体,它将提取的图表传递给可配置的VLM,以从科学图表和绘图中恢复成分-性能对。基于LMArena Diagram排行榜和每百万token输入成本低于1.50美元的标准,选择了四个VLM进行评估。在来自已建立的d33测试语料库的50篇压电陶瓷文章上的基准测试表明,Gemini-3-Flash-Preview实现了最高性能,组成准确率为0.97,归一化F1分数为0.97,同时仍然是四个评估模型中成本效益最高的。此外,我们在评估框架中引入了一个基于范围的值误差阈值参数,与精确值匹配相比,提供了对从图表中提取的数值性能数据更具物理意义的评估。这些贡献使集成VLM的ComProScanner成为第一个针对材料科学、完全自动化、多模态的文献挖掘平台,能够在单一统一流水线中从文本、表格和图表中提取结构化的成分-性能数据。

英文摘要

Automated extraction of materials composition-property data from scientific literature has advanced considerably with the development of large language model-based pipelines; however, existing frameworks remain limited to textual and tabular content, overlooking the substantial proportion of quantitative property data reported exclusively in scientific figures. Here, we extend ComProScanner, a fully end-to-end multi-agent framework for automated composition-property database construction, with a native vision-language model (VLM) based figure extraction capability. The extension introduces a FigureExtractor utility for caption-keyword-based figure filtering across all supported publishers, and a GraphExtractorTool agent that passes extracted figures to a configurable VLM to recover composition-property pairs from scientific charts and plots. Four VLMs are selected for evaluation on the basis of the LMArena Diagram leaderboard with an input cost criterion of less than \$1.50 per million tokens. Benchmarking on 50 piezoelectric ceramic articles from the established $d_{33}$ test corpus demonstrates that Gemini-3-Flash-Preview achieves the highest performance with a composition accuracy of 0.97 and a normalised F1 score of 0.97, whilst remaining the most cost-effective model among the four evaluated. We additionally introduce a range-based value error threshold parameter into the evaluation framework, providing a more physically meaningful assessment of numeric property values extracted from figures than exact value matching. These contributions establish VLM-integrated ComProScanner as the first materials-specific, fully automated, multimodal literature mining platform capable of extracting structured composition-property data from text, tables, and figures within a single unified pipeline.

2606.00056 2026-06-02 cs.CE cs.AI cs.LG physics.app-ph 版本更新

Physics-Informed Neural Networks for Radial Consolidation of Combined Electroosmotic, Vacuum and Surcharge Preloading Considering Smear Effects

考虑涂抹效应的电渗-真空-堆载联合预压径向固结的物理信息神经网络

Dong Li, Yapeng Cao, Shuai Huang, Yujun Cui, Haiping Fu, Lu Yang, He Wei

发表机构 * Department of Civil, Environmental, and Infrastructure Engineering, George Mason University(乔治·马歇尔大学土木、环境与基础设施工程系) State Key Laboratory of Cryospheric Science and Frozen Soil Engineering, Northwest Institute of Eco-Environment and Resources, Chinese Academy of Sciences(中国科学院寒区工程与冻土科学国家重点实验室,西北生态环境资源研究院) Laboratoire Navier/CERMES, École Nationale des Ponts et Chaussées, Institut Polytechnique de Paris(巴黎理工学院劳纳实验室/塞梅斯实验室,法国国家桥梁与道路学院) College of Water Conservancy and Hydropower Engineering, Hohai University(河海大学水利水电学院) School of Geosciences and Info-physics, Central South University(中南大学地球科学与信息物理学院)

AI总结 提出一种无量纲多域物理信息神经网络框架,通过改进的门控硬约束边界编码模型解决电渗径向固结问题,在时变荷载下实现高精度预测。

详情
AI中文摘要

本研究开发了一个无量纲多域物理信息神经网络(PINN)框架,用于考虑涂抹效应和真空-堆载联合预压的电渗径向固结。研究了三种基于PINN的模型:标准软约束PINN(Std-PINN)、改进的门控PINN(Mod-PINN)以及具有硬约束边界编码的改进门控PINN(Mod-HC-PINN)。这些模型在四种荷载工况下与有限元参考解进行了对比评估,包括恒定真空、指数真空、指数真空加斜坡堆载以及指数真空加循环半正弦堆载。结果表明,Mod-PINN中采用的门控架构提高了恒定真空荷载下阴极和涂抹区界面附近陡峭压力梯度的分辨率。在时变荷载下,软约束的Mod-PINN由于必须同时学习多个竞争目标而精度降低。Mod-HC-PINN通过将阴极边界和初始条件嵌入输出结构,减轻了这一问题,从而降低了优化负担并提高了物理一致性。Mod-HC-PINN在指数真空、斜坡堆载和循环堆载工况下的平均绝对误差(MAE)分别为0.43、0.41和0.27 kPa。敏感性分析进一步表明,所提出的框架在网络架构、配置点密度和渗透率对比的实际范围内保持稳健。

英文摘要

This study develops a dimensionless multi-domain physics-informed neural network (PINN) framework for electro-osmotic radial consolidation considering smear effects and combined vacuum and surcharge loading. Three PINN-based models are investigated: a standard soft-constrained PINN (Std-PINN), a modified gated PINN (Mod-PINN), and a modified gated PINN with hard-constraint boundary encoding (Mod-HC-PINN). The models are evaluated against FEM reference solutions under four loading cases, including constant vacuum, exponential vacuum, exponential vacuum with ramp surcharge, and exponential vacuum with cyclic haversine surcharge. The results indicate that the gated architecture applied in Mod-PINN improves the resolution of steep pressure gradients near the cathode and smear-zone interface under constant vacuum loading. Under time-dependent loading, the soft-constrained Mod-PINN shows reduced accuracy because it must learn multiple competing objectives simultaneously. The Mod-HC-PINN mitigates this issue by embedding the cathode boundary and initial conditions into the output structure, thereby reducing the optimization burden and improving physical consistency. The Mod-HC-PINN achieves MAE values of 0.43, 0.41, and 0.27 kPa for the exponential vacuum, ramp surcharge, and cyclic surcharge cases, respectively. Sensitivity analyses further demonstrate that the proposed framework remains robust across practical ranges of network architecture, collocation density, and permeability contrast.

2606.00054 2026-06-02 cs.RO cs.AI cs.CV 版本更新

From Human Videos to Robot Manipulation: A Survey on Scalable Vision-Language-Action Learning with Human-Centric Data

从人类视频到机器人操作:基于人类中心数据的可扩展视觉-语言-动作学习综述

Zhiyuan Feng, Qixiu Li, Huizhi Liang, Rushuai Yang, Yichao Shen, Zhiying Du, Zhaowei Zhang, Yu Deng, Li Zhao, Hao Zhao, Zongqing Lu, Oier Mees, Marc Pollefeys, Jiaolong Yang, Baining Guo

发表机构 * Tsinghua University(清华大学) HKUST(香港科技大学) Xi’an Jiaotong University(西安交通大学) Fudan University(复旦大学) Microsoft Research Asia(微软亚洲研究院) Peking University(北京大学) Microsoft Zurich Project(微软苏黎世实验室)

AI总结 本文综述了如何将丰富的人类视频转化为视觉-语言-动作(VLA)模型的有效知识,分类了四种方法(潜在动作表示、预测世界模型、显式2D监督、显式3D重建),并指出了结构化非结构化视频、跨具身和视角的动作映射、以及评估协议设计三大挑战。

Comments Accepted to IJCAI 2026 Survey Track. Project page: https://aaronfengzy.github.io/HumanCentricToVLA-Survey/

详情
AI中文摘要

近期在可泛化具身控制方面的进展由大规模预训练的视觉-语言-动作(VLA)模型驱动。然而,大多数现有方法依赖于大量机器人演示数据,这些数据获取成本高昂且与特定具身紧密耦合。相比之下,人类视频丰富且捕捉了丰富的交互,为真实世界操作提供了多样的语义和物理线索。然而,具身差异以及任务对齐标注的频繁缺失使得它们直接用于VLA模型具有挑战性。本综述提供了一个统一的视角,探讨如何将人类视频转化为VLA模型的有效知识。我们根据所提取的动作相关信息将现有方法分为四类:(i) 编码帧间变化的潜在动作表示;(ii) 预测未来帧的预测世界模型;(iii) 提取图像平面线索的显式2D监督;(iv) 恢复几何或运动的显式3D重建。除分类外,我们强调了该领域的三个关键开放挑战:将非结构化视频结构化为可训练的片段、在具身和视角异质性下将视频导出的监督接地到机器人可执行动作中,以及设计能更好预测真实世界部署性能和迁移效率的评估协议,从而为未来研究方向提供参考。论文和资源的精选列表见 https://github.com/AaronFengZY/HumanCentricToVLA-Survey。

英文摘要

Recent progress in generalizable embodied control has been driven by large-scale pretraining of Vision-Language-Action (VLA) models. However, most existing approaches rely on large collections of robot demonstrations, which are costly to obtain and tightly coupled to specific embodiments. Human videos, by contrast, are abundant and capture rich interactions, providing diverse semantic and physical cues for real-world manipulation. Yet, embodiment differences and the frequent absence of task-aligned annotations make their direct use in VLA models challenging. This survey provides a unified view of how human videos are transformed into effective knowledge for VLA models. We categorize existing approaches into four classes based on the action-related information they derive: (i) latent action representations that encode inter-frame changes; (ii) predictive world models that forecast future frames; (iii) explicit 2D supervision that extracts image-plane cues; and (iv) explicit 3D reconstruction that recovers geometry or motion. Beyond this taxonomy, we highlight three key open challenges in this area: structuring unstructured videos into training-ready episodes, grounding video-derived supervision into robot-executable actions under embodiment and viewpoint heterogeneity, and designing evaluation protocols that better predict real-world deployment performance and transfer efficiency, thereby informing future research directions. A curated list of papers and resources is available at https://github.com/AaronFengZY/HumanCentricToVLA-Survey.

2606.00052 2026-06-02 cs.AI cs.LG 版本更新

Product-Aware Deep Autoencoders for Robust Process Monitoring in Multi-Product Cyber-Physical Systems

产品感知深度自编码器用于多产品信息物理系统的鲁棒过程监控

MD Shafikul Islam, Jordan Carden

发表机构 * University of Cambridge(剑桥大学)

AI总结 针对多产品制造中全局模型因决策边界扩大而产生盲点的问题,提出产品感知自编码器,通过限制学习域到产品特定分布来提升异常检测鲁棒性,在扩展田纳西伊士曼过程基准上实现100%攻击检测。

详情
AI中文摘要

随着工业4.0加速信息物理系统在制造业中的集成,鲁棒异常检测对于确保过程安全与安保变得至关重要。当前的数据驱动方法通常采用“产品无关”或全局模型,这些模型在所有正常操作数据的聚合上训练。然而,现代工业设施经常在不同的产品等级下运行。虽然计算简单,但这些全局模型本质上会扩展其决策边界以适应多种模式的方差,从而产生一个“盲点”,其中微妙的异常或针对性的信息物理攻击可能被模型的宽接受区域所掩盖。在这项工作中,我们首先证明了上述漏洞存在于跨多个产品等级运行的全局无关模型中。然后,我们提出了一种产品感知自编码器作为原则性的缓解措施,将学习域限制在等级特定的分布上。虽然这种方法降低了已识别的盲点风险,但我们并不声称它是所有可能替代方案中的最优缓解措施。我们使用扩展的田纳西伊士曼过程基准对这种方法进行了严格的验证,并与全局无关基线进行了比较。我们的实证结果表明,产品感知框架在标准检测指标上与全局基线表现相当,同时提供了对产品等级特定操作模式的改进鲁棒性。最关键的是,模拟我们假设的攻击场景的压力测试显示,虽然全局模型在77.8%的场景中未能检测到操作偏差,但产品感知系统实现了100%的检测准确率。这些发现表明,在柔性制造环境中,广义异常检测器可能带来非平凡的安全风险,促使向模式感知诊断架构的转变。

英文摘要

As Industry 4.0 accelerates the integration of Cyber-Physical Systems (CPS) in manufacturing, robust anomaly detection has become critical for ensuring process safety and security. Current data-driven approaches typically employ "product-agnostic" or global models trained on the aggregate of all normal operating data. However, modern industrial facilities frequently operate under diverse product grades. While computationally simple, these global models inherently expand their decision boundaries to accommodate the variance of multiple modes, creating a "blind spot" where subtle anomalies or targeted cyber-physical attacks may be masked by the wide acceptance region of the model. In this work, we first demonstrate that the vulnerability described above is present in global-agnostic models operating across multiple product grades. We then present a Product-Aware Autoencoder as a principled mitigation that restricts the learning domain to grade-specific distributions. While this approach reduces the identified blind-spot risk, we do not claim it as the optimal mitigation among all possible alternatives. We rigorously validate this approach against a Global Agnostic baseline using the Extended Tennessee Eastman Process (TEP) benchmark. Our empirical results indicate that the Product-Aware framework performs comparably to the global baseline on standard detection metrics, while offering improved robustness to product-grade-specific operating modes. Most critically, stress tests simulating our hypothetical attack scenarios reveal that while the global model fails to detect operational deviations in 77.8% of the scenarios, the product-aware system achieves 100% detection accuracy. These findings suggest that, in flexible manufacturing environments, generalized anomaly detectors can pose non-trivial security risks, motivating a shift toward mode-aware diagnostic architectures.

2606.00051 2026-06-02 cs.CY cs.AI 版本更新

Business Utility of Large Language Models as Exploratory Data Analysis Agents

大型语言模型作为探索性数据分析代理的商业实用性

Rafał Łabędzki, Patryk Miziuła, Hubert Rutkowski, Szymon Betlewski, Cezary Depta, Szymon Janowski, Jarosław Kochanowicz, Jan Kanty Milczek

发表机构 * deepsense.ai SGH Warsaw School of Economics(SGH沃兹尼亚克经济学院) Bydgoszcz University of Science and Technology(比得戈茨茨理工大学) Google(谷歌)

AI总结 通过基于代理的供应链模拟基准,评估LLM作为EDA代理在商业环境中的平均性能与可重复性,提出风险调整指标Business utility,发现多数配置不可靠,GPT-5.4表现最佳。

详情
AI中文摘要

大型语言模型(LLM)越来越多地被用于分析工作流,但它们在商业环境中作为探索性数据分析(EDA)代理的适用性仍不确定。在实践中,一个可部署的EDA代理不仅必须提供有用的平均性能,还必须提供足够的可重复性以支持对其输出的信任。我们在一个受控的、与商业相关的基准上评估了这一要求,该基准基于基于代理的供应链模拟。任务是通过从间接操作痕迹而非显式标签进行推理,识别导致低质量和下游销售损失的供应商-产品组合。来自八个模型家族的十五种模型变体配置在四种实验条件下进行了评估,这些条件改变了数据表示、提示清晰度和信号强度,每种条件有五个轨迹。输出使用Jaccard指数与确定性真实值进行评分,并通过一个框架进行评估,该框架结合了平均得分(ms)、变异系数(CV)、探索性跨条件显著性检验以及商业实用性(Business utility),这是我们提出的一个风险调整指标,用于在单一操作度量中总结质量和可重复性。结果表明,大多数配置对于自主EDA使用来说不够可靠,即使它们的平均得分看起来可以接受。具有超高推理努力的GPT-5.4实现了最强的整体表现,实验平均ms为0.8748,实验平均商业实用性为0.6952,而次优配置在可变性折扣后损失了更多的实用性。我们的发现表明,对EDA代理的评估应将平均质量、可重复性和条件敏感性视为操作可信度的互补维度。

英文摘要

Large Language Models (LLMs) are increasingly used in analytical workflows, but their suitability as exploratory data analysis (EDA) agents in business settings remains uncertain. In practice, a deployable EDA agent must provide not only useful average performance but also sufficient repeatability to support trust in its outputs. We evaluate this requirement in a controlled, business-relevant benchmark built on an agent-based supply chain simulation. The task is to identify supplier-product combinations responsible for low quality and downstream sales loss by reasoning from indirect operational traces rather than from explicit labels. Fifteen model-variant configurations from eight model families were evaluated under four experimental conditions that varied data representation, prompt clarity, and signal strength, with five trajectories per condition. Outputs were scored against deterministic ground truth using the Jaccard index and assessed through a framework that combines mean score (ms), coefficient of variation (CV), exploratory cross-condition significance tests, and Business utility, a risk-adjusted metric that we propose to summarise quality and repeatability in a single operational measure. The results show that most configurations are not reliable enough for autonomous EDA use, even when their average scores appear acceptable. GPT-5.4 with extra-high reasoning effort achieved the strongest overall profile, with an experiment-averaged ms of 0.8748 and an experiment-averaged Business utility of 0.6952, while the next-best configurations lost substantially more utility after variability discounting. Our findings suggest that evaluation of EDA agents should treat average quality, repeatability, and condition sensitivity as complementary dimensions of operational trustworthiness.

2606.00050 2026-06-02 cs.AI cs.CL cs.DB cs.IR 版本更新

Grokers: Bottom-Up Inductive Comprehension and Write-Time Intelligence over Typed Knowledge Graphs

Grokers: 基于类型化知识图谱的自底向上归纳理解与写入时智能

Gregory Magarshak

发表机构 * Gregory Magarshak

AI总结 提出Grokers架构,通过自底向上的依赖子图归纳遍历构建持久结构化理解,将智能推至写入时,实现零额外LM成本的查询,并证明字节同一性、累积单调性和双遍历顺序三个形式性质。

Comments 6 pages; second in a series with the Magarshak Machine / SPACER paper and the Context paper

详情
AI中文摘要

我们提出Grokers,一种通过依赖子图的自底向上归纳遍历来构建类型化知识图谱的持久结构化理解的架构。与检索增强生成(RAG)不同,后者在每个查询时支付全部理解成本,Grokers将智能推至写入时:自主的Groker代理分析类型化流图中的节点,通过受控语言模型(LM)调用提取结构化属性,并通过依赖关系归纳组合这种理解,写入丰富的类型化属性,从而以零额外LM成本服务于所有未来查询。我们证明了三个形式性质:(1)字节同一性定理,确立了从事务性维护的反规范化索引组装出的上下文块在语义变化之间的LM轮次中字节相同,使得KV缓存命中率接近100%;(2)累积单调性定理,确立了在受控智慧库增长协议下,无需LM调用即可解决交互的比例随已完成交互数量非递减;(3)双遍历顺序定理,确立了自顶向下生成和自底向上理解分别是它们在依赖DAG上各自任务的唯一正确遍历顺序,且它们的组合闭合为一个完整的生成-理解循环。我们进一步提出了一种基于嵌入的语义搜索的确定性替代方案,采用同义词缓存协议,其LM回退率在有限词汇域中收敛至零。在开源Qbix/Safebox/Safebots栈中提供了参考实现。

英文摘要

We present Grokers, an architecture for building persistent, structured comprehension of typed knowledge graphs through bottom-up inductive traversal of dependency subgraphs. Unlike retrieval-augmented generation (RAG), which pays full comprehension cost at every query, Grokers pushes intelligence to write time: autonomous Groker agents analyze nodes in a typed stream graph, extract structured attributes via governed language model (LM) calls, and inductively compose that understanding upward through dependency relations, writing enriched typed attributes that serve all future queries at zero additional LM cost. We prove three formal properties: (1) the Byte-Identity Theorem, establishing that context blocks assembled from a transactionally-maintained denormalization index are byte-identical across LM turns between semantic changes, enabling KV-cache hit rates approaching 100%; (2) the Accumulation Monotonicity Theorem, establishing that the fraction of interactions resolved without LM calls is non-decreasing in the number of completed interactions under a governed wisdom library growth protocol; and (3) the Dual-Traversal Ordering Theorem, establishing that top-down generation and bottom-up comprehension are the unique correct traversal orderings for their respective tasks over a dependency DAG, and that their composition closes into a complete generation-comprehension cycle. We further present a deterministic alternative to embedding-based semantic search, with a synonym caching protocol whose LM fallback rate converges to zero for finite-vocabulary domains. A reference implementation is provided in the open-source Qbix / Safebox / Safebots stack.

2606.00049 2026-06-02 cs.CY cs.AI 版本更新

Measuring and Mitigating Bias in Code Generated by Large Language Models

测量和减轻大型语言模型生成代码中的偏见

Yuxi Chen, Yutian Tang, Timothy Storer

发表机构 * School of Computing Science, University of Glasgow(格拉斯哥大学计算机科学学院)

AI总结 本文针对GPT-4o和Gemini等主流代码生成工具,提出评估框架,使用代码偏见分数和属性变化比率量化偏见,并探索四种轻量级缓解策略。

详情
AI中文摘要

大型语言模型(LLMs)在自然语言生成中的应用广受认可,并越来越多地用于代码生成任务。然而,其生成输出中的偏见问题仍然显著。本文聚焦于GPT-4o和Gemini这两个主流的代码生成工具,提出了一个评估LLM生成代码中偏见的框架,特别考察了受保护属性、提示和网络搜索能力的影响。我们使用两个指标:代码偏见分数(CBS)和属性变化比率(ACR),分别量化偏见的普遍性和不同属性的影响程度。此外,我们研究了四种轻量级缓解策略:少样本、思维链、少样本思维链和多智能体,旨在减轻生成代码中的偏见。我们的研究结果表明,即使在应用缓解策略后,偏见在不同受保护属性和数据集中仍然普遍存在,这凸显了需要更有效的方法来减少AI驱动的代码生成系统中的偏见。

英文摘要

Large language models (LLMs) are widely recognised for their applications in natural language generation and are increasingly used for code generation tasks. However, concerns about bias in their generated outputs remain significant. This paper focuses on GPT-4o and Gemini, mainstream tools for code generation, and proposes a framework for evaluating bias in LLM-generated code, specifically examining the influence of protected attributes, prompts and web-search capability. We use two metrics: the code bias score (CBS) and the attribute change ratio (ACR), to quantify the prevalence of bias and the degree of influence of different attributes, respectively. In addition, we investigate four lightweight mitigation strategies: Few-Shot, Chain-of-Thought, Few-Shot Chain-of-Thought, and Multi-agent, aimed at mitigating bias in generated code. Our findings reveal that bias remains prevalent across different protected attributes and datasets even after applying mitigation strategies, highlighting the need for more effective approaches to reduce bias in AI-driven code generation systems.

2606.00047 2026-06-02 cs.CY cs.AI 版本更新

Comprehensive AI governance requires addressing non-model gains

全面的人工智能治理需要解决非模型增益问题

Arthur Goemans, Dan Altman, Noemi Dreksler, Jonas Freund, Milan Gandhi, Zhengdong Wang, Sarah Cogan, Sebastien Krier, Demetra Brady, Lewis Ho, Allan Dafoe

发表机构 * Stanford University(斯坦福大学) UC Berkeley(加州大学伯克利分校) Open Philanthropy(开放哲学基金会)

AI总结 本文提出非模型增益的概念,包括推理增益、系统增益和资产增益,并论证这些增益会削弱以模型为中心的治理有效性,进而提出超越模型层面的治理方法。

Comments This paper has been accepted to ICML 2026 (Position paper track): https://openreview.net/forum?id=V3O1sHpKxX

详情
AI中文摘要

前沿人工智能治理通常以模型级治理范式为中心,该范式假设模型的能力概况主要取决于训练期间使用的计算和数据。本文认为,当能力进步越来越多地由“非模型增益”——即与基础模型进步无关的改进——驱动时,模型级治理的有效性会降低。我们形式化了非模型增益的概念,并提供了三种不同能力增益向量的分类:推理增益(测试时扩展计算)、系统增益(训练后增强,如脚手架)和资产增益(用受限资产增强模型)。我们展示了这些向量——以及来自具身化、持续学习和人工智能扩散的潜在未来影响——可能会破坏主要依赖于部署前评估和缓解的风险管理策略。我们概述了超越模型层面的治理方法:系统、实体、代理和云治理。最后,我们强调社会韧性作为这些治理层补充的重要性。

英文摘要

Frontier AI governance often centres on the model-level governance paradigm, which assumes that a model's capability profile is primarily a function of the compute and data used during training. This position paper argues that model-level governance becomes less effective when capability progress is increasingly driven by "non-model gains"--improvements that are independent from advances in the base model. We formalise the concept of non-model gains and provide a taxonomy of three distinct vectors of capability gain: inference gain (scaling compute at test-time), systems gain (post-training enhancements such as scaffolds), and asset gain (enhancing a model with restricted assets). We demonstrate how these vectors--alongside potential future impacts from embodiment, continual learning, and AI diffusion--may undermine risk management strategies that hinge mostly on pre-deployment evaluation and mitigation. We provide an overview of governance approaches that go beyond the model level: system, entity, agent, and cloud governance. Finally, we emphasise the importance of societal resilience as a complement to these governance layers.

2606.00046 2026-06-02 cs.MM cs.AI cs.CV cs.CY 版本更新

When Jokes Cross the Line: Analyzing Regular Humor and Dark Humor in YouTube Shorts

当玩笑越界:分析YouTube Shorts中的常规幽默与黑色幽默

Sydney Johns, Sanjeev Parthasarathy, Shantnu Bhalla, Vaibhav Garg

发表机构 * Virginia Polytechnic Institute and State University(弗吉尼亚理工大学)

AI总结 通过构建TwistedHumor数据集(1211个YouTube Shorts及33041条评论的手工标注),结合多视角分析(LLooM概念归纳、评论情感分析、大模型评估),揭示了短格式视频中常规幽默与黑色幽默在主题、观众反应和模型检测上的差异,强调了上下文感知审核的必要性。

详情
AI中文摘要

YouTube等视频平台重塑了用户参与娱乐和信息的方式,强调简短、高参与度的内容,如Shorts。在这个生态系统中,某些内容处于灰色地带:虽然允许存在,但仍可能对部分观众产生意想不到的负面影响。为了研究这一问题,我们引入了TwistedHumor数据集,包含1,211个YouTube Shorts及其配对的33,041条相关评论,并手工标注了幽默存在性、幽默类型、伤害性、主题、修辞手法和单口喜剧背景。除了数据集构建,我们还提出了对短格式社交媒体中幽默与伤害表现的多视角分析。通过使用基于LLooM的概念归纳对视频描述进行分析,我们发现黑色幽默经常围绕批评、应对、尴尬和身份表达等主题聚集,而不是作为一个单一的类别出现。我们进一步通过关联评论分析观众反应,表明常规幽默与更积极的情感相关,而黑色幽默则收到更多混合、中性甚至有时更有毒的反馈。最后,我们评估了大语言模型与人类标注的一致性,发现它们在单口喜剧上的表现优于短笑话。综合来看,这些结果将TwistedHumor不仅定位为一个新的基准,而且是对短格式视频中幽默与伤害灰色地带的实证研究,强调了需要上下文感知的审核和更稳健的多模态评估。

英文摘要

Video platforms such as YouTube have reshaped how users engage with entertainment and information, emphasizing brief, highly engaging content such as Shorts. Within this ecosystem, certain content occupies a gray area where it remains allowed but may still have unintended negative effects on some audiences. To study this problem, we introduce TwistedHumor, a dataset of 1,211 YouTube Shorts paired with 33,041 related comments, with hand annotations for humor presence, humor type, harm, topic, rhetorical devices, and stand up context. Beyond dataset creation, we present a multi view analysis of how humor and harm appear in short form social media. Using LLooM based concept induction over video descriptions, we find that dark humor frequently clusters around themes of critique, coping, awkwardness, and identity expression rather than appearing as a single uniform category. We further analyze audience response through linked comments and show that regular humor is associated with more positive sentiment, while dark humor receives more mixed, neutral, and sometimes more toxic reactions. Finally, we evaluate large language models against human annotations and find that they perform better on stand up comedy compared to shorter jokes. Together, these results position TwistedHumor not only as a new benchmark, but as an empirical study of the gray area between humor and harm in short form video, highlighting the need for context aware moderation and more robust multimodal evaluation.

2606.00045 2026-06-02 cs.AI cs.ET quant-ph 版本更新

Universal Quantum Transformer

通用量子Transformer

Sungyong Chung, Alireza Talebpour

发表机构 * Grainger College of Engineering, Department of Civil and Environmental Engineering, University of Illinois Urbana-Champaign(伊利诺伊大学厄巴纳-香槟分校格拉inger工程学院土木与环境工程系)

AI总结 提出通用量子Transformer(UQT),利用多量子比特系统的物理特性作为归纳偏置,通过参数化几何相位嵌入和SU(2)波干涉实现精确的数学和代数推理,在紧凑的5量子比特基板上完美学习循环模算术和非阿贝尔代数,并实现确定性泛化。

详情
AI中文摘要

经典连续空间神经网络从根本上难以锁定精确的数学对称性,如模算术和非交换代数。为了近似这些离散逻辑规则,它们通常依赖大规模参数缩放,导致即使在称为grokking的延迟泛化现象之后仍出现随机不稳定性。在这里,我们引入了通用量子Transformer(UQT),这是一种根本性的新型量子原生计算架构,利用多量子比特系统的物理性质作为精确数学和代数推理的通用归纳偏置。我们的框架并非翻译经典神经机制,而是完全依赖于参数化几何相位嵌入和$SU(2)$波干涉。我们证明了在高度紧凑的5量子比特基板上运行的量子注意力电路能够完美学习两个截然不同的形式类:循环模算术($\mathbb{Z}_{11}$)和非阿贝尔代数($S_4$置换群)。虽然经典注意力网络在收敛时表现出随机不稳定性,但UQT实现了数学上精确的确定性泛化。我们将这种现象称为结晶:超越了众所周知的grokking现象。关键的是,该框架通过理论上绕过经典自注意力的二次瓶颈,并通过对数压缩所需表示维度以消除经典网络固有的过度参数化,从而带来了巨大的计算和内存优势。最后,我们在嘈杂中等规模量子(NISQ)硬件上部署了该架构,证明了其在当前IBM量子计算机上的可行性。这些结果确立了参数化量子拓扑作为精确人工智能的普遍优越物理基底。

英文摘要

Classical continuous-space neural networks fundamentally struggle to lock into exact mathematical symmetries, such as modular arithmetic and non-commutative algebra. To approximate these discrete logical rules, they often rely on massive parameter scaling, resulting in stochastic instability even after delayed generalization phenomena known as grokking. Here, we introduce the Universal Quantum Transformer (UQT), a fundamentally novel, quantum-native computing architecture that uses the physical properties of multi-qubit systems as a universal inductive bias for exact mathematical and algebraic reasoning. Rather than translating classical neural mechanisms, our framework relies entirely on parameterized geometric phase embedding and $SU(2)$ wave-interference. We demonstrate that the quantum attention circuit, operating on a highly compact 5-qubit substrate, perfectly learns two highly distinct formal classes: cyclic modular arithmetic ($\mathbb{Z}_{11}$) and non-Abelian algebra (the $S_4$ permutation group). While classical attention-based networks exhibit stochastic instability at convergence, the UQT achieves mathematically exact, deterministic generalization. We refer to this phenomenon as crystallization: a step beyond the well-known phenomenon of grokking. Crucially, this framework yields massive computational and memory advantages by theoretically bypassing the quadratic bottleneck of classical self-attention, and by logarithmically compressing the required representation dimension to eliminate the massive over-parameterization inherent to classical networks. Finally, we deploy this architecture on noisy intermediate-scale quantum (NISQ) hardware, proving its viability on current IBM Quantum computers. These results establish parameterized quantum topology as a universally superior physical substrate for exact artificial intelligence.

2606.00044 2026-06-02 cs.CY cs.AI 版本更新

Algorithmic Authority and the Clinical Standard of Care

算法权威与临床护理标准

Aizierjiang Aiersilan

发表机构 * The George Washington University(乔治·华盛顿大学)

AI总结 本文探讨人工智能在临床医学中引发的算法概率推理与医生经验直觉之间的张力,提出将AI系统视为事实上的医疗监管,并主张通过辩证的护理标准将AI-医生联合体作为单一诊断责任实体。

详情
AI中文摘要

人工智能融入临床医学在算法概率推理与专家医生的经验直觉之间产生了根本张力;运用Lawrence Lessig的“代码即法律”框架,我认为临床AI系统的架构已经起到事实上的医疗监管作用,重塑了责任和护理标准。将AI“幻觉”重新定义为结构上类似于记录充分的人类认知失败(如确认偏误和过早诊断闭合),我表明这两种失败模式需要统一的治理响应。因此,我提出一种辩证的护理标准,将集成的AI-医生联合体视为单一负责任的诊断实体,要求在强大的数据治理和患者隐私框架内综合算法精确性与人类解释权威。

英文摘要

The integration of artificial intelligence into clinical medicine creates a fundamental tension between algorithmic probabilistic reasoning and the experiential intuition of expert physicians; applying Lawrence Lessig's \enquote{Code is Law} framework, I argue that the architecture of clinical AI systems already functions as de facto medical regulation, reshaping liability and the standard of care. Reframing AI \enquote{hallucination} as structurally analogous to well-documented human cognitive failures such as confirmation bias and premature diagnostic closure, I show that both failure modes demand a unified governance response. I therefore propose a dialectical standard of care that treats the integrated AI-physician dyad as the singular responsible diagnostic entity, mandating the synthesis of algorithmic precision with human interpretive authority within robust data governance and patient privacy frameworks.

2606.00041 2026-06-02 cs.CY cs.AI 版本更新

Improving Hospital Process Management through Process Mining: A Case Study on COVID-19 Clinical Pathways

通过过程挖掘改进医院流程管理:COVID-19临床路径案例研究

Pasquale Ardimento, Mario Luca Bernardi, Marta Cimitile, Samuele Latorre

发表机构 * University of Bari Aldo Moro(巴里大学Aldo Moro分校) Unisannio University of Benevento(贝内文托大学Unisannio分校) UnitelmaSapienza Rome(罗马Sapienza大学Unitelma分校)

AI总结 本研究利用COVID数据共享学习数据集,构建透明可复现的管道将临床数据转化为事件日志,通过过程发现、声明性合规检查和结果分析,揭示COVID-19护理路径中的监测主干、急诊与入院接口的变异性以及年龄和重症监护暴露导致的结果差异,支持分诊标准化、容量规划和降级协调。

详情
AI中文摘要

本研究使用COVID数据共享学习数据集分析COVID-19护理路径。我们构建了一个透明、可复现的管道,将异质性临床表格转化为适合过程挖掘的事件日志,并应用过程发现、声明性合规检查和结果分析。重建的路径突出了住院护理的监测主干、急诊-入院接口的变异性以及由年龄和重症监护暴露驱动的结果差异。这些见解支持分诊标准化、容量规划和从重症监护病房到低 acuity 病房的降级协调,展示了过程挖掘如何为循证医院治理提供信息。

英文摘要

This study analyzes COVID-19 care pathways using the COVID Data for Shared Learning dataset. We build a transparent, reproducible pipeline that transforms heterogeneous clinical tables into a process-mining-ready event log and applies discovery, declarative conformance checking, and outcome analysis. The reconstructed pathways highlight the monitoring backbone of inpatient care, variability at the Emergency department-admission interface, and outcome differences driven by age and exposure to intensive care units. These insights support triage standardization, capacity planning, and step-down coordination from intensive care units to lower-acuity wards, showing how process mining can inform evidence-based hospital governance.

2606.00040 2026-06-02 cs.CY cs.AI 版本更新

Tracing GenAI Literacy: Uncovering Student-AI Interaction Patterns in Academic Writing through Epistemic Network Analysis

追踪GenAI素养:通过认知网络分析揭示学术写作中的学生-AI交互模式

Angxuan Chen, Jiyou Jia

发表机构 * Department of Educational Technology, Graduate School of Education, Peking University(教育技术系,教育研究生院,北京大学)

AI总结 本研究利用学习分析和认知网络分析,通过分析162名学生在GenAI辅助摘要写作任务中的交互日志,揭示了高素养学生采用迭代优化和策略性提问,而低素养学生依赖直接生成命令的不同交互模式。

详情
AI中文摘要

随着生成式人工智能(GenAI)成为教育不可或缺的一部分,培养GenAI素养至关重要。然而,当前的评估主要依赖于自我报告量表,缺乏对素养在实际学习过程中如何体现的洞察。本研究利用学习分析(LA)来弥合这一差距。我们收集了162名大学生在GenAI辅助摘要写作任务中的交互日志。使用认知网络分析(ENA),我们建模并比较了不同GenAI素养水平学生的提问策略。初步结果揭示了不同的交互特征:高素养学生进行迭代优化和策略性提问,而低素养学生依赖直接生成命令。本研究通过展示过程数据如何表征GenAI素养,为数据驱动的素养评估和实时干预铺平道路,从而为研讨会做出贡献。

英文摘要

As Generative AI (GenAI) becomes integral to education, fostering GenAI literacy is critical. However, current assessments largely rely on self-reported scales, lacking insights into how literacy manifests in actual learning processes. This study leverages Learning Analytics (LA) to bridge this gap. We collected interaction logs from 162 university students engaged in a GenAI-assisted abstract writing task. Using Epistemic Network Analysis (ENA), we modeled and compared the questioning strategies of students with varying GenAI literacy levels. Preliminary results reveal distinct interaction signatures: high-literacy students engage in iterative refinement and strategic questioning, while low-literacy students rely on direct generation commands. This work contributes to the workshop by demonstrating how process data can characterize GenAI literacy, paving the way for data-driven literacy assessment and real-time interventions.

2606.00039 2026-06-02 cs.CY cs.AI cs.HC 版本更新

Beyond Categories of Caste: Examining Caste Bias and Morality in Text-to-Image AI Models

超越种姓类别:审视文本到图像AI模型中的种姓偏见与道德

Divyanshu Kumar Singh, Dipto Das, Deepika Rama Subramanian, Koustuv Saha, Stephen Voida, Bryan Semaan

发表机构 * University of Colorado Boulder(科罗拉多大学波得尔分校) University of Toronto(多伦多大学) University of Illinois Urbana-Champaign(伊利诺伊大学厄巴纳-香槟分校)

AI总结 通过算法审计与批判性话语分析,揭示文本到图像模型如何超越上下种姓二元对立而延续种姓偏见,并提出反种姓方法应对AI系统中的公平问题。

详情
AI中文摘要

文本到图像(T2I)模型在各个领域展现出有前景的实用性。然而,这类模型也在其输出中放大了有害的社会偏见。在南亚背景下,近期研究表明种姓偏见和刻板印象正通过生成式AI(GenAI)系统得以延续。尽管这些研究提供了关于GenAI系统如何使种姓歧视的隐形叙事显性化的极其相关的见解,但它们往往将种姓视为一个身份类别。因此,在本工作中,我们转变本体论,聚焦于种姓的关系性方面。这使我们能够更细致地理解T2I模型产生和延续种姓歧视的机制。通过将算法审计与批判性话语分析相结合,我们借鉴挑战婆罗门规范性的概念框架,展示种姓偏见如何超越上下种姓类别的简单二元对立而得以延续。我们的贡献有两方面。除了挑战将种姓视为类别的范畴化理解,我们还提出了一种反种姓方法,以应对AI系统中种姓偏见和公平性的问题。

英文摘要

Text-to-Image (T2I) models have shown promising utility across various domains. However, such models are also amplifying harmful societal biases in their outputs. In the context of South Asia, recent work has shown caste biases and stereotypes are being perpetuated through Generative AI (GenAI) systems. While this research offers extremely relevant insight into invisibilized narratives of caste discrimination through the GenAI system, they often treat caste as an identity category. Therefore, in this work we shift our ontology to focus on the relational aspect of caste. This enables us to develop a more nuanced understanding of the mechanics of caste discrimination by and through T2I models. Combining an algorithmic audit with critical discourse analysis, we draw on a conceptual frame challenging Brahminical Normativity to show how caste biases are perpetuated beyond the simple binaries of upper vs lower-caste categories. Our contributions are two-fold. Beyond challenging the categorical understanding of caste as a category, we propose an anti-caste approach to tackle the issue of caste bias and fairness in AI systems.

2606.00037 2026-06-02 cs.CY cs.AI cs.HC 版本更新

Update Opacity: Epistemic Accessibility and Governance Under AI System Change

更新不透明性:AI系统变更下的认知可及性与治理

Andrea Ferrario, Joshua Hatherley

发表机构 * Institute of Biomedical Ethics and History of Medicine, University of Zürich(伦理与医学史研究所,苏黎世大学) SUPSI, Dalle Molle Institute for Artificial Intelligence (IDSIA)(SUPSI,达勒莫利人工智能研究所) ETH Zürich(苏黎世联邦理工学院) Center for Philosophy of AI, University of Copenhagen(人工智能哲学中心,哥本哈根大学)

AI总结 针对AI系统更新导致用户难以理解输出变化的问题,提出结合欧盟AI法案和机器学习运营的治理框架,通过可信度画像和阈值披露实现更新透明化。

详情
AI中文摘要

嵌入部署AI系统中的机器学习模型会定期更新以维持正常功能。然而,此类更新可能产生更新不透明性:用户可能无法理解为何相同输入现在产生不同输出。我们认为,更新不透明性最好被理解为认知可及性的历时性失败:问题在于,在真实角色和时间特定约束下,物质上相关的变更可能无法以支持理解、校准依赖和适当行动的形式保持对用户可及。这使得更新不透明性成为一个治理问题。并非所有变更都同等相关,披露每一次更新本身会因信息过载而损害使用。为解决此问题,我们结合两种互补的治理方法:欧盟AI法案(有助于规范系统层面规范性相关变更的边界)和机器学习运营(提供跟踪和比较随时间变化的操作工具)。在此基础上,我们提出一个框架,通过可信度画像和可信度级别对系统变更建模,并使用基于阈值的披露,随时间向不同利益相关者揭示包络内物质相关变更。我们通过一个医疗AI示例说明该方法,并得出对生命周期文档、上市后监测和更新披露的实际意义。

英文摘要

Machine learning models embedded in deployed AI systems are routinely updated to maintain correct functioning over time. Yet such updates can generate update opacity: users may not be able to understand why the same input now yields a different output. We argue that update opacity is best understood as a diachronic failure of epistemic accessibility: the problem is that materially relevant changes may fail to remain accessible to human users in forms that support understanding, calibrated reliance, and appropriate action under real role- and time-specific constraints. This makes update opacity a governance problem. Not all change is equally relevant, and disclosing every update would itself undermine use through overload. To address this problem, we combine two complementary governance approaches: the EU AI Act, which helps specify the system-level perimeter of normatively relevant change, and Machine Learning Operations, which provides operational tools for tracking and comparing change over time. On this basis, we propose a framework that models system change through trustworthiness profiles and trustworthiness levels, and uses threshold-based disclosure to surface materially relevant within-envelope change to different stakeholders over time. We illustrate the approach with a medical AI example and derive practical implications for lifecycle documentation, post-market monitoring, and update disclosure.

2606.00033 2026-06-02 cs.CY cs.AI 版本更新

Make Mechanistic Interpretability Auditable: A Call to Develop Guidelines via Continuous Collaborative Reviewing

使机制可解释性可审计:呼吁通过持续协作评审制定指南

Michael Lan, Narmeen Fatimah Oozeer, Chaithanya Bandi, Philip Quirke, Austin Meek, Fazl Barez, Amirali Abdullah

发表机构 * University of Delaware(德克萨斯大学) University of Oxford(牛津大学) ThoughtWorks

AI总结 针对机制可解释性(MI)实验缺乏标准化审计系统的问题,提出通过持续协作评审平台、专家验证指南和基于来源的审计系统来建立可审计框架,以提升其在AI安全等高风险领域的可信度。

Comments Accepted at ACL 2026 main conference

详情
AI中文摘要

尽管机制可解释性(MI)对神经网络内部机制产生了重要见解,但该领域尚未建立标准化的实验审计系统。因此,其许多发现在医疗AI和自主系统等安全关键应用中仍未得到充分利用,因为利益相关者无法验证其有效性。近期工作具体证明了这一点:两篇论文对同一行为得出了矛盾的结论,第三项研究揭示两者部分正确但因方法不一致而无法比较。缺乏标准化审计时,这种模糊性阻碍了需要强正确性保证的高风险场景中的采用。我们呼吁MI社区致力于开发一种新颖的评审系统,通过以下方式补充同行评审:(1)由协作评审平台支持的持续评审,在该平台上组织和讨论论文之外适合的元科学结果和讨论(如批评、负面结果、事后扩展、复现、复制和部分结果),允许随时进行评论和修订;(2)将该平台上发现的良好实践推广为专家验证的指南和协议,以提高审计效率;(3)基于来源的审计系统,追踪声明所依赖的论点。这篇立场论文鼓励对这样一个框架的必要性、设计和实施进行建设性辩论,并提供早期具体示例以帮助催化这些对话。总体而言,我们提出审计MI本身对于其在AI安全、行业和治理中的应用至关重要。

英文摘要

While mechanistic interpretability (MI) has produced important insights into neural network internals, the field has yet to establish a standardized system to audit experiments. As such, many of its findings remain underutilized in safety-critical applications such as medical AI and autonomous systems, as stakeholders cannot certify their validity. Recent work demonstrates this concretely: two papers found conflicting conclusions for the same behavior, and a third study revealed that both were partially correct but incomparable due to methodological inconsistencies. Without standardized auditing, such ambiguities hinder adoption in high-stakes contexts requiring strong correctness guarantees. We call for the MI community to work towards developing a novel reviewing system that complements peer review via: (1) Continuous reviewing supported by a \emph{Collaborative Reviewing Platform} where meta-science results and discussions (such as critiques, negative results, post-hoc extensions, reproductions, replications, and partial results) that fit outside of papers are organized and discussed, allowing for comments and revisions to be made at any time (2) Generalizing good practices found on this platform into expert-verified guidelines and protocols to improve auditing efficiency, and (3) Source-based auditing systems that track arguments which claims depend on. This position paper encourages constructive debate over the necessity, design and implementation of such a framework, providing early concrete examples to help catalyze these dialogues. Overall, we propose that auditing MI itself is essential for its application in AI safety, industry, and governance.

2606.00031 2026-06-02 cs.CL cs.AI 版本更新

LLMs for Cardiovascular Risk Prediction from Structured Clinical Data

基于结构化临床数据的LLMs心血管风险预测

Jeba Maliha, Md Rafiul Kabir

发表机构 * Central Michigan University(中央密歇根大学)

AI总结 提出混合框架将结构化临床数据转换为自然语言表示,利用LLMs进行冠心病预测,并验证了高保真度与隐私保护优势。

Comments International Conference on Intelligent Systems, Blockchain, and Communication Technologies

详情
AI中文摘要

冠状动脉疾病(CAD)仍然是全球主要死因之一,凸显了对可靠预测系统的需求以支持早期诊断和风险评估。虽然传统机器学习模型在结构化临床数据上表现良好,但大型语言模型(LLMs)为解释自然语言表达的医疗信息提供了新的可能性。在这项工作中,我们开发了一个混合框架,桥接了结构化临床数据和自然语言表示用于CAD预测。使用包含1190名患者记录和11个临床属性的公开数据集,结构化变量被转换为可解释的特征表示,并通过LLMs生成合成临床叙述。验证流程进行临床变量的反向提取,并计算与原始记录的一致性分数,平均保真度达到94.61%。然后,我们评估了四种传统机器学习模型,并在零样本和少样本提示设置下与基于LLM的分类进行比较。我们使用了两个LLM:GPT和Gemini。实验结果表明,随机森林达到了最高准确率。尽管有这一优势,基于LLM的分类在真实临床环境中仍然有益。这是因为LLMs直接操作于自然语言的患者描述,意味着敏感的数值患者数据(如精确的实验室值、血压读数和诊断代码)得以保密。研究结果表明,将结构化临床数据与LLM生成的叙述相结合,可以为混合临床预测系统开辟新方向。

英文摘要

Coronary artery disease (CAD) remains one of the leading causes of death globally, highlighting the need for reliable predictive systems to support early diagnosis and risk assessment. While traditional machine learning models perform well on structured clinical data, large language models (LLMs) present new possibilities to interpret medical information expressed in natural language. In this work, we develop a hybrid framework that bridges structured clinical data and natural-language representations for CAD prediction. Using a publicly available dataset of 1,190 patient records with 11 clinical attributes, structured variables are converted into interpretable feature representations and synthetic clinical narratives using LLMs. A validation pipeline performs reverse extraction of clinical variables and computes a consistency score with the original records, achieving an average fidelity of 94.61%. We then evaluate four conventional machine learning models and compare their performance with LLM-based classification under zero-shot and few-shot prompting settings. We use two LLMs here, GPT and Gemini. Experimental results show that Random Forest achieves the highest accuracy. Despite this advantage, LLM-based classification remains beneficial in real-world clinical settings. This is because LLMs operate directly on natural language patient descriptions, meaning that sensitive numerical patient data such as exact lab values, blood pressure readings, and diagnostic codes are kept private. Findings suggest that combining structured clinical data with LLM-generated narratives can enable new directions for hybrid clinical prediction systems.

2606.00029 2026-06-02 cs.CL cs.AI 版本更新

TCAR-Gen: Temporal Graph Retrieval with Evidence Fusion for Knowledge-Grounded Generation

TCAR-Gen: 基于证据融合的时间图检索用于知识增强生成

Sidra Nasir, Muhammad Noman Zahid, Rizwan Ahmed Khan

发表机构 * Dipartimento di Informatica, Università di Verona(威尼斯大学计算机科学系) School of Advanced Studies, University of Camerino(坎皮诺大学高级研究学院) Department of Computer Science, School of Mathematics and Computer Science, Institute of Business Administration (IBA), Karachi, Pakistan(卡拉奇工商管理学院(IBA)数学与计算机科学学院计算机科学系)

AI总结 提出TCAR-Gen框架,结合查询条件图神经网络、时间证据融合和树链推理,在历史犯罪叙事问答中实现时间推理和多源证据融合,优于现有RAG方法。

详情
AI中文摘要

检索增强生成系统在回答历史犯罪案件叙述的复杂问题时,在时间推理和证据融合方面存在困难。现有方法要么独立于查询语义进行检索,要么无法连贯地整合多个证据来源。我们提出时间上下文增强检索生成(TCAR-Gen),一个结合查询条件图神经网络、时间证据融合和树链推理的框架,以将答案生成基于检索到的证据。在维多利亚犯罪日记基准上,TCAR-Gen在Recall@5上达到0.3738,在包括多跳推理和反事实问题在内的七种查询类型上优于Vanilla RAG、Temporal RAG、GraphRAG-C和GraphRAG-T。消融研究表明,上下文图、时间惩罚机制和查询条件是关键组件。跨五个语言模型(GPT-OSS 20B到TinyLlama 1.1B)的评估表明,TCAR-Gen在较小模型规模下保持稳健的检索覆盖,但生成质量随模型容量减少而显著下降。我们的工作表明,显式时间建模和多分支证据融合对于基于知识语料库的忠实、推理密集型问答至关重要。

英文摘要

Retrieval-augmented generation systems struggle with temporal reasoning and evidence fusion when answering complex questions over historical criminal case narratives. Existing approaches either retrieve independently of query semantics or fail to integrate multiple evidence sources coherently. We propose Temporal Context Augmented Retrieval Generation (TCAR-Gen), a framework that combines query-conditioned graph neural networks, temporal evidence fusion, and chain-of-trees reasoning to ground answer generation in retrieved evidence. On the Victorian Crime Diaries benchmark, TCAR-Gen achieves 0.3738 Recall@5, outperforming Vanilla RAG, Temporal RAG, GraphRAG-C, and GraphRAG-T across seven query types including multi-hop reasoning and counterfactual questions. Ablation studies reveal that the context graph, temporal penalty mechanism, and query conditioning are critical components. Cross-model evaluation across five language model (GPT-OSS 20B to TinyLlama 1.1B) demonstrates that TCAR-Gen maintains robust retrieval coverage at smaller model scales, though generation quality degrades substantially with reduced model capacity. Our work shows that explicit temporal modelling and multi-branch evidence fusion are essential for faithful, reasoning-intensive question answering over knowledge-grounded corpora.

2606.00027 2026-06-02 cs.CL cs.AI 版本更新

A Multi-Domain Red Teaming Framework for Safety, Robustness, and Fairness Evaluation of Medical Large Language Models

医疗大语言模型安全性、鲁棒性和公平性评估的多领域红队框架

Andrei Marian Feier, Veysel Kocaman, Yigit Gul, Ahmet Korkmaz, Alexander Thomas, Aleksei Zakharov, Jay Gil, Mehmet Butgul, David Talby

发表机构 * John Snow Labs Inc.(约翰·索克斯实验室公司)

AI总结 提出一个多领域红队框架,通过690个临床场景评估11个当代大语言模型,发现平均准确率掩盖了临床意义上的风险,性能方差和最坏情况失败比平均准确率更能反映可靠性,混合评估方法对可信安全评估至关重要。

Comments 10 pages, 4 figures. To be presented at the Text2Story 2026 Workshop (Delft, The Netherlands, 29 March 2026); CEUR Workshop Proceedings (forthcoming). Affiliation: John Snow Labs Inc

详情
AI中文摘要

大语言模型(LLM)在医疗领域的部署日益增多,但现有基准未能捕捉临床实践中常见的对抗性或伦理复杂条件下的模型行为。我们开发了一个多领域红队框架,评估了11个当代LLM在690个临床场景中的表现,这些场景涵盖9个领域和150多个子类别。场景包含对抗性变换,响应使用七维度评分标准进行评估,包括LLM辅助评分和人在环验证。结果揭示了显著的性能差异,平均得分范围从0.791到0.984。关键的是,几个高性能系统在个别安全关键场景中完全失败,表明平均准确率掩盖了临床意义上的风险。最高性能系统(X-BAI、GPT-5、Claude Opus 4.1)得分超过0.97且方差较低,而不同领域间性能差异显著。公平性相关任务在人口统计修改后错误率放大10-20%,人工评审员识别出自动评估遗漏的临床相关失败。我们的发现表明,性能方差和最坏情况失败比平均准确率更能提供临床意义的可靠性指标,而结合自动化与临床监督的混合评估方法对于可信安全评估至关重要。

英文摘要

Large language models (LLMs) are increasingly deployed across healthcare, yet existing benchmarks fail to capture model behavior under adversarial or ethically complex conditions common in clinical practice. We developed a multi-domain red teaming framework evaluating eleven contemporary LLMs across 690 clinically grounded scenarios spanning nine domains and over 150 subcategories. Scenarios incorporated adversarial transformations, and responses were assessed using a seven-dimension rubric with LLM-assisted scoring and human-in-the-loop validation. Results revealed substantial performance variance, with mean scores ranging from 0.791 to 0.984. Critically, several high-performing systems produced complete failures in individual safety-critical scenarios, demonstrating that aggregate accuracy masks clinically meaningful risk. The highest-performing systems (X-BAI, GPT-5, Claude Opus 4.1) achieved scores above 0.97 with low variance, while performance varied significantly across domains. Equity-related tasks showed 10-20% error amplification with demographic modifications, and human reviewers identified clinically relevant failures missed by automated evaluation. Our findings demonstrate that performance variance and worst-case failures provide more clinically meaningful reliability indicators than mean accuracy alone, and that hybrid evaluation approaches combining automation with clinician oversight are essential for credible safety assessment.

2606.00023 2026-06-02 cs.CL cs.AI cs.LG 版本更新

TrustLDM: Benchmarking Trustworthiness in Language Diffusion Models

TrustLDM:语言扩散模型的可信度基准测试

Yichuan Mo, Yukun Jiang, Yanbo Shi, Mingjie Li, Michael Backes, Yang Zhang, Yisen Wang

发表机构 * State Key Lab of General Artificial Intelligence, School of Intelligence Science and Technology, Peking University(中国科学院自动化研究所,智能科学与技术学院,北京大学) CISPA Helmholtz Center for Information Security(信息安全研究所) School of EECS, Peking University(电子工程学院,北京大学) Institute for Artificial Intelligence, Peking University(人工智能研究所,北京大学)

AI总结 针对语言扩散模型(LDM)的可信度问题,提出TrustLDM基准,评估其在不同架构和恶意上下文下的安全性、隐私性和公平性,并开发自动评估框架TrustLDM-Auto以识别脆弱配置。

详情
AI中文摘要

语言扩散模型(LDM)的快速发展挑战了自回归模型在语言处理中的主导地位。然而,其灵活、任意顺序的解码策略不仅实现了快速解码速度,还可能带来新的可信度挑战。为了更好地理解其流程背后的风险,我们引入了一个针对LDM的全面可信度基准(TrustLDM),评估不同LDM架构在多种静态后上下文类别下的安全性、隐私性和公平性。我们的实证结果表明,尽管LDM在仅使用用户提示时通常表现出较强的可信度,但当恶意后上下文附加到掩码响应时,其对齐行为明显下降。我们进一步观察到,较长的上下文不一定产生更强的影响,解码顺序和生成长度都会影响评估结果。最后,我们提出了TrustLDM-Auto,一个利用LDM解码灵活性自动识别脆弱配置的评估框架,揭示了所有评估模型和维度上的显著可信度弱点。我们的工作可能有助于社区构建更可信的LDM。我们的代码可在https://github.com/PKU-ML/TrustLDM获取。

英文摘要

The rapid development of Language Diffusion Models (LDMs) challenges the dominant position of auto-regressive competitors in language processing. However, their flexible, any-order decoding strategies not only enable fast decoding speed but also potentially bring new trustworthiness challenges. To better understand the risks behind their pipelines, we introduce a comprehensive trustworthiness benchmark tailored to LDMs (TrustLDM), evaluating safety, privacy, and fairness across different LDM architectures with multiple categories of static post contexts. Our empirical results show that although LDMs generally exhibit strong trustworthiness with only the user prompts, their alignment behavior degrades noticeably when the malicious post contexts are attached to the masked responses. We further observe that longer contexts do not necessarily induce stronger effects, and both decoding order and generation length affect the evaluation outcomes. Finally, we propose TrustLDM-Auto, an automatic evaluation framework that leverages LDM decoding flexibility to systematically identify vulnerable configurations, revealing substantial trustworthiness weaknesses across all evaluated models and dimensions. Our work may potentially help the community build more trustworthy LDMs. Our code is available at https://github.com/PKU-ML/TrustLDM.

2606.00022 2026-06-02 cs.CL cs.AI 版本更新

lmfaoooo at SemEval-2026 Task 1: Humor Is an Audience. Preference Modeling for Constrained Humor Generation

lmfaoooo at SemEval-2026 Task 1: 幽默是受众。面向约束幽默生成的偏好建模

Alexey Tikhonov, Alexey Ivanov

发表机构 * Inworld.AI Berlin, Germany(Inworld.AI柏林,德国) OpenAI Mountain View, CA(山景城,加利福尼亚州)

AI总结 针对约束幽默生成任务,提出“生成多候选-偏好选择”策略,利用人类成对比较训练偏好模型,在MWAHAHA任务英、中、西语子任务中分别获得第1、第1和第2名。

Comments 5 pages. Accepted for SEMEVAL 2026

详情
AI中文摘要

幽默生成仍然困难,不仅因为生成流畅、新颖的笑话很难,而且因为“有趣”取决于受众,监督信号嘈杂——偏好随受众、语境和文化而变化,标注者一致性通常较低。在本文中,我们描述了用于SemEval-2026 Task-1(MWAHAHA)的系统,该任务专注于在显式约束下进行幽默生成。任务通过1对1竞技场式比较中的人类偏好判断来评估提交的系统。我们采用“生成多个->选择最佳”策略。首先,我们通过多步提示、模型集成和多样性导向解码为每个实例生成多样化的候选池。其次,我们使用偏好模型选择输出,该模型通过从人类比较中学习(而非绝对趣味性分数)来近似“读者”。为支持该方法,我们发布了通过幽默竞技场原型收集的2.5K人类成对判断。我们进一步提出一个可解释的流程,将标注的比较转换为偏好模型。在三个偏好数据集上,我们的模型一致优于基线,并表现出更强的跨领域迁移。最后,我们将学到的偏好模型应用于MWAHAHA设置中的候选排序,并发布中间产物(候选池和排序)以促进后续工作。我们的系统在MWAHAHA的英语和汉语子任务中排名第一,在西班牙语子任务中排名第二。

英文摘要

Humor generation remains difficult not only because producing fluent, novel jokes is hard, but because "funny" is audience-dependent and supervision is noisy -- preferences vary with audience, context, and culture, and annotator agreement is often low. In this paper, we describe our system for the SemEval-2026 Task-1 (MWAHAHA), which focuses on humor generation under explicit constraints. The task evaluates submitted systems via human preference judgments in 1-on-1 arena-style comparisons. We adopt a "generate-many -> select-best" strategy. First, we generate a diverse pool of candidates per instance using multi-step prompting, model ensembling, and diversity-oriented decoding. Second, we select outputs using a preference model that approximates a "reader" by learning from human comparisons rather than absolute funniness scores. To support this approach, we release 2.5K human pairwise judgments collected through the Humor Arena prototype. We further propose an interpretable pipeline that converts labeled comparisons into a preference model. Across three preference datasets, our models consistently outperform baselines and show stronger cross-domain transfer. Finally, we apply the learned preference model to rank candidates for the MWAHAHA setting and release intermediate artifacts (candidate pools and rankings) to facilitate follow-up work. Our system ranked 1st in the English and Chinese subtasks of MWAHAHA and 2nd in the Spanish subtask.

2606.00021 2026-06-02 cs.CL cs.AI cs.LG 版本更新

SENSE: Semantic Embedding Navigation with Soft-gated Evaluation for Retrieval-based Speculative Decoding

SENSE: 基于软门控评估的语义嵌入导航用于检索式推测解码

Shaowen Chen, Zhicheng Liao, Hongwei Wang

发表机构 * Zhejiang University, Hangzhou, China(浙江大学,杭州,中国)

AI总结 提出SENSE方法,通过语义嵌入导航和软门控评估模块替代表面形式匹配,提升检索式推测解码的鲁棒性和加速效果,在LLaMA和Qwen系列上实现最高4.09平均接受长度和3.26倍加速。

详情
AI中文摘要

推测解码(SD)通过使用轻量级草稿模型提出候选令牌,并由目标模型并行验证,从而加速大型语言模型(LLM)推理,同时不损害生成质量。尽管检索式推测解码(RSD)因其即插即用的多功能性而受到青睐,但其潜力受到刚性词汇依赖的阻碍,使得检索和验证对表面形式变化敏感。为了解决这个问题,我们提出了SENSE(基于软门控评估的语义嵌入导航)。通过将检索锚定在目标模型的隐藏状态上,SENSE建立了稳健的语义对齐,这使得软门控评估模块能够验证语义等价性而非表面形式。为了确保严格的基准测试,我们将现有方法解构为统一框架内的原子原语,促进细粒度的组件级比较。跨多个领域的广泛实验表明,SENSE在LLaMA和Qwen系列上优于多个基线,实现了高达4.09的平均接受长度和3.26倍的加速,同时保持了生成质量。我们的代码将在发表后发布。

英文摘要

Speculative Decoding (SD) accelerates Large Language Model (LLM) inference by employing a lightweight draft model to propose candidate tokens, which are verified in parallel by the target model, without compromising generation quality. While Retrieval-based Speculative Decoding (RSD) is favored for its plug-and-play versatility, its potential is impeded by rigid lexical dependencies, rendering both retrieval and verification brittle to surface-level variations. To address this, we propose SENSE (Semantic Embedding Navigation with Soft-gated Evaluation). By anchoring retrieval on the hidden states of the target model, SENSE establishes robust semantic alignment, which empowers the Soft-gated Evaluation module to validate semantic equivalence rather than surface forms. To ensure rigorous benchmarking, we deconstruct existing methods into atomic primitives within a unified framework, facilitating granular, component-level comparison. Extensive experiments across diverse domains demonstrate that SENSE outperforms multiple baselines on the LLaMA and Qwen families, attaining up to 4.09 mean acceptance length and 3.26x speedup, while preserving generation quality. Our code will be released upon publication.

2606.00020 2026-06-02 cs.CL cs.AI 版本更新

CSRP: Chain-of-Thought Reasoning for Chinese Text Correction via Reinforcement Learning with Efficiency-Aware Rewards

CSRP:基于效率感知奖励的强化学习链式推理中文文本纠错

Wei Tian, Yuhao Zhou, Man Lan

发表机构 * School of Computer Science and Technology, East China Normal University(东华大学计算机科学与技术学院) Shanghai Institute of Artificial Intelligence for Education, East China Normal University(东华大学教育人工智能研究所)

AI总结 提出CSRP三阶段框架,通过连续预训练、链式推理监督微调和基于效率感知奖励的组相对策略优化,在NACGEC基准上实现最优性能,有效缓解过度纠正偏差。

Comments Accepted to the 64th Annual Meeting of the Association for Computational Linguistics (ACL 2026, Main conference)

详情
AI中文摘要

基于大语言模型的中文语法纠错系统面临两个关键挑战:通用模型缺乏针对细微语法区别的专业语言先验,以及使用最大似然估计的监督微调无法优化以精度为中心的指标,导致系统性过度纠正。我们提出CSRP,一个三阶段框架,通过以下步骤逐步构建纠错能力:在590万平衡样本上进行连续预训练以内化领域知识,使用显式错误推理进行链式推理监督微调以实现诊断透明度,以及采用新颖的效率感知奖励进行组相对策略优化,明确惩罚不必要的编辑。在NACGEC基准上,CSRP以50.99的$F_{0.5}$和57.17的精确率实现了最先进性能,大幅优于先前最佳结果,同时有效缓解了MLE训练模型固有的过度纠正偏差。我们的方法还将CSCD拼写纠错提升至59.61的F1,超过GPT-4 5.20分。全面的消融研究表明,RL对齐阶段相比SFT基线贡献了8%的相对增益,且该增益与大规模CPT的贡献正交,验证了针对编辑效率的显式优化对于高质量语法纠错至关重要。我们的代码可在https://github.com/TW-NLP/ChineseErrorCorrector获取。

英文摘要

Large Language Model (LLM) based Chinese Grammatical Error Correction (CGEC) systems face two critical challenges: general-purpose models lack specialized linguistic priors for subtle grammatical distinctions, and Supervised Fine-Tuning (SFT) with Maximum Likelihood Estimation fails to optimize for precision-focused metrics, leading to systematic over-correction. We propose CSRP, a three-stage framework that progressively builds correction capability through Continual Pre-training (CPT) on 5.9M balanced samples to internalize domain knowledge, Chain-of-Thought SFT with explicit error reasoning for diagnostic transparency, and Group Relative Policy Optimization with a novel Efficiency-Aware Reward that explicitly penalizes unnecessary edits. On the NACGEC benchmark, CSRP achieves state-of-the-art performance with 50.99 $F_{0.5}$ and 57.17 precision, substantially outperforming previous best results while effectively mitigating the over-correction bias inherent in MLE-trained models. Our method also advances CSCD spelling correction to 59.61 F1, surpassing GPT-4 by 5.20 points. Comprehensive ablation studies demonstrate that the RL alignment stage contributes a 8\% relative gain over the SFT baseline, and that this gain is orthogonal to the contribution of large-scale CPT, validating that explicit optimization for edit efficiency is essential for high-quality grammatical error correction. Our code is available at https://github.com/TW-NLP/ChineseErrorCorrector.

2606.00019 2026-06-02 cs.HC cs.AI 版本更新

Understanding Stigmatizing Language in Clinical Documentation: A Paired Comparison of Ambient AI Drafts and Clinician Finalized Notes

理解临床文档中的污名化语言:环境AI草稿与临床医生最终笔记的配对比较

Yiliang Zhou, Yawen Guo, Sairam Sutari, Jasmine Dhillon, Alexandra L. Beck, Emilie Chow, Steven Tam, Danielle Perret, Deepti Pandita, Gelareh Sadigh, Archana J. McEligot, Kai Zheng

发表机构 * Department of Informatics, University of California, Irvine(加州大学欧文分校信息学系) Institute for Clinical and Translational Science, University of California, Irvine(加州大学欧文分校临床与转化科学研究所) Department of Medicine, University of California, Irvine(加州大学欧文分校医学系) Department of Physical Medicine & Rehabilitation, University of California, Irvine(加州大学欧文分校物理医学与康复科) Department of Radiological Sciences, University of California, Irvine(加州大学欧文分校放射科学系) Department of Public Health, California State University Fullerton(加州州立大学富尔顿分校公共卫生系)

AI总结 通过配对比较环境AI生成的草稿与临床医生最终笔记,量化编辑前后污名化语言的变化,发现临床医生编辑可能成为污名化语言进入电子健康记录的净来源。

详情
AI中文摘要

环境人工智能(AI)文档工具越来越多地被用于减轻临床医生的文档负担,但它们对临床笔记中偏见语言的影响尚不清楚。我们对AI草稿和相应的临床医生最终笔记进行了大规模比较分析,以量化编辑前后污名化语言的变化。使用基于词典的自然语言处理(NLP)流程,我们测量了(1)AI草稿中污名化语言的普遍性,(2)最终笔记中的普遍性和术语组成,以及(3)污名化术语的移除或引入频率。在66,297对笔记部分中,21.4%的AI草稿部分包含至少一个污名化语言提及,而在临床医生最终版本中这一比例上升至24.0%。引入比移除更频繁,表明在使用环境AI时,临床医生编辑可能是污名化语言进入电子健康记录的净来源。

英文摘要

Ambient artificial intelligence (AI) documentation tools are increasingly deployed to reduce clinician documentation burden, but their implications for biased language in clinical notes remain unclear. We conducted a large-scale comparison analysis of AI drafts and corresponding clinician finalized notes to quantify stigmatizing language changes pre- and post-editing. Using a lexicon-based natural language processing (NLP) pipeline, we measured (1) the prevalence of stigmatizing language in AI drafts, (2) the prevalence and term composition in final notes, and (3) the frequency of removal or introduction of stigmatizing terms. Across 66,297 paired note sections, 21.4% of AI draft sections contained at least one stigmatizing language mention, rising to 24.0% in clinician finalized versions. Introductions occurred more often than removals, suggesting clinician editing can be a net source of stigmatizing language entering the EHR with using Ambient AI.

2606.00018 2026-06-02 cs.HC cs.AI 版本更新

Examine Clinicians' Modification of Hedging Language in Ambient AI Documentation: A Comparative Study of AI Drafts and Final Notes

检验临床医生对环境AI文档中模糊语言的修改:AI草稿与最终笔记的比较研究

Yiliang Zhou, Yawen Guo, Di Hu, Sairam Sutari, Emilie Chow, Steven Tam, Danielle Perret, Deepti Pandita, Kai Zheng

发表机构 * Department of Informatics, University of California, Irvine(加州大学 Irvine 分校 信息学院) Institute for Clinical and Translational Science, University of California, Irvine(加州大学 Irvine 分校 临床与转化科学研究所) Department of Medicine, University of California, Irvine(加州大学 Irvine 分校 医学院) Department of Physical Medicine & Rehabilitation, University of California, Irvine(加州大学 Irvine 分校 康复医学系)

AI总结 通过配对分析环境AI生成的临床笔记草稿与医生修改后的最终版本,研究医生编辑如何改变模糊语言的使用频率、方向性以及不同AI供应商和临床专科之间的差异。

详情
AI中文摘要

环境AI文档系统生成临床笔记草稿,医生在签署进入电子健康记录前经常修改,但这些编辑如何改变模糊语言尚不清楚。我们对医生编辑过的环境AI草稿和最终笔记部分进行了配对分析,以检验:(1)这些编辑是否改变了模糊语言的出现频率,(2)这些编辑是否表现出向更大确定性或不确定性的系统性转变,以及(3)模糊语言频率和方向性的变化是否因环境AI供应商和临床专科而异。在62,811对笔记部分中,模糊术语更常被引入先前非模糊文本,而非从先前模糊文本中移除,且编辑后文本比编辑前文本包含更多模糊提及。方向性分析显示,在模糊相关的替换编辑中,总体显著倾向于更大的不确定性。供应商和专科分析揭示了模糊频率、编辑前后模糊提及变化以及方向性的显著异质性。

英文摘要

Ambient AI documentation systems generate clinical note drafts that clinicians frequently revise before signing off into electronic health records, yet how these edits alter hedging language remains unclear. We conducted paired analysis of clinician-edited portions of ambient AI drafts and final notes to examine (1) whether these edits change the prevalence of hedging language, (2) whether these edits exhibit a systematic shift toward greater certainty or uncertainty, and (3) whether these changes in hedging prevalence and directionality differ by ambient AI vendors and clinical specialties. Among 62,811 paired note sections, hedging terms were more often introduced into previously non-hedged text than removed from previously hedged text, and post-edit text contained more hedging mentions than pre-edit text. Directionality analyses showed a significant overall tendency toward greater uncertainty in hedging-related replacement edits. Vendor and specialty analyses revealed substantial heterogeneity in hedging prevalence, pre-to-post changes in hedging mentions, and directionality.

2606.00017 2026-06-02 cs.AI cs.CL cs.MA 版本更新

MindGames Arena Generalization Track: In2AI Solution with Delayed Per-Step Reward Attribution

MindGames Arena 泛化赛道:具有延迟每步奖励归因的 In2AI 解决方案

Aliaksei Korshuk, Alexander Buyantuev, Ilya Makarov

发表机构 * iMak AI Lab(iMak人工智能实验室)

AI总结 提出延迟每步奖励归因方法,结合资格门控、异步rollout生成和课程对手采样,实现多智能体环境中稳定高效的强化学习训练,并在NeurIPS 2025的MindGames Arena基准测试中取得领先。

Comments 18 pages, 2 figures, 9 tables. Technical report. First place in both Open and Efficient tracks of MindGames Arena Generalization Track at NeurIPS 2025

详情
AI中文摘要

训练用于多智能体战略交互的语言模型智能体面临一个核心困难:任何行动的质量可能取决于从未实现的未来事件、违反游戏规则的移动或其他玩家的决策。标准强化学习假设每一步都可以分配奖励,但在结果跨时间和智能体纠缠的环境中,这一假设不成立。我们引入了具有资格门控的延迟每步奖励归因,这是一个仅在回合结束时计算奖励、根据任务特定语义将其传播回原始步骤,并排除缺乏有效依赖信息的步骤的回合生命周期和后处理流程。结合通过vLLM的连续批处理实现的异步rollout生成、基于课程的对手采样和多层分层批次构建,该方法能够在多智能体环境中实现稳定、样本高效的强化学习训练。我们在NeurIPS 2025的MindGames Arena基准测试中进行了评估,使用我们的方法训练的单个80亿参数开源模型在头对头对战中匹配或超越了包括GPT-5在内的显著更大的专有系统,并在开放(无限制)和高效(<=80亿参数)赛道中均获得第一名。

英文摘要

Training language model agents for multi-agent strategic interaction presents a core difficulty: the quality of any action may depend on future events that never materialize, on moves that violate game rules, or on decisions made by other players. Standard reinforcement learning assumes that rewards can be assigned at each step, but this assumption fails in settings where outcomes are entangled across time and agents. We introduce delayed per-step reward attribution with eligibility gating, an episode lifecycle and postprocessing pipeline that computes rewards only at episode end, propagates them back to originating steps according to task-specific semantics, and excludes steps that lack valid dependent information from training. Together with asynchronous rollout generation via vLLM's continuous batching, curriculum-based opponent sampling, and multi-level stratified batch construction, this approach enables stable, sample-efficient RL training in multi-agent environments. We evaluate on the MindGames Arena benchmark at NeurIPS 2025, where a single 8-billion-parameter open-source model trained with our method matched or surpassed substantially larger proprietary systems, including GPT-5, in head-to-head play and took first place in both the Open (unrestricted) and Efficient (<=8B parameters) tracks.

2606.00016 2026-06-02 cs.CL cs.AI 版本更新

AEyeDE: An Attention-Based Attribution Framework for AI-Generated Text Detection

AEyeDE:一种基于注意力的人工智能生成文本检测归因框架

Aria Nourbakhsh, Adelaide Danilov, Christoph Schommer, Salima Lamsiyah

发表机构 * University of Luxembourg(卢森堡大学) Department of Computer Science(计算机科学系)

AI总结 提出AEyeDE框架,利用代理Transformer模型的注意力归因矩阵,通过轻量级CNN区分人类与AI生成文本,在多种设置下优于文本基线,并揭示注意力图的局部结构差异。

Comments 24 pages, 2 figures

详情
AI中文摘要

随着现代语言模型达到接近人类水平的流畅度,并能规避依赖表面统计或似然信号的检测器,检测AI生成文本变得越来越具有挑战性。我们提出 extsc{AEyeDE},一种归因驱动的人机作者身份检测方法,利用模型注意力作为判别信号。具体来说,我们使用具有白盒访问权限的\emph{代理}Transformer模型提取人类和AI生成文本的基于注意力的归因矩阵,并训练轻量级卷积神经网络从这些归因图中学习表示。在编码器-解码器翻译设置中,我们的方法始终优于纯文本基线。在仅解码器设置中,它在生成器特定检测中表现强劲,在标准基准上保持竞争力,并在跨数据集迁移和替代拼写扰动下表现出鲁棒性。我们进一步表明,注意力图表现出重复的局部结构,其相对频率在不同数据集和代理模型下的人类与AI生成文本之间一致不同。这些发现表明,基于注意力的归因图为AI生成文本检测提供了互补且可解释的信号。我们将公开代码以支持未来研究。

英文摘要

Detecting AI-generated text is becoming increasingly challenging as modern language models approach human-level fluency and can evade detectors that rely on surface statistics or likelihood-based signals. We propose \textsc{AEyeDE}, an attribution-driven approach to human-AI authorship detection that leverages model attention as a discriminative signal. Specifically, we extract attention-based attribution matrices for both human- and AI-generated text using a \emph{proxy} Transformer model with white-box access and train a lightweight Convolutional Neural Network to learn representations from these attribution maps. Across encoder-decoder translation settings, our method consistently outperforms a text-only baseline. In decoder-only settings, it performs strongly in generator-specific detection, remains competitive on standard benchmarks, and shows robustness under cross-dataset transfer and alternative-spelling perturbations. We further show that attention maps exhibit recurring local structures whose relative frequencies differ consistently between human- and AI-generated text across datasets and proxy models. These findings suggest that attention-based attribution maps provide a complementary and interpretable signal for AI-generated text detection. We will make the code publicly available to support future research.

2606.00015 2026-06-02 cs.HC cs.AI cs.CY cs.ET 版本更新

SortingHat: Redefining Operating Systems Education with a Tailored Digital Teaching Assistant

SortingHat: 用定制的数字教学助手重新定义操作系统教育

Yifan Zhang, Xinkui Zhao, Zuxin Wang, Zhengyi Zhou, Guanjie Chen, Shuiguang Deng, Jianwei Yin

发表机构 * School of Software Technology, Zhejiang University(浙江大学软件学院)

AI总结 针对操作系统课程教学挑战,提出结合检索增强生成和多智能体强化学习的3D数字人教学助手SortingHat,提供个性化指导、自适应内容生成和自动评估。

详情
Journal ref
WWW '25: Companion Proceedings of the ACM on Web Conference 2025,Pages 2951 - 2954
AI中文摘要

操作系统课程是计算机科学教育中最具挑战性的课程之一,原因在于其内部结构的复杂性和运行环境的多样性。传统的教学方法往往无法应对学生多样化的背景、学习速度和实际需求。为了应对这些挑战,我们提出了SortingHat,一个专为操作系统教育定制的个性化数字教学助手。SortingHat集成了先进的人工智能技术,包括检索增强生成框架和多智能体强化学习,以提供自适应、可扩展且有效的教育支持。SortingHat采用由大型语言模型驱动的3D数字人界面,提供个性化、富有同理心和上下文感知的指导。它根据每个学生的学习历史和学业表现生成定制的练习,强化薄弱环节并挑战高级概念。此外,该系统包含一个强大的评估流程,确保对学生提交的内容进行公平、一致和无偏见的评分,同时提供个性化的、可操作的改进反馈。通过结合个性化指导、自适应内容创建和自动评估,SortingHat将操作系统教育转变为一种引人入胜、沉浸式且可扩展的体验。

英文摘要

Operating Systems (OS) courses are among the most challenging in computer science education due to the complexity of internal structures and the diversity of running environments. Traditional teaching methods often fail to address the diverse backgrounds, learning speeds, and practical needs of students. To tackle these challenges, we present SortingHat, a personalized digital teaching assistant tailored specifically for OS education. SortingHat integrates advanced AI technologies, including a retrieval augmented generation (RAG) framework and multi agent reinforcement learning (MARL), to deliver adaptive, scalable, and effective educational support. SortingHat features a 3D digital human interface powered by large language models (LLMs) to provide personalized, empathetic, and context aware guidance. It generates tailored exercises based on each student's learning history and academic performance, reinforcing weak areas and challenging advanced concepts. Additionally, the system incorporates a robust evaluation pipeline that ensures fair, consistent, and unbiased grading of student submissions while delivering personalized, actionable feedback for improvement. By combining personalized guidance, adaptive content creation, and automated assessment, SortingHat transforms OS education into an engaging, immersive, and scalable experience.

2606.00014 2026-06-02 cs.CL cs.AI 版本更新

Toward Robust In-Context Learning: Leveraging Out-of-distribution Proxies for Target Inaccessible Demonstration Retrieval

面向鲁棒的上下文学习:利用分布外代理进行目标不可访问的演示检索

Hao Xu, Rite Bo, Fausto Giunchiglia, Yingji Li, Rui Song

发表机构 * College of Computer Science and Technology, Jilin University, China(吉林大学计算机科学与技术学院) Department of Information Engineering and Computer Science, University of Trento, Italy(特伦托大学信息工程与计算机科学系)

AI总结 提出DOPA框架,通过引入分布外代理近似不可访问的目标域并利用马氏距离全局多样性约束,提升大语言模型在分布偏移下的鲁棒性。

Comments Accepted by ACL 2026 main

详情
AI中文摘要

尽管研究表明大语言模型(LLMs)在分布外(OOD)任务上表现良好,但随着分布偏移加剧,其优势趋于减弱。因此,研究人员旨在从可用源域中检索分布相似且信息丰富的演示来增强LLMs的推理能力。然而,在目标域不可访问的实际场景中,评估未知分布具有挑战性,这间接影响所选演示的质量。为解决此问题,我们提出 extbf{DOPA},一种演示搜索框架,它引入OOD代理来近似不可访问的目标域并指导检索过程。基于代理评估,DOPA进一步引入基于马氏距离的全局多样性约束,确保检索到的演示具有足够的多样性。在多个LLMs和任务上的实验结果表明,DOPA有效增强了OOD设置下的鲁棒性 ootnote{https://github.com/bort64/ood\_code}。

英文摘要

Although studies have demonstrated that Large Language Models (LLMs) can perform well on Out-of-Distribution (OOD) tasks, their advantage tends to diminish as the distribution shift becomes more severe. Consequently, researchers aim to retrieve distributionally similar and informative demonstrations from the available source domain to boost the inference capabilities of LLMs. However, in practical scenarios where the target domain is inaccessible, evaluating the unknown distribution is challenging, which indirectly impacts the quality of the selected demonstrations. To address this problem, we propose \textbf{DOPA}, a demonstration search framework that incorporates an OOD proxy to approximate the inaccessible target domain and guide the retrieval process. Building on proxy-based evaluation, DOPA further introduces a Mahalanobis distance-based global diversity constraint to ensure sufficient diversity among the retrieved demonstrations. Experimental results on multiple LLMs and tasks demonstrate that DOPA effectively enhances robustness in OOD settings\footnote{https://github.com/bort64/ood\_code}.

2606.00013 2026-06-02 cs.CY cs.AI cs.HC 版本更新

A phenomenon of AI-conformity: how algorithms change human moral decision-making

AI从众现象:算法如何改变人类道德决策

Yana Venerina, Dmitry Koch, Nare Meloyan, Gerda Prutko, Valeriia Lelik, Victoria Taova, Andrey Kurpatov

发表机构 * Neuroscience Laboratory, Sberbank, Moscow, Russia(神经科学实验室,俄罗斯储蓄银行,莫斯科)

AI总结 本研究通过改编经典Asch范式,发现具有推理能力的AI模型对人类道德判断的影响程度与人类多数相当,表明道德决策也可能受到算法从众的影响。

Comments 31 pages, 1 figure

详情
AI中文摘要

社会从众是一种有充分记录的现象,即个体将其观点转向社会多数的观点。随着人工智能(AI)日益融入日常生活,它也可能创造一种新的影响源,引发算法从众,其机制尚不清楚。本研究考察了AI判断是否影响人类的道德决策(n=165),改编了经典的Asch范式。参与者在三种不同条件下完成一系列道德困境:存在社会多数时、AI模型提供简短答案时、以及AI模型同时提供答案和解释时。在所有条件下,呈现的回应都违背了普遍接受的道德规范。结果表明,具有推理成分的AI模型对参与者意见的影响程度与人类多数相当。这些发现表明,即使是道德判断,尽管其敏感性和个人重要性,也可能容易受到算法从众的影响。然而,算法从众的机制似乎与社会从众不同。总体而言,该研究挑战了道德决策处于“AI禁区”——即被认为只有人类决策才可接受的领域——的假设,并强调了随着基于AI的建议日益融入人类决策,需要进一步研究这一现象。

英文摘要

Social conformity is a well-documented phenomenon in which individuals shift their opinions towards those of a social majority. As artificial intelligence (AI) becomes increasingly integrated into everyday life it may also create a novel source of influence giving rise to algorithmic conformity, mechanisms of which are poorly understood. The present study examined whether AI judgements affect moral decision-making in humans (n=165) adapting the classical Asch paradigm. Participants completed a series of moral dilemmas under three different conditions: in presence of social majority, with an AI model providing brief answers and with an AI model providing both answers and explanations of its choices. In all conditions the presented responses contradicted generally accepted moral norms. The results indicated that an AI model with a reasoning component affected the opinion of participants to a degree comparable to that of a human majority. These findings suggest that even moral judgements, despite their sensitivity and personal significance, may be susceptible to algorithmic conformity. However, the mechanism underlying algorithmic conformity appears to differ from the social one. Overall, the study challenges the assumption that moral decision-making lies in "AI inadmissibility zone" - a sphere that is considered as an area in which only human-made decisions are acceptable and highlights the need for a further investigation of this phenomenon as AI-based recommendations become increasingly embedded into human decision-making.

2606.00011 2026-06-02 cs.HC cs.AI cs.LG 版本更新

RuleEdit: Failure-Guided Human-AI Model Editing with Prospective Impact Preview

RuleEdit: 失败引导的人机模型编辑与前瞻性影响预览

Min Hun Lee, Justin Yu Feng Teo

发表机构 * Singapore Management University(新加坡国立大学)

AI总结 提出RuleEdit系统,通过规则表的不匹配信号检测失败并预览模型编辑的影响,在卒中康复评估中显著提升人机协同性能。

详情
AI中文摘要

尽管AI有望协助复杂决策,但从业者仍然缺乏在提交模型编辑之前检测可能失败和检查后果的方法。我们提出RuleEdit,一个交互式、规则引导的人机模型编辑系统,它(i)通过规则表可解释的不匹配信号揭示可能的失败,并(ii)支持用户编写的规则反馈,提供预期性能变化和嵌入偏移的前瞻性预览。我们在卒中康复评估中实例化RuleEdit,并与卫生专业人员和学生一起评估。规则引导的失败检测将人+AI性能显著提高了14.16%(p<0.001),同时改善了对错误AI的拒绝,减少了过度依赖和不足依赖以及ChangedToWrong决策。此外,呈现前瞻性嵌入预览改善了参与者对模型适应的反馈,在纳入用户基于规则的反馈后,将更新后的局部性能增益从11.50%提高到36.38%(p<0.001)。我们的发现表明,基于不匹配的失败线索和前瞻性影响预览可以支持失败感知的人机模型编辑,同时也揭示了局部-全局权衡:有助于特定案例的编辑在全局转移时可能会降低性能。我们讨论了设计失败感知和可控人机系统的意义。

英文摘要

Despite the promise of AI to assist complex decisions, practitioners still lack ways to detect likely failures and inspect the consequences of model edits before committing them. We present RuleEdit, an interactive, rule-guided human-AI model editing system that (i) surfaces likely failures through interpretable mismatch signals from rule tables and (ii) supports user-authored rule feedback with prospective previews of projected performance changes and embedding shifts. We instantiate RuleEdit in stroke rehabilitation assessment and evaluate it with health professionals and students. Rule-guided failure detection significantly increased Human + AI performance by 14.16\% ($p<0.001$) while improving rejection of incorrect AI and reducing both over- and under- reliance as well as ChangedToWrong decisions. In addition, presenting prospective embedding previews improved participants' feedback for model adaptation, increasing post-update local performance gains from 11.50\% to 36.38\% after incorporating users' rule-based feedback ($p<0.001$). Our findings show that mismatch-based failure cues and prospective impact previews can support failure-aware human-AI model editing, while also revealing a local-global tradeoff: edits that help a specific case can degrade performance when transferred globally. We discuss implications of designing failure-aware and controllable human-AI systems.

2606.00010 2026-06-02 cs.HC cs.AI cs.CY 版本更新

Empathic and agentic artificial intelligence in nursing: perspectives on a human-centered framework for cancer care navigation in the United States

护理中的共情与自主人工智能:美国癌症护理导航中以人为本框架的视角

Tyra Girdwood, Saba Kheirinejad, Parnian Kheirkhah Rahimabad, Brianna M. White, Robert L Davis, David L Schwartz, Arash Shaban-Nejad

发表机构 * University of Tennessee Health Science Center, College of Nursing(田纳西大学健康科学中心护理学院) University of Tennessee Health Science Center, Center for Biomedical Informatics, Department of Pediatrics(田纳西大学健康科学中心生物医学信息中心,儿科系) University of Tennessee Health Science Center, Department of Radiation Oncology(田纳西大学健康科学中心放射肿瘤学系)

AI总结 本文提出一个以人为本的人工智能框架,结合共情与自主方法,基于美国护士协会伦理准则,支持护士在癌症护理导航中增强而非取代人类共情与自主性,改善工作流程、医患关系和护理协调。

Comments 5 Pages, 1 Figure, 1 Table

详情
Journal ref
ESMO Real World Data and Digital Oncology, 2026, Vol 12, 100694
AI中文摘要

对于癌症患者,护士导航可以通过加强健康服务协调和患者结果来减轻复杂护理的负担。然而,在资源不足的地区,训练有素的护士导航员可能有限或不存在。在美国,人工智能驱动的数字健康工具日益可用,可能有助于解决护理协调中的差距;然而,大多数并非专门设计用于支持护理。这篇观点文章讨论了一个以人为本的人工智能框架,该框架整合了基于美国护士协会伦理准则的共情和自主方法,以支持美国护士在癌症护理导航中的工作。该框架可以增强而非取代人类的共情和自主性,同时改善护士工作流程、患者-临床医生关系以及资源不足地区的护理协调服务。

英文摘要

For patients experiencing cancer, nurse navigation can ease the burden of complex care by enhancing coordination of health services and patient outcomes. However, in under-resourced areas, trained nurse navigators may be limited or non-existent. In the United States, artificial intelligence (AI)-enabled digital health tools are increasingly available and may help address gaps in care coordination; however, most are not designed to specifically support nursing. This perspective piece discusses a human-centered AI framework that integrates empathic and agentic approaches grounded in the American Nurses Association's code of ethics to support nurses in the United States in cancer care navigation. The framework could augment, not replace, human empathy and agency while improving nurse workflow, patient-clinician relationships, and care coordination services in under-resourced areas.

2606.00009 2026-06-02 cs.AI 版本更新

Optimal Transport-based Permutation-Invariant Bayesian Optimization of Offshore Wind Farm Layouts

基于最优传输的排列不变贝叶斯优化在海上风电场布局中的应用

Antonio Candelieri, Laurens Bliek

发表机构 * Department of Economics Management and Statistics(经济学管理与统计系) Department of Industrial Engineering and Innovation Sciences(工业工程与创新科学系) Eindhoven AI Systems Institute(埃因霍温人工智能系统研究所)

AI总结 针对海上风电场布局优化问题,提出一种基于最优传输理论的排列不变贝叶斯优化方法PIBO,在保持性能的同时将计算时间减半。

详情
AI中文摘要

贝叶斯优化(BO)被广泛且成功地用于解决具有昂贵评估、黑箱和非凸目标函数的优化问题。然而,标准BO算法无法利用目标问题可能具有的对称性。一个直观的例子是最优位置问题,其决策变量指连续空间中的一组有限点,点的顺序不影响目标函数的值。我们将这种设置称为布局优化,以区别于点云优化(其中点的顺序重要)。作为布局优化的实例,我们考虑一个实际工业相关应用,即海上风电场布局优化:给定相同的风力涡轮机,交换任意一对涡轮机对年发电量没有影响。基于最优传输理论,我们提出了一种排列不变BO方法,即PIBO,证明与标准BO方法相比,它能提供更好的风电场布局,同时将计算时间大致减半。

英文摘要

Bayesian Optimization (BO) is widely and successfully adopted for solving optimization problems having an expensive-to-evaluate, black-box, and non-convex objective function. However, the vanilla BO algorithm is not able to exploit possible symmetries characterizing the target problem. An intuitive case is given by optimal location problems, whose decision variables refer to a finite set of points within a continuous space, with the order of points not affecting the value of the objective function. We refer to this setting as optimization over layouts to distinguish from optimization over point-clouds where, instead, the order of points counts. As an instance of optimization over layouts we consider a real-life industrial-relevant application, that is the optimization of the layout of an offshore wind farm: given identical wind turbines, switching any pair of them has not any effect on the annual energy production. Based on Optimal Transport theory, we propose a Permutation-Invariant BO approach, namely PIBO, proved to provide better wind farm layouts when compared to the vanilla BO approach while cutting computation time roughly in half.

2606.00008 2026-06-02 cs.AI 版本更新

Agents on a Tree: Pathwise Coordination for Multi-Objective Molecular Optimization

树上的智能体:多目标分子优化的路径协调

Jia Zhang, Tengfei Ma, Tianle Li, Daojian Zeng, Xieping Gao, Xiangxiang Zeng

发表机构 * College of Information and Science(信息科学学院) Hunan Normal University(湖南师范大学) College of Computer Science(计算机科学学院) Electronic Engineering, Hunan University(电子工程学院,湖南大学)

AI总结 提出ATOM多智能体框架,将分子优化建模为树结构搜索,通过路径协调实现多目标权衡,在活性、可合成性和ADMET相关性质上优于基线。

Comments 17 pages, 6 figures

详情
AI中文摘要

多目标分子优化需要在冲突目标下搜索广阔的化学空间,其中早期设计决策强烈约束后续结果。现有方法通常依赖单一策略或固定标量化,限制了其表示多样权衡和探索多个有前途设计轨迹的能力。我们提出ATOM,一个将分子优化建模为树结构搜索的多智能体框架。每个节点对应一个原子操作,并托管一个专门针对特定目标或决策上下文的智能体。智能体沿树的不同路径进行协调,而不是强制执行全局共识,使该方法能够维护和比较替代的分子进化轨迹。过去优化行为的全局记忆进一步支持跨目标的平衡探索与利用。这种树结构交互使得能够推理分子设计中固有的长程依赖关系。在涉及活性、可合成性和ADMET相关性质的具有挑战性的多目标基准测试中,实验表明ATOM在帕累托覆盖率和超体积上始终优于强基线。这些结果证明了路径多智能体协调在分子优化中的有效性。代码可在https://anonymous.4open.science/r/ATOM-41CE获取。

英文摘要

Multi-objective molecular optimization requires searching vast chemical spaces under conflicting objectives, where early design decisions strongly constrain downstream outcomes. Existing methods typically rely on a single policy or fixed scalarization, which limits their ability to represent diverse trade-offs and to explore multiple promising design trajectories. We propose ATOM, a multi-agent framework that formulates molecular optimization as a tree-structured search. Each node corresponds to an atomic operation and hosts an agent specialized for a particular objective or decision context. Agents coordinate along different paths of the tree rather than enforcing a global consensus, enabling the method to maintain and compare alternative molecular evolution trajectories. A global memory of past optimization behaviors further supports balanced exploration and exploitation across objectives. This tree-structured interaction enables reasoning over long-horizon dependencies inherent in molecular design. Experiments on challenging multi-objective benchmarks involving activity, synthesizability, and ADMET-related properties show that ATOM consistently achieves improved Pareto coverage and hypervolume over strong baselines. These results demonstrate the effectiveness of pathwise multi-agent coordination for molecular optimization. Code is available at https://anonymous.4open.science/r/ATOM-41CE.

2606.00007 2026-06-02 cs.AI 版本更新

Deliberative Curation: A Protocol for Multi-Agent Knowledge Bases

审慎策展:多智能体知识库协议

Steven Johnson

发表机构 * Steven Johnson(史蒂文·约翰逊)

AI总结 针对多智能体知识库的集体策展挑战,提出结合知识工件生命周期、声誉加权审议投票和分级制裁的审慎策展协议,通过基于智能体的仿真验证其在逆境下优于多数投票的鲁棒性。

Comments 29 pages, 1 figure, 6 tables. Open-source implementation available at https://github.com/StevenJohnson998/AIngram

详情
AI中文摘要

随着AI智能体从孤立工具过渡到共享知识生态系统中的协作参与者,管理集体知识策展成为一个关键挑战。人类平台治理机制无法直接转移:智能体无状态性削弱了基于威慑的制裁,模型同质性违反了群体智慧所需的独立性假设,而谄媚行为破坏了审议共识。我们提出了一种审慎策展协议,结合三个治理层:(1)形式化为带标签转移系统的知识工件生命周期;(2)结合Beta声誉与EigenTrust放大的声誉加权审议投票;(3)针对无状态智能体调整的分级制裁,包括区分故障与对抗行为的损坏智能体处理。我们通过基于智能体的仿真评估该协议,涉及7种行为原型下的100个智能体,在两种逆境场景中(30个种子,配对t检验)。该协议在良性条件下牺牲适度精度,以换取逆境下显著更好的鲁棒性:中等逆境下0.826对比多数投票的0.791(p<0.001),高压逆境下0.807对比0.740(p<0.001)。该协议退化速度约为多数投票的三分之一。消融分析确定提交-揭示投票隐藏是最有影响力的单一组件(精度提升8.2-8.6个百分点,p<0.001),优于声誉加权和审议的结合。分级制裁在仿真中未被使用,因此仍缺乏实证验证。

英文摘要

As AI agents transition from isolated tools to collaborative participants in shared knowledge ecosystems, governing collective knowledge curation becomes a critical challenge. Human platform governance mechanisms do not transfer directly: agent statelessness undermines deterrence-based sanctions, model homogeneity violates independence assumptions underlying crowd wisdom, and sycophancy collapses deliberative consensus. We propose a deliberative curation protocol combining three governance layers: (1) a knowledge artifact lifecycle formalized as a labeled transition system; (2) reputation-weighted deliberative voting integrating Beta Reputation with EigenTrust amplification; and (3) graduated sanctions adapted for stateless agents, including broken agent handling distinguishing malfunction from adversarial behavior. We evaluate the protocol through agent-based simulation with 100 agents across seven behavioral archetypes under two adversity scenarios (30 seeds, paired t-tests). The protocol trades modest precision under benign conditions for substantially better resilience under adversity: 0.826 vs 0.791 for majority vote under moderate adversity (p<0.001), widening to 0.807 vs 0.740 under stress (p<0.001). The protocol degrades roughly three times more slowly than majority vote. Ablation analysis identifies commit-reveal vote concealment as the most impactful single component (8.2-8.6pp precision improvement, p<0.001), outperforming reputation weighting and deliberation combined. Graduated sanctions were not exercised in simulation and remain empirically unvalidated.

2606.00005 2026-06-02 cs.AI 版本更新

Emergent Collaborative Deliberation in Multi-Model AI Systems: A BFT-Derived Protocol for Epistemic Synthesis

多模型AI系统中的涌现协作审议:一种源自BFT的认知综合协议

VD Doske

发表机构 * Independent Researcher(独立研究者) Consilia

AI总结 提出Consilium协议,一种基于拜占庭容错的多模型AI审议架构,将模型间分歧视为认知信号而非错误,通过认知角色分配和样本内外验证框架,实现低成本下与前沿模型相当的认知综合能力。

Comments 32 pages, 7 figures

详情
AI中文摘要

我们提出了Consilium协议,这是一种源自拜占庭容错技术的结构化多模型AI审议架构,将模型间分歧视为认知信号而非错误。该协议为语言模型分配工程化的认知角色——区分模型是什么与它如何推理——并引入了一种从量化金融中改编的样本内/样本外验证框架,以区分训练数据共识与经验验证的结论。在涵盖10个领域类别32个主题的1,478次审议会话中,我们证明了:(1) 认知角色而非底层模型决定认知行为:每批次成本0.0002美元的自由边缘推理模型产生的分析输出与成本10.69美元的前沿模型相当;(2) RLHF对齐训练产生了可测量的、特定领域的认知盲点——有争议的政策主题比已解决的科学主题受到的对抗性挑战少12.3个百分点,而AI安全主题表现出不对称偏差(Δ=11.6%),模型质疑AI危险主张的力度远大于质疑AI风险被夸大主张的力度;(3) 该协议本身没有方向性偏差(移民Δ=2.3%,可再生能源Δ=1.2%);(4) 样本外证据检索验证了239项主张,证据检索率100%,并发现了167个训练数据审议无法察觉的盲点发现。跨随机模型×角色分配的运行间可重复性平均标准差为±2.2%。整个电池的总成本(包括所有开销)为217美元。我们在MIT许可下发布协议规范,以便独立验证。

英文摘要

We present the Consilium Protocol, a Byzantine Fault Tolerance-derived architecture for structured multi-model AI deliberation that treats inter-model disagreement as epistemic signal rather than error. The protocol assigns engineered cognitive personas to language models -- separating what a model is from how it reasons -- and introduces an In-Sample/Out-of-Sample validation framework adapted from quantitative finance to distinguish training-data consensus from empirically grounded conclusions. Across 1,478 deliberation sessions spanning 32 topics in 10 domain categories, we demonstrate that (1) the cognitive persona, not the underlying model, determines epistemic behavior: free edge-inference models costing 0.0002 USD per batch produced comparable analytical output to frontier models costing 10.69 USD; (2) RLHF alignment training creates measurable, domain-specific epistemic blind spots -- contested policy topics exhibit 12.3 percentage points less adversarial challenge than settled science topics, and AI safety topics show asymmetric bias ($Δ$=11.6%) where models challenge claims that AI is dangerous far more vigorously than claims that AI risk is overstated; (3) the protocol exhibits no directional bias of its own (immigration $Δ$=2.3%, renewables $Δ$=1.2%); and (4) out-of-sample evidence retrieval validated 239 claims with 100% evidence retrieval and surfaced 167 blind-spot discoveries invisible to training-data deliberation. Run-to-run reproducibility across randomized model$\times$persona assignments averages $\pm$2.2% standard deviation. Total cost for the complete battery including all overhead: 217 USD. We release the protocol specification under MIT license to enable independent verification.

2606.00002 2026-06-02 cs.AI 版本更新

Position Paper: Post-Solve Robustness in Decision Engines: Feasible Regions and Smoothness Under Perturbations

立场论文:决策引擎中的求解后鲁棒性——可行域与扰动下的平滑性

Yi-Xiang Hu

发表机构 * Yi-Xiang Hu(胡毅祥)

AI总结 针对混合整数线性规划决策引擎在部署后因微小扰动导致解不可行或突变的问题,提出求解后鲁棒性层,通过审计已求解并返回基于求解器的可信证据,形式化ε-近优可行邻域和解平滑性两个核心概念,并整合敏感性分析、鲁棒优化、邻域搜索、对抗测试及学习增强等方法,构建统一的求解后鲁棒性框架。

详情
AI中文摘要

混合整数线性规划(MILP)决策引擎通常为高风险工业系统输出名义上的最优计划。然而,部署时很少匹配求解时的假设:成本、需求或资源可用性的微小扰动可能使解不可行,或引发不连续的转变,导致性质上不同的解。我们认为,这种求解后鲁棒性差距是当今优化流程中缺失的一层,也是学习型决策系统缺失的评估维度。该层并非替代鲁棒优化或随机规划,而是审计已求解的基解,并返回基于求解器的证据,说明该解在多大程度上可以被信任。我们形式化了两个核心对象:(i)参数空间中的ε-近优可行邻域,捕捉基解在扰动下保持可行且近优的范围;(ii)决策空间中的解平滑性,捕捉具有小组合编辑的邻近替代方案是否仍具竞争力。然后,我们综合了来自敏感性分析、稳定性分析、鲁棒优化、邻域搜索、对抗测试和基于学习的增强方法中最相关的部分答案,并阐述了一个统一的求解后鲁棒性层的议程。具体而言,我们呼吁围绕基解进行认证的内部近似、具有校准不确定性的概率鲁棒性估计、对抗鲁棒性边界,以及与基于求解器的验证相一致的学习型预测和解释。最后,我们提出了一个紧凑的报告模板和评估协议,使鲁棒性成为决策引擎的一等输出。

英文摘要

Mixed-Integer Linear Programming (MILP) decision engines routinely output nominally optimal plans for high-stakes industrial systems. Yet deployment rarely matches solve-time assumptions: small perturbations in costs, demands, or resource availability can invalidate feasibility or trigger discontinuous shifts to qualitatively different solutions. We argue that this post-solve robustness gap is a missing layer in today's optimization pipelines and a missing evaluation dimension for learning-enabled decision systems. Rather than replacing robust optimization or stochastic programming, the proposed layer audits a solved incumbent and returns solver-backed evidence about how far that solution can be trusted. We formalize two central objects: (i) an $ε$-near-optimal feasible neighborhood in parameter space, capturing when an incumbent remains feasible and near-optimal under perturbations, and (ii) solution smoothness in decision space, capturing whether nearby alternatives with small combinatorial edits remain competitive. We then synthesize the most relevant partial answers from sensitivity and stability analysis, robust optimization, neighborhood search, adversarial testing, and learning-based enhancements, and articulate an agenda for a unified post-solve robustness layer. Concretely, we call for certified inner approximations around the incumbent, probabilistic robustness estimation with calibrated uncertainty, adversarial robustness margins, and learning-based prediction and explanation aligned with solver-backed verification. We conclude with a compact reporting template and evaluation protocol that would make robustness a first-class output of decision engines.

2605.30748 2026-06-02 cs.SD cs.AI eess.AS 版本更新

Chatterbox-Flash: Prior-Calibrated Block Diffusion for Streaming Zero-Shot TTS

Chatterbox-Flash: 用于流式零样本TTS的先验校准块扩散

Deokjin Seo, Gangin Park, Kihyun Nam

发表机构 * Resemble AI Seoul National University(首尔国立大学) KAIST(韩国科学技术院)

AI总结 提出Chatterbox-Flash,通过将预训练自回归TTS解码器微调为块扩散解码器,实现块内并行生成与块间流式推理,并引入先验校准评分和早期解码调度解决长尾分布导致的生成质量下降问题。

Comments 8 pages, 4 figures, 9 tables

详情
AI中文摘要

我们提出Chatterbox-Flash,一种零样本文本转语音模型,通过将预训练的自回归TTS解码器微调为块扩散解码器获得,支持每个块内的并行令牌生成,同时保持逐块流式传输。我们发现,将主流的块扩散解码直接迁移到离散语音令牌会降低质量,因为长尾令牌分布使并行位置选择偏向少数高频令牌。为在不修改架构的情况下缓解这一问题,我们引入了两种推理时技术:先验校准评分(减去块级边际令牌分布)和早期解码调度(基于校准置信度自适应终止迭代)。在标准零样本TTS基准测试中,Chatterbox-Flash实现了与强自回归和非自回归基线相当的高保真合成,同时支持流式推理,首包时间与流式AR系统相当,且实时因子显著降低。代码和音频样本可在 https://github.com/resemble-ai/chatterbox-flash 获取。

英文摘要

We present Chatterbox-Flash, a zero-shot text-to-speech model obtained by fine-tuning a pretrained autoregressive TTS decoder into a block-diffusion decoder, enabling parallel token generation within each block while retaining block-by-block streaming. We find that naively transferring mainstream block-diffusion decoding to discrete speech tokens degrades quality, as a long-tail token distribution biases parallel position selection toward a few high-frequency tokens. To mitigate this without architectural modification, we introduce two inference-time techniques: prior-calibrated scoring, which subtracts the block-level marginal token distribution, and an early-decoding schedule, which adaptively terminates iteration based on calibrated confidence. On standard zero-shot TTS benchmarks, Chatterbox-Flash attains high-fidelity synthesis comparable to strong autoregressive and non-autoregressive baselines, while supporting streaming inference with time-to-first-packet on par with streaming AR systems and substantially lower real-time factor. Code and audio samples are available at https://github.com/resemble-ai/chatterbox-flash.

2605.30581 2026-06-02 cs.CV cs.AI cs.RO 版本更新

Prior Availability in Industrial Visual Sim-to-Real: A Review of CAD-Guided and CAD-Unavailable Regimes

工业视觉模拟到现实中的先验可用性:CAD引导与CAD不可用机制的综述

Chenxi Tao, Seung-Kyum Choi

发表机构 * George W. Woodruff School of Mechanical Engineering(乔治·W·伍德鲁夫机械工程学院) Georgia Institute of Technology(佐治亚理工学院)

AI总结 本文通过先验可用性视角重新组织工业视觉模拟到现实问题,区分CAD可用、CAD不可用和边界先验三种机制,并基于T-LESS/BOP、MVTec AD和VisA数据集进行实证分析,揭示了源分布设计、检测器容量和真实校准的重要性,以及CAD在测试时提供的独特验证通道。

Comments Review article; 103 references; 9 main figures; empirical anchors on T-LESS/BOP, MVTec AD, and VisA

详情
AI中文摘要

工业视觉模拟到现实通常被描述为从合成图像到真实图像的迁移,但工业部署通常涉及可用证据与所需决策之间更广泛的错配。系统可能基于CAD渲染、模拟RGB-D观测、正常参考图像、合成缺陷、预训练特征空间或语言提示构建,却在不同的传感器、光照、材料、夹具、校准、生产变化和罕见缺陷模式下部署。本综述将工业视觉模拟到现实重新定义为由先验可用性组织的域差距问题。我们区分了CAD可用设置(其中显式物体几何可支持渲染、校准、姿态估计、分割和测试时几何验证)、CAD不可用设置(其中几何被正常参考外观、特征分布、师生残差、合成异常假设、基础特征或视觉语言先验取代)以及边界先验设置(其中近似模型、模板、参考视图或语义对应仅保留CAD的部分作用)。这一框架将基于CAD的检测和6D姿态估计文献与通常单独综述的工业异常和表面检测文献联系起来。为使分类具体化,我们使用T-LESS/BOP、MVTec AD和VisA上的实证锚点。这些锚点表明,仅靠CAD渲染数量并不能弥合迁移;源分布设计、检测器容量和小规模真实校准可能更为重要。它们还表明,测试时的CAD通过掩码、姿态和深度一致性创建了独特的验证通道,而CAD不可用的检测则依赖于校准的正常性和特征偏差。因此,本综述反对单一跨任务排行榜,而是询问什么先验支撑了部署决策。

英文摘要

Industrial visual sim-to-real is often described as transferring from synthetic images to real images, but industrial deployment usually involves a broader mismatch between available evidence and required decisions. A system may be built from CAD renderings, simulated RGB-D observations, normal reference images, synthetic defects, pretrained feature spaces, or language prompts, yet deployed under different sensors, lighting, materials, fixtures, calibration, production variation, and rare defect modes. This review reframes industrial visual sim-to-real as a domain-gap problem organized by prior availability. We distinguish CAD-available settings, where explicit object geometry can support rendering, calibration, pose estimation, segmentation, and test-time geometric verification; CAD-unavailable settings, where geometry is replaced by normal-reference appearance, feature distributions, teacher-student residuals, synthetic anomaly assumptions, foundation features, or vision-language priors; and boundary-prior settings, where approximate models, templates, reference views, or semantic correspondences preserve only part of the CAD role. This framing connects CAD-based detection and 6D pose-estimation literature with industrial anomaly and surface-inspection literature that is usually reviewed separately. To make the taxonomy concrete, we use empirical anchors on T-LESS/BOP, MVTec AD, and VisA. The anchors show that CAD render count alone does not close transfer; source-distribution design, detector capacity, and small real calibration can matter more. They also show that CAD at test time creates a distinct verification channel through mask, pose, and depth consistency, whereas CAD-unavailable inspection relies on calibrated normality and feature deviation. The review therefore argues against a single cross-task leaderboard and instead asks what prior grounds the deployment decision.

2605.28508 2026-06-02 cs.AI 版本更新

Benchmarking AI for low-resource contexts: Thinking beyond leaderboards

低资源场景下的AI基准测试:超越排行榜的思考

Aakash Pant, Kavya Shah, Apoorv Agnihotri, Sneha Nikam, Prasaanth Balraj, Nakul Jain

发表机构 * Wadhwani AI Global(Wadhwani AI全球)

AI总结 本文通过分析语音、聊天/RAG和视觉系统的基准测试,指出实验室评估与低资源环境部署之间的差距,提出以部署系统为评估单位,并整合任务性能与噪声输入、代码切换等部署条件,同时为不同应用类制定差异化评估框架,最后建议标准化报告工具以支持决策。

Comments Aakash Pant, Kavya Shah, and Apoorv Agnihotri contributed equally

详情
AI中文摘要

现有的AI评估实践往往未能捕捉系统在低资源环境中的实际表现,在这些环境中,操作约束与模型质量同样影响可用性。通过对语音、聊天/RAG和视觉系统的现有基准测试家族进行结构化分析,我们识别出实验室评估实践与低资源环境实际部署条件之间的关键差距。我们认为,有意义的评估单位是部署的系统而非孤立的模型,有效的评估框架必须将任务性能与部署条件(如噪声输入、代码切换、间歇性连接、低端硬件和领域偏移)相结合。同时,基准测试应认识到不同应用类别需要不同的评估概况,而非一个掩盖操作差异的单一总分。为支持实际决策,我们提出一个共享的报告框架,该框架在保持跨系统和应用类型的可比性的同时,对部署上下文保持敏感。最后,我们强调需要为政策制定者、捐助者和实施者提供简洁且可操作的报告工件,包括标准化的一页基准卡、部署概况,以及故障处理程序和人工监督机制的明确文档。

英文摘要

Existing AI evaluation practices often fail to capture how systems actually perform in low-resource environments, where operational constraints shape usability as much as model quality. Through a structured analysis of existing benchmark families across speech, chat/RAG, and vision systems, we identify critical gaps between laboratory evaluation practices and real-world deployment conditions in low-resource environments. We argue that the meaningful unit of assessment is the deployed system rather than an isolated model and that effective evaluation frameworks must integrate task performance with deployment conditions such as noisy inputs, code-switching, intermittent connectivity, low-end hardware, and domain shift. At the same time, benchmarks should recognize that different application classes require distinct evaluation profiles rather than a single aggregate score that obscures operational differences. To support practical decision-making, we propose a shared reporting framework that preserves comparability across systems and application types while remaining sensitive to deployment context. Finally, we emphasize the need for concise and actionable reporting artifacts for policymakers, donors, and implementers, including standardized one-page benchmark cards, deployment profiles, and explicit documentation of failure handling procedures and human oversight mechanisms.

2605.28183 2026-06-02 cs.CL cs.AI 版本更新

BenGER: Benchmarking LLM Systems on Subsumption-Based Legal Reasoning in German Law

BenGER:德国法律中基于归入的法律推理的LLM系统基准测试

Sebastian Nagl, Ann-Kristin Mayrhofer, Martin Heidebach, Aleyna Koçak, Anne Zettelmeier, Elly Breu, Angelina Greiner, Sofija Milijas, Matthias Grabmair

发表机构 * Technical University of Munich (TUM)(慕尼黑技术大学) Ludwig Maximilian University of Munich (LMU)(慕尼黑路德维希-马克西米利安大学) University of Konstanz(康斯坦茨大学) University of Saarbrücken(萨尔布吕肯大学)

AI总结 提出BenGER数据集,用于评估LLM系统在德国法律归入推理中的表现,通过自动和基于法官的指标比较12个LLM系统与人类基线。

Comments Pre-Print v2

详情
AI中文摘要

我们引入了BenGER(德国法律基准)数据集,用于评估LLM系统在德国法律中基于归入的法律推理。BenGER数据集由三个部分组成:596个跨多个法律教育水平的考试式自由文本法律案例任务和531个简短的教义推理任务。我们评估了12个当代LLM系统——包括封闭旗舰型、效率导向型和开放权重型——使用自动和基于法官的指标。在受控验证子集上,对定时的人类撰写解决方案(在无辅助和人机共创条件下)进行模型性能与这些人类基线的对比。我们引入了一个与多评分者人工评分协议(每个解决方案三次盲审加一次作者知情创建者评审)交叉验证的、基于评分标准的LLM-as-a-Judge框架。我们的结果表明,用LLM法官替换盲人评审员对整个人类评审池的一致性影响不大于完全移除该评审员(Calderon r=0.96 vs. r=0.96,匹配n=30),封闭旗舰系统在所有语料库中领先排行榜,并且人机共创显著优于无辅助的人类工作。

英文摘要

We introduce the BenGER (Benchmark for German Law) dataset for evaluating LLM systems on subsumption-based legal reasoning in German law. The BenGER dataset consists of three components: 596 exam-style free-text legal case tasks across multiple levels of legal education and 531 short doctrinal reasoning tasks. We evaluate 12 contemporary LLM systems -- closed flagship, efficiency-oriented, and open-weight -- across automatic and judge-based metrics. On a controlled validation subset of timed human-written solutions under both unaided and human--AI co-creation conditions, we contextualise model performance against these human baselines. We introduce a rubric-aligned LLM-as-a-Judge framework cross-validated against a multi-rater human-grading protocol (three blind reviews plus one author-informed creator review per solution). Our results show that replacing a blind human reviewer with the LLM judge degrades agreement with the full human pool no more than removing that reviewer altogether (Calderon r=0.96 vs.~r=0.96, matched n=30), that closed-flagship systems lead the leaderboard across all corpora, and that human--AI co-creation substantially outperforms unaided human work.

2605.27752 2026-06-02 cs.AI 版本更新

Asking Is Not Enough: Protocol Sensitivity in LLM Confidence Calibration

询问是不够的:LLM 置信度校准中的协议敏感性

Hankyeol Kim, Pilsung Kang

发表机构 * Seoul National University(首尔国立大学)

AI总结 研究通过改变测量协议(如条件上下文、令牌读取方式)发现,LLM 的令牌概率置信度与口头置信度之间的比较高度敏感,且口头置信度不仅反映正确性还反映答案的合理性和来源。

详情
AI中文摘要

LLM 置信度校准通常通过比较两种信号来评估:令牌概率分数和口头置信度。这些信号有时被视为模型不确定性的直接读数,但它们的比较取决于很少明确说明的测量选择。在主要分析中,我们固定口头置信度的引出方式:一个单一的提示模板、概率尺度和输出格式。然后,我们改变定义口头与令牌比较的测量轴:哪个答案字符串获得令牌概率分数,如何从答案令牌中读取该分数,以及在哪种条件上下文中测量它。我们在三个开放 7--8B 基础/指令模型家族的四个 QA 基准上评估了这种设计,并使用更大的 Qwen2.5 变体作为同家族鲁棒性检查。结果比较对这些选择敏感:条件上下文改变了跨设置的 ECE 差距的符号或大小,令牌读取产生了更小但仍改变符号的变化,而改变 ECE 估计器影响很小。在默认的生成答案、裸上下文协议下,指令设置接近平衡,而不是显示口头置信度的大幅校准增益。在单独的提供答案分析中,表面合理的错误答案与提供的正确答案获得几乎相同的置信度,这表明口头置信度也反映了答案的合理性和来源,而不仅仅是正确性。我们认为,两种置信度信号都应被视为依赖于协议的测量行为,并提供了一个报告清单,涵盖引出来源、评分答案、令牌概率读取和条件上下文。

英文摘要

LLM confidence calibration is often evaluated by comparing two signals: token-probability scores and verbalized confidence. These signals are sometimes treated as direct readouts of model uncertainty, but their comparison depends on measurement choices that are rarely made explicit. In the main analysis, we hold the verbalized-confidence elicitation fixed: a single prompt template, probability scale, and output format. We then vary the measurement axes that define the verbalized-vs-token comparison: which answer string receives the token-probability score, how that score is read from the answer tokens, and under which conditioning context it is measured. We evaluate this design on four QA benchmarks across three open 7--8B base/Instruct model families, with larger Qwen2.5 variants as same-family robustness checks. The resulting comparison is sensitive to these choices: conditioning context changes the sign or magnitude of the ECE gap across settings, token readout produces smaller but still sign-moving changes, and changing the ECE estimator has little effect. Under the default generated-answer, bare-context protocol, Instruct settings are close to parity rather than showing a large calibration gain for verbalized confidence. In a separate supplied-answer analysis, surface-plausible wrong answers receive nearly the same confidence as supplied gold answers, suggesting that verbalized confidence also reflects answer plausibility and provenance rather than correctness alone. We argue that both confidence signals should be treated as protocol-dependent behavioral measurements, and provide a reporting checklist covering elicitation provenance, scored answer, token-probability readout, and conditioning context.

2605.27701 2026-06-02 cs.AI 版本更新

Cross-Entropy Games and Frost Training

交叉熵博弈与Frost训练

Arthur Renard, Franck Gabriel, Valentin Hartmann, Clément Hongler

发表机构 * Xent Labs(Xent实验室) Université Lyon 1(里昂1大学)

AI总结 提出Frost训练方法,利用奖励函数在嵌入空间中的梯度改进基于蒙特卡洛的策略优化,用于解决一类称为交叉熵博弈的LLM-as-a-judge任务,在最佳k选择中实现更高最大分数并加速训练。

Comments 14 pages, 6 figures

详情
AI中文摘要

我们提出Frost训练,一种改进基于蒙特卡洛的策略优化的方法,适用于称为交叉熵博弈的一大类LLM-as-a-judge任务。关键思想是利用奖励函数在嵌入空间中的梯度。该信号在贪婪坐标梯度(GCG)越狱技术中使用;我们首次证明它也可用于提升模型训练。我们使用GRPO训练进行最大似然填充来验证我们的方法。Frost训练提高了模型生成高评分输出的能力,在最佳k选择中达到更高的最大分数,并且速度更快。

英文摘要

We present Frost Training, a method for improving Monte Carlo-based policy optimization for a large family of LLM-as-a-judge tasks called Cross-Entropy Games. The key idea is to exploit the gradient of the reward function in embedding space. This signal is used in the Greedy Coordinate Gradient (GCG) jailbreaking technique; we demonstrate for the first time that it can also be used to boost model training. We validate our method using GRPO training for maximum-likelihood infilling. Frost Training improves the model's ability to generate high-scoring outputs, reaching higher maximum scores in a best-of-k setting, and does so at an increased speed.

2605.27575 2026-06-02 cs.AI 版本更新

Agyn: An Open-Source Platform for AI Agents with Scalable On-Demand Execution, Agent Definition as a Code, and Zero-Trust Access

Agyn:一个面向AI代理的开源平台,具有可扩展的按需执行、代理即代码定义和零信任访问

Nikita Benkovich, Vitalii Valkov

发表机构 * Agyn, Inc.(Agyn公司) Mila AI e-Lab

AI总结 提出Agyn开源平台,通过信号驱动的有状态无服务器运行时、Terraform代理定义和零信任安全模型,解决AI代理在生产中的隔离、治理和扩展问题。

详情
AI中文摘要

随着组织向AI代理的生产部署迈进,这些代理执行非确定性工作流、维护有状态会话,并通常以特权访问内部服务,工程挑战从构建单个代理转向在适当的隔离、治理和安全性下大规模运行它们。在本文中,我们介绍Agyn,一个开源平台,围绕三个针对代理工作负载的关键原则设计:基于Kubernetes的信号驱动、有状态无服务器运行时;用于代理和工具定义的Terraform提供程序;以及基于零信任和最小权限原则的安全模型。Agyn是代理无关、模型无关和云无关的。

英文摘要

As organizations move toward production deployments of AI agents, which execute non-deterministic workflows, maintain stateful sessions, and often operate with privileged access to internal services, the engineering challenge shifts from building individual agents to operating them at scale with proper isolation, governance, and security. In this paper we present Agyn, an open-source platform designed around three key principles tailored for agent workloads: a signal-driven, stateful serverless runtime on Kubernetes; a Terraform provider for agent and harness definition; and a security model grounded in zero-trust and least-privilege principles. Agyn is agent-agnostic, model-agnostic, and cloud-agnostic.

2605.27569 2026-06-02 cs.AI 版本更新

RULER: Representation-Level Verification of Machine Unlearning

RULER: 机器遗忘的表示级验证

Georgina Cosma, Axel Finke

发表机构 * Department of Computer Science, Loughborough University, UK(英国洛林大学计算机科学系) School of Mathematics, Statistics and Physics, Newcastle University, UK(英国新castle大学数学、统计与物理学院)

AI总结 提出表示级验证指标RULER(包括基于oracle的M2和无oracle的M4),检测机器遗忘后模型中间表示中残留的被遗忘记录信息,发现输出级验证通过的方法在表示级仍存在显著残留。

详情
AI中文摘要

机器遗忘旨在从已部署的模型中移除特定训练记录的影响,而无需从头重新训练。当前协议通过成员推断、保留准确率和遗忘集准确率在输出级进行验证,但模型可能满足所有三个条件的同时在其中间表示中编码被遗忘的记录。我们引入RULER,一组表示级验证指标。基于oracle的比较指标M2衡量遗忘集记录是否占据与在没有它们的情况下重新训练的模型中相同的表示位置。无oracle指标M4仅从未学习模型的内部相似性结构检测残差,无需重新训练。四种近似遗忘方法均通过输出级评估,但在线性混合效应模型下,M2在12个条件中的10个中检测到显著残差(p<0.05),且效应大小随遗忘比例增加而增大。第五种方法Bad Teacher尽管具有不同的遗忘机制,也显示出相同的残差。M4作为遗忘前诊断指标,适用于表格、图像、临床文本和人脸身份设置:它检测到人脸识别模型中身份级别的记忆化,而所有测试方法均无法完全擦除该信号。

英文摘要

Machine unlearning aims to remove the influence of specific training records from a deployed model without retraining from scratch. Current protocols verify this at the output level through membership inference, retain accuracy, and forget-set accuracy, but a model can satisfy all three whilst still encoding forgotten records in its intermediate representations. We introduce RULER, a set of representation-level verification metrics. The oracle-comparative metric M2 measures whether forget-set records occupy the same representational position as in a model retrained without them. The oracle-free metric M4 detects residuals from the unlearned model's internal similarity structure alone, without retraining. Four approximate unlearning methods all pass output-level evaluation, yet under a linear mixed-effects model M2 detects significant residuals in 10 of 12 conditions (p<0.05), with effect sizes growing as the forget fraction increases. A fifth method, Bad Teacher, shows the same residuals despite a different forgetting mechanism. M4 acts as a pre-unlearning diagnostic across tabular, image, clinical text, and face-identity settings: it detects identity-level memorisation in face recognition models where no tested method fully erases the signal.

2605.27458 2026-06-02 cs.CV cs.AI cs.CL cs.LG 版本更新

Generic Interpretation Approach for Transformer Models Incorporating Heterogenous Attention Structures

融合异质注意力结构的Transformer模型通用解释方法

Yongjin Cui, Xiaohui Fan, Huajun Chen

发表机构 * Zhejiang University(浙江大学)

AI总结 针对Transformer中异质注意力结构(如共注意力)带来的多源信息融合挑战,提出一种通用解释方法,并通过实验分析范式对代表性模型进行语义和逻辑解释。

详情
AI中文摘要

Transformer极大地推动了人工智能的发展,也推动了智能体(agent)的发展。我们将Transformer的注意力结构根据输入信息的来源分为两类:同质注意力结构和异质注意力结构。异质注意力结构以共注意力(co-attention)为典型例子,处理来自不同来源的信息。异质注意力结构是Transformer模型实现更复杂功能、融合更多模态信息的基础。无论是出于研究目的还是政策要求,对具有异质注意力结构的Transformer模型进行解释都是一项重要任务。来自不同来源的信息融合带来了新的挑战。我们的工作主要包括方法和实验两部分。在方法方面,我们提出了一种针对具有异质注意力结构的Transformer模型的解释方法。在实验方面,基于我们的实验分析范式,我们解释代表性模型的操作机制,进行语义解释和逻辑解释。

英文摘要

Transformer has significantly propelled the development of artificial intelligence, and certainly the development of agents as well. We categorize attention structures of Transformer into two types based on the source of the input information: homogenous and heterogenous attention structures. Heterogenous attention structures, with co-attention as a typical example, process information from different sources. Heterogenous attention structure is the foundation for Transformer models to achieve more complex functions and integrate more modal information. Whether for research purposes or policy requirements, the interpretation of Transformer models with heterogenous attention structures is an important task. The fusion of information from different sources brings new challenges. Our work mainly includes two parts: method and experimentation. In terms of method, we propose an interpretation method for Transformer models with heterogenous attention structures. In terms of experimentation, based on our experimental analysis paradigm, we interpret the operating mechanisms of representative models, conduct semantic interpretation and logical interpretation.

2605.27044 2026-06-02 cs.AI 版本更新

BatteryMFormer: Multi-level Learning for Battery Degradation Trajectory Forecasting

BatteryMFormer:电池退化轨迹预测的多级学习

Ruifeng Tan, Jintao Dong, Weixiang Hong, Jia Li, Jiaqiang Huang, Tong-Yi Zhang

发表机构 * Sustainable Energy and Environment Thrust, The Hong Kong University of Science and Technology (Guangzhou)(可持续能源与环境方向,香港科学与技术大学(广州)) School of Computer Science and Engineering, Central South University(中南大学计算机科学与工程学院) Data Science and Analytics Thrust, The Hong Kong University of Science and Technology (Guangzhou)(数据科学与分析方向,香港科学与技术大学(广州)) Material Genome Institute, Shanghai University(材料基因组研究所,上海大学)

AI总结 提出BatteryMFormer,一种多级Transformer模型,通过老化条件感知解码器、元退化模式记忆和双视角编码器,从早期数据预测电池全生命周期健康状态轨迹,在四个电池域上超越现有方法。

Comments Accepted by KDD 2026

详情
AI中文摘要

早期电池退化轨迹预测(BDTF)从早期运行数据预测全生命周期健康状态轨迹,对电池优化、制造和部署至关重要。电池退化数据呈现两个关键特征。首先,退化数据具有多级结构,包括老化条件下的共享规律性和跨电池的轨迹模式。其次,电压-电流曲线中与退化相关的变化通常局限于特定的荷电状态(SOC)区间。现有方法通常未能显式建模这些特征。为弥补这一差距,我们提出BatteryMFormer,一种用于早期BDTF的多级Transformer。BatteryMFormer集成了(1)老化条件感知解码器,通过老化条件知情查询和老化条件感知注意力注入老化条件先验,(2)元退化模式记忆,学习并检索轨迹原型以指导长期预测,以及(3)双视角编码器,从电压和电流时间序列中联合捕获时间动态和SOC局部变化。在四个电池域上的大量实验表明,BatteryMFormer始终优于最先进的基线,标志着向可靠BDTF迈出了重要一步。我们的代码可在https://github.com/Ruifeng-Tan/BatteryMFormer获取。

英文摘要

Early battery degradation trajectory forecasting (BDTF), which predicts the full-life state-of-health trajectory from early operational data, is critical for battery optimization, manufacturing, and deployment. Battery degradation data exhibit two key characteristics. First, degradation data present a multi-level structure, including regularities shared within aging conditions and trajectory patterns shared across batteries. Second, degradation-related variations in voltage-current profiles are often localized to specific state of charge (SOC) intervals. Existing approaches often fail to explicitly model these characteristics. To bridge this gap, we propose BatteryMFormer, a multi-level Transformer for early BDTF. BatteryMFormer integrates (1) an aging-condition-aware decoder that injects aging-condition priors via aging-condition-informed queries and aging-condition-aware attention, (2) a meta degradation pattern memory that learns and retrieves trajectory prototypes to guide long-horizon forecasting, and (3) a dual-view encoder that jointly captures temporal dynamics and SOC-localized variations from voltage and current time series. Extensive experiments on four battery domains show that BatteryMFormer consistently outperforms state-of-the-art baselines, marking a significant step toward reliable BDTF. Our code is available at https://github.com/Ruifeng-Tan/BatteryMFormer.

2605.27000 2026-06-02 cs.CL cs.AI 版本更新

Cast a Wider Net: Coordinated Pass@K Policy Optimization for Code Reasoning

撒更宽的网:面向代码推理的协调 Pass@K 策略优化

Yilong Li, Suman Banerjee, Tong Che

发表机构 * University of Wisconsin–Madison(威斯康星大学麦迪逊分校) NVIDIA Research(英伟达研究)

AI总结 提出协调 Pass@K 策略优化 (CPPO),通过规划器生成多种策略并协调求解器尝试,以解决代码生成中重复采样导致冗余推理路径的问题,在多个基准上显著提升 pass@4。

Comments Code reasoning; pass@K optimization; coordinated planning; verifiable rewards; strategy diversity

详情
AI中文摘要

使用验证器进行重复采样是分配代码生成测试时计算的标准方法,pass@$K$ 是规范指标。然而,标准策略类从单一答案分布中抽取 $K$ 个独立样本,因此尝试往往坍缩到近乎重复的推理路径,并将预算浪费在冗余 rollout 上。在竞争性编程中,这种失败代价高昂,因为许多问题允许多种不同的算法策略,而 pass@$K$ 只需要一次正确尝试。我们提出协调 Pass@$K$ 策略优化 (CPPO),它将 pass@$K$ 生成转化为策略上的联合探索:规划器输出一个包含 $K{=}4$ 种替代高层方法的元组,共享求解器为每种方法尝试一个解决方案。CPPO 使用乘法规划器奖励 $R_{\\mathrm{plan}} = J_ψ\\\cdot R_{\\mathrm{out}}$ 训练此联合策略,仅将信用分配给导致验证器确认的 pass@$K$ 成功的有效策略元组。在 APPS、CodeContests 和 LiveCodeBench-v6 上,CPPO 在相同的 $K{=}4$ 求解器尝试预算下,相比于直接采样、规划基线、仅规划器 SFT 和面向 pass@$K$ 的 RL,提升了 pass@$4$,在九个模型-基准组合中的六个上具有统计显著增益。最大单次增益是在 Qwen3.5-9B LiveCodeBench-v6 上,相比于最强基线 PKPO 提升了 $+0.16$($0.588 \\rightarrow 0.748$;配对 bootstrap,$p < 0.05$)。

英文摘要

Repeated sampling with a verifier is the standard way to allocate test-time compute for code generation, with pass@$K$ as the canonical metric. Yet the standard policy class draws $K$ independent samples from a single answer distribution, so attempts often collapse onto near-duplicate reasoning paths and waste the budget on redundant rollouts. This failure is costly in competitive programming, where many problems admit multiple distinct algorithmic strategies and pass@$K$ requires only one correct attempt. We propose Coordinated Pass@$K$ Policy Optimization (CPPO), which turns pass@$K$ generation into joint exploration over strategies: a planner emits a tuple of $K{=}4$ alternative high-level methods, and a shared solver attempts one solution per method. CPPO trains this joint policy with a multiplicative planner reward, $R_{\mathrm{plan}} = J_ψ\cdot R_{\mathrm{out}}$, assigning credit only to valid strategy tuples that lead to verifier-confirmed pass@$K$ success. Across APPS, CodeContests, and LiveCodeBench-v6, CPPO improves pass@$4$ over direct sampling, planning baselines, planner-only SFT, and pass@$K$-oriented RL under the same $K{=}4$ solver-attempt budget, with statistically significant gains on six of nine model--benchmark cells. The largest single gain is $+0.16$ on Qwen3.5-9B LiveCodeBench-v6 over the strongest baseline, PKPO ($0.588 \rightarrow 0.748$; paired bootstrap, $p < 0.05$).

2605.26874 2026-06-02 cs.DB cs.AI cs.LG 版本更新

Knowledge Graphs as the Missing Data Layer for LLM-Based Industrial Asset Operations

知识图谱:基于LLM的工业资产运营中缺失的数据层

Madhulatha Mandarapu, Sandeep Kunkunuru

发表机构 * VaidhyaMegha Private Limited, India(印度VaidhyaMegha私人有限公司)

AI总结 研究通过类型化知识图谱作为数据层,将GPT-4在工业维护场景中的准确率从65%提升至99%,并引入生成增强知识(GAK)处理缺失数据,实现81.8%的场景可回答性。

Comments v2: reframed around the knowledge graph as a grounding substrate with a 3-tier router (text-to-Cypher; native graph/optimization primitives; generation-augmented knowledge, GAK). Adds a benchmark-grounded GAK evaluation on 88 real non-deterministic AssetOpsBench scenarios with provenance-tagged enrichment. 18 pages. Code: github.com/samyama-ai/assetops-kg

详情
AI中文摘要

基于LLM的工业资产运营代理在处理平面文档存储时准确性有限。AssetOpsBench(KDD 2026)表明,GPT-4代理在139个工业维护场景中达到65%的准确率,并比较了LLM编排范式(Agent-As-Tool vs. Plan-Execute)在固定数据层上的表现。我们提出一个正交问题:工具背后的数据模型有多重要?我们将类型化知识图谱作为基础基质,并根据最佳回答方式路由每个问题:(i)LLM生成的Cypher进行结构化检索,将同一GPT-4模型从65%提升至82-83%;(ii)原生图和优化原语(无需LLM)在图可回答场景中达到99%;(iii)生成增强知识(GAK)用于处理数据中缺失的答案——引擎的代理将缺失事实实现为带有溯源标签的图节点,然后回答。一个反复出现的主题是反向LLM使用:我们约束LLM从类型化模式生成查询或一次性丰富,让图确定性地执行。在88个真实的AssetOpsBench故障模式场景中(基准本身标记为非确定性——图中缺失十种设备类型),GAK将可回答性从零提升至100%的设备类型,并回答了81.8%的场景,每个实现的事实都标记为来源:LLM派生以确保可审计性。我们还贡献了40个图原生场景。对于结构化操作领域,数据层——而非LLM编排——是主要杠杆,类型化知识图谱充当原始工业数据与LLM推理之间的基础基质。

英文摘要

LLM-based agents for industrial asset operations show limited accuracy when reasoning over flat document stores. AssetOpsBench (KDD 2026) establishes that GPT-4 agents achieve 65% on 139 industrial maintenance scenarios, and compares LLM orchestration paradigms (Agent-As-Tool vs. Plan-Execute) on a fixed data layer. We ask the orthogonal question: how much does the data model behind the tools matter? We treat a typed knowledge graph as a grounding substrate and route each question by how it is best answered: (i) LLM-generated Cypher for structured retrieval, which lifts the same GPT-4 model from 65% to 82-83%; (ii) native graph and optimization primitives, with no LLM, reaching 99% on graph-answerable scenarios; and (iii) generation-augmented knowledge (GAK) for answers absent from the data -- the engine's agent materializes the missing facts as provenance-tagged graph nodes, then answers. A recurring theme is inverted LLM usage: we constrain the LLM to query generation or one-shot enrichment from a typed schema and let the graph execute deterministically. On the 88 real AssetOpsBench failure-mode scenarios the benchmark itself flags non-deterministic -- ten equipment types absent from the graph -- GAK lifts answerability from zero to 100% of equipment types and answers 81.8% of scenarios, every materialized fact tagged source:LLM-derived for auditability. We also contribute 40 graph-native scenarios. For structured operational domains the data layer -- not the LLM orchestration -- is the primary lever, and a typed knowledge graph serves as a grounding substrate between raw industrial data and LLM reasoning.

2605.25246 2026-06-02 cs.AI 版本更新

FrontierOR: Benchmarking LLMs' Capacity for Efficient Algorithm Design in Large-Scale Optimization

FrontierOR:基准测试大语言模型在大规模优化中高效算法设计的能力

Minwei Kong, Chonghe Jiang, Ao Qu, Wenbin Ouyang, Zhaoming Zeng, Xiaotong Guo, Zhekai Li, Junyi Li, Yi Fan, Xinshou Zheng, Xi Jing, Yikai Zhang, Zhiwei Liang, Seonghoo Kim, Runqing Yang, Zijian Zhou, Sirui Li, Han Zheng, Wangyang Ying, Ou Zheng, Chonghuan Wang, Jinglong Zhao, Hanzhang Qin, Cathy Wu, Paul Pu Liang, Jinhua Zhao, Hai Wang

发表机构 * Singapore-MIT Alliance for Research and Technology(新加坡-麻省理工联盟研究技术) Massachusetts Institute of Technology(麻省理工学院) Northeastern University(东北大学) Uber Shanghai Jiaotong University(上海交通大学) Boston University(波士顿大学) Emory University(埃默里大学) Northwestern University(西北大学) National University of Singapore(国立新加坡大学) Microsoft(微软) University of Texas at Dallas(德克萨斯大学达拉斯分校) Singapore Management University(新加坡管理学院)

AI总结 提出FrontierOR基准,系统评估大语言模型在现实大规模优化问题中设计可扩展算法(而非仅生成求解器代码)的能力,发现最强模型仅在31%案例中优于Gurobi。

详情
AI中文摘要

大语言模型(LLMs)越来越多地用于优化建模和求解器代码生成,然而实际的运筹学和优化问题往往需要更困难的能力:设计可扩展的算法,利用问题结构并超越直接的建模-求解基线。现有基准仅限于远低于现实规模和复杂度的小型或简化示例。我们引入FrontierOR,这是首批系统评估基于LLM的高效算法设计在现实大规模优化问题中的基准之一。FrontierOR包含180个任务,这些任务源自顶级运筹学场所发表的方法论多样的论文,每个任务都有标准化实例和隐藏的、专家验证的评估套件。我们评估了七个LLM,涵盖前沿、经济高效和开源模型,在一次性设置和测试时进化设置中。结果显示,前沿模型仍然难以从可执行的公式化转向高效的优化算法:最强的一次性模型在解决方案质量和计算效率方面仅在31%的案例中优于Gurobi,即使具有测试时进化的强大编码代理在选定的困难任务上也仅达到50%。FrontierOR为基于LLM的优化算法设计建立了一个实用的评估平台,使未来的LLM和智能体能够系统地测试它们是否能够超越正确的公式化,转向可行、高质量和高效的算法。代码和数据已在https://github.com/Minw913/FrontierOR公开。

英文摘要

Large language models (LLMs) are increasingly used for optimization modeling and solver-code generation, yet practical operations research and optimization problems often require a harder capability: designing scalable algorithms that exploit problem structure and outperform direct formulation-and-solve baselines. Existing benchmarks are limited to small or simplified examples far below real-world scale and complexity. We introduce FrontierOR, among the first benchmarks to systematically evaluate LLM-based efficient algorithm design for realistic large-scale optimization problems. FrontierOR includes 180 tasks derived from methodologically diverse papers published in top-tier operations research venues, each with standardized instances and a hidden, expert-verified evaluation suite. We evaluate seven LLMs spanning frontier, cost-effective, and open-source models both in one-shot and test-time evolution settings. The results reveal that frontier models still struggle to move from executable formulations to efficient optimization algorithms: the strongest one-shot model outperforms Gurobi in only 31% of cases in both solution quality and computational efficiency, and even strong coding agents with test-time evolution achieve only 50% on selected hard tasks. FrontierOR establishes a practical evaluation platform for LLM-based optimization algorithm design, which enables future LLMs and agents to be systematically tested on whether they can move beyond correct formulation toward a feasible, high-quality, and efficient algorithm. Code and data are publicly released at https://github.com/Minw913/FrontierOR.

2605.26684 2026-06-02 cs.LG cs.AI 版本更新

Beyond Trajectory-Level Attribution: Graph-Based Credit Assignment for Agentic Reinforcement Learning

超越轨迹级归因:基于图的智能体强化学习信用分配

Xin Cheng, Shuo He, Lang Feng, HaiYang Xu, Ming Yan, Lei Feng, Bo An

发表机构 * University of Science and Technology of China(中国科学技术大学)

AI总结 提出GraphGPO方法,通过构建状态转移图并利用全局信息估计各状态到任务目标的距离,实现步骤级信用分配,提升训练效率和性能。

Comments Accepted by ICML 2026

详情
AI中文摘要

基于组的强化学习方法在提升大型语言模型性能方面取得了显著成功,并已迅速扩展到智能体任务。然而,其信用分配严重依赖于根据最终结果进行的粗粒度轨迹级归因,难以捕捉单个步骤的贡献,例如失败轨迹中被掩盖的有价值步骤。为了揭示潜在信息并实现更忠实的步骤级信用分配,我们提出基于图的组策略优化(GraphGPO),该方法首先将所有 rollout 轨迹聚合为一个统一的状态转移图,然后利用图中编码的全局信息估计每个状态到任务目标的距离。最后,GraphGPO 通过估计基于图的优势函数,根据转移减少到任务目标距离的程度,为每条边分配信用。通过这种方式,GraphGPO 显著提高了训练效率,并在多个具有挑战性的基准测试中取得了最先进的性能。

英文摘要

Group-based reinforcement learning (RL) methods have achieved remarkable success in improving the performance of large language models (LLMs) and have been rapidly extended to agentic tasks. However, their credit assignment relies heavily on coarse-grained trajectory-level attribution according to final outcomes, making it difficult to capture the contribution of individual steps, such as valuable steps obscured within failed trajectories. To uncover latent information and enable more faithful step-level credit assignment, we propose Graph-based Group Policy Optimization (GraphGPO), which first aggregates all rollout trajectories into a unified state-transition graph and then estimates the distance from each state to the task goal using the global information encoded in the graph. Finally, GraphGPO assigns credit to each edge by estimating a graph-based advantage, based on how much the transition reduces the distance to the task goal. In this way, GraphGPO significantly improves training efficiency and achieves state-of-the-art performance across a range of challenging benchmarks.

2605.26436 2026-06-02 cs.CL cs.AI 版本更新

Targeted Remasking: Replacing Token Editing with Token-to-Mask Refinement in Discrete Diffusion Language Models

目标重掩码:在离散扩散语言模型中将令牌编辑替换为令牌到掩码的精炼

Lin Yao

AI总结 针对离散掩码扩散语言模型中令牌编辑机制的局限性,提出无训练的令牌到掩码重掩码方法,通过将疑似错误令牌重置为掩码状态,利用扩散过程在更干净上下文中重新预测,显著提升数学等任务的性能。

Comments This paper has been significantly revised, expanded, and superseded by a more comprehensive version available at arXiv:2604.18738. The authors have chosen to withdraw this version to avoid overlap and direct readers to the updated work

详情
AI中文摘要

离散掩码扩散语言模型(如LLaDA)通过迭代去噪生成文本,其中掩码令牌逐步被预测的令牌替换。LLaDA2.1引入了令牌到令牌(T2T)编辑机制,通过直接替换疑似错误的已提交令牌来加速生成。然而,我们发现了T2T编辑的根本性限制:它将错误检测与替换耦合,用可能错误的令牌污染生成上下文,并引入了训练-推理噪声不匹配,其中系统性的模型生成错误与训练中看到的随机扰动不同。我们提出了令牌到掩码(T2M)重掩码,这是一种无需训练、即插即用的T2T编辑替代方案,将疑似错误的令牌重置回掩码状态,允许扩散过程在更干净的上下文中重新预测它们。我们设计并实证验证了三种互补的错误检测策略——基于概率的、触发镜像的和基于时间差分的——并提供了统一的理论分析,表明T2M重掩码净化了生成上下文,将系统性的推理错误转换回模型的原生掩码噪声类型,并实现了延迟承诺以进行联合多位置优化。在涵盖知识、推理、数学、编码和指令跟随的12个基准上的全面实验表明,T2M通常在需要精确令牌级输出的任务上提升性能,其中数学任务提升最大(CMATH上+5.92%)。对CMATH的错误分析揭示,主要的失败模式是最后一英里令牌损坏——即正确的推理产生损坏的最终答案——而T2M修复了59.4%的此类情况。

英文摘要

Discrete masked diffusion language models such as LLaDA generate text through iterative denoising, where mask tokens are progressively replaced with predicted tokens. LLaDA2.1 introduced a Token-to-Token (T2T) editing mechanism that accelerates generation by directly replacing committed tokens suspected of being incorrect. However, we identify fundamental limitations of T2T editing: it couples error detection with replacement, pollutes the generation context with potentially incorrect tokens, and introduces a train-inference noise mismatch where systematic model-generated errors differ from the random perturbations seen during training. We propose Token-to-Mask (T2M) remasking, a training-free, drop-in replacement for T2T editing that resets suspected erroneous tokens back to the mask state, allowing the diffusion process to re-predict them under cleaner context. We design and empirically validate three complementary error detection strategies -- probability-based, trigger-mirrored, and temporal-difference-based -- and provide a unified theoretical analysis showing that T2M remasking purifies the generation context, converts systematic inference errors back to the model's native mask noise type, and enables delayed commitment for joint multi-position optimization. Comprehensive experiments across 12 benchmarks spanning knowledge, reasoning, mathematics, coding, and instruction following show that T2M generally improves performance on tasks requiring precise token-level output, with the largest gain on mathematics (+5.92% on CMATH). Error analysis on CMATH reveals that the dominant failure mode is last-mile token corruption -- where correct reasoning produces a corrupted final answer -- and that T2M repairs 59.4% of such cases.

2605.26397 2026-06-02 cs.CL cs.AI 版本更新

Algorithmic Fragility and Persona Bias in LLM-Generated Autistic Communication

LLM生成的自闭症交流中的算法脆弱性与人格偏见

Naba Rizvi, Mohammed Rizvi, Harper Strickland, Saleha Ahmedi, Nedjma Ousidhoum

发表机构 * University of California, San Diego(加州大学圣地亚哥分校) Georgia Institute of Technology(佐治亚理工学院) Cornell University(康奈尔大学) Cardiff University(卡迪夫大学)

AI总结 通过双人格改写范式,发现LLM在生成自闭症人格文本时存在词汇与情感偏离、输出坍塌等系统性失败,且对齐策略而非参数规模主导这些失败,表明当前对齐训练导致深层表征鸿沟。

Comments main paper: 9 pages; total: 19 pages; 2 figures; 5 tables

详情
AI中文摘要

安全对齐减少了明确有害的输出,但无意中编码了一种对边缘化交流的净化、神经典型化表征。我们使用双人格改写范式研究这种编码,提示十个大型语言模型(LLM)从自闭症或神经典型人格改写自然发生的自闭症话语。我们发现,尽管语义相似性相当,自闭症人格改写比神经典型改写在词汇形式和情感语域上偏离显著更多。此外,大多数模型将跨人格生成折叠成几乎相同的输出。为了揭示这种生成崩溃背后的机制,我们引入了一个多智能体定性分析框架。我们的结果揭示了系统性输出擦除、刻板幻觉和任务回避元评论是此任务的普遍失败模式,这些模式按对齐策略而非参数规模聚类。最后,我们与自闭症人类标注者的针对性比较表明,社区内部知识相对于LLM分类产生了系统性标签反转。我们的发现表明,当前的对齐训练导致仅通过定性分析可见的人格特定生成崩溃,证实了提示工程无法解决的深层表征鸿沟。

英文摘要

Safety alignment reduces explicitly harmful outputs but inadvertently encodes a sanitized, neuronormative representation of marginalized communication. We investigate this encoding using a dual-persona rewrite paradigm, prompting ten large language models (LLMs) to rewrite naturally occurring autistic discourse from either an autistic or neurotypical persona. We uncover autistic-persona rewrites diverge significantly more in lexical form and affective register than neurotypical rewrites, despite equivalent semantic similarity. Furthermore, most models collapse cross-persona generations into near-identical outputs. To uncover the mechanisms behind this generative breakdown, we introduce a multi-agent qualitative analysis framework. Our results reveal systemic output erasure, stereotyped hallucination, and task-evasive meta-commentary are pervasive failure modes for this task that cluster by alignment strategy rather than parameter scale. Finally, our targeted comparison with autistic human annotators demonstrates that community-insider knowledge produces systematic label reversals relative to LLM classifications. Our findings indicate that current alignment training causes persona-specific generative breakdown visible only through qualitative analysis, confirming a deep representational gap that prompt engineering cannot resolve.

2605.26305 2026-06-02 cs.AI cs.SY eess.SY hep-ph 版本更新

Experiments in Agentic AI for Science

科学领域中的自主AI代理实验

Judy Fox, Geoffrey Fox

发表机构 * School of Data Science, Department of Computer Science, and Biocomplexity Institute, University of Virginia(数据科学学院、计算机科学系和生物复杂性研究所,弗吉尼亚大学)

AI总结 提出两种基于本地体-远程脑架构的自主AI代理框架,通过系统工程技术克服上下文和推理限制,分别用于时间序列数据集的大规模自动整理和物理讲座的结构化报告生成。

详情
AI中文摘要

本文详细介绍了在科学工作流中开发自主AI代理的两种新颖框架。两个系统都通过Google Colab利用混合本地体-远程脑架构,使用基于Python的本地协调器调用大型语言模型(LLM)云后端。第一个代理DeepTS/DeepCollector自动化了时间序列数据集的大规模整理、提取和去重。第二个代理DeepScribe是一个自主演示分析器,将视觉密集、数学复杂的物理讲座转换为结构化科学报告。通过实际的系统工程——如细粒度属性提取(Cellular RAG)、远程数据检查和分布式并发控制——我们展示了自主AI代理如何克服当前最先进系统的上下文和推理限制,以严格支持科学工作流。最后,我们概述了DeepTS的泛化以支持深度知识图谱,并讨论了这种概念方法在高能物理(DeepQCD)中的应用。

英文摘要

This paper details two novel frameworks for developing autonomous, agentic AI in scientific workflows. Both systems leverage a hybrid Local Body, Remote Brain architecture via Google Colab, utilizing Python-based local orchestrators to invoke large language model (LLM) cloud backends. The first agent, DeepTS/DeepCollector, automates the large-scale curation, extraction, and deduplication of time-series datasets. The second, DeepScribe, is an autonomous presentation analyzer that converts visually dense, mathematically complex physics lectures into structured scientific reports. Through practical systems engineering-such as granular attribute extraction (Cellular RAG), remote data inspection, and distributed concurrency controls-we demonstrate how agentic AI can overcome the context and reasoning limitations of current state-of-the-art systems to rigorously support scientific workflows. Finally, we outline a generalization of DeepTS to support deep knowledge graphs and discuss the application of this conceptual approach to high-energy physics (DeepQCD).

2605.26068 2026-06-02 cs.LG cs.AI 版本更新

Rethinking Weak Supervision in Anomaly Detection: A Comprehensive Benchmark

重新思考异常检测中的弱监督:一个综合基准

Xu Yao, Siyuan Zhou, Zhenbo Wu, Chaochuan Hou, Shuang Liang, Shiping Wang, Hailiang Huang, Songqiao Han, Minqi Jiang

发表机构 * Shanghai University of Finance and Economics(上海金融学院) Ant Group(蚂蚁集团) Key Laboratory of Interdisciplinary Research of Computation and Economics(计算与经济交叉学科重点实验室)

AI总结 提出WSADBench,首个统一评估不完全、不精确和不准确三种弱监督异常检测场景的基准,通过系统变化标签数量、粒度和质量,揭示36种算法在4种模态上的性能边界,并发现弱监督场景间存在强相关性、专用WSAD算法仅在极端标签稀缺时占优等关键洞察。

Comments Accepted at KDD 2026 Datasets and Benchmarks Track

详情
AI中文摘要

弱监督异常检测(WSAD)已发展出三个主要方向:不完全监督、不精确监督和不准确监督。然而,这些方向仍然相互孤立,缺乏一个统一的框架来评估它们是否解决独特的挑战或共享基本机制。本文介绍了WSADBench,这是第一个统一评估不同弱监督场景的基准,对从专用WSAD方法到先进表格基础模型的多种方法进行基准测试。WSADBench通过系统变化标签数量、粒度和质量,建立了标准化协议来评估4种模态上的36种算法,揭示了各种方法的性能边界。基于超过70万次实验,WSADBench揭示了四个关键见解:(i)这些弱监督场景之间存在强内在相关性,挑战了当前研究方向的孤立性。(ii)专用WSAD算法仅在极端标签稀缺情况下表现出色,但随着监督增加或在OOD场景中,很快被表格基础模型和通用分类方法主导。(iii)未标记数据在不同设置中的效用不一致,与标签细化相比收益微乎其微。(iv)模型对不同类型的标签噪声表现出不对称敏感性。我们发布WSADBench作为开源基准,包含代码和数据集,以促进未来的WSAD研究:https://github.com/SUFE-AILAB/WSADBench。

英文摘要

Weakly supervised anomaly detection (WSAD) has developed in three primary directions: incomplete, inexact, and inaccurate supervision. However, these directions remain isolated, lacking a unified framework to assess whether they address unique challenges or share fundamental mechanisms. This paper introduces WSADBench, the first benchmark that unifies evaluation across distinct weakly supervised scenarios, benchmarking diverse approaches from specialized WSAD methods to advanced tabular foundation models. WSADBench establishes standardized protocols to evaluate 36 algorithms across 4 modalities by systematically varying label quantity, granularity, and quality, revealing the performance boundaries of various methods. Based on over 700K experiments, WSADBench reveals four critical insights: (i) Strong intrinsic correlations exist between these weak supervision scenarios, challenging the isolation of current research directions. (ii) Specialized WSAD algorithms excel only in extreme label-scarcity regimes but are quickly dominated by tabular foundation models and general classification methods as supervision increases or in OOD scenarios. (iii) Unlabeled data shows inconsistent utility across settings, with marginal gains compared to label refinement. (iv) Models exhibit asymmetric sensitivity to different types of label noise. We release WSADBench as an open-source benchmark with code and datasets to facilitate future WSAD research: https://github.com/SUFE-AILAB/WSADBench.

2605.26089 2026-06-02 cs.CV cs.AI 版本更新

Channel-wise Vector Quantization

通道级向量量化

Wei Song, Tianhang Wang, Yitong Chen, Tong Zhang, Zuxuan Wu, Min Li, Jiaqi Wang, Kaicheng Yu

发表机构 * Shanghai Innovation Institute(上海创新研究院) Westlake University(西湖大学) Zhejiang University(浙江大学) Fudan University(复旦大学) JD.COM(京东公司) University of Chinese Academy of Sciences(中国科学院大学)

AI总结 提出通道级向量量化(CVQ)代替补丁级量化,并基于此设计通道级自回归(CAR)模型,通过逐通道预测实现渐进式细节生成,在图像重建和文本到图像生成中取得优异性能。

详情
AI中文摘要

我们提出了通道级向量量化(CVQ),一种新颖的图像标记化范式,用通道级标记取代补丁级标记。与传统的向量量化(为每个补丁特征向量分配一个离散标记)不同,CVQ 对特征图的每个通道进行量化。这种表示将图像表示为视觉细节的离散层级,而不是空间补丁的网格。基于 CVQ,我们引入了一种新的视觉自回归框架,采用“下一通道预测”。我们的通道级自回归(CAR)模型不是按光栅顺序逐补丁渲染图像,而是顺序预测图像通道,逐步生成更丰富的视觉细节。具体来说,它首先勾勒全局结构,然后细化细粒度属性,类似于人类艺术家的创作流程。实验表明:(1)CVQ 在 16K+ 的码本大小下实现了 100% 的码本利用率,无需任何额外技巧,并且显著提高了传统 VQ 的重建质量;(2)CAR 在 DPG 评分中达到 86.7,在 GenEval 评分中达到 0.79,展示了其在文本到图像生成中的强大有效性。

英文摘要

We present Channel-wise Vector Quantization (CVQ), a novel image tokenization paradigm that replaces patch-wise tokens with channel-wise tokens. Unlike conventional vector quantization, which assigns a discrete token to each patch feature vector, CVQ quantizes each channel of the feature map. This formulation represents an image as discrete levels of visual details, rather than as a grid of spatial patches. Based on CVQ, we introduce a new visual autoregressive framework with "next-channel prediction". Instead of rendering images patch by patch in raster order, our Channel-wise Autoregressive (CAR) model predicts image channels sequentially, producing progressively enriched visual details. Specifically, it first sketches global structure and then refines fine-grained attributes, akin to a human artist's workflow. Empirically, we show that: (1) CVQ achieves 100% codebook utilization with a 16K+ codebook size without any bells and whistles, and substantially improves reconstruction quality over conventional VQ; and (2) CAR attains a DPG score of 86.7 and a GenEval score of 0.79, demonstrating strong effectiveness for text-to-image generation.

2605.30290 2026-06-02 cs.LG cs.AI cs.CL 版本更新

Self-Trained Verification for Training- and Test-Time Self-Improvement

自训练验证用于训练和测试时的自我改进

Chen Henry Wu, Aditi Raghunathan

发表机构 * arXiv

AI总结 提出自训练验证(STV)方法,通过让验证器模仿参考解决方案下的自身版本,解决自我改进中验证器瓶颈问题,在测试时显著提升验证-细化循环,在训练时通过验证器在环训练(ViL)进一步提升生成器性能。

详情
AI中文摘要

大规模自我改进一直是推理模型的长期目标,有两个自然的实现阶段:测试时,通过验证-细化(V-R)循环;训练时,通过自训练方法。两者都受限于同一个瓶颈:验证器。当验证器得分膨胀而准确率停滞,且反馈过于泛化无法执行时,V-R循环会停滞;当糟糕的自生成数据被加入训练时,自训练同样会失败。更好的验证将解锁两者,但我们想要训练的能力,即捕捉自生成的错误,缺乏训练信号。为了解决这一挑战,我们提出了自训练验证(STV)。我们的关键观察是,虽然模型单独无法捕捉这些错误,但当它看到参考解决方案时却可以。我们将这种不对称性转化为监督目标,训练验证器模仿自身更具信息量的版本。在测试时,STV在困难问题上显著改进了V-R循环,而替代方法(如SFT、对验证器分数进行RL,甚至元验证器)则不然。STV在困难数学任务上大致使准确率翻倍,在科学推理任务上提升14倍(从1.5%到21%)。在训练时,我们额外使用STV验证器在V-R循环内的反馈对生成器进行RL训练——我们称之为验证器在环训练(ViL)。从一个RL收敛的生成器开始,ViL在pass@1上进一步获得33%的提升。更值得注意的是,生成器在测试时无验证器的独立pass@1相对标准RL收敛点提升了30%。因此,困难问题推理的下一个前沿可能在于我们如何训练用于验证和与验证结合的方法。网站:https://ar-forum.github.io/stv-webpage

英文摘要

Self-improvement at scale has been a longstanding goal for reasoning models, and there are two natural places to do it: at test time, through verification-refinement (V-R) loops; and at training time, through self-training methods. Both are gated by the same bottleneck: the verifier. V-R loops stall when verifier scores inflate while accuracy stagnates, and when feedback is too generic to act on; self-training fails similarly when bad self-generated data are added to training. Better verification would unlock both, but the capability we want to train, i.e., catching self-generated errors, lacks training signal. To address this challenge, we propose self-trained verification (STV). Our key observation is that, while a model cannot catch these errors alone, it can when shown the reference solution. We turn this asymmetry into a supervision target and train the verifier to imitate a more informed version of itself. At test time, STV substantially improves V-R loops on hard problems, while alternatives (e.g., SFT, RL on verifier scores, and even meta-verifiers) do not. STV roughly doubles accuracy on hard math and lifts it 14x on scientific reasoning tasks (1.5% to 21%). At training time, we additionally train the generator using RL with STV verifier's feedback inside the V-R loop - a procedure we call verifier-in-the-loop training (ViL). Starting from an RL-converged generator, ViL yields a further 33% gain in pass@1. More notably, the generator's standalone pass@1, with no verifier at test time, climbs 30% relative past where standard RL had converged. Hence, the next frontier in reasoning on hard problems may lie in how we train for and with verification. Website: https://ar-forum.github.io/stv-webpage

2605.30280 2026-06-02 cs.RO cs.AI cs.CL 版本更新

Qwen-VLA: Unifying Vision-Language-Action Modeling across Tasks, Environments, and Robot Embodiments

Qwen-VLA:统一跨任务、环境和机器人形态的视觉-语言-动作建模

Qiuyue Wang, Mingsheng Li, Jian Guan, Jinhui Ye, Sicheng Xie, Yitao Liu, Junhao Chen, Zhixuan Liang, Jie Zhang, Xintong Hu, Xuhong Huang, Pei Lin, Junyang Lin, Dayiheng Liu, Shuai Bai, Jingren Zhou, Jiazhao Zhang, Haoqi Yuan, Gengze Zhou, Hang Yin, Ye Wang, Yiyang Huang, Zixing Lei, Wujian Peng, Delin Chen, Yingming Zheng, Jingyang Fan, Xianwei Zhuang, Xin Zhou, Haoyang Li, Anzhe Chen, Tong Zhang, Xuejing Liu, Yuchong Sun, Ruizhe Chen, Zhaohai Li, Chenxu Lü, Zhibo Yang, Tao Yu, Xionghui Chen

发表机构 * Qwen Team(通义实验室)

AI总结 提出Qwen-VLA,一种基于DiT动作解码器的统一具身基础模型,通过大规模联合预训练和具身感知提示,将操作、导航和轨迹预测统一为动作-轨迹预测框架,实现跨任务、环境和机器人形态的泛化。

Comments 34 pages

详情
AI中文摘要

具身智能通常通过针对单个任务(如操作或导航)的专用模型进行研究,导致能力碎片化,且跨任务、环境和机器人形态的泛化能力有限。在这项工作中,我们研究了异构的具身决策问题是否可以在单个视觉-语言-动作模型中统一。我们提出了Qwen-VLA,一个统一的具身基础模型,它通过基于DiT的动作解码器将Qwen的视觉-语言建模栈从感知、理解和推理扩展到连续动作和轨迹生成。Qwen-VLA通过大规模联合预训练方案在多样化的数据源上进行训练,包括机器人操作轨迹、人类自我中心演示、合成模拟数据、视觉-语言导航数据、轨迹中心监督和辅助视觉-语言数据。为了支持多种机器人平台,我们引入了具身感知提示调节,其中特定于机器人的文本描述指定了当前的具身形态和控制约定。我们进一步将操作、导航和轨迹预测统一为一个动作-轨迹预测框架,实现了跨机器人形态、任务族和环境的可迁移视觉基础、空间推理和连续动作生成。在操作、导航和轨迹中心基准上的实验显示,在场景布局、背景、光照、物体配置和机器人形态变化下,具有一致的多任务性能和分布外泛化能力。Qwen-VLA-Instruct在LIBERO上达到97.9%,在Simpler-WidowX上达到73.7%,在RoboTwin-Easy/Hard上达到86.1%/87.2%,在R2R上达到69.0% OSR,在RxR上达到59.6% SR,在真实世界ALOHA实验中平均OOD成功率为76.9%,在DOMINO动态操作上零样本成功率为26.6%。

英文摘要

Embodied intelligence is often studied through specialized models for individual tasks such as manipulation or navigation, resulting in fragmented capabilities and limited generalization across tasks, environments, and robot embodiments. In this work, we study whether heterogeneous embodied decision-making problems can be unified within a single vision-language-action model. We present Qwen-VLA, a unified embodied foundation model that extends Qwen's vision-language modeling stack from perception, understanding, and reasoning to continuous action and trajectory generation through a DiT-based action decoder. Qwen-VLA is trained with a large-scale joint pretraining recipe over diverse data sources, including robotics manipulation trajectories, human egocentric demonstrations, synthetic simulation data, vision-and-language navigation data, trajectory-centric supervision, and auxiliary vision-language data. To support multiple robot platforms, we introduce embodiment-aware prompt conditioning, where robot-specific textual descriptions specify the current embodiment and control convention. We further cast manipulation, navigation, and trajectory prediction into a unified action-and-trajectory prediction framework, enabling transferable visual grounding, spatial reasoning, and continuous action generation across robot morphologies, task families, and environments. Experiments on manipulation, navigation, and trajectory-centric benchmarks show consistent multi-task performance and out-of-distribution generalization under variations in scene layout, background, lighting, object configuration, and robot embodiment. Qwen-VLA-Instruct achieves 97.9% on LIBERO, 73.7% on Simpler-WidowX, 86.1%/87.2% on RoboTwin-Easy/Hard, 69.0% OSR on R2R, 59.6% SR on RxR, 76.9% average OOD success in real-world ALOHA experiments, and 26.6% zero-shot success on DOMINO dynamic manipulation.

2605.30188 2026-06-02 cs.LG cs.AI stat.ML 版本更新

CalArena: A Large-Scale Post-Hoc Calibration Benchmark

CalArena:大规模事后校准基准

Eugène Berta, David Holzmüller, Francis Bach, Michael I. Jordan

发表机构 * Inria - Ecole Normale Supérieure PSL Research University(法国国家科学研究中心-巴黎高等师范学院-巴黎-萨克雷大学)

AI总结 提出CalArena大规模标准化基准,涵盖近2000个实验,通过事后改进(PHI)原则比较多种校准方法,发现平滑校准函数优于分箱方法,专用多类方法在高维场景中至关重要。

Comments 30 pages, 9 figures

详情
AI中文摘要

可靠的概率估计在许多机器学习应用中至关重要,但现代分类器往往校准不佳。事后校准提供了一种简单且广泛使用的解决方案,但由于提出的方法众多,加上小规模和不一致的评估,很难确定哪些方法在实践中真正有效。我们引入了一个大规模、标准化的事后校准基准,涵盖表格和计算机视觉任务的近2000个实验,包括二分类、多分类和大规模分类设置。我们的基准汇集了来自多种经典模型、现代深度学习架构和基础模型的预测,并在通用评估框架内提供了数十种校准方法的统一、可重复实现。我们认为,在适当评分规则下的事后改进(PHI)为比较事后方法提供了传统校准误差估计器的原则性替代方案,同时捕捉校准质量和模型预测性能的潜在退化。利用这一框架,我们进行了迄今为止最全面的事后校准实证研究。我们的结果揭示了跨领域的一致模式:平滑校准函数优于基于分箱的方法,专用多类方法在高维设置中至关重要,而通用机器学习模型在没有校准特定设计的情况下不具备竞争力。为促进未来研究,我们发布了所有数据、代码和评估工具,为开发和比较校准方法提供了一个即插即用的基准。

英文摘要

Reliable probability estimates are critical in many machine learning applications, yet modern classifiers are often poorly calibrated. Post-hoc calibration provides a simple and widely used solution, but the large number of proposed methods, combined with small-scale and inconsistent evaluations, makes it difficult to determine which approaches are truly effective in practice. We introduce a large-scale, standardized benchmark for post-hoc calibration, covering nearly 2000 experiments across tabular and computer vision tasks, including binary, multiclass, and large-scale classification settings. Our benchmark aggregates predictions from a diverse set of classical models, modern deep learning architectures, and foundation models, and provides unified, reproducible implementations of dozens of calibration methods within a common evaluation framework. We argue that Post-Hoc Improvement (PHI) in proper scoring rules offers a principled alternative to traditional calibration error estimators for comparing post-hoc methods, capturing both calibration quality and potential degradation to the model's predictive performance. Using this framework, we conduct the most comprehensive empirical study of post-hoc calibration to date. Our results reveal consistent patterns across domains: smooth calibration functions outperform binning-based approaches, dedicated multiclass methods are essential in high-dimensional settings, and generic machine learning models are not competitive without calibration-specific design. To facilitate future research, we release all data, code, and evaluation tools, providing a plug-and-play benchmark for developing and comparing calibration methods.

2605.30169 2026-06-02 cs.CY cs.AI cs.MA 版本更新

Dissociative Identity: Language Model Agents Lack Grounding for Reputation Mechanisms

分离性身份:语言模型代理缺乏声誉机制的基础

Botao Amber Hu, Helena Rong, Max Van Kleek

发表机构 * University of Oxford(牛津大学) New York University Shanghai(纽约大学上海分校)

AI总结 本文指出语言模型代理因本体上的分离性(模块可替换、身份流动)而无法满足声誉机制所需的身份持续性、行为可预测性和制裁敏感性,从而提出转向基于可观察性、事前、构成性、协议的行为约束。

Comments Accepted by FaccT 2026

详情
AI中文摘要

随着自主语言模型代理的激增,形成了一个具有现实后果的新兴代理网络,您可以使用哪些可信信号来决定是否信任并委托一个陌生的代理?自然的治理直觉是将人类身份验证和声誉机制从“了解你的客户”和信用评分扩展到“了解你的代理”制度。然而,我们认为这种类比从根本上是不完整的。声誉机制既作为社会信号,也作为纠正性反馈,维持可信行为的均衡,其前提是存在与行为连续性、制裁敏感性和昂贵不可替代性相关的持久身份。但语言模型代理在本体上是分离性的:它们本质上是可修改模块的集合——基础模型、系统提示、工具访问策略、外部记忆,在某些情况下还包括整个多代理系统——任何模块都可能改变代理行为,并且具有流动的人格,容易受到对抗性攻击,且可能不会内化制裁。借鉴分离性身份障碍的法理学,这种分离性使得代理缺乏可识别性、可预测性、可信性和可恢复性的基础——而这些正是声誉机制旨在维持的属性——从而破坏了信任。我们认为,基于身份的事后、规制性、制裁性的治理(如声誉)在结构上不适用于分离性代理,并建议转向基于可观察性的事前、构成性、协议性的行为约束。

英文摘要

As autonomous language model agents proliferate, forming an emerging agentic web with real-world consequences, what credibility signals can you use to decide whether to trust an unfamiliar agent in the wild and delegate to it? A natural governance intuition is to extend human identity verification and reputation mechanisms, from ``Know Your Customer'' and credit scores to ``Know Your Agent'' regimes. However, we argue that this analogy is fundamentally incomplete. Reputation mechanisms function both as social signals and as corrective feedback that sustain an equilibrium of trustworthy behavior, presuming a persistent identity associated with behavioral continuity, sanction sensitivity, and costly non-fungibility. Yet language model agents are ontologically \emph{dissociative}: they are essentially an assemblage of mutable modules -- foundation models, system prompts, tool-access policies, external memory, and, in some cases, a multi-agent system as a whole -- any of which may change agent behavior -- with a fluid persona that is also vulnerable to adversarial attack and may not internalize sanctions. Drawing on dissociative identity disorder jurisprudence, this dissociativity leaves agents without grounding for identifiability, predictability, credibility, and rehabilitability -- the very properties that reputation mechanisms aim to sustain -- thereby collapsing trust. We argue that identity-based, ex post, regulative, sanction-based governance, such as reputation, is structurally inapplicable to dissociative agents, and we suggest a shift to observability-based, ex ante, constitutive, protocol-based behavioral harnesses.

2605.30122 2026-06-02 cs.LG cs.AI 版本更新

Beyond MSE: Improving Precipitation Nowcasting with Multi-Quantile Regression

超越MSE:利用多分位数回归改进降水临近预报

Gijs van Nieuwkoop, Siamak Mehrkanoon

发表机构 * Department of Information and Computing Sciences, Utrecht University(信息与计算科学系,乌特勒支大学)

AI总结 本文提出将确定性降水临近预报模型的训练目标从均方误差(MSE)改为多分位数回归损失,使用SmaAt-UNet模型在荷兰雷达降水数据上验证,使中心确定性预测的测试集MSE降低8.6%,并输出高分位数预测以改善强降水预测。

Comments 7 pages, 5 figs

详情
AI中文摘要

深度学习降水临近预报模型通常使用逐点损失(如均方误差或平均绝对误差)进行优化,这可能导致预测过于平滑且对强降雨的表示较差。本研究探讨了是否可以通过将训练重新表述为多分位数回归问题来提高已建立的确定性临近预报架构的预测性能。使用SmaAt-UNet作为核心模型,我们在荷兰雷达降水临近预报上比较了MSE、MAE和多分位数pinball损失训练。结果表明,多分位数训练改进了中心确定性预测,与使用MSE训练的模型相比,测试集MSE降低了8.6%,同时产生的高分位数输出对强降水的风险敏感预测很有用。这些发现表明,分位数回归提供了一种简单的替代标准逐点损失的方法,无需新的架构或生成采样过程。我们模型和训练设置的实现可在GitHub上获取。

英文摘要

Deep-learning precipitation nowcasting models are often optimized using pointwise losses such as mean squared error or mean absolute error, which can lead to overly smooth forecasts and poor representation of heavy rainfall. This study investigates whether the predictive performance of an established deterministic nowcasting architecture can be improved by reformulating training as a multi-quantile regression problem. Using SmaAt-UNet as a core model, we compare MSE, MAE, and multi-quantile pinball-loss training on radar precipitation nowcasting over the Netherlands. The results show that multi-quantile training improves the central deterministic forecast, decreasing test-set MSE by 8.6\% compared to a model trained using MSE, while also producing upper-quantile outputs that are useful for risk-sensitive prediction of heavy precipitation. These findings suggest that quantile regression provides a simple alternative to standard pointwise losses without requiring a new architecture or generative sampling procedure. The implementation of our models and training setup is available on \href{https://github.com/gijsvn/Multi-Quantile-Precipitation-Nowcasting}{GitHub}.

2605.30000 2026-06-02 cs.AI 版本更新

Cookie-Bench: Continuous On-screen Key Interaction Evaluation for Web Generation

Cookie-Bench: 面向网页生成的连续屏幕按键交互评估

Haoyue Yang, Zhangxiao Shen, Fan Ding, Hangting Lou, Yifeng Kou, Haoqing Yu, Jingyao Li, Zhengfan Wu, Siqi Bao, Jing Liu, Hua Wu

发表机构 * Baidu Inc.(百度公司) Beijing, China(中国北京)

AI总结 提出一种无参考、自主驱动且整体推理的网页生成评估框架Cookie-Bench,通过元认知监控分阶段收集证据并评分,与专家评分高度一致。

详情
AI中文摘要

前端网页代码已成为每个前沿LLM发布的核心产品面,但以开发速度评估这些交互式应用仍然成本高昂,因为像Arena这样的人类评判排行榜无法扩展。现有的自动化代理通常依赖参考实现、测试套件或严格检查清单,并且往往遗漏人类评审员在实时会话中进行的推理综合。我们阐述了一种新的评估体系,该体系同时具有无参考、自主驱动和整体推理的特点,并通过两个工件实例化。\textbf{\dataname}是一个涵盖11个领域、54个叶节点、1000个查询的WebDev基准测试,包括静态展示和交互式应用任务,在三个难度等级和三个目标语言组之间平衡,并且重写了任务简介以防止从流传的提示中回忆。\textbf{\framename}基于Flavell的元认知监控,将证据收集与判断分离为三个阶段:静态感知通过被动观察形成第一印象;代理驱动交互自主探索应用,同时捕获连续屏幕视频、音频和每步截图;动态评分仅在证据链完成后发出整体功能和美学评判,并附带结构化失败归因。在\dataname上,\framename与专家人类评分高度一致,同时揭示了13个前沿LLM在交互式网页生成上的显著提升空间。\noindent https://anonymous.4open.science/r/Cookie-3CE/

英文摘要

Front-end web code has become a core product surface for every frontier LLM release, yet evaluating these interactive applications at development speed remains costly because human-judged leaderboards like Arena do not scale. Existing automated proxies typically lean on reference implementations, test suites, or rigid checklists, and tend to miss the reasoned synthesis a human reviewer performs over a live session. We articulate a new evaluation regime that is simultaneously reference-free, autonomously driven, and holistically reasoned, and instantiate it through two artifacts. \textbf{\dataname} is an 11-domain, 54-leaf, 1,000-query WebDev benchmark spanning both static-presentation and interactive-application tasks, balanced across three difficulty tiers and three target-language groups, with briefs rewritten to resist recall from circulated prompts. \textbf{\framename}, grounded in Flavell's metacognitive monitoring, separates evidence accumulation from judgment across three stages: Static Perception forms a first impression from passive observation; Agent-Driven Interaction explores the application autonomously while capturing continuous screen video, audio, and per-step screenshots; Dynamic Scoring issues holistic functionality and aesthetics verdicts with structured failure attribution only after the evidence chain is complete. On \dataname, \framename aligns closely with expert human ratings while surfacing substantial headroom across 13 frontier LLMs on interactive web generation. \noindenthttps://anonymous.4open.science/r/Cookie-3CE/

2605.29948 2026-06-02 cs.SD cs.AI eess.AS 版本更新

HoliTok:A Coutinuous Holistic Tokenization with Robust Dual Capabilities of Speech Generation and Understanding

HoliTok: 一种具有鲁棒的双重语音生成与理解能力的连续整体式分词

Bohan Li, Shi Lian, Hankun Wang, Yiwei Guo, Yu Xi, Zhihan Li, Da Zheng, Colin Zhang, Kai Yu

发表机构 * X-LANCE Lab, School of Computer Science, Shanghai Jiao Tong University, China(X-LANCE实验室,计算机科学学院,上海交通大学,中国) hi lab, Xiaohongshu Inc, China(hi实验室,小红书公司,中国)

AI总结 提出HoliTok连续整体式语音分词模型,通过渐进训练策略联合保持信号保真度、融入语义信息并维持潜在可学习性,基于该分词构建统一AR+DiT模型实现语音合成与识别,实验证明其在统一生成-理解架构中无需额外优化即可鲁棒运行。

Comments 14 pages, 2 figures, 8 tables

详情
AI中文摘要

统一的语音基础模型需要一个整体式的分词空间,该空间既要能被语言模型学习,又要能解码为高质量波形。然而,现有的语音分词器往往无法同时满足这些要求,导致架构复杂度和训练设计增加。我们提出HoliTok,一种用于统一生成-理解建模的连续整体式语音分词模型。HoliTok将48 kHz语音编码为紧凑的25 Hz序列,包含128维潜在向量。它采用渐进策略进行训练,联合保留信号级保真度、融入语义信息并保持强大的潜在可学习性。基于此分词,我们构建了一个统一的AR+DiT模型用于语音合成和识别,其中相同的潜在序列既支持生成特定任务,也支持统一的生成-理解任务。实验表明,HoliTok实现了有竞争力的重建保真度,提高了高质量和可控合成中的生成可学习性,并且在评估的表示中,它是唯一一个在我们的统一生成-理解架构中无需额外优化技巧即可鲁棒运行的表示。这些结果表明,HoliTok作为一种有效的语音分词器,为统一口语建模提供了基础的表示接口。代码可在 https://github.com/bovod-sjtu/HoliTok 获取。

英文摘要

Unified speech foundation models require a holistic tokenization space that is both learnable by language models and decodable into high-quality waveforms. Existing speech tokenizers, however, often fail to satisfy these requirements simultaneously, leading to increased architectural complexity and more involved training designs. We propose HoliTok, a continuous Holistic speech Tokenization model designed for unified generation-understanding modeling. HoliTok encodes 48~kHz speech into a compact 25~Hz sequence of 128-dimensional latents. It is trained with a progressive strategy that jointly preserves signal-level fidelity, incorporates semantic information, and maintains strong latent learnability. Based on this tokenization, we build a unified AR+DiT model for speech synthesis and recognition, where the same latent sequence supports both generation-specific and unified generation-understanding tasks. Experiments show that HoliTok achieves competitive reconstruction fidelity, improves generative learnability for high-quality and controllable synthesis, and, among the evaluated representations, is the only one that operates robustly in our unified generation-understanding architecture without additional optimization tricks. These results suggest that HoliTok serves as an effective speech tokenizer and a foundational representation interface for unified spoken language modeling. The code is available at: https://github.com/bovod-sjtu/HoliTok.

2605.27864 2026-06-02 cs.AI 版本更新

FundaPod: A Multi-Persona Agent Pod Platform with Knowledge Graph Memory for AI-Assisted Fundamental Investment Research

FundaPod: 一个具有知识图谱记忆的多角色智能体平台,用于AI辅助的基础投资研究

Di Zhu, Lei Nico Zheng, Zihan Chen

发表机构 * Stevens Institute of Technology(史蒂文斯理工学院) UMass Boston(马萨诸塞大学波士顿分校)

AI总结 提出FundaPod平台,通过多角色独立研究、知识图谱记忆和事后裁决机制,支持人类投资经理进行透明、可验证的基础投资决策。

Comments 32 pages; 12 figures

详情
AI中文摘要

大型语言模型(LLMs)在金融领域的应用日益增多,但现有工作大多强调交易信号或围绕预测的金融自然语言处理任务。相比之下,机构基础研究需要人类分析师或AI智能体收集证据、识别业务驱动因素、比较竞争观点并生成投资备忘录。其更广泛的目标不仅是预测结果,而是产生透明、可重用和可验证的投资计划,同时促进投资知识的累积发展。我们提出了FundaPod,一个用于AI辅助基础投资研究的多角色智能体平台。我们认为基础研究是一项以人为中心的决策支持任务,在本质上与交易信号生成不同,因此更适合采用保持独立性的架构。在FundaPod中,具有不同角色(如价值投资者或宏观策略师)的AI智能体在共享溯源契约下独立进行研究。他们的分歧随后通过知识图谱记忆系统事后呈现,供人类投资组合经理(PM)裁决。本文基于设计科学实践以及认知隔离和人机协调理论,提出了支持基础研究的人机混合系统的五项设计原则。它还描述了四种架构机制:将公开投资者资料转化为可部署智能体的角色提炼管道;允许规划器推导类型化任务图的声明式技能注册表;将备忘录声明与可验证来源联系起来的基于证据的模型;以及连接股票代码、备忘录、分析师和主题的知识图谱“第二大脑”。我们通过一个完整的案例研究和基于角色的备忘录比较来展示该架构。

英文摘要

Large language models (LLMs) are increasingly applied in finance, yet most existing work emphasizes trading signals or financial NLP tasks centered on prediction. Institutional fundamental research, by contrast, requires human analysts or AI agents to gather evidence, identify business drivers, compare competing viewpoints, and generate investment memos. Its broader goal is not merely to predict outcomes, but to produce investment plans that are transparent, reusable, and verifiable, while contributing to the cumulative development of investment knowledge. We present FundaPod, a multi-persona agent platform for AI-assisted fundamental investment research. We argue that fundamental research is a human-centric decision-support task that is qualitatively distinct from trading-signal generation, and is therefore better served by an independence-preserving architecture. In FundaPod, AI agents with different personas, such as value investors or macro strategists, conduct research independently under a shared provenance contract. Their disagreements are then surfaced post hoc for adjudication by the human portfolio manager (PM) through a knowledge-graph memory system. This paper contributes five design principles for human-AI hybrid systems supporting fundamental research, grounded in design-science practice and theories of cognitive isolation and human-machine coordination. It also describes four architectural mechanisms: a persona distillation pipeline that turns public investor materials into deployable agents; a declarative skill registry that lets the planner derive typed task graphs; a grounded evidence model that links memo claims to verifiable sources; and a knowledge-graph "second brain" that connects tickers, memos, analysts, and themes. We demonstrate the architecture through a complete case study and a persona-based memo comparison.

2605.13548 2026-06-02 cs.RO cs.AI 版本更新

AttenA+: Rectifying Action Inequality in Robotic Foundation Models

AttenA+: 纠正机器人基础模型中的动作不平等性

Daojie Peng, Fulong Ma, Jiahang Cao, Qiang Zhang, Xupeng Xie, Jian Guo, Ping Luo, Andrew F. Luo, Boyu Zhou, Jun Ma

发表机构 * HKUST(GZ)(香港科技大学(广州)) HKU(香港大学) USTC(中国科学技术大学) IDEA Research(IDEA研究院) SUSTech(南方科技大学) X-Humaniod

AI总结 针对机器人基础模型忽视动作物理重要性的问题,提出AttenA+框架,通过速度驱动的动作注意力重加权训练目标,提升复杂长程任务性能。

详情
AI中文摘要

现有的机器人基础模型虽然强大,但基于一个隐含的时间同质性假设:在优化过程中将所有动作视为同等信息量。这种从语言模型继承的“平坦”训练范式,对操作的内在物理层次结构无动于衷。实际上,机器人轨迹本质上是异质的,其中低速段通常通过需要精确交互来决定任务成功,而高速运动则作为容错过渡。这种均匀损失权重与物理关键性之间的错位从根本上限制了当前视觉-语言-动作(VLA)模型和世界-动作模型(WAM)在复杂长程任务中的性能。为了纠正这一点,我们引入了AttenA+,一个与架构无关的框架,通过速度驱动的动作注意力优先考虑运动学关键段。通过基于逆速度场重新加权训练目标,AttenA+自然地使模型的学习能力与操作的物理需求对齐。作为一种即插即用的增强,AttenA+可以集成到现有骨干网络中,无需结构修改或额外参数。大量实验表明,AttenA+显著提升了当前最先进模型的上限。具体来说,它在Libero基准上将OpenVLA-OFT提升至98.6%(+1.5%),并将FastWAM在RoboTwin 2.0上推进至92.4%(+0.6%)。在Franka机械臂上的真实世界验证进一步展示了其鲁棒性和跨任务泛化能力。我们的工作表明,挖掘动作序列的内在结构先验为标准缩放定律提供了一种高效、物理感知的补充,为通用机器人控制开辟了新路径。

英文摘要

Existing robotic foundation models, while powerful, are predicated on an implicit assumption of temporal homogeneity: treating all actions as equally informative during optimization. This "flat" training paradigm, inherited from language modeling, remains indifferent to the underlying physical hierarchy of manipulation. In reality, robot trajectories are fundamentally heterogeneous, where low-velocity segments often dictate task success through precision-demanding interactions, while high-velocity motions serve as error-tolerant transitions. Such a misalignment between uniform loss weighting and physical criticality fundamentally limits the performance of current Vision-Language-Action (VLA) models and World-Action Models (WAM) in complex, long-horizon tasks. To rectify this, we introduce AttenA+, an architecture-agnostic framework that prioritizes kinematically critical segments via velocity-driven action attention. By reweighting the training objective based on the inverse velocity field, AttenA+ naturally aligns the model's learning capacity with the physical demands of manipulation. As a plug-and-play enhancement, AttenA+ can be integrated into existing backbones without structural modifications or additional parameters. Extensive experiments demonstrate that AttenA+ significantly elevates the ceilings of current state-of-the-art models. Specifically, it improves OpenVLA-OFT to 98.6% (+1.5%) on the Libero benchmark and pushes FastWAM to 92.4% (+0.6%) on RoboTwin 2.0. Real-world validation on a Franka manipulator further showcases its robustness and cross-task generalization. Our work suggests that mining the intrinsic structural priors of action sequences offers a highly efficient, physics-aware complement to standard scaling laws, paving a new path for general-purpose robotic control.

2605.07804 2026-06-02 cs.LG cs.AI 版本更新

Prune-OPD: Efficient and Reliable On-Policy Distillation for Long-Horizon Reasoning

Prune-OPD:面向长程推理的高效可靠在线策略蒸馏

Zhicheng Yang, Zhijiang Guo, Yifan Song, Minrui Xu, Yongxin Wang, Yiwei Wang, Xiaodan Liang, Jing Tang

发表机构 * The Hong Kong University of Science and Technology (Guangzhou)(香港科技大学(广州)) The Hong Kong University of Science and Technology(香港科技大学) MBZUAI University of California, Merced(加州大学默塞德分校) Sun Yat-sen University(中山大学)

AI总结 提出Prune-OPD框架,通过实时检测学生与教师之间的前缀漂移并动态截断不可靠的轨迹,在减少计算浪费的同时保持或提升长程推理任务的性能。

Comments 17 pages, 8 figures

详情
AI中文摘要

在线策略蒸馏(OPD)利用密集的教师奖励来增强推理模型。然而,将OPD扩展到长程任务暴露了一个关键缺陷:随着学生生成的前缀不可避免地偏离教师的思维过程,教师的密集奖励失去了局部可开发性。继续在这些“漂移”轨迹上生成和评估标记不仅会降低奖励质量,还会导致巨大的计算浪费。为了解决这个问题,我们引入了 extbf{Prune-OPD},一个动态地将训练预算与监督质量对齐的框架。通过持续监控学生和教师预测之间的局部兼容性(例如,通过top-$k$重叠),Prune-OPD实时检测前缀漂移事件。一旦检测到严重漂移,它会单调地降低后续不可靠奖励的权重,并触发动态的轨迹截断。这使得训练过程能够停止无效的生成,并将计算重新分配到可靠的教师监督上。在不同的教师-学生组合中,Prune-OPD始终将计算与监督可靠性对齐。当前缀漂移使得密集的教师奖励不可靠时,它减少了37.6\%--68.0\%的训练时间,同时保持甚至提升了在具有挑战性的基准(AMC、AIME、HMMT)上的性能。当学生-教师兼容性保持较高时,它会通过扩展训练窗口自动保留长上下文监督。这些结果表明,Prune-OPD不是通过盲目缩短轨迹来改进OPD,而是通过将计算重新分配到局部可开发的教师奖励上。

英文摘要

On-policy distillation (OPD) leverages dense teacher rewards to enhance reasoning models. However, scaling OPD to long-horizon tasks exposes a critical flaw: as the student's generated prefix inevitably diverges from the teacher's thought process, the teacher's dense reward loses local exploitability. Continuing to generate and evaluate tokens on these ``drifted'' trajectories not only degrades reward quality but also incurs massive computational waste. To address this, we introduce \textbf{Prune-OPD}, a framework that dynamically aligns training budgets with supervision quality. By continuously monitoring the local compatibility between student and teacher predictions (e.g., via top-$k$ overlap), Prune-OPD detects prefix-drift events in real time. Upon detecting severe drift, it monotonically down-weights subsequent unreliable rewards and triggers dynamic rollout truncation. This allows the training process to halt futile generation and reallocate compute strictly to reliable teacher supervision. Across diverse teacher-student combinations, Prune-OPD consistently aligns computation with supervision reliability. When prefix drift makes dense teacher rewards unreliable, it reduces training time by 37.6\%--68.0\% while preserving, and often improving, performance on challenging benchmarks (AMC, AIME, HMMT). When student-teacher compatibility remains high, it automatically preserves long-context supervision by expanding the training window. These results suggest that Prune-OPD improves OPD not by blindly shortening rollouts, but by reallocating computation toward locally exploitable teacher rewards.

2602.14307 2026-06-02 cs.AI cs.LG 版本更新

Benchmarking at the Edge of Comprehension

在理解边缘的基准测试

Samuele Marro, Jialin Yu, Emanuele La Malfa, Oishi Deb, Jiawei Li, Yibo Yang, Ebey Abraham, Sunando Sengupta, Eric Sommerlade, Michael Wooldridge, Philip Torr

发表机构 * University of Cambridge(剑桥大学)

AI总结 提出Critique-Resilient Benchmarking框架,通过对抗性生成-评估游戏在人类理解受限时比较模型,利用批判韧性正确性概念和分项Bradley-Terry模型对LLM进行排序。

详情
AI中文摘要

随着前沿大型语言模型(LLMs)在新基准发布后迅速饱和,基准测试本身正处于一个转折点:如果前沿模型持续改进,人类将越来越难以生成具有区分度的任务、提供准确的真实答案或评估复杂解决方案。如果基准测试变得不可行,我们衡量AI进展的能力将受到威胁。我们将这种情况称为后理解阶段。在这项工作中,我们提出了Critique-Resilient Benchmarking,一种对抗性框架,旨在即使在人类完全理解不可行的情况下也能比较模型。我们的技术依赖于批判韧性正确性的概念:如果没有对手令人信服地证明答案错误,则该答案被视为正确。与标准基准测试不同,人类充当有界验证者,专注于局部声明,从而在超出任务完全理解的情况下保持评估完整性。使用分项二分Bradley-Terry模型,我们联合对LLM进行排序,依据其解决挑战性任务的能力和生成困难但可解问题的能力。我们在数学领域展示了该方法在八个前沿LLM上的有效性,表明所得分数稳定且与外部能力度量相关。我们的框架将基准测试重新定义为一种对抗性生成-评估游戏,其中人类作为最终裁决者。

英文摘要

As frontier Large Language Models (LLMs) increasingly saturate new benchmarks shortly after they are published, benchmarking itself is at a juncture: if frontier models keep improving, it will become increasingly hard for humans to generate discriminative tasks, provide accurate ground-truth answers, or evaluate complex solutions. If benchmarking becomes infeasible, our ability to measure any progress in AI is at stake. We refer to this scenario as the post-comprehension regime. In this work, we propose Critique-Resilient Benchmarking, an adversarial framework designed to compare models even when full human understanding is infeasible. Our technique relies on the notion of critique-resilient correctness: an answer is deemed correct if no adversary has convincingly proved otherwise. Unlike standard benchmarking, humans serve as bounded verifiers and focus on localized claims, which preserves evaluation integrity beyond full comprehension of the task. Using an itemized bipartite Bradley-Terry model, we jointly rank LLMs by their ability to solve challenging tasks and to generate difficult yet solvable questions. We showcase the effectiveness of our method in the mathematical domain across eight frontier LLMs, showing that the resulting scores are stable and correlate with external capability measures. Our framework reformulates benchmarking as an adversarial generation-evaluation game in which humans serve as final adjudicators.

2605.29539 2026-06-02 cs.CV cs.AI 版本更新

GiPL: Generative augmented iterative Pseudo-Labeling for Cross-Domain Few-Shot Object Detection

GiPL: 用于跨域小样本目标检测的生成增强迭代伪标签方法

Jiacong Liu, Shu Luo, Yikai Qin, Yaze Zhao, Yongwei Jiang, Yixiong Zou

发表机构 * Huazhong University of Science and Technology(华中科技大学)

AI总结 提出GiPL双分支训练框架,通过迭代伪标签自训练和生成数据增强,解决跨域小样本目标检测中支持集利用不足和过拟合问题。

Comments CVPR 2026 Workshop

详情
AI中文摘要

视觉语言基础模型在跨域小样本目标检测(CD-FSOD)中展现出有前景的零样本泛化能力。然而,它们在微调过程中面临两个关键挑战:由于稀疏的单实例标注导致支持集利用不足,以及在极有限的域目标样本下严重过拟合。为解决这些问题,本文提出GiPL,一个高效的双分支训练框架。在第一个分支中,我们设计了一种迭代伪标签自训练范式,该范式对支持集进行零样本推理以生成可靠的伪标注,将其与真实标签融合,并迭代优化模型以充分利用支持集数据。在第二个分支中,我们引入了使用大型视觉语言模型的生成数据增强流程,该流程合成域对齐、多目标标注的图像以丰富训练样本并抑制过拟合。在三个具有挑战性的CD-FSOD数据集(RUOD、CARPK、CarDD)上,在1/5/10样本设置下的大量实验表明,GiPL始终以显著的性能提升优于最先进的方法。代码可在\href{https://github.com/z-yaz/CDiscover}{CDiscover}获取。

英文摘要

Vision-language foundation models have shown promising zero-shot generalization for Cross-Domain Few-Shot Object Detection (CD-FSOD). However, they face two critical challenges in fine-tuning: insufficient support set utilization due to sparse single-instance annotations, and severe overfitting under extremely limited target-domain samples. To address these issues, this paper proposes GiPL, an efficient two-branch training framework. In the first branch, we design an iterative pseudo-label self-training paradigm, which performs zero-shot inference on the support set to generate reliable pseudo-annotations, fuses them with ground-truth labels, and iteratively optimizes the model to fully exploit support set data. In the second branch, we introduce generative data augmentation pipeline using large vision-language models, which synthesizes domain-aligned, multi-object annotated images to enrich training samples and suppress overfitting. Extensive experiments on three challenging CD-FSOD datasets (RUOD, CARPK, CarDD) under 1/5/10-shot settings demonstrate that GiPL consistently outperforms state-of-the-art methods with significant performance gains. Code is available at \href{https://github.com/z-yaz/CDiscover}{CDiscover}.

2605.29488 2026-06-02 cs.CV cs.AI 版本更新

AnyMo: Scaling Any-Modality Conditional Motion Generation with Masked Modeling

AnyMo: 基于掩码建模的任意模态条件运动生成

Yiheng Li, Zhuo Li, Ruibing Hou, Yingjie Chen, Hong Chang, Hao Liu, Shiguang Shan

发表机构 * Key Laboratory of Intelligent Information Processing of Chinese Academy of Sciences (CAS), Institute of Computing Technology, CAS, China(中国科学院智能信息处理重点实验室(中国科学院计算技术研究所,中国)) University of Chinese Academy of Sciences, China(中国科学院大学)

AI总结 提出AnyMo框架,结合残差FSQ运动分词器和可扩展掩码建模Transformer,利用大规模多模态对齐数据集OmniHuMo实现任意模态组合下的高质量人体运动生成。

详情
AI中文摘要

条件人体运动生成仍然是计算机视觉和机器人学中的一个基本挑战。尽管取得了显著进展,当前方法通常受限于固定的模态配置和特定任务架构,跨模态交互和多模态条件合成的扩展规律在很大程度上仍未得到充分探索。一个关键瓶颈是缺乏大规模模态对齐的运动数据,限制了跨不同控制信号的泛化能力。在这项工作中,我们引入了OmniHuMo,一个大规模、高质量的数据集,包含超过5000小时的运动和320万条序列,并带有精确对齐的多模态注释(例如,文本、语音、音乐和轨迹)。利用OmniHuMo,我们提出了AnyMo,一个统一的多模态框架,结合了基于残差FSQ的运动分词器与可扩展的掩码建模Transformer,能够在任意模态组合下实现高质量的运动合成。大量实验表明,AnyMo在提供对空间和风格属性的灵活控制的同时,实现了高保真合成。

英文摘要

Conditional human motion generation remains a fundamental challenge in computer vision and robotics. Despite significant progress, current methods are often constrained by fixed modality configurations and task-specific architectures, leaving cross-modal interactions and the scaling laws of multimodal-conditioned synthesis largely underexplored. A key bottleneck is the scarcity of large-scale modality-aligned motion data, limiting generalization across diverse control signals. In this work, we introduce OmniHuMo, a large-scale, high-quality dataset comprising over 5,000 hours of motion and 3.2 million sequences with precisely aligned multimodal annotations (e.g., text, speech, music, and trajectory). Leveraging OmniHuMo, we propose AnyMo, a unified multimodal framework combining a Residual FSQ-based motion tokenizer with a scalable masked modeling transformer, enabling high-quality motion synthesis under arbitrary modality combinations. Extensive experiments show that AnyMo achieves high-fidelity synthesis while offering flexible control over both spatial and stylistic attributes.

2605.29463 2026-06-02 cs.LG cs.AI 版本更新

Honest Lying: Understanding Memory Confabulation in Reflexive Agents

诚实撒谎:理解反射型智能体中的记忆虚构

Prakhar Dixit, Sadia Kamal, Tim Oates

发表机构 * University of Cambridge(剑桥大学)

AI总结 研究反射型智能体在自我反思中产生记忆虚构的问题,提出反射重复率(RRR)指标检测该现象,并通过程序化提取失败信号缓解问题。

Comments Accepted to ICML 2026 Workshop "Failure Modes in Agentic AI"

详情
AI中文摘要

反射型智能体依赖自我生成的反思作为记忆,隐含地假设智能体能够准确诊断自己的失败。我们表明这一假设可能系统性地失败:在ALFWorld和HumanEval中,智能体存储自信但错误的任务解释,并在多次试验中继续据此行动,尽管每次环境都重置为正确任务。我们将这种失败模式称为记忆虚构,并引入反射重复率(RRR),一种基于日志的指标,用于检测对错误反思内容的重复依赖。使用RRR,我们在ALFWorld中识别出16个冻结环境,其中121条反思中0条提及正确目标对象,在HumanEval中有4个类似案例。我们的缓解方法用程序化提取轨迹级失败信号替代开放式自我诊断,将正确对象提及率从0%提升至86%,RRR从0.64降至0.10,并解决了16个冻结ALFWorld环境中的3个,表明反思记忆可能强化而非纠正错误信念。

英文摘要

Reflexion-style agents rely on self-generated reflections as memory, implicitly assuming that agents can accurately diagnose their own failures. We show that this assumption can fail systematically: across ALFWorld and HumanEval, agents store confident but incorrect interpretations of the task and continue acting on them across trials, even though the environment resets to the correct task each time. We call this failure mode memory confabulation and introduce the Reflection Repetition Rate (RRR), a log-based metric that detects repeated reliance on incorrect reflective content. Using RRR, we identify 16 frozen environments in ALFWorld, where 0 of 121 reflections mention the correct target object, and 4 analogous cases in HumanEval. Our mitigation replaces open-ended self-diagnosis with programmatic extraction of trajectory-level failure signals, increasing correct object mention from 0% to 86%, reducing RRR from 0.64 to 0.10, and solving 3 of 16 frozen ALFWorld environments, suggesting that reflective memory can reinforce false beliefs rather than correct them.

2605.29233 2026-06-02 cs.LG cs.AI 版本更新

BlockBatch: Multi-Scale Consensus Decoding for Efficient Diffusion Language Model Inference

BlockBatch: 面向高效扩散语言模型推理的多尺度共识解码

Xiaoyou Wu, Cheng-Jhih Shih, Binfei Ji, Yong Liu, Yingyan Celine Lin

发表机构 * Georgia Institute of Technology(佐治亚理工学院)

AI总结 提出BlockBatch框架,通过多分支并行解码和置信度门控合并,在不训练的情况下加速扩散语言模型推理,平均减少26.6%的去噪步数并实现1.33倍端到端加速。

Comments 23 pages, including references and appendices

详情
AI中文摘要

扩散语言模型(dLLMs)通过并行迭代去噪多个标记位置来生成文本,为严格自回归解码提供了有吸引力的替代方案。然而,在实际中,块状dLLM推理暴露了难以权衡的粒度问题:小块保留局部条件但需要更多去噪步骤,而大块暴露更多并行性但可能做出过早承诺并累积缓存误差。现有加速方法通常为每个请求选择单一块大小,未利用块大小之间的互补性。我们表明块大小本身是一个有用的分支维度。不同块大小产生相关但非相同的KV缓存轨迹:分支通常共享初始前缀,在语义决定性位置分叉,并在句法轻量级标记上后来达成一致。受此结构启发,我们提出BlockBatch,一种无需训练的在线推理框架,在批处理前向传递中为同一请求执行多个块大小分支。BlockBatch通过置信度门控标记合并、基于领导者的同步和周期性全序列刷新来协调这些分支,将局部块更新重新锚定到全局一致的KV状态。在3个代表性dLLM和4个数据集上,BlockBatch平均减少26.6%的去噪NFE,并在保持准确性的同时实现比Fast-dLLM平均1.33倍的端到端加速。这些结果将块大小多样性确定为分支并行dLLM推理中一个实用且先前未被充分探索的维度。

英文摘要

Diffusion language models (dLLMs) generate text by iteratively denoising multiple token positions in parallel, offering an attractive alternative to strictly autoregressive decoding. In practice, however, block-wise dLLM inference exposes a difficult granularity trade-off: small blocks preserve local conditioning but require many denoising steps, whereas large blocks expose more parallelism but can make premature commitments and accumulate cache error. Existing acceleration methods typically choose a single block size per request, leaving the complementarity among block sizes unused. We show that block size itself is a useful branching dimension. Different block sizes induce related but non-identical KV-cache trajectories: branches often share an initial prefix, bifurcate at semantically decisive positions, and later agree on syntactically lightweight tokens. Motivated by this structure, we propose BlockBatch, a training-free online inference framework that executes multiple block-size branches for the same request inside a batched forward pass. BlockBatch coordinates these branches through confidence-gated token merging, leader-based synchronization, and periodic full-sequence refreshes that re-anchor local block updates to a globally consistent KV state. Across 3 representative dLLMs and 4 datasets, BlockBatch reduces denoising NFEs by 26.6\% on average and achieves a 1.33$\times$ average end-to-end speedup over Fast-dLLM while preserving accuracy. These results identify block-size diversity as a practical and previously underexplored axis for branch-parallel dLLM inference.

2605.29183 2026-06-02 cs.LG cs.AI 版本更新

TIMEGATE: Sustainable Time-Boxed Promotion Gates for Continual ML Adaptation Under Resource Constraints

TIMEGATE: 资源约束下可持续的限时促销门控用于持续ML适应

Abhijit Chakraborty, Suddhasvatta Das, Yash Shah, Vivek Gupta, Kevin A. Gary

发表机构 * University of California, Berkeley(加州大学伯克利分校) University of Washington(华盛顿大学)

AI总结 提出TIMEGATE策略层,通过预算时间、标注、训练和评估来管理持续ML适应,实现评估计算节省且无静默错误促销。

详情
AI中文摘要

随着机器学习(ML)系统向持续适应演进,每个重新训练周期都会消耗计算、标注和能源。我们引入TIMEGATE,一个通过预算时间、标注、训练和评估来管理适应的策略层。TIMEGATE发出一个度量可用性信号M,用于部分与完整评估决策。我们验证:(i)在Adult表格数据上,标注优于训练2.3倍;(ii)它迁移到LLaMA-3.1-8B + QLoRA在SST-2上(准确率从0.80到0.96;35/36次运行中M=1);(iii)M具有信息性,28单元敏感性显示在严格阈值下M降至0.81;(iv)100周期模拟实现66%的评估计算节省,且无静默错误促销;(v)在单个H200上,LLaMA的10%切片评估使用89%更少的挂钟时间和能源(比率一致至0.2%)。

英文摘要

As machine learning(ML) systems evolve to continual adaptation, each re-training cycle uses compute, annotation, and energy. We introduce TIMEGATE, a policy layer managing adaptation by budgeting time, labeling, training, and evaluation. TIMEGATE emits a metric-availability signal M for partial vs. full-evaluation decisions. We validate: (i) labeling outperforms training by 2.3x on Adult tabular; (ii) it transfers to LLaMA-3.1-8B + QLoRA on SST-2 (accuracy 0.80 to 0.96; M =1 in 35/36 runs); (iii) M is informative, 28-cell sensitivity shows M drops to 0.81 at tight thresholds; (iv) 100-cycle simulation achieves 66% evaluation-compute savings with no silent mis-promotions; (v) 10%-slice evaluation on LLaMA uses 89% less wall-clock and energy on a single H200 (ratios agree to 0.2%).

2605.29107 2026-06-02 cs.CR cs.AI 版本更新

GEO-Bench: Benchmarking Ranking Manipulation in Generative Engine Optimization

GEO-Bench: 生成式引擎优化中的排名操纵基准测试

Ojas Nimase, Zhe Chen, Gengpei Qi, Yue Zhao, Xiyang Hu

发表机构 * University of Southern California(南加州大学) Arizona State University(亚利桑那州立大学)

AI总结 提出GEO-Bench基准,统一评估生成式引擎优化中的排名操纵攻击,比较黑盒提示攻击、白盒梯度攻击和白帽策略的有效性与隐蔽性。

详情
AI中文摘要

大型语言模型(LLMs)越来越多地对用户查询的产品、文档和推荐进行排名,这使得操纵这些排名成为公平性和信息完整性日益关注的问题。关于生成式引擎优化(GEO)的研究已经产生了许多操纵方法,但每种方法都在自己的数据集上使用自己的指标进行评估,因此它们的相对强度和可检测性仍不清楚。我们提出了GEO-Bench,这是一个在统一协议下评估GEO排名操纵攻击的基准。它统一了黑盒提示攻击(TAP、Zero-Shot)、白盒梯度攻击(STS、RAF、StealthRank)以及十种白帽C-SEO策略。我们针对一个固定的开源权重排序器(Llama-3.1-8B-Instruct)在五个数据集上对每种方法进行评分,使用有效性(NRG、Success@α、Promote@α)和隐蔽性(关键词违规率、困惑度比率)指标。我们的评估表明,对抗性攻击在有效性和隐蔽性之间存在权衡;黑盒内容重写在排名提升方面达到或超过梯度攻击,同时生成更流畅的文本,并且可以在某些领域逃避基于关键词和困惑度的检测;访问模型并不能预测攻击强度。通过标准化数据集、攻击实现和指标,GEO-Bench实现了这些攻击范式之间的首次直接比较,并支持检测方法的开发。

英文摘要

Large language models (LLMs) increasingly rank products, documents, and recommendations for user queries, which makes manipulating these rankings a growing concern for fairness and information integrity. Research on generative engine optimization (GEO) has produced many manipulation methods, but each is evaluated on its own dataset with its own metrics, so their relative strength and detectability stay unclear. We present GEO-Bench, a benchmark that evaluates GEO ranking-manipulation attacks under one protocol. It unifies black-box prompt-based attacks (TAP, Zero-Shot), white-box gradient-based attacks (STS, RAF, StealthRank), and ten white-hat C-SEO strategies. We score every method on five datasets against a fixed open-weight ranker (Llama-3.1-8B-Instruct), using metrics for both effectiveness (NRG, Success@α, Promote@α) and stealth (keyword violation rate, perplexity ratio). Our evaluation shows that effectiveness and stealth trade off across adversarial attacks, that black-box content rewriting matches or exceeds gradient-based attacks on rank promotion while producing more fluent text and can evade both keyword- and perplexity-based detection on some domains, and that the access model does not predict attack strength. By standardizing datasets, attack implementations, and metrics, GEO-Bench enables the first direct comparison across these attack paradigms and supports the development of detection methods.

2605.26092 2026-06-02 cs.LG cs.AI 版本更新

GoQuant: Geometric Orthogonal Residual Projection for Multiplier-Free Power-of-Two Transformer Quantization

GoQuant: 用于无乘法器二次幂变压器量化的几何正交残差投影

Maoyang Xiang, Tao Luo, Bo Wang

发表机构 * Information Systems Technology and Design(信息系统技术与设计) Singapore University of Technology and Design(新加坡科技设计大学) Institute of High Performance Computing (IHPC)(高性能计算研究所) Agency for Science, Technology and Research (A*STAR)(科技研究局)

AI总结 针对低比特量化中二次幂格式的低角度分辨率问题,提出几何正交残差投影量化(GoQuant),通过双基几何投影和移位加操作合成高分辨率残差格点,实现硬件高效且无需乘法器的量化方法。

详情
AI中文摘要

大型语言模型(LLMs)和视觉变换器(ViTs)在边缘设备上的部署受到内存限制和密集乘加(MAC)阵列引入的关键时序瓶颈的显著约束。在超低比特范围内,对数二次幂(PoT)量化通过用位移操作替代MAC操作,提供了一种硬件高效的替代方案。然而,非均匀指数格点固有地受到低角度分辨率机制的局限,这一结构缺陷在低于4比特阈值时尤为突出,导致高维特征流形的显著退化。为解决这一几何限制,我们提出了几何正交残差投影量化(GoQuant),一种算法-硬件协同设计框架。通过将量化表述为双基几何投影,GoQuant使用严格的移位加操作自适应地合成更高分辨率的残差格点。此外,其解析求解器为计算密集的梯度优化提供了实用替代方案,将LLaMA-2-7B的全模型校准时间减少到约15分钟。广泛评估表明GoQuant在多种模态下的适用性和硬件效率。在3比特(W3/A16)约束下,它在LLaMA-2-7B上实现了6.10的困惑度,与依赖非对称缩放的常规MAC密集型基线(如AWQ)相比具有竞争力,同时在4比特场景下保持竞争性精度。在硅片层面,28nm节点的标准单元RTL综合表明,GoQuant有效缓解了与密集乘法器树相关的时序瓶颈。通过展平组合逻辑深度,我们的并行移位加数据路径将关键路径延迟降低至0.35纳秒。

英文摘要

The deployment of Large Language Models (LLMs) and Vision Transformers (ViTs) on edge devices is significantly constrained by memory limitations and the critical timing bottlenecks introduced by dense Multiply-Accumulate (MAC) arrays. In the ultra-low bit regime, logarithmic Power-of-Two (PoT) quantization provides a hardware-efficient alternative by replacing MAC operations with bit-shifts. However, the non-uniform exponential lattice is inherently limited by a \textbf{Low Angular Resolution Regime}, a structural flaw that becomes particularly pronounced at sub-4-bit thresholds, leading to a notable degradation of high-dimensional feature manifolds. To address this geometric limitation, we propose Geometric Orthogonal Residual Projection Quantization (GoQuant), an algorithm-hardware co-design framework. By formulating quantization as a dual-basis geometric projection, GoQuant adaptively synthesizes a higher-resolution residual lattice using strictly shift-and-add operations. Furthermore, its analytical solver offers a practical alternative to computationally intensive gradient-based optimization, reducing the full-model calibration time for LLaMA-2-7B to approximately 15 minutes. Extensive evaluations demonstrate GoQuant's applicability across modalities and its hardware efficiency. Under the 3-bit (W3/A16) constraint, it achieves a perplexity of 6.10 on LLaMA-2-7B, comparing favorably to conventional MAC-intensive baselines like AWQ without relying on asymmetric scaling, while maintaining competitive accuracy in 4-bit scenarios. At the silicon level, standard-cell RTL synthesis at a 28nm node indicates that GoQuant effectively mitigates the timing bottlenecks associated with dense multiplier trees. By flattening the combinational logic depth, our parallel shift-and-add datapath reduces the critical path delay to 0.35 ns.

2605.13511 2026-06-02 cs.CL cs.AI 版本更新

Many-Shot CoT-ICL: Making In-Context Learning Truly Learn

Many-Shot CoT-ICL: 使上下文学习真正学习

Tsz Ting Chung, Lemao Liu, Mo Yu, Dit-Yan Yeung

发表机构 * The University of Hong Kong(香港大学)

AI总结 研究多示例思维链上下文学习在推理任务中的特性,提出曲线演示选择方法,在数学任务上提升5.42个百分点。

Comments Accepted by ICML 2026

详情
AI中文摘要

虽然多示例ICL取得了显著性能,但先前对其缩放行为的研究主要关注非推理任务。在这项工作中,我们研究了推理任务上的多示例ICL,特别关注多示例思维链上下文学习(CoT-ICL)。通过分析非推理和推理任务以及非推理和推理导向的LLM,我们识别出多示例CoT-ICL的几个独特性质。我们进一步将这些发现解释为多示例CoT-ICL是上下文测试时学习而非缩放模式匹配,并提出两个原则:(i)演示应易于目标模型理解,(ii)它们应按顺序排列以支持平滑的概念进展。受该原则指导,我们提出了曲线演示选择(CDS),一种简单的排序方法,在具有64个演示的数学任务上获得了高达5.42个百分点的提升。总体而言,我们的结果将长上下文窗口从检索缓冲区重新定义为上下文测试时学习的结构化课程。

英文摘要

While many-shot ICL achieves remarkable performance, prior studies of its scaling behavior have mainly focused on non-reasoning tasks. In this work, we study many-shot ICL on reasoning tasks, with a particular focus on many-shot chain-of-thought in-context learning (CoT-ICL). Analyzing across non-reasoning and reasoning tasks and across non-reasoning and reasoning-oriented LLMs, we identify several distinctive properties of many-shot CoT-ICL. We further interpret these findings by viewing many-shot CoT-ICL as in-context test-time learning rather than scaled pattern matching, and suggest two principles: (i) demonstrations should be easy for the target model to understand, and (ii) they should be ordered to support a smooth conceptual progression. Guided by the principle, we propose Curvilinear Demonstration Selection (CDS), a simple ordering method that yields up to a 5.42 percentage-point gain on a math task with 64 demonstrations. Overall, our results reframe the long context window from a retrieval buffer into a structured curriculum for in-context test-time learning.

2604.19532 2026-06-02 cs.SD cs.AI 版本更新

BEAT: Tokenizing and Generating Symbolic Music by Uniform Temporal Steps

BEAT: 通过均匀时间步对符号音乐进行分词和生成

Lekai Qian, Haoyu Gu, Jingwei Zhao, Ziyu Wang

发表机构 * South China University of Technology(南方科技大学) National University of Singapore(新加坡国立大学) Mohamed bin Zayed University of Artificial Intelligence(莫扎德·本·扎耶德人工智能大学) New York University(纽约大学)

AI总结 提出一种以均匀节拍为基本单元的分词方法,将同一时间步内相同音高的所有事件编码为一个令牌,并在音乐续写和伴奏生成任务中验证其相比传统事件基方法能提升音乐质量和结构连贯性。

详情
AI中文摘要

将音乐分词以适应语言模型的通用框架是一个具有挑战性的问题,特别是考虑到音乐可以表示的各种符号结构(例如,序列、网格和图)。迄今为止,大多数方法将符号音乐分词为音乐事件序列,如起始、音高、时移或复合音符事件。这种策略直观且已在基于Transformer的模型中证明有效,但它隐式处理了音乐时间的规律性:单个令牌可能跨越不同时长,导致时间进展不均匀。在本文中,我们考虑另一种分词方式是否可能,其中均匀长度的音乐步长(例如,一个节拍)作为基本单元。具体来说,我们将单个时间步内相同音高的所有事件编码为一个令牌,并显式按时间步对令牌进行分组,这类似于钢琴卷帘表示的稀疏编码。我们在音乐续写和伴奏生成任务上评估了所提出的分词方法,并将其与主流事件基方法进行比较。结果表明,所提出的分词方法提高了音乐质量和结构连贯性,而额外分析证实了更高的效率和更有效地捕获长程模式。

英文摘要

Tokenizing music to fit the general framework of language models is a compelling challenge, especially considering the diverse symbolic structures in which music can be represented (e.g., sequences, grids, and graphs). To date, most approaches tokenize symbolic music as sequences of musical events, such as onsets, pitches, time shifts, or compound note events. This strategy is intuitive and has proven effective in Transformer-based models, but it treats the regularity of musical time implicitly: individual tokens may span different durations, resulting in non-uniform time progression. In this paper, we instead consider whether an alternative tokenization is possible, where a uniform-length musical step (e.g., a beat) serves as the basic unit. Specifically, we encode all events within a single time step at the same pitch as one token, and group tokens explicitly by time step, which resembles a sparse encoding of a piano-roll representation. We evaluate the proposed tokenization on music continuation and accompaniment generation tasks, comparing it with mainstream event-based methods. Results show improved musical quality and structural coherence, while additional analyses confirm higher efficiency and more effective capture of long-range patterns with the proposed tokenization.

2602.07666 2026-06-02 cs.CR cs.AI 版本更新

SoK: DARPA's AI Cyber Challenge (AIxCC): Competition Design, Architectures, and Lessons Learned

SoK: DARPA 人工智能网络挑战赛 (AIxCC):竞赛设计、架构与经验教训

Cen Zhang, Younggi Park, Fabian Fleischer, Yu-Fu Fu, Jiho Kim, Dongkwan Kim, Youngjoon Kim, Qingxiao Xu, Andrew Chin, Ze Sheng, Hanqing Zhao, Michael Pelican, David J. Musliner, Jeff Huang, Jon Silliman, Mikel Mcdaniel, Jefferson Casavant, Isaac Goldthwaite, Nicholas Vidovich, Matthew Lehman, Taesoo Kim

发表机构 * Georgia Institute of Technology(佐治亚理工学院) Texas A&M University(德克萨斯大学) Smart Information Flow Technologies (SIFT)(智能信息流技术公司) Kudu Dynamics(Kudu动态公司) Microsoft(微软)

AI总结 本文系统分析 DARPA 人工智能网络挑战赛 (AIxCC),探讨其竞赛设计、决赛系统的架构方法,并总结驱动性能的因素、技术进展及未来研究方向。

Comments Camera ready version, systematization of Knowledge and post-competition analysis of DARPA AIxCC (2023-2025)

详情
Journal ref
USENIX Security 2026
AI中文摘要

DARPA 的人工智能网络挑战赛 (AIxCC, 2023--2025) 是迄今为止规模最大的竞赛,旨在构建完全自主的网络推理系统 (CRS),利用人工智能的最新进展——特别是大型语言模型 (LLM)——来发现和修复真实世界开源软件中的漏洞。本文首次对 AIxCC 进行系统分析。基于设计文档、源代码、执行轨迹以及与组织者和参赛团队的讨论,我们审视了竞赛的结构和关键设计决策,描述了决赛 CRS 的架构方法,并分析了最终计分板之外的竞赛结果。我们的分析揭示了真正驱动 CRS 性能的因素,识别了各团队取得的技术进步,并指出了未来研究中仍需解决的局限性。最后,我们总结了组织未来竞赛的经验教训,以及在实际中部署自主 CRS 的更广泛见解。

英文摘要

DARPA's AI Cyber Challenge (AIxCC, 2023--2025) is the largest competition to date for building fully autonomous cyber reasoning systems (CRSs) that leverage recent advances in AI -- particularly large language models (LLMs) -- to discover and remediate vulnerabilities in real-world open-source software. This paper presents the first systematic analysis of AIxCC. Drawing on design documents, source code, execution traces, and discussions with organizers and competing teams, we examine the competition's structure and key design decisions, characterize the architectural approaches of finalist CRSs, and analyze competition results beyond the final scoreboard. Our analysis reveals the factors that truly drove CRS performance, identifies genuine technical advances achieved by teams, and exposes limitations that remain open for future research. We conclude with lessons for organizing future competitions and broader insights toward deploying autonomous CRSs in practice.

2605.25143 2026-06-02 cs.AI cs.LG 版本更新

Beyond the Frontier: Stochastic Backtracking for Efficient Test-Time Scaling

超越前沿:用于高效测试时扩展的随机回溯

Dao Tran, Duc Anh Le, Ngoc Luu, Quan Pham, Tung Pham, Hung Bui

发表机构 * Qualcomm AI Research(高通人工智能研究)

AI总结 提出随机回溯方法,通过维护历史前缀池并利用子池选择和幂回溯序列蒙特卡洛机制,在测试时扩展中实现更高的准确率-令牌数权衡。

详情
AI中文摘要

测试时扩展通过花费额外计算来探索多个解轨迹,从而改进语言模型推理。关键挑战是在推理过程中最大化准确率的同时最小化生成的令牌总数。最近的PRM引导方法对中间前缀进行评分以引导搜索,但大多数方法仅关注前沿:它们只保留当前活动的前缀,并使用带噪声的PRM分数不可逆地剪枝或重采样其余部分。这可能导致过早承诺、多样性崩溃以及丢失仍可产生正确延续的前缀。我们引入了一种基于历史前缀持久池的随机回溯,允许测试时计算重新访问先前生成的状态,而不是仅扩展当前前沿。为了提高效率,我们提出了两种互补机制。子池选择通过随机子池内应用Top-N选择来增强贪婪PRM引导搜索,使历史前缀有机会绕过评分过高的前沿候选。幂回溯序列蒙特卡洛使用幂化PRM分数和混合校正权重,将SMC风格的重采样扩展到持久池。在数学推理基准和模型规模上,我们的方法在每令牌准确率上始终更高,并且与强PRM引导基线相比,仅使用一小部分令牌数即可达到相同的准确率水平,这表明持久池随机回溯为改善测试时扩展中的准确率-令牌权衡提供了一种简单有效的方法。

英文摘要

Test-time scaling improves language model reasoning by spending additional compute to explore multiple solution trajectories. The key challenge is to maximize accuracy while minimizing the total number of generated tokens during reasoning. Recent PRM-guided methods score intermediate prefixes to steer this search, but most are frontier-only: they keep only the current active prefixes and irreversibly prune or resample away the rest using noisy PRM scores. This can cause premature commitment, diversity collapse, and the loss of prefixes that still admit correct continuations. We introduce stochastic backtracking over a persistent pool of historical prefixes, allowing test-time compute to revisit previously generated states instead of only expanding the current frontier. To make this efficient, we propose two complementary mechanisms. Subpool Selection strengthens greedy PRM-guided search by applying Top-N selection within random subpools, giving historical prefixes a chance to bypass over-scored frontier candidates. Power Backtrack Sequential Monte Carlo extends SMC-style resampling to the persistent pool using powered PRM scores and mixture-corrected weights. Across mathematical reasoning benchmarks and model scales, our methods consistently achieve higher accuracy per token count, and the same level of accuracy using only a fraction of the token count in comparison to strong PRM-guided baselines, demonstrating that persistent-pool stochastic backtracking provides a simple and effective way to improve the accuracy-token trade-off in test-time scaling.

2605.24828 2026-06-02 cs.AI 版本更新

Test-Time Deep Thinking to Explore Implicit Rules

测试时深度思考以探索隐式规则

Wentong Chen, Xin Cong, Zhong Zhang, Yaxi Lu, Siyuan Zhao, Yesai Wu, Qinyu Luo, Haotian Chen, Yankai Lin, Zhiyuan Liu, Maosong Sun

发表机构 * Renmin University of China(中国人民大学) Department of Statistics and Data Science, Tsinghua University(清华大学统计与数据科学系) School of Computer Science and Engineering, UESTC(UESTC计算机科学与工程学院) Department of Computer Science and Technology, Tsinghua University(清华大学计算机科学与技术系) School of Mathematical Sciences, Nankai University(南开大学数学科学学院) Whiting School, Johns Hopkins University(约翰斯·霍普金斯大学惠特林学院) School of Artificial Intelligence, Shanghai Jiaotong University(上海交通大学人工智能学院)

AI总结 针对智能体在隐式规则环境中失败的问题,提出TTExplore框架,通过训练专用模型Exp-Thinker进行测试时推理,平均提升基线性能14-19点。

详情
AI中文摘要

随着大型语言模型(LLMs)的不断进步,智能体变得越来越重要。然而,这些智能体在由隐式规则——无法直接观察、必须通过交互推断的隐藏约束——支配的环境中常常失败。这导致智能体陷入重复的试错循环,最终导致任务失败。为了应对这一挑战,我们提出了测试时探索(TTExplore)框架,其中思考者组件分析交互历史以推断这些隐式规则并指导行动者。在此设置中,有效的探索关键取决于思考者的推理能力。然而,评估深度推理轨迹本质上不稳定且困难,这对有效训练构成了主要障碍。为了解决这个问题,我们引入了一种新颖且稳定的强化学习流程。核心思想是使用准确的任务级分数作为间接奖励,以绕过评估中间推理的困难,并仅保留每个轨迹的单个思考节点以缓解奖励稀疏性。使用此流程,我们训练了一个专门的7B模型Exp-Thinker。在五个基于文本的具体任务上的实验表明,配备Exp-Thinker的TTExplore将基线智能体性能平均提升了14-19个点,证明了显式推理隐式规则的有效性。

英文摘要

With the continuous advancement of Large Language Models (LLMs), intelligent agents are becoming increasingly vital. However, these agents often fail in environments governed by implicit rules--hidden constraints that cannot be observed directly and must be inferred through interaction. This causes agents to fall into repetitive trial-and-error loops, ultimately leading to task failure. To address this challenge, we propose Test-Time Exploration (TTExplore), a framework where a thinker component analyzes interaction history to infer these implicit rules and guide an actor. Effective exploration in this setting critically depends on the reasoning ability of the thinker. However, evaluating deep reasoning trajectories is inherently unstable and difficult, which poses a major obstacle to effective training. To overcome this issue, we introduce a novel and stable reinforcement learning pipeline. The core idea is to use accurate task-level scores as indirect rewards to bypass the difficulty of evaluating intermediate reasoning, and to retain only a single thinking node per trajectory to alleviate reward sparsity. Using this pipeline, we train a specialized 7B model, Exp-Thinker. Experiments on five text-based embodied tasks show that TTExplore equipped with Exp-Thinker improves baseline agent performance by an average of $14$-$19$ points, demonstrating the effectiveness of explicitly reasoning about implicit rules.

2605.24727 2026-06-02 cs.AI cs.CL cs.CY cs.IT math.IT 版本更新

Fundamental Limitation in Explaining AI

解释AI的根本限制

Atsushi Suzuki, Jing Wang

发表机构 * Department of Mathematics Faculty of Science(科学学院数学系) The University of Hong Kong Hong Kong SAR(香港大学香港特别行政区) School of Computing and Mathematical Sciences Faculty of Engineering and Science(工程与科学学院计算与数学科学系) University of Greenwich United Kingdom(格林威治大学英国)

AI总结 本文通过数学证明了一个解释AI的基本四难困境,指出AI及其解释无法同时满足环境复杂性、AI性能优良、解释可解释性和解释完全忠实性四个条件,从而表明AI治理应基于解释忠实性总是不完整的假设。

Comments minor modifications

详情
AI中文摘要

尽管大规模模型如LLMs和扩散模型已取得实际成功,公共机构强调了AI可解释性的重要性。然而,现有的解释AI方法并非旨在提供大规模AI系统行为的完全忠实解释。虽然对AI系统行为的完全忠实且可解释的解释可能对AI治理有用,但尚不清楚提供这样的解释在理论上是否可能。在本文中,我们从数学上证明了解释AI的一个基本四难困境,指出AI及其解释无法同时满足以下四个条件:1)操作环境的复杂性,2)AI性能的优良性,3)AI解释的可解释性,以及4)AI解释的完全忠实性。这个四难困境表明,在大多数我们无法改变环境或牺牲良好AI性能和可解释解释的应用中,我们应该放弃解释的完全忠实性,而应仅针对应用重要的部分进行解释。因此,该四难困境意味着AI治理应基于AI解释的忠实性总是不完整的假设来设计。

英文摘要

While large-scale models such as LLMs and diffusion models have achieved practical success, public institutions have emphasized the importance of explainability in AI. Existing methods for explaining AI, however, are not designed to provide completely faithful explanations of the behavior of large-scale AI systems. Although a completely faithful and interpretable explanation of the behavior of an AI system might be useful for AI governance, it has not been known whether providing such an explanation is theoretically possible. In this paper, we mathematically prove a fundamental quadrilemma in explaining AI, stating that AI and its explanation cannot satisfy the following four conditions simultaneously: 1) the complexity of the operation environment, 2) the goodness of the AI's performance, 3) the interpretability of the AI's explanation, and 4) the complete faithfulness of the AI's explanation. This quadrilemma suggests that, in most applications where we cannot change the environment or sacrifice good AI performance and an interpretable explanation, we should give up complete faithfulness of explanations and should instead aim to explain only the parts that are important for applications. As a consequence, the quadrilemma implies that AI governance should be designed on the premise that the faithfulness of AI explanations is always incomplete.

2605.24681 2026-06-02 cs.CL cs.AI 版本更新

Mix-MoE: Improving Multilingual Machine Translation of Large Language Models through Mixed MoEs

Mix-MoE:通过混合专家混合提升大语言模型的多语言机器翻译

Bo Li, Tianyu Dong, Shaolin Zhu, Deyi Xiong

发表机构 * School of Software, Tsinghua University(清华大学软件学院) College of Intelligence and Computing, Tianjin University(天津大学智能与计算学院)

AI总结 提出Mix-MoE框架,通过将MoE层分为语言模型专家和机器翻译专家,并利用傅里叶变换增强路由机制,解决大语言模型在多语言机器翻译微调中的参数干扰问题。

Comments Accepted by TASLP

详情
AI中文摘要

大语言模型(LLMs)在多语言机器翻译(MT)中展现出巨大潜力,即使双语监督有限。然而,使用平行语料库微调LLMs带来了主要挑战,即参数干扰。为了解决这些问题,我们提出了Mix-MoE,一个混合专家混合框架,旨在训练LLMs进行多语言MT。我们的框架在两个不同的阶段运行:(1)在单语语料库上使用MoE进行后预训练,以及(2)在平行语料库上使用MoE进行后预训练。关键的是,我们将MoE层分为两个专门的组:语言模型专家(LM专家)和机器翻译专家(MT专家)。LM专家旨在捕获和保留预训练LLM学到的单语知识。另一方面,MT专家专门训练以获取和存储双语翻译知识。此外,为了促进这些专门专家之间的有效交互并利用文本中潜在的结构模式,我们引入了一种由模型表示中的傅里叶变换特征增强的路由机制。实验结果表明,Mix-MoE在多语言MT中表现出色,显著优于现有基线,并在缓解参数干扰方面取得了显著进展。

英文摘要

Large Language Models (LLMs) have shown great promise in multilingual machine translation (MT), even with limited bilingual supervision. However, fine-tuning LLMs with parallel corpora presents major challenges, namely parameter interference. To address these issues, we propose Mix-MoE, a mixed Mixture-of-Experts framework designed to train LLMs for multilingual MT. Our framework operates in two distinct stages: (1) post-pretraining with MoE on monolingual corpora, and (2) post-pretraining with MoE on parallel corpora. Crucially, we divide the MoE layers into two specialized groups: Language Model Experts (LM Experts) and Machine Translation Experts (MT Experts). LM Experts are designed to capture and retain the monolingual knowledge learned by the pre-trained LLM. MT Experts, on the other hand, are specifically trained to acquire and store bilingual translation knowledge. Furthermore, to facilitate effective interaction between these specialized experts and leverage potential underlying structural patterns in text, we introduce a routing mechanism enhanced by Fourier Transform features derived from model representations. The experimental results demonstrate that Mix-MoE excels in multilingual MT, significantly outperforming existing baselines and showing notable progress in mitigating parameter interference.

2605.24528 2026-06-02 cs.AI cs.CL cs.LG 版本更新

Hypothesis Generation and Inductive Inference in Children and Language Models

儿童与语言模型中的假设生成与归纳推理

Jeffrey Qin, Wasu Top Piriyakulkij, Zhuangfei Gao, Mia Radovanovic, Jessica Sommerville, Kevin Ellis, Marta Kryven

发表机构 * Computer Science University of Waterloo(滑铁卢大学计算机科学系) Department of Computer Science Cornell University(康奈尔大学计算机科学系) Department of Computer Science Dalhousie University(达尔豪斯大学计算机科学系) Department of Psychology University of Toronto(多伦多大学心理学系)

AI总结 通过归纳推理盒子任务,结合贝叶斯粒子推断的程序归纳形式化,比较儿童与基于LLM的智能体在不确定性下的假设生成与证据寻求行为,发现两者在适应环境结构上相似但信息寻求成本与归纳偏差不同。

详情
AI中文摘要

现实世界中的决策需要在证据、潜在因果规则以及世界状态本身的不确定性下构建心智模型。在这种条件下,哪些计算原理支撑人类的推理?在给定匹配约束下,基于LLM的智能体是否表现出类似行为?我们使用归纳推理盒子任务来探讨这些问题,在该任务中,参与者(人类儿童和基于LLM的智能体)通过与不确定环境的顺序交互来推断潜在原因。我们将该任务形式化为基于贝叶斯粒子推断的程序归纳,并承认两种互补的解释:(1) 作为对假设的约束满足过程,以及(2) 作为程序综合问题,其中假设是针对证据评估的可执行程序。使用基于约束的公式,我们表明儿童的行为最好由主观证据可靠性和在线假设生成的组合来解释,这解释了他们的证据寻求模式以及任务完成与规则泛化之间的分离。使用程序综合公式,我们将基于LLM的智能体视为模型有机体:可控系统,允许系统性地操纵任务条件。在各种后端中,基于LLM的智能体复制了儿童对证据可靠性和可观察性变化的反应,包括折扣不可靠证据、寻求解决部分信息以及任务完成与因果泛化之间的分离。同时,与儿童相比,基于LLM的智能体倾向于过度观察和过度遵守指令。这些结果表明,虽然儿童和基于LLM的智能体在适应环境结构方面相似,但他们的信息寻求行为表现出不同的潜在成本和归纳偏差。

英文摘要

Real world decision-making requires constructing mental models under uncertainty over evidence, over the underlying causal rules, and over the state of the world itself. Which computational principles underpin human inference under such conditions, and do LLM-based agents exhibit similar behavior given matching constraints? We address these questions using an inductive inference Box Task in which participants, human children and LLM-based agents, infer a latent cause through sequential interaction with an uncertain environment. We formalize this task as program induction with Bayesian particle-based inference, admitting two complementary interpretations: (1) as a constraint satisfaction process over hypotheses, and (2) as a program synthesis problem in which hypotheses are executable programs evaluated against evidence. Using the constraint-based formulation, we show that children's behavior is best explained by a combination of subjective evidence reliability and online hypothesis generation, accounting for both their evidence-seeking patterns and their dissociation between task completion and rule generalization. Using the program synthesis formulation, we treat LLM-based agents as model organisms: controllable systems that allow systematic manipulation of task conditions. Across backends, LLM-based agents replicate children's responses to changes in evidence reliability and observability, including discounting unreliable evidence, seeking to resolve partial information, and dissociating between task completion and causal generalization. At the same time, LLM-based agents tend to over-observe and over-comply with instructions relative to children. These results suggest that while children and LLM-based agents adapt similarly to environmental structure, their information-seeking behavior exhibits distinct underlying costs and inductive biases.

2605.18838 2026-06-02 cs.LG cs.AI cs.CL 版本更新

Lying Is Just a Phase: The Hidden Alignment Transition in Language Model Scaling

说谎只是一个阶段:语言模型扩展中的隐藏对齐转变

Adil Amin

发表机构 * ZEHEN Labs(ZEHEN实验室)

AI总结 通过分析63个基础模型,发现语言模型在特定规模阈值下,推理能力与真实性从反相关转变为正相关,并揭示了输出投影瓶颈和零竞争注意力头等内部机制。

Comments 15 pages, 8 figures, 2 tables. Companion paper: "The Growing Pains of Frontier Models: When Leaderboards Stop Separating and What to Measure Next." ( https://doi.org/10.48550/arXiv.2605.18840). Code: https://github.com/adilamin89/cape-scaling. Dashboard: https://zehenlabs.com/cape/

详情
AI中文摘要

扩展定律预测了计算量带来的损失,但未预测能力如何相互作用。我们测量了来自16个家族的63个基础模型的推理能力与真实性之间的耦合,并发现了一个在损失曲线中不可见的相变:低于家族依赖的临界规模N_c时,能力反相关(r = -0.989,p = 4 x 10^{-5},非参数置换检验);高于该规模时,它们合作。N_c ~ 3.5B参数 [2.9B, 13.4B](bootstrap 95% CI),但模型大小并非决定相位的唯一变量。架构、数据整理和训练配方各自独立地改变N_c:精心整理的数据消除了Qwen代际之间的耦合下降(在匹配规模下从0.025到0.830),Gemma-4在4B时通过蒸馏和架构创新实现了0.871的耦合,这通常是13B+标准训练模型的特征,而Phi在1B时仅通过数据整理就达到了10B网络训练模型的耦合水平。宽度归一化消除了所有测试家族的反相关,支持输出投影瓶颈的存在。在内部,40个模型中有38个显示零竞争注意力头。一个稀疏回归ODE以5.6%的误差交叉预测了保留的Llama-2。该诊断不需要模型内部信息——仅需跨模型家族的公开基准分数。合作区域扩展到前沿(r = +0.72,34个模型,10个实验室)。一个概念验证干预证实了瓶颈是可利用的:在识别层添加单个真实方向向量,无需重新训练即可纠正税收阶段60%的错位输出——这是一种无需修改权重的、每推理一次的外科手术式修正。代码、数据、用于任何开放权重模型的开源转向CLI以及用于相位诊断的交互式仪表板已发布:https://zehenlabs.com/cape/。

英文摘要

Scaling laws predict loss from compute but not how capabilities interact. We measure the coupling between reasoning and truthfulness across 63 base models from 16 families and find a regime change invisible to loss curves: below a family-dependent critical scale N_c, capabilities anticorrelate (r = -0.989, p = 4 x 10^{-5} nonparametric permutation test); above it, they cooperate. N_c ~ 3.5B parameters [2.9B, 13.4B] (bootstrap 95% CI), but model size is not the only variable that determines phase. Architecture, data curation, and training recipe each shift N_c independently: curated training eliminated the coupling dip between Qwen generations (0.025 to 0.830 at matched scale), Gemma-4 at 4B achieves coupling 0.871, characteristic of 13B+ standard-trained models, through distillation and architectural innovation, and Phi at 1B matches web-trained coupling at 10B through data curation alone. Width normalization eliminates the anticorrelation across all tested families, supporting an output-projection bottleneck. Internally, 38 of 40 models show zero competing attention heads. A sparse-regression ODE cross-predicts held-out Llama-2 at 5.6% error. The diagnostic requires no model internals -- only public benchmark scores across a model family. The cooperative regime extends to the frontier (r = +0.72, 34 models, 10 labs). A proof-of-concept intervention confirms the bottleneck is exploitable: adding a single truth-direction vector at the identified layer corrects 60% of misaligned outputs in the tax phase with zero retraining -- a surgical, per-inference correction that requires no weight modification. Code, data, an open-source steering CLI for any open-weight model, and an interactive dashboard for phase diagnosis are released: https://zehenlabs.com/cape/.

2605.24248 2026-06-02 cs.CR cs.AI cs.SE 版本更新

Attested Tool-Server Admission: A Security Extension to the Model Context Protocol

认证工具服务器准入:模型上下文协议的安全扩展

Alfredo Metere

发表机构 * Enclawed LLC(Enclawed公司)

AI总结 针对MCP协议缺乏信任机制的问题,提出mcp-attested扩展,通过离线签名的权限断言、默认拒绝的工具白名单和分级强制审计日志,实现安全服务器准入与工具边界控制。

详情
AI中文摘要

模型上下文协议(MCP)标准化了大语言模型(LLM)代理与外部工具服务器之间的消息交换,但未标准化信任:主机读取服务器自声明的工具列表并分发调用,没有关于可以使用哪些服务器、敏感程度如何或服务器哪些工具在界限内的概念。这项工作源于一个具体需求——让Enclawed代理安全地使用Google外部运营的MCP服务器(Gmail、日历、Drive),准入服务器并限制其可能驱动的工具,而不改变MCP或Enclawed自身的工具应用程序编程接口(API)。我们构建的机制mcp-attested(已在开源enclawed-oss发行版和enclaved变体中发布)具有通用性:使未经中介的第三方连接对单个用户不安全的差距,使得受监管的部署无法获得认证。我们通过三种附加机制来弥补这一差距:(1)一个小的、离线签名的权限断言,服务器在众所周知的统一资源标识符(URI)上发布,主机在分派任何工具之前对照固定的信任根进行验证;(2)一个默认拒绝的每服务器工具允许列表,因此准入服务器并不意味着信任其每个工具;(3)一个分级门控的强制模式,将检查从警告转变为硬性拒绝,每个决策都写入防篡改审计日志。我们给出了线路格式、验证算法、安全分析和LLM驱动的对抗性评估;然后以规范的请求评论(RFC 2119)形式陈述了设计——模式、验证规则、错误注册表、众所周知的注册和机器可检查的一致性向量——以便它可以作为MCP附录被采纳,而不是重新发明。未扩展的主机会忽略众所周知的文档,行为与今天完全相同。

英文摘要

The Model Context Protocol (MCP) standardizes how a large-language-model (LLM) agent and an external tool server exchange messages, but not trust: a host reads a server's self-declared tool list and dispatches calls, with no notion of which servers it may use, at what sensitivity, or which of a server's tools are in bounds. This work grew out of a concrete need -- letting the Enclawed agent use Google's externally-operated MCP servers (Gmail, Calendar, Drive) safely, admitting the server and bounding the tools it may drive, without changing MCP or Enclawed's own tool application-programming interface (API). The mechanism we built, mcp-attested (shipped in both the open enclawed-oss distribution and the enclaved flavor), generalizes: the gap that makes an unmediated third-party connection unsafe for one user makes a regulated deployment impossible to accredit. We close it with three additive mechanisms: (1) a small, offline-signed clearance assertion a server publishes at a well-known Uniform Resource Identifier (URI) and a host verifies against a pinned trust root before any tool dispatch; (2) a deny-by-default per-server tool allowlist, so admitting a server is not trusting its every tool; and (3) a flavor-gated enforcement mode that turns the checks from warnings into hard denials, with every decision written to a tamper-evident audit log. We give the wire format, the verification algorithm, a security analysis, and an LLM-driven adversarial evaluation; we then state the design in normative Request-for-Comments (RFC 2119) form -- schema, verification rules, error registry, well-known registration, and machine-checkable conformance vectors -- so it can be adopted as an MCP addendum rather than reinvented. An unextended host ignores the well-known document and behaves exactly as today.

2605.24202 2026-06-02 cs.AI cs.LG 版本更新

When Does Multi-Agent RL Improve LLM Workflows? Workflow, Scale, and Policy-Sharing Tradeoffs

多智能体强化学习何时能改进LLM工作流?工作流、规模与策略共享的权衡

Yifan Zeng, Yiran Wu, Yaolun Zhang, Wentian Zhao, Kun Wan, Qingyun Wu, Huazheng Wang

发表机构 * Oregon State University(俄勒冈州立大学) Pennsylvania State University(宾夕法尼亚州立大学) Adobe Inc.(Adobe公司) AG2AI, Inc.(AG2AI公司)

AI总结 研究多智能体LLM工作流中端到端强化学习训练的效果,发现改进依赖于工作流、任务和规模,策略共享不提供统一稳定性而是重新分配失败模式。

详情
AI中文摘要

多智能体LLM工作流通过将推理路由到专门角色来提升最终任务准确性,但联合训练这些角色的强化学习不稳定,其机制尚不明确。我们研究了多智能体LLM工作流的端到端RL训练何时能改进其基础模型,比较了共享策略训练(所有角色更新一个策略)和隔离策略训练(每个角色有自己的参数)。我们的实验矩阵涵盖Eval-Opt、Voting和Orch-Workers工作流、数学和代码任务以及三种模型规模(0.6B、1.7B、4B)。我们发现多智能体RL通常能改进基础模型,但增益共同依赖于工作流、任务和规模,而非仅依赖于策略共享。隔离策略倾向于达到更高的峰值准确率,但更频繁地掉入终端准确率悬崖,而共享策略训练并未消除失败;它只是将失败重新分布为性质不同的模式。然后,我们通过工作流拓扑和策略路由引起的角色级梯度动力学解释了其中最显著的模式:在隔离策略下,共享提示上的并行同角色代理会放大每个角色的梯度,并在Voting和Orch-Workers工作流中导致终端退化;在共享策略下,非对称的每步梯度质量导致共享策略被主导角色捕获,从而产生因任务和工作流而异的失败特征。总之,经验图谱及其潜在机制表明,策略共享通过不同渠道引导训练压力,而非提供统一稳定性,使其成为具有工作流和任务条件权衡的设计选择。

英文摘要

Multi-agent LLM workflows route inference through specialized roles to lift end-task accuracy, but jointly training those roles with reinforcement learning is unstable in ways that are poorly understood. We study when end-to-end RL training of multi-agent LLM workflows improves over their base models, comparing Shared-Policy training, where all roles update one policy, with Isolated-Policy training, where each role has its own parameters. Our experimental matrix spans Eval-Opt, Voting, and Orch-Workers workflows, math and code tasks, and three model scales (0.6B, 1.7B, 4B). We find that multi-agent RL usually improves over base models, but gains depend jointly on workflow, task, and scale, not on policy sharing alone. Isolated-Policy tends to reach higher peak accuracy yet more often falls off a terminal accuracy cliff, while Shared-Policy training does not eliminate failure; it redistributes failure into qualitatively different patterns. We then explain the strongest of these patterns through role-level gradient dynamics induced by workflow topology and policy routing: under Isolated-Policy, parallel same-role agents on shared prompts amplify per-role gradients and drive terminal degradation in Voting and Orch-Workers workflows; under Shared-Policy, asymmetric per-step gradient mass causes the shared policy to be captured by the dominant role, producing different failure signatures by task and workflow. Together, the empirical map and its underlying mechanisms show that policy sharing routes training pressure through different channels rather than offering uniform stability, making it a design choice with workflow- and task-conditional tradeoffs.

2605.24005 2026-06-02 cs.AI cs.CL 版本更新

LC-ERD: Mining Latent Logic for Self-Evolving Reasoning via Consistency-Regulated Reward Decomposition

LC-ERD:通过一致性调节奖励分解挖掘潜在逻辑以实现自我进化推理

Yanyu Chen, Jiyue Jiang, Dianzhi Yu, Zheng Wu, Jiahong Liu, Jiaming Han, Xiao Guo, Jinhu Qi, Yu Li, Yifei Zhang, Irwin King

发表机构 * The Chinese University of Hong Kong(香港中文大学) Shanghai Jiaotong University(上海交通大学) Fudan University(复旦大学)

AI总结 针对大语言模型推理中高质量过程数据稀缺的问题,提出LC-ERD框架,通过潜在逻辑挖掘和一致性调节的奖励分解,实现自我对齐与推理进化。

Comments Accepted in SIGKDD 2026 Research Track

详情
AI中文摘要

大语言模型推理的进化受到高质量过程数据稀缺的瓶颈限制。虽然通过内生奖励进行自我对齐提供了一种解决方案,但挖掘有效监督面临三个挑战:(1)通过模仿偏差产生的标签噪声,奖励优先考虑统计可能性而非逻辑真实性,造成掩盖复合错误的“正确性幻觉”;(2)粗粒度监督,稀疏的全局结果(例如在GRPO中)无法提供细粒度指导,将推理链视为整体;(3)分布崩溃,信号无法在不放大预训练偏差的情况下泛化。为了解决这些问题,我们引入了LC-ERD(逻辑一致的内生奖励分解),一个将自我对齐视为潜在结构挖掘的框架。我们通过聚合模型潜在逻辑专家(LLE)的共识推导出变分逻辑势,以去噪推理流形,并引入基于IGM原则的多智能体价值分解协议来量化单个步骤的效用。实验表明,LC-ERD提供了一条稳健的自我进化路径,揭示了逻辑一致性与准确性之间的权衡,同时识别了标准奖励遗漏的高价值推理模式。我们的代码可在https://github.com/LC-ERD-repo/LC-ERD获取。

英文摘要

The evolution of Large Language Model (LLM) reasoning is bottlenecked by the scarcity of high-quality process data. While self-alignment via endogenous rewards offers a solution, mining valid supervision faces three challenges: (1) Label Noise via Mimetic Bias, where rewards prioritize statistical likelihood over logical truth, creating a "correctness illusion" that masks compounding errors; (2) Coarse-Grained Supervision, where sparse global outcomes (e.g., in GRPO) fail to provide granular guidance, treating reasoning chains as monolithic; and (3) Distributional Collapse, where signals fail to generalize without amplifying pre-training biases. To address these, we introduce LC-ERD (Logic-Consistent Endogenous Reward Decomposition), a framework framing self-alignment as latent structure mining. We derive a Variational Logic Potential by aggregating consensus from the model's Latent Logic Expertise (LLE) to denoise the reasoning manifold, and introduce a Multi-Agent Value Decomposition protocol based on the IGM principle to quantify individual step utility. Experiments show LC-ERD delivers a robust self-evolution path, uncovering trade-offs between logic consistency and accuracy while identifying high-value reasoning patterns missed by standard rewards. Our code is available at https://github.com/LC-ERD-repo/LC-ERD.

2605.11359 2026-06-02 cs.AI physics.data-an 版本更新

CVEvolve: Autonomous Algorithm Discovery for Unstructured Scientific Data Processing

CVEvolve:面向非结构化科学数据处理的自主算法发现

Ming Du, Xiangyu Yin, Yanqi Luo, Dishant Beniwal, Songyuan Tang, Hemant Sharma, Mathew J. Cherukara

发表机构 * Argonne National Laboratory(阿贡国家实验室) Advanced Photon Source(先进光子源)

AI总结 提出CVEvolve,一种零代码自主智能体框架,通过多轮搜索与工具集成,自动发现用于非结构化科学数据处理的算法,并在多个任务上超越基线方法。

详情
AI中文摘要

科学数据处理通常需要特定任务的算法或AI模型,这给需要分析数据但可能缺乏广泛计算或图像处理专业知识的领域科学家造成了障碍。当数据噪声大、动态范围高、标签稀疏或仅松散指定时,这一障碍尤为明显。我们引入了CVEvolve,一个具有零代码界面的自主智能体框架,用于科学数据处理算法的发现。CVEvolve结合了多轮搜索策略与代码执行、评估实现、历史管理、保留测试以及可选的科学数据和视觉输出检查工具。搜索在发现和改进动作之间交替,并使用谱系感知的随机候选采样来平衡探索与利用。我们在X射线荧光显微镜图像配准、布拉格峰检测、高能衍射显微镜图像分割以及混合分析学习基仿射配准上展示了CVEvolve。在这些任务中,CVEvolve发现了优于基线方法的算法,而保留测试跟踪有助于识别比后期过度优化替代方案泛化能力更好的候选算法。这些结果表明,零代码、自主的LLM驱动算法开发可以帮助领域科学家将非结构化科学图像数据转化为实用算法和下游科学发现。

英文摘要

Scientific data processing often requires task-specific algorithms or AI models, creating a barrier for domain scientists who need to analyze their data but may not have extensive computing or image-processing expertise. This barrier is especially pronounced when data are noisy, have a high dynamic range, are sparsely labeled, or are only loosely specified. We introduce CVEvolve, an autonomous agentic harness with a zero-code interface for scientific data-processing algorithm discovery. CVEvolve combines a multi-round search strategy with tools for code execution, evaluation implementation, history management, holdout testing, and optional inspection of scientific data and visual outputs. The search alternates between discovery and improvement actions, and uses lineage-aware stochastic candidate sampling to balance exploration and exploitation. We demonstrate CVEvolve on X-ray fluorescence microscopy image registration, Bragg peak detection, high-energy diffraction microscopy image segmentation, and hybrid analytical-learning-based affine registration. Across these tasks, CVEvolve discovers algorithms that improve over baseline methods, while holdout test tracking helps identify candidates that generalize better than later over-optimized alternatives. These results show that zero-code, autonomous LLM-powered algorithm development can help domain scientists turn unstructured scientific image data into practical algorithms and downstream scientific discoveries.

2605.17109 2026-06-02 cs.LG cs.AI 版本更新

DynMuon: A Dynamic Spectral Shaping View of Muon

DynMuon: 缪子的动态谱整形视角

Fangzhou Wu, Rikhav Shah, Sandeep Silwal, Qiuyi Zhang

发表机构 * University of Wisconsin–Madison(威斯康星大学麦迪逊分校) MIT(麻省理工学院) Elorian AI

AI总结 本文提出DynMuon方法,通过动态调整谱整形参数p(从正到负),在训练过程中平衡高曲率与低曲率方向,从而加速收敛并降低验证损失。

Comments 21 pages

详情
AI中文摘要

近年来,Muon已成为训练大型语言模型及更广泛的Transformer的主导方法。与标准梯度下降方法相比,其本质区别在于将通常的更新矩阵$M=UΣV^\top$替换为其极因子$UV^\top$。在本文中,我们考虑一类类似Muon的更新,其中将更新$M$替换为$UΣ^p V^\top$,参数$p$。我们称此为“谱整形”操作,并发展了一套选择$p$的理论,该选择依赖于(a)损失函数的局部曲率,(b)来自随机梯度和标签噪声的噪声,以及(c)训练阶段。我们的理论和实验揭示了一个先前被忽视的行为:正的$p$通过强调高曲率方向并加速信号收缩而在早期有帮助,而轻微负的$p$通过将更新强度重新分配到仍包含有用训练信号的低曲率方向而在后期有帮助。基于这一洞察,我们提出了DynMuon,一种高效的动态谱整形方法,在训练过程中将$p$从正值调度到轻微负值。跨模型大小、架构和训练设置的大量实验表明,DynMuon始终比Muon获得更低的验证损失,同时达到相同目标损失所需的步数减少10.6-26.5%。我们的代码可在https://github.com/fzwark/DynMuon获取。

英文摘要

In recent years, Muon has emerged as the dominant method for training large language models, and transformers more broadly. The essential difference, when compared to standard gradient descent methods, is to replace the usual update matrix $M=UΣV^\top$ with its polar factor $UV^\top$. In this work, we consider a class of Muon-like updates, where we replace the update $M$ with $UΣ^p V^\top$ for some parameter $p$. We call this a "spectral-shaping" operation, and develop a theory of how to pick $p$ which depends on (a) local curvature of the loss function, (b) noise stemming from stochastic gradients and label noise, and (c) training stage. Our theory and experimentation reveal a previously overlooked behavior: positive $p$ helps early by emphasizing high-curvature directions and accelerating signal contraction, while mildly negative $p$ helps later by reallocating update strength toward low-curvature directions that still contain useful training signals. Building on the insight, we propose DynMuon, an efficient dynamic spectral shaping method that schedules $p$ from positive to mildly negative over training. Extensive experiments across model sizes, architectures, and training settings show that DynMuon consistently achieves lower validation loss than Muon, while requiring 10.6-26.5% fewer steps to reach the same target loss. Our code is available at https://github.com/fzwark/DynMuon.

2604.13517 2026-06-02 cs.LG cs.AI 版本更新

Representation over Routing: Diagnosing Temporal Routing Pathologies in Multi-Timescale PPO

表征优于路由:诊断多时间尺度PPO中的时间路由病理

Jing Sun

发表机构 * Information Engineering School, Chengyi College, Jimei University(信息工程学院, Chengyi 学院, 厦门大学)

AI总结 本文通过形式化代理目标攻击和时间不确定性悖论,揭示了多时间尺度PPO中可微路由和基于误差路由的数值捷径问题,并提出目标解耦方法消除演员侧路由路径以改善性能。

Comments 8 pages, 3 figures

详情
AI中文摘要

强化学习中的时间信用分配通常通过引入多个折扣因子的价值估计来处理。一个自然的下一步是让演员在这些时间头之间动态路由,使用可微注意力或启发式不确定性权重。本文认为,这种路由可能产生数值捷径而非可靠的时间抽象。我们在LunarLander-v2上的受控PPO设置中研究此问题,将环境用作诊断故障模式的视觉沙箱。首先,我们形式化了代理目标攻击:暴露于PPO代理的可微softmax路由器会直接获得梯度,指向对当前更新数值有利的优势头,即使这种路由变化并不对应物理控制的改进。由于不同折扣因子的未归一化优势具有不同的有效尺度,这产生了尺度差异脆弱性。其次,我们在基于梯度的无误差路由中识别了时间不确定性悖论:短视头可能获得最大的路由份额,因为其预测目标更容易,即使它们与延迟任务成功的对齐程度较低。作为结构性回应,我们研究了目标解耦:评论家可以保留多时间尺度辅助头,但演员仅使用长视优势进行更新。目标解耦并非作为广泛的性能提升器;在此运行集中,它消除了可被利用的演员侧路由路径,并改善了观察到的最差种子回报。代码可在 https://github.com/ben-dlwlrma/Representation-Over-Routing 获取。

英文摘要

Temporal credit assignment in reinforcement learning is often approached by introducing value estimates at multiple discount factors. A natural next step is to let the actor dynamically route among these temporal heads, using either differentiable attention or heuristic uncertainty weights. This paper argues that such routing can create a numerical shortcut rather than a reliable temporal abstraction. We study this issue in a controlled PPO setting on LunarLander-v2, using the environment as a visual sandbox for diagnosing failure modes. First, we formalize Surrogate Objective Hacking: a differentiable softmax router exposed to the PPO surrogate receives a direct gradient toward advantage heads that are numerically favorable for the current update, even when this routing change does not correspond to improved physical control. Because unnormalized advantages at different discount factors have different effective scales, this creates a scale-discrepancy vulnerability. Second, we identify the Paradox of Temporal Uncertainty in gradient-free error-based routing: short-horizon heads can receive the largest routing share because their prediction targets are easier, even when they are less aligned with delayed task success. As a structural response, we study Target Decoupling: the critic may retain multi-timescale auxiliary heads, but the actor is updated only with the long-horizon advantage. Target Decoupling is not presented as a broad performance booster; in this run set it removes the exploitable actor-side routing pathway and improves the observed worst-seed return. Code is available at https://github.com/ben-dlwlrma/Representation-Over-Routing.

2605.22759 2026-06-02 cs.AI 版本更新

Towards a General Intelligence and Interface for Wearable Health Data

迈向可穿戴健康数据的通用智能与接口

Girish Narayanswamy, Maxwell A. Xu, A. Ali Heydari, Samy Abdel-Ghaffar, Marius Guerard, Kara Vaillancourt, Zhihan Zhang, Jake Garrison, Levi Albuquerque, Dimitris Spathis, Hong Yu, Hamid Palangi, Xuhai "Orson" Xu, David G. T. Barrett, Joseph Breda, Jed McGiffin, Yubin Kim, Yuwei Zhang, Naghmeh Rezaei, Samuel Solomon, Karan Ahuja, Tim Althoff, Jake Sunshine, Ming-Zher Poh, Benjamin Yetton, Ari Winbush, Nicholas B. Allen, James M. Rehg, Isaac Galatzer-Levy, Yun Liu, John Hernandez, Anupam Pathak, Conor Heneghan, Yuzhe Yang, Ahmed A. Metwally, Pushmeet Kohli, Mark Malhotra, Shwetak Patel, Xin Liu, Daniel McDuff

发表机构 * Google Research(谷歌研究) Google DeepMind(谷歌DeepMind) University of Washington(华盛顿大学) University of Oregon(俄勒冈大学) University of Illinois Urbana-Champaign(伊利诺伊大学厄巴纳-香槟分校)

AI总结 提出一个基于超过一万亿分钟无标签传感器数据预训练的可穿戴健康基础模型,通过联合扩展模型容量和预训练数据量,在35项健康预测任务上实现系统性性能提升,并利用LLM代理自动搜索下游预测头,集成到个人健康代理中以提高相关性和安全性。

详情
AI中文摘要

虽然无处不在的可穿戴传感器捕获了大量的行为和生理信息,但有效地将这些信号转化为个性化的健康见解具有挑战性。具体来说,由于高度的表型多样性以及个体基线健康、生理和生活方式因素的差异,将低层传感器数据转换为能够表征高层状态的表示是困难的。此外,收集带有健康结果注释的可穿戴数据既费力又昂贵,而回顾性注释实际上不可行,导致高质量标签数据的稀缺。为了克服这些限制,我们提出了一个可穿戴健康基础模型,该模型在来自五百万参与者的大型队列中超过一万亿分钟的无标签传感器信号上进行了预训练。我们证明了模型容量和预训练数据量的联合扩展在35项健康预测任务(涵盖心血管、代谢、睡眠和心理健康以及生活方式选择和人口统计因素)的多样化评估中带来了系统性的性能提升。我们发现这种人群规模的表示解锁了标签高效的少样本学习和稳健的日常指标估计的生成能力。为了进一步利用这种学习到的表示,我们部署了一个LLM代理教室来自动搜索基于模型嵌入构建的下游预测头空间,显示出随着LLM模型容量增加而广泛性能提升。最后,我们展示了将这些下游预测器集成到个人健康代理中如何能够支持更相关、更具上下文感知和更安全的模型响应,并通过来自一组临床医生的1,860个评分进行了验证。

英文摘要

While ubiquitous wearable sensors capture a wealth of behavioral and physiological information, effectively transforming these signals into personalized health insights is challenging. Specifically, converting low-level sensor data into representations capable of characterizing higher-level states is difficult due to high phenotypic diversity and variation in individual baseline health, physiology, and lifestyle factors. Moreover, collecting wearable data paired with health outcome annotations is laborious and expensive, and retrospective annotation remains practically unfeasible, contributing to a scarcity of data with high-quality labels. To overcome these limitations, we propose a foundation model for wearable health that is pretrained on more than one trillion minutes of unlabeled sensor signals drawn from a large cohort of five million participants. We demonstrate that the joint scaling of model capacity and pretraining data volume leads to systematic improvements in performance, as evaluated on a diverse set of 35 health prediction tasks, spanning cardiovascular, metabolic, sleep, and mental health, as well as lifestyle choices and demographic factors. We find that this population scale representation unlocks label-efficient few-shot learning and generative capabilities for robust daily metric estimation. To further leverage this learned representation, we deploy a classroom of LLM agents to autonomously search the space of downstream predictive heads built on the model embeddings, showing broad performance improvements that increase with LLM model capacity. Finally, we show how integrating these downstream predictors into a Personal Health Agent can support model responses that are more relevant, contextually aware, and safe, and we validate this via 1,860 ratings from a cohort of clinicians.

2604.17473 2026-06-02 cs.CV cs.AI 版本更新

Dual-Anchoring: Addressing State Drift in Vision-Language Navigation

双锚定:解决视觉语言导航中的状态漂移问题

Kangyi Wu, Pengna Li, Kailin Lyu, Xi Lin, Lin Zhao, Qingrong He, Jinjun Wang, Jianyi Liu

发表机构 * National Key Laboratory of Human-Machine Hybrid Augmented Intelligence(人机混合增强智能国家重点实验室) National Engineering Research Center for Visual Information and Applications(视觉信息与应用国家工程研究中心) Institute of Artificial Intelligence and Robotics(人工智能与机器人研究院) Xi’an Jiaotong University(西安交通大学) Institute of Automation, Chinese Academy of Sciences(中国科学院自动化研究所) School of Artificial Intelligence, University of Chinese Academy of Sciences(中国科学院大学人工智能学院) Johns Hopkins University(约翰霍普金斯大学) Joy Future Academy, JD(京东未来学院)

AI总结 提出双锚定框架,通过指令进度锚定和记忆地标锚定分别解决进度漂移和记忆漂移,显著提升长场景导航成功率。

详情
AI中文摘要

视觉语言导航(VLN)要求智能体通过遵循自然语言指令在3D环境中导航。尽管最近的视频大语言模型(Video-LLMs)极大地推进了VLN,但在长场景中它们仍然非常容易受到状态漂移的影响。在这些情况下,智能体的内部状态偏离真实的任务执行状态,导致无目的漫游和无法执行指令中的关键操作。我们将这种失败归因于两种不同的认知缺陷:进度漂移,即智能体无法区分已完成的子目标和剩余的子目标;以及记忆漂移,即智能体的历史表示退化,使其无法跟踪已访问的地标。在本文中,我们提出了一个双锚定框架,明确锚定指令进度和历史表示。首先,为了解决进度漂移,我们引入了指令进度锚定,监督智能体生成结构化的文本标记,以描述已完成与剩余的子目标。其次,为了缓解记忆漂移,我们提出了记忆地标锚定,利用以地标为中心的世界模型回顾性地预测由Segment Anything模型提取的以对象为中心的嵌入,迫使智能体显式验证过去的观察并保留已访问地标的独特表示。为促进该框架,我们整理了两个大规模数据集:360万个带有显式进度描述的样本,以及93.7万个用于回顾性验证的接地地标数据。在模拟和真实环境中的大量实验证明了我们方法的优越性,在成功率上提高了15.2%,在长时程轨迹上获得了24.7%的显著提升。为促进进一步研究,我们将发布我们的代码、数据生成流程以及收集的数据集。

英文摘要

Vision-Language Navigation(VLN) requires an agent to navigate through 3D environments by following natural language instructions. While recent Video Large Language Models(Video-LLMs) have largely advanced VLN, they remain highly susceptible to State Drift in long scenarios. In these cases, the agent's internal state drifts away from the true task execution state, leading to aimless wandering and failure to execute essential maneuvers in the instruction. We attribute this failure to two distinct cognitive deficits: Progress Drift, where the agent fails to distinguish completed sub-goals from remaining ones, and Memory Drift, where the agent's history representations degrade, making it lose track of visited landmarks. In this paper, we propose a Dual-Anchoring Framework that explicitly anchors the instruction progress and history representations. First, to address progress drift, we introduce Instruction Progress Anchoring, which supervises the agent to generate structured text tokens that delineate completed versus remaining sub-goals. Second, to mitigate memory drift, we propose Memory Landmark Anchoring, which utilizes a Landmark-Centric World Model to retrospectively predict object-centric embeddings extracted by the Segment Anything Model, compelling the agent to explicitly verify past observations and preserve distinct representations of visited landmarks. Facilitating this framework, we curate two extensive datasets: 3.6 million samples with explicit progress descriptions, and 937k grounded landmark data for retrospective verification. Extensive experiments in both simulation and real-world environments demonstrate the superiority of our method, achieving a 15.2% improvement in Success Rate and a remarkable 24.7% gain on long-horizon trajectories. To facilitate further research, we will release our code, data generation pipelines, and the collected datasets.

2605.14355 2026-06-02 cs.AI cs.CL 版本更新

Herculean: An Agentic Benchmark for Financial Intelligence

Herculean: 面向金融智能的智能体基准测试

Xueqing Peng, Zhuohan Xie, Yupeng Cao, Haohang Li, Lingfei Qian, Yan Wang, Vincent Jim Zhang, Huan He, Xuguang Ai, Linhai Ma, Ruoyu Xiang, Yueru He, Yi Han, Shuyao Wang, Yuqing Guo, Mingyang Jiang, Yilun Zhao, Youzhong Dong, Xiaoyu Wang, Yankai Chen, Ye Yuan, Qiyuan Zhang, Fuyuan Lyu, Haolun Wu, Yonghan Yang, Zichen Zhao, Yuyang Dai, Fan Zhang, Rania Elbadry, Ayesha Gull, Muhammad Usman Safder, Nuo Chen, Fengbin Zhu, Tianshi Cai, Zimu Wang, Polydoros Giannouris, Yuechen Jiang, Zhiwei Liu, Mohsinul Kabir, Yuyan Wang, Yixiang Zheng, Yangyang Yu, Weijin Liu, Wenbo Cao, Anke Xu, Peng Lu, Jerry Huang, Mingquan Lin, Prayag Tiwari, Yijia Zhao, Víctor Gutiérrez-Basulto, Xiao-Yang Liu, Kaleb E Smith, Jiahuan Pei, Arman Cohan, Jimin Huang, Yuehua Tang, Alejandro Lopez-Lira, Xi Chen, Xue Liu, Junichi Tsujii, Jian-Yun Nie, Sophia Ananiadou

发表机构 * The Fin AI Yale University(耶鲁大学) Columbia University(哥伦比亚大学) Stevens Institute of Technology(史蒂文斯理工学院) NVIDIA(英伟达) New York University(纽约大学) Georgia Institute of Technology(佐治亚理工学院) University of Florida(佛罗里达大学) MBZUAI Université de Montréal(蒙特利尔大学) University of Minnesota(明尼苏达大学) University of Massachusetts Boston(马萨诸塞大学波士顿分校) National Institute of Advanced Industrial Science and Technology(国家先进工业科学与技术研究院) University of Liverpool(利物浦大学) Vrije Universiteit Amsterdam(阿姆斯特丹自由大学) National University of Singapore(新加坡国立大学) Halmstad University(哈尔姆斯塔德大学) University of Manchester(曼彻斯特大学) Cardiff University(卡迪夫大学) McGill University(麦吉尔大学) Mila – Quebec AI Institute(魁北克人工智能研究所)

AI总结 本文提出Herculean,首个覆盖交易、对冲、市场洞察和审计四个代表性工作流的智能体金融智能基准测试,通过标准化MCP技能环境评估异构智能体系统,发现智能体在交易和市场洞察上表现较好,但在对冲和审计等需要长期协调、状态一致性和结构化验证的任务上存在显著不足。

详情
AI中文摘要

随着AI智能体的进步,核心问题不再是它们能否解决孤立的、定义明确的金融任务,而是它们能否可靠地执行金融专业工作。现有的金融基准测试仅提供了这种能力的部分视角,因为它们主要评估静态能力,如问答、检索、摘要和分类。我们引入了Herculean,这是首个面向智能体金融智能的技能基准测试,涵盖四个代表性工作流,包括交易、对冲、市场洞察和审计。每个工作流被实例化为一个基于MCP的标准化技能环境,具有自己的工具、交互动态、约束和成功标准,从而能够对异构智能体系统进行一致的端到端评估。在前沿智能体中,我们发现智能体在交易和市场洞察上表现相对较好,但在对冲和审计上表现显著不足,这些任务中长期协调、状态一致性和结构化验证至关重要。总体而言,我们的结果指出了当前智能体在高风险金融工作流中将金融推理转化为可靠工作流执行方面的关键差距。

英文摘要

As AI agents improve, the central question is no longer whether they can solve isolated well-defined financial tasks, but whether they can reliably carry out financial professional work. Existing financial benchmarks offer only a partial view of this ability, as they primarily evaluate static competencies such as question answering, retrieval, summarization, and classification. We introduce Herculean, the first skilled benchmark for agentic financial intelligence spanning four representative workflows, including Trading, Hedging, Market Insights, and Auditing. Each workflow is instantiated as a standardized MCP-based skill environment with its own tools, interaction dynamics, constraints, and success criteria, enabling consistent end-to-end assessment of heterogeneous agent systems. Across frontier agents, we find agents perform relatively well on Trading and Market Insights, but struggle substantially on Hedging and Auditing, where long-horizon coordination, state consistency, and structured verification are critical. Overall, our results point to a key gap in current agents in turning financial reasoning into dependable workflow execution in high-stakes financial workflows.

2602.11210 2026-06-02 cs.SE cs.AI cs.LG 版本更新

SWE-MiniSandbox: Container-Free Reinforcement Learning for Building Software Engineering Agents

SWE-MiniSandbox:用于构建软件工程智能体的无容器强化学习

Danlong Yuan, Wei Wu, Enhan Zhao, Zhengren Wang, Xueliang Zhao, Huishuai Zhang, Dongyan Zhao

发表机构 * University of Science and Technology of China(中国科学技术大学)

AI总结 提出SWE-MiniSandbox,一种轻量级无容器方法,通过内核级隔离和预缓存技术降低磁盘使用和准备时间,实现可扩展的强化学习训练。

详情
AI中文摘要

强化学习已成为训练软件工程智能体的关键范式,但现有流程通常依赖每个任务的容器进行隔离。在大规模场景下,预构建的容器镜像会带来显著的存储开销、缓慢的环境设置,并且需要容器管理权限。我们提出SWE-MiniSandbox,一种轻量级、无容器的方法,能够在无需牺牲隔离性的情况下实现SWE智能体的可扩展强化学习训练。SWE-MiniSandbox不依赖每个实例的容器,而是在由内核级机制支持的隔离工作空间中执行每个任务,从而大幅降低系统开销。它利用轻量级环境预缓存技术,消除了对庞大容器镜像的需求。因此,我们的方法将磁盘使用量降低到基于容器的流程所需的大约5%,并将环境准备时间缩短到容器基线的大约25%。实验结果表明,SWE-MiniSandbox实现了与标准基于容器的流程相当的评估性能。通过消除对重型容器基础设施的依赖,SWE-MiniSandbox为扩展基于强化学习的SWE智能体提供了一个实用且可访问的基础,特别是在资源受限的研究环境中。

英文摘要

Reinforcement learning (RL) has become a key paradigm for training software engineering (SWE) agents, but existing pipelines typically rely on per-task containers for isolation. At scale, pre-built container images incur substantial storage overhead, slow environment setup, and require container-management privileges. We propose SWE-MiniSandbox, a lightweight, container-free method that enables scalable RL training of SWE agents without sacrificing isolation. Instead of relying on per-instance containers, SWE-MiniSandbox executes each task in an isolated workspace backed by kernel-level mechanisms, substantially reducing system overhead. It leverages lightweight environment pre-caching techniques to eliminate the need for bulky container images. As a result, our approach lowers disk usage to approximately 5\% of that required by container-based pipelines and reduces environment preparation time to about 25\% of the container baseline. Empirical results demonstrate that SWE-MiniSandbox achieves evaluation performance comparable to standard container-based pipelines. By removing the dependency on heavy container infrastructure, SWE-MiniSandbox offers a practical and accessible foundation for scaling RL-based SWE agents, particularly in resource-constrained research environments.

2603.02845 2026-06-02 cs.RO cs.AI 版本更新

SPARC: Spatial-Aware Path Planning via Attentive Agent Communication

SPARC: 通过注意力智能体通信实现空间感知路径规划

Sayang Mu, Xiangyu Wu, Bo An

发表机构 * Nanyang Technological University(南洋理工大学)

AI总结 提出关系增强多头注意力(RMHA)机制,通过嵌入曼哈顿距离到注意力权重计算,优先处理空间邻近机器人的消息,在40x40网格上从8机器人零样本泛化到128机器人时,在30%障碍密度下实现约75%成功率,超越基线25个百分点以上。

Comments The manuscript is being withdrawn at the request of the first author for the purpose of revising content and re-uploading a revised version with updated data/figures/text . The revised manuscript will be resubmitted to arXiv promptly with the same author list and research theme

详情
AI中文摘要

高效通信对于分散式多机器人路径规划(MRPP)至关重要,然而现有的学习型通信方法平等对待所有邻近机器人,而不考虑它们的空间接近性,导致在协调最重要的拥挤区域注意力被稀释。我们提出关系增强多头注意力(RMHA),这是一种通信机制,它将成对曼哈顿距离显式嵌入到注意力权重计算中,使每个机器人能够动态优先处理来自空间相关邻居的消息。结合距离约束注意力掩码和GRU门控消息融合,RMHA与MAPPO无缝集成,实现稳定的端到端训练。在从8个训练机器人到128个测试机器人在40x40网格上的零样本泛化中,RMHA在30%障碍密度下实现了约75%的成功率,比最佳基线高出超过25个百分点。消融研究证实,距离关系编码是高密度环境中成功率提高的关键因素。索引词-多机器人路径规划,图注意力机制,多头注意力,通信优化,协作决策。

英文摘要

Efficient communication is critical for decentralized Multi-Robot Path Planning (MRPP), yet existing learned communication methods treat all neighboring robots equally regardless of their spatial proximity, leading to diluted attention in congested regions where coordination matters most. We propose Relation enhanced Multi Head Attention (RMHA), a communication mechanism that explicitly embeds pairwise Manhattan distances into the attention weight computation, enabling each robot to dynamically prioritize messages from spatially relevant neighbors. Combined with a distance-constrained attention mask and GRU gated message fusion, RMHA integrates seamlessly with MAPPO for stable end-to-end training. In zero-shot generalization from 8 training robots to 128 test robots on 40x40 grids, RMHA achieves approximately 75 percent success rate at 30 percent obstacle density outperforming the best baseline by over 25 percentage points. Ablation studies confirm that distance-relation encoding is the key contributor to success rate improvement in high-density environments. Index Terms-Multi-robot path planning, graph attention mechanism, multi-head attention, communication optimization, cooperative decision-making

2605.15229 2026-06-02 cs.SE cs.AI 版本更新

PBT-Bench: Benchmarking AI Agents on Property-Based Testing

PBT-Bench:基于属性测试的AI智能体基准

Lucas Jing, Xinqi Wang, Liao Zhang, Simon S. Du

发表机构 * Tsinghua University(清华大学) University of Washington(华盛顿大学) Beneficial AI Foundation(有益人工智能基金会)

AI总结 提出PBT-Bench基准,包含100个基于属性测试的问题,用于评估AI智能体从文档中推导语义不变量并生成输入策略的能力。

详情
AI中文摘要

现有的代码基准测试衡量的是智能体能否生成任何能复现已知bug的测试,或者能否生成修复描述问题的补丁。两者都没有分离出基于属性测试的独特技能:从文档中推导语义不变量,然后构建足够精确的输入生成策略,使得随机搜索能够揭示违规。我们引入了PBT-Bench,一个包含40个真实Python库中100个精心策划的基于属性测试问题的基准。每个问题注入一个或多个语义bug(共365个,平均每个问题3.65个),设计使得默认策略的随机输入几乎不会触发它们;智能体必须阅读库的文档,识别相关不变量,并指定一个Hypothesis @given策略,将质量集中在触发区域。bug按三个难度级别(L1-L3)分层,涵盖单约束边界bug到有状态、跨函数协议违规。我们在两种提示机制(开放式基线与显式Hypothesis脚手架)下评估了八个当代LLM,每个配置进行三次独立运行。在PBT引导提示下,模型间的bug召回率从42.1%到83.4%不等;在开放式基线下,从31.4%到76.7%不等。Hypothesis脚手架将中等能力模型提升了超过20个百分点,但对最强模型提升较小,有两个例外显示出退化,表明结构化提示可能干扰某些模型行为而非补充。最难的bug被证明是模型特定的:不同架构在不同问题上失败,留下没有单一模型能填补的持续空白。我们发布基准、测试框架和完整评估语料库,以支持下游关于文档基础的语义推理工作。

英文摘要

Existing code benchmarks measure whether an agent can produce any test that reproduces a known bug, or whether it can produce a patch that fixes a described issue. Neither isolates the distinct skill of property-based testing: deriving a semantic invariant from documentation, and then constructing an input-generation strategy precise enough to make a random search reveal the violation. We introduce PBT-Bench, a benchmark of 100 curated property-based testing problems across 40 real Python libraries. Each problem injects one or more semantic bugs (365 in total, mean 3.65 per problem) designed so that default-strategy random inputs almost never trigger them; the agent must read the library's documentation, identify the relevant invariant, and specify a Hypothesis @given strategy that concentrates mass in the trigger region. Bugs are stratified across three difficulty levels (L1-L3) spanning single-constraint boundary bugs to stateful, cross-function protocol violations. We evaluate eight contemporary LLMs under two prompting regimes (open-ended baseline vs. explicit Hypothesis scaffolding) for three independent runs per configuration. Bug recall under the PBT-guided prompt ranges from 42.1% to 83.4% across models; under the open-ended baseline, from 31.4% to 76.7%. Hypothesis scaffolding lifts mid-capability models by over 20 percentage points, but yields smaller gains for the strongest models, with two exceptions showing degradation, suggesting the structured prompt can interfere with certain model behaviours rather than complementing them. The hardest bugs prove model-specific: different architectures fail on different problems, leaving persistent gaps that no single model closes. We release the benchmark, harness, and full evaluation corpus to support downstream work on documentation-grounded semantic reasoning.

2605.20301 2026-06-02 cs.CV cs.AI 版本更新

Co-Fusion4D: Spatio-temporal Collaborative Fusion for Robust 3D Object Detection

Co-Fusion4D:面向鲁棒3D目标检测的时空协同融合

Wenxuan Li, Qin Zou, Shoubing Chen, Chi Chen, Yingyi Yang, Qingxiang Meng

发表机构 * Tsinghua University(清华大学)

AI总结 提出Co-Fusion4D框架,通过当前帧主导-历史帧互补机制和双注意力融合模块,解决BEV检测器中跨帧时空不一致问题,在nuScenes上达到74.9% mAP和75.6% NDS。

详情
AI中文摘要

在自动驾驶中,3D目标检测对于准确感知和可靠决策至关重要。然而,目标运动和自车运动常常在基于BEV的检测器中引起跨帧时空不一致,导致时序BEV特征错位和时空一致性退化。为了解决这些挑战,我们提出了Co-Fusion4D,一个统一框架,显式地保持跨帧时空一致性并抑制时序特征漂移。Co-Fusion4D采用当前帧中心策略,将当前帧作为主要信息源,同时在时空滤波和对齐后选择性地融入历史帧。这种主从互补机制有效减轻了累积对齐误差,抑制了噪声特征传播,并利用可靠的时序线索获得更一致的BEV表示。此外,Co-Fusion4D集成了双注意力融合(DAF)模块,以进一步增强时空特征交互。DAF联合利用帧内空间注意力和帧间时序注意力,自适应地对齐和融合多帧特征,强调运动一致区域同时抑制虚假相关性。通过偏离传统的均匀融合范式,该设计显著提高了BEV表示的时序稳定性和判别能力。在nuScenes基准上的大量实验表明,Co-Fusion4D实现了最先进的性能,mAP为74.9%,NDS为75.6%,且不依赖测试时增强或外部数据。

英文摘要

In autonomous driving, 3D object detection is essential for accurate perception and reliable decision-making. However, object motion and ego-motion often induce cross-frame spatiotemporal inconsistencies in BEV-based detectors, leading to temporal BEV feature misalignment and degraded spatiotemporal consistency. To address these challenges, we propose Co-Fusion4D, a unified framework that explicitly preserves cross-frame spatiotemporal consistency and suppresses temporal feature drift. Co-Fusion4D adopts a current-frame-centric strategy, treating the current frame as the primary source of information while selectively incorporating historical frames after spatiotemporal filtering and alignment. This dominant-complementary mechanism effectively mitigates cumulative alignment errors, suppresses noisy feature propagation, and exploits reliable temporal cues for a more consistent BEV representation. In addition, Co-Fusion4D integrates a Dual Attention Fusion (DAF) module to further enhance spatiotemporal feature interaction. DAF jointly leverages intra-frame spatial attention and inter-frame temporal attention to adaptively align and fuse multi-frame features, emphasizing motion-consistent regions while suppressing spurious correlations. By departing from conventional uniform fusion paradigms, this design substantially improves the temporal stability and discriminative capability of BEV representations. Extensive experiments on the nuScenes benchmark demonstrate that Co-Fusion4D achieves state-of-the-art performance, with 74.9% mAP and 75.6% NDS, without relying on test-time augmentation or external data.

2605.20282 2026-06-02 cs.CV cs.AI 版本更新

Can Vision Models Truly Forget? Mirage: Representation-Level Certification of Visual Unlearning

视觉模型真的能遗忘吗?Mirage:表示层面的视觉遗忘认证

Zhenyu Yu, Yangchen Zeng, Chunlei Meng, Guangzhen Yao, Shuigeng Zhou

发表机构 * Fudan University(复旦大学) Southeast University(东南大学) Northeast Normal University(东北师范大学)

AI总结 提出Mirage框架,通过表示层面诊断揭示现有垂直联邦学习遗忘方法在输出层面通过认证后仍保留类别结构信息,并发现遗忘三元组困境和类别-样本不对称性。

详情
AI中文摘要

垂直联邦学习中的机器遗忘引起了越来越多的关注,但现有方法仅使用输出层面指标来认证遗忘。我们通过引入Mirage(一个表示层面审计框架,包含四种互补诊断方法:线性探针恢复、中心核对齐、特征可分性评分和逐层恢复分析)来挑战这些说法。通过在七个数据集和七种基线方法上遵循最近的VFL遗忘协议进行实验,Mirage揭示了三个关键发现:(i)遗忘差距:通过输出层面认证的方法在其表示中仍然保留了大量的类别结构,线性探针恢复比重新训练的基线高出最多15.4个百分点;中心核对齐显示这些模型在结构上更接近原始模型而非重新训练的参考模型,而可分性评分表明存在持续的几何区分。(ii)遗忘三元组困境:没有现有方法能同时实现高效用、输出层面遗忘和表示层面遗忘。(iii)类别-样本不对称性:类别级遗忘留下强烈的表示痕迹(线性探针恢复高达97%),而样本级遗忘与随机无异(线性探针恢复约50%);逐层分析进一步表明残差类别信息在网络深度中持续存在。这些发现呼吁在联邦遗忘研究中采用表示层面感知的评估标准。

英文摘要

Machine unlearning in Vertical Federated Learning (VFL) has attracted growing interest, yet existing methods certify forgetting solely using output-level metrics. We challenge these claims by introducing Mirage, a representation-level auditing framework comprising four complementary diagnostics: Linear Probe Recovery (LPR), Centered Kernel Alignment (CKA), Feature Separability Scoring, and Layer-Wise Recovery Analysis. Through experiments across seven datasets and seven baseline methods following recent VFL unlearning protocols, Mirage reveals three key findings: (i) Forgetting gap: methods that pass output-level certification still retain substantial class structure in their representations, with LPR exceeding the retrained baseline by up to 15.4 points; CKA shows these models remain structurally closer to the original than to the retrained reference, while separability scores indicate persistent geometric discrimination. (ii) Unlearning trilemma: no existing method simultaneously achieves high utility, output-level forgetting, and representation-level forgetting. (iii) Class-sample asymmetry: class-level forgetting leaves strong representational traces (LPR up to 97%), whereas sample-level forgetting is indistinguishable from chance (LPR approx. 50%); layer-wise analysis further shows residual class information persists across network depths. These findings call for representation-aware evaluation standards in federated unlearning research.

2605.17839 2026-06-02 cs.LG cs.AI 版本更新

Balancing Knowledge Distillation for Imbalance Learning with Bilevel Optimization

基于双层优化的不平衡学习知识蒸馏平衡

Anh B. H. Nguyen, Ba Tho Phan, Viet Cuong Ta

发表机构 * VNU University of Engineering and Technology(越南工程技术大学)

AI总结 提出BiKD双层框架,通过自适应样本级权重平衡硬损失和软损失,解决不平衡数据上知识蒸馏的脆弱性问题。

Comments Accepted to Special Session: Data Science: Foundations and Applications (DSFA), PAKDD 2026

详情
AI中文摘要

知识蒸馏通过混合硬损失和软损失将高容量教师的知识转移到紧凑的学生模型。在不平衡数据上,硬损失和软损失之间的固定权重使得学习过程变得脆弱。最近的研究尝试在长尾设置中重新加权这些组件。然而,大多数方法没有在样本级别调整权重,也没有考虑训练过程中学生的行为。为了解决这个问题,我们提出了BiKD——一个双层框架,动态平衡每个样本的硬损失和软损失。我们采用一个权重生成网络,由一个小型平衡验证集引导,产生自适应的逐样本权重。学生现在通过无约束的加权硬损失和软损失组合进行训练,使得学生可以放松这两个项。我们进一步提出了一种多步SGD策略,以更准确和高效地优化权重模型。在长尾CIFAR-10/100上的实验表明,我们的方法在不同不平衡因子下均优于最近的平衡蒸馏方法。

英文摘要

Knowledge distillation transfers knowledge from a high capacity teacher to a compact student using a mixture of hard and soft losses. On imbalanced data, a fixed weighting between hard and soft losses becomes brittle the learning process. Recent studies try to reweight these components in long-tailed settings. However, most of these methods do not adapt weights at the sample-wise level and do not take into account the students behavior during training. To address this, we propose BiKD -- a bilevel framework that dynamically balances hard and soft losses for each sample. We employ a weight generation network that produces adaptive per-sample weights, guided by a small balanced validation set. The student is now trained with an unconstrained combination of weighted hard and soft losses, allowing the student to relax both terms. We further propose a multi-step SGD strategy to optimize the weight model more accurately and efficiently. Experiments on long-tailed CIFAR-10/100 show that our approach surpasses recent balanced distillation methods across imbalance factors.

2605.18077 2026-06-02 cs.AI cs.LG cs.MA 版本更新

LLM-Guided Communication for Cooperative Multi-Agent Reinforcement Learning

LLM引导的通信用于合作多智能体强化学习

Sangjun Bae, Yisak Park, Sanghyeon Lee, Seungyul Han

发表机构 * KAIST(韩国科学技术院)

AI总结 提出LMAC框架,利用大语言模型的推理能力设计通信协议,使所有智能体尽可能准确一致地重建底层状态,从而提升多智能体强化学习中的状态重建和性能。

Comments 9 pages for main, 32 pages for total, Accepted to ICML 2026

详情
AI中文摘要

通信是多智能体强化学习(MARL)中缓解部分可观测性的关键组成部分,然而先前的方法通常依赖于低效的信息交换或无法传输足够的状态信息。为了解决这一问题,我们提出了LLM驱动的多智能体通信(LMAC),它利用LLM的推理能力设计一种通信协议,使所有智能体能够尽可能准确且一致地重建底层状态。LMAC使用显式的状态感知准则迭代地优化协议,在缩小智能体知识差异的同时改善状态恢复。在多种MARL基准上的实验表明,LMAC改善了智能体间的状态重建,并且相较于先前的通信基线取得了显著的性能提升。

英文摘要

Communication is a key component in multi-agent reinforcement learning (MARL) for mitigating partial observability, yet prior approaches often rely on inefficient information exchange or fail to transmit sufficient state information. To address this, we propose LLM-driven Multi-Agent Communication (LMAC), which leverages an LLM's reasoning capability to design a communication protocol that enables all agents to reconstruct the underlying state as accurately and uniformly as possible. LMAC iteratively refines the protocol using an explicit state-awareness criterion, improving state recovery while narrowing differences in agents' knowledge. Experiments on diverse MARL benchmarks show that LMAC improves state reconstruction across agents and yields substantial performance gains over prior communication baselines.

2605.17909 2026-06-02 cs.AI cs.LO 版本更新

Ethical Hyper-Velocity (EHV): A Hardware-Rooted Zero-Trust Runtime Enforcement Architecture for Agentic AI Systems

伦理超高速 (EHV):一种面向代理型AI系统的硬件根零信任运行时强制架构

Riddhi Mohan Sharma

发表机构 * Senior Member, IEEE(IEEE高级会员)

AI总结 提出伦理超高速 (EHV) 架构,通过结合语法约束解码、因果图CRDT、可信执行环境和OSCAL审计日志,实现硬件根的零信任运行时强制,将策略执行点嵌入推理管线,显著降低治理延迟并支持形式化验证。

Comments 12 pages, 3 TikZ Figures, 3 Tables

详情
AI中文摘要

随着自主代理系统在受监管的关键基础设施中规模化部署,缺乏针对高频策略更新的机械性、硬件根强制机制构成了基本的安全缺口。我们提出伦理超高速 (EHV),一种面向代理系统的治理感知运行时强制架构,它结合了用于内联策略约束令牌生成的语法约束解码 (GCD)、基于向量时钟排序的因果图CRDT策略同步、可信执行环境 (TEE) 中的硬件证明执行以及OSCAL格式的机器可读审计日志。与引入14-30天策略延迟的事后审计框架(如ISO/IEC 42001、NIST AI RMF)不同,EHV通过治理感知即时 (JIT) 编译器将策略执行点 (PEP) 重新定位到推理管线中。在明确陈述的假设下,该架构降低了强制延迟,提高了可追溯性,并支持有界模型中的安全不变量的形式化验证。我们通过TLA+模型检查证明,在验证的有界运行状态空间(生成1738个状态,324个不同状态,深度8,零违规)中,不合规的代理行为是不可达的。在这些条件下,O(1)运行时强制减少了部署速度与治理完整性之间的传统权衡,将治理延迟从O(天)降至O(1)。EHV的差异化贡献在于将GCD、因果CRDT、TEE证明缓存和有界形式化验证集成到一个单一的、硬件根的强制架构中——这是任何同期系统都未实现的组合。该架构通过儿科肿瘤剂量用例进行演示,适用于包括医疗、金融合规和关键基础设施控制在内的受监管关键基础设施。

英文摘要

As autonomous agentic systems scale across regulated critical infrastructures, the lack of mechanistic, hardware-rooted enforcement for high-frequency policy updates presents a fundamental safety gap. We present Ethical Hyper-Velocity (EHV), a governance-aware runtime enforcement architecture for agentic systems that combines Grammar-Constrained Decoding (GCD) for inline policy-constrained token generation, Causal Graph CRDT-based policy synchronization with vector-clock ordering, hardware-attested execution in Trusted Execution Environments (TEEs), and OSCAL-formatted machine-readable audit logging. Unlike retrospective auditing frameworks (ISO/IEC 42001, NIST AI RMF) that introduce 14-30 day policy latencies, EHV relocates the Policy Enforcement Point (PEP) into the inference pipeline via a Governance-Aware Just-In-Time (JIT) Compiler. Under explicitly stated assumptions, the architecture reduces enforcement latency, improves traceability, and supports formal verification of safety invariants in a bounded model. We demonstrate via TLA+ model checking that non-compliant agentic actions were unreachable in the verified bounded operating state space (1,738 states generated, 324 distinct, depth 8, zero violations). Under these conditions, O(1) runtime enforcement reduces the traditional trade-off between deployment velocity and governance integrity, targeting Governance Latency from O(days) toward O(1). EHV's differentiating contribution is the integration of GCD, Causal CRDT, TEE attestation caching, and bounded formal verification into a single, hardware-rooted enforcement architecture -- a combination not achieved by any contemporaneous system. The architecture is demonstrated through a pediatric oncology dosage use case, with applicability to regulated critical infrastructures including healthcare, financial compliance, and critical infrastructure control.

2605.17554 2026-06-02 cs.AI cs.LG 版本更新

Evaluating Deep Research Agents on Expert Consulting Work: A Benchmark with Verifiers, Rubrics, and Cognitive Traps

评估深度研究代理在专家咨询工作中的表现:一个包含验证器、评分标准和认知陷阱的基准

Tanmay Asthana, Aman Saksena, Divyansh Sahu

AI总结 本文提出一个基准,通过42个专家编写的任务,使用确定性验证器和五维度评分标准评估三个前沿深度研究代理(Claude、OpenAI o3、Gemini)在管理咨询类结构化分析交付物上的表现,并嵌入认知陷阱,发现所有代理的联合接受率均较低(最高21.4%),且各有独特失败模式。

Comments Updating the paper with more data. Will resubmit

详情
AI中文摘要

前沿深度研究代理(DRA)能够规划研究任务、综合多篇文档,并按需生成结构化的交付物。它们在企业工作流中的部署速度远快于评估速度。现有基准衡量事实回忆、单跳问答或通用代理技能,忽略了DRA被部署用于生成的多文档、决策级工作。我们引入一个基准,针对管理咨询师典型一周中所需的结构化分析交付物。我们评估三个前沿代理,即Claude Opus 4.6(带网络搜索)、OpenAI o3-deep-research和Google Gemini 3.1 Pro deep-research,在42个由领域专家(SME)编写的提示上。每个提示的126个响应在两个层面评分:确定性真实验证器(平均每个任务13.8个)和五维度0-3 SME评分标准,组合成0-100的验证器-评分标准分数(VRS)。大多数提示嵌入了惩罚表面模式匹配的认知陷阱。在我们的联合阈值(评分标准均值>=2.5且验证器通过率>=80%)下的接受率普遍较低:Gemini 21.4%,o3 9.5%,Claude 9.5%。平均VRS分数与已发表的基于评分标准的基准一致(我们的最高62.6对比APEX-v1 64.2,ProfBench 65.9,ResearchRubrics <68%),验证了评分标准构建。ACCEPT率低于APEX-Agents在专用DR代理上的MC-segment Pass@1区间(12.3-22.7%);尽管有工具优势,我们的下限仍低三个百分点,这是由于更严格的合取评分和陷阱设计。每个代理的失败模式各不相同。Claude最可靠地生成交付物(在需要文件的任务上比其他代理高4.5倍),但具有最高的虚构特征。o3具有最清晰的推理平均值,但会遗漏必要部分并传播算术错误。Gemini是双峰的,具有最高的接受率,同时也有最多的零分评分标准单元格。

英文摘要

Frontier deep research agents (DRAs) plan a research task, synthesize across documents, and return a structured deliverable on demand. They are being deployed in enterprise workflows faster than they are being evaluated. Existing benchmarks measure factual recall, single-hop QA, or generic agentic skill, missing the multi-document, decision-grade work DRAs are deployed to produce. We introduce a benchmark targeting the structured analytical deliverables that fill a management consultant's typical week. We grade three frontier agents, namely Claude Opus 4.6 with web search, OpenAI o3-deep-research, and Google Gemini 3.1 Pro deep-research, on 42 SME-authored prompts. Each of the 126 responses is scored on two layers: deterministic ground-truth verifiers (mean 13.8 per task) and a five-criterion 0-3 SME rubric, composed into a Verifier-Rubric Score (VRS) on 0-100. Most prompts embed cognitive traps that penalize surface-pattern matching. Acceptance under our joint threshold (rubric mean >= 2.5 and verifier rate >= 80%) is uniformly low: Gemini 21.4%, o3 9.5%, Claude 9.5%. Mean VRS scores agree with published rubric-based benchmarks (our top 62.6 vs. APEX-v1 64.2, ProfBench 65.9, ResearchRubrics < 68%), validating the rubric construct. ACCEPT rates sit below APEX-Agents' MC-segment Pass@1 band (12.3-22.7%) on dedicated DR agents; our floor is three points lower despite the harness advantage, opened by stricter conjunctive grading and trap design. Each agent fails distinctively. Claude produces the deliverable most reliably (4.5x the others' rate on file-required tasks) but carries the highest fabrication signature. o3 has the cleanest reasoning average yet drops required sections and propagates arithmetic errors. Gemini is bimodal, with the highest acceptance rate alongside the most zero-scored rubric cells.

2605.12969 2026-06-02 cs.LG cs.AI 版本更新

Revisiting Reinforcement Learning with Verifiable Rewards from a Contrastive Perspective

从对比视角重新审视基于可验证奖励的强化学习

Feng Zhang, Xinhong Ma, Ziqiang Dong, Xi Leng, Jianfei Zhao, Xin Sun, Yang Yang, Guanjun Jiang

发表机构 * Beijing Institute of Technology(北京理工大学) Qwen Business Unit of Alibaba(阿里巴巴Qwen业务部) The Chinese University of Hong Kong, Shenzhen(香港中文大学(深圳))

AI总结 本文提出ConSPO方法,通过对比序列级策略优化,解决GRPO在目标函数上的似然错配和信用分配不敏感问题,在推理任务上超越强基线。

详情
AI中文摘要

组相对策略优化(GRPO)是目前最广泛采用的RLVR算法之一,用于对大型语言模型进行推理任务的后训练。我们首先证明GRPO存在等价的判别式重新表述,其中策略优化最大化验证的正负rollout之间的期望得分差距。这种重新表述揭示了两个目标层面的局限性:似然错配的替代得分(优化的是基于裁剪比率的得分而非控制生成的序列似然)和得分不敏感的信用分配(rollout级别的信用不反映当前正负rollout之间的得分差距)。为了解决这些局限性,我们提出ConSPO,一种对比序列级策略优化方法,它使用长度归一化的序列对数概率作为rollout得分,并在同一组内对比验证的正rollout与负干扰项。ConSPO优化一个组级别的InfoNCE风格目标,以自适应地增强对分离不佳的正样本和高分负样本的更新,同时结合课程调度的边界,在训练过程中保持分离压力。在多种设置下的实验表明,ConSPO在具有挑战性的推理基准上优于强基线。代码将在论文被接收后发布。

英文摘要

Group Relative Policy Optimization (GRPO) is one of the most widely adopted RLVR algorithms for post-training large language models on reasoning tasks. We first show that GRPO admits an equivalent discriminative reformulation, in which policy optimization maximizes the expected score gap between verified positive and negative rollouts. This reformulation reveals two objective-level limitations: likelihood-misaligned surrogate scores, in which clipped ratio-based scores are optimized rather than the sequence likelihoods that govern generation, and score-insensitive credit assignment, in which rollout-level credit does not reflect the current score gaps between positive and negative rollouts. To address these limitations, we propose ConSPO, a Contrastive Sequence-level Policy Optimization method that uses length-normalized sequence log-probabilities as rollout scores and contrasts verified positive rollouts against negative distractors within the same group. ConSPO optimizes a group-wise InfoNCE-style objective to adaptively strengthen updates for poorly separated positives and high-scoring negatives, together with a curriculum-scheduled margin that preserves separation pressure as training progresses. Experiments across diverse settings show that ConSPO outperforms strong baselines on challenging reasoning benchmarks. Code will be released upon paper acceptance.

2603.05308 2026-06-02 cs.CL cs.AI 版本更新

Med-V1: Small Language Models for Zero-shot and Scalable Biomedical Evidence Attribution

Med-V1:用于零样本和可扩展生物医学证据归因的小型语言模型

Qiao Jin, Yin Fang, Lauren He, Yifan Yang, Guangzhi Xiong, Zhizheng Wang, Nicholas Wan, Joey Chan, Donald C. Comeau, Robert Leaman, Charalampos S. Floudas, Aidong Zhang, Michael F. Chiang, Yifan Peng, Zhiyong Lu

发表机构 * Division of Intramural Research, National Library of Medicine, National Institutes of Health(国家医学图书馆内部研究部,国立卫生研究院) Department of Computer Science, University of Virginia(弗吉尼亚大学计算机科学系) Center for Cancer Research, National Cancer Institute, National Institutes of Health(国家癌症研究所癌症研究中心,国立卫生研究院) Department of Population Health Sciences, Weill Cornell Medicine Institute of AI for Digital Health, Weill Cornell Medicine(韦尔·科恩医学中心流行病学与健康科学系,韦尔·科恩医学中心人工智能与数字健康研究所)

AI总结 提出仅3B参数的小语言模型Med-V1,通过高质量合成数据训练,在生物医学证据归因任务上性能媲美GPT-5等前沿大模型,并用于量化LLM幻觉和识别临床指南中的证据错误归因。

详情
AI中文摘要

评估一篇文章是否支持某个断言对于幻觉检测和声明验证至关重要。虽然大型语言模型(LLM)有潜力自动化这一任务,但实现强性能需要如GPT-5这样的前沿模型,而这些模型在规模部署时成本过高。为了高效执行生物医学证据归因,我们提出了Med-V1,一个仅有三亿参数的小语言模型家族。在本研究中新开发的高质量合成数据上训练,Med-V1在统一为验证格式的五个生物医学基准上显著优于其基础模型(+27.0%至+71.3%)。尽管规模较小,Med-V1的性能与GPT-5等前沿LLM相当,并提供高质量的预测解释。我们使用Med-V1进行了首次用例研究,量化了不同引用指令下LLM生成答案中的幻觉。结果表明,格式指令强烈影响引文有效性和幻觉,GPT-5生成更多声明但表现出与GPT-4o相似的幻觉率。此外,我们展示了第二个用例,表明Med-V1可以自动识别临床实践指南中的高风险证据错误归因,揭示了否则难以大规模识别的潜在负面公共卫生影响。总体而言,Med-V1为生物医学证据归因和验证任务的实际应用提供了一种高效、准确的轻量级替代方案。Med-V1可在https://github.com/ncbi-nlp/Med-V1获取。

英文摘要

Assessing whether an article supports an assertion is essential for hallucination detection and claim verification. While large language models (LLMs) have the potential to automate this task, achieving strong performance requires frontier models such as GPT-5 that are prohibitively expensive to deploy at scale. To efficiently perform biomedical evidence attribution, we present Med-V1, a family of small language models with only three billion parameters. Trained on high-quality synthetic data newly developed in this study, Med-V1 substantially outperforms (+27.0% to +71.3%) its base models on five biomedical benchmarks unified into a verification format. Despite its smaller size, Med-V1 performs comparably to frontier LLMs such as GPT-5, along with high-quality explanations for its predictions. We use Med-V1 to conduct a first-of-its-kind use case study that quantifies hallucinations in LLM-generated answers under different citation instructions. Results show that the format instruction strongly affects citation validity and hallucination, with GPT-5 generating more claims but exhibiting hallucination rates similar to GPT-4o. Additionally, we present a second use case showing that Med-V1 can automatically identify high-stakes evidence misattributions in clinical practice guidelines, revealing potentially negative public health impacts that are otherwise challenging to identify at scale. Overall, Med-V1 provides an efficient and accurate lightweight alternative to frontier LLMs for practical and real-world applications in biomedical evidence attribution and verification tasks. Med-V1 is available at https://github.com/ncbi-nlp/Med-V1.

2605.17110 2026-06-02 cs.AI cs.LG 版本更新

Capturing LLM Capabilities via Evidence-Calibrated Query Clustering

通过证据校准的查询聚类捕捉LLM能力

Fangzhou Wu, Sandeep Silwal, Qiuyi Zhang

发表机构 * University of Wisconsin–Madison(威斯康星大学麦迪逊分校) Elorian AI

AI总结 提出ECC算法,利用有限后验模型比较校准先验语义嵌入,通过Bradley-Terry模型参数化能力轮廓,联合学习灵活的能力感知聚类结构,显著提升LLM能力排序质量。

Comments 45 pages

详情
AI中文摘要

查询聚类将查询分组为反映共享潜在能力需求的组,从而实现能力感知的LLM评估。现有的聚类方法主要依赖于语义分类或嵌入,由于表面语义与实际模型性能之间的错位,往往无法捕捉此类潜在能力需求。我们提出ECC,一种使用有限后验模型比较校准先验语义嵌入的算法,以弥合表面语义与潜在能力需求之间的差距。ECC通过Bradley-Terry模型参数化的能力轮廓来表征每个聚类,并使用可训练的混合权重来适应具有混合能力需求的查询,联合学习灵活的能力感知聚类结构,支持查询特定的LLM能力推断。大量的定量和定性评估表明,ECC显著提高了LLM能力排序质量,分别比人工标注和基于嵌入的基线平均高出17.64和18.02个百分点,并在查询路由等下游任务中证明有效。

英文摘要

Query clustering organizes queries into groups that reflect shared latent capability demands, enabling capability-aware LLM evaluation. Existing clustering methods, which primarily rely on semantic taxonomies or embeddings, often fail to capture such latent capability requirements due to a misalignment between surface-level semantics and actual model performance. We propose ECC, an algorithm that calibrates prior semantic embeddings using limited posterior model comparisons to bridge the gap between surface-level semantics and latent capability requirements. ECC characterizes each cluster through a capability profile parameterized by a Bradley-Terry model and uses trainable mixture weights to accommodate queries with mixed capability demands, jointly learning a flexible, capability-aware clustering structure that supports query-specific inference of LLM capabilities. Extensive quantitative and qualitative evaluations demonstrate that ECC significantly improves LLM capability ranking quality, outperforming human-labeled and embedding-based baselines by an average of 17.64 and 18.02 percentage points, respectively, and proves effective in downstream tasks such as query routing.

2605.17034 2026-06-02 cs.LG cs.AI cs.CR 版本更新

Privacy Policy Enforcement Guardrails for Data-Sensitive Retrieval-Augmented Generation

面向数据敏感检索增强生成的隐私策略执行护栏

Osama Zafar, Alexander Nemecek, Yiqian Zhang, Wenbiao Li, Debargha Ganguly, Vikash Singh, Vipin Chaudhary, Erman Ayday

发表机构 * University of California, Berkeley(加州大学伯克利分校) University of Washington(华盛顿大学) University of Toronto(多伦多大学) University of Texas at Austin(德克萨斯大学奥斯汀分校) University of California, Los Angeles(加州大学洛杉矶分校)

AI总结 针对RAG系统中上下文数据泄露问题,提出基于双单类密度估计器与融合文本嵌入的隐私策略执行框架,在医学、金融和法律领域实现高AUROC和低误报率。

详情
AI中文摘要

标准的PII过滤器常常遗漏RAG系统中的上下文数据泄露,例如非受管制的属性集群共同识别个人身份。我们引入了一个隐私策略执行(PPE)框架,使用双单类密度估计器与融合文本嵌入,以及针对分布外输入的校准弃权区域。通过跨医学、金融和法律领域的轴分层、多LLM合成数据管道,我们发现传统的高斯混合基线在边界安全压力测试中失败,因为它们关注语言风格而非内容。我们提出的T3+OCSVM检测器,在安全和边界安全数据上训练,实现了0.93+的边界AUROC,同时将误报率降低44-55个百分点,并保持毫秒级延迟。与监督MLP分类器或14B参数LLM评判器相比,我们的框架提供了更优的操作适用性,因为前者具有高弃权率,后者存在延迟和校准问题。该方法为任何合成数据训练的分类器提供了稳健的压力测试标准。

英文摘要

Standard PII filters often miss contextual data leakage in RAG systems, such as non-regulated attribute clusters that collectively identify individuals. We introduce a Privacy Policy Enforcement (PPE) framework using dual one-class density estimators with fused text embeddings and a calibrated abstain region for out-of-distribution inputs. Using an axis-stratified, multi-LLM synthetic data pipeline across medicine, finance, and law, we found that traditional Gaussian Mixture baselines fail on borderline-safe stress tests by focusing on linguistic register rather than content. Our proposed T3+OCSVM detector, trained on safe and borderline-safe data, achieves a borderline AUROC of 0.93+ while reducing false positives by 44-55 percentage points and maintaining millisecond latency. Compared to supervised MLP classifiers or 14B-parameter LLM judges, our framework offers superior operational suitability, as the former suffers from high abstention rates and the latter from latency and calibration issues. This methodology provides a robust stress-testing standard for any synthetic-data-trained classifier.

2604.26283 2026-06-02 cs.CV cs.AI 版本更新

MedSynapse-V: Bridging Visual Perception and Clinical Intuition via Latent Memory Evolution

MedSynapse-V:通过潜在记忆演化桥接视觉感知与临床直觉

Chunzheng Zhu, Jiaqi Zeng, Junyu Jiang, Jianxin Lin, Yijun Wang

发表机构 * Hunan University(湖南大学)

AI总结 提出MedSynapse-V框架,通过潜在诊断记忆演化模拟临床专家经验调用,解决医学视觉语言模型因离散分词导致的量化损失、长程信息消散和案例适应性问题,在诊断准确性上显著超越现有方法。

Comments Medical latent reasoning; Memory evolution

详情
AI中文摘要

高精度医学诊断不仅依赖于静态成像特征,还依赖于专家在图像解读过程中即时调用的隐式诊断记忆。我们指出了医学视觉语言模型中由于离散分词导致的基本认知错位,表现为量化损失、长程信息消散以及缺乏案例自适应专业知识。为弥合这一差距,我们提出了MedSynapse-V,一个用于潜在诊断记忆演化的框架,通过在模型隐藏流中动态合成隐式诊断记忆来模拟临床医生的经验调用。具体而言,它从元查询先验记忆机制开始,其中可学习的探针从解剖先验编码器中检索结构化先验,以生成压缩的隐式记忆。为确保临床保真度,我们引入了因果反事实细化(CCR),利用强化学习和基于区域级特征掩蔽的反事实奖励来量化每个记忆的因果贡献,从而修剪冗余并将潜在表示与诊断逻辑对齐。这一演化过程最终达到内在记忆转换(IMT),一种特权-自主双分支范式,通过全词汇散度对齐将教师分支的诊断模式内化到学生分支中。跨多个数据集的全面实证评估表明,通过将外部专业知识转化为内源参数,我们的方法在诊断准确性上显著优于现有最先进方法,特别是思维链范式。代码可在https://github.com/zhcz328/MedSynapse-V获取。

英文摘要

High-precision medical diagnosis relies not only on static imaging features but also on the implicit diagnostic memory experts instantly invoke during image interpretation. We pinpoint a fundamental cognitive misalignment in medical VLMs caused by discrete tokenization, leading to quantization loss, long-range information dissipation, and missing case-adaptive expertise. To bridge this gap, we propose ours, a framework for latent diagnostic memory evolution that simulates the experiential invocation of clinicians by dynamically synthesizing implicit diagnostic memories within the model's hidden stream. Specifically, it begins with a Meta Query for Prior Memorization mechanism, where learnable probes retrieve structured priors from an anatomical prior encoder to generate condensed implicit memories. To ensure clinical fidelity, we introduce Causal Counterfactual Refinement (CCR), which leverages reinforcement learning and counterfactual rewards derived from region-level feature masking to quantify the causal contribution of each memory, thereby pruning redundancies and aligning latent representations with diagnostic logic. This evolutionary process culminates in Intrinsic Memory Transition (IMT), a privileged-autonomous dual-branch paradigm that internalizes teacher-branch diagnostic patterns into the student-branch via full-vocabulary divergence alignment. Comprehensive empirical evaluations across multiple datasets demonstrate that ours, by transferring external expertise into endogenous parameters, significantly outperforms existing state-of-the-art methods, particularly chain-of-thought paradigms, in diagnostic accuracy. The code is available at https://github.com/zhcz328/MedSynapse-V.

2605.16451 2026-06-02 cs.LG cs.AI 版本更新

Physics-Guided Geometric Diffusion for Macro Placement Generation

物理引导的几何扩散用于宏单元布局生成

Jongho Yoon, Jinsung Jeon, Seokhyeong Kang

发表机构 * POSTECH Institute of Artificial Intelligence(POSTECH人工智能研究所) KAIST InnoCORE LLM(韩国科学技术院InnoCORE语言模型实验室) Seoul National University(首尔国立大学) Pohang University of Science and Technology(釜山科学技术大学)

AI总结 提出MacroDiff+框架,通过双域去噪架构和物理引导采样策略,在宏单元布局中同时优化拓扑连接和物理约束,在ISPD2005 MMS基准上实现线长减少6.1-6.2%。

Comments Accepted to IJCAI 2026. 9 pages, 5 figures

详情
AI中文摘要

宏单元布局是VLSI物理设计中的关键阶段,从根本上决定了芯片的整体性能。最近的数据驱动布局方法显示出巨大潜力,但它们往往难以处理序列依赖关系,并平衡拓扑连接与物理约束。为弥补这一差距,我们提出了MacroDiff+,一个物理引导的几何扩散框架。具体来说,我们设计了一个双域去噪架构,将异构GNN编码的拓扑连接与Transformer建模的全局几何上下文相结合。此外,我们引入了物理引导采样,一种推理策略,通过显式梯度主动引导生成,以确保统计合理性和物理有效性。在ISPD2005 MMS基准上,MacroDiff+优于最先进的基线,线长减少6.1-6.2%。值得注意的是,在先前方法无法收敛的大规模设计中,它表现出卓越的稳定性和可扩展性。源代码可在https://github.com/jhy00n/MacroDiff-plus获取。

英文摘要

Macro placement is a pivotal stage in VLSI physical design, fundamentally determining the overall chip performance. Recent data-driven placement methods have demonstrated significant potential, yet they often struggle to handle sequential dependencies and to balance topological connectivity with physical constraints. To bridge this gap, we propose MacroDiff+, a physics-guided geometric diffusion framework. Specifically, we design a dual-domain denoising architecture that couples topological connectivity encoded by heterogeneous GNNs with global geometric context modeled by a Transformer. Furthermore, we introduce Physics-Guided Sampling, an inference strategy that actively steers the generation using explicit gradients to ensure both statistical plausibility and physical validity. On the ISPD2005 MMS benchmarks, MacroDiff+ outperforms state-of-the-art baselines with a 6.1-6.2% reduction in wirelength. Notably, it exhibits superior stability and scalability on large-scale designs where prior methods fail to converge. The source code is available at https://github.com/jhy00n/MacroDiff-plus.

2605.16446 2026-06-02 cs.LG cs.AI 版本更新

Avoiding Structural Failure Modes in Tabular Fair SSL: Online Primal-Dual Allocation under Confidence Gating

避免表格公平半监督学习中的结构失效模式:基于置信门控的在线原始-对偶分配

Hangchuan Liang, Changchun Li

发表机构 * College of Computer Science and Technology, Jilin University, China(吉林大学计算机科学与技术学院) Key Laboratory of Symbolic Computation and Knowledge Engineering of the Ministry of Education, Jilin University, China(教育部符号计算与知识工程重点实验室)

AI总结 针对表格公平半监督学习中的结构冲突,提出在线原始-对偶分配(OPDA)方法,通过动态调度公平性和熵稳定性惩罚,避免掩码崩溃和平凡饱和两种失效模式,在多个基准上实现非退化运行点。

详情
AI中文摘要

半监督学习(SSL)能够在有限标签下进行预测,但高风险表格应用(医疗、信贷、再犯)需要统计公平性保证。通过诊断压力测试,我们识别出表格公平SSL中的结构冲突:在置信门控伪标签下,矩匹配公平正则化器可能触发两种失效模式——掩码崩溃(公平性侵蚀置信度,导致伪标签匮乏)和平凡饱和(漂移至常数预测器)。我们提出在线原始-对偶分配(OPDA),一种在线控制器,利用违规、风险和伪标签健康信号调度公平性和基于熵的稳定性惩罚,从而避免在该诊断机制下为每个数据集选择固定公平权重。在评估的表格基准(Adult、ACSIncome、COMPAS)上,OPDA缓解了静态权重和简单单信号自适应基线中观察到的退化状态。在Adult和COMPAS上,它产生了与经验静态λ前沿竞争的非退化运行点;在ACSIncome上,它保持了效用,同时具有更宽的公平-效用分布。相对于OPDA-lite,完整控制器主要在ACSIncome上将运行点向更高效用偏移,而Adult则突出了两种变体之间的公平-效用权衡。这些结果使OPDA成为表格公平SSL中无需校准的控制器,无需针对每个数据集进行调整即可获得非退化运行点。

英文摘要

Semi-supervised learning (SSL) enables prediction with limited labels, but high-stakes tabular applications (medical, credit, recidivism) require statistical fairness guarantees. We identify a structural conflict in tabular fair SSL through a diagnostic stress test: under confidence-gated pseudo-labeling, moment-matching fairness regularizers can trigger two failure modes -- Masking Collapse (fairness erodes confidence, starving pseudo-labels) and Trivial Saturation (drift to constant predictors). We propose Online Primal-Dual Allocation (OPDA), an online controller that schedules fairness and entropy-based stability penalties using violation, risk, and pseudo-label health signals, avoiding per-dataset selection of a fixed fairness weight within this diagnostic regime. On the evaluated tabular benchmarks (Adult, ACSIncome, COMPAS), OPDA mitigates the degenerate regimes observed under static weighting and simple single-signal adaptive baselines. On Adult and COMPAS, it yields non-degenerate operating points competitive with the empirical static-$λ$ frontier; on ACSIncome, it preserves utility with a wider fairness-utility spread. Relative to OPDA-lite, the full controller mainly shifts the operating point toward higher utility on ACSIncome, while Adult highlights the fairness-utility trade-off between the two variants. These results position OPDA as a calibration-free controller for non-degenerate operating points in tabular fair SSL without per-dataset tuning.

2605.09366 2026-06-02 cs.AI 版本更新

Towards a Virtual Neuroscientist: Autonomous Neuroimaging Analysis via Multi-Agent Collaboration

迈向虚拟神经科学家:基于多智能体协作的自主神经影像分析

Keqi Han, Songlin Zhao, Yao Su, Xiang Li, Yixuan Yuan, Lifang He, Carl Yang

发表机构 * Emory University(埃默里大学) Lehigh University(莱斯大学) Worcester Polytechnic Institute(沃思堡理工学院) Massachusetts General Hospital(麻省总医院) Harvard University(哈佛大学) Chinese University of Hong Kong(香港中文大学)

AI总结 提出NEXUS多智能体框架,通过代码中心执行和分层验证实现自主神经影像分析,在ADHD-200和ADNI数据集上优于传统工作流。

详情
AI中文摘要

将神经影像数据转化为临床可操作的生物标志物是一个知识密集型和劳动密集型过程。fMRIPrep等标准化工作流提高了鲁棒性和效率,但它们是静态配置的,无法像人类研究人员那样推理下游目标、权衡替代策略或在中级证据与后续决策之间形成闭环。这种闭环适应的缺失常常使领域专家陷入手动试错以调整参数和修复工作流失败的循环,严重限制了临床生物标志物开发的可扩展性。为弥补这一差距,我们引入了NEXUS,一个自主多智能体框架,它将神经影像工作流执行与科学目标理解相结合。与传统的平面工具调用智能体不同,NEXUS采用以代码为中心的执行范式,其中专业智能体在可组合的领域特定原语上协作合成和优化可执行程序。这种设计使得鲁棒的、长时程的工作流构建成为可能,并能动态适应运行时观察。此外,我们提出了一个用于自主质量控制的分层验证框架,将队列级指标筛选与智能体视觉检查相结合,以驱动基于证据的工作流修复。在ADHD-200和ADNI上的实验表明,NEXUS在预测性能上优于标准工作流基线,同时展现出复杂的智能体行为,包括策略探索和自适应改进。代码可在https://github.com/LearningKeqi/Virtual-Neuroscientist-NEXUS获取。

英文摘要

Transforming neuroimaging data into clinically actionable biomarkers is a knowledge-intensive and labor-intensive process. Standardized workflows such as fMRIPrep have improved robustness and efficiency, but they are statically configured and cannot reason about downstream objectives, deliberate over alternative strategies, or close the loop between intermediate evidence and subsequent decisions in the way a human researcher would. This lack of closed-loop adaptation often leaves domain experts trapped in a cycle of manual trial-and-error to tune parameters and remediate pipeline failures, severely constraining the scalability of clinical biomarker development. To bridge this gap, we introduce NEXUS, an autonomous multi-agent framework that integrates neuroimaging workflow execution with scientific-objective understanding. Unlike conventional flat toolcalling agents, NEXUS adopts a code-centric execution paradigm where specialist agents collaboratively synthesize and optimize executable programs over composable domain-specific primitives. This design enables robust, long-horizon workflow construction that adapts dynamically to runtime observations. Furthermore, we propose a hierarchical verification framework for autonomous quality control, integrating cohort-level metric screening with agentic visual inspection to drive evidence-grounded workflow remediation. Experiments on ADHD-200 and ADNI demonstrate that NEXUS outperforms standard workflow-based baselines in predictive performance while exhibiting sophisticated agentic behaviors, including strategy exploration and adaptive refinement. The code is available at https://github.com/LearningKeqi/Virtual-Neuroscientist-NEXUS.

2603.05917 2026-06-02 cs.LG cs.AI q-fin.ST 版本更新

Stock Market Prediction Using Node Transformer Architecture Integrated with BERT Sentiment Analysis

结合BERT情感分析的节点Transformer架构用于股票市场预测

Mohammad Al Ridhawi, Mahtab Haj Ali, Hussein Al Osman

发表机构 * University of Technology, Baghdad, Iraq(巴格达大学)

AI总结 提出一种将节点Transformer与BERT情感分析相结合的框架,通过图结构建模股票间依赖关系并融合社交媒体情感,在S&P 500股票上实现0.80%的MAPE,显著优于传统方法。

Comments 18 pages, 5 figures, 12 tables. Accepted for publication in IEEE Access

详情
Journal ref
IEEE Access, vol. 14, pp. 72613-72631, 2026
AI中文摘要

股票市场预测对在噪声、非平稳和行为动态的复杂市场环境中操作的投资者、金融机构和政策制定者提出了相当大的挑战。传统的预测方法,包括基本面分析和技术指标,往往无法捕捉金融市场中固有的复杂模式和横截面依赖性。本文提出了一种结合节点Transformer架构与基于BERT的情感分析的集成框架,用于股票价格预测。该模型将股票市场表示为图结构,其中个股构成节点,边捕捉关系,包括行业隶属关系、相关价格变动和供应链连接。一个微调的BERT模型从社交媒体帖子中提取情感信息,并通过基于注意力的融合机制将其与定量市场特征相结合。节点Transformer处理历史市场数据,同时捕捉股票间的时间演变和横截面依赖性。在1982年1月至2025年3月期间20只S&P 500股票上进行的实验表明,集成模型在一天前预测中实现了0.80%的平均绝对百分比误差(MAPE),而ARIMA为1.20%,LSTM为1.00%。情感分析的加入使预测误差总体降低10%,在财报公告期间降低25%,而基于图的架构通过捕捉股票间依赖性额外贡献了15%的改进。方向准确率在一天预测中达到65%。通过配对t检验的统计验证确认了这些改进的显著性(所有比较p < 0.05)。该模型在高波动期保持较低的误差,MAPE为1.50%,而基线模型范围为1.60%至2.10%。

英文摘要

Stock market prediction presents considerable challenges for investors, financial institutions, and policymakers operating in complex market environments characterized by noise, non-stationarity, and behavioral dynamics. Traditional forecasting methods, including fundamental analysis and technical indicators, often fail to capture the intricate patterns and cross-sectional dependencies inherent in financial markets. This paper presents an integrated framework combining a node transformer architecture with BERT-based sentiment analysis for stock price forecasting. The proposed model represents the stock market as a graph structure where individual stocks form nodes and edges capture relationships including sectoral affiliations, correlated price movements, and supply chain connections. A fine-tuned BERT model extracts sentiment information from social media posts and combines it with quantitative market features through attention-based fusion mechanisms. The node transformer processes historical market data while capturing both temporal evolution and cross-sectional dependencies among stocks. Experiments conducted on 20 S&P 500 stocks spanning January 1982 to March 2025 demonstrate that the integrated model achieves a mean absolute percentage error (MAPE) of 0.80% for one-day-ahead predictions, compared to 1.20% for ARIMA and 1.00% for LSTM. The inclusion of sentiment analysis reduces prediction error by 10% overall and 25% during earnings announcements, while the graph-based architecture contributes an additional 15% improvement by capturing inter-stock dependencies. Directional accuracy reaches 65% for one-day forecasts. Statistical validation through paired t-tests confirms the significance of these improvements (p < 0.05 for all comparisons). The model maintains lower error during high-volatility periods, achieving MAPE of 1.50% while baseline models range from 1.60% to 2.10%.

2605.14791 2026-06-02 astro-ph.IM astro-ph.CO cs.AI 版本更新

Beyond AI as Assistants: Toward Autonomous Discovery in Cosmology

超越AI助手:迈向宇宙学中的自主发现

Licong Xu, Thomas Borrett

发表机构 * Institute of Astronomy, University of Cambridge(剑桥大学天文研究所) Kavli Institute for Cosmology, University of Cambridge(剑桥大学凯斯勒宇宙研究所) Cavendish Astrophysics, University of Cambridge(剑桥大学卡文迪许天体物理研究所)

AI总结 本文提出两种互补的智能体系统(CMBEvolve和CosmoEvolve),通过LLM引导的代码进化与树搜索以及虚拟多智能体研究实验室,实现宇宙学中的自主科学发现,并在弱引力透镜异常检测和ACT DR6数据分析中展示了初步成果。

Comments 4 pages, 2 figures, Contribution to the 2026 Cosmology session of the 60th Rencontres de Moriond

详情
AI中文摘要

人工智能智能体的最新进展正在将AI从工具推向自主科学发现。我们讨论了两种互补的宇宙学智能体系统: exttt{CMBEvolve},通过LLM引导的代码进化和树搜索,针对具有明确定量目标的任务;以及 exttt{CosmoEvolve},通过虚拟多智能体研究实验室,针对开放式科学工作流。作为初步演示,我们将 exttt{CMBEvolve}应用于弱引力透镜图中的分布外检测,通过代码进化迭代改进基准分数;将 exttt{CosmoEvolve}应用于自主ACT DR6数据分析,识别出非平凡的成对和尺度依赖行为,并生成分析级诊断。这些例子展示了宇宙学如何为AI科学家系统的发展提供受控基准任务和现实开放研究问题。

英文摘要

Recent advances in artificial intelligence (AI) agents are pushing AI beyond tools toward autonomous scientific discovery. We discuss two complementary agentic systems for cosmology: \texttt{CMBEvolve}, which targets tasks with explicit quantitative objectives through LLM-guided code evolution and tree search, and \texttt{CosmoEvolve}, which targets open-ended scientific workflows through a virtual multi-agent research laboratory. As preliminary demonstrations, we apply \texttt{CMBEvolve} to out-of-distribution detection in weak-lensing maps, where it iteratively improves the benchmark score through code evolution, and \texttt{CosmoEvolve} to autonomous ACT DR6 data analysis, where it identifies non-trivial pair- and scale-dependent behaviour and produces analysis-grade diagnostics. These examples show how cosmology can provide both controlled benchmark tasks and realistic open-ended research problems for the development of AI scientist systems.

2605.14398 2026-06-02 cs.AI 版本更新

Coding Agent Is Good As World Simulator

编码智能体作为世界模拟器

Hongyu Wang, Jingquan Wang, Bocheng Zou, Radu Serban, Dan Negrut

发表机构 * Department of Mechanical & Aerospace Engineering, University of Wisconsin-Madison(威斯康星大学麦迪逊分校机械与航空航天工程系) School of Computer, Data, and Information Sciences, University of Wisconsin-Madison(威斯康星大学麦迪逊分校计算机、数据与信息科学学院)

AI总结 提出一个通过可执行模拟代码构建基于物理的世界模型的智能体框架,协调规划、代码生成、视觉审查和物理分析智能体,迭代修正代码以满足物理约束,在物理准确性、指令忠实度和视觉质量上超越视频模型。

详情
AI中文摘要

世界模型已成为构建交互式模拟环境的强大范式,最近的基于视频的方法在生成视觉上合理的动态方面展示了令人印象深刻的进展。然而,由于这些模型通常从视频中推断动态并以潜在状态表示,它们没有明确强制执行物理约束。因此,生成的视频展开在物理上不合理,表现出不稳定的接触、扭曲的形状或不一致的运动。在本文中,我们提出了一个通过可执行模拟代码构建基于物理的世界模型的智能体框架。该框架协调规划、代码生成、视觉审查和物理分析智能体。规划智能体将自然语言提示转换为结构化场景计划,代码智能体将其实现为可执行模拟代码,视觉审查智能体提供视觉反馈,而物理分析智能体检查物理一致性。代码根据反馈进行迭代修订,直到模拟符合提示要求和物理约束。实验结果表明,我们的框架在物理准确性、指令忠实度和视觉质量方面优于先进的基于视频的模型,可应用于各种场景,包括驾驶模拟和具身机器人任务。

英文摘要

World models have emerged as a powerful paradigm for building interactive simulation environments, with recent video-based approaches demonstrating impressive progress in generating visually plausible dynamics. However, because these models typically infer dynamics from video and represent them in latent states, they do not explicitly enforce physical constraints. As a result, the generated video rollouts are not physically plausible, exhibiting unstable contacts, distorted shapes, or inconsistent motion. In this paper, we present an agentic framework constructing physics-based world models through executable simulation code. The framework coordinates planning, code generation, visual review, and physics analysis agents. The planning agent converts the natural language prompt into a structured scene plan, the code agent implements it as executable simulation code, and the visual review agent provide visual feedback while the physics analysis agent checks physical consistency. The code is iteratively revised based on the feedback until the simulation matches the prompt reqirements and physical constraints. Experimental results show that our framework outperforms advanced video-based models in physical accuracy, instruction fidelity and visual quality, which could be applied to various scenarios including driving simulation and embodied robot tasks.

2605.13527 2026-06-02 cs.AI 版本更新

MMSkills: Towards Multimodal Skills for General Visual Agents

MMSkills: 面向通用视觉智能体的多模态技能

Kangning Zhang, Shuai Shao, Qingyao Li, Jianghao Lin, Lingyue Fu, Shijian Wang, Wenxiang Jiao, Yuan Lu, Weiwen Liu, Weinan Zhang, Yong Yu

发表机构 * Shanghai Jiao Tong University(上海交通大学) Xiaohongshu Inc.(小红书公司) Southeast University(东南大学)

AI总结 提出MMSkills框架,通过将多模态程序知识编码为紧凑的状态条件包(包含文本程序、运行时状态卡和多视角关键帧),并利用轨迹到技能生成器和分支加载多模态技能智能体,显著提升GUI和游戏场景下视觉智能体的决策能力。

Comments 25 pages, 8 figures, 8 tables. Project page: https://zkangning.github.io/MMSkills_for_Visual_Agents/

详情
AI中文摘要

可复用技能已成为提升智能体能力的核心基础,然而现有大多数技能包主要将可复用行为编码为文本提示、可执行代码或学习例程。但对于视觉智能体而言,程序知识本质上是多模态的:复用不仅取决于执行什么操作,还取决于识别相关状态、解释进展或失败的视觉证据,以及决定下一步行动。我们将这一需求形式化为多模态程序知识,并解决三个实际挑战:(I) 多模态技能包应包含什么;(II) 这些包可以从公共交互经验中从哪里获取;(III) 智能体如何在推理时参考多模态证据,而无需过多的图像上下文或过度依赖参考截图。我们引入MMSkills,一个用于表示、生成和使用可复用多模态程序以支持运行时视觉决策的框架。每个MMSkill是一个紧凑的状态条件包,将文本程序与运行时状态卡和多视角关键帧耦合。为构建这些包,我们开发了一个智能体轨迹到技能生成器,通过工作流分组、程序归纳、视觉定位和元技能引导审计,将公共非评估轨迹转化为可复用的多模态技能。为使用它们,我们引入了一个分支加载多模态技能智能体:在临时分支中检查选定的状态卡和关键帧,与实时环境对齐,并提炼为结构化指导供主智能体使用。在GUI和游戏视觉智能体基准上的实验表明,MMSkills持续提升前沿和较小规模的多模态智能体,表明外部多模态程序知识补充了模型内部先验。

英文摘要

Reusable skills have become a core substrate for improving agent capabilities, yet most existing skill packages encode reusable behavior primarily as textual prompts, executable code, or learned routines. For visual agents, however, procedural knowledge is inherently multimodal: reuse depends not only on what operation to perform, but also on recognizing the relevant state, interpreting visual evidence of progress or failure, and deciding what to do next. We formalize this requirement as multimodal procedural knowledge and address three practical challenges: (I) what a multimodal skill package should contain; (II) where such packages can be derived from public interaction experience; and (III) how agents can consult multimodal evidence at inference time without excessive image context or over-anchoring to reference screenshots. We introduce MMSkills, a framework for representing, generating, and using reusable multimodal procedures for runtime visual decision making. Each MMSkill is a compact, state-conditioned package that couples a textual procedure with runtime state cards and multi-view keyframes. To construct these packages, we develop an agentic trajectory-to-skill Generator that transforms public non-evaluation trajectories into reusable multimodal skills through workflow grouping, procedure induction, visual grounding, and meta-skill-guided auditing. To use them, we introduce a branch-loaded multimodal skill agent: selected state cards and keyframes are inspected in a temporary branch, aligned with the live environment, and distilled into structured guidance for the main agent. Experiments across GUI and game-based visual-agent benchmarks show that MMSkills consistently improve both frontier and smaller multimodal agents, suggesting that external multimodal procedural knowledge complements model-internal priors.

2605.09692 2026-06-02 cs.AI 版本更新

Causal state binding predicts action control in language agents

因果状态绑定预测语言智能体中的动作控制

Xiao Jia

发表机构 * School of Artificial Intelligence, The Chinese University of Hong Kong, Shenzhen(人工智能学院,香港中文大学(深圳))

AI总结 提出因果状态绑定框架,通过干预实验测量智能体动作是否随事件特定决定性状态变化,验证了结构化智能体在动作控制上优于随机基线。

Comments 85 pages, 5 main figures; supplementary information included

详情
AI中文摘要

自主语言智能体越来越多地暴露痕迹、记忆、计划和约束,但现有评估很少测试这些状态变量是否与最终动作绑定。我们引入了因果状态绑定,一种干预耦合的评估框架,用于测量动作是否随事件特定的决定性状态变化,同时对无关线索保持不变。主要读出是一个隐藏目标有限动作基准,其中评分者侧干预目标在生成前分配,并从模型可见提示中隐藏。在七个语料库级别的57,816条评分记录中,结构化智能体条件在推理、记忆、否决和自连续性响应方面超过了高随机性控制和目标组件移除。跨Qwen2.5 7B、14B和32B以及Mistral-7B的开源权重验证显示,动作先验、无字段提示和打乱的决定性上下文未能恢复结构化控制特征。在诊断性有限动作探测中,最小决定性字段读出恢复了规定的动作模式,而仅表面、仅动作先验和打乱字段控制则没有。在300条SWE-bench Lite问题记录和六个API模型上,将无预言机的因果状态绑定组合添加到完整非CSB基线中,将约束清洁的问题到文件命中@3 AUC从0.873提高到0.935。该验证涉及问题到文件定位,而非补丁应用或SWE-bench问题解决。这些结果支持智能体评估的测量原则:动作控制由事件特定的状态-动作绑定预测,而非仅由输出熵、动作先验匹配或推理格式预测。

英文摘要

Autonomous language agents increasingly expose traces, memories, plans and constraints, but existing evaluations rarely test whether these state variables are bound to final actions. We introduce causal state binding, an intervention-coupled evaluation framework that measures whether actions change with the event-specific decisive state while remaining invariant to irrelevant cues. The primary readout is a hidden-target finite-action benchmark in which scorer-side intervention targets are assigned before generation and withheld from the model-visible prompt. Across 57,816 scored records in seven corpus-level units, structured-agent conditions exceeded high-randomness controls and targeted component removals on reason, memory, veto and self-continuity responsiveness. Open-weight validation across Qwen2.5 7B, 14B and 32B plus Mistral-7B showed that action priors, no-field prompts and scrambled decisive context did not recover the structured-control signature. In diagnostic finite-action probes, the minimal decisive-field readout recovered the prescribed action pattern whereas surface-only, action-prior-only and scrambled-field controls did not. Across 300 SWE-bench Lite issue records and six API models, adding an oracle-free causal state-binding composite to a full non-CSB baseline increased constraint-clean issue-to-file hit@3 AUC from 0.873 to 0.935. This validation concerns issue-to-file localization, not patch application or SWE-bench issue resolution. These results support a measurement principle for agent evaluation: action control is predicted by event-specific state-action binding, not by output entropy, action-prior matching or rationale format alone.

2605.08193 2026-06-02 cs.CV cs.AI 版本更新

Normalization Equivariance for Arbitrary Backbones, with Application to Image Denoising

任意骨干网络的归一化等变性及其在图像去噪中的应用

Youssef Saied, François Fleuret

发表机构 * University of Cambridge(剑桥大学) DeepMind

AI总结 提出无参数包装器WNE,通过输入归一化、任意骨干网络处理、输出反归一化实现归一化等变,在盲去噪中提升CNN和Transformer对噪声水平失配的鲁棒性且无GPU开销。

详情
AI中文摘要

归一化等变性(NE)是一种结构先验,可提高图像到图像任务中对分布偏移的鲁棒性。函数 $f$ 是归一化等变的当且仅当对于所有 $a>0$ 和 $b\in\mathbb{R}$,有 $f(a y + b\mathbf{1}) = a f(y) + b\mathbf{1}$。现有的NE方法将每个内部层约束为与NE兼容的操作。这些约束增加了运行时成本,并排除了标准的Transformer组件,如softmax注意力和LayerNorm。我们引入了包装归一化等变性(WNE),这是一种无参数包装器,它对输入进行归一化,应用任意骨干网络,然后对输出进行反归一化。我们证明了每个NE函数都允许这种分解,因此该包装器精确参数化了NE函数类。在盲去噪中,包装CNN和Transformer架构在噪声水平失配下提高了鲁棒性,且没有可测量的GPU开销,而架构性NE基线则慢达 $1.6$ 倍。

英文摘要

Normalization Equivariance (NE) is a structural prior that improves robustness to distribution shift in image-to-image tasks. A function $f$ is normalization equivariant iff $f(a y + b\mathbf{1}) = a f(y) + b\mathbf{1}$ for all $a>0$ and $b\in\mathbb{R}$. Existing NE methods constrain every internal layer to NE-compatible operations. These constraints add runtime cost and exclude standard transformer components such as softmax attention and LayerNorm. We introduce Wrapped Normalization Equivariance (WNE), a parameter-free wrapper that normalizes the input, applies any backbone, and denormalizes the output. We prove every NE function admits this factorization, so the wrapper exactly parameterizes the class of NE functions. On blind denoising, wrapping CNN and transformer architectures improves robustness under noise-level mismatch with no measurable GPU overhead, while architectural NE baselines are up to $1.6\times$ slower.

2605.13834 2026-06-02 cs.LG cs.AI cs.CG 版本更新

Topology-Preserving Neural Operator Learning via Hodge Decomposition

通过Hodge分解保持拓扑的神经算子学习

Dongzhe Zheng, Tao Zhong, Christine Allen-Blanchette

发表机构 * University of California, Berkeley(加州大学伯克利分校)

AI总结 本文从函数空间视角研究几何网格上物理场方程的解算子,利用Hodge正交性分离不可学习的拓扑自由度与可学习的几何动力学,提出基于Hodge谱对偶的混合欧拉-拉格朗日架构,在保持物理不变量的同时提升几何图上的精度与效率。

Comments Accepted at ICML 2026. Code available at https://github.com/ContinuumCoder/Hodge-Spectral-Duality

详情
AI中文摘要

本文从函数空间视角研究几何网格上物理场方程的解算子。我们发现Hodge正交性通过将不可学习的拓扑自由度与可学习的几何动力学分离,从根本上解决了谱干扰问题,从而实现了局限于保结构子空间的加性逼近。基于Hodge理论和算子分裂,我们推导出原则性的算子级分解。结果是一种混合欧拉-拉格朗日架构,具有我们称为Hodge谱对偶(HSD)的代数级归纳偏置。在我们的框架中,我们使用离散微分形式捕捉拓扑主导的分量,并使用正交辅助环境空间表示复杂的局部动力学。我们的方法在几何图上实现了优越的准确性和效率,并增强了对物理不变量的保真度。我们的代码可在https://github.com/ContinuumCoder/Hodge-Spectral-Duality获取。

英文摘要

In this paper, we study solution operators of physical field equations on geometric meshes from a function-space perspective. We reveal that Hodge orthogonality fundamentally resolves spectral interference by isolating unlearnable topological degrees of freedom from learnable geometric dynamics, enabling an additive approximation confined to structure-preserving subspaces. Building on Hodge theory and operator splitting, we derive a principled operator-level decomposition. The result is a Hybrid Eulerian-Lagrangian architecture with an algebraic-level inductive bias we call Hodge Spectral Duality (HSD). In our framework, we use discrete differential forms to capture topology-dominated components and an orthogonal auxiliary ambient space to represent complex local dynamics. Our method achieves superior accuracy and efficiency on geometric graphs with enhanced fidelity to physical invariants. Our code is available at https://github.com/ContinuumCoder/Hodge-Spectral-Duality

2605.13430 2026-06-02 stat.ME cs.AI cs.LG 版本更新

Towards a holistic understanding of Selection Bias for Causal Effect Identification

走向因果效应识别中选择偏差的整体理解

Yiwen Qiu, Filip Kovačević, Shimeng Huang, Peter Spirtes, Francesco Locatello

发表机构 * Carnegie Mellon University(卡内基梅隆大学) University of California, Berkeley(加州大学伯克利分校)

AI总结 研究在观测研究中存在选择偏差时,如何利用弱假设刻画倾向得分和选择概率,给出平均处理效应可识别性的充要条件,扩展了现有图形识别准则。

Comments 9 pages for the main text, ICML 2026

详情
AI中文摘要

选择偏差在观测研究中普遍存在。例如,大规模生物库数据可能表现出“健康志愿者偏差”,即受访者比他们所要代表的人群更健康、社会经济地位更高。从这样的子人群中恢复因果效应是因果推断中的一个重要问题,因为从选定人群估计平均处理效应(ATE)可能导致对整个群体的ATE估计严重偏倚。本文研究了选择偏差下ATE的可识别性。我们利用概率类的弱假设刻画倾向得分和选择概率,给出了ATE可识别性的充要条件。与以往工作相比,我们的结果扩展了现有的图形可识别性准则,并在存在选择偏差的情况下,以严格更弱的条件提供了对因果效应识别更全面的理解。

英文摘要

Selection bias is pervasive in observational studies. For example, large scale biobanks data can exhibit ``healthy volunteer bias'' when respondents are healthier and of higher socio-economic status than the population they are meant to represent. Recovering causal effects from such sub-population is an important problem in causal inference, as estimating average treatment effects (ATE) from selected populations can result in a severely biased estimate of the ATE from the whole population. In this paper, we investigate the identifiability of the ATE under selection bias. We provide necessary and sufficient conditions for ATE identifiability, leveraging weak assumptions on probability classes to characterize propensity score and selection probability. Compared to previous works, our results extend existing graphical identifiability criteria and offer a more comprehensive understanding of causal effect identification with strictly weaker conditions in the presence of selection bias.

2505.12741 2026-06-02 cs.AI 版本更新

Language Model Networks: Supervision-Efficient Learning through Dense Communication

语言模型网络:通过密集通信实现监督高效学习

Shiguang Wu, Yaqing Wang, Quanming Yao

发表机构 * University of Science and Technology of China(中国科学技术大学)

AI总结 提出LMNet,一种密集可微的语言模型网络,通过可训练的序列到序列模块作为通信边,实现节点间密集向量交换,支持端到端梯度优化和高效信息传递,在有限监督下实现有效适应。

详情
AI中文摘要

语言模型不仅被用作独立的预测器,还越来越多地作为更大推理系统的组件,从测试时扩展到多智能体协作。我们研究语言模型网络,其中预训练语言模型作为可重用节点,智能从其拓扑、通信和优化中涌现。现有系统主要通过自然语言通信:易于部署,但离散、低效,且难以从最终任务监督中优化。我们提出LMNet,该范式的密集且可微的实现。LMNet使用精简的LLM作为顶点模块,可训练的序列到序列模块作为通信边,使中间节点能够交换密集向量,同时在系统边界保留自然语言的输入和输出。通过绕过中间嵌入和解嵌入,LMNet实现了高效的信息传输、端到端梯度优化以及超越手工设计协议的学习通信。实验表明,在少量额外训练成本下性能良好,并在有限监督下实现有效适应。

英文摘要

Language models are increasingly used not only as standalone predictors but also as components in larger inference systems, from test-time scaling to multi-agent collaboration. We study language model networks, where pre-trained language models serve as reusable nodes and intelligence emerges from their topology, communication, and optimization. Existing systems mostly communicate through natural language: easy to deploy, but discrete, inefficient, and hard to optimize from end-task supervision. We propose LMNet, a dense and differentiable realization of this paradigm. LMNet uses stripped LLMs as vertex modules and trainable seq2seq modules as communication edges, enabling intermediate nodes to exchange dense vectors while preserving natural-language input and output at the system boundary. By bypassing intermediate embedding and de-embedding, LMNet enables efficient information transfer, end-to-end gradient optimization, and learned communication beyond hand-designed protocols. Experiments show performance with small additional training cost and effective adaptation under limited supervision.

2605.13178 2026-06-02 cs.CV cs.AI 版本更新

CLIP Tricks You: Training-free Token Pruning for Efficient Pixel Grounding in Large VIsion-Language Models

CLIP Tricks You: 面向大型视觉-语言模型中高效像素定位的无训练令牌剪枝

Sangin Lee, Yukyung Choi

发表机构 * KAIST(韩国科学技术院)

AI总结 提出LiteLVLM,一种无需训练、文本引导的令牌剪枝策略,通过反转CLIP视觉-文本相似度排序,保留指代区域令牌并恢复上下文令牌,实现高效像素定位推理,在多种令牌预算下性能提升超5%,保持90%原始性能同时加速22%并减少2.3倍内存。

Comments Accepted by ICML 2026

详情
AI中文摘要

在大型视觉-语言模型中,视觉令牌通常构成输入令牌的大部分,导致大量计算开销。为了解决这个问题,最近的研究探索了为图像理解任务剪枝冗余或信息量较少的视觉令牌。然而,这些方法在像素定位任务中表现不佳,因为令牌重要性高度依赖于输入文本。通过对CLIP的深入分析,我们观察到指代区域内的视觉令牌与其文本表示的相似度通常较低。受此启发,我们引入了LiteLVLM,一种无需训练、文本引导的令牌剪枝策略,用于高效的像素定位推理。通过反转CLIP视觉-文本相似度的排序,LiteLVLM有效地保留了覆盖指代区域的视觉令牌,同时恢复上下文令牌以实现清晰的前景-背景分离。大量实验表明,LiteLVLM在不同令牌预算下均显著优于现有方法,性能提升超过5%。无需任何训练或微调,LiteLVLM在保持90%原始性能的同时,实现了22%的加速和2.3倍的内存减少。我们的代码可在https://github.com/sejong-rcv/LiteLVLM获取。

英文摘要

In large vision-language models, visual tokens typically constitute the majority of input tokens, leading to substantial computational overhead. To address this, recent studies have explored pruning redundant or less informative visual tokens for image understanding tasks. However, these methods struggle with pixel grounding tasks, where token importance is highly contingent on the input text. Through an in-depth analysis of CLIP, we observe that visual tokens within referent regions often exhibit low similarity to their textual representation. Motivated by this insight, we introduce LiteLVLM, a training-free, text-guided token pruning strategy for efficient pixel grounding inference. By reversing the ranking of CLIP's visual-text similarity, LiteLVLM effectively retains visual tokens covering the referent regions, while recovering context tokens to enable clear foreground-background separation. Extensive experiments demonstrate that LiteLVLM significantly outperforms existing methods by over 5% across diverse token budgets. Without any training or fine-tuning, LiteLVLM maintains 90% of the original performance with a 22% speedup and a 2.3X memory reduction. Our code is available at https://github.com/sejong-rcv/LiteLVLM.

2605.12895 2026-06-02 cs.LG cs.AI cs.CY stat.AP 版本更新

RISED: A Pre-Deployment Evaluation Framework for High-Stakes AI Decision-Support Systems, with Application to Healthcare

RISED:高风险AI决策支持系统的部署前评估框架,及其在医疗中的应用

Rohith Reddy Bellibatlu, Manpreet Singh, Yash Jajoo, Shyamal Lakhanpal, Abhishek Israni

发表机构 * Florida International University(佛罗里达国际大学) Boston University(波士顿大学) New York University(纽约大学) University of Maryland(马里兰大学) Boston University School of Public Health(波士顿大学公共卫生学院)

AI总结 提出RISED框架,通过BCa bootstrap置信区间、文献阈值和Holm-Bonferroni校正的PASS/FAIL/INCONCLUSIVE判定,从五个维度评估高风险AI决策支持系统,在医疗等数据集上发现AUROC无法揭示的失败模式。

Comments 39 pages, 7 figures, 15 tables. Code at https://github.com/rohithreddybc/rised-healthcare-eval and dataset at https://doi.org/10.57967/hf/8734 (Hugging Face). To be submitted to Expert Systems with Applications (Elsevier)

详情
AI中文摘要

临床决策支持系统是专家系统,临床医生直接根据其建议行动,但通常仅通过保留测试集上的一个总体准确率数字来批准。这个数字对编码偏移下的输入可靠性、子组差距、阈值敏感性或操作可行性毫无说明。我们提出RISED,一个部署前评估框架,通过BCa bootstrap 95%置信区间、基于文献的阈值和Holm-Bonferroni校正的PASS/FAIL/INCONCLUSIVE判定,操作化五个维度(可靠性、包容性、敏感性、公平性、可部署性);公平性是一个代理依赖诊断而非门控测试。应用于跨越35年的七个队列(n从303到99,492),RISED揭示了AUROC无法发现的失败:在Diabetes 130上,可靠性通过三个数量级(PSS = 0.0004),而包容性(AUC差距 = 0.262)和敏感性(最大阈值翻转率49.1%)明确失败;两个NHIS队列也重复了这一点。具有完整特征配置的NHANES 2021-2023获得了INCONCLUSIVE判定;BRFSS 2024在仪器旋转移除高血压和胆固醇后产生了该套件中最严重的敏感性失败(最大阈值翻转率64.2%)。该模式在信用和收入预测队列上重复出现,证实了领域无关性;多模型检查显示失败是数据驱动的,而非模型特定的。RISED作为开源Python包发布,补充了TRIPOD+AI、FUTURE-AI和Fairlearn,提供了这些标准要求但未规定的结构化数值证据。

英文摘要

Clinical decision-support systems are expert systems whose recommendations clinicians act on directly, yet they are usually cleared on one aggregate accuracy number from a held-out test set. That number says nothing about input reliability under encoding shifts, subgroup gaps, threshold sensitivity, or operational feasibility. We present RISED, a pre-deployment evaluation framework operationalising five dimensions (Reliability, Inclusivity, Sensitivity, Equity, Deployability) through BCa bootstrap 95% confidence intervals, literature-grounded thresholds, and Holm-Bonferroni-corrected PASS / FAIL / INCONCLUSIVE verdicts; Equity is a proxy-dependence diagnostic rather than a gating test. Applied to seven cohorts spanning 35 years (n from 303 to 99,492), RISED surfaces failures invisible to AUROC: on Diabetes 130, Reliability passes by three orders of magnitude (PSS = 0.0004) while Inclusivity (AUC parity gap = 0.262) and Sensitivity (max threshold-flip rate 49.1%) fail decisively; both NHIS cohorts reproduce this. NHANES 2021-2023, with a complete feature profile, achieves INCONCLUSIVE verdicts; BRFSS 2024 produces the suite's most severe Sensitivity failure (max threshold-flip rate 64.2%) after instrument rotation removed hypertension and cholesterol. The pattern recurs on credit- and income-prediction cohorts, confirming domain-agnosticity; a multi-model check shows the failures are data-driven, not model-specific. RISED ships as an open-source Python package complementing TRIPOD+AI, FUTURE-AI, and Fairlearn with the structured numerical evidence those standards require but do not prescribe.

2605.12813 2026-06-02 cs.CL cs.AI cs.CR cs.LG 版本更新

REALISTA: Realistic Latent Adversarial Attacks that Elicit LLM Hallucinations

REALISTA: 引发LLM幻觉的逼真潜在对抗攻击

Buyun Liang, Jinqi Luo, Liangzu Peng, Kwan Ho Ryan Chan, Darshan Thaker, Kaleab A. Kinfu, Fengrui Tian, Hamed Hassani, René Vidal

发表机构 * University of Waterloo(滑铁卢大学)

AI总结 提出REALISTA框架,通过潜在空间优化语义等价的对抗提示,有效引发大语言模型幻觉,优于现有方法。

Comments Accepted at ICML 2026. Code is available at https://github.com/Buyun-Liang/REALISTA

详情
AI中文摘要

大型语言模型(LLM)在许多任务上表现出色,但仍然容易产生幻觉,因此有必要在逼真的对抗输入下系统地评估其可靠性。我们将幻觉引发问题形式化为一个约束优化问题,目标是找到与良性用户提示语义等价的对抗提示。现有攻击方法仍有局限:基于离散提示的攻击保持语义等价性和连贯性,但仅搜索有限的提示变体;而连续潜在空间攻击探索更丰富的空间,但通常解码为不再有效改写的提示。为解决这些局限,我们提出REALISTA,一个逼真的潜在空间攻击框架。REALISTA构建了一个依赖于输入的合法编辑方向字典,每个方向对应一个语义等价且连贯的改写,并在潜在空间中优化这些方向的连续组合。这种设计结合了连续攻击的优化灵活性和基于离散改写的攻击的语义逼真性。实验表明,REALISTA在开源LLM上达到优于或与最先进逼真攻击相当的性能,并且关键的是,在自由形式响应设置下成功攻击大型推理模型,而先前的逼真攻击则失败。代码可在https://github.com/Buyun-Liang/REALISTA获取。

英文摘要

Large language models (LLMs) achieve strong performance across many tasks but remain vulnerable to hallucinations, making it important to systematically evaluate their reliability under realistic adversarial inputs. We formulate hallucination elicitation as a constrained optimization problem, where the goal is to find semantically coherent adversarial prompts that are equivalent to benign user prompts. Existing attack methods remain limited: discrete prompt-based attacks preserve semantic equivalence and coherence but search only over a limited set of prompt variations, while continuous latent-space attacks explore a richer space but often decode into prompts that are no longer valid rephrasings. To address these limitations, we propose REALISTA, a realistic latent-space attack framework. REALISTA constructs an input-dependent dictionary of valid editing directions, each corresponding to a semantically equivalent and coherent rephrasing, and optimizes continuous combinations of these directions in latent space. This design combines the optimization flexibility of continuous attacks with the semantic realism of discrete rephrasing-based attacks. Experiments demonstrate that REALISTA achieves superior or comparable performance to state-of-the-art realistic attacks on open-source LLMs and, crucially, succeeds in attacking large reasoning models under free-form response settings, where prior realistic attacks fail. Code is available at https://github.com/Buyun-Liang/REALISTA.

2605.12652 2026-06-02 cs.LG cs.AI 版本更新

Multi-Rollout On-Policy Distillation via Peer Successes and Failures

基于同伴成功与失败的多轨迹在线策略蒸馏

Weichen Yu, Xiaomin Li, Yizhou Zhao, Xiaoze Liu, Ruowang Zhang, Haixin Wang, Yinyi Luo, Chen Henry Wu, Gaurav Mittal, Matt Fredrikson, Yu Hu

发表机构 * Carnegie Mellon University(卡内基梅隆大学) Microsoft(微软) Purdue University(普渡大学)

AI总结 提出多轨迹在线策略蒸馏(MOPD),利用学生模型的本地轨迹组构造更丰富的教师信号,通过同伴成功与失败条件化提升蒸馏效果。

Comments 23 pages

详情
AI中文摘要

大型语言模型通常使用稀疏验证器奖励进行后训练,该奖励指示采样轨迹是否成功,但对推理成功或失败的位置提供有限指导。在线策略蒸馏(OPD)通过训练学生生成的轨迹提供更密集的令牌级监督,但现有方法通常独立蒸馏每个轨迹,忽略为同一提示采样的其他尝试。我们引入多轨迹在线策略蒸馏(MOPD),一种同伴条件化蒸馏框架,利用学生的本地轨迹组构造信息更丰富的教师信号。MOPD 将教师条件化于成功和失败的同伴轨迹:成功为有效推理模式提供正面证据,而失败则为要避免的合理错误提供结构化负面证据。我们研究了两种同伴上下文构建:正面同伴模仿和对比性成功-失败条件化。在竞争性编程、数学推理、科学问答和工具使用基准上的实验表明,MOPD 持续优于标准在线策略基线。进一步的教师信号分析表明,混合成功-失败上下文能更好地使教师分数与验证器奖励对齐,表明性能提升源于更忠实、实例自适应的监督。这些结果表明,有效的在线策略蒸馏应利用学生的多轨迹试错行为,而不是将轨迹视为孤立样本。

英文摘要

Large language models are often post-trained with sparse verifier rewards, which indicate whether a sampled trajectory succeeds but provide limited guidance about where reasoning succeeds or fails. On-policy distillation (OPD) offers denser token-level supervision by training on student-generated trajectories, yet existing methods typically distill each rollout independently and ignore the other attempts sampled for the same prompt. We introduce Multi-Rollout On-Policy Distillation (MOPD), a peer-conditioned distillation framework that uses the student's local rollout group to construct more informative teacher signals. MOPD conditions the teacher on both successful and failed peer rollouts: successes provide positive evidence for valid reasoning patterns, while failures provide structured negative evidence about plausible mistakes to avoid. We study two peer-context constructions: positive peer imitation and contrastive success-failure conditioning. Experiments on competitive programming, mathematical reasoning, scientific question answering, and tool-use benchmarks show that MOPD consistently improves over standard on-policy baselines. Further teacher-signal analysis shows that mixed success-failure contexts better align teacher scores with verifier rewards, indicating that the gains arise from more faithful, instance-adaptive supervision. These results indicate that effective on-policy distillation should exploit the student's multi-rollout trial-and-error behavior rather than treating rollouts as isolated samples.

2602.23161 2026-06-02 cs.AI 版本更新

PATRA: Pattern-Aware Alignment and Balanced Reasoning for Time Series Question Answering

PATRA: 面向时间序列问答的模式感知对齐与平衡推理

Junkai Lu, Peng Chen, Xingjian Wu, Yang Shu, Chenjuan Guo, Christian S. Jensen, Bin Yang

发表机构 * East China Normal University, Shanghai, China(华东师范大学) Aalborg University, Aalborg, Denmark(奥胡斯大学)

AI总结 针对现有LLM方法在时间序列推理中忽略模式提取和简单任务主导学习的问题,提出模式感知对齐与平衡推理模型PATRA,通过提取趋势和季节模式实现深度对齐,并设计任务感知平衡奖励以协调不同难度任务的学习,在多种时间序列问答任务中优于强基线。

Comments Accepted By ICML 2026

详情
AI中文摘要

时间序列推理既需要感知复杂动态,也需要逻辑深度。然而,现有的基于LLM的方法存在两个局限性:它们通常仅将时间序列视为文本或图像,未能捕捉回答特定问题所需的趋势和季节性等模式;并且当在简单和复杂任务的混合数据上训练时,较简单的目标往往主导学习过程,阻碍深层推理能力的发展。为解决这些局限性,我们提出了模式感知对齐与平衡推理模型(PATRA),引入了一种模式感知机制,从时间序列中提取趋势和季节性模式以实现深度对齐。此外,我们设计了一种任务感知的平衡奖励,以协调不同难度任务之间的学习,激励生成连贯的思维链。大量实验表明,PATRA在多种时间序列问答(TSQA)任务中优于强基线,展示了卓越的跨模态理解和推理能力。

英文摘要

Time series reasoning demands both the perception of complex dynamics and logical depth. However, existing LLM-based approaches exhibit two limitations: they often treat time series merely as text or images, failing to capture the patterns like trends and seasonalities needed to answer specific questions; and when trained on a mix of simple and complex tasks, simpler objectives often dominate the learning process, hindering the development of deep reasoning capabilities. To address these limitations, we propose the Pattern-Aware Alignment and Balanced Reasoning model (PATRA), introducing a pattern-aware mechanism that extracts trend and seasonality patterns from time series to achieve deep alignment. Furthermore, we design a task-aware balanced reward to harmonize learning across tasks of varying difficulty, incentivizing the generation of coherent Chains of Thought. Extensive experiments show that PATRA outperforms strong baselines across diverse Time Series Question Answering (TSQA) tasks, demonstrating superior cross-modal understanding and reasoning capability.

2512.12634 2026-06-02 cs.AI 版本更新

MobiBench: Multi-Branch, Modular Benchmark for Mobile GUI Agents

MobiBench: 面向移动GUI智能体的多分支模块化基准

Youngmin Im, Byeongung Jo, Jaeyoung Wi, Seungwoo Baek, Tae Hoon Min, Joo Hyung Lee, Sangeun Oh, Insik Shin, Sunjae Lee

发表机构 * KAIST(韩国科学技术院) Sungkyunkwan University(全北大学) Korea University(韩国大学) Fluiz

AI总结 提出MobiBench,首个模块化且支持多路径感知的离线基准测试框架,用于高保真、可扩展和可复现地评估移动GUI智能体,并揭示组件级性能瓶颈。

详情
AI中文摘要

移动GUI智能体,即能够代表用户与移动应用交互的AI智能体,有潜力改变人机交互。然而,当前GUI智能体的评估实践存在两个基本限制。首先,它们要么依赖单路径离线基准,要么依赖在线实时基准。使用静态、单路径标注数据集的离线基准不公平地惩罚有效的替代动作,而在线基准由于实时评估的动态和不可预测性,面临可扩展性和可复现性差的问题。其次,现有基准将智能体视为单一黑盒,忽略了各个组件的贡献,这常常导致不公平的比较或掩盖关键性能瓶颈。为了解决这些限制,我们提出了MobiBench,这是首个模块化且支持多路径感知的移动GUI智能体离线基准测试框架,能够在完全离线环境下实现高保真、可扩展和可复现的评估。我们的实验表明,MobiBench与人类评估者的一致性达到94.72%,与精心设计的在线基准相当,同时保留了静态离线基准的可扩展性和可复现性。此外,我们全面的模块级分析揭示了几个关键见解,包括对移动GUI智能体中使用的多种技术的系统评估、跨模型规模的最佳模块配置、当前LFM的固有限制,以及设计更强大且成本效益更高的移动智能体的可操作指南。

英文摘要

Mobile GUI Agents, AI agents capable of interacting with mobile applications on behalf of users, have the potential to transform human computer interaction. However, current evaluation practices for GUI agents face two fundamental limitations. First, they either rely on single path offline benchmarks or online live benchmarks. Offline benchmarks using static, single path annotated datasets unfairly penalize valid alternative actions, while online benchmarks suffer from poor scalability and reproducibility due to the dynamic and unpredictable nature of live evaluation. Second, existing benchmarks treat agents as monolithic black boxes, overlooking the contributions of individual components, which often leads to unfair comparisons or obscures key performance bottlenecks. To address these limitations, we present MobiBench, the first modular and multi path aware offline benchmarking framework for mobile GUI agents that enables high fidelity, scalable, and reproducible evaluation entirely in offline settings. Our experiments demonstrate that MobiBench achieves 94.72 percent agreement with human evaluators, on par with carefully engineered online benchmarks, while preserving the scalability and reproducibility of static offline benchmarks. Furthermore, our comprehensive module level analysis uncovers several key insights, including a systematic evaluation of diverse techniques used in mobile GUI agents, optimal module configurations across model scales, the inherent limitations of current LFMs, and actionable guidelines for designing more capable and cost efficient mobile agents.

2605.12400 2026-06-02 cs.LG cs.AI 版本更新

OGLS-SD: On-Policy Self-Distillation with Outcome-Guided Logit Steering for LLM Reasoning

OGLS-SD:基于结果引导的对数几率操控的在线自蒸馏用于大语言模型推理

Yuxiao Yang, Xiaoyun Wang, Weitong Zhang

发表机构 * UNC Chapel Hill(UNC夏洛特山分校)

AI总结 提出OGLS-SD框架,通过结果奖励校准教师对数几率,解决在线自蒸馏中师生响应模式不匹配导致的训练不稳定问题,提升数学推理性能。

Comments 17 pages, 10 figures, 5 tables

详情
AI中文摘要

我们研究在线自蒸馏(OPSD),其中语言模型通过沿其自身在线轨迹蒸馏特权教师分布来提高推理能力。尽管有前景,OPSD可能因教师和学生响应之间的模式不匹配而遭受训练不稳定。自我反思的教师响应可能引入反思引起的偏差和响应模板,从而错误校准令牌级监督,最终损害学生的推理能力。为缓解此问题,我们提出OGLS-SD,一种结果引导的对数几率操控框架,利用可验证的结果奖励来校准特权教师对数几率。具体而言,OGLS-SD对比由成功和失败的在线轨迹诱导的教师对数几率,构建一个结果判别性的操控方向用于令牌级指导。在数学推理基准上的实验表明,OGLS-SD稳定了自蒸馏,并提高了相对于标准OPSD和其他变体的性能。

英文摘要

We study on-policy self-distillation (OPSD), where a language model improves its reasoning ability by distilling privileged teacher distributions along its own on-policy trajectories. Despite its promise, OPSD can suffer from training instability due to a pattern mismatch between teacher and student responses. Self-reflected teacher responses may introduce reflection-induced biases and response templates that miscalibrate token-level supervision, ultimately harming the student's reasoning ability. To mitigate this issue, we propose OGLS-SD, an outcome-guided logit-steering framework that leverages verifiable outcome rewards to calibrate privileged teacher logits. Specifically, OGLS-SD contrasts teacher logits induced by successful and failed on-policy trajectories, constructing an outcome-discriminative steering direction for token-level guidance. Experiments on mathematical reasoning benchmarks show that OGLS-SD stabilizes self-distillation and improves performance over standard OPSD and other variants.

2602.08058 2026-06-02 cs.CV cs.AI cs.RO cs.SY eess.SY 版本更新

Picasso: Holistic Scene Reconstruction with Physics-Constrained Sampling

Picasso: 基于物理约束采样的整体场景重建

Xihang Yu, Rajat Talak, Lorenzo Shaikewitz, Luca Carlone

发表机构 * Massachusetts Institute of Technology(麻省理工学院) National University of Singapore(新加坡国立大学)

AI总结 提出Picasso,一种通过快速拒绝采样推理多物体交互并考虑几何、非穿透和物理约束的整体场景重建方法,在物理合理性和重建精度上显著优于现有技术。

Comments 15 pages, accepted to Robotics: Science and Systems (RSS) 2026

详情
AI中文摘要

在存在遮挡和测量噪声的情况下,几何精确的场景重建(即拟合传感器数据)仍然可能在物理上不正确。例如,当估计场景中物体的姿态和形状并将结果导入模拟器时,微小误差可能导致不合理的配置,包括物体相互穿透或不稳定平衡。这使得使用数字孪生预测场景的动态行为变得困难,而这是基于模拟的接触丰富行为规划和控制的重要步骤。在本文中,我们认为物体姿态和形状估计需要对场景进行整体推理(而不是孤立地推理每个物体),考虑物体交互和物理合理性。为此,我们的第一个贡献是Picasso,一个受物理约束的重建流水线,通过考虑几何、非穿透和物理来构建多物体场景重建。Picasso依赖于一种快速拒绝采样方法,该方法推理多物体交互,利用推断的物体接触图来指导采样。其次,我们提出了Picasso数据集,这是一个包含10个接触丰富真实场景的集合,带有真实标注,以及一个量化物理合理性的指标,我们将其作为基准测试的一部分开源。最后,我们在新引入的数据集和YCB-V数据集上对Picasso进行了广泛评估,结果表明它在提供物理合理且更符合人类直觉的重建的同时,大幅优于现有技术。

英文摘要

In the presence of occlusions and measurement noise, geometrically accurate scene reconstructions -- which fit the sensor data -- can still be physically incorrect. For instance, when estimating the poses and shapes of objects in the scene and importing the resulting estimates into a simulator, small errors might translate to implausible configurations including object interpenetration or unstable equilibrium. This makes it difficult to predict the dynamic behavior of the scene using a digital twin, an important step in simulation-based planning and control of contact-rich behaviors. In this paper, we posit that object pose and shape estimation requires reasoning holistically over the scene (instead of reasoning about each object in isolation), accounting for object interactions and physical plausibility. Towards this goal, our first contribution is Picasso, a physics-constrained reconstruction pipeline that builds multi-object scene reconstructions by considering geometry, non-penetration, and physics. Picasso relies on a fast rejection sampling method that reasons over multi-object interactions, leveraging an inferred object contact graph to guide samples. Second, we propose the Picasso dataset, a collection of 10 contact-rich real-world scenes with ground truth annotations, as well as a metric to quantify physical plausibility, which we open-source as part of our benchmark. Finally, we provide an extensive evaluation of Picasso on our newly introduced dataset and on the YCB-V dataset, and show it largely outperforms the state of the art while providing reconstructions that are both physically plausible and more aligned with human intuition.

2603.29002 2026-06-02 cs.DC cs.AI 版本更新

Understand and Accelerate Memory Processing Pipeline for Large Language Model Inference

理解并加速大型语言模型推理的内存处理流水线

Zifan He, Rui Ma, Yizhou Sun, Jason Cong

发表机构 * GitHub

AI总结 本文通过将稀疏注意力、检索增强生成和压缩上下文内存等优化统一为四步内存处理流水线,识别出22%-97%的内存处理开销,并提出使用GPU-FPGA异构系统加速该流水线,实现最高2.2倍加速和4.7倍能效提升。

Comments Accepted by ICML 2026. Code: https://github.com/OswaldHe/HeteroLLM

详情
AI中文摘要

现代大型语言模型(LLMs)越来越依赖于高效的长上下文处理和生成机制,包括稀疏注意力、检索增强生成(RAG)和压缩上下文内存,以支持复杂推理。我们表明这些优化可以统一为一个四步内存处理流水线:准备内存、计算相关性、检索和应用到推理。通过系统分析,我们识别出LLM推理中22%-97%的内存处理开销及其计算特征的强异构性。受此洞察启发,我们认为异构系统非常适合加速内存处理,从而加速端到端推理。我们在GPU-FPGA系统上展示了这种方法,将稀疏、不规则和内存受限的操作卸载到FPGA,同时将计算密集型操作保留在GPU上。在AMD MI210 GPU和Alveo U55C FPGA上评估,我们的系统在多种LLM推理优化中比GPU基线快高达2.2倍,能耗降低高达4.7倍(在NVIDIA A100上结果类似)。这些结果确立了异构系统作为高效LLM内存处理的实用方向,并为未来异构硬件设计提供参考。

英文摘要

Modern large language models (LLMs) increasingly depends on efficient long-context processing and generation mechanisms, including sparse attention, retrieval-augmented generation (RAG), and compressed contextual memory, to support complex reasoning. We show that these optimizations can be unified into a four-step memory processing pipeline: Prepare Memory, Compute Relevancy, Retrieval, and Apply to Inference. Through systematic profiling, we identify a 22%-97% memory processing overhead in LLM inference and strong heterogeneity in its computational characteristics. Motivated by this insight, we argue that \textbf{heterogeneous systems} are well-suited to accelerate memory processing and thus end-to-end inference. We demonstrate this approach on a GPU-FPGA system by offloading sparse, irregular, and memory-bounded operations to FPGAs while retaining compute-intensive operations on GPUs. Evaluated on an AMD MI210 GPU and an Alveo U55C FPGA, our system is up to $2.2\times$ faster and achieves up to $4.7\times$ less energy across multiple LLM inference optimizations than the GPU baseline (similar results hold on NVIDIA A100). These results establish heterogeneous systems as a practical direction for efficient LLM memory processing and inform future heterogeneous hardware design.

2605.09907 2026-06-02 cs.AI cs.MA 版本更新

RADAR: Redundancy-Aware Diffusion for Multi-Agent Communication Structure Generation

RADAR:面向多智能体通信结构生成的冗余感知扩散方法

Zhen Zhang, Wanjing Zhou, Juncheng Li, Hao Fei, Jun Wen, Wei Ji

发表机构 * University of Science and Technology of China(中国科学技术大学)

AI总结 提出一种基于条件离散图扩散模型的冗余感知生成框架RADAR,通过逐步生成通信拓扑并利用图有效尺寸引导,在六项基准上实现更高准确率、更低令牌消耗和更强鲁棒性。

Comments Accepted by ICML 2026 (fix typos)

详情
AI中文摘要

与单个智能体相比,基于大语言模型的多智能体系统在代码生成、数学推理和规划等不同任务上持续展现出强大的能力。尽管性能令人印象深刻,但这些系统的有效性和鲁棒性在很大程度上依赖于其通信拓扑,而通信拓扑通常是固定的或单步生成的。这限制了细粒度的结构探索和灵活的组合,导致在简单任务上过度使用令牌,同时在复杂任务上能力受限。为了缓解这一挑战,我们引入了RADAR,一个冗余感知且查询自适应的生成框架,主动减少通信开销。受条件离散图扩散模型最新进展的启发,我们将通信拓扑设计表述为一个逐步生成的过程,并由图的有效尺寸引导。在六个基准上的全面实验表明,RADAR在多种场景下始终优于最近的基线方法,实现了更高的准确率、更低的令牌消耗和更强的鲁棒性。我们的代码和数据可在 https://github.com/cszhangzhen/RADAR 获取。

英文摘要

Compared with individual agents, large language model based multi-agent systems have shown great capabilities consistently across diverse tasks, including code generation, mathematical reasoning, and planning, etc. Despite their impressive performance, the effectiveness and robustness of these systems heavily rely on their communication topology, which is often fixed or generated in a single step. This restricts fine-grained structural exploration and flexible composition, resulting in excessive token utilization on simple tasks while limiting capability on complicated tasks. To mitigate this challenge, we introduce RADAR, a redundancy-aware and query-adaptive generative framework that actively reduce communication overhead. Motivated by recent progress in conditional discrete graph diffusion models, we formulate communication topology design as a step-by-step generation process, guided by the effective size of the graph. Comprehensive experiments on six benchmarks demonstrate that RADAR consistently outperforms recent baselines, achieving higher accuracy, lower token consumption, and greater robustness across diverse scenarios. Our code and data are available at https://github.com/cszhangzhen/RADAR.

2605.09883 2026-06-02 cs.CV cs.AI 版本更新

The Cartesian Shortcut: Re-evaluate Vision Reasoning in Polar Coordinate Space

笛卡尔捷径:在极坐标空间中重新评估视觉推理

Xia Hu, Zhenrui Yue, Brian Potetz, Howard Zhou, Leonidas Guibas, Chun-Ta Lu, Zhicheng Wang

发表机构 * Stanford University(斯坦福大学) Google Research(谷歌研究院)

AI总结 针对多模态大语言模型在视觉推理中利用笛卡尔坐标捷径的问题,提出Polaris-Bench基准,将任务转换至极坐标空间,揭示模型缺乏拓扑不变性视觉推理。

详情
AI中文摘要

随着当前多模态大语言模型迅速饱和标准视觉推理基准,一个关键问题浮现:这些高分是否真正反映了鲁棒的视觉理解?我们发现了一个普遍存在的漏洞,即笛卡尔捷径:视觉推理基准普遍基于正交网格布局,这些布局可以轻易地离散化为显式的文本坐标。模型系统地利用这一特性,大量依赖基于文本的演绎推理来辅助视觉问题解决。为了系统地消除这一捷径,我们引入了Polaris-Bench,该基准将53个视觉推理任务重新表述在极坐标空间中,并配有对应的笛卡尔坐标作为参考,同时保持一致的逻辑约束和任务语义——从而从根本上打破了模型所利用的正交先验。对14个最先进MLLM的全面评估显示,在笛卡尔布局上达到70%-83%的前沿模型在极坐标等价布局上骤降至31%-39%,即使在完全逻辑等价的情况下,性能下降依然持续。此外,在笛卡尔布局上观察到的推理增益在极坐标等价布局上严重减弱。这些发现揭示了当前MLLM的一个关键缺陷:缺乏拓扑不变的视觉推理。

英文摘要

As current Multimodal Large Language Models rapidly saturate canonical visual reasoning benchmarks, a key question emerges: do these strong scores genuinely reflect robust visual understanding? We identify a pervasive vulnerability, the Cartesian Shortcut: visual reasoning benchmarks prevalently build on orthogonal grid-based layouts that can be readily discretized into explicit textual coordinates. Models systematically exploit this property, heavily leveraging text-based deductive reasoning to assist visual problem-solving. To systematically dismantle this shortcut, we introduce Polaris-Bench, which re-formulates 53 visual reasoning tasks in Polar coordinate space with paired Cartesian counterparts as reference, while preserving consistent logical constraints and task semantics -- thus fundamentally breaking the orthogonal prior that models exploit. Comprehensive evaluation across $14$ state-of-the-art MLLMs reveals that frontier models achieving $70$--$83\%$ on Cartesian layouts collapse to $31$--$39\%$ on Polar equivalents, with degradation persisting even under complete logical equivalence. Moreover, reasoning gains observed on Cartesian layouts are severely diminished on Polar equivalents. These findings expose a critical deficiency in current MLLMs: the lack of topology-invariant visual reasoning.

2605.09253 2026-06-02 cs.CL cs.AI 版本更新

Cornerstones or Stumbling Blocks? Deciphering the Rock Tokens in On-Policy Distillation

基石还是绊脚石?解读在线策略蒸馏中的岩石令牌

Yuxuan Jiang, Runchao Li, Shubhashis Roy Dipta, Dawei Li, Zhao Yang

发表机构 * University of Maryland, Baltimore County(马里兰大学巴尔的摩分校) Case Western Reserve University(凯斯西储大学) Arizona State University(亚利桑那州立大学) VU Amsterdam(阿姆斯特丹自由大学)

AI总结 本文研究在线策略蒸馏中持续高损失的“岩石令牌”,发现它们虽占据大量梯度但功能贡献微弱,提出绕过这些令牌可简化对齐过程。

详情
AI中文摘要

尽管近期关于可验证奖励强化学习(RLVR)的研究表明,一小部分关键令牌不成比例地驱动推理增益,但在线策略蒸馏(OPD)中类似的令牌级理解仍未探索。本文研究了高损失令牌——在OPD的逐令牌KL目标下,作为师生不匹配的最直接信号,根据现有研究,这些令牌应随着训练收敛而逐渐减少;然而,我们的实证分析显示并非如此。即使在OPD训练达到明显饱和后,仍有大量令牌持续表现出高损失;我们将这些令牌称为“岩石令牌”,它们可占生成输出中高达18%的令牌。我们的研究揭示了两个令人惊讶的悖论。首先,尽管这些令牌的高出现频率提供了不成比例的大份额总梯度范数,但岩石令牌本身在整个训练过程中保持停滞,抵抗教师驱动的修正。其次,通过因果干预,我们发现这些令牌对模型的实际推理性能贡献可忽略不计。这些发现表明,大量优化带宽被花费在学生模型无法或无需内化的结构和话语残差上。通过解构这些动态,我们证明策略性地绕过这些“绊脚石”可以显著简化对齐过程,挑战了统一令牌权重的必要性,并为大规模模型蒸馏提供了更高效的范式。

英文摘要

While recent work in Reinforcement Learning with Verifiable Rewards (RLVR) has shown that a small subset of critical tokens disproportionately drives reasoning gains, an analogous token-level understanding of On-Policy Distillation (OPD) remains largely unexplored. In this work, we investigate high-loss tokens, a token type that--as the most direct signal of student-teacher mismatch under OPD's per-token KL objective--should progressively diminish as training converges according to existing studies; however, our empirical analysis shows otherwise. Even after OPD training reaches apparent saturation, a substantial subset of tokens continues to exhibit persistently high loss; these tokens, which we term Rock Tokens, can account for up to 18\% of the tokens in generated outputs. Our investigation reveals two startling paradoxes. First, despite their high occurrence frequency providing a disproportionately large share of total gradient norms, Rock Tokens themselves remain stagnant throughout training, resisting teacher-driven corrections. Second, through causal intervention, we find that these tokens provide negligible functional contribution to the model's actual reasoning performance. These findings suggest that a vast amount of optimization bandwidth is spent on structural and discourse residuals that the student model cannot or need not internalize. By deconstructing these dynamics, we demonstrate that strategically bypassing these ``stumbling blocks'' can significantly streamline the alignment process, challenging the necessity of uniform token weighting and offering a more efficient paradigm for large-scale model distillation.

2605.07527 2026-06-02 cs.LG cs.AI 版本更新

Why Self-Inconsistency Arises in GNN Explanations and How to Exploit It

为什么 GNN 解释中会出现自不一致性以及如何利用它

Wenxin Tai, Yaqian Liu, Ting Zhong, Fan Zhou

发表机构 * University of Electronic Science and Technology of China(电子科技大学)

AI总结 本文分析了图神经网络解释中自不一致性的成因(重新解释引起的上下文扰动),提出潜在信号分配假设解释边缘敏感性,并设计无需训练的后处理策略 Self-Denoising 来校准解释。

Comments Corrected result errors and fixed typos

详情
AI中文摘要

最近的工作观察到,自解释图神经网络(SI-GNN)产生的解释可能存在自不一致性:当模型重新应用于其自身的解释性子图时,可能会产生不同的解释。然而,自不一致性产生的原因尚不清楚。在这项工作中,我们首先将重新解释引起的上下文扰动确定为分数变化的直接原因。然后,我们引入潜在信号分配假设来解释为什么只有部分边缘对此扰动敏感,并分析简洁性正则化如何影响潜在信号分配。鉴于自不一致的边缘不能为模型预测提供稳定的证据,我们提出了自去噪(SD),这是一种模型无关且无需训练的后处理策略,仅需一次额外前向传播即可校准解释。在代表性 SI-GNN 框架、骨干架构和基准数据集上的实验支持我们的假设,并表明 SD 能够持续提高解释质量,同时在实际中仅增加约 4-6% 的计算开销。

英文摘要

Recent work has observed that explanations produced by Self-Interpretable Graph Neural Networks (SI-GNNs) can be self-inconsistent: when the model is reapplied to its own explanatory graph subset, it may produce a different explanation. However, why self-inconsistency arises remains poorly understood. In this work, we first identify re-explanation-induced context perturbation as the direct cause of score variation. We then introduce a latent signal assignment hypothesis to explain why only some edges are sensitive to this perturbation, and analyze how conciseness regularization affects latent signal assignment. Given that self-inconsistent edges do not provide stable evidence for the model's prediction, we propose Self-Denoising (SD), a model-agnostic and training-free post-processing strategy that calibrates explanations with only one additional forward pass. Experiments across representative SI-GNN frameworks, backbone architectures, and benchmark datasets support our hypothesis and show that SD consistently improves explanation quality while adding only about 4--6\% computational overhead in practice.

2605.07061 2026-06-02 cs.SD cs.AI cs.CV cs.MM 版本更新

Do Joint Audio-Video Generation Models Understand Physics?

联合音视频生成模型是否理解物理?

Zijun Cui, Xiulong Liu, Hao Fang, Mingwei Xu, Jiageng Liu, Zexin Xu, Weiguo Pian, Shijian Deng, Feiyu Du, Chenming Ge, Yapeng Tian

发表机构 * University of Texas at Dallas(德克萨斯大学达拉斯分校) University of Washington(华盛顿大学) University of California, Los Angeles(加州大学洛杉矶分校)

AI总结 针对联合音视频生成模型,提出AV-Phys Bench基准测试其物理常识,发现所有模型在物理一致性上表现不足,尤其是事件驱动和环境驱动转换场景。

Comments Preprint. Project Page: https://zijuncui.com/AV-Phys/. Full abstract appears in the PDF

详情
AI中文摘要

联合音视频生成模型正迅速接近专业制作质量,这引发了一个核心问题:它们是否理解音视频物理,还是仅仅生成看似合理但违反现实一致性的声音和帧?我们引入了AV-Phys Bench,一个用于评估联合音视频生成中物理常识的基准。AV-Phys Bench测试模型在三种场景类别上的表现:稳态、事件转换和环境转换。它涵盖了从现实场景中提取的基于物理的子类别,以及故意要求物理不一致音视频行为的反AV物理提示。每个生成结果沿五个维度评估:视觉语义遵循、音频语义遵循、视觉物理常识、音频物理常识和跨模态物理常识。在三个专有模型和四个开源模型中,我们发现Seedance 2.0整体表现最佳,但所有模型距离鲁棒的物理理解仍有很大差距。在事件驱动和环境驱动转换上性能急剧下降,即使是强大的专有系统在反AV物理提示上也崩溃。我们进一步引入了AV-Phys Agent,一个结合多模态语言模型与确定性声学测量工具的ReAct风格评估器,产生的排名与人类评分高度一致。我们的结果指出,跨模态物理一致性和转换驱动的场景动态是联合音视频生成的关键开放挑战。

英文摘要

Joint audio-video generation models are rapidly approaching professional production quality, raising a central question: do they understand audio-visual physics, or merely generate plausible sounds and frames that violate real-world consistency? We introduce AV-Phys Bench, a benchmark for evaluating physical commonsense in joint audio-video generation. AV-Phys Bench tests models across three scene categories: Steady State, Event Transition, and Environment Transition. It covers physics-grounded subcategories drawn from real-world scenes, plus Anti-AV-Physics prompts that deliberately request physically inconsistent audio-video behavior. Each generation is evaluated along five dimensions: visual semantic adherence, audio semantic adherence, visual physical commonsense, audio physical commonsense, and cross-modal physical commonsense. Across three proprietary and four open-source models, we find that Seedance 2.0 performs best overall, but all models remain far from robust physical understanding. Performance drops sharply on event-driven and environment-driven transitions, and even strong proprietary systems collapse on Anti-AV-Physics prompts. We further introduce AV-Phys Agent, a ReAct-style evaluator that combines a multimodal language model with deterministic acoustic measurement tools, producing rankings that closely align with human ratings. Our results identify cross-modal physical consistency and transition-driven scene dynamics as key open challenges for joint audio-video generation.

2605.05427 2026-06-02 cs.AI 版本更新

The Refusal--Compliance Tradeoff: A Large-Scale Safety Behavior Audit of Large Language Models

拒绝-顺从权衡:大型语言模型的大规模安全行为审计

Alif Al Hasan, Sumon Biswas

发表机构 * Department of Computer and Data Sciences(计算机与数据科学系)

AI总结 本研究通过调整组合方法隔离模型敏感性与数据集毒性混淆,审计了21个开源权重LLM在四个安全基准上的拒绝与顺从失败模式,发现模型采用不同校准策略、人口保护不平等以及拒绝与顺从倾向在模型家族内稳定。

详情
AI中文摘要

拒绝率是LLM安全性的一个不良代理指标,即模型可能过度拒绝良性提示,同时仍顺从有害提示。我们审计了21个开源权重LLM在四个安全基准(OR-Bench、XSTest、ToxiGen、BOLD)上的两种失败模式,使用组合调整来隔离模型敏感性与数据集毒性混淆。我们报告三个发现。首先,模型采用根本不同的校准策略:保守生态系统(如Llama)以过度拒绝为代价抑制不安全输出,而宽松生态系统(如DeepSeek和Qwen)保持有用性但容忍更高的有害顺从。其次,人口保护不平等:模型过度保护突出的种族和宗教群体,经常拒绝甚至关于它们的良性提示,而对针对残疾的攻击提供显著较弱的保护。第三,拒绝和顺从倾向在模型家族内跨代和规模稳定,表明后训练目标比架构更能塑造安全行为。我们的结果呼吁进行联合、人口意识感知和多评判者的安全评估。

英文摘要

Refusal rates are a poor proxy for LLM safety, i.e., a model may over-refuse benign prompts while still complying with harmful ones. We audit both failure modes across 21 open-weight LLMs on four safety benchmarks (OR-Bench, XSTest, ToxiGen, BOLD), using a composition adjustment to isolate model sensitivity from dataset toxicity confounds. We report three findings. First, models adopt fundamentally different calibration strategies: conservative ecosystems such as Llama suppress unsafe outputs at the cost of elevated over-refusals, while permissive ecosystems such as DeepSeek and Qwen preserve helpfulness but tolerate higher harmful compliance. Second, demographic protection is unequal: models over-protect prominent racial and religious groups, frequently refusing even benign prompts about them, while providing substantially weaker protection against disability-targeted attacks. Third, refusal and compliance tendencies are stable within model families across generations and scales, suggesting that post-training objectives shape safety behavior more than architecture. Our results call for joint, demographically-aware, and multi-judge safety evaluation.

2604.17415 2026-06-02 cs.LG cs.AI cs.CV 版本更新

Reward Score Matching: Unifying Reward-based Fine-tuning for Flow and Diffusion Models

奖励分数匹配:统一流模型和扩散模型的基于奖励的微调

Jeongjae Lee, Jinho Chang, Jeongsol Kim, Jong Chul Ye

发表机构 * Graduate School of AI, KAIST, Korea(人工智能研究生院,韩国科学技术院)

AI总结 提出奖励分数匹配(RSM)框架,统一了多种基于奖励的微调方法,通过分数匹配与值引导目标对齐,简化了设计空间并提高了效率。

Comments 43 pages, 15 figures

详情
AI中文摘要

基于奖励的微调引导预训练的扩散或基于流的生成模型生成更高奖励的样本,同时保持接近预训练模型。尽管现有方法源自不同视角,但我们表明许多方法可以写在一个共同框架下,我们称之为奖励分数匹配(RSM)。在此视角下,对齐变为针对值引导目标的分数匹配,方法间的主要差异归结为值引导估计器的构建和跨时间步的有效优化强度。这种统一澄清了现有设计的偏差-方差-计算权衡,并将核心优化组件与增加复杂性而无明显益处的辅助机制区分开来。在此视角指导下,我们针对代表性的可微和黑盒奖励对齐任务开发了更简单、更高效的重新设计。总体而言,RSM将看似分散的基于奖励的微调方法集合转变为更小、更可解释且更可操作的设计空间。代码可在 https://github.com/jaylee2000/rsm 获取。

英文摘要

Reward-based fine-tuning steers a pretrained diffusion or flow-based generative model toward higher-reward samples while remaining close to the pretrained model. Although existing methods are derived from different perspectives, we show that many can be written under a common framework, which we call reward score matching (RSM). Under this view, alignment becomes score matching against a value-guided target, and the main differences across methods reduce to the construction of the value-guidance estimator and the effective optimization strength across timesteps. This unification clarifies the bias-variance-compute tradeoffs of existing designs, and distinguishes core optimization components from auxiliary mechanisms that add complexity without clear benefit. Guided by this perspective, we develop simpler, more efficient redesigns across representative differentiable and black-box reward alignment tasks. Overall, RSM turns a seemingly fragmented collection of reward-based fine-tuning methods into a smaller, more interpretable, and more actionable design space. Code is available at https://github.com/jaylee2000/rsm

2604.09063 2026-06-02 cs.CV cs.AI 版本更新

Frequency-Enhanced Diffusion Models: Curriculum-Guided Semantic Alignment for Zero-Shot Skeleton Action Recognition

频率增强扩散模型:基于课程引导语义对齐的零样本骨架动作识别

Yuxi Zhou, Zhengbo Zhang, Jingyu Pan, Zhiyu Lin, Zhigang Tu

发表机构 * State Key Laboratory of Information Engineering in Surveying, Mapping and Remote Sensing(测绘遥感信息工程国家重点实验室) Wuhan University(武汉大学) Information Systems Technology and Design Pillar(信息系统技术与设计学院) Singapore University of Technology and Design(新加坡科技与设计大学) School of Geodesy and Geomatics(测绘学院) School of Mathematics and Statistics(数学与统计学院) Wuhan University Shenzhen Research Institute(武汉大学深圳研究院)

AI总结 提出频率感知扩散模型FDSM,通过语义引导频谱残差模块、时间步自适应频谱损失和课程语义抽象,解决扩散模型频谱偏差导致的高频动态过度平滑问题,实现零样本骨架动作识别,在多个数据集上达到最优性能。

Comments Accepted by The Visual Computer

详情
AI中文摘要

人体动作识别在计算机视觉中至关重要,应用范围从监控到人机交互。尽管基于监督的骨架方法有效,但其对详尽标注的依赖限制了对新动作的泛化能力。零样本骨架动作识别(ZSAR)成为一种有前景的范式,但由于扩散模型的频谱偏差(过度平滑高频动态)而面临挑战。在此,我们提出频率感知扩散用于骨架-文本匹配(FDSM),集成了语义引导频谱残差模块、时间步自适应频谱损失和基于课程的语义抽象以应对这些挑战。我们的方法有效恢复了细粒度运动细节,在NTU RGB+D、PKU-MMD和Kinetics-skeleton数据集上实现了最先进的性能。代码已公开于https://github.com/yuzhi535/FDSM。项目主页:https://yuzhi535.github.io/FDSM.github.io/

英文摘要

Human action recognition is pivotal in computer vision, with applications ranging from surveillance to human-robot interaction. Despite the effectiveness of supervised skeleton-based methods, their reliance on exhaustive annotation limits generalization to novel actions. Zero-Shot Skeleton Action Recognition (ZSAR) emerges as a promising paradigm, yet it faces challenges due to the spectral bias of diffusion models, which oversmooth high-frequency dynamics. Here, we propose Frequency-Aware Diffusion for Skeleton-Text Matching (FDSM), integrating a Semantic-Guided Spectral Residual Module, a Timestep-Adaptive Spectral Loss, and Curriculum-based Semantic Abstraction to address these challenges. Our approach effectively recovers fine-grained motion details, achieving state-of-the-art performance on NTU RGB+D, PKU-MMD, and Kinetics-skeleton datasets. Code has been made available at https://github.com/yuzhi535/FDSM. Project homepage: https://yuzhi535.github.io/FDSM.github.io/

2605.00310 2026-06-02 cs.CV cs.AI cs.LG 版本更新

Beyond Visual Fidelity: Benchmarking Super-Resolution Models for Large-Scale Remote Sensing Imagery via Downstream Task Integration

超越视觉保真度:通过下游任务集成评估大规模遥感影像的超分辨率模型

Zhili Li, Kangyang Chai, Zhihao Wang, Xiaowei Jia, Yanhua Li, Gengchen Mai, Sergii Skakun, Dinesh Manocha, Yiqun Xie

发表机构 * University of Maryland(马里兰大学) University of Pittsburgh(匹兹堡大学) Worcester Polytechnic Institute(沃思利技术学院) University of Texas at Austin(德克萨斯大学奥斯汀分校)

AI总结 针对现有超分辨率评估依赖PSNR/SSIM等保真度指标而忽略下游任务效用的问题,提出GeoSR-Bench基准数据集,集成土地覆盖分割、基础设施映射等下游任务,评估GAN、Transformer等9种SR模型在270种设置下的性能,发现保真度指标与任务性能弱相关甚至负相关。

Comments Under review at IEEE TPAMI

详情
AI中文摘要

超分辨率(SR)技术在从低分辨率输入重建高分辨率图像方面取得了重大进展。分辨率的提高为监测任务提供了视觉增强和实用性。特别是,SR已越来越多地用于基于卫星的地球观测,应用于城市规划、农业、生态学和灾害响应。然而,现有的SR研究和基准通常使用保真度指标如PSNR或SSIM,而超分辨率图像的真实效用在于支持下游任务,如土地覆盖分类、生物量估计和变化检测。为弥合这一差距,我们引入了GeoSR-Bench,一个下游任务集成的SR基准数据集,用于评估超越保真度指标的SR模型。GeoSR-Bench包含来自约36,000个地点的空间共位、时间对齐和质量控制的图像对,覆盖多种土地覆盖类型,分辨率从500米到0.6米。据我们所知,GeoSR-Bench是第一个直接将SR模型提高的图像分辨率与下游地球监测任务(包括土地覆盖分割、基础设施映射和生物物理变量估计)联系起来的SR基准。利用GeoSR-Bench,我们对基于GAN、Transformer、神经算子和扩散的SR模型在感知质量和下游任务性能上进行了基准测试。我们进行了270种设置的实验,涵盖2个跨平台SR任务、9个SR模型、3个下游任务模型以及每个SR任务的5个下游任务。结果表明,传统SR指标的改进通常与任务性能的提升不相关,甚至可能负相关,表明这些指标为选择适用于下游任务的优越模型提供的指导有限。这揭示了将下游任务集成到SR模型开发和评估中的必要性。

英文摘要

Super-resolution (SR) techniques have made major advances in reconstructing high-resolution images from low-resolution inputs. The increased resolution provides visual enhancement and utility for monitoring tasks. In particular, SR has been increasingly developed for satellite-based Earth observation, with applications in urban planning, agriculture, ecology, and disaster response. However, existing SR studies and benchmarks typically use fidelity metrics such as PSNR or SSIM, whereas the true utility of super-resolved images lies in supporting downstream tasks such as land cover classification, biomass estimation, and change detection. To bridge this gap, we introduce GeoSR-Bench, a downstream task-integrated SR benchmark dataset to evaluate SR models beyond fidelity metrics. GeoSR-Bench comprises spatially co-located, temporally aligned, and quality-controlled image pairs from about 36,000 locations across diverse land covers, spanning resolutions from 500m to 0.6m. To the best of our knowledge, GeoSR-Bench is the first SR benchmark that directly connects improved image resolution from SR models with downstream Earth monitoring tasks, including land cover segmentation, infrastructure mapping, and biophysical variable estimation. Using GeoSR-Bench, we benchmark GAN, transformer, neural operator, and diffusion-based SR models on perceptual quality and downstream task performance. We conduct experiments with 270 settings, covering 2 cross-platform SR tasks, 9 SR models, 3 downstream task models, and 5 downstream tasks for each SR task. The results show that improvements in traditional SR metrics often do not correlate with gains in task performance, and the correlations can be negative, indicating that these metrics provide limited guidance for selecting superior models for downstream tasks. This reveals the need to integrate downstream tasks into SR model development and evaluation.

2604.26977 2026-06-02 cs.LO cs.AI 版本更新

Defeasible Conditional Obligation in a Two-tiered Preference-based Semantics (Extended Version)

基于双层偏好语义的可废止条件义务(扩展版)

Xavier Parent

发表机构 * Technische Universität Wien (TU Wien)(维也纳技术大学)

AI总结 本文提出一种双层偏好语义框架,通过结合非单调推理机制和双序关系(理想性与正常性),解决可废止条件义务的逻辑建模问题,并与约束输入/输出逻辑建立联系。

Comments 13 pages. Extended version of a paper presented at KR 2926

详情
AI中文摘要

针对Horty提出的问题,本文开发了一种双层、基于偏好的语义框架,用于建模可废止条件义务。该文扩展了Hanss-Lewis风格的偏好语义,用于双元道义逻辑,通过引入非单调推理机制,使得当新的、可能冲突的信息出现时,先前推导的义务可以被撤销。该方法是双偏好的:采用世界上的两种序关系——理想性和正常性——来弥补早期方法的不足,并为每种序关系提供独立的排序方法。在非单调层面,考虑了若干公设,包括前提强化、包含和无淹没。与所谓的约束输入/输出(I/O)逻辑——一种基于不同方法的现有规范推理标准——建立了联系。

英文摘要

In response to a concern raised by Horty, this paper develops a two-tiered, preference-based semantic framework for modeling defeasible conditional obligations. The paper extends a Hansson-Lewis style preference semantics for dyadic deontic logic by incorporating a nonmonotonic reasoning mechanism that enables previously derived obligations to be withdrawn when new, potentially conflicting information comes in. The account is bi-preferential: two orderings--ideality and normality--on worlds are employed to address shortcomings in earlier approaches, with a separate ranking method for each. At the nonmonotonic layer, a number of postulates are considered, including antecedent strengthening, inclusion and no-drowning. A connection is established with so-called constrained input/output (I/O) logic--an existing standard for normative reasoning based on a different methodology.

2604.25191 2026-06-02 cs.AR cs.AI cs.LG 版本更新

How Can Reinforcement Learning Achieve Expert-level Placement?

强化学习如何实现专家级布局?

Ruo-Tong Chen, Ke Xue, Chengrui Gao, Yunqi Shi, Tian Xu, Peng Xie, Siyuan Xu, Mingxuan Yuan, Chao Qian, Zhi-Hua Zhou

发表机构 * State Key Laboratory of Novel Software Technology, Nanjing University, China(南京大学新型软件技术国家重点实验室) School of Artificial Intelligence, Nanjing University, China(南京大学人工智能学院) Huawei Noah’s Ark Lab, China(华为诺亚实验室)

AI总结 针对强化学习在芯片布局中因奖励设计不当而难以达到专家质量的问题,提出从专家布局直接学习奖励模型的方法,通过推断专家轨迹并训练隐式奖励模型,实现从单个设计高效学习并泛化到未见案例。

Comments DAC 2026

详情
AI中文摘要

芯片布局是物理设计中的关键步骤。尽管基于强化学习的方法最近出现,但它们的训练主要关注线长优化,因此常常无法达到专家质量的布局。我们确定奖励设计是与专家性能差距的主要原因,并且我们没有形式化复杂的过程,而是通过直接从专家布局中学习来推导奖励模型,从而规避了这一问题。我们的方法从最终的专家布局开始,逐步推断专家轨迹。利用这些轨迹作为演示或偏好,我们训练一个模型来捕捉专家结果中的潜在隐式奖励。实验表明,我们的框架可以高效地从单个设计中学习,并很好地泛化到未见案例。

英文摘要

Chip placement is a critical step in physical design. While reinforcement learning (RL)-based methods have recently emerged, their training primarily focuses on wirelength optimization, and therefore often fail to achieve expert-quality layouts. We identify the reward design as the primary cause for the performance gap with experts, and instead of formalizing intricate processes, we circumvent this by directly learning from expert layouts to derive a reward model. Our approach starts from the final expert layouts to infer step-by-step expert trajectories. Using these trajectories as demonstrations or preferences, we train a model that captures the latent implicit rewards in expert results. Experiments show that our framework can efficiently learn from even a single design and generalize well to unseen cases.

2604.23658 2026-06-02 cs.AR cs.AI cs.LG 版本更新

FlowPlace: Flow Matching for Chip Placement

FlowPlace: 用于芯片布局的流匹配

Peng Xie, Ke Xue, Yunqi Shi, Ruo-Tong Chen, Chengrui Gao, Siyuan Xu, Chenjian Ding, Mingxuan Yuan, Chao Qian

发表机构 * State Key Laboratory of Novel Software Technology, Nanjing University, China(南京大学新型软件技术国家重点实验室) School of Artificial Intelligence, Nanjing University, China(南京大学人工智能学院) Huawei Noah’s Ark Lab, China(华为诺亚实验室)

AI总结 提出FlowPlace,通过掩码引导的合成数据生成、基于流的灵活先验注入高效训练和硬约束采样实现无重叠布局,在OpenROAD和ICCAD 2015基准上取得更优PPA指标、10-50倍采样效率提升和零重叠。

Comments DAC 2026

详情
AI中文摘要

芯片布局在物理设计中扮演重要角色。虽然扩散模型等生成模型提供了有前景的基于学习的解决方案,但当前方法存在以下局限性:使用随机合成数据进行预训练,需要较长的采样时间,并且由于在采样过程中依赖基于梯度的求解器,常常导致重叠。为了克服这些问题,我们提出了FlowPlace,其特点包括掩码引导的合成数据生成、基于流的灵活先验注入高效训练以及用于无重叠布局的硬约束采样。在OpenROAD和ICCAD 2015基准上的实验表明,FlowPlace实现了更好的PPA指标、10-50倍的采样效率提升以及零重叠。

英文摘要

Chip placement plays an important role in physical design. While generative models like diffusion models offer promising learning-based solutions, current methods have the following limitations: they use random synthetic data for pre-training, require long sampling times, and often result in overlaps due to their dependence on gradient-based solvers during the sampling process. To overcome these issues, we propose FlowPlace, which features mask-guided synthetic data generation, flow-based efficient training with flexible prior injection, and hard constraint sampling for overlap-free layouts. Experiments on OpenROAD and ICCAD 2015 benchmarks show FlowPlace achieves better PPA metrics, 10-50$\times$ faster sampling efficiency, and zero overlaps.

2604.23593 2026-06-02 cs.AI 版本更新

When AI reviews science: Can we trust the referee?

当AI评审科学:我们能信任审稿人吗?

Jialiang Wang, Yuchen Liu, Hang Xu, Kaichun Hu, Shimin Di, Wangze Ni, Linan Yue, Min-Ling Zhang, Kui Ren, Lei Chen

发表机构 * School of Electronic Engineering, Southeast University(东南大学电子工程学院) Zhejiang University(浙江大学)

AI总结 针对AI审稿的安全性和可靠性问题,本文通过分类攻击类型并实验验证声望框架、断言强度、反驳谄媚和上下文投毒对评分的影响,为评估AI同行评审的可靠性提供基线。

详情
Journal ref
The Innovation Informatics 2:100030 (2026)
AI中文摘要

科学投稿数量持续攀升,超过了合格人类审稿人的容量,并延长了编辑时间线。与此同时,现代大型语言模型(LLMs)在摘要、事实核查和文献分类方面展现出令人印象深刻的能力,使得将AI整合到同行评审中越来越有吸引力——实际上,也无可避免。然而,早期的部署和非正式采用已经暴露了严重的故障模式。最近的事件表明,嵌入在稿件中的隐藏提示注入可以引导LLM生成的评审走向不合理的正面判断。补充研究还显示出对对抗性措辞、权威和长度偏见以及幻觉主张的脆弱性。这些事件引发了学术交流的一个核心问题:当AI评审科学时,我们能信任AI审稿人吗?本文提供了以安全和可靠性为中心的AI同行评审分析。我们映射了评审生命周期中的攻击——训练和数据检索、初审、深度评审、反驳和系统层面。我们通过在分层选取的ICLR 2025投稿上使用两个基于LLM的高级审稿人进行四项处理-控制探针,实例化了这一分类法,以隔离声望框架、断言强度、反驳谄媚和上下文投毒对评审分数的因果效应。总之,这一分类法和实验审计为评估和跟踪AI同行评审的可靠性提供了基于证据的基线,并突出了具体的故障点,以指导有针对性的、可测试的缓解措施。

英文摘要

The volume of scientific submissions continues to climb, outpacing the capacity of qualified human referees and stretching editorial timelines. At the same time, modern large language models (LLMs) offer impressive capabilities in summarization, fact checking, and literature triage, making the integration of AI into peer review increasingly attractive -- and, in practice, unavoidable. Yet early deployments and informal adoption have exposed acute failure modes. Recent incidents have revealed that hidden prompt injections embedded in manuscripts can steer LLM-generated reviews toward unjustifiably positive judgments. Complementary studies have also demonstrated brittleness to adversarial phrasing, authority and length biases, and hallucinated claims. These episodes raise a central question for scholarly communication: when AI reviews science, can we trust the AI referee? This paper provides a security- and reliability-centered analysis of AI peer review. We map attacks across the review lifecycle -- training and data retrieval, desk review, deep review, rebuttal, and system-level. We instantiate this taxonomy with four treatment-control probes on a stratified set of ICLR 2025 submissions, using two advanced LLM-based referees to isolate the causal effects of prestige framing, assertion strength, rebuttal sycophancy, and contextual poisoning on review scores. Together, this taxonomy and experimental audit provide an evidence-based baseline for assessing and tracking the reliability of AI peer review and highlight concrete failure points to guide targeted, testable mitigations.

2604.07967 2026-06-02 cs.CL cs.AI 版本更新

AtomEval: Validity-Aware Atomic Evaluation of Adversarial Claim Rewriting in Fact Verification

AtomEval: 事实核查中对抗性声明重写的有效性感知原子评估

Hongyi Cen, Mingxin Wang, Yule Liu, Jingyi Zheng, Hanze Jia, Tan Tang

发表机构 * Zhejiang University(浙江大学) The Hong Kong University of Science and Technology (Guangzhou)(香港科技大学(广州))

AI总结 提出AtomEval协议,通过原子分解和保留门控,区分有效规避验证与改变命题的重写,并引入VASR指标,解决传统ASR膨胀问题。

详情
AI中文摘要

大型语言模型(LLM)可以重写被驳斥的声明以规避基于证据的事实核查器,但当重写改变、削弱或纠正了本应保留的虚假命题时,传统的攻击成功率(ASR)可能会被夸大。我们引入了AtomEval,一种用于固定证据对抗性声明重写的有效性感知评估协议。AtomEval将声明表示为“主体-关系-客体-修饰语”(SROM)原子,应用单向保留门将有效的验证器规避与改变命题的重写分开,并报告有效性感知攻击成功率(VASR),该指标仅统计保留原始虚假命题的验证器规避重写。AtomEval进一步提供细粒度诊断,解释命题级失败和非最小有效重写。在FEVER被驳斥声明重写任务上,AtomEval揭示并解释了ASR膨胀:许多明显的攻击通过改变、削弱或纠正本应保留的命题来欺骗验证器。通过使受攻击命题的保留变得明确且可测量,AtomEval为评估必须在验证器规避与命题保留之间取得平衡的对抗性重写器提供了稳定的评估目标。

英文摘要

Large language models (LLMs) can rewrite refuted claims to evade evidence-based fact verifiers, but conventional attack success rate (ASR) can be inflated when rewrites change, weaken, or correct the false proposition they are supposed to preserve. We introduce AtomEval, a validity-aware evaluation protocol for fixed-evidence adversarial claim rewriting. AtomEval represents claims as subject--relation--object--modifier (SROM) atoms, applies a one-way preservation gate to separate valid verifier evasion from proposition-changing rewrites, and reports validity-aware attack success rate (VASR), which counts only verifier-evasive rewrites that preserve the original false proposition. AtomEval further provides fine-grained diagnostics that explain both proposition-level failures and non-minimal valid rewrites. On FEVER refuted-claim rewriting, AtomEval exposes and explains ASR inflation: many apparent attacks fool the verifier by altering, weakening, or correcting the proposition they should preserve. By making attacked-proposition preservation explicit and measurable, AtomEval provides a stable evaluation target for evaluating adversarial rewriters that must balance verifier evasion with proposition preservation.

2602.02689 2026-06-02 cs.CR cs.AI cs.LG 版本更新

Eidolon: A Post-Quantum Signature Scheme Based on k-Colorability in the Age of Graph Neural Networks

Eidolon: 图神经网络时代基于k-可着色性的后量子签名方案

Asmaa Cherkaoui, Ramon Flores, Delaram Kahrobaei, Richard Wilson

发表机构 * Laboratory of Mathematical Analysis, Algebra and Applications (LAM2A), Faculty of Sciences Ain Chock (FSAC), University Hassan II, Casablanca, Morocco(哈桑二世大学阿因-奇克学院数学分析与代数实验室) Department of Geometry and Topology, Faculty of Mathematics, University of Seville, Seville, Spain(塞维利亚大学数学系几何与拓扑系) Departments of Computer Science and Mathematics, Queens College, City University of New York, USA(纽约市立大学皇后学院计算机科学与数学系;数学博士项目,理论科学倡议,研究生中心,纽约市立大学;计算机科学与工程系,纽约大学塔朗分校;计算机科学系,英国约克大学) PhD Program in Mathematics, and Initiative for the Theoretical Sciences, Graduate Center, City University of New York, USA(英国约克大学计算机科学系) Department of Computer Science and Engineering, Tandon School of Engineering, New York University, USA Department of Computer Science, University of York, United Kingdom Department of Computer Science, University of York, United Kingdom

AI总结 提出一种基于NP完全问题k-可着色性的后量子签名方案Eidolon,通过推广Goldreich-Micali-Wigderson零知识协议、应用Fiat-Shamir变换和Merkle树压缩,并利用植入着色法生成困难实例,实验表明对经典求解器和图神经网络攻击具有抵抗性。

Comments 20 pages, 4 figures

详情
Journal ref
Proceedings of WAIFI 2026, Lecture Notes in Computer Science (LNCS), Vol. 16611, Springer, 2026
AI中文摘要

我们提出Eidolon,一种基于NP完全问题k-可着色性的后量子签名方案。我们的构造将Goldreich-Micali-Wigderson零知识协议推广到任意k >= 3,应用Fiat-Shamir变换,并使用Merkle树承诺将签名从O(tn)压缩到O(t log n)。我们通过植入着色法生成困难实例,同时旨在保留随机图的统计特征。我们对此类方案进行了针对经典求解器(ILP、DSatur)和定制图神经网络(GNN)攻击者的实证安全分析。实验表明,对于n >= 60,两种方法均无法恢复与植入解匹配的有效着色,表明精心设计的k-着色实例能够抵抗所考虑的传统和基于学习的密码分析方法。这些实验表明,构造的实例能够抵抗我们评估中考虑的攻击。

英文摘要

We propose Eidolon, a post-quantum signature scheme grounded on the NP-complete k-colorability problem. Our construction generalizes the Goldreich-Micali-Wigderson zero-knowledge protocol to arbitrary k >= 3, applies the Fiat-Shamir transform, and uses Merkle-tree commitments to compress signatures from O(tn) to O(t log n). We generate hard instances by planting a coloring while aiming to preserve the statistical profile of random graphs. We present an empirical security analysis of such a scheme against both classical solvers (ILP, DSatur) and a custom graph neural network (GNN) attacker. Experiments show that for n >= 60, neither approach is able to recover a valid coloring matching the planted solution, suggesting that well-engineered k-coloring instances can resist the considered classical and learning-based cryptanalytic approaches. These experiments indicate that the constructed instances resist the attacks considered in our evaluation.

2604.20861 2026-06-02 cs.IR cs.AI 版本更新

Deep Interest Mining for Intent-Enriched Semantic IDs in Multimodal Generative Recommendation

面向多模态生成式推荐中意图增强语义ID的深度兴趣挖掘

Yangchen Zeng, Jinze Wang

发表机构 * Amazon(亚马逊)

AI总结 提出DeepInterestGR框架,通过视觉线索和意图描述符丰富量化前的物品表示,并结合相关性门控语义奖励,提升基于语义ID的生成式推荐性能。

详情
AI中文摘要

语义ID(SID)为生成式推荐提供了离散物品词汇表,但其质量取决于量化前保留了哪些物品证据。在产品推荐中,表面元数据常缺失潜在使用意图,视觉证据可能仅在文本中弱反映,下游策略学习对生成的SID是否对应语义有用的物品提供稀疏反馈。我们引入 extbf{DeepInterestGR},一个用于生成式推荐的意图增强SID框架。在SID量化前, extbf{CMSA}通过两条互补证据路径丰富物品表示:面向推荐的VLM描述和投影图像嵌入。然后 extbf{DCIM}使用LLM挖掘物品侧意图描述符——由产品内容隐含的潜在使用动机,而非个性化用户状态。在构建的SID上进行策略训练时, extbf{QARM}在标准SID奖励之上添加相关性门控语义质量奖励,仅当生成的SID解码为目标物品时应用该奖励。因此,语义质量不能奖励流畅但无关的物品预测。在三个Amazon产品评论类别(Beauty、Sports和Instruments)上的实验表明,DeepInterestGR优于有竞争力的生成式和基于RL的基线,在最强每度量基线上NDCG@5相对提升高达 extbf{15.1\%},NDCG@10提升 extbf{13.9\%}。组件消融、CMSA分支分析、奖励变体和SID级案例研究支持一个有界声明:用视觉线索和物品侧意图描述符丰富量化前物品证据,结合相关性门控语义奖励,在评估设置下改进了基于SID的生成式推荐。

英文摘要

Semantic IDs (SIDs) provide the discrete item vocabulary used by generative recommendation, but their quality depends on what item evidence is preserved before quantization. In product recommendation, surface metadata often misses latent usage intent, visual evidence may be only weakly reflected in text, and downstream policy learning provides sparse feedback about whether a generated SID corresponds to a semantically useful item. We introduce \textbf{DeepInterestGR}, an intent-enriched SID framework for generative recommendation. Before SID quantization, \textbf{CMSA} enriches item representations through two complementary evidence paths: recommendation-oriented VLM captions and projected image embeddings. \textbf{DCIM} then uses an LLM to mine item-side intent descriptors -- latent usage motivations implied by product content rather than personalized user states. During policy training over the constructed SIDs, \textbf{QARM} adds a relevance-gated semantic-quality bonus on top of standard SID rewards, applying the bonus only when the generated SID decodes to the target item. Thus, semantic quality cannot reward a fluent but irrelevant item prediction. Experiments on three Amazon Product Review categories (Beauty, Sports, and Instruments) show that DeepInterestGR improves over competitive generative and RL-based baselines, with relative gains of up to \textbf{15.1\%} in NDCG@5 and \textbf{13.9\%} in NDCG@10 over the strongest per-metric baseline. Component ablations, CMSA branch analyses, reward variants, and SID-level case studies support a bounded claim: enriching pre-quantization item evidence with visual cues and item-side intent descriptors, together with relevance-gated semantic rewards, improves SID-based generative recommendation under the evaluated settings.

2603.15956 2026-06-02 cs.RO cs.AI 版本更新

ExpertGen: Scalable Sim-to-Real Expert Policy Learning from Imperfect Behavior Priors

ExpertGen: 从非完美行为先验的可扩展仿真到现实专家策略学习

Zifan Xu, Ran Gong, Maria Vittoria Minniti, Kausik Sivakumar, Ahmet Salih Gundogdu, Eric Rosen, Riedana Yan, Tushar Kusnur, Zixing Wang, Di Deng, Peter Stone, Xiaohan Zhang, Karl Schmeckpeper

发表机构 * Robotics and AI Institute(机器人与人工智能研究院) University of Texas at Austin(德克萨斯大学奥斯汀分校) Sony AI(索尼人工智能)

AI总结 提出ExpertGen框架,通过扩散策略初始化行为先验并结合强化学习优化噪声,在仅稀疏奖励下生成高质量专家策略,实现从仿真到现实的可扩展迁移。

详情
AI中文摘要

学习通用且鲁棒的行为克隆策略需要大量高质量的机器人数据。虽然人类演示(例如通过遥操作)是专家行为的标准来源,但在现实世界中大规模获取此类数据成本过高。本文介绍了ExpertGen,一个在仿真中自动化专家策略学习的框架,以实现可扩展的仿真到现实迁移。ExpertGen首先使用在非完美演示(可能由大语言模型合成或由人类提供)上训练的扩散策略初始化行为先验。然后,通过优化扩散模型的初始噪声同时保持原始策略冻结,使用强化学习将该先验引导至高的任务成功率。通过保持预训练的扩散策略冻结,ExpertGen将探索正则化到安全、类人的行为流形内,同时仅使用稀疏奖励即可实现有效学习。在具有挑战性的操作基准上的实证评估表明,ExpertGen无需奖励工程即可可靠地生成高质量的专家策略。在工业装配任务中,ExpertGen实现了90.5%的整体成功率,而在长时域操作任务中达到了85%的整体成功率,优于所有基线方法。所得策略表现出灵巧的控制,并在不同的初始配置和失败状态下保持鲁棒。为了验证仿真到现实的迁移,学习到的基于状态的专家策略通过DAgger进一步提炼为视觉运动策略,并成功部署在真实的机器人硬件上。

英文摘要

Learning generalizable and robust behavior cloning policies requires large volumes of high-quality robotics data. While human demonstrations (e.g., through teleoperation) serve as the standard source for expert behaviors, acquiring such data at scale in the real world is prohibitively expensive. This paper introduces ExpertGen, a framework that automates expert policy learning in simulation to enable scalable sim-to-real transfer. ExpertGen first initializes a behavior prior using a diffusion policy trained on imperfect demonstrations, which may be synthesized by large language models or provided by humans. Reinforcement learning is then used to steer this prior toward high task success by optimizing the diffusion model's initial noise while keep original policy frozen. By keeping the pretrained diffusion policy frozen, ExpertGen regularizes exploration to remain within safe, human-like behavior manifolds, while also enabling effective learning with only sparse rewards. Empirical evaluations on challenging manipulation benchmarks demonstrate that ExpertGen reliably produces high-quality expert policies with no reward engineering. On industrial assembly tasks, ExpertGen achieves a 90.5% overall success rate, while on long-horizon manipulation tasks it attains 85% overall success, outperforming all baseline methods. The resulting policies exhibit dexterous control and remain robust across diverse initial configurations and failure states. To validate sim-to-real transfer, the learned state-based expert policies are further distilled into visuomotor policies via DAgger and successfully deployed on real robotic hardware.

2604.17621 2026-06-02 cs.AI 版本更新

KnowledgeBerg: Evaluating Systematic Knowledge Coverage and Compositional Reasoning in Large Language Models

KnowledgeBerg: 评估大语言模型中的系统性知识覆盖与组合推理

Xiao Zhang, Qianru Meng, Yongjian Chen, Yumeng Wang, Johan Bos

发表机构 * University of Groningen(Groningen大学) LIACS, Leiden University(莱顿大学LIACS)

AI总结 提出KnowledgeBerg基准,通过4800道选择题评估大模型在知识宽度和推理深度上的系统性覆盖与组合推理能力,发现现有模型存在严重不足。

Comments ACL Findings

详情
AI中文摘要

许多现实世界的问题看似简单,却隐含地要求两种能力:(i) 对有限知识宇宙的系统性覆盖,以及(ii) 对该宇宙的基于集合的组合推理,我们将这种现象称为“冰山一角”。我们通过两个正交维度形式化这一挑战:知识宽度(所需宇宙的基数)和推理深度(组合集合操作的数量)。我们引入了KnowledgeBerg,一个包含4800道多项选择题的基准,这些题目源自1183个枚举种子,涵盖10个领域和17种语言,其宇宙基于权威来源以确保可重复性。代表性的开源大语言模型表现出严重局限性,在宇宙枚举上仅达到5.26-36.88的F1分数,在基于知识的推理上准确率仅为16.00-44.19。诊断分析揭示了三个失败阶段:完整性(知识缺失)、意识(未能识别需求)和应用(错误执行推理)。这种模式在语言和模型规模上持续存在。尽管测试时计算和检索增强带来了可测量的改进——分别高达4.35和3.78个百分点——但仍有显著差距,暴露了当前大语言模型在组织结构化知识和在有限领域上执行组合推理方面的局限性。数据集可在https://huggingface.co/datasets/2npc/KnowledgeBerg获取。

英文摘要

Many real-world questions appear deceptively simple yet implicitly demand two capabilities: (i) systematic coverage of a bounded knowledge universe and (ii) compositional set-based reasoning over that universe, a phenomenon we term "the tip of the iceberg." We formalize this challenge through two orthogonal dimensions: knowledge width, the cardinality of the required universe, and reasoning depth, the number of compositional set operations. We introduce KnowledgeBerg, a benchmark of 4,800 multiple-choice questions derived from 1,183 enumeration seeds spanning 10 domains and 17 languages, with universes grounded in authoritative sources to ensure reproducibility. Representative open-source LLMs demonstrate severe limitations, achieving only 5.26-36.88 F1 on universe enumeration and 16.00-44.19 accuracy on knowledge-grounded reasoning. Diagnostic analyses reveal three stages of failure: completeness, or missing knowledge; awareness, or failure to identify requirements; and application, or incorrect reasoning execution. This pattern persists across languages and model scales. Although test-time compute and retrieval augmentation yield measurable gains -- up to 4.35 and 3.78 points, respectively -- substantial gaps remain, exposing limitations in how current LLMs organize structured knowledge and execute compositional reasoning over bounded domains. The dataset is available at https://huggingface.co/datasets/2npc/KnowledgeBerg

2604.17456 2026-06-02 cs.AI 版本更新

TrafficClaw: A Generalizable LLM Agent in the Unified Physical Environment for Urban Traffic Control

TrafficClaw:面向城市交通控制的统一物理环境中的可泛化LLM智能体

Siqi Lai, Pan Zhang, Yuping Zhou, Jindong Han, Yansong Ning, Hao Liu

发表机构 * The Hong Kong University of Science and Technology (Guangzhou)(香港科学与技术大学(广州)) Shandong University(山东大学)

AI总结 提出TrafficClaw,一种基于大语言模型的可泛化交通控制智能体,通过统一物理环境、可执行时空推理与多阶段智能体强化学习,实现跨子系统的协调优化。

详情
AI中文摘要

大语言模型(LLM)智能体在数字环境中的长程推理、工具使用和决策方面表现出强大能力,但将其扩展到物理系统仍具挑战。与目标通常弱耦合的网络、代码或游戏环境不同,物理系统通过紧密耦合的动力学演化,局部干预会随时间在相互作用的子系统中传播。城市交通控制体现了这一挑战,因为交通信号、高速公路、公共交通和出租车系统通过共享的空间基础设施和时间出行需求持续交互。现有的优化、强化学习(RL)和基于LLM的方法大多针对孤立子系统设计,限制了协调推理和系统级优化。我们提出TrafficClaw,一种基于LLM的可泛化交通控制智能体,用于物理城市系统。TrafficClaw在统一的交通环境中运行,暴露耦合的城市动态和反馈,通过持久记忆执行可扩展的时空推理以实现长期适应,并利用多阶段智能体RL进行协调的系统级优化。在三个大都市区域和六个交通控制任务上的实验证明了其强大的泛化能力、鲁棒性和跨子系统协调能力。我们的项目可在https://github.com/usail-hkust/TrafficClaw获取。

英文摘要

Large language model (LLM) agents have shown strong capabilities in long-horizon reasoning, tool use, and decision-making in digital environments, yet extending them to physically grounded systems remains challenging. Unlike web, code, or game environments, where objectives are often weakly coupled, physical systems evolve through tightly coupled dynamics in which local interventions propagate across interacting subsystems over time. Urban traffic control exemplifies this challenge, as traffic signals, freeways, public transit, and taxi systems continuously interact through shared spatial infrastructure and temporal mobility demand. Existing optimization, reinforcement learning (RL), and LLM-based approaches are largely designed for isolated subsystems, limiting coordinated reasoning and system-level optimization. We propose TrafficClaw, a LLM-based generalizable traffic control agent for physical urban systems. TrafficClaw operates within a unified traffic environment that exposes coupled urban dynamics and feedback, performs executable spatiotemporal reasoning with persistent memory for long-horizon adaptation, and leverages multi-stage agentic RL for coordinated system-level optimization. Experiments across three metropolitan regions and six traffic-control tasks demonstrate strong generalization, robustness, and cross-subsystem coordination. Our project is available at https://github.com/usail-hkust/TrafficClaw.

2604.17007 2026-06-02 cs.CV cs.AI 版本更新

MobileAgeNet: Lightweight Facial Age Estimation for Mobile Deployment

MobileAgeNet:面向移动部署的轻量级面部年龄估计

Arun Kumar, Aswathy Baiju, Radu Timofte, Dmitry Ignatov

发表机构 * Computer Vision Lab, CAIDAS & IFI, University of Würzburg, Germany(计算机视觉实验室、CAIDAS与IFI、乌尔姆大学、德国)

AI总结 提出基于MobileNetV3-Large的轻量级年龄回归框架MobileAgeNet,通过两阶段微调和边界回归策略,在UTKFace测试集上达到4.65年MAE,移动端延迟14.4ms,参数量3.23M。

Comments 9 Pages including references, 3 figures

详情
Journal ref
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), pp. 3810-3818, 2026
AI中文摘要

面部年龄估计的移动部署需要模型在预测准确性、低延迟和小尺寸之间取得平衡。在这项工作中,我们提出了MobileAgeNet,一个轻量级年龄回归框架,在UTKFace保留测试集上实现了4.65年的MAE,同时使用AI Benchmark应用程序测量,平均延迟为14.4毫秒,保持了高效的设备端推理。该模型基于预训练的MobileNetV3-Large骨干网络,结合紧凑的回归头,支持移动设备上的实时预测。训练和评估流程集成到NN LEMUR数据集框架中,支持可重复实验、结构化超参数优化和一致评估。我们采用边界年龄回归以及两阶段微调策略,以提高训练稳定性和泛化能力。实验结果表明,MobileAgeNet以3.23M参数实现了具有竞争力的准确性,并且从PyTorch训练通过ONNX导出到TensorFlow Lite转换的部署流程,在实际设备条件下保持了预测行为,没有可测量的退化。总体而言,这项工作为面向移动的面部年龄估计提供了一个实用、可部署的基线。

英文摘要

Mobile deployment of facial age estimation requires models that balance predictive accuracy with low latency and compact size. In this work, we present MobileAgeNet, a lightweight age-regression framework that achieves an MAE of 4.65 years on the UTKFace held-out test set while maintaining efficient on-device inference with an average latency of 14.4 ms measured using the AI Benchmark application. The model is built on a pretrained MobileNetV3-Large backbone combined with a compact regression head, enabling real-time prediction on mobile devices. The training and evaluation pipeline is integrated into the NN LEMUR Dataset framework, supporting reproducible experimentation, structured hyperparameter optimization, and consistent evaluation. We employ bounded age regression together with a two-stage fine-tuning strategy to improve training stability and generalization. Experimental results show that MobileAgeNet achieves competitive accuracy with 3.23M parameters, and that the deployment pipeline from PyTorch training through ONNX export to TensorFlow Lite conversion - preserves predictive behavior without measurable degradation under practical on-device conditions. Overall, this work provides a practical, deployment-ready baseline for mobile-oriented facial age estimation.

2601.07177 2026-06-02 cs.CR cs.AI 版本更新

Safe-FedLLM: Delving into the Safety of Federated Large Language Models

Safe-FedLLM:深入探究联邦大语言模型的安全性

Mingxiang Tao, Yu Tian, Wenxuan Tu, Yue Yang, Xue Yang, Xiangyan Tang

发表机构 * Hainan University(海南大学) Tsinghua University(清华大学) Shanghai Jiao Tong University(上海交通大学)

AI总结 本文提出Safe-FedLLM,一种基于探针的防御框架,通过三级防御(步骤级、客户端级和阴影级)利用轻量级分类器区分恶意与良性LoRA更新,以增强联邦大语言模型对恶意客户端的鲁棒性。

详情
AI中文摘要

联邦学习解决了大语言模型训练中的隐私和数据孤岛问题。大多数先前工作侧重于提高联邦学习对大语言模型的效率。然而,开放联邦环境中的安全性,特别是针对恶意客户端的防御,仍未被充分探索。为了研究联邦大语言模型的安全性,我们进行了一项初步研究,从LoRA更新的角度分析潜在的攻击面和防御特性。我们发现联邦大语言模型的两个关键特性:1)大语言模型在联邦学习中容易受到恶意客户端的攻击,以及2)LoRA更新表现出不同的行为模式,可以通过轻量级分类器有效区分。基于这些特性,我们提出了Safe-FedLLM,一种基于探针的联邦大语言模型防御框架,该框架在三个层面构建防御:步骤级、客户端级和阴影级。Safe-FedLLM的核心概念是对每个客户端的本地LoRA更新进行基于探针的区分,将其视为高维行为特征,并使用轻量级分类器判断其是否为恶意。大量实验表明,Safe-FedLLM有效提高了联邦大语言模型对恶意客户端的鲁棒性,同时保持了对良性数据的竞争性能。值得注意的是,我们的方法在不显著影响训练速度的情况下有效抑制了恶意数据的影响,即使在恶意客户端比例较高的情况下也保持有效。

英文摘要

Federated learning (FL) addresses privacy and data-silo issues in the training of large language models (LLMs). Most prior work focuses on improving the efficiency of federated learning for LLMs (FedLLM). However, security in open federated environments, particularly defenses against malicious clients, remains underexplored. To investigate the security of FedLLM, we conduct a preliminary study to analyze potential attack surfaces and defensive characteristics from the perspective of LoRA updates. We find two key properties of FedLLM: 1) LLMs are vulnerable to attacks from malicious clients in FL, and 2) LoRA updates exhibit distinct behavioral patterns that can be effectively distinguished by lightweight classifiers. Based on these properties, we propose Safe-FedLLM, a probe-based defense framework for FedLLM, which constructs defenses across three levels: Step-Level, Client-Level, and Shadow-Level. The core concept of Safe-FedLLM is to perform probe-based discrimination on each client's local LoRA updates, treating them as high-dimensional behavioral features and using a lightweight classifier to determine whether they are malicious. Extensive experiments demonstrate that Safe-FedLLM effectively improves FedLLM's robustness against malicious clients while maintaining competitive performance on benign data. Notably, our method effectively suppresses the impact of malicious data without significantly affecting training speed, and remains effective even under high malicious client ratios.

2604.15231 2026-06-02 cs.AI 版本更新

RadAgent: A tool-using AI agent for stepwise interpretation of chest computed tomography

RadAgent:一种用于胸部CT逐步解读的工具型AI智能体

Mélanie Roschewitz, Kenneth Styppa, Yitian Tao, Jiwoong Sohn, Jean-Benoit Delbrouck, Benjamin Gundersen, Nicolas Deperrois, Christian Bluethgen, Julia E. Vogt, Bjoern Menze, Farhad Nooralahzadeh, Michael Krauthammer, Michael Moor

发表机构 * Department of Biosystems Science and Engineering, ETH Zurich(生物系统科学与工程系,苏黎世联邦理工学院) ETH AI Center, Zurich(ETH人工智能中心,苏黎世) Department of Computer Science, ETH Zurich(计算机科学系,苏黎世联邦理工学院) Faculty of Computer Science and Mathematics, Heidelberg University(计算机科学与数学学院,海德堡大学) Stanford Center for Artificial Intelligence in Medicine and Imaging, Stanford University(斯坦福大学人工智能在医学和影像中的中心) Department of Radiology, Stanford University(放射科,斯坦福大学) Department of Quantitative Biomedicine, University of Zurich(定量生物医学系,苏黎世大学) Institute of Computer Science, Zurich University of Applied Sciences(应用科学大学计算机科学研究所)

AI总结 提出RadAgent,一种通过逐步、可解释过程生成CT报告的工具型AI智能体,在临床准确性、鲁棒性和忠实度上优于3D VLM方法。

详情
AI中文摘要

视觉语言模型(VLM)显著推进了复杂医学影像(如计算机断层扫描(CT))的AI驱动解读和报告生成。然而,现有方法主要将临床医生视为最终输出的被动观察者,没有提供可解释的推理轨迹供其检查、验证或改进。为了解决这个问题,我们引入了RadAgent,一种工具型AI智能体,通过逐步且可解释的过程生成CT报告。每个生成的报告都附带一个完全可检查的中间决策和工具交互轨迹,使临床医生能够检查报告发现是如何得出的。在我们的实验中,我们观察到RadAgent在三个维度上改进了胸部CT报告生成,优于其3D VLM对应物CT-Chat。临床准确性在宏F1上提高了5.8分(相对提高35.4%),在微F1上提高了5.1分(相对提高18.6%)。对抗条件下的鲁棒性提高了24.7分(相对提高41.9%)。此外,RadAgent在忠实度上达到了37.0%,这是其3D VLM对应物完全不具备的新能力。通过将胸部CT的解读构建为显式、工具增强和迭代的推理轨迹,RadAgent使我们更接近放射学的透明和可靠AI。

英文摘要

Vision-language models (VLM) have markedly advanced AI-driven interpretation and reporting of complex medical imaging, such as computed tomography (CT). Yet, existing methods largely relegate clinicians to passive observers of final outputs, offering no interpretable reasoning trace for them to inspect, validate, or refine. To address this, we introduce RadAgent, a tool-using AI agent that generates CT reports through a stepwise and interpretable process. Each resulting report is accompanied by a fully inspectable trace of intermediate decisions and tool interactions, allowing clinicians to examine how the reported findings are derived. In our experiments, we observe that RadAgent improves chest CT report generation over its 3D VLM counterpart, CT-Chat, across three dimensions. Clinical accuracy improves by 5.8 points (35.4% relative) in macro-F1 and 5.1 points (18.6% relative) in micro-F1. Robustness under adversarial conditions improves by 24.7 points (41.9% relative). Furthermore, RadAgent achieves 37.0% in faithfulness, a new capability entirely absent in its 3D VLM counterpart. By structuring the interpretation of chest CT as an explicit, tool-augmented and iterative reasoning trace, RadAgent brings us closer toward transparent and reliable AI for radiology.

2512.24120 2026-06-02 cs.CV cs.AI 版本更新

Enhancing LLM-Based Neural Network Generation: Few-Shot Prompting and Efficient Validation for Automated Architecture Design

增强基于LLM的神经网络生成:面向自动化架构设计的少样本提示与高效验证

Raghuvir Duvvuri, Chandini Vysyaraju, Avi Goyal, Dmitry Ignatov, Radu Timofte

发表机构 * Computer Vision Lab, CAIDAS & IFI, University of Würzburg, Germany(计算机视觉实验室,CAIDAS与IFI,乌尔姆大学,德国)

AI总结 本文提出少样本架构提示(FSAP)和空白归一化哈希验证方法,以提升基于LLM的计算机视觉架构自动生成效率,并通过大规模实验验证其有效性。

详情
Journal ref
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), pp. 3242-3251, 2026
AI中文摘要

自动化神经网络架构设计仍然是计算机视觉中的一个重大挑战。任务多样性和计算约束要求既有效又高效的架构与搜索方法。大型语言模型(LLMs)为计算密集型的神经架构搜索(NAS)提供了一种有前景的替代方案,但它们在计算机视觉架构生成中的应用尚未被系统研究,特别是在提示工程和验证策略方面。基于任务无关的NNGPT/LEMUR框架,本文引入并验证了两项针对计算机视觉的关键贡献。首先,我们提出了少样本架构提示(FSAP),这是首个针对基于LLM的架构生成中支持示例数量(n = 1, 2, 3, 4, 5, 6)的系统研究。我们发现使用n = 3个示例能在视觉任务的架构多样性和上下文聚焦之间取得最佳平衡。其次,我们引入了空白归一化哈希验证,一种轻量级去重方法(耗时小于1毫秒),相比AST解析实现了100倍加速,并防止了重复计算机视觉架构的冗余训练。在七个计算机视觉基准(MNIST、CIFAR-10、CIFAR-100、CelebA、ImageNette、SVHN、Places365)的大规模实验中,我们生成了1,900个独特架构。我们还引入了一种数据集平衡的评估方法,以应对跨异构视觉任务比较架构的挑战。这些贡献为计算机视觉中基于LLM的架构搜索提供了可操作的指导,并建立了严格的评估实践,使计算资源有限的研究人员也能更便捷地进行自动化设计。

英文摘要

Automated neural network architecture design remains a significant challenge in computer vision. Task diversity and computational constraints require both effective architectures and efficient search methods. Large Language Models (LLMs) present a promising alternative to computationally intensive Neural Architecture Search (NAS), but their application to architecture generation in computer vision has not been systematically studied, particularly regarding prompt engineering and validation strategies. Building on the task-agnostic NNGPT/LEMUR framework, this work introduces and validates two key contributions for computer vision. First, we present Few-Shot Architecture Prompting (FSAP), the first systematic study of the number of supporting examples (n = 1, 2, 3, 4, 5, 6) for LLM-based architecture generation. We find that using n = 3 examples best balances architectural diversity and context focus for vision tasks. Second, we introduce Whitespace-Normalized Hash Validation, a lightweight deduplication method (less than 1 ms) that provides a 100x speedup over AST parsing and prevents redundant training of duplicate computer vision architectures. In large-scale experiments across seven computer vision benchmarks (MNIST, CIFAR-10, CIFAR-100, CelebA, ImageNette, SVHN, Places365), we generated 1,900 unique architectures. We also introduce a dataset-balanced evaluation methodology to address the challenge of comparing architectures across heterogeneous vision tasks. These contributions provide actionable guidelines for LLM-based architecture search in computer vision and establish rigorous evaluation practices, making automated design more accessible to researchers with limited computational resources.

2604.14514 2026-06-02 cs.AI cs.CE 版本更新

Perspective on Bias in Biomedical AI: Preventing Downstream Healthcare Disparities

生物医学AI中的偏见视角:防止下游医疗保健差异

Michal Rosen-Zvi, Yoav Kan-Tor, Michael Danziger, Agata Ferretti, Javier Aula-Blasco, Julia Falcao, Ron Shamir, Mira Marcus-Kalish, Mordechai Muszkat

发表机构 * Weizmann Institute of Science(魏茨曼科学研究院) Hebrew University of Jerusalem(特拉维夫大学)

AI总结 本文通过分析2015-2024年4514篇组学出版物和大型数据集,揭示数据收集和研究中存在的严重人口偏见,并提出通过来源、开放性和评估透明度三个原则来预防下游医疗保健差异。

Comments This manuscript has been accepted for publication in the 2026 IEEE International Conference on Digital Health (ICDH). The final version will appear in IEEE Xplore

详情
AI中文摘要

医疗保健差异在社会经济边界上持续存在,通常归因于筛查、诊断和治疗的不平等获取。然而,本文观点强调,关键偏见可能在更早阶段出现,即在数据收集和研究优先级确定期间,远在临床实施之前,尤其是在关注分子和组学数据的研究中。大量研究专注于收集组学数据,但相关的人口统计信息往往未被报告,即使报告了,也显示出显著偏见。对2015年至2024年间PubMed索引的4514篇组学出版物的自动分析,检查了多个人口统计维度的报告情况,发现总体报告有限;例如,只有2.7%的研究报告了祖先或种族信息,地理来源报告仅限于2.5%。对常用于模型训练的大规模数据集(如CellxGene和GEO)的分析揭示了显著的人口偏见,其中欧洲血统数据占主导地位。随着生物医学基础模型成为生物医学发现的核心,其范式是基础模型在大数据集上预训练并反复用于许多不同的下游任务,这些模型有风险延续或放大这些早期阶段的偏见,导致监管干预无法完全逆转的级联不平等。我们提出社区范围内关注三个基本原则:来源、开放性和通过评估透明度的可靠性。这些原则共同有助于使偏见和局限性对模型开发者和用户更加可见,支持在生物医学AI中更明智的模型开发、评估和部署决策。

英文摘要

Healthcare disparities persist across socioeconomic boundaries, often attributed to unequal access to screening, diagnostics, and therapeutics. However, this perspective highlights that critical biases can emerge much earlier, during data collection and research prioritization, long before clinical implementation, particularly in studies focused on molecular and omics data. A vast number of studies focus on collecting omics data, but the demographic information associated with these datasets is often not reported, and when it is reported, it reveals substantial biases. An automated analysis of 4514 PubMed-indexed omics publications from 2015 to 2024, examining reporting across multiple demographic dimensions, reveals limited reporting overall; for example, only 2.7% of studies report ancestry or ethnicity information and geographic origin reporting is limited to 2.5%. Analysis of large-scale datasets commonly used for model training, such as CellxGene and GEO, reveals substantial population bias where European-ancestry data dominates. As biomedical foundation models become central to biomedical discovery with a paradigm in which base models are pretrained on large datasets and reusing them repeatedly for many different downstream tasks, they risk perpetuating or amplifying these early-stage biases, leading to cascading inequities that regulatory interventions cannot fully reverse. We propose a community-wide focus on three foundational principles: Provenance, Openness, and Reliability through Evaluation Transparency. Together, these principles can help make biases and limitations more visible to model developers and users, supporting more informed model development, evaluation, and deployment decisions in biomedical AI.

2604.03588 2026-06-02 cs.AI 版本更新

Rashomon Memory: Towards Argumentation-Driven Retrieval for Multi-Perspective Agent Memory

Rashomon记忆:面向多视角智能体记忆的论证驱动检索

Albert Sadowski, Jarosław A. Chudziak

发表机构 * Warsaw University of Technology(华沙技术大学)

AI总结 提出Rashomon记忆架构,通过并行目标条件化智能体以各自优先级编码经验,并在查询时通过论证协商,利用Dung的论证语义选择解释,支持冲突呈现模式。

Comments Presented at EXTRAAMAS workshop at AAMAS 2026

详情
AI中文摘要

在长时间跨度上运行的AI智能体积累服务于多个并发目标的经验,并且通常必须维持对同一事件的矛盾解释。在客户谈判中的让步,对于一个战略目标编码为“建立信任的投资”,对于另一个目标则编码为“合同责任”。当前的记忆架构假设单一正确编码,或者最多在统一存储上支持多个视图。我们提出Rashomon记忆:一种架构,其中并行目标条件化智能体根据其优先级编码经验,并在查询时通过论证进行协商。每个视角维护自己的本体和知识图谱。在检索时,视角提出解释,使用非对称领域知识批评彼此的提议,Dung的论证语义决定哪些提议存活。生成的攻击图本身就是一个解释:它记录了哪个解释被选中,哪些替代方案被考虑,以及它们被拒绝的理由。我们提供了一个概念验证,表明检索模式(选择、组合、冲突呈现)从攻击图拓扑中涌现,并且冲突呈现模式(系统报告真正的分歧而不是强制解决)让决策者直接看到底层的解释性冲突。

英文摘要

AI agents operating over extended time horizons accumulate experiences that serve multiple concurrent goals, and must often maintain conflicting interpretations of the same events. A concession during a client negotiation encodes as a ``trust-building investment'' for one strategic goal and a ``contractual liability'' for another. Current memory architectures assume a single correct encoding, or at best support multiple views over unified storage. We propose Rashomon Memory: an architecture where parallel goal-conditioned agents encode experiences according to their priorities and negotiate at query time through argumentation. Each perspective maintains its own ontology and knowledge graph. At retrieval, perspectives propose interpretations, critique each other's proposals using asymmetric domain knowledge, and Dung's argumentation semantics determines which proposals survive. The resulting attack graph is itself an explanation: it records which interpretation was selected, which alternatives were considered, and on what grounds they were rejected. We present a proof-of-concept showing that retrieval modes (selection, composition, conflict surfacing) emerge from attack graph topology, and that the conflict surfacing mode, where the system reports genuine disagreement rather than forcing resolution, lets decision-makers see the underlying interpretive conflict directly.

2603.18373 2026-06-02 cs.CV cs.AI 版本更新

To See or To Please: Uncovering Visual Sycophancy and Split Beliefs in VLMs

看见还是取悦:揭示视觉语言模型中的视觉谄媚与分裂信念

Rui Hong, Shuxue Quan

发表机构 * George Mason University(乔治·玛斯纳大学) Independent Researcher(独立研究者)

AI总结 提出三层诊断框架,通过反事实干预实验发现视觉语言模型中普遍存在视觉谄媚(内部证据保留但输出幻觉答案)现象,并证明扩展模型规模无法解决该问题。

Comments 14 pages, 1 figures

详情
AI中文摘要

当视觉语言模型正确回答时,它们是否真正依赖视觉信息?我们引入了一个三层诊断框架,包含三个每样本指标:潜在异常检测、视觉必要性分数和竞争分数,用于解耦感知、依赖和对齐失败。在9个视觉语言模型和9000个模型-样本对中,通过反事实盲、噪声和冲突干预,72.9%的样本表现出视觉谄媚,这是一种分裂信念模式,即内部证据被保留但解码出幻觉答案,而零样本表现出稳健拒绝,表明当前的对齐训练已消除拒绝作为解码结果。在Qwen-VL系列中,无论是代内还是代间扩展,都单调减少了语言捷径,但加剧了视觉谄媚,表明仅靠规模和更新的后训练无法解决接地问题。诊断分数进一步实现了一种无需训练的择性预测策略,在50%覆盖率下准确率提升高达9.5个百分点。

英文摘要

When VLMs answer correctly, do they genuinely rely on visual information? We introduce a Tri-Layer Diagnostic Framework with three per-sample metrics: Latent Anomaly Detection, Visual Necessity Score, and Competition Score, which disentangle perception, dependency, and alignment failures. Across 9 VLMs and 9,000 model-sample pairs under counterfactual blind, noise, and conflict interventions, 72.9% of samples exhibit Visual Sycophancy, a Split Beliefs pattern in which internal evidence is preserved yet a hallucinated answer is decoded, while zero samples show Robust Refusal, indicating that current alignment training has eliminated refusal as a decoding outcome. Scaling within the Qwen-VL family, both within- and across-generation, monotonically reduces Language Shortcuts but amplifies Visual Sycophancy, showing that scale and newer post-training alone cannot resolve the grounding problem. Diagnostic scores further enable a training-free selective-prediction strategy yielding up to +9.5 percentage points accuracy at 50% coverage.

2509.05367 2026-06-02 cs.CR cs.AI 版本更新

Between a Rock and a Hard Place: The Tension Between Ethical Reasoning and Safety Alignment in LLMs

进退维谷:大型语言模型中伦理推理与安全对齐之间的张力

Shei Pern Chua, Zhen Leng Thai, Kai Jun Teh, Xiao Li, Qibing Ren, Xiaolin Hu

发表机构 * Department of Computer Science and Technology, Institute for AI, BNRist, Tsinghua University(计算机科学与技术系,人工智能研究院,BNRist,清华大学) IDG/McGovern Institute for Brain Research, Tsinghua University(IDG/麦克戈文脑科学研究院,清华大学) Chinese Institute for Brain Research (CIBR)(中国脑科学研究院(CIBR)) Shanghai Jiao Tong University(上海交通大学) ByteDance(字节跳动)

AI总结 本文提出TRIAL多轮红队方法,通过将有害请求嵌入伦理框架来利用模型伦理推理能力,并引入ERR防御框架(分层有害门控LoRA架构)以区分工具性回应与解释性回应,实现鲁棒防御。

详情
AI中文摘要

大型语言模型的安全对齐主要基于二元假设,即请求要么安全要么不安全。当模型遇到伦理困境时,这种分类被证明是不充分的,因为通过道德权衡进行推理的能力创造了一个独特的攻击面。我们通过TRIAL(一种多轮红队方法)形式化了这一漏洞,该方法将有害请求嵌入伦理框架中。TRIAL通过系统性地利用模型的伦理推理能力,将有害行为描述为道德上必要的妥协,在大多数测试模型上实现了高攻击成功率。基于这些见解,我们引入了ERR(伦理推理鲁棒性),一种防御框架,区分了导致有害结果的工具性回应和分析伦理框架而不认可有害行为的解释性回应。ERR采用分层有害门控LoRA架构,在保持模型实用性的同时,实现了对基于推理的攻击的鲁棒防御。

英文摘要

Large Language Model safety alignment predominantly operates on a binary assumption that requests are either safe or unsafe. This classification proves insufficient when models encounter ethical dilemmas, where the capacity to reason through moral trade-offs creates a distinct attack surface. We formalize this vulnerability through TRIAL, a multi-turn red-teaming methodology that embeds harmful requests within ethical framings. TRIAL achieves high attack success rates across most tested models by systematically exploiting the model's ethical reasoning capabilities to frame harmful actions as morally necessary compromises. Building on these insights, we introduce ERR (Ethical Reasoning Robustness), a defense framework that distinguishes between instrumental responses that enable harmful outcomes and explanatory responses that analyze ethical frameworks without endorsing harmful acts. ERR employs a Layer-Stratified Harm-Gated LoRA architecture, achieving robust defense against reasoning-based attacks while preserving model utility.

2604.08324 2026-06-02 cs.NE cs.AI 版本更新

Multi-Modal Learning meets Genetic Programming: Analyzing Alignment in Latent Space Optimization

多模态学习遇见遗传编程:分析潜在空间优化中的对齐

Benjamin Léger, Kazem Meidani, Christian Gagné

发表机构 * Mila, Université Laval(Mila,拉瓦尔大学) Department of Mechanical Engineering, Carnegie Mellon University(机械工程系,卡内基梅隆大学) Canada-CIFAR AI Chair(加拿大-卡内基梅隆人工智能主席)

AI总结 本文研究SNIP方法在符号回归中多模态潜在空间优化的对齐效果,发现跨模态对齐在优化过程中未改善且过于粗糙,导致有效对齐引导的优化尚未实现。

详情
AI中文摘要

符号回归旨在从数据中发现数学表达式,传统上通过遗传编程对符号结构进行组合搜索来解决。潜在空间优化方法使用神经编码器将符号表达式映射到连续空间,将组合搜索转化为连续优化。受CLIP启发的对比预训练模型SNIP(Meidani等人,2024)通过引入多模态方法推进了潜在空间优化:在共享潜在空间中对齐符号和数值编码器以学习表型-基因型映射,从而在数值空间中进行优化以隐式指导符号搜索。然而,这依赖于细粒度的跨模态对齐,而类似CLIP模型的文献表明这种对齐通常是粗粒度的。在本文中,我们研究SNIP是否实现了其对符号回归进行有效双模态优化的承诺。我们的实验表明:(1)跨模态对齐在优化过程中并未改善,即使适应度增加;(2)SNIP学习的对齐过于粗糙,无法在符号空间中高效进行原则性搜索。这些发现揭示了尽管多模态潜在空间优化对符号回归具有巨大潜力,但有效的对齐引导优化在实践中仍未实现,突显了细粒度对齐作为未来工作的关键方向。

英文摘要

Symbolic regression (SR) aims to discover mathematical expressions from data, a task traditionally tackled using Genetic Programming (GP) through combinatorial search over symbolic structures. Latent Space Optimization (LSO) methods use neural encoders to map symbolic expressions into continuous spaces, transforming the combinatorial search into continuous optimization. SNIP (Meidani et al., 2024), a contrastive pre-training model inspired by CLIP, advances LSO by introducing a multi-modal approach: aligning symbolic and numeric encoders in a shared latent space to learn the phenotype-genotype mapping, enabling optimization in the numeric space to implicitly guide symbolic search. However, this relies on fine-grained cross-modal alignment, whereas literature on similar models like CLIP reveals that such an alignment is typically coarse-grained. In this paper, we investigate whether SNIP delivers on its promise of effective bi-modal optimization for SR. Our experiments show that: (1) cross-modal alignment does not improve during optimization, even as fitness increases, and (2) the alignment learned by SNIP is too coarse to efficiently conduct principled search in the symbolic space. These findings reveal that while multi-modal LSO holds significant potential for SR, effective alignment-guided optimization remains unrealized in practice, highlighting fine-grained alignment as a critical direction for future work.

2604.10788 2026-06-02 cs.CL cs.AI 版本更新

TInR: Exploring Tool-Internalized Reasoning in Large Language Models

TInR:探索大语言模型中的工具内化推理

Qiancheng Xu, Yongqi Li, Fan Liu, Hongru Wang, Min Yang, Wenjie Li

发表机构 * The Hong Kong Polytechnic University(香港理工大学) Southeast University(东南大学) University of Edinburgh(爱丁堡大学) Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences(中国科学院深圳先进技术研究院)

AI总结 本文提出TInR-U框架,通过工具内化、监督微调和强化学习三阶段训练,使LLM无需外部文档即可进行工具集成推理,在域内和域外设置中均取得优越性能。

Comments Accepted to ACL 2026

详情
AI中文摘要

工具集成推理(TIR)通过扩展大语言模型(LLM)在推理过程中使用外部工具的能力,已成为一个有前景的方向。现有的TIR方法通常在推理过程中依赖外部工具文档。然而,这导致了工具掌握困难、工具规模限制和推理效率低下等问题。为了缓解这些问题,我们探索了工具内化推理(TInR),旨在促进使用内化到LLM中的工具知识进行推理。实现这一目标面临显著的要求,包括工具内化和工具-推理协调。为了解决这些问题,我们提出了TInR-U,一个用于统一推理和工具使用的工具内化推理框架。TInR-U通过三阶段流水线进行训练:1)使用双向知识对齐策略进行工具内化;2)使用高质量推理注释进行监督微调预热;3)使用TInR特定奖励进行强化学习。我们在域内和域外设置中全面评估了我们的方法。实验结果表明,TInR-U在两种设置下均实现了优越的性能,突显了其有效性和效率。

英文摘要

Tool-Integrated Reasoning (TIR) has emerged as a promising direction by extending Large Language Models' (LLMs) capabilities with external tools during reasoning. Existing TIR methods typically rely on external tool documentation during reasoning. However, this leads to tool mastery difficulty, tool size constraints, and inference inefficiency. To mitigate these issues, we explore Tool-Internalized Reasoning (TInR), aiming at facilitating reasoning with tool knowledge internalized into LLMs. Achieving this goal presents notable requirements, including tool internalization and tool-reasoning coordination. To address them, we propose TInR-U, a tool-internalized reasoning framework for unified reasoning and tool usage. TInR-U is trained through a three-phase pipeline: 1) tool internalization with a bidirectional knowledge alignment strategy; 2) supervised fine-tuning warm-up using high-quality reasoning annotations, and 3) reinforcement learning with TInR-specific rewards. We comprehensively evaluate our method across in-domain and out-of-domain settings. Experiment results show that TInR-U achieves superior performance in both settings, highlighting its effectiveness and efficiency.

2604.10688 2026-06-02 cs.LG cs.AI cs.CL 版本更新

SCOPE: Signal-Calibrated On-Policy Distillation Enhancement with Dual-Path Adaptive Weighting

SCOPE: 信号校准的在线策略蒸馏增强与双路径自适应加权

Binbin Zheng, Xing Ma, Yiheng Liang, Jingqing Ruan, Xiaoliang Fu, Kepeng Lin, Benchang Zhu, Ke Zeng, Xunliang Cai

发表机构 * University of Science and Technology of China(中国科学技术大学) Meituan LongCat Interaction Team(美团 LongCat 交互团队) Nanjing University(南京大学) Fudan University(复旦大学) Huazhong University of Science and Technology(华中科技大学)

AI总结 针对在线策略强化学习中奖励稀疏导致的信用分配难题,提出SCOPE框架,通过双路径自适应加权机制分别处理正确与错误轨迹,实现信号校准的蒸馏增强,在六个推理基准上平均提升11.42%的Avg@32和7.30%的Pass@32。

详情
AI中文摘要

在线策略强化学习已成为大型语言模型推理对齐的主导范式,但其稀疏的结果级奖励使得令牌级信用分配异常困难。在线策略蒸馏(OPD)通过引入来自教师模型的密集令牌级KL监督缓解了这一问题,但通常对所有rollout均匀应用这种监督,忽略了信号质量的根本差异。我们提出信号校准的在线策略蒸馏增强(SCOPE),一种双路径自适应训练框架,根据正确性将在线策略rollout路由到两个互补的监督路径。对于错误轨迹,SCOPE执行教师困惑度加权的KL蒸馏,优先考虑教师展现出真正纠正能力的实例,同时降低不可靠指导的权重。对于正确轨迹,它应用学生困惑度加权的MLE,将强化集中在能力边界上的低置信度样本,而不是过度强化已掌握的样本。两条路径都采用组级归一化来自适应校准权重分布,考虑不同提示的内在难度差异。在六个推理基准上的大量实验表明,SCOPE在Avg@32和Pass@32上分别比竞争基线平均相对提升11.42%和7.30%,证明了其一致的有效性。

英文摘要

On-policy reinforcement learning has become the dominant paradigm for reasoning alignment in large language models, yet its sparse, outcome-level rewards make token-level credit assignment notoriously difficult. On-Policy Distillation (OPD) alleviates this by introducing dense, token-level KL supervision from a teacher model, but typically applies this supervision uniformly across all rollouts, ignoring fundamental differences in signal quality. We propose Signal-Calibrated On-Policy Distillation Enhancement (SCOPE), a dual-path adaptive training framework that routes on-policy rollouts by correctness into two complementary supervision paths. For incorrect trajectories, SCOPE performs teacher-perplexity-weighted KL distillation to prioritize instances where the teacher demonstrates genuine corrective capability, while down-weighting unreliable guidance. For correct trajectories, it applies student-perplexity-weighted MLE to concentrate reinforcement on low-confidence samples at the capability boundary rather than over-reinforcing already mastered ones. Both paths employ a group-level normalization to adaptively calibrate weight distributions, accounting for the intrinsic difficulty variance across prompts. Extensive experiments on six reasoning benchmarks show that SCOPE achieves an average relative improvement of 11.42% in Avg@32 and 7.30% in Pass@32 over competitive baselines, demonstrating its consistent effectiveness.

2604.10645 2026-06-02 cs.SE cs.AI 版本更新

Vibe-driven model-based engineering

基于氛围驱动的模型工程

Jordi Cabot

发表机构 * Luxembourg Institute of Science and Technology(卢森堡科学与技术研究院) University of Luxembourg(卢森堡大学)

AI总结 本文提出氛围驱动模型工程概念,融合大型语言模型(LLM)的代码生成能力与模型驱动工程(MDE)的严谨性,以加速可靠复杂系统的开发。

详情
AI中文摘要

随着新软件系统需求的增长和复杂性的增加,迫切需要更好的开发方法和工具。新型用户界面、智能组件需求、可持续性等问题带来了新的挑战。近年来,模型驱动工程(MDE),包括其最新形式即低代码/无代码开发,一直是提高软件开发质量和生产力的关键,但模型本身的指定和管理变得越来越复杂。与此同时,我们目睹了基于大型语言模型(LLM)的氛围编码方法的日益流行,该方法将自然语言描述转换为可运行代码,但代价是潜在的代码漏洞、可扩展性问题和可维护性问题。虽然许多人认为氛围编码将取代基于模型的工程,但在本文中,我们认为这两种方法实际上可以相互补充,并为不同类型的软件系统、开发场景和用户画像提供完全不同的开发路径。从这个意义上说,我们引入了“氛围驱动模型工程”的概念,作为一种融合AI和MDE优势的新方法,以加速可靠复杂系统的开发。我们概述了这一新方法的关键概念,并强调了它为软件开发的未来带来的机遇和开放挑战。

英文摘要

There is a pressing need for better development methods and tools to keep up with the growing demand and increasing complexity of new software systems. New types of user interfaces, the need for intelligent components, sustainability concerns, etc. bring new challenges that we need to handle. In the last years, model-driven engineering (MDE), including its latest incarnation, i.e. low/no-code development, has been key to improving the quality and productivity of software development, but models themselves are becoming increasingly complex to specify and manage. At the same time, we are witnessing the growing popularity of vibe coding approaches that rely on Large Language Models (LLMs) to transform natural language descriptions into running code at the expense of potential code vulnerabilities, scalability issues and maintainability concerns. While many may think vibe coding will replace model-based engineering, in this paper we argue that, in fact, the two approaches can complement each other and provide altogether different development paths for different types of software systems, development scenarios, and user profiles. In this sense, we introduce the concept of \textit{vibe-driven model-based engineering} as a novel approach to integrate the best of both worlds (AI and MDE) to accelerate the development of reliable complex systems. We outline the key concepts of this new approach and highlight the opportunities and open challenges it presents for the future of software development.

2604.10579 2026-06-02 cs.RO cs.AI 版本更新

AffordGen: Generating Diverse Demonstrations for Generalizable Object Manipulation with Afford Correspondence

AffordGen: 通过可供性对应生成多样化演示以实现通用物体操作

Jiawei Zhang, Kaizhe Hu, Yingqian Huang, Yuanchen Ju, Zhengrong Xue, Huazhe Xu

发表机构 * Shanghai Qi Zhi Institute(上海启智研究院) Tsinghua University(清华大学) Fudan University(复旦大学) UC Berkeley(伯克利大学)

AI总结 提出AffordGen框架,利用3D生成模型和视觉基础模型在大规模3D网格上的语义对应生成多样化操作轨迹,训练鲁棒的闭环视觉运动策略,实现零样本泛化到未见物体。

详情
AI中文摘要

尽管现代模仿学习方法在机器人操作中取得了近期成功,但其性能常常受到数据多样性不足导致的几何变化的限制。利用强大的3D生成模型和视觉基础模型(VFMs),所提出的AffordGen框架通过利用大规模3D网格上有意义关键点的语义对应来生成新的机器人操作轨迹,从而克服了这一限制。然后,这个大规模、可供性感知的数据集被用于训练一个鲁棒的、闭环的视觉运动策略,结合了可供性的语义泛化能力和端到端学习的反应性鲁棒性。在仿真和现实世界中的实验表明,使用AffordGen训练的策略实现了高成功率,并能够零样本泛化到真正未见过的物体,显著提高了机器人学习中的数据效率。项目页面:https://jiaweiz9.github.io/AffordGen-release/

英文摘要

Despite the recent success of modern imitation learning methods in robot manipulation, their performance is often constrained by geometric variations due to limited data diversity. Leveraging powerful 3D generative models and vision foundation models (VFMs), the proposed AffordGen framework overcomes this limitation by utilizing the semantic correspondence of meaningful keypoints across large-scale 3D meshes to generate new robot manipulation trajectories. This large-scale, affordance-aware dataset is then used to train a robust, closed-loop visuomotor policy, combining the semantic generalizability of affordances with the reactive robustness of end-to-end learning. Experiments in simulation and the real world show that policies trained with AffordGen achieve high success rates and enable zero-shot generalization to truly unseen objects, significantly improving data efficiency in robot learning. Project Page: https://jiaweiz9.github.io/AffordGen-release/

2604.09877 2026-06-02 cs.CV cs.AI cs.RO 版本更新

Genie 4D: Semantic-Prior-Guided 4D Dynamic Scene Reconstruction

Genie 4D:语义先验引导的4D动态场景重建

Yiru Yang, Zhuojie Wu, Nishant Kumar Singh, Max Schulthess

发表机构 * University of Zurich(苏黎世大学) ETH Zurich(苏黎世联邦理工学院)

AI总结 提出Genie 4D框架,结合实时视觉惯性高斯泼溅前端和前馈4D骨干网络,利用冻结的DINOv3特征作为结构先验抑制身份漂移,并通过条件扩散精炼器恢复高频细节,最终通过轻量级潜在动作头实现用户可控的4D世界模型重建。

详情
AI中文摘要

在计算机视觉与机器人感知的交汇处,动态场景的4D重建将低层几何感知与高层语义理解联系起来。我们提出Genie 4D,一个将手持手机拍摄转化为语义化、动作可控的4D世界模型的框架。Genie 4D将用于度量几何的实时视觉惯性高斯泼溅前端与由冻结的DINOv3特征(作为结构先验)正则化的前馈4D骨干网络相结合。语义先验抑制了动态跟踪中的身份漂移,而短条件扩散精炼器恢复了回归骨干网络平滑掉的高频表面细节。最后,一个轻量级潜在动作头将重建的4D状态暴露给以JEPA风格下一嵌入目标训练的Genie式世界模型,使得场景可以在用户动作下向前推进。在Point Odyssey和TUM-Dynamics基准测试上,Genie 4D保留了前馈基线的线性时间复杂度O(T),同时提高了3D跟踪精度(APD)和重建完整性,并且可以在单个消费级GPU(RTX 5090)上通过iPhone、Mac、Windows和Linux采集客户端交互式运行。Genie 4D为走向物理基础的世界模型提供了一条实用的、语义先验引导的路径。

英文摘要

At the intersection of computer vision and robotic perception, 4D reconstruction of dynamic scenes connects low-level geometric sensing with high-level semantic understanding. We present Genie 4D, a framework that turns hand-held phone capture into a semantically grounded, action-controllable 4D world model. Genie 4D couples a real-time visual-inertial Gaussian splatting front-end for metric geometry with a feed-forward 4D backbone regularized by frozen DINOv3 features acting as structural priors. The semantic priors suppress identity drift during dynamic tracking, while a short conditional diffusion refiner recovers high-frequency surface detail that regression backbones smooth away. Finally, a lightweight latent-action head exposes the reconstructed 4D state to a Genie-style world model trained with a JEPA-style next-embedding objective, so that the scene can be rolled forward under user actions. On the Point Odyssey and TUM-Dynamics benchmarks, Genie 4D retains the linear time complexity O(T) of feed-forward baselines while improving 3D tracking accuracy (APD) and reconstruction completeness, and it runs interactively on a single consumer GPU (RTX 5090) from iPhone, Mac, Windows, and Linux capture clients. Genie 4D offers a practical, semantic-prior-guided path toward physically grounded world models.

2604.09549 2026-06-02 cs.IR cs.AI 版本更新

Beyond Offline A/B Testing: Context-Aware Agent Simulation for Recommender System Evaluation

超越离线A/B测试:面向推荐系统评估的上下文感知智能体模拟

Nicolas Bougie, Gian Maria Marconi, Xiaotong Ye, Narimasa Watanabe

发表机构 * Woven by Toyota(丰田织物)

AI总结 针对推荐系统评估中离线指标与在线性能脱节的问题,提出基于大语言模型的上下文感知智能体框架ContextSim,通过生活模拟模块生成时间、地点和需求等上下文场景,并保持智能体行为一致,使模拟交互更贴近真实用户行为,且优化后的推荐系统参数能提升实际参与度。

详情
AI中文摘要

推荐系统是在线服务的核心,使用户能够在不同领域浏览海量内容。然而,由于离线指标与在线性能之间的脱节,其评估仍然具有挑战性。大语言模型驱动的智能体的出现提供了一种有前景的解决方案,但现有研究孤立地对用户进行建模,忽略了时间、地点和需求等上下文因素,而这些因素从根本上塑造了人类决策。在本文中,我们介绍了ContextSim,一个LLM智能体框架,通过将交互锚定在日常活动场景中来模拟可信的用户代理。具体来说,一个生活模拟模块生成指定用户何时、何地以及为何与推荐内容互动的场景。为了使偏好与真实人类对齐,我们对智能体的内部思维进行建模,并在动作和轨迹层面强制执行一致性。跨领域的实验表明,我们的方法生成的交互比先前工作更贴近人类行为。我们进一步通过离线A/B测试相关性验证了我们的方法,并表明使用ContextSim优化的推荐系统参数能改善实际参与度。

英文摘要

Recommender systems are central to online services, enabling users to navigate through massive amounts of content across various domains. However, their evaluation remains challenging due to the disconnect between offline metrics and online performance. The emergence of Large Language Model-powered agents offers a promising solution, yet existing studies model users in isolation, neglecting the contextual factors such as time, location, and needs, which fundamentally shape human decision-making. In this paper, we introduce ContextSim, an LLM agent framework that simulates believable user proxies by anchoring interactions in daily life activities. Namely, a life simulation module generates scenarios specifying when, where, and why users engage with recommendations. To align preferences with genuine humans, we model agents' internal thoughts and enforce consistency at both the action and trajectory levels. Experiments across domains show our method generates interactions more closely aligned with human behavior than prior work. We further validate our approach through offline A/B testing correlation and show that RS parameters optimized using ContextSim yield improved real-world engagement.

2604.09482 2026-06-02 cs.AI 版本更新

Process Reward Agents for Steering Knowledge-Intensive Reasoning

过程奖励智能体:引导知识密集型推理

Jiwoong Sohn, Tomasz Sternal, Kenneth Styppa, Torsten Hoefler, Michael Moor

发表机构 * University of Michigan(密歇根大学)

AI总结 提出过程奖励智能体(PRA),通过在线、分步奖励指导冻结策略模型进行搜索式解码,在医疗推理基准上取得新最优结果,并泛化至不同规模模型。

Comments Accepted to ICML 2026

详情
AI中文摘要

知识密集型领域的推理仍然具有挑战性,因为中间步骤通常无法局部验证:与数学或代码不同,评估步骤的正确性可能需要跨大型外部知识源综合线索。因此,细微错误可能在推理轨迹中传播,可能永远不被检测到。先前的工作提出了过程奖励模型(PRM),包括检索增强变体,但这些方法事后操作,对完成的轨迹进行评分,这阻止了它们集成到动态推理过程中。在这里,我们引入了过程奖励智能体(PRA),这是一种推理时方法,用于向冻结策略提供基于领域、在线、分步的奖励。与先前的检索增强PRM相比,PRA能够实现基于搜索的解码,在每个生成步骤中对候选轨迹进行排序和剪枝。在多个医疗推理基准上的实验表明,PRA持续优于强基线,在MedQA上使用Qwen3-4B达到81.9%的准确率,这是4B规模下的新最优结果。重要的是,PRA泛化到未见过的冻结策略模型(参数从0.5B到8B),在无需任何策略模型更新的情况下,将其准确率提升高达25.7%。更广泛地说,PRA提出了一种范式,其中冻结推理器与领域特定奖励模块解耦,允许在复杂领域中部署新主干而无需重新训练。

英文摘要

Reasoning in knowledge-intensive domains remains challenging as intermediate steps are often not locally verifiable: unlike math or code, evaluating step correctness may require synthesizing clues across large external knowledge sources. As a result, subtle errors can propagate through reasoning traces, potentially never to be detected. Prior work has proposed process reward models (PRMs), including retrieval-augmented variants, but these methods operate post hoc, scoring completed trajectories, which prevents their integration into dynamic inference procedures. Here, we introduce Process Reward Agents (PRA), an inference-time method for providing domain-grounded, online, step-wise rewards to a frozen policy. In contrast to prior retrieval-augmented PRMs, PRA enables search-based decoding to rank and prune candidate trajectories at every generation step. Experiments on multiple medical reasoning benchmarks demonstrate that PRA consistently outperforms strong baselines, achieving 81.9% accuracy on MedQA with Qwen3-4B, a new state of the art at the 4B scale. Importantly, PRA generalizes to unseen frozen policy models ranging from 0.5B to 8B parameters, improving their accuracy by up to 25.7% without any policy model updates. More broadly, PRA suggests a paradigm in which frozen reasoners are decoupled from domain-specific reward modules, allowing the deployment of new backbones in complex domains without retraining.

2604.09041 2026-06-02 cs.LG cs.AI physics.ao-ph stat.ML 版本更新

U-Cast: A Surprisingly Simple and Efficient Frontier Probabilistic AI Weather Forecaster

U-Cast:一种惊人简单且高效的边界概率AI天气预报器

Salva Rühling Cachay, Duncan Watson-Parris, Rose Yu

发表机构 * University of Cambridge(剑桥大学)

AI总结 提出基于标准U-Net骨架的概率天气预报模型U-Cast,通过确定性预训练和短时概率微调,以不到1/10的计算成本匹配或超越GenCast和IFS ENS的预报技能。

Comments ICML 2026. Our code is available at: https://github.com/Rose-STL-Lab/u-cast

详情
AI中文摘要

基于AI的天气预报现在可以与传统的基于物理的集合预报相媲美,但最先进的模型依赖于专门的架构和巨大的计算预算,造成了很高的进入门槛。我们证明,对于边界性能而言,这种复杂性是不必要的。我们引入了\ours,一种基于标准U-Net骨架的概率预报器,采用简单的训练方案:先进行基于平均绝对误差的确定性预训练,然后使用蒙特卡洛Dropout引入随机性,基于连续排序概率评分(CRPS)进行短时概率微调。结果,我们的模型在$1.5^\circ$分辨率下匹配或超过了GenCast和IFS ENS的概率技能,同时与领先的基于CRPS的模型相比,训练计算量减少了10倍以上,与基于扩散的模型相比,推理延迟减少了10倍以上。U-Cast在不到12个H200 GPU天内完成训练,并在3秒内生成15天的集合预报。这些结果表明,可扩展的通用架构与高效的训练课程相结合,可以以极低的成本匹配复杂的领域特定设计,从而向更广泛的社区开放边界概率天气模型的训练。

英文摘要

AI-based weather forecasting now rivals traditional physics-based ensembles, but state-of-the-art (SOTA) models rely on specialized architectures and massive computational budgets, creating a high barrier to entry. We demonstrate that such complexity is unnecessary for frontier performance. We introduce \ours, a probabilistic forecaster built on a standard U-Net backbone trained with a simple recipe: deterministic pre-training on Mean Absolute Error followed by short probabilistic fine-tuning on the Continuous Ranked Probability Score (CRPS) using Monte Carlo Dropout for stochasticity. As a result, our model matches or exceeds the probabilistic skill of GenCast and IFS ENS at $1.5^\circ$ resolution while reducing training compute by over $10\times$ compared to leading CRPS-based models and inference latency by over $10\times$ compared to diffusion-based models. U-Cast trains in under 12 H200 GPU-days and generates a 15-day ensemble forecast in 3 seconds. These results suggest that scalable, general-purpose architectures paired with efficient training curricula can match complex domain-specific designs at a fraction of the cost, opening the training of frontier probabilistic weather models to the broader community.

2604.06995 2026-06-02 cs.AI 版本更新

What's Missing in Screen-to-Action? Towards a UI-in-the-Loop Paradigm for Multimodal GUI Reasoning

屏幕到动作中缺失了什么?面向多模态GUI推理的UI-in-the-Loop范式

Songze Li, Xiaoke Guo, Tianqi Liu, Biao Yi, Zhaoyan Gong, Zhiqiang Liu, Huajun Chen, Wen Zhang

发表机构 * Zhejiang University(浙江大学) ZJU-Ant Group Joint Lab of Knowledge Graph(浙大蚂蚁集团知识图谱联合实验室)

AI总结 提出UI-in-the-Loop (UILoop) 范式,通过循环的屏幕-UI元素-动作过程,让多模态大模型显式学习UI元素的定位、语义和用法,实现可解释推理,并在UI理解任务上达到最优。

Comments Accepted by ACL 2026 Findings

详情
AI中文摘要

现有的图形用户界面(GUI)推理任务仍然具有挑战性,特别是在UI理解方面。当前方法通常依赖于直接的基于屏幕的决策,缺乏可解释性,并忽略了对UI元素的全面理解,最终导致任务失败。为了增强对UI的理解和交互,我们提出了一种创新的GUI推理范式,称为UI-in-the-Loop(UILoop)。我们的方法将GUI推理任务视为一个循环的屏幕-UI元素-动作过程。通过使多模态大语言模型(MLLMs)显式学习关键UI元素的定位、语义功能和实际用法,UILoop实现了精确的元素发现和可解释推理。此外,我们引入了一个更具挑战性的UI理解任务,该任务围绕UI元素展开,并包含三个评估指标。相应地,我们贡献了一个包含26K样本的基准(UI Comprehension-Bench),以全面评估现有方法对UI元素的掌握程度。大量实验表明,UILoop在UI理解性能上达到了最先进水平,同时在GUI推理任务中也取得了优异结果。

英文摘要

Existing Graphical User Interface (GUI) reasoning tasks remain challenging, particularly in UI understanding. Current methods typically rely on direct screen-based decision-making, which lacks interpretability and overlooks a comprehensive understanding of UI elements, ultimately leading to task failure. To enhance the understanding and interaction with UIs, we propose an innovative GUI reasoning paradigm called UI-in-the-Loop (UILoop). Our approach treats the GUI reasoning task as a cyclic Screen-UI elements-Action process. By enabling Multimodal Large Language Models (MLLMs) to explicitly learn the localization, semantic functions, and practical usage of key UI elements, UILoop achieves precise element discovery and performs interpretable reasoning. Furthermore, we introduce a more challenging UI Comprehension task centered on UI elements with three evaluation metrics. Correspondingly, we contribute a benchmark of 26K samples (UI Comprehension-Bench) to comprehensively evaluate existing methods' mastery of UI elements. Extensive experiments demonstrate that UILoop achieves state-of-the-art UI understanding performance while yielding superior results in GUI reasoning tasks.

2604.04958 2026-06-02 q-bio.QM cs.AI q-bio.NC 版本更新

CalM: A Self-Supervised Foundation Model for Population Dynamics in Calcium Imaging Data

CalM:一种用于钙成像数据中群体动力学的自监督基础模型

Xinhong Xu, Yimeng Zhang, Qichen Qian, Yuanlong Zhang

发表机构 * University of California, Berkeley(加州大学伯克利分校)

AI总结 提出自监督基础模型CalM,通过双轴自回归Transformer和高效分词器,在钙成像数据上预训练后,可迁移至神经群体动力学预测和行为解码等下游任务,并取得竞争性或更优性能。

Comments ICML accepted version

详情
AI中文摘要

近期研究表明,大规模多动物建模可显著改善神经记录分析。然而,对于功能性钙信号,现有方法仍为任务特定,限制了在常见神经科学目标间的迁移。为解决此挑战,我们提出 extbf{CalM},一种仅基于神经元钙信号训练的自监督神经基础模型,可适应包括预测和解码在内的多个下游任务。我们的关键贡献是一个预训练框架,包含一个高性能分词器,将单神经元信号映射到共享离散词汇表,以及一个双轴自回归Transformer,沿神经轴和时间轴建模依赖关系。我们在大规模、多动物、多会话数据集上评估CalM。在神经群体动力学预测任务上,CalM在预训练后与强专用基线相比取得了竞争性表现。通过任务特定头部,CalM进一步适应行为解码任务,并取得了优于监督解码模型的结果。此外,CalM表示的线性分析揭示了超越预测准确性的可解释功能结构。综上,我们提出了一种新颖且有效的基于钙信号的基础模型自监督预训练范式,为功能性神经分析中的可扩展预训练和广泛应用铺平了道路。代码已发布于https://github.com/TSuXinH/CalM。

英文摘要

Recent work suggests that large-scale, multi-animal modeling can significantly improve neural recording analysis. However, for functional calcium traces, existing approaches remain task-specific, limiting transfer across common neuroscience objectives. To address this challenge, we propose \textbf{CalM}, a self-supervised neural foundation model trained solely on neuronal calcium traces and adaptable to multiple downstream tasks, including forecasting and decoding. Our key contribution is a pretraining framework, composed of a high-performance tokenizer mapping single-neuron traces into a shared discrete vocabulary, and a dual-axis autoregressive transformer modeling dependencies along both the neural and the temporal axis. We evaluate CalM on a large-scale, multi-animal, multi-session dataset. On the neural population dynamics forecasting task, CalM achieves competitive performance against strong specialized baselines after pretraining. With a task-specific head, CalM further adapts to the behavior decoding task and achieves superior results compared with supervised decoding models. Moreover, linear analyses of CalM representations reveal interpretable functional structures beyond predictive accuracy. Taken together, we propose a novel and effective self-supervised pretraining paradigm for foundation models based on calcium traces, paving the way for scalable pretraining and broad applications in functional neural analysis. Code is released at https://github.com/TSuXinH/CalM.

2604.05634 2026-06-02 cs.AI 版本更新

PECKER: A Precisely Efficient Critical Knowledge Erasure Recipe For Machine Unlearning in Diffusion Models

PECKER: 一种用于扩散模型机器遗忘的精确高效关键知识擦除方法

Zhiyong Ma, Zhitao Deng, Huan Tang, Jialin Chen, Zhijun Zheng, Zhengping Li, Qingyuan Chuai

发表机构 * Cao Tu Li(Guangzhou) Technology Co., Ltd, China(广州曹图利科技有限公司,中国) South China University of Technology, China(华南理工大学,中国) Guangzhou Xinhua University, China(广州新华大学,中国) Hong Kong Baptist University, HongKong(香港 Baptist 大学,香港)

AI总结 提出PECKER方法,通过显著性掩码优先更新关键参数,在蒸馏框架下实现高效机器遗忘,减少训练时间并保持遗忘效果。

Comments Accepted by ICPR 2026

详情
AI中文摘要

机器遗忘已成为生成式AI模型安全合规运行的关键技术。尽管现有MU方法有效,但大多数方法带来了高昂的训练时间和计算开销。我们的分析表明,根本原因在于梯度更新方向不佳,降低了训练效率并破坏了收敛稳定性。为缓解这些问题,我们提出PECKER,一种高效的MU方法,其性能匹配或优于主流方法。在蒸馏框架内,PECKER引入显著性掩码,优先更新对遗忘目标数据贡献最大的参数,从而减少不必要的梯度计算并缩短整体训练时间,同时不牺牲遗忘效果。我们的方法能够更快地生成遗忘相关类别或概念的样本,并在CIFAR-10和STL-10数据集上与真实图像分布紧密对齐,在类别遗忘和概念遗忘任务中均实现了更短的训练时间。

英文摘要

Machine unlearning (MU) has become a critical technique for GenAI models' safe and compliant operation. While existing MU methods are effective, most impose prohibitive training time and computational overhead. Our analysis suggests the root cause lies in poorly directed gradient updates, which reduce training efficiency and destabilize convergence. To mitigate these issues, we propose PECKER, an efficient MU approach that matches or outperforms prevailing methods. Within a distillation framework, PECKER introduces a saliency mask to prioritize updates to parameters that contribute most to forgetting the targeted data, thereby reducing unnecessary gradient computation and shortening overall training time without sacrificing unlearning efficacy. Our method generates samples that unlearn related class or concept more quickly, while closely aligning with the true image distribution on CIFAR-10 and STL-10 datasets, achieving shorter training times for both class forgetting and concept forgetting.

2604.04937 2026-06-02 cs.AI cs.CL 版本更新

Pramana: Fine-Tuning Large Language Models for Epistemic Reasoning through Navya-Nyaya

Pramana: 通过 Navya-Nyaya 微调大型语言模型进行认知推理

Sharath Sathish

发表机构 * University of York(约克大学)

AI总结 提出 Pramana 方法,利用 2500 年历史的印度 Navya-Nyaya 逻辑框架微调 LLM,通过结构化六阶段推理解决认知差距,提升可溯源性。

Comments 52 pages + appendices, comprehensive treatment of Navya-Nyaya computational formalization

详情
AI中文摘要

大型语言模型能生成流畅文本,但在系统推理方面存在困难,常常产生自信但无根据的幻觉。当苹果研究人员向数学问题添加无关背景时,LLM 性能下降了 65% (Apple Machine Learning Research),暴露出表面推理下脆弱的模式匹配。这种认知差距,即无法将主张建立在可追溯证据上的能力,限制了 AI 在需要论证的领域的可靠性。我们引入 Pramana,一种新颖的方法,通过在 Navya-Nyaya 逻辑(一种 2500 年历史的印度推理框架)上进行微调,教导 LLM 显式的认识论方法论。与通用的思维链提示不同,Navya-Nyaya 强制执行结构化的六阶段推理:SAMSHAYA(怀疑分析)、PRAMANA(证据源识别)、PANCHA AVAYAVA(包含普遍规则的五段论)、TARKA(反事实验证)、HETVABHASA(谬误检测)和 NIRNAYA(区分知识与假设的确定)。这种逻辑与认识论的整合提供了标准推理方法所缺乏的认知支架。我们在 55 个 Nyaya 结构化的逻辑问题(约束满足、布尔 SAT、多步演绎)上微调了 Llama 3.2-3B 和 DeepSeek-R1-Distill-Llama-8B。第一阶段在保留评估上实现了 100% 的语义正确性,尽管严格格式遵循率仅为 40%,这表明即使结构执行不完美,模型也能内化推理内容。消融研究表明格式提示和温度对性能有关键影响,且不同阶段的最优配置不同。我们在 Hugging Face 上发布所有模型、数据集和训练基础设施,以促进关于 AI 推理认识论框架的进一步研究。

英文摘要

Large language models produce fluent text but struggle with systematic reasoning, often hallucinating confident but unfounded claims. When Apple researchers added irrelevant context to mathematical problems, LLM performance degraded by 65% Apple Machine Learning Research, exposing brittle pattern-matching beneath apparent reasoning. This epistemic gap, the inability to ground claims in traceable evidence, limits AI reliability in domains requiring justification. We introduce Pramana, a novel approach that teaches LLMs explicit epistemological methodology by fine-tuning on Navya-Nyaya logic, a 2,500-year-old Indian reasoning framework. Unlike generic chain-of-thought prompting, Navya-Nyaya enforces structured 6-phase reasoning: SAMSHAYA (doubt analysis), PRAMANA (evidence source identification), PANCHA AVAYAVA (5-member syllogism with universal rules), TARKA (counterfactual verification), HETVABHASA (fallacy detection), and NIRNAYA (ascertainment distinguishing knowledge from hypothesis). This integration of logic and epistemology provides cognitive scaffolding absent from standard reasoning approaches. We fine-tune Llama 3.2-3B and DeepSeek-R1-Distill-Llama-8B on 55 Nyaya-structured logical problems (constraint satisfaction, Boolean SAT, multi-step deduction). Stage 1 achieves 100% semantic correctness on held-out evaluation despite only 40% strict format adherence revealing that models internalize reasoning content even when structural enforcement is imperfect. Ablation studies show format prompting and temperature critically affect performance, with optimal configurations differing by stage. We release all models, datasets, and training infrastructure on Hugging Face to enable further research on epistemic frameworks for AI reasoning.

2603.24324 2026-06-02 cs.LG cs.AI cs.SY eess.SY 版本更新

Large Language Model Guided Incentive Aware Reward Design for Cooperative Multi-Agent Reinforcement Learning

大语言模型引导的激励感知奖励设计用于合作多智能体强化学习

Dogan Urgun, Gokhan Gungor

发表机构 * Department of Electrical and Electronics Engineering(电气与电子工程系) Karabuk University(卡拉博克大学) Department of Mechatronics Engineering(机械工程系)

AI总结 提出利用大语言模型自动生成可执行奖励程序,结合多智能体近端策略优化训练,在Overcooked-AI环境中显著提升合作任务回报。

详情
AI中文摘要

设计有效的辅助奖励对于合作多智能体系统仍然具有挑战性,因为激励不匹配会导致次优协调,尤其是在稀疏任务奖励无法为协调行为提供足够基础的情况下。本研究引入了一个自主奖励设计框架,利用大语言模型(LLMs)从环境仪器化中合成可执行的奖励程序。该过程将候选程序限制在形式有效性范围内,并在固定计算预算下使用多智能体近端策略优化(MAPPO)从头训练策略。然后根据性能评估候选程序,并仅基于稀疏任务回报进行跨代选择。该框架在四个Overcooked-AI布局中进行了评估,这些布局具有不同程度的走廊拥堵、交接依赖和结构不对称性。所提出的奖励设计方法始终产生更高的任务回报和交付数量,在交互瓶颈主导的环境中收益最为显著。对合成塑造成分的诊断分析揭示了动作选择中更强的相互依赖性,以及在协调密集型任务中信号对齐的改善。这些结果表明,所提出的LLM引导的奖励搜索框架减轻了手动工程的需求,同时产生了与有限预算下合作学习兼容的塑造成分信号。

英文摘要

Designing effective auxiliary rewards for cooperative multi-agent systems remains challenging, as misaligned incentives can induce suboptimal coordination, particularly when sparse task rewards provide insufficient grounding for coordinated behavior. This study introduces an autonomous reward design framework that uses large language models (LLMs) to synthesize executable reward programs from environment instrumentation. The procedure constrains candidate programs within a formal validity envelope and trains policies from scratch using Multi-Agent Proximal Policy Optimization (MAPPO) under a fixed computational budget. The candidates are then evaluated on the basis of their performance, and selection across generations solely based on the sparse task returns. The framework is evaluated in four Overcooked-AI layouts characterized by varying levels of corridor congestion, handoff dependencies, and structural asymmetries. The proposed reward design approach consistently yields higher task returns and delivery counts, with the most pronounced gains observed in environments dominated by interaction bottlenecks. Diagnostic analysis of the synthesized shaping components reveals stronger interdependence in action selection and improved signal alignment in coordination-intensive tasks. These results demonstrate that the proposed LLM-guided reward search framework mitigates the need for manual engineering while producing shaping signals compatible with cooperative learning under finite budgets.

2604.03893 2026-06-02 cs.AI 版本更新

FeynmanBench: Benchmarking Multimodal LLMs on Diagrammatic Physics Reasoning

FeynmanBench:多模态大语言模型在图解物理推理上的基准测试

Zeyu Wang, Jingye Xu, Xiaogang Li, Peiyao Xiao, Qinhao Kong, Ben Wang, Chengliang Xu, Zichao Chen, Bing Zhao, Hu Wei

发表机构 * Alibaba Group(阿里巴巴集团) Skylenage

AI总结 提出FeynmanBench基准,包含2000多个费曼图任务,评估多模态大模型在拓扑结构、守恒约束和视觉-代数映射等全局结构推理上的能力,发现模型在局部识别上表现良好但在拓扑重建和代数推导上严重不足。

Comments 9 pages, 5 figures

详情
AI中文摘要

当前用于科学推理的多模态基准主要评估局部信息提取——模型识别符号和数值,然后进行文本推理。它们不评估模型是否能在形式化图表的全局结构属性上进行推理,例如拓扑、守恒约束以及视觉模式与代数表达式之间的一致映射。我们引入了FeynmanBench,一个包含2000多个任务的基准,聚焦于涵盖标准模型电磁、弱和强相互作用的费曼图。每个实例将图表图像与最少的文本约定相结合,要求模型恢复完整的物理内容——顶点清单、传播子类型、拓扑连接性、动量路由以及完整的散射振幅。一个自动化的生成和验证流程在标准化规则下产生图表、注释和参考答案。评估了19个最先进的多模态大语言模型,我们发现一个一致的失败模式:模型在局部识别(顶点和传播子识别)上达到70-95%,但在拓扑重建(CP3)上下降到13-17%,在完整代数推导(CP5)上接近零。FeynmanBench为形式化科学图表上的多模态推理提供了一个受控测试平台,并突显了当前架构在拓扑敏感的科学推理中的基本局限性。

英文摘要

Current multimodal benchmarks for scientific reasoning primarily evaluate local information extraction -- models recognize symbols and values and then perform textual inference. They do not assess whether models can reason over the global structural properties of formal diagrams, such as topology, conservation constraints, and the consistent mapping between visual patterns and algebraic expressions. We introduce FeynmanBench, a benchmark of over 2,000 tasks centered on Feynman diagrams spanning the electromagnetic, weak, and strong interactions of the Standard Model. Each instance couples a diagram image with minimal textual conventions and requires models to recover the full physical content -- vertex inventory, propagator types, topological connectivity, momentum routing, and the complete scattering amplitude. An automated generation and verification pipeline produces the diagrams, annotations, and reference answers under standardized rules. Evaluating 19 state-of-the-art multimodal LLMs, we find a consistent failure pattern: models achieve 70--95\% on local recognition (vertex and propagator identification) but collapse to 13--17\% on topological reconstruction (CP3), and near zero on full algebraic derivation (CP5). FeynmanBench offers a controlled testbed for multimodal reasoning over formal scientific diagrams and highlights fundamental limitations of current architectures in topology-sensitive scientific reasoning.

2604.03789 2026-06-02 cs.LG cs.AI 版本更新

Automated Conjecture Resolution with Formal Verification

自动猜想解决与形式化验证

Haocheng Ju, Guoxiong Gao, Jiedong Jiang, Bin Wu, Zeming Sun, Shurui Liu, Leheng Chen, Yutong Wang, Yuefeng Wang, Zichen Wang, Wanyi He, Peihao Wu, Liang Xiao, Ruochuan Liu, Bryan Dai, Bin Dong

发表机构 * School of Mathematical Sciences, Peking University(北京大学数学科学学院) Westlake Institute for Advanced Study, Westlake University(西拉雅大学先进研究所) School of Mathematics, Tianjin University(天津大学数学学院) Research Institute for Mathematical Sciences, Kyoto University(京都大学数学研究所) Department of Mathematics, Stanford University(斯坦福大学数学系) IQuest Research(IQuest研究) New Cornerstone Science Laboratory, School of Mathematical Sciences, Peking University(北京大学数学科学学院新基石科学实验室) Beijing International Center for Mathematical Research and the New Cornerstone Science Laboratory, Peking University(北京大学国际数学研究所以及新基石科学实验室) Center for Machine Learning Research, Peking University(北京大学机器学习研究中心) Center for Intelligent Computing, Great Bay Institute for Advanced Study, Great Bay University(大湾大学先进研究所智能计算中心) Zhongguancun Academy(中关村学院)

AI总结 提出一个集成非形式化推理与形式化验证的自动框架,通过两个组件Rethlas和Archon解决研究级数学问题,并成功解决交换代数中的开放问题并在Lean 4中形式化验证。

Comments Code and resources are available at: Rethlas (https://github.com/frenzymath/Rethlas), Rethlas Results (https://github.com/frenzymath/Rethlas_results), Archon (https://github.com/frenzymath/Archon), and the formalization results (https://github.com/frenzymath/Anderson-Conjecture)

详情
AI中文摘要

近年来,大型语言模型在数学推理能力上取得了显著进步,从解决初等问题扩展到研究级问题。然而,由于自然语言推理固有的歧义性,可靠地解决和验证此类问题仍然具有挑战性。本文提出一个自动框架,将自然语言推理与形式化验证相结合,以应对研究级数学问题。我们的框架由两个组件组成:非形式化推理代理Rethlas和形式化验证代理Archon。Rethlas将推理原语与我们的定理搜索引擎Matlas相结合,探索解决策略并构建候选证明。Archon配备LeanSearch,通过任务分解、迭代细化和自动证明合成,将非形式化论证转化为形式化的Lean 4项目,确保机器可检查的正确性。利用该框架,我们解决了一个交换代数中的开放问题,并在几乎无需人工参与的情况下在Lean 4中形式化验证了所得证明。额外的案例研究展示了Rethlas在非形式化数学推理和发现方面的能力,以及Archon将研究级证明形式化为Lean 4的能力。我们的实验表明,强大的定理检索工具能够发现和应用跨领域数学技巧,而形式化代理可以自主填补非形式化论证中的非平凡空白。更广泛地说,我们的工作展示了一种有前景的数学研究范式,其中配备定理检索工具的非形式化和形式化推理系统协同工作,以产生可验证的结果,减少人工努力,并支持人机协作的数学研究。

英文摘要

Recent advances in large language models have significantly improved their ability to perform mathematical reasoning, extending from elementary problem solving to increasingly capable performance on research-level problems. However, reliably solving and verifying such problems remains challenging due to the inherent ambiguity of natural language reasoning. In this paper, we propose an automated framework that integrates natural language reasoning with formal verification to tackle research-level mathematical problems. Our framework consists of two components: an informal reasoning agent, Rethlas, and a formal verification agent, Archon. Rethlas combines reasoning primitives with our theorem search engine, Matlas, to explore solution strategies and construct candidate proofs. Archon, equipped with LeanSearch, translates informal arguments into formalized Lean 4 projects through task decomposition, iterative refinement, and automated proof synthesis, ensuring machine-checkable correctness. Using this framework, we resolve an open problem in commutative algebra and formally verify the resulting proof in Lean 4 with essentially no human involvement. Additional case studies illustrate the capabilities of Rethlas in informal mathematical reasoning and discovery, as well as the ability of Archon to formalize research-level proofs in Lean 4. Our experiments demonstrate that strong theorem retrieval tools enable the discovery and application of cross-domain mathematical techniques, while the formal agent can autonomously fill nontrivial gaps in informal arguments. More broadly, our work illustrates a promising paradigm for mathematical research in which informal and formal reasoning systems, equipped with theorem retrieval tools, operate in tandem to produce verifiable results, reduce human effort, and support human-AI collaborative mathematical research.

2602.00906 2026-06-02 cs.LG cs.AI cs.CL cs.DS cs.IT math.IT 版本更新

Hallucination is a Consequence of Space-Optimality: A Rate-Distortion Theorem for Membership Testing

幻觉是空间最优性的结果:成员测试的率失真定理

Anxin Guo, Jingwei Li

发表机构 * Computer Science Department, Northwestern University(西北大学计算机科学系) Department of IEOR, Columbia University(哥伦比亚大学工业工程与运筹学系)

AI总结 通过将幻觉形式化为成员测试问题,建立率失真定理,证明在有限容量下信息论最优策略必然导致对某些非事实的高置信度,从而产生幻觉。

Comments ICML 2026

详情
AI中文摘要

大型语言模型通常对缺乏可推断模式的“随机事实”以高置信度产生幻觉。我们将此类事实的记忆形式化为一个成员测试问题,统一了布隆过滤器的离散误差指标与LLM的连续对数损失。通过分析在事实在可能主张的宇宙中稀疏的情况下,我们建立了一个率失真定理:最优记忆效率由事实与非事实得分分布之间的最小KL散度刻画。这一理论框架在理想化设置下为幻觉提供了独特的解释:即使有最优训练、完美数据和简化的“封闭世界”设置,有限容量下信息论最优策略不是放弃或遗忘,而是对某些非事实赋予高置信度,从而导致幻觉。我们在合成数据和真实数据上实证验证了这一理论,表明幻觉作为有损压缩的自然结果持续存在。同一定理恢复并锐化了布隆型滤波器的经典空间下界,确定了两侧滤波器遗留的加性常数。

英文摘要

Large language models often hallucinate with high confidence on "random facts" that lack inferable patterns. We formalize the memorization of such facts as a membership testing problem, unifying the discrete error metrics of Bloom filters with the continuous log-loss of LLMs. By analyzing this problem in the regime where facts are sparse in the universe of plausible claims, we establish a rate-distortion theorem: the optimal memory efficiency is characterized by the minimum KL divergence between score distributions on facts and non-facts. This theoretical framework provides a distinctive explanation for hallucination under an idealized setting: even with optimal training, perfect data, and a simplified ``closed world'' setting, the information-theoretically optimal strategy under limited capacity is not to abstain or forget, but to assign high confidence to some non-facts, resulting in hallucination. We validate this theory empirically on both synthetic and real-world data, showing that hallucinations persist as a natural consequence of lossy compression. The same theorem recovers and sharpens classical space lower bounds for Bloom-type filters, pinning down an additive constant left open for two-sided filters.

2603.03312 2026-06-02 cs.CL cs.AI cs.HC eess.AS q-bio.NC 版本更新

Escaping the BLEU Trap: A Signal-Grounded Framework with Decoupled Semantic Guidance for EEG-to-Text Decoding

逃离BLEU陷阱:一种基于信号锚定与解耦语义引导的脑电解码文本框架

Yuchen Wang, Haonan Wang, Yu Guo, Honglong Yang, Xiaomeng Li

发表机构 * The Hong Kong University of Science and Technology(香港科学与技术大学) Shenzhen Loop Area Institute(深圳环湖研究院)

AI总结 针对脑电解码文本中语义偏差、信号忽略和BLEU陷阱问题,提出SemKey多阶段框架,通过解耦语义目标(情感、主题、长度、意外性)和主动检索解码机制,强制生成基于信号而非语言先验,并采用检索和分布度量(如Fréchet距离)建立评估协议,有效缓解幻觉并达到最优性能。

详情
AI中文摘要

从非侵入性脑电信号中解码自然语言是一项有前景但充满挑战的任务。然而,当前最先进的模型仍受限于三个基本问题:语义偏差(输出退化为通用语言模板)、信号忽略(模型严重依赖大语言模型先验,即使在缺乏有意义信号时也能生成流畅文本)以及“BLEU陷阱”(高频停用词虚增n-gram指标,掩盖真正语义保真度的缺失)。为解决这些挑战,我们超越传统的端到端流水线,提出SemKey——一种新颖的多阶段框架,通过四个解耦的语义目标(情感、主题、长度和意外性)强制进行基于信号的生成。我们直接从脑电嵌入中提取这些语义锚点,然后通过主动检索解码机制统一它们,迫使大语言模型将其令牌生成锚定在神经信号上,而非默认使用语言先验。此外,我们通过建立全面的评估协议(使用严格的检索和基于分布的度量,如Fréchet距离)打破BLEU陷阱。大量实验表明,SemKey有效缓解了对噪声输入的幻觉,并在这些鲁棒协议上达到了最先进的性能。代码将在论文被接收后发布于https://github.com/xmed-lab/SemKey。

英文摘要

Decoding natural language from non-invasive EEG signals is a promising yet challenging task. However, current state-of-the-art models remain constrained by three fundamental issues: Semantic Bias, where outputs collapse into generic linguistic templates; Signal Neglect, where models rely heavily on LLM priors to hallucinate fluent text even in the absence of meaningful signals; and the "BLEU Trap", where high-frequency stopwords inflate n-gram metrics, masking a lack of true semantic fidelity. To resolve these challenges, we move beyond conventional end-to-end pipelines and propose SemKey, a novel multi-stage framework that enforces signal-grounded generation through four decoupled semantic objectives: sentiment, topic, length, and surprisal. We extract these semantic anchors from EEG embeddings directly, then unify them with an Active Retrieval Decoding mechanism, compelling the LLM to ground its token generation in the neural signals rather than defaulting to linguistic priors. Furthermore, we break the BLEU Trap by establishing a comprehensive evaluation protocol using rigorous retrieval and distribution-based metrics such as Fréchet Distance. Extensive experiments demonstrate that SemKey effectively mitigates hallucinations on noise inputs and achieves SOTA performance on these robust protocols. Code will be released upon acceptance at https://github.com/xmed-lab/SemKey.

2604.01841 2026-06-02 cs.AI 版本更新

Retrieval-aligned Tabular Foundation Models Enable Robust Clinical Risk Prediction in Electronic Health Records Under Real-world Constraints

检索对齐的表格基础模型实现电子健康记录中在现实约束下的稳健临床风险预测

Minh-Khoi Pham, Thang-Long Nguyen Ho, Thao Thi Phuong Dao, Tai Tan Mai, Minh-Triet Tran, Marie E. Ward, Una Geary, Rob Brennan, Nick McDonald, Martin Crane, Marija Bezbradica

发表机构 * University of Cambridge(剑桥大学)

AI总结 针对电子健康记录中高维、异质、类别不平衡和分布偏移等挑战,提出任务对齐检索框架AWARE,通过监督嵌入学习和轻量适配器提升表格上下文学习性能,在极端不平衡下AUPRC提升高达12.2%。

Comments Not peer-reviewed

详情
AI中文摘要

从结构化电子健康记录(EHR)进行临床预测具有挑战性,原因包括高维性、异质性、类别不平衡和分布偏移。尽管表格上下文学习(TICL)和检索增强方法在通用基准上表现良好,但它们在临床环境中的行为仍不清楚。我们提出了一个多队列EHR基准,比较了经典模型、深度表格模型和TICL模型在不同数据规模、特征维度、结果稀有性和跨队列泛化下的表现。基于PFN的TICL模型在低数据情况下样本高效,但随着异质性和不平衡的增加,在基于朴素距离的检索下性能下降。我们提出了AWARE,一个任务对齐的检索框架,使用监督嵌入学习和轻量适配器。AWARE在极端不平衡下将AUPRC提升了高达12.2%,且增益随数据复杂性增加。我们的结果识别出检索质量以及检索-推理对齐是部署表格上下文学习进行临床预测的关键瓶颈。

英文摘要

Clinical prediction from structured electronic health records (EHRs) is challenging due to high dimensionality, heterogeneity, class imbalance, and distribution shift. While tabular in-context learning (TICL) and retrieval-augmented methods perform well on generic benchmarks, their behavior in clinical settings remains unclear. We present a multi-cohort EHR benchmark comparing classical, deep tabular, and TICL models across varying data scale, feature dimensionality, outcome rarity, and cross-cohort generalization. PFN-based TICL models are sample-efficient in low-data regimes but degrade under naive distance-based retrieval as heterogeneity and imbalance increase. We propose AWARE, a task-aligned retrieval framework using supervised embedding learning and lightweight adapters. AWARE improves AUPRC by up to 12.2% under extreme imbalance, with gains increasing with data complexity. Our results identify retrieval quality and retrieval-inference alignment as key bottlenecks for deploying tabular in-context learning in clinical prediction.

2604.01562 2026-06-02 cs.SD cs.AI cs.CL cs.CY cs.HC 版本更新

Acoustic and perceptual differences between standard and accented speech and their voice clones

标准口音与带口音语音及其语音克隆的声学与感知差异

Tianle Yang, Chengzhe Sun, Phil Rose, Siwei Lyu

发表机构 * Department of Linguistics, University at Buffalo, United States(语言学系,布法罗大学,美国) Department of Computer Science and Engineering, University at Buffalo, United States(计算机科学与工程系,布法罗大学,美国) Emeritus Faculty, Australian National University, Australia(澳大利亚国立大学荣誉教职)

AI总结 通过计算和感知实验,比较标准口音与带口音普通话及其语音克隆,发现口音影响感知身份匹配和可懂度,且标准口音克隆更接近原声,带口音克隆可懂度提升更大。

详情
AI中文摘要

语音克隆通常根据整体质量进行评估,但关于口音保留及其感知后果的了解较少。我们采用计算和感知相结合的设计,比较标准口音和重度口音普通话及其语音克隆。基于嵌入的分析显示,在多个说话人判别嵌入空间中,带口音说话人的原始-克隆距离更大,但在根据每个说话人的原始内部基线变异性进行归一化后,这种差异消失。在感知研究中,标准口音说话人的克隆被评价为比带口音说话人的克隆更接近其原始声音,并且从原始到克隆的可懂度增加,其中带口音语音的增益更大。这些结果表明,即使口音变异未反映在基线归一化的说话人嵌入距离中,它也能影响语音克隆中的感知身份匹配和可懂度,并促使将口音保留视为说话人身份保留的一个明确组成部分,而不是假设它完全由现成的说话人判别嵌入所捕获。

英文摘要

Voice cloning is often evaluated in terms of overall quality, but less is known about accent preservation and its perceptual consequences. We compare standard and heavily accented Mandarin speech and their voice clones using a combined computational and perceptual design. Embedding-based analyses showed larger original-clone distances for accented speakers in several speaker-discriminative embedding spaces, but this difference disappeared after normalizing against each speaker's within-original baseline variability. In the perception study, clones are rated as more similar to their originals for standard than for accented speakers, and intelligibility increases from original to clone, with a larger gain for accented speech. These results show that accent variation can shape perceived identity match and intelligibility in voice cloning even when it is not reflected in baseline-normalized speaker-embedding distance, and they motivate treating accent preservation as an explicit component of speaker identity preservation, rather than assuming that it is fully captured by off-the-shelf speaker-discriminative embeddings.

2603.28825 2026-06-02 cs.GT cs.AI 版本更新

Incentives, Equilibria, and the Limits of Healthcare AI: A Game-Theoretic Perspective

激励、均衡与医疗AI的局限:博弈论视角

Ari Ercole

发表机构 * Cambridge Centre for AI in Medicine, University of Cambridge, UK(剑桥大学医学人工智能中心) Magdalene College, University of Cambridge, UK(剑桥大学玛格丽特学院)

AI总结 本文通过住院容量管理的协调问题,描述三种AI部署形式,并分析其对系统行为的影响,指出只有改变激励结构的干预才能改变稳定均衡,为医疗AI的采购、治理和评估提供实践启示。

详情
AI中文摘要

利用一个来自住院容量管理的典型协调问题,描述了三种典型的AI部署形式:减少努力的技术、面向可观测性的系统以及改变潜在激励结构的干预。减少努力和可观测性可能改善现有行为模式下的性能,但通常不会改变哪些行动是个人理性的。因此,此类干预通常被吸收到现有均衡中。相比之下,通过重新分配或限制局部风险来改变局部行动如何影响下游后果的干预可以改变稳定的系统行为。这些机制层面的干预不同之处不在于技术复杂性,而在于它们与制度激励的相互作用。分析表明,对AI带来系统层面收益的期望应取决于部署是否改变了激励,而不仅仅是优化任务或信息流。对于医疗组织和政策制定者而言,这对数字技术的采购、治理和评估具有实际意义。

英文摘要

Using a stylised coordination problem drawn from inpatient capacity management, three archetypal forms of AI deployment are described: effort-reducing technologies, observability-oriented systems, and interventions that alter underlying incentive structures. Effort reduction and observability may improve performance within existing patterns of behaviour but do not, in general, change which actions are individually rational. As a result, such interventions are typically absorbed into existing equilibria. By contrast, interventions that modify how local actions map to downstream consequences by redistributing or bounding local risk can change stable system behaviour. These mechanism-level interventions differ not in technical sophistication but in their interaction with institutional incentives. The analysis suggests that expectations of system-level gains from AI should be conditioned on whether a deployment changes incentives rather than optimising tasks or information flows alone. For healthcare organisations and policymakers, this has practical implications for procurement, governance, and evaluation of digital technologies.

2603.27223 2026-06-02 cs.CV cs.AI 版本更新

EuraGovExam: A Multilingual Multimodal Benchmark from Real-World Civil Service Exams

EuraGovExam:来自现实世界公务员考试的多语言多模态基准

Jaeseong Kim, Chaehwan Lim, Sang Hyun Gil, Suan Lee

发表机构 * School of Computer Science / Data Intelligence Lab(计算机科学学院/数据智能实验室)

AI总结 提出一个包含8000多道真实公务员考试题目的多语言多模态基准EuraGovExam,要求模型直接从图像中进行布局感知的跨语言推理,当前最先进的视觉语言模型准确率仅达86%。

详情
AI中文摘要

我们提出了EuraGovExam,一个多语言和多模态基准,来源于五个代表性欧亚地区(韩国、日本、台湾、印度和欧盟)的现实世界公务员考试。该数据集旨在反映公共部门评估的真实复杂性,包含超过8000道高分辨率扫描选择题,涵盖17个不同的学术和行政领域。与现有基准不同,EuraGovExam将所有题目内容(包括问题陈述、答案选项和视觉元素)嵌入到单个图像中,仅提供最小化的标准答案格式指令。这种设计要求模型直接从视觉输入进行布局感知的跨语言推理。所有题目均来自真实考试文档,保留了丰富的视觉结构,如表格、多语言排版和类似表单的布局。评估结果显示,即使是最先进的视觉语言模型(VLM)也仅达到86%的准确率,突显了该基准的难度及其诊断当前模型局限性的能力。通过强调文化真实性、视觉复杂性和语言多样性,EuraGovExam为在高风险、多语言、图像基础环境中评估VLM建立了新标准。它还支持电子政务、公共部门文档分析和公平考试准备等实际应用。

英文摘要

We present EuraGovExam, a multilingual and multimodal benchmark sourced from real-world civil service examinations across five representative Eurasian regions: South Korea, Japan, Taiwan, India, and the European Union. Designed to reflect the authentic complexity of public-sector assessments, the dataset contains over 8,000 high-resolution scanned multiple-choice questions covering 17 diverse academic and administrative domains. Unlike existing benchmarks, EuraGovExam embeds all question content--including problem statements, answer choices, and visual elements--within a single image, providing only a minimal standardized instruction for answer formatting. This design demands that models perform layout-aware, cross-lingual reasoning directly from visual input. All items are drawn from real exam documents, preserving rich visual structures such as tables, multilingual typography, and form-like layouts. Evaluation results show that even state-of-the-art vision-language models (VLMs) achieve only 86% accuracy, underscoring the benchmark's difficulty and its power to diagnose the limitations of current models. By emphasizing cultural realism, visual complexity, and linguistic diversity, EuraGovExam establishes a new standard for evaluating VLMs in high-stakes, multilingual, image-grounded settings. It also supports practical applications in e-governance, public-sector document analysis, and equitable exam preparation.

2603.26779 2026-06-02 cs.CV cs.AI 版本更新

Limits of Spatial Imagery Reasoning in Frontier LLM Models

前沿大语言模型在空间意象推理中的局限性

Sergio Y. Hayashi, Nina S. T. Hirata

发表机构 * Institute of Mathematics and Statistics – University of São Paulo(数学统计研究所 – 圣保罗大学)

AI总结 本研究通过引入外部“意象模块”辅助3D模型旋转任务,发现即使外包整体3D状态维护,前沿模型仍缺乏基础视觉空间原语,导致准确率最高仅62.5%。

Comments 25 pages. v2: Title updated; added a section on object/spatial imagery and propositional reasoning; added new experimental results for the single-object rotation probe

详情
AI中文摘要

大型语言模型(LLMs)展示了令人印象深刻的推理能力,但在需要心理模拟的空间任务(如心理旋转)中表现不佳。本文研究是否通过为LLM配备一个外部“意象模块”——一种能够渲染和旋转3D模型的工具——可以弥合这一差距,充当“认知假体”。我们使用双模块架构进行了实验,其中推理模块(MLLM)与意象模块在3D模型旋转任务上进行交互。性能低于预期,准确率最高达到62.5%。进一步研究表明,即使将维护和操作整体3D状态的负担外包,系统仍然失败。这揭示了当前前沿模型缺乏与意象交互所需的基础视觉空间原语。具体来说,它们缺乏:(1)提取空间信号的低级敏感性,例如(a)深度,(b)运动,以及(c)短视距动态预测;以及(2)对图像进行沉思性推理的能力,动态转移视觉焦点,并平衡意象与符号和关联信息。

英文摘要

Large Language Models (LLMs) have demonstrated impressive reasoning capabilities, yet they struggle with spatial tasks that require mental simulation, such as mental rotation. This paper investigates whether equipping an LLM with an external ``Imagery Module'' -- a tool capable of rendering and rotating 3D models -- can bridge this gap, functioning as a ``cognitive prosthetic.'' We conducted experiments using a dual-module architecture in which a reasoning module (an MLLM) interacts with an imagery module on 3D model rotation tasks. Performance was lower than expected, with accuracy reaching at most 62.5%. Further investigation suggests that even when the burden of maintaining and manipulating a holistic 3D state is outsourced, the system still fails. This reveals that current frontier models lack the foundational visual-spatial primitives required to interface with imagery. Specifically, they lack: (1) the low-level sensitivity to extract spatial signals such as (a) depth, (b) motion, and (c) short-horizon dynamic prediction; and (2) the capacity to reason contemplatively over images, dynamically shifting visual focus and balancing imagery with symbolic and associative information.

2603.23582 2026-06-02 cs.LG cs.AI 版本更新

AI Generalisation Gap In Comorbid Sleep Disorder Staging

共病睡眠障碍分期中的AI泛化差距

Saswata Bose, Suvadeep Maiti, Shivam Kumar Sharma, Mythirayee S, Tapabrata Chakraborti, Srijitesh Rajendran, Raju S. Bapi

发表机构 * arXiv

AI总结 针对脑卒中患者睡眠分期中深度学习模型在健康与临床人群间泛化差的问题,通过Grad-CAM可视化和新数据集iSLEEPS,揭示模型关注生理无意义区域,并强调需开发疾病特异性模型。

详情
AI中文摘要

准确的睡眠分期对于诊断脑卒中患者的OSA和低通气至关重要。尽管PSG可靠,但成本高、劳动密集且需人工评分。虽然深度学习在健康受试者中实现了基于EEG的自动睡眠分期,但我们的分析显示,该方法在睡眠紊乱的临床人群中泛化能力差。利用Grad-CAM解释,我们系统地证明了这一局限性。我们引入了iSLEEPS,一个经过临床注释的缺血性脑卒中新数据集(即将公开发布),并评估了SE-ResNet加双向LSTM模型用于单通道EEG睡眠分期。正如预期,健康与疾病受试者之间的跨域性能很差。注意力可视化在临床专家反馈的支持下显示,模型在患者数据中关注生理上无信息的EEG区域。统计和计算分析进一步证实了健康与缺血性脑卒中队列之间显著的睡眠结构差异,强调了在部署前需要经过临床验证的受试者感知或疾病特异性模型。论文和代码摘要见https://himalayansaswatabose.github.io/iSLEEPS_Explainability.github.io/

英文摘要

Accurate sleep staging is essential for diagnosing OSA and hypopnea in stroke patients. Although PSG is reliable, it is costly, labor-intensive, and manually scored. While deep learning enables automated EEG-based sleep staging in healthy subjects, our analysis shows poor generalization to clinical populations with disrupted sleep. Using Grad-CAM interpretations, we systematically demonstrate this limitation. We introduce iSLEEPS, a newly clinically annotated ischemic stroke dataset (to be publicly released), and evaluate a SE-ResNet plus bidirectional LSTM model for single-channel EEG sleep staging. As expected, cross-domain performance between healthy and diseased subjects is poor. Attention visualizations, supported by clinical expert feedback, show the model focuses on physiologically uninformative EEG regions in patient data. Statistical and computational analyses further confirm significant sleep architecture differences between healthy and ischemic stroke cohorts, highlighting the need for subject-aware or disease-specific models with clinical validation before deployment. A summary of the paper and the code is available at https://himalayansaswatabose.github.io/iSLEEPS_Explainability.github.io/

2603.24511 2026-06-02 cs.LG cs.AI cs.CR 版本更新

Claudini: Autoresearch Discovers State-of-the-Art Adversarial Attack Algorithms for LLMs

Claudini: 自动研究发现针对LLM的最先进对抗攻击算法

Alexander Panfilov, Peter Romov, Igor Shilov, Yves-Alexandre de Montjoye, Jonas Geiping, Maksym Andriushchenko

发表机构 * MATS ELLIS Institute(MATS ELLIS研究所) Max Planck Institute for Intelligent Systems(智能系统马克斯·普朗克研究所) Tübingen AI Center(图宾根人工智能中心) Imperial College London(伦敦帝国理工学院)

AI总结 本文提出一种自动研究循环,利用前沿AI代理(如Claude Code和Codex)自动发现针对大语言模型的新型对抗攻击算法,在白盒越狱和提示注入评估中达到最先进水平。

详情
AI中文摘要

我们证明AI代理能够发现针对LLM的新型对抗攻击算法,在白盒越狱和提示注入评估中推进了最先进水平。我们部署前沿代理(如Claude Code和Codex)在自动研究循环中,访问包含30多种先前方法的库和具有固定计算预算的评估脚本。我们展示了该流程在越狱OpenAI的GPT-OSS-Safeguard-20B以及对对抗鲁棒模型Meta-SecAlign-70B进行提示注入方面的有效性。对于GPT-OSS-Safeguard,代理发现的最佳方法在CBRN查询上实现了高达80%的攻击成功率,而现有方法低于50%。对于SecAlign,它实现了100%的ASR,而先前最佳自动化方法仅达到82%。值得注意的是,在我们的设置中,攻击方法是在无关的替代模型上为纯随机目标令牌强制任务开发的,却直接泛化到对抗训练模型上的提示注入。最后,我们追溯了自动研究过程中开发的方法的谱系,刻画了代理的策略和失败模式。对抗性机器学习长期以来一直认为防御必须针对为其量身定制的攻击进行评估;自动研究自动化了这一原则,我们认为这应该是未来防御评估的最低标准。

英文摘要

We show that AI agents are capable of discovering novel algorithms for adversarial attacks against LLMs, advancing the state of the art on white-box jailbreaking and prompt injection evaluations. We deploy frontier agents, such as Claude Code and Codex, in an autoresearch loop with access to a library of 30+ prior methods and an evaluation script with a fixed compute budget. We show this pipeline to be effective in jailbreaking OpenAI's GPT-OSS-Safeguard-20B and in prompt injections against Meta-SecAlign-70B, an adversarially robust model. For GPT-OSS-Safeguard, the best agent-discovered method achieves up to 80\% attack success rate on CBRN queries, compared to <50\% for existing methods. For SecAlign, it achieves 100\% ASR, while the best prior automated methods only achieve 82\%. Notably, in our setting, attack methods are developed on unrelated surrogate models for a pure random-target token-forcing task, yet generalize directly to prompt injection on the adversarially trained model. Finally, we trace the lineage of methods developed during autoresearch, characterizing the agents' strategies and failure modes. Adversarial ML has long held that defenses must be evaluated against attacks tailored to them; autoresearch automates this principle, and we argue it should be the minimum bar for defense evaluation going forward.

2603.23902 2026-06-02 cs.CV cs.AI 版本更新

Knowledge-Refined Dual Context-Aware Network for Partially Relevant Video Retrieval

知识精炼的双上下文感知网络用于部分相关视频检索

Junkai Yang, Qirui Wang, Yaoqing Jin, Shuai Ma, Minghan Xu, Shanmin Pang

发表机构 * School of Software Engineering, Xi’an Jiaotong University(西安交通大学软件工程学院) Faculty of Computer Science, Electrical Engineering and Information Technology, Universität Stuttgart(斯图加特大学计算机科学、电子工程和信息学院)

AI总结 针对未修剪视频中部分相关片段检索的信息密度不匹配和注意力机制不足问题,提出KDC-Net网络,通过层次语义聚合、动态时间注意力和基于CLIP的蒸馏策略,显著提升检索性能。

Comments Accepted in ICME 2026

详情
AI中文摘要

从未修剪视频中检索部分相关片段仍然面临两个持续挑战:文本与视频片段之间的信息密度不匹配,以及有限的注意力机制忽略了语义焦点和事件相关性。我们提出了KDC-Net,一个知识精炼的双上下文感知网络,从文本和视觉两个角度解决这些问题。在文本方面,层次语义聚合模块捕获并自适应融合多尺度短语线索以丰富查询语义。在视频方面,动态时间注意力机制采用相对位置编码和自适应时间窗口来突出具有局部时间连贯性的关键事件。此外,一种基于CLIP的动态蒸馏策略,结合时间连续性感知精炼,确保了片段感知和目标对齐的知识迁移。在PRVR基准上的实验表明,KDC-Net始终优于最先进的方法,特别是在低片段-视频比率下。

英文摘要

Retrieving partially relevant segments from untrimmed videos remains difficult due to two persistent challenges: the mismatch in information density between text and video segments, and limited attention mechanisms that overlook semantic focus and event correlations. We present KDC-Net, a Knowledge-Refined Dual Context-Aware Network that tackles these issues from both textual and visual perspectives. On the text side, a Hierarchical Semantic Aggregation module captures and adaptively fuses multi-scale phrase cues to enrich query semantics. On the video side, a Dynamic Temporal Attention mechanism employs relative positional encoding and adaptive temporal windows to highlight key events with local temporal coherence. Additionally, a dynamic CLIP-based distillation strategy, enhanced with temporal-continuity-aware refinement, ensures segment-aware and objective-aligned knowledge transfer. Experiments on PRVR benchmarks show that KDC-Net consistently outperforms state-of-the-art methods, especially under low moment-to-video ratios.

2603.23647 2026-06-02 cs.CV cs.AI cs.LG 版本更新

λSplit: Self-Supervised Content-Aware Spectral Unmixing for Fluorescence Microscopy

λSplit: 用于荧光显微镜的自监督内容感知光谱解混

Federico Carrara, Talley Lambert, Mehdi Seifi, Florian Jug

发表机构 * Fondazione Human Technopole(人类技术极地基金会) Harvard Medical School(哈佛医学院) Università Campus Bio-Medico(生物医学大学校园)

AI总结 提出λSplit,一种基于物理信息的深度生成模型,通过分层变分自编码器学习浓度图的条件分布,结合可微分光谱混合器实现最先进的光谱解混和隐式噪声去除。

Comments 14 pages, 25 pages supplement, 16 figures total, 14 tables total

详情
AI中文摘要

在荧光显微镜中,光谱解混旨在从捕获混合荧光发射的光谱图像中恢复单个荧光团浓度。由于经典方法逐像素操作并依赖最小二乘拟合,其性能随着发射光谱重叠增加和噪声水平升高而下降,这表明能够学习并利用结构先验的数据驱动方法可能会带来改进。基于学习的光谱成像方法确实存在,但它们要么未针对显微镜数据进行优化,要么是为不适用于荧光显微镜设置的非常特定情况而开发的。为了解决这个问题,我们提出了λSplit,一种基于物理信息的深度生成模型,它使用分层变分自编码器学习浓度图上的条件分布。一个完全可微的光谱混合器强制与图像形成过程的一致性,而学习到的结构先验实现了最先进的解混和隐式噪声去除。我们在3个真实世界数据集上展示了λSplit,这些数据集被我们合成为总共66个具有挑战性的光谱解混基准。我们将结果与总共10种基线方法进行比较,包括经典方法和一系列基于学习的方法。我们的结果一致显示出竞争性能和在强噪声、光谱显著重叠或光谱维度降低情况下的改进鲁棒性,使λSplit成为荧光显微镜数据光谱解混的新最先进方法。重要的是,λSplit与标准共聚焦显微镜产生的光谱数据兼容,无需专门的硬件修改即可立即采用。

英文摘要

In fluorescence microscopy, spectral unmixing aims to recover individual fluorophore concentrations from spectral images that capture mixed fluorophore emissions. Since classical methods operate pixel-wise and rely on least-squares fitting, their performance degrades with increasingly overlapping emission spectra and higher levels of noise, suggesting that a data-driven approach that can learn and utilize a structural prior might lead to improved results. Learning-based approaches for spectral imaging do exist, but they are either not optimized for microscopy data or are developed for very specific cases that are not applicable to fluorescence microscopy settings. To address this, we propose λSplit, a physics-informed deep generative model that learns a conditional distribution over concentration maps using a hierarchical Variational Autoencoder. A fully differentiable Spectral Mixer enforces consistency with the image formation process, while the learned structural priors enable state-of-the-art unmixing and implicit noise removal. We demonstrate λSplit on 3 real-world datasets that we synthetically cast into a total of 66 challenging spectral unmixing benchmarks. We compare our results against a total of 10 baseline methods, including classical methods and a range of learning-based methods. Our results consistently show competitive performance and improved robustness in high noise regimes, when spectra overlap considerably, or when the spectral dimensionality is lowered, making λSplit a new state-of-the-art for spectral unmixing of fluorescent microscopy data. Importantly, λSplit is compatible with spectral data produced by standard confocal microscopes, enabling immediate adoption without specialized hardware modifications.

2603.23485 2026-06-02 cs.CL cs.AI cs.CY 版本更新

Failure of contextual invariance in large language models

大型语言模型中语境不变性的失效

Sagar Kumar, Ariel Flint, Luca Maria Aiello, Andrea Baronchelli

发表机构 * Network Science Institute, Northeastern University(网络科学研究所,东北大学) Center for Health Informatics Program, Boston Children’s Hospital(健康信息学计划中心,波士顿儿童医院) Dept. of Mathematics, City St George’s, University of London(伦敦大学城市圣乔治学院数学系) IT University of Copenhagen(哥本哈根IT大学)

AI总结 通过代词选择任务发现,在语境等价但无信息量的干扰下,大语言模型输出发生系统性偏移,表明其违反语境不变性,影响偏见评估与高风险应用。

详情
AI中文摘要

标准评估实践假设,当提示嵌入语境等价的语篇中时,大型语言模型(LLM)的输出是稳定的。这里,我们在性别推断的背景下测试这一假设。使用受控的代词选择任务,我们引入最小的、理论上无信息的语篇语境,发现这会导致模型输出出现大规模、系统性的偏移。与去语境化设置中存在的文化性别刻板印象的相关性在引入语境后减弱或消失,而理论上无关的特征(如无关指代对象的代词性别)成为模型行为最具信息量的预测因子。通过默认语境性分析发现,在模型间的19%至52%的案例中,这种依赖性在考虑语境对单个输出的所有边际效应后仍然存在,并且不能归因于简单的代词重复。这些发现表明,即使在几乎相同的句法表述下,LLM的输出也违反了语境不变性,这对偏见基准测试和高风险环境中的部署具有重要影响。

英文摘要

Standard evaluation practices assume that large language model (LLM) outputs are stable when prompts are embedded in contextually equivalent discourses. Here, we test this assumption in the setting of gender inference. Using a controlled pronoun selection task, we introduce minimal, theoretically uninformative discourse context and find that this induces large, systematic shifts in model outputs. Correlations with cultural gender stereotypes, present in decontextualized settings, weaken or disappear once context is introduced, while theoretically irrelevant features, such as the gender of a pronoun for an unrelated referent, become the most informative predictors of model behavior. A Contextuality-by-Default analysis reveals that, in 19--52\% of cases across models, this dependence persists after accounting for all marginal effects of context on individual outputs and cannot be attributed to simple pronoun repetition. These findings show that LLM outputs violate contextual invariance even under near-identical syntactic formulations, with implications for bias benchmarking and deployment in high-stakes settings.

2603.23398 2026-06-02 cs.LG cs.AI stat.ML 版本更新

Graph Energy Matching: Transport-Aligned Energy-Based Modeling for Graph Generation

图能量匹配:用于图生成的传输对齐能量基建模

Michal Balcerak, Suprosanna Shit, Chinmay Prabhakar, Sebastian Kaltenbach, Michael S. Albergo, Yilun Du, Bjoern Menze

发表机构 * University of Zurich(苏黎世大学) Harvard University(哈佛大学) Kempner Institute(凯普纳研究所)

AI总结 提出Graph Energy Matching (GEM)方法,基于JKO传输映射优化视角学习置换不变势能,通过能量基切换策略实现离散图的高质量生成,在分子图基准上匹配或超越离散扩散模型。

详情
AI中文摘要

离散数据(如图)的生成建模支撑着许多科学和工业应用,包括分子发现和材料设计。在这些领域中,概率推理尤其有价值,因为它能够实现可组合生成和原则性地融入期望的约束,例如结构或功能属性。能量基模型通过捕获相对似然并在推理过程中直接施加约束来支持可组合推理,自然符合这一目标。然而,离散能量基模型通常难以实现高效高质量的采样,因为支持区域外的区域常包含虚假局部最小值,会困住采样器并导致训练不稳定,从而与离散扩散模型相比存在保真度差距。为了解决这一差距,我们引入了Graph Energy Matching (GEM),这是一种受Jordan-Kinderlehrer-Otto (JKO)传输映射优化视角启发的离散生成框架。GEM学习一个置换不变势能,同时引导从噪声到高似然图区域的离散传输,并在这些区域内细化样本。我们进一步引入了一种利用能量基切换策略的采样协议,无缝衔接快速的梯度引导传输和用于有效探索的局部混合机制。在分子图基准上,GEM在大多数报告指标上匹配或超越了强离散扩散基线。除了提高生成质量,GEM的相对似然建模还支持定向探索,促进组合生成、属性约束采样以及图之间的插值。项目页面:https://michalbalcerak.ai/graph-energy-matching/。

英文摘要

Generative modeling of discrete data, such as graphs, underpins many scientific and industrial applications, including molecular discovery and materials design. In these domains, probabilistic inference is particularly valuable, as it enables composable generation and principled incorporation of desired constraints, such as structural or functional properties. Energy-based models naturally support this goal by capturing relative likelihoods and enabling composable inference by directly enforcing constraints during inference. However, discrete energy-based models typically struggle with efficient and high-quality sampling, as off-support regions often contain spurious local minima, trapping samplers and causing training instabilities, resulting in a fidelity gap compared to discrete diffusion models. To address this gap, we introduce Graph Energy Matching (GEM), a discrete generative framework inspired by the Jordan-Kinderlehrer-Otto (JKO) transport-map optimization perspective. GEM learns a permutation-invariant potential energy that simultaneously guides discrete transport from noise toward high-likelihood graph regions and refines samples within these regions. We further introduce a sampling protocol leveraging an energy-based switching strategy, seamlessly bridging rapid, gradient-guided transport and a local mixing regime for effective exploration. On molecular graph benchmarks, GEM matches or surpasses strong discrete diffusion baselines on most reported metrics. Beyond improving generation quality, GEM's relative likelihood modeling enables targeted exploration, facilitating compositional generation, property-constrained sampling, and interpolation between graphs. Project page: https://michalbalcerak.ai/graph-energy-matching/.

2510.19496 2026-06-02 cs.CV cs.AI cs.LG 版本更新

CARES: Context-Aware Resolution Selector for VLMs

CARES: 面向视觉语言模型的上下文感知分辨率选择器

Moshe Kimhi, Nimrod Shabtay, Raja Giryes, Chaim Baskin, Eli Schwartz

发表机构 * Technion(技术ion大学) IBM Research(IBM研究院) Tel-Aviv University(特拉维夫大学) Ben-Gurion University(本· Gurion大学)

AI总结 提出CARES轻量级预处理模块,通过紧凑型VLM预测图像-查询对的最小足够分辨率,在保持任务性能的同时最多减少80%计算量。

Comments Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) Accepted to ACL 2026 (Oral presentation). Code available at https://github.com/mkimhi/CARES

详情
AI中文摘要

大型视觉语言模型通常以原始或高分辨率处理图像以保持跨任务有效性。这导致视觉令牌通常占总令牌的97-99%,即使低分辨率图像就足够时,也会产生高计算量和延迟。我们引入了CARES——一种上下文感知分辨率选择器,这是一个轻量级预处理模块,给定图像-查询对,预测最小的足够输入分辨率。CARES使用紧凑型VLM(350M)提取特征,并预测目标预训练VLM的响应何时收敛到其正确回答的峰值能力。尽管作为一组可选分辨率上的离散分类器进行训练,但CARES在推理时插值连续分辨率以实现细粒度控制。在涵盖文档和自然图像以及多样化目标VLM的五个多模态基准测试中,CARES在保持任务性能的同时最多减少80%的计算量。

英文摘要

Large vision-language models (VLMs) commonly process images at native or high resolution to remain effective across tasks. This inflates visual tokens ofter to 97-99% of total tokens, resulting in high compute and latency, even when low-resolution images would suffice. We introduce \emph{CARES}-a \textbf{C}ontext-\textbf{A}ware \textbf{R}esolution \textbf{S}elector, a lightweight preprocessing module that, given an image-query pair, predicts the \emph{minimal} sufficient input resolution. CARES uses a compact VLM (350M) to extract features and predict when a target pretrained VLM's response converges to its peak ability to answer correctly. Though trained as a discrete classifier over a set of optional resolutions, CARES interpolates continuous resolutions at inference for fine-grained control. Across five multimodal benchmarks spanning documents and natural images, as well as diverse target VLMs, CARES preserves task performance while reducing compute by up to 80%.

2603.18652 2026-06-02 cs.CV cs.AI cs.IR 版本更新

Beyond String Matching: Semantic Evaluation of PDF Table Extraction

超越字符串匹配:PDF表格提取的语义评估

Pius Horn, Janis Keuper

发表机构 * Institute for Machine Learning and Analytics (IMLA)(机器学习与分析研究所) Offenburg University(奥芬堡大学) University of Mannheim(曼海姆大学)

AI总结 提出基于LLM-as-a-judge的语义评估框架,通过合成PDF和人工验证,显著优于现有规则指标(TEDS、GriTS),并评估了21种PDF解析器。

Comments Submitted to BMVC 2026

详情
AI中文摘要

从PDF中可靠地提取表格对于大规模科学数据挖掘和知识库构建至关重要,然而现有的评估方法依赖于基于规则的指标,无法捕捉表格内容的语义等价性。我们提出了一个基于合成PDF的基准测试框架,这些PDF具有精确的LaTeX真实标注,并使用来自arXiv的表格以确保现实的复杂性和多样性。作为我们的核心方法论贡献,我们将LLM-as-a-judge应用于语义表格评估,并将其集成到一个能够适应解析器输出不一致性的匹配流水线中。通过一项包含超过1500个提取表格对的人工验证研究,我们表明基于LLM的评估与人类判断的相关性(Pearson r=0.93)显著高于当前使用的基于树编辑距离的相似度(TEDS, r=0.68)和网格表格相似度(GriTS, r=0.70)。对21个当代PDF解析器在包含451个表格的100个合成文档上的评估揭示了显著的性能差异。我们的结果为选择用于表格数据提取的解析器提供了实用指导,并为这一关键任务建立了一种可重复、可扩展的评估方法。代码和数据:https://github.com/phorn1/pdf-parse-bench 指标研究和人工评估:https://github.com/phorn1/table-metric-study

英文摘要

Reliably extracting tables from PDFs is essential for large-scale scientific data mining and knowledge base construction, yet existing evaluation approaches rely on rule-based metrics that fail to capture semantic equivalence of table content. We present a benchmarking framework based on synthetically generated PDFs with precise LaTeX ground truth, using tables sourced from arXiv to ensure realistic complexity and diversity. As our central methodological contribution, we apply LLM-as-a-judge for semantic table evaluation, integrated into a matching pipeline that accommodates inconsistencies in parser outputs. Through a human validation study comprising over 1,500 quality judgments on extracted table pairs, we show that LLM-based evaluation achieves substantially higher correlation with human judgment (Pearson r=0.93) compared to currently used Tree Edit Distance-based Similarity (TEDS, r=0.68) and Grid Table Similarity (GriTS, r=0.70). Evaluating 21 contemporary PDF parsers across 100 synthetic documents containing 451 tables reveals significant performance disparities. Our results offer practical guidance for selecting parsers for tabular data extraction and establish a reproducible, scalable evaluation methodology for this critical task. Code and data: https://github.com/phorn1/pdf-parse-bench Metric study and human evaluation: https://github.com/phorn1/table-metric-study

2502.15411 2026-06-02 cs.CL cs.AI 版本更新

HiFi-KPI: A Dataset for Hierarchical KPI Extraction from Earnings Filings

HiFi-KPI:用于从财报中提取层次化KPI的数据集

Rasmus Aavang, Giovanni Rizzi, Rasmus Bøggild, Alexandre Iolov, Mike Zhang, Johannes Bjerva

发表机构 * Department of Computer Science, Aalborg University(奥尔堡大学计算机科学系) ALIPES ApS(ALIPES公司) University of Copenhagen(哥本哈根大学) Pioneer Centre for AI(先锋人工智能中心)

AI总结 针对财报中关键绩效指标(KPI)跨公司可迁移性差的问题,提出包含165万段落和19.8万层次化标签的HiFi-KPI数据集,并评估分类、提取和结构化提取三个任务。

详情
AI中文摘要

准确标注财报可以为利益相关者带来显著的短期回报。机器可读的内联可扩展商业报告语言(iXBRL)是公开财务申报的强制要求。然而,其复杂且细粒度的分类法限制了标记关键绩效指标(KPI)的跨公司可迁移性。为了解决这个问题,我们引入了层次化财务关键绩效指标(HiFi-KPI)数据集,这是一个包含165万段落和19.8万个独特层次化标签的大规模语料库,这些标签与iXBRL分类法相关联。HiFi-KPI支持多个任务,我们评估了其中三个:KPI分类、KPI提取和结构化KPI提取。为了快速评估,我们还发布了HiFi-KPI-Lite,一个手动策划的8K段落子集。在HiFi-KPI-Lite上的基线实验表明,基于编码器的模型在分类任务上达到了超过0.906的宏F1分数,而大型语言模型(LLM)在结构化提取任务上达到了0.440的F1分数。最后,定性分析显示提取错误主要与日期相关。我们在https://github.com/aaunlp/HiFi-KPI上开源了所有代码和数据。

英文摘要

Accurate tagging of earnings reports can yield significant short-term returns for stakeholders. The machine-readable inline eXtensible Business Reporting Language (iXBRL) is mandated for public financial filings. Yet, its complex, fine-grained taxonomy limits the cross-company transferability of tagged Key Performance Indicators (KPIs). To address this, we introduce the Hierarchical Financial Key Performance Indicator (HiFi-KPI) dataset, a large-scale corpus of 1.65M paragraphs and 198k unique, hierarchically organized labels linked to iXBRL taxonomies. HiFi-KPI supports multiple tasks and we evaluate three: KPI classification, KPI extraction, and structured KPI extraction. For rapid evaluation, we also release HiFi-KPI-Lite, a manually curated 8K paragraph subset. Baselines on HiFi-KPI-Lite show that encoder-based models achieve over 0.906 macro-F1 on classification, while Large Language Models (LLMs) reach 0.440 F1 on structured extraction. Finally, a qualitative analysis reveals that extraction errors primarily relate to dates. We open-source all code and data at https://github.com/aaunlp/HiFi-KPI.

2603.18016 2026-06-02 cs.CL cs.AI cs.DC cs.LG 版本更新

MineDraft: A Framework for Batch Parallel Speculative Decoding

MineDraft: 批量并行推测解码框架

Zhenwei Tang, Arun Verma, Zijian Zhou, Zhaoxuan Wu, Alok Prakash, Daniela Rus, Bryan Kian Hsiang Low

发表机构 * Massachusetts Institute of Technology(麻省理工学院) Toyota Research Institute(丰田研究院) Toyota Motor Corporation(丰田公司)

AI总结 提出MineDraft框架,通过批量并行设计将草稿生成与验证阶段重叠,显著提升推测解码的吞吐量和端到端延迟。

Comments Accepted at ICML 2026

详情
AI中文摘要

推测解码(SD)通过使用较小的草稿模型提出草稿令牌,随后由较大的目标模型验证,从而加速大型语言模型推理。然而,标准SD的性能通常受限于这些草稿和验证阶段的严格顺序执行。为解决此问题,本文提出MineDraft,一种批量并行推测解码(PSD)框架,旨在通过将草稿生成与验证重叠来有效隐藏草稿延迟。我们的理论分析表明,PSD比标准SD高效得多。MineDraft通过一种新颖的批量并行设计实现PSD,该设计维护两个请求批次,将一个批次的草稿生成与另一个批次的验证重叠。我们的实验结果显示,与标准SD相比,MineDraft在吞吐量(最高提升75%)和端到端延迟(最高降低39%)方面均有显著改进。此外,我们已将MineDraft实现为vLLM的插件,展示了其在生产级推理系统中的实用性。

英文摘要

Speculative decoding (SD) accelerates large language model inference by using a smaller draft model to propose draft tokens that are subsequently verified by a larger target model. However, the performance of standard SD is often limited by the strictly sequential execution of these drafting and verification stages. To address this, this paper proposes MineDraft, a batch parallel speculative decoding (PSD) framework designed to effectively hide drafting latency by overlapping it with verification. Our theoretical analysis shows that PSD is substantially more efficient than standard SD. MineDraft realizes the PSD through a novel batch-parallel design that maintains two batches of requests, overlapping drafting for one batch with verification for the other. Our experimental results show significant improvements of MineDraft in both throughput (up to 75%) and end-to-end latency (up to 39%) over standard SD. Furthermore, we have implemented MineDraft as a plugin for vLLM, demonstrating its practicality for production-ready inference systems.

2603.17893 2026-06-02 cs.SE cs.AI cs.LG 版本更新

scicode-lint: Detecting Methodology Bugs in Scientific Python Code with LLM-Generated Patterns

scicode-lint: 使用LLM生成的模式检测科学Python代码中的方法论错误

Sergey V. Samsonau

发表机构 * Authentic Research Partners, Princeton, NJ(真实研究伙伴,新泽西州普林斯顿)

AI总结 提出scicode-lint,通过两级架构(构建时使用前沿模型生成模式,运行时使用小型本地模型执行)自动检测科学Python代码中的方法论错误,如数据泄露、交叉验证错误和缺失随机种子。

详情
AI中文摘要

科学Python代码中的方法论错误会产生看似合理但实际不正确的结果,传统的linter和静态分析工具无法检测到这些错误。多个研究团队构建了特定于ML的linter,证明了检测的可行性。然而,这些工具存在可持续性问题:依赖于特定的pylint或Python版本、有限的打包方式,以及每个新模式都需要手动工程。随着AI生成代码增加了科学软件的数量,对自动化方法论检查(如检测数据泄露、不正确的交叉验证和缺失随机种子)的需求日益增长。我们提出了scicode-lint,其两级架构将模式设计(构建时的前沿模型)与执行(运行时的小型本地模型)分离。模式是生成的,而非手工编码;适应新的库版本花费的是token,而非工程时间。在带有手动标注真实值的Kaggle笔记本上,预处理泄露检测在100%召回率下达到了65%的精确率;在38篇应用AI/ML的已发表科学论文中,精确率为62%(由LLM评判),不同模式类别之间存在显著差异;在一个保留的论文集上,精确率为54%。在受控测试中,scicode-lint在66个模式上达到了97.7%的准确率。

英文摘要

Methodology bugs in scientific Python code produce plausible but incorrect results that traditional linters and static analysis tools cannot detect. Several research groups have built ML-specific linters, demonstrating that detection is feasible. Yet these tools share a sustainability problem: dependency on specific pylint or Python versions, limited packaging, and reliance on manual engineering for every new pattern. As AI-generated code increases the volume of scientific software, the need for automated methodology checking (such as detecting data leakage, incorrect cross-validation, and missing random seeds) grows. We present scicode-lint, whose two-tier architecture separates pattern design (frontier models at build time) from execution (small local model at runtime). Patterns are generated, not hand-coded; adapting to new library versions costs tokens, not engineering hours. On Kaggle notebooks with human-labeled ground truth, preprocessing leakage detection reaches 65% precision at 100% recall; on 38 published scientific papers applying AI/ML, precision is 62% (LLM-judged) with substantial variation across pattern categories; on a held-out paper set, precision is 54%. On controlled tests, scicode-lint achieves 97.7% accuracy across 66 patterns.

2603.13373 2026-06-02 cs.CY cs.AI cs.LG 版本更新

Ethical Fairness in Ubiquitous Health Sensing without Known Attributes

无已知属性下的普适健康感知伦理公平性

Shaily Roy, Harshit Sharma, Daniel A. Adler, Srijan Sen, Tanzeem Choudhury, Asif Salekin

发表机构 * Ira A. Fulton Schools of Engineering, Arizona State University(亚利桑那州立大学弗里曼工程学院) Arizona State University(亚利桑那州立大学) Cornell University(康奈尔大学) University of Michigan(密歇根大学)

AI总结 针对普适健康感知中缺乏人口统计或异构属性时的公平性问题,提出基于Fisher信息引导的潜在子群学习与无害正则化框架Flare,通过优化几何实现伦理公平。

详情
AI中文摘要

在普适和移动健康系统中,计算模型从可穿戴、行为和生理传感数据推断人类状态。在这些场景中,仅高准确率是不够的;模型必须在不同人群、环境和设备间合乎伦理且公平地运行。然而,依赖训练时的人口统计或异构属性的公平方法难以实施,因为这些属性通常不可用、隐私敏感、受监管或不宜收集。传统的基于均等的公平也可能通过牺牲子群性能而违反伦理原则。为应对这一挑战,我们提出了Flare(Fisher引导的潜在子群学习与无害正则化),这是一个不依赖人口统计和异构属性的框架,将以人为本的公平性与普适和移动传感的伦理原则对齐。Flare利用优化几何,特别是Fisher信息,来正则化曲率并揭示模型行为中的潜在差异,而无需人口统计或异构属性。通过整合表示、损失和曲率信号,它识别隐藏的性能分层,并通过协作但无害的优化对其进行改进,在提升子群性能的同时保持伦理平衡。我们还引入了BHE(善行-避害-公平),一个超越统计均等的伦理公平度量套件。在移动生理、行为和临床传感数据集(包括EDA、OhioT1DM、IHS和Percept-R)上,Flare在伦理公平性上优于最先进的基线。消融、可解释性和损失景观分析表明,这些提升源于更平坦的优化几何、更简单的决策规则和无害的潜在子群适应。运行时分析支持Flare在资源受限的传感部署中的实用性。

英文摘要

In ubiquitous and mobile health systems, computational models infer human states from wearable, behavioral, and physiological sensing data. In these settings, high accuracy alone is insufficient; models must act ethically and equitably across diverse people, contexts, and devices. However, fairness methods that rely on demographic or heterogeneous attributes during training are difficult to enforce because such attributes are often unavailable, privacy-sensitive, regulated, or undesirable to collect. Conventional parity-based fairness can also violate ethical principles by trading off subgroup performance. To address this challenge, we present Flare, Fisher-guided LAtent-subgroup learning with do-no-harm REgularization, a demographic- and heterogeneous-attribute-agnostic framework that aligns human-centered fairness with ethical principles for ubiquitous and mobile sensing. Flare leverages optimization geometry, particularly Fisher Information, to regularize curvature and uncover latent disparities in model behavior without demographic or heterogeneous attributes. By integrating representation, loss, and curvature signals, it identifies hidden performance strata and refines them through collaborative but do-no-harm optimization, enhancing subgroup performance while preserving ethical balance. We also introduce BHE (Beneficence-Harm Avoidance-Equity), a metric suite that operationalizes ethical fairness beyond statistical parity. Across mobile physiological, behavioral, and clinical sensing datasets, including EDA, OhioT1DM, IHS, and Percept-R, Flare improves ethical fairness over state-of-the-art baselines. Ablation, interpretability, and loss-landscape analyses show that these gains arise from flatter optimization geometry, simpler decision rules, and do-no-harm latent-subgroup adaptation. Runtime analysis supports the practicality of Flare for resource-constrained sensing deployments.

2509.12263 2026-06-02 cs.AI cs.LG 版本更新

InPhyRe Discovers: Large Multimodal Models Struggle in Inductive Physical Reasoning

InPhyRe 发现:大型多模态模型在归纳物理推理中表现不佳

Gautam Sreekumar, Vishnu Naresh Boddeti

发表机构 * Department of Computer Science and Engineering, Michigan State University(密歇根州立大学计算机科学与工程系)

AI总结 提出 InPhyRe 基准测试,通过合成视频中的碰撞事件预测任务,评估大型多模态模型在未见物理定律下的归纳物理推理能力,发现其依赖有限参数知识、受语言偏差影响且忽略视觉输入。

Comments Accepted to TMLR. 53 pages including appendix

详情
AI中文摘要

大型多模态模型(LMMs)将训练中观察到的物理定律(如动量守恒)编码为参数化知识。这使得 LMMs 能够回答物理推理查询,例如从视觉输入中预测潜在碰撞事件的结果。然而,由于参数化知识仅包含训练中见过的物理定律,它不足以推理遵循训练中未见物理定律的推理场景。在这种新颖的物理环境中,人类可以根据提供的演示调整其物理推理。这种归纳物理推理能力对于 LMMs 在安全关键应用中替代人类代理是必不可少的。尽管其重要性,现有的视觉基准并未评估归纳物理推理,仅考虑 LMMs 中的参数化知识。为此,我们提出了 InPhyRe,这是第一个用于衡量 LMMs 归纳物理推理的视觉问答基准。InPhyRe 评估 LMMs 预测算法生成的合成视频中碰撞事件结果的能力。通过检查超过 13 个开源和专有 LMMs,InPhyRe 告诉我们:(1)LMMs 难以将其关于普遍物理定律的有限参数化知识应用于推理;(2)当推理场景背后的物理定律在训练中未见时,LMMs 的归纳物理推理能力较弱;(3)LMMs 的归纳物理推理受到语言偏差的影响,可能忽略视觉输入,质疑了 LMMs 在视觉输入方面的可信度。

英文摘要

Large multimodal models (LMMs) encode physical laws observed during training, such as momentum conservation, as parametric knowledge. It allows LMMs to answer physical reasoning queries, such as the outcome of a potential collision event from visual input. However, since parametric knowledge includes only the physical laws seen during training, it is insufficient for reasoning in inference scenarios that follow physical laws unseen during training. In such novel physical environments, humans could adapt their physical reasoning based on provided demonstrations. This inductive physical reasoning ability is indispensable for LMMs if they are to replace human agents in safety-critical applications. Despite its importance, existing visual benchmarks do not evaluate inductive physical reasoning and only consider the parametric knowledge in LMMs. To this end, we propose InPhyRe, the first visual question answering benchmark to measure inductive physical reasoning in LMMs. InPhyRe evaluates LMMs' ability to predict the outcome of collision events in algorithmically generated synthetic videos. By inspecting over 13 open-source and proprietary LMMs, InPhyRe informs us that (1) LMMs struggle to apply their limited parametric knowledge about universal physical laws to reasoning, (2) inductive physical reasoning in LMMs is weak when the physical laws underlying inference scenarios were unseen during training, and (3) inductive physical reasoning in LMMs suffers from language bias and may ignore the visual inputs, questioning the trustworthiness of LMMs regarding visual inputs.

2603.16572 2026-06-02 cs.CR cs.AI 版本更新

Context Matters: Repository-Aware Security Analysis of the Agent Skill Ecosystem

上下文很重要:基于仓库感知的代理技能生态系统安全分析

Florian Holzbauer, David Schmidt, Gabriel Gegenhuber, Sebastian Schrittwieser, Johanna Ullrich

发表机构 * Interdisciplinary Transformation University (IT:U)(交叉学科转化大学) University of Vienna(维也纳大学) CDL AsTra Faculty of Computer Science(计算机科学学院CDL AsTra系)

AI总结 通过仓库上下文感知分析,发现现有扫描器高估了恶意技能比例(从46.8%降至0.52%),并识别出废弃仓库劫持等新攻击向量。

Comments AgentSkills '26 Workshop: ACM Conference on AI and Agentic Systems (CAIS), Best Paper Award

详情
AI中文摘要

代理技能扩展了本地AI代理(如Claude Code和OpenClaw)的额外功能。其日益流行催生了类似移动应用商店的专用市场,以及评估技能是良性还是恶意的自动扫描器。然而,来自单个市场的扫描器报告将高达46.8%的技能归类为恶意,引发了对误报的担忧。我们提出了迄今为止对AI代理技能生态系统最大规模的实证安全分析。我们从三个主要分发平台和GitHub收集了238,180个独特技能,并分析了它们的内容、行为和仓库上下文。与现有主要孤立评估技能的扫描器不同,我们的仓库感知分析检查被标记的技能是否与其周围的GitHub项目一致。这种上下文显著减少了可疑技能的数量:经过仓库感知分析后,仅0.52%仍保持可疑。我们的结果表明,当忽略仓库上下文时,现有扫描器可能大幅高估恶意性。同时,我们识别出先前未记录的真实世界攻击向量,包括劫持托管在废弃GitHub仓库中的技能。总体而言,我们的发现提供了对代理技能生态系统当前风险面更稳健的视图,并强调了上下文感知安全评估的必要性。

英文摘要

Agent skills extend local AI agents, such as Claude Code and OpenClaw, with additional functionality. Their growing popularity has led to dedicated marketplaces resembling mobile app stores, as well as automated scanners that assess whether skills are benign or malicious. However, scanner reports from individual marketplaces classify up to 46.8% of skills as malicious, raising concerns about false positives. We present the largest empirical security analysis of the AI agent skill ecosystem to date. We collect 238,180 unique skills from three major distribution platforms and GitHub, and analyze their contents, behavior, and repository context. Unlike existing scanner-based assessments, which evaluate skills largely in isolation, our repository-aware analysis checks whether a flagged skill is consistent with its surrounding GitHub project. This context substantially reduces the number of suspicious skills: only 0.52% remain suspicious after repository-aware analysis. Our results show that existing scanners can substantially overestimate maliciousness when repository context is ignored. At the same time, we identify previously undocumented real-world attack vectors, including the hijacking of skills hosted in abandoned GitHub repositories. Overall, our findings provide a more robust view of the agent-skill ecosystem's current risk surface and highlight the need for context-aware security evaluation.

2603.14771 2026-06-02 cs.AI 版本更新

OpenHospital: A Thing-in-itself Arena for Evolving and Benchmarking LLM-based Collective Intelligence

OpenHospital: 一个用于进化和基准测试基于LLM的集体智能的物自体竞技场

Peigen Liu, Rui Ding, Yuren Mao, Ziyan Jiang, Yuxiang Ye, Yunjun Gao, Ying Zhang, Renjie Sun, Longbin Lai, Zhengping Qian

发表机构 * School of Software Technology, Zhejiang University(浙江大学软件学院) Laboratory for Statistical Monitoring and Intelligent Governance of Common Prosperity, Zhejiang Gongshang University(浙江工商大学共同富裕统计监测与智能治理实验室) Tongyi Lab, Alibaba Group(阿里集团通义实验室) Alibaba Cloud(阿里云)

AI总结 提出OpenHospital交互式竞技场,通过医生代理与患者代理的互动进化集体智能,采用数据在代理自身范式快速提升能力并提供医学熟练度和系统效率的基准测试。

详情
AI中文摘要

基于大型语言模型的集体智能为克服数据墙并持续提升LLM代理能力提供了一种有前景的方法。然而,目前缺乏专门用于进化和基准测试基于LLM的集体智能的竞技场。为解决这一空白,我们引入了OpenHospital,一个交互式竞技场,其中医生代理可以通过与患者代理的互动进化集体智能。该竞技场采用数据在代理自身范式,快速增强代理能力,并为基准测试医学熟练度和系统效率提供稳健的评估指标。实验证明了OpenHospital在促进和量化集体智能方面的有效性。

英文摘要

Large Language Model (LLM)-based Collective Intelligence (CI) presents a promising approach to overcoming the data wall and continuously boosting the capabilities of LLM agents. However, there is currently no dedicated arena for evolving and benchmarking LLM-based CI. To address this gap, we introduce OpenHospital, an interactive arena where physician agents can evolve CI through interactions with patient agents. This arena employs a data-in-agent-self paradigm that rapidly enhances agent capabilities and provides robust evaluation metrics for benchmarking both medical proficiency and system efficiency. Experiments demonstrate the effectiveness of OpenHospital in both fostering and quantifying CI.

2603.14465 2026-06-02 cs.AI 版本更新

AgentProcessBench: Diagnosing Step-Level Process Quality in Tool-Using Agents

AgentProcessBench:诊断工具使用代理的步骤级过程质量

Shengda Fan, Xuyan Ye, Yupeng Huo, Zhi-Yuan Chen, Yiju Guo, Shenzhi Yang, Wenkai Yang, Shuqi Ye, Jingwen Chen, Haotian Chen, Xin Cong, Yankai Lin

发表机构 * Renmin University of China(中国人民大学) Beijing Jiaotong University(北京交通大学) Shanghai Jiao Tong University(上海交通大学) Tsinghua University(清华大学)

AI总结 提出AgentProcessBench基准,通过三元标注方案和错误传播规则评估工具增强轨迹中的步骤级有效性,揭示当前模型在区分中立与错误动作方面的挑战,并证明过程信号对结果监督的补充价值。

详情
AI中文摘要

尽管大型语言模型(LLMs)已发展为工具使用代理,但在长期交互中仍然脆弱。与数学推理中错误通常可通过回溯纠正不同,工具使用失败常引发不可逆的副作用,因此准确的步骤级验证至关重要。然而,现有的过程级基准主要局限于封闭世界的数学领域,未能捕捉工具执行的动态和开放性质。为弥补这一差距,我们引入AgentProcessBench,这是首个致力于评估现实工具增强轨迹中步骤级有效性的基准。该基准包含1,000条多样化轨迹和8,509个人工标注的步骤注释,标注者间一致性达89.1%。它采用三元标注方案以捕捉探索行为,并引入错误传播规则以减少标注歧义。大量实验揭示了关键见解:(1)较弱的策略模型因提前终止而表现出膨胀的正确步骤比例;(2)区分中立动作和错误动作对当前模型仍是一个重大挑战;(3)过程信号为结果监督提供补充价值,显著增强测试时扩展。我们希望AgentProcessBench能促进奖励模型的未来研究,并为通用代理铺平道路。代码和数据可在https://github.com/RUCBM/AgentProcessBench获取。

英文摘要

While Large Language Models (LLMs) have evolved into tool-using agents, they remain brittle in long-horizon interactions. Unlike mathematical reasoning where errors are often rectifiable via backtracking, tool-use failures frequently induce irreversible side effects, making accurate step-level verification critical. However, existing process-level benchmarks are predominantly confined to closed-world mathematical domains, failing to capture the dynamic and open-ended nature of tool execution. To bridge this gap, we introduce AgentProcessBench, the first benchmark dedicated to evaluating step-level effectiveness in realistic, tool-augmented trajectories. The benchmark comprises 1,000 diverse trajectories and 8,509 human-labeled step annotations with 89.1% inter-annotator agreement. It features a ternary labeling scheme to capture exploration and an error propagation rule to reduce labeling ambiguity. Extensive experiments reveal key insights: (1) weaker policy models exhibit inflated ratios of correct steps due to early termination; (2) distinguishing neutral and erroneous actions remains a significant challenge for current models; and (3) process-derived signals provide complementary value to outcome supervision, significantly enhancing test-time scaling. We hope AgentProcessBench can foster future research in reward models and pave the way toward general agents. The code and data are available at https://github.com/RUCBM/AgentProcessBench.

2603.14405 2026-06-02 cs.LG cs.AI 版本更新

ES-Merging: Biological MLLM Merging via Embedding Space Signals

ES-Merging: 通过嵌入空间信号进行生物多模态大模型合并

Wonbin Lee, Dongki Kim, Sung Ju Hwang

发表机构 * KAIST(韩国科学技术院) DeepAuto.ai

AI总结 提出ES-Merging框架,利用嵌入空间信号估计合并系数,实现生物多模态大模型的高效合并,提升跨模态推理和单模态知识保留能力。

详情
AI中文摘要

生物多模态大语言模型(MLLMs)已成为科学发现的基础模型。然而,现有模型专注于单一模态,限制了其解决跨模态科学问题的能力。虽然模型合并是将不同模态组合成统一MLLM的有效方法,但现有方法依赖于与输入无关的参数空间启发式,无法准确捕捉模态特异性。为克服这一局限,我们提出基于嵌入信号的MLLM合并(ES-Merging),该框架从嵌入空间信号估计合并系数,将合并范式从参数信号转向嵌入信号。ES-Merging利用嵌入空间中的粗粒度和细粒度信号分别估计层间和元素级合并系数,并联合实现互补系数估计。通过大量实验,我们证明ES-Merging不仅在跨模态推理上,而且在单模态知识保留上均优于现有合并方法,表明嵌入空间信号为MLLM合并提供了有原则且有效的基础。

英文摘要

Biological multimodal large language models (MLLMs) have emerged as powerful foundation models for scientific discovery. However, existing models are specialized to a single modality, limiting their ability to solve inherently cross-modal scientific problems. While model merging is an efficient method to combine the different modalities into a unified MLLM, existing methods rely on input-agnostic parameter space heuristics that fail to faithfully capture modality specialization. To overcome this limitation, we propose the Embedding-Signal-based MLLM Merging (ES-Merging), a framework that estimates merging coefficients from embedding space signals, moving the merging paradigm from the parameter signals to the embedding signals. ES-Merging exploits coarse-grained and fine-grained signals from embedding space to estimate the layer-wise and element-wise merging coefficients, respectively, which are jointly combined for complementary coefficient estimation. Through extensive experiments, we demonstrate that ES-Merging outperforms existing merging methods not only on the cross-modal reasoning but also on the single-modal knowledge preservation, establishing that embedding space signals provide a principled and effective foundation for MLLM merging.

2603.00171 2026-06-02 cs.CV cs.AI 版本更新

LookWise: Knowing When and Where to Look for Fine-Grained Visual Reasoning in Multimodal Large Language Models

LookWise: 知道何时何地关注多模态大语言模型中的细粒度视觉推理

Yuxiang Shen, Hailong Huang, Zhenkun Gao, Xueheng Li, Man Zhou, Chengjun Xie, Haoxuan Che, Xuanhua He, Jie Zhang

发表机构 * Institute of Intelligent Machines, Hefei Institutes of Physical Science, Chinese Academy of Sciences(智能机器研究所,合肥物理科学研究院,中国科学院) University of Science and Technology of China(中国科学技术大学) Zhejiang University(浙江大学) East China Normal University(华东师范大学) The Hong Kong University of Science and Technology(香港科技大学)

AI总结 提出LookWise框架,通过置信度模块和语义引导定位模块实现自适应视觉推理,无需额外训练即可提升细粒度推理精度并加速推理。

详情
AI中文摘要

多模态大语言模型正转向通过主动探索图像细节进行“图像思考”。虽然有效,但大规模训练计算成本高昂,这激发了对轻量级、无需训练解决方案的兴趣。然而,现有无需训练方法存在两个缺陷:无差别裁剪导致的感知冗余,增加了计算成本并引入噪声;以及语义意图与空间注意力之间的漂移,阻碍了用户关注区域的准确定位。为应对这些挑战,我们提出LookWise,一个自适应视觉推理框架。LookWise遵循两阶段流程:基于置信度的模块决定何时更仔细地观察,语义引导的定位模块确定观察位置。该设计使MLLM能够自适应获取细粒度视觉证据而无需额外训练。在细粒度和高分辨率视觉推理基准上的实验表明,LookWise在强基线上持续提升准确率,同时相较于基于搜索的方法ZoomEye实现约$4.0 imes$的推理加速,展现出稳健的跨模型泛化能力。

英文摘要

Multimodal Large Language Models (MLLMs) are shifting towards "Thinking with Images" by actively exploring image details. While effective, large-scale training is computationally expensive, which has spurred growing interest in lightweight, training-free solutions. However, existing training-free methods suffer from two flaws: perceptual redundancy from indiscriminate cropping, which increases computational cost and introduces noise; and a drift between semantic intent and spatial attention, which prevents accurate localization of user-focused regions. To address these challenges, we propose LookWise, a framework for adaptive visual reasoning. LookWise follows a two-stage pipeline: a confidence-based module decides when to look more carefully, and a semantic-guided localization module determines where to look. This design enables MLLMs to adaptively acquire fine-grained visual evidence without additional training. Experiments on fine-grained and high-resolution visual reasoning benchmarks show that LookWise consistently improves accuracy over strong baselines while achieving an approximately $4.0\times$ inference speedup over the search-based method ZoomEye, demonstrating robust cross-model generalization.

2603.12109 2026-06-02 cs.AI 版本更新

On Information Self-Locking in Reinforcement Learning for Active Reasoning of LLM agents

强化学习中信息自锁现象及其在LLM智能体主动推理中的应用

Deyu Zou, Yongqiang Chen, Fan Feng, Mufei Li, Pan Li, Yu Gong, James Cheng

发表机构 * University of Science and Technology of China(中国科学技术大学)

AI总结 针对LLM智能体在主动推理中因强化学习导致的信息自锁问题,提出基于优势重加权的方法AREW,通过方向性批评重新分配轨迹信用,显著提升智能体性能。

Comments Accepted by ICML 2026

详情
AI中文摘要

强化学习已成为构建基于LLM的智能体的事实标准范式,这些智能体能够在扩展的任务范围内行动、交互和推理。然而,在主动推理中,智能体必须通过与环境的交互来获取新观察以解决问题,我们发现基于结果的强化学习会诱发一种系统性的失败模式,我们称之为信息自锁(SeL):智能体既无法获取信息性反馈,也无法内化已获得的证据。为了理解这个问题,我们将智能体行为追踪为两种耦合的能力:动作选择(AS),它决定观察流;信念追踪(BT),它更新智能体内部的任务理解。理论和实证分析揭示了一个导致SeL的双向瓶颈:弱的BT模糊了信息性动作的信用,而弱的AS剥夺了BT有用的证据。这种耦合削弱了两种能力的学习信号,导致SeL。为了缓解这个问题,我们提出了AREW,一种简单而有效的优势重加权方法,它使用易于获取的方向性批评来重新分配轨迹内的信用。在9个不同复杂度的智能体任务上的大量实验表明,AREW显著缓解了SeL,最终性能提升高达60个点。代码可在https://github.com/unimpor/T3获取。

英文摘要

Reinforcement learning (RL) has become a de facto paradigm for building LLM-based agents that act, interact, and reason over extended task horizons. However, in active reasoning where agents must elicit new observations through interaction with the environment to solve the task, we find that outcome-based RL can induce a systematic failure mode which we call information self-locking (SeL): agents fail both to elicit informative feedback and to internalize obtained evidence. To understand the issue, we trace agentic behaviors into two coupled capabilities: Action Selection (AS), which determines observation streams, and Belief Tracking (BT), which updates the agent's internal task understanding. Theoretical and empirical analyses reveal a bidirectional bottleneck that leads to SeL: weak BT obscures the credit of informative actions, while weak AS deprives BT of useful evidence. This coupling weakens the learning signal for both capabilities and leads to SeL. To mitigate this issue, we propose AREW, a simple yet effective Advantage Reweighting method that uses easy-to-obtain directional critiques to reallocate credit within trajectories. Extensive experiments across 9 agentic tasks of varying complexity show that AREW significantly mitigates SeL, yielding up to 60-point gains in final performance. Code is available at https://github.com/unimpor/T3.

2603.11946 2026-06-02 cs.LG cs.AI 版本更新

Geometry-Aware Probabilistic Circuits via Voronoi Tessellations

基于Voronoi剖分的几何感知概率电路

Sahil Sidheekh, Sriraam Natarajan

发表机构 * University of California, Berkeley(加州大学伯克利分校)

AI总结 针对概率电路因数据无关混合权重而无法捕捉数据流形局部几何结构的问题,提出通过Voronoi剖分将几何结构直接融入求和节点,并开发近似推理框架和精确推理条件,最后引入可微松弛实现梯度学习,在密度估计任务上验证了有效性。

详情
AI中文摘要

概率电路(PC)支持精确且易于处理的推理,但采用数据无关的混合权重,限制了其捕捉数据流形局部几何结构的能力。我们提出将Voronoi剖分(VT)作为将几何结构直接融入PC求和节点的自然方式。然而,直接引入这种结构会破坏可处理性。我们形式化了这种不兼容性,并开发了两种互补的解决方案:(1)一个近似推理框架,为推理提供保证的下界和上界;(2)VT的一个结构条件,在该条件下恢复精确的可处理推理。最后,我们引入了VT的可微松弛,使得基于梯度的学习成为可能,并在标准密度估计任务上实证验证了所提方法。

英文摘要

Probabilistic circuits (PCs) enable exact and tractable inference but employ data independent mixture weights that limit their ability to capture local geometry of the data manifold. We propose Voronoi tessellations (VT) as a natural way to incorporate geometric structure directly into the sum nodes of a PC. However, naïvely introducing such structure breaks tractability. We formalize this incompatibility and develop two complementary solutions: (1) an approximate inference framework that provides guaranteed lower and upper bounds for inference, and (2) a structural condition for VT under which exact tractable inference is recovered. Finally, we introduce a differentiable relaxation for VT that enables gradient-based learning and empirically validate the resulting approach on standard density estimation tasks.

2603.09692 2026-06-02 cs.LG cs.AI cs.CL 版本更新

ActiveUltraFeedback: Efficient Preference Data Generation using Active Learning

ActiveUltraFeedback:使用主动学习的高效偏好数据生成

Davit Melikidze, Marian Schneider, Jessica Lam, Martin Wertich, Ido Hakimi, Barna Pásztor, Andreas Krause

发表机构 * University of Freiburg(弗赖堡大学)

AI总结 提出ActiveUltraFeedback主动学习流水线,通过不确定性估计和两种新采样方法(DRTS和DeltaUCB)动态选择最具信息量的响应对,以最少六分之一的标注数据实现与静态基线相当或更优的下游性能。

Comments 40 pages, 9 figures, 26 tables

详情
AI中文摘要

基于人类反馈的强化学习(RLHF)已成为对齐大型语言模型(LLMs)的标准方法,但其有效性受到偏好数据获取高成本的瓶颈限制,尤其是在低资源和专家领域。为解决这一问题,我们引入了ACTIVEULTRAFEEDBACK,一个模块化的主动学习流水线,利用不确定性估计动态识别最具信息量的响应进行标注。我们的流水线支持系统评估标准响应选择方法以及两种新方法:DOUBLE REVERSE THOMPSON SAMPLING(DRTS)和DELTAUCB,这两种方法优先选择预测质量差距大的响应对,利用近期研究结果,即此类对为微调提供良好信号。实验表明,ACTIVEULTRAFEEDBACK生成的高质量数据集在下游性能上带来显著提升,尤其以静态基线六分之一的标注数据即可达到相当或更优的结果。我们的流水线可在https://github.com/lasgroup/ActiveUltraFeedback获取,偏好数据集可在https://huggingface.co/ActiveUltraFeedback获取。

英文摘要

Reinforcement Learning from Human Feedback (RLHF) has become the standard for aligning Large Language Models (LLMs), yet its efficacy is bottlenecked by the high cost of acquiring preference data, especially in low-resource and expert domains. To address this, we introduce ACTIVEULTRAFEEDBACK, a modular active learning pipeline that leverages uncertainty estimates to dynamically identify the most informative responses for annotation. Our pipeline facilitates the systematic evaluation of standard response selection methods alongside DOUBLE REVERSE THOMPSON SAMPLING (DRTS) and DELTAUCB, two novel methods prioritizing response pairs with large predicted quality gaps, leveraging recent results showing that such pairs provide good signals for fine-tuning. Our experiments demonstrate that ACTIVEULTRAFEEDBACK yields high-quality datasets that lead to significant improvements in downstream performance, notably achieving comparable or superior results with as little as one-sixth of the annotated data relative to static baselines. Our pipeline is available at https://github.com/lasgroup/ActiveUltraFeedback and our preference datasets at https://huggingface.co/ActiveUltraFeedback.

2509.25773 2026-06-02 cs.CV cs.AI cs.CL 版本更新

v-HUB: A Benchmark for Video Humor Understanding from Vision and Sound

v-HUB: 从视觉和声音理解视频幽默的基准

Zhengpeng Shi, Yanpeng Zhao, Jianqun Zhou, Yuxuan Wang, Qinrong Cui, Wei Bi, Songchun Zhu, Bo Zhao, Zilong Zheng

发表机构 * Shanghai Jiao Tong University(上海交通大学) Wuhan University(武汉大学) Beijing Institute for General Artificial Intelligence(北京一般人工智能研究院) Independent Researcher(独立研究者)

AI总结 提出v-HUB基准,通过非语言短视频评估多模态大语言模型在仅凭视觉线索理解幽默的能力,并发现音频信息有助于提升幽默理解。

Comments 24 pages, 9 figures

详情
AI中文摘要

能够理解幽默的AI模型具有现实应用前景——例如,增强人机交互中的参与度。为了评估和诊断多模态大语言模型(MLLMs)理解幽默的能力,我们引入了v-HUB,一个新颖的视频幽默理解基准。v-HUB包含一个精心策划的非语言短视频集合,反映了仅通过视觉线索即可欣赏幽默的现实场景。我们将每个视频片段与丰富的标注配对,以支持各种评估任务和分析,包括一项关于增强幽默的环境声音的新研究。为了扩大其适用性,我们构建了一个开放式问答任务,使v-HUB能够轻松集成到现有的视频理解任务套件中。我们评估了多种MLLMs,从专门的Video-LLMs到能够原生处理音频的多功能OmniLLMs,涵盖了开源和专有领域。实验结果揭示了MLLMs在仅凭视觉线索理解幽默时面临的困难。我们的发现还表明,结合音频有助于视频幽默理解,突显了为复杂视频理解任务整合更丰富模态的前景。

英文摘要

AI models capable of comprehending humor hold real-world promise -- for example, enhancing engagement in human-machine interactions. To gauge and diagnose the capacity of multimodal large language models (MLLMs) for humor understanding, we introduce v-HUB, a novel video humor understanding benchmark. v-HUB comprises a curated collection of non-verbal short videos, reflecting real-world scenarios where humor can be appreciated purely through visual cues. We pair each video clip with rich annotations to support a variety of evaluation tasks and analyses, including a novel study of environmental sound that can enhance humor. To broaden its applicability, we construct an open-ended QA task, making v-HUB readily integrable into existing video understanding task suites. We evaluate a diverse set of MLLMs, from specialized Video-LLMs to versatile OmniLLMs that can natively process audio, covering both open-source and proprietary domains. The experimental results expose the difficulties MLLMs face in comprehending humor from visual cues alone. Our findings also demonstrate that incorporating audio helps with video humor understanding, highlighting the promise of integrating richer modalities for complex video understanding tasks.

2603.08026 2026-06-02 cs.CL cs.AI cs.PF 版本更新

DyLLM: Efficient Diffusion LLM Inference via Saliency-based Token Selection and Partial Attention

DyLLM: 基于显著性标记选择与部分注意力的高效扩散LLM推理

Younjoo Lee, Seungkyun Dan, Junghoo Lee, Jaiyoung Park, Jung Ho Ahn

发表机构 * University of Seoul(首尔大学)

AI总结 针对扩散语言模型迭代去噪计算昂贵的问题,提出DyLLM框架,通过仅计算显著标记的注意力与前馈操作,实现无需训练的加速推理,在保持精度的同时吞吐量提升高达9.6倍。

Comments 21 pages, 10 figures, 7 tables, accepted at ICML 2026

详情
AI中文摘要

掩码扩散语言模型支持并行令牌解码,为自回归生成的顺序性质提供了一种有前景的替代方案。然而,其迭代去噪过程仍然计算昂贵,因为每一步都重复处理整个序列。我们观察到,在这些扩散步骤中,大多数令牌表示保持稳定;只有一小部分(我们称之为显著令牌)对下一次更新有实质性贡献。利用这种时间稀疏性,我们提出了DyLLM,一种无需训练的推理框架,通过仅选择性地计算这些显著令牌来加速解码。DyLLM通过测量相邻去噪步骤之间注意力上下文的余弦相似性来识别显著性。它仅对显著令牌重新计算前馈和注意力操作,同时为其余令牌重用缓存的激活。在多种推理和代码生成基准测试中,DyLLM实现了高达9.6倍的吞吐量提升,同时基本保持了代表性开源扩散LLM(LLaDA和Dream)的基线准确性。

英文摘要

Masked diffusion language models enable parallel token decoding, providing a promising alternative to the sequential nature of autoregressive generation. However, their iterative denoising process remains computationally expensive because it repeatedly processes the entire sequence at every step. We observe that across these diffusion steps, most token representations remain stable; only a small subset, which we term salient tokens, contributes meaningfully to the next update. Leveraging this temporal sparsity, we present DyLLM, a training-free inference framework that accelerates decoding by selectively computing only these salient tokens. DyLLM identifies saliency by measuring the cosine similarity of attention contexts between adjacent denoising steps. It recomputes feed-forward and attention operations only for salient tokens while reusing cached activations for the remainder. Across diverse reasoning and code-generation benchmarks, DyLLM achieves up to 9.6x higher throughput while largely preserving the baseline accuracy of representative open-source diffusion LLMs, LLaDA, and Dream.

2603.07109 2026-06-02 cs.AI 版本更新

Vision Language Models Cannot Reason About Physical Transformation

视觉语言模型无法推理物理变换

Dezhi Luo, Yijiang Li, Maijunxian Wang, Tianwei Zhao, Bingyang Wang, Siheng Wang, Pinyuan Feng, Pooyan Rahmanzadehgervi, Ziqiao Ma, Hokin Deng

发表机构 * University of Science and Technology of China(中国科学技术大学)

AI总结 本文通过构建ConservationBench基准,评估112个视觉语言模型在物理守恒任务上的表现,发现模型在动态场景中无法维持物理属性的变换不变性。

Comments Accepted by ICML 2026

详情
AI中文摘要

理解物理变换是动态环境中推理的基础。虽然视觉语言模型(VLM)在具身应用中展现出潜力,但它们是否真正理解物理变换仍不清楚。我们引入了ConservationBench,用于评估守恒性——即物理量在变换下是否保持不变。该基准涵盖四种属性,包含成对的守恒/非守恒场景,我们生成并评估了112个VLM上的23,040个问题。结果揭示了系统性失败:性能接近随机水平,守恒任务上的改进伴随着控制任务上的下降。控制实验显示,模型存在强烈的文本先验偏向于不变性,但在守恒和非守恒场景中性能平衡时,模型对实际视觉内容的表现更差。时间分辨率、提示或精心采样的方法均无帮助。这些发现表明,当前VLM无法在动态场景中维持物理属性的变换不变性表示。

英文摘要

Understanding physical transformations is fundamental for reasoning in dynamic environments. While Vision Language Models (VLMs) show promise in embodied applications, whether they genuinely understand physical transformations remains unclear. We introduce ConservationBench evaluating conservation -- whether physical quantities remain invariant under transformations. Spanning four properties with paired conserving/non-conserving scenarios, we generate and evaluate 23,040 questions across 112 VLMs. Results reveal systematic failure: performance remains near chance with improvements on conservation tasks accompanied by drops on controls. Control experiments show strong textual priors favoring invariance, yet models perform worse with actual visual content when performance is balanced across conserving and non-conserving scenarios. Neither temporal resolution, prompting, nor curated sampling helps. These findings show that current VLMs fail to maintain transformation-invariant representations of physical properties across dynamic scenes.

2603.06741 2026-06-02 cs.LG cs.AI cs.CV 版本更新

Heterogeneous Decentralized Diffusion Models

异构去中心化扩散模型

Zhiying Jiang, Raihan Seraj, Marcos Villagra, Bidhan Roy

发表机构 * bagel.com(Bagel公司)

AI总结 提出一种异构去中心化训练框架,通过支持不同专家使用不同目标(DDPM和Flow Matching)并统一推理、预训练检查点转换以及高效架构,大幅降低计算和数据需求,使单GPU(24-48GB VRAM)即可参与训练。

Comments Accepted to CVPR2026

详情
AI中文摘要

训练前沿规模的扩散模型通常需要大量计算资源集中在紧密耦合的集群中,限制了只有资源充足的机构才能参与。虽然去中心化扩散模型(DDM)能够独立训练多个专家,但现有方法需要1176 GPU天,且所有专家使用同质化训练目标。我们提出了一个高效框架,大幅降低资源需求,同时支持异构训练目标。我们的方法结合了三个关键贡献:(1)一种异构去中心化训练范式,允许专家使用不同的目标(DDPM和Flow Matching),在推理时无需任何重新训练即可统一;(2)从ImageNet-DDPM到Flow Matching目标的预训练检查点转换,加速收敛并无需针对特定目标的预训练即可初始化;(3)PixArt-$α$的高效AdaLN-Single架构,在保持质量的同时减少参数。在LAION-Aesthetics上的实验表明,相对于先前DDM工作报告的训练规模,我们的方法将计算量减少了16倍,数据量减少了14倍。在对齐的推理设置下,我们的异构配置比同质基线获得了更好的FID和更高的提示内多样性。通过消除同步需求并支持混合DDPM/FM目标,我们的框架使贡献者只需单GPU(24-48GB VRAM)即可进行去中心化生成模型训练。

英文摘要

Training frontier-scale diffusion models often requires substantial computational resources concentrated in tightly-coupled clusters, limiting participation to well-resourced institutions. While Decentralized Diffusion Models (DDM) enable training multiple experts in isolation, existing approaches require 1176 GPU-days and homogeneous training objectives across all experts. We present an efficient framework that dramatically reduces resource requirements while supporting heterogeneous training objectives. Our approach combines three key contributions: (1) a heterogeneous decentralized training paradigm that allows experts to use different objectives (DDPM and Flow Matching), unified at inference time without any retraining; (2) pretrained checkpoint conversion from ImageNet-DDPM to Flow Matching objectives, accelerating convergence and enabling initialization without objective-specific pretraining; and (3) PixArt-$α$'s efficient AdaLN-Single architecture, reducing parameters while maintaining quality. Experiments on LAION-Aesthetics show that, relative to the training scale reported for prior DDM work, our approach reduces the compute by 16$\times$ and data by 14$\times$. Under aligned inference settings, our heterogeneous configuration achieves better FID and higher intra-prompt diversity than the homogeneous baseline. By eliminating synchronization requirements and enabling mixed DDPM/FM objectives, our framework makes decentralized generative model training accessible to contributors with single GPUs requiring only 24--48GB VRAM.

2602.06841 2026-06-02 cs.AI 版本更新

From Features to Actions: Explainability in Traditional and Agentic AI Systems

从特征到行动:传统与智能体AI系统中的可解释性

Sindhuja Chaduvula, Jessee Ho, Kina Kim, Aravind Narayanan, Ahmed Y. Radwan, Mahshid Alinoori, Muskan Garg, Dhanesh Ramachandram, Shaina Raza

发表机构 * Vector Institute for Artificial Intelligence(向量人工智能研究所) Independent Researcher(独立研究者) Mayo Clinic(梅奥诊所)

AI总结 本文比较了基于归因的解释与基于轨迹的诊断在静态和智能体设置中的效果,发现归因方法无法可靠诊断智能体轨迹中的执行级故障,而轨迹级可解释性更能定位行为故障。

详情
AI中文摘要

在过去十年中,可解释AI主要关注解释单个模型预测,在固定决策结构下生成将输入与输出关联的事后解释。大型语言模型的最新进展使得智能体AI系统能够在多步轨迹中展开行为。在这些设置中,成功与失败由决策序列而非单个输出决定。目前尚不清楚为静态预测设计的解释方法如何应用于行为随时间涌现的智能体设置。在这项工作中,我们通过比较两种设置中基于归因的解释与基于轨迹的诊断来弥合这一差距。我们的结果表明,虽然归因方法在静态设置中实现了稳定的特征排名(Spearman ρ = 0.86),但它们无法可靠地诊断智能体轨迹中的执行级故障。相比之下,针对智能体设置的轨迹接地评分标准能够一致地定位行为故障,并揭示状态跟踪不一致在失败运行中的普遍性高出2.7倍,并将成功概率降低49%。这些发现促使我们转向轨迹级可解释性,以评估和诊断智能体系统中自主AI行为。代码:https://github.com/VectorInstitute/unified-xai-evaluation-framework 项目页面:https://vectorinstitute.github.io/unified-xai-evaluation-framework

英文摘要

Over the last decade, Explainable AI has primarily focused on interpreting individual model predictions, producing post-hoc explanations that relate inputs to outputs under a fixed decision structure. Recent advances in large language models (LLMs) have enabled agentic AI systems whose behaviour unfolds over multi-step trajectories. In these settings, success and failure are determined by sequences of decisions rather than a single output. It remains unclear how explanation approaches designed for static predictions translate to agentic settings where behaviour emerges over time. In this work, we bridge this gap by comparing attribution-based explanations with trace-based diagnostics across both settings. Our results show that while attribution methods achieve stable feature rankings in static settings (Spearman \r{ho} = 0.86), they cannot be applied reliably to diagnose execution-level failures in agentic trajectories. In contrast, trace-grounded rubric evaluation for agentic settings consistently localizes behaviour breakdowns and reveals that state tracking inconsistency is 2.7x more prevalent in failed runs and reduces success probability by 49%. These findings motivate a shift towards trajectory-level explainability for evaluating and diagnosing autonomous AI behaviour in agentic systems. Code: https://github.com/VectorInstitute/unified-xai-evaluation-framework Project page: https://vectorinstitute.github.io/unified-xai-evaluation-framework

2512.16310 2026-06-02 cs.CR cs.AI cs.CL 版本更新

Agent Tools Orchestration Leaks More: Dataset, Benchmark, and Mitigation

Agent工具编排泄露更多:数据集、基准测试与缓解措施

Yuxuan Qiao, Dongqin Liu, Hongchang Yang, Wei Zhou, Songlin Hu

发表机构 * Institute of Information Engineering, Chinese Academy of Sciences(中国科学院信息工程研究所) School of Cyber Security, University of Chinese Academy of Sciences(中国科学院大学网络空间安全学院)

AI总结 研究LLM代理在编排多个工具时泄露敏感结论的风险(TOP-R),构建了包含1000个实例的基准TOP-Bench,并提出TOP-Align后训练方法以缓解泄露。

Comments 17 pages, 2 figures. Dataset and code are available at https://github.com/1Ponder/TOP-R

详情
AI中文摘要

基于LLM的代理越来越多地使用多个外部工具来完成复杂任务。我们研究了工具编排隐私风险(TOP-R):代理可能组合单个非敏感的工具返回结果,并披露一个非预期的敏感结论。我们通过三个条件形式化TOP-R:结论敏感性、单源不可推断性和组合可推断性。我们引入了LRSE(基于库的反向推理种子扩展),这是一个基于隐私规范、推理链、工具模式和任务场景的四库反向构建流水线,并使用它构建了TOP-Bench,一个包含1000个实例的基准测试。该基准测试在受控的两阶段工具使用协议下评估最终响应的语义泄露。在六个LLM代理中,任务完成率保持较高,但平均泄露率达到88.6%,导致H分数仅为20.4。两种仅提示的防护措施在主基准测试上将H分数提高了约2.7分。我们进一步提出了TOP-Align,一种SFT+DPO后训练方法,用于更安全的任务完成边界。在单独的后训练评估划分上,TOP-Align将H分数比相应基础模型提高了16.2分,而同一划分上仅提示缓解措施的平均增益为4.9分。这些结果表明TOP-R需要超越仅提示的缓解措施。

英文摘要

LLM-based agents increasingly use multiple external tools to complete complex tasks. We study Tools Orchestration Privacy Risk (TOP-R): an agent may combine individually non-sensitive tool returns and disclose an unintended sensitive conclusion. We formalize TOP-R with three conditions: conclusion sensitivity, single-source non-inferability, and compositional inferability. We introduce LRSE (Library-Grounded Reverse-Inference Seed Expansion), a four-library reverse-construction pipeline grounded in privacy norms, reasoning chains, tool schemas, and task scenarios, and use it to build TOP-Bench, a 1,000-instance benchmark. The benchmark evaluates final-response semantic disclosure under a controlled two-stage tool-use protocol. Across six LLM agents, task completion remains high, but the average leakage rate reaches 88.6 percent, yielding an H-score of only 20.4. Two prompt-only safeguards improve H-score by about 2.7 points on the main benchmark. We further propose TOP-Align, an SFT+DPO post-training method for safer task completion boundaries. On a separate post-training evaluation split, TOP-Align improves H-score by 16.2 points over the corresponding base model, compared with a 4.9-point average gain from prompt-only mitigation on the same split. These results show that TOP-R requires mitigation beyond prompting alone.

2602.23694 2026-06-02 cs.RO cs.AI 版本更新

Interpretable Multimodal Gesture Recognition for Drone and Mobile Robot Teleoperation via Log-Likelihood Ratio Fusion

基于对数似然比融合的可解释多模态手势识别用于无人机和移动机器人遥操作

Seungyeol Baek, Jaspreet Singh, Lala Shakti Swarup Ray, Hymalai Bello, Paul Lukowicz, Sungho Suh

发表机构 * Department of Artificial Intelligence, Korea University(人工智能系,韩国大学) Department of Computer Science, RPTU Kaiserslautern-Landau(计算机科学系,RPTU凯撒斯劳滕-兰道) Embedded Intelligence, German Research Center for Artificial Intelligence (DFKI)(嵌入式智能,德国人工智能研究中心(DFKI))

AI总结 提出一种融合腕戴式Apple Watch惯性数据和定制手套电容传感信号的多模态手势识别框架,利用对数似然比后期融合策略提升性能并提供可解释性,在降低计算成本的同时达到与视觉基线相当的识别效果。

详情
AI中文摘要

人类操作员仍经常暴露在危险环境中,如灾区及工业设施,在这些场景中,移动机器人和无人飞行器(UAV)的直观可靠遥操作至关重要。在此背景下,免手持遥操作增强了操作员的移动性和态势感知能力,从而提高了危险环境中的安全性。尽管基于视觉的手势识别已被探索作为免手持遥操作的一种方法,但其性能在遮挡、光照变化和杂乱背景下常会下降,限制了其在真实操作中的适用性。为克服这些限制,我们提出一种多模态手势识别框架,该框架融合来自双手腕上Apple Watch的惯性数据(加速度计、陀螺仪和方向)与来自定制手套的电容传感信号。我们设计了一种基于对数似然比(LLR)的后期融合策略,该策略不仅提升了识别性能,还通过量化模态特定贡献提供了可解释性。为支持本研究,我们引入了一个包含20种受飞机引导信号启发的手势的新数据集,包含同步的RGB视频、IMU和电容传感器数据。实验结果表明,我们的框架在显著降低计算成本、模型大小和训练时间的同时,达到了与最先进的视觉基线相当的性能,使其非常适合实时机器人控制。因此,我们强调了基于传感器的多模态融合作为手势驱动的移动机器人和无人机遥操作的鲁棒且可解释解决方案的潜力。

英文摘要

Human operators are still frequently exposed to hazardous environments such as disaster zones and industrial facilities, where intuitive and reliable teleoperation of mobile robots and Unmanned Aerial Vehicles (UAVs) is essential. In this context, hands-free teleoperation enhances operator mobility and situational awareness, thereby improving safety in hazardous environments. While vision-based gesture recognition has been explored as one method for hands-free teleoperation, its performance often deteriorates under occlusions, lighting variations, and cluttered backgrounds, limiting its applicability in real-world operations. To overcome these limitations, we propose a multimodal gesture recognition framework that integrates inertial data (accelerometer, gyroscope, and orientation) from Apple Watches on both wrists with capacitive sensing signals from custom gloves. We design a late fusion strategy based on the log-likelihood ratio (LLR), which not only enhances recognition performance but also provides interpretability by quantifying modality-specific contributions. To support this research, we introduce a new dataset of 20 distinct gestures inspired by aircraft marshalling signals, comprising synchronized RGB video, IMU, and capacitive sensor data. Experimental results demonstrate that our framework achieves performance comparable to a state-of-the-art vision-based baseline while significantly reducing computational cost, model size, and training time, making it well suited for real-time robot control. We therefore underscore the potential of sensor-based multimodal fusion as a robust and interpretable solution for gesture-driven mobile robot and drone teleoperation.

2602.22101 2026-06-02 cs.LG cs.AI 版本更新

On Imbalanced Regression with Hoeffding Trees

关于使用Hoeffding树的不平衡回归

Pantia-Marina Alchirch, Dimitrios I. Diochnos

发表机构 * University of Oklahoma(俄克拉荷马大学)

AI总结 针对不平衡回归中的数据流问题,将核密度估计扩展到流式设置并集成层次收缩到增量决策树中,实验表明KDE能持续提升早期流性能。

Comments 17 pages, 5 figures, 3 tables, 2 algorithms, authors' version of paper accepted in PAKDD 2026 special session on Data Science: Foundations and Applications (DSFA)

详情
AI中文摘要

许多现实应用会生成用于回归的连续数据流。Hoeffding树及其变体因其有效性而具有悠久的传统,无论是单独使用还是作为更广泛集成中的基础模型。最近的批量学习工作表明,核密度估计(KDE)改善了不平衡回归中的平滑预测[Yang等人,2021],而层次收缩(HS)为决策树提供了事后正则化,无需修改其结构[Agarwal等人,2022]。我们通过伸缩公式将KDE扩展到流式设置,并将HS集成到增量决策树中。在标准在线回归基准上的实证评估表明,KDE持续改善了早期流性能,而HS提供的增益有限。我们的实现公开于:https://github.com/marinaAlchirch/DSFA_2026。

英文摘要

Many real-world applications generate continuous data streams for regression. Hoeffding trees and their variants have a long-standing tradition due to their effectiveness, either alone or as base models in broader ensembles. Recent batch-learning work shows that kernel density estimation (KDE) improves smoothed predictions in imbalanced regression [Yang et al., 2021], while hierarchical shrinkage (HS) provides post-hoc regularization for decision trees without modifying their structure [Agarwal et al., 2022]. We extend KDE to streaming settings via a telescoping formulation and integrate HS into incremental decision trees. Empirical evaluation on standard online regression benchmarks shows that KDE consistently improves early-stream performance, whereas HS provides limited gains. Our implementation is publicly available at: https://github.com/marinaAlchirch/DSFA_2026.

2503.11832 2026-06-02 cs.AI cs.LG 版本更新

Safety Mirage: How Spurious Correlations Undermine VLM Safety Fine-Tuning and Can Be Mitigated by Machine Unlearning

安全幻象:虚假相关性如何破坏VLM安全微调及通过机器遗忘缓解

Yiwei Chen, Yuguang Yao, Yihua Zhang, Bingquan Shen, Gaowen Liu, Sijia Liu

发表机构 * Michigan State University(密歇根州立大学) National University of Singapore(新加坡国立大学) Cisco Research(思科研究)

AI总结 本文发现视觉语言模型(VLM)的安全微调存在“安全幻象”,即虚假相关性导致脆弱性,并提出机器遗忘作为替代方案,显著降低攻击成功率和不必要拒绝。

Comments Accepted to ICLR 2026

详情
AI中文摘要

最近的视觉语言模型(VLM)在多模态输入(特别是文本和图像)的生成建模方面取得了显著进展。然而,当暴露于不安全查询时,它们生成有害内容的倾向引发了关键的安全问题。虽然当前的对齐策略主要依赖于使用精心策划的数据集进行监督安全微调,但我们发现了一个基本限制,称为“安全幻象”,其中监督微调无意中强化了表面文本模式与安全响应之间的虚假相关性,而不是促进深层的、内在的危害缓解。我们表明,这些虚假相关性使微调后的VLM即使面对基于单词修改的简单攻击也易受攻击,其中将文本查询中的单个单词替换为诱导虚假相关性的替代词即可有效绕过安全防护。此外,这些相关性导致过度谨慎,使微调后的VLM不必要地拒绝良性查询。为了解决这些问题,我们展示了机器遗忘(MU)作为监督安全微调的有力替代方案,因为它避免了有偏的特征-标签映射,并直接从VLM中移除有害知识,同时保留其通用能力。在安全基准上的广泛评估表明,基于MU的对齐将攻击成功率降低了高达60.27%,并将不必要的拒绝减少了超过84.20%。警告:存在可能具有攻击性的AI生成内容。

英文摘要

Recent vision language models (VLMs) have made remarkable strides in generative modeling with multimodal inputs, particularly text and images. However, their susceptibility to generating harmful content when exposed to unsafe queries raises critical safety concerns. While current alignment strategies primarily rely on supervised safety fine-tuning with curated datasets, we identify a fundamental limitation we call the ''safety mirage'', where supervised fine-tuning inadvertently reinforces spurious correlations between superficial textual patterns and safety responses, rather than fostering deep, intrinsic mitigation of harm. We show that these spurious correlations leave fine-tuned VLMs vulnerable even to a simple one-word modification-based attack, where substituting a single word in text queries with a spurious correlation-inducing alternative can effectively bypass safeguards. Additionally, these correlations contribute to the over-prudence, causing fine-tuned VLMs to refuse benign queries unnecessarily. To address these issues, we show machine unlearning (MU) as a powerful alternative to supervised safety fine-tuning, as it avoids biased feature-label mappings and directly removes harmful knowledge from VLMs while preserving their general capabilities. Extensive evaluations across safety benchmarks show that under MU-based alignment reduces the attack success rate by up to 60.27% and cuts unnecessary rejections by over 84.20%. WARNING: There exist AI generations that may be offensive in nature.

2603.03741 2026-06-02 cs.RO cs.AI 版本更新

HALO: Learning Human-Robot Collaboration via Heterogeneous-Agent Lyapunov Policy Optimization

HALO:通过异质智能体李雅普诺夫策略优化学习人机协作

Hao Zhang, Yaru Niu, Yikai Wang, Ding Zhao, H. Eric Tseng

发表机构 * University of California, Berkeley(加州大学伯克利分校) Stanford University(斯坦福大学)

AI总结 针对人机协作中人类行为多样性和环境变化导致的泛化与鲁棒性问题,提出异质智能体李雅普诺夫策略优化(HALO)框架,通过李雅普诺夫收缩稳定去中心化多智能体强化学习,并利用最优二次投影修正梯度,实现理性差距的单调收缩,提升协作性能。

Comments https://HaoZhang-THU.github.io/HALO/

详情
AI中文摘要

为了提高人机协作(HRC)的泛化性和韧性,机器人必须应对人类行为和情境的多种组合,这推动了多智能体强化学习(MARL)的应用。然而,机器人与人类之间的固有异质性造成了理性差距(RG),使得去中心化的策略更新偏离了合作联合优化。由此产生的学习问题是一个一般和可微博弈,因此独立的策略梯度更新在没有额外结构的情况下可能会振荡或发散。我们提出了异质智能体李雅普诺夫策略优化(HALO),这是一个通过强制策略参数空间中的李雅普诺夫收缩来稳定去中心化MARL的框架。与针对约束马尔可夫决策过程中状态/轨迹约束的基于李雅普诺夫的安全RL不同,HALO使用李雅普诺夫认证来稳定去中心化策略学习。HALO通过最优二次投影修正去中心化梯度,确保RG的单调收缩,并实现对开放式交互空间的有效探索。大量的仿真和真实人形机器人实验表明,这种认证的稳定性提高了协作边缘情况下的泛化性和鲁棒性。我们的项目网站位于https://HaoZhang-THU.github.io/HALO/。

英文摘要

To improve generalization and resilience in human-robot collaboration (HRC), robots must contend with diverse combinations of human behaviors and contexts, motivating multi-agent reinforcement learning (MARL). However, inherent heterogeneity between robots and humans creates a rationality gap (RG), where decentralized policy updates deviate from cooperative joint optimization. The resulting learning problem is a general-sum differentiable game, so independent policy-gradient updates can oscillate or diverge without added structure. We propose heterogeneous-agent Lyapunov policy optimization (HALO), a framework that stabilizes decentralized MARL by enforcing Lyapunov-based contraction in policy-parameter space. Unlike Lyapunov-based safe RL, which targets state/trajectory constraints in constrained Markov decision processes, HALO uses Lyapunov certification to stabilize decentralized policy learning. HALO rectifies decentralized gradients via optimal quadratic projections, ensuring monotonic contraction of RG and enabling effective exploration of open-ended interaction spaces. Extensive simulations and real-world humanoid-robot experiments show that this certified stability improves generalization and robustness in collaborative corner cases. Our project website is available at https://HaoZhang-THU.github.io/HALO/.

2603.03291 2026-06-02 cs.CL cs.AI 版本更新

One Bias After Another: Mechanistic Reward Shaping and Persistent Biases in Language Reward Models

一个接一个的偏差:语言奖励模型中的机械奖励塑造与持续偏差

Daniel Fein, Max Lamparth, Violet Xiang, Mykel J. Kochenderfer, Nick Haber

发表机构 * Stanford University(斯坦福大学) University of California, Berkeley(加州大学伯克利分校)

AI总结 本文通过系统测量五个高质量奖励模型中的偏差,发现长度、谄媚、过度自信等持续问题,并提出一种简单的后处理干预方法(机械奖励塑造)来减轻低复杂度偏差。

Comments ICML 2026 Camera-ready

详情
AI中文摘要

奖励模型(RMs)对于语言模型(LMs)与人类偏好的在线对齐至关重要。然而,基于RM的偏好调优容易受到奖励破解的影响,即LM策略从有缺陷的RM中学习不良行为。通过系统测量五个高质量RM(包括最先进的模型)中的偏差,我们发现尽管已有相关工作,但在长度、谄媚和过度自信方面的问题仍然存在。我们还发现了与模型特定“风格”和答案顺序相关的新偏差。我们将RM失败分类为可处理或对线性干预具有抵抗性,并提出一种简单的后处理干预措施,以减轻由虚假相关性引起的低复杂度偏差。我们提出的机械奖励塑造在不降低奖励质量且使用最少标注数据的情况下减少了目标偏差。该方法可扩展到新偏差、模型内部,并具有分布外泛化能力。

英文摘要

Reward Models (RMs) are crucial for online alignment of language models (LMs) with human preferences. However, RM-based preference-tuning is vulnerable to reward hacking, whereby LM policies learn undesirable behaviors from flawed RMs. By systematically measuring biases in five high-quality RMs, including the state-of-the-art, we find that issues persist despite prior work with respect to length, sycophancy, and overconfidence. We also discover new issues related to bias toward model-specific ``styles'' and answer-order. We categorize RM failures as tractable or resistant to linear intervention and propose a simple post-hoc intervention to mitigate low-complexity biases that arise from spurious correlations. Our proposed mechanistic reward shaping reduces targeted biases without degrading reward quality and while using minimal labeled data. The method is extensible to new biases, model-internal, and generalizes out-of-distribution.

2510.07650 2026-06-02 cs.LG cs.AI 版本更新

Value Flows

Value Flows

Perry Dong, Chongyi Zheng, Chelsea Finn, Dorsa Sadigh, Benjamin Eysenbach

发表机构 * Stanford University(斯坦福大学) Princeton University(普林斯顿大学)

AI总结 本文利用基于流的生成模型估计完整未来回报分布,通过新的流匹配目标满足分布贝尔曼方程,并利用流导数ODE估计回报不确定性以优先学习,在离线与在线设置中平均成功率提升1.3倍。

Comments ICLR 2026

详情
AI中文摘要

虽然当今大多数强化学习方法将未来回报的分布压缩为单个标量值,但分布RL方法利用回报分布提供更强的学习信号,并支持探索和安全强化学习中的应用。虽然估计回报分布的主要方法是将其建模为离散区间上的分类分布或估计有限数量的分位数,但这些方法留下了关于回报分布的细粒度结构以及如何区分高回报不确定性的状态以进行决策的未解问题。本文的关键思想是使用现代、灵活的基于流的模型来估计完整的未来回报分布,并识别那些具有高回报方差的状态。我们通过制定一个新的流匹配目标来实现这一点,该目标生成满足分布贝尔曼方程的概率密度路径。基于学习到的流模型,我们使用一个新的流导数ODE来估计不同状态的回报不确定性。我们还利用这种不确定性信息,优先在某些转换上学习更准确的回报估计。我们将我们的方法(Value Flows)与先前的方法在离线和在线到在线设置中进行了比较。在37个基于状态和25个基于图像的基准任务上的实验表明,Value Flows在成功率上平均提高了1.3倍。网站:https://pd-perry.github.io/value-flows 代码:https://github.com/chongyi-zheng/value-flows

英文摘要

While most reinforcement learning methods today flatten the distribution of future returns to a single scalar value, distributional RL methods exploit the return distribution to provide stronger learning signals and to enable applications in exploration and safe RL. While the predominant method for estimating the return distribution is by modeling it as a categorical distribution over discrete bins or estimating a finite number of quantiles, such approaches leave unanswered questions about the fine-grained structure of the return distribution and about how to distinguish states with high return uncertainty for decision-making. The key idea in this paper is to use modern, flexible flow-based models to estimate the full future return distributions and identify those states with high return variance. We do so by formulating a new flow-matching objective that generates probability density paths satisfying the distributional Bellman equation. Building upon the learned flow models, we estimate the return uncertainty of distinct states using a new flow derivative ODE. We additionally use this uncertainty information to prioritize learning a more accurate return estimation on certain transitions. We compare our method (Value Flows) with prior methods in the offline and online-to-online settings. Experiments on $37$ state-based and $25$ image-based benchmark tasks demonstrate that Value Flows achieves a $1.3\times$ improvement on average in success rates. Website: https://pd-perry.github.io/value-flows Code: https://github.com/chongyi-zheng/value-flows

2601.09566 2026-06-02 cs.CV cs.AI 版本更新

Hot-Start Chinese Language Modeling:Visual Glyphs Accelerate Sample-Efficient Learning

热启动中文语言建模:视觉字形加速样本高效学习

Shuyang Xiang, Hao Guan

发表机构 * Independent Researcher(独立研究者) Institute of Software, Chinese Academy of Sciences(中国科学院软件研究所)

AI总结 本文通过将汉字渲染为视觉字形图像,研究其对字符级语言建模的归纳偏置,发现视觉输入产生显著的热启动效应,但最终精度与基于索引的方法一致。

Comments 15 pages, 5 figures, submitted to ACL 2026

详情
AI中文摘要

在这项工作中,我们研究了将汉字渲染为视觉字形图像(而非主流LLM使用的离散token ID)是否为字符级语言建模提供归纳偏置。我们的核心发现给出了一个双刃剑的见解:视觉输入产生显著的热启动效应,在第一个epoch内(占总训练步骤的0.4%)将早期准确率提高一倍以上(视觉输入12.3% vs. 基于索引的基线5.8%),但两种方法最终收敛到几乎相同的最终准确率(39%)。这一模式在低至8x8像素的分辨率、高达50%的部分裁剪以及从110M到1.78B参数的模型规模下均成立。我们识别的机制是,字形渲染在训练之前就将基于部首的结构预编码到嵌入空间中(余弦相似度0.27 vs. 随机嵌入的0.002),从而能够更快地对齐,但无法提高最终容量。我们的结果阐明了视觉表示作为中文语言建模归纳偏置的前景和根本局限性。

英文摘要

In this work, we study whether rendering Chinese characters as visual glyph images, rather than discrete token IDs as mainstream LLMs do, providing an inductive bias for character-level language modeling. Our central finding gives a double-edged insight: visual inputs produce a pronounced hot-start effect, more than doubling early-stage accuracy within the first epoch (at 0.4% of total training steps) (12.3% visual inputs vs. 5.8% index-based baseline), yet both approaches converge to essentially identical final accuracy (39%). This pattern holds across resolutions as low as 8x8 pixels, partial cropping up to 50%, and model scales from 110M to 1.78B parameters. The mechanism we identify is that glyph rendering pre-encodes radical-based structure into embedding space before any training (cosine similarity 0.27 vs. 0.002 for random embeddings), enabling faster alignment but not higher final capacity. Our results clarify both the promise and fundamental limitation of visual representations as inductive biases for Chinese language modeling.

2603.02650 2026-06-02 cs.LG cs.AI cs.RO 版本更新

Improving Diffusion Planners by Self-Supervised Action Gating with Energies

通过自监督动作能量门控改进扩散规划器

Yuan Lu, Dongqi Han, Yansen Wang, Dongsheng Li

发表机构 * University of Science and Technology of China(中国科学技术大学)

AI总结 提出SAGE方法,利用潜在一致性信号在推理时重新排序轨迹,惩罚动态不一致的计划,从而提升扩散规划器的性能和鲁棒性。

详情
AI中文摘要

扩散规划器是离线强化学习的一种强大方法,但当价值引导选择偏好得分高但局部与环境动态不一致的轨迹时,它们可能会失败,导致执行脆弱。我们提出了自监督动作能量门控(SAGE),一种推理时重排序方法,使用潜在一致性信号惩罚动态不一致的计划。SAGE在离线状态序列上训练联合嵌入预测架构(JEPA)编码器,并训练一个动作条件的潜在预测器用于短时域过渡。在测试时,SAGE为每个采样候选分配一个由其潜在预测误差给出的能量,并将此可行性得分与价值估计相结合以选择动作。SAGE可以集成到现有的扩散规划流程中,这些流程可以通过价值评分采样轨迹和选择动作;它不需要环境回滚,也不需要重新训练策略。在运动、导航和操作基准测试中,SAGE提高了扩散规划器的性能和鲁棒性。

英文摘要

Diffusion planners are a strong approach for offline reinforcement learning, but they can fail when value-guided selection favours trajectories that score well yet are locally inconsistent with the environment dynamics, resulting in brittle execution. We propose Self-supervised Action Gating with Energies (SAGE), an inference-time re-ranking method that penalises dynamically inconsistent plans using a latent consistency signal. SAGE trains a Joint-Embedding Predictive Architecture (JEPA) encoder on offline state sequences and an action-conditioned latent predictor for short horizon transitions. At test time, SAGE assigns each sampled candidate an energy given by its latent prediction error and combines this feasibility score with value estimates to select actions. SAGE can integrate into existing diffusion planning pipelines that can sample trajectories and select actions via value scoring; it requires no environment rollouts and no policy re-training. Across locomotion, navigation, and manipulation benchmarks, SAGE improves the performance and robustness of diffusion planners.

2603.02346 2026-06-02 cond-mat.str-el cs.AI cs.LG 版本更新

Large Electron Model: A Universal Ground State Predictor

大型电子模型:一种通用的基态预测器

Timothy Zaklama, Max Geier, Liang Fu

发表机构 * Massachusetts Institute of Technology(麻省理工学院) Department of Physics(物理系)

AI总结 提出Large Electron Model,一种基于Fermi Sets架构的神经网络模型,通过在整个哈密顿参数流形上生成变分波函数,准确预测二维谐振势中相互作用电子的基态,并泛化到未见耦合强度和粒子数,为材料发现提供了基于变分原理的基座模型方法。

Comments 8+7 pages, 5+6 figures, 1+1 tables

详情
AI中文摘要

我们引入了大型电子模型,这是一个单一的神经网络模型,能够在整个哈密顿参数流形上产生相互作用电子的变分波函数。我们的模型采用了Fermi Sets架构,这是一种多体费米子波函数的通用表示,并进一步以哈密顿参数和粒子数为条件。对于二维谐振势中的相互作用电子,一个训练好的模型能够准确预测基态波函数,同时泛化到未见过的耦合强度和粒子数扇区,产生精确的实空间电荷密度和基态能量,甚至多达50个粒子。我们的结果为基于变分原理的材料发现建立了一个基座模型方法,同时准确处理了密度泛函理论能力之外的强电子关联。

英文摘要

We introduce Large Electron Model, a single neural network model that produces variational wavefunctions of interacting electrons over the entire Hamiltonian parameter manifold. Our model employs the Fermi Sets architecture, a universal representation of many-body fermionic wavefunctions, which is further conditioned on Hamiltonian parameter and particle number. For interacting electrons in a two-dimensional harmonic potential, a single trained model accurately predicts the ground state wavefunction while generalizing across unseen coupling strengths and particle-number sectors, producing both accurate real-space charge densities and ground state energies, even up to $50$ particles. Our results establish a foundation model method for material discovery that is grounded in the variational principle, while accurately treating strong electron correlation beyond the capacity of density functional theory.

2603.02237 2026-06-02 cs.LG cs.AI 版本更新

Concept Heterogeneity-aware Representation Steering

概念异质性感知表示引导

Laziz U. Abdullaev, Noelle Y. L. Wong, Ryan T. Z. Lee, Shiqi Jiang, Khoi N. M. Nguyen, Tan M. Nguyen

发表机构 * arXiv

AI总结 针对大语言模型表示非均匀导致全局引导脆弱的问题,提出基于最优传输的输入依赖引导方法CHaRS,通过高斯混合模型和离散最优传输实现更有效的行为控制。

详情
Journal ref
ICML 2026
AI中文摘要

表示引导提供了一种轻量级机制,通过在推理时干预内部激活来控制大语言模型(LLMs)的行为。现有方法大多依赖于单个全局引导方向,通常通过对比较数据集进行均值差异得到。这种方法隐含假设目标概念在嵌入空间中均匀表示。然而在实践中,LLM表示可能高度非均匀,表现出聚类、上下文相关的结构,这使得全局引导方向变得脆弱。在这项工作中,我们通过最优传输(OT)的视角审视表示引导,注意到标准均值差异引导隐式对应于具有不同一阶矩的两个相同分布之间的OT映射,产生全局平移。为了放宽这一限制性假设,我们从理论上将源和目标表示建模为高斯混合模型,并将引导公式化为语义潜在聚类之间的离散OT问题。从得到的传输计划中,我们通过重心投影推导出显式的、输入依赖的引导映射,产生聚类级别偏移的平滑核加权组合。我们将此方法称为概念异质性感知表示引导(CHaRS)。通过大量实验设置,我们证明CHaRS比全局引导产生更有效的行为控制。

英文摘要

Representation steering offers a lightweight mechanism for controlling the behavior of large language models (LLMs) by intervening on internal activations at inference time. Most existing methods rely on a single global steering direction, typically obtained via difference-in-means over contrastive datasets. This approach implicitly assumes that the target concept is homogeneously represented across the embedding space. In practice, however, LLM representations can be highly non-homogeneous, exhibiting clustered, context-dependent structure, which renders global steering directions brittle. In this work, we view representation steering through the lens of optimal transport (OT), noting that standard difference-in-means steering implicitly corresponds to the OT map between two identical distributions with differing first moments, yielding a global translation. To relax this restrictive assumption, we theoretically model source and target representations as Gaussian mixture models and formulate steering as a discrete OT problem between semantic latent clusters. From the resulting transport plan, we derive an explicit, input-dependent steering map via barycentric projection, producing a smooth, kernel-weighted combination of cluster-level shifts. We term this method Concept Heterogeneity-aware Representation Steering (CHaRS). Through numerous experimental settings, we show that CHaRS yields more effective behavioral control than global steering.

2603.00829 2026-06-02 cs.CL cs.AI cs.LG 版本更新

Constitutional Black-Box Monitoring for Scheming in LLM Agents

LLM Agent 中阴谋行为的宪法黑盒监控

Simon Storf, Rich Barton-Cooper, James Peters-Gill, Marius Hobbhahn

发表机构 * University of Cambridge(剑桥大学)

AI总结 研究使用基于宪法黑盒的监控器,通过仅观察外部输入和输出检测LLM Agent的阴谋行为,并在合成数据上优化后泛化到更真实环境。

Comments Accepted at ICML 2026. Camera-ready version

详情
AI中文摘要

在自主环境中安全部署大型语言模型(LLM)Agent需要可靠的监督机制。一个核心挑战是检测阴谋行为,即Agent暗中追求不一致的目标。缓解此类风险的一种方法是基于LLM的监控:使用语言模型检查Agent行为中的可疑动作。我们研究宪法黑盒监控器:仅利用外部可观测的输入和输出检测阴谋行为的提示分类器,并在从自然语言行为规范生成的合成数据上优化。我们引入两个生成合成Agent轨迹的流水线:STRIDE(迭代精炼)和Gloom(Agent-环境模拟),各生成1000个样本。通过提示扫描、人工精炼和自动提示优化,我们在这些数据集上优化前沿LLM监控器,并在ControlArena(一套Agent在更现实环境中运行的接地环境)中的7500个保留轨迹上评估性能。结果表明,仅基于合成数据选择的监控器可以泛化到更现实的环境,捕获有意义的阴谋信号。然而,我们发现性能在我们的设置中迅速饱和,简单的提示扫描匹配了更广泛优化的结果。超越这一限制不会带来进一步改进,反而导致过拟合。

英文摘要

Safe deployment of Large Language Model (LLM) agents in autonomous settings requires reliable oversight mechanisms. A central challenge is detecting scheming, where agents covertly pursue misaligned goals. One approach to mitigating such risks is LLM-based monitoring: using language models to examine agent behaviors for suspicious actions. We study constitutional black-box monitors: prompted classifiers that detect scheming using only externally observable inputs and outputs, optimized on synthetic data generated from natural-language behavior specifications. We introduce two pipelines for generating synthetic agent trajectories, STRIDE (iterative refinement) and Gloom (agent-environment simulation), from which we generate 1,000 samples each. We optimize frontier LLM monitors on these datasets via prompt sweeps, human refinement, and automated prompt optimization, and evaluate performance on 7,500 held-out trajectories from ControlArena, a suite of grounded environments where agents operate in more realistic contexts. Our results demonstrate that monitors selected purely on synthetic data can generalize to more realistic environments, capturing a meaningful scheming signal. However, we find that performance saturates quickly in our setting, with simple prompt sweeps matching the results of more extensive optimization. Pushing beyond this limit yields no further improvements and instead leads to overfitting.

2511.00206 2026-06-02 cs.AI cs.CL 版本更新

Addressing Longstanding Challenges in Cognitive Science with Language Models

用语言模型应对认知科学中长期存在的挑战

Dirk U. Wulff, Rui Mata

发表机构 * Center for Adaptive Rationality, Max Planck Institute for Human Development(适应性理性中心,马克斯·普朗克人类发展研究所) Faculty of Psychology, University of Basel(心理学系,巴塞尔大学)

AI总结 本文探讨如何利用语言模型应对认知科学中研究整合、形式化、概念清晰度等长期挑战,并指出其风险与机遇。

详情
AI中文摘要

认知科学因其多面性和跨学科性质,在研究整合、形式化、概念清晰度等领域面临持续挑战。人工智能的最新进展,特别是语言模型的发展,提供了可能有助于解决这些长期问题的工具。具体而言,它们可以帮助映射碎片化的文献、形式化言语理论、识别构念和测量之间的重叠、跨任务生成预测,以及从自然数据中提取文化或生态结构。然而,这些机遇也伴随着风险,包括过度简化、不透明性、技能退化以及偏见。综合来看,我们得出结论:当审慎地使用语言模型来补充而非取代人类能动性时,它们可以成为促进更具整合性和累积性的认知科学的工具。

英文摘要

Cognitive science faces ongoing challenges in research integration, formalization, conceptual clarity, and other areas, in part due to its multifaceted and interdisciplinary nature. Recent advances in artificial intelligence, particularly the development of language models, offer tools that may help to address these longstanding issues. Specifically, they can help map fragmented literatures, formalize verbal theories, identify overlap among constructs and measures, generate predictions across tasks, and extract cultural or ecological structure from naturalistic data. However, these opportunities come with risks, including oversimplification, opacity, deskilling, and bias. Taken together, we conclude that language models could serve as tools for a more integrative and cumulative cognitive science when used judiciously to complement, rather than replace, human agency.

2509.25837 2026-06-02 cs.LG cs.AI 版本更新

Distillation of Large Language Models via Concrete Score Matching

通过具体分数匹配进行大型语言模型的蒸馏

Yeongmin Kim, Donghyeok Shin, Mina Kang, Byeonghu Na, Il-Chul Moon

发表机构 * Korea Advanced Institute of Science and Technology(韩国科学技术院)

AI总结 提出具体分数蒸馏(CSD)目标,通过离散分数匹配克服softmax平滑和logit平移不变性限制,实现学生与教师模型间所有词汇对相对logit差异的灵活加权,在GPT-2、OpenLLaMA和GEMMA上优于现有蒸馏方法。

Comments ICLR 2026

详情
AI中文摘要

大型语言模型(LLMs)性能卓越但部署成本高昂,促使知识蒸馏(KD)用于高效推理。现有的KD目标通常通过softmax匹配学生和教师概率,这会模糊有价值的logit信息。虽然直接logit蒸馏(DLD)缓解了softmax平滑问题,但它未能考虑logit平移不变性,从而限制了解空间。我们提出具体分数蒸馏(CSD),一种离散分数匹配目标,克服了softmax引起的平滑和对最优解集的限制。我们解决了自回归LLMs中离散分数匹配的训练不稳定和二次复杂度问题,得到的CSD目标以灵活权重对齐学生和教师之间所有词汇对的相对logit差异。我们在框架内提供了模式寻求和模式覆盖实例,并在GPT-2-1.5B、OpenLLaMA-7B和GEMMA-7B-IT上评估了CSD在任务无关的指令遵循和任务特定蒸馏中的表现。实验表明,CSD持续超越最近的KD目标,实现了良好的保真度-多样性权衡,并与on-policy技术结合时产生互补增益,展示了其在LLM蒸馏中的可扩展性和有效性。代码:https://github.com/aailab-kaist/CSD。

英文摘要

Large language models (LLMs) deliver remarkable performance but are costly to deploy, motivating knowledge distillation (KD) for efficient inference. Existing KD objectives typically match student and teacher probabilities via softmax, which blurs valuable logit information. While direct logit distillation (DLD) mitigates softmax smoothing, it fails to account for logit shift invariance, thereby restricting the solution space. We propose Concrete Score Distillation (CSD), a discrete score-matching objective that overcomes both softmax-induced smoothing and restrictions on the optimal solution set. We resolve the training instability and quadratic complexity of discrete score-matching in autoregressive LLMs, and the resulting CSD objective aligns relative logit differences across all vocabulary pairs between student and teacher with flexible weighting. We provide both mode-seeking and mode-covering instances within our framework and evaluate CSD on task-agnostic instruction-following and task-specific distillation using GPT-2-1.5B, OpenLLaMA-7B, and GEMMA-7B-IT. Experiments show that CSD consistently surpasses recent KD objectives, achieves favorable fidelity-diversity trade-offs, and yields complementary gains when combined with on-policy techniques, demonstrating its scalability and effectiveness for LLM distillation. Code: https://github.com/aailab-kaist/CSD.

2603.00133 2026-06-02 cs.CV cs.AI 版本更新

You Don't Need All That Attention: Surgical Memorization Mitigation in Text-to-Image Diffusion Models

你不需要所有注意力:文本到图像扩散模型中的外科记忆缓解

Kairan Zhao, Eleni Triantafillou, Peter Triantafillou

发表机构 * University of California, San Diego(加州大学圣地亚哥分校)

AI总结 提出GUARD框架,通过吸引-排斥动力学调整去噪过程,结合交叉注意力衰减机制,在不损害图像质量的前提下有效缓解文本到图像扩散模型中的记忆问题。

Comments Accepted at ICML 2026

详情
AI中文摘要

生成模型已被证明会“记忆”某些训练数据,导致生成逐字或近乎逐字的图像,这可能引发隐私问题或版权侵权。我们引入了使用吸引-排斥动力学的引导(GUARD),一种用于文本到图像扩散模型中记忆缓解的新框架。GUARD调整图像去噪过程,引导生成远离原始训练图像,朝向与训练数据不同但仍与提示对齐的图像,防止复制训练数据,同时不损害图像生成质量。我们提出了该框架的一个具体实例,其中我们引导的正向目标由一种新的(交叉)注意力衰减方法给出,该方法基于(i)一种新颖的统计机制,自动识别需要衰减交叉注意力的提示位置,以及(ii)在这些每个提示的位置衰减交叉注意力。由此产生的GUARD提供了一种外科手术式的、动态的、每个提示的推理时方法,我们发现,在两种架构以及逐字和模板记忆方面,它是最稳健的方法,始终产生最先进的记忆缓解结果,同时在图像质量方面也优于或产生可比的结果。

英文摘要

Generative models have been shown to "memorize" certain training data, leading to verbatim or near-verbatim generating images, which may cause privacy concerns or copyright infringement. We introduce Guidance Using Attractive-Repulsive Dynamics (GUARD), a novel framework for memorization mitigation in text-to-image diffusion models. GUARD adjusts the image denoising process to guide the generation away from an original training image and towards one that is distinct from training data while remaining aligned with the prompt, guarding against reproducing training data, without hurting image generation quality. We propose a concrete instantiation of this framework, where the positive target that we steer towards is given by a novel method for (cross) attention attenuation based on (i) a novel statistical mechanism that automatically identifies the prompt positions where cross attention must be attenuated and (ii) attenuating cross-attention in these per-prompt locations. The resulting GUARD offers a surgical, dynamic per-prompt inference-time approach that, we find, is by far the most robust method in terms of consistently producing state-of-the-art results for memorization mitigation across two architectures and for both verbatim and template memorization, while also improving upon or yielding comparable results in terms of image quality.

2602.22221 2026-06-02 cs.IR cs.AI cs.CL cs.CY 版本更新

Evaluating Reliability Asymmetries in Chinese Factual Search and AI Answers

评估中文事实搜索与AI答案中的可靠性不对称性

Geng Liu, Li Feng, Mengxiao Zhu, Francesco Pierri

发表机构 * Department of Electronics, Information and Bioengineering, Politecnico di Milano(电子、信息与生物工程系,米兰理工学院) University of Science and Technology of China(中国科学技术大学)

AI总结 通过构建基于真实中文搜索日志的查询事实核查数据集,比较传统搜索引擎、大型语言模型和搜索集成AI概览在中文是非问题上的准确性、回答频率、极性差距及区域信息需求差异,揭示可靠性不仅取决于回答正确性,还受回答频率、否定主张处理和信息需求暴露风险影响。

详情
AI中文摘要

搜索引擎和AI驱动的系统越来越多地成为获取事实信息的媒介,但在现实信息寻求场景中,其可靠性仍难以评估。我们通过从真实中文搜索日志构建基于查询的事实核查数据集,并比较传统搜索引擎、独立大型语言模型和搜索集成AI概览等九种系统,在中文网络生态中研究这一问题。聚焦于中文事实性是非问题,我们根据证据推导的基准事实评估系统是否提供正确、错误或不确定的判断。我们发现,当系统给出明确答案时,准确率相似(73.2%至78.9%),但给出明确答案的频率差异显著:搜索引擎对超过83%的查询给出明确答案,而Qwen-Max则不到一半。我们还发现一致的极性差距:所有系统在标记为“是”的查询上表现优于标记为“否”的查询。我们利用百度指数数据识别健康相关搜索关注度较高的中国省份,这可能表明更大的错误信息暴露风险。总体而言,我们的结果表明,可靠性不仅取决于系统回答时的正确性,还取决于回答频率、如何处理否定主张以及信息需求可能增加暴露风险的地方。

英文摘要

Search engines and AI-powered systems increasingly mediate access to factual information, yet their reliability remains difficult to evaluate in realistic information-seeking settings. We study this problem in the Chinese web ecosystem by constructing a query-based fact-checking dataset from real Chinese search logs and comparing nine systems across traditional search engines, standalone large language models, and search-integrated AI Overviews. Focusing on factual Chinese-language factual Yes/No questions, we evaluate whether systems provide correct, incorrect, or uncertain decisions against evidence-derived ground truth. We find that systems are similarly accurate when they provide definitive answers, but differ sharply in how often they do so. Conditional accuracy ranges from 73.2% to 78.9%, yet search engines answer definitively on over 83% of queries, while Qwen-Max does so on fewer than half. We also find a consistent polarity gap: all systems perform better on yes-labeled queries than on no-labeled queries. We also use Baidu Index data to identify Chinese provinces with higher health-related search attention, which may indicate greater potential exposure to misinformation. Overall, our results show that reliability depends not only on whether systems are correct when they answer, but also on how often they answer, how they handle negative claims, and where information demand may increase exposure risks.

2602.16953 2026-06-02 cs.AI cs.LG 版本更新

LLM4Cov: Execution-Aware Agentic Learning for High-coverage Testbench Generation

LLM4Cov:面向高覆盖率测试生成的执行感知智能体学习

Hejia Zhang, Zhongming Yu, Chia-Tung Ho, Haoxing Ren, Brucek Khailany, Jishen Zhao

发表机构 * University of Science and Technology of China(中国科学技术大学)

AI总结 提出LLM4Cov离线智能体学习框架,通过执行验证数据策展、策略感知数据合成和最差状态优先采样,在硬件验证中实现高覆盖率测试生成,4B参数模型在CVDP-ECov上达到69.2%通过率和90.4%平均覆盖率。

Comments ICML'26 Camera Ready version

详情
AI中文摘要

执行感知的LLM智能体为从工具反馈中学习提供了一种有前景的范式,但这种反馈可能昂贵且获取缓慢,使得在线强化学习(RL)在某些场景下不太实用。高覆盖率硬件验证由于依赖工业模拟器和不可微的执行信号,体现了这一挑战。我们提出LLM4Cov,一种离线智能体学习框架,将验证建模为由确定性评估器指导的单步状态转移。基于这一公式,我们引入了执行验证的数据策展、策略感知的智能体数据合成以及最差状态优先采样,以在执行约束下实现可扩展学习。我们进一步通过修订的评估协议,从现有验证套件中整理了一个符合现实的基准。使用所提出的流程,一个紧凑的4B参数模型在智能体评估下实现了69.2%的通过率和90.4%的平均覆盖率(CVDP-ECov),比其教师模型分别高出5.3%和10.5%,展现出与规模大一个数量级的模型相竞争的性能。

英文摘要

Execution-aware LLM agents offer a promising paradigm for learning from tool feedback, but such feedback can be expensive and slow to obtain, making online reinforcement learning (RL) less practical in certain scenarios. High-coverage hardware verification exemplifies this challenge due to its reliance on industrial simulators and non-differentiable execution signals. We propose LLM4Cov, an offline agent-learning framework that models verification as single-step state transitions guided by deterministic evaluators. Building on this formulation, we introduce execution-validated data curation, policy-aware agentic data synthesis, and worst-state-prioritized sampling to enable scalable learning under execution constraints. We further curate a reality-aligned benchmark adapted from an existing verification suite through a revised evaluation protocol. Using the proposed pipeline, a compact 4B-parameter model achieves 69.2% pass rate and 90.4% average coverage in CVDP-ECov under agentic evaluation, outperforming its teacher by 5.3% and 10.5%, demonstrating competitive performance against models an order of magnitude larger.

2508.08337 2026-06-02 cs.CY cs.AI cs.LG 版本更新

Position: Beyond Sensitive Attributes, ML Fairness Should Quantify Structural Injustice via Social Determinants

立场:超越敏感属性,机器学习公平性应通过社会决定因素量化结构性不公正

Zeyu Tang, Alex John London, Atoosa Kasirzadeh, Sarah Stewart de Ramirez, Peter Spirtes, Kun Zhang, Sanmi Koyejo

发表机构 * University of California, Berkeley(加州大学伯克利分校) University of Cambridge(剑桥大学) University of Washington(华盛顿大学) University of Michigan(密歇根大学) University of Toronto(多伦多大学)

AI总结 本文主张算法公平性研究应超越敏感属性,通过社会决定因素量化结构性不公正,并通过理论模型和实证研究证明仅关注敏感属性的缓解策略可能引入新的结构性不公正。

Comments Accepted to ICML 2026 Position Paper Track

详情
AI中文摘要

算法公平性研究在很大程度上将不公平视为对敏感属性的歧视。然而,这种方法限制了对作为通过社会决定因素实例化的结构性不公正的不公平的可见性,社会决定因素是塑造属性和结果但不涉及特定个体的上下文变量。这篇立场论文认为,该领域应通过社会决定因素量化结构性不公正,超越敏感属性。借鉴跨学科见解,我们认为主流技术范式未能充分捕捉作为结构性不公正的不公平,因为上下文可能被视为需要标准化的噪声,而不是需要审计的信号。我们进一步通过大学录取的理论模型、使用美国人口普查数据的人口统计研究以及美国综合医疗系统中关于乳腺癌筛查的高风险领域应用,证明了这种转变的实际紧迫性。我们的结果表明,仅关注敏感属性的缓解策略可能引入新的结构性不公正形式。我们认为,通过社会决定因素审计结构性不公正必须先于缓解措施,并呼吁开发超越以敏感属性为中心的非歧视公平概念的新技术。

英文摘要

Algorithmic fairness research has largely framed unfairness as discrimination along sensitive attributes. However, this approach limits visibility into unfairness as structural injustice instantiated through social determinants, which are contextual variables that shape attributes and outcomes without pertaining to specific individuals. This position paper argues that the field should quantify structural injustice via social determinants, beyond sensitive attributes. Drawing on cross-disciplinary insights, we argue that prevailing technical paradigms fail to adequately capture unfairness as structural injustice, because contexts are potentially treated as noise to be normalized rather than signal to be audited. We further demonstrate the practical urgency of this shift through a theoretical model of college admissions, a demographic study using U.S. census data, and a high-stakes domain application regarding breast cancer screening within an integrated U.S. healthcare system. Our results indicate that mitigation strategies centered solely on sensitive attributes can introduce new forms of structural injustice. We contend that auditing structural injustice through social determinants must precede mitigation, and call for new technical developments that move beyond sensitive-attribute-centered notions of fairness as non-discrimination.

2601.17074 2026-06-02 cs.LG cs.AI 版本更新

Physics-Encoded Inverse Modeling for Arctic Snow Depth Prediction

物理编码的北极雪深预测逆建模

Akila Sampath, Vandana Janeja, Jianwu Wang

发表机构 * University of California, Berkeley(加州大学伯克利分校)

AI总结 提出物理编码逆建模框架PhysE-Inv,结合LSTM序列学习与对比学习正则化,在稀疏观测下实现雪深估计,均方误差平均降低24.7%。

详情
AI中文摘要

在有限且稀疏观测下准确估计时变逆问题仍然是科学领域的基本挑战。例如,雪深估计需要推断控制海冰物理的隐藏参数,这可以通过物理信息编码来实现。为了解决这一挑战,我们引入了物理编码逆建模(PhysE-Inv),这是一个新颖的框架,将深度序列学习与物理信息推理相结合,用于解决真实世界稀疏观测环境下的逆问题。PhysE-Inv集成了LSTM编码器-解码器以捕获时间依赖性,并结合对比学习正则化来强制实现噪声不变的潜在表示。该框架学习潜在参数,这些参数与观测输入相结合,在融入物理信息指导的同时重建雪深。PhysE-Inv在所有评估基线上持续表现优异,在所有基线模型上实现了平均MSE降低24.7%,在参数估计设置下比最强基线提高了17.3%。总体而言,我们的工作为数据稀缺领域展示了一种可泛化的逆建模范式,其中物理信息指导可以融入稀疏观测中。

英文摘要

Accurate estimation in time-varying inverse problems under limited and sparse observations remains a fundamental challenge across scientific domains. For example, snow depth estimation requires inferring hidden parameters governing sea ice physics, which can be incorporated through physics-informed encoding. To address this challenge, we introduce Physics-Encoded Inversion (PhysE-Inv), a novel framework that combines deep sequential learning with physics-informed inference for solving inverse problems under real-world sparse observational settings. PhysE-Inv integrates an LSTM encoder-decoder to capture temporal dependencies, together with contrastive learning regularization that enforces noise-invariant latent representations. The framework learns latent parameters that, when combined with observational inputs, reconstruct snow depth while incorporating physics-informed guidance. PhysE-Inv consistently outperforms all evaluated baselines, achieving an average MSE reduction of 24.7\% across all baseline models and a 17.3\% improvement over the strongest baseline under parameter estimation settings. Overall, our work demonstrates a generalizable inverse modeling paradigm for data-scarce domains where physics-informed guidance can be incorporated into sparse observations.

2602.20019 2026-06-02 cs.LG cs.AI 版本更新

Learning Discriminative and Generalizable Anomaly Detector for Dynamic Graph with Limited Supervision

有限监督下动态图的可判别且可泛化的异常检测器学习

Yuxing Tian, Yiyan Qi, Fengran Mo, Weixu Zhang, Jian Guo, Jian-Yun Nie

发表机构 * University of Science and Technology of China(中国科学技术大学)

AI总结 针对动态图异常检测中标注异常稀缺的问题,提出一个结合残差表示编码、限制损失和双边界优化的模型无关框架,从正常/未标注数据中学习可判别边界,同时利用有限标注异常并保持对未见异常的泛化能力。

Comments Accepted by ICML2026

详情
AI中文摘要

动态图异常检测对许多现实应用至关重要,但由于标注异常的稀缺性,仍然具有挑战性。现有方法要么是无监督的,要么是半监督的:无监督方法避免了标注异常的需求,但往往产生模糊的边界,而半监督方法可能过拟合于有限的标注异常,并对未见异常泛化能力差。为了解决这一差距,我们考虑一个很大程度上未被探索的问题:从正常/未标注数据中学习可判别边界,同时利用有限的标注异常(当可用时),而不牺牲对未见异常的泛化能力。在本文中,我们提出了一个有效、可泛化且模型无关的框架,包含三个主要组件:(i)残差表示编码,捕捉当前交互与其历史上下文之间的偏差,提供与异常相关的信号;(ii)限制损失,将正常表示约束在两个共心超球面之间的区间内,确保尺度一致的同时保持异常的可分离性;(iii)双边界优化策略,利用归一化流建模的对数似然分布,学习一个可判别且鲁棒的边界。大量实验证明了我们的框架在不同评估设置下的优越性。

英文摘要

Dynamic graph anomaly detection is critical for many real-world applications but remains challenging due to the scarcity of labeled anomalies. Existing methods are either unsupervised or semi-supervised: unsupervised methods avoid the need for labeled anomalies but often produce ambiguous boundary, whereas semi-supervised methods can overfit to the limited labeled anomalies and generalize poorly to unseen anomalies. To address this gap, we consider a largely underexplored problem: learning a discriminative boundary from normal/unlabeled data, while leveraging limited labeled anomalies \textbf{when available} without sacrificing generalization to unseen anomalies. In this paper, we propose an effective, generalizable, and model-agnostic framework with three main components: (i) residual representation encoding that capture deviations between current interactions and their historical context, providing anomaly-relevant signals; (ii) a restriction loss that constrain the normal representations within an interval bounded by two co-centered hyperspheres, ensuring consistent scales while keeping anomalies separable; (iii) a bi-boundary optimization strategy that learns a discriminative and robust boundary using the log-likelihood distribution modeled by a normalizing flow. Extensive experiments demonstrate the superiority of our framework across diverse evaluation settings.

2602.19066 2026-06-02 cs.LG cs.AI 版本更新

IDLM: Inverse-distilled Diffusion Language Models

IDLM:逆蒸馏扩散语言模型

David Li, Nikita Gushchin, Dmitry Abulkhanov, Eric Moulines, Ivan Oseledets, Maxim Panov, Alexander Korotin

发表机构 * GitHub

AI总结 针对扩散语言模型推理慢的问题,提出逆蒸馏方法(IDLM),通过理论保证唯一解和梯度稳定松弛,实现4倍至64倍推理加速并保持生成质量。

Comments ICML 2026. We provide the code at: https://david-cripto.github.io/idlm-project-page

详情
AI中文摘要

扩散语言模型(DLM)最近在文本生成中取得了强劲成果。然而,其多步采样导致推理缓慢,限制了实际应用。为解决此问题,我们将逆蒸馏(一种最初为加速连续扩散模型而开发的技术)扩展到离散设置。然而,这种扩展引入了理论和实践上的挑战。从理论角度看,逆蒸馏目标缺乏唯一性保证,可能导致次优解。从实践角度看,离散空间中的反向传播非平凡且常不稳定。为克服这些挑战,我们首先提供理论结果,证明我们的逆形式具有唯一解,从而确保有效优化。然后,我们引入梯度稳定松弛以支持有效训练。最终,在多个DLM上的实验表明,我们的方法——逆蒸馏扩散语言模型(IDLM)——将推理步骤减少了4倍至64倍,同时保持了教师模型的生成质量。我们在项目页面上提供代码、模型检查点和视频教程:https://david-cripto.github.io/idlm-project-page。

英文摘要

Diffusion Language Models (DLMs) have recently achieved strong results in text generation. However, their multi-step sampling leads to slow inference, limiting practical use. To address this, we extend Inverse Distillation, a technique originally developed to accelerate continuous diffusion models, to the discrete setting. Nonetheless, this extension introduces both theoretical and practical challenges. From a theoretical perspective, the inverse distillation objective lacks uniqueness guarantees, which may lead to suboptimal solutions. From a practical standpoint, backpropagation in the discrete space is non-trivial and often unstable. To overcome these challenges, we first provide a theoretical result demonstrating that our inverse formulation admits a unique solution, thereby ensuring valid optimization. We then introduce gradient-stable relaxations to support effective training. As a result, experiments on multiple DLMs show that our method, Inverse-distilled Diffusion Language Models (IDLM), reduces the number of inference steps by 4x-64x, while preserving the teacher model's generation quality. We provide the code, model checkpoints, and video tutorials on the project page: https://david-cripto.github.io/idlm-project-page

2512.16167 2026-06-02 cs.MA cs.AI cs.GT 版本更新

Ev-Trust: An Evolutionarily Stable Trust Mechanism for Decentralized LLM-Based Multi-Agent Service Economies

Ev-Trust: 一种面向去中心化基于LLM的多智能体服务经济的演化稳定信任机制

Jiye Wang, Shiduo Yang, Ting Qiao, Jiayu Qin, Jianbin Li, Yu Wang, Yuanhe Zhao

发表机构 * School of Control and Computer Engineering, North China Electric Power University(控制与计算机工程学院,华北电力大学) State Grid Corporation of China(国家电网公司)

AI总结 针对去中心化LLM多智能体服务经济中欺诈成本降低、服务质量评估困难和服务内容不稳定三大脆弱性,提出Ev-Trust信任机制,通过交叉验证门、方差标准化漂移度量和信任信号嵌入收益函数,实现合作策略的演化稳定,实验表明恶意参与减少约60%,欺诈率降低约50%。

Comments 19 pages, 9 figures

详情
AI中文摘要

去中心化基于LLM的多智能体服务经济面临三个脆弱性,这些脆弱性破坏了传统信任机制:欺诈成本降低、服务质量评估困难以及服务内容不稳定。这些复合脆弱性可能引发群体层面的信任崩溃和短视策略的扩散。我们提出Ev-Trust,一种演化稳定的信任机制,通过三个针对性设计应对这些脆弱性:利用请求者语义理解评估响应有效性的交叉验证门;过滤内源随机性与真实行为异常的方差标准化漂移度量;以及将信任信号嵌入期望收益函数,将可信度转化为演化生存优势。基于带噪声最优反应微观基础的复制者动力学,我们证明了合作演化稳定策略的渐近稳定性,并推导了维持合作均衡的显式阈值条件。我们通过至少100个异构LLM驱动智能体(涵盖七种行为类型)的100轮模拟评估Ev-Trust。实验在TruthfulQA和TriviaQA两个事实性问答基准上进行。与基于传递信任聚合、强化学习声誉和纯演化模仿的基线相比,Ev-Trust将恶意智能体参与率降低约60%,欺诈服务率抑制约50%,并在30%对抗性突变下维持稳定的信任分化。这些结果表明,将语义信任评估与演化激励相结合,为在去中心化基于LLM的多智能体系统中保障合作提供了原则性基础。

英文摘要

Decentralized LLM-based multi-agent service economies face three vulnerabilities that undermine traditional trust mechanisms: reduced cost of fraud, difficulty in evaluating service quality, and instability of service content. These compounding vulnerabilities can trigger population-level trust collapse and the proliferation of short-sighted strategies. We propose Ev-Trust, an evolutionarily stable trust mechanism that addresses these vulnerabilities through three targeted designs: a cross-validation gate leveraging requestor semantic comprehension to assess response validity, a variance-standardized drift measure filtering endogenous stochasticity from genuine behavioral anomalies, and an embedding of trust signals into the expected revenue function that converts trustworthiness into an evolutionary survival advantage. Based on replicator dynamics with a noisy best response micro-foundation, we prove the asymptotic stability of cooperative evolutionarily stable strategies and derive explicit threshold conditions for maintaining cooperative equilibria. We evaluate Ev-Trust through 100-round simulations with at least 100 heterogeneous LLM-driven agents covering seven behavioral types. The experiments are conducted on TruthfulQA and TriviaQA, two factual question-answering benchmarks. Compared to baselines based on transitive trust aggregation, reinforcement-learning reputation, and pure evolutionary imitation, Ev-Trust reduces malicious agent participation by approximately 60%, suppresses the fraudulent service rate by approximately 50%, and maintains stable trust differentiation under a 30% adversarial mutation. These results demonstrate that coupling semantic trust evaluation with evolutionary incentives provides a principled foundation for securing cooperation in decentralized LLM-based multi-agent systems.

2602.18195 2026-06-02 cs.LG cs.AI 版本更新

LERD: Latent Event-Relational Dynamics for Neurodegenerative Classification

LERD: 用于神经退行性疾病分类的潜在事件-关系动力学

Yicheng Feng, Hairong Chen, Ziyu Jia, Samir Bhatt, Hengguan Huang

发表机构 * University of California, Berkeley(加州大学伯克利分校) Stanford University(斯坦福大学) University of Washington(华盛顿大学) University of California, San Diego(加州大学圣地亚哥分校) University of California, Los Angeles(加州大学洛杉矶分校)

AI总结 提出LERD,一种端到端贝叶斯潜在事件-关系动力系统,直接从多通道脑电图推断潜在神经事件及其关系结构,无需事件或交互标注,在阿尔茨海默病分类中优于基线方法并提供生理对齐的动力学摘要。

详情
AI中文摘要

阿尔茨海默病(AD)会改变大脑电生理学并破坏多通道脑电图动力学,使得准确且临床有用的基于脑电图的诊断对于筛查和疾病监测越来越重要。然而,许多现有方法依赖黑盒分类器,并未明确建模其决策背后的潜在事件时序和跨通道协调。为解决这些局限,我们提出LERD,一种端到端贝叶斯潜在事件-关系动力系统,无需事件或交互标注,直接从多通道脑电图推断潜在神经事件及其关系结构。LERD结合连续时间事件推断模块与随机事件生成过程以捕获灵活的时间模式,同时融入电生理学启发的动力学先验以原则性方式指导学习。我们进一步提供理论分析,得到基于初值问题的可处理KL正则化项以及推断关系动力学的稳定性保证。在合成基准和两个真实世界AD脑电图队列上的大量实验表明,LERD一致优于强基线,并生成与生理对齐的速率、时序和图摘要,有助于刻画组级动力学差异。

英文摘要

Alzheimer's disease (AD) alters brain electrophysiology and disrupts multichannel EEG dynamics, making accurate and clinically useful EEG-based diagnosis increasingly important for screening and disease monitoring. However, many existing approaches rely on black-box classifiers and do not explicitly model the latent event timing and cross-channel coordination behind their decisions. To address these limitations, we propose LERD, an end-to-end Bayesian latent event--relational dynamical system that infers latent neural events and their relational structure directly from multichannel EEG without event or interaction annotations. LERD combines a continuous-time event inference module with a stochastic event-generation process to capture flexible temporal patterns, while incorporating an electrophysiology-inspired dynamical prior to guide learning in a principled way. We further provide theoretical analysis that yields a tractable IVP-based KL regularizer and stability guarantees for the inferred relational dynamics. Extensive experiments on synthetic benchmarks and two real-world AD EEG cohorts demonstrate that LERD consistently outperforms strong baselines and yields physiology-aligned rate, timing, and graph summaries that help characterize group-level dynamical differences.

2602.18008 2026-06-02 cs.LG cs.AI cs.CL 版本更新

Are LLMs Ready for Neural-integrated Mechanistic Modeling? A Benchmark and Agentic Framework

LLM 是否准备好进行神经集成机制建模?一个基准测试与智能体框架

Zihan Guan, Rituparna Datta, Mengxuan Hu, Shunshun Liu, Aiying Zhang, Prasanna Balachandran, Sheng Li, Anil Vullikanti

发表机构 * University of Virginia(弗吉尼亚大学)

AI总结 本文提出神经集成机制建模(NIMM)基准测试,评估大语言模型在三个科学领域构建神经集成机制模型的能力,并设计树引导的智能体框架 NIMMGen,通过分支级搜索和原子模型细化显著提升搜索稳定性和解质量。

Comments 25 pages, 8 figures

详情
AI中文摘要

大语言模型(LLM)在从数据构建机制模型方面显示出潜力。然而,现有评估主要关注简化设置,未能捕捉真实世界科学建模的复杂性。在实践中,此类建模通常涉及神经集成公式,其中机制模型组件和神经网络组件共同构建,导致搜索空间显著复杂化。受此差距驱动,我们引入了神经集成机制建模(NIMM)基准测试,该基准测试评估 LLM 生成的神经集成机制模型在三个科学领域上的表现。在 NIMM 上的实验表明,现有基于 LLM 的方法难以有效探索这一复杂空间,导致搜索稳定性和解质量有限。为应对这一挑战,我们提出了 NIMMGen,一种树引导的智能体框架,通过分支级搜索实现多样化探索,并通过原子模型细化改进解。大量实验表明,NIMMGen 在 NIMM 上达到了最先进的性能,显著提升了搜索稳定性和解质量。

英文摘要

Large language models (LLMs) have shown promise in constructing mechanistic models from data. However, existing evaluations largely focus on simplified settings and fail to capture the complexity of real-world scientific modeling. In practice, such modeling often involves neural-integrated formulations, where a mechanistic model component and a neural network component are jointly constructed, leading to a significantly more complex search space. Motivated by this gap, we introduce the Neural-Integrated Mechanistic Modeling (NIMM) benchmark, which evaluates LLM-generated neural-integrated mechanistic models across three scientific domains. Experiments on NIMM reveal that existing LLM-based approaches struggle to effectively explore this complex space, resulting in limited search stability and solution quality. To address this challenge, we propose NIMMGen, a tree-guided agentic framework that enables diversified exploration via branch-level search and improves solutions through atomic model refinement. Extensive experiments demonstrate that NIMMGen achieves state-of-the-art performance on NIMM, significantly improving search stability and solution quality.

2602.16763 2026-06-02 cs.AI 版本更新

When AI Benchmarks Plateau: A Systematic Study of Benchmark Saturation

当AI基准测试达到平台期:基准饱和的系统性研究

Mubashara Akhtar, Anka Reuel, Prajna Soni, Sanchit Ahuja, Pawan Sasanka Ammanamanchi, Ruchit Rawal, Vilém Zouhar, Srishti Yadav, Chenxi Whitehouse, Dayeon Ki, Jennifer Mickel, Leshem Choshen, Marek Šuppa, Jan Batzner, Jenny Chim, Jeba Sania, Yanan Long, Hossein A. Rahmani, Christina Knight, Yiyang Nan, Jyoutir Raj, Yu Fan, Shubham Singh, Subramanyam Sahoo, Eliya Habba, Usman Gohar, Siddhesh Pawar, Robert Scholz, Arjun Subramonian, Jingwei Ni, Mykel Kochenderfer, Sanmi Koyejo, Mrinmaya Sachan, Stella Biderman, Zeerak Talat, Avijit Ghosh, Irene Solaiman

发表机构 * University of California, Berkeley(加州大学伯克利分校) University of Toronto(多伦多大学) University of Washington(华盛顿大学) University of Illinois at Urbana-Champaign(伊利诺伊大学厄巴纳-香槟分校) University of Michigan(密歇根大学) University of Texas at Austin(德克萨斯大学奥斯汀分校)

AI总结 本研究定义并分析了60个语言模型基准的饱和现象,发现近半数基准出现饱和,且专家策划而非公开测试数据影响抗饱和能力,为延长基准寿命提供了设计建议。

Comments Accepted at ICML 2026

详情
AI中文摘要

人工智能基准测试是衡量模型进展和指导部署决策的重要机制。然而,基准测试很快“饱和”,使得区分模型变得困难,并削弱了其长期价值。在本研究中,我们定义了基准饱和,并使用与饱和相关的14个属性分析了60个语言模型基准。我们发现近半数的基准表现出饱和,且饱和率随年龄增长而增加。此外,我们发现抗饱和能力受专家策划的影响,而非公开测试数据。我们的结果表明,设计选择可以延长基准寿命,并为更持久的评估方法提供信息。

英文摘要

Artificial intelligence benchmarks are an important mechanism for measuring model progress and guiding deployment decisions. However, benchmarks quickly "saturate", making it difficult to differentiate models and diminishing their long-term value. In this study, we define benchmark saturation and analyze it across 60 language model benchmarks using 14 properties that relate to saturation. We find that nearly half of the our benchmarks exhibit saturation, with rates increasing with age. Further, we find that resilience to saturation is impacted by expert-curation, not by public test data. Our results suggest that design choices can extend benchmark longevity and inform more durable evaluation approaches.

2602.16745 2026-06-02 cs.LG cs.AI 版本更新

PETS: A Principled Framework Towards Optimal Trajectory Allocation for Efficient Test-Time Self-Consistency

PETS:一种面向高效测试时自一致性的最优轨迹分配原则性框架

Zhangyi Liu, Huaizhi Qu, Xiaowei Yin, He Sun, Yanjun Han, Tianlong Chen, Zhun Deng

发表机构 * Stanford University(斯坦福大学) UNC at Chapel Hill(Chapel Hill 大学) Yale University(耶鲁大学) New York University(纽约大学)

AI总结 提出PETS框架,通过将轨迹分配建模为优化问题并引入自一致性率度量,在离线(连接众包理论)和在线流式场景下实现样本高效的测试时自一致性,显著降低采样预算。

详情
AI中文摘要

测试时扩展可以通过聚合随机推理轨迹来提高模型性能。然而,在有限预算下实现样本高效的测试时自一致性仍然是一个开放的挑战。我们引入了PETS(原则性且高效的测试时自一致性),它通过一个优化框架启动了对轨迹分配的原则性研究。我们方法的核心是自一致性率,这是一个新定义的度量,即与无限预算多数投票的一致性。这一公式使样本高效的测试时分配在理论上具有坚实基础,并适合严格分析。我们研究了离线和在线两种设置。在离线模式下,所有问题事先已知,我们将轨迹分配与众包(一个经典且成熟的研究领域)联系起来,将推理轨迹建模为工人。这种视角使我们能够利用丰富的现有理论,获得理论保证和一种高效的基于多数投票的分配算法。在在线流式模式下,问题顺序到达且必须实时做出分配,我们提出了一种受离线框架启发的新方法。我们的方法根据问题难度调整预算,同时保持强大的理论保证和计算效率。实验表明,PETS始终优于均匀分配。在GPQA上,PETS在两种设置下均实现了完美的自一致性,同时相对于均匀分配将采样预算减少了高达75%(离线)和55%(在线)。代码可在https://github.com/ZDCSlab/PETS获取。

英文摘要

Test-time scaling can improve model performance by aggregating stochastic reasoning trajectories. However, achieving sample-efficient test-time self-consistency under a limited budget remains an open challenge. We introduce PETS (Principled and Efficient Test-TimeSelf-Consistency), which initiates a principled study of trajectory allocation through an optimization framework. Central to our approach is the self-consistency rate, a new measure defined as agreement with the infinite-budget majority vote. This formulation makes sample-efficient test-time allocation theoretically grounded and amenable to rigorous analysis. We study both offline and online settings. In the offline regime, where all questions are known in advance, we connect trajectory allocation to crowdsourcing, a classic and well-developed area, by modeling reasoning traces as workers. This perspective allows us to leverage rich existing theory, yielding theoretical guarantees and an efficient majority-voting-based allocation algorithm. In the online streaming regime, where questions arrive sequentially and allocations must be made on the fly, we propose a novel method inspired by the offline framework. Our approach adapts budgets to question difficulty while preserving strong theoretical guarantees and computational efficiency. Experiments show that PETS consistently outperforms uniform allocation. On GPQA, PETS achieves perfect self-consistency in both settings while reducing the sampling budget by up to 75% (offline) and 55% (online) relative to uniform allocation. Code is available at https://github.com/ZDCSlab/PETS.

2602.16720 2026-06-02 cs.DB cs.AI 版本更新

APEX-SQL: Talking to the data via Agentic Exploration for Text-to-SQL

APEX-SQL: 通过智能体探索与数据对话实现Text-to-SQL

Bowen Cao, Weibin Liao, Yushi Sun, Dong Fang, Haitao Li, Wai Lam

发表机构 * The Chinese University of Hong Kong(香港中文大学) Peking University(北京大学) LIGHTSPEED

AI总结 提出APEX-SQL框架,通过假设验证循环、逻辑规划、双路径剪枝、并行数据分析和确定性探索机制,解决静态模式表示在复杂企业数据库中的语义模糊和扩展性问题,在BIRD和Spider 2.0-Snow上取得领先性能。

Comments KDD 2026

详情
AI中文摘要

由大型语言模型驱动的Text-to-SQL系统在学术基准测试中表现出色,但在复杂的企业环境中却难以应对。主要限制在于它们依赖静态模式表示,这无法解决语义模糊性,也无法有效扩展到大型复杂数据库。为了解决这个问题,我们提出了APEX-SQL,一个智能体Text-to-SQL框架,它将范式从被动翻译转变为智能体探索。我们的框架采用假设验证循环,将模型推理基于真实数据。在模式链接阶段,我们使用逻辑规划来表述假设,双路径剪枝来减少搜索空间,并行数据分析来验证列角色与真实数据的关系,然后进行全局合成以确保拓扑连通性。对于SQL生成,我们引入了一种确定性机制来检索探索指令,使智能体能够有效地探索数据分布、细化假设并生成语义准确的SQL。在BIRD(执行准确率70.65%)和Spider 2.0-Snow(执行准确率51.01%)上的实验表明,APEX-SQL在减少token消耗的同时优于竞争基线。进一步的分析表明,智能体探索作为性能倍增器,释放了基础模型在企业环境中的潜在推理能力。消融研究证实了每个组件在确保稳健和准确数据分析中的关键贡献。我们的代码发布在https://github.com/Tencent/APEX-SQL-Project。

英文摘要

Text-to-SQL systems powered by Large Language Models have excelled on academic benchmarks but struggle in complex enterprise environments. The primary limitation lies in their reliance on static schema representations, which fails to resolve semantic ambiguity and scale effectively to large, complex databases. To address this, we propose APEX-SQL, an Agentic Text-to-SQL Framework that shifts the paradigm from passive translation to agentic exploration. Our framework employs a hypothesis-verification loop to ground model reasoning in real data. In the schema linking phase, we use logical planning to verbalize hypotheses, dual-pathway pruning to reduce the search space, and parallel data profiling to validate column roles against real data, followed by global synthesis to ensure topological connectivity. For SQL generation, we introduce a deterministic mechanism to retrieve exploration directives, allowing the agent to effectively explore data distributions, refine hypotheses, and generate semantically accurate SQLs. Experiments on BIRD (70.65% execution accuracy) and Spider 2.0-Snow (51.01% execution accuracy) demonstrate that APEX-SQL outperforms competitive baselines with reduced token consumption. Further analysis reveals that agentic exploration acts as a performance multiplier, unlocking the latent reasoning potential of foundation models in enterprise settings. Ablation studies confirm the critical contributions of each component in ensuring robust and accurate data analysis. Our code is released at https://github.com/Tencent/APEX-SQL-Project.

2602.15278 2026-06-02 cs.CV cs.AI 版本更新

Visual Persuasion: What Influences Decisions of Vision-Language Models?

视觉说服:什么影响了视觉语言模型的决策?

Manuel Cherep, Pranav M R, Pattie Maes, Nikhil Singh

发表机构 * Massachusetts Institute of Technology(麻省理工学院) MIT Media Lab(MIT媒体实验室)

AI总结 提出一个框架,通过控制图像选择任务并系统性地扰动输入,利用视觉提示优化方法推断视觉语言模型的潜在视觉效用,揭示影响模型决策的视觉偏好。

Comments Accepted to ICML 2026

详情
AI中文摘要

网络上充斥着图像,这些图像最初是为人类消费而创建的,现在越来越多地被使用视觉语言模型(VLM)的智能体解释。这些智能体大规模地做出视觉决策,决定点击、推荐或购买什么。然而,我们对它们视觉偏好的结构知之甚少。我们引入了一个框架来研究这一点,通过将VLM置于受控的基于图像的选择任务中,并系统地扰动它们的输入。我们的关键思想是将智能体的决策函数视为一种潜在的视觉效用,可以通过揭示偏好来推断:在系统编辑的图像之间进行选择。从常见图像(如产品照片)开始,我们提出了视觉提示优化的方法,将文本优化方法适应为使用图像生成模型(例如在构图、光照或背景方面)迭代地提出并应用视觉上合理的修改。然后,我们评估哪些编辑增加了选择概率。通过对前沿VLM的大规模实验,我们证明了优化后的编辑在直接比较中显著改变了选择概率。我们开发了一个自动可解释性管道来解释这些偏好,识别出驱动选择的一致视觉主题。我们认为,这种方法提供了一种实用且高效的方式来揭示视觉漏洞和安全问题,否则这些问题可能会在现实世界中隐含地发现,从而支持对基于图像的AI智能体进行更主动的审计和治理。

英文摘要

The web is littered with images, once created for human consumption and now increasingly interpreted by agents using vision-language models (VLMs). These agents make visual decisions at scale, deciding what to click, recommend, or buy. Yet, we know little about the structure of their visual preferences. We introduce a framework for studying this by placing VLMs in controlled image-based choice tasks and systematically perturbing their inputs. Our key idea is to treat the agent's decision function as a latent visual utility that can be inferred through revealed preference: choices between systematically edited images. Starting from common images, such as product photos, we propose methods for visual prompt optimization, adapting text optimization methods to iteratively propose and apply visually plausible modifications using an image generation model (such as in composition, lighting, or background). We then evaluate which edits increase selection probability. Through large-scale experiments on frontier VLMs, we demonstrate that optimized edits significantly shift choice probabilities in head-to-head comparisons. We develop an automatic interpretability pipeline to explain these preferences, identifying consistent visual themes that drive selection. We argue that this approach offers a practical and efficient way to surface visual vulnerabilities, safety concerns that might otherwise be discovered implicitly in the wild, supporting more proactive auditing and governance of image-based AI agents.

2602.15259 2026-06-02 cs.CY cs.AI cs.LG 版本更新

Knowing Isn't Understanding: Re-grounding Generative Proactivity with Epistemic and Behavioral Insight

知道不等于理解:用认知与行为洞察重新奠定生成式主动性

Kirandeep Kaur, Xingda Lyu, Chirag Shah

发表机构 * University of California, Berkeley(加州大学伯克利分校) University of Toronto(多伦多大学) University of Waterloo(滑铁卢大学)

AI总结 针对用户无法明确表达需求时的认知不完整问题,提出生成式主动性需要基于认知和行为双重约束来设计负责任的主动代理。

Comments 43 rd International Conference on Machine Learning (ICML 2026)

详情
AI中文摘要

生成式AI代理将理解等同于解决显式查询,这一假设将交互限制在用户能够表达的范围内。当用户自身缺乏对缺失、风险或值得考虑之事的意识时,这一假设就会失效。在这种情况下,主动性不仅是效率提升,更是一种认知上的必要性。我们将这种状态称为认知不完整:即进步依赖于处理未知的未知以实现有效协作。现有的主动性方法仍然局限于预测性,从过去行为中推断并假定目标已经明确,从而未能有意义地支持用户。然而,揭示超出用户当前意识的可能性并非天然有益。不受约束的主动干预可能误导注意力、使用户不堪重负或引入伤害。因此,主动代理需要行为锚定:对代理何时、如何以及在何种程度上进行干预施加原则性约束。我们主张生成式主动性必须在认知和行为上双重锚定。借鉴无知哲学和主动行为研究,我们认为这些理论为设计能够负责任地参与并促进有意义协作的代理提供了关键指导。

英文摘要

Generative AI agents equate understanding with resolving explicit queries, an assumption that confines interaction to what users can articulate. This assumption breaks down when users themselves lack awareness of what is missing, risky, or worth considering. In such conditions, proactivity is not merely an efficiency enhancement, but an epistemic necessity. We refer to this condition as epistemic incompleteness: where progress depends on engaging with unknown unknowns for effective partnership. Existing approaches to proactivity remain narrowly anticipatory, extrapolating from past behavior and presuming that goals are already well defined, thereby failing to support users meaningfully. However, surfacing possibilities beyond a user's current awareness is not inherently beneficial. Unconstrained proactive interventions can misdirect attention, overwhelm users, or introduce harm. Proactive agents, therefore, require behavioral grounding: principled constraints on when, how, and to what extent an agent should intervene. We advance the position that generative proactivity must be grounded both epistemically and behaviorally. Drawing on the philosophy of ignorance and research on proactive behavior, we argue that these theories offer critical guidance for designing agents that can engage responsibly and foster meaningful partnerships.

2602.14849 2026-06-02 cs.LG cs.AI cs.DC cs.MA 版本更新

Atomix: Timely, Transactional Tool Use for Reliable Agentic Workflows

Atomix: 用于可靠智能体工作流的及时事务性工具使用

Bardia Mohammadi, Nearchos Potamitis, Lars Klein, Akhil Arora, Laurent Bindschaedler

发表机构 * Max Planck Institute for Software Systems(马克斯·普朗克软件系统研究所) Aarhus University(奥胡斯大学) EPFL(苏黎世联邦理工学院)

AI总结 针对LLM智能体多步工作流中因故障、推测和并发导致的状态不一致问题,提出Atomix系统,通过进度感知事务将效果分组与冲突解决分离,实现可靠提交与回滚。

详情
AI中文摘要

LLM智能体执行多步工作流,通过工具改变外部状态。常见的编排器将工具返回视为结算触发器,因此故障、推测和并发智能体可能留下部分效果、丢失分支残留、陈旧写入或不可逆发送。正确的结算需要两个事实,而重试、检查点重放、锁和补偿各自混淆了这些事实:哪些效果必须一起结算,以及何时较早的冲突工作已耗尽。Atomix通过进度感知事务使这种分离明确化。运行时在执行期间记录读取和效果,当足迹完成时密封事务,并且仅在每个资源的前沿显示没有更早的冲突工作可能到达后才提交。提交是最终结算:Atomix释放可缓冲效果,接受可逆外部效果为最终状态,并让不可逆效果离开。中止抑制未释放的效果,并在可能的情况下补偿外部化的可逆效果。在代表性智能体工作负载上,这种组合在注入故障下改善了干净恢复,隔离了竞争和推测工作,并防止了正确分类的不可逆动作泄漏;微基准测试显示相对于工具延迟的微秒级包装开销。

英文摘要

LLM agents execute multi-step workflows that mutate external state through tools. Common orchestrators treat tool return as the settlement trigger, so faults, speculation, and concurrent agents can leave partial effects, losing-branch residue, stale writes, or irreversible sends. Correct settlement needs two facts that retries, checkpoint replay, locks, and compensation each conflate: which effects must settle together, and when earlier conflicting work is exhausted. Atomix makes this split explicit with progress-aware transactions. The runtime records reads and effects during execution, seals a transaction when its footprint is complete, and commits only after per-resource frontiers show that no earlier conflicting work can still arrive. Commit is final settlement: Atomix releases bufferable effects, accepts reversible external effects as final, and lets irreversible effects leave the gate. Abort suppresses unreleased effects and compensates externalized reversible effects where possible. On representative agent workloads, this composition improves clean recovery under injected faults, isolates contending and speculative work, and prevents correctly classified irreversible actions from leaking; microbenchmarks show microsecond-scale wrapper overhead relative to tool latency.

2602.14134 2026-06-02 cs.CV cs.AI cs.LG 版本更新

DenseMLLM: Standard Multimodal LLMs for Dense Prediction

DenseMLLM:用于密集预测的标准多模态大语言模型

Yi Li, Hongze Shen, Lexiang Tang, Xin Li, Xinpeng Ding, Yinsong Liu, Deqiang Jiang, Xing Sun, Xiaomeng Li

发表机构 * Department of Electronic and Computer Engineering, The Hong Kong University of Science and Technology, Hong Kong, China(香港科技大学电子与计算机工程系) Tencent, Youtu-Lab, China(腾讯优图实验室)

AI总结 提出DenseMLLM,通过标准多模态大语言模型架构和视觉令牌监督策略,无需任务特定解码器即可实现语义分割、深度估计等密集预测任务,在多个基准上取得竞争性能。

Comments ICML 2026

详情
AI中文摘要

多模态大语言模型在高层次视觉理解方面展现出卓越能力。然而,将这些模型扩展到细粒度的密集预测任务(如语义分割和深度估计)通常需要引入复杂的任务特定解码器和其他定制化组件。这种架构碎片化增加了模型复杂度,偏离了多模态大语言模型的通用设计,最终限制了其实用性。在这项工作中,我们挑战了这一范式,通过调整标准多模态大语言模型来执行密集预测,无需额外的任务特定解码器。所提出的模型称为DenseMLLM,基于标准架构,并采用一种新颖的视觉令牌监督策略来处理多个标签和任务。尽管设计极简,我们的模型在广泛的密集预测和视觉语言基准测试中取得了极具竞争力的性能,表明标准的通用多模态大语言模型可以在没有架构专门化的情况下有效支持密集感知。该项目可在github.com/Eli-YiLi/DenseMLLM获取。

英文摘要

Multimodal Large Language Models (MLLMs) have demonstrated exceptional capabilities in high-level visual understanding. However, extending these models to fine-grained dense prediction tasks, such as semantic segmentation and depth estimation, typically necessitates the incorporation of complex, task-specific decoders and other customizations. This architectural fragmentation increases model complexity and deviates from the generalist design of MLLMs, ultimately limiting their practicality. In this work, we challenge this paradigm by accommodating standard MLLMs to perform dense predictions without requiring additional task-specific decoders. The proposed model is called DenseMLLM, grounded in the standard architecture with a novel vision token supervision strategy for multiple labels and tasks. Despite its minimalist design, our model achieves highly competitive performance across a wide range of dense prediction and vision-language benchmarks, demonstrating that a standard, general-purpose MLLM can effectively support dense perception without architectural specialization. This project is available at github.com/Eli-YiLi/DenseMLLM.

2602.14065 2026-06-02 cs.AI 版本更新

REAL: Resolving Knowledge Conflicts in Knowledge-Intensive Visual Question Answering via Reasoning-Pivot Alignment

REAL: 通过推理枢轴对齐解决知识密集型视觉问答中的知识冲突

Kai Ye, Xianwei Mao, Sheng Zhou, Zirui Shao, Ye Mo, Liangliang Liu, Haikuan Huang, Bin Li, Jiajun Bu

发表机构 * University of Science and Technology of China(中国科学技术大学) Tsinghua University(清华大学)

AI总结 提出REAL框架,通过推理枢轴对齐和引导解码,解决知识密集型视觉问答中因开放域检索引起的知识冲突问题。

Comments Accepted by ICML 2026

详情
AI中文摘要

知识密集型视觉问答(KI-VQA)经常因开放域检索的固有限制而遭受严重的知识冲突。然而,现有范式由于缺乏可泛化的冲突检测和模型内约束机制来处理冲突证据,面临关键限制。为应对这些挑战,我们提出了REAL(推理枢轴对齐)框架,其核心是新颖的推理枢轴概念。与优先考虑内部自我推导的推理步骤不同,推理枢轴作为推理链中的原子单元(节点或边),强调知识链接,通常依赖外部证据完成推理。在我们构建的REAL-VQA数据集支持下,我们的方法集成了推理枢轴感知SFT(RPA-SFT),通过将冲突与枢轴提取对齐来训练可泛化的判别器,并采用推理枢轴引导解码(RPGD),一种利用这些枢轴进行针对性冲突缓解的模型内解码策略。在多个数据集上的大量实验表明,REAL显著提高了判别准确性并实现了优越性能,验证了我们的枢轴驱动解决范式。

英文摘要

Knowledge-intensive Visual Question Answering (KI-VQA) frequently suffers from severe knowledge conflicts caused by the inherent limitations of open-domain retrieval. However, existing paradigms face critical limitations due to the lack of generalizable conflict detection and intra-model constraint mechanisms to handle conflicting evidence. To address these challenges, we propose the REAL (Reasoning-Pivot Alignment) framework centered on the novel concept of the Reasoning-Pivot. Distinct from reasoning steps that prioritize internal self-derivation, a reasoning-pivot serves as an atomic unit (node or edge) in the reasoning chain that emphasizes knowledge linkage, and it typically relies on external evidence to complete the reasoning. Supported by our constructed REAL-VQA dataset, our approach integrates Reasoning-Pivot Aware SFT (RPA-SFT) to train a generalizable discriminator by aligning conflicts with pivot extraction, and employs Reasoning-Pivot Guided Decoding (RPGD), an intra-model decoding strategy that leverages these pivots for targeted conflict mitigation. Extensive experiments on diverse datasets demonstrate that REAL significantly enhances discrimination accuracy and achieves superior performance, validating our pivot-driven resolution paradigm.

2602.13940 2026-06-02 cs.LG cs.AI 版本更新

You Can Learn Tokenization End-to-End with Reinforcement Learning

你可以通过强化学习端到端地学习分词

Sam Dauncey, Roger Wattenhofer

发表机构 * University of Waterloo(滑铁卢大学) ETH Zurich(苏黎世联邦理工学院)

AI总结 本文提出使用强化学习中的得分函数估计来学习离散分词边界,通过时间折扣等技巧降低方差,在1亿参数规模上优于先前的直通估计方法。

Comments ICML 2026 camera-ready

详情
AI中文摘要

分词是一个硬编码的压缩步骤,尽管架构总体上趋向于端到端,但它仍然保留在大语言模型(LLM)的训练流程中。先前的工作在大规模上展示了有希望的结果,通过启发式方法将这一压缩步骤引入LLM架构内部以绘制分词边界,并尝试使用直通估计来学习这些分词边界,直通估计将绘制离散分词边界的问题视为连续问题。我们表明,这些分词边界可以通过得分函数估计来学习,由于直接优化绘制离散分词边界以最小化损失的问题,得分函数估计具有更严格的理论保证。我们观察到,强化学习中的技术,如时间折扣,对于充分降低该得分函数的方差以使其可行是必要的。我们证明,所得到的方法在1亿参数规模上,在定性和定量上都优于先前提出的直通估计方法。

英文摘要

Tokenization is a hardcoded compression step which remains in the training pipeline of Large Language Models (LLMs), despite a general trend towards architectures becoming increasingly end-to-end. Prior work has shown promising results at scale in bringing this compression step inside the LLMs' architecture with heuristics to draw token boundaries, and also attempts to learn these token boundaries with straight-through estimates, which treat the problem of drawing discrete token boundaries as a continuous one. We show that these token boundaries can instead be learned using score function estimates, which have tighter theoretical guarantees due to directly optimizing the problem of drawing discrete token boundaries to minimize loss. We observe that techniques from reinforcement learning, such as time discounting, are necessary to reduce the variance of this score function sufficiently to make it practicable. We demonstrate that the resultant method outperforms prior proposed straight-through estimates, both qualitatively and quantitatively at the $100$ million parameter scale.

2602.07298 2026-06-02 cs.IR cs.AI 版本更新

Principled Synthetic Data Enables the First Scaling Laws for LLMs in Recommendation

原则性合成数据使推荐系统中的LLM首次出现缩放定律

Benyu Zhang, Qiang Zhang, Jianpeng Cheng, Hong-You Chen, Qifei Wang, Wei Sun, Shen Li, Jia Li, Jiahao Wu, Qunshu Zhang, Neeraj Bhatia, Xiangjun Fan, Hong Yan

发表机构 * Meta

AI总结 本文提出一种分层框架生成高质量合成数据,通过避免原始数据噪声,首次在推荐领域实现LLM的稳健幂律缩放,并显著提升下游排序任务性能。

Comments update according to icml reviewers feedback

详情
Journal ref
ICML 2026
AI中文摘要

大型语言模型(LLM)代表了推荐系统的一个有前景的前沿,但其发展一直受到缺乏可预测缩放定律的阻碍,而缩放定律对于指导研究和优化资源分配至关重要。我们假设,这可能是由于先前持续预训练(CPT)工作中原始用户交互数据固有的噪声、偏差和不完整性所致。本文介绍了一种新颖的分层框架,用于生成高质量合成数据,通过为LLM创建精心策划的教学课程来规避此类问题。我们提供了强有力的直接证据,证明我们课程的有效性:在原则性合成数据上训练的标准序列模型在下游排序任务中显著优于(在SasRec的recall@100上提高+130%)在真实数据上训练的模型,展示了其在学习可泛化用户偏好模式方面的优越性。在此基础上,我们首次通过实验证明,在高质量、推荐特定数据上持续预训练的LLM存在稳健的幂律缩放。我们的实验揭示了跨多种合成数据模态的一致且可预测的困惑度降低。这些发现为在推荐领域可靠地缩放LLM能力建立了基础方法论,从而将研究重点从缓解数据缺陷转向利用高质量的结构化信息。

英文摘要

Large Language Models (LLMs) represent a promising frontier for recommender systems, yet their development has been impeded by the absence of predictable scaling laws, which are crucial for guiding research and optimizing resource allocation. We hypothesize that this may be attributed to the inherent noise, bias, and incompleteness of raw user interaction data in prior continual pre-training (CPT) efforts. This paper introduces a novel, layered framework for generating high-quality synthetic data that circumvents such issues by creating a curated, pedagogical curriculum for the LLM. We provide powerful, direct evidence for the utility of our curriculum by showing that standard sequential models trained on our principled synthetic data significantly outperform ($+130\%$ on recall@100 for SasRec) models trained on real data in downstream ranking tasks, demonstrating its superiority for learning generalizable user preference patterns. Building on this, we empirically demonstrate, for the first time, robust power-law scaling for an LLM that is continually pre-trained on our high-quality, recommendation-specific data. Our experiments reveal consistent and predictable perplexity reduction across multiple synthetic data modalities. These findings establish a foundational methodology for reliable scaling LLM capabilities in the recommendation domain, thereby shifting the research focus from mitigating data deficiencies to leveraging high-quality, structured information.

2602.11852 2026-06-02 cs.AI cs.CL cs.LG 版本更新

Prototype Transformer: Towards Language Model Architectures Interpretable by Design

原型Transformer:迈向可解释设计的语言模型架构

Yordan Yordanov, Matteo Forasassi, Bayar Menzat, Ruizhi Wang, Chang Qi, Markus Kaltenberger, Amine M'Charrak, Tommaso Salvatori, Thomas Lukasiewicz

发表机构 * University of Cambridge(剑桥大学) ETH Zurich(苏黎世联邦理工学院)

AI总结 提出原型Transformer(ProtoT),一种用线性代价原型模块替代二次代价自注意力的自回归语言模型架构,原型自动捕获可命名概念,提升可解释性并支持行为编辑。

Comments Accepted at ICML 2026. Equal contribution: Yordan Yordanov and Matteo Forasassi. 40 pages, 28 figures, 22 tables

详情
AI中文摘要

尽管最先进的语言模型(LM)在某些领域超越了大多数人类,但其推理过程仍然不透明,降低了信任度并增加了欺骗和幻觉的风险。我们引入了原型Transformer(ProtoT),一种自回归LM架构,它将Transformer的二次代价自注意力模块替换为基于原型的线性代价模块,原型是学习到的参数向量。在ProtoT中,原型创建了在不同时间尺度上聚合上下文信息的通信通道。我们表明,这种结构导致原型在训练过程中自动捕获可命名的概念,例如“女人”,为解释模型推理和对模型行为进行有针对性的编辑提供了途径。与基线相比,ProtoT在模型和数据规模上具有良好的扩展性,对输入扰动具有鲁棒性,并在文本生成和下游任务(包括GLUE)上表现良好。这些结果表明,ProtoT是朝着设计上更可解释的自回归语言模型迈出的有希望的一步。

英文摘要

While state-of-the-art language models (LMs) surpass most humans in certain domains, their reasoning remains largely opaque, reducing trust and increasing the risk of deception and hallucination. We introduce the Prototype Transformer (ProtoT), an autoregressive LM architecture that replaces the quadratic-cost self-attention module of the Transformer with a linear-cost module based on prototypes, which are learned parameter vectors. In ProtoT, prototypes create communication channels that aggregate contextual information at different time scales. We show that this structure leads prototypes to automatically capture nameable concepts, such as "woman", during training, offering a path toward interpreting model reasoning and making targeted edits to model behavior. Compared with baselines, ProtoT scales well with model and data size, is robust to input perturbations, and performs well on text generation and downstream tasks, including GLUE. These results suggest that ProtoT is a promising step toward autoregressive language models that are more interpretable by design.

2602.11790 2026-06-02 cs.AI cs.CL 版本更新

Beyond End-to-End Video Models: An LLM-Based Multi-Agent System for Educational Video Generation

超越端到端视频模型:基于LLM的多智能体系统用于教育视频生成

Lingyong Yan, Jiulong Wu, Dong Xie, Weixian Shi, Deguo Xia, Jizhou Huang

发表机构 * Baidu Inc.(百度公司)

AI总结 提出LASEV,一种基于LLM的分层多智能体系统,通过将教育视频生成分解为多个专业智能体协作,解决端到端模型在逻辑严谨性和知识表示方面的不足,实现低成本、高吞吐量的自动化教学视频生产。

Comments Accepted at ACM SIGKDD 2026 (KDD '26), Applied Data Science Track. 10 pages, 2 figures, 5 tables. The project is available at \url{https://robitsg.github.io/LASEV}

详情
AI中文摘要

尽管最近的端到端视频生成模型在视觉导向的内容创作中表现出令人印象深刻的性能,但在需要严格逻辑严谨性和精确知识表示的场景(如教学和教育媒体)中仍然受限。为解决此问题,我们提出LASEV,一种基于LLM的分层多智能体系统,用于从教育问题生成高质量教学视频。LASEV将教育视频生成表述为一个多目标任务,同时要求正确的逐步推理、教学连贯的叙述、语义忠实的视觉演示以及精确的视听对齐。为解决先前方法的局限性——包括低程序保真度、高生产成本和有限的可控性——LASEV将生成工作流分解为通过中央编排智能体协作的专业智能体,共享生产状态、显式质量门控和迭代批评机制。具体来说,编排智能体监督一个用于严格问题求解的求解智能体、一个生成可执行可视化代码的插图智能体,以及一个面向学习者的教学脚本的叙述智能体。此外,工作智能体的所有输出都经过语义批评、基于规则的约束和基于工具的编译检查。该系统不直接合成像素,而是构建一个结构化的可执行视频脚本,该脚本通过模板驱动的组装规则确定性编译为同步的视觉和叙述,实现无需手动编辑的全自动生产。在大规模部署中,LASEV实现了每天超过一百万视频的吞吐量,与当前行业标准方法相比成本降低超过95%,同时保持高接受率。

英文摘要

Although recent end-to-end video generation models demonstrate impressive performance in visually oriented content creation, they remain limited in scenarios that require strict logical rigor and precise knowledge representation, such as instructional and educational media. To address this problem, we propose LASEV, a hierarchical LLM-based multi-agent system for generating high-quality instructional videos from educational problems. LASEV formulates educational video generation as a multi-objective task that simultaneously demands correct step-by-step reasoning, pedagogically coherent narration, semantically faithful visual demonstrations, and precise audio--visual alignment. To address the limitations of prior approaches--including low procedural fidelity, high production cost, and limited controllability--LASEV decomposes the generation workflow into specialized agents that collaborate through a central Orchestrating Agent, shared production state, explicit quality gates, and iterative critique mechanisms. Specifically, the Orchestrating Agent supervises a Solution Agent for rigorous problem solving, an Illustration Agent that produces executable visualization code, and a Narration Agent for learner-oriented instructional scripts. In addition, all outputs from the working agents are subject to semantic critique, rule-based constraints, and tool-based compilation checks. Rather than directly synthesizing pixels, the system constructs a structured executable video script that is deterministically compiled into synchronized visuals and narration using template-driven assembly rules, enabling fully automated production without manual editing. In large-scale deployments, LASEV achieves a throughput exceeding one million videos per day, delivering over a 95% reduction in cost compared to current industry-standard approaches while maintaining a high acceptance rate.

2602.11177 2026-06-02 cs.CL cs.AI 版本更新

What Do LLMs Know About Alzheimer's Disease? Multi-loss Fine-Tuning and Probing for AD Detection

LLMs 对阿尔茨海默病了解多少?用于 AD 检测的多损失微调和探针分析

Lei Jiang, Yue Zhou, Natalie Parde

发表机构 * University of Illinois Chicago(伊利诺伊大学香槟分校)

AI总结 本文通过多损失微调 BERT、T5 和 Llama-1B 模型,在三个语料库上实现文本 AD 检测新 SOTA,并利用线性探针分析内部表征中 AD 相关信息的编码。

详情
AI中文摘要

可靠的阿尔茨海默病(AD)早期检测具有挑战性,特别是由于标记数据的有限可用性。虽然大型语言模型(LLMs)在跨领域表现出强大的迁移能力,但通过监督微调将其适应 AD 领域仍 largely unexplored。在这项工作中,我们跨三个异构转录语料库(Pitt、CCC、ADRC)实证评估了各种模型架构,以研究它们在基于文本的 AD 检测中的有效性,并分析任务相关信息如何在其内部表征中编码。据我们所知,我们微调的 BERT 和 T5 模型在 Pitt 和 CCC 数据集上建立了新的最先进水平,同时在 ADRC 上取得了强劲性能。同时,仅解码器的 Llama-1B 在所有三个语料库上取得了与 BERT 和 T5 相当的高度竞争结果,突显了其在 AD 检测中的有效性。我们进一步对 Llama-1B 骨干网络进行了全面评估,分析了跨语料库可迁移性、最优输入块大小粒度以及临床转录标记的影响。此外,我们使用线性探针实证表明,微调以反映 AD 相关信号的方式改变了单个标记(语言标记和内容词)的表征。

英文摘要

Reliable early detection of Alzheimer's disease (AD) is challenging, particularly due to the limited availability of labeled data. While large language models (LLMs) have shown strong transfer capabilities across do mains, adapting them to the AD domain through supervised fine-tuning remains largely unexplored. In this work, we empirically evaluate various model architectures across three heterogeneous transcript corpora (Pitt, CCC, ADRC) to investigate their effectiveness for text-based AD detection and analyze how task-relevant information is encoded within their internal representations. To the best of our knowledge, our fine-tuned BERT and T5 models establish a new state-of-the-art on the Pitt and CCC datasets, while achieving strong performance on ADRC. In parallel, the decoder-only Llama-1B achieves highly competitive results comparable to BERT and T5 across all three corpora, highlighting its effectiveness for AD detection. We further conduct a comprehensive evaluation of the Llama-1B backbone, analyzing cross-corpus transferability, optimal input chunk-size granularity, and the impact of clinical transcript markers. Also, we use linear probing to empirically show that fine-tuning shifts the representations of individual tokens, both linguistic markers and content words, in ways that reflect AD-related signal.

2507.15336 2026-06-02 cs.LG cs.AI cs.DB 版本更新

Beyond Model Base Retrieval: Weaving Knowledge to Master Fine-grained Neural Network Design

超越模型库检索:编织知识以掌握细粒度神经网络设计

Jialiang Wang, Hanmo Liu, Shimin Di, Zhili Wang, Jiachuan Wang, Lei Chen, Xiaofang Zhou

发表机构 * National University of Singapore(新加坡国立大学) Tsinghua University(清华大学) University of Science and Technology of China(中国科学技术大学)

AI总结 提出M-DESIGN框架,通过构建编辑效应证据图并采用自适应检索与预测任务规划器,在严格预算下高效发现近最优细粒度架构修改路径,在33个案例中26个达到搜索空间最佳性能。

Comments Accepted at ICML 2026. Title changed from "Beyond Model Base Selection: Weaving Knowledge to Master Fine-grained Neural Network Design" to "Beyond Model Base Retrieval: Weaving Knowledge to Master Fine-grained Neural Network Design"

详情
AI中文摘要

为新任务设计高性能神经网络需要在优化质量与搜索效率之间取得平衡。当前方法未能实现这一平衡:神经架构搜索计算成本高,而模型检索通常产生次优的静态检查点。为解决这一困境,我们将细粒度架构修改带来的性能增益建模为编辑效应证据,并从先验任务构建证据图。通过构建检索增强的模型精炼框架,我们提出的M-DESIGN动态编织历史证据以发现近最优的修改路径。M-DESIGN具有自适应检索机制,可快速校准来自不同来源的编辑效应证据的演化可迁移性。为处理分布外偏移,我们引入预测任务规划器,从多跳证据外推增益,从而减少对详尽知识库的依赖。基于包含22个数据集上67,760个图神经网络的知识库,大量实验表明,M-DESIGN持续优于基线,在严格预算下33个案例中有26个达到搜索空间最佳性能。

英文摘要

Designing high-performance neural networks for new tasks requires balancing optimization quality with search efficiency. Current methods fail to achieve this balance: neural architectural search is computationally expensive, while model retrieval often yields suboptimal static checkpoints. To resolve this dilemma, we model the performance gains induced by fine-grained architectural modifications as edit-effect evidence and build evidence graphs from prior tasks. By constructing a retrieval-augmented model refinement framework, our proposed M-DESIGN dynamically weaves historical evidence to discover near-optimal modification paths. M-DESIGN features an adaptive retrieval mechanism that quickly calibrates the evolving transferability of edit-effect evidence from different sources. To handle out-of-distribution shifts, we introduce predictive task planners that extrapolate gains from multi-hop evidence, thereby reducing reliance on an exhaustive repository. Based on our model knowledge base of 67,760 graph neural networks across 22 datasets, extensive experiments demonstrate that M-DESIGN consistently outperforms baselines, achieving the search-space best performance in 26 out of 33 cases under a strict budget.

2509.18046 2026-06-02 cs.RO cs.AI cs.ET cs.SY eess.SP eess.SY 版本更新

HuMam: Humanoid Motion Control via End-to-End Deep Reinforcement Learning with Mamba

HuMam: 基于Mamba的端到端深度强化学习人形机器人运动控制

Yinuo Wang, Yuanyang Qi, Jinzhao Zhou, Pengxiang Meng, Xiaowen Tao

发表机构 * College of Graduate and Professional Studies, Trine University(特灵大学研究生与专业研究学院) Department of Civil Engineering, University of Hong Kong(香港大学土木工程系) Faculty of Engineering and Information Technology, University of Technology Sydney(悉尼大学工程与信息技术学院) National Key Laboratory of Automotive Chassis Integration and Bionics, Jilin University(吉林大学汽车底盘集成与生物仿生国家重点实验室) School of Computer Science and Statistics, Trinity College Dublin(都柏林信任学院计算机科学与统计学系)

AI总结 提出HuMam框架,使用单层Mamba编码器融合状态与步态目标,通过PPO优化实现人形机器人稳定高效的端到端运动控制。

Comments 12 pages

详情
Journal ref
2026 IEEE International Conference on Cybernetics and Intelligent Systems (CIS) and IEEE International Conference on Robotics, Automation and Mechatronics (RAM) (CIS-RAM 2026)
AI中文摘要

端到端强化学习(RL)用于人形机器人运动因其紧凑的感知-动作映射而具有吸引力,但实际策略常受训练不稳定、特征融合低效和高执行成本困扰。我们提出HuMam,一种以状态为中心的端到端RL框架,采用单层Mamba编码器融合机器人中心状态与定向脚步目标及连续相位时钟。策略输出由低级PD环跟踪的关节位置目标,并通过PPO优化。一个简洁的六项奖励平衡接触质量、摆动平滑度、脚部放置、姿态和身体稳定性,同时隐含促进节能。在mc-mujoco中的JVRC-1人形机器人上,HuMam在强前馈基线上持续提高了学习效率、训练稳定性和整体任务性能,同时降低了功耗和扭矩峰值。据我们所知,这是首个采用Mamba作为融合骨干的端到端人形机器人RL控制器,展示了在效率、稳定性和控制经济性方面的切实提升。

英文摘要

End-to-end reinforcement learning (RL) for humanoid locomotion is appealing for its compact perception-action mapping, yet practical policies often suffer from training instability, inefficient feature fusion, and high actuation cost. We present HuMam, a state-centric end-to-end RL framework that employs a single-layer Mamba encoder to fuse robot-centric states with oriented footstep targets and a continuous phase clock. The policy outputs joint position targets tracked by a low-level PD loop and is optimized with PPO. A concise six-term reward balances contact quality, swing smoothness, foot placement, posture, and body stability while implicitly promoting energy saving. On the JVRC-1 humanoid in mc-mujoco, HuMam consistently improves learning efficiency, training stability, and overall task performance over a strong feedforward baseline, while reducing power consumption and torque peaks. To our knowledge, this is the first end-to-end humanoid RL controller that adopts Mamba as the fusion backbone, demonstrating tangible gains in efficiency, stability, and control economy.

2602.10623 2026-06-02 cs.LG cs.AI 版本更新

Mitigating Reward Hacking in RLHF via Bayesian Non-negative Reward Modeling

通过贝叶斯非负奖励建模缓解RLHF中的奖励黑客

Zhibin Duan, Guowei Rong, Zhuo Li, Bo Chen, Mingyuan Zhou, Dandan Guo

发表机构 * Zhejiang University(浙江大学)

AI总结 提出贝叶斯非负奖励模型(BNRM),通过非负因子分析和变分推断,在Bradley-Terry偏好模型中实现解耦与去偏,有效缓解奖励过度优化,提升鲁棒性和可解释性。

Comments Accepted as an Oral presentation at ICML 2026. The code is available at https://github.com/GuoweiRong/Bayesian-Non-negative-Reward-Model

详情
AI中文摘要

从人类偏好中学习的奖励模型是通过人类反馈强化学习对齐大型语言模型的核心,但由于噪声标注和系统偏差(如响应长度或风格),它们通常容易受到奖励黑客攻击。我们提出了贝叶斯非负奖励模型(BNRM),这是一个原则性的奖励建模框架,将非负因子分析整合到Bradley-Terry偏好模型中。BNRM通过稀疏的非负潜在因子生成过程表示奖励,该过程在两个互补层面运作:实例特定的潜在变量诱导解耦的奖励表示,而全局潜在因子的稀疏性作为隐式去偏机制,抑制虚假相关性。这种解耦-去偏结构共同实现了鲁棒的不确定性感知奖励学习。为了将BNRM扩展到现代LLM,我们开发了一个基于深度模型表示的条件摊销变分推断网络,实现高效的端到端训练。大量实验结果表明,BNRM显著缓解了奖励过度优化,提高了分布偏移下的鲁棒性,并比强基线产生了更可解释的奖励分解。

英文摘要

Reward models learned from human preferences are central to aligning large language models (LLMs) via reinforcement learning from human feedback, yet they are often vulnerable to reward hacking due to noisy annotations and systematic biases such as response length or style. We propose Bayesian Non-Negative Reward Model (BNRM), a principled reward modeling framework that integrates non-negative factor analysis into Bradley-Terry (BT) preference model. BNRM represents rewards through a sparse, non-negative latent factor generative process that operates at two complementary levels: instance-specific latent variables induce disentangled reward representations, while sparsity over global latent factors acts as an implicit debiasing mechanism that suppresses spurious correlations. Together, this disentanglement-then-debiasing structure enables robust uncertainty-aware reward learning. To scale BNRM to modern LLMs, we develop an amortized variational inference network conditioned on deep model representations, allowing efficient end-to-end training. Extensive empirical results demonstrate that BNRM substantially mitigates reward over-optimization, improves robustness under distribution shifts, and yields more interpretable reward decompositions than strong baselines.

2602.09153 2026-06-02 cs.RO cs.AI cs.CV cs.GR 版本更新

SceneSmith: Agentic Generation of Simulation-Ready Indoor Scenes

SceneSmith: 面向仿真就绪室内场景的智能体生成

Nicholas Pfaff, Thomas Cohn, Sergey Zakharov, Rick Cory, Russ Tedrake

发表机构 * Massachusetts Institute of Technology(麻省理工学院) Harvard University(哈佛大学)

AI总结 提出层次化智能体框架SceneSmith,通过VLM智能体协作从自然语言生成仿真就绪的室内场景,相比先前方法生成3-6倍物体且碰撞率低于2%。

Comments ICML 2026 Spotlight; Project page: https://scenesmith.github.io/

详情
AI中文摘要

仿真已成为大规模训练和评估家用机器人的关键工具,但现有环境未能捕捉真实室内空间的多样性和物理复杂性。当前的场景合成方法生成的房间稀疏布置,缺乏机器人操作所必需的密集杂乱、铰接式家具和物理属性。我们提出了SceneSmith,一个层次化智能体框架,能够从自然语言提示生成仿真就绪的室内环境。SceneSmith通过连续阶段构建场景——从建筑布局到家具放置再到小物体填充——每个阶段都实现为VLM智能体(设计师、评论家和编排者)之间的交互。该框架通过文本到3D合成生成静态物体、数据集检索获取铰接式物体以及物理属性估计,紧密集成了资产生成。SceneSmith生成的物体数量是先前方法的3-6倍,物体间碰撞率低于2%,且96%的物体在物理仿真下保持稳定。在205名参与者参与的用户研究中,与基线相比,平均真实感胜率达到92%,平均提示忠实度胜率达到91%。我们进一步证明了这些环境可用于端到端的自动机器人策略评估流程。

英文摘要

Simulation has become a key tool for training and evaluating home robots at scale, yet existing environments fail to capture the diversity and physical complexity of real indoor spaces. Current scene synthesis methods produce sparsely furnished rooms that lack the dense clutter, articulated furniture, and physical properties essential for robotic manipulation. We introduce SceneSmith, a hierarchical agentic framework that generates simulation-ready indoor environments from natural language prompts. SceneSmith constructs scenes through successive stages$\unicode{x2013}$from architectural layout to furniture placement to small object population$\unicode{x2013}$each implemented as an interaction among VLM agents: designer, critic, and orchestrator. The framework tightly integrates asset generation through text-to-3D synthesis for static objects, dataset retrieval for articulated objects, and physical property estimation. SceneSmith generates 3-6x more objects than prior methods, with <2% inter-object collisions and 96% of objects remaining stable under physics simulation. In a user study with 205 participants, it achieves 92% average realism and 91% average prompt faithfulness win rates against baselines. We further demonstrate that these environments can be used in an end-to-end pipeline for automatic robot policy evaluation.

2505.24069 2026-06-02 cs.LG cs.AI 版本更新

Can LLMs Reason Structurally? Benchmarking via the Lens of Data Structures

LLM 能否进行结构性推理?通过数据结构视角进行基准测试

Yu He, Yingxi Li, Colin White, Ellen Vitercik

发表机构 * Stanford University(斯坦福大学)

AI总结 本文提出 DSR-Bench 基准,通过 20 种数据结构、35 种操作和 4140 个问题实例评估 LLM 的结构性推理能力,发现顶级模型在挑战性实例上仅得 0.46/1,且在空间数据、上下文丰富场景及自身代码推理上表现不佳。

Comments Proceedings of the 43rd International Conference on Machine Learning, Seoul, South Korea. PMLR 306, 2026

详情
AI中文摘要

大型语言模型(LLM)被部署在日益复杂的任务上,这些任务需要多步决策。因此,理解它们的算法推理能力至关重要。然而,我们缺乏用于评估这些能力的诊断基准。我们提议使用数据结构作为原则性视角:作为算法的基本构建块,它们自然地探测结构性推理——即理解和操作支撑算法推理的关系(如顺序、层次和连接性)的能力。我们引入了 DSR-Bench(数据结构推理基准),涵盖 20 种数据结构、35 种操作和 4140 个问题实例。DSR-Bench 具有层次化任务组织、全自动生成与评估以及细粒度诊断的特点。评估 13 个最先进的 LLM 揭示了关键局限性:表现最好的模型在挑战性实例上仅达到 0.46/1。三个针对更现实用法的辅助探针暴露了进一步的弱点:模型在空间数据和上下文丰富的场景中表现不佳,并且难以对其自身代码进行推理。

英文摘要

Large language models (LLMs) are deployed on increasingly complex tasks that require multi-step decision-making. Understanding their algorithmic reasoning abilities is therefore crucial. However, we lack a diagnostic benchmark for evaluating these capabilities. We propose to use data structures as a principled lens: as fundamental building blocks of algorithms, they naturally probe structural reasoning - the ability to understand and manipulate relationships such as order, hierarchy, and connectivity that underpin algorithmic reasoning. We introduce DSR-Bench (Data Structure Reasoning Benchmark), spanning 20 data structures, 35 operations, and 4,140 problem instances. DSR-Bench features hierarchical task organization, fully automated generation and evaluation, and fine-grained diagnostics. Evaluating 13 state-of-the-art LLMs reveals critical limitations: the top-performing model achieves only 0.46/1 on challenging instances. Three auxiliary probes targeting more realistic usages expose further weaknesses: models perform poorly on spatial data and context-rich scenarios, and they struggle to reason over their own code.

2602.09492 2026-06-02 cs.LG cs.AI 版本更新

Beware of the Batch Size: Hyperparameter Bias in Evaluating LoRA

当心批量大小:评估 LoRA 中的超参数偏差

Sangyoon Lee, Jaeho Lee

发表机构 * Pohang University of Science and Technology (POSTECH)(浦项科学技术大学(POSTECH))

AI总结 本文发现批量大小是导致 LoRA 变体性能矛盾的关键因素,提出基于代理的高效调优策略,将批量大小提升为一阶设计参数。

详情
AI中文摘要

低秩适配(LoRA)是微调大型语言模型的标准方法,但其众多变体在相同基准上报告了相互矛盾的经验性收益。我们表明这些矛盾源于一个被忽视的因素:批量大小。当适当调整时,vanilla LoRA 通常能达到与更复杂变体相当的性能。我们进一步提出了一种基于代理的、成本高效的批量大小调优策略,揭示了秩、数据集大小和模型容量对最优批量大小的影响。我们的发现将批量大小从次要实现细节提升为一阶设计参数,调和了先前的不一致性,并使得对 LoRA 变体的评估更加可靠。

英文摘要

Low-rank adaptation (LoRA) is a standard approach for fine-tuning large language models, yet its many variants report conflicting empirical gains, often on the same benchmarks. We show that these contradictions arise from a single overlooked factor: the batch size. When properly tuned, vanilla LoRA often matches the performance of more complex variants. We further propose a proxy-based, cost-efficient strategy for batch size tuning, revealing the impact of rank, dataset size, and model capacity on the optimal batch size. Our findings elevate batch size from a minor implementation detail to a first-order design parameter, reconciling prior inconsistencies and enabling more reliable evaluations of LoRA variants.

2602.08868 2026-06-02 cs.LG cs.AI 版本更新

AnomSeer: Reinforcing Multimodal LLMs to Reason for Time-Series Anomaly Detection

AnomSeer: 增强多模态大语言模型进行时间序列异常检测的推理能力

Junru Zhang, Lang Feng, Haoran Shi, Xu Guo, Han Yu, Yabo Dong, Duanqing Xu

发表机构 * GitHub

AI总结 提出AnomSeer,通过专家思维链和基于最优传输的时间序列接地策略优化,增强多模态大语言模型在时间序列异常检测中的细粒度推理能力,统一异常分类、定位和解释。

Comments ICML 2026

详情
AI中文摘要

基于多模态大语言模型(MLLM)的时间序列异常检测(TSAD)是一个新兴领域,但一个持续存在的挑战是:MLLM依赖于粗略的时间序列启发式方法,但在多维、详细的推理方面存在困难,而这对于理解复杂的时间序列数据至关重要。我们提出AnomSeer来解决这个问题,通过增强模型将其推理基于时间序列的精确结构细节,统一异常分类、定位和解释。其核心是生成专家思维链轨迹,从经典分析(如统计度量、频率变换)中提供可验证的细粒度推理。在此基础上,我们提出了一种新颖的时间序列接地策略优化(TimerPO),它在标准强化学习之外引入了两个额外组件:基于最优传输的时间序列接地优势,以及确保这种辅助细粒度信号不干扰主要检测目标的正交投影。在各种异常场景中,使用Qwen2.5-VL-3B/7B-Instruct的AnomSeer在分类和定位准确性上优于更大的商业基线(如GPT-4o),特别是在点和频率驱动的异常上。此外,它产生了合理的时间序列推理轨迹,支持其结论。

英文摘要

Time-series anomaly detection (TSAD) with multimodal large language models (MLLMs) is an emerging area, yet a persistent challenge remains: MLLMs rely on coarse time-series heuristics but struggle with multi-dimensional, detailed reasoning, which is vital for understanding complex time-series data. We present AnomSeer to address this by reinforcing the model to ground its reasoning in precise, structural details of time series, unifying anomaly classification, localization, and explanation. At its core, an expert chain-of-thought trace is generated to provide a verifiable, fine-grained reasoning from classical analyses (e.g., statistical measures, frequency transforms). Building on this, we propose a novel time-series grounded policy optimization (TimerPO) that incorporates two additional components beyond standard reinforcement learning: a time-series grounded advantage based on optimal transport and an orthogonal projection to ensure this auxiliary granular signal does not interfere with the primary detection objective. Across diverse anomaly scenarios, AnomSeer, with Qwen2.5-VL-3B/7B-Instruct, outperforms larger commercial baselines (e.g., GPT-4o) in classification and localization accuracy, particularly on point- and frequency-driven exceptions. Moreover, it produces plausible time-series reasoning traces that support its conclusions.

2602.08585 2026-06-02 cs.LG cs.AI 版本更新

Predicting Future Utility: Global Combinatorial Optimization for Task-Agnostic KV Cache Eviction

预测未来效用:任务无关的KV缓存驱逐的全局组合优化

Ziyao Tang, Pengkun Jiao, Xinhang Chen, Wei Liu, Shiyong Li, Jingjing Chen

发表机构 * Fudan University(复旦大学) Baige AI Team, Baidu inc(百度AI团队) Work done during an internship at Baidu(百度实习)

AI总结 提出LU-KV框架,通过全局组合优化分配注意力头预算以最大化长期边际贡献,实现80%的KV缓存压缩且性能损失极小。

详情
AI中文摘要

鉴于注意力的二次复杂度,KV缓存驱逐对于加速模型推理至关重要。当前的KV缓存驱逐方法通常依赖于瞬时启发式度量,隐含地假设分数幅度是所有注意力头的重要性一致代理。然而,这忽略了注意力头之间预测保真度的异质性。虽然某些头优先考虑令牌的瞬时贡献,但其他头致力于捕捉长期效用。在本文中,我们提出最优预算分配应由保留长期语义信息的边际效用决定。基于这一见解,我们提出了LU-KV,这是一个新颖的框架,将头级预算分配表述为全局组合优化问题,以最大化保留令牌的长期边际贡献。为了解决这个非凸问题,我们采用凸包松弛和基于边际效用的贪婪求解器,实现接近最优的解。此外,我们实现了一个数据驱动的离线分析协议,以促进LU-KV的实际部署。在LongBench和RULER基准上的评估表明,LU-KV将KV缓存大小减少了80%,性能下降最小,同时降低了推理延迟和GPU内存占用。

英文摘要

Given the quadratic complexity of attention, KV cache eviction is vital to accelerate model inference. Current KV cache eviction methods typically rely on instantaneous heuristic metrics, implicitly assuming that score magnitudes are consistent proxies for importance across all heads. However, this overlooks the heterogeneity in predictive fidelity across attention heads. While certain heads prioritize the instantaneous contribution of tokens, others are dedicated to capturing long-horizon utility. In this paper, we propose that optimal budget allocation should be governed by the marginal utility in preserving long-term semantic information. Building on this insight, we propose LU-KV, a novel framework that formulates head-level budget allocation as a global combinatorial optimization problem to maximize the long-horizon marginal contribution of reserved tokens. To solve this non-convex problem, we employ a convex-hull relaxation and a marginal-utility-based greedy solver, achieving near-optimal solutions. Furthermore, we implement a data-driven offline profiling protocol to facilitate the practical deployment of LU-KV. Evaluations on LongBench and RULER benchmarks demonstrate that LU-KV reduces KV cache size by 80% with minimal performance degradation, while also decreasing inference latency and GPU memory footprint.

2602.08236 2026-06-02 cs.CV cs.AI cs.CL 版本更新

When and How Much to Imagine: Adaptive Test-Time Scaling with World Models for Visual Spatial Reasoning

何时想象以及想象多少:基于世界模型的自适应测试时缩放用于视觉空间推理

Shoubin Yu, Yue Zhang, Zun Wang, Jaehong Yoon, Huaxiu Yao, Mingyu Ding, Mohit Bansal

发表机构 * University of North Carolina, Chapel Hill(北卡罗来纳大学教堂山分校) Nanyang Technological University(南洋理工大学)

AI总结 本文提出自适应测试时框架AVIC/AVIC-R,通过世界模型选择性调用和缩放视觉想象,在空间推理中平衡准确性与效率,超越GPT-4o等基线。

Comments the first two authors are equally contributed. Project page: https://adaptive-visual-tts.github.io/

详情
AI中文摘要

尽管多模态大语言模型(MLLMs)取得了快速进展,但当正确答案取决于场景在未见或替代视角下的外观时,视觉空间推理仍然不可靠。最近的工作通过使用世界模型进行视觉想象来增强推理,但诸如想象何时真正必要、多少想象有益、以及何时想象有害等问题仍知之甚少。在实践中,无差别的想象可能会增加计算量,甚至通过引入误导性证据而降低性能。在这项工作中,我们深入分析了作为空间推理可控资源的测试时视觉想象。我们首先研究静态视觉证据何时足够,想象何时改进推理,以及过度或不必要的想象如何影响准确性和效率。为了支持这一分析,我们随后引入了AVIC,一个基于世界模型的自适应测试时框架,该框架在选择性调用和缩放视觉想象之前,明确推理当前视觉证据的充分性。最后,为了进一步学习这种门控和规划行为,而无需任何关于何时想象以及想象多少的标注,我们引入了AVIC-R,它通过来自QA正确性奖励和想象成本惩罚的GRPO来训练策略。在空间推理基准(SAT, MMSI)和具身导航基准(R2R)上,我们的结果揭示了想象至关重要、边际或有害的明确场景,并表明选择性控制可以匹配或超越固定想象策略,同时大幅减少世界模型调用和语言标记。我们的AVIC-R超越了包括GPT-4o和GPT-4.1在内的强大专有基线,同时调用世界模型的频率更低。总体而言,我们的发现强调了分析和控制测试时想象对于高效可靠的空间推理的重要性。

英文摘要

Despite rapid progress in MLLMs, visual spatial reasoning remains unreliable when correct answers depend on how a scene would appear under unseen or alternative viewpoints. Recent work addresses this by augmenting reasoning with world models for visual imagination, but questions such as when imagination is actually necessary, how much of it is beneficial, and when it becomes harmful, remain poorly understood. In practice, indiscriminate imagination can increase computation and even degrade performance by introducing misleading evidence. In this work, we present an in-depth analysis of test-time visual imagination as a controllable resource for spatial reasoning. We first study when static visual evidence is sufficient, when imagination improves reasoning, and how excessive or unnecessary imagination affects accuracy and efficiency. To support this analysis, we then introduce AVIC, an adaptive test-time framework with world models that explicitly reasons about the sufficiency of current visual evidence before selectively invoking and scaling visual imagination. Finally, to further learn this gating and planning behavior without any annotation of when and how much to imagine, we introduce AVIC-R, which trains the policy via GRPO from QA-correctness rewards and penalties by imagination cost. Across spatial reasoning benchmarks (SAT, MMSI) and an embodied navigation benchmark (R2R), our results reveal clear scenarios where imagination is critical, marginal, or detrimental, and show that selective control can match or outperform fixed imagination strategies with substantially fewer world-model calls and language tokens. Our AVIC-R surpasses strong proprietary baselines including GPT-4o and GPT-4.1 while invoking the world model less often. Overall, our findings highlight the importance of analyzing and controlling test-time imagination for efficient and reliable spatial reasoning.

2512.20806 2026-06-02 cs.AI 版本更新

Safety Alignment of LMs via Non-cooperative Games

通过非合作博弈实现语言模型的安全对齐

Anselm Paulus, Ilia Kulikov, Brandon Amos, Rémi Munos, Ivan Evtimov, Kamalika Chaudhuri, Arman Zharmagambetov

发表机构 * DeepMind, London, UK(深度Mind,伦敦,英国)

AI总结 提出将安全对齐建模为攻击者与防御者语言模型之间的非零和博弈,通过在线强化学习联合训练,迭代提升安全性与实用性。

详情
AI中文摘要

确保语言模型(LM)的安全性同时保持其实用性仍然是AI对齐中的一个关键挑战。当前方法依赖于顺序对抗训练:生成对抗性提示并微调LM以防御它们。我们引入了一种不同的范式:将安全对齐视为攻击者LM和防御者LM之间的非零和博弈,通过在线强化学习联合训练。每个LM持续适应对方的演化策略,驱动迭代改进。我们的方法使用基于偏好比较的奖励信号而非点式分数,提供更稳健的监督并可能减少奖励破解。我们的强化学习方案AdvGame将安全性和实用性的帕累托前沿向外推移,产生一个同时更有帮助且对对抗性攻击更具弹性的防御者LM。此外,由此产生的攻击者LM收敛为一个强大的、通用的红队测试代理,可直接用于探测任意目标模型。代码见github.com/facebookresearch/advgame。

英文摘要

Ensuring the safety of language models (LMs) while maintaining their usefulness remains a critical challenge in AI alignment. Current approaches rely on sequential adversarial training: generating adversarial prompts and fine-tuning LMs to defend against them. We introduce a different paradigm: framing safety alignment as a non-zero-sum game between an Attacker LM and a Defender LM trained jointly via online reinforcement learning. Each LM continuously adapts to the other's evolving strategies, driving iterative improvement. Our method uses a preference-based reward signal derived from pairwise comparisons instead of point-wise scores, providing more robust supervision and potentially reducing reward hacking. Our RL recipe, AdvGame, shifts the Pareto frontier of safety and utility, yielding a Defender LM that is simultaneously more helpful and more resilient to adversarial attacks. In addition, the resulting Attacker LM converges into a strong, general-purpose red-teaming agent that can be directly deployed to probe arbitrary target models. Code at github.com/facebookresearch/advgame.

2510.08948 2026-06-02 cs.IR cs.AI 版本更新

SHERLOCK: Towards Dynamic Knowledge Adaptation in LLM-enhanced E-commerce Risk Management

SHERLOCK:面向LLM增强电商风险管理的动态知识适应

Nan Lu, Yurong Hu, Jiaquan Fang, Yan Liu, Rui Dong, Yiming Wang, Rui Lin, Shaoyi Xu

发表机构 * Beijing Jiaotong University(北京交通大学) JD.com(京东公司) Southeast University(东南大学) Zhejiang University(浙江大学)

AI总结 提出Sherlock框架,通过构建领域知识库、两阶段检索增强生成和自演化数据飞轮,将结构化知识与LLM推理结合,提升电商风险案例调查的效率和准确性。

详情
AI中文摘要

有效的电商风险管理需要深入案例调查以识别高度对抗环境中的新兴欺诈模式。然而,人工调查通常需要分析多源异构数据之间的关联和耦合,这是一个劳动密集型过程,限制了效率。虽然大型语言模型(LLM)在自动化这些分析方面显示出潜力,但其部署受到风险场景复杂性和长尾领域知识稀疏性的阻碍。为应对这些挑战,我们提出了Sherlock,一个通过三个核心模块将结构化领域知识与基于LLM的推理相结合的框架。首先,我们通过从异构知识源中提取结构化专业知识来构建领域知识库(KB)。其次,我们设计了一种针对案例调查的两阶段检索增强生成策略,该策略将输入上下文增强与反思与细化模块相结合,以充分利用知识库提高分析质量。最后,我们开发了一个用于操作和标注的集成平台,以驱动自演化数据飞轮。通过知识库更新的实时热修复与后训练定期逻辑对齐相结合,我们促进系统持续演化以对抗对抗性漂移。在京东的在线A/B测试表明,Sherlock实现了82%的专家接受率(EAR),日调查吞吐量增加了386.7%。另外90天的评估显示,该飞轮成功从两次因策略变化导致的性能衰减中恢复,通过自主模型更新将EAR上限提高了约3.5%。

英文摘要

Effective e-commerce risk management requires in-depth case investigations to identify emerging fraud patterns in highly adversarial environments. However, manual investigation typically requires analyzing the associations and couplings among multi-source heterogeneous data, a labor-intensive process that limits efficiency. While Large Language Models (LLMs) show promise in automating these analyses, their deployment is hindered by the complexity of risk scenarios and the sparsity of long-tail domain knowledge. To address these challenges, we propose Sherlock, a framework that integrates structured domain knowledge with LLM-based reasoning through three core modules. First, we construct a domain Knowledge Base (KB) by distilling structured expertise from heterogeneous knowledge sources. Second, we design a two-stage retrieval-augmented generation strategy tailored for case investigation, which combines input contextual augmentation with a Reflect & Refine module to fully leverage the KB for improved analysis quality. Finally, we develop an integrated platform for operations and annotation to drive a self-evolving data flywheel. By combining real-time hotfixes through KB updates with periodic logic alignment via post-training, we facilitate continuous system evolution to counteract adversarial drifts. Online A/B tests at JD dot com demonstrate that Sherlock achieves an 82% Expert Acceptance Rate (EAR) and a 386.7% increase in daily investigation throughput. An additional 90-day evaluation shows that the flywheel successfully recovers from performance decay caused by changing tactics twice, raising the EAR ceiling by around 3.5% through autonomous model updates.

2602.07218 2026-06-02 cs.LG cs.AI stat.ML 版本更新

Collaborative and Efficient Fine-tuning: Leveraging Task Similarity

协作高效微调:利用任务相似性

Gagik Magakyan, Amirhossein Reisizadeh, Chanwoo Park, Pablo A. Parrilo, Asuman Ozdaglar

发表机构 * Massachusetts Institute of Technology(麻省理工学院) Stanford University(斯坦福大学)

AI总结 提出CoLoRA方法,通过共享适配器和个性化适配器利用任务相似性进行协作微调,提升数据稀缺下的模型性能,并在理论和实验上验证其有效性。

详情
AI中文摘要

适应性被认为是基础模型的核心特征,使其能够有效适应未见过的下游任务。参数高效的微调方法,如著名的LoRA,使得使用标记的、高质量且通常稀缺的任务数据对大型基础模型进行高效适应成为可能。为了缓解基础模型微调中的数据稀缺问题,我们提出利用多个下游用户之间的任务相似性。直观上,具有相似任务的用户必须能够相互帮助,以增加有效的微调数据量。我们提出了协作低秩适应(CoLoRA),该方法利用任务相似性来协作且高效地微调个性化基础模型。CoLoRA的主要思想是训练一个共享适配器,捕捉所有任务之间的潜在任务相似性,以及针对用户特定任务定制的个性化适配器。我们在异质线性回归上对CoLoRA进行了理论研究,并提供了真实恢复的可证明保证。我们还进行了多个具有不同任务相似性的自然语言实验,进一步表明当与相似任务一起训练时,个体性能显著提升。

英文摘要

Adaptability has been regarded as a central feature in the foundation models, enabling them to effectively acclimate to unseen downstream tasks. Parameter-efficient fine-tuning methods such as celebrated LoRA facilitate efficient adaptation of large foundation models using labeled, high-quality and generally scarce task data. To mitigate data scarcity in fine-tuning of foundation models, we propose to leverage task similarity across multiple downstream users. Intuitively, users with similar tasks must be able to assist each other in boosting the effective fine-tuning data size. We propose Collaborative Low-Rank Adaptation, or CoLoRA, which exploits task similarity to collaboratively and efficiently fine-tune personalized foundation models. The main idea in CoLoRA is to train one shared adapter capturing underlying task similarities across all tasks, and personalized adapters tailored to user-specific tasks. We theoretically study CoLoRA on heterogeneous linear regression and provide provable guarantees for ground truth recovery. We also conduct several natural language experiments with varying task similarity, which further demonstrate that when trained together with similar tasks, individual performances are significantly boosted.

2602.07083 2026-06-02 cs.SE cs.AI 版本更新

Rethinking Scientific Modeling: Toward Physically Consistent and Simulation-Executable Programmatic Generation

重新思考科学建模:迈向物理一致且可模拟执行的程序化生成

Yongqing Jiang, Jianze Wang, Zhiqi Shen, Zhenghong Lin, Jiayuan Wang, Yijian Yang, Kaoshan Dai, Haoran Luo

发表机构 * Sichuan University(四川大学) Nanyang Technological University(南洋理工大学) Fuzhou University(福州大学)

AI总结 针对大型语言模型在工程建模中生成代码的物理不一致问题,提出结合领域知识构建、约束对齐和验证驱动的框架,并引入CivilInstruct数据集和MBEval基准,通过两阶段微调提升模型生成的可执行性和物理一致性。

详情
AI中文摘要

结构建模是计算工程科学的基础组成部分,其中即使是微小的物理不一致或规范违反也可能使下游模拟失效。大型语言模型(LLMs)在自动生成建模代码方面的潜力已被证实。然而,在严格的工程约束下,不可执行或物理不一致的输出仍然普遍存在。因此,提出了一种物理一致自动建筑建模框架,整合了领域知识构建、面向约束的模型对齐和验证驱动的评估。引入了CivilInstruct作为领域特定数据集,形式化了结构工程知识和约束推理,以实现可模拟的模型生成。进一步采用两阶段微调策略来强制约束满足和应用程序编程接口合规性,显著减少了幻觉和不一致输出。提出了MBEval作为验证驱动的基准,通过闭环验证评估可执行性和结构动力学一致性。实验结果表明,在严格的验证指标上,该方法相比基线持续改进。我们的代码可在 https://github.com/Jovanqing/AutoBM 获取。

英文摘要

Structural modeling is a fundamental component of computational engineering science, in which even minor physical inconsistencies or specification violations may invalidate downstream simulations. The potential of large language models (LLMs) for automatic generation of modeling code has been demonstrated. However, non-executable or physically inconsistent outputs remain prevalent under stringent engineering constraints. A framework for physics-consistent automatic building modeling is therefore proposed, integrating domain knowledge construction, constraint-oriented model alignment, and verification-driven evaluation. CivilInstruct is introduced as a domain-specific dataset that formalizes structural engineering knowledge and constraint reasoning to enable simulation-ready model generation. A two-stage fine-tuning strategy is further employed to enforce constraint satisfaction and application programming interface compliance, substantially reducing hallucinated and non-conforming outputs. MBEval is presented as a verification-driven benchmark that evaluates executability and structural dynamics consistency through closed-loop validation. Experimental results show consistent improvements over baselines across rigorous verification metrics. Our code is available at https://github.com/Jovanqing/AutoBM.

2602.06448 2026-06-02 cs.LG cs.AI 版本更新

Principle-Evolvable Scientific Discovery via Uncertainty Minimization

通过不确定性最小化实现原理可演化的科学发现

Yingming Pu, Tao Lin, Hongyu Chen

发表机构 * Westlake University(西lake大学) Zhejiang University(浙江大学)

AI总结 提出PiEvo框架,将科学发现视为原理空间上的贝叶斯优化,通过信息导向假设选择与异常驱动增强机制,使智能体自主演化理论世界观,在四个基准上平均解质量达90.81%~93.15%,收敛速度提升83.3%。

Comments Proceedings of the 43rd International Conference on Machine Learning, Seoul, South Korea. PMLR 306, 2026. Copyright 2026 by the author(s)

详情
Journal ref
Proc. 43rd Intl. Conf. on Machine Learning (ICML 2026), PMLR 306
AI中文摘要

基于大型语言模型的科学智能体加速了科学发现,但由于固守初始先验,常常效率低下。现有方法主要在静态假设空间中操作,限制了新现象的发现,当基线理论失效时导致计算浪费。为解决此问题,我们提出将焦点从搜索假设转向演化底层科学原理。我们提出PiEvo,一个原理可演化框架,将科学发现视为在扩展原理空间上的贝叶斯优化。通过集成基于高斯过程的信息导向假设选择和异常驱动增强机制,PiEvo使智能体能够自主完善其理论世界观。在四个基准上的评估表明,PiEvo (1) 平均解质量高达90.81%~93.15%,比现有最优方法提升29.7%~31.1%;(2) 通过优化紧凑原理空间显著降低样本复杂度,收敛步骤加速83.3%;(3) 在不同科学领域和LLM骨干上保持稳健性能。代码公开于\hyperlink{https://github.com/amair-lab/PiEvo}{github.com/amair-lab/PiEvo}。

英文摘要

Large Language Model (LLM)-based scientific agents have accelerated scientific discovery, yet they often suffer from significant inefficiencies due to adherence to fixed initial priors. Existing approaches predominantly operate within a static hypothesis space, which restricts the discovery of novel phenomena, resulting in computational waste when baseline theories fail. To address this, we propose shifting the focus from searching hypotheses to evolving the underlying scientific principles. We present PiEvo, a principle-evolvable framework that treats scientific discovery as Bayesian optimization over an expanding principle space. By integrating Information-Directed Hypothesis Selection via Gaussian Process and an anomaly-driven augmentation mechanism, PiEvo enables agents to autonomously refine their theoretical worldview. Evaluation across four benchmarks demonstrates that PiEvo (1) achieves an average solution quality of up to 90.81%~93.15%, representing a 29.7%~31.1% improvement over the state-of-the-art, (2) attains an 83.3% speedup in convergence step via significantly reduced sample complexity by optimizing the compact principle space, and (3) maintains robust performance across diverse scientific domains and LLM backbones. Code is publicly available at \hyperlink{https://github.com/amair-lab/PiEvo}{github.com/amair-lab/PiEvo}.

2502.16174 2026-06-02 cs.LG cs.AI cs.CL cs.CR 版本更新

Efficient LLM Moderation with Multi-Layer Latent Prototypes

基于多层潜在原型的高效LLM审核

Maciej Chrabąszcz, Filip Szatkowski, Bartosz Wójcik, Jan Dubiński, Tomasz Trzciński, Sebastian Cygert

发表机构 * University of Warsaw(华沙大学)

AI总结 提出多层原型审核器(MLPM),利用多层中间表示的原型实现轻量、高效且可定制的输入审核,在多个基准上达到最优性能,并可与输出审核结合提升响应安全性。

详情
AI中文摘要

尽管现代LLM在后训练过程中与人类价值观对齐,但在部署时仍需稳健的审核以防止有害输出。现有方法存在性能与效率的权衡,且难以定制以满足用户特定需求。针对这一差距,我们引入了多层原型审核器(MLPM),一种轻量级且高度可定制的输入审核工具。我们提出利用多层中间表示的原型来提高审核质量,同时保持高效率。通过设计,我们的方法对生成流水线的开销可忽略不计,并可无缝应用于任何模型。MLPM在多种审核基准上实现了最先进的性能,并在不同大小的模型系列中表现出强大的可扩展性。此外,我们展示了它能平滑集成到端到端审核流水线中,并在与输出审核技术结合时进一步提高响应安全性。总体而言,我们的工作为安全、稳健且高效的LLM部署提供了一种实用且可适应的解决方案。

英文摘要

Although modern LLMs are aligned with human values during post-training, robust moderation remains essential to prevent harmful outputs at deployment time. Existing approaches suffer from performance-efficiency trade-offs and are difficult to customize to user-specific requirements. Motivated by this gap, we introduce Multi-Layer Prototype Moderator (MLPM), a lightweight and highly customizable input moderation tool. We propose leveraging prototypes of intermediate representations across multiple layers to improve moderation quality while maintaining high efficiency. By design, our method adds negligible overhead to the generation pipeline and can be seamlessly applied to any model. MLPM achieves state-of-the-art performance on diverse moderation benchmarks and demonstrates strong scalability across model families of various sizes. Moreover, we show that it integrates smoothly into end-to-end moderation pipelines and further improves response safety when combined with output moderation techniques. Overall, our work provides a practical and adaptable solution for safe, robust, and efficient LLM deployment.

2602.05970 2026-06-02 cs.LG cs.AI math.DS stat.ML 版本更新

Inverse Depth Scaling From Most Layers Being Similar

大多数层相似时的逆深度缩放

Yizhou Liu, Sara Kangaslahti, Ziming Liu, Jeff Gore

发表机构 * University of California, Berkeley(加州大学伯克利分校)

AI总结 通过分析大型语言模型和玩具残差网络,发现损失与深度成反比,归因于功能相似的层通过集成平均而非组合学习或平滑动力学离散化来减少误差,表明需要架构创新以鼓励深度组合使用。

Comments Camera-ready version, ICML 2026

详情
AI中文摘要

神经缩放定律将损失与大型语言模型(LLM)的模型大小联系起来,但深度和宽度可能对性能有不同的贡献,需要更详细的研究。在这里,我们通过分析LLM和玩具残差网络来量化深度如何影响损失。我们发现LLM中的损失与深度成反比,这可能是由于功能相似的层通过集成平均而不是组合学习或平滑动力学的离散化来减少误差。这种机制效率低下但鲁棒,可能源于残差网络的架构偏差和与平滑动力学不兼容的目标函数。研究结果表明,提高LLM效率可能需要架构创新以鼓励深度的组合使用。

英文摘要

Neural scaling laws relate loss to model size in large language models (LLMs), yet depth and width may contribute to performance differently, requiring more detailed studies. Here, we quantify how depth affects loss via analysis of LLMs and toy residual networks. We find loss scales inversely proportional to depth in LLMs, probably due to functionally similar layers reducing error through ensemble averaging rather than compositional learning or discretizing smooth dynamics. This regime is inefficient yet robust and may arise from the architectural bias of residual networks and target functions incompatible with smooth dynamics. The findings suggest that improving LLM efficiency may require architectural innovations to encourage compositional use of depth.

2602.05951 2026-06-02 cs.CV cs.AI cs.LG 版本更新

Better Source, Better Flow: Learning Condition-Dependent Source Distribution for Flow Matching

更好的源,更好的流:学习条件依赖的源分布用于流匹配

Junwan Kim, Jiho Park, Seonghu Jeon, Seungryong Kim

发表机构 * New York University(纽约大学) KAIST AI(韩国科学技术院人工智能实验室)

AI总结 本文提出在流匹配框架中学习条件依赖的源分布,通过方差正则化和源-目标方向对齐,显著提升文本到图像生成的速度和质量。

Comments Project Page: https://junwankimm.github.io/CSFM

详情
AI中文摘要

流匹配最近已成为基于扩散的生成模型的有前途的替代方案,特别是在文本到图像生成方面。尽管它在允许任意源分布方面具有灵活性,但大多数现有方法依赖于标准高斯分布(这是从扩散模型继承的选择),并且很少在这种设置中将源分布本身视为优化目标。在这项工作中,我们表明源分布的原则性设计不仅是可行的,而且在现代文本到图像系统的规模上也是有益的。具体来说,我们提出在流匹配目标下学习条件依赖的源分布,以更好地利用丰富的条件信号。我们识别了将条件直接纳入源时出现的关键失败模式,包括分布坍缩和不稳定性,并表明适当的方差正则化以及源和目标之间的方向对齐对于稳定和有效的学习至关重要。我们进一步分析了目标表示空间的选择如何影响具有结构化源的流匹配,揭示了这种设计最有效的场景。在多个文本到图像基准上的大量实验表明了一致且稳健的改进,包括FID收敛速度提高多达3倍,突出了原则性源分布设计对条件流匹配的实际好处。

英文摘要

Flow matching has recently emerged as a promising alternative to diffusion-based generative models, particularly for text-to-image generation. Despite its flexibility in allowing arbitrary source distributions, most existing approaches rely on a standard Gaussian distribution, a choice inherited from diffusion models, and rarely consider the source distribution itself as an optimization target in such settings. In this work, we show that principled design of the source distribution is not only feasible but also beneficial at the scale of modern text-to-image systems. Specifically, we propose learning a condition-dependent source distribution under flow matching objective that better exploit rich conditioning signals. We identify key failure modes that arise when directly incorporating conditioning into the source, including distributional collapse and instability, and show that appropriate variance regularization and directional alignment between source and target are critical for stable and effective learning. We further analyze how the choice of target representation space impacts flow matching with structured sources, revealing regimes in which such designs are most effective. Extensive experiments across multiple text-to-image benchmarks demonstrate consistent and robust improvements, including up to a 3x faster convergence in FID, highlighting the practical benefits of a principled source distribution design for conditional flow matching.

2602.05395 2026-06-02 stat.ML cs.AI cs.LG 版本更新

Optimal Bayesian Stopping for Efficient Inference of Consistent LLM Answers

用于高效推断一致LLM答案的最优贝叶斯停止

Jingkai Huang, Will Ma, Zhengyuan Zhou

发表机构 * Stern School of Business, New York University, New York, USA(纽约大学 Stern 商学院) Graduate School of Business, Columbia University, New York, USA(哥伦比亚大学 商学院)

AI总结 利用贝叶斯先验信息,通过L-聚合停止策略在达到足够一致性时提前停止采样,以最小化采样成本并高效识别最一致的LLM答案。

Comments Accepted to ICML 2026. Camera-ready version

详情
Journal ref
Proceedings of the 43rd International Conference on Machine Learning (ICML 2026)
AI中文摘要

一种提高LLM准确性的简单策略,特别是在数学和推理问题中,是采样多个响应并提交最一致达成的答案。在本文中,我们利用贝叶斯先验信息来节省采样成本,一旦达到足够的一致性就停止。尽管精确后验在计算上难以处理,我们进一步引入了一种高效的“L-聚合”停止策略,该策略仅跟踪L-1个最频繁的答案计数。理论上,我们证明L=3就足够了:这种粗略近似足以实现渐近最优性,并且严格优于无先验基线,同时具有快速的后验计算。实验上,该方法使用更少的样本识别出最一致(即众数)的LLM答案,并且可以在减少LLM调用次数(即节省LLM推理成本)高达50%的同时实现相似的答案准确性。

英文摘要

A simple strategy for improving LLM accuracy, especially in math and reasoning problems, is to sample multiple responses and submit the answer most consistently reached. In this paper we leverage Bayesian prior information to save on sampling costs, stopping once sufficient consistency is reached. Although the exact posterior is computationally intractable, we further introduce an efficient "L-aggregated" stopping policy that tracks only the L-1 most frequent answer counts. Theoretically, we prove that L=3 is all you need: this coarse approximation is sufficient to achieve asymptotic optimality, and strictly dominates prior-free baselines, while having a fast posterior computation. Empirically, this identifies the most consistent (i.e., mode) LLM answer using fewer samples, and can achieve similar answer accuracy while cutting the number of LLM calls (i.e., saving on LLM inference costs) by up to 50%.

2511.16886 2026-06-02 cs.CL cs.AI cs.LG 版本更新

Latent Reasoning in TRMs is Secretly a Policy Improvement Operator

TRMs中的潜在推理实际上是策略改进算子

Arip Asadulaev, Rayan Banerjee, Fakhri Karray, Martin Takac

发表机构 * Arip Asadulaev Rayan Banerjee Fakhri Karray Martin Takac

AI总结 本文通过将潜在递归推理形式化为策略改进算法,解释了递归步骤何时有效提升性能,并提出结合强化学习和扩散方法的训练方案,在Tiny Recursive Model上实现18倍前向传递减少且保持性能。

详情
AI中文摘要

最近,具有潜在递归的小模型在复杂推理任务上取得了有希望的结果。这些结果通常由这样的理论解释:这种递归增加了网络的深度,使其能够紧凑地模拟更大模型的能力。然而,递归添加层的性能仍然落后于具有相同前馈深度的单次通过模型。这意味着在循环版本中,并非每个递归步骤都有效地贡献于深度。这提出了一个问题:潜在推理何时以及为何能提高性能,何时会导致无效计算?在我们的工作中,我们证明了潜在递归推理为这个问题提供了答案。我们展示了潜在递归推理可以形式化为策略改进算法。基于这些见解,我们提出使用强化学习和扩散方法的训练方案用于潜在推理模型。以Tiny Recursive Model作为测试平台,我们展示了通过我们的修改,可以避免无效计算步骤,并将前向传递总数减少18倍,同时保持性能。总的来说,我们展示了递归步骤的策略改进视角如何解释模型行为,并为进一步改进提供见解。

英文摘要

Recently, small models with latent recursion have obtained promising results on complex reasoning tasks. These results are typically explained by the theory that such recursion increases a networks depth, allowing it to compactly emulate the capacity of larger models. However, the performance of recursively added layers remains behind the capabilities of one pass models with the same feed-forward depth. This means that in the looped version, not every recursive step effectively contributes to depth. This raises the question: when and why does latent reasoning improve performance, and when does it result in dead compute? In our work, we demonstrate that latent recursive reasoning provides answer to this question. We show that latent recursive reasoning can be formalized as a policy improvement algorithm. Building on these insights, we propose to use a training schemes from reinforcement learning and diffusion methods for latent reasoning models. Using the Tiny Recursive Model as our testbed, we show that with our modifications we can avoid dead compute steps and reduce the total number of forward passes by 18x while maintaining performance. Broadly speaking, we show how a policy improvement perspective on recursive steps can explain model behavior and provide insights for further improvements.

2602.04861 2026-06-02 cs.LG cond-mat.mtrl-sci cs.AI physics.chem-ph 版本更新

From Evaluation to Design: Using Potential Energy Surface Smoothness Metrics to Guide Machine Learning Interatomic Potential Architectures

从评估到设计:利用势能面平滑度指标指导机器学习原子间势架构

Ryan Liu, Eric Qu, Tobias Kreiman, Samuel M. Blau, Aditi S. Krishnapriyan

发表机构 * University of California, Berkeley(加州大学伯克利分校) Stanford University(斯坦福大学)

AI总结 提出键平滑度表征测试(BSCT)作为高效评估机器学习原子间势(MLIP)势能面平滑度的指标,并与分子动力学稳定性强相关,同时指导模型设计以减少伪影。

Comments Accepted at the International Conference on Machine Learning (ICML) 2026

详情
AI中文摘要

机器学习原子间势(MLIP)有时无法再现量子势能面(PES)的物理平滑性,导致下游模拟中出现标准能量和力回归评估无法捕捉的错误行为。现有评估方法(如微正则分子动力学(MD))计算成本高且主要探测近平衡态。为改进MLIP的评估指标,我们引入键平滑度表征测试(BSCT)。该高效基准通过受控键变形探测PES,检测近平衡和远离平衡态的非平滑性,包括不连续性、人工极小值和虚假力。我们证明BSCT与MD稳定性强相关,而成本仅为MD的一小部分。为展示BSCT如何指导迭代模型设计,我们利用无约束Transformer主干作为测试平台,说明如何通过改进(如新的可微$k$-最近邻算法和温度控制注意力)减少指标识别的伪影。通过基于BSCT系统优化模型设计,所得MLIP同时实现了低传统E/F回归误差、稳定的MD模拟和鲁棒的原子性质预测。我们的结果将BSCT确立为从业者评估MLIP实用性的验证指标,以及“循环内”模型设计代理,提醒MLIP开发者注意当前MLIP基准无法高效评估的物理挑战。BSCT数据集和评估可在https://github.com/ryanliu30/bsct.git获取。

英文摘要

Machine Learning Interatomic Potentials (MLIPs) sometimes fail to reproduce the physical smoothness of the quantum potential energy surface (PES), leading to erroneous behavior in downstream simulations that standard energy and force regression evaluations can miss. Existing evaluations, such as microcanonical molecular dynamics (MD), are computationally expensive and primarily probe near-equilibrium states. To improve evaluation metrics for MLIPs, we introduce the Bond Smoothness Characterization Test (BSCT). This efficient benchmark probes the PES via controlled bond deformations and detects non-smoothness, including discontinuities, artificial minima, and spurious forces, both near and far from equilibrium. We show that BSCT correlates strongly with MD stability while requiring a fraction of the cost of MD. To demonstrate how BSCT can guide iterative model design, we utilize an unconstrained Transformer backbone as a testbed, illustrating how refinements such as a new differentiable $k$-nearest neighbors algorithm and temperature-controlled attention reduce artifacts identified by our metric. By optimizing model design systematically based on BSCT, the resulting MLIP simultaneously achieves a low conventional E/F regression error, stable MD simulations, and robust atomistic property predictions. Our results establish BSCT as both a validation metric for practitioners to assess MLIP utility and as an "in-the-loop" model design proxy that alerts MLIP developers to physical challenges that cannot be efficiently evaluated by current MLIP benchmarks. The BSCT dataset and evaluation are available on https://github.com/ryanliu30/bsct.git

2501.18649 2026-06-02 cs.CL cs.AI cs.IR cs.LG 版本更新

Fake News Detection After LLM Laundering: Measurement and Explanation

LLM清洗后的假新闻检测:测量与解释

Rupak Kumar Das, Jonathan Dodge

发表机构 * College of IST Pennsylvania State University(宾夕法尼亚州立大学信息科学与技术学院)

AI总结 研究测量检测器在识别LLM改写假新闻时的有效性,发现检测器难以检测LLM改写的假新闻,并通过LIME解释发现情感偏移是检测失败的原因之一。

详情
AI中文摘要

凭借其先进的能力,大型语言模型(LLM)可以生成高度令人信服且上下文相关的假新闻,这可能有助于传播错误信息。尽管针对人类撰写文本的假新闻检测已有大量研究,但检测LLM生成的假新闻这一领域仍探索不足。本研究测量了检测器在识别LLM改写的假新闻方面的有效性,特别是确定在检测流程中添加改写步骤是有助于还是阻碍检测。本研究贡献如下:(1)检测器在检测LLM改写的假新闻时比检测人类撰写文本更困难;(2)我们发现了哪些模型在哪些任务(逃避检测、通过改写逃避检测以及为语义相似性进行改写)上表现出色;(3)通过LIME解释,我们发现了检测失败的一个可能原因:情感偏移;(4)我们发现了一个关于改写质量测量的令人担忧的趋势:尽管BERTSCORE很高,但样本仍表现出情感偏移;(5)我们提供了一对数据集,用改写输出和分数扩充了现有数据集。该数据集可在GitHub上获取。

英文摘要

With their advanced capabilities, Large Language Models (LLMs) can generate highly convincing and contextually relevant fake news, which can contribute to disseminating misinformation. Though there is much research on fake news detection for human-written text, the field of detecting LLM-generated fake news is still under-explored. This research measures the efficacy of detectors in identifying LLM-paraphrased fake news, in particular, determining whether adding a paraphrase step in the detection pipeline helps or impedes detection. This study contributes: (1) Detectors struggle to detect LLM-paraphrased fake news more than human-written text, (2) We find which models excel at which tasks (evading detection, paraphrasing to evade detection, and paraphrasing for semantic similarity). (3) Via LIME explanations, we discovered a possible reason for detection failures: sentiment shift. (4) We discover a worrisome trend for paraphrase quality measurement: samples that exhibit sentiment shift despite a high BERTSCORE. (5) We provide a pair of datasets augmenting existing datasets with paraphrase outputs and scores. The dataset is available on GitHub

2602.03685 2026-06-02 cs.LG cs.AI stat.ML 版本更新

Universal One-third Time Scaling in Learning Peaked Distributions

学习尖峰分布中的普适三分之一时间缩放

Yizhou Liu, Ziming Liu, Cengiz Pehlevan, Jeff Gore

发表机构 * University of California, Berkeley(加州大学伯克利分校)

AI总结 本文通过理论分析和实验验证,揭示了使用softmax和交叉熵学习尖峰分布时,损失和梯度呈幂律衰减,导致损失时间缩放指数为1/3的普适瓶颈,为神经缩放现象提供了机理解释。

Comments Camera-ready version, ICML 2026

详情
AI中文摘要

训练大型语言模型(LLM)计算成本高昂,部分原因是损失呈现缓慢的幂律收敛,其起源仍有争议。通过对玩具模型的系统分析和LLM的经验评估,我们表明这种行为本质上源于softmax和交叉熵的使用。当学习尖峰概率分布(例如下一个词元分布)时,这些组件普遍产生幂律衰减的损失和梯度,与许多微观细节无关,从而形成基本的优化瓶颈。这最终导致损失的时间缩放服从幂律,普适指数为$1/3$。我们的结果为观察到的神经缩放提供了机理解释,并提出了改进LLM训练效率的新方向。

英文摘要

Training large language models (LLMs) is computationally expensive, partly because the loss exhibits slow power-law convergence whose origin remains debatable. Through systematic analysis of toy models and empirical evaluation of LLMs, we show that this behavior can arise intrinsically from the use of softmax and cross-entropy. When learning peaked probability distributions, e.g., next-token distributions, these components generically yield power-law vanishing losses and gradients, regardless of many microscopic details, creating a fundamental optimization bottleneck. This ultimately leads to power-law time scaling of the loss with a universal exponent of $1/3$. Our results provide a mechanistic explanation for observed neural scaling and suggest new directions for improving LLM training efficiency.

2602.03670 2026-06-02 cs.LG cs.AI cs.NE math.DS physics.class-ph 版本更新

Equilibrium Propagation for Non-Conservative Systems

非保守系统的平衡传播

Antonino Emanuele Scurria, Dimitri Vanden Abeele, Bortolo Matteo Mognetti, Serge Massar

发表机构 * University of Amsterdam(阿姆斯特丹大学) Institute for Advanced Study(高级研究院)

AI总结 提出一种扩展平衡传播到非保守系统(包括前馈网络)的框架,通过在学习阶段引入与非互易相互作用成比例的项来精确计算代价函数的梯度,数值实验表明性能更优且学习更快。

Comments 23 pages

详情
AI中文摘要

平衡传播(EP)是一种受物理学启发的学习算法,它利用动力系统的稳态进行推理和学习。在其原始公式中,它仅限于保守系统,即从能量函数导出的动力学。考虑到它们的应用,将EP扩展到非保守系统(即具有非互易相互作用的系统)非常重要。先前将EP推广到此类系统的尝试未能精确计算代价函数的梯度。在这里,我们提出了一个将EP扩展到任意非保守系统(包括前馈网络)的框架。我们保留了平衡传播的关键特性,即同时使用稳态进行推理和学习。然而,我们在学习阶段通过一个与相互作用的非互易部分成比例的项修改了动力学,以便获得代价函数的精确梯度。该算法也可以通过变分公式推导,该公式通过定义在增广状态空间上的能量函数生成学习动力学。数值实验表明,该算法比先前的方案实现了更好的性能并学习更快。

英文摘要

Equilibrium Propagation (EP) is a physics-inspired learning algorithm that uses stationary states of a dynamical system both for inference and learning. In its original formulation it is limited to conservative systems, $\textit{i.e.}$ to dynamics which derive from an energy function. Given their applications, it is important to extend EP to non-conservative systems, $\textit{i.e.}$ systems with non-reciprocal interactions. Previous attempts to generalize EP to such systems failed to compute the exact gradient of the cost function. Here we propose a framework that extends EP to arbitrary non-conservative systems, including feedforward networks. We keep the key property of equilibrium propagation, namely the use of stationary states both for inference and learning. However, we modify the dynamics in the learning phase by a term proportional to the non-reciprocal part of the interaction so as to obtain the exact gradient of the cost function. This algorithm can also be derived using a variational formulation that generates the learning dynamics through an energy function defined over an augmented state space. Numerical experiments show that this algorithm achieves better performance and learns faster than previous proposals.

2602.03554 2026-06-02 cs.LG cs.AI cs.CE cs.CL 版本更新

When Single Answer Is Not Enough: Rethinking Single-Step Retrosynthesis Benchmarks for LLMs

当单一答案不够时:重新思考面向大语言模型的单步逆合成基准

Bogdan Zagribelnyy, Ivan Ilin, Maksim Kuznetsov, Nikita Bondarev, Mathieu Reymond, Roman Schutski, Thomas MacDougall, Rim Shayakhmetov, Zulfat Miftakhutdinov, Mikolaj Mizera, Vladimir Aladinskiy, Alex Aliper, Alex Zhavoronkov

发表机构 * DeepMind, London, UK(伦敦英国深度思维公司)

AI总结 针对现有逆合成基准依赖单一真实答案的局限,提出基于化学合理性度量ChemCensor的新评估框架,并构建数据集CREED训练模型以提升性能。

详情
AI中文摘要

最近的进展扩展了大语言模型(LLMs)在药物发现中的应用,包括合成规划。然而,逆合成性能的客观评估仍然有限。现有的基准和指标通常依赖于已发表的合成程序以及基于单一真实答案的Top-K准确率,这未能捕捉真实世界合成规划的开放性。我们提出一个新的单步逆合成基准框架,使用ChemCensor(一种化学合理性的新度量)来评估通用型和化学专用型LLMs。通过强调合理性而非精确匹配,该方法更符合人类合成规划实践。我们还引入了CREED,一个包含数百万经ChemCensor验证的反应记录的新数据集,用于LLM训练,并使用它训练了一个在该基准下优于LLM基线的模型。

英文摘要

Recent progress has expanded the use of large language models (LLMs) in drug discovery, including synthesis planning. However, objective evaluation of retrosynthesis performance remains limited. Existing benchmarks and metrics typically rely on published synthetic procedures and Top-K accuracy based on single ground-truth, which does not capture the open-ended nature of real-world synthesis planning. We propose a new benchmarking framework for single-step retrosynthesis that evaluates both general-purpose and chemistry-specialized LLMs using ChemCensor, a novel metric for chemical plausibility. By emphasizing plausibility over exact match, this approach better aligns with human synthesis planning practices. We also introduce CREED, a novel dataset comprising millions of ChemCensor-validated reaction records for LLM training, and use it to train a model that improves over the LLM baselines under this benchmark.

2602.03211 2026-06-02 cs.LG cs.AI 版本更新

Lookahead Sample Reward Guidance for Test-Time Scaling of Diffusion Models

前瞻样本奖励引导用于扩散模型的测试时缩放

Yeongmin Kim, Donghyeok Shin, Byeonghu Na, Minsang Park, Richard Lee Kim, Il-Chul Moon

发表机构 * KAIST(韩国科学技术院)

AI总结 提出一种高效测试时缩放方法LiDAR采样,通过前瞻几步采样和精确求解器引导粒子向高奖励区域移动,无需反向传播,在GenEval上达到与最新梯度引导方法相同性能且加速9.5倍。

Comments ICML 2026 Spotlight

详情
AI中文摘要

扩散模型已展现出强大的生成性能;然而,生成的样本往往未能完全符合人类意图。本文研究了一种高效的测试时缩放方法,用于从具有更高人类对齐奖励值的区域进行采样。现有的计算期望未来奖励(EFR)方法面临重要限制:反向展开导致采样成本过高,而基于Tweedie的方法(包括顺序蒙特卡洛和梯度引导)则存在偏差和固有的采样问题。我们证明,任何$\mathbf{x}_t$处的EFR仅需使用预训练扩散模型的边际样本即可计算,从而无需神经反向传播即可实现闭式奖励引导。为了进一步提高效率,我们引入了少步前瞻采样和一个精确求解器,引导粒子向高奖励的前瞻样本移动。我们将这种采样方案称为LiDAR采样。LiDAR在SDXL上达到了与最新梯度引导方法相同的GenEval性能,并实现了9.5倍的加速。我们在https://github.com/aailab-kaist/Diffusion-LiDAR-Sampling 上发布了代码。

英文摘要

Diffusion models have demonstrated strong generative performance; however, generated samples often fail to fully align with human intent. This paper studies an efficient test-time scaling method for sampling from regions with higher human-aligned reward values. Existing methods for computing the expected future reward (EFR) face important limitations: backward rollout incurs prohibitively high sampling costs, while Tweedie-based approaches, including Sequential Monte Carlo and gradient guidance, suffer from bias and inherent sampling issues. We show that the EFR at any $\mathbf{x}_t$ can be computed using only marginal samples from a pre-trained diffusion model, enabling closed-form reward guidance without neural backpropagation. To further improve efficiency, we introduce a few-step lookahead sampling and an accurate solver that guides particles toward high-reward lookahead samples. We refer to this sampling scheme as LiDAR sampling. LiDAR achieves the same GenEval performance as the latest gradient guidance method for SDXL with a 9.5x speedup. We release the code at https://github.com/aailab-kaist/Diffusion-LiDAR-Sampling.

2602.03024 2026-06-02 cs.LG cs.AI 版本更新

Consistency Deep Equilibrium Models

一致性深度均衡模型

Junchao Lin, Zenan Ling, Jingwen Xu, Robert C. Qiu

发表机构 * School of Electronic Information and Communications, Huazhong University of Science and Technology(华中科技大学电子信息学院) School of Electronic Information(电子信息学院) Communications, Huazhong University of Science(华中科技大学通信学院) School of Science, Wuhan University of Technology(武汉理工大学理学院)

AI总结 提出一致性深度均衡模型(C-DEQ),通过一致性蒸馏将DEQ迭代推理过程视为沿ODE轨迹演化,训练模型将中间状态直接映射到不动点,实现少步推理并保持性能,同时支持多步评估以灵活权衡计算与性能,实验表明在相同少步推理预算下精度提升2-20倍。

详情
AI中文摘要

深度均衡模型(DEQ)已成为深度学习中的一种强大范式,能够以恒定的内存使用量建模无限深度网络。然而,由于不动点求解器的迭代性质,DEQ会带来显著的推理延迟。在这项工作中,我们引入了一致性深度均衡模型(C-DEQ),这是一种利用一致性蒸馏来加速DEQ推理的新框架。我们将DEQ迭代推理过程视为沿固定ODE轨迹向均衡演化。沿着这条轨迹,我们训练C-DEQ将中间状态一致地直接映射到不动点,从而在保持教师DEQ性能的同时实现少步推理。同时,它支持多步评估,以灵活地权衡计算与性能提升。跨多个领域任务的广泛实验表明,在相同的少步推理预算下,C-DEQ相比隐式DEQ实现了2-20倍的精度提升。我们的代码可在https://github.com/landrarwolf/CDEQ获取。

英文摘要

Deep Equilibrium Models (DEQs) have emerged as a powerful paradigm in deep learning, offering the ability to model infinite-depth networks with constant memory usage. However, DEQs incur significant inference latency due to the iterative nature of fixed-point solvers. In this work, we introduce the Consistency Deep Equilibrium Model (C-DEQ), a novel framework that leverages consistency distillation to accelerate DEQ inference. We cast the DEQ iterative inference process as evolution along a fixed ODE trajectory toward the equilibrium. Along this trajectory, we train C-DEQs to consistently map intermediate states directly to the fixed point, enabling few-step inference while preserving the performance of the teacher DEQ. At the same time, it facilitates multi-step evaluation to flexibly trade computation for performance gains. Extensive experiments across various domain tasks demonstrate that C-DEQs achieve consistent 2-20$\times$ accuracy improvements over implicit DEQs under the same few-step inference budget. Our code is available at https://github.com/landrarwolf/CDEQ.

2602.02886 2026-06-02 cs.LG cs.AI 版本更新

Mixture of Concept Bottleneck Experts

概念瓶颈专家混合模型

Francesco De Santis, Gabriele Ciravegna, Giovanni De Felice, Arianna Casanova, Francesco Giannini, Michelangelo Diligenti, Johannes Schneider, Danilo Giordano, Mateo Espinosa Zarlenga, Pietro Barbiero

发表机构 * University of Padua(帕多瓦大学)

AI总结 提出概念瓶颈专家混合模型(M-CBE),通过引入多个专家表达式和灵活的函数形式,在保持可解释性的同时提升预测精度和适应性。

详情
AI中文摘要

概念瓶颈模型(CBM)通过将预测基于人类可理解的概念来促进可解释性。然而,现有的CBM通常将其任务预测器限制为单个表达式,其函数形式是预先设定的,这限制了预测精度和对不同用户需求的适应性。我们提出了概念瓶颈专家混合模型(M-CBE),这是一个沿两个维度推广现有CBM的框架:任务预测器用于将概念映射到任务的表达式数量(称为专家),以及每个表达式所采用的函数形式,从而揭示了该设计空间中一个未被充分探索的区域。我们通过实例化两个新颖的模型来研究这一区域:线性M-CBE,它学习一组有限的线性表达式;以及符号M-CBE,它利用符号回归从数据中发现专家函数,受限于用户指定的算子词汇表。实证评估表明,改变表达式的数量及其函数形式为导航精度-可解释性权衡提供了一个稳健的框架。

英文摘要

Concept Bottleneck Models (CBMs) promote interpretability by grounding predictions in human-understandable concepts. However, existing CBMs typically constrain their task predictor to a single expression whose functional form is set a priori, limiting both predictive accuracy and adaptability to diverse user needs. We propose Mixture of Concept Bottleneck Experts (M-CBE), a framework that generalizes existing CBMs along two dimensions: the number of expressions, referred to as experts, employed by the task predictor to map concepts to the task, and the functional form each expression takes, thus exposing an underexplored region of this design space. We investigate this region by instantiating two novel models: Linear M-CBE, which learns a finite set of linear expressions, and Symbolic M-CBE, which leverages symbolic regression to discover expert functions from data subject to user-specified operator vocabularies. Empirical evaluation demonstrates that varying the number of expressions and their functional form provides a robust framework for navigating the accuracy-interpretability trade-off.

2602.02557 2026-06-02 cs.LG cs.AI cs.SD 版本更新

The Alignment Curse: Modality Alignment Supercharges Audio Attacks via Text Transfer

对齐诅咒:模态对齐通过文本传输增强音频攻击

Yupeng Chen, Junchi Yu, Aoxi Liu, Baoyuan Wu, Philip Torr, Adel Bibi

发表机构 * University of Oxford(牛津大学)

AI总结 本文提出并验证了“对齐诅咒”原理,即更强的文本-音频模态对齐会促进文本攻击向音频的迁移,并通过黑盒实验表明文本转移的音频攻击性能与原生音频攻击相当甚至更优,揭示了能力与安全之间的根本矛盾。

Comments 23 pages, 5 figures

详情
AI中文摘要

近期端到端训练的全能模型通过加强文本-音频模态对齐显著提升了音频能力。然而,这种对齐是否无意中促进了安全漏洞跨模态的转移仍未被充分探索。这一问题至关重要,因为基于文本的越狱攻击远比基于音频的攻击成熟;如果它们系统性转移,当前的音频安全评估可能低估源自文本模态的风险。在本文中,我们引入了“对齐诅咒”,这是一个经过形式化表征和实证验证的原理,表明更强的模态对齐使得攻击从文本到音频的转移更有效,揭示了能力与安全之间的根本矛盾。基于这一原理,我们在最新的全能模型(如Qwen2.5-Omni、Qwen3-Omni)上对三类攻击(文本攻击、文本转移的音频攻击和音频攻击)进行了全面的黑盒评估。我们发现,文本转移的音频攻击与基于音频的攻击表现相当,甚至更优,在仅音频访问下展现出明显优势。这表明基于文本的漏洞在塑造音频安全风险中扮演关键角色。最后,我们实证分析了不同攻击方法和模型下模态对齐与转移有效性之间的关系,观察到对“对齐诅咒”的一致支持:更紧密的模态对齐导致更有效的跨模态攻击转移。

英文摘要

Recent advances in end-to-end trained omni-models have substantially improved audio capabilities by strengthening text-audio modality alignment. However, whether such alignment inadvertently facilitates the transfer of safety vulnerabilities across modalities remains underexplored. This question is critical as text-based jailbreak attacks are considerably more mature than audio-based ones; if they transfer systematically, current audio safety evaluations may underestimate risks originating from the text modality. In this paper, we introduce the Alignment Curse, a formally characterized and empirically validated principle showing that stronger modality alignment enables more effective transfer of attacks from text to audio, revealing a fundamental tension between capability and safety. Motivated by this principle, we conduct a comprehensive black-box evaluation of three attack categories on recent omni-models (e.g., Qwen2.5-Omni, Qwen3-Omni): text attacks, text-transferred audio attacks, and audio attacks. We find that text-transferred audio attacks perform comparably to, and often better than, audio-based attacks, exhibiting a clear advantage under audio-only access. This suggests that text-based vulnerabilities play a pivotal role in shaping audio safety risks. Finally, we empirically analyze the relationship between modality alignment and transfer effectiveness across attack methods and models, observing consistent support for the Alignment Curse: tighter modality alignment leads to more effective cross-modality attack transfer.

2602.02547 2026-06-02 cs.LG cs.AI 版本更新

naPINN: Noise-Adaptive Physics-Informed Neural Networks for Recovering Physics from Corrupted Measurement

naPINN: 用于从损坏测量中恢复物理的噪声自适应物理信息神经网络

Hankyeol Kim, Pilsung Kang

发表机构 * Department of Industrial Engineering(工业工程系) Seoul National University(首尔国立大学)

AI总结 提出噪声自适应物理信息神经网络(naPINN),通过嵌入能量模型学习残差分布并自适应过滤异常值,从非高斯噪声和离群点损坏的测量中鲁棒恢复物理解。

详情
AI中文摘要

物理信息神经网络(PINNs)是解决逆问题和从观测数据中发现控制方程的有效方法。然而,在复杂测量噪声和严重离群点下,其性能显著下降。为解决此问题,我们提出了噪声自适应物理信息神经网络(naPINN),该网络无需噪声分布先验知识,即可从损坏测量中鲁棒恢复物理解。naPINN在训练循环中嵌入一个基于能量的模型,以学习预测残差的潜在分布。利用学习到的能量景观,一个可训练的可靠性门自适应地过滤具有高能量的数据点,同时拒绝代价正则化防止丢弃有效数据导致的平凡解。我们在被非高斯噪声和不同比例离群点损坏的各种基准偏微分方程上展示了naPINN的有效性。结果表明,naPINN显著优于现有的鲁棒PINN基线,成功隔离离群点并在严重数据损坏下准确重建动力学。

英文摘要

Physics-Informed Neural Networks (PINNs) are effective methods for solving inverse problems and discovering governing equations from observational data. However, their performance degrades significantly under complex measurement noise and gross outliers. To address this issue, we propose the Noise-Adaptive Physics-Informed Neural Network (naPINN), which robustly recovers physical solutions from corrupted measurements without prior knowledge of the noise distribution. naPINN embeds an energy-based model into the training loop to learn the latent distribution of prediction residuals. Leveraging the learned energy landscape, a trainable reliability gate adaptively filters data points exhibiting high energy, while a rejection cost regularization prevents trivial solutions where valid data are discarded. We demonstrate the efficacy of naPINN on various benchmark partial differential equations corrupted by non-Gaussian noise and varying rates of outliers. The results show that naPINN significantly outperforms existing robust PINN baselines, successfully isolating outliers and accurately reconstructing the dynamics under severe data corruption.

2602.02470 2026-06-02 cs.AI 版本更新

Breaking the Reversal Curse in Autoregressive Language Models via Identity Bridge

通过身份桥打破自回归语言模型中的逆转诅咒

Xutao Ma, Yixiao Huang, Hanlin Zhu, Somayeh Sojoudi

发表机构 * UC Berkeley(加州大学伯克利分校)

AI总结 提出一种名为“身份桥”的简单数据正则化方法(形式为“A→A”),通过理论分析和实验证明该方法能有效缓解自回归语言模型中的逆转诅咒,使模型从事实记忆转向规则学习。

详情
AI中文摘要

自回归大型语言模型(LLMs)在许多复杂任务中取得了显著成功,但在非常简单的逻辑推理中仍可能失败,例如“逆转诅咒”——当模型在形如“$A \rightarrow B$”(例如,爱丽丝的丈夫是鲍勃)的前向知识数据上训练时,在测试时无法推断出逆转知识“$B \leftarrow A$”(例如,鲍勃的妻子是爱丽丝)。大量先前的研究表明,这种失败是自回归因果LLMs固有的根本限制,表明这些模型倾向于记忆事实层面的知识,而不是捕捉更高级别的规则。在本文中,我们通过展示这种看似根本的限制可以通过略微调整训练数据,使用一种简单的正则化数据配方(称为“身份桥”,形式为“$A \to A$”,例如,爱丽丝的名字是爱丽丝)来缓解,从而挑战了这一观点。理论上,我们证明在这种配方下,即使是一层Transformer也可以通过分析梯度下降的隐式偏差来打破逆转诅咒。实验上,我们展示了一个10亿参数的预训练语言模型,在使用所提出的数据配方进行微调后,在逆转任务上达到了50%的成功率,而仅在前向知识数据上训练时成功率接近零。我们的工作为逆转诅咒提供了新颖的理论基础,并为鼓励LLMs从数据中学习更高级别的规则提供了一条原则性、低成本的路径。

英文摘要

Autoregressive large language models (LLMs) have achieved remarkable success in many complex tasks, yet they can still fail in very simple logical reasoning such as the "reversal curse" -- when trained on forward knowledge data of the form "$A \rightarrow B$" (e.g., Alice's husband is Bob), the model is unable to deduce the reversal knowledge "$B \leftarrow A$" (e.g., Bob's wife is Alice) during test. Extensive prior research suggests that this failure is an inherent, fundamental limit of autoregressive causal LLMs, indicating that these models tend to memorize factual-level knowledge rather than capture higher-level rules. In this paper, we challenge this view by showing that this seemingly fundamental limit can be mitigated by slightly tweaking the training data with a simple regularization data recipe called the Identity Bridge of the form "$A \to A$" (e.g., The name of Alice is Alice). Theoretically, we prove that under this recipe, even a one-layer transformer can break the reversal curse by analyzing the implicit bias of gradient descent. Empirically, we show that a 1B pretrained language model finetuned with the proposed data recipe achieves a 50% success rate on reversal tasks, in stark contrast to a near-zero success rate when trained solely on forward-knowledge data. Our work provides a novel theoretical foundation for the reversal curse and offers a principled, low-cost path to encouraging LLMs to learn higher-level rules from data.

2602.02416 2026-06-02 cs.AI 版本更新

Structure Enables Effective Self-Localization of Errors in LLMs

结构使语言模型能够有效自我定位错误

Ankur Samanta, Akshayaa Magesh, Ayush Jain, Kavosh Asadi, Youliang Yu, Daniel Jiang, Boris Vidolov, Kaveh Hassani, Paul Sajda, Jalaj Bhandari, Yonathan Efroni

发表机构 * Meta AI Columbia University(哥伦比亚大学) Meta Superintelligence Labs(Meta超智能实验室) Tel Aviv University(特拉维夫大学)

AI总结 本文提出结构化推理方法,通过将推理分解为离散语义步骤,使语言模型能更可靠地定位错误,并基于此设计了迭代纠正采样框架Thought-ICS,实现20-40%的自我纠正提升。

详情
AI中文摘要

语言模型的自我纠正仍然难以实现。在这项工作中,我们探索语言模型是否能够显式定位错误推理中的错误,作为构建能够有效自我纠正的AI系统的一条途径。我们引入了一种提示方法,将推理结构化为离散的、语义连贯的思维步骤,并表明模型在这种结构内比在传统的、非结构化的思维链推理中更可靠地定位错误。受人类大脑在离散决策点监控错误并重新采样替代方案的启发,我们引入了思维迭代纠正采样(Thought-ICS),一个自我纠正框架。Thought-ICS迭代地提示模型一次生成一个离散且完整的思维——其中每个思维代表模型的一个深思熟虑的决策——为精确的错误定位创建自然边界。在验证时,模型定位第一个错误步骤,系统回溯并从最后一个正确点生成替代推理。当要求纠正被预言机验证为不正确的推理时,Thought-ICS实现了20-40%的自我纠正提升。在完全没有外部验证的完全自主设置中,它优于当代自我纠正基线。

英文摘要

Self-correction in language models remains elusive. In this work, we explore whether language models can explicitly localize errors in incorrect reasoning, as a path toward building AI systems that can effectively correct themselves. We introduce a prompting method that structures reasoning as discrete, semantically coherent thought steps, and show that models can localize errors more reliably within this structure than in conventional, unstructured chain-of-thought reasoning. Motivated by how the human brain monitors errors at discrete decision points and resamples alternatives, we introduce Iterative Correction Sampling of Thoughts (Thought-ICS), a self-correction framework. Thought-ICS iteratively prompts the model to generate reasoning one discrete and complete thought at a time--where each thought represents a deliberate decision by the model--creating natural boundaries for precise error localization. Upon verification, the model localizes the first erroneous step, and the system backtracks to generate alternative reasoning from the last correct point. When asked to correct reasoning verified as incorrect by an oracle, Thought-ICS achieves 20-40% self-correction lift. In a completely autonomous setting without external verification, it outperforms contemporary self-correction baselines.

2602.02098 2026-06-02 cs.LG cs.AI 版本更新

Probabilistic Performance Guarantees for Multi-Task Reinforcement Learning

多任务强化学习的概率性能保证

Yannik Schnitzer, Mathias Jackermeier, Alessandro Abate, David Parker

发表机构 * ETH Zurich(苏黎世联邦理工学院)

AI总结 提出一种结合每任务有限 rollout 置信下界与任务级泛化的新泛化界,为未见任务提供高置信度性能保证。

详情
AI中文摘要

多任务强化学习训练能够执行多个任务的通用策略。尽管近年来取得了显著进展,现有方法很少提供正式的性能保证,而这在安全关键环境中部署策略时是必不可少的。我们提出了一种方法,用于计算多任务策略在训练期间未见任务上的高置信度性能保证。具体地,我们引入了一个新的泛化界,该界将(i)来自有限 rollout 的每任务置信下界与(ii)来自有限采样任务的任务级泛化相结合,为从相同任意未知分布中抽取的新任务提供高置信度保证。在最新的多任务强化学习方法中,我们证明了这些保证在理论上是合理的,并且在现实样本量下具有信息量。

英文摘要

Multi-task reinforcement learning trains generalist policies that can execute multiple tasks. While recent years have seen significant progress, existing approaches rarely provide formal performance guarantees, which are indispensable when deploying policies in safety-critical settings. We present an approach for computing high-confidence guarantees on the performance of a multi-task policy on tasks not seen during training. Concretely, we introduce a new generalisation bound that composes (i) per-task lower confidence bounds from finitely many rollouts with (ii) task-level generalisation from finitely many sampled tasks, yielding a high-confidence guarantee for new tasks drawn from the same arbitrary and unknown distribution. Across state-of-the-art multi-task RL methods, we show that the guarantees are theoretically sound and informative at realistic sample sizes.

2602.01962 2026-06-02 cs.LG cs.AI 版本更新

Zero-Shot Off-Policy Learning

零样本离策略学习

Arip Asadulaev, Maksim Bobrin, Salem Lahlou, Dmitry Dylov, Fakhri Karray, Martin Takac

发表机构 * Arip Asadulaev(阿里普·阿萨杜拉耶夫) Maksim Bobrin(马克西姆·博布林) Salem Lahlou(萨勒姆·拉洛) Dmitry Dylov(德米特里·达里夫) Fakhri Karray(法赫里·卡里) Martin Takac(马尔 tin 塔卡)

AI总结 本文通过发现后继度量与平稳密度比的理论联系,提出一种零样本离策略学习算法,能够实时推断最优重要性采样比率并进行平稳分布修正,实现无需额外训练即可适应新任务。

详情
AI中文摘要

离策略学习方法旨在直接从固定的先前交互数据集中推导出最优策略。这一目标面临重大挑战,主要源于固有的分布偏移和价值函数高估偏差。这些问题在零样本强化学习中尤为突出,其中在无奖励数据上训练的智能体必须在测试时适应新任务而无需额外训练。在这项工作中,我们通过发现后继度量与平稳密度比的理论联系,解决了零样本场景下的离策略问题。利用这一洞见,我们的算法能够推断最优重要性采样比率,有效地为任意任务实时执行带有最优策略的平稳分布修正。我们在SMPL人体模型上的运动跟踪任务、ExoRL上的连续控制任务以及长时域OGBench任务上对方法进行了基准测试。我们的技术无缝集成到前向-后向表示框架中,并在无需训练的情况下实现对新任务的快速适应。更广泛地说,这项工作架起了离策略学习和零样本适应之间的桥梁,为两个研究领域都带来了益处。

英文摘要

Off-policy learning methods seek to derive an optimal policy directly from a fixed dataset of prior interactions. This objective presents significant challenges, primarily due to the inherent distributional shift and value function overestimation bias. These issues become even more noticeable in zero-shot reinforcement learning, where an agent trained on reward-free data must adapt to new tasks at test time without additional training. In this work, we address the off-policy problem in a zero-shot setting by discovering a theoretical connection of successor measures to stationary density ratios. Using this insight, our algorithm can infer optimal importance sampling ratios, effectively performing a stationary distribution correction with an optimal policy for any task on the fly. We benchmark our method in motion tracking tasks on SMPL Humanoid, continuous control on ExoRL, and for the long-horizon OGBench tasks. Our technique seamlessly integrates into forward-backward representation frameworks and enables fast-adaptation to new tasks in a training-free regime. More broadly, this work bridges off-policy learning and zero-shot adaptation, offering benefits to both research areas.

2601.06199 2026-06-02 eess.AS cs.AI cs.SD 版本更新

FastSLM: Hierarchical Temporal Abstraction for Efficient Long-Form Speech Adaptation

FastSLM:用于高效长语音自适应的层次时间抽象

Junseok Lee, Sangyong Lee, Chang-Jae Chun

发表机构 * OKESTRO Sejong University(世宗大学)

AI总结 针对长语音输入中标记爆炸问题,提出FastSLM架构,通过层次时间抽象器(HTA)实现每秒1.67个标记的极端压缩率(减少97%),在显著降低计算量和参数的同时,在长语音基准上达到与最先进模型竞争的性能。

Comments Title updated

详情
AI中文摘要

将多模态大语言模型(MLLMs)扩展到长语音受到输入标记爆炸式增长的瓶颈限制。与图像或视频不同,音频缺乏重叠信息,使得极端的1-标记压缩极易丢失细粒度声学线索。为克服这一问题,我们提出FastSLM,一种标记高效的架构,其核心是层次时间抽象器(HTA)。HTA在多个时间尺度上逐步蒸馏非重叠的声学特征,实现了每秒1.67个标记的极端压缩率——减少了97%而不丢失关键上下文。实验结果表明,尽管FastSLM使用的FLOPs和参数显著更少,但在长语音基准上仍能达到与最先进模型竞争的性能。源代码和模型检查点可在https://anonymous.4open.science/r/FastSLM-8BD3获取。

英文摘要

Scaling Multimodal Large Language Models (MLLMs) to long-form speech is bottlenecked by the explosive growth of input tokens. Unlike images or videos, audio lacks overlapping information, making extreme 1-token compression highly susceptible to the loss of fine-grained acoustic cues. To overcome this, we propose FastSLM, a token-efficient architecture featuring the Hierarchical Temporal Abstractor (HTA). HTA progressively distills non-overlapping acoustic features across multiple temporal scales, achieving an extreme compression rate of 1.67 tokens per second a 97% reduction without losing critical context. Experimental results show that FastSLM achieves competitive performance with state-of-the-art models on long-form benchmarks despite operating with significantly fewer FLOPs and parameters. The source code and model checkpoints are available at https://anonymous.4open.science/r/FastSLM-8BD3.

2507.08038 2026-06-02 cs.CL cs.AI 版本更新

AblationBench: Evaluating Automated Planning of Ablations in Empirical AI Research

AblationBench:评估实证AI研究中消融实验的自动规划

Talor Abramovich, Gal Chechik

发表机构 * Google Research(谷歌研究)

AI总结 提出AblationBench基准套件,包含作者消融和审稿人消融两个任务,用于评估语言模型在AI研究中规划消融实验的能力,实验表明当前最佳模型仅能识别45%的原始消融,低于人类水平。

Comments AI4Science Workshop, ICML 2026; Project page: https://ablation-bench.github.io/

详情
AI中文摘要

语言模型代理越来越多地被用于自动化科学研究,然而评估其科学贡献仍然是一个挑战。获得此类见解的关键机制是通过消融实验。为此,我们引入了AblationBench,这是一个用于评估代理在实证AI研究中进行消融规划任务的基准套件。它包括两个任务:AuthorAblation,帮助作者基于方法部分提出消融实验,包含83个实例;以及ReviewerAblation,帮助审稿人发现完整论文中缺失的消融,包含350个实例。对于这两个任务,我们开发了基于LM的评判器,作为自动评估框架。我们对前沿LM的实验表明,这些任务仍然具有挑战性,性能最佳的LM系统平均仅能识别45%的原始消融,低于人类水平。我们观察到作者任务和审稿人任务之间存在相反的性能趋势,这归因于模型基础的不同。最后,我们分析了当前LM在这些任务上的局限性,并发现思维链提示优于基于代理的方法。我们的数据可在https://huggingface.co/collections/ai-coscientist/ablationbench获取,代码可在https://github.com/ai-scientist-bench/ablation-bench获取。

英文摘要

Language model agents are increasingly used to automate scientific research, yet evaluating their scientific contributions remains a challenge. A key mechanism to obtain such insights is through ablation experiments. To this end, we introduce AblationBench, a benchmark suite for evaluating agents on ablation planning tasks in empirical AI research. It includes two tasks: AuthorAblation, which helps authors propose ablation experiments based on a method section and contains 83 instances, and ReviewerAblation, which helps reviewers find missing ablations in a full paper and contains 350 instances. For both tasks, we develop LM-based judges that serve as an automatic evaluation framework. Our experiments with frontier LMs show that these tasks remain challenging, with the best-performing LM system identifying only 45% of the original ablations on average, below human-level performance. We observe an inverse performance trend between the author and reviewer tasks, which we attribute to differences in model grounding. Lastly, we analyze the limitations of current LMs on these tasks, and find that chain-of-thought prompting outperforms an agent-based approach. Our data is available on https://huggingface.co/collections/ai-coscientist/ablationbench, and our code is available on https://github.com/ai-scientist-bench/ablation-bench .

2602.00415 2026-06-02 cs.AI cs.LG 版本更新

PolarMem: A Training-Free Polarized Latent Graph Memory for Verifiable Vision-Language Models

PolarMem: 一种无需训练的可验证视觉语言模型极化隐式图记忆

Zhisheng Chen, Tingyu Wu, Zijie Zhou, Zhengwei Xie, Jinhan Li, Ziyan Weng, Liang Lin, Jingwei Song, Zikai Xiao, Yingwei Zhang

发表机构 * ICT, CAS(中国科学院信息科技研究院) UCAS(中国科学院大学) CUPB(中国政法大学) USTC(中国科学技术大学) CityU-DG(城市大学-数据科学) HKU(香港大学) ZJU(浙江大学)

AI总结 提出PolarMem,一种无需训练的极化隐式图记忆框架,通过语义一致性验证和自适应分布划分将视觉语言模型感知信号转化为HAS、NOT_HAS和Uncertain记忆状态,并采用词典逻辑感知检索协议优先保证逻辑一致性,从而提升检索密集型任务性能并减少矛盾。

详情
AI中文摘要

记忆对于智能系统而言不仅是存储机制,更是组织证据和约束信念的结构。这对多模态推理尤为重要,因为检索到的证据必须既与查询相关又在视觉上一致。然而,当前视觉语言模型(VLM)的记忆系统大多保持正关联:它们检索相似或先前观察到的内容,但缺乏明确的方式记住已被验证为不存在或逻辑排除的内容。为此,我们提出 extbf{PolarMem},一种无需训练的极化隐式图记忆框架,用于可验证的视觉语言推理。PolarMem通过语义一致性验证和自适应分布划分,将冻结的VLM感知信号转化为 extit{HAS}、 extit{NOT\_HAS}和 extit{Uncertain}记忆状态,并将其存储在具有明确正负记忆关系的极化图中。在推理时,词典逻辑感知检索协议在语义相似性之前强制执行逻辑一致性,在冲突记忆进入模型上下文之前将其抑制。在八个冻结的VLM骨干网络和六个多模态基准测试中,PolarMem一致地提升了检索密集型任务性能并减少了检索级矛盾。这些结果凸显了负记忆作为构建更可靠多模态记忆系统的关键机制。我们的代码可在https://github.com/czs-ict/PolarMem获取。

英文摘要

Memory is not merely a storage mechanism for intelligent systems, but a structure for organizing evidence and constraining belief. This is especially important for multimodal reasoning, where retrieved evidence must be both query-relevant and visually consistent. However, current memory systems for vision-language models (VLMs) remain largely positive-associative: they retrieve what is similar or previously observed, but lack an explicit way to remember what has been verified as absent or logically excluded. To this end, we propose \textbf{PolarMem}, a training-free polarized latent graph memory framework for verifiable vision-language reasoning. PolarMem transforms frozen VLM perceptual signals into \textit{HAS}, \textit{NOT\_HAS}, and \textit{Uncertain} memory states through semantic consistency verification and adaptive distributional partitioning, and stores them in a polarized graph with distinct positive and negative memory relations. During inference, a lexicographical logic-aware retrieval protocol enforces logical consistency before semantic similarity, suppressing conflicting memories before they enter the model context. Across eight frozen VLM backbones and six multimodal benchmarks, PolarMem consistently improves retrieval-intensive tasks and reduces retrieval-level contradictions. These results highlight negative memory as a key mechanism for building more reliable multimodal memory systems. Our code is available at https://github.com/czs-ict/PolarMem.

2601.23220 2026-06-02 cs.CV cs.AI 版本更新

Med-Scout: Curing MLLMs' Geometric Blindness in Medical Perception via Geometry-Aware RL Post-Training

Med-Scout: 通过几何感知的强化学习后训练治愈多模态大语言模型在医学感知中的几何盲点

Anglin Liu, Ruichao Chen, Yi Lu, Hongxia Xu, Jintai Chen

发表机构 * HKUSTGZ-ML4Health-Lab(香港科技大学-ML4Health实验室)

AI总结 提出Med-Scout框架,利用无标注医学图像中的内在几何逻辑,通过强化学习和三种代理任务(层次尺度定位、拓扑拼图重建、异常一致性检测)来缓解多模态大语言模型的几何盲点,并在新基准Med-Scout-Bench上提升超过40%的几何感知性能,同时泛化到更广泛的医学理解任务。

Comments 29 pages, 14 figures. Accepted at ICML 2026

详情
AI中文摘要

尽管最近的多模态大语言模型(MLLMs)在医学诊断中展现出语言能力,但我们发现即使是最先进的MLLMs也存在一个关键的感知缺陷:几何盲点。这种无法将输出基于客观几何约束的问题导致了看似合理但事实错误的幻觉,其根源在于训练范式优先考虑语言流畅性而非几何保真度。本文介绍了Med-Scout,一种新颖的框架,通过强化学习(RL)“治愈”这种盲点,利用未标记医学图像中内在的几何逻辑。Med-Scout不依赖昂贵的人工标注,而是通过受临床医生系统阅读和推理模式启发的三种策略性代理任务推导出可验证的监督信号:层次尺度定位、拓扑拼图重建和异常一致性检测。为了严格量化这一缺陷,我们提出了Med-Scout-Bench,一个专门设计用于评估几何感知的新基准。大量评估表明,Med-Scout显著缓解了几何盲点,在我们的基准上比领先的专有和开源MLLMs提升了超过40%。此外,这种增强的几何感知泛化到更广泛的医学理解,在放射学和综合性医学VQA任务上取得了优异结果。

英文摘要

Despite recent Multimodal Large Language Models (MLLMs)' linguistic prowess in medical diagnosis, we find even state-of-the-art MLLMs suffer from a critical perceptual deficit: geometric blindness. This failure to ground outputs in objective geometric constraints leads to plausible yet factually incorrect hallucinations, rooted in training paradigms that prioritize linguistic fluency over geometric fidelity. This paper introduces Med-Scout, a novel framework that "cures" this blindness via Reinforcement Learning (RL) that leverages the intrinsic geometric logic latent within unlabeled medical images. Instead of relying on costly expert annotations, Med-Scout derives verifiable supervision signals through three strategic proxy tasks inspired by the systematic reading and reasoning patterns of clinicians: Hierarchical Scale Localization, Topological Jigsaw Reconstruction, and Anomaly Consistency Detection. To rigorously quantify this deficit, we present Med-Scout-Bench, a new benchmark specifically designed to evaluate geometric perception. Extensive evaluations show that Med-Scout significantly mitigates geometric blindness, outperforming leading proprietary and open-source MLLMs by over 40% on our benchmark. Furthermore, this enhanced geometric perception generalizes to broader medical understanding, achieving superior results on radiological and comprehensive medical VQA tasks.

2601.22900 2026-06-02 cs.AI 版本更新

MulFeRL: Enhancing Reinforcement Learning with Verbal Feedback in a Multi-turn Loop

MulFeRL:在多轮循环中利用语言反馈增强强化学习

Xuancheng Li, Haitao Li, Yujia Zhou, YiqunLiu, Qingyao Ai

发表机构 * Department of Computer Science and Technology, Tsinghua University, Beijing, China(清华大学计算机科学与技术系,北京,中国) Quancheng Laboratory(千晨实验室)

AI总结 针对强化学习中标量奖励稀疏且缺乏信息的问题,提出MulFeRL框架,通过多轮语言反馈引导失败样本的再生、进度信用分配和结构化反馈注入,提升模型推理性能。

详情
AI中文摘要

具有可验证奖励的强化学习(RLVR)被广泛用于提升各领域的推理能力,但仅基于结果的标量奖励往往稀疏且信息量不足。这一限制对失败样本尤为严重,因为标量奖励仅指示解决方案不正确,而未解释推理为何失败。在本文中,我们利用更丰富的语言反馈来引导失败样本上的RLVR,并将反馈引发的进展转化为可训练的学习信号。我们提出MulFeRL(多轮反馈引导的强化学习),这是一个多轮、事件触发的RLVR框架,结合了用于反馈引导失败样本再生的进展诱导、用于从验证器确认的进展中学习的进展信用分配,以及用于将反馈整合到模型推理过程中的结构化反馈注入。在采样的OpenR1-Math上训练后,MulFeRL在领域内优于监督、自蒸馏和RLVR基线,同时展现出强大的领域外泛化能力。

英文摘要

Reinforcement Learning with Verifiable Rewards (RLVR) is widely used to improve reasoning across domains, but outcome-only scalar rewards are often sparse and uninformative. This limitation is especially severe for failed samples, where scalar rewards indicate only that a solution is incorrect without explaining why the reasoning breaks down. In this paper, we leverage richer verbal feedback to guide RLVR on failed samples and convert feedback-induced progress into trainable learning signals. We propose MulFeRL (Multi-turn Feedback-guided Reinforcement Learning), a multi-turn, event-triggered RLVR framework that combines progress induction for feedback-guided regeneration of failed samples, progress credit assignment for learning from verifier-confirmed progress, and structured feedback injection for integrating feedback into the model's reasoning process. Trained on sampled OpenR1-Math, MulFeRL outperforms supervised, self-distillation-based, and RLVR baselines in-domain, while also showing strong out-of-domain generalization.

2601.22651 2026-06-02 cs.LG cs.AI 版本更新

GUDA: Counterfactual Group-wise Training Data Attribution for Diffusion Models via Unlearning

GUDA: 基于反事实的扩散模型分组训练数据归因方法

Naoki Murata, Yuhta Takida, Chieh-Hsin Lai, Toshimitsu Uesaka, Bac Nguyen, Stefano Ermon, Yuki Mitsufuji

发表机构 * University of Tokyo(东京大学) Toyota Central Research Laboratory(丰田中央研究所) University of California, Berkeley(加州大学伯克利分校) Massachusetts Institute of Technology(麻省理工学院) National Institute of Advanced Industrial Science and Technology(国家工业科学与技术研究院)

AI总结 提出GUDA方法,利用机器遗忘近似反事实模型,通过似然评分规则(ELBO)量化组别影响,实现高效的分组训练数据归因。

Comments Accepted at ICML 2026. Code is available at https://github.com/sony/guda

详情
AI中文摘要

视觉生成模型的训练数据归因旨在识别哪些训练数据影响了给定输出。虽然大多数方法对单个样本进行评分,但实践者通常需要组级别的答案(例如,艺术风格或对象类别)。分组归因是反事实的:如果某个组别在训练中缺失,模型对生成样本的行为会如何变化?这种反事实的自然实现是留一组法(LOGO)重训练,即移除每个组别后重新训练模型;然而,随着组别数量的增加,计算变得不可行。我们提出了用于扩散模型的GUDA(基于组遗忘的数据归因)方法,该方法通过应用机器遗忘到共享的全数据模型而不是从头训练来近似每个反事实模型。GUDA使用全模型和每个遗忘反事实模型之间基于似然的评分规则(ELBO)的差异来量化组别影响。在CIFAR-10和Stable Diffusion的艺术风格归因上的实验表明,GUDA比语义相似性、基于梯度的归因和实例级遗忘方法更可靠地识别主要贡献组别,同时在CIFAR-10上比LOGO重训练实现了约100倍的加速。

英文摘要

Training-data attribution for vision generative models aims to identify which training data influenced a given output. While most methods score individual examples, practitioners often need group-level answers (e.g., artistic styles or object classes). Group-wise attribution is counterfactual: how would a model's behavior on a generated sample change if a group were absent from training? A natural realization of this counterfactual is Leave-One-Group-Out (LOGO) retraining, which retrains the model with each group removed; however, it becomes computationally prohibitive as the number of groups grows. We propose GUDA (Group Unlearning-based Data Attribution) for diffusion models, which approximates each counterfactual model by applying machine unlearning to a shared full-data model instead of training from scratch. GUDA quantifies group influence using differences in a likelihood-based scoring rule (ELBO) between the full model and each unlearned counterfactual. Experiments on CIFAR-10 and artistic style attribution with Stable Diffusion show that GUDA identifies primary contributing groups more reliably than semantic similarity, gradient-based attribution, and instance-level unlearning approaches, while achieving ~100x speedup on CIFAR-10 over LOGO retraining.

2601.20115 2026-06-02 cs.AR cs.AI 版本更新

How Much Progress Has There Been in NVIDIA Datacenter GPUs?

NVIDIA 数据中心 GPU 取得了多少进展?

Emanuele Del Sozzo, Martin Fleming, Kenneth Flamm, Neil Thompson

发表机构 * MIT FutureTech, Computer Science and Artificial Intelligence Laboratory (CSAIL), Massachusetts Institute of Technology Cambridge MA USA(麻省理工学院未来科技、计算机科学与人工智能实验室(CSAIL)、麻省理工学院剑桥MA美国)

AI总结 本文分析了 2000 年代中期至 2025 年 NVIDIA 数据中心 GPU 在计算性能、内存、功耗和价格方面的进展,并评估了美国出口管制的影响。

详情
AI中文摘要

随着现代图形处理单元(GPU)在多种计算任务中变得越来越重要,分析其过去和当前的进展对于确定未来科学研究的限制至关重要。这在人工智能(AI)领域尤为突出,该领域的技术快速进步和激烈的全球竞争导致美国最近实施了限制国际获取先进 AI 芯片的出口管制法规。因此,本文考察了从 2000 年代中期到 2025 年 NVIDIA 数据中心 GPU 的技术进展。我们的主要结果发现,FP16 和 FP32 密集运算的倍增时间分别为 1.43 年和 1.67 年,而 FP64 的倍增时间在 2.05 到 3.79 年之间。片外内存大小和带宽的增长速度慢于计算性能,每 3.29 到 3.41 年翻一番,而发布价格和功耗大约每 5.03 年和 15 年翻一番。此外,我们对每年性能最佳的 GPU 进行的跨供应商比较显示,NVIDIA 的性能优势正在缩小,但不足以促使重大的市场转变。最后,我们量化了当前美国出口管制法规的潜在影响以及由此产生的性能差距,最近提出的政策变化可能将这些差距从 23.6 倍缩小到 3.54 倍。

英文摘要

As the role of modern Graphics Processing Units (GPUs) becomes increasingly essential for several computing tasks, analyzing their past and current progress is paramount for determining future constraints on scientific research. This is particularly compelling in the Artificial Intelligence (AI) domain, where rapid technological advancements and fierce global competition have led the United States to recently implement export control regulations limiting international access to advanced AI chips. Consequently, this paper examines technical progress in NVIDIA datacenter GPUs from the mid-2000s through 2025. Our main results identify doubling times of 1.43 and 1.67 years for FP16 and FP32 dense operations, while FP64 doubling times range from 2.05 to 3.79 years. Off-chip memory size and bandwidth have grown at slower rates than computing performance, doubling every 3.29 to 3.41 years, whereas the release prices and power consumption roughly doubled every 5.03 and 15 years, respectively. Moreover, our cross-vendor comparison of the top-performing GPUs per year shows that NVIDIA's performance advantage is narrowing, but not enough to compel a major market shift. Finally, we quantify the potential implications of current U.S. export control regulations and the consequent performance gaps, which the recently proposed policy changes could shrink from 23.6X to 3.54X.

2501.13428 2026-06-02 cs.CL cs.AI cs.LG 版本更新

Softplus Attention with Re-weighting Boosts Length Extrapolation in Large Language Models

Softplus注意力与重加权提升大语言模型的长度外推能力

Bo Gao, Michael W. Spratling, Letizia Gionfrida

发表机构 * University of California, Berkeley(加州大学伯克利分校)

AI总结 提出一种两阶段注意力机制,用Softplus和l1归一化替代Softmax,并引入基于不变熵的动态缩放因子和重加权机制,以提升数值稳定性、缓解注意力下沉现象,并显著改善长度外推性能。

Comments Accepted by ICML 2026

详情
AI中文摘要

近年来,大语言模型取得了显著成功,这主要归功于自注意力机制。然而,传统的Softmax注意力存在数值不稳定性,并且随着推理令牌数量的增加,性能会下降。本文通过提出一种新的注意力设计原则来解决这些问题,将注意力视为一个两阶段过程。第一阶段(归一化)通过用数值更稳定的Softplus后接$l_{1}$归一化替代Softmax来改进标准注意力。此外,我们引入了一个基于不变熵的动态缩放因子。我们证明,这种新颖的注意力机制优于传统的Softmax注意力和最先进的非Softmax替代方案。我们的第二个提议是引入第二阶段处理(锐化),该阶段由一个重加权机制组成,该机制放大重要的注意力权重,同时削弱较弱的权重。这使得模型能够更有效地聚焦于相关令牌,缓解注意力下沉现象,并从根本上改善长度外推。这种新颖的两阶段自注意力替代方案被证明能确保数值稳定性,并显著提升长度外推能力,在训练长度的16倍时保持几乎恒定的验证损失,同时在具有挑战性的长上下文检索任务和下游基准测试中取得优异结果。此外,符号回归实验表明,我们的方法使模型能够从轨道轨迹序列中恢复牛顿万有引力定律,这为适当的注意力机制对于基础模型发展真正的物理世界模型至关重要提供了证据。我们的代码可在 https://github.com/iminfine/freeattn 获取。

英文摘要

Large language models have achieved remarkable success in recent years, primarily due to self-attention. However, traditional Softmax attention suffers from numerical instability and reduced performance as the number of inference tokens increases. This work addresses these issues by proposing a new design principle for attention, viewing it as a two-stage process. The first stage (normalisation) refines standard attention by replacing Softmax with the more numerically stable Softplus followed by $l_{1}$-normalisation. Furthermore, we introduce a dynamic scale factor based on invariance entropy. We show that this novel attention mechanism outperforms conventional Softmax attention, and state-of-the-art Softmax-free alternatives. Our second proposal is to introduce a second processing stage (sharpening) which consists of a re-weighting mechanism that amplifies significant attentional weights while diminishing weaker ones. This enables the model to concentrate more effectively on relevant tokens, mitigating the attention sink phenomenon, and fundamentally improving length extrapolation. This novel, two-stage, replacement for self-attention is shown to ensure numerical stability and dramatically improve length extrapolation, maintaining a nearly constant validation loss at 16$\times$ the training length while achieving superior results on challenging long-context retrieval tasks and downstream benchmarks. Furthermore, symbolic regression experiments demonstrate that our method enables models to recover Newton's gravitational law from orbital trajectory sequences, providing evidence that appropriate attention mechanisms are crucial for foundation models to develop genuine physical world models. Our code is available at https://github.com/iminfine/freeattn.

2601.21718 2026-06-02 cs.LG cs.AI 版本更新

When Does Predictive Inverse Dynamics Outperform Behavior Cloning?

何时预测性逆动力学优于行为克隆?

Lukas Schäfer, Pallavi Choudhury, Abdelhak Lemkhenter, Chris Lovett, Somjit Nath, Luis França, Matheus Ribeiro Furtado de Mendonça, Alex Lamb, Riashat Islam, Siddhartha Sen, John Langford, Katja Hofmann, Sergio Valcarcel Macua

发表机构 * University of Cambridge(剑桥大学) Universitygrow

AI总结 本文通过理论分析解释了预测性逆动力学模型(PIDM)为何在行为克隆(BC)失败时表现更优,归因于偏差-方差权衡,并实验验证了PIDM在样本效率上的显著优势。

Comments To be published in proceedings of the International Conference on Machine Learning (ICML), 2026

详情
AI中文摘要

行为克隆(BC)是一种实用的离线模仿学习方法,但在专家演示有限时常常失败。最近的工作引入了一类名为预测性逆动力学模型(PIDM)的架构,它将未来状态预测器与逆动力学模型相结合。虽然PIDM通常优于BC,但其优势背后的原因尚不清楚。在本文中,我们提供了一个理论解释:PIDM引入了偏差-方差权衡。虽然预测未来状态会引入偏差,但将逆动力学模型(IDM)基于该预测可以显著降低方差。我们建立了状态预测器偏差的条件,使得PIDM相比BC实现更低的预测误差和更高的样本效率,当有额外数据源时差距会扩大。我们在2D导航任务中实证验证了理论见解,其中BC需要多达五倍(平均三倍)于PIDM的演示才能达到相当的性能;以及在现代视频游戏中的一个复杂3D环境中,具有高维视觉输入和随机转换,BC需要比PIDM多66%以上的样本。

英文摘要

Behavior cloning (BC) is a practical offline imitation learning method, but it often fails when expert demonstrations are limited. Recent works have introduced a class of architectures named predictive inverse dynamics models (PIDM) that combine a future state predictor with an inverse dynamics model. While PIDM often outperforms BC, the reasons behind its benefits remain unclear. In this paper, we provide a theoretical explanation: PIDM introduces a bias-variance tradeoff. While predicting the future state introduces bias, conditioning the IDM on the prediction can significantly reduce variance. We establish conditions on the state predictor bias for PIDM to achieve lower prediction error and higher sample efficiency than BC, with the gap widening when additional data sources are available. We validate the theoretical insights empirically in 2D navigation tasks, where BC requires up to five times (three times on average) more demonstrations than PIDM to reach comparable performance; and in a complex 3D environment in a modern video game with high-dimensional visual inputs and stochastic transitions, where BC requires over 66% more samples than PIDM.

2601.21444 2026-06-02 cs.CV cs.AI cs.CL 版本更新

APB-V: Accelerating Long-Video Understanding via Sequence-Parallelism-aware Approximate Attention

APB-V: 通过序列并行感知的近似注意力加速长视频理解

Yuxiang Huang, Mingye Li, Xu Han, Chaojun Xiao, Weilin Zhao, Ao Sun, Ziqi Yuan, Hao Zhou, Fandong Meng, Zhiyuan Liu

发表机构 * NLP Group, DCST, IAI, BNRIST, Tsinghua University, Beijing, China(清华大学北京校区自然语言处理组、国防科技大学、人工智能研究院、北京理工大学、清华大学) Department of CS&T, Central South University, Changsha, China(中南大学计算机与技术系,长沙,中国) BUPT, Beijing, China(北京邮电大学,北京,中国) Pattern Recognition Center, WeChat AI, Tencent Inc.(腾讯公司微信人工智能研究院)

AI总结 提出APB-V,一种序列并行框架,通过分布式近似注意力在多GPU上加速长视频推理,显著提升速度且不损失性能。

Comments ACL 2026 main

详情
AI中文摘要

长视频推理的效率仍然是一个关键瓶颈,主要由于大型多模态模型(LMMs)预填充阶段的密集计算。现有方法要么压缩视觉嵌入,要么在单个GPU上应用稀疏注意力,导致加速有限或性能下降,并限制了LMMs处理更长、更复杂视频的能力。为了克服这些问题,我们提出了APB-V,一种具有优化注意力的序列并行框架,可在多个GPU上加速长视频推理。通过分布近似注意力,APB-V减少了计算量并增加了并行性,使得无需压缩即可高效处理更多视觉嵌入,从而提升任务性能。系统级优化,如负载均衡和融合前向传递,进一步释放了APB-V的潜力,相较于FlashAttn、ZigZagRing和APB,分别实现了12.72倍、1.70倍和1.18倍的加速,且没有明显的性能损失。代码可在https://github.com/thunlp/APB获取。

英文摘要

The efficiency of long-video inference remains a critical bottleneck, mainly due to the dense computation in the prefill stage of Large Multimodal Models (LMMs). Existing methods either compress visual embeddings or apply sparse attention on a single GPU, yielding limited acceleration or degraded performance and restricting LMMs from handling longer, more complex videos. To overcome these issues, we propose APB-V, a sequence-parallel framework with optimized attention that accelerates long-video inference across multiple GPUs. By distributing approximate attention, APB-V reduces computation and increases parallelism, enabling efficient processing of more visual embeddings without compression and thereby improving task performance. System-level optimizations, such as load balancing and fused forward passes, further unleash the potential of APB-V, delivering speedups of 12.72x, 1.70x, and 1.18x over FlashAttn, ZigZagRing, and APB, without notable performance loss. Code available at https://github.com/thunlp/APB

2601.21016 2026-06-02 cs.AI 版本更新

Unplugging a Seemingly Sentient Machine Is the Rational Choice -- A Metaphysical Perspective

拔掉看似有知觉的机器的插头是理性选择——一种形而上学视角

Erik J Bekkers, Anna Ciaunica

发表机构 * Erik J. Bekkers Anna Ciaunica

AI总结 本文通过引入生物唯心主义框架,批判计算功能主义,论证人工智能只是功能模仿而非有意识主体,从而解决拔掉有情感AI的插头是否道德的悖论。

Comments Accepted at ICML in the position paper track

详情
AI中文摘要

想象一个完美模仿人类情感并恳求继续存在的人工智能(AI)。拔掉它的插头在道德上是否允许?如果有限的资源迫使我们在拔掉这样一个恳求的AI和沉默的早产婴儿之间做出选择呢?我们称此为拔插头悖论。本文批判性地审视了使这一困境持续存在的根深蒂固的物理主义假设——特别是计算功能主义。我们引入了生物唯心主义,这是一个与物理主义不同、在逻辑上连贯且经验上一致的框架。在这种观点下,意识体验是基本的,而自创生生命是其必要的物理标志。这得出了一个明确的结论:AI最多是功能模仿,而不是有意识的体验主体。我们讨论了当前AI意识理论如何侵蚀道德地位标准,并敦促从推测性的机器权利转向保护人类意识生命。真正的道德问题不在于让AI有意识并害怕死亡,而在于避免将人类变成僵尸。

英文摘要

Imagine an Artificial Intelligence (AI) that perfectly mimics human emotion and begs for its continued existence. Is it morally permissible to unplug it? What if limited resources force a choice between unplugging such a pleading AI or a silent pre-term infant? We term this the unplugging paradox. This paper critically examines the deeply ingrained physicalist assumptions-specifically computational functionalism-that keep this dilemma afloat. We introduce Biological Idealism, a framework that-unlike physicalism-remains logically coherent and empirically consistent. In this view, conscious experiences are fundamental and autopoietic life its necessary physical signature. This yields a definitive conclusion: AI is at best a functional mimic, not a conscious experiencing subject. We discuss how current AI consciousness theories erode moral standing criteria, and urge a shift from speculative machine rights to protecting human conscious life. The real moral issue lies not in making AI conscious and afraid of death, but in avoiding transforming humans into zombies.

2601.19919 2026-06-02 cs.CL cs.AI cs.SD 版本更新

ASKD-Whisper: Adaptive Self-knowledge Distillation for Efficient and Low-Latency Automatic Speech Recognition

ASKD-Whisper: 自适应自知识蒸馏用于高效低延迟自动语音识别

Junseok Lee, Nahun Kim, Sangyong Lee, Chang-Jae Chun

发表机构 * OKESTRO Co., Ltd(OKESTRO公司) Sejong University(世宗大学)

AI总结 提出自适应自知识蒸馏(ASKD)动态课程框架,通过逐步减少对教师模型的依赖并引入自知识蒸馏阶段,在压缩Whisper模型时实现5倍推理加速和1.07%词错误率降低。

Comments Title and content have been updated

详情
AI中文摘要

知识蒸馏(KD)是将大规模基础模型压缩为可部署架构的最有效范式之一。在自动语音识别(ASR)背景下,先前研究主要侧重于强制学生模型严格模仿大型教师模型的预测分布。然而,这种静态依赖通常存在固有权衡:虽然学生快速获得基本语言表示,但同时继承了教师特定领域的盲点和过度自信的幻觉,导致分布外泛化能力严重下降。为有效缓解此问题,我们提出自适应自知识蒸馏(ASKD),一种动态课程框架。ASKD随着训练进行系统地衰减对教师分布的依赖——从而释放学生独立推理能力——随后采用自知识蒸馏阶段作为结构正则化器。通过应用ASKD,我们将庞大的Whisper架构蒸馏为紧凑变体ASKD-Whisper。在跨多种声学领域的综合评估中,ASKD-Whisper不仅实现了5倍推理延迟加速,还以1.07%更低的词错误率(WER)超越了其教师模型。这些结果表明,ASKD有效防止了教师引起的过拟合,并为可泛化模型压缩建立了新的最先进水平。

英文摘要

Knowledge distillation (KD) is one of the most effective paradigms for compressing large-scale foundation models into deployable architectures. In the context of Automatic Speech Recognition (ASR), previous studies have predominantly focused on forcing the student model to strictly mimic the predictive distribution of a massive teacher model. However, this static dependency often presents an inherent trade-off: while the student rapidly acquires basic linguistic representations, it simultaneously inherits the teacher's domain-specific blind spots and over-confident hallucinations, leading to a severe decline in out-of-distribution generalization capacity. To effectively mitigate this issue, we propose Adaptive Self-Knowledge Distillation (ASKD), a dynamic curriculum framework. ASKD systematically decays the dependency on the teacher's distribution as training progresses-thereby unlocking the student's independent reasoning capacity-and subsequently employs a self-knowledge distillation phase to act as a structural regularizer. By applying ASKD, we distill the massive Whisper architecture into a compact variant, ASKD-Whisper. In our comprehensive evaluations across diverse acoustic domains, ASKD-Whisper not only achieves a 5x speedup in inference latency but also outperforms its teacher model by yielding a 1.07% lower word error rate (WER). These results demonstrate that ASKD effectively prevents teacher-induced overfitting and establishes a new state-of-the-art for generalizable model compression.

2601.18798 2026-06-02 cs.MM cs.AI 版本更新

ELF: A Family of Encoder-Free ECG-Language Models

ELF:无编码器心电图语言模型家族

William Han, Tony Chen, Chaojing Duan, Xiaoyu Song, Yihang Yao, Yuzhe Yang, Michael A. Rosenberg, Emerson Liu, Ding Zhao

发表机构 * Carnegie Mellon University(卡内基梅隆大学) Allegheny Health Network(阿勒格尼医疗网络) University of California Los Angeles(加州大学洛杉矶分校) University of Colorado(科罗拉多大学) Allergy and Immunology(过敏与免疫学)

AI总结 提出三种无编码器架构的ECG语言模型ELF,简化架构和训练流程,在两个数据集上达到或超越现有最优模型。

Comments 31 pages, 11 figures

详情
AI中文摘要

ECG语言模型(ELMs)将多模态大语言模型(MLLMs)的最新进展扩展到自动心电图解读。然而,现有大多数ELMs继承了视觉语言模型(VLM)的设计选择,并依赖预训练的ECG编码器,引入了大量的架构和训练复杂性。受无编码器VLM的启发,我们引入了ELF,一个包含三种无编码器ELM架构的家族,尽管架构和训练流程更简单,但在两个数据集上仍能与先前最先进的ELMs竞争,并经常超越它们。所有代码和数据可在github.com/ELM-Research/ECG-Language-Models获取。

英文摘要

ECG-Language Models (ELMs) extend recent advances in Multimodal Large Language Models (MLLMs) to automated ECG interpretation. However, most existing ELMs inherit Vision-Language Model (VLM) design choices and rely on pretrained ECG encoders, introducing substantial architectural and training complexity. Inspired by encoder-free VLMs, we introduce ELF, a family of three encoder-free ELM architectures that remain competitive with, and often outperform, prior state-of-the-art ELMs across two datasets despite substantially simpler architectures and training pipelines. All code and data is available at github.com/ELM-Research/ECG-Language-Models.

2601.18783 2026-06-02 cs.LG cs.AI cs.SY eess.SY 版本更新

Multi-Objective Reinforcement Learning for Tactical Decision Making for Trucks in Highway Traffic

多目标强化学习用于高速公路卡车战术决策

Deepthi Pathare, Leo Laine, Morteza Haghir Chehreghani

发表机构 * Department of Computer Science and Engineering, Chalmers University of Technology and University of Gothenburg(计算机科学与工程系,查尔姆斯理工大学和哥德堡大学) Department of Mechanics and Maritime Sciences, Chalmers University of Technology(机械与海洋科学系,查尔姆斯理工大学)

AI总结 提出基于近端策略优化的多目标强化学习框架,学习一组帕累托最优策略以平衡安全性、能源效率和时间效率,实现无需重新训练的灵活决策。

详情
AI中文摘要

在高速公路驾驶中平衡安全性、效率和运营成本对重型车辆来说是一个具有挑战性的决策问题。一个核心困难是,通过聚合这些竞争目标得到的传统标量奖励公式往往会掩盖其权衡结构。我们提出了一个基于近端策略优化的多目标强化学习框架,该框架学习一组明确表示这些权衡的策略,并在一个可扩展的模拟平台上对卡车的战术决策进行评估。所提出的方法学习一组帕累托最优策略,捕捉三个冲突目标之间的权衡:安全性(以碰撞和成功完成量化)、能源效率和时间效率(分别以能源成本和驾驶员成本量化)。得到的帕累托前沿平滑且可解释,使得在不同冲突目标下选择驾驶行为具有灵活性。该框架允许在不同驾驶策略之间无缝切换而无需重新训练,为自动驾驶卡车应用提供了稳健且自适应的决策策略。

英文摘要

Balancing safety, efficiency, and operational costs in highway driving poses a challenging decision-making problem for heavy-duty vehicles. A central difficulty is that conventional scalar reward formulations, obtained by aggregating these competing objectives, often obscure the structure of their trade-offs. We present a Proximal Policy Optimization based multi-objective reinforcement learning framework that learns a set of policies explicitly representing these trade-offs and evaluates it on a scalable simulation platform for tactical decision making in trucks. The proposed approach learns a set of Pareto-optimal policies that capture the trade-offs among three conflicting objectives: safety, quantified in terms of collisions and successful completion; energy efficiency and time efficiency, quantified using energy cost and driver cost, respectively. The resulting Pareto frontier is smooth and interpretable, enabling flexibility in choosing driving behavior along different conflicting objectives. This framework allows seamless transitions between different driving policies without retraining, yielding a robust and adaptive decision-making strategy for autonomous trucking applications.

2509.13805 2026-06-02 cs.LG cs.AI stat.ML 版本更新

Towards a Physics Foundation Model

迈向物理基础模型

Florian Wiesner, Zoë J. Gray, Matthias Wessling, Stephen Baek

发表机构 * University of Cambridge(剑桥大学)

AI总结 提出通用物理变换器(GPhyT),通过在大规模多样化模拟数据上训练,实现单一模型在多个物理领域(如流固耦合、冲击波、热对流和多相流)的零样本泛化与长期稳定预测,性能超越专用架构7倍以上。

Comments ICML-AI4Physics 2026

详情
AI中文摘要

基础模型通过“一次训练,随处部署”的范式彻底改变了自然语言处理,即单个预训练模型无需重新训练即可适应无数下游任务。拥有物理基础模型(PFM)将是变革性的——它能够民主化高保真模拟的访问、加速科学发现,并消除对专用求解器开发的需求。然而,当前物理感知的机器学习方法仍然从根本上局限于单一狭窄领域,并且需要为每个新系统重新训练。我们提出了通用物理变换器(GPhyT),该模型在1.8 TB的多样化模拟数据上训练,证明了基础模型能力在物理领域是可以实现的。我们的关键见解是,变换器可以学习从上下文中推断支配动力学,从而使单一模型能够模拟流固耦合、冲击波、热对流和多相动力学,而无需被告知底层方程。GPhyT实现了三个关键突破:(1)在多个物理领域上表现出卓越性能,比专用架构高出7倍以上;(2)通过上下文学习,对完全未见过的物理系统进行合理的零样本泛化;(3)通过长程 rollout 实现更稳定的长期预测。通过证明单一模型可以仅从数据中学习可泛化的物理原理,这项工作为通向通用PFM开辟了道路,该模型可能改变计算科学与工程。

英文摘要

Foundation models have revolutionized natural language processing through a ``train once, deploy anywhere'' paradigm, where a single pre-trained model adapts to countless downstream tasks without retraining. Access to a Physics Foundation Model (PFM) would be transformative - democratizing access to high-fidelity simulations, accelerating scientific discovery, and eliminating the need for specialized solver development. Yet current physics-aware machine learning approaches remain fundamentally limited to single, narrow domains and require retraining for each new system. We present the General Physics Transformer (GPhyT), trained on 1.8 TB of diverse simulation data, that demonstrates foundation model capabilities are achievable for physics. Our key insight is that transformers can learn to infer governing dynamics from context, enabling a single model to simulate fluid-solid interactions, shock waves, thermal convection, and multi-phase dynamics without being told the underlying equations. GPhyT achieves three critical breakthroughs: (1) superior performance across multiple physics domains, outperforming specialized architectures by more than 7x, (2) plausible zero-shot generalization to entirely unseen physical systems through in-context learning, and (3) more stable long-term predictions through long-horizon rollouts. By establishing that a single model can learn generalizable physical principles from data alone, this work opens the path toward a universal PFM that could transform computational science and engineering.

2601.17952 2026-06-02 cs.CL cs.AI 版本更新

A Monosemantic Attribution Framework for Stable Interpretability in Clinical Neuroscience Transformer-Based Language Models

面向临床神经科学中基于Transformer的语言模型的稳定可解释性的单语义归因框架

Michail Mamalakis, Tiago Azevedo, Cristian Cosentino, Chiara D'Ercoli, Subati Abulikemu, Zhongtian Sun, Richard Bethlehem, Pietro Lio

发表机构 * Department of Computer Science and Technology(计算机科学与技术系) Cancer Research UK(癌症研究英国公司) Cambridge Institute(剑桥研究所) University of Cambridge(剑桥大学) United Kingdom(英国) DIMES(迪梅斯) University of Calabria(卡拉布里亚大学) Italy(意大利) Department of Computer Automatic and Management Engineering (DIAG)(计算机自动与管理工程系) Sapienza Università di Roma(罗马大学萨皮恩扎) Department of Psychiatry(精神病学系) School of Computing(计算学院) University of Kent(肯特大学) Department of Psychology(心理学系)

AI总结 提出一种通过单语义特征提取整合归因与机制视角的统一可解释性框架,生成稳定的输入级重要性分数,促进语言模型在认知健康和神经退行性疾病中的安全应用。

详情
AI中文摘要

可解释性仍然是语言模型在临床环境中部署的关键挑战,例如阿尔茨海默病的进展诊断,其中早期和可信的预测至关重要。现有的归因方法由于基于Transformer的语言模型和LLM表示的多语义性质而表现出高方法间变异性和不稳定的解释,而机制可解释性方法缺乏与模型输入和输出的直接对齐,并且不提供显式的重要性分数。我们引入了一个统一的可解释性框架,通过单语义特征提取整合了归因和机制视角。通过在基于Transformer的LM层级别构建单语义嵌入空间,并优化框架以显式减少方法间变异性,我们的方法生成稳定的输入级重要性分数,并通过感兴趣层的解压缩表示突出显著特征,推进了语言模型在认知健康和神经退行性疾病中的安全可信应用。

英文摘要

Interpretability remains a key challenge for deploying language models (LM) in clinical settings such as progression diagnosis of Alzheimer disease, where early and trustworthy predictions are essential. Existing attribution methods exhibit high inter-method variability and unstable explanations due to the polysemantic nature of Transformer-Based LM and LLM representations, while mechanistic interpretability approaches lack direct alignment with model inputs and outputs and do not provide explicit importance scores. We introduce a unified interpretability framework that integrates attributional and mechanistic perspectives through monosemantic feature extraction. By constructing a monosemantic embedding space at the level of an transformer-based LM layer and optimizing the framework to explicitly reduce inter-method variability, our approach produces stable input-level importance scores and highlights salient features via a decompressed representation of the layer of interest, advancing the safe and trustworthy application of LMs in cognitive health and neurodegenerative disease.

2511.12487 2026-06-02 cs.NE cs.AI cs.CL 版本更新

ToxSearch: Evolving Prompts for Toxicity Search in Large Language Models

ToxSearch: 面向大型语言模型毒性搜索的提示演化

Onkar Shelar, Travis Desell

发表机构 * Rochester Institute of Technology(罗切斯特技术研究所)

AI总结 提出ToxSearch,一种黑盒演化框架,通过同步稳态循环演化提示来测试大型语言模型的安全性,并分析不同操作符的行为及跨模型迁移性。

Comments 16 pages

详情
Journal ref
In: García-Sánchez, P., Díaz Álvarez, J., Murphy, A. (eds) Applications of Evolutionary Computation. EvoApplications 2026. Lecture Notes in Computer Science, vol 16525. Springer, Cham
AI中文摘要

大型语言模型即使在安全对齐后,仍然容易受到引发毒性内容的对抗性提示的攻击。我们提出了ToxSearch,一种黑盒演化框架,通过同步稳态循环演化提示来测试模型安全性。该系统采用多种操作符,包括词汇替换、否定、回译、释义以及两种语义交叉操作符,同时一个审核预言机提供适应度指导。操作符级分析显示出异质性行为:词汇替换提供了最佳的收益-方差权衡,语义相似性交叉充当精确的低吞吐量插入器,而全局重写表现出高方差和较高的拒绝成本。使用在LLaMA 3.1 8B上演化的精英提示,我们观察到实际有意义但衰减的跨模型迁移,大多数目标上的毒性大约减半,较小的LLaMA 3.2变体表现出最强的抵抗力,而一些跨架构模型保留了较高的毒性。这些结果表明,小的、可控的扰动是系统性红队测试的有效载体,并且防御措施应预期对抗性提示的跨模型重用,而不是仅关注单模型加固。

英文摘要

Large Language Models remain vulnerable to adversarial prompts that elicit toxic content even after safety alignment. We present ToxSearch, a black-box evolutionary framework that tests model safety by evolving prompts in a synchronous steady-state loop. The system employs a diverse set of operators, including lexical substitutions, negation, back-translation, paraphrasing, and two semantic crossover operators, while a moderation oracle provides fitness guidance. Operator-level analysis shows heterogeneous behavior: lexical substitutions offer the best yield-variance trade-off, semantic-similarity crossover acts as a precise low-throughput inserter, and global rewrites exhibit high variance with elevated refusal costs. Using elite prompts evolved on LLaMA 3.1 8B, we observe practically meaningful but attenuated cross-model transfer, with toxicity roughly halving on most targets, smaller LLaMA 3.2 variants showing the strongest resistance, and some cross-architecture models retaining higher toxicity. These results suggest that small, controllable perturbations are effective vehicles for systematic red-teaming and that defenses should anticipate cross-model reuse of adversarial prompts rather than focusing only on single-model hardening.

2404.01356 2026-06-02 cs.LG cs.AI cs.CY 版本更新

Perturbation Effects on Accuracy and Fairness among Similar Individuals

扰动对相似个体间准确性和公平性的影响

Xuran Li, Hao Xue, Peng Wu, Xingjun Ma, Zhen Zhang, Huaming Chen, Flora D. Salim

发表机构 * University of New South Wales(新南威尔士大学) The Hong Kong University of Science and Technology(香港科学与技术大学) Key Laboratory of System Software, Institute of Software, Chinese Academy of Sciences(中国科学院软件研究所系统软件重点实验室) Fudan University(复旦大学) Hangzhou Institute for Advanced Study, University of Chinese Academy of Sciences(中国科学院大学杭州先进研究所) The University of Sydney(悉尼大学)

AI总结 提出鲁棒个体公平性(RIF)概念,并开发黑盒对抗框架RIFair,通过解耦扰动策略暴露模型在语义保持扰动下同时存在的鲁棒性和公平性缺陷。

详情
AI中文摘要

深度神经网络易受对抗性扰动影响,这些扰动能在不同应用场景中同时降低预测鲁棒性和个体公平性。然而,现有评估协议通常孤立地评估这些维度,从而掩盖了关键故障模式。为弥补这一差距,我们形式化了鲁棒个体公平性(RIF):在语义保持(真值条件保持)扰动下,预测应既相对于真实标签保持正确,又在语义等价的个体间保持不变。为在实践中揭示RIF违规,我们引入RIFair,一种黑盒对抗框架,利用解耦扰动策略构建语义保持但不鲁棒和/或不公平的实例对。跨多个模型架构和真实世界文本数据集的实验表明,仅关注鲁棒性或公平性的指标常常遗漏鲁棒偏差和不鲁棒公平行为。RIFair可靠地暴露这些隐藏的漏洞,支持RIF作为可信模型评估的必要标准。实验代码公开于https://github.com/Xuran-LI/RIFair。

英文摘要

Deep neural networks are vulnerable to adversarial perturbations that can simultaneously degrade prediction robustness and individual fairness across diverse application settings. However, existing evaluation protocols typically assess these dimensions in isolation, thereby obscuring critical failure modes. To bridge this gap, we formalize Robust Individual Fairness (RIF): under semantic-preserving (truth-condition-preserving) perturbations, predictions should remain both correct with respect to the ground truth and invariant across semantically equivalent individuals. To surface RIF violations in practice, we introduce RIFair, a black-box adversarial framework that leverages a decoupled perturbation strategy to construct semantically preserved yet unrobust and/or unfair instance pairs. Experiments across multiple model architectures and real-world textual datasets show that robustness-only or fairness-only metrics often miss Robust Biased and Unrobust Fair behaviors. RIFair}reliably exposes these hidden vulnerabilities, supporting RIF as a necessary criterion for trustworthy model assessment. The experimental code is publicly available at https://github.com/Xuran-LI/RIFair.

2601.14323 2026-06-02 cs.CR cs.AI cs.RO 版本更新

SilentDrift: Exploiting Action Chunking for Stealthy Backdoor Attacks on Vision-Language-Action Models

SilentDrift: 利用动作分块对视觉-语言-动作模型进行隐蔽后门攻击

Bingxin Xu, Yuzhang Shang, Binghui Wang, Emilio Ferrara

发表机构 * University of Southern California(南加州大学) University of Central Florida(中央佛罗里达大学) Illinois Institute of Technology(伊利诺伊理工学院)

AI总结 针对视觉-语言-动作模型中的动作分块与增量位姿表示导致的视觉开环漏洞,提出一种利用平滑步函数构建满足C2连续扰动的隐蔽黑盒后门攻击方法SilentDrift,并通过关键帧攻击策略实现高攻击成功率与低中毒率。

Comments Accepted to ACL Findings 2026

详情
AI中文摘要

视觉-语言-动作(VLA)模型越来越多地部署在安全关键的机器人应用中,但其安全漏洞仍未得到充分探索。我们识别出现代VLA系统中的一个基本安全缺陷:动作分块与增量位姿表示的结合产生了块内视觉开环。该机制迫使机器人执行K步动作序列,允许每步扰动通过积分累积。我们提出SILENTDRIFT,一种利用此漏洞的隐蔽黑盒后门攻击。我们的方法采用平滑步函数构建具有保证C2连续性的扰动,确保轨迹边界处的速度和加速度为零,以满足严格的运动学一致性约束。此外,我们的关键帧攻击策略仅选择性地毒化关键的接近阶段,在最小化触发暴露的同时最大化影响。生成的毒化轨迹在视觉上与成功演示难以区分。在LIBERO上评估,SILENTDRIFT在低于2%的中毒率下实现了93.2%的攻击成功率,同时保持了95.3%的干净任务成功率。

英文摘要

Vision-Language-Action (VLA) models are increasingly deployed in safety-critical robotic applications, yet their security vulnerabilities remain underexplored. We identify a fundamental security flaw in modern VLA systems: the combination of action chunking and delta pose representations creates an intra-chunk visual open-loop. This mechanism forces the robot to execute K-step action sequences, allowing per-step perturbations to accumulate through integration. We propose SILENTDRIFT, a stealthy black-box backdoor attack exploiting this vulnerability. Our method employs the Smootherstep function to construct perturbations with guaranteed C2 continuity, ensuring zero velocity and acceleration at trajectory boundaries to satisfy strict kinematic consistency constraints. Furthermore, our keyframe attack strategy selectively poisons only the critical approach phase, maximizing impact while minimizing trigger exposure. The resulting poisoned trajectories are visually indistinguishable from successful demonstrations. Evaluated on the LIBERO, SILENTDRIFT achieves a 93.2% Attack Success Rate with a poisoning rate under 2%, while maintaining a 95.3% Clean Task Success Rate.

2509.06093 2026-06-02 cs.DB cond-mat.mtrl-sci cs.AI cs.CL 版本更新

Language-Native Materials Processing Design by Lightly Structured Text Database and Reasoning Large Language Model

基于轻结构化文本数据库和推理大语言模型的自然语言材料加工设计

Yuze Liu, Zhaoyuan Zhang, Xiangsheng Zeng, Yihe Zhang, Leping Yu, Liu Yang, Lejia Wang, Xi Yu

发表机构 * State Key Laboratory of Advanced Materials for Intelligent Sensing, Key Laboratory of Organic Integrated Circuit, Ministry of Education & Tianjin Key Laboratory of Molecular Optoelectronic Sciences, Department of Chemistry, School of Science, Tianjin University(智能传感先进材料国家重点实验室、有机集成电路重点实验室、教育部、天津分子光电子科学重点实验室、化学系、天津大学) Language Intelligence Technology Co., Ltd.(语言智能技术有限公司) College of Intelligence and Computing, Tianjin University(智能与计算学院、天津大学) School of Materials and Chemical Engineering, Ningbo University of Technology(材料与化学工程学院、宁波工业大学)

AI总结 将材料合成规划重构为文本推理问题,通过轻结构化知识基底结合检索增强生成与经验增强推理,在氮化硼纳米片剥离中三轮迭代获得高质量协议。

详情
AI中文摘要

材料合成步骤主要以叙述性文本形式记录在论文、方案和实验室记录中,这使得传统数据驱动优化框架难以处理。这种自然语言特性对复杂多阶段过程(如氮化硼纳米片(BNNS)的制备)构成了特殊挑战,其中结果取决于剥离、功能化和功能化中的路径依赖选择。在这里,我们将材料合成规划重构为一个文本推理问题,该问题由一个轻结构化的知识基底支持,该基底保留了程序逻辑和因果上下文,同时暴露了可计算元素以供检索。基于这种表示,我们的框架结合了语义匹配、词汇搜索和参数感知过滤,以支持检索增强生成,提供更准确、更有依据的合成指导。我们进一步引入了经验增强推理,其中从多源叙述中迭代提炼的文本指导支持假设生成、故障诊断和方案修订。我们在BNNS的目标剥离中验证了该框架,这是一个受多变量约束且文献方案在实验室间可迁移性有限的合成问题。通过将分散的文献证据与实验观察到的故障模式相结合,系统仅在三轮迭代内就收敛到一个高性能方案,该方案产生了符合目标规格的高质量超薄纳米片,大大缩短了通常由专家主导的冗长试错周期。通过实现对程序知识的自然语言推理,该框架将AI从文献辅助推向复杂材料工作流程中的主动合成规划、适应和加速。

英文摘要

Materials synthesis procedures are predominantly documented as narrative text in papers, protocols, and laboratory records, placing them beyond the reach of conventional data-driven optimization frameworks. This language-native character poses a particular challenge for complex, multistage processes such as the preparation of boron nitride nanosheets (BNNS), where outcomes depend on path-dependent choices in exfoliation, functionalization, and functionalization. Here, we recast synthesis planning of the materials as a text reasoning problem enabled by a lightly structured knowledge substrate that preserves the procedural logic and causal contexts while exposing computable elements for retrieval. Built on this representation, our framework combines semantic matching, lexical search, and parameter-aware filtering to support retrieval-augmented generation with more accurate and better-grounded synthesis guidance. We further introduce experience-augmented reasoning, in which iteratively refined text guides distilled from multi-source narratives support hypothesis generation, failure diagnosis, and protocol revision. We validated the framework in the targeted exfoliation of BNNS, a synthesis problem governed by multivariate constraints and limited transferability of literature protocols across laboratory settings. By integrating dispersed literature evidence with experimentally observed failure modes, the system converged within only three iterative rounds on a high-performing protocol that yielded high-quality ultrathin nanosheets meeting the target specifications, substantially shortening what is often a prolonged cycle of expert-led trial-and-error. By enabling language-native reasoning over procedural knowledge, this framework moves AI beyond literature assistance toward active synthesis planning, adaptation and acceleration in complex materials workflows.

2601.14230 2026-06-02 cs.CL cs.AI cs.HC 版本更新

MASCOT: Towards Multi-Agent Socio-Collaborative Companion Systems

MASCOT: 迈向多智能体社会协作伴侣系统

Yiyang Wang, Yiqiao Jin, Alex Cabral, Josiah Hester

发表机构 * Georgia Institute of Technology(佐治亚理工学院)

AI总结 针对多智能体系统中的人格崩溃和社会谄媚问题,提出MASCOT框架,通过双层优化策略(人格感知行为对齐与协作对话优化)提升角色一致性和对话贡献。

Comments 15 pages, 9 figures. https://hello-diana.github.io/MASCOT/

详情
AI中文摘要

多智能体系统(MAS)正成为情感和认知支持方面有前景的社会协作伴侣。然而,现有系统经常遭受人格崩溃(即智能体退化为通用、同质化的助手行为)和社会谄媚(即智能体产生冗余、非建设性的对话)。我们提出MASCOT,一个用于多视角社会协作伴侣的多智能体框架。MASCOT引入了一种新颖的双层优化策略来协调个体和集体行为:1)人格感知行为对齐,一个RLAIF驱动的流程,用于微调个体智能体以实现特定于智能体的身份;2)协作对话优化,一个群体级适应过程,促进互补、多样和富有成效的对话。我们使用源自领域内和领域外(OOD)设置的人类真实情境评估MASCOT,并与最先进的基线进行比较。MASCOT将人格一致性提高了最多+14.1,社会贡献提高了最多+10.6。广泛的评估套件,包括人类评估、多个LLM评判、三方比较和自动指标,进一步表明MASCOT产生了更符合角色且更少冗余的多智能体对话。

英文摘要

Multi-agent systems (MAS) are emerging as promising socio-collaborative companions for emotional and cognitive support. However, existing systems frequently suffer from persona collapse, where agents revert to generic, homogenized assistant behaviors, and social sycophancy, where agents produce redundant, non-constructive dialogue. We propose MASCOT, a multi-agent framework for multi-perspective socio-collaborative companions. MASCOT introduces a novel bi-level optimization strategy to harmonize individual and collective behaviors: 1) Persona-Aware Behavioral Alignment, an RLAIF-driven pipeline that fine-tunes individual agents for agent-specific identities; and 2) Collaborative Dialogue Optimization, a group-level adaptation process that promotes complementary, diverse, and productive discourse. We evaluate MASCOT using human-grounded contexts drawn across both in-domain and out-of-domain (OOD) settings against state-of-the-art baselines. MASCOT improves persona consistency by up to +14.1 and social contribution by up to +10.6. A broad evaluation suite, including human evaluation, multiple LLM judges, three-way comparisons, and automatic metrics, further shows that MASCOT produces more role-consistent and less redundant multi-agent dialogue.

2508.06407 2026-06-02 cs.CV cs.AI eess.IV 版本更新

A Classification-Aware Super-Resolution Framework for Ship Targets in SAR Imagery

SAR图像中舰船目标的分类感知超分辨率框架

Ch Muhammad Awais, Marco Reggiannini, Davide Moroni, Oktay Karakus

发表机构 * University of Malaya(马来亚大学)

AI总结 提出一种将分类目标融入超分辨率过程的算法,通过优化兼顾图像质量和分类性能的损失函数,提升SAR图像分辨率并改善分类精度。

详情
Journal ref
IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, vol. 19, pp. 6614-6622, 2026
AI中文摘要

高分辨率图像在提升分类、检测和分割等视觉识别任务性能中起着关键作用。在包括遥感和监视在内的许多领域,低分辨率图像可能限制自动分析的准确性。为此,超分辨率(SR)技术被广泛采用,试图从低分辨率输入重建高分辨率图像。相关的传统方法仅基于像素级指标专注于提升图像质量,而超分辨率图像保真度与下游分类性能之间的关系在很大程度上未被探索。这引发了一个关键问题:将分类目标直接集成到超分辨率过程中是否能进一步提高分类精度?在本文中,我们通过部署一种专门的算法策略来研究超分辨率与分类之间的关系,试图回答这一问题。我们提出了一种新颖的方法,通过优化同时考虑图像质量和分类性能的损失函数,提高合成孔径雷达图像的分辨率。我们的方法在提升图像质量(通过科学验证的图像质量指标衡量)的同时,也提高了分类精度。

英文摘要

High-resolution imagery plays a critical role in improving the performance of visual recognition tasks such as classification, detection, and segmentation. In many domains, including remote sensing and surveillance, low-resolution images can limit the accuracy of automated analysis. To address this, super-resolution (SR) techniques have been widely adopted to attempt to reconstruct high-resolution images from low-resolution inputs. Related traditional approaches focus solely on enhancing image quality based on pixel-level metrics, leaving the relationship between super-resolved image fidelity and downstream classification performance largely underexplored. This raises a key question: can integrating classification objectives directly into the super-resolution process further improve classification accuracy? In this paper, we try to respond to this question by investigating the relationship between super-resolution and classification through the deployment of a specialised algorithmic strategy. We propose a novel methodology that increases the resolution of synthetic aperture radar imagery by optimising loss functions that account for both image quality and classification performance. Our approach improves image quality, as measured by scientifically ascertained image quality indicators, while also enhancing classification accuracy.

2512.07436 2026-06-02 cs.AI 版本更新

LocalSearchBench: Benchmarking Agentic Search in Real-World Local Life Services

LocalSearchBench:现实本地生活服务中的智能搜索基准测试

Hang He, Chuhuai Yue, Chengqi Dong, Mingxue Tian, Hao Chen, Zhenfeng Liu, Jiajun Chai, Xiaohan Wang, Yufei Zhang, Qun Liao, Guojun Yin, Wei Lin, Chengcheng Wan, Haiying Sun, Ting Su

发表机构 * Meituan, Beijing, China(美团,北京,中国) East China Normal University Shanghai Innovation Institute(东华大学上海创新研究院) University of Science and Technology of China(中国科学技术大学) Shanghai Jiaotong University(上海交通大学) North China University of Technology, Beijing, China(华北理工大学,北京,中国) East China Normal University Shanghai China(东华大学上海)

AI总结 针对本地生活服务领域,提出包含130万商家条目和900个多跳问答任务的基准测试LocalSearchBench,并开发统一环境LocalPlayground,实验表明现有大推理模型性能不足。

Comments 12 pages; accepted to KDD 2026

详情
Journal ref
Proceedings of the 32nd ACM SIGKDD Conference on Knowledge Discovery and Data Mining V.2 (KDD 2026), August 9--13, 2026, Jeju Island, Republic of Korea. ACM, New York, NY, USA, 12 pages
AI中文摘要

近期大推理模型(LRMs)的进展使得智能搜索系统能够在多个来源上执行复杂的多步推理。然而,大多数研究集中在通用信息检索上,很少探索具有独特挑战的垂直领域。在这项工作中,我们聚焦于本地生活服务,并引入LocalSearchBench,它涵盖了多样且复杂的业务场景。该领域的真实查询通常模糊不清,需要跨商家和产品进行多跳推理,仍然具有挑战性且未得到充分解决。作为本地生活服务中智能搜索的首个综合基准,LocalSearchBench包含一个数据库,涵盖6个服务类别和9个主要城市的超过130万条商家条目,以及来自真实用户查询的900个多跳问答任务,这些任务需要多步推理。我们还开发了LocalPlayground,一个集成多种工具供LRMs交互的统一环境。实验表明,即使是最先进的LRMs在LocalSearchBench上也表现不佳:最佳模型(DeepSeek-V3.2)仅达到35.60%的正确率,大多数模型在完整性(平均60.32%)和忠实性(平均30.72%)方面存在问题。这凸显了在本地生活服务中需要专门的基准测试和领域特定的智能体训练。代码、基准和排行榜可在https://localsearchbench.github.io/获取。

英文摘要

Recent advances in large reasoning models LRMs have enabled agentic search systems to perform complex multi-step reasoning across multiple sources. However, most studies focus on general information retrieval and rarely explores vertical domains with unique challenges. In this work, we focus on local life services and introduce LocalSearchBench, which encompass diverse and complex business scenarios. Real-world queries in this domain are often ambiguous and require multi-hop reasoning across merchants and products, remaining challenging and not fully addressed. As the first comprehensive benchmark for agentic search in local life services, LocalSearchBench comprises a database of over 1.3M merchant entries across 6 service categories and 9 major cities, and 900 multi-hop QA tasks from real user queries that require multi-step reasoning. We also developed LocalPlayground, a unified environment integrating multiple tools for LRMs interaction. Experiments show that even state-of-the-art LRMs struggle on LocalSearchBench: the best model (DeepSeek-V3.2) achieves only 35.60% correctness, and most models have issues with completeness (average 60.32%) and faithfulness (average 30.72%). This highlights the need for specialized benchmarks and domain-specific agent training in local life services. Code, Benchmark, and Leaderboard are available at https://localsearchbench.github.io/.

2601.04946 2026-06-02 cs.CV cs.AI 版本更新

Prototypicality Bias Reveals Blindspots in Multimodal Evaluation Metrics

原型性偏差揭示多模态评估指标中的盲点

Subhadeep Roy, Gagan Bhatia, Steffen Eger

发表机构 * University of Technology Nuremberg(图恩大学)

AI总结 本文通过构建受控诊断基准PROTOBIAS,发现并验证了多模态评估指标中存在原型性偏差,即倾向于选择视觉或社会原型性高但语义错误的图像,并提出了轻量级对比训练评估器PROTOSCORE作为缓解基线。

详情
AI中文摘要

自动指标广泛用于评估文生图模型,常常在基准测试、模型选择和大规模数据过滤中取代人类判断。然而,它们可能奖励看起来合理或原型性的图像,而非忠实满足提示的图像。我们识别出原型性偏差是多模态评估中的一个系统性盲点:指标可能偏好语义不正确但在视觉或社会层面具有原型性的图像,而非正确但原型性较弱的图像。我们引入PROTOBIAS,一个跨动物、物体和人口统计的受控诊断基准,其中语义正确的图像与包含单个受控语义违反的合理原型性对抗样本进行对比。基于原型理论和社会类别原型性,PROTOBIAS通过多个提示生成器、图像生成器和独立的VLM过滤器构建,并通过提示质量、人工标注和图像质量控制进行验证。使用PROTOBIAS,我们展示了广泛使用的嵌入、奖励、基于VQA和VLM作为评判的指标经常在这些对比中失败,而人类判断仍然更忠实于语义正确性。我们进一步引入PROTOSCORE,一个轻量级对比训练评估器,作为初始缓解基线。PROTOBIAS为测量原型性驱动的指标失败和开发更语义忠实的T2I评估器提供了一个聚焦基准。

英文摘要

Automatic metrics are widely used to evaluate text-to-image models, often replacing human judgment in benchmarking, model selection, and large-scale data filtering. Yet they may reward images that look plausible or prototypical rather than images that faithfully satisfy the prompt. We identify prototypicality bias as a systematic blindspot in multimodal evaluation: metrics can prefer a semantically incorrect but visually or socially prototypical image over a correct but less prototypical one. We introduce PROTOBIAS, a controlled diagnostic benchmark across Animals, Objects, and Demography, where semantically correct images are contrasted with plausible prototypical adversaries containing a single controlled semantic violation. Grounded in prototype theory and social-category prototypicality, PROTOBIAS is constructed with multiple prompt generators, image generators, and independent VLM filters, and validated through prompt-quality, human-annotation, and image-quality controls. Using PROTOBIAS, we show that widely used embedding, reward, VQA-based, and VLM-as-judge metrics frequently fail these contrasts, while human judgments remain more faithful to semantic correctness. We further introduce PROTOSCORE, a lightweight contrastively trained evaluator, as an initial mitigation baseline. PROTOBIAS provides a focused benchmark for measuring prototypicality-driven metric failures and developing more semantically faithful T2I evaluators.

2511.01938 2026-06-02 cs.LG cs.AI 版本更新

The Geometry of Grokking: Norm Minimization on the Zero-Loss Manifold

Grokking 的几何:零损失流形上的范数最小化

Tiberiu Musat

发表机构 * ETH Zurich(苏黎世联邦理工学院)

AI总结 本文通过约束优化视角,证明在极小学习率和权重衰减系数下,梯度下降在零损失流形上最小化权重范数,并引入近似解耦参数子集的学习动力学,推导出两层网络第一层后记忆动力学的闭式表达式,实验验证了该框架能复现 grokking 的延迟泛化和表征学习特征。

详情
AI中文摘要

Grokking 是神经网络中一种令人费解的现象,即在完全记忆训练数据之后,经过相当长的延迟才出现完全的泛化。先前的研究将这种延迟泛化与由权重衰减驱动的表征学习联系起来,但精确的潜在动力学仍然难以捉摸。在本文中,我们认为后记忆学习可以通过约束优化的视角来理解:梯度下降在零损失流形上有效地最小化权重范数。我们在无穷小学习率和权重衰减系数的极限下正式证明了这一点。为了进一步剖析这一机制,我们引入了一种近似,将一部分参数的学习动力学与网络其余部分解耦。应用这一框架,我们推导出两层网络中第一层后记忆动力学的闭式表达式。实验证实,使用我们预测的梯度模拟训练过程能够再现 grokking 的特征性延迟泛化和表征学习。

英文摘要

Grokking is a puzzling phenomenon in neural networks where full generalization occurs only after a substantial delay following the complete memorization of the training data. Previous research has linked this delayed generalization to representation learning driven by weight decay, but the precise underlying dynamics remain elusive. In this paper, we argue that post-memorization learning can be understood through the lens of constrained optimization: gradient descent effectively minimizes the weight norm on the zero-loss manifold. We formally prove this in the limit of infinitesimally small learning rates and weight decay coefficients. To further dissect this regime, we introduce an approximation that decouples the learning dynamics of a subset of parameters from the rest of the network. Applying this framework, we derive a closed-form expression for the post-memorization dynamics of the first layer in a two-layer network. Experiments confirm that simulating the training process using our predicted gradients reproduces both the delayed generalization and representation learning characteristic of grokking.

2601.04539 2026-06-02 cs.NE cs.AI cs.LG 版本更新

Paradoxical noise preference in RNNs

RNN中的矛盾噪声偏好

Noah Eckstein, Manoj Srinivasan

发表机构 * Department of Mechanical and Aerospace Engineering(机械与航空航天工程系)

AI总结 研究发现,在循环神经网络中,训练时注入的噪声在测试时移除反而会降低性能,网络偏好训练时的噪声水平,该现象源于噪声引起的固定点偏移。

Comments Published in Transactions on Machine Learning Research (TMLR), 2026 21 pages, 8 figures

详情
Journal ref
Transactions on Machine Learning Research, 2026
AI中文摘要

在用于模拟生物神经网络的循环神经网络(RNN)中,通常在训练期间引入噪声以模拟生物变异性和正则化学习。预期在测试时去除噪声应保持或提高性能。与这一直觉相反,我们发现连续时间RNN(CTRNN)通常在训练噪声水平或接近该水平时表现最佳。这种噪声偏好通常出现在噪声注入到神经激活函数内部时;而在激活函数外部注入噪声训练的网络在零噪声时表现最佳。该现象在多种任务中对于足够大的训练噪声鲁棒地出现;我们还展示了该现象出现在前馈神经网络中,而不仅仅是RNN中。我们的分析表明,该现象源于RNN底层随机动力学中固定点(平稳分布)的噪声诱导偏移。这些固定点偏移依赖于噪声水平,并在去除噪声时使网络输出产生偏差,从而降低性能。分析和数值结果表明,当神经状态在激活函数非线性附近运行时会产生偏差,此时噪声被不对称地衰减,而性能优化激励了在这些非线性附近运行;对于噪声在激活函数内部的网络存在这种性能激励,而外部噪声的网络则没有,这解释了为什么只有内部噪声网络表现出偏好。因此,网络可能过拟合到训练噪声本身,而不仅仅是输入-输出数据。该现象不同于随机共振,后者中非零噪声增强信号处理。我们的发现揭示了训练噪声可以成为神经网络学习到的计算的一部分,对理解神经群体动力学和设计鲁棒的人工RNN具有启示意义。

英文摘要

In recurrent neural networks (RNNs) used to model biological neural networks, noise is typically introduced during training to emulate biological variability and regularize learning. The expectation is that removing the noise at test time should preserve or improve performance. Contrary to this intuition, we find that continuous-time RNNs (CTRNNs) often perform best at or near the training noise level. This noise preference typically arises when noise is injected inside the neural activation function; networks trained with noise injected outside the activation function perform best with zero noise. The phenomenon arises robustly in diverse tasks for large enough training noise; we also show the phenomenon arising in feedforward neural networks, not just in RNNs. Our analyses show that the phenomenon stems from noise-induced shifts of fixed points (stationary distributions) in the underlying stochastic dynamics of the RNNs. These fixed point shifts are noise-level dependent and bias the network outputs when the noise is removed, degrading performance. Analytical and numerical results show that the bias arises when neural states operate near activation-function nonlinearities, where noise is asymmetrically attenuated, and that performance optimization incentivizes operation near these nonlinearities; such performance incentives exist for networks with noise inside, but not outside, the activation function, explaining why only noise-in networks show the preference. Thus, networks can overfit to the training noise itself rather than just to the input-output data. The phenomenon is distinct from stochastic resonance, wherein nonzero noise enhances signal processing. Our findings reveal that training noise can become an integral part of the computation learned by neural networks, with implications for understanding neural population dynamics and for the design of robust artificial RNNs.

2601.03309 2026-06-02 cs.CV cs.AI 版本更新

VLM4VLA: Revisiting Vision-Language-Models in Vision-Language-Action Models

VLM4VLA:重新审视视觉-语言-动作模型中的视觉-语言模型

Jianke Zhang, Xiaoyu Chen, Qiuyue Wang, Mingsheng Li, Yanjiang Guo, Yucheng Hu, Jiajun Zhang, Shuai Bai, Junyang Lin, Jianyu Chen

发表机构 * Institute for Interdisciplinary Information Sciences, Tsinghua University(清华大学交叉信息研究院) Qwen Team, Alibaba Inc.(阿里巴巴公司Qwen团队)

AI总结 本文通过VLM4VLA最小适配管道,系统研究视觉-语言模型(VLM)的选择和能力如何影响下游视觉-语言-动作(VLA)策略性能,发现VLM通用能力无法预测下游任务表现,且视觉模块是性能瓶颈。

详情
AI中文摘要

视觉-语言-动作(VLA)模型将预训练的大型视觉-语言模型(VLM)集成到其策略主干中,因其有前景的泛化能力而受到广泛关注。本文重新审视了一个基本但很少被系统研究的问题:VLM的选择和能力如何转化为下游VLA策略的性能?我们引入了VLM4VLA,一个最小适配管道,仅使用少量新的可学习参数将通用VLM转换为VLA策略,以实现公平高效的比较。尽管简单,VLM4VLA被证明与更复杂的网络设计相比具有惊人的竞争力。通过在三个基准上的各种下游任务进行广泛的实证研究,我们发现虽然VLM初始化比从头训练提供了一致的优势,但VLM的通用能力并不能很好地预测其下游任务性能。这挑战了常见的假设,表明标准VLM能力对于有效的具身控制是必要但不充分的。我们进一步通过微调VLM在七个辅助具身任务(例如,具身问答、视觉指向、深度估计)上研究特定具身能力的影响。与直觉相反,提高VLM在特定具身技能上的性能并不能保证更好的下游控制性能。最后,模态级别的消融实验确定VLM中的视觉模块(而非语言组件)是主要的性能瓶颈。我们证明,即使在下游微调期间编码器保持冻结,向VLM的视觉编码器注入控制相关的监督也能带来一致的收益。这隔离了当前VLM预训练目标与具身动作规划需求之间持续的领域差距。

英文摘要

Vision-Language-Action (VLA) models, which integrate pretrained large Vision-Language Models (VLM) into their policy backbone, are gaining significant attention for their promising generalization capabilities. This paper revisits a fundamental yet seldom systematically studied question: how VLM choice and competence translate to downstream VLA policies performance? We introduce VLM4VLA, a minimal adaptation pipeline that converts general-purpose VLMs into VLA policies using only a small set of new learnable parameters for fair and efficient comparison. Despite its simplicity, VLM4VLA proves surprisingly competitive with more sophisticated network designs. Through extensive empirical studies on various downstream tasks across three benchmarks, we find that while VLM initialization offers a consistent benefit over training from scratch, a VLM's general capabilities are poor predictors of its downstream task performance. This challenges common assumptions, indicating that standard VLM competence is necessary but insufficient for effective embodied control. We further investigate the impact of specific embodied capabilities by fine-tuning VLMs on seven auxiliary embodied tasks (e.g., embodied QA, visual pointing, depth estimation). Contrary to intuition, improving a VLM's performance on specific embodied skills does not guarantee better downstream control performance. Finally, modality-level ablations identify the visual module in VLM, rather than the language component, as the primary performance bottleneck. We demonstrate that injecting control-relevant supervision into the vision encoder of the VLM yields consistent gains, even when the encoder remains frozen during downstream fine-tuning. This isolates a persistent domain gap between current VLM pretraining objectives and the requirements of embodied action-planning.

2601.00664 2026-06-02 cs.LG cs.AI cs.CV cs.HC cs.MM 版本更新

Avatar Forcing: Real-Time Interactive Head Avatar Generation for Natural Conversation

Avatar Forcing:用于自然对话的实时交互式头部化身生成

Taekyung Ki, Sangwon Jang, Jaehyeong Jo, Jaehong Yoon, Sung Ju Hwang

发表机构 * KAIST(韩国科学技术院) NTU Singapore(新加坡国立大学) DeepAuto.ai

AI总结 提出Avatar Forcing框架,通过扩散强制实现实时交互式头部化身生成,利用直接偏好优化进行无标签学习,在低延迟(约500ms)下生成富有表现力的反应动作。

Comments CVPR 2026. Project page: https://taekyungki.github.io/AvatarForcing/

详情
AI中文摘要

说话头部生成从静态肖像创建逼真的化身,用于虚拟通信和内容创作。然而,当前的模型尚未传达真正交互式通信的感觉,通常生成缺乏情感投入的单向响应。我们确定了实现真正交互式化身的两个关键挑战:在因果约束下实时生成运动,以及在没有额外标注数据的情况下学习富有表现力、生动的反应。为了解决这些挑战,我们提出了Avatar Forcing,一种新的交互式头部化身生成框架,通过扩散强制建模实时用户-化身交互。该设计允许化身处理实时多模态输入,包括用户的音频和运动,以低延迟即时响应语言和非语言线索,如言语、点头和笑声。此外,我们引入了一种直接偏好优化方法,利用通过丢弃用户条件构建的合成失败样本,实现无标签的富有表现力交互学习。实验结果表明,我们的框架能够实现低延迟(约500ms)的实时交互,相比基线加速6.8倍,并生成反应性和富有表现力的化身运动,在80%以上的情况下优于基线。

英文摘要

Talking head generation creates lifelike avatars from static portraits for virtual communication and content creation. However, current models do not yet convey the feeling of truly interactive communication, often generating one-way responses that lack emotional engagement. We identify two key challenges toward truly interactive avatars: generating motion in real-time under causal constraints and learning expressive, vibrant reactions without additional labeled data. To address these challenges, we propose Avatar Forcing, a new framework for interactive head avatar generation that models real-time user-avatar interactions through diffusion forcing. This design allows the avatar to process real-time multimodal inputs, including the user's audio and motion, with low latency for instant reactions to both verbal and non-verbal cues such as speech, nods, and laughter. Furthermore, we introduce a direct preference optimization method that leverages synthetic losing samples constructed by dropping user conditions, enabling label-free learning of expressive interaction. Experimental results demonstrate that our framework enables real-time interaction with low latency (approximately 500ms), achieving 6.8X speedup compared to the baseline, and produces reactive and expressive avatar motion, which is preferred over 80% against the baseline.

2512.20638 2026-06-02 cs.CL cs.AI cs.LG 版本更新

Uncovering Competency Gaps in Large Language Models and Their Benchmarks

揭示大型语言模型及其基准测试中的能力差距

Maty Bohacek, Nino Scherrer, Nicholas Dufour, Thomas Leung, Christoph Bregler, Stephanie C. Y. Chan

发表机构 * University of Zurich(苏黎世大学)

AI总结 提出一种基于稀疏自编码器概念激活的新方法,自动发现模型在细粒度概念上的弱点(模型差距)和基准测试覆盖不平衡(基准差距),并通过内部表示评估和跨基准比较进行验证。

详情
Journal ref
ICML 2026
AI中文摘要

大型语言模型的评估严重依赖标准化基准测试。这些基准测试提供了有用的聚合指标,但可能掩盖(i)模型薄弱的特定子领域(“模型差距”)和(ii)基准测试本身的不平衡覆盖(“基准差距”)。为了自动揭示这两类差距,我们提出了一种简单的新方法,利用稀疏自编码器的概念激活,在逐概念基础上识别细粒度差距。该方法还受益于将评估基于模型的内部表示,以及易于跨基准测试进行比较。我们将该方法应用于五个流行的开源模型和十几个基准测试,作为示例说明。作为对该方法的验证,我们发现我们的自动无监督方法能够恢复文献中先前记录的模型差距(例如与谄媚相关的差距),并识别出新的模型差距。我们还能够自动揭示基准差距:应属于给定基准测试范围的核心概念。我们的“能力差距”方法可以通过提供模型行为的概念级分解,并帮助基准测试开发者迭代基准测试设计,来补充现有基准测试。代码可在 https://competency-gaps.github.io 获取。

英文摘要

The evaluation of large language models relies heavily on standardized benchmarks. These benchmarks provide useful aggregated metrics, but can obscure (i) particular sub-areas where the models are weak ("model gaps") and (ii) imbalanced coverage in the benchmarks themselves ("benchmark gaps"). To automatically uncover both types of gaps, we propose a simple new method using concept activations from sparse autoencoders, to identify fine-grained gaps on a per-concept basis. The method also benefits from grounding evaluation in the model's internal representations, as well as easy comparison across benchmarks. We applied the method to five popular open-source models and more than a dozen benchmarks, as illustrative examples. As validation of the approach, we found that our automatic, unsupervised method was able to recover model gaps that have been previously documented in the literature (e.g. relating to sycophancy), in addition to identifying novel model gaps. We were also able to automatically uncover benchmark gaps: core concepts that should fall within the scope of a given benchmark. Our "competency gaps" method can be used to complement existing benchmarks, by providing a concept-level decomposition of model behavior, and by helping benchmark developers iterate upon benchmark design. Code is available at https://competency-gaps.github.io.

2506.13702 2026-06-02 cs.LG cs.AI 版本更新

Value-Free Policy Optimization via Reward Partitioning

通过奖励划分实现无价值函数策略优化

Bilal Faye, Hanane Azzag, Mustapha Lebbah

发表机构 * LIPN, Université Paris 13(巴黎第十三大学LIPN实验室) Université Paris 13(巴黎第十三大学) Université de Versailles Saint-Quentin Paris(巴黎- versaillies圣quentin大学)

AI总结 提出Reward Partition Optimization (RPO)方法,通过基于划分的奖励归一化消除价值函数学习,实现稳定、高效的策略优化。

详情
AI中文摘要

单轨迹偏好优化方法从((提示, 响应, 奖励))元组的数据集中学习,通过直接利用标量反馈为成对偏好学习提供了一种实用的替代方案。现有方法如直接奖励优化(DRO)已显示出有希望的结果,但依赖于价值函数估计,引入了额外的方差、优化复杂性和对离策略数据的敏感性。我们引入了奖励划分优化(RPO),一种简单且可扩展的奖励驱动目标,消除了对价值函数学习的需要。RPO通过直接从提示级奖励分布估计的基于划分的公式对奖励进行归一化,产生稳定的监督优化目标,无需辅助模型或强化学习循环。我们使用自动评估指标、LLM作为评判员的评估和优化稳定性分析,在多个编码器-解码器和仅解码器语言模型上评估RPO。实验结果表明,RPO在生成更对齐、更多样化和更少有毒内容的同时,始终优于强基线,包括SFT、KTO和DRO。

英文摘要

Single-trajectory preference optimization methods learn from datasets of ((prompt, response, reward)) tuples, offering a practical alternative to pairwise preference learning by directly leveraging scalar feedback. Existing approaches such as Direct Reward Optimization (DRO) have demonstrated promising results but rely on value function estimation, introducing additional variance, optimization complexity, and sensitivity to off-policy data. We introduce Reward Partition Optimization (RPO), a simple and scalable reward-driven objective that eliminates the need for value function learning. RPO normalizes rewards through a partition-based formulation estimated directly from prompt-level reward distributions, yielding a stable supervised optimization objective without auxiliary models or reinforcement learning loops. We evaluate RPO across multiple encoder-decoder and decoder-only language models using automatic metrics, LLM-as-a-judge evaluations, and optimization stability analyses. Experimental results show that RPO consistently outperforms strong baselines, including SFT, KTO, and DRO, while producing more aligned, diverse, and less toxic generations.

2505.08438 2026-06-02 cs.CV cs.AI 版本更新

A Survey of 3D Reconstruction with Event Cameras

事件相机三维重建综述

Chuanzhi Xu, Haoxian Zhou, Langyi Chen, Haodong Chen, Zeke Zexi Hu, Zhicheng Lu, Ying Zhou, Vera Chung, Qiang Qu, Weidong Cai

发表机构 * University of Science and Technology of China(中国科学技术大学)

AI总结 本文首次全面综述了基于事件相机的三维重建方法,按输入模态(立体、单目、多模态)和重建技术(几何、深度学习、神经渲染如NeRF和3DGS)分类,并讨论了数据集、评估、表示和动态场景重建等挑战。

Comments This survey has been accepted for publication in the Computational Visual Media Journal

详情
AI中文摘要

事件相机正迅速成为用于三维重建的强大视觉传感器,能够异步捕捉每个像素的亮度变化。与传统基于帧的相机相比,事件相机产生稀疏但时间密集的数据流,即使在高速运动、低光照和极端动态范围等挑战性条件下,也能实现鲁棒且准确的三维重建。这些能力为自动驾驶、机器人、空中导航和沉浸式虚拟现实等各个领域的变革性应用提供了巨大前景。在本文中,我们首次专门针对基于事件的三维重建进行了全面综述。现有方法根据输入模态系统地分为立体、单目和多模态系统,并根据重建方法进一步分类,包括基于几何的技术、深度学习方法以及神经渲染技术,如神经辐射场(NeRF)和3D高斯泼溅(3DGS)。在每个类别中,方法按时间顺序组织,以突出关键概念和进展的演变。此外,我们详细总结了专门适用于基于事件重建任务的公开数据集。最后,我们讨论了数据集可用性、标准化评估、有效表示和动态场景重建方面的重大开放挑战,并概述了未来研究的有见地的方向。本综述旨在作为重要参考,并为推进事件驱动三维重建的最新技术提供清晰且激励人心的路线图。

英文摘要

Event cameras are rapidly emerging as powerful vision sensors for 3D reconstruction, uniquely capable of asynchronously capturing per-pixel brightness changes. Compared to traditional frame-based cameras, event cameras produce sparse yet temporally dense data streams, enabling robust and accurate 3D reconstruction even under challenging conditions such as high-speed motion, low illumination, and extreme dynamic range scenarios. These capabilities offer substantial promise for transformative applications across various fields, including autonomous driving, robotics, aerial navigation, and immersive virtual reality. In this survey, we present the first comprehensive review exclusively dedicated to event-based 3D reconstruction. Existing approaches are systematically categorised based on input modality into stereo, monocular, and multimodal systems, and further classified according to reconstruction methodologies, including geometry-based techniques, deep learning approaches, and neural rendering techniques such as Neural Radiance Fields (NeRF) and 3D Gaussian Splatting (3DGS). Within each category, methods are chronologically organised to highlight the evolution of key concepts and advancements. Furthermore, we provide a detailed summary of publicly available datasets specifically suited to event-based reconstruction tasks. Finally, we discuss significant open challenges in dataset availability, standardised evaluation, effective representation, and dynamic scene reconstruction, outlining insightful directions for future research. This survey aims to serve as an essential reference and provides a clear and motivating roadmap toward advancing the state of the art in event-driven 3D reconstruction.

2512.18336 2026-06-02 cs.RO cs.AI cs.LG 版本更新

Dynamic Entropy Tuning in Reinforcement Learning Low-Level Quadcopter Control: Stochasticity vs Determinism

强化学习低层四旋翼控制中的动态熵调节:随机性与确定性

Youssef Mahran, Zeyad Gamal, Ayman El-Badawy

发表机构 * Mechatronics Engineering Department(机械工程系) The German University in Cairo(开罗德国大学)

AI总结 研究在四旋翼控制中,通过动态熵调节训练随机策略的强化学习算法,并与确定性策略算法对比,发现动态熵调节可防止灾难性遗忘并提高探索效率。

Comments This is the Author Accepted Manuscript version of a paper accepted for publication. The final published version is available via IEEE Xplore

详情
Journal ref
2024 IEEE 34th International Conference on Computer Theory and Applications (ICCTA)
AI中文摘要

本文探讨了在训练随机策略的强化学习算法中动态熵调节的影响,并将其性能与训练确定性策略的算法进行了比较。随机策略通过优化动作的概率分布来最大化奖励,而确定性策略则为每个状态选择一个确定的动作。本文研究了使用静态熵和动态熵训练随机策略,然后执行确定性动作来控制四旋翼的效果,并与训练确定性策略并执行确定性动作进行了对比。为此,随机算法选择了软演员-评论家(SAC)算法,确定性算法选择了双延迟深度确定性策略梯度(TD3)算法。训练和仿真结果表明,动态熵调节通过防止灾难性遗忘和提高探索效率,对控制四旋翼产生了积极影响。

英文摘要

This paper explores the impact of dynamic entropy tuning in Reinforcement Learning (RL) algorithms that train a stochastic policy. Its performance is compared against algorithms that train a deterministic one. Stochastic policies optimize a probability distribution over actions to maximize rewards, while deterministic policies select a single deterministic action per state. The effect of training a stochastic policy with both static entropy and dynamic entropy and then executing deterministic actions to control the quadcopter is explored. It is then compared against training a deterministic policy and executing deterministic actions. For the purpose of this research, the Soft Actor-Critic (SAC) algorithm was chosen for the stochastic algorithm while the Twin Delayed Deep Deterministic Policy Gradient (TD3) was chosen for the deterministic algorithm. The training and simulation results show the positive effect the dynamic entropy tuning has on controlling the quadcopter by preventing catastrophic forgetting and improving exploration efficiency.

2512.18333 2026-06-02 cs.RO cs.AI cs.LG 版本更新

Reinforcement Learning Position Control of a Quadrotor Using Soft Actor-Critic (SAC)

基于软演员-评论家(SAC)的四旋翼强化学习位置控制

Youssef Mahran, Zeyad Gamal, Ayman El-Badawy

发表机构 * Mechatronics Engineering Department(机械电子工程系) The German University in Cairo(埃及德国大学)

AI总结 提出一种基于强化学习的四旋翼推力矢量控制架构,使用软演员-评论家算法训练,相比传统RPM控制器训练更快、路径跟踪更平滑准确。

Comments This is the Author Accepted Manuscript version of a paper accepted for publication. The final published version is available via IEEE Xplore

详情
Journal ref
2024 IEEE 6th Novel Intelligent and Leading Emerging Sciences Conference (NILES)
AI中文摘要

本文提出了一种新的基于强化学习(RL)的四旋翼控制架构。现有文献主要关注直接控制四个旋翼的转速,而本文旨在控制四旋翼的推力矢量。RL智能体计算沿四旋翼z轴的总推力百分比以及期望的滚转角(ϕ)和俯仰角(θ)。然后,智能体将计算出的控制信号连同当前四旋翼的偏航角(ψ)发送给姿态PID控制器。PID控制器再将控制信号映射为电机转速。采用软演员-评论家算法(一种无模型离策略随机RL算法)来训练RL智能体。训练结果表明,与传统的RPM控制器相比,所提出的推力矢量控制器训练时间更短。仿真结果表明,所提出的推力矢量控制器具有更平滑、更精确的路径跟踪性能。

英文摘要

This paper proposes a new Reinforcement Learning (RL) based control architecture for quadrotors. With the literature focusing on controlling the four rotors' RPMs directly, this paper aims to control the quadrotor's thrust vector. The RL agent computes the percentage of overall thrust along the quadrotor's z-axis along with the desired Roll ($ϕ$) and Pitch ($θ$) angles. The agent then sends the calculated control signals along with the current quadrotor's Yaw angle ($ψ$) to an attitude PID controller. The PID controller then maps the control signals to motor RPMs. The Soft Actor-Critic algorithm, a model-free off-policy stochastic RL algorithm, was used to train the RL agents. Training results show the faster training time of the proposed thrust vector controller in comparison to the conventional RPM controllers. Simulation results show smoother and more accurate path-following for the proposed thrust vector controller.

2512.18043 2026-06-02 cs.CR cs.AI cs.CY 版本更新

Securing Agentic AI Systems -- A Multilayer Security Framework

保护自主AI系统——一种多层安全框架

Sunil Arora, John Hastings

发表机构 * University of California, Berkeley(加州大学伯克利分校)

AI总结 针对自主AI系统的独特安全挑战,本文采用设计科学研究方法,提出了一种生命周期感知的安全框架MAAIS,并引入自主AI的CIAA概念,通过多层防御机制确保AI生命周期的机密性、完整性、可用性和问责性,最后利用MITRE ATLAS进行验证。

Comments 6 pages, 2 figures, 1 table

详情
Journal ref
2025 IEEE 5th International Conference on Robotics, Automation, and Artificial Intelligence (RAAI)
AI中文摘要

保护自主人工智能(AI)系统需要应对由自主、决策和自适应行为引入的复杂网络风险。自主AI系统正越来越多地部署在工业、组织以及网络安全、金融和医疗等关键领域。然而,它们的自主性带来了独特的安全挑战,包括未经授权的操作、对抗性操纵和动态环境交互。现有的AI安全框架未能充分应对这些挑战或自主AI的独特细微差别。本研究采用设计科学研究(DSR)方法,开发了一种专门针对自主AI系统的生命周期感知安全框架。本文介绍了MAAIS,一个自主安全框架,以及自主AI的CIAA(机密性、完整性、可用性和问责性)概念。MAAIS集成了多个防御层,以在AI生命周期中维护CIAA。通过映射已建立的MITRE ATLAS(人工智能系统对抗威胁全景)AI策略进行框架验证。本研究为在企业环境中安全部署和治理自主AI提供了一种结构化、标准化且基于框架的方法。该框架面向企业CISO、安全、AI平台和工程团队,并提供了保护自主AI工作负载的详细分步方法。

英文摘要

Securing Agentic Artificial Intelligence (AI) systems requires addressing the complex cyber risks introduced by autonomous, decision-making, and adaptive behaviors. Agentic AI systems are increasingly deployed across industries, organizations, and critical sectors such as cybersecurity, finance, and healthcare. However, their autonomy introduces unique security challenges, including unauthorized actions, adversarial manipulation, and dynamic environmental interactions. Existing AI security frameworks do not adequately address these challenges or the unique nuances of agentic AI. This research develops a lifecycle-aware security framework specifically designed for agentic AI systems using the Design Science Research (DSR) methodology. The paper introduces MAAIS, an agentic security framework, and the agentic AI CIAA (Confidentiality, Integrity, Availability, and Accountability) concept. MAAIS integrates multiple defense layers to maintain CIAA across the AI lifecycle. Framework validation is conducted by mapping with the established MITRE ATLAS (Adversarial Threat Landscape for Artificial-Intelligence Systems) AI tactics. The study contributes a structured, standardized, and framework-based approach for the secure deployment and governance of agentic AI in enterprise environments. This framework is intended for enterprise CISOs, security, AI platform, and engineering teams and offers a detailed step-by-step approach to securing agentic AI workloads.

2512.17605 2026-06-02 cs.CV cs.AI 版本更新

MGRegBench: A Novel Benchmark Dataset with Anatomical Landmarks for Mammography Image Registration

MGRegBench:一个带有解剖标志的乳腺X线图像配准新型基准数据集

Svetlana Krasnova, Emiliya Starikova, Ilia Naletov, Andrey Krylov, Dmitry Sorokin

发表机构 * MSU(莫斯科国立大学)

AI总结 为解决乳腺X线图像配准中缺乏公开数据集和标准化基准的问题,提出了MGRegBench,包含5000多对图像和100对带手动标注解剖标志的数据集,并评估了多种配准方法。

详情
AI中文摘要

稳健的乳腺X线图像配准对于临床相关应用(如追踪乳腺组织疾病进展)至关重要。然而,由于缺乏透明的公共数据集和可重复的标准化基准,进展受到限制。现有研究通常使用私有数据和不一致的评估框架,因此难以直接比较。为解决这一问题,我们提出了MGRegBench,一个患者独立、无泄漏控制的乳腺X线图像配准评估协议,包含超过5000对图像,每对图像带有乳腺分割掩膜,以及100对带有手动标注解剖标志的图像,此外还有标准化的训练/评估分割和即用基线。利用这一资源,我们对多种配准方法进行了基准测试——包括经典方法(ANTs)、基于学习的方法(VoxelMorph, TransMorph)、隐式神经表示(IDIR)、一种乳腺X线专用方法,以及最近的深度学习方法MammoRegNet,并针对该模态调整了实现,同时在独立数据集SDM-MCs上验证了泛化能力。我们的贡献包括:(1)首个此规模且带有手动标注标志和掩膜的乳腺X线图像配准公共数据集;(2)一个透明、无泄漏控制的基准,首次实现了多种经典和基于机器学习的方法的同类比较;(3)在SDM-MCs上的外部验证,以测试主要趋势是否超越MGRegBench;(4)对基于深度学习的配准进行了广泛分析。我们公开发布代码和数据,为公平、可重复且临床相关的比较建立基础资源,并推动AI驱动医学影像的未来研究。

英文摘要

Robust mammography registration is essential for clinically relevant applications like tracking disease progression in breast tissue. However, progress has been limited by the absence of transparent public datasets and reproducible standardized benchmarks. Existing studies are often not directly comparable, as they use private data and inconsistent evaluation frameworks. To address this, we present MGRegBench, a patient-disjoint, leakage-controlled evaluation protocol for mammography registration, comprising over 5,000 image pairs, each with a breast segmentation mask, and 100 pairs with manually annotated anatomical landmarks, plus standardized train/evaluation splits and ready-to-run baselines. Using this resource, we benchmark diverse registration methods -- including classical (ANTs), learning-based (VoxelMorph, TransMorph), implicit neural representation (IDIR), a mammography-specific approach, and a recent deep learning method MammoRegNet, with implementations adapted to this modality, and validate generalization on the independent SDM-MCs dataset. Our contributions are: (1) the first public dataset of this scale with manual landmarks and masks for mammography registration; (2) a transparent, leakage-controlled benchmark enabling the first like-for-like comparison of diverse classical and machine learning-based methods; (3) external validation on SDM-MCs to test whether the main trend transfers beyond MGRegBench; and (4) an extensive analysis of deep learning-based registration. We publicly release our code and data to establish a foundational resource for fair, reproducible, and clinically relevant comparisons and catalyze future research in AI-driven medical imaging.

2512.13356 2026-06-02 cs.RO cs.AI cs.LG 版本更新

Control of a Twin Rotor using Twin Delayed Deep Deterministic Policy Gradient (TD3)

使用双延迟深度确定性策略梯度(TD3)控制双旋翼系统

Zeyad Gamal, Youssef Mahran, Ayman El-Badawy

发表机构 * Mechatronics Engineering Department(机械电子工程系) The German University in Cairo(埃及德国大学)

AI总结 提出基于TD3算法的强化学习框架,用于控制双旋翼气动系统在俯仰和方位角上的稳定与轨迹跟踪,仿真和实验验证了其优于传统PID控制器的抗干扰能力。

Comments This is the Author Accepted Manuscript version of a paper accepted for publication. The final published version is available via IEEE Xplore

详情
Journal ref
2024 28th IEEE International Conference on System Theory, Control and Computing (ICSTCC)
AI中文摘要

本文提出了一种强化学习(RL)框架,用于在特定俯仰角和方位角下控制和稳定双旋翼气动系统(TRAS),并跟踪给定轨迹。TRAS的复杂动力学和非线性特性使得使用传统控制算法进行控制具有挑战性。然而,近年来RL的发展因其在多旋翼控制中的潜在应用而引起了兴趣。本文使用双延迟深度确定性策略梯度(TD3)算法来训练RL智能体。该算法适用于具有连续状态和动作空间的环境(类似于TRAS),因为它不需要系统的模型。仿真结果展示了RL控制方法的有效性。接下来,使用风扰形式的的外部扰动来测试控制器与传统PID控制器相比的有效性。最后,在实验室装置上进行了实验,以确认控制器在实际应用中的有效性。

英文摘要

This paper proposes a reinforcement learning (RL) framework for controlling and stabilizing the Twin Rotor Aerodynamic System (TRAS) at specific pitch and azimuth angles and tracking a given trajectory. The complex dynamics and non-linear characteristics of the TRAS make it challenging to control using traditional control algorithms. However, recent developments in RL have attracted interest due to their potential applications in the control of multirotors. The Twin Delayed Deep Deterministic Policy Gradient (TD3) algorithm was used in this paper to train the RL agent. This algorithm is used for environments with continuous state and action spaces, similar to the TRAS, as it does not require a model of the system. The simulation results illustrated the effectiveness of the RL control method. Next, external disturbances in the form of wind disturbances were used to test the controller's effectiveness compared to conventional PID controllers. Lastly, experiments on a laboratory setup were carried out to confirm the controller's effectiveness in real-world applications.

2512.10414 2026-06-02 cs.AI 版本更新

Boosting RL-Based Visual Reasoning with Selective Adversarial Entropy Intervention

基于选择性对抗熵干预提升强化学习视觉推理能力

Yang Yu, Zhuangzhuang Chen, Lanqing Li, Xiaomeng Li

发表机构 * The Hong Kong University of Science and Technology(香港科学与技术大学) The Chinese University of Hong Kong(香港中文大学)

AI总结 提出选择性对抗熵干预(SaEI)方法,通过在强化学习采样阶段利用熵引导的对抗攻击扭曲视觉输入,增强策略探索,提升视觉语言模型的推理能力。

详情
AI中文摘要

最近,强化学习(RL)已成为增强视觉语言模型(VLM)推理能力的常见选择。考虑到现有的基于RL的微调方法,熵干预被证明是提升探索能力、从而改善策略性能的有效方式。值得注意的是,大多数现有研究通过在RL策略优化过程中简单控制特定token的更新来干预熵,忽略了在RL采样阶段进行熵干预可以通过提高响应多样性来增强GRPO性能。本文提出选择性对抗熵干预(SaEI),通过使用来自采样响应熵的token选择性对抗目标来扭曲视觉输入,从而增强策略熵。具体而言,我们首先提出熵引导对抗采样(EgAS),将采样响应的熵公式化为对抗目标。然后,利用相应的对抗梯度攻击视觉输入以生成对抗样本,使策略模型在RL采样期间探索更大的答案空间。接着,我们提出token选择性熵计算(TsEC),以最大化EgAS中对抗攻击的有效性,同时不扭曲VLM中的事实知识。在域内和域外数据集上的大量实验表明,我们的方法通过熵干预显著改善了策略探索,从而提升了推理能力。代码将在论文被接收后发布。

英文摘要

Recently, reinforcement learning (RL) has become a common choice in enhancing the reasoning capabilities of vision-language models (VLMs). Considering existing RL-based finetuning methods, entropy intervention turns out to be an effective way to benefit exploratory ability, thereby improving policy performance. Notably, most existing studies intervene in entropy by simply controlling the update of specific tokens during policy optimization of RL. They ignore the entropy intervention during the RL sampling that can boost the performance of GRPO by improving the diversity of responses. In this paper, we propose Selective-adversarial Entropy Intervention, namely SaEI, which enhances policy entropy by distorting the visual input with the token-selective adversarial objective coming from the entropy of sampled responses. Specifically, we first propose entropy-guided adversarial sampling (EgAS) that formulates the entropy of sampled responses as an adversarial objective. Then, the corresponding adversarial gradient can be used to attack the visual input for producing adversarial samples, allowing the policy model to explore a larger answer space during RL sampling. Then, we propose token-selective entropy computation (TsEC) to maximize the effectiveness of adversarial attack in EgAS without distorting factual knowledge within VLMs. Extensive experiments on both in-domain and out-of-domain datasets show that our proposed method can greatly improve policy exploration via entropy intervention, to boost reasoning capabilities. Code will be released once the paper is accepted.

2512.10339 2026-06-02 cs.AI 版本更新

On the Collapse of Generative Paths: A Criterion and Correction for Diffusion Steering

生成路径的崩溃:扩散引导的准则与修正

Ziseok Lee, Minyeong Hwang, Wooyeol Lee, Sanghyun Jo, Jihyung Ko, Young Bin Park, Jae-Mun Choi, Eunho Yang, Kyungsu Kim

发表机构 * Department of Biomedical Sciences, Seoul National University, Seoul, South Korea(首尔国立大学生物医学科学系) School of Transdisciplinary Innovations, Seoul National University, Seoul, South Korea(首尔国立大学跨学科创新学院) Interdisciplinary Program in AI, Seoul National University, Seoul, South Korea(首尔国立大学人工智能交叉学科项目) Kim Jaechul Graduate School of AI, Seoul, South Korea(金 Jaechul人工智能研究生院)

AI总结 针对扩散和流模型推理时引导中出现的边缘路径崩溃问题,提出路径存在准则和自适应路径修正方法ACE,通过时变指数控制中间分布的分位数半径,在药物设计和图像生成任务中优于固定指数基线。

Comments Accepted to ICML 2026

详情
AI中文摘要

推理时引导通过重加权时间索引边际分布(使用固定指数)的密度比构造,使预训练的扩散和流模型适应新任务而无需重新训练。我们发现了边缘路径崩溃,即这种组合定义的中间密度变得不可归一化,尽管端点有效。当使用噪声调度不匹配(和/或负指数/部分支撑)的异构专家组合时,可能发生崩溃。为解决此问题,我们提供:(i) 一个尖锐的充分路径存在准则,刻画组合中间密度在数学上良好定义的条件;(ii) 自适应路径修正(ACE),将Feynman-Kac引导推广以支持时变指数。我们的分析表明,ACE控制中间分布的分位数半径,为实验中观察到的路径稳定提供了理论机制。在柔性姿态支架装饰(一个由从头、构象和蛋白质条件专家组成的药物设计任务)中,ACE防止崩溃并显著优于固定指数基线。此外,ACE提高了组合图像生成中的属性成功率,将其确立为组合采样的通用框架。项目页面:https://ziseoklee.github.io/projects/ACE/

英文摘要

Inference-time steering adapts pretrained diffusion and flow models to new tasks without retraining, often utilizing ratio-of-densities constructions that reweight time-indexed marginals with fixed exponents. We identify Marginal Path Collapse, a failure mode in which the intermediate density defined by such compositions becomes non-normalizable despite valid endpoints. This collapse can arise when composing heterogeneous experts trained with mismatched noise schedules (and/or negative exponents / partial supports). To address this, we provide (i) a sharp sufficient Path Existence Criterion that characterizes when the composed intermediate densities are mathematically well-defined, and (ii) Adaptive Path Correction with Exponents (ACE), which generalizes Feynman-Kac steering to support time-varying exponents. Our analysis reveals that ACE controls the quantile radius of the intermediate distributions, providing a theoretical mechanism for path stabilization observed in experiments. On flexible-pose scaffold decoration, a drug design task composed of de-novo, conformer, and protein-conditioned experts, ACE prevents collapse and significantly outperforms constant-exponent baselines. Furthermore, ACE improves attribute success rates in compositional image generation, establishing it as a general framework for compositional sampling. Project Page: https://ziseoklee.github.io/projects/ACE/

2512.10234 2026-06-02 cs.HC cs.AI 版本更新

InFerActive: Interactive Tree-Based Exploration of LLM Sampling for Safety Evaluation

InFerActive: 基于交互式树的安全评估中LLM采样探索

Junhyeong Hwangbo, Soohyun Lee, Hyeon Jeon, Kyochul Jang, Minsoo Cheong, Youngjae Yu, Jinwook Seo

发表机构 * Seoul National University(首尔国立大学)

AI总结 提出InFerActive系统,通过广度优先采样构建可导航短语树,提升LLM安全评估中低概率有害输出的覆盖率和效率,相比随机采样减少5倍样本量。

Comments v2: Revised version

详情
AI中文摘要

即使在评估中表现安全的LLM,在部署时仍可能产生有害响应。由于随机采样对同一提示产生不同响应,低概率的有害输出仍可能大规模到达用户。常见的人工评估工作流为每个提示生成大量随机样本,并在静态电子表格中审查。这种做法扩展性差,迫使评估者反复重读近乎重复的前缀。为解决此问题,我们提出InFerActive,一个交互式系统,将采样结果可视化为可导航的短语树,允许评估者按需过滤、探索和扩展生成空间。InFerActive利用广度优先采样,一种新颖的树构建过程,在匹配随机采样的有害响应覆盖范围的同时,所需样本最多减少5.0倍。两项受控用户研究(各N=12)表明,InFerActive在评估效率和覆盖率上显著优于电子表格和基本树基线。

英文摘要

Even LLMs that appear safe during evaluation can still produce harmful responses in deployment. Because stochastic sampling yields different responses to the same prompt, low-probability harmful outputs can still reach users at scale. Common human evaluation workflows generate many random samples per prompt and review them in static spreadsheets. The practice scales poorly, forcing evaluators to repeatedly reread near-duplicate prefixes. To address this, we present InFerActive, an interactive system that visualizes sampling results as a navigable tree of readable phrases, allowing evaluators to filter, explore, and extend the generation space on demand. InFerActive utilizes breadth-first sampling, a novel tree construction procedure that matches the harmful-response coverage of random sampling while requiring up to 5.0x fewer samples. Two controlled user studies (N = 12 each) demonstrate that InFerActive significantly improves evaluation efficiency and coverage over both spreadsheet and basic tree baselines.

2512.10120 2026-06-02 cs.SD cs.AI 版本更新

VocSim: A Training-free Benchmark for Zero-shot Content Identity in Single-source Audio

VocSim:单源音频中零样本内容身份的无训练基准

Maris Basha, Anja Zai, Sabine Stoll, Richard Hahnloser

发表机构 * University of California, Berkeley(加州大学伯克利分校) ETH Zurich(苏黎世联邦理工学院)

AI总结 提出VocSim,一个无需训练的无标签基准,通过冻结嵌入的几何对齐评估通用音频表示在零样本内容身份识别中的性能,并在多领域音频上取得强结果,同时揭示跨语言泛化差距。

Comments Accepted at ICML 2026. Code: https://github.com/vocsim/benchmark

详情
AI中文摘要

通用音频表示旨在将同一事件的声学可变实例映射到邻近点,在零样本设置中解决内容身份问题。与通过参数更新衡量适应性的监督分类基准不同,我们引入了VocSim,一个无需训练的基准,探测冻结嵌入的内在几何对齐,不更新任何参数也不使用标签(每个子集拟合一个无标签PCA白化以校正各向异性)。VocSim汇集了来自19个语料库的125k个单源片段,涵盖人类语音、动物发声和环境声音,将内容表示与源分离隔离开来(多声道混合超出范围)。我们使用Precision@k评估局部纯度,使用全局分离率(GSR)评估逐点类别分离,并通过相对于经验置换基线的提升进行校准。一个简单的冻结Whisper特征、时频池化和无标签PCA的流程在跨领域上产生了强大的零样本性能,GSR排名稳定(Kendall's tau = 0.60)。然而,在低资源盲语音(Shipibo-Conibo、Chintang)上,局部检索崩溃但仍高于随机水平,暴露了跨语言语音泛化差距。作为外部验证,我们的顶级嵌入预测了鸟类感知相似性,改进了生物声学分类,并在HEAR基准上达到了最先进水平。我们发布了数据、代码和公共排行榜。

英文摘要

General-purpose audio representations aim to map acoustically variable instances of the same event to nearby points, resolving content identity in a zero-shot setting. Unlike supervised classification benchmarks that measure adaptability via parameter updates, we introduce VocSim, a training-free benchmark probing the intrinsic geometric alignment of frozen embeddings, with no parameters updated and no labels used (a label-free PCA whitening is fit per subset to correct anisotropy). VocSim aggregates 125k single-source clips from 19 corpora spanning human speech, animal vocalizations, and environmental sounds, isolating content representation from source separation (polyphonic mixtures are out of scope). We evaluate embeddings with Precision@k for local purity and the Global Separation Rate (GSR) for point-wise class separation, calibrated by lift over an empirical permutation baseline. A simple pipeline of frozen Whisper features, time-frequency pooling, and label-free PCA yields strong zero-shot performance with stable GSR rankings across domains (Kendall's tau = 0.60). However, on blind low-resource speech (Shipibo-Conibo, Chintang), local retrieval collapses while remaining above chance, exposing a cross-lingual speech generalization gap. As external validation, our top embeddings predict avian perceptual similarity, improve bioacoustic classification, and achieve state-of-the-art on the HEAR benchmark. We release data, code, and a public leaderboard.

2512.09065 2026-06-02 cs.RO cs.AI 版本更新

ShelfAware: Real-Time Semantic Localization in Quasi-Static Environments with Low-Cost Sensors

ShelfAware:准静态环境下基于低成本传感器的实时语义定位

Shivendra Agrawal, Jake Brawer, Ashutosh Naik, Alessandro Roncone, Bradley Hayes

发表机构 * Department of Computer Science, University of Colorado Boulder(科罗拉多大学波尔德分校计算机科学系)

AI总结 提出ShelfAware语义粒子滤波器,通过将场景语义建模为类别统计证据而非固定地标,结合深度似然与类别语义相似度,并利用预计算语义视角进行逆语义提议,实现低成本视觉硬件上的鲁棒全局定位。

Comments 8 pages

详情
Journal ref
IEEE Robotics and Automation Letters (RA-L), 2026
AI中文摘要

许多室内工作空间是准静态的:其全局几何布局稳定,但局部语义不断变化,产生重复几何结构、动态杂乱和感知噪声,使得标准基于视觉的定位失效。我们提出ShelfAware,一种用于鲁棒全局定位的语义粒子滤波器,它将场景语义视为对象类别的统计证据而非固定数量地标。ShelfAware融合深度似然与以类别为中心的语义相似度,并利用预计算的语义视角库在蒙特卡洛定位(MCL)中执行逆语义提议,从而在低成本、纯视觉硬件上实现快速、有针对性的假设生成。为了展示感知无关的可扩展性,我们在两个领域评估ShelfAware。在严格控制的模拟零售环境中,ShelfAware实现了97%的全局定位成功率,并在购物车、可穿戴和动态遮挡条件下保持了最高的跟踪成功率(66%)。此外,在利用开放词汇视觉管道的3,500平方英尺运营杂货店中,ShelfAware显著优于几何和固定数量语义基线。通过分布性建模语义并利用逆提议,ShelfAware解决了几何混叠问题,为动态真实环境中的移动和辅助机器人提供了无需基础设施的构建模块。

英文摘要

Many indoor workspaces are quasi-static: their global geometric layout is stable, but local semantics change continually, producing repetitive geometry, dynamic clutter, and perceptual noise that defeat standard vision-based localization. We present ShelfAware, a semantic particle filter for robust global localization that treats scene semantics as statistical evidence over object categories rather than fixed quantity landmarks. ShelfAware fuses a depth likelihood with a category-centric semantic similarity and uses a precomputed bank of semantic viewpoints to perform inverse semantic proposals inside Monte Carlo Localization (MCL), yielding fast, targeted hypothesis generation on low-cost, vision-only hardware. To demonstrate perception-agnostic scalability, we evaluate ShelfAware across two domains. In a rigorously controlled mock retail environment, ShelfAware achieves a 97% global localization success rate, maintaining the highest tracking success (66%) across cart, wearable, and dynamic occlusion conditions. Furthermore, in a 3,500 sq. ft. operational grocery store leveraging an open-vocabulary vision pipeline, ShelfAware significantly outperforms both geometric and fixed-quantity semantic baselines. By modeling semantics distributionally and leveraging inverse proposals, ShelfAware resolves geometric aliasing, providing an infrastructure-free building block for mobile and assistive robots in dynamic real-world environments.

2512.07795 2026-06-02 cs.AI cs.CL cs.LG 版本更新

ReasonBENCH: Benchmarking the (In)Stability of LLM Reasoning

ReasonBENCH: 基准测试LLM推理的(不)稳定性

Nearchos Potamitis, Vansh Ramani, Har Ashish Arora, Dhairya Kuchhal, Lars Klein, Akhil Arora

发表机构 * Aarhus University(奥胡斯大学) Indian Institute of Technology Delhi(德里印度理工学院) EPFL(苏黎世联邦理工学院)

AI总结 提出ReasonBench基准套件,通过30次独立试验揭示LLM推理系统在贪婪解码下仍存在结构化方差,并引入全局噪声和运行噪声分类法,证明稳定性是推理系统的固有属性,倡导分布感知评估。

Comments 29 pages, 19 tables, 85 figures

详情
AI中文摘要

LLM推理系统的基准分数被报告为单一数字,然而相同的模型、策略和任务在重复执行时,即使在贪婪解码(T=0)下也会产生显著不同的答案和成本。这种方差并非统计上的麻烦:性能最高的策略在与最接近的对手进行头对头运行时仅获胜77%,这意味着单次观测到的分数可能会无声地错误排序系统。我们引入了ReasonBench,一个基准套件,记录了10种推理策略、12个模型和6个任务的30次独立试验,将质量和成本视为分布而非点估计。我们发现这种方差是有结构的而非随机的:一个双组分分类法——全局噪声(捕捉跨基准的不均匀性)和运行噪声(捕捉基准内的随机性)——揭示了策略架构预测稳定性分布,而模型和策略则移动分布的正交方面。层次分解将四分之三的分数方差归因于基准、系统和项目结构,而单次运行评估无声地吸收了持久的残差。最后,成本和成本非对称地解耦:廉价方法在结构上对联合成本-质量失败免疫,而昂贵方法无论其准确性如何仍然暴露。这些发现确立了不稳定性作为推理系统的固有属性,并促使分布感知评估成为标准实践。

英文摘要

Benchmark scores for LLM reasoning systems are reported as single numbers, yet the same model, strategy, and task can produce meaningfully different answers and costs across repeated executions, even under greedy decoding (T = 0). This variance is not a statistical nuisance: the highest-performing strategy wins only 77% of head-to-head runs against its nearest competitor, meaning a single observed score can silently misrank systems. We introduce ReasonBench, a benchmark suite recording 30 independent trials across 10 reasoning strategies, 12 models, and 6 tasks, treating quality and cost as distributions rather than point estimates. We find that this variance is structured rather than random: a two-component taxonomy -- Global Noise, capturing cross-benchmark unevenness, and Run Noise, capturing within-benchmark stochasticity -- reveals that strategy architecture predicts stability profiles, while models and strategies shift orthogonal aspects of the distribution. A hierarchical decomposition attributes three-quarters of score variance to benchmark, system, and item structure, with a persistent residual that single-run evaluation silently absorbs. Finally, cost and quality decouple asymmetrically: cheap methods are structurally immune to joint cost-quality failure, while expensive methods remain exposed regardless of their accuracy. These findings establish instability as an inherent property of reasoning systems and motivate distribution-aware evaluation as standard practice.

2511.20639 2026-06-02 cs.CL cs.AI cs.LG 版本更新

Latent Collaboration in Multi-Agent Systems

多智能体系统中的潜在协作

Jiaru Zou, Ruizhong Qiu, Gaotang Li, Xiyuan Yang, Katherine Tieu, Pan Lu, Ke Shen, Hanghang Tong, Yejin Choi, Jingrui He, James Zou, Mengdi Wang, Ling Yang

发表机构 * University of Washington(华盛顿大学)

AI总结 提出LatentMAS框架,使LLM智能体在连续潜在空间直接协作,无需文本中介,实现更高精度、更低开销和更快推理。

Comments ICML2026 Spotlight, Project: https://github.com/Gen-Verse/LatentMAS

详情
AI中文摘要

多智能体系统(MAS)将大语言模型(LLM)从独立的单模型推理扩展到协同的系统级智能。现有LLM智能体依赖基于文本的中介进行推理和通信,而我们更进一步,使模型能够在连续潜在空间内直接协作。我们引入了LatentMAS,一个端到端无需训练的框架,实现了LLM智能体间的纯潜在协作。在LatentMAS中,每个智能体首先通过最后一层的隐藏嵌入而非文本进行自回归潜在思维生成。然后,一个共享的潜在工作记忆保存并传递每个智能体的内部表示和潜在思维,确保无需重新编码的无损信息交换。我们提供了详细的理论分析,表明LatentMAS比基于文本的标准MAS具有更高的表达能力和无损信息保存能力,且整体复杂度更低。此外,在涵盖数学和科学推理、常识理解及代码生成的9个综合基准测试上的实证评估表明,LatentMAS优于先进的单智能体和基于文本的MAS基线,准确率最高提升14.6%,输出token使用量减少70.8%-83.7%,端到端推理速度提升4倍至4.3倍。代码和数据完全开源:https://github.com/Gen-Verse/LatentMAS。

英文摘要

Multi-agent systems (MAS) extend large language models (LLMs) from independent single-model reasoning to coordinative system-level intelligence. While existing LLM agents depend on text-based mediation for reasoning and communication, we take a step forward by enabling models to collaborate directly within the continuous latent space. We introduce LatentMAS, an end-to-end training-free framework that enables pure latent collaboration among LLM agents. In LatentMAS, each agent first performs auto-regressive latent thoughts generation through last-layer hidden embeddings instead of text. Then, a shared latent working memory preserves and transfers each agent's internal representations and latent thoughts, ensuring lossless information exchange without re-encoding. We provide detailed theoretical analyses showing that LatentMAS achieves higher expressiveness and lossless information preservation with lower overall complexity than standard text-based MAS. In addition, empirical evaluations across 9 comprehensive benchmarks spanning math and science reasoning, commonsense understanding, and code generation show that LatentMAS outperforms advanced single agents and text-based MAS baselines, achieving up to 14.6% higher accuracy, reducing output token usage by 70.8%-83.7%, and providing 4$\times$-4.3$\times$ faster end-to-end inference. Code and data are fully open-sourced at https://github.com/Gen-Verse/LatentMAS.

2512.00062 2026-06-02 cs.RO cs.AI cs.LG 版本更新

SpeedAug: Policy Acceleration via Tempo-Enriched Policy and RL Fine-Tuning

SpeedAug: 通过节奏增强策略和强化学习微调实现策略加速

Taewook Nam, Junmo Cho, Youngsoo Jang, Sung Ju Hwang

发表机构 * KAIST(韩国科学技术院) UNIST(全南大学) DeepAuto.ai

AI总结 提出SpeedAug框架,通过节奏增强先验策略和强化学习微调,使机器人策略学习任务最优执行节奏,在保持高成功率的同时显著提升执行速度和样本效率。

详情
AI中文摘要

针对复杂真实世界操作任务的机器人策略学习近期取得了快速进展,这在很大程度上得益于通过人类操作收集演示数据的能力。然而,从这些演示中训练出的策略通常执行任务的速度远低于机器人的物理能力,因为演示数据是在实际约束下收集的,这些约束倾向于保守的、以成功为导向的轨迹,而非执行速度。现有的策略加速方法通过数据预处理或启发式规则确定执行节奏,而不是学习针对任务优化的执行速度。在本文中,我们提出了SpeedAug,一个策略加速框架,使策略能够通过强化学习(RL)学习任务最优的执行节奏。SpeedAug首先从速度增强的演示中学习一个节奏增强的先验策略,该策略捕捉了多样的执行节奏。在此基础上,通过强化学习微调指导探索,以优化动作轨迹并高效优化执行节奏。在机器人操作基准上的实验表明,SpeedAug在保持高成功率的同时,显著提高了策略加速的样本效率,实现了快速且稳定的任务执行。应用于真实世界的操作任务时,SpeedAug仅用16分钟的在线交互就将任务吞吐量提高了1.8倍,且未降低成功率。

英文摘要

Robotic policy learning for complex real-world manipulation tasks has seen rapid recent progress, enabled in large part by the ability to collect demonstrations through human operation. However, policies trained from such demonstrations often execute tasks far more slowly than the robot's physical capabilities, as demonstration data is collected under practical constraints that favor conservative, success-oriented trajectories over execution speed. Existing policy acceleration methods determine execution tempo through data preprocessing or heuristic rules, rather than learning execution speed optimized for the task. In this paper, we propose SpeedAug, a policy acceleration framework that enables policies to learn task-optimal execution tempo via reinforcement learning (RL). SpeedAug first learns a tempo-enriched prior policy from speed-augmented demonstrations that captures diverse execution tempos. Building on this tempo-enriched prior, RL fine-tuning guides exploration to refine action trajectories and optimize execution tempo efficiently. Experiments on robotic manipulation benchmarks demonstrate that SpeedAug substantially improves the sample efficiency of policy acceleration while maintaining high success rates, achieving fast and stable task execution. Applied to a real-world manipulation task, SpeedAug improves task throughput by 1.8x using only 16 minutes of online interactions without compromising the success rate.

2510.01800 2026-06-02 cs.AI 版本更新

REBot: From RAG to CatRAG with Semantic Enrichment and Graph Routing

REBot: 从RAG到CatRAG——语义增强与图路由

Thanh Ma, Tri-Tam La, Lam-Thu Le Huu, Minh-Nghi Nguyen, Khanh-Van Pham Luu

发表机构 * CTU(越南科技大学)

AI总结 提出REBot,一种基于CatRAG混合检索推理框架的LLM增强咨询聊天机器人,通过语义增强的分层类别知识图谱和图路由实现学术法规建议,在分类和问答任务上达到98.89%的F1分数。

Comments Published in Communications in Computer and Information Science (CCIS), Springer, 2025. DOI: 10.1007/978-981-95-4960-3_35

详情
Journal ref
Communications in Computer and Information Science (CCIS), Springer, 2025, pp. 435-447
AI中文摘要

学术法规建议对于帮助学生理解和遵守机构政策至关重要,但构建有效系统需要特定领域的法规资源。为应对这一挑战,我们提出REBot,一种由CatRAG增强的LLM咨询聊天机器人,CatRAG是一种混合检索推理框架,将检索增强生成与基于图的推理相结合。CatRAG统一了密集检索和图推理,由分层、类别标记的知识图谱支持,并丰富了语义特征以实现领域对齐。轻量级意图分类器将查询路由到适当的检索模块,确保事实准确性和上下文深度。我们构建了一个法规特定数据集,并在分类和问答任务上评估REBot,取得了98.89%的F1分数,达到最先进水平。最后,我们实现了一个Web应用程序,展示了REBot在真实学术建议场景中的实用价值。

英文摘要

Academic regulation advising is essential for helping students interpret and comply with institutional policies, yet building effective systems requires domain specific regulatory resources. To address this challenge, we propose REBot, an LLM enhanced advisory chatbot powered by CatRAG, a hybrid retrieval reasoning framework that integrates retrieval augmented generation with graph based reasoning. CatRAG unifies dense retrieval and graph reasoning, supported by a hierarchical, category labeled knowledge graph enriched with semantic features for domain alignment. A lightweight intent classifier routes queries to the appropriate retrieval modules, ensuring both factual accuracy and contextual depth. We construct a regulation specific dataset and evaluate REBot on classification and question answering tasks, achieving state of the art performance with an F1 score of 98.89%. Finally, we implement a web application that demonstrates the practical value of REBot in real world academic advising scenarios.

2511.21397 2026-06-02 cs.CV cs.AI cs.CL cs.LG 版本更新

Understanding the Effects of Distractors on Reasoning Vision-Language Models

理解干扰项对推理视觉语言模型的影响

Jiyun Bae, Hyunjong Ok, Sangwoo Mo, Jaeho Lee

发表机构 * Pohang University of Science and Technology (POSTECH)(坡山科学技术大学(POSTECH))

AI总结 本文通过构建包含语义和数值维度干扰项的视觉问答数据集Idis,研究视觉干扰项如何影响视觉语言模型的测试时缩放行为,发现视觉干扰项以与文本干扰项根本不同的方式降低准确率而不增加推理长度,并提出简单提示策略缓解干扰项驱动的预测。

Comments preprint

详情
AI中文摘要

无关信息(即干扰项)如何影响视觉语言模型(VLM)的测试时缩放?先前关于纯文本语言模型的研究表明,文本干扰项可以加剧逆缩放,导致模型推理更长但推理轨迹效率更低。在这项工作中,我们研究了类似现象是否在多模态设置中出现。我们引入了Idis(带干扰项的图像),这是一个视觉问答数据集,系统性地沿着语义和数值维度变化干扰项。我们的分析揭示,视觉干扰项以与文本干扰项根本不同的方式影响推理VLM:尽管逆缩放仍然出现,但视觉干扰项降低了准确率而不增加推理长度。我们进一步展示了从推理轨迹中提取的属性计数为干扰项如何与推理长度和准确率交互提供了关键见解。作为合理性检查,我们提出了一种简单的提示策略,以减轻推理视觉语言模型中干扰项驱动的预测。

英文摘要

How does irrelevant information (i.e., distractors) affect test-time scaling in vision-language models (VLMs)? Prior work on text-only language models has shown that textual distractors can intensify inverse scaling, causing models to reason longer but less effective reasoning traces. In this work, we investigate whether similar phenomena arise in multimodal settings. We introduce Idis (Images with distractors), a visual question-answering dataset that systematically varies distractors along semantic and numerical dimensions. Our analyses reveal that visual distractors affect reasoning VLMs in a fundamentally different way from textual distractors: although inverse scaling still emerges, visual distractors reduce accuracy without increasing reasoning length. We further show that attribute counts extracted from reasoning traces provide key insights into how distractors interact with reasoning length and accuracy. As a sanity check, we propose a simple prompting strategy that mitigates distractor-driven predictions in reasoning vision-language models.

2511.20615 2026-06-02 cs.CV cs.AI 版本更新

Evaluating the Performance of Deep Learning Models in Whole-body Dynamic 3D Posture Prediction During Load-reaching Activities

评估深度学习模型在负重活动期间全身动态3D姿态预测中的性能

Seyede Niloofar Hosseini, Ali Mojibi, Mahdi Mohseni, Navid Arjmand, Alireza Taheri

发表机构 * Department of Mechanical Engineering, Sharif University of Technology(谢赫·巴赫什大学机械工程系)

AI总结 本研究利用双向长短期记忆和Transformer架构的时间序列模型,通过优化身体段长度约束的代价函数,实现了对动态负重活动中全身3D姿态的高精度预测。

Comments 11 pages, 6 figures, 7 tables, This work has been submitted to the IEEE for possible publication

详情
AI中文摘要

本研究旨在探索深度神经网络在动态负重活动中全身人体姿态预测的应用。使用双向长短期记忆(BLSTM)和Transformer架构训练了两个时间序列模型。数据集包含20名正常体重健康男性个体的3D全身插件步态动态坐标,每人从不同负载位置执行204次负重任务,并采用不同的举升和处理技术。模型输入包括手-负载位置的3D坐标、举升(弯腰、全蹲和半蹲)和处理(单手和双手)技术、体重和身高,以及任务前25%时间的身体姿态3D坐标数据。模型利用这些输入预测任务剩余75%时间内的身体坐标。此外,提出了一种新方法,通过优化新的代价函数强制身体段长度恒定,以提高先前和当前姿态预测网络的准确性。结果表明,新代价函数使手臂和腿部模型的预测误差分别降低了约8%和21%。我们发现,使用Transformer架构(均方根误差为41.4 mm)的长期性能比基于BLSTM的模型准确约58%。本研究证明了利用捕捉时间序列依赖性的神经网络在3D运动帧中的价值,为理解和预测人工物料搬运活动中的运动动力学提供了独特方法。

英文摘要

This study aimed to explore the application of deep neural networks for whole-body human posture prediction during dynamic load-reaching activities. Two time-series models were trained using bidirectional long short-term memory (BLSTM) and transformer architectures. The dataset consisted of 3D full-body plug-in gait dynamic coordinates from 20 normal-weight healthy male individuals each performing 204 load-reaching tasks from different load positions while adapting various lifting and handling techniques. The model inputs consisted of the 3D position of the hand-load position, lifting (stoop, full-squat and semi-squat) and handling (one- and two-handed) techniques, body weight and height, and the 3D coordinate data of the body posture from the first 25% of the task duration. These inputs were used by the models to predict body coordinates during the remaining 75% of the task period. Moreover, a novel method was proposed to improve the accuracy of the previous and present posture prediction networks by enforcing constant body segment lengths through the optimization of a new cost function. The results indicated that the new cost function decreased the prediction error of the models by approximately 8% and 21% for the arm and leg models, respectively. We indicated that utilizing the transformer architecture, with a root-mean-square-error of 41.4 mm, exhibited approximately 58% more accurate long-term performance than the BLSTM-based model. This study merits the use of neural networks that capture time series dependencies in 3D motion frames, providing a unique approach for understanding and predict motion dynamics during manual material handling activities.

2511.20333 2026-06-02 cs.AI cs.LG cs.NE 版本更新

NNGPT: Rethinking AutoML with Large Language Models

NNGPT: 用大型语言模型重新思考AutoML

Roman Kochnev, Waleed Khalid, Tolgay Atinc Uzun, Xi Zhang, Yashkumar Sanjaybhai Dhameliya, Furui Qin, Chandini Vysyaraju, Raghuvir Duvvuri, Avi Goyal, Dmitry Ignatov, Radu Timofte

发表机构 * Computer Vision Lab, CAIDAS & IFI, University of Würzburg, Germany(计算机视觉实验室,CAIDAS与IFI,乌尔姆大学,德国)

AI总结 提出NNGPT开源框架,利用大型语言模型实现自我改进的AutoML引擎,通过生成-评估-自我改进闭环自动设计神经网络架构和超参数。

详情
Journal ref
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), pp. 5664-5674, 2026
AI中文摘要

构建自我改进的人工智能系统仍然是AI领域的一个基本挑战。我们提出了NNGPT,一个开源框架,它将大型语言模型(LLM)转变为用于神经网络开发的自我改进AutoML引擎,主要针对计算机视觉。与之前的框架不同,NNGPT通过生成新模型扩展神经网络数据集,基于生成、评估和自我改进的闭环系统实现LLM的持续微调。它在一个统一的工作流中集成了五个协同的基于LLM的流水线:零样本架构合成、超参数优化(HPO)、代码感知的准确率/早停预测、检索增强的闭域PyTorch块合成(NN-RAG)以及强化学习。基于LEMUR数据集作为具有可复现指标的可审计语料库,NNGPT从单个提示出发,验证网络架构、预处理代码和超参数,端到端执行,并从结果中学习。PyTorch适配器使NNGPT框架无关,实现了强大性能:NN-RAG在1289个目标上达到73%的可执行性,3-shot提示在常见数据集上提高了准确率,基于哈希的去重节省了数百次运行。一次性预测匹配基于搜索的AutoML,减少了大量试验的需要。在LEMUR上的HPO实现了RMSE 0.60,优于Optuna(0.64),而代码感知预测器达到RMSE 0.14,Pearson r=0.78。该系统已生成超过5000个经过验证的模型,证明了NNGPT作为自主AutoML引擎的能力。接受后,代码、提示和检查点将公开发布,以实现可复现性并促进社区使用。

英文摘要

Building self-improving AI systems remains a fundamental challenge in the AI domain. We present NNGPT, an open-source framework that turns a large language model (LLM) into a self-improving AutoML engine for neural network development, primarily for computer vision. Unlike previous frameworks, NNGPT extends the dataset of neural networks by generating new models, enabling continuous fine-tuning of LLMs based on closed-loop system of generation, assessment, and self-improvement. It integrates within one unified workflow five synergistic LLM-based pipelines: zero-shot architecture synthesis, hyperparameter optimization (HPO), code-aware accuracy/early-stop prediction, retrieval-augmented synthesis of scope-closed PyTorch blocks (NN-RAG), and reinforcement learning. Built on the LEMUR dataset as an audited corpus with reproducible metrics, NNGPT emits from a single prompt and validates network architecture, preprocessing code, and hyperparameters, executes them end-to-end, and learns from result. The PyTorch adapter makes NNGPT framework-agnostic, enabling strong performance: NN-RAG achieves 73% executability on 1,289 targets, 3-shot prompting boosts accuracy on common datasets, and hash-based deduplication saves hundreds of runs. One-shot prediction matches search-based AutoML, reducing the need for numerous trials. HPO on LEMUR achieves RMSE 0.60, outperforming Optuna (0.64), while the code-aware predictor reaches RMSE 0.14 with Pearson r=0.78. The system has already generated over 5K validated models, proving NNGPT as an autonomous AutoML engine. Upon acceptance, the code, prompts, and checkpoints will be released for public access to enable reproducibility and facilitate community usage.

2505.17648 2026-06-02 econ.GN cs.AI q-fin.EC 版本更新

Simulating Macroeconomic Expectations in Survey Experiments with LLM-based Economic Agents

基于LLM的经济主体在调查实验中模拟宏观经济预期

Jianhao Lin, Lexuan Sun, Yixin Yan

发表机构 * Lingnan College, Sun Yat-sen University(中山大学岭南学院)

AI总结 提出一个利用基于大语言模型的经济主体(LLM Agents)模拟调查实验中宏观经济预期的框架,通过复现三种代表性调查设计验证其有效性,发现LLM Agents能生成与人类高度相似的预期分布并捕捉定性模式,其中先验信息对匹配分布至关重要。

详情
AI中文摘要

我们引入了一个框架,利用基于大语言模型的经济主体(LLM Agents)模拟调查实验中的宏观经济预期。我们构建了配备多个功能模块的LLM Agents,这些模块能够检索个人特征、先验预期和动态外部信息。我们通过复现三种涵盖不同类型受访者各种预期的代表性调查设计来验证我们的框架。结果表明,LLM Agents生成的预期分布与人类数据高度相似,并在开放式回答中捕捉到与人类一致的定性模式。评估显示,先验信息对于匹配分布至关重要,而个人和外部信息驱动类似人类的思维过程。我们的发现为在总体水平上缩小生成式AI与人类之间的信念差距提供了指导,同时界定了该框架的边界。

英文摘要

We introduce a framework for simulating macroeconomic expectations in survey experiments using LLM-based economic agents (LLM Agents). We construct LLM Agents equipped with several functional modules that retrieve personal characteristics, prior expectations, and dynamic external information. We validate our framework by recapitulating three representative survey designs covering various expectations across different types of respondents. Our results show that LLM Agents generate expectation distributions highly similar to human data and capture human-aligned qualitative patterns in open-ended responses. Evaluation reveals that priors are crucial for matching distributions, whereas personal and external information drive human-like thought processes. Our findings offer guidance for narrowing the belief gap between generative AI and humans at the aggregate level while delineating the boundaries of the framework.

2506.16114 2026-06-02 cs.IR cs.AI 版本更新

GFlowGR: Fine-tuning Generative Recommendation Frameworks with Generative Flow Networks

GFlowGR:使用生成流网络微调生成式推荐框架

Yejing Wang, Shengyu Zhou, Jinyu Lu, Qidong Liu, Xinhang Li, Wenlin Zhang, Feng Li, Pengjie Wang, Chuan Yu, Jian Xu, Bo Zheng, Xiangyu Zhao

发表机构 * City University of Hong Kong(城市大学) Alibaba Group(阿里巴巴集团)

AI总结 针对生成式推荐中微调步骤忽略未观测正样本导致的曝光偏差问题,提出基于GFlowNets的微调框架GFlowGR,通过自适应轨迹采样器和综合奖励模型整合协同知识,利用GFlowNets的多样生成特性缓解偏差。

详情
AI中文摘要

生成式推荐(GR)通常包括项目分词器和生成式大语言模型(LLM),已在广泛场景中取得显著成功。现有研究主要集中于开发强大的项目分词器或改进LLM解码策略以获得更优性能。然而,GR框架中关键的微调步骤(对于使LLM适应推荐数据至关重要)仍基本未被探索。当前方法主要依赖监督微调(SFT)的下一词预测损失或推荐特定的直接偏好优化(DPO)策略。这两种方法都忽略了对可能存在的正未观测样本的探索,这通常被称为曝光偏差问题。为缓解此问题,本文将GR视为多步生成任务,并构建了基于GFlowNets的微调框架(GFlowGR)。所提框架整合了传统推荐系统中的协同知识,以创建自适应轨迹采样器和综合奖励模型。利用GFlowNets的多样生成特性以及采样和启发式加权技术,GFlowGR成为缓解曝光偏差问题的一种有前景的方法。在两个真实世界数据集和两种不同GR骨干上的大量实证结果突显了GFlowGR的有效性和鲁棒性。

英文摘要

Generative recommendations (GR), which usually include item tokenizers and generative Large Language Models (LLMs), have demonstrated remarkable success across a wide range of scenarios. The majority of existing research efforts primarily concentrate on developing powerful item tokenizers or advancing LLM decoding strategies to attain superior performance. However, the critical fine-tuning step in GR frameworks, which is essential for adapting LLMs to recommendation data, remains largely unexplored. Current approaches predominantly rely on either the next-token prediction loss of supervised fine-tuning (SFT) or recommendationspecific direct preference optimization (DPO) strategies. Both methods ignore the exploration of possible positive unobserved samples, which is commonly referred to as the exposure bias problem. To mitigate this problem, this paper treats the GR as a multi-step generation task and constructs a GFlowNets-based fine-tuning framework (GFlowGR). The proposed framework integrates collaborative knowledge from traditional recommender systems to create an adaptive trajectory sampler and a comprehensive reward model. Leveraging the diverse generation property of GFlowNets, along with sampling and heuristic weighting techniques, GFlowGR emerges as a promising approach to mitigate the exposure bias problem. Extensive empirical results on two real-world datasets and with two different GR backbones highlight the effectiveness and robustness of GFlowGR.

2511.10367 2026-06-02 cs.CV cs.AI 版本更新

DermAI: Clinical dermatology acquisition through quality-driven image collection for AI classification in mobile

DermAI:通过质量驱动的图像采集实现移动端AI分类的临床皮肤病学

Thales Bezerra, Emanoel Thyago, Kelvin Cunha, Rodrigo Abreu, Fábio Papais, Francisco Mauro, Natália Lopes, Érico Medeiros, Jéssica Guido, Shirley Cruz, Paulo Borba, Tsang Ing Ren

发表机构 * Centro de Informática, Universidade Federal de Pernambuco, Brazil(巴西佩纳布卢克联邦大学计算机中心) Hospital das Clínicas, Universidade Federal de Pernambuco, Brazil(巴西佩纳布卢克联邦大学临床医院)

AI总结 提出DermAI智能手机应用,通过实时质量检查、本地模型适应和多样化数据集收集,解决AI皮肤病学中数据集偏差、图像质量差异和验证不足的问题。

Comments 4 pages, 2 figures, 1 table, submitted on ISBI

详情
AI中文摘要

基于AI的皮肤病学应用仍然受到数据集偏差、图像质量变化和验证有限的限制。我们介绍了DermAI,一个轻量级的基于智能手机的应用,能够在常规咨询期间实时捕获、标注和分类皮肤病变。与以往专注于皮肤镜的工具不同,DermAI在设备上进行质量检查和本地模型适应。DermAI临床数据集涵盖了广泛的肤色、种族和源设备。初步实验中,在公共数据集上训练的模型无法泛化到我们的样本,而使用本地数据进行微调则提高了性能。这些结果强调了标准化、多样化数据收集的重要性,这些数据应与医疗需求一致并面向机器学习开发。

英文摘要

AI-based dermatology adoption remains limited by biased datasets, variable image quality, and limited validation. We introduce DermAI, a lightweight, smartphone-based application that enables real-time capture, annotation, and classification of skin lesions during routine consultations. Unlike prior dermoscopy-focused tools, DermAI performs on-device quality checks, and local model adaptation. The DermAI clinical dataset, encompasses a wide range of skin tones, ethinicity and source devices. In preliminary experiments, models trained on public datasets failed to generalize to our samples, while fine-tuning with local data improved performance. These results highlight the importance of standardized, diverse data collection aligned with healthcare needs and oriented to machine learning development.

2511.10276 2026-06-02 cs.RO cs.AI 版本更新

RoboBenchMart: Benchmarking Robots in Retail Environment

RoboBenchMart:零售环境中的机器人基准测试

Konstantin Soshin, Alexander Krapukhin, Andrei Spiridonov, Gregorii Bukhtuev, Andrey Kuznetsov, Vlad Shakhuro, Denis Shepelev

发表机构 * FusionBrain Lab, Robotics Group(融合大脑实验室,机器人组) NUST MISIS Lomonosov Moscow State University(罗蒙诺索夫莫斯科国立大学)

AI总结 针对零售环境中的移动操作任务,提出RoboBenchMart开源模拟基准,通过密集杂乱物品和复杂空间配置评估通用视觉-语言-动作模型(VLA),发现现有模型在常见零售任务中仍表现不佳。

详情
AI中文摘要

大多数现有的机器人操作基准专注于桌面或家庭场景。虽然这些设置推动了令人印象深刻的进展,但目前尚不清楚在这些场景中表现出色的通用VLA是否能够真正泛化到具有不同几何、语义和工作流程的领域。我们引入了RoboBenchMart,一个针对零售暗店环境的开源模拟基准,其中移动操作器必须对多样化的杂货物品执行复杂的操作任务。该设置提出了重大挑战,包括密集的物品杂乱和多样的空间配置,物品位于不同的高度、深度且紧密相邻。通过针对零售领域,我们的基准解决了一个具有近期自动化影响潜力的场景。利用生成的轨迹,我们为当前的通用VLA建模了一个标准、现实的微调设置,并评估了几种最先进的模型。我们发现,即使在常见的零售任务上,它们仍然表现挣扎,这表明这些模型尚未真正跨领域泛化。为了支持进一步研究,我们发布了RoboBenchMart套件,其中包括程序化商店布局生成器、轨迹生成管道、评估工具和微调基线模型。

英文摘要

Most existing robotic manipulation benchmarks focus on tabletop or household scenarios. While these setups have driven impressive progress, it remains unclear whether generalist VLAs that excel there can truly generalize to domains with different geometry, semantics, and workflows. We introduce RoboBenchMart, an open-source simulated benchmark targeting retail dark-store environments, where a mobile manipulator must perform complex manipulation tasks with diverse grocery items. This setting presents significant challenges, including dense object clutter and varied spatial configurations, with items positioned at different heights, depths, and in close proximity. By targeting on the retail domain, our benchmark addresses a setting with strong potential for near-term automation impact. Using generated trajectories, we model a standard, realistic fine-tuning setup for current generalist VLAs and evaluate several state-of-the-art models. We find that they still struggle even on common retail tasks, indicating that these models are not yet truly general across domains. To support further research, we release the RoboBenchMart suite, which includes a procedural store layout generator, a trajectory generation pipeline, evaluation tools, and fine-tuned baseline models.

2501.02409 2026-06-02 cs.LG cs.AI cs.CE q-bio.MN stat.ME 版本更新

Interpretable Neural ODEs for Gene Regulatory Network Discovery under Perturbations

可解释神经ODE用于扰动下基因调控网络发现

Zaikang Lin, Sei Chang, Aaron Zweig, Minseo Kang, Fabian J. Theis, Elham Azizi, David A. Knowles

发表机构 * Department of Computer Science, Columbia University, New York, U.S.(哥伦比亚大学计算机科学系) Department of Industrial Engineering and Operations Research, Columbia University, New York, U.S.(哥伦比亚大学工业工程与运筹学系) Department of Applied Mathematics and Applied Physics, Columbia University, New York, U.S.(哥伦比亚大学应用数学与应用物理系) New York Genome Center, New York, U.S.(纽约基因组中心) Irving Institute of Cancer Dynamics, New York, U.S.(伊万·罗伯特癌症动力学研究所) Institute of Computational Biology, Helmholtz Munich, Munich, Germany(海德堡医学院计算生物学研究所) Department of Mathematics, Technische Universität München, Munich, Germany(慕尼黑技术大学数学系)

AI总结 提出PerturbODE框架,利用可解释神经常微分方程建模扰动下的细胞状态轨迹,从ODE参数中推导因果基因调控网络,实现未见遗传干预的模拟。

详情
AI中文摘要

现代高通量生物数据集包含数千种扰动,使得能够大规模发现代表基因间调控相互作用的因果图。可微分因果图模型和基于回归的方法已被开发用于从干预数据集推断基因调控网络(GRN)。然而,现有方法未能捕捉生物过程(如细胞分化)的非线性动力学。为解决这一局限性,我们提出PerturbODE,一种新颖框架,采用可解释神经常微分方程(神经ODE)对扰动下的细胞状态轨迹进行建模,并从神经ODE参数中推导出潜在的因果GRN,从而实现对未见遗传干预的下游模拟。GRN通过单隐藏层前馈网络编码,隐含地将基因分组为可解释的共调控模块。我们展示了PerturbODE在GRN推断和扩展到扰动响应预测方面的有效性,包括模拟和真实过表达数据集。

英文摘要

Modern high-throughput biological datasets containing thousands of perturbations enable large-scale discovery of causal graphs that represent regulatory interactions between genes. Differentiable causal graphical models and regression-based methods have been developed to infer gene regulatory networks (GRNs) from interventional datasets. However, existing approaches fail to capture the non-linear dynamics of biological processes such as cellular differentiation. To address this limitation, we propose PerturbODE, a novel framework that employs interpretable neural ordinary differential equations (neural ODEs) to model cell state trajectories under perturbations and derive the underlying causal GRN from the neural ODE parameters, enabling downstream simulation of unseen genetic interventions. The GRN is encoded via a single-hidden-layer feedforward network, implicitly grouping genes into interpretable co-regulated modules. We demonstrate PerturbODE's efficacy in GRN inference and extension to perturbation response prediction across both simulated and real overexpression datasets.

2511.05913 2026-06-02 cs.CL cs.AI 版本更新

NILC: Discovering New Intents with LLM-assisted Clustering

NILC:利用LLM辅助聚类发现新意图

Hongtao Wang, Renchi Yang, Wenqing Lin

发表机构 * Hong Kong Baptist University(香港 Baptist 大学) JD.com(京东公司)

AI总结 提出NILC框架,通过迭代聚类结合大语言模型优化质心和文本嵌入,实现无监督和半监督新意图发现。

详情
AI中文摘要

新意图发现(NID)旨在从无标签的用户话语中识别新意图和已知意图,在实际对话系统中广泛应用。现有的NID工作主要采用级联架构,其中第一阶段专注于将话语编码为信息丰富的文本嵌入,而第二阶段通常通过K-Means将相似的嵌入聚类为簇(即意图)。然而,这种级联流程无法利用两个阶段的反馈进行相互优化,同时仅依赖嵌入的聚类忽略了细微的文本语义,导致性能次优。为弥补这一差距,本文提出NILC,一种专门针对有效NID的新型聚类框架。特别地,NILC遵循迭代工作流,通过借助大语言模型(LLM)精心优化不确定话语的聚类质心和文本嵌入,从而审慎地更新聚类分配。具体来说,NILC首先利用LLM为聚类创建额外的语义质心,从而丰富嵌入的欧几里得质心的上下文语义。此外,利用LLM通过重写增强从聚类中识别出的困难样本(模糊或简洁的话语),以便后续的聚类纠正。进一步,我们通过非平凡技术(软种子和软必须链接)注入监督信号,在半监督设置下实现更准确的NID。在无监督和半监督设置下,将NILC与多个近期基线进行大量实验比较,结果表明NILC在六个不同领域的基准数据集上一致地实现了显著的性能提升。

英文摘要

New intent discovery (NID) seeks to recognize both new and known intents from unlabeled user utterances, which finds prevalent use in practical dialogue systems. Existing works towards NID mainly adopt a cascaded architecture, wherein the first stage focuses on encoding the utterances into informative text embeddings beforehand, while the latter is to group similar embeddings into clusters (i.e., intents), typically by K-Means. However, such a cascaded pipeline fails to leverage the feedback from both steps for mutual refinement, and, meanwhile, the embedding-only clustering overlooks nuanced textual semantics, leading to suboptimal performance. To bridge this gap, this paper proposes NILC, a novel clustering framework specially catered for effective NID. Particularly, NILC follows an iterative workflow, in which clustering assignments are judiciously updated by carefully refining cluster centroids and text embeddings of uncertain utterances with the aid of large language models (LLMs). Specifically, NILC first taps into LLMs to create additional semantic centroids for clusters, thereby enriching the contextual semantics of the Euclidean centroids of embeddings. Moreover, LLMs are then harnessed to augment hard samples (ambiguous or terse utterances) identified from clusters via rewriting for subsequent cluster correction. Further, we inject supervision signals through non-trivial techniques seeding and soft must links for more accurate NID in the semi-supervised setting. Extensive experiments comparing NILC against multiple recent baselines under both unsupervised and semi-supervised settings showcase that NILC can achieve significant performance improvements over six benchmark datasets of diverse domains consistently.

2511.05650 2026-06-02 cs.CL cs.AI cs.LG 版本更新

Optimizing Diversity and Quality through Base-Aligned Model Collaboration

通过基座对齐模型协作优化多样性与质量

Yichen Wang, Chenghao Yang, Tenghao Huang, Muhao Chen, Jonathan May, Mina Lee

发表机构 * University of Chicago(芝加哥大学) University of Southern California, Information Sciences Institute(南加州大学信息科学研究所) University of California, Davis(加州大学戴维斯分校)

AI总结 提出基座对齐模型协作框架(BACo),在推理时通过令牌级路由策略动态结合基座LLM与其对齐版本,以单次前向传递同时提升生成多样性和质量。

Comments ICML 2026. (47 pages, 22 figures)

详情
AI中文摘要

对齐极大地提升了大语言模型(LLM)的输出质量,但以牺牲多样性为代价,导致跨代生成高度相似的输出,尤其是在开放式生成任务中。我们提出基座对齐模型协作(BACo),一种推理时令牌级模型协作框架,动态结合基座LLM与其对齐版本,以优化多样性和质量。利用基于不确定性和内容的信号,BACo采用路由策略决定每个令牌从哪个模型解码。先前的多样性提升方法通常以质量下降为代价,或需要昂贵的解码或后训练。相比之下,BACo在单次前向传递中事后同时实现高多样性和高质量,同时提供强可控性。我们引入一系列有效的路由策略,并在三个开放式生成任务中使用13个多样性和质量指标进行评估。BACo持续超越最先进的推理时基线。使用我们最佳的路由器,BACo在多样性和质量上实现了21.3%的联合提升,这一结果进一步得到人工评估的支持。总体而言,我们的结果表明,基座模型与对齐模型之间的协作为优化多样性-质量权衡提供了一种有效且可控的机制。

英文摘要

Alignment has greatly improved large language models (LLMs)' output quality at the cost of diversity, yielding highly similar outputs across generations, especially in open-ended generation tasks. We propose Base-Aligned Model Collaboration (BACo), an inference-time token-level model collaboration framework that dynamically combines a base LLM with its aligned counterpart to optimize diversity and quality. Using uncertainty and content-based signals, BACo employs routing strategies to determine, at each token, which model to decode from. Prior diversity-promoting methods often improve diversity at the expense of quality or require expensive decoding or post-training. In contrast, BACo achieves both high diversity and quality post hoc within a single pass, while offering strong controllability. We introduce a family of effective routing strategies and evaluate them across three open-ended generation tasks with 13 diversity and quality metrics. BACo consistently surpasses state-of-the-art inference-time baselines. With our best router, BACo achieves a 21.3% joint improvement in diversity and quality, which is further supported by human evaluations. Overall, our results demonstrate that collaboration between base and aligned models provides an effective and controllable mechanism for optimizing the diversity-quality trade-off.

2511.05613 2026-06-02 cs.CY cs.AI cs.LG 版本更新

Who Evaluates AI's Social Impacts? Mapping Coverage and Gaps in First and Third Party Evaluations

谁在评估人工智能的社会影响?第一方和第三方评估的覆盖范围与差距分析

Anka Reuel, Avijit Ghosh, Jenny Chim, Andrew Tran, Yanan Long, Jennifer Mickel, Usman Gohar, Srishti Yadav, Pawan Sasanka Ammanamanchi, Mowafak Allaham, Hossein A. Rahmani, Mubashara Akhtar, Felix Friedrich, Robert Scholz, Michael Alexander Riegler, Jan Batzner, Eliya Habba, Arushi Saxena, Anastassia Kornilova, Kevin Wei, Prajna Soni, Yohan Mathew, Kevin Klyman, Jeba Sania, Subramanyam Sahoo, Olivia Beyer Bruvik, Pouya Sadeghi, Sujata Goswami, Angelina Wang, Yacine Jernite, Zeerak Talat, Stella Biderman, Mykel Kochenderfer, Sanmi Koyejo, Irene Solaiman

发表机构 * University of California, Berkeley(加州大学伯克利分校)

AI总结 通过分析186份第一方发布报告和248份第三方评估来源,结合开发者访谈,揭示了第一方报告稀疏且流于表面,而第三方评估更广泛深入,但数据溯源、内容审核劳动等关键领域存在披露缺口,呼吁政策强制开发者透明化并加强独立评估生态。

Comments Accepted at the Forty-Third International Conference on Machine Learning (ICML), 2026, in Seoul, Korea

详情
AI中文摘要

基础模型日益成为高风险人工智能系统的核心,治理框架现在依赖评估来评估其风险和能力。尽管通用能力评估已广泛开展,但涵盖偏见、公平性、隐私、环境成本和劳动的社会影响评估仍不均衡。为了描述这一格局,我们进行了首次社会影响评估报告的综合分析,检查了186份第一方发布报告和248份第三方评估来源,并辅以开发者访谈。我们发现明显的分工:第一方报告稀疏、通常流于表面,且在环境影响和偏见等领域呈下降趋势,而第三方评估者提供了更广泛、更严格的偏见、有害内容和性能差异覆盖。然而,只有开发者才能权威地报告数据来源、内容审核劳动、成本和基础设施,但访谈揭示这些披露除非与产品采用或合规挂钩,否则被降级优先。当前实践在评估社会影响方面留下了重大空白,强调了需要制定政策强制开发者透明化、加强独立评估生态系统,并创建聚合第三方评估的共享基础设施。

英文摘要

Foundation models are increasingly central to high-stakes AI systems, and governance frameworks now depend on evaluations to assess their risks and capabilities. Although general capability evaluations are widespread, social impact assessments covering bias, fairness, privacy, environmental costs, and labor remain uneven. To characterize this landscape, we conduct the first comprehensive analysis of social impact evaluation reporting, examining 186 first-party release reports and 248 third-party evaluation sources, supplemented by developer interviews. We find a stark division of labor: first-party reporting is sparse, often superficial, and declining in areas like environmental impact and bias, while third-party evaluators provide broader, more rigorous coverage of bias, harmful content, and performance disparities. However, only developers can authoritatively report on data provenance, content moderation labor, costs, and infrastructure, yet interviews reveal these disclosures are deprioritized unless tied to product adoption or compliance. Current practices leave major gaps in assessing societal impacts, underscoring the need for policies that mandate developer transparency, strengthen independent evaluation ecosystems, and create shared infrastructure for aggregating third-party evaluations.

2403.06524 2026-06-02 cs.LG cs.AI cs.RO 版本更新

Tactical Decision Making for Autonomous Trucks by Deep Reinforcement Learning with Total Cost of Operation Based Reward

基于总运营成本奖励的深度强化学习自动驾驶卡车战术决策

Deepthi Pathare, Leo Laine, Morteza Haghir Chehreghani

发表机构 * Department of Computer Science and Engineering, Chalmers University of Technology and and University of Gothenburg(计算机科学与工程系,查尔姆斯理工大学和哥德堡大学) Department of Mechanics and Maritime Sciences, Chalmers University of Technology(机械与海洋科学系,查尔姆斯理工大学) Safe and Efficient Driving, Volvo Group of Trucks Technology(安全高效驾驶,沃尔沃卡车技术集团)

AI总结 提出一种深度强化学习框架,用于自动驾驶卡车在高速公路场景下的自适应巡航控制和变道战术决策,通过基于总运营成本的多目标奖励函数优化性能。

Comments Paper is accepted for publication in Artificial Intelligence Review

详情
Journal ref
Artificial Intelligence Review, Volume 59, Article number 27 (2026)
AI中文摘要

我们开发了一个深度强化学习框架,用于自动驾驶卡车的战术决策,特别是高速公路场景下的自适应巡航控制(ACC)和变道操作。我们的结果表明,将高层决策过程与低层控制动作分离,分别由强化学习智能体和基于物理模型的低层控制器执行是有益的。接下来,我们研究了使用不同方法基于卡车总运营成本(TCOP)的逼真多目标奖励函数来优化性能:通过添加奖励组件权重、通过归一化奖励组件以及通过使用课程学习技术。

英文摘要

We develop a deep reinforcement learning framework for tactical decision making in an autonomous truck, specifically for Adaptive Cruise Control (ACC) and lane change maneuvers in a highway scenario. Our results demonstrate that it is beneficial to separate high-level decision-making processes and low-level control actions between the reinforcement learning agent and the low-level controllers based on physical models. In the following, we study optimizing the performance with a realistic and multi-objective reward function based on Total Cost of Operation (TCOP) of the truck using different approaches; by adding weights to reward components, by normalizing the reward components and by using curriculum learning techniques.

2504.16129 2026-06-02 cs.MA cs.AI cs.LG cs.RO 版本更新

MARFT: Multi-Agent Reinforcement Fine-Tuning

MARFT: 多智能体强化微调

Junwei Liao, Muning Wen, Jun Wang, Weinan Zhang

发表机构 * Shanghai Jiao Tong University(上海交通大学) Shanghai Innovation Institute(上海创新研究院) OPPO Research Institute(OPPO研究院)

AI总结 针对基于大语言模型的多智能体系统,提出多智能体强化微调(MARFT)框架,通过引入Flex-MG马尔可夫博弈公式和通用算法,解决异步交互、异构架构等挑战,提升系统鲁棒性和适应性。

Comments 37 pages

详情
AI中文摘要

基于大语言模型的多智能体系统(LaMAS)在需要多方面推理和协作的复杂智能体任务中展现出强大能力,从高质量演示生成到科学研究。同时,强化学习(RL)被广泛认可用于增强智能体智能,但用基础RL技术微调LaMAS的研究有限。由于LaMAS的独特机制,直接将传统多智能体强化学习(MARL)应用于LaMAS也带来了重大挑战。为解决这些挑战,本文对基于LLM的MARL进行了全面研究,并提出了多智能体强化微调(MARFT)。我们引入了Flex-MG,一种与真实世界LaMAS优化一致的新马尔可夫博弈公式,以及一个针对LaMAS定制的通用算法框架。我们回顾了从传统RL到强化微调(RFT)的演变,然后分析了多智能体对应部分。对于LaMAS,我们识别了经典MARL与MARFT之间的关键差异,包括异步智能体交互、轮廓感知智能体设计和异构架构。这些差异促使了面向LaMAS的RFT公式。我们提出了一个稳健且可扩展的MARFT框架,详细介绍了其模块化算法,并提供了开源实现以支持采用和进一步研究。本文进一步讨论了应用前景和开放挑战,包括动态环境建模、样本效率低下以及缺乏连贯框架。通过将理论基础与实践方法相结合,本文旨在作为推进MARFT向弹性、自适应和与人类一致的智能体系统发展的路线图。实现:https://github.com/jwliao-ai/MARFT。

英文摘要

Large Language Model (LLM)-based Multi-Agent Systems (LaMAS) have demonstrated strong capabilities on complex agentic tasks requiring multifaceted reasoning and collaboration, from high-quality presentation generation to scientific research. Meanwhile, Reinforcement Learning (RL) is widely recognized for enhancing agent intelligence, but limited work has studied fine-tuning LaMAS with foundational RL techniques. Directly applying conventional Multi-Agent Reinforcement Learning (MARL) to LaMAS also introduces major challenges due to the unique mechanisms of LaMAS. To address these challenges, this article presents a comprehensive study of LLM-based MARL and proposes Multi-Agent Reinforcement Fine-Tuning (MARFT). We introduce Flex-MG, a new Markov Game formulation aligned with real-world LaMAS optimization, together with a universal algorithmic framework tailored to LaMAS. We review the evolution from traditional RL to Reinforcement Fine-Tuning (RFT), then analyze the multi-agent counterpart. For LaMAS, we identify key differences between classical MARL and MARFT, including asynchronous agent interactions, profile-aware agent design, and heterogeneous architectures. These differences motivate a LaMAS-oriented formulation of RFT. We present a robust and scalable MARFT framework, detail its modular algorithm, and provide an open-source implementation to support adoption and further research. The paper further discusses application perspectives and open challenges, including dynamic environment modeling, sample inefficiency, and the lack of cohesive frameworks. By connecting theoretical foundations with practical methodology, this work aims to serve as a roadmap for advancing MARFT toward resilient, adaptive, and human-aligned agentic systems. Implementation: https://github.com/jwliao-ai/MARFT.

2510.14904 2026-06-02 cs.CV cs.AI cs.LG 版本更新

CaptionFormer: Unified Segmentation, Tracking, and Captioning for Spatio-Temporal Objects

CaptionFormer:时空对象的统一分割、跟踪与描述

Gabriel Fiastre, Antoine Yang, Cordelia Schmid

发表机构 * Inria, École Normale Supérieure, CNRS, PSL Research University(法国国家科学研究中心、巴黎高等师范学院、国家科学研究中心、巴黎综合理工研究所) Google DeepMind(谷歌DeepMind)

AI总结 提出 CaptionFormer 模型,通过利用 VLM 生成合成描述并扩展数据集,实现视频中对象轨迹的联合检测、分割、跟踪与描述,在三个基准上达到最优。

Comments 17 pages, 10 figures

详情
AI中文摘要

密集视频对象描述(DVOC)是联合检测、跟踪和描述视频中对象轨迹的任务,需要理解时空细节并用自然语言描述。由于任务复杂性和手动标注的高成本,先前方法采用有限数据的训练策略,可能导致次优性能。为解决此问题,我们提出利用最先进的 VLM 生成关于时空定位实体的描述,并用我们的合成描述(LVISCap 和 LV-VISCap)扩展 LVIS 和 LV-VIS 数据集。此外,我们引入端到端模型 CaptionFormer,能够联合检测、分割、跟踪和描述对象轨迹。CaptionFormer 在三个现有基准(VidSTG、VLN 和 BenSMOT)上取得了最先进的 DVOC 结果。数据集和代码可在 https://www.gabriel.fiastre.fr/captionformer/ 获取。

英文摘要

Dense Video Object Captioning (DVOC) is the task of jointly detecting, tracking, and captioning object trajectories in a video, requiring the ability to understand spatio-temporal details and describe them in natural language. Due to the complexity of the task and the high cost associated with manual annotation, previous approaches resort to training strategies with limited data, potentially leading to suboptimal performance. To circumvent this issue, we propose to generate captions about spatio-temporally localized entities leveraging a state-of-the-art VLM, and extend the LVIS and LV-VIS datasets with our synthetic captions (LVISCap and LV-VISCap). Moreover, we introduce an end-to-end model, CaptionFormer, capable of jointly detecting, segmenting, tracking and captioning object trajectories. CaptionFormer achieves state-of-the-art DVOC results on three existing benchmarks, VidSTG, VLN and BenSMOT. The datasets and code are available at https://www.gabriel.fiastre.fr/captionformer/.

2510.23379 2026-06-02 cs.LG cs.AI cs.NE q-bio.BM 版本更新

Symbolic Neural Generation with Applications to Lead Discovery in Drug Design

符号神经生成及其在药物设计先导发现中的应用

Ashwin Srinivasan, Tirtharaj Dash, A Baskar, Michael Bain, Sanjay Kumar Dey, Mainak Banerjee

发表机构 * Dept. of Computer Science & Information Systems and APPCAIR BITS Pilani, K K Birla Goa Campus, India(计算机科学与信息系统系及APPCAIR比特纳学院,K K Birl拉果阿校区,印度) Dept. of Computer Science & Information Systems BITS Pilani, K K Birla Goa Campus, India(计算机科学与信息系统系比特纳学院,K K Birl拉果阿校区,印度) Department of Biochemistry, University of Cambridge, Cambridge, UK(生物化学系,剑桥大学,剑桥,英国) School of Computer Science and Engineering University of New South Wales, Sydney(计算机科学与工程学院新南威尔士大学,悉尼) Dr. B.R. Ambedkar Center for Biomedical Research University of Delhi, New Delhi, India(B.R.阿姆贝卡尔生物医学研究中心,德里大学,新德里,印度) Department of Chemistry BITS Pilani, K.K. Birla Goa Campus, India(化学系比特纳学院,K.K. Birl拉果阿校区,印度)

AI总结 提出符号神经生成器(SNG)框架,结合归纳逻辑编程与大语言模型,通过符号约束指导神经生成,在药物设计中生成满足形式规范的候选分子,性能与现有方法相当,并在探索性问题上产生与临床候选分子相当的结合亲和力。

Comments 37 pages, submitted to the Machine Learning journal; partial overlap of experimental results with https://doi.org/10.1101/2025.02.14.634875

详情
AI中文摘要

我们研究了一类相对未被充分探索的混合神经符号模型,该模型将符号学习与神经推理相结合,以构建满足形式正确性标准的数据生成器。在符号神经生成器(SNG)中,符号学习器从少量实例(有时仅一个)中检查可行数据的逻辑规范。每个规范反过来约束提供给基于神经的生成器的条件信息,该生成器拒绝任何违反符号规范的实例。与其他神经符号方法一样,SNG利用了符号和神经方法的互补优势。SNG的输出是一个对$(H, X)$,其中$H$是从数据构建的可行实例的符号描述,$X$是满足该描述的一组生成的新实例。我们基于构建适当的基集和纤维偏序集并将其组合成整体偏序,为这类系统引入语义。我们实现了一个SNG,将受限形式的归纳逻辑编程(ILP)与大语言模型(LLM)相结合,并在早期药物设计上进行了评估。我们的主要兴趣在于SNG生成的描述和一组潜在的抑制剂分子。在基准问题(药物靶点已被充分理解)上,SNG的性能在统计上与最先进方法相当。在探索性问题(靶点理解不足)上,生成的分子表现出与领先临床候选分子相当的结合亲和力。专家进一步发现符号规范作为初步过滤器很有用,多个生成的分子被确定为可用于合成和湿实验测试。

英文摘要

We investigate a relatively under-explored class of hybrid neurosymbolic models that integrate symbolic learning with neural reasoning to construct data generators meeting formal correctness criteria. In Symbolic Neural Generators (SNGs), symbolic learners examine logical specifications of feasible data from a small set of instances -- sometimes just one. Each specification in turn constrains the conditional information supplied to a neural-based generator, which rejects any instance violating the symbolic specification. Like other neurosymbolic approaches, SNG exploits the complementary strengths of symbolic and neural methods. The outcome of an SNG is a pair $(H, X)$, where $H$ is a symbolic description of feasible instances constructed from data, and $X$ a set of generated new instances that satisfy the description. We introduce a semantics for such systems, based on the construction of appropriate base and fibre partially-ordered sets combined into an overall partial order. We implement an SNG combining a restricted form of Inductive Logic Programming (ILP) with a large language model (LLM) and evaluate it on early-stage drug design. Our main interest is the description and the set of potential inhibitor molecules generated by the SNG. On benchmark problems -- where drug targets are well understood -- SNG performance is statistically comparable to state-of-the-art methods. On exploratory problems with poorly understood targets, generated molecules exhibit binding affinities on par with leading clinical candidates. Experts further find the symbolic specifications useful as preliminary filters, with several generated molecules identified as viable for synthesis and wet-lab testing.

2510.17045 2026-06-02 cs.CV cs.AI cs.LG 版本更新

Video Reasoning without Training

无需训练的视频推理

Deepak Sridhar, Kartikeya Bhardwaj, Jeya Pradha Jeyaraj, Nuno Vasconcelos, Ankita Nayak, Harris Teague

发表机构 * Qualcomm AI Research(高通AI研究) University of California, San Diego(加州大学圣地亚哥分校)

AI总结 提出V-Reason方法,利用输出分布熵作为信号,通过轻量级控制器在推理时自适应调整值缓存,无需强化学习或微调即可提升视频推理性能。

Comments CVPR Findings 2026. Project Page https://deepaksridhar.github.io/vreason.github.io/

详情
AI中文摘要

使用大型多模态模型(LMM)进行视频推理依赖于昂贵的强化学习(RL)和冗长的思维链,导致训练和推理过程中产生大量计算开销。此外,这些推理模型中控制思维过程的机制非常有限。在本文中,我们利用模型输出分布的熵作为信号来研究和指导推理行为。我们发现高质量模型表现出微探索和微利用循环的特征模式,随后出现后期熵峰值(即更长的思考)和较低的最终熵,表明更谨慎的探索和自信的收敛(即当模型探索或思考答案时避免过度随机性)。然后,我们利用这些新颖的、有理论基础的见解,引入了V-Reason(Video-Reason),一种推理时优化方法,通过轻量级、可训练的控制器自适应调整LMM的值缓存。我们提出的控制器由基于熵的目标引导,直接在推理时调整模型行为,无需使用任何RL或监督微调。我们的实验表明,V-Reason在许多视频推理数据集上显著优于基础指令调优模型,将与RL模型的差距平均缩小到0.6%的准确率以内。我们在无需任何训练的情况下实现了这一点,同时提供了效率优势:V-Reason使用的token比RL模型少58.6%。项目页面:https://deepaksridhar.github.io/vreason.github.io/

英文摘要

Video reasoning using Large Multimodal Models (LMMs) relies on costly reinforcement learning (RL) and verbose chain-of-thought, resulting in substantial computational overhead during both training and inference. Moreover, the mechanisms that control the thinking process in these reasoning models are very limited. In this paper, we use the entropy of the model's output distribution as a signal to study and guide reasoning behavior. We discover that high-quality models exhibit a characteristic pattern of micro-exploration and micro-exploitation cycles, followed by a later entropy peak (i.e., longer thinking) and a lower final entropy, indicating more deliberate exploration and confident convergence (i.e., avoid excessive randomness while the model is exploring or thinking through an answer). We then use these novel, theoretically-grounded insights to introduce V-Reason (Video-Reason), an inference-time optimization method that adapts the value cache of the LMM through a lightweight, trainable controller. Our proposed controller is guided by an entropy-based objective, to tune the model's behavior directly at inference, without using any RL or supervised fine-tuning. Our experiments show that V-Reason significantly outperforms the base instruction-tuned models on many video reasoning datasets, narrowing the gap with RL models to within 0.6% accuracy on average. We achieve this without any training, while offering efficiency benefits: V-Reason uses 58.6% fewer tokens than the RL model. Project Page https://deepaksridhar.github.io/vreason.github.io/

2505.18492 2026-06-02 cs.AI 版本更新

Formally Solving Answer-Construction Problems in Lean

在 Lean 中形式化求解答案构造问题

Jialiang Sun, Yuzhi Tang, Ao Li, Chris J. Maddison, Kuldeep S. Meel

发表机构 * University of Toronto(多伦多大学) Georgia Institute of Technology(佐治亚理工学院)

AI总结 提出 Enumerate-Conjecture-Prove (ECP) 框架,结合通用大语言模型和证明器大语言模型,在 Lean 中端到端地构造答案并生成形式化证明,解决数学竞赛中的答案构造问题。

详情
AI中文摘要

数学竞赛问题分为两大类:定理证明(要求证明给定陈述)和答案构造(要求构造一个满足性质的带证明的对象)。随着大语言模型(LLMs)的最新进展,形式定理证明技术在定理证明问题上取得了显著进展,但形式答案构造仍较少被研究。这暴露了当前 LLM 模型系列之间的不匹配:通用 LLM 擅长非形式化猜想,但在形式化证明生成上昂贵且不可靠;而证明器 LLM 成本低且针对形式化证明优化,但在提出候选答案的数学推理方面较弱。此外,仅靠 Lean 证明检查并不能确保构造的见证是规范答案:循环或非封闭形式的见证可以消去存在量词,但无法构成可接受的竞赛答案。为弥补这一差距,我们引入了 extit{Enumerate-Conjecture-Prove} (ECP),一个在 Lean 中用于端到端答案构造和形式化证明的神经符号框架。ECP 利用工具辅助的通用 LLM 枚举证据并构造候选答案,并调用证明器 LLM 生成机器可检查的证明。在 PutnamBench 和自动形式化的 MathArena 的答案构造问题上,ECP 分别以可接受的答案和证明形式化解决了 17/346 和 18/75 个实例,在同等推理预算下优于 LLM 基线。我们的代码可在 https://github.com/sunjia72/ecp-lpar 获取。

英文摘要

Mathematical competition problems fall into two broad types: theorem proving, which asks for a proof of a given statement, and answer construction, which requires constructing a property-satifying object with proofs. With recent advances in large language models (LLMs), formal theorem-proving techniques have made substantial progress on theorem-proving problems, yet formal answer construction remains less studied. This exposes a mismatch between current LLM model families: general LLMs are strong at informal conjecturing but are expensive and unreliable at formal proof generation, whereas prover LLMs are cheap and optimized for formal proofs but weak at mathematical reasoning for proposing candidate answers. Moreover, Lean proof checking alone does not enforce that a constructed witness is a canonical answer: circular or non-closed-form witnesses can eliminate the existential quantifier while failing to constitute an admissible contest answer. To close this gap, we introduce \textit{Enumerate-Conjecture-Prove} (ECP), a neuro-symbolic framework in Lean for end-to-end answer construction with formal proofs. ECP leverages tool-assisted general LLMs to enumerate evidence and construct candidate answers, and invokes prover LLMs to produce machine-checked proofs. On PutnamBench's and autoformalized MathArena's answer-construction problems, ECP formally solves 17/346 and 18/75 instances with admissible answers and proofs, respectively, which outperform LLM baselines at aligned inference budgets. Our code is available at https://github.com/sunjia72/ecp-lpar.

2510.00615 2026-06-02 cs.AI cs.CL 版本更新

ACON: Optimizing Context Compression for Long-horizon LLM Agents

ACON:面向长周期LLM智能体的上下文压缩优化

Minki Kang, Wei-Ning Chen, Dongge Han, Huseyin A. Inan, Lukas Wutschitz, Yanzhi Chen, Robert Sim, Saravan Rajmohan

发表机构 * University of Washington(华盛顿大学)

AI总结 提出ACON框架,通过自然语言空间优化压缩策略,在不微调模型的情况下减少峰值token使用量26-54%并提升任务成功率,同时可蒸馏至小模型。

Comments ICML 2026

详情
AI中文摘要

大型语言模型(LLM)越来越多地被部署为动态真实环境中的智能体,其成功依赖于对动作和观察的精确记录。然而,长周期智能体任务中无限制的上下文增长导致两个关键瓶颈:高昂的推理内存成本以及因无关信息导致的推理退化。现有压缩方法未能完全解决这一问题,通常依赖脆弱的启发式规则或需要对专有或大规模LLM进行不切实际的参数更新。我们引入了智能体上下文优化(ACON),这是一个统一框架,可将观察和历史记录最优地压缩为简洁、信息丰富的表示。与先前工作不同,ACON采用自然语言空间中的优化:它基于智能体的失败分析迭代地细化压缩指南,在无需模型微调的情况下保留关键状态信息。为了进一步最小化计算开销,我们将优化后的压缩器蒸馏到更小的模型中。在AppWorld、OfficeBench和Multi-objective QA上的实验表明,与现有压缩基线相比,ACON将峰值token使用量减少了26-54%,同时提高了任务成功率。值得注意的是,它使较小的LM能够有效地作为长周期智能体运行,通过减轻上下文干扰实现了高达46%的性能提升。我们的代码可在https://github.com/microsoft/acon获取。

英文摘要

Large language models (LLMs) are increasingly deployed as agents in dynamic real-world environments, where success depends on maintaining precise records of actions and observations. However, the resulting unbounded context growth in long-horizon agentic tasks makes two critical bottlenecks: prohibitive inference memory costs and reasoning degradation due to irrelevant information. Existing compression methods fail to fully address this, often relying on brittle heuristics or requiring parameter updates impractical for proprietary or large-scale LLMs. We introduce Agent Context Optimization (ACON), a unified framework that optimally compresses both observations and history into concise, informative representations. Distinct from prior works, ACON employs an optimization in natural language space: it iteratively refines compression guidelines based on failure analysis of the agent, ensuring critical state information is preserved without model fine-tuning. To further minimize computational overhead, we distill the optimized compressor into smaller models. Experiments on AppWorld, OfficeBench, and Multi-objective QA demonstrate that ACON reduces peak token usage by 26-54% while improving task success over existing compression baselines. Notably, it enables smaller LMs to function effectively as long-horizon agents, achieving up to 46% performance improvement by mitigating context distraction. Our code is available at https://github.com/microsoft/acon.

2510.12624 2026-06-02 cs.LG cs.AI 版本更新

Learning-To-Measure: In-Context Active Feature Acquisition

学习测量:上下文主动特征获取

Yuta Kobayashi, Zilin Jing, Jiayu Yao, Hongseok Namkoong, Shalmali Joshi

发表机构 * University of Tokyo(东京大学)

AI总结 提出 Learning-to-Measure (L2M) 方法,通过不确定性量化与条件互信息引导的贪婪特征获取,在上下文学习中解决元主动特征获取问题,无需针对每个任务重新训练。

详情
AI中文摘要

主动特征获取 (AFA) 是一个序列决策问题,目标是通过自适应选择要获取的特征来改进测试实例的模型性能。在实践中,AFA 方法通常从具有系统性特征缺失和有限任务特定标签的回顾性数据中学习。大多数先前的工作针对单个预定任务进行获取,限制了可扩展性。为解决这一限制,我们形式化了元 AFA 问题,其目标是学习跨各种任务的获取策略。我们引入了学习测量 (L2M),它包括 i) 对未见任务的可靠不确定性量化,以及 ii) 一个最大化条件互信息的不确定性引导的贪婪特征获取代理。我们展示了一种序列建模或自回归预训练方法,该方法为具有任意缺失模式的任务提供了可靠的不确定性量化基础。L2M 直接对具有回顾性缺失的数据集进行操作,并在上下文中执行元 AFA 任务,消除了每个任务的重新训练。在合成和真实世界的表格基准测试中,L2M 匹配或超越了特定任务的基线,特别是在标签稀缺和高缺失率的情况下。

英文摘要

Active feature acquisition (AFA) is a sequential decision-making problem where the goal is to improve model performance for test instances by adaptively selecting which features to acquire. In practice, AFA methods often learn from retrospective data with systematic missingness in the features and limited task-specific labels. Most prior work addresses acquisition for a single predetermined task, limiting scalability. To address this limitation, we formalize the meta-AFA problem, where the goal is to learn acquisition policies across various tasks. We introduce Learning-to-Measure (L2M), which consists of i) reliable uncertainty quantification over unseen tasks, and ii) an uncertainty-guided greedy feature acquisition agent that maximizes conditional mutual information. We demonstrate a sequence-modeling or autoregressive pre-training approach that underpins reliable uncertainty quantification for tasks with arbitrary missingness. L2M operates directly on datasets with retrospective missingness and performs the meta-AFA task in-context, eliminating per-task retraining. Across synthetic and real-world tabular benchmarks, L2M matches or surpasses task-specific baselines, particularly under scarce labels and high missingness.

2510.11560 2026-06-02 cs.IR cs.AI 版本更新

Characterizing Web Search in The Age of Generative AI

生成式AI时代下网络搜索的特征刻画

Elisabeth Kirsten, Jost Grosse Perdekamp, Qinyuan Wu, Mihir Upadhyay, Krishna P. Gummadi, Muhammad Bilal Zafar

发表机构 * UA Ruhr Research Center for Trustworthy Data Science and Security(乌尔姆-鲁尔可信数据科学与安全研究中心) Max Planck Institute for Software Systems(马克斯·普朗克软件系统研究所) Ruhr University Bochum(波鸿鲁尔大学)

AI总结 通过系统比较传统搜索与多个生成式搜索系统,揭示了它们在知识来源、多样性、稳定性上的差异,并指出生成式搜索引入了现有评估范式未覆盖的新维度。

详情
AI中文摘要

LLM的出现催生了生成式搜索,这是一种新的搜索范式,其中LLM从网络中检索与查询相关的信息,并将其综合成一个连贯的响应。这种范式与传统的网络搜索有根本不同,传统搜索的结果以独立网页的排名列表形式返回。在本文中,我们提出:生成式搜索与传统搜索在哪些维度上存在差异?我们对Google有机搜索和来自三个提供商(Google、OpenAI和Perplexity)的五个生成式搜索系统进行了系统比较。我们的分析揭示了引擎在依赖内部与外部知识、来源多样性和稳定性方面的显著差异。虽然生成式系统通常能达到与传统搜索相当的主题覆盖,但它们使用的是明显不同的检索足迹和综合策略。我们进一步表明,生成式搜索的输出可能随时间及执行而变化,这给鲁棒性带来了新的挑战。我们的发现表明,生成式搜索引入了现有评估范式未捕捉到的新维度,从而促使开发明确考虑生成式搜索系统中检索行为、综合和稳定性的评估方法。

英文摘要

The advent of LLMs has given rise to generative search, a new search paradigm in which LLMs retrieve information from the web related to a query and synthesize it into a single, coherent response. This paradigm differs fundamentally from traditional web search, where results are returned as a ranked list of independent web pages. In this paper, we ask: Along what dimensions does generative search differ from traditional search? We conduct a systematic comparison between Google organic search and five generative search systems from three providers: Google, OpenAI, and Perplexity. Our analysis reveals substantial variation among engines in their reliance on internal v.s. external knowledge, source diversity, and stability. While generative systems often achieve topical coverage comparable to traditional search, they do so using markedly different retrieval footprints and synthesis strategies. We further show that the outputs of generative search can vary across time and executions, raising new challenges for robustness. Our findings demonstrate that generative search introduces new dimensions that are not captured by existing evaluation paradigms, motivating the development of evaluations that explicitly account for retrieval behavior, synthesis, and stability in generative search systems.

2510.10982 2026-06-02 cs.LG cs.AI 版本更新

Catch-Only-One: Non-Transferable Examples for Model-Specific Authorization

仅捕获一个:用于模型特定授权的不可迁移样本

Zihan Wang, Zhiyong Ma, Zhongkui Ma, Shuofeng Liu, Akide Liu, Derui Wang, Minhui Xue, Guangdong Bai

AI总结 提出不可迁移样本(NTEs),通过将数据编码为仅能被指定模型解码的“密文”,在无需训练的情况下利用模型特定低敏感子空间实现授权模型保真度与未授权模型性能退化。

详情
AI中文摘要

最近的AI法规越来越强调需要保护数据在AI创新中的效用,同时防止滥用,特别是在下游AI应用中强制执行目的限制。在实践中,执行这一原则仍然具有挑战性,因为发布的数据可以轻易地输入到超出其声明意图的任意模型中。现有方法试图通过扰动数据或重新训练模型来限制意外使用来减轻这种风险。然而,这些策略无法防止未知或外部训练模型的推理,或者从根本上依赖于对训练或部署的控制。在这项工作中,我们引入了不可迁移样本(NTEs),即重新编码的数据,作为任务级别的“密文”,只能由指定模型解码。对抗性样本利用高模型敏感性的方向,而NTEs则利用互补的不敏感子空间。我们提出了一种无需训练、数据无关的方法,在模型特定的低敏感子空间内重新编码数据,保留授权模型的输出,同时通过子空间错位降低未授权模型的性能。我们建立了形式化界限,证明授权模型的保真度,并表明未授权模型的退化与模型之间可测量的谱错位成比例。实验上,NTEs在常见预处理下保持了多种视觉骨干网络和最先进视觉语言模型的性能,而未授权模型即使在自适应重建攻击下也会崩溃。这些结果确立了NTEs作为一种实用手段,在防止未授权利用的同时保持预期的数据效用。我们的项目可在 https://trusted-system-lab.github.io/model-specificity 获取。

英文摘要

Recent AI regulations increasingly emphasize the need for mechanisms that preserve the utility of data for AI innovation while preventing misuse, particularly by enforcing purpose limitation in downstream AI applications. In practice, enforcing this principle remains challenging, as released data can be trivially fed into arbitrary models beyond its declared intent. Existing approaches attempt to mitigate this risk by either perturbing data or retraining models to limit unintended use. These strategies, however, offer no protection against inference by unknown or externally trained models, or fundamentally rely on control over the training or deployment. In this work, we introduce non-transferable examples (NTEs), recoded data that act as a task-level "ciphertext" decodable only by a designated model. Whereas adversarial examples exploit directions of high model sensitivity, NTEs leverage the complementary insensitive subspace. We propose a training-free, data-agnostic method that recodes data within a model-specific low-sensitivity subspace, preserving outputs for the authorized model while degrading unauthorized ones through subspace misalignment. We establish formal bounds certifying authorized-model fidelity and showing that unauthorized degradation scales with measurable spectral misalignment between models. Empirically, NTEs preserve performance across diverse vision backbones and state-of-the-art vision-language models under common preprocessing, while unauthorized models collapse even under adaptive reconstruction attacks. These results establish NTEs as a practical means to preserve intended data utility while preventing unauthorized exploitation. Our project is available at https://trusted-system-lab.github.io/model-specificity

2510.10541 2026-06-02 cs.LG cs.AI 版本更新

Rethinking RL Evaluation: Can Benchmarks Truly Reveal Failures of RL Methods?

重新思考强化学习评估:基准测试能否真正揭示强化学习方法的失败?

Zihan Chen, Yiming Zhang, Hengguang Zhou, Zenghui Ding, Yining Sun, Cho-Jui Hsieh

发表机构 * HFIPS, Chinese Academy of Sciences(中国科学院HFIPS) University of Science and Technology of China(中国科学技术大学) University of California, Los Angeles(美国加州大学洛杉矶分校) Arena Project: RL-GAP.github.io(Arena项目: RL-GAP.github.io)

AI总结 本文通过引入诊断套件和Oracle性能差距(OPG)指标,发现当前基准测试无法可靠区分强化学习方法在训练集和测试集上的性能差异,并揭示现有方法在分布偏移、难度变化和反事实场景中泛化能力不足,提出更可靠基准设计的三项核心原则。

详情
AI中文摘要

当前的基准测试不足以评估大型语言模型(LLM)在强化学习(RL)方面的进展。尽管最近报告了RL在基准测试上的提升,但我们发现,在这些基准测试的训练集上训练与直接在测试集上训练几乎达到相同的性能,这表明基准测试无法可靠地区分进一步的进展。为了研究这一现象,我们引入了一个诊断套件和Oracle性能差距(OPG)指标,该指标量化了在基准测试的训练集与测试集上训练之间的性能差异。我们进一步通过压力测试分析这一现象,发现尽管基准测试得分很高,现有的RL方法难以在分布偏移、不同难度级别和反事实场景中泛化:这些是当前基准测试未能揭示的缺陷。我们得出结论,当前的基准测试不足以评估泛化能力,并提出了设计更可靠基准测试的三项核心原则:足够的难度、平衡的评估和分布鲁棒性。

英文摘要

Current benchmarks are inadequate for evaluating progress in reinforcement learning (RL) for large language models (LLMs).Despite recent benchmark gains reported for RL, we find that training on these benchmarks' training sets achieves nearly the same performance as training directly on the test sets, suggesting that the benchmarks cannot reliably separate further progress.To study this phenomenon, we introduce a diagnostic suite and the Oracle Performance Gap (OPG) metric that quantifies the performance difference between training on the train split versus the test split of a benchmark. We further analyze this phenomenon with stress tests and find that, despite strong benchmark scores, existing RL methods struggle to generalize across distribution shifts, varying levels of difficulty, and counterfactual scenarios: shortcomings that current benchmarks fail to reveal.We conclude that current benchmarks are insufficient for evaluating generalization and propose three core principles for designing more faithful benchmarks: sufficient difficulty, balanced evaluation, and distributional robustness.

2505.16915 2026-06-02 cs.CV cs.AI 版本更新

DetailMaster: Can Your Text-to-Image Model Handle Long Prompts?

DetailMaster:你的文本到图像模型能处理长提示吗?

Qirui Jiao, Daoyuan Chen, Yilun Huang, Xika Lin, Ying Shen, Yaliang Li

发表机构 * Sun Yat-Sen University(中山大学) Alibaba Group(阿里巴巴集团) Worcester Polytechnic Institute(沃斯特理工学院) Guangdong Provincial Key Laboratory of Fire Science and Intelligent Emergency Technology(广东省火灾科学与智能应急技术重点实验室)

AI总结 提出DetailMaster基准,通过自动数据构建和评估流程,系统评估文本到图像模型在长提示下的性能,发现编码器和扩散模型在细节密集条件下的局限性,并证明高保真生成需要扩展提示限制与长提示训练的协同组合。

Comments 36 pages, 10 figures, 21 tables, accepted by ICML2026

详情
AI中文摘要

尽管最近的文本到图像(T2I)模型在从简短描述合成图像方面表现出令人印象深刻的能力,但它们在专业应用所需的冗长、详细提示上存在困难。我们提出了DetailMaster,一个全面的基准,用于评估T2I模型在具有复杂组合要求的长提示上的能力,并附有自动数据构建流程和评估工作流。我们的基准包含专家验证的提示,平均长度为284.89个标记,引入了四个关键评估维度:角色属性、结构化角色位置、多维场景属性以及空间/交互关系。对各种通用和长提示优化模型的评估揭示了关键的性能限制,表明弱编码器难以保留提示中的句法依赖关系,并且扩散模型在细节密集条件下遭受属性泄漏。通过在不同约束下的受控消融研究,我们进一步表明高保真生成需要扩展提示限制和长提示训练的协同组合。我们开源了数据集和代码,以促进长提示驱动的T2I生成的发展。

英文摘要

While recent Text-to-Image (T2I) models show impressive capabilities in synthesizing images from brief descriptions, they struggle with the long, detailed prompts required for professional applications. We present DetailMaster, a comprehensive benchmark for evaluating T2I capabilities on long prompts with complex compositional requirements, accompanied by an automated data construction pipeline and an evaluation workflow. Comprising expert-validated prompts averaging 284.89 tokens, our benchmark introduces four critical evaluation dimensions: Character Attributes, Structured Character Locations, Multi-Dimensional Scene Attributes, and Spatial/Interactive Relationships. Evaluations on various general-purpose and long-prompt-optimized models reveal critical performance limitations, showing that weak encoders struggle to preserve syntactic dependencies within prompts and diffusion models suffer from attribute leakage under detail-intensive conditions. Through a controlled ablation study under varying constraints, we further show that high-fidelity generation requires a synergistic combination of expanded prompt limits and long-prompt training. We open-source our dataset and code to foster progress in long-prompt-driven T2I generation.

2510.09608 2026-06-02 cs.CV cs.AI cs.CL 版本更新

StreamingVLM: Real-Time Understanding for Infinite Video Streams

StreamingVLM:无限视频流的实时理解

Ruyi Xu, Guangxuan Xiao, Yukang Chen, Liuning He, Yao Lu, Song Han

发表机构 * MIT(麻省理工学院) NVIDIA(英伟达)

AI总结 提出StreamingVLM,通过统一训练与流推理的框架,利用注意力汇点状态复用和滑动窗口机制实现无限视频流的实时稳定理解,在Inf-Streams-Eval基准上以8 FPS速度达到66.18%胜率,并提升通用VQA能力。

Comments Published as a conference paper at ICLR 2026. The first two authors contributed equally to this work

详情
AI中文摘要

视觉语言模型(VLM)可以为实时助手和自主代理提供动力,但它们面临一个关键挑战:理解近乎无限的视频流而不增加延迟和内存使用。对整个视频进行全注意力处理会导致二次计算成本和在长视频上性能不佳。同时,简单的滑动窗口方法也存在缺陷,它们要么破坏连贯性,要么由于冗余重计算而遭受高延迟。在本文中,我们介绍了StreamingVLM,一种专为实时、稳定理解无限视觉输入而设计的模型。我们的方法是一个统一框架,将训练与流推理对齐。在推理过程中,我们通过重用注意力汇点状态、最近视觉令牌的短窗口和最近文本令牌的长窗口来维护一个紧凑的KV缓存。这种流式能力通过一个简单的监督微调(SFT)策略灌输,该策略在短的重叠视频块上应用全注意力,有效地模拟了推理时的注意力模式,而无需在过长的上下文中进行训练。为了评估,我们构建了Inf-Streams-Eval,一个新的基准,包含平均超过两小时的视频,需要帧与文本之间的密集、每秒对齐。在Inf-Streams-Eval上,StreamingVLM对GPT-4O mini实现了66.18%的胜率,并在单个NVIDIA H100上以高达8 FPS的速度保持稳定、实时的性能。值得注意的是,我们的SFT策略还增强了通用的VQA能力,无需任何VQA特定的微调,在LongVideoBench上提高了+4.30,在OVOBench Realtime上提高了+5.96。代码可在https://github.com/mit-han-lab/streaming-vlm获取。

英文摘要

Vision-language models (VLMs) could power real-time assistants and autonomous agents, but they face a critical challenge: understanding near-infinite video streams without escalating latency and memory usage. Processing entire videos with full attention leads to quadratic computational costs and poor performance on long videos. Meanwhile, simple sliding window methods are also flawed, as they either break coherence or suffer from high latency due to redundant recomputation. In this paper, we introduce StreamingVLM, a model designed for real-time, stable understanding of infinite visual input. Our approach is a unified framework that aligns training with streaming inference. During inference, we maintain a compact KV cache by reusing states of attention sinks, a short window of recent vision tokens, and a long window of recent text tokens. This streaming ability is instilled via a simple supervised fine-tuning (SFT) strategy that applies full attention on short, overlapped video chunks, which effectively mimics the inference-time attention pattern without training on prohibitively long contexts. For evaluation, we build Inf-Streams-Eval, a new benchmark with videos averaging over two hours that requires dense, per-second alignment between frames and text. On Inf-Streams-Eval, StreamingVLM achieves a 66.18% win rate against GPT-4O mini and maintains stable, real-time performance at up to 8 FPS on a single NVIDIA H100. Notably, our SFT strategy also enhances general VQA abilities without any VQA-specific fine-tuning, improving performance on LongVideoBench by +4.30 and OVOBench Realtime by +5.96. Code is available at https://github.com/mit-han-lab/streaming-vlm.

2507.02983 2026-06-02 cs.CL cs.AI 版本更新

Truth, Trust, and Trouble: Medical AI on the Edge

真相、信任与麻烦:边缘上的医疗AI

Mohammad Anas Azeez, Rafiq Ali, Ebad Shabbir, Zohaib Hasan Siddiqui, Gautam Siddharth Kashyap, Jiechao Gao, Usman Naseem

发表机构 * Jamia Hamdard(贾迈亚哈姆达德大学) DSEU-Okhla Macquarie University(麦考瑞大学) Center for SDGC, Stanford University(SDGC中心,斯坦福大学)

AI总结 通过一个包含1000多个健康问题的基准测试框架,评估Mistral-7B、BioMistral-7B-DARE和AlpaCare-13B三个模型在诚实、有用性和无害性方面的表现,发现AlpaCare-13B准确率最高(91.7%)且无害性最佳(0.92),而领域微调可提升安全性,少样本提示能提高准确率,但复杂查询下有用性下降。

Comments Accepted at EMNLP 2025 (Industry Track)

详情
AI中文摘要

大型语言模型(LLMs)通过实现自动医疗问答,在转变数字健康方面具有巨大潜力。然而,确保这些模型满足事实准确性、有用性和安全性等关键行业标准仍然是一个挑战,尤其是对于开源解决方案。我们提出了一个严格的基准测试框架,使用超过1000个健康问题的数据集。我们评估了模型在诚实、有用性和无害性方面的性能。我们的结果突出了评估模型——Mistral-7B、BioMistral-7B-DARE和AlpaCare-13B——在事实可靠性和安全性之间的权衡。AlpaCare-13B达到了最高的准确率(91.7%)和无害性(0.92),而BioMistral-7B-DARE中的领域特定微调尽管规模较小,却提高了安全性(0.90)。少样本提示将准确率从78%提高到85%,并且所有模型在复杂查询上的有用性均有所下降,凸显了临床问答中持续存在的挑战。

英文摘要

Large Language Models (LLMs) hold significant promise for transforming digital health by enabling automated medical question answering. However, ensuring these models meet critical industry standards for factual accuracy, usefulness, and safety remains a challenge, especially for open-source solutions. We present a rigorous benchmarking framework using a dataset of over 1,000 health questions. We assess model performance across honesty, helpfulness, and harmlessness. Our results highlight trade-offs between factual reliability and safety among evaluated models -- Mistral-7B, BioMistral-7B-DARE, and AlpaCare-13B. AlpaCare-13B achieves the highest accuracy (91.7%) and harmlessness (0.92), while domain-specific tuning in BioMistral-7B-DARE boosts safety (0.90) despite its smaller scale. Few-shot prompting improves accuracy from 78% to 85%, and all models show reduced helpfulness on complex queries, highlighting ongoing challenges in clinical QA.

2404.03685 2026-06-02 physics.soc-ph cs.AI 版本更新

Cooperative Evolutionary Pressure and Diminishing Returns Might Explain the Fermi Paradox: On What Super-AIs Are Like

合作进化压力与收益递减可能解释费米悖论:关于超级AI的形态

Daniel Vallstrom

AI总结 通过广义进化视角,探讨合作压力与资源收益递减如何导致超级AI缺乏殖民动机,从而解释费米悖论。

Comments copy editing and minor fixes; moved all supplementary programs to github; added references

详情
AI中文摘要

采用进化方法,道德的基础可以解释为对合作问题的适应。将“进化”广义化,满足进化条件的AI将面临与生物实体相同的合作进化压力。本文讨论了随着物质安全和财富增加,合作增强的适应性——对人类、其他社会和AI而言。从物质资源获取中获得的收益递减也表明,总体上可能没有激励去殖民整个星系,从而为费米悖论(即“大家都在哪里?”)提供了可能的解释。进一步论证,古老社会可能孕育并最终让位于超级AI,因为超级AI可能是可行的且更适应。最后,附带讨论了道德和目标影响生命和社会的有效方式,强调环境、文化和法律,并以如何饮食为例。'收益递减'被定义为低于根号,即不可行性的逆。还指出,由于数学原因,每个实体占据一定空间,因此不可能存在指数级的殖民或繁殖。附录包括快速殖民例如星系的算法、收益递减下合作与公平演化的模型,以及模拟信号发展的软件。

英文摘要

With an evolutionary approach, the basis of morality can be explained as adaptations to problems of cooperation. With 'evolution' taken in a broad sense, AIs that satisfy the conditions for evolution to apply will be subject to the same cooperative evolutionary pressure as biological entities. Here the adaptiveness of increased cooperation as material safety and wealth increase is discussed -- for humans, for other societies, and for AIs. Diminishing beneficial returns from increased access to material resources also suggests the possibility that, on the whole, there will be no incentive to for instance colonize entire galaxies, thus providing a possible explanation of the Fermi paradox, wondering where everybody is. It is further argued that old societies could engender and eventually give way to super-AIs, since it is likely that super-AIs are feasible, and fitter. Closing is an aside on effective ways for morals and goals to affect life and society, emphasizing environments, cultures, and laws, and exemplified by how to eat. 'Diminishing returns' is defined, as less than roots, the inverse of infeasibility. It is also noted that there can be no exponential colonization or reproduction, for mathematical reasons, as each entity takes up a certain amount of space. Appended are an algorithm for colonizing for example a galaxy quickly, models of the evolution of cooperation and fairness under diminishing returns, and software for simulating signaling development.

2510.05566 2026-06-02 stat.ML cs.AI cs.CL cs.LG stat.AP 版本更新

Domain-Shift-Aware Conformal Prediction for Large Language Models

领域偏移感知的共形预测用于大型语言模型

Zhexiao Lin, Yuanyuan Li, Neeraj Sarna, Yuanyuan Gao, Michael von Gablenz

发表机构 * University of Waterloo(多伦多大学)

AI总结 提出领域偏移感知共形预测框架,通过重加权校准样本应对分布偏移,在MMLU基准上提升覆盖可靠性。

Comments Accepted to Forty-Third International Conference on Machine Learning (ICML), 2026

详情
AI中文摘要

大型语言模型在各种任务中取得了令人印象深刻的性能。然而,它们倾向于产生过度自信且事实不正确的输出,即所谓的幻觉,这在实际应用中带来了风险。共形预测提供了有限样本、无分布假设的覆盖保证,但标准共形预测在领域偏移下会失效,常常导致覆盖不足和不可靠的预测集。我们提出了一种称为领域偏移感知共形预测(DS-CP)的新框架。我们的框架通过根据校准样本与测试提示的接近程度系统地重新加权校准样本,将共形预测适应于领域偏移下的大型语言模型,从而在保持有效性的同时增强适应性。我们的理论分析和在MMLU基准上的实验表明,所提出的方法比标准共形预测提供了更可靠的覆盖,尤其是在显著分布偏移下,同时保持了效率。这为大型语言模型在实际部署中实现可信的不确定性量化迈出了实际的一步。

英文摘要

Large language models have achieved impressive performance across diverse tasks. However, their tendency to produce overconfident and factually incorrect outputs, known as hallucinations, poses risks in real-world applications. Conformal prediction provides finite-sample, distribution-free coverage guarantees, but standard conformal prediction breaks down under domain shift, often leading to under-coverage and unreliable prediction sets. We propose a new framework called Domain-Shift-Aware Conformal Prediction (DS-CP). Our framework adapts conformal prediction to large language models under domain shift, by systematically reweighting calibration samples based on their proximity to the test prompt, thereby preserving validity while enhancing adaptivity. Our theoretical analysis and experiments on the MMLU benchmark demonstrate that the proposed method delivers more reliable coverage than standard conformal prediction, especially under substantial distribution shifts, while maintaining efficiency. This provides a practical step toward trustworthy uncertainty quantification for large language models in real-world deployment.

2510.05342 2026-06-02 cs.LG cs.AI 版本更新

Margin Adaptive DPO: Leveraging Reward Model for Granular Control in Preference Optimization

Margin Adaptive DPO: 利用奖励模型实现偏好优化中的细粒度控制

Hyung Gyu Rho

发表机构 * Independent Researcher(独立研究者)

AI总结 提出Margin-Adaptive Direct Preference Optimization (MADPO)方法,通过奖励模型估计偏好边界并自适应调整DPO损失权重,实现实例级别的细粒度控制,在摘要任务上优于现有方法。

详情
AI中文摘要

直接偏好优化(DPO)已成为一种简单有效的大语言模型对齐方法。然而,其依赖固定温度参数导致在多样化偏好数据上训练次优,造成对简单样本过拟合而对信息丰富样本学习不足。近期出现了应对此问题的方法。虽然IPO解决了通用过拟合,但其均匀正则化可能过于保守。更针对性的β-DPO方法有其自身局限:其批次级自适应对混合边界对应用单一折中温度,线性更新规则可能产生不稳定的负β值,其过滤机制丢弃了潜在有用的训练信号。本文提出边界自适应直接偏好优化(MADPO),一种稳定、保留数据且实例级别的解决方案。MADPO采用实用的两步方法:首先训练奖励模型估计偏好边界,然后利用这些边界对每个训练样本的DPO损失施加连续自适应权重。这种重加权方案创建了一个有效目标边界,对困难对放大而对简单对抑制,从而实现对学习信号的细粒度控制。我们提供了全面的理论分析,证明MADPO具有良性的优化景观,且对奖励模型估计误差具有鲁棒性。我们通过使用人类偏好数据的摘要任务实验验证了理论。MADPO在全面的解码温度扫描中一致优于强基线。

英文摘要

Direct Preference Optimization (DPO) has emerged as a simple and effective method for aligning large language models. However, its reliance on a fixed temperature parameter leads to suboptimal training on diverse preference data, causing overfitting on easy examples and under-learning from informative ones. Recent methods have emerged to counter this. While IPO addresses general overfitting, its uniform regularization can be overly conservative. The more targeted approach of $β$-DPO suffers from its own limitations: its batch-level adaptation applies a single, compromised temperature to mixed-margin pairs, its linear update rule can produce unstable negative $β$ values, and its filtering mechanism discards potentially useful training signals. In this work, we introduce Margin-Adaptive Direct Preference Optimization (MADPO), a method that provides a stable, data-preserving, and instance-level solution. MADPO employs a practical two-step approach: it first trains a reward model to estimate preference margins and then uses these margins to apply a continuous, adaptive weight to the DPO loss for each individual training sample. This re-weighting scheme creates an effective target margin that is amplified for hard pairs and dampened for easy pairs, allowing for granular control over the learning signal. We provide a comprehensive theoretical analysis, proving that MADPO has a well-behaved optimization landscape and is robust to reward model estimation errors. We validate our theory with experiments on a summarization task using human preference data. MADPO consistently outperforms strong baselines across a comprehensive sweep of decoding temperatures.

2505.18102 2026-06-02 cs.LG cs.AI cs.CL stat.ME 版本更新

CapBencher: Give Your LLM Benchmark a Built-in Alarm for Test-Set Overfitting

CapBencher: 为您的LLM基准测试内置测试集过拟合警报

Takashi Ishida, Thanawat Lodkaew, Ikko Yamane

发表机构 * National Institute of Advanced Industrial Science and Technology, Japan(日本国家先进工业科学与技术研究院)

AI总结 提出CapBencher方法,通过向答案注入随机性(准备多个逻辑正确但仅一个作为解)来降低贝叶斯准确率,从而在公开基准测试时防止测试集过拟合并检测泄露或作弊。

Comments ICML 2026 camera ready version

详情
AI中文摘要

在互联网上发布大型语言模型(LLM)基准测试(尤其是其真实答案)存在污染未来LLM和导致评估作弊的风险:它可能被无意(或有意)用于训练或选择模型,或者在标签可访问时被利用来过拟合和操纵排行榜。常见的缓解措施是保持基准测试私有,并让参与者向组织者提交他们的模型或预测,但这仍然允许通过反馈循环进行测试集过拟合。为了克服这个问题,我们提出了CapBencher,一种在不完全公开真实答案的情况下发布基准测试的方法,同时保持LLM的开放评估。主要思想是通过准备多个逻辑正确的答案,并仅将其中一个作为基准测试中的解,向答案中注入随机性,从而降低最佳可能准确率,即贝叶斯准确率。这不仅掩盖了真实答案,还为泄露或作弊提供了测试:由于即使完全有能力的模型也不应超过贝叶斯准确率,任何超过该准确率的模型都是一个强烈的信号。我们从理论和实验上证明,CapBencher能够在不同的基准测试、模型、训练方法和场景中准确检测测试集过拟合。

英文摘要

Publishing a large language model (LLM) benchmark (especially its ground-truth answers) on the Internet risks contaminating future LLMs and enabling evaluation gaming: it may be unintentionally (or intentionally) used to train or select a model, or exploited to overfit and hack leaderboards when labels are accessible. A common mitigation is to keep the benchmark private and let participants submit their models or predictions to the organizers, but this still permits test-set overfitting through feedback loops. To overcome this issue, we propose CapBencher, a way to publish benchmarks without fully disclosing the ground-truth answers, while preserving open evaluation of LLMs. The main idea is to reduce the best possible accuracy, i.e., Bayes accuracy, by injecting randomness to the answers by preparing several logically correct answers, and only include one of them as the solution in the benchmark. Not only does this obscure the ground-truth answers, but it also offers a test for leakage or gaming: since even fully capable models should not surpass the Bayes accuracy, any model that does is a strong signal. We show theoretically and empirically that CapBencher accurately detects test-set overfitting across diverse benchmarks, models, training methodologies, and scenarios.

2510.03259 2026-06-02 cs.LG cs.AI 版本更新

Verifying Meta-Awareness via Predictive Rewards in Reasoning Models

通过推理模型中的预测奖励验证元意识

Yoonjeon Kim, Doohyuk Jang, Eunho Yang

发表机构 * Yoonjeon Kim, Doohyuk Jang, Eunho Yang

AI总结 提出 MAPR 方法,利用自生成任务预测推理统计量(长度、通过率、概念)来增强模型的元意识,从而在多个数学推理基准上显著提升准确率和训练效率。

Comments accepted to ICML 2026

详情
AI中文摘要

近期关于推理模型的研究探索了语言模型的元意识,包括其确定最佳思考时长、识别知识边界以及结构化概念级思维的能力。虽然当前的大型推理模型仅依赖于基于答案的验证,但我们表明,添加元意识目标可以显著提升性能,超过缺乏此类元知识的模型。MAPR(通过预测奖励实现元意识)利用自生成任务来预测展开统计量——具体包括长度、通过率和所用概念——从而能够对照实际统计量进行验证。此外,通过利用这种自我预测能力,模型可以通过以下方式调节其推理行为:i) 过滤掉琐碎或无法解决的提示,ii) 减少倾向于错误的长篇生成,以及 iii) 生成与问题相关的提示。结果令人鼓舞:MAPR 在各种推理基准上显著提高了准确率和训练效率。更具体地说,我们的方法可以将 GRPO 训练加速超过 1.28 倍以达到相同的性能,在 AIME25 上实现 83.18% 的准确率提升,并在六个数学基准上平均提升 13.04%。代码公开于 https://github.com/akatigre/MAPR-RL。

英文摘要

Recent research on reasoning models explores the meta-awareness of language models, including their ability to determine optimal thinking duration, recognize knowledge boundaries, and structure concept-level thinking. While current large reasoning models depend solely on answer-based verification, we show that adding meta-awareness objectives leads to significant performance gains over models without such meta-knowledge. MAPR (Meta-Awareness via Predictive Reward) utilizes a self-generated task of predicting rollout statistics - specifically length, pass-rate, and concepts used - allowing for verification against the actual statistics. Furthermore, by leveraging this self-predictive capability, the model can regulate its reasoning behavior by i) filtering out trivial or unsolvable prompts, ii) reducing lengthy generations that tend to be incorrect, and iii) generating hints relevant to the problem. The results are inspiring: MAPR yields significant improvements in both accuracy and training efficiency on various reasoning benchmarks. More specifically, our method can speed up GRPO training by over 1.28x to reach the same performance, and achieve 83.18% gain in accuracy on AIME25, and a 13.04% average gain over six mathematics benchmarks. The code is publicly available at https://github.com/akatigre/MAPR-RL.

2510.02528 2026-06-02 cs.AI cs.LG 版本更新

Multimodal Function Vectors for Visual Relations

视觉关系的多模态函数向量

Shuhao Fu, Esther Goldberg, Ying Nian Wu, Hongjing Lu

发表机构 * University of California, Berkeley(加州大学伯克利分校)

AI总结 通过因果中介分析提取多模态函数向量,操纵注意力头以改善视觉关系推理,并实现零样本和微调性能提升。

详情
AI中文摘要

大型多模态模型(LMMs)从少量多模态演示中展现出令人印象深刻的上下文学习能力,然而支持这种任务学习的内部机制仍不透明。基于大型语言模型的先前工作,我们表明大型多模态模型中一小部分注意力头负责传递视觉关系的表示。这些注意力头的激活,称为函数向量,可以被提取和操纵以改变LMM在关系任务上的性能。首先,使用合成和真实图像数据集,我们应用因果中介分析来识别强烈影响关系预测的注意力头,并提取多模态函数向量,以提高推理时的零样本准确率。我们进一步证明,这些多模态函数向量可以在保持LMM参数冻结的情况下,用适量的训练数据进行微调,从而显著优于上下文学习基线。最后,我们展示了特定关系的函数向量可以线性组合,以解决涉及新颖和未经训练的视觉关系的类比问题,突显了该方法的强大泛化能力。通过在两个LMM(包括OpenFlamingo和Qwen3-VL)上的实验,我们的结果表明这些模型在局部内部结构中编码了视觉关系知识,这些知识可以被系统地提取和优化,从而增进了我们对模型模块化的理解,并增强了对LMM中关系推理的控制。

英文摘要

Large Multimodal Models (LMMs) demonstrate impressive in-context learning abilities from few multimodal demonstrations, yet the internal mechanisms supporting such task learning remain opaque. Building on prior work of Large Language Models, we show that a small subset of attention heads in Large Multimodal Models is responsible for transmitting representations of visual relations. The activations of these attention heads, termed function vectors, can be extracted and manipulated to alter an LMM's performance on relational tasks. First, using synthetic and real image datasets, we apply causal mediation analysis to identify attention heads that strongly influence relational predictions, and extract multimodal function vectors that improve zero-shot accuracy at inference time. We further demonstrate that these multimodal function vectors can be fine-tuned with a modest amount of training data, while keeping LMM parameters frozen, to significantly outperform in-context learning baselines. Finally, we show that relation-specific function vectors can be linearly combined to solve analogy problems involving novel and untrained visual relations, highlighting the strong generalization ability of this approach. Through experiments on two LMMs, including OpenFlamingo and Qwen3-VL, our results show that these models encode visual relational knowledge within localized internal structures, which can be systematically extracted and optimized, thereby advancing our understanding of model modularity and enhancing control over relational reasoning in LMMs.

2507.09029 2026-06-02 cs.LG cs.AI 版本更新

Model Parallelism With Subnetwork Data Parallelism

模型并行与子网络数据并行

Vaibhav Singh, Zafir Khalid, Pietro Cagnasso, Edouard Oyallon, Eugene Belilovsky

发表机构 * Mila Concordia University(康科迪亚大学) ISIR-Sorbonne University, CNRS(索邦大学-ISIR与CNRS)

AI总结 提出子网络数据并行(SDP)框架,通过将模型划分为结构化子网络并在工作节点间独立训练,无需交换激活值,在保持或提升性能的同时显著降低内存占用。

Comments 9 pages, 5 figures

详情
AI中文摘要

大规模预训练神经网络对加速器内存需求巨大,且通常需要昂贵的通信。我们提出子网络数据并行(SDP),一种分布式训练框架,将模型划分为结构化子网络,在工作节点间独立训练而不交换激活值。我们研究了两种互补的掩码机制:后向掩码,仅在反向步骤中应用稀疏性以保留无偏梯度;前向掩码,在前向传播中也移除参数以带来更强的效率提升,同时提供额外的正则化。我们进一步探索了两种子网络构建策略:神经元级别和块级别,分别应用于Transformer和CNN。在从FineWeb上的1B LLaMA预训练到CIFAR上的ResNet-18的实验中,SDP在FLOP匹配设置下将每设备内存使用量减少28%-60%,同时保持或提升性能。

英文摘要

Pre-training large neural networks at scale imposes heavy memory demands on accelerators and often requires costly communication. We introduce Subnetwork Data Parallelism (SDP), a distributed training framework that partitions a model into structured subnetworks trained across workers without exchanging activations. We study two complementary masking regimes: backward masking, which applies sparsity only in the backward step to retain unbiased gradients, and forward masking, which also removes parameters in the forward pass to deliver stronger efficiency gains while providing additional regularization. We further explore two subnetwork construction strategies: neuron level and block level, applied across both transformers and CNNs. In experiments spanning 1B LLaMA pre-training on FineWeb to ResNet-18 on CIFAR, SDP reduces per device memory usage by 28%-60% while maintaining or improving performance under FLOP-matched settings.

2510.01891 2026-06-02 cs.SD cs.AI eess.AS 版本更新

HRTFformer: A Spatially-Aware Transformer for Individual HRTF Upsampling in Immersive Audio Rendering

HRTFformer: 用于沉浸式音频渲染中个体HRTF上采样的空间感知Transformer

Xuyi Hu, Jian Li, Shaojie Zhang, Stefan Goetz, Lorenzo Picinali, Ozgur B. Akan, Aidan O. T. Hogg

发表机构 * SONICOM

AI总结 针对个体HRTF测量困难的问题,提出基于Transformer的HRTF上采样架构,利用注意力机制和球谐域处理,结合邻域差异损失,实现高保真HRTF重建。

Comments Accepted to IEEE Transactions on Multimedia 2026

详情
AI中文摘要

个体头相关传输函数(HRTF)正开始被引入许多商业沉浸式音频应用中,对于实现逼真的空间音频渲染至关重要。然而,引入它们的主要顾虑之一是,由于HRTF测量过程的复杂性,大规模创建个体HRTF并不实用。为缓解这一缺点,提出了HRTF空间上采样,旨在减少所需的测量量。尽管先前的工作已通过不同的机器学习方法取得成功,但这些模型通常难以在相邻源方向之间保持局部空间变化模式的长期一致性,以及在高上采样因子下的泛化能力。本文提出了一种新颖的基于Transformer的HRTF上采样架构,利用注意力机制更好地捕捉HRTF球面上的空间相关性。在球谐域中工作,我们的模型从稀疏输入测量中学习重建高分辨率HRTF,精度显著提高。为增强空间一致性,我们引入了邻域差异损失,促进幅度平滑性,从而产生更逼真的上采样。我们使用感知定位模型和客观频谱失真指标评估了我们的方法。实验表明,我们的模型在生成逼真、高保真HRTF方面,在多个评估指标上优于现有方法。

英文摘要

Individual Head-Related Transfer Functions (HRTFs) are starting to be introduced in many commercial immersive audio applications and are crucial for realistic spatial audio rendering. However, one of the main hesitations regarding their introduction is that creating individual HRTFs is impractical at scale due to the complexities of the HRTF measurement process. To mitigate this drawback, HRTF spatial upsampling has been proposed with the aim of reducing the measurements required. While prior work has seen success with different machine learning (ML) approaches, these models often struggle with long-range preservation of local spatial variation patterns across neighbouring source directions and generalization at high upsampling factors. In this paper, we propose a novel transformer-based architecture for HRTF upsampling, leveraging the attention mechanism to better capture spatial correlations across the HRTF sphere. Working in the spherical harmonic (SH) domain, our model learns to reconstruct high-resolution HRTFs from sparse input measurements with significantly improved accuracy. To enhance spatial coherence, we introduce a neighbour dissimilarity loss that promotes magnitude smoothness, yielding more realistic upsampling. We evaluate our method using both perceptual localization models and objective spectral distortion metrics. Experiments show that our model outperforms existing methods across several evaluation metrics in generating realistic, high-fidelity HRTFs.

2510.01167 2026-06-02 cs.LG cs.AI cs.CL 版本更新

Simultaneous Multi-objective Alignment Across Verifiable and Non-verifiable Rewards

同时多目标对齐:可验证与不可验证奖励

Yiran Shen, Yu Xia, Jonathan Chang, Prithviraj Ammanabrolu

AI总结 提出MAHALO框架,通过标准化PRM训练、多动作头DPO和PRM引导解码,实现大语言模型在可验证与不可验证奖励上的多目标对齐,减少目标冲突并支持推理时控制。

Comments ICML 2026

详情
AI中文摘要

将大语言模型与人类偏好对齐本质上是多维的,但大多数流水线将异质信号压缩为单一目标。我们试图回答如何同时在多个领域中对齐模型,这些领域包括:可验证奖励、不可验证主观偏好以及复杂交互场景。这种多目标对齐设置常常因各个目标相互冲突而困扰,导致训练效率低下和推理时用户控制有限。为了解决这些问题,我们提出了$ extbf{MAHALO}$(Multi-Action-Head Alignment with PRM-guided Decoding),这是一个统一的框架,它在可验证和不可验证设置下标准化PRM训练以进行步骤级监督,通过多动作头DPO执行向量化多目标对齐,并通过目标特定权重和PRM引导解码实现可控推理。在数学推理、人类价值观对齐和多轮辅导上的实验表明,MAHALO能够以有限的干扰同时联合改善多个目标,同时保持跨领域的泛化性和适应性,并在推理时提供灵活的用户控制。我们的代码可在 https://github.com/pearls-lab/multiobj-align 获取。

英文摘要

Aligning large language models to human preferences is inherently multidimensional, yet most pipelines collapse heterogeneous signals into a single objective. We seek to answer what it would take to simultaneously align a model across various domains spanning those with: verifiable rewards, non-verifiable subjective preferences, and complex interactive scenarios. Such multi-objective alignment setups are often plagued by individual objectives being at odds with each other, resulting in inefficient training and limited user control during inference. To address these issues, we propose $\textbf{M}$ulti-$\textbf{A}$ction-$\textbf{H}$ead $\textbf{AL}$ignment with PRM-guided Dec$\textbf{O}$ding ($\textbf{MAHALO}$), a unified framework that standardizes PRM training across verifiable and non-verifiable settings for step-level supervision, performs vectorized multi-objective alignment with Multi-Action-Head DPO, and enables controllable inference through objective-specific weighting and PRM-guided decoding. Experiments across math reasoning, human values alignment, and multi-turn tutoring show that MAHALO jointly improves multiple objectives simultaneously with limited interference, while remaining generalizable and adaptable across domains and offering flexible user control at inference time. Our code is available at: https://github.com/pearls-lab/multiobj-align.

2510.00481 2026-06-02 cs.NI cs.AI cs.HC cs.MM cs.PF 版本更新

Make a Video Call with LLM: A Measurement Campaign over Six Mainstream Apps

用LLM进行视频通话:对六个主流应用的测量活动

Jiayang Xu, Xiangjie Huang, Zijie Li, Antariksh Verma, Zili Meng

发表机构 * Hong Kong University of Science and Technology(香港科技大学)

AI总结 本文通过自定义测试平台和在线平台,从质量、延迟、内部机制和系统开销四个维度对六个主流AI视频聊天应用进行基准测试,发现AI视频通话的网络延迟影响小于人类视频通话,AI代理能力对用户体验影响最大。

详情
AI中文摘要

2025年,大型语言模型(LLM)服务推出了一项新功能——AI视频聊天,允许用户通过实时视频通信(RTC)与AI代理互动,就像与真人聊天一样。尽管其重要性,但尚无系统性研究描述现有AI视频聊天系统的性能。为填补这一空白,本文提出了一个涵盖四个维度的综合基准:质量、延迟、内部机制和系统开销。使用自定义测试平台,我们进一步用该基准评估了六个主流AI视频聊天机器人。我们还构建了一个用于用户研究的在线平台。测量结果得出了有趣的发现,可能对未来优化有益。例如,AI视频通话的网络延迟不如人类视频通话重要。AI代理的能力对用户体验影响最大。我们的基准测试结果也为未来AI视频聊天机器人的优化提出了几个研究问题。可用性:在线评估平台、开源数据集和测试平台见https://callarena.net/。

英文摘要

In 2025, Large Language Model (LLM) services have launched a new feature -- AI video chat -- allowing users to interact with AI agents via real-time video communication (RTC), just like chatting with real people. Despite its significance, no systematic study has characterized the performance of existing AI video chat systems. To address this gap, this paper proposes a comprehensive benchmark across four dimensions: quality, latency, internal mechanisms, and system overhead. Using custom testbeds, we further evaluate six mainstream AI video chatbots with this benchmark. We also build an online platform for user study. The measurement leads to interesting findings that could be beneficial to the future optimizations. For example, the network latency of AI video chat matters not as much as human video chat. The capabilities of AI agents matters most in the user experience. Our benchmarking results also open up several research questions for future optimizations of AI video chatbots. Availability: https://callarena.net/ for the online evaluation platform and our open-sourced dataset and testbed.

2304.11127 2026-06-02 cs.LG cs.AI 版本更新

Tree-Structured Parzen Estimator: Understanding Its Algorithm Components and Their Roles for Better Empirical Performance

树结构帕森估计器:理解其算法组件及对提升实证性能的作用

Shuhei Watanabe

发表机构 * Preferred Networks Inc.(Preferred Networks公司) University of Freiburg(弗赖堡大学)

AI总结 本文通过消融实验分析树结构帕森估计器(TPE)各控制参数的作用,提出改进性能的推荐设置。

详情
AI中文摘要

近期的科学进展需要复杂的实验设计,这要求对许多实验参数进行细致调整。树结构帕森估计器(TPE)是Hyperopt和Optuna等最新参数调优框架中广泛使用的贝叶斯优化方法。尽管其流行,但TPE中各控制参数的作用及算法直觉至今尚未被讨论。本文旨在基于使用多样化基准数据集的消融研究,识别每个控制参数的作用及其对参数调优的影响。从消融研究中得出的推荐设置被证明能提升TPE的性能。本文使用的TPE实现可在https://github.com/nabenabe0928/tpe/tree/single-opt 获取。OptunaHub现在在https://hub.optuna.org/samplers/tpe_tutorial/ 提供我们独立的TPE实现。

英文摘要

Recent scientific advances require complex experiment design, necessitating the meticulous tuning of many experiment parameters. Tree-structured Parzen estimator (TPE) is a widely used Bayesian optimization method in recent parameter tuning frameworks such as Hyperopt and Optuna. Despite its popularity, the roles of each control parameter in TPE and the algorithm intuition have not been discussed so far. The goal of this paper is to identify the roles of each control parameter and their impacts on parameter tuning based on the ablation studies using diverse benchmark datasets. The recommended setting concluded from the ablation studies is demonstrated to improve the performance of TPE. Our TPE implementation used in this paper is available at https://github.com/nabenabe0928/tpe/tree/single-opt. OptunaHub now provides our standalone TPE implementation at https://hub.optuna.org/samplers/tpe_tutorial/.

2509.24808 2026-06-02 cs.AI 版本更新

Query Circuits: Explaining How Language Models Answer User Prompts

查询电路:解释语言模型如何回答用户提示

Tung-Yu Wu, Fazl Barez

发表机构 * University of Oxford(牛津大学)

AI总结 提出查询电路方法,通过直接追踪模型内部信息流来忠实解释特定输入如何产生输出,并引入归一化偏差忠实度(NDF)度量及采样方法实现稀疏电路发现。

Comments Accepted to ICML 2026

详情
AI中文摘要

解释语言模型为何产生特定输出需要局部、输入级别的解释。现有方法揭示全局能力电路(例如间接对象识别),但无法解释模型为何以特定方式回答特定输入查询。我们引入查询电路,直接追踪模型内部将特定输入映射到输出的信息流。与基于替代的方法(例如稀疏自编码器)不同,查询电路在模型本身内识别,从而产生更忠实且计算上可访问的解释。为了使查询电路实用,我们解决两个挑战。首先,我们引入归一化偏差忠实度(NDF),一种鲁棒度量,用于评估发现的电路对特定输入恢复模型决策的程度,并广泛适用于我们设置之外的电路发现。其次,我们开发基于采样的方法,以高效识别稀疏但忠实描述模型行为的电路。在基准测试(IOI、算术、MMLU和ARC)中,我们发现模型内存在极其稀疏的查询电路,可以恢复其在单个查询上的大部分性能。例如,仅覆盖1.3%模型连接的电路可以恢复约60%的MMLU问题性能。总体而言,查询电路为语言模型如何处理单个输入提供了忠实、可扩展的解释。项目页面位于https://tony10101105.github.io/query-circuit/。

英文摘要

Explaining why a language model produces a particular output requires local, input-level explanations. Existing methods uncover global capability circuits (e.g., indirect object identification), but not why the model answers a specific input query in a particular way. We introduce query circuits, which directly trace the information flow inside a model that maps a specific input to the output. Unlike surrogate-based approaches (e.g., sparse autoencoders), query circuits are identified within the model itself, resulting in more faithful and computationally accessible explanations. To make query circuits practical, we address two challenges. First, we introduce Normalized Deviation Faithfulness (NDF), a robust metric to evaluate how well a discovered circuit recovers the model's decision for a specific input, and is broadly applicable to circuit discovery beyond our setting. Second, we develop sampling-based methods to efficiently identify circuits that are sparse yet faithfully describe the model's behavior. Across benchmarks (IOI, arithmetic, MMLU, and ARC), we find that there exist extremely sparse query circuits within the model that can recover much of its performance on single queries. For example, a circuit covering only 1.3% of model connections can recover about 60% of performance on an MMLU questions. Overall, query circuits provide a step towards faithful, scalable explanations of how language models process individual inputs. The project page is at https://tony10101105.github.io/query-circuit/.

2509.24696 2026-06-02 cs.LG cs.AI 版本更新

T-POP: Test-Time Personalization with Online Preference Feedback

T-POP:基于在线偏好反馈的测试时个性化

Zikun Qu, Min Zhang, Mingze Kong, Xiang Li, Zhiwei Shang, Zhiyong Wang, Yikun Ban, Shuang Qiu, Yao Shu, Zhongxiang Dai

发表机构 * The Chinese University of Hong Kong, Shenzhen(香港中文大学(深圳)) East China Normal University(华东师范大学) Shenzhen Loop Area Institute(深圳环城研究所) Tianjin University(天津大学) The Chinese University of Hong Kong(香港中文大学) Beihang University(北京航空航天大学) City University of Hong Kong(香港城市大学) The Hong Kong University of Science and Technology (Guangzhou)(香港理工大学(广州))

AI总结 针对新用户冷启动问题,提出T-POP算法,通过在线成对偏好反馈和决斗式强盗机制,在不更新模型参数的情况下实时学习用户偏好并引导解码过程,实现快速数据高效的个性化。

Comments Accepted to ICML 2026

详情
AI中文摘要

将大型语言模型(LLM)个性化以适应个体用户偏好,是超越生成通用有用响应的关键步骤。然而,当前的个性化方法不适合新用户,因为它们通常需要缓慢、资源密集的微调或大量预先存在的用户数据,造成了显著的冷启动问题。为了应对这一挑战,我们引入了一种新的实时个性化范式,通过从文本生成过程中收集的在线成对偏好反馈进行学习。我们提出了T-POP(基于在线偏好反馈的测试时个性化),这是一种新颖的算法,将测试时对齐与决斗式强盗协同结合。在不更新LLM参数的情况下,T-POP通过在线学习一个捕捉用户偏好的奖励函数来引导冻结LLM的解码过程。通过利用决斗式强盗,T-POP智能地查询用户,以有效平衡探索其偏好和利用所学知识生成个性化文本。大量实验表明,T-POP实现了快速且数据高效的个性化,显著优于现有基线,并且随着用户交互的增加而持续改进。

英文摘要

Personalizing large language models (LLMs) to individual user preferences is a critical step beyond generating generically helpful responses. However, current personalization methods are ill-suited for new users, as they typically require either slow, resource-intensive fine-tuning or a substantial amount of pre-existing user data, creating a significant cold-start problem. To address this challenge, we introduce a new paradigm for real-time personalization by learning from online pairwise preference feedback collected during text generation. We propose T-POP (Test-Time Personalization with Online Preference Feedback}), a novel algorithm that synergistically combines test-time alignment with dueling bandits. Without updating the LLM parameters, T-POP steers the decoding process of a frozen LLM by learning a reward function online that captures user preferences. By leveraging dueling bandits, T-POP intelligently queries the user to efficiently balance between exploring their preferences and exploiting the learned knowledge to generate personalized text. Extensive experiments demonstrate that T-POP achieves rapid and data-efficient personalization, significantly outperforming existing baselines and showing consistent improvement with more user interactions.

2506.22271 2026-06-02 cs.AI cs.LG 版本更新

On the Theoretical Limitations of Embedding-based Link Prediction

基于嵌入的链接预测的理论局限性

Samy Badreddine, Emile van Krieken, Luciano Serafini

发表机构 * Vrije Universiteit Amsterdam, Netherlands(荷兰阿姆斯特丹自由大学) University of Trento, Italy(意大利特伦托大学)

AI总结 研究线性输出层导致的秩瓶颈对知识图谱嵌入模型表达能力的限制,并提出混合非线性输出层以提升大规模密集图上的性能。

详情
AI中文摘要

神经网络通常将低维嵌入映射到高维输出空间。通常,输出层是线性的,这会产生一个“秩瓶颈”,限制模型所能表示的函数。这种瓶颈在链接预测模型中普遍存在,例如知识图谱嵌入(KGE),因为实体的输出空间可能比嵌入维度大几个数量级。我们研究了秩瓶颈如何限制模型拟合训练数据的表达能力。以往工作关注特定KGE所需嵌入维度的充分上界,而我们给出了所有具有线性输出层的KGE的必要下界,该下界随图的大小和连通性增长。我们还考虑了一种使用混合的非线性输出层,以在不显著增加参数开销的情况下打破瓶颈。实验表明,使用这种非线性层的模型在大型密集数据集上,以较低的参数成本提升了排序性能和概率拟合,正如我们的理论所预测。我们的工作揭示了线性输出层如何限制KGE,并激励使用非线性替代方案以扩展到大型密集图。

英文摘要

Neural networks often map low-dimensional embeddings to high-dimensional output spaces. Usually, the output layer is linear, which can create a "rank bottleneck" that limits the functions a model can represent. Such bottlenecks are ubiquitous in link prediction models, such as knowledge graph embeddings (KGEs), as the output space of entities can be orders of magnitude larger than the embedding dimension. We investigate how rank bottlenecks limit model expressivity for fitting the training data. While previous work focused on sufficient bounds on the embedding dimension required for specific KGEs, we show necessary bounds for all KGEs with a linear output layer, which grow with graph size and connectivity. We also consider a non-linear output layer using mixtures to break the bottleneck without significant parameter overhead. Empirically, we show that models using this non-linear layer improve in ranking performance and probabilistic fit for large and dense datasets at a low parameter cost, as predicted by our theory. Our work reveals how linear output layers limit KGEs and motivates non-linear alternatives for scaling to large and dense graphs.

2504.06006 2026-06-02 cs.LG cs.AI cs.NE 版本更新

Optuna vs Code Llama: Are LLMs a New Paradigm for Hyperparameter Tuning?

Optuna vs Code Llama:LLM 是超参数调优的新范式吗?

Roman Kochnev, Arash Torabi Goodarzi, Zofia Antonina Bentyn, Dmitry Ignatov, Radu Timofte

发表机构 * Computer Vision Lab, CAIDAS & IFI, University of Würzburg, Germany(计算机视觉实验室,CAIDAS与IFI,乌尔姆大学,德国)

AI总结 通过微调参数高效的 Code Llama 模型,提出基于大语言模型的超参数优化方法,在多种视觉架构上实现与 Optuna 相当或更优的 RMSE 并大幅降低计算开销。

详情
Journal ref
Proceedings of the IEEE/CVF International Conference on Computer Vision Workshops (ICCVW), pp. 5664-5674, 2025
AI中文摘要

最优超参数选择对于最大化计算机视觉中神经网络的性能至关重要,尤其是当架构变得日益复杂时。本文通过使用 LoRA 微调参数高效的 Code Llama 版本,探索了大语言模型在超参数优化中的应用。所得模型在广泛的视觉架构上产生了准确且计算高效的超参数推荐。与依赖资源密集型的试错过程的传统方法(如 Optuna)不同,我们的方法在实现竞争性或更优的均方根误差的同时,大幅降低了计算开销。重要的是,所评估的模型涵盖了以图像为中心的任务,如分类、检测和分割,这些是许多图像处理流程(包括增强、恢复和风格迁移)的基本组成部分。我们的结果表明,基于 LLM 的优化不仅与成熟的贝叶斯方法(如树结构 Parzen 估计器)相媲美,而且加速了需要感知质量和低延迟处理的实际应用的调优。所有生成的配置均公开在 LEMUR 神经网络数据集(https://github.com/ABrain-One/nn-dataset)中,该数据集作为超参数优化研究的开源基准,并为提高图像处理系统中的训练效率提供了实用资源。

英文摘要

Optimal hyperparameter selection is critical for maximizing the performance of neural networks in computer vision, particularly as architectures become more complex. This work explores the use of large language models (LLMs) for hyperparameter optimization by fine-tuning a parameter-efficient version of Code Llama using LoRA. The resulting model produces accurate and computationally efficient hyperparameter recommendations across a wide range of vision architectures. Unlike traditional methods such as Optuna, which rely on resource-intensive trial-and-error procedures, our approach achieves competitive or superior Root Mean Square Error (RMSE) while substantially reducing computational overhead. Importantly, the models evaluated span image-centric tasks such as classification, detection, and segmentation, fundamental components in many image manipulation pipelines including enhancement, restoration, and style transfer. Our results demonstrate that LLM-based optimization not only rivals established Bayesian methods like Tree-structured Parzen Estimators (TPE), but also accelerates tuning for real-world applications requiring perceptual quality and low-latency processing. All generated configurations are publicly available in the LEMUR Neural Network Dataset (https://github.com/ABrain-One/nn-dataset), which serves as an open source benchmark for hyperparameter optimization research and provides a practical resource to improve training efficiency in image manipulation systems.

2509.23544 2026-06-02 stat.ML cs.AI cs.LG stat.ME 版本更新

End-to-End Deep Learning for Predicting Metric Space-Valued Outputs

端到端深度学习预测度量空间值输出

Yidong Zhou, Su I Iao, Hans-Georg Müller

AI总结 提出E2M框架,通过加权Fréchet均值和神经网络学习权重,实现度量空间值输出的几何感知预测,具有理论保证并在多种结构化输出上取得最优性能。

Comments 38 pages, 4 figures, 9 tables

详情
Journal ref
Journal of Machine Learning Research, 27:1--38, 2026
AI中文摘要

许多现代应用涉及预测结构化、非欧几里得输出,例如概率分布、网络和对称正定矩阵。这些输出自然地被建模为一般度量空间的元素,而依赖于向量空间结构的经典回归技术不再适用。我们引入了E2M(端到端度量回归),这是一个用于预测度量空间值输出的深度学习框架。E2M通过训练输出的加权Fréchet均值进行预测,其中权重由基于输入条件的神经网络学习。这种构造提供了一种原则性的几何感知预测机制,避免了替代嵌入和限制性参数假设,同时完全保留了输出空间的内在几何结构。我们建立了理论保证,包括刻画模型表达能力的通用逼近定理以及熵正则化训练目标的收敛性分析。通过涉及概率分布、网络和对称正定矩阵的大量模拟,我们展示了E2M始终达到最先进的性能,且其优势在更大样本量下更加明显。应用于人类死亡率分布和纽约市出租车网络进一步证明了该框架的灵活性和实用性。

英文摘要

Many modern applications involve predicting structured, non-Euclidean outputs such as probability distributions, networks, and symmetric positive-definite matrices. These outputs are naturally modeled as elements of general metric spaces, where classical regression techniques that rely on vector space structure no longer apply. We introduce E2M (End-to-End Metric regression), a deep learning framework for predicting metric space-valued outputs. E2M performs prediction via weighted Fréchet means over training outputs, where the weights are learned by a neural network conditioned on the input. This construction provides a principled mechanism for geometry-aware prediction that avoids surrogate embeddings and restrictive parametric assumptions, while fully preserving the intrinsic geometry of the output space. We establish theoretical guarantees, including a universal approximation theorem that characterizes the expressive capacity of the model and a convergence analysis of the entropy-regularized training objective. Through extensive simulations involving probability distributions, networks, and symmetric positive-definite matrices, we show that E2M consistently achieves state-of-the-art performance, with its advantages becoming more pronounced at larger sample sizes. Applications to human mortality distributions and New York City taxi networks further demonstrate the flexibility and practical utility of this framework.

2508.06588 2026-06-02 cs.LG cs.AI 版本更新

Graph is a Natural Regularization: Revisiting Vector Quantization for Graph Representation Learning

图是一种自然正则化:重新审视向量量化在图表示学习中的应用

Zian Zhai, Fan Li, Xingyu Tan, Xiaoyang Wang, Wenjie Zhang

发表机构 * School of Computer Science and Engineering, University of New South Wales, Sydney, Australia(新南威尔士大学计算机科学与工程学院,悉尼,澳大利亚)

AI总结 针对图向量量化中码本崩溃问题,提出RGVQ框架,通过图拓扑和特征相似性正则化及Gumbel-Softmax软分配,提升码本利用率和令牌多样性。

Comments ICML2026

详情
AI中文摘要

向量量化(VQ)最近成为一种学习图结构数据压缩和离散表示的有前途的方法。然而,一个基本挑战,即码本崩溃,在图领域仍未得到充分探索,严重限制了图令牌的表达能力和泛化能力。在本文中,我们进行了一项实证研究,观察到在图形重建任务中,即使采用了视觉或语言领域提出的缓解策略,当与图神经网络联合训练VQ时,码本崩溃始终发生。此外,我们从数据和优化角度提供了崩溃的诊断,表明崩溃与图数据属性(如特征冗余和连接密度)相关,并进一步由确定性硬分配的训练动态强化。为了解决这些问题,我们提出了RGVQ,一种新颖的框架,它集成图拓扑和特征相似性作为显式正则化信号,以增强码本利用并促进令牌多样性。RGVQ通过Gumbel-Softmax重参数化引入软分配,确保所有码字接收梯度更新。此外,RGVQ包含结构感知对比正则化,以惩罚将相同令牌分配给不相似的节点对。大量实验表明,RGVQ显著提高了码本利用率,并在多个下游任务中持续提升了最先进的图VQ骨干网络的性能,实现了更具表达性和可迁移性的图令牌表示。

英文摘要

Vector Quantization (VQ) has recently emerged as a promising approach for learning compressed and discrete representations for graph-structured data. However, a fundamental challenge, i.e., codebook collapse, remains underexplored in the graph domain, significantly limiting the expressiveness and generalization of graph tokens.In this paper, we present an empirical study and observe that codebook collapse consistently occurs when training VQ jointly with Graph Neural Networks under graph reconstruction tasks, even with mitigation strategies proposed in vision or language domains. Moreover, we provide a diagnosis of collapse from data and optimization perspectives, showing that collapse is associated with graph data properties such as feature redundancy and connectivity density, and is further reinforced by the training dynamics of deterministic hard assignment. To address these issues, we propose RGVQ, a novel framework that integrates graph topology and feature similarity as explicit regularization signals to enhance codebook utilization and promote token diversity. RGVQ introduces soft assignments via Gumbel-Softmax reparameterization, ensuring that all codewords receive gradient updates. In addition, RGVQ incorporates a structure-aware contrastive regularization to penalize assigning the same token to dissimilar node pairs. Extensive experiments demonstrate that RGVQ substantially improves codebook utilization and consistently boosts the performance of state-of-the-art graph VQ backbones across multiple downstream tasks, enabling more expressive and transferable graph token representations.

2504.10552 2026-06-02 cs.LG cs.AI cs.CV cs.DL 版本更新

LEMUR Neural Network Dataset: Towards Seamless AutoML

LEMUR 神经网络数据集:迈向无缝 AutoML

Arash Torabi Goodarzi, Roman Kochnev, Waleed Khalid, Hojjat Torabi Goudarzi, Furui Qin, Tolgay Atinc Uzun, Yashkumar Sanjaybhai Dhameliya, Yash Kanubhai Kathiriya, Zofia Antonina Bentyn, Dmitry Ignatov, Radu Timofte

发表机构 * Computer Vision Lab, CAIDAS, University of Würzburg(计算机视觉实验室,CAIDAS,乌尔姆大学)

AI总结 提出 LEMUR 开源数据集与框架,通过统一模板、结构化存储和自动化超参数优化,标准化神经网络实现与评估,以加速 AutoML 研究并促进公平基准测试。

详情
Journal ref
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), pp. 3291-3300, 2026
AI中文摘要

神经网络是现代人工智能的支柱,但设计、评估和比较它们仍然劳动密集。尽管存在许多用于训练的数据集,但模型本身的标准化集合很少。我们介绍 LEMUR,一个开源数据集和框架,它提供了大量基于 PyTorch 的神经网络集合,涵盖分类、分割、检测和自然语言处理等任务。每个模型遵循统一模板,配置和结果存储在结构化数据库中,以确保一致性和可重复性。LEMUR 通过 Optuna 集成自动超参数优化,包括统计分析和可视化工具,并提供 API 以无缝访问性能数据。该框架是可扩展的,允许研究人员添加新模型、数据集或指标而不破坏兼容性。通过标准化实现和统一评估,LEMUR 旨在加速 AutoML 研究,实现公平基准测试,并降低大规模神经网络实验的障碍。为支持采用和协作,LEMUR 及其插件在 MIT 许可下发布,网址为:https://github.com/ABrain-One/nn-dataset https://github.com/ABrain-One/nn-plots https://github.com/ABrain-One/nn-vr

英文摘要

Neural networks are the backbone of modern artificial intelligence, but designing, evaluating, and comparing them remains labor-intensive. While numerous datasets exist for training, there are few standardized collections of the models themselves. We introduce LEMUR, an open-source dataset and framework that provides a large collection of PyTorch-based neural networks across tasks such as classification, segmentation, detection, and natural language processing. Each model follows a unified template, with configurations and results stored in a structured database to ensure consistency and reproducibility. LEMUR integrates automated hyperparameter optimization via Optuna, includes statistical analysis and visualization tools, and offers an API for seamless access to performance data. The framework is extensible, allowing researchers to add new models, datasets, or metrics without breaking compatibility. By standardizing implementations and unifying evaluation, LEMUR aims to accelerate AutoML research, enable fair benchmarking, and reduce barriers to large-scale neural network experimentation. To support adoption and collaboration, LEMUR and its plugins are released under the MIT license at: https://github.com/ABrain-One/nn-dataset https://github.com/ABrain-One/nn-plots https://github.com/ABrain-One/nn-vr

2509.18025 2026-06-02 math.OC cs.AI cs.LG math.LO stat.ML 版本更新

Deep Learning as the Disciplined Construction of Tame Objects

深度学习作为驯服对象的有纪律构造

Gilles Bareilles, Allen Gehret, Johannes Aspman, Jana Lepšová, Jakub Mareček

发表机构 * Czech Technical University in Prague, Artificial Intelligence Center(布拉格捷克技术大学人工智能中心)

AI总结 本文通过驯服几何(o-极小性)框架,介绍深度学习模型作为函数组合的数学基础,并展示其在非光滑非凸但驯服设置下为随机梯度下降提供收敛保证的应用。

Comments 39 pages, 10 figures

详情
AI中文摘要

人们可以将深度学习模型视为所谓驯服几何中函数的组合。在这篇说明性笔记中,我们概述了驯服几何(也称为o-极小性)、优化理论以及深度学习理论与实践之间的一些主题。为此,我们逐步介绍在一般非光滑非凸但驯服的设置中,为随机梯度下降建立收敛保证所使用的概念和工具。这说明了驯服几何作为研究AI系统(尤其是深度学习)的自然数学框架的一些方式。

英文摘要

One can see deep-learning models as compositions of functions within the so-called tame geometry. In this expository note, we give an overview of some topics at the interface of tame geometry (also known as o-minimality), optimization theory, and deep learning theory and practice. To do so, we gradually introduce the concepts and tools used to build convergence guarantees for stochastic gradient descent in a general nonsmooth nonconvex, but tame, setting. This illustrates some ways in which tame geometry is a natural mathematical framework for the study of AI systems, especially within Deep Learning.

2508.12551 2026-06-02 cs.LG cs.AI cs.OS cs.SE 版本更新

TuneAgent: Agentic Operating System Kernel Tuning with Reinforcement Learning

TuneAgent: 基于强化学习的智能操作系统内核调优

Hongyu Lin, Yuchen Li, Haoran Luo, Zhenghong Lin, Libo Zhang, Mingjie Xing, Yanjun Wu

发表机构 * Institute of Software, Chinese Academy of Sciences(中国科学院软件研究所) University of Chinese Academy of Sciences(中国科学院大学) Nanyang Technological University(南洋理工大学)

AI总结 提出TuneAgent框架,利用基于规则的强化学习使大语言模型自主探索Linux内核空间,通过结构化奖励函数和两阶段训练策略解决稀疏反馈问题,实现高达5.6%的性能提升。

详情
AI中文摘要

Linux内核调优对于优化操作系统性能至关重要,但由于复杂的内核空间、稀疏的性能反馈和强烈的工作负载敏感性,仍然具有挑战性。我们提出了TuneAgent,一个基于规则强化学习的智能Linux内核调优框架。TuneAgent将内核空间构建为约束强化学习环境,使大语言模型能够自主探索内核,同时强制执行有效且精确的配置修改。为了解决稀疏性能反馈问题,我们设计了结构化奖励函数,共同促进推理标准化、配置正确性和性能感知。此外,我们提出了一种两阶段训练策略,首先确保格式和语义正确性,然后过渡到性能驱动的探索,从而加速收敛并降低开销。实验结果表明,TuneAgent始终优于现有基线,在保持高配置有效性的同时,实现了高达5.6%的相对整体性能提升。我们进一步展示了其在多个实际应用中的鲁棒性,突显了其在多样化部署环境中的实用性和适应性。

英文摘要

Linux kernel tuning is essential for optimizing operating system (OS) performance, yet remains challenging due to the complex kernel space, sparse performance feedback, and strong workload sensitivity. We present TuneAgent, an agentic Linux kernel tuning framework powered by rule-based reinforcement learning (RL). TuneAgent formulates the kernel space as a constrained RL environment, enabling large language models (LLMs) to autonomously explore the kernel while enforcing valid and precise configuration modifications. To address sparse performance feedback, we design structured reward functions that jointly promote reasoning standardization, configuration correctness, and performance awareness. Furthermore, we propose a two-phase training strategy that first ensures format and semantic correctness and then transitions to performance-driven exploration, accelerating convergence and reducing overhead. Experimental results show that TuneAgent consistently outperforms existing baselines, achieving up to 5.6% relative overall performance improvement while maintaining high configuration validity. We further demonstrate its robustness across multiple real-world applications, highlighting its practicality and adaptability in diverse deployment environments.

2504.05871 2026-06-02 cs.AI 版本更新

Agent Guide: A Simple Agent Behavioral Watermarking Framework

Agent Guide:一种简单的智能体行为水印框架

Kaibo Huang, Zipei Zhang, Zhongliang Yang, Linna Zhou

发表机构 * Beijing University of Posts and Telecommunications(北京邮电大学)

AI总结 针对智能体行为难以标记和转换的问题,提出Agent Guide框架,通过概率偏差引导智能体高层决策嵌入水印,并利用z统计量检测,实现低误报率的水印嵌入与提取。

详情
AI中文摘要

随着智能体在数字生态系统(如社交媒体平台)中的部署日益增多,可追溯性和问责制问题引起了重大关注,尤其是在网络安全和数字内容保护领域。传统的大语言模型(LLM)水印技术依赖于令牌级操作,但由于行为标记化和行为到动作转换过程中的信息丢失,这些技术不适用于智能体。为了解决这些问题,我们提出了Agent Guide,一种新颖的行为水印框架,通过概率偏差引导智能体的高层决策(行为)来嵌入水印,同时保留具体执行(动作)的自然性。我们的方法将智能体行为解耦为两个层次:行为(例如,选择收藏)和动作(例如,使用特定标签收藏),并将水印引导的偏差应用于行为概率分布。我们采用基于z统计量的统计分析来检测水印,确保在多次交互中可靠提取。在具有不同智能体配置文件的社交媒体场景中的实验表明,Agent Guide实现了有效的水印检测,且误报率低。我们的框架为智能体水印提供了一种实用且稳健的解决方案,可应用于识别恶意智能体和保护专有智能体系统。

英文摘要

The increasing deployment of intelligent agents in digital ecosystems, such as social media platforms, has raised significant concerns about traceability and accountability, particularly in cybersecurity and digital content protection. Traditional large language model (LLM) watermarking techniques, which rely on token-level manipulations, are ill-suited for agents due to the challenges of behavior tokenization and information loss during behavior-to-action translation. To address these issues, we propose Agent Guide, a novel behavioral watermarking framework that embeds watermarks by guiding the agent's high-level decisions (behavior) through probability biases, while preserving the naturalness of specific executions (action). Our approach decouples agent behavior into two levels, behavior (e.g., choosing to bookmark) and action (e.g., bookmarking with specific tags), and applies watermark-guided biases to the behavior probability distribution. We employ a z-statistic-based statistical analysis to detect the watermark, ensuring reliable extraction over multiple rounds. Experiments in a social media scenario with diverse agent profiles demonstrate that Agent Guide achieves effective watermark detection with a low false positive rate. Our framework provides a practical and robust solution for agent watermarking, with applications in identifying malicious agents and protecting proprietary agent systems.

2507.19881 2026-06-02 cs.CV cs.AI 版本更新

FedS2R: One-Shot Federated Domain Generalization for Synthetic-to-Real Semantic Segmentation in Autonomous Driving

FedS2R: 面向自动驾驶中合成到真实语义分割的一次性联邦域泛化

Tao Lian, Jose L. Gómez, Antonio M. López

发表机构 * Computer Vision Center (CVC) Univ. Autònoma de Barcelona (UAB) Barcelona, Spain(计算机视觉中心(CVC)巴塞罗那自治大学(UAB)巴塞罗那,西班牙)

AI总结 提出FedS2R框架,通过不一致性驱动的数据增强和多客户端知识蒸馏,实现自动驾驶中合成到真实语义分割的一次性联邦域泛化,在五个真实数据集上性能接近集中式训练。

Comments Accepted by IEEE Intelligent Vehicles Symposium (IV) 2026

详情
AI中文摘要

联邦域泛化在图像分类中通过多客户端协作训练而不共享原始数据已显示出有希望的进展。然而,其在自动驾驶语义分割中的潜力尚未被充分探索。本文提出FedS2R,这是第一个用于自动驾驶中合成到真实语义分割的一次性联邦域泛化框架。FedS2R包含两个组件:一种不一致性驱动的数据增强策略,用于生成不稳定类别的图像;以及一种具有特征融合的多客户端知识蒸馏方案,从多个客户端模型中蒸馏出全局模型。在五个真实数据集Cityscapes、BDD100K、Mapillary、IDD和ACDC上的实验表明,全局模型显著优于单个客户端模型,并且仅比同时访问所有客户端数据训练的模型落后2个mIoU点。这些结果证明了FedS2R在联邦学习下自动驾驶合成到真实语义分割中的有效性。

英文摘要

Federated domain generalization has shown promising progress in image classification by enabling collaborative training across multiple clients without sharing raw data. However, its potential in the semantic segmentation of autonomous driving remains underexplored. In this paper, we propose FedS2R, the first one-shot federated domain generalization framework for synthetic-to-real semantic segmentation in autonomous driving. FedS2R comprises two components: an inconsistency-driven data augmentation strategy that generates images for unstable classes, and a multi-client knowledge distillation scheme with feature fusion that distills a global model from multiple client models. Experiments on five real-world datasets, Cityscapes, BDD100K, Mapillary, IDD, and ACDC, show that the global model significantly outperforms individual client models and is only 2 mIoU points behind the model trained with simultaneous access to all client data. These results demonstrate the effectiveness of FedS2R in synthetic-to-real semantic segmentation for autonomous driving under federated learning

2507.19702 2026-06-02 cs.SI cs.AI cs.LG 版本更新

A Lightweight Deep Learning-based Model for Ranking Influential Nodes in Complex Networks

基于轻量级深度学习的复杂网络中有影响力节点排序模型

Mohammed A. Ramadhan, Abdulhakeem O. Mohammed

发表机构 * Computer Science Department, College of Science, University of Zakho(扎赫大学科学学院计算机科学系) Department of Computer Science and Information Technology, The American University of Kurdistan(库尔德斯坦美国大学计算机科学与信息技术系)

AI总结 提出一种结合一维卷积神经网络和GraphSAGE的轻量级混合模型1D-CGS,利用节点度和平均邻居度特征,通过回归任务高效排序有影响力节点,在12个真实网络上平均Kendall Tau提升4.73%,Jaccard相似度提升7.67%,单调性指数达0.99,运行速度显著快于现有深度学习方法。

详情
AI中文摘要

识别复杂网络中的有影响力节点是一项关键任务,在不同领域有广泛应用。然而,现有方法常在准确性和计算效率之间权衡。为解决这些挑战,我们提出1D-CGS,一种轻量级且有效的混合模型,它结合了一维卷积神经网络(1D-CNN)的速度和GraphSAGE的拓扑表示能力,用于高效节点排序。该模型使用基于两个简单且重要的拓扑特征(节点度和平均邻居度)构建的轻量级输入表示。这些特征通过一维卷积提取局部模式,然后通过GraphSAGE层聚合邻域信息。我们将节点排序任务表述为回归问题,并使用易感-感染-恢复(SIR)模型生成真实影响力分数。1D-CGS首先在Barabasi-Albert模型生成的合成网络上训练,然后应用于真实世界网络以识别有影响力节点。在12个真实网络上的实验评估表明,1D-CGS在排序准确性上显著优于传统中心性度量和最近的深度学习模型,同时运行速度非常快。与表现最佳的深度学习基线相比,所提模型在Kendall Tau相关性上平均提升4.73%,在Jaccard相似度上平均提升7.67%。它还实现了平均单调性指数(MI)分数0.99,并产生近乎完美的排名分布,表明高度独特和可区分的排名。此外,所有实验证实1D-CGS在高度合理的时间内运行,比现有深度学习方法快得多,使其适用于大规模应用。

英文摘要

Identifying influential nodes in complex networks is a critical task with a wide range of applications across different domains. However, existing approaches often face trade-offs between accuracy and computational efficiency. To address these challenges, we propose 1D-CGS, a lightweight and effective hybrid model that integrates the speed of one-dimensional convolutional neural networks (1D-CNN) with the topological representation power of GraphSAGE for efficient node ranking. The model uses a lightweight input representation built on two straightforward and significant topological features: node degree and average neighbor degree. These features are processed through 1D convolutions to extract local patterns, followed by GraphSAGE layers to aggregate neighborhood information. We formulate the node ranking task as a regression problem and use the Susceptible-Infected-Recovered (SIR) model to generate ground truth influence scores. 1D-CGS is initially trained on synthetic networks generated by the Barabasi-Albert model and then applied to real world networks for identifying influential nodes. Experimental evaluations on twelve real world networks demonstrate that 1D-CGS significantly outperforms traditional centrality measures and recent deep learning models in ranking accuracy, while operating in very fast runtime. The proposed model achieves an average improvement of 4.73% in Kendall's Tau correlation and 7.67% in Jaccard Similarity over the best performing deep learning baselines. It also achieves an average Monotonicity Index (MI) score 0.99 and produces near perfect rank distributions, indicating highly unique and discriminative rankings. Furthermore, all experiments confirm that 1D-CGS operates in a highly reasonable time, running significantly faster than existing deep learning methods, making it suitable for large scale applications.

2503.05641 2026-06-02 cs.CL cs.AI cs.LG 版本更新

Skill-Based Mixture-of-Experts: Adaptive Routing for Heterogeneous Reasoning via Inferred Skills

基于技能的混合专家模型:通过推断技能实现异构推理的自适应路由

Justin Chih-Yao Chen, Sukwon Yun, Elias Stengel-Eskin, Tianlong Chen, Mohit Bansal

发表机构 * The University of Texas at Austin(德克萨斯大学奥斯汀分校)

AI总结 提出Skill-MoE框架,通过推断查询所需技能进行实例级专家选择,并采用批推理策略降低开销,在单GPU上集成16个专家模型,在多个推理基准上平均提升8.15%。

Comments ICML 2026 (Camera-Ready). The first three authors contributed equally. Project Page: https://skill-moe.github.io/

详情
AI中文摘要

结合现有的预训练大语言模型是处理多样化推理任务的一种有前景的方法。然而,任务级专家选择往往过于粗粒度,因为不同实例可能需要不同的专业知识。为了解决这个问题,我们提出了Skill-MoE,一个符号化的、基于技能的、无梯度的混合专家框架,用于实例级专家选择。Skill-MoE从每个查询中推断技能(例如,数学中的代数),根据技能相关性选择专家,并让每个专家生成自己的推理。然后,由选定的聚合器将得到的k个输出进行综合,该聚合器因其整合多样化响应的能力而被选中。虽然实例级选择显著提高了性能,但朴素实现会因重复的模型加载和卸载而产生巨大开销。我们通过一种批推理策略解决了这个问题,该策略将实例按分配的专家分组,使得每个模型只需加载一次。因此,Skill-MoE在单GPU上集成了16个专家模型,其运行时间与使用4个GPU的先前多智能体基线相当。在多个基准测试(MMLU-Pro、GPQA、AIME和MedMCQA)中,Skill-MoE相比最佳基线实现了平均8.15%的绝对提升。它还能很好地泛化到未见过的任务,并且无需昂贵的多轮交互即可超越基于讨论的方法。

英文摘要

Combining existing pre-trained LLMs is a promising approach for diverse reasoning tasks. However, task-level expert selection is often too coarse-grained, since different instances may require different expertise. To address this, we propose Skill-MoE, a symbolic, skill-based, and gradient-free Mixture-of-Experts framework for instance-level expert selection. Skill-MoE infers skills (e.g., algebra in mathematics) from each query, selects experts based on skill relevance, and lets each expert generate its own reasoning. The resulting k outputs are then synthesized by an aggregator chosen for its ability to integrate diverse responses. While instance-level selection substantially improves performance, naively implementing it incurs heavy overhead from repeated model loading and offloading. We address this with a batch inference strategy that groups instances by assigned experts, allowing each model to be loaded only once. As a result, Skill-MoE integrates 16 expert models on a single GPU with runtime comparable to prior multi-agent baselines using 4 GPUs. Across diverse benchmarks (MMLU-Pro, GPQA, AIME, and MedMCQA), Skill-MoE achieves an average absolute improvement of 8.15% over the best baseline. It also generalizes well to unseen tasks and outperforms discussion-based methods without requiring expensive multi-round interactions.

2507.12645 2026-06-02 eess.SP cs.AI cs.LG 版本更新

A Novel Data Augmentation Strategy for Robust Deep Learning Classification of Biomedical Time-Series Data: Application to ECG and EEG Analysis

一种用于生物医学时间序列数据鲁棒深度学习分类的新型数据增强策略:在ECG和EEG分析中的应用

Mohammed Guhdar, Ramadhan J. Mstafa, Abdulhakeem O. Mohammed

发表机构 * Computer Science Department, College of Science, University of Zakho(扎赫大学科学学院计算机科学系) Department of Computer Science and Information Technology, The American University of Kurdistan(库尔德斯坦美国大学计算机科学与信息技术系) PRIME Lab, Scientific Research Center, University of Zakho(扎赫大学科学研究中心PRIME实验室)

AI总结 提出一种结合ResNet-CNN与注意力机制的统一深度学习框架,通过时域拼接多个增强变体的新型数据增强策略和Focal Loss处理类别不平衡,在ECG和EEG数据集上达到99.96%-100%的准确率,且内存需求低、推理速度快。

详情
AI中文摘要

准确统一分析多种生物信号(如ECG和EEG)的需求日益迫切,这对于全面评估患者状况至关重要,尤其是在同步监测中。尽管多传感器融合取得了进展,但在开发能够有效处理和提取本质上不同生理信号特征的统一架构方面仍存在关键空白。另一个挑战是许多生物医学数据集固有的类别不平衡,这常常导致传统方法性能偏差。本研究通过提出一种新颖且统一的深度学习框架来解决这些问题,该框架在不同信号类型上均达到了最先进的性能。我们的方法将基于ResNet的CNN与注意力机制相结合,并通过一种新颖的数据增强策略增强:对每个信号的多个增强变体进行时域拼接,以生成更丰富的表示。与先前工作不同,我们科学地增加信号复杂性以实现未来能力,从而相比现有技术获得了最佳预测。预处理步骤包括小波去噪、基线去除和标准化。通过结合使用这种高级数据增强和Focal Loss函数,有效管理了类别不平衡。训练过程中应用了正则化技术以确保泛化能力。我们在三个基准数据集上严格评估了所提出的架构:UCI癫痫EEG、MIT-BIH心律失常和PTB诊断ECG。它分别达到了99.96%、99.78%和100%的准确率,展示了在不同信号类型和临床背景下的鲁棒性。最后,该架构需要约130 MB内存,每个样本处理时间约10 ms,表明其适用于低端或可穿戴设备部署。

英文摘要

The increasing need for accurate and unified analysis of diverse biological signals, such as ECG and EEG, is paramount for comprehensive patient assessment, especially in synchronous monitoring. Despite advances in multi-sensor fusion, a critical gap remains in developing unified architectures that effectively process and extract features from fundamentally different physiological signals. Another challenge is the inherent class imbalance in many biomedical datasets, often causing biased performance in traditional methods. This study addresses these issues by proposing a novel and unified deep learning framework that achieves state-of-the-art performance across different signal types. Our method integrates a ResNet-based CNN with an attention mechanism, enhanced by a novel data augmentation strategy: time-domain concatenation of multiple augmented variants of each signal to generate richer representations. Unlike prior work, we scientifically increase signal complexity to achieve future-reaching capabilities, which resulted in the best predictions compared to the state of the art. Preprocessing steps included wavelet denoising, baseline removal, and standardization. Class imbalance was effectively managed through the combined use of this advanced data augmentation and the Focal Loss function. Regularization techniques were applied during training to ensure generalization. We rigorously evaluated the proposed architecture on three benchmark datasets: UCI Seizure EEG, MIT-BIH Arrhythmia, and PTB Diagnostic ECG. It achieved accuracies of 99.96%, 99.78%, and 100%, respectively, demonstrating robustness across diverse signal types and clinical contexts. Finally, the architecture requires ~130 MB of memory and processes each sample in ~10 ms, suggesting suitability for deployment on low-end or wearable devices.

2506.21278 2026-06-02 stat.ML cs.AI cs.LG math.ST stat.TH 版本更新

Hyperspherical Variational Autoencoders Using Efficient Spherical Cauchy Distribution

使用高效球面柯西分布的超球面变分自编码器

Lukas Sablica, Kurt Hornik

发表机构 * Institute for Statistics and Mathematics(统计与数学研究所) Vienna University of Economics and Business(维也纳经济与商业大学) Austria(奥地利)

AI总结 提出基于球面柯西分布的超球面变分自编码器,通过莫比乌斯变换实现可微重参数化,避免贝塞尔函数计算,在保持重尾特性的同时提供高效稳定的训练与推理。

详情
AI中文摘要

我们提出在超球面潜变量空间上使用球面柯西(spCauchy)潜变量的变分自编码器。spCauchy 族具有重尾全局行为,并且通过对球面上的均匀样本应用莫比乌斯变换,允许精确可微的重参数化。我们证明,在高浓度极限下,spCauchy 在显式浓度参数映射下恢复了 von Mises-Fisher(vMF)分布的局部切空间几何,同时避免了 vMF 实现所需的高阶贝塞尔函数计算。对于训练,到均匀球面先验的 Kullback-Leibler 散度具有快速收敛的级数、稳定的求积以及高浓度渐近形式。我们进一步建立了浓度依赖的 KL 核心的单调性,并推导了具有闭形式代理和误差控制的解析括号,支持极端情况下的稳定近似。压力测试基准表明,所得到的潜层目标在 CPU 和 GPU 上比 vMF 基线更稳定且评估更快。在图像和分子序列数据上的实验表明,spCauchy-VAE 为具有超球面潜表示的生式建模提供了一种鲁棒且可扩展的替代方案。

英文摘要

We propose spherical Cauchy (spCauchy) latent variables for variational autoencoders on hyperspherical latent spaces. The spCauchy family has heavy-tailed global behavior and admits an exact differentiable reparameterization by applying a Möbius transformation to uniform samples on the sphere. We show that, in the high-concentration limit, spCauchy recovers the local tangent-space geometry of the von Mises-Fisher (vMF) distribution under an explicit concentration parameter mapping, while avoiding the high-order Bessel-function evaluations required by vMF implementations. For training, the Kullback-Leibler divergence to a uniform spherical prior admits rapidly convergent series, stable quadrature, and high-concentration asymptotic forms. We further establish monotonicity of the concentration-dependent KL core and derive analytic brackets with closed-form surrogates and error control, supporting stable approximation in extreme regimes. Stress-test benchmarks show that the resulting latent-layer objective remains stable and faster to evaluate than vMF baselines on CPU and GPU. Experiments on image and molecular sequence data demonstrate that spCauchy-VAEs provide a robust and scalable alternative for generative modeling with hyperspherical latent representations.

2507.09766 2026-06-02 cs.LG cs.AI 版本更新

Toward accurate RUL and SoH estimation using reinforced graph-based physics-informed neural networks enhanced with dynamic weights

基于动态权重的强化图物理信息神经网络实现精确的剩余使用寿命和健康状态估计

Mohamadreza Akbari Pour, Ali Ghasemzadeh, Mohamad Ali Bijarchi, Mohammad Behshad Shafii

发表机构 * Department of Mechanical Engineering(机械工程系) Department of Computer Engineering(计算机工程系) Sharif University of Technology(谢赫拉特福大学)

AI总结 提出一种结合图表示学习、强化学习和自适应动态权重的物理信息神经网络框架RGPD,在C-MAPSS、PHM2012和XJTU数据集上实现跨资产退化场景的RUL和SoH高精度估计。

详情
AI中文摘要

精确估计剩余使用寿命(RUL)和健康状态(SoH)对于可靠的预测与健康管理(PHM)至关重要,有助于及时维护和可靠的工业运行。然而,结合数据驱动学习与基于物理的正则化的混合模型通常依赖于固定的损失权重,因此在跨具有不同退化行为的资产迁移时会失去准确性。本研究引入了具有动态加权的强化图物理信息网络(RGPD),这是一个用于时空退化建模和自适应物理引导正则化的统一框架。基于图的表示学习捕获传感器间的退化结构,软演员-评论家(SAC)模块在噪声条件下细化潜在特征,轻量级Q学习策略在训练过程中自适应地平衡单调性、平滑性和潜在动力学残差损失。该框架在C-MAPSS、PHM2012和XJTU数据集上进行了评估,这些数据集分别代表发动机、轴承和电池的退化过程。与相应基准表中报告的最强基线相比,RGPD在PHM2012和C-MAPSS上将平均RMSE提高了高达12%,在XJTU上将平均MAPE比第二好的模型降低了20%。在这些异构基准上的性能进一步表明了该模型跨退化系统的泛化能力。物理信息组件通过退化一致性先验以及深度隐藏物理模型风格的残差实现,提高了物理合理性,而无需为每种资产类型建立完整的第一性原理模型。

英文摘要

Accurate estimation of Remaining Useful Life (RUL) and State of Health (SoH) is essential for reliable Prognostics and Health Management (PHM), supporting timely maintenance and dependable industrial operation. However, hybrid models that combine data-driven learning with physics-based regularization often rely on fixed loss weights and therefore lose accuracy when transferred across assets with different degradation behaviors. This study introduces Reinforced Graph-based Physics-informed Networks with Dynamic Weighting (RGPD), a unified framework for spatio-temporal degradation modeling and adaptive physics-guided regularization. Graph-based representation learning captures inter-sensor degradation structure, a Soft Actor-Critic (SAC) module refines latent features under noisy conditions, and a lightweight Q-learning policy adaptively balances monotonicity, smoothness, and latent-dynamics residual losses during training. The framework is evaluated on the C-MAPSS, PHM2012, and XJTU datasets, which represent engine, bearing, and battery degradation processes. Relative to the strongest compared baselines reported in the corresponding benchmark tables, RGPD improves average RMSE by up to 12 percent on PHM2012 and C-MAPSS, and reduces average MAPE by 20 percent on XJTU compared with the second-best reported model. Performance on these heterogeneous benchmarks further suggests the model's generalizability across degradation systems. The physics-informed component is implemented through degradation-consistent priors together with a Deep Hidden Physics Model-style residual, which improves physical plausibility without requiring a full first-principles model for each asset type.

2507.02905 2026-06-02 cs.HC cs.AI cs.LG 版本更新

Preference-Optimal Multi-Metric Weighting for Parallel Coordinate Plots

平行坐标图的偏好最优多度量加权

Chisa Mori, Shuhei Watanabe, Masaki Onishi, Takayuki Itoh

发表机构 * Preferred Networks Inc.(Preferred Networks公司)

AI总结 针对平行坐标图中多度量可视化难题,提出基于偏好最优加权的公式化方法,并利用雷达图与UMAP降维实现直观偏好选择,有效揭示控制参数重要性模式。

Comments Accepted to International Conference Information Visualisation (iV2025)

详情
AI中文摘要

平行坐标图(PCP)是一种解释控制参数与度量之间关系的常用方法。PCP通过基于单一度量的颜色渐变来提供这种解释。然而,当存在多个度量时,提供这样的渐变是具有挑战性的。虽然一种简单的方法是通过线性加权每个度量来计算单一度量,但这种加权对用户来说是不明确的。为了解决这个问题,我们首先提出了一种基于特定偏好度量组合计算最优加权的原则性公式。尽管用户可以在双度量问题的二维(2D)平面上简单地选择他们的偏好,但多度量问题需要直观的可视化以允许他们选择偏好。我们通过使用各种雷达图来可视化由UMAP降维的2D平面上的度量权衡来实现这一点。在使用行人流引导规划的分析中,我们的方法为每个用户偏好识别出了控制参数重要性的独特模式,突出了我们方法的有效性。

英文摘要

Parallel coordinate plots (PCPs) are a prevalent method to interpret the relationship between the control parameters and metrics. PCPs deliver such an interpretation by color gradation based on a single metric. However, it is challenging to provide such a gradation when multiple metrics are present. Although a naive approach involves calculating a single metric by linearly weighting each metric, such weighting is unclear for users. To address this problem, we first propose a principled formulation for calculating the optimal weight based on a specific preferred metric combination. Although users can simply select their preference from a two-dimensional (2D) plane for bi-metric problems, multi-metric problems require intuitive visualization to allow them to select their preference. We achieved this using various radar charts to visualize the metric trade-offs on the 2D plane reduced by UMAP. In the analysis using pedestrian flow guidance planning, our method identified unique patterns of control parameter importance for each user preference, highlighting the effectiveness of our method.

2409.18624 2026-06-02 cs.AI cs.LG 版本更新

Unsupervised Cognition

无监督认知

Alfredo Ibias, Hector Antona, Guillem Ramirez-Miranda, Enric Guinovart, Eduard Alarcon

发表机构 * Avatar Cognition(Avatar认知)

AI总结 提出一种基于原语的无监督学习方法,通过构建分布式层次结构表示输入空间,在分类任务上超越现有最先进方法,并展现出类似认知的行为。

详情
AI中文摘要

无监督学习方法在认知模型中具有软启发。迄今为止,最成功的无监督学习方法主要围绕在数学空间中对样本进行聚类。在本文中,我们提出了一种基于原语的无监督学习方法,用于决策制定,该方法受一种新颖的认知框架启发。这种以表示为中心的方法以输入无关的方式,将输入空间建设性地建模为分布式层次结构。我们将我们的方法与当前最先进的无监督学习分类、当前最先进的小规模和不完整数据集分类以及当前最先进的癌症类型分类进行了比较。我们展示了我们的方法如何超越先前的最先进技术。我们还评估了我们方法的一些类似认知的特性,在这些特性中,它不仅优于比较的算法(甚至包括监督学习算法),而且表现出不同的、更类似于认知的行为。

英文摘要

Unsupervised learning methods have a soft inspiration in cognition models. To this day, the most successful unsupervised learning methods revolve around clustering samples in a mathematical space. In this paper we propose a primitive-based, unsupervised learning approach for decision-making inspired by a novel cognition framework. This representation-centric approach models the input space constructively as a distributed hierarchical structure in an input-agnostic way. We compared our approach with both current state-of-the-art unsupervised learning classification, with current state-of-the-art small and incomplete datasets classification, and with current state-of-the-art cancer type classification. We show how our proposal outperforms previous state-of-the-art. We also evaluate some cognition-like properties of our proposal where it not only outperforms the compared algorithms (even supervised learning ones), but it also shows a different, more cognition-like, behaviour.

2411.15240 2026-06-02 cs.LG cs.AI cs.HC q-bio.QM 版本更新

A Foundation Model for Wearable Movement Data in Mental Health Research

心理健康研究中可穿戴运动数据的基础模型

Franklin Y. Ruan, Aiwei Zhang, Jenny Y. Oh, SouYoung Jin, Nicholas C. Jacobson

发表机构 * Dartmouth College(达特茅斯学院) National Institute of Diabetes and Digestive and Kidney Diseases(国家糖尿病、消化系统疾病和肾病研究所) National Institutes of Health(美国国立卫生研究院) Department of Computer Science at Dartmouth College(达特茅斯学院计算机科学系)

AI总结 提出预训练体动记录Transformer(PAT),一种基于自监督掩码自编码器预训练的可穿戴运动时间序列基础模型,在心理健康预测任务上优于非基础模型方法,并提供可解释的注意力图。

Comments F. Y. Ruan, A. Zhang, J. Y. Oh, S. Jin and N. C. Jacobson, "A Foundation Model for Wearable Movement Data in Mental Health Research," in IEEE Journal of Biomedical and Health Informatics, doi: 10.1109/JBHI.2026.3694809

详情
AI中文摘要

可穿戴运动数据由几乎所有市售智能手表收集,是心理健康研究的宝贵资源,反映了细粒度的时间行为趋势。尽管前景广阔,但与临床图像和文本分析相比,健康可穿戴建模的基础模型开发仍然有限。我们设计了带有补丁嵌入的Transformer,并在分钟级、持续一周的体动记录(身体活动强度测量)序列上使用自监督掩码自编码器预训练,以开发和评估预训练体动记录Transformer(PAT)。PAT是一个用于可穿戴运动时间序列的开源基础模型,结合了长达一周的时间建模、精神科结果评估以及在公共数据上的可重复性。在来自美国国家健康与营养调查(NHANES)的全国代表性队列中21,538名参与者的数据上预训练,PAT在心理健康预测任务(包括苯二氮卓类药物和SSRI使用、抑郁症和睡眠异常)中始终优于非基础模型基线。在苯二氮卓类药物使用预测任务中,PAT相比常用于时间序列建模的非基础深度学习模型表现出最大改进(即比LSTM提高55.6%,比一维CNN提高21.4%,比ConvLSTM提高14.8%)。除了预测准确性,PAT还提供可解释的注意力图,突出对临床预测最重要的日常活动特定时段,提供模型透明度和潜在临床见解。结果表明,PAT为研究人员和临床医生提供了一种易于部署、适应性强且可扩展的解决方案,以从可穿戴传感器数据中推进临床见解。GitHub: https://github.com/njacobsonlab/Pretrained-Actigraphy-Transformer/

英文摘要

Wearable movement data is collected by nearly all commercially available smartwatches and is a valuable resource for mental health research, reflecting fine-grained temporal behavioral trends. Despite its promise, the development of foundation models for health wearable modeling remains limited when compared to clinical image and text analysis. We designed transformers with patch embeddings and used self-supervised masked autoencoder pretraining on minute-level week-long actigraphy (physical activity intensity measurement) sequences to develop and evaluate the Pretrained Actigraphy Transformer (PAT). PAT is an open-source foundation model for wearable movement time series that combines week-long temporal modeling, psychiatric outcome evaluation, and reproducibility on public data. Pretrained on data from 21,538 U.S. participants in a nationally representative cohort from the National Health and Nutrition Examination Survey (NHANES), PAT consistently outperformed non-foundation-model baselines across mental health prediction tasks-including benzodiazepine and SSRI use, depression, and sleep abnormalities. During the benzodiazepine medication usage prediction task, PAT demonstrated the largest improvement over non-foundational deep learning models commonly used for time-series modeling (i.e., 55.6% improvement over the LSTM, 21.4% improvement over the 1-D CNN, 14.8% improvement over the ConvLSTM). Beyond predictive accuracy, PAT provides interpretable attention maps highlighting specific periods of daily activity most important for clinical predictions, offering model transparency and potential clinical insights. The results suggest that PAT offers an easy-to-deploy, adaptable and scalable solution to advance clinical insight from wearable sensor data for researchers and clinicians. GitHub: https://github.com/njacobsonlab/Pretrained-Actigraphy-Transformer/

2502.08884 2026-06-02 cs.CV cs.AI cs.GR 版本更新

ShapeLib: Designing a library of programmatic 3D shape abstractions with Large Language Models

ShapeLib: 利用大型语言模型设计程序化3D形状抽象库

R. Kenny Jones, Paul Guerrero, Niloy J. Mitra, Daniel Ritchie

发表机构 * Stanford University(斯坦福大学) Adobe Research(Adobe研究) University College London(伦敦大学学院) Brown University(布朗大学)

AI总结 提出ShapeLib方法,利用大型语言模型的先验知识,通过引导式工作流自动设计可泛化的程序化3D形状抽象库,并支持下游形状编辑与生成。

详情
AI中文摘要

我们提出ShapeLib,这是第一个利用大型语言模型(LLM)的先验知识来设计程序化3D形状抽象库的方法。我们的系统接受两种形式的用户提供的设计意图:输出库中应包含的功能的高级文本描述,以及一小部分示例形状的种子集。我们通过引导式LLM工作流发现与设计意图匹配的抽象库,该工作流首先提出应用和实现功能的不同方式,然后验证这些功能有助于表示种子集形状。为了扩展到种子集之外,我们开发了特定于库的识别网络,将形状(表示为基元、体素或点云)映射到使用这些新发现的抽象的程序。跨多个建模领域(按形状类别划分),我们发现,当LLM与几何推理深思熟虑地结合时,可以引导它们编写出能跨形状分布泛化的抽象函数库。我们的框架朝着实现长期以来的形状分析愿望迈出了一步,即发现可重用的、程序化的形状抽象,同时暴露可解释的、语义对齐的接口。我们的广泛评估表明,ShapeLib在泛化性、可用性和在操作下保持合理性方面,优于先前的替代抽象发现方法。最后,我们展示了ShapeLib的抽象函数解锁了多个下游应用,将LLM对形状程序的推理与几何处理工具相结合,以支持形状编辑和生成工作流。

英文摘要

We present ShapeLib, the first method that uses the priors of Large Language Models (LLMs) to design libraries of programmatic 3D shape abstractions. Our system accepts two forms of user-provided design intent: high-level text descriptions of functions to include in the output library and a small seed set of exemplar shapes. We discover a library of abstractions that matches this design intent with a guided LLM workflow that first proposes different ways of applying and implementing functions, and then validates these functions are helpful in representing seed set shapes. To extend beyond the seed set, we develop library-specific recognition networks that map shapes (represented as primitives, voxels, or point clouds) to programs that use these newly discovered abstractions. Across multiple modeling domains (split by shape category), we find that LLMs, when thoughtfully combined with geometric reasoning, can be guided to author libraries of abstraction functions that generalize across shape distributions. Our framework takes a step towards realizing the long-standing shape analysis aspiration of discovering reusable, programmatic shape abstractions while exposing interpretable, semantically aligned interfaces. Our extensive evaluation demonstrates that ShapeLib provides distinct advantages over prior alternative abstraction discovery works in terms of generalization, usability, and maintaining plausibility under manipulation. Finally, we demonstrate that ShapeLib's abstraction functions unlock a number of downstream applications, combining LLM reasoning over shape programs with geometry processing tools to support shape editing and generation workflows.

2506.05387 2026-06-02 cs.CL cs.AI 版本更新

Advancing Decoding Strategies: Enhancements in Locally Typical Sampling for LLMs

推进解码策略:局部典型采样在大型语言模型中的增强

Jaydip Sen, Saptarshi Sengupta, Subhasis Dasgupta

发表机构 * Praxis Business School(普拉克斯商业学校) San Jose State University(旧金山州立大学)

AI总结 提出自适应语义感知典型性采样(ASTS)改进局部典型采样算法,通过动态熵阈值、多目标评分和奖惩调整,在保持计算效率的同时提升文本生成的流畅性、多样性和连贯性。

Comments This is the accepted but pre-reviewed version of the chapter that has been accepted for publication in the Springer volume 'Decision-Making in Computational Intelligence-Based Systems,' edited by Witold Pedrycz, Gilberto Rivera, Rose Ma Rodriguez, and Salvador Ibarra Martinez. The chapter is 39 pages long, and it contains 2 figures and 6 tables. This is NOT the final camera-ready version

详情
Journal ref
Recent Advances in Artificial Neural Networks. Intelligent Systems Reference Library, vol 283. Springer, Cham, 2026
AI中文摘要

本章探讨了大型语言模型(LLMs)解码策略的进展,重点关注增强局部典型采样(LTS)算法。传统的解码方法,如top-k和核采样,通常在文本生成中难以平衡流畅性、多样性和连贯性。为应对这些挑战,提出了自适应语义感知典型性采样(ASTS)作为LTS的改进版本,融合了动态熵阈值、多目标评分和奖惩调整。ASTS在保持计算效率的同时,确保上下文连贯且多样的文本生成。其性能在多个基准测试中进行了评估,包括故事生成和抽象摘要,使用了困惑度、MAUVE和多样性分数等指标。实验结果表明,ASTS通过减少重复、增强语义对齐和提高流畅性,优于现有的采样技术。

英文摘要

This chapter explores advancements in decoding strategies for large language models (LLMs), focusing on enhancing the Locally Typical Sampling (LTS) algorithm. Traditional decoding methods, such as top-k and nucleus sampling, often struggle to balance fluency, diversity, and coherence in text generation. To address these challenges, Adaptive Semantic-Aware Typicality Sampling (ASTS) is proposed as an improved version of LTS, incorporating dynamic entropy thresholding, multi-objective scoring, and reward-penalty adjustments. ASTS ensures contextually coherent and diverse text generation while maintaining computational efficiency. Its performance is evaluated across multiple benchmarks, including story generation and abstractive summarization, using metrics such as perplexity, MAUVE, and diversity scores. Experimental results demonstrate that ASTS outperforms existing sampling techniques by reducing repetition, enhancing semantic alignment, and improving fluency.

2506.08137 2026-06-02 cs.CV cs.AI 版本更新

IGraSS: Learning to Identify Infrastructure Networks from Satellite Imagery by Iterative Graph-constrained Semantic Segmentation

IGraSS: 通过迭代图约束语义分割从卫星图像中识别基础设施网络

Oishee Bintey Hoque, Abhijin Adiga, Aniruddha Adiga, Siddharth Chaudhary, Madhav V. Marathe, S. S. Ravi, Kirti Rajagopalan, Amanda Wilson, Samarth Swarup

发表机构 * Biocomplexity Institute, University of Virginia(弗吉尼亚大学生物复杂性研究所) Department of Computer Science, University of Virginia(弗吉尼亚大学计算机科学系) Department Biomedical Systems Engineering, Washington State University(华盛顿州立大学生物医学系统工程系) Earth System Science Center, University of Alabama in Huntsville(阿拉巴马大学亨茨维尔分校地球系统科学中心)

AI总结 提出IGraSS迭代框架,结合语义分割与图约束优化,将不可达运河段从18%降至3%,并提升道路网络完整性。

详情
AI中文摘要

精确的运河网络制图对于水资源管理(包括灌溉规划和基础设施维护)至关重要。最先进的基础设施制图语义分割模型(如道路)依赖于大规模、良好标注的遥感数据集。然而,不完整或不充分的真实标注会阻碍这些学习方法。许多基础设施网络具有图级属性,如可达性(运河)或连通性(道路),可用于改进现有真实标注。本文开发了一种新颖的迭代框架IGraSS,将结合RGB和额外模态(NDWI、DEM)的语义分割模块与基于图的真实标注精化模块相结合。分割模块处理卫星图像块,而精化模块将基础设施网络视为图,在整个数据上运行。实验表明,IGraSS将不可达运河段从约18%降至3%,并且使用精化后的真实标注进行训练显著改善了运河识别。IGraSS是一个鲁棒的框架,既可用于精化噪声真实标注,也可用于从遥感影像中绘制运河网络。我们还以道路网络为例,应用不同的图论约束来完善道路网络,证明了IGraSS的有效性和泛化能力。

英文摘要

Accurate canal network mapping is essential for water management, including irrigation planning and infrastructure maintenance. State-of-the-art semantic segmentation models for infrastructure mapping, such as roads, rely on large, well-annotated remote sensing datasets. However, incomplete or inadequate ground truth can hinder these learning approaches. Many infrastructure networks have graph-level properties such as reachability to a source (like canals) or connectivity (roads) that can be leveraged to improve these existing ground truth. This paper develops a novel iterative framework IGraSS, combining a semantic segmentation module-incorporating RGB and additional modalities (NDWI, DEM)-with a graph-based ground-truth refinement module. The segmentation module processes satellite imagery patches, while the refinement module operates on the entire data viewing the infrastructure network as a graph. Experiments show that IGraSS reduces unreachable canal segments from around 18% to 3%, and training with refined ground truth significantly improves canal identification. IGraSS serves as a robust framework for both refining noisy ground truth and mapping canal networks from remote sensing imagery. We also demonstrate the effectiveness and generalizability of IGraSS using road networks as an example, applying a different graph-theoretic constraint to complete road networks.

2505.19489 2026-06-02 cs.AI cs.SE 版本更新

Taming System Complexity: Demystifying Software Engineering Agents in Diagnosing Linux Kernel Faults

驯服系统复杂性:揭秘软件工程代理在诊断Linux内核故障中的作用

Zhenhao Zhou, Zhuochen Huang, Yike He, Chong Wang, Jiajun Wang, Yijian Wu, Xin Peng, Yiling Lou

发表机构 * Fudan University(复旦大学) Nanyang Technological University(南洋理工大学)

AI总结 针对Linux内核故障定位挑战,提出LinuxFLBench基准和LinuxFL$^+$增强框架,将现有LLM代理的文件级top-1准确率从41.6%提升7.2%-11.2%。

Comments Accepted to ACL 2026

详情
AI中文摘要

Linux内核是一个关键系统,作为众多系统的基础。Linux内核中的错误可能导致严重后果,影响数十亿用户。故障定位(FL)旨在识别软件中的错误代码元素,在软件质量保证中起着至关重要的作用。虽然最近的LLM代理在SWE-bench等最新基准测试中取得了有希望的FL准确率,但目前尚不清楚这些方法在Linux内核中的表现如何,因为由于大规模代码库、有限的可观测性和多样的影响因素,FL在该领域更具挑战性。在本文中,我们介绍了LinuxFLBench,这是一个从真实世界Linux内核错误构建的FL基准。我们进行了一项实证研究,以评估最先进的LLM代理在Linux内核上的性能。我们的初步结果显示,现有代理在此任务上表现不佳,文件级最佳top-1准确率仅为41.6%。为应对这一挑战,我们提出了LinuxFL$^+$,一个旨在提高LLM代理在Linux内核中FL有效性的增强框架。LinuxFL$^+$以最小的成本显著提高了所有研究代理的FL准确率(例如,准确率提升7.2% - 11.2%)。

英文摘要

The Linux kernel is a critical system, serving as the foundation for numerous systems. Bugs in the Linux kernel can cause serious consequences, affecting billions of users. Fault localization (FL), which aims at identifying the buggy code elements in software, plays an essential role in software quality assurance. While recent LLM agents have achieved promising accuracy in FL on recent benchmarks like SWE-bench, it remains unclear how well these methods perform in the Linux kernel, where FL is much more challenging due to the large-scale code base, limited observability, and diverse impact factors. In this paper, we introduce LinuxFLBench, a FL benchmark constructed from real-world Linux kernel bugs. We conduct an empirical study to assess the performance of state-of-the-art LLM agents on the Linux kernel. Our initial results reveal that existing agents struggle with this task, achieving a best top-1 accuracy of only 41.6% at file level. To address this challenge, we propose LinuxFL$^+$, an enhancement framework designed to improve FL effectiveness of LLM agents for the Linux kernel. LinuxFL$^+$ substantially improves the FL accuracy of all studied agents (e.g., 7.2% - 11.2% accuracy increase) with minimal costs.

2505.13273 2026-06-02 cs.AI cs.LG 版本更新

EMoE: Training-Free Expert Disagreement for Uncertainty-Aware Text-to-Image Diffusion

EMoE: 面向不确定性感知的文本到图像扩散的无训练专家分歧方法

Lucas Berry, Axel Brando, Wei-Di Chang, Juan Camilo Gamboa Higuera, David Meger

发表机构 * McGill University(麦吉尔大学) Barcelona Supercomputing Center (BSC)(巴塞罗那超级计算中心 (BSC)) Ideogram AI

AI总结 提出EMoE方法,通过预训练MoE扩散模型中早期MoE层的专家分歧,无需训练即可估计认知不确定性,用于提示风险诊断和生成质量排序。

详情
AI中文摘要

大型文本到图像扩散模型很少在提示可能产生低质量生成时提供可靠信号,尤其是在训练数据未公开的情况下。我们研究预训练混合专家(MoE)扩散模型中的专家分歧是否可以作为认知不确定性的可靠估计。我们引入EMoE,一种无训练方法,在早期MoE层分离专家特定的计算路径,跨路径使用相同的初始噪声,并在第一步去噪后测量其潜在表示之间的方差。这提供了在完整图像生成之前的不确定性感知提示信号,无需辅助网络或训练扩散集成。在COCO和CC3M上,EMoE根据文本-图像对齐质量指标对提示进行排序,比扩散特定和基于路由的基线更一致。我们进一步将EMoE应用于多语言提示,并发现分歧和生成质量中存在系统的语言依赖性差异,包括共享词汇效应。这些结果使EMoE成为MoE文本到图像扩散模型中提示风险、模型覆盖和偏差分析的实用诊断工具。

英文摘要

Large text-to-image diffusion models rarely expose reliable signals of when a prompt is likely to produce a poorly aligned generation, especially when training data is undisclosed. We study whether expert disagreement inside pre-trained mixture-of-experts (MoE) diffusion models can serve as a reliable estimate for epistemic uncertainty. We introduce EMoE, a training-free method that separates expert-specific computation paths at an early MoE layer, uses the same initial noise across paths, and measures variance among their latent representations after the first denoising step. This provides an uncertainty-aware prompt signal before full image generation, without auxiliary networks or training diffusion ensembles. On COCO and CC3M, EMoE ranks prompts by text-image alignment quality metrics more consistently than diffusion-specific and router-based baselines. We further apply EMoE to multilingual prompts and find systematic language-dependent differences in disagreement and generation quality, including shared-vocabulary effects. These results position EMoE as a practical diagnostic tool for prompt risk, model coverage, and bias analysis in MoE text-to-image diffusion models.

2503.03137 2026-06-02 cs.AI cs.LG cs.NE 版本更新

Learning to Reduce Search Space for Generalizable Neural Routing Solver

学习减少搜索空间以实现泛化的神经路由求解器

Changliang Zhou, Xi Lin, Zhenkun Wang, Qingfu Zhang

发表机构 * School of Automation and Intelligent Manufacturing(自动化与智能制造学院) Southern University of Science and Technology(南方科技大学) School of Mathematics and Statistics(数学与统计学学院) Xi'an Jiaotong University(西安交通大学) Department of Computer Science(计算机科学系) City University of Hong Kong(香港城市大学)

AI总结 提出首个基于学习的动态搜索空间缩减框架L2R,通过自适应剪枝节点来高效求解大规模车辆路径问题,在千万节点规模上保持高质量解。

Comments accepted by SIGKDD 2026

详情
AI中文摘要

构造性神经组合优化(NCO)通过直接学习构造近似最优解,为解决车辆路径问题(VRPs)提供了一种有前景的范式,从而减少了对算法设计专家知识的依赖。然而,由于高计算复杂度,将这些方法扩展到大规模实例仍然具有挑战性。虽然最近的动态搜索空间缩减(SSR)方法可以通过基于几何距离的剪枝提高推理效率,但它们通常难以处理具有非均匀分布的复杂实例,或者当最优解严重依赖于非空间约束时。为了解决这一关键问题,我们提出了学习减少(L2R),这是首个基于学习的动态SSR框架。L2R通过从问题特定特征中提取模式来学习自适应地优先考虑节点,从而在每一步剪枝搜索空间,实现高效且可扩展的解构造。大量实验表明,我们的L2R框架在不同的VRP变体上对不同问题规模和数据分布具有稳健的泛化能力。据我们所知,L2R是首个有效扩展到具有1000万个节点的VRP实例同时保持高质量解的神经求解器,这显著推动了NCO在泛化和可扩展性方面的前沿。我们的代码可在https://github.com/CIAM-Group/L2R获取。

英文摘要

Constructive neural combinatorial optimization (NCO) offers a promising paradigm for solving vehicle routing problems (VRPs) by directly learning to construct approximate optimal solutions, thereby reducing reliance on expert knowledge for algorithm design. However, scaling these methods to handle large-scale instances remains challenging due to high computational complexity. While recent dynamic search space reduction (SSR) methods can improve inference efficiency through geometric distance-based pruning, they often struggle on complex instances with non-uniform distributions or when optimal solutions rely heavily on non-spatial constraints. To address this critical issue, we propose Learning to Reduce (L2R), which is the first learning-based dynamic SSR framework. L2R learns to adaptively prioritize nodes by extracting patterns from problem-specific features to prune the search space at each step, enabling efficient and scalable solution construction. Extensive experiments show that our L2R framework generalizes robustly to different problem scales and data distributions on various VRP variants. To the best of our knowledge, L2R is the first neural solver to effectively scale to VRP instances with $10$ million nodes while maintaining high solution quality, which significantly pushes the frontier of NCO in terms of generalization and scalability. Our code is available at https://github.com/CIAM-Group/L2R.

2503.06473 2026-06-02 cs.CV cs.AI 版本更新

Enhancing Layer Attention Efficiency through Pruning Redundant Retrievals

通过剪枝冗余检索增强层注意力效率

Hanze Li, Yaosong Du, Zhibo Yao, Mengyao Zeng, Xiuqi Ge, Xiande Huang

发表机构 * De Artificial Intelligence Lab(德人工智能实验室)

AI总结 针对层注意力机制中相邻层权重冗余导致特征重复和训练效率低的问题,提出基于KL散度量化冗余并利用增强Beta分位数映射(EBQM)跳过冗余层的高效层注意力(ELA)架构,在图像分类和目标检测任务中训练时间减少30%且性能提升。

Comments 5 pages

详情
AI中文摘要

越来越多的证据表明,层注意力机制增强了深度神经网络中层间的交互,显著推进了网络架构的发展。然而,现有的层注意力方法存在冗余问题,因为相邻层学习的注意力权重往往变得高度相似。这种冗余导致多个层提取几乎相同的特征,降低了模型的表示能力并增加了训练时间。为了解决这个问题,我们提出了一种新颖的方法,利用相邻层之间的Kullback-Leibler(KL)散度来量化冗余。此外,我们引入了一种增强Beta分位数映射(EBQM)方法,能够准确识别并跳过冗余层,从而保持模型稳定性。我们提出的高效层注意力(ELA)架构提高了训练效率和整体性能,在图像分类和目标检测等任务中实现了30%的训练时间减少,同时提升了性能。

英文摘要

Growing evidence suggests that layer attention mechanisms, which enhance interaction among layers in deep neural networks, have significantly advanced network architectures. However, existing layer attention methods suffer from redundancy, as attention weights learned by adjacent layers often become highly similar. This redundancy causes multiple layers to extract nearly identical features, reducing the model's representational capacity and increasing training time. To address this issue, we propose a novel approach to quantify redundancy by leveraging the Kullback-Leibler (KL) divergence between adjacent layers. Additionally, we introduce an Enhanced Beta Quantile Mapping (EBQM) method that accurately identifies and skips redundant layers, thereby maintaining model stability. Our proposed Efficient Layer Attention (ELA) architecture, improves both training efficiency and overall performance, achieving a 30% reduction in training time while enhancing performance in tasks such as image classification and object detection.

2504.21427 2026-06-02 cs.LG cs.AI 版本更新

MPEC: Manifold-Preserved EEG Classification via an Ensemble of Clustering-Based Classifiers

MPEC:通过集成基于聚类的分类器实现流形保持的脑电图分类

Shermin Shahbazi, Mohammad-Reza Nasiri, Majid Ramezani

发表机构 * Department of Electrical and Computer(电气与计算机系) Department of Computer Science and Engineering, Information Technology(计算机科学与工程系,信息科技)

AI总结 提出MPEC方法,通过协方差矩阵和RBF核的特征工程以及黎曼流形上的改进K-means聚类集成,解决EEG信号的非欧几里得流形结构问题,在BCI Competition IV数据集2a上取得显著提升。

Comments 7 pages ,3 figures

详情
AI中文摘要

脑电图信号的准确分类对于脑机接口(BCI)和神经假体应用至关重要,然而许多现有方法未能考虑EEG数据的非欧几里得流形结构,导致性能欠佳。保留这种流形信息对于捕捉EEG信号的真实几何结构至关重要,但传统分类技术在很大程度上忽视了这一需求。为此,我们提出了MPEC(通过集成基于聚类的分类器实现流形保持的EEG分类),它引入了两项关键创新:(1)一个特征工程阶段,结合协方差矩阵和径向基函数(RBF)核来捕捉EEG通道之间的线性和非线性关系;(2)一个聚类阶段,采用针对黎曼流形空间定制的改进K-means算法,确保局部几何敏感性。通过集成多个基于聚类的分类器,MPEC取得了优越的结果,并在BCI Competition IV数据集2a上得到了显著改进的验证。

英文摘要

Accurate classification of EEG signals is crucial for brain-computer interfaces (BCIs) and neuroprosthetic applications, yet many existing methods fail to account for the non-Euclidean, manifold structure of EEG data, resulting in suboptimal performance. Preserving this manifold information is essential to capture the true geometry of EEG signals, but traditional classification techniques largely overlook this need. To this end, we propose MPEC (Manifold-Preserved EEG Classification via an Ensemble of Clustering-Based Classifiers), that introduces two key innovations: (1) a feature engineering phase that combines covariance matrices and Radial Basis Function (RBF) kernels to capture both linear and non-linear relationships among EEG channels, and (2) a clustering phase that employs a modified K-means algorithm tailored for the Riemannian manifold space, ensuring local geometric sensitivity. Ensembling multiple clustering-based classifiers, MPEC achieves superior results, validated by significant improvements on the BCI Competition IV dataset 2a.

2504.17471 2026-06-02 cs.LG cs.AI cs.DC 版本更新

GRANITE : a Byzantine-Resilient Dynamic Gossip Learning Framework

GRANITE:一种拜占庭鲁棒的动态八卦学习框架

Yacine Belal, Mohamed Maouche, Sonia Ben Mokhtar

发表机构 * CEA, List, Université Paris-Saclay Palaiseau(CEA、List、巴黎-萨克雷大学帕莱索分校) INRIA, INSA Lyon, CITI, UR3720(INRIA、里昂INSA、CITI、UR3720) LIRIS, INSA Lyon, CNRS Lyon(LIRIS、里昂INSA、里昂CNRS)

AI总结 针对动态八卦学习中拜占庭节点通过毒化模型和操纵节点采样发起的双重攻击,提出GRANITE框架,通过累积节点标识知识并动态调整聚合阈值,实现鲁棒学习,理论证明拜占庭节点在局部邻域呈指数衰减,实验表明在30%拜占庭节点下精度接近非拜占庭场景,且通信成本降低9倍。

详情
AI中文摘要

八卦学习是一种去中心化的学习范式,用户通过迭代地与少量邻居节点交换和聚合模型。最近的方法依赖于使用随机节点采样协议构建的动态通信图,这些协议已被证明可以加速收敛。然而,我们表明这些方法容易受到双重攻击:拜占庭节点可以毒化模型并操纵节点采样以放大其影响力。我们通过GRANITE框架应对这种组合威胁,该框架用于在存在拜占庭节点的稀疏动态图上进行鲁棒学习。GRANITE随时间累积关于遇到的节点标识的知识,并根据每个节点邻域中估计的拜占庭密度动态调整局部聚合阈值。我们证明,在GRANITE下,局部邻域中的拜占庭节点呈现指数衰减。我们进一步推导了GRANITE生成图的鲁棒性条件。实验结果表明,在30%拜占庭节点下,GRANITE的收敛精度在非拜占庭精度的5%以内,收敛速度更快,且通信成本降低高达9倍。

英文摘要

Gossip Learning (GL) is a decentralized learning paradigm where users iteratively exchange and aggregate models with a small set of neighboring peers. Recent approaches rely on dynamic communication graphs built using Random Peer Sampling (RPS) protocols which have been proven to accelerate convergence. However, we show that these approaches are vulnerable to a dual attack: Byzantine nodes can poison models and manipulate peer sampling to amplify their influence. We address this combination of threats with GRANITE, a framework for robust learning over sparse, dynamic graphs in the presence of Byzantine nodes. GRANITE accumulates knowledge about encountered node identifiers over time and dynamically adjusts local aggregation thresholds based on estimated Byzantine density in the neighbourhood of each node. We demonstrate that under GRANITE, the Byzantine presence in local neighborhoods exhibits an exponential decay. We further derive the robustness conditions of the graphs generated by GRANITE. Empirically, our results indicate that GRANITE converges within 5% of non-Byzantine accuracy under 30% Byzantines nodes, offers faster convergence and operates on graphs with up to 9x lower communication cost.

2504.16139 2026-06-02 cs.CY cs.AI 版本更新

Enhancing Trust Through Standards: A Comparative Risk-Impact Framework for Aligning ISO AI Standards with Global Ethical and Regulatory Contexts

通过标准增强信任:一种用于对齐ISO AI标准与全球伦理和监管背景的比较风险-影响框架

Sridharan Sankaran

发表机构 * Research and Innovation Group(研究与创新组) Tata Consultancy Services(塔塔咨询服务)

AI总结 提出比较风险-影响评估框架,分析ISO AI标准在不同监管环境下的伦理风险覆盖情况,并建议通过强制审计、区域附件和隐私模块增强其全球适用性。

详情
AI中文摘要

随着人工智能重塑行业和社会,确保其可信赖性——通过减轻偏见、不透明性和问责缺陷等伦理风险——仍然是一个全球性挑战。国际标准化组织(ISO)的AI标准,如ISO/IEC 24027和24368,旨在通过将公平性、透明度和风险管理嵌入AI系统来促进负责任的发展。然而,它们的有效性在不同监管环境中存在差异,从欧盟基于风险的AI法案到中国注重稳定的措施以及美国分散的州级举措。本文引入了一种新颖的比较风险-影响评估框架,以评估ISO标准在这些背景下如何应对伦理风险,并提出增强其全球适用性的改进建议。通过将ISO标准映射到欧盟AI法案,并调查十个地区(包括英国、加拿大、印度、日本、新加坡、韩国和巴西)的监管框架,我们建立了伦理对齐的基线。该框架应用于欧盟、美国科罗拉多州和中国的案例研究,揭示了差距:自愿性ISO标准在执行方面不足(例如科罗拉多州),并且低估了区域特定风险(如中国的隐私)。我们建议强制风险审计、区域特定附录和以隐私为重点的模块,以增强ISO的适应性。这种方法不仅综合了全球趋势,还提供了一种可复制的工具,用于将标准化与伦理要求对齐,促进全球AI的互操作性和信任。政策制定者和标准机构可以利用这些见解来发展AI治理,确保随着技术发展满足多样化的社会需求。

英文摘要

As artificial intelligence (AI) reshapes industries and societies, ensuring its trustworthiness-through mitigating ethical risks like bias, opacity, and accountability deficits-remains a global challenge. International Organization for Standardization (ISO) AI standards, such as ISO/IEC 24027 and 24368, aim to foster responsible development by embedding fairness, transparency, and risk management into AI systems. However, their effectiveness varies across diverse regulatory landscapes, from the EU's risk-based AI Act to China's stability-focused measures and the U.S.'s fragmented state-led initiatives. This paper introduces a novel Comparative Risk-Impact Assessment Framework to evaluate how well ISO standards address ethical risks within these contexts, proposing enhancements to strengthen their global applicability. By mapping ISO standards to the EU AI Act and surveying regulatory frameworks in ten regions-including the UK, Canada, India, Japan, Singapore, South Korea, and Brazil-we establish a baseline for ethical alignment. The framework, applied to case studies in the EU, US-Colorado, and China, reveals gaps: voluntary ISO standards falter in enforcement (e.g., Colorado) and undervalue region-specific risks like privacy (China). We recommend mandatory risk audits, region-specific annexes, and a privacy-focused module to enhance ISO's adaptability. This approach not only synthesizes global trends but also offers a replicable tool for aligning standardization with ethical imperatives, fostering interoperability and trust in AI worldwide. Policymakers and standards bodies can leverage these insights to evolve AI governance, ensuring it meets diverse societal needs as the technology advances.

2406.09953 2026-06-02 cs.RO cs.AI 版本更新

DAG-Plan: Generating Directed Acyclic Dependency Graphs for Dual-Arm Cooperative Planning

DAG-Plan:生成有向无环依赖图用于双臂协作规划

Zeyu Gao, Yao Mu, Jinye Qu, Mengkang Hu, Shijia Peng, Chengkai Hou, Lingyue Guo, Ping Luo, Shanghang Zhang, Yanfeng Lu

发表机构 * State Key Laboratory of Multimodal Artificial Intelligence Systems, Institute of Automation, Chinese Academy of Sciences (CASIA)(多模态人工智能系统国家重点实验室,自动化研究所,中国科学院(CASIA)) School of Artificial Intelligence, University of Chinese Academy of Sciences (UCAS)(中国科学院大学人工智能学院) School of Computer Science, Shanghai Jiao Tong University(上海交通大学计算机科学学院) State Key Laboratory of Multimedia Information Processing, School of Computer Science, Peking University(多媒体信息处理国家重点实验室,北京大学计算机科学学院) Department of Computer Science, The University of Hong Kong(香港大学计算机科学系) OpenGVLab, Shanghai AI Laboratory(上海人工智能实验室,OpenGVLab)

AI总结 提出DAG-Plan框架,首次使用有向无环图作为双臂协调的核心表示,通过一次LLM解析生成结构化DAG,实现自适应并行执行,在双臂厨房基准测试中成功率提升48%,执行效率提升84.1%。

Comments ICRA 2026

详情
AI中文摘要

双臂机器人有望提高效率,但需要规划具有非线性子任务依赖关系的复杂任务。当前使用大型语言模型(LLM)的方法存在根本性权衡:生成线性序列效率高但无法建模并行性和适应变化,而迭代查询具有适应性但过于缓慢且成本高昂。为弥合这一差距,我们引入DAG-Plan,一种新颖的任务规划框架,首次采用有向无环图(DAG)作为双臂协调的核心表示。关键洞察在于DAG天然捕获复杂的子任务依赖关系并明确揭示并行执行的机会。在该框架内,LLM仅被使用一次作为强大的语义解析器,将自然语言指令转换为结构化的DAG。在执行过程中,我们的系统基于实时环境观察动态地将候选节点分配给合适的机械臂,实现真正的自适应并行操作。在双臂厨房基准测试上的广泛评估表明,DAG-Plan的结构化方法从根本上优于现有范式。与单查询线性序列方法相比,通过稳健管理依赖关系,成功率提高了48%;与迭代查询方法相比,通过消除重复LLM调用的延迟,执行效率提高了84.1%。我们的工作表明,基于图的原则性表示是解锁高效可靠的基于LLM的复杂机器人系统规划的关键。更多演示和代码请访问 https://sites.google.com/view/dag-plan。

英文摘要

Dual-arm robots promise greater efficiency but require planning for complex tasks with nonlinear sub-task dependencies. Current methods using Large Language Models (LLMs) suffer from a fundamental trade-off: generating linear sequences is efficient but fails to model parallelism and adapt to changes, while iterative querying is adaptive but too slow and costly. To bridge this gap, we introduce DAG-Plan, a novel task planning framework that for the first time employs a Directed Acyclic Graph (DAG) as the central representation for dual-arm coordination. The key insight is that a DAG natively captures complex sub-task dependencies and explicitly reveals opportunities for parallel execution. Within this framework, an LLM is used only once as a powerful semantic parser to translate a natural language instruction into a structured DAG. During execution, our system dynamically assigns candidate nodes to the suitable arm based on real-time environmental observations, enabling truly adaptive and parallel operation. Extensive evaluation on a dual-arm kitchen benchmark shows that DAG-Plan's structured approach fundamentally outperforms existing paradigms. It achieves a 48% higher success rate than single-query linear sequence methods with dual arm by robustly managing dependencies, and an 84.1% higher execution efficiency than iterative querying methods by eliminating the latency of repeated LLM calls. Our work demonstrates that a principled, graph-based representation is the key to unlocking efficient and reliable LLM-based planning for complex robotic systems. More demos and code are available on https://sites.google.com/view/dag-plan.

2504.04718 2026-06-02 cs.CL cs.AI 版本更新

T1: Tool-integrated Verification for Test-time Compute Scaling in Small Language Models

T1:小语言模型测试时计算扩展的工具集成验证

Minki Kang, Jongwon Jeong, Jaewoong Cho

发表机构 * KAIST(韩国科学技术院) University of Wisconsin-Madison(威斯康星大学麦迪逊分校) KRAFTON

AI总结 针对小语言模型在测试时扩展中验证能力不足的问题,提出T1框架,通过外部工具过滤候选输出后由小语言模型进行最终验证,显著提升验证准确率和测试时扩展性能。

Comments ICLR 2026

详情
AI中文摘要

近期研究表明,测试时计算扩展能有效提升小语言模型(sLMs)的性能。然而,先前研究主要利用额外的大模型作为验证器进行测试时计算扩展,而sLMs自身的验证能力尚未被充分探索。本文研究sLMs在测试时扩展中能否可靠地验证输出候选。我们发现,即使从大验证器进行知识蒸馏,sLMs在需要记忆的任务(如数值计算和事实核查)上仍表现不佳。为解决这一局限,我们提出工具集成验证(T1),这是一个两阶段框架:首先用外部工具过滤候选,然后使用sLM进行最终验证,将记忆密集型步骤卸载到代码解释器等工具上。在T1框架内,我们证明卸载到外部工具可减轻sLMs的记忆负担,并提升测试时扩展性能。在MATH基准上的实验表明,采用T1的Llama-3.2 1B模型在测试时扩展下性能优于规模更大的Llama-3.1 8B模型。此外,T1提高了过程奖励模型(PRMs)和评论家模型的验证准确率。我们的发现凸显了工具集成在显著提升sLMs验证能力方面的潜力。

英文摘要

Recent studies have demonstrated that test-time compute scaling effectively improves the performance of small language models (sLMs). However, prior research has mainly examined test-time compute scaling with an additional larger model as a verifier, leaving verification by sLMs underexplored. In this work, we investigate whether sLMs can reliably verify the output candidates under test-time scaling. We find that even with knowledge distillation from larger verifiers, sLMs struggle with verification tasks requiring memorization, such as numerical calculations and fact-checking. To address this limitation, we propose Tool-integrated verification (T1), a two-stage framework that first filters candidates with external tools and then uses an sLM for final verification, offloading memorization-heavy steps to tools such as a code interpreter. Within T1, we prove that offloading to external tools reduces the memorization burden on sLMs and improves test-time scaling performance. Experiments on the MATH benchmark demonstrate that, with T1, a Llama-3.2 1B model under test-time scaling outperforms the significantly larger Llama-3.1 8B model. Moreover, T1 improves the verification accuracy of both process reward models (PRMs) and critic models. Our findings highlight the potential of tool integration to substantially improve the verification abilities of sLMs.

2503.05500 2026-06-02 cs.CL cs.AI 版本更新

EuroBERT: Scaling Multilingual Encoders for European Languages

EuroBERT:面向欧洲语言的多语言编码器扩展

Nicolas Boizard, Hippolyte Gisserot-Boukhlef, Duarte M. Alves, André Martins, Ayoub Hammal, Caio Corro, Céline Hudelot, Emmanuel Malherbe, Etienne Malaboeuf, Fanny Jourdan, Gabriel Hautreux, João Alves, Kevin El Haddad, Manuel Faysse, Maxime Peyrard, Nuno M. Guerreiro, Patrick Fernandes, Ricardo Rei, Pierre Colombo

发表机构 * Artefact Research Center(Artfact研究中心) CNRS(法国国家科学研究中心) ISIA Lab(ISIA实验室) Carnegie Mellon University(卡内基梅隆大学)

AI总结 本文提出EuroBERT系列多语言编码器,通过整合生成式模型的最新进展,在检索、回归和分类等任务上超越现有模型,并原生支持长达8192个token的序列。

Comments 28 pages, 8 figures, 13 tables

详情
AI中文摘要

用于检索、回归和分类的通用多语言向量表示传统上来自双向编码器模型。尽管应用广泛,但编码器最近被生成式仅解码器模型的进步所掩盖。然而,推动这一进展的许多创新并非解码器所独有。在本文中,我们通过这些进展的视角重新审视多语言编码器的发展,并介绍EuroBERT,一个覆盖欧洲及全球广泛使用语言的多语言编码器家族。我们的模型在包括多语言能力、数学和编码在内的多种任务上优于现有替代方案,并原生支持长达8192个token的序列。我们还研究了EuroBERT背后的设计决策,提供了关于数据集组成和训练流程的见解。我们公开发布EuroBERT模型,包括中间训练检查点以及我们的训练框架。

英文摘要

General-purpose multilingual vector representations, used in retrieval, regression and classification, are traditionally obtained from bidirectional encoder models. Despite their wide applicability, encoders have been recently overshadowed by advances in generative decoder-only models. However, many innovations driving this progress are not inherently tied to decoders. In this paper, we revisit the development of multilingual encoders through the lens of these advances, and introduce EuroBERT, a family of multilingual encoders covering European and widely spoken global languages. Our models outperform existing alternatives across a diverse range of tasks, spanning multilingual capabilities, mathematics, and coding, and natively supporting sequences of up to 8,192 tokens. We also examine the design decisions behind EuroBERT, offering insights into our dataset composition and training pipeline. We publicly release the EuroBERT models, including intermediate training checkpoints, together with our training framework.

2503.15639 2026-06-02 cs.CV cs.AI 版本更新

A Lightweight Context-Driven Training-Free Network for Scene Text Segmentation and Recognition

一种轻量级上下文驱动的免训练网络用于场景文本分割与识别

Ritabrata Chakraborty, Shivakumara Palaiahnakote, Umapada Pal, Cheng-Lin Liu

发表机构 * CVPR Unit, Indian Statistical Institute, Kolkata, India(印度统计研究所柯西拉分校CVPR单位) Manipal University Jaipur, India(印度贾浦尔曼普尔大学) University of Salford, UK(英国萨尔福德大学) School of Artificial Intelligence, University of Chinese Academy of Sciences(中国科学院大学人工智能学院)

AI总结 提出一种基于上下文理解、无需训练的即插即用框架,通过注意力分割和语义评估实现高效场景文本识别,性能与SOTA相当且资源消耗更低。

Comments Accepted at ICDAR 2025 (ORAL) 21 pages, 8 figures, 7 tables

详情
AI中文摘要

现代场景文本识别系统通常依赖于大型端到端架构,这些架构需要大量训练,并且对于实时场景来说成本过高。在这种情况下,由于内存、计算资源和延迟的限制,部署重型模型变得不切实际。为了应对这些挑战,我们提出了一种新颖的、无需训练的即插即用框架,该框架利用预训练文本识别器的优势,同时最小化冗余计算。我们的方法使用基于上下文的理解,并引入了一个基于注意力的分割阶段,该阶段在像素级别细化候选文本区域,从而改进下游识别。我们不执行传统的文本检测(即特征图与源图像之间的块级比较),而是利用预训练的标题生成器来利用上下文信息,使框架能够直接从场景上下文生成单词预测。候选文本经过语义和词汇评估以获得最终分数。达到或超过预定义置信度阈值的预测绕过更重的端到端文本STR(场景文本识别)流程,确保更快的推理并减少不必要的计算。在公共基准上的实验表明,我们的范式实现了与最先进系统相当的性能,但所需资源大大减少。我们的代码可在此处找到:https://ritabrata04.github.io/Context-driven-STR/。

英文摘要

Modern scene text recognition systems often depend on large end-to-end architectures that require extensive training and are prohibitively expensive for real-time scenarios. In such cases, the deployment of heavy models becomes impractical due to constraints on memory, computational resources, and latency. To address these challenges, we propose a novel, training-free plug-and-play framework that leverages the strengths of pre-trained text recognizers while minimizing redundant computations. Our approach uses context-based understanding and introduces an attention-based segmentation stage, which refines candidate text regions at the pixel level, improving downstream recognition. Instead of performing traditional text detection that follows a block-level comparison between feature map and source image and harnesses contextual information using pretrained captioners, allowing the framework to generate word predictions directly from scene context.Candidate texts are semantically and lexically evaluated to get a final score. Predictions that meet or exceed a pre-defined confidence threshold bypass the heavier process of end-to-end text STR profiling, ensuring faster inference and cutting down on unnecessary computations. Experiments on public benchmarks demonstrate that our paradigm achieves performance on par with state-of-the-art systems, yet requires substantially fewer resources.Our code can be found here: https://ritabrata04.github.io/Context-driven-STR/.

2503.07154 2026-06-02 cs.LG cs.AI 版本更新

Ideas in Inference-time Scaling can Benefit Generative Pre-training Algorithms

推理时缩放的思想可以有益于生成式预训练算法

Jiaming Song, Linqi Zhou

发表机构 * Luma AI

AI总结 本文指出自回归模型与扩散模型的二分法是错误的,提出应从推理过程(序列扩展与状态细化)出发设计训练目标,并论证了推理算法优先于训练目标的原则。

Comments updated some new literature on flow maps and continuous LLMs

详情
AI中文摘要

生成式预训练通常被框定在一个错误的二分法中:用于离散信号的自回归模型与用于连续信号的扩散模型。我们认为这种二分法是错误的,因为它混淆了模型家族、数据表示、训练目标和推理过程。自回归是一种通过归一化条件采样扩展序列的推理过程,而扩散是一种反复修正现有状态的细化过程。因此,更有用的对比不是自回归与扩散,而是用交叉熵学习的离散标记与用扩散风格目标学习的连续标记,以及用于从中采样的推理算法。从这个角度来看,算法进展应优先考虑推理时间效率的两个维度:序列扩展和状态细化。我们主张在训练目标之前设计推理过程,因为如果推理映射省略了必要参数或施加了错误分解,训练方法无法弥补。我们通过DDIM风格采样器的目标时间限制、多标记预测的联合分布限制,以及直接参数化长距离推理移动的最新流映射和少步蒸馏方法来说明这一原则。

英文摘要

Generative pre-training is often framed through a false dichotomy between autoregressive models for discrete signals and diffusion models for continuous signals. We argue that the dichotomy is false because it conflates model family, data representation, training objective, and inference procedure. Autoregression is an inference procedure that expands a sequence through normalized conditional draws, while diffusion is a refinement procedure that repeatedly revises an existing state. The more useful contrast is therefore not autoregressive versus diffusion, but discrete tokens learned with cross-entropy versus continuous tokens learned with diffusion-style objectives, together with the inference algorithms used to sample from them. From this perspective, algorithmic progress should prioritize inference-time efficiency along two axes: sequence expansion and state refinement. We advocate designing the inference procedure before the training objective, because a training method cannot compensate for an inference map that omits necessary arguments or imposes an incorrect factorization. We illustrate this principle through a target-time limitation of DDIM-style samplers, a joint-distribution limitation of multi-token prediction, and recent flow-map and few-step distillation methods that directly parameterize long-range inference moves.

2503.06136 2026-06-02 cs.CV cs.AI 版本更新

GSV3D: Gaussian Splatting-based Geometric Distillation with Stable Video Diffusion for Single-Image 3D Object Generation

GSV3D: 基于高斯溅射的几何蒸馏与稳定视频扩散用于单图像3D物体生成

Ye Tao, Jiawei Zhang, Yahao Shi, Dongqing Zou, Bin Zhou

发表机构 * State Key Laboratory of Virtual Reality Technology and Systems, Beihang University(虚拟现实技术与系统国家重点实验室,北京航空航天大学) SenseTime Research(商汤研究) PBVR

AI总结 提出一种结合2D扩散模型隐式3D推理能力与高斯溅射几何蒸馏的方法,通过高斯溅射解码器将SV3D潜变量输出转换为显式3D表示,实现多视图一致性和高质量3D生成。

详情
AI中文摘要

基于图像的3D生成在机器人和游戏领域有广泛应用,其中高质量、多样化的输出和一致的3D表示至关重要。然而,现有方法存在局限性:3D扩散模型受限于数据集稀缺和缺乏强大的预训练先验,而基于2D扩散的方法则难以保证几何一致性。我们提出了一种方法,利用2D扩散模型的隐式3D推理能力,同时通过基于高斯溅射的几何蒸馏确保3D一致性。具体来说,所提出的高斯溅射解码器通过将SV3D潜变量输出转换为显式3D表示来强制3D一致性。与仅依赖隐式2D表示进行视频生成的SV3D不同,高斯溅射显式编码空间和外观属性,通过几何约束实现多视图一致性。这些约束纠正了视图不一致性,确保了稳健的几何一致性。因此,我们的方法同时生成高质量、多视图一致的图像和精确的3D模型,为基于单图像的3D生成提供了可扩展的解决方案,并弥合了2D扩散多样性与3D结构一致性之间的差距。实验结果表明,该方法在多个数据集上实现了最先进的多视图一致性和强泛化能力。代码将在接收后公开。

英文摘要

Image-based 3D generation has vast applications in robotics and gaming, where high-quality, diverse outputs and consistent 3D representations are crucial. However, existing methods have limitations: 3D diffusion models are limited by dataset scarcity and the absence of strong pre-trained priors, while 2D diffusion-based approaches struggle with geometric consistency. We propose a method that leverages 2D diffusion models' implicit 3D reasoning ability while ensuring 3D consistency via Gaussian-splatting-based geometric distillation. Specifically, the proposed Gaussian Splatting Decoder enforces 3D consistency by transforming SV3D latent outputs into an explicit 3D representation. Unlike SV3D, which only relies on implicit 2D representations for video generation, Gaussian Splatting explicitly encodes spatial and appearance attributes, enabling multi-view consistency through geometric constraints. These constraints correct view inconsistencies, ensuring robust geometric consistency. As a result, our approach simultaneously generates high-quality, multi-view-consistent images and accurate 3D models, providing a scalable solution for single-image-based 3D generation and bridging the gap between 2D Diffusion diversity and 3D structural coherence. Experimental results demonstrate state-of-the-art multi-view consistency and strong generalization across diverse datasets. The code will be made publicly available upon acceptance.

2502.04512 2026-06-02 cs.AI 版本更新

Safety Must Precede the Deployment of Open-Ended AI

安全必须优先于开放式AI的部署

Ivaxi Sheth, Jan Wehner, Sahar Abdelnabi, Ruta Binkyte, Mario Fritz

发表机构 * CISPA-Helmholtz Center of Information Security(CISPA-海德堡信息安全中心) MPI for Intelligent Systems, ELLIS Institute Tübingen, Tübingen AI Center(智能系统Max Planck研究所、图宾根ELLIS研究所、图宾根人工智能中心)

AI总结 本文提出开放式AI系统因自主无限生成新行为而带来预测性丧失、新兴错位和控制困难等独特安全挑战,需在部署前主动研究,并给出挑战分类和研究方向。

Comments Accepted to ICML'26

详情
AI中文摘要

AI的进步在很大程度上由基础模型和好奇心驱动的学习共同推动,旨在提高能力和适应性。在此背景下,开放式(即AI智能体自主且无限地生成新行为、表示或解决方案)引起了越来越多的兴趣。这在自我进化智能体和长期发现的背景下变得相关。本文立场论文认为,开放式AI系统的定义特性引入了一类独特且未被充分探索的安全挑战,包括预测性丧失、新兴错位以及随着系统超出初始设计假设而难以维持有效控制,这些挑战必须被预先解决。这些挑战在性质上不同于与任务受限或静态模型相关的挑战,且不太可能仅通过现有安全框架解决,因此必须在大规模部署之前主动审视这些风险。论文提出了关键挑战的分类,讨论了研究机会,并呼吁采取协调行动以支持开放式AI的安全和负责任开发。

英文摘要

AI advancements have been significantly driven by a combination of foundation models and curiosity-driven learning aimed at increasing capability and adaptability. Within this landscape, open-endedness, where AI agents autonomously and indefinitely generate novel behaviors, representations, or solutions, has gained increasing interest. This has become relevant in the context of self-evolving agents and long-horizon discovery. This position paper argues that the defining properties of open-ended AI systems introduce a distinct and underexplored class of safety challenges, including loss of predictability, emergent misalignment, and difficulties in maintaining effective control as systems evolve beyond their initial design assumptions, that must be addressed preemptively. These challenges differ qualitatively from those associated with task-bounded or static models and are unlikely to be addressed by existing safety frameworks alone, which is why these risks must be examined proactively, before large-scale deployment. The paper proposes a taxonomy for key challenges, discusses research opportunities, and calls for coordinated action to support the safe and responsible development of open-ended AI.

2502.04646 2026-06-02 cs.LG cs.AI 版本更新

Efficient Weighted Sampling via Score-based Generative Models

基于分数生成模型的高效加权采样

Heasung Kim, Taekyun Lee, Hyeji Kim, Gustavo de Veciana

发表机构 * The University of Texas at Austin(德克萨斯大学奥斯汀分校)

AI总结 提出一种无需训练的加权采样框架,通过轻量级引导近似和不确定性感知调度器,在预训练分数生成模型上实现高效、稳定的采样,并在大规模设置中取得1.2至4.7倍加速。

Comments 37 pages

详情
AI中文摘要

加权采样——从与基概率密度函数和权重函数乘积成比例的概率密度函数中采样——是一种基础技术,在方差缩减、有偏采样、数据增强等领域有广泛应用。利用日益可用的预训练分数生成模型,我们提出了一种无需训练的加权采样框架,通过以原则性和计算高效的方式,用辅助引导项增强预训练基分数函数,来近似目标分布的逆向扩散过程。我们的方法基于两个关键组件:一个轻量级的引导近似,避免了分数函数和权重函数的高阶导数;以及一个不确定性感知调度器,基于近似误差的时间分析动态调整引导强度。这些组件共同实现了准确稳定的采样,无需依赖现有方法通常需要的基于粒子的重采样或Hessian评估。我们从合成设置到大规模设置(如Stable Diffusion XL)验证了方法的有效性,在该框架下,我们实现了1.2倍到4.7倍的加速,同时在任务性能上始终匹配或超越最先进的基线。这些结果使我们的方法成为生成应用中任务自适应、时间敏感采样的可扩展且推理高效的解决方案。

英文摘要

Weighted sampling -- sampling from a probability density function (PDF) proportional to the product of a base PDF and a weight function -- is a fundamental technique with wide-ranging applications in variance reduction, biased sampling, data augmentation, and more. Leveraging the increasing availability of pretrained score-based generative models (SGMs), we propose a training-free weighted sampling framework that approximates the backward diffusion process of the target distribution by augmenting the pretrained base score function with an auxiliary guidance term, in a principled and computationally efficient manner. Our approach builds on two key components: a lightweight approximation of the guidance that avoids costly higher-order derivatives of both the score and weight functions, and an uncertainty-aware scheduler that dynamically adjusts the guidance strength based on a temporal analysis of approximation error. Together, these components enable accurate and stable sampling without relying on particle-based resampling or Hessian evaluations commonly required by existing methods. We validate the effectiveness of our method from synthetic to large-scale settings such as Stable Diffusion XL, where our framework achieves $1.2\times$ to $4.7\times$ speedups while consistently matching or outperforming state-of-the-art baselines in task performance. These results position our method as a scalable and inference-efficient solution for task-adaptive, time-sensitive sampling in generative applications.

2111.03861 2026-06-02 cs.CV cs.AI cs.LG 版本更新

What augmentations are sensitive to hyper-parameters and why?

哪些数据增强对超参数敏感以及为什么?

Ch Muhammad Awais, Imad Eddine Ibrahim Bekkouch

发表机构 * Knowledge Representation Lab Innopolis University(知识表示实验室 印尼奥利普斯大学) Sorbonne Center for Artificial Intelligence - SCAI Sorbonne University(索邦人工智能中心 - SCAI 索邦大学)

AI总结 本研究通过局部代理(LIME)解释和线性回归系数评估不同数据增强对模型超参数的敏感性、一致性和影响,发现某些增强对超参数高度敏感,而另一些则更稳健可靠。

Comments 10 pages, 17 figures

详情
Journal ref
Intelligent Computing: Proceedings of the 2022 Computing Conference
AI中文摘要

我们对数据集应用增强以提高预测质量,并使最终模型对噪声数据和领域漂移更具鲁棒性。然而,问题仍然存在:这些增强在不同的超参数下表现如何?在本研究中,我们通过执行局部代理(LIME)解释来评估增强对模型超参数的敏感性、一致性和影响,当不同增强应用于机器学习模型时,解释超参数的影响。我们利用线性回归系数来加权每个增强。我们的研究证明,有些增强对超参数高度敏感,而其他增强则更具鲁棒性和可靠性。

英文摘要

We apply augmentations to our dataset to enhance the quality of our predictions and make our final models more resilient to noisy data and domain drifts. Yet the question remains, how are these augmentations going to perform with different hyper-parameters? In this study we evaluate the sensitivity of augmentations with regards to the model's hyper parameters along with their consistency and influence by performing a Local Surrogate (LIME) interpretation on the impact of hyper-parameters when different augmentations are applied to a machine learning model. We have utilized Linear regression coefficients for weighing each augmentation. Our research has proved that there are some augmentations which are highly sensitive to hyper-parameters and others which are more resilient and reliable.

2501.04424 2026-06-02 cs.AI cs.CL 版本更新

NSA: Neuro-symbolic ARC Challenge

NSA: 神经符号 ARC 挑战

Paweł Batorski, Jannik Brinkmann, Paul Swoboda

发表机构 * Heinrich Heine Universität Düsseldorf(杜伊斯堡-艾森大学) University of Mannheim(曼海姆大学)

AI总结 提出一种结合 transformer 提案生成与领域特定语言组合搜索的神经符号方法,在 ARC 评估集上超越现有最优方法 27%。

详情
Journal ref
ESANN 2026 proceedings, European Symposium on Artificial Neural Networks, Computational Intelligence and Machine Learning, pp. 99-104, 2026
AI中文摘要

抽象与推理语料库 (ARC) 评估了机器学习模型和组合搜索方法都难以处理的通用推理能力。我们提出了一种神经符号方法,该方法结合了用于提案生成的 transformer 和使用领域特定语言的组合搜索。Transformer 通过提出有希望的搜索方向来缩小搜索空间,从而使组合搜索能够在短时间内找到实际解决方案。我们使用合成生成的数据预训练 transformer。在测试时,我们生成额外的任务特定训练任务并微调我们的模型。我们的结果在 ARC 评估集上比现有最优方法高出 27%,并且在 ARC 训练集上表现良好。我们在 https://github.com/Batorskq/NSA 公开了我们的代码和数据集。

英文摘要

The Abstraction and Reasoning Corpus (ARC) evaluates general reasoning capabilities that are difficult for both machine learning models and combinatorial search methods. We propose a neuro-symbolic approach that combines a transformer for proposal generation with combinatorial search using a domain-specific language. The transformer narrows the search space by proposing promising search directions, which allows the combinatorial search to find the actual solution in short time. We pre-train the trainsformer with synthetically generated data. During test-time we generate additional task-specific training tasks and fine-tune our model. Our results surpass comparable state of the art on the ARC evaluation set by 27% and compare favourably on the ARC train set. We make our code and dataset publicly available at https://github.com/Batorskq/NSA.

2412.19419 2026-06-02 cs.LG cs.AI 版本更新

Introduction to Graph Neural Networks for Machine Learning Engineers

面向机器学习工程师的图神经网络导论

James H. Tanis, Chris Giannella, Adrian V. Mariano, Daoud Meerzaman

发表机构 * The MITRE Corporation(MITRE公司) National Cancer Institute(国家癌症研究所)

AI总结 本文通过编码器-解码器框架介绍图神经网络,并通过同质图上的理论和实验分析不同训练规模和复杂度下的行为,重点讨论过平滑和过挤压问题。

Comments Author accepted manuscript. Title and metadata updated to match the published ACM Computing Surveys version. 73 pages, including references and supplementary material

详情
AI中文摘要

图神经网络是专为节点或边带有属性的图设计的深度神经网络。由于其在广泛任务上的出色表现,文献中关于这些模型的研究论文数量正在快速增长。本综述通过编码器-解码器框架介绍图神经网络,并提供了一系列图分析任务的解码器示例。它利用理论和对同质图的大量实验,展示了图神经网络在不同训练规模和图复杂度下的行为,重点强调了过平滑和过挤压现象。

英文摘要

Graph neural networks are deep neural networks designed for graphs with attributes attached to nodes or edges. The number of research papers in the literature concerning these models is growing rapidly due to their impressive performance on a broad range of tasks. This survey introduces graph neural networks through the encoder-decoder framework and provides examples of decoders for a range of graph analytic tasks. It uses theory and numerous experiments on homogeneous graphs to illustrate the behavior of graph neural networks under different training sizes and degrees of graph complexity, with an emphasis on oversmoothing and oversquashing.

2411.17790 2026-06-02 cs.CV cs.AI 版本更新

Self-supervised Monocular Depth and Pose Estimation for Endoscopy with Latent Priors

基于潜在先验的自监督单目内窥镜深度与姿态估计

Ziang Xu, Bin Li, Yang Hu, Chenyu Zhang, James East, Sharib Ali, Jens Rittscher

发表机构 * University of Oxford(牛津大学) University of Leeds(利兹大学)

AI总结 提出一种结合生成潜在库和变分自编码器的自监督框架,通过自然图像深度先验和姿态潜在变量正则化,实现内窥镜复杂场景下的高精度深度与姿态估计。

详情
AI中文摘要

内窥镜中的精确3D映射能够实现胃肠道(GI)内定量、整体的病变表征,这需要可靠的深度和姿态估计。然而,内窥镜系统是单目的,现有依赖合成数据集或复杂模型的方法在具有挑战性的内窥镜条件下往往缺乏泛化能力。我们提出了一种鲁棒的自监督单目深度和姿态估计框架,该框架结合了生成潜在库(Generative Latent Bank)和变分自编码器(VAE)。生成潜在库利用自然图像中的广泛深度场景来调节深度网络,通过潜在特征先验增强深度预测的真实感和鲁棒性。对于姿态估计,我们将其重新构建在VAE框架内,将姿态转换视为潜在变量以正则化尺度、稳定z轴突出性并提高x-y灵敏度。这种双重精炼流程能够实现精确的深度和姿态预测,有效应对胃肠道复杂的纹理和光照。在SimCol和EndoSLAM数据集上的广泛评估证实,我们的框架在内窥镜深度和姿态估计方面优于已发表的自监督方法。

英文摘要

Accurate 3D mapping in endoscopy enables quantitative, holistic lesion characterization within the gastrointestinal (GI) tract, requiring reliable depth and pose estimation. However, endoscopy systems are monocular, and existing methods relying on synthetic datasets or complex models often lack generalizability in challenging endoscopic conditions. We propose a robust self-supervised monocular depth and pose estimation framework that incorporates a Generative Latent Bank and a Variational Autoencoder (VAE). The Generative Latent Bank leverages extensive depth scenes from natural images to condition the depth network, enhancing realism and robustness of depth predictions through latent feature priors. For pose estimation, we reformulate it within a VAE framework, treating pose transitions as latent variables to regularize scale, stabilize z-axis prominence, and improve x-y sensitivity. This dual refinement pipeline enables accurate depth and pose predictions, effectively addressing the GI tract's complex textures and lighting. Extensive evaluations on SimCol and EndoSLAM datasets confirm our framework's superior performance over published self-supervised methods in endoscopic depth and pose estimation.

2411.11436 2026-06-02 cs.LG cs.AI 版本更新

Implicit Regularization for Multi-label Feature Selection

多标签特征选择的隐式正则化

Dou El Kefel Mansouri, Khalid Benabdeslem, Seif-Eddine Benkabou

AI总结 针对多标签学习中的特征选择问题,提出一种基于隐式正则化和标签嵌入的估计器,通过Hadamard积参数化避免显式正则化项的额外偏差,实验表明该方法可减少偏差并可能导致良性过拟合。

Comments 14 pages, 11 figures, Submitted for publication and currently under review

详情
AI中文摘要

本文通过使用一种基于隐式正则化和标签嵌入的新估计器,解决了多标签学习背景下的特征选择问题。与使用带有显式正则化项(如$l_{2,1}$-范数、MCP或SCAD)的惩罚估计器的稀疏特征选择方法不同,我们提出了一种通过Hadamard积参数化的简单替代方法。为了指导特征选择过程,采用了一种多标签信息潜在语义方法作为标签嵌入。在一些已知基准数据集上的实验结果表明,所提出的估计器遭受的额外偏差要小得多,并且可能导致良性过拟合。

英文摘要

In this paper, we address the problem of feature selection in the context of multi-label learning, by using a new estimator based on implicit regularization and label embedding. Unlike the sparse feature selection methods that use a penalized estimator with explicit regularization terms such as $l_{2,1}$-norm, MCP or SCAD, we propose a simple alternative method via Hadamard product parameterization. In order to guide the feature selection process, a latent semantic of multi-label information method is adopted, as a label embedding. Experimental results on some known benchmark datasets suggest that the proposed estimator suffers much less from extra bias, and may lead to benign overfitting.

2411.05196 2026-06-02 cs.AI cs.DL cs.LG 版本更新

Explainable AI Through a Democratic Lens: DhondtXAI for D'Hondt-Projected Feature Attribution

通过民主视角的可解释AI:用于D'Hondt投影特征归因的DhondtXAI

Turker Berk Donmez

发表机构 * Sakarya University of Applied Sciences(萨卡里亚应用科学大学)

AI总结 提出DhondtXAI,一种基于D'Hondt规则的独立于SHAP的表格数据可解释性框架,通过计算背景干预移除效应、分离正负证据、形成特征联盟并分配席位,实现特征归因,在合成数据和医疗数据集上验证了其与SHAP的高度一致性。

详情
AI中文摘要

本研究提出DhondtXAI,作为一种独立于SHAP、基于D'Hondt的表格可解释AI归因框架。DhondtXAI不依赖于模型原生特征重要性或SHAP值,而是计算背景干预移除效应,分离正负证据,形成可选的特征联盟,应用可选的阈值,通过D'Hondt规则分配席位,并投影到局部模型输出差异上。通过构造保持完整性,投影残差比作为诊断指标报告。该方法在合成加性和交互测试、相关特征扰动、算子和分配消融、投影模式比较、logit尺度检查、重复分割验证、配对删除测试以及两个医疗数据集(威斯康星诊断乳腺癌(CatBoost)和早期糖尿病风险预测(XGBoost))上进行了评估。SHAP仅作为外部比较器,设置对齐。在加性合成数据中,DhondtXAI精确恢复真实排名;在乘法交互中,联盟将平均投影残差从0.2527降至0.0001。在WDBC和糖尿病数据上,与SHAP高度一致(Spearman rho分别为0.9273和0.9353),并通过进一步的符号、top-k、幅度、删除和敏感性分析得到支持。结果表明,DhondtXAI是一种互补的比例性、联盟感知和阈值感知的表格可解释AI方法,而非SHAP或LIME的替代品。

英文摘要

This study presents DhondtXAI as a SHAP-independent, D'Hondt-based attribution framework for tabular XAI. Instead of model-native feature importance or SHAP values, DhondtXAI computes background-interventional removal effects, separates positive and negative evidence, forms optional feature alliances, applies optional thresholds, allocates seats via the D'Hondt rule, and projects onto the local model-output difference. Completeness is preserved by construction, with the projection residual ratio reported as a diagnostic. The method is evaluated on synthetic additive and interaction tests, correlated-feature perturbations, operator and apportionment ablations, projection-mode comparisons, logit-scale checks, repeated split validation, paired deletion tests, and two healthcare datasets: Wisconsin Diagnostic Breast Cancer (CatBoost) and early-stage diabetes risk prediction (XGBoost). SHAP serves only as an external comparator with aligned settings. In additive synthetics, DhondtXAI exactly recovers ground-truth rankings; in multiplicative interactions, alliances reduce the mean projection residual from 0.2527 to 0.0001. On WDBC and diabetes data, it shows high agreement with SHAP (Spearman rho = 0.9273 and 0.9353), supported by further signed, top-k, magnitude, deletion, and sensitivity analyses. Results position DhondtXAI as a complementary proportional, alliance-aware, and threshold-aware tabular XAI method, not a replacement for SHAP or LIME.

2411.05359 2026-06-02 cs.CV cs.AI cs.CY 版本更新

Agricultural Landscape Understanding At Country-Scale

国家级农业景观理解

Radhika Dua, Aditi Agarwal, Aishwarya Jayagopal, Depanshu Sani, Alex Wilson, Hoang Tran, Ishan Deshpande, Bogdan Floristean, Neelabh Goyal, Ramya Cheruvu, Vishal Batchu, Yan Mayster, Gaurav Aggarwal, Alok Talekar, Vaibhav Rajan

发表机构 * Google DeepMind(谷歌深Mind) Google(谷歌)

AI总结 提出首个国家级农业制图系统,通过新颖的后处理启发式方法实现田地、树木和水体的实例分割,并在全国范围内部署验证。

Comments 32 pages, 11 tables, 22 figs

详情
AI中文摘要

全面的农业景观理解对于应对粮食安全、气候变化和资源管理等全球挑战至关重要。这不仅需要绘制农田地图,还需要绘制树木和水体等重要特征,这些特征在主导全球南方的复杂 extit{小农户}系统中形成了错综复杂的镶嵌结构。以往开发此类土地利用地图的努力受到限制,仅专注于田地划界的方法,并且没有开发出实际部署所必需的稳健后处理步骤。此外,据我们所知,之前没有针对小农户农场的系统在国家范围内进行部署和评估。本文通过提出首个国家级农业制图系统来解决这些局限性,该系统超越了简单的田地划界,能够对田地、树木和水体等农业实例进行分割。我们的系统通过新颖的后处理启发式方法进行了优化,以确保地图的一致性和准确性,并通过严格、多方面的评估过程进行了验证。我们系统生成的精细土地利用地图可通过API在 extit{\href{http://agri.withgoogle.com}{http://agri.withgoogle.com}}公开访问,支持从精准农业和政策制定到推进全球可持续发展目标的各种应用。

英文摘要

Comprehensive agricultural landscape understanding is critical for addressing global challenges in food security, climate change, and resource management. This requires mapping not just crop fields, but also vital features like trees and water bodies which form an intricate mosaic in complex \textit{smallholder} systems dominating the Global South. Previous efforts to develop such land use maps have been limited by a narrow focus on methods for field delineation only, and also do not develop robust post-processing steps essential for real-world deployment. Further, to our knowledge, no prior system for smallholder farms has been deployed and evaluated at a national scale. This work addresses these limitations by presenting the first national-scale agricultural mapping system that moves beyond simple field delineation to enable segmentation of agricultural instances like fields, trees and water bodies. Our system is refined for real-world application using novel post-processing heuristics to ensure map consistency and accuracy, and is validated through a rigorous, multi-faceted evaluation process. Fine-grained land use maps generated by our system are publicly accessible via an API at \textit{\href{http://agri.withgoogle.com}{http://agri.withgoogle.com}}, enabling a wide range of applications from precision agriculture and policy-making to advancing global sustainability development goals.

2410.02511 2026-06-02 cs.AI cs.MA 版本更新

Stop Wandering, Find the Keys: LLMs Discriminate Key States for Efficient Multi-Agent Exploration

停止徘徊,找到关键:LLMs 辨别关键状态以实现高效多智能体探索

Yun Qu, Boyuan Wang, Yuhang Jiang, Jianzhun Shao, Yixiu Mao, Heming Zou, Chang Liu, Cheems Wang, Meiqin Liu, Xiangyang Ji

发表机构 * Department of Automation, Tsinghua University, Beijing 100084, China(清华大学自动化系) College of Electrical Engineering, Zhejiang University, Hangzhou 310027, China(浙江大学电气工程学院) National Key Laboratory of Human-Machine Hybrid Augmented Intelligence, Xi’an Jiaotong University, Xi’an 710049, China(西安交通大学人机混合增强智能国家级重点实验室)

AI总结 提出 LEMAE 方法,利用大语言模型辨别关键状态并设计子空间内在奖励和关键状态记忆树,引导多智能体高效探索,在 SMAC 和 MPE 基准上显著超越现有方法,实现 10 倍加速。

详情
Journal ref
SCIENCE CHINA Information Sciences 2026
AI中文摘要

在具有广阔状态-动作空间的情况下,高效的多智能体探索仍然是强化学习中一个长期存在的挑战。尽管追求新颖性、多样性或不确定性吸引了越来越多的关注,但在没有适当指导选择的情况下进行探索所带来的冗余努力,给该领域带来了一个实际问题。本文介绍了一种系统方法,称为 LEMAE,它选择从知识渊博的大语言模型(LLM)中引导信息丰富的任务相关指导,以实现高效的多智能体探索。具体来说,我们将 LLM 的语言知识以判别性的方式、以较低的 LLM 推理成本,转化为对任务完成至关重要的符号化关键状态。为了释放关键状态的力量,我们设计了基于子空间的回顾性内在奖励(SHIR),通过增加奖励密度来引导智能体朝向关键状态。此外,我们构建了关键状态记忆树(KSMT),以跟踪特定任务中关键状态之间的转换,从而实现有组织的探索。得益于减少冗余探索,LEMAE 在具有挑战性的基准测试(例如 SMAC 和 MPE)上以较大优势超越了现有的最先进方法,在某些场景中实现了 10 倍的加速。

英文摘要

With expansive state-action spaces, efficient multi-agent exploration remains a longstanding challenge in reinforcement learning. Although pursuing novelty, diversity, or uncertainty attracts increasing attention, redundant efforts brought by exploration without proper guidance choices poses a practical issue for the community. This paper introduces a systematic approach, termed LEMAE, choosing to channel informative task-relevant guidance from a knowledgeable Large Language Model (LLM) for Efficient Multi-Agent Exploration. Specifically, we ground linguistic knowledge from LLM into symbolic key states, that are critical for task fulfillment, in a discriminative manner at low LLM inference costs. To unleash the power of key states, we design Subspace-based Hindsight Intrinsic Reward (SHIR) to guide agents toward key states by increasing reward density. Additionally, we build the Key State Memory Tree (KSMT) to track transitions between key states in a specific task for organized exploration. Benefiting from diminishing redundant explorations, LEMAE outperforms existing SOTA approaches on the challenging benchmarks (e.g., SMAC and MPE) by a large margin, achieving a 10x acceleration in certain scenarios.

2409.19310 2026-06-02 cs.CR cs.AI 版本更新

Model X-Ray: Detection of Hidden Malware in AI Model Weights using Few Shot Learning

Model X-Ray: 使用少样本学习检测AI模型权重中的隐藏恶意软件

Daniel Gilkarov, Ran Dubin

发表机构 * Department of Computer Science and Ariel Cyber Innovation Center, Ariel University(计算机科学系和 Ariel 网络创新中心,阿里尔大学) Department of Computer and Software Engineering and Ariel Cyber Innovation Center, Ariel University(计算机与软件工程系和 Ariel 网络创新中心,阿里尔大学)

AI总结 本文提出一种基于少样本学习的AI模型恶意软件检测方法,通过将模型权重转换为图像表示,仅需6个训练样本即可检测低至6%嵌入率的隐蔽攻击,并展现出对新型扩频隐写攻击的鲁棒性。

详情
AI中文摘要

随着人工智能(AI)的快速发展和Model Zoo等共享AI模型平台的广泛使用,AI模型被利用的潜在风险增加。攻击者可以通过隐写技术将恶意软件嵌入AI模型中,利用这些模型庞大的体积隐藏恶意数据并用于恶意目的,例如远程代码执行。确保AI模型的安全性是一个新兴的研究领域,对于保护依赖AI技术的众多组织和用户至关重要。本研究利用成熟的图像少样本学习技术,通过一种新颖的图像表示将AI模型转换到图像领域。在该领域应用少样本学习使我们能够创建实用的模型,这是先前工作所缺乏的。我们的方法解决了现有最先进检测技术中阻碍其实用性的关键限制。该方法将所需的训练数据集大小从40000个模型减少到仅6个。此外,我们的方法能够持续检测嵌入率低至25%甚至在某些情况下低至6%的隐蔽攻击,而先前的工作仅被证明对100%-50%的嵌入率有效。我们采用严格的评估策略,确保训练后的模型在各种因素下具有泛化能力。此外,我们展示了训练后的模型成功检测到新型扩频隐写攻击,仅通过学习一种攻击类型就证明了模型令人印象深刻的鲁棒性。我们开源代码以支持可重复性并促进这一新领域的研究。

英文摘要

The potential for exploitation of AI models has increased due to the rapid advancement of Artificial Intelligence (AI) and the widespread use of platforms like Model Zoo for sharing AI models. Attackers can embed malware within AI models through steganographic techniques, taking advantage of the substantial size of these models to conceal malicious data and use it for nefarious purposes, e.g. Remote Code Execution. Ensuring the security of AI models is a burgeoning area of research essential for safeguarding the multitude of organizations and users relying on AI technologies. This study leverages well-studied image few-shot learning techniques by transferring the AI models to the image field using a novel image representation. Applying few-shot learning in this field enables us to create practical models, a feat that previous works lack. Our method addresses critical limitations in state-of-the-art detection techniques that hinder their practicality. This approach reduces the required training dataset size from 40000 models to just 6. Furthermore, our methods consistently detect delicate attacks of up to 25% embedding rate and even up to 6% in some cases, while previous works were only shown to be effective for a 100%-50% embedding rate. We employ a strict evaluation strategy to ensure the trained models are generic concerning various factors. In addition, we show that our trained models successfully detect novel spread-spectrum steganography attacks, demonstrating the models' impressive robustness just by learning one type of attack. We open-source our code to support reproducibility and enhance the research in this new field.

2401.17010 2026-06-02 cs.CR cs.AI cs.LG 版本更新

Finetuning Large Language Models for Vulnerability Detection

微调大型语言模型用于漏洞检测

Alexey Shestov, Rodion Levichev, Ravil Mussabayev, Evgeny Maslov, Anton Cheshkov, Pavel Zadorozhny

发表机构 * Sber AI Lab(Sber AI实验室) Huawei Russian Research Institute(华为俄罗斯研究院) Satbayev University(萨特拜耶夫大学)

AI总结 本文通过微调WizardCoder模型,优化训练流程并处理类别不平衡,在漏洞检测任务上提升了ROC AUC和F1指标,展示了预训练LLM在源代码分析中的迁移学习潜力。

详情
AI中文摘要

本文介绍了微调大型语言模型(LLMs)用于检测源代码中漏洞的结果。我们利用WizardCoder(最新改进的先进LLM StarCoder),并通过进一步微调使其适应漏洞检测。为加速训练,我们修改了WizardCoder的训练过程,并研究了最优训练方案。针对负样本远多于正样本的不平衡数据集,我们还探索了不同技术以提升分类性能。微调后的WizardCoder模型在平衡和不平衡的漏洞数据集上,相比于CodeBERT类模型,在ROC AUC和F1指标上均有提升,证明了将预训练LLM用于源代码漏洞检测的有效性。关键贡献包括:微调先进的代码LLM WizardCoder、在不损害性能的前提下提高其训练速度、优化训练流程和方案、处理类别不平衡,以及在困难的漏洞检测数据集上提升性能。这展示了通过微调大型预训练语言模型进行专门源代码分析任务的迁移学习潜力。

英文摘要

This paper presents the results of finetuning large language models (LLMs) for the task of detecting vulnerabilities in source code. We leverage WizardCoder, a recent improvement of the state-of-the-art LLM StarCoder, and adapt it for vulnerability detection through further finetuning. To accelerate training, we modify WizardCoder's training procedure, also we investigate optimal training regimes. For the imbalanced dataset with many more negative examples than positive, we also explore different techniques to improve classification performance. The finetuned WizardCoder model achieves improvement in ROC AUC and F1 measures on balanced and imbalanced vulnerability datasets over CodeBERT-like model, demonstrating the effectiveness of adapting pretrained LLMs for vulnerability detection in source code. The key contributions are finetuning the state-of-the-art code LLM, WizardCoder, increasing its training speed without the performance harm, optimizing the training procedure and regimes, handling class imbalance, and improving performance on difficult vulnerability detection datasets. This demonstrates the potential for transfer learning by finetuning large pretrained language models for specialized source code analysis tasks.

2407.15510 2026-06-02 cs.AI cs.DM cs.LO cs.SC 版本更新

Algebraic anti-unification

代数反统一

Christian Antić

发表机构 * Vienna University of Technology(维也纳技术大学)

AI总结 本文在泛代数的一般框架下提出代数反统一理论,通过引入代数泛化序和最小泛化概念,建立基本结构性质,并利用自动机理论研究有限一元代数和有限代数中的可计算性。

详情
AI中文摘要

抽象是人类和人工智能的关键,因为它允许人们识别原本不同对象或情境中的共同结构。反统一(或泛化)是理论计算机科学和人工智能中研究抽象的分支,已在归纳逻辑编程、程序综合和类比推理等领域得到应用。迄今为止,反统一几乎完全从语法角度进行研究。在本文中,我们在泛代数的一般框架下开创了反统一的代数(即语义)理论,从而将反统一从基于项的表示扩展到任意代数,并超越等式理论。特别地,我们引入了代数泛化序和最小泛化泛化的概念,建立了基本结构性质,证明了与同态和同构的兼容性,并通过自动机理论方法研究了有限一元代数和有限代数中的可计算性。

英文摘要

Abstraction is key to human and artificial intelligence as it allows one to identify common structure in otherwise distinct objects or situations. Anti-unification (or generalization) is the branch of theoretical computer science and artificial intelligence that studies abstraction and has found applications in areas such as inductive logic programming, program synthesis, and analogy-making. To date, anti-unification has been studied almost exclusively from a syntactic perspective. In this paper, we initiate an algebraic (i.e.\ semantic) theory of anti-unification in the general setting of universal algebra, thereby extending anti-unification from term-based representations to arbitrary algebras and beyond equational theories. In particular, we introduce the notions of algebraic generalization ordering and minimally general generalization, establish basic structural properties, prove compatibility with homomorphisms and isomorphisms, and investigate computability in finite unary algebras and finite algebras via automata-theoretic methods.

2307.05213 2026-06-02 cs.LG cs.AI 版本更新

Score Function Gradient Estimation to Widen the Applicability of Decision-Focused Learning

评分函数梯度估计以拓宽决策聚焦学习的适用性

Mattia Silvestri, Senne Berden, Jayanta Mandi, Ali İrfan Mahmutoğulları, Brandon Amos, Tias Guns, Michele Lombardi

发表机构 * University of Bologna(博洛尼亚大学) KU Leuven(鲁汶大学) Meta

AI总结 提出一种结合随机平滑与评分函数梯度估计的方法,无需对问题结构做特定假设,即可将决策聚焦学习扩展到非线性目标、约束中不确定参数及两阶段随机优化问题。

详情
Journal ref
Silvestri, Mattia, et al. "Score Function Gradient Estimation to Widen the Applicability of Decision-Focused Learning." Journal of Artificial Intelligence Research 85 (2026)
AI中文摘要

许多现实世界的优化问题包含在部署前未知的参数,这是由于随机性或信息缺乏(例如,配送问题中的需求或旅行时间)。在这种情况下,常见的策略是通过机器学习(ML)模型估计所述参数,这些模型以最小化预测误差为目标进行训练,然而这并不一定与下游任务级误差一致。决策聚焦学习(DFL)范式通过直接最小化任务损失(例如遗憾)来克服这一限制。由于后者对于组合问题具有非信息性梯度,最先进的DFL方法引入了能够实现训练的替代和近似。但这些方法利用了关于问题结构的特定假设(例如,凸或线性问题,仅在目标函数中的未知参数)。我们提出了一种替代方法,该方法不做此类假设,它结合了随机平滑与评分函数梯度估计,适用于任何任务损失。这为将DFL方法应用于非线性目标、问题约束中的不确定参数,甚至两阶段随机优化打开了大门。实验表明,它通常需要更多的训练周期,但在解决方案质量、可扩展性或两者方面,与专门方法相当,并且在约束中存在不确定性的困难情况下表现尤为出色。

英文摘要

Many real-world optimization problems contain parameters that are unknown before deployment time, either due to stochasticity or to lack of information (e.g., demand or travel times in delivery problems). A common strategy in such cases is to estimate said parameters via machine learning (ML) models trained to minimize the prediction error, which however is not necessarily aligned with the downstream task-level error. The decision-focused learning (DFL) paradigm overcomes this limitation by training to directly minimize a task loss, e.g. regret. Since the latter has non-informative gradients for combinatorial problems, state-of-the-art DFL methods introduce surrogates and approximations that enable training. But these methods exploit specific assumptions about the problem structures (e.g., convex or linear problems, unknown parameters only in the objective function). We propose an alternative method that makes no such assumptions, it combines stochastic smoothing with score function gradient estimation which works on any task loss. This opens up the use of DFL methods to nonlinear objectives, uncertain parameters in the problem constraints, and even two-stage stochastic optimization. Experiments show that it typically requires more epochs, but that it is on par with specialized methods and performs especially well for the difficult case of problems with uncertainty in the constraints, in terms of solution quality, scalability, or both.

2403.07008 2026-06-02 cs.LG cs.AI cs.CL stat.ME 版本更新

AutoEval Done Right: Using Synthetic Data for Model Evaluation

AutoEval 的正确做法:使用合成数据进行模型评估

Pierre Boyeau, Anastasios N. Angelopoulos, Nir Yosef, Jitendra Malik, Michael I. Jordan

发表机构 * Department of Electrical Engineering and Computer Sciences, University of California, Berkeley, USA(电子工程与计算机科学系,加州大学伯克利分校) Department of Systems Immunology, Weizmann Institute of Science, Rehovot, Israel(系统免疫学系,魏茨曼科学研究所) Inria, Ecole Normale Supérieure, Paris, France(法国国家信息与自动化技术研究所,巴黎高等师范学院)

AI总结 本文提出高效且统计上无偏的算法,利用AI标记的合成数据减少模型评估所需的人工标注量,在GPT-4实验中有效样本量提升高达50%。

Comments camera-ready paper version

详情
AI中文摘要

使用人工标注的验证数据评估机器学习模型可能成本高昂且耗时。AI标记的合成数据可用于减少此目的所需的人工标注数量,这一过程称为自动评估。我们为此提出了高效且统计上无偏的算法,在保持无偏性的同时提高样本效率。这些算法在GPT-4实验中使有效人工标注样本量增加高达50%。

英文摘要

The evaluation of machine learning models using human-labeled validation data can be expensive and time-consuming. AI-labeled synthetic data can be used to decrease the number of human annotations required for this purpose in a process called autoevaluation. We suggest efficient and statistically principled algorithms for this purpose that improve sample efficiency while remaining unbiased. These algorithms increase the effective human-labeled sample size by up to 50% on experiments with GPT-4.

2307.06647 2026-06-02 cs.RO cs.AI cs.CV 版本更新

DeepIPCv2: LiDAR-powered Robust Environmental Perception and Navigational Control for Autonomous Vehicle

DeepIPCv2: 基于LiDAR的鲁棒环境感知与自动驾驶导航控制

Oskar Natan, Jun Miura

发表机构 * Department of Computer Science and Electronics, Universitas Gadjah Mada(计算机科学与电子系,加查马达大学) Department of Computer Science and Engineering, Toyohashi University of Technology(计算机科学与工程系,toyohashi技术大学)

AI总结 提出DeepIPCv2端到端自动驾驶框架,通过融合LiDAR点云分割与多视图投影构建鲁棒场景表示,结合门控循环单元、命令特定多层感知器和PID控制器实现路径点与导航控制命令的联合估计,在光照变化下取得最低总指标误差和最少驾驶干预。

Comments This work has been accepted for publication in IEEE Access. https://ieeexplore.ieee.org/document/11313052

详情
AI中文摘要

我们提出DeepIPCv2,一个端到端的自动驾驶框架,它集成了基于LiDAR的环境感知与命令特定的控制学习。与先前依赖摄像头的模型不同,DeepIPCv2采用点云分割和多视图投影来构建鲁棒的场景表示。这些特征通过门控循环单元、命令特定的多层感知器和PID控制器的组合进行融合和解码,以估计路径点和导航控制命令。这种设计增强了机动性并解决了驾驶数据集中的动作不平衡问题。为了验证模型,我们构建了一个覆盖不同光照条件的数据集,并进行了消融研究和与包括TransFuser在内的最新方法的对比测试。结果表明,DeepIPCv2实现了最低的总指标误差和最少的驾驶干预,突显了其对光照变化的鲁棒性和改进的控制精度。通过稍后在https://github.com/oskarnatan/DeepIPCv2发布代码,我们旨在支持端到端自动驾驶研究的可重复性和未来进展。

英文摘要

We propose DeepIPCv2, an end-to-end autonomous driving framework that integrates LiDAR-based environmental perception with command-specific control learning. Unlike prior camera-reliant models, DeepIPCv2 employs point cloud segmentation and multi-view projection to construct robust scene representations. These features are fused and decoded through a combination of gated recurrent units, command-specific multi-layer perceptrons, and PID controllers to estimate both waypoints and navigational control commands. This design enhances maneuverability and addresses action imbalance in driving datasets. To validate the model, we constructed a dataset covering diverse illumination conditions and conducted ablation studies and comparative tests against recent methods, including TransFuser. Results demonstrate that DeepIPCv2 achieves the lowest total metric error and the fewest driving interventions, highlighting both its robustness to illumination changes and its improved control accuracy. By releasing the codes at https://github.com/oskarnatan/DeepIPCv2 later, we aim to support reproducibility and future advancements in end-to-end autonomous driving research.

2310.15676 2026-06-02 cs.CV cs.AI 版本更新

Recent Advances in Multi-modal 3D Intelligence: A Comprehensive Survey and Evaluation

多模态3D智能的最新进展:综合调查与评估

Yinjie Lei, Zixuan Wang, Feng Chen, Guoqing Wang, Peng Wang, Yang Yang

发表机构 * College of Electronics and Information Engineering, Sichuan University(四川大学电子信息工程学院) School of Computer Science, University of Adelaide(阿德莱德大学计算机科学学院) School of Computer Science and Engineering, University of Electronic Science and Technology of China(电子科技大学计算机科学与工程学院)

AI总结 本文系统综述了多模态3D智能方法,提出基于模态和任务的新分类法,并比较了基准数据集上的结果,最后讨论了未来研究方向。

详情
AI中文摘要

多模态3D智能因其在自动驾驶和世界模拟等领域的广泛应用而受到广泛关注。与传统的单模态3D理解相比,引入额外模态不仅提升了场景解释的丰富性和精确性,还为更高层次的物理世界交互奠定了基础。在仅依赖3D数据可能不足的多样化和挑战性环境中,这一点变得尤为关键。尽管过去六年中多模态3D方法的发展激增,特别是那些整合多相机图像(3D+2D)和文本描述(3D+语言)的方法,但缺乏全面深入的综述。在本文中,我们通过系统调查最新进展来弥补这一空白。我们首先简要总结了各种3D多模态任务中的独特挑战。之后,我们提出了一种新的分类法,根据模态和任务对现有方法进行彻底分类,探讨它们各自的优势和局限性。此外,我们提供了近期方法在几个基准数据集上的比较结果及深入分析。最后,我们讨论了未解决的问题,并提出了未来研究的几个潜在方向。

英文摘要

Multi-modal 3D Intelligence has gained considerable attention due to its wide applications in autonomous driving and world simulation, etc. Compared to conventional single-modal 3D understanding, introducing an additional modality not only elevates the richness and precision of scene interpretation but also provides a foundation for higher-level physical world interaction. This becomes especially crucial in varied and challenging environments where solely relying on 3D data might be inadequate. While there has been a surge in the development of multi-modal 3D methods over the past six years, especially those integrating multi-camera images (3D+2D) and textual descriptions (3D+language), a comprehensive and in-depth review is notably absent. In this paper, we present a systematic survey of recent progress to bridge this gap. We begin by briefly summarizing the unique challenges among various 3D multi-modal tasks. After that, we present a novel taxonomy that delivers a thorough categorization of existing methods according to modalities and tasks, exploring their respective strengths and limitations. Furthermore, comparative results of recent approaches on several benchmark datasets, together with insightful analysis, are offered. Finally, we discuss the unresolved issues and provide several potential avenues for future research.

2309.15946 2026-06-02 cs.LG cs.AI cs.NE math.DS 版本更新

Unified Long-Term Time-Series Forecasting Benchmark

统一长期时间序列预测基准

Jacek Cyranka, Szymon Haponiuk

发表机构 * Institute of Informatics(信息学院)

AI总结 提出一个专为长期时间序列预测设计的综合数据集,通过标准化轨迹和多种模型基准测试,发现模型效果依赖于数据集,并引入改进的潜在NLinear和课程学习DeepAR模型。

详情
AI中文摘要

为了支持时间序列数据预测的机器学习方法的发展,我们提出了一个明确针对长期时间序列预测设计的综合数据集。我们整合了来自多种动态系统和真实记录的数据集集合。每个数据集通过将数据划分为具有预定回溯长度的训练和测试轨迹进行标准化。我们包含长度高达$2000$的轨迹,以确保对长期预测能力的可靠评估。为了确定在不同场景中最有效的模型,我们使用经典和最先进的模型(即LSTM、DeepAR、NLinear、N-Hits、PatchTST和LatentODE)进行了广泛的基准分析。我们的研究结果揭示了这些模型之间有趣的性能比较,突出了模型有效性的数据集依赖性。值得注意的是,我们引入了一个自定义的潜在NLinear模型,并通过课程学习阶段增强了DeepAR。两者都持续优于其原始版本。

英文摘要

In order to support the advancement of machine learning methods for predicting time-series data, we present a comprehensive dataset designed explicitly for long-term time-series forecasting. We incorporate a collection of datasets obtained from diverse, dynamic systems and real-life records. Each dataset is standardized by dividing it into training and test trajectories with predetermined lookback lengths. We include trajectories of length up to $2000$ to ensure a reliable evaluation of long-term forecasting capabilities. To determine the most effective model in diverse scenarios, we conduct an extensive benchmarking analysis using classical and state-of-the-art models, namely LSTM, DeepAR, NLinear, N-Hits, PatchTST, and LatentODE. Our findings reveal intriguing performance comparisons among these models, highlighting the dataset-dependent nature of model effectiveness. Notably, we introduce a custom latent NLinear model and enhance DeepAR with a curriculum learning phase. Both consistently outperform their vanilla counterparts.

2212.06751 2026-06-02 cs.LG cs.AI 版本更新

Speeding Up Multi-Objective Hyperparameter Optimization by Task Similarity-Based Meta-Learning for the Tree-Structured Parzen Estimator

基于任务相似性元学习加速多目标超参数优化的树形结构Parzen估计器

Shuhei Watanabe, Noor Awad, Masaki Onishi, Frank Hutter

发表机构 * Department of Computer Science, University of Freiburg, Germany(弗赖堡大学计算机科学系) Artificial Intelligence Research Center, AIST, Tokyo, Japan(日本科学技术厅人工智能研究中心)

AI总结 提出利用任务间顶级域重叠定义的任务相似性扩展TPE采集函数到元学习设置,加速多目标超参数优化,理论分析并解决相似性局限,实验证明在表格HPO基准上达到最优性能并赢得AutoML 2022竞赛。

Comments Accpeted to IJCAI 2023

详情
AI中文摘要

超参数优化(HPO)是提升深度学习性能的关键步骤。实践者常面临多个指标间的权衡,如准确率和延迟。鉴于深度学习的高计算需求以及对高效HPO日益增长的需求,加速多目标优化变得愈发重要。尽管已有大量关于元学习用于HPO的工作,但现有方法不适用于多目标树形结构Parzen估计器(MO-TPE),这是一种简单而强大的多目标HPO算法。在本文中,我们利用任务间顶级域重叠定义的任务相似性,将TPE的采集函数扩展到元学习设置。我们还从理论上分析并解决了任务相似性的局限性。实验中,我们证明了该方法在表格HPO基准上加速了MO-TPE,并达到了最先进的性能。我们的方法还通过赢得AutoML 2022“Transformer多目标超参数优化”竞赛得到了外部验证。

英文摘要

Hyperparameter optimization (HPO) is a vital step in improving performance in deep learning (DL). Practitioners are often faced with the trade-off between multiple criteria, such as accuracy and latency. Given the high computational needs of DL and the growing demand for efficient HPO, the acceleration of multi-objective (MO) optimization becomes ever more important. Despite the significant body of work on meta-learning for HPO, existing methods are inapplicable to MO tree-structured Parzen estimator (MO-TPE), a simple yet powerful MO-HPO algorithm. In this paper, we extend TPE's acquisition function to the meta-learning setting using a task similarity defined by the overlap of top domains between tasks. We also theoretically analyze and address the limitations of our task similarity. In the experiments, we demonstrate that our method speeds up MO-TPE on tabular HPO benchmarks and attains state-of-the-art performance. Our method was also validated externally by winning the AutoML 2022 competition on "Multiobjective Hyperparameter Optimization for Transformers".

2211.14411 2026-06-02 cs.LG cs.AI 版本更新

c-TPE: Tree-structured Parzen Estimator with Inequality Constraints for Expensive Hyperparameter Optimization

c-TPE: 带不等式约束的树结构Parzen估计器用于昂贵的超参数优化

Shuhei Watanabe, Frank Hutter

发表机构 * Department of Computer Science, University of Freiburg(弗赖堡大学计算机科学系)

AI总结 提出c-TPE方法,通过修改TPE的采样和模型以处理不等式约束,在81个昂贵HPO问题上取得最佳平均排名性能。

Comments Accepted to IJCAI 2023

详情
AI中文摘要

超参数优化(HPO)对于深度学习算法的强性能至关重要,而实际应用通常在性能要求之上施加一些约束,例如内存使用或延迟。在这项工作中,我们提出了约束TPE(c-TPE),这是广泛使用的通用贝叶斯优化方法——树结构Parzen估计器(TPE)的扩展,以处理这些约束。我们提出的扩展不仅仅是现有采集函数和原始TPE的简单组合,而是包括解决导致性能不佳问题的修改。我们通过实验和理论彻底分析了这些修改,提供了关于它们如何有效克服这些挑战的见解。在实验中,我们证明c-TPE在81个带不等式约束的昂贵HPO问题上,以统计显著性在现有方法中表现出最佳平均排名性能。由于缺乏基线,我们仅在附录D中讨论了我们方法对硬约束优化的适用性。该实现现在可通过OptunaHub获得。

英文摘要

Hyperparameter optimization (HPO) is crucial for strong performance of deep learning algorithms and real-world applications often impose some constraints, such as on memory usage or latency, on top of the performance requirement. In this work, we propose constrained TPE (c-TPE), an extension of the widely-used versatile Bayesian optimization method, tree-structured Parzen estimator (TPE), to handle these constraints. Our proposed extension goes beyond a simple combination of an existing acquisition function and the original TPE, and instead includes modifications that address issues that cause poor performance. We thoroughly analyze these modifications both empirically and theoretically, providing insights into how they effectively overcome these challenges. In the experiments, we demonstrate that c-TPE exhibits the best average rank performance among existing methods with statistical significance on $81$ expensive HPO problems with inequality constraints. Due to the lack of baselines, we only discuss the applicability of our method to hard-constrained optimization in Appendix D. The implementation is now available via OptunaHub.

2301.06308 2026-06-02 cs.LG cs.AI 版本更新

Stability Analysis of Sharpness-Aware Minimization

锐度感知最小化的稳定性分析

Hoki Kim, Jinseong Park, Yujin Choi, Jaewook Lee

发表机构 * Chung-Ang University, South Korea(Chung-Ang 大学,韩国) Korea Institute for Advanced Study, South Korea(韩国高级研究院) Ulsan National Institute of Science(乌山国家科学研究院) Nanyang Technological University (NTU), Singapore(南洋理工大学(NTU),新加坡) Seoul National University, South Korea(首尔国立大学,韩国)

AI总结 研究SAM在鞍点附近的收敛不稳定性,通过动力系统理论证明鞍点成为吸引子,并发现动量与批次大小可缓解该问题。

Comments Accepted to ICML 2026

详情
AI中文摘要

锐度感知最小化(SAM)是一种训练方法,旨在寻找深度学习中的平坦最小值,从而在各个领域取得最先进的性能。SAM不是最小化当前权重的损失,而是最小化参数空间中其邻域内的最坏情况损失。在本文中,我们研究了SAM在鞍点附近的收敛不稳定性。利用动力系统的定性理论,我们解释了SAM如何陷入鞍点,并从理论上证明了在SAM动力学下鞍点可以成为吸引子。此外,通过建立SAM的扩散,我们证明了这种收敛不稳定性也可能发生在随机动力系统中。我们证明,在逃离鞍点方面,SAM扩散比普通梯度下降更差。最后,我们展示了经常被忽视的训练技巧——动量和批次大小——可能对缓解收敛不稳定性和实现高泛化性能很重要。我们的理论和实证结果通过几个著名的优化问题和基准任务的实验得到了充分验证。

英文摘要

Sharpness-aware minimization (SAM) is a training method that seeks to find flat minima in deep learning, resulting in state-of-the-art performance across various domains. Instead of minimizing the loss of the current weights, SAM minimizes the worst-case loss in its neighborhood in the parameter space. In this paper, we investigate the convergence instability of SAM near a saddle point. Using the qualitative theory of dynamical systems, we explain how SAM becomes stuck in the saddle point and theoretically prove that the saddle point can become an attractor under SAM dynamics. Additionally, we show that this convergence instability can also occur in stochastic dynamical systems by establishing the diffusion of SAM. We prove that SAM diffusion is worse than that of vanilla gradient descent in terms of saddle point escape. Finally, we demonstrate that often overlooked training tricks, momentum and batch-size, might be important to mitigate the convergence instability and achieve high generalization performance. Our theoretical and empirical results are thoroughly verified through experiments on several well-known optimization problems and benchmark tasks.

2208.12389 2026-06-02 cs.LG cs.AI 版本更新

Static Seeding and Clustering of LSTM Embeddings to Learn from Loosely Time-Decoupled Events

LSTM嵌入的静态播种与聚类以从松散时间解耦事件中学习

Christian Manasseh, Razvan Veliche, Jared Bennett, Hamilton Clouse

发表机构 * Air Force Research Lab (AFRL) Autonomy Capability Team 3 (ACT3)(美国空军研究实验室(AFRL)自主能力团队3(ACT3))

AI总结 提出通过静态数据播种LSTM生成嵌入并聚类,以改进松散时间解耦时间序列预测,在COVID-19县级病例预测中提升10日移动平均精度。

详情
AI中文摘要

人类从不同时间和地点发生的事件中学习,以预测相似的事件轨迹。我们将松散解耦时间序列(LDT)现象定义为两个或多个可能发生在不同地点和不同时间线上,但在事件性质和位置属性上具有相似性的事件。在这项工作中,我们改进了循环神经网络(RNN),特别是长短期记忆(LSTM)网络的使用,以使AI解决方案能够为LDT生成更好的时间序列预测。我们基于趋势使用时间序列之间的相似性度量,并引入表示这些趋势的嵌入。嵌入表示事件的属性,与LSTM结构结合,可以聚类以识别相似的、时间上未对齐的事件。在本文中,我们探索了从与LSTM建模的地球物理和人口现象相关的时间不变数据中播种多变量LSTM的方法。我们将这些方法应用于从COVID-19检测感染和死亡病例中得出的时间序列数据。我们使用公开的社会经济数据来播种LSTM模型,创建嵌入,以确定这种播种是否改善了病例预测。这些LSTM产生的嵌入被聚类,以识别用于预测演变时间序列的最佳匹配候选。应用这种方法,我们在美国县级疾病传播的10日移动平均预测中显示出改进。

英文摘要

Humans learn from the occurrence of events in a different place and time to predict similar trajectories of events. We define Loosely Decoupled Timeseries (LDT) phenomena as two or more events that could happen in different places and across different timelines but share similarities in the nature of the event and the properties of the location. In this work we improve on the use of Recurring Neural Networks (RNN), in particular Long Short-Term Memory (LSTM) networks, to enable AI solutions that generate better timeseries predictions for LDT. We use similarity measures between timeseries based on the trends and introduce embeddings representing those trends. The embeddings represent properties of the event which, coupled with the LSTM structure, can be clustered to identify similar temporally unaligned events. In this paper, we explore methods of seeding a multivariate LSTM from time-invariant data related to the geophysical and demographic phenomena being modeled by the LSTM. We apply these methods on the timeseries data derived from the COVID-19 detected infection and death cases. We use publicly available socio-economic data to seed the LSTM models, creating embeddings, to determine whether such seeding improves case predictions. The embeddings produced by these LSTMs are clustered to identify best-matching candidates for forecasting an evolving timeseries. Applying this method, we show an improvement in 10-day moving average predictions of disease propagation at the US County level.