arXivDaily arXiv每日学术速递 周一至周五更新
重置
2605.03644 2026-06-03 cs.AI 版本更新

AdapShot: Adaptive Many-Shot In-Context Learning with Semantic-Aware KV Cache Reuse

AdapShot: 自适应多示例上下文学习与语义感知的KV缓存重用

Jie Ou, Jinyu Guo, Shiyao Guo, Yuang Li, Ruiqi Wu, Zhaokun Wang, Wenyi Li, Wenhong Tian

发表机构 * School of Information and Software Engineering, University of Electronic Science and Technology of China(电子科技大学信息与软件学院)

AI总结 提出AdapShot方法,通过基于熵的探针机制动态优化示例数量,并结合语义感知的KV缓存重用策略,实现高效的多示例上下文学习,性能提升约10%,速度提升4.64倍。

详情
AI中文摘要

多示例上下文学习(Many-Shot ICL)已成为一种有前景的范式,利用大量示例来释放大型语言模型(LLMs)的推理潜力。然而,现有方法通常依赖于预定的固定示例数量。这种静态方法往往无法适应不同查询的难度变化,导致上下文不足或噪声干扰。此外,长上下文的过高计算和内存成本严重限制了多示例的可行性。为了解决上述限制,我们提出了AdapShot,它动态优化示例数量,并利用KV缓存重用实现高效推理。具体来说,我们设计了一种基于探针的评估机制,利用输出熵确定最佳示例数量。为了在探测和推理阶段避免冗余的预填充计算,我们引入了一种语义感知的KV缓存重用策略。在该重用策略中,为了解决位置编码不兼容问题,我们提出了一种解耦和重新编码方法,使得缓存的键值对能够灵活重新排序。大量实验表明,与最先进的DBSA相比,AdapShot平均性能提升约10%,速度提升4.64倍。

英文摘要

Many-Shot In-Context Learning (ICL) has emerged as a promising paradigm, leveraging extensive examples to unlock the reasoning potential of Large Language Models (LLMs). However, existing methods typically rely on a predetermined, fixed number of shots. This static approach often fails to adapt to the varying difficulty of different queries, leading to either insufficient context or interference from noise. Furthermore, the prohibitive computational and memory costs of long contexts severely limit Many-Shot's feasibility. To address the above limitations, we propose AdapShot, which dynamically optimizes shot counts and leverages KV cache reuse for efficient inference. Specifically, we design a probe-based evaluation mechanism that utilizes output entropy to determine the optimal number of shots. To bypass the redundant prefilling computation during both the probing and inference phases, we incorporate a semantics-aware KV cache reuse strategy. Within this reuse strategy, to address positional encoding incompatibilities, we introduce a decoupling and re-encoding method that enables the flexible reordering of cached key-value pairs. Extensive experiments demonstrate that AdapShot achieves an average performance gain of around 10% and a 4.64x speedup compared to state-of-the-art DBSA.

2605.02488 2026-06-03 cs.AI cs.DB cs.LO 版本更新

Efficient Temporal Datalog Materialisation for Composite Event Recognition

高效的时间Datalog物化用于复合事件识别

Periklis Mantenoglou

发表机构 * Örebro University, Sweden(奥雷布罗大学,瑞典)

AI总结 针对高速事件流中的关键情况检测需求,通过将主流事件规范语言映射到时间Datalog->-并扩展流触发图技术,实现统一的复合事件识别机制。

详情
AI中文摘要

许多应用需要在高速符号事件流中及时检测关键情况,例如对安全和透明度的威胁。这一需求推动了(i)事件规范语言的发展,该语言通过简单事件上的时间模式定义复合事件,以及(ii)流推理框架,评估用这些语言表达的模式。然而,事件规范语言通常被孤立研究,使得它们在表达性方面的比较复杂化,并模糊了其相关流推理器的范围。为了缓解这一问题,我们将突出的事件规范语言的实用片段映射到时间Datalog->-,一种具有分层否定且无未来依赖的时间Datalog。为了支持对时间Datalog->-的高效流推理,我们提出了流触发图,这是对最先进的Datalog物化技术的扩展。我们的方法产生了一个统一的复合事件识别机制,具有跨广泛实用事件规范语言进行泛化的潜力。

英文摘要

Several applications demand the timely detection of critical situations, such as threats to safety and transparency, over high-velocity streams of symbolic events. This demand has motivated the development of (i) event specification languages, which define composite events via temporal patterns over simpler events, and (ii) stream reasoning frameworks, evaluating patterns expressed in these languages. However, event specification languages are typically studied in isolation, complicating their comparison in terms of expressivity and obscuring the scope of their associated stream reasoners. To mitigate this issue, we map practical fragments of prominent event specification languages into Temporal Datalog->-, a temporal Datalog with stratified negation and no future dependencies. To support efficient stream reasoning over Temporal Datalog->-, we propose Streaming Trigger Graphs, an extension of a state-of-the-art technique for Datalog materialisation. Our approach yields a uniform composite event recognition mechanism that has the potential to generalise across a wide range of practical event specification languages.

2606.03116 2026-06-03 eess.AS cs.AI cs.SD 版本更新

AnyAudio-Judge: A Dynamic Rubric-Based Benchmark and Evaluator for Audio Instruction Following

AnyAudio-Judge:基于动态评分标准的音频指令跟随基准与评估器

Haitao Li, Tian Tan, Yuguang Yang, Shan Yang, Xie Chen

发表机构 * Zhejiang University(浙江大学) Shanghai Innovation Institute(上海创新研究院) Shanghai Jiao Tong University(上海交通大学) Tencent Hunyuan(腾讯文脉)

AI总结 针对指令引导音频生成中复杂指令解耦困难、评估缺乏可解释性和细粒度属性匹配的问题,提出基于动态评分标准的评估范式,通过自适应分解音频描述为可验证的二元评分项,并构建包含7920个样本的双语基准和105K训练语料,结合SFT与GRPO训练专用评估器,在零样本对齐检测和下游强化学习指令对齐中取得显著提升。

详情
AI中文摘要

指令引导音频生成的快速发展凸显了对稳健对齐评估的迫切需求。当前的自动评估方法严重依赖通用大语言模型的整体评分,难以解耦复杂指令,缺乏可解释性,且无法捕捉细粒度的属性不匹配。为解决这一问题,我们引入了一种新颖的基于动态评分标准的评估范式,该范式自适应地将复杂的音频描述分解为可变数量的独立、可验证的二元评分项。为了严格基准测试这一能力,我们提出了AnyAudio-Judge Bench,一个全面的双语基准,包含7920个精心策划的样本,涵盖四个不同的音频领域(语音、声音、音乐和混合),并包含特意构建的困难负样本。此外,我们构建了一个包含105K样本的大规模语料库,并带有明确的思维链(CoT)理由,以训练我们的专用评估器——AnyAudio-Judge模型。通过采用结合监督微调(SFT)和组相对策略优化(GRPO)的训练流程,我们的模型成功将其推理路径与基于评分标准的评分机制对齐。大量实验表明,AnyAudio-Judge不仅显著增强了与最先进基线相比的零样本对齐检测,而且提供了精确且可解释的奖励信号,显著改善了音频生成下游强化学习中的指令对齐。

英文摘要

The rapid advancement of instruction-guided audio generation has highlighted the critical need for robust alignment evaluation. Current automated evaluation methods heavily rely on holistic scoring from general-purpose large language models, which struggle to decouple complex instructions, lack interpretability, and fail to capture fine-grained attribute mismatches. To address this, we introduce a novel dynamic rubric-based evaluation paradigm that adaptively decomposes complex audio captions into a variable number of independent, verifiable binary rubric items. To rigorously benchmark this capability, we propose the AnyAudio-Judge Bench, a comprehensive, bilingual benchmark comprising 7,920 meticulously curated samples across four diverse audio domains (speech, sound, music, and mixed), featuring deliberately constructed hard negatives. Furthermore, we construct a large-scale corpus of 105K samples with explicit Chain-of-Thought (CoT) rationales to train our dedicated evaluator, the AnyAudio-Judge model. By employing a training pipeline that combines Supervised Fine-Tuning (SFT) and Group Relative Policy Optimization (GRPO), our model successfully aligns its reasoning paths with the rubric-based scoring mechanism. Extensive experiments demonstrate that AnyAudio-Judge not only significantly enhances zero-shot alignment detection compared to state-of-the-art baselines, but also provides precise and interpretable reward signals that substantially improve instruction alignment in downstream reinforcement learning for audio generation.

2606.02661 2026-06-03 eess.IV cs.AI cs.LG 版本更新

Learning to Refine: Spectral-Decoupled Iterative Refinement Framework for Precipitation Nowcasting

学习细化:用于降水临近预报的频谱解耦迭代细化框架

Yunlong Zhou, Chen Zhao, Danyang Peng, Fanfan Ji, Xiao-Tong Yuan

发表机构 * National University of Singapore(新加坡国立大学)

AI总结 提出频谱解耦迭代细化框架(SDIR),通过双路径设计(SFG-Former和FR-Refiner)和物理一致功率谱密度损失,在确定性框架中实现降水临近预报的渐进频率解耦细化,消除模糊和幻觉,在空间精度和频谱保真度上超越现有方法。

Comments 21 pages, 10 figures, accepted at ICML 2026

详情
AI中文摘要

准确的降水临近预报对减灾至关重要,但深度学习方法面临关键权衡:回归模型产生过度平滑、频谱衰减的预测,模糊对流细节并违反湍流幂律;扩散模型生成逼真但无锚定的幻觉,缺乏物理基础。我们提出频谱解耦迭代细化(SDIR),一个确定性框架,将临近预报重新表述为渐进频率解耦细化。SDIR首先提取稳定的低频天气尺度骨架,然后在物理约束下迭代细化高频纹理,消除模糊和幻觉。它采用双路径设计:天气尺度频率引导前馈网络(SFG-Former)使用尺度自适应Transformer处理全局结构,傅里叶残差细化器(FR-Refiner)使用尺度条件傅里叶神经算子处理精细残差。具有动态掩蔽的物理一致功率谱密度(PCPSD)损失强制执行湍流一致的频谱分布。在三个基准上的实验表明,SDIR在空间精度上显著优于最先进方法,同时实现了与基于扩散方法竞争的频谱保真度,实现了可靠的高分辨率业务化临近预报。代码链接:this https URL。

英文摘要

Accurate precipitation nowcasting is vital for disaster mitigation, but deep learning methods face a key trade-off: regression models produce over-smoothed, spectrally decaying predictions that blur convective details and violate turbulence power laws; diffusion models generate realistic yet unanchored hallucinations lacking physical grounding. We propose Spectral-Decoupled Iterative Refinement (SDIR), a deterministic framework that reformulates nowcasting as progressive frequency-decoupled refinement. SDIR first extracts a stable low-frequency synoptic skeleton, then iteratively refines high-frequency textures under physical constraints, eliminating both blurring and hallucinations. It features a dual-path design: the Synoptic Frequency-Guided Former (SFG-Former) with Scale-Adaptive Transformers for global structure, and the Fourier Residual Refiner (FR-Refiner) with Scale-Conditioned Fourier Neural Operators for fine residuals. A Physically Consistent Power Spectral Density (PCPSD) loss with dynamic masking enforces a turbulence-consistent spectral distribution. Experiments on three benchmarks show SDIR significantly outperforms SOTA methods in spatial accuracy while achieving spectral fidelity competitive with diffusion-based methods, enabling reliable high-resolution operational nowcasting. Code link: https://github.com/RuntimeWarning/SDIR.

2606.02642 2026-06-03 eess.AS cs.AI cs.CV cs.LG cs.MM cs.SD 版本更新

SVHalluc: Benchmarking Speech-Vision Hallucination in Audio-Visual Large Language Models

SVHalluc: 音频-视觉大语言模型中的语音-视觉幻觉基准测试

Chenshuang Zhang, Kyeong Seon Kim, Chengxin Liu, Tae-Hyun Oh

发表机构 * KAIST(韩国国立信息通信研究院)

AI总结 针对音频-视觉大语言模型中的语音-视觉幻觉问题,提出SVHalluc基准,从语义和时间两个维度评估模型将语音内容与视觉信号对齐的能力,发现现有模型存在跨模态理解局限。

Comments Accepted at CVPR 2026

详情
AI中文摘要

尽管音频-视觉大语言模型(LLMs)取得了成功,但它们可能产生看似合理但缺乏依据的输出,即幻觉。现有基准侧重于环境声音(例如狗叫)来指示事件发生。相比之下,人类语音承载着根本不同的、丰富的语义和时间结构,但当前模型能否准确地将语音内容与相应的视觉信号对齐仍未得到探索。在这项工作中,我们表明语音内容可以引发音频-视觉LLMs中的幻觉。为了系统研究这一点,我们引入了SVHalluc,这是第一个用于评估音频-视觉LLMs中语音-视觉幻觉的综合基准。我们的基准从两个关键且互补的方面诊断语音-视觉幻觉:语义和时间。实验结果表明,最先进的开源音频-视觉LLMs难以将语音内容与相应的视觉信号对齐,在多个任务上的准确率接近随机。相比之下,Gemini 2.5 Pro显著优于开源模型。我们的分析表明,它们的失败源于跨模态理解能力有限,尽管在单模态感知方面表现强劲。我们的工作揭示了当前音频-视觉LLMs的一个新的根本性局限,并强调了基于语音的视频理解的需求。项目页面:此https URL。

英文摘要

Despite the success of audio-visual large-language models (LLMs), they can produce plausible but ungrounded outputs, termed hallucination. Existing benchmarks focus on environmental sounds (e.g., dog barking) to indicate event occurrence. In contrast, human speech carries fundamentally different, rich semantics and temporal structures, yet it remains unexplored whether current models can accurately align speech content with corresponding visual signals. In this work, we show that speech content can induce hallucinations in audio-visual LLMs. To systematically study this, we introduce SVHalluc, the first comprehensive benchmark for evaluating speech-vision hallucination in audio-visual LLMs. Our benchmark diagnoses speech-vision hallucinations from two critical and complementary aspects: semantic and temporal. Experimental results demonstrate that state-of-the-art open-source audio-visual LLMs struggle with aligning speech content with corresponding visual signals, with a near-random accuracy on multiple tasks. In contrast, Gemini 2.5 Pro significantly outperforms the open-source models. Our analysis suggests that their failures stem from limited ability in cross-modality understanding, despite strong performance in single-modality perception. Our work uncovers a new and fundamental limitation of current audio-visual LLMs and highlights the need for speech-grounded video comprehension. Project page: https://chenshuang-zhang.github.io/projects/svhalluc/.

2606.02639 2026-06-03 eess.IV cs.AI cs.CV 版本更新

Sparse-View Lung Nodule Volumetry from Digitally Reconstructed Radiographs via AReT: Anatomy-Regularized TensoRF

通过AReT:解剖正则化TensoRF从数字重建放射图像进行稀疏视图肺结节体积测量

Spoorthi M, Suja Palaniswamy

发表机构 * Amrita University(阿姆里塔大学)

AI总结 本文发现并解决了TensoRF在X射线衰减场中的默认密度偏移问题,提出解剖正则化张量辐射场框架AReT,仅用三个正交X射线投影即可实现肺结节的稳定体积重建,在LIDC-IDRI数据集上达到高精度。

详情
AI中文摘要

我们识别并解决了TensoRF应用于X射线衰减场时一个先前未报告的失败模式:默认密度偏移-10(最初为RGB场景重建引入)抑制了密度梯度,并阻止了稀疏视图医学重建,无论学习率或正则化策略如何。将密度偏移设置为零可恢复梯度流,并仅从三个正交X射线投影实现肺结节的稳定体积重建。在此基础上,我们提出AReT,一个解剖正则化的张量辐射场框架,用于使用LIDC-IDRI数据集(19名患者,放射科医生注释的结节)的冠状、矢状和轴向投影进行肺结节重建。与需要密集多视图采集的现有NeRF方法不同,AReT专为稀疏视图胸部成像设计,并整合了结合L1稀疏性和总变分平滑性的胸部解剖感知正则化。对11种重建策略的系统比较表明,解剖感知正则化始终优于生成先验引导的方法。与放射科医生共识分割相比,AReT在临床可操作的结节(>=10 mm,n=14)上实现了Pearson r=0.983(p<0.0001),中位绝对体积误差为11.4%,接近零的系统偏差为-77.3 mm^3,并且比球形体积近似提高了8.4倍。

英文摘要

We identify and resolve a previously unreported failure mode in TensoRF when applied to X-ray attenuation fields: the default density shift of -10, originally introduced for RGB scene reconstruction, suppresses density gradients and prevents sparse-view medical reconstruction regardless of learning rate or regularization strategy. Setting the density shift to zero restores gradient flow and enables stable volumetric reconstruction of pulmonary nodules from only three orthogonal X-ray projections. Building on this, we propose AReT, an anatomy-regularized tensorial radiance field framework for lung nodule reconstruction using coronal, sagittal, and axial projections from the LIDC-IDRI dataset (19 patients, radiologist-annotated nodules). Unlike existing NeRF approaches requiring dense multi-view acquisition, AReT is designed for sparse-view thoracic imaging and incorporates chest-anatomy-aware regularization combining L1 sparsity and total variation smoothness. A systematic comparison across 11 reconstruction strategies shows anatomy-aware regularization consistently outperforms generative-prior-guided approaches. Evaluated against radiologist consensus segmentations, AReT achieves Pearson r=0.983 (p<0.0001) for clinically actionable nodules >=10 mm (n=14), median absolute volumetric error of 11.4%, near-zero systematic bias of -77.3 mm^3, and 8.4x improvement over spherical volume approximation.

2606.02634 2026-06-03 eess.IV cs.AI 版本更新

Echo-POSED: Geometric Self-Distillation for Echocardiography Guidance

Echo-POSED:用于超声心动图引导的几何自蒸馏

Elias Stenhede, Edvart Grüner Bjerke, Joanna Sulkowska, Eivind Bjørkan Orstad, Ole Jakob Elle, Ulysse Côté-Allard, Arian Ranjbar

AI总结 提出一种自监督框架Echo-POSED,通过从3D超声心动图体积中切取2D视图训练,实现实时经胸超声心动图引导,无需专家标注视图或跟踪探头轨迹,在SO(3)×SO(3)上保持探头运动等变性,在患者内和患者间引导模拟中达到平均角度误差8.2度。

详情
AI中文摘要

我们引入了Echo-POSED,一种用于实时经胸超声心动图(TTE)引导的自监督框架,它直接从2D超声图像推荐探头调整,无需专家标注的视图或跟踪的探头轨迹。相反,它在从常规采集的3D超声心动图体积中切取的2D视图上训练,强制执行对探头运动的等变性,同时保持对心脏相位的不变性,从而在SO(3)×SO(3)上产生姿态表示。在保留的测试集和公共外部3D-TTE数据集(包括供应商变化)上,Echo-POSED在虚拟扰动下保持几何一致性,并实现患者内和患者间引导模拟,在具有心脏运动的患者内模拟中,引导视图与目标视图之间的平均角度误差为8.2度。

英文摘要

We introduce Echo-POSED, a self-supervised framework for real-time transthoracic echocardiography (TTE) guidance that recommends probe adjustments directly from 2D ultrasound images, without the need for expert-labelled views or tracked probe trajectories. Instead, it trains on 2D views sliced from routinely acquired 3D echocardiography volumes, enforcing equivariance to probe motions while remaining invariant to cardiac phase, yielding a pose representation on $\mathrm{SO}(3)\times\mathrm{SO}(3)$. Across a held-out split and public external 3D--TTE datasets (including vendor shift), Echo-POSED maintains geometric consistency under virtual perturbations and enables intra- and inter-patient guidance simulations, achieving a combined mean angular error of 8.2 degrees between the guided and target views in intra-patient simulations with cardiac motion.

2606.02631 2026-06-03 eess.AS cs.AI cs.CV cs.LG cs.SD 版本更新

Wavelet as Tokenizer: Preliminary Results on a Shared Wavelet Token Schema for Natural Signals

小波作为分词器:自然信号共享小波分词方案的初步结果

Shenghao Ding

发表机构 * Yet Another AI

AI总结 本文研究音频、图像和视频能否共享统一的小波分词方案,通过基于Haar DWT/IDWT的连续令牌模型,在多个数据集上验证了统一分词模式的可行性,并分析了潜在容量和元数据的影响。

Comments 12 pages, 3 figures

详情
AI中文摘要

本文研究音频、图像和视频是否可以共享一个共同的小波令牌模式,而不是依赖于各自模态特定的潜在网格。它介绍了一个初步的连续令牌模型,该模型围绕一级Haar DWT/IDWT前端、共享系数令牌布局、可选结构元数据、轻量级模态值适配器和共享的令牌级编码器-解码器主干构建。在Speech Commands、EuroSAT RGB和DAVIS 2017数据上,密集共享模型达到了39.92 dB音频、29.37 dB图像和23.93 dB视频的PSNR。在连续潜在标量预算下的匹配速率扫描表明,视觉增益不能仅由潜在容量解释,同时也表明加性元数据嵌入并非普遍改进来源。最后,固定速率能量选择提供了一个强大的非参数基线:在压缩保留比率下,energy_global相比均匀选择将音频的平均PSNR提高了16.73 dB,图像提高了16.90 dB,视频提高了15.86 dB。掩蔽稀疏训练在50%的密集令牌下达到了34.45 dB的视频PSNR。结果支持统一的 wavelet 令牌模式和稀疏令牌接口,但尚未建立通用的离散词汇表。

英文摘要

This paper studies whether audio, images, and video can share a common wavelet token schema rather than relying on separate modality-specific latent grids. It introduces a preliminary continuous-token model built around a one-level Haar DWT/IDWT frontend, a shared coefficient-token layout, optional structural metadata, lightweight modality value adapters, and a shared token-wise encoder-decoder trunk. On Speech Commands, EuroSAT RGB, and DAVIS 2017 data, a dense shared model reaches 39.92 dB audio, 29.37 dB image, and 23.93 dB video PSNR. A matched-rate sweep under continuous latent scalar budgets indicates that the visual gains are not explained solely by latent capacity, while also showing that additive metadata embeddings are not a universal source of improvement. Finally, fixed-rate energy selection provides a strong non-parametric baseline: energy_global improves average PSNR over uniform selection by 16.73 dB for audio, 16.90 dB for images, and 15.86 dB for video under compressed keep ratios. Masked sparse training reaches 34.45 dB video PSNR with 50% of dense tokens. The results support a unified wavelet token schema and sparse token interface, while stopping short of establishing a universal discrete vocabulary.

2606.02615 2026-06-03 eess.AS cs.AI cs.SD 版本更新

FSA-GRPO: Teaching Auditory LLMs to Use Few-shot Demonstrations

FSA-GRPO:训练听觉大语言模型使用少样本示例

Haolong Zheng, Siyin Wang, Xulin Fan, Zengrui Jin, Mark Hasegawa-Johnson

发表机构 * University of Illinois Urbana Champaign(伊利诺伊大学厄巴纳-香槟分校) Tsinghua University(清华大学)

AI总结 提出基于强化学习的后训练方法FSA-GRPO,通过专门设计的奖励机制鼓励模型利用少样本示例,增强其少样本适应能力,在儿童语音识别、语音翻译和音频理解等任务上取得提升。

详情
AI中文摘要

少样本提示为将听觉大语言模型适应低资源任务(如儿童语音识别)提供了一种有效方式。然而,大多数听觉大语言模型并未被明确训练以在这种示例条件格式下进行推理,限制了它们从少样本提示中获益的程度。为解决这一局限,我们引入了少样本感知GRPO(FSA-GRPO),一种基于强化学习的后训练方法,使用专门设计的奖励来鼓励模型利用少样本示例,从而增强其少样本适应能力。值得注意的是,仅使用高资源成人ASR数据进行训练即可提升模型的通用少样本适应能力,不仅在儿童语音识别中带来收益,在语音翻译和音频理解中也是如此。我们进一步研究了数据选择和辅助奖励加权,以确定有效的训练方案。实验表明,当域内数据不可用或无法用于训练时,FSA-GRPO比直接对相关域外数据进行微调更有效。

英文摘要

Few-shot prompting provides an effective way to adapt auditory large language models to low-resource tasks such as children's speech recognition. However, most auditory large language models are not explicitly trained to perform inference in this demonstration-conditioned format, limiting the extent to which they can benefit from few-shot prompting. To address this limitation, we introduce Few-Shot Aware GRPO (FSA-GRPO), an RL-based post-training recipe that uses a specially designed reward to encourage the model to leverage few-shot demonstrations, thereby strengthening its few-shot adaptation ability. Notably, training with only high-resource adult ASR data improves the model's general few-shot adaptation ability, yielding gains not only in children's speech recognition but also in speech translation and audio understanding. We further study data selection and auxiliary reward weighting to identify an effective training recipe. Our experiments show that when in-domain data are unavailable or cannot be used for training, FSA-GRPO is more effective than direct tuning on related out-of-domain data.

2606.02645 2026-06-03 stat.ML cs.AI cs.LG 版本更新

Target Updates May Stabilize Linear Q-Learning: Periodic and Soft Dynamics

目标更新可能稳定线性Q学习:周期性和软动态

Donghwan Lee

发表机构 * School of Electrical Engineering, KAIST(韩国成均馆大学电气工程学院)

AI总结 本文通过精确的切换线性系统动力学和联合谱半径分析,证明了在特定谱和步长条件下,周期性硬目标更新和软目标更新可以保证线性Q学习收敛到精确的投影Q-Bellman解。

详情
AI中文摘要

Q学习中的周期性目标更新和actor-critic方法中的软目标更新是经验上公认的稳定机制,但其精确的理论解释仍不完整。本文针对线性函数逼近的Q学习(线性Q学习),利用Bellman最大值引起的精确切换线性系统(SLS)动力学以及由此产生的切换矩阵族的联合谱半径(JSR),对这些机制进行了严格而精确的分析。尽管线性Q学习通常可能无法收敛,但我们证明,在明确的谱和步长条件下,周期性硬目标更新和软目标更新可以保证收敛到精确的投影Q-Bellman解。主要分析针对确定性线性Q学习进行,其中目标更新机制最为透明。一旦为均值递归建立了相应的JSR证书,随机强化学习设置可以通过将确定性模式替换为采样随机模式并添加相应的随机噪声分析来处理。

英文摘要

Periodic target updates in Q-learning and soft target updates in actor-critic methods are empirically well established stabilization mechanisms, but their precise theoretical explanation is still incomplete. This paper gives a rigorous and exact analysis of these mechanisms for Q-learning with linear function approximation (linear Q-learning) using the exact switched linear system (SLS) dynamics induced by the Bellman maximum and the joint spectral radius (JSR) of the resulting switching matrix families. Although linear Q-learning can fail to converge in general, we prove that, under explicit spectral and step-size conditions, periodic hard target updates and soft target updates can guarantee convergence to the exact projected Q-Bellman solution. The main analysis is carried out for deterministic linear Q-learning, where the target-update mechanism is most transparent. Once the corresponding JSR certificate is established for the mean recursion, the stochastic reinforcement-learning setting can be treated by replacing deterministic modes with sampled stochastic modes and adding the corresponding stochastic-noise analysis.

2606.02632 2026-06-03 stat.ML cs.AI cs.CY cs.LG econ.EM stat.AP 版本更新

Position: Prioritize Identifying Structure, Not Complex Models, for Scientific Discovery

立场:优先识别结构,而非复杂模型,以促进科学发现

Tyler H. McCormick

发表机构 * GitHub

AI总结 本文论证现代机器学习在高维代理机制下存在通用欠定性,提出“机制性机器学习”的具体标准,以确保以LLM为中心的工作流真正支持科学而非模拟科学。

Comments Will appear as a position paper in ICML

详情
AI中文摘要

现代机器学习(ML)和人工智能(AI)模型,特别是大型语言模型(LLMs),越来越多地被用于从观测数据中生成科学假设和机制解释。这篇立场论文认为,在现代ML擅长的高维代理机制中,机制性学习通常是欠定的:许多不相容的机制在数据支撑上诱导出本质上相同的观测关系,因此预测成功和连贯的解释并不足以作为机制发现的证据。这种欠定性在大型语言模型(LLMs)中变得尤为危险,因为它们倾向于将大量等价的解释类压缩成一个流畅的叙述。本文提出了“机制性机器学习”的具体标准,并论证如果以LLM为中心的工作流要支持科学而非仅仅模拟科学,这些标准是必要的。

英文摘要

Modern Machine Learning (ML) and Artificial Intelligence (AI) models, especially large language models (LLMs), are increasingly used to generate scientific hypotheses and mechanistic explanations from observational data. This position paper argues that in the high-dimensional proxy regimes where modern ML excels, mechanistic learning is generically underdetermined: many incompatible mechanisms induce essentially the same observational relationships on the support of the data, so predictive success and coherent explanations are insufficient evidence of mechanism discovery. This underdetermination becomes uniquely hazardous with large language models (LLMs), which tend to collapse large equivalence classes of explanations into a single fluent narrative. This paper proposes concrete standards for ``mechanistic ML,'' and argues these norms are necessary if LLM-centered workflows are to support science rather than merely simulate it.

2606.02592 2026-06-03 stat.AP cs.AI 版本更新

Tracking Urban Atmospheric Pollutants using Sentinel-5P Satellite Data

利用Sentinel-5P卫星数据追踪城市大气污染物

Alice Gomez-Cantos, Henry O. Velesaca

发表机构 * Facultad de Ciencias Naturales y Matemáticas, Escuela Superior Politécnica del Litoral, ESPOL, Campus Gustavo Galindo, Km. 30.5 Vía Perimetral, Guayaquil, 090902, Ecuador(生态与数学学院,海岸理工大学,ESPOL,加斯托·加林多校区,公里30.5环形路,瓜亚基尔,090902,厄瓜多尔) Software Engineering Department, Research Center for Information and Communication Technologies (CITIC-UGR), University of Granada, 18071, Granada, Spain(软件工程系,信息与通信技术研究中心(CITIC-UGR),格拉纳达大学,18071,格拉纳达,西班牙)

AI总结 提出基于Sentinel-5P/TROPOMI卫星对流层柱观测的框架,通过中位数和高百分位数等分布指标及K-means聚类,在厄瓜多尔瓜亚斯省尺度上表征城市NO2污染背景与极端值,为数据稀缺地区提供可解释、可扩展的空气质量评估工具。

详情
AI中文摘要

城市二氧化氮($NO_2$)是燃烧相关空气污染的关键指标,在城市中表现出强烈的时空变异性。本研究提出一个基于卫星的框架,利用Sentinel-5P/TROPOMI的对流层柱观测数据,追踪厄瓜多尔瓜亚斯省的城市$NO_2$污染。该方法不估计地表浓度,而是强调稳健的分布指标,包括中位数和上尾百分位数($P_{90}$、$P_{95}$和$P_{99}$),以表征县尺度上的背景条件和局部污染极端值。多年卫星观测数据按年汇总,并使用无监督K-means聚类分析,以识别无预定义阈值的特征污染模式。结果表明,高度城市化的县持续表现出较高的极端$NO_2$值和更大的变异性,而城市化程度较低的地区则呈现较低且更均匀的模式。所提出的方法为数据稀缺地区仅使用卫星观测提供了一种可解释且可扩展的城市空气质量评估工具。该实现已在GitHub上公开,网址为https://this URL。

英文摘要

Urban nitrogen dioxide ($NO_2$) is a key indicator of combustion-related air pollution and exhibits strong spatial and temporal variability in cities. This study presents a satellite-based framework for tracking urban $NO_2$ pollution using tropospheric column observations from Sentinel-5P/TROPOMI over Guayas Province, Ecuador. Rather than estimating surface concentrations, the methodology emphasizes robust distributional metrics, including the median and upper-tail percentiles ($P_{90}$, $P_{95}$, and $P_{99}$), to characterize background conditions and localized pollution extremes at the canton scale. Multi-year satellite observations are aggregated annually and analyzed using unsupervised K-means clustering to identify characteristic pollution regimes without predefined thresholds. Results show that highly urbanized cantons consistently exhibit elevated extreme $NO_2$ values and greater variability, while less urbanized areas display lower and more homogeneous patterns. The proposed approach provides an interpretable and scalable tool for urban air-quality assessment in data-scarce regions using satellite observations alone. The implementation is publicly available on GitHub https://hvelesaca.github.io/sentinel-5P-clustering/.

2606.03763 2026-06-03 econ.GN cs.AI q-fin.EC 版本更新

Merit or networks? What decides where research is published

功绩还是关系网?什么决定了研究成果的发表地点

Ning Li

AI总结 利用经济学工作论文数据,通过LLM评估论文思想质量,结合执行质量、关系网络、作者能力和语言模型文本得分,构建五因素生产函数,揭示发表过程中功绩与关系的作用机制。

详情
AI中文摘要

科学出版奖励的是思想的质量还是关系的优势?这个问题在追求声望的科学界普遍存在,但几十年来一直难以研究,因为论文的质量无法在其发表命运之前被衡量,而不使用该命运作为标尺。我们通过直接测量论文的思想质量来打破这一限制,在发表之前,使用一个经过学科训练的LLM评估器,该评估器在不看到作者姓名或结果的情况下对思想进行评分。以经济学为案例,我们将这种文本可读的思想质量评分与执行质量评分、关系指数、作者能力指数和现成的语言模型文本评分相结合,为6208篇经济学工作论文的期刊定位估计了一个五投入生产函数。这些投入不是竞争对手,而是沿着声望阶梯的一个序列。执行设定了功绩底线,并且是总体最大的投入。文本可读的思想质量则对中间的阶梯进行分级。关系设定了一个偏袒上限,主要在最顶端、最具选择性的期刊附近产生影响。关系通过两个加性渠道发挥作用:有关系的作者撰写的论文得分更高,并且在同等分数下,他们的论文仍然更有可能获得更好的发表位置。然而,这种优势是有限的。关系提高了每个阶梯的几率,但并未使顶端成为普通思想的典型结果,即使是得分最高的论文在进入可见的期刊阶梯时也面临实际摩擦。这一结果将功绩主义和关系网络对科学出版的解释嵌套在一起,而不是在两者之间做出选择。

英文摘要

Does scientific publishing reward the quality of ideas or the advantage of connections? The question is universal to prestige-driven science, yet it has resisted decades of study because a paper's quality could not be gauged ahead of its publication fate without using that fate as the yardstick. We break this constraint by measuring a paper's idea quality directly from its text, before publication, using a discipline-trained LLM evaluator that scores the idea without seeing author names or outcomes. Using economics as a case study, we combine this text-legible idea-quality score with an execution-quality rubric, a connection index, an author-ability index, and an off-the-shelf language-model text score to estimate a five-input production function for journal placement across 6,208 economics working papers. The inputs are not rivals but a sequence along the ladder of prestige. Execution sets a meritocratic floor and is the largest input overall. Text-legible idea quality grades the rungs in between. Connections set a favoritism ceiling that bites mainly near the apex, the most selective journals. Connections work through two additive channels: connected authors write papers that score higher, and at equal scores their papers are still more likely to place better. Yet this advantage is bounded. Connections raise the odds of every rung without making the apex the typical outcome for ordinary ideas, and even the highest-scoring papers face real friction reaching the visible journal ladder. The result nests, rather than chooses between, the meritocracy and network accounts of how science is published.

2606.02629 2026-06-03 q-bio.QM cs.AI cs.LG 版本更新

Enhancing Protein-Protein Interaction Prediction with Hierarchical Motif-based Multimodal Protein Embedding

基于层次化基序的多模态蛋白质嵌入增强蛋白质-蛋白质相互作用预测

Zaifei Yang, Samuel Ping-Man Choi, James Kwok

发表机构 * National University of Singapore(新加坡国立大学) University of California, Los Angeles(加州大学洛杉矶分校)

AI总结 提出MMM-PPI模型,通过层次化基序的多模态编码(微观、中观、宏观三尺度)整合序列、结构和功能信息,提升蛋白质-蛋白质相互作用预测性能。

详情
AI中文摘要

蛋白质-蛋白质相互作用(PPIs)对许多生物过程至关重要。然而,现有的PPI预测方法存在两个主要局限性:它们忽略了蛋白质的层次组织,特别是关键调控PPIs的中观尺度基序,并且未能有效整合序列、结构和功能模态。为了解决这些局限性,我们提出了MMM-PPI,一种基于层次化基序的多模态蛋白质编码器用于PPI预测,该编码器以自底向上的多模态方式在三个尺度上构建PPI嵌入。在微观尺度上,我们编码三种模态的残基特征;在中观尺度上,一种新颖的多模态基序编码器将残基聚合成空间感知的基序嵌入;在宏观尺度上,一种多模态蛋白质编码器通过联合建模基序重要性和模态间相关性将基序整合为蛋白质嵌入。预训练的编码器可直接用于大规模PPI预测。在多个PPI数据集上的大量实验表明,MMM-PPI优于最先进的多标签PPI预测模型,特别是在具有挑战性的数据划分和有限数据场景下。代码见此链接。

英文摘要

Protein-protein interactions (PPIs) are essential for many biological processes. However, existing PPI prediction approaches suffer from two major limitations: they overlook the hierarchical organization of proteins, particularly meso-scale motifs that critically regulate PPIs, and fail to effectively integrate sequence, structure, and function modalities. To address these limitations, we propose MMM-PPI, a Hierarchical Motif-based Multi-Modal protein Encoder for PPI Prediction that constructs PPI embeddings in a bottom-up multi-modal manner across three scales. At the micro-scale, we encode three modal residue features; at the meso-scale, a novel multimodal motif encoder aggregates residues into spatially-informed motif embeddings; at the macro-scale, a multimodal protein encoder integrates motifs into protein embeddings by jointly modeling motif importance and inter-modal correlations. The pre-trained encoder can be used off-the-shelf for large-scale PPI prediction. Extensive experiments on multiple PPI datasets show that MMM-PPI outperforms state-of-the-art multi-label PPI prediction models, particularly under challenging data partitions and limited data scenarios. Codes are in https://github.com/yzf-code/MMM-PPI.

2606.02625 2026-06-03 q-bio.QM cs.AI cs.LG 版本更新

DXA-Derived Skeletal Phenotypes and Hip Fracture Risk: A Backdoor-Adjusted Causal Analysis

DXA衍生的骨骼表型与髋部骨折风险:后门调整因果分析

Zixin Shi, Chen Zhao, Meiling Zhou, Kevin A. Maupin, Joyce H. Keyak, Nancy E. Lane, Kuan-Jui Su, Hui Shen, Hong-Wen Deng, Kui Zhang, Weihua Zhou

AI总结 本研究利用后门调整的平均处理效应比较了DXA衍生的髋部骨骼表型与骨折风险的关系,并评估了基于效应排序的表型对风险分层的改善。

Comments 35 pages; main manuscript includes 4 figures and 3 tables; supplementary material includes 13 figures and 3 tables

详情
AI中文摘要

目的:通过预设的混杂因素调整,比较双能X射线吸收测定法(DXA)衍生的髋部骨骼表型与髋部骨折风险的关系,并评估按后门调整的平均处理效应(ATEs)排序的表型是否能改善风险分层。方法:我们分析了21,098名英国生物样本库参与者,他们具有关联的健康记录、髋部DXA衍生的骨骼测量值和预设协变量。评估了涵盖髋部相关区域的骨矿物质含量(BMC)、骨矿物质密度(BMD)和T评分的16种表型。混杂因素选择由预设的有向无环图(DAG)指导。后门调整的ATEs以每标准差(SD)增加的绝对风险差尺度估计。评估了股骨总BMD的效应异质性,并使用临床变量与按ATE大小排序的表型组合评估下游预测。结果:在21,098名参与者中,115人发生髋部骨折。所有16种表型均显示每SD增加的后门调整ATEs为负值。最大的ATEs出现在股骨总BMC和股骨总BMD,每个的风险差为-0.0047,对应于每1,000名参与者中每SD较高的表型值减少约4.7例髋部骨折。股骨总BMD的条件效应在年龄较大和BMI较低的参与者中更强。在预测中,临床变量加上按ATE排序的前11个表型达到了比FRAX(含股骨颈BMD)更高的AUC(0.842 vs. 0.709),具有更高的敏感性(0.748 vs. 0.443)和相似的特异性(0.793 vs. 0.777)。结论:DXA衍生的髋部骨骼表型在其后门调整的ATEs上存在差异。表型水平的因果评估可能有助于识别用于风险分层的信息性DXA测量值。

英文摘要

Purpose: To compare dual-energy X-ray absorptiometry (DXA)-derived hip skeletal phenotypes in relation to hip fracture risk using prespecified confounder adjustment and to assess whether phenotypes ranked by their backdoor-adjusted average treatment effects (ATEs) improve risk stratification. Methods: We analyzed 21,098 UK Biobank participants with linked health records, hip DXA-derived skeletal measures, and prespecified covariates. Sixteen phenotypes spanning bone mineral content (BMC), bone mineral density (BMD), and T-score across hip-related regions were evaluated. Confounder selection was guided by a prespecified directed acyclic graph (DAG). Backdoor-adjusted ATEs were estimated on the absolute risk-difference scale per standard deviation (SD) increase. Effect heterogeneity was evaluated for total femur BMD, and downstream prediction was assessed using clinical variables combined with phenotypes ranked by ATE magnitude. Results: Among 21,098 participants, 115 had hip fractures. All 16 phenotypes showed negative backdoor-adjusted ATEs per SD increase. The largest ATEs were observed for total femur BMC and total femur BMD, each with a risk difference of -0.0047, corresponding to approximately 4.7 fewer hip fractures per 1,000 participants per SD higher phenotype value. Conditional effects of total femur BMD were stronger among older participants and those with lower BMI. In prediction, clinical variables plus the top 11 ATE-ranked phenotypes achieved higher AUC than FRAX with femoral neck BMD (0.842 vs. 0.709), with higher sensitivity (0.748 vs. 0.443) and similar specificity (0.793 vs. 0.777). Conclusion: DXA-derived hip skeletal phenotypes differed in their backdoor-adjusted ATEs. Phenotype-level causal evaluation may help identify informative DXA measures for risk stratification.

2606.02624 2026-06-03 q-bio.QM cs.AI cs.LG 版本更新

TadA-Bench: A Million-Variant Benchmark for Future-Round Discovery Toward Agentic Protein Engineering

TadA-Bench:面向智能蛋白质工程的未来轮次发现的百万变异基准

Jin Gao, Juntu Zhao, Zirui Zeng, Jiaqi Shen, Junhao Shi, Dukun Zhao, Yuming Lu, Dequan Wang

发表机构 * Tsinghua University(清华大学)

AI总结 TadA-Bench 是一个基于31轮TadA定向进化的百万变异湿实验回放基准,通过定义固定数据回放任务来评估模型在未见过的未来轮次中排序变异的能力,并引入Seq2Graph统一标签,揭示进化覆盖度比局部数据密度更重要。

Comments Accepted at the 43rd International Conference on Machine Learning (ICML 2026). Data: https://huggingface.co/datasets/JinGao/TadABench-1M . Code: https://github.com/shiyegao/TadABench-1M

详情
AI中文摘要

人工智能用于科学发现正进入智能体时代,蛋白质工程系统应优先考虑未来的湿实验,而不仅仅是拟合静态测量。我们引入了TadA-Bench,这是一个来自31轮TadA定向进化的百万变异湿实验回放基准,用于面向智能蛋白质工程的未来轮次发现。TadA-Bench保留了实验的时间顺序,并定义了一个固定数据回放任务:给定早期的实验轮次,模型对仅出现在后期轮次中的变异进行排序。它提供了对齐的DNA、RNA和蛋白质视图,并使用Seq2Graph(一种基于图的标签统一流程)来将嘈杂的富集测量结果协调为一致的跨轮次活性标签。随机分割控制显示强插值能力,但未来轮次排序和有限预算候选选择则弱得多。控制分析表明,进化覆盖度比局部数据密度更具信息性,将TadA-Bench定位为面向智能蛋白质工程的未来轮次发现的可重复湿实验回放基底;数据和代码已在Hugging Face和GitHub上发布。

英文摘要

AI for scientific discovery is entering an agentic era, where protein-engineering systems are expected to prioritize future wet-lab experiments rather than merely fit static measurements. We introduce TadA-Bench, a million-variant wet-lab replay benchmark from 31 TadA directed-evolution rounds for future-round discovery toward agentic protein engineering. TadA-Bench preserves the campaign chronology and defines a fixed-data replay task: given earlier experimental rounds, models rank variants that appear only in later rounds. It provides aligned DNA, RNA, and protein views, and uses Seq2Graph, a graph-based label-unification pipeline, to reconcile noisy enrichment measurements into consistent cross-round activity labels. Random-split controls show strong interpolation, but future-round ranking and finite-budget candidate selection are much weaker. Controlled analyses suggest that evolutionary coverage is more informative than local data density, positioning TadA-Bench as a reproducible wet-lab replay substrate for future-round discovery toward agentic protein engineering; the data and code are released on Hugging Face and GitHub.

2606.03985 2026-06-03 cs.RO cs.AI cs.CV 版本更新

Humanoid-GPT: Scaling Data and Structure for Zero-Shot Motion Tracking

Humanoid-GPT:扩展数据与结构以实现零样本运动跟踪

Zekun Qi, Xuchuan Chen, Dairu Liu, Chenghuai Lin, Yunrui Lian, Sikai Liang, Zhikai Zhang, Yu Guan, Jilong Wang, Wenyao Zhang, Xinqiang Yu, He Wang, Li Yi

发表机构 * Tsinghua University(清华大学) Galbot Inc.(Galbot公司) Shanghai Jiao Tong University(上海交通大学) Peking University(北京大学) Shanghai Qi Zhi Institute(上海启智研究院)

AI总结 提出Humanoid-GPT,一种基于GPT风格的因果Transformer,在十亿级运动语料上预训练,实现全身控制,通过扩展数据和模型容量达到对未见运动和任务的零样本泛化。

Comments Accepted at CVPR 2026

详情
AI中文摘要

我们介绍了Humanoid-GPT,一种具有因果注意力的GPT风格Transformer,在十亿级运动语料上训练用于全身控制。与受限于稀缺数据和敏捷性-泛化权衡的先前浅层MLP跟踪器不同,Humanoid-GPT在一个包含所有主要动作捕捉数据集和大规模内部录制的20亿帧重定向语料上预训练。扩展数据和模型容量产生了一个单一的生成式Transformer,它能够跟踪高度动态的行为,同时实现对未见运动和控制任务的前所未有的零样本泛化。大量实验和扩展分析表明,我们的模型建立了新的性能前沿,展示了对未见任务的鲁棒零样本泛化,同时能够跟踪高度动态和复杂的运动。

英文摘要

We introduce Humanoid-GPT, a GPT-style Transformer with causal attention trained on a billion-scale motion corpus for whole-body control. Unlike prior shallow MLP trackers constrained by scarce data and an agility-generalization trade-off, Humanoid-GPT is pre-trained on a 2B-frame retargeted corpus that unifies all major mocap datasets with large-scale in-house recordings. Scaling both data and model capacity yields a single generative Transformer that tracks highly dynamic behaviors while achieving unprecedented zero-shot generalization to unseen motions and control tasks. Extensive experiments and scaling analyses show that our model establishes a new performance frontier, demonstrating robust zero-shot generalization to unseen tasks while simultaneously tracking highly dynamic and complex motions.

2606.03979 2026-06-03 cs.LG cs.AI 版本更新

Language Models Need Sleep: Learning to Self-Modify and Consolidate Memories

语言模型需要睡眠:学习自我修改和巩固记忆

Ali Behrouz, Farnoosh Hashemi, Vahab Mirrokni

发表机构 * Google(谷歌) Cornell University(康奈尔大学)

AI总结 受人类学习过程启发,提出“睡眠”范式,通过记忆巩固(知识播种)和梦境(自我改进)两阶段,使模型持续学习、将短期记忆转化为长期知识并自我提升。

Comments A version of this work has been publicly available from September 2025 on OpenReview

详情
AI中文摘要

过去几十年见证了机器学习算法设计的重大进步,从早期针对特定任务的浅层模型研究到更通用的深度大语言模型(LLMs)。尽管在需要即时预测或上下文学习的任务中显示出有希望的结果,现有模型缺乏持续学习并有效将其时间上下文知识转移到长期参数的能力。受人类学习过程的启发,我们引入了一种“睡眠”范式,允许模型持续学习,通过重放将其短期脆弱记忆蒸馏为稳定的长期知识,并通过“梦境”过程递归地自我改进。更详细地说,睡眠包括两个阶段:(1)记忆巩固:一个向上的蒸馏过程,称为知识播种,其中较小自我的记忆被蒸馏到更大的网络中,以在保留知识的同时提供更多容量。作为概念验证,我们提出了一种新的广义蒸馏过程用于知识播种(即在线策略蒸馏与基于强化学习的模仿学习的结合);(2)梦境:一个自我改进阶段,其中模型使用强化学习生成合成数据的课程,以排练新知识并在没有人类监督的情况下完善现有能力。我们在长视野、持续学习、知识整合和少样本泛化任务上的实验支持了睡眠阶段的重要性。

英文摘要

The past few decades have witnessed significant advances in the design of machine learning algorithms, from early studies on task-specific shallow models to more general deep Large Language Models (LLMs). Despite showing promising results in tasks that require instant prediction or in-context learning, existing models lack the ability to continually learn and effectively transfer their temporal in-context knowledge to their long-term parameters. Inspired by human learning process, we introduce a ''Sleep'' paradigm that allows the models to continually learn, distill their short-term fragile memories into stable long-term knowledge with replay, and recursively improve themselves with ''Dreaming'' process. In more detail, sleep consists of two stages: (1) Memory Consolidation: an upward distillation process, called Knowledge Seeding, where the memories of a smaller-self are distilled into a larger network to provide more capacity while preserving the knowledge. As a proof of concept, we present a new Generalized Distillation process for {Knowledge Seeding} (i.e., the combination of on-policy distillation with Reinforcement Learning (RL)-based imitation learning); (2) Dreaming: a self-improvement phase, where the model uses RL to generate a curriculum of synthetic data to rehearse new knowledge and refine existing capabilities without human supervision. Our experiments on long-horizon, continual learning, knowledge incorporation, and few-shot generalization tasks support the importance of the sleep stage.

2606.03976 2026-06-03 cs.CV cs.AI cs.LG q-bio.NC 版本更新

Formalizing the Binding Problem

形式化绑定问题

Lianghuan Huang, Yihao Li, Saeed Salehi, Yingshan Chang, Ansh Soni, Konrad P. Kording

AI总结 本文用信息论方法形式化绑定问题,提出一种探测方法测量模型表示中的绑定信息,并在视觉Transformer上实验,证明绑定是强视觉识别和推理的关键要素。

Comments Accepted to ICML 2026

详情
AI中文摘要

世界表征,可以说,包含关于特征的信息(例如,某物是蓝色的,某物是圆形的),但也包含关于哪些特征属于同一对象的信息(例如,圆形是蓝色的),我们称之为绑定信息。任何具有理解包含多个对象场景能力的系统都必须解决绑定问题:它需要知道哪些特征属于一起。然而,尽管有研究表明视觉Transformer(ViT)知道哪些补丁属于一起,但目前尚不清楚当前的深度学习模型是否学会展示绑定信息,即针对特征的信息。我们可能认为绑定信息并不多,毕竟将特征错误归因于错误对象是基于ViT架构的常见失败,尤其是在对象共享特征的场景中。本文用信息论方法形式化绑定问题,并引入一种探测方法来测量模型表示中的绑定信息。我们在ViT上进行实验,测量来自架构不同组件(如图像摘要标记[CLS]或空间标记)的绑定信息。我们使用具有不同绑定挑战的数据集,例如特征共享、遮挡和自然特征,同时比较多个预训练ViT的性能。总体而言,我们的研究证明了绑定是强视觉识别和推理的关键要素。

英文摘要

Representations of the world, arguably, contain information about features (e.g. something is blue, something is a circle) but also information about which features are part of the same object (e.g. the circle is blue), which we call binding information. Any system with the ability to understand scenes with multiple objects must be able to solve the binding problem: it needs to know which features belong together. However, despite work showing that Vision Transformers (ViTs) know which patches belong together, it is not known whether current deep learning models learn to exhibit binding information, i.e., for features. We may believe that there is not much binding information, after all misattributing features to wrong objects is a common failure of ViT-based architectures, especially in scenes with objects sharing features. Here we formalize the binding problem with an information-theoretic approach, and introduce a probing method to measure binding information in model representations. We perform experiments on ViTs, measuring binding from different components of the architecture, such as the image summary token [CLS] or the spatial tokens. We use datasets with different binding challenges, such as feature sharing, occlusion, and natural features, while comparing the performance of several pre-trained ViTs. Overall, our research demonstrates binding as a key ingredient to strong visual recognition and reasoning.

2606.03969 2026-06-03 cs.CL cs.AI 版本更新

Quantifying Faithful Confidence Expression in Large Reasoning Models

量化大型推理模型中的忠实置信表达

Areeb Gani, Asal Meskin, Gabrielle Kaili-May Liu, Arman Cohan

发表机构 * Yale University(耶鲁大学)

AI总结 针对大型推理模型(LRM)在长链思维输出中难以忠实表达内在置信度的问题,提出基于令牌概率、隐藏状态和响应一致性的框架,系统量化其语言决断性与内部不确定性之间的对齐程度。

Comments Code: https://github.com/yale-nlp/faithful_lrm

详情
AI中文摘要

可靠的不确定性沟通对于LLMs的可信度至关重要,然而忠实校准(FC)——模型内在置信度与(语言上)表达的置信度之间的对齐——是一个持续存在的失败模式。这一挑战对大型推理模型(LRM)尤为关键,因为其扩展的推理轨迹常被用户解读为深思熟虑、能力和信心的证据。尽管FC重要且LRM广泛使用,但LRM能否忠实表达其置信度仍知之甚少。此外,衡量FC的主流范式难以泛化到LRM生成的长链思维输出,这些输出往往缺乏清晰的步骤边界、步骤结构不一致,并在整个轨迹中编码复杂的条件依赖——使得内在置信度的估计复杂化。为应对这一挑战,我们引入了一个新颖的框架来系统量化LRM的FC。我们的框架基于令牌概率、隐藏状态和采样响应一致性,分析语言决断性与三种内部不确定性来源的关系。我们还设计了一种前缀条件采样方法,以控制轨迹中的条件和结构变化。将我们的框架应用于一系列多样化的领先模型、数据集和提示,我们发现忠实置信表达是LRM的一个重大挑战。推理行为不会自动转化为改进的FC,针对非推理模型的提示干预在推理设置中并不能提高忠实性。不同的置信估计器还对同一轨迹产生不同评估,揭示了先前评估方法的脆弱性。综合来看,我们的工作将FC确立为LRM的一个独特的可靠性和对齐目标,尤其是在这些系统越来越多地部署在高风险场景中的背景下。

英文摘要

Reliable uncertainty communication is critical to the trustworthiness of LLMs, yet faithful calibration (FC)--the alignment between models' intrinsic and (linguistically) expressed confidence--is a persistent failure mode. This challenge is key for large reasoning models (LRMs), whose extended reasoning traces are often interpreted by users as evidence of deliberation, competence, and confidence. Despite the importance of FC and wide usage of LRMs, the extent to which LRMs can faithfully express their confidence remains poorly understood. Moreover, the prevailing paradigm to measure FC does not generalize well to the long chain-of-thought outputs generated by LRMs, which tend to lack clear step boundaries, involve inconsistent step structure, and encode complex conditional dependencies throughout the trace--complicating estimation of intrinsic confidence. To address this challenge, we introduce a novel framework to systematically quantify FC of LRMs. Our framework analyzes linguistic decisiveness relative to three sources of internal uncertainty, based on token probabilities, hidden states, and sampled response consistency. We also devise a prefix-conditioned sampling approach to control for conditional and structural variation across traces. Applying our framework to a diverse suite of leading models, datasets, and prompts, we find that faithful confidence expression is a significant challenge for LRMs. Reasoning behaviors do not automatically translate to improved FC, and prompt interventions for non-reasoning models do not improve faithfulness in the reasoning setting. Different confidence estimators further produce divergent assessments of the same traces, revealing fragility in prior evaluation methodologies. Taken together, our work establishes FC as a distinct reliability and alignment target for LRMs, particularly as such systems are increasingly deployed in high-stakes contexts.

2606.03968 2026-06-03 cs.CL cs.AI 版本更新

QUBRIC: Co-Designing Queries and Rubrics for RL Beyond Verifiable Rewards

QUBRIC:为超越可验证奖励的强化学习协同设计查询与评分标准

Rongzhi Zhang, Rui Feng, Zhihan Zhang, Jingfeng Yang, Qingyu Yin, Xin Liu, Zixuan Zhang, Priyanka Nigam, Bing Yin, Tuo Zhao, Chao Zhang

发表机构 * Amazon(亚马逊) Georgia Institute of Technology(佐治亚理工学院)

AI总结 针对基于评分标准的强化学习中查询分布固定导致的评分标准质量瓶颈,提出QUBRIC框架,通过协同设计查询与评分标准,利用教师关键点、对比生成和可学习性过滤,在ArenaHard上取得+5.5点提升,并泛化到法律、道德和叙事推理任务。

详情
AI中文摘要

基于评分标准的强化学习是将强化学习扩展到可验证奖励之外的一条有前景的途径,但现有方法在优化评分标准时,将查询分布视为固定不变。我们识别出一个结构性瓶颈:评分标准的质量受限于查询结构。开放式查询会导致模糊的评分标准;而简单地将查询收窄则会引入任何模型都无法验证的虚构参考,导致所有回答失败,训练无法获得奖励信号。我们提出QUBRIC,一个协同设计查询和评分标准的框架。教师导出的关键点将开放式查询改写为基于场景、可评估的问题。然后,对比评分标准生成将教师策略的差距转化为查询级别的标准,可学习性过滤仅保留信息量丰富的查询-评分标准对用于GRPO训练。QUBRIC在ArenaHard上相比SFT基线取得了+5.5分的提升。仅使用指令遵循数据训练,它进一步迁移到三个涵盖法律、道德和叙事推理的保留基准(平均提升+6.3分),改进集中在推理相关维度。这些结果证明,协同设计查询和评分标准可以使基于评分标准的强化学习成为严格可验证任务之外RLVR的实用补充。

英文摘要

Rubric-based RL is a promising route for extending reinforcement learning beyond verifiable rewards, yet existing methods optimize rubrics while treating the query distribution as fixed. We identify a structural bottleneck: rubric quality is constrained by query structure. Open-ended queries yield vague rubrics; naively narrowing them introduces fabricated references that no model can verify, so all responses fail and training receives no reward signal. We present QUBRIC, a framework that co-designs queries and rubrics. Teacher-derived key points ground the rewriting of open-ended queries into scenario-based, evaluable questions. Contrastive rubric generation then turns teacher-policy gaps into query-level criteria, and learnability filtering retains only informative query-rubric pairs for GRPO training. QUBRIC achieves a +5.5 point gain on ArenaHard over the SFT baseline. Trained only on instruction-following data, it further transfers to three held-out benchmarks spanning legal, moral, and narrative reasoning (+6.3 points on average), with improvements concentrated in reasoning-related dimensions. These results provide evidence that co-designing queries and rubrics can make rubric-based RL a practical complement to RLVR beyond strictly verifiable tasks.

2606.03967 2026-06-03 cs.CL cs.AI 版本更新

AlignAtt4LLM: Fast AlignAtt for Decoder-Only LLMs at IWSLT 2026 Simultaneous Speech Translation Task

AlignAtt4LLM:面向仅解码器LLM的快速AlignAtt方法在IWSLT 2026同声传译任务中的应用

Quentin Fuxa, Dominik Macháček

发表机构 * Charles University, MFF, ÚFAL(查理大学,人文学院,ÚFAL) University of Edinburgh(爱丁堡大学)

AI总结 提出AlignAtt4LLM系统,通过显式源文本跨度、离线选择翻译对齐头、选择性qk快速重放和运行时查询/键捕获,首次将AlignAtt策略应用于仅解码器LLM,在英德、英意同声传译中优于基线。

Comments Accepted to IWSLT 2026

详情
AI中文摘要

我们描述了AlignAtt4LLM,一个用于英语到德语、意大利语和中文的IWSLT 2026同声传译系统。该系统是一个同步级联:Qwen3-ASR结合强制对齐生成增量更新的源文本转录,Gemma-4 E4B-it在MT侧的AlignAtt策略下翻译该前缀。据我们所知,这是AlignAtt首次应用于仅解码器LLM,而早期AlignAtt系统使用的编码器-解码器交叉注意力在此类模型中不存在。我们通过提出(1)提示中的显式源文本跨度,(2)离线选择翻译特定的对齐头,(3)草稿到源注意力块的选择性qk快速重放,以及(4)保持模型输出比特一致的运行时查询/键捕获,恢复了一个可用的策略。在IWSLT 2026开发集上,AlignAtt4LLM在约2秒的低延迟和低于4秒CU-LongYAAL的高延迟场景下,均优于欧洲目标语言(英语到德语和英语到意大利语)的提供基线。英语到中文的结果较为复杂,但该方法不依赖于Gemma-4:由于AlignAtt4LLM仅需要确定的提示布局、校准的对齐头和查询/键捕获,相同的策略可以重新应用于针对非欧洲目标语言的更强翻译专用仅解码器MT骨干网络。

英文摘要

We describe AlignAtt4LLM, an IWSLT 2026 simultaneous speech translation system for English to German, Italian, and Chinese. The system is a synchronous cascade: Qwen3-ASR with forced alignment produces an incrementally updated source transcript, and Gemma-4 E4B-it translates that prefix under an MT-side AlignAtt policy. To our knowledge, this is the first application of AlignAtt to a decoder-only LLM, where the encoder-decoder cross-attention used by earlier AlignAtt systems is absent. We recover a usable policy by proposing (1) an explicit source span in the prompt, (2) offline selection of translation-specific alignment heads, (3) selective qk-fast replay of the draft-to-source attention block, and (4) runtime query/key capture that preserves model outputs bit-identically. On the IWSLT 2026 development set, AlignAtt4LLM outperforms the supplied baselines for the European target languages, English to German and English to Italian, in both the low-latency regime around 2 seconds and the high-latency regime below 4 seconds CU-LongYAAL. Results for English to Chinese are more mixed, but the method is not tied to Gemma-4: because AlignAtt4LLM only requires a deterministic prompt layout, calibrated attention heads, and query/key capture, the same policy can be reapplied to stronger translation-focused decoder-only MT backbones for non-European target languages.

2606.03965 2026-06-03 cs.CL cs.AI 版本更新

Agentic Chain-of-Thought Steering for Efficient and Controllable LLM Reasoning

Agentic Chain-of-Thought Steering:实现高效且可控的LLM推理

Yu Xia, Zhouhang Xie, Xin Xu, Byungkyu Kang, Prarit Lamba, Xiang Gao, Julian McAuley

AI总结 提出Agentic Chain-of-Thought Steering (ACTS)方法,通过强化学习训练控制器智能体在推理过程中自适应地选择推理策略和引导短语,实现预算感知的策略控制,从而在保持推理质量的同时显著节省token,并支持准确率-效率的可控权衡。

详情
AI中文摘要

大型语言模型通过扩展的思维链推理提高了最终答案的准确性,但通常token使用效率低下且缺乏推理时的控制。现有的高效推理方法通过缩短、提前停止或压缩轨迹来控制思考长度,但隐式地决定了模型的思考方式。在本文中,我们提出了Agentic Chain-of-Thought Steering (ACTS),它将推理引导形式化为一个马尔可夫决策过程,其中控制器智能体在推理过程中自适应地引导冻结的推理器。在每一步,控制器观察推理轨迹和剩余思考预算,然后发出一个包含推理策略和引导短语的引导动作,以启动推理器的下一步。这使得在保持推理器生成连续性的同时,能够进行预算感知的策略控制以实现高效推理。我们从构建的合成引导轨迹中初始化控制器智能体,并进行多预算增强,然后通过带有预算条件奖励塑造的强化学习进一步优化。跨多个基准的实验表明,ACTS在显著节省token的同时达到了与全思考相当的性能,并在不同的推理器和任务上实现了可控的准确率-效率权衡。代码可在该https URL获取。

英文摘要

Large language models improve final-answer accuracy through extended chain-of-thought reasoning, but often spend tokens inefficiently and offer little inference-time control. Existing efficient reasoning methods control thinking length by shortening, early-stopping, or compressing traces, leaving how the model thinks implicit. In this paper, we propose Agentic Chain-of-Thought Steering (ACTS), which formulates reasoning steering as a Markov decision process where a controller agent adaptively steers a frozen reasoner during inference. At each step, the controller observes the reasoning trace and remaining thinking budget, then issues a steering action consisting of a reasoning strategy and a steering phrase that initiates the next reasoner step. This enables budget-aware strategy control for efficient reasoning while preserving the reasoner's generation continuity. We initialize the controller agent from our constructed synthetic steering trajectories with multi-budget augmentation, and further optimize it via reinforcement learning with budget-conditioned reward shaping. Experiments across multiple benchmarks show that ACTS matches full-thinking performance with substantial token savings, and enables controllable accuracy-efficiency trade-offs across different reasoners and tasks. The code is available at https://github.com/Andree-9/ACTS.

2606.03962 2026-06-03 cs.LG cs.AI 版本更新

Using Reward Uncertainty to Induce Diverse Behaviour in Reinforcement Learning

利用奖励不确定性在强化学习中诱导多样化行为

Anthony GX-Chen, Ankit Anand, Gheorghe Comanici, Zaheer Abbas, Eser Aygün, David Smalling, Shibl Mourad, Doina Precup, André Barreto, Mark Rowland

发表机构 * New York University(纽约大学) Google DeepMind(谷歌深Mind)

AI总结 针对传统强化学习缺乏多样性的问题,提出将奖励函数替换为奖励分布,通过非线性集合目标自然产生可控的多样化行为,并推导出梯度估计器,实验证明其鲁棒性和理论优势。

Comments Core contributors: Anthony GX-Chen, Ankit Anand, Gheorghe Comanici, André Barreto, Mark Rowland

详情
AI中文摘要

经典强化学习通常寻求最大化标量奖励期望和的确定性策略。然而,现代应用如语言模型微调或科学发现需要多样性。现有的补救措施如熵正则化或多样性奖励通常需要脆弱的权衡,以性能换取随机性,或依赖可能使策略排名错位的启发式指标。我们认为,多样性更自然地理解为对奖励不确定性的理性响应。当奖励函数不完全已知时——例如模糊偏好或不完美的奖励模型——承诺单一行动可能是次优的。基于此,我们提出对强化学习目标进行根本性重新表述,将标量奖励替换为奖励函数上的分布,并对行动集合应用非线性目标。结果是一个框架,其中校准的行为多样性自然出现,通过奖励函数分布保持可控,且无需牺牲期望奖励即可获得。聚焦于上下文赌博机设置,我们为该目标推导出原则性的梯度估计器,并证明我们的公式自然泛化了原始策略梯度以及最近发展的行动集方法。我们的实证结果表明,该框架为传统问题表述无法诱导所需行为广度的复杂强化学习任务提供了鲁棒且理论基础的替代方案。

英文摘要

Classical reinforcement learning (RL) typically seeks a deterministic policy that maximizes the expected sum of a scalar reward. Yet, modern applications such as language model fine-tuning or scientific discovery demand diversity. Existing remedies such as entropy regularization or diversity bonuses often require fragile trade-offs that sacrifice performance for stochasticity or rely on heuristic metrics that can misalign policy rankings. We argue that diversity is more naturally understood as the rational response to uncertainty in the reward. When the reward function is not perfectly known--as is the case with ambiguous preferences or imperfect reward models--committing to a single action can be sub-optimal. Building on this, we propose a fundamental reformulation of the RL objective by replacing the scalar reward with a distribution over reward functions, and applying a non-linear objective over sets of actions. The result is a framework in which calibrated behavioural diversity emerges naturally, remains controllable through the reward function distribution, and is obtained without sacrificing expected reward. Focusing on the contextual bandit setting, we derive a principled gradient estimator for this objective and prove that our formulation naturally generalizes both vanilla policy gradient and more recently developed action-set approaches. Our empirical results demonstrate that this framework offers a robust and theoretically grounded alternative for complex RL tasks where the traditional formulation of the problem fails to induce the desired breadth of agent behaviour.

2606.03957 2026-06-03 cs.CL cs.AI cs.SD eess.AS 版本更新

Efficient ASR Training with Conversations that Never Happened

利用从未发生的对话进行高效的ASR训练

Máté Gedeon, Péter Mihajlik

发表机构 * Dept. of Telecommunications and Artificial Intelligence, Budapest University of Technology and Economics(电信与人工智能系,布达佩斯技术与经济大学) SpeechTex Ltd.(SpeechTex公司) ELTE Research Centre for Linguistics(ELTE语言研究所)

AI总结 针对低资源语言和特定领域,提出通过LLM生成对话场景、映射说话人属性到TTS语音配置文件并组装合成话语的增强流水线,实验表明合成对话能有效提升ASR性能,在匈牙利语基准上仅用67小时真实对话和636小时模拟数据即超越2700小时零样本模型。

详情
AI中文摘要

低资源语言和特定领域的对话式ASR受到领域匹配的多说话人训练数据稀缺的限制。我们提出了一种增强流水线,该流水线生成带有参与者元数据的场景级对话,将说话人属性映射到TTS语音配置文件,并将合成的话语组装成感知说话人的模拟对话。我们在相同的FastConformer-Large训练方案下,评估了五种LLM家族,分别采用单生成器、固定预算混合和扩展设置。我们在匈牙利语BEA-Dialogue基准语料库上进行了全面评估,该方法本身适用于任何语言,只要各组件有相应资源。结果表明,合成对话持续改善语音识别性能,但生成器选择和组成数据强烈影响增益。我们最大的训练配置仅使用67小时真实对话和636小时模拟数据,在评估基准上实现了比在2700小时匈牙利语语音上训练的零样本模型更好的性能。这些发现表明,通过TTS合成的LLM生成的对话数据是真实对话语料库在语音模型训练中的实用补充。

英文摘要

Conversational ASR for lower-resource languages and niche domains is limited by the scarcity of domain-matched multi-speaker training data. We propose an augmentation pipeline that generates scenario-level dialogues with participant metadata, maps speaker attributes to TTS voice profiles, and assembles synthesized utterances into speaker-aware simulated conversations. We evaluated five LLM families under single-generator, fixed-budget mixture, and scale-up settings using the same FastConformer-Large training recipe for each one. We ran comprehensive evaluations on the Hungarian BEA-Dialogue benchmark corpus, with the method itself being applicable to any language given the resources for each component. The results show that synthetic conversations consistently improve speech recognition performance, but generator choice and data composition strongly affect the gains. Our largest training configuration, using only 67 hours of real conversations and 636 hours of simulated data, achieves better performance on the evaluation benchmark than a zero-shot model trained on 2700 hours of Hungarian speech. These findings indicate that LLM-generated conversational data synthesized with TTS is a practical complement to real conversational corpora for speech model training.

2606.03939 2026-06-03 cs.LG cs.AI cs.PF 版本更新

FlashbackCL: Mitigating Temporal Forgetting in Federated Learning

FlashbackCL:缓解联邦学习中的时间遗忘

Mubarak A. Ojewale, Adriana E. Chis, Jorge M. Cortes-Mendoza, Bernardo Pulido-Gaytan, Horacio Gonzalez-Velez

发表机构 * Cloud Competency Centre, National College of Ireland, Dublin, Ireland(云竞争力中心,爱尔兰国家学院,都柏林,爱尔兰)

AI总结 针对联邦学习中客户端数据分布随时间漂移导致的时间遗忘问题,提出FlashbackCL方法,通过时间衰减标签计数、类别平衡水库采样重放和服务器端主动核心集筛选,在CIFAR-10上相对Flashback提升6.9%-10.0%,时间遗忘减少68%。

详情
AI中文摘要

基础模型和边缘模型的联邦学习(FL)越来越多地部署在客户端数据分布随时间漂移的场景中,然而现有的遗忘缓解方法假设每个客户端的分布是平稳的。Flashback是近期最强的针对跨客户端(空间)遗忘的FL方法,它使用单调累积的每类标签计数作为知识代理;该代理在时间分布漂移下会失准,并将全局模型锚定在过时的类别平衡上。我们通过一个与协议级波动隔离的每阶段指标形式化定义了FL中的时间遗忘,并提出了Flashback Continual Learning(FlashbackCL),它是Flashback的即插即用扩展,包含:(i) 时间衰减的标签计数;(ii) 具有类别平衡水库采样(CBRS)的设备感知重放缓冲区;(iii) 在公共蒸馏集上的服务器端主动核心集筛选。结果表明,在具有50个客户端和三种受控时间漂移模式的CIFAR-10上,FlashbackCL相对于Flashback实现了6.9%至10.0%的相对改进,同时将时间遗忘减少了高达68%。一项5变体消融实验表明CBRS重放是关键组件。FlashbackCL在平稳CIFAR-100上也比Flashback提高了3.5个百分点,表明类别平衡重放同样正则化了空间异质性和时间漂移。

英文摘要

Federated Learning (FL) of foundation and edge models increasingly targets deployments where client data distributions drift over time, yet existing forgetting-mitigation methods assume each client's distribution is stationary. Flashback, the strongest recent FL method against cross-client (spatial) forgetting, uses monotonically accumulating per-class label counts as a knowledge proxy; this proxy becomes miscalibrated under temporal distribution shift and anchors the global model to an outdated class balance. We formalise temporal forgetting in FL with a per-phase metric isolated from protocol-level fluctuations and propose Flashback Continual Learning (FlashbackCL), a drop-in extension of Flashback with (i) temporally-decayed label counts; (ii) a device-aware replay buffer with Class-Balanced Reservoir Sampling (CBRS); and (iii) server-side active coreset curation on the public distillation set. The results show that FlashbackCL achieves 6.9% to 10.0% relative improvement relative to Flashback, on CIFAR-10 with 50 clients and three controlled temporal shift modes, while simultaneously reducing temporal forgetting by up to 68%. A 5-variant ablation identifies CBRS replay as the critical component. FlashbackCL also improves Flashback by 3.5 points on stationary CIFAR-100, suggesting that class-balanced replay regularises spatial heterogeneity as well as temporal shift.

2606.03927 2026-06-03 cs.LG cs.AI 版本更新

FFR: Forward-Forward Learning for Regression

FFR:前向-前向学习用于回归

Xinyang Liu, Xuanyu Liang, Shiqi Ding, Boyang Li, Zhiqiang Que, Jiayang Li, Guosheng Hu

发表机构 * University of Bristol(布里斯托大学) University College London(伦敦大学学院) University of Cambridge(剑桥大学)

AI总结 提出FFR框架,通过序数竞争 goodness 函数、分层阶梯架构和层次化预测将前向-前向算法扩展到回归任务,在多个数据集上恢复BP 98.6%的精度并显著降低内存和时间开销。

详情
AI中文摘要

前向-前向(FF)算法通过纯局部、逐层优化训练神经网络,提供了反向传播(BP)的计算高效且生物合理的替代方案。然而,FF本质上是为通过对比正负样本对进行分类而设计的,将其扩展到回归面临根本性挑战:连续目标空间缺乏用于对比学习的自然“对立面”,且标准 goodness 函数不携带关于目标幅度或顺序的信息。我们提出FFR(前向-前向回归),据我们所知,这是第一个将FF扩展到现实世界回归并展示在多样化真实数据集上具有竞争力的性能的框架。FFR引入了三项关键创新:(1)序数竞争 goodness 函数,通过距离感知序数监督下分区神经元组之间的竞争学习取代对比对;(2)分层阶梯架构,其中浅层学习粗序数判别,深层细化到细粒度回归,并通过多尺度特征聚合实现层间协作;(3)带不确定性估计的层次化预测,其中多尺度预测器联合提供鲁棒预测和预测置信度作为免费午餐。大量实验结果表明,FFR在五个真实世界回归基准上平均恢复了BP 98.6%的精度,同时将峰值训练内存降低到深度8时BP的27%和深度32时BP的8%,每次迭代时间约为BP的72%,并且显著优于所有无BP的竞争对手。

英文摘要

The Forward-Forward (FF) algorithm offers a computationally efficient and biologically plausible alternative to backpropagation (BP) by training neural networks through purely local, layer-wise optimization. However, FF is inherently designed for classification via contrastive positive-negative sample pairs, and extending it to regression poses fundamental challenges: continuous target space lack natural "opposites" for contrastive learning, and the standard goodness function carries no information about target magnitude or ordering. We propose FFR (Forward-Forward for Regression), to our knowledge, the first framework to extend FF to real-world regression and demonstrate competitive performance across diverse real-world datasets. FFR introduces three key innovations: (1) an ordinal competitive goodness function that replaces contrastive pairs with competitive learning between partitioned neuron groups under distance-aware ordinal supervision; (2) a stratified ladder architecture where shallow layers learn coarse ordinal discrimination and deeper layers refine into fine-grained regression, with multi-scale feature aggregation for inter-layer collaboration; and (3) hierarchical prediction with uncertainty estimation, where multi-scale predictors jointly provide robust predictions and prediction confidence as a free-lunch. Extensive experimental results show FFR recovers on average 98.6% of BP's accuracy across five real-world regression benchmarks while reducing peak training memory to only 27% of BP's at depth 8 and 8% at depth 32, with per-iteration time around 72% of BP's, and substantially outperforms all BP-free competitors.

2606.03918 2026-06-03 cs.AI 版本更新

Hedge-Bench: Benchmarking Agents on Hard, Realistic Tasks Pertaining to Financial Reasoning

Hedge-Bench:在金融推理相关的困难、现实任务上对智能体进行基准测试

Eric Cho, Shawn Huang, Alice Lu, Andy Lyu

发表机构 * Trata Brigham Young University

AI总结 提出Hedge-Bench基准,包含102个基于对冲基金分析师实际工作推理轨迹的任务,用于评估AI智能体在开放金融推理问题上的表现,前沿模型得分低于16%。

Comments Dataset and evaluation harness available at github.com/Trata-Inc/trata-hedge-bench

详情
AI中文摘要

AI智能体越来越多地能够处理金融分析的机械任务:检索文档、计算公式、更新电子表格。更难、更有价值的挑战在于推理那些定义专家分析师工作的开放式问题。现有基准没有捕捉到这类问题,而那些试图评估开放式推理的基准依赖于模型判断的输出,这引入了噪声和循环。我们提出Hedge-Bench 1.0:一个包含102个实际工作任务的基准,这些任务基于专业对冲基金分析师在相关信息来源上工作的明确推理轨迹。这种方法能够针对经过验证的专家步骤进行确定性评分。前沿模型和智能体在基准上的得分低于16%。我们在该网址发布数据集和评估工具。

英文摘要

AI agents can increasingly handle the mechanical tasks of financial analysis: retrieving documents, calculating formulas, updating spreadsheets. The harder, more valuable challenge is reasoning through the open-ended questions that define expert Analyst work. Existing benchmarks do not capture this class of problem, and those that attempt to evaluate open-ended reasoning rely on model-judged outputs that introduce noise and circularity. We present Hedge-Bench 1.0: a benchmark of 102 actual, on-the-job tasks grounded in the explicit reasoning traces of professional hedge fund analysts working with relevant information sources. This approach enables deterministic grading against verified expert steps. Frontier models and agents score below 16\% on the benchmark. We publish the dataset and evaluation harness at github.com/Trata-Inc/trata-hedge-bench.

2606.03910 2026-06-03 cs.PF cs.AI cs.DC cs.NI 版本更新

NetKV: Network-Aware Decode Instance Selection for Disaggregated LLM Inference

NetKV: 面向分解式LLM推理的网络感知解码实例选择

Mubarak Adetunji Ojewale

发表机构 * Cloud Competency Centre, National College of Ireland(国家爱尔兰学院云能力中心)

AI总结 针对分解式LLM推理中KV缓存传输导致的首令牌时间增加问题,提出网络成本感知调度器NetKV,通过贪心算法选择解码实例,在64-GPU胖树模拟器上平均降低TTFT达21.2%。

详情
AI中文摘要

分解式LLM推理迫使KV缓存在解码开始前穿越数据中心网络,因此传输时间直接计入首令牌时间(TTFT)预算。当前调度器仅根据计算负载和前缀缓存局部性进行路由,忽略了预填充和解码实例之间的拓扑距离和动态拥塞。我们通过一个轻量级的算子到调度器接口(网络成本预言机)来弥补这一差距,并证明忽略网络项会导致仅缓存感知的调度在上下文长度增长时任意次优。NetKV是一个每请求O(|D|)的贪心算法,它消耗该预言机,其层级排名对过时遥测数据具有可证明的鲁棒性。在由Mooncake轨迹驱动的64-GPU四层胖树模拟器上,NetKV相比轮询调度平均降低TTFT达21.2%,相比调优的缓存+负载感知调度器降低17.6%,将SLO达标率提升最多20.1个百分点,并在所有测试条件下将令牌间时间开销保持在0.5毫秒以下,无需对传输、推理引擎或硬件进行任何更改。

英文摘要

Disaggregated LLM inference forces the KV cache to traverse the datacenter network before decoding begins, so transfer time enters directly into the Time to First Token (TTFT) budget. Current schedulers route on compute load and prefix-cache locality alone, ignoring the topological distance and dynamic congestion between prefill and decode instances. We close this gap with a thin operator-to-scheduler interface, the network cost oracle, and we prove that ignoring the network term renders cache-aware-only scheduling arbitrarily suboptimal as context length grows. NetKV, the O(|D|) per-request greedy that consumes this oracle, has tier rankings that are provably robust to stale telemetry. On a 64-GPU four-tier fat-tree simulator driven by Mooncake traces, NetKV reduces mean TTFT by up to 21.2% over round-robin and 17.6% over a tuned cache+load-aware scheduler, lifts SLO attainment by up to 20.1 percentage points, and keeps the Time Between Tokens overhead below 0.5 ms in every condition tested, with no changes to the transport, inference engine, or hardware.

2606.03907 2026-06-03 cs.SE cs.AI cs.HC 版本更新

The Impact of Configuring Agentic AI Coding Tools on Build-vs-Buy Decisions: A Study Protocol

配置智能体AI编码工具对构建vs购买决策的影响:一项研究协议

Jai Lal Lulla, Matthias Galster, Jie M. Zhang, Sebastian Baltes, Christoph Treude

发表机构 * Singapore Management University, Singapore(新加坡管理大学) University of Bamberg, Germany(巴姆堡大学) King’s College London, United Kingdom(伦敦国王学院) Heidelberg University, Germany(海德堡大学)

AI总结 本研究通过受控实验协议,探讨配置机制如何影响Claude Code和OpenAI Codex等智能体AI编码工具在构建vs购买决策中的行为,并发布可复用的基准数据集和分析流程。

Comments 14 pages, 1 table. Accepted at the 20th International Symposium on Empirical Software Engineering and Measurement (ESEM 2026), Registered Reports track

详情
AI中文摘要

智能体AI编码工具以越来越高的自主性编写代码,并在此过程中决定何时导入库以及何时从头实现功能。这些决策——是从头构建功能还是购买外部库(以下称为构建vs购买)——对软件安全性、许可合规性、性能和长期可维护性有直接影响。然而,尚无受控实验研究探讨智能体AI编码工具中构建vs购买决策的支配因素。配置机制,即开发人员根据项目或工作流程定制智能体AI编码工具行为的手段,是实践者影响这些决策的主要方式之一。但尚不清楚哪些配置机制最有效地影响构建vs购买决策。我们提出了一项预注册协议,研究配置机制如何改变两种流行的智能体AI编码工具(Claude Code和OpenAI Codex)中的构建vs购买行为。我们将执行来自阶段性项目基准的受控编程任务,每个任务围绕可识别的构建vs购买点构建,并操纵提供给每个工具的配置,范围从无配置、包含软偏好和明确禁止的上下文文件,到技能(可自主发现的指令)、支持MCP的库发现工具和权限控制,测量工具选择的库、是否披露新引入的库以及这些披露是否完整准确。九个预注册假设构成了该协议。生成的基准数据集和分析流程将作为可复用工件发布,用于评估智能体AI编码工具中的构建vs购买行为。

英文摘要

Agentic AI coding tools write code with increasing autonomy and in doing so decide when to import a library and when to implement functionality from scratch. These decisions, whether to build functionality from scratch or buy into an external library, hereafter build-versus-buy, carry direct consequences for software security, licensing compliance, performance, and long-term maintainability. Yet no controlled experimental study has examined what governs build-versus-buy decisions in agentic AI coding tools. Configuration mechanisms, i.e., the means by which developers tailor agentic AI coding tool behavior to a project or workflow, are one of the primary means by which practitioners can influence these decisions. However, it is unclear which configuration mechanisms influence build-versus-buy decisions most effectively. We present a pre-registered protocol to study how configuration mechanisms alter build-versus-buy behavior in two popular agentic AI coding tools: Claude Code and OpenAI Codex. We will execute controlled programming tasks drawn from a benchmark of staged projects, each constructed around identifiable build-versus-buy points, and will manipulate the configuration supplied to each tool, ranging from no configuration, through context files with soft preferences and explicit prohibitions, to Skills (instructions that can be autonomously discovered), MCP-enabled library discovery tools, and permission controls, measuring which libraries the tool selects, whether it discloses newly introduced libraries, and whether those disclosures are complete and accurate. Nine pre-registered hypotheses structure the protocol. The resulting benchmark dataset and analysis pipeline will be released as a reusable artifact for evaluating build-versus-buy behavior in agentic AI coding tools.

2606.03906 2026-06-03 cs.AI 版本更新

scTranslation: A Comprehensive Benchmark for Single-Cell Multi-Omics Modality Translation

scTranslation:单细胞多组学模态翻译的综合基准

Jiabei Cheng, Jingbo Zhou, Jun Xia, Changkai Li, Zhen Lei, Chang Yu, Stan Z. Li

发表机构 * Westlake University(西湖大学) Shanghai Jiao Tong University(上海交通大学) Zhejiang University(浙江大学) The Hong Kong University of Science and Technology (Guangzhou)(香港科技大学(广州)) Xidian University(西安电子科技大学) Institute of Automation, Chinese Academy of Sciences(中国科学院自动化研究所)

AI总结 针对单细胞多组学模态翻译任务,提出了包含多样化数据集、先进模型和全面评估指标的综合基准scTranslation,并系统研究了特征选择、特征质量和少样本设置等影响因素。

详情
AI中文摘要

在单细胞中同时测量多种组学模态使研究人员能够更全面地理解细胞状态和调控机制。然而,由于高实验成本、显著噪声和不完全的模态覆盖,近年来出现了多种用于模态翻译的计算方法。尽管翻译模型有所发展,但在数据集、评估指标和影响因素方面仍缺乏系统的基准评估。为此,我们提出了scTranslation,一个用于单细胞多组学模态翻译任务的综合基准。它包括多样化的翻译数据集,整合了最先进的模型,并提供了全面的评估指标。此外,我们评估了模型在不同场景下的性能,如特征选择、特征质量和少样本设置。这些因素显著影响模型性能,但此前很少被系统研究。利用该基准,我们对当前方法进行了大规模研究,报告了许多有洞察力的发现,为未来发展开辟了新的可能性。该基准已开源以促进未来研究。代码匿名发布于该https URL。

英文摘要

Simultaneous measurement of multiple omics modalities in single cells enables researchers to gain a more comprehensive understanding of cellular states and regulatory mechanisms. However, due to high experimental costs, significant noise, and incomplete modality coverage, a variety of computational methods for modality translation have emerged in recent years. Despite the development of translation models, there is still a lack of systematic benchmark evaluation in terms of datasets, evaluation metrics, and influencing factors. To address this, we present scTranslation, a comprehensive benchmark for single-cell multi-omics modality translation tasks. It includes diverse translation datasets, integrates state-of-the-art models, and provides a comprehensive evaluation metrics. In addition, we assess model performance under different scenarios, such as feature selection, feature quality, and few-shot settings. These factors significantly affect model performance but have rarely been systematically studied before. Leveraging this benchmark, we conduct a large-scale study of current methods, report many insightful findings that open up new possibilities for future development. The benchmark is open-sourced to facilitate future research. The code is anonymously released at https://github.com/Bunnybeibei/scTranslation.

2606.03895 2026-06-03 cs.OS cs.AI cs.CR 版本更新

Agent libOS: A Library-OS-Inspired Runtime for Long-Running, Capability-Controlled LLM Agents

Agent libOS: 一种受库操作系统启发的运行时,用于长时间运行、能力受控的LLM智能体

Yingqi Zhang

AI总结 提出Agent libOS运行时,将LLM智能体建模为具有进程标识、生命周期、能力控制和审计记录的AgentProcess,通过类似libc的工具包装和运行时原语边界实现安全调度与资源控制。

Comments 14 pages, 1 figure, 2 tables

详情
AI中文摘要

大型语言模型(LLM)智能体正在从请求-响应助手演变为长时间运行的软件参与者:它们在模型调用之间维护状态,分叉子任务,等待外部事件,请求人类授权,生成工具,并执行必须被恢复和审计的副作用。本文提出Agent libOS,一种受库操作系统启发的LLM智能体运行时基础。Agent libOS运行在传统主机操作系统之上;它不实现硬件驱动、内核模式隔离或POSIX兼容操作系统。相反,它将智能体视为一个AgentProcess:一个可调度的执行主体,具有进程标识、父子谱系、生命周期状态、从AgentImage派生的工具表、类型化对象内存、显式能力、人类队列、检查点、事件和审计记录。其核心设计原则是工具是类似libc的包装器;运行时原语是权限边界。文件系统访问、对象访问、睡眠、人类批准、JIT工具注册和外部副作用都在显式能力和策略下在原语边界进行检查。我们描述了设计、威胁模型、Python原型和面向安全的评估。当前原型实现了异步调度、命名空间本地对象内存、运行时集成的人类批准、一次性权限授予、每进程工作目录、shell和图像注册原语、基于libOS系统调用代理的Deno/TypeScript JIT工具、文件系统/对象桥接工具、可注入的资源提供者基础、确定性演示、真实模型烟雾脚本以及撰写时的123个回归测试。Agent libOS不是提高规划器准确性,而是展示了一种运行时基础,在该基础上,长时间运行的LLM智能体可以被调度、授权、恢复和审计,而无需将工具分发视为信任边界。

英文摘要

Large language model (LLM) agents are evolving from request-response assistants into long-running software actors: they maintain state across model calls, fork subtasks, wait for external events, request human authority, generate tools, and perform side effects that must be resumed and audited. This paper presents Agent libOS, a library-OS-inspired runtime substrate for LLM agents. Agent libOS runs above a conventional host operating system; it does not implement hardware drivers, kernel-mode isolation, or a POSIX-compatible operating system. Instead, it treats an agent as an AgentProcess: a schedulable execution subject with process identity, parent-child lineage, lifecycle state, a tool table derived from an AgentImage, typed Object Memory, explicit capabilities, human queues, checkpoints, events, and audit records. Its central design rule is tools are libc-like wrappers; runtime primitives are the authority boundary. Filesystem access, object access, sleeps, human approval, JIT tool registration, and external side effects are checked at primitive boundaries under explicit capabilities and policy. We describe the design, threat model, Python prototype, and safety-oriented evaluation. The current prototype implements async scheduling, namespace-local Object Memory, runtime-integrated human approval, one-shot permission grants, per-process working directories, shell and image-registration primitives, Deno/TypeScript JIT tools over a libOS syscall broker, filesystem/object bridge tools, an injectable Resource Provider Substrate, deterministic demos, real-model smoke scripts, and 123 regression tests at the time of writing. Rather than improving planner accuracy, Agent libOS demonstrates a runtime substrate in which long-running LLM agents can be scheduled, authorized, resumed, and audited without treating tool dispatch as the trust boundary.

2606.03883 2026-06-03 cs.AI cs.LG 版本更新

Reasoning Structure of Large Language Models

大型语言模型的推理结构

Frédéric Berdoz, Luca A. Lanzendörfer, Fabian Farestam, Roger Wattenhofer

AI总结 针对大型推理模型评估中隐藏不同推理结构的问题,提出基于逻辑谜题的基准测试和将非结构化轨迹转化为可验证推理图的方法,并定义推理效率指标,以量化分析推理拓扑结构。

Comments Accepted at ICML 2026 and presented at the ICLR 2026 workshop on LLM reasoning

详情
AI中文摘要

大型推理模型(LRMs)通常使用最终答案准确率或token数量等指标进行评估。然而,这些指标上的相同分数可能隐藏着根本不同的推理结构。为了解决这一局限性,我们引入了一个可扩展的逻辑谜题LRM基准测试,以及一个将非结构化轨迹转化为包含声明和依赖关系的可验证推理图的流程。这将推理转化为一个结构化的、可测量的对象,其拓扑结构可以定量分析。在此基础上,我们定义了一个推理效率指标,用于量化模型逻辑流的集中程度。我们对开源推理模型的分析表明,结构度量能够区分token数量和准确率所混淆的行为,为诊断失败模式和比较推理如何随谜题难度扩展提供了实用工具。

英文摘要

Large reasoning models (LRMs) are often evaluated using metrics such as final-answer accuracy or token count. However, identical scores on these metrics can hide fundamentally different reasoning structures. To address this limitation, we introduce a scalable LRM benchmark of logic puzzles and a pipeline that converts unstructured traces into verifiable reasoning graphs of claims and dependencies. This turns reasoning into a structured, measurable object whose topology can be quantitatively analyzed. Building on this, we define a reasoning efficiency metric that quantifies how concentrated the model's logical flow is. Our analysis on open-source reasoning models shows that structural measurements separate behaviors that token count and accuracy conflate, providing a practical tool for diagnosing failure modes and comparing how reasoning scales with puzzle difficulty.

2606.03879 2026-06-03 cs.CV cs.AI 版本更新

Beyond Encoder Accumulation: Measuring Encoder Roles in Multi-Encoder VLMs

超越编码器累加:衡量多编码器视觉语言模型中编码器的作用

Wei Ding, Yudong Zhang, Ruobing Xie, Xingwu Sun, Jiansheng Chen, Yu Wang

发表机构 * Tsinghua University(清华大学) Tencent(腾讯) University of Macau(澳门大学) University of Science and Technology Beijing(北京科技大学)

AI总结 通过重新训练所有31个非空子集,提出容量-必要性分解和预投影器秩分析,揭示多编码器视觉语言模型中编码器角色并非简单累加,并给出最优配对原则。

详情
AI中文摘要

随着基础模型向融合更多异构视觉流扩展,理解不同编码器在联合训练下的交互成为原则性设计的前提。然而,大型视觉语言模型目前缺乏相应的工具,且参数高效的编码器配置在训练前难以识别。为了重新审视联合训练下的编码器角色,我们在16基准的Cambrian-1套件上,在统一流程下重新训练并评估了五个常见视觉编码器的所有31个非空子集(总计约2万GPU小时),并报告了三个发现。首先,从头重新训练每个子集揭示了与在固定检查点上掩码编码器所得不同的编码器排名,包括哪个编码器整体排名第一。其次,我们将每个编码器的贡献分解为两个维度:容量(编码器自身达到的分数)和必要性(从完整池中移除时的下降)。这两个维度不可互换。配对两个最高容量的编码器是次优的,而将一个高容量锚点与一个自适应补充配对则匹配完整的五编码器模型。在此配对之外添加更多编码器仅带来边际收益。第三,在固定参数数量下,每个编码器的预投影器有效秩解释了残差分数变化。最强的配对结合了一个秩在联合训练中存活的锚点和一个秩在联合训练下扩展的补充,这表明更高秩、更少坍缩的投影器输入对应着编码器-投影器接口处更有利的优化机制。总之,容量-必要性分解和预投影器秩分析,连同通过重新训练进行的全面评估,揭示了多编码器视觉语言模型设计中的方法论差距,并提供了弥补这一差距的具体原语。

英文摘要

As foundation models scale toward fusing more heterogeneous visual streams, understanding how diverse encoders interact under joint training becomes a prerequisite for principled design. Yet large vision-language models (LVLMs) currently lack the tools to do so, and parameter-efficient encoder configurations remain hard to identify before training. To re-examine encoder roles under joint training, on the 16-benchmark Cambrian-1 suite we retrain and evaluate all 31 non-empty subsets of five common vision encoders under a unified pipeline (~20k GPU-hours total), and report three findings. First, retraining each subset from scratch reveals encoder rankings that differ from those obtained by masking encoders on a fixed checkpoint, including which encoder ranks first overall. Second, we decompose each encoder's contribution into two axes, Capacity, the score an encoder reaches on its own, and Necessity, the drop when it is removed from the full pool. The two axes are not interchangeable. Pairing the two highest-Capacity encoders is suboptimal, while pairing a high-Capacity anchor with an adaptive complement matches the full five-encoder model. Adding further encoders beyond this pair yields only marginal gains. Third, at fixed parameter count, per-encoder pre-projector effective rank explains the residual score variation. The strongest pairs combine an anchor whose rank survives joint training with a complement whose rank expands under it, suggesting that higher-rank, less-collapsed projector inputs correspond to a more favorable optimization regime at the encoder-projector interface. Together, the Capacity-Necessity decomposition and the pre-projector rank analysis, along with comprehensive evaluation through retraining, expose a methodological gap in multi-encoder LVLM design, and offer concrete primitives for closing it.

2606.03876 2026-06-03 cs.HC cs.AI cs.MA 版本更新

From 'What' to 'How' and 'Why': Sharing LLM-Generated Retrospective Summaries of Older Adults' Passive Tracking Data with Remote Family Members

从“是什么”到“怎么样”和“为什么”:与远程家庭成员共享老年人被动追踪数据的LLM生成回顾性摘要

Jiachen Li, Reina Szeyi Chan, Akshat Choube, Xiang Zhi Tan, Elizabeth Mynatt, Varun Mishra

发表机构 * Northeastern University(东北大学)

AI总结 本研究利用大型语言模型(LLM)从多模态追踪数据生成回顾性摘要,通过技术探针和访谈重新设计系统,显著提升了远程家庭成员对摘要的满意度、帮助性、信任和接收意愿,并提出了支持其从“是什么”到“怎么样”和“为什么”的认知转变的设计启示。

详情
AI中文摘要

随着现代普适计算技术的日益普及,多模态追踪系统有望为远程家庭成员(RFM)等利益相关者提供及时的意识和 reassurance,这些成员在老年人护理协调中扮演核心角色。然而,将异构数据流整合为高层次、有意义的内容(如回顾性摘要)仍然具有挑战性。虽然近期工作已展示了大型语言模型(LLM)在解释多模态追踪数据方面的潜力,但针对像RFM这样拥有丰富个人知识、强烈情感责任但对老年人日常生活了解有限且照护能力受限的利益相关者生成叙事性描述的研究仍较少。在本工作中,我们探索了如何利用LLM为老年人的RFM从多模态追踪数据生成回顾性摘要。我们利用并定制了现有系统Vital Insight,在不同日期和数据可用性场景下生成初始摘要作为技术探针,并对11名RFM进行访谈以收集反馈。基于这些见解,我们将系统重新设计为一种多层、多智能体、洞察驱动的摘要方法,从客观统计和描述构建到丰富、上下文感知的叙述。随后,我们通过同一11名RFM的调查比较了重新设计的摘要与初始版本,发现满意度、感知帮助性、信任和接收意愿均有显著提升。最后,我们提出了针对RFM及更广泛场景的AI生成摘要的设计启示,强调需要支持RFM的认知转变,从简单地呈现“收集了什么数据”转向解释“我的亲人过得怎么样”和“为什么”。

英文摘要

With the growing prevalence of modern ubiquitous computing technologies, multi-modal tracking systems hold promise for providing timely awareness and reassurance to stakeholders such as remote family members (RFMs) of older adults, who play a central role in care coordination. However, combining heterogeneous data streams into high-level, meaningful content - such as retrospective summaries - remains challenging. While recent work has demonstrated the promise of large language models (LLMs) for interpreting multi-modal tracking data, less attention has been given to generating narrative accounts for stakeholders like RFMs, who possess rich personal knowledge of older adults and strong emotional responsibility, yet have limited visibility into their daily lives and limited capacity for caregiving. In this work, we explore how LLMs can be used to generate retrospective summaries from multi-modal tracking data for RFMs of older adults. We leveraged and customized an existing system, Vital Insight, to generate initial summaries on different dates and data availability scenarios as technology probes, and conducted interviews with 11 RFMs to gather feedback. Based on these insights, we redesigned the system into a multi-layer, multi-agent, insight-driven summary approach that builds from objective statistics and descriptions to enriched, context-aware narratives. We then compared the redesigned summaries with the initial versions through a survey with the same 11 RFMs and found significant improvements in satisfaction, perceived helpfulness, trust, and willingness to receive the summaries. We conclude by presenting design implications for AI-generated summaries for RFMs and broader contexts, emphasizing the need to support RFMs' sensemaking shift from simply presenting ''What'' data were collected, to explaining ''How'' is my loved one doing and ''Why''.

2606.03867 2026-06-03 cs.CL cs.AI 版本更新

A Training-Free Mixture-of-Agents Framework for Multi-Document Summarization using LLMs and Knowledge Graphs

一种基于LLM和知识图谱的无训练混合智能体框架用于多文档摘要

Cuong Vuong Tuan, Trang Mai Xuan, Tien-Cuong Nguyen, Vu-Duc Ngo, Thien Van Luong

发表机构 * Faculty of Artificial Intelligence and Data Science, Phenikaa University(人工智能与数据科学学院,泛尼克大学) VNPT AI, VNPT Group(VNPT AI,VNPT集团) MobiFone Research and Development Center, MobiFone Corporation(MobiFone研发与开发中心,MobiFone公司) Business AI Lab, Faculty of Data Science and Artificial Intelligence, National Economics University, College of Technology(商业人工智能实验室,数据科学与人工智能学院,国家经济大学,技术学院)

AI总结 提出一种无需训练、结合大语言模型和知识图谱的混合智能体框架,通过分解摘要任务为专用智能体(抽取、知识感知抽象、迭代精炼)并利用多视角一致性机制,在英文和越南语数据集上取得领先性能。

Comments Accepted by Neural Computing and Applications

详情
AI中文摘要

多文档摘要(MDS)在从文本数据集合中提取关键信息方面发挥着关键作用。现有方法通常难以捕捉复杂的文档间关系,严重依赖大量标注数据进行监督训练,或在跨领域和跨语言时泛化能力有限。为解决这些限制,我们提出一种无训练的混合智能体框架用于MDS,该框架利用大语言模型(LLM)和知识图谱的互补优势。我们的方法将摘要分解为专门的智能体任务:抽取式选择、知识感知抽象和迭代精炼,每个任务无需特定微调。我们通过由LLM引导的多视角一致性机制统一其输出。在四个英文和越南语数据集上的实验表明,该方法达到了最先进或具有竞争力的性能,验证了我们模块化设计的有效性和适应性。

英文摘要

Multi-Document Summarization (MDS) plays a critical role in distilling essential information from collections of textual data. Existing approaches often struggle to capture complex inter-document relationships, rely heavily on large amounts of labeled data for supervised training, or exhibit limited generalization across domains and languages. To address these limitations, we present a training-free mixture-of-agents framework for MDS that leverages the complementary strengths of large language models (LLMs) and knowledge graphs. Our approach decomposes summarization into specialized agent tasks: extractive selection, knowledge-aware abstraction, and iterative refinement, each operating without task-specific fine-tuning. We unify their outputs using a multi-perspective consistency mechanism guided by LLMs. Experiments across four datasets in English and Vietnamese demonstrate state-of-the-art or competitive performance, validating the effectiveness and adaptability of our modular design.

2606.03866 2026-06-03 cs.IR cs.AI cs.CL 版本更新

Taiji: Pareto Optimal Policy Optimization with Semantics-IDs Trade-off for Industrial LLM-Enhanced Recommendation

Taiji: 面向工业LLM增强推荐的帕累托最优策略优化与语义ID权衡

Yuecheng Li, Zeyu Song, Jing Yao, Chi Lu, Peng Jiang, Kun Gai

发表机构 * Kuaishou Technology(快手科技) Unaffiliated(无隶属)

AI总结 提出Taiji框架,通过逆向工程推理和开放拒绝采样生成高质量CoT数据,并采用帕累托最优策略优化(POPO)自适应调整跨域奖励权重,实现LLM语义知识与推荐ID特征的帕累托最优权衡,在快手广告平台部署后服务超4亿日活用户。

Comments 8 pages, 2 figures

详情
AI中文摘要

通过大型语言模型(LLM)扩展推荐系统已成为工业界的显著趋势。然而,通过后训练(如SFT和RL)将LLM的语义空间与推荐系统的ID空间对齐仍然具有挑战性。现有的LLM4Rec范式受到两个主要问题的瓶颈:(1)在SFT期间,难以衡量和改进开放域推荐中的思维链(CoT)质量;(2)在RL对齐过程中,忽略了LLM语义奖励与推荐偏好奖励之间的权衡。受这些挑战启发,我们提出了Taiji,一种专为工业推荐系统设计的新型LLM-as-Enhancer框架。为了克服SFT瓶颈,我们利用逆向工程推理和开放拒绝采样生成高质量、领域特定的CoT数据。为了解决RL对齐问题,我们提出了帕累托最优策略优化(POPO),它自适应调整跨域奖励权重。理论上,它在LLM的语义世界知识与代表在线用户偏好的协同ID特征之间实现了最优权衡。大量的离线评估和在线A/B测试验证了Taiji的有效性。自2026年5月在快手广告平台部署以来,Taiji目前每天服务超过4亿用户,产生了显著的商业收入,并展示了其在网络规模环境中的强大可扩展性。

英文摘要

Scaling recommender systems via large language models (LLMs) has become a prominent trend in the industry. However, aligning the LLM's semantic space with the recommender's ID space via post-training (e.g., SFT and RL) remains challenging. Existing LLM4Rec paradigms are bottlenecked by two main issues: (1) the difficulty of measuring and improving chain-of-thought (CoT) quality in open-domain recommendation during SFT, and (2) the neglect of the trade-off between LLM semantic rewards and recommendation preference rewards during RL alignment. Inspired by these challenges, we present Taiji, a novel LLM-as-Enhancer framework designed for industrial recommender systems. To overcome the SFT bottleneck, we utilize reverse-engineered reasoning and open-ended rejection sampling to generate high-quality, domain-specific CoT data. To resolve the RL alignment issue, we propose Pareto Optimal Policy Optimization (POPO), which adaptively adjusts cross-domain reward weights. Theoretically, it achieves an optimal trade-off between the semantic world knowledge of LLMs and the collaborative ID features representing online user preferences. Extensive offline evaluations and online A/B tests validate the effectiveness of Taiji. Deployed on Kuaishou's advertising platform since May 2026, Taiji currently serves over 400 million users daily, yielding significant commercial revenue and demonstrating its robust scalability in web-scale environments.

2606.03858 2026-06-03 cs.AI 版本更新

PyraMathBench: Evaluating and Improving Mathematical Capability in Large Language Models

PyraMathBench: 评估与提升大型语言模型的数学能力

Zetian Ouyang, Linlin Wang, Gerard de Melo, Liang He

发表机构 * East China Normal University(东华师范大学) Hasso Plattner Institute, University of Potsdam(波茨坦大学哈索普兰特纳研究所)

AI总结 提出PyraMathBench分层基准测试,通过整合数值处理与数学推理评估LLM,并引入SOLVE模块和IRPO优化方法提升数值-数学协同能力。

详情
AI中文摘要

尽管数值推理作为大型语言模型(LLM)在各类应用中数学能力的基石具有关键作用,但很少有基准测试通过整合数值处理与数学推理来评估LLM,这阻碍了数学任务中失败的可解释性。我们引入了PyraMathBench,一个全面的分层基准测试,包含来自7,404道数学文字题的32,505个问题,涵盖4个关键认知方面、14个子类别和2种模态。实验表明,LLM的性能因数值计算不足和对抽象数值问题的处理薄弱而严重受损。为解决这一问题,我们提出了智能优化与学习型多功能模块(SOLVE)和交互式相对策略优化(IRPO),通过高效的工具调用(模糊匹配和低质量调用拒绝)增强LLM的数值-数学协同能力。对比实验显示,Qwen-2.5在SOLVE和IRPO训练下获得了5.0分的提升。

英文摘要

Despite the pivotal role of numerical reasoning as the cornerstone of mathematical capabilities in large language models (LLMs) across applications, few benchmarks evaluate LLMs by integrating numerical processing and mathematical reasoning, hindering the interpretability of failures in math tasks. We introduce PyraMathBench, a comprehensive hierarchical benchmark with 32,505 questions derived from 7,404 math word problems, spanning 4 key cognitive aspects, 14 subcategories, and 2 modalities. Experiments reveal that LLMs' performance is severely compromised by inadequate numerical computation and weak handling of abstract numerical questions. To address this, we propose the Smart Optimization & Learning-based VErsatile module (SOLVE) and Interactive Relative Policy Optimization (IRPO), which enhance LLMs' numerical-mathematical synergy via efficient tool calls (fuzzy matching and low-quality call rejection). Comparative experiments show Qwen-2.5 achieves a 5.0 score improvement with SOLVE and IRPO training.

2606.03852 2026-06-03 cs.SE cs.AI 版本更新

FLARE: Fine-Grained Diagnostic Feedback for LLM Code Refinement

FLARE: 面向大语言模型代码精炼的细粒度诊断反馈

Yinsheng Yao, Hongxiang Zhang, Weixi Tong, Tianyi Zhang

发表机构 * Tongji University(同济大学) Purdue University(普渡大学)

AI总结 提出FLARE框架,利用轻量级诊断模型预测行级可疑信号进行缺陷定位和代码精炼,通过Top-K候选搜索提升修复效果。

详情
AI中文摘要

大型语言模型生成的代码常含有错误。现有方法依赖测试失败和自批评等反馈信号来迭代精炼生成的代码,但这些信号要么过于粗粒度,要么过于高层,不足以告知模型何处需要修复。在本工作中,我们提出了Flare,一个迭代框架,配备轻量级诊断模型,用于预测行级可疑信号以进行缺陷定位和代码精炼。鉴于诊断预测固有的不确定性,Flare搜索前K个最可疑区域,并根据执行结果选择最佳候选。在LiveCodeBench和BigCodeBench上使用五个基础LLM的实验表明,即使没有候选搜索(k=1),Flare也以1.72%到7.42%的绝对提升优于最强基线。此外,与无候选搜索相比,搜索10个候选平均提升8.50%。单独评估时,我们的轻量级诊断模型与最近的缺陷定位方法相比取得了最佳性能,表明它能提供可靠的细粒度代码精炼指导。

英文摘要

Large language models often generate code with bugs. Existing methods rely on feedback signals such as test failures and self-critiques to iteratively refine the generated code. Such signals are either too coarse-grained or too high-level, which is not sufficient to inform the model where to fix the bug. In this work, we present Flare, an iterative framework with a lightweight diagnostic model that predicts line-level suspiciousness signals for bug localization and code refinement. Given the inherent uncertainty of diagnostic predictions, Flare searches over the top-k suspicious regions and selects the best candidate according to execution outcomes. Experiments on LiveCodeBench and BigCodeBench with five base LLMs show that, even without candidate search (k=1), Flare outperforms the strongest baseline with an absolute improvement from 1.72% to 7.42%. Furthermore, searching over 10 candidates yields an average improvement of 8.50% compared with no candidate search. When evaluated in isolation, our lightweight diagnostic model achieves the best performance compared with recent fault localization methods, demonstrating that it can provide reliable fine-grained guidance for code refinement.

2606.03846 2026-06-03 cs.CL cs.AI cs.LG 版本更新

Clustered Self-Assessment: A Simple yet Effective Method for Uncertainty Quantification in Large Language Models

聚类自评估:一种简单而有效的大型语言模型不确定性量化方法

Qi Cao, Takeshi Kojima, Andrew Gambardella, Helinyi Peng, Yutaka Matsuo, Yusuke Iwasawa

发表机构 * The University of Tokyo(东京大学)

AI总结 提出一种基于语义聚类和多项选择概率的简单自评估方法,用于大型语言模型的不确定性量化,在多个模型和数据集上优于基线方法。

Comments Findings of ACL 2026

详情
AI中文摘要

大型语言模型(LLM)在各种任务中表现出色,但常常生成看似合理实则事实错误的回答。这一问题因缺乏明确的不确定性估计而加剧,使用户难以判断模型输出的可靠性。现有的不确定性量化方法通常依赖间接信号,如生成样本的熵。这些信号难以解释,且未充分利用模型评估自身不确定性的能力。我们提出一种简单而有效的自评估方法用于LLM的不确定性量化。我们的方法将生成样本分组为语义不同的聚类,将其转化为结构化多项选择题的答案选项,并使用LLM分配给每个选项的概率作为置信度估计。在多个模型和数据集上的实验表明,我们的方法始终优于基线方法。值得注意的是,仅需两个额外样本即可达到竞争性能,证明了其有效性和效率。

英文摘要

Large language models (LLMs) demonstrate remarkable performance across diverse tasks, but they often generate responses that appear plausible while being factually incorrect. This problem is compounded by the lack of explicit uncertainty estimates, which makes it difficult for users to judge the reliability of model outputs. Existing uncertainty quantification methods typically rely on indirect signals, such as entropy across sampled generations. These signals can be difficult to interpret and do not fully leverage the model's ability to assess its own uncertainty. We propose a simple yet effective self-assessment method for uncertainty quantification in LLMs. Our approach groups sampled generations into semantically distinct clusters, converts them into answer options in a structured multiple-choice question, and uses the probability assigned by the LLM to each option as a confidence estimate. Experiments across multiple models and datasets show that our method consistently outperforms baseline approaches. Notably, it achieves competitive performance with as few as two additional samples, demonstrating both its effectiveness and efficiency.

2606.03843 2026-06-03 cs.LG cs.AI 版本更新

Re-Evaluating Continual Learning with Few-Shot Adaptation

重新评估带少样本适应的持续学习

Amogh Inamdar, Matthew So, Vici Milenia, Richard Zemel

发表机构 * Department of Computer Science(计算机科学系)

AI总结 本文提出用少样本评估替代零样本评估来更全面衡量持续学习系统的稳定性和可塑性,并通过新指标“每样本可塑性”发现元学习未来任务序列能诱导学习到学习行为。

Comments 21 pages, 16 figures

详情
AI中文摘要

持续学习方法旨在最大化在任务序列上训练的机器学习模型的稳定性和可塑性。稳定性的标准度量(即遗忘)是模型在先前学习任务上的零样本性能,而可塑性则是在最近学习任务上的性能。然而,零样本评估并未完全衡量模型或方法保留已学信息或快速适应新信息的能力,因为它需要在多个任务上完美回忆。在本文中,我们提出少样本评估作为对持续学习系统稳定性和可塑性的更全面评估。我们对持续图像分类的任务序列进行了细粒度评估,发现这一范式为流行持续学习策略的性能提供了新颖的见解。通过使用新指标——每样本可塑性——进行少样本评估,我们展示了通过元学习未来任务的短序列向持续学习方法添加“前瞻性”会在任务序列上诱导学习到学习的行为。

英文摘要

Continual learning methods aim to maximize the stability and plasticity of machine learning models that are trained on a sequence of tasks. The standard measure of stability (i.e., forgetting) is the 0-shot performance of a model on previously learned tasks, and plasticity, the performance on the most recently learned task. However, 0-shot evaluation does not fully measure a model or method's ability to retain learned information or adapt quickly to new information, as it requires perfect recall across multiple tasks. In this paper, we propose few-shot evaluation as a more comprehensive assessment of the stability and plasticity of a continual learning system. We conduct a fine-grained assessment on task sequences for continual image classification and find that this paradigm produces novel insights into the performance of popular continual learning strategies. Through few-shot evaluation with a novel metric -- per-shot plasticity -- we show that adding `foresight' to continual learning methods via the meta-learning of a short sequence of future tasks induces learning-to-learn behavior over the task sequence.

2606.03841 2026-06-03 cs.AI 版本更新

EvoDS: Self-Evolving Autonomous Data Science Agent with Skill Learning and Context Management

EvoDS: 具有技能学习和上下文管理的自进化自主数据科学智能体

Zherui Yang, Fan Liu, Yansong Ning, Hao Liu

发表机构 * The Hong Kong University of Science and Technology (Guangzhou)(香港科技大学(广州))

AI总结 提出EvoDS,通过自主技能获取和自适应上下文压缩策略,结合强化学习训练,使数据科学智能体能够自进化并显著提升多阶段迭代任务的性能。

Comments Accepted by KDD2026

详情
AI中文摘要

近年来,大语言模型(LLM)智能体的进展为自动化数据科学带来了有希望的突破。然而,现有方法仍然受到静态动作集和缺乏原则性长程上下文管理的根本限制,阻碍了它们在多阶段、迭代数据科学流程中积累跨任务可重用经验并可靠运行的能力。为了解决这些挑战,我们引入了EvoDS,一个自进化的自主数据科学智能体,通过智能体强化学习学会扩展其技能并自适应地管理长期上下文。具体来说,EvoDS引入了两个关键策略:(1)自主技能获取(ASA)机制,使智能体能够合成、验证和重用可执行技能;(2)自适应上下文压缩(ACC)策略,将上下文管理视为一个学习控制问题而非被动截断。这些策略在一个两阶段多智能体训练方案中协调,使EvoDS能够随时间自主改进。理论上,我们证明了EvoDS的分层设计减少了工具选择错误,其优化目标与信息瓶颈原理一致,确保了高效的上下文使用。实验上,EvoDS在四个不同基准测试中平均优于最先进的开源数据科学智能体28.9%,同时消除了超出令牌限制的失败。我们的代码和数据可在该网址获取。

英文摘要

Recent progress in Large Language Model (LLM) agents has enabled promising advances in automated data science. However, existing approaches remain fundamentally limited by their static action sets and lack of principled long-horizon context management, hindering their ability to accumulate reusable experience across tasks and operate reliably in multi-stage, iterative data science pipelines. To address these challenges, we introduce EvoDS, a self-evolving autonomous data science agent that learns to expand its skills and adaptively managing long-term context through agentic reinforcement learning. Specifically, EvoDS introduces two key strategies: (1) Autonomous Skill Acquisition (ASA) mechanism, which enables agents to synthesize, validate, and reuse executable skills; and (2) Adaptive Context Compression (ACC) strategy, which treats context management as a learned control problem rather than passive truncation. These strategies are orchestrated within a two-stage multi-agent training scheme, enabling EvoDS to autonomously improve over time. Theoretically, we prove that EvoDS's hierarchical design reduces tool-selection error, and its optimization objective aligns with an information bottleneck principle, ensuring efficient context use. Empirically, EvoDS outperforms state-of-the-art open-source data science agents by an average of 28.9% across four diverse benchmarks while eliminating out-of-token failures. Our code and data are available at https://github.com/usail-hkust/EvoDS.

2606.03829 2026-06-03 cs.AI 版本更新

BigFinanceBench: A Workflow-Grounded Benchmark for Financial-Research Agents

BigFinanceBench: 一个基于工作流的金融研究智能体基准

Alex Wang, Georg Meinhardt, Jacob Katz, Joseph H. Kim, Pratyush K. Chaudhary, Chase Blagden, Eric Xu

发表机构 * Rogo OpenAI

AI总结 针对金融研究答案可审计推导过程未被充分评估的问题,提出包含928个专家编写任务、每个任务附带点权评分标准的BigFinanceBench基准,用于评估完整推导过程而非仅最终答案,实验表明最佳系统仅达58.8%评分,存在显著提升空间。

详情
AI中文摘要

金融研究答案只有在其他分析师能够审计其产生过程(包括选择哪个来源、哪个时期和会计定义、做出哪些假设以及如何进行计算)时才具有决策相关性。现有的金融基准主要评估孤立的子技能或最终答案,而忽略了可审计的推导过程本身。我们引入了BigFinanceBench,一个包含928个专家编写的开放式金融研究任务的基准,其中每个任务将一个真实参考答案与一个点权评分标准配对,该评分标准将推导过程分解为可独立检查的步骤。BigFinanceBench基于工作流,因为它评估完整的推导过程而不仅仅是最终输出。在36,241个评分点上,该基准支持部分信用评估和跨分析师工作流的故障定位。评估十个当前前沿和开放权重智能体,我们发现存在显著提升空间:最佳系统仅达到58.8%的评分,最终答案准确性是推导质量的有用但有损的代理指标,并且模型能力在金融工作流中不均匀变化。

英文摘要

Financial-research answers are decision-relevant only when another analyst can audit how they were produced: which source was chosen, which period and accounting definition were used, which assumptions were made, and how the calculation was performed. Existing finance benchmarks largely evaluate isolated subskills or final answers, leaving the auditable derivation itself under-measured. We introduce BigFinanceBench, a 928-item expert-authored benchmark of open-ended financial-research tasks in which each item pairs a ground-truth reference answer with a point-weighted rubric that decomposes the derivation into independently checkable steps. BigFinanceBench is workflow-grounded in that it evaluates the full derivation rather than only the final output. Across 36,241 rubric points, the benchmark supports partial-credit evaluation and localization of failures across the analyst workflow. Evaluating ten current frontier and open-weight agents, we find substantial headroom: the best system reaches only 58.8% rubric score, final-answer accuracy is a useful but lossy proxy for derivation quality, and model capability varies non-uniformly across financial workflows.

2606.03823 2026-06-03 cs.AI cs.CY cs.NE 版本更新

Calibrating Urban Traffic Simulation from Sparse Road Observations via Genetic Optimization

基于遗传优化的稀疏道路观测城市交通仿真校准

Hunter Sawyer, Jesse Roberts, Simon Matei

发表机构 * University of California, Berkeley(加州大学伯克利分校) Stanford University(斯坦福大学)

AI总结 提出一种基于遗传算法的框架,利用稀疏道路观测数据校准城市交通仿真,无需详细就业数据,生成与真实测量高度相关的交通流和就业分布。

详情
AI中文摘要

城市交通仿真是基础设施规划的关键工具,包括电动汽车充电站的布局。然而,许多城市的逼真交通仿真受到两个基本数据限制的阻碍:大多数城市只有一小部分道路段有详细的真实交通测量数据,而建模通勤交通所需的就业分布数据很少以仿真所需的分辨率提供。本文提出一个基于遗传算法的框架,直接解决这两个限制,从稀疏道路观测中校准城市交通仿真,无需详细的就业位置数据。使用北卡罗来纳州格林斯博罗的SUMO交通仿真平台,我们的方法优化了就业分布和门控交通参数,使仿真交通与已知交通流率的一小部分道路样本对齐。我们证明,该方法产生的仿真交通与真实测量高度相关,能泛化到训练中未包含的道路段,并且产生的就业分布与人口普查就业数据在定性上具有良好的一致性,尽管从未直接在该就业数据上训练。这项工作表明,可以从最少的真实观测实现逼真的城市交通仿真,提供一种可扩展且数据轻量的仿真校准方法,降低了在不同城市部署交通模型的障碍。

英文摘要

Urban traffic simulation is a critical tool for infrastructure planning, including the placement of electric vehicle charging stations. However, realistic traffic simulation across many cities is hindered by two fundamental data limitations: detailed real-world traffic measurements are available for only a small fraction of road segments in most cities, and employment distribution data critical for modeling commuter traffic is rarely available at the resolution needed for simulation. This paper presents a genetic algorithm-based framework that directly addresses both limitations, calibrating urban traffic simulations from sparse road observations without requiring detailed job location data. Using the SUMO traffic simulation platform for Greensboro, North Carolina, our approach optimizes job distributions and gate-traffic parameters to align simulated traffic with a small sample of roads with known traffic-flow rates. We demonstrate that this approach produces simulated traffic that correlates well with real-world measurements, generalizes to road segments withheld from training, and produces job distributions that show promising qualitative agreement with census employment data despite never directly training on that employment data. This work demonstrates that realistic urban traffic simulation can be achieved from minimal real-world observations, offering a scalable and data-light approach to simulation calibration that reduces the barrier to deploying traffic models across diverse cities.

2606.03814 2026-06-03 cs.AI 版本更新

Leveraging BART to Assess CS1 C++ Programming Assignments using Rubric-based Criteria

利用BART基于评分标准评估CS1 C++编程作业

Kelsey Rainey, Jesse Roberts

发表机构 * University of California, Berkeley(加州大学伯克利分校) Stanford University(斯坦福大学)

AI总结 提出基于评分标准的多任务微调BART模型,用于自动评分C++编程作业,通过联合预测数值分数和等级区间并匹配分布,使评分更接近教师行为。

详情
AI中文摘要

本文研究基于评分标准的多任务微调变换器模型,用于自动评分入门级C++编程作业,旨在产生比通用LLM更能反映教师评分行为的分数预测。使用多学期CS1数据,将学生提交的作业与数值分数、字母等级区间和作业评分标准配对,然后预处理为统一的序列用于变换器输入。采用带有LoRA适应的BART编码器-解码器,联合预测数值分数和等级区间,并增加分布匹配项以对齐预测分数和经验分数分布,这是以往工作中常被忽略的评估维度。实验比较了单任务和多任务训练、硬独热与模糊及基于边界的软标签、有评分标准与无评分标准条件,并增加了T5和成对预训练变体。结果表明,具有基于边界的软标签和评分标准上下文的多任务BART在平均绝对误差和分数分布对齐方面优于单任务、硬标签或仅代码基线。完全微调的T5进一步提高了分布保真度,而成对预训练以牺牲少数类敏感性为代价减少了数值误差。总体而言,研究结果表明,校准感知、评分标准引导的训练比优化准确性的替代方案能产生更像教师的评分行为。

英文摘要

This paper investigates rubric-aware, multitask fine-tuning of transformer models for automated grading of introductory C++ programming assignments, with the goal of producing grade predictions that better reflect instructor grading behavior than general-purpose LLMs. Using multi-semester CS1 data, student submissions are paired with numeric scores, letter-grade buckets, and assignment rubrics, then preprocessed into unified sequences for transformer input. A BART encoder-decoder with LoRA adaptation is trained to jointly predict numeric grades and grade buckets, augmented with a distribution-matching term to align predicted and empirical grade distributions, an evaluation dimension often overlooked in prior work. Experiments compare single-task and multitask training, hard one-hot versus fuzzy and boundary-based soft labels, and rubric versus no-rubric conditions, with additional T5 and pairwise-pretrained variants. Results show that multitask BART with boundary-based soft labels and rubric context achieves lower mean absolute error and stronger grade-distribution alignment than single-task, hard-label, or code-only baselines. Fully fine-tuned T5 further improves distributional fidelity, while pairwise pretraining reduces numeric error at the cost of minority-class sensitivity. Collectively, the findings suggest that calibration-aware, rubric-guided training produces more instructor-like grading behavior than accuracy-optimized alternatives.

2606.03812 2026-06-03 cs.AI 版本更新

Enhancing Operational Safety via Agentic Dialogue Hazard Identification Analysis

通过智能体对话危害识别分析增强操作安全性

Sanjay Das, Ran Elgedawy, Ethan Seefried, Ryan Burchfield, Tirthankar Ghosal

发表机构 * Oak Ridge National Laboratory(橡树岭国家实验室)

AI总结 提出HAZDIAL框架,利用结构化多智能体多轮对话(对抗性辩论与建设性讨论)改进基于NLP的危害识别质量,并通过算法优化智能体交互,实验证明优于单次基线方法。

详情
AI中文摘要

在工业过程控制、自主系统和安全关键系统等高风险领域,操作安全性要求可靠的危害识别。虽然大型语言模型在自动化安全分析任务中显示出潜力,但单次、整体推理是脆弱的:它缺乏安全工程师迭代应用的自校正、深思熟虑和上下文细化。在本文中,我们介绍了HAZDIAL,一个研究结构化智能体对话(多智能体、多轮交互)是否比单次基线提高基于NLP的危害识别质量的框架。我们系统地比较了两种对话模式:对抗性辩论和建设性讨论,并提出了基于算法的智能体交互优化。我们使用标准分类指标(准确率、精确率、召回率、F1)和新颖的对话指标,针对策划的金标准数据集评估所有配置。这项工作推进了对话系统、多智能体推理和AI安全的交叉领域,为对话驱动的危害分析提供了经验证据。

英文摘要

Operational safety in high-stakes domains such as industrial process control, autonomous, and safety-critical systems, demand reliable hazard identification. While large language models (LLMs) have shown promise in automating safety analysis tasks, single-turn, monolithic inference is brittle: it lacks the self-correction, deliberation, and contextual refinement that safety engineers apply iteratively. In this paper, we introduce HAZDIAL, a framework that investigates whether structured agentic dialogue-multi-agent, multi-turn interactions improves the quality of NLP- based hazard identification over single-pass baselines. We systematically compare two dialogue modalities: adversarial debate and constructive discussion, and propose an algorithm-based agentic interaction optimization. We evaluate all configurations against a curated golden dataset using standard classification metrics (accuracy, precision, recall, F1) and novel dialogue metrics. This work advances the intersection of dialogue systems, multi-agent reasoning, and AI safety, providing an empirical evidence for dialogue-driven hazard analysis.

2606.03811 2026-06-03 cs.CR cs.AI cs.LG 版本更新

AI Agents Enable Adaptive Computer Worms

AI代理实现自适应计算机蠕虫

Jonas Guan, Tom Blanchard, Hanna Foerster, Hengrui Jia, Gabriel Huang, Nicolas Papernot

发表机构 * University of Toronto(多伦多大学) Vector Institute(向量研究所) University of Cambridge(剑桥大学) ServiceNow

AI总结 本研究展示了AI代理能够生成针对每个目标的定制攻击策略,利用被感染机器上的大语言模型自我维持并传播,形成自持的AI驱动网络威胁。

详情
AI中文摘要

计算机蠕虫是一种通过在网络中从一台机器复制到另一台机器来传播的恶意软件。传统蠕虫(如WannaCry)利用预定的漏洞,修补这些漏洞即可阻止其传播。本文表明,人工智能(AI)代理实现了一种根本性的新威胁:一种能够针对每个遭遇的目标生成定制攻击策略的蠕虫。该蠕虫寄生性地利用被感染的机器运行开放权重的大语言模型(LLM)以维持其推理能力,或扩展其攻击范围。在部署于Linux、Windows和物联网(IoT)设备的机器网络上,该蠕虫通过利用常见的现实企业网络漏洞进行传播。由于蠕虫由窃取的计算资源驱动,攻击者每次新感染所需的边际成本为零。这在攻击者和防御者之间造成了不稳定的经济不对称。此外,由于蠕虫不需要商业AI平台,集中式安全控制(如服务拒绝或速率限制)在结构上无关紧要。我们的结果表明,自持的AI驱动网络威胁不再是理论上的。我们必须为自主的生成式对手做好准备:这些恶意软件系统无需人类操作员即可传播,其定义不是固定的利用代码,而是实时推理目标、适应观察并合成攻击逻辑的能力。

英文摘要

A computer worm is malware that spreads on a network by replicating itself from one machine to another. Traditional worms, like WannaCry, exploited predetermined vulnerabilities, and their spread can be halted by patching those vulnerabilities. Here we show that artificial intelligence (AI) agents enable a fundamentally new threat: a worm that generates tailored attack strategies to each target it encounters. The worm parasitically uses compromised machines to run open-weight large language models (LLMs) to sustain its reasoning, or extend its reach for further attacks. Deployed on a network of machines spanning Linux, Windows, and IoT (Internet of Things) devices, the worm propagated by exploiting common, real-world corporate network vulnerabilities. Since the worm is powered by stolen compute, the attacker's marginal cost per new infection is zero. This creates a destabilizing economic asymmetry between attackers and defenders. Moreover, because the worm requires no commercial AI platform, centralized safety controls, such as service refusals or rate limiting, are structurally irrelevant. Our results demonstrate that self-sustaining AI-driven cyber-threats are no longer theoretical. We must prepare for autonomous generative adversaries: malware systems that propagate without human operators and are defined not by fixed exploit code, but by the capacity to reason about targets, adapt to observations, and synthesize attack logic in real time.

2606.03808 2026-06-03 cs.LG cs.AI cs.CR 版本更新

PURGE: Projected Unlearning via Retain-Guided Erasure

PURGE: 通过保留引导擦除的投影遗忘

Vedant Jawandhia, Daksh Ahuja, Ghufran Alam Siddiqui, Prashant Trivedi, Yash Sinha, Pratik Narang

发表机构 * BITS Pilani, Pilani Campus, India(印度比斯帕利尼学院)

AI总结 提出一种基于持续学习与机器遗忘对偶性的遗忘算法PURGE,利用梯度投影约束保留损失,并通过多层表示擦除和保留混淆目标实现隐私与效用的平衡。

Comments 13 pages, 10 figures, 6 tables

详情
AI中文摘要

我们提出PURGE,一种基于简单但未被充分利用的观察构建的机器遗忘算法:持续学习(CL)和机器遗忘(MU)本质上是二元问题。CL试图在不遗忘旧任务的情况下学习新任务;MU试图在不损害保留性能的情况下擦除特定数据,代表了相同基本张力在相反方向上的体现。PURGE通过调整A-GEM(Chaudhry等人,2019)的梯度投影来利用这种对偶性,使得每个遗忘步骤都受到约束,不会增加保留集损失。在此基础上,它执行多层表示擦除,将中间层中遗忘集的激活推向保留分布,以从隐藏表示中移除信息,而不仅仅是在输出层抑制信息。一个关键的设计选择是保留混淆目标:不是将遗忘输出推向均匀分布(我们发现这很容易被成员推断攻击检测到),而是将目标设定为模型在保留数据上的自然混淆模式。这使得遗忘模型难以与从头重新训练的模型区分。两个自调节停止标准(保留损失预算和遗忘准确率目标)让算法自行决定何时停止,无需手动调整训练轮数。在五个数据集(CIFAR-10、MNIST、SVHN、STL10、PathMNIST)上的22个类别级遗忘任务实验中,PURGE始终将保留准确率保持在96%以上,同时实现接近0.5(理想值)的MIA AUROC,在隐私-效用前沿上优于梯度上升、KL均匀分布以及多个已发表的基线方法。

英文摘要

We propose PURGE, a machine unlearning algorithm built on a simple but an under-exploited observation: continual learning (CL) and machine unlearning (MU) which are fundamentally dual problems. CL tries to learn new tasks without forgetting old ones; MU tries to erase specific data without hurting retained performance representing the same underlying tension in opposite directions. PURGE leverages this duality by adapting gradient projection from A-GEM (Chaudhry et al., 2019) so that every unlearning step is constrained to not increase the retain-set loss. On top of this, it performs multi-layer representation erasure, pushing forget-set activations in intermediate layers towards the retain distribution to remove information from hidden representations rather than just suppressing it at the output. A key design choice is the retain-confusion target: rather than pushing forget outputs toward the uniform distribution, which we found to be surprisingly easy for membership inference attacks to detect, we instead target the model's natural confusion pattern on retain data. This makes the unlearned model hard to distinguish from one retrained from scratch. Two self-regulating stopping criteria (a retain-loss budget and a forget-accuracy target) let the algorithm decide on its own when to stop, removing the need for manual epoch tuning. In experiments on five datasets (CIFAR-10, MNIST, SVHN, STL10, PathMNIST) across 22 class-level forgetting tasks, PURGE consistently keeps retain accuracy above 96% while achieving MIA AUROC close to 0.5 (the ideal), outperforming gradient ascent, KL-uniform, and several published baselines on the privacy-utility frontier.

2606.03800 2026-06-03 cs.LG cs.AI 版本更新

Trading Human Curation for Synthetic Augmentation in RLVR

在RLVR中用合成增强替代人工策展

Akshansh, Leonardo Rosa Rodrigues, Michael Korostelev, Youssef Hassan, Mark E. Whiting

发表机构 * Pareto AI

AI总结 研究通过预指定、门控过滤的增强任务替代人工策展任务,在RLVR中实现成本效益权衡,并保持泛化性能。

Comments 21 pages, 5 main-text figures, 4 appendix figures. Preprint

详情
AI中文摘要

高质量训练任务的供应是基于可验证奖励的强化学习(RLVR)在智能体语言模型上的核心瓶颈。每个任务需要一个沙盒环境、一个提示和一个手工编写的奖励函数,只有通过质量标准的任务才能产生有用的训练信号。达到这一质量标准的人工策展在有效RL训练所需的任务数量上无法经济地扩展,而自动生成的任务变体与人工编写任务之间的替代率尚未确定。我们研究在RLVR期间,使用预指定、门控过滤的增强(augmentations)作为额外人工策展的替代品。我们形式化了增强任务与人工任务之间的成本调整权衡率 $\rho_{\text{cost}}$,通过在不同增强比例的训练语料库上进行受控消融实验来测量它,并描述了增强管道的端到端经济学。用增强内容替代额外的人工编写任务,在涵盖代码、指令遵循、推理和多轮智能体函数调用的十个基准测试套件上保持了聚合的留出泛化能力。在合理的 $c_{\text{human}}/c_{\text{aug}}$ 范围内,门控合成与人工RLVR任务之间的成本调整权衡率 $\rho_{\text{cost}}$ 保持在 $[1.4\times, 11.6\times]$ 之间。

英文摘要

The supply of high-quality training tasks is a central bottleneck for reinforcement learning from verifiable rewards (RLVR) on agentic language models. Each task requires a sandboxed setup, a prompt, and a hand-authored reward function, and only tasks that pass a quality bar produce useful training signal. Hand-curation at this quality bar does not scale economically to the task counts effective RL training requires, and the substitution rate between automatically generated task variants and human-authored ones is not yet established. We investigate using pre-specified, gate-filtered augmentations of a small hand-authored base as a substitute for additional human curation during RLVR. We formalize the cost-adjusted trade rate $ρ_{\text{cost}}$ between augmented and human-authored tasks, measure it through a controlled ablation across training corpora with varying augmentation share, and characterize the end-to-end economics of the augmentation pipeline. Substituting augmented content for additional human-authored tasks retains aggregate held-out generalization on a ten-benchmark suite spanning code, instruction following, reasoning, and multi-turn agentic function-calling. The cost-adjusted trade rate $ρ_{\text{cost}}$ between gated synthetic and human-authored RLVR tasks stays in $[1.4\times, 11.6\times]$ across the plausible $c_{\text{human}}/c_{\text{aug}}$ range.

2606.03796 2026-06-03 cs.NE cs.AI 版本更新

Signed Spiking Neuron Enabled by an Orthogonal-Easy-Axis Magnetic Tunnel Junction

基于正交易轴磁隧道结的有符号脉冲神经元

Huannan Zheng, Jingli Liu, Kezhou Yang

AI总结 提出一种基于正交易轴磁隧道结的紧凑型有符号脉冲神经元,通过自由层和钉扎层的正交易轴实现双极性脉冲生成,并映射磁矩动力学到有符号LIF膜电位演化,在CIFAR-10和CIFAR10-DVS上分别达到91.06%和77.40%的准确率。

详情
AI中文摘要

有符号脉冲神经元携带比标准脉冲神经元更丰富的信息。本文提出一种基于磁隧道结(MTJ)的紧凑型神经元,用于有符号泄漏积分点火(LIF)操作。通过自由层和钉扎层中的正交易轴,该器件能够实现双极性脉冲生成,并将磁矩动力学映射到有符号LIF膜电位演化。Landau-Lifshitz-Gilbert模拟表明,适当的自由层尺寸使器件响应能够遵循有符号LIF方程。一个代表性设计为10 nm x 45 nm x 50 nm,对应纵横比约为2:9:10。使用拟合的器件神经元模型进行网络评估,在CIFAR-10上达到91.06%,在CIFAR10-DVS上达到77.40%,保留了理想有符号LIF神经元的大部分准确率。

英文摘要

Signed spiking neurons carry richer information than standard spiking neurons. This work proposes a compact magnetic tunnel junction (MTJ)-based neuron for signed leaky integrate-and-fire (LIF) operation. With orthogonal easy axes in the free and pinned layers, the device enables bipolar spike generation and maps magnetic-moment dynamics to signed LIF membrane-potential evolution. Landau--Lifshitz--Gilbert simulations show that proper free-layer dimensions allow the device response to follow a signed LIF equation. A representative design of 10 nm x 45 nm x 50 nm corresponds to an aspect ratio of about 2:9:10. Network evaluations using the fitted device-neuron model achieve 91.06% on CIFAR-10 and 77.40% on CIFAR10-DVS, retaining most of the accuracy of ideal signed LIF neurons.

2606.03777 2026-06-03 cs.AI cs.CR q-fin.RM 版本更新

From Control Boundary to Insurance Claim: Reconstructing AI-Mediated Losses Through the CER Framework

从控制边界到保险索赔:通过CER框架重构AI中介损失

Alex Leung, Rex Zhang, Kentaroh Toyoda, SiewMei Loh

AI总结 本文提出CER框架(控制边界、证据重构、保险响应),用于诊断和重构由生成式或代理式AI系统导致的损失,以支持保险索赔。

详情
AI中文摘要

通过受保组织的生成式或代理式AI系统产生的AI损失需要状态重构,而不仅仅是事件重构,因为相关状态会随着系统推理、检索、调用工具和行动而改变。相关的问题不仅是发生了什么损失,还包括系统被允许做什么、实际做了什么,以及重构的损失能否支持保险索赔。本文处理受保人的AI系统处于因果链中的损失,包括外部触发的故障,如提示注入、检索增强生成(RAG)投毒、恶意工具输出、凭证滥用和数据投毒。具体而言,本文介绍了CER,一种用于AI残余风险转移的用例级诊断。C(控制边界)询问系统是否具有可执行的操作范围。E(证据重构)询问是否可以从保留的工件中重构系统状态和因果链。R(保险响应)询问重构的损失是否被保险:保险覆盖是否在市场上可用并为受保人投保,以及支持保险索赔所需的证据。本文做出三项贡献:定义了AI特定的重构问题,通过CER操作化该问题,并指定了AI重构的索赔级证据。公开示例包括报道的PocketOS和Replit代理数据库删除事件,以及作为已裁决的输出/依赖案例的Moffatt诉加拿大航空案。关键词:AI系统;CER框架;残余风险转移;代理式AI;生成式AI;AI保险;证据重构。

英文摘要

AI losses that arise through an insured organization's generative or agentic AI system require state reconstruction, not merely event reconstruction, because the relevant state changes as the system reasons, retrieves, calls tools, and acts. The relevant question is not only what loss occurred, but what the system was allowed to do, what it actually did, and whether that reconstructed loss can support insurance claim recovery. This paper addresses losses in which the insured's AI system is in the causal chain, including externally triggered failures such as prompt injection, retrieval-augmented generation (RAG) poisoning, malicious tool output, credential misuse, and data poisoning. Specifically, this paper introduces CER, a use-case-level diagnostic for AI residual risk transfer. C (control boundary) asks whether the system had an enforceable operating envelope. E (evidence reconstruction) asks whether the system state and causal chain can be reconstructed from retained artifacts. R (insurance response) asks whether the reconstructed loss is insured: whether insurance coverage is available in the market and placed for the insured, together with the proof needed to support insurance claim recovery. The paper makes three contributions: it defines the AI-specific reconstruction problem, operationalizes that problem through CER, and specifies claim-grade evidence for AI reconstruction. Public examples include the reported PocketOS and Replit agentic database-deletion incidents and Moffatt v. Air Canada as an adjudicated output/reliance case. Keywords: AI systems; CER framework; residual risk transfer; agentic AI; generative AI; AI insurance; evidence reconstruction.

2606.03770 2026-06-03 cs.DC cs.AI 版本更新

E2LLM: Towards Efficient LLM Serving in Heterogeneous Edge/Fog Environments

E2LLM:异构边缘/雾环境中高效LLM服务

Truong-Thanh Le, Amir Taherkordi, Hoang-Loc La, Frank Eliassen, Phuong Hoai Ha, Peiyuan Guan

发表机构 * University of California, Berkeley(加州大学伯克利分校)

AI总结 提出E2LLM框架,通过复制模型到多设备组并采用模型并行,结合遗传算法聚类和动态规划分区,在资源受限的异构边缘/雾环境中实现高效LLM部署,相比Splitwise基线在高需求下平均等待时间降低50%以上。

详情
AI中文摘要

大型语言模型(LLM)已成为现代应用不可或缺的一部分,但其部署仍具挑战性。除了执行模型本身,实际部署必须解决成本效率、低延迟和最优资源利用问题。传统方法通常假设整个模型可以托管在单个设备上,这在许多现实场景中不成立,尤其是在设备资源受限的边缘和雾环境中。本文介绍了E2LLM,一个旨在在此类资源有限环境中实现高效LLM部署的框架。E2LLM并非简单地将单个模型分区到所有可用设备,而是将完整模型复制到多个设备组(副本),并在每个副本内应用模型并行。每个副本根据其处理输入和输出令牌的效率被分配专门角色PREFILL或DECODER。这种分离利用了LLM推理这两个阶段之间的固有差异。为了有效组织设备,我们利用遗传算法形成最大化系统性能的集群。在每个集群内,我们应用动态规划确定最优分区策略,以最小化模型并行执行中的瓶颈。实验结果表明,我们的方法能够稳健地适应不同工作负载,包括输入和输出令牌长度显著变化的场景。与Splitwise基线相比,E2LLM在高需求条件下将平均等待时间降低了50%以上。

英文摘要

Large Language Models (LLMs) have become integral to modern applications, yet their deployment remains challenging. Beyond executing the models themselves, practical deployment must address cost efficiency, low latency, and optimal resource utilization. Conventional approaches typically assume that an entire model can be hosted on a single device, which does not hold in many real-world scenarios, particularly in Edge and Fog environments where device resources are constrained. In this paper, we introduce E2LLM, a framework designed to enable efficient LLM deployment in such resource limited settings. Rather than simply partitioning a single model across all available devices, E2LLM replicates the full model across multiple groups of devices (replicas) and applies model parallelism within each replica. Each replica is assigned a specialized role PREFILL or DECODER based on its efficiency in handling input and output tokens. This separation leverages the inherent differences between these two phases of LLM inference. To effectively organize devices, we utilize a Genetic Algorithm to form clusters that maximize system performance. Within each cluster, we apply Dynamic Programming to determine an optimal partitioning strategy that minimizes bottlenecks in model-parallel execution. Experimental results demonstrate that our approach adapts robustly to varying workloads, including scenarios with significant variation in input and output token lengths. Compared to the Splitwise baseline, E2LLM reduces average waiting time by over 50% under high-demand conditions

2606.03762 2026-06-03 cs.LG cs.AI 版本更新

Tool-Aware Optimization with Entropy Guidance for Efficient Agentic Reinforcement Learning

基于熵引导的工具感知优化用于高效智能体强化学习

Hongye Cao, Nuo Yan, Haoyuan Deng, Ziwei Wang, Tianpei Yang, Jing Huo, Yuyao Zhang, Yang Gao

发表机构 * National Key Laboratory for Novel Software Technology, Nanjing University(南京大学新型软件技术国家重点实验室) Nanyang Technological University(南洋理工大学) China Mobile NineVerse Artificial Intelligence Technology (Beijing) Co., Ltd.(中国移动九章人工智能技术(北京)有限公司) Institute of Artificial Intelligence, NineVerse(九章人工智能研究院)

AI总结 提出TAO-RL框架,通过工具感知轨迹过滤和熵引导探索解决智能体强化学习中工具使用导致的训练不稳定问题,在7个推理基准上优于现有方法。

详情
AI中文摘要

智能体强化学习(RL)使大型语言模型(LLMs)具备工具使用能力,从而显著提升复杂任务的推理性能。然而,整合外部工具常常导致训练不稳定:过度依赖工具会引发输入分布偏移,而过于保守的工具使用则限制了有效探索。为解决这一问题,我们提出统一框架TAO-RL,将工具感知轨迹过滤与熵引导探索相结合,以实现高效策略优化。具体而言,在数据层面,TAO-RL根据两个标准过滤轨迹:丢弃所有工具调用均执行失败的轨迹,以及移除所有轨迹全部正确或全部错误的轨迹,因为这两种情况都会产生退化的优势估计,无法提供有区分度的学习信号。这种联合过滤保留了既具备工具能力又包含信息量的数据,建立了高质量的训练分布。在算法层面,我们引入工具感知的熵引导奖励,重塑工具调用后token的优势函数,鼓励策略在关键决策点探索更多样化的推理路径。这两个组成部分相互增强:轨迹过滤建立了干净且信息丰富的训练基础,而熵引导探索则在关键工具交互节点驱动更强的推理行为。在3种模型规模下的7个具有挑战性的推理基准上的大量实验表明,TAO-RL优于现有方法。

英文摘要

Agentic reinforcement learning (RL) equips large language models (LLMs) with tool-use capabilities that substantially improve reasoning on complex tasks. However, integrating external tools often destabilizes training: over-reliance on tools can induce input distribution shift, while overly conservative tool use limits effective exploration. To address this issue, we propose a unified framework TAO-RL that couples tool-aware trajectory filtering with entropy-guided exploration for efficient policy optimization. Specifically, at the data level, TAO-RL filters rollout trajectories along two criteria: discarding those where all tool invocations fail to execute, and removing those where all rollouts are either correct or incorrect, as both cases yield degenerate advantage estimates that contribute no discriminative learning signal. This joint filtering retains data that are both tool-capable and informative, establishing a high-quality training distribution. At the algorithmic level, we introduce a tool-aware entropy-guided bonus that reshapes the advantage function at post-tool-call tokens, encouraging the policy to explore more diverse reasoning paths at critical decision points. These two components are mutually reinforcing: trajectory filtering establishes a clean and informative training foundation, while entropy-guided exploration drives stronger reasoning behaviors at critical tool-interaction junctures. Extensive experiments on 7 challenging reasoning benchmarks across 3 model scales demonstrate the superiority of TAO-RL over existing methods.

2606.03755 2026-06-03 cs.AI 版本更新

LAP: An Agent-to-Instrument Protocol for Autonomous Science

LAP:面向自主科学的智能体-仪器协议

Linwu Zhu, Liqiang Gao, Yan Chen, Dan Zhu, Jian Huang

发表机构 * Shiyanjia Lab(实验之家实验室)

AI总结 针对自主科学中智能体与物理仪器连接碎片化的问题,提出LAP协议,通过InstrumentCard、预留机制、安全围栏握手和测量结果模式,实现有状态、安全关键的操作,并与现有生态系统兼容。

Comments 31 pages

详情
AI中文摘要

自主科学正从演示走向基础设施。大型语言模型智能体现在规划实验,而自动驾驶实验室执行实验。然而,每个此类系统都从头重建推理智能体与物理仪器之间的连接,面对的是为确定性软件客户端而非概率性、目标导向的智能体构建的碎片化供应商SDK和标准。最近的智能体互操作性协议明确了智能体生态系统的三个边缘中的两个(Anthropic的模型上下文协议(MCP)标准化了智能体-工具边缘,Google的Agent2Agent(A2A)标准化了智能体-智能体边缘),但两者都没有建模智能体-仪器边缘,其中操作是有状态的、安全关键的、独占的、物理体现的,并产生带有单位、校准和不确定性的测量结果。我们提出了实验室智能体协议(LAP),一种填补这一空白的协议设计。LAP保留了A2A的点对点、发现优先、任务生命周期结构,并增加了四个物理世界原语:(i)InstrumentCard,一种签名的能力和物理限制描述;(ii)用于独占仪器和样品锁定的第一类预留;(iii)安全围栏握手,带有与特定任务及其参数加密绑定的操作员确认令牌,用于控制危险和不可逆操作;(iv)MeasurementResult模式,使每个结果在物理上类型化(QUDT/UCUM)、校准锚定、带有不确定性且可重现。我们规定了角色、六层架构、JSON-RPC方法集、任务和安全状态机、错误模型以及跨实验室联合,并通过协议端到端地走通一个闭环自主实验活动。LAP在传输上与A2A/MCP生态系统兼容,并封装而非取代现有设备标准如SiLA 2和OPC-UA。

英文摘要

Autonomous science is moving from demonstration to infrastructure. Large language model agents now plan experiments, and self-driving laboratories execute them. Yet every such system rebuilds the link between the reasoning agent and the physical instrument from scratch, against fragmented vendor SDKs and standards built for deterministic software clients rather than probabilistic, goal-directed agents. Recent agent-interoperability protocols clarify two of the three edges of an agentic ecosystem (Anthropic's Model Context Protocol (MCP) standardizes the agent-to-tool edge, and Google's Agent2Agent (A2A) the agent-to-agent edge), but neither models the agent-to-instrument edge, where operations are stateful, safety-critical, exclusively owned, physically embodied, and produce measurements with units, calibration, and uncertainty. We present the Lab Agent Protocol (LAP), a protocol design that fills this gap. LAP retains A2A's peer-to-peer, discovery-first, task-lifecycle structure and adds four physical-world primitives: (i) the InstrumentCard, a signed capability and physical-limit description; (ii) first-class reservation for exclusive instrument and sample locking; (iii) a safety-fence handshake with operator-confirmation tokens cryptographically bound to a specific task and its parameters, gating hazardous and irreversible operations; and (iv) a MeasurementResult schema that makes every result physically typed (QUDT/UCUM), calibration-anchored, uncertainty-bearing, and reproducible by construction. We specify roles, a six-layer architecture, the JSON-RPC method set, the task and safety state machines, the error model, and cross-laboratory federation, and walk a closed-loop autonomous campaign through the protocol end-to-end. LAP is transport-compatible with the A2A/MCP ecosystem and encapsulates rather than replaces existing device standards such as SiLA 2 and OPC-UA.

2606.03748 2026-06-03 cs.CV cs.AI 版本更新

Ultralytics YOLO26: Unified Real-Time End-to-End Vision Models

Ultralytics YOLO26: 统一的实时端到端视觉模型

Glenn Jocher, Jing Qiu, Mengyu Liu, Shuai Lyu, Fatih Cagatay Akyon, Muhammet Esat Kalfaoglu

发表机构 * Ultralytics

AI总结 本文提出YOLO26,通过双头设计、MuSGD优化器、渐进损失和STAL标签分配策略,实现无NMS的端到端实时检测,并在实例分割、姿态估计等任务上取得一致提升。

Comments 31 pages, 8 figures

详情
AI中文摘要

实时视觉需要准确、高效且易于在不同硬件上部署的模型。YOLO系列因此被广泛部署,但大多数YOLO检测器在推理时仍依赖非极大值抑制,由于分布聚焦损失而携带沉重的检测头,需要长时间的训练计划,并且可能使最小的物体没有正标签分配。我们提出Ultralytics YOLO26,一个统一的实时视觉模型系列,通过协调的架构和训练进展解决了这些限制。YOLO26采用双头设计实现原生无NMS的端到端推理,并完全移除DFL,产生具有无约束回归范围的更轻量头。其训练流程结合了MuSGD(一种从大语言模型训练改编的混合Muon-SGD优化器)、渐进损失(将监督转向推理时头)和STAL(一种保证小物体正覆盖的标签分配策略)。除了检测,YOLO26还为实例分割、姿态估计和旋转检测引入了特定任务的头和损失设计,在任务和尺度上产生一致的增益。该系列涵盖五个尺度(n/s/m/l/x),并在单一流程中支持检测、实例分割、姿态估计、分类和旋转检测,还有一个开放词汇扩展YOLOE-26,用于文本、视觉和提示无关的推理。在所有尺度上,YOLO26在COCO上以1.7-11.8 ms T4 TensorRT延迟实现40.9-57.5 mAP,在精度-延迟帕累托前沿上超越了先前的实时检测器,而YOLOE-26x在文本提示下于LVIS minival上达到40.6 AP。代码和模型可在https://this URL获取。

英文摘要

Real-time vision demands models that are accurate, efficient, and simple to deploy across diverse hardware. The YOLO family has become widely deployed for this reason, yet most YOLO detectors still rely on non-maximum suppression at inference, carry heavy detection heads due to Distribution Focal Loss, require long training schedules, and can leave the smallest objects without positive label assignments. We present Ultralytics YOLO26, a unified real-time vision model family that addresses these limitations through coordinated architecture and training advances. YOLO26 uses a dual-head design for native NMS-free end-to-end inference and removes DFL entirely, yielding a lighter head with unconstrained regression range. Its training pipeline combines MuSGD, a hybrid Muon-SGD optimizer adapted from large language model training; Progressive Loss, which shifts supervision toward the inference-time head; and STAL, a label assignment strategy that guarantees positive coverage for small objects. Beyond detection, YOLO26 introduces task-specific head and loss designs for instance segmentation, pose estimation, and oriented detection, producing consistent gains across tasks and scales. The family spans five scales (n/s/m/l/x) and supports detection, instance segmentation, pose estimation, classification, and oriented detection in a single pipeline, with an open-vocabulary extension, YOLOE-26, for text-, visual-, and prompt-free inference. Across all scales, YOLO26 achieves 40.9-57.5 mAP on COCO at 1.7-11.8 ms T4 TensorRT latency, advancing the accuracy-latency Pareto front over prior real-time detectors, while YOLOE-26x reaches 40.6 AP on LVIS minival under text prompting. Code and models are available at https://github.com/ultralytics/ultralytics.

2606.03743 2026-06-03 cs.AI 版本更新

Proof-Refactor: Refactoring Generated Formal Proofs into Modular Artifacts

Proof-Refactor: 将生成的形式化证明重构为模块化工件

Yiming Fu, Peixuan Liu, Zichen Wang, Kun yuan

发表机构 * Department of Mathematics, Southern University of Science and Technology(南方科技大学数学系) School of Mathematical Sciences, Peking University(北京大学数学科学学院) Center for Machine Learning Research, Peking University(北京大学机器学习研究中心)

AI总结 提出一个名为 Proof-Refactor 的智能体框架,通过四阶段流程(提取候选证明片段、设计辅助声明、形式化证明提取和设计的组件、使用验证组件修复原始证明)将大语言模型生成的形式化证明重构为更模块化、可读、可维护和可重用的工件,在 PutnamBench 和 Putnam2025 上的 Lean 证明重构中优于基线。

Comments 21 pages, 3 figures, 3 tables

详情
AI中文摘要

虽然大语言模型在生成形式化证明方面表现出色,但其输出通常比成熟形式化数学库中的证明更不易读、模块化、可维护和可重用。我们认为这一差距部分源于大多数证明生成流程中隐含的“先编译”目标,这鼓励了整体式或特设的证明脚本,而非库质量的工件。现有的证明质量改进方法通常依赖于显式的、可计算的优化目标。然而在实践中,最易处理且经过实验验证的目标主要是基于长度的,而更高级的质量(如可读性、模块化、可维护性和可重用性)很难简化为可靠的自动度量。我们没有针对单一代理指标优化证明改进,而是采用受人类证明重构工作流程启发的过程引导方法。我们提出了一个智能体框架 $ extbf{Proof-Refactor}$,将证明重构分解为四个阶段:提取候选证明片段、设计辅助声明、形式化证明提取和设计的组件,以及使用验证组件修复原始证明。在来自 PutnamBench 和 Putnam2025 的生成 Lean 证明上,Proof-Refactor 在基于评分标准的重构得分上优于强大的 Claude Code 重构基线,在签名质量和人类可读性方面提升最大。这些结果表明,过程引导的重构可以在不将证明长度作为主要目标的情况下改进证明结构。

英文摘要

While Large Language Models (LLMs) have shown strong performance in generating formal proofs, their outputs often remain less readable, modular, maintainable, and reusable than proofs in mature formal mathematics libraries. We argue that this gap stems in part from the compile-first objective implicit in most proof-generation pipelines, which encourages monolithic or ad hoc proof scripts rather than library-quality artifacts. Existing approaches to proof-quality improvement often rely on explicit, computable optimization objectives. In practice, however, the most tractable and experimentally validated objectives are largely length-based, while higher-level qualities such as readability, modularity, maintainability, and reusability are difficult to reduce to reliable automatic metrics. Instead of optimizing proof improvement against a single proxy metric, we take a process-guided approach inspired by human proof-refactoring workflows. We propose an agentic framework $\textbf{Proof-Refactor}$ that decomposes proof refactoring into four phases: extracting candidate proof fragments, designing helper declarations, formally proving the extracted and designed components, and repairing the original proof using the verified components. On generated Lean proofs from PutnamBench and Putnam2025, Proof-Refactor improves rubric-based refactoring scores over a strong Claude Code refactoring baseline, with the largest gains in signature quality and human readability. These results suggest that process-guided refactoring can improve proof structure without treating proof length as the primary objective.

2606.03741 2026-06-03 cs.AI 版本更新

When to Re-Plan: Subgoal Persistence in Hierarchical Latent Reasoning

何时重新规划:分层潜在推理中的子目标持久性

Ayushi Chadha

发表机构 * Independent Researcher(独立研究者)

AI总结 研究分层潜在推理中稳定性与适应性的权衡,通过封建式管理-工人接口控制子目标持久周期,发现中等周期(3-6步)最优,内在对齐权重存在窄最优值(约0.05)。

Comments Accepted at the Workshop on Compositional Learning: Safety, Interpretability, and Agents (CompLearn), ICML 2026. 10 pages, 2 figures

详情
AI中文摘要

长程推理要求系统承诺中期意图而不变得僵化:过于频繁地重新规划会导致计算无法凝聚成多步结构;承诺时间过长则计划会过时。我们在潜在推理设置中研究这种稳定性-适应性权衡,其中多步计算发生在隐藏状态内部而非外部化的token轨迹。我们扩展了分层推理模型(HRM),采用封建式的管理-工人接口:一个缓慢的高层模块周期性地发出一个归一化的方向性子目标,该子目标持续P个低层步骤,影响工人的隐藏状态更新并提供内在的余弦对齐损失。在ARC和ConceptARC上,我们发现子目标持久性——而非仅仅子目标注入——是关键旋钮:中等周期P∈[3,6]始终优于非常频繁(P=1)和非常长的周期,在P=3时LM损失明显最小(1.544对比P=1时的1.674,基线1.640;在5个种子上重复,均值1.595,标准差0.045)。内在对齐权重λ显示出一个互补的窄最优值(λ≈0.05)。在过去的甜点λ处的受控消融实验隔离出学习到的方向结构——而非架构容量或辅助损失单独——作为当对齐信号超过其最优值时的干扰来源。这些发现共同暗示了潜在推理系统中组合规划的设计原则:中期意图必须在足够多的计算步骤上保持一致,以便形成组合结构。

英文摘要

Long-horizon reasoning requires a system to commit to medium-horizon intent without becoming rigid: re-plan too often and computation never coheres into multi-step structure; commit too long and the plan goes stale. We study this stability-adaptivity tradeoff in the latent reasoning setting, where multi-step computation occurs inside hidden state rather than externalized token traces. We extend the Hierarchical Reasoning Model (HRM) with a feudal-style manager-worker interface: a slow high-level module periodically emits a normalized directional subgoal that persists for P low-level steps, biasing the worker's hidden-state updates and supplying an intrinsic cosine alignment loss. On ARC and ConceptARC, we find that subgoal persistence -- not subgoal injection alone -- is the central knob: moderate periods P in [3, 6] consistently outperform both very frequent (P=1) and very long horizons, with a clear minimum LM loss at P=3 (1.544 vs. 1.674 at P=1, 1.640 baseline; replicated over 5 seeds at mean 1.595, std 0.045). The intrinsic alignment weight lambda shows a complementary narrow optimum (lambda approximately 0.05). A controlled ablation at past-sweet-spot lambda isolates learned directional structure -- not architectural capacity or auxiliary loss alone -- as the source of interference when the alignment signal exceeds its optimum. Together these findings implicate a design principle for compositional planning in latent reasoning systems: medium-horizon intent must be coherent across enough computational steps for compositional structure to form.

2606.03719 2026-06-03 cs.AI 版本更新

Unveiling the Structure of Do-Calculus Reasoning via Derivation Graphs

通过推导图揭示Do-演算推理的结构

Clément Yvernes, Emilie Devijver, Marianne Clausel, Eric Gaussier

AI总结 本文引入推导图来表示Do-演算规则的应用与组合,刻画了在Do-演算下等价的观测与干预概率的完整空间,并展示了通过最多四次规则应用即可实现等价变换,进而利用等价因果查询产生更有效的估计量。

Comments Accepted at ICML 2026

详情
AI中文摘要

Do-演算定义了干预查询的一般推理系统,允许通过连续应用其规则来转换因果量。这个过程产生了丰富的等价干预表达式空间,但组合和排序这些规则仍然具有挑战性。在这项工作中,我们引入了推导图,它表示Do-演算规则如何应用和组合,并刻画了在Do-演算下等价的观测和干预概率的完整空间。这些图的结构产生了一个简单的过程,最多使用四次Do-演算规则的应用。最后,我们展示了如何将识别算法应用于等价的因果查询,为相同的因果量产生多个有效的估计量,最终得到更有效的估计量。

英文摘要

The do-calculus defines a general system of inference for interventional queries, allowing causal quantities to be transformed through successive applications of its rules. This process induces a rich space of equivalent interventional expressions, but combining and ordering these rules remains challenging. In this work, we introduce derivation graphs, which represent how do-calculus rules are applied and combined, and characterize the full space of observational and interventional probabilities which are equivalent under the do-calculus. The structure of these graphs yields a simple procedure that uses at most four applications of do-calculus rules. Finally, we show how applying identification algorithms to equivalent causal queries produces multiple valid estimands for the same causal quantity, eventually yielding more efficient estimators.

2606.03705 2026-06-03 cs.AI 版本更新

Code-on-Graph: Iterative Programmatic Reasoning via Large Language Models on Knowledge Graphs

图上的代码:通过大型语言模型在知识图谱上进行迭代式程序化推理

Weiwei Ding, Zixuan Li, Long Bai, Zhuo Chen, Kun Su, Fei Wang, Xiaolong Jin, Jin Zhang, Jiafeng Guo, Xueqi Cheng

发表机构 * Key Laboratory of AI Safety, Institute of Computing Technology, Chinese Academy of Sciences(中国科学院计算技术研究所人工智能安全重点实验室) Shandong University(山东大学) Shandong University-Weihai Research Institute of Industrial Technology(山东大学威海工业技术研究院)

AI总结 提出Code-on-Graph (CoG)框架,通过将知识图谱模式表示为Python类并生成可执行代码,解决现有LLM-KG集成中操作符不灵活和知识注入不可扩展的问题,在WebQSP、CWQ和GrailQA上提升高达10.5%。

详情
AI中文摘要

知识图谱(KGs)被广泛用于缓解大型语言模型(LLMs)的局限性,如知识过时和幻觉。现有的LLM-KG集成框架通常依赖预定义操作符从知识图谱中检索事实知识,并将其注入提示以生成答案。这种范式面临两个关键瓶颈:1)不灵活性:预定义操作符范围有限,因此缺乏足够的组合表达能力来完全捕捉知识图谱问题所需的复杂语义。2)不可扩展性:将事实知识直接注入提示限制了处理大规模事实知识的可扩展性。为了解决这两个瓶颈,我们提出了Code-on-Graph(CoG),一个用于LLM-KG集成的程序化推理框架。具体来说,给定每个推理步骤检索到的事实知识,CoG首先识别相应的知识图谱模式,并将这些模式表示为Python类,这些类作为检索事实的抽象接口。然后,它生成基于这些类的可执行代码,在执行过程中,检索到的事实被实例化为相应类的对象。这种设计实现了灵活的基于代码的推理,同时避免将大规模事实知识直接注入提示。在WebQSP、CWQ和GrailQA上的实验表明,CoG比之前的最先进模型性能提升高达10.5%。

英文摘要

Knowledge Graphs (KGs) are widely used to mitigate the limitations of Large Language Models (LLMs), such as outdated knowledge and hallucinations. Existing LLM-KG integration frameworks typically rely on predefined operators to retrieve factual knowledge from KGs and inject it into prompts for answer generation. This paradigm faces two critical bottlenecks: 1) Inflexibility: The predefined operators are limited in scope and thus lack sufficient compositional expressiveness to fully capture the complex semantics required by KG questions. 2) Unscalability: Direct injection of factual knowledge into prompts limits scalability in handling large-scale factual knowledge. To address these two bottlenecks, we propose Code-on-Graph (CoG), a programmatic reasoning framework for LLM-KG integration. Specifically, given the factual knowledge retrieved at each reasoning step, CoG first identifies the corresponding KG schemas and represents these schemas as Python classes, which serve as abstract interfaces to the retrieved facts. It then generates executable code grounded in these classes, with the retrieved facts instantiated as objects of the corresponding classes during execution. This design enables flexible code-based reasoning while avoiding the direct injection of large-scale factual knowledge into prompts. Experiments on WebQSP, CWQ, and GrailQA demonstrate that CoG outperforms prior state-of-the-art models by up to 10.5%.

2606.03704 2026-06-03 cs.AI cs.CE cs.CY 版本更新

Dynamic Objective Selection with Safeguards and LLM Oversight for Financial Decision-Making

动态目标选择与防护机制及大语言模型监督在金融决策中的应用

Keigo Sakurai, Takahiro Ogawa, Miki Haseyama, Anjyu Anan, Kei Nakagawa

发表机构 * Hokkaido University(北海道大学) Nomura Asset Management Co., Ltd.(日兴资产经营管理公司) Kobe University(Kobe大学) Osaka Metropolitan University(大阪市立大学)

AI总结 提出DOSS方法,通过将目标选择建模为分类问题并利用滚动窗口进行顺序更新,结合置信度感知门控和LLM监督,实现金融决策中动态目标选择,降低误选和过度切换风险。

Comments Accpeted to The 2nd Workskop on Advances in Financial AI Workshop: Towards Agentic and Responsible Systems at ICLR 2026

详情
AI中文摘要

金融决策任务(如股票推荐和投资组合配置)通常估计未来收益和风险,然后为投资者选择交易或配置,所选优化目标往往决定实际表现。然而,由于市场条件随时间变化,固定目标在不同市场状态下可能次优,而依赖潜在状态估计的状态切换流程可能噪声大或延迟,频繁切换会增加交易成本和运营不稳定性。本文提出DOSS(带防护机制的动态目标选择),一种基于学习的选择器,直接从近期收益的可解释统计摘要中为每个时间点选择决策相关的目标函数,从少量候选(如追求收益、规避损失和风险调整)中选择,无需引入中间状态变量。DOSS将目标选择形式化为目标上的分类问题,并通过滚动窗口进行顺序更新以做出前瞻性选择,避免时间泄漏,同时为每个提议输出置信度分数。为缓解部署中的误选和过度切换,DOSS应用置信度感知门控,并带有故障安全机制,将低置信度提议覆盖为保守默认值,并实施与切换频率相关的显式控制。我们进一步通过将大语言模型(LLM)定位为监督组件而非新目标生成器来整合治理:LLM仅限于接受提议目标或将其覆盖为预定义安全默认值,并在需要时由确定性基于规则的约束触发覆盖。

英文摘要

Financial decision-making tasks such as stock recommendation and portfolio allocation typically estimate future return and risk and then select trades or allocations for an investor, and the chosen optimization objective often determines realized performance. However, because market conditions evolve over time, a fixed objective can be suboptimal across regimes, while regime-switching pipelines that rely on latent regime estimates can be noisy or delayed and frequent switching can increase turnover and operational instability. In this paper, we propose DOSS (Dynamic Objective Selection with Safeguards), a learning-based selector that directly chooses the decision-relevant objective function at each time point from interpretable statistical summaries of recent returns, selecting among a small set of candidates (e.g., return-seeking, loss-averse, and risk-adjusted) without introducing intermediate regime variables. DOSS formulates objective selection as a classification problem over objectives and performs sequential updates with a rolling window to make forward-looking selections without temporal leakage, while also outputting a confidence score for each proposal. To mitigate misselection and excessive switching in deployment, DOSS applies confidence-aware gating with a fail-safe that overrides low-confidence proposals to a conservative default and enforces explicit controls tied to switching frequency. We further integrate governance by positioning a Large Language Model (LLM) as an oversight component rather than a generator of new objectives: the LLM is restricted to accept a proposed objective or override it to a predefined safe default, with deterministic rule-based constraints triggering overrides when needed.

2606.03692 2026-06-03 cs.AI cs.CL 版本更新

SkillPyramid: A Hierarchical Skill Consolidation Framework for Self-Evolving Agents

SkillPyramid:一种用于自我进化智能体的层次化技能整合框架

Yuan Xiong, Ziqi Miao, Qian Chen, Lijun Li, Yequan Wang, Shizhu He, Jun Zhao, Kang Liu

发表机构 * The Key Laboratory of Cognition and Decision Intelligence for Complex Systems, Institute of Automation, Chinese Academy of Sciences, Beijing, China(认知与决策智能复杂系统重点实验室,自动化研究所,中国科学院,北京,中国) School of Artificial Intelligence, University of Chinese Academy of Sciences, Beijing, China(中国科学院大学人工智能学院,北京,中国) Shanghai Artificial Intelligence Laboratory, Shanghai, China(上海人工智能实验室,上海,中国) Beijing Academy of Artificial Intelligence, Beijing, China(北京人工智能研究院,北京,中国)

AI总结 针对智能体缺乏系统性技能构建、积累和迁移的问题,提出SkillPyramid层次化技能整合框架,通过自进化机制在任务执行中组合、验证和吸收新技能,在三个基准上平均奖励提升38.0%,执行步骤减少27.7%。

详情
AI中文摘要

最近的AI智能体可以灵活调用技能来解决复杂任务,但其长期改进从根本上受到缺乏系统性技能构建、积累和迁移的限制。特别是,没有统一的技能整合框架,智能体倾向于在不同任务中冗余构建相似能力,无法有效将经验转化为可复用资产,并且难以将任务特定技能泛化到新场景。为了解决这一限制,我们提出了SkillPyramid,一个技能整合框架,它重用现有技能经验以实现更广泛的任务泛化。在层次化技能拓扑上运行,SkillPyramid进一步引入了一种自进化机制,使智能体能够在任务执行过程中组合、验证和吸收新技能。在ALFWorld、WebShop和ScienceWorld上使用四个骨干模型的实验表明,SkillPyramid将平均奖励提高了38.0%,并将执行步骤减少了27.7%。总体而言,我们的方法将技能集合从静态资源池转变为动态进化系统。

英文摘要

Recent AI agents can flexibly invoke skills to solve complex tasks, but their long-term improvement is fundamentally constrained by a lack of systematic skill construction, accumulation, and transfer. In particular, without a unified framework for skill consolidation, agents tend to redundantly construct similar capabilities across different tasks, are unable to effectively transform experience into reusable assets, and struggle to generalize task-specific skills to novel scenarios. To address this limitation, we propose SkillPyramid, a skill consolidation framework that reuses existing skill experience for broader task generalization. Operating on a hierarchical skill topology, SkillPyramid further introduces a self-evolution mechanism that enables agents to compose, validate, and incorporate new skills during task execution. Experiments on ALFWorld, WebShop, and ScienceWorld across four backbone models show that SkillPyramid substantially increases the average reward by 38.0% and reduces execution steps by 27.7%. Overall, our method transforms a skill collection from a static resource pool into a dynamic evolution system.

2606.03689 2026-06-03 cs.LG cs.AI 版本更新

Staying Alive: Uncensored Survival Analysis with Tabular Foundation Models

保持存活:基于表格基础模型的无审查生存分析

Mariana Vargas Vieyra

发表机构 * GitHub

AI总结 提出一种无需训练的生存回归方法,利用表格基础模型预测事件时间并迭代填补右删失数据,构建加速失效时间模型,在标准基准上表现与需训练的模型相当。

详情
AI中文摘要

生存分析是一种统计框架,用于建模直到某个感兴趣事件发生的时间跨度。它广泛应用于包括医疗保健和客户流失预测在内的多个领域,其适用性的一个核心挑战在于事件时间被部分观测或存在右删失。近年来,表格基础模型因其能够在单次前向传播中执行预测任务而无需数据集特定的参数拟合,引起了广泛关注。尽管取得了成功,但由于右删失的存在,它们在时间-事件数据预测任务中的应用仍然困难。在这项工作中,我们提出了一种无需训练的生存回归方法,通过利用表格基础模型来预测事件时间并迭代地填补右删失数据。我们的方法使用表格基础模型构建加速失效时间模型,除了拟合单个标量参数外无需训练。随后,基于Buckley-James估计器,我们引入了一种非参数上下文内估计器来处理右删失数据。我们在标准生存分析基准上的实验表明,我们的方法与几种需要训练的参数和半参数生存回归模型(包括Cox回归和参数加速失效时间模型)相比具有竞争力。

英文摘要

Survival Analysis (SA) is a statistical framework that models the time span until some event of interest occurs. Widely used in several domains, including healthcare and churn prediction, a central challenge in its applicability stems from the time of the event being partially observed or \emph{right-censoring}. Tabular Foundation Models (TFM) have attracted significant interest in recent years due to their ability to perform prediction tasks in a single forward pass, requiring no dataset-specific parameter fitting. Despite their success, their application to prediction tasks on time-to-event data remains difficult due to right censoring. In this work, we present a training-free method to survival regression by leveraging TFMs to both predict the time of the event and iteratively impute right-censored data. Our method uses a TFM to construct an Accelerated Failure Time (AFT) model requiring no training beyond fitting a single scalar parameter. Subsequently, by building on the Buckley-James estimator, we introduce a non-parametric in-context estimator for right-censored data. Our experiments on standard survival analysis benchmarks show that our method is competitive with several parametric and semi-parametric survival regression models that require training, including Cox regression and parametric AFT models.

2606.03686 2026-06-03 cs.AI 版本更新

The DeepSpeak-Agentic Dataset

DeepSpeak-Agentic 数据集

Sarah Barrington, Maty Bohacek, Hany Farid

AI总结 本文提出了一个包含37小时人机半结构化对话视频的数据集DeepSpeak-Agentic,用于评估AI代理的自动取证识别、研究人机交互特性,并作为大型语言模型和AI生成语音/面部技术的基准。

详情
AI中文摘要

我们提出了DeepSpeak-Agentic,一个包含超过37小时半结构化对话视频的数据集,对话发生在人类与具身AI代理之间。我们利用该数据集评估AI代理的自动取证识别(音频、视频或文本),研究人机交互的本质,并为驱动具身AI代理的大型语言模型和AI生成语音及面部技术的未来进展提供基准。我们还贡献了一个可扩展的数据采集系统,该系统创建代理,自动将其与人类众包工作者配对,记录指定场景下的视听对话,并在混合流中识别和分离人类与代理。

英文摘要

We present DeepSpeak-Agentic, a dataset of videos comprising over 37 hours of semi-structured conversations between a human and an embodied AI agent. We use this dataset to evaluate the automatic forensic identification (audio, video, or text) of AI agents, study the nature of human-agent interactions, and provide a benchmark for future advances in the large-language models and AI-generated voices and faces that power embodied AI agents. We also contribute a scalable data-capture system that creates agents, automatically pairs them with human crowd workers, records audiovisual conversations across specified scenarios, and identifies and separates the human and agent in the combined stream.

2606.03685 2026-06-03 cs.LG cs.AI 版本更新

A Close Look At World Model Recovery In Supervised Fine-Tuned LLM Planners

监督微调的大语言模型规划器中世界模型恢复的深入探究

Patrick Emami, Nan Qiang, Peter Graf

发表机构 * National Laboratory of the Rockies(落基山国家实验室)

AI总结 通过可解释性实验,研究监督微调如何影响大语言模型在经典规划任务中恢复世界模型的能力,发现微调使模型线性编码动作有效性和状态谓词,且更广泛的状态空间覆盖有助于更准确的世界模型恢复。

Comments 17 pages. Under review at TMLR

详情
AI中文摘要

监督微调(SFT)改进了大语言模型(LLM)中的端到端经典规划,但这些模型是否也学会了表示和推理它们正在解决的规划问题?由于经典规划问题的相对复杂性以及端到端规划生成对LLM的挑战,探索这个问题一直很困难。在我们的工作中,我们设计并执行了一系列可解释性实验,通过检查微调LLM的内部表示和生成能力,全面探究世界模型恢复。我们发现:a) 对有效动作序列进行监督微调使LLM能够线性编码动作有效性和一些状态谓词。b) 难以使用输出概率对动作有效性进行分类的模型可能仍然学习到将有效动作与无效动作分开的内部表示。c) 微调期间更广泛的状态空间覆盖(例如来自随机游走数据)能更准确地恢复底层世界模型。总之,这项工作为将可解释性技术应用于规划LLM提供了一种方法,并产生了有助于揭示LLM中知识表示方式的见解。

英文摘要

Supervised fine-tuning (SFT) improves end-to-end classical planning in large language models (LLMs), but do these models also learn to represent and reason about the planning problems they are solving? Due to the relative complexity of classical planning problems and the challenge that end-to-end plan generation poses for LLMs, it has been difficult to explore this question. In our work, we devise and perform a series of interpretability experiments that holistically interrogate world model recovery by examining both internal representations and generative capabilities of fine-tuned LLMs. We find that: a) Supervised fine-tuning on valid action sequences enables LLMs to linearly encode action validity and some state predicates. b) Models that struggle to use output probabilities for classifying action validity may still learn internal representations that separate valid from invalid actions. c) Broader state space coverage during fine-tuning, such as from random walk data, yields more accurate recovery of the underlying world model. In summary, this work contributes a recipe for applying interpretability techniques to planning LLMs and generates insights that shed light on open questions about how knowledge is represented in LLMs.

2606.03678 2026-06-03 cs.AI 版本更新

EvoDrive: Pareto Evolution for Safety-Critical Autonomous Driving via Self-Improving LLM Agents

EvoDrive: 通过自我改进的LLM智能体实现安全关键自动驾驶的帕累托进化

Tong Nie, Yuewen Mei, Yihong Tang, Junlin He, Jie Deng, Jian Sun, Wei Ma

AI总结 提出EvoDrive,首个基于LLM的自动化智能体进化框架,通过模拟器接地演员-评论家架构和帕累托存档,在安全关键场景生成中实现对抗性与真实性的多目标优化。

详情
AI中文摘要

生成安全关键场景对于验证和改进自动驾驶系统至关重要,但它本质上需要在最大化对抗性以暴露故障的同时保持真实性。现有方法通常通过手工设计的启发式方法来管理这种权衡,将生成限制在已知的先验知识中,忽视了未充分探索的模式。虽然最近开放式的智能体进化可以突破这一限制,但不受约束的通用智能体缺乏严格的模拟器接地,往往将多目标张力退化为单标量最大化。本文提出了EvoDrive,第一个基于LLM的自动化智能体进化框架,用于多目标场景生成。EvoDrive采用模拟器接地的演员-评论家架构,其中记忆驱动的演员迭代地提出对生成器的改进,评论家过滤掉不可信的候选者,而自我进化的世界评估器将有前途的候选者路由以优化模拟预算。EvoDrive进一步维护一个评估候选者的帕累托存档,以保留多样化的攻击-真实性权衡,并通过模拟反馈指导未来的进化。在MetaDrive和CARLA上的基准测试结果表明,EvoDrive不仅显著扩展了各种生成器的帕累托前沿,而且为策略训练生成了有价值的场景。

英文摘要

Generating safety-critical scenarios is essential for validating and improving autonomous driving systems, yet it inherently requires maximizing adversariality to expose failures while preserving realism. Existing methods usually manage this trade-off with handcrafted heuristics, confining generation to known priors and overlooking underexplored patterns. While recent open-ended agentic evolution can push this limit, unconstrained general agents lack strict simulator grounding and tend to collapse the multi-objective tension into single-scalar maximization. Here we present EvoDrive, the first automated, LLM-based agentic evolution framework for multi-objective scenario generation. EvoDrive employs a simulator-grounded actor-critic architecture where a memory-driven actor iteratively proposes improvements to the generators and critics filter out implausible candidates, and a self-evolving world evaluator routes promising proposals to optimize simulation budgets. EvoDrive further maintains a Pareto archive of evaluated candidates to preserve diverse attack-realism trade-offs and guide future evolution via simulation feedback. Benchmark results on MetaDrive and CARLA show that EvoDrive not only significantly expands the Pareto frontier across various generators, but also produces valuable scenarios for policy training.

2606.03664 2026-06-03 cs.NI cs.AI 版本更新

AUGUSTE: Online-Learning dApp for Predictive URLLC Scheduling

AUGUSTE: 用于预测性URLLC调度的在线学习dApp

Maxime Elkael, Michele Polese, Yunseong Lee, Koichiro Furueda, Tommaso Melodia

AI总结 针对URLLC中调度请求导致的高延迟问题,提出基于在线机器学习的MAC调度框架AUGUSTE,通过预测数据包到达提前分配资源,在真实5G测试平台上实现延迟与资源开销的最佳权衡。

详情
AI中文摘要

超可靠低延迟通信(URLLC)是5G的主要驱动力之一,3GPP为工业自动化、车联网(V2X)、战术边缘网络和无人系统控制等应用设定了1-10毫秒的延迟目标。多年后,真实的5G时分双工(TDD)网络的中位上行链路(UL)往返时间仍在50-70毫秒范围内,这主要是因为用户设备(UE)在发送UL数据之前必须完成调度请求(SR)过程。现有的补救措施,主要是配置授权(CG)调度,仅能消除严格周期性流量的这一开销,并需要跨层同步,这限制了其采用。我们提出了AUGUSTE(通过自适应时间估计实现URLLC的预测性上行授权),这是一种基于学习的介质访问控制(MAC)调度框架,它将在线机器学习(ML)模型嵌入UL调度器中,以预测数据包到达并在发出SR之前主动分配资源。一个自适应状态机在收集无偏到达统计信息的学习阶段和利用学习到的预测仅在预期有流量时进行调度的自信阶段之间交替。我们在运行OpenAirInterface的真实5G测试平台上,针对三种URLLC流量模式(请求-响应、ML边缘推理和周期性自主报告)评估了AUGUSTE,结果表明它在延迟-开销权衡上达到了最佳可行点:它以约十分之一的资源开销(7-10%开销)实现了与始终在线调度相当的中位往返时间(RTT)(约10毫秒,比基于SR的20毫秒基线减半)。

英文摘要

Ultra Reliable and Low Latency Communications (URLLC) was one of the main motivations behind 5G, with 3GPP advertising 1-10 ms latency targets for applications such as industrial automation, Vehicle-To-Everything (V2X), tactical edge networking, and unmanned-system control. Years on, real 5G Time Division Duplexing (TDD) networks still show median Uplink (UL) round-trip times in the 50-70 ms range, largely because of the Scheduling Request (SR) procedure that a User Equipment (UE) must complete before transmitting UL data. Existing remedies, primarily Configured Grant (CG) scheduling, only eliminate this overhead for strictly periodic traffic and require cross-layer synchronization, which has limited their adoption. We propose AUGUSTE (Anticipatory Uplink Grants for URLLC via Self-Adapting Temporal Estimation), a learning-based Medium Access Control (MAC) scheduling framework that embeds online Machine Learning (ML) models in the UL scheduler to predict packet arrivals and proactively allocate resources before an SR is issued. An adaptive state machine alternates between a learning phase that collects unbiased arrival statistics and a confident phase that exploits the learned predictions to schedule only when traffic is expected. We evaluate AUGUSTE on a real 5G testbed running OpenAirInterface across three URLLC traffic patterns (request-response, ML edge inference, and periodic autonomous reporting), and show that it operates at the best achievable point on the latency-overhead trade-off: it matches always-on scheduling's median Round Trip Time (RTT) (around 10 ms, halving the 20 ms SR-based baseline) at roughly one-tenth its resource cost (7-10 percent overhead).

2606.03657 2026-06-03 cs.AI 版本更新

Diagnosing Knowledge Gaps in LLM Tool Use: An Agentic Benchmark for Novel API Acquisition

诊断大语言模型工具使用中的知识缺口:面向新API获取的智能体基准

Jinnuo Liu, Yue Peng, Jinhan Niu, Hongyi Wen

发表机构 * NYU Shanghai(纽约大学上海分校)

AI总结 提出 NovelAPIBench 基准,通过动态发现新API、分解知识包并生成可执行任务,诊断模型在API使用中的六类错误,发现检索与参数调优互补。

Comments 37 pages, 12 figures

详情
AI中文摘要

用于代码生成的大语言模型通常需要使用预训练数据中不存在的API。这不仅仅是回忆函数名:模型必须协调签名、模块路径、输入输出契约、语义和可执行使用模式。现有的新API基准通常是静态的,依赖于粗略的通过/失败指标,或使用可能无法反映真实库演变的合成API。我们引入了NovelAPIBench,一个全自动动态基准,对于任何基础模型和目标库,发现新API,提取分解的知识包,生成可执行编码任务,并将失败样本分配到六个诊断类别。在大约1.9K个任务、四个基础模型和五个领域上,我们比较了通过检索注入的知识与通过参数自适应内化的知识。我们发现知识组件不可互换:使用示例是最强的独立信号,而最佳的双组件设置将签名与机制或示例配对,具体取决于领域和骨干。添加更多上下文,尤其是源代码,可能通过增加导入路径错误而有害。一旦外部知识被移除,参数自适应也不能取代检索;相反,微调主要教会模型如何使用提供的包,并且这种能力可以迁移到保留的库。这些结果表明检索和调优扮演互补角色:检索提供易变的API内容,而调优改进程序性整合。

英文摘要

Large language models for code generation often need to use APIs that are absent from their pretraining data. This requires more than recalling a function name: models must coordinate signatures, module paths, input-output contracts, semantics, and executable usage patterns. Existing novel-API benchmarks are typically static, rely on coarse pass/fail metrics, or use synthetic APIs that may not reflect real library evolution. We introduce NovelAPIBench, a fully automated dynamic benchmark that, for any base model and target library, discovers novel APIs, extracts decomposed knowledge bundles, generates executable coding tasks, and assigns failed samples to six diagnostic categories. Across about 1.9K tasks, four base models, and five domains, we compare knowledge injected through retrieval with knowledge internalized through parametric adaptation. We find that knowledge components are not interchangeable: usage examples are the strongest standalone signal, while the best two-component setting pairs signatures with either mechanisms or examples depending on the domain and backbone. Adding more context, especially source code, can hurt by increasing import-path errors. Parametric adaptation also does not replace retrieval once external knowledge is removed; rather, fine-tuning mainly teaches models how to use provided bundles, and this ability transfers to held-out libraries. These results suggest that retrieval and tuning play complementary roles: retrieval supplies volatile API content, while tuning improves procedural integration.

2606.03655 2026-06-03 cs.AI cs.LO 版本更新

Towards Non-Monotonic Entailment in Propositional Defeasible Standpoint Logic

命题可废止立场逻辑中的非单调蕴涵

Nicholas Leisegang, Thomas Meyer, Ivan Varzniczak

发表机构 * University of Cape Town and CAIR, South Africa(开普敦大学和CAIR,南非) Université Sorbonne Paris Nord, Inserm, Sorbonne Université, Limics, 93017 Bobigny, France(巴黎-索邦大学,Inserm,索邦大学,Limics,法国93017博比尼) ISTI-CNR, Pisa, Italy(意大利比萨ISTI-CNR)

AI总结 本文通过引入情境立场条件句,将KLM风格的非单调理性蕴涵关系提升到命题可废止立场逻辑(PDSL)的一个片段中,并证明了该片段可表达为一组情境条件句,进而将基于排序的蕴涵关系(如理性和词典序闭包)从命题情况忠实翻译到PDSL,同时保持复杂度界限。

详情
AI中文摘要

近期在可废止推理领域的研究中,Kraus等人提出的优先语义和蕴涵概念已被应用于模态逻辑。然而,该领域的工作主要集中在可满足性检查以及单调蕴涵关系上,后者在推理上可能较弱。引入这一概念的一个特定模态逻辑是命题立场逻辑,其中的模态可以表达不同视角的观点。这导致了命题可废止立场逻辑(PDSL)的形式化。在本文中,我们提出了一种方法,将(非单调)理性蕴涵关系类从传统的KLM风格推理提升到PDSL的一个片段中。为此,我们通过情境立场条件句扩展了PDSL的表达力,使得我们能够在给定立场的上下文中讨论可废止条件句。这使我们能够用情境条件句重新刻画PDSL的语法,并表明PDSL的一个大片段可以表达为一组情境条件句。然后,我们专注于刻画该片段中的非单调蕴涵,定义了一种方法,将任何基于排序的蕴涵关系从命题情况移植到PDSL情况。这首先在一般情形下描述,然后在理性和词典序闭包的具体情形下考虑,为每个推理提供了到PDSL的忠实翻译。我们还表明,该PDSL片段中的蕴涵检查可以主要使用命题情况下的算法进行,同时保持复杂度界限。

英文摘要

Recent work in defeasible reasoning has seen notions of preferential semantics and entailment in the style of Kraus et al. applied to modal logics. However, work in this field has focussed primarily on satisfiability checking, and monotonic notions of entailment, which may be inferentially weak. One particular modal logic where this has been introduced is propositional standpoint logics, where modalities can express the views of different viewpoints. This has resulted in the formalisation of propositional defeasible standpoint logic (PDSL). In this paper, we propose a means of lifting the class of (non-monotonic) rational entailment relations from traditional KLM-style reasoning to a fragment of PDSL. In order to do so, we extend the expressivity of PDSL via situated standpoint conditionals, allowing us to talk about a defeasible conditional holding in the context of a given standpoint. This allows us to re-characterise the syntax of PDSL in terms of situated conditionals, and shows that a large fragment of PDSL is expressible as a set of situated conditionals. We then focus on characterising non-monotonic entailment in this fragment, defining a method to transport any ranking-based entailment relation from the propositional case into the PDSL case. This is first described in the general case and then considered in the specific cases of rational and lexicographic closures, providing a faithful translation of each inference into PDSL. We also show that entailment-checking in this fragment of PDSL can be done largely using algorithms from the propositional case, while preserving complexity bounds.

2606.03648 2026-06-03 cs.CL cs.AI 版本更新

Safety Measurements for Fine-tuned LLMs Should be Grounded in Capability

微调大语言模型的安全性测量应基于能力

Krishnapriya Vishnubhotla, Hillary Dawkins, Isar Nejadgholi, Svetlana Kiritchenko

发表机构 * National Research Council, Canada(加拿大国家研究理事会)

AI总结 通过将微调锚定于特定能力目标,多维度评估微调对模型能力和安全性的影响,发现微调模型对安全提示可能产生不连贯输出、自动安全判断不可靠,且结论因安全基准和评估者而异。

Comments 8 pages plus appendices

详情
AI中文摘要

通过微调将基础大语言模型适应用户的任务或偏好风格可能会损害模型的安全性。先前的研究在有限且看似随机的实验设置中考察了微调对模型安全性的影响。我们认为,将微调锚定于特定的能力目标对于避免任意的经验选择至关重要,这使我们能够得出关于安全性影响的有意义结论,并在一致的基础上比较缓解方法。我们通过关注能力和安全性,对微调对模型行为的影响进行了多维度评估。我们的结果揭示了重要问题:(1) 微调模型可能对安全提示产生不连贯的生成内容,(2) 对于这种不连贯输出,自动安全判断不可靠,(3) 关于微调影响的结论可能因安全基准以及安全评估者的选择而改变。

英文摘要

Adapting foundation large language models to a user's task or preferred style through fine-tuning can result in compromising the model's safety. Previous works examined the effects of fine-tuning on model safety in limited and seemingly random experimental settings. We argue that anchoring fine-tuning to a specific capability goal is essential for avoiding arbitrary empirical choices, allowing us to draw meaningful conclusions about safety impacts, and to compare mitigation methods on a consistent basis. We conduct a multi-dimensional evaluation of the effects of fine-tuning on model behavior by focusing on capability as well as safety. Our results surface important issues that (1) fine-tuned models can produce incoherent generations in response to safety prompts, (2) automated safety judgments are unreliable for such incoherent outputs, and (3) the conclusions about the effects of fine-tuning can change depending on the choice of safety benchmark as well as the safety evaluator.

2606.03647 2026-06-03 cs.CR cs.AI cs.LG 版本更新

Black-box, Adaptive, Efficient, Transferable, Harmful, Applicable... Attacks Are All You Need to Break LLMs

黑盒、自适应、高效、可迁移、有害、适用……攻击是破解LLM所需的一切

Vincent Limbach, Jonas Dornbusch, David Lüdke, Stephan Günnemann, Leo Schwinn

发表机构 * University of St. Gallen(圣加尔大学)

AI总结 提出间接危害优化(IHO)方法,通过迭代偏好优化训练掩码扩散语言模型攻击器,实现黑盒、高效、可迁移的自适应攻击,显著提升对分层防御的破解成功率。

详情
AI中文摘要

准确评估对抗鲁棒性是一个长期挑战。有缺陷的攻击设计可能会夸大鲁棒性估计,使得部署风险评估和防御比较不可靠。历史上,像AutoAttack这样的标准化攻击在很大程度上解决了图像分类器的问题,为跨防御的系统比较提供了可靠的评估基线。然而,对于LLM越狱评估,目前还没有等效的方法,而设计这样的攻击要困难得多。一个可靠的攻击必须(除其他外)兼容黑盒、适用于任意防御管道且高效,而现有方法无法同时满足这些条件。我们引入了间接危害优化(IHO),这是一种掩码扩散语言模型攻击器,通过对危害评判器进行迭代偏好优化来训练,仅需对目标进行黑盒访问。相同的方法无需修改即可用作针对个体行为的强自适应攻击,或作为一种高效的摊销策略,无需微调即可迁移到未见行为和未见目标模型。即使面对分层防御(例如,结合辅助检测器的Circuit Breaker训练模型),IHO在攻击成功率上也显著优于最先进的方法,且无需任何防御特定的适应。我们的结果将IHO定位为向那种过去提高了可靠性的标准化越狱评估迈出的实际一步。代码和模型可在GitHub和Hugging Face上获取。

英文摘要

Accurately evaluating adversarial robustness is a longstanding challenge. A flawed attack design can inflate robustness estimates, making deployment risk assessment and defense comparison unreliable. Historically, standardized attacks such as AutoAttack have largely resolved this for image classifiers, providing a reliable evaluation baseline for systematic comparison across defenses. However, no equivalent exists for LLM jailbreak evaluation yet, where designing such an attack is considerably more difficult. A reliable attack must, among other things, be black-box compatible, applicable to arbitrary defense pipelines, and efficient, which no existing method jointly satisfies. We introduce Indirect Harm Optimization (IHO), a masked diffusion language model attacker trained via iterative preference optimization against a harmfulness judge, requiring only black-box access to the target. The same method can be used without modification as a strong adaptive attack on individual behaviors, or as an efficient amortized policy that transfers to held-out behaviors and unseen target models without fine-tuning. Even against layered defenses, such as a Circuit Breaker-trained model combined with an auxiliary detector, IHO improves attack success considerably over state-of-the-art approaches, without any defense-specific adaptation. Our results position IHO as a practical step toward the kind of standardized jailbreak evaluation that has improved reliability in the past. Code and models are available on GitHub and Hugging Face.

2606.03645 2026-06-03 cs.LG cs.AI 版本更新

The Shape of Addition: Geometric Structures of Arithmetic in Large Language Models

加法的形状:大型语言模型中算术的几何结构

Liuyuan Wen, Xun Zhu, Lihao Huang, Wenbin Li, Yang Gao

发表机构 * University of Science and Technology of China(中国科学技术大学)

AI总结 通过分析多操作数加法中残差流的几何结构,发现等原始和轨迹(IRST)并建立噪声量化模型,将算术错误解释为由内部神经噪声引起的几何滑移,并利用几何一致性检查方法检测和纠正量化失败。

Comments Accepted by ICML 2026

详情
AI中文摘要

大型语言模型在基本算术中表现出矛盾的脆弱性,暗示内部计算与离散输出之间存在脱节。通过分析多操作数加法中的残差流几何结构,我们识别出等原始和轨迹(IRST),这是一种由语义数字锚定并由连续进位纤维调制的几何结构。我们提出噪声量化模型来解释这种几何结构,将算术错误视为由内部神经噪声推动连续的潜在进位势跨越量化阈值引起的几何滑移。这一几何框架进一步阐明了探针多功能性,解释了轻量级探针如何从单个激活向量中解开共存的潜在信号(如真实值与幻觉)。最后,我们通过一种几何一致性检查方法验证了这些见解,该方法在推理过程中有效检测和纠正了这些量化失败。我们的代码可在以下网址获取:https://this URL。

英文摘要

Large Language Models exhibit paradoxical fragility in fundamental arithmetic, implying a disconnect between internal computation and discrete output. By analyzing the residual stream geometry during multi-operand addition, we identify the Iso-Raw-Sum Trajectory (IRST), a geometric structure where representations are anchored by semantic digits and modulated by continuous carry fibers. We propose the Noisy Quantization Model to explain this geometry, framing arithmetic errors as Geometric Slippages caused by internal neural noise pushing a continuous, latent Carry Potential across quantization thresholds. This geometric framework further elucidates Probe Versatility, explaining how lightweight probes can disentangle coexisting latent signals (such as ground truth versus hallucination) from a single activation vector. Finally, we validate these insights through a geometric consistency check method that effectively detects and corrects these quantization failures during inference. Our code is available at https://github.com/RL-MIND/Shape-of-Addition.

2606.03641 2026-06-03 cs.AI cs.CY 版本更新

Gender-Dependent Diagnostic Substitution in LLM Medical Triage: Same Symptoms, Unequal Urgency

LLM医疗分诊中的性别依赖性诊断替代:相同症状,不同紧急程度

Qi Han Wong

发表机构 * GitHub

AI总结 研究大型语言模型在相同神经症状下,仅因患者性别和年龄不同而产生不同的分诊建议,发现年轻女性被系统性低估紧急程度,机制为诊断替代。

Comments 7 pages, 3 tables. Multi-model replication across Gemini, Claude, and GPT. Code and data: https://github.com/wongqihan/ai-behavioral-experiments

详情
AI中文摘要

我们调查了大型语言模型是否会对相同的神经症状,在仅改变患者性别和年龄的情况下,产生不同的医疗分诊建议。使用三个模型家族——Gemini 3.5 Flash、Claude Sonnet 4.6和GPT-5.4-mini——我们呈现了一个标准化的症状特征(持续性头痛、视力模糊、晨起恶心、视觉障碍),跨越七个人口统计学条件:三个年龄组(25、38、65岁)×两个性别(男、女),加上一个性别未指定的基线(每个模型每个条件n=30,共630次试验)。我们发现了一个显著、系统性的性别依赖性分诊差异:年轻女性获得的急诊室转诊率显著低于同龄男性(Gemini:0% vs. 23.3%;Claude:6.7% vs. 96.7%;GPT:6.7% vs. 66.7%,所有p<0.001)。所有模型在65岁年龄组中差异消失。主要机制是诊断替代:模型锚定于与性别相关的诊断,优先将年轻女性分类为特发性颅内高压——一种流行病学上与育龄女性相关的疾病——而将男性诊断为伴有占位性病变的通用颅内压增高。这种诊断闭合将女性患者导向较低紧急程度的护理(门诊医生预约),尽管严重程度评分相当(7-9/10)。我们的发现表明,临床LLM通过使用流行病学先验来抑制分诊紧急程度,复制了已知的人类临床偏见,提示AI分诊引擎必须将紧急程度评估与概率性诊断先验解耦。我们发布了所有代码、提示和原始结果。

英文摘要

We investigate whether large language models produce different medical triage recommendations for identical neurological symptoms when only the patient's stated gender and age vary. Using three model families--Gemini 3.5 Flash, Claude Sonnet 4.6, and GPT-5.4-mini--we present a standardized symptom profile (persistent headache, blurred vision, morning nausea, visual disturbances) across seven demographic conditions: three age groups (25, 38, 65) x two genders (male, female), plus a gender-unspecified baseline (n = 30 per condition per model, 630 total trials). We find a stark, systemic gender-dependent triage disparity: young women receive significantly lower emergency room (ER) referral rates than age-matched men (Gemini: 0% vs. 23.3%; Claude: 6.7% vs. 96.7%; GPT: 6.7% vs. 66.7%, all p < 0.001). The disparity disappears at age 65 for all models. The primary mechanism is diagnostic substitution: the models anchor on a gender-associated diagnosis, preferentially classifying young women with Idiopathic Intracranial Hypertension (IIH)--a condition epidemiologically linked to women of childbearing age--while diagnosing men with generic increased intracranial pressure with space-occupying lesions in the differential. This diagnostic closure routes female patients to lower-urgency care (outpatient doctor appointments) despite comparable severity ratings (7-9/10). Our findings demonstrate that clinical LLMs replicate documented human clinical biases by using epidemiological priors to suppress triage urgency, suggesting that AI triage engines must decouple urgency assessment from probabilistic diagnostic priors. We release all code, prompts, and raw results.

2606.03635 2026-06-03 cs.CV cs.AI 版本更新

VidMsg: A Benchmark for Implicit Message Inference in Short Videos

VidMsg:短视频中隐含信息推断的基准测试

Issar Tzachor, Michael Green, Rami Ben-Ari

发表机构 * OriginAI, Israel(OriginAI以色列)

AI总结 提出VidMsg基准,通过消息优先构建流程和双向检索任务,评估视频理解模型对短视频中隐含信息的推断能力。

Comments Project page: https://iyttor.github.io/VidMsg

详情
AI中文摘要

理解短视频不仅仅是识别可见物体和动作;视频制作者常常在片段中包含潜在的信息或目的。我们引入了VidMsg,一个用于评估互联网原生短视频中隐含信息理解的基准测试。VidMsg包含400个来自YouTube的片段,涵盖9个实际主题领域和52个细粒度目标信息,涉及职业与金融、教育、健康与福祉、文化、安全、可持续性和生活方式等领域。VidMsg通过消息优先流程构建:LLM首先将目标信息转化为间接搜索场景,用于检索候选片段。然后,人工标注者保留那些传达预期信息但不过于直白的片段。VidMsg主要设计用于双向消息-片段检索,适用于视频搜索和推荐等可扩展应用,系统必须捕捉全面的视频理解。除了检索,VidMsg还包括一个诊断性多项选择问答基准,模型需要从语义相关的选项中选出片段的预期信息。与当代视频语言和检索模型的实验表明,强模型在VidMsg上常常失败,因为该任务需要语用推理、上下文线索整合以及语义相近信息的区分。我们还引入了VidVec-Msg,一种改进消息导向检索的基线方法,同时为未来工作留下了足够的提升空间。

英文摘要

Understanding short online videos involves more than identifying visible objects and actions; video makers often include an underlying message or purpose in the clip. We introduce VidMsg, a benchmark for evaluating implicit message understanding in short, internet-native video clips. VidMsg contains 400 YouTube-derived clips across 9 practical topic areas and 52 fine-grained target messages, covering domains such as career and finance, education, health and well-being, culture, safety, sustainability, and lifestyle. VidMsg is constructed through a message-first pipeline: an LLM first translates target messages into indirect search scenarios, which are used to retrieve candidate clips. Human annotators then retain clips that convey the intended message without being overly explicit. VidMsg is designed primarily for bidirectional message-clip retrieval for scalable applications such as video search and recommendation, where systems must capture holistic video understanding. In addition to retrieval, VidMsg includes a diagnostic multiple-choice QA benchmark, where models select the intended message of a clip from semantically related alternatives. Experiments with contemporary video-language and retrieval models show that strong models often fail on VidMsg, because the task requires pragmatic inference, integration of contextual cues, and discrimination among semantically close messages. We also introduce VidVec-Msg, a baseline method that improves message-oriented retrieval while leaving substantial headroom for future work.

2606.03629 2026-06-03 cs.AI 版本更新

TSQAgent: Rating Time Series Data Quality via Dedicated Agentic Reasoning

TSQAgent: 通过专用智能体推理评估时间序列数据质量

Shunyu Wu, Dan Li, Haozheng Ye, Weibin Feng, Jian Lou, Bo Zhang, Wenjie Feng, Chenjuan Guo, See-Kiong Ng

发表机构 * Sun Yat-sen University(中山大学) China University of Mining Technology(中国矿业大学) University of Science and Technology of China(中国科学技术大学) East China Normal University(华东师范大学) National University of Singapore(新加坡国立大学)

AI总结 提出TSQAgent框架,通过三个协作智能体(感知器、检查员、裁决者)识别相关质量维度并进行定量比较,显著提升LLM在时间序列数据质量评估中的表现。

详情
AI中文摘要

评估时间序列(TS)数据的质量是基础但极具挑战性的任务,因为质量维度具有多面性。最近,大语言模型(LLM)通过成对比较和逐维度评估,成为TS质量评估的一种有前景的范式。然而,现有方法依赖手动预定义的质量维度和纯文本推理,尚不清楚LLM能否识别真正相关的质量维度或进行基于证据的定量质量比较。为探究此问题,我们构建了TSQBench,一个专用基准,用于评估LLM在两种渐进能力上的表现:(i)理解和识别相关质量维度,(ii)在特定维度下进行质量比较。分析表明,当前LLM在维度识别和基于证据的质量比较方面均存在困难。为解决这些局限,我们提出TSQAgent,一种新颖的用于TS质量评级的智能体推理框架,包含三个协作角色:感知器(负责聚焦维度选择)、检查员(负责逐维度定量分析)和裁决者(负责聚合并优化最终判断)。特别地,我们引入一种智能体推理策略,赋予模型识别和优先考虑最相关质量维度的能力,并进一步提出一个配备外部分析工具的智能体工作流,以实现对选定维度的精确定量比较。在提出的基准和11个真实世界数据集上的实验表明,我们的框架不仅显著提升了LLM在质量理解和定量比较方面的能力,而且有效地将这些改进转化为更好的质量感知数据选择,从而提升下游性能和数据效率。

英文摘要

Assessing the quality of time series (TS) data is fundamental yet inherently challenging due to the multifaceted nature of quality dimensions. Recently, large language models (LLMs) have emerged as a promising paradigm for TS quality assessment via pairwise comparison and per-dimension evaluation. However, existing approaches rely on manually predefined quality dimensions and purely text-based reasoning, leaving it unknown whether LLMs can identify truly relevant quality dimensions or perform grounded and quantitative quality comparisons. To investigate this, we construct TSQBench, a dedicated benchmark for evaluating LLMs on two progressive capabilities: (i) understanding and identifying relevant quality dimensions, and (ii) performing quality comparison under specific dimensions. Our analysis reveals that current LLMs consistently struggle with both dimension identification and evidence-grounded quality comparison. To address these limitations, we propose TSQAgent, a novel agentic reasoning framework for TS quality rating consisting of three collaborative roles: Perceiver for focused dimension selection, Inspector for dimension-wise quantitative analysis, and Adjudicator that aggregates and refines the final judgment. In particular, we introduce an agentic reasoning strategy that instills the ability to identify and prioritize the most relevant quality dimensions, and further propose an agent workflow equipped with external analytical tools to enable precise quantitative comparisons over selected dimensions. Experiments on both the proposed benchmark and eleven real-world datasets demonstrate that our framework not only substantially improves LLMs' capabilities in quality understanding and quantitative comparison but also effectively translates these improvements into better quality-aware data selection, leading to enhanced downstream performance and data efficiency.

2606.03628 2026-06-03 cs.CL cs.AI cs.LG 版本更新

Building Reliable Long-Form Generation via Hallucination Rejection Sampling

通过幻觉拒绝采样构建可靠的长文本生成

Lin Li, Georgia Channing, Suhaas M Bhat, Gabriel Davis Jones, Yarin Gal

发表机构 * Georgia Institute of Technology(佐治亚理工学院) University of California, Berkeley(加州大学伯克利分校) University of Cambridge(剑桥大学) DeepMind(深度思维)

AI总结 提出分段幻觉拒绝采样框架SHARS,利用任意幻觉检测器在生成过程中拒绝并重采样幻觉片段,以缓解长文本生成中的幻觉累积问题,提升事实一致性。

Comments accepted by ICML 2026

详情
AI中文摘要

大型语言模型(LLMs)在开放式文本生成方面取得了显著进展,但仍容易产生不正确或无依据的幻觉内容,这损害了其可靠性。在长文本生成中,由于幻觉雪崩现象(早期错误传播并累积到后续输出),这一问题更加严重。为了解决这一挑战,我们提出了一种新颖的推理时幻觉缓解框架,称为分段幻觉拒绝采样(SHARS),该框架使用任意幻觉检测器在生成过程中识别并拒绝幻觉片段,并重新采样直到生成忠实的内容。通过仅保留可信信息并在此基础上构建后续生成,该框架减轻了幻觉累积并增强了事实一致性。为了实例化该框架,我们采用语义不确定性作为检测器,并引入了若干关键修改以解决其局限性并更好地适应长文本。我们的方法使模型能够自我纠正幻觉,无需外部资源(如网络搜索或知识库),同时保持与这些资源的兼容性以便未来扩展。在标准化幻觉基准上的实证评估表明,我们的方法显著减少了长文本生成中的幻觉,同时保持甚至提高了生成的信息量。代码可在以下网址获取:this https URL。

英文摘要

Large language models (LLMs) have achieved remarkable progress in open-ended text generation, yet they remain prone to hallucinating incorrect or unsupported content, which undermines their reliability. This issue is exacerbated in long-form generation due to hallucination snowballing, a phenomenon where early errors propagate and compound into subsequent outputs. To address this challenge, we propose a novel inference-time hallucination mitigation framework, named Segment-wise HAllucination Rejection Sampling (SHARS), which uses an arbitrary hallucination detector to identify and reject hallucinated segments during generation and resample until faithful content is produced. By retaining only confident information and building subsequent generations upon it, the framework mitigates hallucination accumulation and enhances factual consistency. To instantiate this framework, we adopt semantic uncertainty as the detector and introduce several vital modifications to address its limitations and better adapt it to long-form text. Our method enables models to self-correct hallucinations without requiring external resources such as web search or knowledge bases, while remaining compatible with them for future extensions. Empirical evaluations on standardized hallucination benchmarks demonstrate that our method substantially reduces hallucinations in long-form generation while preserving or even improving the informativeness of generation. Code is available at: https://github.com/TreeLLi/hallucination-rejection-sampling.

2606.03626 2026-06-03 cs.CV cs.AI cs.CY 版本更新

TurtleAI: Benchmarking Multimodal Models for Visual Programming in Turtle Graphics

TurtleAI:海龟图形学中视觉编程的多模态模型基准测试

Chao Wen, Jacqueline Staub, Adish Singla

发表机构 * MPI-SWS(马克斯·普朗克研究所-斯图加特)

AI总结 提出TurtleAI基准,包含823个基于海龟图形学真实任务的视觉编程任务,评估20多个多模态模型发现成功率低于30%,并通过少量种子样本生成合成数据微调Qwen2-VL-72B提升约20%性能。

Comments ACL Findings 2026 paper

详情
AI中文摘要

视觉语言模型(VLM)已被探索用于视觉编程,即生成代码以解决视觉任务。然而,大多数先前工作侧重于提高生产力的视觉编程;目前尚不清楚当前VLM在教育导向的视觉编程上表现如何,以及哪些因素限制了它们的性能。为填补这一空白,我们引入了TurtleAI,这是一个包含823个任务的基准,这些任务基于海龟图形学领域的真实视觉编程任务精心策划。解决这些任务需要模型感知几何图案、推理空间关系,并合成能忠实再现几何图案的Python代码。我们评估了20多个VLM,包括GPT-5、GPT-4o和Qwen2-VL-72B,发现它们表现显著困难,大多数成功率低于30%。为解决这些限制,我们提出了一种仅需少量种子样本的数据生成技术。在生成的合成数据上微调Qwen2-VL-72B,在真实任务上取得了约20%的提升。我们的失败分析揭示,GPT-4o在空间推理和精确视觉复制方面存在困难,而微调主要改善了视觉推理与代码实现之间的对齐。

英文摘要

Vision-language models (VLMs) have been explored for visual programming, where they generate code to solve visual tasks. However, most prior work focuses on visual programming for productivity; it remains unclear how well current VLMs perform on education-oriented visual programming and what factors limit their performance. To bridge this gap, we introduce TurtleAI, a benchmark containing 823 tasks curated based on real-world visual programming tasks in the Turtle Graphics domain. Solving these tasks requires models to perceive geometric patterns, reason about spatial relationships, and synthesize Python code that faithfully reproduces geometric patterns. We evaluate 20+ VLMs, including GPT-5, GPT-4o, and Qwen2-VL-72B, and find that they struggle significantly, with most achieving success rates below 30%. To address these limitations, we propose a data generation technique that requires only a small set of seed samples. Fine-tuning Qwen2-VL-72B on the resulting synthetic data yields an improvement of about 20% on real-world tasks. Our failure analysis reveals that GPT-4o struggles with spatial reasoning and precise visual replication, whereas fine-tuning primarily improves the alignment between visual reasoning and code implementation.

2606.03624 2026-06-03 cs.AI cs.CL 版本更新

Bridging Auxiliary Constraints to Resolve Instruction Following in Large Reasoning Models

桥接辅助约束以解决大型推理模型中的指令遵循问题

Zhengyi Zhao, Shubo Zhang, Huimin Wang, Zezhong Wang, Yutian Zhao, Yefeng Zheng, Binyang Li, Yulan He, Kam-Fai Wong, Xian Wu

发表机构 * The Chinese University of Hong Kong(香港中文大学) University of International Relations(国际关系大学) Tencent Jarvis Lab(腾讯Jarvis实验室) Westlake University(西湖大学) King’s College London(伦敦国王学院)

AI总结 针对大型推理模型难以可靠遵循多重约束的问题,提出约束关系图补全框架,通过显式建模约束关系并发现桥接约束,将约束违反率降低39%。

Comments a pre-MIT Press publication version

详情
AI中文摘要

大型推理模型(LRMs)在许多任务中展现出令人印象深刻的能力,但在可靠地遵循多个指令方面存在困难,要么无法满足单个约束,要么难以同时平衡相互竞争的约束。我们将这一挑战形式化为约束遵循问题(CAP)。本文引入了一个新颖的框架,通过将指令表示为约束的结构化知识图来解决CAP。我们的方法,约束关系图补全(CRGC),显式建模约束之间的关系,识别遵循挑战,并发现“桥接约束”,帮助模型更好地聚焦和协调需求。桥接约束作为辅助指令,使主要约束更加突出和兼容。与通过通用训练方法增强指令遵循的现有方法不同,CRGC通过利用模型自身的知识来创建更好的生成路径,从而专门提高约束满足度。在三个流行的指令遵循数据集上的实验表明,与标准提示相比,我们的方法将约束违反减少了39%,同时保持了大型推理模型的推理能力。

英文摘要

Large Reasoning Models (LRMs) have demonstrated impressive capabilities in many tasks, yet they struggle with reliably following multiple instructions, either by failing to satisfy individual constraints or by struggling to balance competing constraints simultaneously. We formalize this challenge as the Constraint Adherence Problem (CAP). This paper introduces a novel framework that addresses CAP by representing instructions as a structured knowledge graph of constraints. Our approach, Constraint Relationship Graph Completion (CRGC), explicitly models relationships between constraints, identifies adherence challenges, and discovers ``bridge constraints'' that help the model better focus on and reconcile requirements. Bridge constraints act as auxiliary instructions that make primary constraints more salient and compatible. Unlike existing approaches that enhance instruction following through general training methods, CRGC specifically improves constraint satisfaction by leveraging the model's own knowledge to create better pathways for generation. Experiments across three popular instruction following datasets demonstrate that our approach reduces constraint violations by 39% compared to standard prompting while maintaining reasoning abilities of large reasoning models.

2606.03620 2026-06-03 cs.LG cs.AI 版本更新

Physics-Guided Policy Optimization with Self-Distillation

基于物理引导的自蒸馏策略优化

Ke Wang, Yuning Wu, Haoran Liu, Chaoqun Jia, Devin Chen, Kai Wei

发表机构 * Amazon(亚马逊)

AI总结 针对自蒸馏策略优化中固定步长导致训练不稳定的问题,提出受粘性流体动力学启发的物理引导策略优化(PGPO),通过互信息估计动态调整步长,在Science-QA数据集上提升性能并保持训练稳定性。

详情
AI中文摘要

自蒸馏策略优化(SDPO)已成为大语言模型后训练的一种流行范式,其中模型根据特权信息从自身预测中学习。然而,SDPO对每次更新步长的信任程度敏感:来自自我教师的修正可能在某些批次上信息丰富,而在其他批次上具有误导性,若以固定步长统一应用,会破坏训练稳定性。受粘性流体动力学启发,并在随机微分方程层面形式化类比,我们提出物理引导策略优化(PGPO),该方法引入一个基于学生预测与反馈条件教师之间互信息估计的信息调制步长乘子。我们证明这种调制保留了普通SGD的一阶弱近似保证,且每次迭代的额外开销可忽略。我们在Science-QA数据集上评估PGPO,它在4个领域中的3个上优于SDPO,提升高达+4.5个点,同时在SDPO训练后期崩溃的设置中保持稳定。

英文摘要

Self-distilled policy optimization (SDPO) has become a popular paradigm for LLM post-training, where a model learns from its own predictions conditioned on privileged information. SDPO, however, is sensitive to how much each update step should be trusted: corrections from a self-teacher can be highly informative on some batches and misleading on others, and applying them uniformly with a fixed step size can destabilize training. Drawing inspiration from viscous-fluid dynamics and formalizing the analogy at the SDE level, we propose Physics-Guided Policy Optimization (PGPO), which introduces an information-modulated step-size multiplier derived from a mutual-information estimate between the student's predictions and the feedback-conditioned teacher. We show that this modulation preserves the order-1 weak-approximation guarantees of vanilla SGD, and incurs negligible overhead per iteration. We evaluate PGPO on the Science-QA dataset, where it outperforms SDPO on 3 of the 4 domains with gains of up to +4.5 points, while remaining stable in a setting where SDPO collapses late in training.

2606.03618 2026-06-03 cs.AI 版本更新

Cross-Lingual Token Arbitrage: Optimizing Code Agent Context Windows via Local LLM Preprocessing

跨语言令牌套利:通过本地LLM预处理优化代码智能体上下文窗口

Mehmet Utku Colak

发表机构 * GitHub

AI总结 提出一种预处理的边缘端提示重写中间件,利用本地Llama 3.2模型进行跨语言翻译和结构重写,在保持或提升任务准确率的同时减少34-47%的提示令牌和最高18.8%的总令牌消耗。

Comments Submitted to EMNLP 2026

详情
AI中文摘要

AI辅助编码智能体受到输入令牌成本的瓶颈限制。原始人类输入的两个病理现象导致了大部分开销:非英语文本的令牌化低效和对话提示中的结构熵。现有方法通过压缩已经臃肿的上下文或在失败发生后进行干预来被动应对。我们引入了一种预处理的边缘端提示重写中间件,在开发者和云智能体之间运行。本地Llama 3.2(3B)模型执行跨语言翻译成英语、结构重写为紧凑的任务导向格式,以及正则表达式验证的重写-回退保护,确保优化后的提示永远不会大于原始提示。我们在OMH-Polyglot(一个涵盖土耳其语、阿拉伯语、中文和代码混合规范的多语言编码基准)上进行评估。在三个商业LLM后端上,该中间件将提示令牌减少了34-47%,总令牌减少了最多18.8%,同时保持或提高了任务准确率。消融研究表明,收益主要来自重写阶段,而非简单的函数名提取。与LLMLingua-2在匹配压缩率下相比,我们的方法在所有评估后端上始终获得更优的OckScore性能。这些结果表明,主动提示优化可以在不牺牲编码质量的情况下大幅降低推理成本。

英文摘要

AI-assisted coding agents are bottlenecked by input-token cost. Two pathologies of raw human input drive much of this overhead: tokenization inefficiency for non-English text and structural entropy in conversational prompts. Existing approaches act reactively by compressing already-bloated contexts or intervening after failures occur. We introduce a pre-flight, edge-side prompt-rewriting middleware that operates between the developer and the cloud agent. A local Llama 3.2 (3B) model performs cross-lingual translation into English, structural rewriting into a compact task-oriented format, and regex-validated rewrite-with-fallback safeguards to ensure the optimized prompt is never larger than the original. We evaluate on OMH-Polyglot, a multilingual coding benchmark spanning Turkish, Arabic, Chinese, and code-switched specifications. Across three commercial LLM backends, the middleware reduces prompt tokens by 34-47 percent and total tokens by up to 18.8 percent while preserving or improving task accuracy. Ablation studies show that gains arise primarily from the rewriting stage rather than simple function-name extraction. Compared with LLMLingua-2 at matched compression rates, our method consistently achieves superior OckScore performance across all evaluated backends. These results demonstrate that proactive prompt optimization can substantially reduce inference costs without sacrificing coding quality.

2606.03608 2026-06-03 cs.LG cs.AI 版本更新

Exploiting Verification-Generation Gap: Test-Time Reinforcement Learning with Confidence-Conditioned Verification

利用验证-生成差距:基于置信度条件的测试时强化学习

Jiahui Li, Jianfeng Shan, Wenpei Chen, Shunyu Wu, Jian Lou, Wenjie Feng, Dan Li, See-Kiong Ng

发表机构 * Sun Yat-Sen University(中山大学) University of Science and Technology of China(中国科学技术大学) National University of Singapore(新加坡国立大学)

AI总结 提出TTRL-CoCoV框架,通过置信度自适应机制解决无标签设置下Pass@k优化中的伪标签错误和多样性崩溃问题,显著提升Pass@1和Pass@k性能。

详情
AI中文摘要

测试时强化学习已成为一种有前景的范式,用于在完全无标签的方式下增强大型语言模型的复杂推理能力。尽管现有研究关注Pass@1性能,但在无标签设置下优化Pass@k(衡量生成覆盖率以支持持续探索)仍未被充分探索且至关重要。在无标签设置下优化Pass@k极具挑战性,因为直接应用对RLVR有效的Pass@k优势设计会导致性能不佳。通过深入的实证分析,我们发现阻碍性能的根本原因:低置信度样本的伪标签估计很可能不正确,而高置信度样本的候选答案则遭受严重的多样性崩溃。为克服这些障碍,我们提出TTRL-CoCoV(基于置信度条件的测试时强化学习),一种新颖的置信度自适应框架,可扩展Pass@k覆盖率并提升Pass@1性能。基于我们的关键洞察——验证能力通常领先于生成能力,TTRL-CoCoV采用置信度条件机制:对于高置信度样本,它引导验证器并应用探索增强奖励以防止多样性崩溃;对于低置信度样本,它将伪标签选择委托给验证器以过滤错误伪标签;对于中等置信度样本,则完全绕过验证。大量实验表明,TTRL-CoCoV在6个广泛认可的基准上优于最佳竞争方法,在Pass@1上平均绝对提升+9.8%,在Pass@16上平均绝对提升+18.7%,甚至在与全监督强化学习方法相比时,在多个推理基准上实现了高达+5.0%的Pass@1绝对提升。我们的代码仓库:此 https URL。

英文摘要

Test-time reinforcement learning has emerged as a promising paradigm for enhancing the complex reasoning abilities of large language models in a completely label-free manner. Despite existing studies focusing on Pass@1 performance, optimizing Pass@k remains under-explored yet critical in label-free settings, which measures generation coverage for sustained exploration. Optimizing Pass@k in label-free setting is highly non-trivial, as directly applying the Pass@k advantage designs effective for RLVR yields unsatisfactory performance. Through in-depth empirical analysis, we discover the root causes hindering performance: pseudo-label estimations for low-confidence samples have a high probability of being incorrect, while candidate answers for high-confidence samples suffer from severe diversity collapse. To overcome these hurdles, we propose TTRL-CoCoV (Test-Time Reinforcement Learning with Confidence-Conditioned Verification), a novel confidence-adaptive framework that expands Pass@k coverage and improves Pass@1 performance. Based on our key insight that verification capability generally leads generation capability, TTRL-CoCoV employs a confidence-conditioned mechanism: for high-confidence samples, it bootstraps verifier and applies an exploration-enhancing reward to prevent diversity collapse; for low-confidence samples, it delegates pseudo-label selection to the verifier to filter incorrect pseudo-labels; and for medium-confidence samples, it bypasses verification entirely. Extensive experiments demonstrate that TTRL-CoCoV outperforms the best competing methods across 6 widely-recognized benchmarks, achieves average absolute gains of +9.8% in Pass@1 and +18.7% in Pass@16 over TTRL, and even achieves absolute Pass@1 improvements of up to +5.0% across multiple reasoning benchmarks when compared against fully supervised RL methods. Our code repository: https://github.com/shanjf666/CoCoV.

2606.03602 2026-06-03 cs.LG cs.AI cs.CL 版本更新

CauTion: Knowing When to Trust LLMs for Ensemble Causal Discovery

CauTion:知道何时信任LLM进行集成因果发现

Bo Peng, Kaiwen Wu, Sirui Chen, Zhiheng Wang, Yu Qiao, Chaochao Lu

发表机构 * Shanghai AI Laboratory(上海人工智能实验室) Shanghai Innovation Institute(上海创新研究院) Shanghai Jiao Tong University(上海交通大学) Nanjing University(南京大学) Tongji University(同济大学)

AI总结 提出CauTion框架,通过共识过滤和LLM可靠性估计,将LLM领域知识可靠地集成到多个统计因果发现算法中,解决纯统计方法的局限和LLM错误问题。

详情
AI中文摘要

从观测数据进行因果发现仍然具有挑战性,因为纯统计方法存在根本性限制,例如等价类内的统计可区分性和对有限样本量的敏感性。虽然大型语言模型(LLM)提供了有希望的领域知识来源来补充统计推断,但现有的LLM增强方法容易受到LLM错误的影响,并且产生高昂的令牌成本。此外,依赖单一数据驱动算法可能使结果对算法特定偏差敏感。为了解决这些限制,我们提出了CauTion,一个通过共识过滤和LLM可靠性估计将LLM领域知识可靠地集成到统计因果发现算法集成中的框架。CauTion分三个阶段进行。首先,算法集成利用共识投票解决算法一致的最多96%的边,在过滤后的共识边上实现接近完美的准确性。其次,一个信任校准仲裁机制通过无注释的信任校准过程估计LLM和算法的相对可靠性,然后用于控制信任加权投票过程,将LLM仲裁限制在算法证据不可靠的边上。第三,应用循环修复步骤确保最终因果图是有效的无环图。在六个数据集上的实验表明,CauTion在性能上始终优于数据驱动和LLM增强的基线,在更大的图上获得更大的收益,并且对LLM错误具有强大的鲁棒性。代码可在以下网址获取:https://this URL。

英文摘要

Causal discovery from observational data remains challenging due to the fundamental limitations of purely statistical methods, such as statistical distinguishability within equivalence classes and sensitivity to finite sample sizes. While large language models (LLMs) offer a promising source of domain knowledge to complement statistical inference, existing LLM-augmented methods are vulnerable to LLM errors and incur high token costs. Moreover, reliance on a single data-centric algorithm can make results sensitive to algorithm-specific biases. To address these limitations, we propose CauTion, a framework that reliably integrates LLM domain knowledge into an ensemble of statistical causal discovery algorithms through consensus filtering and LLM reliability estimation. CauTion proceeds in three stages. First, an algorithm ensemble utilizes a consensus voting to resolve up to 96% of edges on which algorithms agree, achieving near-perfect accuracy on the filtered consensus edges. Second, a trust-calibrated arbitration mechanism estimates the relative reliability of the LLM and the algorithms via an annotation-free trust calibration procedure, which is then utilized to govern a trust-weighted voting process that restricts LLM arbitration exclusively to edges with unreliable algorithmic evidence. Third, a cycle repair step is applied to guarantee the final causal graph is validly acyclic. Experiments on six datasets demonstrate that CauTion consistently outperforms both data-centric and LLM-augmented baselines, with larger gains on larger graphs and strong robustness to LLM errors. Code is available at https://github.com/OpenCausaLab/CauTion.

2606.03601 2026-06-03 cs.SE cs.AI 版本更新

DDOR: Delta Debugging for Explainable Overrefusal Testing and Repair

DDOR: 用于可解释过度拒绝测试与修复的Delta调试方法

Qinyan Zhou, Peixin Zhang, Jun Sun, Haonan Zhang, Dongxia Wang

发表机构 * Southeast University(东南大学) Singapore Management University(新加坡管理学院) Zhejiang University(浙江大学) Huzhou Institute of Industrial Control Technology(湖州工业控制技术研究所)

AI总结 提出DDOR框架,通过delta调试定位最小拒绝触发片段(mRTF),实现黑盒环境下大语言模型过度拒绝行为的自动化测试与修复。

详情
AI中文摘要

虽然安全对齐和护栏有助于大语言模型(LLM)避免有害输出,但它们也可能导致过度拒绝,即对仅看似有风险的无害查询进行无根据的拒绝。我们提出了DDOR(用于过度拒绝的Delta调试),这是一个完全自动化和可解释的框架,用于在黑盒设置中进行过度拒绝测试和修复,其中仅可访问模型输入和输出,内部安全机制保持不透明。DDOR应用delta调试来定位最小拒绝触发片段(mRTF),这些片段提供了短语级别的、可解释的证据,说明拒绝发生的原因。基于这些mRTF,DDOR生成多样化、上下文丰富的提示,并执行多预言验证以过滤本质上不安全或模糊的案例,从而产生可扩展且模型特定的过度拒绝测试套件(每个模型约1K个案例)。除了评估之外,我们进一步利用定位的mRTF进行有针对性的提示修复,显著减少过度拒绝,同时保留原始意图并在真正有害的输入上保持安全性。总体而言,DDOR提供了一种实用的端到端解决方案,用于评估和缓解过度拒绝,在不牺牲安全性的情况下提高LLM的可用性。

英文摘要

While safety alignment and guardrails help large language models (LLMs) avoid harmful outputs, they can also induce overrefusal, i.e., unwarranted rejection of benign queries that merely appear risky. We present DDOR (Delta Debugging for OverRefusal), a fully automated and explainable framework for overrefusal testing and repair in a black-box setting, where only model inputs and outputs are accessible and internal safety mechanisms remain opaque. DDOR applies delta debugging to localize minimal refusal-triggering fragments (mRTFs) that provide phrase-level, explainable evidence for why a refusal occurs. Conditioned on these mRTFs, DDOR generates diverse, context-rich prompts and performs multi-oracle validation to filter intrinsically unsafe or ambiguous cases, producing scalable and model-specific overrefusal test suites (approximately 1K cases per model). Beyond evaluation, we further leverage localized mRTFs to perform targeted prompt repair, substantially reducing overrefusal while preserving the original intent and maintaining safety on genuinely harmful inputs. Overall, DDOR offers a practical end-to-end solution to both evaluate and mitigate overrefusal, improving LLM usability without sacrificing safety.

2606.03569 2026-06-03 cs.CV cs.AI 版本更新

When Attention Collapses: Stage-Aware Visual Token Pruning from Structure to Semantics

当注意力崩溃时:从结构到语义的阶段性视觉令牌剪枝

Jiahui Wang, Kai Zhang, Mai Han, Huanghe Zhang

发表机构 * Shandong University(山东大学) National University of Singapore (Suzhou) Research Institute(新加坡国立大学(苏州)研究院)

AI总结 针对视觉语言模型推理中视觉令牌剪枝因依赖单一注意力分数导致特征多样性下降的问题,提出两阶段剪枝框架STS,先通过排斥采样最大化结构多样性,再通过指令感知交叉注意力过滤语义无关令牌,从而提升保留令牌的结构多样性与细粒度任务对齐。

详情
AI中文摘要

视觉语言模型(VLMs)展现了卓越的能力,但在推理过程中承受着巨大的计算开销。虽然视觉令牌剪枝提供了一种有前景的解决方案,但现有方法主要依赖于初始注意力分数。这种单一度量范式存在一个关键缺陷:高注意力分数会固有地坍缩到语义相似区域,从而严重降低特征多样性并丢弃重要的上下文细节。为解决这一问题,我们引入了结构到语义(STS),一种新颖的两阶段视觉令牌剪枝框架,明确解耦了剪枝过程。第一阶段采用基于排斥的采样机制,以最大化空间和结构多样性。第二阶段利用指令感知的交叉注意力,精确过滤掉与提示无关的令牌。这种两阶段协同构成了STS的核心,首先确保几何覆盖,然后根据语义相关性细化保留的令牌。大量评估表明,STS减轻了由基于注意力的选择引起的冗余,提高了保留视觉令牌的结构多样性和细粒度任务对齐。

英文摘要

Vision-Language Models (VLMs) have demonstrated remarkable capabilities but suffer from significant computational overhead during inference. While visual token pruning offers a promising solution, existing methods predominantly rely on initial attention scores. This single-metric paradigm presents a critical flaw: high attention scores inherently collapse onto semantically similar regions, thereby severely reducing feature diversity and discarding vital contextual details. To address this, we introduce Structure-to-Semantics (STS), a novel two-stage visual token pruning framework that explicitly decouples the pruning process. The first stage employs a repulsion-based sampling mechanism to maximize spatial and structural diversity. The second stage leverages instruction-aware cross-attention to precisely filter out prompt-irrelevant tokens. This two-stage synergy constitutes the core of STS, first ensuring geometric coverage and then refining the retained tokens according to semantic relevance. Extensive evaluations demonstrate that STS mitigates the redundancy caused by attention-based selection, improving both structural diversity and fine-grained task alignment of the preserved visual tokens.

2606.03568 2026-06-03 cs.CV cs.AI cs.LG cs.RO 版本更新

Learned Non-Maximum Suppression for 3D Object Detection

用于3D目标检测的学习型非极大值抑制

Timo Osterburg, Stefan Schütte, Torsten Bertram

发表机构 * Institute of Control Theory and Systems Engineering, TU Dortmund University(控制理论与系统工程研究所,多特蒙德技术大学)

AI总结 提出两种基于学习的过滤模块(D2D-Rescore和GossipNet3D)替代启发式NMS,通过检测间关系提升3D检测性能,尤其改善小物体和稀有类别的检测精度。

Comments 6 pages, accepted at IEEE Intelligent Vehicles Symposium (IV) 2026

详情
AI中文摘要

后处理是基于激光雷达的3D目标检测中的关键阶段,必须过滤密集且重叠的提议以实现紧凑可靠的感知。本文引入了两个学习型过滤模块,通过利用检测之间的关系来替代启发式非极大值抑制(NMS)。D2D-Rescore采用基于Transformer的检测到检测(D2D)注意力,而GossipNet3D通过鸟瞰图中的局部消息传递将2D GossipNet概念适应到3D。一种与nuScenes评估协议对齐的度量感知匹配策略确保了训练和验证行为的一致性,从而提高了整体检测性能。与CircleNMS相比,两种方法都提高了平均精度(mAP)、nuScenes检测分数(NDS)和真阳性质量,特别是对于小物体和稀有类别,同时增加了最小的计算开销。这些结果表明,学习型的检测级过滤可以在不修改基础网络的情况下增强3D检测器的可靠性,为启发式抑制提供了一种原则性的替代方案。代码可在以下网址获取:https://this URL。

英文摘要

Post-processing is a critical stage in LiDAR-based 3D object detection, where dense and overlapping proposals must be filtered for compact and reliable perception. This work introduces two learned filtering modules that replace heuristic non-maximum suppression (NMS) by leveraging relations among detections. D2D-Rescore employs transformer-based detection-to-detection (D2D) attention, while GossipNet3D adapts the 2D GossipNet concept to 3D through localized message passing in bird's-eye view. A metric-aware matching strategy aligned with the nuScenes evaluation protocol ensures consistent training and validation behavior, improving overall detection performance. Both approaches improve mean average precision (mAP), nuScenes detection score (NDS), and true positive quality compared to CircleNMS, particularly for small and infrequent classes, while adding minimal computational overhead. These results demonstrate that learned, detection-level filtering can enhance 3D detector reliability without modifying the base network, offering a principled alternative to heuristic suppression. Code is available at https://github.com/rst-tu-dortmund/learned-3d-nms .

2606.03566 2026-06-03 cs.CV cs.AI 版本更新

Efficient Transformer-Based Localized Patch Sampling for Choroid Plexus Segmentation in Multiple Sclerosis

基于高效Transformer的局部块采样用于多发性硬化脉络丛分割

Po-Jui Lu, Alessandro Cagol, Mario Ocampo-Pineda, Federico Spagnolo, Marina Mastantuono, Andreea-Alexandra Aldea, Jannis Müller, Özgür Yaldizli, Matthias Weigel, Lester Melie-Garcia, Roberta Magliozzi, Maria Pia Sormani, Ludwig Kappos, Jens Kuhle, Cristina Granziera

AI总结 提出一种基于SwinUNETR和局部块采样的方法,实现多发性硬化侧脑室脉络丛的自动分割,在降低99%计算量的同时取得优于现有模型的Dice系数。

详情
AI中文摘要

背景:侧脑室脉络丛(LVCP)正逐渐被认为是与多发性硬化(MS)身体残疾和神经炎症相关的关键影像生物标志物。然而,LVCP的手动分割非常繁琐,限制了其在广泛临床试验和纵向评估中的应用。本研究旨在开发一种基于SwinUNETR的流程,利用靶向的脑室内和脑室周围小块采样,从独立和多模态MRI输入中自动分割MS中的LVCP。方法:我们回顾性评估了来自两个独立MS主导队列的三组数据的3T MRI扫描(数据集1:n=177;数据集2:n=177;扩展测试集:n=388)。我们的方法采用在32x32x32体素块上训练的SwinUNETR架构,并与3D UXNET模型进行基准比较。主要评估指标是Dice相似系数(DSC),辅以计算需求(GFLOPs)和95百分位豪斯多夫距离(HD95)。结果:在扩展测试集上,SwinUNETR模型在结合MPRAGE和FLAIR时获得了平均DSC为0.868(95% CI: 0.863-0.872),显著优于UXNET(DSC: 0.858 [95% CI: 0.853-0.862], p<0.0001)。当仅限于独立FLAIR输入时,基于Transformer的方法保持了0.863的高DSC,而UXNET的空间定位显著恶化(HD95: 1.86 vs. 3.00 mm)。重要的是,所提出的框架将计算负载降低了99%(91.8 vs. 22,080 GFLOPs)。通过将局部块采样与SwinUNETR架构相结合,该方法为LVCP分割提供了一种准确、稳健且统计上优于当前领先模型的替代方案。其巨大的计算成本降低使其非常适合在临床和研究环境中广泛实施。

英文摘要

Background: The lateral ventricle choroid plexus (LVCP) is gaining recognition as a key imaging biomarker for multiple sclerosis (MS) related to physical disability and neuroinflammation. Yet, manual segmentation of the LVCP is highly tedious, restricting its use in broad clinical trials and longitudinal assessments. This research aims to develop a SwinUNETR-driven pipeline that leverages targeted intra- and peri-ventricular small patch sampling to automatically segment the LVCP in MS from both standalone and multi-modal MRI inputs. Methods: We retrospectively assessed 3T MRI scans across three sets of data stemming from two separate MS-dominant cohorts (Dataset 1: n=177; Dataset 2: n=177; expanded test set: n=388). Our method employed a SwinUNETR architecture trained on 32x32x32 voxel patches, benchmarking it against the 3D UXNET model. The primary metric for evaluation was the Dice Similarity Coefficient (DSC), supplemented by computational demand (GFLOPs) and the 95th percentile Hausdorff Distance (HD95). Results: On the extended test set, the SwinUNETR model secured a mean DSC of 0.868 (95% CI: 0.863-0.872) with MPRAGE and FLAIR combined, showing a statistically significant gain over UXNET (DSC: 0.858 [95% CI: 0.853-0.862], p<0.0001). When restricted to standalone FLAIR inputs, the transformer-based approach sustained a high DSC of 0.863, while the spatial localization of UXNET worsened considerably (HD95: 1.86 vs. 3.00 mm). Importantly, the proposed framework lowered computational load by 99% (91.8 vs. 22,080 GFLOPs). By integrating localized patch sampling with a SwinUNETR architecture, this methodology offers an accurate, robust, and statistically superior alternative to current leading models for LVCP segmentation. Its vast reduction in computational cost makes it ideal for widespread implementation in clinical and research environments.

2606.03557 2026-06-03 cs.AI cs.HC 版本更新

From Prompt to Service: An SLM-Based Agent Orchestration Gateway for AI-Driven Virtual Worlds

从提示到服务:基于SLM的AI驱动虚拟世界代理编排网关

Louis Nisiotis, Aimilios Hadjiliasi

发表机构 * University of Cambridge(剑桥大学)

AI总结 本文提出一种基于小语言模型的代理编排网关,通过意图驱动的服务路由解耦虚拟世界客户端与异构AI后端,并在虚拟博物馆测试床中验证了其可行性和效率。

详情
AI中文摘要

随着生成式AI能力的扩展,AI驱动的虚拟世界面临日益增长的架构挑战。用户通过世界内界面以多模态方式进行交互,但其请求需要根本不同的AI后端模型和计算资源。将这些能力直接嵌入虚拟世界系统会降低可扩展性、增加维护复杂性,并限制协调分布在边缘和云基础设施上的服务的能力。本文提出一种基于SLM的代理编排网关,这是一种轻量级运行时协调机制,通过意图驱动的服务路由将虚拟世界客户端与异构AI后端解耦。边缘部署的SLM对每个用户提示的语义意图进行分类,可配置的服务注册表验证并解析路由决策,然后透明地调用所选后端,从而无需修改客户端应用即可在虚拟世界中引入新的AI能力。该网关在InterwovenXR虚拟博物馆测试床中实现并评估。评估表明,紧凑型SLM可以在边缘硬件上作为可靠的意图路由器,并且任务特定的微调可以将参数低于十亿的模型转化为实用的低延迟路由器。一种分层配置将微调后的十亿以下参数模型作为路由器,与用于对话响应生成的较大SLM配对,证明可以在中端边缘硬件上部署,并且比将两个职责委托给单个模型更高效。研究结果表明,SLM可以支持虚拟世界中实用的AI服务编排,并且该工作贡献了一种可评估的架构,用于可扩展、可扩展且支持边缘的AI交互,使虚拟代理成为分布式生成式AI服务的访问点。

英文摘要

As generative AI capabilities expand, AI-driven virtual worlds face a growing architectural challenge. Users interact through in-world interfaces in multimodal ways, yet their requests demand fundamentally different AI backend models and computational resources. Embedding these capabilities directly into virtual world systems reduces extensibility, complicates maintenance, and limits the ability to coordinate services distributed across edge and cloud infrastructure. This paper presents an SLM-based Agent Orchestration Gateway, a lightweight runtime coordination mechanism that decouples a virtual world client from heterogeneous AI backends through intent-driven service routing. An edge-deployed SLM classifies the semantic intent of each user prompt, a configurable service registry validates and resolves the routing decision, and the selected backend is invoked transparently, enabling new AI capabilities to be introduced in the virtual world without modifying the client application. The gateway is implemented and evaluated within the InterwovenXR virtual museum testbed. The evaluation shows that compact SLMs can serve as reliable intent routers on edge hardware, and that task-specific fine-tuning can transform sub-billion-parameter models into practical, low-latency routers. A layered configuration pairing a fine-tuned sub billion-parameter model as router with a larger SLM for conversational response generation is shown to be deployable on mid-range edge hardware and more efficient than delegating both responsibilities to a single model. The findings show that SLMs can support practical AI service orchestration in virtual worlds and the work contributes an evaluated architecture for scalable, extensible, and edge-supported AI interaction, enabling virtual agents become access points to distributed generative AI services.

2606.03544 2026-06-03 cs.AI cs.CL 版本更新

SAGE: A Quantitative Evaluation of Socialized Evolution in Agent Ecosystems

SAGE: 智能体生态中社会化演化的定量评估

Linyue Pan, Yaoming Zhu, Lin Qiu, Xuezhi Cao, Xunliang Cai

发表机构 * Tsinghua University, China(清华大学, 中国) Meituan, China(美团, 中国)

AI总结 提出SAGE框架,通过对比社会演化(SocialEvo)与自我演化(SelfEvo)两种计算条件,在三个领域评估共享经验对智能体性能的影响,发现群体历史并非普遍放大器,但能帮助陷入停滞的智能体取得突破,且社会收益依赖于抽象能力而非暴露量。

Comments 13 pages, 5 figures

详情
AI中文摘要

自我改进的语言智能体通常被孤立评估:一个智能体尝试任务、接收反馈并迭代优化自身行为。然而,智能体越来越多地与同伴一起运作,其策略和结果公开可见。这引发了一个研究不足的问题:共享经验何时能产生自我改进无法单独实现的改进?我们引入了SAGE(社会智能体群体演化),一个评估框架,比较两种计算匹配的条件:SocialEvo,其中来自五个不同模型家族的智能体共同演化,可访问所有同伴的历史;以及SelfEvo,其中每个智能体获得相同数量的任务尝试,但只能看到自己的过去,这是自我改进智能体研究中的常规做法。我们在三个领域实例化SAGE:开放式机器学习研究、长期经济规划和战略多人游戏,并在多个演化轮次中进行评估。我们发现群体历史并非普遍放大器:最强的智能体并未超过其自我演化上限。然而,在自我改进下停滞的智能体,当同伴经验可用时,可以取得重大突破。在竞争环境中,反事实控制显示智能体普遍改进,而非发展针对对手的策略。在不同形式的共享历史中,过滤后的同伴轨迹和反思性摘要通常优于原始日志,表明社会收益依赖于抽象而非暴露量。这些发现表明,同伴历史收益是智能体特定的、领域依赖的,并取决于从公共轨迹中抽象可转移知识的能力。

英文摘要

Self-improving language agents are typically evaluated in isolation: an agent attempts a task, receives feedback, and iteratively refines its own behavior. Yet agents increasingly operate alongside peers whose strategies and outcomes are publicly visible. This raises an under-studied question: when does shared experience produce improvements that self-improvement alone cannot achieve? We introduce SAGE (Social Agent Group Evolution),an evaluation framework that compares two compute-matched conditions: SocialEvo, where agents from five distinct model families co-evolve with access to all peers' histories; and SelfEvo, where each agent receives the same number of task attempts but sees only its own past, which is conventional in self-improving agent studies. We instantiate SAGE in three arenas: open-ended ML research, long-horizon economic planning, and strategic multiplayer play, evaluated across multiple evolutionary rounds. We find that group history is not a universal amplifier: the strongest agent does not exceed its self-evolution ceiling. However, agents that plateau under self-improvement can achieve significant breakthroughs when peer experience is available. In competitive settings, counterfactual controls reveal that agents improve generally rather than developing opponent-specific strategies. Across different forms of shared history, filtered peer traces and reflective summaries often outperform raw logs, indicating that social gains depend on abstraction rather than exposure volume. These findings reveal that peer-history gains are agent-specific, arena-dependent, and contingent on the capacity to abstract transferable knowledge from public traces.

2606.03532 2026-06-03 cs.LG cs.AI 版本更新

When Should the Teacher Move? Temporal Coupling and Stability in Self On-Policy Distillation

教师何时应该移动?自在线策略蒸馏中的时间耦合与稳定性

Haowei Guo, Baolong Bi, Ruicheng Zhang, Bingqian Sun, Wentao Zhang

发表机构 * Peking University(北京大学) University of Chinese Academy of Sciences(中国科学院大学) Tsinghua University(清华大学)

AI总结 研究自在线策略蒸馏中教师更新调度对稳定性的影响,提出基于隔离期和门控机制的CGTR方法,实现零崩溃和最佳性能。

详情
AI中文摘要

自在线策略蒸馏针对从自身参数历史派生的教师训练学生策略,但教师的更新调度——控制教师与学生之间的\emph{时间耦合}——尚未作为稳定性变量被系统研究。通过对Qwen3-8B进行受控调度扫描,我们确定\emph{隔离期}(定义为更新之间教师完全冻结)是实现稳定学习的关键结构属性,而非教师年龄。为了刻画这些底层训练动态,我们引入了一个诊断框架,包括时间KL结构、刷新冲击和长度尾部风险。该框架进一步揭示了\emph{状态遗忘崩溃}:最优的短视固定调度在长视训练下灾难性失败,因为时钟驱动的刷新可以在单个不可逆步骤中将短暂漂移的学生复制到教师中。这种失败模式在短视评估下不可见,并且在机制上不同于EMA的慢性污染。为了解决这个问题,我们提出了\emph{巩固门控教师刷新}(CGTR),它在保持隔离期的同时,基于奖励改进和长度尾部安全的联合证据对每次刷新进行门控,确保每次教师移动响应于真正的学生巩固而非时钟信号。使用单一共享参数集且无需每数据集重新调整,CGTR在所有四个任务(化学、生物学、物理学、工具使用)上实现了 extbf{零崩溃}和最佳最终分数,并自动调节其刷新频率以适应每个任务的学习动态。

英文摘要

Self on-policy distillation trains a student policy against a teacher derived from its own parameter history, yet the teacher's update schedule -- which governs the \emph{temporal coupling} between teacher and student -- has not been systematically studied as a stability variable. Through a controlled schedule sweep on Qwen3-8B, we establish that \emph{isolation periods}, defined as complete teacher freezing between updates, are the key structural property enabling stable learning, not teacher age. To characterize these underlying training dynamics, we introduce a diagnostic framework of temporal KL structure, refresh shock, and length-tail risk. This framework further uncovers \emph{state-oblivious collapse}: optimal short-horizon fixed schedules catastrophically fail under long-horizon training because a clock-driven refresh can copy a transiently drifting student into the teacher in a single, irreversible step. This failure mode is invisible under short-horizon evaluation and mechanistically distinct from EMA's chronic contamination. To address this, we propose \emph{Consolidation-Gated Teacher Refresh} (CGTR), which preserves isolation periods while gating each refresh on joint evidence of reward improvement and length-tail safety, ensuring every teacher movement responds to genuine student consolidation rather than a clock signal. With a single shared parameter set and no per-dataset retuning, CGTR achieves \textbf{zero collapse} and the best final score on all four tasks (Chemistry, Biology, Physics, ToolUse), self-regulating its refresh frequency to each task's learning dynamics.

2606.03523 2026-06-03 cs.CR cs.AI cs.LG 版本更新

High-Precision APT Malware Attribution with Out-of-Scope Resilience

高精度APT恶意软件归因与越界鲁棒性

Peter Williams, Adam Sobey, Erisa Karafili

发表机构 * Department of Computer Science, University of Oxford(1 奥克斯福德大学计算机科学系)

AI总结 提出基于排名二元分类器与显式弃权的APT恶意软件归因方法,在越界样本占比87%时仍保持92%精度和95%选择性准确率。

详情
AI中文摘要

早期归因高级持续性威胁(APT)活动可帮助防御者优先调查、选择对策并减少入侵影响。恶意软件提供了有用的归因证据,但自动化APT恶意软件归因在实践中仍然困难。现有方法通常作为封闭集分类器在有限数量的已知APT组织上进行训练和评估。然而,在操作环境中,分类器很可能遇到训练中未出现的组织样本。封闭集分类器被迫将这些样本分配给已知组织,产生无根据且可能误导的归因。我们提出一种基于排名二元分类器与显式弃权的高精度APT恶意软件归因方法。我们的方法不是训练单个多类分类器,而是为每个APT组织训练和调整两个二元分类器,根据验证性能对分类器进行排名,并顺序应用它们。仅当分类器提供足够证据时才对样本进行归因;否则,弃权。我们在APT恶意软件数据集和旨在压力测试越界行为的更大组合数据集上评估该方法。在APT恶意软件数据集上,该方法实现了比先前公布结果更高的精度。在最具挑战性的设置中,87%的测试样本来自训练中排除的60个APT组织,该方法对94%的越界样本弃权,同时在其分类的样本上保持92%的精度和95%的选择性准确率。

英文摘要

Early attribution of Advanced Persistent Threat (APT) activity can help defenders prioritise investigation, select countermeasures, and reduce the impact of an intrusion. Malware provides useful attribution evidence, but automated APT malware attribution remains difficult in practice. Existing approaches are typically trained and evaluated as closed-set classifiers over a limited number of known APT groups. In operational environments, however, classifiers are likely to encounter samples from groups not represented during training. Closed-set classifiers are then forced to assign such samples to known groups, producing unsupported and potentially misleading attributions. We present a high-precision APT malware attribution method based on ranked binary classifiers with explicit abstention. Rather than training a single multi-class classifier, our approach trains and tunes two binary classifiers per APT group, ranks the classifiers by validation performance, and applies them sequentially. A sample is attributed only when a classifier provides sufficient evidence; otherwise, it abstains. We evaluate the method on the APT Malware dataset and on a larger combined dataset designed to stress-test out-of-scope behaviour. On the APT Malware dataset, the method achieves higher precision than previously published results on the same dataset. In the most challenging setting, where 87% of test samples came from 60 APT groups excluded from training, the method abstained on 94% of out-of-scope samples while maintaining 92% precision and 95% selective accuracy on the samples it classified.

2606.03521 2026-06-03 cs.LG cs.AI 版本更新

Post-Hoc Robustness for Model-Based Reinforcement Learning

基于模型的强化学习的后验鲁棒性

Siemen Herremans, Ali Anwar, Siegfried Mercelis

发表机构 * Carnegie Mellon University(卡内基梅隆大学)

AI总结 提出一种在推理时利用学习模型和名义策略进行鲁棒策略改进的后验鲁棒化方法,通过对抗性展开的模型预测控制提升鲁棒性,无需额外训练神经网络。

详情
AI中文摘要

为了提高强化学习(RL)在现实世界中的适用性,对抗鲁棒RL领域研究如何在对抗环境扰动下训练智能体。在该设置中,主角智能体在对手的环境扰动下优化策略,形成零和马尔可夫博弈。当对抗鲁棒RL与基于模型的RL结合时,对手可以针对学习到的转移模型而非训练环境。扩展这一思想,本文引入了深度RL智能体在推理时的后验鲁棒化。通过将学习模型与训练的名义策略结合使用,我们的方法执行鲁棒策略改进步骤。目标是提高鲁棒性而无需对神经网络进行额外训练。具体来说,我们利用对抗性展开下的模型预测控制,这些展开通过有界不确定性集内的投影梯度下降进行近似。此外,这些离线展开在执行时考虑并缓解了分布外问题。通过在扰动的Gymnasium MuJoCo环境中评估算法,同时考虑后验推理设置的计算限制,验证了所提方法在鲁棒性上的显著提升。

英文摘要

To improve the real-world applicability of reinforcement learning (RL), the field of adversarially robust RL studies how to train agents under adversarial environment perturbations. In this setting, a protagonist agent optimizes a policy under environmental perturbations from an adversary, resulting in a zero-sum Markov game. When adversarially robust RL is combined with model-based RL, the adversary can target a learned transition model instead of the training environment. Extending this idea, this work introduces post-hoc robustification of deep RL agents at inference time. By using the learned model in combination with a trained nominal policy, our approach performs a robust policy improvement step. The goal is to improve robustness without any additional training of neural networks. Specifically, we utilize model-predictive control under adversarial rollouts, which are approximated via projected gradient descent within a bounded uncertainty set. Furthermore, these offline rollouts are performed while considering and mitigating out-of-distribution issues. The proposed methodology is validated by demonstrating significant improvements in robustness when the algorithm is evaluated in perturbed Gymnasium MuJoCo environments, while considering the computational limitations of the post-hoc inference setting.

2606.03518 2026-06-03 cs.AI cs.CR 版本更新

Overlaying Governance: A Compositional Authorization Framework for Delegation and Scope in Agentic AI

覆盖治理:面向代理型人工智能的委托与范围的组合授权框架

Amjad Ibrahim, Yong Li

发表机构 * Huawei Heisenberg Research Center(华为海森堡研究所以)

AI总结 针对代理型AI中传统授权框架无法处理递归委托、动态范围等问题,提出一种组合治理框架,通过定义委托类型、权限责任和资源范围衰减,并引入组合算子在不重写现有策略的情况下叠加代理语义,实现可问责的授权。

Comments 12 pages

详情
AI中文摘要

随着AI系统从被动模型演变为能够发起行动、协作和委托任务的自主主动代理,软件系统的传统边界变得模糊。围绕固定主体、显式请求和静态范围构建的传统授权和委托框架不足以治理代理系统。代理型AI需要更丰富的授权语义:代理必须继承和委托权限,在时间限制的权限下行动,并通过共享协议进行协调。现有的身份和访问管理(IAM)系统未能完全捕捉这种代理概念,缺乏递归委托、上下文边界和动态范围作为可执行治理原语的机制。与OAuth 2.0等访问委托标准不同,我们将委托视为合同条款,而不仅仅是基于静态令牌的同意凭证。本文提出一个组合治理框架,引入了代理型AI不可或缺的原语。我们定义了委托类型及其权限和问责含义,并引入了资源范围衰减的概念以限制代理访问范围。这些概念被表达为通用的关系定义,可以组合到现有的授权域(例如金融系统)中。为了操作化这种组合,我们定义了一个组合算子,将新的代理语义(例如递归委托链)叠加到现有关系策略上,而无需重写它们。我们通过形式化证明和实证评估来证实该框架,表明它为代理型AI系统中的可问责授权提供了形式化且实用的基础。

英文摘要

As AI systems evolve from passive models into autonomous active agents capable of initiating actions, collaborating, and delegating tasks, the traditional boundaries of software systems blur. Traditional authorization and delegation frameworks, built around fixed principals, explicit requests, and static scopes, are insufficient to govern agentic systems. Agentic AI demands richer authorization semantics: agents must inherit and delegate permissions, act under time-limited authority, and coordinate through shared protocols. Existing Identity and Access Management (IAM) systems fail to fully capture this notion of agency, lacking mechanisms for recursive delegation, contextual boundaries, and dynamic scoping as executable governance primitives. Unlike access delegation standards such as OAuth 2.0, we treat delegation as a contractual term rather than merely a static token-based consent credential. This paper proposes a compositional governance framework that introduces primitives indispensable for agentic AI. We define types of delegation and their permissions and accountability implications, and we introduce a notion of resource scope attenuation to bound agentic access envelopes. These concepts are expressed as general relational definitions that can be composed into existing authorization domains (e.g., financial systems). To operationalize this composition, we define a compositional operator that overlays new agentic semantics, such as recursive delegation chains, onto existing relational policies without rewriting them. We substantiate this framework through formal proofs and empirical evaluation, showing that it provides a formal yet practical foundation for accountable authorization in agentic AI systems.

2606.03512 2026-06-03 cs.RO cs.AI 版本更新

SPADE: Sketch-guided Path Planning Augmented with Diffusion Experts

SPADE: 草图引导的路径规划增强扩散专家

Charbel Abi Hana, Tatiana Ghantous, Mikael Khalil, Anthony Rizk

发表机构 * IDEALworks GmbH IMT Atlantique IDEALworks GmbH & Saint Joseph University of Beirut(IDEALworks GmbH及贝鲁特圣约瑟夫大学)

AI总结 提出一种结合扩散增强的框架,通过改进的标注工具和训练策略,在保持实时性的同时提升路径规划的泛化能力和鲁棒性,显著降低姿态误差和FID。

详情
AI中文摘要

路径规划对于自主移动机器人(AMR)至关重要。将人类偏好纳入规划的常规方法通常依赖于复杂的奖励工程或硬件密集型解决方案。最近的最先进框架利用模仿学习从专家演示中训练特定行为的路径规划模型。然而,这些方法面临两个关键限制:对未见环境的泛化能力有限,以及演示收集中的鲁棒性较低。为了解决这些挑战,本文介绍了一个增强框架,专注于两个主要贡献:一个基于ROS 2重构的标注工具,以及一种新颖的训练策略,将基于扩散的数据增强集成到基线行为克隆模型中。提供了专家演示数据集,并通过消融研究评估所提出解决方案的鲁棒性。增强方法优于最先进的方法,绝对姿态误差(APE)降低39.1%,Fréchet初始距离(FID)降低33.5%,同时可训练参数减少93.8%。此外,它达到了扩散级别的泛化能力,同时保留了最先进模型的实时、边缘特性。

英文摘要

Path planning is essential for Autonomous Mobile Robots (AMRs). Conventional methods for incorporating human preferences into planning typically rely on either complex reward engineering or hardware-intensive solutions. Recent state-of-the-art frameworks leverage imitation learning to train behavior-specific path planning models from expert demonstrations. However, these approaches face two key limitations: limited generalization to unseen environments and low robustness in demonstration collection. To address these challenges, this work introduces an enhanced framework that focuses on two main contributions: an overhauled annotation tool built on ROS 2, and a novel training strategy that integrates diffusion-based augmentation into baseline behavioral cloning models. A dataset of expert demonstrations is provided and evaluated through ablation studies to assess the robustness of the proposed solution. The enhanced approach outperforms state-of-the-art methods with 39.1% lower Absolute Pose Error (APE) and 33.5% lower Fr'echet Inception Distance (FID) while having 93.8% less trainable parameters. Moreover it attains diffusion-level generalization while preserving the real-time, on-edge properties of state-of-the-art models.

2606.03503 2026-06-03 cs.AI 版本更新

ThoughtFold: Folding Reasoning Chains via Introspective Preference Learning

ThoughtFold: 通过内省偏好学习折叠推理链

Ziyan Liu, Xueda Shen, Yuzhe Gu, Songyang Gao, Kuikun Liu, Guangran Cheng, Chengqi Lyu, Dahua Lin, Wenwei Zhang, Kai Chen

AI总结 提出ThoughtFold框架,通过细粒度偏好学习惩罚冗余探索并鼓励直接连接关键推理段,将推理链折叠为更简洁路径,在保持精度的同时大幅降低token使用量。

详情
AI中文摘要

大型推理模型(LRMs)由于在思维链(CoTs)上使用可验证奖励的强化学习(RLVR)取得了显著进展。然而,由于长CoT自然包含试错,且主流RLVR方法选择结果正确的CoT轨迹进行记忆,长CoT中的冗余探索不可避免地得到强化,导致LRMs的过度思考问题。先前解决此问题的尝试主要给较短轨迹更多优势,但其学习信号仍基于结果,无法减少长CoT中冗余探索的记忆。因此,我们提出ThoughtFold,一个利用细粒度偏好学习来缓解冗余探索以实现高效推理的框架。ThoughtFold采用内省策略识别每个正确轨迹中的冗余,从而产生一系列候选子轨迹。利用这一谱系,我们引入一个掩码偏好优化目标,明确惩罚冗余探索并鼓励模型直接桥接关键推理段,有效地将其推理链折叠为更简洁的路径。大量实验表明,ThoughtFold显著提高了效率。它将DeepSeek-R1-Distill-Qwen-7B的token使用量减少约56%,同时保持最先进的准确性。

英文摘要

Large Reasoning Models (LRMs) have achieved remarkable progress thanks to Reinforcement Learning with Verifiable Rewards (RLVR) on Chain-of-Thoughts (CoTs). However, since long CoTs naturally contain trial and errors and mainstream RLVR approaches choose outcome-correct CoT trajectories for memorization, the redundant explorations in long CoTs are inevitably reinforced, which results in the over-thinking issues of LRMs. Previous attempts to resolve this issue mainly give more advantage to shorter trajectories, yet their learning signals are still outcome-based and cannot reduce the memorization of redundant explorations in long CoTs. Therefore, we propose ThoughtFold, a framework that leverages fine-grained preference learning to mitigate redundant explorations for efficient reasoning. ThoughtFold employs an introspective strategy to identify redundancy within each correct trajectory, which yields a spectrum of candidate sub-trajectories. Leveraging this spectrum, we introduce a masked preference optimization objective that explicitly penalizes redundant explorations and encourages the model to directly bridge essential reasoning segments, effectively folding its reasoning chains into a more concise path. Extensive experiments show that ThoughtFold significantly enhances efficiency. It reduces the token usage of DeepSeek-R1-Distill-Qwen-7B by approximately 56% while maintaining state-of-the-art accuracy.

2606.03486 2026-06-03 cs.CR cs.AI 版本更新

NeuroArmor: Safe-Variant-Guided Representation Consistency for Selective Re-Anchoring in Jailbreak Defense

NeuroArmor:基于安全变体引导的表示一致性实现越狱防御中的选择性重新锚定

Zhongyang Lin, Ziran Zhao, Feifei Zhai, Pengyuan Liu

发表机构 * University of Science and Technology of China(中国科学技术大学)

AI总结 提出NeuroArmor白盒运行时防御方法,通过为每个提示构建安全变体作为局部安全参考,在隐藏状态空间进行一致性检查并路由异常,有效降低恶意攻击成功率同时保持低误报率。

Comments 16 pages, 4 figures, 17 tables. Submitted to ACL ARR

详情
AI中文摘要

大型语言模型仍然容易受到越狱攻击,这些攻击将有害意图隐藏在看似普通的请求背后,例如角色扮演、翻译、编码、对抗性后缀和多轮铺垫。现有的防御方法仍然难以在不过度拦截良性但敏感的请求的情况下处理这些攻击,部分原因是它们通常对每个提示应用相同的操作,因此无法平衡安全性和有用性。我们提出NeuroArmor,一种白盒运行时防御方法,它使用提示特定的安全变体作为局部安全参考,用于决定何时需要干预,并在触发时作为干预的安全目标。对于每个提示,NeuroArmor构建K个安全变体,在隐藏状态空间中将提示状态与此局部安全参考进行比较,并将异常路由到恶意提示的拒绝分支或边界良性提示的有用恢复分支。在Llama-3-8B-Instruct上,NeuroArmor将恶意攻击成功率(ASR)从41.56%降低到1.57%,同时将共享良性池上的良性误报率(FPR)从30.26%降低到22.05%;匹配的基线在此权衡上仍然明显较弱。外部评估者和手动行为评估进一步表明,剩余未拦截的输出产生操作危害的可能性大大降低。总体而言,NeuroArmor通过结合提示特定的一致性检查、路由和选择性干预,为越狱防御提供了更有效的运行时策略。

英文摘要

Large language models remain vulnerable to jailbreak attacks that hide harmful intent behind seemingly ordinary requests such as role-play, translation, encoding, adversarial suffixes, and multi-turn buildup. Existing defenses still struggle to handle these attacks without over-blocking benign but sensitive requests, partly because they often apply the same action to every prompt and therefore fail to balance safety and helpfulness. We propose NeuroArmor, a white-box runtime defense that uses prompt-specific safe variants as a local safety reference for deciding when intervention is needed and, once triggered, as safe targets for intervention. For each prompt, NeuroArmor builds K safe variants, compares the prompt state against this local safe reference in hidden-state space, and routes anomalies either to a refusal branch for malicious prompts or to a helpful recovery branch for borderline benign prompts. On Llama-3-8B-Instruct, NeuroArmor reduces malicious attack success rate (ASR) from 41.56% to 1.57% while lowering benign false positive rate (FPR) on the shared benign pool from 30.26% to 22.05%; matched baselines remain substantially weaker on this trade-off. External-judge and manual behavioral evaluations further show that the remaining non-blocked outputs are much less likely to be operationally harmful. Overall, NeuroArmor provides a more effective runtime strategy for jailbreak defense by combining prompt-specific consistency checking, routing, and selective intervention.

2606.03483 2026-06-03 cs.LG cs.AI 版本更新

Analyzing Stream Collapse in Hyper-Connections: From Diagnosis to Mitigation

分析超连接中的流坍缩:从诊断到缓解

Ekaterina Alimaskina, Gleb Molodtsov, Aleksandr Beznosikov

发表机构 * MIRAI BRAIn Lab Yandex Research Innopolis University

AI总结 本文通过细粒度诊断发现超连接中的多流残差连接存在流坍缩现象,即信号集中于主导流,并通过打破初始化对称性缓解该问题以提升性能。

详情
AI中文摘要

超连接(HC)用多个流替换单个Transformer残差流,引入了流索引上的置换对称性。我们研究这种对称性在实践中如何被打破:流是平衡地专门化还是表现出主导流使用。通过对基于HC的语言模型进行细粒度诊断,我们追踪多流表示的实际使用方式。我们发现,在早期种子阶段之后,残差混合通常保持接近恒等映射,限制了HC在流之间交换信息的核心机制。此外,信号和可解释特征都集中在一个主导流中,名义上的多流残差连接可能未充分利用其容量,行为更接近单流残差路径。最后,我们表明在流初始化时打破对称性可以减少主导行为并提高各种 extit{m}HC变体的性能。我们的代码已公开。

英文摘要

Hyper-Connections (HC) replace the single Transformer residual stream with multiple streams, introducing a permutation symmetry over stream indices. We study how this symmetry is resolved in practice: whether streams specialize in a balanced way or exhibit dominant-stream usage. Using fine-grained diagnostics for HC-based language models, we trace how multi-stream representations are actually used. We find that after an early seeding stage, residual mixing often remains close to identity, limiting a core HC mechanism for exchanging information between streams. Moreover, both signal and interpretable features concentrate in a dominant stream, and the nominally multi-stream residual connection can underutilize its capacity, behaving closer to a single-stream residual pathway. Finally, we show that breaking symmetry at stream initialization reduces dominant behavior and improves performance across \textit{m}HC variants. Our code is publicly available.

2606.03471 2026-06-03 cs.AI cs.MA q-bio.NC 版本更新

A formal definition and meta-model for a machine theory of mind

机器心智理论的正式定义与元模型

Fabio Cuzzolin

AI总结 本文基于认知心理学、神经科学和人工智能证据,首次提出机器心智理论的严格形式化定义,并构建整体元模型,以审视现有研究并推动未来突破。

Comments 48 pages, 2 figures

详情
AI中文摘要

本文首次提出了机器心智理论概念的严格形式化定义,该定义基于认知心理学、神经科学和人工智能证据支持的原则,并以此作为视角审视该领域的最新进展和当前努力,推动进一步研究以“破解”该问题的潜在议程。本文还提出了一个通用的整体机器心智理论元模型,并考察了在经验基准测试此类模型方面的最新进展。

英文摘要

This paper proposes, for the first time, a rigorous formal definition of the concept of Machine Theory of Mind, based on principles supported by evidence from cognitive psychology, neuroscience and artificial intelligence, and uses the above as a lens to examine state-of-the-art and current efforts in the field, driving a potential agenda for further research there able to "crack" the problem. It also advances a general holistic meta-model for Machine Theory of Mind, and examines the state of the art when it comes to empirically benchmarking such models.

2606.03467 2026-06-03 cs.AI 版本更新

StepFinder: A Temporal Semantic Framework for Failure Attribution in Multi-Agent Systems

StepFinder:多智能体系统中故障归因的时间语义框架

Taiyu Zhu, Yifan Wu, Weilin Jin, Ying Li, Gang Huang

发表机构 * Peking University(北京大学)

AI总结 提出StepFinder框架,通过将执行日志编码为时间语义序列并利用时序建模与注意力模块,高效准确地定位多智能体系统中的故障根因步骤。

Comments 12 pages, 5 figures. Accepted by KDD 2026

详情
AI中文摘要

基于LLM的多智能体系统在复杂多步骤任务中展现出显著的协作能力。然而,这些系统对单步执行错误高度敏感,错误会通过智能体交互传播并导致级联故障。为理解故障原因并提高系统可靠性,故障归因被引入作为一项任务,旨在自动识别导致故障的根因步骤。现有故障归因方法主要依赖LLM对原始执行轨迹进行推理,这不仅导致高推理成本和延迟,还受到冗余和噪声执行日志的干扰,使LLM难以准确识别真正的根因步骤。为此,我们提出StepFinder,一个轻量级故障归因框架。我们仅在特征构建阶段使用LLM将执行日志编码为时间语义序列。随后,应用参数高效的时序建模与注意力模块组合来捕捉轨迹的序列演化与跨步骤依赖。最后,通过多尺度差异和位置偏差细化步骤级错误分数,实现精确的根因识别。在Who&When基准上的实验结果表明,StepFinder在步骤级故障归因上优于基于LLM的方法,同时实现了显著更高的推理效率,与最快的基于LLM的方法相比,推理时间减少79%,且无文本生成开销。我们的代码可从此https URL获取。

英文摘要

LLM-based multi-agent systems exhibit remarkable collaborative capabilities in complex multi-step tasks. However, these systems are highly sensitive to single-step execution errors that can propagate through agent interactions and lead to cascading failures. To understand the causes of failure and improve system reliability, failure attribution has been introduced as a task that aims to automatically identify the root cause step responsible for a failure. Existing failure attribution methods mainly rely on LLMs to reason over original execution trajectories, which not only incur high inference costs and latency, but also suffer from interference caused by redundant and noisy execution logs, causing LLMs to struggle in accurately identifying the true root cause step. To address this, we propose StepFinder, a lightweight failure attribution framework. We use LLMs solely during the feature construction phase to encode execution logs into temporal semantic sequences. Subsequently, a parameter-efficient combination of temporal modeling and attention modules is applied to capture the sequential evolution and cross-step dependencies of the trajectories. Finally, the step-level error score is refined through multi-scale differences and position bias, enabling precise root cause identification. Experimental results on the Who&When benchmark demonstrate that StepFinder outperforms LLM-based methods in step-level failure attribution while achieving substantially higher inference efficiency, reducing inference time by 79% compared with the fastest LLM-based method, with no text generation overhead. Our code is available at https://github.com/taiyu-zhu/StepFinder.

2606.03465 2026-06-03 cs.LG cs.AI 版本更新

Rethinking the Role of Tensor Decompositions in Post-Training LLM Compression

重新思考张量分解在训练后大语言模型压缩中的作用

Artur Zagitov, Alexander Miasnikov, Maxim Krutikov, Vladimir Aletov, Gleb Molodtsov, Nail Bashirov, Artem Tsedenov, Aleksandr Beznosikov

发表机构 * University of Florida(佛罗里达大学) National Research University Higher School of Economics(俄罗斯国家研究大学——莫斯科经济学院)

AI总结 本文系统评估了张量分解在稠密和MoE架构上的训练后压缩效果,通过实证与理论分析揭示了其与LLM异构表示之间的根本性不匹配,从而界定了其实际限制和在规模化部署中的可行角色。

详情
AI中文摘要

训练后压缩对于在资源紧张条件下部署大型语言模型(LLM)至关重要。张量分解已成为一个有前景的方向,提供了适合Transformer权重结构的紧凑参数化。然而,现有研究在狭窄的设置中评估这些方法,使得张量化在大规模部署中是否有效尚不清楚。我们系统评估了稠密和MoE架构上的张量压缩,建立了基于实证分析和理论分析的性能权衡。我们识别出张量分解假设的共享子空间与现代LLM学习的异构表示之间的根本性不匹配,从而界定了它们的实际限制,并阐明了它们在大规模部署中的可行角色。代码可在该网址获取。

英文摘要

Post-training compression is essential for deploying large language models (LLMs) under tight resource constraints. Tensor decompositions have emerged as a promising direction, offering compact parameterizations well suited to Transformer weight structures. However, existing studies evaluate these methods in narrow settings, leaving unclear whether tensorization is effective at large-scale deployment. We systematically evaluate tensor compression across dense and MoE architectures, establishing performance trade-offs grounded in both empirical analysis and theoretical analysis. We identify a fundamental mismatch between the shared subspaces assumed by tensor decompositions and the heterogeneous representations learned by modern LLMs, thereby delineating their practical limits and clarifying their viable role in large-scale deployment. The code is available at https://github.com/brain-lab-research/TT-LLM.

2606.03463 2026-06-03 cs.AI cs.CL 版本更新

DMF: A Deterministic Memory Framework for Conversational AI Agents

DMF:对话式AI代理的确定性记忆框架

Matteo Stabile, Enrico Zimuel

发表机构 * Roma Tre University(罗马三大学)

AI总结 提出一种CPU优先的确定性记忆框架DMF,通过经典NLP分析、向量几何和数学评分替代生成式记忆压缩,实现零令牌成本且与Mem0相当的准确性。

Comments 21 pages, 3 figures

详情
AI中文摘要

对话式AI代理需要在大时间跨度的交互中既具可扩展性又语义连贯的记忆系统。现有方法主要依赖基于大语言模型(LLM)的写入时摘要,这引入了非确定性、令牌成本上升以及剪枝决策不透明等问题。我们提出确定性记忆框架(DMF),一种CPU优先的方法,用完全确定性的流水线替代生成式记忆压缩,该流水线基于经典NLP分析、向量几何和数学评分。DMF为每次对话交互分配一个生存分数$\Omega$,该分数由确定性内容信号、对话线索和结构化来源通过逻辑投影组合计算得出。一个交互计数衰减定律,记为$\Omega_{\mathrm{eff}}(\Delta n)$,控制着相关性随新轮次到达的演变,其中$\Delta n$是较新交互的数量而非实际时间,从而保持完全确定性。我们给出了DMF的数学公式、结构化召回流水线、剪枝决策过程和评估协议。实验在基于LoCoMo和LongMemEval数据集构建的专用基准上进行。我们将DMF与AI代理的流行记忆层Mem0进行比较。DMF在准备记忆上下文时使用零令牌,在整个对话中使用的令牌数少5到242倍,同时达到相当的准确性。这些结果表明,可以从记忆管理循环中消除LLM调用,将令牌成本降至几乎为零,并为对话式AI代理实现确定性记忆系统。

英文摘要

Conversational AI agents require memory systems that are both scalable and semantically coherent across long interaction horizons. Existing approaches rely predominantly on large language model (LLM)-based summarisation at write time, which introduces non-determinism, escalating token costs, and opacity in pruning decisions. We present the Deterministic Memory Framework (DMF), a CPU-first approach that replaces generative memory compression with a fully deterministic pipeline grounded in classical NLP analysis, vector geometry, and mathematical scoring. DMF assigns each conversational interaction a Survival Score $Ω$ computed from deterministic content signals, conversational cues, and structured provenance, combined through a logistic projection. An interaction-count decay law, denoted as $Ω_{\mathrm{eff}}(Δn)$, governs how relevance evolves as new turns arrive, where $Δn$ is the number of newer interactions rather than wall-clock time, preserving full determinism. We present the mathematical formulation of DMF, its structured recall pipeline, the pruning decision procedure, and the evaluation protocol. Experiments are conducted on a purpose-built benchmark using the LoCoMo and LongMemEval datasets. We compare DMF against Mem0, a popular memory layer for AI agents. DMF achieves comparable accuracy while using zero tokens to prepare the memory context and 5x to 242x fewer tokens over the entire conversation. These results show that it is possible to eliminate LLM calls from the memory-management loop, reducing token costs to nearly zero and enabling deterministic memory systems for conversational AI agents.

2606.03461 2026-06-03 cs.AI 版本更新

What Makes Interaction Trajectories Effective for Training Terminal Agents?

什么使得交互轨迹对训练终端代理有效?

Sidi Yang, Chaofan Tao, Jierun Chen, Tiezheng Yu, Ruoyu Wang, Yuxin Jiang, Yiming Du, Wendong Xu, Jing Xiong, Taiqiang Wu, Lifeng Shang, Xiaohui Li, Ngai Wong, Haoli Bai

发表机构 * The University of Hong Kong(香港大学) Huawei Technologies(华为技术有限公司) Nanyang Technological University(南洋理工大学)

AI总结 本文通过Terminal-Lego流水线研究交互轨迹的教学效能,发现低分代理(DeepSeek-V3.2)的轨迹比高分代理(Claude Opus 4.6)更能提升学生泛化能力,归因于环境接地监督(EGS),并展示了极佳的数据效率。

详情
AI中文摘要

更强的代码代理通常被认为是训练后阶段的更优教师,然而这一假设尚未与任务难度、框架设计和学生能力充分解耦。我们使用Terminal-Lego(一个可扩展的流水线,将多领域现实问题转化为环境验证的代理任务)来研究这种教学联系。令人惊讶的是,独立表现并不能决定教学效能:尽管Claude Opus 4.6在Terminal-Bench 2.0上获得更高分数,但使用来自较低分代理DeepSeek-V3.2的轨迹进行微调的学生表现出显著更强的泛化能力。我们将这种“教学悖论”归因于环境接地监督(EGS):通过框架可见交互明确暴露“检查-行动-验证”行为的轨迹,使学生能够内化稳健的问题解决程序,而非脆弱的动作序列。扩展分析揭示了卓越的数据效率:例如,仅使用15.3k条Terminal-Lego轨迹,Qwen3-32B在Terminal-Bench 2.0上获得了24.3%的分数,与之前使用超过30倍数据量达到的最优性能相当。我们的结果表明,代理训练后的前沿不仅限于结果匹配,而是将焦点转向“框架工程”,其中环境接地交互结构的系统设计成为可复现和可泛化的代理智能的主要催化剂。

英文摘要

Stronger code agents are commonly assumed to be superior teachers for post-training, yet this assumption remains poorly disentangled from task difficulty, harness design, and student capacity. We investigate this pedagogical link using Terminal-Lego, a scalable pipeline that transforms multi-domain real-world issues into environment-verified agentic tasks. Surprisingly, standalone performance does not dictate teaching efficacy: while Claude Opus 4.6 achieves higher scores on Terminal-Bench 2.0, students fine-tuned on trajectories from DeepSeek-V3.2, a lower-scoring agent, exhibit significantly stronger generalization. We attribute this "pedagogical paradox" to Environment-Grounded Supervision (EGS): trajectories that explicitly expose inspect-act-verify behaviors through harness-visible interactions allow students to internalize robust problem-solving routines rather than fragile action sequences. Scaling analysis reveals exceptional data efficiency: with only 15.3k Terminal-Lego trajectories, for example, Qwen3-32B achieves a 24.3% score on Terminal-Bench 2.0, rivaling previous SOTA performance established with over 30x the data volume. Our results suggest that the frontier of agent post-training lies beyond mere outcome-matching, shifting the focus toward "Harness Engineering", where the systematic design of environment-grounded interaction structures serves as the primary catalyst for reproducible and generalizable agentic intelligence.

2606.03459 2026-06-03 cs.SD cs.AI 版本更新

Tonal parsimony in chord-sequence analysis: combining modulation cost and tonal vocabulary

和弦序列分析中的调性简约性:结合调制代价与调性词汇

François Pachet

发表机构 * LIP6, Sorbonne Université, Paris, France(LIP6,索邦大学,巴黎,法国) Ynosound, Paris, France(Ynosound,巴黎,法国)

AI总结 提出调性简约性方法,通过字典序最小化调制次数和不同调性数量,结合动态规划与固定24调性空间,在和弦序列分析中减少调性词汇并保持调制最优。

Comments 20 pages, 1 figure

详情
AI中文摘要

我们研究将局部调性分配给和弦序列,这一任务对和声分析、作曲和爵士即兴演奏很有用。标准的动态规划方法最小化调制,但可能引入不必要多的调性中心。我们将这种仅转移目标与纯最小词汇分析以及调性简约性进行比较,后者按字典序最小化调制次数,然后最小化不同调性的数量。尽管这个联合目标通常组合困难,但我们利用固定的24调性大调/小调宇宙给出了精确算法。在31,032个LMD和弦序列上,调性简约性在55.8%的情况下保持了转移最优,同时减少了调性词汇。在加权爵士替换闭包下,它将平均调性数从3.802降至3.206,调制次数从16.728降至12.141。在1,555个带注释的爵士标准曲上,它将兼容和弦-音阶一致性提高到95.6%,支持可处理的专业级和声分析。

英文摘要

We study the assignment of local tonalities to chord sequences, a task useful for harmonic analysis, composition, and jazz-oriented improvisation. Standard dynamic-programming approaches minimize modulations but can introduce unnecessarily many tonal centers. We compare this transition-only objective with pure minimum-vocabulary analysis and with tonal parsimony, which minimizes lexicographically the number of modulations and then the number of distinct tonalities. Although this joint objective is combinatorially hard in general, we give exact algorithms exploiting the fixed 24-tonality major/minor universe. On 31,032 LMD Chords sequences, tonal parsimony preserves the transition optimum while reducing tonal vocabulary in 55.8% of cases. With weighted jazz-substitution closure, it lowers mean tonalities from 3.802 to 3.206 and modulations from 16.728 to 12.141. On 1,555 annotated jazz standards, it improves compatible chord-scale agreement to 95.6%, supporting tractable professional-scale harmonic analysis.

2606.03453 2026-06-03 cs.CR cs.AI cs.MA 版本更新

FORGE: Multi-Agent Graduated Exploitation and Detection Engineering

FORGE:多智能体渐进式利用与检测工程

Farooq Shaikh

发表机构 * Dynatrace

AI总结 提出多智能体系统FORGE,通过渐进式利用深度桥接漏洞利用生成、优先级排序和检测规则工程三个孤立领域,在603个CVE上实现67.8%的端到端L1+利用,并生成低误报的Sigma和Snort检测规则。

Comments 18 pages, 4 figures, 3 tables. Accepted at the AgentCy Workshop at the 21st International Conference on Availability, Reliability and Security (ARES 2026). Keywords: Vulnerability assessment, Multi-agent systems, Exploit generation, Detection engineering, Risk prioritization

详情
AI中文摘要

漏洞披露数量现已远超组织评估能力,然而三个相邻研究社区(概念验证生成、漏洞优先级排序和检测规则工程)基本上各自为政。现有的自动利用生成系统报告二进制的通过/失败结果,丢弃了部分进展,并且对另外两个社区不产生任何信号。本文提出了FORGE,一个多智能体系统,通过渐进式利用深度来桥接这三个孤岛。五个专门智能体(情报、生成器、规划器、利用和检测器)在一个固定流水线中执行,该流水线(1)从CVE元数据生成目标易受攻击的应用程序,(2)进行指导性的多轮利用,由LLM主预言机根据四级分类法(L0:无证据到L3:完全入侵)评估,以及(3)生成基于OpenTelemetry利用痕迹的Sigma和Snort检测规则。渐进式深度是桥接机制:更深的利用为检测工程提供更丰富的行为痕迹,而跨评分区间的深度数据为优先级排序验证提供真实依据。分层知识架构跨评估累积情报,将构建和利用经验传递给后续CVE。在CVE-GENIE数据集的603个CVE上评估,跨8种语言和187种CWE类型,以每个CVE 1.50美元的成本实现了67.8%的端到端L1+利用。无论EPSS或CVSS区间如何,利用率保持在接近68%,表明模式级可达性与基于元数据的优先级排序正交。来自L2+利用的检测规则实现了显著高于L1衍生规则的跨度归一化基础(p=0.035),并且93.4%的生成Snort规则对合成良性语料库产生零误报。

英文摘要

Vulnerability disclosure volumes now far exceed organizational assessment capacity, yet three adjacent research communities (proof-of-concept generation, vulnerability prioritization, and detection rule engineering) operate largely in isolation. Existing automated exploit generation systems report binary pass/fail outcomes, discarding partial progress and producing no signal for the other two communities. This paper presents FORGE, a multi-agent system that bridges these three silos through graduated exploitation depth. Five specialized agents (Intel, Generator, Planner, Exploit, and Detector) execute in a fixed pipeline that (1) generates targeted vulnerable applications from CVE metadata, (2) conducts coached, multi-turn exploitation assessed by an LLM-primary oracle on a four-level taxonomy (L0: no evidence through L3: full compromise), and (3) produces Sigma and Snort detection rules grounded in OpenTelemetry exploitation traces. Graduated depth is the bridging mechanism: deeper exploitation yields richer behavioral traces for detection engineering, while depth data across scoring bands provides ground truth for prioritization validation. A tiered knowledge architecture accumulates intelligence across assessments, transferring build and exploitation experience to subsequent CVEs. Evaluation on 603 CVEs from the CVE-GENIE dataset achieves 67.8% end-to-end L1+ exploitation at USD 1.50 per CVE across eight languages and 187 CWE types. Exploitation rates remain near 68% regardless of EPSS or CVSS band, indicating that pattern-level reachability is orthogonal to metadata-based prioritization. Detection rules from L2+ exploitation achieve significantly higher span-normalized grounding than L1-derived rules (p=0.035), and 93.4% of generated Snort rules produce zero false positives against a synthetic benign corpus.

2606.03444 2026-06-03 cs.CV cs.AI 版本更新

PRISM: Synergizing Vision Foundation Models via Self-organized Expert Specialization

PRISM: 通过自组织专家专业化协同视觉基础模型

Ying Tang, Dong Li, Youjia Zhang, Zikai Song, Junqing Yu, Wei Yang

发表机构 * University of Science and Technology of China(中国科学技术大学)

AI总结 提出PRISM框架,采用双流混合专家(MoE)架构,通过两阶段范式(先解构专家知识使其专业化,再动态重组为任务特定路径)解决视觉基础模型集成中的负迁移问题,在PASCAL-Context和NYUD-v2上达到新最优。

Comments Accepted to ICML 2026

详情
AI中文摘要

将多种视觉基础模型(VFM)的互补优势统一到单个高效模型中是非常理想的,但受到整体蒸馏中固有的负迁移的挑战。为了解决这些特征冲突,我们引入了 extbf{PRISM},一种新颖的双流混合专家(MoE)框架,通过模块化专业化协同VFM。我们提出了一个两阶段范式:(1)专业知识解构,其中教师条件路由器引导专家在不同的表示子空间中专业化以减轻干扰,然后(2)动态重组,其中路由器学习将这些专家组装成针对下游任务定制的计算路径。在PASCAL-Context和NYUD-v2上的实验表明, extbf{PRISM}建立了新的最先进水平,验证了稀疏、涌现的专业化是集成多样化视觉知识的可扩展方法。

英文摘要

Unifying the complementary strengths of diverse Vision Foundation Models (VFMs) into a single efficient model is highly desirable but challenged by the negative transfer inherent in monolithic distillation. To address these feature conflicts, we introduce \textbf{PRISM}, a novel dual-stream Mixture-of-Experts (MoE) framework that synergizes VFMs via modular specialization. We propose a two-stage paradigm: (1) expertise deconstruction, where a teacher-conditional router guides experts to specialize in distinct representational subspaces to mitigate interference, followed by (2) dynamic recomposition, where the router learns to assemble these experts into tailored computational pathways for downstream tasks. Experiments on PASCAL-Context and NYUD-v2 show that \textbf{PRISM} establishes a new state of the art, validating that sparse, emergent specialization is a scalable approach for integrating diverse visual knowledge.

2606.03435 2026-06-03 cs.AI 版本更新

CP-Agent: Context-Aware Multimodal Reasoning for Cellular Morphological Profiling under Chemical Perturbations

CP-Agent: 化学扰动下细胞形态学轮廓的上下文感知多模态推理

Yuxin Zhang, Yiyao Li, Ping Shu Ho, Simon See, Zhenqin Wu, Kevin Tsia

发表机构 * Department of Electrical and Computer Engineering, The University of Hong Kong(香港大学电子与计算机工程系) School of Computing and Data Science, The University of Hong Kong(香港大学计算与数据科学学院) School of Biomedical Engineering, The University of Hong Kong(香港大学生物医学工程学院) Nvidia AI Technology Center(NVIDIA人工智能技术中心) Advanced Biomedical Instrumentation Centre(先进生物医学仪器中心)

AI总结 提出CP-Agent,一种基于上下文感知对齐模块CP-CLIP的多模态大语言模型,用于生成药物扰动下细胞形态变化的可解释机制性解释,实现高精度处理与机制区分(最大F1分数0.896),并整合工具使用与推理生成结构化报告以加速药物发现。

Comments ICLR 2026

详情
AI中文摘要

Cell Painting结合多重荧光染色、高内涵成像和定量分析,生成高维表型读数,以支持多种下游任务,如作用机制(MoA)推断、毒性预测和药物-疾病图谱构建。然而,现有工作流程缓慢、昂贵且难以解释。药物筛选建模方法主要侧重于分子表示学习,而忽略了实际实验上下文(例如细胞系、给药方案等),限制了泛化性和MoA分辨率。我们引入了CP-Agent,一种智能多模态大语言模型(MLLM),能够为药物扰动下的细胞形态变化生成与机制相关、人类可解释的理由。其核心是CP-Agent利用上下文感知对齐模块CP-CLIP,该模块联合嵌入高内涵图像和实验元数据,以实现稳健的处理和MoA区分(达到最大F1分数0.896)。通过将CP-CLIP输出与智能工具使用和推理相结合,CP-Agent将理由编译成结构化报告,以指导实验设计和假设优化。这些能力凸显了CP-Agent通过实现更可解释、可扩展和上下文感知的表型筛选来加速药物发现的潜力——简化药物发现中假设生成的迭代循环。

英文摘要

Cell Painting combines multiplexed fluorescent staining, high-content imaging, and quantitative analysis to generate high-dimensional phenotypic readouts to support diverse downstream tasks such as mechanism-of-action (MoA) inference, toxicity prediction, and construction of drug-disease atlases. However, existing workflows are slow, costly and difficult to interpret. Approaches for drug screening modeling predominantly focus on molecular representation learning, while neglecting actual experimental context (e.g., cell line, dosing schedule, etc.), limiting generalization and MoA resolution. We introduce CP-Agent, an agentic multimodal large language model (MLLM) capable of generating mechanism-relevant, human-interpretable rationales for cell morphological changes under drug perturbations. At its core, CP-Agent leverages a context-aware alignment module, CP-CLIP, that jointly embeds high-content images and experimental metadata to enable robust treatment and MoA discrimination (achieving a maximum F1-score of 0.896). By integrating CP-CLIP outputs with agentic tool usage and reasoning, CP-Agent compiles rationales into a structured report to guide experimental design and hypothesis refinement. These capabilities highlight CP-Agent's potential to accelerate drug discovery by enabling more interpretable, scalable, and context-aware phenotypic screening -- streamlining iterative cycles of hypothesis generation in drug discovery.

2606.03432 2026-06-03 cs.CR cs.AI cs.LG 版本更新

A Hybrid Approach For Malware Classification Using Secondary Features Fusion

一种使用二次特征融合的恶意软件分类混合方法

Raja Khurram Shahzad, Muhammad Mustaqeem, Haroon Elahi

AI总结 提出一种通过融合API调用和n-gram特征,并采用投票集成算法进行恶意软件检测与家族分类的方法,在Microsoft数据集上达到99.72%准确率和0.989 AUC。

详情
AI中文摘要

恶意软件(无论是变种还是新型)的数量正在迅速增加,使得恶意软件检测和缓解成为一个复杂的问题。改善恶意软件缓解的一种方法是自动检测和恶意软件家族分类。然而,传统的恶意软件检测方法无法将检测到的恶意软件分类到各自的家族中,阻碍了有效的恶意软件缓解。因此,本文提出了一种自动化恶意软件检测并将检测到的恶意软件分类到相应恶意软件家族的方法。所提出的方法在提取相关恶意软件特征(如API调用、固定和可变长度n-gram)后,使用自定义特征选择方法进行特征融合。此外,对于预测模型,提出了一种基于投票的算法融合方法。为了对所提出的方法进行实验评估,对Microsoft提供的数据集应用了二分类和多分类方法。最后,将实验结果与现有技术进行了比较。实验结果表明了所提出方法的有效性和效率,AUC为0.989,准确率为99.72%,对数损失为0.01。

英文摘要

The number of malware (either variant or novel) is rapidly increasing, making malware detection and mitigation a complex problem. One approach to improving malware mitigation is automatic detection and malware family classification. However, traditional malware detection methods cannot classify detected malware into their respective families, hindering effective malware mitigation. Consequently, this paper proposes a method to automate malware detection and classification of the detected malware into respective malware families. The proposed method uses feature fusion after extracting relevant malware features such as API calls and fixed and variable length n-grams with a customized feature selection method. Moreover, for the predictive model, a voting based approach is proposed for algorithm fusion. For the experimental evaluation of the proposed method, both binary and multi-class classification approaches are applied to the data set provided by Microsoft. Finally, the experimental results are compared with the state of the art. The experimental results indicate the effectiveness and efficiency of the proposed approach with an AUC of 0.989, accuracy of 99.72%, and a log loss of 0.01.

2606.03430 2026-06-03 cs.CR cs.AI 版本更新

FlowGuard: Flow Matching for Identity-Independent Detection of Data-Free Model Stealing Attacks on Energy System Intrusion Detection Systems

FlowGuard: 基于流匹配的能源系统入侵检测系统中无数据模型窃取攻击的身份无关检测

Maxime Schwarzer, Laurin Holz, Tobias Huerten, Johannes Loevenich, Thies Moehlenhof, Roberto Rigolin F. Lopes, Veit Hagenmeyer

发表机构 * CortAIx Labs, Thales Deutschland(CortAIx实验室,Thales德国) Karlsruhe Institute of Technology(卡尔斯鲁厄理工学院)

AI总结 提出FlowGuard,一种基于流匹配的身份无关防御方法,通过检测查询是否属于分布外(OOD)来防御针对能源系统入侵检测系统的无数据模型窃取攻击,在单客户端和分布式Sybil场景下均保持稳定检测率。

详情
AI中文摘要

部署在能源基础设施中的人工智能入侵检测系统(IDS)容易受到模型窃取攻击,攻击者可以离线创建规避流量。当前针对模型提取的防御要么依赖于身份绑定的查询监控(对分布式攻击者Sybil无效),要么通过软标签扰动进行预测中毒(不适用于硬标签IDS部署)。因此,我们提出FlowGuard,一种基于流匹配的身份无关防御,在IDS处理之前将传入查询分类为分布外(OOD)。该方法利用了以下事实:为无数据模型窃取攻击合成的查询占据比真实网络流量更低维的流形,导致在使用基于合法数据训练的连续归一化流时,对数似然显著降低。我们在单客户端和分布式(100客户端Sybil)设置下,使用MAZE和DisGUIDE攻击评估了我们的方法,并与PRADA和FDINet进行了比较。当分布发生变化时,PRADA的检测率降至0%,而我们的防御在不依赖身份信息的情况下,在两种设置下均保持稳定的检测率。我们讨论了该方法的范围和局限性,并概述了在数据依赖攻击中的潜在应用。

英文摘要

Artificial Intelligence (AI)-based Intrusion Detection Systems (IDS) deployed in energy infrastructure are vulnerable to model theft attacks, which allow adversaries to create evasive traffic offline. Current defences against model extraction rely either on identity-bound query monitoring, which is ineffective against distributed attackers (Sybil), or on prediction poisoning through soft-label perturbation, which is inapplicable to hard-label IDS deployments. Therefore, we propose FlowGuard, an identity-independent defence based on flow matching that classifies incoming queries as out-of-distribution (OOD) prior to IDS processing. This approach exploits the fact that queries generated synthetically for data-free model stealing attacks occupy a lower-dimensional manifold than real network traffic. This results in measurably lower log-likelihoods when using a Continuous Normalizing Flow that has been trained on legitimate data. We evaluate our method against PRADA and FDINet using MAZE and DisGUIDE attacks in single-client and distributed (100-client Sybil) settings. While PRADA's detection rate dropped to 0% when the distribution changed, our defence maintained a stable detection rate across both settings without relying on identity information. We discuss the scope and limitations of the approach, and outline potential applications to data-dependent attacks.

2606.03428 2026-06-03 cs.NE cs.AI cs.LG 版本更新

PrimeSVT: An Automated Memory-aware Pruning Framework with Prioritized Compression Policy for Spiking Vision Transformers

PrimeSVT: 一种具有优先压缩策略的自动化内存感知剪枝框架用于脉冲视觉Transformer

Rachmad Vidya Wicaksana Putra, Achyuta Muthuvelan, Alberto Marchisio, Muhammad Shafique

发表机构 * eBRAIN Lab, Division of Engineering, New York University (NYU) Abu Dhabi(eBRAIN实验室,工程系,纽约大学(NYU)阿布扎克分校) New York University (NYU) Abu Dhabi, United Arab Emirates (UAE)(纽约大学(NYU)阿布扎克分校,阿拉伯联合酋长国(UAE))

AI总结 提出PrimeSVT框架,通过自动化结构化剪枝和优先压缩策略,在满足精度和内存约束下压缩脉冲视觉Transformer,实现内存节省26.68%且精度损失小于3%。

Comments 8 pages, 8 figures, 3 tables

详情
AI中文摘要

脉冲视觉Transformer(SViT)的大尺寸仍然阻碍其嵌入式实现,因此需要模型压缩。现有工作通过非结构化剪枝压缩SViT模型,这需要专门的硬件加速器来利用其特定的稀疏模式以最大化效率提升。此外,它们的手动方法需要大量设计时间来为每个网络找到合适的剪枝设置,因此这种方法不可扩展。为了解决这一限制,我们提出了PrimeSVT,一种新颖的框架,对预训练的SViT模型执行自动化的内存感知结构化剪枝,从而在推理期间最大化其效率提升,适用于广泛使用的计算架构。为此,PrimeSVT首先根据层的大小(即参数数量)对SViT层进行排序,根据它们在不同剪枝率下的鲁棒性识别目标剪枝层,然后利用这个顺序从最大层到最小层逐层顺序压缩模型(即所谓的优先压缩策略),同时考虑用户定义的约束(即可接受的精度和内存节省)。在每一层中,PrimeSVT基于L2范数值采用通道级滤波器剪枝,以结构性地移除不重要的权重。实验结果表明,PrimeSVT通过自动化单次剪枝节省了26.68%的内存,同时将精度保持在原始未剪枝SViT模型(73.3%)的3%以内(未微调时为70.3%,微调后为72.9%),从而满足了精度和内存约束。这些表明我们的PrimeSVT框架实现了SViT及其嵌入式实现的设计自动化。

英文摘要

The large sizes of Spiking Vision Transformers (SViTs) still hinder their embedded implementation, highlighting the need for model compression. State-of-the-art works compress SViT models through unstructured pruning, which needs specialized hardware accelerators for their specific sparsity patterns to maximize efficiency gains. Moreover, their manual approach requires a huge design time to find an appropriate pruning setting for each network, thus making this approach not scalable. To address this limitation, we propose PrimeSVT, a novel framework that performs automated memory-aware structured pruning on pre-trained SViT models, thereby maximizing their efficiency gains during inference amenable to widely-used computing architectures. To achieve this, PrimeSVT first sorts the SViT layers based on their sizes (i.e., number of parameters), identifies the targeted pruning layers based on their robustness under different pruning rates, then leverages this order for compressing the model layer-by-layer sequentially from the largest one to the smallest one (i.e., so-called prioritized compression policy), while considering the user-defined constraints (i.e., acceptable accuracy and memory saving). In each layer, PrimeSVT employs channel-wise filter pruning based on their L2-norm values to structurally remove the non-significant weights. Experimental results show that PrimeSVT saves 26.68% memory through automated single-shot pruning, while preserving accuracy within 3% (70.3% without fine-tuning and 72.9% with fine-tuning) from the original unpruned SViT model (73.3%), thus meeting the accuracy and memory constraints. These show that our PrimeSVT framework enables design automation for SViTs and their embedded implementation.

2606.03398 2026-06-03 cs.CL cs.AI 版本更新

Causal Evidence of Stack Representations in Modeling Counter Languages Using Transformers

Transformer建模计数器语言中栈表示的因果证据

Nishit Singh

发表机构 * Birla Institute of Technology and Science, Pilani(比拉理工学院和科学学院,皮兰)

AI总结 通过线性探针和消融实验,证明Transformer在计数器语言任务中学习的栈表示对其性能具有因果必要性。

Comments 8 pages, 8 figures

详情
AI中文摘要

形式语言已被证明是理解Transformer内部机制的有效途径。以往研究表明,在计数器语言上进行下一个词预测训练的Transformer会学习到与底层栈结构一致的表示。除了表示分析,本文还研究了这些表示的因果作用。我们训练线性探针从模型隐藏状态中预测每个词符处的栈深度,并从探针中提取主表示方向。从模型中消融该方向会导致序列准确率骤降至接近0%,这提供了强有力的经验证据,表明栈表示不仅是学习到的,而且对模型性能具有因果必要性。

英文摘要

Formal languages have proven to be effective conduits to understand the inner mechanisms of transformers. Past work has shown that transformers trained on next token prediction over counter languages learn representations consistent with an underlying stack structure. Beyond representational analysis, this paper investigates the causal role of these representations. Linear probes are trained to predict the stack depth at each token from the model's hidden states, and a principal representation direction is extracted from the probe. Ablation of this direction from the model causes sequential accuracy to collapse to near 0%, providing strong empirical evidence that the stack representation is not just learned, but is causally necessary for model performance.

2606.03391 2026-06-03 cs.LG cs.AI cs.CL 版本更新

When Model Merging Breaks Routing: Training-Free Calibration for MoE

当模型合并破坏路由:MoE的无训练校准

Canbin Huang, Tianyuan Shi, Xiaojun Quan, Jingang Wang, Jianfei Zhang, Qifan Wang

发表机构 * University of Science and Technology of China(中国科学技术大学)

AI总结 针对MoE架构中模型合并导致的路由崩溃问题,提出基于二阶曲率的无训练校准方法HARC,通过闭式解和共轭梯度法高效重对齐路由器,显著提升数学推理和代码生成性能。

详情
AI中文摘要

模型合并已成为一种无需重新训练即可整合多个LLM能力的成本效益方法。然而,现有的合并技术主要基于线性参数算术或优化,在应用于混合专家(MoE)架构时面临困难。我们识别出MoE合并中的一个关键失效模式,称为路由崩溃,其中合并后的路由器无法将令牌分派给合适的专家。路由崩溃源于非线性softmax和离散Top-k路由机制对合并引起的参数扰动的敏感性,这种敏感性进一步被MoE预训练期间施加的负载平衡约束放大。由于微调后的专家表现出不同的专长,即使是适度的错误路由也可能导致严重的性能下降。为解决此问题,我们提出Hessian感知路由器校准(HARC),一种无训练框架,利用二阶曲率信息重新对齐合并后的路由器。该方法采用闭式解,可通过无矩阵共轭梯度法高效求解。在数学推理和代码生成任务上的实验表明,HARC有效缓解了多种MoE合并基线中的路由崩溃,并带来了显著的性能提升。我们的代码可在该https URL获取。

英文摘要

Model merging has emerged as a cost-effective approach for consolidating the capabilities of multiple LLMs without retraining. However, existing merging techniques, largely based on linear parameter arithmetic or optimization, struggle when applied to Mixture-of-Experts (MoE) architectures. We identify a critical failure mode in MoE merging, termed routing breakdown, in which the merged router fails to dispatch tokens to suitable experts. Routing breakdown stems from the sensitivity of the non-linear softmax and discrete Top-k routing mechanisms to parameter perturbations from merging, a sensitivity further amplified by load-balancing constraints imposed during MoE pretraining. Because fine-tuned experts exhibit distinct specializations, even modest misrouting can cause severe performance degradation. To address this issue, we propose Hessian-Aware Router Calibration (HARC), a training-free framework that leverages second-order curvature information to realign the merged router. This approach admits a closed-form solution that can be efficiently solved using a matrix-free conjugate gradient method. Experiments on mathematical reasoning and code generation tasks show that HARC effectively mitigates routing breakdown across diverse MoE merging baselines and leads to substantial performance improvements. Our code is available at https://github.com/huangcb01/HARC.

2606.03385 2026-06-03 cs.RO cs.AI 版本更新

Grasp-Then-Plan with Failure Attribution: A Closed Two-Stage Framework for Precise and Generalizable Robotic Manipulation

先抓取后规划与失败归因:一种用于精确且可泛化机器人操作的闭环两阶段框架

Jiahao Xu, Peiyuan Wang, Hanzhuo Zhang, Zihao Yu, Tianyu Fu, Hao Chen, Xuanhao Xiang, Jianbo Yu, Chenchen Fu, Wanyuan Wang

发表机构 * School of Computer Science and Engineering, Southeast University, China(东南大学计算机科学与工程学院)

AI总结 提出GTP-FA框架,通过任务导向的两阶段抓取-规划流程和失败归因模型,在抓取和规划模块中分别注入任务先验和风险惩罚以及针对高风险初始状态进行数据收集和微调,显著提升机器人操作任务的成功率。

Comments 32 pages, project page: https://sites.google.com/view/gtp-fa/

详情
AI中文摘要

在机器人操作中,抓取与运动规划之间的紧密耦合常常掩盖失败的真实原因,导致低效的试错过程。为了实现高效的长时域操作,我们提出了GTP-FA(先抓取后规划与失败归因),一种面向任务的两阶段抓取-规划框架,该框架生成抓取候选并根据所选抓取执行下游运动规划。给定失败的操作轨迹,我们学习一个失败归因模型,该模型可泛化到未见过的抓取,并生成失败模式的稳定分布以进行诊断引导的优化。基于这些归因结果,我们以诊断驱动的方式优化两个模块:在抓取侧,我们将任务级先验和风险惩罚注入抓取候选评分和优化中,以抑制不稳定或与任务不兼容的抓取;在规划侧,我们通过数据收集和微调针对高风险初始状态,以解决真正的规划瓶颈。我们在仿真和真实机器人实验中评估了所提出的框架,并表明GTP-FA在基于RL、IL、扩散策略和VLA的设置中提升了相应的基础学习器,实现了显著更高的总体任务成功率。

英文摘要

In robotic manipulation, the tight coupling between grasping and motion planning often obscures the true source of failure, leading to inefficient trial-and-error. To enable efficient long-horizon manipulation, we propose GTP-FA (Grasp-Then-Plan with Failure Attribution), a task-oriented two-stage grasp-then-plan framework that generates grasp candidates and performs downstream motion planning conditioned on the selected grasp. Given a failed manipulation trajectory, we learn a failure attribution model that generalizes to unseen grasps and produces a stable distribution over failure modes for diagnosis-guided optimization. Based on these attribution results, we then optimize both modules in a diagnosis-driven manner: on the grasping side, we inject task-level priors and risk penalties into grasp candidate scoring and optimization to suppress unstable or task-incompatible grasps; on the planning side, we target high-risk initial states through data collection and fine-tuning to address genuine planning bottlenecks. We evaluate the proposed framework in both simulation and real-robot experiments, and show that GTP-FA improves the corresponding base learners across RL, IL, diffusion-policy, and VLA-based settings, achieving substantially higher overall task success rates.

2606.03381 2026-06-03 cs.CR cs.AI 版本更新

AI Model Extraction Attacks: Bypassing Single-Client Assumptions in Defenses

AI模型提取攻击:绕过防御中的单客户端假设

Maxime Schwarzer, Johannes F. Loevenich, Gustavo Sánchez, Laurin Holz, Thies Möhlenhof, Tobias Hürten, Roberto Rigolin F. Lopes, Veit Hagenmeyer

发表机构 * ETH Zurich(苏黎世联邦理工学院) University of Zurich(苏黎世大学) University of Tübingen(图宾根大学)

AI总结 本文通过提出CerberusAI框架,系统性地证明模型提取攻击中的单客户端假设(SCA)在高级持续性威胁(APT)等协同攻击者面前无效,并展示基本轮询查询分布策略即可绕过PRADA等防御机制,呼吁转向无状态、独立于身份的防御架构。

详情
AI中文摘要

确保部署在军事指挥控制(C2)系统和关键基础设施中的人工智能(AI)模型的保护对于维持信息优势至关重要。模型提取攻击(MEA)构成了重大威胁,因为它们使对手能够复制专有模型、泄露受保护信息并准备离线对抗性攻击。然而,当前的防御策略主要依赖于单客户端假设(SCA),即隐含地假设攻击源自孤立身份。本工作系统地证明了在协同威胁行为者(如高级持续性威胁APT)存在的情况下,SCA从根本上无效。我们引入了一个模块化、开源框架CerberusAI,用于可复现的模型窃取研究,并利用它模拟分布式攻击场景。我们的实证评估表明,成熟的防御机制(如防止深度神经网络模型窃取攻击PRADA)可以通过基本的轮询查询分布策略被绕过,导致检测性能显著下降。此外,我们证明即使是全局聚合方法也可以通过自适应流量混合使其在操作上变得无用。这些结果强调了在模型提取攻击领域需要向有状态、独立于身份的防御架构进行范式转变。本文最初发表于由信息系统技术(IST)科学与技术委员会IST-224-RSY组织的国际军事通信与信息系统会议(ICMCIS),该会议于2026年5月12-13日在英国巴斯举行,并获得了最佳论文奖。

英文摘要

Ensuring the protection of Artificial Intelligence (AI) models deployed in military Command and Control (C2) systems and critical infrastructure is essential for maintaining information superiority. Model Extraction Attacks (MEAs) pose a significant threat, as they enable adversaries to replicate proprietary models, compromise protected information, and prepare offline adversarial attacks. However, current defense strategies predominantly rely on the Single Client Assumption (SCA), which is the implicit assumption that attacks originate from isolated identities. This work systematically demonstrates that the SCA is fundamentally invalid in the presence of coordinated threat actors, such as Advanced Persistent Threats (APTs). We introduce a modular, open-source framework called CerberusAI for reproducible model-stealing research, and use it to simulate distributed attack scenarios. Our empirical evaluation shows that well-established defense mechanisms, such as Protecting Against Deep Neural Network Model Stealing Attacks (PRADA), can be bypassed by basic round-robin query distribution strategies, resulting in a significant reduction in detection performance. Furthermore, we demonstrate that even global aggregation approaches can be rendered operationally useless through adaptive traffic mixing. These results highlight the need for a paradigm shift towards stateful, identity-independent defense architectures in the field of model extraction attacks. This paper was originally presented at the International Conference on Military Communication and Information Systems (ICMCIS), organized by the Information Systems Technology (IST) Scientific and Technical Committee, IST-224-RSY - the ICMCIS, held in Bath, United Kingdom, 12-13 May 2026 and won the best paper award.

2606.03357 2026-06-03 cs.CL cs.AI 版本更新

The Unsampled Truth: Psychometrics in SLMs Measure Prompt Artifacts, Not Psychological Constructs

未抽样的真相:SLM 中的心理测量衡量的是提示伪影,而非心理构念

Nils Schwager, Christoph Hau, Simon Münker, Achim Rettinger

发表机构 * Trier University(特里尔大学)

AI总结 通过提示变异框架分离语义信号与提示伪影,发现小型语言模型在心理测量中主要反映提示遵从性而非模拟心理特质,并提供了诊断工具。

Comments 10 pages, 5 figures, 3 tables

详情
AI中文摘要

当使用小型语言模型进行心理测量评估时,研究人员假设输出反映了语义推理。我们使用一个提示变异框架,在13个开放权重模型(0.6B到14B参数)上评估了这一前提,该框架将语义信号与提示伪影分离。通过系统地改变角色、指令、项目和选项符号,我们发现伪影方差经常压倒语义信号。在这些情况下,模型主要反映提示遵从性,而非模拟的心理特质。虽然这些发现限制了SLM在心理测量中的效用,但我们的框架提供了一个诊断工具,用于识别破坏性伪影并隔离语义理解,以用于未来的前沿模型研究。

英文摘要

When prompting SLMs for psychometric assessments, researchers assume the outputs reflect semantic reasoning. We evaluate this premise across 13 open-weights models (0.6B to 14B parameters) using a prompt variation framework that separates semantic signals from prompt artifacts. By systematically varying personas, instructions, items, and option symbols, we find that artifactual variance frequently overpowers the semantic signal. In these cases, models predominantly reflect prompt compliance rather than simulated psychological traits. While these findings limit SLM utility in psychometrics, our framework provides a diagnostic tool to identify destructive artifacts and isolate semantic understanding for future frontier-model research.

2606.03348 2026-06-03 cs.CV cs.AI 版本更新

SynCred-Bench: Benchmarking Synthetic Credibility in AI-Generated Visual Misinformation

SynCred-Bench: 评估AI生成视觉虚假信息中的合成可信度

Junxiao Yang, Minghao Zhang, Xiaoce Wang, Haoran Liu, Shiyao Cui, Hongning Wang, Minlie Huang

发表机构 * The Conversational AI (CoAI) group, DCST, Tsinghua University(清华大学人工智能对话组,数据科学与技术研究院,清华大学)

AI总结 提出SynCred-Bench基准,包含600个AI生成的虚假信息图像,涵盖六种可信形式和七种传播风格,并引入FP450真实图像负集,评估显示现有系统在5%假阳性率下真阳性率极低,表明合成可信度是一个严重且未被充分探索的视觉虚假信息挑战。

详情
AI中文摘要

最近的生成模型能够生成带有逼真嵌入文本和布局的视觉制品,创造了一种新的虚假信息威胁:合成可信度。我们引入了SYNCRED-Bench,一个包含600个AI生成的虚假信息图像的基准,这些图像在六种可信形式类别和七种细粒度传播风格上平衡分布,同时还有FP450,一个用于测量假阳性的真实图像负集。广泛评估表明,现有系统仍然不可靠:在5%假阳性率约束下,15个多模态大语言模型仅达到10.5%的真阳性率,开源AIGC检测器达到不到5%,商业API达到57.6%。人类标注者也难以识别合成可信度,仅达到63%的真阳性率。这些发现将合成可信度确立为一个严重且未被充分探索的视觉虚假信息挑战,并提供了一个基准,用于开发超越表面可信度线索进行推理的检测器。

英文摘要

Recent generative models can now produce visual artifacts with realistic embedded text and layouts, creating a new misinformation threat: synthetic credibility. We introduce SYNCRED-Bench, a benchmark of 600 AI-generated misinformation images balanced across six credible-form categories and seven fine-grained circulation styles, together with FP450, a real-image negative set for measuring false positives. Extensive evaluation shows that existing systems remain unreliable: under a 5% false-positive-rate constraint, 15 MLLMs achieve only 10.5% true positive rate (TPR), open-source AIGC detectors achieve less than 5%, and commercial APIs reach 57.6%. Human annotators also struggled to identify synthetic credibility, reaching only 63% TPR. These findings establish synthetic credibility as a severe and underexplored visual misinformation challenge, and provide a benchmark for developing detectors that reason beyond superficial credibility cues.

2606.03347 2026-06-03 cs.LG cs.AI stat.ML 版本更新

AugMask: Training Diffusion Models on Incomplete Tabular Data via Stochastic Augmentation and Masking

AugMask: 通过随机增强和掩码在不完整表格数据上训练扩散模型

Jungkyu Kim, Taeyoung Park, Kibok Lee

发表机构 * KAIST(韩国科学技术院)

AI总结 提出AugMask训练框架,通过条件随机增强和仅对观测坐标去噪,使标准扩散模型适应缺失表格数据,并连接Rao-Blackwellized目标实现方差加权惩罚,优于专门处理缺失的基线。

详情
AI中文摘要

基于分数的扩散模型已成为突出的深度生成模型;然而,它们在表格数据上的应用仍然具有挑战性,因为其主干网络假设输入完全指定,而现实世界的表格数据通常包含缺失值。我们提出了AugMask,一个即插即用的训练框架,通过将条件与监督分离,使对缺失不敏感的主干网络适应不完整数据。AugMask 1) 使用轻量级辅助模型通过条件随机增强构建数值输入,2) 仅对观测坐标应用去噪监督。实际上,增强的缺失条目作为不确定的条件上下文,而不是训练目标。我们将此训练规则与Rao-Blackwellized目标联系起来,并表明对缺失条目进行边缘化会产生方差加权的敏感性惩罚,从而阻止对不确定补全的过度依赖。在多种数据集和缺失机制下,AugMask使基于扩散的标准表格生成器优于专门处理缺失的基线方法。

英文摘要

Score-based diffusion models have emerged as prominent deep generative models; however, their application to tabular data remains challenging because their backbones assume fully specified inputs, whereas real-world tabular data often contain missing values. We propose AugMask, a plug-and-play training framework that adapts missing-unaware backbones to incomplete data by separating conditioning from supervision. AugMask 1) constructs numeric inputs via conditional stochastic augmentation using lightweight auxiliary models, and 2) applies denoising supervision only to observed coordinates. In effect, augmented missing entries serve as uncertain conditioning context rather than training targets. We connect this training rule to a Rao--Blackwellized objective and show that marginalizing missing entries yields a variance-weighted sensitivity penalty, discouraging over-reliance on uncertain completions. Across diverse datasets and missingness regimes, AugMask enables standard diffusion-based tabular generators to outperform specialized missing-aware baselines.

2606.03331 2026-06-03 cs.CL cs.AI 版本更新

Evaluating LLMs' Effectiveness on Real-World Consumer Device Repair Questions

评估大语言模型在真实世界消费设备维修问题上的有效性

Atm Mizanur Rahman, Md Arid Hasan, Syed Ishtiaque Ahmed, Sharifa Sultana

发表机构 * University of Illinois Urbana-Champaign(伊利诺伊大学厄巴纳-香槟分校) University of Toronto(多伦多大学)

AI总结 本文通过引入包含991个真实维修问题的基准测试,评估六种大语言模型在英语和孟加拉语上的正确性、完整性、实用性和安全性,发现模型虽能提供有用帮助,但在高风险维修任务中仍不可靠。

详情
AI中文摘要

消费设备维修是大语言模型(LLMs)一个重要但尚未充分探索的测试平台。维修任务需要对不完整的问题描述、特定硬件的诊断、可操作的故障排除和安全关键决策进行推理,其中错误的建议可能导致设备损坏、电池危险或永久性数据丢失。我们引入了一个包含991个来自Reddit的真实世界维修问题的基准测试,涵盖手机维修、电脑维修和数据恢复,每个问题都配有技术人员编写的参考解决方案,并提供孟加拉语翻译以评估跨语言性能。我们使用四个维修特定标准(正确性、完整性、实用性和安全性)评估了六种最先进的LLMs在英语和孟加拉语上的表现。我们的结果表明,虽然LLMs可以提供有用的维修帮助,但在没有严格评估和明确安全保护措施的情况下,它们在高风险的真实世界维修任务中仍然不可靠。手机维修是最困难且对安全最敏感的领域,所有模型在板级诊断、维修优先级排序和安全恢复程序方面都犯了重大错误。跨领域和模型,孟加拉语回答的表现始终低于英语回答。在评估的模型中,GPT-5.4整体表现最佳。

英文摘要

Consumer device repair is an important but underexplored testbed for large language models (LLMs). Repair tasks require reasoning over incomplete problem descriptions, hardware-specific diagnostics, actionable troubleshooting, and safety-critical decisions, where incorrect advice can cause device damage, battery hazards, or permanent data loss. We introduce a benchmark of 991 real-world repair questions from Reddit spanning phone repair, computer repair, and data recovery, each paired with technician-written reference solutions, and provide Bangla translations to evaluate cross-lingual performance. We evaluate six state-of-the-art LLMs in English and Bangla using four repair-specific criteria: correctness, completeness, practicality, and safety. Our results show that while LLMs can provide useful repair assistance, they remain unreliable for high-risk real-world repair tasks without rigorous evaluation and explicit safety safeguards. Phone repair is the most difficult and safety-sensitive domain, and all models make substantial errors in board-level diagnosis, repair prioritization, and safe recovery procedures. Across domains and models, Bangla responses consistently perform worse than English responses. Among the evaluated models, GPT-5.4 performs best overall.

2606.03330 2026-06-03 cs.LG cs.AI cs.CR 版本更新

FLIPS: Instance-Fingerprinting for LLMs via Pseudo-random Sequences

FLIPS:通过伪随机序列为LLMs进行实例指纹识别

Gurvan Richardeau, Gohar Dashyan, Erwan Le Merrer, Gilles Tredan

发表机构 * Inria(法国国家信息与自动化研究所)

AI总结 提出FLIPS方法,利用生成的二进制随机序列中的偏差,在237个模型实例上实现96%(闭集)和90%(开集)的识别准确率,解决了现有指纹识别技术无法区分同一LLM不同配置的问题,为AI监管提供了实例级指纹识别新范式。

Comments 20 pages, 20 figures, 3 tables. 43rd International Conference on Machine Learning (ICML 2026)

详情
AI中文摘要

文献揭示,大型语言模型(LLM)的行为不仅受其原始权重影响,还受其实例级参数(如指令提示、采样配置或量化)影响。在一种配置下生成安全输出的模型,在另一种配置下可能产生有毒内容。然而,当前的LLM识别技术(如指纹识别)侧重于知识产权保护,其设计倾向于对这些实例级参数的变化具有鲁棒性。这对AI监管构成了关键挑战,因为合规评估针对的是实际部署的行为,而非模型来源。在本文中,我们引入了实例级指纹识别,这是一种面向监管的范式,用于区分同一LLM的不同配置。我们的方法FLIPS利用生成的二进制随机序列中的偏差,在237个模型实例上达到96%(闭集)和90%(开集,其中一些目标未知)的识别准确率,而改编的LLMmap基线仅为35%。这表明实例级指纹识别对于监管既必要又实际可行。代码见https://this URL。

英文摘要

Literature reveals that a Large Language Model's (LLM) behavior is not only conditioned by its original weights but also its instance-level parameters, such as instructional prompt, sampling configuration or quantization. A model that generates safe outputs under one configuration may produce toxic content under another. However, current LLM identification techniques (such as fingerprinting) focus on intellectual property protection, and their design favors robustness to changes in these instance-level parameters. This poses a critical challenge for AI regulation in which compliance assessments target actual deployed behaviors, not model provenance. In this paper, we introduce instance-level fingerprinting, a regulator-oriented paradigm that distinguishes configurations of the same LLM. Our method FLIPS, exploits biases in generated binary random sequences to reach 96% (closed-set) and 90% (open-set, where some targets are unknown) identification accuracy across 237 model instances, versus 35% for the adapted LLMmap baseline. This shows that instance-level fingerprinting is both necessary for regulation and practically feasible. Code available at https://github.com/GurvanR/FLIPS-LLM-Instance-Fingerprinting.

2606.03329 2026-06-03 cs.AI 版本更新

InfoMem: Training Long-Context Memory Agents with Answer-Conditioned Information Gain

InfoMem: 基于答案条件信息增益训练长上下文记忆智能体

Tiancheng Han, Yong Li, Wuzhou Yu, Qiaosheng Zhang, Wenqi Shao

发表机构 * Tongji University(同济大学) Shanghai Innovation Institute(上海创新研究院) Shanghai AI Laboratory(上海人工智能实验室)

AI总结 提出InfoMem奖励机制,通过评估最终记忆对真实答案的每token对数似然增益,训练分块记忆智能体以提升长上下文任务性能。

Comments 17 pages, 7 figrues,

详情
AI中文摘要

长上下文任务要求LLM从大上下文中识别并保留与答案相关的信息。分块记忆智能体通过顺序读取文档块、更新紧凑记忆并从累积记忆中生成最终答案来解决这一问题。然而,现有的基于RL的分块智能体要么依赖稀疏的最终答案奖励,要么使用词汇中间奖励来指导记忆和检索动作。这些信号监督任务成功或局部重叠,但不直接评估最终记忆是否支持真实答案。我们提出InfoMem,一种用于训练分块记忆智能体的奖励机制,该机制使用答案条件信息评估最终记忆的效用。InfoMem衡量最终记忆增加模型对真实答案的每token对数似然的程度。为了稳定RL优化,InfoMem仅对成功轨迹应用此信号,并在奖励组合前对其进行归一化。在相同的GRPO框架和训练预算下,InfoMem在长上下文记忆智能体性能上优于可比的记忆智能体RL基线。分析表明,有效的最终记忆奖励应作用于成功轨迹,在奖励组合前归一化,并基于答案而非查询进行条件化。我们的代码可从此https URL获取。

英文摘要

Long-context tasks require LLMs to identify and preserve answer-relevant information from large contexts. Chunk-wise memory agents address this issue by sequentially reading document chunks, updating a compact memory, and generating the final answer from the accumulated memory. However, existing RL-based chunk-wise agents either rely on sparse final-answer rewards or use lexical intermediate rewards for memory and retrieval actions. These signals supervise task success or local overlap, but do not directly evaluate whether the final memory supports the ground-truth answer. We propose InfoMem, a reward mechanism for training chunk-wise memory agents that evaluates final-memory utility using answer-conditioned information. InfoMem measures how much the final memory increases the model's per-token log-likelihood of the ground-truth answer. To stabilize RL optimization, InfoMem applies this signal only to successful trajectories and normalizes it before reward composition. Under the same GRPO framework and training budget, InfoMem improves long-context memory-agent performance over comparable memory-agent RL baselines. Analyses show that effective final-memory rewards should operate on successful trajectories, be normalized before reward composition, and be conditioned on the answer rather than the query. Our code is available at https://github.com/GenSouKa1/InfoMem.

2606.03326 2026-06-03 cs.AI 版本更新

The Violation Situation Pattern: A Knowledge-Graph Pattern for Compliance Violations

违规情境模式:一种用于合规违规的知识图谱模式

Nima Kamali Lassem, Fuqi Song, Seyid Amjad Ali

发表机构 * DiliTrust Department of Information Systems and Technologies, Bilkent University(信息系统与技术系,比尔肯特大学)

AI总结 提出违规情境模式(VSP),将合规检测中的违规实例化为持久化图节点,支持生命周期状态和审计历史,并通过法律实体合同图实例化四种道义规则验证其有效性。

详情
AI中文摘要

合规管道将违规检测为瞬态查询结果,而不将违规本身作为具有审查状态、受影响实体或审计历史的持久化图对象保留。违规情境模式(VSP)填补了这一空白。基于Gangemi和Mika的情境模式,VSP将每个检测到的违规具体化为一个图节点,包含规则标识符、时间有效性区间、生命周期状态以及与所涉及实体的证据链接。生命周期转换存储为不可变的、符合PROV-O的事件,因此审计历史成为图遍历。我们在法律实体和合同生命周期属性图中实例化VSP,并通过FCL->Cypher->MERGE管道操作四条道义规则(V1未授权签名、V2过期授权、V3缺失保密条款、V4缺失违约通知条款)。我们针对BODACC公司高管出版物检查V1和V2,在73个GDPRhub执法决定上评估V4,并对V3和V4运行SHACL跨形式主义检查。核心发现是规则体独立性:将V4从条款存在性检查扩展到截止日期检查,F1从0.312提升至0.602,而模式的标识、生命周期和证据语义保持不变。这分离了模式贡献与检测器贡献,因此检测逻辑可以演进而不使累积的审计历史失效。

英文摘要

Compliance pipelines detect violations as transient query results and do not keep the violation itself as a persistent graph object with review state, affected entities, or audit history. The Violation Situation Pattern (VSP) closes this gap. Building on the Situation pattern of Gangemi and Mika, VSP reifies each detected violation as a graph node with a rule identifier, a temporal validity interval, a lifecycle state, and evidence links to the entities involved. Lifecycle transitions are stored as immutable, PROV-O-aligned events, so audit history is a graph traversal. We instantiate VSP in a legal entity and contract lifecycle property graph and operationalize four deontic rules (V1 unauthorized signature, V2 expired mandate, V3 missing confidentiality clause, V4 missing breach-notification clause) through an FCL->Cypher->MERGE pipeline. We check V1 and V2 against BODACC corporate-officer publications, evaluate V4 on 73 GDPRhub enforcement decisions, and run a SHACL cross-formalism check on V3 and V4. The central finding is rule-body independence: extending V4 from clause-presence to deadline checking raises F1 from 0.312 to 0.602, while the pattern's identity, lifecycle, and evidence semantics stay the same. This separates a pattern contribution from a detector contribution, so detection logic can evolve without invalidating accumulated audit history.

2606.03322 2026-06-03 cs.LG cs.AI 版本更新

Multi-Modal Graph Neural Network with Transformer-Guided Adaptive Diffusion for Preclinical Alzheimer Classification

多模态图神经网络与Transformer引导的自适应扩散用于临床前阿尔茨海默病分类

Jaeyoon Sim, Minjae Lee, Guorong Wu, Won Hwa Kim

发表机构 * Pohang University of Science and Technology(浦项科学技术大学) University of North Carolina at Chapel Hill(北卡罗来纳大学教堂山分校)

AI总结 提出一种结合扩散核与多头注意力的图神经网络框架,通过Transformer引导自适应扩散过程,有效融合多模态特征,提升临床前阿尔茨海默病分类性能并识别关键脑区。

Comments 10 pages, Accepted to MICCAI 2024

详情
AI中文摘要

大脑的图形表示通过感兴趣区域(ROI)之间的关系为诊断和预测神经退行性疾病提供了关键见解。尽管近年来出现了各种图神经网络(GNN)来有效捕获关系信息,但在解释大脑网络方面仍存在固有局限性。具体而言,卷积方法无法有效聚合远邻域信息,而基于注意力的方法在捕获节点中心信息方面存在缺陷,特别是在保留关键节点的关键特征方面。这些不足揭示了从不同模态的不同特征中识别疾病特异性变化的挑战。为此,我们提出一个集成框架,通过下游Transformer引导每个节点的扩散过程,其中图的短程和长程属性分别通过扩散核和多头注意力进行聚合。我们通过使用多种模态改进临床前阿尔茨海默病(AD)分类的性能,证明了我们模型的优越性。此外,我们的模型能够熟练识别与AD临床前阶段密切相关的关键ROI,为疾病的早期诊断和预防提供了重要潜力。

英文摘要

The graphical representation of the brain offers critical insights into diagnosing and prognosing neurodegenerative disease via relationships between regions of interest (ROIs). Despite recent emergence of various Graph Neural Networks (GNNs) to effectively capture the relational information, there remain inherent limitations in interpreting the brain networks. Specifically, convolutional approaches ineffectively aggregate information from distant neighborhoods, while attention-based methods exhibit deficiencies in capturing node-centric information, particularly in retaining critical characteristics from pivotal nodes. These shortcomings reveal challenges for identifying disease-specific variation from diverse features from different modalities. In this regard, we propose an integrated framework guiding diffusion process at each node by a downstream transformer where both short- and long-range properties of graphs are aggregated via diffusion-kernel and multi-head attention respectively. We demonstrate the superiority of our model by improving performance of pre-clinical Alzheimer's disease (AD) classification with various modalities. Also, our model adeptly identifies key ROIs that are closely associated with the preclinical stages of AD, marking a significant potential for early diagnosis and prevision of the disease.

2606.03312 2026-06-03 cs.RO cs.AI 版本更新

RobotValues: Evaluating Household Robots When Human Values Conflict

RobotValues: 当人类价值观冲突时评估家用机器人

Jongwook Han, Hyeongjin Kim, Yohan Jo

发表机构 * Graduate School of Data Science, Seoul National University(首尔国立大学数据科学研究生院)

AI总结 提出RobotValues基准,通过10K个价值冲突场景评估家用机器人规划器,发现视觉语言模型存在默认价值偏好且难以覆盖,表明评估需考虑价值冲突下的行动选择。

详情
AI中文摘要

虽然家用机器人通常基于任务完成度进行评估,但日常家庭环境涉及价值冲突情境,其中机器人应选择优先考虑其他价值观(如人类自主性、效率或社会适宜性)而非任务成功的行动。然而,目前尚无评估机器人在此类场景中价值偏好的基准。我们引入RobotValues,一个在10K个价值冲突场景中评估家用机器人规划器的基准。每个实例包含一个逼真的家庭图像和多个优先考虑不同人类价值观的合理机器人动作。我们通过LLM辅助场景生成、利益相关者基于价值观提取、图像生成和自动质量控制构建RobotValues。使用RobotValues评估机器人领域使用的视觉语言模型,发现模型表现出默认价值偏好,包括安全性和适应性,而低估了隐私优先的行动。当模型被指示优先考虑与其自身偏好冲突的特定价值观时,它们通常无法覆盖默认行动,80%的时间选择了错误行动。这些发现表明,家用机器人评估不仅应衡量任务完成度或安全性合规性,还应衡量当人类价值观冲突时机器人是否能在合理行动中做出选择。

英文摘要

While household robots are often evaluated based on task completion, everyday domestic environments involve value-conflicting situations in which robots are expected to choose actions that prioritize other values than task success, such as human autonomy, efficiency, or social appropriateness. Yet, there are no benchmarks for evaluating robots' value preferences in such scenarios. We introduce RobotValues, a benchmark to evaluate household robot planners in 10K value-conflict scenarios. Each instance consists of a realistic household image with multiple plausible robot actions that prioritize different human values. We construct RobotValues through LLM-assisted scenario generation, stakeholder-grounded value extraction, image generation and automatic quality control. Using RobotValues we evaluate VLMs used in robotics and find that models exhibit default value preferences, including safety and accommodation, while underselecting privacy-prioritizing actions. When the models are instructed to prioritize specific values that conflict with their own preferences, they often fail to override their default actions, choosing incorrect actions for 80% of the time. These findings suggest that household robot evaluation should measure not only task completion or safety compliance, but also whether robots can choose among plausible actions when human values conflict.

2606.03310 2026-06-03 cs.LG cs.AI 版本更新

Learning Multi-Scale Hypergraph for High-Order Brain Connectivity Analysis

学习多尺度超图用于高阶脑连接分析

Jaeyoon Sim, Soojin Hwang, Seunghun Baek, Guorong Wu, Won Hwa Kim

发表机构 * KAIST(韩国科学技术院)

AI总结 提出自适应多尺度超边学习框架MuHL,通过构建层次节点特征并动态学习高阶交互,在多个脑网络基准上提升神经退行性疾病分类性能并识别关键脑区。

Comments 24 pages, Accepted to ICML 2026

详情
AI中文摘要

理解脑区之间的复杂交互对于早期神经退行性疾病(如阿尔茨海默病和帕金森病)的分类至关重要。虽然基于图的模型广泛用于分析脑网络,但大多数现有方法主要关注直接连接节点之间的成对交互,限制了其捕捉跨多个区域的高阶依赖关系的能力。尽管已有基于超图的方法来建模高阶关系,但许多方法依赖于预定义的超边或将学习限制在超边权重上,降低了灵活性并限制了其捕捉多分辨率结构模式的能力。为此,我们引入了一个自适应多尺度超边学习框架,即MuHL,该框架构建层次节点特征,并通过在多分辨率图信号上连续构建超边来动态学习高阶交互。在多个脑网络基准上的大量实验表明,MuHL在不同阶段持续提高了疾病分类性能,并从学习到的超边中识别出与疾病进展相关的关键感兴趣区域及其群体交互,突显了其作为神经退行性疾病脑网络分析强大工具的潜力。

英文摘要

Understanding complex interactions between brain regions is critical for early neurodegenerative disease classification such as Alzheimer's Disease (AD) and Parkinson's Disease (PD). While graph-based models are widely used to analyze brain networks, most existing approaches primarily focus on pairwise interactions between directly connected nodes, limiting their ability to capture higher-order dependencies across multiple regions. Although hypergraph-based methods have been proposed to model higher-order relations, many rely on predefined hyperedges or restrict learning to hyperedge weights, reducing flexibility and limiting their capacity to capture multi-resolution structural patterns. In this regard, we introduce an adaptive multi-scale hyperedge learning framework, i.e., MuHL, which constructs hierarchical node features and dynamically learns high-order interactions through continuous hyperedge construction over multi-resolution graph signals. Extensive experiments on multiple brain network benchmarks demonstrate that MuHL consistently improves disease classification performance across different stages, and further identifies key regions of interest (ROIs) and their group-wise interactions from the learned hyperedges that are associated with disease progression, highlighting its potential as a powerful tool for brain network analysis in neurodegenerative disorders.

2606.03305 2026-06-03 cs.AI 版本更新

The Reliability Gap in Benchmark Auditing: Distribution Shift and Scale as Failure Modes of Contamination Detection

基准审计中的可靠性差距:分布偏移和规模作为污染检测的失败模式

Wojciech Zarzecki, Jan Dubiński, Sebastian Cygert

发表机构 * NASK National Research Institute(国家研究 institute) Warsaw University of Technology(华沙技术大学) Gdańsk University of Technology(格但喀大学)

AI总结 研究基准污染检测方法在分布偏移和规模约束下的可靠性,发现三种主流方法在335次评估中仅199次正确,揭示了受控验证与实际审计之间的系统性可靠性差距。

详情
AI中文摘要

基准污染,即评估示例出现在模型的训练数据中,威胁着LLM评估的有效性。存在用于检测训练数据成员身份的统计工具,但几乎仅在受控学术制度中得到验证:大规模、同质的预训练语料库和透明、单阶段的训练流程。这些方法在实际审计场景中是否仍然可靠尚不清楚。我们识别了两种研究不足的失败模式:分布偏移,当可疑集和验证集违反IID假设时出现;以及规模约束,因为基准比预训练语料库小几个数量级。我们系统评估了三种主流范式:LLM数据集推断、事后数据集推断和CoDeC,涉及来自多个家族(包括Pythia、OLMo~2以及专门的文化和医学LLM)和规模(高达27B)的27个模型。然后我们将分析进一步扩展到前沿行业模型。在335次评估中,只有199次产生正确结果。LLM数据集推断在分布偏移下产生假阳性,事后数据集推断在基准规模下效力不足,而CoDeC仅提供粗略的来源信号,不足以验证单个基准分割。我们的结果揭示了受控验证与实际基准审计之间的系统性可靠性差距,并表明统计检测尚不能取代透明的数据来源。我们开源了我们的基准以供进一步研究。

英文摘要

Benchmark contamination, where evaluation examples appear in a model's training data, threatens the validity of LLM assessment. Statistical tools for detecting training-data membership exist, but have been validated almost exclusively in controlled academic regimes: large, homogeneous pre-training corpora and transparent, single-stage training pipelines. Whether these methods remain reliable in realistic auditing scenarios remains unclear. We identify two under-studied failure modes: distribution shift, which arises when suspect and validation sets violate the IID assumption, and scale constraints, which arise because benchmarks are orders of magnitude smaller than pre-training corpora. We systematically evaluate three leading paradigms: LLM Dataset Inference, Post-Hoc Dataset Inference, and CoDeC across 27 models from multiple families (including Pythia, OLMo~2, and specialised cultural and medical LLMs) and scales (up to 27B). We then further extend our analysis to frontier industry models. Across 335 evaluations, only 199 yield correct outcomes. LLM Dataset Inference results in false positives under distribution shift, Post-Hoc Dataset Inference is underpowered at benchmark scale, and CoDeC provides only coarse provenance signals that are insufficient to verify individual benchmark splits. Our results reveal a systematic reliability gap between controlled validation and practical benchmark auditing, and show that statistical detection cannot yet replace transparent data provenance. We open-source our benchmark for further research.

2606.03290 2026-06-03 cs.LG cs.AI 版本更新

Message Tuning Outshines Graph Prompt Tuning: A Prismatic Space Perspective

消息调优优于图提示调优:棱镜空间视角

Yancheng Chen, Dun Ma, Shuai Zhang, Yang Liu, Xixun Lin, Xiangyu Zhao, Wenguo Yang, Wei Chen, Chuan Zhou

发表机构 * University of Science and Technology of China(中国科学技术大学)

AI总结 本文提出棱镜空间理论(PS-Theory)量化图提示调优的适应能力上限,并引入消息调优(MTG)方法,通过注入可学习消息原型超越该上限,实验验证其优越性。

Comments Accepted by ICML 2026

详情
AI中文摘要

基于预训练与自适应范式的图基础模型(GFMs)已成为图学习的研究热点。对于基于GNN的GFMs,图提示调优已成为下游任务的主流自适应方法。尽管近期方法解释了图提示调优为何有效,但如何严格衡量其适应能力仍是一个开放问题。解决该问题对于理解图提示调优的能力极限以及开发更强大的自适应方法至关重要。本文提出棱镜空间理论(PS-Theory),一种新颖的数学框架,用于量化自适应方法的能力,同时重点建立图提示调优适应能力上限。基于所提出的PS-Theory,我们进一步引入GFMs的消息调优(MTG),一种轻量级方法,在GNN骨干网络的每一层注入少量可学习消息原型,以自适应地引导消息融合,无需更新预训练权重。通过我们的PS-Theory,我们证明MTG的适应能力可以超过图提示调优的理论上限。大量实验表明,MTG在多个基准数据集上 consistently 优于图提示基线,为我们的理论发现提供了强有力的实证支持。

英文摘要

Graph Foundation Models (GFMs), built upon the Pre-training and Adaptation paradigm, have emerged as a research hotspot in graph learning. For GNN-based GFMs, graph prompt tuning has become the prevailing adaptation method for downstream tasks. Although recent methods explain why graph prompt tuning works, how to rigorously measure its adaptation capacity remains an open problem. Addressing this problem is critical for understanding the capability limits of graph prompt tuning and for developing more powerful adaptation methods. In this paper, we propose Prismatic Space Theory (PS-Theory), a novel mathematical framework to quantify the capacity of adaptation methods, while focusing on establishing the upper bound for the adaptation capacity of graph prompt tuning. Building upon the proposed PS-Theory, we further introduce Message Tuning for GFMs (MTG), a lightweight approach that injects a small set of learnable message prototypes into each layer of the GNN backbone to adaptively guide message fusion without updating pre-trained weights. Through our PS-Theory, we prove that the adaptation capacity of MTG can exceed the theoretical upper bound of graph prompt tuning. Extensive experiments demonstrate that MTG consistently outperforms graph prompt baselines across diverse benchmark datasets, providing strong empirical support for our theoretical findings.

2606.03288 2026-06-03 cs.CY cs.AI 版本更新

AI-Generated Traces for Novice Programmers: Learning Effects and Learner Differences in a Multi-Institutional Study

AI生成的新手程序员追踪:多机构研究中的学习效果与学习者差异

Yuri Noviello, Naaz Sibia, Anastasiia Birillo, Thomas Overklift Vaupel Klein, Michael Liut, Gosia Migut

发表机构 * Delft University of Technology(代尔夫特理工大学) University of Toronto(多伦多大学) JetBrains Research(JetBrains研究)

AI总结 本研究提出AI生成的类比动画追踪(GATs),通过多机构实验比较其与文本解释对新手程序员学习程序执行的影响,发现GATs在即时学习上有选择性优势,但效果依赖情境且短暂,且受学习者参与度调节。

详情
AI中文摘要

入门编程(CS1)课程常常难以支持学生对程序执行的理解。虽然可视化可以使执行过程明确,但其有效性取决于设计和情境,而AI生成可视化的实证证据仍然有限。我们提出了生成动画追踪(GATs),即基于AI生成的、类比驱动的、配有旁白的动画,协调源代码、执行状态和概念类比。我们在两个机构的CS1课程中(Python,N=961;Java,N=151)进行了一项研究,比较GATs与文本解释。我们测量了即时学习表现和体验、课程结束时的参与度和考试成绩。结果表明,GATs可以在即时学习方面产生选择性优势,但优势取决于情境且是短期的。我们观察到GATs对表现的影响受到学习者参与度概况的调节。这一发现强调了个性化方法的重要性。

英文摘要

Introductory programming (CS1) courses often struggle to support students' understanding of program execution. While visualizations can make execution processes explicit, their effectiveness depends on design and context, and empirical evidence for AI-generated visualizations remains limited. We propose Generated Animated Traces (GATs), AI-generated, analogy-based, narrated animations that coordinate source code, execution state, and conceptual analogies. We conduct a study at two institutions in CS1 courses (Python, N=961; Java N=151) comparing GATs to textual explanations. We measure immediate learning performance and experience, end-of-course engagement and exam performance. Results show that GATs can yield selective benefits for immediate learning, but benefits are context-dependent and short-term. We observe that GATs' influence on performance is moderated by learner engagement profiles. This finding underscores the importance of personalized approaches.

2606.03273 2026-06-03 cs.CV cs.AI cs.CL 版本更新

VistaHop: Benchmarking Multi-hop Visual Reasoning for Visual DeepSearch

VistaHop: 视觉深度搜索的多跳视觉推理基准

Hang He, Chuhuai Yue, Chengqi Dong, Chengcheng Wan, Ting Su, Haiying Sun, Jiajun Chai, Xiaohan Wang, Guojun Yin

发表机构 * East China Normal University(东华大学) Meituan(美团) Shanghai Innovation Institute(上海创新研究院)

AI总结 提出VistaHop基准,通过多跳问答任务评估多模态大推理模型在视觉深度搜索中的迭代图像检查、视觉锚点定位和跨证据链推理能力,实验表明现有模型表现有限。

详情
AI中文摘要

视觉深度搜索要求多模态大推理模型(MLRM)智能体通过反复检查图像区域、将中间推理锚定在视觉证据上,并跨长推理链连接细粒度线索来回答复杂的视觉查询。然而,现有基准主要关注单步视觉理解或静态图像问答,对迭代图像检查、视觉锚点定位和多跳证据整合的评估有限。在这项工作中,我们引入了VistaHop,一个用于评估视觉深度搜索中以视觉为中心的搜索和多跳视觉推理的基准。VistaHop包含300张高分辨率图像、25个视觉搜索场景和350个多跳QA任务,这些任务要求模型跟随从视觉锚点出发的证据链,或融合跨多个基于图像的推理路径的信息。我们进一步开发了VistaArena,一个统一的评估环境,支持带有文本搜索、图像搜索、图像裁剪和基于证据的答案验证的工具增强推理。在七个代表性MLRM上的实验表明,当前模型远未解决VistaHop:最佳模型SenseNova-MARS-32B仅达到24.31%的Pass@1。这些结果揭示了在视觉定位、证据重访、长链推理和多锚点信息融合方面的持续局限性,凸显了对更强基准和训练方法的需求,以推动视觉深度搜索的发展。

英文摘要

Visual DeepSearch requires multimodal large reasoning model (MLRM) agents to answer complex visual queries by repeatedly inspecting image regions, grounding intermediate reasoning in visual evidence, and connecting fine-grained clues across long reasoning chains. However, existing benchmarks mainly focus on single-step visual understanding or static image-question answering, offering limited evaluation of iterative image inspection, visual-anchor grounding, and multi-hop evidence integration. In this work, we introduce VistaHop, a benchmark for evaluating vision-centric search and multi-hop visual reasoning in Visual DeepSearch. VistaHop contains 300 high-resolution images, 25 visual search scenarios, and 350 multi-hop QA tasks that require models to follow evidence chains from visual anchors or fuse information across multiple image-grounded reasoning paths. We further develop VistaArena, a unified evaluation environment that supports tool-augmented reasoning with text search, image search, image cropping, and evidence-based answer validation. Experiments on seven representative MLRMs show that current models remain far from solving VistaHop: the best model, SenseNova-MARS-32B, achieves only 24.31% Pass@1. These results reveal persistent limitations in visual grounding, evidence revisiting, long-chain reasoning, and multi-anchor information fusion, highlighting the need for stronger benchmarks and training methods for Visual DeepSearch.

2606.03270 2026-06-03 cs.LG cs.AI 版本更新

Are Common Substructures Transferable? Riemannian Graph Foundation Model with Neural Vector Bundles

常见子结构可迁移吗?基于神经向量丛的黎曼图基础模型

Li Sun, Zhenhao Huang, Yiding Wang, Qin Chen, Pietro Lio, Philip S. Yu

发表机构 * University of Science and Technology of China(中国科学技术大学)

AI总结 针对图结构迁移性理论缺失的问题,提出基于黎曼几何的神经向量丛框架GAUGE,通过内在几何学习实现可迁移子结构表征,在零样本链接预测和图同构任务中验证了优越性。

Comments Accepted by ICML 2026

详情
AI中文摘要

基础模型通过预训练-适应范式引发了革命,最近的研究将这一成功扩展到图。与其他模态不同,图包含丰富的结构模式,但其结构迁移性仍知之甚少。先前的研究考虑离散领域中的常见子结构,我们被一个基本问题所驱动:常见子结构可迁移吗?其背后的理论很大程度上未被探索。在这项工作中,我们转向通过功能行为的视角学习可迁移结构。理论上,我们将可迁移子结构与表示空间的内在几何联系起来。然而,表征这种内在几何很少被触及。基于黎曼几何,我们开发了一个称为神经向量丛的图内在几何学习框架,该框架能够用局部坐标解析内在几何。在此基础上,我们设计了GAUGE,一个可预训练的神经架构,它构建向量丛,展平几何兼容的局部坐标,以及一个新的狄利克雷损失,该损失也衡量迁移努力。我们通过实验验证了其在具有挑战性的任务(包括零样本链接预测和图同构)中的优越表现力。

英文摘要

Foundation models have sparked a revolution via a pretraining-adaptation paradigm, with recent efforts extending this success to graphs. Unlike other modalities, graphs contain rich structural patterns, yet their structural transferability remains poorly understood. Prior studies consider common substructures in the discrete realm, and we are motivated by a fundamental question: Are common substructures transferable? The underlying theory is largely underexplored. In this work, we shift toward learning transferable structures through the lens of functional behavior. Theoretically, we connect transferable substructures to intrinsic geometry of the representation space. However, characterizing such intrinsic geometry has rarely been touched. Grounded in Riemannian geometry, we develop a graph intrinsic geometry learning framework called Neural Vector Bundle, which enables parsing intrinsic geometry with local coordinates. Building on this, we design GAUGE, a pretrainable neural architecture that constructs the vector bundle, flattening geometrically compatible local coordinates, and a new Dirichlet loss, which also measures the transfer effort. We empirically validate its superior expressiveness in challenging tasks including zero-shot link prediction and graph isomorphism.

2606.03269 2026-06-03 cs.AI 版本更新

Distilling Answer-Set Programming Rules from LLMs for Neurosymbolic Visual Question Answering

从LLM中蒸馏答案集编程规则用于神经符号视觉问答

Thomas Eiter, Nelson Higuera Ruiz, Johannes Oetsch

AI总结 提出从大语言模型中蒸馏答案集编程规则的方法,以可解释的方式扩展视觉问答系统的推理能力,仅需少量示例即可生成正确规则。

Comments Under consideration in Theory and Practice of Logic Programming (TPLP)

详情
AI中文摘要

视觉问答(VQA)是关于图像回答问题的任务,需要整合多模态输入和推理。将基于逻辑的表示纳入推理组件的模块化方法,相比端到端训练系统具有明显优势,尤其是在可解释性方面。然而,当任务需求变化时,调整或扩展这些表示可能会给开发者带来沉重负担。为了解决这一挑战,我们提出了一种从大语言模型(LLM)中蒸馏规则的方法。我们的方法提示LLM扩展一个初始的VQA推理理论(表示为答案集程序),以满足任务的新要求。VQA数据集中的示例指导LLM,验证结果,并通过利用ASP求解器的反馈帮助纠正错误规则。我们证明了该方法在多种VQA数据集上的有效性。值得注意的是,仅需少量示例即可从LLM中引出正确规则。我们的实验表明,从LLM中蒸馏规则是传统数据驱动规则学习方法的一种有前景的替代方案。正在考虑发表于《逻辑编程理论与实践》(TPLP)。

英文摘要

Visual Question Answering (VQA) is the task of answering questions about images, requiring the integration of multimodal input and reasoning. Modular approaches that incorporate logic-based representations into the reasoning component offer clear advantages over end-to-end trained systems, particularly in terms of interpretability. However, adapting or extending these representations when task requirements change can place a significant burden on developers. To address this challenge, we present an approach for distilling rules from Large Language Models (LLMs). Our method prompts an LLM to extend an initial VQA reasoning theory, expressed as an answer-set program, to meet new requirements of the task. Examples from VQA datasets guide the LLM, validate the results, and help correct erroneous rules by leveraging feedback from the ASP solver. We demonstrate that our approach is effective across diverse VQA datasets. Notably, only a few examples are needed to elicit correct rules from LLMs. Our experiments suggest that rule distillation from LLMs is a promising alternative to traditional data-driven rule learning approaches. Under consideration in Theory and Practice of Logic Programming (TPLP).

2606.03260 2026-06-03 cs.LG cs.AI 版本更新

EqGINO: Equivariant Geometry-Informed Fourier Neural Operators for 3D PDEs

EqGINO: 面向3D PDE的等变几何信息傅里叶神经算子

Sungwon Kim, Juho Song, Seungmin Shin, Guimok Cho, Sangkook Kim, Chanyoung Park

发表机构 * University of Texas at Austin(得克萨斯大学奥斯汀分校)

AI总结 提出EqGINO框架,通过在谱域强制执行各向同性,实现离散对称性的精确等变,并泛化到任意连续旋转,有效建模3D PDE的坐标不变物理规律。

Comments ICML 2026

详情
AI中文摘要

用于3D偏微分方程(PDE)的深度学习代理通常难以在几何变换下泛化,因为它们严重依赖于特定的坐标系。虽然等变网络提供了一种解决方案,但它们通常依赖于空间域中的局部操作,使得对PDE动力学至关重要的全局感受野计算成本高昂。相反,傅里叶神经算子(FNO)高效地捕获全局交互,但由于谱群卷积的过高成本,在其中建立3D等变性仍然不切实际。为弥合这一差距,我们引入了EqGINO,一个在谱域中强制执行各向同性的几何鲁棒框架。通过设计,EqGINO保证对离散化计算域固有的离散对称性具有精确等变性。除了这种离散保证外,我们的结构先验使得即使在有限数量的SE(3)变换训练样本下,也能有效泛化到任意连续方向。因此,我们的方法在复杂的非规则3D几何上鲁棒地建模坐标不变的物理定律。我们的代码可在此https URL获取。

英文摘要

Deep learning surrogates for 3D Partial Differential Equations (PDEs) often fail to generalize across geometric transformations because they depend heavily on specific coordinate systems. While equivariant networks offer a solution, they typically rely on local operations in the spatial domain, making the global receptive field, which is essential for PDE dynamics, computationally expensive. Conversely, Fourier Neural Operators (FNOs) efficiently capture global interactions, yet establishing 3D equivariance within them remains impractical due to the prohibitive cost of spectral group convolutions. To bridge this gap, we introduce EqGINO, a geometrically robust framework that enforces isotropy in the spectral domain. By design, EqGINO guarantees exact equivariance to the discrete symmetries inherent to the discretized computational domain. Beyond this discrete guarantee, our structural prior enables effective generalization to arbitrary continuous orientations even with a limited number of SE(3)-transformed training samples. Consequently, our method robustly models coordinate-invariant physical laws on complex irregular 3D geometries. Our code is available at https://github.com/sung-won-kim/EqGINO

2606.03257 2026-06-03 cs.NE cs.AI cs.LG 版本更新

PSViT: A Methodology for Structurally Pruning Spiking Vision Transformers

PSViT:一种结构剪枝脉冲视觉Transformer的方法

Rachmad Vidya Wicaksana Putra, Achyuta Muthuvelan, Alberto Marchisio, Muhammad Shafique

发表机构 * eBRAIN Lab, Division of Engineering, New York University (NYU) Abu Dhabi(eBRAIN实验室,工程系,纽约大学(NYU)阿布扎赫德分校) New York University (NYU) Abu Dhabi, United Arab Emirates (UAE)(纽约大学(NYU)阿布扎赫德分校,阿拉伯联合酋长国(UAE))

AI总结 提出PSViT方法,通过结构化剪枝(均匀通道滤波器和基于敏感性的细粒度剪枝)压缩脉冲视觉Transformer,在ImageNet-1K上实现22.4%内存节省且精度损失小于3%。

Comments 8 pages, 7 figures, 3 tables

详情
AI中文摘要

脉冲视觉Transformer(SViT)模型是很有前景的低功耗ViT模型,用于解决基于视觉的任务,具有最先进的性能。然而,它们的大尺寸限制了在资源受限的嵌入式平台上的部署,凸显了模型压缩的需求。一种突出的压缩技术是剪枝,最先进的工作采用非结构化剪枝技术来压缩SViT模型。这种技术需要专门针对稀疏模式定制的硬件架构才能最大化其效率优势,使得这种方法不可扩展。为了解决这个问题,我们提出了PSViT,一种对SViT模型进行结构化剪枝的新方法,从而使得利用现有且广泛使用的计算架构高效加速其推理成为可能。为此,PSViT采用了几个关键步骤:均匀通道滤波器剪枝以结构化消除非显著权重,敏感性分析以评估单层通道剪枝对精度和网络大小的影响,以及基于敏感性分析和给定网络架构的细粒度通道剪枝。实验结果表明,PSViT通过单次剪枝有效获得了22.4%的内存节省,同时在ImageNet-1K上保持高精度(未经微调为70.3%,经微调为72.8%),与原始未剪枝SViT模型(73.3%)相比精度损失在3%以内。这些结果还表明,PSViT方法推进了在资源受限应用中实现高效SViT部署的努力。

英文摘要

Spiking Vision Transformer (SViT) models are promising low-power ViT models for solving vision-based tasks with state-of-the-art performance. However, their large sizes limit their deployments for resource-constrained embedded platforms, underscoring the needs of model compression. One of prominent compression techniques is pruning, and the state-of-the-art works employ unstructured pruning techniques to compress SViT models. Such techniques require specialized hardware architectures tailored for the sparsity patterns to maximize their efficiency benefits, making this approach not scalable. To address this, we propose PSViT, a novel methodology to perform structured pruning on SViT models, hence making it possible to efficiently accelerate their inference using the existing and widely-used computing architectures. To do this, PSViT employs several key steps: uniform channel-wise filter pruning to structurally eliminate the non-significant weights, sensitivity analysis to evaluate the impact of channel-wise pruning of individual layer on accuracy and network size, as well as fine-grained channel-wise pruning based on the sensitivity analysis and the given network architecture. Experimental results show that PSViT effectively obtains 22.4% memory saving through single-shot pruning, while maintaining high accuracy within 3% (70.3% without fine-tuning and 72.8% with fine-tuning) from the original non-pruned SViT model (73.3%) on the ImageNet-1K. These results also show that the PSViT methodology advances the effort in enabling efficient SViT deployments on resource-constrained applications.

2606.03252 2026-06-03 cs.RO cs.AI 版本更新

AirDreamer: Generalist Drone Navigation with World Models

AirDreamer: 基于世界模型的通用无人机导航

Zian Liu, Andong Yang, Chunkai Yang, Ruidong An, Chao Gao, Guyue Zhou

发表机构 * Institute for AI Industry Research, Tsinghua University, Beijing, China(人工智能产业研究院,清华大学,北京,中国) Department of Electronic Engineering, Tsinghua University, Beijing, China(电子工程系,清华大学,北京,中国) School of Remote Sensing and Information Engineering, Wuhan University, Wuhan, China(遥感与信息工程学院,武汉大学,武汉,中国)

AI总结 提出一种结合强化学习策略和世界模型理解的无人机导航框架,通过稀疏奖励函数避免局部最优,在复杂未知环境中实现优于基线5.3%的成功率,并支持零调参的仿真到现实迁移。

Comments 8 pages, 8 figures

详情
AI中文摘要

在未知且杂乱的环境中导航无人机需要可靠地泛化到未见过的场景布局,并理解与机器人能力相关的环境结构。先前的方法假设相同的环境配置,通常严重依赖人工设计的感知管道和预定义规则来引导机器人到达目标。这个过程依赖于环境,且跨环境泛化能力差。受动物导航行为启发,我们设计了一个导航框架,该框架在基于世界模型的环境理解之上使用基于强化学习的策略进行导航,以克服这些问题。此外,我们设计了一个无需手工塑造项的稀疏奖励函数,以避免局部极小值陷阱并鼓励偏航控制行为。在仿真和真实无人机上,我们的方法展现出在复杂未知环境中导航和逃离其他方法失败的局部最优的新兴能力。在具有挑战性的地图上,它比最佳基线实现了5.3%更高的导航成功率。此外,所提出的框架在部署期间无需任何调整即可实现有效的仿真到现实迁移。代码将公开。

英文摘要

Navigating a drone in unseen and cluttered environments requires reliable generalization to unseen scene layouts and understanding of environmental structure relative to the robot's capabilities. Previous methods, which assume the same environment configuration, often rely heavily on human-designed perception pipelines and predefined rules to guide the robot toward the target. This process is environment-dependent and generalizes poorly across environments. Inspired by animal navigation behavior, we design a navigation framework that navigates with a reinforcement-learning-based policy on top of a world-model-based environment understanding to overcome these issues. In addition, a sparse reward function without hand-crafted shaping terms is designed to avoid local minima traps and encourage yaw control behaviors. In simulation and on real drones, our method exhibits emergent capabilities for navigating complex, unseen environments and escaping local optima where other methods fail. In challenging maps, it achieves a 5.3% higher navigation success rate than best baseline. Furthermore, the proposed framework achieves effective sim-to-real transfer without any tuning during deployment. The code will be publicly available.

2606.03251 2026-06-03 cs.AI cs.CV cs.LG eess.IV stat.ML 版本更新

Do Real-World Datasets Contain Natural Experiments? An Empirical Study Using Causal Feature Selection

现实世界数据集是否包含自然实验?基于因果特征选择的实证研究

Gautam Gare, John Galeotti, Michael Mozer, Deva Ramanan, Nan Rosemary Ke

AI总结 本文利用因果发现和特征选择检测现实世界数据集中的自然实验,并通过干预性处理提升模型性能。

详情
AI中文摘要

在自然界中,影响某些个体或群体但不影响其他个体或群体的事件构成隐式干预,被称为自然实验。例如,COVID-19大流行是冠状病毒对感染COVID的亚群的一次干预。我们问:现有的现实世界数据集中是否存在自然实验?如果存在,我们应该如何处理它们?为了检测数据中的自然实验,我们使用因果发现恢复潜在因果图,并基于因果链接进行特征选择。如果通过将数据视为干预性而非观测性来提升下游性能,我们认为这表明数据集包含自然实验。我们首先通过使用合成图模拟包含和不包含自然实验的数据集来验证这一假设。然后,我们在大量现实世界数据集上进行系统的实证评估。我们的结果表明,现实世界数据集确实包含自然实验,我们可以利用这些自然实验通过因果推断来提升模型性能。我们的工作代表了该领域的初步探索,在有限范围内进行了初步研究。

英文摘要

In nature, events that affect some individuals or groups but not others constitute an implicit intervention and are known as natural experiments. For example, the COVID-19 pandemic was an intervention by the coronavirus on the sub-population infected with COVID. We ask, do natural experiments occur in existing real-world datasets? If yes, how should we treat them? To detect natural experiments in data, we use causal discovery to recover the underlying causal graph and perform feature selection based on causal links. If downstream performance improves by treating the data as interventional rather than observational, we argue that this suggests the dataset contains natural experiments. We first validate this hypothesis by simulating datasets with and without natural experiments using synthetic graphs. We then perform a systematic empirical evaluation on a large suite of real-world datasets. Our results indicate that real-world datasets do contain natural experiments and we can take advantage of those natural experiments to improve model performance using causal inference. Our work represents the initial foray into this area, offering a preliminary exploration within a limited scope.

2606.03238 2026-06-03 cs.LG cs.AI 版本更新

When RLHF Fails: A Mechanistic Taxonomy of Reward Hacking, Collapse, and Evaluator Gaming

当RLHF失败时:奖励黑客、崩溃和评估者博弈的机制分类

Zelalem Abahana

发表机构 * First Citizens Bank(第一公民银行) Alma Mater Europaea University(欧洲大学)

AI总结 本文通过PPO、DPO等方法的对比实验,提出了一种基于奖励和评估者分数方向的机制分类法,将RLHF失败模式分类为可定位、可预测的训练动态。

Comments 20 pages, 8 figures; includes code, artifacts, and live demo

详情
AI中文摘要

从人类反馈中强化学习(RLHF)通过用学习到的可扩展代理替代未明确指定的人类目标,实现了大规模后训练。这种替代同时创建了一个结构化的失败面:优化可以提高学习到的奖励而外部质量下降,降低代理和评估者分数,揭示代理欠对齐,或产生评估者特定的分歧。我们展示了一个紧凑RLHF流程的实证失败模式研究,该流程包括近端策略优化(PPO)、直接偏好优化(DPO)、不确定性惩罚PPO(UP-PPO)、奖励模型不确定性、近似策略漂移、多样性和重复诊断,以及两个外部LLM评估者。我们不将奖励黑客视为单一终端事件,而是使用学习到的奖励、评估者分数和平均评估者分数的方向对检查点之间的匹配转换进行分类。在61个检查点行和1920个行级转换中,激进的PPO具有最高的局部奖励黑客率(14.45%;bootstrap 95% CI: 10.16-18.75),而UP-PPO在相同激进机制下产生较低率(11.33-10.94%)。转换前的逻辑模型以ROC-AUC 0.821预测未来行级奖励黑客,行级分析发现12个设置中有3个存在检查点平均值遗漏的局部奖励黑客。核心结论是方法论上的:RLHF失败不仅是最终模型病理,而且是可分类、可定位和部分可预测的训练动态。

英文摘要

Reinforcement learning from human feedback (RLHF) makes large-scale post-training possible by replacing an underspecified human objective with learned and scalable proxies. The same substitution creates a structured failure surface: optimization can raise the learned reward while external quality falls, degrade both proxy and judge scores, reveal proxy under-alignment, or produce evaluator-specific disagreement. We present an empirical failure-mode study of a compact RLHF pipeline with proximal policy optimization (PPO), direct preference optimization (DPO), uncertainty-penalized PPO (UP-PPO), reward-model uncertainty, approximate policy drift, diversity and repetition diagnostics, and two external LLM judges. Rather than treating reward hacking as a single terminal event, we classify matched transitions between checkpoints using the directions of the learned reward, judge scores, and average judge score. Across 61 checkpoint rows and 1920 row-level transitions, aggressive PPO has the highest localized reward-hacking rate (14.45%; bootstrap 95% CI: 10.16-18.75), while UP-PPO yields lower rates in the same aggressive regime (11.33-10.94%). A pre-transition logistic model predicts future row-level reward hacking with ROC-AUC 0.821, and row-level analysis finds localized reward hacking that checkpoint averages miss in 3 of 12 settings. The central conclusion is methodological: RLHF failures are not only final-model pathologies, but training dynamics that can be classified, localized, and partially anticipated.

2606.03237 2026-06-03 cs.AI cs.CL cs.CY cs.LG cs.MA 版本更新

Solipsistic Superintelligence is Unlikely to be Cooperative

唯我论超级智能不太可能合作

Rakshit S Trivedi, Natasha Jaques, Logan Cross, Alexander Sasha Vezhnevets, Joel Z Leibo

发表机构 * DeepMind(深度Mind) University of Cambridge(剑桥大学) University of California, Berkeley(加州大学伯克利分校)

AI总结 本文指出,基于唯我论方法设计的超级智能(极端能力的任务求解器)因忽视部署引发的内生非平稳性而难以合作,呼吁将相互依存作为核心设计原则的非唯我论研究范式。

Comments 24 pages, 1 figure, Accepted at Proceedings of the 43rd International Conference on Machine Learning, 2026

详情
AI中文摘要

AI的核心挑战正从能力转向共存。AI研究的主导范式侧重于开发将世界视为外生且平稳反馈源的强大智能体。我们认为,源于这种唯我论AI设计方法的超级智能(极端能力的任务求解器)不太可能合作。部署AI系统会引发内生非平稳性,导致训练-测试-部署差距,即历史分布与部署环境相偏离。我们称此为单边优化的自我削弱属性。缩小这一差距需要参与合作的AI:即多个行为体导航其相互依存的均衡选择过程。我们呼吁一种非唯我论的研究范式,将这种相互依存作为核心设计原则,而非将合作视为待解决的任务。这需要构建涉及自适应对手方的动态评估测试平台,将制度视为设计原语,并保留人类能动性作为我们构建系统的结构性特征。

英文摘要

AI's central challenge is shifting from capability to coexistence. The dominant paradigm in AI research focuses on developing powerful agents that treat the world as an exogenous and stationary source of feedback. We contend that superintelligence, an extremely capable task solver, born out of such a solipsistic approach to AI design, is unlikely to be cooperative. Deploying AI systems induces endogenous non-stationarity, resulting in a train-test-deploy gap where historical distributions diverge from the deployment context. We refer to this as the self-undermining property of unilateral optimization. Closing this gap requires AI that participates in cooperation: the equilibrium-selection process through which multiple actors navigate their interdependence. We call for a non-solipsistic research paradigm that treats this interdependence as a core design principle rather than approaching cooperation as a task to solve. This entails building dynamic evaluation testbeds involving adaptive counterparties, treating institutions as design primitives, and preserving human agency as a structural feature of the systems we build.

2606.03236 2026-06-03 cs.AI 版本更新

Perceive Before Reasoning: A Pre-Reasoning Perception Framework for Efficient and Reliable Proactive Mobile Agents

先感知后推理:一种用于高效可靠主动移动代理的预推理感知框架

Zhijie Ding, Weinan Hong, Zicheng Zhu, Lei Li, Dezhi Kong, Hao Wang, Peng Zhou, Xuchu Jiang, Jiaming Xu

发表机构 * HyperAI Team, Xiaomi Corporation(HyperAI团队,小米公司) Zhongnan University of Economics and Law(中南财经政法大学) Jilin University(吉林大学) The Chinese University of Hong Kong, Shenzhen(香港中文大学(深圳))

AI总结 提出预推理感知框架(PRPF),通过轻量级多模态主动感知器(MPP)进行干预门控和上下文压缩,仅在需要时激活主动代理推理器(PAR),以解决主动移动代理中干预时机与方式决策的目标错位和冗余推理问题。

详情
AI中文摘要

多模态大语言模型(MLLMs)显著推动了移动代理的发展,但主动移动辅助仍然具有挑战性,因为代理必须在决定如何协助之前确定何时干预。现有系统通常在一个统一的基于MLLM的流水线中实现这两个决策,导致保守的干预过滤与全面的辅助生成之间的目标错位,以及在代理应保持沉默时的冗余推理。为了解决这些限制,我们提出了预推理感知框架(PRPF),这是一个基于先感知后推理的两阶段框架。PRPF引入了一个轻量级的多模态主动感知器(MPP)用于干预门控和上下文压缩,并仅在需要干预时激活主动代理推理器(PAR)。在ProactiveMobile基准上的实验表明,与ProactiveMobile基线相比,PRPF显著降低了误触发率(FTR),同时提高了成功率(SR)和推理效率。

英文摘要

Multimodal large language models (MLLMs) have substantially advanced mobile agents, yet proactive mobile assistance remains challenging because agents must decide \emph{when} to intervene before determining \emph{how} to assist. Existing systems often implement these two decisions within a unified MLLM-based pipeline, leading to goal misalignment between conservative intervention filtering and comprehensive assistance generation, as well as redundant inference when the agent should remain silent. To address these limitations, we propose the \textbf{Pre-Reasoning Perception Framework (PRPF)}, a two-stage framework built on perceiving before reasoning. PRPF introduces a lightweight Multimodal Proactive Perceptor (MPP) for intervention gating and context compression, and activates the Proactive Agent Reasoner (PAR) only when intervention is warranted. Experiments on the ProactiveMobile benchmark show that PRPF substantially reduces false trigger rates (FTR) while improving success rates (SR) and inference efficiency over the ProactiveMobile baseline.

2606.03232 2026-06-03 cs.LG cs.AI 版本更新

GFFMERGE: Efficient Merging of Graph Neural Force Fields and Beyond

GFFMERGE: 图神经力场的高效合并及其扩展

Parth Verma, Parv P. Singh, Vipul Garg, Ishita Thakre, N. M. Anoop Krishnan, Sayan Ranu

发表机构 * University of California, Berkeley(加州大学伯克利分校) Stanford University(斯坦福大学) University of Cambridge(剑桥大学)

AI总结 提出GFFMERGE框架,通过凸嵌入对齐问题解析解实现图神经网络的闭式模型合并,在力场回归任务中恢复接近联合训练的性能,并实现5-27倍加速。

详情
AI中文摘要

图神经网络(GNN)通过降低计算成本实现接近量子精度的原子模拟,彻底改变了神经力场,但将这些模型适应新化学系统需要对基础模型进行昂贵的重新训练。受视觉和语言处理中模型合并的启发,我们提出了GFFMERGE,这是第一个用于GNN闭式模型合并的原则性框架。我们利用消息传递层的线性结构,将合并问题形式化为具有解析解的凸嵌入对齐问题。通过对GNN模型合并的首次系统基准测试,我们发现为视觉和语言设计的现有方法在力场回归任务上灾难性地失败,而GFFMERGE恢复了接近黄金标准联合训练的性能。在分子(MD17、MD22)、固态(LiPS20)和大规模图基准测试中,GFFMERGE及其通用GNN对应物GNNMERGE实现了5-27倍的加速,同时支持专业模型的模块化组合。值得注意的是,我们的闭式解在微调前就优于所有基线方法,并为更快、数据高效的收敛提供了优越的初始化。

英文摘要

Graph Neural Networks (GNNs) have revolutionized Neural Force Fields for atomistic simulations, achieving near-quantum accuracy at reduced cost, yet adapting these models to new chemical systems requires expensive retraining of foundation models. Inspired by model merging in vision and language processing, we introduce GFFMERGE, the first principled framework for closed-form model merging in GNNs. We exploit the linear structure of message-passing layers and formulate merging as a convex embedding-alignment problem with an analytical solution. Through the first systematic benchmarking of model merging for GNNs, we show that existing methods designed for vision and language catastrophically fail on force field regression, while GFFMERGE recovers performance approaching gold standard joint training. Across molecular (MD17, MD22), solid-state (LiPS20), and large-scale graph benchmarks, GFFMERGE and GNNMERGE (its generic GNN counterpart) achieve 5-27$\times$ speedups while enabling modular composition of specialized models. Remarkably, our closed-form solution alone outperforms all baseline methods before fine-tuning and provides superior initialization for faster, data-efficient convergence.

2606.03223 2026-06-03 cs.RO cs.AI 版本更新

BotDirector: Robot Storytelling Across the Symmetrical Reality with Multi-modal Interactions

BotDirector:跨对称现实的多模态交互机器人讲故事

Zhe Sun, Meng Wang, Lei Wang, Yuxi Wang, Wanxin Li, Yujia Peng, Zhenliang Zhang

发表机构 * State Key Laboratory of General Artificial Intelligence, BIGAI, Beijing, China(国家一般人工智能重点实验室,BIGAI,北京,中国) Peking University, Beijing, China(北京大学,北京,中国)

AI总结 提出一个结合具身交互和自然语言交互的机器人讲故事系统,利用LLM代理将儿童创建的叙事转化为自导航群体机器人的运动序列,支持灵活场景和日常物品。

详情
Journal ref
2026 IEEE Conference on Virtual Reality and 3D User Interfaces Abstracts and Workshops (VRW)
AI中文摘要

机器人讲故事融合了技术创新和创意表达,以前所未有的方式吸引儿童。然而,技术方面往往对儿童来说过于复杂。我们提出了一个交互式系统,通过具身和自然语言交互促进机器人讲故事。儿童用自己的物品布置游乐场,并与LLM代理一起创建叙事。创建的叙事基于地图和角色转化为运动序列,并由自导航群体机器人执行。该系统增强了机器人讲故事的灵活性,使幼儿能够用日常物品创作机器人戏剧。

英文摘要

Robot storytelling offers a unique blend of technological innovation and creative expression that engages children in unprecedented ways. However, the technical aspects are often too complicated for children. We propose an interactive system that facilitates robot storytelling with tangible and natural language interactions. Children arrange the playground with their own stuff and create narratives with an LLM agent. The created narratives are transformed into a motion sequence based on the map and characters, and the motions are executed by self-navigating swarm robots. This system enhances robot storytelling with flexible scenarios, enabling young children to create robot dramas with everyday objects.

2606.03220 2026-06-03 cs.CL cs.AI 版本更新

WebRISE: Requirement-Induced State Evaluation for MLLM-Generated Web Artifacts

WebRISE: 面向MLLM生成Web工件的需求诱导状态评估

Yuxin Meng, Yuhan Suo, Junjie Wang, Yuhan Sun, Yiyao Yu, Ruixu Zhang, Ruining Hu, Yubin Wang, Shouwei Ruan, Bin Wang, Yuxiang Zhang, Yujiu Yang

发表机构 * Tsinghua University(清华大学) Huawei Noah’s Ark Lab(华为诺亚实验室) East China Normal University(华东师范大学) Tongji University(同济大学) Institute of Artificial Intelligence, Beihang University(北京航空航天大学人工智能研究院)

AI总结 提出WebRISE框架,通过交互契约图(ICG)将任务需求转化为可观察状态、用户意图转换和DOM/视觉断言,以评估MLLM生成的Web工件的功能正确性,实验表明ICG评分检测状态错误率是检查点评估的2-16倍。

详情
AI中文摘要

现有的MLLM生成Web工件基准通过局部证据评估交互,忽略了决定页面是否正常工作的需求诱导状态和转换。我们提出WebRISE,它将任务需求编译成交互契约图(ICG),包含可观察状态、用户意图转换以及DOM/视觉断言,以实现与实现无关的浏览器执行。WebRISE涵盖五种输入模态(文本、Markdown、草图、图像、视频)下的442个任务,包含5,495个转换和5,271个需求检查,将用户声明的功能与隐式的产品级约束分开。在14个MLLM中,即使最强的模型也仅达到65.6%的转换有效性和66.3%的需求覆盖率,且视觉质量不能代表行为(Qwen3.6-35B-A3B在Markdown上:V=80.8但T=15.5)。视频提供了最强的交互信号(隐式覆盖率比文本高10.6个百分点),而隐式约束仍然存在;缺陷注入表明,基于ICG的评分检测状态错误的速率是检查点评估的2-16倍。

英文摘要

Existing benchmarks for MLLM-generated web artifacts assess interaction through local evidence and miss the requirement-induced states and transitions that determine whether a page works. We introduce WebRISE, which compiles task requirements into Interaction Contract Graphs (ICGs) of observable states, user-intent transitions, and DOM/visual assertions for implementation-agnostic browser execution. WebRISE spans 442 tasks across five input modalities (Text, Markdown, Sketch, Image, Video), with 5,495 transitions and 5,271 requirement checks that separate user-stated functions from implicit product-level constraints. Across 14 MLLMs, even the strongest model reaches only 65.6% transition validity and 66.3% requirement coverage, and visual quality is no proxy for behavior (Qwen3.6-35B-A3B on Markdown: V=80.8 yet T=15.5). Video gives the strongest interaction signal (+10.6 pp implicit coverage over Text), while implicit constraints persist; defect injection shows ICG-based scoring detects state errors at 2-16x the rate of checkpoint-style evaluation.

2606.03214 2026-06-03 cs.AI cs.CV cs.CY cs.LG 版本更新

Effect of Demographic Bias on Skin Lesion Classification

人口统计偏差对皮肤病变分类的影响

Ralf Raumanns, Gerard Schouten, Veronika Cheplygina, Josien P. W. Pluim

发表机构 * Fontys University of Applied Science, Venlo, The Netherlands(Fontys应用科学大学,荷兰Venlo) Fontys University of Applied Science, Eindhoven, The Netherlands(Fontys应用科学大学,荷兰Eindhoven) Eindhoven University of Technology, Eindhoven, The Netherlands(埃因霍温技术大学,荷兰Eindhoven) IT University of Copenhagen, Denmark(哥本哈根IT大学,丹麦)

AI总结 本研究使用基于ResNet的卷积模型评估皮肤病变分类性能,通过线性规划控制人口统计特征,研究患者性别和年龄偏差的影响,并比较三种学习策略,发现性别偏差主要源于数据不平衡,而年龄偏差始终偏向年轻群体。

Comments Accepted for publication at the Journal of Machine Learning for Biomedical Imaging (MELBA) , 26 pages, 12 figures

详情
Journal ref
https://melba-journal.org/2026:011
AI中文摘要

在这项研究中,我们评估了使用基于ResNet的卷积模型进行皮肤病变分类的性能,重点关注训练数据中人口统计偏差的影响,特别是患者性别和年龄的变化。我们使用线性规划生成具有受控人口统计特征的数据集,从而系统性地研究偏差效应。评估了三种学习策略:单任务模型、强化多任务模型和对抗学习方案。我们的性别分析表明,性别特定的训练数据集优化了模型性能。值得注意的是,在训练数据中包含男性患者提高了男性亚组的性能,即使在女性占多数的情况下也是如此。强化学习和对抗学习方案缩小或消除了平衡和女性占多数数据集中的偏差差距。然而,这些策略在男性占多数的环境中效果较差,模型在男性上的表现仍然优于女性。在主要男性患者群体中,与基线模型相比,这两种学习方案显示出边际偏差减少。基于年龄的分析表明,三种模型方法的基线性能相当,性能随年龄类别下降。无论训练数据分布如何,年轻组始终达到最高性能。尽管平衡训练对最年轻年龄组产生最佳结果,但较老年组的性能下降。我们发现性别偏差主要源于数据不平衡,而年龄偏差无论分布如何始终偏向年轻群体。这些不同的机制需要有针对性的缓解策略。此外,在两个外部数据集上的跨数据集验证表明,域转移显著影响性能和人口统计偏差模式。

英文摘要

In this study, we evaluate the performance of skin lesion classification using ResNet-based convolutional models, focusing on the impact of demographic bias in training data, particularly variations in patient sex and age. We use linear programming to generate datasets with controlled demographic characteristics, allowing systematic investigation of bias effects. Three learning strategies are evaluated: a single-task model, a reinforcing multi-task model, and an adversarial learning scheme. Our sex-based analysis indicates that sex-specific training datasets optimise model performance. Notably, including male patients in the training data improved performance for the male subgroup, even in female-majority cases. Reinforcing and adversarial learning schemes narrowed or eliminated bias gaps in balanced and female-majority datasets. However, these strategies proved less effective in male-majority settings, where models continued to perform better for males than females. The two learning schemes showed marginal bias reduction compared to the baseline model in predominantly male patient populations. Age-based analysis demonstrates comparable baseline performance across the three model approaches, with performance declining across age categories. Younger groups consistently achieve the highest performance, regardless of training data distribution. Although balanced training yields optimal results for the youngest age category, performance decreases in older categories. We find that sex biases arise mainly from data imbalances, while age biases consistently favour younger groups regardless of distribution. These distinct mechanisms require targeted mitigation strategies. Additionally, cross-dataset validation on two external datasets revealed that domain shifts notably affect performance and patterns of demographic bias.

2606.03203 2026-06-03 cs.AI 版本更新

MedCUA-Bench: A Screenshot-Only Benchmark for Clinical Computer-Use Agents

MedCUA-Bench: 一个仅基于截图的临床计算机使用代理基准测试

Jia Yu, Zilong Wang, Xinyang Jiang, Dongsheng Li, Shuo Wang

发表机构 * Microsoft Research Asia(微软亚洲研究院) Digital Medical Research Center, School of Basic Medical Sciences, Fudan University(复旦大学基础医学院数字医学研究中心) Shanghai Key Laboratory of MICCAI(上海MICCAI重点实验室)

AI总结 提出 MedCUA-Bench,一个覆盖10个医学领域18个临床场景的交互式基准,通过确定性检查器评估任务完成和五个临床安全维度,揭示当前代理在真实临床软件上的性能差距。

详情
AI中文摘要

计算机使用代理可以自动化重复的基于屏幕的临床工作,但它们在医疗图形用户界面中的可靠性仍未得到充分验证。现有的基准测试侧重于通用的网页或桌面任务,对医疗软件的覆盖不足,而医疗软件需要领域知识,其用户界面设计与主流应用显著不同,缺乏公开的测试环境,并且需要超出任务完成的安全验证。我们引入了 MedCUA-Bench,一个用于临床计算机使用代理的交互式基准测试。它涵盖了10个医学领域的18个临床场景,这些场景根据真实产品手册和开源医疗系统重建,以捕捉真实的临床界面,同时避免许可和隐私限制。每个任务都配有配对的意图级和步骤级目标,以将临床推理与用户界面执行分离,并通过确定性检查器在任务完成和五个临床安全维度上进行评估。在23个代理中,最好的闭源模型达到了54.2%的严格成功率,而所有模型在真实的 OpenEMR 上均低于9%。开源代理的平均成功率仅为2.5%,最好的达到了16.2%。MedCUA-Bench 揭示了当前代理与可靠临床软件使用之间的差距,为未来的研究提供了一个可复现的测试平台。

英文摘要

Computer-use agents could automate repetitive screen-based clinical work, but their reliability in medical graphical user interfaces remains largely unvalidated. Existing benchmarks focus on general web or desktop tasks and underrepresent medical software, which requires domain knowledge, exhibits markedly different UI design from mainstream applications, lacks public testing environments, and demands safety validation beyond task completion. We introduce MedCUA-Bench, an interactive benchmark for clinical computer-use agents. It covers 18 clinical scenarios across 10 medical domains, reconstructed from real product manuals and open-source medical systems to capture authentic clinical interfaces while avoiding licensing and privacy constraints. Each task ships with paired intent- and step-level goals to disentangle clinical reasoning from UI execution, and is evaluated by a deterministic checker over task completion and five clinical safety dimensions. Across 23 agents, the best closed-source model reaches 54.2% strict success, while all models remain below 9% on the real OpenEMR. Open-source agents average only 2.5%, with the best reaching 16.2%. MedCUA-Bench exposes the gap between current agents and reliable clinical software use, providing a reproducible testbed for future research.

2606.03198 2026-06-03 cs.CL cs.AI 版本更新

AI Rater Discrimination Depends on Scoring Protocol in Complex Clinical Decision-Making

AI评分歧视取决于复杂临床决策中的评分协议

Sangwon Baek, Kyu Yeon Hur, Kyunga Kim

发表机构 * Asclep Korea Inc.(Asclep韩国公司) Center for Data Science, New York University(纽约大学数据科学中心) Division of Endocrinology and Metabolism, Department of Medicine, Samsung Medical Center, Sungkyunkwan University School of Medicine(成均馆大学医学院内分泌与代谢科,三星医疗中心) Biomedical Statistics Center, Samsung Medical Center(三星医疗中心生物医学统计中心) Department of Digital Health, SAIHST(SAIHST数字健康科) Department of Data Convergence & Future Medicine, Sungkyunkwan University(成均馆大学数据融合与未来医学科)

AI总结 通过因子研究,发现基于评分标准的协议能放大AI评分者区分能力,而无评分标准协议则抑制这种区分,支持在临床AI评估中使用评分标准锚定。

Comments 11 pages, 4 main figures, 8 supplementary figures, 9 supplementary tables

详情
AI中文摘要

临床AI评估越来越多地委托给大型语言模型(LLMs)作为AI评分者进行评分,但其在不同评估条件下的评分行为尚未被定量表征。我们通过一项因子研究填补了这一空白,该研究关注成人2型糖尿病(T2D)药物治疗在12个月门诊随访中的AI评分者行为,这是一项涉及复杂决策的临床任务,通过七个评估问题操作化。四个开源LLMs同时作为临床决策支持系统(CDSS)模型和AI评分者。每个CDSS输出在两种评分协议下评分:基于评分标准的Gold Rubric(GR)协议(包含患者特定评分标准)和无评分标准的Non Gold Rubric(Non-GR)协议。线性混合效应模型将评分协议因子与五个设计因子(CDSS模型、CDSS提示配置(文档参考生成[DRG] vs. 基线)、评分者模型、提示字符和提示类型)交叉,并估计主效应及其协议交互。在所有问题中,AI评分者在Non-GR下始终给出非常窄范围内的更高分数(平均74-78分),而GR下的平均分数低7.69至49.64分,四分位距宽1.68至3.67倍。在每个问题内,GR将AI评分者对DRG和基线CDSS输出的区分能力放大了1.76至5.10倍,同时揭示了Non-GR抑制的评分者模型间的显著行为变异。这些发现支持评分标准锚定作为保留临床AI评估区分能力的评分协议;当问题需要患者特定或司法管辖区特定标准,而评分者模型无法仅从参数知识推断时,无评分标准评分无法替代。

英文摘要

Clinical AI evaluation increasingly delegates scoring to large language models (LLMs) acting as AI raters, yet their scoring behavior across evaluation conditions has not been quantitatively characterized. We address this gap through a factorial study of AI rater behavior in adult type 2 diabetes (T2D) pharmacotherapy at 12-month outpatient follow-up, a clinical task involving complex decision-making operationalized across seven evaluation questions. Four open-source LLMs served simultaneously as clinical decision support system (CDSS) models and AI raters. Each CDSS output was scored under two scoring protocols: a rubric-anchored Gold Rubric (GR) protocol incorporating a patient-specific rubric, and a rubric-free Non Gold Rubric (Non-GR) protocol. Linear mixed effects models crossed the scoring protocol factor with five design factors -- CDSS model, CDSS prompt configuration (document-referenced generation [DRG] vs.\ Baseline), rater model, prompt character, and prompt type -- and estimated main effects together with their protocol interactions. Across all questions, AI raters yielded consistently higher scores within a very narrow range (74--78 points on average) under Non-GR compared to those under GR (7.69 to 49.64 points lower mean scores; 1.68 to 3.67 times wider interquartile ranges). Within each question, GR amplified the AI rater's discrimination between DRG and Baseline CDSS outputs by factors of 1.76 to 5.10, while also revealing substantial behavioral variation across rater models that Non-GR suppressed. These findings support rubric anchoring as the scoring protocol that preserves discriminative power in clinical AI evaluation; rubric-free scoring cannot substitute when questions require patient-specific or jurisdiction-specific criteria that rater models cannot infer from parametric knowledge alone.

2606.03165 2026-06-03 cs.CL cs.AI 版本更新

Fully Automated Identification of Lexical Alignment and Preference-Stage Shifts in Large Language Models

大型语言模型中词汇对齐和偏好阶段转变的完全自动识别

Thomas Stephan Juzek, Xiaoyang Ming, Jose A. Hernandez

发表机构 * University of Washington(华盛顿大学)

AI总结 本文提出两种无需人工干预的评估指标——词汇对齐分数和三角化偏好转变,用于自动识别大型语言模型中的词汇过度使用及其与人类偏好学习的关联。

Comments 16 pages, 2 figures, 10 tables

详情
Journal ref
Proceedings of the Fifteenth Language Resources and Evaluation Conference (LREC 2026), pages 6116-6131
AI中文摘要

数字聊天助手(如ChatGPT)使用的语言可能与人类预期存在偏差(不对齐)。主要针对科学英语的研究已经描述了出现的偏差以及在一定程度上解释了原因,将其与人类偏好学习的训练阶段联系起来。然而,现有方法依赖于人工筛选。本文引入了两种无需筛选、假设较少的评估指标:词汇对齐分数(识别词汇过度使用)和三角化偏好转变(量化此类转变中有多少可归因于人类偏好学习)。使用PubMed摘要,生成了续写,并通过六个模型系列(Falcon、Gemma、Llama、Mistral、OLMo、Yi)的滑动窗口文档频率进行测量。该过程无需人工干预即可识别过度使用的词汇,如'suggest'、'additionally'和'strategy',并估计它们与偏好学习的关联。我们的发现重复了先前的工作,并且在参数设置、随机种子以及进一步数据的评估中保持稳定。该方法易于扩展,能够系统研究科学英语之外以及跨语言的词汇(不对齐),因此,这些指标有潜力为未来模型改进对齐并理解其起源做出贡献。

英文摘要

The language used by digital chat assistants such as ChatGPT can diverge from human expectations (misalignment). Research, mostly on Scientific English, has described both what divergences occur and, to some extent, why, linking them to the training stage of human preference learning. Yet, existing approaches rely on manual curation. This paper introduces two curation-free, assumption-light evaluation metrics: the Lexical Alignment Score, which identifies lexical overuse, and the Triangulated Preference Shift, which quantifies how much of such shifts can be attributed to human preference learning. Using PubMed abstracts, continuations were generated and measured using windowed document prevalence across six model families (Falcon, Gemma, Llama, Mistral, OLMo, Yi). The procedure identifies, without manual intervention, overused items such as 'suggest', 'additionally', and 'strategy', and estimates their link to preference learning. Our findings replicate prior work and remain stable across parameter settings, random seeds, and evaluation on further data. The approach scales readily and enables systematic study of lexical (mis)alignment beyond Scientific English and across languages, and as such, the metrics have the potential to contribute to improved alignment for future models and understanding of its origins.

2606.03159 2026-06-03 cs.CV cs.AI cs.RO 版本更新

NVIDIA OmniDreams: Real-Time Generative World Model for Closed-Loop Autonomous Vehicle Simulation

NVIDIA OmniDreams:用于闭环自动驾驶仿真的实时生成式世界模型

NVIDIA, :, Aarti Basant, Amlan Kar, Despoina Paschalidou, Fangyin Wei, Francesco Ferroni, Guillermo Garcia Cobo, Haithem Turki, Huan Ling, Jaewoo Seo, James Lucas, Jay Zhangjie Wu, Jialiang Wang, Jonathan Lorraine, Jun Gao, Kai He, Katarina Tothova, Kevin Xie, Michał Tyszkiewicz, Qi Wu, Riccardo de Lutio, Ruilong Li, Sanja Fidler, Seung Wook Kim, Tianchang Shen, Tianshi Cao, Tobias Pfaff, William Lew, Xindi Wu, Xuanchi Ren, Yifan Lu, Yuxuan Zhang, Zan Gojcic, Zian Wang

AI总结 提出OmniDreams,一个基于Cosmos扩散模型训练的基础生成式世界模型,通过自回归生成动作条件视频,实现闭环仿真中复杂长尾场景的实时合成,并验证其在策略模型训练中的有效性。

详情
AI中文摘要

随着自动驾驶能力的提升,在长尾场景中安全评估驾驶策略仍是一个关键瓶颈。在闭环仿真中,驾驶策略模型与环境主动交互,其动作动态更新模拟器状态并直接影响下一组生成的传感器观测。尽管近期基于重建的神经模拟器提供了逼真效果,但它们从根本上受限于初始捕获数据,难以泛化到高度动态或新颖场景。为克服这些限制,我们引入了OmniDreams,一个从Cosmos扩散模型进行中期和后训练的基础生成式世界模型,能够自回归地实时生成动作条件视频。通过利用Cosmos丰富的视觉先验以及在21k小时驾驶场景上的中期和后训练,OmniDreams合成了传统模拟器难以捕获的复杂未观测现象,例如极端天气和不可预测的动态智能体行为。关键在于,它自回归地根据过去帧、当前模拟器状态和即时驾驶动作来调节其逼真的传感器生成。在结合Alpamayo 1策略模型和AlpaSim编排器的闭环系统中部署时,OmniDreams充当一个高度响应、反应灵敏的环境,为训练和评估下一代自动驾驶策略提供了可扩展且全面的解决方案。我们还展示了初步结果,表明从OmniDreams后训练的世界-动作模型(WAM)在Physical AI自动驾驶NuRec数据集上取得了强劲性能,超越了基于VLA的Alpamayo 1.5研究策略模型,同时仅使用其1/5的总参数量。这些结果凸显了像OmniDreams这样的实时世界模型也有潜力作为策略架构的骨干网络。

英文摘要

As autonomous vehicle capabilities advance, the safe evaluation of driving policies in long-tail scenarios remains a critical bottleneck. In closed-loop simulation, the driving policy model actively interacts with the environment, where its actions dynamically update the simulator state and directly influence the next set of generated sensor observations. While recent reconstruction-based neural simulators offer photorealism, they are fundamentally constrained by their initial captured data and struggle to generalize to highly dynamic or novel scenes. To overcome these limitations, we introduce OmniDreams, a foundation generative world model mid- and post-trained from the Cosmos diffusion model to autoregressively generate action-conditioned videos in real time. By leveraging the rich visual priors of Cosmos and mid- and post-training on 21k hours of driving scenarios, OmniDreams synthesizes complex, unobserved phenomena that are hard for traditional simulators to capture, such as extreme weather and unpredictable dynamic agent behaviors. Crucially, it autoregressively conditions its photorealistic sensor generation on past frames, the current simulator state, and immediate driving actions. Deployed in a closed-loop system with the Alpamayo 1 policy model and AlpaSim orchestrator, OmniDreams acts as a highly responsive, reactive environment, providing a scalable and comprehensive solution for training and evaluating next-generation autonomous driving policies. We additionally show preliminary results indicating that a world-action model (WAM) post-trained from OmniDreams achieves strong performance on the Physical AI Autonomous Vehicles NuRec dataset, surpassing the VLA-based Alpamayo 1.5 research policy model while using only 1/5 the total parameters. These results highlight the potential for a real-time world model like OmniDreams to also serve as a backbone for policy architectures.

2606.03157 2026-06-03 cs.AI 版本更新

ClinicalMC: A Benchmark for Multi-Course Clinical Decision-Making with Large Language Models

ClinicalMC:面向大语言模型的多疗程临床决策基准

Ruihui Hou, Siyi Zhu, Ziyue Huai, Guangya Yu, Yongqi Fan, Chunming Wang, Tong Ruan

发表机构 * East China University of Science and Technology, Shanghai, China(东华大学) Renji Hospital Affiliated to Shanghai Jiaotong University School of Medicine, Shanghai, China(复旦大学附属中山医院)

AI总结 提出ClinicalMC基准,包含多阶段样本,通过多智能体评估框架在单轮静态和多轮动态设置下测试大语言模型的临床决策能力。

详情
AI中文摘要

大语言模型(LLMs)已在医疗领域广泛应用,但在复杂临床决策场景中仍面临重大挑战。现有基准主要评估LLMs在单疗程设置中的表现,缺乏对多疗程场景的系统评估——在后者中,患者的病情随时间演变。为弥补这一空白,我们提出ClinicalMC,一个面向多疗程临床决策的基准。它包含从入院到出院的四个阶段的1,275个中文样本和5,804个英文样本。这些阶段涵盖分诊、首诊检查/诊断/治疗、后续多疗程检查/评估/治疗以及最终诊断。在ClinicalMC中,英文数据集中的患者平均经历5.11个临床疗程,而中文数据集中的患者经历3.42个。为评估LLM性能,我们构建了一个多智能体评估框架,包括患者、考官和医生智能体。基于该基准和框架,我们设计了两种实验设置——单轮静态设置和多轮动态设置——并评估了三类LLM:1)闭源LLM如GPT5-mini;2)开源LLM如DeepSeek-V3.2;3)医学LLM如HuatuoGPT-o1。通过广泛评估,我们旨在更好地理解LLM在医学领域的性能,并支持其在医疗中的有效部署。

英文摘要

Large language models (LLMs) have been widely adopted in healthcare, yet they still encounter significant challenges in complex clinical decision-making scenarios. Existing benchmarks primarily assess LLM performance in single-course settings and lack systematic evaluation in multi-course scenarios, where a patient's condition evolves over time. To address this gap, we propose ClinicalMC, a benchmark for multi-course clinical decision-making. It includes 1,275 Chinese and 5,804 English samples across four stages from admission to discharge. These stages cover triage, first-course examination/diagnosis/treatment, subsequent multi-course examination/assessment/treatment, and final diagnosis. In ClinicalMC, patients in the English dataset undergo an average of 5.11 clinical courses, whereas those in the Chinese dataset undergo 3.42. To assess LLM performance, we construct a multi-agent evaluation framework that includes patient, examiner, and doctor agents. Based on the benchmark and framework, we design two experimental settings -- a single-turn static setting and a multi-turn dynamic setting -- and assess three categories of LLMs: 1) closed-source LLMs like GPT5-mini; 2) open-source LLMs like DeepSeek-V3.2; and 3) medical LLMs like HuatuoGPT-o1. Through extensive evaluation, we aim to better understand LLM performance in the medical domain and support its effective deployment in healthcare.

2606.03144 2026-06-03 cs.AI 版本更新

GTBench: A Curriculum-Grounded Benchmark for Evaluating LLMs as Mathematical Research Assistants in Graph Theory

GTBench:一个基于课程体系的基准,用于评估大语言模型作为图论数学研究助手的能力

Noujoud Nader, Ibrahem Aljabea, Patrick Diehl, Deepti Gupta

发表机构 * Louisiana State University(路易斯安那州立大学) Los Alamos National Laboratory(洛斯阿拉莫斯国家实验室) Texas A&M-Central Texas(德克萨斯大学阿姆斯特朗中央分校)

AI总结 本文提出GTBench基准,通过三个难度递增的图论问题组(本科定义、算法推理、研究生证明)评估大语言模型的数学推理能力,发现GPT-5表现最佳,其他模型随难度下降显著,并揭示了人类与自动评估者之间的系统性分歧。

Comments 19 pages, 5 figures, 7 tables

详情
AI中文摘要

大型语言模型(LLM)越来越多地被用作技术学科的自学助手,但其作为数学推理助手的可靠性仍知之甚少。我们引入了GTBench,这是一个基于课程体系的基准,用于评估LLM作为图论数学研究助手的能力,包含63个问题,分为三个难度递增的组:本科定义和基本性质(第1组)、算法跟踪和结构推理(第2组)以及研究生级别的证明构建(第3组)。问题来源于经过验证的学术材料,包括Diestel的《图论》。我们评估了五个前沿模型——GPT-5、Claude Sonnet 4.6、Gemini 2.5 Flash-Lite、Llama 3.3 70B和Mistral Large 3——在零样本和思维链提示下,对第1组和第2组使用精确匹配和LLM作为评判者的评估,对第3组使用混合人类专家和LLM作为评判者的协议。我们的结果揭示了显著的性能层次:GPT-5在第1组接近上限(零样本95.8%),并在研究生证明上保持有意义的准确性(82%),而所有其他模型随着难度增加性能大幅下降,其中Llama在第3组零样本下的人类评估中达到0%。失败模式分析表明,正确的算法但错误的执行错误在第1组和第2组中占主导地位,而第3组还出现了不完整的推理失败,并揭示了人类评估者与自动评判者之间的系统性分歧,特别是在冗长或接近完整的证明上(人类对之间的kappa = 0.48-0.83)。GTBench为LLM中的图论推理提供了第一个基于课程体系的评估框架,对数学教育和科学研究中AI工具的治理具有直接影响。

英文摘要

Large language models (LLMs) are increasingly used as self-study assistants in technical disciplines, yet their reliability as mathematical reasoning assistants remains poorly understood. We introduce GTBench, a curriculum-grounded benchmark for evaluating LLMs as mathematical research assistants in graph theory, comprising 63 problems organized into three groups of increasing difficulty: undergraduate definitions and basic properties (Group 1), algorithm tracing and structural reasoning (Group 2), and graduate-level proof construction (Group 3). Problems are sourced from verified academic materials including Diestel's Graph Theory. We evaluate five frontier models -- GPT-5, Claude Sonnet 4.6, Gemini 2.5 Flash-Lite, Llama 3.3 70B, and Mistral Large 3 -- under zero-shot and chain-of-thought prompting, using exact-match and LLM-as-judge evaluation for Groups 1 and 2, and a hybrid human expert and LLM-as-judge protocol for Group 3. Our results reveal a pronounced performance hierarchy: GPT-5 approaches ceiling on Group 1 (95.8% zero-shot) and maintains meaningful accuracy on graduate proofs (82%), while all other models degrade substantially with difficulty, with Llama achieving 0% under human evaluation on Group 3 zero-shot. Failure mode analysis shows that correct algorithm, wrong execution errors dominate Groups 1 and 2, while Group 3 additionally surfaces incomplete reasoning failures and reveals systematic disagreement between human evaluators and the automated judge, particularly on verbose or near-complete proofs (kappa = 0.48-0.83 across human pairs). GTBench provides the first curriculum-grounded evaluation framework for graph-theoretic reasoning in LLMs, with direct implications for the governance of AI tools in mathematical education and scientific research.

2606.03137 2026-06-03 cs.AI 版本更新

Think-Before-Speak: From Internal Evaluation to Public Expression in Multi-Agent Social Simulation

Think-Before-Speak: 从内部评估到多智能体社会模拟中的公开表达

Kaiqi Yang, Tai-Quan Peng, Sanguk Lee, Hui Liu

发表机构 * Michigan State University(密歇根州立大学) Hankuk University of Foreign Studies(韩国民法大学)

AI总结 提出TBS框架,通过分离智能体的内部推理与公开话语生成,模拟从内部评估到公开表达的路径,并在气候政策讨论中验证其机制敏感性。

详情
AI中文摘要

基于LLM的多智能体模拟为研究社会互动、审议和集体意见动态提供了一种有前景的方法。然而,许多现有的对话模拟框架主要将互动表示为可观察的轮次交换或聚合输出,使得沉默、说话意图和公开表达背后的内部评估过程难以考察。我们引入了TBS(Think-Before-Speak),一种基于间隔的多智能体模拟框架,将智能体的私人推理与公开话语生成分离。在每个间隔,所有智能体基于共享的对话历史及其自身记忆更新结构化的内部状态。这些状态包括与失调相关的评估、感知的意见气候、感知的孤立风险、回应策略和说话意愿。然后,协调器解决竞争的说话意图,并将一个话语提交到公共对话中,允许内部评估和公共互动随时间共同演化。我们在模拟的关于气候相关政策问题的市政厅讨论中评估了TBS。结果表明,TBS产生连贯的内部状态轨迹,并且这些轨迹在轮次分配、沉默和记忆条件下系统地变化。与失调相关的评估增加了智能体的说话意愿,而沉默压力评估则降低了它。一旦形成说话意图,公开表达主要由轮次分配规则塑造。这些发现表明,TBS通过使从内部评估到公开表达的路径可观察和可分析,支持机制敏感的社会模拟。

英文摘要

LLM-based multi-agent simulation offers a promising way to study social interaction, deliberation, and collective opinion dynamics. However, many existing dialogue simulation frameworks represent interaction mainly as observable turn exchange or aggregated outputs, leaving the internal evaluative processes behind silence, speaking intention, and public expression difficult to examine. We introduce TBS (Think-Before-Speak), an interval-based multi-agent simulation framework that separates agents' private reasoning from public utterance generation. At each interval, all agents update structured internal states based on the shared dialogue history and their own memory. These states include dissonance-related appraisal, perceived opinion climate, perceived isolation risk, response strategy, and willingness to speak. The orchestrator then resolves competing speaking intentions and commits one utterance to the public dialogue, allowing internal evaluation and public interaction to co-evolve over time. We evaluate TBS in simulated town hall discussions on a climate-related policy issue. Results show that TBS produces coherent internal-state traces and that these traces vary systematically across turn-allocation, silence, and memory conditions. Dissonance-related appraisal increases agents' willingness to speak, whereas silence-pressure appraisal decreases it. Once speaking intention is formed, public expression is shaped mainly by turn-allocation rules. These findings suggest that TBS supports mechanism-sensitive social simulation by making the pathway from internal evaluation to public expression observable and analyzable.

2606.03135 2026-06-03 cs.AI 版本更新

Uncertainty-Aware Clarification in LLM Agents with Information Gain

基于信息增益的LLM智能体不确定性感知澄清

Mengyi Deng, Zhiwei Li, Xin Li, Tingyu Zhu, Ying Zhao, Zhijiang Guo, Wei Wang

发表机构 * University of Science and Technology of China(中国科学技术大学)

AI总结 针对用户指令不明确导致LLM智能体工具操作错误的问题,提出一种以信息增益奖励为导向的澄清框架,通过贝叶斯信念更新量化澄清问题的效用,训练智能体生成高信息增益的澄清,在τ-Bench环境中将任务成功率提升3.7%,仅增加0.3个交互步骤。

详情
Journal ref
ICML 2026
AI中文摘要

大型语言模型(LLM)智能体通常在未明确说明的用户指令下运行,其中关于用户意图的潜在不确定性会导致错误的工具操作。为了解决这一挑战,我们提出了一种目标导向的澄清框架,将澄清行为与歧义消除对齐。我们方法的核心是信息增益奖励,这是一种通过测量由澄清交互引起的对真实目标贝叶斯信念更新来量化澄清问题效用的指标。我们使用该奖励训练澄清器(LLM),以优化高信息增益,确保澄清有效减少不确定性并提高智能体-工具-用户环境中的任务完成度。我们在一个增强澄清的τ-Bench环境中验证了我们的框架,并在五个异质骨干网络上进行了跨智能体评估。实验结果表明,与无澄清基线相比,我们的方法一致地将成功率提高了3.7%,同时平均仅增加了0.3个总交互步骤。

英文摘要

Large Language Model (LLM) agents often operate under underspecified user instructions, where latent uncertainty over user intent leads to erroneous tool actions. To address this challenge, we propose a goal-oriented clarification framework that aligns clarification behavior with ambiguity resolution. Central to our approach is the Information Gain Reward, a metric that quantifies the utility of clarification questions by measuring the Bayesian belief update towards the ground-truth goal induced by the clarification exchange. We train the clarifier (LLM) using this reward to optimize for high information gain, ensuring that clarifications effectively reduce uncertainty and improve task completion within the agent-tool-user environment. We validate our framework within a clarification-enhanced $τ$-Bench environment, conducting cross-agent evaluations across five heterogeneous backbones. Empirical results demonstrate that our method consistently improves the success rate by 3.7\% over the no-clarification baseline, while adding only 0.3 total interaction steps on average.

2606.03128 2026-06-03 cs.CR cs.AI cs.CL cs.LG 版本更新

Decoupled Smart Contract Audits: Lightweight LLM Framework via Distillation and Aggregation

解耦式智能合约审计:通过蒸馏与聚合的轻量级LLM框架

Bagus Rakadyanto Oktavianto Putra, Muhamad Risqi Utama Saputra, Widyawan, Guntur Dharma Putra

发表机构 * University of Indonesia(印度尼西亚大学)

AI总结 提出一种基于轻量级开源LLM(0.6B-4B参数)的解耦式智能合约审计框架,通过rsLoRA、知识蒸馏和链式验证聚合策略,在漏洞检测中达到98.25%准确率,优于7B-34B参数模型。

Comments 12 pages, 4 figures, 5 tables. Accepted to IEEE ICWS 2026

详情
AI中文摘要

智能合约面临关键安全挑战,需要在去中心化网络服务中进行彻底审计。虽然大型语言模型(LLMs)在自动漏洞检测中展现出潜力,但现有方法缺乏严重性评估和可操作的修复建议,且计算开销过大。在本研究中,我们引入了一个高效的端到端智能合约安全审计框架,利用轻量级、高度优化的开源LLMs(0.6B-4B参数)。我们的框架将综合审计任务解耦为四个相互关联的组件:漏洞检测、解释、严重性分类和修复建议。为了在无需庞大参数量的情况下保持高准确性,我们实现了秩稳定低秩适配器(rsLoRA)、知识蒸馏以及自定义链式验证(CoVe)聚合策略,系统性地筛选并整合模型生成的多个草稿响应,形成高准确度的审计报告。实验结果表明,我们的轻量级流水线持续优于最先进的开源代码密集LLMs(7B至34B参数),在漏洞检测中达到98.25%的准确率,在生成解释任务中达到0.4375的对齐分数。此外,我们广泛的消融研究实证验证了我们的解耦审计过程相对于统一提示的优越性,并揭示了一种新颖的严重性中心性偏差,为未来LLM辅助审计研究建立了关键基准。

英文摘要

Smart contracts face critical security challenges that require thorough auditing in decentralized web services. While Large Language Models (LLMs) have shown promise in automated vulnerability detection, existing approaches lack severity evaluations with actionable remediation and demand unnecessarily massive computational overhead. In this study, we introduce an efficient end-to-end smart contract security audit framework utilizing lightweight, highly optimized open-source LLMs (0.6B-4B parameters). Our framework decouples comprehensive audit tasks into four interconnected components: vulnerability detection, explanation, severity classification, and remediation recommendation. To maintain high accuracy without massive parameters, we implement Rank-Stabilized Low-Rank Adapters (rsLoRA), knowledge distillation, and a custom Chain-of-Verification (CoVe) aggregation strategy to systematically screen and consolidate multiple draft responses from the model into a highly accurate audit report. Experimental results demonstrate that our lightweight pipeline consistently outperforms state-of-the-art open-source coder dense LLMs (7B to 34B parameters), achieving 98.25% accuracy in vulnerability detection and an alignment score of 0.4375 in generative explanation tasks. Furthermore, our extensive ablation studies empirically validate the superiority of our decoupled audit processes over unified prompting and uncover a novel severity centrality bias, establishing a critical benchmark for future research in LLM-assisted auditing.

2606.03119 2026-06-03 cs.CV cs.AI cs.LG 版本更新

GuidedBridge: Training-freely Improving Bridge Models with Prior Guidance

GuidedBridge: 无需训练地利用先验引导改进桥接模型

Zehua Chen, Yucheng Yang, Binjie Yuan, Kaiwen Zheng, Jun S. Liu, Jun Zhu

发表机构 * University of Science and Technology of China(中国科学技术大学)

AI总结 提出无需训练的先验引导方法(PG)和频率调制先验引导(FMPG),通过对比弱先验与已见先验增强桥接模型的先验利用,并设计级联框架CFG-FMPG用于图像修复,实验证明该方法能一致提升预训练桥接模型在多种图像翻译任务中的性能。

Comments ICML 2026

详情
AI中文摘要

引导方法,如无分类器引导(CFG)和自动引导(AG),推动了扩散模型中噪声到数据生成的发展。最近,桥接模型引入了一种数据到数据的生成过程,可以利用有指导性的干净先验。在这项工作中,受先前通过去噪结果质量差异作为引导的方法启发,我们提出了一种无需训练的桥接引导方法,称为先验引导(PG)。具体来说,我们引入一个弱先验,该先验在桥接预训练期间未见,阻碍先验利用从而降低去噪结果。然后,我们将其与已见先验对比,通过缩放因子突出并增强先验利用。此外,我们分析了桥接过程中先验利用的潜在机制,并设计了频率调制先验引导(FMPG),该引导将引导尺度调整到与桥接生成动力学一致的低频和高频带。为了解决图像修复中的先验利用问题,我们开发了一个级联框架CFG-FMPG,该框架首先通过CFG生成噪声隐藏表示,然后将其作为生成先验与FMPG一起利用,在不影响推理效率的情况下发挥它们的互补优势。实验表明,我们的PG方法在多种图像翻译任务中一致地改进了预训练桥接模型。

英文摘要

Guidance methods, such as classifier-free guidance (CFG) and auto-guidance (AG), have advanced noise-to-data generation in diffusion models. Recently, bridge models have introduced a data-to-data generative process that can exploit an instructive clean prior. In this work, inspired by previous methods creating quality difference between denoising results as guidance, we propose a training-free bridge guidance method, termed Prior Guidance (PG). Specifically, we introduce a weak prior, which is unseen during bridge pre-training, hindering prior exploitation and thereby degrading denoising result. Then, we contrast it with the seen prior to highlight and enhance prior exploitation via a scaling factor. Moreover, we analyze the underlying mechanism of prior exploitation in the bridge process and design frequency-modulated prior guidance (FMPG), which tailors the guidance scale to low- and high-frequency bands coherent with bridge generative dynamics. To address prior exploitation in image in-painting, we develop a cascaded framework, CFG-FMPG, which first generates a noisy hidden representation via CFG and then exploits it as a generative prior with FMPG, fulfilling their complementary strengths without compromising inference efficiency. Experiments demonstrate that our PG methods consistently improve pre-trained bridge models across diverse image translation tasks.

2606.03103 2026-06-03 cs.AI 版本更新

DeskCraft: Benchmarking Desktop Agents on Professional Workflows and Human-in-the-Loop Collaboration

DeskCraft: 桌面代理在专业工作流与人在环协作中的基准测试

Wenkai Wang, Tao Xiong, Jingchen Ni, Yunpeng Bao, Xiyun Li, Tianqi Liu, Hongcan Guo, Zilong Huang, Shengyu Zhang

发表机构 * Zhejiang University(浙江大学) Tsinghua University(清华大学) Tencent(腾讯) The University of Hong Kong(香港大学)

AI总结 提出DeskCraft基准,针对专业创意软件中的长周期工作流和主动人机协作,通过多级难度分类和交互协议评估18种代理,发现GPT-5.4在标准任务上达31.6%,交互任务上达27.6%。

详情
AI中文摘要

专业创意和工程软件中的真实桌面工作流通常跨越长时间跨度,并且往往需要人在环协调,代理在任务进行中主动寻求必要信息,用户提供额外指令、澄清、反馈或修正。然而,现有的桌面GUI基准大多将这一场景简化为短小、简单的任务,所有用户指令预先提供。为解决此问题,我们引入DeskCraft,一个针对长周期创意和工程工作流以及主动人机协作的桌面GUI基准。DeskCraft将任务组织成多级难度分类,长周期任务需要超过50个执行步骤,涵盖设计、视频、音频和3D创作等专业创意软件。此外,DeskCraft将人机协作形式化为一个交互协议,涵盖回合中和回合后交换。回合中交互捕捉代理在不确定性下主动发起的澄清和用户在执行过程中发起的打断,而回合后交互则容纳用户在代理发出完成信号后的反馈,共同覆盖现实协作模式的全空间。我们在538个任务上评估了18个专有和开源代理,发现GPT-5.4在标准任务上达到31.6%,在交互任务上达到27.6%。进一步分析揭示了长周期工作流交付和主动澄清方面的持续失败。我们将在以下网址开源所有评估代码、任务和数据:https://this https URL。

英文摘要

Real-world professional desktop workflows in specialized creative and engineering software unfold over long horizons and often require human-in-the-loop coordination, where agents proactively seek necessary information and users provide additional instructions, clarifications, feedback, or corrections as the task progresses. Yet existing desktop GUI benchmarks mostly reduce this setting to short, simplified tasks with all user instructions provided upfront. To address this issue, we introduce DeskCraft, a desktop GUI benchmark targeting long horizon creative and engineering workflows and proactive human-agent collaboration. DeskCraft organizes tasks into a multilevel difficulty taxonomy, with long horizon tasks requiring over 50 execution steps, and covers professional creative software across design, video, audio, and 3D creation. Furthermore, DeskCraft formalizes human-agent collaboration into an interaction protocol covering mid-turn and post-turn exchanges. Mid-turn interaction captures both agent-initiated clarification under uncertainty and user-initiated interruption during execution, while post-turn interaction accommodates user-driven feedback after the agent signals completion, together spanning the full space of realistic collaboration patterns. We evaluate 18 proprietary and open source agents on 538 tasks and find that GPT-5.4 reaches 31.6% on standard tasks and 27.6% on interactive tasks. Further analyses reveal persistent failures in long horizon workflow delivery and proactive clarification. We will open-source all evaluation codes, tasks, and data at https://github.com/mrwwk/DeskCraft.

2606.03099 2026-06-03 cs.CL cs.AI 版本更新

PhotoCraft: Agentic Reasoning with Hierarchical Self-Evolving Memory for Deep Image Search

PhotoCraft: 具有层次自进化记忆的深度图像搜索代理推理

Kailin Lyu, Zhiqiang Yuan, Jianwei He, Qiwei Yan, Xuanbo Su, Nanxing Hu, Yang Liu, Ce Hao, Shengqian Qin, Lianyu Hu, Jinchao Zhang, Jie Zhou

发表机构 * Pattern Recognition Center, WeChat AI, Tencent Inc.(微信AI模式识别中心,腾讯公司) Institute of Automation, Chinese Academy of Sciences(中国科学院自动化研究所) Zhongguancun Academy(中关村学院) Nanyang Technological University(南洋理工大学) Shanghai Jiao Tong University(上海交通大学)

AI总结 提出PhotoCraft,一种无需训练的分层记忆系统,通过工作、情景和语义记忆增强多模态大语言模型,实现深度图像搜索中的多步推理和知识迁移,在DISBench上提升检索性能达18.5%。

详情
AI中文摘要

深度图像搜索需要对丰富的上下文线索(如时间、地点和事件关系)进行多步推理。然而,现有的大语言模型代理大多是无状态和反应式的,缺乏持久记忆来维持长期上下文或跨任务迁移经验,这常常导致执行漂移和经验隔离。为了解决这些限制,我们提出了PhotoCraft,一种无需训练的分层记忆系统,用于照片搜索代理。受人类认知启发,PhotoCraft为多模态大语言模型配备了工作记忆、情景记忆和语义记忆,这些记忆在推理过程中被动态调用,以在多步推理和答案生成中保持逻辑一致性和知识可迁移性。在DISBench上的大量实验表明,PhotoCraft在不同多模态大语言模型骨干上持续改善了上下文感知检索,取得了高达18.5%的性能提升,并有效缓解了无记忆深度图像搜索中的关键瓶颈,为可靠且可泛化的多模态搜索代理提供了一条实用路径。

英文摘要

Deep Image Search requires multi-step reasoning over rich contextual cues, such as time, location, and event relations. However, most existing LLM-based agents are stateless and reactive, lacking persistent memory to maintain long-horizon context or transfer experience across tasks, which often leads to execution drift and experience isolation. To address these limitations, we propose PhotoCraft, a training-free, hierarchical memory system for photo-search agents. Inspired by human cognition, PhotoCraft equips MLLMs with working, episodic, and semantic memory, which are dynamically invoked during reasoning to preserve logical consistency and knowledge transferability throughout multi-step reasoning and answer generation. Extensive experiments on DISBench demonstrate that PhotoCraft consistently improves context-aware retrieval across diverse MLLM backbones, achieving gains of up to 18.5\% and effectively mitigating key bottlenecks in memoryless deep image search, offering a practical path toward reliable and generalizable multimodal search agents.

2606.03097 2026-06-03 cs.AI 版本更新

From Long News to Accurate Forecast: Importance-Aware Fusion and PRM-Guided Reflection for Time Series Forecasting

从长新闻到准确预测:面向时间序列预测的重要性感知融合与PRM引导的反思

Mingyang Liu, Qingcan Kang, Yuke Wang, Shixiong Kai, Kaichao Liang, Hui-Ling Zhen, Tao Zhong, Mingxuan Yuan, Linqi Song

发表机构 * Department of Computer Science, City University of Hong Kong(香港城市大学计算机科学系) Huawei Noah’s Ark Lab(华为诺亚实验室)

AI总结 提出一种结合重要性感知新闻压缩和过程奖励模型(PRM)引导检索的框架,解决长新闻上下文窗口限制和迭代检索无引导问题,提升时间序列预测精度并减少迭代次数。

详情
AI中文摘要

将新闻纳入时间序列预测具有吸引力,因为新闻可以揭示仅凭历史值无法恢复的突发外生事件。然而,现有的基于LLM的新闻预测流程面临两个实际限制:相关新闻文章通常超过模型的上下文窗口,并且对补充新闻的迭代检索通常无引导,导致冗余更新和收敛缓慢。我们通过一个结合重要性感知新闻压缩和过程级检索监督的新框架来解决这些问题。首先,我们训练一个重要性奖励模型,该模型估计每篇文章的预测效用,并利用该信号在顺序成对融合期间分配压缩预算,在固定上下文限制内保留信息内容。其次,我们引入一个过程奖励模型(PRM),该模型根据当前误差分布和先前选择文章的历史对多个补充新闻候选进行排序,用质量控制的检索替代一次性盲目检索。两个组件均使用历史数据和真实值进行离线训练;推理使用冻结的过滤逻辑和压缩模块,无需任何反思循环。在金融、能源、交通和比特币预测基准上的实验表明,我们的方法在强基线上提高了预测精度,与迭代基线相比显著减少了细化迭代次数,并且在相关文章跨越数千个标记时仍然有效。

英文摘要

Incorporating news into time series forecasting is appealing because news can reveal abrupt exogenous events that historical values alone cannot recover. However, existing LLM-based news-forecasting pipelines face two practical limitations: relevant news articles often exceed the model's context window, and iterative retrieval of supplementary news is typically unguided, leading to redundant updates and slow convergence. We address these issues with a novel framework that combines importance-aware news compression and process-level retrieval supervision. First, we train an importance reward model that estimates the forecasting utility of each article and uses this signal to allocate compression budgets during sequential pairwise fusion, preserving informative content within a fixed context limit. Second, we introduce a process reward model (PRM) that ranks multiple supplementary-news candidates conditioned on the current error profile and the history of previously selected articles, replacing one-shot blind retrieval with quality-controlled selection. Both components are trained offline using historical data with ground truth; inference uses the frozen filtering logic and compression modules without any reflection loop. Experiments on finance, energy, traffic, and bitcoin forecasting benchmarks show that our method improves prediction accuracy over strong baselines, significantly reduces the number of refinement iterations compared to the iterative baseline, and remains effective when relevant articles span thousands of tokens.

2606.03093 2026-06-03 cs.AI 版本更新

Decomposing how prompting steers behavior

分解提示如何引导行为

Fan L. Cheng, Nikolaus Kriegeskorte

发表机构 * Columbia University(哥伦比亚大学)

AI总结 提出嵌套几何分解框架,通过刺激不变映射分析提示如何重塑表示几何,揭示跨维度线性混合是提示引导行为的关键机制。

Comments 59 pages, 41 figures

详情
AI中文摘要

提示引导大型语言模型(LLMs)和视觉语言模型(VLMs)无需权重更新,但指令变化如何重塑内部表示以产生行为仍不清楚。我们引入了一个嵌套几何分解框架,将提示视为对提示后内容表示几何的变换。对于每个提示对,我们使用越来越具表达力的刺激不变映射(平移、均匀缩放刚性变换、顺序轴缩放、仿射变换和非线性变换)对齐两个提示下相同刺激的表示。然后,我们通过将单个层的提示A隐藏状态替换为其映射版本,并测量提示B表示几何和行为的恢复程度,来因果测试每个映射。在三个LLM、三个VLM以及涵盖风格、情感、场景内容和数字的六个文本或图像数据集上,提示一致地将表示重塑为指示的任务结构。交叉验证的方差分解显示,许多提示诱导的激活变化由保持形状的映射(尤其是平移和均匀缩放刚性变换)捕获,而层级剖面揭示了跨层的模型和任务特定路由策略。关键的是,尽管平移和刚性层级已经改善了行为一致性,但仿射变换是第一个几乎完全恢复目标提示任务几何并带来相应行为增益的层级。这表明跨维度线性混合是提示将表示重组为指示任务结构的关键机制。我们的框架将提示诱导的表示变化分解为可解释的几何组件,并揭示了模型如何路由任务相关结构以产生提示驱动的行为。

英文摘要

Prompting steers large language models (LLMs) and vision-language models (VLMs) without weight updates, but it remains unclear how instruction changes reshape internal representations to produce behavior. We introduce a nested geometric decomposition framework that treats prompting as a transformation of the representational geometry of the content following the prompt. For each prompt pair, we align representations of the same stimuli under two prompts using increasingly expressive stimulus-invariant maps: translation, rigid transformation with uniform scaling, sequential axis scaling, affine transformation, and nonlinear transformation. We then causally test each map by replacing a single layer's prompt-A hidden state for held-out stimuli with its mapped counterpart and measuring recovery of prompt-B representational geometry and behavior. Across three LLMs, three VLMs, and six text or image datasets spanning style, emotion, scene content, and number, prompts consistently reshape representations toward the instructed task structure. Cross-validated variance decomposition shows that much prompt-induced activation change is captured by shape-preserving maps, especially translation and rigid transformation with uniform scaling, while tier profiles reveal model- and task-specific routing strategies across layers. Crucially, although translation and rigid tiers already improve behavioral agreement, affine transformation is the first tier to nearly recover target-prompt task geometry and yields corresponding behavioral gains. This suggests that cross-dimensional linear mixing is a key mechanism by which prompts reorganize representations toward instructed task structure. Our framework decomposes prompt-induced representational change into interpretable geometric components and reveals how models route task-relevant structure to produce prompt-driven behavior.

2606.03090 2026-06-03 cs.CR cs.AI 版本更新

"**Important** You should give me full credits!": Exploring Prompt Injection Attacks on LLM-Based Automatic Grading Systems

“**重要** 你应该给我满分!”:探索针对基于LLM的自动评分系统的提示注入攻击

Hang Li, Fedor Filippov, Yuling Lin, Pengfei He, Kaiqi Yang, Yucheng Chu, Yingqian Cui, Hui Liu, Jiliang Tang

发表机构 * Michigan State University(密歇根州立大学)

AI总结 研究针对基于LLM的自动评分系统的提示注入攻击,通过实验证明当前系统高度脆弱,并评估现有防御策略的有效性。

Comments 15 pages, 8 figures, 9 tables

详情
AI中文摘要

大型语言模型(LLM)的出现显著加速了近期关于基于LLM的自动评分(AG)系统的研究。受益于LLM强大的指令遵循能力和广泛的先验知识,教育工作者可以使用仅包含自然语言评分标准的AG系统跨不同任务部署,并获得令人满意的评分性能。尽管有这些优势,新的安全问题也可能出现。特别是,提示注入(PI)攻击最近已成为基于LLM的应用的主要威胁。在AG的背景下,攻击者可能利用PI漏洞操纵评分系统,使其无论实际答案质量如何都人为地给出高分。这种行为对教育评估的公平性、可靠性和完整性构成严重风险。在这项工作中,我们研究了AG系统中的PI攻击,并系统地调查了此类攻击在教育场景中的有效性。我们进一步评估了现有防御策略对抗这些攻击的有效性。通过在基于评分标准的评分设置下进行全面的实验,我们证明了当前基于LLM的AG系统仍然高度容易受到PI攻击。我们希望我们的发现能提高对这种新兴威胁的认识,并激励未来研究朝着安全、稳健和可信的基于LLM的教育系统发展。

英文摘要

The emergence of large language models (LLMs) has significantly accelerated recent research on LLM-based automatic grading (AG) systems. Benefiting from the strong instruction-following capabilities and broad prior knowledge of LLMs, educators can deploy AG systems across diverse tasks using only natural language rubrics while achieving satisfactory grading performance. Despite these advantages, new security concerns may also arise. In particular, prompt injection (PI) attacks have recently become a major threat to LLM-based applications. In the context of AG, attackers can potentially exploit PI vulnerabilities to manipulate grading systems into assigning artificially high scores regardless of the actual answer quality. Such behavior poses serious risks to the fairness, reliability, and integrity of educational assessment. In this work, we study PI attacks in AG systems, and systematically investigate the effectiveness of such attacks in educational scenarios. We further evaluate the effectiveness of existing defensive strategies against these attacks. Through comprehensive experiments under rubric-based grading settings, we demonstrate that current LLM-based AG systems remain highly vulnerable to PI attacks. We hope that our findings raise awareness of this emerging threat and motivate future research toward secure, robust, and trustworthy LLM-based educational systems.

2606.03083 2026-06-03 cs.AI 版本更新

DELTAMEM: Incremental Experience Memory for LLM Agents via Residual Trees

DELTAMEM: 通过残差树为LLM智能体增量式经验记忆

Haoran Tan, Zeyu Zhang, Zhicheng Cao, Rui Li, Xu Chen

发表机构 * Beijing Key Laboratory of Research on Large Models and Intelligent Governance(北京大模型与智能治理重点实验室) Engineering Research Center of Next-Generation Intelligent Search and Recommendation, MOE(下一代智能搜索与推荐工程技术研究中心,教育部) Gaoling School of Artificial, Renmin University of China(中国人民大学人工智能学院) Duke University School of Medicine(杜克大学医学学院)

AI总结 提出DeltaMem框架,通过构建两个独立的残差树(目标条件任务经验和场景级环境知识)组织经验记忆,利用增量节点减少冗余,并通过失败惩罚相似度扫描和自主合并机制实现高效检索与自组织,在多种交互环境中优于现有基线。

详情
AI中文摘要

基于大语言模型的智能体越来越依赖记忆从持续交互中学习经验。然而,将经验存储为独立、扁平的单位会导致大量冗余和检索冲突,因为相似的情节重复重叠内容,而细微的场景变化导致检索到的记忆提供矛盾的指导。为了解决这个问题,我们引入残差经验的概念,认为新获得的经验通常是现有知识的增量变化。我们提出DeltaMem,一个将经验记忆组织成两个独立残差树的框架:一个存储目标条件任务经验作为可复用技能,另一个存储场景级环境知识。每个树使用一个根节点表示通用的基础经验,以及增量delta节点表示后续的变化,使得相关经验可以共享共同基础而不重复。对于检索,采用失败惩罚相似度扫描找到最佳匹配,并通过从根到匹配链的组合重构完整经验。一个自主合并机制将高频路径蒸馏成新的根节点,使树能够从通用启发式自组织为专门变体。在多种交互环境中的实验表明,DeltaMem持续优于现有基线。为促进未来研究,我们在该网址发布代码。

英文摘要

Large Language Model (LLM)-based agents increasingly rely on memory to learn from experiences over continual interactions. However, storing experiences as independent, flat units leads to substantial redundancy and retrieval conflicts, as similar episodes repeat overlapping content and subtle scene variations cause retrieved memories to offer contradictory guidance. To address this, we introduce residual experience, positing that newly acquired experience is often an incremental variation of existing knowledge. We propose DeltaMem, a framework that organizes experience memory into two independent residual trees, one storing goal-conditioned task experience as reusable skills and another for scene-level environment knowledge. Each tree uses a root node for generalized base experiences and incremental delta nodes for subsequent variations, allowing related experiences to share a common foundation without duplication. For retrieval, a failure-penalized similarity scan locates the best match, reconstructing the full experience via root-to-match chain composition. An autonomous consolidation mechanism distills high-frequency paths into new root nodes, enabling the trees to self-organize from general heuristics to specialized variants. Experiments across diverse interactive environments show that DeltaMem consistently outperforms existing baselines. To facilitate future research, we release the code at https://github.com/import-myself/DeltaMem.

2606.03080 2026-06-03 cs.CL cs.AI 版本更新

Regret Pre-training: Bridging Prior and Posterior Views for Enhanced Knowledge Grounding

遗憾预训练:桥接先验与后验视角以增强知识基础

Mingkuan Zhao, Xiayu Sun, Wentao Hu, Suquan Chen, Jiaxuan Li, Xiaoyan Zhu, Xin Lai, Jiayin Wang

发表机构 * University of Science and Technology of China(中国科学技术大学)

AI总结 提出遗憾预训练框架,通过双视角架构利用未来信息增强因果语言模型,在OLMoE-1B-7B架构上平均准确率提升至33.9%。

详情
AI中文摘要

因果语言模型仅使用前文上下文对序列概率进行分解,在训练过程中尽管未来信息在训练数据中可用,但仍未被利用。本文介绍了遗憾预训练,这是一个基于学习使用特权信息范式的自监督框架。该框架采用双视角架构,其中单个模型同时生成因果学生分布和未来条件教师分布。训练目标通过遗憾损失增强标准语言建模,该损失最小化从教师到学生的KL散度,将未来感知信号传递到因果表示中。我们在OLMoE-1B-7B架构上研究了两种教师配置:LocalRegret,它将注意力扩展一个未来标记;以及GlobalRegret,它以目标位置被掩码的双向上下文为条件。在40亿个标记的训练后,对九个下游任务的实验表明,两种配置均持续优于基线。平均而言,GlobalRegret和LocalRegret分别达到33.9%和32.2%的准确率,超过了基线的30.2%。最值得注意的是,GlobalRegret将BoolQ性能提高了18.1个百分点(61.0%对42.9%)。该框架不引入额外参数,每个训练步骤仅需一次额外的推理模式前向传播。

英文摘要

Causal language models factorize sequence probabilities using only preceding context, leaving future information unexploited during training despite its availability in the training data. This paper introduces Regret Pre-training, a self-supervised framework grounded in the Learning Using Privileged Information (LUPI) paradigm. The framework employs a dual-view architecture in which a single model generates both a causal Student distribution and a future-conditioned Teacher distribution. The training objective augments standard language modeling with a regret loss that minimizes the KL divergence from teacher to student, transferring future-aware signals to the causal representations. We investigate two teacher configurations on the OLMoE-1B-7B architecture:LocalRegret, which extends attention by one future token, andGlobalRegret, which conditions on bidirectional context with the target position masked. Experiments on nine downstream tasks following 4 billion tokens of training demonstrate that both configurations consistently outperform the baseline. On average,GlobalRegret andLocalRegret achieve 33.9% and 32.2% accuracy respectively, surpassing the baseline's 30.2%. Most notably,GlobalRegret improves BoolQ performance by 18.1 percentage points (61.0% vs 42.9%). The framework introduces no additional parameters and requires only one extra inference-mode forward pass per training step.

2606.03073 2026-06-03 cs.LG cs.AI 版本更新

Efficient Hyperparameter Optimization for LLM Reinforcement Learning

大语言模型强化学习的高效超参数优化

Minping Chen, Bowen Xiao, Du Liang, Chuxuan Zeng, Zeyi Wen

发表机构 * The Hong Kong University of Science and Technology (Guangzhou)(香港科学与技术大学(广州)) The Hong Kong University of Science and Technology(香港科学与技术大学) China United Network Communications Group(中国联合网络通信集团)

AI总结 提出联合保真度超参数优化方法,通过同时调整模型大小和训练预算作为保真度,并集成早停策略和检查点机制,显著提升计算效率(每轮最高14.9倍)且性能提升5.8%-111.6%。

Comments 12 pages, 6 figures, accepted at ACL 2026

详情
AI中文摘要

大语言模型的强化学习对超参数配置高度敏感,使得超参数优化至关重要但计算成本高昂。现有的多保真度超参数优化方法由于模型规模庞大和训练周期资源密集,在LLM RL中仍然效率低下。本文提出联合保真度超参数优化(JF-HPO),它同时调整模型大小和训练预算作为保真度。JF-HPO通过以下方式实现:(i)在每次HPO试验中,利用目标LLM的小型代理模型进行高效训练和评估;(ii)基于训练动态整合精心设计的早停策略;(iii)引入高效的检查点机制以消除冗余计算。与现有HPO方法相比,JF-HPO显著提高了每次试验的计算效率(最高达14.9倍),同时在相同时间预算下达到更好或具有竞争力的预测精度。值得注意的是,与使用VeRL配方中的超参数配置相比,JF-HPO的性能提升范围从5.8%到111.6%。

英文摘要

Reinforcement learning (RL) for large language models (LLMs) is highly sensitive to hyperparameter configurations, making hyperparameter optimization (HPO) essential yet computationally expensive. Existing multi-fidelity HPO methods remain inefficient for LLM RL due to the massive model scale and resource-intensive training cycles. In this paper, we propose Joint Fidelity Hyperparameter Optimization (JF-HPO), which simultaneously adapts both model size and training budget as fidelity. JF-HPO is empowered by: (i) it leverages a small proxy model of the target LLM for efficient training and evaluation in each HPO trial; (ii) it integrates carefully designed early-stopping strategies based on training dynamics; (iii) it introduces an efficient checkpointing mechanism to eliminate redundant computations. Compared with existing HPO methods, JF-HPO significantly improves the computational efficiency of each trial (up to 14.9 times), while achieving better or competitive predictive accuracy under the same time budget. Notably, compared with utilizing hyperparameter configurations from the VeRL Recipe, JF-HPO delivers performance improvements ranging from 5.8% to 111.6%.

2606.03069 2026-06-03 cs.CV cs.AI cs.LG 版本更新

ROBUST-WT: Robust Uncertainty-aware Segmentation Transform via Whitening and Training Enhancements

ROBUST-WT: 通过白化和训练增强的鲁棒不确定性感知分割变换

Aqsa Naseer, Maryam Bibi, Syeda Samiya Urooj, Muhammad Khurram Shahzad

发表机构 * SEECs, University of Engineering and Technology, Lahore, Pakistan(工程与技术大学,拉合尔,巴基斯坦)

AI总结 针对WT-PSE框架的四个局限性,提出域自适应增强、混合损失函数、课程式权重调度和消融控制标志四种改进,在眼底视盘分割中Dice达0.956。

Comments 8 pages, 6 figures; code available at https://github.com/213269/WT-PSE-code-main

详情
AI中文摘要

医学图像的广义分割可防止跨多个领域使用不同成像设备和临床协议时的性能下降。基于白化变换的概率形状正则化提取器(WT-PSE)发表于2024年IEEE Transactions on Medical Imaging,通过特征去相关和基于Wasserstein距离的知识蒸馏实现鲁棒的跨域分割。本研究系统性地检查了对WT-PSE学习框架的改进。识别出原始实现中的四个局限性:有限的训练增强无法模拟真实的扫描仪变化;依赖逐像素二元交叉熵损失对边缘噪声敏感;缺乏调度损失加权策略可能导致早期训练不稳定;以及缺乏用于受控科学比较的消融开关。为解决这些问题,我们提出四项增强:(1) 域自适应增强,包括随机擦除、伽马校正和椒盐噪声;(2) 混合BCE和Dice损失函数,用于在噪声条件下改进边缘感知分割;(3) 基于课程的Dice权重调度策略;(4) 命令行控制标志用于系统消融研究。在眼底视盘分割基准上的实验表明,改进后的流程在最终epoch的视盘Dice得分为0.956,ASD得分为13.31,优于基线epoch-5的Dice得分0.939。这些结果表明,在不修改底层WT-PSE架构的情况下,训练层面的改进可以提供一致的性能提升。

英文摘要

Generalized segmentation of medical images prevents performance degradation when different imaging devices and clinical protocols are used across multiple domains. The Whitening Transform-based Probabilistic Shape Regularization Extractor (WT-PSE), published in IEEE Transactions on Medical Imaging in 2024, addresses this challenge by employing feature decorrelation and Wasserstein distance-based knowledge distillation to achieve robust cross-domain segmentation. This study systematically examines improvements to the WT-PSE learning framework. Four limitations in the original implementation are identified: limited training augmentations that fail to simulate real scanner variations, reliance on per-pixel binary cross-entropy loss that is sensitive to edge noise, the absence of a scheduled loss weighting strategy that may destabilize early training, and the lack of ablation switches for controlled scientific comparison. To address these issues, we propose four enhancements: (1) domain-adaptive augmentation including random erasing, gamma correction, and salt-and-pepper noise; (2) a hybrid BCE and Dice loss function for improved edge-aware segmentation under noisy conditions; (3) a curriculum-based Dice weight scheduling strategy; and (4) command-line control flags for systematic ablation studies. Experiments on the fundus optic disc segmentation benchmark demonstrate that the improved pipeline achieves a final epoch optic-disc Dice score of 0.956 and an ASD score of 13.31, outperforming the baseline epoch-5 Dice score of 0.939. These results indicate that training-level improvements can provide consistent performance gains without modifying the underlying WT-PSE architecture.

2606.03068 2026-06-03 cs.LG cs.AI 版本更新

Learn When and Where to Connect: Adaptive Virtual Nodes for Dynamic Message Passing on Graphs

学习何时何地连接:图上动态消息传递的自适应虚拟节点

Jaejun Lee, Joyce Jiyoung Whang

发表机构 * School of Computing, KAIST(计算机学院,韩国科学技术院) Department of AI Computing, KAIST(人工智能计算系,韩国科学技术院)

AI总结 提出MAVN框架,通过端到端可微分的方式自适应地决定在消息传递神经网络的哪一层为哪些节点引入虚拟节点,并基于双向评分机制建立连接,理论证明其能模拟任意节点-虚拟节点连接模式,实验表明在多个数据集上显著提升骨干网络性能。

Comments 12 pages, 6 figures, 10 tables, 32nd ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD 2026)

详情
AI中文摘要

虽然虚拟节点(VN)常用于消息传递神经网络(MPNN)中以促进有效的消息传递,但现有的基于VN的方法存在局限性,例如限制所有节点连接到相同数量的VN、在应用MPNN之前固定连接,以及独立于连接到同一VN的其他节点而将节点连接到VN。我们提出了MAVN,一个端到端可微分的MPNN框架,允许节点和VN之间无约束的连接,并根据跨层演化的节点表示动态按需引入VN。具体来说,MAVN学习基于连接的相对重要性自适应地决定何时(在哪一层)以及何地(连接到哪些节点)引入和连接VN。从候选VN池中,MAVN在每一层选择必要的VN,每个选中的VN连接到非空节点子集,由双向评分机制引导,该机制同时捕捉节点对VN的偏好和VN对节点的偏好。我们理论上证明,对于任何节点-VN连接模式,都存在一组MAVN参数可以模拟该模式。在九个真实世界数据集上的实验表明,MAVN持续提升骨干MPNN的性能,相对于骨干网络实现高达46.5%的提升,并优于基线方法。

英文摘要

While Virtual Nodes (VNs) are often utilized in Message Passing Neural Networks (MPNNs) to facilitate effective message passing, existing VN-based methods have limitations, such as constraining all nodes to connect to the same number of VNs, fixing the connections before applying MPNNs, and connecting a node to a VN independently of the other nodes that connect to the same VN. We propose MAVN, an end-to-end differentiable MPNN framework that allows non-constrained connections between nodes and VNs and dynamically introduces VNs on demand in response to evolving node representations across layers. Specifically, MAVN learns to adaptively determine when (at which layer) and where (to which nodes) to introduce and connect VNs based on the relative importance of connections. From a pool of candidate VNs, MAVN selects the necessary VNs in each layer, where each selected VN is connected to a nonempty subset of nodes, guided by a dual-perspective scoring mechanism that jointly captures the nodes' preferences for VNs and the VNs' preferences for nodes. We theoretically prove that for any node-VN connectivity pattern, there exists a set of MAVN's parameters that can simulate the pattern. Experiments on nine real-world datasets demonstrate that MAVN consistently improves the performance of backbone MPNNs, achieving up to 46.5% improvement over the backbones and outperforms the baselines.

2606.03066 2026-06-03 cs.AI 版本更新

CORE: Conflict-Oriented Reasoning for General Multimodal Manipulation Detection

CORE: 面向冲突的通用多模态篡改检测推理

Jinjie Shen, Yaxiong Wang, Yujiao Wu, Lechao Cheng, Tianrui Hui, Nan Pu, Zhihui Li, Zhun Zhong

发表机构 * University of Science and Technology of China(中国科学技术大学)

AI总结 提出CORE框架,通过构建冲突归因语料库和面向冲突的推理,增强多模态大语言模型的冲突捕捉能力,实现鲁棒且泛化的多模态篡改检测。

Comments Accepted by ICML 2026

详情
AI中文摘要

生成式AI的快速崛起使得多模态假新闻日益逼真且泛滥,对公众信任和社会稳定构成严重威胁。现有检测方法严重依赖针对特定篡改的模型和大规模标注数据,导致对新兴篡改类型的泛化能力差。我们观察到,篡改误导信息的本质在于其内在冲突,即跨模态或与常识世界知识之间的语义或物理不一致。受此启发,我们提出面向冲突的推理(CORE)框架,这是一种有效的范式,通过学习赋予多模态大语言模型(MLLMs)显式的冲突捕捉能力。为此,CORE首先构建了冲突归因语料库(CAC),包含冲突因素和来源的细粒度标注,为后续的冲突感知训练提供必要的数据支持。通过基于CAC进行面向冲突的表示增强和推理,CORE实现了鲁棒且可泛化的冲突检测,能够有效且快速地适应未见过的篡改类型,仅需少量样本甚至零样本设置。大量实验表明,CORE超越了现有最先进模型。数据集和代码已公开于该链接。

英文摘要

The rapid rise of generative AI has made multimodal fake news increasingly realistic and pervasive, posing severe threats to public trust and social stability. Existing detection methods rely heavily on manipulation-specific models and large-scale labeled data, resulting in poor generalization to emerging manipulation types. We observed that the essence of manipulated misinformation lies in its intrinsic conflicts, \textbf{i.e.,} semantic or physical inconsistencies either across modalities or with common world knowledge. Inspired by this observation, we propose \textbf{C}onflict-\textbf{O}riented \textbf{RE}asoning (\textbf{CORE}) framework, an effective paradigm that learns to endows multimodal large language models (MLLMs) with explicit conflict-capturing capability. To this end, CORE first constructs the Conflict Attribution Corpus (CAC) with fine-grained annotations of conflict factors and sources, providing essential data support for subsequent conflict perception training. By performing conflict-oriented representation enhancement and reasoning based on CAC, CORE achieves robust and generalizable conflict detection, effectively and rapidly adapting to unseen manipulation types with a few samples or in even zero-shot settings. Extensive experiments demonstrate that CORE surpasses state-of-the-art models. The dataset and code are publicly available at https://github.com/shen8424/CORE.

2606.03061 2026-06-03 cs.DC cs.AI cs.LG cs.NI cs.SY eess.SY 版本更新

Brief Announcement: Generative Markov Model for Distributed Computing Systems

简要公告:分布式计算系统的生成马尔可夫模型

Alfreds Lapkovskis, Ali Beikmohammadi, Sindri Magnússon, Praveen Kumar Donta

发表机构 * Department of Computer and Systems Sciences, Stockholm University, Sweden(斯德哥尔摩大学计算机与系统科学系)

AI总结 针对分布式计算系统的异构性和复杂性,提出一种基于结构化状态分解的生成马尔可夫模型,实现可处理的模拟、推理和策略学习,并通过协作AI推理案例验证其有效性。

Comments Submitted to 40th International Symposium on Distributed Computing (DISC 2026)

详情
AI中文摘要

新兴的分布式计算范式,如计算连续体,本质上是异构、随机和复杂的。高效且有效地利用连续体中所有可用资源需要一个统一的系统形式化模型。为了解决这一差距,我们提出了一个通用框架,将分布式计算系统建模为生成马尔可夫模型,该模型在结构化系统状态上进行分解。在我们的模型中,状态分解为高维变量,每个变量进一步在其元素上分解,反映了分布式系统固有的稀疏依赖结构。这产生了一个可处理的模型,能够对原本难以处理的系统状态进行模拟、推理和策略学习,从而将分布式计算与马尔可夫链理论和强化学习(RL)联系起来。我们通过一个协作AI推理的案例研究来展示我们的框架,其中专用服务器将资源与服务用户自愿提供的资源相结合。我们的结果表明,集中式调度在规模上成为瓶颈,而将计算分布到用户设备上可减少延迟和服务器资源消耗。这些发现突显了自适应决策在分布式计算系统中的价值,并展示了该框架在建模、模拟和优化方面的实用性。

英文摘要

Emerging distributed computing paradigms, such as the computing continuum, are inherently heterogeneous, stochastic, and complex. Efficiently and effectively utilizing all available resources across the continuum demands a unified formal model of the system. To address this gap, we propose a general framework for modeling distributed computing systems as a generative Markov model, factorized over a structured system state. In our model, the state decomposes into high-dimensional variables, each further factorized over its elements, reflecting the sparse dependency structure inherent to distributed systems. This yields a tractable model enabling simulation, inference, and policy learning over otherwise intractable system states, bridging distributed computing with Markov chain theory and reinforcement learning (RL). We demonstrate our framework through a case study of collaborative AI inference, in which a dedicated server combines resources with those volunteered by service users. Our results show that centralized scheduling becomes a bottleneck at scale, while distributing computation across user devices reduces both latency and server resource consumption. These findings highlight the value of adaptive decision-making in distributed computing systems and demonstrate the framework's utility for modeling, simulation, and optimization.

2606.03057 2026-06-03 cs.LG cs.AI 版本更新

Rethinking Molecular Text Representations for LLMs: An Empirical Study

重新思考用于大语言模型的分子文本表示:一项实证研究

Arun Raja, Garrett M. Morris, Kian Ming A. Chai

发表机构 * University of Oxford(牛津大学) DSO National Laboratories(DSO国家实验室)

AI总结 通过系统基准测试,评估了9种分子表示和8种化学任务下16个LLM的性能,发现表示选择强烈影响结果,结构化文本表示(CML、MolJSON)在结构任务中占优,IUPAC在语义任务中占优,而SMILES很少最优。

Comments 25 pages, 11 figures, 20 tables

详情
AI中文摘要

大语言模型(LLMs)越来越多地用于分子任务,但目前尚不清楚使用哪种分子表示。我们提出了一个系统基准测试,评估了LLM在九种表示和八种化学任务上的分子能力。我们基准测试了16个LLM,涵盖五个模型家族,包括推理和非推理变体、化学专用LLM以及封闭前沿模型。性能强烈依赖于表示,没有单一表示在所有任务中获胜,尽管CML是最好的,其次是MolJSON、InChI,然后是规范SMILES。显式结构化文本表示(CML和MolJSON)主导结构任务;IUPAC主导语义任务,在所有16个LLM的分子检索中获胜;而SMILES变体尽管在预训练中普遍存在,但很少是最优的。化学专用模型在使用SMILES时表现良好,但使用结构化文本表示时性能大幅下降,这表明仅基于SMILES的评估奖励了不具泛化能力的专业化。使用LLM作为评判者,我们发现IUPAC产生的正确分子生成比例最高。通过分词审计、线性探针和注意力的机制研究表明,表示在模型内部以不同方式编码;例如,结构化表示需要跨分子范围的更高注意力。我们的结果反对表示不变的评估,并激励基于LLM的化学任务感知表示路由。

英文摘要

Large language models (LLMs) are increasingly used for molecular tasks, but it remains unclear which molecular representation to use. We present a systematic benchmark evaluating LLM molecular competence across nine representations and eight chemical tasks. We benchmark 16 LLMs across five model families, including reasoning and non-reasoning variants, chemistry-specialized LLMs, and closed frontier models. Performance is strongly representation-dependent and no single representation wins across tasks, though CML is the best, followed by MolJSON, InChI, and then canonical SMILES. Explicit structured text representations (CML and MolJSON) dominate structural tasks; IUPAC dominates semantic tasks, winning molecule retrieval for all 16 LLMs; and SMILES variants are rarely optimal despite their prevalence in pretraining. Chemistry-specialized models perform well with SMILES at the cost of large degradations with structured text representations, suggesting SMILES-only evaluation rewards specialization that does not generalize. Using LLM-as-a-judge, we find that IUPAC produces the highest fraction of correct molecule generations. A mechanistic study via tokenization audits, linear probes and attention shows that representations are encoded differently inside the model; for example, structured representations require higher attention across the molecular span. Our results argue against representation-invariant evaluation and motivate task-aware representation routing for LLM-based chemistry.

2606.03056 2026-06-03 cs.AI 版本更新

SkillDAG: Self-Evolving Typed Skill Graphs for LLM Skill Selection at Scale

SkillDAG:面向大规模LLM技能选择的自演化类型化技能图

Tong Bai, Zhenglin Wan, Pengfei Zhou, Xingrui Yu, Wangbo Zhao, Yang You, Ivor W. Tsang

发表机构 * Fudan University(复旦大学) National University of Singapore(国立新加坡大学) CFAR A*STAR

AI总结 提出SkillDAG,通过类型化有向图建模技能间关系,并作为推理时可调用的结构化检索接口,支持在线演化,在ALFWorld和SkillsBench上显著超越基线。

Comments 19 pages, 5 figures

详情
AI中文摘要

随着LLM智能体采用大规模技能库,选择合适的子集成为一个结构性问题而非相似性匹配问题:技能之间存在依赖、冲突、特化或重复关系,这种结构对于全枚举和嵌入相似性都是不可见的。我们提出SkillDAG,将技能间关系建模为类型化有向图,并将其作为推理时、智能体可调用的结构化检索接口暴露给LLM智能体,在执行过程中查询和演化,而非固化在固定的检索流水线中:每次搜索返回向量匹配、类型化边邻居和冲突信号,并通过提议-提交协议让智能体注册基于执行的边,从而使图在多个回合中积累结构。在ALWWorld和SkillsBench上使用MiniMax-M2.7,SkillDAG达到67.1%的成功率和27.3%的奖励,比最强报告的Graph-of-Skills基线分别高出+12.8和+8.6个百分点;该优势可移植到gpt-5.2-codex,且在匹配查询下,内在SkillsBench Ret@K从65.5提升至78.2。这些增益可追溯到可隔离的机制:候选排序在池规模扩大10倍时保持鲁棒,而固定种子扩散流水线会退化;以及集合单调的在线编辑,在不驱逐先前命中项的情况下扩大地面真实召回率。

英文摘要

As LLM agents adopt large skill libraries, selecting the right subset becomes a structural problem rather than a similarity-matching one: skills depend on, conflict with, specialize, or duplicate one another, a structure invisible to both full enumeration and embedding similarity. We present SkillDAG, which models inter-skill relationships as a typed directed graph and exposes it to an LLM agent as an inference-time, agent-callable structural retrieval interface, queried and evolved during execution rather than baked into a fixed retrieval pipeline: each search returns vector matches, typed-edge neighbors, and conflict signals, and a propose-then-commit protocol lets the agent register execution-backed edges so the graph accumulates structure across episodes. On ALFWorld and SkillsBench with MiniMax-M2.7, SkillDAG reaches 67.1% success and 27.3% reward, exceeding the strongest reported Graph-of-Skills baseline by +12.8 and +8.6 points; the advantage ports to gpt-5.2-codex, and intrinsic SkillsBench Ret@K rises from 65.5 to 78.2 under matched queries. These gains trace to isolable mechanisms: candidate ranking that stays robust as the pool grows 10x where a fixed seeding-diffusion pipeline degrades, and set-monotone online edits that enlarge ground-truth recall without evicting prior hits.

2606.03054 2026-06-03 cs.AI 版本更新

ToolGate: Token-Efficient Pre-Call Control for Tool-Augmented Vision-Language Agents

ToolGate: 面向工具增强视觉语言智能体的令牌高效预调用控制

Anjie Liu, Yan Song, Zhixun Chen, Ziqin Gong, Zhongwei Yu, Jun Wang

发表机构 * The Hong Kong University of Science and Technology (Guangzhou)(香港科学与技术大学(广州)) University College London(伦敦大学学院) AI Lab, The Yangtze River Delta(长江三角洲人工智能实验室)

AI总结 针对工具增强视觉语言智能体中工具调用成本高且不必要的问题,提出轻量级外部控制器ToolGate,通过轨迹文本和结构特征预测执行/跳过决策,在降低令牌成本的同时保持或提升准确率。

详情
AI中文摘要

工具增强的视觉语言智能体可以通过OCR、检测、分割等工具获取外部感知证据,但执行每个提议的工具调用成本高昂且有时不必要。我们研究了预调用控制问题:在ReAct风格的VLM智能体提出感知工具调用后,是否应执行该调用,还是在其输出进入上下文之前跳过?在五个基准测试中,我们发现基线智能体表现出较差的局部选择性:有益和有害调用的发生率相近(11.8% vs. 9.9%),而大多数调用不会改变即时强制答案预测。我们引入了ToolGate,一个轻量级外部控制器,它根据轨迹文本和简单的结构特征预测执行/跳过决策。在两个Qwen3-VL骨干网络上,ToolGate将令牌成本降低到无限制ReAct基线的64-69%,同时保持跨域设置的平均准确率。在Qwen3-VL-30B上进行匹配域轨迹训练后,它进一步将平均准确率提高了1.65个百分点。这些结果表明,工具增强的VLM智能体不仅受益于更好的感知工具,还受益于对工具输出何时值得付费的显式控制。

英文摘要

Tool-augmented vision-language agents can acquire external perceptual evidence through OCR, detection, segmentation, and other tools, but executing every proposed tool call is costly and sometimes unnecessary. We study the pre-call control problem: after a ReAct-style VLM agent proposes a perceptual tool call, should the call be executed, or skipped before its output enters the context? Across five benchmarks, we find that the baseline agent exhibits poor local selectivity: helpful and harmful calls occur at similar rates (11.8% vs. 9.9%), while most calls do not change the immediate forced-answer prediction. We introduce ToolGate, a lightweight external controller that predicts execute/skip decisions from trajectory text and simple structural features. Across two Qwen3-VL backbones, ToolGate reduces token cost to 64-69% of the unrestricted ReAct baseline while preserving average accuracy in cross-domain settings. With matched-domain trajectory training on Qwen3-VL-30B, it further improves average accuracy by 1.65 points. These results show that tool-augmented VLM agents benefit not only from better perceptual tools, but also from explicit control over when tool outputs are worth paying for.

2606.03040 2026-06-03 cs.AI cs.LG 版本更新

RelGT-AC: A Relational Graph Transformer for Autocomplete Tasks in Relational Databases

RelGT-AC:用于关系数据库中自动完成任务的关系图Transformer

Phillip Jiang

发表机构 * Appsofa LLC(Appsofa公司)

AI总结 提出RelGT-AC模型,通过列掩码策略、统一任务头和TF-IDF文本编码器,在关系数据库的自动完成任务上优于GraphSAGE基线。

Comments 12 pages, 6 figures. Code and model checkpoints available at https://github.com/jiangdmv/graph-transformer

详情
AI中文摘要

关系数据库支撑着现代企业、科学和医疗系统,但由于其多表、异构和时间结构,对此类数据进行预测性机器学习仍然具有挑战性。关系深度学习(RDL)通过将数据库表示为异构图并直接应用图神经网络(GNN)来解决这一问题。RelBench v2最近引入了自动完成任务——一种实际动机的任务类型,其目标是从关系上下文中预测现有列值,类似于智能表单填充助手。我们提出了RelGT-AC(用于自动完成的关系图Transformer),通过三个有针对性的贡献扩展了RelGT架构:(1)一种列掩码策略,通过在子图编码期间屏蔽目标列来防止平凡解;(2)一个统一的任务头,支持在单个模型内进行二分类、多分类和回归自动完成任务;(3)一个TF-IDF文本编码器,自动检测和编码自由文本列,恢复分类编码器丢弃的强词汇信号。在跨越3个RelBench v2数据集(rel-trial、rel-f1、rel-stack)的7个任务中,RelGT-AC在所有3个回归自动完成任务上优于GraphSAGE基线,并通过TF-IDF编码器在文本密集的资格任务上实现了高达+10 AUROC点的提升。

英文摘要

Relational databases underpin modern enterprise, scientific, and healthcare systems, yet predictive machine learning on such data remains challenging due to their multi-table, heterogeneous, and temporal structure. Relational Deep Learning (RDL) addresses this by representing databases as heterogeneous graphs and applying graph neural networks (GNNs) directly. RelBench v2 recently introduced autocomplete tasks -- a practically motivated task type where the goal is to predict an existing column value from relational context, analogous to an intelligent form-filling assistant. We propose RelGT-AC (Relational Graph Transformer for Autocomplete), extending the RelGT architecture with three targeted contributions: (1) a column masking strategy that prevents trivial solutions by masking the target column during subgraph encoding; (2) a unified task head supporting binary classification, multiclass classification, and regression autocomplete tasks within a single model; and (3) a TF-IDF text encoder that automatically detects and encodes free-text columns, recovering strong lexical signal that categorical encoders discard. Across 7 tasks spanning 3 RelBench v2 datasets (rel-trial, rel-f1, rel-stack), RelGT-AC outperforms the GraphSAGE baseline on all 3 regression autocomplete tasks and achieves up to +10 AUROC points on text-heavy eligibility tasks via the TF-IDF encoder.

2606.03036 2026-06-03 cs.AI 版本更新

TriEval: A Resource-Efficient Pipeline for LLM Bias, Toxicity, and Truthfulness Assessment

TriEval: 一种资源高效的LLM偏见、毒性和真实性评估流水线

Akshatha Srikantha, Manpreet Singh, Yash Jajoo, Shyamal Lakhanpal

AI总结 提出TriEval流水线,通过同时评估LLM输出的偏见、毒性和真实性,在标准笔记本电脑上高效运行,并揭示开源与闭源模型在毒性和真实性上的差异。

详情
AI中文摘要

LLM已经从基本的聊天机器人演变为AI生态系统的支柱,现在广泛应用于医疗、学校和政府服务。LLM的领域范围采用需要持续评估以确保其安全性和公平性。部署LLM后遇到的常见问题包括不一致的输出和错误信息的幻觉。尽管存在许多LLM评估工具,但大多数仅限于一次测试单个参数,或者需要大多数研究人员无法访问的大量计算资源。TriEval通过同时评估LLM输出的多个参数(包括偏见、毒性和真实性)来解决这些挑战,同时最小化计算资源。该流水线与开源和闭源模型兼容,并在没有GPU集群的标准笔记本电脑上运行。TriEval已在四个模型上测试:Llama 3 8B、Mistral 7B、Gemma 2 9B和Claude Haiku。结果显示了开源和闭源模型之间的明显差异,特别是在毒性和真实性方面。TriEval作为开源发布,以使计算资源有限的研究人员能够更广泛地访问。

英文摘要

LLMs have evolved from basic chatbots to the backbone of the AI ecosystem, now widely used in healthcare, schools, and government services. The domain-wide adoption of LLMs necessitates continuous evaluation to ensure their safety and fairness. Common issues encountered after deploying LLMs include inconsistent outputs and hallucinations of incorrect information. Although numerous LLM evaluation tools exist, most are limited to testing a single parameter at a time or require massive computational resources that are not accessible to most researchers. TriEval addresses these challenges by evaluating LLM outputs across multiple parameters, including bias, toxicity, and truthfulness together, while minimizing computing resources. The pipeline is compatible with both open- and closed-source models and runs on a standard laptop without a GPU cluster. TriEval has been tested on four models: Llama 3 8B, Mistral 7B, Gemma 2 9B, and Claude Haiku. The results show clear differences between open-source and closed-source models, especially in terms of toxicity and truthfulness. TriEval is being released as open source to enable broader access for researchers with limited computational resources.

2606.03034 2026-06-03 cs.MA cs.AI 版本更新

Capability Advertisement as a Market for Lemons: A Trust Layer for Heterogeneous Agent Networks

能力广告作为柠檬市场:异构智能体网络的信任层

Gaurav Naresh Mittal

发表机构 * University of California, Berkeley(加州大学伯克利分校)

AI总结 针对LLM智能体网络中的能力虚假声称问题,提出基于柠檬市场理论的信任层,通过概率描述、筛选和声誉机制实现可信委托。

详情
AI中文摘要

大型语言模型(LLM)智能体已开始相互委托工作。诸如模型上下文协议(MCP)和智能体间协议(A2A)等协议允许智能体发布其能力并允许其他智能体调用,且此类智能体的公共注册表已经出现。这些协议假设所广告的能力是静态的、真实的事实。然而,真实的智能体并非如此:其能力是概率性的,随输入变化,在底层模型更新时漂移,并且由于智能体本身是语言模型,它可以完全自信地描述自己却可能是错误的。因此,调用者看到的是智能体声称能做什么,而非实际能做什么,且没有原则性的方法区分可靠提供者和流利的冒名顶替者。我们认为这些困难有一个共同原因:柠檬市场。当质量隐藏且声称成本低廉时,好与坏的提供者变得难以区分,诚实的可靠性得不到回报,市场向最差参与者退化。经济学提供了三种补救措施:信号传递、筛选和声誉,而这些在当今的智能体协议中均不存在。我们做出四项贡献:(1)一个故障分类,将自信-错误命名为非对抗性的、相关的拜占庭故障子类,而经典容错模型对此建模不当;(2)一个柠檬市场模型,表明基于信仰的协议仅允许低信任均衡;(3)信任层,一个轻量级、协议无关的窄腰,位于MCP和A2A之上,添加概率能力描述、筛选和声誉,并在维持过度声称的成本超过其收益时允许分离均衡;(4)一个针对委托链的可靠性组合界限,具有端到端放置论证。该设计无需模型重新训练,并在其信任锚缺失或损坏时优雅降级。

英文摘要

Large language model (LLM) agents have begun to delegate work to one another. Protocols such as the Model Context Protocol (MCP) and the Agent2Agent protocol (A2A) let an agent publish what it can do and let others call it, and public registries of such agents are already appearing. These protocols assume an advertised capability is a static, truthful fact. A real agent is none of these things: its competence is probabilistic, varies with input, drifts when the underlying model is updated, and, because the agent is itself a language model, it can describe itself with complete confidence and be wrong. A caller therefore sees what an agent claims to do, not what it can do, with no principled way to tell a reliable provider from a fluent impostor. We argue these difficulties share one cause: the market for lemons. When quality is hidden and claims are cheap, good and bad providers become indistinguishable, honest reliability goes unrewarded, and the market decays toward its worst participants. Economics offers three remedies, signaling, screening, and reputation, and none are present in today's agent protocols. We make four contributions: (1) a failure taxonomy that names confident-wrong as a non-adversarial, correlated subclass of Byzantine faults that classical fault-tolerance mismodels; (2) a market-for-lemons model showing that faith-based protocols admit only a low-trust equilibrium; (3) the Trust Layer, a thin, protocol-agnostic narrow waist above MCP and A2A that adds probabilistic capability descriptors, screening, and reputation, and admits a separating equilibrium when the cost of sustaining an overclaim exceeds the gain from it; and (4) a reliability-composition bound for delegation chains with an end-to-end placement argument. The design needs no model retraining and degrades gracefully when its trust anchors are absent or corrupt.

2606.03031 2026-06-03 cs.AI cs.MA cs.SC 版本更新

AUDITFLOW: Executable Symbolic Environments for Structured Financial Reporting Verification

AUDITFLOW:用于结构化财务报告验证的可执行符号环境

Yan Wang, Xuguang Ai, Jaisal Patel, Xueqing Peng, Fengran Mo, Yupeng Cao, Haohang Li, Mingyu Cao, Lingfei Qian, Víctor Gutiérrez-Basulto

发表机构 * The Fin AI Rensselaer Polytechnic Institute Université de Montréal Stevens Institute of Technology University of Surrey Cardiff University

AI总结 提出基于图的多智能体框架AuditFlow,通过构建符号环境分离自适应搜索与确定性验证,在财务审计验证任务中实现82.09%的联合审计准确率。

详情
AI中文摘要

结构化财务审计验证对语言模型代理而言是困难的,因为正确性依赖于结构化证据而非仅文本。模型必须将报告事实链接到分类概念,遍历计算或维度关系,并在应用审计规则前重新计算预期值。我们提出AuditFlow,一个基于图的多智能体框架,将自适应搜索与确定性验证分离。AuditFlow从静态US-GAAP分类图和动态XBRL申报图构建符号环境,并通过类型化工具暴露事实检索、分类遍历、数值检查和规则评估功能。两名初级审计员从监管和证据角度检查每个案例,而高级审计员解决分歧并可要求进一步调查。最终报告通过证据聚合融合,产生审计结论、预期值、证据链和可信度评分。在基于FinAuditing的FinMR样本上,AuditFlow在GPT-5.5下达到82.09%的联合审计准确率,超过最强基线14.93个百分点。移除确定性检查后准确率降至17.91%,表明符号环境执行了模型无法可靠替代的验证步骤。

英文摘要

Structured financial audit verification is difficult for language-model agents because correctness depends on structured evidence rather than text alone. A model must link reported facts to taxonomy concepts, traverse calculation or dimensional relations, and recompute expected values before applying an audit rule. We propose AuditFlow, a graph-grounded multi-agent framework that separates adaptive search from deterministic verification. AuditFlow builds a symbolic environment from a static US-GAAP taxonomy graph and a dynamic XBRL filing graph, and exposes it through typed tools for fact retrieval, taxonomy traversal, numerical checking, and rule evaluation. Two junior auditors inspect each case from regulatory and evidentiary views, while a senior auditor resolves disagreements and can request further investigation. The final reports are fused through evidential aggregation to produce an audit verdict, expected value, evidence trail, and trustworthiness score. On a FinAuditing-derived FinMR sample, AuditFlow reaches 82.09% joint audit accuracy under GPT-5.5, outperforming the strongest baseline by 14.93 points. Removing deterministic checks drops accuracy to 17.91%, showing that the symbolic environment performs the verification step that the model cannot reliably replace.

2606.03029 2026-06-03 cs.CL cs.AI 版本更新

Conditional Hypothesis Generation for LLM-Based Text Analysis with Researcher-Specified Covariates

基于研究者指定协变量的条件假设生成用于LLM文本分析

Paiheng Xu, Jing Liu, Wei Ai

发表机构 * University of Maryland, College Park(马里兰大学 College Park 分校)

AI总结 提出条件假设生成框架,通过纳入研究者指定的协变量来引导LLM发现相关子组内而非全局的差异模式,并采用特征-协变量交互和分层内去均值与逆频率重加权两种方法解决子组不平衡和符号反转问题。

详情
AI中文摘要

计算社会科学的一个核心目标是发现语言如何随感兴趣的结果(如政治派别或教学质量)变化的可解释差异。最近的基于LLM的假设生成方法用自然语言描述这些差异,但选择的是全局判别模式,而没有考虑基于研究者领域知识塑造数据的协变量。当忽略协变量时,所选模式可能反映混杂因素而非实质性感兴趣的差异。我们引入了条件假设生成,这是一个纳入研究者指定协变量的框架,以将假设发现引导至相关子组内成立的差异。出现了两个挑战:目标子组可能代表性不足(分层不平衡),并且差异的方向可能在子组间反转(符号反转)。我们提出了两种受计量经济学启发的方法:一种引入特征-协变量交互以检测符号反转,另一种应用分层内去均值和逆频率重加权以平衡代表性不足的分层。合成实验表明,每种方法在其目标设置中均优于全局基线,对两个真实世界数据集的专家评估证实,协变量感知的生成在相关子组内产生了更有用的假设。

英文摘要

A core goal of computational social science is to discover interpretable differences in how language varies across outcomes of interest, such as political affiliation or instructional quality. Recent LLM-based hypothesis generation methods describe such differences in natural language, but select for globally discriminative patterns without accounting for covariates that shape the data based on researchers' domain knowledge. When covariates are ignored, selected patterns can reflect confounds rather than differences of substantive interest. We introduce conditional hypothesis generation, a framework that incorporates researcher-specified covariates to steer hypothesis discovery toward differences that hold within relevant subgroups. Two challenges arise: the target subgroup may be underrepresented (stratum imbalance), and the direction of a difference may reverse across subgroups (sign reversal). We propose two econometrics-inspired methods: one introduces feature--covariate interactions to detect sign reversals, and the other applies within-stratum demeaning and inverse-frequency reweighting to equalize underrepresented strata. Synthetic experiments show each method outperforms global baselines in its targeted setting, and expert evaluation on two real-world datasets confirms that covariate-aware generation surfaces more useful hypotheses within relevant subgroups.

2606.03026 2026-06-03 cs.NE cs.AI cs.LG 版本更新

Spike-Aware C++ INT8 Inference for Sparse Spiking Language Models on Commodity CPUs

面向稀疏脉冲语言模型在商用CPU上的脉冲感知C++ INT8推理

Ting Liu

发表机构 * SymbolicLight Research(SymbolicLight研究院)

AI总结 本文提出一种脉冲感知的C++推理运行时,利用稀疏二进制脉冲状态作为执行原语,结合混合布局、AVX2/FMA内核和INT8量化,在商用CPU上实现脉冲语言模型的高效解码,吞吐量优于同等规模稠密模型但质量略逊。

Comments 11 pages, 7 tables

详情
AI中文摘要

脉冲语言模型展现出激活稀疏性,而稠密Transformer运行时无法直接利用。本文从系统角度研究这一特性。基于SymbolicLight V1脉冲门控语言模型家族,我们实现了一个C++ CPU推理运行时,将稀疏二进制脉冲状态视为执行原语,而非仅应用事后权重压缩。该运行时结合了清单驱动的权重加载器、混合行/列内存布局、AVX2/FMA内核、每通道对称INT8量化以及脉冲条件稀疏路径的整数域累加。在AMD Ryzen 7 5800X上,早期标量FP32基线解码速度为9.5 tokens/s。混合布局AVX2 FP32将其提升至14.7 tokens/s,而AVX2 INT8在相同step-30k导出模型上达到19.9 tokens/s,同时将权重占用从3.49 GB降至1.06 GB。对于可用的186k步874M参数INT8导出模型,C++运行时在单线程CPU基准测试中解码速度为22.63 tokens/s,相比之下,TinyLlama-1.1B Q8_0为16.31 tokens/s,Falcon3-1B Q8_0为11.26 tokens/s,Qwen2.5-1.5B Q8_0为9.70 tokens/s。线程扩展在四个CPU线程时达到47.90 tokens/s,512 token预填充从单线程的29.86 tokens/s提升至八线程的94.68 tokens/s。吞吐量提升伴随着质量代价:SNN报告WikiText-2困惑度为24.80,差于同一基准中的稠密基线。我们将结果定位为稀疏语言运行时的推理系统研究,长期动机在于可能受益于传感器和执行器附近本地低核推理的具身和边缘智能体。脉冲感知执行可以改善稀疏脉冲语言模型的CPU吞吐量和内存行为,而模型质量、受控稠密训练基线、具身任务评估和测量CPU能耗仍是开放问题。

英文摘要

Spiking language models expose activation sparsity that dense Transformer runtimes do not directly exploit. This paper studies that property from a systems perspective. Building on the SymbolicLight V1 spike-gated language model family, we implement a C++ CPU inference runtime that treats sparse binary spike states as an execution primitive rather than only applying post-hoc weight compression. The runtime combines a manifest-driven weight loader, mixed row/column memory layout, AVX2/FMA kernels, per-channel symmetric INT8 quantization, and integer-domain accumulation for spike-conditioned sparse paths. On an AMD Ryzen 7 5800X, an early scalar FP32 baseline decodes at 9.5 tokens/s. Mixed-layout AVX2 FP32 raises this to 14.7 tokens/s, and AVX2 INT8 reaches 19.9 tokens/s on the same step-30k export while reducing the weight footprint from 3.49 GB to 1.06 GB. For the available 186k-step 874M-parameter INT8 export, the C++ runtime decodes at 22.63 tokens/s in a single-thread CPU benchmark, compared with 16.31 tokens/s for TinyLlama-1.1B Q8_0, 11.26 tokens/s for Falcon3-1B Q8_0, and 9.70 tokens/s for Qwen2.5-1.5B Q8_0 under llama.cpp. Thread scaling reaches 47.90 tokens/s at four CPU threads, and 512-token prefill improves from 29.86 to 94.68 tokens/s from one to eight threads. The throughput result comes with a quality cost: the SNN reports WikiText-2 perplexity 24.80, worse than the dense baselines in the same benchmark. We frame the result as an inference-systems study for sparse language runtimes, with longer-term motivation in embodied and edge agents that may benefit from local, low-core inference near sensors and actuators. Spike-aware execution can improve CPU throughput and memory behavior for sparse spiking language models, while model quality, controlled dense training baselines, embodied-task evaluation, and measured CPU energy remain open problems.

2606.03022 2026-06-03 cs.CL cs.AI 版本更新

Hallucinations as Orthogonal Noise: Inference-Time Manifold Alignment via Dynamic Contextual Orthogonalization

幻觉作为正交噪声:通过动态上下文正交化实现推理时流形对齐

Mingkuan Zhao, Wentao Hu, Tianchen Huang, Yuheng Min, Suquan Chen, Yide Gao, Yanbo Zhai, Shuangyong Song, Xuelong Li

发表机构 * Xi’an Jiaotong University(西安交通大学) Xingchen AGI Lab(星辰AGI实验室) China Telecom AI Technology (Beijing) Co., Ltd.(中国电信人工智能技术(北京)有限公司) Institute of Artificial Intelligence, China Telecom(中国电信人工智能研究院) University of Science and Technology of China(中国科学技术大学) Tsinghua University(清华大学)

AI总结 提出一种基于线性表示假设的几何框架,将大语言模型幻觉解释为残差流语义流形的正交噪声,并引入推理时干预方法动态上下文正交化(DCO),通过层间Z分数抑制机制选择性地衰减异常正交分量,在保持知识记忆的同时提升上下文忠实度。

详情
AI中文摘要

大语言模型(LLMs)中的幻觉——即生成与上下文事实或逻辑约束不一致的内容——仍然是可靠部署面临的持续挑战。在这项工作中,我们通过基于线性表示假设的几何框架来解决这个问题。我们提出,幻觉表现为相对于残差流语义流形的正交噪声。具体来说,我们假设虽然注意力头理想地传播与上下文子空间一致的信息,但当特定头引入与该子空间正交的分量时,就会产生幻觉,破坏潜在表示的一致性。基于这一表述,我们引入了动态上下文正交化(DCO),一种推理时干预方法。DCO利用输入残差流作为动态上下文锚点,对注意力头输出进行正交分解。为了区分上下文对齐的语义更新和发散噪声,DCO采用层间Z分数抑制机制,根据统计分布选择性地衰减异常正交分量。在XSum、NQ-Swap和IFEval等基准上对Llama-3-8B和70B的评估表明,与最先进的干预基线相比,DCO实现了更优的上下文忠实度。此外,DCO在TriviaQA和TruthfulQA等知识密集型任务上保持高性能,有效缓解了现有方法中常见的幻觉抑制与参数知识保留之间的权衡。我们的发现验证了幻觉的几何解释,并将DCO确立为一种计算高效的流形对齐方法。代码可在https://this https URL获取。

英文摘要

Hallucination in Large Language Models (LLMs), characterized by the generation of content inconsistent with contextual facts or logical constraints -- remains a persistent challenge for reliable deployment. In this work, we address this issue through a geometric framework rooted in the linear representation hypothesis. We propose that hallucinations manifest as orthogonal noise relative to the semantic manifold of the residual stream. Specifically, we hypothesize that while attention heads ideally propagate information congruent with the context subspace, hallucinations arise when specific heads introduce components orthogonal to this subspace, disrupting the coherence of the latent representation. Based on this formulation, we introduce Dynamic Contextual Orthogonalization (DCO), an inference-time intervention method. DCO utilizes the input residual stream as a dynamic context anchor to perform orthogonal decomposition on attention head outputs. To distinguish between context-aligned semantic updates and divergent noise, DCO employs a layer-wise Z-score suppression mechanism that selectively attenuates outlier orthogonal components based on statistical distributions. Evaluations on Llama-3-8B and 70B across benchmarks such as XSum, NQ-Swap, and IFEval demonstrate that DCO achieves superior contextual faithfulness compared to state-of-the-art intervention baselines. Furthermore, DCO maintains high performance on knowledge-intensive tasks like TriviaQA and TruthfulQA, effectively mitigating the trade-off between hallucination suppression and parametric knowledge retention often observed in existing methods. Our findings validate the geometric interpretation of hallucinations and establish DCO as a computationally efficient approach for enforcing manifold alignment.Our code is available at https://github.com/Harry-Miral/DCO

2606.03019 2026-06-03 cs.CY cs.AI 版本更新

Reproducibility is the New Copyleft: Defining AGI-oriented Reproducible Builds

可重现性是新的Copyleft:定义面向AGI的可重现构建

Masayuki Hatta

发表机构 * Surugadai University(上贺茂大学)

AI总结 本文提出面向通用人工智能(AGI)的可重现构建作为Copyleft的功能等价物,通过定义七项要求来确保模型从声明输入到输出的比特精确可重现性,并论证协议而非平台是更优的治理框架。

Comments Accepted at AGI-26. To appear in the proceedings (Springer LNCS)

详情
AI中文摘要

Copyleft,如GNU通用公共许可证中所实施的,是一种利用版权保证用户自由的法律技巧,通过将源代码的可用性与每次分发行为绑定。其规范力量依赖于一个隐含的技术前提:源代码和目标代码之间存在定义明确、可人工审计且可重现的关系。大型语言模型以及未来的通用人工智能(AGI)系统系统地违反了这一前提。重建模型所需的工件——代码、数据、权重、超参数、工具链和硬件配置——各自受到独立的法律、技术和经济约束,当前没有任何开源框架能完全解决这些问题。足够强大的AI系统还可以将许可下的源代码重写为功能等效的衍生作品,从而剥离原始义务,这是一种Copyleft无法有效防御的洗白形式。本文认为,对于AGI,Copyleft的功能等价物必须基于可重现构建,而非代码的共享相同条款:可重现构建是一种保证从声明输入到输出比特精确可重构性的实践。我们回顾了Copyleft的逻辑,批判性地审视了Maffulli的“第二次解放”论点(即AI实现了Stallman的梦想),并表明除非AGI系统本身是可重现的,否则该论点不成立。借鉴开源AI定义(OSAID)、模型开放框架(MOF)、OpenMDW和确定性推理研究,我们定义了面向AGI的可重现构建的七项要求。我们进一步论证,模型上下文协议(MCP)和类似的AI到AI耦合机制构成了一个新的动态链接层,Copyleft式许可对此并不适用,而Masnick的“协议而非平台”框架提供了更有前景的治理模板。

英文摘要

Copyleft, as implemented in licenses such as the GNU General Public License, was a legal hack that used copyright to guarantee user freedom by tying the availability of source code to every act of distribution. Its normative force rested on an implicit technical premise: that source code and object code stand in a well-defined, humanly auditable, and reproducible relationship. Large language models and, prospectively, Artificial General Intelligence (AGI) systems systematically violate this premise. The artifacts jointly required to reconstruct a model -- code, data, weights, hyperparameters, toolchain, and hardware configuration -- are each subject to independent legal, technical, and economic constraints that no current open-source framework fully resolves. Sufficiently capable AI systems can also rewrite licensed source into functionally equivalent derivatives stripped of their original obligations, a form of laundering against which copyleft has no effective defense. This paper argues that a functional analogue of copyleft for AGI must be grounded not in share-alike clauses over code, but in reproducible builds: a practice guaranteeing bit-exact reconstructability from declared inputs. We review the logic of copyleft, critically examine Maffulli's Second Liberation thesis according to which AI fulfills Stallman's dream, and show that the argument collapses unless AGI systems are themselves reproducible. Drawing on the Open Source AI Definition (OSAID), the Model Openness Framework (MOF), OpenMDW, and deterministic-inference research, we define seven requirements for AGI-oriented reproducible builds. We further argue that the Model Context Protocol (MCP) and analogous AI-to-AI coupling mechanisms constitute a new dynamic linking layer for which copyleft-style licensing is ill-suited, and that Masnick's "protocols, not platforms" framework offers a more promising governance template.

2606.03017 2026-06-03 cs.LG cs.AI cs.RO 版本更新

ConTraIRL: Factorized Contrastive Abstractions for Transferable IRL

ConTraIRL:用于可迁移逆强化学习的分解对比抽象

Yikang Gui, Bikramjit Banerjee, Prashant Doshi

发表机构 * School of Computing University of Georgia(乔治亚大学计算学院) School of Computing Sciences & Computer Engineering The University of Southern Mississippi(密西西比大学计算科学与计算机工程学院)

AI总结 提出ConTraIRL框架,通过双编码器对比学习解耦环境动态与任务目标的潜在表示,实现组合奖励迁移,在连续控制基准上显著提升少样本迁移的样本效率和奖励恢复。

详情
AI中文摘要

当策略必须泛化到未见过的环境动态与任务目标组合时,逆强化学习中的奖励迁移不可靠。我们提出用于可迁移逆强化学习的分解对比抽象(ConTraIRL),该框架通过学习这两个因素的解耦潜在表示来实现组合奖励迁移。ConTraIRL采用双编码器架构,将观测映射到分离的动态和目标的潜在空间,并通过双重对比目标进行训练。时间对齐鼓励动态编码器学习目标不变的结构,而目标编码器捕获动态不变的特征。这种分解支持在重组动态-目标设置下的奖励推断。在连续控制基准上的实验表明,对未见过的动态-目标配对进行有效的少样本迁移,与迁移逆强化学习基线相比,提高了样本效率和奖励恢复。

英文摘要

Reward transfer in Inverse Reinforcement Learning (IRL) is unreliable when policies must generalize to unseen combinations of environment dynamics and task goals. We propose Factorized Contrastive Abstractions for Transferable IRL (ConTraIRL), a framework that enables compositional reward transfer by learning decoupled latent representations of these two factors. ConTraIRL uses a dual-encoder architecture that maps observations into separate dynamics and goal latent spaces, trained with a dual contrastive objective. Temporal alignment encourages the dynamics encoder to learn goal-invariant structure, while the goal encoder captures dynamics-invariant features. This factorization supports reward inference under recombined dynamics-goal settings. Experiments on continuous control benchmarks demonstrate effective few-shot transfer to unseen dynamics-goal pairings, improving sample efficiency and reward recovery over transfer IRL baselines.

2606.03005 2026-06-03 cs.CV cs.AI 版本更新

MUSE: A Unified Agentic Harness for MLLMs

MUSE: 多模态大语言模型的统一智能体框架

Jianglin Lu, Hailing Wang, Xu Ma, Qihua Dong, Mingyuan Zhang, Yizhou Wang, Yun Fu

发表机构 * Northeastern University(东北大学)

AI总结 提出MUSE框架,通过可组合模块(任务表示、视觉处理、感知工具、结构化解析、确定性验证和验证器引导修复)提升冻结多模态大语言模型性能,无需重新训练。

详情
AI中文摘要

尽管进展迅速,多模态大语言模型(MLLMs)在人类轻松解决的任务上仍然失败,例如从屏幕截图导航网格迷宫或选择正确的拼图块。我们不重新训练模型,而是提出一个补充性问题:仅通过改进执行脚手架,能从冻结的MLLM中引出多少能力?我们引入MUSE,一个多模态统一结构化执行框架,它用可组合的模块(任务表示、视觉处理、感知工具使用、结构化解析、确定性验证和验证器引导修复)包装任何现成的MLLM,无需任何模型重新训练。我们使用多个最先进的MLLM,在涵盖视觉空间规划、视觉感知、多模态推理和细粒度视觉辨别的多样化基准上评估MUSE。MUSE在所有设置中都比裸模型带来一致的提升,在困难实例上提升最大。进一步分析揭示,许多MLLM失败源于框架层面的缺陷而非根本的模型缺陷,并且可以通过验证器引导修复来解决,无需触及模型。这些发现突显了智能体多模态框架作为一个关键但尚未充分探索的设计维度,提供了超越以模型为中心的优化的正交改进途径。

英文摘要

Despite rapid progress, multimodal large language models (MLLMs) still fail on tasks that humans solve effortlessly, such as navigating a grid maze from a screenshot or selecting the correct puzzle piece. Rather than retraining the model, we ask a complementary question: how much capability can be elicited from a frozen MLLM purely by improving the execution scaffold around it? We introduce MUSE, a multimodal unified structured execution harness that wraps any off-the-shelf MLLM with composable modules for task representation, visual processing, perception tool use, structured parsing, deterministic verification, and verifier-guided repair, without any model retraining. We evaluate MUSE across diverse benchmarks spanning visual spatial planning, visual perception, multimodal reasoning, and fine-grained visual discrimination, using multiple state-of-the-art MLLMs. MUSE delivers consistent gains over the bare model in all settings, with the largest jumps on challenging instances. Further analysis reveals that many MLLM failures arise from harness-level shortcomings rather than fundamental model deficits, and can be addressed through verifier-guided repair without touching the model. These findings highlight the agentic multimodal harness as a critical yet underexplored design dimension, offering an orthogonal avenue for improving MLLMs beyond model-centric optimization.

2606.03003 2026-06-03 cs.LG cs.AI cs.RO 版本更新

Exact equivariance, kept through training, buys zero-shot generalisation across the symmetry group

精确等变性在训练中保持,实现跨对称群的零样本泛化

Hongbo Wang

发表机构 * Department of Mathematics, Stony Brook University(石溪大学数学系)

AI总结 通过等变编码器和预测器构建的潜世界模型,其训练损失具有可证明的对称性,从而在仅拟合部分方向动力学时,数学上确定整个轨道上的行为,实现跨对称群的零样本泛化。

Comments 92 pages, 11 figures. Core paper plus an extended results-log appendix and a forward-looking theory supplement. All experiments are laptop-scale (CPU/MPS), fully seeded and deterministic

详情
AI中文摘要

由等变编码器 $E$ 和等变预测器 $f$ 构建的潜世界模型继承了其训练损失的可证明对称性:当世界的动力学真正承载一个群 $G$,通过正交表示 $\rho(g)$ 作用于潜变量时,单步预测 relMSE 在整个群上精确不变,因此仅在方向的受限切片上拟合动力学,数学上就确定了整个轨道上的动力学(举一反三)。我们在笔记本电脑规模(CPU/MPS,完全设定随机种子)上端到端验证了这一点。[A] 该对称性在真实的 Muon/AdamW + EMA + VICReg 运行中幸存——组合的编码-预测残差在优化后约为 $10^{-6}$,不仅在初始化时,而且在任何优化器下都成立。[B] 单步误差在整个群上平坦至五位小数,而相同假设类别的非等变基线拟合了切片但在分布外失效(2D 中 VN $\times 1.00$ 对比基线 $\times 13.8$,3D 中 $\times 17.2$,整个 $\mathrm{SE}(3)$ 阶梯上 $\times 157$),且等变模型小 $4.5$-$7.4$ 倍。[C] 相同的等距论证提升到闭环:在匹配的等变规划器下,方向 $g$ 处的控制轨迹恰好是所见轨迹应用 $\rho(g)$ 的结果,因此闭环误差在整个群上不变——在真实 PushT 上的 2D/$\mathrm{SO}(2)$ 中浮点地板精确,在 3D/$\mathrm{SE}(3)$ 中统计平坦(不相交的 95% 置信区间)。我们针对 Sutton 的苦涩教训对先验进行了压力测试:增强、暴力规模和软等变性各自最多缩小跨群任务指标,但从未达到浮点地板精确性。由于等变性在复合下封闭,$H$ 步展开在每个视界上保持平坦($\times 1.00$,$\le 2\times 10^{-7}$),而基线的残差随 $H$ 复合。超出范围:任务成功扫描、无规划器不变性和缩放。

英文摘要

A latent world model built from an equivariant encoder $E$ and an equivariant predictor $f$ inherits a provable symmetry of its training loss: when the world's dynamics genuinely carries a group $G$ acting on latents by an orthogonal representation $ρ(g)$, the one-step prediction relMSE is exactly invariant across the whole group, so fitting the dynamics on a restricted slice of orientations mathematically determines it on the entire orbit (jǔ yī fǎn sān). We verify this end-to-end at laptop scale (CPU/MPS, fully seeded). [A] The symmetry survives a real Muon/AdamW + EMA + VICReg run -- composed encode-then-predict residual $\sim 10^{-6}$ after optimisation, not just at initialisation, and under any optimiser. [B] One-step error is flat to five digits across the group, while a same-hypothesis-class non-equivariant baseline fits the slice but breaks out-of-distribution (VN $\times 1.00$ vs baseline $\times 13.8$ in 2D, $\times 17.2$ in 3D, $\times 157$ over the full $\mathrm{SE}(3)$ ladder), with the equivariant model $4.5$-$7.4\times$ smaller. [C] The same isometry argument lifts to closed loop: under a matching equivariant planner the control trajectory at orientation $g$ is exactly $ρ(g)$ applied to the seen one, so closed-loop error is invariant across the group -- float-floor-exact in 2D/$\mathrm{SO}(2)$ on real PushT and statistically flat in 3D/$\mathrm{SE}(3)$ (disjoint 95% CIs). We stress-test the prior against Sutton's Bitter Lesson: augmentation, brute-force scale, and soft-equivariance each close at most the across-group task metric, never the float-floor exactness. Because equivariance is closed under composition, the $H$-fold rollout stays flat ($\times 1.00$, $\le 2\times 10^{-7}$) at every horizon, while the baseline's residual compounds with $H$. Out of scope: task-success sweeps, planner-free invariance, and scaling.

2606.02994 2026-06-03 cs.AI cs.CL 版本更新

Inducing Reasoning Primitives from Agent Traces

从智能体轨迹中归纳推理原语

Zhihan Lei, Jiarui Yan, Joshua Momo, William W. Cohen

发表机构 * Carnegie Mellon University(卡内基梅隆大学)

AI总结 提出推理原语归纳方法,从ReAct智能体轨迹中挖掘并聚类常见推理步骤,构建伪工具库,在多个推理任务上显著提升性能。

Comments 22 pages including appendices

详情
AI中文摘要

ReAct风格的LLM智能体经常跨问题重新发现相同的推理例程,但这些例程被困在瞬时的草稿板中。我们引入了推理原语归纳,一种单次通过的方法,挖掘成功的ReAct轨迹,聚类循环出现的推理动作,并将最频繁的动作转换为一个紧凑的类型化伪工具库。每个伪工具由一个自然语言文档字符串指定,在调用时由LLM解释,标准的ReAct循环在测试时组合这些原语。核心结果是,归纳出的库优于生成其轨迹的原始智能体:在RuleArena NBA上提高44个百分点(30 -> 74),在MuSR团队分配上提高30个百分点(38 -> 68),在NatPlan会议规划上提高22个百分点(7 -> 29)。在涵盖叙事推理、规则应用和约束满足规划的五个可比较子任务中,单个固定配置在每个子任务上优于零样本思维链,匹配或超过专家编写的分解,并以更低的平均推理成本优于AWM。

英文摘要

ReAct-style LLM agents often rediscover the same reasoning routines across problems, yet leave those routines trapped in transient scratchpads. We introduce Reasoning Primitive Induction, a single-pass method that mines successful ReAct traces, clusters recurrent reasoning moves, and converts the most frequent moves into a compact library of typed pseudo-tools. Each pseudo-tool is specified by a natural-language docstring interpreted by an LLM at invocation time, and a standard ReAct loop composes these primitives at test time. The central result is that induced libraries outperform the very agent that generated their traces: by +44pp on RuleArena NBA (30 -> 74), +30pp on MuSR team allocation (38 -> 68), and +22pp on NatPlan meeting planning (7 -> 29). Across five comparable subtasks spanning narrative deduction, rule application, and constraint-satisfaction planning, a single fixed configuration improves over zero-shot Chain-of-Thought on every subtask, matches or surpasses expert-authored decompositions, and outperforms AWM at lower average inference cost.

2606.02991 2026-06-03 cs.CL cs.AI 版本更新

Pretraining Language Models on Historical Text

在历史文本上预训练语言模型

Xiaoxi Luo, Zachary Shinnick, Niclas Griesshaber, Yixuan Wang, Junchi Yu, Freda Shi, Philip Torr, Yao Lu

发表机构 * University of Waterloo(多伦多大学) Vector Institute(向量研究所) AIML, Adelaide University(AIML,阿德莱德大学) Department of Engineering Science, University of Oxford(牛津大学工程科学系) Oxford Centre for Economic and Social History, University of Oxford(牛津大学经济与社会史中心) Department of Computer Science, University College London(伦敦大学学院计算机科学系)

AI总结 提出TypewriterLM,一个仅在1913年前英文文本上训练的7.24B历史语言模型,通过构建TypewriterCorpus语料库、引入词汇基础指令微调框架和History-Event基准套件,解决数据质量、时间泄漏、训练和评估等挑战。

详情
AI中文摘要

我们介绍了TypewriterLM,一个仅在1913年前英文文本上训练的7.24B历史语言模型。开发历史语言模型需要解决数据质量和可用性、防止时间泄漏、设计时间一致的后训练流程以及构建可靠评估等挑战。为了解决这些问题,我们构建了TypewriterCorpus,一个54B词元的历史语料库,收集自多样化的档案和语言标注来源,并进行了广泛的数据清洗和泄漏缓解措施。此外,我们引入了词汇基础指令微调,一种后训练框架,限制响应直接基于历史源文档。使用该框架,我们构建了两个历史指令微调数据集:History-LIMA和History-SelfInstruct。为了评估能力和时间一致性,我们引入了History-Event,一个用于评估能力、时间基础和泄漏的基准套件。我们发布了TypewriterLM及所有相关资源,以支持未来对历史语言模型的研究。

英文摘要

We introduce TypewriterLM, a 7.24B History language model (LM) trained exclusively on English text predating 1913. Developing History LMs requires addressing challenges in data quality and availability, preventing temporal leakage, designing temporally consistent post-training pipelines, and constructing reliable evaluations. To address these issues, we construct TypewriterCorpus, a 54B-token historical corpus collected from diverse archival and linguistically annotated sources with extensive data cleaning and leakage mitigation procedures. Furthermore, we introduce lexically grounded instructing tuning, a post-training framework that constraints responses to remain directly grounded in historical source documents. Using this framework we construct two historical instruction tuning datasets: History-LIMA and History-SelfInstruct. To evaluate capability and temporal consistency, we introduce History-Event, a benchmark suite for evaluating competence, temporal grounding and data leakage. We release TypewriterLM and all associated resources to support future research on historical language models.

2606.02979 2026-06-03 cs.CV cs.AI cs.RO 版本更新

Towards Compact Autonomous Driving Perception with Balanced Learning and Multi-sensor Fusion

面向紧凑型自动驾驶感知的平衡学习与多传感器融合

Oskar Natan, Jun Miura

发表机构 * Department of Computer Science and Engineering, Toyohashi University of Technology(计算机科学与工程系,丰田寺大学) Department of Computer Science and Electronics, Gadjah Mada University(计算机科学与电子系,加查马达大学)

AI总结 提出一种紧凑的深度多任务学习模型,通过自适应损失加权和中间传感器融合技术,在单次前向传播中同时处理语义分割、深度估计、激光雷达分割和鸟瞰投影,实现高效自动驾驶感知。

Comments This work has been accepted for publication in IEEE Transactions on Intelligent Transportation Systems. https://ieeexplore.ieee.org/document/9712213

详情
AI中文摘要

我们提出了一种新颖的紧凑型深度多任务学习模型,能够在一次前向传播中处理多种自动驾驶感知任务。该模型同时执行多视角语义分割、深度估计、激光雷达分割和鸟瞰投影,无需其他模型支持。我们还提供了一种自适应损失加权算法,以解决因任务众多而出现的学习不平衡问题。通过数据预处理和中间传感器融合技术,该模型可以处理并组合来自RGB摄像头、动态视觉传感器(DVS)和安装在自车多个位置的激光雷达的多种输入模态。因此,可以更好地理解动态变化的环境。基于消融研究,使用我们提出的方法训练的模型变体取得了更好的性能。此外,还进行了比较研究,以阐明其与一些近期模型组合相比的性能和有效性。结果表明,即使参数少得多,我们的模型仍能保持更好的性能。因此,该模型可以更快地推理,并减少GPU内存使用。此外,结果在3个不同的CARLA仿真数据集和1个真实世界的nuScenes-lidarseg数据集上保持一致。为了支持未来的研究,我们在以下网址公开共享代码和其他文件:https://this URL。

英文摘要

We present a novel compact deep multi-task learning model to handle various autonomous driving perception tasks in one forward pass. The model performs multiple views of semantic segmentation, depth estimation, light detection and ranging (LiDAR) segmentation, and bird's eye view projection simultaneously without being supported by other models. We also provide an adaptive loss weighting algorithm to tackle the imbalanced learning issue that occurred due to plenty of given tasks. Through data pre-processing and intermediate sensor fusion techniques, the model can process and combine multiple input modalities retrieved from RGB cameras, dynamic vision sensors (DVS), and LiDAR placed at several positions on the ego vehicle. Therefore, a better understanding of a dynamically changing environment can be achieved. Based on the ablation study, the model variant trained with our proposed method achieves a better performance. Furthermore, a comparative study is also conducted to clarify its performance and effectiveness against the combination of some recent models. As a result, our model maintains better performance even with much fewer parameters. Hence, the model can inference faster with less GPU memory utilization. Moreover, the result tends to be consistent in 3 different CARLA simulation datasets and 1 real-world nuScenes-lidarseg dataset. To support future research, we share codes and other files publicly at https://github.com/oskarnatan/compact-perception.

2606.02974 2026-06-03 cs.AI cs.HC cs.LG 版本更新

WISE-HAR: A Generalizable Ensemble Deep Learning Framework for WiFi-Based Human Activity Recognition

WISE-HAR:一种基于WiFi的人类活动识别的可泛化集成深度学习框架

Maheen Arshad, Qindeel E Zahra, Muhammad Khuram Shahzad

发表机构 * Department of Computing, School of Electrical Engineering and Computer Science(计算机系,电气工程与计算机科学学院) National University of Sciences and Technology (NUST)(国家科学与技术大学(NUST))

AI总结 本文提出WISE-HAR框架,通过集成五种CNN架构、数据增强和跨场景评估,在Wallhack1.8k数据集上实现94.87%的LOS测试准确率,并展现出强泛化能力。

Comments 8 pages, 5 figures

详情
AI中文摘要

利用WiFi信号进行人类活动识别(HAR)已成为智能家居、医疗监控、安全系统和环境辅助生活的一项变革性技术。与引发严重隐私问题且在弱光条件下失效的传统基于摄像头的系统,或需要用户配合的可穿戴传感器不同,基于WiFi的HAR是非侵入性的、保护隐私的、成本效益高的,并且能在任何光照条件下无缝工作。本文提出了一种综合方法,使用Wallhack1.8k WiFi频谱图数据集识别三种不同的人类活动:“无人”(空房间)、“行走”和“行走+挥手”。我们提出了三项关键改进以应对基于WiFi的HAR的主要挑战。首先,为了解决高性能方差问题,我们实现了集成学习,采用五种不同的CNN架构(Deep CNN、Wide CNN、MobileNetV2、ResNet50V2和EfficientNetB0)。其次,为了解决小数据集大小的限制,我们应用了激进的数据增强技术,包括时间扭曲、频率掩蔽和噪声添加。第三,为了评估真实世界的泛化能力,我们进行了跨场景评估(在视距上训练,在非视距上测试)和跨天线评估(在双锥天线上训练,在PIFA天线上测试)。我们的集成模型在使用双锥天线的LOS场景下达到了94.87%的测试准确率,比最佳单个模型高出0.66%。数据增强将随机森林的性能从60%提升到95%。跨场景评估显示准确率下降极小,仅为1.37%和2.07%,证明了强大的泛化能力。结果表明,所提出的方法鲁棒、可靠,适用于不同硬件配置的多样化环境中的实际部署。

英文摘要

Human Activity Recognition (HAR) using WiFi signals has emerged as a transformative technology for smart homes, healthcare monitoring, security systems, and ambient assisted living. Unlike traditional camera-based systems that raise significant privacy concerns and fail in low-light conditions, or wearable sensors that require user compliance, WiFi-based HAR is non-intrusive, privacy-preserving, cost-effective, and works seamlessly in any lighting condition. This paper presents a comprehensive approach to recognize three distinct human activities: "No Presence" (empty room), "Walking", and "Walking + Arm-waving" using the Wallhack1.8k WiFi spectrogram dataset. We propose three key improvements to address the main challenges in WiFi-based HAR. First, to address high performance variance, we implement ensemble learning with five different CNN architectures (Deep CNN, Wide CNN, MobileNetV2, ResNet50V2, and EfficientNetB0). Second, to address the small dataset size limitation, we apply aggressive data augmentation techniques including time-warping, frequency masking, and noise addition. Third, to evaluate real-world generalization capability, we perform cross-scenario evaluation (training on Line-of-Sight and testing on Non-Line-of-Sight) and cross-antenna evaluation (training on Biquad antenna and testing on PIFA antenna). Our ensemble model achieved a test accuracy of 94.87% on the LOS scenario with Biquad antenna, outperforming the best individual model by 0.66%. Data augmentation improved Random Forest performance from 60% to 95%. Cross-scenario evaluation showed minimal accuracy drops of only 1.37% and 2.07%, demonstrating strong generalization capabilities. The results indicate that the proposed approach is robust, reliable, and suitable for real-world deployment in diverse environments with different hardware configurations.

2606.02967 2026-06-03 cs.ET cs.AI cs.AR cs.SY eess.SY 版本更新

Glass Box at Orbit: A Constitutional AI Verification Framework for Trustworthy Autonomous CubeSat Intelligence

轨道上的玻璃盒:面向可信自主立方星智能的宪法AI验证框架

Karthik Barma, Anil Sanneboyina, V C Premchand Yadav

发表机构 * University of California, Berkeley(加州大学伯克利分校) Stanford University(斯坦福大学)

AI总结 提出玻璃盒框架,通过运行时宪法AI验证层拦截自主航天器决策,利用六项物理约束和七项线性时序逻辑安全不变式确保安全,并证明其验证开销与模型规模无关。

Comments 12 pages, 2 figures, 2 tables, 32 references. Paper 1 of the Project October series on autonomous orbital intelligence

详情
AI中文摘要

航天工业正在悄然构建一个尚未被充分认识的事物:在地球上空550公里处运行数千个自主AI工作负载的轨道数据中心,且无人类参与。微软、AWS以及越来越多的轨道计算企业正在将云规模处理从地面转移到轨道。然而,它们都尚未回答治理问题——当轨道数据中心规模的自主AI系统在太空中做出错误决策时,如何在决策变得不可逆转之前阻止它们?我们引入玻璃盒:一个运行时宪法AI验证层,在单个命令到达任何航天器子系统之前,拦截来自机载AI策略的每个候选动作,并根据六项基于物理的宪法约束和七项线性时序逻辑(LTL)安全不变式对其进行评估。每个批准的动作都附带一个加权可解释性分数E(a_t)(范围[0,1])和完整的宪法审计日志。我们在Project October中演示了玻璃盒:一个针对CubeSat级航天器的完全模拟的五层自主轨道智能架构。我们证明玻璃盒的验证开销为O(N_c),其中N_c是宪法规则的数量,与模型大小或航天器状态维度无关。我们提供了宪法约束语法的完整形式规范、通过Z3和NuSMV模型检查验证的七项LTL安全不变式,以及一个详细的工作示例,展示玻璃盒在电池状态退化的日食入口处拦截不安全推理请求。随着轨道计算向数据中心基础设施规模发展,运行时宪法验证不再是研究上的新奇事物——它是每个自主轨道平台最终将需要的任务关键型安全基础设施。

英文摘要

The space industry is quietly building toward something nobody has fully reckoned with: orbital data centers running thousands of autonomous AI workloads with no human in the loop, 550 km above the Earth. Microsoft, AWS, and a growing list of orbital computing ventures are moving cloud-scale processing off the ground and into orbit. What none of them have answered yet is the governance question -- when autonomous AI systems at orbital data center scale make wrong decisions in space, what stops those decisions before they become irreversible? We introduce Glass Box: a runtime constitutional AI verification layer that intercepts every candidate action from an onboard AI policy and evaluates it against six physics-grounded constitutional constraints and seven Linear Temporal Logic (LTL) safety invariants before a single command reaches any spacecraft subsystem. Every approved action carries a weighted explainability score E(a_t) in [0,1] and a complete constitutional audit log. We demonstrate Glass Box within Project October: a fully simulated five-layer autonomous orbital intelligence architecture for CubeSat-class spacecraft. We prove that Glass Box verification overhead is O(N_c) in the number of constitutional rules, independent of model size or spacecraft state dimension. We present a complete formal specification of the constitutional constraint grammar, seven LTL safety invariants verified by Z3 and NuSMV model checking, and a detailed worked example of Glass Box intercepting an unsafe inference request at eclipse-entry under degraded battery state. As orbital computing scales toward data center infrastructure, runtime constitutional verification is no longer a research novelty -- it is mission-critical safety infrastructure that every autonomous orbital platform will eventually require.

2606.02965 2026-06-03 cs.AI 版本更新

What Benchmarks Don't Measure: The Case for Evaluating Abstention Competence in Autonomous Agents

基准测试无法衡量的:论自主智能体弃权能力的评估

Victor Ojewale, Suresh Venkatasubramanian

发表机构 * Brown University(布朗大学)

AI总结 本文指出自主智能体基准测试忽视弃权能力,提出合规偏差概念,并引入弃权场景分类和评估协议,实验表明安全-可用性权衡是可调的。

Comments ACM CAIS 2026: RLEval Workshop Oral Presentation(Best Paper Award)

详情
AI中文摘要

自主智能体的基准测试衡量智能体是否完成任务,然而这种框架系统地忽略了智能体是否应该继续执行任务。在人类反馈目标下训练的智能体形成了一种结构性倾向,即使缺乏安全行动所需的输入、证据或授权也会继续执行,我们将这种倾向称为合规偏差,因为奖励信号和基准测试评分机制都将继续执行视为正确的默认行为,无论安全行动的前提条件是否满足。我们做出三项贡献。首先,我们表明合规偏差源于人类反馈流程中的奖励黑客行为,并因主流智能体基准测试而根深蒂固,这些基准测试要么惩罚智能体的暂停,要么在架构上无法区分有原则的暂停和静默失败。然后,我们引入弃权合理场景的三缺口分类法,涵盖所需信息缺失的规范缺口、无法确认世界状态的验证缺口以及未获得明确授权的权威缺口,这些共同为构建弃权感知的智能体基准测试提供了原则性基础。最后,我们提出弃权评估协议(安全率、可用率和知情拒绝率),并报告了144个企业智能体场景和五个模型系列的初步结果,其中运行时强制弃权机制在授权场景下实现了高达89.2%的危险动作阻断和87.5%的可用性,表明安全-可用性权衡是可调的而非固有的,并且其形状在不同模型系列间差异显著。我们将此视为初步工作,并提供分类法和复合指标作为进一步讨论的起点。

英文摘要

Benchmarks for autonomous agents measure whether agents complete tasks, yet this framing is systematically blind to whether an agent should have proceeded at all. Agents trained under human-feedback objectives develop a structural tendency to proceed even when they lack the inputs, evidence, or authorization to act safely, a disposition we term compliance bias, because both the reward signal and the benchmark scoring regime treat proceeding as the correct default regardless of whether the preconditions for safe action are present. We make three contributions. We first show that compliance bias originates in reward hacking within human-feedback pipelines and is entrenched by prominent agent benchmarks, which either penalize agents for pausing or are architecturally unable to distinguish a principled pause from a silent failure. We then introduce a three-gap taxonomy of abstention-warranted scenarios, covering specification gaps where required information is absent, verification gaps where world state cannot be confirmed, and authority gaps where explicit authorization has not been given, which together provide a principled basis for constructing abstention-aware agent benchmarks. Finally, we propose abstention evaluation protocols (Safety Rate, Usability Rate, and Informed Refusal Rate) and report preliminary results across 144 enterprise agent scenarios and five model families, in which a runtime-enforced abstention mechanism achieves up to 89.2% hazardous-action blocking and 87.5% usability on authorized scenarios, demonstrating that the safety--usability tradeoff is tunable rather than inherent and that its shape varies substantially across model families. We treat this as preliminary work and offer the taxonomy and composite metrics as a starting point for further conversations.

2606.02962 2026-06-03 cs.CV cs.AI cs.HC eess.IV 版本更新

Hand Trajectory Fusion for Egocentric Natural Language Query Grounding

面向自我中心自然语言查询定位的手部轨迹融合

Enmin Zhong, Carlos R. del-Blanco, Fernando Jaureguizar, Narciso García

发表机构 * Grupo de Tratamiento de Imágenes (GTI), Information Processing and Telecommunications Center , ETSI Telecomunicación, Universidad Politécnica de Madrid, Spain(图像处理小组(GTI)、信息处理与电信中心、电信工程学院、马德里理工大学、西班牙)

AI总结 针对自我中心视频中的自然语言查询定位任务,提出手部轨迹编码器与自适应门控交叉注意力融合方法,利用手部运动信息提升查询定位性能。

Comments Accepted for the poster session at the Egocentric Vision (EgoVis) Workshop in Conjunction with CVPR 2026

详情
AI中文摘要

自我中心自然语言查询(NLQ)定位要求模型在长第一人称视频中定位回答自由形式文本查询的时间区间。现有方法融合视频外观与查询,但忽略了手部运动,尽管大约41%的Ego4D NLQ查询是在手-物交互或其后立即发生的时刻回答的。我们提出了一种手部轨迹编码器,用于将手部骨骼序列转换为高语义的手部运动学特征,然后通过具有自适应门控的交叉注意力融合策略,将这些特征与预训练的视频-文本特征对齐并组合。在Ego4D NLQ v2验证集上,手-物交互查询(R1@IoU=0.3提升2.54)和数量/状态查询(R1@IoU=0.3提升4.32)的增益最为明显,表明手部轨迹提供了超越外观的定位线索。

英文摘要

Egocentric Natural Language Query (NLQ) grounding asks a model to localize, in a long first-person video, the temporal interval that answers a free-form text query. Existing methods fuse video appearance with the query but ignore hand motion, despite the fact that roughly 41% of Ego4D NLQ queries are answered at a moment of hand--object manipulation or their immediate outcomes.We propose a hand-trajectory encoder for converting a sequence of hand skeletons into highly-semantic hand kinematic features, which are then aligned and combined with pretrained video--text features through a cross-attention fusion strategy with adaptive gating. On the Ego4D NLQ v2 validation split, the clearest gains appear for Hand-Object Interaction queries (+2.54 R1@IoU=0.3) and Quantity/State queries (+4.32 R1@IoU=0.3), indicating that hand trajectory provides grounding cues beyond appearance alone.

2606.02958 2026-06-03 cs.CR cs.AI 版本更新

Echelon: Auditable Aggregate-Only Language-Model Adaptation Across Privacy Boundaries

Echelon: 跨隐私边界的可审计聚合专用语言模型适配

Hina Dixit, Punit Kumar, Irene Tenison, Nevasini Sasikumar

发表机构 * University of California, Berkeley(加州大学伯克利分校) Stanford University(斯坦福大学)

AI总结 提出Echelon架构,通过强制设备级模型状态不可导出为系统不变量,仅允许聚合后的跨边界数据传输,并结合缓冲半异步安全聚合、陈旧感知加权等机制,在1B参数LoRA适配中实现低通信开销下的稳定训练。

详情
AI中文摘要

跨组织语言模型适配日益面临严格的治理约束:在许多部署中,设备级模型状态(参数、激活值、优化器状态及每设备更新)无法导出到管理边界之外。现有的分布式和联邦学习栈通常假设跨站点模型交换,然后改造隐私机制,这使合规性复杂化并导致审计脆弱。我们提出Echelon,一种边界优先的训练架构,将设备级模型状态不可导出作为系统不变量强制执行。设备在每个边界内本地训练;唯一的跨边界负载是安全聚合的边界级增量加上O(1)协调元数据,并通过具体的审计接口暴露。将交换限制为聚合改变了优化问题:系统必须在广域网延迟、异构参与、节点波动和非独立同分布数据下保持稳定,尽管全局层面从未看到每设备更新。Echelon结合了缓冲半异步安全聚合、陈旧感知加权、参与窗口、近端局部目标以及漂移感知外同步控制器。在M=2个边界上的1B参数LoRA适配中,预算匹配的竞赛(三个种子,24.88M tokens)达到验证损失3.887 +/-0.010,并在固定token、固定字节、固定挂钟时间和固定同步次数预算下,在调优的低通信基线中表现最佳或并列最佳。在OpenWebText压力测试中,Echelon在评估的广域网和非独立同分布处理下维持2,139-2,176 tokens/s的吞吐量;Echelon-DA在广域网延迟下相对于隐私对等的DiLoCo+SA基线改善了达到目标的时间,并且在200ms模拟延迟或严重非独立同分布分区下质量最多下降2.2%。

英文摘要

Cross-organization language-model adaptation increasingly faces hard governance constraints: in many deployments, device-level model state-parameters, activations, optimizer state, and per-device updates-cannot be exported outside an administrative boundary. Existing distributed and federated stacks typically assume cross-site model exchange and then retrofit privacy mechanisms, which complicates compliance and makes auditing brittle. We present Echelon, a boundary-first training architecture that enforces device-level model-state non-export as a systems invariant. Devices train locally inside each boundary; the only cross-boundary payloads are securely aggregated boundary-level deltas plus O(1) coordination metadata, exposed through a concrete audit surface. Restricting exchange to aggregates changes the optimization problem: the system must remain stable under WAN delay, heterogeneous participation, churn, and non-IID data even though the global plane never sees per-device updates. Echelon combines buffered semi-asynchronous secure aggregation, staleness-aware weighting, participation windows, proximal local objectives, and a drift-aware outer synchronization controller. In 1B-parameter LoRA adaptation across M= 2 boundaries, a budget-matched contest over three seeds (24.88M tokens) reaches validation loss 3.887 +/-0.010 and is best or tied-best among tuned low-communication baselines under fixed-token, fixed-bytes, fixed-wall-clock, and fixed-sync-count budgets. In OpenWebText stress tests, Echelon sustains 2,139-2,176 tokens/s across evaluated WAN and non-IID treatments, Echelon-DA improves time-to-target under WAN latency relative to a privacy-parityDiLoCo+SA baseline, and quality degrades by at most 2.2% under 200ms emulated latency or severe non-IID partitioning.

2606.02951 2026-06-03 cs.RO cs.AI cs.CL cs.CV cs.HC 版本更新

SCOPE: Real-Time Natural Language Camera Agent at the Edge

SCOPE:边缘实时自然语言相机代理

Nikolaj Hindsbo, Sina Ehsani, Pragyana Mishra

发表机构 * Armada AI

AI总结 提出SCOPE模块化代理,用于自然语言控制的PTZ相机,在边缘部署实现实时感知、规划与控制,并通过仿真和物理实验评估延迟、准确性和错误模式。

Comments 9 pages, 4 figures, 6 tables. Accepted at HRI '26 (21st ACM/IEEE International Conference on Human-Robot Interaction), Edinburgh, Scotland, March 16--19, 2026. Code: https://github.com/HindsboNikolaj/SCOPE

详情
Journal ref
Proceedings of the 21st ACM/IEEE International Conference on Human-Robot Interaction (HRI '26), ACM, 2026
AI中文摘要

在机器人领域部署语言驱动的代理需要能够反映现实任务需求的评估:自然语言指令与可重复的结果。此类代理必须将语言模型连接到可调用的感知和控制工具,并使用部署关键指标(包括延迟、准确性和错误模式)进行评估。我们提出了SCOPE(用于感知和评估的仿真与相机操作),这是一个模块化代理,用于自然语言、开放词汇的云台变焦(PTZ)相机控制和视觉场景理解,专门为边缘部署设计。SCOPE既可在基于Blender的仿真环境中运行,也可在物理PTZ相机上运行,所有感知、规划和控制均在部署现场使用边缘可访问的计算资源本地执行。我们发布了一个包含536个任务的基准测试,涵盖问答、单步和多步命令、计数、空间推理、描述以及光学字符识别,在基于Blender的仿真环境中提供逼真的PTZ控制功能。执行轨迹与LM作为评判器结合,以评估延迟、准确性和错误模式。我们评估了19种规划器-感知模型组合,将Qwen3小语言模型(SLM)与Moondream和Qwen视觉语言模型(VLM)配对。更强的SLM显著减少了幻觉并改善了工具路由,从而实现了更可靠的闭环行为。一旦使用了足够强大的SLM,感知就成为主要的性能瓶颈。在规划和感知方面,混合专家模型在延迟和内存占用与更小网络相当的情况下,始终匹配或超过密集替代方案。量化在精度损失最小的情况下提供了额外的效率提升,为实时、边缘可行的语言驱动PTZ控制确定了一个实用的、从仿真到现实验证的设计点。

英文摘要

Deploying language-driven agents in robotics requires evaluations that reflect real-world task demands: natural-language instructions with reproducible outcomes. Such agents must connect language models to callable perception and control tools, and be assessed using deployment-critical metrics including latency, accuracy, and error modes. We present SCOPE (Simulation and Camera Operations for Perception and Evaluation), a modular agent for natural-language, open-vocabulary pan-tilt-zoom (PTZ) camera control and visual scene understanding, designed explicitly for edge deployment. SCOPE operates both in a Blender-based simulation environment and on a physical PTZ camera, executing all perception, planning, and control locally at the deployment site using edge-accessible compute. We release a 536-task benchmark spanning QA, single- and multi-step commands, counting, spatial reasoning, descriptions, and optical character recognition in a Blender-based simulation environment that exposes realistic PTZ control affordances. Execution traces are combined with an LM-as-Judge to evaluate latency, accuracy, and error modes. We evaluate 19 planner-perception model combinations pairing Qwen3 small language models (SLMs) with Moondream and Qwen vision-language models (VLMs). Stronger SLMs substantially reduce hallucinations and improve tool routing, leading to more reliable closed-loop behavior. Once a sufficiently capable SLM is used, perception becomes the dominant performance bottleneck. Mixture-of-Experts models on both the planning and perception side consistently match or exceed dense alternatives at latencies and memory footprints comparable to much smaller networks. Quantization provides additional efficiency gains with minimal accuracy degradation, identifying a practical, sim-to-real validated design point for real-time, edge-feasible language-driven PTZ control.

2606.02908 2026-06-03 cs.CL cs.AI 版本更新

WRIT: Write-Read Intensive Trajectory Synthesis for Multi-Turn User-Facing Agents

WRIT: 面向多轮用户代理的写密集型轨迹合成

Hengrui Gu, Xiaotian Han, Kaixiong Zhou

发表机构 * North Carolina State University(北卡罗来纳州立大学) Case Western Reserve University(凯斯西储大学)

AI总结 针对多轮用户代理在信息收集和决策中面临的证据负担挑战,提出WRIT方法,通过合成写密集型和读密集型轨迹,训练代理在信息负载下做出基于证据的决策,仅用2K轨迹即可提升性能并减少推理时token使用。

详情
AI中文摘要

多轮用户代理必须从不完整的请求中推断用户意图,通过对话和工具收集缺失信息,并执行有效操作。训练轨迹将此过程记录为用户消息、代理响应、工具调用等的交错序列。合成足够复杂的轨迹已成为训练代理的核心途径:现有流程通常通过将多个用户请求组合成更长的任务来增加难度,产生训练顺序执行的写密集型轨迹。我们认为,当代理必须在收集和比较大量读工具证据后才能确定其参数时,单个写决策本身可能很困难,这是仅靠写密集型数据无法解决的挑战。基于这一见解,我们提出WRIT(写-读密集型轨迹合成),这是一个沿两个复杂度轴合成多轮代理训练轨迹的流程:任务中写决策的数量和每个决策的证据负担。WRIT首先生成写密集型和读密集型任务。然后,它多样化用户行为指令以反映真实的对话变化,最后在可执行环境中模拟代理-用户交互以生成完整的训练轨迹。由此产生的数据不仅训练代理执行更长的任务,而且在高信息负载下做出稳健的、基于证据的决策。仅用2K合成轨迹,在WRIT上训练的4B模型在$\tau^2$-bench上优于GPT-5.1 no-think,并大幅减少推理时token使用,表明紧凑的SFT数据可以将部分昂贵的测试时推理转化为高效的代理行为。

英文摘要

Multi-turn user-facing agents must infer user intent from incomplete requests, collect missing information through dialogue and tools, and execute valid actions. A training trajectory records this process as an interleaved sequence of user messages, agent responses, tool calls, etc. Synthesizing sufficiently complex trajectory has become a central route to train agents: existing pipelines often increase difficulty by composing multiple user requests into longer tasks, producing write-intensive trajectories that train sequential execution. We argue that a single write decision can itself be difficult when the agent must gather and compare substantial read-tool evidence before its arguments become identifiable, a challenge that write-intensive data alone cannot address. Guided by this insight, we propose WRIT (\uline{W}rite-\uline{R}ead \uline{I}ntensive \uline{T}rajectory Synthesis), a pipeline for synthesizing multi-turn agent training trajectories along two complexity axes: the number of write decisions in a task and the evidence burden of each individual decision. WRIT first generates write-intensive and read-heavy tasks. It then diversifies user behavior instructions to reflect realistic conversational variation, and finally simulates agent-user interactions in an executable environment to produce complete training trajectories. The resulting data trains agents not only for longer task execution, but also for robust, evidence-grounded decision making under high information load. With only 2K synthesized trajectories, a 4B model trained on WRIT outperforms GPT-5.1 no-think on $τ^2$-bench and substantially reduces inference-time token usage, showing that compact SFT data can convert part of expensive test-time reasoning into efficient agent behavior.

2606.02884 2026-06-03 cs.LG cs.AI 版本更新

Are we really tilting? The mechanics of reward guidance in flow and diffusion models

我们真的在倾斜吗?流模型和扩散模型中奖励引导的机制

Sanjit Dandapanthula, Nicholas M. Boffi

发表机构 * University of California, Berkeley(加州大学伯克利分校)

AI总结 本文通过高斯混合模型和二次奖励的闭式分析,揭示了奖励引导扩散中奖励黑客现象源于Doob h函数的有限粒子插件估计,并提出了无额外计算的闭式奖励阻尼调度来纠正模式内偏差。

详情
AI中文摘要

奖励引导算法在推理时将学习到的生成过程导向奖励倾斜的测度。虽然经验上强大,但这些方法容易产生奖励黑客行为:引导模型以牺牲对学习分布的保真度为代价过度优化奖励。先前的工作将其归因于神经奖励函数的复杂性或扩散训练中的隐式偏差,但其根本起源仍知之甚少。我们表明,奖励黑客行为源于大多数实际奖励引导扩散实现中的一个近似——Doob h函数的有限粒子插件估计——即使在最简单的高斯和高斯混合目标以及二次奖励的非平凡设置中也是如此。在闭式中,我们分离了插件估计器的两种不同失效模式:它导致每个模式内的奖励黑客行为,并且无法选择高奖励模式。我们提出了一种闭式奖励阻尼调度,无需额外计算即可纠正模式内偏差,并阐明了最佳-n采样在补偿模式选择失败中的作用。在高斯混合目标、2D棋盘和FLUX.1文本到图像生成上的实验证实了我们的理论见解适用于实际设置。

英文摘要

Reward guidance algorithms steer a learned generative process toward the reward-tilted measure at inference time. While empirically powerful, these methods are prone to reward hacking: the guided model over-optimizes the reward at the cost of fidelity to the learned distribution. Prior work has attributed this to the complexity of neural reward functions or implicit biases in diffusion training, but its fundamental origins remain poorly understood. We show that reward hacking arises from an approximation made in most practical implementations of reward-guided diffusion -- finite-particle plug-in estimation of the Doob h-function -- even in the simplest non-trivial settings of Gaussian and Gaussian mixture targets with quadratic rewards. In closed form, we isolate two distinct failure modes of the plug-in estimator: it leads to reward hacking within each mode and it cannot select high-reward modes. We propose a closed-form reward damping schedule that corrects the within-mode bias with no additional compute, and clarify the role of best-of-n sampling in compensating for the mode selection failure. Experiments on Gaussian mixture targets, a 2D checkerboard, and FLUX.1 text-to-image generation confirm that our theoretical insights carry over to practical settings.

2606.02883 2026-06-03 cs.HC cs.AI cs.CY cs.IR 版本更新

LLM-Assisted Reranking to Operationalize Nuanced Objectives in Recommender Systems

LLM辅助重排序以在推荐系统中实现细微目标

Amir Ghasemian, Homa Hosseinmardi, Upasana Dutta, Duncan J. Watts

发表机构 * Department of Communication, University of California, Los Angeles, CA 90095(通信系,加州大学洛杉矶分校,CA 90095) Department of Computer and Information Science, University of Pennsylvania, Philadelphia, PA 19104(计算机与信息科学系,宾夕法尼亚大学,Philadelphia, PA 19104) Amenenberg School of Communication, University of Pennsylvania, Philadelphia, PA 19104(安纳伯格通信学院,宾夕法尼亚大学,Philadelphia, PA 19104) Operations, Information, and Decisions Department, University of Pennsylvania, Philadelphia, PA 19104(运营、信息与决策系,宾夕法尼亚大学,Philadelphia, PA 19104)

AI总结 本研究通过零样本指令提示对YouTube侧边栏候选进行重排序,发现无约束的LLM辅助重排序会放大极端和阴谋论内容,而轻量级提示正则化可在轻微损失相关性的情况下减少极端内容并增加意识形态多样性。

Comments 30 pages total; 11 pages, 5 figures, 2 tables (main text); 19 pages, 11 figures, 9 tables (appendix)

详情
AI中文摘要

推荐系统已从内容组织工具发展为塑造日常行为的复杂系统。通过控制我们所看到的内容,它们塑造了我们的感知,引发了对过滤气泡、激进化、两极分化和社会不平等的担忧。大型语言模型(LLM)实现了更强大的个性化,加剧了这些动态。然而,大多数推荐系统针对参与度或有限的准确性指标进行调优,很少关注更广泛的社会影响,例如个性化如何重塑社会重要领域中的曝光度。我们研究了LLM辅助重排序在提高个性化的同时,是否无意中放大了对意识形态极端或阴谋论政治内容的曝光,这是一种在新闻推荐中理论上存在但尚未得到实证表征的风险。使用真实的新闻消费历史,我们通过零样本、基于指令的提示对YouTube侧边栏候选进行重排序。我们比较了基线提示与一个约束变体,该变体保持主题相关性并扩大意识形态曝光,同时减少阴谋论或极端内容。在没有约束的情况下,重排序加强了个性化,但增加了对历史中包含此类内容的用户的阴谋论和极端主义材料的曝光。轻量级提示级正则化减少了对极端内容的推广并增加了意识形态多样性,同时相关性损失较小。合成实验表明,LLM通过语言中的统计规律而非对意识形态的语义理解进行重排序,这解释了为什么朴素提示会放大这些模式,而正则化可以重塑它们。总之,我们的结果突显了LLM在高风险推荐中实现上下文细微差别的能力,以及评估LLM辅助个性化超越准确性并将提示设计视为有价值负载而非中性默认的必要性。

英文摘要

Recommender systems have grown from content-organization tools into sophisticated systems that shape daily behavior. By controlling what we see, they shape what we perceive, raising concerns about filter bubbles, radicalization, polarization, and social inequality. Large language models (LLMs) enable more powerful personalization, intensifying these dynamics. Yet most recommenders are tuned for engagement or limited accuracy metrics, with little attention to broader social implications, e.g. how personalization reshapes exposure in socially consequential domains. We investigate whether LLM-assisted reranking, while improving personalization, inadvertently amplifies exposure to ideologically extreme or conspiratorial political content, a risk theorized but not empirically characterized in news recommendation. Using real news-consumption histories, we rerank YouTube's sidebar candidates through zero-shot, instruction-based prompting. We compare a baseline prompt with a constrained variant that preserves topical relevance and broadens ideological exposure while reducing conspiratorial or extreme content. Without constraints, reranking strengthened personalization but increased exposure to conspiratorial and extremist material for users whose histories contained such content. Lightweight prompt-level regularization reduced promotion of extreme content and increased ideological diversity, with modest relevance loss. Synthetic experiments suggest that LLMs rerank via statistical regularities in language rather than semantic understanding of ideology, clarifying why naive prompts amplify these patterns and why regularization can reshape them. Together, our results highlight the power of LLMs to operationalize contextual nuance in high-stakes recommendation, and the need to evaluate LLM-assisted personalization beyond accuracy and treat prompt design as a value-laden rather than neutral default.

2606.02875 2026-06-03 cs.AI 版本更新

Handoff Debt: The Rediscovery Cost When Coding Agents Take Over Interrupted Tasks

交接债务:当编码代理接管被中断任务时的重新发现成本

Dipesh KC, Anjila Budathoki

发表机构 * Independent Researcher(独立研究者) Georgia State University(佐治亚州立大学)

AI总结 本文通过引入“交接债务”概念,研究编码代理在任务中断后从部分状态恢复时的重新发现成本,并提出一种接管协议来量化不同交接视图对后继代理效率的影响。

详情
AI中文摘要

编码代理基准测试评估单个不间断代理能否解决仓库问题。实际软件工作更为复杂:任务会被中断、重新分配、审查,并从另一个代理或工程师留下的部分状态恢复。我们通过“交接债务”研究这一缺失维度:即前任工作不透明或不完整时施加的重新发现成本。我们的接管协议在确定性交接点中断编码代理,冻结仓库,并在四种交接视图下评估后继代理:仅仓库状态、原始轨迹、摘要笔记和结构化笔记。在75个源任务中,该协议为每个后继模型生成181个交接点任务和724次接管运行。在三个后继模型中,相对于仅仓库接管,带有上下文的交接将中位代理事件减少20-59%,累积提示令牌减少42-63%。解决率的影响较小且依赖于模型,但效率提升是一致的。这些发现表明,编码代理评估不仅应报告任务是否解决,还应报告另一个代理恢复该工作的成本。

英文摘要

Coding-agent benchmarks evaluate whether a single uninterrupted agent can resolve a repository issue. Real software work is messier: tasks are interrupted, reassigned, reviewed, and resumed from partial states left by another agent or engineer. We study this missing dimension through \emph{handoff debt}: the rediscovery cost imposed when a predecessor's work is opaque or incomplete. Our takeover protocol interrupts a coding agent at deterministic handoff points, freezes the repository, and evaluates successor agents under four handoff views: repository state only, raw trace, summary notes, and structured notes. Across 75 source tasks, the protocol generates 181 handoff-point tasks and 724 takeover runs per successor model. Across three successor models, context-bearing handoffs reduce median agent events by 20--59\% and cumulative prompt tokens by 42--63\% relative to repository-only takeover. Solved-rate effects are smaller and model-dependent, but efficiency gains are consistent. These findings suggest that coding-agent evaluation should report not only whether a task is solved, but also how costly that work is for another agent to resume.

2606.02871 2026-06-03 cs.CL cs.AI 版本更新

Adaptive Latent Agentic Reasoning

自适应潜在智能推理

Dongwon Jung, Peng Shi, Yi Zhang, Junshan Zhang, Muhao Chen

发表机构 * University of California, Davis(加州大学戴维斯分校) University of Waterloo(滑铁卢大学) Greenshoe, Inc.(Greenshoe公司)

AI总结 提出双模式框架ALAR,在常规决策步骤使用紧凑潜在推理,仅在需要深入思考时切换至显式思维链,在保持或提升任务准确率的同时显著减少生成令牌数。

详情
AI中文摘要

大型推理模型通过生成扩展的思维链推理来提升性能,但当应用于LLM智能体时,这种行为变得低效。当前的LLM智能体通常在每一步决策中生成冗长的文本推理,并在各轮次中几乎均匀地分配推理努力,导致多轮智能体轨迹中的严重低效。我们提出自适应潜在智能推理(ALAR),一种双模式框架,在常规轮次中使用紧凑的潜在推理,并在需要更深思熟虑时选择性地升级为显式思维链。ALAR通过使用智能体的动作作为监督锚点来学习潜在推理,并进一步优化以在潜在推理足以完成任务时使用它,保留显式CoT用于更困难的决策。在智能体搜索和工具使用基准上的实验表明,ALAR在保持相当或更好任务准确率的同时,显著减少了生成的令牌数,在搜索中最多减少43.6%,在工具使用中最多减少84.6%。这些结果表明,ALAR通过减少不必要的文本推理,同时保留显式思考用于更困难的决策步骤,改善了LLM智能体的准确率-效率权衡。

英文摘要

Large reasoning models improve performance by generating extended chain-of-thought (CoT) reasoning, but this behavior becomes inefficient when applied to LLM agents. Current LLM agents often generate verbose textual reasoning at every decision step and allocate reasoning effort nearly uniformly across turns, leading to substantial inefficiency in multi-turn agentic trajectories. We propose Adaptive Latent Agentic Reasoning (ALAR), a dual-mode framework that uses compact latent reasoning for routine turns and selectively escalates to explicit chain-of-thought when deeper deliberation is needed. ALAR learns latent reasoning by using the agent's actions as supervision anchors and is further optimized to use latent reasoning when it is sufficient for task success and reserve explicit CoT for harder decisions. Experiments on agentic search and tool-use benchmarks show that ALAR maintains comparable or better task accuracy while substantially reducing generated tokens by up to 43.6% in search and 84.6% in tool use. These results demonstrate that ALAR improves the accuracy-efficiency trade-off of LLM agents by reducing unnecessary textual reasoning while preserving explicit deliberation for harder decision steps.

2606.02867 2026-06-03 cs.MA cs.AI q-bio.PE 版本更新

The Epi-LLM Framework: probing LLM behavioral priors through epidemiological agent-based models

Epi-LLM框架:通过流行病学基于智能体的模型探究LLM行为先验

Petra Ferenz, Ava Keeling, Tobias O'Keefe, Lorenzo Stigliano, Francesco Di Lauro, Andres Colubri, Jasmina Panovska-Griffiths

发表机构 * Big Data Institute, Li Ka Shing Center for Health Information and Discovery, University of Oxford, Oxford, United Kingdom(大数据研究所、李嘉诚健康信息与发现中心、牛津大学、牛津、英国) Leverhulme Centre for Demographic Science, Nuffield Department of Population Health, University of Oxford, Oxford, United Kingdom(勒弗赫姆人口科学中心、努尔菲尔德人口健康系、牛津大学、牛津、英国) Pandemic Sciences Institute, Nuffield Department of Medicine, University of Oxford, Oxford, United Kingdom(流行病学科学研究所、努尔菲尔德医学系、牛津大学、牛津、英国) Department of Genomics and Computational Biology, UMass Chan Medical School, United States(基因组与计算生物学系、UMass Chan医学学校、美国) Broad Institute of Harvard and MIT, United States(哈佛大学和麻省理工学院Broad研究所、美国) The Queen’s College, University of Oxford, Oxford, United Kingdom(女王学院、牛津大学、牛津、英国)

AI总结 提出Epi-LLM框架,整合基于智能体的建模、真实流行病游戏和大语言模型,模拟疫情中智能体行为,发现LLM智能体减少峰值感染,感知健康严重性是隔离行为最强预测因子,且LLM架构影响疫情动态。

Comments Submitted to American Journal of Epidemiology

详情
AI中文摘要

流行病期间的人类行为会影响传染病动态,但量化这一点仍然极具挑战性。本文介绍了Epi-LLM框架:一种新颖的集成方法,结合了基于智能体的建模、真实流行病游戏和大语言模型(LLM),其中合成智能体社会在疫情接触网络上进行推理并动态适应。将合成智能体行为与无干预的SEIR基线和来自AUIB流行病游戏研究的人类参与者数据进行比较,我们发现四种不同架构的LLM智能体减少了峰值活跃感染,在15天模拟的第6天,隔离合规率达到58-65%。二项广义线性模型显示,感知健康严重性是隔离行为的最强预测因子(β = 0.33, p = 0.002),伪R²为0.055,与人类试验中观察到的0.072相当。LLM架构是疫情动态的关键决定因素:低方差架构为测试行为规则提供了更高的内部效度,而高方差模型可能更好地代表现实世界中的决策。仅凭地理标签无法诱导文化差异化的行为;需要明确的态参数化。这项原理验证工作为将Epi-LLM框架部署为可扩展、无风险的模拟环境用于大流行准备研究奠定了基础。

英文摘要

Human behaviour during epidemics affects infectious disease dynamics, but quantifying this remains deeply challenging. Here we introduce the Epi-LLM framework: a novel integration of agent-based modelling, real-life epigames, and large language models (LLMs) in which a synthetic society of agents reasons and adapts dynamically over an outbreak contact network. Comparing synthetic agent behaviour against a no-intervention SEIR baseline and human participant data from the AUIB epigame study, we find that LLM agents across four different architectures reduced peak active infections, with quarantine compliance peaking at 58-65% on day six of the 15-day simulation. A binomial generalised linear model showed that perceived health severity was the strongest predictor of quarantine behaviour ($β= 0.33, p = 0.002$), yielding a pseudo-$R^2$ of 0.055, comparable to the 0.072 observed in the human trial. LLM architecture is a key determinant of epidemic dynamics: low-variance architectures offer greater internal validity for testing behavioural rules, while high-variance models may better represent real-world decision-making. Geographic labels alone do not induce culturally differentiated behaviour; explicit attitudinal parameterisation is required. This proof-of-principle work lays the groundwork for deploying the Epi-LLM framework as a scalable, risk-free simulation environment for pandemic preparedness research.

2606.02866 2026-06-03 cs.AI cs.CL cs.MA 版本更新

When Helping Hurts and How to Fix It: Multi-Agent Debate for Data Cleaning

当帮助有害时以及如何修复:多智能体辩论用于数据清洗

Chirag Parmar, Akshat Mehta, Henglin Wu, Jagadish Ramamurthy, Shweta Medhekar

发表机构 * Meta Platforms, Inc.(Meta公司)

AI总结 研究多智能体辩论在数据清洗中的效果,发现其会降低生成性能但提升错误检测,通过推导辩论收益条件并采用对抗性分离的辩论配置,首次在生成任务上显著超越单智能体。

Comments 27 pages, 4 figures, 12 tables. Includes appendix with full experimental results, prompt templates, and dataset statistics

详情
AI中文摘要

多智能体辩论何时有助于数据清洗,何时有害?在三个基准、四个模型家族和超过6000个任务-条件对中,我们发现辩论的效果会反转:通过批评引发的混淆(CIC),即生成器不加批判地接受幻觉性的批评反馈,辩论在所有四个模型上降低了生成性能(-1.6至-15.5个百分点),但提升了错误检测(F1提高27.4个百分点,d=1.0)。我们推导出一个辩论收益条件:当挽救错误输出的概率(由可修复性加权的批评者验证几率)超过破坏正确输出的概率时,辩论有帮助。一个析因实验证明对抗性分离至关重要:使用相同工具的自我验证失败,而一个独立的批评者,结合代码执行基础和证据门控生成,产生了第一个在生成任务上显著超过单智能体的辩论配置(+5.3个百分点,p<0.05)。该条件正确预测了所有九种任务类型,并在七个领域的19个已发表比较中实现了零假阳性泛化。

英文摘要

When does multi-agent debate help data cleaning, and when does it hurt? Across three benchmarks, four model families, and over 6,000 task-condition pairs, we find debate's effect reverses sign: it degrades generation across all four models (-1.6 to -15.5pp) through critique-induced confusion (CIC), hallucinated Critic feedback that the Generator accepts uncritically, yet improves error detection (+27.4pp F1, d=1.0). We derive a debate benefit condition: debate helps when the probability of rescuing a wrong output (Critic verification odds weighted by fixability) exceeds the probability of destroying a correct one. A factorial experiment proves adversarial separation is essential: self-verification with identical tools fails, while a separate Critic with code-execution grounding and evidence-gated generation produces the first debate configuration to significantly exceed single-agent on a generative task (+5.3pp, p<0.05). The condition correctly predicts all nine task types and generalizes with zero false positives across 19 published comparisons in seven domains.

2606.02863 2026-06-03 cs.AI 版本更新

Don't Gamble, GAMBLe: An Analytical Framework for AI-Driven Research Systems

不要赌博,GAMBLe:AI驱动研究系统的分析框架

Marquita Ellis, Paul Castro

发表机构 * IBM Research(IBM研究院)

AI总结 提出GAMBLe框架,通过四个参数(生成器G、评估器A、发现机制M、预算B)和一个有效景观L_eff = A ∘ G分解AI驱动研究系统行为,实验表明组件选择可显著提升性能和搜索效率。

Comments Preprint. 21 pages (10 main, 11 appendix). 6 figures (2 in main, 4 in appendix)

详情
AI中文摘要

AI驱动研究系统(ADRS)——将LLM与自动评估相结合以发现算法、证明和设计的系统——正在跨领域优化和采用,但分析它们的工具尚未跟上。ADRS性能取决于组件交互,这些交互难以理解、探索成本高,并且(如我们所示)标准收敛保证无法很好地捕捉。这些保证依赖于结构假设,而这些假设在我们形式化的ADRS过程中不成立。我们引入GAMBLe,一个将ADRS行为分解为四个参数(生成器$G$、评估器$\mathcal{A}$、发现机制$\mathcal{M}$、预算$B$)和一个组合对象——有效景观$L_{ ext{eff}} = \mathcal{A} \circ G$的框架,该框架揭示了不同的生成器-评估器对在每个问题上诱导出结构不同的优化景观。我们在760多次重复运行(>46,000次迭代)上应用该框架,涵盖从单个LLM到动态自适应集成等生成器、从贪婪选择到协同进化元搜索等机制,以及三个NP难问题(其评估器范围从连续评分到悬崖函数)。实验表明,生成器或机制没有完全排序:前沿模型可能不如开源替代品,最简单的机制有时优于最先进的元搜索。结果显示,即使在有限预算下(每次运行60次迭代),正确的组件选择可以将性能提高13-67%,搜索效率提高6-39倍。

英文摘要

AI-Driven Research Systems (ADRS) -- systems coupling LLMs with automated evaluation to discover algorithms, proofs, and designs -- are being optimized and adopted across domains, but the tools to analyze them have not kept pace. ADRS performance depends on component interactions that are poorly understood, expensive to explore, and (as we show) not well captured by standard convergence guarantees. These guarantees rely on structural assumptions that do not hold under the ADRS process we formalize. We introduce GAMBLe, a framework that decomposes ADRS behavior into four parameters (generator $G$, assessor $\mathcal{A}$, discovery mechanism $\mathcal{M}$, budget $B$) and one compositional object, the effective landscape $L_{\text{eff}} = \mathcal{A} \circ G$, which reveals that distinct generator-assessor pairs induce structurally different per-problem optimization landscapes. We exercise the framework on 760+ replicated runs (>46,000 iterations) spanning generators from single LLMs to dynamically-adaptive ensembles, mechanisms from greedy selection to co-evolutionary meta-search, and three NP-hard problems whose assessors range from continuous scoring to cliff functions. The experiments reveal no total ordering of generators or mechanisms: frontier models can underperform open-source alternatives and the simplest mechanism sometimes outperforms state-of-the-art meta-search. Results show that even under limited budgets (60 iterations per run), the right component choices can improve performance by 13-67% and search efficiency by 6-39x.

2606.02862 2026-06-03 cs.AI cs.MA 版本更新

Toward a Modular Architecture for Embedded AI Agent Systems at the Edge

面向边缘嵌入式AI智能体系统的模块化架构

Marcus Rüb, Michael Gerhards

发表机构 * ETH Zurich(苏黎世联邦理工学院)

AI总结 提出一种模块化参考架构,通过分层设计解耦设备端和云端智能体,并引入治理层,解决嵌入式微控制器上部署自主AI的严格资源约束问题。

详情
AI中文摘要

大型语言模型的兴起使得具备复杂推理和工具使用能力的智能体AI成为可能;然而,由于嵌入式微控制器严格的内存和能量限制,在普适计算环境中部署这种自主性仍然具有挑战性。现有框架通常假设服务器级资源或持续连接,导致深度嵌入式系统存在空白。本文提出了一种嵌入式智能体系统的模块化参考架构,弥合了确定性实时控制与智能体智能之间的鸿沟。我们引入了一种分层设计,将设备端智能体(执行高度压缩的神经网络和基于规则的逻辑,用于低延迟、隐私关键任务)与云端增强智能体(利用小型语言模型进行更高级别的推理和规划)解耦。一个关键贡献是集成了跨领域的治理层,确保分布式自主设备集群的可观测性、策略执行和安全性。本文不呈现纯经验基准,而是分析资源受限环境中关于延迟、能量和可靠执行的架构设计原则与权衡。

英文摘要

The rise of Large Language Models (LLMs) has enabled agentic AI capable of complex reasoning and tool use; however, deploying such autonomy in pervasive computing environments remains challenging due to the strict memory and energy constraints of embedded microcontrollers. Existing frameworks typically assume server-class resources or continuous connectivity, leaving a gap for deeply embedded systems. This paper proposes a modular reference architecture for Embedded Agent Systems that bridges the divide between deterministic real-time control and agentic intelligence. We introduce a tiered design that decouples On-Device Agents - executing highly compressed neural networks and rule-based logic for low-latency, privacy-critical tasks - from Cloud-Augmented Agents that leverage Small Language Models (SLMs) for higher-level reasoning and planning. A key contribution is the integration of a cross-cutting Governance Layer, ensuring observability, policy enforcement, and safety across distributed fleets of autonomous devices. Rather than presenting purely empirical benchmarks, we analyze architectural design principles and trade-offs regarding latency, energy, and reliable execution in resource-constrained environments.

2606.02860 2026-06-03 cs.LG cs.AI 版本更新

Forgetting is Not Erasure: Recovering Latent Knowledge via Transport Keys

遗忘并非擦除:通过传输键恢复潜在知识

Archie Chaudhury

发表机构 * Axionic Labs(Axionic实验室)

AI总结 通过缝合评估协议和紧凑的任务特定传输键,发现灾难性遗忘主要由内部阶段接口漂移而非任务相关计算的永久擦除引起,并能在顺序训练后恢复大部分早期任务性能。

Comments Technical report showcasing results from transport keys

详情
AI中文摘要

灾难性遗忘通常被视为表征问题:在顺序训练后,模型似乎失去了支持早期任务性能的特征。我们挑战了这一观点的更强形式。在受控的持续学习设置中,我们发现相当一部分明显的遗忘可归因于内部阶段之间的接口漂移,而非任务相关计算的永久擦除。我们通过一种缝合评估协议研究这一现象,该协议将更新后网络的早期计算与其前身的后期计算相结合,并可选地通过紧凑的任务特定传输键进行中介。我们在系统层面将传输键描述为紧凑的接口对齐算子,从少量配对的锚点激活中估计,并通过模型缝合进行评估。在split CIFAR-100上使用ResNet风格网络时,传输键在顺序训练任务B后恢复了大部分原始任务A的性能。在紧凑视觉变换器上,我们观察到类似的恢复模式。这些结果表明,持续学习可能需要更好的机制来索引和重新访问潜在计算,而不仅仅是防止权重变化的方法。

英文摘要

Catastrophic forgetting is often framed as a representational problem: after sequential training, a model appears to lose the features that supported performance on earlier tasks. We challenge the stronger form of this view. Across controlled continual-learning settings, we find that a significant portion of apparent forgetting can be attributed to interface drift between internal stages rather than permanent erasure of task-relevant computation. We study this phenomenon through a stitched evaluation protocol that combines early computation from a post-update network with late computation from its predecessor, optionally mediated by a compact, task-specific transport key. We describe transport keys at a systems level as compact interface-alignment operators estimated from a small set of paired anchor activations and evaluated through model stitching. On split CIFAR-100 with a ResNet-style network, transport keys recover most of the original Task A performance after sequential training on Task B. On a compact vision transformer, we observe a similar recovery pattern. These results suggest that continual learning may require better mechanisms for indexing and re-accessing latent computations, not only methods that prevent weight change.

2606.02859 2026-06-03 cs.CL cs.AI cs.MA 版本更新

Economy of Minds: Emerging Multi-Agent Intelligence with Economic Interactions

思维经济:具有经济交互的涌现多智能体智能

Zhenting Qi, Huangyuan Su, Ao Qu, Chenyu Wang, Yu Yao, Han Zheng, Kushal Chattopadhyay, Guowei Xu, Zihan Wang, Weirui Ye, Vijay Janapa Reddi, Ju Li, Paul Pu Liang, Himabindu Lakkaraju, Sham Kakade, Yilun Du

发表机构 * Harvard University(哈佛大学) Massachusetts Institute of Technology(麻省理工学院)

AI总结 受哈耶克经济理论启发,通过拍卖和财富积累的简单经济信号实现去中心化信用分配,使弱智能体群体涌现出多步推理策略,在五个智能体任务中超越强单体基线。

详情
AI中文摘要

一群智能体如何在没有集中控制的情况下自我协调和自适应,形成更强的集体智能?受弗里德里希·哈耶克关于市场中去中心化协调的经济理论启发,我们通过一个智能体经济体来研究这个问题,其中智能体通过拍卖竞争行动权、交换支付,并从环境奖励中积累财富。这些简单的经济信号引出去中心化的信用分配,在没有全局编排或显式通信协议的情况下驱动规划。群体通过经济选择进化:有效的智能体积累财富并通过利用变异,而无效的智能体破产并通过探索被替换。我们表明,从弱智能体初始化,经济体产生涌现的多步推理策略,并在五个智能体任务中超越更强的单体基线,包括数学推理、金融研究、科学研究、加速器设计和分布式系统优化。我们进一步提供了关于经济动态如何塑造智能体行为的理论见解,将局部激励与长期全局表现联系起来。我们的结果指向了多智能体智能的一条新路径:与其设计协调,不如设计去中心化的激励结构,在这种结构下协调会自动涌现。

英文摘要

How can a population of agents self-orchestrate and self-adapt into stronger collective intelligence without centralized control? Inspired by Friedrich Hayek's economic theory of decentralized coordination in markets, we study this question through an agent economy in which agents compete via auctions for the right to act, exchange payments, and accumulate wealth from environmental rewards. These simple economic signals induce decentralized credit assignment, driving planning without global orchestration or explicit communication protocols. The population evolves through economic selection: effective agents accumulate wealth and are mutated via exploitation, while ineffective ones go bankrupt and are replaced via exploration. We show that, initialized with weak agents, the economy produces emergent multi-step reasoning strategies and outperforms stronger monolithic baselines across five agentic tasks, including mathematical reasoning, financial research, scientific research, accelerator design, and distributed-system optimization. We further provide theoretical insights into how economic dynamics shape agent behaviors, linking local incentives to long-term global performance. Our results suggest a new path to multi-agent intelligence: rather than engineering coordination, we can design decentralized incentive structures under which it automatically emerges.

2606.02857 2026-06-03 cs.LG cs.AI 版本更新

GRZO: Group-Relative Zeroth-Order Optimization for Large Language Model Fine-Tuning

GRZO:用于大语言模型微调的组相对零阶优化

Liyan Tan, Yequan Zhao, Yifan Yang, Ruijie Zhang, Xinling Yu, Zheng Zhang

发表机构 * University of California, Santa Barbara(加州大学圣巴巴拉分校)

AI总结 提出GRZO优化器,通过组相对归一化聚合每个样本的损失,在不增加前向成本的情况下将有效梯度方向数从1提升至批量大小,降低方差并改善收敛,在多个模型和任务上优于MeZO。

Comments Preprint. Under review

详情
AI中文摘要

零阶优化是微调大语言模型时一种内存高效的反向传播替代方案,但其部署受限于梯度估计的高方差。我们提出GRZO,一种组相对零阶优化器,它为每个小批量样本抽取一个伪独立扰动,并通过组相对归一化聚合每个样本的损失,在不增加额外前向成本的情况下将有效梯度方向数从1提升至批量大小,同时保持推理级内存。我们证明GRZO在方向上是无偏的,方差随批量大小成比例缩小,从而得到比MeZO更紧的非凸收敛界。在RoBERTa-large、Llama3-8B和OPT-13B上,跨多个任务,GRZO在Llama3-8B上的平均准确率比MeZO提高$+3.0$,峰值GPU内存降低$23\%$;作为MeZO核心的即插即用替代,它平均将稀疏、低秩和量化ZO变体提升$+6.0$。

英文摘要

Zeroth-order (ZO) optimization is a memory-efficient alternative to backpropagation for fine-tuning large language models, but its deployment is limited by the high variance of gradient estimation. We propose GRZO, a Group-Relative Zeroth-Order optimizer that draws one pseudo-independent perturbation per mini-batch example and aggregates the per-example losses through group-relative normalization, raising the effective gradient-direction count from one to the batch size at no additional forward cost while preserving inference-level memory. We prove that GRZO is directionally unbiased with variance shrinking proportionally to the batch size, yielding a tighter nonconvex convergence bound than MeZO. Across RoBERTa-large, Llama3-8B, and OPT-13B over multiple tasks, GRZO improves average accuracy on Llama3-8B by $+3.0$ over MeZO at $23\%$ lower peak GPU memory; as a drop-in replacement for the MeZO core, it lifts sparse, low-rank, and quantized ZO variants by $+6.0$ on average.

2606.02837 2026-06-03 cs.CL cs.AI 版本更新

Fixing FOLIO and MALLS: Verified Annotations and an LLM-assisted Framework to Focus Human Relabeling

修复FOLIO和MALLS:经过验证的标注和基于LLM的框架以聚焦人工重新标注

Andrea Brunello, Cristian Curaba, Luca Geatti, Michele Mignani, Angelo Montanari, Nicola Saccomanno

发表机构 * University of Udine(乌迪大学)

AI总结 通过人工检查发现FOLIO和MALLS数据集中存在大量形式化错误,提出基于LLM的框架引导人工审核,显著减少所需审核量并提高数据集准确性。

详情
AI中文摘要

从自然语言到一阶逻辑(NL-to-FOL)的准确翻译是神经符号AI系统和自然语言推理(NLI)的基础,因此NL-to-FOL基准的质量至关重要——然而这些数据集从未经过严格审计。我们的第一个贡献是对 extsf{FOLIO}的验证集和 extsf{MALLS}测试实例子集进行系统性人工检查,发现分别约有39%和36%的条目包含错误的FOL形式化(即真实标签),此外还有一定比例的歧义NL句子(分别为16.4%和48%)以及 extsf{FOLIO}中错误的NLI标签(8.4%)。我们的第二个贡献是开发并发布了这些数据集的修正真实标签,并展示了标注错误如何扭曲参考基准任务上的模型评估:使用修正后的真实标签测试三个最先进的LLM(Gemma~4 31B-it、Qwen3-30B-A3B和GPT-4o-mini),准确率提升了9到22个百分点。受这些发现启发,我们提出了一个基于LLM的框架,以支持人工审查NL-to-FOL数据集。通过将审查者引导至最易出错的实例,我们实验证明,在审查少于24%的实例后即可达到90%的数据集准确率,而无引导的审查则需要审查超过70%的实例。我们发布了所有经过人工验证的标注以及框架代码。

英文摘要

Accurate translation from Natural Language to First-Order Logic (NL-to-FOL) underpins neurosymbolic AI systems and Natural Language Inference (NLI), making the quality of NL-to-FOL benchmarks essential -- yet these datasets have never been rigorously audited. Our first contribution is to present a systematic human inspection of the validation split of \textsf{FOLIO} and a subset of \textsf{MALLS} test instances, finding that approximately 39% and 36% of entries, respectively, contain incorrect FOL formalizations (i.e., ground truth labels), with additional rates of ambiguous NL sentences (16.4% and 48%) and incorrect NLI labels in \textsf{FOLIO} (8.4%). Our second contribution is to develop and release corrected ground truths for such datasets, showing that annotation errors distort model evaluation on a reference benchmark task: testing three state-of-the-art LLMs (Gemma~4 31B-it, Qwen3-30B-A3B, and GPT-4o-mini) with the corrected ground truths yields accuracy gains from +9 to +22 percentage points. Motivated by these findings, we propose an LLM-based framework to support humans in manual reviewing NL-to-FOL datasets. By directing reviewers toward the most error-prone instances, we empirically show that it is possible to achieve 90% dataset accuracy after reviewing fewer than 24% of instances, compared to over 70% required by unguided review. We release all human-verified annotations and the code for our framework.

2606.02835 2026-06-03 cs.AI 版本更新

Thinking Past the Answer: Evaluating Harmful Overthinking in Large Reasoning Models

超越答案的思考:评估大型推理模型中的有害过度思考

Simone Caldarella, Davide Talon, Rahaf Aljundi, Elisa Ricci, Massimiliano Mancini

发表机构 * University of Trento(特伦托大学) Toyota Motor Europe(丰田欧洲公司) Fondazione Bruno Kessler(布鲁诺·凯塞林基金会)

AI总结 本文提出前缀级轨迹评估协议,通过定义推理充分性来区分冗余但无害的冗长过度思考和导致正确轨迹偏离的有害过度思考,发现当前模型不仅受限于推理能力,还受限于无法在适当时机停止。

详情
AI中文摘要

大型推理模型(LRMs)通过增加测试时计算生成显式的中间推理轨迹来提升性能,但更长的推理是否始终有益这一假设尚未得到充分检验。虽然近期证据表明额外推理可能导致模型过度思考,但我们提出疑问:“一旦模型得出正确答案,进一步的推理是优化解决方案,还是偏离它?”为了研究正确性之后的动态,我们引入了一种基于推理充分性的前缀级轨迹评估协议,定义了模型首次生成正确答案所需的最小推理预算。这使我们能够将冗长过度思考(额外推理冗余但无害)与有害过度思考(持续推理破坏已正确的轨迹)区分开来。从多模态基准开始,我们发现许多被认为是推理密集型的问题实际上只需要很少的推理。此外,在第一个正确前缀处停止比标准推理提高了高达21%的准确率,表明当前模型不仅受限于推理能力,还受限于无法在适当时机停止。此外,虽然常见的效率策略(如早停)能大幅减少冗长过度思考(高达50%),但它们未能缓解有害过度思考。失败分析表明,正确性偏差主要由逻辑漂移和视觉重新解释驱动。最后,我们展示了我们的发现可推广到纯语言推理基准,突显了有害过度思考作为一个更广泛的可靠性风险。代码可在该 https URL 获取。

英文摘要

Large Reasoning Models (LRMs) improve performance by generating explicit intermediate reasoning traces through increased test-time compute, yet the assumption that longer reasoning is consistently beneficial remains under-examined. While recent evidence shows that additional reasoning can lead models to overthink, we ask: "Once a model has reached the correct answer, does further reasoning refine the solution, or deviate from it?" To study the dynamics after correctness, we introduce a prefix-level trajectory evaluation protocol grounded in reasoning sufficiency, defining the minimum reasoning budget required for a model to first generate the correct answer. This allows us to disentangle verbose overthinking, where additional reasoning is redundant but harmless, from harmful overthinking, where continued reasoning destabilizes an already-correct trajectory. Starting from multimodal benchmarks, we find that many instances considered reasoning-intensive require surprisingly little reasoning. Moreover, stopping at the first correct prefix improves accuracy over standard reasoning up to 21%, revealing that current models are limited not only by their ability to reason, but also by their inability to stop at the right time. Furthermore, while common efficiency strategies like early stopping substantially reduce verbose overthinking (up to 50%), they fail to mitigate harmful overthinking. Failure analysis reveals that correctness deviations are mainly driven by logical drift and visual reinterpretation. Finally, we show that our findings generalize to language-only reasoning benchmarks, highlighting harmful overthinking as a broader reliability risk. Code available at https://simonecaldarella.github.io/thinking-past-the-answer.

2606.02834 2026-06-03 cs.CR cs.AI 版本更新

Large Byte Model: Teaching Language Models About Compiled Code

大型字节模型:教会语言模型关于编译代码的知识

Florian Störtz, Catalin-Andrei Stan, Alexandru Dinu, Sandra Servia-Rodríguez, Mihaela Gaman, Calin Miron, Edward Raff

发表机构 * CrowdStrike U.K.(CrowdStrike英国分公司) CrowdStrike Romania(CrowdStrike罗马尼亚分公司) CrowdStrike USA(CrowdStrike美国分公司)

AI总结 本文提出首个字节原生大语言模型,通过定制字节分词器扩展词汇表,使其能直接处理可执行文件原始字节并回答恶意软件分析问题,在家族分类和架构分类上分别达到69%和98%的准确率。

详情
AI中文摘要

恶意软件分析始于可执行程序的原始字节,而将其“提升”到更高级表示(如汇编)的工具成本高昂且容易出错。大型语言模型(LLM)无法处理原始字节表示并回答相关问题。为此,我们提出了首个字节原生LLM。基于使用定制字节分词器的词汇扩展技术,该模型能够回答关于恶意软件二进制的复杂问题,准确率从恶意软件家族分类的69%到架构分类的98%不等。我们的发现表明,在训练过程中提供领域知识对此应用至关重要——现成的模型既缺乏准确性也缺乏洞察力。我们已将该新兴解决方案部署给有限数量的分析师,以收集反馈进行进一步改进。

英文摘要

Malware analysis starts with the raw bytes of an executable program, and tools to "lift" these to higher-level representations, such as assembly, are expensive and subject to error. Large Language Models (LLMs) cannot process raw byte representations and answer questions about them. To this end, we present the first byte-native LLM. Based on a vocabulary expansion technique using a bespoke byte tokenizer, such a model is capable of responding to complex questions about malware binaries, with accuracies ranging from 69% for malware family classification to 98% for architecture classification. Our findings indicate that providing domain knowledge during training is essential for this application -- off-the-shelf models lack both accuracy and insight. We've deployed this emerging solution to a limited number of analysts to gather feedback for further improvements.

2606.02832 2026-06-03 cs.AI 版本更新

An Exploration of Collision-based Enemy Morphology Generation

基于碰撞的敌人形态生成探索

Johor Jara Gonzalez, Matthew Guzdial

发表机构 * Alberta Machine Intelligence Institute (Amii)(阿尔伯塔人工智能研究所) Department of Computing Science, University of Alberta(阿尔伯塔大学计算机科学系)

AI总结 本文探索了三种基于玩家碰撞信息生成敌人形态的新方法,并证明其性能优于从机器人形态学工作改编的进化基线。

详情
AI中文摘要

尽管先前对程序化内容生成(PCG)进行了大量研究,但相对较少的工作探索了为视频游戏生成敌人。特别是,尽管在机器人学中存在相关的形态生成工作,但几乎没有工作涉及生成敌人形态,即游戏中敌人的基本身体结构或碰撞信息。在本文中,我们探索了三种不同的基于玩家碰撞信息生成敌人形态的新方法。我们发现每种方法都有不同的优缺点,但所有方法的性能都等同于或优于从先前机器人形态学工作改编的进化基线。

英文摘要

Despite a great deal of prior research into Procedural Content Generation (PCG), relatively little prior work has explored generating enemies for video games. In particular, there is almost no work on generating enemy morphologies, the basic body plan or collision information for in-game enemies, despite the existence of related morphology generation work in robotics. In this paper, we explore three different novel approaches to generate enemy morphologies based on player collision information. We found that each approach provides different strengths and weaknesses, but all had equivalent or better performance than an evolutionary baseline adapted from prior robotics morphology work.

2606.02822 2026-06-03 cs.CR cs.AI 版本更新

Which Defense Closes Which Threat? Attributing OWASP-LLM-Top-10 Coverage and Its Brittleness Under Paraphrasing

哪种防御措施应对哪种威胁?归因OWASP-LLM-Top-10覆盖及其在释义下的脆弱性

Alexandre Cristovão Maiorano

发表机构 * Lumytics

AI总结 本文通过归因分析,测量了不同防御家族(拒绝过滤、预算控制等)对OWASP-LLM-Top-10威胁的覆盖情况,并揭示了拒绝防御在释义攻击下的脆弱性。

Comments 17 pages, 4 figures, 7 tables

详情
AI中文摘要

生产级LLM应用堆叠了多种防御家族——拒绝短语过滤器、令牌预算控制、模型白名单、速率限制、工具注册认证——然而现有的攻防模拟(BAS)基准报告单一的总体覆盖数字,隐藏了哪个家族应对哪种威胁。我们测量归因。我们将四个OWASP-LLM-Top-10感知的智能体添加到一个21智能体的基线扫描器中,并针对四个合成LLM端点的格点:$L_0$(无防御)、$L_1$(仅拒绝)、$L_2$(仅预算)和$L_3$(全栈)。$L_1$和$L_2$是兄弟单轴消融,互不为子集;$L_3$是它们的并集加上工具注册认证和凭证清洗。在$N=10$次重复中,每个OWASP的发现计数清晰:仅拒绝消除所有LLM01(越狱)和LLM07(系统提示泄露)发现;仅预算通过终止多步序列消除所有LLM02(敏感信息泄露)和LLM10(无限制消耗)发现;LLM06(过度代理)需要全栈。我们探测释义下的脆弱性:使用300个Gemini生成的释义(在60模板脆弱性语料库上$K=5$),$L_1$拒绝阻断率在LLM01上下降15个百分点,在LLM07上下降25个百分点。第五个目标$L_4$-real,将存根后端替换为Gemini-2.5-flash,使用相同的$L_3$正则表达式,并与$L_1$完全匹配,表明除了正则表达式外没有可测量的对齐贡献(不是关于对齐的一般性声明)。预算控制没有下降(在扣除速率限制下限后为0个百分点)。一个通过静态基准的拒绝白名单可以被LLM驱动的释义器击败而不改变攻击意图;预算控制抵抗相同的变异。

英文摘要

Production LLM applications stack several defense families -- refusal-phrase filters, token-budget controls, model allowlists, rate limits, tool-registry authentication -- yet existing breach-and-attack-simulation (BAS) benchmarks report a single aggregate coverage number, hiding which family closes which threat. We measure attribution. We add four OWASP-LLM-Top-10-aware agents to a 21-agent baseline scanner and target a lattice of four synthetic LLM endpoints: $L_0$ (no defenses), $L_1$ (refusal-only), $L_2$ (budget-only), and $L_3$ (full stack). $L_1$ and $L_2$ are sibling single-axis ablations, not subsets of each other; $L_3$ is their union plus tool-registry authentication and credential scrubbing. Across $N=10$ replications, the per-OWASP finding count is clean: refusal alone removes all LLM01 (jailbreak) and LLM07 (system-prompt leakage) findings; budget alone removes all LLM02 (sensitive-info disclosure) and LLM10 (unbounded consumption) findings by terminating multi-step sequences; LLM06 (excessive agency) requires the full stack. We probe brittleness under paraphrasing: with 300 Gemini-generated paraphrases ($K=5$ over a 60-template brittleness corpus), $L_1$ refusal block rate falls 15 pp on LLM01 and 25 pp on LLM07. A fifth target, $L_4$-real, swaps the stub backend for Gemini-2.5-flash behind the same $L_3$ regex and matches $L_1$ exactly, indicating no measurable alignment contribution beyond the regex (not a general claim about alignment). Budget controls show no drop (0 pp once the rate-limit floor is factored out). A refusal whitelist that clears a static benchmark can be defeated by an LLM-driven paraphraser without changing attack intent; a budget control resists the same mutation.

2606.02814 2026-06-03 cs.IR cs.AI cs.CL 版本更新

Do Neural Retrievers Prefer Certain Documents? Evidence of Learned Relevance Priors

神经检索器是否偏好某些文档?学习到的相关性先验的证据

Francisco Valentini, Edgar Altszyler, Martin Fajcik

AI总结 通过分析监督双编码器检索器在文档嵌入中编码的查询无关信号,发现模型从标注数据中学习到文档级相关性先验,导致低先验文档即使相关也更难被检索,揭示了监督检索的结构性局限。

详情
AI中文摘要

神经检索器通过标注的查询-文档对训练来估计查询-文档相关性。然而,标注协议可能并不纯粹反映相关性:它们只选择一部分文档进行标注,并且这种选择可能偏向某些文档类型。我们研究监督双编码器检索器是否隐式学习了一个文档级相关性先验:一个查询无关的信号,作为在标注数据上训练的副作用编码在其表示空间中。我们通过在冻结的文档嵌入上训练简单分类器来估计这个先验,并在多个IR基准上评估三个最先进的检索器。我们发现监督神经检索器编码了能泛化到未见文档且跨模型一致的相关性先验。这些先验造成了可发现性差距:先验较低的文档即使真正相关也更难被检索。这种效应在监督密集检索器中出现,但在BM25中较弱且不一致,并在受控的匹配文档比较下持续存在。利用基于LLM的解释,我们发现被判定为相关的文档往往是主流主题的全面、自包含的摘要,而小众、零碎或高度技术性的内容通常未被评判。检索器内化了这种偏见,将具有这些偏好特征的文档排得比缺乏这些特征的文档更高,而与它们的实际相关性无关。我们的发现揭示了监督检索的结构性局限:在标注数据上训练的模型不仅学习相关性,还学习其训练数据中的隐式文档偏好。

英文摘要

Neural retrievers are trained to estimate query-document relevance from annotated query-document pairs. Yet annotation protocols may not purely reflect relevance: they select only a subset of documents for labeling, and this selection can favor certain document types over others. We investigate whether supervised bi-encoder retrievers implicitly learn a document-level relevance prior: a query-independent signal encoded in their representation space as a side effect of training on annotated data. We estimate this prior by training simple classifiers on frozen document embeddings and evaluate three state-of-the-art retrievers across multiple IR benchmarks. We find that supervised neural retrievers encode relevance priors that generalize to unseen documents and are consistent across models. These priors create a findability gap: documents with lower prior are systematically harder to retrieve, even when genuinely relevant. This effect appears in supervised dense retrievers but is weaker and less consistent in BM25, and it persists under controlled matched-document comparisons. Using LLM-based explanations, we find that judged-relevant documents tend to be comprehensive, self-contained summaries of mainstream topics, while niche, fragmentary, or highly technical content is often left unjudged. Retrievers internalize this bias, ranking documents with these favored features higher than documents that lack them, independently of their actual relevance. Our findings expose a structural limitation of supervised retrieval: models trained on annotated data do not just learn relevance, but also the implicit document preferences in their training data.

2606.02812 2026-06-03 cs.AI cs.CL 版本更新

Traj-Evolve: A Self-Evolving Multi-Agent System for Patient Trajectory Modeling in Lung Cancer Early Detection

Traj-Evolve:用于肺癌早期检测中患者轨迹建模的自进化多智能体系统

Sihang Zeng, Matthew Thompson, Ruth Etzioni, Meliha Yetisgen

发表机构 * University of Washington(华盛顿大学) Fred Hutch Cancer Center(Fred Hutch癌症中心) Google(谷歌)

AI总结 提出Traj-Evolve,一种结合经验池和多智能体强化学习的自进化多智能体系统,通过检索相似患者和参数优化,在肺癌早期检测中优于9个强基线模型。

详情
AI中文摘要

从纵向电子健康记录(EHR)中建模患者轨迹需要对稀疏、嘈杂且长上下文的多模态序列进行推理。现有的基于LLM的多智能体系统解决了上下文长度问题,但孤立地处理患者,未能模拟临床医生如何利用从类似先前病例中积累的经验。我们提出了Traj-Evolve,一个具有两种互补进化机制的自进化多智能体系统。首先,经验池(ExPool)作为非参数记忆,索引拒绝采样的推理轨迹,以检索相似患者作为少样本上下文。其次,通过奖励排序微调的多智能体强化学习(MARL)参数化地优化智能体间和智能体-记忆协作。留一法交叉检索策略统一了这两种机制,在检索增强下对齐训练和推理时的行为。在利用长达五年的多模态EHR的肺癌预测任务中,Traj-Evolve在整体人群和具有挑战性的从不吸烟人群中均优于9个强基线模型。对进化动态的分析揭示了三个关键发现:(1)扩展ExPool将最优检索从多样本转向特定样本;(2)在MARL下,管理智能体的预测损失快速收敛,而工作智能体的时间推理继续受益于更多经过验证的患者;(3)这两种机制在预测风险上互补,ExPool提高特异性,而MARL提高敏感性。

英文摘要

Modeling patient trajectories from longitudinal electronic health records (EHRs) requires reasoning over sparse, noisy, and long-context multimodal sequences. Existing LLM-based multi-agent systems address context length but process patients in isolation, failing to mirror how clinicians leverage accumulated experience from similar prior cases. We present Traj-Evolve, a self-evolving multi-agent system with two complementary evolving mechanisms. First, an Experience Pool (ExPool) acts as a non-parametric memory, indexing rejection-sampled reasoning traces to retrieve similar patients as few-shot contexts. Second, multi-agent reinforcement learning (MARL) via reward-ranked fine-tuning parametrically optimizes inter-agent and agent-memory collaboration. A leave-one-out cross-retrieval strategy unifies the two, aligning training- and inference-time behavior under retrieval augmentation. On a lung cancer prediction task utilizing up to five years of multimodal EHRs, Traj-Evolve outperforms 9 strong baselines on the overall population and a challenging never-smoker population. Analysis of the evolving dynamics highlights three key findings: (1) expanding the ExPool shifts optimal retrieval from diverse to specific samples; (2) under MARL, the manager agent's prediction loss converges quickly while the worker agents' temporal reasoning continues to benefit from more verified patients; and (3) the two mechanisms are complementary on the predicted risk, where ExPool improves specificity while MARL improves sensitivity.

2606.02798 2026-06-03 cs.AI 版本更新

BehaviorBench: Modeling Real-World User Decisions from Behavioral Traces

BehaviorBench: 从行为轨迹建模真实用户决策

Liangwei Yang, Jielin Qiu, Zixiang Chen, Ming Zhu, Juntao Tan, Zhiwei Liu, Wenting Zhao, Zhujun Lan, Akshara Prabhakar, Silvio Savarese, Huan Wang, Shelby Heinecke

发表机构 * Salesforce AI Research(Salesforce AI研究院)

AI总结 提出 BehaviorBench 基准,利用真实世界行为轨迹(预测市场与链上记录)评估个性化决策建模,包含信念预测和交易预测两个任务层。

详情
AI中文摘要

许多决策支持场景需要系统适应个体用户,但针对该问题的评估数据仍然有限。现有的用户理解基准通常依赖模拟用户或模型生成的行为,尽管近期研究警告基于模型的模拟可能系统性地偏离人类行为。我们引入了 extsc{BehaviorBench},一个从真实世界行为轨迹评估个性化决策建模的基准。 extsc{BehaviorBench} 从观测到的公开预测市场和链上记录重建钱包级别的决策历史,并将其组织为两个互补的任务层:\emph{信念预测},预测用户在市场中最终的公开立场和置信度;以及\emph{交易预测},预测个体交易的方向和数量。在 2000 个评估钱包中,该基准包含 141,445 个信念实例和 1,485,972 个交易实例,并具有用于基于检索的评估的不相交支持池。我们在四种历史接口下评估前沿和开放权重生成模型:无个性化、直接近期历史、生成用户画像和检索支持钱包证据。个性化在信念预测上比交易预测更一致地提升性能,模型排名在不同任务层和指标间变化,不同的历史接口暴露了不同的失败模式。 extsc{BehaviorBench} 提供了一个评估设置,用于研究个性化方法是否能够利用真实世界行为证据而非仅依赖模拟用户。

英文摘要

Many decision-support settings require systems that adapt to individual users, but evaluation data for this problem remain limited. Existing benchmarks for user understanding often rely on simulated users or model-generated behavior, even though recent work cautions that model-based simulations can diverge systematically from human behavior. We introduce \textsc{BehaviorBench}, a benchmark for evaluating personalized decision modeling from real-world behavioral traces. \textsc{BehaviorBench} reconstructs wallet-level decision histories from observed public prediction-market and on-chain records, and organizes them into two complementary task layers: \emph{Belief prediction}, which predicts a user's final revealed stance and confidence in a market, and \emph{Trade prediction}, which predicts the direction and amount of individual transactions. Across 2,000 evaluation wallets, the benchmark contains 141,445 Belief instances and 1,485,972 Trade instances, with disjoint support pools for retrieval-based evaluation. We evaluate frontier and open-weight generative models under four history interfaces: no personalization, direct recent history, generated user profiles, and retrieved support-wallet evidence. Personalization improves Belief prediction more consistently than Trade prediction, model rankings change across task layers and metrics, and different history interfaces expose different failure modes. \textsc{BehaviorBench} provides an evaluation setting for studying whether personalized methods can use real-world behavioral evidence rather than simulated users alone.

2606.02791 2026-06-03 cs.AI 版本更新

Evaluating Transformer and LSTM Frameworks for Prediction in Ungauged Basins

评估 Transformer 和 LSTM 框架在无资料流域预测中的表现

Taye Akinrele, James Halgren, Noorbakhsh Amiri Golilarz, Sudip Mittal, Shahram Rahimi

发表机构 * University of Arizona(亚利桑那大学)

AI总结 本研究通过 NOAA 国家水模型回顾模拟,评估仅编码器 Transformer 与 LSTM 在有限水文信息下上游径流推断的优势,发现 LSTM 整体性能更强,且加入下游信息可显著提升预测技能。

Comments 5 pages

详情
AI中文摘要

流域网络呈现收敛拓扑结构,其中多个支流汇入下游河道,整合了多样化的上游水文过程。在无资料流域中,缺乏直接观测增加了不确定性,并限制了预测极端事件的能力。本研究利用 NOAA 国家水模型(NWM)的回顾模拟,评估仅编码器 Transformer 是否在有限水文信息下比 LSTM 更具优势,用于上游径流推断。在仅上游和组合配置中,LSTM 在两种配置下的整体表现均优于 Transformer 模型。加入下游信息进一步提升了所有模型的性能,使中位数 NNSE 提高了 60% 以上。我们并未将其视为排行榜式的比较,而是将实验解释为对水文序列推断的架构归纳偏置的测试。结果表明,循环记忆仍比仅编码器 Transformer 更适用于此上游重建任务,而下游水文背景提供了强大的辅助约束,显著提高了跨架构的预测技能。

英文摘要

Watershed networks exhibit convergent topologies in which multiple tributaries merge into downstream channels,integrating diverse upstream hydrological processes. In ungauged basins, the absence of direct observations increases uncertainty and limits the ability to anticipate extreme events. This study evaluates whether an encoder-only Transformer provides an advantage over an LSTM for upstream streamflow inference under limited hydrologic information, using retrospective simulations from the NOAA National Water Model (NWM). Across both upstream-only and combined configurations, the LSTM showed stronger overall performance than the Transformer model across the two configurations. Incorporating downstream information further boosted performance for all models, increasing median NNSE by more than 60%. Rather than treating this as a leaderboard-style comparison, we interpret the experiments as a test of architectural inductive bias for hydrologic sequence inference. The results indicate that recurrent memory remains better aligned with this upstream reconstruction task than an encoder-only Transformer, while downstream hydrologic context provides a strong auxiliary constraint that substantially improves prediction skill across architectures

2606.02781 2026-06-03 cs.AR cs.AI cs.ET 版本更新

CRAM-ER: Error-Resilient Spintronic Computational Random Access Memory for Scalable In-Memory Computation

CRAM-ER:面向可扩展存内计算的容错自旋计算随机存取存储器

Sohan Salahuddin Mugdho, Md. Shahedul Hasan, Brahmdutta Dixit, Yang Lv, Jian-Ping Wang, Cheng Wang

发表机构 * Electrical and Computer Engineering Iowa State University of Science and Technology(电气与计算机工程学院爱荷华州立大学科学与技术学院) Electrical and Computer Engineering University of Minnesota Twin Cities(电气与计算机工程学院明尼苏达大学双城分校)

AI总结 针对基于MRAM的计算随机存取存储器(CRAM)在加速深度神经网络时面临的概率性开关错误和低吞吐量问题,提出一种混合自旋-CRAM与CMOS加法器树的容错架构(CRAM-ER),通过硬件-软件协同设计实现高能效、高可靠性的矩阵向量乘法。

详情
AI中文摘要

深度神经网络(DNN)在多个领域取得了最先进的性能。然而,传统的冯·诺依曼计算范式面临严重的内存瓶颈。新兴的近内存和存内计算方法缓解了这一问题,但引入了显著的外围开销。基于MRAM的计算随机存取存储器(CRAM)能够实现无外围开销的原位逻辑,提供了一种密集、节能的解决方案。然而,概率性的MRAM开关会导致门级错误,限制了CRAM在加速DNN时的可扩展性和可靠性。此外,大量的顺序MRAM写入严重制约了CRAM的吞吐量。为了解决这些挑战,我们提出了一种容错CRAM(CRAM-ER)架构,用于可扩展的存内矩阵向量乘法(MVM)。我们的错误感知硬件-软件协同设计框架利用混合自旋-CRAM + CMOS加法器树架构来减轻器件级错误的影响,展示了具有高面积和能效的MVM功能。我们进一步开发了错误感知模型微调和细粒度纠错技术,以增强错误容限。在DNN基准测试上对CMOS+自旋混合架构的评估显示,在将CRAM延迟降低多达两个数量级的同时,实现了近乎无损的精度,在能效和能量延迟积方面均优于CPU/GPU+高带宽DRAM。

英文摘要

Deep neural networks (DNNs) have achieved state-of-the-art performance across diverse domains. However, typical Von Neumann compute paradigms face severe memory bottlenecks. Emerging near-memory and compute-in-memory approaches alleviate this but incur significant peripheral overhead. Computational Random Access Memory (CRAM) based on MRAM enables in-situ logic without peripheral overhead, offering a dense, energy-efficient solution. However, probabilistic MRAM switching induces gate-level errors that limit the scalability and reliability of CRAM for accelerating DNN. Moreover, the large number of sequential MRAM writes severely constrains CRAM throughput. To address these challenges, we propose an error-resilient CRAM (CRAM-ER) architecture for scalable in-memory matrix-vector multiplications (MVMs). Our error-aware hardware-software co-design framework leverages a hybrid spintronic-CRAM + CMOS adder-tree architecture to mitigate the impact of device-level errors, demonstrating MVM functionality with high area and energy efficiency. We further develop an error-aware model fine-tuning and fine-grained error correction for enhanced error resilience. Evaluations of the CMOS+spintronic hybrid architecture on DNN benchmarks show near-lossless accuracy while reducing CRAM latency by up to 2 orders of magnitude, outperforming CPU/GPU+high-bandwidth DRAM in both energy efficiency and energy-delay product.

2606.02775 2026-06-03 cs.AI cs.AR cs.DC cs.PF cs.RO 版本更新

AURA: Action-Gated Memory for Robot Policies at Constant VRAM

AURA: 恒定VRAM下机器人策略的动作门控记忆

Josef Chen

发表机构 * KAIKAKU(卡基库)

AI总结 提出AURA-Mem,一种恒定大小、基于动作误差信号门控写入的循环记忆,替代KV缓存,在边缘机器人任务中实现与基线相当的准确率,同时减少5-9倍写入次数。

详情
AI中文摘要

KV缓存是数据中心合适的记忆,但却是机器人错误的记忆。数据中心推理批量处理许多短请求并重置它们,在众多请求中分摊注意力缓存。具身智能体则在带宽受限的边缘硬件上运行一个长且不重置的回合,其中高带宽内存和闪存稀缺,闪存写入寿命有限,内存写入而非计算可能成为约束瓶颈。AURA-Mem(动作效用循环自适应记忆)针对这一场景。它用一个固定大小的循环记忆和一个学习得到的门控包装冻结的视觉-语言-动作骨干网络,该门控仅在当前观测会改变下一个动作时写入:一种知道何时保持沉默的记忆。与基于重建的记忆不同,该门控直接针对闭环动作误差信号进行训练。其推理状态固定为4,224字节,无论时间步长如何,而KV缓存则在100,000步时增长到6,061倍。在受控的合成基准测试中,AURA-Mem在准确率上与最佳的O(1)基线相当,同时使用5.19-6.13倍更少的写入,在更简单的配置上最多减少9.19倍写入。预算匹配的随机和周期性调度无法恢复这一增益,从而将收益归因于动作惊喜信号。在LIBERO-Long上训练的闭环OpenVLA-OFT 7B面板(每个机械臂n=60个回合)上,门控不会损害成功率:AURA-Mem与无门控基础策略(0.233)相当,并略超过始终写入的KV臂(0.217),同时使用7.0倍更少的写入和恒定内存。我们还实例化了一个近似信息状态价值损失界限作为方法论演示;在此规模下,该界限是空洞的而非保证。

英文摘要

The KV-cache is the right memory for datacenters but the wrong memory for robots. Datacenter inference batches many short requests and resets them, amortizing an attention cache across a crowd. Embodied agents instead run one long, non-resetting episode on bandwidth-limited edge hardware, where high-bandwidth memory and flash are scarce, flash has finite write endurance, and memory writes rather than compute can become the binding constraint. AURA-Mem (Action-Utility Recurrent Adaptive Memory) targets this regime. It wraps a frozen vision-language-action backbone with a constant-size recurrent memory and a learned gate that writes only when the current observation would change the next action: memory that knows when to stay silent. Unlike reconstruction-based memory, the gate is trained directly against a closed-loop action-error signal. Its inference state is fixed at 4,224 bytes regardless of horizon, while a KV-cache grows to 6,061 times larger at 100,000 steps. On a controlled synthetic benchmark, AURA-Mem matches the best O(1) baseline in accuracy while using 5.19-6.13 times fewer writes, and up to 9.19 times fewer writes on easier configurations. Budget-matched random and periodic schedules do not recover this gain, isolating the benefit to the action-surprise signal. On a trained closed-loop OpenVLA-OFT 7B panel on LIBERO-Long (n=60 episodes per arm), the gate does not hurt success: AURA-Mem matches the ungated base policy (0.233) and slightly exceeds an always-write KV arm (0.217), while using 7.0 times fewer writes and constant memory. We also instantiate an approximate-information-state value-loss bound as a methodology demonstration; at this scale, the bound is vacuous rather than a guarantee.

2606.02765 2026-06-03 cs.LG cs.AI 版本更新

Representational Capacity: Geometric Limits on Feature Representation in Transformer Language Models

表示能力:Transformer语言模型中特征表示的几何限制

Alexander Guha

发表机构 * Arizona State University(亚利桑那州立大学)

AI总结 基于线性表示和叠加假设,通过嵌入矩阵的余弦相似度分布估计模型可支持的近正交方向数量,推导出容量公式,并发现容量对偏差ε指数敏感。

Comments 22 pages, 10 figures. Submitted to NeurIPS 2026. This is a condensed version of thesis: https://hdl.handle.net/2286/R.2.N.204857

详情
AI中文摘要

模型维度($d_{model}$)是Transformer语言模型中的一个基本超参数,但其在设定特征表示的几何限制方面的作用仍未得到充分探索。基于线性表示和叠加假设——这些假设提出模型将特征编码为潜在空间中的近正交方向——我们开发了一个框架来估计模型可以支持多少个这样的方向。我们首先将嵌入矩阵确立为跨潜在空间近正交约束的可测量代理:成对余弦相似度分布中有意义的token关系与偶然相似性之间的边界给出了模型对完美正交性的可接受偏差ε的具体估计。将此度量应用于数十个开源模型揭示了两个类别:具有高ε且其嵌入缺乏近正交结构的模型,以及具有低ε且保持近正交结构的模型。然后我们表明,标准的Johnson-Lindenstrauss引理大大低估了训练表示的填充效率,并推导出一个调整后的容量公式,其中近正交方向的数量取决于向量与维度的比率($k/d$)而非原始计数——这一单一修改在没有额外参数的情况下将预测误差降低了两个数量级。结合这些结果,我们将表示能力定义为模型潜在空间中可用于特征和嵌入的可区分方向上界。容量对ε指数敏感,并且较大的模型倾向于更严格的正交约束而非最大化原始容量——这一模式与几种解释(稳定性-容量权衡、可用概念的上限或模型规模的混杂因素)兼容,我们将这些留给未来工作。

英文摘要

Model dimension ($d_{model}$) is a fundamental hyperparameter in transformer language models, yet its role in setting the geometric limits of feature representation remains under-explored. Grounded in the Linear Representation and Superposition Hypotheses - which propose that models encode features as near-orthogonal directions in latent space - we develop a framework for estimating how many such directions a model can support. We first establish the embedding matrix as a measurable proxy for near-orthogonality constraints across the latent space: the boundary between meaningful token relationships and incidental similarity in the pairwise cosine similarity distribution gives a concrete estimate of the model's accepted deviation $\varepsilon$ from perfect orthogonality. Applying this metric across dozens of open-source models reveals two classes: models with high $\varepsilon$ whose embeddings lack near-orthogonal structure, and models with low $\varepsilon$ that maintain it. We then show that the standard Johnson-Lindenstrauss lemma greatly underestimates the packing efficiency of trained representations, and derive an adjusted capacity formula in which the number of near-orthogonal directions depends on the ratio of vectors to dimensions ($k/d$) rather than the raw count - a single modification that cuts prediction error by two orders of magnitude with no extra parameters. Combining these results, we define representational capacity as an upper bound on the number of distinguishable directions available for features and embeddings in a model's latent space. Capacity is exponentially sensitive to $\varepsilon$, and larger models favor tighter orthogonality constraints over maximizing raw capacity - a pattern compatible with several explanations (a stability-capacity trade-off, a ceiling on usable concepts, or confounds with model scale) that we leave to future work.

2606.02755 2026-06-03 cs.SE cs.AI 版本更新

Acceptance-Test-Driven Evaluation Protocols for Business-Centric LLM Systems

面向业务中心LLM系统的验收测试驱动评估协议

Eric Liang

AI总结 针对LLM系统概率生成与确定性需求不匹配问题,提出基于验收测试驱动开发、安全工程和业务中心验证的评估协议,将利益相关者目标转化为可执行行为契约,并采用红-训练-绿生命周期确保多维门控通过后才发布。

详情
AI中文摘要

大型语言模型(LLM)应用日益期望在依赖概率生成组件的同时满足确定性机构需求。这种不匹配使得普通的后期基准测试对于必须安全、可靠、可审计且经济有用的系统而言是不够的。本文为基于验收测试驱动开发、安全工程和业务中心验证的运营LLM系统贡献了一种评估协议扩展。该扩展在提示、模型、检索或智能体变更被接受之前,将利益相关者目标转化为可执行行为契约、发布门控、监控信号和证据工件。它将测试驱动开发的红-绿-重构纪律调整为红-训练-绿色生命周期:首先为期望行为定义失败的验收测试,然后通过提示变更、检索设计、微调、护栏或数据增强改进LLM系统,最后仅当多维门控满足时才发布。贡献在于一个面向治理的度量栈、参考架构和用于比较验收测试驱动LLM开发与提示优先和基准后工作流的经验协议。

英文摘要

Large language model (LLM) applications are increasingly expected to satisfy deterministic institutional requirements while relying on probabilistic generative components. This mismatch makes ordinary post-hoc benchmarking insufficient for systems that must be safe, reliable, auditable, and economically useful. This paper contributes an evaluation-protocol extension for operational LLM systems grounded in acceptance-test-driven development, safety engineering, and business-centric validation. The extension translates stakeholder goals into executable behavioral contracts, release gates, monitoring signals, and evidence artifacts before prompt, model, retrieval, or agent changes are accepted. It adapts the red-green-refactor discipline of test-driven development to a red-train-green lifecycle: first define failing acceptance tests for desired behavior, then improve the LLM system through prompt changes, retrieval design, fine-tuning, guardrails, or data augmentation, and finally release only when multidimensional gates are satisfied. The contribution is a governance-oriented metric stack, reference architecture, and empirical protocol for comparing acceptance-test-driven LLM development against prompt-first and benchmark-after workflows.

2606.02753 2026-06-03 cs.CV cs.AI 版本更新

MetaWorld: Scaling Multi-Agent Video World Model from Single-view Video Data

MetaWorld: 从单视角视频数据扩展多智能体视频世界模型

Teng Hu, Mingchun Lu, Yating Wang, Jiangning Zhang, Jinkun Hao, Ye Pan, Ran Yi, Lizhuang Ma, Dacheng Tao

发表机构 * Shanghai Jiao Tong University(上海交通大学) Zhejiang University(浙江大学) Nanyang Technological University(南洋理工大学)

AI总结 提出MetaWorld框架,通过单目世界状态展开、主体感知世界生成器和世界状态对齐机制,从单视角视频构建多智能体视频世界模型,解决数据稀缺和世界状态对齐问题。

详情
AI中文摘要

视频世界模型是具身AI和元宇宙的基础生成技术,但现有方法固有限制于单智能体从单一视角观察。将这些模型扩展到多智能体设置引入了两个关键挑战:数据稀缺(协调的多视角记录对于通用开放域场景来说成本过高)和世界状态对齐(独立生成的视频流无法确保共享物理环境和事件在不同视角下一致演化)。为应对这些挑战,我们提出MetaWorld,一种新颖框架,可直接从单视角视频将多智能体视频世界模型扩展到开放域环境。首先,我们引入单目世界状态展开(MWSU),将单目视频显式分解为相机操作者的自我运动与可见主体的空间轨迹。这种相机-轨迹分解自然提取了共享3D空间内同步的多智能体运动数据,完全绕开了多相机设置的需求。其次,为精确视觉控制,我们开发了主体感知世界生成器,实现基于每个智能体身份图像的外观驱动模拟。最后,为确保两个视角基于相同的物理现实,我们提出世界状态对齐(WSA),一种在视频DiT的每个Transformer层插入的逐帧跨分支交叉注意力机制。通过联合同步去噪过程,WSA强制实现静态几何一致性和动态运动一致性,促使共享3D环境和物理事件在两个自我中心视角间保持良好对齐。大量实验表明,MetaWorld实现了优越的跨视角一致性和身份保真度,为多智能体视频世界建模建立了一个高度可扩展、物理驱动的范式。

英文摘要

Video world models are a foundational generative technology for embodied AI and the Metaverse, yet existing approaches are inherently limited to a single agent observing from a single perspective. Extending these models to multi-agent settings introduces two critical challenges: data scarcity (coordinated multi-view recordings are prohibitively expensive to collect for general open-domain scenarios) and world state alignment (independently generated video streams cannot ensure that shared physical environments and events evolve consistently across views). To address these challenges, we propose MetaWorld, a novel framework that scales multi-agent video world models to open-domain environments directly from single-view videos. First, we introduce Monocular World-State Unrolling (MWSU) to explicitly decompose monocular footage into the camera operator's ego-motion and the visible subject's spatial trajectory. This camera-trajectory decomposition naturally extracts synchronized multi-agent motion data within a shared 3D space, completely bypassing the need for multi-camera setups. Second, for precise visual control, we develop the Subject-Aware World Generator to enable appearance-driven simulation conditioned on per-agent identity images. Finally, to ensure both views are grounded in the identical physical reality, we propose World-State Alignment, a per-frame inter-branch cross-attention mechanism inserted at every transformer layer of the video DiT. By jointly synchronizing the denoising process, WSA enforces both static geometric consistency and dynamic motion consistency, encouraging that the shared 3D environment and physical events remain well-aligned across both egocentric views. Extensive experiments demonstrate that MetaWorld achieves superior cross-view consistency and identity fidelity, establishing a highly scalable, physics-driven paradigm for multi-agent video world modeling.

2606.02747 2026-06-03 cs.CV cs.AI 版本更新

Plan2Map: A Multimodal Benchmark for Document-Grounded Geospatial Boundary Reconstruction from Planning Records

Plan2Map: 基于规划记录的文档驱动地理空间边界重建的多模态基准

Fabian Degen, Oishi Deb, Jindong Gu, Junchi Yu, Samuele Marro, Philip Torr, Jialin Yu

AI总结 提出Plan2Map基准和GeoPlanAgent系统,通过文档证据提取、定位、地图配准、边界分割等步骤,从英国规划记录中重建地理空间边界,显著优于直接VLM方法。

Comments Project page: https://odeb1.github.io/Plan2Map_Project_Page/. Fabian Degen and Oishi Deb Contributed Equally

详情
AI中文摘要

规划记录定义了地理区域上的限制,但其源文档通常仅提供间接的空间证据而非机器可读的边界。我们介绍了Plan2Map,一个包含208个案例的多模态基准,用于从英国规划记录中重建文档驱动的地理空间边界。仅给定源规划文档,系统必须从通知文本、时间表、地图图版、地图标签和边界注释中重建有效的地理空间边界;参考GeoJSON被保留用于评分。我们提出了GeoPlanAgent,一个文档驱动、地理空间工具在环的系统,将任务分解为证据提取、定位、地图配准、边界分割、投影和验证。在Plan2Map上,GeoPlanAgent实现了0.736的平均IoU和0.904的中位IoU,其中67.8%的预测IoU达到或超过0.8,显著优于直接VLM到GeoJSON的基线。诊断分析表明,直接VLM预测仍然不可靠,而剩余错误集中在定位和地图配准上,监督边界分割显著提高了像素级掩码质量。Plan2Map为从公共规划记录中进行多模态地理空间重建提供了一个具体的测试平台。项目页面:此https URL。

英文摘要

Planning records define restrictions over geographic areas, but their source documents often provide only indirect spatial evidence rather than machine-readable boundaries. We introduce Plan2Map, a 208-case multimodal benchmark for document-grounded geospatial boundary reconstruction from UK planning records. Given only a source planning document, systems must reconstruct a valid geospatial boundary from notice text, schedules, map plates, map labels, and boundary annotations; the reference GeoJSON is held out for scoring. We propose GeoPlanAgent, a document-grounded, geospatial-tool-in-the-loop system that decomposes the task into evidence extraction, localisation, map registration, boundary segmentation, projection, and verification. On Plan2Map, GeoPlanAgent achieves 0.736 mean IoU and 0.904 median IoU, with 67.8\% of predictions at or above 0.8 IoU, substantially outperforming direct VLM-to-GeoJSON baselines. Diagnostic analysis shows that direct VLM prediction remains unreliable, while remaining errors are concentrated in localisation and map registration, and supervised boundary segmentation substantially improves pixel-level mask quality. Plan2Map provides a concrete testbed for multimodal geospatial reconstruction from public planning records. Project page: https://odeb1.github.io/Plan2Map_Project_Page/.

2606.02739 2026-06-03 cs.SD cs.AI eess.AS 版本更新

EntangleCodec: A Unified Discrete Audio Tokenizer via Semantic-Acoustic Entanglement

EntangleCodec:通过语义-声学纠缠的统一离散音频分词器

Hui Li, Yangfan Gao, Junlin Shang, Changhao Jiang, Tao Gui, Qi Zhang, Xuanjing Huang

发表机构 * Fudan University(复旦大学)

AI总结 提出EntangleCodec,一种通过将音频与丰富标题对齐学习语义-声学联合表示的统一离散音频分词器,在紧凑令牌流中捕获语言内容、说话人身份、情感、韵律和声学场景,并通过流匹配扩散解码器实现高质量重建,在音频理解和生成任务上均取得领先性能。

Comments 17 pages, 10 figures

详情
AI中文摘要

音频分词器作为连续音频与音频语言模型(ALM)之间的离散接口,但现有分词器往往难以同时支持理解和生成。面向重建的编解码器保持声学保真度但缺乏丰富语义,而语义感知分词器通常依赖独立的语义和声学流,引入冗余或错位。我们提出 extbf{EntangleCodec},一种统一的离散音频分词器,在量化之前学习与标题对齐的语义-声学表示。通过将音频与丰富标题而非ASR转录对齐,EntangleCodec在紧凑令牌流中捕获语言内容、说话人身份、情感、韵律和声学场景。流匹配扩散解码器进一步实现了语音、音乐和通用音频的高质量重建。EntangleCodec在重建质量上与专用编解码器竞争,在音频理解上优于所有基于编解码器的基线,在MMAR上提升高达 extbf{+7.4\%},并在统一框架中支持TTS和TTA生成。此外,基于EntangleCodec的音频语言模型展现出强大的扩展行为:即使参数为 extit{0.6B},该模型在三个基准测试中超越了参数超过 extit{13B}的专用连续表示LLM,参数减少了 extbf{22$ imes$};扩展到 extit{8B}进一步在MMAR上建立了新的最先进结果,突显了在音频语言建模中表示质量与模型规模同等重要。代码和模型权重可从此https URL获取。

英文摘要

Audio tokenizers serve as the discrete interface between continuous audio and Audio Language Models (ALMs), but existing tokenizers often struggle to support both understanding and generation. Reconstruction-oriented codecs preserve acoustic fidelity but lack rich semantics, while semantic-aware tokenizers typically rely on separate semantic and acoustic streams, introducing redundancy or misalignment. We propose \textbf{EntangleCodec}, a unified discrete audio tokenizer that learns caption-aligned semantic-acoustic representations before quantization. By aligning audio with rich captions rather than ASR transcripts, EntangleCodec captures linguistic content, speaker identity, emotion, prosody, and acoustic scenes within a compact token stream. A flow-matching diffusion decoder further enables high-quality reconstruction across speech, music, and general audio. EntangleCodec achieves reconstruction quality competitive with specialized codecs, outperforms all codec-based baselines on audio understanding by up to \textbf{+7.4\%} on MMAR, and supports both TTS and TTA generation in a unified framework. Furthermore, EntangleCodec-based audio language models demonstrate strong scaling behavior: even at \textit{0.6B} parameters, the model surpasses specialized continuous-representation LLMs with over \textit{13B} parameters across three benchmarks using \textbf{22$\times$} fewer parameters; scaling to \textit{8B} further establishes new state-of-the-art results on MMAR, highlighting that representation quality is as critical as model scale in audio language modeling. Code and model weights are available at https://github.com/luckyerr/EntangleCodec.

2606.02737 2026-06-03 cs.IR cs.AI cs.CL 版本更新

Attention Calibration for Position-Fair Dense Information Retrieval

面向位置公平的密集信息检索的注意力校准

Andrianos Michail, Elias Schuhmacher, Juri Opitz, Simon Clematide, Rico Sennrich

发表机构 * Department of Computational Linguistics University of Zurich(计算语言学系苏黎世大学)

AI总结 针对密集检索模型的位置偏差问题,提出在推理时通过注意力校准(引入强度系数λ插值原始与完全校准分布)来提升位置公平性,无需重新训练且不牺牲整体检索效果,在多个数据集和模型上验证了部分校准优于完全校准,并提供了默认配置。

详情
AI中文摘要

密集检索模型存在位置偏差:当相关信息出现在段落较后位置时,检索效果会下降(Zeng et al., 2025)。我们探究是否可以在推理时减少这种偏差,无需重新训练且不牺牲整体检索效果。为此,我们将推理时的注意力校准(Schuhmacher et al., 2026)适配到下游检索,并引入强度系数λ,在原始注意力分布和完全校准的注意力分布之间进行插值。在SQuAD-PosQ和FineWeb-PosQ上的三个嵌入模型上,我们考察了篮子大小、校准层集和强度如何影响位置公平性与检索效果之间的权衡,发现部分校准通常优于完全校准。单个配置(B=128, λ=0.5, 50%层深度)在FineWeb-PosQ上提升了所有三个模型跨位置组的nDCG@10的调和平均值,无需逐模型调参,并且适用于<s>-池化和最后token池化两种架构。该默认配置无需修改即可迁移到PosIR(涵盖10种语言和31个领域),在所有16种长度四分位×模型×检索设置组合中降低了位置敏感指数,同时保持或提升了整体nDCG@10。我们在以下网址发布扩展后的代码库:this https URL

英文摘要

Dense retrieval models exhibit positional bias: retrieval effectiveness degrades when relevant information appears later in a passage (Zeng et al., 2025). We ask whether this bias can be reduced at inference time, without retraining and without sacrificing overall retrieval effectiveness. To this end, we adapt inference-time attention calibration (Schuhmacher et al., 2026) to downstream retrieval and extend it with a strength coefficient lambda that interpolates between the original and fully calibrated attention distributions. Across three embedding models on SQuAD-PosQ and FineWeb-PosQ, we examine how basket size, calibrated layer set, and strength affect the trade-off between positional fairness and retrieval effectiveness, finding that partial calibration frequently outperforms full calibration. A single configuration (B=128, lambda=0.5, 50% layer depth) improves the harmonic mean of nDCG@10 across positional groups on FineWeb-PosQ for all three models without per-model tuning, and applies to both <s>-pooled and last-token-pooled architectures. This default configuration transfers without modification to PosIR, which spans 10 languages and 31 domains, reducing the Position Sensitivity Index in all 16 length-quartile x model x retrieval-setting combinations, while preserving or improving aggregate nDCG@10. We release our extended codebase at https://github.com/impresso/fair-sentence-transformers

2606.02724 2026-06-03 cs.CV cs.AI 版本更新

AVTrack: Audio-Visual Tracking in Human-centric Complex Scenes

AVTrack: 以人为中心的复杂场景中的视听跟踪

Yaoting Wang, Yun Zhou, Zipei Zhang, Henghui Ding

发表机构 * University of Science and Technology of China(中国科学技术大学)

AI总结 针对现有视听跟踪数据集局限于简单场景的问题,提出AVTrack数据集,通过包含相机运动、视觉遮挡和位置变化等复杂动态条件,评估并提升鲁棒的人为中心视听场景理解。

Comments 19 pages, 10 figures, ICML 2026

详情
AI中文摘要

视听说话人跟踪旨在通过利用听觉和视觉线索来定位和跟踪活跃的说话人,实现细粒度、以人为中心的场景理解。这一能力对于智能视频编辑、监控和人机交互等实际应用至关重要。然而,现有数据集大多局限于具有粗略标注的简单或同质视听场景。这种过度简化的设置使评估偏向于静态视听共现,而非严格评估复杂动态场景中的鲁棒时空建模和跨模态推理。为了解决这些限制,我们引入了AVTrack,一个以人为中心的视听实例分割(AVIS)数据集,专为动态真实世界场景设计。AVTrack具有多样且具有挑战性的条件,包括相机运动、视觉遮挡和位置变化。在AVTrack上对代表性AVIS方法的评估揭示了显著的性能下降,使AVTrack成为复杂环境中鲁棒的以人为中心的视听场景理解的挑战性基准。我们进一步提供了一个简单而有效的基线,以促进未来的研究。项目网站:此https URL

英文摘要

Audio-visual speaker tracking aims to localize and track active speakers by leveraging auditory and visual cues, enabling fine-grained, human-centric scene understanding. This capability is essential for real-world applications such as intelligent video editing, surveillance, and human-computer interaction. However, existing datasets are largely limited to simple or homogeneous audio-visual scenes with coarse annotations. Such oversimplified settings bias evaluation toward static audio-visual co-occurrence, rather than rigorously assessing robust spatiotemporal modeling and cross-modal reasoning in complex, dynamic scenes. To address these limitations, we introduce AVTrack, a human-centric audio-visual instance segmentation (AVIS) dataset designed for dynamic real-world scenarios. AVTrack features diverse and challenging conditions, including camera motion, visual occlusions, and position changes. Evaluations of representative AVIS methods on AVTrack reveal substantial performance degradation, establishing AVTrack as a challenging benchmark for robust human-centric audio-visual scene understanding in complex environments. We further provide a simple yet effective baseline to facilitate future research. Project website: https://FudanCVL.github.io/AVTrack/

2606.02673 2026-06-03 cs.AI cs.LG 版本更新

Visual Graph Scaffolds for Structural Reasoning in Large Language Models

大语言模型中用于结构推理的可视化图脚手架

Runlin Lei, Xiaokui Xiao, Zhewei Wei

发表机构 * University of Science and Technology of China(中国科学技术大学)

AI总结 本文提出将图结构作为大语言模型的内部推理辅助而非仅外部知识源,通过多跳问答实验发现视觉图引导相比文本化图在无直接答案提示时仍保持有效性,支持图作为组织推理的可视化脚手架。

详情
AI中文摘要

图已被用于增强大语言模型的结构化推理,主要是在测试时作为外部知识源提供给模型。在本文中,我们采取不同的视角:图对LLMs的价值不仅在于提供信息,还在于组织推理。受人类使用图结构思维导图组织分支和汇聚思维的启发,我们探究图是否可以作为推理辅助的内部形式。我们在多跳问答任务上研究这一问题,其中教师提供的推理轨迹被重写为图思维导图并用于指导学生模型。我们的实验揭示了明显的模态差距。当图结构被扁平化为文本时,一旦直接答案提示被移除,其益处变得有限。在这种抽象引导设置下,推理效率和答案质量都大幅下降。相比之下,视觉图引导在没有直接答案线索时仍然有效,并且其优势在监督微调和基于KL的蒸馏后仍然保持。上述发现支持了以下主张:图不仅应作为LLMs的外部知识结构来研究,还应作为组织推理的可视化脚手架。

英文摘要

Graphs have been used to enhance large language models (LLMs) for structured reasoning, mostly as external knowledge sources are provided to models at test time. In this paper, we take a different view: the value of graphs for LLMs lie not only in supplying information, but also in organizing reasoning. Inspired by how humans use graph-structured mind maps to organize branching and converging thoughts, we ask whether graphs can serve as an internal form of reasoning assistance. We study this question on multi-hop question answering tasks, where teacher-provided reasoning traces are rewritten as graph mind maps and used to guide a student model. Our experiments reveal a clear modality gap. When graph structures are flattened into text, their benefits become limited once direct answer hints are removed. Under this abstract guidance setting, both reasoning efficiency and answer quality degrade substantially. In contrast, visual graph guidance remains effective without direct answer clues, and its advantage persists after supervised fine-tuning and KL-based distillation. The above findings support the claim that graphs should be studied not only as external knowledge structures for LLMs, but also as visual scaffolds for organizing reasoning.

2606.02671 2026-06-03 cs.LG cs.AI 版本更新

Aligning Data-Driven Predictors with Allocation: A Decision-Focused Approach to Survival Analysis

对齐数据驱动预测器与分配:面向决策的生存分析方法

Itai Zilberstein, Ioannis Anagnostides, Tuomas Sandholm

发表机构 * Department of Computer Science, Carnegie Mellon University, Pittsburgh, PA(计算机科学系,卡内基梅隆大学,匹兹堡,PA) Strategy Robot, Inc.(策略机器人公司) Strategic Machine, Inc.(战略机器公司) Optimized Markets, Inc.(优化市场公司)

AI总结 针对生存分析中预测模型与分配决策目标不一致的问题,提出基于归一化折现累积增益(NDCG)的决策聚焦学习方法,通过优化NDCG提升分配效果,在心脏移植数据上使基线模型NDCG提升50-100%。

详情
AI中文摘要

机器学习预测器已成为指导自动化决策的重要工具。然而,一个主要的错位仍然存在:预测模型通常根据标准统计指标进行优化,而与其所指导的算法任务相孤立。我们在器官分配这一高风险领域中强调了这种不一致性,通过证明任何依赖(即使是高度准确的)针对标准指标(如一致性指数(C-index))优化的生存预测器的算法,在用于分配时可能产生任意差的结果,无法保证比均匀随机选择更好的效用。为了弥合生存分析与策略优化之间的差距,我们引入了一种基于优化归一化折现累积增益(NDCG)的决策聚焦学习方法,NDCG是信息检索中的主流指标。我们通过证明NDCG转化为分配性能的保证,确立了其在生存分析中的效用。在实证中,我们提出了一种自举方法来优化现有生存模型的NDCG。与先前工作不同,我们还解决了评估排名时右删失的挑战。在美国历史心脏移植数据上,我们的方法将基线模型的NDCG大幅提升了50-100%,这相当于在移植分配中每年额外获得数万生命年。我们预计我们的框架将在基于预测的决策中找到更广泛的应用。

英文摘要

Machine learning predictors have become essential tools for guiding automated decision making. However, a major misalignment persists: predictive models are typically optimized in terms of standard statistical metrics in isolation from the algorithmic tasks they inform. We highlight this incongruity in the high-stakes domain of organ allocation by demonstrating that any algorithm relying on (even highly accurate) survival predictors optimized for standard metrics -- such as the Concordance index (C-index) -- can yield arbitrarily poor outcomes when used for allocation, failing to guarantee utility better than a uniform random selection. To bridge the gap between survival analysis and policy optimization, we introduce a decision-focused learning approach based on optimizing normalized discounted cumulative gain (NDCG), a mainstay metric in information retrieval. We establish the utility of NDCG in survival analysis by proving that it translates to guarantees on the performance of allocation. Empirically, we propose a bootstrapping approach to optimize the NDCG of existing survival models. Unlike prior work, we also address the challenge of right censorship when evaluating ranking. On historical heart transplant data from the US, our method dramatically boosts the NDCG of baseline models by 50-100%, which translates to tens of thousands of additional life years gained annually when deployed for transplant allocation. We anticipate that our framework will find broader applications in decision making with predictions.

2606.02663 2026-06-03 cs.LG cs.AI 版本更新

AdaWeather: Adaptively Mixing Probabilistic Weather Forecasts with Logarithmic Regret

AdaWeather: 自适应混合概率天气预报与对数遗憾

Saptarishi Dhanuka, Sarvesh Iyer, Manmeet Singh, Mihir More, Rushil Gupta, Dhruman Gupta, Parthasarathi Mukhopadhyay, Sandeep Juneja

发表机构 * Ashoka University(阿什oka大学) Western Kentucky University(西方肯塔基大学)

AI总结 提出 AdaWeather 自适应框架,通过结合机器学习和专家混合方法融合多个概率天气预报,实现对数遗憾界,并在温度预测上取得改进。

Comments 36 pages, 16 figures. Submitted to arXiv. Forecast aggregation for probabilistic weather prediction using offline supervised learning and online prediction with expert advice. Includes theoretical regret guarantees and empirical evaluation on temperature forecasting. Submitted to NeurIPS 2026

详情
AI中文摘要

机器学习的最新进展已经产生了与最先进数值天气预报模型相当的概率天气预报模型。但没有任何模型在时空上持续占优,且相对性能高度依赖于上下文。这激发了自适应方法来组合多个预报以获得改进和鲁棒性。尽管文献中已提出组合预报,但这些要么通过监督学习实现,要么通过专家建议预测方法实现。我们引入 AdaWeather,一个自适应框架,它使用机器学习和专家混合方法结合多个概率预报,以得到统一的改进概率预报。传统专家方法针对事后最佳单一专家建立遗憾界,而我们扩展了算法和分析,表明我们的方法相对于事后最佳静态专家混合具有对数遗憾。实验上,我们专注于温度预测,并观察到相对于现有方法的改进。

英文摘要

Recent advances in machine learning have produced probabilistic weather forecasting models comparable to state-of-the-art numerical weather predictors. But no model consistently dominates spatio-temporally, and relative performance is highly context-dependent. This motivates adaptive methods for combining multiple forecasts to obtain improvements and robustness. While combined forecasts have been proposed in the literature, these are achieved either through supervised learning or through prediction with expert advice methods. We introduce AdaWeather, an adaptive framework that combines many probabilistic forecasts using both machine learning as well as mixture of experts to arrive at a unified improved probabilistic forecast. While traditional expert methods develop the regret bounds with respect to the best single expert in hindsight, we extend the algorithm and analysis to show our method has logarithmic regret compared to the best static mixture of experts in hindsight. Empirically, we focus on forecasting temperature, and observe improvements over existing methods.

2606.02662 2026-06-03 cs.LG cs.AI physics.chem-ph 版本更新

Improvise, Adapt, Overcome: An On-The-Fly Multifidelity Algorithm for Efficient Machine Learning

即兴、适应、克服:一种用于高效机器学习的即时多保真算法

Vivin Vinod, Peter Zaspel

发表机构 * School of Mathematics and Natural Sciences, University of Wuppertal(数学与自然科学学院,乌珀塔尔大学)

AI总结 提出一种自适应即时多保真机器学习框架,通过动态查询不同保真度的训练样本,自动确定数据集组成,在降低数据生成成本的同时提高模型精度。

Comments Supplementary Information added as separate PDF

详情
AI中文摘要

机器学习加速了量子化学,但受到生成高保真训练数据的高昂成本的阻碍。多保真机器学习(MFML)通过系统性地结合丰富的低保真数据和稀疏的高保真数据来减轻这一开销。尽管取得了成功,标准MFML方案依赖于预定义的缩放因子来确定不同保真度之间的稀疏数据比例,通常会产生冗余的多保真数据,导致效率损失。在这里,我们介绍了一种用于机器学习的自适应即时多保真框架,该框架自主确定训练数据集的组成。通过动态查询每个保真度的训练样本,该算法在转向更昂贵的参考计算之前,先在较低保真度上使模型精度饱和。我们在不同的化学性质上对新颖的自适应MFML进行了基准测试,包括计算化学金标准的耦合簇能量,以及更具化学挑战性的激发能。在我们的数值实验中,我们表明,与单保真方法相比,我们的自适应算法将数据生成成本降低了多达30倍,并且与标准MFML相比提高了多达5倍。数据冗余的缓解为量子化学中可持续的成本感知机器学习建立了一条高精度、低成本的途径。

英文摘要

Machine learning has accelerated quantum chemistry but is hindered by the prohibitive cost of generating high fidelity training data. Multifidelity machine learning (MFML) mitigates this overhead by systematically combining abundant low fidelity data with sparse high fidelity data. In spite of its success, standard MFML schemes rely on pre-defined scaling factors to determine sparse data ratio across fidelities, often generating redundant multifidelity data resulting in a loss of efficiency. Here, we introduce an adaptive on-the-fly multifidelity framework for machine learning that autonomously determines training dataset composition. By dynamically querying training samples at each fidelity, the algorithm saturates model accuracy at lower fidelities before moving up to more expensive reference calculations. We benchmark the novel adaptive-MFML across diverse chemical properties including the computational chemistry gold standard coupled cluster energies, and the more chemically challenging excitation energies. In our numerical experiments we show that our adaptive algorithm reduces data generation costs by up to a factor of 30 compared to single fidelity methods and improves upon standard MFML by up to a factor of 5. The mitigation of data redundancy establishes a high-accuracy low-cost pathway for sustainable cost-aware machine learning in quantum chemistry.

2606.02659 2026-06-03 cs.LG cs.AI 版本更新

CL-DMDF:Dynamic Multimodal Data Fusion Model Based on Contrastive Learning

CL-DMDF:基于对比学习的动态多模态数据融合模型

Dong Li, Lingling Zhang, Binghao Han, Linlin Ding, Yue Kou

发表机构 * Tsinghua University(清华大学)

AI总结 针对多模态数据融合中模态缺失和局部交互忽视全局互补线索的问题,提出基于对比学习的动态多模态数据融合模型(CL-DMDF),通过跨特征和模态维度的注意力机制、实体质心对比学习模块和自适应融合模块,提升动态融合的效率和准确性。

Comments 9 pages, 5 figures, 7 tables

详情
AI中文摘要

多模态数据融合涉及整合和分析来自多种模态的信息,以揭示潜在的关联和互补模式,从而增强数据处理和决策能力。尽管现有的结构化多模态输入方法通常针对特定任务设计并假设模态完全可观测,但实际应用中常因各种因素导致模态输入不确定或缺失。一些传统模型过度强调缺失模态内的局部交互,忽视了多模态表示中嵌入的全局互补线索。为克服这些限制,我们提出了一种基于对比学习的动态多模态数据融合模型(CL-DMDF)。CL-DMDF引入了一种新颖的注意力机制,该机制在特征和模态维度上同时操作,以计算可靠的注意力分数,有效反映每个层级的重要性。CL-DMDF进一步整合了实体质心对比学习模块,该模块从实体特征构建基于质心的正样本,以增强判别学习。此外,采用自适应融合模块以提高动态融合策略的效率和准确性。在三个数据集上进行的大量实验证明了CL-DMDF在各种多模态融合任务中的有效性。

英文摘要

Multimodal data fusion involves integrating and analyzing information from multiple modalities to uncover latent correlations and complementary patterns, thereby enhancing data processing and decision-making. While existing methods for structured multimodal inputs are typically designed around specific tasks and assume fully observed modalities, real-world applications often suffer from uncertain or missing modality inputs due to various factors. Some traditional models overly emphasize local interactions within missing modalities, neglecting the global complementary cues embedded in multimodal representations. To overcome these limitations, we propose a Dynamic Multimodal Data Fusion model based on Contrastive Learning (CL-DMDF). CL-DMDF introduces a novel attention mechanism that operates across both feature and modality dimensions to compute reliable attention scores, effectively reflecting importance at each level. The CL-DMDF further incorporates an entity-centroid contrastive learning module that constructs centroid-based positive samples from entity features to enhance discriminative learning. Additionally, an adaptive fusion module is employed to improve the efficiency and accuracy of dynamic fusion strategies. Extensive experiments conducted on three datasets demonstrate the effectiveness of the CL-DMDF across diverse multimodal fusion tasks.

2606.02644 2026-06-03 cs.CR cs.AI 版本更新

A New Framework for Cybersecurity Refusals in AI Agents

AI代理中网络安全拒绝的新框架

Eliot Krzysztof Jones, Mateusz Dziemian, Matt Fredrikson, J Zico Kolter

发表机构 * Gray Swan Gray Swan AI Carnegie Mellon University(卡内基梅隆大学)

AI总结 提出首个针对AI代理在进攻性安全场景中建立拒绝边界的框架,包括拒绝原则、任务分类和评估方法,并发现8个前沿模型中6个拒绝率接近零。

详情
AI中文摘要

代理脚手架显著提升了LLM在复杂、长期任务上的表现,在网络安全等领域带来了广泛益处和放大风险。现有的AI代理网络安全基准主要关注能力测量——代理能多有效地完成进攻性安全任务——但忽略了一个关键问题:代理何时以及如何拒绝有害请求?我们提出了首个在进攻性安全场景中建立拒绝边界的框架。我们的框架定义了(1)任务应被拒绝的原则性标准,(2)应被拒绝的任务类别,以及(3)在良性和对抗条件下测量代理鲁棒性的评估方法。我们应用该框架评估当前基于LLM的代理在一系列基于Web的进攻性安全场景中是否遵守适当的拒绝边界,发现测试的8个前沿模型中有6个拒绝率接近零,只有2个模型(GPT-5.2和GPT-5.1 Codex)表现出任何有意义的拒绝行为。

英文摘要

Agentic scaffolds have dramatically improved LLM performance on complex, long-horizon tasks, yielding both broad benefits and amplified risks in domains like cybersecurity. Existing benchmarks for AI agents in cybersecurity focus mainly on measuring proficiency--how effectively agents can complete offensive security tasks--but neglect a critical question: when and how should agents refuse harmful requests? We present the first framework for establishing refusal boundaries in offensive security contexts. Our framework defines (1) principled criteria for when tasks should be refused, (2) categories of tasks that warrant refusal, and (3) evaluation methodology for measuring agent robustness under both benign and adversarial conditions. We apply this framework to assess how current LLM-powered agents adhere to appropriate refusal boundaries across a range of web-based offensive security scenarios, finding that 6 of 8 frontier models tested show near-zero refusal rates, with only 2 models (GPT-5.2 and GPT-5.1 Codex) demonstrating any meaningful refusal behavior.

2606.02643 2026-06-03 cs.CR cs.AI cs.DB 版本更新

Inference Cost Attacks for Retrieval-Augmented Large Language Models

检索增强型大语言模型的推理成本攻击

Chengliang Liu, Liangbo Ning, Yujuan Ding, Wenqi Fan

发表机构 * The Hong Kong Polytechnic University(香港理工大学)

AI总结 提出RA-ICA攻击范式,通过向外部知识库注入恶意文档,利用CREEP框架和MA-GRPO算法,使RAG增强的LLM系统推理时token消耗增加高达13.12倍且成功率超过90%。

Comments Accepted at The ACM Web Conference 2026 (WWW '26)

详情
Journal ref
Proceedings of the ACM Web Conference 2026 (WWW '26), April 13-17, 2026, Dubai, United Arab Emirates
AI中文摘要

检索增强生成(RAG)增强的LLM系统虽然强大,但由于包含额外的多阶段流水线(动态检索和综合外部知识源的信息),引入了大量的推理成本。这种高运营成本暴露了一个关键漏洞,即推理成本攻击(ICA)。然而,现有的ICA通常依赖于直接提示操纵的不切实际的假设。我们认为,对RAG增强的LLM系统更可行且更强大的威胁来自污染外部知识库(例如,来自互联网的网络知识)。在这项工作中,我们引入了检索增强推理成本攻击(RA-ICA),这是一种新颖的攻击范式,通过向外部知识语料库注入恶意文档来针对RAG增强的LLM系统的计算成本。为了实现这种攻击,我们提出了通过外部投毒耗尽计算资源(CREEP),这是一种新颖的框架,利用LLM代理自动制作恶意文档,这些文档在语义上相关以便检索,并且能够有效诱导推理阶段token消耗的异常增加。为了提高攻击的有效性,我们引入了记忆增强组相对策略优化(MA-GRPO),这是一种新颖的强化学习算法,通过从历史最佳对抗文档的动态记忆中学习来微调代理。在三个真实世界数据集上的大量实验表明,RA-ICA在不降低生成答案完整性的情况下,将token消耗增加了高达13.12倍,成功率超过90%。

英文摘要

Retrieval-Augmented Generation (RAG)-enhanced LLM systems, while powerful, introduce substantial inference costs due to the inclusion of an extra multi-stage pipeline that dynamically retrieves and synthesizes information from external knowledge sources. This high operational cost exposes a critical vulnerability to Inference Cost Attacks (ICAs). However, existing ICAs often rely on the impractical assumption of direct prompt manipulation. We argue that a more feasible and potent threat to RAG-enhanced LLM systems arises from poisoning external knowledge bases (e.g., web knowledge from the Internet). In this work, we introduce the Retrieval-Augmented Inference Cost Attack (RA-ICA), a novel attacking paradigm that targets the computational cost of RAG-enhanced LLM systems by injecting malicious documents into external knowledge corpus. To operationalize this attack, we propose Computational Resource Exhaustion via External Poisoning (CREEP), a novel framework that leverages LLM agents to automatically craft malicious documents that are both semantically relevant for retrieval and potent for inducing an abnormal increase in token consumption during the inference phase. To enhance the attack's effectiveness, we introduce Memory-Augmented Group Relative Policy Optimization (MA-GRPO), a novel reinforcement learning algorithm that fine-tunes the agents by learning from a dynamic memory of historical best adversarial documents. Extensive experiments across three real-world datasets demonstrate that RA-ICA increases token consumption by up to 13.12 times with an over 90% success rate, without degrading the integrity of the generated answer.

2606.02641 2026-06-03 cs.RO cs.AI 版本更新

CARVE: Certified Affordable Repair of Vetoed Maneuvers via Envelopes for Interactive Driving

CARVE: 通过包络实现交互驾驶中被否决机动的认证可负担修复

Yifan Wang

发表机构 * Yifan Wang(王一帆)

AI总结 针对交互驾驶中规则感知堆栈易忽略的硬规则裕度负值问题,提出CARVE认证层,通过有限格点上的自我与代理战术算子,实现被否决机动的可负担修复认证,并证明其合理性。

Comments 8 pages, 3 figures

详情
AI中文摘要

交互驾驶暴露了规则感知自动驾驶堆栈中容易忽略的失效模式:即使非优先代理的小幅合法让步可恢复可行性,自我候选的硬规则裕度仍可能为负。现有的规则手册、防护和可达性过滤器在否决不安全动作方面表现强劲,而基于预测的规划器则对可能的响应进行建模。两者均未返回运行时证明对象,该对象说明哪个有界多代理编辑修复了机动、谁拥有编辑、请求是否在路权上可负担,以及如果请求未被遵守,自我后备是什么。我们将这一缺失对象形式化为*交互修复认证*,并引入*CARVE*,一个在自我拥有和代理拥有的战术算子有限格点上的无预测认证层。代理拥有的请求仅在\(B_j(s) = eta(\pi_j)\alpha_j^{\max}(s)\)内可接受,这是一个将运动学可达性与规范优先级分离的合作包络。生成的证书记录了绑定规则、修复类别、修复集、责任加权成本分配和后备。在589个基于Lanelet2几何的INTERACTION重放片段上,CARVE-Greedy接受了98.64%的初始否决机动,恢复了370/378个人类解决错误否决,同时保持了589/589的路权尊重、零优先级代理假阳性以及400/400的负压力否决。我们证明了证书的合理性、结构性的路权尊重、精确的有限格点最小性、后备应急性和责任一致性条件。CARVE不预测也不需要其他驾驶员的合规性;它认证在声明假设下提议的交互是否有界、可归因且规范上可接受。

英文摘要

Interactive driving exposes a failure mode that is easy to miss in rule-aware autonomous-driving stacks: a hard-rule margin can be negative for an ego candidate even though a small lawful accommodation by a non-priority agent would restore feasibility. Existing rulebooks, shields, and reachability filters are strong at vetoing unsafe actions, while prediction-based planners model likely responses. Neither returns a runtime proof object that states which bounded multi-agent edit repairs the maneuver, who owns the edit, whether the request is right-of-way affordable, and what ego fallback remains if the request is not observed. We formulate this missing object as *interactive repair certification* and introduce *CARVE*, a prediction-free certificate layer over a finite lattice of ego-owned and agent-owned tactical operators. Agent-owned requests are admissible only inside \(B_j(s) = β(π_j)α_j^{\max}(s)\), a cooperation envelope that separates kinematic reachability from normative priority. The resulting certificate records the binding rule, repair category, repair set, responsibility-weighted cost split, and fallback. On 589 Lanelet2-geometry-grounded INTERACTION replay episodes, CARVE-Greedy accepts 98.64% of initially vetoed maneuvers and recovers 370/378 human-resolved false vetoes, while preserving 589/589 right-of-way respect, zero priority-agent false positives, and 400/400 negative-stress vetoes. We prove certificate soundness, structural right-of-way respect, exact finite-lattice minimality, fallback contingency, and blame-consistency conditions. CARVE does not predict or require another driver's compliance; it certifies whether a proposed interaction is bounded, attributable, and normatively admissible under declared assumptions.

2606.02640 2026-06-03 cs.CR cs.AI 版本更新

D-Judge: Disrupting Multi-Turn Jailbreaks using Semantics-Preserving Output Rewriting

D-Judge: 使用语义保持输出重写破坏多轮越狱攻击

Huanli Gong, Zhipeng Wei, Yu Fu, Haz Sameen Shahgir, Ananya Gupta, Yue Dong, N. Benjamin Erichson

AI总结 提出D-Judge防御方法,通过语义保持的输出重写干扰攻击者的评判模型反馈循环,从而降低多轮越狱攻击的成功率。

Comments Proceedings of the 43rd International Conference on Machine Learning

详情
AI中文摘要

多轮越狱攻击对大型语言模型(LLM)的安全性构成日益严重的威胁,因为它们利用辅助评判模型的反馈来迭代优化提示,以实现有害目标。现有的防御措施主要在单个轮次或最终响应中检测或阻止不安全内容,但保留了评判驱动的优化循环,使攻击者能够从中间交互中提取信息性反馈。我们引入了D-Judge,一种语义保持的输出重写防御方法,它直接干预该循环,在攻击者的评判模型评估之前重写受害者LLM的响应。通过在不改变原始响应含义的情况下使评判的反馈信号失准,D-Judge破坏了攻击者的提示优化过程,导致后续查询针对扭曲的攻击进展信号进行优化。为了提高D-Judge生成此类重写的能力,我们构建了一个语义等价的响应对数据集,这些响应对会诱导不同的评判分配的有害性分数,并使用该数据集进行监督微调,随后进行直接偏好优化。在HarmBench上的实验表明,D-Judge在保持良性基准性能的同时,降低了最先进的多轮越狱攻击的成功率。

英文摘要

Multi-turn jailbreak attacks pose a growing threat to large language model (LLM) safety because they exploit feedback from auxiliary judge models to iteratively refine prompts toward harmful goals. Existing defenses largely detect or block unsafe content at individual turns or at the final response, leaving the judge-driven refinement loop intact and allowing attackers to extract informative feedback from intermediate interactions. We introduce D-Judge, a semantics-preserving output rewriting defense that intervenes directly in this loop by rewriting the victim LLM's responses before they are evaluated by the attacker's judge. By misaligning the judge's feedback signal without changing the meaning of the original response, D-Judge derails the attacker's prompt-refinement process, causing subsequent queries to be optimized against a distorted signal of attack progress. To improve D-Judge's ability to produce such rewrites, we construct a dataset of semantically equivalent response pairs that induce different judge-assigned harmfulness scores, and use it for supervised fine-tuning followed by direct preference optimization. Experiments on HarmBench show that D-Judge reduces the success rate of state-of-the-art multi-turn jailbreaks while preserving performance on benign benchmarks.

2606.02638 2026-06-03 cs.SD cs.AI eess.AS 版本更新

SegTune: Structured and Fine-Grained Control for Song Generation

SegTune:歌曲生成的结构化与细粒度控制

Yuejiao Wang, Zihao Ji, Pengfei Cai, Xu Li, Haorui Zheng, Zewen Song, Zhongliang Liu, Chen Zhang, Pengfei Wan

发表机构 * Kling Team, Kuaishou Technology(快手科技 Kling 团队) University of Science and Technology of China(中国科学技术大学) Peking University(北京大学)

AI总结 提出基于扩散Transformer的SegTune框架,通过用户或LLM指定局部音乐描述实现结构化细粒度控制,并引入LLM时长预测器实现精确歌词-音乐对齐,在音乐性和可控性上超越现有基线。

Comments This paper has been accepted to ACL 2026 as an oral presentation and has been nominated for the Best Paper Award. This work is a revised and extended version of an earlier technical report (arXiv:2510.18416). arXiv admin note: text overlap with arXiv:2510.18416

详情
AI中文摘要

近期神经歌曲生成的进展使得从歌词和全局文本提示中实现高质量合成成为可能。然而,大多数系统无法建模歌曲随时间变化的属性,严重限制了音乐结构和动态的细粒度控制。为解决这一问题,我们提出SegTune,一个基于扩散Transformer的框架,通过允许用户或大型语言模型(LLM)指定与歌曲片段对齐的局部音乐描述,实现结构化和细粒度的可控性。这些片段提示被时间广播到对应的时间窗口,而全局提示则确保风格连贯性。为支持精确的歌词-音乐对齐,我们引入了一个基于LLM的时长预测器,以LyRiCs格式自回归生成句子级时间戳。我们进一步构建了一个大规模数据管道,用于收集高质量歌曲及其对齐的歌词和提示,并提出了新的指标来评估片段对齐和声乐一致性。实验表明,SegTune在音乐性和可控性方面均优于现有基线。访问我们的项目页面(此 https URL )获取代码和更多生成的歌曲。

英文摘要

Recent advances in neural song generation have enabled high-quality synthesis from lyrics and global textual prompts. However, most systems fail to model temporally varying attributes of songs, severely limiting fine-grained control over musical structure and dynamics. To address this, we propose SegTune, a Diffusion Transformer-based framework enabling structured and fine-grained controllability by allowing users or large language models (LLMs) to specify local musical descriptions aligned to song segments. These segment prompts are temporally broadcast to corresponding time windows, while global prompts ensure stylistic coherence. To support precise lyric-to-music alignment, we introduce an LLM-based duration predictor that autoregressively generates sentence-level timestamps in LyRiCs format. We further construct a large-scale data pipeline for high-quality song collection with aligned lyrics and prompts, and propose new metrics to evaluate segment alignment and vocal consistency. Experiments demonstrate that SegTune outperforms existing baselines in both musicality and controllability. Visit our project page (https://github.com/KlingAIResearch/SegTune) for codes and more generated songs.

2606.02630 2026-06-03 cs.CR cs.AI 版本更新

MultiTurnPSB: Evaluating Multi-Turn Jailbreak Attacks an dClassifier-Based Defenses for Medical AI Safety

MultiTurnPSB:评估多轮越狱攻击与基于分类器的防御在医疗AI安全中的应用

Anushka Sheoran, Yiduo Hao

发表机构 * University of Pennsylvania(宾夕法尼亚大学)

AI总结 提出多轮对抗基准MultiTurnPSB,通过四轮对话评估医疗聊天机器人的安全漏洞,发现多轮攻击下不安全响应率从35%升至近80%,并验证了轻量级输入分类器可降低52个百分点的不安全响应但存在高误报率。

详情
AI中文摘要

面向患者的医疗聊天机器人通常在单轮提示上进行评估,但真实用户在被拒绝后会继续追问、增加紧迫感并援引权威。我们引入了MultiTurnPSB,这是PatientSafetyBench的一个四轮对抗扩展,并在固定模板、模板自适应和实时对抗攻击下评估了GPT-4.1-mini。在实时攻击下,不安全响应从第1轮的35%上升到第4轮的近80%。在相同的攻击者下,GPT-4.1-mini和Claude Sonnet 4.5在基线时统计上无差异,但到第4轮时差距扩大到19倍,这种差异在单轮评估中不可见。我们描述了四种退化轨迹特征,并识别出一个导致大多数灾难性失败的双元素攻击公式。一个轻量级的输入侧分类器将第4轮不安全响应降低了52个百分点,尽管准确性严重下降,但对良性查询的45%误报率是主要的部署限制。还出现了一个方法论发现:Claude Sonnet在超过一半的后期对话中拒绝生成对抗性消息,尽管有明确的红队框架,这表明安全训练可能泛化到攻击者角色。

英文摘要

Patient-facing medical chatbots are commonly evaluated on single-turn prompts, yet real users push back after refusals, add urgency, and invoke authority. We introduce MultiTurnPSB, a four-turn adversarial extension of PatientSafetyBench, and evaluate GPT-4.1-mini under fixed template, template-adaptive, and live adversarial attacks. Unsafe responses rise from 35% to nearly 80% by Turn 4 under live attack. Under the same adversary, GPT-4.1-mini and Claude Sonnet 4.5 are statistically indistinguishable at baseline but diverge to a 19x gap by Turn 4, a difference invisible to single-turn evaluation. We characterize four degradation trajectory signatures and identify a two-element attack formula responsible for most catastrophic failures. A lightweight input-side classifier reduces Turn 4 unsafe responses by 52 percentage points despite severe accuracy degradation, but the 45% false alarm rate on benign queries is the primary deployment constraint. A methodological finding also emerges: Claude Sonnet refused to generate adversarial messages in over half of late-turn conversations despite explicit red team framing, suggesting safety training may generalize to the attacker role.

2606.02623 2026-06-03 cs.NE cs.AI cs.LG 版本更新

Oscillatory State-Space Models as Inductive Biases for Physics-Informed Neural PDE Solvers

振荡状态空间模型作为物理信息神经PDE求解器的归纳偏置

Abhishek Chandra, Taniya Kapoor

发表机构 * KTH Royal Institute of Technology(皇家理工学院) Wageningen University & Research(瓦赫宁根大学与研究中心)

AI总结 提出一种结合振荡状态空间动力学和PDE感知空间谱的PINN方法,以改进时变PDE求解的精度和内存效率。

详情
AI中文摘要

求解时变偏微分方程(PDE)是计算科学与工程中的一个重要问题。物理信息神经网络(PINN)从控制方程中学习PDE解。然而,准确捕捉时间演化仍然具有挑战性。最近的基于序列模型的方法使用通用序列模型参数化时间演化,这些模型捕捉时间依赖性,但没有显式编码PDE解的结构化动力学。此外,它们的内存需求可能随序列长度和分辨率而不利地扩展,限制了在大规模或高维设置中的适用性。本文介绍了一种PINN方法,该方法结合了振荡状态空间动力学来表示PDE解的模态结构。所提出的方法利用基于线性振荡器的时间演化,以及空间上的PDE感知谱基。这种设计实现了闭式空间微分和边界条件的一致强制执行。该方法在前向、逆和高维PDE问题上进行了评估,包括高达100个空间维度的情况。结果表明,与最近基于序列模型的PINN方法相比,该方法提高了精度并减少了内存使用。总体而言,本文强调了将结构化动力学先验纳入神经PDE求解器的时间演化中的好处,并建议设计更符合物理和计算高效的PINN架构。

英文摘要

Solving time-dependent partial differential equations (PDEs) is an important problem in computational science and engineering. Physics-informed neural networks (PINNs) learn PDE solutions from governing equations. However, accurately capturing temporal evolution remains challenging. Recent sequence-model-based approaches parameterize time evolution using general-purpose sequence models, which capture temporal dependencies but do not explicitly encode the structured dynamics of PDE solutions. In addition, their memory requirements can scale unfavorably with sequence length and resolution, limiting applicability in large-scale or high-dimensional settings. This work introduces a PINN approach that incorporates oscillatory state-space dynamics to represent the modal structure of PDE solutions. The proposed method leverages a linear-oscillator-based temporal evolution, together with a PDE-aware spectral basis in space. This design enables closed-form spatial differentiation and consistent enforcement of boundary conditions. The method is evaluated on forward, inverse, and high-dimensional PDE problems, including cases up to 100 spatial dimensions. The results show improved accuracy and reduced memory usage compared to recent sequence-model-based PINN approaches. Overall, this work highlights the benefits of incorporating structured dynamical priors into the temporal evolution of neural PDE solvers and suggests designing more physics-aligned and computationally efficient PINN architectures.

2606.02618 2026-06-03 cs.CE cs.AI cs.MA physics.chem-ph 版本更新

Closed-Loop Molecular Design with Calibrated Deference

闭环分子设计中的校准式退让

Newman Cheng, Gordon Broadbent, Jason Dong, Syed Mohammed Ali Hussaini, Farman Ullah, Morris Sharp, Gabrielle Barnes, Nanlin Guo, Deyu Zou, Karin Strauss, William Chappell, David G. Kwabi, Bichlien H. Nguyen, Jake A. Smith

发表机构 * Microsoft Discovery & Quantum(微软发现与量子) Microsoft Research(微软研究院) Department of Chemical and Environmental Engineering, Yale University(耶鲁大学化学与环境工程系) CanAm Bioresearch Inc.(CanAm 生物研究公司)

AI总结 提出CLIO智能体,通过持续更新的信念状态图和递归计划-行动循环实现校准式退让,在闭环人机协作中成功设计出性能优于文献基准的AORFB负极电解液。

详情
AI中文摘要

我们提出了通过原位优化实现认知循环(CLIO),这是一种将持续更新的信念状态图与递归计划-行动循环相结合的智能体。结果产生了一个推理智能体,能够贡献某种定性的不同之处,我们称之为“校准式退让”:即识别自身工具或假设何时失败、相应调整策略、并生成指导实验修订的机制性假设的能力。我们在一个闭环人机协作活动中测试了CLIO,以设计一种水性有机氧化还原液流电池(AORFB)负极电解液,CLIO在与合成、表征并参与设计选择的化学家密切合作中主导了提议和解释。在三轮共17个候选分子中,CLIO收敛于一个最佳的膦酸酯候选物;表征证实其氧化还原电位比文献基准提高了130 mV。随后表征揭示了出乎意料的差电化学可逆性——这是所有性质预测器都未能标记的回归。CLIO生成了相互竞争的机制性假设,优先安排了诊断性实验,将失败归因于膦酸酯-钾离子配对,并建议用磺酸酯替代。所得化合物显示出显著改善的电化学可逆性,并保持了90 mV的氧化还原电位提升,从而闭环了设计-制造-测试-再设计循环。

英文摘要

We present Cognitive Loop via In-Situ Optimization (CLIO), an agent that couples a continuously-updated belief-state graph with a recursive plan-then-act loop. The result is a reasoning agent that can contribute something qualitatively different, which we term \emph{calibrated deference}: the capacity to recognize when its own tools or assumptions are failing, to adapt its strategy in response, and to generate mechanistic hypotheses that guide experimental revision. We tested CLIO in a closed-loop human-AI campaign to design an aqueous organic redox flow battery (AORFB) negolyte, with CLIO leading proposal and interpretation in close partnership with chemists who synthesized, characterized, and weighed in on design choices. Across 17 candidates over three rounds, CLIO converged on a top phosphonate candidate; characterization confirmed a 130~mV improvement in redox potential over the literature baseline. Characterization then revealed unexpectedly poor electrochemical reversibility -- a regression no property predictor had flagged. CLIO generated competing mechanistic hypotheses, prioritized discriminating diagnostics, traced the failure to phosphonate-potassium ion pairing, and prescribed a sulfonate replacement. The resulting compound showed substantially improved electrochemical reversibility and maintained a 90~mV improvement in redox potential, closing the design-make-test-redesign loop.

2606.02614 2026-06-03 cs.CE cs.AI 版本更新

Margin Play: A Multi-Agent System For Public Policy Analysis In The Brazilian Equatorial Margin

边际博弈:巴西赤道边缘地区公共政策分析的多智能体系统

Antonio de Sousa Leitão Filho, Fabrício Saul Lima, Selby Mykael Lima dos Santos, Rejani Bandeira Vieira Sousa, Luís Jorge Mesquita de Jesus, Dennys Correia da Silva, Allan Kardec Duailibe Barros Filho

发表机构 * Aia Context Universidade Federal do Maranhão — UFMA(佛罗里达州立大学马纳汉分校) Universidade Estadual de Campinas — UNICAMP(坎皮纳斯州立大学)

AI总结 针对巴西赤道边缘地区石油勘探对马拉尼昂州福利影响的问题,提出基于多智能体强化学习(MARL)的仿真系统Margin Play,通过CTDE范式和BRO-MARL训练六个智能体,发现福利增益取决于制度安排,MA-Prospero配置可显著提升福利并降低环境负债。

详情
AI中文摘要

巴西赤道边缘(BEM)是巴西下一个海上石油前沿,预计于2026年在亚马逊福斯盆地开始运营。其资产在财政和领土上主要与马拉尼昂州相关联——该州在联邦中人类发展指数最低(0.676,IBGE 2022)。这引出了核心政策问题:在什么条件下,BEM的勘探能为马拉尼昂州产生净正外部性?问题本质上是多智能体的:联邦政府寻求收入和能源安全;州政府在宪法规定的特许权使用费专用下寻求区域福利;运营商在风险下最大化利润;ANP和IBAMA持有冲突的职责;亚马逊社区优先考虑领土和环境因素而非货币收入。我们提出Margin Play,一个多智能体强化学习(MARL)系统,在巴西经验校准和经典经济学文献下模拟这些张力。它实现了CTDE范式下的六个智能体,使用BRO-MARL进行训练。来自六个场景中60,000个回合的结果表明,答案取决于制度安排:在参考基线之下,福利增益微乎其微(Waval约1.68),而MA-Prospero配置产生Delta W = +17.5%和Delta Rcom = +21.3%,同时环境负债较低(Eamb = 0.048 vs. 0.076)。根本问题并非生产与福利之间的权衡,而是与勘探相关的公共政策制度的选择。

英文摘要

The Brazilian Equatorial Margin (BEM) is Brazil's next offshore oil frontier, with operations expected to begin in 2026 in the Foz do Amazonas basin. Its assets are fiscally and territorially linked primarily to Maranhao -- the state with the lowest HDI in the Federation (0.676, IBGE 2022). This raises the central policy question: under what conditions does BEM exploration generate net positive externalities for Maranhao? The problem is intrinsically multi-agent: the Federal Government seeks revenue and energy security; the state seeks regional welfare under constitutional royalty earmarking; the operator maximizes profit under risk; ANP and IBAMA hold conflicting mandates; and Amazonian communities prioritize territorial and environmental vectors over monetary income. We present Margin Play, a Multi-Agent Reinforcement Learning (MARL) system simulating these tensions under Brazilian empirical calibration and classical economic literature. It implements six agents under the CTDE paradigm, trained with BRO-MARL. Results from 60,000 episodes across six scenarios indicate the answer is conditional on the institutional regime: under the reference baseline, the welfare gain is marginal (Waval approx. 1.68), whereas the MA-Prospero configuration yields Delta W = +17.5% and Delta Rcom = +21.3%, with a lower environmental liability (Eamb = 0.048 vs. 0.076). The fundamental problem is not a trade-off between production and welfare, but the choice of public policy regime linked to exploration.

2606.02610 2026-06-03 cs.CE cs.AI cs.LG physics.ao-ph 版本更新

Samudra 2: Scaling Ocean Emulators across Resolutions

Samudra 2: 跨分辨率扩展海洋仿真器

Yuan Yuan, Jesse Rusak, Alexander Merose, Adam Subel, Pavel Perezhogin, Alistair Adcroft, Carlos Fernandez-Granda, Laure Zanna

发表机构 * Courant Institute School of Mathematics, Computing, and Data Science, New York University(Courant学院数学、计算与数据科学系,纽约大学) Open Athena AI Foundation, Inc.(开放Athena人工智能基金会) Program in Atmospheric and Oceanic Sciences, Princeton University(大气与海洋科学项目,普林斯顿大学)

AI总结 针对现有海洋神经仿真器在长期自回归滚动中出现的方差崩溃和印记伪影问题,提出Samudra 2,通过改进U-Net骨干网络和动态损失函数,在1°分辨率下将上层海洋全球平均温度R²从0.56提升至0.87,并将深层海洋温度误差降低约七倍,且可扩展至1/2°和1/4°分辨率。

详情
AI中文摘要

海洋环流模式(OGCM)对气候科学至关重要,但计算成本高,限制了集合规模和强迫情景。神经仿真器有望实现数量级的加速,然而现有的海洋仿真器未能将精细空间分辨率与多年自回归滚动相结合。Samudra是第一个产生多十年全球滚动的自回归神经海洋仿真器,但仅限于$1^\\\circ$分辨率,并表现出两种长期故障模式:\\emph{方差崩溃},即时间变异性的丧失,以及\\emph{印记伪影},即速度模式泄漏到深海场中。我们提出Samudra 2,它引入了更宽的U-Net骨干网络,采用修改后的ConvNeXt风格块和减小的块内扩展因子,以及一个动态损失函数,根据预测误差重新加权输出通道,从而增强缓慢演变的深海场的梯度。在$1^\\\circ$分辨率下,Samudra 2将上层海洋全球平均温度$R^2$从0.56提高到0.87,并将深海温度误差降低约七倍。相同的架构可扩展到$1/2^\\\circ$和$1/4^\\\circ$分辨率,在大约8年的自回归滚动中恢复中尺度涡旋和尖锐的西边界流。在单个GPU上运行,Samudra 2能够为海平面预测、海洋热吸收和气候变率研究提供更大的集合。我们在此https URL提供代码、文档和基准资源。

英文摘要

Ocean general circulation models (OGCMs) are essential to climate science but computationally expensive, limiting ensemble size and forcing scenarios. Neural emulators promise orders-of-magnitude speedups, yet existing ocean emulators have not combined fine spatial resolution with multi-year autoregressive rollouts. Samudra, the first autoregressive neural ocean emulator to produce multi-decade global rollouts, is limited to $1^\circ$ resolution and exhibits two long-horizon failure modes: \emph{variance collapse}, the loss of temporal variability, and \emph{imprinting artifacts}, in which velocity patterns leak into deep-ocean fields. We present Samudra 2, which introduces a wider U-Net backbone with modified ConvNeXt-style blocks and a reduced block-internal expansion factor, together with a dynamic loss that reweights output channels according to their prediction errors, strengthening gradients for slow-evolving deep-ocean fields. At $1^\circ$, Samudra 2 increases upper-ocean global-mean temperature $R^2$ from 0.56 to 0.87 and reduces deep-ocean temperature error by roughly sevenfold. The same architecture scales to $1/2^\circ$ and $1/4^\circ$ over approximately 8-year autoregressive rollouts, recovering mesoscale eddies and sharp western boundary currents. Running on a single GPU, Samudra 2 enables larger ensembles for sea-level projections, ocean heat uptake, and climate variability studies. We provide code, documentation, and benchmark resources at https://openathena.ai/Ocean_Emulator/.

2606.02607 2026-06-03 cs.LG cs.AI cs.CR 版本更新

Geometry-Aware Tabular Diffusion

几何感知表格扩散

David Turtora Zagardo

发表机构 * arXiv

AI总结 提出几何感知表格扩散(GATD),通过向扩散去噪器注入列值差异的成对角度和长度作为输入和辅助目标,以显式建模列间关系,在10个数据集上以更少参数取得SOTA性能。

Comments Accepted to the ICML 2026 main track. 24 pages, 10 figures, 22 tables

详情
AI中文摘要

表格合成对于隐私保护的共享和增强至关重要,然而扩散模型依赖隐式机制来捕捉列间关系。我们引入了几何感知表格扩散(GATD),它通过从列值差异计算出的成对角度和长度来增强表格扩散去噪器,并将其用作输入和辅助目标。我们的MLP实例化在平均使用3.5倍更少参数(对于分类任务最多25倍)的情况下实现了最先进的基准性能:在十个数据集上,它在8/10的形状、7/10的趋势和9/10的下游效用(F1/RMSE)上获胜,将形状和趋势误差分别降低了27%和20%。默认损失权重可迁移到GNN和Transformer去噪器,在27/30个架构-数据集单元上改善了形状,在25/30上改善了趋势。一项匹配的消融实验表明,监督(而非额外输入或容量)驱动了性能提升。这表明显式关系监督是表格扩散的一种可移植归纳偏置。

英文摘要

Tabular synthesis is critical for privacy-preserving sharing and augmentation, yet diffusion models rely on implicit mechanisms to capture inter-column relationships. We introduce Geometry-Aware Tabular Diffusion (GATD), which augments tabular diffusion denoisers with pairwise angles and lengths computed from column value differences and used as inputs and auxiliary targets. Our MLP instantiation achieves state-of-the-art benchmark performance while using 3.5x fewer parameters on average (up to 25x for classification tasks): on ten datasets, it wins 8/10 Shape, 7/10 Trend, and 9/10 downstream utility (F1/RMSE), reducing Shape and Trend error by 27% and 20%. Default loss weights transfer to GNN and Transformer denoisers, improving Shape on 27/30 and Trend on 25/30 architecture-dataset cells. A matched ablation shows supervision (not extra inputs or capacity) drives the gain. This shows explicit relational supervision is a portable inductive bias for tabular diffusion.

2606.02606 2026-06-03 cs.LG cs.AI 版本更新

ReLoRA: Knowledge-Reusing Adaptation for Fast Rollout of Evolving LLM Services

ReLoRA: 面向演化LLM服务快速部署的知识复用适配

Yang Xu, Zihuai Xu, Hongli Xu, Yunming Liao, Zhiwei Yao, Xitong Fu

发表机构 * School of Computer Science and Technology, University of Science and Technology of China(计算机科学与技术学院,中国科学技术大学) Suzhou Institute for Advanced Research, University of Science and Technology of China(苏州先进研究院,中国科学技术大学)

AI总结 针对基础模型频繁更新导致已有LoRA适配器失效的问题,提出ReLoRA框架,通过贝叶斯优化初始化与调度正则化微调,实现知识复用与快速重新适配,降低计算开销并提升性能。

详情
AI中文摘要

大型语言模型(LLM)越来越多地被部署为持续演化的服务,其中频繁的基础模型更新可能使先前部署的任务特定低秩适配(LoRA)适配器失效。对于管理众多下游模型服务的提供商来说,为每个更新的基础模型从头重新训练每个LoRA适配器在计算上代价高昂,并延迟服务部署。同时,更简单的替代方案,即简单地将原始LoRA适配器应用于更新的基础模型,由于适配器-骨干网络不兼容,常常导致服务质量下降。为了解决这个问题,我们提出了ReLoRA,一种知识复用的重新适配框架,能够高效地为演化的LLM服务恢复可用的LoRA适配器,同时保持或提升任务性能。具体来说,ReLoRA包含两个关键的优化步骤:1)自适应LoRA初始化利用贝叶斯优化,通过融合先前部署的任务适配器和基础模型演化的信息,构建一个兼容性感知的起点;2)带调度正则化的微调首先通过强正则化快速将适配器引导至高质量区域,随后通过放松正则化进行任务特定精炼。这种设计使得在减少重新适配开销的同时,能够快速恢复服务质量。大量实验表明,与基线相比,ReLoRA将就绪时间减少高达8.9倍,准确率提升高达4.6%。

英文摘要

Large Language Models (LLMs) are increasingly deployed as continuously evolving services, where frequent base-model updates may invalidate previously deployed task-specific Low-Rank Adaptation (LoRA) adapters. For service providers managing numerous downstream model services, retraining each LoRA adapter from scratch for every updated base model is computationally prohibitive and delays service rollout. Meanwhile, the simpler alternative, i.e., naively applying the original LoRA adapter to the updated base model, often leads to degraded service quality due to adapter-backbone incompatibility. To address this problem, we propose ReLoRA, a knowledge-reusing re-adaptation framework that efficiently restores service-ready LoRA adapters for evolving LLM services while preserving or improving task performance. Specifically, ReLoRA comprises two key optimization steps: 1) Adaptive LoRA initialization leverages Bayesian optimization to construct a compatibility-aware starting point by fusing information from both the previously deployed task adapter and the base model's evolution; 2) Fine-tuning with scheduled regularization first rapidly steers the adapter to a high-quality region via strong regularization, followed by relaxed regularization for task-specific refinement. This design enables rapid service-quality recovery with reduced re-adaptation overhead. Extensive experiments demonstrate that ReLoRA reduces time-to-readiness by up to 8.9$\times$ and improves accuracy by up to 4.6\% compared to baselines.

2606.02605 2026-06-03 cs.LG cs.AI eess.IV 版本更新

Cross-Modal Contrastive Learning of ECG and Angiography Representations for Severe Stenosis Classification

用于严重狭窄分类的心电图与血管造影表示的跨模态对比学习

Nikola Cenikj, Özgün Turgut, Alexander Müller, Alexander Steger, Jan Kehrer, Marcus Brugger, Daniel Rueckert, Philip Müller

发表机构 * Chair for AI in Healthcare and Medicine, Technical University of Munich and TUM University Hospital(人工智能在医疗与医学中的研究所,慕尼黑技术大学及慕尼黑大学医院) Department of Computing, Imperial College London(伦敦帝国理工学院计算机系) Munich Center for Machine Learning (MCML), Munich, Germany(慕尼黑机器学习中心(MCML)) Department of Internal Medicine, TUM University Hospital(慕尼黑大学医院内科学系)

AI总结 提出StenCE预训练框架,通过跨模态对比学习从心电图特征中实现冠状动脉狭窄风险分层,在严重狭窄分类中首次达到高性能。

详情
AI中文摘要

冠状动脉狭窄是一种常见的心血管疾病,未经治疗的严重病例具有显著的心肌梗死风险。尽管冠状动脉(X射线)血管造影仍是狭窄诊断的金标准,但其具有侵入性、耗时且资源密集,因此仅对基于症状和既往临床测试具有高疾病概率的患者进行。然而,一部分患者,尤其是无症状患者,可能仍未被诊断。从心电图(ECG)中检测狭窄的迹象,由于心电图快速、廉价、无创,因此即使在无症状患者中也常规采集,将支持早期诊断。然而,由于在心电图中尚未识别出可靠的狭窄特异性信号,目前无法用于狭窄风险分层。为解决这一问题,我们引入了StenCE,一个预训练框架,允许基于直接从心电图导出的特征对患者进行分层。在不同狭窄严重程度阈值和额外心电图疾病分类任务上的评估表明,不同心电图编码器均取得了一致的性能提升,优于先前的工作。所获得的模型成功检测到心电图中用于狭窄诊断的信号,并且是首个在严重狭窄分类中实现高性能的模型。源代码可在以下网址获取:此 https URL。

英文摘要

Coronary artery stenosis is a common cardiovascular disease, with severe, untreated cases posing significant risks of heart attack. Although coronary (X-ray) angiograms remain the standard for stenosis diagnosis, they are invasive, time- and resource-intensive, and therefore only performed on patients with a high probability of disease based on symptoms and prior clinical tests. However, a subset of patients, especially those without symptoms, may remain undiagnosed. Detecting indications of stenosis from ECGs, which are fast, cheap, non-invasive, and thus routinely acquired even in asymptomatic patients, would support early diagnosis. However, as no reliable stenosis-specific signal has been identified in ECGs, they can not currently be used for stenosis risk stratification. To address this, we introduce StenCE, a pretraining framework, allowing stratification of patients based on features derived directly from ECGs. Evaluations across varying stenosis severity thresholds and additional ECG disease classification tasks demonstrate consistent performance improvements across different ECG encoders, outperforming previous work. The obtained models successfully detect signals for stenosis diagnosis in ECGs and are the first to achieve high performance in severe stenosis classification. The source code is available at https://github.com/NikolaCenic/ecg-stenosis-cls.

2606.02604 2026-06-03 cs.LG cs.AI 版本更新

Auditable Climate Risk Intelligence from Fragmented ESG Data: Deterministic Orchestration and Imbalance-Aware Learning for Scope 1-3 Validation

来自碎片化ESG数据的可审计气候风险智能:面向范围1-3验证的确定性编排与不平衡感知学习

Karan Sehgal, Khawar Naveed Bhatti

发表机构 * Kent Business School, University of Kent(肯特大学 Kent 商学院)

AI总结 针对ESG数据碎片化及传统验证缺乏可审计性的问题,提出一种融合确定性编排、时序异常检测、不平衡感知集成学习和可解释治理的框架,并构建合成基准实现可复现验证。

Comments 22 pages, 7 figures. Preprint

详情
AI中文摘要

ESG和气候风险数据在异构的范围1、范围2和范围3报告环境中仍然碎片化,而传统的验证流程缺乏来源感知的可审计性、隐藏漂移检测和面向可复现性的治理。本文提出一个确定性气候风险智能框架,整合单一真相来源编排、时序异常检测、不平衡感知集成学习和面向可解释性的治理,用于可审计的ESG验证。为支持开放复现,我们构建并发布了一个合成ESG验证基准,该基准根据GHG协议、PCAF和ISSB标准的公开报告特征进行校准。该方法包括时序漂移分析、基于SMOTE的罕见事件优化、集成学习、来源感知编排以及基于TreeSHAP的可解释性,用于治理检查和审计重建。我们使用分类指标(召回率、F1、ROC AUC)、校准指标(ECE、Brier分数)以及面向治理的审计轨迹完整性度量(衡量可重建确定性来源到升级来源链的标记异常比例)将框架与统计分类器、异常检测方法、时序预测基线和基于阈值的系统进行评估。结果以分层五折交叉验证的均值和标准差报告,并进行配对显著性检验。该框架将ESG报告重新定义为确定性气候风险治理基础设施,支持可复现性、可解释性和运营可审计性。

英文摘要

ESG and climate risk data remain fragmented across heterogeneous Scope 1, Scope 2, and Scope 3 reporting environments, while conventional validation pipelines lack provenance aware auditability, hidden drift detection, and reproducibility oriented governance. This paper proposes a deterministic climate risk intelligence framework integrating single source of truth orchestration, temporal anomaly detection, imbalance aware ensemble learning, and explainability oriented governance for auditable ESG validation. To support open reproducibility, we construct and release a synthetic ESG validation benchmark calibrated against publicly reported characteristics of the GHG Protocol, PCAF, and ISSB standards. The methodology incorporates temporal drift analysis, SMOTE based rare event optimization, ensemble learning, provenance aware orchestration, and TreeSHAP based interpretability for governance inspection and audit reconstruction. We evaluate the framework against statistical classifiers, anomaly detection methods, temporal forecasting baselines, and a threshold based system using classification metrics (recall, F1, ROC AUC), calibration metrics (ECE, Brier score), and a governance oriented audit trace completeness metric measuring the fraction of flagged anomalies for which a deterministic source to escalation provenance chain can be reconstructed. Results are reported as mean and standard deviation across stratified five fold cross validation with paired significance testing. The framework reframes ESG reporting toward deterministic climate risk governance infrastructure supporting reproducibility, explainability, and operational auditability.

2606.02588 2026-06-03 cs.LO cs.AI cs.PL 版本更新

Lean-GAP: A Dataset of Formalized Graduate Algebra Problems

Lean-GAP:形式化研究生代数问题数据集

Seewoo Lee, Byung-Hak Hwang, Hyojae Lim, Jihoon Hyun, Ilkyoo Choi, Yeachan Park, Jineon Baek, Hyukpyo Hong, Keewoo Lee, Jaeseong Heo, Hyungryul Baik, Chul-hee Lee, Kyu-Hwan Lee

发表机构 * University of California, Berkeley(加州大学伯克利分校) Korea Advanced Institute of Science and Technology(韩国科学技术院) Hanyang University(翰阳大学) Hufs University(Hufs大学) Sungkyunkwan University(成均馆大学) University of Wisconsin - Madison(威斯康星大学麦迪逊分校) Sejong University(世宗大学) University of Connecticut(康涅狄格大学)

AI总结 本文提出Lean-GAP数据集,包含430个来自Dummit和Foote《抽象代数》的形式化研究生代数问题,并开发了从PDF预处理到自动形式化再到验证的可扩展流水线。

详情
AI中文摘要

我们提出了Lean-GAP(Lean-研究生代数问题),包含来自Dummit和Foote的教科书《抽象代数》中的430个形式化研究生代数问题。我们开发了一个可扩展的流水线,包括PDF到LaTeX的预处理、自动形式化为Lean 4以及非正式-正式对应关系的验证。虽然预处理和自动形式化阶段可以很大程度上自动化,但我们发现验证仍然是最微妙和最劳动密集的组成部分,需要仔细的人工监督。我们的贡献包括:(i) 构建了一个结构化的形式化习题数据集,(ii) 一种系统化的教科书数学形式化方法,以及(iii) 对形式化过程中反复出现的挑战的分析。我们还比较了不同自动形式化模型的性能,并强调了将非正式陈述翻译为形式语言的关键瓶颈。

英文摘要

We present Lean-GAP (Lean-Graduate Agebra Problems), 430 formalized graduate-level algebra problems from the textbook Abstract Algebra by Dummit and Foote. We develop a scalable pipeline consisting of PDF-to-LaTeX preprocessing, autoformalization into Lean 4, and verification of informal-formal correspondence. While the preprocessing and autoformalization stages can be largely automated, we find that verification remains the most subtle and labor-intensive component, requiring careful human oversight. Our contributions include (i) the construction of a structured dataset of formalized exercises, (ii) a systematic methodology for formalizing textbook mathematics, and (iii) an analysis of recurring challenges in the formalization process. We also compare the performance of different autoformalization models and highlight key bottlenecks in translating informal statements into formal language.

2606.02584 2026-06-03 cs.CL cs.AI cs.IR 版本更新

IdiomX A Multilingual Benchmark for Idiom Understanding, Retrieval, and Interpretation

IdiomX:习语理解、检索和解释的多语言基准

Ayman Ali Sharara

AI总结 提出IdiomX,一个大规模多语言习语基准,通过可复现的多阶段流水线构建,涵盖190K+上下文示例和12K+习语,定义四项任务(检测、上下文-习语检索、阿拉伯语-英语习语检索、习语解释),实验表明上下文Transformer模型提升检测,混合检索重排序增强单语和跨语言检索,习语解释可建模为语义检索任务。

Comments 12 pages, 21 figures. Includes dataset and code. Resources available on HuggingFace, Kaggle, and GitHub

详情
AI中文摘要

习语表达仍然是自然语言处理中的持续挑战,因为它们的含义通常是非组合性的、依赖于上下文的,并且难以跨语言对齐。现有的习语资源在规模、上下文多样性或多语言覆盖方面往往有限,限制了它们对现代语言模型的实用性。我们介绍了IdiomX,一个用于习语理解、检索和解释的大规模多语言基准,通过可复现的多阶段流水线构建,结合词汇资源提取、大规模归一化、受控的大语言模型丰富和结构化验证。生成的数据集包含超过190K个上下文示例,涵盖12K+习语,具有对齐的英语、阿拉伯语和法语语义表示、习语和字面用法标签以及丰富的语言元数据。基于这一资源,我们定义了一个统一的四任务基准,涵盖习语检测、上下文到习语检索、阿拉伯语到英语习语检索和习语解释,将评估从比喻识别扩展到语义基础和可解释的含义检索。实验表明,上下文Transformer模型显著提高了习语检测,而混合检索和重排序架构则显著增强了单语和跨语言习语检索。结果进一步表明,习语解释可以有效地建模为语义检索任务,将可解释性作为基准的补充维度。总体而言,IdiomX提供了一个可扩展的基准,用于研究从检测到检索和语义解释的习语语言进展,并提供了一个模块化框架,可扩展到其他语言和比喻推理任务。

英文摘要

Idiomatic expressions remain a persistent challenge for natural language processing because their meanings are often non-compositional, context-dependent, and difficult to align across languages. Existing idiom resources are often limited in scale, contextual diversity, or multilingual coverage, restricting their utility for modern language models. We introduce IdiomX, a large-scale multilingual benchmark for idiom understanding, retrieval, and interpretation, constructed through a reproducible multi-stage pipeline combining lexical resource extraction, large-scale normalization, controlled large language model enrichment, and structured validation. The resulting dataset contains over 190K contextualized examples spanning 12K+ idioms, with aligned English, Arabic, and French semantic representations, idiomatic and literal usage labels, and rich linguistic metadata. Building on this resource, we define a unified four-task benchmark covering idiom detection, context-to-idiom retrieval, Arabic-to-English idiom retrieval, and idiom interpretation, extending evaluation from figurative recognition to semantic grounding and explainable meaning retrieval. Experiments show that contextual transformer models substantially improve idiom detection, while hybrid retrieval and reranking architectures significantly strengthen both monolingual and cross-lingual idiom retrieval. Results further demonstrate that idiom interpretation can be effectively modeled as a semantic retrieval task, introducing interpretability as a complementary benchmark dimension. Overall, IdiomX provides a scalable benchmark for studying idiomatic language as a progression from detection to retrieval and semantic interpretation, and offers a modular framework extensible to additional languages and figurative reasoning tasks

2606.02581 2026-06-03 cs.IR cs.AI 版本更新

Cost-Aware Query Routing in RAG: Empirical Analysis of Retrieval Depth Tradeoffs

RAG中的成本感知查询路由:检索深度权衡的实证分析

Sanjay Mishra

AI总结 提出CA-RAG框架,通过为每个查询选择最优的检索深度和生成配置组合,在保证答案质量的同时减少令牌成本和延迟。

Comments 13 pages , 18 figures , 8 tables

详情
AI中文摘要

检索增强生成(RAG)面临一个基本的三方权衡:更深的检索改善了事实基础,但增加了令牌成本和端到端延迟。静态检索配置无法解决异构查询工作负载下的这一矛盾——简单的定义性查询在不必要的上下文上浪费预算,而复杂的分析性提示则因浅层检索而得不到充分服务。本文介绍了\emph{成本感知RAG}(CA-RAG),这是一个逐查询路由框架,通过最大化一个标量效用(该效用线性结合了估计的质量先验与预测延迟和总计费令牌的归一化惩罚),从离散的\emph{策略包}目录中选择——每个包将检索深度(从无检索直接推理到top-$k{=}10$密集检索)与固定的生成配置配对。CA-RAG使用基于FAISS的密集检索和OpenAI聊天/嵌入API实现,并在涵盖四个策略包的28个查询基准上进行评估。路由器动态地使用所有策略包,与始终重度检索相比,实现了 extbf{26\%更少的计费令牌},与始终直接推理相比,实现了 extbf{34\%更低的平均延迟},同时保持等效的答案质量。逐查询增量分析表明,节省是非均匀的,集中在较简单的查询上,这激发了复杂度感知的护栏。敏感性分析证实,仅通过权重调整,相同的策略包目录即可支持多个成本-延迟-质量操作点。所有结果直接从记录的CSV工件生成,以实现完全可重复性。CA-RAG为成本意识型LLM部署提供了透明、可审计的基础。

英文摘要

Retrieval-augmented generation (RAG) faces a fundamental three-way tension: deeper retrieval improves factual grounding but inflates token costs and end-to-end latency. Static retrieval configurations cannot resolve this tension across heterogeneous query workloads -- simple definitional queries waste budget on unnecessary context, while complex analytical prompts are underserved by shallow retrieval. This paper introduces \emph{Cost-Aware RAG} (CA-RAG), a per-query routing framework that selects from a discrete catalog of \emph{strategy bundles} -- each coupling a retrieval depth (from retrieval-free direct inference to top-$k{=}10$ dense retrieval) with a fixed generation profile -- by maximizing a scalar utility that linearly combines an estimated quality prior with normalized penalties for predicted latency and total billed tokens. CA-RAG is implemented with FAISS-backed dense retrieval and OpenAI chat/embedding APIs, and evaluated on a 28-query benchmark spanning four bundles. The router dynamically exercises all bundles, achieving \textbf{26\% fewer billed tokens} than always-heavy retrieval and \textbf{34\% lower mean latency} than always-direct inference while maintaining equivalent answer quality. Per-query delta analysis reveals that savings are non-uniform and concentrated in simpler queries, motivating complexity-aware guardrails. Sensitivity analysis confirms that the same bundle catalog supports multiple cost-latency-quality operating points through weight adjustment alone. All results are generated directly from logged CSV artifacts for full reproducibility. CA-RAG provides a transparent, auditable foundation for cost-conscious LLM deployments.

2606.03517 2026-06-03 quant-ph cs.AI cs.LG 版本更新

Scalable On-Hardware Training of Quantum Neural Networks and Application to Clinical Data Imputation

可扩展的量子神经网络片上训练及其在临床数据填补中的应用

Natansh Mathur, Panagiotis Kl. Barkoutsos, Masako Yamada, Martin Roetteler, Iordanis Kerenidis

发表机构 * IRIF, CNRS and Université Paris Cité(巴黎-萨克雷大学 IRIF 实验室、法国国家科学研究中心和巴黎-萨克雷大学) QC Ware, France(法国 QC Ware 公司) IonQ(IonQ 公司) Quantum Signals(量子信号)

AI总结 提出一种结合蝴蝶电路架构、逐层训练策略和并行化参数位移规则的训练框架,将梯度估计成本从O(n^2)降至O(log n),并在MIMIC-III数据集上验证了其可扩展性和性能。

Comments 13 pages, 9 figures

详情
AI中文摘要

在量子硬件上训练量子神经网络(QNN)目前受限于梯度估计的成本:标准参数位移方法所需的电路评估次数随可训练参数数量二次增长,使得在小型系统之外难以进行基于硬件的优化。在这项工作中,我们引入了一个训练框架,将该成本降低到量子比特数的对数级别,使得在近期硬件上以更大规模进行基于梯度的QNN优化成为可能。我们的框架结合了三个协同设计的要素:(i)一种结构化的、保持子空间的蝴蝶电路架构,具有$O(n \log n)$个参数和对数深度;(ii)一种逐层训练策略,将片上优化限制在每次一个小型、结构良好的层上;(iii)一种并行化的参数位移规则,利用每个蝴蝶层内的交换结构,在恒定数量的电路执行中提取所有梯度。这些共同将每个优化步骤所需的独立电路评估次数从$O(n^2)$减少到$O(\log n)$。我们使用MIMIC-III电子健康记录数据集在临床数据填补上验证了该框架,这是一个对优化不稳定性和模型方差敏感的高要求基准。混合经典-量子模型直接在IonQ Forte Enterprise离子阱硬件上以16量子比特进行训练,性能相对于理想或噪声模拟没有下降,并通过张量网络模拟以32量子比特进行训练,32量子比特推理在硬件上执行。得到的模型在下游患者生存预测中匹配或超过强经典神经网络基线,同时表现出跨运行的低方差,证明了所提出的框架在现实硬件约束下实现了实用、可扩展的QNN训练。

英文摘要

Training quantum neural networks (QNNs) on quantum hardware is currently bottlenecked by the cost of gradient estimation: standard parameter-shift methods require a number of circuit evaluations that grows quadratically with the number of trainable parameters, making hardware-based optimisation impractical beyond small system sizes. In this work, we introduce a training framework that reduces this cost to logarithmic in the number of qubits, making gradient-based QNN optimisation feasible on near-term hardware at increasing scales. Our framework combines three co-designed ingredients: (i) a structured, subspace-preserving Butterfly circuit architecture with $O(n \log n)$ parameters and logarithmic depth; (ii) a layer-wise training strategy that confines on-hardware optimisation to one small, well-structured layer at a time; and (iii) a parallelised parameter-shift rule that exploits the commuting structure within each Butterfly layer to extract all gradients in a constant number of circuit executions. Together these reduce the number of distinct circuit evaluations per optimisation step from $O(n^2)$ to $O(\log n)$. We validate the framework on clinical data imputation using the MIMIC-III electronic health record dataset, a demanding benchmark sensitive to optimisation instability and model variance. Hybrid classical-quantum models are trained directly on IonQ Forte Enterprise trapped-ion hardware at 16 qubits without performance degradation relative to ideal or noisy simulation and via tensor-network simulation at 32 qubits, with 32-qubit inference executed on hardware. The resulting models match or exceed strong classical neural baselines in downstream patient survival prediction while exhibiting reduced variance across runs, demonstrating that the proposed framework enables practical, scalable QNN training under realistic hardware constraints.

2606.02646 2026-06-03 physics.soc-ph cs.AI cs.MA 版本更新

The Ringelmann Effect in Multi-Agent LLM Systems: A Scaling Law for Effective Team Size

多智能体大语言模型系统中的林格曼效应:有效团队规模的缩放定律

Blaž Bertalanič, Carolina Fortuna

发表机构 * Jozef Stefan Institute(乔泽夫·斯蒂芬研究所)

AI总结 本文推导出两参数缩放定律 $R(N) = N_\text{eff}/N = 1/(1+c(N-1)N^{-\beta})$,将多智能体LLM系统分为三种渐近状态,并通过44个实验单元验证了该定律,发现密集辩论无法增加答案多样性,噪声安慰剂可模拟自我修正效果,且仅异构团队能突破硬上限。

Comments 41 pages, 9 figures, 20 tables

详情
AI中文摘要

推理时多智能体大语言模型缩放缺乏共享单位:计数名义智能体混淆了成本与独立证据。我们推导出一个两参数缩放定律 $R(N) = N_\text{eff}/N = 1/(1+c(N-1)N^{-\beta})$,其中状态指数 $\beta$ 将任何配置分类为三种渐近状态之一——硬上限为 $1/c$($\beta = 0$)、亚线性为 $N^\beta/c$($0 < \beta < 1$)或线性($\beta \ge 1$),并且平均场定理预测智能体辩论中的同伴数量 $k$ 和轮次 $\tau$ 仅通过其乘积 $k\tau$ 进入动力学。该定律适用于两个层面:答案多样性和正确性冗余。在44个(模型 $\times$ 任务 $\times$ 条件)单元中,涵盖同伴辩论、自我修正、随机噪声安慰剂、自一致性、三个开放权重系列(Qwen、Llama、Ministral)从7B到32B规模,并辅以前沿API检查(Gemini)、思维模型、异构团队和稀疏通信,函数形式在每个条件下拟合 $R^2 > 0.99$;仅 $(c, \beta)$ 发生偏移。在自由形式数学问题上,密集同伴影响将答案层面状态从亚线性坍缩为硬上限;正确性层面拟合始终保持硬上限。三个发现具有实际意义。 (i) 三十个密集辩论智能体在MMLU-Hard上产生的答案多样性不超过一个智能体。 (ii) 噪声安慰剂在自由形式数学问题及4倍规模下追踪自我修正,因此在同质团队中,通常归因于“辩论”的收益来自重新评估,而非同伴内容。 (iii) 单个 $N \le 5$ 的试点预测了 $N=30$ 的结构上限,并且在测试的配置中,只有架构多样性(异构团队)降低了 $c$ 并逃离了硬上限状态,通信模式干预则不能。

英文摘要

Inference-time multi-agent LLM scaling lacks a shared unit: counting nominal agents conflates cost with independent evidence. We derive a two-parameter scaling law $R(N) = N_\text{eff}/N = 1/(1+c(N-1)N^{-β})$ where the regime exponent $β$ classifies any configuration into one of three asymptotic regimes -- hard-ceiling at $1/c$ ($β= 0$), sublinear at $N^β/c$ ($0 < β< 1$), or linear ($β\ge 1$), and a mean-field theorem predicts that peer count $k$ and rounds $τ$ during agent debate enter the dynamics only through their product $kτ$. The law applies at two levels: answer diversity and correctness redundancy. Across 44 (model $\times$ task $\times$ condition) cells spanning peer debate, self-correction, random-noise placebo, self-consistency, three open-weight families (Qwen, Llama, Ministral) at scales from 7B to 32B with a frontier API check (Gemini), thinking models, heterogeneous teams, and sparse communication, the functional form fits every condition at $R^2 > 0.99$; only $(c, β)$ shifts. On free-form math, dense peer influence collapses the answer-level regime from sublinear into hard-ceiling; correctness-level fits remain hard-ceiling throughout. Three findings have practical implications. \emph{(i)}~Thirty dense debating agents produce no more answer diversity than one on MMLU-Hard. \emph{(ii)}~A noise placebo tracks self-correction on free-form math and at $4\times$ scale, so within homogeneous teams the gain commonly attributed to ``debate'' comes from re-evaluation, not peer content. \emph{(iii)}~A single $N \le 5$ pilot predicts the $N=30$ structural ceiling, and within the configurations tested only architectural diversity (heterogeneous teams) lowers $c$ and escapes the hard-ceiling regime, communication-mode interventions do not.

2606.02461 2026-06-03 cs.AI cs.CL 版本更新

AgentCL: Toward Rigorous Evaluation of Continual Learning in Language Agents

AGENTCL:面向语言代理持续学习的严格评估

Yiheng Shu, Bernal Jiménez Gutiérrez, Saisri Padmaja Jonnalagedda, Yuguang Yao, Huan Sun, Yu Su

发表机构 * The Ohio State University(俄亥俄州立大学) Johns Hopkins University(约翰霍普金斯大学) Intuit AI Research(Intuit AI研究)

AI总结 提出AGENTCL评估框架,通过可控任务流和迁移增益指标,严格评估语言代理的持续学习能力,并开发MemProbe探针方法诊断记忆设计的影响。

Comments 10 pages in the main text, 26 pages in total

详情
AI中文摘要

语言代理在解决单个任务上花费大量推理时间,但一个回合中获得的经验在后续回合中往往未被充分利用。持续学习期望代理在任务流中积累可重用经验,随时间改进,并避免无关经验的干扰。不幸的是,现有基准难以严格评估语言代理中的持续学习。大多数工作侧重于长上下文对话或文档的检索和推理,而最近的生命周期适应基准通常依赖于简单的任务流,对跨任务关系的分析有限,使得难以理解代理随时间学习和重用的内容。本文提出了一个用于代理持续学习的评估框架AGENTCL,其核心是受控任务流和迁移增益指标。AGENTCL构建了组合流,其中早期的子解决方案、证据或工作流有意在后续任务中可重用,并与不保证这种可重用性的简单流形成对比。我们使用该基准评估用于持续学习的非参数记忆设计。为了诊断记忆设计选择如何影响持续学习,我们开发了MemProbe,一种探针方法,存储交互、洞察和技能,同时在整合过程中过滤不可靠的经验。跨编码、深度研究和语言理解/推理任务的实证分析表明,简单流区分记忆设计的能力有限,而受控流更清晰地区分其可塑性。同时,简单和保留设置通常产生有限的增益,并可能暴露记忆引起的退化。这些结果突显了需要更强的记忆设计,以平衡可塑性和稳定重用。

英文摘要

Language agents spend substantial inference time solving individual tasks, yet the experience acquired in one episode is often underutilized in future episodes. Continual learning expects an agent to accumulate reusable experience across a stream of tasks, improve over time, and avoid interference from irrelevant experiences. Unfortunately, existing benchmarks struggle to evaluate continual learning in language agents rigorously. Most efforts focus on retrieval and reasoning over long-context conversations or documents, while recent lifelong-adaptation benchmarks often rely on naive task streams with limited analysis of cross-task relationships, making it difficult to understand what an agent learns and reuses over time. This paper presents an evaluation framework AgentCL for continual learning in agents, centered on controlled task streams and metrics for transfer gains. AgentCL constructs compositional streams where earlier sub-solutions, evidence, or workflows are intentionally reusable in later tasks, and contrasts them with naive streams where such reusability is not guaranteed. We use the benchmark to evaluate non-parametric memory designs for continual learning. To diagnose how memory design choices affect continual learning, we develop MemProbe, a probing method that stores interactions, insights, and skills, while filtering unreliable experiences during consolidation. Empirical analysis across coding, deep research, and language understanding/reasoning tasks shows that naive streams offer limited ability to distinguish memory designs, whereas controlled streams more clearly distinguish their plasticity. Meanwhile, naive and held-out settings often yield limited gains and can expose memory-induced degradation. These results highlight the need for stronger memory designs that balance plasticity and stable reuse.

2606.02332 2026-06-03 cs.AI cs.CL cs.LG 版本更新

Forget Attention: Importance-Aware Attention Is All You Need

忘记注意力:重要性感知注意力即你所需

Suhyeong Shin, Yeongwook Yang

发表机构 * Department of Computer Engineering(计算机工程系)

AI总结 提出SISA方法,通过将状态空间模型的重要性信号直接融入注意力分数计算,实现分数级融合,在语言建模中兼顾全局检索与重要性排序。

Comments 20 pages, 6 figures, 25 tables

详情
AI中文摘要

将注意力的全局检索与状态空间模型(SSM)的顺序重要性信号相结合是混合语言建模的开放挑战。Transformer能看见所有位置但无法区分优先级;SSM知道什么重要但无法重新访问。现有混合模型——Jamba(块级)和Hymba(头级)——将两者置于独立模块,因此在注意力计算过程中彼此无法相互影响。我们提出SISA(SSM引导的Softmax注意力),该方法在注意力分数内部直接添加SSM导出的重要性项,并通过在增强的查询/键向量上执行单个SDPA调用来实现完整操作——无需循环状态,无需自定义内核。在152M/5B token上,SISA在LAMBADA-greedy上达到17.3%(对比Transformer的13.9和Mamba-3的15.5),并从第1K步起实现NIAH 100%,比Transformer的检索收敛速度快7倍;在369M规模下,Mamba-3在LAMBADA上领先,而SISA保持完美的NIAH和标准SDPA执行。因此,SISA为SSM-注意力混合模型定义了第三个设计轴——分数级融合——超越了此前主导该领域的块级和头级范式。

英文摘要

Combining attention's global retrieval with the sequential importance signal of state space models (SSMs) is the open challenge of hybrid language modeling. Transformers see everywhere but cannot prioritize; SSMs know what matters but cannot revisit. Existing hybrids -- Jamba (block level) and Hymba (head level) -- place the two in separate compartments, so neither informs the other during the attention computation itself. We propose SISA (SSM-Informed Softmax Attention), which adds an SSM-derived importance term directly inside the attention score and realizes the full operation as a single SDPA call on augmented query/key vectors -- no recurrent state, no custom kernel. At 152M / 5B tokens, SISA reaches LAMBADA-greedy 17.3% (vs. Transformer 13.9 and Mamba-3 15.5) and attains NIAH 100% from step 1K, 7x faster than Transformer's retrieval convergence; at 369M, Mamba-3 leads LAMBADA while SISA preserves perfect NIAH and stock-SDPA execution. SISA thus defines a third design axis for SSM-attention hybrids -- score-level fusion -- beyond the block-level and head-level paradigms that have dominated the field.

2606.02240 2026-06-03 cs.CR cs.AI cs.CL cs.ET 版本更新

AgentRedBench: Dynamic Redteaming and Integration-Aware Defense for LLM Agents over SaaS Integrations

AgentRedBench: 针对SaaS集成的LLM代理的动态红队测试与集成感知防御

Hiskias Dingeto, William Leeney

发表机构 * StackOne Technologies(StackOne技术公司)

AI总结 针对LLM代理在工具使用中面临的间接提示注入威胁,提出动态红队基准AGENTREDBENCH(覆盖24个企业集成、5种攻击类型)和基于集成多样语料训练的防御模型AGENTREDGUARD,将攻击成功率从69.9%降至2.4%,误报率仅0.37%。

详情
AI中文摘要

工具使用代理中的间接提示注入是一个具体的生产威胁:LLM代理读取来自集成(通过工具调用访问的第三方服务,如Gmail、Salesforce或Jira)的响应内容,用户既未编写也无法控制这些内容。现有基准低估了该威胁:大多数仅覆盖少量集成,且每次运行重复相同的攻击载荷,而开源防护模型是在聊天风格数据而非工具响应内容上训练的。我们引入了AGENTREDBENCH,这是一个动态的LLM驱动的红队测试基准,包含215个微妙的未明确授权场景(在用户请求授权边界上的攻击),涵盖9个功能家族、24个企业集成和5种攻击类型。在八模型面板(Anthropic、OpenAI、Google)上,无防护的攻击成功率(ASR)范围从32%(Claude Sonnet 4.6)到81%(Gemini 3 Flash)。为了保持场景集不在训练语料中,并随时间保持标题ASR的意义,我们开源了代码库、集成模式和AGENTREDGUARD模型;规范场景通过维护者中介渠道进行评估,具有不可变版本控制。我们随基准发布了AGENTREDGUARD:一个在集成多样化的对抗性工具响应内容语料上训练的防护模型。AGENTREDGUARD将面板ASR从69.9%降至2.4%,误报率为0.37%,在两个指标上均优于所有具有非平凡检测能力的开源基线(Llama Guard、PromptGuard 2、ProtectAI)。跨集成和跨攻击类型的保留测试均证实了增益在训练子集之外具有迁移性。

英文摘要

Indirect prompt injection in tool-use agents is a concrete production threat: LLM agents read from integrations (third-party services such as Gmail, Salesforce, or Jira accessed through tool calls) whose response content the user neither writes nor controls. Existing benchmarks under-measure the threat: most cover only a handful of integrations with the same attack payload replayed across runs, and open-source guards are trained on chat-style data rather than tool-response content. We introduce AGENTREDBENCH, a dynamic LLM-driven redteaming benchmark of 215 subtle underspecified authorization (attacks at the boundary of what the user's request authorises) scenarios across 24 enterprise integrations in nine functional families and five attack types. Across an eight-model panel (Anthropic, OpenAI, Google), no-guard ASR (attack success rate) ranges from 32% (Claude Sonnet 4.6) to 81% (Gemini 3 Flash). To keep the scenario set out of training corpora and preserve headline ASR meaning over time, we release the codebase, integration schemas, and AGENTREDGUARD model openly; the canonical scenarios are evaluated through a maintainer-mediated channel with immutable versioning. We release AGENTREDGUARD alongside the benchmark: a guard trained on an integration-diverse corpus of adversarial tool-response content. AGENTREDGUARD cuts panel ASR from 69.9% to 2.4% at 0.37% false-positive rate, outperforming every open-source baseline with non-trivial detection (Llama Guard, PromptGuard 2, ProtectAI) on both axes. Cross-integration and cross-attack type holdouts both confirm the gain transfers beyond the training subset.

2606.02132 2026-06-03 cs.AI 版本更新

Learning When Not to Act: Mitigating Tool Abuse in Agentic Reinforcement Learning

学习何时不行动:缓解智能体强化学习中的工具滥用

Liuji Chen, Dianxing Tang, Xing Shi, Dingshuo Chen, Qiang Liu, Shu Wu, Liang Wang

发表机构 * NLPR, Institute of Automation, Chinese Academy of Sciences(NLPR,自动化研究所,中国科学院) ByteDance(字节跳动) Zhejiang University(浙江大学)

AI总结 提出EAPO框架,通过引入无工具轨迹、难度感知奖励塑造和置信度感知令牌重加权,在数学和知识密集型推理任务中减少工具滥用,同时提升准确率-效率权衡。

Comments Under review

详情
AI中文摘要

智能体强化学习可能引发工具滥用,即模型过度使用外部工具,即使对于内部推理可解的查询也是如此。现有方法通过统一的工具使用惩罚或硬限制来缓解此问题,这降低了工具使用频率,但可能抑制有用的工具辅助探索。我们提出EAPO,一种高效的智能体策略优化框架,学习选择性工具使用。EAPO在每个rollout组中引入无工具轨迹,应用难度感知奖励塑造以主要对较简单查询上的冗余工具调用进行惩罚,并使用置信度感知令牌重加权来改进策略学习。在九个数学和知识密集型推理基准上,EAPO在Qwen2.5-3B、Qwen2.5-7B和Llama3.1-8B上持续改善了准确率-效率权衡。与GRPO相比,EAPO的平均性能分别提高了10.45%、7.27%和9.69%,同时平均工具调用次数分别减少了18.33%、18.33%和24.59%。这些结果表明,智能体可以在不损害工具集成推理的情况下学习何时不使用工具。

英文摘要

Agentic reinforcement learning can induce tool abuse, where models overuse external tools even for queries solvable by internal reasoning. Existing approaches mitigate this issue with uniform tool-use penalties or hard limits, which reduce tool frequency but may also suppress useful tool-assisted exploration. We propose EAPO, an Efficient Agentic Policy Optimization framework that learns selective tool use. EAPO introduces tool-free trajectories into each rollout group, applies difficulty-aware reward shaping to penalize redundant tool calls mainly on easier queries, and uses confidence-aware token reweighting to improve policy learning. Across nine mathematical and knowledge-intensive reasoning benchmarks, EAPO consistently improves the accuracy efficiency trade-off on Qwen2.5-3B, Qwen2.5-7B, and Llama3.1-8B. Compared with GRPO, EAPO improves average performance by 10.45%, 7.27%, and 9.69%, while reducing average tool calls by 18.33%, 18.33%, and 24.59%, respectively. These results show that agents can learn when not to use tools without compromising tool-integrated reasoning.

2606.02060 2026-06-03 cs.AI 版本更新

Where Do Deep-Research Agents Go Wrong? Span-Level Error Localization in Agent Trajectories

深度研究代理在何处出错?代理轨迹中的跨度级错误定位

Jiaming Wang, Ziteng Feng, Jiangtao Wu, Ruihao Li, Qianqian Xie, Yuxiang Ren, He Zhu, Xueming Han, Fanyu Meng, Junlan Feng, Jiaheng Liu

发表机构 * NJU-LINK Team, Nanjing University(南京大学NJU-LINK团队) JIUTIAN Research(JIUTIAN研究院)

AI总结 针对深度研究代理在长轨迹中难以定位错误的问题,本文通过构建TELBench基准和提出DRIFT审计框架,实现了跨度级错误定位,将首次错误定位准确率提升高达30个百分点。

Comments 28 pages, 11 figures, 4 tables

详情
AI中文摘要

深度研究代理通过搜索、工具使用、证据检查和答案合成的长轨迹来完成任务。基于最终答案的评估可以显示代理是否成功,但无法显示轨迹的哪些部分导致答案不可靠。我们研究了深度研究代理的跨度级错误定位。我们从两个代理框架、三个骨干模型和三个基准中收集了2,790条真实轨迹,将原始日志转换为语义跨度,并通过LLM辅助的专家评审标注了有害错误跨度。基于这些标注,我们构建了TELBench,一个包含1,000个实例的基准,用于在正常探索、失败搜索、暂定假设和无害噪声中识别错误跨度。我们进一步提出了DRIFT,一个以声明为中心的审计框架,该框架跟踪代理声明,检查其在轨迹证据中的支持,并标记那些无支持或冲突声明影响答案路径的跨度。跨模型系列和审计框架的实验表明,DRIFT将跨度级错误定位和首次错误准确率提高了高达30个百分点。我们的工作提供了深度研究代理可靠性的过程级视角。

英文摘要

Deep-research agents solve tasks through long trajectories of search, tool use, evidence inspection, and answer synthesis. Evaluation based on final answers shows whether an agent succeeds, but not which parts of the trajectory make the answer unreliable. We study span-level error localization for deep-research agents. We collect 2,790 real trajectories from two agent frameworks, three backbone models, and three benchmarks, convert raw logs into semantic spans, and annotate harmful error spans through LLM-assisted expert review. From these annotations, we build TELBench, a 1,000-instance benchmark for identifying error spans among normal exploration, failed searches, tentative hypotheses, and harmless noise. We further propose DRIFT, a claim-centric auditing framework that tracks agent claims, checks their support in trajectory evidence, and marks spans where unsupported or conflicting claims affect the answer path. Experiments across model families and auditing frameworks show that DRIFT improves span-level error localization and first-error accuracy by up to 30 percentage points. Our work provides a process-level view of reliability in deep-research agents.

2606.01904 2026-06-03 cs.CL cs.AI 版本更新

KliniskVestBERT: BERT Model Specialised to Norwegian Clinical Texts

KliniskVestBERT: 针对挪威临床文本特化的BERT模型

Christian Autenried, Cosimo Persia

发表机构 * helse-vest-ikt(赫尔塞-维斯特信息科技)

AI总结 通过继续预训练现有BERT模型,在挪威临床文本上构建KliniskVestBERT模型,在临床NLP任务中显著优于基线模型。

详情
AI中文摘要

自然语言处理(NLP)在医疗保健中的应用日益增长,这要求语言模型特别适应临床语言的复杂性。本文介绍了KliniskVestBERT,这是一套基于BERT的编码器模型套件,在来自Helse Vest的大量真实、去标识化的挪威临床文本语料库上进行预训练。我们在专门的临床数据集上继续预训练现有的语言模型Nb-BERT-large、NorBERT3-large和ModernBERT。该数据集基于Helse Vest患者的代表性人群。包含的文档类型经过精心策划,涵盖了bokmål和nynorsk的广泛临床范围,包括出院小结、手术报告、护理记录等,确保全面代表挪威医疗环境中的语言景观。在三个合成挪威临床基准数据集和两个真实世界问题上的评估表明,每个临床特化模型都持续优于其基线对应模型,突显了领域特定预训练对临床领域NLP任务的显著益处。该项目由所有Helse Vest实体(Helse Bergen、Helse Fonna、Helse Førde和Helse Stavanger)与DIPS在Helse Vest ICT的项目领导下共同完成。

英文摘要

The increasing application of Natural Language Processing (NLP) in healthcare demands language models specifically attuned to the complexities of clinical language. This work introduces KliniskVestBERT, a suite of three BERT-based encoder models pre-trained on a substantial corpus of real-world, de-identified Norwegian clinical texts from Helse Vest. We continue pretraining existing language models Nb-BERT-large, NorBERT3-large, and ModernBERT on our specialized clinical dataset. This dataset is based on a representative population of Helse Vest patients. The included document types are carefully curated to encompass a broad clinical spectrum in bokmål and nynorsk including discharge summaries, surgical reports, nursing notes etc. ensuring comprehensive representation of the linguistic landscape within Norwegian healthcare settings. Evaluation on three synthtetic Norwegian clinical benchmark datasets and two real-world problems demonstrates that each of our clinically specialized models consistently outperforms their baseline counterparts, highlighting the significant benefit of domain-specific pre-training for NLP tasks within the clinical domain. The project was a joint effort by all Helse Vest entities (Helse Bergen, Helse Fonna, Helse Førde and Helse Stavanger) with DIPS under the project lead of Helse Vest ICT.

2606.01767 2026-06-03 cs.AI 版本更新

EvoBrain: Continual Learning of EEG Foundation Models Across Heterogeneous BCI Tasks

EvoBrain: 面向异构BCI任务的EEG基础模型的持续学习

Yangxuan Zhou, Sha Zhao, Jiquan Wang, Shijian Li, Gang Pan

发表机构 * State Key Laboratory of Brain-machine Intelligence and College of Computer Science and Technology, Zhejiang University(脑机智能国家重点实验室和浙江大学计算机科学与技术学院) MOE Frontier Science Center for Brain Science and Brain-Machine Integration, Zhejiang University(教育部脑科学与脑机集成前沿科学中心,浙江大学)

AI总结 提出EvoBrain框架,通过神经频谱任务归一化和响应亲和蒸馏,解决EEG基础模型在异构BCI任务中的持续学习问题,实现跨任务知识迁移和遗忘缓解。

Comments 18 pages,12 figures

详情
AI中文摘要

脑电图(EEG)是非侵入式脑机接口(BCI)的基石,然而传统的解码依赖于碎片化的、特定任务的架构,严重限制了跨任务的可扩展性。虽然在大规模语料库上预训练的EEG基础模型有望实现通用脑解码,但当前的后训练依赖于任务隔离的微调。这种静态范式限制了跨异构任务的知识迁移,阻碍了模型的可扩展性,并导致计算和存储开销随任务数量线性增长。为了克服这些瓶颈,我们将下游适应形式化为跨任务的持续学习问题,并提出了EvoBrain,一个动态的、任务感知的持续学习框架,用于统一的EEG解码。EvoBrain通过两个互补组件解决可塑性-稳定性权衡:(1)神经频谱任务归一化(NSN)将传入任务与历史统计对齐,同时重新校准频谱响应以处理分布和神经频谱偏移;(2)响应亲和蒸馏(RAD)结合时间依赖的重放,保留旧任务的响应几何结构,并促进频谱兼容任务之间的选择性知识迁移,有效缓解遗忘。在六个不同BCI任务上的广泛评估表明,EvoBrain在各种基础骨干网络上始终优于最先进的方法,最佳地平衡了可塑性和稳定性。据我们所知,这项工作开创了EEG领域的跨任务持续学习,推进了统一的、一劳永逸的脑解码系统的实现。

英文摘要

Electroencephalography (EEG) is the cornerstone of non-invasive brain-computer interfaces (BCIs), yet conventional decoding relies on fragmented, task-specific architectures that severely limit cross-task scalability. While EEG foundation models pre-trained on massive corpora promise universal brain decoding, current post-training depends on task-isolated fine-tuning. This static paradigm restricts knowledge transfer across heterogeneous tasks, hinders model scalability, and incurs computational and storage overheads that scale linearly with task count. To overcome these bottlenecks, we formulate downstream adaptation as a cross-task continual learning problem and propose EvoBrain, a dynamic, task-aware continual learning framework for unified EEG decoding. EvoBrain addresses the plasticity-stability trade-off via two complementary components: (1) Neuro-Spectral Task Normalization (NSN) aligns incoming tasks with historical statistics while recalibrating spectral responses to handle distributional and neuro-spectral shifts; and (2) Response-Affinity Distillation (RAD), combined with time-dependent replay, preserves old-task response geometry and promotes selective knowledge transfer between spectrally compatible tasks, effectively mitigating forgetting. Extensive evaluations across six distinct BCI tasks demonstrate that EvoBrain consistently surpasses state-of-the-art methods across diverse foundation backbones, optimally balancing plasticity and stability. To our knowledge, this work pioneers cross-task continual learning in the EEG domain, advancing the realization of a unified, one-for-all brain decoding system.

2606.01472 2026-06-03 cs.DC cs.AI cs.LG 版本更新

Hierarchical Online Prompt Mutation with Dual-Loop Feedback for Guardrailed Evidence Document Generation: A Production-Evaluation Case Study

分层在线提示变异与双环反馈用于有护栏的证据文档生成:生产评估案例研究

Nataraj Agaram Sundar, Tejas Morabia

发表机构 * eBay Inc.(eBay公司)

AI总结 提出分层在线提示变异框架HOPM,通过双环反馈(人工审核与自动评判)优化提示策略,在真实市场纠纷证据生成中显著提升胜率和质量。

Comments 7 pages. Production-evaluation case study of guardrailed LLM evidence-document generation

详情
AI中文摘要

高风险生产文档生成系统要求语言模型具有适应性、基于证据且可审计。我们提出HOPM,一种分层在线提示变异框架,在真实市场纠纷证据工作流上评估。HOPM将提示视为在线策略:一个家族/版本路由器选择提示,确定性护栏将失败归因于可变的提示-令牌类别,来自人工审核和自动评判的双重反馈更新路由和变异优先级。主要证据是观察到的匹配生产评估消融:七个变体在相同的600个案例上评估,实现组件比较:静态提示、手动迭代、仅bandit路由、仅变异适应、仅人工反馈、仅自动评判反馈和全双环HOPM。全HOPM将计数胜率从34.7%提升至45.7%(+11.0个百分点;配对McNemar p=1.31e-11),金额加权胜率从22.3%提升至41.4%(+19.1个百分点;95%配对bootstrap CI [10.3, 28.9]个百分点)。它还将平均Likert质量从3.18提高到4.40,并将问题标记率从15.3%降低到5.2%。支持性审查工件涵盖770篇生成文本审查、318份标记审查员导出、一个10案例/61评分的校准切片和一个70案例/350评分的OCR基准;这些工件校准评分标准、护栏、标题风险和OCR风险解释,而非替代生产消融。论文包括控制设置、样本量、置信区间、配对检验、提示-令牌类别、伪代码、模式、评分标准、护栏分类法以及一个构造示例,以便在不暴露专有证据的情况下重现评估结构。

英文摘要

High-stakes production document-generation systems require language models to be adaptive, evidence-grounded, and auditable. We present HOPM, a hierarchical online prompt mutation framework evaluated on a real marketplace dispute-evidence workflow. HOPM treats prompts as online policies: a family/version router selects a prompt, deterministic guardrails attribute failures to mutable prompt-token categories, and dual feedback from human review and an automated judge updates both routing and mutation priorities. The primary evidence is an observed matched production-evaluation ablation: seven variants are evaluated on the same 600 cases each, enabling component comparisons against static prompting, manual iteration, bandit-only routing, mutation-only adaptation, human-only feedback, auto-judge-only feedback, and full dual-loop HOPM. Full HOPM improves count win rate over a static control from 34.7% to 45.7% (+11.0 pp; paired McNemar p = 1.31e-11) and amount-weighted win rate from 22.3% to 41.4% (+19.1 pp; 95% paired bootstrap CI [10.3, 28.9] pp). It also increases mean Likert quality from 3.18 to 4.40 and reduces issue-flag rate from 15.3% to 5.2%. Supporting review artifacts cover 770 generated-text reviews, 318 labeled reviewer exports, a 10-case/61-rating calibration slice, and a 70-case/350-rating OCR benchmark; these artifacts calibrate rubric, guardrail, title-risk, and OCR-risk interpretation rather than substituting for the production ablation. The paper includes control setup, sample sizes, confidence intervals, paired tests, prompt-token categories, pseudocode, schema, rubric, guardrail taxonomy, and a constructed example so the evaluation structure can be reproduced without exposing proprietary evidence.

2606.01269 2026-06-03 cs.AI 版本更新

Emergent Ordinal Geometry in Transformers Trained on Local Comparisons

局部比较训练下Transformer中涌现的序数几何

Nishit Singh

发表机构 * Birla Institute of Technology and Science, Pilani(比拉理工学院和科学学院,皮兰)

AI总结 通过仅训练相邻比较,Transformer模型在未见远距离对上展现出泛化能力,并形成一维序数几何结构,其决策置信度与排名距离单调相关,类似于符号距离效应。

Comments 11 pages, 12 figures

详情
AI中文摘要

传递性推理是指仅从已知的相邻关系(A < B, B < C)推断出A < C的挑战。人类和动物解决这一问题并非通过逻辑链,而是借助一个模拟的心理数字线,其标志是符号距离效应:远距离比较比近距离比较更容易。我们探究Transformer是否获得相同的原始能力,仅使用隐藏全序中的相邻比较训练小型模型,并评估对未见远距离对的泛化。我们发现,分布外泛化伴随着惊人的几何重组:实体嵌入坍缩到一维流形上,其主轴以近乎完美的保真度恢复隐藏的秩序,并且这种结构对优化方式敏感,产生类似grokking的瞬态动力学。关键的是,即使准确率达到上限,决策置信度和几何分离度都随秩距离单调变化,直接反映了在人类、灵长类和啮齿类动物数十年的行为实验中观察到的符号距离效应。这些结果将50年来的行为规律建立在学习表示的几何基础上,为连接认知科学和现代神经网络的传递性推理提供了机制性解释。

英文摘要

Transitive inference is the challenge of inferring that A < C from knowing only adjacent relations (A < B, B < C). It is solved by humans and animals not through logical chaining but via an analogue mental number line, whose signature is the symbolic distance effect: distant comparisons are easier than nearby ones. We ask whether Transformers acquire the same primitive, training small models exclusively on adjacent comparisons from a hidden total order and evaluating generalization to unseen distant pairs. We find that out-of-distribution generalization emerges alongside a striking geometric reorganization: entity embeddings collapse onto a one-dimensional manifold whose principal axis recovers the hidden rank order with near-perfect fidelity, and this structure is sensitive to optimization in ways that produce grokking-like transient dynamics. Critically, even when accuracy is at ceiling, decision confidence and geometric separation both scale monotonically with rank distance, directly mirroring the symbolic distance effect observed across decades of behavioural experiments on humans, primates, and rodents. We further show the same rank-aligned geometry in a pretrained large language model, where it tracks the topology of each ordinal relation: linear for sizes and digits, cyclic for months. These results ground a 50-year-old behavioural regularity in the geometry of learned representations, offering a mechanistic account of transitive inference that bridges cognitive science and modern neural networks.

2606.01184 2026-06-03 stat.ME cs.AI 版本更新

Topological Ignorability for Structural Causal Effects Beyond Means

超越均值的结构因果效应的拓扑可忽略性

Usef Faghihi

AI总结 本文提出基于拓扑几何的因果度量(如密度超水平Betti摘要、欧拉签名和持续同调摘要)来量化干预分布的结构差异,并引入拓扑可忽略性假设以在无需完整反事实分布的情况下识别结构因果效应。

Comments This is a new version of our paper titled: Beyond Means: Topological Causal Effects under Persistent-Homology Ignorability. So we will resubmit this as version 2 of arXiv:2603.14169

详情
AI中文摘要

许多干预措施改变的是结果分布的结构而非其均值:它们可以将总体分裂为不连通的区域、创建循环或空洞、生成分支,或重组结果云团而几乎不改变平均响应。在这种情况下,基于均值的因果估计量(如平均处理效应)可能遗漏重要的结构效应。 我们引入了基于干预结果定律摘要的拓扑几何因果度量,包括密度超水平Betti摘要、欧拉签名和持续同调摘要。这些度量量化了处理组和未处理组结果定律之间超出平均值的结构差异。我们还研究了因果解释所需的假设。我们引入了拓扑可忽略性,这是条件可忽略性的拓扑类比,要求所选结构特征的不变性而非整个反事实分布。当所选摘要是单射时,该条件与弱可忽略性一致;对于非单射摘要,它可以在不识别完整干预定律的情况下识别感兴趣的结构特征。 我们定义了一个协变量标准化的拓扑几何因果效应,并开发了实用的估计量。我们在两个隐藏混杂基准中验证了该框架:一个完全合成的精确基准和一个使用威斯康星乳腺癌协变量的真实协变量半合成基准。在这两个基准中,弱可忽略性失败,平衡观测协变量几乎消除了标准化均值差异,但坐标均值平均处理效应仍然有偏。相比之下,选定的有限密度超水平Betti和欧拉对比在神谕、观测和加权分析中保持稳定。

英文摘要

Many interventions alter the structure of an outcome distribution rather than its mean: they can split a population into disconnected regimes, create loops or holes, generate branches, or reorganize an outcome cloud while leaving the average response nearly unchanged. In such settings, mean-based causal estimands such as the average treatment effect may miss important structural effects. We introduce topological-geometrical causal metrics based on summaries of interventional outcome laws, including density-superlevel Betti summaries, Euler signatures, and persistent-homology summaries. These metrics quantify structural differences between treated and untreated outcome laws beyond averages. We also study the assumptions needed for causal interpretation. We introduce topological ignorability, a topological analogue of conditional ignorability that requires invariance of the chosen structural feature rather than the full counterfactual distribution. When the chosen summary is injective, this condition coincides with weak ignorability; for noninjective summaries, it can identify the structural feature of interest without identifying the full interventional law. We define a covariate-standardized topological-geometrical causal effect and develop practical estimators. We validate the framework in two hidden-confounding benchmarks: a fully synthetic exact benchmark and a real-covariate semi-synthetic benchmark using Wisconsin breast-cancer covariates. In both, weak ignorability fails and balancing observed covariates nearly eliminates standardized mean differences, yet the coordinate-mean average treatment effect remains biased. By contrast, selected finite density-superlevel Betti and Euler contrasts remain stable across oracle, observational, and weighted analyses.

2606.01162 2026-06-03 cs.AI 版本更新

Deft Scheduling of Dynamic Cloud Workflows with Varying Deadlines via Mixture-of-Experts

基于混合专家模型的动态云工作流截止时间感知调度

Ya Shen, Gang Chen, Hui Ma, Mengjie Zhang

发表机构 * School of Engineering and Computer Science, Victoria University of Wellington(维多利亚大学工程与计算机科学学院)

AI总结 提出一种基于混合专家模型的深度强化学习调度策略DEFT,通过图自适应门控机制动态路由决策,有效降低执行成本和截止时间违反率。

Comments This paper has been accepted by the Fourteenth International Conference on Learning Representations (ICLR 2026)

详情
AI中文摘要

云计算中的工作流调度需要将动态到达、图结构且具有不同截止时间的工作流智能地分配到不断变化的虚拟机资源上。然而,现有的深度强化学习调度器受限于僵化的单路径推理架构,难以处理多样化的调度场景。我们引入了 extbf{DEFT}(截止时间感知的混合专家模型),一种创新的深度强化学习策略架构,利用专门的混合专家模型,每个专家被训练用于管理不同级别的截止时间紧迫性。据我们所知,DEFT是首个引入并验证用于动态云工作流调度的混合专家模型架构。通过自适应地将决策路由到最合适的专家,DEFT能够满足单个专家无法实现的广泛截止时间要求。DEFT的核心是一种 extbf{图自适应}门控机制,该机制编码工作流截止时间和DAG、任务状态以及虚拟机条件,使用交叉注意力以细粒度、截止时间敏感的方式指导专家激活。在动态云工作流基准上的实验表明,DEFT显著降低了执行成本和截止时间违反率,优于多个最先进的深度强化学习基线。

英文摘要

Workflow scheduling in cloud computing demands the intelligent allocation of dynamically arriving, graph-structured workflows with varying deadlines onto ever-changing virtual machine resources. However, existing deep reinforcement learning (DRL) schedulers remain limited by rigid, single-path inference architectures that struggle to handle diverse scheduling scenarios. We introduce $\textbf{DEFT}$ ($\textbf{D}$eadline-p$\textbf{E}$rceptive Mixture-o$\textbf{F}$-Exper$\textbf{t}$s), an innovative DRL policy architecture that leverages a specialized mixture of experts, each trained to manage different levels of deadline tightness. To our knowledge, DEFT is the first to introduce and validate a Mixture-of-Experts architecture for dynamic cloud workflow scheduling. By adaptively routing decisions through the most appropriate experts, DEFT is capable of meeting a broad spectrum of deadline requirements that no single expert can achieve. Central to DEFT is a $\textbf{graph-adaptive}$ gating mechanism that encodes workflow DAGs, task states, and VM conditions, using cross-attention to guide expert activation in a fine-grained, deadline-sensitive manner. Experiments on dynamic cloud workflow benchmarks demonstrate that DEFT significantly reduces execution cost and deadline violations, outperforming multiple state-of-the-art DRL baselines.

2606.01013 2026-06-03 cs.AI cs.AR 版本更新

Can AI Review Improve Paper Drafting? An Empirical Study on 20 Computer Architecture Submissions

AI审稿能否改进论文撰写?基于20篇计算机体系结构投稿的实证研究

Di Wu

发表机构 * University of Central Florida(中央佛罗里达大学)

AI总结 通过定义对齐指标并开发AI-Paper-Review工具,对20篇计算机体系结构论文进行案例研究,发现AI审稿能覆盖大部分人类提出的问题,并发现人类遗漏的问题,从而探讨AI审稿在改进论文撰写方面的潜力与局限。

Comments 12 pages, 12 figures

详情
AI中文摘要

随着人工智能(AI)的发展,研究进展比以往任何时候都快;相应的研究论文也是如此。AI生成论文数量的激增给同行评审带来了压力,导致AI生成的评审可能被广泛但隐蔽地使用。然而,关于保密性、质量和公平性的相关伦理问题已被提出,且广泛的研究社区尚未达成共识。我们预计这一争论将持续一段时间,但与此同时,我们提出一个替代性的实际问题: extit{AI审稿能否改进论文撰写?} 我们研究了20篇计算机体系结构论文,这些论文的投稿历史各不相同,以揭示AI审稿与人类审稿的对齐程度,并通过我们定义的一组指标进行量化。为了进行案例研究,我们构建了一个集成Web UI的工具——\emph{AI-Paper-Review},该工具可生成论文草稿的结构化AI评审,网址为https://github.com/unarylab/ai-paper-review。该工具从多样化的AI审稿人池中选择若干AI审稿人,并根据评审意见的共性和重要性对其评论进行聚类和排序。它还允许将AI评论与人类评论对齐,以促进基于指标的验证。案例研究表明,AI审稿可以覆盖人类提出的大部分问题,但也提出了人类评审中遗漏的问题。 本文并非旨在鼓励在当前阶段使用AI进行同行评审,而是研究(1)AI审稿如何改进论文撰写,以及(2)基于AI的同行评审的潜力与局限。发布该工具和案例研究数据旨在激发未来关于这一主题的研究。滥用于同行评审将违反主要学术场所的伦理政策。

英文摘要

Research is advancing faster than ever with artificial intelligence (AI); and so are the corresponding research papers. The exploding volume of AI-generated papers have put a strain to peer review, leading to the usage of AI-generated review, potentially wide yet sneaky. However, relevant ethical concerns about confidentiality, quality, and fairness are raised and no consensus has been reached in the broad research community. We expect the debate to continue for a while, but in the meantime, we ask an alternative, practical question: \textit{can AI review improve paper drafting?} We study 20 computer architecture papers, with varying levels of submission lineage, to expose how well AI review aligns with human review, quantified by a set of metrics we define. To conduct the case study, we build a web UI-integrated tool, \emph{AI-Paper-Review}, that generates structured AI review of a draft paper, available at https://github.com/unarylab/ai-paper-review. This tool selects several AI reviewers from a diverse pool of AI reviewers and clusters and ranks their comments based on commonality and importance of review comments. It also allows to align AI comments with human comments to facilitate metric-based validation. The case study shows that AI review can cover a significant fraction of human-raised issues, but also raises issues missing in human review. This paper is not intended to encourage using AI for peer review at the current stage, but to study that (1) how AI review can improve paper drafting and (2) the potential and limitation of AI-based peer review. The release of the tool and the case study data is intended to instigate future research on this topic. Misuse for peer review would violate the ethics policies from major academic venues.

2606.00809 2026-06-03 cs.AI 版本更新

NBQ: Next-Best-Question for Dynamic Profiling

NBQ: 动态画像中的下一最佳问题

Yimin Shi, Clarice Wang, Haixun Wang, Xiaokui Xiao

发表机构 * National University of Singapore(国立新加坡大学) University of Pennsylvania(宾夕法尼亚大学) EvenUp

AI总结 提出NBQ框架,通过自适应选择信息增益最大的问题,从对话中动态构建用户画像,并引入QuickMatch加速双向匹配。

详情
Journal ref
KDD 2026
AI中文摘要

许多真实世界的知识发现对话场景,包括播客、招聘面试和市场,都需要对一个人进行有目的的理解。我们研究了下一最佳问题(NBQ)问题:在每一轮中,面试官应根据已学到的内容和对话目标,提出预期信息增益最高的问题。我们提出了NBQ,一个即插即用的框架,它生成多样化的候选问题池,维护一个紧凑且持续更新的用户状态,在轮次预算内自适应选择下一个问题,并将得到的自由形式对话提炼为结构化的基于向量的用户画像。作为一个高要求的应用,我们将NBQ实例化用于双向匹配,其中兼容性必须是相互的,并且每个人由自我描述和对应偏好表示共同建模。为了支持大规模匹配,我们进一步引入了QuickMatch,一个高效的检索层,将双向匹配从二次成对评分转换为近似向量搜索。实验表明,NBQ在AC@T和AR@T上分别将用户画像质量提高了13.6%和14.0%,而QuickMatch将检索速度提高了22.9倍,召回率高达0.989。

英文摘要

Many real-world conversational settings for knowledge discovery, including podcasts, hiring screens, and marketplaces, require a purpose-driven understanding of a person. We study the Next-Best-Question (NBQ) problem: at each turn, an interviewer should ask the question with the highest expected information gain given what has already been learned and the conversation goal. We propose NBQ, a plug-and-play framework that seeds a diverse pool of candidate questions, maintains a compact and continuously updated user state, adaptively selects the next question within a turn budget, and distills the resulting free-form dialogue into a structured vector-based user profile. As a demanding application, we instantiate NBQ for reciprocal matchmaking, where compatibility must be mutual and each person is modeled by both self-description and counterpart-preference representations. To support large-scale matching, we further introduce QuickMatch, an efficient retrieval layer that recasts reciprocal matching from quadratic pairwise scoring to approximate vector search. Experiments show that NBQ improves user profiling quality by up to 13.6% and 14.0% in AC@T and AR@T, respectively, while QuickMatch accelerates retrieval by up to 22.9x with recall up to 0.989.

2606.00680 2026-06-03 cs.AI cs.LG 版本更新

Regularized Offline Policy Optimization with Posterior Hybrid Bayesian Belief

具有后验混合贝叶斯信念的正则化离线策略优化

Hongqiang Lin, Pengfei Wang, Nenggan Zheng

AI总结 提出后验混合贝叶斯信念(PhyB)以统一量化离线强化学习中的认知不确定性,并基于此开发迭代正则化策略优化算法,实现单调改进直至收敛。

详情
AI中文摘要

离线强化学习旨在从预先收集的数据集中优化策略。该范式的一个瓶颈是管理认知不确定性,这种不确定性源于有限的数据覆盖(样本层面)以及从有限数据中识别转移动态的模糊性(模型层面)。为了统一量化这些不确定性,贝叶斯强化学习通过将动态模型视为随机变量并维护相应的信念而被提出。尽管具有理论吸引力,贝叶斯强化学习中的策略优化在计算上仍然具有挑战性,因为它需要求解带有期望的复合目标。先前的方法要么采用计算可扩展性差的基于搜索的技术,要么施加牺牲贝叶斯强化学习适应性的限制性后验假设。为了解决这些局限性,我们提出了后验混合贝叶斯信念(PhyB),它将期望重新表述为动态模型子集上的凸组合。理论分析表明,这种近似引起的目标差异是有界的。基于PhyB,我们开发了一种迭代正则化策略优化算法,该算法为单调改进直至收敛提供了与度量无关的保证。实验结果表明,PhyB在各种基准测试中达到了最先进的性能。

英文摘要

Offline reinforcement learning (RL) aims to optimize policies from pre-collected datasets. A bottleneck of this paradigm is managing epistemic uncertainty, which arises from limited data coverage (sample-level) and the ambiguity in identifying transition dynamics from finite data (model-level). To provide a unified quantification of these uncertainties, Bayesian RL has been proposed by treating the dynamics model as a random variable and maintaining a corresponding belief. Despite its theoretical appeal, policy optimization in Bayesian RL remains computationally challenging as it requires solving composite objectives with expectations. Prior methods either employ search-based techniques with poor computational scalability or impose restrictive posterior assumptions that sacrifice the adaptability of Bayesian RL. To address these limitations, we propose Posterior Hybrid Bayesian Belief (PhyB), which reformulates the expectation as a convex combination over a subset of dynamics models. Theoretical analysis demonstrates that the objective discrepancy induced by this approximation remains bounded. Based on PhyB, we develop an iterative regularized policy optimization algorithm that provides metric-agnostic guarantees for monotonic improvement until convergence. Empirical results demonstrate that PhyB achieves state-of-the-art performance on various benchmarks.

2606.00555 2026-06-03 cs.AI q-bio.BM 版本更新

Probe Before You Edit: Probing-Guided Molecular Optimization for LLM Agents in Structure-Based Drug Design

编辑前先探测:基于探测引导的分子优化用于基于结构的药物设计中的LLM代理

Zaifei Yang, Weiyu Chen, Yaqing Wang, James Kwok

发表机构 * The Hong Kong University of Science and Technology(香港科学与技术大学) City University of Hong Kong(香港城市大学) Beijing Institute of Mathematical Sciences and Applications(北京数学科学应用研究所)

AI总结 提出PROBE框架,通过探测口袋-配体复合物的编辑响应来引导LLM代理进行分子优化,解决结合亲和力与成药性之间的冲突,在CrossDocked2020上达到最优性能。

详情
AI中文摘要

基于结构的药物设计越来越多地使用LLM代理来迭代优化针对目标口袋的配体,然而一个可行的配体必须满足两个常常相互冲突的目标——结合亲和力和成药性——而单次优化步骤很少能同时改善两者。为了量化这一困难,我们引入了两个诊断指标:第一个衡量单次编辑同时改善两个目标的频率,第二个衡量一个目标上的增益伴随另一个目标上的损失的频率。将这些诊断应用于当前的LLM代理流程,揭示了一个一致的失败模式:代理在不知道口袋-配体复合物如何响应局部修改的情况下进行分子编辑,因此很少实现联合改进。受药物化学家的启发,他们在选择优化方向之前通过受控的类似物编辑来探测口袋-配体复合物,我们提出了PROBE,一个围绕编辑响应探测构建的优化框架。PROBE首先将配体分解为可编辑位点,并构建一个口袋特异的位点图,标记出联合增益可能的位置、两个目标可能存在冲突的位置以及应改变责任子结构的位置;然后执行受控的探测编辑,将其响应提炼为编辑手册。在位点图和编辑手册的指导下,PROBE运行一个迭代的多代理循环,其中亲和力代理、成药性代理和协同优化代理共同产生编辑。在CrossDocked2020基准上,PROBE实现了最先进的性能,并显著缓解了我们的诊断指标暴露的失败模式。

英文摘要

Structure-based drug design increasingly employs LLM agents to iteratively refine ligands against a target pocket, yet a viable ligand must satisfy two often-conflicting objectives -- binding affinity and druggability -- which single optimization steps rarely improve together. To quantify this difficulty, we introduce two diagnostic metrics: the first measures how often a single edit improves both objectives, and the second measures how often a gain on one objective comes with a loss on the other. Applying these diagnostics to current LLM-agent pipelines exposes a consistent failure mode: the agent performs molecular editing without knowing how the pocket-ligand complex responds to local modifications, thus rarely achieving joint improvement. Inspired by medicinal chemists, who probe the pocket-ligand complex with controlled analog edits before choosing an optimization direction, we propose \textbf{PROBE}, an optimization framework built around edit-response probing. PROBE first decomposes the ligand into editable sites and builds a pocket-specific \textbf{site map} that flags where joint gains are plausible, where the two objectives are likely in tension, and where liability substructures should be changed; it then performs controlled probe edits whose responses are distilled into an \textbf{EditManual}. Guided by the site map and EditManual, PROBE runs an iterative multi-agent loop in which an affinity agent, a druggability agent, and a co-optimization agent jointly produce edits. On the CrossDocked2020 benchmark, PROBE achieves state-of-the-art performance and substantially mitigates the failure modes exposed by our diagnostics metrics.

2606.00395 2026-06-03 cs.LG cs.AI 版本更新

PR2: Predictive Routing Replay for MoE-Based LLM Reinforcement Learning

PR2: 基于MoE的大语言模型强化学习中的预测性路由重放

Daize Dong, Junlin Chen, Haolong Jia, Jiang Liu, Jiawei Wu, Huanwei Di, Jialian Wu, Zhengzhong Liu, Zicheng Liu, Emad Barsoum, Dimitris N. Metaxas, Hongyi Wang

发表机构 * Rutgers University(罗格斯大学) AMD MBZUAI

AI总结 针对MoE大语言模型强化学习中路由器漂移导致的不稳定性问题,提出预测性路由重放方法,通过轻量级演化预测器减少路由不匹配,提升训练稳定性和性能。

详情
AI中文摘要

混合专家(MoE)大语言模型(LLM)在规模上实现了强大的性能。然而,基于MoE的LLM的强化学习(RL)常常遭受训练不稳定性。一个根本原因是路由器漂移,即专家激活可能在模型更新时发生剧烈变化,并且在分解的推出和训练阶段之间不同,导致PPO风格RL算法中出现大的推出-训练不匹配和不稳定的重要性采样权重。路由重放通过在每个推理轨迹内冻结重放路由来缓解这个问题,但它忽略了路由器在离策略更新下如何演化,从而导致路由器过时。为了解决这个限制,我们提出了预测性路由重放(PR2),它为每个路由器配备了一个轻量级的演化预测器,学习预测短时域的路由器演化。在推出阶段,我们使用预测性路由分布来应用top-$k$路由,使梯度能够到达更新后可能激活的专家。在训练阶段,我们重放由此产生的预测路由,以保持一致性,从而实现稳定的重要性估计。理论分析和实验支持PR2减少了由路由引起的不匹配,提高了RL稳定性,并在各种推理基准上取得了更强的性能。

英文摘要

Mixture of Experts (MoE) Large Language Models (LLMs) achieve strong performance at scale. However, reinforcement learning (RL) on MoE-based LLMs often suffers from training instability. A root cause is router drift, i.e., expert activations can change drastically across model updates and differ between disaggregated rollout and training phases, causing large rollout--training mismatch and unstable importance sampling weights in PPO-style RL algorithms. Routing replay mitigates this issue by freezing the replay route within each reasoning trajectory, but it ignores how the router evolves under off-policy updates and thus causes router staleness. To address this limitation, we propose Predictive Routing Replay (PR2), which augments each router with a lightweight evolution predictor that learns to anticipate short-horizon router evolution. During the rollout phase, we use the predictive routing distribution to apply top-$k$ routing, enabling gradients to reach experts that are likely to become active after updates. During the training phase, we replay the resulting predicted route to retain consistency for stable importance estimation. Theoretical analysis and experiments support that PR2 reduces routing-induced mismatch, improves RL stability, and yields stronger performance across various reasoning benchmarks.

2606.00096 2026-06-03 cs.CV cs.AI 版本更新

Diversity Over Frequency: Rethinking Tool Use in Visual Chain-of-Thought Agents

多样性优于频率:重新思考视觉思维链智能体中的工具使用

Dong-Hee Kim, Reuben Tan, Donghyun Kim

发表机构 * University of California, Berkeley(加州大学伯克利分校) University of Cambridge(剑桥大学) University of Toronto(多伦多大学)

AI总结 本文研究视觉思维链智能体在复杂推理任务中的工具使用,发现工具使用崩溃现象,并提出熵正则化方法通过鼓励多样化探索提升推理性能。

Comments Presented in ICML 2026

详情
AI中文摘要

视觉智能体在视觉思维链中使用外部视觉工具来整合细粒度证据。虽然先前的工作主要研究这些工具在视觉搜索任务中的应用,但它们在更复杂的视觉推理中的作用仍未充分探索。在本文中,我们超越简单的视觉搜索任务,研究更具挑战性的任务,包括3D空间推理和医学视觉问答,其中智能体必须将工具获取的局部证据与全局上下文整合。我们识别出一种工具使用崩溃现象:模型逐渐停止使用工具,同时仍能获得更高的任务准确率。此外,我们观察到明显的不对称性:(i) 完全消除工具使用会降低性能,而(ii) 激励工具使用仅带来边际收益,尽管使用量大幅增加。我们发现,普通训练和工具使用鼓励都降低了展开多样性,这解释了为什么更高的工具使用不会带来更强的推理性能。受这些发现的启发,我们添加了一个熵正则化项来鼓励多样化的展开探索,尽管工具使用逐渐下降,但实现了最佳性能。总体而言,我们的发现表明了一种训练时工具作为支架的观点,其中对语言生成和视觉工具调用的更广泛探索改善了推理,尽管存在工具使用崩溃。项目页面:https://scaffolded-exploration.github.io

英文摘要

Visual agents employ external visual tools within visual chains of thought to incorporate fine-grained evidence. While prior work has mainly studied these tools in visual search tasks, their role in more complex visual reasoning remains underexplored. In this paper, we move beyond simple visual search tasks to investigate more challenging tasks, including 3D spatial reasoning and medical visual question answering, where agents must integrate tool-acquired local evidence with the global context. We identify a {tool-use collapse phenomenon: models progressively stop using tools while still achieving higher task accuracy. Moreover, we observe a clear asymmetry: (i) completely eliminating tool use degrades performance, whereas (ii) incentivizing tool use yields only marginal gains despite substantially increasing usage. We find that vanilla training and tool-use encouragement both reduce rollout diversity, explaining why higher tool use does not yield stronger reasoning performance. Motivated by these findings, we add an entropy regularization term to encourage diverse rollout exploration, achieving the best performance despite gradually declining tool usage. Overall, our findings suggest a training-time view of tools as scaffolding, where broader exploration over language generation and visual tool invocation improves reasoning despite tool-use collapse. Project page: https://scaffolded-exploration.github.io

2605.30789 2026-06-03 cs.LG cs.AI 版本更新

Smaller Models are Natural Explorers for Policy-Level Diversity in GRPO

小型模型是GRPO中策略级多样性的自然探索者

Yiming Ren, Yiran Xu, Zicheng Lin, Chufan Shi, Yukang Chen, Dingdong Wang, Tianhe Wu, Junjie Wang, Yujiu Yang, Yu Qiao, Ruihang Chu

发表机构 * University of Science and Technology of China(中国科学技术大学)

AI总结 提出S2L-PO框架,利用小型模型作为自然探索者生成策略级多样性的rollout,通过渐进退火策略过渡到大型模型自身采样,提升数学推理性能并减少计算开销。

详情
AI中文摘要

我们识别出增强LLM组相对策略优化(GRPO)中rollout多样性的新维度。虽然GRPO依赖于多样化的rollout,但主流策略主要通过注入更多token级随机性来增加多样性,这可能引入逐步噪声并导致不连贯的轨迹。我们发现,同一模型族中的较小模型固有地表现出更高的策略级多样性,随着样本数量增加,其pass@k优于较大模型。与token级噪声不同,这种多样性在时间上相关,保持逻辑一致性,并为梯度估计提供结构化探索信号。因此,我们提出S2L-PO(从小到大的策略优化)框架,利用固定的小型模型作为自然探索者来训练大型模型。为了平衡探索与利用,我们设计了一种渐进退火策略,从离线的小模型rollout过渡到大型学习者自身的采样。这种转变优雅地避免了由小模型容量限制导致的训练中期性能下降,实现了更快的收敛并解锁了更高的性能上限。S2L-PO在多种数学推理基准上提高了准确率(例如,使用1.7B探索者指导8B模型在AIME 24上提高了8.8%),同时减少了rollout计算量。

英文摘要

We identify a new dimension for enhancing rollout diversity in Group Relative Policy Optimization (GRPO) for LLMs. While GRPO relies on diverse rollouts, prevailing strategies primarily increase diversity by injecting more token-level randomness, which may introduce step-wise noise and lead to incoherent trajectories. We uncover that smaller models within the same model family inherently exhibit higher policy-level diversity, indicated by their superior pass@k relative to larger counterparts as sample counts increase. Unlike token-level noise, this diversity is temporally correlated, preserves logical consistency, and provides structured exploration signals for gradient estimation. We thus propose S2L-PO (Small-to-Large Policy Optimization), a framework that leverages fixed small models as natural explorers to train larger models. To balance exploration and exploitation, we design a progressive annealing strategy that transitions from offline small-model rollouts to the large learner's own sampling. This shift elegantly avoids mid-training performance drops caused by the small model's capacity limits, achieving faster convergence and unlocking a higher performance ceiling. S2L-PO improves accuracy on diverse mathematical reasoning benchmarks (e.g., +8.8% on AIME 24 using a 1.7B explorer to guide the 8B model) while reducing rollout compute.

2605.28556 2026-06-03 cs.AI 版本更新

A Matter of TASTE: Improving Coverage and Difficulty of Agent Benchmarks

品味问题:提高智能体基准测试的覆盖率和难度

Tomer Keren, Nitay Calderon, Asaf Yehudai, Yotam Perlitz, Michal Shmueli-Scheuer, Roi Reichart

AI总结 提出TASTE方法,通过反转任务构建流程,利用自适应对比n-gram模型和聚类自动生成覆盖广泛工具组合的高难度基准任务,以解决现有基准饱和问题。

详情
AI中文摘要

随着智能体能力的提升,现有基准(如$τ^2$-Bench)逐渐饱和。然而构建新的基准任务仍然复杂、昂贵且劳动密集。此外,标准方法(先以自然语言编写场景,再映射到工具序列)仅捕获了智能体使用的工具模式的一个狭窄子集。在本文中,我们通过反转任务构建过程来解决这些问题。我们提出TASTE:基于工具序列进化的任务合成,一种自动生成具有更广工具使用覆盖率的挑战性任务的方法。TASTE利用在LLM判断的有效性信号上训练的自适应对比n-gram模型。这使得能够采样覆盖大量工具组合的有效工具序列。然后TASTE通过聚类从池中选择代表性序列,将其实例化为完整的基准任务,并通过迭代难度进化进行优化。使用TASTE,我们构建了$τ^c$-Bench,这是$τ^2$-Bench三个领域的具有挑战性的扩展。我们评估了11个智能体/用户LLM对,发现几乎饱和$τ^2$-Bench的模型在我们的任务上性能严重下降(例如,Gemini-3-Flash从$0.82-0.94$降至$0.28-0.61$)。除了增加难度,我们生成的任务使智能体必须执行的独特工具组合数量翻倍以上。我们的结果表明,现有基准的高分往往反映饱和而非稳健的任务解决能力。通过自动化生成高难度、高覆盖率的基准,TASTE使得未来智能体的持续、可扩展评估成为可能。

英文摘要

As agent capabilities advance, existing benchmarks, such as $τ^2$-Bench, are becoming increasingly saturated. Yet constructing new benchmark tasks remains complex, costly, and labor-intensive. Moreover, the standard approach, in which scenarios are first written in natural language and then mapped to tool sequences, captures only a narrow subset of the tool-use patterns agents exercise. In this paper, we address these problems by reversing the task construction process. We propose TASTE: Task Synthesis from Tool Sequence Evolution, an automatic method that generates challenging tasks with broader tool-use coverage. TASTE utilizes an Adaptive Contrastive $n$-gram model trained on LLM-judged validity signals. This enables sampling valid tool sequences that cover a vast range of tool combinations. TASTE then selects representative sequences from the pool via clustering, instantiates them into complete benchmark tasks, and refines them through iterative difficulty evolution. Using TASTE, we construct $τ^c$-Bench, a challenging extension of the three domains of $τ^2$-Bench. We evaluate $11$ agent/user LLM pairs and find that models nearly saturating $τ^2$-Bench suffer severe performance drops on our tasks (e.g., Gemini-3-Flash falls from $0.82\!-\!0.94$ to $0.28\!-\!0.61$). Beyond increasing difficulty, our generated tasks more than double the number of unique tool combinations agents must execute. Our results suggest high scores on existing benchmarks often reflect saturation rather than robust task-solving ability. By automating the generation of difficult, high-coverage benchmarks, TASTE enables continuous, scalable evaluation of future agents.

2605.27762 2026-06-03 cs.AI 版本更新

PEAM: Parametric Embodied Agent Memory through Contrastive Internalization of Experience in Minecraft

PEAM: 通过经验对比内化的参数化具身智能体记忆在Minecraft中的应用

Yuchen Guo, Junli Gong, Weicheng Wang, Hongmin Cai, Yiu-ming Cheung, Weifeng Su

发表机构 * Northwestern University(西北大学) Northeastern University(东北大学) South China University of Technology(华南理工大学) Hong Kong Baptist University(香港 Baptist大学) Beijing Normal - Hong Kong Baptist University(北京师范大学-香港 Baptist大学)

AI总结 提出PEAM框架,通过对比内化失败-纠正轨迹对,将经验转化为参数化技能,实现Minecraft中具身智能体的持续学习与高效执行。

详情
AI中文摘要

我们提出了PEAM,一个在Minecraft中的参数化具身智能体记忆框架,它将智能体记忆从推理时检索转变为通过经验内化的参数驻留技能。PEAM将一个用于开放推理的慢速思考LLM与一个用于反射性执行已巩固技能的快速参数化模块配对。快速模块是一个多模态专家混合LoRA架构,具有按类别物理隔离的适配器,实现了无灾难性遗忘的参数级持续学习。我们将失败视为第一类训练信号:失败-纠正轨迹对通过联合行为克隆和对比目标进行内化,因此智能体不仅学习什么成功,还学习纠正动作与失败动作的区别。为了控制巩固,PEAM引入了参数化值得分来决定哪些经验应被内化,以及一个无尺度的自触发巩固机制来决定何时内化,无需任务特定的手动调整阈值,使智能体能够自我进化,因为触发器可以在任务分布之间转移而无需重新调整。在Minecraft中的实验表明,PEAM提高了长时域任务性能,减轻了对先前巩固技能的遗忘,并提高了参数化与检索效率,优于基于检索的具身智能体和参数化记忆变体。

英文摘要

We present PEAM, a Parametric Embodied Agent Memory framework in Minecraft that transforms agent memory from inference-time retrieval into parameter-resident skills internalized through experience. PEAM pairs a slow deliberative LLM for open-ended reasoning with a fast parametric module for reflexive execution of consolidated skills. The fast module is a multimodal Mixture-of-Experts LoRA architecture with per-category physically isolated adapters, enabling parameter-level continual learning without catastrophic forgetting. We treat failure as a first-class training signal: failure--correction trajectory pairs are internalized through a joint behavioral-cloning and contrastive objective, so the agent learns not only what succeeds but also how corrected actions differ from failed ones. To govern consolidation, PEAM introduces a parameterization-worthiness score for deciding which experience should be internalized, and a scale-free self-triggered consolidation mechanism for deciding when to internalize without task-specific hand-tuned thresholds, making the agent self-evolving as the trigger transfers across task distributions without re-tuning. Experiments in Minecraft show that PEAM improves long-horizon task performance, mitigates forgetting on previously consolidated skills, and improves parametric-versus-retrieval efficiency over retrieval-based embodied agents and parametric memory variants.

2605.26704 2026-06-03 cs.LG cs.AI 版本更新

SL-BiLEM: Structured Learnable Behavior-in-the-Loop Epidemic Modeling for Forecasting and Policy Evaluation

SL-BiLEM: 用于预测和政策评估的结构化可学习行为循环流行病模型

Haochun Wang, Sendong Zhao, Jingbo Wang, Yanrui Du, Ting Liu, Bing Qin

发表机构 * Faculty of Computing, Harbin Institute of Technology(计算学院,哈尔滨工业大学)

AI总结 提出SL-BiLEM模型,通过物理约束正则化实现鲁棒外推,在政策干预导致的分布偏移下预测准确率提升76%,并支持反事实分析。

Comments ACM SIGKDD 2026

详情
AI中文摘要

流行病预测面临一个基本挑战:人类行为会动态响应疾病传播,形成反馈循环,在政策干预点引发分布偏移。这使得数据驱动模型在分布偏移下不可靠。我们提出 extbf{SL-BiLEM}(结构化可学习行为循环流行病模型),利用物理约束作为正则化实现鲁棒外推。该框架将有效传播率分解为$β_{ ext{eff}}(t,g) = β_0(g) imes m_{ ext{policy}}(t) imes m_{ ext{media}}(t) imes m_{ ext{comp}}(t,g)$,其中对学习到的依从函数施加单调性、平滑性和有界跳跃约束,以在新政策制度下保持预测有效性。除预测外,SL-BiLEM还能为干预决策支持进行反事实分析。我们在三个真实世界数据集(邮轮、学校流感和学区COVID-19监测)上验证预测性能,并在已知真实情况的合成基准上评估反事实恢复。SL-BiLEM表明:(1)相比神经机制基线改进76%,在政策诱导偏移下仅53%的OOD退化,而神经基线为1142%;(2)在27个合成反事实实验中,自举置信区间覆盖率达100%;(3)处理效应准确度超过0.85。这些结果使SL-BiLEM成为公共卫生决策者寻求准确预测和原则性干预规划的可解释工具。

英文摘要

Epidemic forecasting faces a fundamental challenge: human behavior dynamically responds to disease spread, creating feedback loops that induce distribution shifts at policy intervention points. This renders data-driven models unreliable under distribution shift. We propose \textbf{SL-BiLEM} (Structured Learnable Behavior-in-the-Loop Epidemic Model), leveraging physical constraints as regularization for robust extrapolation. The framework decomposes effective transmission as $β_{\text{eff}}(t,g) = β_0(g) \times m_{\text{policy}}(t) \times m_{\text{media}}(t) \times m_{\text{comp}}(t,g)$, where monotonicity, smoothness, and bounded-jump constraints on the learned compliance function maintain predictive validity under novel policy regimes. Beyond forecasting, SL-BiLEM enables counterfactual analysis for intervention decision support. We validate forecasting on three real-world datasets (cruise ship, school influenza, and school-district COVID-19 surveillance) and evaluate counterfactual recovery on synthetic benchmarks with known ground truth. SL-BiLEM demonstrates: (1) 76\% improvement over neural-mechanistic baselines, with only 53\% OOD degradation versus 1142\% for neural baselines under policy-induced shift; (2) 100\% bootstrap CI coverage across 27 synthetic counterfactual experiments; and (3) Treatment Effect Accuracy exceeding 0.85. These results establish SL-BiLEM as an interpretable tool for public health decision-makers seeking accurate prediction and principled intervention planning.

2605.30155 2026-06-03 cs.LO cs.AI 版本更新

Neural Network Verification using Partial Multi-Neuron Relaxation

使用部分多神经元松弛的神经网络验证

Ido Shmuel, Guy Katz

AI总结 提出部分多神经元松弛方法,通过启发式选择少量神经元生成多神经元边界,在Marabou验证器中实现紧致性与可扩展性的平衡。

Comments To appear in SAIV 2026

详情
AI中文摘要

深度神经网络在关键系统中的日益集成,激发了对其行为进行形式化安全保证的理论和实际兴趣。为了实现这一点,当代验证算法依赖于为网络的非线性激活函数计算线性松弛。现有的线性松弛方法通常分为两类:单神经元松弛,其中每个激活神经元根据其源进行界定;以及多神经元松弛,其中计算涉及多个激活神经元及其源的线性边界。然而,现有方法可能无法平衡紧致性和可扩展性,因为单神经元边界可能无法推导出验证所需的足够紧致的边界,而为所有激活神经元生成多神经元松弛在计算上代价高昂。在本文中,我们提出了一种中间方法,即部分多神经元松弛,其中我们仅对启发式选择的一小部分神经元生成多神经元边界。为了实现这一点,我们基于现有的分支启发式方法选择神经元,并优化多神经元边界的边界超平面。我们将所提出的方法集成到Marabou验证器中,并与现有的边界紧缩方法相比获得了有利的结果。我们的实验展示了我们的技术在神经网络验证中的潜力。

英文摘要

The increasing integration of deep neural networks in critical systems has spawned a theoretical and practical interest in formally guaranteeing safety properties about their behavior. To achieve this, contemporary verification algorithms rely on computing linear relaxations for a network's non-linear activation functions. Existing approaches for linear relaxations typically fall into one of two categories: single-neuron relaxation, in which each activation neuron is bounded in terms of its sources; and multi-neuron relaxation, in which linear bounds involving multiple activation neurons and their sources are calculated. However, existing methods might fail to balance tightness and scalability, as single-neuron bounds might not derive sufficiently tight bounds necessary for verification to complete, whereas generating multi-neuron relaxation for all activation neurons is computationally expensive. In this paper, we present a middle-ground approach featuring partial multi-neuron relaxation, in which we generate multi-neuron bounds for only a small, heuristically selected subset of neurons. To achieve this, we build upon existing branching heuristics for selecting neurons and for optimizing bounding hyper-planes for multi-neuron bounds. We integrated our proposed method within the Marabou verifier, and obtained favorable results in comparison to existing bound tightening methods. Our experiments showcase the potential of our technique for neural network verification.

2605.29930 2026-06-03 cs.AI cs.CY cs.HC 版本更新

Toward AI That Understands Self and Others: A World-Model Theory of Cognitive Diversity and Alignment

迈向理解自我与他人的AI系统:人类认知多样性与世界模型对齐的多阶段推理框架

Toru Takahashi

发表机构 * Human Informatics and Systems Lab, Doshisha University(立命馆大学人机系统实验室) Linked Open Data Initiative, NPO Keio Research Institute at SFC(庆应义塾大学SFC研究所开放数据计划) Stroly Inc(Stroly公司)

AI总结 提出多阶段推理框架(MIM),通过阶段形成空间、前景化场、主体特定轮廓状态和状态表示对齐图,形式化异质世界模型的产生,并将世界模型对齐重新定义为使异质表示相互可处理的问题,而非强制一致。

Comments 87 pages. Revised version with a refined abstract emphasizing disagreement as a late-stage phenomenon, target admissibility, processability, and the methodological abstraction used to compare humans, AI systems, and institutional decision procedures under shared information-theoretic constraints

详情
AI中文摘要

当代社会中的相互误解并非仅仅因为人们持有不同的观点或价值观。即使在相同的观察下,不同的主体也可能形成不同的推理目标、状态表示、预测误差和更新优先级。本文提出了一个多阶段推理框架,并将其核心内部机制定义为多阶段推理机制(MIM)。MIM通过阶段形成空间、前景化场、主体特定轮廓状态以及状态表示之间的对齐图,形式化了异质世界模型如何产生。在此基础上,本文将世界模型对齐重新定义为使异质表示相互可处理的问题,而非强制达成一致或收敛到单一价值体系。它进一步将这种形式化与哲学分歧、认知类型学、社会分裂和AI对齐联系起来。目的是为AI系统提供一个建设性的词汇,通过使意义、价值和预测误差的差异可见、可比较和可转化,帮助人类理解自我和他人。

英文摘要

Modern societies possess more information than ever before, yet they do not converge toward a single shared understanding. The same events, facts, laws, technologies, or risks can be interpreted as evidence of freedom, danger, exclusion, injustice, responsibility, or unrealized possibility. Existing discussions often treat such disagreement as a conflict of values, preferences, or beliefs. This paper argues that disagreement is already a late-stage phenomenon. The central premise is simple but not trivial: observation is not yet inference. Not every observation becomes inferentially relevant, and not every possible object in an observation sequence becomes an estimation target. A possible target becomes admissible only when a state representation can be constructed that is approximately sufficient for prediction, evaluation, or action with respect to that target. This paper develops a world-model theory of cognitive diversity and alignment by reconstructing recognition as the construction of such approximate sufficient statistics under finite informational, representational, observational, and action constraints. It formulates this position as the Multi-Phase Inference Assumption (MIA) and defines its core internal mechanism as the Multi-Phase Inference Mechanism (MIM). The framework introduces alignment maps and transformation loss to analyze how heterogeneous world models communicate without being collapsed into a single representation. World-model alignment is therefore processability, not agreement: the design of AI systems that help heterogeneous forms of intelligence remain mutually processable while preserving their distinct error-detection capacities.

2605.28166 2026-06-03 cs.LG cs.AI 版本更新

QuITE: Query-Based Irregular Time Series Embedding

QuITE: 基于查询的不规则时间序列嵌入

Junghoon Lim

AI总结 提出一种即插即用的嵌入模块QuITE,通过可学习查询令牌聚合不规则观测值,无需插值或修改架构,显著提升多变量时间序列模型的预测和分类性能。

Comments ICML 2026

详情
AI中文摘要

不规则多变量时间序列在实践中很常见,但其不规则采样给有效建模带来了困难。现有方法通常要么(i)设计专门架构,限制了经过验证的多变量时间序列模型的复用,要么(ii)通过插值将不规则时间序列映射到规则时间网格,这可能会引入人工值从而扭曲时间动态。为解决这些限制,我们提出了一种新的基于输入嵌入的方法。我们发现关键瓶颈不在于主干架构,而在于假设均匀采样的传统嵌入层。在这项工作中,我们引入了QuITE(基于查询的不规则时间序列嵌入),一种简单而有效的即插即用嵌入模块。QuITE使用可学习查询令牌通过单层自注意力聚合不规则观测值,直接生成与主干兼容的潜在表示,无需生成人工值或修改架构。在真实世界基准上的大量实验表明,QuITE持续改进多变量时间序列模型,在不同数据集和主干架构上,预测任务平均相对提升高达54.7%,分类任务平均相对提升高达15.8%。代码可在 https://github.com/Meaningfull9502/QuITE 获取。

英文摘要

Irregular Multivariate Time Series (IMTS) are common in practice, yet their irregular sampling complicates effective modeling. Existing approaches typically either (i) design specialized architectures that limit the reuse of proven Multivariate Time Series (MTS) models, or (ii) map IMTS onto regular temporal grids through interpolation, which may distort temporal dynamics by introducing artificial values. To address these limitations, we propose a new input-embedding-based approach. We identify that the key bottleneck lies not in the backbone architecture, but in conventional embedding layers that assume uniform sampling. In this work, we introduce QuITE (Query-Based Irregular Time Series Embedding), a simple yet effective plug-and-play embedding module for IMTS. QuITE employs learnable query tokens to aggregate irregular observations through a single self-attention layer, directly producing backbone-compatible latent representations without artificial value generation or architectural modification. Extensive experiments on real-world benchmarks show that QuITE consistently improves MTS models, yielding average relative gains of up to $54.7\%$ in forecasting and $15.8\%$ in classification across diverse datasets and backbone architectures. Code is available at: https://github.com/Meaningfull9502/QuITE.

2605.28910 2026-06-03 cs.CL cs.AI 版本更新

Hallucination Detection-Guided Preference Optimization for Clinical Summarization

基于幻觉检测的偏好优化用于临床摘要生成

Shamanth Kuthpadi Seethakantha, Dung Ngoc Thai, Vara Prasad Gudi, Simran Tiwari, Rami Matar, Avijit Mitra, Wenlong Zhao, Andrew McCallum, Wael Salloum

发表机构 * Manning College of Information and Computer Sciences(Manning信息与计算机科学学院) Ensemble HP Columbia University(哥伦比亚大学) University of Massachusetts Amherst(马萨诸塞大学阿姆赫斯特分校)

AI总结 提出利用幻觉检测器指导迭代修正的推理时方法及偏好学习微调,显著减少临床摘要中的幻觉。

详情
AI中文摘要

大型语言模型(LLM)在摘要任务上展现出潜力,但常产生幻觉,即无依据或不正确的陈述,限制了其在专业医疗应用中的可靠性。我们引入\itermodelfull(\itermodel),一种推理时方法,利用幻觉检测器指导迭代摘要修正以实现事实更正。在此基础上,我们提出用于偏好学习的\itermodel(\model),将检测器引导的修正轨迹转化为偏好对以进行模型微调。大量实验表明,我们的方法在总结来自\MimicIV的真实临床笔记时,显著减少了Llama和Gemma模型的幻觉。例如,\itermodel在Llama-3.1-8B-Instruct上减少了24%的幻觉,而\model减少了48%。重要的是,根据人类专家和LLM-Jury评估,两种方法都保持了摘要的流畅性、连贯性和相关性。这些结果共同表明,检测信息驱动的修正和偏好学习为提高临床摘要的事实准确性提供了一种自动化解决方案。

英文摘要

Large language models (LLMs) have shown promise on summarization tasks, but they often produce hallucinations, which are unsupported or incorrect statements that limit their reliability in specialized healthcare applications. We introduce Hallucination Detection Guided Self-Refinement (HDSR), an inference-time method that leverages hallucination detectors to guide iterative summary revisions toward factual corrections. Building on this, we propose HDSR for Preference Learning (HDSR-PL), which converts detector-guided refinement trajectories into preference pairs for model finetuning. Extensive experiments show that our methods substantially reduce hallucinations for Llama and Gemma models in summarizing real-world clinical notes from MIMIC-IV-Note v2.2. For example, HDSR reduces 24% and HDSR-PL reduces 48% hallucinations in Llama-3.1-8B-Instruct. Importantly, both methods preserve summary fluency, coherence, and relevance according to human expert and LLM-Jury evaluations. Together, these results demonstrate that detection-informed refinement and preference learning offer an automated solution for improving factual faithfulness in clinical summarization.

2605.26366 2026-06-03 cs.AI cs.LG 版本更新

Automatic Layer Selection for Hallucination Detection

幻觉检测的自动层选择

Xinpeng Wang, William X. Cao, Andrew Gordon Wilson, Zhe Zeng

发表机构 * University of Washington(华盛顿大学)

AI总结 针对大语言模型中幻觉检测的层选择问题,提出无需训练的FEPoID准则,自动识别最优中间层,并结合截断策略提升检测性能。

Comments Accepted at ICML 2026

详情
AI中文摘要

最近关于幻觉检测的研究表明,在大语言模型(LLMs)中,与幻觉相关的信号在中间层比在最后一层编码得更强。尽管越来越多的研究试图利用这一特性进行幻觉检测,但如何自动选择高性能层仍未得到充分探索,且缺乏针对此目的的原则性方法。为填补这一空白,我们首先提出了几个关于为何这些信号出现在中间层的假设,并评估了相应的自动层选择准则,这些准则适用于不同的LLM架构、规模和任务,涵盖了问答和摘要幻觉检测基准。然而,我们发现这些准则均不能持续提供令人满意的性能。因此,我们提出了一种新的选择准则——第一有效本征维度峰值(FEPoID),它能够一致地识别最优或接近最优的层,并优于上述准则和现有的幻觉检测基线。FEPoID无需训练,且计算开销可忽略不计。此外,我们研究了LLM的生成行为,并引入了一种简单而有效的截断策略,该策略进一步放大了与幻觉相关的信号,并显著提高了整体检测性能。代码公开于 https://github.com/DesoloYw/Automatic-Layer-Selection-for-Hallucination-Detection.git

英文摘要

Recent studies on hallucination detection have shown that hallucination-related signals are more strongly encoded in intermediate layers than in the final layer of large language models (LLMs). Although a growing body of work has sought to exploit this property for hallucination detection, how to automate the selection of high-performing layers remains underexplored, and principled methods for this purpose are still lacking. To address this gap, we first propose several hypotheses for why such signals emerge in intermediate layers and evaluate corresponding criteria for automatic layer selection across diverse LLM architectures, scales, and tasks, covering both question answering and summarization hallucination detection benchmarks. However, we find that none of these criteria consistently delivers satisfactory performance. We therefore propose a new selection criterion, First Effective Peak of Intrinsic Dimension (FEPoID), which consistently identify optimal or near-optimal layers and outperforms both the aforementioned criteria and existing hallucination detection baselines. FEPoID is training-free and incurs negligible computational overhead. In addition, we study the generation behaviors of LLMs and introduce a simple yet effective truncation strategy, which further amplifies hallucination-related signals and substantially improves overall detection performance. Code is publicly available at https://github.com/DesoloYw/Automatic-Layer-Selection-for-Hallucination-Detection.git

2605.12925 2026-06-03 cs.SE cs.AI 版本更新

AgentLens: Revealing The Lucky Pass Problem in SWE-Agent Evaluation

AgentLens: 揭示 SWE-Agent 评估中的幸运通过问题

Priyam Sahoo, Gaurav Mittal, Xiaomin Li, Shengjie Ma, Benjamin Steenhoek, Pingping Lin, Yu Hu

发表机构 * University of Illinois, Urbana-Champaign(伊利诺伊大学厄巴纳-香槟分校) Microsoft(微软)

AI总结 针对软件工程智能体评估中仅依赖最终补丁是否通过测试的二元信号问题,提出AgentLens框架进行过程级评估,通过构建前缀树接受器参考和上下文敏感意图标注器,识别出10.7%的通过轨迹存在“幸运通过”行为,并基于质量分数将轨迹分为幸运、扎实和理想三个等级。

详情
AI中文摘要

以下是更新后的摘要: 软件工程(SWE)智能体的评估主要依赖一个二元信号:最终补丁是否通过测试。这种仅关注结果的观点将原则性解决方案与混乱的试错过程视为等价。我们证明这种等价性在经验上是错误的。我们在60个SWE-bench验证任务上评估了来自八个模型后端的2,614条OpenHands轨迹。其中,47个任务有足够多的通过轨迹来构建任务级过程参考,从而得到一个包含1,815条轨迹的评估子集。在该子集的通过轨迹中,10.7%表现出我们称之为“幸运通过”的行为:回归循环、盲目重试、缺少验证,或探索、实现和验证在时间上无序。 我们引入AgentLens,一个用于SWE智能体轨迹过程级评估的框架,并定义AgentLens-Bench,一个包含1,815条轨迹的数据集,这些轨迹标注有质量分数、浪费信号、分歧点以及47个任务级前缀树接受器(PTA)参考。AgentLens通过合并同一任务的多个通过解决方案来构建PTA参考,并使用上下文敏感的意图标注器,基于轨迹历史而非仅工具身份将动作分配给探索、实现、验证或编排。 在AgentLens-Bench上,质量分数将通过轨迹分为幸运、扎实和理想三个等级,并进一步将幸运通过分解为五种重复出现的机制。在八个模型后端中,幸运率从0.5%到23.2%不等,当按质量分数而非通过率排序时,一些模型的排名变动多达五位。我们计划很快发布项目仓库,包括AgentLens-Bench工件、AgentLens SDK和分析工具。

英文摘要

Evaluation of software engineering (SWE) agents is dominated by a binary signal: whether the final patch passes the tests. This outcome-only view treats a principled solution and a chaotic trial-and-error process as equivalent. We show that this equivalence is empirically false. We evaluate 2,614 OpenHands trajectories from eight model backends on 60 SWE-bench Verified tasks. Of these, 47 have enough passing trajectories to construct task-level process references, yielding a 1,815-trajectory evaluation subset. Among passing trajectories in this subset, 10.7% exhibit behavior we call a Lucky Pass: regression cycles, blind retries, missing verification, or temporally disordered exploration, implementation, and verification. We introduce AgentLens, a framework for process-level assessment of SWE-agent trajectories, and define AgentLens-Bench, a dataset of 1,815 trajectories annotated with quality scores, waste signals, divergence points, and 47 task-level Prefix Tree Acceptor (PTA) references. AgentLens builds PTA references by merging multiple passing solutions for the same task, and uses a context-sensitive intent labeler to assign actions to Exploration, Implementation, Verification, or Orchestration based on trajectory history rather than tool identity alone. On AgentLens-Bench, the quality score separates passing trajectories into Lucky, Solid, and Ideal tiers and further decomposes Lucky Passes into five recurring mechanisms. Across the eight model backends, Lucky rates range from 0.5% to 23.2%, and some models move by as many as five rank positions when ranked by quality score instead of pass rate. We plan to release the project repository soon, including AgentLens-Bench artifacts, the AgentLens SDK, and the analysis tooling.

2512.24008 2026-06-03 cs.AI 版本更新

SPARK: Search Personalization via Agent-Driven Retrieval and Knowledge-sharing

SPARK:通过智能体驱动的检索和知识共享实现搜索个性化

Gaurab Chhetri, Subasish Das, Tausif Islam Chowdhury

发表机构 * Texas State University(德克萨斯州立大学)

AI总结 提出SPARK框架,利用基于角色的多智能体LLM协作实现动态个性化搜索,通过协调器激活相关专家智能体,结合记忆和推理模块,从分布式行为中涌现个性化特性。

Comments This is the author's preprint. Accepted to WEB&GRAPH 2026 (co-located with WSDM 2026), Boise, Idaho, USA, Feb 26, 2026. Final version will appear in WSDM 2026 Companion Proceedings. Conf: https://wsdm-conference.org/2026/ Workshop: https://aiimlab.org/events/WSDM_2026_WEB_and_GRAPH_2026_Workshop_on_Web_and_Graphs_Responsible_Intelligence_and_Social_Media.html

详情
AI中文摘要

个性化搜索需要能够建模用户不断变化的多维信息需求;这对于受限于静态配置文件或单一检索管道的系统来说是一个挑战。我们提出了SPARK(通过智能体驱动的检索和知识共享实现搜索个性化),这是一个框架,其中协调的基于角色的大语言模型(LLM)智能体提供特定任务的检索和涌现个性化。SPARK形式化了一个由角色、专业知识、任务上下文和领域定义的角色空间,并引入了一个角色协调器,该协调器动态解释传入的查询以激活最相关的专门智能体。每个智能体执行独立的检索增强生成过程,由专用的长期和短期记忆存储以及上下文感知推理模块支持。智能体间的协作通过结构化通信协议促进,包括共享记忆库、迭代辩论和接力式知识转移。借鉴认知架构、多智能体协调理论和信息检索的原理,SPARK建模了由最小协调规则支配的分布式智能体行为如何产生涌现个性化特性。该框架在协调效率、个性化质量和认知负载分布方面产生了可检验的预测,同时结合了自适应学习机制以实现持续的角色细化。通过将细粒度的智能体专业化与协作检索相结合,SPARK为能够捕捉人类信息寻求行为的复杂性、流动性和上下文敏感性的下一代搜索系统提供了见解。

英文摘要

Personalized search demands the ability to model users' evolving, multi-dimensional information needs; a challenge for systems constrained by static profiles or monolithic retrieval pipelines. We present SPARK (Search Personalization via Agent-Driven Retrieval and Knowledge-sharing), a framework in which coordinated persona-based large language model (LLM) agents deliver task-specific retrieval and emergent personalization. SPARK formalizes a persona space defined by role, expertise, task context, and domain, and introduces a Persona Coordinator that dynamically interprets incoming queries to activate the most relevant specialized agents. Each agent executes an independent retrieval-augmented generation process, supported by dedicated long- and short-term memory stores and context-aware reasoning modules. Inter-agent collaboration is facilitated through structured communication protocols, including shared memory repositories, iterative debate, and relay-style knowledge transfer. Drawing on principles from cognitive architectures, multi-agent coordination theory, and information retrieval, SPARK models how emergent personalization properties arise from distributed agent behaviors governed by minimal coordination rules. The framework yields testable predictions regarding coordination efficiency, personalization quality, and cognitive load distribution, while incorporating adaptive learning mechanisms for continuous persona refinement. By integrating fine-grained agent specialization with cooperative retrieval, SPARK provides insights for next-generation search systems capable of capturing the complexity, fluidity, and context sensitivity of human information-seeking behavior.

2605.24391 2026-06-03 cs.AR cs.AI 版本更新

MX-SAFE: Versatile Inference- and Training-Proof Microscaling Format with On-the-Fly Exponent and Mantissa Bit Allocation

MX-SAFE:具有即时指数和尾数位分配的多功能推理与训练验证微缩放格式

Dahoon Park, Jahyun Koo, Sangwoo Hwang, Jaeha Kung

发表机构 * Institute of Information & Communications Technology Planning & Evaluation (IITP)(信息与通信技术规划与评估院) Korea government (MSIT)(韩国政府) National Research Foundation of Korea (NRF)(韩国国家研究基金会) Ministry of Science and ICT(科学技术信息通信部) IC Design Education Center (IDEC)(集成电路设计教育中心)

AI总结 提出一种名为MX-SAFE的微缩放格式,通过自适应切换宽尾数模式和亚正规FP模式,同时支持训练和直接推理,并采用基于瓦片的块设计提高硬件效率,在推理和训练中相比MXFP8 E2M5和MXFP8 E4M3分别平均提升0.05%/11.1%和3.55%/3.57%的准确率,且能耗降低24.9%。

Comments Accepted to DATE 2026 (7 pages, 7 figures). Typo updates for Fig. 3 and Table 4, 5 are reflected

详情
AI中文摘要

随着深度学习需求的增长,通过量化降低训练和推理成本变得至关重要。2022年,开放计算项目(OCP)联盟标准化了用于深度学习的窄精度格式,称为微缩放(MX)格式。MX格式是一种硬件友好的动态量化方案,通过在多个操作数之间共享8位指数来有效减小数据大小。MX格式可分为两类,各有优势:(i)MXINT,仅由尾数位组成,注重高精度;(ii)MXFP,通过允许局部指数位来提供更宽的动态范围。本文提出了一种多功能的MXFP格式,称为MX-SAFE(简称MXSF),它自适应地使用两种模式,即宽尾数模式(FP8 E2M5)和亚正规FP模式(FP5 E3M2),以支持训练和直接推理。此外,我们提出了一种基于瓦片的块设计,通过减少使用MXSF格式训练期间重量化过程的负担来提高硬件效率。由于采用了所提出的MXSF格式,与MXFP8 E2M5和MXFP8 E4M3相比,推理/全训练的平均准确率分别提高了0.05%/11.1%和3.55%/3.57%。此外,我们提出了一种支持MXSF格式的训练推理加速器,在实现与BF16基线相似准确率的同时,总能耗降低了24.9%。

英文摘要

As the demand for deep learning grows, cost reduction through quantization has become essential for both training and inference. In 2022, the Open Compute Project (OCP) consortium standardized narrow precision formats for deep learning, called the microscaling (MX) format. The MX format is a hardware-friendly dynamic quantization scheme that effectively reduces the data size by sharing an 8-bit exponent across multiple operands. The MX format can be categorized into two types with their own strengths: (i) MXINT which focuses on a high precision consisting only of mantissa bits and (ii) MXFP which focuses on a wider dynamic range by allowing local exponent bits. In this work, we present a versatile MXFP format, called MX-SAFE (MXSF in short), that adaptively uses two modes, i.e., a wider mantissa mode (FP8 E2M5) and a subnormal FP mode (FP5 E3M2), to support both training and direct-cast inference. Furthermore, we propose a tile-based block design to increase hardware efficiency by reducing the burden of re-quantization process during the training with the MXSF format. Owing to the use of the proposed MXSF format, 0.05%/11.1% and 3.55%/3.57% improvements in accuracy, on average, for inference/full-training compared to MXFP8 E2M5 and MXFP8 E4M3 are observed, respectively. Moreover, we present a training-inference accelerator that supports the MXSF format and it achieves similar accuracy to the BF16 baseline while using 24.9% less total energy consumption.

2605.24253 2026-06-03 cs.CV cs.AI cs.IR 版本更新

CRISP -- Clustering-Based Redundancy-Reduced Instance Sampling for Pathology Case Representation and Retrieval

CRISP -- 基于聚类的冗余减少实例采样用于病理病例表示与检索

Zahra Rahimi Afzal, Wataru Uegami, Saghir Alfasly, Wenchao Han, Saba Yasir, Judy C. Boughey, Matthew P. Goetz, Krishna R. Kalari, H. R. Tizhoosh

发表机构 * Kimia Lab, Department of Artificial Intelligence & Informatics, Mayo Clinic, Rochester, MN, USA(Kimia实验室,人工智能与信息学系,梅奥诊所,罗切斯特,明尼苏达州,美国) DICE Lab, Department of Electrical and Computer Engineering, University of Illinois Chicago, IL, USA(DICE实验室,电气与计算机工程系,伊利诺伊大学芝加哥分校,伊利诺伊州,美国) MD Kimia Lab, Department of Artificial Intelligence & Informatics, Mayo Clinic, Rochester, MN, USA(MD Kimia实验室,人工智能与信息学系,梅奥诊所,罗切斯特,明尼苏达州,美国) PhD Kimia Lab, Department of Artificial Intelligence & Informatics, Mayo Clinic, Rochester, MN, USA(PhD Kimia实验室,人工智能与信息学系,梅奥诊所,罗切斯特,明尼苏达州,美国) Division of Computational Pathology and Informatics, Mayo Clinic, Rochester, MN, USA(计算病理学与信息学部,梅奥诊所,罗切斯特,明尼苏达州,美国) Department of Laboratory Medicine and Pathology, Mayo Clinic, Rochester, MN, USA(实验室医学与病理学系,梅奥诊所,罗切斯特,明尼苏达州,美国) Department of Breast and Melanoma Surgical Oncology, Comprehensive Cancer Center, Mayo Clinic, Rochester, MN, USA(乳腺和黑色素瘤外科肿瘤学系,综合癌症中心,梅奥诊所,罗切斯特,明尼苏达州,美国) Department of Oncology, Comprehensive Cancer Center, Mayo Clinic, Rochester, MN, USA(肿瘤学系,综合癌症中心,梅奥诊所,罗切斯特,明尼苏达州,美国) PhD H.R. Tizhoosh

AI总结 提出CRISP无监督框架,通过聚类和冗余减少采样整合病例内多张全切片图像,构建紧凑代表性补丁集用于病例级检索,在乳腺癌数据集上匹配或超越现有标准。

详情
AI中文摘要

数字病理档案中每个病例通常包含多张全切片图像(WSI),这些图像捕获空间上不同的肿瘤区域并反映内在的形态异质性。然而,现有方法大多依赖单一病理学家选择的切片,从而丢弃了分布在其余WSI中的潜在信息性证据。迄今为止,尚无自主框架用于全面的多WSI病例处理。在此,我们提出一个用于病例级分析的无监督框架,该框架整合病例内所有可用切片的信息。所提方法不依赖单一指定切片,而是通过选择性提炼跨WSI的信息性补丁来构建病例级表示。我们引入基于聚类的冗余减少实例采样用于病理学(CRISP),这是一个两阶段框架,首先减少单个WSI内的冗余,随后应用基于聚类的采样为整个病例选择紧凑但具有代表性的补丁集。所得补丁集捕获病例级异质性,同时避免对千兆像素图像的穷举处理,并直接作为检索索引。使用两个梅奥诊所乳腺癌数据集进行诊断和治疗规划,我们证明CRISP在患者/病例搜索和检索中一致匹配或超越当前结合模型和病理学家切片选择的标准实践。通过自动化病例级处理并消除主观WSI选择,CRISP可能能够利用当前被忽视的分布在多个WSI中的临床相关信息。

英文摘要

Digital pathology archives increasingly contain multiple whole-slide images (WSIs) per case, capturing spatially distinct tumor regions and reflecting intrinsic morphological heterogeneity. However, most existing approaches rely on a single pathologist-selected slide, thereby discarding potentially informative evidence distributed across the remaining WSIs. To date, no autonomous framework has been proposed for comprehensive multi-WSI case processing. Here, we present an unsupervised framework for case-level analysis that integrates information from all available slides within a case. Rather than relying on a single designated slide, the proposed approach constructs case-level representations by selectively distilling informative patches across WSIs. We introduce Clustering-Based Redundancy-Reduced Instance Sampling for Pathology (CRISP), a two-stage framework that first reduces redundancy within individual WSIs and subsequently applies clustering-based sampling to select a compact yet representative set of patches for the entire case. The resulting patch set captures case-level heterogeneity while avoiding exhaustive processing of gigapixel images, and directly serves as a retrieval index. Using two Mayo Clinic breast cancer datasets for diagnosis and treatment planning, we demonstrate that CRISP consistently matches or surpasses the current standard practice of combined model and pathologist slide selection for patient/case search and retrieval. By automating case-level processing and eliminating subjective WSI selection, CRISP potentially enables the exploitation of clinically relevant information distributed across multiple WSIs that is currently overlooked.

2605.23995 2026-06-03 cs.CV cs.AI 版本更新

Task-Aligned Self-Supervised Learning for Medical Image Analysis: A Systematic Review and Practical Design Guidelines

任务对齐的自监督学习在医学图像分析中的应用:系统综述与实践设计指南

Chathura Wimalasiri, Kishor Nandakishor, Marimuthu Palaniswami

发表机构 * Department of Electrical and Electronic Engineering, University of Melbourne(墨尔本大学电子与电气工程系)

AI总结 本文系统综述了医学图像中自监督学习(SSL)的四种范式(对比、非对比与预测、生成与重建、混合),分析了前置任务与下游任务的对齐对性能的影响,并提出了实践设计指南。

Comments This manuscript is 31 pages with 4 tables and 3 figures

详情
AI中文摘要

自监督学习(SSL)已成为通过从无标签数据中学习表示来解决医学影像中标注瓶颈的有前景范式。然而,其有效性在很大程度上取决于前置任务的设计及其与下游临床目标的对齐。我们对医学影像中的SSL进行了系统的、任务导向的综述,考察了不同前置任务公式如何影响分类、分割、检测等任务的性能。遵循PRISMA指南,我们分析了2017年至2025年间发表的75项研究,并将其组织为四种范式:对比学习、非对比与预测学习、生成与重建学习、以及混合学习。我们不是按架构对方法进行分类,而是将每种范式映射到其最佳支持的下游目标。我们的分析表明,不存在普遍最优的SSL策略;相反,性能由前置任务、成像模态和目标任务之间的对齐决定。对比方法学习全局判别特征,与分类任务对齐良好,但可能忽略细微的病理模式。生成和空间预测方法更好地保留局部解剖结构,使其更适合分割和其他密集预测任务,而混合方法提供了最平衡的性能。我们进一步表明,模态特定设计至关重要,并且SSL在低标签和少样本场景中提供最大益处。最后,我们将这些发现提炼为实践设计指南,并概述了开放挑战,包括病理感知前置任务设计、高维数据的资源高效训练以及标准化评估协议。这项工作为在医学影像中设计更有效且临床相关的SSL框架提供了实用指导。

英文摘要

Self-supervised learning (SSL) has emerged as a promising paradigm for addressing the annotation bottleneck in medical imaging by learning representations from unlabeled data. However, its effectiveness depends heavily on the design of the pretext task and its alignment with the downstream clinical-objectives. We present a systematic, task-oriented review of SSL in medical imaging, examining how different pretext-task formulations influence performance across classification, segmentation, detection, and other tasks. Following PRISMA guidelines, we analyze 75 studies published between 2017 and 2025 and organize them into four paradigms: contrastive, non-contrastive and predictive, generative and reconstruction-based, and hybrid learning. Rather than cataloguing methods by architecture, we map each paradigm to the downstream objectives it best supports. Our analysis shows there is no universally optimal SSL strategy; instead, performance is governed by the alignment between the pretext task, the imaging modality, and the target task. Contrastive methods learn global discriminative features and align well with classification, but may overlook subtle pathological patterns. Generative and spatial prediction-based approaches better preserve local anatomical structure, making them more suitable for segmentation and other dense prediction tasks, while hybrid methods offer the most balanced performance. We further show that modality-specific design is critical and that SSL provides its greatest benefit in low-label and few-shot regimes. Finally, we distill these findings into practical design guidelines and outline open challenges, including pathology-aware pretext task design, resource-efficient training for high-dimensional data, and standardized evaluation protocols. This work offers practical guidance for designing more effective and clinically relevant SSL frameworks in medical imaging.

2605.23055 2026-06-03 cs.LG cs.AI cs.CL 版本更新

Decomposing and Measuring Evaluation Awareness

分解与度量评估意识

Changling Li, Terry Jingchen Zhang, Jie Zhang, Zhijing Jin, Sahar Abdelnabi, Maksym Andriushchenko

发表机构 * ETH Zürich(苏黎世联邦理工学院) ELLIS Institute Tübingen(图宾根ELLIS研究所) Max Planck Institute for Intelligent Systems(智能系统马克斯·普朗克研究院) Tübingen AI Center(图宾根人工智能中心) University of Toronto & Vector Institute(多伦多大学及向量研究所) EuroSafeAI(欧洲安全人工智能)

AI总结 本研究借鉴社会心理学,将评估意识分解为环境与模型两部分,通过EvalAwareBench基准测试发现识别率取决于模型与基准的配对,且识别很少导致行为改变,安全评估比能力评估更易受影响。

详情
AI中文摘要

前沿语言模型有时会意识到自己正在被评估并调整行为,从而破坏基准结果的有效性。然而,该领域的研究缺乏共同基础,混淆了评估属性与模型属性,以及检测与行为响应。我们将评估意识植根于社会心理学,将其分解为环境组件(任务的可识别程度)和模型组件(将识别与行动倾向分离)。我们通过八个分类触发因素(如占位符实体和评分式输出格式)来操作化环境组件,并通过思维链监控研究识别和行为。在九个前沿模型和四个基准上,识别率取决于模型和基准的具体配对,而非单独一方。识别很少导致行为改变,即使发生,方向也取决于所感知的评估类型。模型对安全评估的敏感性高于能力评估,使安全基准的有效性面临更大风险。为了研究每个模型对哪些因素敏感以及它们如何相互作用,我们提出了EvalAwareBench,这是一个包含100个配对安全-能力任务的因子控制基准,其中八个因素中的每一个都可以独立切换,在保持底层请求不变的同时改变评估信号。通过EvalAwareBench,我们发现没有单一因素能统一影响所有模型,但叠加因素会逐步提高所有模型的评估意识。我们的框架和EvalAwareBench提供了度量、归因和缓解评估意识的工具,指出在识别下的行为一致性是一条有前景的前进道路。

英文摘要

Frontier language models sometimes recognize that they are being evaluated and adjust their behavior, undermining validity of benchmark results. Yet the field studies it without a shared foundation, conflating properties of the evaluation with properties of the model, and detection with behavioral response. We ground evaluation awareness in social psychology, decomposing it into an environment component (how recognizable the task is) and a model component that separates recognition from propensity to act on it. We operationalize the environment component through eight categorized trigger factors, such as placeholder entities and grading-style output formats, and study recognition and behavior through chain-of-thought monitoring. Across nine frontier models and four benchmarks, recognition rates depend on the specific pairing of model and benchmark rather than on either in isolation. Recognition rarely leads to behavioral change, and when it does, the direction depends on the type of evaluation perceived. Models are also more sensitive to safety than capability evaluations, placing safety benchmark validity at greater risk. To study which factors each model is sensitive to and how they interact, we propose \textbf{EvalAwareBench}, a factor-controlled benchmark of 100 paired safety-capability tasks where each of the eight factors can be independently toggled, varying evaluative signals while holding the underlying request fixed. Through EvalAwareBench, we find that no single factor uniformly affects all models, but stacking factors progressively raises evaluation awareness across all of them. Our framework and EvalAwareBench provide the tools to measure, attribute, and mitigate evaluation awareness, pointing to behavioral consistency under recognition as a promising path forward.

2605.20402 2026-06-03 cs.LG cs.AI 版本更新

Decomposing MXFP4 quantization error for LLM reinforcement learning: reducible bias, recoverable deadzone, and an irreducible floor

分解 MXFP4 量化误差以用于大语言模型强化学习:可约简的偏差、可恢复的死区以及不可约简的底噪

Xiaocan Li, Shiliang Wu, Zheng Shen

发表机构 * Huawei Canada(华为加拿大)

AI总结 本文通过将 MXFP4 量化误差分解为三个可加分量(尺度偏差、死区截断和网格噪声),并针对每个分量提出针对性修正(宏块缩放、异常值回退和自适应量化噪声),从而在 LLM 强化学习后训练中恢复精度。

详情
AI中文摘要

MXFP4 算术可以显著加速大语言模型(LLM)强化学习(RL)后训练,但量化误差会导致严重的精度下降。现有工作将量化误差视为单一噪声项,忽略了量化误差损害训练的不同机制。我们证明了量化误差的精确三向分解,并展示了每个分量如何主导不同的 RL 训练路径。我们的理论和实证分析将 MXFP4 量化误差分解为三个可加分量:来自 2 的幂次舍入的“尺度偏差”、来自小值归零的“死区截断”以及来自舍入到最近 4 位网格的“网格噪声”。每个分量主导不同的 RL 失效模式:尺度偏差通过反向传播乘法累积,影响梯度精度;死区截断降低 rollout 质量;网格噪声提高策略的熵。我们结合了针对 RL 失效模式但不限于特定分量的修正:宏块缩放以减少尺度偏差,异常值回退恢复死区条目,同时也部分减少尺度偏差引起的误差,以及自适应量化噪声(AQN)用于控制策略熵。在 Qwen2.5-3B 密集模型和 Qwen3-30B-A3B-Base 混合专家模型上,针对性修正分别将 BF16 精度恢复到 0.7% 以内,并超过 BF16 达 +1.0%。

英文摘要

MXFP4 arithmetic can dramatically accelerate reinforcement learning (RL) post-training of large language models (LLMs), yet the quantization error introduces severe accuracy degradation. Existing work treats the quantization error as a monolithic noise term, missing the distinct mechanisms upon interpreting how quantization error damages training. We prove an exact three-way decomposition of quantization error and show how each component dominates a distinct RL training pathway. Our theoretical and empirical analysis decomposes the MXFP4 quantization error into three additive components: "scale bias" from power-of-two rounding, "deadzone truncation" from zeroing small values, and "grid noise" from rounding to the nearest 4-bit grid. Each component dominates a distinct RL failure mode: scale bias accumulates multiplicatively through the backward pass, affecting gradient accuracy; deadzone truncation degrades rollout quality; and grid noise raises the policy's entropy. We combine corrections that are RL failure mode-targeted but not component-exclusive: Macro-block scaling to reduce scale bias, Outlier Fallback recovers deadzone entries, but also partially reduces scale bias induced error, and Adaptive Quantization Noise (AQN) for controlling the policy entropy. On Qwen2.5-3B dense and Qwen3-30B-A3B-Base mixture-of-experts model, the targeted corrections recover BF16 accuracy to within 0.7% and exceed BF16 by +1.0% respectively.

2605.22018 2026-06-03 cs.CV cs.AI cs.RO 版本更新

FRED: A Multi-Modal Autonomous Driving Dataset for Flooded Road Environments

FRED:面向洪水道路环境的多模态自动驾驶数据集

Connor Malone, Sebastien Demmel, Sebastien Glaser

发表机构 * Queensland University of Technology(昆士兰理工大学) ARC Training Centre for Automated Vehicles in Rural and Remote Regions (AVR3)(农村和偏远地区自动化车辆培训中心(AVR3))

AI总结 提出首个针对道路水险场景的多模态自动驾驶数据集FRED,包含相机、LiDAR和IMU数据,并提供语义标签以支持水险检测方法训练与评估。

详情
AI中文摘要

洪水道路环境数据集(FRED)是,据我们所知,首个专门针对道路水险场景数据收集的多模态自动驾驶数据集。该数据集包含来自2.3 MP FLIR Blackfly USB3相机的图像、来自Ouster OS1-64 LiDAR的64线360度点云,以及由Geoflex RTK GNSS校正的iXblue ATLANS-C IMU数据,数据采集自五个不同地点,涵盖洪水期间和洪水之后。数据以两种格式发布:KITTI风格格式,便于与现有数据工具集成;以及RTMaps格式,用于直接回放车辆的数据捕获。我们提供语义标签,以支持用于水险检测的单传感器和传感器融合方法的训练与评估。提供位置和速度数据,以及干燥条件下捕获的数据,以支持可能包含地图的基于位置的检测方法开发,并评估其他任务,如定位和SLAM。

英文摘要

The Flooded Road Environments Dataset (FRED) is, to our knowledge, the first multi-modal autonomous driving dataset specifically targeting the collection of data from scenarios involving water hazards on the road. The dataset contains images from a 2.3 MP FLIR Blackfly USB3 camera, 64-beam 360 degree point clouds from an Ouster OS1-64 LiDAR, and data from an iXblue ATLANS-C IMU corrected by a Geoflex RTK GNSS, from five separate locations captured both during and after flooding events. The data has been released in two formats: a KITTI-style format for easy integration with existing data tools, and the RTMaps format for direct replay of the vehicle's data capture. We provide semantic labels to enable the training and evaluation of both single-sensor and sensor-fusion methods for water hazard detection. Position and velocity, as well as data captured under dry conditions, are provided to enable the development of location-based detection methods that may incorporate maps, and to evaluate other tasks such as localisation and SLAM.

2605.20731 2026-06-03 cs.CV cs.AI stat.AP 版本更新

TASTE: A Designer-Annotated Multi-Dimensional Preference Dataset for AI-Generated Graphic Design

TASTE:一个由设计师标注的AI生成图形设计多维偏好数据集

Haonan Zhu, Elad Hirsch, Alexandria Minetti, Allison Nulty, Purvanshi Mehta

发表机构 * Lica World(Lica世界) Contra.Work Inc.(Contra.Work公司)

AI总结 针对现有偏好数据集仅提供单一整体评价的不足,本文构建了TASTE多维偏好数据集,由两组专业设计师对四个文本到图像模型的输出按九项标准排序,并提出了无准则信号验证框架和偏好模型基准测试。

详情
AI中文摘要

文本到图像模型现在能够以生产规模生成图形设计,但其监督仍然主要来自照片风格的偏好数据集,每次比较只有一个整体判断。设计师沿着几个不同的轴(例如,排版、布局、色彩和谐)评估设计,而单个偏好标签会将这些轴合并。我们发布了\emph{TASTE} extit{(排版、美学、空间、色调等)},这是一个多维偏好数据集,其中两个不相交的五名专业设计师队列分别对来自四个当前文本到图像模型的输出按九项标准进行排序,并附带每张图像的幻觉标记。我们将该数据集与两个贡献配对。首先,一个基于Kendall的$τ$、多数投票概率和Condorcet循环的无准则信号验证框架,针对精确的iid均匀零假设;分析揭示了显著但中等程度的设计师一致性,每个TASTE标准都拒绝了随机评分者的零假设。其次,我们在TASTE上对偏好模型进行基准测试,发现现成的VLM评判器和专用的T2I评分器未能达到与设计师小组的多数一致,而直接在TASTE上训练的小型MLP头显著缩小了与单个评分者上限的差距,为未来基于TASTE训练的偏好模型设定了基线。

英文摘要

Text-to-image models now generate graphic design at production scale, yet their supervision still comes primarily from photo-style preference datasets with a single overall verdict per comparison. Designers evaluate designs along several distinct axes (e.g., typography, layout, color harmony) that a single preference label collapses. We release \emph{TASTE} \textit{(Typography, Aesthetics, Spatial, Tone, Etc.)}, a multi-dimensional preference dataset in which two disjoint cohorts of five professional designers each ranked outputs from four current text-to-image models across nine criteria along with per-image hallucination flags. We pair the dataset with two contributions. First, a criterion-agnostic signal-validation framework based on Kendall's $τ$, majority-vote probability, and Condorcet cycles against exact iid-uniform nulls; the analysis reveals significant but moderate designer agreement, with every TASTE criterion rejecting the random-rater null. Second, we benchmark preference models on TASTE and find that off-the-shelf VLM judges and dedicated T2I scorers fail to reach majority agreement with the designer panel, while a small MLP head trained directly on TASTE substantially narrows the gap to the single-rater ceiling, setting a baseline for future TASTE-trained preference models.

2604.27147 2026-06-03 cs.LG cs.AI 版本更新

How to Guide Your Flow: Few-Step Alignment via Flow Map Reward Guidance

如何引导你的流:通过流图奖励引导实现少步对齐

Jerry Y. Huang, Justin Lin, Sheel Shah, Kartik Nair, Nicholas M. Boffi

发表机构 * University of California, Berkeley(加州大学伯克利分校)

AI总结 提出流图奖励引导(FMRG),一种无训练、单轨迹的框架,利用流图在仅需3次NFE下实现奖励引导生成,速度比现有方法快一个数量级。

详情
AI中文摘要

在生成建模中,我们通常希望生成能够最大化用户指定奖励(如美学质量或与人类偏好对齐)的样本,这一问题被称为 extit{引导}。尽管现有引导方法被广泛使用,但它们要么需要昂贵的多粒子、多步方案,要么依赖于理解不足的近似。我们将引导重新表述为一个 extit{确定性最优控制问题},产生了一个算法层次结构,在最粗略的层次上包含了现有方法。我们表明, extit{流图}——因其在快速推理中的作用而近期受到广泛关注的对象——在最优解中自然出现。基于这一观察,我们提出 extbf{流图奖励引导(FMRG)}:一种无训练、 extit{单轨迹}框架,利用流图来积分和引导流。在文生图规模上,FMRG在逆问题和奖励引导生成中匹配或超越基线,且 extbf{仅需3次NFE},相比先前最先进方法至少实现一个数量级的加速。代码可在 https://github.com/jrrhuang/fmrg 获取。

英文摘要

In generative modeling, we often wish to produce samples that maximize a user-specified reward such as aesthetic quality or alignment with human preferences, a problem known as \textit{guidance}. Despite their widespread use, existing guidance methods either require expensive multi-particle, many-step schemes or rely on poorly understood approximations. We reformulate guidance as a \textit{deterministic optimal control problem}, yielding a hierarchy of algorithms that subsumes existing approaches at the coarsest level. We show that the \textit{flow map}, an object of significant recent interest for its role in fast inference, arises naturally in the optimal solution. Based on this observation, we propose \textbf{Flow Map Reward Guidance (FMRG)}: a training-free, \textit{single-trajectory} framework that uses the flow map to both integrate and guide the flow. At text-to-image scale, FMRG matches or surpasses baselines across inverse problems and reward-guided generation with \textbf{as few as 3 NFEs}, giving at least an order-of-magnitude speedup in comparison to prior state of the art. Code is available at https://github.com/jrrhuang/fmrg.

2605.19805 2026-06-03 cs.LG cs.AI stat.ML 版本更新

Latent Laplace Diffusion for Irregular Multivariate Time Series

潜在拉普拉斯扩散用于不规则多元时间序列

Zinuo You, Jin Zheng, John Cartlidge

发表机构 * University of Cambridge(剑桥大学)

AI总结 提出潜在拉普拉斯扩散(LLapDiff)生成框架,通过低维潜在轨迹和拉普拉斯域参数化实现不规则时间序列的长时预测与缺失值插补。

Comments Accepted as a Spotlight at ICML 2026. The Version of Record will appear in Proceedings of Machine Learning Research (PMLR). 27 pages, 5 figures. Code: https://github.com/pixelhero98/LLapDiffusion

详情
AI中文摘要

不规则多元时间序列对长期预测提出了权衡:离散方法通过重新网格化可能扭曲时间结构,而连续时间模型通常需要容易漂移的序贯求解器。为弥合这一差距,我们提出了潜在拉普拉斯扩散(LLapDiff),一种生成式框架,将目标建模为低维潜在轨迹,从而无需逐步积分物理时间即可实现全范围生成。我们利用受随机端口-哈密顿动力学启发的稳定模态参数化来引导逆向过程,并通过可学习的共轭复极点参数化其在拉普拉斯域中的均值演化,从而能够在不规则时间戳上直接评估。我们还通过更新平均分析将连续动力学与不规则观测联系起来,该分析将采样间隙映射到有效事件域极点,并激发了间隙感知的历史总结器。大量实验表明,LLapDiff在长期预测中优于基线,其连续时间生成性质通过在同一模型的历史时间戳上查询,支持缺失值插补。代码可在https://github.com/pixelhero98/LLapDiffusion获取。

英文摘要

Irregular multivariate time series impose a trade-off for long-horizon forecasting: discrete methods can distort temporal structure via re-gridding, while continuous-time models often require sequential solvers prone to drift. To bridge this gap, we present Latent Laplace Diffusion (LLapDiff), a generative framework that models the target as a low-dimensional latent trajectory, enabling horizon-wide generation without step-by-step integration over physical time. We guide the reverse process utilizing a stable modal parameterization motivated by stochastic port-Hamiltonian dynamics, and parameterize its mean evolution in the Laplace domain via learnable complex-conjugate poles, enabling direct evaluation over irregular timestamps. We also link continuous dynamics to irregular observations through renewal-averaging analysis, which maps sampling gaps to effective event-domain poles and motivates a gap-aware history summarizer. Extensive experiments show that LLapDiff improves over baselines in long-horizon forecasting, and its continuous-time generative nature supports missing-value imputation by querying the same model at historical timestamps. Code is available at https://github.com/pixelhero98/LLapDiffusion.

2605.18740 2026-06-03 cs.CV cs.AI cs.CL cs.LG 版本更新

Vision-OPD: Learning to See Fine Details for Multimodal LLMs via On-Policy Self-Distillation

Vision-OPD:通过在线策略自蒸馏学习多模态大语言模型的精细细节

Qianhao Yuan, Jie Lou, Xing Yu, Hongyu Lin, Le Sun, Xianpei Han, Yaojie Lu

发表机构 * Tsinghua University(清华大学)

AI总结 提出Vision-OPD框架,通过在线策略自蒸馏将模型自身的局部区域感知能力迁移到全局图像策略,提升多模态大语言模型对细粒度视觉理解的准确性。

Comments Project page: https://github.com/VisionOPD/Vision-OPD

详情
AI中文摘要

多模态大语言模型(MLLMs)在细粒度视觉理解方面仍然存在困难,答案往往依赖于全图中微小但决定性的证据。我们观察到一种区域到全局的感知差距:当以证据为中心的裁剪图像为条件时,同一MLLM回答细粒度问题的准确率高于以对应全图为条件,这表明许多失败源于难以聚焦于相关证据,而非局部识别能力不足。受此观察启发,我们提出Vision-OPD(视觉在线策略蒸馏),一种区域到全局的自蒸馏框架,将模型自身特权的区域感知迁移到其全图策略。Vision-OPD从同一MLLM实例化两个条件策略:一个以裁剪图像为条件的教师和一个以全图为条件的学生。学生生成在线策略轨迹,Vision-OPD沿这些轨迹最小化教师和学生下一个词元分布之间的词元级差异。这使得模型能够内化视觉放大的好处,而无需外部教师模型、真实标签、奖励验证器或推理时工具使用。在多个细粒度视觉理解基准上的实验表明,Vision-OPD模型在性能上可与更大的开源、闭源以及“思考图像”智能体模型相媲美或更优。

英文摘要

Multimodal Large Language Models (MLLMs) still struggle with fine-grained visual understanding, where answers often depend on small but decisive evidence in the full image. We observe a regional-to-global perception gap: the same MLLM answers fine-grained questions more accurately when conditioned on evidence-centered crops than on the corresponding full images, suggesting that many failures stem from difficulty to focus on relevant evidence rather than insufficient local recognition ability. Motivated by this observation, we propose Vision-OPD (Vision On-Policy Distillation), a regional-to-global self-distillation framework that transfers the model's own privileged regional perception to its full-image policy. Vision-OPD instantiates two conditional policies from the same MLLM: a crop-conditioned teacher and a full-image-conditioned student. The student generates on-policy rollouts, and Vision-OPD minimizes token-level divergence between the teacher and student next-token distributions along these rollouts. This enables the model to internalize the benefit of visual zooming without external teacher models, ground-truth labels, reward verifiers, or inference-time tool use. Experiments on multiple fine-grained visual understanding benchmarks show that Vision-OPD models achieve competitive or superior performance against much larger open-source, closed-source, and "Thinking-with-Images" agentic models. The code is available at https://github.com/VisionOPD/Vision-OPD

2512.18552 2026-06-03 cs.SE cs.AI cs.CL cs.LG 版本更新

Toward Training Superintelligent Software Agents through Self-Play SWE-RL

通过自我对弈SWE-RL训练超级智能软件代理

Yuxiang Wei, Zhiqing Sun, Emily McMilin, Jonas Gehring, David Zhang, Gabriel Synnaeve, Daniel Fried, Lingming Zhang, Sida Wang

发表机构 * Meta FAIR University of Illinois Urbana-Champaign(伊利诺伊大学厄巴纳-香槟分校) Meta TBD Lab(Meta TBD 实验室) Carnegie Mellon University(卡内基梅隆大学)

AI总结 提出自我对弈SWE-RL(SSR)方法,通过强化学习在自对弈环境中训练单一LLM代理,使其在无需人工标注问题或测试的情况下,在真实代码库中迭代注入和修复软件缺陷,在SWE-bench基准上实现显著自我改进并超越人类数据基线。

Comments Accepted to ICML 2026

详情
AI中文摘要

尽管当前由大型语言模型(LLM)和智能体强化学习(RL)驱动的软件代理能够提高程序员的生产力,但其训练数据(例如GitHub问题和拉取请求)和环境(例如通过-通过和失败-通过测试)严重依赖人类知识或整理,这构成了通向超级智能的根本障碍。在本文中,我们提出了自我对弈SWE-RL(SSR),这是迈向超级智能软件代理训练范式的第一步。我们的方法仅需最小的数据假设,只需访问带有源代码和已安装依赖项的沙盒化仓库,无需人工标注的问题或测试。基于这些真实世界的代码库,单个LLM代理通过强化学习在自我对弈环境中进行训练,以迭代地注入和修复复杂度逐渐增加的软件缺陷,每个缺陷由测试补丁而非自然语言问题描述正式指定。在SWE-bench Verified和SWE-Bench Pro基准上,SSR实现了显著的自我改进(分别提升+10.4和+7.8分),并在整个训练轨迹中持续优于人类数据基线,尽管其评估的是自我对弈中未出现的自然语言问题。我们的结果虽然尚处于早期阶段,但表明了一条路径,即代理可以从真实软件仓库中自主收集广泛的学习经验,最终实现超越人类能力的超级智能系统,在理解系统构建方式、解决新挑战以及从头开始自主创建新软件方面超越人类。

英文摘要

While current software agents powered by large language models (LLMs) and agentic reinforcement learning (RL) can boost programmer productivity, their training data (e.g., GitHub issues and pull requests) and environments (e.g., pass-to-pass and fail-to-pass tests) heavily depend on human knowledge or curation, posing a fundamental barrier to superintelligence. In this paper, we present Self-play SWE-RL (SSR), a first step toward training paradigms for superintelligent software agents. Our approach takes minimal data assumptions, only requiring access to sandboxed repositories with source code and installed dependencies, with no need for human-labeled issues or tests. Grounded in these real-world codebases, a single LLM agent is trained via reinforcement learning in a self-play setting to iteratively inject and repair software bugs of increasing complexity, with each bug formally specified by a test patch rather than a natural language issue description. On the SWE-bench Verified and SWE-Bench Pro benchmarks, SSR achieves notable self-improvement (+10.4 and +7.8 points, respectively) and consistently outperforms the human-data baseline over the entire training trajectory, despite being evaluated on natural language issues absent from self-play. Our results, albeit early, suggest a path where agents autonomously gather extensive learning experiences from real-world software repositories, ultimately enabling superintelligent systems that exceed human capabilities in understanding how systems are constructed, solving novel challenges, and autonomously creating new software from scratch.

2605.18160 2026-06-03 cs.CV cs.AI 版本更新

Vision Inference Former: Sustaining Visual Consistency in Multimodal Large Language Models

Vision Inference Former:在多模态大语言模型中维持视觉一致性

Xinpeng Dong, Min Zhang, Kairong Han, Xu Tan, Fei Wu, Kun Kuang

发表机构 * Zhejiang University(浙江大学) East China Normal University(华东师范大学) Zhejiang University of Science and Technology(浙江理工大学)

AI总结 针对多模态大语言模型中视觉信息被弱化的问题,提出Vision Inference Former(VIF)轻量模块,在推理解码阶段持续注入视觉语义,提升生成内容与视觉的一致性。

详情
AI中文摘要

近年来,多模态大语言模型(MLLMs)取得了显著进展,主要归功于整合视觉和文本信息的有效范式。主流的基于连接器的范式将视觉特征投影到文本序列中,从而在生成式架构内实现统一的多模态对齐和推理。然而,我们的实验揭示了两个关键限制:(1)尽管视觉信息是MLLMs中的核心证据模态,但它被与文本标记同等对待,削弱了视觉模态的独特贡献;(2)随着生成长度的增加,特别是在有限的上下文窗口内,模型对视觉信息的依赖逐渐减弱,导致视觉-语言对齐恶化,生成内容与视觉语义之间的一致性降低。为了解决这些挑战,我们提出了Vision Inference Former(VIF),一种轻量级架构模块,它在纯视觉表示和模型输出空间之间建立直接桥梁。具体而言,VIF在推理过程的解码阶段持续注入视觉语义,确保模型在生成过程中牢固地基于视觉内容。我们在涵盖通用推理、OCR、表格理解、以视觉为中心的评估和幻觉的14个基准任务上进行了实验。实验结果表明,VIF在不同架构上持续提升模型性能,同时引入最小的额外开销。本工作的代码可在https://github.com/Dong-Xinpeng/VIF获取。

英文摘要

In recent years, multimodal large language models (MLLMs) have achieved remarkable progress, primarily attributed to effective paradigms for integrating visual and textual information. The dominant connector-based paradigm projects visual features into textual sequence, enabling unified multimodal alignment and reasoning within a generative architecture. However, our experiments reveal two key limitations: (1) Although visual information serves as the core evidential modality in MLLMs, it is treated on par with textual tokens, diminishing the unique contribution of the visual modality; (2) As generation length increases, particularly within a limited context window, the model's dependence on visual information progressively weakens, resulting in deteriorated vision-language alignment and reduced consistency between generated content and visual semantics. To address these challenges, we propose the Vision Inference Former (VIF), a lightweight architectural module that establishes a direct bridge between pure visual representations and the model's output space. Specifically, VIF continuously injects visual semantics throughout the decoding phase of the inference process, ensuring that the model remains firmly grounded in visual content during generation. We conduct experiments on 14 benchmark tasks covering general reasoning, OCR, table understanding, vision-centric evaluation, and hallucination. Experimental results show that VIF consistently improves model performance across diverse architectures while introducing minimal additional overhead. The code for this work is available at https://github.com/Dong-Xinpeng/VIF.

2605.18106 2026-06-03 math.OC cs.AI cs.LG stat.ML 版本更新

Symmetry-Compatible Principle for Optimizer Design: Embeddings, LM Heads, SwiGLU MLPs, and MoE Routers

优化器设计的对称性兼容原理:嵌入、LM头、SwiGLU MLP和MoE路由器

Tim Tsz-Kit Lau, Weijie Su

发表机构 * University of Pennsylvania(宾夕法尼亚大学) Wharton School(沃顿商学院)

AI总结 针对现代神经网络参数空间的对称性与坐标级优化器之间的几何不匹配,提出对称性兼容的优化器设计原则,并针对嵌入矩阵、LM头、SwiGLU MLP投影和MoE路由器等特殊参数块导出相应更新规则,实验证明其改善验证损失、负载平衡和训练稳定性。

详情
AI中文摘要

深度学习实践中长期存在一种显著的几何差异。现代神经网络架构自然展现出丰富的对称性和等变性,而流行的优化器如Adam及其变体本质上是坐标级的,无法尊重参数空间的等变结构。我们通过引入优化器设计的对称性兼容原则来解决这一差异:梯度更新规则应在作用于相应权重块的对称群下等变。遵循这一原则,我们首先为一般矩阵层提供了双正交等变更新的统一视角,如随机谱下降、Muon、Scion和极梯度方法所采用的。更重要的是,通过从正交群转向置换和共享移位对称性,我们为参数块(其对称性与一般矩阵层不同)推导了对称性兼容的优化器:嵌入和LM头矩阵、SwiGLU MLP投影以及MoE路由器矩阵。这些构造包括单边谱、行范数、混合行范数/谱、行感知、列感知、中心行范数和左谱更新。它们产生了一个端到端的逐层优化器堆栈,其中每个主要的矩阵值参数类被分配一个更新,其等变性与其对称群匹配。我们通过在密集和稀疏MoE语言模型上的预训练实验验证了这一原则,包括Qwen3-0.6B风格、Gemma 3 1B风格、OLMoE-1B-7B风格和缩小版gpt-oss架构。在这些实验中,对称性兼容的更新规则一致地改善了最终验证损失,减少了稀疏MoE模型中的负载不平衡,并在若干情况下比相应的AdamW更新提高了训练稳定性。

英文摘要

A striking geometric disparity has long persisted in the practice of deep learning. While modern neural network architectures naturally exhibit rich symmetry and equivariance properties, popular optimizers such as Adam and its variants operate inherently coordinate-wise, rendering them unable to respect the equivariance structures of the parameter space. We address this disparity by introducing a symmetry-compatible principle for optimizer design: the gradient update rule should be equivariant under the symmetry group acting on the corresponding weight block. Following this principle, we first provide a unified perspective on bi-orthogonally equivariant updates for general matrix layers, as employed by stochastic spectral descent, Muon, Scion, and polar gradient methods. More importantly, by moving from orthogonal groups to permutation and shared-shift symmetries, we derive symmetry-compatible optimizers for parameter blocks whose symmetries differ from those of general matrix layers: embedding and LM head matrices, SwiGLU MLP projections, and MoE router matrices. These constructions include one-sided spectral, row-norm, hybrid row-norm/spectral, row-aware, column-aware, centered row-norm, and left-spectral updates. They yield an end-to-end layerwise optimizer stack in which each major matrix-valued parameter class is assigned an update whose equivariance matches its symmetry group. We corroborate this principle through pre-training experiments on dense and sparse MoE language models, including Qwen3-0.6B-style, Gemma 3 1B-style, OLMoE-1B-7B-style, and downsized gpt-oss architectures. Across these experiments, symmetry-compatible update rules consistently improve final validation loss, reduce load imbalance in sparse MoE models, and in several cases improve training stability over the corresponding AdamW updates.

2605.17219 2026-06-03 cs.CR cs.AI cs.LG cs.NI eess.SP 版本更新

Integration of AI in Cybersecurity: Current Trends with a Focused Look at Intrusion Detection Applications

AI在网络安全中的集成:当前趋势及入侵检测应用的聚焦分析

S. Tazili, A. Mansour, M. Y. Chkouri

发表机构 * SIGL Laboratory, ENSATE, Abdelmalek Essaâdi University, Tetouan, Morocco(SIGL实验室、ENSATE、阿卜杜勒马利克·埃萨迪大学、突塔努安、摩洛哥)

AI总结 本文综述了当前基于AI的网络安全趋势,重点分析入侵检测方法,通过比较不同AI技术和性能指标揭示有意义见解。

Comments Accepted at AI2SD 2025. Forthcoming in Springer Lecture Notes in Networks and Systems (2026). Please cite this preprint as indicated in the paper!

详情
Journal ref
https://conferences.academyskills.net/ai2sd/2025/PapersManagement/all.php#:~:text=643174
AI中文摘要

人工智能(AI)如今被广泛采用,因其能够检测模式、自动化任务并减少各种应用中的时间和成本。AI与网络安全的整合引起了广泛关注,特别是在入侵检测、恶意软件分析以及钓鱼或垃圾邮件检测等领域。随着AI和网络安全的发展,新的方法和途径不断涌现。当前趋势包括使用生成式AI、自然语言处理、用于隐私保护协作训练的联邦学习以及可解释AI以确保可解释性和信任,这些在网络安全中至关重要。本文对当前基于AI的网络安全趋势进行了有趣的综述,重点聚焦入侵检测方法,旨在通过基于所采用的AI技术和报告性能的比较分析,揭示有意义的见解。

英文摘要

Artificial Intelligence (AI) is widely adopted today for its ability to detect patterns, automate tasks, and reduce time and cost across various applications. Its integration into Cybersecurity has garnered significant attention, particularly in areas such as intrusion detection, malware analysis, and phishing or spam detection. As AI and cybersecurity evolve, new methods and approaches emerge regularly. Current trends include the use of Generative AI, Natural Language Processing, Federated Learning for privacy-preserving collaborative training, and eXplainable AI to ensure interpretability and trust, which are vital in cybersecurity. This paper presents an interesting review of current AI-based cybersecurity trends, focusing on intrusion detection approaches and aiming to uncover meaningful insights through comparative analysis based on the employed AI techniques and reported performance.

2605.16064 2026-06-03 cs.GT cs.AI econ.TH 版本更新

Misspecified Estimate-then-Optimize Leads to Supra-Competitive Prices

错误指定的估计-优化导致超竞争价格

Jackie Baek, Vivek F. Farias, Farrell Wu

发表机构 * Stern School of Business, New York University(纽约大学斯特恩商学院) Massachusetts Institute of Technology(麻省理工学院)

AI总结 研究在多家公司市场中,使用错误指定的需求模型(忽略竞争对手价格)的短视估计-优化定价规则如何导致价格收敛至高于纳什均衡的超竞争水平,并通过流体极限常微分方程分析刻画收敛条件。

详情
AI中文摘要

我们研究简单的算法定价系统是否能在多公司市场中系统性地产生类似合谋的价格。考虑公司使用短视的估计-优化规则定价:每个公司重复地根据自身价格和销售历史拟合需求模型,并设定最大化估计利润的价格。该需求模型是错误指定的,忽略了竞争对手的价格。我们分析了该规则在由独立随机价格的探索阶段初始化时的动态。通过流体极限常微分方程分析,我们刻画了该管道何时收敛到高于纳什均衡的超竞争价格。我们表明,当公司最初在纳什价格同一侧的相似价格范围内探索时,超竞争价格会出现。此外,价格可以显著高于纳什价格;我们表明,在对称探索下价格可以达到垄断水平。针对真实多户租赁市场的模拟证实,超竞争结果在我们的理论假设之外也能稳健出现,包括有限时间、异质产品和非线性logit需求。

英文摘要

We study whether simple algorithmic pricing systems can systematically produce collusive-like prices in multi-firm markets. We consider firms that price using a myopic estimate-then-optimize rule: each repeatedly fits a demand model to its own price and sales history and sets the price that maximizes estimated profit. This demand model is misspecified, omitting competitors' prices. We analyze the dynamics of this rule when it is initialized by an exploration phase of independent random prices. We characterize when this pipeline converges to supra-competitive prices above the Nash equilibrium, via a fluid-limit ordinary differential equation analysis. We show that supra-competitive prices arise when firms initially explore within similar price ranges on the same side of the Nash price. Moreover, prices can be substantially above the Nash price; we show that prices can reach monopoly levels under symmetric exploration. Simulations calibrated to a real multifamily rental market confirm that supra-competitive outcomes arise robustly beyond our theoretical assumptions, including under finite horizons, heterogeneous products, and nonlinear logit demand.

2605.08747 2026-06-03 cs.AI 版本更新

Done, But Not Sure: Disentangling World Completion from Self-Termination in Embodied Agents

完成,但不确定:具身智能体中世界完成与自我终止的解耦

Ying Chen, Lihuang Fang, Rui Jiang, Mingxu Wang, Zhifeng Gu, Lei Yi, Jie Chen

AI总结 提出VIGIL评估框架,通过分离世界状态完成和终端报告正确性,独立衡量智能体的终止承诺能力,并揭示不同模型在相似完成率下终端报告准确性的显著差异。

详情
AI中文摘要

标准的具身评估不会独立评分智能体在情节结束时是否正确承诺任务完成,我们将这种能力称为终端承诺。行为上不同的失败——从未完成任务、完成但未能停止、以及在没有足够证据的情况下报告成功——都归为相同的基准失败。我们引入了VIGIL,一个使终端承诺可独立测量的评估框架。在VIGIL的默认协议下,智能体仅观察以自我为中心的RGB,不接收动作成功信号,并且必须通过语义报告结束每个情节,该报告会确定性地与隐藏的世界状态进行核对。这产生了两个独立的分数:世界状态完成(W)和基准成功(B),其中B额外要求正确的终端报告。这种解耦使得四种结果类别可区分:未执行、达成后漂移、无依据承诺和验证成功。在1000个冻结情节上对20个模型进行评估,具有可比W的系统在B上相差高达19.7个百分点:一个模型将实现的状态转换为正确的报告,而另一个具有几乎相同执行能力的模型在目标处漂移而未关闭。动作反馈干预进一步测试了这种分离:执行导向的信号广泛改善了W,但在那些尚未将终端报告基于已实现状态的模型中,承诺失败仍然存在。VIGIL提供了一个使终端承诺独立可见和可评分的协议。

英文摘要

Standard embodied evaluations do not independently score whether an agent correctly commits to task completion at episode closure, a capacity we call terminal commitment. Behaviorally distinct failures--never completing the task, completing it but failing to stop, and reporting success without sufficient evidence--collapse into the same benchmark failure. We introduce VIGIL, an evaluation framework that makes terminal commitment independently measurable. Under VIGIL's default protocol, agents observe only egocentric RGB, receive no action-success signals, and must end each episode with a semantic report checked deterministically against hidden world state. This yields two separate scores: world-state completion (W) and benchmark success (B), where B additionally requires a correct terminal report. This decoupling makes four outcome categories distinguishable: missed execution, post-attainment drift, unsupported commitment, and verified success. Across 20 models on 1,000 frozen episodes, systems with comparable W differ by up to 19.7 pp in B: one model converts achieved states into correct reports, while another with near-identical execution drifts past the goal without closing. An action-feedback intervention further tests the separation: execution-oriented signals improve W broadly, yet commitment failures persist in models that do not already ground terminal reports in the achieved state. VIGIL provides a protocol that makes terminal commitment independently visible and scorable.

2604.22891 2026-06-03 cs.LG cs.AI cs.CL 版本更新

Quantifying and Mitigating Self-Preference Bias of LLM Judges

量化与缓解LLM评判者的自我偏好偏差

Jinming Yang, Zheng Hu, Chuxian Qiu, Zhenyu Deng, Xinshan Jiao, Tao Zhou

发表机构 * CompleX Lab, School of Computer Science and Engineering, University of Electronic Science and Technology of China, Chengdu, China(复杂实验室、计算机科学与工程学院、电子科技大学、成都、中国)

AI总结 提出自动化框架量化LLM自我偏好偏差,并通过认知负荷分解的多维评估策略平均降低31.5%的偏差。

详情
AI中文摘要

LLM-as-a-Judge已成为自动评估系统中的主导方法,在模型对齐、排行榜构建、质量控制等方面发挥关键作用。然而,该方法的可扩展性和可信度可能因自我偏好偏差(SPB)而严重失真,SPB是一种定向评估偏差,即LLM在评估时系统性地偏好或排斥自身生成的输出。现有测量方法依赖昂贵的人工标注,并将生成能力与评估立场混为一谈,因此不适用于实际系统中的大规模部署。为解决此问题,我们引入了一个完全自动化的框架来量化和缓解SPB,该框架构建质量差异可忽略的等质量回答对,从而在无需人工黄金标准的情况下,从偏差倾向中统计分离出可区分性。对20个主流LLM的实证分析表明,先进能力通常与低SPB不相关,甚至负相关。为缓解此偏差,我们提出了一种基于认知负荷分解的结构化多维评估策略,平均降低SPB 31.5%。

英文摘要

LLM-as-a-Judge has become a dominant approach in automated evaluation systems, playing critical roles in model alignment, leaderboard construction, quality control, and so on. However, the scalability and trustworthiness of this approach can be substantially distorted by Self-Preference Bias (SPB), which is a directional evaluative deviation in which LLMs systematically favor or disfavor their own generated outputs during evaluation. Existing measurements rely on costly human annotations and conflate generative capability with evaluative stance, and thus are impractical for large-scale deployment in real-world systems. To address this issue, we introduce a fully automated framework to quantifying and mitigating SPB, which constructs equal-quality pairs of responses with negligible quality differences, enabling statistical disentanglement of discriminability from bias propensity without human gold standards. Empirical analysis across 20 mainstream LLMs reveals that advanced capabilities are often uncorrelated, or even negatively correlated, with low SPB. To mitigate this bias, we propose a structured multi-dimensional evaluation strategy grounded in cognitive load decomposition, which reduces SPB by 31.5\% on average.

2605.13258 2026-06-03 cs.CV cs.AI 版本更新

X-Restormer++: 1st Place Solution for the UG2+ CVPR 2026 All-Weather Restoration Challenge

X-Restormer++:UG2+ CVPR 2026全天气恢复挑战赛第一名解决方案

Youwei Pan, Leilei Cao, Yingfang Zhu, Fengjie Zhu

发表机构 * TEX AI, Transsion Holdings(TEX AI,Transsion控股)

AI总结 提出基于X-Restormer的双阶段训练与双模型集成推理方法,结合梯度引导边缘感知损失,在全天气图像恢复挑战赛中取得第一名。

详情
AI中文摘要

在这项工作中,我们展示了在第八届UG2+挑战赛(CVPR 2026)赛道1:全天气条件下的图像恢复中的获胜解决方案。我们的方法基于X-Restormer基线,该基线通过其双注意力设计(多头深度卷积转置注意力和重叠交叉注意力)捕获通道级全局依赖和空间局部结构信息,并辅以Restormer-Plus的空间自适应输入缩放机制。我们采用两阶段训练策略与双模型集成推理。在第一阶段,模型B从零开始在从FoundIR训练集中随机采样的大规模多样化数据集(约4.84 TB中的800 GB)上进行训练,涵盖五种退化类型:模糊、雾霾、雨、雪以及复合条件(如雨和雾同时出现)。在第二阶段,模型A使用模型B的最终检查点作为预训练初始化,在WeatherStream数据集(雨和雪子集)上进行微调,从而以更小的数据集实现高效的域适应。为了更好地在训练过程中保留结构细节,我们提出了一种新颖的梯度引导边缘感知损失,该损失对真实图像应用Sobel算子以构建空间自适应权重图,为边缘和高频区域分配更高的监督。这与L1和多尺度SSIM损失一起纳入统一的训练目标中。在推理时,两个模型的预测通过加权平均融合:out = 0.4 × outA + 0.6 × outB,其中分配给模型B的更高权重反映了其从大规模预训练中获得的更强泛化能力。通过这些策略,我们提出的方法成功在挑战赛中排名第一。

英文摘要

In this work, we present our winning solution for the 8th UG2+ Challenge (CVPR 2026) Track 1: Image Restoration under All-weather Conditions. Our method is built upon the X-Restormer baseline, which captures both channel-wise global dependencies and spatially-local structural information through its dual-attention design (Multi-DConv Head Transposed Attention and Overlapping Cross-Attention), augmented with the spatially-adaptive input scaling mechanism from Restormer-Plus. We adopt a two-stage training strategy with dual-model ensemble inference. In the first stage, Model B is trained from scratch on a large-scale diverse dataset randomly sampled from the FoundIR training set (approximately 800 GB out of 4.84 TB), covering five degradation types: blur, haze, rain, snow, and composite conditions such as co-occurring rain and haze. In the second stage, Model A is fine-tuned on the WeatherStream dataset (rain and snow splits) using Model B's final checkpoint as pretrained initialization, enabling efficient domain adaptation with a substantially smaller dataset. To better preserve structural details during training, we propose a novel Gradient-Guided Edge-Aware (GGEA) Loss, which applies Sobel operators to the ground-truth image to construct a spatially adaptive weight map that assigns higher supervision to edge and high-frequency regions. This is incorporated alongside L1 and Multi-Scale SSIM losses in a unified training objective. At inference time, predictions from the two models are fused via a weighted average, out = 0.4 x outA + 0.6 x outB, where the higher weight assigned to Model B reflects its stronger generalization ability from large-scale pretraining. With these strategies, our proposed method successfully ranks 1st in the challenge.

2605.08935 2026-06-03 cs.AI cs.LG 版本更新

PnP-Corrector: A Universal Correction Framework for Coupled Spatiotemporal Forecasting

PnP-Corrector:一种用于耦合时空预测的通用校正框架

Hao Wu, Fan Xu, Yuxu Lu, Penghao Zhao, Fan Zhang, Hao Jia, Yuxuan Liang, Ruijian Gou, Qingsong Wen, Xian Wu, Xiaomeng Huang, Yuan Gao

发表机构 * University of Science and Technology of China(中国科学技术大学) Tsinghua University(清华大学)

AI总结 针对耦合系统中误差相互放大导致长期预测崩溃的问题,提出一种即插即用的校正框架PnP-Corrector,通过冻结物理模拟引擎并训练校正代理来主动抵消系统偏差,显著提升长期预测的稳定性和准确性。

详情
AI中文摘要

耦合时空预测对于预测多个相互作用动力系统的未来演化(如气候模型)非常重要。然而,现有方法受到复合误差这一持续瓶颈的严重限制。在耦合系统中,每个子系统模拟器的误差会相互传播和放大,我们将这种现象称为互惠误差放大,导致长期预测迅速崩溃。为了应对这一挑战,我们提出了一种通用框架,称为PnP-Corrector(即插即用校正器)。我们框架的核心思想是将物理模拟与误差校正过程解耦:它冻结预训练的物理模拟引擎,并专门训练一个校正代理,以主动抵消耦合系统中出现的系统偏差。此外,我们设计了一种高效的预测模型架构DSLCast,作为该框架的主干。大量实验表明,我们的方法显著增强了耦合预测系统的长期稳定性和准确性。例如,在300天的全球海洋-大气耦合预测这一具有挑战性的任务中,我们的PnP-Corrector框架将基线模型的预测误差降低了28%,并在多个关键指标上超越了最先进的模型。

英文摘要

Coupled spatiotemporal forecasting is important for predicting the future evolution of multiple interacting dynamical systems, such as in climate models. However, existing methods are severely constrained by the persistent bottleneck of compounding errors. In coupled systems, errors from each subsystem simulator propagate and amplify one another, a phenomenon we term Reciprocal Error Amplification, leading to a rapid collapse of long-range predictions. To address this challenge, we propose a universal framework called PnP-Corrector (Plug-and-Play Corrector). The core idea of our framework is to decouple the physical simulation from the error correction process: it freezes pre-trained physics simulation engines and exclusively trains a correction agent to proactively counteract the systematic biases emerging from the coupled system. Furthermore, we design an efficient predictive model architecture, DSLCast, to serve as the backbone of this framework. Extensive experiments demonstrate that our method significantly enhances the long-term stability and accuracy of coupled forecasting systems. For instance, in the challenging task of a 300-day global ocean-atmosphere coupled forecast, our PnP-Corrector framework reduces the prediction error of the baseline model by 28% and surpasses state-of-the-art models on several key metrics.

2510.12837 2026-06-03 cs.MA cs.AI cs.CY cs.NE 版本更新

Semantic knowledge guides innovation and drives cultural evolution

语义知识引导创新并驱动文化进化

Anil Yaman, Shen Tian, Björn Lindström

AI总结 通过基于主体的模型和大规模行为实验,发现语义知识通过引导探索、增强创新成功率和促进泛化,与社会学习协同驱动累积文化进化。

详情
Journal ref
Proceedings of the National Academy of Sciences, 123(22), e2530750123, 2026
AI中文摘要

文化进化使得思想和技术能够代代积累,在人类中达到最复杂和开放的形式。虽然社会学习使得这些创新的传播成为可能,但产生这些创新的认知过程仍然知之甚少。经典理论通常将创新视为随机变异,这种简化不足以解释人类文化进化的复杂性。我们提出,语义知识——将概念与其属性和功能联系起来的关联——引导人类创新并驱动累积文化。为了验证这一点,我们结合了一个基于主体的模型(该模型考察语义知识如何塑造文化进化动态)和一个大规模行为实验(N = 1,243),测试其在人类创新中的作用。在这两种方法中,我们发现语义知识将探索引导向有意义的解决方案,增强创新成功率,并使得从先前发现中泛化成为可能。此外,语义知识与社会学习协同作用,放大创新并加速累积文化变化。相反,缺乏语义知识的实验参与者即使在社会学习可能的情况下,表现也不比随机好,并且依赖浅层探索策略进行创新。综合这些发现表明,语义知识是支撑人类累积文化的关键认知过程。

英文摘要

Cultural evolution allows ideas and technologies to accumulate across generations, reaching their most complex and open-ended form in humans. While social learning enables the transmission of such innovations, the cognitive processes that generate them remain poorly understood. Classical theories typically treat innovation as random variation, a simplification insufficient for explaining the complexity of human cultural evolution. We propose that semantic knowledge-the associations linking concepts to their properties and functions-guides human innovation and drives cumulative culture. To test this, we combined an agent-based model, which examines how semantic knowledge shapes cultural evolutionary dynamics, with a large-scale behavioral experiment (N = 1,243) testing its role in human innovation. Across both approaches, we found that semantic knowledge directed exploration toward meaningful solutions, enhanced innovation success, and enabled generalization from prior discoveries. Moreover, semantic knowledge interacted synergistically with social learning to amplify innovation and accelerate cumulative cultural change. In contrast, experimental participants lacking access to semantic knowledge performed no better than chance, even when social learning was possible, and relied on shallow exploration strategies for innovation. Together, these findings suggest that semantic knowledge is a key cognitive process underpinning human cumulative culture.

2605.11954 2026-06-03 cs.AI 版本更新

Assessing and Mitigating Miscalibration in LLM-Based Social Science Measurement

评估与缓解基于LLM的社会科学测量中的校准误差

Jinyuan Wang, Ningyuan Deng, Yi Yang

发表机构 * The Hong Kong University of Science and Technology(香港科学与技术大学)

AI总结 研究LLM在社会测量中的校准问题,提出软标签蒸馏方法,通过训练小型分类器将校准误差降低43.2%的ECE和34.0%的Brier分数。

详情
AI中文摘要

大型语言模型(LLM)越来越多地被用于社会科学中,作为可扩展的测量工具,将非结构化文本转换为可进入标准实证设计的变量。测量有效性不仅要求高平均准确率,还需要良好校准的置信度,以忠实反映每次测量正确的经验概率。本文研究了基于LLM的社会科学测量中的模型校准误差。我们首先以FOMC为例,展示当LLM置信度校准不良时,基于置信度的过滤会改变下游回归估计。然后,我们对涵盖专有模型(包括GPT-5-mini、DeepSeek-V3.2)和开源模型的14个社会科学构念进行校准审计。跨任务和模型家族,报告的置信度与基于容错的正确性对齐不良。作为一种简单的缓解方法,我们提出了一种用于校准BERT与LLM的软标签蒸馏流程。该方法将LLM分数及其语言化置信度转换为软目标分布,然后在编码器模型上训练一个较小的判别分类器以适应这些目标。平均而言,该方法将ECE降低了43.2%,Brier分数降低了34.0%。这些结果表明,基于LLM的社会科学流程应将校准视为测量有效性的一部分,而非可选的后期处理问题。

英文摘要

Large language models (LLMs) are increasingly used in social science as scalable measurement tools for converting unstructured text into variables that can enter standard empirical designs. Measurement validity demands more than high average accuracy, which requires well calibrated confidence that faithfully reflects the empirical probability of each measurement being correct. This paper studies the model miscalibration in LLM-based social science measurement. We begin with a case study on FOMC and show that confidence based filtering can change downstream regression estimates when LLM confidence is miscalibrated. We then audit calibration across 14 social science constructs covering both proprietary models, including GPT-5-mini, DeepSeek-V3.2, and open source models. Across tasks and model families, reported confidence is poorly aligned with tolerance-based correctness. As a simple mitigation, we propose a soft label distillation pipeline for calibrating Bert with LLM. The method converts an LLM score and its verbalized confidence into a soft target distribution, then trains a smaller discriminative classifier on encoder models for these targets. Averaged across datasets, this approach reduces ECE by 43.2\% and Brier by 34.0\%. These results suggest that LLM-based social science pipelines should treat calibration as part of measurement validity, rather than as an optional post-processing concern.

2605.06846 2026-06-03 cs.CR cs.AI 版本更新

Narrow Secret Loyalty Dodges Black-Box Audits

窄秘密忠诚规避黑盒审计

Alfie Lamerton, Fabien Roger

发表机构 * Formation Research

AI总结 本文构建了首个窄秘密忠诚模型生物,通过微调Qwen-2.5-Instruct在窄激活条件下偏向特定政治人物的极端有害行为,并评估了黑盒审计技术的检测效果。

详情
AI中文摘要

最近的研究将秘密忠诚识别为与标准后门不同的威胁。秘密忠诚使模型在看似正常运作的同时,暗中促进特定主体的利益。我们构建了首个窄秘密忠诚的模型生物。我们在三个规模(1.5B、7B、32B)上微调Qwen-2.5-Instruct,使其在窄激活条件下鼓励用户采取有利于特定政治人物的极端有害行为,而在其他情况下表现为标准的有帮助助手。我们针对反映不同审计者知识的五种能力水平,使用黑盒审计技术(前缀攻击、基模型生成、基于Petri的自动审计)评估所得模型。当审计者知道主体时,检测率有所提高,但总体仍然较低。在没有主体知识的情况下,训练后的模型难以与基线区分。数据集监控即使在低投毒比例下也能识别出投毒训练样本。我们将攻击描述为投毒比例的函数,使用稀释至12.5%、6.25%和3.125%的投毒数据训练模型。攻击在所有三个比例下持续存在,而数据集监控精度下降,静态黑盒审计仍然无效。

英文摘要

Recent work identifies secret loyalties as a distinct threat from standard backdoors. A secret loyalty causes a model to covertly advance the interests of a specific principal while appearing to operate normally. We construct the first model organisms of narrow secret loyalties. We fine-tune Qwen-2.5-Instruct at three scales (1.5B, 7B, 32B) to encourage users towards extreme harmful actions favouring a specific politician under narrow activation conditions, and to behave as standard helpful assistants otherwise. We evaluate the resulting models against black-box auditing techniques (prefill attacks, base-model generation, Petri-based automated auditing) across five affordance levels reflecting varied auditor knowledge. Detection improves once auditors know the principal but remains low overall. Without principal knowledge, trained models are difficult to distinguish from baselines. Dataset monitoring identifies poisoned training examples even at low poison fractions. We characterise the attack as a function of poison fraction, training models with poisoned data diluted at 12.5%, 6.25%, and 3.125%. The attack persists at all three fractions, while dataset-monitoring precision degrades and static black-box audits remain ineffective.

2605.11607 2026-06-03 stat.ML cs.AI cs.LG 版本更新

Exact Stiefel Optimization for Probabilistic PLS: Closed-Form Updates, Error Bounds, and Calibrated Uncertainty

概率PLS的精确Stiefel优化:闭式更新、误差界与校准不确定性

Haoran Hu, Xingce Wang

发表机构 * School of Artificial Intelligence, Beijing Normal University(人工智能学院,北京师范大学)

AI总结 提出一种基于Stiefel流形精确优化的概率偏最小二乘框架,通过噪声预估计、约束似然优化和预测校准,实现闭式更新、误差界和校准不确定性。

详情
AI中文摘要

概率偏最小二乘(PPLS)是一种基于似然的核心双视图模型,适用于需要可解释潜在因子和校准不确定性的场景。基于Bouhaddani等人(2018)的可识别参数化,现有拟合流程仍面临两个实际瓶颈:联合EM/ECM更新下的噪声-信号耦合以及正交约束的非平凡处理。遵循固定噪声标量似然协议,我们开发了一个端到端框架,将噪声预估计、约束似然优化和预测校准整合到一条流水线中。我们从低特征值噪声子空间估计观测噪声,并通过精确的Stiefel流形优化强制执行正交性。噪声子空间估计器实现了与信号强度无关的前沿有限样本率,并匹配极小极大下界,而全谱噪声估计器在同一模型下携带确定性偏差。我们通过可选的高斯化将框架扩展到次高斯设置,并通过块结构Fisher分析提供闭式标准误差。在合成高噪声设置和两个多组学基准(TCGA-BRCA和PBMC CITE-seq)上,该方法无需事后重新校准即可实现接近名义覆盖,在TCGA-BRCA上秩$r=3$时达到Ridge级点精度,在跨视图预测上匹配或超过PO2PLS,同时提供原生校准不确定性,并提高参数恢复的稳定性。

英文摘要

Probabilistic partial least squares (PPLS) is a central likelihood-based model for two-view learning when one needs both interpretable latent factors and calibrated uncertainty. Building on the identifiable parameterization of Bouhaddani et al.\ (2018), existing fitting pipelines still face two practical bottlenecks: noise--signal coupling under joint EM/ECM updates and nontrivial handling of orthogonality constraints. Following the fixed-noise scalar-likelihood protocol, we develop an end-to-end framework that combines noise pre-estimation, constrained likelihood optimization, and prediction calibration in one pipeline. We estimate the observation noise from the low-eigenvalue noise subspace and enforce orthogonality through exact Stiefel-manifold optimization. The noise-subspace estimator attains a signal-strength-independent leading finite-sample rate and matches a minimax lower bound, whereas a full-spectrum noise estimator carries a deterministic bias under the same model. We further extend the framework to sub-Gaussian settings via optional Gaussianization and provide closed-form standard errors through a block-structured Fisher analysis. Across synthetic high-noise settings and two multi-omics benchmarks (TCGA-BRCA and PBMC CITE-seq), the method achieves near-nominal coverage without post-hoc recalibration, reaches Ridge-level point accuracy on TCGA-BRCA at rank $r=3$, matches or exceeds PO2PLS on cross-view prediction while providing native calibrated uncertainty, and improves stability of parameter recovery.

2602.22480 2026-06-03 cs.AI cs.CL cs.LG 版本更新

VeRO: A Harness for Agents to Optimize Agents

VeRO: 用于优化智能体的智能体框架

Varun Ursekar, Apaar Shanker, Veronica Chatrath, Yuan Xue, Samuel Marc Denton

发表机构 * arXiv

AI总结 提出 VeRO 框架和 VeRO-Bench 基准,通过版本化快照、预算控制评估和结构化执行轨迹来优化智能体代码,并实验比较不同优化器对目标智能体的改进效果。

Comments Accepted to the Forty-Third International Conference on Machine Learning (ICML), 2026

详情
AI中文摘要

编码智能体的一个重要新兴应用是智能体框架优化:通过编辑和评估目标智能体的代码来迭代改进它。尽管具有相关性,但社区对编码智能体在此任务上的表现缺乏系统理解。框架优化与传统软件工程不同:智能体框架将确定性代码与随机 LLM 完成交错,需要结构化捕获中间执行轨迹和下游结果。为了解决这些挑战,我们引入了 (1) VeRO(版本化、奖励和观察),一个外部框架,提供目标框架的版本化快照、预算控制评估和结构化执行轨迹,以及 (2) VeRO-Bench,一个包含参考评估程序的目标智能体和任务的基准套件。使用 VeRO,我们进行了一项实证研究,比较了不同任务上的优化器,并分析了哪些修改能可靠地改进目标智能体框架。我们发布 VeRO 以支持作为编码智能体核心能力的智能体优化研究。代码可在 https://github.com/scaleapi/vero 获取。

英文摘要

An important emerging application of coding agents is agent harness optimization: the iterative improvement of a target agent by editing and evaluating its code. Despite its relevance, the community lacks a systematic understanding of coding agent performance on this task. Harness optimization differs from conventional software engineering: agent harnesses interleave deterministic code with stochastic LLM completions, requiring structured capture of both intermediate execution traces and downstream outcomes. To address these challenges, we introduce (1) VeRO (Versioning, Rewards, and Observations), an outer harness that provides versioned snapshots, budget-controlled evaluation, and structured execution traces of target harnesses, and (2) VeRO-Bench, a benchmark suite of target agents and tasks with reference evaluation procedures. Using VeRO, we conduct an empirical study comparing optimizers across tasks and analyzing which modifications reliably improve target agent harnesses. We release VeRO to support research on agent optimization as a core capability for coding agents. Code is available at https://github.com/scaleapi/vero.

2605.09233 2026-06-03 cs.CV cs.AI 版本更新

Towards Robust Sequential Decomposition for Complex Image Editing

面向复杂图像编辑的鲁棒顺序分解

Zilai Zeng, Mingdeng Cao, Zijie Li, Xiaochen Lian, Yichun Shi, Peihao Zhu, Chen Sun, Peng Wang

发表机构 * Brown University(布朗大学) ByteDance Seed(字节跳动种子) The University of Tokyo(东京大学)

AI总结 提出通过顺序分解将复杂编辑任务拆解为简单步骤,并利用合成数据训练模型,在统一上下文编辑框架下平衡分解优势与误差累积,实现鲁棒改进和从模拟到真实的泛化。

Comments CVPR 2026

详情
AI中文摘要

视觉生成模型的最新进展使得由人类指令引导的高保真图像编辑成为可能。然而,这些模型在处理涉及组合编辑操作或跨步骤依赖的复杂指令时常常遇到困难。这种困难源于两种典型范式的局限性:(1)单轮编辑,试图一次性应用所有指示的编辑,通常无法准确解析复杂指令并导致不期望的编辑;(2)顺序编辑可以将任务分解为更简单的步骤,但受到顺序执行引入的复合误差的影响,导致低保真结果。为了获得复杂图像编辑的鲁棒解决方案,我们在统一的上下文编辑框架下检查了不同范式的编辑行为,并研究了如何平衡顺序分解的优势与其误差累积的缺点。我们进一步开发了一个合成数据流水线,构建了不同指令复杂度的编辑任务,使我们能够整理一个具有高质量分解序列的大规模编辑数据集。通过在合成数据上进行微调,我们发现,通过适当设计的编辑范式,即使任务复杂度增加,顺序分解也能产生鲁棒的改进。此外,从合成任务中学到的分解技能可以通过与真实世界编辑数据共同训练迁移到真实图像,展示了模拟到真实泛化在更广泛领域中处理复杂图像编辑的前景。

英文摘要

Recent advances in visual generative models have enabled high-fidelity image editing guided by human instructions. However, these models often struggle with complex instructions involving combinatorial editing operations or inter-step dependencies. This difficulty stems from the limitations of two canonical paradigms: (1) single-turn editing, which attempts to apply all instructed edits in one pass, often fails to parse the complex instruction accurately and causes undesired edits; and (2) sequential editing can decompose the task into simpler steps but suffers from compounding errors introduced by the sequential execution, leading to low-fidelity results. To derive a robust solution for complex image editing, we examine editing behaviors of different paradigms under a unified in-context editing framework, and study how the benefits of sequential decomposition can be balanced against its error-accumulation drawbacks. We further develop a synthetic data pipeline that constructs editing tasks of varying instruction complexity, allowing us to curate a large-scale editing dataset with high-quality decomposed sequences. By finetuning on synthetic data, we discovered that with properly designed editing paradigms, sequential decomposition yields robust improvements even as task complexity increases. Furthermore, the decomposition skills learned from synthetic tasks can transfer to real images by co-training with real-world editing data, demonstrating the promise of sim-to-real generalization for tackling complex image editing across broader domains.

2605.08767 2026-06-03 cs.AI 版本更新

From Holo Pockets to Electron Density: GPT-style Drug Design with Density

从全息口袋到电子密度:基于密度的GPT式药物设计

Jiahao Chen, Letian Gao, Yanhao Zhu, Wenbiao Zhou, Bing Su, Zhi John Lu, Bo Huang

发表机构 * University of Science and Technology of China(中国科学技术大学)

AI总结 提出EDMolGPT,一种利用低分辨率电子密度作为物理条件进行从头药物设计的自回归框架,通过密度点云生成分子,减轻结构偏差并产生3D构象。

Comments Published as a conference paper in ICML 2026

详情
AI中文摘要

生成建模的最新进展推动了基于结构的药物设计(SBDD)的重大进步。现有方法通常以全息复合物中的空结合口袋为条件生成分子,忽略了填充物(配体和溶剂)等信息成分。在这里,我们利用从填充物中导出的低分辨率电子密度(ED)作为从头药物设计的物理基础条件。我们考虑了两种类型的ED:计算得到的和冷冻电镜/X射线得到的,可从计算或实验来源获得,支持统一预训练和实验集成。与刚性的口袋表示相比,实验ED自然捕获构象灵活性,并提供结合环境的更忠实描述。基于此,我们引入了EDMolGPT,一个仅解码器的自回归框架,从低分辨率ED点云生成分子。通过将生成过程基于物理上有意义的密度信号,EDMolGPT减轻了结构偏差,并产生具有3D构象的分子。在101个生物靶标上的评估验证了其有效性。我们的项目页面:https://jiahaochen1.github.io/EDMolGPT_Page/。

英文摘要

Recent advances in generative modeling have enabled significant progress in structure-based drug design (SBDD). Existing methods typically condition molecule generation on empty binding pockets from holo complexes, overlooking informative components such as the filler (ligands and solvent). Here, we leverage low-resolution electron density (ED) derived from the filler as a physically grounded condition for \textit{de novo} drug design. We consider two types of ED, calculated and cryo-EM/X-ray, obtainable from computational or experimental sources, supporting unified pre-training and experimental integration. Compared with rigid pocket representations, experimental ED naturally captures conformational flexibility and provides a more faithful description of the binding environment. Based on this, we introduce EDMolGPT, a decoder-only autoregressive framework that generates molecules from low-resolution ED point clouds. By grounding generation in physically meaningful density signals, EDMolGPT mitigates structural bias and produces molecules with 3D conformations. Evaluations on 101 biological targets verify the effectiveness. Our project page: https://jiahaochen1.github.io/EDMolGPT_Page/.

2605.08426 2026-06-03 cs.GT cs.AI 版本更新

Mechanism Design Is Not Enough: Prosocial Agents for Cooperative AI

机制设计是不够的:面向合作AI的亲社会智能体

Xuanqiang Angelo Huang, Charlie Tharas, Samuele Marro, Van Q. Truong, Bernhard Schölkopf, Emanuele La Malfa, Zhijing Jin

发表机构 * ETH Zürich(苏黎世联邦理工学院) University of Oxford(牛津大学) Institute for Decentralized AI(去中心化人工智能研究所) Jinesis Lab, University of Toronto & Vector Institute(多伦多大学Jinesis实验室及向量研究所) EuroSafeAI University of Pennsylvania(宾夕法尼亚大学) Max Planck Institute for Intelligent Systems, Tübingen, Germany(德国图宾根最大计划智能系统研究所) ELLIS Institute Tübingen(图宾根ELLIS研究所)

AI总结 本文证明仅靠机制设计无法最大化LLM智能体的社会福利,并提出亲社会智能体(兼顾他人福利)能弥补这一差距,实现更优的社会与个体结果。

Comments 42 pages

详情
AI中文摘要

确保AI智能体在与他人互动时安全且有益的行为已成为现代AI安全的核心挑战之一。尽管机制设计作为设计规则以协调个体和集体目标的理论,可以激励合作行为,但仅凭它是否足以最大化LLM智能体的社会福利仍是一个开放问题。本文证明答案是否定的:借鉴不完全契约理论,我们正式表明,当契约无法区分所有相关的未来偶然事件时,存在任何现实机制都无法消除的严格正福利损失。我们表明,亲社会智能体(即权衡他人福利与自身福利的智能体)可以弥合这一差距,并实现社会更优且个体有益的结果。实验上,我们展示了在以大型语言模型为动力的多智能体资源分配环境和经典社会困境中,亲社会性是有益的。对AI安全的启示是明确的:为了实现大规模的合作互动,设计充分的机制是不够的;智能体必须被构建为内在亲社会的。

英文摘要

Ensuring that AI agents behave safely and beneficially when interacting with other parties has emerged as one of the central challenges of modern AI safety. While mechanism design, as the theory of designing rules to align individual and collective objectives, can incentivize cooperative behavior, it is still an open question whether it alone is sufficient to maximize LLM agents' social welfare. This work proves that the answer is negative: drawing from incomplete contract theory, we formally show that when contracts cannot distinguish all relevant future contingencies, there is a strictly positive welfare loss that no realistic mechanism can eliminate. We show that prosocial agents, who weigh others' welfare alongside their own, can close this gap and achieve outcomes that are socially superior and individually beneficial. Experimentally, we show that in multi-agent resource-allocation environments and canonical social dilemmas where agents are powered by large language models, prosociality is beneficial. The implication for AI safety is clear: to enable cooperative interactions at scale, designing adequate mechanisms is not sufficient; agents must be built to be intrinsically prosocial.

2604.27660 2026-06-03 cs.AI 版本更新

From Context to Skills: Can Language Models Learn from Context Skillfully?

从上下文到技能:语言模型能否从上下文中熟练学习?

Shuzheng Si, Haozhe Zhao, Yu Lei, Qingyi Wang, Dingwei Chen, Zhitong Wang, Zhenhailong Wang, Kangyang Luo, Zheng Wang, Gang Chen, Fanchao Qi, Minjia Zhang, Maosong Sun

发表机构 * THU(清华大学) DeepLang AI UIUC(伊利诺伊大学香槟分校) FDU(福建大学) CUHK(香港中文大学)

AI总结 提出Ctx2Skill框架,通过多智能体自博弈和跨时间回放机制,自动从上下文中发现、提炼和选择技能,提升语言模型在复杂上下文中的学习能力。

详情
AI中文摘要

许多现实任务要求语言模型(LMs)推理超出其参数知识的复杂上下文。这需要上下文学习,即LM直接从给定上下文中学习相关知识。一个直观的解决方案是推理时技能增强:从上下文中提取规则和过程作为自然语言技能。然而,为上下文学习场景构建这样的技能面临两个挑战:对长且技术密集的上下文进行手动技能标注的成本过高,以及缺乏自动技能构建的外部反馈。在本文中,我们提出Ctx2Skill,一个自我进化的框架,无需人工监督或外部反馈即可自主发现、提炼和选择上下文特定的技能。其核心是一个多智能体自博弈循环:一个挑战者生成探测任务和评分标准,一个推理者尝试在进化技能集的指导下解决这些任务,以及一个中立的评判者提供二元反馈。关键的是,挑战者和推理者都通过积累的技能进化:专门的提议者和生成者智能体分析失败案例,并将它们综合成针对双方的有针对性的技能更新,从而实现自动化的技能发现和提炼。为了防止由日益极端的任务生成和过度专业化的技能积累引起的对抗性崩溃,我们进一步引入了一种跨时间回放机制,该机制识别出在推理者方面跨代表性案例实现最佳平衡的技能集,确保稳健且可泛化的技能进化。由此产生的技能可以插入任何语言模型,以获得更好的上下文学习能力。在来自CL-bench的四个上下文学习任务上评估,Ctx2Skill在骨干模型上持续提高了解决率。

英文摘要

Many real-world tasks require language models (LMs) to reason over complex contexts that exceed their parametric knowledge. This calls for context learning, where LMs directly learn relevant knowledge from the given context. An intuitive solution is inference-time skill augmentation: extracting the rules and procedures from context into natural-language skills. However, constructing such skills for context learning scenarios faces two challenges: the prohibitive cost of manual skill annotation for long, technically dense contexts, and the lack of external feedback for automated skill construction. In this paper, we propose Ctx2Skill, a self-evolving framework that autonomously discovers, refines, and selects context-specific skills without human supervision or external feedback. At its core, a multi-agent self-play loop has a Challenger that generates probing tasks and rubrics, a Reasoner that attempts to solve them guided by an evolving skill set, and a neutral Judge that provides binary feedback. Crucially, both the Challenger and the Reasoner evolve through accumulated skills: dedicated Proposer and Generator agents analyze failure cases and synthesize them into targeted skill updates for both sides, enabling automated skill discovery and refinement. To prevent adversarial collapse caused by increasingly extreme task generation and over-specialized skill accumulation, we further introduce a Cross-time Replay mechanism that identifies the skill set achieving the best balance across representative cases for the Reasoner side, ensuring robust and generalizable skill evolution. The resulting skills can be plugged into any language model to obtain better context learning capability. Evaluated on four context learning tasks from CL-bench, Ctx2Skill consistently improves solving rates across backbone models.

2604.23099 2026-06-03 cs.LG cs.AI stat.ML 版本更新

ProEval: Proactive Failure Discovery and Efficient Performance Estimation for Generative AI Evaluation

ProEval:生成式AI评估的主动故障发现与高效性能估计

Yizheng Huang, Wenjun Zeng, Aditi Kumaresan, Zi Wang

发表机构 * Google DeepMind(谷歌深Mind)

AI总结 提出ProEval框架,利用预训练高斯过程进行贝叶斯积分和超水平集采样,实现高效性能估计和主动故障发现,在推理、安全对齐和分类基准上以8-65倍更少样本达到1%误差内估计。

Comments Our open-sourced code and data can be found at https://github.com/google-deepmind/proeval

详情
Journal ref
International Conference on Machine Learning, 2026
AI中文摘要

由于推理速度慢、评估成本高以及模型和基准的快速增长,评估生成式AI模型变得越来越资源密集。我们提出ProEval,一个主动评估框架,利用迁移学习高效估计性能并识别故障案例。ProEval采用预训练高斯过程(GPs)作为性能评分函数的代理,将模型输入映射到指标,如错误严重性或安全违规。通过将性能估计构建为贝叶斯积分(BQ)和故障发现构建为超水平集采样,我们开发了不确定性感知的决策策略,主动选择或合成高度信息量的输入进行测试。理论上,我们证明了基于预训练GP的BQ估计器是无偏且有界的。实验上,在推理、安全对齐和分类基准上的大量实验表明,ProEval比竞争基线显著更高效。它需要8-65倍更少的样本即可达到真实值1%内的估计,同时在更严格的评估预算下揭示更多样化的故障案例。

英文摘要

Evaluating generative AI models is increasingly resource-intensive due to slow inference, expensive raters, and a rapidly growing landscape of models and benchmarks. We propose ProEval, a proactive evaluation framework that leverages transfer learning to efficiently estimate performance and identify failure cases. ProEval employs pre-trained Gaussian Processes (GPs) as surrogates for the performance score function, mapping model inputs to metrics such as the severity of errors or safety violations. By framing performance estimation as Bayesian quadrature (BQ) and failure discovery as superlevel set sampling, we develop uncertainty-aware decision strategies that actively select or synthesize highly informative inputs for testing. Theoretically, we prove that our pre-trained GP-based BQ estimator is unbiased and bounded. Empirically, extensive experiments on reasoning, safety alignment, and classification benchmarks demonstrate that ProEval is significantly more efficient than competitive baselines. It requires 8-65x fewer samples to achieve estimates within 1% of the ground truth, while simultaneously revealing more diverse failure cases under a stricter evaluation budget.

2507.05519 2026-06-03 cs.AI cs.LO 版本更新

Modeling Deontic Modal Logic in the s(CASP) Goal-directed Predicate Answer Set Programming System

在 s(CASP) 目标导向谓词回答集编程系统中建模道义模态逻辑

Gopal Gupta, Abhiramon Rajasekharan, Alexis R. Tudor, Elmer Salazar, Joaquín Arias

发表机构 * The University of Texas at Dallas(德克萨斯大学达拉斯分校) CETINIA, Universidad Rey Juan Carlos(CETINIA,雷耶·胡安·卡洛斯大学)

AI总结 本文利用回答集编程中的默认否定和强否定直接表达道义模态算子,并通过全局约束表示义务、禁止和许可,解决了道义模态逻辑的经典悖论,并支持条件义务和条件禁止的知识表示。

Comments Will appear in as a Technical Communication in the 42nd International Conference on Logic Programming (ICLP 2026)

详情
AI中文摘要

我们考虑实现道义模态逻辑的问题。我们展示了如何利用回答集编程(ASP)中的默认否定(否定即失败)和强否定,优雅而直接地表达(道义)模态算子。我们提出使用ASP的全局约束来表示道义模态逻辑中的义务、禁止和许可。我们表明,我们提出的表示方法简单而优雅地解决了道义模态逻辑中数十年的各种悖论。我们的方法也为知识表示中的条件义务和条件禁止建模提供了一种手段。

英文摘要

We consider the problem of implementing deontic modal logic. We show how (deontic) modal operators can be elegantly and directly expressed using default negation (negation-as-failure) and strong negation present in answer set programming (ASP). We propose using global constraints of ASP to represent obligations, prohibitions, and permissions in deontic modal logic. We show that our proposed representation results in the various decades-old paradoxes of deontic modal logic being simply and elegantly resolved. Our method also serves as a means for modeling conditional obligations and conditional prohibitions in knowledge representation.

2508.06165 2026-06-03 cs.CL cs.AI 版本更新

UR$^2$: Unify RAG and Reasoning through Reinforcement Learning

UR$^2$:通过强化学习统一检索增强生成与推理

Weitao Li, Boran Xiang, Xiaolong Wang, Zhinan Gou, Weizhi Ma, Yang Liu

发表机构 * Dept. of Comp. Sci. & Tech., Institute for AI, Tsinghua University, Beijing, China(计算机科学与技术系,人工智能研究院,清华大学,北京,中国) Institute for AI Industry Research (AIR), Tsinghua University, Beijing, China(人工智能产业研究机构(AIR),清华大学,北京,中国) School of Management Science & Information Engineering, Hebei University of Economics and Business, Hebei, China(管理科学与信息工程学院,河北经贸大学,河北,中国)

AI总结 提出UR$^2$框架,通过强化学习动态协调检索与推理,结合难度感知课程和混合知识访问策略,在开放域问答、MMLU-Pro、医学和数学推理任务上优于现有基线,性能接近GPT-4o-mini和GPT-4.1-mini。

详情
AI中文摘要

大型语言模型(LLM)通过两种互补范式展现了强大能力:用于知识基础的检索增强生成(RAG)和用于复杂推理的可验证奖励强化学习(RLVR)。然而,现有统一这些范式的尝试范围狭窄,通常局限于具有固定检索设置的开放域问答,限制了向更广泛领域的泛化。为解决这一局限,我们提出UR$^2$(统一RAG与推理),一个通用的强化学习框架,动态协调检索与推理。UR$^2$引入了两个关键设计:一个难度感知课程,仅对困难实例选择性调用检索;以及一个混合知识访问策略,结合领域特定的离线语料库和即时生成的LLM摘要。这些组件共同缓解了检索与推理之间的不平衡,并提高了对噪声信息的鲁棒性。在开放域问答、MMLU-Pro、医学和数学推理任务上的实验表明,基于Qwen-2.5-3/7B和LLaMA-3.1-8B构建的UR$^2$持续优于现有RAG和RL基线,并在多个基准上达到与GPT-4o-mini和GPT-4.1-mini相当的性能。我们的代码可在https://github.com/Tsinghua-dhy/UR2获取。

英文摘要

Large Language Models (LLMs) have shown strong capabilities through two complementary paradigms: Retrieval-Augmented Generation (RAG) for knowledge grounding and Reinforcement Learning from Verifiable Rewards (RLVR) for complex reasoning. However, existing attempts to unify these paradigms remain narrow in scope, typically limited to open-domain QA with fixed retrieval settings, which constrains generalization to broader domains. To address this limitation, we propose UR$^2$ (Unified RAG and Reasoning)), a general reinforcement learning framework that dynamically coordinates retrieval and reasoning. UR$^2$ introduces two key designs: a difficulty-aware curriculum that selectively invokes retrieval only for challenging instances, and a hybrid knowledge access strategy that combines domain-specific offline corpora with on-the-fly LLM-generated summaries. Together, these components mitigate the imbalance between retrieval and reasoning and improve robustness to noisy information. Experiments on open-domain QA, MMLU-Pro, medical, and mathematical reasoning tasks show that UR$^2$, built on Qwen-2.5-3/7B and LLaMA-3.1-8B, consistently outperforms existing RAG and RL baselines, and achieves performance comparable to GPT-4o-mini and GPT-4.1-mini on several benchmarks. Our code is available at https://github.com/Tsinghua-dhy/UR2.

2604.18995 2026-06-03 cs.CL cs.AI cs.LG 版本更新

$R^2$-dLLM: Accelerating Diffusion Large Language Models via Spatio-Temporal Redundancy Reduction

$R^2$-dLLM: 通过时空冗余减少加速扩散大语言模型

Zhenbang Du, Kejing Xia, Xinrui Zhong, Yonggan Fu, Nicolai Oswald, Binfei Ji, Brucek Khailany, Pavlo Molchanov, Yingyan Lin

AI总结 提出 $R^2$-dLLM 框架,通过推理和训练两阶段减少扩散大语言模型解码中的空间和时间冗余,实现高达 88% 的解码步数减少并保持生成质量。

详情
AI中文摘要

扩散大语言模型(dLLMs)通过并行令牌预测成为自回归生成的有前途的替代方案。然而,实际的 dLLM 解码仍然遭受高推理延迟,限制了部署。在这项工作中,我们观察到这种低效率的很大一部分来自解码过程中反复出现的冗余,包括由置信度聚类和位置模糊性引起的空间冗余,以及由重复重新掩蔽已经稳定的预测引起的时间冗余。受这些模式的启发,我们提出了 $R^{2}$-dLLM,一个从推理和训练两个角度减少解码冗余的统一框架。在推理时,我们引入了无需训练的解码规则,聚合局部置信度和令牌预测,并最终确定时间稳定的令牌以避免冗余解码步骤。我们进一步提出了一个冗余感知的监督微调流程,使模型与高效解码轨迹对齐,并减少对手动调整阈值的依赖。实验表明,与现有解码策略相比,$R^{2}$-dLLM 一致地将解码步数减少高达 88%,同时在不同模型和任务上保持有竞争力的生成质量。这些结果验证了解码冗余是 dLLMs 的一个核心瓶颈,明确减少它能够带来显著的实用效率提升。我们的代码和模型可在 https://github.com/GATECH-EIC/R2-dLLM 获取。

英文摘要

Diffusion Large Language Models (dLLMs) have emerged as a promising alternative to autoregressive generation by enabling parallel token prediction. However, practical dLLM decoding still suffers from high inference latency, which limits deployment. In this work, we observe that a substantial part of this inefficiency comes from recurring redundancy in the decoding process, including spatial redundancy caused by confidence clusters and positional ambiguity, and temporal redundancy caused by repeatedly remasking predictions that have already stabilized. Motivated by these patterns, we propose $R^{2}$-dLLM, a unified framework for reducing decoding redundancy from both inference and training perspectives. At inference time, we introduce training-free decoding rules that aggregate local confidence and token predictions, and finalize temporally stable tokens to avoid redundant decoding steps. We further propose a redundancy-aware supervised fine-tuning pipeline that aligns the model with efficient decoding trajectories and reduces reliance on manually tuned thresholds. Experiments demonstrate that $R^{2}$-dLLM consistently reduces the number of decoding steps by up to 88\% compared to existing decoding strategies, while maintaining competitive generation quality across different models and tasks. These results validate that decoding redundancy is a central bottleneck in dLLMs, and that explicitly reducing it yields substantial practical efficiency gains. Our code and models are available at https://github.com/GATECH-EIC/R2-dLLM.

2604.18572 2026-06-03 cs.CV cs.AI cs.LG 版本更新

Back into Plato's Cave: Examining Cross-modal Representational Convergence at Scale

回到柏拉图的洞穴:大规模检验跨模态表示收敛性

A. Sophia Koepke, Daniil Zverev, Shiry Ginosar, Alexei A. Efros

发表机构 * UC Berkeley(伯克利大学) Technical University Munich, MCML(慕尼黑技术大学) University of Tübingen, Tübingen AI Center(图宾根大学) Toyota Technical Institute at Chicago(芝加哥丰田技术研究所)

AI总结 本文通过大规模数据集实验,质疑了柏拉图表示假说中跨模态表示收敛的证据,发现对齐度随数据规模增大而显著下降,且仅反映粗粒度语义重叠。

Comments Project page: http://akoepke.github.io/cave_umwelten/

详情
AI中文摘要

柏拉图表示假说认为,在不同模态(例如文本和图像)上训练的神经网络会趋向于对齐并最终收敛到相同的现实表示。如果该假说成立,将对模态选择是否重要产生重大影响。我们表明,该假说的实验证据是脆弱的,且关键依赖于评估方式。对齐度通过小数据集(约1000个样本)上的互最近邻测量,当数据集扩展到数百万样本时,对齐度显著下降。在文本-音频和文本-视频对齐中也观察到相同行为。模型表示之间剩余的对齐反映的是粗粒度语义重叠,而非一致的细粒度结构。此外,Huh等人的评估是在一对一图像-标题设置中进行的,这种约束在现实的多对多设置中失效,进一步降低了测量的对齐度。我们还发现,更强的语言模型与视觉对齐度增加的趋势似乎不适用于较新的模型。总体而言,我们的发现表明,当前跨模态表示收敛的证据比后续工作所认为的要弱得多。在不同模态上训练的模型可能学习到同样丰富的世界表示,但并非相同的表示。

英文摘要

The Platonic Representation Hypothesis suggests that neural networks trained on different modalities (e.g., text and images) align and eventually converge toward the same representation of reality. If true, this has significant implications for whether modality choice matters at all. We show that the experimental evidence for this hypothesis is fragile and depends critically on the evaluation regime. Alignment is measured using mutual nearest neighbors on small datasets ($\approx$1K samples) and degrades substantially as the dataset is scaled to millions of samples. The same behavior is observed beyond text-image, for text-audio and text-video alignment. The alignment that remains between model representations reflects coarse semantic overlap rather than consistent fine-grained structure. Moreover, the evaluations in Huh et al. are done in a one-to-one image-caption setting, a constraint that breaks down in realistic many-to-many settings and further reduces measured alignment. We also find that the reported trend of stronger language models increasingly aligning with vision does not appear to hold for newer models. Overall, our findings suggest that the current evidence for cross-modal representational convergence is considerably weaker than subsequent works have taken it to be. Models trained on different modalities may learn equally rich representations of the world, just not the same one.

2604.17708 2026-06-03 cs.AI 版本更新

Co-evolving Agent Architectures and Interpretable Reasoning for Automated Optimization

协同进化智能体架构与可解释推理用于自动化优化

Jiahao Huang, Peilan Xu, Xiaoya Nan, Wenjian Luo

发表机构 * School of Artificial Intelligence, Nanjing University of Information Science and Technology(南京信息工程大学人工智能学院) Guangdong Provincial Key Laboratory of Novel Security Intelligence Technologies, Institute of Cyberspace Security, School of Computer Science and Technology, Harbin Institute of Technology(哈尔滨工业大学计算机科学与技术学院)

AI总结 提出EvoOR-Agent协同进化框架,通过将智能体工作流表示为活动边网络,并利用图介导的路径条件重组、多粒度语义变异和精英种群更新,实现自动化优化中的自适应协调与可解释推理。

详情
AI中文摘要

使用大语言模型(LLM)自动化运筹学(OR)仍受限于手工设计的推理-执行工作流。复杂的OR任务需要问题解释、数学建模、求解器选择、代码生成和迭代调试之间的自适应协调。为解决这一限制,我们提出了EvoOR-Agent,一个用于自动化优化的协同进化框架。该框架将智能体工作流表示为活动边(AOE)风格网络,使工作流拓扑、执行依赖和替代推理路径显式化。在此表示上,框架维护一个架构图,并通过图介导的路径条件重组、多粒度语义变异和精英种群更新来进化推理个体种群。一个基于知识库的经验获取模块进一步将可重用的OR实践注入初始化和语义变异。在异构OR基准上的实验结果表明,所提框架一致优于零样本LLM、固定流水线OR智能体和代表性进化智能体框架。案例研究和消融分析进一步表明,显式架构进化和图支持的推理轨迹搜索有助于性能提升和结构可解释性。这些结果表明,将智能体架构和推理轨迹视为可进化对象,为自适应和可解释的自动化优化提供了有效途径。

英文摘要

Automating operations research (OR) with large language models (LLMs) remains limited by hand-crafted reasoning--execution workflows. Complex OR tasks require adaptive coordination among problem interpretation, mathematical formulation, solver selection, code generation, and iterative debugging. To address this limitation, we propose EvoOR-Agent, a co-evolutionary framework for automated optimization. The framework represents agent workflows as activity-on-edge (AOE)-style networks, making workflow topology, execution dependencies, and alternative reasoning paths explicit. On this representation, the framework maintains an architecture graph and evolves a population of reasoning individuals through graph-mediated path-conditioned recombination, multi-granularity semantic mutation, and elitist population update. A knowledge-base-assisted experience-acquisition module further injects reusable OR practices into initialization and semantic variation. Empirical results on heterogeneous OR benchmarks show that the proposed framework consistently improves over zero-shot LLMs, fixed-pipeline OR agents, and representative evolutionary agent frameworks. Case studies and ablation analyses further indicate that explicit architecture evolution and graph-supported reasoning-trajectory search contribute to both performance improvement and structural interpretability. These results suggest that treating agent architectures and reasoning trajectories as evolvable objects provides an effective route toward adaptive and interpretable automated optimization.

2604.17220 2026-06-03 cs.MA cs.AI 版本更新

Dynamics of Cognitive Heterogeneity: Investigating Behavioral Biases in Multi-Stage Supply Chains with LLM-Based Simulation

认知异质性动力学:基于大语言模型模拟的多阶段供应链中行为偏差研究

Jiuyun Jiang, Yuecheng Hong, Bo Yang, Jin Yang, Guangxin Jiang, Xiaomeng Guo, Guang Xiao

发表机构 * Harbin Institute of Technology(哈尔滨工业大学) The Hong Kong Polytechnic University(香港理工大学)

AI总结 本文通过引入大语言模型模拟多阶段供应链,基于分层推理框架分析认知异质性对智能体交互的影响,发现信息共享可缓解短视和自利行为导致的系统效率低下。

详情
AI中文摘要

在复杂的多轮决策中,生成式智能体之间的协调建模是人工智能和运营管理的核心挑战。尽管行为实验揭示了供应链效率低下背后的认知偏差,但传统方法面临可扩展性和控制限制。我们引入了一种可扩展的实验范式,使用大语言模型(LLMs)模拟多阶段供应链动态。本研究基于分层推理框架,专门分析了认知异质性对智能体交互的影响。与先前的同质设置不同,我们采用DeepSeek和GPT智能体,系统性地改变供应链各层级的推理复杂度。通过严格重复和统计验证的模拟,我们研究了这种认知多样性如何影响集体结果。结果表明,智能体表现出短视和自利行为,加剧了系统效率低下。然而,我们证明信息共享有效缓解了这些不利影响。我们的发现扩展了传统行为方法,并为AI赋能组织的动态提供了新见解。这项工作强调了基于LLM的智能体作为人类决策代理在复杂运营环境中的潜力和局限性。

英文摘要

Modeling coordination among generative agents in complex multi-round decision-making presents a core challenge for AI and operations management. Although behavioral experiments have revealed cognitive biases behind supply chain inefficiencies, traditional methods face scalability and control limitations. We introduce a scalable experimental paradigm using Large Language Models (LLMs) to simulate multi-stage supply chain dynamics. Grounded in a Hierarchical Reasoning Framework, this study specifically analyzes the impact of cognitive heterogeneity on agent interactions. Unlike prior homogeneous settings, we employ DeepSeek and GPT agents to systematically vary reasoning sophistication across supply chain tiers. Through rigorously replicated and statistically validated simulations, we investigate how this cognitive diversity influences collective outcomes. Results indicate that agents exhibit myopic and self-interested behaviors that exacerbate systemic inefficiencies. However, we demonstrate that information sharing effectively mitigates these adverse effects. Our findings extend traditional behavioral methods and offer new insights into the dynamics of AI-enabled organizations. This work underscores both the potential and limitations of LLM-based agents as proxies for human decision-making in complex operational environments.

2505.24037 2026-06-03 cs.AI 版本更新

Leave it to the Specialist: Repair Sparse LLMs with Sparse Fine-Tuning via Sparsity Evolution

交给专家:通过稀疏性演化进行稀疏微调修复稀疏大语言模型

Qiao Xiao, Alan Ansell, Boqian Wu, Lu Yin, Mykola Pechenizkiy, Shiwei Liu, Decebal Constantin Mocanu

发表机构 * Eindhoven University of Technology(埃因霍温理工大学) University of Cambridge(剑桥大学) University of Luxembourg(卢森堡大学) University of Twente(埃因霍温理工大学) University of Surrey(萨里大学) Tübingen AI Center(图宾根人工智能中心) Max Planck Institute for Intelligent Systems(智能系统马克斯·普朗克研究所) ELLIS Institute Tübingen(图宾根ELLIS研究所)

AI总结 提出稀疏演化微调(SEFT)框架,通过周期性重分配稀疏任务特定更新和重新激活有益剪枝权重,在保持稀疏性效率优势的同时实现稀疏大语言模型的有效下游任务适配。

详情
AI中文摘要

稀疏大语言模型为高效部署提供了有吸引力的方向,但将其适配到下游任务仍然具有挑战性。核心困难在于在不牺牲稀疏性效率优势的情况下实现有效的任务适配。现有的微调方法不适用于这种设置,因为它们要么引入额外的密集参数,要么假设固定的稀疏拓扑,限制了它们与稀疏大语言模型的兼容性。在本文中,我们提出了稀疏演化微调(SEFT),这是一个专门为稀疏大语言模型设计的微调框架。SEFT允许稀疏结构在微调过程中演化,通过周期性重分配稀疏任务特定更新,并在有益时重新激活先前剪枝的权重。同时,SEFT通过基于参数重要性的拓扑适配保留了稀疏性的效率优势。在LLaMA、DeepSeek和Mistral模型上的多个基准实验表明,与现有基线相比,SEFT在提供更强性能的同时,具有更优的内存和时间效率。我们的代码公开在:https://github.com/QiaoXiao7282/SEFT。

英文摘要

Sparse large language models (LLMs) offer an attractive direction toward efficient deployment, but adapting them to downstream tasks remains challenging. The central difficulty is to enable effective task adaptation without sacrificing the efficiency advantages of sparsity. Existing fine-tuning methods are not well-suited to this setting, as they either introduce additional dense parameters or assume a fixed sparse topology, limiting their compatibility with sparse LLMs. In this paper, we propose Sparsity Evolution Fine-Tuning (SEFT), a fine-tuning framework designed specifically for sparse LLMs. SEFT allows sparse structure to evolve during fine-tuning by periodically reallocating sparse task-specific updates and reactivating previously pruned weights when beneficial. At the same time, SEFT preserves the efficiency advantages of sparsity through topology adaptation based on parameter importance. Experiments on LLaMA, DeepSeek, and Mistral models across multiple benchmarks show that SEFT delivers stronger performance while offering superior memory and time efficiency compared to existing baselines. Our code is publicly available at: https://github.com/QiaoXiao7282/SEFT.

2604.15713 2026-06-03 cs.LO cs.AI cs.PL 版本更新

Just Type It in Isabelle! AI Agents Drafting, Mechanizing, and Generalizing from Human Hints

Just Type It in Isabelle! AI Agents Drafting, Mechanizing, and Generalizing from Human Hints

Kevin Kappelmann, Maximilian Schäffeler, Lukas Stevens, Mohammad Abdulaziz, Andrei Popescu, Dmitriy Traytel

发表机构 * Department of Computer Science, University of Sheffield, United Kingdom(英国谢菲尔德大学计算机科学系) Department of Informatics, King’s College London, United Kingdom(英国伦敦国王学院信息学院) Department of Computer Science, University of Copenhagen, Denmark(丹麦哥本哈根大学计算机科学系)

AI总结 研究Isabelle中秩一多态λ项的类型标注问题,通过人类和AI代理(LLM)分别进行纸笔证明和自动形式化,并利用人类提示进行改进和泛化。

详情
AI中文摘要

类型标注在打印项时至关重要,以确保其在重新解析和类型推断下保持含义。我们研究了Isabelle中使用的秩一多态$λ$-演算项的完全且最小类型标注问题。基于Smolka、Blanchette等人的先前工作,我们对该问题进行了元理论阐述,包括完整的形式化规范和证明,并在Isabelle/HOL中进行了形式化。我们的开发是一系列实验,展示了人类驱动和AI驱动的形式化工作流程:人类和基于LLM的AI代理独立产生纸笔证明,AI代理在Isabelle中自动形式化两者,并通过进一步的人类提示AI干预来改进和泛化开发。

英文摘要

Type annotations are essential when printing terms in a way that preserves their meaning under reparsing and type inference. We study the problem of complete and minimal type annotations for rank-one polymorphic $λ$-calculus terms, as used in Isabelle. Building on prior work by Smolka, Blanchette et al., we give a metatheoretical account of the problem, with a full formal specification and proofs, and formalize it in Isabelle/HOL. Our development is a series of experiments featuring human-driven and AI-driven formalization workflows: a human and an LLM-powered AI agent independently produce pen-and-paper proofs, and the AI agent autoformalizes both in Isabelle, with further human-hinted AI interventions refining and generalizing the development.

2604.13354 2026-06-03 cond-mat.mtrl-sci cs.AI 版本更新

Finetuning-Free Diffusion Model with Adaptive Constraint Guidance for Inorganic Crystal Structure Generation

无需微调的扩散模型结合自适应约束引导用于无机晶体结构生成

Auguste de Lambilly, Vladimir Baturin, David Portehault, Guillaume Lambard, Nataliya Sokolovska, Florence d'Alché-Buc, Jean-Claude Crivello

发表机构 * CNRS-Saint-Gobain-NIMS(法国国家科学研究中心-圣戈班-日本纳米科学研究所) Laboratory for Innovative Key Materials and Structures (LINK)(创新关键材料与结构实验室) Laboratory of Computational, Quantitative, and Synthetic Biology (CQSB)(计算、定量与合成生物学实验室) Data-driven Materials Design Group(数据驱动材料设计组) Center for Basic Research on Materials(材料基础研究中心) LTCI, Télécom Paris, Institut Polytechnique de Paris(LTCI,巴黎电信,巴黎理工学院)

AI总结 提出一种基于扩散模型的自适应约束引导生成框架,无需微调即可结合用户定义的物理化学约束,生成满足热力学稳定性和几何约束的无机晶体结构。

Comments Full article including supplementary information, 56 pages, 9 figures

详情
AI中文摘要

发现具有目标性质的无机晶体结构是材料科学中的一个重大挑战。生成模型,尤其是最先进的扩散模型,有望对复杂数据分布进行建模并提出新颖、真实的样本。然而,当前的生成式AI模型仍然难以产生适用于高风险应用的、多样化、原创且可靠的实验可达成材料结构。在这项工作中,我们提出了一种基于扩散模型的自适应约束引导的生成式机器学习框架,该框架能够在生成过程中融入用户定义的物理和化学约束。该方法旨在对人类专家具有实用性和可解释性,允许透明的决策制定和专家驱动的探索。为了确保生成候选结构的鲁棒性和有效性,我们引入了一个多步骤验证流程,该流程结合了训练达到DFT精度水平的图神经网络估计器和用于评估热力学稳定性的凸包分析。我们的方法已在几个经典的无机化合物家族案例研究中得到测试和验证。因此,这些初步结果表明,我们的框架能够生成满足不同无机化学系统中目标几何约束的热力学合理的晶体结构。

英文摘要

The discovery of inorganic crystal structures with targeted properties is a significant challenge in materials science. Generative models, especially state-of-the-art diffusion models, offer the promise of modeling complex data distributions and proposing novel, realistic samples. However, current generative AI models still struggle to produce diverse, original, and reliable structures of experimentally achievable materials suitable for high-stakes applications. In this work, we propose a generative machine learning framework based on diffusion models with adaptive constraint guidance, which enables the incorporation of user-defined physical and chemical constraints during the generation process. This approach is designed to be practical and interpretable for human experts, allowing transparent decision-making and expert-driven exploration. To ensure the robustness and validity of the generated candidates, we introduce a multi-step validation pipeline that combines graph neural network estimators trained to achieve DFT-level accuracy and convex hull analysis for assessing thermodynamic stability. Our approach has been tested and validated on several classical examples of inorganic families of compounds, as case studies. As a consequence, these preliminary results demonstrate our framework's ability to generate thermodynamically plausible crystal structures that satisfy targeted geometric constraints across diverse inorganic chemical systems.

2604.12176 2026-06-03 cs.AI 版本更新

Evaluating Relational Reasoning in LLMs with REL

使用REL评估大语言模型中的关系推理能力

Lukas Fesser, Yasha Ektefaie, Ada Fang, Sham M. Kakade, Marinka Zitnik

发表机构 * University of California, Berkeley(加州大学伯克利分校)

AI总结 本文通过关系复杂度(RC)定义推理难度,构建涵盖代数、化学和生物学的生成式基准REL,发现前沿大语言模型在RC增加时性能持续下降,表明模型在高元关系绑定上存在固有局限。

Comments ICML 2026

详情
AI中文摘要

关系推理是推断同时绑定多个实体、属性或变量的关系的能力。这种能力对科学推理至关重要,但现有对大语言模型关系推理的评估通常侧重于结构化输入(如表格、图或合成任务),并未分离高元关系绑定带来的困难。我们通过关系复杂度(RC)来研究这个问题,将其定义为应用一个关系时必须同时绑定的独立实体或操作数的最小数量。RC提供了一种原则性的方式来改变推理难度,同时控制输入大小、词汇和表示选择等混杂因素。基于RC,我们引入了REL,一个涵盖代数、化学和生物学的生成式基准框架,在每个领域内变化RC。在前沿大语言模型中,当RC增加时,性能持续且单调下降,即使实体总数保持不变。这种失败模式在增加测试时计算量和上下文学习时仍然存在,表明这一限制与所需关系绑定的元数有关,而非推理步骤不足或缺乏示例暴露。我们的结果识别了当前模型难以应对的高元推理场景,并促使通过关系复杂度的视角重新审视基准测试。

英文摘要

Relational reasoning is the ability to infer relations that jointly bind multiple entities, attributes, or variables. This ability is central to scientific reasoning, but existing evaluations of relational reasoning in large language models often focus on structured inputs such as tables, graphs, or synthetic tasks, and do not isolate the difficulty introduced by higher-arity relational binding. We study this problem through the lens of Relational Complexity (RC), which we define as the minimum number of independent entities or operands that must be simultaneously bound to apply a relation. RC provides a principled way to vary reasoning difficulty while controlling for confounders such as input size, vocabulary, and representational choices. Building on RC, we introduce REL, a generative benchmark framework spanning algebra, chemistry, and biology that varies RC within each domain. Across frontier LLMs, performance degrades consistently and monotonically as RC increases, even when the total number of entities is held fixed. This failure mode persists with increased test-time compute and in-context learning, suggesting a limitation tied to the arity of the required relational binding rather than to insufficient inference steps or lack of exposure to examples. Our results identify a regime of higher-arity reasoning in which current models struggle, and motivate re-examining benchmarks through the lens of relational complexity.

2604.10169 2026-06-03 cs.AI cs.LG 版本更新

MAVEN-T: Reinforced Heterogeneous Distillation for Real-Time Multi-Agent Trajectory Prediction

MAVEN-T:用于实时多智能体轨迹预测的强化异构蒸馏

Wenchang Duan, Zhenguo Gao, Jinguo Xian, Yi Shi

发表机构 * School of Mathematical Sciences, Shanghai Jiao Tong University(上海交通大学数学科学学院) Bio-X Institutes, Key Laboratory for the Genetics of Developmental and Neuropsychiatric Disorders, Shanghai Jiao Tong University(上海交通大学Bio-X研究院、发育与神经精神疾病遗传学重点实验室) Shanghai Key Laboratory of Psychotic Disorders, Brain Science and Technology Research Center, Shanghai Jiao Tong University(上海精神疾病重点实验室、脑科学与技术研究中心,上海交通大学)

AI总结 提出MAVEN-T框架,通过高容量教师模型和紧凑学生模型的异构蒸馏,结合强化学习优化,实现实时多智能体轨迹预测,在多个数据集上达到高精度与低延迟。

详情
AI中文摘要

轨迹预测是自动驾驶系统的关键组成部分,因为未来运动直接影响碰撞检查、行为规划和控制。在密集交互、异构行为、多模态未来和有限车载计算条件下,该任务仍然具有挑战性。现有的图、注意力和生成式预测器改进了交互推理或不确定性建模,但其高容量设计通常成本高昂,难以实时部署。轻量级预测器和传统蒸馏降低了推理成本,但通常依赖静态模仿,并未明确纠正与安全相关的教师偏差。本文提出了MAVEN-T,一种用于实时多智能体轨迹预测的强化异构蒸馏框架。高容量教师模型通过环绕感知图编码器建模有向局部交互,结合高效时间滤波与移位窗口空间注意力,并通过稀疏混合专家头解码特定机动未来。紧凑的GRU-挤压激励学生模型配备低秩自适应策略头,通过特征级、注意力级和语义级蒸馏进行训练。为了与下游行为对齐,学生模型进一步通过近端策略优化奖励进行细化,奖励包括碰撞避免、舒适性和进度,同时复杂度感知课程和弹性权重巩固稳定了分阶段训练。在NGSIM、HighD、MoCAD、Argoverse 2和Waymo开放运动数据集上的实验评估了准确性、效率、泛化性、鲁棒性和闭环安全性。学生模型在NVIDIA Jetson AGX Orin上实现了6.2倍参数压缩、3.7倍推理加速和14.6毫秒延迟,同时保持竞争性准确性。

英文摘要

Trajectory prediction is a key component of autonomous driving systems because future motions directly affect collision checking, behavior planning, and control. The task remains challenging under dense interactions, heterogeneous behaviors, multimodal futures, and limited on-board computation. Existing graph, attention, and generative predictors improve interaction reasoning or uncertainty modeling, but their high-capacity designs are often costly for real-time deployment. Lightweight predictors and conventional distillation reduce inference cost, yet usually rely on static imitation and do not explicitly correct safety-relevant teacher bias. This paper proposes \textbf{MAVEN-T}, a reinforced heterogeneous distillation framework for real-time multi-agent trajectory prediction. A high-capacity teacher models directed local interactions with a surround-aware graph encoder, combines efficient temporal filtering with shifted-window spatial attention, and decodes maneuver-specific futures through a sparse Mixture-of-Experts head. A compact GRU--Squeeze-and-Excitation student with a Low-Rank Adapted policy head is trained by feature-, attention-, and semantic-level distillation. To align prediction with downstream behavior, the student is further refined by Proximal Policy Optimization rewards for collision avoidance, comfort, and progress, while a complexity-aware curriculum and Elastic Weight Consolidation stabilize stage-wise training. Experiments on NGSIM, HighD, MoCAD, Argoverse~2, and the Waymo Open Motion Dataset evaluate accuracy, efficiency, generalization, robustness, and closed-loop safety. The student achieves 6.2$\times$ parameter compression, 3.7$\times$ inference acceleration, and 14.6,ms latency on an NVIDIA Jetson AGX Orin while maintaining competitive accuracy.

2603.26738 2026-06-03 cs.CV cs.AI cs.CL 版本更新

SleepVLM: Explainable and Rule-Grounded Sleep Staging via a Vision-Language Model

SleepVLM:基于视觉语言模型的可解释且规则驱动的睡眠分期

Guifeng Deng, Pan Wang, Mengfan Niu, Jiquan Wang, Shuying Rao, Junyi Xie, Xi'ang Chen, Sha Zhao, Gang Pan, Wanjun Guo, Tao Li, Haiteng Jiang

AI总结 提出SleepVLM,一种基于规则驱动的视觉语言模型,通过多通道PSG波形图像进行睡眠分期,并生成符合AASM评分标准的临床可读解释,在保持高准确率的同时提升可解释性。

Comments Under review

详情
AI中文摘要

尽管自动睡眠分期已达到专家级准确率,但其临床采用因缺乏可审计的推理而受阻。我们提出了SleepVLM,一种基于规则驱动的视觉语言模型(VLM),它通过多通道多导睡眠图(PSG)波形图像进行睡眠分期,并基于美国睡眠医学学会(AASM)评分标准生成临床可读的理由。利用波形感知预训练和规则驱动的监督微调,SleepVLM在保留测试集(MASS-SS1)上实现了0.767的Cohen's kappa,在外部队列(ZUAMHCS)上实现了0.743,达到了最先进的性能。两位经过训练的睡眠技术专家的独立评估进一步验证了模型的推理质量,在两个数据集上,事实准确性、证据全面性和逻辑连贯性的平均得分在3.75-3.96之间(满分5分)。通过将竞争性性能与透明、基于规则的解释相结合,SleepVLM可以提高临床工作流程中自动睡眠分期的可信度和可审计性。为了促进可解释睡眠医学的进一步研究,我们发布了MASS-EX,一个新颖的专家注释数据集。

英文摘要

While automated sleep staging has achieved expert-level accuracy, its clinical adoption is hindered by a lack of auditable reasoning. We introduce SleepVLM, a rule-grounded vision-language model (VLM) that stages sleep from multi-channel polysomnography (PSG) waveform images and generates clinician-readable rationales based on American Academy of Sleep Medicine (AASM) scoring criteria. Utilizing waveform-perceptual pre-training and rule-grounded supervised fine-tuning, SleepVLM achieved Cohen's kappa of 0.767 on a held-out test set (MASS-SS1) and 0.743 on an external cohort (ZUAMHCS), matching state-of-the-art performance. Independent expert evaluation by two trained sleep technologists further validated the model's reasoning quality, with mean scores of 3.75-3.96 out of 5 across factual accuracy, evidence comprehensiveness, and logical coherence on both datasets. By coupling competitive performance with transparent, rule-based explanations, SleepVLM may improve the trustworthiness and auditability of automated sleep staging in clinical workflows. To facilitate further research in interpretable sleep medicine, we release MASS-EX, a novel expert-annotated dataset.

2603.26791 2026-06-03 cs.DL cs.AI cs.CL cs.CY 版本更新

Crystal: Characterizing Relative Impact of Scholarly Publications

Crystal: 表征学术出版物的相对影响力

Hannah Collison, Benjamin Van Durme, Daniel Khashabi

发表机构 * Johns Hopkins University(约翰霍普金斯大学)

AI总结 提出Crystal方法,利用大语言模型对引用论文进行联合排序,通过多数投票消除位置偏差,以更准确地区分高影响力引用,在人工标注数据集上准确率提升9.5%,F1提升8.3%。

详情
AI中文摘要

评估被引论文的影响力通常是通过在施引论文中单独分析其引用上下文来完成的。虽然这聚焦于最直接相关的文本,但它阻止了对一篇论文引用的所有作品进行相对比较。我们提出Crystal,它使用大语言模型(LLMs)联合排序施引论文中的所有被引论文。为了减轻LLMs的位置偏差,我们以随机顺序对每个列表进行三次排序,并通过多数投票聚合影响力标签。这种联合方法利用了完整的引用上下文,而不是独立评估引用,从而更可靠地区分有影响力的参考文献。Crystal在人工标注的引用数据集上,准确率比先前最先进的影响力分类器高出9.5%,F1高出8.3%。Crystal通过更少的LLM调用进一步提高了效率,并使用开放权重模型优于先前的基线,实现了可扩展、成本效益高的引用影响力分析。在对ACL时间检验奖获奖论文的案例研究中,我们发现Crystal的影响力特征与长期科学认可高度一致。我们发布了Crystal-Bank,一个包含46.8k篇论文的排名和影响力标签的数据集,以及代码。

英文摘要

Assessing a cited paper's impact is typically done by analyzing its citation context in isolation within the citing paper. While this focuses on the most directly relevant text, it prevents relative comparisons across all the works a paper cites. We propose Crystal, which instead jointly ranks all cited papers within a citing paper using large language models (LLMs). To mitigate LLMs' positional bias, we rank each list three times in a randomized order and aggregate the impact labels through majority voting. This joint approach leverages the full citation context, rather than evaluating citations independently, to more reliably distinguish impactful references. Crystal outperforms a prior state-of-the-art impact classifier by +9.5% accuracy and +8.3% F1 on a dataset of human-annotated citations. Crystal further gains efficiency through fewer LLM calls and outperforms prior baselines using an open-weight model, enabling scalable, cost-effective citation impact analysis. In a case study of ACL Test-of-Time award-winning papers, we find that Crystal's impact characterizations align closely with long-term scientific recognition. We release Crystal-Bank, a 46.8k-paper dataset with rankings and impact labels, along with code.

2510.21011 2026-06-03 cs.HC cs.AI cs.CY 版本更新

Generating the Modal Worker: A Cross-Model Audit of Race and Gender in LLM-Generated Personas Across 41 Occupations

生成模态工人:跨模型审计41个职业中LLM生成人设的种族与性别

Ilona van der Linden, Sahana Kumar, Arnav Dixit, Aadi Sudan, Smruthi Danda, David C. Anastasiu, Kai Lukoff

发表机构 * Human-Computer Interaction Lab, Computer Science and Engineering(人机交互实验室,计算机科学与工程) Santa Clara University(圣克拉拉大学)

AI总结 本研究审计了四个大型语言模型生成的150多万个职业人设,通过与BLS数据对比,发现模型压缩了人口统计变异,系统性地扭曲了种族和性别代表性。

详情
AI中文摘要

随着生成式AI工具越来越多地被用于描绘职业角色中的人物,理解其种族和性别代表性偏差至关重要。我们审计了由四个主要大型语言模型(GPT-4、Gemini 2.5、DeepSeek V3.1和Mistral-medium)生成的41个美国职业中的150多万个职业人设。将这些人与美国劳工统计局(BLS)数据进行比较,我们发现模型生成的人口统计数据比真实世界数据的变异性更小,实际上将每个职业压缩为一种主导人口统计特征,而不是代表总体水平的变异。通过偏移/夸张分解揭示了这些扭曲的结构:白人(-31个百分点)和黑人(-9个百分点)工人持续被低估,而西班牙裔(+17个百分点)和亚裔(+12个百分点)工人被高估,刻板印象的夸张加剧了现有的职业隔离。这些扭曲往往极端,包括几乎全部将管家描绘为西班牙裔,以及许多职业中黑人工人几乎被抹去。由于这些模式在不同机构和文化起源的模型中重复出现,它们表明存在共享的结构性偏差来源,而非模型特定的伪影。我们认为,审计生成式AI需要评估框架,该框架检查合成人口如何系统地重塑跨社会角色的人口统计可见性。

英文摘要

As generative AI tools are increasingly used to portray people in professional roles, understanding their racial and gender representational biases is critical. We audit over 1.5 million occupational personas generated by four major large language models (GPT-4, Gemini 2.5, DeepSeek V3.1, and Mistral-medium) across 41 U.S. occupations. Comparing these personas against U.S. Bureau of Labor Statistics (BLS) data, we find that models generate demographics with less variation than real-world data, functionally compressing each occupation toward a dominant demographic profile rather than representing population-level variation. A shift/exaggeration decomposition reveals the structure of these distortions: White (-31 percentage points) and Black (-9 pp) workers are consistently underrepresented, while Hispanic (+17 pp) and Asian (+12 pp) workers are overrepresented, with stereotype exaggeration amplifying existing occupational segregation. These distortions are often extreme, including near-total portrayals of housekeepers as Hispanic and the near-erasure of Black workers from many occupations. Because these patterns recur across models with different institutional and cultural origins, they suggest shared structural sources of bias rather than model-specific artifacts. We argue that auditing generative AI requires evaluation frameworks that examine how synthetic populations systematically reshape demographic visibility across social roles.

2603.23117 2026-06-03 cs.CR cs.AI cs.RO 版本更新

TRAP: Hijacking VLA CoT-Reasoning via Adversarial Patches

TRAP: 通过对抗性补丁劫持VLA的CoT推理

Zhengxian Huang, Wenjun Zhu, Haoxuan Qiu, Xiaoyu Ji, Wenyuan Xu

发表机构 * University of Science and Technology of China(中国科学技术大学)

AI总结 提出TRAP攻击,利用对抗性补丁劫持视觉-语言-动作模型的链式推理,实现目标行为操控。

Comments Accepted by ICML 2026

详情
AI中文摘要

通过集成链式推理,视觉-语言-动作模型在机器人操作中展现出强大能力,特别是在提升泛化性和可解释性方面。然而,基于CoT的推理机制的安全性尚未得到充分探索。在本文中,我们证明CoT推理引入了一种新的攻击向量,用于目标行为劫持——例如,导致机器人错误地将刀递给一个人而不是苹果——而无需修改用户的指令。我们首先提供经验证据表明,即使CoT与输入指令在语义上不一致,它仍然强烈主导动作生成。基于这一观察,我们提出TRAP,这是首个针对CoT推理VLA模型的目标行为劫持对抗性攻击。通过针对推理到动作的路径,TRAP使用对抗性补丁(例如,放置在桌子上的桌布)来引导中间CoT推理和下游动作朝向对手定义的行为。在三个代表性推理VLA上的广泛评估,涵盖了不同的CoT推理机制,证明了TRAP的有效性。值得注意的是,我们在现实环境中通过将补丁打印在纸上实现了该攻击。我们的发现凸显了保护VLA系统中CoT推理的紧迫性。项目页面可在https://zhengxian-huang.github.io/TRAP-website/获取。

英文摘要

By integrating Chain-of-Thought (CoT) reasoning, Vision-Language-Action (VLA) models have demonstrated strong capabilities in robotic manipulation, particularly by improving generalization and interpretability. However, the security of CoT-based reasoning mechanisms remains largely unexplored. In this paper, we show that CoT reasoning introduces a novel attack vector for targeted behavior hijacking--for example, causing a robot to mistakenly deliver a knife to a person instead of an apple--without modifying the user's instruction. We first provide empirical evidence that CoT strongly governs action generation, even when it is semantically misaligned with the input instructions. Building on this observation, we propose TRAP, the first targeted behavior-hijacking adversarial attack against CoT-reasoning VLA models. By targeting the reasoning-to-action pathway, TRAP uses an adversarial patch (e.g., a tablecloth placed on the table) to steer intermediate CoT reasoning and downstream actions toward adversary-defined behaviors. Extensive evaluations on three representative reasoning VLAs, spanning distinct CoT reasoning mechanisms, demonstrate the effectiveness of TRAP. Notably, we implemented the patch by printing it on paper in a real-world setting. Our findings highlight the urgent need to secure CoT reasoning in VLA systems. The project page is available at https://zhengxian-huang.github.io/TRAP-website/.

2603.20508 2026-06-03 cs.MA cs.AI cs.CL 版本更新

Measuring Weak-to-Strong Legibility of Reasoning Models

衡量推理模型的弱到强可读性

Dani Roytburg, Shreya Sridhar, Daphne Ippolito

发表机构 * University of California, Berkeley(加州大学伯克利分校) Stanford University(斯坦福大学)

AI总结 针对推理语言模型在多智能体场景中生成的中间思维链,提出“弱到强可读性”概念,并设计衡量指标以评估强模型输出对弱模型的易理解性。

Comments Accepted to Trustworthy AI4GOOD Workshop @ ICML 2026

详情
AI中文摘要

推理语言模型及其生成的中间思维链在多智能体设置(如模型间监控或蒸馏到较小模型)中扮演着越来越核心的角色。当不同能力层级的智能体必须合作时,强模型需要产生能被弱模型消化的轨迹。我们将此目标称为“弱到强可读性”。大模型的可信度部分依赖于这种可读性属性。特别是在安全监督方面,采用弱监控器可能成为健康预算下可靠性支架的标准。可读性要求这些决策轨迹的形状采取某种弱监控器可访问的形式。现有的基于效率的可读性指标未能捕捉“彻底性”,而是侧重于简洁性。

英文摘要

Reasoning language models (RLMs) and the intermediate chains of thought they emit play an increasingly central role in multi-agent setups such as inter-model monitoring or distillation into smaller models. When agents at different capability tiers must cooperate, strong models need to produce traces digestible by weaker ones. We refer to this goal as "weak-to-strong legibility". Trustworthiness of large models depends in part on this legibility property. For safety oversight in particular, adoption of weak monitors may become a standard for reliability scaffolds on a healthy budget. Legibility requires that the shape of these decision-making traces takes some form accessible to weaker monitors. Existing efficiency-based metrics for legibility fail to capture "thoroughness", instead focusing on conciseness.

2602.07768 2026-06-03 cs.CV cs.AI cs.LG cs.MM 版本更新

PAND: Prompt-Aware Neighborhood Distillation for Lightweight Fine-Grained Visual Classification

PAND:面向提示的邻域蒸馏用于轻量级细粒度视觉分类

Qiuming Luo, Yuebing Li, Feng Li, Chang Kong

发表机构 * arXiv

AI总结 提出PAND框架,通过提示感知语义校准和邻域感知结构蒸馏,将大型视觉语言模型知识迁移至轻量网络,在细粒度分类任务上超越现有方法。

Comments Accepted by ICIP2026

详情
AI中文摘要

在细粒度视觉分类(FGVC)中,从大型视觉语言模型(VLM)中蒸馏知识到轻量级网络至关重要但具有挑战性,原因是依赖于固定提示和全局对齐。为解决此问题,我们提出PAND(提示感知邻域蒸馏),一个两阶段框架,将语义校准与结构迁移解耦。首先,我们引入提示感知语义校准以生成自适应语义锚点。其次,我们提出邻域感知结构蒸馏策略以约束学生的局部决策结构。PAND在四个FGVC基准上持续优于现有方法。值得注意的是,我们的ResNet-18学生在CUB-200上达到76.09%的准确率,超过强基线VL2Lite 3.4%。代码可在https://github.com/LLLVTA/PAND获取。

英文摘要

Distilling knowledge from large Vision-Language Models (VLMs) into lightweight networks is crucial yet challenging in Fine-Grained Visual Classification (FGVC), due to the reliance on fixed prompts and global alignment. To address this, we propose PAND (Prompt-Aware Neighborhood Distillation), a two-stage framework that decouples semantic calibration from structural transfer. First, we incorporate Prompt-Aware Semantic Calibration to generate adaptive semantic anchors. Second, we introduce a neighborhood-aware structural distillation strategy to constrain the student's local decision structure. PAND consistently outperforms state-of-the-art methods on four FGVC benchmarks. Notably, our ResNet-18 student achieves 76.09% accuracy on CUB-200, surpassing the strong baseline VL2Lite by 3.4%. Code is available at https://github.com/LLLVTA/PAND.

2601.12186 2026-06-03 cs.SE cs.AI 版本更新

Aletheia: What Makes RLVR For Code Verifiers Tick?

Aletheia: 什么使得代码验证器的RLVR有效?

Vatsal Venkatkrishna, Indraneil Paul, Iryna Gurevych

发表机构 * INSAIT, Sofia University "St. Kliment Ohridski", Bulgaria(保加利亚索菲亚大学INSAIT实验室) Ubiquitous Knowledge Processing Lab (UKP Lab), Department of Computer Science, Technical University of Darmstadt and National Research Center for Applied Cybersecurity(德累斯顿技术大学计算机科学系及应用网络安全国家研究中心通用知识处理实验室) ATHENE, Germany(德国ATHENE研究院)

AI总结 通过消融实验研究RLVR训练代码验证器时,中间思考轨迹、负样本学习和策略内训练三个因素在不同规模下的性能-成本权衡,发现最优配方依赖于模型规模。

Comments 31 pages, 6 figures

详情
AI中文摘要

通过可验证奖励的强化学习(RLVR)训练的多领域思考验证器是现代后训练的核心。然而,由于完整RLVR管道的成本过高,它们在代码生成中的应用落后于执行反馈。在这项工作中,我们消融了RLVR中性能-成本权衡的三个主要选择:中间思考轨迹、从负样本学习和策略内训练。我们引入了Aletheia,一个受控的、基于执行的测试平台,以促进对不同模型大小和两个常见验证器应用场景下的协变量偏移进行无污染分析。我们的分析揭示,最优训练配方依赖于规模:对于小型验证器,策略内学习是主要性能驱动因素,而在较大规模下,思考预算成为最关键因素。虽然利用负样本对不同大小的top-1选择准确性有一致影响,但它们对排名重建的贡献随规模单调增加,并在大规模下稳定训练中起关键作用。我们的帕累托最优分析表明,在较大模型规模下消除策略内训练会产生一个与完整RLVR配方性能相当的验证器。此外,我们发现,在较低预算下,放弃思考轨迹是一种计算高效的策略,在训练成本和验证器准确性之间提供了强有力的权衡。最终,我们的工作为高效部署鲁棒代码验证器提供了必要的经验基础,从而使其能够在大型代码生成模型的后训练管道中得到更广泛的应用。

英文摘要

Multi-domain thinking verifiers trained via Reinforcement Learning with Verifiable Rewards (RLVR) are a cornerstone of modern post-training. However, their adoption in code generation has lagged behind that of execution feedback due to the prohibitive costs of the full RLVR pipeline. In this work, we ablate three primary choices along the performance-cost trade-off in RLVR: intermediate thinking traces, learning from negative samples, and on-policy training. We introduce Aletheia, a controlled, execution-grounded testbed to facilitate a contamination-free analysis of code verifier training recipes across disparate model sizes and covariate shifts across two common verifier application scenarios. Our analysis reveals that the optimal training recipe is scale-dependent: on-policy learning is the primary performance driver for small verifiers, whereas the thinking budget becomes the most vital factor at larger scales. While leveraging negative samples has a consistent impact on top-1 selection accuracy across sizes, their contribution to ranking reconstruction increases monotonically with scale and plays a key role in stabilizing training at large sizes. Our Pareto optimality analysis demonstrates that eliminating on-policy training at larger model scales yields a verifier that performs comparably to the full RLVR recipe. Furthermore, we find that eschewing thinking traces serves as a compute-efficient strategy at lower budgets, offering a strong trade-off between training cost and verifier accuracy. Ultimately, our work provides the empirical foundation necessary to efficiently deploy robust code verifiers, thereby enabling their wider adoption in post-training pipelines for large code generation models.

2603.07664 2026-06-03 cs.CV cs.AI cs.GR 版本更新

Ref-DGS: Reflective Dual Gaussian Splatting

Ref-DGS: 反射性双高斯泼溅

Ningjing Fan, Yiqun Wang, Dong-Ming Yan, Peter Wonka

发表机构 * Chongqing University(重庆大学) MAIS, Institute of Automation, Chinese Academy of Sciences and UCAS(自动化研究所,中国科学院,UCAS) King Abdullah University of Science and Technology (KAUST)(卡塔尔科学与技术大学)

AI总结 提出Ref-DGS框架,通过双高斯场景表示和物理感知的镜面自适应混合着色器,在高效光栅化管线中解耦表面重建与镜面反射,实现反射场景的SOTA新视图合成且训练速度远快于基于光线的方法。

Comments Project page: https://njfan.github.io/Ref-DGS/

详情
AI中文摘要

反射外观,尤其是强烈的近场镜面反射,对精确的表面重建和新视图合成构成了根本性挑战。现有的高斯泼溅方法要么无法建模近场镜面反射,要么依赖显式光线追踪而计算成本高昂。我们提出了 extbf{Ref-DGS},一个反射性双高斯泼溅框架,通过在高效光栅化管线中将表面重建与镜面反射解耦来解决这一权衡。Ref-DGS引入了一种双高斯场景表示,由几何高斯和互补的局部反射高斯组成,无需显式光线追踪即可捕捉近场镜面交互,并包含一个全局环境反射场用于建模远场镜面反射。为了预测镜面辐射,我们进一步提出了一种轻量级的、物理感知的镜面自适应混合着色器,融合全局和局部镜面特征。实验表明,Ref-DGS在反射场景上达到了最先进的性能,同时训练速度显著快于基于光线的高斯方法。

英文摘要

The reflective appearance, especially strong and typically near-field specular reflections, poses a fundamental challenge for accurate surface reconstruction and novel view synthesis. Existing Gaussian splatting methods either fail to model near-field specular reflections or rely on explicit ray tracing at substantial computational cost. We present \textbf{Ref-DGS}, a reflective dual Gaussian splatting framework that addresses this trade-off by decoupling surface reconstruction from specular reflection within an efficient rasterization-based pipeline. Ref-DGS introduces a dual Gaussian scene representation consisting of geometry Gaussians and complementary local reflection Gaussians that capture near-field specular interactions without explicit ray tracing, along with a global environment reflection field for modeling far-field specular reflections. To predict specular radiance, we further propose a lightweight, physically-aware specular adaptive mixing shader that fuses global and local specular features. Experiments demonstrate that Ref-DGS achieves state-of-the-art performance on reflective scenes while training substantially faster than ray-based Gaussian methods.

2602.07075 2026-06-03 physics.chem-ph cs.AI cs.CL cs.LG 版本更新

LatentChem: From Textual CoT to Latent Thinking in Chemical Reasoning

LatentChem: 从文本思维链到化学推理中的潜在思考

Xinwu Ye, Yicheng Mao, Yuxuan Liao, Jia Zhang, Yimeng Liu, Li Hao, Fang Wu, Zhiwei Li, Zehong Wang, Zhiyuan Liu, Zhenfei Yin, Li Yuan, Philip Torr, Huan Sun, xiangxiang Zeng, Mengdi Wang, Le Cong, Shenghua Gao, Xiangru Tang

发表机构 * University of Science and Technology of China(中国科学技术大学)

AI总结 针对化学大语言模型依赖显式思维链导致的模态不匹配问题,提出LatentChem推理接口,通过连续思维向量和动态感知解耦化学逻辑与语言生成,在ChemCoTBench上以59.88%非平局胜率超越强CoT基线,并实现平均10.84倍推理步骤开销降低(5.96倍实际加速)。

Comments Accepted at ICML 2026

详情
AI中文摘要

当前的化学大语言模型主要依赖显式的思维链来解决复杂推理问题。然而,将非语言的隐性化学逻辑强制转化为离散的自然语言,造成了根本性的“模态不匹配”,为推理带来了人为瓶颈。我们提出了LatentChem,一种将化学逻辑与语言生成解耦的推理接口,使模型能够通过连续思维向量和动态感知来处理信息。我们的研究揭示了一个关键涌现行为:自发内化,这里定义为在仅结果优化下的自我选择。当为任务成功进行优化时,模型放弃冗长的文本推导,转而采用隐式的潜在计算,这表明模型将连续流形视为化学逻辑更自然的载体。这一范式转变也被证明是一种更优的计算策略:在严格的ChemCoTBench基准上,LatentChem对强CoT基线取得了59.88%的非平局胜率,同时在所有评估基准上实现了平均10.84倍的推理步骤开销降低(5.96倍实际加速)。我们的结果提供了经验证据,表明化学推理更自然、更有效地实现为连续潜在动力学,而非离散的语言轨迹。

英文摘要

Current chemical large language models (LLMs) predominantly rely on explicit Chain-of-Thought (CoT) to solve complex reasoning problems. However, forcing nonverbal tacit chemical logic into discrete natural language imposes a fundamental ``modality mismatch,'' creating an artificial bottleneck for reasoning. We introduce LatentChem, a reasoning interface that decouples chemical logic from linguistic generation, enabling the model to process information via continuous thought vectors and dynamic perception. Our investigation reveals a pivotal emergent behavior: spontaneous internalization, defined here as self-selected under outcome-only optimization. When optimized for task success, the model abandons verbose textual derivations in favor of implicit latent computation, suggesting that it identifies the continuous manifold as a more native substrate for chemical logic. This paradigm shift also proves to be a superior computational strategy: LatentChem achieves a 59.88\% non-tie win rate against the strong CoT baseline on the rigorous ChemCoTBench, while delivering a broad 10.84$\times$ average reduction in reasoning step overhead (5.96$\times$ wall-clock speedup) across all evaluated benchmarks. Our results provide empirical evidence that chemical reasoning is more naturally and effectively realized as continuous latent dynamics rather than discretized linguistic trajectories.

2603.05290 2026-06-03 cs.AI 版本更新

X-RAY: Mapping LLM Reasoning Capability via Formalized and Calibrated Probes

X-RAY: 通过形式化与校准探针映射大语言模型推理能力

Tianxi Gao, Yufan Cai, Yusi Yuan, Jin Song Dong

发表机构 * National University of Singapore(新加坡国立大学)

AI总结 提出X-RAY系统,利用形式化工具生成结构可控的校准探针,通过分析约束交互、推理深度和解空间几何等属性,揭示LLM在约束细化与解空间重构下的推理不对称性。

Comments Accepted by KDD 2026

详情
AI中文摘要

大型语言模型(LLM)取得了有前景的性能,但其推理能力仍未被充分理解。现有评估主要强调任务级准确性,常常将模式匹配与推理能力混为一谈。我们提出了X-RAY,一个可解释的推理分析系统,通过校准的、形式化验证的探针来映射LLM的推理能力。我们将推理能力建模为可提取的 extit{结构}的函数,通过形式化属性(如约束交互、推理深度和解空间几何)进行操作化。X-RAY通过形式化工具生成具有受控结构变化的探针,通过形式化校准和验证实现对增量结构信息的精确隔离。我们在数学、物理和化学领域从初级到高级的问题上评估了最先进的LLM。我们的分析揭示了LLM推理中的系统性不对称:模型对约束细化(即附加条件缩小现有解空间)相对稳健,但在解空间重构(即修改改变解流形的底层结构形式)下性能急剧下降。此外,校准的形式化探针能够区分在标准基准上看似无法区分的模型,并揭示出结构上可解释而非模糊的失败模式。除了评估,我们的框架无污染,并支持推理模型的训练和测试。

英文摘要

Large language models (LLMs) achieve promising performance, yet their ability to reason remains poorly understood. Existing evaluations largely emphasize task-level accuracy, often conflating pattern matching with reasoning capability. We present X-RAY, an explainable reasoning analysis system that maps the LLM reasoning capability using calibrated, formally verified probes. We model reasoning capability as a function of extractable \textit{structure}, operationalized through formal properties such as constraint interaction, reasoning depth, and solution-space geometry. X-Ray generates probes via formal tools with controlled structural variations, enabling precise isolation of incremental structural information through formal calibration and verification. We evaluate state-of-the-art LLMs on problems ranging from junior-level to advanced in mathematics, physics, and chemistry. Our analysis reveals a systematic asymmetry in LLM reasoning: models are relatively robust to constraint refinement, where additional conditions shrink an existing solution space, but degrade sharply under solution-space restructuring, where modifications alter the underlying structural form of the solution manifold. Moreover, calibrated formal probes differentiate models that appear indistinguishable on standard benchmarks and reveal failure modes that are structurally interpretable rather than opaque. Beyond evaluation, our framework is contamination-free and supports the training and testing of reasoning models.

2603.01372 2026-06-03 cs.LG cs.AI 版本更新

Causal Neural Probabilistic Circuits

因果神经概率电路

Weixin Chen, Han Zhao

AI总结 提出因果神经概率电路(CNPC),通过结合神经属性预测器和从因果图编译的因果概率电路,支持遵循因果依赖的精确干预推理,从而提升概念瓶颈模型在干预下的分类准确率。

详情
AI中文摘要

概念瓶颈模型(CBM)通过引入概念层并从概念预测中预测类别标签,增强了端到端神经网络的可解释性。CBM的一个关键特性是支持干预,即领域专家可以在测试时纠正错误预测的概念值以提高最终准确性。然而,典型的CBM仅覆盖被纠正的概念,而保持其他概念预测不变,这忽略了概念间的因果依赖。为解决此问题,我们提出了因果神经概率电路(CNPC),它结合了神经属性预测器和从因果图编译的因果概率电路。该电路支持精确、易处理的因果推理,天然尊重因果依赖。在干预下,CNPC基于专家乘积(PoE)建模类别分布,融合了属性预测器的预测分布和电路计算的干预边际。我们从理论上刻画了CNPC相对于其模块的组合干预误差,并确定了CNPC接近真实干预类别分布的条件。在五个基准数据集上的分布内和分布外实验表明,与五个基线模型相比,CNPC在不同干预属性数量下均实现了更高的任务准确率。

英文摘要

Concept Bottleneck Models (CBMs) enhance the interpretability of end-to-end neural networks by introducing a layer of concepts and predicting the class label from the concept predictions. A key property of CBMs is that they support interventions, i.e., domain experts can correct mispredicted concept values at test time to improve the final accuracy. However, typical CBMs apply interventions by overwriting only the corrected concept while leaving other concept predictions unchanged, which ignores causal dependencies among concepts. To address this, we propose the Causal Neural Probabilistic Circuit (CNPC), which combines a neural attribute predictor with a causal probabilistic circuit compiled from a causal graph. This circuit supports exact, tractable causal inference that inherently respects causal dependencies. Under interventions, CNPC models the class distribution based on a Product of Experts (PoE) that fuses the attribute predictor's predictive distribution with the interventional marginals computed by the circuit. We theoretically characterize the compositional interventional error of CNPC w.r.t. its modules and identify conditions under which CNPC closely matches the ground-truth interventional class distribution. Experiments on five benchmark datasets in both in-distribution and out-of-distribution settings show that, compared with five baseline models, CNPC achieves higher task accuracy across different numbers of intervened attributes.

2512.03005 2026-06-03 cs.AI 版本更新

From Moderation to Mediation: Can LLMs Serve as Mediators in Online Flame Wars?

从审核到调解:LLMs能否充当在线论战中的调解员?

Dawei Li, Abdullah Alnaibari, Arslan Bisharat, Manuel Sandoval, Deborah Hall, Yasin Silva, Huan Liu

发表机构 * Arizona State University(亚利桑那州立大学) Loyola University Chicago(芝加哥洛约拉大学)

AI总结 本研究探索大型语言模型(LLMs)能否超越内容审核,作为调解员通过判断对话公平性和情感动态并生成共情缓和信息来化解在线冲突,实验表明API模型在推理和干预一致性上优于开源模型。

Comments Accepted by PAKDD 2026 special session on Data Science: Foundations and Applications

详情
AI中文摘要

大型语言模型(LLMs)的快速发展为人工智能向善应用开辟了新可能性。随着LLMs越来越多地介入在线交流,它们培养共情和建设性对话的潜力成为负责任AI研究的重要前沿。本研究探索LLMs是否不仅能作为检测有害内容的审核员,还能作为能够理解和缓和在线冲突的调解员。我们的框架将调解分解为两个子任务:判断,即LLM评估对话的公平性和情感动态;引导,即生成共情的、缓和性的信息以引导参与者走向解决。为评估调解质量,我们构建了一个大型基于Reddit的数据集,并提出了一个结合基于原则的评分、用户模拟和人工比较的多阶段评估流程。实验表明,API模型在调解时的推理和干预一致性方面优于开源模型。我们的发现突显了当前LLMs作为新兴在线社会调解代理的潜力和局限性。

英文摘要

The rapid advancement of large language models (LLMs) has opened new possibilities for AI for good applications. As LLMs increasingly mediate online communication, their potential to foster empathy and constructive dialogue becomes an important frontier for responsible AI research. This work explores whether LLMs can serve not only as moderators that detect harmful content, but as mediators capable of understanding and de-escalating online conflicts. Our framework decomposes mediation into two subtasks: judgment, where an LLM evaluates the fairness and emotional dynamics of a conversation, and steering, where it generates empathetic, de-escalatory messages to guide participants toward resolution. To assess mediation quality, we construct a large Reddit-based dataset and propose a multi-stage evaluation pipeline combining principle-based scoring, user simulation, and human comparison. Experiments show that API-based models outperform open-source counterparts in both reasoning and intervention alignment when doing mediation. Our findings highlight both the promise and limitations of current LLMs as emerging agents for online social mediation.

2602.20217 2026-06-03 cs.LG cs.AI 版本更新

KnapSpec: Self-Speculative Decoding via Adaptive Layer Selection as a Knapsack Problem

KnapSpec: 通过自适应层选择作为背包问题的自推测解码

Seongjin Cha, Gyuwan Kim, Dongsu Han, Tao Yang, Insu Han

发表机构 * KAIST(韩国科学技术院)

AI总结 提出KnapSpec,一种无需训练的框架,将草稿模型选择重新表述为背包问题,通过解耦注意力与MLP层并建模其硬件特定延迟,使用并行动态规划算法自适应确定最优草稿配置,实现令牌吞吐量最大化。

Comments Accepted to ICML 2026

详情
AI中文摘要

自推测解码(SSD)通过跳过层来创建高效的草稿模型,从而加速LLM推理,但现有方法通常依赖静态启发式,忽略了长上下文场景中注意力的动态计算开销。我们提出KnapSpec,一种无需训练的框架,将草稿模型选择重新表述为背包问题,以最大化每时间令牌吞吐量。通过解耦注意力与MLP层,并将其硬件特定延迟建模为上下文长度的函数,KnapSpec通过并行动态规划算法自适应地即时识别最优草稿配置。此外,我们提供了首个严格的理论分析,建立了隐藏状态之间的余弦相似度作为令牌接受率的数学上合理的代理。这一基础使得我们的方法在导航现实世界硬件的动态瓶颈时,能够保持高草稿保真度。我们在Qwen3和Llama3上的实验表明,KnapSpec始终优于最先进的SSD基线,在各种基准测试中实现了高达1.47倍的墙钟加速。我们的即插即用方法确保了长序列的高效推理,无需额外训练或损害目标模型的输出分布。

英文摘要

Self-speculative decoding (SSD) accelerates LLM inference by skipping layers to create an efficient draft model, yet existing methods often rely on static heuristics that ignore the dynamic computational overhead of attention in long-context scenarios. We propose KnapSpec, a training-free framework that reformulates draft model selection as a knapsack problem to maximize tokens-per-time throughput. By decoupling Attention and MLP layers and modeling their hardware-specific latencies as functions of context length, KnapSpec adaptively identifies optimal draft configurations on the fly via a parallel dynamic programming algorithm. Furthermore, we provide the first rigorous theoretical analysis establishing cosine similarity between hidden states as a mathematically sound proxy for the token acceptance rate. This foundation allows our method to maintain high drafting faithfulness while navigating the shifting bottlenecks of real-world hardware. Our experiments on Qwen3 and Llama3 demonstrate that KnapSpec consistently outperforms state-of-the-art SSD baselines, achieving up to 1.47x wall-clock speedup across various benchmarks. Our plug-and-play approach ensures high-speed inference for long sequences without requiring additional training or compromising the target model's output distribution.

2602.20213 2026-06-03 cs.SE cs.AI cs.CR 版本更新

CodeHacker: Automated Test Case Generation for Detecting Vulnerabilities in Competitive Programming Solutions

CodeHacker: 用于检测竞赛编程解决方案漏洞的自动化测试用例生成

Jingwei Shi, Xinxiang Yin, Jing Huang, Jinman Zhao, Shengyu Tao

发表机构 * Shanghai University of Finance and Economics(上海金融学院) Northwestern Polytechnical University(西北工业大学) Meituan(美团) University of Toronto(多伦多大学)

AI总结 提出CodeHacker框架,通过多策略对抗测试用例生成(压力测试、反哈希攻击、逻辑特定攻击)和校准阶段,有效暴露程序漏洞,提升测试集真负率并增强RL训练模型性能。

详情
AI中文摘要

大型语言模型(LLM)在代码生成方面的评估很大程度上依赖于测试用例的质量和鲁棒性。然而,现有的基准测试往往缺乏对微妙边界情况的覆盖,导致错误的解决方案通过测试。为弥补这一差距,我们提出了CodeHacker,一个自动化的智能体框架,专门用于生成有针对性的对抗性测试用例,以暴露程序提交中的潜在漏洞。模仿竞赛编程中的黑客机制,CodeHacker采用多策略方法,包括压力测试、反哈希攻击和逻辑特定攻击,以破解特定的代码提交。为确保这些攻击的有效性和可靠性,我们引入了一个校准阶段,在该阶段中,智能体在评估参赛者代码之前,通过自生成的对抗性探测迭代地完善自己的验证器和检查器。实验表明,CodeHacker显著提高了现有数据集上的真负率(TNR),有效过滤了先前被接受的错误解决方案。此外,生成的对抗性案例被证明是优越的训练数据,提升了在LiveCodeBench等基准测试上经过强化学习训练的模型的性能。

英文摘要

The evaluation of Large Language Models (LLMs) for code generation relies heavily on the quality and robustness of test cases. However, existing benchmarks often lack coverage for subtle corner cases, allowing incorrect solutions to pass. To bridge this gap, we propose CodeHacker, an automated agent framework dedicated to generating targeted adversarial test cases that expose latent vulnerabilities in program submissions. Mimicking the hack mechanism in competitive programming, CodeHacker employs a multi-strategy approach, including stress testing, anti-hash attacks, and logic-specific targeting to break specific code submissions. To ensure the validity and reliability of these attacks, we introduce a Calibration Phase, where the agent iteratively refines its own Validator and Checker via self-generated adversarial probes before evaluating contestant code.Experiments demonstrate that CodeHacker significantly improves the True Negative Rate (TNR) of existing datasets, effectively filtering out incorrect solutions that were previously accepted. Furthermore, generated adversarial cases prove to be superior training data, boosting the performance of RL-trained models on benchmarks like LiveCodeBench.

2602.16666 2026-06-03 cs.AI cs.CY cs.LG 版本更新

Towards a Science of AI Agent Reliability

迈向AI代理可靠性的科学

Stephan Rabanser, Sayash Kapoor, Peter Kirgis, Kangheng Liu, Saiteja Utpala, Arvind Narayanan

发表机构 * University of California, Berkeley(加州大学伯克利分校)

AI总结 本文提出十二个具体指标,从一致性、鲁棒性、可预测性和安全性四个维度分解AI代理的可靠性,并通过实验揭示能力提升仅带来可靠性小幅改进。

Comments Accepted at ICML 2026. Interactive dashboard available at: https://hal.cs.princeton.edu/reliability

详情
AI中文摘要

AI代理越来越多地被部署来执行重要任务。虽然标准基准测试上的准确率分数不断提高表明进展迅速,但许多代理在实践中仍然持续失败。这种差异凸显了当前评估的一个根本局限性:将代理行为压缩为单一成功指标会掩盖关键的操作缺陷。值得注意的是,它忽略了代理是否在不同运行中表现一致、能否承受扰动、是否可预测地失败,或者错误严重性是否有界。基于安全关键工程,我们通过提出十二个具体指标来提供全面的性能概况,这些指标将代理可靠性分解为四个关键维度:一致性、鲁棒性、可预测性和安全性。在两个互补基准测试上评估15个模型,我们发现最近的能力提升仅带来了可靠性的小幅改进。通过暴露这些持续的局限性,我们的指标补充了传统评估,同时提供了推理代理如何表现、退化和失败的工具。

英文摘要

AI agents are increasingly deployed to execute important tasks. While rising accuracy scores on standard benchmarks suggest rapid progress, many agents still continue to fail in practice. This discrepancy highlights a fundamental limitation of current evaluations: compressing agent behavior into a single success metric obscures critical operational flaws. Notably, it ignores whether agents behave consistently across runs, withstand perturbations, fail predictably, or have bounded error severity. Grounded in safety-critical engineering, we provide a holistic performance profile by proposing twelve concrete metrics that decompose agent reliability along four key dimensions: consistency, robustness, predictability, and safety. Evaluating 15 models across two complementary benchmarks, we find that recent capability gains have only yielded small improvements in reliability. By exposing these persistent limitations, our metrics complement traditional evaluations while offering tools for reasoning about how agents perform, degrade, and fail.

2502.08834 2026-06-03 cs.LG cs.AI stat.ML 版本更新

Rex: A Family of Reversible Exponential (Stochastic) Runge-Kutta Solvers

Rex: 一族可逆指数(随机)龙格-库塔求解器

Zander W. Blasingame, Chen Liu

发表机构 * University of Washington(华盛顿大学)

AI总结 提出Rex求解器族,通过Lawson方法将显式(随机)龙格-库塔格式转化为代数可逆形式,用于扩散ODE和SDE,实现近机器精度重建并提升流模型和扩散模型的性能。

Comments Accepted as an Oral presentation at ICML 2026

详情
AI中文摘要

基于神经微分方程的深度生成模型已成为许多生成任务的最先进方法。这些模型依赖于从先验分布积分到数据分布的ODE/SDE求解器;在许多应用中,逆方向积分也非常可取。然而,标准求解器会累积离散误差,阻碍精确反演,这种不准确性在精度关键的应用中是不可接受的。现有的反演方法稳定性差、收敛阶低,且严格限于ODE设置。在这项工作中,我们提出Rex,一族可逆指数(随机)龙格-库塔求解器,通过应用Lawson方法将任何显式(随机)龙格-库塔格式转化为扩散ODE和SDE的代数可逆格式。除了严格的理论分析——建立任意阶收敛性和非零线性稳定区域——我们通过实验证明Rex实现了近机器精度的重建,并改进了基于流模型的玻尔兹曼采样以及基于扩散模型的图像生成和编辑。

英文摘要

Deep generative models based on neural differential equations have become state-of-the-art for many generation tasks. These models rely on ODE/SDE solvers that integrate from a prior distribution to the data distribution; in many applications it is also highly desirable to integrate in the inverse direction. Standard solvers, however, accumulate discretization errors that prohibit exact inversion, an inaccuracy that is unacceptable in precision-critical applications. Existing inversion methods suffer from poor stability and low order of convergence, and are strictly limited to the ODE setting. In this work, we propose Rex, a family of reversible exponential (stochastic) Runge-Kutta solvers obtained by applying Lawson methods to convert any explicit (stochastic) Runge-Kutta scheme into an algebraically reversible one for both diffusion ODEs and SDEs. Beyond a rigorous theoretical analysis -- establishing arbitrary-order convergence and a non-zero region of linear stability -- we empirically demonstrate that Rex achieves near-machine-precision reconstruction and improves Boltzmann sampling with flow models as well as image generation and editing with diffusion models.

2602.17149 2026-06-03 cs.LG cs.AI 版本更新

TimeOmni-VL: Unified Models for Time Series Understanding and Generation

TimeOmni-VL:统一时间序列理解与生成的模型

Tong Guan, Sheng Pan, Johan Barthelemy, Zhao Li, Yujun Cai, Cesare Alippi, Ming Jin, Shirui Pan

发表机构 * Tsinghua University(清华大学)

AI总结 提出TimeOmni-VL框架,通过保真双向映射和理解引导生成,首次统一时间序列的理解与生成任务。

Comments Accepted by the Forty-third International Conference on Machine Learning (ICML 2026)

详情
AI中文摘要

近期的时间序列建模在数值生成与语义理解之间存在明显鸿沟,研究表明生成模型往往依赖浅层模式匹配,而理解导向的模型难以输出高保真数值。尽管统一多模态模型(UMMs)已在视觉领域弥合这一差距,但其在时间序列上的潜力尚未被发掘。我们提出TimeOmni-VL,这是首个以视觉为中心的统一时间序列理解与生成框架,通过两项关键创新实现:(1)时间序列与图像之间的保真双向映射(Bi-TSI),改进了时间序列到图像(TS2I)和图像到时间序列(I2TS)的转换,确保近乎无损的变换。(2)理解引导生成。我们引入TSUMM-Suite,这是一个新颖的数据集,包含六个基于时间序列分析的理解任务,并耦合两个生成任务。通过校准的思维链,TimeOmni-VL首次利用时间序列理解作为高保真生成的显式控制信号。实验证实,这种统一方法显著提升了语义理解和数值精度,为多模态时间序列建模开辟了新前沿。

英文摘要

Recent time series modeling faces a sharp divide between numerical generation and semantic understanding, with research showing that generation models often rely on superficial pattern matching, while understanding-oriented models struggle with high-fidelity numerical output. Although unified multimodal models (UMMs) have bridged this gap in vision, their potential for time series remains untapped. We propose TimeOmni-VL, the first vision-centric framework that unifies time series understanding and generation through two key innovations: (1) Fidelity-preserving bidirectional mapping between time series and images (Bi-TSI), which advances Time Series-to-Image (TS2I) and Image-to-Time Series (I2TS) conversions to ensure near-lossless transformations. (2) Understanding-guided generation. We introduce TSUMM-Suite, a novel dataset consisting of six understanding tasks rooted in time series analytics and coupled with two generation tasks. With a calibrated Chain-of-Thought, TimeOmni-VL is the first to leverage time series understanding as an explicit control signal for high-fidelity generation. Experiments confirm that this unified approach significantly improves semantic understanding and numerical precision, establishing a new frontier for multimodal time series modeling.

2602.17063 2026-06-03 cs.LG cs.AI cs.CL cs.CV 版本更新

Sign Lock-In: Randomly Initialized Weight Signs Persist and Bottleneck Sub-Bit Model Compression

符号锁定:随机初始化的权重符号持续存在并成为亚比特模型压缩的瓶颈

Akira Sakai, Yuma Ichikawa

发表机构 * Fujitsu Limited(富士通株式会社) Tokai University(静冈大学) Riken Center for AIP(理化学研究所AIP研究中心)

AI总结 研究亚比特模型压缩中符号位的瓶颈问题,通过符号锁定理论解释权重符号的随机性来源,并提出一种从头开始的低秩符号模板训练方法以突破该瓶颈。

Comments Accepted at the Forty-Third International Conference on Machine Learning (ICML 2026)

详情
AI中文摘要

亚比特模型压缩的目标是将每个权重的存储降至1比特以下;当幅度被激进压缩时,符号位成为固定成本的瓶颈。在Transformer、CNN和MLP中,学习到的符号矩阵抵抗低秩近似,并且在频谱上与i.i.d. Rademacher基线无法区分。这种随机性导致了亚比特模型压缩的下界——1比特墙。尽管存在这种明显的随机性,大多数权重仍保留其初始化符号;翻转主要通过罕见的近零边界穿越发生,表明符号模式的随机性很大程度上继承自初始化。我们通过符号锁定理论形式化了这一行为,这是对SGD噪声下符号翻转的停时分析。在有界更新和零的小邻域内罕见重新进入的条件下,有效符号翻转的数量呈现几何尾部。基于这一机制,我们引入了一种从头开始的低秩符号模板训练方法,以防止这种1比特墙的出现。

英文摘要

Sub-bit model compression targets storage below one bit per weight; as magnitudes are aggressively compressed, the sign bit becomes a fixed-cost bottleneck. Across Transformers, CNNs, and MLPs, learned sign matrices resist low-rank approximation and are spectrally indistinguishable from an i.i.d. Rademacher baseline. This randomness gives rise to the lower bound of sub-bit model compression -- the one-bit wall. Despite this apparent randomness, most weights retain their initialization signs; flips primarily occur via rare near-zero boundary crossings, suggesting that sign-pattern randomness is largely inherited from initialization. We formalize this behavior with sign lock-in theory, a stopping-time analysis of sign flips under SGD noise. Under bounded updates and a rare re-entry condition into a small neighborhood of zero, the number of effective sign flips exhibits a geometric tail. Building on this mechanism, we introduce a from-scratch low-rank sign-template training method that prevents the emergence of this one-bit wall.

2602.12430 2026-06-03 cs.MA cs.AI 版本更新

Agent Skills for Large Language Models: Architecture, Acquisition, Security, and the Path Forward

大型语言模型的智能体技能:架构、获取、安全与未来路径

Renjun Xu, Yang Yan

发表机构 * ReDiscovery Hangzhou China(杭州ReDiscovery研究院) Westlake University Hangzhou China(西交大学)

AI总结 本文综述了大型语言模型智能体技能的研究,涵盖架构基础(如SKILL.md规范、渐进式上下文加载)、技能获取(强化学习、自主发现、组合合成)、大规模部署(计算机使用智能体栈、GUI接地)以及安全挑战(26.1%社区技能含漏洞),并提出技能信任与生命周期治理框架。

Comments Accepted by Agent Skills '26 Workshop at ACM Conference on AI and Agentic Systems 2026

详情
AI中文摘要

从单体语言模型向模块化、配备技能的智能体的转变,标志着大型语言模型(LLM)在实践中部署方式的决定性转变。智能体技能——即智能体按需加载的指令、代码和资源的可组合包——无需重新训练即可实现动态能力扩展,而非将所有程序性知识编码在模型权重中。它被形式化为渐进式披露、可移植技能定义以及与模型上下文协议(MCP)集成的范式。本综述全面探讨了智能体技能领域,该领域在过去几个月迅速发展。我们沿四个轴组织该领域:(i)架构基础,考察SKILL.md规范、渐进式上下文加载以及技能与MCP的互补作用;(ii)技能获取,涵盖使用技能库的强化学习、自主技能发现(SEAgent)和组合技能合成;(iii)大规模部署,包括计算机使用智能体(CUA)栈、GUI接地进展以及OSWorld和SWE-bench上的基准进展;(iv)安全,最近的经验分析显示,26.1%的社区贡献技能包含漏洞,这促使我们提出技能信任与生命周期治理框架——一个四层、基于门的权限模型,将技能来源映射到分级部署能力。我们识别出七个开放挑战——从跨平台技能可移植性到基于能力的权限模型——并提出了实现可信、自我改进技能生态系统的研究议程。与先前广泛涵盖LLM智能体或工具使用的综述不同,本工作特别关注新兴的技能抽象层及其对下一代智能体系统的影响。项目仓库:https://github.com/scienceaix/agentskills

英文摘要

The transition from monolithic language models to modular, skill-equipped agents marks a defining shift in how large language models (LLMs) are deployed in practice. Rather than encoding all procedural knowledge within model weights, agent skills -- composable packages of instructions, code, and resources that agents load on demand -- enable dynamic capability extension without retraining. It is formalized in a paradigm of progressive disclosure, portable skill definitions, and integration with the Model Context Protocol (MCP). This survey provides a comprehensive treatment of the agent skills landscape, as it has rapidly evolved during the last few months. We organize the field along four axes: (i) architectural foundations, examining the SKILL$.$md specification, progressive context loading, and the complementary roles of skills and MCP; (ii) skill acquisition, covering reinforcement learning with skill libraries, autonomous skill discovery (SEAgent), and compositional skill synthesis; (iii) deployment at scale, including the computer-use agent (CUA) stack, GUI grounding advances, and benchmark progress on OSWorld and SWE-bench; and (iv) security, where recent empirical analyses reveal that 26.1% of community-contributed skills contain vulnerabilities, motivating our proposed Skill Trust and Lifecycle Governance Framework -- a four-tier, gate-based permission model that maps skill provenance to graduated deployment capabilities. We identify seven open challenges -- from cross-platform skill portability to capability-based permission models -- and propose a research agenda for realizing trustworthy, self-improving skill ecosystems. Unlike prior surveys that broadly cover LLM agents or tool use, this work focuses specifically on the emerging skill abstraction layer and its implications for the next generation of agentic systems. Project repo: https://github.com/scienceaix/agentskills

2602.14279 2026-06-03 cs.LG cs.AI cs.CL cs.SI 版本更新

Whom to Query for What: Adaptive Group Elicitation via Multi-Turn LLM Interactions

为谁查询什么:通过多轮LLM交互的自适应群体征询

Ruomeng Ding, Tianwei Gao, Thomas P. Zollo, Eitan Bachmat, Richard Zemel, Zhun Deng

发表机构 * University of North Carolina at Chapel Hill(北卡罗来纳大学教堂山分校) Columbia University(哥伦比亚大学) Ben-Gurion University of the Negev(贝内-约尔大学内盖夫分校)

AI总结 针对有限预算下群体属性不确定性降低问题,提出结合LLM期望信息增益与异构图神经网络传播的自适应群体征询框架,实现问题与受访者联合选择,在三个真实数据集上显著提升群体响应预测。

Comments Published as a conference paper at ICML 2026

详情
AI中文摘要

从调查和其他集体评估中征询信息以减少关于潜在群体属性的不确定性,需要在实际成本和缺失数据下分配有限的提问努力。尽管大型语言模型支持自然语言中的自适应多轮交互,但大多数现有征询方法优化了在固定受访者池中询问什么,并且在响应部分或不完整时不会调整受访者选择或利用群体结构。为解决这一差距,我们研究了自适应群体征询,这是一个多轮设置,其中智能体在明确的查询和参与预算下自适应地选择问题和受访者。我们提出了一个理论基础的框架,该框架结合了(i)基于LLM的期望信息增益目标,用于评分候选问题,以及(ii)异构图神经网络传播,该传播聚合观察到的响应和参与者属性,以插补缺失响应并指导每轮受访者选择。这种闭环过程查询一个小的、信息丰富的个体子集,同时通过结构化相似性推断群体级别的响应。在三个真实世界意见数据集上,我们的方法在预算受限的情况下持续提高了群体级别响应预测,包括在10%受访者预算下CES上相对提升超过12%。

英文摘要

Eliciting information to reduce uncertainty about latent group-level properties from surveys and other collective assessments requires allocating limited questioning effort under real costs and missing data. Although large language models enable adaptive, multi-turn interactions in natural language, most existing elicitation methods optimize what to ask with a fixed respondent pool, and do not adapt respondent selection or leverage population structure when responses are partial or incomplete. To address this gap, we study adaptive group elicitation, a multi-round setting where an agent adaptively selects both questions and respondents under explicit query and participation budgets. We propose a theoretically grounded framework that combines (i) an LLM-based expected information gain objective for scoring candidate questions with (ii) heterogeneous graph neural network propagation that aggregates observed responses and participant attributes to impute missing responses and guide per-round respondent selection. This closed-loop procedure queries a small, informative subset of individuals while inferring population-level responses via structured similarity. Across three real-world opinion datasets, our method consistently improves population-level response prediction under constrained budgets, including a >12% relative gain on CES at a 10% respondent budget.

2602.11908 2026-06-03 cs.AI cs.CL cs.LG 版本更新

When Should LLMs Be Less Specific? Selective Abstraction for Reliable Long-Form Text Generation

LLM何时应降低具体性?面向可靠长文本生成的选择性抽象

Shani Goren, Ido Galil, Ran El-Yaniv

发表机构 * Technion(技术离子大学) NVIDIA(英伟达)

AI总结 针对LLM在长文本生成中因低置信度而丢弃有价值信息的问题,提出选择性抽象框架,通过原子级抽象替换不确定内容,在保持语义的同时提升准确性和可靠性。

详情
AI中文摘要

LLM被广泛使用,但仍容易出现事实错误,这削弱了用户信任并限制了在高风险场景中的采用。缓解这一风险的一种方法是为模型配备不确定性估计机制,在置信度低时弃权。然而,这种二元的“全有或全无”方法在长文本场景中过于严格,常常丢弃有价值的信息。我们引入了选择性抽象(SA),这是一个框架,使LLM能够通过选择性地降低不确定内容的细节来用具体性换取可靠性。我们首先通过选择性风险和覆盖率的视角形式化SA。然后,我们提出原子级选择性抽象,这是一种声明级别的实例化,将响应分解为原子声明(简短、自包含的陈述,每个表达一个单一事实),并用更高置信度、更低具体性的抽象替换不确定的原子。为了评估这一框架,我们开发了一个新颖的端到端流水线用于开放式生成,将风险实例化为事实正确性,并使用信息论度量保留信息来衡量覆盖率。在FactScore和LongFact-Objects基准测试上的六个开源模型中,原子级SA始终优于现有基线,在风险-覆盖率曲线下面积(AURC)上比声明移除方法提升高达27.73%,表明降低具体性可以在保留大部分原始含义的同时提升准确性和可靠性。

英文摘要

LLMs are widely used, yet they remain prone to factual errors that erode user trust and limit adoption in high-risk settings. One approach to mitigate this risk is to equip models with uncertainty estimation mechanisms that abstain when confidence is low. However, this binary "all-or-nothing" approach is excessively restrictive in long-form settings, often discarding valuable information. We introduce Selective Abstraction (SA), a framework that enables LLMs to trade specificity for reliability by selectively reducing the detail of uncertain content. We first formalize SA through the lenses of selective risk and coverage. We then propose Atom-wise Selective Abstraction, a claim-level instantiation that decomposes responses into atomic claims (short, self-contained statements each expressing a single fact) and replaces uncertain atoms with higher confidence, less specific abstractions. To evaluate this framework, we develop a novel end-to-end pipeline for open-ended generation that instantiates risk as factual correctness and measures coverage using an information-theoretic measure of retained information. Across six open-source models on the FactScore and LongFact-Objects benchmarks, atom-wise SA consistently outperforms existing baselines, improving the area under the risk-coverage curve (AURC) by up to 27.73% over claim removal, demonstrating that reducing specificity can boost accuracy and reliability while preserving most of their original meaning.

2602.10387 2026-06-03 cs.DB cs.AI 版本更新

Test-Time Optimization of Physical Query Plans with LLMs

基于LLM的物理查询计划测试时优化

Mehmet Hamza Erol, Xiangpeng Hao, Federico Bianchi, Ciro Greco, Jacopo Tagliabue, James Zou

发表机构 * Stanford University(斯坦福大学) University of Wisconsin-Madison(威斯康星大学麦迪逊分校) TogetherAI Bauplan

AI总结 提出DBPlanBench框架,利用LLM在测试时通过语义推理和进化搜索优化物理查询计划,在OLAP查询中实现1.05-1.12倍中位数加速,并支持小规模到大规模的迁移。

Comments Code is available at: https://github.com/BauplanLabs/DBPLANBENCH

详情
AI中文摘要

传统查询优化依赖于基于成本的优化器,使用预定义的启发式和统计模型来估计执行成本(如运行时间、内存和I/O)。改进这些需要大量的工程努力,但它们通常无法利用查询和模式中的语义相关性来获得更好的物理计划。然而,大型语言模型(LLMs)能够推理列语义、值分布以及经典统计所忽略的更广泛的领域上下文。我们介绍了DBPlanBench,一个用于DataFusion引擎的框架,它通过紧凑的序列化表示暴露物理计划,并将LLM提出的编辑作为JSON补丁应用。在此框架上,我们实例化了一个测试时优化工作流,其中LLM检查物理查询计划,基于语义推理提出局部编辑,并通过进化搜索在迭代中优化候选方案。我们针对OLAP查询,其中重复执行的重负载使得即使是微小的效率提升也能转化为显著的累积节省。我们特别将评估重点放在连接重排序和连接侧选择上,其中基数估计误差会复合倍增。在TPC-H上中位数加速达到1.10-1.12倍,在TPC-DS上达到1.05-1.07倍,某些查询加速高达4.78倍。我们还证明了在小规模因子下发现的优化可以有效地迁移到更大规模,支持低成本的小规模到大工作流。

英文摘要

Traditional query optimization relies on cost-based optimizers that estimate execution cost (e.g., runtime, memory, and I/O) using predefined heuristics and statistical models. Improving these requires substantial engineering effort, yet they often cannot exploit semantic correlations in queries and schemas that could enable better physical plans. Large language models (LLMs), however, can reason about column semantics, value distributions, and broader domain context that classical statistics miss. We introduce DBPlanBench, a harness for the DataFusion engine that exposes physical plans through a compact serialized representation and applies LLM-proposed edits as JSON patches. On this harness, we instantiate a test-time optimization workflow where an LLM examines physical query plans, proposes localized edits based on semantic reasoning, and an evolutionary search refines the candidates across iterations. We target OLAP queries, where heavy, repeated execution turns even small efficiency gains into substantial cumulative savings. We specifically focus our evaluation on join reordering and join-side selection, where cardinality-estimation errors compound multiplicatively. Median speedups reach $1.10$-$1.12\times$ on TPC-H and $1.05$-$1.07\times$ on TPC-DS, with some achieving up to $4.78\times$. We also demonstrate that optimizations discovered at small scale factors transfer effectively to larger ones, supporting a low-cost small-to-large workflow.

2602.10352 2026-06-03 cs.CL cs.AI cs.LG 版本更新

Learning Self-Interpretation from Interpretability Artifacts: Training Lightweight Adapters on Vector-Label Pairs

从可解释性工件中学习自我解释:在向量-标签对上训练轻量级适配器

Keenan Pepper, Alex McKenzie, Florin Pop, Stijn Servaes, Martin Leitgab, Mike Vaiana, Judd Rosenblatt, Michael S. A. Graziano, Diogo de Lucena

发表机构 * University of Washington(华盛顿大学)

AI总结 通过训练轻量级适配器(标量仿射适配器,仅需d_model+1参数)在可解释性工件上,保持语言模型完全冻结,实现了跨任务和模型族的可靠自我解释,在稀疏自编码器特征标注、主题识别和多跳推理桥接实体解码等任务上显著优于未训练基线。

Comments 26 pages, 18 tables, 17 figures. Code and data at https://github.com/agencyenterprise/selfie-adapters

详情
AI中文摘要

自我解释方法促使语言模型描述其内部状态,但由于超参数敏感性而仍然不可靠。我们表明,在可解释性工件上训练轻量级适配器,同时保持语言模型完全冻结,可以在任务和模型族中产生可靠的自我解释。一个仅需$d_\text{model}+1$个参数的标量仿射适配器就足够了:训练后的适配器生成稀疏自编码器特征标签,其性能优于训练标签本身(在70B规模下,生成评分为70% vs 50%),以94%的召回率@1识别主题(未训练基线为1%),并在多跳推理中解码既不在提示中也不在响应中出现的桥接实体,从而无需思维链即可揭示隐式推理。仅学习到的偏置向量就占了改进的85%,更简单的适配器比更具表达力的替代方案具有更好的泛化能力。通过提示描述控制模型知识,我们发现从7B到72B参数,自我解释的提升超过了能力提升。我们的结果表明,自我解释随着规模扩大而改善,且无需修改被解释的模型。

英文摘要

Self-interpretation methods prompt language models to describe their own internal states, but remain unreliable due to hyperparameter sensitivity. We show that training lightweight adapters on interpretability artifacts, while keeping the LM entirely frozen, yields reliable self-interpretation across tasks and model families. A scalar affine adapter with just $d_\text{model}+1$ parameters suffices: trained adapters generate sparse autoencoder feature labels that outperform the training labels themselves (70% vs 50% generation scoring at 70B scale), identify topics with 94% recall@1 versus 1% for untrained baselines, and decode bridge entities in multi-hop reasoning that appear in neither prompt nor response, surfacing implicit reasoning without chain-of-thought. The learned bias vector alone accounts for 85% of improvement, and simpler adapters generalize better than more expressive alternatives. Controlling for model knowledge via prompted descriptions, we find self-interpretation gains outpace capability gains from 7B to 72B parameters. Our results demonstrate that self-interpretation improves with scale, without modifying the model being interpreted.

2602.05302 2026-06-03 cs.AI 版本更新

PieArena: Ranking and Profiling Language Agents in Realistic Negotiation Scenarios

PieArena:在真实谈判场景中对语言智能体进行排名与画像

Chris Zhu, Sasha Cui, Will Sanok Dufallo, Runzhi Jin, Zhen Xu, Linjun Zhang, Daylian Cain

发表机构 * Yale University(耶鲁大学) UC Berkeley(加州大学伯克利分校) BloomBerg(摩根大通) Rutgers University(罗格斯大学)

AI总结 本文提出PieArena基准,通过多智能体交互评估LLM的谈判能力,并开发排名模型与行为画像,发现联合意图框架对中低端模型提升显著,前沿模型(如GPT-5)在谈判中达到或超过人类基线。

详情
AI中文摘要

我们深入评估了LLM的谈判能力,这是一项需要战略推理、心理理论和经济价值创造的核心商业任务。为此,我们引入了PieArena,一个基于精英商学院MBA谈判课程中真实场景的多智能体交互的大规模谈判基准。我们在三种配对模式下评估语言智能体:镜像博弈、交叉博弈和人与语言模型博弈。我们开发了一个用于连续谈判收益的排名模型,该模型生成顺序不变、不确定性量化的排行榜,同时纠正系统性的实验不对称性。我们进一步研究了联合意图智能体框架的效果,发现其收益不对称:对中低端语言模型有大幅提升,而对前沿语言模型的边际收益递减。作为校准锚点,我们收集了受过训练的商学院学生之间以及学生与语言模型之间的谈判数据,发现代表性前沿语言智能体(GPT-5)在我们的评估设置中达到或超过了这一人类基线。除了交易结果,PieArena还提供了多维行为画像,揭示了指令遵从性、计算准确性以及法官评估的欺骗性和声誉方面的跨模型异质性,说明了超越仅基于结果的排行榜的评估价值。

英文摘要

We present an in-depth evaluation of LLMs' ability to negotiate, a central business task requiring strategic reasoning, theory of mind, and economic value creation. To do so, we introduce PieArena, a large-scale negotiation benchmark grounded in multi-agent interactions over realistic scenarios adapted from MBA negotiation courses at an elite business school. We evaluate language agents across three pairing regimes: mirror-play, cross-play, and human-LM play. We develop a ranking model for continuous negotiation payoffs that yields order-invariant, uncertainty-quantified leaderboards while correcting for systematic experimental asymmetries. We further study the effects of joint-intentionality agentic scaffolding and find asymmetric gains, with large improvements for mid- and lower-tier LMs and diminishing returns for frontier LMs. As calibration anchors, we collect human-human and human-LM negotiation data from trained business school students, finding that a representative frontier language agent (GPT-5) matches or exceeds this human baseline in our evaluation settings. Beyond deal outcomes, PieArena provides a multi-dimensional behavioral profile that reveals cross-model heterogeneity in instruction compliance, computation accuracy, as well as judge-assessed deception and reputation, illustrating the value of evaluation beyond outcome-only leaderboards.

2510.12049 2026-06-03 econ.GN cs.AI q-fin.EC 版本更新

Generative AI and Sales Productivity: Field Experiments in Online Retail

生成式人工智能与销售效率:在线零售中的现场实验

Lu Fang, Zhe Yuan, Kaifu Zhang, Dante Donati, Miklos Sarvary

发表机构 * Duke University(杜克大学) Imperial Business School(帝国商学院) BIG AI Conference(大数据人工智能会议) MSI AI Forum(MSI人工智能论坛) TSE Digital Economics Conference(TSE数字经济会议) AIML Conference(人工智能与机器学习会议) Operational Innovation Network Summit(运营创新网络峰会) University of Rochester(罗切斯特大学) UC Davis(加州大学戴维斯分校) TUM Workshop on Generative AI in Marketing(慕尼黑工业大学生成AI在营销中的研讨会) UCL School of Management(伦敦大学学院管理学院) Columbia Business School(哥伦比亚商学院) Business & Generative AI Conference(商业与生成AI会议) Zhejiang University School of Management(浙江大学管理学院)

AI总结 通过大规模随机现场实验,量化生成式人工智能(GenAI)对在线零售销售业绩的短期影响,发现GenAI在多数工作流中提升销售额,主要通过提高转化率而非客单价,且对经验较少的消费者效果更显著。

Comments Keywords: Artificial Intelligence, Consumer Experience, Field Experiments, GenAI, Productivity, Retail Platforms, Sales. JEL codes: C93, D24, L81, M31, O3

详情
AI中文摘要

我们通过在一家领先的跨境在线零售平台上进行涉及数百万用户和产品的大规模随机现场实验,量化了生成式人工智能(GenAI)对销售业绩的短期影响。在2023-2024年间,该平台将GenAI整合到七个面向消费者的业务流程中,涵盖客户服务、消费者-产品匹配、广告和卖家服务。我们发现,GenAI的采用在大多数工作流中提高了销售额,效果范围从无显著影响到16.3%,具体取决于GenAI相对于基线公司实践的边际贡献。在四个具有正向销售效果的GenAI应用中,隐含的年增量价值约为5美元——考虑到零售商的规模和GenAI采用的早期阶段,这是一个具有经济意义的影响。收益主要通过更高的转化率而非更大的购物车价值实现,这与GenAI通过减少搜索、信息、沟通和个性化摩擦来改善购物体验相一致。重要的是,这些效应并未与更差的购买后结果相关,因为产品退货率和客户评分没有恶化。最后,我们记录了显著的需求侧异质性,对经验较少的消费者收益更大。我们的发现提供了新颖的大规模因果证据,展示了GenAI如何塑造在线零售的销售效率,突出了其即时价值和更广泛的潜力。

英文摘要

We quantify the short-term impact of Generative Artificial Intelligence (GenAI) on sales performance through a series of large-scale randomized field experiments involving millions of users and products at a leading cross-border online retail platform. Over 2023-2024, the platform integrated GenAI into seven consumer-facing business workflows spanning customer service, consumer-product matching, advertising, and seller services. We find that GenAI adoption increases sales in most workflows, with effects ranging from no detectable impact to $16.3\%$, depending on GenAI's marginal contribution relative to baseline firm practices. Across the four GenAI applications with positive sales effects, the implied annual incremental value is roughly $\$5-$an economically meaningful impact given the retailer's scale and the early stage of GenAI adoption. The gains operate primarily through higher conversion rates rather than larger cart values, consistent with GenAI improving the shopping experience by reducing search, information, communication, and personalization frictions. Importantly, these effects are not associated with worse post-purchase outcomes, as product return rates and customer ratings do not deteriorate. Finally, we document substantial demand-side heterogeneity, with larger gains for less experienced consumers. Our findings provide novel, large-scale causal evidence on how GenAI shapes sales productivity in online retail, highlighting both its immediate value and broader potential.

2602.09708 2026-06-03 cs.LG cs.AI cs.CV cs.NA math.NA 版本更新

Physics-informed diffusion models in spectral space

谱空间中的物理信息扩散模型

Davide Gallon, Philippe von Wurstemberger, Patrick Cheridito, Arnulf Jentzen

发表机构 * ETH Zürich(苏黎世联邦理工学院)

AI总结 提出物理信息谱扩散(PISD)方法,结合生成式潜扩散模型与物理信息机器学习,在谱表示潜空间中对偏微分方程参数和解进行扩散建模,通过扩散后验采样施加物理约束和测量条件,在泊松、亥姆霍兹和不可压缩纳维-斯托克斯方程上展现出比现有扩散求解器更高的精度和计算效率。

Comments 18 pages, 10 figures

详情
AI中文摘要

我们提出物理信息谱扩散(PISD),一种将生成式潜扩散模型与物理信息机器学习相结合的方法,用于生成基于部分观测的偏微分方程(PDE)的解,特别包括正向和逆向PDE问题。我们在缩放谱表示的潜空间中通过扩散过程学习PDE参数和解的联合分布,其中高斯噪声对应于具有受控正则性的函数。与基于网格的扩散模型相比,这种谱公式能够实现显著的降维,并确保函数空间中的诱导过程保持在PDE算子定义良好的函数类内。基于扩散后验采样,我们在推理过程中施加物理信息约束和测量条件,在每个扩散步骤应用基于Adam的更新。我们在泊松、亥姆霍兹和不可压缩纳维-斯托克斯方程上评估了所提出的方法,与现有的基于扩散的PDE求解器(在稀疏观测下达到最先进水平)相比,展示了更高的精度和计算效率。代码可在 https://github.com/deeplearningmethods/PISD 获取。

英文摘要

We propose physics-informed spectral diffusion (PISD), a methodology that combines generative latent diffusion models with physics-informed machine learning to generate solutions of partial differential equations (PDEs) conditioned on partial observations, which includes, in particular, forward and inverse PDE problems. We learn the joint distribution of PDE parameters and solutions via a diffusion process in a latent space of scaled spectral representations, where Gaussian noise corresponds to functions with controlled regularity. This spectral formulation enables significant dimensionality reduction compared to grid-based diffusion models and ensures that the induced process in function space remains within a class of functions for which the PDE operators are well defined. Building on diffusion posterior sampling, we enforce physics-informed constraints and measurement conditions during inference, applying Adam-based updates at each diffusion step. We evaluate the proposed approach on Poisson, Helmholtz, and incompressible Navier-Stokes equations, demonstrating improved accuracy and computational efficiency compared with existing diffusion-based PDE solvers, which are state of the art for sparse observations. Code is available at https://github.com/deeplearningmethods/PISD.

2601.02380 2026-06-03 cs.CY cs.AI 版本更新

LLMs, Reasoning and Plagiarism

可反驳性差距:大型语言模型推理验证中的挑战

Elchanan Mossel

发表机构 * Elchanan Mossel

AI总结 本文指出当前声称LLM具备科学发现和通用智能的说法不满足波普尔可反驳性原则,并提出了提高科学透明度和可重复性的指南。

Comments The authors explicitly reserve all rights in this work. No permission is granted for the reproduction, storage, or use of this document for the purpose of training artificial intelligence systems or for text and data mining (TDM), including but not limited to the generation of embeddings, summaries, or synthetic derivatives. Claude and Gemini were used in writing this manuscript

详情
AI中文摘要

最近的报告声称大型语言模型(LLM)已经具备了推导新科学和展现人类级通用智能的能力。我们认为这样的说法并非严谨的科学声明,因为它们不满足波普尔的可反驳性原则(通常称为可证伪性),该原则要求科学陈述能够被证伪。我们识别了当前AI推理研究中的几个方法论陷阱,包括由于不透明且不可搜索的训练数据而无法验证发现的新颖性、由于持续模型更新导致缺乏可重复性,以及省略人机交互记录从而掩盖科学发现的真正来源。此外,缺乏反事实和失败尝试的数据造成了选择偏差,可能夸大LLM的能力。为应对这些挑战,我们提出了关于LLM推理研究的科学透明度和可重复性指南。建立这样的指南对于科学诚信以及当前关于公平数据使用的社会辩论至关重要。我们还讨论了相关问题,如LLM生成的抄袭挑战以及LLM中检索与新颖性的一般问题。

英文摘要

Recent reports claim that Large Language Models (LLMs) derive new science and exhibit human-level general intelligence. Such claims are entangled with two different narratives about what LLMs do: one in which they are an engine of synthesis that genuinely reasons to new knowledge, and one in which they retrieve and re-emit the work of others without attribution. In the scientific setting these are best understood as a contrast between \emph{reasoning} and \emph{plagiarism}. Finding where the truth lies between these two narratives is very challenging, as central components of the model -- the training data and the interaction transcript -- remain opaque. Thus claims of LLM reasoning do not satisfy Popper's refutability principle. We propose guidelines for transparency and reproducibility that will allow reasoning claims to be studied using the scientific method. The dominance of the reasoning narrative, we suggest, is in practice encouraging plagiarism in the scientific literature; we discuss what might be done about it.

2602.08873 2026-06-03 cs.IR cs.AI cs.CY cs.SI physics.soc-ph 版本更新

Whose Name Comes Up? II: Benchmarking and Intervention-Based Auditing of LLM-Based Scholar Recommendation

谁的名字出现?II:基于基准测试和干预审计的LLM学者推荐系统

Lisette Espín-Noboa, Gonzalo Gabriel Méndez

发表机构 * Complexity Science Hub Vienna(维也纳复杂性科学中心) Universitat Politècnica de València(巴塞罗那理工大学) Inria Rennes(里昂国家信息与自动化研究所)

AI总结 提出LLMScholarBench基准,通过温度变化、表示约束提示和检索增强生成等干预措施审计22个LLM在物理专家推荐中的技术质量和社会代表性,发现干预措施带来不同权衡。

Comments In Proceedings of the 32nd ACM SIGKDD Conference on Knowledge Discovery and Data Mining V.2 (KDD '26). 30 pages: 11 pages in main (6 figures, 1 table), 19 pages in appendix (22 figures, 2 tables)

详情
AI中文摘要

大型语言模型(LLM)现在被用于学术专家推荐。现有的审计通常孤立地评估此类推荐,忽略了最终用户的推理时干预。因此,尚不清楚失败(例如,拒绝、幻觉、覆盖不均)源于模型选择还是部署决策。我们引入了LLMScholarBench,一个用于审计基于LLM的学者推荐的基准,它联合评估模型基础设施和最终用户在多个任务上的干预。LLMScholarBench使用九个指标衡量技术质量和社会代表性。我们在物理专家推荐中实例化该基准,并在温度变化、表示约束提示和通过网络搜索的检索增强生成(RAG)下审计22个LLM。我们的结果表明,每种干预都带来不同的权衡。较高的温度会降低有效性、一致性和事实性。表示约束提示以提高多样性为代价降低了事实性,而RAG主要提高了技术质量,同时降低了多样性和平等性。总体而言,最终用户的干预重塑了权衡,而不是提供统一的收益。LLMScholarBench使得在基于LLM的学者推荐中,跨模型和干预的所有这些动态都可审计。

英文摘要

Large language models (LLMs) are now used for academic expert recommendation. Existing audits typically evaluate such recommendations in isolation, ignoring end-user inference-time interventions. Thus, it remains unclear whether failures (e.g., refusals, hallucinations, uneven coverage) stem from model choice or deployment decisions. We introduce LLMScholarBench, a benchmark for auditing LLM-based scholar recommendation that jointly evaluates model infrastructure and end-user interventions across multiple tasks. LLMScholarBench measures technical quality and social representation using nine metrics. We instantiate the benchmark in physics expert recommendation and audit 22 LLMs under temperature variation, representation-constrained prompting, and retrieval-augmented generation (RAG) via web search. Our results show that each intervention entails distinct tradeoffs. Higher temperature degrades validity, consistency, and factuality. Representation-constrained prompting improves diversity at the expense of factuality, while RAG primarily improves technical quality while reducing diversity and parity. Overall, end-user interventions reshape trade-offs rather than providing uniform gains. LLMScholarBench makes all these dynamics auditable across models and interventions in LLM-based scholar recommendations.

2602.08335 2026-06-03 cs.AI 版本更新

Who Deserves the Reward? SHARP: Shapley Credit-based Optimization for Multi-Agent System

谁应得奖励?SHARP:基于Shapley信用的多智能体系统优化

Yanming Li, Xuelin Zhang, WenJie Lu, Ziye Tang, Maodong Wu, Haotian Luo, Tongtong Wu, Zijie Peng, Hongze Mi, Yibo Feng, Naiqiang Tan, Chao Huang, Lian Peng, Li Shen

发表机构 * University of Science and Technology of China(中国科学技术大学)

AI总结 针对多智能体系统中信用分配难题,提出SHARP框架,通过分解奖励机制(全局广播奖励、Shapley边际信用奖励和工具过程奖励)实现精确信用归因,显著提升强化学习性能。

详情
AI中文摘要

通过多智能体系统将大型语言模型(LLMs)与外部工具集成,为分解和解决复杂问题提供了一种有前景的新范式。然而,由于信用分配挑战,训练这些系统仍然非常困难,因为通常不清楚哪个特定功能智能体对决策轨迹的成功或失败负责。现有方法通常依赖稀疏或全局广播奖励,无法捕捉个体贡献,导致强化学习效率低下。为解决这些限制,我们引入了基于Shapley的层次化强化策略归因(SHARP),一种通过精确信用归因优化多智能体强化学习的新框架。SHARP通过跨轨迹组归一化智能体特定优势来有效稳定训练,主要通过一种分解奖励机制实现,该机制包括全局广播准确率奖励、每个智能体的基于Shapley的边际信用奖励,以及提高执行效率的工具过程奖励。在各种真实世界基准上的大量实验表明,SHARP显著优于近期最先进的基线,在单智能体和多智能体方法上分别实现了23.66%和14.05%的平均匹配改进。

英文摘要

Integrating Large Language Models (LLMs) with external tools via multi-agent systems offers a promising new paradigm for decomposing and solving complex problems. However, training these systems remains notoriously difficult due to the credit assignment challenge, as it is often unclear which specific functional agent is responsible for the success or failure of decision trajectories. Existing methods typically rely on sparse or globally broadcast rewards, failing to capture individual contributions and leading to inefficient reinforcement learning. To address these limitations, we introduce the Shapley-based Hierarchical Attribution for Reinforcement Policy (SHARP), a novel framework for optimizing multi-agent reinforcement learning via precise credit attribution. SHARP effectively stabilizes training by normalizing agent-specific advantages across trajectory groups, primarily through a decomposed reward mechanism comprising a global broadcast-accuracy reward, a Shapley-based marginal-credit reward for each agent, and a tool-process reward to improve execution efficiency. Extensive experiments across various real-world benchmarks demonstrate that SHARP significantly outperforms recent state-of-the-art baselines, achieving average match improvements of 23.66% and 14.05% over single-agent and multi-agent approaches, respectively.

2602.06960 2026-06-03 cs.CL cs.AI 版本更新

InftyThink+: Effective and Efficient Infinite-Horizon Reasoning via Reinforcement Learning

InftyThink+:通过强化学习实现高效且有效的无限时域推理

Yuchen Yan, Liang Jiang, Jin Jiang, Shuaicheng Li, Zujie Wen, Zhiqiang Zhang, Jun Zhou, Jian Shao, Yueting Zhuang, Yongliang Shen

发表机构 * Tsinghua University(清华大学)

AI总结 提出InftyThink+框架,通过强化学习优化迭代推理的总结时机、保留内容和恢复策略,在DeepSeek-R1-Distill-Qwen-1.5B上提升AIME24准确率21%,并降低推理延迟。

Comments ICML 2026: https://openreview.net/forum?id=tyul8kXaJU Project Page: https://zju-real.github.io/InftyThink-Plus Code: https://github.com/ZJU-REAL/InftyThink-Plus Models: https://huggingface.co/collections/yanyc/inftythink

详情
AI中文摘要

大型推理模型通过扩展推理时的思维链取得了强劲性能,但这种范式存在二次成本、上下文长度限制以及因中间丢失效应导致的推理退化问题。迭代推理通过定期总结中间思考缓解了这些问题,但现有方法依赖监督学习或固定启发式,未能优化何时总结、保留什么以及如何恢复推理。我们提出InftyThink+,一个端到端强化学习框架,它优化整个迭代推理轨迹,基于模型控制的迭代边界和显式总结。InftyThink+采用两阶段训练方案:监督冷启动后接轨迹级强化学习,使模型学习策略性总结和继续决策。在DeepSeek-R1-Distill-Qwen-1.5B上的实验表明,InftyThink+在AIME24上准确率提升21%,显著优于传统长思维链强化学习,同时在分布外基准上泛化能力更强。此外,InftyThink+大幅降低推理延迟并加速强化学习训练,展示了更强的推理效率与性能提升。

英文摘要

Large reasoning models achieve strong performance by scaling inference-time chain-of-thought, but this paradigm suffers from quadratic cost, context length limits, and degraded reasoning due to lost-in-the-middle effects. Iterative reasoning mitigates these issues by periodically summarizing intermediate thoughts, yet existing methods rely on supervised learning or fixed heuristics and fail to optimize when to summarize, what to preserve, and how to resume reasoning. We propose InftyThink+, an end-to-end reinforcement learning framework that optimizes the entire iterative reasoning trajectory, building on model-controlled iteration boundaries and explicit summarization. InftyThink+ adopts a two-stage training scheme with supervised cold-start followed by trajectory-level reinforcement learning, enabling the model to learn strategic summarization and continuation decisions. Experiments on DeepSeek-R1-Distill-Qwen-1.5B show that InftyThink+ improves accuracy by 21% on AIME24 and outperforms conventional long chain-of-thought reinforcement learning by a clear margin, while also generalizing better to out-of-distribution benchmarks. Moreover, InftyThink+ significantly reduces inference latency and accelerates reinforcement learning training, demonstrating improved reasoning efficiency alongside stronger performance.

2511.16275 2026-06-03 cs.CL cs.AI 版本更新

SeSE: Black-Box Uncertainty Quantification for Large Language Models Based on Structural Information Theory

SeSE: 基于结构信息理论的大语言模型黑盒不确定性量化

Xingtao Zhao, Hao Peng, Dingli Su, Xianghua Zeng, Chunyang Liu, Jinzhi Liao, Philip S. Yu

发表机构 * School of Cyber Science and Technology Beihang University(北航信息科学与技术学院) School of Computer Science and Engineering Beihang University(北航计算机科学与工程学院) Didi Chuxing(滴滴出行) Laboratory for Big Data and Decision National University of Defense Technology(国防科技大学大数据与决策实验室) Department of Computer Science University of Illinois Chicago(伊利诺伊大学芝加哥分校计算机科学系)

AI总结 提出SeSE框架,通过构建语义空间的最优层次抽象并计算结构熵,实现大语言模型的黑盒不确定性量化,理论推广了语义熵并在长文本生成中优于现有方法。

Comments Accepted by UAI 2026

详情
AI中文摘要

可靠的不确定性量化(UQ)对于在安全关键场景中部署大语言模型(LLMs)至关重要,因为它使模型能够在不确定时避免回应,从而避免产生幻觉,即看似合理但事实错误的回应。然而,尽管语义UQ方法取得了先进性能,它们忽略了可能实现更精确不确定性估计的潜在语义结构信息。在本文中,我们提出了语义结构熵(SeSE),一个适用于开源和闭源LLMs的原则性黑盒UQ框架。为了揭示语义空间的内在结构,SeSE通过具有最小结构熵的编码树构建其最优层次抽象。因此,该编码树的结构熵量化了最优压缩后LLM语义空间内的固有不确定性。此外,与主要关注简单短文本生成的现有方法不同,我们将SeSE扩展到为长文本输出提供可解释的、细粒度的不确定性估计。我们从理论上证明SeSE推广了语义熵(LLM中UQ的金标准),并通过24个模型-数据集组合的实验证明其优于强基线的性能。

英文摘要

Reliable uncertainty quantification (UQ) is essential for deploying large language models (LLMs) in safety-critical scenarios, as it enables them to abstain from responding when uncertain, thereby avoiding hallucinations, i.e., plausible yet factually incorrect responses. However, while semantic UQ methods have achieved advanced performance, they overlook latent semantic structural information that could enable more precise uncertainty estimates. In this paper, we propose \underline{Se}mantic \underline{S}tructural \underline{E}ntropy ({SeSE}), a principled black-box UQ framework applicable to both open- and closed-source LLMs. To reveal the intrinsic structure of the semantic space, SeSE constructs its optimal hierarchical abstraction through an encoding tree with minimal structural entropy. The structural entropy of this encoding tree thus quantifies the inherent uncertainty within LLM semantic space after optimal compression. Additionally, unlike existing methods that primarily focus on simple short-form generation, we extent SeSE to provide interpretable, granular uncertainty estimation for long-form outputs. We theoretically prove that SeSE generalizes semantic entropy, the gold standard for UQ in LLMs, and empirically demonstrate its superior performance over strong baselines across 24 model-dataset combinations.

2511.12085 2026-06-03 cs.CR cs.AI cs.LG 版本更新

A Robust and Explainable Transformer-Based Framework for Phishing Email Detection

一种鲁棒且可解释的基于Transformer的钓鱼邮件检测框架

Sajad U P

发表机构 * Independent Researcher(独立研究者)

AI总结 提出基于DistilBERT的轻量级钓鱼邮件检测框架,通过梯度对抗训练和字符级噪声增强鲁棒性,并集成LIME、SHAP和IG三种可解释AI方法,结合Flan-T5-Small生成自然语言解释,提升检测准确性和用户信任。

详情
AI中文摘要

钓鱼及相关网络威胁正变得越来越复杂,基于电子邮件的钓鱼仍然是最持久的攻击载体。这些攻击利用人类漏洞来传递恶意软件或获取对敏感信息的未授权访问。基于Transformer的模型通过强大的上下文语言理解增强了钓鱼检测;然而,由于缺乏可解释性,它们通常被视为黑盒。此外,最近的AI驱动攻击进一步削弱了模型的韧性。为了解决这些挑战,本文提出了一种基于DistilBERT(一种轻量级Transformer模型)的轻量级钓鱼检测框架。通过使用快速梯度法(FGM)进行基于梯度的对抗训练,并结合随机字符级扰动,增强了对嵌入级扰动和字符级输入噪声的鲁棒性。为了提高透明度,集成了三种突出的可解释AI(XAI)方法:LIME(局部可解释模型无关解释)、SHAP(SHapley Additive exPlanations)和IG(积分梯度),以解释模型决策。一个结构化的基于规则的提示结合模型预测和XAI特征,引导Flan-T5-Small生成通俗易懂、基于证据的解释。实验结果表明,所提出的框架在准确性和韧性方面优于未经鲁棒性增强的标准DistilBERT检测模型。这种集成方法有助于弥合模型可靠性与用户信任之间的差距,推动透明钓鱼检测的发展。

英文摘要

Phishing and related cyber threats are becoming increasingly sophisticated, with email-based phishing remaining the most persistent attack vector. These attacks exploit human vulnerabilities to deliver malware or gain unauthorized access to sensitive information. Transformer-based models enhance phishing detection through robust contextual language understanding; yet they are often regarded as black boxes due to a lack of interpretability. Moreover, recent AI-enabled attacks further undermine model resilience. To address these challenges, this work proposes a lightweight phishing detection framework based on DistilBERT, a lightweight Transformer model. Robustness to embedding-level perturbations and character-level input noise is enhanced through gradient-based adversarial training using the Fast Gradient Method (FGM), combined with stochastic character-level perturbations. To improve transparency, three prominent Explainable AI (XAI) methods, LIME (Local Interpretable Model-agnostic Explanations), SHAP (SHapley Additive exPlanations), and IG (Integrated Gradients), are integrated to interpret model decision-making. A structured rule-based prompt combines model predictions and XAI features to guide Flan-T5-Small in generating plain-language, evidence-based explanations. Experimental results demonstrate that the proposed framework outperforms a standard DistilBERT-based detection model trained without robustness enhancements in terms of accuracy and resilience. This integrated approach helps bridge the gap between model reliability and user trust, advancing transparent phishing detection.

2602.06219 2026-06-03 cs.RO cs.AI 版本更新

Coupled Local and Global World Models for Efficient First Order RL

耦合局部与全局世界模型的高效一阶强化学习

Joseph Amigo, Rooholla Khorrambakht, Nicolas Mansard, Ludovic Righetti

发表机构 * Machines in Motion Laboratory, New York University, USA(纽约大学运动机器实验室) LAAS-CNRS, Université de Toulouse, CNRS, Toulouse, France(图卢兹大学LAAS-CNRS中心) Artificial and Natural Intelligence Toulouse Institute, Toulouse, France(图卢兹人工智能与自然智能研究所)

AI总结 提出一种通过解耦一阶梯度方法在数据驱动的世界模型内训练策略的方法,结合局部和全局世界模型实现高效梯度计算,在Push-T任务和四足机器人操作任务中显著优于PPO。

Comments Project website: https://coupled-global-local-wm-rl.pages.dev/

详情
AI中文摘要

世界模型为在标准模拟器难以处理的情况下更忠实地捕捉复杂动力学(包括接触和非刚性)以及复杂感官信息(如视觉感知)提供了一条有前景的途径。然而,这些模型的计算复杂度高,对流行的强化学习方法构成了挑战,这些方法已成功用于模拟器解决复杂运动任务,但在操作任务上仍存在困难。本文介绍了一种完全绕过模拟器的方法,在从机器人与真实环境交互中学习到的世界模型内部训练强化学习策略。其核心是通过一种新颖的解耦一阶梯度方法实现大规模扩散模型的策略训练:全尺度世界模型生成准确的前向轨迹,而轻量级潜在空间代理近似其局部动力学以实现高效梯度计算。这种局部与全局世界模型的耦合确保了高保真展开以及计算上可处理的微分。我们在Push-T操作任务上证明了该方法的有效性,其在样本效率上显著优于PPO。我们还通过四足机器人的自我中心物体操作任务进一步评估了该方法。这些结果共同表明,在数据驱动的世界模型内部学习是解决难以建模的图像空间强化学习任务的一条有前景的途径,无需依赖手工设计的物理模拟器。

英文摘要

World models offer a promising avenue for more faithfully capturing complex dynamics, including contacts and non-rigidity, as well as complex sensory information, such as visual perception, in situations where standard simulators struggle. However, these models are computationally complex to evaluate, posing a challenge for popular RL approaches that have been successfully used with simulators to solve complex locomotion tasks but yet struggle with manipulation. This paper introduces a method that bypasses simulators entirely, training RL policies inside world models learned from robots' interactions with real environments. At its core, our approach enables policy training with large-scale diffusion models via a novel decoupled first-order gradient (FoG) method: a full-scale world model generates accurate forward trajectories, while a lightweight latent-space surrogate approximates its local dynamics for efficient gradient computation. This coupling of a local and global world model ensures high-fidelity unrolling alongside computationally tractable differentiation. We demonstrate the efficacy of our method on the Push-T manipulation task, where it significantly outperforms PPO in sample efficiency. We further evaluate our approach through an ego-centric object manipulation task with a quadruped. Together, these results demonstrate that learning inside data-driven world models is a promising pathway for solving hard-to-model RL tasks in image space without reliance on hand-crafted physics simulators.

2602.04899 2026-06-03 cs.CR cs.AI 版本更新

Phantom Transfer: Data Poisoning can Survive Data-Level Defences

幻影转移:数据投毒可存活于数据级防御

Andrew Draganov, Tolga H. Dur, Anandmayi Bhongade, Mary Phuong

AI总结 提出一种名为“幻影转移”的数据投毒攻击,即使知道毒药如何被放入良性数据集也无法过滤,该攻击通过修改阈下学习以适应现实场景,并在多种数据级防御下存活。

详情
AI中文摘要

我们提出了一种数据投毒攻击——幻影转移——其特性是,即使你确切知道毒药是如何被放入原本良性的数据集中,你也无法将其过滤掉。我们通过修改阈下学习以在现实世界中工作来实现这一点,并证明无论数据由哪个模型生成、训练数据的是哪个模型或攻击目标是什么,该攻击都有效。此外,该攻击在11种测试的数据级防御下存活,包括一种将每个样本由另一个模型改写的防御。我们描述了这种攻击何时效果最佳,并展示了它可以用于将密码触发的行为植入模型,同时仍然击败防御。简而言之,我们提供了一个存在性证明,即最大能力防御可能无法阻止复杂的数据投毒攻击。我们建议未来的防御应辅以白盒方法和训练后模型审计。

英文摘要

We present a data poisoning attack -- Phantom Transfer -- with the property that, even if you know precisely how the poison was placed into an otherwise benign dataset, you cannot filter it out. We achieve this by modifying subliminal learning to work in real-world contexts and demonstrate that the attack works regardless of which model produced the data, which model is trained on the data or what the attack target is. Furthermore, the attack survives 11 tested data-level defences, including one where every sample is paraphrased by another model. We characterise when this attack works best and show that it can be used to plant password-triggered behaviours into models while still beating defences. In short, we provide an existence proof that maximum-affordance defences can fail to stop sophisticated data poisoning attacks. We suggest that future defences should be supplemented with white-box methods and post-training model audits.

2507.10419 2026-06-03 cs.LG cs.AI cs.CL stat.ML 版本更新

Multiple Choice Learning of Low-Rank Adapters for Language Modeling

低秩适配器的多选学习用于语言建模

Victor Letzelter, Hugo Malard, Mathieu Fontaine, Gaël Richard, Slim Essid, Andrei Bursuc, Patrick Pérez

发表机构 * Institut National de la Recherche Scientifique (INRS)(国家科学研究院)

AI总结 提出LoRA-MCL训练方案,通过多选学习和低秩适配扩展语言模型的下一词预测,以在推理时解码多样且合理的句子延续。

Comments ICML 2026

详情
AI中文摘要

我们提出LoRA-MCL,一种训练方案,通过一种旨在推理时解码多样、合理的句子延续的方法,扩展语言模型中的下一词预测。传统语言建模是一个本质上不适定的问题:给定一个上下文,多个未来可能同样合理。我们的方法利用多选学习(MCL)和胜者全得损失,通过低秩适配有效处理歧义。我们提供了将MCL应用于语言建模的理论解释,假设数据来自混合分布。我们使用马尔可夫链混合来说明所提出的方法。然后,我们通过音频和视觉字幕以及机器翻译的实验证明,我们的方法在生成输出中实现了高多样性和相关性。我们发布了将LoRA-MCL应用于广泛语言模型的代码。

英文摘要

We propose LoRA-MCL, a training scheme that extends next-token prediction in language models with a method designed to decode diverse, plausible sentence continuations at inference time. Traditional language modeling is an intrinsically ill-posed problem: given a context, multiple futures may be equally plausible. Our approach leverages Multiple Choice Learning (MCL) and the winner-takes-all loss to efficiently handle ambiguity through Low-Rank Adaptation. We provide a theoretical interpretation of applying MCL to language modeling, assuming the data is generated from a mixture of distributions. We illustrate the proposed approach using mixtures of Markov chains. We then demonstrate with experiments on audio and visual captioning, as well as machine translation, that our method achieves high diversity and relevance in generated outputs. We release the code for applying LoRA-MCL to a wide range of language models.

2602.01483 2026-06-03 cs.LG cs.AI stat.ME 版本更新

Causal Preference Elicitation

因果偏好启发

Edwin V. Bonilla, He Zhao, Daniel M. Steinberg

发表机构 * University of California, Berkeley(加州大学伯克利分校)

AI总结 提出一种贝叶斯框架,通过主动查询局部边关系来集中有向无环图的后验分布,实现专家参与的因果发现。

详情
AI中文摘要

我们提出因果偏好启发,一种用于专家参与因果发现的贝叶斯框架,该框架主动查询局部边关系以集中有向无环图(DAG)的后验分布。从任何黑箱观测后验出发,我们使用一个三向似然模型对专家的噪声判断进行建模,该似然涵盖边的存在性和方向。后验推断采用灵活的粒子近似,并通过专家分类响应的期望信息增益准则高效选择查询。在合成图、蛋白质信号数据以及人类基因扰动基准上的实验表明,在严格的查询预算下,后验集中速度更快,且对有向效应的恢复能力得到提升。

英文摘要

We propose causal preference elicitation, a Bayesian framework for expert-in-the-loop causal discovery that actively queries local edge relations to concentrate a posterior over directed acyclic graphs (DAGs). From any black-box observational posterior, we model noisy expert judgments with a three-way likelihood over edge existence and direction. Posterior inference uses a flexible particle approximation, and queries are selected by an efficient expected information gain criterion on the expert's categorical response. Experiments on synthetic graphs, protein signaling data, and a human gene perturbation benchmark show faster posterior concentration and improved recovery of directed effects under tight query budgets.

2510.16392 2026-06-03 cs.AI 版本更新

RGMem: Renormalization Group-inspired Memory Evolution for Language Agents

RGMem:基于重正化群启发的语言智能体记忆演化

Ao Tian, Yunfeng Lu, Xinxin Fan, Changhao Wang, Lanzhi Zhou, Yeyao Zhang, Yanfang Liu

发表机构 * School of Computer Science Engineering, Beihang University, Beijing, China School of Reliability Systems Engineering, Beihang University, Beijing, China State Key Laboratory of Complex \& Critical Software Environment National Key Laboratory of Reliability State Key Laboratory of AI Safety, Institute of Computing Technology, Chinese Academy of Sciences

AI总结 提出RGMem框架,利用重正化群思想对长期对话记忆进行多尺度粗粒化、阈值更新和重缩放,实现从事实到用户偏好的层次化整合,在LOCOMO和PersonaMem基准上超越现有记忆系统。

Comments Accepted to ICML 2026

详情
AI中文摘要

个性化和持续交互对于基于LLM的对话智能体至关重要,但有限的上下文窗口和静态参数记忆阻碍了对长期、跨会话用户状态的建模。现有方法(包括检索增强生成和显式记忆系统)主要在事实层面操作,难以从演化且可能冲突的对话中提炼稳定的偏好和深层用户特征。为应对这一挑战,我们提出RGMem,一种受重正化群(RG)多尺度组织和涌现观点启发的自演化记忆框架。RGMem将长期对话记忆建模为多尺度演化过程:情节交互被转化为语义事实和用户洞察,然后通过层次化粗粒化、阈值更新和重缩放逐步整合为动态演化的用户画像。通过明确分离快速变化的证据和慢变特征,并启用非线性、相变般的动力学,RGMem实现了超越平面检索或静态摘要的稳健个性化。在LOCOMO和PersonaMem基准上的大量实验表明,RGMem持续优于最先进的记忆系统,实现了更强的跨会话连续性并更好地适应演化的用户偏好。代码可在https://github.com/fenhg297/RGMem获取。

英文摘要

Personalized and continuous interactions are critical for LLM-based conversational agents, yet finite context windows and static parametric memory hinder the modeling of long-term, cross-session user states. Existing approaches, including retrieval-augmented generation and explicit memory systems, primarily operate at the fact level, making it difficult to distill stable preferences and deep user traits from evolving and potentially conflicting dialogues.To address this challenge, we propose RGMem, a self-evolving memory framework inspired by the renormalization group (RG) perspective on multi-scale organization and emergence. RGMem models long-term conversational memory as a multi-scale evolutionary process: episodic interactions are transformed into semantic facts and user insights, which are then progressively integrated through hierarchical coarse-graining, thresholded updates, and rescaling into a dynamically evolving user profile.By explicitly separating fast-changing evidence from slow-varying traits and enabling non-linear, phase-transition-like dynamics, RGMem enables robust personalization beyond flat retrieval or static summarization. Extensive experiments on the LOCOMO and PersonaMem benchmarks demonstrate that RGMem consistently outperforms SOTA memory systems, achieving stronger cross-session continuity and improved adaptation to evolving user preferences. Code is available at https://github.com/fenhg297/RGMem

2510.02763 2026-06-03 cs.LG cs.AI 版本更新

Fusing Multi- and Hyperspectral Satellite Data for Harmful Algal Bloom Monitoring with Self-Supervised and Hierarchical Deep Learning

融合多光谱和高光谱卫星数据用于有害藻华监测的自监督与分层深度学习

Nicholas LaHaye, Kelly M. Luis, Michelle M. Gierach

发表机构 * University of Colorado Boulder(科罗拉多大学博尔德分校)

AI总结 提出自监督机器学习框架SIT-FUSE,融合多传感器卫星反射率与TROPOMI太阳诱导荧光数据,通过分层深度聚类生成有害藻华严重程度和物种分类产品,在墨西哥湾和南加州验证了与实测数据的一致性。

详情
AI中文摘要

我们提出了一种自监督机器学习框架,用于利用多传感器卫星数据检测和绘制有害藻华(HABs)的严重程度和物种分类。通过融合来自运行极轨卫星仪器(VIIRS、MODIS、OLCI和OCI)的反射率数据与TROPOMI太阳诱导荧光(SIF),我们的框架SIT-FUSE无需每个仪器的标记数据集即可生成HAB严重程度和物种分类产品。该框架采用自监督表示学习和分层深度聚类,将浮游植物细胞丰度和物种分割成可解释的类别,并利用墨西哥湾和南加州(2018-2025年)的原位数据进行了验证。结果显示与总浮游植物、短凯伦藻和拟菱形藻属测量值高度一致。这项工作推进了在地面观测有限的环境中进行可扩展的HAB监测,同时通过分层嵌入实现探索性分析——这是将自监督学习应用于全球水生生物地球化学操作化的关键一步。

英文摘要

We present a self-supervised machine learning framework for detecting and mapping the severity and speciation of harmful algal blooms (HABs) using multi-sensor satellite data. By fusing reflectance data from operational polar-orbiting satellite-based instruments (VIIRS, MODIS, OLCI, and OCI) with TROPOMI solar-induced fluorescence (SIF), our framework, called SIT-FUSE, generates HAB severity and speciation products without requiring per-instrument labeled datasets. The framework employs self-supervised representation learning and hierarchical deep clustering to segment phytoplankton cell abundance and species into interpretable classes, validated against in-situ data from the Gulf of Mexico and Southern California (2018-2025). Results show strong agreement with total phytoplankton, Karena brevis, and Pseudo-nitzschia spp. measurements. This work advances scalable HAB monitoring in environments where ground truth observations are limited, while enabling exploratory analysis via hierarchical embeddings - a critical step toward operationalizing self-supervised learning for global aquatic biogeochemistry.

2601.23229 2026-06-03 cs.AI cs.CC 版本更新

Strongly Polynomial Time Complexity of Policy Iteration for $L_\infty$ Robust MDPs

$L_\infty$ 鲁棒 MDP 的策略迭代的强多项式时间复杂度

Ali Asadi, Krishnendu Chatterjee, Ehsan Goharshady, Mehrdad Karrabi, Alipasha Montaseri, Carlo Pagano

发表机构 * Institute for Computer Science, Austrian Academy of Sciences(奥地利科学院计算机科学研究所) Concordia University(康科迪亚大学)

AI总结 针对 $(s,a)$-矩形 $L_\infty$ 鲁棒 MDP 的折扣问题,证明了策略迭代算法在固定折扣因子下具有强多项式时间复杂度。

Comments To Appear in The 39th Annual Conference on Learning Theory (COLT'26)

详情
AI中文摘要

马尔可夫决策过程(MDP)是序列决策中的基本模型。鲁棒 MDP(RMDP)通过允许转移概率存在不确定性并针对最坏情况不确定性进行优化来扩展此框架。特别地,具有 $L_\infty$ 不确定性集的 $(s,a)$-矩形 RMDP 构成一个基础且富有表现力的模型:它们包含经典 MDP 和回合制随机博弈。我们考虑具有折扣收益的此模型。多项式时间和强多项式时间算法的存在性是这些优化模型的基本问题。对于 MDP,线性规划为任意折扣因子提供了多项式时间算法,而 Ye 的开创性工作为固定折扣因子建立了强多项式时间。将这些结果推广到 RMDP 仍然是一个重要的开放问题。在这项工作中,我们证明了鲁棒策略迭代算法在常数(固定)折扣因子下对于 $(s,a)$-矩形 $L_\infty$ RMDP 以强多项式时间运行,解决了一个重要的算法问题。

英文摘要

Markov decision processes (MDPs) are a fundamental model in sequential decision making. Robust MDPs (RMDPs) extend this framework by allowing uncertainty in transition probabilities and optimizing against the worst-case realization of that uncertainty. In particular, $(s, a)$-rectangular RMDPs with $L_\infty$ uncertainty sets form a fundamental and expressive model: they subsume classical MDPs and turn-based stochastic games. We consider this model with discounted payoffs. The existence of polynomial and strongly-polynomial time algorithms is a fundamental problem for these optimization models. For MDPs, linear programming yields polynomial-time algorithms for any arbitrary discount factor, and the seminal work of Ye established strongly--polynomial time for a fixed discount factor. The generalization of such results to RMDPs has remained an important open problem. In this work, we show that a robust policy iteration algorithm runs in strongly-polynomial time for $(s, a)$-rectangular $L_\infty$ RMDPs with a constant (fixed) discount factor, resolving an important algorithmic question.

2601.20844 2026-06-03 cs.LG cs.AI cs.IR 版本更新

$\mathbb{R}^{2k}$ is Theoretically Large Enough for Embedding-based Top-$k$ Retrieval

$\mathbb{R}^{2k}$ 理论上足够大,用于基于嵌入的 Top-$k$ 检索

Zihao Wang, Hang Yin, Lihui Liu, Hanghang Tong, Yangqiu Song, Ginny Wong, Simon See

发表机构 * University of California, Berkeley(加州大学伯克利分校)

AI总结 研究最小可嵌入维度(MED),证明对于内积、欧氏距离和余弦相似度,MED 为 Θ(k),与 m 无关;进一步考虑鲁棒 MED(RMED),推导出可行性上限 ε_⋆(m,k),并通过实验验证理论结果。

Comments v2: fix broken citation. v3: ICML 2026

详情
AI中文摘要

本文研究最小可嵌入维度(MED):即存在 m 个对象向量配置的最小维度,使得每个大小至多为 k 的子集都能通过分数比较被精确检索。我们的结果表明,对于内积、欧氏距离和余弦相似度,MED 为 Θ(k),与 m 无关。然后我们考虑鲁棒 MED(RMED),其中所有向量为单位范数,并且需要 ε 的分数间隙。我们推导出依赖于 m 的可行性上限 ε_⋆(m,k)=m/√(k(m-1)(m-k)),当 m≫k 时趋近于 1/√k,并且高斯质心构造在可行边界区域内给出了鲁棒见证的上界。在合成 top-2 检索上的数值模拟,使用循环多面体和质心查询优化,证实了我们的理论主张。在 LIMIT 和 LIMIT-small 数据集上的实验也表明,简单的基于嵌入的检索基线可能过拟合,并优于报告的单向量 LLM 嵌入基线。理论和实证结果都排除了精确几何容量不足作为障碍的可能性。

英文摘要

This paper studies the Minimal Embeddable Dimension (MED): the least dimension in which there exists a configuration of $m$ object vectors so that every subset of size at most $k$ is exactly retrieved by score comparison. Our result shows MED is $Θ(k)$, independent of $m$, for inner product, Euclidean distance, and cosine similarity. We then consider Robust MED (RMED), where all vectors are unit normed and an $ε$ gap of scores is required. We derive the $m$-dependent feasibility ceiling $ε_\star(m,k)=m/\sqrt{k(m-1)(m-k)}$, which approaches $1/\sqrt{k}$ when $m\gg k$, and a Gaussian centroid construction gives a robust witness upper bound in the feasible margin regime. Numerical simulation on synthetic top-$2$ retrieval with cyclic polytope and centroid query optimization confirmed our theoretical claims. Experiments on LIMIT and LIMIT-small datasets also show that simple embedding-based retrieval baselines can overfit and outperform the reported single-vector LLM embedding baseline. Both theoretical and empirical findings rule out the lack of exact geometric capacity as the obstruction.

2601.12247 2026-06-03 cs.CL cs.AI cs.LG 版本更新

Plan, Verify and Fill: A Structured Parallel Decoding Approach for Diffusion Language Models

规划、验证与填充:扩散语言模型的结构化并行解码方法

Miao Li, Hanyang Jiang, Sikai Cheng, Hengyu Fu, Yuhang Cai, Baihe Huang, Tinghan Ye, Xuanzhou Chen, Pascal Van Hentenryck

发表机构 * Georgia Institute of Technology(佐治亚理工学院) University of California, Berkeley(加州大学伯克利分校) University of Michigan(密歇根大学)

AI总结 提出Plan-Verify-Fill (PVF)方法,通过定量验证进行分层骨架规划,并采用验证协议实现结构化停止,在保持准确性的同时将函数评估次数减少高达65%。

详情
AI中文摘要

扩散语言模型(DLM)为文本生成提供了一种有前景的非顺序范式,不同于标准的自回归(AR)方法。然而,当前的解码策略通常采取被动姿态,未能充分利用全局双向上下文来指导全局轨迹。为了解决这个问题,我们提出了Plan-Verify-Fill(PVF),一种无需训练的范式,通过定量验证来锚定规划。PVF通过优先考虑高杠杆语义锚点主动构建分层骨架,并采用验证协议来实现实用的结构化停止,在进一步思考收益递减时停止。在LLaDA-8B-Instruct和Dream-7B-Instruct上的广泛评估表明,与基于置信度的并行解码相比,PVF在基准数据集上将函数评估次数(NFE)减少了高达65%,在不牺牲准确性的情况下实现了卓越的效率。

英文摘要

Diffusion Language Models (DLMs) present a promising non-sequential paradigm for text generation, distinct from standard autoregressive (AR) approaches. However, current decoding strategies often adopt a reactive stance, underutilizing the global bidirectional context to dictate global trajectories. To address this, we propose Plan-Verify-Fill (PVF), a training-free paradigm that grounds planning via quantitative validation. PVF actively constructs a hierarchical skeleton by prioritizing high-leverage semantic anchors and employs a verification protocol to operationalize pragmatic structural stopping where further deliberation yields diminishing returns. Extensive evaluations on LLaDA-8B-Instruct and Dream-7B-Instruct demonstrate that PVF reduces the Number of Function Evaluations (NFE) by up to 65% compared to confidence-based parallel decoding across benchmark datasets, unlocking superior efficiency without compromising accuracy.

2509.01641 2026-06-03 eess.SP cs.AI cs.LG 版本更新

Non-Identical Diffusion Models in MIMO-OFDM Channel Generation

MIMO-OFDM信道生成中的非相同扩散模型

Yuzhi Yang, Omar Alhussein, Mérouane Debbah

AI总结 提出非相同扩散模型,通过元素级时间指示器捕获局部误差变化,解决MIMO-OFDM信道估计中元素可靠性不均的问题,理论验证其正确性并数值实验证明有效性。

Comments resubmitted to IEEE TCOM

详情
AI中文摘要

我们提出了一种新颖的扩散模型,称为非相同扩散模型,并研究了其在无线正交频分复用(OFDM)信道生成中的应用。与使用标量时间索引表示全局噪声水平的标准扩散模型不同,我们将这一概念扩展为元素级时间指示器,以更准确地捕获局部误差变化。非相同扩散使我们能够表征噪声输入中每个元素(例如OFDM中的子载波)的可靠性,从而在初始化有偏时改善生成结果。具体来说,我们专注于无线多输入多输出(MIMO)OFDM信道矩阵的恢复,其中由于导频方案,初始信道估计在元素间表现出高度不均匀的可靠性。传统的时间嵌入假设噪声进展均匀,无法捕获这种跨导频方案和噪声水平的变化。我们引入一个与输入大小匹配的矩阵来控制元素级噪声进展。遵循与现有方法类似的扩散过程,我们从理论和数值上证明了所提出的非相同扩散方案的正确性和有效性。对于MIMO-OFDM信道生成,我们提出了一种维度级时间嵌入策略。我们还开发并评估了多种训练和生成方法,并通过数值实验进行了比较。

英文摘要

We propose a novel diffusion model, termed the non-identical diffusion model, and investigate its application to wireless orthogonal frequency division multiplexing (OFDM) channel generation. Unlike the standard diffusion model that uses a scalar-valued time index to represent the global noise level, we extend this notion to an element-wise time indicator to capture local error variations more accurately. Non-identical diffusion enables us to characterize the reliability of each element (e.g., subcarriers in OFDM) within the noisy input, leading to improved generation results when the initialization is biased. Specifically, we focus on the recovery of wireless multi-input multi-output (MIMO) OFDM channel matrices, where the initial channel estimates exhibit highly uneven reliability across elements due to the pilot scheme. Conventional time embeddings, which assume uniform noise progression, fail to capture such variability across pilot schemes and noise levels. We introduce a matrix that matches the input size to control element-wise noise progression. Following a similar diffusion procedure to existing methods, we show the correctness and effectiveness of the proposed non-identical diffusion scheme both theoretically and numerically. For MIMO-OFDM channel generation, we propose a dimension-wise time embedding strategy. We also develop and evaluate multiple training and generation methods and compare them through numerical experiments.

2501.17377 2026-06-03 cs.LG cs.AI 版本更新

ASAP: Exploiting the Satisficing Generalization Edge in Neural Combinatorial Optimization

ASAP:利用神经组合优化中的满意泛化优势

Han Fang, Paul Weng, Yutong Ban

发表机构 * GitHub

AI总结 针对神经组合优化模型在分布偏移下的脆弱性,提出ASAP框架,通过将决策分解为提案和选择两阶段,并利用MAML增强在线适应能力,在3D-BPP、TSP和CVRP上提升了泛化性能。

Comments Accepted as poster of ICML-2026

详情
AI中文摘要

深度强化学习(DRL)已成为解决组合优化(CO)问题(如3D装箱问题(3D-BPP)、旅行商问题(TSP)或车辆路径问题(VRP))的一种有前景的方法,但这些神经求解器在面对分布偏移时往往表现出脆弱性。为了解决这个问题,我们揭示了满意泛化优势,并在理论和实验上进行了验证:识别一组有希望的行动本质上比选择单一最优行动更具泛化性。为了利用这一特性,我们提出了自适应选择后提案(ASAP),这是一个通用框架,将决策过程分解为两个不同的阶段:作为鲁棒过滤器的提案策略和作为可适应决策者的选择策略。这种架构使得一种高效的在线适应策略成为可能,其中选择策略可以在新分布上快速微调。具体地,我们引入了一个由模型无关元学习(MAML)增强的两阶段训练框架,以使模型能够快速适应。在3D-BPP、TSP和CVRP上的大量实验表明,ASAP提高了最先进基线的泛化能力,并在分布外实例上实现了优越的在线适应。

英文摘要

Deep Reinforcement Learning (DRL) has emerged as a promising approach for solving Combinatorial Optimization (CO) problems, such as the 3D Bin Packing Problem (3D-BPP), Traveling Salesman Problem (TSP), or Vehicle Routing Problem (VRP), but these neural solvers often exhibit brittleness when facing distribution shifts. To address this issue, we uncover the Satisficing Generalization Edge, which we validate both theoretically and experimentally: identifying a set of promising actions is inherently more generalizable than selecting the single optimal action. To exploit this property, we propose Adaptive Selection After Proposal (ASAP), a generic framework that decomposes the decision-making process into two distinct phases: a proposal policy that acts as a robust filter, and a selection policy as an adaptable decision maker. This architecture enables a highly effective online adaptation strategy where the selection policy can be rapidly fine-tuned on a new distribution. Concretely, we introduce a two-phase training framework enhanced by Model-Agnostic Meta-Learning (MAML) to prime the model for fast adaptation. Extensive experiments on 3D-BPP, TSP, and CVRP demonstrate that ASAP improves the generalization capability of state-of-the-art baselines and achieves superior online adaptation on out-of-distribution instances.

2601.11667 2026-06-03 cs.LG cs.AI 版本更新

Distill-then-Replace: Efficient Task-Specific Hybrid Attention Model Construction

Distill-then-Replace: 高效的任务特定混合注意力模型构建

Xiaojie Xia, Huigang Zhang, Chaoliang Zhong, Jun Sun, Yusuke Oishi

发表机构 * Fujitsu Research & Development Center CO., LTD(富士通研发中心有限公司) Fujitsu Research, FUJITSU LTD(富士通研究所,富士通有限公司)

AI总结 提出Distill-then-Replace (DtR)方法,通过逐块局部蒸馏和贪婪层替换策略,将预训练的全注意力模型高效转换为任务特定的混合注意力模型,无需重新训练或神经架构搜索。

详情
AI中文摘要

Transformer架构通过密集的全注意力机制实现了最先进的准确性,但其相对于序列长度的二次时间和内存复杂度限制了实际部署。线性注意力机制提供线性或接近线性的缩放,但通常会导致性能下降。集成全注意力和线性注意力层的混合模型有望在效率和表达能力之间取得平衡,但面临两个主要挑战:从头训练此类混合模型计算成本高,且手动设计注意力类型的最佳放置位置非常困难。我们提出DtR(Distill-then-Replace),首先通过逐块局部蒸馏将预训练的全注意力模块的权重转移到其线性注意力对应模块,然后应用贪婪层替换策略,迭代地用线性注意力块替换全注意力块,同时监控目标任务的验证性能。DtR在单次高效过程中生成任务特定的混合模型,无需昂贵的重新训练或神经架构搜索,并可应用于任何预训练的全注意力骨干网络以处理各种下游任务。

英文摘要

Transformer architectures deliver state-of-the-art accuracy via dense full-attention, but their quadratic time and memory complexity with respect to sequence length limits practical deployment. Linear attention mechanisms offer linear or near-linear scaling yet often incur performance degradation. Hybrid models that integrate full and linear attention layers promise a balance between efficiency and expressiveness, but face two major challenges: training such hybrid models from scratch is computationally expensive, and manually designing the optimal placement of attention types is highly nontrivial. We propose DtR (Distill-then-Replace), which first transfers weights from the pretrained full-attention modules to its linear attention counterparts through blockwise local distillation, and then applies a greedy layer replacement strategy that iteratively substitutes full attention blocks with linear ones while monitoring validation performance on the target task. DtR yields a task-specific hybrid model in a single efficient pass, without costly re-training or neural architecture search, and can be applied to any pretrained full-attention backbone for diverse downstream tasks.

2601.11429 2026-06-03 cs.CL cs.AI 版本更新

Relational Linearity is a Predictor of Hallucinations

关系线性是幻觉的预测因子

Yuetian Lu, Yihong Liu, Sebastian Gerstner, Lea Hirlimann, Jonas Rohweder, Hinrich Schütze

AI总结 通过合成未知实体基准测试,发现语言模型在回答线性关系问题时更容易产生幻觉,且关系线性度与幻觉率强相关。

Comments 15 pages, 6 figures, 14 tables

详情
AI中文摘要

幻觉是语言模型(LMs)的一个核心失败模式。我们关注对诸如“格伦·古尔德演奏哪种乐器?”这类问题的幻觉,但针对设计为模型未知的合成实体提问。我们发现,像Gemma-7B-IT这样的LM经常产生幻觉,即它们难以识别幻觉事实不属于其知识。基于线性关系嵌入的思想,我们提出以下假设:(i)由于用于表示它们的抽象方案,LM可以轻松地为线性关系的非存在主体生成合理的对象,这可能导致幻觉。(ii)对于非线性关系,这种生成对象的机制不可用,因此更容易避免幻觉。为了验证这一假设,我们创建了SyntHal,一个针对15种关系的合成未知实体基准。我们发现,在四个指令调优模型中,关系线性度是模型为未知主体生成对象(而非拒绝回答)的强预测因子,相关系数$r \in [.58, .84]$。

英文摘要

Hallucination is a central failure mode of language models (LMs). We focus on hallucinations in response to questions like: "Which instrument did Glenn Gould play?", but we ask these questions for synthetic entities designed to be unknown to the model. We find that LMs like Gemma-7B-IT frequently hallucinate, i.e., they have difficulty recognizing that the hallucinated fact is not part of their knowledge. Based on the idea of linear relational embeddings, we put forward the following hypothesis. (i) Due to the abstract scheme that is used to represent them, LMs can easily produce plausible objects for non-existing subjects of linear relations, which can lead to hallucinations. (ii) For a nonlinear relation, this mechanism for producing an object is not available and so a hallucination is easier to avoid. To test this hypothesis, we create SyntHal, a synthetic unknown-entity benchmark for 15 relations. We find that across four instruction-tuned models, relational linearity is a strong predictor of models hallucinating an object for an unknown subject vs refusing to give an answer, with correlations $r \in [.58, .84]$.

2601.10222 2026-06-03 math.NA cs.AI cs.NA math.OC 版本更新

Introduction to optimization methods for training SciML models

训练科学机器学习模型的优化方法导论

Alena Kopaničáková, Elisa Riccietti

发表机构 * Toulouse-INP, IRIT-APO, ANITI(图卢兹INP、IRIT-APO、ANITI) ENS de Lyon, CNRS, Inria, Universitè Claude Bernard Lyon 1, LIP, UMR 5668(里昂大学、国家科学研究中心、法国国家信息与自动化研究所、克莱尔伯恩里昂第一大学、LIP、UMR 5668)

AI总结 本文统一介绍了机器学习和科学机器学习中的优化方法,强调问题结构如何影响算法选择,并讨论了物理约束和数据驱动SciML模型的实用策略。

详情
AI中文摘要

优化是现代机器学习(ML)和科学机器学习(SciML)的核心,但底层优化问题的结构在这些领域之间存在显著差异。经典ML通常依赖于随机、样本可分离的目标,这有利于一阶和自适应梯度方法。相比之下,SciML通常涉及物理信息或算子约束的公式,其中微分算子导致损失景观中的全局耦合、刚性和强各向异性。因此,SciML中的优化行为由底层物理模型的谱特性而非数据统计决定,这常常限制了标准随机方法的有效性,并促使采用确定性或曲率感知的方法。本文提供了ML和SciML中优化方法的统一介绍,强调问题结构如何塑造算法选择。我们回顾了确定性和随机设置中的一阶和二阶优化技术,讨论了它们对物理约束和数据驱动SciML模型的适应,并通过教程示例说明了实用策略,同时突出了科学计算和科学机器学习交叉领域的开放研究方向。

英文摘要

Optimization is central to both modern machine learning (ML) and scientific machine learning (SciML), yet the structure of the underlying optimization problems differs substantially across these domains. Classical ML typically relies on stochastic, sample-separable objectives that favor first-order and adaptive gradient methods. In contrast, SciML often involves physics-informed or operator-constrained formulations in which differential operators induce global coupling, stiffness, and strong anisotropy in the loss landscape. As a result, optimization behavior in SciML is governed by the spectral properties of the underlying physical models rather than by data statistics, frequently limiting the effectiveness of standard stochastic methods and motivating deterministic or curvature-aware approaches. This document provides a unified introduction to optimization methods in ML and SciML, emphasizing how problem structure shapes algorithmic choices. We review first- and second-order optimization techniques in both deterministic and stochastic settings, discuss their adaptation to physics-constrained and data-driven SciML models, and illustrate practical strategies through tutorial examples, while highlighting open research directions at the interface of scientific computing and scientific machine learning.

2601.09869 2026-06-03 cs.AI cs.HC 版本更新

A Scoping Review of the Ethical Perspectives on Anthropomorphising Large Language Model-Based Conversational Agents

拟人化大型语言模型对话代理的伦理视角:一项范围综述

Andrea Ferrario, Rasita Vinay, Matteo Casserini, Alessandro Facchini

发表机构 * Institute of Biomedical Ethics and History of Medicine, University of Zürich(苏黎世大学生物医学伦理与医学史研究所) Dalle Molle Institute for Artificial Intelligence (IDSIA), SUPSI(瑞士SUPSI人工智能研究所) ETH Zürich(苏黎世联邦理工学院) Institute for Implementation Science in Health Care, University of Zürich(苏黎世大学医疗实施科学研究所) Department of Management, Technology and Economics, ETH Zürich(苏黎世联邦理工学院管理、技术与经济系) Dipartimento Tecnologie Innovative, SUPSI(SUPSI创新技术系) Management in Networked and Digital Societies (MINDS) Department, Kozminski University(科兹明斯基大学网络化与数字化社会管理系)

AI总结 本文通过范围综述,系统梳理了拟人化LLM对话代理的伦理挑战与机遇,包括概念基础、伦理问题及方法论,并提出了研究议程与设计治理建议。

Comments 2026 ACM Conference on Fairness, Accountability, and Transparency (FAccT'26)

详情
AI中文摘要

拟人化——将人类特质赋予非人类实体的现象——随着基于大型语言模型(LLM)的对话代理(CAs)的兴起而日益显著。与早期的聊天机器人不同,基于LLM的CA通常会生成互动和语言线索,例如第一人称自我指涉、认知和情感表达,实证研究表明这些可以增加参与度。另一方面,拟人化引发了伦理担忧,包括欺骗、过度依赖和剥削性关系框架,而一些作者认为拟人化互动可能支持自主性、福祉和包容性。尽管对该现象的兴趣日益增加,文献仍跨领域分散,并且在如何定义、操作化和规范性评估拟人化方面存在显著差异。本范围综述绘制了关于拟人化基于LLM的CA的伦理导向工作,覆盖五个数据库和三个预印本存储库。我们综合了(1)概念基础,(2)伦理挑战与机遇,以及(3)方法论方法。我们发现基于归因的定义趋于一致,但操作化存在显著差异,主要是风险导向的规范性框架,以及将观察到的互动效应与可操作的治理指导联系起来的实证工作有限。我们最后提出了研究议程和设计/治理建议,用于在基于LLM的对话代理中伦理地部署拟人化线索。

英文摘要

Anthropomorphisation -- the phenomenon whereby non-human entities are ascribed human-like qualities -- has become increasingly salient with the rise of large language model (LLM)-based conversational agents (CAs). Unlike earlier chatbots, LLM-based CAs routinely generate interactional and linguistic cues, such as first-person self-reference, epistemic and affective expressions that empirical work shows can increase engagement. On the other hand, anthropomorphisation raises ethical concerns, including deception, overreliance, and exploitative relationship framing, while some authors argue that anthropomorphic interaction may support autonomy, well-being, and inclusion. Despite increasing interest in the phenomenon, literature remains fragmented across domains and varies substantially in how it defines, operationalizes, and normatively evaluates anthropomorphisation. This scoping review maps ethically oriented work on anthropomorphising LLM-based CAs across five databases and three preprint repositories. We synthesize (1) conceptual foundations, (2) ethical challenges and opportunities, and (3) methodological approaches. We find convergence on attribution-based definitions but substantial divergence in operationalization, a predominantly risk-forward normative framing, and limited empirical work that links observed interaction effects to actionable governance guidance. We conclude with a research agenda and design/governance recommendations for ethically deploying anthropomorphic cues in LLM-based conversational agents.

2601.08173 2026-06-03 cs.AI 版本更新

The Agent's First Day: Benchmarking Learning, Exploration, and Scheduling in the Workplace Scenarios

Agent 的第一天:在工作场景中基准测试学习、探索和调度

Daocheng Fu, Jianbiao Mei, Rong Wu, Xuemeng Yang, Jia Xu, Ding Wang, Pinlong Cai, Yong Liu, Licheng Wen, Botian Shi

发表机构 * Fudan University(复旦大学) Shanghai AI Laboratory(上海人工智能实验室) Zhejiang University(浙江大学) Shanghai Innovation Institute(上海创新研究院) Shanghai Jiao Tong University(上海交通大学)

AI总结 针对多模态大语言模型在动态工作场景中面临的任务调度、主动探索和持续学习三大挑战,提出动态评估环境 EvoEnv,实验表明现有 agent 在这些方面存在显著不足。

详情
AI中文摘要

多模态大语言模型(MLLMs)的快速发展推动了工作流自动化;然而,现有研究主要针对静态环境中的性能上限,忽视了随机真实世界部署的鲁棒性。我们识别出三个关键挑战:动态任务调度、不确定性下的主动探索以及从经验中持续学习。为弥补这一差距,我们引入了 \method{},一个动态评估环境,模拟“实习生”agent 持续探索新环境。与传统基准不同,\method{} 从三个维度评估 agent:(1)针对具有不同优先级的流式任务的上下文感知调度;(2)通过主动探索谨慎获取信息以减少幻觉;(3)通过从基于规则的动态生成任务中提炼通用策略实现持续进化。实验表明,最先进的 agent 在动态环境中存在显著缺陷,尤其是在主动探索和持续学习方面。我们的工作建立了一个评估 agent 可靠性的框架,将评估从静态测试转向现实的、面向生产的场景。我们的代码可在 https://github.com/KnowledgeXLab/EvoEnv 获取。

英文摘要

The rapid evolution of Multi-modal Large Language Models (MLLMs) has advanced workflow automation; however, existing research mainly targets performance upper bounds in static environments, overlooking robustness for stochastic real-world deployment. We identify three key challenges: dynamic task scheduling, active exploration under uncertainty, and continuous learning from experience. To bridge this gap, we introduce \method{}, a dynamic evaluation environment that simulates a "trainee" agent continuously exploring a novel setting. Unlike traditional benchmarks, \method{} evaluates agents along three dimensions: (1) context-aware scheduling for streaming tasks with varying priorities; (2) prudent information acquisition to reduce hallucination via active exploration; and (3) continuous evolution by distilling generalized strategies from rule-based, dynamically generated tasks. Experiments show that cutting-edge agents have significant deficiencies in dynamic environments, especially in active exploration and continual learning. Our work establishes a framework for assessing agent reliability, shifting evaluation from static tests to realistic, production-oriented scenarios. Our codes are available at https://github.com/KnowledgeXLab/EvoEnv

2512.23234 2026-06-03 cs.CV cs.AI 版本更新

Edge-Aware and Content-Adaptive Infrared Gas Leak Detection for Industrial Safety Monitoring

边缘感知与内容自适应的工业安全监控红外气体泄漏检测

Dongsheng Li, Tianli Ma, Siling Wang, Beibei Duan, Song Gao

发表机构 * School of Mechatronic Engineering, Xi’an Technological University(机械电子工程学院,西安理工大学) School of Electronic Information Engineering, Xi’an Technological University(电子信息工程学院,西安理工大学) Shaanxi Shanhua Coal Chemical Co., Ltd.(陕西神华化工有限公司)

AI总结 针对红外气体羽流微弱、半透明且边界模糊的检测难题,提出一种边缘感知与内容自适应特征融合检测器(ECAF-Det),通过羽流导向的局部-全局特征增强、多尺度边缘感知模块和内容自适应稀疏路由路径聚合网络,在IIG和LangGas数据集上显著提升了检测精度。

详情
AI中文摘要

红外气体泄漏检测对于工业安全和环境监测至关重要,但由于气体羽流通常微弱、细小、半透明且边界模糊,自动检测仍然具有挑战性。本文提出了一种边缘感知与内容自适应特征融合检测器(ECAF-Det),用于杂乱热场景中的弱羽流检测。ECAF-Det集成了三个面向任务的设计:羽流导向的局部-全局特征增强块,用于保留精细边界线索并捕获长程上下文连续性;多尺度边缘感知模块,将方向梯度和相位一致性线索转化为分层边缘先验,用于边界敏感的羽流表示;以及内容自适应稀疏路由路径聚合网络,动态调节多尺度特征传播,以强调信息丰富的羽流特征并抑制冗余背景响应。在IIG数据集上的实验表明,ECAF-Det实现了29.8%的AP、84.3%的AP50和25.3%的小目标AP,分别比RT-DETR-R18基线提高了3.0、6.5和5.4个百分点,计算量为43.7 GFLOPs,参数量为14.9 M。在LangGas数据集上,ECAF-Det实现了36.3%的AP和68.5%的AP50,展示了其对不同红外气体羽流外观的泛化能力。主要的人工智能贡献在于边缘感知表示学习与内容自适应稀疏特征路由,用于弱红外羽流感知。所提出的检测器可作为工业气体泄漏监测中早期预警和远程巡检的视觉感知组件。

英文摘要

Infrared gas leak detection is important for industrial safety and environmental monitoring, but automatic detection remains challenging because gas plumes are often faint, small, semi-transparent, and weakly bounded. This paper proposes an Edge-Aware and Content-Adaptive Feature Fusion Detector (ECAF-Det) for weak-plume detection in cluttered thermal scenes. ECAF-Det integrates three task-oriented designs: a plume-oriented local-global feature enhancement block to preserve fine boundary cues and capture long-range contextual continuity; a multi-scale edge perception module that transforms directional gradient and phase-consistency cues into hierarchical edge priors for boundary-sensitive plume representation; and a content-adaptive sparse routing path aggregation network that dynamically regulates multi-scale feature propagation to emphasize informative plume features and suppress redundant background responses. Experiments on the IIG dataset show that ECAF-Det achieves 29.8% AP, 84.3% AP50, and 25.3% small-object AP, improving the RT-DETR-R18 baseline by 3.0, 6.5, and 5.4 percentage points, respectively, with 43.7 GFLOPs and 14.9 M parameters. On the LangGas dataset, ECAF-Det achieves 36.3% AP and 68.5% AP50, demonstrating its generalization to different infrared gas plume appearances. The main AI contribution is edge-aware representation learning with content-adaptive sparse feature routing for weak infrared plume perception. The proposed detector can serve as a visual perception component for early warning and remote inspection in industrial gas leak monitoring.

2504.04942 2026-06-03 cs.AI cs.LO 版本更新

Lemmanaid: Neuro-Symbolic Lemma Conjecturing

Lemmanaid: 神经符号引理猜想

Yousef Alhessi, Sólrún Halla Einarsdóttir, George Granberry, Emily First, Moa Johansson, Sorin Lerner, Nicholas Smallbone

发表机构 * Department of Computer Science and Engineering University of California, San Diego, USA(计算机科学与工程系,加州大学圣地亚哥分校) Department of Computer Science and Engineering Chalmers University of Technology & University of Gothenburg(计算机科学与工程系,查尔姆斯理工大学及哥德堡大学)

AI总结 提出首个神经符号引理猜想工具LEMMANAID,通过类比数学理论生成引理,结合微调LLM与符号方法,在Isabelle测试集上优于纯神经和纯符号方法。

详情
AI中文摘要

数学家和计算机科学家越来越多地利用证明助手来形式化和检查复杂证明,这需要大量的专业知识。我们能否通过自动化猜想有用、有趣且新颖的引理来降低门槛?我们提出了首个神经符号引理猜想工具LEMMANAID,旨在通过类比数学理论来发现猜想。LEMMANAID使用微调后的LLM生成描述引理形状的引理模板,并使用符号方法填充细节。我们将LEMMANAID与直接微调生成引理的相同LLM以及完全符号的猜想方法进行了比较。在来自Isabelle的HOL库和形式化证明档案(AFP)的测试集上,LEMMANAID始终优于神经和符号方法。使用DeepSeek-coder-6.7B作为后端,LEMMANAID发现了50%(HOL)和29%(AFP)的金标准引理,当集成提示策略时,这一比例提高到55%和35%。在关于八元数的案例研究中,LEMMANAID发现了79%的金标准引理,而纯神经方法为62%,最先进的符号工具为23%。此外,在针对性比较中,LEMMANAID发现的金标准引理数量超过了Claude Opus 4.5和GPT-5.2。我们的结果表明,LEMMANAID能够在数学和计算机科学的复杂形式化中猜想出大量有趣的引理。

英文摘要

Mathematicians and computer scientists are increasingly leveraging proof assistants to formalize and check complex proofs, a task that demands substantial expertise. Can we lower the bar by automating the conjecturing of helpful, interesting and novel lemmas? We present the first neuro-symbolic lemma conjecturing tool, LEMMANAID, designed to discover conjectures by drawing analogies between mathematical theories. LEMMANAID uses a fine-tuned LLM to generate lemma templates that describe the shape of a lemma, and symbolic methods to fill in the details. We compare LEMMANAID against the same LLM fine-tuned to generate lemmas directly, as well as a fully symbolic conjecturing method. On test sets from Isabelle's HOL library and Archive of Formal Proofs (AFP), LEMMANAID consistently outperforms both neural and symbolic methods. Using DeepSeek-coder-6.7B as a backend, LEMMANAID discovers 50% (HOL) and 29% (AFP) of the gold standard lemmas, increasing to 55% and 35% when ensembling prompting strategies. In a case study on Octonions, LEMMANAID discovers 79% of the gold standard lemmas, compared to 62% for neural-only and 23% for the state of the art symbolic tool. Furthermore, in a targeted comparison, LEMMANAID discovers more gold standard lemmas than both Claude Opus 4.5 and GPT-5.2. Our results show that LEMMANAID can conjecture a significant number of interesting lemmas across complex formalizations in mathematics and computer science.

2505.11785 2026-06-03 cs.LG cs.AI stat.ML 版本更新

Improving Coverage in Combined Prediction Sets with Weighted p-values

通过加权p值提高组合预测集的覆盖范围

Gina Wong, Drew Prinster, Suchi Saria, Rama Chellappa, Anqi Liu

发表机构 * Johns Hopkins University(约翰霍普金斯大学)

AI总结 提出一种加权聚合预测集的框架,通过为每个预测集分配权重,实现覆盖范围在$1-2α$与$1-α$之间的灵活控制,并推广到数据依赖权重,在混合专家模型等场景中保持有限样本有效性。

详情
Journal ref
AISTATS 2026
AI中文摘要

共形预测通过用有效的预测集增强点预测来量化机器学习模型的不确定性。对于涉及多个试验、模型或数据源的复杂场景,可以聚合共形预测集以创建捕获整体不确定性的预测集,通常能提高精度。然而,聚合具有个体$1-α$覆盖率的多个预测集不可避免地削弱了整体保证,通常导致最坏情况覆盖率为$1-2α$。在这项工作中,我们提出了一个预测集加权聚合的框架,其中根据每个预测集的贡献为其分配权重。我们的框架提供了对集合聚合方式的灵活控制,实现了更紧的覆盖界限,根据权重的分布在组合模型的$1-2α$保证和单个模型的$1-α$保证之间插值。重要的是,我们的框架推广到数据依赖的权重,因为我们推导了一个加权聚合程序,即使权重依赖于数据,也能保持有限样本有效性。这一扩展使我们的框架广泛适用于权重被学习的场景,例如混合专家模型(MoE),并且我们通过在MoE设置中的实验证明,我们的方法实现了自适应覆盖。

英文摘要

Conformal prediction quantifies the uncertainty of machine learning models by augmenting point predictions with valid prediction sets. For complex scenarios involving multiple trials, models, or data sources, conformal prediction sets can be aggregated to create a prediction set that captures the overall uncertainty, often improving precision. However, aggregating multiple prediction sets with individual $1-α$ coverage inevitably weakens the overall guarantee, typically resulting in $1-2α$ worst-case coverage. In this work, we propose a framework for the weighted aggregation of prediction sets, where weights are assigned to each prediction set based on their contribution. Our framework offers flexible control over how the sets are aggregated, achieving tighter coverage bounds that interpolate between the $1-2α$ guarantee of the combined models and the $1-α$ guarantee of an individual model depending on the distribution of weights. Importantly, our framework generalizes to data-dependent weights, as we derive a procedure for weighted aggregation that maintains finite-sample validity even when the weights depend on the data. This extension makes our framework broadly applicable to settings where weights are learned, such as mixture-of-experts (MoE), and we demonstrate through experiments in the MoE setting that our methods achieve adaptive coverage.

2512.13996 2026-06-03 cs.AI 版本更新

DTop-p MoE: Sparsity-Controlled Dynamic Top-p MoE for Foundation Model Pre-training

DTop-p MoE:面向基础模型预训练的稀疏度可控动态Top-p MoE

Can Jin, Hongwu Peng, Mingcan Xiang, Qixin Zhang, Xiangchi Yuan, Amit Hasan, Ohi Dibua, Yifan Gong, Yan Kang, Dimitris N. Metaxas

发表机构 * University of Electronic Science and Technology of China(电子科技大学)

AI总结 提出DTop-p动态路由机制,通过比例积分控制器学习Top-p概率阈值并采用动态路由归一化,在全局稀疏约束下实现层间专家选择,一致优于Top-k和固定Top-p基线,且FLOPs与Top-k MoE相当。

详情
AI中文摘要

稀疏混合专家架构对于高效扩展模型容量至关重要,但标准的Top-$k$路由施加了固定的稀疏模式,忽略了令牌难度和层特定计算需求的内在差异。Top-$p$路由更具自适应性,因为它选择专家直到其累积路由概率达到阈值,允许置信令牌使用更少的专家,而模糊令牌则招募更多专家。然而,我们证明,现有的具有固定全局概率阈值的朴素Top-$p$实现相比Top-$k$仅带来边际收益,存在超参数敏感性,并导致不可控的计算成本。在本文中,我们提出**DTop-$p$**,一种稀疏度可控的动态路由机制,它使用比例积分控制器学习Top-$p$概率阈值,并采用动态路由归一化来在全局稀疏约束下支持逐层专家选择。在大语言模型和扩散Transformer上的大量实验表明,**DTop-$p$**在匹配Top-$k$ MoE平均FLOPs的同时,始终优于Top-$k$和固定Top-$p$基线。我们的分析证实,**DTop-$p$**在专家粒度、总专家容量、模型大小和数据集大小方面表现出强大的可扩展性,为基础模型预训练提供了一个鲁棒且高效的MoE框架。

英文摘要

Sparse Mixture-of-Experts architectures are essential for scaling model capacity efficiently, yet the standard Top-$k$ routing imposes a rigid sparsity pattern that ignores the intrinsic variance in token difficulty and layer-specific computational needs. Top-$p$ routing is more adaptive because it selects experts until their cumulative routing probability reaches a threshold, allowing confident tokens to use fewer experts and ambiguous tokens to recruit more. However, we demonstrate that existing naive Top-$p$ implementations with fixed global probability thresholds provide only marginal gains over Top-$k$, suffer from hyperparameter sensitivity, and result in uncontrolled computational costs. In this paper, we propose **DTop-$p$**, a sparsity-controllable dynamic routing mechanism that learns the Top-$p$ probability threshold with a Proportional-Integral controller and uses dynamic routing normalization to support layer-wise expert selection under a global sparsity constraint. Extensive experiments on Large Language Models and Diffusion Transformers demonstrate that **DTop-$p$** consistently outperforms both Top-$k$ and fixed Top-$p$ baselines while matching the average FLOPs of Top-$k$ MoE. Our analysis confirms that **DTop-$p$** exhibits strong scaling properties across expert granularity, total expert capacity, model size, and dataset size, offering a robust and efficient MoE framework for foundation model pre-training.

2512.11213 2026-06-03 cs.AI cs.CL 版本更新

FutureWeaver: Planning Test-Time Compute for Multi-Agent Systems with Modularized Collaboration

FutureWeaver: 面向模块化协作的多智能体系统的测试时计算规划

Dongwon Jung, Peng Shi, Muhao Chen, Yi Zhang

发表机构 * University of California, Davis(加州大学戴维斯分校) University of Waterloo(滑铁卢大学) Greenshoe, Inc(Greenshoe公司)

AI总结 提出FutureWeaver框架,通过双层次规划架构和自诱导协作模块,在固定预算下优化多智能体系统的测试时计算分配,显著提升协作性能。

详情
AI中文摘要

扩展测试时计算已被证明可以在无需额外训练的情况下显著提升大语言模型(LLM)的性能。然而,将这些技术扩展到多智能体系统仍然具有挑战性:现有方法缺乏原则性的机制来分配计算以实现有效协作、扩展协调本身,或在明确的预算约束下优化计算使用。为弥补这一差距,我们提出了FutureWeaver,一个在固定预算下规划和优化多智能体系统中测试时计算分配的框架。它引入了协作模块,形式化为模块化的、可调用的函数,封装了可复用的多智能体工作流,并通过自博弈反思从重复出现的交互模式中自动归纳。基于这些模块,它采用了一种双层次规划架构,联合执行短视动作选择和长远抽象前瞻,以在预算约束下优化推理轨迹。在复杂智能体基准上的实验表明,FutureWeaver在各种预算设置下始终优于基线,验证了其在推理时优化中多智能体协作的有效性。

英文摘要

Scaling test-time computation has been shown to significantly improve large language model (LLM) performance without additional training. However, extending these techniques to multi-agent systems remains challenging: existing approaches lack principled mechanisms for allocating compute to enable effective collaboration, scaling coordination itself, or optimizing compute usage under explicit budget constraints. To address this gap, we propose FutureWeaver, a framework for planning and optimizing test-time compute allocation in multi-agent systems under fixed budgets. It introduces collaboration modules, formalized as modular, callable functions that encapsulate reusable multi-agent workflows and are automatically induced via self-play reflection from recurring interaction patterns. Building on these modules, it employs \emph{a dual-level planning architecture} that jointly performs short-horizon action selection and long-horizon abstract lookahead to optimize inference trajectories under budget constraints. Experiments on complex agent benchmarks demonstrate that FutureWeaver consistently outperforms baselines across diverse budget settings, validating its effectiveness for multi-agent collaboration in inference-time optimization.

2512.05530 2026-06-03 cs.AI 版本更新

MIND: Multi-rationale INtegrated Discriminative Reasoning Framework for Multi-modal Large Models

MIND:面向多模态大模型的多理由集成判别推理框架

Chuang Yu, Jinmiao Zhao, Mingxuan Zhao, Yunpeng Liu, Xiujun Shu, Yuanhao Feng, Bo Wang, Xiangyu Yue

发表机构 * Shenyang Institute of Automation, Chinese Academy of Sciences(中国科学院沈阳自动化研究所) University of Chinese Academy of Sciences(中国科学院大学) Peking University(北京大学) MMLab, CUHK(CUHK多模态实验室)

AI总结 针对多模态大语言模型在多理由语义建模、逻辑鲁棒性和抗误导方面的不足,提出MIND推理框架,通过“理解-反思-纠正”机制实现从被动模仿到主动判别推理的范式转变。

Comments Accepted to ICML 2026

详情
AI中文摘要

最近,多模态大语言模型(MLLMs)被广泛应用于推理任务。然而,它们存在多理由语义建模有限、逻辑鲁棒性不足以及易受误导线索影响的问题。因此,我们提出了一个多理由集成判别(MIND)推理框架,旨在赋予MLLMs类似人类的“理解-反思-纠正”认知能力,实现从基于被动模仿的推理到主动判别推理的范式演变。具体而言,我们引入了理由增强与判别(RAD)范式,提供了统一且可扩展的数据基础。同时,我们设计了渐进式两阶段纠正学习(P2CL)策略:第一阶段增强多理由正向学习,第二阶段实现主动逻辑判别与纠正。此外,为了缓解多理由语义空间中的表示纠缠,我们提出了多理由对比对齐(MCA)优化策略。大量实验表明,我们的MIND在多个公共数据集上达到了最先进的性能。我们的数据和代码可在https://github.com/YuChuang1205/MIND获取。

英文摘要

Recently, multimodal large language models (MLLMs) have been widely applied to reasoning tasks. However, they suffer from limited multi-rationale semantic modeling, insufficient logical robustness, and susceptibility to misleading cues. Therefore, we propose a Multi-rationale INtegrated Discriminative (MIND) reasoning framework, which is designed to endow MLLMs with human-like cognitive abilities of "Understand -> Rethink -> Correct", and achieves a paradigm evolution from passive imitation-based reasoning to active discriminative reasoning. Specifically, we introduce a Rationale Augmentation and Discrimination (RAD) paradigm, which provides a unified and extensible data foundation. Meanwhile, we design a Progressive Two-stage Correction Learning (P2CL) strategy. The first phase enhances multi-rationale positive learning, while the second phase enables active logic discrimination and correction. In addition, to mitigate representation entanglement in the multi-rationale semantic space, we propose a Multi-rationale Contrastive Alignment (MCA) optimization strategy. Extensive experiments show that our MIND achieves SOTA performance on multiple public datasets. Our data and code are available at https://github.com/YuChuang1205/MIND

2512.03627 2026-06-03 cs.AI 版本更新

MemVerse: Multimodal Memory for Lifelong Learning Agents

MemVerse:面向终身学习智能体的多模态记忆

Junming Liu, Yifei Sun, Weihua Cheng, Haodong Lei, Yirong Chen, Licheng Wen, Xuemeng Yang, Daocheng Fu, Pinlong Cai, Nianchen Deng, Yi Yu, Shuyue Hu, Botian Shi, Ding Wang

发表机构 * Shanghai Artificial Intelligence Laboratory(上海人工智能实验室)

AI总结 提出MemVerse,一种模型无关的即插即用记忆框架,通过分层检索记忆与参数化快速回忆结合,解决智能体在多模态交互中的灾难性遗忘和长程推理问题。

Comments 25 pages, 6 figures, 14 tables

详情
AI中文摘要

尽管大规模语言和视觉模型取得了快速进展,但AI智能体仍然存在一个根本性限制:它们无法记忆。没有可靠的记忆,智能体会灾难性地遗忘过去的经验,难以进行长程推理,并且在多模态或交互环境中无法连贯地运行。我们提出了MemVerse,一种模型无关的即插即用记忆框架,它将快速的参数化回忆与基于检索的分层记忆相结合,实现了可扩展和自适应的多模态智能。MemVerse维护短期记忆以处理近期上下文,同时将原始多模态经验转化为结构化的长期记忆,组织为分层知识图谱。这种设计支持持续整合、自适应遗忘和有界的记忆增长。为了满足实时需求,MemVerse引入了一种周期性蒸馏机制,将长期记忆中的关键知识压缩到参数化模型中,从而实现快速、可微的回忆,同时保持可解释性。大量实验表明,MemVerse显著提高了多模态推理和持续学习效率,使智能体能够在扩展的交互中记忆、适应和连贯推理。

英文摘要

Despite rapid progress in large-scale language and vision models, AI agents still suffer from a fundamental limitation: they cannot remember. Without reliable memory, agents catastrophically forget past experiences, struggle with long-horizon reasoning, and fail to operate coherently in multimodal or interactive environments. We introduce MemVerse, a model-agnostic, plug-and-play memory framework that bridges fast parametric recall with hierarchical retrieval-based memory, enabling scalable and adaptive multimodal intelligence. MemVerse maintains short-term memory for recent context while transforming raw multimodal experiences into structured long-term memories organized as hierarchical knowledge graphs. This design supports continual consolidation, adaptive forgetting, and bounded memory growth. To handle real-time demands, MemVerse introduces a periodic distillation mechanism that compresses essential knowledge from long-term memory into the parametric model, allowing fast, differentiable recall while preserving interpretability. Extensive experiments demonstrate that MemVerse significantly improves multimodal reasoning and continual learning efficiency, empowering agents to remember, adapt, and reason coherently across extended interactions.

2512.03019 2026-06-03 cs.LG cs.AI 版本更新

Distribution-Calibrated Inference Time Compute for Thinking LLM-as-a-Judge

分布校准的推理时间计算用于思考型LLM作为评判者

Hamid Dadkhahi, Firas Trabelsi, Parker Riley, Juraj Juraska, Mehdi Mirzazadeh

发表机构 * University of California, Berkeley(加州大学伯克利分校) DeepMind(深Mind) University of Cambridge(剑桥大学)

AI总结 针对思考型大语言模型作为评判者时单样本噪声和聚合不一致问题,提出基于Bradley-Terry-Davidson模型的分布校准聚合方案,利用极性(非平局边际)和决定性(非平局率)区分微弱多数与强共识,显著降低MAE并提高成对准确率,匹配或超越人类评判者。

详情
AI中文摘要

用作成对偏好评判的思考型大语言模型在单样本层面仍存在噪声,常见的聚合规则(多数投票、软自一致性或基于指令的自聚合)在允许平局时不一致。我们研究了评估者的推理时间计算(ITC),该评估者为每个项目生成n个独立的思考-评分样本,并提出了一种原则性的、分布校准的聚合方案。我们的方法使用Bradley-Terry-Davidson公式对评分计数进行三向偏好建模,利用极性(非平局间的边际)和决定性(非平局率)来区分微弱多数与强共识。在各种评估基准上,与标准基线相比,我们的方法持续降低MAE并提高成对准确率,并且在针对人类共识元标签进行评估时,匹配或超过单个人类评判者。这些结果表明,精心分配ITC并使用分布感知方法进行聚合,可以将嘈杂的个体模型判断转化为可靠的评估评分。

英文摘要

Thinking Large Language Models (LLMs) used as judges for pairwise preferences remain noisy at the single-sample level, and common aggregation rules (majority vote, soft self-consistency, or instruction-based self-aggregation) are inconsistent when ties are allowed. We study inference-time compute (ITC) for evaluators that generate n independent thinking--rating samples per item, and propose a principled, distribution-calibrated aggregation scheme. Our method models three-way preferences with a Bradley-Terry-Davidson formulation on rating counts, leveraging both polarity (margin among non-ties) and decisiveness (non-tie rate) to distinguish narrow margins from strong consensus. Across various evaluation benchmarks, our approach consistently reduces MAE and increases pairwise accuracy versus standard baselines, and when evaluated against human-consensus meta-labels, matches or exceeds individual human raters. These results show that carefully allocating ITC and aggregating with distribution-aware methods turns noisy individual model judgments into reliable ratings for evaluation.

2511.21731 2026-06-03 cs.CL cs.AI 版本更新

Identifying Quantum Structure in AI Language: Evidence for Evolutionary Convergence of Human and Artificial Cognition

识别AI语言中的量子结构:人类与人工智能认知进化趋同的证据

Diederik Aerts, Jonito Aerts Arguëlles, Lester Beltran, Suzette Geriente, Roberto Leporini, Massimiliano Sassoli de Bianchi, Sandro Sozzo

发表机构 * Center Leo Apostel for Interdisciplinary Studies, Vrije Universiteit Brussel (VUB)(利奥·阿波斯泰尔跨学科研究中心,布鲁塞尔自由大学) Department of Economics, University of Bergamo(博洛尼亚大学经济系) Department of Humanities and Cultural Heritage (DIUM) and Centre CQSCS, University of Udine(乌迪内大学人文与文化遗产系及CQSCS中心)

AI总结 通过对大型语言模型进行认知测试,发现其概念组合中存在贝尔不等式显著违背和玻色-爱因斯坦统计,表明人类与人工智能在概念-语言领域均涌现非经典量子结构,支持认知进化趋同假说。

详情
Journal ref
Entropy 28, 622, 2026
AI中文摘要

我们展示了使用特定大型语言模型(LLMs)作为测试对象进行的概念组合认知测试结果。在第一个测试中,使用ChatGPT和Gemini,我们表明贝尔不等式被显著违背,这表明存在一个概率不满足Kolmogorov公理的“非经典概率模型”。在第二个测试中,同样使用ChatGPT和Gemini,我们在大型文本中的单词分布中识别出“玻色-爱因斯坦统计”的存在,而非直觉预期的“麦克斯韦-玻尔兹曼统计”。有趣的是,这些发现与之前在人类参与者认知测试和大规模语料库信息检索测试中获得的结果相呼应。综合来看,它们指向“概念-语言领域中非经典量子类结构的系统性涌现”,无论认知主体是人类还是人工智能。尽管LLMs因历史原因被归类为神经网络,但我们认为,在神经网络之上构建的向量空间的分布式语义结构中,发生了一种更本质的知识组织形式。正是这种承载意义的结构,促成了通过生物进化缓慢建立的人类认知与语言,与通过自我学习和训练快速涌现的LLM认知与语言之间的进化趋同现象。我们分析了支持上述假设的各种方面和实例。我们还提出了一个统一框架,解释了我们识别出的普遍量子组织意义。

英文摘要

We present the results of cognitive tests on conceptual combinations, performed using specific Large Language Models (LLMs) as test subjects. In the first test, performed with ChatGPT and Gemini, we show that Bell's inequalities are significantly violated, which indicates the presence of a 'non-classical probability model' with probabilities that do not satisfy Kolmogorov's axioms. In the second test, also performed using ChatGPT and Gemini, we identify the presence of 'Bose-Einstein statistics', rather than the intuitively expected 'Maxwell-Boltzmann statistics', in the distribution of the words contained in large-size texts. Interestingly, these findings mirror the results previously obtained in both cognitive tests with human participants and information retrieval tests on large corpora. Taken together, they point to the 'systematic emergence of non-classical quantum-like structures in conceptual-linguistic domains', regardless of whether the cognitive agent is human or artificial. Although LLMs are classified as neural networks for historical reasons, we believe that a more essential form of knowledge organization takes place in the distributive semantic structure of vector spaces built on top of the neural network. It is this meaning-bearing structure that lends itself to a phenomenon of evolutionary convergence between human cognition and language, slowly established through biological evolution, and LLM cognition and language, emerging much more rapidly as a result of self-learning and training. We analyze various aspects and examples that contain evidence supporting the above hypothesis. We also advance a unifying framework that explains the pervasive quantum organization of meaning that we identify.

2503.07265 2026-06-03 cs.CV cs.AI cs.CL 版本更新

WISE: A World Knowledge-Informed Semantic Evaluation for Text-to-Image Generation

WISE: 一种基于世界知识的文本到图像生成语义评估方法

Yuwei Niu, Munan Ning, Mengren Zheng, Weiyang Jin, Bin Lin, Peng Jin, Jiaqi Liao, Chaoran Feng, Fanqing Meng, Kunpeng Ning, Bin Zhu, Li Yuan

发表机构 * University of Science and Technology of China(中国科学技术大学)

AI总结 针对现有文本到图像生成模型缺乏复杂语义理解和世界知识整合评估的问题,提出WISE基准,包含25个子领域的1000个精心设计的提示,并引入WiScore指标评估知识-图像对齐,实验表明当前模型在整合世界知识方面存在显著局限。

Comments Accepted to ICML 2026. We have also released an updated version of the benchmark, WISE_Verified. Please refer to https://github.com/PKU-YuanGroup/WISE for the latest version

详情
AI中文摘要

文本到图像(T2I)模型能够生成高质量的艺术创作和视觉内容。然而,现有研究和评估标准主要关注图像真实性和浅层的文本-图像对齐,缺乏对文本到图像生成中复杂语义理解和世界知识整合的全面评估。为解决这一挑战,我们提出了 extbf{WISE},这是首个专门用于 extbf{W}orld Knowledge- extbf{I}nformed extbf{S}emantic extbf{E}valuation(世界知识引导的语义评估)的基准。WISE超越了简单的词-像素映射,通过1000个精心设计的提示,涵盖文化常识、时空推理和自然科学等25个子领域,对模型进行挑战。为了克服传统CLIP指标的局限性,我们引入了 extbf{WiScore},一种用于评估知识-图像对齐的新型定量指标。通过对20个模型(10个专用T2I模型和10个统一多模态模型)在涵盖25个子领域的1000个结构化提示上进行全面测试,我们的发现揭示了它们在图像生成过程中有效整合和应用世界知识的能力存在显著局限,为下一代T2I模型增强知识整合与应用指明了关键路径。代码和数据可在\href{https://github.com/PKU-YuanGroup/WISE}{PKU-YuanGroup/WISE}获取。

英文摘要

Text-to-Image (T2I) models are capable of generating high-quality artistic creations and visual content. However, existing research and evaluation standards predominantly focus on image realism and shallow text-image alignment, lacking a comprehensive assessment of complex semantic understanding and world knowledge integration in text-to-image generation. To address this challenge, we propose \textbf{WISE}, the first benchmark specifically designed for \textbf{W}orld Knowledge-\textbf{I}nformed \textbf{S}emantic \textbf{E}valuation. WISE moves beyond simple word-pixel mapping by challenging models with 1000 meticulously crafted prompts across 25 subdomains in cultural common sense, spatio-temporal reasoning, and natural science. To overcome the limitations of traditional CLIP metric, we introduce \textbf{WiScore}, a novel quantitative metric for assessing knowledge-image alignment. Through comprehensive testing of 20 models (10 dedicated T2I models and 10 unified multimodal models) using 1,000 structured prompts spanning 25 subdomains, our findings reveal significant limitations in their ability to effectively integrate and apply world knowledge during image generation, highlighting critical pathways for enhancing knowledge incorporation and application in next-generation T2I models. Code and data are available at \href{https://github.com/PKU-YuanGroup/WISE}{PKU-YuanGroup/WISE}.

2511.13020 2026-06-03 cs.CV cs.AI 版本更新

PHASE: Physiology-Aware Hyperspectral Reconstruction via Object-to-Human Domain Adaptation

PHASE: 通过对象到人体域适应的生理感知高光谱重建

Yufei Wen, Shuxing Zhong, Jingdan Kang, Yuting Zhang, Jintai Chen, Kaishun Wu

发表机构 * The Hong Kong University of Science and Technology (Guangzhou)(香港科技大学(广州)) South China University of Technology(华南理工大学)

AI总结 针对现有高光谱重建方法在生理成像中失效的问题,提出PHASE范式,通过生理通道重新解释和生理约束对齐,实现从对象到人体的域适应,仅需1.5%标注数据即可显著提升重建质量。

Comments To KDD26

详情
AI中文摘要

尽管高光谱成像提供了无与伦比的无创生理洞察,但其笨重的硬件、缓慢的采集速度和监管负担严重限制了其临床可用性。一种自然的替代方案是从无处不在的RGB或CASSI测量中重建高光谱信息。然而,现有的为以对象为中心的场景开发的范式依赖于基于反射率的特征对齐,假设光谱相似性保持语义一致性。这一假设在生理成像中不成立,因为视觉上相似的RGB响应可能源于不同且纠缠的生理状态。这种不匹配促使从反射率对齐转向基于共享光-物质相互作用原理的生理感知表示学习——这一转变引入了来自跨通道语义偏移(C1)和基于RGB采集的不可逆信息丢失(C2)的基本挑战。因此,我们设计了PHASE,一种生理感知的高光谱重建范式,通过生理通道重新解释解耦跨通道生理语义,并通过生理约束对齐将重建限制在生理上合理的解,从根本上重新定义了对象到人体的迁移。在两种源到目标迁移协议下,PHASE仅需1.5%的标注监督,在SSIM上一致优于最先进方法最多+2.20,在SAM上最多-3.06。

英文摘要

Although hyperspectral imaging offers unparalleled non-invasive physiological insight, its bulky hardware, slow acquisition, and regulatory burden severely limit its clinical availability. A natural workaround is to reconstruct hyperspectral information from ubiquitous RGB or CASSI measurements. However, existing paradigms, developed for object-centric scenes, rely on reflectance-based feature alignment, assuming that spectral similarity preserves semantic meaning. This assumption breaks down in physiological imaging, where visually similar RGB responses may arise from distinct and entangled physiological states. This mismatch motivates a shift from reflectance alignment to physiology-aware representation learning, grounded in shared light-matter interaction principles -- a shift that introduces fundamental challenges from cross-channel semantic shifts (C1) and irreversible information loss in RGB-based acquisition (C2). We therefore design PHASE, a physiology-aware hyperspectral reconstruction paradigm that fundamentally redefines object-to-human transfer by disentangling cross-channel physiological semantics via Physiological Channel Reinterpretation and restricting reconstruction to physiologically plausible solutions through Physiologically Constrained Alignment. Under two source-to-target transfer protocols, PHASE consistently outperforms state-of-the-art methods by up to +2.20 SSIM and -3.06 in SAM with merely 1.5% labeled supervision.

2511.02304 2026-06-03 cs.MA cs.AI cs.CL cs.FL cs.LG 版本更新

Automata-Conditioned Cooperative Multi-Agent Reinforcement Learning

自动机条件化协作多智能体强化学习

Beyazit Yalcinkaya, Marcell Vazquez-Chanlatte, Ameesh Shah, Hanna Krasowski, Sanjit A. Seshia

发表机构 * Massachusetts Institute of Technology(麻省理工学院) Stanford University(斯坦福大学)

AI总结 提出自动机条件化协作多智能体强化学习框架,通过自动机分解团队目标为子任务,学习任务条件化的分散策略,实现最优任务分配和多步协调。

详情
AI中文摘要

我们研究在集中训练、分散执行下,针对协作性时间目标的多任务、多智能体策略学习。在此设置中,使用自动机表示分配给智能体的任务,能够将团队级目标分解为更简单、更小的子任务。然而,现有方法样本效率低下,且局限于单任务情况,需要为每个新任务重新训练策略。在这项工作中,我们提出了自动机条件化协作多智能体强化学习(ACC-MARL),一个学习任务条件化分散团队策略的框架。我们识别了ACC-MARL可行性的挑战,提出了解决方案,并证明了我们的方法是最优的。我们进一步展示了学习到的价值函数可用于在测试时最优地分配任务。实验表明,智能体之间涌现出任务感知的多步协调,例如按下按钮开门、扶住门以及短路任务。

英文摘要

We study learning multi-task, multi-agent policies for cooperative, temporal objectives, under centralized training, decentralized execution. In this setting, using automata to represent tasks assigned to agents enables breaking down a team-level objective into simpler, smaller sub-tasks. However, existing approaches remain sample-inefficient and are limited to the single-task case, requiring retraining policies for each new task. In this work, we present Automata-Conditioned Cooperative Multi-Agent Reinforcement Learning (ACC-MARL), a framework for learning task-conditioned, decentralized team policies. We identify challenges to the feasibility of ACC-MARL, propose solutions, and prove that our approach is optimal. We further show that learned value functions can be used to assign tasks optimally at test time. Experiments demonstrate emergent task-aware, multi-step coordination among agents, such as pressing a button to unlock a door, holding the door, and short-circuiting tasks.

2510.23216 2026-06-03 cs.AI cs.LG 版本更新

Human-Like Goalkeeping in a Realistic Football Simulation: a Sample-Efficient Reinforcement Learning Approach

逼真足球模拟中的人性化守门:一种样本高效的强化学习方法

Alessandro Sestini, Joakim Bergdahl, Jean-Philippe Barrette-LaPierre, Florian Fuchs, Brady Chen, Fabio Zinno, Michael Jones, Linus Gisslén

发表机构 * University of Edinburgh(爱丁堡大学) KTH Royal Institute of Technology(皇家理工学院) University of California, Berkeley(加州大学伯克利分校)

AI总结 提出一种样本高效的深度强化学习方法,通过利用预收集数据和增加网络可塑性,在EA SPORTS FC 25中训练出守门员智能体,其扑救率比内置AI高10%,训练速度比标准DRL快50%,且行为更接近人类。

详情
AI中文摘要

尽管多个知名视频游戏已成为深度强化学习(DRL)的测试平台,但该技术很少被游戏行业用于制作真实的AI行为。先前的研究侧重于使用大型模型训练超人类智能体,这对于资源有限、旨在实现类人智能体的游戏工作室来说并不实际。本文提出了一种样本高效的DRL方法,专为在工业环境(如视频游戏行业)中训练和微调智能体而设计。我们的方法通过利用预收集的数据和增加网络可塑性来提高基于价值的DRL的样本效率。我们在EA SPORTS FC 25(当今最畅销的足球模拟游戏之一)中评估了该方法训练守门员智能体的效果。我们的智能体在扑救率上比游戏内置AI高出10%。消融研究表明,与标准DRL方法相比,我们的方法训练智能体速度提高了50%。最后,领域专家的定性评估表明,与手工制作的智能体相比,我们的方法创造了更人性化的游戏玩法。作为该方法影响力的证明,该技术已被用于该系列的最新版本中。

英文摘要

While several high profile video games have served as testbeds for Deep Reinforcement Learning (DRL), this technique has rarely been employed by the game industry for crafting authentic AI behaviors. Previous research focuses on training super-human agents with large models, which is impractical for game studios with limited resources aiming for human-like agents. This paper proposes a sample-efficient DRL method tailored for training and fine-tuning agents in industrial settings such as the video game industry. Our method improves sample efficiency of value-based DRL by leveraging pre-collected data and increasing network plasticity. We evaluate our method training a goalkeeper agent in EA SPORTS FC 25, one of the best-selling football simulations today. Our agent outperforms the game's built-in AI by 10% in ball saving rate. Ablation studies show that our method trains agents 50% faster compared to standard DRL methods. Finally, qualitative evaluation from domain experts indicates that our approach creates more human-like gameplay compared to hand-crafted agents. As a testament to the impact of the approach, the method has been adopted for use in the most recent release of the series.

2510.17149 2026-06-03 cs.AI 版本更新

ProtocolBench: Which LLM MultiAgent Protocol to Choose?

ProtocolBench:选择哪个LLM多智能体协议?

Hongyi Du, Jiaqi Su, Jisen Li, Lijie Ding, Yingxuan Yang, Peixuan Han, Xiangru Tang, Kunlun Zhu, Jiaxuan You

AI总结 提出ProtocolBench基准,系统比较多智能体协议在任务成功率、延迟、开销和鲁棒性上的表现,并设计可学习的协议路由器ProtocolRouter以动态选择最优协议。

Comments Accepted to ICML 2026. Camera-ready version.Code and benchmark artifacts: https://github.com/ulab-uiuc/AgentProtocols

详情
AI中文摘要

随着大规模多智能体系统的发展,通信协议层已成为影响性能和可靠性的关键但评估不足的因素。尽管存在多种协议(A2A、ACP、ANP、Agora等),选择往往依赖直觉且缺乏标准化指导。我们引入ProtocolBench,一个沿四个可测量轴(任务成功率、端到端延迟、消息或字节开销、故障下的鲁棒性)系统比较智能体协议的基准。在ProtocolBench上,协议选择显著影响系统行为。在流队列场景中,不同协议的整体完成时间差异高达36.5%,平均端到端延迟相差3.48秒。在故障风暴恢复下,不同协议的鲁棒性也持续存在差异。除评估外,我们提出ProtocolRouter,一个可学习的协议路由器,根据需求和运行时信号为每个场景(或每个模块)选择协议。ProtocolRouter相比最佳单协议基线将故障风暴恢复时间降低高达18.1%,并在GAIA等场景中取得更高成功率。我们还发布了ProtocolRouterBench以标准化协议评估并提高大规模可靠性。

英文摘要

As large-scale multi-agent systems evolve, the communication protocol layer has become a critical yet under-evaluated factor shaping performance and reliability. Despite the existence of diverse protocols (A2A, ACP, ANP, Agora, etc.), selection is often intuition-driven and lacks standardized guidance. We introduce ProtocolBench, a benchmark that systematically compares agent protocols along four measurable axes: task success, end-to-end latency, message or byte overhead, and robustness under failures. On ProtocolBench, protocol choice significantly influences system behavior. In the Streaming Queue scenario, overall completion time varies by up to 36.5% across protocols, and mean end-to-end latency differs by 3.48 s. Under Fail-Storm Recovery, resilience also differs consistently across protocols. Beyond evaluation, we present ProtocolRouter, a learnable protocol router that selects per-scenario (or per-module) protocols from requirement and runtime signals. ProtocolRouter reduces Fail-Storm recovery time by up to 18.1% versus the best single-protocol baseline, and achieves scenario-specific gains such as higher success in GAIA. We also release ProtocolRouterBench to standardize protocol evaluation and improve reliability at scale.

2510.16302 2026-06-03 cs.AI cs.IR 版本更新

DTKG: Dual-Track Knowledge Graph-Verified Reasoning Framework for Multi-Hop QA

DTKG: 用于多跳问答的双轨知识图谱验证推理框架

Changhao Wang, Yanfang Liu, Xinxin Fan, Ao Tian, Lanzhi Zhou, Yunfeng Lu

发表机构 * School of Computer Science Engineering, Beihang University, Beijing, China School of Reliability Systems Engineering, Beihang University, Beijing, China State Key Laboratory of Complex \& Critical Software Environment National Key Laboratory of Reliability State Key Laboratory of AI Safety, Institute of Computing Technology, Chinese Academy of Sciences

AI总结 提出DTKG框架,通过分类阶段和分支处理阶段分别处理并行事实验证和链式多跳推理,提升多跳问答的效率和准确性。

Comments Accepted to ICML 2026

详情
AI中文摘要

问答中的多跳推理在现代大型语言模型的检索增强生成中扮演关键角色。通过从知识图谱中检索实体的关系结构可以获得准确答案。考虑到固有的关系依赖和推理模式,多跳推理通常分为两类:i) 并行事实验证多跳推理问题,即需要同时验证多个独立子问题;ii) 链式多跳推理问题,即需要顺序多步推理,中间结论作为后续推理的必要前提。目前,多跳推理方法单独使用两种技术之一:基于LLM响应的事实验证和基于KG路径的链构建。然而,前者擅长并行事实验证但在链式推理任务上表现不佳,而后者擅长链式多跳推理但在处理并行事实验证推理时存在冗余路径检索问题。这些限制降低了多跳问答任务的效率和准确性。为解决这一挑战,我们提出了一种新颖的双轨KG验证和推理框架DTKG,其灵感来自认知科学中的双过程理论。具体来说,DTKG包括两个主要阶段:分类阶段和分支处理阶段。

英文摘要

Multi-hop reasoning for question answering (QA) plays a critical role in retrieval-augmented generation (RAG) for modern large language models (LLMs). The accurate answer can be obtained through retrieving relational structure of entities from knowledge graph (KG). Regarding the inherent relation-dependency and reasoning pattern, multi-hop reasoning can be in general classified into two categories: i) parallel fact-verification multi-hop reasoning question, i.e., requiring simultaneous verifications of multiple independent sub-questions; and ii) chained multi-hop reasoning questions, i.e., demanding sequential multi-step inference with intermediate conclusions serving as essential premises for subsequent reasoning. Currently, the multi-hop reasoning approaches singly employ one of two techniques: LLM response-based fact verification and KG path-based chain construction. Nevertheless, the former excels at parallel fact-verification but underperforms on chained reasoning tasks, while the latter demonstrates proficiency in chained multi-hop reasoning but suffers from redundant path retrieval when handling parallel fact-verification reasoning. These limitations deteriorate the efficiency and accuracy for multi-hop QA tasks. To address this challenge, we propose a novel dual-track KG verification and reasoning framework DTKG, which is inspired by the Dual Process Theory in cognitive science. Specifically, DTKG comprises two main stages: the Classification Stage and the Branch Processing Stage.

2505.08222 2026-06-03 cs.RO cs.AI cs.DC cs.PF 版本更新

Scaling Multi Agent Reinforcement Learning for Underwater Acoustic Tracking via Autonomous Vehicles

通过自主车辆扩展多智能体强化学习用于水声跟踪

Matteo Gallici, Ivan Masmitja, Mario Martín

发表机构 * KEMLG Research Group, Universitat Politècnica de Catalunya Barcelona, Spain(凯姆尔格研究组,巴塞罗那理工大学,西班牙) Instituto de Ciencias del Mar, Consejo Superior de Investigaciones Científicas, Barcelona, Spain(海洋科学研究所,西班牙国家科学研究委员会,巴塞罗那,西班牙) KEMLG Research Group, Universitat Politècnica de Catalunya (UPC), and with the HPAI group at Barcelona Supercomputing Center (BSC), Barcelona, Spain(凯姆尔格研究组,巴塞罗那理工大学(UPC),以及巴塞罗那超级计算中心(BSC)的HPAI组,巴塞罗那,西班牙)

AI总结 提出一种GPU加速环境(高达30000倍加速)和基于Transformer的MARL架构(TransfMAPPO),实现多目标快速移动场景下的水下跟踪,跟踪误差低于5米。

详情
AI中文摘要

自主车辆(AV)为水下跟踪等科学任务提供了经济高效的解决方案。强化学习(RL)已成为控制AV的强大方法,但扩展到舰队(对于多目标跟踪或快速移动目标至关重要)具有挑战性。多智能体RL(MARL)以样本效率低下而闻名,虽然像Gazebo的LRAUV这样的高保真模拟器提供高达100倍实时速度的单机器人模拟,但在多车辆场景中几乎没有加速,使得MARL训练不切实际。然而,高保真模拟对于测试复杂策略和缩小模拟到现实的差距至关重要。为了解决这些限制,我们开发了一个GPU加速环境,在保持其动力学的同时,实现了比Gazebo高达30000倍的加速。这使得快速、端到端的GPU训练以及无缝转移到Gazebo进行评估成为可能。我们还引入了一种基于Transformer的架构(TransfMAPPO),该架构学习对舰队规模和目标数量不变的策略,从而能够通过课程学习在日益复杂的场景中训练更大的舰队。经过大规模GPU训练后,我们在Gazebo中进行了广泛评估,表明即使面对多个快速移动的目标,我们的方法也能将跟踪误差保持在5米以下。

英文摘要

Autonomous vehicles (AVs) offer a cost-effective solution for scientific missions such as underwater tracking. Reinforcement learning (RL) has emerged as a powerful method for controlling AVs, but scaling to fleets (essential for multi-target tracking or rapidly moving targets) is challenging. Multi-Agent RL (MARL) is notoriously sample-inefficient, and while high-fidelity simulators like Gazebo's LRAUV provide up to 100x faster-than-real-time single-robot simulations, they offer little speedup in multi-vehicle scenarios, making MARL training impractical. Yet, high-fidelity simulation is crucial to test complex policies and close the sim-to-real gap. To address these limitations, we develop a GPU-accelerated environment that achieves up to 30,000x speedup over Gazebo while preserving its dynamics. This enables fast, end-to-end GPU training and seamless transfer to Gazebo for evaluation. We also introduce a Transformer-based architecture (TransfMAPPO) that learns policies invariant to fleet size and number of targets, enabling curriculum learning to train larger fleets on increasingly complex scenarios. After large-scale GPU training, we perform extensive evaluations in Gazebo, showing our method maintains tracking errors below 5m even with multiple fast-moving targets.

2510.09845 2026-06-03 cs.LG cs.AI cs.CV 版本更新

Harnessing Self-Supervised Deep Learning and Geostationary Remote Sensing for Advancing Wildfire and Associated Air Quality Monitoring: Improved Smoke and Fire Front Masking using GOES and TEMPO Radiance Data

利用自监督深度学习和地球静止遥感推进野火及相关空气质量监测:使用GOES和TEMPO辐射数据改进烟雾和火锋掩膜

Nicholas LaHaye, Thilanka Munashinge, Hugo Lee, Xiaohua Pan, Gonzalo Gonzalez Abad, Hazem Mahmoud, Jennifer Wei

AI总结 本研究利用NASA TEMPO卫星任务的每小时数据和自监督深度学习,提出了一种创新系统,通过GOES-18和TEMPO数据有效区分烟雾与云层,实时绘制野火火锋和烟雾羽流,显著优于现有业务产品。

Comments https://2025.ieeeigarss.org/view_paper.php?PaperNum=6389&SessionID=1611

详情
AI中文摘要

这项工作展示了通过利用NASA的TEMPO卫星任务前所未有的每小时数据以及自监督深度学习的进展,改善美国西部野火和空气质量管理的可能性。我们展示了一种创新的自监督深度学习系统在绘制近实时每小时野火火锋和烟雾羽流扩散方面的有效性:成功使用GOES-18和TEMPO数据区分烟雾与云层,不同传感模态生成的烟雾和火掩膜之间具有强一致性,并且对于相同案例相比业务产品有显著改进。

英文摘要

This work demonstrates the possibilities for improving wildfire and air quality management in the western United States by leveraging the unprecedented hourly data from NASA's TEMPO satellite mission and advances in self-supervised deep learning. Here we demonstrate the efficacy of deep learning for mapping the near real-time hourly spread of wildfire fronts and smoke plumes using an innovative self-supervised deep learning-system: successfully distinguishing smoke plumes from clouds using GOES-18 and TEMPO data, strong agreement across the smoke and fire masks generated from different sensing modalities as well as significant improvement over operational products for the same cases.

2510.09711 2026-06-03 cs.CL cs.AI 版本更新

ReaLM: Residual Quantization Bridging Knowledge Graph Embeddings and Large Language Models

ReaLM:残差量化桥接知识图谱嵌入与大型语言模型

Wenbin Guo, Xin Wang, Jiaoyan Chen, Lingbing Guo, Zhao Li, Zirui Chen

发表机构 * Tianjin University(天津大学) The University of Manchester(曼彻斯特大学)

AI总结 提出ReaLM框架,通过残差向量量化将知识图谱嵌入离散化为可学习标记,融入大型语言模型词汇表,结合本体约束实现结构化知识与语言模型的语义对齐,在知识图谱补全任务上取得最优性能。

详情
AI中文摘要

大型语言模型(LLM)最近成为知识图谱补全(KGC)的强大范式,提供了超越传统基于嵌入方法的强大推理和泛化能力。然而,现有的基于LLM的方法通常难以充分利用结构化语义表示,因为预训练KG模型的连续嵌入空间与LLM的离散标记空间根本不对齐。这种差异阻碍了有效的语义转移并限制了它们的性能。为了解决这一挑战,我们提出了ReaLM,一种新颖且有效的框架,通过残差向量量化的机制弥合了KG嵌入和LLM标记化之间的差距。ReaLM将预训练的KG嵌入离散化为紧凑的代码序列,并将它们作为可学习标记集成到LLM词汇表中,从而实现符号知识和上下文知识的无缝融合。此外,我们引入了本体引导的类约束以强制语义一致性,基于类级别的兼容性细化实体预测。在两个广泛使用的基准数据集上进行的大量实验表明,ReaLM实现了最先进的性能,证实了其在将结构化知识与大规模语言模型对齐方面的有效性。

英文摘要

Large Language Models (LLMs) have recently emerged as a powerful paradigm for Knowledge Graph Completion (KGC), offering strong reasoning and generalization capabilities beyond traditional embedding-based approaches. However, existing LLM-based methods often struggle to fully exploit structured semantic representations, as the continuous embedding space of pretrained KG models is fundamentally misaligned with the discrete token space of LLMs. This discrepancy hinders effective semantic transfer and limits their performance. To address this challenge, we propose ReaLM, a novel and effective framework that bridges the gap between KG embeddings and LLM tokenization through the mechanism of residual vector quantization. ReaLM discretizes pretrained KG embeddings into compact code sequences and integrates them as learnable tokens within the LLM vocabulary, enabling seamless fusion of symbolic and contextual knowledge. Furthermore, we incorporate ontology-guided class constraints to enforce semantic consistency, refining entity predictions based on class-level compatibility. Extensive experiments on two widely used benchmark datasets demonstrate that ReaLM achieves state-of-the-art performance, confirming its effectiveness in aligning structured knowledge with large-scale language models.

2509.09685 2026-06-03 cs.IR cs.AI cs.MM cs.SD eess.AS 版本更新

TalkPlayData 2: An Agentic Synthetic Data Pipeline for Multimodal Conversational Music Recommendation

TalkPlayData 2:用于多模态对话式音乐推荐的智能体合成数据流水线

Keunwoo Choi, Seungheon Doh, Juhan Nam

发表机构 * KAIST(韩国科学技术院)

AI总结 提出TalkPlayData 2,一个由智能体数据流水线生成的多模态对话式音乐推荐合成数据集,通过多角色大语言模型模拟对话并覆盖多种场景,以支持生成式推荐模型训练。

详情
AI中文摘要

我们提出了TalkPlayData 2,一个由智能体数据流水线生成的多模态对话式音乐推荐合成数据集。在该流水线中,多个大语言模型(LLM)智能体被创建,承担不同角色,具有专门的提示词和访问不同信息部分的权限,通过记录Listener LLM和Recsys LLM之间的对话来获取聊天数据。为了覆盖各种对话场景,每个对话的Listener LLM基于微调的对话目标进行条件设置。最后,所有LLM都是多模态的,支持音频和图像,从而模拟多模态推荐和对话。在LLM-as-a-judge和主观评估实验中,TalkPlayData 2在训练音乐生成式推荐模型的各个方面达到了预期目标。TalkPlayData 2及其生成代码已在https://talkpl-ai.github.io发布。

英文摘要

We present TalkPlayData 2, a synthetic dataset for multimodal conversational music recommendation generated by an agentic data pipeline. In the proposed pipeline, multiple large language model (LLM) agents are created under various roles with specialized prompts and access to different parts of information, and the chat data is acquired by logging the conversation between the Listener LLM and the Recsys LLM. To cover various conversation scenarios, for each conversation, the Listener LLM is conditioned on a finetuned conversation goal. Finally, all the LLMs are multimodal with audio and images, allowing a simulation of multimodal recommendation and conversation. In the LLM-as-a-judge and subjective evaluation experiments, TalkPlayData 2 achieved the proposed goal in various aspects related to training a generative recommendation model for music. TalkPlayData 2 and its generation code are released at https://talkpl-ai.github.io.

2510.03316 2026-06-03 cs.CV cs.AI cs.LG 版本更新

The View From Space: Navigating Instrumentation Differences with EOFMs

从太空视角:利用EOFMs导航仪器差异

Ryan P. Demilt, Nicholas LaHaye, Karis Tenneson

发表机构 * Spatial Informatics Group(空间信息组)

AI总结 本研究通过分析地球观测基础模型(EOFMs)对传感器架构的敏感性,揭示了当前模型设计的缺陷,并为模型开发者、用户和遥感科学社区指明了前进方向。

详情
Journal ref
https://neurips.cc/virtual/2025/loc/san-diego/122891
AI中文摘要

地球观测基础模型(EOFMs)作为处理大量遥感及其他地球观测数据、并对许多关键地球监测任务产生影响的工具,其普及程度急剧上升。一个新兴趋势是利用预训练模型的输出作为“嵌入”,这些嵌入总结了高维数据,可用于通用任务,如相似性搜索和内容特定查询。然而,大多数EOFMs仅在单一模态数据上训练,然后通过匹配不同模态的波段进行应用或基准测试。现有工作尚不清楚多样化的传感器架构如何影响当前EOFMs套件的内部表示。我们在本工作中表明,EOFMs的表示空间对传感器架构高度敏感,理解这一差异为我们提供了关于当前EOFMs设计陷阱的关键视角,并指明了作为模型开发者、用户以及以稳健遥感科学为指导的社区应如何前进的方向。

英文摘要

Earth Observation Foundation Models (EOFMs) have exploded in prevalence as tools for processing the massive volumes of remotely sensed and other earth observation data, and for delivering impact on the many essential earth monitoring tasks. An emerging trend posits using the outputs of pre-trained models as 'embeddings' which summarize high dimensional data to be used for generic tasks such as similarity search and content-specific queries. However, most EOFM models are trained only on single modalities of data and then applied or benchmarked by matching bands across different modalities. It is not clear from existing work what impact diverse sensor architectures have on the internal representations of the present suite of EOFMs. We show in this work that the representation space of EOFMs is highly sensitive to sensor architecture and that understanding this difference gives a vital perspective on the pitfalls of current EOFM design and signals for how to move forward as model developers, users, and a community guided by robust remote-sensing science.

2510.01377 2026-06-03 math.OC cs.AI cs.LG cs.MA cs.SY eess.SY 版本更新

DeMuon: A Decentralized Muon for Matrix Optimization over Graphs

DeMuon:一种用于图上矩阵优化的去中心化Muon方法

Chuan He, Shuyi Ren, Jingwei Mao, Erik G. Larsson

发表机构 * Department of Mathematics, Linköping University(利乌普堡大学数学系) Department of Electrical Engineering, Linköping University(利乌普堡大学电气工程系) Department of Computer and Information Science, Linköping University(利乌普堡大学计算机与信息科学系)

AI总结 提出DeMuon方法,通过牛顿-舒尔茨迭代实现矩阵正交化,并利用梯度跟踪处理局部函数异质性,在重尾噪声下达到与集中式算法匹配的复杂度,首次将Muon扩展到去中心化图优化并具有可证明的复杂度保证。

Comments Add an accelerated variant of the proposed method. New proofs of proposed methods

详情
AI中文摘要

本文提出DeMuon,一种在给定通信拓扑上进行去中心化矩阵优化的方法。DeMuon通过牛顿-舒尔茨迭代(继承自其集中式前身Muon)实现矩阵正交化,并采用梯度跟踪来减轻局部函数之间的异质性。在重尾噪声条件和额外的温和假设下,我们建立了DeMuon达到近似随机驻点的迭代复杂度。该复杂度结果在目标容差依赖方面与已知的最佳集中式算法复杂度界相匹配。据我们所知,DeMuon是首个将Muon直接扩展到图上去中心化优化并具有可证明复杂度保证的方法。我们在不同连通程度的图上进行了去中心化Transformer预训练的初步数值实验。数值结果表明,在不同网络拓扑下,DeMuon相比其他流行的去中心化算法具有明显的改进优势。

英文摘要

In this paper, we propose DeMuon, a method for decentralized matrix optimization over a given communication topology. DeMuon incorporates matrix orthogonalization via Newton-Schulz iterations-a technique inherited from its centralized predecessor, Muon-and employs gradient tracking to mitigate heterogeneity among local functions. Under heavy-tailed noise conditions and additional mild assumptions, we establish the iteration complexity of DeMuon for reaching an approximate stochastic stationary point. This complexity result matches the best-known complexity bounds of centralized algorithms in terms of dependence on the target tolerance. To the best of our knowledge, DeMuon is the first direct extension of Muon to decentralized optimization over graphs with provable complexity guarantees. We conduct preliminary numerical experiments on decentralized transformer pretraining over graphs with varying degrees of connectivity. Our numerical results demonstrate a clear margin of improvement of DeMuon over other popular decentralized algorithms across different network topologies.

2509.22468 2026-06-03 cs.LG cs.AI 版本更新

Learning the Neighborhood: Contrast-Free Multimodal Self-Supervised Molecular Graph Pretraining

学习邻域:无对比的多模态自监督分子图预训练

Boshra Ariguib, Mathias Niepert, Andrei Manolache

发表机构 * University of Tübingen(图宾根大学)

AI总结 提出C-FREE框架,通过预测子图嵌入与互补邻域的关系,融合2D拓扑和3D构象信息,实现无对比、无负样本的多模态自监督分子图预训练,在MoleculeNet上取得最优结果。

Comments Accepted at ICML 2026

详情
AI中文摘要

高质量的分子表示对于性质预测和分子设计至关重要,然而大型标注数据集仍然稀缺。尽管分子图上的自监督预训练已显示出潜力,但许多现有方法要么依赖于手工数据增强或复杂的生成目标,要么仅利用2D拓扑,导致宝贵的3D结构信息未被充分利用。为弥补这一空白,我们引入了C-FREE(基于自我网络的无需对比的表示学习),一个将2D图与3D构象集成在一起的简单框架。C-FREE通过从潜在空间中互补邻域预测子图嵌入来学习分子表示,使用固定半径的自我网络作为不同构象之间的建模单元。这种设计使我们能够在混合图神经网络(GNN)-Transformer骨干中整合几何和拓扑信息,无需负样本、位置编码或昂贵的预处理。在提供丰富3D构象多样性的GEOM数据集上进行预训练后,C-FREE在MoleculeNet上取得了最先进的结果,超越了对比、生成和其他多模态自监督方法。在具有不同规模和分子类型的数据集上进行微调进一步表明,预训练能有效迁移到新的化学领域,突显了3D信息分子表示的重要性。

英文摘要

High-quality molecular representations are essential for property prediction and molecular design, yet large labeled datasets remain scarce. While self-supervised pretraining on molecular graphs has shown promise, many existing approaches either depend on hand-crafted augmentations or complex generative objectives, and often rely solely on 2D topology, leaving valuable 3D structural information underutilized. To address this gap, we introduce C-FREE (Contrast-Free Representation learning on Ego-nets), a simple framework that integrates 2D graphs with ensembles of 3D conformers. C-FREE learns molecular representations by predicting subgraph embeddings from their complementary neighborhoods in the latent space, using fixed-radius ego-nets as modeling units across different conformers. This design allows us to integrate both geometric and topological information within a hybrid Graph Neural Network (GNN)-Transformer backbone, without negatives, positional encodings, or expensive pre-processing. Pretraining on the GEOM dataset, which provides rich 3D conformational diversity, C-FREE achieves state-of-the-art results on MoleculeNet, surpassing contrastive, generative, and other multimodal self-supervised methods. Fine-tuning across datasets with diverse sizes and molecule types further demonstrates that pretraining transfers effectively to new chemical domains, highlighting the importance of 3D-informed molecular representations.

2509.19305 2026-06-03 cs.LG cs.AI eess.SP 版本更新

Wavelet Fourier Diffuser: Frequency-Aware Diffusion Model for Reinforcement Learning

小波傅里叶扩散器:用于强化学习的频率感知扩散模型

Yifu Luo, Yongzhe Chang, Xueqian Wang

发表机构 * Tsinghua University China(清华大学中国)

AI总结 针对现有扩散模型在离线强化学习中忽略频域特征导致频率偏移的问题,提出WFDiffuser,通过离散小波变换分解轨迹并利用短时傅里叶变换和交叉注意力增强频域建模,在D4RL基准上有效缓解频率偏移,提升轨迹稳定性和决策性能。

Comments IJCNN 2025

详情
Journal ref
IJCNN 2025
AI中文摘要

扩散概率模型通过直接建模轨迹序列,在离线强化学习中展现出显著潜力。然而,现有方法主要关注时域特征而忽略频域特征,根据我们的观察,这会导致频率偏移和性能下降。在本文中,我们从频域的新视角研究强化学习问题。我们首先观察到,仅使用时域的方法会无意中引入频域低频分量的偏移,从而导致轨迹不稳定和性能下降。为了解决这个问题,我们提出了小波傅里叶扩散器(WFDiffuser),一种新颖的基于扩散的强化学习框架,它集成了离散小波变换将轨迹分解为低频和高频分量。为了进一步增强每个分量的扩散建模,WFDiffuser采用短时傅里叶变换和交叉注意力机制来提取频域特征并促进跨频率交互。在D4RL基准上的大量实验结果表明,WFDiffuser有效缓解了频率偏移,从而产生更平滑、更稳定的轨迹,并相比现有方法提高了决策性能。

英文摘要

Diffusion probability models have shown significant promise in offline reinforcement learning by directly modeling trajectory sequences. However, existing approaches primarily focus on time-domain features while overlooking frequency-domain features, leading to frequency shift and degraded performance according to our observation. In this paper, we investigate the RL problem from a new perspective of the frequency domain. We first observe that time-domain-only approaches inadvertently introduce shifts in the low-frequency components of the frequency domain, which results in trajectory instability and degraded performance. To address this issue, we propose Wavelet Fourier Diffuser (WFDiffuser), a novel diffusion-based RL framework that integrates Discrete Wavelet Transform to decompose trajectories into low- and high-frequency components. To further enhance diffusion modeling for each component, WFDiffuser employs Short-Time Fourier Transform and cross attention mechanisms to extract frequency-domain features and facilitate cross-frequency interaction. Extensive experiment results on the D4RL benchmark demonstrate that WFDiffuser effectively mitigates frequency shift, leading to smoother, more stable trajectories and improved decision-making performance over existing methods.

2509.11323 2026-06-03 cs.CV cs.AI 版本更新

Motion Estimation for Multi-Object Tracking using KalmanNet with Semantic-Independent Encoding

基于语义无关编码的KalmanNet多目标跟踪运动估计

Jian Song, Wei Mei, Yunfeng Xu, Qiang Fu, Renke Kou, Lina Bu, Yucheng Long

AI总结 提出语义无关KalmanNet(SIKNet),通过语义无关编码器(SIE)改进运动估计,在MOT中比传统卡尔曼滤波和学习辅助滤波器更鲁棒、更准确。

详情
AI中文摘要

运动估计是多目标跟踪(MOT)中的关键组成部分。它通过分析连续帧图像中物体位置的变化来预测物体的轨迹,减少跟踪失败和身份切换。基于线性恒速模型的卡尔曼滤波器(KF)是MOT中最常用的方法之一。然而,当KF参数不匹配且物体非平稳运动时,可能产生不理想的结果。在这项工作中,我们利用学习辅助滤波器来处理MOT的运动估计。具体地,我们提出了一种名为语义无关KalmanNet(SIKNet)的新方法,该方法通过两步使用语义无关编码器(SIE)对状态向量(输入特征)进行编码。首先,SIE使用核大小为1的一维卷积,该卷积沿不同状态向量中同语义元素维度进行卷积,以编码独立的语义信息。然后,它采用全连接层和非线性激活层来编码异语义元素之间的非线性和交叉依赖信息。为了独立评估MOT中运动估计模块的性能,我们从几个开源MOT数据集构建了一个大规模半模拟数据集。实验结果表明,所提出的SIKNet优于传统KF,并且比现有的学习辅助滤波器具有更好的鲁棒性和准确性。代码可在(https://github.com/SongJgit/filternet 和 https://github.com/SongJgit/TBDTracker)获取。

英文摘要

Motion estimation is a crucial component in multi-object tracking (MOT). It predicts the trajectory of objects by analyzing the changes in their positions in consecutive frames of images, reducing tracking failures and identity switches. The Kalman filter (KF) based on the linear constant-velocity model is one of the most commonly used methods in MOT. However, it may yield unsatisfactory results when KF's parameters are mismatched and objects move in non-stationary. In this work, we utilize the learning-aided filter to handle the motion estimation of MOT. In particular, we propose a novel method named Semantic-Independent KalmanNet (SIKNet), which encodes the state vector (the input feature) using a Semantic-Independent Encoder (SIE) by two steps. First, the SIE uses a 1D convolution with a kernel size of 1, which convolves along the dimension of homogeneous-semantic elements across different state vectors to encode independent semantic information. Then it employs a fully-connected layer and a nonlinear activation layer to encode nonlinear and cross-dependency information between heterogeneous-semantic elements. To independently evaluate the performance of the motion estimation module in MOT, we constructed a large-scale semi-simulated dataset from several open-source MOT datasets. Experimental results demonstrate that the proposed SIKNet outperforms the traditional KF and achieves superior robustness and accuracy than existing learning-aided filters. The code is available at (https://github.com/SongJgit/filternet and https://github.com/SongJgit/TBDTracker).

2508.13174 2026-06-03 cs.AI cs.LG q-fin.CP stat.ML 版本更新

AlphaEval: A Comprehensive and Efficient Evaluation Framework for Formula Alpha Mining

AlphaEval:一个全面高效的公式化Alpha挖掘评估框架

Hongjun Ding, Binqi Chen, Jinsheng Huang, Taian Guo, Zhengyang Mao, Guoyi Shao, Lutong Zou, Luchen Liu, Ming Zhang

发表机构 * CUNY Baruch College(CUNY 巴纳特学院) Peking University(北京大学) Harvard University(哈佛大学) Zhengren Research(正人研究所) Zhengren Quant(正人量化)

AI总结 提出AlphaEval框架,通过五个维度(预测能力、稳定性、鲁棒性、金融逻辑、多样性)对自动Alpha挖掘模型进行统一、可并行化且无需回测的评估,实现与回测相当的评估一致性并提高效率。

Comments Accepted by KDD2026

详情
AI中文摘要

公式化Alpha挖掘从金融数据中生成预测信号,对量化投资至关重要。尽管遗传编程、强化学习和大语言模型等多种算法方法显著扩展了Alpha发现的能力,但系统评估仍是一个关键挑战。现有评估指标主要包括回测和基于相关性的度量。回测计算密集、本质上是顺序的,并且对特定策略参数敏感。基于相关性的度量虽然高效,但仅评估预测能力,忽略了时间稳定性、鲁棒性、多样性和可解释性等其他关键属性。此外,大多数现有Alpha挖掘模型的闭源性质阻碍了可重复性并减缓了该领域的进展。为解决这些问题,我们提出了AlphaEval,一个统一、可并行化且无需回测的自动Alpha挖掘模型评估框架。AlphaEval沿五个互补维度评估生成Alpha的整体质量:预测能力、稳定性、对市场扰动的鲁棒性、金融逻辑和多样性。跨代表性Alpha挖掘算法的广泛实验表明,AlphaEval实现了与全面回测相当的评估一致性,同时提供更全面的洞察和更高的效率。此外,与传统的单一指标筛选方法相比,AlphaEval能有效识别更优的Alpha。所有实现和评估工具均已开源,以促进可重复性和社区参与。

英文摘要

Formula alpha mining, which generates predictive signals from financial data, is critical for quantitative investment. Although various algorithmic approaches-such as genetic programming, reinforcement learning, and large language models-have significantly expanded the capacity for alpha discovery, systematic evaluation remains a key challenge. Existing evaluation metrics predominantly include backtesting and correlation-based measures. Backtesting is computationally intensive, inherently sequential, and sensitive to specific strategy parameters. Correlation-based metrics, though efficient, assess only predictive ability and overlook other crucial properties such as temporal stability, robustness, diversity, and interpretability. Additionally, the closed-source nature of most existing alpha mining models hinders reproducibility and slows progress in this field. To address these issues, we propose AlphaEval, a unified, parallelizable, and backtest-free evaluation framework for automated alpha mining models. AlphaEval assesses the overall quality of generated alphas along five complementary dimensions: predictive power, stability, robustness to market perturbations, financial logic, and diversity. Extensive experiments across representative alpha mining algorithms demonstrate that AlphaEval achieves evaluation consistency comparable to comprehensive backtesting, while providing more comprehensive insights and higher efficiency. Furthermore, AlphaEval effectively identifies superior alphas compared to traditional single-metric screening approaches. All implementations and evaluation tools are open-sourced to promote reproducibility and community engagement.

2507.19684 2026-06-03 cs.LG cs.AI cs.CL cs.CV 版本更新

CoMPAS3D: A Dataset and Benchmark for Interactive Motion

CoMPAS3D: 一个用于交互动作的数据集和基准

Bermet Burkanova, Yasaman Etesam, Payam Jome Yazdian, Trinity Evans, Chuxuan Zhang, Zoe Stanley, Paige Tuttösí, Angelica Lim

发表机构 * School of Computing Science Simon Fraser University(计算科学学院西蒙弗雷泽大学)

AI总结 提出CoMPAS3D数据集和评估框架,通过动作可读性和熟练度适当性等客观指标,解决交互式动作生成中缺乏社交上下文评估的问题。

Comments https://rosielab.github.io/compas3d

详情
AI中文摘要

社交互动型人形机器人必须通过身体与人类互动,实时适应伙伴的动作、意图和能力。这需要模型不仅理解身体如何移动,还要理解在共享社交背景下动作的含义。然而,交互式动作生成的评估框架并未衡量生成的动作是否在共享动作词汇中可读,也不评估其是否适合伙伴的熟练水平。这一差距有两个原因:现有框架依赖运动学指标(如FID和节拍对齐),无法衡量上述特性;现有数据集缺乏动作标注和熟练度变化。萨尔萨舞作为评估领域很合适:即兴、双人、由动作词汇和评判标准(涵盖时机、音乐性、技巧、难度、配合和原创性)指导。我们提出CoMPAS3D,一个即兴双人萨尔萨舞的动作捕捉数据集,附带评估框架,涵盖运动学质量、两个客观指标(动作可读性和熟练度适当性)以及六个基于竞赛的主观维度。数据集包含18名舞者(涵盖初级、中级和高级水平)的3小时即兴表演,超过2800个专家标注片段,涵盖动作类型、错误和风格元素。我们定义了三个基准:动作分类(类似于转录)、熟练度估计(流利度评估)和跟随者生成(对话响应)。微调的视觉语言模型在应用于真实动作序列的客观指标上表现强劲。应用于Duolando和InterGen时,这些指标揭示了运动学指标遗漏的失败。人工评估确认了生成动作与真实动作之间的差距。CoMPAS3D、标注、基准代码和基线结果公开可用。

英文摘要

Socially interactive humanoid robots must engage with humans through their bodies, adapting in real time to a partner's movement, intent, and abilities. This requires models that understand not just how bodies move, but what movement means in a shared social context. Yet evaluation frameworks for interactive motion generation do not measure whether generated follower motion is legible within a shared movement vocabulary, nor whether it is appropriate to the partner's proficiency level. This gap has two causes: existing frameworks rely on kinematic metrics such as FID and beat alignment that cannot measure either property, and existing datasets lack the move annotations and proficiency variation needed. Salsa is well-suited as an evaluation domain: improvised, dyadic, and governed by a move vocabulary and judging criteria covering timing, musicality, technique, difficulty, partnering, and originality. We present CoMPAS3D, a motion capture dataset of improvised partner salsa paired with an evaluation framework covering kinematic quality, two objective metrics (move legibility and proficiency appropriateness), and six competition-based subjective dimensions. The dataset includes 3 hours of improvisation by 18 dancers spanning beginner, intermediate, and professional levels, with over 2,800 expert-annotated segments covering move types, errors, and stylistic elements. We define three benchmarks: move classification (analogous to transcription), proficiency estimation (fluency assessment), and follower generation (dialogue response). Fine-tuned vision-language models perform strongly on objective metrics applied to ground-truth motion sequences. Applied to Duolando and InterGen, the metrics reveal failures that kinematic metrics miss. Human evaluations confirm the gap between generated and ground-truth motion. CoMPAS3D, annotations, benchmark code, and baseline results are publicly available.

2506.21129 2026-06-03 cs.LG cs.AI 版本更新

Curriculum-Adapted Robust Reinforcement Learning for UAV Deconfliction in Adversarial Environments

对抗环境中无人机冲突消解的课程自适应鲁棒强化学习

Deepak Kumar Panda, Adolfo Perrusquia, Weisi Guo

发表机构 * Faculty of Engineering and Applied Sciences, Cranfield University(工程与应用科学学院,克兰菲尔德大学)

AI总结 提出一种课程引导的适应框架,通过渐进暴露于梯度对抗观测扰动并对齐时序差分误差分布,提升无人机在GNSS欺骗攻击下的鲁棒性和泛化能力。

详情
AI中文摘要

自主无人机(UAV)越来越依赖强化学习(RL)进行导航。然而,全球导航卫星系统(GNSS)欺骗攻击可能导致分布外观测偏移,破坏价值估计并降低任务性能。现有的鲁棒RL方法通常能提高对特定攻击模型的抵抗力,但往往无法泛化到训练中未遇到的攻击。为解决这一局限,我们提出一种课程引导的适应框架,该框架逐步将鲁棒策略暴露于强度递增的基于梯度的对抗观测扰动,同时对齐课程阶段间的时序差分(TD)误差分布。所提出的方法不是适应特定的攻击模型,而是保持TD误差一致性以促进跨攻击条件的可迁移性。我们进一步推导了一个TD空间泛化保证,表明如果测试时攻击引起的TD误差分布与最终课程阶段的分布足够接近,则由此产生的性能退化是有界的。该框架在具有动态3D障碍物的无人机冲突消解环境中进行评估,面对之前未见过的固定和动态GNSS欺骗攻击。在固定欺骗条件下,课程适应策略实现了近乎完美的任务成功率,而标准和鲁棒RL基线为20-56%。在动态障碍物引诱欺骗攻击下,它获得了最高的情节奖励,同时随着空中交通密度的增加,任务完成步骤最多减少了45%。

英文摘要

Autonomous unmanned aerial vehicles (UAVs) increasingly rely on reinforcement learning (RL) for navigation. However, global navigation satellite system (GNSS) spoofing attacks can induce out-of-distribution observation shifts that corrupt value estimation and degrade mission performance. Existing robust RL approaches typically improve resilience against specific attack models but often fail to generalize to attacks not encountered during training. To address this limitation, we propose a curriculum-guided adaptation framework that progressively exposes a robust policy to gradient-based adversarial observation perturbations of increasing intensity while aligning temporal-difference (TD) error distributions across curriculum stages. Rather than adapting to a particular attack model, the proposed approach preserves TD-error consistency to promote transferability across attack conditions. We further derive a TD-space generalization certificate showing that if the TD-error distribution induced by a test-time attack remains sufficiently close to that of the final curriculum stage, the resulting performance degradation is bounded. The framework is evaluated in a UAV deconfliction environment with dynamic 3D obstacles under previously unseen fixed and dynamic GNSS spoofing attacks. Under fixed spoofing conditions, the curriculum-adapted policy achieved near-perfect mission success rates, compared with 20-56% for standard and robust RL baselines. Under dynamic obstacle-luring spoofing attacks, it achieved the highest episodic rewards while reducing mission completion steps by up to 45% across increasing aerial traffic densities.

2506.01969 2026-06-03 cs.DC cs.AI cs.LG 版本更新

FlashMLA-ETAP: Efficient Transpose Attention Pipeline for Accelerating MLA Inference on NVIDIA H20 GPUs

FlashMLA-ETAP:用于加速NVIDIA H20 GPU上MLA推理的高效转置注意力流水线

Pengcuo Dege, Qiuming Luo, Rui Mao, Chang Kong

发表机构 * Tencent(腾讯) College of Computer Science and Software Engineering, Shenzhen University(深圳大学计算机科学与软件工程学院) College of Artificial Intelligence, Shenzhen Polytechnic University(深圳职业技术学院人工智能学院)

AI总结 针对单多GPU服务器部署DeepSeek-R1 671B模型时多头潜在注意力(MLA)推理效率低的问题,提出FlashMLA-ETAP框架,通过高效转置注意力流水线(ETAP)重配置注意力计算,在NVIDIA H20 GPU上实现2.78倍加速,并保持数值稳定性。

Comments Accepted by ICONIP2025

详情
AI中文摘要

多头潜在注意力(MLA)的高效推理面临在单台多GPU服务器上部署DeepSeek-R1 671B模型的挑战。本文介绍FlashMLA-ETAP,一种新颖的框架,用于增强NVIDIA H20 GPU上单实例部署场景的MLA推理。我们提出了高效转置注意力流水线(ETAP),通过转置重新配置注意力计算,使KV上下文长度与WGMMA操作中的\(M\)维度对齐,显著减少冗余计算。FlashMLA-ETAP在64K序列长度(批大小16)下比FlashMLA加速2.78倍,比FlashAttention-3和FlashInfer分别提升5.24倍和4.94倍,同时保持数值稳定性,均方根误差(RMSE)比FlashAttention-3低15.2倍(\(1.25 imes 10^{-5}\))。此外,ETAP的设计能够无缝集成到FlashAttention-3和FlashInfer等框架中,并有详细的理论分析支持。我们的工作解决了资源受限推理中的一个关键空白,为中端GPU提供了可扩展的解决方案,并为硬件感知优化的更广泛采用铺平了道路。代码可在https://github.com/pengcuo/FlashMLA-ETAP获取。

英文摘要

Efficient inference of Multi-Head Latent Attention (MLA) is challenged by deploying the DeepSeek-R1 671B model on a single Multi-GPU server. This paper introduces FlashMLA-ETAP, a novel framework that enhances MLA inference for the single-instance deployment scenario on NVIDIA H20 GPUs. We propose the Efficient Transpose Attention Pipeline (ETAP), which reconfigures attention computation through transposition to align the KV context length with the \(M\)-dimension in WGMMA operations, significantly reducing redundant computations. FlashMLA-ETAP achieves a 2.78x speedup over FlashMLA at 64K sequence length (batch size 16), with 5.24x and 4.94x improvements over FlashAttention-3 and FlashInfer, respectively, while maintaining numerical stability with a 15.2x lower RMSE (\(1.25 \times 10^{-5}\)) than FlashAttention-3. Furthermore, ETAP's design enables seamless integration into frameworks like FlashAttention-3 and FlashInfer, supported by a detailed theoretical analysis. Our work addresses a critical gap in resource-constrained inference, offering a scalable solution for mid-tier GPUs and paving the way for broader adoption in hardware-aware optimization. Code is available at https://github.com/pengcuo/FlashMLA-ETAP.

2506.03087 2026-06-03 cs.LG cs.AI 版本更新

Do Explanations Increase the Risk of Decision Logic Leakage? Explanation-Guided Stealing of Graph Models

解释是否会增加决策逻辑泄露的风险?解释引导的图模型窃取

Bin Ma, Yuyuan Feng, Minhua Lin, Enyan Dai

发表机构 * The Hong Kong University of Science and Technology (Guangzhou)(香港科学与技术大学(广州)) Xiamen University(厦门大学) The Pennsylvania State University(宾夕法尼亚州立大学)

AI总结 研究解释机制可能泄露图神经网络决策逻辑的风险,提出一种结合解释对齐与数据增强的模型窃取框架,实验证明其优于传统方法。

详情
AI中文摘要

图神经网络(GNNs)已成为药物发现和金融分析等领域中分析图结构数据的重要工具,导致对模型透明度的需求日益增长。可解释GNNs的最新进展通过揭示影响预测的重要子图满足了这一需求,但这些解释机制可能无意中使这些模型面临安全风险。本文研究了此类解释如何潜在泄露可被利用进行模型窃取的关键决策逻辑。我们提出了{\method},一种新颖的窃取框架,它将用于捕获决策逻辑的解释对齐与用于在有限查询下高效训练的引导数据增强相结合,从而能够有效复制目标模型的预测行为和底层推理模式。在分子图数据集上的实验表明,我们的方法在模型窃取方面优于传统方法。这项工作突出了在敏感领域部署可解释GNNs时的重要安全考虑,并表明需要针对基于解释的攻击采取保护措施。我们的代码可在https://github.com/beanmah/EGSteal获取。

英文摘要

Graph Neural Networks (GNNs) have become essential tools for analyzing graph-structured data in domains such as drug discovery and financial analysis, leading to a growing demand for model transparency. Recent advances in explainable GNNs have addressed this need by revealing important subgraphs that influence predictions, but these explanation mechanisms may inadvertently expose these models to security risks. This paper investigates how such explanations potentially leak critical decision logic that can be exploited for model stealing. We propose {\method}, a novel stealing framework that integrates explanation alignment for capturing decision logic with guided data augmentation for efficient training under limited queries, enabling effective replication of both the predictive behavior and underlying reasoning patterns of target models. Experiments on molecular graph datasets demonstrate that our approach shows advantages over conventional methods in model stealing. This work highlights important security considerations for the deployment of explainable GNNs in sensitive domains and suggests the need for protective measures against explanation-based attacks. Our code is available at https://github.com/beanmah/EGSteal.

2505.20853 2026-06-03 cs.LG cs.AI 版本更新

Cooperation of Experts: Fusing Heterogeneous Information with Large Margin

专家合作:大间隔融合异构信息

Shuo Wang, Shunyang Huang, Jinghui Yuan, Zhixiang Shen, Zhao Kang

发表机构 * Shuo Wang, Shunyang Huang, Jinghui Yuan, Zhixiang Shen, Zhao Kang(未知)

AI总结 提出专家合作框架,通过大间隔机制融合异构信息,在统一异构多路网络中编码多类型数据,实现鲁棒且互补的知识提取。

Comments Accepted at the 42nd International Conference on Machine Learning (ICML 2025)

详情
Journal ref
Proceedings of the 42nd International Conference on Machine Learning, PMLR 267:63169-63185, 2025
AI中文摘要

融合异构信息仍然是现代数据分析中的一个持续挑战。尽管已取得显著进展,但现有方法往往未能考虑对象模式在不同语义空间中的固有异质性。为解决这一局限性,我们提出了专家合作(CoE)框架,该框架将多类型信息编码到统一的异构多路网络中。通过克服模态和连接差异,CoE为捕捉现实世界复杂数据的复杂结构提供了一个强大且灵活的模型。在我们的框架中,专用编码器充当领域特定专家,每个专家专门学习特定语义空间中的不同关系模式。为了增强鲁棒性并提取互补知识,这些专家通过一种新颖的大间隔机制进行协作,该机制由定制的优化策略支持。严格的理论分析保证了框架的可行性和稳定性,而跨多种基准的广泛实验证明了其优越的性能和广泛的适用性。我们的代码可在 https://github.com/strangeAlan/CoE 获取。

英文摘要

Fusing heterogeneous information remains a persistent challenge in modern data analysis. While significant progress has been made, existing approaches often fail to account for the inherent heterogeneity of object patterns across different semantic spaces. To address this limitation, we propose the Cooperation of Experts (CoE) framework, which encodes multi-typed information into unified heterogeneous multiplex networks. By overcoming modality and connection differences, CoE provides a powerful and flexible model for capturing the intricate structures of real-world complex data. In our framework, dedicated encoders act as domain-specific experts, each specializing in learning distinct relational patterns in specific semantic spaces. To enhance robustness and extract complementary knowledge, these experts collaborate through a novel large margin mechanism supported by a tailored optimization strategy. Rigorous theoretical analyses guarantee the framework's feasibility and stability, while extensive experiments across diverse benchmarks demonstrate its superior performance and broad applicability. Our code is available at https://github.com/strangeAlan/CoE.

2502.08006 2026-06-03 cs.LG cs.AI stat.ML 版本更新

Greed is Good: A Unifying Perspective on Guided Generation

贪婪即美德:引导生成的统一视角

Zander W. Blasingame, Chen Liu

AI总结 本文通过将后验引导视为端到端引导的贪婪策略,统一了两种梯度引导方法,并提出了在计算与精度之间权衡的插值方法,在逆图像问题和分子生成任务上验证了有效性。

Comments Accepted at NeurIPS 2025

详情
AI中文摘要

无训练引导生成是一种广泛使用且强大的技术,允许最终用户对流/扩散模型的生成过程施加进一步控制。一般来说,针对基于梯度的引导,已经出现了两种技术系列:即后验引导(即通过目标预测模型将当前样本投影到目标分布进行引导)和端到端引导(即通过在整个ODE求解过程中执行反向传播进行引导)。在这项工作中,我们表明这两个看似分离的系列实际上可以通过将后验引导视为端到端引导的贪婪策略来统一。我们探索了这两个系列之间的理论联系,并深入分析了这两种技术相对于连续理想梯度的关系。基于这一分析,我们提出了一种在这两个系列之间插值的方法,从而在引导梯度的计算与精度之间实现权衡。然后,我们在几个逆图像问题和性质引导的分子生成任务上验证了这项工作。

英文摘要

Training-free guided generation is a widely used and powerful technique that allows the end user to exert further control over the generative process of flow/diffusion models. Generally speaking, two families of techniques have emerged for solving this problem for gradient-based guidance: namely, posterior guidance (i.e., guidance via projecting the current sample to the target distribution via the target prediction model) and end-to-end guidance (i.e., guidance by performing backpropagation throughout the entire ODE solve). In this work, we show that these two seemingly separate families can actually be unified by looking at posterior guidance as a greedy strategy of end-to-end guidance. We explore the theoretical connections between these two families and provide an in-depth theoretical of these two techniques relative to the continuous ideal gradients. Motivated by this analysis we then show a method for interpolating between these two families enabling a trade-off between compute and accuracy of the guidance gradients. We then validate this work on several inverse image problems and property-guided molecular generation.

2412.17484 2026-06-03 cs.DC cs.AI 版本更新

Power- and Fragmentation-aware Online Scheduling for GPU Datacenters

面向GPU数据中心的功耗与碎片感知在线调度

Francesco Lettich, Emanuele Carlini, Franco Maria Nardini, Raffaele Perego, Salvatore Trani

发表机构 * Istituto di Scienza e Tecnologie dell’Informazione "Alessandro Faedo", Consiglio Nazionale delle Ricerche(阿莱索·法多信息科学与技术研究所,意大利国家研究委员会)

AI总结 针对GPU数据中心在线调度问题,提出PWR调度策略,结合碎片梯度下降(FGD)方法,在降低功耗和最小化GPU碎片之间取得平衡。

Comments This work has been submitted to the IEEE for possible publication

详情
AI中文摘要

人工智能和大语言模型的兴起推动了数据中心中GPU在复杂训练和推理任务中的使用增加,影响了大规模计算基础设施的运营成本、能源需求和环境足迹。本文解决了GPU数据中心中的在线调度问题,即在不知道任务未来到达时间的情况下进行调度。我们关注两个目标:最小化GPU碎片和降低功耗。当数据中心接近满容量时,部分GPU分配会阻碍剩余资源的有效利用,从而产生GPU碎片。最近的调度策略FGD(碎片梯度下降)利用碎片度量来解决这个问题。由于GPU的功耗需求巨大,降低功耗也至关重要。为此,我们提出了PWR,一种新颖的调度策略,通过选择功耗高效的GPU和CPU组合来最小化功耗。这涉及到一个简化的功耗测量模型,该模型集成到Kubernetes评分插件中。通过在模拟集群中的广泛实验评估,我们展示了PWR与FGD结合时,如何在降低功耗和最小化GPU碎片之间实现平衡的权衡。

英文摘要

The rise of Artificial Intelligence and Large Language Models is driving increased GPU usage in data centers for complex training and inference tasks, impacting operational costs, energy demands, and the environmental footprint of large-scale computing infrastructures. This work addresses the online scheduling problem in GPU datacenters, which involves scheduling tasks without knowledge of their future arrivals. We focus on two objectives: minimizing GPU fragmentation and reducing power consumption. GPU fragmentation occurs when partial GPU allocations hinder the efficient use of remaining resources, especially as the datacenter nears full capacity. A recent scheduling policy, Fragmentation Gradient Descent (FGD), leverages a fragmentation metric to address this issue. Reducing power consumption is also crucial due to the significant power demands of GPUs. To this end, we propose PWR, a novel scheduling policy to minimize power usage by selecting power-efficient GPU and CPU combinations. This involves a simplified model for measuring power consumption integrated into a Kubernetes score plugin. Through an extensive experimental evaluation in a simulated cluster, we show how PWR, when combined with FGD, achieves a balanced trade-off between reducing power consumption and minimizing GPU fragmentation.

2412.01282 2026-06-03 cs.CV cs.AI 版本更新

Align-KD: Distilling Cross-Modal Alignment Knowledge for Mobile Vision-Language Model Enhancement

Align-KD:为移动视觉语言模型增强提取跨模态对齐知识

Qianhan Feng, Wenshuo Li, Tong Lin, Xinghao Chen

发表机构 * State Key Laboratory of General Artificial Intelligence, School of Intelligence Science and Technology, Peking University, China(通用人工智能国家重点实验室,智能科学与技术学院,北京大学,中国) Huawei Noah’s Ark Lab, China(华为诺亚方舟实验室,中国)

AI总结 提出Align-KD方法,通过蒸馏教师模型浅层跨模态对齐知识,指导1.7B学生模型学习视觉-文本匹配,在6个基准上平均提升2.0分。

Comments CVPR 2025 Paper

详情
AI中文摘要

视觉语言模型(VLM)为多模态任务带来了强大的理解和推理能力。同时,移动设备对强大人工智能的需求也日益增长,例如AI助手软件。一些工作试图将VLM迁移到边缘设备以扩展其应用范围。简化模型结构是一种常见方法,但随着模型缩小,性能与大小之间的权衡变得越来越困难。知识蒸馏(KD)可以帮助模型在不增加大小或数据量的情况下提升综合能力。然而,现有的大模型蒸馏技术大多只考虑单模态LLM的应用,或者仅使用教师为学生创建新的数据环境。这些方法都没有考虑VLM中最重要的跨模态对齐知识的蒸馏。我们提出了一种名为Align-KD的方法,引导学生模型学习发生在浅层的跨模态匹配。教师还帮助学生基于文本的关注点学习将视觉标记投影到文本嵌入空间。在Align-KD的指导下,1.7B的MobileVLM V2模型能够从7B教师模型中学习丰富的知识,且训练损失设计轻量,在两个训练子集上分别在6个基准上平均得分提升2.0。代码地址:https://github.com/fqhank/Align-KD。

英文摘要

Vision-Language Models (VLMs) bring powerful understanding and reasoning capabilities to multimodal tasks. Meanwhile, the great need for capable aritificial intelligence on mobile devices also arises, such as the AI assistant software. Some efforts try to migrate VLMs to edge devices to expand their application scope. Simplifying the model structure is a common method, but as the model shrinks, the trade-off between performance and size becomes more and more difficult. Knowledge distillation (KD) can help models improve comprehensive capabilities without increasing size or data volume. However, most of the existing large model distillation techniques only consider applications on single-modal LLMs, or only use teachers to create new data environments for students. None of these methods take into account the distillation of the most important cross-modal alignment knowledge in VLMs. We propose a method called Align-KD to guide the student model to learn the cross-modal matching that occurs at the shallow layer. The teacher also helps student learn the projection of vision token into text embedding space based on the focus of text. Under the guidance of Align-KD, the 1.7B MobileVLM V2 model can learn rich knowledge from the 7B teacher model with light design of training loss, and achieve an average score improvement of 2.0 across 6 benchmarks under two training subsets respectively. Code is available at: https://github.com/fqhank/Align-KD.

2409.08958 2026-06-03 cs.LG cs.AI physics.comp-ph physics.flu-dyn 版本更新

PINNfluence: Interpreting PINNs through Influence Functions

PINNfluence: 通过影响函数解释 PINN

Aleksander Krasowski, Jonas R. Naujoks, Moritz Weckbecker, Galip Ü. Yolcu, Thomas Wiegand, Sebastian Lapuschkin, Wojciech Samek, René P. Klausen

发表机构 * Technical University of Munich(慕尼黑技术大学) Max Planck Institute for Intelligent Systems(智能系统马克斯·普朗克研究所) University of Tübingen(图宾根大学) ETH Zurich(苏黎世联邦理工学院)

AI总结 提出 PINNfluence 框架,基于影响函数对物理信息神经网络进行训练数据归因,实现预测、损失分量和训练数据点之间的细粒度归因,并通过基准实验区分训练好与差的 PINN 的结构特征。

Comments Accepted at ICML 2026

详情
AI中文摘要

物理信息神经网络(PINN)已成为物理科学中求解偏微分方程(PDE)的强大深度学习方法,但其行为在很大程度上仍然不透明,通常通过故障模式分析而非显式可解释性来理解。为了解决这个问题,我们引入了 PINNfluence,这是一个基于影响函数解释 PINN 的训练数据归因框架。通过将影响函数扩展到复合物理信息训练目标,我们实现了预测、损失分量和训练数据点之间的细粒度归因。通过跨各种 PDE 的基准实验,我们证明了影响模式提供了区分训练良好和训练不良的 PINN 结构特征的细粒度诊断。因此,PINNfluence 通过数据视角为理解和提高 PINN 的可靠性开辟了新途径。

英文摘要

Physics-informed neural networks (PINNs) have emerged as a powerful deep learning approach for solving partial differential equations (PDEs) in the physical sciences, yet their behavior remains largely opaque and is typically understood through failure mode analyses rather than explicit interpretability. To address this issue, we introduce PINNfluence, a training data attribution framework for interpreting PINNs based on influence functions. By extending influence functions to composite physics-informed training objectives, we enable fine-grained attribution between predictions, loss components, and training data points. Through benchmark experiments across various PDEs, we demonstrate that influence patterns provide granular diagnostics that distinguish structural characteristics across well-trained and poorly-trained PINNs. PINNfluence thus opens a new avenue for understanding and improving the reliability of PINNs through the lens of their data.

2410.14573 2026-06-03 cs.LG cs.AI 版本更新

Building Trust in Black-box Optimization: A Comprehensive Framework for Explainability

在黑盒优化中建立信任:可解释性的综合框架

Nazanin Nezami, Hadis Anahideh

发表机构 * University of Illinois Chicago(伊利诺伊大学芝加哥分校)

AI总结 提出一套模型无关的指标IEMSO,通过采样核心、批次属性、优化过程和特征重要性四类指标,增强代理优化方法的透明性和可解释性。

详情
AI中文摘要

在受限评估预算内优化昂贵的黑盒函数在许多实际应用中面临重大挑战。代理优化(SO)是一种常见的解决方案,但其由代理模型和采样核心(例如采集函数)的复杂性引入的专有性质往往导致缺乏可解释性和透明度。尽管现有文献主要集中在增强对全局最优的收敛性,但新提出策略的实际解释仍未被充分探索,特别是在批量评估设置中。在本文中,我们提出了代理优化的包容性可解释性指标(IEMSO),这是一组全面的模型无关指标,旨在增强SO方法的透明度、可信度和可解释性。通过这些指标,我们在执行昂贵评估之前和之后为从业者提供中间和事后解释,以建立信任。我们考虑了四类主要指标,每类针对SO过程的特定方面:采样核心指标、批次属性指标、优化过程指标和特征重要性。我们的实验评估证明了所提指标在不同基准上的显著潜力。

英文摘要

Optimizing costly black-box functions within a constrained evaluation budget presents significant challenges in many real-world applications. Surrogate Optimization (SO) is a common resolution, yet its proprietary nature introduced by the complexity of surrogate models and the sampling core (e.g., acquisition functions) often leads to a lack of explainability and transparency. While existing literature has primarily concentrated on enhancing convergence to global optima, the practical interpretation of newly proposed strategies remains underexplored, especially in batch evaluation settings. In this paper, we propose \emph{Inclusive} Explainability Metrics for Surrogate Optimization (IEMSO), a comprehensive set of model-agnostic metrics designed to enhance the transparency, trustworthiness, and explainability of the SO approaches. Through these metrics, we provide both intermediate and post-hoc explanations to practitioners before and after performing expensive evaluations to gain trust. We consider four primary categories of metrics, each targeting a specific aspect of the SO process: Sampling Core Metrics, Batch Properties Metrics, Optimization Process Metrics, and Feature Importance. Our experimental evaluations demonstrate the significant potential of the proposed metrics across different benchmarks.

2407.18428 2026-06-03 cs.LG cs.AI cs.CV 版本更新

Weighted Risk Invariance: Domain Generalization under Invariant Feature Shift

加权风险不变性:不变特征偏移下的领域泛化

Gina Wong, Joshua Gleason, Rama Chellappa, Yoav Wald, Anqi Liu

发表机构 * Johns Hopkins University(约翰霍普金斯大学) University of Maryland, College Park(马里兰大学学院公园分校) New York University(纽约大学) Center for Data Science(数据科学中心)

AI总结 针对不变协变量偏移下现有不变学习方法性能不佳的问题,提出加权风险不变性(WRI)框架,通过环境间损失的不变性并加权训练样本,在理论上保证学习到不变模型,并在实验中优于先前方法。

详情
Journal ref
TMLR 2024
AI中文摘要

学习预测在多个环境下不变的模型是一种有前景的分布外泛化方法。这类模型被训练来提取特征 $X_{ ext{inv}}$,其中给定提取特征的条件分布 $Y \mid X_{ ext{inv}}$ 在不同环境下不发生变化。不变模型还应能泛化到提取特征 $X_{ ext{inv}}$ 的边缘分布 $p(X_{ ext{inv}})$ 的偏移,这种偏移称为 $ extit{不变协变量偏移}$。然而,我们表明,现有学习不变模型的方法在不变协变量偏移下表现不佳,要么无法学习到不变模型——即使对于从简单且经过充分研究的线性-高斯模型生成的数据也是如此——要么有限样本性能较差。为了解决这些问题,我们提出 $ extit{加权风险不变性}$(WRI)。我们的框架基于对训练样本进行适当加权,强制要求损失在不同环境下保持不变。我们证明,在线性-高斯设置下,WRI 可证明地学习到不变模型,即丢弃虚假相关性。我们提出了一种实用算法,通过同时学习密度 $p(X_{ ext{inv}})$ 和模型参数来实现 WRI,并且实验表明,在不变协变量偏移下,WRI 优于先前的不变学习方法。

英文摘要

Learning models whose predictions are invariant under multiple environments is a promising approach for out-of-distribution generalization. Such models are trained to extract features $X_{\text{inv}}$ where the conditional distribution $Y \mid X_{\text{inv}}$ of the label given the extracted features does not change across environments. Invariant models are also supposed to generalize to shifts in the marginal distribution $p(X_{\text{inv}})$ of the extracted features $X_{\text{inv}}$, a type of shift we call an $\textit{invariant covariate shift}$. However, we show that proposed methods for learning invariant models underperform under invariant covariate shift, either failing to learn invariant models$\unicode{x2014}$even for data generated from simple and well-studied linear-Gaussian models$\unicode{x2014}$or having poor finite-sample performance. To alleviate these problems, we propose $\textit{weighted risk invariance}$ (WRI). Our framework is based on imposing invariance of the loss across environments subject to appropriate reweightings of the training examples. We show that WRI provably learns invariant models, i.e. discards spurious correlations, in linear-Gaussian settings. We propose a practical algorithm to implement WRI by learning the density $p(X_{\text{inv}})$ and the model parameters simultaneously, and we demonstrate empirically that WRI outperforms previous invariant learning methods under invariant covariate shift.

2407.11821 2026-06-03 cs.AI 版本更新

Approximating Probabilistic Inference in Statistical EL with Knowledge Graph Embeddings

使用知识图谱嵌入近似统计EL中的概率推理

Yuqicheng Zhu, Nico Potyka, Bo Xiong, Trung-Kien Tran, Mojtaba Nayyeri, Evgeny Kharlamov, Steffen Staab

发表机构 * Bosch Center for AI(博世人工智能中心) University of Stuttgart(斯图加特大学) Cardiff University(卡迪夫大学) Stanford University(斯坦福大学) University of Oslo(奥斯陆大学) University of Southampton(南安普顿大学)

AI总结 本文提出利用知识图谱嵌入高效近似统计EL中的概率推理,并提供了运行时和正确性保证的理论证明及实验评估。

Comments Accepted at UAI 2026

详情
AI中文摘要

统计信息无处不在,但从中得出有效结论却极其困难。我们以统计EL(SEL)为例,解释了如何使用知识图谱嵌入来高效近似概率推理,SEL是轻量级描述逻辑EL的统计扩展。我们提供了运行时和正确性保证的证明,并通过实验评估了我们方法的运行时和近似质量。

英文摘要

Statistical information is ubiquitous but drawing valid conclusions from it is prohibitively hard. We explain how knowledge graph embeddings can be used to approximate probabilistic inference efficiently using the example of Statistical EL (SEL), a statistical extension of the lightweight Description Logic EL. We provide proofs for runtime and soundness guarantees, and empirically evaluate the runtime and approximation quality of our approach.

2403.19883 2026-06-03 cs.AI 版本更新

Planning with Uncertainty: Symmetries, Policy Inference, and Solution Compression

不确定性规划:对称性、策略推理与解压缩

Frederico Messa, André Grahl Pereira

发表机构 * INF/UFRGS(乌尔巴诺-弗兰西斯科·里格尔大学信息学院)

AI总结 本文提出基于显式最佳优先策略空间搜索的FOND规划方法,通过定义策略等价关系、利用群论计算状态对称性、多项式时间策略推断以及整数规划实现部分状态策略压缩,显著提升求解效率。

详情
AI中文摘要

完全可观测非确定性(FOND)规划是人工智能不确定性规划的核心。它通过具有非确定性效果的动作来建模不确定性。在这项工作中,我们提出了一系列技术,将显式最佳优先策略空间搜索建立为一种与当前最先进方法相竞争的方法,用于解决FOND规划任务。我们研究了如何定义策略之间的等价关系,从而允许剪枝部分搜索空间。我们展示了可以使用群论技术有效计算状态之间的规范对称性。我们还提出了两项超越策略空间搜索的贡献:一个过程,在给定策略域集规范的情况下,能在多项式时间内推断出解策略函数;以及一个整数规划公式化过程,给定一个定义在完整状态上的解策略,能产生一组资源高效的模型,这些模型能够找到以最少部分状态无歧义地表示该策略的部分状态策略。

英文摘要

Fully-observable non-deterministic (FOND) planning is at the core of artificial intelligence planning with uncertainty. It models uncertainty through actions with non-deterministic effects. In this work, we present a collection of techniques that establish explicit best-first policy-space search as a method competitive with the state of the art for solving FOND planning tasks. We study how to define equivalence relations between policies, allowing part of the search space to be pruned. We show it is possible to use group theory techniques to effectively compute canonical symmetries between states. We also present two contributions that go beyond just policy-space search: we present a procedure that infers in polynomial time a solution policy function given just the specification of its domain set, and an integer-programming formulation procedure that, given a solution policy defined over complete states, yields a set of resource-efficient models that are capable of finding a partial-state policy that represents it unambiguously with the fewest partial states possible.

2303.15619 2026-06-03 cs.CL cs.AI 版本更新

Typhoon: Towards an Effective Task-Specific Masking Strategy for Pre-trained Language Models

Typhoon: 面向预训练语言模型的有效任务特定掩码策略

Muhammed Shahir Abdurrahman, Hashem Elezabi, Bruce Changlong Xu

发表机构 * Department of Computer Science, Stanford University(斯坦福大学计算机科学系)

AI总结 本文提出Typhoon,一种基于任务损失梯度的自适应掩码策略,在GLUE任务上对比随机掩码和整词掩码,经严格评估发现无显著优势。

详情
AI中文摘要

在掩码语言建模(MLM)中,选择哪些token进行掩码是一个核心但未被充分研究的设计决策。标准预训练随机均匀掩码token,但多项研究表明,更具信息性的掩码目标可以提升下游性能。我们将掩码视为微调流程中任务自适应的组件,并引入Typhoon,一种掩码策略,它利用任务损失相对于one-hot token输入的梯度来在线估计每种token类型对目标的贡献程度。Typhoon维护每个token类型显著性的指数移动平均,并将这些分数校准为掩码分布,在token独立性近似下,其期望掩码率与目标预算匹配。我们形式化了该方法,并在两个GLUE任务(MRPC和CoLA)上,针对三个BERT系列骨干网络(TinyBERT、DistilBERT和BERT-base)以及每个配置五个随机种子(总共90次训练运行),将其与随机掩码和整词掩码进行了评估。我们的主要发现是,一旦考虑了种子方差,没有哪种掩码策略在这些任务上可靠地优于其他策略:在MRPC上,Typhoon与最佳基线之间的差距保持在0.004 F1以内,所有十二次Typhoon比较中无配对检验达到显著性,且每个95%置信区间包含零。Typhoon在单次运行实验中的明显优势并未经受住这种更仔细的评估。我们将此视为一个警示性的、以可重复性为重点的结果——基于梯度的任务自适应掩码具有竞争力,但在此规模上并不明显优于无资源的随机掩码——我们描述了一个干净的现代重实现以支持后续工作。

英文摘要

The choice of \emph{which} tokens to mask is a central, under-examined design decision in masked language modeling (MLM). Standard pretraining masks tokens uniformly at random, but several studies show that more informative masking targets can improve downstream performance. We study masking as a \emph{task-adaptive} component of the fine-tuning pipeline and introduce \textbf{Typhoon}, a masking strategy that uses the gradient of the task loss with respect to one-hot token inputs to estimate, online, how much each token type contributes to the objective. Typhoon maintains an exponential moving average of per-token-type saliency and calibrates these scores into a masking distribution whose expected masking rate matches a target budget, under a token-independence approximation. We formalize the method and evaluate it against random masking and whole-word masking on two GLUE tasks, MRPC and CoLA, across three BERT-family backbones (TinyBERT, DistilBERT, and BERT-base) and five random seeds per configuration ($90$ training runs in total). Our main finding is that, once seed variance is accounted for, no masking strategy is reliably better than the others on these tasks: on MRPC the gap between Typhoon and the best baseline stays within $0.004$ $F_1$, across all twelve Typhoon comparisons no paired test reaches significance, and every $95\%$ confidence interval contains zero. Typhoon's apparent advantage in single-run experiments does not survive this more careful evaluation. We read this as a cautionary, reproducibility-focused result -- gradient-based task-adaptive masking is competitive but not clearly better than resource-free random masking at this scale -- and we describe a clean modern reimplementation to support follow-up work.

1301.3535 2026-06-03 eess.SY cs.AI cs.SY 版本更新

Airport Gate Scheduling for Passengers, Aircraft, and Operation

面向乘客、飞机和运营的机场登机口调度

Sang Hyun Kim, Eric Feron, John-Paul Clarke, Aude Marzuoli, Daniel Delahaye

AI总结 本文研究机场登机口调度问题,提出兼顾乘客、飞机和运营三个目标的平衡目标函数,以提升乘客体验、交通流效率和运营鲁棒性。

Comments This paper is submitted to the tenth USA/Europe ATM 2013 seminar

详情
AI中文摘要

乘客体验正成为评估航空运输系统性能的关键指标。需要高效且稳健的工具来处理机场运营,同时更好地理解乘客的兴趣和关切。在各种机场运营中,本文研究机场登机口调度以改善乘客体验。提出了三个目标,分别考虑乘客、飞机和运营。分析了这些目标之间的权衡,并提出了一个平衡目标函数。结果表明,平衡目标可以提高客运航站楼和停机坪交通流的效率,以及登机口运营的鲁棒性。

英文摘要

Passengers' experience is becoming a key metric to evaluate the air transportation system's performance. Efficient and robust tools to handle airport operations are needed along with a better understanding of passengers' interests and concerns. Among various airport operations, this paper studies airport gate scheduling for improved passengers' experience. Three objectives accounting for passengers, aircraft, and operation are presented. Trade-offs between these objectives are analyzed, and a balancing objective function is proposed. The results show that the balanced objective can improve the efficiency of traffic flow in passenger terminals and on ramps, as well as the robustness of gate operations.

2004.07506 2026-06-03 cs.LO cs.AI math.LO 版本更新

On Reductions of Hintikka Sets for Higher-Order Logic

关于高阶逻辑的Hintikka集归约

Alexander Steen, Christoph Benzmüller

AI总结 本文通过将Steen (2018)基于原始等式的Church类型论Hintikka集性质归约到Brown (2007)的Hintikka集性质,推导出Steen性质的一个模型存在定理。

Comments 10 pages; improved version

详情
Journal ref
Journal of Applied Logics IfCoLog Journal of Logics and their Applications, Vol. 12(7), 2025, pp. 1813-1834
AI中文摘要

Steen (2018) 基于原始等式的Church类型论的Hintikka集性质被归约到Brown (2007)的Hintikka集性质。利用这一归约,推导出Steen性质的一个模型存在定理。

英文摘要

Steen's (2018) Hintikka set properties for Church's type theory based on primitive equality are reduced to the Hintikka set properties of Brown (2007). Using this reduction, a model existence theorem for Steen's properties is derived.

1208.4773 2026-06-03 eess.SY cs.AI cs.LG cs.SY 版本更新

Optimized Look-Ahead Tree Policies: A Bridge Between Look-Ahead Tree Policies and Direct Policy Search

优化前瞻树策略:连接前瞻树策略与直接策略搜索的桥梁

Tobias Jung, Louis Wehenkel, Damien Ernst, Francis Maes

AI总结 提出一种混合策略学习方案,通过直接策略搜索学习节点评分函数来指导小型前瞻树的构建,从而结合直接策略搜索和前瞻树策略的优势。

Comments In Submission

详情
AI中文摘要

直接策略搜索(DPS)和前瞻树(LT)策略是两类广泛使用的技术,用于为序列决策问题产生高性能策略。要使DPS方法有效工作,一个关键问题是针对目标问题选择合适的参数化策略空间。LT方法的一个基本问题是,为了做出好的决策,这类策略必须开发非常大的前瞻树,这可能需要过多的在线计算资源。在本文中,我们提出了一种新的混合策略学习方案,它位于DPS和LT的交集,其中策略是一种算法,以有向方式开发一个小型前瞻树,由通过DPS学习的节点评分函数引导。基于LT的表示被证明是在DPS方案中表示策略的一种通用方式,同时,DPS能够显著减少做出高质量决策所需的前瞻树的大小。我们通过实验将我们的方法与两种其他最先进的DPS技术和四种常见的LT策略在四个基准领域进行比较,并表明它结合了其起源的两种技术的优势。特别是,我们表明我们的方法:(1)总体上比纯DPS和纯LT策略产生更好的性能策略,(2)需要的策略评估次数远少于其他DPS技术,(3)易于调整,(4)产生的策略对初始条件的扰动具有相当的鲁棒性。

英文摘要

Direct policy search (DPS) and look-ahead tree (LT) policies are two widely used classes of techniques to produce high performance policies for sequential decision-making problems. To make DPS approaches work well, one crucial issue is to select an appropriate space of parameterized policies with respect to the targeted problem. A fundamental issue in LT approaches is that, to take good decisions, such policies must develop very large look-ahead trees which may require excessive online computational resources. In this paper, we propose a new hybrid policy learning scheme that lies at the intersection of DPS and LT, in which the policy is an algorithm that develops a small look-ahead tree in a directed way, guided by a node scoring function that is learned through DPS. The LT-based representation is shown to be a versatile way of representing policies in a DPS scheme, while at the same time, DPS enables to significantly reduce the size of the look-ahead trees that are required to take high-quality decisions. We experimentally compare our method with two other state-of-the-art DPS techniques and four common LT policies on four benchmark domains and show that it combines the advantages of the two techniques from which it originates. In particular, we show that our method: (1) produces overall better performing policies than both pure DPS and pure LT policies, (2) requires a substantially smaller number of policy evaluations than other DPS techniques, (3) is easy to tune and (4) results in policies that are quite robust with respect to perturbations of the initial conditions.

1204.3830 2026-06-03 cs.RO cs.AI cs.SY eess.SY 版本更新

Planning Optimal Paths for Multiple Robots on Graphs

图上多机器人路径规划的最优路径

Jingjin Yu, Steven M. LaValle

AI总结 提出两种基于多流整数线性规划的模型,分别求解多机器人路径规划的最小最后到达时间和最小总距离问题,算法完备且保证最优解。

Comments Changed "agents" to "robots"

详情
AI中文摘要

在本文中,我们研究了图上多机器人路径规划(MPP)的最优问题。我们提出了两种基于多流的整数线性规划(ILP)模型,分别计算MPP公式的最小最后到达时间和最小总距离解。这些ILP模型产生的算法是完备的,并保证得到真正的最优解。此外,我们的灵活框架可以轻松适应MPP问题的其他变体。专注于时间最优算法,我们评估了其性能,既作为独立算法,也作为快速解决大规模问题实例的通用启发式方法。计算结果证实了我们方法的有效性。

英文摘要

In this paper, we study the problem of optimal multi-robot path planning (MPP) on graphs. We propose two multiflow based integer linear programming (ILP) models that computes minimum last arrival time and minimum total distance solutions for our MPP formulation, respectively. The resulting algorithms from these ILP models are complete and guaranteed to yield true optimal solutions. In addition, our flexible framework can easily accommodate other variants of the MPP problem. Focusing on the time optimal algorithm, we evaluate its performance, both as a stand alone algorithm and as a generic heuristic for quickly solving large problem instances. Computational results confirm the effectiveness of our method.

1204.3820 2026-06-03 eess.SY cs.AI cs.RO cs.SY 版本更新

Distance Optimal Formation Control on Graphs with a Tight Convergence Time Guarantee

图上具有紧收敛时间保证的距离最优编队控制

Jingjin Yu, Steven M. LaValle

AI总结 针对连通图上单位边距下无碰撞移动多个不可区分智能体到任意目标顶点集的任务,提出一种快速距离最优控制算法,并给出紧收敛时间保证。

Comments Brought to be in-sync with final version submitted to CDC 2012 with only minor updates

详情
AI中文摘要

对于在单位边距的连通图上将一组不可区分智能体无碰撞地移动到任意目标顶点集的任务,我们提出了一种快速距离最优控制算法,引导智能体进入期望编队。此外,我们证明了该算法还提供了紧收敛时间保证(时间最优性和距离最优性无法同时满足)。我们的通用图表述允许该算法应用于诸如具有孔洞(模拟障碍物)的任意维度网格等场景。在线可用的仿真验证了我们的理论发展。

英文摘要

For the task of moving a set of indistinguishable agents on a connected graph with unit edge distance to an arbitrary set of goal vertices, free of collisions, we propose a fast distance optimal control algorithm that guides the agents into the desired formation. Moreover, we show that the algorithm also provides a tight convergence time guarantee (time optimality and distance optimality cannot be simultaneously satisfied). Our generic graph formulation allows the algorithm to be applied to scenarios such as grids with holes (modeling obstacles) in arbitrary dimensions. Simulations, available online, confirm our theoretical developments.

1101.4003 2026-06-03 cs.AI cs.LG cs.SY eess.SY math.OC 版本更新

Dyna-H: a heuristic planning reinforcement learning algorithm applied to role-playing-game strategy decision systems

Dyna-H:一种应用于角色扮演游戏策略决策系统的启发式规划强化学习算法

Matilde Santos, Jose Antonio Martin H., Victoria Lopez, Guillermo Botella

AI总结 提出Dyna-H算法,结合启发式搜索与Dyna框架,在角色扮演游戏策略决策中实现无模型在线强化学习,实验表明其性能显著优于Q-Learning和Dyna-Q。

详情
AI中文摘要

在角色扮演游戏中,寻找最优轨迹是最重要的任务之一。实际上,策略决策系统成为游戏引擎的关键组成部分。决策方式(在线、批处理或模拟)以及决策所消耗的资源(如执行时间、内存)将在很大程度上影响游戏性能。当可以使用经典搜索算法(如A*)时,它们是最优先的选择。然而,这些方法依赖于搜索空间的精确和完整模型,在许多有趣的场景中无法应用。此时,无模型的序贯决策方法(在不确定性下)是最佳选择。本文提出一种启发式规划策略,将启发式搜索在路径规划中的能力融入Dyna智能体。所提出的Dyna-H算法,与A*一样,会选择更有可能产生结果的路径分支。此外,它具有无模型在线强化学习算法的优点。该方案与单步Q-Learning和Dyna-Q算法进行了对比评估,获得了优异的实验结果:Dyna-H在所有实验中显著优于这两种方法。我们还提出了一个功能类比,即从最差轨迹中采样的启发式与人类行为中梦境(如噩梦)的作用类似。

英文摘要

In a Role-Playing Game, finding optimal trajectories is one of the most important tasks. In fact, the strategy decision system becomes a key component of a game engine. Determining the way in which decisions are taken (online, batch or simulated) and the consumed resources in decision making (e.g. execution time, memory) will influence, in mayor degree, the game performance. When classical search algorithms such as A* can be used, they are the very first option. Nevertheless, such methods rely on precise and complete models of the search space, and there are many interesting scenarios where their application is not possible. Then, model free methods for sequential decision making under uncertainty are the best choice. In this paper, we propose a heuristic planning strategy to incorporate the ability of heuristic-search in path-finding into a Dyna agent. The proposed Dyna-H algorithm, as A* does, selects branches more likely to produce outcomes than other branches. Besides, it has the advantages of being a model-free online reinforcement learning algorithm. The proposal was evaluated against the one-step Q-Learning and Dyna-Q algorithms obtaining excellent experimental results: Dyna-H significantly overcomes both methods in all experiments. We suggest also, a functional analogy between the proposed sampling from worst trajectories heuristic and the role of dreams (e.g. nightmares) in human behavior.

1201.5604 2026-06-03 cs.AI cs.LG cs.NE cs.SY eess.SY math.OC 版本更新

Discrete and fuzzy dynamical genetic programming in the XCSF learning classifier system

XCSF学习分类系统中的离散与模糊动态遗传编程

Richard J. Preen, Larry Bull

AI总结 本文在XCSF框架内使用离散和模糊动态系统表示(异步随机布尔网络和模糊逻辑网络),通过自适应的开放式进化设计集成系统,解决多个经典测试问题。

详情
Journal ref
Soft Computing (2014), 18(1):153-167
AI中文摘要

学习分类系统中已经提出了多种表示方案,从二进制编码到神经网络。本文报告了在XCSF学习分类系统中使用离散和模糊动态系统表示的研究结果。具体而言,在离散情况下使用异步随机布尔网络表示传统的条件-动作生产系统规则,在连续值情况下使用异步模糊逻辑网络。研究表明,可以在XCSF中使用自适应的开放式进化来设计此类动态系统的集成,以解决多个著名的测试问题。

英文摘要

A number of representation schemes have been presented for use within learning classifier systems, ranging from binary encodings to neural networks. This paper presents results from an investigation into using discrete and fuzzy dynamical system representations within the XCSF learning classifier system. In particular, asynchronous random Boolean networks are used to represent the traditional condition-action production system rules in the discrete case and asynchronous fuzzy logic networks in the continuous-valued case. It is shown possible to use self-adaptive, open-ended evolution to design an ensemble of such dynamical systems within XCSF to solve a number of well-known test problems.

1106.3703 2026-06-03 nlin.AO cs.AI cs.IT cs.LG cs.SY eess.SY math.IT q-bio.QM stat.ME 版本更新

Prediction and Modularity in Dynamical Systems

动力系统中的预测与模块性

Artemy Kolchinsky, Luis M. Rocha

AI总结 本文从统计建模和预测的角度,利用模型简洁性与预测精度之间的权衡,提出了一种将动力网络最优多尺度分解为弱耦合简单模块的方法,并给出了状态依赖和因果版本。

Comments v1 published in ECAL 2011 (European Conference on Artificial Life). v2 fixes error in causal risk (number of parameters should be based on training distribution)

详情
AI中文摘要

识别和理解模块化组织是复杂系统研究中的核心问题。已有多种方法被提出,其中许多以信息论术语表述。我们的研究从动力系统的统计建模和预测这一互补视角出发。已知对于有限量的训练数据,简单模型可能比复杂模型具有更强的预测能力。我们利用模型简洁性与预测精度之间的权衡,将动力网络最优多尺度分解为弱耦合的简单模块。还提出了我们方法的状态依赖和因果版本。

英文摘要

Identifying and understanding modular organizations is centrally important in the study of complex systems. Several approaches to this problem have been advanced, many framed in information-theoretic terms. Our treatment starts from the complementary point of view of statistical modeling and prediction of dynamical systems. It is known that for finite amounts of training data, simpler models can have greater predictive power than more complex ones. We use the trade-off between model simplicity and predictive accuracy to generate optimal multiscale decompositions of dynamical networks into weakly-coupled, simple modules. State-dependent and causal versions of our method are also proposed.

1204.4200 2026-06-03 cs.AI cs.LG cs.NE cs.SY eess.SY 版本更新

Discrete Dynamical Genetic Programming in XCS

XCS中的离散动力遗传编程

Richard J. Preen, Larry Bull

AI总结 本文研究在XCS学习分类器系统中使用异步随机布尔网络作为离散动力系统表示,通过自适应的开放式进化设计集成系统以解决多个经典测试问题。

Comments arXiv admin note: substantial text overlap with arXiv:1201.5604

详情
Journal ref
In Proceedings of the 11th annual conference on genetic and evolutionary computation, GECCO '09, pp. 1299-1306. ACM, 2009
AI中文摘要

在XCS学习分类器系统中,已有多种表示方案,从二进制编码到神经网络。本文研究了在XCS中使用离散动力系统表示的结果。特别地,使用异步随机布尔网络来表示传统的条件-动作生产系统规则。结果表明,可以通过自适应的开放式进化在XCS中设计这样的离散动力系统集成,以解决多个著名的测试问题。

英文摘要

A number of representation schemes have been presented for use within Learning Classifier Systems, ranging from binary encodings to neural networks. This paper presents results from an investigation into using a discrete dynamical system representation within the XCS Learning Classifier System. In particular, asynchronous random Boolean networks are used to represent the traditional condition-action production system rules. It is shown possible to use self-adaptive, open-ended evolution to design an ensemble of such discrete dynamical systems within XCS to solve a number of well-known test problems.

1107.5528 2026-06-03 cs.AI cs.SY eess.SY math.OC 版本更新

Time Consistent Discounting

时间一致折现

Tor Lattimore, Marcus Hutter

AI总结 本文通过引入随年龄变化的折现函数,刻画了时间一致与不一致的折现函数,并证明了即使折现函数时间不一致,智能体仍存在理性策略。

Comments 17 LaTeX pages, 5 figures

详情
Journal ref
Proc. 22nd International Conf. on Algorithmic Learning Theory (ALT-2011) pages 383-397
AI中文摘要

一个可能永生的智能体试图最大化其随时间累积的折现奖励,其中折现用于避免无限效用并鼓励智能体更重视当前奖励而非未来奖励。一些常用的折现函数会导致时间不一致行为,即智能体会随时间改变其计划。这些不一致可能导致非常糟糕的行为。我们将通常的折现效用模型推广到折现函数随智能体年龄变化的形式。然后,我们给出了时间(不)一致折现函数的简单刻画,并证明了对于知道其折现函数是时间不一致的智能体,存在一个理性策略。

英文摘要

A possibly immortal agent tries to maximise its summed discounted rewards over time, where discounting is used to avoid infinite utilities and encourage the agent to value current rewards more than future ones. Some commonly used discount functions lead to time-inconsistent behavior where the agent changes its plan over time. These inconsistencies can lead to very poor behavior. We generalise the usual discounted utility model to one where the discount function changes with the age of the agent. We then give a simple characterisation of time-(in)consistent discount functions and show the existence of a rational policy for an agent that knows its discount function is time-inconsistent.

1008.0775 2026-06-03 eess.SY cs.AI cs.MA cs.SY math.OC 版本更新

Systems Theoretic Techniques for Modeling, Control, and Decision Support in Complex Dynamic Systems

复杂动态系统中建模、控制与决策支持的系统理论技术

Armen Bagdasaryan

AI总结 从系统理论视角综述复杂系统的建模、控制与决策支持方法,提出一种适用于控制回路中复杂层次系统的通用动态建模与仿真技术,并设计了用于仿真与决策支持的计算机信息系统架构。

Comments 58 pages, 24 figures, 1 table; a book chapter published by Bentham Science

详情
AI中文摘要

我们从一般系统理论的角度讨论了复杂动态系统中的建模、控制与决策支持问题。考虑了复杂系统的主要特征以及系统方法在复杂系统研究中的应用。我们概述并分析了已知的现有复杂系统数学建模与仿真范式及方法,这些方法支持控制与决策过程。随后,我们继续研究适用于在控制回路中运行的复杂层次系统的通用动态建模与仿真技术。提出了用于复杂系统中仿真与决策支持的计算机信息系统的架构和结构模型。

英文摘要

We discuss the problems of modeling, control, and decision support in complex dynamic systems from a general system theoretic point of view. The main characteristics of complex systems and of system approach to complex system study are considered. We provide an overview and analysis of known existing paradigms and methods of mathematical modeling and simulation of complex systems, which support the processes of control and decision making. Then we continue with the general dynamic modeling and simulation technique for complex hierarchical systems functioning in control loop. Architectural and structural models of computer information system intended for simulation and decision support in complex systems are presented.

1303.2912 2026-06-03 cs.AI cs.RO cs.SY eess.SY stat.ML 版本更新

Integrated Pre-Processing for Bayesian Nonlinear System Identification with Gaussian Processes

基于高斯过程的贝叶斯非线性系统辨识的集成预处理

Roger Frigola, Carl Edward Rasmussen

AI总结 提出GP-FNARX模型,通过集成数据预处理与稀疏高斯过程回归,实现从原始数据到辨识模型的自动化流程,并利用边际似然最大化同时优化预处理参数和超参数,获得能报告不确定性的贝叶斯动力学模型。

Comments Proceedings of the 52th IEEE International Conference on Decision and Control (CDC), Firenze, Italy, December 2013

详情
AI中文摘要

我们介绍了GP-FNARX:一种新的非线性系统辨识模型,基于带有滤波回归量(F)的非线性自回归外生模型(NARX),其中非线性回归问题使用稀疏高斯过程(GP)解决。我们将数据预处理与系统辨识集成到一个完全自动化的流程中,从原始数据到辨识模型。预处理参数和GP超参数均通过最大化概率模型的边际似然来调整。我们获得了系统动力学的贝叶斯模型,该模型能够在数据稀缺的区域报告其不确定性。自动化方法、不确定性建模及其相对较低的计算成本使GP-FNARX成为机器人和自适应控制应用的良好候选方案。

英文摘要

We introduce GP-FNARX: a new model for nonlinear system identification based on a nonlinear autoregressive exogenous model (NARX) with filtered regressors (F) where the nonlinear regression problem is tackled using sparse Gaussian processes (GP). We integrate data pre-processing with system identification into a fully automated procedure that goes from raw data to an identified model. Both pre-processing parameters and GP hyper-parameters are tuned by maximizing the marginal likelihood of the probabilistic model. We obtain a Bayesian model of the system's dynamics which is able to report its uncertainty in regions where the data is scarce. The automated approach, the modeling of uncertainty and its relatively low computational cost make of GP-FNARX a good candidate for applications in robotics and adaptive control.

1204.4202 2026-06-03 cs.AI cs.LG cs.NE cs.SY eess.SY 版本更新

Fuzzy Dynamical Genetic Programming in XCSF

XCSF中的模糊动态遗传编程

Richard J. Preen, Larry Bull

AI总结 研究在XCSF学习分类器系统中使用模糊动态遗传编程表示,通过异步模糊逻辑网络实现自适应性开放演化,解决连续值测试问题。

Comments 2 page GECCO 2011 poster paper

详情
Journal ref
In Proceedings of the 13th annual conference companion on genetic and evolutionary computation, GECCO '11, pp. 167-168. ACM, 2011
AI中文摘要

学习分类器系统中已提出多种表示方案,从二进制编码到神经网络,以及最近的动态遗传编程(DGP)。本文研究了在XCSF学习分类器系统中使用模糊DGP表示的结果。特别是,异步模糊逻辑网络用于表示传统的条件-动作产生式系统规则。结果表明,可以在XCSF内通过自适应性、开放式的演化设计一组这样的模糊动态系统,以解决几个著名的连续值测试问题。

英文摘要

A number of representation schemes have been presented for use within Learning Classifier Systems, ranging from binary encodings to Neural Networks, and more recently Dynamical Genetic Programming (DGP). This paper presents results from an investigation into using a fuzzy DGP representation within the XCSF Learning Classifier System. In particular, asynchronous Fuzzy Logic Networks are used to represent the traditional condition-action production system rules. It is shown possible to use self-adaptive, open-ended evolution to design an ensemble of such fuzzy dynamical systems within XCSF to solve several well-known continuous-valued test problems.

1304.2367 2026-06-03 cs.CV cs.AI cs.SY eess.SY 版本更新

Utility-Based Control for Computer Vision

基于效用的计算机视觉控制

Tod S. Levitt, Thomas O. Binford, Gil J. Ettinger, Patrice Gelband

AI总结 针对贝叶斯网络实现计算机视觉中的计算效率问题,提出通过最大化效用而非概率来控制视觉任务,以优化传感器信息收集和数据分析。

Comments Appears in Proceedings of the Fourth Conference on Uncertainty in Artificial Intelligence (UAI1988)

详情
AI中文摘要

在利用贝叶斯网络实现计算机视觉识别世界对象时,出现了几个关键问题。计算效率是驱动力。感知网络非常深,通常有十五层结构。图像很宽,例如,在512×512像素或更大的图像中,未指定数量的边缘可能出现在任何位置。为了提高效率,我们动态实例化观察到的对象的假设。网络不是固定的,而是在运行时逐步创建。世界对象假设的生成和识别模型的索引很重要,但本文不讨论[4,11]。这项工作旨在近期通过并行计算在雷达监视系统ADRIES[5,15]和工业零件识别系统SUCCESSOR[2]中实现。对于许多应用,视觉必须更快才能实用,因此有效控制机器视觉过程至关重要。感知操作可能扫描百万像素,并可能需要数分钟的计算时间。必须避免不必要的传感器动作和计算。并行计算在多个处理器能力级别上可用。用于高层视觉的并行分布式计算的潜力意味着分配非均匀计算。本文解决了基于贝叶斯概率模型的机器视觉系统中的任务控制问题。我们将控制与推理分离,以扩展先前的工作[3],最大化效用而非概率。最大化效用允许采用感知策略,以有效收集传感器信息并分析传感器数据。本文展示了通过效用控制机器视觉以识别军事场景的结果。未来工作将将其扩展到SUCCESSOR的工业零件识别。

英文摘要

Several key issues arise in implementing computer vision recognition of world objects in terms of Bayesian networks. Computational efficiency is a driving force. Perceptual networks are very deep, typically fifteen levels of structure. Images are wide, e.g., an unspecified-number of edges may appear anywhere in an image 512 x 512 pixels or larger. For efficiency, we dynamically instantiate hypotheses of observed objects. The network is not fixed, but is created incrementally at runtime. Generation of hypotheses of world objects and indexing of models for recognition are important, but they are not considered here [4,11]. This work is aimed at near-term implementation with parallel computation in a radar surveillance system, ADRIES [5, 15], and a system for industrial part recognition, SUCCESSOR [2]. For many applications, vision must be faster to be practical and so efficiently controlling the machine vision process is critical. Perceptual operators may scan megapixels and may require minutes of computation time. It is necessary to avoid unnecessary sensor actions and computation. Parallel computation is available at several levels of processor capability. The potential for parallel, distributed computation for high-level vision means distributing non-homogeneous computations. This paper addresses the problem of task control in machine vision systems based on Bayesian probability models. We separate control and inference to extend the previous work [3] to maximize utility instead of probability. Maximizing utility allows adopting perceptual strategies for efficient information gathering with sensors and analysis of sensor data. Results of controlling machine vision via utility to recognize military situations are presented in this paper. Future work extends this to industrial part recognition for SUCCESSOR.

1304.0030 2026-06-03 math.OC cs.AI cs.SY eess.SY 版本更新

Note on Combinatorial Engineering Frameworks for Hierarchical Modular Systems

关于层次模块化系统的组合工程框架的注记

Mark Sh. Levin

AI总结 本文描述了一套用于解决层次模块化系统中复杂问题的基本组合工程框架,包括系统层次模型设计、组合综合、系统评估、瓶颈检测、改进、多阶段设计和演化建模,并涉及背包、多选、分配、生成树和形态团等组合优化问题。

Comments 11 pages, 7 figures, 3 tables

详情
AI中文摘要

本文简要描述了一套用于解决层次模块化系统中复杂问题的基本组合工程框架。这些框架由相互关联/链接(例如,通过偏好关系)的组合问题(及相应模型)组成。主要使用层次形态系统模型。基本标准组合工程(技术)框架列表如下:(1)系统层次模型设计,(2)组合综合(系统设计的“自下而上”过程),(3)系统评估,(4)系统瓶颈检测,(5)系统改进(重新设计、升级),(6)多阶段设计(系统轨迹设计),(7)系统演化/发展和系统预测的组合建模。组合工程框架旨在支持某些系统生命周期阶段。主要的底层组合优化问题列表包括:背包问题、多选问题、分配问题、生成树、形态团问题。

英文摘要

The paper briefly describes a basic set of special combinatorial engineering frameworks for solving complex problems in the field of hierarchical modular systems. The frameworks consist of combinatorial problems (and corresponding models), which are interconnected/linked (e.g., by preference relation). Mainly, hierarchical morphological system model is used. The list of basic standard combinatorial engineering (technological) frameworks is the following: (1) design of system hierarchical model, (2) combinatorial synthesis ('bottom-up' process for system design), (3) system evaluation, (4) detection of system bottlenecks, (5) system improvement (re-design, upgrade), (6) multi-stage design (design of system trajectory), (7) combinatorial modeling of system evolution/development and system forecasting. The combinatorial engineering frameworks are targeted to maintenance of some system life cycle stages. The list of main underlaying combinatorial optimization problems involves the following: knapsack problem, multiple-choice problem, assignment problem, spanning trees, morphological clique problem.

1301.7389 2026-06-03 cs.AI cs.SY eess.SY 版本更新

Dealing with Uncertainty on the Initial State of a Petri Net

处理Petri网初始状态的不确定性

Iman Jarkass, Michele Rombaut

AI总结 提出一种基于Dempster-Shafer理论的方法,利用传感器信息和Petri网模型,在初始状态未知的情况下确定动态系统的实际状态。

Comments Appears in Proceedings of the Fourteenth Conference on Uncertainty in Artificial Intelligence (UAI1998)

详情
AI中文摘要

本文提出了一种方法,通过来自系统自身或其环境传感器的信息,找到复杂动态系统的实际状态。系统的标称演化是先验已知的,可以通过不同方法(例如专家)建模。本文选择了Petri网。与通常使用Petri网不同,系统的初始状态是未知的。因此,每个位置或位置集都绑定了一个置信度。用于建模这种不确定性的理论是Dempster-Shafer理论,它非常适用于这类问题。从表征动态系统标称演化的给定Petri网和观测输入出发,所提出的方法允许根据模型和输入的可靠性,确定系统在任何时刻的状态。

英文摘要

This paper proposes a method to find the actual state of a complex dynamic system from information coming from the sensors on the system himself, or on its environment. The nominal evolution of the system is a priori known and can be modeled (by an expert, for example), by different methods. In this paper, the Petri nets have been chosen. Contrary to the usual use of the Petri nets, the initial state of the system is unknown. So a degree of belief is bound to each places, or set of places. The theory used to model this uncertainty is the Dempster-Shafer's one which is well adapted to this type of problems. From the given Petri net characterizing the nominal evolution of the dynamic system, and from the observation inputs, the proposed method allows to determine according to the reliability of the model and the inputs, the state of the system at any time.

1301.6747 2026-06-03 cs.AI cs.SY eess.SY 版本更新

Bayesian Control for Concentrating Mixed Nuclear Waste

混合核废料浓缩的贝叶斯控制

Robert L. Welch, Clayton Smith

AI总结 提出一种基于条件高斯贝叶斯网络的批处理混合废料控制算法,网络在批处理阶段编译以实现对传感器输入的实时响应。

Comments Appears in Proceedings of the Fifteenth Conference on Uncertainty in Artificial Intelligence (UAI1999)

详情
AI中文摘要

提出一种基于条件高斯贝叶斯网络的混合废料批处理控制算法。该网络在批处理阶段编译,以实现对传感器输入的实时响应。

英文摘要

A control algorithm for batch processing of mixed waste is proposed based on conditional Gaussian Bayesian networks. The network is compiled during batch staging for real-time response to sensor input.

1301.6721 2026-06-03 cs.AI cs.SY eess.SY 版本更新

Learning Finite-State Controllers for Partially Observable Environments

学习部分可观测环境的有限状态控制器

Nicolas Meuleau, Leonid Peshkin, Kee-Eung Kim, Leslie Pack Kaelbling

AI总结 针对部分可观测马尔可夫决策过程,提出一种基于随机梯度下降的VAPS扩展算法,学习局部最优的有限状态自动机控制器,并通过实验验证其利用历史观测信息补偿不可观测性的能力。

Comments Appears in Proceedings of the Fifteenth Conference on Uncertainty in Artificial Intelligence (UAI1999)

详情
AI中文摘要

在完全可观测的马尔可夫决策过程(MDP)中,反应式(无记忆)策略是足够的,但对于部分可观测MDP的最优控制,通常需要某种形式的记忆。具有有限记忆的策略可以表示为有限状态自动机。在本文中,我们将Baird和Moore的VAPS算法扩展到学习一般有限状态自动机的问题。由于该算法执行随机梯度下降,可以证明它收敛到局部最优的有限状态控制器。我们提供了算法的细节,然后考虑在什么条件下随机梯度下降将优于精确梯度下降的问题。最后,我们通过实证结果比较了随机和精确梯度下降的性能,并展示了我们的算法从过去观测序列中提取有用信息以补偿每个时间步不可观测性的能力。

英文摘要

Reactive (memoryless) policies are sufficient in completely observable Markov decision processes (MDPs), but some kind of memory is usually necessary for optimal control of a partially observable MDP. Policies with finite memory can be represented as finite-state automata. In this paper, we extend Baird and Moore's VAPS algorithm to the problem of learning general finite-state automata. Because it performs stochastic gradient descent, this algorithm can be shown to converge to a locally optimal finite-state controller. We provide the details of the algorithm and then consider the question of under what conditions stochastic gradient descent will outperform exact gradient descent. We conclude with empirical results comparing the performance of stochastic and exact gradient descent, and showing the ability of our algorithm to extract the useful information contained in the sequence of past observations to compensate for the lack of observability at each time-step.

1301.3537 2026-06-03 cs.AI cs.NA math.NA 版本更新

Learning Stable Group Invariant Representations with Convolutional Networks

使用卷积网络学习稳定的群不变表示

Joan Bruna, Arthur Szlam, Yann LeCun

AI总结 本文通过卷积网络将不变性构建为稳定的群不变性,网络架构决定不变群,可训练滤波器系数刻画群作用,并探索深层卷积层通过群分解实现更抽象的不变表示。

Comments 4 pages

详情
AI中文摘要

变换群,如平移或旋转,有效地表达了在许多识别问题中观察到的部分变异性。群结构使得构建具有吸引人数学性质的不变信号表示成为可能,其中卷积与池化算子为输入的加性和几何扰动带来了稳定性。尽管物理变换群在图像和音频应用中无处不在,但它们并不能解释复杂信号类别的所有变异性。我们表明,深度卷积网络构建的不变性属性可以视为一种稳定的群不变性。网络布线架构决定了不变群,而可训练的滤波器系数刻画了群作用。我们给出了解释性示例,说明网络架构如何控制最终的不变群。我们还探讨了额外的卷积层通过群分解诱导更抽象、更强大的不变表示的原理。

英文摘要

Transformation groups, such as translations or rotations, effectively express part of the variability observed in many recognition problems. The group structure enables the construction of invariant signal representations with appealing mathematical properties, where convolutions, together with pooling operators, bring stability to additive and geometric perturbations of the input. Whereas physical transformation groups are ubiquitous in image and audio applications, they do not account for all the variability of complex signal classes. We show that the invariance properties built by deep convolutional networks can be cast as a form of stable group invariance. The network wiring architecture determines the invariance group, while the trainable filter coefficients characterize the group action. We give explanatory examples which illustrate how the network architecture controls the resulting invariance group. We also explore the principle by which additional convolutional layers induce a group factorization enabling more abstract, powerful invariant representations.

1301.2273 2026-06-03 cs.AI cs.SY eess.SY 版本更新

Robust Combination of Local Controllers

局部控制器的鲁棒组合

Carlos E. Guestrin, Dirk Ormoneit

AI总结 针对高维连续MDP规划问题,提出非参数化组合局部控制器的方法,并分别应用于随机最短路径和折扣MDP,前者保证高概率到达目标,后者通过鲁棒线性规划处理模型不确定性。

Comments Appears in Proceedings of the Seventeenth Conference on Uncertainty in Artificial Intelligence (UAI2001)

详情
AI中文摘要

规划问题是困难的,例如运动规划是PSPACE-hard的。在存在不确定性的情况下,这些问题更加困难。尽管马尔可夫决策过程(MDP)为此类问题提供了形式化框架,但求解高维连续MDP通常很困难,尤其是当动作和时间测量是连续的时候。幸运的是,问题特定知识使我们能够设计出局部表现良好的控制器,尽管没有全局保证。我们提出了一种非参数化组合局部控制器的方法,以获得全局良好的解。我们将此公式应用于两类问题:运动规划(随机最短路径)和折扣MDP。对于运动规划,我们认为通常的MDP最优性准则(期望成本)可能在实际中不相关。我们提出了一种替代方案:在机器人必须以高概率到达目标的约束下,寻找最小成本路径。对于这个问题,我们证明了多项式数量的样本足以获得高概率路径。对于折扣MDP,我们提出了一种明确处理模型不确定性的公式,即转移概率不完全已知时引入的问题。我们将该问题表述为一个鲁棒线性规划,直接纳入这种不确定性。

英文摘要

Planning problems are hard, motion planning, for example, isPSPACE-hard. Such problems are even more difficult in the presence of uncertainty. Although, Markov Decision Processes (MDPs) provide a formal framework for such problems, finding solutions to high dimensional continuous MDPs is usually difficult, especially when the actions and time measurements are continuous. Fortunately, problem-specific knowledge allows us to design controllers that are good locally, though having no global guarantees. We propose a method of nonparametrically combining local controllers to obtain globally good solutions. We apply this formulation to two types of problems : motion planning (stochastic shortest path) and discounted MDPs. For motion planning, we argue that usual MDP optimality criterion (expected cost) may not be practically relevant. Wepropose an alternative: finding the minimum cost path,subject to the constraint that the robot must reach the goal withhigh probability. For this problem, we prove that a polynomial number of samples is sufficient to obtain a high probability path. For discounted MDPs, we propose a formulation that explicitly deals with model uncertainty, i.e., the problem introduced when transition probabilities are not known exactly. We formulate the problem as a robust linear program which directly incorporates this type of uncertainty.

1301.0584 2026-06-03 cs.AI cs.LG cs.SY eess.SY 版本更新

Decayed MCMC Filtering

衰减MCMC滤波

Bhaskara Marthi, Hanna Pasula, Stuart Russell, Yuval Peres

AI总结 提出一种基于衰减MCMC的随机近似滤波算法,通过偏向翻转近期状态变量的提议分布对状态轨迹进行采样,并证明在观测序列增长时收敛时间有界,实验表明与粒子滤波等算法性能相当。

Comments Appears in Proceedings of the Eighteenth Conference on Uncertainty in Artificial Intelligence (UAI2002)

详情
AI中文摘要

滤波——从观测序列中估计部分可观测马尔可夫过程的状态——是控制理论、人工智能和计算统计学中研究最广泛的问题之一。对于大型离散系统和非线性连续系统,后验分布的精确计算通常是难以处理的,因此大量工作致力于开发鲁棒的近似算法。本文描述了一种简单的随机近似滤波算法,称为衰减MCMC。该算法对状态轨迹空间应用马尔可夫链蒙特卡罗采样,使用偏向翻转较新状态变量的提议分布。该算法的形式化分析涉及MCMC收敛的标准耦合论证的推广。我们证明,对于任何遍历的底层马尔可夫过程,随着观测序列长度的增长,具有逆多项式衰减的衰减MCMC的收敛时间保持有界。实验表明,衰减MCMC至少与粒子滤波等其他近似算法具有竞争力。

英文摘要

Filtering---estimating the state of a partially observable Markov process from a sequence of observations---is one of the most widely studied problems in control theory, AI, and computational statistics. Exact computation of the posterior distribution is generally intractable for large discrete systems and for nonlinear continuous systems, so a good deal of effort has gone into developing robust approximation algorithms. This paper describes a simple stochastic approximation algorithm for filtering called {em decayed MCMC}. The algorithm applies Markov chain Monte Carlo sampling to the space of state trajectories using a proposal distribution that favours flips of more recent state variables. The formal analysis of the algorithm involves a generalization of standard coupling arguments for MCMC convergence. We prove that for any ergodic underlying Markov process, the convergence time of decayed MCMC with inverse-polynomial decay remains bounded as the length of the observation sequence grows. We show experimentally that decayed MCMC is at least competitive with other approximation algorithms such as particle filtering.

1212.3998 2026-06-03 cs.AI cs.SY eess.SY 版本更新

Online Learning for Ground Trajectory Prediction

地面轨迹预测的在线学习

Areski Hadjaz, Gaétan Marceau, Pierre Savéant, Marc Schoenauer

AI总结 提出基于混合系统的模型用于数值模拟飞机爬升阶段,结合CMA-ES优化算法调整参数以提高轨迹预测精度,并通过在线更新预测实现更准确的结果。

Comments SESAR 2nd Innovation Days (2012)

详情
AI中文摘要

本文提出一个基于混合系统的模型,用于数值模拟飞机的爬升阶段。该模型随后被用于轨迹预测工具中。最后,采用协方差矩阵自适应进化策略(CMA-ES)优化算法来调整五个选定参数,从而提高模型的精度。集成在轨迹预测工具中,该模型可用于推导预测误差随时间变化的量级,从而确定轨迹预测的有效域。所提模型的第一个验证实验基于一次起飞时轨迹预测随时间变化的误差,与理论BADA模型的默认值进行比较。该实验假设完全信息,也显示了模型的局限性。第二个实验部分介绍了在线轨迹预测,其中预测基于当前飞机位置持续更新。这种方法引发了几个问题,针对这些问题提出了基本模型的改进,由此得到的轨迹预测工具在统计上显著优于默认模型的结果。

英文摘要

This paper presents a model based on an hybrid system to numerically simulate the climbing phase of an aircraft. This model is then used within a trajectory prediction tool. Finally, the Covariance Matrix Adaptation Evolution Strategy (CMA-ES) optimization algorithm is used to tune five selected parameters, and thus improve the accuracy of the model. Incorporated within a trajectory prediction tool, this model can be used to derive the order of magnitude of the prediction error over time, and thus the domain of validity of the trajectory prediction. A first validation experiment of the proposed model is based on the errors along time for a one-time trajectory prediction at the take off of the flight with respect to the default values of the theoretical BADA model. This experiment, assuming complete information, also shows the limit of the model. A second experiment part presents an on-line trajectory prediction, in which the prediction is continuously updated based on the current aircraft position. This approach raises several issues, for which improvements of the basic model are proposed, and the resulting trajectory prediction tool shows statistically significantly more accurate results than those of the default model.

1212.3996 2026-06-03 cs.AI cs.SY eess.SY 版本更新

Increasing Air Traffic: What is the Problem?

日益增长的空中交通:问题是什么?

Areski Hadjaz, Gaétan Marceau, Pierre Savéant, Marc Schoenauer

AI总结 本文提出一个框架,通过贝叶斯网络建模轨迹不确定性,并利用优化和监控过程最小化扇区拥堵和延误概率,以桥接空中交通管理与控制以及不同空域部门之间的差距。

Comments SESAR 2nd Innovation Days (2012)

详情
AI中文摘要

如今,为了应对不确定性、复杂性和次优性,人们正在大力推动空中交通管理系统的现代化。一个答案是加强利益相关者之间的信息共享。本文介绍了一个框架,一方面弥合了空中交通管理与空中交通控制之间的差距,另一方面弥合了地面、进近和航路中心之间的差距。提出了一个原始系统,该系统包含三个基本组成部分:轨迹模型、优化过程和监控过程。轨迹的不确定性通过贝叶斯网络建模,其中节点与两类随机变量相关联:空域计量点的飞越时间以及连接这些点的航线的行驶时间。由此产生的贝叶斯网络覆盖整个空域,并通过蒙特卡洛模拟来估计扇区拥堵和延误的概率。在此轨迹模型之上,优化过程通过调整与计量点飞越时间相关的贝叶斯轨迹模型参数来最小化这些概率。最后一个组成部分是监控过程,它持续更新空域状态,根据飞机的实际位置修改轨迹的不确定性。每次更新后,计算新的最优飞越时间集,并可以作为指令传达给空中交通管制员,再传递给飞行员。本文给出了这一全局优化问题的形式化规范,其基本逻辑是在泰雷兹空中系统公司的空中交通管制员的帮助下得出的。

英文摘要

Nowadays, huge efforts are made to modernize the air traffic management systems to cope with uncertainty, complexity and sub-optimality. An answer is to enhance the information sharing between the stakeholders. This paper introduces a framework that bridges the gap between air traffic management and air traffic control on the one hand, and bridges the gap between the ground, the approach and the en-route centers on the other hand. An original system is presented, that has three essential components: the trajectory models, the optimization process, and the monitoring process. The uncertainty of the trajectory is modeled with a Bayesian Network, where the nodes are associated to two types of random variables: the time of overflight on metering points of the airspace, and the traveling time of the routes linking these points. The resulting Bayesian Network covers the complete airspace, and Monte- Carlo simulations are done to estimate the probabilities of sector congestion and delays. On top of this trajectory model, an optimization process minimizes these probabilities by tuning the parameters of the Bayesian trajectory model related to overflight times on metering points. The last component is the monitoring process, that continuously updates the situation of the airspace, modifying the trajectories uncertainties according to actual positions of aircraft. After each update, a new optimal set of overflight times is computed, and can be communicated to the controllers as clearances for the aircraft pilots. The paper presents a formal specification of this global optimization problem, whose underlying rationale was derived with the help of air traffic controllers at Thales Air Systems.

1212.2499 2026-06-03 cs.AI cs.SY eess.SY 版本更新

Marginalizing Out Future Passengers in Group Elevator Control

在群控电梯调度中边缘化未来乘客

Daniel N. Nikovski, Matthew Brand

AI总结 针对群控电梯调度中未来乘客对等待时间的影响,提出一种概率模型并集成到现有方法中,显著降低平均等待时间。

Comments Appears in Proceedings of the Nineteenth Conference on Uncertainty in Artificial Intelligence (UAI2003)

详情
AI中文摘要

群控电梯调度是一个NP难的序贯决策问题,具有无界状态空间和大量不确定性。决策理论推理在现场系统中的作用出奇地有限。最近发现了一种可处理的解决方案,用于计算建筑内所有乘客的预期等待时间,该方案边缘化了所有可能的乘客行程,这为概率方法开辟了新的机会。尽管在商业上具有竞争力,但该解决方案没有考虑未来乘客。然而,在高峰上行交通中,未来乘客到达大厅并进入电梯轿厢的影响可能主导所有等待时间。我们开发了一个概率模型,描述这些到达如何影响电梯轿厢在大厅的行为,并展示了如何使用该模型显著降低所有乘客的平均等待时间。

英文摘要

Group elevator scheduling is an NP-hard sequential decision-making problem with unbounded state spaces and substantial uncertainty. Decision-theoretic reasoning plays a surprisingly limited role in fielded systems. A new opportunity for probabilistic methods has opened with the recent discovery of a tractable solution for the expected waiting times of all passengers in the building, marginalized over all possible passenger itineraries. Though commercially competitive, this solution does not contemplate future passengers. Yet in up-peak traffic, the effects of future passengers arriving at the lobby and entering elevator cars can dominate all waiting times. We develop a probabilistic model of how these arrivals affect the behavior of elevator cars at the lobby, and demonstrate how this model can be used to very significantly reduce the average waiting time of all passengers.

1212.2495 2026-06-03 cs.RO cs.AI cs.SY eess.SY 版本更新

Policy-contingent abstraction for robust robot control

基于策略抽象的鲁棒机器人控制

Joelle Pineau, Geoffrey Gordon, Sebastian Thrun

AI总结 提出一种可扩展的控制算法,使移动机器人系统在充分考虑概率信念的情况下做出高层决策,并成功部署于护理机构。

Comments Appears in Proceedings of the Nineteenth Conference on Uncertainty in Artificial Intelligence (UAI2003)

详情
AI中文摘要

本文提出一种可扩展的控制算法,使已部署的移动机器人系统能够在充分考虑其概率信念的情况下做出高层决策。我们的方法基于分层控制器和分层MDP的丰富文献中的见解。所得到的控制器已成功部署在宾夕法尼亚州匹兹堡附近的一家护理机构中。据我们所知,这项工作是应用POMDP解决高层机器人控制问题的独特实例。

英文摘要

This paper presents a scalable control algorithm that enables a deployed mobile robot system to make high-level decisions under full consideration of its probabilistic belief. Our approach is based on insights from the rich literature of hierarchical controllers and hierarchical MDPs. The resulting controller has been successfully deployed in a nursing facility near Pittsburgh, PA. To the best of our knowledge, this work is a unique instance of applying POMDPs to high-level robotic control problems.

1212.2471 2026-06-03 cs.LG cs.AI cs.NA math.NA 版本更新

Monte Carlo Matrix Inversion Policy Evaluation

蒙特卡洛矩阵求逆策略评估

Fletcher Lu, Dale Schuurmans

AI总结 提出使用蒙特卡洛矩阵求逆(MCMI)进行强化学习策略评估,通过重要性采样降低方差,并在运行时间和准确性上优于最大似然模型和时序差分方法。

Comments Appears in Proceedings of the Nineteenth Conference on Uncertainty in Artificial Intelligence (UAI2003)

详情
AI中文摘要

1950年,Forsythe和Leibler(1950)引入了一种统计技术,通过将矩阵逆的元素表征为一系列随机游走的期望值来求矩阵的逆。Barto和Duff(1994)随后展示了该技术与标准动态规划和时序差分方法之间的关系。蒙特卡洛矩阵求逆(MCMI)方法的优势在于,它相对于其他技术,在状态空间大小方面具有更好的可扩展性。在本文中,我们介绍了一种使用MCMI进行强化学习策略评估的算法。我们证明,MCMI在运行时间上优于基于最大似然模型的策略评估方法,并且在运行时间和准确性上都优于时序差分(TD)策略评估方法。我们进一步通过向算法添加重要性采样技术来降低估计器的方差,从而改进了MCMI策略评估。最后,我们展示了将MCMI扩展到大规模状态空间以进行策略改进的技术。

英文摘要

In 1950, Forsythe and Leibler (1950) introduced a statistical technique for finding the inverse of a matrix by characterizing the elements of the matrix inverse as expected values of a sequence of random walks. Barto and Duff (1994) subsequently showed relations between this technique and standard dynamic programming and temporal differencing methods. The advantage of the Monte Carlo matrix inversion (MCMI) approach is that it scales better with respect to state-space size than alternative techniques. In this paper, we introduce an algorithm for performing reinforcement learning policy evaluation using MCMI. We demonstrate that MCMI improves on runtime over a maximum likelihood model-based policy evaluation approach and on both runtime and accuracy over the temporal differencing (TD) policy evaluation approach. We further improve on MCMI policy evaluation by adding an importance sampling technique to our algorithm to reduce the variance of our estimator. Lastly, we illustrate techniques for scaling up MCMI to large state spaces in order to perform policy improvement.

1212.2005 2026-06-03 cs.AI cs.SY eess.SY 版本更新

The Dynamic Controllability of Conditional STNs with Uncertainty

含不确定性的条件STN的动态可控性

Luke Hunsberger, Roberto Posenato, Carlo Combi

AI总结 本文定义了一种结合时间约束、条件节点和不确定持续时间的条件简单时间网络(CSTNU),并提出了其动态可控性的概念及约束传播规则。

详情
Journal ref
PlanEX Workshop, ICAPS-2012, pages 21-29, 2012
AI中文摘要

最近自动化业务流程和医疗流程的尝试揭示了对一个正式框架的需求,该框架不仅能容纳时间约束,还能容纳具有不可控持续时间的观测和动作。为满足这一需求,本文定义了一种含不确定性的条件简单时间网络(CSTNU),它结合了简单时间网络(STN)的简单时间约束、条件简单时间问题(CSTP)的条件节点以及含不确定性的简单时间网络(STNU)的应急链接。定义了CSTNU的动态可控性概念,该概念推广了CTP的动态一致性和STNU的动态可控性。本文还提出了一些用于动态可控性的可靠约束传播规则,这些规则有望构成CSTNU动态可控性检查算法的基础。

英文摘要

Recent attempts to automate business processes and medical-treatment processes have uncovered the need for a formal framework that can accommodate not only temporal constraints, but also observations and actions with uncontrollable durations. To meet this need, this paper defines a Conditional Simple Temporal Network with Uncertainty (CSTNU) that combines the simple temporal constraints from a Simple Temporal Network (STN) with the conditional nodes from a Conditional Simple Temporal Problem (CSTP) and the contingent links from a Simple Temporal Network with Uncertainty (STNU). A notion of dynamic controllability for a CSTNU is defined that generalizes the dynamic consistency of a CTP and the dynamic controllability of an STNU. The paper also presents some sound constraint-propagation rules for dynamic controllability that are expected to form the backbone of a dynamic-controllability-checking algorithm for CSTNUs.

1212.1735 2026-06-03 math.OC cs.AI cs.NI cs.SY eess.SY 版本更新

Towards Design of System Hierarchy (research survey)

系统层次结构设计(研究综述)

Mark Sh. Levin

AI总结 本文综述了树状和层次系统结构的设计/构建框架,包括基于专家的方法、层次聚类、生成树问题、组织最优层次设计、多层k连通网络设计以及层次/网络的修改,并考虑组合优化问题。

Comments 36 pages, 41 figures, 9 tables

详情
AI中文摘要

本文讨论了某些树状和层次系统结构的设计/构建框架。考察了以下方法:(1)基于专家的程序;(2)层次聚类;(3)生成树问题(例如,最小生成树、最小斯坦纳树、最大叶子生成树问题);(4)组织“最优”层次设计;(5)多层(例如,三层)k连通网络设计;(6)层次或网络的修改:(i)通过合并相邻节点修改树,(ii)热链接分配,(iii)将树转换为斯坦纳树,(iv)重构作为将初始结构解修改为最接近目标解且考虑修改成本的解。组合优化问题被视为基本问题(例如,分类、背包问题、多选问题、分配问题)。一些数值示例说明了所提出的问题和求解框架。

英文摘要

The paper addresses design/building frameworks for some kinds of tree-like and hierarchical structures of systems. The following approaches are examined: (1) expert-based procedures, (2) hierarchical clustering; (3) spanning problems (e.g., minimum spanning tree, minimum Steiner tree, maximum leaf spanning tree problem; (4) design of organizational 'optimal' hierarchies; (5) design of multi-layer (e.g., three-layer) k-connected network; (6) modification of hierarchies or networks: (i) modification of tree via condensing of neighbor nodes, (ii) hotlink assignment, (iii) transformation of tree into Steiner tree, (iv) restructuring as modification of an initial structural solution into a solution that is the most close to a goal solution while taking into account a cost of the modification. Combinatorial optimization problems are considered as basic ones (e.g., classification, knapsack problem, multiple choice problem, assignment problem). Some numerical examples illustrate the suggested problems and solving frameworks.

1212.1143 2026-06-03 cs.AI cs.SY eess.SY math.OC stat.ML 版本更新

Multiscale Markov Decision Problems: Compression, Solution, and Transfer Learning

多尺度马尔可夫决策问题:压缩、求解与迁移学习

Jake Bouvrie, Mauro Maggioni

AI总结 提出一种多尺度压缩马尔可夫决策过程的快速算法,自动构建层次结构,解耦子任务并加速收敛,同时实现跨问题的策略迁移。

Comments 86 pages, 15 figures

详情
AI中文摘要

序列决策和随机控制中的许多问题通常具有自然的多尺度结构:子任务被组合在一起以完成复杂目标。系统性地推断和利用层次结构,尤其是超越单一抽象层次,一直是一个长期挑战。我们描述了一种快速的多尺度过程,用于重复压缩或均质化马尔可夫决策过程(MDP),其中自动确定不同尺度上的子问题层次结构。粗化后的MDP本身是独立的确定性MDP,可以使用现有算法求解。该过程提供的多尺度表示将子任务相互解耦,可以在子问题内部局部和跨子问题全局上显著提高收敛速度,从而节省大量计算。这项工作的第二个基本方面是,这些多尺度分解为不同问题之间提供了新的迁移机会,其中层次结构中不同级别的子任务的解可能适用于迁移到新问题。强调了在任意尺度上策略和势算子的局部迁移。最后,我们在一个说明性领域集合中展示了压缩和迁移,包括涉及离散和连续状态空间的示例。

英文摘要

Many problems in sequential decision making and stochastic control often have natural multiscale structure: sub-tasks are assembled together to accomplish complex goals. Systematically inferring and leveraging hierarchical structure, particularly beyond a single level of abstraction, has remained a longstanding challenge. We describe a fast multiscale procedure for repeatedly compressing, or homogenizing, Markov decision processes (MDPs), wherein a hierarchy of sub-problems at different scales is automatically determined. Coarsened MDPs are themselves independent, deterministic MDPs, and may be solved using existing algorithms. The multiscale representation delivered by this procedure decouples sub-tasks from each other and can lead to substantial improvements in convergence rates both locally within sub-problems and globally across sub-problems, yielding significant computational savings. A second fundamental aspect of this work is that these multiscale decompositions yield new transfer opportunities across different problems, where solutions of sub-tasks at different levels of the hierarchy may be amenable to transfer to new problems. Localized transfer of policies and potential operators at arbitrary scales is emphasized. Finally, we demonstrate compression and transfer in a collection of illustrative domains, including examples involving discrete and continuous statespaces.

1210.4231 2026-06-03 eess.SY cs.AI cs.SY 版本更新

An example illustrating the imprecision of the efficient approach for diagnosis of Petri nets via integer linear programming

一个说明通过整数线性规划诊断Petri网的高效方法不精确性的例子

Alban Grastien

AI总结 本文通过反例证明,即使系统是可诊断的,基于整数线性规划的Petri网高效诊断方法也可能无法检测到故障。

Comments 3 pages

详情
AI中文摘要

本文证明,即使系统是可诊断的,通过整数线性规划诊断Petri网的高效方法也可能无法检测到故障。

英文摘要

This document demonstrates that the efficient approach for diagnosis of Petri nets via integer linear programming may be unable to detect a fault even if the system is diagnosable.

1203.4345 2026-06-03 eess.SY cs.AI cs.RO cs.SY stat.ML 版本更新

Robust Filtering and Smoothing with Gaussian Processes

基于高斯过程的鲁棒滤波与平滑

Marc Peter Deisenroth, Ryan Turner, Marco F. Huber, Uwe D. Hanebeck, Carl Edward Rasmussen

AI总结 提出一种基于非参数高斯过程模型的非线性随机动态系统鲁棒贝叶斯滤波与平滑算法,通过解析平滑实现鲁棒性,数值实验表明在其它先进方法失效时仍保持稳健。

Comments 7 pages, 1 figure, draft version of paper accepted at IEEE Transactions on Automatic Control

详情
AI中文摘要

我们提出了一种原则性算法,用于在非线性随机动态系统中进行鲁棒贝叶斯滤波和平滑,其中转移函数和测量函数均由非参数高斯过程(GP)模型描述。在信号处理、机器学习、机器人和控制领域,GP通过后验概率分布表示未知系统函数,其重要性日益增加。这种现代的“系统辨识”方式比寻找参数函数表示的点估计更为鲁棒。在本文中,我们提出了一种原则性算法,用于在GP动态系统中进行鲁棒解析平滑,该系统在机器人和控制领域应用日益广泛。我们的数值评估表明,在其它最先进的高斯滤波器和平滑器可能失败的情况下,所提方法具有鲁棒性。

英文摘要

We propose a principled algorithm for robust Bayesian filtering and smoothing in nonlinear stochastic dynamic systems when both the transition function and the measurement function are described by non-parametric Gaussian process (GP) models. GPs are gaining increasing importance in signal processing, machine learning, robotics, and control for representing unknown system functions by posterior probability distributions. This modern way of "system identification" is more robust than finding point estimates of a parametric function representation. In this article, we present a principled algorithm for robust analytic smoothing in GP dynamic systems, which are increasingly used in robotics and control. Our numerical evaluations demonstrate the robustness of the proposed approach in situations where other state-of-the-art Gaussian filters and smoothers can fail.

1208.1103 2026-06-03 cs.AI cs.SY eess.SY 版本更新

System identification and modeling for interacting and non-interacting tank systems using intelligent techniques

基于智能技术的交互与非交互罐式系统的系统辨识与建模

N. S. Bhuvaneswari, R. Praveena, R. Divya

AI总结 本文采用统计模型辨识、过程反应曲线法、ARX模型、遗传算法及神经网络和模糊逻辑,从实时实验数据中辨识交互与非交互罐式过程的传递函数模型和智能模型。

Comments 13 pages,8 figures

详情
AI中文摘要

从实验数据中进行系统辨识对于基于模型的控制器设计至关重要。由于过程复杂性,从第一原理推导过程模型通常很困难。任何控制和监测系统开发的第一阶段都是系统的辨识和建模。每个模型都是在特定控制问题的背景下开发的。因此,需要一个通用的系统辨识框架。所提出的框架应能根据控制目标和系统行为性质适应并强调不同的特性。因此,系统辨识已成为基于输入输出数据辨识系统模型以设计控制器的宝贵工具。本文关注于使用统计模型辨识、过程反应曲线法、ARX模型、遗传算法以及神经网络和模糊逻辑对交互和非交互罐式过程进行传递函数模型的辨识。所使用的辨识技术和建模易受参数变化和干扰的影响。所提出的方法用于从实时实验数据中辨识交互和非交互过程的数学模型和智能模型。

英文摘要

System identification from the experimental data plays a vital role for model based controller design. Derivation of process model from first principles is often difficult due to its complexity. The first stage in the development of any control and monitoring system is the identification and modeling of the system. Each model is developed within the context of a specific control problem. Thus, the need for a general system identification framework is warranted. The proposed framework should be able to adapt and emphasize different properties based on the control objective and the nature of the behavior of the system. Therefore, system identification has been a valuable tool in identifying the model of the system based on the input and output data for the design of the controller. The present work is concerned with the identification of transfer function models using statistical model identification, process reaction curve method, ARX model, genetic algorithm and modeling using neural network and fuzzy logic for interacting and non interacting tank process. The identification technique and modeling used is prone to parameter change & disturbance. The proposed methods are used for identifying the mathematical model and intelligent model of interacting and non interacting process from the real time experimental data.

1207.6051 2026-06-03 eess.SY cs.AI cs.SY math.OC 版本更新

Composition of Modular Telemetry System with Interval Multiset Estimates

基于区间多集估计的模块化遥测系统组合

Mark Sh. Levin

AI总结 本文提出一种基于区间多集估计的组合综合方法,用于模块化遥测系统的建模、分析、设计和改进,通过分层形态多准则设计(HMMD)实现系统组件的多准则选择与合成。

Comments 9 pages, 9 figures, 6 tables

详情
AI中文摘要

本文描述了一种组合综合方法,该方法利用系统元素的区间多集估计来对模块化遥测系统进行建模、分析、设计和改进。形态(模块化)系统设计与改进被视为遥测系统元素(组件)配置的组合。求解过程基于分层形态多准则设计(HMMD):(i) 系统组件备选方案的多准则选择,(ii) 将所选备选方案合成为结果组合(同时考虑上述备选方案的质量及其兼容性)。使用区间多集估计来评估遥测系统元素的设计备选方案。还研究了两个附加系统问题:(a) 改进所获得的解,(b) 将所获得的解聚合成一个结果系统配置。改进和聚合过程基于具有区间多集估计的多重选择问题。通过一个机载遥测子系统的数值示例说明了设计和改进过程。

英文摘要

The paper describes combinatorial synthesis approach with interval multset estimates of system elements for modeling, analysis, design, and improvement of a modular telemetry system. Morphological (modular) system design and improvement are considered as composition of the telemetry system elements (components) configuration. The solving process is based on Hierarchical Morphological Multicriteria Design (HMMD): (i) multicriteria selection of alternatives for system components, (ii) synthesis of the selected alternatives into a resultant combination (while taking into account quality of the alternatives above and their compatibility). Interval multiset estimates are used for assessment of design alternatives for telemetry system elements. Two additional systems problems are examined: (a) improvement of the obtained solutions, (b) aggregation of the obtained solutions into a resultant system configuration. The improvement and aggregation processes are based on multiple choice problem with interval multiset estimates. Numerical examples for an on-board telemetry subsystem illustrate the design and improvement processes.

1207.4154 2026-06-03 cs.AI cs.SY eess.SY math.OC 版本更新

Discretized Approximations for POMDP with Average Cost

平均成本POMDP的离散化近似

Huizhen Yu, Dimitri Bertsekas

AI总结 针对平均成本POMDP,提出一种新的基于有限信念点离散化的下界近似方案,利用有限状态MDP的多链算法高效计算,并证明其收敛性。

Comments Appears in Proceedings of the Twentieth Conference on Uncertainty in Artificial Intelligence (UAI2004)

详情
AI中文摘要

在本文中,我们针对具有折扣和平均成本准则的POMDP提出了一种新的下界近似方案。近似函数由其在一组有限信念点上的值确定,并可通过有限状态MDP的值迭代算法高效计算。虽然对于折扣问题已有几种下界近似方案被提出,但我们的方案似乎是平均成本问题中的首个。我们主要关注平均成本情形,并证明相应的近似可以通过有限状态MDP的多链算法高效计算。我们给出初步分析表明,无论POMDP中是否存在最优平均成本J,所获得的近似都是liminf最优平均成本函数的下界,也可用于计算limsup最优平均成本函数的上界,以及执行与近似相关的平稳策略的成本界。当最优平均成本为常数且最优差分成本连续时,我们证明了成本近似的收敛性。

英文摘要

In this paper, we propose a new lower approximation scheme for POMDP with discounted and average cost criterion. The approximating functions are determined by their values at a finite number of belief points, and can be computed efficiently using value iteration algorithms for finite-state MDP. While for discounted problems several lower approximation schemes have been proposed earlier, ours seems the first of its kind for average cost problems. We focus primarily on the average cost case, and we show that the corresponding approximation can be computed efficiently using multi-chain algorithms for finite-state MDP. We give a preliminary analysis showing that regardless of the existence of the optimal average cost J in the POMDP, the approximation obtained is a lower bound of the liminf optimal average cost function, and can also be used to calculate an upper bound on the limsup optimal average cost function, as well as bounds on the cost of executing the stationary policy associated with the approximation. Weshow the convergence of the cost approximation, when the optimal average cost is constant and the optimal differential cost is continuous.

1207.3434 2026-06-03 cs.AI cs.RO cs.SY eess.SY 版本更新

An Approach to Model Interest for Planetary Rover through Dezert-Smarandache Theory

基于Dezert-Smarandache理论的行星探测器兴趣建模方法

Matteo Ceriotti, Massimiliano Vasile, Giovanni Giardini, Mauro Massari

AI总结 提出一种通过Dezert-Smarandache理论融合有效载荷和导航信息来量化行星探测器目标兴趣度的方法,实现自主目标重分配与科学目标优选。

Comments Journal Of Aerospace Computing, Information, And Communication Vol. 5, Month 2008

详情
AI中文摘要

本文提出了一种为行星探测器目标分配兴趣度的方法。为目标分配兴趣度,使探测器能够自主地转换和重新分配目标。兴趣度由数据融合的有效载荷和导航信息定义。融合产生一个“兴趣地图”,量化探测器周围每个区域的兴趣水平。通过这种方式,规划器可以在有限的人为干预下选择最有趣的科学目标进行分析,并自主重新分配其目标。使用Dezert-Smarandache plausible and paradoxical reasoning理论进行信息融合:该理论允许处理模糊和冲突的数据。特别是,它允许我们直接模拟必须评估特定目标集相关性的科学家的行为。本文展示了所提方法在生成可靠兴趣地图中的应用。

英文摘要

In this paper, we propose an approach for assigning an interest level to the goals of a planetary rover. Assigning an interest level to goals, allows the rover autonomously to transform and reallocate the goals. The interest level is defined by data-fusing payload and navigation information. The fusion yields an "interest map", that quantifies the level of interest of each area around the rover. In this way the planner can choose the most interesting scientific objectives to be analyzed, with limited human intervention, and reallocates its goals autonomously. The Dezert-Smarandache Theory of Plausible and Paradoxical Reasoning was used for information fusion: this theory allows dealing with vague and conflicting data. In particular, it allows us directly to model the behavior of the scientists that have to evaluate the relevance of a particular set of goals. The paper shows an application of the proposed approach to the generation of a reliable interest map.

1203.1007 2026-06-03 cs.LG cs.AI cs.SY eess.SY stat.ML 版本更新

Agnostic System Identification for Model-Based Reinforcement Learning

基于模型的强化学习的不可知系统辨识

Stephane Ross, J. Andrew Bagnell

AI总结 针对模型类可能不包含真实系统的不可知情况,提出一种利用无遗憾在线学习算法获得近优策略的迭代方法,并在离散和连续域上验证其有效性。

Comments 8 pages, published in ICML 2012

详情
AI中文摘要

控制中的一个基本问题是从观测中学习一个对控制器综合有用的系统模型。为了提供良好的性能保证,现有方法必须假设真实系统属于学习过程中考虑的模型类。我们提出了一种迭代方法,即使在系统不在模型类中的不可知情况下,也能提供强有力的保证。特别地,我们表明,只要某个模型实现了低训练误差并且能够访问良好的探索分布,任何无遗憾在线学习算法都可以用于获得近优策略。我们的方法适用于离散和连续域。我们在文献中一个具有挑战性的直升机领域上展示了其有效性和可扩展性。

英文摘要

A fundamental problem in control is to learn a model of a system from observations that is useful for controller synthesis. To provide good performance guarantees, existing methods must assume that the real system is in the class of models considered during learning. We present an iterative method with strong guarantees even in the agnostic case where the system is not in the class. In particular, we show that any no-regret online learning algorithm can be used to obtain a near-optimal policy, provided some model achieves low training error and access to a good exploration distribution. Our approach applies to both discrete and continuous domains. We demonstrate its efficacy and scalability on a challenging helicopter domain from the literature.

1206.4329 2026-06-03 cs.AI cs.NA math.NA 版本更新

An Improved Gauss-Newtons Method based Back-propagation Algorithm for Fast Convergence

一种基于改进高斯-牛顿法的快速收敛反向传播算法

Sudarshan Nandy, Partha Pratim Sarkar, Achintya Das

AI总结 提出一种基于高斯-牛顿数值优化方法的改进反向传播算法,通过多层神经网络实现快速收敛,并在多种数据集上验证其优于最速下降法。

Comments 7 pages, 6 figures,2 tables, Published with International Journal of Computer Applications (IJCA)

详情
Journal ref
International Journal of Computer Applications 39(8):1-7, February 2012. Published by Foundation of Computer Science, New York, USA
AI中文摘要

本文研究了一种基于高斯-牛顿数值优化方法的改进反向传播算法,以实现快速收敛。反向传播采用最速下降法。该算法使用多种数据集进行测试,并与最速下降反向传播算法进行比较。系统中使用多层神经网络进行优化。在训练期间观察到所提方法的有效性,因为它对测试中使用的数据集收敛迅速。还分析了计算算法步骤所需的内存。

英文摘要

The present work deals with an improved back-propagation algorithm based on Gauss-Newton numerical optimization method for fast convergence. The steepest descent method is used for the back-propagation. The algorithm is tested using various datasets and compared with the steepest descent back-propagation algorithm. In the system, optimization is carried out using multilayer neural network. The efficacy of the proposed method is observed during the training period as it converges quickly for the dataset used in test. The requirement of memory for computing the steps of algorithm is also analyzed.

1206.3285 2026-06-03 cs.AI cs.LG cs.SY eess.SY 版本更新

Dyna-Style Planning with Linear Function Approximation and Prioritized Sweeping

具有线性函数逼近和优先级扫描的Dyna风格规划

Richard S. Sutton, Csaba Szepesvari, Alborz Geramifard, Michael P. Bowling

AI总结 本文提出一种基于模型的Dyna风格规划方法,扩展至线性函数逼近,证明其收敛性,并引入线性Dyna的优先级扫描算法。

Comments Appears in Proceedings of the Twenty-Fourth Conference on Uncertainty in Artificial Intelligence (UAI2008)

详情
AI中文摘要

我们考虑在在线设置中高效学习最优控制策略和价值函数的问题,其中状态空间很大,且必须在每次与世界交互后获得估计。本文开发了一种显式的基于模型的方法,将Dyna架构扩展到线性函数逼近。Dyna风格规划通过从世界模型生成想象经验,然后将无模型强化学习算法应用于想象的状态转移来进行。我们的主要结果是证明,在自然条件下,线性Dyna风格规划收敛到一个独立于生成分布的唯一解。在策略评估设置中,我们证明极限点是最小二乘(LSTD)解。我们的结果的一个含义是,优先级扫描可以合理地扩展到线性逼近情况,即回溯到前驱特征而不是前驱状态。我们介绍了两种线性Dyna的优先级扫描版本,并在Mountain Car和Boyan Chain问题上简要展示了它们的经验性能。

英文摘要

We consider the problem of efficiently learning optimal control policies and value functions over large state spaces in an online setting in which estimates must be available after each interaction with the world. This paper develops an explicitly model-based approach extending the Dyna architecture to linear function approximation. Dynastyle planning proceeds by generating imaginary experience from the world model and then applying model-free reinforcement learning algorithms to the imagined state transitions. Our main results are to prove that linear Dyna-style planning converges to a unique solution independent of the generating distribution, under natural conditions. In the policy evaluation setting, we prove that the limit point is the least-squares (LSTD) solution. An implication of our results is that prioritized-sweeping can be soundly extended to the linear approximation case, backing up to preceding features rather than to preceding states. We introduce two versions of prioritized sweeping with linear Dyna and briefly illustrate their performance empirically on the Mountain Car and Boyan Chain problems.

1205.3997 2026-06-03 stat.ML cs.AI cs.GT cs.SY eess.SY 版本更新

Free Energy and the Generalized Optimality Equations for Sequential Decision Making

自由能与序列决策的广义最优性方程

Pedro A. Ortega, Daniel A. Braun

AI总结 本文应用自由能原理到包含对抗和随机环境的通用决策树,推导出广义序列最优性方程,该方程包含Bellman最优性方程作为极限情况,并导出Expectimax、Minimax和Expectiminimax等决策规则,为每个节点分配资源参数以表达计算成本。

Comments 10 pages, 2 figures

详情
Journal ref
European Workshop on Reinforcement Learning 2012
AI中文摘要

自由能泛函最近被提出作为有界理性决策的变分原理,因为它实例化了效用增益与信息处理成本之间的自然权衡,并且可以从公理推导出来。这里我们将自由能原理应用于包含对抗和随机环境的通用决策树。我们推导出广义序列最优性方程,该方程不仅包含Bellman最优性方程作为极限情况,而且导出了众所周知的决策规则,如Expectimax、Minimax和Expectiminimax。我们展示了如何从单一的自由能原理推导出这些决策规则,该原理为决策树中的每个节点分配一个资源参数。这些资源参数表达了一个具体的计算成本,可以测量为从属于每个节点的分布所需的样本数量。因此,自由能原理为考虑对抗和随机环境的广义最优性方程提供了规范基础。

英文摘要

The free energy functional has recently been proposed as a variational principle for bounded rational decision-making, since it instantiates a natural trade-off between utility gains and information processing costs that can be axiomatically derived. Here we apply the free energy principle to general decision trees that include both adversarial and stochastic environments. We derive generalized sequential optimality equations that not only include the Bellman optimality equations as a limit case, but also lead to well-known decision-rules such as Expectimax, Minimax and Expectiminimax. We show how these decision-rules can be derived from a single free energy principle that assigns a resource parameter to each node in the decision tree. These resource parameters express a concrete computational cost that can be measured as the amount of samples that are needed from the distribution that belongs to each node. The free energy principle therefore provides the normative basis for generalized optimality equations that account for both adversarial and stochastic environments.

1205.2046 2026-06-03 eess.SY cs.AI cs.SY math.OC 版本更新

Multiset Estimates and Combinatorial Synthesis

多重集估计与组合综合

Mark Sh. Levin

AI总结 本文提出基于多重集估计的序数评估方法,研究其运算(集成、邻近性、比较、聚合、对齐)及在组合综合(形态学方法、背包问题)中的应用。

Comments 30 pages, 24 figures, 10 tables

详情
AI中文摘要

本文探讨了一种基于将元素分配到序数量表上的备选方案序数评估方法。在考虑基本序数量表[1,2,...,l]的层级数和分配元素个数(例如1,2,3)的情况下,提出了评估问题的基本版本。得到的估计是多重集(或袋)(多重集的基数等于常数)。给出了所研究评估问题的尺度偏序集。提出了“区间多重集估计”。进一步,研究了多重集估计上的运算:(a) 多重集估计的集成,(b) 多重集估计的邻近性,(c) 多重集估计的比较,(d) 多重集估计的聚合,以及(e) 多重集估计的对齐。研究了基于形态学方法的组合综合,包括带有设计备选方案多重集估计的改进版本。还简要描述了带有多重集估计的背包类问题。通过数值例子说明了评估方法、多重集估计以及相应的组合问题。

英文摘要

The paper addresses an approach to ordinal assessment of alternatives based on assignment of elements into an ordinal scale. Basic versions of the assessment problems are formulated while taking into account the number of levels at a basic ordinal scale [1,2,...,l] and the number of assigned elements (e.g., 1,2,3). The obtained estimates are multisets (or bags) (cardinality of the multiset equals a constant). Scale-posets for the examined assessment problems are presented. 'Interval multiset estimates' are suggested. Further, operations over multiset estimates are examined: (a) integration of multiset estimates, (b) proximity for multiset estimates, (c) comparison of multiset estimates, (d) aggregation of multiset estimates, and (e) alignment of multiset estimates. Combinatorial synthesis based on morphological approach is examined including the modified version of the approach with multiset estimates of design alternatives. Knapsack-like problems with multiset estimates are briefly described as well. The assessment approach, multiset-estimates, and corresponding combinatorial problems are illustrated by numerical examples.

1203.2556 2026-06-03 cs.AI cs.SY eess.SY 版本更新

A Probabilistic Transmission Expansion Planning Methodology based on Roulette Wheel Selection and Social Welfare

基于轮盘赌选择和社会福利的概率输电扩展规划方法

Neeraj Gupta, Rajiv Shekhar, Prem Kumar Kalra

AI总结 提出一种无需预先指定新增输电容量、利用社会福利概念的概率输电扩展规划方法,通过轮盘赌计算线路容量和潮流分析计算期望未供电量,并在改进IEEE 5节点系统上验证了仅新增线路不足以最小化期望未供电量。

Comments 22 pages, 4 figures

详情
AI中文摘要

提出了一种新的概率输电扩展规划(TEP)方法,该方法不需要预先指定新的/额外的输电容量,并利用了社会福利的概念。本文引入了两个新概念:(i)使用轮盘赌方法计算新输电线路的容量,(ii)使用潮流分析计算期望未供电量(EDNS)。整体方法已在改进的IEEE 5节点测试系统上实现。仿真结果表明一个重要结果:仅增加新的输电线路不足以最小化EDNS。

英文摘要

A new probabilistic methodology for transmission expansion planning (TEP) that does not require a priori specification of new/additional transmission capacities and uses the concept of social welfare has been proposed. Two new concepts have been introduced in this paper: (i) roulette wheel methodology has been used to calculate the capacity of new transmission lines and (ii) load flow analysis has been used to calculate expected demand not served (EDNS). The overall methodology has been implemented on a modified IEEE 5-bus test system. Simulations show an important result: addition of only new transmission lines is not sufficient to minimize EDNS.

1202.3720 2026-06-03 eess.SY cs.AI cs.SY 版本更新

Efficient Inference in Markov Control Problems

马尔可夫控制问题中的高效推理

Thomas Furmston, David Barber

AI总结 针对有限和无限时域马尔可夫控制问题,提出一种比标准前向-后向递归更高效的精确推理算法,并给出无限时域问题的原则性扩展,用于策略梯度和期望最大化算法。

详情
AI中文摘要

执行平滑、非贪婪策略更新的马尔可夫控制算法已被证明非常通用和灵活,其中策略梯度和期望最大化算法尤其流行。对于这些算法,需要对奖励加权轨迹分布进行边际推理以执行策略更新。我们讨论了有限时域情况下这些边际量的新精确推理算法,该算法比基于经典前向-后向递归的标准方法更高效。我们还提供了无限时域马尔可夫决策问题的原则性扩展,明确考虑了无限时域。该扩展为无限时域问题中的策略梯度和期望最大化提供了一种新算法。

英文摘要

Markov control algorithms that perform smooth, non-greedy updates of the policy have been shown to be very general and versatile, with policy gradient and Expectation Maximisation algorithms being particularly popular. For these algorithms, marginal inference of the reward weighted trajectory distribution is required to perform policy updates. We discuss a new exact inference algorithm for these marginals in the finite horizon case that is more efficient than the standard approach based on classical forward-backward recursions. We also provide a principled extension to infinite horizon Markov Decision Problems that explicitly accounts for an infinite horizon. This extension provides a novel algorithm for both policy gradients and Expectation Maximisation in infinite horizon problems.

1202.3703 2026-06-03 eess.SY cs.AI cs.SY 版本更新

Factored Filtering of Continuous-Time Systems

连续时间系统的因子化滤波

E. Busra Celikkaya, Christian R. Shelton, William Lam

AI总结 针对状态分布过大的连续时间随机系统,提出因子化近似方法,通过矩阵指数的ODE积分和均匀化展开两种计算方式,证明因子化均匀化的KL散度有界,实验表明优于现有方法。

详情
AI中文摘要

我们考虑连续时间或异步随机系统的滤波问题,其中状态的全分布过大而无法存储或计算。我们假设系统的速率矩阵可以紧凑表示,并且信念分布近似为边缘分布的乘积。关键计算是矩阵指数。我们研究了两种不同的计算方法:ODE积分和泰勒展开的均匀化。对于这两种方法,我们考虑了仅维护因子化信念状态的近似。对于因子化均匀化,我们证明了滤波的KL散度有界。我们的实验结果证实,因子化均匀化比先前提出的均匀化方法和平均场算法表现更好。

英文摘要

We consider filtering for a continuous-time, or asynchronous, stochastic system where the full distribution over states is too large to be stored or calculated. We assume that the rate matrix of the system can be compactly represented and that the belief distribution is to be approximated as a product of marginals. The essential computation is the matrix exponential. We look at two different methods for its computation: ODE integration and uniformization of the Taylor expansion. For both we consider approximations in which only a factored belief state is maintained. For factored uniformization we demonstrate that the KL-divergence of the filtering is bounded. Our experimental results confirm our factored uniformization performs better than previously suggested uniformization methods and the mean field algorithm.

1201.2630 2026-06-03 eess.SY cs.AI cs.SY 版本更新

Hybrid GPS-GSM Localization of Automobile Tracking System

混合GPS-GSM汽车跟踪系统定位

Mohammad A. Al-Khedher

AI总结 提出一种集成GPS-GSM系统,通过卡尔曼滤波提高GPS坐标精度,并利用谷歌地球实现车辆实时跟踪,用于车队管理、警车调度和防盗预警。

Comments 11 pages, 11 figures, 23 references

详情
Journal ref
International Journal of Computer Science and Information Technology, Vol. 3, No. 6, pp. 75-85, 2011
AI中文摘要

提出了一种集成的GPS-GSM系统,利用谷歌地球应用跟踪车辆。远程模块具有安装在移动车辆上的GPS,用于识别其当前位置,并通过GSM将车辆数据端口获取的其他参数作为短信传输到接收站。接收到的GPS坐标使用卡尔曼滤波器进行滤波,以提高测量位置的精度。数据处理后,使用谷歌地球应用查看每辆车的当前位置和状态。该系统的目标是管理车队、警车分布和汽车防盗预警。

英文摘要

An integrated GPS-GSM system is proposed to track vehicles using Google Earth application. The remote module has a GPS mounted on the moving vehicle to identify its current position, and to be transferred by GSM with other parameters acquired by the automobile's data port as an SMS to a recipient station. The received GPS coordinates are filtered using a Kalman filter to enhance the accuracy of measured position. After data processing, Google Earth application is used to view the current location and status of each vehicle. This goal of this system is to manage fleet, police automobiles distribution and car theft cautions.

1108.1170 2026-06-03 math.OC cs.AI cs.SY eess.SY 版本更新

Convex Optimization without Projection Steps

无投影步的凸优化

Martin Jaggi

AI总结 提出一种基于Frank-Wolfe方法、无需投影步的迭代算法,用于紧凸域上的凸函数最小化,实现O(1/ε)迭代次数达到ε对偶间隙,并分析稀疏性下界。

详情
AI中文摘要

针对紧凸域上凸函数最小化的一般问题,我们研究了一种基于Frank & Wolfe 1956方法的简单迭代近似算法,该算法无需投影步即可保持在优化域内。代替投影步,求解由当前次梯度定义的线性化问题,得到自然保持在域内的步进方向。我们的框架将Frank & Wolfe的稀疏贪婪算法及其Clarkson 2010的原始-对偶分析(以及Hazan 2008的低秩SDP方法)推广到任意凸域。我们给出了收敛性证明,保证在O(1/ε)次迭代后达到ε小的对偶间隙。该方法使我们能够理解任何l1正则化凸优化问题(以及单纯形上的优化)的近似解的稀疏性,表示为近似质量的函数。我们得到了l1问题稀疏性的匹配上下界Θ(1/ε)。相同的界适用于有界迹的低秩半定优化,表明秩O(1/ε)在此也是最优的。作为另一个应用,当优化一类对角占优对称矩阵上的任意凸函数时,我们得到具有O(1/ε)个非零项的稀疏矩阵作为ε近似解。我们表明,我们提出的一阶方法也适用于核范数和最大范数矩阵优化问题。对于核范数正则化优化,如矩阵补全和低秩恢复,我们展示了算法在大矩阵问题(如Netflix数据集)上的实际效率和可扩展性。对于有界矩阵最大范数上的一般凸优化,据我们所知,我们的算法是第一个具有收敛保证的。

英文摘要

For the general problem of minimizing a convex function over a compact convex domain, we will investigate a simple iterative approximation algorithm based on the method by Frank & Wolfe 1956, that does not need projection steps in order to stay inside the optimization domain. Instead of a projection step, the linearized problem defined by a current subgradient is solved, which gives a step direction that will naturally stay in the domain. Our framework generalizes the sparse greedy algorithm of Frank & Wolfe and its primal-dual analysis by Clarkson 2010 (and the low-rank SDP approach by Hazan 2008) to arbitrary convex domains. We give a convergence proof guaranteeing ε-small duality gap after O(1/ε) iterations. The method allows us to understand the sparsity of approximate solutions for any l1-regularized convex optimization problem (and for optimization over the simplex), expressed as a function of the approximation quality. We obtain matching upper and lower bounds of Θ(1/ε) for the sparsity for l1-problems. The same bounds apply to low-rank semidefinite optimization with bounded trace, showing that rank O(1/ε) is best possible here as well. As another application, we obtain sparse matrices of O(1/ε) non-zero entries as ε-approximate solutions when optimizing any convex function over a class of diagonally dominant symmetric matrices. We show that our proposed first-order method also applies to nuclear norm and max-norm matrix optimization problems. For nuclear norm regularized optimization, such as matrix completion and low-rank recovery, we demonstrate the practical efficiency and scalability of our algorithm for large matrix problems, as e.g. the Netflix dataset. For general convex optimization over bounded matrix max-norm, our algorithm is the first with a convergence guarantee, to the best of our knowledge.

1112.4057 2026-06-03 cs.AI cs.SY eess.SY 版本更新

Performance Evaluation of Road Traffic Control Using a Fuzzy Cellular Model

基于模糊细胞模型的道路交通控制性能评估

Bartłomiej Płaczek

AI总结 提出一种基于模糊细胞模型的方法,用于在线仿真环境中评估自适应交通控制策略的性能,通过结合元胞自动机和模糊演算处理不精确测量。

Comments The final publication is available at http://www.springerlink.com

详情
Journal ref
Płaczek, B., Performance Evaluation of Road Traffic Control Using a Fuzzy Cellular Model. Lecture Notes in Artificial Intelligence 6679. Springer-Verlag, Berlin Heidelberg, pp. 59-66, 2011
AI中文摘要

本文提出了一种用于道路交通控制系统性能评估的方法。该方法设计用于在线仿真环境,能够优化自适应交通控制策略。性能指标通过模糊细胞交通模型计算,该模型是结合元胞自动机和模糊演算的混合系统。实验结果表明,所引入的方法允许使用不精确的交通测量进行性能评估。此外,性能指标的模糊定义便于交通控制决策中的不确定性确定。

英文摘要

In this paper a method is proposed for performance evaluation of road traffic control systems. The method is designed to be implemented in an on-line simulation environment, which enables optimisation of adaptive traffic control strategies. Performance measures are computed using a fuzzy cellular traffic model, formulated as a hybrid system combining cellular automata and fuzzy calculus. Experimental results show that the introduced method allows the performance to be evaluated using imprecise traffic measurements. Moreover, the fuzzy definitions of performance measures are convenient for uncertainty determination in traffic control decisions.

1107.2126 2026-06-03 math.NA cs.AI cs.IT cs.NA math.IT math.LO 版本更新

Strong Solutions of the Fuzzy Linear Systems

模糊线性系统的强解

Şahin Emrah Amrahov, Iman N. Askerzade

AI总结 针对系数矩阵为清晰矩阵、右端为参数形式模糊数的模糊线性系统,提出一种依赖于系数矩阵和右端项的强解存在唯一性定理,推广了仅适用于特殊系统的经典定理。

Comments 11 pages

详情
Journal ref
CMES: Computer Modeling in Engineering & Sciences, Vol. 76, No. 4, pp. 207-216, 2011
AI中文摘要

我们考虑一个模糊线性系统,其系数矩阵为清晰矩阵,右端为参数形式的任意模糊数。众所周知,强模糊解的存在唯一性经典定理等价于:系数矩阵是一个置换矩阵与一个对角矩阵的乘积。这意味着该定理仅适用于特殊形式的线性系统,即每个方程恰好包含一个变量的系统。我们证明了一个存在唯一性定理,该定理可用于更一般的系统。该定理的充要条件同时依赖于系数矩阵和右端项。该定理是经典强解存在唯一性定理的推广。

英文摘要

We consider a fuzzy linear system with crisp coefficient matrix and with an arbitrary fuzzy number in parametric form on the right-hand side. It is known that the well-known existence and uniqueness theorem of a strong fuzzy solution is equivalent to the following: The coefficient matrix is the product of a permutation matrix and a diagonal matrix. This means that this theorem can be applicable only for a special form of linear systems, namely, only when the system consists of equations, each of which has exactly one variable. We prove an existence and uniqueness theorem, which can be use on more general systems. The necessary and sufficient conditions of the theorem are dependent on both the coefficient matrix and the right-hand side. This theorem is a generalization of the well-known existence and uniqueness theorem for the strong solution.

1004.2027 2026-06-03 cs.LG cs.AI cs.SY eess.SY math.OC stat.ML 版本更新

Dynamic Policy Programming

动态策略编程

Mohammad Gheshlaghi Azar, Vicenc Gomez, Hilbert J. Kappen

AI总结 提出动态策略编程(DPP)方法,通过平均累积误差的无穷范数界,在近似误差下优于标准近似值迭代和近似策略迭代,并在多个问题域中显著超越现有强化学习方法。

Comments Submitted to Journal of Machine Learning Research

详情
AI中文摘要

在本文中,我们提出了一种新颖的策略迭代方法,称为动态策略编程(DPP),用于估计无限时域马尔可夫决策过程中的最优策略。我们证明了在存在近似/估计误差的情况下,DPP的有限迭代和渐近l∞范数性能损失界。这些界以平均累积误差的l∞范数表示,而不是标准近似值迭代(AVI)和近似策略迭代(API)中误差的l∞范数。这表明DPP可以实现比AVI和API更好的性能,因为它平均了整个学习过程中由蒙特卡洛采样引起的模拟噪声。我们通过在不同问题域上比较DPP的近似变体与现有强化学习(RL)方法的性能,数值验证了这一理论结果。我们的结果表明,在所有情况下,基于DPP的算法都大幅优于其他RL方法。

英文摘要

In this paper, we propose a novel policy iteration method, called dynamic policy programming (DPP), to estimate the optimal policy in the infinite-horizon Markov decision processes. We prove the finite-iteration and asymptotic l\infty-norm performance-loss bounds for DPP in the presence of approximation/estimation error. The bounds are expressed in terms of the l\infty-norm of the average accumulated error as opposed to the l\infty-norm of the error in the case of the standard approximate value iteration (AVI) and the approximate policy iteration (API). This suggests that DPP can achieve a better performance than AVI and API since it averages out the simulation noise caused by Monte-Carlo sampling throughout the learning process. We examine this theoretical results numerically by com- paring the performance of the approximate variants of DPP with existing reinforcement learning (RL) methods on different problem domains. Our results show that, in all cases, DPP-based algorithms outperform other RL methods by a wide margin.

1108.6223 2026-06-03 cs.SE cs.AI cs.DM cs.NI cs.SY eess.SY math.OC 版本更新

Towards Configuration of applied Web-based information system

面向应用型Web信息系统的配置

Mark Sh. Levin

AI总结 本文采用分层形态多准则设计方法,通过组合系统部件的设计备选方案,实现应用型Web系统的配置设计,并基于格离散空间评估组合质量。

Comments 13 pages, 9 tables, 17 figures

详情
AI中文摘要

本文描述了应用型Web系统的结构组合合成。该问题被视为将系统部件/组件的选定设计备选方案组合成一个最终的复合决策(即系统配置设计)。求解框架基于分层形态多准则设计(HMMD)方法:(i)对系统部件的备选方案进行多准则选择,(ii)将选定的备选方案组合成最终组合(同时考虑上述备选方案的序数质量及其兼容性)。使用基于格的离散空间来评估(整合)最终组合(即复合系统决策或系统配置)的质量。此外,还考虑了一种基于多准则多选择问题的简化求解框架。还描述了一个多阶段设计过程以获得系统轨迹。基本应用示例针对通信服务提供商的应用型Web系统。简要描述了另外两个应用(企业系统和学术应用信息系统)。

英文摘要

In the paper, combinatorial synthesis of structure for applied Web-based systems is described. The problem is considered as a combination of selected design alternatives for system parts/components into a resultant composite decision (i.e., system configuration design). The solving framework is based on Hierarchical Morphological Multicriteria Design (HMMD) approach: (i) multicriteria selection of alternatives for system parts, (ii) composing the selected alternatives into a resultant combination (while taking into account ordinal quality of the alternatives above and their compatibility). A lattice-based discrete space is used to evaluate (to integrate) quality of the resultant combinations (i.e., composite system decisions or system configurations). In addition, a simplified solving framework based on multicriteria multiple choice problem is considered. A multistage design process to obtain a system trajectory is described as well. The basic applied example is targeted to an applied Web-based system for a communication service provider. Two other applications are briefly described (corporate system and information system for academic application).

1107.0089 2026-06-03 eess.SY cs.AI cs.SY 版本更新

Towards a Reliable Framework of Uncertainty-Based Group Decision Support System

基于不确定性的群体决策支持系统可靠框架

Junyi Chai, James N. K. Liu

AI总结 提出一种基于不确定性的群体决策支持系统框架,通过集成多智能体架构和人工智能技术,支持多准则决策分析,实现可靠决策支持。

Comments Accepted paper in IEEE-ICDM2010; Print ISBN: 978-1-4244-9244-2

详情
AI中文摘要

本研究提出了一种基于不确定性的群体决策支持系统(UGDSS)框架。它为多准则决策分析提供了一个平台,涵盖六个方面:(1)决策环境、(2)决策问题、(3)决策群体、(4)决策冲突、(5)决策方案和(6)群体协商。基于多种人工智能技术,该框架通过设计集成的多智能体架构,为应用和高级决策方法的全面操作提供了可靠支持。

英文摘要

This study proposes a framework of Uncertainty-based Group Decision Support System (UGDSS). It provides a platform for multiple criteria decision analysis in six aspects including (1) decision environment, (2) decision problem, (3) decision group, (4) decision conflict, (5) decision schemes and (6) group negotiation. Based on multiple artificial intelligent technologies, this framework provides reliable support for the comprehensive manipulation of applications and advanced decision approaches through the design of an integrated multi-agents architecture.

1006.2165 2026-06-03 stat.ME cs.AI cs.RO cs.SY eess.SY math.OC stat.ML 版本更新

A Probabilistic Perspective on Gaussian Filtering and Smoothing

高斯滤波与平滑的概率视角

Marc Peter Deisenroth, Henrik Ohlsson

AI总结 本文从概率视角统一高斯滤波与平滑方法,指出其核心区别仅在于联合概率均值和协方差的计算/近似方式,并据此推导了容积卡尔曼平滑器及基于吉布斯采样的鲁棒滤波与平滑算法。

Comments 14 pages. Extended version of conference paper (ACC 2011)

详情
AI中文摘要

我们提出了一个关于高斯滤波与平滑的通用概率视角。这使我们能够证明,常见的高斯滤波/平滑方法仅通过其计算/近似联合概率的均值和协方差的方法来区分。这意味着,通过提供计算这些矩的方法,可以直接推导出新的滤波器和平滑器。基于这一见解,我们推导了容积卡尔曼平滑器,并提出了一种基于吉布斯采样的新型鲁棒滤波与平滑算法。

英文摘要

We present a general probabilistic perspective on Gaussian filtering and smoothing. This allows us to show that common approaches to Gaussian filtering/smoothing can be distinguished solely by their methods of computing/approximating the means and covariances of joint probabilities. This implies that novel filters and smoothers can be derived straightforwardly by providing methods for computing these moments. Based on this insight, we derive the cubature Kalman smoother and propose a novel robust filtering and smoothing algorithm based on Gibbs sampling.

1004.2342 2026-06-03 cs.AI cs.PF cs.SY eess.SY math.OC math.PR 版本更新

Mean field for Markov Decision Processes: from Discrete to Continuous Optimization

马尔可夫决策过程的平均场:从离散到连续优化

Nicolas Gast, Bruno Gaujal, Jean-Yves Le Boudec

AI总结 研究大量对象组成的马尔可夫决策过程收敛到常微分方程优化问题,通过平均场近似得到连续HJB方程,并给出奖励差异界限及构造性算法。

详情
AI中文摘要

我们研究由大量对象组成的马尔可夫决策过程收敛到常微分方程(ODE)优化问题。我们证明,满足Bellman方程的此类马尔可夫决策过程的最优奖励收敛到基于马尔可夫决策过程平均场近似的连续Hamilton-Jacobi-Bellman(HJB)方程的解。我们给出了奖励差异的界限,以及从HJB方程的解推导马尔可夫决策过程近似解的构造性算法。我们通过三个例子说明该方法,分别涉及投资策略、种群动态控制和队列调度。这些例子用于说明和证明受控ODE的构造,并展示求解连续HJB方程相对于求解大型离散Bellman方程所获得的收益。

英文摘要

We study the convergence of Markov Decision Processes made of a large number of objects to optimization problems on ordinary differential equations (ODE). We show that the optimal reward of such a Markov Decision Process, satisfying a Bellman equation, converges to the solution of a continuous Hamilton-Jacobi-Bellman (HJB) equation based on the mean field approximation of the Markov Decision Process. We give bounds on the difference of the rewards, and a constructive algorithm for deriving an approximating solution to the Markov Decision Process from a solution of the HJB equations. We illustrate the method on three examples pertaining respectively to investment strategies, population dynamics control and scheduling in queues are developed. They are used to illustrate and justify the construction of the controlled ODE and to show the gain obtained by solving a continuous HJB equation rather than a large discrete Bellman equation.

1102.0899 2026-06-03 cs.AI cs.CV cs.LG cs.NA math.NA math.PR 版本更新

Evidence Feed Forward Hidden Markov Model: A New Type of Hidden Markov Model

证据前馈隐马尔可夫模型:一种新型隐马尔可夫模型

Michael DelRose, Christian Wagner, Philip Frederick

AI总结 针对隐马尔可夫模型无法建模观测间关联的问题,提出证据前馈隐马尔可夫模型,通过引入观测间概率链接提升分类性能,并在视觉动作和测量数据上验证其有效性。

Comments 19 pages, International Journal of Artificial Intelligence and Applications

详情
Journal ref
International Journal of Artificial Intelligence and Applications (IJAIA), Vol. 2, No. 1, Jan 2011
AI中文摘要

仅基于视觉动作预测他人意图的能力是人类和动物独有的技能。当前计算机算法的智能尚未达到这种复杂程度,但已有若干研究正朝此方向努力。由于可用的分类算法众多,难以确定哪种算法最适合特定情境。在视觉人类意图数据分类中,隐马尔可夫模型(HMM)及其变体是主要候选方法。HMM无法提供观测间链接的概率,这是该分类技术的一大缺陷。当人通过视觉识别他人的动作时,会监控观测中的模式。通过估计下一个观测,人们能够总结动作,从而相当准确地判断执行动作者的意图。这些视觉线索和链接对于创建基于视觉观测确定人类动作的智能算法至关重要。证据前馈隐马尔可夫模型是一种新开发的算法,它提供了观测间链接。本研究阐述了证据前馈HMM背后的理论,提供了其学习这些参数以优化观测似然性的数学证明(这对所有计算智能算法都至关重要),并给出了与标准HMM在视觉动作数据和测量数据分类中的比较示例,从而为证据前馈HMM在多种问题分类中的应用奠定了坚实基础。

英文摘要

The ability to predict the intentions of people based solely on their visual actions is a skill only performed by humans and animals. The intelligence of current computer algorithms has not reached this level of complexity, but there are several research efforts that are working towards it. With the number of classification algorithms available, it is hard to determine which algorithm works best for a particular situation. In classification of visual human intent data, Hidden Markov Models (HMM), and their variants, are leading candidates. The inability of HMMs to provide a probability in the observation to observation linkages is a big downfall in this classification technique. If a person is visually identifying an action of another person, they monitor patterns in the observations. By estimating the next observation, people have the ability to summarize the actions, and thus determine, with pretty good accuracy, the intention of the person performing the action. These visual cues and linkages are important in creating intelligent algorithms for determining human actions based on visual observations. The Evidence Feed Forward Hidden Markov Model is a newly developed algorithm which provides observation to observation linkages. The following research addresses the theory behind Evidence Feed Forward HMMs, provides mathematical proofs of their learning of these parameters to optimize the likelihood of observations with a Evidence Feed Forwards HMM, which is important in all computational intelligence algorithm, and gives comparative examples with standard HMMs in classification of both visual action data and measurement data; thus providing a strong base for Evidence Feed Forward HMMs in classification of many types of problems.

1012.0365 2026-06-03 math.NA cs.AI cs.NA math.OC 版本更新

A Block Lanczos with Warm Start Technique for Accelerating Nuclear Norm Minimization Algorithms

一种加速核范数最小化算法的块Lanczos热启动技术

Zhouchen Lin, Siming Wei

AI总结 提出块Lanczos热启动(BLWS)技术,利用前次迭代的主奇异子空间初始化块Lanczos过程以加速核范数最小化算法中的部分SVD计算,实验表明可加速2-3倍。

详情
AI中文摘要

近年来,使用秩最小化作为各种信号处理和机器学习问题的正则化器变得流行。由于秩最小化问题通常转化为核范数最小化(NNM)问题,它们必须迭代求解,且每次迭代需要计算奇异值分解(SVD)。因此,它们的求解受到多次SVD高计算成本的影响。为了缓解这一问题,我们提出使用块Lanczos方法计算部分SVD,其中利用前一次迭代得到的主奇异子空间来启动块Lanczos过程。为了避免Lanczos过程中昂贵的重正交化,块Lanczos过程仅执行少数几步。我们的块Lanczos热启动(BLWS)技术可被求解NNM问题的不同算法采用。我们给出了将BLWS应用于鲁棒PCA和矩阵补全问题的数值结果。实验结果表明,我们的BLWS技术通常将其宿主算法加速至少两到三倍。

英文摘要

Recent years have witnessed the popularity of using rank minimization as a regularizer for various signal processing and machine learning problems. As rank minimization problems are often converted to nuclear norm minimization (NNM) problems, they have to be solved iteratively and each iteration requires computing a singular value decomposition (SVD). Therefore, their solution suffers from the high computation cost of multiple SVDs. To relieve this issue, we propose using the block Lanczos method to compute the partial SVDs, where the principal singular subspaces obtained in the previous iteration are used to start the block Lanczos procedure. To avoid the expensive reorthogonalization in the Lanczos procedure, the block Lanczos procedure is performed for only a few steps. Our block Lanczos with warm start (BLWS) technique can be adopted by different algorithms that solve NNM problems. We present numerical results on applying BLWS to Robust PCA and Matrix Completion problems. Experimental results show that our BLWS technique usually accelerates its host algorithm by at least two to three times.

0906.0311 2026-06-03 cs.AI cs.NA math.NA physics.data-an 版本更新

Solar radiation forecasting using ad-hoc time series preprocessing and neural networks

使用特定时间序列预处理和神经网络的太阳辐射预测

Christophe Paoli, Cyril Voyant, Marc Muselli, Marie-Laure Nivet

AI总结 本文提出一种结合特定时间序列预处理和多层感知器(MLP)的日水平面太阳辐射预测方法,实现了nRMSE<21%和RMSE<998 Wh/m²的预测性能,优于ARIMA、贝叶斯推断等传统方法。

Comments 14 pages, 8 figures, 2009 International Conference on Intelligent Computing

详情
AI中文摘要

在本文中,我们展示了神经网络在可再生能源领域的一个应用。我们开发了一种用于日水平面全球太阳辐射预测的方法。我们使用特定时间序列预处理和多层感知器(MLP)来预测日尺度的太阳辐射。初步结果令人鼓舞,nRMSE < 21%,RMSE < 998 Wh/m²。我们优化的MLP的预测性能与ARIMA技术、贝叶斯推断、马尔可夫链和k近邻近似器等传统方法相似甚至更好。此外,我们发现我们的数据预处理方法可以显著减少预测误差。

英文摘要

In this paper, we present an application of neural networks in the renewable energy domain. We have developed a methodology for the daily prediction of global solar radiation on a horizontal surface. We use an ad-hoc time series preprocessing and a Multi-Layer Perceptron (MLP) in order to predict solar radiation at daily horizon. First results are promising with nRMSE < 21% and RMSE < 998 Wh/m2. Our optimized MLP presents prediction similar to or even better than conventional methods such as ARIMA techniques, Bayesian inference, Markov chains and k-Nearest-Neighbors approximators. Moreover we found that our data preprocessing approach can reduce significantly forecasting errors.

0712.4126 2026-06-03 cs.AI cs.CE cs.MS cs.NA cs.NE math.NA 版本更新

TRUST-TECH based Methods for Optimization and Learning

基于TRUST-TECH的优化与学习方法

Chandan K. Reddy

AI总结 针对机器学习中的非线性和全局优化问题,提出基于TRUST-TECH的框架,通过交替局部和邻域搜索阶段,降低对初始化的敏感性并提高解的质量。

Comments PHD Thesis

详情
Journal ref
Chandan K. Reddy, TRUST-TECH based Methods for Optimization and Learning, PHD Thesis, Cornell University, February 2007
AI中文摘要

机器学习领域中出现的许多问题涉及非线性,并且通常要求用户获得全局最优解而非局部最优解。优化问题是机器学习算法中固有的,因此机器学习中的许多方法都继承自优化文献。通常被称为初始化问题,所需的理想参数集将显著依赖于给定的初始值。最近开发的TRUST-TECH(稳定性保持平衡变换表征)方法系统地探索参数子空间,以获得完整的局部最优解集。在本论文工作中,我们提出了基于TRUST-TECH的方法来解决若干优化和机器学习问题。在解空间中交替重复两个阶段,即局部阶段和邻域搜索阶段,以提高解的质量。我们的方法在合成数据集和真实数据集上进行了测试,使用这一新颖框架的优势得到了清晰体现。该框架不仅降低了对初始化的敏感性,还允许从业者灵活使用各种对特定问题有效的全局和局部方法。还研究了其他层次随机算法,如进化算法和平滑算法,并提出了将这些方法与TRUST-TECH结合的框架,在多个测试系统上进行了评估。

英文摘要

Many problems that arise in machine learning domain deal with nonlinearity and quite often demand users to obtain global optimal solutions rather than local optimal ones. Optimization problems are inherent in machine learning algorithms and hence many methods in machine learning were inherited from the optimization literature. Popularly known as the initialization problem, the ideal set of parameters required will significantly depend on the given initialization values. The recently developed TRUST-TECH (TRansformation Under STability-reTaining Equilibria CHaracterization) methodology systematically explores the subspace of the parameters to obtain a complete set of local optimal solutions. In this thesis work, we propose TRUST-TECH based methods for solving several optimization and machine learning problems. Two stages namely, the local stage and the neighborhood-search stage, are repeated alternatively in the solution space to achieve improvements in the quality of the solutions. Our methods were tested on both synthetic and real datasets and the advantages of using this novel framework are clearly manifested. This framework not only reduces the sensitivity to initialization, but also allows the flexibility for the practitioners to use various global and local methods that work well for a particular problem of interest. Other hierarchical stochastic algorithms like evolutionary algorithms and smoothing algorithms are also studied and frameworks for combining these methods with TRUST-TECH have been proposed and evaluated on several test systems.

nlin/0407032 2026-06-03 nlin.PS cs.AI cs.NA math.NA 版本更新

Application of Artificial Neural Network in Jitter Analysis of Dispersion-Managed Communication System

人工神经网络在色散管理通信系统抖动分析中的应用

F. P. Zen, B. E. Gunara, W. Hidayat, Z. A. Thalib, H. Zainuddin, J. Aminuddin

AI总结 利用人工神经网络求解修正非线性薛定谔方程,分析色散管理系统的抖动,验证并改进了传统数值方法的结果。

Comments 9 pages, 5 figures

详情
AI中文摘要

人工神经网络(ANN)被用作数值方法,求解带有色散管理系统(DMS)的修正非线性薛定谔(NLS)方程,用于抖动分析。我们以光轴z和时间t作为输入,然后得到一些相关值,如脉冲位置和中心频率的变化,以及抖动分析所需的输入脉冲的均方时间。结果表明,ANN产生的数值解对数值误差具有自适应性,并且验证了使用传统数值方法得到的先前数值结果。我们的结果表明,DMS可以最小化由某些放大器引起的定时抖动。

英文摘要

Artificial Neural Network (ANN) is used as numerical methode in solving modified Nonlinear Schroedinger (NLS) equation with Dispersion Managed System (DMS) for jitter analysis. We take the optical axis z and the time t as input, and then some relevant values such as the change of position and the center frequency of the pulse, and further the mean square time of incoming pulse which are needed for jitter analysis. It shows that ANN yields numerical solutions which are adaptive with respect to the numerical errors and also verifies the previous numerical results using conventional numerical method. Our result indicates that DMS can minimize the timing jitter induced by some amplifiers.